Anonymizing a BIDS dataset#

Consider the following scenario:

  • You’ve created a BIDS dataset.

  • Now you want to make this dataset available to the public.

  • Therefore, all personally identifying information must be removed.

While mne_bids.write_raw_bids() and mne_bids.write_anat() can be used to store anonymized copies of data (by passing the anonymize and deface keyword arguments, respectively), using these functions to anonymize an entire existing dataset can be cumbersome and error-prone.

MNE-BIDS provides a dedicated function, mne_bids.anonymize_dataset(), to do the heavy lifting for you, automatically.

# Authors: The MNE-BIDS developers
# SPDX-License-Identifier: BSD-3-Clause
import shutil

import mne

from mne_bids import (
    BIDSPath,
    anonymize_dataset,
    print_dir_tree,
    write_anat,
    write_meg_calibration,
    write_meg_crosstalk,
    write_raw_bids,
)

data_path = mne.datasets.sample.data_path()
event_id = {
    "Auditory/Left": 1,
    "Auditory/Right": 2,
    "Visual/Left": 3,
    "Visual/Right": 4,
    "Smiley": 5,
    "Button": 32,
}

raw_path = data_path / "MEG" / "sample" / "sample_audvis_raw.fif"
raw_er_path = data_path / "MEG" / "sample" / "ernoise_raw.fif"  # empty-room
events_path = data_path / "MEG" / "sample" / "sample_audvis_raw-eve.fif"
cal_path = data_path / "SSS" / "sss_cal_mgh.dat"
ct_path = data_path / "SSS" / "ct_sparse_mgh.fif"
t1w_path = data_path / "subjects" / "sample" / "mri" / "T1.mgz"

bids_root = data_path.parent / "MNE-sample-data-bids"
bids_root_anon = data_path.parent / "MNE-sample-data-bids-anon"

To ensure the output paths don’t contain any leftover files from previous tests and example runs, we simply delete it.

Warning

Do not delete directories that may contain important data!

bids_path = BIDSPath(
    subject="ABC123", task="audiovisual", root=bids_root, datatype="meg"
)
bids_path_er = bids_path.copy().update(
    subject="emptyroom", task="noise", session="20021206"
)

raw = mne.io.read_raw_fif(raw_path, verbose=False)
raw_er = mne.io.read_raw_fif(raw_er_path, verbose=False)
# specify power line frequency as required by BIDS
raw.info["line_freq"] = 60
raw_er.info["line_freq"] = 60

# Write empty-room data
write_raw_bids(raw=raw_er, bids_path=bids_path_er, verbose=False)

# Write experimental MEG data, fine-calibration and crosstalk files
write_raw_bids(
    raw=raw,
    bids_path=bids_path,
    events=events_path,
    event_id=event_id,
    empty_room=bids_path_er,
    verbose=False,
)
write_meg_calibration(cal_path, bids_path=bids_path, verbose=False)
write_meg_crosstalk(ct_path, bids_path=bids_path, verbose=False)

# Write anatomical scan
# We pass the MRI landmark coordinates, which will later be required for
# automated defacing
mri_landmarks = mne.channels.make_dig_montage(
    lpa=[66.08580, 51.33362, 46.52982],
    nasion=[41.87363, 32.24694, 74.55314],
    rpa=[17.23812, 53.08294, 47.01789],
    coord_frame="mri_voxel",
)
bids_path.datatype = "anat"
write_anat(image=t1w_path, bids_path=bids_path, landmarks=mri_landmarks, verbose=False)
BIDSPath(
root: /home/circleci/mne_data/MNE-sample-data-bids
datatype: anat
basename: sub-ABC123_T1w.nii.gz)

Basic anonymization#

Now we’re ready to anonymize the dataset!

anonymize_dataset(bids_root_in=bids_root, bids_root_out=bids_root_anon)
Anonymizing BIDS dataset
Determining "daysback" for anonymization.

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Shifting recording dates by 30709 days (84.1 years).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-1
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/5 [00:00<?,       ?it/s]
 20%|██        | Anonymizing : 1/5 [00:00<00:01,    3.59it/s]
 40%|████      | Anonymizing : 2/5 [00:01<00:01,    1.96it/s]
100%|██████████| Anonymizing : 5/5 [00:03<00:00,    1.59it/s]
100%|██████████| Anonymizing : 5/5 [00:03<00:00,    1.60it/s]

That’s it! Let’s have a look at directory structure of the anonymized dataset.

|MNE-sample-data-bids-anon/
|--- README
|--- dataset_description.json
|--- participants.json
|--- participants.tsv
|--- sub-1/
|------ sub-1_scans.tsv
|------ anat/
|--------- sub-1_T1w.json
|--------- sub-1_T1w.nii.gz
|------ meg/
|--------- sub-1_acq-calibration_meg.dat
|--------- sub-1_acq-crosstalk_meg.fif
|--------- sub-1_coordsystem.json
|--------- sub-1_task-audiovisual_channels.tsv
|--------- sub-1_task-audiovisual_events.json
|--------- sub-1_task-audiovisual_events.tsv
|--------- sub-1_task-audiovisual_meg.fif
|--------- sub-1_task-audiovisual_meg.json
|--- sub-emptyroom/
|------ ses-19181108/
|--------- sub-emptyroom_ses-19181108_scans.tsv
|--------- meg/
|------------ sub-emptyroom_ses-19181108_task-noise_channels.tsv
|------------ sub-emptyroom_ses-19181108_task-noise_meg.fif
|------------ sub-emptyroom_ses-19181108_task-noise_meg.json

You can see that the subject ID was changed to a number (in this case, the digit 1`), and the recording dates have been shifted backward in time (as indicated by the emptyroom session name). Anonymized IDs are zero-padded numbers ranging from 1 to \(N\), where \(N\) is the total number of participants (excluding the emptyroom pseudo-subject).

Limiting to specific data types#

By default, mne_bids.anonymize_dataset() will anonymize electrophysiological data and anatomical MR scans (T1-weighted and FLASH). You can limit which data types to convert using the datatypes keyword argument. The parameter can be a string (e.g., 'meg', 'eeg', 'anat') or a list of such strings.

shutil.rmtree(bids_root_anon)
anonymize_dataset(
    bids_root_in=bids_root,
    bids_root_out=bids_root_anon,
    datatypes="anat",  # Only anatomical data
)
print_dir_tree(bids_root_anon)
Anonymizing BIDS dataset

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Not shifting recording dates (found anatomical scans only).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-1
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/1 [00:00<?,       ?it/s]
100%|██████████| Anonymizing : 1/1 [00:02<00:00,    2.05s/it]
100%|██████████| Anonymizing : 1/1 [00:02<00:00,    2.05s/it]
|MNE-sample-data-bids-anon/
|--- README
|--- dataset_description.json
|--- participants.json
|--- sub-1/
|------ anat/
|--------- sub-1_T1w.json
|--------- sub-1_T1w.nii.gz

Specifying time shift#

Anonymization involves shifting the recording dates back in time. MNE-BIDS will try to automatically choose a suitable time shift. You may also explicitly specify by how many days you wish to shift the recording dates back in time via the daysback parameter. To avoid the time shift, pass daysback=0.

shutil.rmtree(bids_root_anon)
anonymize_dataset(
    bids_root_in=bids_root,
    bids_root_out=bids_root_anon,
    datatypes="meg",  # Only MEG data
    daysback=10,
)
print_dir_tree(bids_root_anon / "sub-emptyroom")  # Easy to see effects here
Anonymizing BIDS dataset

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Shifting recording dates by 10 days (0.0 years).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-1
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/4 [00:00<?,       ?it/s]
 25%|██▌       | Anonymizing : 1/4 [00:00<00:00,    3.60it/s]
 50%|█████     | Anonymizing : 2/4 [00:01<00:01,    1.97it/s]
100%|██████████| Anonymizing : 4/4 [00:01<00:00,    3.96it/s]
|sub-emptyroom/
|--- ses-20021126/
|------ sub-emptyroom_ses-20021126_scans.tsv
|------ meg/
|--------- sub-emptyroom_ses-20021126_task-noise_channels.tsv
|--------- sub-emptyroom_ses-20021126_task-noise_meg.fif
|--------- sub-emptyroom_ses-20021126_task-noise_meg.json

Specifying subject IDs#

Anonymized subject IDs are automatically generated as unique numbers in ascending order. You can control this behavior via the subject_mapping parameter. Set it to None to avoid changing the subject IDs, e.g., in case they’re already anonymized. You can pass a dictionary that maps original subject IDs to the anonymize IDs. Lastly, you can also pass a function that accepts a list of original IDs and returns such a dictionary.

shutil.rmtree(bids_root_anon)

subject_mapping = {"ABC123": "anonymous", "emptyroom": "emptyroom"}

anonymize_dataset(
    bids_root_in=bids_root,
    bids_root_out=bids_root_anon,
    datatypes="meg",
    subject_mapping=subject_mapping,
)
print_dir_tree(bids_root_anon)
Anonymizing BIDS dataset
Determining "daysback" for anonymization.

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Shifting recording dates by 35045 days (96.0 years).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-anonymous
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/4 [00:00<?,       ?it/s]
 25%|██▌       | Anonymizing : 1/4 [00:00<00:00,    3.59it/s]
 50%|█████     | Anonymizing : 2/4 [00:01<00:01,    1.96it/s]
100%|██████████| Anonymizing : 4/4 [00:01<00:00,    3.94it/s]
|MNE-sample-data-bids-anon/
|--- README
|--- dataset_description.json
|--- participants.json
|--- participants.tsv
|--- sub-anonymous/
|------ sub-anonymous_scans.tsv
|------ meg/
|--------- sub-anonymous_acq-calibration_meg.dat
|--------- sub-anonymous_acq-crosstalk_meg.fif
|--------- sub-anonymous_coordsystem.json
|--------- sub-anonymous_task-audiovisual_channels.tsv
|--------- sub-anonymous_task-audiovisual_events.json
|--------- sub-anonymous_task-audiovisual_events.tsv
|--------- sub-anonymous_task-audiovisual_meg.fif
|--------- sub-anonymous_task-audiovisual_meg.json
|--- sub-emptyroom/
|------ ses-19061225/
|--------- sub-emptyroom_ses-19061225_scans.tsv
|--------- meg/
|------------ sub-emptyroom_ses-19061225_task-noise_channels.tsv
|------------ sub-emptyroom_ses-19061225_task-noise_meg.fif
|------------ sub-emptyroom_ses-19061225_task-noise_meg.json

Reproducibility#

Every time you run this function, the automatically-generated subject IDs and the timeshift may differ (unless you excplicitly specify them as described above), as they are determined randomly.

To ensure results are reproducible across runs, you can pass the random_state parameter, causing the random number generator to produce the same results every time you execute the function. This may come in handy in situations where you discover a problem with the data while working with the anonymized dataset, fix the issue in the original dataset, and run anonymization again.

(Note that throughout this example, we only had a single subject in our dataset, meaning it will always be assigned the anonymized ID 1. Only in a dataset with multiple subjects will the effects of randomly-picked IDs become apparent.)

A good random seed is truly random. Avoid using random seeds from popular culture, like “42”, or “1337”. To obtain a truly random seed, you can paste the following into your console: python -c "import secrets; print(secrets.randbits(31))" Here, 31 bits correspond to the maximum seed “size” that the the legacy RandomState by NumPy, which many scientific libraries still rely on, can accept. For more information, see also this blog post on NumPy RNG best practices.

Note

Passing random_state merely guarantees that subject IDs and time shift remain the same across anonymization runs if the original dataset remained unchanged. It does not allow you to incrementally add data (e.g., a new participant) to an anonymized dataset: If the original dataset changes and you want the changes anonymized, you will need to anonymize the entire dataset again.

for i in range(2):
    print(f"\n\nRun {i + 1}\n")
    shutil.rmtree(bids_root_anon)
    anonymize_dataset(
        bids_root_in=bids_root,
        bids_root_out=bids_root_anon,
        datatypes="meg",
        random_state=293201004,
    )
    print_dir_tree(bids_root_anon)
Run 1


Anonymizing BIDS dataset
Determining "daysback" for anonymization.

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Shifting recording dates by 36293 days (99.4 years).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-1
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/4 [00:00<?,       ?it/s]
 25%|██▌       | Anonymizing : 1/4 [00:00<00:00,    3.61it/s]
 50%|█████     | Anonymizing : 2/4 [00:01<00:01,    1.98it/s]
100%|██████████| Anonymizing : 4/4 [00:01<00:00,    3.97it/s]
|MNE-sample-data-bids-anon/
|--- README
|--- dataset_description.json
|--- participants.json
|--- participants.tsv
|--- sub-1/
|------ sub-1_scans.tsv
|------ meg/
|--------- sub-1_acq-calibration_meg.dat
|--------- sub-1_acq-crosstalk_meg.fif
|--------- sub-1_coordsystem.json
|--------- sub-1_task-audiovisual_channels.tsv
|--------- sub-1_task-audiovisual_events.json
|--------- sub-1_task-audiovisual_events.tsv
|--------- sub-1_task-audiovisual_meg.fif
|--------- sub-1_task-audiovisual_meg.json
|--- sub-emptyroom/
|------ ses-19030726/
|--------- sub-emptyroom_ses-19030726_scans.tsv
|--------- meg/
|------------ sub-emptyroom_ses-19030726_task-noise_channels.tsv
|------------ sub-emptyroom_ses-19030726_task-noise_meg.fif
|------------ sub-emptyroom_ses-19030726_task-noise_meg.json


Run 2


Anonymizing BIDS dataset
Determining "daysback" for anonymization.

    Input:  /home/circleci/mne_data/MNE-sample-data-bids
    Output: /home/circleci/mne_data/MNE-sample-data-bids-anon

Shifting recording dates by 36293 days (99.4 years).
Using the following subject ID anonymization mapping:

    sub-ABC123 → sub-1
    sub-emptyroom → sub-emptyroom


  0%|          | Anonymizing : 0/4 [00:00<?,       ?it/s]
 25%|██▌       | Anonymizing : 1/4 [00:00<00:00,    3.61it/s]
 50%|█████     | Anonymizing : 2/4 [00:01<00:01,    1.97it/s]
100%|██████████| Anonymizing : 4/4 [00:01<00:00,    3.95it/s]
|MNE-sample-data-bids-anon/
|--- README
|--- dataset_description.json
|--- participants.json
|--- participants.tsv
|--- sub-1/
|------ sub-1_scans.tsv
|------ meg/
|--------- sub-1_acq-calibration_meg.dat
|--------- sub-1_acq-crosstalk_meg.fif
|--------- sub-1_coordsystem.json
|--------- sub-1_task-audiovisual_channels.tsv
|--------- sub-1_task-audiovisual_events.json
|--------- sub-1_task-audiovisual_events.tsv
|--------- sub-1_task-audiovisual_meg.fif
|--------- sub-1_task-audiovisual_meg.json
|--- sub-emptyroom/
|------ ses-19030726/
|--------- sub-emptyroom_ses-19030726_scans.tsv
|--------- meg/
|------------ sub-emptyroom_ses-19030726_task-noise_channels.tsv
|------------ sub-emptyroom_ses-19030726_task-noise_meg.fif
|------------ sub-emptyroom_ses-19030726_task-noise_meg.json

Total running time of the script: (0 minutes 11.021 seconds)

Gallery generated by Sphinx-Gallery