ABCD project

From CCN Wiki
Jump to navigation Jump to search

So the main issue is that we are working with the raw DICOM files, which include every bit of data that the scanners generate. However, these files may therefore contain data from the first several seconds of a scan where the MRI field is unstable and the data are saturated with noise that decreases exponentially over the first 8 seconds or so. From experience I have seen that if these volumes are present in the data, later linear detrending actually makes the data take on a parabolic shape. The problem is compounded by the fact that data were collected from different scanners that drop variable volumes of data automatically. Correctly preprocessing each functional run requires knowledge of which scanner model was used to collect that particular dataset -- different scanners are even used for the same participant at different times!

Details about the functional run data can be found in the .json data in each func directory. JSON stands for Javascript Object Notation, and it’s a popular format for recording and passing around data that can be organized into fields and sub-fields (and sub-sub-fields, etc.). These are plaintext files that you can view in any text editor.

For example in the directory:

/sub-NDARINVJKBDRF1B/ses-2YearFollowUpYArm1/func

You will see a .json file for each of the 2 n-back runs. Open it up and you will see a whole bunch of lines, including

"Manufacturer": "GE MEDICAL SYSTEMS",
 "ManufacturerModelName": "DISCOVERY MR750",
 "Modality": "MR",
 "NumFiles": "removed",
 "NumberOfTemporalPositions": "378",
 "OSLevel": "DV26.0",

This tells me that this scan was done on a GE scanner. It also tells me that the scanner used the DV26 scanner operating system (instead of DV25), which is relevant to the number of volumes dropped by GE scanners.

Identifying Scanners

This task is easily accomplished using the linux grep command. Grep is basically a command-line version of the “find” tool in most text editors, except it works on any file in your filesystem. It has many applications, so I’ll limit this discussion to the relevant usage. If you pass it a single filename, it will open the file and look for your search string, highlighting the line it appears in:

grep “search string” my_file.txt

However, you can also have it search recursively. And you can have it also limit its search to particular types of files (e.g., limited by file extension). So I went to the directory directory containing the unzipped ABCD dataset so that I could search recursively through the subdirectories contained therein, and tell it to inspect only the .json files. I was searching for a field called SeriesDescription because that was one of the fields that I found had the scanner name in it:

grep -r --include="*.json" "SeriesDescription" ./

When I did this, I got information from ALL the .json files, including the anatomical files, which was more information that I strictly required. It was totally manageable, but I could use grep to be even more specific by chaining grep together with the pipe function. Most linux commands that work on contents of files can also operate on the output from other linux command by using the pipe (“ | ”, which is shift-backslash, not an upper-case “i") command (the “pipe” metaphor comes from plumbing, where you use pipes to allow water to flow from one place to another). The above example found all the scanner-model mentions. We can pipe this output to a second application of grep to filter specifically for the nBack task data:

grep -r --include="*.json" "SeriesDescription" ./ | grep "nBack"

Among the files I looked at, it turned out that nearly every functional run in the gooddata folder came from SIEMENS scanners. The single exception is for subject sub-NDARINVJKBDRF1B, whose 2YearFollowUp data was acquired on a GE scanner (DV26). Of course, this may change when/if new participants are added to our downloaded dataset.

Application of Scanner ID

How do we use this information when working with our data? The recommendation is that the first 16 volumes (12.8 seconds) of each time series are dropped. However, different scanners handle the initial few seconds of data differently. As a result, the number of initial volumes that we exclude from our reconstructed NII files depends on the scanner used:

  • Siemens:8
  • Philips: 8
  • GE DV25: 5
  • GE DV26: 16

This also impacts the calculation of the event timings from the E-Prime data associated with each data set.

Dropping Volumes from BOLD Time Series

This is fortunately documented here on the wiki. You can use fslhd to inspect the header of an f.nii field to determine the number of volumes present, and ascertain whether any must be removed (e.g., depending on whether or not someone else has already dropped the number of volumes appropriate to the scanner type). The corrected files should have 362 volumes. Let's check out one of the functional runs:

fslhd /home/ABCD/gooddata/NII/NDARINVJKBDRF1B/func/001/f.nii

I see that the dim4 field (the time dimension) is 370 volumes. I knew that these data came from a Siemens scanner, and so I was expecting 8 extra volumes. 370-362=8, and so this meets my expectations for a 4-D BOLD time series that has not had the initial 8 volumes dropped.

Now check out the first run of the 2-year follow-up, which I know to hav come from a GE scanner running the DV26 operating system:

fslhd /home/ABCD/gooddata/NII/NDARINVJKBDRF1B/func/003/f.nii

The dim4 field is 378 volumes, 16 more than the 362 in a corrected file as would be expected for GE DV26 data.

Elsewhereon the wiki, I describe in more detail the method for dropping volumes from NII files. In a nutshell, it entails splitting up the 370 (or 378) volume 4-D f.nii file into a series of 370 (or 378) 3-D files (called f_0000.nii to f_0369.nii) and then deleting the files corresponding to the volumes you want to toss out. Then you pack them back up into a single 4D file again.

It may be possible to drop the first n volumes during the initial conversion from dicom to NII, but I’ve never looked into that. If that's possible and I figure it out, I will document that procedure as well and reference it here. Until then, before the f.nii files are preprocessed, we’ll need to ensure that our f.nii files have dim4 = 362.

Parsing E-Prime Data

The other place that the number of dropped volumes is relevant is when determining the event timings from the EventRelatedInformation.txt files found in each unzipped archive of functional data. These files are a disaster to try to interpret, partly because there are scores of columns, and partly because the timing starts well before the functional data are acquired. On top of that, the timings need to be adjusted for the number of volumes dropped from the data, which as we just went through, varies by scanner. Also, the data for multiple functional runs are recorded in a single file (note that this single file is duplicated, so that there is one such file for each of the two functional runs, but these two differently-named files are IDENTICAL). Fabulous. Fortunately, people involved with the ABCD project have a GitHub code respository for extracting timings from these files. You'll need to download the zip file from the site and extract the files and add them to your MATLAB path. Once available to MATLAB, you can run MATLAB and navigate to the directory containing your target E-Prime data and extract event timings thus:

>> fn=dir('*.txt'); %gets the name of all .txt files 
>> %I had previously deleted the redundant EventRelatedInformation.txt file so that there was only 1 such file
>> %This lets me refer to the fn.name field when I make the function call on the next line, saving me from typing out a long filename
>> [nruns,err]=abcd_extract_eprime_nback(fn.name, 'nskipTRs', 8, 'outstem', 'nback')

In the above example, my data came from a Siemens scanner. Before I preprocess the data, I'll need to drop the first 8 volumes of the data, and when I extract the timing events, I will indicate to skip the first 8 volumes using the nskipTRs, 8 parameters.

COVID-19 Method

If you're doing this remotely via SSH, you'll probably want to be able to run the script from the terminal window. Note that MATLAB can be run quite fine from the shell without the usual GUI:

matlab -nodesktop

You can run any script in your matlab PATH from the commandline. So if you've moved the abcd_extract_eprime code to your matlab path (usually ~/Documents/MATLAB or something similar), then you can run them from the BASH terminal:

matlab -nodisplay -nodesktop -r "abcd_extract_eprime_nback('ABCD-nBack-fMRI_run-20190801141911-EventRelatedInformation.txt', 'nskipTRs', 8, 'outstem', 'nback');exit"

I tacked on that last call to the exit command so that you quit matlab once the script executes. Otherwise you'll remain in the MATLAB terminal until you call the exit command to return to the BASH shell.

But Wait There's More!=

The MATLAB script solves many problems, but leaves an outstanding issue: It produces several different timing files, but none of them can be directly be used in our FreeSurfer pipeline. Additional scripting will be required to aggregate the multiple files and generate a usable PAR-file for FreeSurfer. Unfortunately, that's not begun yet.