ABCD project

From CCN Wiki
Jump to navigation Jump to search

So the main issue is that we are working with the raw DICOM files, which include every bit of data that the scanners generate. However, these files may therefore contain data from the first several seconds of a scan where the MRI field is unstable and the data are saturated with noise that decreases exponentially over the first 8 seconds or so. From experience I have seen that if these volumes are present in the data, later linear detrending actually makes the data take on a parabolic shape. The problem is compounded by the fact that data were collected from different scanners that drop variable volumes of data automatically. Correctly preprocessing each functional run requires knowledge of which scanner model was used to collect that particular dataset -- different scanners are even used for the same participant at different times!

Details about the functional run data can be found in the .json data in each func directory. JSON stands for Javascript Object Notation, and it’s a popular format for recording and passing around data that can be organized into fields and sub-fields (and sub-sub-fields, etc.). These are plaintext files that you can view in any text editor.

For example in the directory:

/sub-NDARINVJKBDRF1B/ses-2YearFollowUpYArm1/func

You will see a .json file for each of the 2 n-back runs. Open it up and you will see a whole bunch of lines, including

"Manufacturer": "GE MEDICAL SYSTEMS",
 "ManufacturerModelName": "DISCOVERY MR750",
 "Modality": "MR",
 "NumFiles": "removed",
 "NumberOfTemporalPositions": "378",
 "OSLevel": "DV26.0",

This tells me that this scan was done on a GE scanner. It also tells me that the scanner used the DV26 scanner operating system (instead of DV25), which is relevant to the number of volumes dropped by GE scanners.

Organization and Reconstruction of DICOM files

The type of scanner used is irrelevant to the unpacking of the zipped up dicom files that we download from the ABCD study, but it is relevant, as described above, for producing sets of .nii files for the functional runs that are consistent across participants. I have modified some existing scripts to separately manage the unpacking and organization of the DICOM data, and the conversion of a participant's DICOM files into .nii files with the appropriate number of volumes dropped. These files can be found in /ubfs/caset/cpmcnorg/Scripts/Shell.

abcd_unpack.sh

This script handles the collection of all related files associated with a particular subject. Your project directory may contain .tgz archives from multiple subjects. The abcd_unpack.sh script takes a subject name as a parameter and then unzips all the anatomical and n-back archives for that participant. All other data are moved to an archive subdirectory. Note that the name of the unpacked subject gets a sub_ prefix prepended to it.

abcd_unpack.sh Example Usage

./abcd_unpack.sh NDARINV4JVM9WML
#after you do this, you will have a folder called sub_NDARINV4JVM9WML
#all the .tgz files associated with that subject will have been moved into sub_NDARINV4JVM9WML/archive

abcd_dicomtonii.sh

This script handles the conversion of dicom files to .nii files, first using afni's dicom_hdr tool to output the dicom header information to a text file that can be used to identify the scanner model and software version. The model and software version are then used to determine the number of initial volumes to omit from the reconstructed file. The output directory is ./NII/${SUBJECT}. The script will convert the Baseline T1 and T2 anatomical files. The anatomical files will be found in ./NII/${SUBJECT}/anat/T1.nii (and T2.nii). The script will convert the Baseline and 2nd Year Followup files, dropping 5,8 or 16 volumes as appropriate to the scanner used to acquire the data. The functional files will be found in ./NII/${SUBJECT}/func/00[1-4]/f.nii.

AFNI Dependencies

Important: These scripts require that several afni tools are in your $PATH. To check whether this is the case, from the command prompt, type:

which dicom_hdr

You should see the following printed to the screen on the next line:

/usr/lib/afni/bin/dicom_hdr

If you don't see this, then AFNI is not on your path. To add things to your $PATH variable, you can edit your .bashrc file:

nano ~/.bashrc

and add the following to the very bottom of the file:

#AFNI
PATH=/usr/lib/afni/bin:$PATH

Save (ctrl-o) and exit (ctrl-x). Now your .bashrc file has been updated to include afni in the path. However you'll need to force Linux to re-run your .bashrc startup script:

source ~/.bashrc

Verify that dicom_hdr is now visible on your path:

which dicom_hdr

This time around, you should see the path to this command printed to the screen. The next time you log in, you won't have to do this again, because the .bashrc script is run every time you log in, so this PATH update will occur for every subsequent login.

abcd_dicomtonii.sh Example Usage

./abcd_dicomtonii.sh NDARINV4JVM9WML

After running this, you should find directories numbered 001 to 004 in ./NII/NDARINV4JVM9WML/func. In each of these subdirectories will be 2 files: f.nii, whcih is the converted BOLD data, and dcmhdr.tmp, which contains information from the dicom header. I decided it was worth retaining this information and storing it with the .nii file because the dicom header contains quite a bit of information that is not included in the .nii file header. Having the extra information available might make it easier at a later point. You can inspect the f.nii file header at any time to ensure that it meets your expectations; most notably that it should have exactly 362 volumes:

fslhd NII/NDARINV4JVM9WML/func/001/f.nii

This generates screen output that includes the following lines:

dim0		4
dim1		90
dim2		90
dim3		60
dim4		362

We see here that the file has 362 frames in the time dimension, which is what we were expecting. Presently, the script is also writing the scanner model/software and the session and run number to the descrip field, so we also see later on:

descrip          SIEMENS:SESS1:RUN1

This may change because we have only 80 characters to work with, but having the scanner model and OS written in the header will be helpful; usually this field just contains useless information (or is empty).

Identifying Scanners

If you were working with data for which the scanner type is not known (e.g., .nii files generated from dicom files prior to April 13), you may need to look for information about the scanner in various supplemental files that can be found in the unpacked files. This task is easily accomplished using the linux grep command. Grep is basically a command-line version of the “find” tool in most text editors, except it works on any file in your filesystem. It has many applications, so I’ll limit this discussion to the relevant usage. If you pass it a single filename, it will open the file and look for your search string, highlighting the line it appears in:

grep “search string” my_file.txt

However, you can also have it search recursively. And you can have it also limit its search to particular types of files (e.g., limited by file extension). So I went to the directory directory containing the unzipped ABCD dataset so that I could search recursively through the subdirectories contained therein, and tell it to inspect only the .json files. I was searching for a field called SeriesDescription because that was one of the fields that I found had the scanner name in it:

grep -r --include="*.json" "SeriesDescription" ./

When I did this, I got information from ALL the .json files, including the anatomical files, which was more information that I strictly required. It was totally manageable, but I could use grep to be even more specific by chaining grep together with the pipe function. Most linux commands that work on contents of files can also operate on the output from other linux command by using the pipe (“ | ”, which is shift-backslash, not an upper-case “i") command (the “pipe” metaphor comes from plumbing, where you use pipes to allow water to flow from one place to another). The above example found all the scanner-model mentions. We can pipe this output to a second application of grep to filter specifically for the nBack task data:

grep -r --include="*.json" "SeriesDescription" ./ | grep "nBack"

Among the files I looked at, it turned out that nearly every functional run in the gooddata folder came from SIEMENS scanners. The single exception is for subject sub-NDARINVJKBDRF1B, whose 2YearFollowUp data was acquired on a GE scanner (DV26). Of course, this may change when/if new participants are added to our downloaded dataset.

Application of Scanner ID

How do we use this information when working with our data? The recommendation is that the first 16 volumes (12.8 seconds) of each time series are dropped. However, different scanners handle the initial few seconds of data differently. As a result, the number of initial volumes that we exclude from our reconstructed NII files depends on the scanner used:

  • Siemens:8
  • Philips: 8
  • GE DV25: 5
  • GE DV26: 16

This also impacts the calculation of the event timings from the E-Prime data associated with each data set.

Dropping Volumes from BOLD Time Series

If you are working with .nii files generated prior to April 13, it is possible that the data still contain initial volumes that need to be dropped from the time series. This is fortunately documented here on the wiki. You can use fslhd to inspect the header of an f.nii field to determine the number of volumes present, and ascertain whether any must be removed (e.g., depending on whether or not someone else has already dropped the number of volumes appropriate to the scanner type). The corrected files should have 362 volumes. If you see any number other than 362, you'll have to drop the initial few volumes.

Let's check out one of the functional runs for a participant that may or may not have been converted using the fancy abcd_dicomtonii.sh script:

fslhd /home/ABCD/gooddata/NII/NDARINVJKBDRF1B/func/001/f.nii

I see that the dim4 field (the time dimension) is 370 volumes. Having used grep earlier, I knew that these data came from a Siemens scanner, and so I was expecting 8 extra volumes. 370-362=8, and so this meets my expectations for a 4D BOLD time series that has not had the initial 8 volumes dropped.

Now check out the first run of the 2-year follow-up, which I know to hav come from a GE scanner running the DV26 operating system:

fslhd /home/ABCD/gooddata/NII/NDARINVJKBDRF1B/func/003/f.nii

The dim4 field is 378 volumes, 16 more than the 362 in a corrected file as would be expected for GE DV26 data.

Elsewhere on the wiki, I describe in more detail the method for dropping volumes from NII files. In a nutshell, it entails splitting up the 370 (or 378) volume 4-D f.nii file into a series of 370 (or 378) 3-D files (called f_0000.nii to f_0369.nii) and then deleting the files corresponding to the volumes you want to toss out. Then you pack them back up into a single 4D file again.

As indicated at the top of this wiki page, the new abcd_dicomtonii.sh script automagically handles the dropping of an appropriate number of initial volumes, so this will not be an issue for data that you know to have been converted by that script.

Protip: An easy way to tell whether the abcd_dicomtonii.sh script was used to generate the file is to look to see whether the run directory contains both the f.nii file and the dcmhdr.tmp file. If both files are there, then the magic script was used, because that's the only tool we use that generates that specific file.

Parsing E-Prime Data

The other place that the number of dropped volumes is relevant is when determining the event timings from the EventRelatedInformation.txt files found in each unzipped archive of functional data. These files are a disaster to try to interpret, partly because there are scores of columns, and partly because the timing starts well before the functional data are acquired. On top of that, the timings need to be adjusted for the number of volumes dropped from the data, which as we just went through, varies by scanner. Also, the data for multiple functional runs are recorded in a single file (note that this single file is duplicated, so that there is one such file for each of the two functional runs, but these two differently-named files are IDENTICAL). Fabulous. Fortunately, people involved with the ABCD project have a GitHub code respository for extracting timings from these files. You'll need to download the zip file from the site and extract the files and add them to your MATLAB path. Once available to MATLAB, you can run MATLAB and navigate to the directory containing your target E-Prime data and extract event timings thus:

>> fn=dir('*.txt'); %gets the name of all .txt files 
>> %I had previously deleted the redundant EventRelatedInformation.txt file so that there was only 1 such file
>> %This lets me refer to the fn.name field when I make the function call on the next line, saving me from typing out a long filename
>> [nruns,err]=abcd_extract_eprime_nback(fn.name, 'nskipTRs', 8, 'outstem', 'nback')

In the above example, my data came from a Siemens scanner. Before I preprocess the data, I'll need to drop the first 8 volumes of the data, and when I extract the timing events, I will indicate to skip the first 8 volumes using the nskipTRs, 8 parameters.

COVID-19 Method

If you're doing this remotely via SSH, you'll probably want to be able to run the script from the terminal window. Note that MATLAB can be run quite fine from the shell without the usual GUI:

matlab -nodesktop

You can run any script in your matlab PATH from the commandline. So if you've moved the abcd_extract_eprime code to your matlab path (usually ~/Documents/MATLAB or something similar), then you can run them from the BASH terminal:

matlab -nodisplay -nodesktop -r "abcd_extract_eprime_nback('ABCD-nBack-fMRI_run-20190801141911-EventRelatedInformation.txt', 'nskipTRs', 8, 'outstem', 'nback');exit"

I tacked on that last call to the exit command so that you quit matlab once the script executes. Otherwise you'll remain in the MATLAB terminal until you call the exit command to return to the BASH shell.

But Wait There's More!

The MATLAB script solves many problems, but leaves an outstanding issue: It produces several different timing files, but none of them can be directly be used in our FreeSurfer pipeline. Additional scripting will be required to aggregate the multiple files and generate a usable PAR-file for FreeSurfer. Unfortunately, that's not begun yet.