BASH Tricks

From CCN Wiki
Jump to navigation Jump to search

Text File Batch

Suppose we have a list of items to process -- like all the entries in the subjects file, for example. We want to use each line in the file in a for-loop:

while read s; do
  echo "$s";
done <subjects

How many lines in my text file?

Totally useful when you have some kind of training file with many rows and columns:

FILENAME=myfile.csv
nl ${FILENAME} | awk '{ print $1 }'

I want to drop the first line of my text file

tail echoes the last n lines (default: 10) of a text file to stdout. Using the -n flag flips it around so that it echoes back all up to the last n lines of the file. So -n +2 will echo back the file up to the 2nd line of the file (i.e., dropping the first line). We can pipe this to a temp file (so we don't write out an empty file), and then rename:

tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"

PROTIP: I'm pretty sure the same trick applies using the head command to drop the last n lines from a file.

Make a list of directory names

We often organize subject data so that each subject gets their own directory. Freesurfer uses a subjects file when batch processing. Rather than manually type out each folder name into a text file, it can be generated in one line of code:

ls -1 -d */ | sed "s,/$,," >  subjects

This lists in 1 column all the directories (-1 -d) and uses sed to snip off the trailing forward slashes in the directory names

What directories are in one directory but not the other?

Scenario: We had a directory, let's call it ALLSUBJECTS, that had a bunch of subject directories named NDARINVxxxxxxxx. Some of them had a full dataset, but many of them did not. Sophia made a directory called gooddata that contained only the subset of folders that had full datasets. What's the fastest way to figure out who has incomplete data? Look for folders appearing in ALLSUBJECTS that don't appear in gooddata.

cd ALLSUBJECTS
#next command lists only directories (-d), in a single column (-1), sorts the list (sort), 
#makes sure it only lists folder names starting with "ND" ( grep "^ND"), and then uses 
#sed to strip the trailing backslash
ls -d -1  */ | grep "^ND" | sort | sed 's/\///g' >> ../allsubs.txt
cd ../gooddata
ls -d -1  */ | grep "^ND" | sort | sed 's/\///g' >> ../goodsubs.txt
#next line finds lines appearing in allsubs that do not appear in goodsubs:
comm -23 allsubs.txt goodsubs.txt

Making a tar archive containing only the minimal set of structural files for FreeSurfer

FreeSurfer makes a zillion files during the recon-all step. I have no idea what most of them are there for. Which do we need? The absolute minimal list is a work in progress, but I have made a file called fsaverage.required (copied to /ubfs/caset/cpmcnorg/Scripts/fsaverage.required) based on the contents of the fsaverage template subject directory contents. I dropped the obvious directories (e.g., mri 2mm), so what's left should hopefully be close to the minimal required set for getting things done with a subject. The idea is to reduce the number of superfluous files that you store or copy over the network so that we don't waste as much time and disk space with useless nonsense.

So here's what you do (note the text in red will vary - don't just blindly copy the code snippets below and expect them to work; that's how Chernobyl happens! You need to understand what you're doing):

  1. Copy fsaverage.required to $SUBJECTS_DIR
  2. Inspect fsaverage.required to make sure that it has any idiosyncratic files that you might wish to include
    • e.g., the original version only includes f.nii.gz files. If you want to also grab all your preprocessed .mgz files, then you'll want to include *.mgz up at the top. Save any changes.
  3. Navigate to a subject directory: cd FS_sub-001
  4. The following command will use the files listed in $SUBJECTS_DIR/fsaverage.required to find and archive the desired files for this subject:
    • tar -czvf sub-001.minimal.tgz `find . | grep -G -f ${SUBJECTS_DIR}/fsaverage.required`
  5. When you're done, you'll have the bare-bones minimum files to permit FS-FAST analyses of your BOLD data for your subject.
  6. You can copy the .tgz files to an external drive or over the network. Be sure to unpack the .tgz archive in an empty subject directory
    • e.g.:
mkdir ~/new_project/   #starting a new project directory - in this case on the same computer, but it could be anywhere
cd new_project            #enter the new project directory
mkdir FS_sub-001       #making an empty subject directory for the files we're about to unpack
cd FS_sub-001            #navigate into the new empty subject directory
 #next line copies the minimal file archive from the source directory into the new empty subject directory
cp ${SUBJECTS_DIR}/FS_sub-001/sub-001.minimal.tgz ./
#next line unzips the file archive into the empty directory
tar -xzvf sub-001.minimal.tgz


With the minimal set of structural files, you should be able to unzip the surface and T1 anatomical files and inspect for reconstruction accuracy, or add BOLD files from elsewhere (the BOLD files are what really does you in, and I've developed a similar procedure to grab your blob analysis files)

Making a tar archive containing only the minimal set of GLM Analysis Files for FreeSurfer

After running a first level GLM analysis (a "blob" analysis) using selxavg3-sess, each of your subject/bold directories will contain an analysis directory for each of the surfaces you included in your analysis (typically for lh and rh, and possibly also for mni305). Assuming the analyses were done in fsaverage template space (and there's no good reason anymore why they wouldn't be), then if you would like to download the bare minimum set of files required to inspect the subject-level analyses, then you can do so with the following script:

#!/bin/bash
#usage: ./zip1stla.sh SUBJECT_ID ANALYSIS_DIR_1 [ANALYSIS_DIR_2 ... etc]
#this first step is going to be to enforce that this only works when SUBJECTS_DIR is set
cd ${SUBJECTS_DIR}
#first param is subject id
SUB=${1}
shift
#remaining params are analysis directories
DIRS="$@"
#we're going to clone the analysis directory structure
cd ${SUB}/bold
mkdir --parents ${SUB}/bold

#iterate through analysis directories
for DIR in "${DIRS[@]}";
do
  cp -r ${DIR}  ${SUB}/bold/
done
#zip up our cloned directory structure 
tar -czvf ${SUBJECTS_DIR}/${SUB}.1stla.tgz ${SUB}
#delete the clone
rm -rf ${SUB}
#go back to where we started
cd ${SUBJECTS_DIR}


If you were to copy/paste the above script to a file named zip1stla.sh and make it executable (chmod ug+x zip1stla.sh) then you would run it this way:

#suppose my analysis directories are called FAM.sm6.lh, FAM.sm6.rh and FAM.sm6.mni
zip1stla.sh FS_SUB01 FAM.sm6.lh FAM.sm6.rh FAM.sm6.mni

This will create a file called FS_SUB01.1stla.tgz. When you unzip the file, it will create a subject folder with the following structure:

  • FS_SUB01
    • bold
      • FAM.sm6.lh
        • {some files}
      • FAM.sm6.rh
        • {some files}
      • FAM.sm6.mni
        • {some files}

No other files will be included in the archive, which keeps the archive size to a minimum. If FS_SUB01 already exists, then the contents of this archive will be added to the existing directory. This can be useful if you previously used the method described above to archive a minimal set of FreeSurfer structural files. Note that the FreeSurfer structural files are not needed to view the first level GLM data if you ran the analysis in fsaverage space, because these data are mapped to the fsaverage template, which you will already have on your local machine if you have FreeSurfer installed.

Make a series of numbered directories

FreeSurfer BOLD data goes in a series of directories, numbered 001, 002, ... , 0nn. A one-liner of code to create these directories in the command line:
for i in $(seq -f "%03g" 1 6); do mkdir ${i}; done
#this will create directories 001 to 006. Obviously, if you need more directories, change the second value from 6 to something else

Protip: If you want to also make the runs file that some of our scripts use at the same time, the above snippet can be modified:

 for i in $(seq -f "%03g" 1 6); do mkdir ${i}; echo ${i} >> runs; done

Save list of numbered directories to file

Another protip: If you already had a set of numbered directories and want to save them to a list (e.g., a "runs" file):

while read s; do
  ls "$SUBJECTS_DIR/$s/bold" | grep "^00*" > $SUBJECTS_DIR/$s/bold/runs
done <subjects

Restart Window Manager

This has happened a couple times before: you step away from the computer for awhile (maybe even overnight) and when you come back, you find it is locked up and completely unresponsive. The nuclear option is to reboot the whole machine:

sudo shutdown -r now #Sad for anyone running autorecon or a neural network

Unfortunately, that will stop anything that might be running in the background. A less severe solution might be to just restart the window manager. To do this you will need to ssh into the locked-up computer from a different computer, and then restart the lightdm process. This will require superuser privileges.

ssh hostname

Then after you have connected to the frozen computer:

sudo restart lightdm

Any processes that were dependent on the window manager will be terminated (e.g., so if you had been in the middle of editing labels in tksurfer, you will find that tksurfer has been shutdown and you will need to start over), however anything that was running in the background (e.g., autorecon) should be unaffected.

Renaming Multiple Files

Rename Using rename

A perl command, called rename might be available on your *nix system:

rename [OPTIONS] perlexpr files

Among useful options are the -n flag, which just reports what all the file renames would be, but doesn't actually execute them. A handy application of rename is to hide files and/or directories. Files with names beginning with a dot are hidden by default and don't show up in directory listings. This can be a handy way of excluding chunks of data from your scripts.

Use-Case: Hiding Session 2 Data

In our Multisensory Imagery experiment, we collect 6 runs at time points 1 and 2. If we wish to be able to analyze all the data, these would be stored together as runs 001 to 012. Suppose we wish to temporarily hide the second time point data:

rename -n 's/01/\._01/' `find ./ -type d -name "01*"`

This would find all the directories ("-type -d") named 01*, then it would show you how it would rename them. If everything looked right, you would execute the same command again, but omit the -n flag so that the renaming actually takes place. Note that this example only gets the 010, 011 and 012 directories. You would do something similar for directories 00[6-9].

Use-Case: Unhiding Directories

This one is easier, since all the hidden directories start with "._" using the approach described above:

rename 's/\._//' `find ./ -type d -name "._0*"`

In case you're curious about the syntax of the perl expression, you might want to read up a bit about regular expressions, but in this case, 's/\._//' indicates we are doing a substitution that will replace every instance of ._ with an empty string (//). The extra back-slash in front of the period is an escape character, which is needed because otherwise the dot (period) will be interpreted as a special character.

Rename Using mv

If you don't have access to the rename command (Mac OSX), you can fake it:

PREFIX=LO
for file in `find . -name "*.txt"`; do mv ${file##*/} ${PREFIX}_${file##*/}; done

Source: [1]

Related Trick: Collecting and Renaming Multiple Files in Subdirectories

Use case: I ran a bunch of model simulations. Each batch of simulations produced a series of 8 Keras files named model_0x.h5, and stored in directories named batch_##/. 10 batches of simulations produced 80 model files, except that they all had the same names. I wanted to run some tests on the complete set, so I needed to aggregate all the files in a single directory, but rename them from 01 to 80:

for run in $(seq 1 10) 
do
       r=`printf "%02d" $run`
       echo "Gathering run $r files"
       for m in {1..8}
       do
               basemodel=`printf "%02d" $m`
               blockstart=$(( ($run-1)*8 ))
               newmodel=$(( $blockstart+$m ))
              cp batch_$r/model_$basemodel.h5 ./model_$newmodel.h5 
       done
done

sed Tricks

Replacing Text in Multiple Files

sed -i 's/oldtext/newtext/g' *.ext

Remove punctuation and convert to lowercase

$FILENAME=file.txt
sed 's/[[:punct:]]//g' $FILENAME | sed $'s/\t//g' | tr '[:upper:]' '[:lower:]'  > lowercase.$FILENAME

Archiving Specific Files in a Directory Tree

The tar has an --include switch which will archive only matching file patterns, however it appears that this filtering breaks when trying to archive files in subdirectories. Fortunately, the person who posed the question on StackExchange already had a workaround that works fine (it's just ugly):

 find ./ -name "*.wav.txt" -print0 | tar -cvzf ~/adhd.tgz --null -T -

No idea what the -T does, nor what the trailing - does, but there you have it. This works. Just replace your file pattern with whatever it is you're filtering out, and of course specify an appropriate tgz archive name.

mysql on the terminal

So I learned tonight how to export query results to a text file from the shell interface. Note that MySQL server is running with the --secure-file-priv option enabled, so you can't just willy-nilly write files wherever you want. However /var/lib/mysql-files/ is fair game, so for example:

select * from conceptstats inner join concepts on conceptstats.concid=concepts.concid where pid=183 and norm=1 into outfile '/var/lib/mysql-files/0183.txt'