csvuniq installed/working (check ~/bin/csvuniq -h)?In which our flying, speaking beasts assemble and become legion
We can type these commands into text files, and instruct our Unix shell to execute them.
The Unix shell is also a mini-programming language!
Next to piping, this is our second route towards the promise of Unix composability
We’re going to create a build.sh script to automate the analyses we did last week.
Run vim build.sh to open the file, then hit i for insert mode, and enter the following text:
# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"Esc to return to command mode:w<Enter> to save the fileCtrl-a | (or switch to an existing one with Ctrl-a <arrow-key>)cd ~/bioinfclass
bash build.sh
ls output # note the new file
csvlook output/seqs_per_species.csv # take a lookGo back to your vim build.sh tmux pane (Ctrl-a <arrow-key>), and add the following lines:
#!/bin/bash
# ^ this magic comment is a "shebang"; It tells the shell how to interpret our script.
# Sane error handling settings (stop running for most errors)
set -euf -o pipefail
# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"Save when done (Esc for command mode, then :w<Enter>).
Switch back to your unix shell tmux pane (Ctrl-a <arrow-key>), and then execute:
chmod +x build.sh
# Note "x" when we use `ls -l`
ls -l build.sh
file build.sh
# Now it's a command/program!
./build.shGo back to your vim build.sh tmux pane, (Ctrl-a <arrow-key>), press down till you get to the bottom of the file, and add to the end:
# Compute number of sequences per specimen
csvuniq -zc specimen,species,location data/sfv.csv > output/seqs_per_specimen.csv
# Use those results to count specimens per species
csvuniq -zc species output/seqs_per_specimen.csv > output/specs_per_species.csv
# Also use them to count specimens by species and location
csvuniq -zc species,location output/seqs_per_specimen.csv > output/specs_per_species_location.csvIn the shell:
./build.sh
ls output
csvlook output/seqs_per_specimen.csv | less -SShell variables act as shorthand references to paths, files, or data. What the variable name points to can change, without having to change (much) of the code. This makes it easier to grow your scripts.
There are also some pretty valuable programming patterns which shell variables enable, and as we’ve seen, shell variables are important features of the Unix shell itself (see $PATH).
In the shell:
# Set the shell variable `birds`
# (Note: no spaces around the "=")
birds="are really dinosaurs"
# We can use that variable in further commands by prepending $
echo $birdsEdit our build.sh file to look as follows:
#!/bin/bash
set -euf -o pipefail
# Specify output directory, and name input data
outdir="output"
metadata="data/sfv.csv"
# Compute number of sequences per species
seqs_per_species="$outdir/seqs_per_species.csv"
csvuniq -zc species $metadata > $seqs_per_species
# Compute number of sequences per specimen
seqs_per_specimen="$outdir/seqs_per_specimen.csv"
csvuniq -zc specimen,species,location $metadata > $seqs_per_specimenNow note that when we add the following, we get to refer to past results using the defined variables:
# Use seqs_per_specimen to compute specimens per species
specs_per_species="$outdir/specs_per_species.csv"
csvuniq -zc species $seqs_per_specimen > $specs_per_species
# Also use them to compute number of specimens per species and location
specs_per_species_location="$outdir/specs_per_species_location.csv"
csvuniq -zc species,location $seqs_per_specimen > $specs_per_species_locationSave when done (Esc then :w<Enter>).
Back in your unix shell (Ctrl-a <arrow-key>):
./build.sh
ls output
csvlook output/seqs_per_specimen.csvWe’ve seen csvlook ___ | less -S quite a few times now for looking at csv data. So let’s give this a name!
csvless scriptOpen up a new file with vim ~/bin/csvless, and insert:
#!/bin/bash
csvlook $@ | less -SHere $@ is a “magic” variable that points to all of the arguments passed to the command. If you didn’t want positional arguments, you can use $1 or $2 for the 1st or 2nd (etc.) positional arguments passed to the script.
csvless scriptSave and close the file (Esc then :wq<Enter>), then run the following from the shell:
chmod +x ~/bin/csvless
csvless data/sfv.csv
csvcut -c sequence,specimen,species data/sfv.csv | csvlesscsvheadOpen up a new file with vim ~/bin/csvhead, and insert:
#!/bin/bash
csvlook $@ | head -n 20Again, save and close (Esc + :wq<Enter>), run chmod +x ~/bin/csvhead, then test it!
Most of this we’ll leave to the book…
One line in the terminal:
for i in $(ls); do echo "My file $i is cool"; doneIn a file, you can expand this:
for i in $(ls)
do
  echo "My file $i is cool"
doneWe could do a series of computations for partitions of our data. For instance:
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir
  
  # Do some computations for $location
done#!/bin/bash
if [command]
then
  [if-statements]
else
  [else-statements]
fi#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
  # commands to run if "pattern" is found
  echo "found 'pattern' in 'some_file.txt"
else
  echo "no such pattern found..."
fiQuestion: Why do we use > /dev/null here?
There is lots of explanation in the book, so you can learn more there. But to get a brief sense:
find: Lets you compute very specific lists of files, returned to stdout separated by linesxargs: Can take line separated items, and apply them as arguments to some command
parallel: More robust version of xargs, with better control over parallelizationScons and make offer the following advantages over shell scripts:How would we add the alignment and tree building we did in the last two sessions to our build.sh script?
Make the top of our script look like:
#!/bin/bash
# ...
inseqs="data/sfv.fasta"
# Alignment
alignment="$outdir/alignment.fasta"
muscle -maxiters 2 -in $inseqs -out $alignment
# Tree
tree="$outdir/tree.nw"
FastTree -nt $alignment > $treeLet’s start by creating a separate directory for each. How would we add location specific metadata using our csvkit tools?
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir
  # Create a subset of the metadata for just that location
  loc_metadata="$loc_outdir/metadata.csv"
  # XXX TODO! Add code here to create $loc_metadata
done# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir
  # Create a subset of the metadata for just that location
  loc_metadata="$loc_outdir/metadata.csv"
  csvgrep -c location -m $location $metadata > $loc_metadata
done# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # ... create location metadata, etc.
  # Create a list of sequences sampled from that location
  loc_sequences="$loc_outdir/sequences"
  csvcut -c sequence $loc_metadata > $loc_sequences
  # TODO:
  # * Subset Ex.1 alignment using `seqmagick convert`
  # * Build a tree from this alignment subset using FastTree
doneAdd the following to the end of our for loop
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # ... create location metadata, etc.
  # Create a list of sequences sampled from that location
  loc_sequences="$loc_outdir/sequences"
  csvcut -c sequence $loc_metadata > $loc_sequences
  # Subset our alignment to just that location
  loc_alignment="$loc_outdir/alignment.fasta"
  seqmagick convert --include-from-file $loc_sequences $alignment $loc_alignment
  # Build a location tree
  loc_tree="$loc_outdir/tree.nw"
  FastTree -nt $loc_alignment > $loc_tree
doneRecommended reading:
Reading for next class (if you want to read ahead):
vimtutor at the terminal…)