Lesson 3 - Unix Scripting & Automation

Introduction to Bioinformatics

First things first

Were you able to get csvuniq installed/working (check ~/bin/csvuniq -h)?
How is project planning and outside time going?

Shell scripting

In which our flying, speaking beasts assemble and become legion

A shell script is just a list of commands to be executed

We can type these commands into text files, and instruct our Unix shell to execute them.

Surprise!

The Unix shell is also a mini-programming language!

Next to piping, this is our second route towards the promise of Unix composability

For bioinformatics this means

Simple analysis tools/commands
Automation

Shell scripting is just “duct tape”

Perfect for connecting existing shell commands together
Less so for more complicated data flow and logic (for which we have python, R, etc.)

Back to biology

Our first script!

We’re going to create a build.sh script to automate the analyses we did last week.

Our first script!

Run vim build.sh to open the file, then hit i for insert mode, and enter the following text:

# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"

Hit Esc to return to command mode
Use the command :w<Enter> to save the file

Running our script

Open a new tmux pane with Ctrl-a | (or switch to an existing one with Ctrl-a <arrow-key>)
Then run the following in the shell:

cd ~/bioinfclass
bash build.sh

ls output # note the new file

csvlook output/seqs_per_species.csv # take a look

Making into a proper command

Go back to your vim build.sh tmux pane (Ctrl-a <arrow-key>), and add the following lines:

#!/bin/bash
# ^ this magic comment is a "shebang"; It tells the shell how to interpret our script.

# Sane error handling settings (stop running for most errors)
set -euf -o pipefail

# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"

Save when done (Esc for command mode, then :w<Enter>).

Now make the script executable!

Switch back to your unix shell tmux pane (Ctrl-a <arrow-key>), and then execute:

chmod +x build.sh

# Note "x" when we use `ls -l`
ls -l build.sh

file build.sh

# Now it's a command/program!
./build.sh

Back to our build script

Go back to your vim build.sh tmux pane, (Ctrl-a <arrow-key>), press down till you get to the bottom of the file, and add to the end:

# Compute number of sequences per specimen
csvuniq -zc specimen,species,location data/sfv.csv > output/seqs_per_specimen.csv

# Use those results to count specimens per species
csvuniq -zc species output/seqs_per_specimen.csv > output/specs_per_species.csv

# Also use them to count specimens by species and location
csvuniq -zc species,location output/seqs_per_specimen.csv > output/specs_per_species_location.csv

Run and investigate

In the shell:

./build.sh

ls output

csvlook output/seqs_per_specimen.csv | less -S

Questions?

Shell variables

Shell variables act as shorthand references to paths, files, or data. What the variable name points to can change, without having to change (much) of the code. This makes it easier to grow your scripts.

There are also some pretty valuable programming patterns which shell variables enable, and as we’ve seen, shell variables are important features of the Unix shell itself (see $PATH).

Shell variables

In the shell:

# Set the shell variable `birds`
# (Note: no spaces around the "=")
birds="are really dinosaurs"

# We can use that variable in further commands by prepending $
echo $birds

Cleaning up our build script

Edit our build.sh file to look as follows:

#!/bin/bash
set -euf -o pipefail

# Specify output directory, and name input data
outdir="output"
metadata="data/sfv.csv"

# Compute number of sequences per species
seqs_per_species="$outdir/seqs_per_species.csv"
csvuniq -zc species $metadata > $seqs_per_species

# Compute number of sequences per specimen
seqs_per_specimen="$outdir/seqs_per_specimen.csv"
csvuniq -zc specimen,species,location $metadata > $seqs_per_specimen

Cleaning up our build script

Now note that when we add the following, we get to refer to past results using the defined variables:

# Use seqs_per_specimen to compute specimens per species
specs_per_species="$outdir/specs_per_species.csv"
csvuniq -zc species $seqs_per_specimen > $specs_per_species

# Also use them to compute number of specimens per species and location
specs_per_species_location="$outdir/specs_per_species_location.csv"
csvuniq -zc species,location $seqs_per_specimen > $specs_per_species_location

Save when done (Esc then :w<Enter>).

Verify that it runs

Back in your unix shell (Ctrl-a <arrow-key>):

./build.sh

ls output

csvlook output/seqs_per_specimen.csv

Questions

Writing your own commands

The other side of shell scripting, and the dream of composition

Looking for common command patterns

Letting laziness lead the way…

We’ve seen csvlook ___ | less -S quite a few times now for looking at csv data. So let’s give this a name!

Writing a `csvless` script

Open up a new file with vim ~/bin/csvless, and insert:

#!/bin/bash

csvlook $@ | less -S

Here $@ is a “magic” variable that points to all of the arguments passed to the command. If you didn’t want positional arguments, you can use $1 or $2 for the 1st or 2nd (etc.) positional arguments passed to the script.

Testing our `csvless` script

Save and close the file (Esc then :wq<Enter>), then run the following from the shell:

chmod +x ~/bin/csvless

csvless data/sfv.csv

csvcut -c sequence,specimen,species data/sfv.csv | csvless

Another quick example: `csvhead`

Open up a new file with vim ~/bin/csvhead, and insert:

#!/bin/bash

csvlook $@ | head -n 20

Again, save and close (Esc + :wq<Enter>), run chmod +x ~/bin/csvhead, then test it!

Other things to know for shell scripting & power usage:

Iteration and arrays: Let us nest or loop over things like different species or locations and process separately
Conditionals: Let you run tests to determine what code to run
find/xargs/parallel: Run a command or program in parallel over a sequence of files

Most of this we’ll leave to the book…

Simple iteration

One line in the terminal:

for i in $(ls); do echo "My file $i is cool"; done

In a file, you can expand this:

for i in $(ls)
do
  echo "My file $i is cool"
done

How we might use this?

We could do a series of computations for partitions of our data. For instance:

# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir
  
  # Do some computations for $location
done

Conditionals

#!/bin/bash

if [command]
then
  [if-statements]
else
  [else-statements]
fi

Conditional example

#!/bin/bash

if grep "pattern" some_file.txt > /dev/null
then
  # commands to run if "pattern" is found
  echo "found 'pattern' in 'some_file.txt"
else
  echo "no such pattern found..."
fi

Question: Why do we use > /dev/null here?

find/xargs/parallel

There is lots of explanation in the book, so you can learn more there. But to get a brief sense:

find: Lets you compute very specific lists of files, returned to stdout separated by lines
xargs: Can take line separated items, and apply them as arguments to some command
- Good for large collections
- Can run in parallel
parallel: More robust version of xargs, with better control over parallelization

Build tools such as `Scons` and `make` offer the following advantages over shell scripts:

Only rebuild what’s needed (very important with long running programs) by tracking:
- whether data has changed
- whether commands have changed
Generally more robust
Automatically parallelize (sanely)

Excercise 1

How would we add the alignment and tree building we did in the last two sessions to our build.sh script?

Make the top of our script look like:

#!/bin/bash
# ...

inseqs="data/sfv.fasta"

# Alignment
alignment="$outdir/alignment.fasta"
muscle -maxiters 2 -in $inseqs -out $alignment

# Tree
tree="$outdir/tree.nw"
FastTree -nt $alignment > $tree

Excercise 2

Say we want to do some analyses for each location.

Let’s start by creating a separate directory for each. How would we add location specific metadata using our csvkit tools?

# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir

  # Create a subset of the metadata for just that location
  loc_metadata="$loc_outdir/metadata.csv"
  # XXX TODO! Add code here to create $loc_metadata

done

# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # Create a location outdir, if it doesn't already exist
  loc_outdir="$outdir/$location"
  mkdir -p $loc_outdir

  # Create a subset of the metadata for just that location
  loc_metadata="$loc_outdir/metadata.csv"
  csvgrep -c location -m $location $metadata > $loc_metadata

done

Excercise 3

How would we create a phylogenetic tree for each location in our data set?

# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # ... create location metadata, etc.

  # Create a list of sequences sampled from that location
  loc_sequences="$loc_outdir/sequences"
  csvcut -c sequence $loc_metadata > $loc_sequences

  # TODO:
  # * Subset Ex.1 alignment using `seqmagick convert`
  # * Build a tree from this alignment subset using FastTree
done

Add the following to the end of our for loop

# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
  # ... create location metadata, etc.

  # Create a list of sequences sampled from that location
  loc_sequences="$loc_outdir/sequences"
  csvcut -c sequence $loc_metadata > $loc_sequences

  # Subset our alignment to just that location
  loc_alignment="$loc_outdir/alignment.fasta"
  seqmagick convert --include-from-file $loc_sequences $alignment $loc_alignment

  # Build a location tree
  loc_tree="$loc_outdir/tree.nw"
  FastTree -nt $loc_alignment > $loc_tree

done

Other exercises

Write a handy little shell script for doing something (for example, printing the last 20 commands entered)

Reading

Resources

SCons for bioinformatics post - http://www.metasoarous.com/scons-for-data-science-and-compbio/
Seqmagick - http://fhcrc.github.io/seqmagick/ for munging sequence data
Vim for more powerful text editing - (type vimtutor at the terminal…)
Erick’s favorite video about shell scripting, and website

This slide intentionally left almost blank…

Back to homepage

Lesson 3 - Unix Scripting & Automation

Introduction to Bioinformatics

First things first

Shell scripting

A shell script is just a list of commands to be executed

Surprise!

For bioinformatics this means

Shell scripting is just “duct tape”

Back to biology

Our first script!

Our first script!

Running our script

Making into a proper command

Now make the script executable!

Back to our build script

Run and investigate

Questions?

Shell variables

Shell variables

Cleaning up our build script

Cleaning up our build script

Verify that it runs

Questions

Writing your own commands

The other side of shell scripting, and the dream of composition

Looking for common command patterns

Letting laziness lead the way…

Writing a csvless script

Testing our csvless script

Another quick example: csvhead

Other things to know for shell scripting & power usage:

Simple iteration

How we might use this?

Conditionals

Conditional example

find/xargs/parallel

Build tools such as Scons and make offer the following advantages over shell scripts:

Excercise 1

Excercise 2

Say we want to do some analyses for each location.

Excercise 3

How would we create a phylogenetic tree for each location in our data set?

Other exercises

Reading

Resources

This slide intentionally left almost blank…

Writing a `csvless` script

Testing our `csvless` script

Another quick example: `csvhead`

Build tools such as `Scons` and `make` offer the following advantages over shell scripts: