csvuniq
installed/working (check ~/bin/csvuniq -h
)?In which our flying, speaking beasts assemble and become legion
We can type these commands into text files, and instruct our Unix shell to execute them.
The Unix shell is also a mini-programming language!
Next to piping, this is our second route towards the promise of Unix composability
We’re going to create a build.sh
script to automate the analyses we did last week.
Run vim build.sh
to open the file, then hit i
for insert mode, and enter the following text:
# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"
Esc
to return to command mode:w<Enter>
to save the fileCtrl-a |
(or switch to an existing one with Ctrl-a <arrow-key>
)cd ~/bioinfclass
bash build.sh
ls output # note the new file
csvlook output/seqs_per_species.csv # take a look
Go back to your vim build.sh
tmux pane (Ctrl-a <arrow-key>
), and add the following lines:
#!/bin/bash
# ^ this magic comment is a "shebang"; It tells the shell how to interpret our script.
# Sane error handling settings (stop running for most errors)
set -euf -o pipefail
# Compute number of sequences per species
csvuniq -zc species data/sfv.csv > output/seqs_per_species.csv
echo "IT WORKED!"
Save when done (Esc
for command mode, then :w<Enter>
).
Switch back to your unix shell tmux pane (Ctrl-a <arrow-key>
), and then execute:
chmod +x build.sh
# Note "x" when we use `ls -l`
ls -l build.sh
file build.sh
# Now it's a command/program!
./build.sh
Go back to your vim build.sh
tmux pane, (Ctrl-a <arrow-key>
), press down till you get to the bottom of the file, and add to the end:
# Compute number of sequences per specimen
csvuniq -zc specimen,species,location data/sfv.csv > output/seqs_per_specimen.csv
# Use those results to count specimens per species
csvuniq -zc species output/seqs_per_specimen.csv > output/specs_per_species.csv
# Also use them to count specimens by species and location
csvuniq -zc species,location output/seqs_per_specimen.csv > output/specs_per_species_location.csv
In the shell:
./build.sh
ls output
csvlook output/seqs_per_specimen.csv | less -S
Shell variables act as shorthand references to paths, files, or data. What the variable name points to can change, without having to change (much) of the code. This makes it easier to grow your scripts.
There are also some pretty valuable programming patterns which shell variables enable, and as we’ve seen, shell variables are important features of the Unix shell itself (see $PATH
).
In the shell:
# Set the shell variable `birds`
# (Note: no spaces around the "=")
birds="are really dinosaurs"
# We can use that variable in further commands by prepending $
echo $birds
Edit our build.sh
file to look as follows:
#!/bin/bash
set -euf -o pipefail
# Specify output directory, and name input data
outdir="output"
metadata="data/sfv.csv"
# Compute number of sequences per species
seqs_per_species="$outdir/seqs_per_species.csv"
csvuniq -zc species $metadata > $seqs_per_species
# Compute number of sequences per specimen
seqs_per_specimen="$outdir/seqs_per_specimen.csv"
csvuniq -zc specimen,species,location $metadata > $seqs_per_specimen
Now note that when we add the following, we get to refer to past results using the defined variables:
# Use seqs_per_specimen to compute specimens per species
specs_per_species="$outdir/specs_per_species.csv"
csvuniq -zc species $seqs_per_specimen > $specs_per_species
# Also use them to compute number of specimens per species and location
specs_per_species_location="$outdir/specs_per_species_location.csv"
csvuniq -zc species,location $seqs_per_specimen > $specs_per_species_location
Save when done (Esc
then :w<Enter>
).
Back in your unix shell (Ctrl-a <arrow-key>
):
./build.sh
ls output
csvlook output/seqs_per_specimen.csv
We’ve seen csvlook ___ | less -S
quite a few times now for looking at csv data. So let’s give this a name!
csvless
scriptOpen up a new file with vim ~/bin/csvless
, and insert:
#!/bin/bash
csvlook $@ | less -S
Here $@
is a “magic” variable that points to all of the arguments passed to the command. If you didn’t want positional arguments, you can use $1
or $2
for the 1st or 2nd (etc.) positional arguments passed to the script.
csvless
scriptSave and close the file (Esc
then :wq<Enter>
), then run the following from the shell:
chmod +x ~/bin/csvless
csvless data/sfv.csv
csvcut -c sequence,specimen,species data/sfv.csv | csvless
csvhead
Open up a new file with vim ~/bin/csvhead
, and insert:
#!/bin/bash
csvlook $@ | head -n 20
Again, save and close (Esc
+ :wq<Enter>
), run chmod +x ~/bin/csvhead
, then test it!
Most of this we’ll leave to the book…
One line in the terminal:
for i in $(ls); do echo "My file $i is cool"; done
In a file, you can expand this:
for i in $(ls)
do
echo "My file $i is cool"
done
We could do a series of computations for partitions of our data. For instance:
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
# Create a location outdir, if it doesn't already exist
loc_outdir="$outdir/$location"
mkdir -p $loc_outdir
# Do some computations for $location
done
#!/bin/bash
if [command]
then
[if-statements]
else
[else-statements]
fi
#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
# commands to run if "pattern" is found
echo "found 'pattern' in 'some_file.txt"
else
echo "no such pattern found..."
fi
Question: Why do we use > /dev/null
here?
There is lots of explanation in the book, so you can learn more there. But to get a brief sense:
find
: Lets you compute very specific lists of files, returned to stdout
separated by linesxargs
: Can take line separated items, and apply them as arguments to some command
parallel
: More robust version of xargs, with better control over parallelizationScons
and make
offer the following advantages over shell scripts:How would we add the alignment and tree building we did in the last two sessions to our build.sh
script?
Make the top of our script look like:
#!/bin/bash
# ...
inseqs="data/sfv.fasta"
# Alignment
alignment="$outdir/alignment.fasta"
muscle -maxiters 2 -in $inseqs -out $alignment
# Tree
tree="$outdir/tree.nw"
FastTree -nt $alignment > $tree
Let’s start by creating a separate directory for each. How would we add location specific metadata using our csvkit tools?
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
# Create a location outdir, if it doesn't already exist
loc_outdir="$outdir/$location"
mkdir -p $loc_outdir
# Create a subset of the metadata for just that location
loc_metadata="$loc_outdir/metadata.csv"
# XXX TODO! Add code here to create $loc_metadata
done
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
# Create a location outdir, if it doesn't already exist
loc_outdir="$outdir/$location"
mkdir -p $loc_outdir
# Create a subset of the metadata for just that location
loc_metadata="$loc_outdir/metadata.csv"
csvgrep -c location -m $location $metadata > $loc_metadata
done
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
# ... create location metadata, etc.
# Create a list of sequences sampled from that location
loc_sequences="$loc_outdir/sequences"
csvcut -c sequence $loc_metadata > $loc_sequences
# TODO:
# * Subset Ex.1 alignment using `seqmagick convert`
# * Build a tree from this alignment subset using FastTree
done
Add the following to the end of our for loop
# For each location...
locations=($(csvuniq -c location $metadata | tail -n +2))
for location in ${locations[*]}
do
# ... create location metadata, etc.
# Create a list of sequences sampled from that location
loc_sequences="$loc_outdir/sequences"
csvcut -c sequence $loc_metadata > $loc_sequences
# Subset our alignment to just that location
loc_alignment="$loc_outdir/alignment.fasta"
seqmagick convert --include-from-file $loc_sequences $alignment $loc_alignment
# Build a location tree
loc_tree="$loc_outdir/tree.nw"
FastTree -nt $loc_alignment > $loc_tree
done
Recommended reading:
Reading for next class (if you want to read ahead):
vimtutor
at the terminal…)