All of this is in service of composability.
How does this help bioinformatics?
In which we see the static world of things
/dev/nullPaths are the addresses for files, and are based on directories.
/ - The “root” of the filesystem; everything lives here. - Current directory (depends on where you are).. - Parent directory (depends on where you are)~ - Home directoryWe can also use / to represent how paths relate to each other:
cd ~ # set our "working directory" to be home
# First try listing bioinfclass contents
ls bioinfclass
# Now try listing data directory contents
ls data
# ^ fails (no data in ~); Relative to current working directory tho?
ls bioinfclass/data
# We could also say
ls ./bioinfclass/data
ls ~/bioinfclass/data
ls /home/<yourusername>/bioinfclass/data  # ALL can be traced to /Every process in Unix (including your shell sessions) will have a working directory associated with it, and you can see this with pwd. Files can be accessed relative to this directory.
cd ~
pwd
# list the contents of this directory
ls bioinfclass/data
cd bioinfclass
pwd
# now data is fine relative to our current working directory
ls dataNow that we are in the ~/bioinformatics directory, how could we refer to another file in ~, like your .tmux.conf?
ls ~/.tmux.conf
ls /home/<yourusername>/.tmux.conf
# Using relative paths
ls ../.tmux.confmv# move the data directory to ~, our current directory
cd ~
mv bioinfclass/data .
ls
ls data
ls bioinfclassExercise: How do we get data back? (click down for answer)
mv data bioinfclassmv is for renaming as well# nothing to see here...
mv bioinfclass catgifs
lsThink of mv as updating the “path” stamp on a file
Exercise: How do we get bioinfclass back? (down for answer)
mv catgifs bioinfclassWe can make new directories or files with mkdir and touch, and remove with rm and rmdir
mkdir name will create a directory nametouch name will create an empty file namerm name will remove a file called namermdir name will remove an empty directory called namerm -r name will remove name and all it’s contents (CAREFUL!)Exercise: create a file called birds in a new subdirectory called dinosaurs of your home directory
In which things take flight and evolve!
command [flags] [operands]
-h)--help)-help; esp Java programs)-h or --help)-v -x -f the-file == -vxf the-file)dd if=somefile of=somenewfileUltimately, programs can interpret their arguments however they choose! These are just conventions.
As genetic sequences evolve, base pairs can be added or dropped. Alignment tries to figure out where these insertion/deletion events happened so that individual sites in an alignment can be compared between sequences:
AAGGCCTT            AAGGCCTT
ACGGCCTT    ==>>    ACGGCCTT
AAGCCTA             AAG-CCTA
ACGGCCTA            ACGGCCTANote: the - character in the alignment represents a putative deletion.
cd ~/bioinfclass
muscle -maxiters 2 -in data/sfv.fasta -out output/alignment.fasta
# Get bored waiting and want to work on something else...
<Ctrl-a |> # create new panewhich tree
# -l: "long" list output; lots of info that we will ignore except for `x` presence/absence
ls -l /usr/bin/tree
# super secret shortcut
ls -l $(which tree)
# Notice no x; not executable...
ls -l README.md/usr/bin?ls /usr/bin => LOTS OF STUFF!
This is one of the many places your shell looks for programs
echo $PATH => Your PATH environment variable; a : separated list of places your shell looks for programs.
This can be edited if you want to let your shell know to look other places as well (see .bashrc)
In which our constructions in flight don tongues and communicate!
std filesThere are three special files that programs often read from and write to by default to facilitate composability:
For the most part we’ll ignore stderr…
>Program vs Program > out-file
FastTree -nt output/alignment.fasta
vs
FastTree -nt output/alignment.fasta > output/tree.nw
[make sure you’re in ~/bioinfclass, and if your shell can’t see FastTree, don’t forget to module load intro-bio!]
| stdout of one command to stdin of anotherProg1 | Prog2:
column -t -s , data/sfv.csv | less -S
column (a program for working with tabular data) writes its output to stdout
-t specifies tabs for output-s , specifies command for separatorless reads from /dev/stdin if no file is specified
-S specifies don’t wrapsequence         specimen   species   gene   location     ...
BGH150WBGag2     BGH150     human     gag    Charmaguria  ...
BGH31WBGag2      BGH31      human     gag    Bormi        ...
BGH150WBGag4     BGH150     human     gag    Charmaguria  ...
BGH150WBGag8     BGH150     human     gag    Charmaguria  ...
MBG103WBGag101   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag3      BGH31      human     gag    Bormi        ...
MBG103WBGag102   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag1      BGH31      human     gag    Bormi        ...
MBG103WBGag104   MBG103     monkey    gag    Narayanganj  ...
MBG104WBGag101   MBG104     monkey    gag    Narayanganj  ...
...Programs have to know to use stdin / stdout for this to work. Some conventions:
stdoutstdin, output to stdoutstdout- as a shortcut for stdin/dev/stdout, /dev/stdin directly(down for answer)
A: Your keyboard!
Recall that cat writes to stdout whatever files you pass it as args. If no files are specified though, it reads from stdin.
catCtrl-d to close the stdin filecat will print out whatever you typed inYour bash shell works very much like cat:
stdin filestdout to screenNote: this is why Ctrl-d works to quit the shell! You’re literally closing the stdin file that the shell REPL is reading from.
Questions?
In which we apply our knowledge of these flying, speaking beasts
We’ll start with some more basic examples using piping to more effectively look through our data.
Head writes to stdout the first several lines of stdin.
head data/sfv.csv
column -t -s , data/sfv.csv | headThis is useful for looking at the very top – or head – of big files (particularly CSV; -> header + a few rows).
It’s important to always check your alignments to make sure they look “reasonable” (more or less, look like a plausable history, not completely “random”).
# We'll use the `alnvu` program
av -L 99999 -w 99999 -c output/alignment.fasta | less -Sav is an alignment viewer which – with these settings – prints the differences from consensus for each sequence, making it easy to scan.
nw_display output/tree.nw | less -SIn these examples we’ll be combinging several shell commands using piping in order to perform multistep computations, effectively making the shell into a programming language.
The way to work through and understand these examples:
This is analogous to how we typically write these computations, breaking a computation down into steps corresponding to known shell commands, one at a time.
There is a species column in our data set with human and monkey entries.
Question: How would we figure out how many human and monkey sequences there are in the data?
# How many human sequences are there?
grep human data/sfv.csv | wc -l
# What about monkey sequences?
grep monkey data/sfv.csv | wc -lgrep searches through files for a patternwc -l counts the number of linesgrep ... | wc -l counts the number of lines in a file matching some patternThere’s one line per sequence, so this gives us what we’re after.
We’re not accounting for the tabular structure:
How many sequences are there for each species?
cut -d , -f 3 data/sfv.csv | sort | uniq -ccut -d , -f 3 extracts the third , separated entry for each linesort | uniq -c counts occurrences for each unique entryThe number of times "monkey" shows up in the count then will be the number of sequences taking that value in the species column.
The first few columns in the data
|-----------------+----------+---------+------+-------------|
|  sequence       | specimen | species | gene | location    |
|-----------------+----------+---------+------+-------------|
|  BGH150WBGag2   | BGH150   | human   | gag  | Charmaguria |
|  BGH31WBGag2    | BGH31    | human   | gag  | Bormi       |
|  BGH150WBGag4   | BGH150   | human   | gag  | Charmaguria |
|  BGH150WBGag8   | BGH150   | human   | gag  | Charmaguria |
|  MBG103WBGag101 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag3    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag102 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag1    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag104 | MBG103   | monkey  | gag  | Narayanganj |
|  MBG104WBGag101 | MBG104   | monkey  | gag  | Narayanganj |
|  ...cut -d , -f 3 data/sfv.csv
|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| monkey  |
| human   |
| monkey  |
| human   |
| monkey  |
| monkey  |
| ...cut -d , -f 3 data/sfv.csv | sort
|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| human   |
| human   |
| ...
| monkey  |
| monkey  |
| monkey  |
| monkey  |
| ...cut -d , -f 3 data/sfv.csv | sort | uniq -c
  77 human
1097 monkey
   1 speciesWe can do this with a two step version of the cut | uniq | sort pattern we used above:
# First, uniq specimens, with species info
cut -d , -f 2,3 data/sfv.csv | sort | uniq | head
# Next count that by species
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -ccut -d , -f 2,3 data/sfv.csv
|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| BGH150   | human   |
| BGH150   | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...-c)cut -d , -f 2,3 data/sfv.csv | sort | uniq
|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...Now we now have exactly 1 row per specimen (species just “comes along for the ride”).
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c
  8 human
169 monkey
  1 speciesSince there’s 1 row per specimen in the previous result, counting these rows by species gives us the number of specimens per species.
What would we do?
cut -d , -f 2,3,5 data/sfv.csv | sort | uniq | cut -d , -f 2,3 | sort | uniq -cWhat do you notice about where we tend to see humans getting infected?
Small composable tools offer a lot of power and flexibility.
From very simple pieces we were able to compute answers to very nontrivial queries. This is the promise of Unix composability, and why we love Unix for bioinformatics.
-d , -f 2,3,5 business?csvkitWe have csv... versions of many of the standard Unix utils:
csvcut, csvgrep, csvjoin, csvlook, csvsort, csvstack, csvstat
-s ,/-d , every timecsvgrep -c species -m human data/sfv.csv | csvlook | less -S# ~/bin is one of the places we can put programs
mkdir ~/bin
# Download script, and mark as executable
wget https://goo.gl/ZdNoUL -O ~/bin/csvuniq
chmod +x ~/bin/csvuniq
# Test with help message
csvuniq -h(if this fails to work, try export PATH=~/bin:$PATH and module load intro-bio)
csvuniq -c specimen,species,location data/sfv.csv | csvuniq -zc species,location | csvlookNotice that csvuniq does all of the sorting and cutting for us…
For homework go over these examples, and translate into the equivalents using csvkit utils.
Recommended reading:
Reading for next class (if you want to read ahead):