All of this is in service of composability.
How does this help bioinformatics?
In which we see the static world of things
/dev/null
Paths are the addresses for files, and are based on directories.
/
- The “root” of the filesystem; everything lives here.
- Current directory (depends on where you are)..
- Parent directory (depends on where you are)~
- Home directoryWe can also use /
to represent how paths relate to each other:
cd ~ # set our "working directory" to be home
# First try listing bioinfclass contents
ls bioinfclass
# Now try listing data directory contents
ls data
# ^ fails (no data in ~); Relative to current working directory tho?
ls bioinfclass/data
# We could also say
ls ./bioinfclass/data
ls ~/bioinfclass/data
ls /home/<yourusername>/bioinfclass/data # ALL can be traced to /
Every process in Unix (including your shell sessions) will have a working directory associated with it, and you can see this with pwd
. Files can be accessed relative to this directory.
cd ~
pwd
# list the contents of this directory
ls bioinfclass/data
cd bioinfclass
pwd
# now data is fine relative to our current working directory
ls data
Now that we are in the ~/bioinformatics
directory, how could we refer to another file in ~
, like your .tmux.conf
?
ls ~/.tmux.conf
ls /home/<yourusername>/.tmux.conf
# Using relative paths
ls ../.tmux.conf
mv
# move the data directory to ~, our current directory
cd ~
mv bioinfclass/data .
ls
ls data
ls bioinfclass
Exercise: How do we get data
back? (click down for answer)
mv data bioinfclass
mv
is for renaming as well# nothing to see here...
mv bioinfclass catgifs
ls
Think of mv
as updating the “path” stamp on a file
Exercise: How do we get bioinfclass
back? (down for answer)
mv catgifs bioinfclass
We can make new directories or files with mkdir
and touch
, and remove with rm
and rmdir
mkdir name
will create a directory name
touch name
will create an empty file name
rm name
will remove a file called name
rmdir name
will remove an empty directory called name
rm -r name
will remove name
and all it’s contents (CAREFUL!)Exercise: create a file called birds
in a new subdirectory called dinosaurs
of your home directory
In which things take flight and evolve!
command [flags] [operands]
-h
)--help
)-help
; esp Java programs)-h
or --help
)-v -x -f the-file
== -vxf the-file
)dd if=somefile of=somenewfile
Ultimately, programs can interpret their arguments however they choose! These are just conventions.
As genetic sequences evolve, base pairs can be added or dropped. Alignment tries to figure out where these insertion/deletion events happened so that individual sites in an alignment can be compared between sequences:
AAGGCCTT AAGGCCTT
ACGGCCTT ==>> ACGGCCTT
AAGCCTA AAG-CCTA
ACGGCCTA ACGGCCTA
Note: the -
character in the alignment represents a putative deletion.
cd ~/bioinfclass
muscle -maxiters 2 -in data/sfv.fasta -out output/alignment.fasta
# Get bored waiting and want to work on something else...
<Ctrl-a |> # create new pane
which tree
# -l: "long" list output; lots of info that we will ignore except for `x` presence/absence
ls -l /usr/bin/tree
# super secret shortcut
ls -l $(which tree)
# Notice no x; not executable...
ls -l README.md
/usr/bin
?ls /usr/bin
=> LOTS OF STUFF!
This is one of the many places your shell looks for programs
echo $PATH
=> Your PATH
environment variable; a :
separated list of places your shell looks for programs.
This can be edited if you want to let your shell know to look other places as well (see .bashrc
)
In which our constructions in flight don tongues and communicate!
std
filesThere are three special files that programs often read from and write to by default to facilitate composability:
For the most part we’ll ignore stderr
…
>
Program
vs Program > out-file
FastTree -nt output/alignment.fasta
vs
FastTree -nt output/alignment.fasta > output/tree.nw
[make sure you’re in ~/bioinfclass
, and if your shell can’t see FastTree, don’t forget to module load intro-bio
!]
|
stdout
of one command to stdin
of anotherProg1 | Prog2
:
column -t -s , data/sfv.csv | less -S
column
(a program for working with tabular data) writes its output to stdout
-t
specifies tabs for output-s ,
specifies command for separatorless
reads from /dev/stdin
if no file is specified
-S
specifies don’t wrapsequence specimen species gene location ...
BGH150WBGag2 BGH150 human gag Charmaguria ...
BGH31WBGag2 BGH31 human gag Bormi ...
BGH150WBGag4 BGH150 human gag Charmaguria ...
BGH150WBGag8 BGH150 human gag Charmaguria ...
MBG103WBGag101 MBG103 monkey gag Narayanganj ...
BGH31WBGag3 BGH31 human gag Bormi ...
MBG103WBGag102 MBG103 monkey gag Narayanganj ...
BGH31WBGag1 BGH31 human gag Bormi ...
MBG103WBGag104 MBG103 monkey gag Narayanganj ...
MBG104WBGag101 MBG104 monkey gag Narayanganj ...
...
Programs have to know to use stdin / stdout for this to work. Some conventions:
stdout
stdin
, output to stdout
stdout
-
as a shortcut for stdin
/dev/stdout
, /dev/stdin
directly(down for answer)
A: Your keyboard!
Recall that cat writes to stdout whatever files you pass it as args. If no files are specified though, it reads from stdin.
cat
Ctrl-d
to close the stdin filecat
will print out whatever you typed inYour bash shell works very much like cat:
stdin
filestdout
to screenNote: this is why Ctrl-d works to quit the shell! You’re literally closing the stdin file that the shell REPL is reading from.
Questions?
In which we apply our knowledge of these flying, speaking beasts
We’ll start with some more basic examples using piping to more effectively look through our data.
Head writes to stdout
the first several lines of stdin
.
head data/sfv.csv
column -t -s , data/sfv.csv | head
This is useful for looking at the very top – or head – of big files (particularly CSV; -> header + a few rows).
It’s important to always check your alignments to make sure they look “reasonable” (more or less, look like a plausable history, not completely “random”).
# We'll use the `alnvu` program
av -L 99999 -w 99999 -c output/alignment.fasta | less -S
av
is an alignment viewer which – with these settings – prints the differences from consensus for each sequence, making it easy to scan.
nw_display output/tree.nw | less -S
In these examples we’ll be combinging several shell commands using piping in order to perform multistep computations, effectively making the shell into a programming language.
The way to work through and understand these examples:
This is analogous to how we typically write these computations, breaking a computation down into steps corresponding to known shell commands, one at a time.
There is a species column in our data set with human
and monkey
entries.
Question: How would we figure out how many human and monkey sequences there are in the data?
# How many human sequences are there?
grep human data/sfv.csv | wc -l
# What about monkey sequences?
grep monkey data/sfv.csv | wc -l
grep
searches through files for a patternwc -l
counts the number of linesgrep ... | wc -l
counts the number of lines in a file matching some patternThere’s one line per sequence, so this gives us what we’re after.
We’re not accounting for the tabular structure:
How many sequences are there for each species?
cut -d , -f 3 data/sfv.csv | sort | uniq -c
cut -d , -f 3
extracts the third ,
separated entry for each linesort | uniq -c
counts occurrences for each unique entryThe number of times "monkey"
shows up in the count then will be the number of sequences taking that value in the species column.
The first few columns in the data
|-----------------+----------+---------+------+-------------|
| sequence | specimen | species | gene | location |
|-----------------+----------+---------+------+-------------|
| BGH150WBGag2 | BGH150 | human | gag | Charmaguria |
| BGH31WBGag2 | BGH31 | human | gag | Bormi |
| BGH150WBGag4 | BGH150 | human | gag | Charmaguria |
| BGH150WBGag8 | BGH150 | human | gag | Charmaguria |
| MBG103WBGag101 | MBG103 | monkey | gag | Narayanganj |
| BGH31WBGag3 | BGH31 | human | gag | Bormi |
| MBG103WBGag102 | MBG103 | monkey | gag | Narayanganj |
| BGH31WBGag1 | BGH31 | human | gag | Bormi |
| MBG103WBGag104 | MBG103 | monkey | gag | Narayanganj |
| MBG104WBGag101 | MBG104 | monkey | gag | Narayanganj |
| ...
cut -d , -f 3 data/sfv.csv
|---------|
| species |
|---------|
| human |
| human |
| human |
| human |
| monkey |
| human |
| monkey |
| human |
| monkey |
| monkey |
| ...
cut -d , -f 3 data/sfv.csv | sort
|---------|
| species |
|---------|
| human |
| human |
| human |
| human |
| human |
| human |
| ...
| monkey |
| monkey |
| monkey |
| monkey |
| ...
cut -d , -f 3 data/sfv.csv | sort | uniq -c
77 human
1097 monkey
1 species
We can do this with a two step version of the cut | uniq | sort
pattern we used above:
# First, uniq specimens, with species info
cut -d , -f 2,3 data/sfv.csv | sort | uniq | head
# Next count that by species
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c
cut -d , -f 2,3 data/sfv.csv
|----------+---------|
| specimen | species |
|----------+---------|
| BGH150 | human |
| BGH31 | human |
| BGH150 | human |
| BGH150 | human |
| MBG103 | monkey |
| BGH31 | human |
| MBG103 | monkey |
| BGH31 | human |
| MBG103 | monkey |
| MBG104 | monkey |
| ...
-c
)cut -d , -f 2,3 data/sfv.csv | sort | uniq
|----------+---------|
| specimen | species |
|----------+---------|
| BGH150 | human |
| BGH31 | human |
| MBG103 | monkey |
| MBG104 | monkey |
| ...
Now we now have exactly 1 row per specimen (species just “comes along for the ride”).
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c
8 human
169 monkey
1 species
Since there’s 1 row per specimen in the previous result, counting these rows by species gives us the number of specimens per species.
What would we do?
cut -d , -f 2,3,5 data/sfv.csv | sort | uniq | cut -d , -f 2,3 | sort | uniq -c
What do you notice about where we tend to see humans getting infected?
Small composable tools offer a lot of power and flexibility.
From very simple pieces we were able to compute answers to very nontrivial queries. This is the promise of Unix composability, and why we love Unix for bioinformatics.
-d , -f 2,3,5
business?csvkit
We have csv...
versions of many of the standard Unix utils:
csvcut
, csvgrep
, csvjoin
, csvlook
, csvsort
, csvstack
, csvstat
-s ,
/-d ,
every timecsvgrep -c species -m human data/sfv.csv | csvlook | less -S
# ~/bin is one of the places we can put programs
mkdir ~/bin
# Download script, and mark as executable
wget https://goo.gl/ZdNoUL -O ~/bin/csvuniq
chmod +x ~/bin/csvuniq
# Test with help message
csvuniq -h
(if this fails to work, try export PATH=~/bin:$PATH
and module load intro-bio
)
csvuniq -c specimen,species,location data/sfv.csv | csvuniq -zc species,location | csvlook
Notice that csvuniq
does all of the sorting and cutting for us…
For homework go over these examples, and translate into the equivalents using csvkit
utils.
Recommended reading:
Reading for next class (if you want to read ahead):