All of this is in service of composability.
How does this help bioinformatics?
In which we see the static world of things
Paths are the addresses for files, and are based on directories.
/- The “root” of the filesystem; everything lives here
.- Current directory (depends on where you are)
..- Parent directory (depends on where you are)
~- Home directory
We can also use
/ to represent how paths relate to each other:
cd ~ # set our "working directory" to be home # First try listing bioinfclass contents ls bioinfclass # Now try listing data directory contents ls data # ^ fails (no data in ~); Relative to current working directory tho? ls bioinfclass/data # We could also say ls ./bioinfclass/data ls ~/bioinfclass/data ls /home/<yourusername>/bioinfclass/data # ALL can be traced to /
Every process in Unix (including your shell sessions) will have a working directory associated with it, and you can see this with
pwd. Files can be accessed relative to this directory.
cd ~ pwd # list the contents of this directory ls bioinfclass/data cd bioinfclass pwd # now data is fine relative to our current working directory ls data
Now that we are in the
~/bioinformatics directory, how could we refer to another file in
~, like your
ls ~/.tmux.conf ls /home/<yourusername>/.tmux.conf # Using relative paths ls ../.tmux.conf
# move the data directory to ~, our current directory cd ~ mv bioinfclass/data . ls ls data ls bioinfclass
Exercise: How do we get
data back? (click down for answer)
mv data bioinfclass
mvis for renaming as well
# nothing to see here... mv bioinfclass catgifs ls
mv as updating the “path” stamp on a file
Exercise: How do we get
bioinfclass back? (down for answer)
mv catgifs bioinfclass
We can make new directories or files with
touch, and remove with
mkdir namewill create a directory
touch namewill create an empty file
rm namewill remove a file called
rmdir namewill remove an empty directory called
rm -r namewill remove
nameand all it’s contents (CAREFUL!)
Exercise: create a file called
birds in a new subdirectory called
dinosaurs of your home directory
In which things take flight and evolve!
command [flags] [operands]
-help; esp Java programs)
-v -x -f the-file==
dd if=somefile of=somenewfile
Ultimately, programs can interpret their arguments however they choose! These are just conventions.
As genetic sequences evolve, base pairs can be added or dropped. Alignment tries to figure out where these insertion/deletion events happened so that individual sites in an alignment can be compared between sequences:
AAGGCCTT AAGGCCTT ACGGCCTT ==>> ACGGCCTT AAGCCTA AAG-CCTA ACGGCCTA ACGGCCTA
- character in the alignment represents a putative deletion.
cd ~/bioinfclass muscle -maxiters 2 -in data/sfv.fasta -out output/alignment.fasta # Get bored waiting and want to work on something else... <Ctrl-a |> # create new pane
which tree # -l: "long" list output; lots of info that we will ignore except for `x` presence/absence ls -l /usr/bin/tree # super secret shortcut ls -l $(which tree) # Notice no x; not executable... ls -l README.md
ls /usr/bin => LOTS OF STUFF!
This is one of the many places your shell looks for programs
echo $PATH => Your
PATH environment variable; a
: separated list of places your shell looks for programs.
This can be edited if you want to let your shell know to look other places as well (see
In which our constructions in flight don tongues and communicate!
There are three special files that programs often read from and write to by default to facilitate composability:
For the most part we’ll ignore
Program > out-file
FastTree -nt output/alignment.fasta
FastTree -nt output/alignment.fasta > output/tree.nw
[make sure you’re in
~/bioinfclass, and if your shell can’t see FastTree, don’t forget to
module load intro-bio!]
stdoutof one command to
Prog1 | Prog2:
column -t -s , data/sfv.csv | less -S
column(a program for working with tabular data) writes its output to
-tspecifies tabs for output
-s ,specifies command for separator
/dev/stdinif no file is specified
-Sspecifies don’t wrap
sequence specimen species gene location ... BGH150WBGag2 BGH150 human gag Charmaguria ... BGH31WBGag2 BGH31 human gag Bormi ... BGH150WBGag4 BGH150 human gag Charmaguria ... BGH150WBGag8 BGH150 human gag Charmaguria ... MBG103WBGag101 MBG103 monkey gag Narayanganj ... BGH31WBGag3 BGH31 human gag Bormi ... MBG103WBGag102 MBG103 monkey gag Narayanganj ... BGH31WBGag1 BGH31 human gag Bormi ... MBG103WBGag104 MBG103 monkey gag Narayanganj ... MBG104WBGag101 MBG104 monkey gag Narayanganj ... ...
Programs have to know to use stdin / stdout for this to work. Some conventions:
stdin, output to
-as a shortcut for
(down for answer)
A: Your keyboard!
Recall that cat writes to stdout whatever files you pass it as args. If no files are specified though, it reads from stdin.
Ctrl-dto close the stdin file
catwill print out whatever you typed in
Your bash shell works very much like cat:
Note: this is why Ctrl-d works to quit the shell! You’re literally closing the stdin file that the shell REPL is reading from.
In which we apply our knowledge of these flying, speaking beasts
Head writes to
stdout the first several lines of
head data/sfv.csv column -t -s , data/sfv.csv | head
This is useful for looking at the very top – or head – of big files (particularly CSV; -> header + a few rows).
It’s important to always check your alignments to make sure they look “reasonable” (more or less, look like a plausable history, not completely “random”).
# We'll use the `alnvu` program av -L 99999 -w 99999 -c output/alignment.fasta | less -S
av is an alignment viewer which – with these settings – prints the differences from consensus for each sequence, making it easy to scan.
nw_display output/tree.nw | less -S
In these examples we’ll be combinging several shell commands using piping in order to perform multistep computations, effectively making the shell into a programming language.
The way to work through and understand these examples:
This is analogous to how we typically write these computations, breaking a computation down into steps corresponding to known shell commands, one at a time.
There is a species column in our data set with
Question: How would we figure out how many human and monkey sequences there are in the data?
# How many human sequences are there? grep human data/sfv.csv | wc -l # What about monkey sequences? grep monkey data/sfv.csv | wc -l
grepsearches through files for a pattern
wc -lcounts the number of lines
grep ... | wc -lcounts the number of lines in a file matching some pattern
There’s one line per sequence, so this gives us what we’re after.
We’re not accounting for the tabular structure:
How many sequences are there for each species?
cut -d , -f 3 data/sfv.csv | sort | uniq -c
cut -d , -f 3extracts the third
,separated entry for each line
sort | uniq -ccounts occurrences for each unique entry
The number of times
"monkey" shows up in the count then will be the number of sequences taking that value in the species column.
The first few columns in the data
|-----------------+----------+---------+------+-------------| | sequence | specimen | species | gene | location | |-----------------+----------+---------+------+-------------| | BGH150WBGag2 | BGH150 | human | gag | Charmaguria | | BGH31WBGag2 | BGH31 | human | gag | Bormi | | BGH150WBGag4 | BGH150 | human | gag | Charmaguria | | BGH150WBGag8 | BGH150 | human | gag | Charmaguria | | MBG103WBGag101 | MBG103 | monkey | gag | Narayanganj | | BGH31WBGag3 | BGH31 | human | gag | Bormi | | MBG103WBGag102 | MBG103 | monkey | gag | Narayanganj | | BGH31WBGag1 | BGH31 | human | gag | Bormi | | MBG103WBGag104 | MBG103 | monkey | gag | Narayanganj | | MBG104WBGag101 | MBG104 | monkey | gag | Narayanganj | | ...
cut -d , -f 3 data/sfv.csv
|---------| | species | |---------| | human | | human | | human | | human | | monkey | | human | | monkey | | human | | monkey | | monkey | | ...
cut -d , -f 3 data/sfv.csv | sort
|---------| | species | |---------| | human | | human | | human | | human | | human | | human | | ... | monkey | | monkey | | monkey | | monkey | | ...
cut -d , -f 3 data/sfv.csv | sort | uniq -c
77 human 1097 monkey 1 species
We can do this with a two step version of the
cut | uniq | sort pattern we used above:
# First, uniq specimens, with species info cut -d , -f 2,3 data/sfv.csv | sort | uniq | head # Next count that by species cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c
cut -d , -f 2,3 data/sfv.csv
|----------+---------| | specimen | species | |----------+---------| | BGH150 | human | | BGH31 | human | | BGH150 | human | | BGH150 | human | | MBG103 | monkey | | BGH31 | human | | MBG103 | monkey | | BGH31 | human | | MBG103 | monkey | | MBG104 | monkey | | ...
cut -d , -f 2,3 data/sfv.csv | sort | uniq
|----------+---------| | specimen | species | |----------+---------| | BGH150 | human | | BGH31 | human | | MBG103 | monkey | | MBG104 | monkey | | ...
Now we now have exactly 1 row per specimen (species just “comes along for the ride”).
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c
8 human 169 monkey 1 species
Since there’s 1 row per specimen in the previous result, counting these rows by species gives us the number of specimens per species.
What would we do?
cut -d , -f 2,3,5 data/sfv.csv | sort | uniq | cut -d , -f 2,3 | sort | uniq -c
What do you notice about where we tend to see humans getting infected?
Small composable tools offer a lot of power and flexibility.
From very simple pieces we were able to compute answers to very nontrivial queries. This is the promise of Unix composability, and why we love Unix for bioinformatics.
-d , -f 2,3,5business?
csv... versions of many of the standard Unix utils:
-d ,every time
csvgrep -c species -m human data/sfv.csv | csvlook | less -S
# ~/bin is one of the places we can put programs mkdir ~/bin # Download script, and mark as executable wget https://goo.gl/ZdNoUL -O ~/bin/csvuniq chmod +x ~/bin/csvuniq # Test with help message csvuniq -h
(if this fails to work, try
export PATH=~/bin:$PATH and
module load intro-bio)
csvuniq -c specimen,species,location data/sfv.csv | csvuniq -zc species,location | csvlook
csvuniq does all of the sorting and cutting for us…
For homework go over these examples, and translate into the equivalents using
Reading for next class (if you want to read ahead):