Introduction to Bioinformatics

Lesson 2 - Unix Foundations

The Shell

Shell : Bioinformatician   ::   Water : Fish


  • Philosophy
  • Tools
  • Cluster computing

The Unix Philosophy

  • Small tools that do one thing right
  • Plumb tools together into pipelines/scripts
  • Embrace plain text/data

All of this is in service of composability.

How does this help bioinformatics?

Breaking down Unix

Directories, files and paths, oh my!

In which we see the static world of things

In Unix, (almost) everything is a file

  • programs
  • devices
  • /dev/null
  • etc.

And files have paths

Paths are the addresses for files, and are based on directories.

Directories (aka “folders”) are containers

…for more directories and files…

Some special paths & directories:

  • / - The “root” of the filesystem; everything lives here
  • . - Current directory (depends on where you are)
  • .. - Parent directory (depends on where you are)
  • ~ - Home directory

Composing paths

We can also use / to represent how paths relate to each other:

cd ~ # set our "working directory" to be home

# First try listing bioinfclass contents
ls bioinfclass

# Now try listing data directory contents
ls data

# ^ fails (no data in ~); Relative to current working directory tho?
ls bioinfclass/data

# We could also say
ls ./bioinfclass/data
ls ~/bioinfclass/data
ls /home/<yourusername>/bioinfclass/data  # ALL can be traced to /

More on current working directory

Every process in Unix (including your shell sessions) will have a working directory associated with it, and you can see this with pwd. Files can be accessed relative to this directory.

cd ~
pwd

# list the contents of this directory
ls bioinfclass/data

cd bioinfclass
pwd

# now data is fine relative to our current working directory
ls data

Exercise: going up the directory chain

Now that we are in the ~/bioinformatics directory, how could we refer to another file in ~, like your .tmux.conf?

Going up the directory chain

ls ~/.tmux.conf
ls /home/<yourusername>/.tmux.conf
# Using relative paths
ls ../.tmux.conf

We can change the path of a file using mv

# move the data directory to ~, our current directory
cd ~
mv bioinfclass/data .

ls

ls data

ls bioinfclass

Exercise: How do we get data back? (click down for answer)

Getting data back

mv data bioinfclass

mv is for renaming as well

# nothing to see here...
mv bioinfclass catgifs

ls

Think of mv as updating the “path” stamp on a file

Exercise: How do we get bioinfclass back? (down for answer)

Getting bioinfclass back

mv catgifs bioinfclass

Creating and removing files

We can make new directories or files with mkdir and touch, and remove with rm and rmdir

  • mkdir name will create a directory name
  • touch name will create an empty file name
  • rm name will remove a file called name
  • rmdir name will remove an empty directory called name
  • rm -r name will remove name and all it’s contents (CAREFUL!)

Exercise: create a file called birds in a new subdirectory called dinosaurs of your home directory

Commands

In which things take flight and evolve!

Anatomy of a Unix command

command [flags] [operands]

  • command: name of the program
  • operands: positional based arguments
    • often but not always required
  • flags: name based arguments
    • position frequently ignored
    • usually for controlling details of program behavior
    • often not required
    • sometimes take arguments, sometimes not

More on flag shape

  • sometimes single dash and character (like -h)
  • sometimes double dash and word (like --help)
  • sometimes single dash and word (like -help; esp Java programs)
  • sometimes there are short and long options (like -h or --help)
  • sometimes short flags can be munged together (-v -x -f the-file == -vxf the-file)
  • occasional insanity like dd if=somefile of=somenewfile

Ultimately, programs can interpret their arguments however they choose! These are just conventions.

Sequence alignments

As genetic sequences evolve, base pairs can be added or dropped. Alignment tries to figure out where these insertion/deletion events happened so that individual sites in an alignment can be compared between sequences:

AAGGCCTT            AAGGCCTT
ACGGCCTT    ==>>    ACGGCCTT
AAGCCTA             AAG-CCTA
ACGGCCTA            ACGGCCTA

Note: the - character in the alignment represents a putative deletion.

Making a sequence alignment

cd ~/bioinfclass
muscle -maxiters 2 -in data/sfv.fasta -out output/alignment.fasta

# Get bored waiting and want to work on something else...
<Ctrl-a |> # create new pane

Programs are just files!

which tree

# -l: "long" list output; lots of info that we will ignore except for `x` presence/absence
ls -l /usr/bin/tree

# super secret shortcut
ls -l $(which tree)

# Notice no x; not executable...
ls -l README.md

What else is in /usr/bin?

ls /usr/bin => LOTS OF STUFF!

This is one of the many places your shell looks for programs

How does the shell know where to look for programs?

echo $PATH => Your PATH environment variable; a : separated list of places your shell looks for programs.

This can be edited if you want to let your shell know to look other places as well (see .bashrc)

Questions?

Pipes and streams

In which our constructions in flight don tongues and communicate!

A bit about std files

There are three special files that programs often read from and write to by default to facilitate composability:

For the most part we’ll ignore stderr

By default, stdout gets printed to the screen

But we can redirect stdout to a file using >

Program vs Program > out-file

Example of redirection

FastTree -nt output/alignment.fasta

vs

FastTree -nt output/alignment.fasta > output/tree.nw

 

[make sure you’re in ~/bioinfclass, and if your shell can’t see FastTree, don’t forget to module load intro-bio!]

We can also | stdout of one command to stdin of another

Prog1 | Prog2:

Example

column -t -s , data/sfv.csv | less -S

  • column (a program for working with tabular data) writes its output to stdout
    • -t specifies tabs for output
    • -s , specifies command for separator
  • less reads from /dev/stdin if no file is specified
    • -S specifies don’t wrap

You should see something like this

sequence         specimen   species   gene   location     ...
BGH150WBGag2     BGH150     human     gag    Charmaguria  ...
BGH31WBGag2      BGH31      human     gag    Bormi        ...
BGH150WBGag4     BGH150     human     gag    Charmaguria  ...
BGH150WBGag8     BGH150     human     gag    Charmaguria  ...
MBG103WBGag101   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag3      BGH31      human     gag    Bormi        ...
MBG103WBGag102   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag1      BGH31      human     gag    Bormi        ...
MBG103WBGag104   MBG103     monkey    gag    Narayanganj  ...
MBG104WBGag101   MBG104     monkey    gag    Narayanganj  ...
...

Programs have to know to use stdin / stdout for this to work. Some conventions:

  • Assume first argument is input, second output
  • If only one argument is specified, output to stdout
  • If no arguments specified, input from stdin, output to stdout
  • Sometimes there is no “output” argument; assume stdout
  • Sometimes programs treat - as a shortcut for stdin
  • When all else fails, you can read from & write to paths /dev/stdout, /dev/stdin directly

More about stdin

Q: What does stdin default to?

(down for answer)

Q: What does stdin default to?

A: Your keyboard!

Reading from stdin

Recall that cat writes to stdout whatever files you pass it as args. If no files are specified though, it reads from stdin.

cat
  • Enter some text using your keyboard
  • Use the key command Ctrl-d to close the stdin file
  • cat will print out whatever you typed in

The shell reads from stdin!

Your bash shell works very much like cat:

  • Read line of stdin file
  • Evaluate that line
  • Print stdout to screen
  • Loop back and start over

Note: this is why Ctrl-d works to quit the shell! You’re literally closing the stdin file that the shell REPL is reading from.

That’s it for pipes and streams

Questions?

Practicum!

In which we apply our knowledge of these flying, speaking beasts

Head first…

Head writes to stdout the first several lines of stdin.

head data/sfv.csv

column -t -s , data/sfv.csv | head

This is useful for looking at the very top – or head – of big files (particularly CSV; -> header + a few rows).

Looking at alignments

It’s important to always check your alignments to make sure they look “reasonable” (more or less, look like a plausable history, not completely “random”).

# We'll use the `alnvu` program
av -L 99999 -w 99999 -c output/alignment.fasta | less -S

av is an alignment viewer which – with these settings – prints the differences from consensus for each sequence, making it easy to scan.

We can also look at our tree

nw_display output/tree.nw | less -S

Analyzing data

In these examples we’ll be combinging several shell commands using piping in order to perform multistep computations, effectively making the shell into a programming language.

Side note: strategy

The way to work through and understand these examples:

  • run the first part of the command and think about what it’s doing
  • add the next part of the pipe to see what that does
  • repeat, till through the whole command
  • reflect and think about the strategy used to put each of these pieces together towards the end product

This is analogous to how we typically write these computations, breaking a computation down into steps corresponding to known shell commands, one at a time.

Sequences per species

There is a species column in our data set with human and monkey entries.


Question: How would we figure out how many human and monkey sequences there are in the data?

Sequences per species

# How many human sequences are there?
grep human data/sfv.csv | wc -l

# What about monkey sequences?
grep monkey data/sfv.csv | wc -l
  • grep searches through files for a pattern
  • wc -l counts the number of lines
  • Thus, grep ... | wc -l counts the number of lines in a file matching some pattern

There’s one line per sequence, so this gives us what we’re after.

What are some problems with this?

Some problems with this

We’re not accounting for the tabular structure:

  • We can’t restrict matches to just a particular column in the data
  • We have to do this twice; if we had many categories, this could take forever

A better way

How many sequences are there for each species?

cut -d , -f 3 data/sfv.csv | sort | uniq -c
  • cut -d , -f 3 extracts the third , separated entry for each line
  • sort | uniq -c counts occurrences for each unique entry

The number of times "monkey" shows up in the count then will be the number of sequences taking that value in the species column.

How to visualize this

The first few columns in the data

|-----------------+----------+---------+------+-------------|
|  sequence       | specimen | species | gene | location    |
|-----------------+----------+---------+------+-------------|
|  BGH150WBGag2   | BGH150   | human   | gag  | Charmaguria |
|  BGH31WBGag2    | BGH31    | human   | gag  | Bormi       |
|  BGH150WBGag4   | BGH150   | human   | gag  | Charmaguria |
|  BGH150WBGag8   | BGH150   | human   | gag  | Charmaguria |
|  MBG103WBGag101 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag3    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag102 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag1    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag104 | MBG103   | monkey  | gag  | Narayanganj |
|  MBG104WBGag101 | MBG104   | monkey  | gag  | Narayanganj |
|  ...

The results of cutting out the third column

cut -d , -f 3 data/sfv.csv

|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| monkey  |
| human   |
| monkey  |
| human   |
| monkey  |
| monkey  |
| ...

The result of sorting

cut -d , -f 3 data/sfv.csv | sort

|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| human   |
| human   |
| ...
| monkey  |
| monkey  |
| monkey  |
| monkey  |
| ...

The result of counting

cut -d , -f 3 data/sfv.csv | sort | uniq -c

  77 human
1097 monkey
   1 species

How would we count how many specimens there are per species?

Counting specimens per species

We can do this with a two step version of the cut | uniq | sort pattern we used above:

# First, uniq specimens, with species info
cut -d , -f 2,3 data/sfv.csv | sort | uniq | head

# Next count that by species
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c

Columns 2 and 3 cut out of data

cut -d , -f 2,3 data/sfv.csv

|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| BGH150   | human   |
| BGH150   | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...

The results of the first sort and uniq (without -c)

cut -d , -f 2,3 data/sfv.csv | sort | uniq

|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...

Now we now have exactly 1 row per specimen (species just “comes along for the ride”).

Now we cut out the species, and count

cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c

  8 human
169 monkey
  1 species

Since there’s 1 row per specimen in the previous result, counting these rows by species gives us the number of specimens per species.

What if we wanted to count number of specimens per species for each location?

What would we do?

Specimens by location and species

cut -d , -f 2,3,5 data/sfv.csv | sort | uniq | cut -d , -f 2,3 | sort | uniq -c

Look at these numbers

What do you notice about where we tend to see humans getting infected?

Moral of the story

Small composable tools offer a lot of power and flexibility.

From very simple pieces we were able to compute answers to very nontrivial queries. This is the promise of Unix composability, and why we love Unix for bioinformatics.

Is anyone else getting tired of all this -d , -f 2,3,5 business?

Enter csvkit

We have csv... versions of many of the standard Unix utils:

csvcut, csvgrep, csvjoin, csvlook, csvsort, csvstack, csvstat

Advantages of csvkit

  • Can use header names, not just indices
  • Don’t have to specify -s ,/-d , every time
  • Headers are treated separately from the data

Looking at just the human sequences

csvgrep -c species -m human data/sfv.csv | csvlook | less -S

The one missing piece: csvuniq

# ~/bin is one of the places we can put programs
mkdir ~/bin

# Download script, and mark as executable
wget https://goo.gl/ZdNoUL -O ~/bin/csvuniq
chmod +x ~/bin/csvuniq

# Test with help message
csvuniq -h

(if this fails to work, try export PATH=~/bin:$PATH and module load intro-bio)

Now let’s see what our last example looks like:

csvuniq -c specimen,species,location data/sfv.csv | csvuniq -zc species,location | csvlook

Notice that csvuniq does all of the sorting and cutting for us…

Homework

For homework go over these examples, and translate into the equivalents using csvkit utils.

Recommended reading:

  • Chapter 3, and chapter 7 (till around page 148).

Reading for next class (if you want to read ahead):

  • Chapter 12

Resources

Back to homepage