Introduction to Bioinformatics

Lesson 2 - Unix Foundations

The Shell

Shell : Bioinformatician :: Water : Fish

Philosophy
Tools
Cluster computing

The Unix Philosophy

Small tools that do one thing right
Plumb tools together into pipelines/scripts
Embrace plain text/data

All of this is in service of composability.

How does this help bioinformatics?

Breaking down Unix

Directories, files and paths, oh my!

In which we see the static world of things

In Unix, (almost) everything is a file

programs
devices
/dev/null
etc.

And files have paths

Paths are the addresses for files, and are based on directories.

Directories (aka “folders”) are containers

…for more directories and files…

Some special paths & directories:

/ - The “root” of the filesystem; everything lives here
. - Current directory (depends on where you are)
.. - Parent directory (depends on where you are)
~ - Home directory

Composing paths

We can also use / to represent how paths relate to each other:

cd ~ # set our "working directory" to be home

# First try listing bioinfclass contents
ls bioinfclass

# Now try listing data directory contents
ls data

# ^ fails (no data in ~); Relative to current working directory tho?
ls bioinfclass/data

# We could also say
ls ./bioinfclass/data
ls ~/bioinfclass/data
ls /home/<yourusername>/bioinfclass/data  # ALL can be traced to /

More on current working directory

Every process in Unix (including your shell sessions) will have a working directory associated with it, and you can see this with pwd. Files can be accessed relative to this directory.

cd ~
pwd

# list the contents of this directory
ls bioinfclass/data

cd bioinfclass
pwd

# now data is fine relative to our current working directory
ls data

Exercise: going up the directory chain

Now that we are in the ~/bioinformatics directory, how could we refer to another file in ~, like your .tmux.conf?

Going up the directory chain

ls ~/.tmux.conf
ls /home/<yourusername>/.tmux.conf
# Using relative paths
ls ../.tmux.conf

We can change the path of a file using `mv`

# move the data directory to ~, our current directory
cd ~
mv bioinfclass/data .

ls

ls data

ls bioinfclass

Exercise: How do we get data back? (click down for answer)

Getting data back

mv data bioinfclass

`mv` is for renaming as well

# nothing to see here...
mv bioinfclass catgifs

ls

Think of mv as updating the “path” stamp on a file

Exercise: How do we get bioinfclass back? (down for answer)

Getting bioinfclass back

mv catgifs bioinfclass

Creating and removing files

We can make new directories or files with mkdir and touch, and remove with rm and rmdir

mkdir name will create a directory name
touch name will create an empty file name
rm name will remove a file called name
rmdir name will remove an empty directory called name
rm -r name will remove name and all it’s contents (CAREFUL!)

Exercise: create a file called birds in a new subdirectory called dinosaurs of your home directory

Commands

In which things take flight and evolve!

Anatomy of a Unix command

command [flags] [operands]

command: name of the program
operands: positional based arguments
- often but not always required
flags: name based arguments
- position frequently ignored
- usually for controlling details of program behavior
- often not required
- sometimes take arguments, sometimes not

Sequence alignments

As genetic sequences evolve, base pairs can be added or dropped. Alignment tries to figure out where these insertion/deletion events happened so that individual sites in an alignment can be compared between sequences:

AAGGCCTT            AAGGCCTT
ACGGCCTT    ==>>    ACGGCCTT
AAGCCTA             AAG-CCTA
ACGGCCTA            ACGGCCTA

Note: the - character in the alignment represents a putative deletion.

Making a sequence alignment

cd ~/bioinfclass
muscle -maxiters 2 -in data/sfv.fasta -out output/alignment.fasta

# Get bored waiting and want to work on something else...
<Ctrl-a |> # create new pane

Programs are just files!

which tree

# -l: "long" list output; lots of info that we will ignore except for `x` presence/absence
ls -l /usr/bin/tree

# super secret shortcut
ls -l $(which tree)

# Notice no x; not executable...
ls -l README.md

What else is in `/usr/bin`?

ls /usr/bin => LOTS OF STUFF!

This is one of the many places your shell looks for programs

How does the shell know where to look for programs?

echo $PATH => Your PATH environment variable; a : separated list of places your shell looks for programs.

This can be edited if you want to let your shell know to look other places as well (see .bashrc)

Questions?

Pipes and streams

In which our constructions in flight don tongues and communicate!

A bit about `std` files

There are three special files that programs often read from and write to by default to facilitate composability:

For the most part we’ll ignore stderr…

By default, stdout gets printed to the screen

But we can redirect stdout to a file using `>`

Program vs Program > out-file

Example of redirection

FastTree -nt output/alignment.fasta

FastTree -nt output/alignment.fasta > output/tree.nw

[make sure you’re in ~/bioinfclass, and if your shell can’t see FastTree, don’t forget to module load intro-bio!]

We can also `|` `stdout` of one command to `stdin` of another

Prog1 | Prog2:

Example

column -t -s , data/sfv.csv | less -S

column (a program for working with tabular data) writes its output to stdout
- -t specifies tabs for output
- -s , specifies command for separator
less reads from /dev/stdin if no file is specified
- -S specifies don’t wrap

You should see something like this

sequence         specimen   species   gene   location     ...
BGH150WBGag2     BGH150     human     gag    Charmaguria  ...
BGH31WBGag2      BGH31      human     gag    Bormi        ...
BGH150WBGag4     BGH150     human     gag    Charmaguria  ...
BGH150WBGag8     BGH150     human     gag    Charmaguria  ...
MBG103WBGag101   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag3      BGH31      human     gag    Bormi        ...
MBG103WBGag102   MBG103     monkey    gag    Narayanganj  ...
BGH31WBGag1      BGH31      human     gag    Bormi        ...
MBG103WBGag104   MBG103     monkey    gag    Narayanganj  ...
MBG104WBGag101   MBG104     monkey    gag    Narayanganj  ...
...

Programs have to know to use stdin / stdout for this to work. Some conventions:

Assume first argument is input, second output
If only one argument is specified, output to stdout
If no arguments specified, input from stdin, output to stdout
Sometimes there is no “output” argument; assume stdout
Sometimes programs treat - as a shortcut for stdin
When all else fails, you can read from & write to paths /dev/stdout, /dev/stdin directly

More about stdin

Q: What does stdin default to?

(down for answer)

Q: What does stdin default to?

A: Your keyboard!

Reading from stdin

Recall that cat writes to stdout whatever files you pass it as args. If no files are specified though, it reads from stdin.

cat

Enter some text using your keyboard
Use the key command Ctrl-d to close the stdin file
cat will print out whatever you typed in

The shell reads from stdin!

Your bash shell works very much like cat:

Read line of stdin file
Evaluate that line
Print stdout to screen
Loop back and start over

Note: this is why Ctrl-d works to quit the shell! You’re literally closing the stdin file that the shell REPL is reading from.

That’s it for pipes and streams

Questions?

Practicum!

In which we apply our knowledge of these flying, speaking beasts

Navigating data

We’ll start with some more basic examples using piping to more effectively look through our data.

Head first…

Head writes to stdout the first several lines of stdin.

head data/sfv.csv

column -t -s , data/sfv.csv | head

This is useful for looking at the very top – or head – of big files (particularly CSV; -> header + a few rows).

Looking at alignments

It’s important to always check your alignments to make sure they look “reasonable” (more or less, look like a plausable history, not completely “random”).

# We'll use the `alnvu` program
av -L 99999 -w 99999 -c output/alignment.fasta | less -S

av is an alignment viewer which – with these settings – prints the differences from consensus for each sequence, making it easy to scan.

We can also look at our tree

nw_display output/tree.nw | less -S

Analyzing data

In these examples we’ll be combinging several shell commands using piping in order to perform multistep computations, effectively making the shell into a programming language.

Side note: strategy

The way to work through and understand these examples:

run the first part of the command and think about what it’s doing
add the next part of the pipe to see what that does
repeat, till through the whole command
reflect and think about the strategy used to put each of these pieces together towards the end product

This is analogous to how we typically write these computations, breaking a computation down into steps corresponding to known shell commands, one at a time.

Sequences per species

There is a species column in our data set with human and monkey entries.

Question: How would we figure out how many human and monkey sequences there are in the data?

Sequences per species

# How many human sequences are there?
grep human data/sfv.csv | wc -l

# What about monkey sequences?
grep monkey data/sfv.csv | wc -l

grep searches through files for a pattern
wc -l counts the number of lines
Thus, grep ... | wc -l counts the number of lines in a file matching some pattern

There’s one line per sequence, so this gives us what we’re after.

What are some problems with this?

Some problems with this

We’re not accounting for the tabular structure:

We can’t restrict matches to just a particular column in the data
We have to do this twice; if we had many categories, this could take forever

A better way

How many sequences are there for each species?

cut -d , -f 3 data/sfv.csv | sort | uniq -c

cut -d , -f 3 extracts the third , separated entry for each line
sort | uniq -c counts occurrences for each unique entry

The number of times "monkey" shows up in the count then will be the number of sequences taking that value in the species column.

How to visualize this

The first few columns in the data

|-----------------+----------+---------+------+-------------|
|  sequence       | specimen | species | gene | location    |
|-----------------+----------+---------+------+-------------|
|  BGH150WBGag2   | BGH150   | human   | gag  | Charmaguria |
|  BGH31WBGag2    | BGH31    | human   | gag  | Bormi       |
|  BGH150WBGag4   | BGH150   | human   | gag  | Charmaguria |
|  BGH150WBGag8   | BGH150   | human   | gag  | Charmaguria |
|  MBG103WBGag101 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag3    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag102 | MBG103   | monkey  | gag  | Narayanganj |
|  BGH31WBGag1    | BGH31    | human   | gag  | Bormi       |
|  MBG103WBGag104 | MBG103   | monkey  | gag  | Narayanganj |
|  MBG104WBGag101 | MBG104   | monkey  | gag  | Narayanganj |
|  ...

The results of cutting out the third column

cut -d , -f 3 data/sfv.csv

|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| monkey  |
| human   |
| monkey  |
| human   |
| monkey  |
| monkey  |
| ...

The result of sorting

cut -d , -f 3 data/sfv.csv | sort

|---------|
| species |
|---------|
| human   |
| human   |
| human   |
| human   |
| human   |
| human   |
| ...
| monkey  |
| monkey  |
| monkey  |
| monkey  |
| ...

The result of counting

cut -d , -f 3 data/sfv.csv | sort | uniq -c

  77 human
1097 monkey
   1 species

How would we count how many specimens there are per species?

Counting specimens per species

We can do this with a two step version of the cut | uniq | sort pattern we used above:

# First, uniq specimens, with species info
cut -d , -f 2,3 data/sfv.csv | sort | uniq | head

# Next count that by species
cut -d , -f 2,3 data/sfv.csv | sort | uniq | cut -d , -f 2 | sort | uniq -c

Columns 2 and 3 cut out of data

cut -d , -f 2,3 data/sfv.csv

|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| BGH150   | human   |
| BGH150   | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...

The results of the first sort and uniq (without `-c`)

cut -d , -f 2,3 data/sfv.csv | sort | uniq

|----------+---------|
| specimen | species |
|----------+---------|
| BGH150   | human   |
| BGH31    | human   |
| MBG103   | monkey  |
| MBG104   | monkey  |
| ...

Now we now have exactly 1 row per specimen (species just “comes along for the ride”).

Now we cut out the species, and count

  8 human
169 monkey
  1 species

Since there’s 1 row per specimen in the previous result, counting these rows by species gives us the number of specimens per species.

What if we wanted to count number of specimens per species for each location?

What would we do?

Specimens by location and species

cut -d , -f 2,3,5 data/sfv.csv | sort | uniq | cut -d , -f 2,3 | sort | uniq -c

Look at these numbers

What do you notice about where we tend to see humans getting infected?

Moral of the story

Small composable tools offer a lot of power and flexibility.

From very simple pieces we were able to compute answers to very nontrivial queries. This is the promise of Unix composability, and why we love Unix for bioinformatics.

Is anyone else getting tired of all this `-d , -f 2,3,5` business?

Enter `csvkit`

We have csv... versions of many of the standard Unix utils:

csvcut, csvgrep, csvjoin, csvlook, csvsort, csvstack, csvstat

Advantages of csvkit

Can use header names, not just indices
Don’t have to specify -s ,/-d , every time
Headers are treated separately from the data

Looking at just the human sequences

csvgrep -c species -m human data/sfv.csv | csvlook | less -S

The one missing piece: csvuniq

# ~/bin is one of the places we can put programs
mkdir ~/bin

# Download script, and mark as executable
wget https://goo.gl/ZdNoUL -O ~/bin/csvuniq
chmod +x ~/bin/csvuniq

# Test with help message
csvuniq -h

(if this fails to work, try export PATH=~/bin:$PATH and module load intro-bio)

Now let’s see what our last example looks like:

csvuniq -c specimen,species,location data/sfv.csv | csvuniq -zc species,location | csvlook

Notice that csvuniq does all of the sorting and cutting for us…

Homework

For homework go over these examples, and translate into the equivalents using csvkit utils.

Resources

Unix command reference: https://ubuntudanmark.dk/filer/fwunixref.pdf
Codecademy Command line course

Back to homepage

Introduction to Bioinformatics

Lesson 2 - Unix Foundations

The Shell

Shell : Bioinformatician :: Water : Fish

The Unix Philosophy

Breaking down Unix

Directories, files and paths, oh my!

In Unix, (almost) everything is a file

And files have paths

Directories (aka “folders”) are containers

…for more directories and files…

Some special paths & directories:

Composing paths

More on current working directory

Exercise: going up the directory chain

Going up the directory chain

We can change the path of a file using mv

Getting data back

mv is for renaming as well

Getting bioinfclass back

Creating and removing files

Commands

Anatomy of a Unix command

More on flag shape

Sequence alignments

Making a sequence alignment

Programs are just files!

What else is in /usr/bin?

How does the shell know where to look for programs?

Questions?

Pipes and streams

A bit about std files

By default, stdout gets printed to the screen

But we can redirect stdout to a file using >

Example of redirection

We can also | stdout of one command to stdin of another

Example

You should see something like this

More about stdin

Q: What does stdin default to?

Q: What does stdin default to?

Reading from stdin

The shell reads from stdin!

That’s it for pipes and streams

Practicum!

Navigating data

Head first…

Looking at alignments

We can also look at our tree

Analyzing data

Side note: strategy

Sequences per species

Sequences per species

What are some problems with this?

Some problems with this

A better way

How to visualize this

The results of cutting out the third column

The result of sorting

The result of counting

How would we count how many specimens there are per species?

Counting specimens per species

Columns 2 and 3 cut out of data

The results of the first sort and uniq (without -c)

Now we cut out the species, and count

What if we wanted to count number of specimens per species for each location?

Specimens by location and species

Look at these numbers

Moral of the story

Is anyone else getting tired of all this -d , -f 2,3,5 business?

Enter csvkit

Advantages of csvkit

Looking at just the human sequences

The one missing piece: csvuniq

Now let’s see what our last example looks like:

Homework

Resources

We can change the path of a file using `mv`

`mv` is for renaming as well

What else is in `/usr/bin`?

A bit about `std` files

But we can redirect stdout to a file using `>`

We can also `|` `stdout` of one command to `stdin` of another

The results of the first sort and uniq (without `-c`)

Is anyone else getting tired of all this `-d , -f 2,3,5` business?

Enter `csvkit`