Introduction to Bioinformatics

Lesson 7 - Writing Python programs

Writing your own program

Wherein with our knowledge we take flight and soar through the clouds of logic

Today

Dive head first into writing a python script to compute something interesting
Hook this up to our bash pipeline
Talk about some higher level things to keep in mind when writing programs
Reflect on everything we’ve learned, and think again about the “big picture”

A biological question

In one of our Unix classes we revealed there seems to be a much more human infection in Bormi. We’d like to know why.

Hypothesis: A more diverse viral population is responsible for the higher incidence of human infection.

Objective

Let’s write a script that:

Takes a newick tree file
Computes the average distance between sequences/tree-tips
Prints to stdout
Takes an optional argument to return data in csv format
Takes another optional argument to add a location to the csv output

We’ll hook this into our pipeline to compare the viral diversity between locations.

Starting things off

Go to your project directory and create a new file at scripts/treedists.py using vim:

tmux attach
cd ~/bioinfclass
vim scripts/treedists.py

Python shebang

When vim starts, insert the following shebang:

#!/usr/bin/env python

This tells the shell to use whatever python environment is currently active. This means our intro-bio modules will be available to us.

Setting up as executable (+ tmux panes)

Save the file. Then Create and switch to a new tmux pane with Ctrl-a |, and enter cd ~/bioinfclass into the new bash prompt.

In the bash shell type chmod +x scripts/treedists.py to mark our file as executable.

Use Ctrl-a → to switch back and forth between panes.

Command line arguments

In vim, add the following after our shebang:

import argparse

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

args = get_args()
print "args is:", args
print "treefile is:", args.treefile

Try running with ./scripts/treedists.py sometreefile.

Setting up a `main` function

This is a common pattern in many languages. A main function is a place for putting all of your “top-level” program logic.

def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

This let’s us import the file from another program as a module, without executing the main function.

Whole file

import argparse

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Loading our data

Let’s start by setting up our main function to load the tree data. Edit the main function to look like this:

# At the top below other imports, add:
from Bio import Phylo

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

Save and run ./scripts/treedists.py output/Karamjal/tree.nw You’ll see an object representation of the clades on the phylogenetic tree.

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Quick note on clades

A clade on a phylogenetic tree is some node in the tree together with all of it’s decendants.

Each tip forms a clade, and where two tips join, we have a clade, and where that joins we have a clade, etc, all the way to the root. Thus, we can think of the entire tree structure in terms of clades (and Bio.Phylo does this).

We want average distance between the tips of this tree

How should we do this?

What to do with the tree

Strategy:

Find all tree tips
For each pair of distinct tree tips:
- Find the distance through the tree between those tree tips
Average all such distances

REPL driven development

Switch to the bash pane (Ctrl-a →), then open a new vertical pane with Ctrl-a -, and cd ~/bioinfclass. Next, start a python REPL (python):

from Bio import Phylo
# This returns an generator of trees, so we'll use next to get the first (and only)
trees = Phylo.parse('output/tree.nw', 'newick')
t = trees.next()

# Inspect attributes
dir(t)

Question: How do we get the tips of the tree?

What do the terminals look like?

terminals = t.get_terminals()
print terminals

# Let's see what the first of these looks like
tip = terminals[0]
dir(tip)

How do we get the name?

Getting the name

Yup, you guessed it…

tip.name

How do we get a list of all names?

We have a list of tips.

How do we get a list of all the tips names?

List of all names

Let’s use list comprehension:

tipnames = [tip.name for tip in terminals]

What about all pairs of names?

(Note that we don’t want duplicate pairs like ["seq1", "seq2"] and ["seq2", "seq1"], and also don’t need to count ["seq1", "seq1"])

All pairs of names

name_pairs = set()
for tip1 in tipnames:
    for tip2 in tipnames:
        if tip1 != tip2:
            pair = [tip1, tip2]
            # So we don't have duplicates
            sorted_pair = sorted(pair)
            tuple_pair = tuple(sorted_pair)
            name_pairs.add(tuple_pair)

list(name_pairs)[0:10]
len(name_pairs)

Let’s make a function of this!

This is a good idea when you have a bunch of modular logic.

Back in our file (Ctrl-a →):

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

How about finding distance?

Back to the REPL (Ctrl-a → for over, and Ctrl-a ↑ for up/down):

# What function should we try here?
dir(t)

Finding distance

name_pair = list(name_pairs)[0]
# We can assign multiple variables at once :-)
# This is called "destructuring"
t1, t2 = name_pair
t.distance(t1, t2)

Putting it all together

Back in our treedist.py file:

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Hooking this up in `main`

def main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    print "Average distance:", average_distance(tree)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Let’s test

Save, then go back to the bash pane and try running again:

./scripts/treedists.py output/Karamjal/tree.nw

Adding a `--csv` output option

This will let us do a csvstack to get a single data set.

First the arguments (back in treedists.py file):

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("treefile")
    # Add optional args; store_true makes this a toggle
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

And in `main`:

def main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        # We'll define this later
        write_csv(result)
    else:
        print "Average distance:", result

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Filling out `write_csv`:

import csv
import sys

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Testing again

From bash:

# Argparse is awesome!
./scripts/treedists.py --help
# Now let's try running
./scripts/treedists.py --csv output/Karamjal/tree.nw
# Old behaviour still works
./scripts/treedists.py output/Karamjal/tree.nw

Adding location

Save, then back in treedists.py:

def get_args():
    # ...
    # Add the argument
    parser.add_argument("-l", "--location")
    # ...

def main():
    # ...
    # Next pass the location to write_csv
    if args.csv:
        write_csv(result, args.location)
    # ...

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Adding location (to `write_csv`)

def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Testing out

Save, then back in bash:

# Look at new help
./scripts/treedists.py --help
# Try out new option
./scripts/treedists.py --csv output/Karamjal/tree.nw -l karamjal
# Old behaviour still works
./scripts/treedists.py --csv output/Karamjal/tree.nw

Hooking this up in `build.sh`

Switch to the vim pane, and close treedists.py. Then vim build.sh and enter:

# At the top make sure scripts is in our PATH
PATH=./scripts/:$PATH

Hooking this up (ctnd)

treedist_files=()

for location in ${locations[*]}
do
  # ...

  # Build a location tree
  loc_tree="$loc_outdir/tree.nw"
  FastTree -seed 1234 -nt $loc_alignment > $loc_tree

  treedists="$loc_outdir/treedist.csv"
  treedists.py $loc_tree --csv -l $location > $treedists

  treedist_files+=($treedists)
done

Hooking this up (final)

  # ...
  treedist_files+=($treedists)
done

# Aggregate into single CSV
all_treedists="$outdir/all_treedists.csv"
csvstack ${treedist_files[*]} | csvsort -c average_distance > $all_treedists

Save, then try running with ./build.sh (but first comment out all FastTree and muscle lines to save time). When done run csvlook output/all_treedists.csv.

Other programming things

Importing our code from another namespace

To import code from this namespace, we have to create an empty __init__.py file in the directory in which it resides. You can do this in bash with touch scripts/__init__.py.

Now back in the python REPL:

# note this does't run the main function
from scripts import treedists
dir(treedists)
treedists.unique_pairs([1,2,3,4])

A bit about debugging

You’ll probably spend 2x the time you spent writing your software debugging it.

This is as much of an art as a science. Think of yourself as a detective.

Put the clues together
Investigate by printing things out as the program runs
Try import pdb; pdb.set_trace()

Be careful with state

When possible, emphasize functions that don’t modify data in place, but rather return new data. (See import copy; help(copy.copy) for copying mutable data)

Upside: Your programs will be a lot easier to analyze and think about in the portions that aren’t stateful.

Downside: Performance can hit you in some situations.

Back to the high level

Stop and think about everything we’ve learned:

Plain text data is awesome
Unix for gluing things together
Git for keeping track of what you do
Markdown README for explaining what you do
Python for more advanced logic
R for plotting & canned stats
Package up reusable things and make scripts/libraries/programs of them

Above all

Explore
Hypothesize
Verify

Questions?

Excercise 1

Add an argument to treedists.py that let’s you specify the output file.

Excercise 2

Pick a problem and write a little python program for yourself!

Resources

Building python packages/modules - For when you want to build bigger python programs, or share/release small/big programs.
Virtualenv - Managing python packages can be a pain, because you can’t install two versions of the same package 😒. Virtualenv is a tool that let’s you create and switch between different environments to avoid conflicts. (Also good for switching between python 2/3).
Bioscons - If you check out my post on using SCons for bioinformatics, you might be interested in checking out the slurm module of bioscons. This let’s you distribute computation across a SLURM cluster (such as the one FHCRC has).

(More next slide)

Resources (ctnd)

Biopython Tutorial and Cookbook - For further help using biopython (the biopython wiki documentation is also quite good).
Pandas - R-like data.frames in python, complete with plyr and reshape like semantics.
Scipi / Numpi - Tools for doing more computationally intensive work (scientific/math functions, stats, matrix algebra, etc.)
Codewars - Haven’t tried it, but looks like a fun set of programming challenges in multiple languages.

That’s all folks!

Back to homepage

Introduction to Bioinformatics

Lesson 7 - Writing Python programs

Writing your own program

Today

A biological question

Objective

Starting things off

Python shebang

Setting up as executable (+ tmux panes)

Command line arguments

Setting up a main function

Whole file

Loading our data

Whole file

Quick note on clades

We want average distance between the tips of this tree

What to do with the tree

REPL driven development

What do the terminals look like?

Getting the name

How do we get a list of all names?

List of all names

What about all pairs of names?

All pairs of names

Let’s make a function of this!

Whole file

How about finding distance?

Finding distance

Putting it all together

Whole file

Hooking this up in main

Whole file

Let’s test

Adding a --csv output option

Whole file

And in main:

Whole file

Filling out write_csv:

Whole file

Testing again

Adding location

Whole file

Adding location (to write_csv)

Whole file

Testing out

Hooking this up in build.sh

Hooking this up (ctnd)

Hooking this up (final)

Other programming things

Importing our code from another namespace

A bit about debugging

Be careful with state

Back to the high level

Above all

Questions?

Excercise 1

Excercise 2

Resources

Resources (ctnd)

That’s all folks!

Setting up a `main` function

Hooking this up in `main`

Adding a `--csv` output option

And in `main`:

Filling out `write_csv`:

Adding location (to `write_csv`)

Hooking this up in `build.sh`