Introduction to Bioinformatics

Lesson 7 - Writing Python programs

Writing your own program

Wherein with our knowledge we take flight and soar through the clouds of logic

Today

  • Dive head first into writing a python script to compute something interesting
  • Hook this up to our bash pipeline
  • Talk about some higher level things to keep in mind when writing programs
  • Reflect on everything we’ve learned, and think again about the “big picture”

A biological question

In one of our Unix classes we revealed there seems to be a much more human infection in Bormi. We’d like to know why.

Hypothesis: A more diverse viral population is responsible for the higher incidence of human infection.

Objective

Let’s write a script that:

  • Takes a newick tree file
  • Computes the average distance between sequences/tree-tips
  • Prints to stdout
  • Takes an optional argument to return data in csv format
  • Takes another optional argument to add a location to the csv output

We’ll hook this into our pipeline to compare the viral diversity between locations.

Starting things off

Go to your project directory and create a new file at scripts/treedists.py using vim:

tmux attach
cd ~/bioinfclass
vim scripts/treedists.py

Python shebang

When vim starts, insert the following shebang:

#!/usr/bin/env python

This tells the shell to use whatever python environment is currently active. This means our intro-bio modules will be available to us.

Setting up as executable (+ tmux panes)

Save the file. Then Create and switch to a new tmux pane with Ctrl-a |, and enter cd ~/bioinfclass into the new bash prompt.

In the bash shell type chmod +x scripts/treedists.py to mark our file as executable.

Use Ctrl-a → to switch back and forth between panes.

Command line arguments

In vim, add the following after our shebang:

import argparse

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

args = get_args()
print "args is:", args
print "treefile is:", args.treefile

Try running with ./scripts/treedists.py sometreefile.

Setting up a main function

This is a common pattern in many languages. A main function is a place for putting all of your “top-level” program logic.

def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

This let’s us import the file from another program as a module, without executing the main function.

Whole file

import argparse

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Loading our data

Let’s start by setting up our main function to load the tree data. Edit the main function to look like this:

# At the top below other imports, add:
from Bio import Phylo

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

Save and run ./scripts/treedists.py output/Karamjal/tree.nw You’ll see an object representation of the clades on the phylogenetic tree.

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Quick note on clades

A clade on a phylogenetic tree is some node in the tree together with all of it’s decendants.

Each tip forms a clade, and where two tips join, we have a clade, and where that joins we have a clade, etc, all the way to the root. Thus, we can think of the entire tree structure in terms of clades (and Bio.Phylo does this).

We want average distance between the tips of this tree

How should we do this?

What to do with the tree

Strategy:

  • Find all tree tips
  • For each pair of distinct tree tips:
    • Find the distance through the tree between those tree tips
  • Average all such distances

REPL driven development

Switch to the bash pane (Ctrl-a →), then open a new vertical pane with Ctrl-a -, and cd ~/bioinfclass. Next, start a python REPL (python):

from Bio import Phylo
# This returns an generator of trees, so we'll use next to get the first (and only)
trees = Phylo.parse('output/tree.nw', 'newick')
t = trees.next()

# Inspect attributes
dir(t)

Question: How do we get the tips of the tree?

What do the terminals look like?

terminals = t.get_terminals()
print terminals

# Let's see what the first of these looks like
tip = terminals[0]
dir(tip)

How do we get the name?

Getting the name

Yup, you guessed it…

tip.name

How do we get a list of all names?

We have a list of tips.

How do we get a list of all the tips names?

List of all names

Let’s use list comprehension:

tipnames = [tip.name for tip in terminals]

What about all pairs of names?

(Note that we don’t want duplicate pairs like ["seq1", "seq2"] and ["seq2", "seq1"], and also don’t need to count ["seq1", "seq1"])

All pairs of names

name_pairs = set()
for tip1 in tipnames:
    for tip2 in tipnames:
        if tip1 != tip2:
            pair = [tip1, tip2]
            # So we don't have duplicates
            sorted_pair = sorted(pair)
            tuple_pair = tuple(sorted_pair)
            name_pairs.add(tuple_pair)

list(name_pairs)[0:10]
len(name_pairs)

Let’s make a function of this!

This is a good idea when you have a bunch of modular logic.

Back in our file (Ctrl-a →):

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

How about finding distance?

Back to the REPL (Ctrl-a → for over, and Ctrl-a ↑ for up/down):

# What function should we try here?
dir(t)

Finding distance

name_pair = list(name_pairs)[0]
# We can assign multiple variables at once :-)
# This is called "destructuring"
t1, t2 = name_pair
t.distance(t1, t2)

Putting it all together

Back in our treedist.py file:

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Hooking this up in main

def main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    print "Average distance:", average_distance(tree)

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Let’s test

Save, then go back to the bash pane and try running again:

./scripts/treedists.py output/Karamjal/tree.nw

Adding a --csv output option

This will let us do a csvstack to get a single data set.

First the arguments (back in treedists.py file):

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("treefile")
    # Add optional args; store_true makes this a toggle
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

And in main:

def main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        # We'll define this later
        write_csv(result)
    else:
        print "Average distance:", result

Whole file

import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Filling out write_csv:

import csv
import sys

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Testing again

From bash:

# Argparse is awesome!
./scripts/treedists.py --help
# Now let's try running
./scripts/treedists.py --csv output/Karamjal/tree.nw
# Old behaviour still works
./scripts/treedists.py output/Karamjal/tree.nw

Adding location

Save, then back in treedists.py:

def get_args():
    # ...
    # Add the argument
    parser.add_argument("-l", "--location")
    # ...

def main():
    # ...
    # Next pass the location to write_csv
    if args.csv:
        write_csv(result, args.location)
    # ...

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Adding location (to write_csv)

def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])

Whole file

import csv
import sys
import argparse
from Bio import Phylo

def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()

def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)

def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)

def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])

def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result

# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()

Testing out

Save, then back in bash:

# Look at new help
./scripts/treedists.py --help
# Try out new option
./scripts/treedists.py --csv output/Karamjal/tree.nw -l karamjal
# Old behaviour still works
./scripts/treedists.py --csv output/Karamjal/tree.nw

Hooking this up in build.sh

Switch to the vim pane, and close treedists.py. Then vim build.sh and enter:

# At the top make sure scripts is in our PATH
PATH=./scripts/:$PATH

Hooking this up (ctnd)

treedist_files=()

for location in ${locations[*]}
do
  # ...

  # Build a location tree
  loc_tree="$loc_outdir/tree.nw"
  FastTree -seed 1234 -nt $loc_alignment > $loc_tree

  treedists="$loc_outdir/treedist.csv"
  treedists.py $loc_tree --csv -l $location > $treedists

  treedist_files+=($treedists)
done

Hooking this up (final)

  # ...
  treedist_files+=($treedists)
done

# Aggregate into single CSV
all_treedists="$outdir/all_treedists.csv"
csvstack ${treedist_files[*]} | csvsort -c average_distance > $all_treedists

Save, then try running with ./build.sh (but first comment out all FastTree and muscle lines to save time). When done run csvlook output/all_treedists.csv.

Other programming things

Importing our code from another namespace

To import code from this namespace, we have to create an empty __init__.py file in the directory in which it resides. You can do this in bash with touch scripts/__init__.py.

Now back in the python REPL:

# note this does't run the main function
from scripts import treedists
dir(treedists)
treedists.unique_pairs([1,2,3,4])

A bit about debugging

You’ll probably spend 2x the time you spent writing your software debugging it.

This is as much of an art as a science. Think of yourself as a detective.

  • Put the clues together
  • Investigate by printing things out as the program runs
  • Try import pdb; pdb.set_trace()

Be careful with state

When possible, emphasize functions that don’t modify data in place, but rather return new data. (See import copy; help(copy.copy) for copying mutable data)

Upside: Your programs will be a lot easier to analyze and think about in the portions that aren’t stateful.

Downside: Performance can hit you in some situations.

Back to the high level

Stop and think about everything we’ve learned:

  • Plain text data is awesome
  • Unix for gluing things together
  • Git for keeping track of what you do
  • Markdown README for explaining what you do
  • Python for more advanced logic
  • R for plotting & canned stats
  • Package up reusable things and make scripts/libraries/programs of them

Above all

  • Explore
  • Hypothesize
  • Verify

Questions?

Excercise 1

Add an argument to treedists.py that let’s you specify the output file.

Excercise 2

Pick a problem and write a little python program for yourself!

Resources

  • Building python packages/modules - For when you want to build bigger python programs, or share/release small/big programs.
  • Virtualenv - Managing python packages can be a pain, because you can’t install two versions of the same package 😒. Virtualenv is a tool that let’s you create and switch between different environments to avoid conflicts. (Also good for switching between python 2/3).
  • Bioscons - If you check out my post on using SCons for bioinformatics, you might be interested in checking out the slurm module of bioscons. This let’s you distribute computation across a SLURM cluster (such as the one FHCRC has).

(More next slide)

Resources (ctnd)

  • Biopython Tutorial and Cookbook - For further help using biopython (the biopython wiki documentation is also quite good).
  • Pandas - R-like data.frames in python, complete with plyr and reshape like semantics.
  • Scipi / Numpi - Tools for doing more computationally intensive work (scientific/math functions, stats, matrix algebra, etc.)
  • Codewars - Haven’t tried it, but looks like a fun set of programming challenges in multiple languages.

That’s all folks!

Back to homepage