Wherein with our knowledge we take flight and soar through the clouds of logic
In one of our Unix classes we revealed there seems to be a much more human infection in Bormi. We’d like to know why.
Hypothesis: A more diverse viral population is responsible for the higher incidence of human infection.
Let’s write a script that:
We’ll hook this into our pipeline to compare the viral diversity between locations.
Go to your project directory and create a new file at scripts/treedists.py using vim:
tmux attach
cd ~/bioinfclass
vim scripts/treedists.pyWhen vim starts, insert the following shebang:
#!/usr/bin/env pythonThis tells the shell to use whatever python environment is currently active. This means our intro-bio modules will be available to us.
Save the file. Then Create and switch to a new tmux pane with Ctrl-a |, and enter cd ~/bioinfclass into the new bash prompt.
In the bash shell type chmod +x scripts/treedists.py to mark our file as executable.
Use Ctrl-a → to switch back and forth between panes.
In vim, add the following after our shebang:
import argparse
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
args = get_args()
print "args is:", args
print "treefile is:", args.treefileTry running with ./scripts/treedists.py sometreefile.
main functionThis is a common pattern in many languages. A main function is a place for putting all of your “top-level” program logic.
def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()This let’s us import the file from another program as a module, without executing the main function.
import argparse
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
def main():
    args = get_args()
    print "args is:", args
    print "treefile is:", args.treefile
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()Let’s start by setting up our main function to load the tree data. Edit the main function to look like this:
# At the top below other imports, add:
from Bio import Phylo
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print treeSave and run ./scripts/treedists.py output/Karamjal/tree.nw You’ll see an object representation of the clades on the phylogenetic tree.
import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()A clade on a phylogenetic tree is some node in the tree together with all of it’s decendants.
Each tip forms a clade, and where two tips join, we have a clade, and where that joins we have a clade, etc, all the way to the root. Thus, we can think of the entire tree structure in terms of clades (and Bio.Phylo does this).
How should we do this?
Strategy:
Switch to the bash pane (Ctrl-a →), then open a new vertical pane with Ctrl-a -, and cd ~/bioinfclass. Next, start a python REPL (python):
from Bio import Phylo
# This returns an generator of trees, so we'll use next to get the first (and only)
trees = Phylo.parse('output/tree.nw', 'newick')
t = trees.next()
# Inspect attributes
dir(t)Question: How do we get the tips of the tree?
terminals = t.get_terminals()
print terminals
# Let's see what the first of these looks like
tip = terminals[0]
dir(tip)How do we get the name?
Yup, you guessed it…
tip.nameWe have a list of tips.
How do we get a list of all the tips names?
Let’s use list comprehension:
tipnames = [tip.name for tip in terminals](Note that we don’t want duplicate pairs like ["seq1", "seq2"] and ["seq2", "seq1"], and also don’t need to count ["seq1", "seq1"])
name_pairs = set()
for tip1 in tipnames:
    for tip2 in tipnames:
        if tip1 != tip2:
            pair = [tip1, tip2]
            # So we don't have duplicates
            sorted_pair = sorted(pair)
            tuple_pair = tuple(sorted_pair)
            name_pairs.add(tuple_pair)
list(name_pairs)[0:10]
len(name_pairs)This is a good idea when you have a bunch of modular logic.
Back in our file (Ctrl-a →):
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()Back to the REPL (Ctrl-a → for over, and Ctrl-a ↑ for up/down):
# What function should we try here?
dir(t)name_pair = list(name_pairs)[0]
# We can assign multiple variables at once :-)
# This is called "destructuring"
t1, t2 = name_pair
t.distance(t1, t2)Back in our treedist.py file:
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()maindef main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    print "Average distance:", average_distance(tree)import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()Save, then go back to the bash pane and try running again:
./scripts/treedists.py output/Karamjal/tree.nw--csv output optionThis will let us do a csvstack to get a single data set.
First the arguments (back in treedists.py file):
def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("treefile")
    # Add optional args; store_true makes this a toggle
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    # For now, just print
    print "Average distance:", average_distance(tree)
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()main:def main():
    args = get_args()
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        # We'll define this later
        write_csv(result)
    else:
        print "Average distance:", resultimport argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()write_csv:import csv
import sys
def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])import csv
import sys
import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result)
    else:
        print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()From bash:
# Argparse is awesome!
./scripts/treedists.py --help
# Now let's try running
./scripts/treedists.py --csv output/Karamjal/tree.nw
# Old behaviour still works
./scripts/treedists.py output/Karamjal/tree.nwSave, then back in treedists.py:
def get_args():
    # ...
    # Add the argument
    parser.add_argument("-l", "--location")
    # ...
def main():
    # ...
    # Next pass the location to write_csv
    if args.csv:
        write_csv(result, args.location)
    # ...import csv
import sys
import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def write_csv(result):
    writer = csv.writer(sys.stdout)
    writer.writerow(["average_distance"])
    writer.writerow([result])
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()write_csv)def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])import csv
import sys
import argparse
from Bio import Phylo
def get_args():
    parser = argparse.ArgumentParser()
    # add parsing rules
    parser.add_argument("treefile")
    parser.add_argument("--csv", action="store_true")
    parser.add_argument("-l", "--location")
    return parser.parse_args()
def unique_pairs(xs):
    pairs = set()
    for x1 in xs:
        for x2 in xs:
            if x1 != x2:
                pair = [x1, x2]
                pairs.add(tuple(sorted(pair)))
    return list(pairs)
def average_distance(tree):
    tipnames = [tip.name for tip in tree.get_terminals()]
    tipname_pairs = unique_pairs(tipnames)
    tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
    return sum(tip_dists) / len(tip_dists)
def write_csv(result, location):
    writer = csv.writer(sys.stdout)
    # Note the inline if/else assignment
    header = ["location", "average_distance"] if location else ["average_distance"]
    main_row = [location, result] if location else [result]
    writer.writerows([header, main_row])
def main():
    args = get_args()
    # parse the tree file and return a generator
    trees = Phylo.parse(args.treefile, 'newick')
    tree = trees.next()
    result = average_distance(tree)
    if args.csv:
        write_csv(result, args.location)
    else:
        print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
    main()Save, then back in bash:
# Look at new help
./scripts/treedists.py --help
# Try out new option
./scripts/treedists.py --csv output/Karamjal/tree.nw -l karamjal
# Old behaviour still works
./scripts/treedists.py --csv output/Karamjal/tree.nwbuild.shSwitch to the vim pane, and close treedists.py. Then vim build.sh and enter:
# At the top make sure scripts is in our PATH
PATH=./scripts/:$PATHtreedist_files=()
for location in ${locations[*]}
do
  # ...
  # Build a location tree
  loc_tree="$loc_outdir/tree.nw"
  FastTree -seed 1234 -nt $loc_alignment > $loc_tree
  treedists="$loc_outdir/treedist.csv"
  treedists.py $loc_tree --csv -l $location > $treedists
  treedist_files+=($treedists)
done  # ...
  treedist_files+=($treedists)
done
# Aggregate into single CSV
all_treedists="$outdir/all_treedists.csv"
csvstack ${treedist_files[*]} | csvsort -c average_distance > $all_treedistsSave, then try running with ./build.sh (but first comment out all FastTree and muscle lines to save time). When done run csvlook output/all_treedists.csv.
To import code from this namespace, we have to create an empty __init__.py file in the directory in which it resides. You can do this in bash with touch scripts/__init__.py.
Now back in the python REPL:
# note this does't run the main function
from scripts import treedists
dir(treedists)
treedists.unique_pairs([1,2,3,4])You’ll probably spend 2x the time you spent writing your software debugging it.
This is as much of an art as a science. Think of yourself as a detective.
import pdb; pdb.set_trace()When possible, emphasize functions that don’t modify data in place, but rather return new data. (See import copy; help(copy.copy) for copying mutable data)
Upside: Your programs will be a lot easier to analyze and think about in the portions that aren’t stateful.
Downside: Performance can hit you in some situations.
Stop and think about everything we’ve learned:
Add an argument to treedists.py that let’s you specify the output file.
Pick a problem and write a little python program for yourself!
slurm module of bioscons. This let’s you distribute computation across a SLURM cluster (such as the one FHCRC has).(More next slide)
data.frames in python, complete with plyr and reshape like semantics.