Wherein with our knowledge we take flight and soar through the clouds of logic
In one of our Unix classes we revealed there seems to be a much more human infection in Bormi. We’d like to know why.
Hypothesis: A more diverse viral population is responsible for the higher incidence of human infection.
Let’s write a script that:
We’ll hook this into our pipeline to compare the viral diversity between locations.
Go to your project directory and create a new file at scripts/treedists.py
using vim
:
tmux attach
cd ~/bioinfclass
vim scripts/treedists.py
When vim starts, insert the following shebang:
#!/usr/bin/env python
This tells the shell to use whatever python environment is currently active. This means our intro-bio
modules will be available to us.
Save the file. Then Create and switch to a new tmux pane with Ctrl-a |
, and enter cd ~/bioinfclass
into the new bash prompt.
In the bash shell type chmod +x scripts/treedists.py
to mark our file as executable.
Use Ctrl-a →
to switch back and forth between panes.
In vim, add the following after our shebang:
import argparse
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
args = get_args()
print "args is:", args
print "treefile is:", args.treefile
Try running with ./scripts/treedists.py sometreefile
.
main
functionThis is a common pattern in many languages. A main function is a place for putting all of your “top-level” program logic.
def main():
args = get_args()
print "args is:", args
print "treefile is:", args.treefile
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
This let’s us import
the file from another program as a module, without executing the main
function.
import argparse
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
def main():
args = get_args()
print "args is:", args
print "treefile is:", args.treefile
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
Let’s start by setting up our main function to load the tree data. Edit the main function to look like this:
# At the top below other imports, add:
from Bio import Phylo
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print tree
Save and run ./scripts/treedists.py output/Karamjal/tree.nw
You’ll see an object representation of the clades on the phylogenetic tree.
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
A clade on a phylogenetic tree is some node in the tree together with all of it’s decendants.
Each tip forms a clade, and where two tips join, we have a clade, and where that joins we have a clade, etc, all the way to the root. Thus, we can think of the entire tree structure in terms of clades (and Bio.Phylo
does this).
How should we do this?
Strategy:
Switch to the bash pane (Ctrl-a →
), then open a new vertical pane with Ctrl-a -
, and cd ~/bioinfclass
. Next, start a python REPL (python
):
from Bio import Phylo
# This returns an generator of trees, so we'll use next to get the first (and only)
trees = Phylo.parse('output/tree.nw', 'newick')
t = trees.next()
# Inspect attributes
dir(t)
Question: How do we get the tips of the tree?
terminals = t.get_terminals()
print terminals
# Let's see what the first of these looks like
tip = terminals[0]
dir(tip)
How do we get the name?
Yup, you guessed it…
tip.name
We have a list of tips.
How do we get a list of all the tips names?
Let’s use list comprehension:
tipnames = [tip.name for tip in terminals]
(Note that we don’t want duplicate pairs like ["seq1", "seq2"]
and ["seq2", "seq1"]
, and also don’t need to count ["seq1", "seq1"]
)
name_pairs = set()
for tip1 in tipnames:
for tip2 in tipnames:
if tip1 != tip2:
pair = [tip1, tip2]
# So we don't have duplicates
sorted_pair = sorted(pair)
tuple_pair = tuple(sorted_pair)
name_pairs.add(tuple_pair)
list(name_pairs)[0:10]
len(name_pairs)
This is a good idea when you have a bunch of modular logic.
Back in our file (Ctrl-a →
):
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
Back to the REPL (Ctrl-a →
for over, and Ctrl-a ↑
for up/down):
# What function should we try here?
dir(t)
name_pair = list(name_pairs)[0]
# We can assign multiple variables at once :-)
# This is called "destructuring"
t1, t2 = name_pair
t.distance(t1, t2)
Back in our treedist.py
file:
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print tree
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
main
def main():
args = get_args()
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
print "Average distance:", average_distance(tree)
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print "Average distance:", average_distance(tree)
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
Save, then go back to the bash pane and try running again:
./scripts/treedists.py output/Karamjal/tree.nw
--csv
output optionThis will let us do a csvstack
to get a single data set.
First the arguments (back in treedists.py
file):
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("treefile")
# Add optional args; store_true makes this a toggle
parser.add_argument("--csv", action="store_true")
return parser.parse_args()
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
parser.add_argument("--csv", action="store_true")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
# For now, just print
print "Average distance:", average_distance(tree)
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
main
:def main():
args = get_args()
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
result = average_distance(tree)
if args.csv:
# We'll define this later
write_csv(result)
else:
print "Average distance:", result
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
parser.add_argument("--csv", action="store_true")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
result = average_distance(tree)
if args.csv:
write_csv(result)
else:
print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
write_csv
:import csv
import sys
def write_csv(result):
writer = csv.writer(sys.stdout)
writer.writerow(["average_distance"])
writer.writerow([result])
import csv
import sys
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
parser.add_argument("--csv", action="store_true")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def write_csv(result):
writer = csv.writer(sys.stdout)
writer.writerow(["average_distance"])
writer.writerow([result])
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
result = average_distance(tree)
if args.csv:
write_csv(result)
else:
print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
From bash:
# Argparse is awesome!
./scripts/treedists.py --help
# Now let's try running
./scripts/treedists.py --csv output/Karamjal/tree.nw
# Old behaviour still works
./scripts/treedists.py output/Karamjal/tree.nw
Save, then back in treedists.py
:
def get_args():
# ...
# Add the argument
parser.add_argument("-l", "--location")
# ...
def main():
# ...
# Next pass the location to write_csv
if args.csv:
write_csv(result, args.location)
# ...
import csv
import sys
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
parser.add_argument("--csv", action="store_true")
parser.add_argument("-l", "--location")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def write_csv(result):
writer = csv.writer(sys.stdout)
writer.writerow(["average_distance"])
writer.writerow([result])
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
result = average_distance(tree)
if args.csv:
write_csv(result, args.location)
else:
print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
write_csv
)def write_csv(result, location):
writer = csv.writer(sys.stdout)
# Note the inline if/else assignment
header = ["location", "average_distance"] if location else ["average_distance"]
main_row = [location, result] if location else [result]
writer.writerows([header, main_row])
import csv
import sys
import argparse
from Bio import Phylo
def get_args():
parser = argparse.ArgumentParser()
# add parsing rules
parser.add_argument("treefile")
parser.add_argument("--csv", action="store_true")
parser.add_argument("-l", "--location")
return parser.parse_args()
def unique_pairs(xs):
pairs = set()
for x1 in xs:
for x2 in xs:
if x1 != x2:
pair = [x1, x2]
pairs.add(tuple(sorted(pair)))
return list(pairs)
def average_distance(tree):
tipnames = [tip.name for tip in tree.get_terminals()]
tipname_pairs = unique_pairs(tipnames)
tip_dists = [tree.distance(t1, t2) for t1, t2 in tipname_pairs]
return sum(tip_dists) / len(tip_dists)
def write_csv(result, location):
writer = csv.writer(sys.stdout)
# Note the inline if/else assignment
header = ["location", "average_distance"] if location else ["average_distance"]
main_row = [location, result] if location else [result]
writer.writerows([header, main_row])
def main():
args = get_args()
# parse the tree file and return a generator
trees = Phylo.parse(args.treefile, 'newick')
tree = trees.next()
result = average_distance(tree)
if args.csv:
write_csv(result, args.location)
else:
print "Average distance:", result
# This only gets run if the file is being run as a script
if __name__ == '__main__':
main()
Save, then back in bash:
# Look at new help
./scripts/treedists.py --help
# Try out new option
./scripts/treedists.py --csv output/Karamjal/tree.nw -l karamjal
# Old behaviour still works
./scripts/treedists.py --csv output/Karamjal/tree.nw
build.sh
Switch to the vim
pane, and close treedists.py
. Then vim build.sh
and enter:
# At the top make sure scripts is in our PATH
PATH=./scripts/:$PATH
treedist_files=()
for location in ${locations[*]}
do
# ...
# Build a location tree
loc_tree="$loc_outdir/tree.nw"
FastTree -seed 1234 -nt $loc_alignment > $loc_tree
treedists="$loc_outdir/treedist.csv"
treedists.py $loc_tree --csv -l $location > $treedists
treedist_files+=($treedists)
done
# ...
treedist_files+=($treedists)
done
# Aggregate into single CSV
all_treedists="$outdir/all_treedists.csv"
csvstack ${treedist_files[*]} | csvsort -c average_distance > $all_treedists
Save, then try running with ./build.sh
(but first comment out all FastTree
and muscle
lines to save time). When done run csvlook output/all_treedists.csv
.
To import code from this namespace, we have to create an empty __init__.py
file in the directory in which it resides. You can do this in bash with touch scripts/__init__.py
.
Now back in the python REPL:
# note this does't run the main function
from scripts import treedists
dir(treedists)
treedists.unique_pairs([1,2,3,4])
You’ll probably spend 2x the time you spent writing your software debugging it.
This is as much of an art as a science. Think of yourself as a detective.
import pdb; pdb.set_trace()
When possible, emphasize functions that don’t modify data in place, but rather return new data. (See import copy; help(copy.copy)
for copying mutable data)
Upside: Your programs will be a lot easier to analyze and think about in the portions that aren’t stateful.
Downside: Performance can hit you in some situations.
Stop and think about everything we’ve learned:
Add an argument to treedists.py
that let’s you specify the output file.
Pick a problem and write a little python program for yourself!
slurm
module of bioscons. This let’s you distribute computation across a SLURM cluster (such as the one FHCRC has).(More next slide)
data.frame
s in python, complete with plyr and reshape like semantics.