Wherein from simple elements we construct towering monuments
OOP is a way of writing programs by bundling data and functionality into objects.
Large programs can be difficult to analyze; Keeping track of how data and computations relate to the state of the program is challenging.
OOP is a way of organizing data and computation with the goal of making it easier to conceptualize and deconstruct the system as a whole.
Objects are entities which bundle data together with computations we think of as the objects “behaviors”. The data is organized into attributes and the computations into methods.
Objects are typically mutable: Methods can modify (the state of) the attributes of the objects on which they’re called.
Classes are a way of defining the shape and behavior of objects. In class based OOP, we say that objects are instances of a class.
Example: the string "cat"
is an instance of the class str
(surprise!).
Class based OOP is the most common form in modern languages, and indeed, python’s OOP is class based.
Any questions?
This is the easy part.
You can consume OOP data and functionality without having to structure your own code using these techniques.
Let’s look at the Biopython library (this should be accessible as long as you’ve loaded your python modules).
# First import the namespace/module
import Bio.SeqIO as SeqIO
# Next, create a sequence parser, specifying file and format
seqrecords = SeqIO.parse("data/sfv.fasta", "fasta")
# Now let's inspect things a bit (bonus: intro to generators!)
seqrecords
# You can use the `next` method to get the "next" thing from a generator.
seqrecords.next()
seqrecords.next()
seqrecords
in more detail# More investigating
seqrecord = seqrecords.next()
seqrecord
type(seqrecord)
Objects have attributes. You can see what attributes an object has with dir(object)
, and access them with object.attribute_name
.
dir(seqrecord)
seqrecord.name
seq
# Here's where the actual "sequence" is
seqrecord.seq
s = seqrecord.seq
type(s)
dir(s)
So an attribute of an object can be another object. It’s objects all the way down!
One of the things OO encourages us to do is separate concerns; Classes and objects naturally do this..
A SeqRecord
is concerned with the abstract record of a sequence; name, id, description, etc. The Seq
is concerned with the actual nucleotide sequence. An instance of SeqRecord
will have a seq
attribute pointing to a Seq
instance.
We can add our own attributes to an object.
s.canhaz_catgifs = True
print s.canhaz_catgifs
This is an open system.
s.transcribe
# <bound method Seq.transcribe of Seq(...)>
# You can call these functions just like normal functions (appending `(...)`)
s.transcribe()
# Compare with
from Bio.Seq import Seq
Seq.transcribe
# <unbound method Seq.transcribe>
Seq.transcribe(s)
We call functions bound to objects methods. Methods describe the “behavior” of objects.
# as a function
Seq.transcribe(s)
# as a method
s.transcribe()
xs = [1, 2, 3]
# as a function
list.append(xs, 4)
# as a method
xs.append(4)
In a class, methods are attributes pointing to simple functions. Objects have attributes pointing to copies of these functions where the first argument is bound to the object in question.
We got this object from SeqIO, but we can also construct objects using constructors.
By default, classes act as constructors for their objects when called as a function.
Seq("AGCT")
object.attributename
object.method(arguments)
Any Questions?
A simple class, modeling data we saw last week:
# Specify the name of the class, and inherit from object
class Person(object):
# This defines the constructor function
def __init__(self, name, age, occupation):
# assign attributes passed through constructor
self.name = name
self.age = age
self.occupation = occupation
p = Person("Bob Jones", 42, "haxxor")
print p
(Note: 2 underscores (_
) on either side of init
)
Suppose we add email (this is pretend code & won’t run):
import email
class Person(object):
# add email to constructor
def __init__(self, name, age, occ, email):
self.name = name
self.age = age
self.occupation = occ
self.email = email
# adding a method looks like defining a function
def notify_via_email(self, message):
email.send(self.email, message)
p = Person("Bob Jones", 42, "haxxor", "bob@jon.es")
p.notify_via_email("your cat was found")
(And how this all works)
Person
, we get a constructor function named Person
that returns an instance of class Person
.__init__
function is called with self
bound to the new object. This function initializes the object.Person.notify_via_email
function, and every instance p
of Person
gets a notify_via_email
attribute set to a copy of this function with the first argument bound to p
.import email
def notify_via_email(person, message):
email.send(person["email"], message)
# This is just another way of creating a dictionary
p = dict(name="Bob Jones", age=42,
occupation="haxxor", email="bob@jon.es")
notify_via_email(person, "your cat was found")
Shorter, but it’s less clear in notify_via_email
what valid input is. Classes make this patent/explicit.
OOP: Have to write a separate method in the Company
class. (There are other ways around this, but they’re a bit of work).
Non-OOP: As long as the company
dictionary has an "email"
key/value pair, the function will still work.
Moral of the story: Structure can come at the cost of flexibility.
Methods are just attributes pointing to functions.
The attrs in an object mirror o.__dict__
, so you can treat objects like they’re dictionaries!
Let’s create it.
Close whatever else you were working on and/or open a new tmux-window with Ctrl-a c
, then
cd ~/bioinfclass
vim scripts/seqs.py
By default, vim autoindents whatever we type into the terminal. This is great for writing code, but it can mess things up on copy/paste.
Some terminals/vim installations can tell when you are pasting text and will automatically turn off auto-indent. But if you find that your indentation is off when you paste things, first run :set paste
from command mode before pasting to turn on paste-mode (thereby turning off autoindent), then afterwards run :set nopaste
to turn autoindent back on.
The following breaks in a REPL, but is ok in a script.
if something:
do_stuff()
else:
do_something_else()
Keep this in mind when copying pasting code.
Add the following to our scripts/seqs.py
file.
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
print s.name
print s.seq
With python, I like to be able to test my scripts as I write them using bash.
Open a new tmux pane using Ctrl-a |
. Then run the following in the bash shell:
# First make sure we're in the right directory
cd ~/bioinfclass
# Run the script file we just created
python scripts/seqs.py
What do you see?
In scripts/seqs.py
, add fasta_string
method so things look like:
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
def fasta_string(self):
string_rep = ">" + self.name + "\n"
string_rep += self.seq + "\n"
return string_rep
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()
seq
more robustSuppose we want to add more functionality to the seq
attribute? Add the following before our SeqRecord
class:
# Create Seq as a "subclass" of str
class Seq(str):
# If not adding any methods, must explicitly "pass"
pass
s = Seq("ACGT")
print s
As a subclass of str
, Seq
will behave exactly like a string, except where we override or add behavior.
# Create Seq as a "subclass" of str
class Seq(str):
# If not adding any methods, must explicitly "pass"
pass
s = Seq("ACGT")
print s
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
def fasta_string(self):
string_rep = ">" + self.name + "\n"
string_rep += self.seq + "\n"
return string_rep
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()
Seq
class Seq(str):
# A class attribute, available to all seqs
complements = dict(A="T", T="A", C="G", G="C")
def complement(self):
# Create list of complemented characters
bps = [self.complements[bp] for bp in self]
# Join characters together, and Seq them
joined_bps = "".join(bps)
return Seq(joined_bps)
s = Seq("ACGT")
print s.complement()
class Seq(str):
# A class attribute, available to all seqs
complements = dict(A="T", T="A", C="G", G="C")
def complement(self):
# Create list of complemented characters
bps = [self.complements[bp] for bp in self]
# Join characters together, and Seq them
joined_bps = "".join(bps)
return Seq(joined_bps)
s = Seq("ACGT")
print s.complement()
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
def fasta_string(self):
string_rep = ">" + self.name + "\n"
string_rep += self.seq + "\n"
return string_rep
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()
# The class and instances both get this attribute
print Seq.complements
print s.complements
Note: complements
would frequently be called a class attribute, since this data lives in the class. The codecademy course calls them member variables, which is a more general term you don’t hear in Python as often.
We can solve this by further subclassing Seq
.
class Seq(str):
# Same method as before...
def complement(self):
bps = [self.complements[x] for x in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
# But we move the class variable into subclasses
class DNASeq(Seq):
complements = dict(A="T", T="A", C="G", G="C")
class RNASeq(Seq):
complements = dict(A="U", U="A", C="G", G="C")
class Seq(str):
# A class attribute, available to all seqs
def complement(self):
bps = [self.complements[x] for x in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
class DNASeq(Seq):
complements = dict(A="T", T="A", C="G", G="C")
class RNASeq(Seq):
complements = dict(A="U", U="A", C="G", G="C")
s = Seq("ACGT")
print s.complement()
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
def fasta_string(self):
string_rep = ">" + self.name + "\n"
string_rep += self.seq + "\n"
return string_rep
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()
# Testing out our new classes
s = DNASeq("ACTG")
print s.complement()
s = RNASeq("ACUG")
print s.complement()
# Does our old class work by itself any more?
s = Seq("ACTG")
print s.complement()
class Seq(str):
# A class attribute, available to all seqs
def complement(self):
bps = [self.complements[x] for x in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
class DNASeq(Seq):
complements = dict(A="T", T="A", C="G", G="C")
class RNASeq(Seq):
complements = dict(A="U", U="A", C="G", G="C")
# Testing out our new classes
s = DNASeq("ACTG")
print s.complement()
s = RNASeq("ACUG")
print s.complement()
# Does our old class work by itself any more?
s = Seq("ACTG")
print s.complement()
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
self.seq = seq
def fasta_string(self):
string_rep = ">" + self.name + "\n"
string_rep += self.seq + "\n"
return string_rep
s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()
When we take the complement, we get a Seq
, but we should get either a DNASeq
or an RNASeq
(as appropriate)!
s = DNASeq("ACTG")
print type(s.complement())
s = RNASeq("ACUG")
print type(s.complement())
# Now we can't take complement of the complement of an RNASeq...
s.complement().complement()
Try solving this later :-) (Solutions at end of slides)
Any Questions?
Design guidance vs flexibility
Extreme Example: Java. Even printing “hello world” then exiting requires creating a class. All data and functionality must be bound to classes/objects. This can be very constraining.
The good with OOP: Easier to conceptualize, because we can think about programs being made of things, like we do in the real world.
The bad: Extra work, and code can become siloed and difficult to generalize or share between different parts of the program.
OOP is strong medicine; don’t overuse it.
Python gives us a choice here. Use classes when they help you conceptualize complexity. Avoid when they add complexity.
Try starting with plain functions. If you find yourself having trouble figuring out what functions should be getting which data, consider bundling things into classes.
(Answers at end of slides)
transcribe
Problem: Add a transcribe
method to the DNASeq
class and a reverse_transcribe
method to the RNASeq
class.
SideNote: biopython does this differently; Only one Seq class, and an alphabet
attribute pointing to DNA/RNA. In this approach, calling reverse_transcribe
on a Seq
instance with a DNA alphabet gives an error. In contrast, with our approach reverse_transcribe
is only defined if the Seq
is an RNASeq
.
Thinking through these architectural trade-offs is important in OOP.
reverse_complement
reserve_complement
method to your Seq
class, such that:
DNASeq
and RNASeq
objects.reverse_complement
(ctnd)reverse_complement
method in your SeqRecord
class, which returns a new seqrecord with:
seq
attribute the reverse_complement
of the original sequence.name
attribute (append "_rc"
to the original name).SeqRecord
, you can just pass in a string for seq, and it will figure out whether it’s a DNASeq
or a RNASeq
based on whether it contains U or and T:
RNASeq
or DNASeq
directly, without it checking.Fix up the Seq
, RNASeq
and DNASeq
class definitions so that DNASeq.complement
returns a DNASeq
object (similarly for RNASeq
). Ideally:
complement
code stays in the Seq
class definitionSeq.complement
definition in RNASeq
and DNASeq
so that the return value is casted as the correct classHint: To solve this “ideally”, look into python’s super
function.
transcbribe
: http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#transcribe (click the “source code” link to see the source code)Spoiler alert!
First the simple solution:
class DNASeq(Seq):
# ...
transcriptions = dict(T="U")
def transcribe(self):
# Note use of get
bps = [self.transcriptions.get(bp, bp) for bp in self]
joined_bps = "".join(bps)
return RNASeq(joined_bps)
class RNASeq(Seq):
# ...
transcriptions = dict(U="T")
def reverse_transcribe(self):
bps = [self.transcriptions.get(bp, bp) for bp in self]
joined_bps = "".join(bps)
return DNASeq(joined_bps)
s = DNASeq("ACGT")
print "Transcription of", s, "is:", s.transcribe()
s = RNASeq("ACGU")
print "Reverse trans of", s, "is:", s.reverse_transcribe()
This works, but note that it’s a lot of duplicate code. We’ll see a more elegant way to solve this using super
(see Bonus problem 2).
First let’s solve this without worrying too much about the return type (just ensuring it’s a Seq).
class Seq(str):
# ...
def reverse_complement(self):
# Take the complement and reverse the bps
bps = reversed(self.complement())
joined_bps = "".join(bps)
return Seq(joined_bps)
s = DNASeq("ACGTCC")
print "Reverse comp of", s, "is:", s.reverse_complement()
s = RNASeq("ACGUCC")
print "Reverse comp of", s, "is:", s.reverse_complement()
Note that if we felt like being more terse, we could omit all the intermediate variable assignments and just do everything in one line. You can decide for yourself what makes code most readable.
class Seq(str):
# ...
def reverse_complement(self):
return Seq("".join(reversed(self.complement())))
First, we’ll solve this just using simple string methods.
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
# We'll assume a sequence without U or T is a DNASeq
if seq.count("T") >= 0 and seq.count("U") == 0:
seq = DNASeq(seq)
elif seq.count("U") > 0 and seq.count("T") == 0:
seq = RNASeq(seq)
else:
# Then we have both U and T, so fail
raise ValueError, "Seq can't have both U and T"
self.seq = seq
# ...
s = SeqRecord("MBG1234", "AGTCTGATA")
print "Seqrecord seq type:", type(s.seq)
s = SeqRecord("MBG1234", "AGUCUGAUA")
print "Seqrecord seq type:", type(s.seq)
# This one should fail
SeqRecord("MBG1234", "AGTCUGATA")
Note that since the definition of SeqRecord
here depends on Seq
, you may have to move some code around so Seq
is defined before SeqRecord
is.
This isn’t the best solution: Right now, only capital letters are considered. A good way to fix this would be to look into regular expressions.
There is always something to do to make your code better :-)
The simple solution to this is to just write two methods, like we did in the solution to Ex 1. However, there is a better way.
class DNASeq(Seq):
# ...
def complement(self):
# Here we use the superclass' method
return DNASeq(Seq.complement(self))
class RNASeq(Seq):
# ...
def complement(self):
return RNASeq(Seq.complement(self))
s = DNASeq("ACGT")
print "Type of complement of", s, "is:", type(s.complement())
s = RNASeq("ACGU")
print "Type of complement of", s, "is:", type(s.complement())
Note: In the assignment, I suggested using super
to solve this. After looking for a good explanation of super as a reference, I found this stackoverflow post, which gave me the idea to just refer to the unbound Seq.complement
function directly. I think this is a lot easier to understand, and it accomplishes the same thing.
We can use this super
-like trick to fix our solution to Ex 2 as well:
class DNASeq(Seq):
# ...
def reverse_complement(self):
return DNASeq(Seq.reverse_complement(self))
class RNASeq(Seq):
# ...
def reverse_complement(self):
return RNASeq(Seq.reverse_complement(self))
s = DNASeq("ACGTCC")
print "Type of reverse comp of", s, "is:", type(s.reverse_complement()
s = RNASeq("ACGUCC")
print "Type of reverse comp of", s, "is:", type(s.reverse_complement()
Similarly, we can refactor transcribe
and reverse_transcribe
.
class Seq(str):
# ...
def abstract_transcribe(self):
bps = [self.transcriptions.get(bp, bp) for bp in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
class DNASeq(Seq):
# ...
def transcribe(self):
return RNASeq(Seq.abstract_transcribe(self))
class RNASeq(Seq):
# ...
def reverse_transcribe(self):
return DNASeq(Seq.abstract_transcribe(self))
class Seq(str):
def complement(self):
bps = [self.complements[bp] for bp in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
def reverse_complement(self):
return Seq("".join(reversed(self.complement())))
def abstract_transcribe(self):
bps = [self.transcriptions.get(bp, bp) for bp in self]
joined_bps = "".join(bps)
return Seq(joined_bps)
class DNASeq(Seq):
complements = dict(A="T", T="A", C="G", G="C")
transcriptions = dict(T="U")
def complement(self):
return DNASeq(Seq.complement(self))
def reverse_complement(self):
return DNASeq(Seq.reverse_complement(self))
def transcribe(self):
return RNASeq(Seq.abstract_transcribe(self))
class RNASeq(Seq):
complements = dict(A="U", U="A", C="G", G="C")
transcriptions = dict(U="T")
def complement(self):
return RNASeq(Seq.complement(self))
def reverse_complement(self):
return RNASeq(Seq.reverse_complement(self))
def reverse_transcribe(self):
return DNASeq(Seq.abstract_transcribe(self))
class SeqRecord(object):
def __init__(self, name, seq):
self.name = name
# We'll assume a sequence without U or T is a DNASeq
if seq.count("T") >= 0 and seq.count("U") == 0:
seq = DNASeq(seq)
elif seq.count("U") > 0 and seq.count("T") == 0:
seq = RNASeq(seq)
else:
# Then we have both U and T, so fail
raise ValueError, "Can't have a sequence with both U and T"
self.seq = seq
def fasta_string(self):
string = ">" + self.name + "\n"
string += self.seq + "\n"
return string
s = DNASeq("ACGT")
print "Transcription of", s, "is:", s.transcribe()
s = RNASeq("ACGU")
print "Reverse transcription of", s, "is:", s.reverse_transcribe()
print ""
s = DNASeq("ACGT")
print "Complement of", s, "is:", s.complement()
print "Type of complement of", s, "is:", type(s.complement())
s = RNASeq("ACGU")
print "Complement of", s, "is:", s.complement()
print "Type of complement of", s, "is:", type(s.complement())
print ""
s = DNASeq("ACGTCC")
print "Reverse complement of", s, "is:", s.reverse_complement()
print "Type of reverse complement of", s, "is:", type(s.reverse_complement())
s = RNASeq("ACGUCC")
print "Reverse complement of", s, "is:", s.reverse_complement()
print "Type of reverse complement of", s, "is:", type(s.reverse_complement())
print ""
s = SeqRecord("MBG1234", "AGTCTGATA")
print "Seqrecord seq type:", type(s.seq)
#print s.fasta_string()
s = SeqRecord("MBG1234", "AGUCUGAUA")
print "Seqrecord seq type:", type(s.seq)
s = SeqRecord("MBG1234", "AGTCUGATA")