Introduction to Bioinformatics

Lesson 6 - Object Oriented Programming with Python

Object Oriented Programming

OOP: yeah, you know me…

Wherein from simple elements we construct towering monuments

What we’ll cover today

  • What is OOP?
  • Why do we care about it?
  • How do you use it?
  • How do you do it?
  • How much should you do it?

What is OOP?

OOP is a way of writing programs by bundling data and functionality into objects.

Why do we have OOP?

Large programs can be difficult to analyze; Keeping track of how data and computations relate to the state of the program is challenging.

OOP is a way of organizing data and computation with the goal of making it easier to conceptualize and deconstruct the system as a whole.

The Objects in OOP

Objects are entities which bundle data together with computations we think of as the objects “behaviors”. The data is organized into attributes and the computations into methods.

Objects are typically mutable: Methods can modify (the state of) the attributes of the objects on which they’re called.

Classes

Classes are a way of defining the shape and behavior of objects. In class based OOP, we say that objects are instances of a class.

Example: the string "cat" is an instance of the class str (surprise!).

Class based OOP is the most common form in modern languages, and indeed, python’s OOP is class based.

That’s it in a nutshell

Any questions?

How do I use other’s OOP

This is the easy part.

You can consume OOP data and functionality without having to structure your own code using these techniques.

A pertinent example

Let’s look at the Biopython library (this should be accessible as long as you’ve loaded your python modules).

# First import the namespace/module
import Bio.SeqIO as SeqIO

# Next, create a sequence parser, specifying file and format
seqrecords = SeqIO.parse("data/sfv.fasta", "fasta")

# Now let's inspect things a bit (bonus: intro to generators!)
seqrecords
# You can use the `next` method to get the "next" thing from a generator.
seqrecords.next()
seqrecords.next()

Let’s look at one of these seqrecords in more detail

# More investigating
seqrecord = seqrecords.next()
seqrecord
type(seqrecord)

Attributes

Objects have attributes. You can see what attributes an object has with dir(object), and access them with object.attribute_name.

dir(seqrecord)
seqrecord.name

Let’s look at the seqrecord’s seq

# Here's where the actual "sequence" is
seqrecord.seq
s = seqrecord.seq
type(s)
dir(s)

So an attribute of an object can be another object. It’s objects all the way down!

Separation of Concerns

One of the things OO encourages us to do is separate concerns; Classes and objects naturally do this..

A SeqRecord is concerned with the abstract record of a sequence; name, id, description, etc. The Seq is concerned with the actual nucleotide sequence. An instance of SeqRecord will have a seq attribute pointing to a Seq instance.

Attributes are open

We can add our own attributes to an object.

s.canhaz_catgifs = True
print s.canhaz_catgifs

This is an open system.

Attributes can be functions

s.transcribe
# <bound method Seq.transcribe of Seq(...)>

# You can call these functions just like normal functions (appending `(...)`)
s.transcribe()

# Compare with
from Bio.Seq import Seq
Seq.transcribe
# <unbound method Seq.transcribe>
Seq.transcribe(s)

We call functions bound to objects methods. Methods describe the “behavior” of objects.

Let this sink in…

# as a function
Seq.transcribe(s)
# as a method
s.transcribe()

xs = [1, 2, 3]
# as a function
list.append(xs, 4)
# as a method
xs.append(4)

In a class, methods are attributes pointing to simple functions. Objects have attributes pointing to copies of these functions where the first argument is bound to the object in question.

Using constructors

We got this object from SeqIO, but we can also construct objects using constructors.

By default, classes act as constructors for their objects when called as a function.

Seq("AGCT")

That’s it for using OOP

  • Construct objects using constructors
  • Access object attributes
    • object.attributename
  • Call methods on objects
    • methods are just functions bound to objects as attributes of those objects
    • object.method(arguments)
  • Classes act as containers for the unbound methods associated with objects

Any Questions?

How do I do OOP?

A simple class, modeling data we saw last week:

# Specify the name of the class, and inherit from object
class Person(object):
    # This defines the constructor function
    def __init__(self, name, age, occupation):
        # assign attributes passed through constructor
        self.name = name
        self.age = age
        self.occupation = occupation

p = Person("Bob Jones", 42, "haxxor")
print p

(Note: 2 underscores (_) on either side of init)

What about the functionality bit

Suppose we add email (this is pretend code & won’t run):

import email

class Person(object):
    # add email to constructor
    def __init__(self, name, age, occ, email):
        self.name = name
        self.age = age
        self.occupation = occ
        self.email = email
    # adding a method looks like defining a function
    def notify_via_email(self, message):
        email.send(self.email, message)

p = Person("Bob Jones", 42, "haxxor", "bob@jon.es")
p.notify_via_email("your cat was found")

Some things about what just happened

(And how this all works)

  • When we define class Person, we get a constructor function named Person that returns an instance of class Person.
  • When this function is called, a blank object gets created, and the __init__ function is called with self bound to the new object. This function initializes the object.
  • Other methods follow the same pattern; We get a Person.notify_via_email function, and every instance p of Person gets a notify_via_email attribute set to a copy of this function with the first argument bound to p.

Let’s contrast this with non-OOP

import email

def notify_via_email(person, message):
    email.send(person["email"], message)

# This is just another way of creating a dictionary
p = dict(name="Bob Jones", age=42,
        occupation="haxxor", email="bob@jon.es")
notify_via_email(person, "your cat was found")

Shorter, but it’s less clear in notify_via_email what valid input is. Classes make this patent/explicit.

What if a company also has an email?

OOP: Have to write a separate method in the Company class. (There are other ways around this, but they’re a bit of work).

Non-OOP: As long as the company dictionary has an "email" key/value pair, the function will still work.

Moral of the story: Structure can come at the cost of flexibility.

But python is cool!

Methods are just attributes pointing to functions.

The attrs in an object mirror o.__dict__, so you can treat objects like they’re dictionaries!

Pretend Biopython didn’t exist

Let’s create it.

Setting up our editing environment

Close whatever else you were working on and/or open a new tmux-window with Ctrl-a c, then

cd ~/bioinfclass
vim scripts/seqs.py

Copying and pasting

By default, vim autoindents whatever we type into the terminal. This is great for writing code, but it can mess things up on copy/paste.

Some terminals/vim installations can tell when you are pasting text and will automatically turn off auto-indent. But if you find that your indentation is off when you paste things, first run :set paste from command mode before pasting to turn on paste-mode (thereby turning off autoindent), then afterwards run :set nopaste to turn autoindent back on.

Empty lines in REPL vs scripts

The following breaks in a REPL, but is ok in a script.

if something:
    do_stuff()

else:
    do_something_else()

Keep this in mind when copying pasting code.

Redefining SeqRecord

Add the following to our scripts/seqs.py file.

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
print s.name
print s.seq

Testing our code

With python, I like to be able to test my scripts as I write them using bash.

Open a new tmux pane using Ctrl-a |. Then run the following in the bash shell:

# First make sure we're in the right directory
cd ~/bioinfclass

# Run the script file we just created
python scripts/seqs.py

What do you see?

Now let’s add some functionality

In scripts/seqs.py, add fasta_string method so things look like:

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

    def fasta_string(self):
        string_rep = ">" + self.name + "\n"
        string_rep += self.seq + "\n"
        return string_rep

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()

Making the seq more robust

Suppose we want to add more functionality to the seq attribute? Add the following before our SeqRecord class:

# Create Seq as a "subclass" of str
class Seq(str):
    # If not adding any methods, must explicitly "pass"
    pass

s = Seq("ACGT")
print s

As a subclass of str, Seq will behave exactly like a string, except where we override or add behavior.

The whole file

# Create Seq as a "subclass" of str
class Seq(str):
    # If not adding any methods, must explicitly "pass"
    pass

s = Seq("ACGT")
print s

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

    def fasta_string(self):
        string_rep = ">" + self.name + "\n"
        string_rep += self.seq + "\n"
        return string_rep

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()

Adding functionality to Seq

class Seq(str):
    # A class attribute, available to all seqs
    complements = dict(A="T", T="A", C="G", G="C")
    def complement(self):
        # Create list of complemented characters
        bps = [self.complements[bp] for bp in self]
        # Join characters together, and Seq them
        joined_bps = "".join(bps)
        return Seq(joined_bps)

s = Seq("ACGT")
print s.complement()

The whole file

class Seq(str):
    # A class attribute, available to all seqs
    complements = dict(A="T", T="A", C="G", G="C")
    def complement(self):
        # Create list of complemented characters
        bps = [self.complements[bp] for bp in self]
        # Join characters together, and Seq them
        joined_bps = "".join(bps)
        return Seq(joined_bps)

s = Seq("ACGT")
print s.complement()

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

    def fasta_string(self):
        string_rep = ">" + self.name + "\n"
        string_rep += self.seq + "\n"
        return string_rep

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()

Note the class attribute

# The class and instances both get this attribute
print Seq.complements
print s.complements

Note: complements would frequently be called a class attribute, since this data lives in the class. The codecademy course calls them member variables, which is a more general term you don’t hear in Python as often.

What about RNA vs DNA?

We can solve this by further subclassing Seq.

class Seq(str):
    # Same method as before...
    def complement(self):
        bps = [self.complements[x] for x in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)

# But we move the class variable into subclasses
class DNASeq(Seq):
    complements = dict(A="T", T="A", C="G", G="C")

class RNASeq(Seq):
    complements = dict(A="U", U="A", C="G", G="C")

The whole file

class Seq(str):
    # A class attribute, available to all seqs
    def complement(self):
        bps = [self.complements[x] for x in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)

class DNASeq(Seq):
    complements = dict(A="T", T="A", C="G", G="C")

class RNASeq(Seq):
    complements = dict(A="U", U="A", C="G", G="C")

s = Seq("ACGT")
print s.complement()

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

    def fasta_string(self):
        string_rep = ">" + self.name + "\n"
        string_rep += self.seq + "\n"
        return string_rep

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()

Testing our subclasses

# Testing out our new classes
s = DNASeq("ACTG")
print s.complement()
s = RNASeq("ACUG")
print s.complement()

# Does our old class work by itself any more?
s = Seq("ACTG")
print s.complement()

The whole file

class Seq(str):
    # A class attribute, available to all seqs
    def complement(self):
        bps = [self.complements[x] for x in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)

class DNASeq(Seq):
    complements = dict(A="T", T="A", C="G", G="C")

class RNASeq(Seq):
    complements = dict(A="U", U="A", C="G", G="C")

# Testing out our new classes
s = DNASeq("ACTG")
print s.complement()
s = RNASeq("ACUG")
print s.complement()

# Does our old class work by itself any more?
s = Seq("ACTG")
print s.complement()

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        self.seq = seq

    def fasta_string(self):
        string_rep = ">" + self.name + "\n"
        string_rep += self.seq + "\n"
        return string_rep

s = SeqRecord("MBG234Gag1", "AGCTGTCGGTAAGTCGAGT")
# test out our new method
print s.fasta_string()

Woops! A bug!

When we take the complement, we get a Seq, but we should get either a DNASeq or an RNASeq (as appropriate)!

s = DNASeq("ACTG")
print type(s.complement())
s = RNASeq("ACUG")
print type(s.complement())

# Now we can't take complement of the complement of an RNASeq...
s.complement().complement()

Try solving this later :-) (Solutions at end of slides)

That’s all we’ll look at for doing OOP

Any Questions?

How hard should I OOP?

Design guidance vs flexibility

The pros and cons of OOP

Extreme Example: Java. Even printing “hello world” then exiting requires creating a class. All data and functionality must be bound to classes/objects. This can be very constraining.

The good with OOP: Easier to conceptualize, because we can think about programs being made of things, like we do in the real world.

The bad: Extra work, and code can become siloed and difficult to generalize or share between different parts of the program.

My recommendation

OOP is strong medicine; don’t overuse it.

Python gives us a choice here. Use classes when they help you conceptualize complexity. Avoid when they add complexity.

Try starting with plain functions. If you find yourself having trouble figuring out what functions should be getting which data, consider bundling things into classes.

Excercises

(Answers at end of slides)

Ex 1 - transcribe

Problem: Add a transcribe method to the DNASeq class and a reverse_transcribe method to the RNASeq class.


SideNote: biopython does this differently; Only one Seq class, and an alphabet attribute pointing to DNA/RNA. In this approach, calling reverse_transcribe on a Seq instance with a DNA alphabet gives an error. In contrast, with our approach reverse_transcribe is only defined if the Seq is an RNASeq.

Thinking through these architectural trade-offs is important in OOP.

Ex 2 - reverse_complement

  • Add a reserve_complement method to your Seq class, such that:
    • The method works with both DNASeq and RNASeq objects.

Ex 3 - reverse_complement (ctnd)

  • Add a reverse_complement method in your SeqRecord class, which returns a new seqrecord with:
    • seq attribute the reverse_complement of the original sequence.
    • a modified name attribute (append "_rc" to the original name).

Bonus Excercise

  • Make it so that when you create an instance of SeqRecord, you can just pass in a string for seq, and it will figure out whether it’s a DNASeq or a RNASeq based on whether it contains U or and T:
    • Should error out if it can’t tell.
    • Should still be possible to pass in an RNASeq or DNASeq directly, without it checking.

Bonus Excercise 2

Fix up the Seq, RNASeq and DNASeq class definitions so that DNASeq.complement returns a DNASeq object (similarly for RNASeq). Ideally:

  • Most of the complement code stays in the Seq class definition
  • Override the Seq.complement definition in RNASeq and DNASeq so that the return value is casted as the correct class

Hint: To solve this “ideally”, look into python’s super function.

Resources

Back to homepage

Excercise solutions:

Spoiler alert!

Ex 1

First the simple solution:

class DNASeq(Seq):
    # ...
    transcriptions = dict(T="U")
    def transcribe(self):
        # Note use of get
        bps = [self.transcriptions.get(bp, bp) for bp in self]
        joined_bps = "".join(bps)
        return RNASeq(joined_bps)

class RNASeq(Seq):
    # ...
    transcriptions = dict(U="T")
    def reverse_transcribe(self):
        bps = [self.transcriptions.get(bp, bp) for bp in self]
        joined_bps = "".join(bps)
        return DNASeq(joined_bps)

s = DNASeq("ACGT")
print "Transcription of", s, "is:", s.transcribe()

s = RNASeq("ACGU")
print "Reverse trans of", s, "is:", s.reverse_transcribe()

Ex 1 (ctnd)

This works, but note that it’s a lot of duplicate code. We’ll see a more elegant way to solve this using super (see Bonus problem 2).

Ex 2

First let’s solve this without worrying too much about the return type (just ensuring it’s a Seq).

class Seq(str):
    # ...
    def reverse_complement(self):
        # Take the complement and reverse the bps
        bps = reversed(self.complement())
        joined_bps = "".join(bps)
        return Seq(joined_bps)

s = DNASeq("ACGTCC")
print "Reverse comp of", s, "is:", s.reverse_complement()

s = RNASeq("ACGUCC")
print "Reverse comp of", s, "is:", s.reverse_complement()

Ex 2 (ctnd)

Note that if we felt like being more terse, we could omit all the intermediate variable assignments and just do everything in one line. You can decide for yourself what makes code most readable.

class Seq(str):
    # ...
    def reverse_complement(self):
        return Seq("".join(reversed(self.complement())))

Bonus 1

First, we’ll solve this just using simple string methods.

class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        # We'll assume a sequence without U or T is a DNASeq
        if seq.count("T") >= 0 and seq.count("U") == 0:
            seq = DNASeq(seq)
        elif seq.count("U") > 0 and seq.count("T") == 0:
            seq = RNASeq(seq)
        else:
            # Then we have both U and T, so fail
            raise ValueError, "Seq can't have both U and T"
        self.seq = seq

    # ...

s = SeqRecord("MBG1234", "AGTCTGATA")
print "Seqrecord seq type:", type(s.seq)

s = SeqRecord("MBG1234", "AGUCUGAUA")
print "Seqrecord seq type:", type(s.seq)

# This one should fail
SeqRecord("MBG1234", "AGTCUGATA")

Note that since the definition of SeqRecord here depends on Seq, you may have to move some code around so Seq is defined before SeqRecord is.

Bonus 1 (ctnd)

This isn’t the best solution: Right now, only capital letters are considered. A good way to fix this would be to look into regular expressions.

There is always something to do to make your code better :-)

Bonus 2

The simple solution to this is to just write two methods, like we did in the solution to Ex 1. However, there is a better way.

class DNASeq(Seq):
    # ...
    def complement(self):
        # Here we use the superclass' method
        return DNASeq(Seq.complement(self))

class RNASeq(Seq):
    # ...
    def complement(self):
        return RNASeq(Seq.complement(self))

s = DNASeq("ACGT")
print "Type of complement of", s, "is:", type(s.complement())

s = RNASeq("ACGU")
print "Type of complement of", s, "is:", type(s.complement())

Bonus 2 (ctnd)

Note: In the assignment, I suggested using super to solve this. After looking for a good explanation of super as a reference, I found this stackoverflow post, which gave me the idea to just refer to the unbound Seq.complement function directly. I think this is a lot easier to understand, and it accomplishes the same thing.

Revisiting Ex 2

We can use this super-like trick to fix our solution to Ex 2 as well:

class DNASeq(Seq):
    # ...
    def reverse_complement(self):
        return DNASeq(Seq.reverse_complement(self))

class RNASeq(Seq):
    # ...
    def reverse_complement(self):
        return RNASeq(Seq.reverse_complement(self))

s = DNASeq("ACGTCC")
print "Type of reverse comp of", s, "is:", type(s.reverse_complement()
s = RNASeq("ACGUCC")
print "Type of reverse comp of", s, "is:", type(s.reverse_complement()

Revisiting Ex 1

Similarly, we can refactor transcribe and reverse_transcribe.

class Seq(str):
    # ...
    def abstract_transcribe(self):
        bps = [self.transcriptions.get(bp, bp) for bp in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)


class DNASeq(Seq):
    # ...
    def transcribe(self):
        return RNASeq(Seq.abstract_transcribe(self))


class RNASeq(Seq):
    # ...
    def reverse_transcribe(self):
        return DNASeq(Seq.abstract_transcribe(self))

Everything all together

class Seq(str):
    def complement(self):
        bps = [self.complements[bp] for bp in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)

    def reverse_complement(self):
        return Seq("".join(reversed(self.complement())))

    def abstract_transcribe(self):
        bps = [self.transcriptions.get(bp, bp) for bp in self]
        joined_bps = "".join(bps)
        return Seq(joined_bps)


class DNASeq(Seq):
    complements = dict(A="T", T="A", C="G", G="C")
    transcriptions = dict(T="U")

    def complement(self):
        return DNASeq(Seq.complement(self))

    def reverse_complement(self):
        return DNASeq(Seq.reverse_complement(self))

    def transcribe(self):
        return RNASeq(Seq.abstract_transcribe(self))


class RNASeq(Seq):
    complements = dict(A="U", U="A", C="G", G="C")
    transcriptions = dict(U="T")

    def complement(self):
        return RNASeq(Seq.complement(self))

    def reverse_complement(self):
        return RNASeq(Seq.reverse_complement(self))

    def reverse_transcribe(self):
        return DNASeq(Seq.abstract_transcribe(self))



class SeqRecord(object):
    def __init__(self, name, seq):
        self.name = name
        # We'll assume a sequence without U or T is a DNASeq
        if seq.count("T") >= 0 and seq.count("U") == 0:
            seq = DNASeq(seq)
        elif seq.count("U") > 0 and seq.count("T") == 0:
            seq = RNASeq(seq)
        else:
            # Then we have both U and T, so fail
            raise ValueError, "Can't have a sequence with both U and T"
        self.seq = seq

    def fasta_string(self):
        string = ">" + self.name + "\n"
        string += self.seq + "\n"
        return string



s = DNASeq("ACGT")
print "Transcription of", s, "is:", s.transcribe()

s = RNASeq("ACGU")
print "Reverse transcription of", s, "is:", s.reverse_transcribe()

print ""


s = DNASeq("ACGT")
print "Complement of", s, "is:", s.complement()
print "Type of complement of", s, "is:", type(s.complement())

s = RNASeq("ACGU")
print "Complement of", s, "is:", s.complement()
print "Type of complement of", s, "is:", type(s.complement())

print ""


s = DNASeq("ACGTCC")
print "Reverse complement of", s, "is:", s.reverse_complement()
print "Type of reverse complement of", s, "is:", type(s.reverse_complement())

s = RNASeq("ACGUCC")
print "Reverse complement of", s, "is:", s.reverse_complement()
print "Type of reverse complement of", s, "is:", type(s.reverse_complement())

print ""


s = SeqRecord("MBG1234", "AGTCTGATA")
print "Seqrecord seq type:", type(s.seq)
#print s.fasta_string()

s = SeqRecord("MBG1234", "AGUCUGAUA")
print "Seqrecord seq type:", type(s.seq)

s = SeqRecord("MBG1234", "AGTCUGATA")

This page left intentionally almost blank