Introduction to Bioinformatics

Lesson 5 - Python basics

What we’ll cover today

Big picture overview and contextualization of Python:

  • Basic data types: What kinds of data are available
  • Syntax: What constitutes a valid python expression
  • Variables: How we keep track of and pass around python data
  • Control flow: How we specify the behaviour of our program
  • Functions: How we package functionality

Conceptually, 90% of python can be explained in terms of these things

Next week

We’ll fill in the final 90%, and start building our own data structures using Object Oriented Programming (OOP).

These concepts can be explained largely in terms of those from this session.

Filling in the pieces

Please refer to the Codecademy Python course for more practice and fluency with these concepts, and how they relate to each other.

Context & motivation

Why do we care about Python?

Why/when shell scripts?

  • Plumbing together existing command line programs (duct tape)…

Not so good for actual logic or anything too complex

Why/when R?

  • Very tabular data (data.frames are great)
  • Very tabular data operations (plyr, dplyr, reshape2)
  • Plotting (ggplot2, shiny)
  • Ready to go statistics functions

However, R has some pain points…

Why/when python?

  • More flexible data structures
  • Better performance
  • More flexible / cleaner programming models => better for algorithmic development
  • Better organization for building bigger programs
  • Better stream (Unix stdin/stdout) support than R

Together, the three make a great combination

Don’t limit yourself unnecessarily to one tool…

Living in Unix makes it easy to use both R and Python in the context of a single project.

Other languages?

There are other languages that fit some of these characteristics, but python is nice because

  • It’s become popular amongst the scientific community
  • It’s relatively simpler in design than other languages
  • It’s relatively quick to write (not necessarily to run)

Python Basics

In which we investigate the building blocks of logic

Firing up a python “REPL”

REPL is short for Read Evaluate Print Loop.

python
> print "Hello world"

Now you can enter in little bits of code as we go (type Ctrl-d or exit to get out)

Basic data types

The elements of logic; our nouns

Questioning the Unix plain text philosophy

The Unix philosophy embraces plain text data.

  • In theory means universal interoperability, and composability
  • But at the expense of richness/standardization in the structure of data.

Python has many data types

Data types let us more easily manage information, and relate it to the things in the world it represents.

This comes at the expense of universal interoperability.

Scalar data types

The most basic data types are scalars. We think of them as atomic values.

  • strings: "this is a string"
  • ints: 42
  • floats: 3.14159
  • booleans: True

Collection types

These are “containers” for other types of data (scalars, other collections, more complex data types…).

Collection types

  • tuples: ("Bob Jones", 42)
    • immutable & fixed length
    • efficient and hashable
  • dictionaries: {"name": "Bob Jones", "age": 42}
    • mutable & variable length
    • unordered “key/value pairs”
    • often for representing data about some thing
  • lists: ["Bob Jones", "Jane Doe", "Ralph Nader"]
    • mutable & variable length
    • ordered & positionally indexed
    • for collecting things

When we use these

Tuples and dictionaries for organizing data about some single thing:

  • ("Bob Jones", 42)
  • {"name": "Bob Jones", "age": 42}

Lists for arbitrary collections of individual data points:

  • ["Bob Jones", "Jane Doe", "Ralph Nader"]
  • [("Bob Jones", 42), ("Jane Doe", 31), ("Ralph N.", 85)]

Sometimes dictionaries blur the lines here;

Here we use them more or less like a list, but where we index entries by name instead of position:

{"Bob Jones": {"age": 42, "occupation": "haxxor"},
 "Jane Doe": {"age": 31, "occupation": "mathematician"},
 "Ralph Nader": {"age": 85, "occupation": "politician"}}

Other collection types

We won’t go into these as deeply today, but in short, we tend to use them like lists:

  • generators:
    • “recipes” for a collection
    • generates each thing as you need it
    • good for “big data”
  • sets
    • unique entries
    • no order guarantees
    • mutable & variable length

Operations

In which we manipulate and reason about objects; our verbs

Numeric

  • Addition: 3 + 4
  • Subtraction: 3 - 1
  • Multiplication: 4 * 5
  • Division: 6 / 5
  • Exponentiation: 10 ** 2
  • Modulo (remainder): 42 % 5

You can create more complex expressions using parentheses for grouping: (4 + 5) * 2

String

+ works for strings as well:

"this" + "that" => "thisthat"

Boolean

  • Logical and: True & False or True and True
    • return True only if both operands are True
  • Logical or: False | True or False or False (not exclusive…)
    • return True if at least one of operands is True

Predict what these evaluate to: True and False, True and True, False and False, True or False, True or True, False or False

Blurring the lines with booleans

  • You can also treat True and False like ints 1 and 0, respectively, and vice versa (try True + True)
  • Scalars other than 0, "", False or None will be treated as True in logical expression (“truthy”)
  • Empty collections will be treated as “falsey” in logical expressions, all other collections truthy

Variables

Like in the shell, we can give things names

age = 42
name = "Bob Jones"
person = {"name": name, "age": age}

# We can use variables just as though they were their values
age / 4

Variable name rules

  • Must start with an alphabetical character or an underscore (_)
  • Can contain alphabetical characters, numbers, and underscores; nothing else
  • Convention is to use all lower case, and separate words with underscores

Example: bobs_occupation = "haxxor"

Collection operations

We’ll mostly focus on lists and dictionaries

Lists

Writing xs = [1, 2, 3, 4, 5, 6, 7] defines a list xs.

  • Concatenation: +
    • ([1, 2] + [3, 4] => [1, 2, 3, 4])
  • Accessing: xs[4]
    • Get the 5th thing in the list
  • Changing: xs[3] = 999
    • Change the 4th thing in the list

Note: Python has 0-based indexing.

Dictionaries

These work similarly to lists, but we have more flexibility over the indices:

  • Get the value: person["name"]
  • Set the value: person["occupation"] = "haxxor"

Here "occupation" is the key, and "haxxor" is the value. Together, they form a key-value pair.

Dictionaries - arbitrary keys

Keys are often strings, but don’t have to be:

crazy_dict = {4: "some string", (1,2,3): 999}

crazy_dict[4]
crazy_dict[(1,2,3)]

They can be numbers, tuples or anything hashable:

  • Most immutable data is hashable.
  • You can tell whether data pointed to by variable x is hashable by running hash(x)
  • Lists and dictionaries are not hashable.

Data representation exercises

What kind of data is this?

("this", 1234)

Is it hashable?

How about this?

(4545, ["a", "b"])

Is it hashable?

How would you represent

A single DNA sequence?


(press down for answer)

Answer

String

seq = "AGCTAGCTACGT"

How would you represent

A DNA sequence, together with sequence name and other metadata?


(press down for answer)

Answer

Dictionary

seqrecord = {"seq": "AGCTAGCTACGT", "name": "MBG234"}

How would you represent

A collection of DNA sequences and names?


(press down for answer)

Answer

Dictionary

seqrecords = [{"seq": "AGCTAGCTACGT", "name": "MBG234"},
              {"seq": "AGCTTCCCACGT", "name": "MBG235"},
              {"seq": "AGATTCCTCCGT", "name": "MBG236"}]

How would you represent

A collection of DNA sequences and names, together with metadata about the collection (sampling date and such)?


(press down for answer)

Answer

Dictionary

{"seqrecords": [{"seq": "AGCTAGCTACGT", "name": "MBG234"},
                {"seq": "AGCTTCCCACGT", "name": "MBG235"},
                {"seq": "AGATTCCTCCGT", "name": "MBG236"}],
 "sampling_location": "Bangladesh",
 "technician": "Bob Jones"}

Functions

In which from our rules of logic we compose spells

Writing a simple function

def square(n):
    ans = n * n
    return ans

square(4)
  • When we call the function with the argument 4, that value gets passed in for the variable n in the body of the function.
  • To get a value out of the function we must use return
  • Python uses 4 spaces of indentation to separate what’s in the body of the function from what’s not
    • In other languages, indentation is just “good practice” for readability; Python enforces it as part of the syntax

Other things about functions

Functions are “first class”, in that they can be passed around as data just like numbers, strings, etc.

# Using our square function as a value
print square
[square, "data"]

We’ll see how this is useful later.

“Type functions”

Some types have functions associated with them, such as str.upper and list.append:

str.upper("this")

xs = [1, 2, 3]
list.append(xs, 9)
print xs

You can see what functions are associated with a type t using dir(t) (example: dir(list) shows all the list functions)

Control flow

In which we gain mastery over the rules of logic

If statements

x = 3
if x < 4:
    print "You should see this"
else:
    print "You should NOT see this!"

Things to note:

  • We used indentation here as well to determine the structure of the statements
  • We used the < (less than) operand, which behaves just as you’d expect, returning True/False depending on the operand values

You can have elif sections too

elif is a combination of an else and an if.

def f(n):
    if n < 5:
        print "Cond 1"
    elif n > 10:
        print "Cond 2"
    else:
        print "Cond 3"

f(3)
f(7)
f(12)

About collections again

What do you think will happen below?

data = []

if data:
    print "catgifs"

list.append(data, "puppies?")

if data:
    print "yay internetz!"

For statements

Do something for every thing in a collection:

for n in xs:
    print "On int:", n

Iterating over key-value pairs in a dictionary

dict has an items function that gives all of the key, value pairs. Try

dict.items(person)

Now loop over them:

for k, v in dict.items(person):
    print "key:", k, "val:", v

Problem:

Using just what we’ve learned already, how would you create a list ys that contained the square of every number in the list xs?

Using for to build results

ys = []
for x in xs:
    y = square(x)
    list.append(ys, y)

print ys

This is imperative; we’re telling the computer how to build ys

A better way

List comprehensions:

ys = [square(x) for x in xs]

print ys

This is more declarative.

The functional way

ys = map(square, xs)

print ys

Note that we’re passing the function square as an argument to the function map.

Declarative is generally better

List comprehension and functional approaches are more declarative.

We tell the computer what to compute, not how we want it computed. This is usually a good thing (cleaner code; sometimes better performance).

Higher level concerns

In which we rise to the level of systems

Importing modules

import math

dir(math)
help(math.log)

math.log(3.14)

math is a module. Modules, just like objects, have attributes, and these attributes are typically functions that we can use in our programs (though sometimes they’re just useful data, like math.e).

Namespaces

We also talk about modules as being namespaces. A namespace is just a “path” to some data or function or module. In this case math can also be thought of as a namespace.

Namespaces help us avoid naming conflicts. If we have a function called square, and we import a module like math with a function called square, namespaces let us distinguish between them.

Installing your own packages with pip

Collections of modules are called packages.

You can install Python packages using the pip command line tool (the “App Store” for Python packages). From your Unix shell:

pip install --user biopython

The --user flag tells pip to install things in your local python lib directory. You would only omit that if you were installing system wide on your own computer.

Beware Python 3

We’re using Python 2. Python 3 has been out for YEARS, but adoption has been slow. Eventually, Python 3 will become the system default and more widely used, but till then…

Bottom line: Python 3 is rather different in a number of ways, so be aware.

Other things in the Python ecosystem

  • Numpy & Scipy: Some work in the numerical computing and stats directions
  • Pandas: R like data frames in Python
  • Jupyter & IPython: Browser-based “notebook” environments for co-locating code, documentation and visualizations
    • Miss out on some of the Unix-centric goodness, but has advantages too

Resources

Back to homepage