Introduction to Bioinformatics

Lesson 1 - Gestalt & remote access

First things first

Did everyone successfully ssh into rhino?

Did everyone get the book?

Introductions

Name
What you’re working on
Technical Experience

Our goals for this course

High level overview of how to approach bioinformatics
Basic bioinformatics technical and problem solving skills
Enough momentum to initiate self-learning process

What you won’t get from this course

…that might be useful.

Higher level programming
Statistics
Other mathematics
Knowledge of every bioinformatics program/method ever created

Our goal is to give you enough momentum to learn what you need as you go.

Please spend time outside class

This stuff doesn’t stick unless you keep doing it.

Have fun!
Start a project
Set up routine time to play
More time per week is great, but consistently doing something every week better
Plan to continue after the course
Feel free to email Brian - he gets paid for this

What is bioinformatics?

The application of computation towards the analysis and processing of biological data

In practice this entails

data munging: cleaning & reshaping data
data exploration: vizualization, intuition building, hypothesizing
drawing conclusions: statistical analyses, publication

Rinse and repeat… New insights and conclusions lead to new data and questions, etc.

So this is what we do

But how should we do it?

Specific goals for research

Reproducibility
Robustness
Accuracy
Clarity
Iterability

Things we can do towards these goals

Immutable data: never change data in place; data in data out (reproducibility, iterability, clarity)
Consistency: naming conventions, data & project layout, etc; (automation, clarity)
Documentation: keep a digital “lab notebook”; software version, data origins, run settings, explain choices/decisions; (reproducibility, clarify)
Automation: the best documentation; write scripts for everything; (reproducibility, iterability)

Polya’s How to Solve It

You will spend more time sitting around and thinking about how to solve problems than you will typing code.

https://en.wikipedia.org/wiki/How_to_Solve_It

Understand
Plan
Exectue
Reflect

When this fails, try to break down into simpler pieces and iterate (solve an easier or related problem first).

Understanding

The most important, but hardest part.

Write out the problem
Isolate what makes the problem hard
Draw/sketch, talk (even out loud to yourself), go for a walk, look for patterns
List knows, work backwards, eliminate possibilities, run examples
When you truly understood a problem, the solution should appear obvious, so don’t stop trying to understand until it is

Questions?

Unix

What is “Unix”?

Unix is a proprietary operating system from the 60s.

True Unix is rare these days, but its philosophy and design live on in “Unix-like” systems (including OsX). When we say “Unix”, we usually mean it in this general sense.

We’ll be using Ubuntu Linux, a Unix-like operating system.

The Unix Philosophy

Small, composable tools that do one thing right
Plumb tools together into pipelines & scripts
Embrace plain-text data

Bioinformatics naturally embraces this.

The Unix shell

The shell is a wrapper around the operating system, through which text commands and output are used to interact with the computer.

This is analogous to the desktop environment on your personal computer.

Remote servers

Unix was built for remote access. Computers were big and expensive, so people shared resources.

Today, this remains useful:

The Hutch servers are really powerful
You can leave long running computations without worrying about unexpected shutdowns
Access running programs and data from any computer
Distribute tasks across compute clusters, such as the Hutch’s gizmo clusters

Questions?

Practicum!

Connect to a remote server
Survival guide
Set up a project directory and load it with data
Investigate our data
Learn how to edit files using an in-shell text editor (vi)
Tmux for session management

SSH for remote connection

Do this however you figured out how to do it for your OS.

ssh <username>@rhino

Write down which rhino you connect to, so you can directly connect to that next time. (e.g If you get connected to rhino3, next time use ssh <username>@rhino3.)

Let us know if you get an error message along the lines of “can't find home”.

From outside the Hutch

You should be able to connect to the rhinos from outside the hutch by connecting to the Hutch VPN.

You can also ssh into <username>@snail.fhcrc.org, and then from there execute ssh <username>@rhino<N> to get into your rhino.

You’re now in a Unix shell!

We’ll cover things more thoroughly next class. For now, just some basic information for orientation.

What follows is your Zombie Apocalypse Unix guide.

Brief overview of a Unix command

command [flags] [operands]

ls -a ~

ls: The command, in this case for listing directory contents
-a: A flag which specifies that hidden files should be listed
~: An operand, in this case a special symbol which points to your home directory

Note: Spaces are important here (but not how many).

Getting help within Unix

Most Unix programs will take either a -h, --help, or -help flag and return useful information about how to construct a valid command.
Some commands also have man pages, which you can access using man <command-name>.
You can also search for programs using apropos

Example: ls --help & man ls & apropos calculator

Stuck in a terminal?

If you are in the terminal and things seem “stuck” (a program is running that won’t stop), try the following:

Ctrl-c, Esc, q, Ctrl-d, Esc : q Enter

Note: These are key commands;

- means press the keys at the same time.
` ` (space between) means let go and press the next key

Autocomplete

Bash can “auto-complete” command and filenames. Just start typing and hit Tab.

Try ls ~/bioi<Tab>.

History

Bash maintains a history of the last several commands you executed. You can access these by typing history.

If you want to run a command similar to one you just executed, you can use the up/down arrow keys to move through the history, edit the line, and re-execute.

Questions?

Supercharging our shell

SciComp has set up a module system for customizing your environment.

We’ll be using the intro-bio module, loaded by executing module load intro-bio (note: this only affects the current shell session!).

module load intro-bio

You might get a message that says Using already loaded python, but don’t worry about it. If you see nothing, it’s fine. If you see an error message, please raise your hand.

Customizing our environment

We don’t want to have to remember to load this module every time, and there are other things we want to have set up for us every time we open a shell session. We can make these cusomtizations by editing our ~/.bashrc configuration file.

But first, we’re going to learn how to edit a text file from the terminal using vi.

Introducing vi and vim

Vi is a wonderful, powerful, but completely arcane editor. However, it’s worth being able to use because:

even the sparsest linux install will have some variant of vi.
when working with remote machines, it’s nice to be able to edit text in a powerful editor directly on the machine.
finally, sometimes another program (e.g. git) will plop you into vim without you realizing it, so it’s nice to know what to do in this situation.

Vim tutorial

Create a file called `vimtest.txt` with vim

Invoke with vim vimtest.txt
Type i to enter insert mode
Type to enter text, move around with arrow keys, edit as needed
Hit Esc to exit insert mode and enter command mode
Type : w <Enter> to save
When you are done editing, type : q <Enter>

Run cat vimtest.txt to see your text. That wasn’t so bad, was it?

(click down for more advanced usage)

Using the command mode

So far the only real action has been in insert mode. The other mode in vi is the command mode. In this mode you can quickly navigate and modify your file using key commands.

Moving around quickly

There are lots of ways to move your cursor around in command mode:

The arrow keys (and the “home row” hjkl keys)
0 moves to the beginning of the line, and $ moves to the end
b moves back one word, and w moves forward one
{ moves back one paragraph, and } moves forward one
( moves back one sentence, and ) moves forward one

You can prefix these commands with numbers to move faster, e.g. 3 w moves you forward three words.

Cutting and pasting

There is a simple way to cut and paste exactly analogous to a word processor: highlight a block of text, then copy or cut, then paste.

In command mode, move cursor to where you want to start your highlight
Press v; this places you in visual mode
Move cursor to the end of your highlighted region
Press d to cut, y (for yank) to copy
Move cursor to where you want to paste
Press p to paste

Combining motions and actions

Motions and actions can be combined. E.g.:

d w cuts one word, and d 2 w cuts two words
y w copies one word, and y 2 w copies two words

You can also cut/copy entire lines with the shortcuts d d and y y.

Undo/redo

If you ever mess anything up (which is easy to do in command mode), u is undo and Ctrl-r is redo (from command mode).

Vim resources

The vimtutor command, available wherever you find vim
Ben Crowder’s vim tips
Cheatsheet/poster
Wallpaper
An online vim tutorial

Note that you can totally use the arrow keys while you You certainly can, though it’s not considered hip (because vim is all about efficiency, and moving your hands from home position to the arrow keys is not efficient.)

Customizing our shell

We can customize our environment using a “dotfile” that gets loaded when new shell sessions are created: ~/.bashrc.

Run vim ~/.bashrc to open the file
Hit i for insert mode
Use arrow keys to scroll to the bottom of the file, then add:

module load intro-bio
export PATH=~/bin/:$PATH

Save and quit (Esc then :wq<Enter>), then run source ~/.bashrc to reload the changes.

Tmux - Terminal multiplexer

Tmux let’s combine multiple Unix shells into one.

We could open multiple terminal windows to multitask, but each would need its own ssh connection. Also, without tmux, if our connection dies, so do our programs. With tmux we can:

Put down and pick up work from multiple machines
Keep working on other things while a long running computation runs
Switch back and forth quickly between multiple projects

Setting up tmux

We’ll use tmux throughout this class to keep sessions running. But first let’s download a nice tmux configuration file for making things easier.

# This is a command that downloads a tmux configuration file
wget https://raw.githubusercontent.com/fredhutchio/intro-bioinformatics/gh-pages/config/tmux.conf

ls

# tmux knows to look for our config file at ~/.tmux.conf
mv tmux.conf ~/.tmux.conf

# Note that this file is hidden once we make it a "dotfile"
ls ~
ls -a ~

Starting a new tmux session

tmux

You should now see a fresh shell session inside of tmux.

Tmux cheatsheet/tutorial

Bare bones tmux

Ctrl-a - your “Command key”
Ctrl-a c - New window
Ctrl-a <Space> - Next window
Ctrl-a | - Split window into panes vertically
Ctrl-a - - Split window into panes horizontally
Ctrl-a <arrow> - Move between
Ctrl-d - Close a pane or window

For more tmux tips, click down.

Moving between panes and moving panes

You can also use h, j, k, and l in place of the arrow keys, as in vim.

You can also swap/reorder panes using Ctrl-a J and Ctrl-a K.

Resizing panes

Ctrl-a < - move vertical split left
Ctrl-a > - move vertical split right
Ctrl-a + - move horizontal split up
Ctrl-a + - move horizontal split down

Note: You can click Ctrl-a once, and hold the second key for big moves.

Naming & finding windows

Ctrl-a , will let you name a window
Ctrl-a ' presents a list of windows (by name)
<arrow> and Enter to switch to a window
Ctrl-a <numeric> switches to a window by number.

Scrolling and copy/paste in tmux

When a noisy program floods a tmux pane, your mouse wheel won’t let you scroll, like in a normal shell session. Pressing Ctrl-a [ will place you in scroll mode. Use can now use arrow keys or Ctrl-u/Ctrl-d to scroll through the history, and search with /.

From this mode, you can also press Space to enter copy-mode, <arrow> keys to specify a collection, and Enter to copy the selection. To paste the selection, use Ctrl-].

Tmux commands

You can also interact with tmux using commands, and you can see a list of commands and explanations on the man page:

man tmux

All of the key combos above are just bindings to these commands, so this is a good place to go if you’re trying to figure something out.

Detaching and attaching a tmux session

Close your terminal window, and ignore any warning prompts.
Reconnect just the way you did at the beginning of the class.
Type tmux attach to attach to an existing session.

Remember: tmux attach will fail if you don’t have a session open already; if that happens just enter tmux to start a new session.

Questions?

Setting up our project

This class builds around analysis of a real world dataset of Simian Foamy Virus (SFV). SFV is a retrovirus that infects non-human primates, but can infect humans bitten by NHP. Howver, we appear to be a “dead-end” host.

This data set looks at viruses sampled from humans and monkeys in Bangladesh.

Project layout

Here’s how I typically organize things:

your-project
├── README.md
├── build.sh
├── data
│   ├── sequences.fasta
│   └── metadata.csv
├── scripts
│   ├── clean.py
│   └── plot.R
└── output
    ├── alignment.fasta
    ├── cleaned_metadata.csv
    ├── tree.newick
    └── tree_plot.png

Note: this is a little different than the organization from the book; Use what works for you, but be consistent.

Setting up our project directory

mkdir ~/bioinfclass

cd ~/bioinfclass
mkdir data output scripts

ls

(You’ll always enter in the things you see in these black boxes)

Download data with `wget`

wget https://goo.gl/8Nk5tZ -O data.tar
ls

Note that we now have a data.tar file in this directory. This is an archive type (like zip) common on Unix systems. To unpack it

tar -v -x -f data.tar

Adding a README

Start vim with vim README.md

Type i to enter insert mode
Enter some text about today’s lesson
Hit Esc to exit insert mode and enter command mode
Type : w <Enter> to save
When you are done editing, type : q <Enter>

Let’s look at what we’ve done

tree prints out a directory tree as ASCII art.


tree

Note: This may not turn out right on Windows PuTTY terminals.

Let’s explore the sequence file

cat prints the output of file(s) to the screen.

cat data/sfv.fasta

Now with `less`

less lets us “page” through data, without flooding the screen.

less data/sfv.fasta

Press the q key to exit from less.

Now let’s look at the metadata

less data/sfv.csv

Press the / key to enter a search string

(Try human or monkey; Use n and N to toggle through results).

Questions?

Source code!

The source code for these slides is available on GitHub.

https://github.com/fredhutchio/intro-bioinformatics

For the code that generated these slides, look at the .mds files in the src directory. (This code is an extension of Markdown, which we’ll look at more next week)

To submit issues or questions about the class, go to the Issues page.

Homework

Please, please, please!

Go over tmux and vim and memorize the basics of how to use them
Set up a routine for practice and tinkering
Pick a personal project

Homework

Resources

Class home page: http://fredhutchio.github.io/intro-bioinformatics
Polya: How to solve it https://en.wikipedia.org/wiki/How_to_Solve_It
Unix command reference: https://ubuntudanmark.dk/filer/fwunixref.pdf
Tmux tutorial: http://www.fredhutch.io/articles/2014/04/27/terminal-multiplex
Stack overflow: https://stackoverflow.com
More on our data: http://dx.doi.org/10.1038/emi.2013.60

Introduction to Bioinformatics

Lesson 1 - Gestalt & remote access

First things first

Introductions

Our goals for this course

What you won’t get from this course

Please spend time outside class

What is bioinformatics?

What is bioinformatics?

In practice this entails

So this is what we do

But how should we do it?

Specific goals for research

Things we can do towards these goals

Polya’s How to Solve It

Understanding

Questions?

Unix

What is “Unix”?

The Unix Philosophy

The Unix shell

Remote servers

Questions?

Practicum!

SSH for remote connection

From outside the Hutch

You’re now in a Unix shell!

Brief overview of a Unix command

Getting help within Unix

Stuck in a terminal?

Autocomplete

History

Questions?

Supercharging our shell

Customizing our environment

Introducing vi and vim

Vim is a modal text editor

Vim tutorial

Create a file called vimtest.txt with vim

Using the command mode

Moving around quickly

Cutting and pasting

Combining motions and actions

Undo/redo

Vim resources

Customizing our shell

Tmux - Terminal multiplexer

Setting up tmux

Starting a new tmux session

Tmux cheatsheet/tutorial

Bare bones tmux

Moving between panes and moving panes

Resizing panes

Naming & finding windows

Scrolling and copy/paste in tmux

Tmux commands

Detaching and attaching a tmux session

Questions?

Setting up our project

Project layout

Setting up our project directory

Download data with wget

Adding a README

Let’s look at what we’ve done

Let’s explore the sequence file

Now with less

Now let’s look at the metadata

Questions?

Source code!

Homework

Homework

Resources

Create a file called `vimtest.txt` with vim

Download data with `wget`

Now with `less`