Introduction to Bioinformatics

Lesson 1 - Gestalt & remote access

First things first

Did everyone successfully ssh into rhino?

Did everyone get the book?

Introductions

  • Name
  • What you’re working on
  • Technical Experience

Our goals for this course

  • High level overview of how to approach bioinformatics
  • Basic bioinformatics technical and problem solving skills
  • Enough momentum to initiate self-learning process

What you won’t get from this course

…that might be useful.

  • Higher level programming
  • Statistics
  • Other mathematics
  • Knowledge of every bioinformatics program/method ever created

Our goal is to give you enough momentum to learn what you need as you go.

Please spend time outside class

This stuff doesn’t stick unless you keep doing it.

  • Have fun!
  • Start a project
  • Set up routine time to play
  • More time per week is great, but consistently doing something every week better
  • Plan to continue after the course
  • Feel free to email Brian - he gets paid for this

What is bioinformatics?

What is bioinformatics?

The application of computation towards the analysis and processing of biological data

In practice this entails

  • data munging: cleaning & reshaping data
  • data exploration: vizualization, intuition building, hypothesizing
  • drawing conclusions: statistical analyses, publication

Rinse and repeat… New insights and conclusions lead to new data and questions, etc.

So this is what we do


But how should we do it?

Specific goals for research

  • Reproducibility
  • Robustness
  • Accuracy
  • Clarity
  • Iterability

Things we can do towards these goals

  • Immutable data: never change data in place; data in data out (reproducibility, iterability, clarity)
  • Consistency: naming conventions, data & project layout, etc; (automation, clarity)
  • Documentation: keep a digital “lab notebook”; software version, data origins, run settings, explain choices/decisions; (reproducibility, clarify)
  • Automation: the best documentation; write scripts for everything; (reproducibility, iterability)

Polya’s How to Solve It

You will spend more time sitting around and thinking about how to solve problems than you will typing code.

https://en.wikipedia.org/wiki/How_to_Solve_It

  1. Understand
  2. Plan
  3. Exectue
  4. Reflect

When this fails, try to break down into simpler pieces and iterate (solve an easier or related problem first).

Understanding

The most important, but hardest part.

  • Write out the problem
  • Isolate what makes the problem hard
  • Draw/sketch, talk (even out loud to yourself), go for a walk, look for patterns
  • List knows, work backwards, eliminate possibilities, run examples
  • When you truly understood a problem, the solution should appear obvious, so don’t stop trying to understand until it is

Questions?

Unix

What is “Unix”?

Unix is a proprietary operating system from the 60s.

True Unix is rare these days, but its philosophy and design live on in “Unix-like” systems (including OsX). When we say “Unix”, we usually mean it in this general sense.

We’ll be using Ubuntu Linux, a Unix-like operating system.

The Unix Philosophy

  • Small, composable tools that do one thing right
  • Plumb tools together into pipelines & scripts
  • Embrace plain-text data

Bioinformatics naturally embraces this.

The Unix shell

The shell is a wrapper around the operating system, through which text commands and output are used to interact with the computer.

This is analogous to the desktop environment on your personal computer.

Remote servers

Unix was built for remote access. Computers were big and expensive, so people shared resources.

Today, this remains useful:

  • The Hutch servers are really powerful
  • You can leave long running computations without worrying about unexpected shutdowns
  • Access running programs and data from any computer
  • Distribute tasks across compute clusters, such as the Hutch’s gizmo clusters

Questions?

Practicum!

  • Connect to a remote server
  • Survival guide
  • Set up a project directory and load it with data
  • Investigate our data
  • Learn how to edit files using an in-shell text editor (vi)
  • Tmux for session management

SSH for remote connection

Do this however you figured out how to do it for your OS.

ssh <username>@rhino

Write down which rhino you connect to, so you can directly connect to that next time. (e.g If you get connected to rhino3, next time use ssh <username>@rhino3.)

Let us know if you get an error message along the lines of “can't find home”.

From outside the Hutch

You should be able to connect to the rhinos from outside the hutch by connecting to the Hutch VPN.

You can also ssh into <username>@snail.fhcrc.org, and then from there execute ssh <username>@rhino<N> to get into your rhino.

You’re now in a Unix shell!

We’ll cover things more thoroughly next class. For now, just some basic information for orientation.

What follows is your Zombie Apocalypse Unix guide.

Brief overview of a Unix command

command [flags] [operands]

ls -a ~
  • ls: The command, in this case for listing directory contents
  • -a: A flag which specifies that hidden files should be listed
  • ~: An operand, in this case a special symbol which points to your home directory

Note: Spaces are important here (but not how many).

Getting help within Unix

  • Most Unix programs will take either a -h, --help, or -help flag and return useful information about how to construct a valid command.
  • Some commands also have man pages, which you can access using man <command-name>.
  • You can also search for programs using apropos

Example: ls --help & man ls & apropos calculator

Stuck in a terminal?

If you are in the terminal and things seem “stuck” (a program is running that won’t stop), try the following:

Ctrl-c, Esc, q, Ctrl-d, Esc : q Enter

Note: These are key commands;

  • - means press the keys at the same time.
  • ` ` (space between) means let go and press the next key

Autocomplete

Bash can “auto-complete” command and filenames. Just start typing and hit Tab.

Try ls ~/bioi<Tab>.

History

Bash maintains a history of the last several commands you executed. You can access these by typing history.

If you want to run a command similar to one you just executed, you can use the up/down arrow keys to move through the history, edit the line, and re-execute.

Questions?

Supercharging our shell

SciComp has set up a module system for customizing your environment.

We’ll be using the intro-bio module, loaded by executing module load intro-bio (note: this only affects the current shell session!).

module load intro-bio

You might get a message that says Using already loaded python, but don’t worry about it. If you see nothing, it’s fine. If you see an error message, please raise your hand.

Customizing our environment

We don’t want to have to remember to load this module every time, and there are other things we want to have set up for us every time we open a shell session. We can make these cusomtizations by editing our ~/.bashrc configuration file.

But first, we’re going to learn how to edit a text file from the terminal using vi.

Introducing vi and vim

Vi is a wonderful, powerful, but completely arcane editor. However, it’s worth being able to use because:

  • even the sparsest linux install will have some variant of vi.
  • when working with remote machines, it’s nice to be able to edit text in a powerful editor directly on the machine.
  • finally, sometimes another program (e.g. git) will plop you into vim without you realizing it, so it’s nice to know what to do in this situation.

Vim is a modal text editor

Vim has two primary modes:

  • insert mode: typing will now insert text into the file
  • command mode: key strokes become poweful commands, and can carry out high-level editing tasks

Fully understanding comamnd mode takes time, but basic usage is quite simple.

Vim tutorial

Create a file called vimtest.txt with vim

  • Invoke with vim vimtest.txt
  • Type i to enter insert mode
  • Type to enter text, move around with arrow keys, edit as needed
  • Hit Esc to exit insert mode and enter command mode
  • Type : w <Enter> to save
  • When you are done editing, type : q <Enter>

Run cat vimtest.txt to see your text. That wasn’t so bad, was it?

(click down for more advanced usage)

Using the command mode

So far the only real action has been in insert mode. The other mode in vi is the command mode. In this mode you can quickly navigate and modify your file using key commands.

Moving around quickly

There are lots of ways to move your cursor around in command mode:

  • The arrow keys (and the “home rowhjkl keys)
  • 0 moves to the beginning of the line, and $ moves to the end
  • b moves back one word, and w moves forward one
  • { moves back one paragraph, and } moves forward one
  • ( moves back one sentence, and ) moves forward one

You can prefix these commands with numbers to move faster, e.g. 3 w moves you forward three words.

Cutting and pasting

There is a simple way to cut and paste exactly analogous to a word processor: highlight a block of text, then copy or cut, then paste.

  • In command mode, move cursor to where you want to start your highlight
  • Press v; this places you in visual mode
  • Move cursor to the end of your highlighted region
  • Press d to cut, y (for yank) to copy
  • Move cursor to where you want to paste
  • Press p to paste

Combining motions and actions

Motions and actions can be combined. E.g.:

  • d w cuts one word, and d 2 w cuts two words
  • y w copies one word, and y 2 w copies two words

You can also cut/copy entire lines with the shortcuts d d and y y.

Undo/redo

If you ever mess anything up (which is easy to do in command mode), u is undo and Ctrl-r is redo (from command mode).

Vim resources

Note that you can totally use the arrow keys while you You certainly can, though it’s not considered hip (because vim is all about efficiency, and moving your hands from home position to the arrow keys is not efficient.)

Customizing our shell

We can customize our environment using a “dotfile” that gets loaded when new shell sessions are created: ~/.bashrc.

  • Run vim ~/.bashrc to open the file
  • Hit i for insert mode
  • Use arrow keys to scroll to the bottom of the file, then add:
module load intro-bio
export PATH=~/bin/:$PATH

Save and quit (Esc then :wq<Enter>), then run source ~/.bashrc to reload the changes.

Tmux - Terminal multiplexer

Tmux let’s combine multiple Unix shells into one.

We could open multiple terminal windows to multitask, but each would need its own ssh connection. Also, without tmux, if our connection dies, so do our programs. With tmux we can:

  • Put down and pick up work from multiple machines
  • Keep working on other things while a long running computation runs
  • Switch back and forth quickly between multiple projects

Setting up tmux

We’ll use tmux throughout this class to keep sessions running. But first let’s download a nice tmux configuration file for making things easier.

# This is a command that downloads a tmux configuration file
wget https://raw.githubusercontent.com/fredhutchio/intro-bioinformatics/gh-pages/config/tmux.conf

ls

# tmux knows to look for our config file at ~/.tmux.conf
mv tmux.conf ~/.tmux.conf

# Note that this file is hidden once we make it a "dotfile"
ls ~
ls -a ~

Starting a new tmux session

tmux

You should now see a fresh shell session inside of tmux.

Tmux cheatsheet/tutorial

Bare bones tmux

  • Ctrl-a - your “Command key”
  • Ctrl-a c - New window
  • Ctrl-a <Space> - Next window
  • Ctrl-a | - Split window into panes vertically
  • Ctrl-a - - Split window into panes horizontally
  • Ctrl-a <arrow> - Move between
  • Ctrl-d - Close a pane or window

For more tmux tips, click down.

Moving between panes and moving panes

You can also use h, j, k, and l in place of the arrow keys, as in vim.

You can also swap/reorder panes using Ctrl-a J and Ctrl-a K.

Resizing panes

  • Ctrl-a < - move vertical split left
  • Ctrl-a > - move vertical split right
  • Ctrl-a + - move horizontal split up
  • Ctrl-a + - move horizontal split down

Note: You can click Ctrl-a once, and hold the second key for big moves.

Naming & finding windows

  • Ctrl-a , will let you name a window
  • Ctrl-a ' presents a list of windows (by name)
  • <arrow> and Enter to switch to a window
  • Ctrl-a <numeric> switches to a window by number.

Scrolling and copy/paste in tmux

When a noisy program floods a tmux pane, your mouse wheel won’t let you scroll, like in a normal shell session. Pressing Ctrl-a [ will place you in scroll mode. Use can now use arrow keys or Ctrl-u/Ctrl-d to scroll through the history, and search with /.

From this mode, you can also press Space to enter copy-mode, <arrow> keys to specify a collection, and Enter to copy the selection. To paste the selection, use Ctrl-].

Tmux commands

You can also interact with tmux using commands, and you can see a list of commands and explanations on the man page:

man tmux

All of the key combos above are just bindings to these commands, so this is a good place to go if you’re trying to figure something out.

Detaching and attaching a tmux session

  1. Close your terminal window, and ignore any warning prompts.
  2. Reconnect just the way you did at the beginning of the class.
  3. Type tmux attach to attach to an existing session.

Remember: tmux attach will fail if you don’t have a session open already; if that happens just enter tmux to start a new session.

Questions?

Setting up our project

This class builds around analysis of a real world dataset of Simian Foamy Virus (SFV). SFV is a retrovirus that infects non-human primates, but can infect humans bitten by NHP. Howver, we appear to be a “dead-end” host.

This data set looks at viruses sampled from humans and monkeys in Bangladesh.

Project layout

Here’s how I typically organize things:

your-project
├── README.md
├── build.sh
├── data
│   ├── sequences.fasta
│   └── metadata.csv
├── scripts
│   ├── clean.py
│   └── plot.R
└── output
    ├── alignment.fasta
    ├── cleaned_metadata.csv
    ├── tree.newick
    └── tree_plot.png

Note: this is a little different than the organization from the book; Use what works for you, but be consistent.

Setting up our project directory

mkdir ~/bioinfclass

cd ~/bioinfclass
mkdir data output scripts

ls

(You’ll always enter in the things you see in these black boxes)

Download data with wget

wget https://goo.gl/8Nk5tZ -O data.tar
ls

Note that we now have a data.tar file in this directory. This is an archive type (like zip) common on Unix systems. To unpack it

tar -v -x -f data.tar

Adding a README

Start vim with vim README.md

  • Type i to enter insert mode
  • Enter some text about today’s lesson
  • Hit Esc to exit insert mode and enter command mode
  • Type : w <Enter> to save
  • When you are done editing, type : q <Enter>

Let’s look at what we’ve done

tree prints out a directory tree as ASCII art.


tree

Note: This may not turn out right on Windows PuTTY terminals.

Let’s explore the sequence file

cat prints the output of file(s) to the screen.

cat data/sfv.fasta

Now with less

less lets us “page” through data, without flooding the screen.

less data/sfv.fasta


Press the q key to exit from less.

Now let’s look at the metadata

less data/sfv.csv

Press the / key to enter a search string

(Try human or monkey; Use n and N to toggle through results).

Questions?

Source code!

The source code for these slides is available on GitHub.

https://github.com/fredhutchio/intro-bioinformatics

For the code that generated these slides, look at the .mds files in the src directory. (This code is an extension of Markdown, which we’ll look at more next week)

To submit issues or questions about the class, go to the Issues page.

Homework

Please, please, please!

  • Go over tmux and vim and memorize the basics of how to use them
  • Set up a routine for practice and tinkering
  • Pick a personal project

Homework

Recommended reading:

  • Chapters 1 & 2 if you haven’t already.
  • Chapters 2 & 3 for basic Unix stuff & tmux.

Reading for next class:

  • Chapter 3, and chapter 7 (till around page 148).

Resources