Introduction to Bioinformatics

Lesson 4 - Version control with git

Git for Version control!

In which with our fleets of flying, speaking beasts we gain mastery over time

Everyone have a GitHub account?

If not please go to github.com and sign up while we get started (it’s free).

The basic idea

Like Back to the Future for data/code (on steroids, sans DeLorean):

  • Save current state, with a helpful message
  • Go back to an old state to try a different strategy, saving steps along the way
  • Take two alternate histories and merge results

In computational biology

Analogy from the book:

Imagine you keep a lab notebook in pencil, and each time you run a PCR you erase your past specifics and jot down the newest ones…

This is functionally equivalent to not versioning your code…

Some perspective on this class

There are four or five commands you should know and start using now. The goal for the rest of the class is to generally understand what is possible should you need it.

So relax and enjoy the ride.

Git

We’ll be using git to version our project from here on out.

There are other version control systems out there (svn, mercurial, etc.), but git is currently by far the most popular in bioinformatics, and it’s lovely!

Customizing git

git config --global user.name "<your-real-name>"
git config --global user.email "<your-email-address>"

# Some nice color modes for git output
git config --global color.ui true

(Note: You can see all of your current settings with cat ~/.gitconfig)

Initializing git

Before you can use git on a project, you have to initialize a git repository.

# Get to project dir
cd ~/bioinfclass
git status

# Initialize
git init
git status

Commits

Commits are the basis for most of git. They are our waypoints as we travel through time.

Initializing a project only sets things up; we still have to make our first commit (save state).

Making our first commit

# Tell git what files to track (staging)
git add *

# Make initial commit
git commit -m "Initial commit"

We now have a saved state on which to build without fear of “messing things up”.

Let’s make some edits, and commit them

Write some things in README.md:

# Bioinfclass Notes

Where you type out notes and stuff...


## Jun 24, 2015

Learned how to use git!
It was pretty fun.

Note: This is formatted in Markdown.

First, seeing what’s changed

# Check the status of the repo
git status

# Seeing specific changes
git diff

Git diff uses + and - (and optionally, colors) to show what’s changed.

Committing our changes

# Stage the changed file
git add README.md

# Commit staged changes, with commit message
git commit -m "Add notes to README"

# Checking our status
git status
git log

That’s it for the basics!

Start using this immediately, and learn the rest as needed.

The remainder of this class is a survey of the more advanced features of git. Most is generally useful, but a lot is most valuable for collaboration. As such, it’s more important at this point to know what’s possible than remember how to do it all.

GitHub!

The de facto home of open source on the internet.

Visit github.com and log in.

Forking

A way of copying a project over to your account

You are now on your “fork” or copy of the project

Cloning a repository

This is how we get a remote repository from GitHub checked out on our computers

Copy the “HTTPS” clone url of the project on GitHub, then

# Go home, and rename the directory we've been working on
cd ~
mv bioinfclass old_bioinfclass

# Make the checkout
git clone <paste-your-https-url> bioinfclass

# Enter the directory we created, and see what's there
cd bioinfclass
tree
git status

Looking at history

# See the list of commits
git log

Wow! Such commit history…

A slightly better way…

# Add an alias to a prettier log command
git config --global alias.glog "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%C(bold blue)<%an>%Creset' --abbrev-commit --all"


# Try it out!
git glog
git glog -n 5

Breaing this down

Notice the history branching, the sha hashes (a4b8893, etc), commit messages, author, and human friendly time string.

* 4467411 - (HEAD -> master, origin/master, origin/HEAD) Finishe
  ...
* a4b8893 - Added previous tree and alignment analysis to build.
* 2683a24 - Added csvhead script (9 months ago)<Christopher Smal
* 9596b65 - Added csvless script (9 months ago)<Christopher Smal
| * 5553b9e - (origin/other-idea) Ran results of sequences by lo
| * de48429 - Looking at sequences by location (9 months ago)<Ch
|/
* 3b0fac2 - Rewrote build.sh with env variables (9 months ago)<C
* 2c17799 - Added other metadata counting steps (9 months ago)<C
* d07378b - Computing number of sequences per species in build.s

Branches

Branches give us a way of referrin to alternate histories.

  • master is generally the “main” or “production” branch
  • other branches can let us keep work out of master until it’s ready (see other-idea branch)

Grab a remote branch

When we clone, we only pull down the master branch, but we can still see remote branches like origin/other-idea.

To check out one of these branches, we can do

git checkout other-idea
git glog

Note that we now see

| * 5553b9e - (HEAD -> other-idea, origin/other-idea) Ran results of sequences by location (1 year, 4 months ago)<Christopher T Small>

Creating a new branch

You can do this any time you work on something you’re not sure you want to keep, or that follows a separate track of development.

git checkout -b my-new-branch

This will create a new branch from whatever commit you currently have checked out (HEAD).

We’ll look at how you can reconcile (merge) histories a little later.

Getting back to the master branch

Make sure to do this before continuing…

git checkout master
git glog

Tracing history

We can see the diffs for each commit with git show:

git show 1aa457d
git show f566a9
git show cac1218
# Skipping a couple..
git show d07378b

Doing more with diff

We can also compare specific commits with git diff.

git diff 1aa457d d07378b

git show or diff select files

# Very long...
git show 2c17799

# If we really just care about build.sh changes...
git show 2c17799 build.sh

# Or with diff
git diff 1aa457d d07378b build.sh

This is pretty valuable as your project gets big and lots of things change.

Pushing the data back up

So far, we’ve forked a repository and cloned that fork locally.

Let’s complete the circle by making some changes and pushing them back up to the main repository.

First, some changes

In README.md, add:

## June 24, 3:30PM

Ran the location trees.
Interesting data.
Thinking about some other studies now.

Adding a new commit, as before

git add README.md

git commit -m "Add location analyses notes"

Pushing changes to our GitHub fork

# Pushing changes on branch `master` to remote `origin`
git push origin master
# Can also do `git push`, which pushes current branch to origin

# When prompted enter your GH username and password

Pull up github, reload and see the new commit there.

Sharing your changes with a “pull request”

Pull requests are a way of suggesting changes to other people’s repositories.

Like “forking”, it’s a GitHub specific thing.

  • Click on the “New pull request” button just above the list of files on your fork’s page
  • Note “base” and “head” forks: base is where the changes will go, head is where they’re from
    • (Sometimes you might change the branches; not today though)
  • Click “Create pull request”

If you get a pull request, you can merge it

This will show up on the repository page.

Some pull requests can be merged automatically, others need to be done from the command line.

We’ve now seen changes make the whole circuit!

Some things to keep in mind

  • Changes move neither up nor down without being requested
  • You can’t change a repository you haven’t been given access/permissions to
  • Fork / Clone / Edit / Push / Pull Request

How to go “back in time”

Pick a commit to check out, like b445eea

  • Type git checkout -b backintime b445eea: creates a new “branch” named backintime based on the desired commit
  • Reload the script file

Make some edits

Instead of worrying about making trees for each location, let’s just directly count the number of sequence names per location to make sure they match up.

  # ...

  # Directly count number of sequences
  loc_spec_count="$loc_outdir/seqcount"
  wc -l $loc_sequences > $loc_spec_count
done

# Combine sequence counts by location
loc_spec_counts="$outdir/location_specimen_counts.txt"
find $outdir -name seqcount | xargs cat > $loc_spec_counts

Now commit and look at our tree

git add build.sh
git commit -m "Add direct sequence per location count"

# Now look at our history; we've branched!
git glog

Merging back into master

Say we want to keep these changes and merge them into the most up to date code.

We have to do a merge.

Git merge

# Switch to branch into which you want to merge changes (aka the HEAD branch).
git checkout master

# Next merge backintime into master
git merge backintime

Clean merges and conflicting merges

If all of your changes are in different parts of the code that any changes on the other branch since the histories split, you’re done! The branches can be automatically merged, and there will be peace in the kingdom.

In our case however, the changes overlap (try running git status). This means we need to resolve the conflicts.

Exercise

vim build.sh file and go down to the bottom where we made our changes.

  • Changes to the branch you merge to are placed between <<<<<<<<< HEAD and ========
  • Changes to the branch you merge from are placed betwee ======== and >>>>>>>>>.

In our case, we want to keep both changes, so simply delete the demarkation lines, then save, exit and then

git add build.sh
git commit -m "Resolve build conflicts"

Some guidelines for using Git

  • Commit often
  • Try to make commits “atomic” (commit unrelated code/data changes separately)
  • Make messages short but good
    • Think of it as a mini command for the state of your repo; what will happen to your repo if you apply the commit

Challenges for bigger data

Having biggish data that updates frequently can slow git down quite a bit.

One solution is to track the output data (and maybe even input data) in separate repositories, which you “ignore” from the main repository. This has a few problems too though:

  • It’s more work keeping multiple repositories up to date
  • It’s more work matching the code / data versions when split across repositories

GitHub is solving this for large files, but the problem remains for lots of smaller files…

Don’t freak out!

Git can be intimidating…

When the office git expert has to come fix everything

But 90% of the time, you’ll be using

git init
git add ...
git commit -m "..."
git status
git log  # (or git glog, as you wish)
git diff

Learn those, start using them, and Google the rest as you need it.

Exercises

  • Create a new branch and add something to it, then merge back in to master
  • Create a new repository and push it to your account on GitHub
  • Create two clones of a repository, make conflicting changes in them, and then push them both up to master (hint: you’ll have to do a merge on one of the machines)

Reading

For this class:

  • Chapter 5

For next class (if you want to jump ahead):

Resources

Back to homepage