In which with our fleets of flying, speaking beasts we gain mastery over time
If not please go to github.com and sign up while we get started (it’s free).
Like Back to the Future for data/code (on steroids, sans DeLorean):
Analogy from the book:
Imagine you keep a lab notebook in pencil, and each time you run a PCR you erase your past specifics and jot down the newest ones…
This is functionally equivalent to not versioning your code…
There are four or five commands you should know and start using now. The goal for the rest of the class is to generally understand what is possible should you need it.
So relax and enjoy the ride.
We’ll be using
git to version our project from here on out.
There are other version control systems out there (
mercurial, etc.), but
git is currently by far the most popular in bioinformatics, and it’s lovely!
git config --global user.name "<your-real-name>" git config --global user.email "<your-email-address>" # Some nice color modes for git output git config --global color.ui true
(Note: You can see all of your current settings with
Before you can use git on a project, you have to initialize a git repository.
# Get to project dir cd ~/bioinfclass git status # Initialize git init git status
Commits are the basis for most of git. They are our waypoints as we travel through time.
Initializing a project only sets things up; we still have to make our first commit (save state).
# Tell git what files to track (staging) git add * # Make initial commit git commit -m "Initial commit"
We now have a saved state on which to build without fear of “messing things up”.
Write some things in
# Bioinfclass Notes Where you type out notes and stuff... ## Jun 24, 2015 Learned how to use git! It was pretty fun.
Note: This is formatted in Markdown.
# Check the status of the repo git status # Seeing specific changes git diff
Git diff uses + and - (and optionally, colors) to show what’s changed.
# Stage the changed file git add README.md # Commit staged changes, with commit message git commit -m "Add notes to README" # Checking our status git status git log
Start using this immediately, and learn the rest as needed.
The remainder of this class is a survey of the more advanced features of git. Most is generally useful, but a lot is most valuable for collaboration. As such, it’s more important at this point to know what’s possible than remember how to do it all.
The de facto home of open source on the internet.
Visit github.com and log in.
You are now on your “fork” or copy of the project
Copy the “HTTPS” clone url of the project on GitHub, then
# Go home, and rename the directory we've been working on cd ~ mv bioinfclass old_bioinfclass # Make the checkout git clone <paste-your-https-url> bioinfclass # Enter the directory we created, and see what's there cd bioinfclass tree git status
# See the list of commits git log
Wow! Such commit history…
# Add an alias to a prettier log command git config --global alias.glog "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%C(bold blue)<%an>%Creset' --abbrev-commit --all" # Try it out! git glog git glog -n 5
Notice the history branching, the sha hashes (
a4b8893, etc), commit messages, author, and human friendly time string.
* 4467411 - (HEAD -> master, origin/master, origin/HEAD) Finishe ... * a4b8893 - Added previous tree and alignment analysis to build. * 2683a24 - Added csvhead script (9 months ago)<Christopher Smal * 9596b65 - Added csvless script (9 months ago)<Christopher Smal | * 5553b9e - (origin/other-idea) Ran results of sequences by lo | * de48429 - Looking at sequences by location (9 months ago)<Ch |/ * 3b0fac2 - Rewrote build.sh with env variables (9 months ago)<C * 2c17799 - Added other metadata counting steps (9 months ago)<C * d07378b - Computing number of sequences per species in build.s
Branches give us a way of referrin to alternate histories.
masteris generally the “main” or “production” branch
masteruntil it’s ready (see
When we clone, we only pull down the
master branch, but we can still see remote branches like
To check out one of these branches, we can do
git checkout other-idea git glog
Note that we now see
| * 5553b9e - (HEAD -> other-idea, origin/other-idea) Ran results of sequences by location (1 year, 4 months ago)<Christopher T Small>
You can do this any time you work on something you’re not sure you want to keep, or that follows a separate track of development.
git checkout -b my-new-branch
This will create a new branch from whatever commit you currently have checked out (
We’ll look at how you can reconcile (merge) histories a little later.
Make sure to do this before continuing…
git checkout master git glog
We can see the diffs for each commit with
git show 1aa457d git show f566a9 git show cac1218 # Skipping a couple.. git show d07378b
We can also compare specific commits with
git diff 1aa457d d07378b
# Very long... git show 2c17799 # If we really just care about build.sh changes... git show 2c17799 build.sh # Or with diff git diff 1aa457d d07378b build.sh
This is pretty valuable as your project gets big and lots of things change.
So far, we’ve forked a repository and cloned that fork locally.
Let’s complete the circle by making some changes and pushing them back up to the main repository.
In README.md, add:
## June 24, 3:30PM Ran the location trees. Interesting data. Thinking about some other studies now.
git add README.md git commit -m "Add location analyses notes"
# Pushing changes on branch `master` to remote `origin` git push origin master # Can also do `git push`, which pushes current branch to origin # When prompted enter your GH username and password
Pull up github, reload and see the new commit there.
This will show up on the repository page.
Some pull requests can be merged automatically, others need to be done from the command line.
Some things to keep in mind
Pick a commit to check out, like
git checkout -b backintime b445eea: creates a new “branch” named
backintimebased on the desired commit
Instead of worrying about making trees for each location, let’s just directly count the number of sequence names per location to make sure they match up.
# ... # Directly count number of sequences loc_spec_count="$loc_outdir/seqcount" wc -l $loc_sequences > $loc_spec_count done # Combine sequence counts by location loc_spec_counts="$outdir/location_specimen_counts.txt" find $outdir -name seqcount | xargs cat > $loc_spec_counts
git add build.sh git commit -m "Add direct sequence per location count" # Now look at our history; we've branched! git glog
Say we want to keep these changes and merge them into the most up to date code.
We have to do a merge.
# Switch to branch into which you want to merge changes (aka the HEAD branch). git checkout master # Next merge backintime into master git merge backintime
If all of your changes are in different parts of the code that any changes on the other branch since the histories split, you’re done! The branches can be automatically merged, and there will be peace in the kingdom.
In our case however, the changes overlap (try running
git status). This means we need to resolve the conflicts.
vim build.sh file and go down to the bottom where we made our changes.
In our case, we want to keep both changes, so simply delete the demarkation lines, then save, exit and then
git add build.sh git commit -m "Resolve build conflicts"
Having biggish data that updates frequently can slow git down quite a bit.
One solution is to track the output data (and maybe even input data) in separate repositories, which you “ignore” from the main repository. This has a few problems too though:
GitHub is solving this for large files, but the problem remains for lots of smaller files…
Git can be intimidating…
When the office git expert has to come fix everything
git init git add ... git commit -m "..." git status git log # (or git glog, as you wish) git diff
Learn those, start using them, and Google the rest as you need it.
For this class:
For next class (if you want to jump ahead):