git
In which with our fleets of flying, speaking beasts we gain mastery over time
If not please go to github.com and sign up while we get started (it’s free).
Like Back to the Future for data/code (on steroids, sans DeLorean):
Analogy from the book:
Imagine you keep a lab notebook in pencil, and each time you run a PCR you erase your past specifics and jot down the newest ones…
This is functionally equivalent to not versioning your code…
There are four or five commands you should know and start using now. The goal for the rest of the class is to generally understand what is possible should you need it.
So relax and enjoy the ride.
We’ll be using git
to version our project from here on out.
There are other version control systems out there (svn
, mercurial
, etc.), but git
is currently by far the most popular in bioinformatics, and it’s lovely!
git config --global user.name "<your-real-name>"
git config --global user.email "<your-email-address>"
# Some nice color modes for git output
git config --global color.ui true
(Note: You can see all of your current settings with cat ~/.gitconfig
)
Before you can use git on a project, you have to initialize a git repository.
# Get to project dir
cd ~/bioinfclass
git status
# Initialize
git init
git status
Commits are the basis for most of git. They are our waypoints as we travel through time.
Initializing a project only sets things up; we still have to make our first commit (save state).
# Tell git what files to track (staging)
git add *
# Make initial commit
git commit -m "Initial commit"
We now have a saved state on which to build without fear of “messing things up”.
Write some things in README.md
:
# Bioinfclass Notes
Where you type out notes and stuff...
## Jun 24, 2015
Learned how to use git!
It was pretty fun.
Note: This is formatted in Markdown.
# Check the status of the repo
git status
# Seeing specific changes
git diff
Git diff uses + and - (and optionally, colors) to show what’s changed.
# Stage the changed file
git add README.md
# Commit staged changes, with commit message
git commit -m "Add notes to README"
# Checking our status
git status
git log
Start using this immediately, and learn the rest as needed.
The remainder of this class is a survey of the more advanced features of git. Most is generally useful, but a lot is most valuable for collaboration. As such, it’s more important at this point to know what’s possible than remember how to do it all.
The de facto home of open source on the internet.
Visit github.com and log in.
You are now on your “fork” or copy of the project
Copy the “HTTPS” clone url of the project on GitHub, then
# Go home, and rename the directory we've been working on
cd ~
mv bioinfclass old_bioinfclass
# Make the checkout
git clone <paste-your-https-url> bioinfclass
# Enter the directory we created, and see what's there
cd bioinfclass
tree
git status
# See the list of commits
git log
Wow! Such commit history…
# Add an alias to a prettier log command
git config --global alias.glog "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%C(bold blue)<%an>%Creset' --abbrev-commit --all"
# Try it out!
git glog
git glog -n 5
Notice the history branching, the sha hashes (a4b8893
, etc), commit messages, author, and human friendly time string.
* 4467411 - (HEAD -> master, origin/master, origin/HEAD) Finishe
...
* a4b8893 - Added previous tree and alignment analysis to build.
* 2683a24 - Added csvhead script (9 months ago)<Christopher Smal
* 9596b65 - Added csvless script (9 months ago)<Christopher Smal
| * 5553b9e - (origin/other-idea) Ran results of sequences by lo
| * de48429 - Looking at sequences by location (9 months ago)<Ch
|/
* 3b0fac2 - Rewrote build.sh with env variables (9 months ago)<C
* 2c17799 - Added other metadata counting steps (9 months ago)<C
* d07378b - Computing number of sequences per species in build.s
Branches give us a way of referrin to alternate histories.
master
is generally the “main” or “production” branchmaster
until it’s ready (see other-idea
branch)When we clone, we only pull down the master
branch, but we can still see remote branches like origin/other-idea
.
To check out one of these branches, we can do
git checkout other-idea
git glog
Note that we now see
| * 5553b9e - (HEAD -> other-idea, origin/other-idea) Ran results of sequences by location (1 year, 4 months ago)<Christopher T Small>
You can do this any time you work on something you’re not sure you want to keep, or that follows a separate track of development.
git checkout -b my-new-branch
This will create a new branch from whatever commit you currently have checked out (HEAD
).
We’ll look at how you can reconcile (merge) histories a little later.
Make sure to do this before continuing…
git checkout master
git glog
We can see the diffs for each commit with git show
:
git show 1aa457d
git show f566a9
git show cac1218
# Skipping a couple..
git show d07378b
diff
We can also compare specific commits with git diff
.
git diff 1aa457d d07378b
git show
or diff
select files# Very long...
git show 2c17799
# If we really just care about build.sh changes...
git show 2c17799 build.sh
# Or with diff
git diff 1aa457d d07378b build.sh
This is pretty valuable as your project gets big and lots of things change.
So far, we’ve forked a repository and cloned that fork locally.
Let’s complete the circle by making some changes and pushing them back up to the main repository.
In README.md, add:
## June 24, 3:30PM
Ran the location trees.
Interesting data.
Thinking about some other studies now.
git add README.md
git commit -m "Add location analyses notes"
# Pushing changes on branch `master` to remote `origin`
git push origin master
# Can also do `git push`, which pushes current branch to origin
# When prompted enter your GH username and password
Pull up github, reload and see the new commit there.
Pull requests are a way of suggesting changes to other people’s repositories.
Like “forking”, it’s a GitHub specific thing.
This will show up on the repository page.
Some pull requests can be merged automatically, others need to be done from the command line.
Some things to keep in mind
Pick a commit to check out, like b445eea
git checkout -b backintime b445eea
: creates a new “branch” named backintime
based on the desired commitInstead of worrying about making trees for each location, let’s just directly count the number of sequence names per location to make sure they match up.
# ...
# Directly count number of sequences
loc_spec_count="$loc_outdir/seqcount"
wc -l $loc_sequences > $loc_spec_count
done
# Combine sequence counts by location
loc_spec_counts="$outdir/location_specimen_counts.txt"
find $outdir -name seqcount | xargs cat > $loc_spec_counts
git add build.sh
git commit -m "Add direct sequence per location count"
# Now look at our history; we've branched!
git glog
Say we want to keep these changes and merge them into the most up to date code.
We have to do a merge.
# Switch to branch into which you want to merge changes (aka the HEAD branch).
git checkout master
# Next merge backintime into master
git merge backintime
If all of your changes are in different parts of the code that any changes on the other branch since the histories split, you’re done! The branches can be automatically merged, and there will be peace in the kingdom.
In our case however, the changes overlap (try running git status
). This means we need to resolve the conflicts.
vim build.sh
file and go down to the bottom where we made our changes.
<<<<<<<<< HEAD
and ========
========
and >>>>>>>>>
.In our case, we want to keep both changes, so simply delete the demarkation lines, then save, exit and then
git add build.sh
git commit -m "Resolve build conflicts"
Having biggish data that updates frequently can slow git down quite a bit.
One solution is to track the output data (and maybe even input data) in separate repositories, which you “ignore” from the main repository. This has a few problems too though:
GitHub is solving this for large files, but the problem remains for lots of smaller files…
Git can be intimidating…
When the office git expert has to come fix everything
git init
git add ...
git commit -m "..."
git status
git log # (or git glog, as you wish)
git diff
Learn those, start using them, and Google the rest as you need it.
For this class:
For next class (if you want to jump ahead):