tfcb_2020

Lecture 4: Reproducible Research, Markdown, Git and GitHub

Trevor Bedford (@trvrb, bedford.io)

Learning objectives

Class materials

  1. Reproducibility and collaborative science
  2. Markdown
  3. Git and GitHub

Reminders

Reproducible science

Motivation

There is a lot of interest and discussion of the reproducibility “crisis”. In one example, “Estimating the reproducibility of psychological science” (Open Science Collaboration, Science 2015), authors attempt to replicate 100 studies in psychology and find that only 36 of the studies had statistically significant results.

The Center for Open Science has also embarked on a Reproducibility Project for Cancer Biology, with results being reported in an ongoing fashion.

There are a lot of factors at play here, including “p hacking” lead by the “garden of forking paths” and selective publication of significant results. I would call this a crisis of replication and have this as a separate concept from reproducibility.

But even reproducibility is also difficult to achieve. In “An empirical analysis of journal policy effectiveness for computational reproducibility” (Stodden et al, PNAS 2018), Stodden, Seiler and Ma:

Evaluate the effectiveness of journal policy that requires the data and code necessary for reproducibility be made available postpublication by the authors upon request. We assess the effectiveness of such a policy by (i) requesting data and code from authors and (ii) attempting replication of the published findings. We chose a random sample of 204 scientific papers published in the journal Science after the implementation of their policy in February 2011. We found that we were able to obtain artifacts from 44% of our sample and were able to reproduce the findings for 26%.

They get responses like:

“When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.”

“I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.”

Tables in paper are very informative.

At the very least, it should be possible to take the raw data that forms the the basis of a paper and run the same analysis that the author used and confirm that it generates the same results. This is my bar for reproducibility.

Reproducible science guidelines

My number one suggestion for reproducible research is to have:

One paper = One GitHub repo

Put both data and code into this repository. This should be all someone needs to reproduce your results.

Digression to demo GitHub.

This has a few added benefits:

  1. Versioning data and code through GitHub allows you to collaborate with colleagues on code. It’s extremely difficult to work on the same computational project otherwise. Even Dropbox is messy.

  2. You’re always working with future you. Having a single clean repo with documented readme makes it possible to come back to a project years later and actually get something done.

  3. Other people can build off your work and we make science a better place.

I have a couple examples to look at here:

Some things to notice:

More sophisticated examples will use a workflow manager like Snakemake to automate builds. For example:

With GitHub as lingua franca for reproducible research, there are now services built on top of this model. For example:

Project communication

For me, as PI, I enforce a further rule:

One paper = One GitHub repo = One Slack channel

It’s much easier if all project communication goes in one place.

Further reading

Some suggested readings on reproducible research include:

Markdown

Git and GitHub

Further reading