Trevor Bedford (@trvrb, bedford.io)
This class requires Microsoft Excel (or an equivalent program that can open .xlsx
files); see Software installation for more information.
It’s important to keep a tidy project directory, even if something is not as the stage of being versioned on GitHub.
Some general advice:
data/
subdirectory and a scripts/
(or src
) subdirectory.../data/some_file.csv
). If you were to use an absolute path (like ~/Projects/SomeProject/data/some_file.csv
or C:\Users\SomeOne\Projects\SomeProject\data\some_file.csv)
then anyone who wanted to reproduce your results but had the project placed in some other location would have to go in and edit all of those directory/file names.Borrowing excellent slide deck from Ciera Martinez and colleagues: Reproducible Science Workshop: File Naming
Continuing slide deck from Ciera Martinez and colleagues: Reproducible Science Workshop: Organization
Ideally, your data/
directory will include an additional README
that (at bare minimum) includes a description of the data (e.g., what the rows and columns represent). Fully documented metadata (data about the data) will include:
Documenting data can be a time-consuming process, but is often required to submit data to repositories. Since data publishing is a requirement for most academic research as a part of publication, keeping track of this information early on can save you time later, and increase the chances of other researchers using your data later (which means more citations for you).
More excellent advice from Karl Broman
Tidy data is term from Hadley Wickham and refers to:
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.
Data in this form is much easier to deal with programmatically. This is also known as a data frame. This tutorial presents a nice overview.
Observations as rows and variables as columns is an excellent standard to adhere to.
See for example, single cell RNA sequencing data, with cells as rows and genes as columns. This is also the way that relational databases (MySQL, Postgres, etc…) are constructed.
Demonstrate conversion of simple example dataset. Work from Table 2 in Bedford et al. 2014, available as an Excel table in the course repo.
Split into small groups of 3-4 people to work from an HI (haemagglutination-inhibition) table and convert to tidy data. Data available as an Excel table in the course repo.
Saving data as plain text files is necessarily to process this data with either R or Python. You can export from Excel to .tsv
(tab-delimited, my preferred format) or .csv
(comma-delimited). A few things to note when exporting data files in these formats:
Some suggested readings include: