Lack of reproducibility in science causes significant issues
Science retracted (without lead author's consent) a study of how canvassers can sway people's opinions about gay marriage
Original survey data was not made available for independent reproduction of results (additionally, survey incentives were misrepresented and sponsorship statements were incorrect)
Two Berkeley grad students attempted to replicate the study and discovered serious issues with the data (were able to show data almost certainly fabricated, and how they were generated).
Lack of reproducibility in science causes significant issues
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates (doi:10.1007/s12098-010-0331-7:
The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.
Source: Retraction Watch
Lack of reproducibility in science causes significant issues
Reproducible science accelerates scientific progress.
Methods are codified by definition, yet still challenging to reproduce
See an experiment on reproducing reproducible computational research.
Day 1
Day 2
This is a two-part exercise:
Part 1: Analyze + document
Part 2: Swap + discuss
Complete the following task using whatever workflow and tools you feel most comforable with. Your solution must include a brief write-up / documentation such that you could pass the file(s) to a collaborator and they would be able to reproduce your results.
Download the data: http://bit.ly/2sEPe4z (Full Link)
Visualize life expectancy over time for Canadians in the 1950s and 1960s using a line plot of these data.
Something should be clearly wrong with your plot, figure out (and document) what this is and come up with a fix.
With the revised data, visualize life expectancy over time for Canadians again.
Stretch goal: Add additional lines for the life expectancy of Mexician and Americans as well.
Introduce yourself to your collaborator(s) /neighbor(s).
Swap instructions / documentation with your collaborator and read through their results and write-up. As you read it over think about how you would attempt to reproduce their work.
If your collaborator/neighbor does not have or is unfamiliar with the software/technology you used we encourage you to given them a brief explination of what it is and why you chose it. (Remember, this could be part of the irreproducibility problem!)
Finally, talk to each other about challenges you faced (or didn't face) or why you were or weren't able to reproduce their work.
This exercise:
In a “real life” setting:
Documentation: explanation and commenting of why and how an analysis is carried out in human readable language
Organization: tools to organize your projects so that you don't have a single folder with hundreds of files
Automation: the power of scripting to create automated (and self documenting) data analyses
Dissemination: publishing is not the end of your analysis, rather it is a way station towards your future research and the future research of others
Documentation takes many forms but the goal shoiuld always be to make it as frictionless as possible for new and existings users / readers / whomever to figure out what is going on.
Making access easy is the most important thing
Provenance of copy and pasted figure:
Reported life expectancy shouldn't exceed even the most extreme age observed for humans.
if (any(gap_5060$lifeExp > 150)) {
stop("Improbably high life expectancy.")
}
Error in eval(expr, envir, enclos): Improbably high life expectancy.
Another approach is using the testthat package, which allows us to make the test a little more readable:
expect_false(any(gap_5060$lifeExp > 150),
"One or more life expectancies are improbably high.")
Note: If you run this in your console, you should get an error that
reads: Error: any(gap_5060$lifeExp > 150) isn't false. One or more
life expectancies are improbably high. Execution halted.
Our solution to “Motivating Reproducibility” exercise.
See material/01-mot-rep-soln.Rmd.
You got more data! New data come in two separate files: gap_7080.csv and
gap_90plus.csv.
Create the same plots that you created before (one for Canada and, as the stretch goal, one for North America) of life expectancy vs. GDP per capita.
Our solution to “Extending your analysis” exercise.
See material/02-extend-soln.Rmd.
More literate programming with RMarkdown.
See material/03-more-lit-prog.Rmd.
03-more-lit-prog.Rmd and re-knitparams$country_1 or country_3)File organization and naming are effective (and necessary) weapons against chaos.
Lets assume that you are collecting data on the amount of mRNA for a particular gene (BRAF) that is being produced by different strains of transgenic E. coli created via transformations using several different plasmids.
Devise a naming scheme for the files that is both “machine” and “human” readable. File names should contain the following information:
$ ls *Plsmd*
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B02.csv
...
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_H03.csv
> list.files(pattern = "MutFrac_A") %>% head
[1] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
[2] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
[3] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
meta <- stringr::str_split_fixed(flist, "[_\\.]", 5)
colnames(meta) <- c("date", "assay", "experiment",
"well", "ext")
meta[,1:4]
date assay experiment well
[1,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A01"
[2,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A02"
[3,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A03"
[4,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B01"
[5,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B02"
[6,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B03"
Noble, William Stafford. 2009. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5 (7): e1000424.
|
+-- data-raw/
| |
| +-- gapminder-5060.csv
| +-- gapminder-7080.csv.csv
| +-- ....
|
+-- data-output/
|
+-- fig/
|
+-- R/
| |
| +-- figures.R
| +-- data.R
| +-- utils.R
| +-- dependencies.R
|
+-- tests/
|
+-- manuscript.Rmd
+-- make.R
data-raw: the original data, do not edit or directly alter any of
the files in this folder.data-output: intermediate datasets that will be generated by the analysis.
fig: where we can store the figures used in the manuscript.R: our R code (the functions)
tests: the code to test that our functions are behaving properly and that all our data is included in the analysis.make_ms <- function() {
rmarkdown::render("manuscript.Rmd",
"html_document")
invisible(file.exists("manuscript.html"))
}
clean_ms <- function() {
res <- file.remove("manuscript.html")
invisible(res)
}
make_all <- function() {
make_data()
make_figures()
make_tests()
make_ms()
}
clean_all <- function() {
clean_data()
clean_figures()
clean_ms()
}
testthat includes a function called test_dir that will run tests
included in files in a given directory. We can use it to run all the tests in
our tests/ folder.
test_dir("tests/")
Let's turn it into a function, so we'll be able to add some additional
functionalities to it a little later. We are also going to save it at the root
of our working directory in the file called make.R:
## add this to make.R
make_tests <- function() {
test_dir("tests/")
}
Use informatively named files
2013-10-14_manuscriptFish.doc
2013-10-30_manuscriptFish.doc
2013-11-05_manusctiptFish_intitialRyanEdits.doc
2013-11-10_manuscriptFish.doc
2013-11-11_manuscriptFish.doc
2013-11-15_manuscriptFish.doc
2013-11-30_manuscriptFish.doc
2013-12-01_manuscriptFish.doc
2013-12-02_manuscriptFish_PNASsubmitted.doc
2014-01-03_manuscriptFish_PLOSsubmitted.doc
2014-02-15_manuscriptFish_PLOSrevision.doc
2014-03-14_manuscriptFish_PLOSpublished.doc
Or zip the entire directory of your project files everytime you make a change, and save with date
Use a version control system (e.g. git)
Why use Git?
Features of a hosting service like GitHub
RStudio's Git intergration.
For more see http://happygitwithr.com/.
Piwowar & Vision (2013) “Data reuse and the open data citation advantage.” PeerJ, e175
Figure 1: Citation density for papers with and without publicly available microarray data, by year of study publication.
Wicherts et al (2011) “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.” PLoS ONE 6(11): e26828
Figure 1. Distribution of reporting errors per paper for papers from which data were shared and from which no data were shared.
Do's
Don't's
Morin, Andrew, Jennifer Urban, and Piotr Sliz. 2012. “A Quick Guide to Software Licensing for the Scientist-Programmer.” PLoS Computational Biology 8 (7): e1002598.

From the Panton Principles:
[In] the scholarly research community the act of citation is a commonly held community norm when reusing another community member’s work. […] A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.
Peng, R. D. “Reproducible Research in Computational Science” Science 334, no. 6060 (2011): 1226–1227
roxygen: document your functions (easy to read, even if project not organized as package)bookdown: provides support for cross-referencing, citations, etc. Works well even if output is not a bookprojectTemplate useful to automate project setupThe Markdown sources, and the HTML, are hosted on Github: https://github.com/fmichonneau/2017-useR-reproducibility
Entire suppl. doc generated from Rmarkdown: Finnegan et al. 2015. “Paleontological Baselines for Evaluating Extinction Risk in the Modern Oceans.” Science 348 (6234): 567–70.
Data Analysis for the Life Sciences - a book completely written in R markdown
FitzJohn et al. 2014. “How Much of the World Is Woody?” The Journal of Ecology. doi:10.1111/1365-2745.12260. Start to end replicable analysis on Github.
Boettiger et al. “RNeXML: A Package for Reading and Writing Richly Annotated Phylogenetic, Character, and Trait Data in R.” Methods in Ecology and Evolution, September. Code archive and DOI assignment at Zenodo
Hart et al. 2016 “Ten Simple Rules for Digital Data Storage”. PLOS Computational Biology.