Jessica Minnier
April 20, 2016
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generate the figures.
(Buckheit and Donoho 1995; De Leeuw 2001)
(Jon F. Claerbout is the Cecil Green Professor Emeritus of Geophysics at Stanford University. He was one of the first scientists to emphasize that computational methods threaten the reproducibility of research unless open access is provided to both the data and the software underlying a publication. (Claerbout and Karrenbach 1992, Wikipedia)
It takes some effort to organize your research to be reproducible.
We found that although the effort seems to be directed to helping other people stand up on your shoulders, the principal beneficiary is generally the author herself.
This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves.
(Schwab, Karrenbach, and Claerbout 2000)
also from M Shotwell and JM Álvarez’ slides “Approaches and Barriers to Reproducible Practices in Biostatistics” and “Barriers to Reproducible Research and a Web-Based Solution” http://biostatmatt.com/uploads/shotwell-interface-2011.pdf and http://biostat.mc.vanderbilt.edu/wiki/pub/Main/MattShotwell/MSRetreat2013Slides.pdf
In J. P. Ioannidis (2014) “How to Make More Published Research True” in PLOS Medicine, the author writes a follow up to J. Ioannidis (2005) “Why most published research findings are false.” He suggests reproducibility as one key component to the cause:
“To make more published research true, practices that have improved credibility and efficiency in specific fields may be transplanted to others which would benefit from them—possibilities include
Research results are replicable if there is sufficient information available for independent researchers to make the same findings using the same procedures.
In computational sciences this means: the data and code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding.
In practice, research needs to be easy for independent researchers to reproduce.
– King (1995), Ball and Medeiros (2012), from Gandrud (2013)
Replicability has been a key part of scientific inquiry from perhaps the 1200s. It has even been called the “demarcation between science and non-science.”
– Gandrud (2013) and references therein, including Roger Bacon’s “Opera quaedam hactenus inedita Vol. 1” from 1267 https://books.google.com/books?id=wMUKAAAAYAAJ
“Computational science has led to exciting new developments, but the nature of the work has exposed limitations in our ability to evaluate published findings. Reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible.”
“Ultimately, developing a culture of reproducibility in which it currently does not exist will require time and sustained effort from the scientific community.”
– Peng (2011)
Authors can choose to meet a subset of these criteria if they wish.
– Peng (2009)
“Reproducibility is important because it is the only thing that an investigator can guarantee about a study.”
“a study can be reproducible and still be wrong”
“These days, with the complexity of data analysis and the subtlety of many claims (particularly about complex diseases), reproducibility is pretty much the only thing we can hope for. Time will tell whether we are ultimately right or wrong about any claims, but reproducibility is something we can know right now.”
“By using the word reproducible, I mean that the original data (and original computer code) can be analyzed (by an independent investigator) to obtain the same results of the original study. In essence, it is the notion that the data analysis can be successfully repeated. Reproducibility is particularly important in large computational studies where the data analysis can often play an outsized role in supporting the ultimate conclusions.”
— Roger Peng’s 2014 blog post on Simply Statistics http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/
“Enhancing Reproducibility through Rigor and Transparency” http://grants.nih.gov/grants/guide/notice-files/NOT-OD-15-103.html
Website: http://grants.nih.gov/reproducibility/index.htm
FAQs: http://grants.nih.gov/reproducibility/faqs.htm
NIH Training Module: https://grants.nih.gov/reproducibility/module_1/presentation.html
Note: Most of this is in regards to the science, design of experiment, chemical and biological methods. Essentially no language describing reproducibility of analyses or data management for data or results generated by the grant.
NIH held a joint workshop in June 2014 with the Nature Publishing Group and Science on the issue of reproducibility and rigor of research findings
A video/slide presentation about this topic and how it applies to grant applications and peer review can be found here: http://grants.nih.gov/grants/policy/rigor/NIH_Policy_Rigor_For_Reviewers/presentation.html
from NIH Guidelines & Landis et al. (2012) “A call for transparent reporting to optimize the predictive value of preclinical research”. Nature 490, 187–191.
Nature has a website containing editorials, features, news, and articles on various topics related to reproducibile research: http://www.nature.com/news/reproducibility-1.17552
Including
Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated. (Knuth 1984)
Examples: Sweave, knitr (for R); SASweave, Statrep (for SAS); StatWeave (for STATA)
This is knitr:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This is a document written in plain text (.Rmd file) with text and R code embedded with the special syntax. Within RStudio when you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
“Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science. One such open source VCS, Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. For individual researchers, Git provides a powerful way to track and compare versions, retrace errors, explore new approaches in a structured manner, while maintaining a full audit trail. For larger collaborative efforts, Git and Git hosting services make it possible for everyone to work asynchronously and merge their contributions at any time, all the while maintaining a complete authorship trail.”
(Ram 2013)
Have you ever:
In these cases, and no doubt others, a version control system should make your life easier.
http://stackoverflow.com/questions/1408450/why-should-i-use-version-control
Learn git: https://try.github.io/levels/1/challenges/1
Authors should submit the following:
Although not required, authors are encouraged to use literate programming tools […]
– Peng (2009)
make
files to rerun analyses when certain files changeTo do!
Adapt http://ropensci.github.io/reproducibility-guide/sections/checklist/
(Sandve et al. 2013)
Gandrud, Christopher. Reproducible Research with R and R Studio. CRC Press, 2013.
Xie, Yihui. Dynamic Documents with R and knitr. Vol. 29. CRC Press, 2013.
Karl Broman’s class “Tools for Reproducible Research” at UWisconsin-Madison http://kbroman.org/Tools4RR/
“Reproducible Research” by Johns Hopkins on Coursera (Peng, Leek, Caffo) https://www.coursera.org/learn/reproducible-research
ROpenSci’s “Reproducibility in Science” guide: http://ropensci.github.io/reproducibility-guide/ including the reproducibility checklist http://ropensci.github.io/reproducibility-guide/sections/checklist/
Matthew Shotwell’s slides (2011) “Approaches and Barriers to Reproducible Practices in Biostatistics”. http://biostatmatt.com/uploads/shotwell-interface-2011.pdf
ROpenSci’s blog post “Reproducible research is still a challenge” by R. FitzJohn, M. Pennell, A. Zanne, W. Cornwell, June 9, 2014, describes the experience of running an example analysis: https://ropensci.org/blog/2014/06/09/reproducibility/
StackOverflow question “Why should I use version control?” http://stackoverflow.com/questions/1408450/why-should-i-use-version-control
Karl Broman’s class “Tools for Reproducible Research” resource page http://kbroman.org/Tools4RR/pages/resources.html and “Why Reproducibility is Hard”https://kbroman.wordpress.com/2015/09/09/reproducibility-is-hard/
CRAN’s task view on Reproducible Research: https://cran.r-project.org/web/views/ReproducibleResearch.html
Frank Harrell’s wiki on statistical reporting: http://biostat.mc.vanderbilt.edu/wiki/Main/StatReport
University of Wisconsin-Madison’s Department of Statistics site on “Reproducible Research Tools”: https://www.stat.wisc.edu/reproducible
Ball, Richard, and Norm Medeiros. 2012. “Teaching Integrity in Empirical Research: A Protocol for Documenting Data Management and Analysis.” The Journal of Economic Education 43 (2). Taylor & Francis: 182–89.
Buckheit, Jonathan B, and David L Donoho. 1995. Wavelab and Reproducible Research. Springer.
Claerbout, Jon, and Martin Karrenbach. 1992. “Electronic Documents Give Reproducible Research a New Meaning.” In Proc. 62nd Ann. Int. Meeting of the Soc. of Exploration Geophysics, 601–4.
De Leeuw, Jan. 2001. “Reproducible Research. the Bottom Line.” Department of Statistics, UCLA.
Gandrud, Christopher. 2013. Reproducible Research with R and R Studio. CRC Press.
Ioannidis, John PA. 2014. “How to Make More Published Research True.”
Ioannidis, JPA. 2005. “Why Most Published Research Findings Are False.” PLoS Med 2 (8): e124.
King, Gary. 1995. “Replication, Replication.” PS: Political Science & Politics 28 (03). Cambridge Univ Press: 444–52.
Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2). Br Computer Soc: 97–111.
Peng, Roger D. 2009. “Reproducible Research and Biostatistics.” Biostatistics 10 (3). Biometrika Trust: 405–8.
———. 2011. “Reproducible Research in Computational Science.” Science (New York, Ny) 334 (6060). NIH Public Access: 1226.
Ram, Karthik. 2013. “Git Can Facilitate Greater Reproducibility and Increased Transparency in Science.” Source Code for Biology and Medicine 8 (1): 7.
Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.”
Schwab, Matthias, Martin Karrenbach, and Jon Claerbout. 2000. “Making Scientific Computations Reproducible.” Computing in Science & Engineering 2 (6). AIP Publishing: 61–67.
Wilson, Greg, DA Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven HD Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLoS Biol 12 (1). Public Library of Science: e1001745.