Notes: Reproducible research

Breakdown of a typical presentation

Hierarchy of information in a document

  • Author
  • Abstract
  • Body
  • Materials
  • Code/gory details

Similar to email

  • Subject line
  • To
  • From
  • Body

Attachments

  • R markdown
  • knitr report

Details

  • Code
  • Github repository

Here is a simple example with a dataset. Using the knitr and the publish feature of the editor, it is possible to place the document directly on the Rpubs website.

Only remember that the pubishing is public from the moment it is published. It’s possible to delete ofcourse, in case something private has been published before.

library(datasets)
data(airquality)
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
pairs(airquality)

The following sections will describe good ways to present and publish data.

Reproducible checklist [Part 1]

  • Make sure a relevant question is being asked
  • Work with good collaborators
  • Do NOT do things by hand {example: editing spreadsheets, etc}. All editing is preferably done by the R-language (to maintain reproducibility)
  • Make a proper codebook
  • Using GUIs to produce graphs can make it not reproducible for the user viewing the document. Interactive software : Danger!!

Reproducible checklist [Part 2]

  • Try to automate everything possible
  • Eliminate clicks (example: use download.fie() to download data)
  • Use version control software.
  • Don’t do massive commits.
  • Keep track of s/w requirements:
  • CPU Architecture
  • OS
  • S/W toolchain
  • Libraries, packages
  • External: websites, repositories, etc.
  • Version numbers : for everything.

Reproducible checklist [Part 3]

  • Do not save output
  • Set the seed for generating random numbers using the set.seed(). The random numbers will always be reproducible if this is done. Otherwise, you will get non-reproducible numbers.
  • Think about entire pipeline [Raw data -> processing -> analysis -> report]
  • How you get to the end is more important

Evidence-based data analysis [Part 1]

Evidence-based data analysis [Part 2]

Evidence-based data analysis [Part 3]

Evidence-based data analysis [Part 4]

Evidence-based data analysis [Part 5]