Reproducible Research

WEG Meeting

5th November 2013


Here is a dynamic report, in which changes will update as the data and results change.

Example - Iris dataset.

This famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

setosa versicolor virginica

Figure 1: The three Iris species.

We can load the data into R as part of this report, by enclosing it in a “chunk”:



data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We can silently load libraries, hide warnings, code etc. by specifying some extra parameters in the chunk definition:

Is sepal length different between the three species?

plot of chunk unnamed-chunk-4

We can examine it using lm:


fit1 <- lm(Sepal.Length ~ Species, data = iris)
summary(fit1)
## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.688 -0.329 -0.006  0.312  1.312 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728   68.76  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030    9.03  8.8e-16 ***
## Speciesvirginica    1.5820     0.1030   15.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.515 on 147 degrees of freedom
## Multiple R-squared: 0.619,   Adjusted R-squared: 0.614 
## F-statistic:  119 on 2 and 147 DF,  p-value: <2e-16

We can take results from the model and input them into the text. For example, the model intercept is 5.006; the adjusted R-squared is 0.6135.

But I can't be bothered putting all my R code into one of these documents.

No problem - think about using source to incorporate different R scripts or saved R objects. Here's an analysis I did earlier on some plant weights.

source("PlantWeights.R")
summary(lm(weight ~ group - 1))
## 
## Call:
## lm(formula = weight ~ group - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.4937  0.0685  0.2462  1.3690 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## groupCtl     5.03       0.22    22.9  9.5e-15 ***
## groupTrt     4.66       0.22    21.2  3.6e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.696 on 18 degrees of freedom
## Multiple R-squared: 0.982,   Adjusted R-squared: 0.98 
## F-statistic:  485 on 2 and 18 DF,  p-value: <2e-16

But doesn't this report look a bit crap?

Yes, a bit. But this is an excellent starting point for beginners or for quickly and easily producing reports which can then be archived and consulted in future. The HTML reports can be easily converted to PDFs or even Word documents.

Pandoc is an excellent programme to convert these files into Word documents. wkhtmltopdf will convert these files to PDF without installing LaTeX

For more experienced users, you could start to delve into Sweave and LaTeX, both of which can be implemented in R Studio.