Reproducible data analysis using RMarkdown

Aims for the rest of today

Sorting out your RStudio working environment.
Getting you ready to write reproducible reports in RMarkdown.

How can you trust someone’s results?

Results are communicated through reports such as peer reviewed scientific articles.
Reported results should be reproducible! – Are they?
Data and code should be made available to others – Is this enough though?
It is difficult to relate code to figures, tables, numbers etc. especially where numbers were copy-pasted from software.
Also manual processing of data and generation of reports is error prone.

Why do we need reproducible analyses?

Open Science Collaboration led by Brian Nosek (Open Science Collaboration 2015).
Aimed to replicate important findings in psychological research.
Main finding: 36% where replicated; only 23% for Social Psychology
Reasons: small samples, pressure to publish sensational results.
Make Psychology Science Again: pre-registration, replications, large samples (power analysis), abandoning significance testing, transparency.

A truely reproducible report

Data, code, report need be one unit.
Based on the concept of literate programming (Knuth 1984): text and code are linked in one single file to generate manual or computer program.
This principle can be used to creating reproducible reports, sometimes known as dynamic documents (Xie 2017).
RMarkdown is the best way of doing this using R.

Why RMarkdown?

Documentation of all analysis steps
Quickly updating and reproducing analysis
No copy-pasting between programmes
Producing texts, website, supplementary materials, APA-style manuscripts, slides
Easy cross-referencing
Automatic reference lists
Easy type-setting of equations

Key concepts and tools

What is RMarkdown?

Text file format, a script like an R script.
Open, edit, run in RStudio like scripts.
RMarkdown are a mixture of markdown code and normal R code (and your data).
R code in RMarkdown documents occurs in R chunks, i.e. blocks of code, or inline R code inside of markdown code
Markdown: minimal syntax that instructs how text should be formatted.
Rendering into .pdf, .html, .docx, .doc

Download repository

Go to github.com/jensroes/stats-iii
Click on: Code > Download ZIP > unzip directory on your machine.
Open project by double clicking on stats-iii.Rproj
- xxx/slides.Rmd: slides in RMarkdown format
- xxx/exercises: R scripts that we will work with
- data/: my scripts will read in data from here

Installation

RStudio comes with necessary R package rmarkdown
To create pdf outputs, we require the package tinytex

# Is tidyverse installed?
'tidyverse' %in% rownames(installed.packages())
# Should return TRUE but if not run
install.packages('tidyverse')

install.packages('tinytex')
tinytex::install_tinytex()
# then restart RStudio

# Then, try
tinytex:::is_tinytex() # should return TRUE

Installation

Test that rmarkdown will render pdf documents:

writeLines('Hello $x^2$', 'test.Rmd')
rmarkdown::render('test.Rmd', output_format = 'pdf_document')

writeLines creates an .Rmd file named test.Rmd.
rmarkdown::render renders .Rmd as pdf named test.pdf.

Minimal RMarkdown example

Open rmarkdown/examples/example.Rmd
You’ll see an R chunk and two pieces of inline R code.
Remainder is plain Markdown.
Knit > Knit to PDF to compile .Rmd to .pdf. Wow!!!
Notice how the code was interpreted in pdf.
Important: R code is executed from top down.
Other demos and examples are provided.

From the RStudio menu …

Load data

Create a new R chunk: try CTRL+ALT+I or CMD+ALT+I (i.e. the letter “i”)
Create a chunk called packages and load libraries needed:

library(tidyverse)

Create a new chunk called loaddata and load Blomkvist et al. (2017) data:

blomkvist <- read_csv("data/blomkvist.csv") %>% 
  select(id, age, smoker, sex, rt = rt_hand_d) %>% 
  drop_na()

The setup chunk: global options

Note ```{r setup, echo=FALSE}
setup is a label of this chunk (optional; useful for cross-referencing of figures and tables).
Chunk configuration option echo = FALSE: don’t display chunk in output; echo = TRUE: display chunk.

knitr::opts_chunk$set(message = FALSE, # don't return messages
                      warning = FALSE, # don't return warnings
                      comment = NA, # don't comment output
                      echo = TRUE, # display chunk (is default)
                      eval = TRUE, # evaluate chunk (is default)
                      out.width = '45%',  # figure width
                      fig.align='center') # figure alignment

Section headers

# This is a section header

## This is a subsection header

# This is another section header

## This is another subsection header

### This is a subsubsection header

Figures

Create a new R chunk with label myscatterplot
Set echo = F cause we only need the figure.
Add a figure caption fig.cap = "A scatterplot." in the chunk configurations.

library(psyntur)
scatterplot(x = age, y = rt, data = blomkvist)

Or using ggplot2

ggplot(blomkvist, aes(x = age, y = rt)) +
    geom_point() +
    theme_classic()

Figures

Change the default size to out.width = 50%.
Cross-reference figure using \ref{fig:myscatterplot} in the text.

"A scatterplot of age and reaction time can be found in Figure \ref{fig:myscatterplot}."

Add header options

header-includes:
- \usepackage{booktabs}

We just need booktabs to improve type setting.

Formatted tables

Reporting results in tables formatted to a high standard.
Calculate descriptives:

library(psyntur)
(smoker_age <- describe(data = blomkvist, by = smoker, mean = mean(age), sd = sd(age)))

# A tibble: 3 × 3
  smoker  mean    sd
  <chr>  <dbl> <dbl>
1 former  65.2  16.9
2 no      53.0  21.3
3 yes     50.6  17.5

This format isn’t good enough for papers.

Formatted APA style tables

library(kableExtra)
smoker_age %>% 
  kable(format = 'latex',
        booktabs = TRUE,
        digits = 2,
        align = 'c', # centre value in each column
        caption = 'Descriptives of age by smoker.') %>% 
  kable_styling(position = 'center') # centre position of table

Also, label your chunk “smoker” and cross-reference the table in the text using Table \ref{tab:smoker}.

Bibliography and citations

Add to the YAML preamble:

bibliography: refs.bib
biblio-style: apalike

Create a file (in RStudio) called refs.bib (save in same working directory as your .Rmd file)

Bibliography and citations

Get the .bib entry for Blomkvist et al. (2017) from Google Scholar and paste it into refs.bib:
- Copy the title “Reference Data on Reaction Time and Aging Using the Nintendo Wii Balance Board” into Google Scholar
- Click cite and BibTeX
- Copy the .bib entry into refs.bib
Note the citation key blomkvist2017reference
Cite Blomkvist et al. (2017) using @blomkvist2017reference or [@blomkvist2017reference].
At the end of your document create a section “# References”

Formatted inline R output

# Fit the model and get the summary
model <- lm(rt ~ sex, data = blomkvist)
model_summary <- summary(model)

# Extract R^2
r2 <- model_summary$r.sq

The $R^2$ for this model is `r round(r2, 2)`.

Renders “The $R^2$ for this model is 0.03.”

Formatted inline R output

# Extract F statistic
f_stat <- model_summary$fstatistic
p_value <- pf(f_stat[1], f_stat[2], f_stat[3], lower.tail = FALSE)

The model summary can be summarised like so: $F(`r round(f_stat[2])`, `r round(f_stat[3])`) =
`r round(f_stat[1],2)`$, $p `r format.pval(p_value, eps = 0.01)`$.

Renders “The model summary can be summarised like so: $F(1, 263) = 7.48$, $p <0.01$.”

p <- c(0.05, 0.02, 0.011, 0.005, 0.001)
format.pval(p, eps = 0.01)

[1] "0.05"  "0.02"  "0.01"  "<0.01" "<0.01"

Mathematical typesetting

Strings are parsed using $\LaTeX$ and typeset accordingly when used between '$' symbols for inline mode.
For example $\beta$ renders $\beta$.
Subscripts: $\beta_0$ is $\beta_0$ and using '{}' for more than one symbol as in $\beta_{01}$ which is $\beta_{01}$
How would you write $\beta_{0_1}$?
Superscripts: '^' as in $\sigma^2$ which is $\sigma^2$.
How would you write $\sigma^{2^2}$?
How about $\sigma_e^2$?

Some arithmetic operations and fractions

$x + y$ , $x - y$
Multiplication use either $\cdot$ or $\times$ to get $\cdot$ or $\times$, respectively, as in $3 \cdot 2$
Division: $/$ or $\div$ to get $/$ or $\div$, respectively, or $\frac{1}{2}$ for $\frac{1}{2}$
$\pm$ renders to $\pm$

For other formats …

install.packages("rmdformats")

install.packages("remotes")
remotes::install_github("juba/rmdformats")

Create from template:
- File > New File > R Markdown (e.g. readthedown or robobook for documents)
- rmdshower::shower_presentations and ioslides_presentation for slides
Rmarkdown example
Online sharing on RPubs (“Publish”)

APA7 manuscripts using `papaja`

Example: https://osf.io/vayhq/ published in Roeser et al. (2021)
Installation: https://github.com/crsh/papaja

Task (homework)

You will use RMarkdown for your assignment (get started now).
Create a list of continuous response variables that can be found in Psychology.
Find a suitable data set (should have a mix of continuous and categorical variables).
Use the RMarkdown we started and replace data with your own, calculate descriptives, and a visualisation.

References

Blomkvist, Andreas W., Fredrik Eika, Martin T. Rahbek, Karin D. Eikhof, Mette D. Hansen, Malene Søndergaard, Jesper Ryg, Stig Andersen, and Martin G. Jørgensen. 2017. “Reference Data on Reaction Time and Aging Using the Nintendo Wii Balance Board: A Cross-Sectional Study of 354 Subjects from 20 to 99 Years of Age.” PLoS One 12 (12): e0189598.

Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.

Roeser, Jens, Sven De Maeyer, Mariëlle Leijten, and Luuk Van Waes. 2021. “Modelling Typing Disfluencies as Finite Mixture Process.” Reading and Writing, 1–26.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

Xie, Yihui. 2017. Dynamic Documents with R and Knitr. Chapman; Hall/CRC.

Aims for the rest of today

How can you trust someone’s results?

Why do we need reproducible analyses?

A truely reproducible report

Why RMarkdown?

Key concepts and tools

Download repository

Installation

Installation

Minimal RMarkdown example

From the RStudio menu …

Load data

The setup chunk: global options

Section headers

Figures

Figures

Add header options

Formatted tables

Formatted APA style tables

Bibliography and citations

Bibliography and citations

Formatted inline R output

Formatted inline R output

Mathematical typesetting

Some arithmetic operations and fractions

For other formats …

APA7 manuscripts using papaja

Task (homework)

Recommended reading

References

APA7 manuscripts using `papaja`