Reproducible data analysis using RMarkdown

About me: Jens Roeser

Senior Lecturer in Psycholinguistics @ Psychology Department
Language production / comprehension / acquisition often with a focus on writing (e.g. Roeser, Torrance, and Baguley 2019; Garcia, Roeser, and Kidd 2023)
Bayesian modelling of production time course data (Roeser, De Maeyer, et al. 2024; Roeser, Conijn, et al. 2024); keystroke logging; eyetracking
Teaching: advanced statistical modelling, data wrangling, data visualisation, R package (psyntur, Andrews and Roeser 2021)

Reproducible data analysis using RMarkdown

How can you trust someone’s results?

Results are communicated through reports such as peer reviewed scientific articles.
Analysis and results should be reproducible.
Data and code should be made publically available.
Often difficult to relate code to figures, tables, numbers in text.
Copy-pasting numbers from software into the text can be error prone.

Why do we need reproducible analyses?

Open Science Collaboration led by Brian Nosek (Open Science Collaboration 2015).
Aimed to replicate important findings in psychological research.
Main finding: 36% where replicated; only 23% for Social Psychology
Reasons: small samples, pressure to publish sensational results.
Make Psychology Science Again: pre-registration, replications, transparency, large samples (power analysis), abandoning significance testing.

Markdown: a truely reproducible report

Data, code, report need be one unit.
Based on the concept of literate programming (Knuth 1984): text and code are linked in one single file to generate manual or computer program.
This principle can be used to creating reproducible reports, sometimes known as dynamic documents (Xie 2017).
RMarkdown is the best way of doing this using R.

Why RMarkdown?

Documentation of all analysis steps
Easy integration between R and text
Quickly updating and reproducing analysis
No copy-pasting between programmes
Producing texts, website, supplementary materials, APA-style manuscripts, slides
Easy cross-referencing (citations)
Automatic reference lists
Easy type-setting of equations

Key concepts and tools

What is RMarkdown?

Text file format, a script like an R script.
Open, edit, run in RStudio like scripts.
RMarkdown are a mixture of markdown code and normal R code (and your data).
R code in RMarkdown documents occurs in R chunks, i.e. blocks of code, or inline R code inside of markdown code
Markdown: minimal syntax that instructs how text should be formatted.
Rendering into .pdf, .html, .docx, .doc

Key concepts and tools

Document preparation system for creating high quality technical and scientific documents especially when involving mathematical formulas and technical diagrams.
Great for cross referencing (citations, figure labels) and automatic generation of content tables, reference lists.
Widely used in statistics, computer science, physics
$\LaTeX$ documents are written in a .tex source code file and rendered to pdf

Outline

Setup
Examples
YAML preamble and chunk options
Headers
Figures and tables
Cross-referencing
Referencing
In-line R output
Mathematical typesetting

Download repository

Go to github.com/jensroes/bristol-ws-2025
Click on: Code > Download ZIP > unzip directory on your machine.
Open project by double clicking on bristol-ws-2025.Rproj
Use slides for code

Installation

RStudio comes with necessary R package rmarkdown
To create pdf outputs, we require the package tinytex

# Is tidyverse installed?
'tidyverse' %in% rownames(installed.packages())
# Should return TRUE but if not run
install.packages('tidyverse')

install.packages('tinytex')
tinytex::install_tinytex()
# then restart RStudio

# Then, try
tinytex:::is_tinytex() # should return TRUE

Installation

Test that rmarkdown will render pdf documents:

writeLines('Hello $x^2$', 'test.Rmd')
rmarkdown::render('test.Rmd', output_format = 'pdf_document')

writeLines creates an .Rmd file named test.Rmd.
rmarkdown::render renders .Rmd as pdf named test.pdf.

Minimal RMarkdown example

Open rmarkdown/examples/example.Rmd
You’ll see an R chunk and two pieces of inline R code.
Remainder is plain Markdown.
Knit > Knit to PDF to compile .Rmd to .pdf. Wow!!!
Notice how the code was interpreted in pdf.
Important: R code is executed from top down.
Other demos and examples are provided.

Elaborate APA7 RMarkdown example

Open rmarkdown/examples/apa7_template.Rmd
To run this, you might need to install a few packages so instead
Check out the file rmarkdown/examples/apa7_template.pdf that was rendered with this .Rmd file

Example for online supp materials

Open rmarkdown/examples/robobook.Rmd
To run this, you might need to install a few packages so instead
Check out the file rmarkdown/examples/robobook.html that was rendered with this .Rmd file

From the RStudio menu …

Load data

Create a new R chunk: try CTRL+ALT+I or CMD+ALT+I (i.e. the letter “i”)
Create a chunk called packages and load libraries needed:

library(tidyverse)

Create a new chunk called loaddata and load Martin et al. (2010) data:

data <- read_csv("data/martin-etal-2010-exp3a.csv")

The setup chunk: global options

Note ```{r setup, echo=FALSE}
setup is a label of this chunk (optional; useful for cross-referencing of figures and tables).
Chunk configuration option echo = FALSE: don’t display chunk in output; echo = TRUE: display chunk.

knitr::opts_chunk$set(message = FALSE, # don't return messages
                      warning = FALSE, # don't return warnings
                      comment = NA, # don't comment output
                      echo = TRUE, # display chunk (is default)
                      eval = TRUE, # evaluate chunk (is default)
                      out.width = '45%',  # figure width
                      fig.align='center') # figure alignment

Section headers

# This is a section header

## This is a subsection header

# This is another section header

## This is another subsection header

### This is a subsubsection header

Figures

Create a new R chunk with label mydensplot
Set echo = F cause we only need the figure.
Add a figure caption fig.cap = "A density plot." in the chunk configurations.

ggplot(data, aes(x = rt, colour = nptype, fill = nptype)) +
    geom_density(alpha = .25) +
    scale_x_log10() +
    theme_classic()

Figures

Change the default size to out.width = 50%.
Cross-reference figure using \ref{fig:mydensplot} in the text.

"A density plot of reaction time visualised by NP type can be found in Figure \ref{fig:mydensplot}."

Add header options

header-includes:
- \usepackage{booktabs}

We just need booktabs to improve type setting.

Formatted tables

Reporting results in tables formatted to a high standard.
Calculate descriptive summary stats:

(rt_stats <- summarise(data, across(rt, list(mean = mean, sd = sd)), .by = nptype))

# A tibble: 2 × 3
  nptype    rt_mean rt_sd
  <chr>       <dbl> <dbl>
1 conjoined   1109.  296.
2 simple      1076.  267.

This format isn’t good enough for papers.

Formatted APA style tables

library(kableExtra)
kable(rt_stats,
      format = 'latex',
      booktabs = TRUE,
      digits = 2,
      align = 'c', # centre value in each column
      caption = 'Descriptive summary statistics of reaction time by NP type.') %>% 
  kable_styling(position = 'center') # centre position of table

Also, label your chunk “rtstats” and cross-reference the table in the text using Table \ref{tab:rtstats}.

Bibliography and citations

Add to the YAML preamble:

bibliography: refs.bib
biblio-style: apalike

Create a file (in RStudio) called refs.bib (save in same working directory as your .Rmd file)

Bibliography and citations

Get the .bib entry for Martin et al. (2010) from Google Scholar and paste it into refs.bib:
- Copy the title “Planning in sentence production: Evidence for the phrase as a default planning scope” into Google Scholar
- Click cite and BibTeX
- Copy the .bib entry into refs.bib
Note the citation key martin2010planning
Cite Martin et al. (2010) using @martin2010planning or [@martin2010planning].
At the end of your document create a section “# References”

Table exercise

# Fit model and get the summary
library(lmerTest)
model <- lmer(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data)
model_summary <- summary(model)$coefficients

Task: Use the tab_model function of the sjPlot package to create a nicely formatted table of the model object. Use cross-referencing as before to refer to the table from the text.

Formatted inline R output

# Extract t statistic
t_val <- model_summary[2, 4]
# Extract df
df <- model_summary[2, 3]
# Extract p value
p_val <- model_summary[2, 5]

The hypothesis test for the rt effect can be summarised like so: 
*t*(`r round(df, 1)`) = `r round(t_val, 1)`, *p* `r format.pval(p_val, eps = 0.05)`.

Which renders to: “The hypothesis test for the slope coefficient can be summarised like so: t(125.7) = -2.2, p <0.05.”

Formatted inline R output

# Extract slope coefficient
est <- model_summary[2, 1]

The $\hat\beta$ coefficient of this model shows a slowdown of `r round(est, 1)` ms.

Renders “The $\hat\beta$ coefficient of this model shows a slowdown of -33.6 ms.”

Task: Use confint(model) to obtain the 95% CI of the effect and add it to the text. Don’t copy paste the numbers but use inline R code.

Formatted inline R output

# Extract slope coefficient
est <- model_summary[2, 1]
ci <- as.numeric(confint(model, parm = "nptypesimple"))

The $\hat\beta$ coefficient of this model shows a slowdown of `r round(est, 1)` ms;
95% CI: `r ci[1]`, `r ci[2]`.

Renders “The $\hat\beta$ coefficient of this model shows a slowdown of -33.6 ms; 95% CI: -63.83, -2.94.”

Mathematical typesetting

Strings are parsed using $\LaTeX$ and typeset accordingly when used between '$' symbols for inline mode.
For example $\beta$ renders $\beta$.
Subscripts: $\beta_0$ is $\beta_0$ and using '{}' for more than one symbol as in $\beta_{01}$ which is $\beta_{01}$
How would you write $\beta_{0_1}$?
Superscripts: '^' as in $\sigma^2$ which is $\sigma^2$.
How would you write $\sigma^{2^2}$?
How about $\sigma_e^2$?

Some arithmetic operations and fractions

$x + y$ , $x - y$
Multiplication use either $\cdot$ or $\times$ to get $\cdot$ or $\times$, respectively, as in $3 \cdot 2$
Division: $/$ or $\div$ to get $/$ or $\div$, respectively, or $\frac{1}{2}$ for $\frac{1}{2}$
$\pm$ renders to $\pm$

$\LaTeX$ mathematical typesetting

The observed variable is modeled as coming from a normal distribution $y_i \sim \mathcal{N}(\mu_i, \sigma^2)$, with $\mu_i = \beta_0 + \beta_1 \cdot x_i$ where each $i \in 1 \dots N$ .

Renders

The observed variable is modeled as coming from a normal distribution $y_i \sim \mathcal{N}(\mu_i, \sigma^2)$, with $\mu_i = \beta_0 + \beta_1 \cdot x_i$ where each $i \in 1 \dots N$.

Display mode using `'$$'` as delimiters

$$
y_i \sim \mathcal{N}(\mu_i, \sigma^2)
$$

Renders

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

Display maths

Using $\LaTeX$’s aligned environment to align multiple mathematical statements.
We will get back to what this means.

$$
\begin{aligned}
  y_i &\sim N(\mu_i, \sigma^2),\\
  \mu_i &= \beta_0 + \beta_1 \cdot x_i + beta_2 \cdot z_i
\end
$$

The '&' is used to align the lines so that ‘$\sim$’ (\sim) and ‘$=$’ are aligned.
'$\\$' forces a line break.

For other formats …

install.packages("rmdformats")

Create from template:
- File > New File > R Markdown (e.g. readthedown or robobook for documents)
- rmdshower::shower_presentations and ioslides_presentation for slides
Online sharing on RPubs (“Publish”): RMarkdown example

APA7 manuscripts using `papaja`

Example: https://osf.io/vayhq/ published in Roeser, De Maeyer, et al. (2024)
Installation: https://github.com/crsh/papaja

Tips / Rules

Never copy numbers from output into text.
Don’t produce citations or cross-reference manually.
Be selective with which output and code you do and do not show your audience.
Don’t use RMarkdown instead of or like R-scripts.
Run more complex statistical models in a separate script, save the model output and read it into your script.

References

Andrews, Mark. 2021. Doing data science in R: An Introduction for Social Scientists. London, UK: SAGE Publications Ltd.

Andrews, Mark, and Jens Roeser. 2021. psyntur: Helper Tools for Teaching Statistical Data Analysis. https://CRAN.R-project.org/package=psyntur.

Garcia, Rowena, Jens Roeser, and Evan Kidd. 2023. “Finding Your Voice: Voice-Specific Effects in Tagalog Reveal the Limits of Word Order Priming.” Cognition 236: 105424.

Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.

Martin, Randi C, Jason E Crowther, Meredith Knight, Franklin P Tamborello II, and Chin-Lung Yang. 2010. “Planning in Sentence Production: Evidence for the Phrase as a Default Planning Scope.” Cognition 116 (2): 177–92.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.

Roeser, Jens, Rianne Conijn, Evgeny Chukharev, Gunn Helen Ofstad, and Mark Torrance. 2024. “Typing in Tandem: Language Planning in Multi-Sentence Text Production Is Fundamentally Parallel.” Journal of Experimental Psychology: General.

Roeser, Jens, Sven De Maeyer, Mariëlle Leijten, and Luuk Van Waes. 2024. “Modelling Typing Disfluencies as Finite Mixture Process.” Reading and Writing 37 (2): 359–84.

Roeser, Jens, Mark Torrance, and Thom Baguley. 2019. “Advance Planning in Written and Spoken Sentence Production.” Journal of Experimental Psychology: Learning, Memory, and Cognition 45 (11): 1983–2009.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

Xie, Yihui. 2017. Dynamic Documents with R and Knitr. Chapman; Hall/CRC.

About me: Jens Roeser