This document provides a high-level perspective on the ideals of replication, the principles of literate programming, and the practicalities of managing data, analysis, and communication in a unified, efficient manner.
Replication is a cornerstone of scientific inquiry, ensuring that research findings are reliable and verifiable, and allows researchers to build on prior work-standing on the shoulders of giants.
Replication is as simple as this: running the same code on the same data and getting the same result. This sounds easy but the replication crisis in social science shows it is anything but.
From a broad scientific perspective, replication is crucial. It serves as a fundamental justification, reinforcing our confidence in the credibility of existing findings by allowing outsiders to audit the work. Furthermore, it is essential for building knowledge by providing a reliable foundation for researchers to extend upon previous studies.
But from a personal productivity perspective, replication is also important. For you, reproducible code is a long-term memory aid. After a semester of grading or a new co-author, you can still press Knit and trust the numbers. Documenting your work extensively helps “past you” assist “present you.” Nobody can remember the dozens or hundreds of decisions made through the research process. This is a structured way to remember the logic even after leaving a project for a while, or inviting collaborators.
The typical workflow of a social scientist involves:
But this process is often fragmented, with researchers using different tools for each step (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), leading to inefficiencies and potential errors.
Adjust a covariate after you have drafted slides and a paper and you must rebuild every table and figure by hand. The extra clicks waste time and almost guarantee mismatches between text and evidence.
Introduced by Donald Knuth in 1984, literate programming prioritizes human readability over machine efficiency. Donald Knuth’s idea of literate programming asks you to write for people first, computers second—explain the reasoning in ordinary prose right next to the command that carries it out.
R Markdown embodies this paradigm, blending text and code seamlessly.
The academic workflow consists of three key stages:
Example: Loading and previewing data with R:
library(readr)
data <- read_csv("my_data.csv")
head(data)
Example: A simple analysis with R:
fit <- lm(y ~ x, data = data)
summary(fit)
Historically, many of us hopped between Stata for cleaning, Excel for figures, Word for papers, and PowerPoint for slides—gluing the pieces together by copy-and-paste. Change one line of code and the whole pile was out of date. R Markdown pulls everything—code, narrative, and output—into a single file, so one knit refreshes the entire project.
R Markdown combines Markdown text with R code chunks. When knitted, it executes the code and embeds the output into the document.
This code generates a plot directly within the document.
Tool | Strength | Limitation |
---|---|---|
Point-and-click (SPSS) | Fast for quick descriptives | Little audit trail |
R script or Stata Do | Fully reproducible | Output lives elsewhere |
R Markdown | Code and output in one place | Steeper learning curve |
Working in a plain .R
script keeps the focus on code and
mirrors the workflow many of us know from Stata do-files. The
trade-off is that every graph or table has to be written to disk (e.g.,
“table_17a.tex”) and then pasted into a slide or manuscript, adding a
manual step that breaks reproducibility.
R Markdown is free, open-source, and embeds code, output, and prose in the same document, so automation and replication come for free.
The main hurdle is mental: you write analysis and exposition in the same file, which feels odd if you grew up on Stata do-files or SPSS menus.
Here’s a practical, hybrid research workflow
In 2023, I published a paper (coauthored with Michael Kistner) on the phenomenon of “Majority Party Rolls” in the United States. These are instances when the minority party votes with a dissenting faction of the majority party to pass a bill that is opposed by the majority of the majority party. It is a strange phenomenon which is incompatible with a major theory of legislative organization, and we wanted to understand it better.
In the Examples folder, I have a series of vignettes showing the workflow I used to produce the paper, including:
In teaching classes like this, or my upcoming data science in-person class, I would exclusively use Rmarkdown. The combination of code, output, and narrative to my mind is the best way to teach this technical subject matter. Moreover, the notebooks remain as a tangible record of the class, which students can refer to later.
R, the programming language, was originally written by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in 1993. It was based on the S programming language, which was developed by John Chambers and his colleagues at Bell Labs in the 1970s. R has since evolved into a powerful and flexible language for data analysis and visualization.
The Tidyverse is a collection of R packages designed for data science, was created by Hadley Wickham while a graduate student and then a professor at Rice. He later cofounded RStudio. It provides a consistent and user-friendly framework for data manipulation, visualization, and analysis. It is built on the principles of tidy data, which emphasizes an organized structure for datasets, and a consistent set of tools for data analysis.
Base R–– is the original, built-in way to work with data: you call functions directly on data frames or vectors and manage each step yourself.
Tidyverse wraps a set of popular community packages around a unified “verb + pipeline” style, so you think in terms of a sequence of actions rather than isolated function calls.
Some of the most popular packages in the Tidyverse include: dplyr, ggplot2, readr, stringr, forcats.
In practice, you can mix and match Base R and Tidyverse functions, but the Tidyverse is designed to work together seamlessly. It also has a more consistent syntax, so once you learn one function, you can apply that knowledge to others.
Base R uses indexing (df[ , ]), the $ operator, and standalone functions like subset() and transform(). That’s concise for quick hacks but can be hard to follow once you’ve got several steps.
Tidyverse pipes (%>%) the result of one step into the next, and uses clear verbs like filter(), mutate(), and summarise(). This linear flow tends to read more like a recipe: “take the table, then filter, then add columns, then summarize.”
In Base R, you often create temporary objects or overwrite the original data at each stage
With the Tidyverse, you can write
df %>%
filter(x > 1) %>%
mutate(y = z / 2)
Notice that the original data frame is never modified. Because each
verb passes its result forward, you skip the throw-away objects
(df2
, df3
) that accumulate in base R scripts.
Fewer temporary names mean fewer chances to rerun an outdated object by
mistake.
Notice also that you don’t have to keep referring to the data frame. The first argument of each function is always the data frame, so you can leave it out. This is a big win for readability.
Search Stack Overflow, R-bloggers, Posit Community, or even ask a
language model, and the first answer you see will almost always pipe
data through dplyr
verbs. That is not a
conspiracy—it reflects two facts:
When you google “R merge two data frames,” the accepted answer pipes left_join(). So it’s easier to work with the dominant framework.