Introduction

This document provides a high-level perspective on the ideals of replication, the principles of literate programming, and the practicalities of managing data, analysis, and communication in a unified, efficient manner.

The Ideal of Replication

Replication is a cornerstone of scientific inquiry, ensuring that research findings are reliable and verifiable, and allows researchers to build on prior work-standing on the shoulders of giants.

Replication is as simple as this: running the same code on the same data and getting the same result. This sounds easy but the replication crisis in social science shows it is anything but.

Why Replication Matters

From a broad scientific perspective, replication is crucial. It serves as a fundamental justification, reinforcing our confidence in the credibility of existing findings by allowing outsiders to audit the work. Furthermore, it is essential for building knowledge by providing a reliable foundation for researchers to extend upon previous studies.

But from a personal productivity perspective, replication is also important. For you, reproducible code is a long-term memory aid. After a semester of grading or a new co-author, you can still press Knit and trust the numbers. Documenting your work extensively helps “past you” assist “present you.” Nobody can remember the dozens or hundreds of decisions made through the research process. This is a structured way to remember the logic even after leaving a project for a while, or inviting collaborators.

Typical workflow

The typical workflow of a social scientist involves:

  • Acquiring data.
  • Exploring it through plots and models.
  • Writing presentations and papers.
  • Submitting for publication and replication.

But this process is often fragmented, with researchers using different tools for each step (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), leading to inefficiencies and potential errors.

Adjust a covariate after you have drafted slides and a paper and you must rebuild every table and figure by hand. The extra clicks waste time and almost guarantee mismatches between text and evidence.

Literate Programming: A Paradigm Shift

Introduced by Donald Knuth in 1984, literate programming prioritizes human readability over machine efficiency. Donald Knuth’s idea of literate programming asks you to write for people first, computers second—explain the reasoning in ordinary prose right next to the command that carries it out.

Benefits

  • Readability: Code is explained in context, making it easier to follow.
  • Documentation: The narrative serves as built-in documentation.
  • Reproducibility: The approach supports transparent, reproducible research.

R Markdown embodies this paradigm, blending text and code seamlessly.

The Academic Workflow

The academic workflow consists of three key stages:

1. Gathering and Processing Data

  • Stage 1 – Data. Collect, merge, and clean the raw files; record every choice in the same script.
  • Stage 2 – Analysis. Explore, model, and validate.
  • Stage 3 – Communication. Turn estimates into figures, tables, and text—the part the world sees.

Example: Loading and previewing data with R:

library(readr)
data <- read_csv("my_data.csv")
head(data)

2. Doing Analysis

  • Exploration: Playing with data and creating visualizations.
  • Modeling: Running and iterating on statistical models.

Example: A simple analysis with R:

fit <- lm(y ~ x, data = data)
summary(fit)

3. Communicating Results

  • Presentations: Sharing findings with slides.
  • Web: Publishing results online.
  • Papers: Writing for publication.
  • Replication: Enabling others to reproduce your work.

Integrating the Workflow with R Markdown

Historically, many of us hopped between Stata for cleaning, Excel for figures, Word for papers, and PowerPoint for slides—gluing the pieces together by copy-and-paste. Change one line of code and the whole pile was out of date. R Markdown pulls everything—code, narrative, and output—into a single file, so one knit refreshes the entire project.

What is R Markdown?

R Markdown combines Markdown text with R code chunks. When knitted, it executes the code and embeds the output into the document.

This code generates a plot directly within the document.

Tools Comparison

Tool Strength Limitation
Point-and-click (SPSS) Fast for quick descriptives Little audit trail
R script or Stata Do Fully reproducible Output lives elsewhere
R Markdown Code and output in one place Steeper learning curve

Tools

R Scripts in RStudio

Working in a plain .R script keeps the focus on code and mirrors the workflow many of us know from Stata do-files. The trade-off is that every graph or table has to be written to disk (e.g., “table_17a.tex”) and then pasted into a slide or manuscript, adding a manual step that breaks reproducibility.

Microsoft Office

  • PowerPoint/Word: Familiar but manual and error-prone.

LaTeX

  • Pros: Beautiful output for publication-ready papers and books
  • Cons: Steep learning curve, brittle syntax.

R Markdown Advantages and Disadvantages

R Markdown is free, open-source, and embeds code, output, and prose in the same document, so automation and replication come for free.

The main hurdle is mental: you write analysis and exposition in the same file, which feels odd if you grew up on Stata do-files or SPSS menus.

My Workflow

Here’s a practical, hybrid research workflow

  1. Data Munging: Use R scripts in RStudio.
  2. Analysis: Develop functions in scripts.
  3. Communication: Switch to R Markdown for presentations and papers.

Specific Example

In 2023, I published a paper (coauthored with Michael Kistner) on the phenomenon of “Majority Party Rolls” in the United States. These are instances when the minority party votes with a dissenting faction of the majority party to pass a bill that is opposed by the majority of the majority party. It is a strange phenomenon which is incompatible with a major theory of legislative organization, and we wanted to understand it better.

In the Examples folder, I have a series of vignettes showing the workflow I used to produce the paper, including:

  • Visualization of the data by state and majority party (“party barplot.R”)
  • Presentation using R Markdown and Xaringan (“agenda_leviathan23.Rmd”)
  • Paper manuscript using R Markdown and Bibtex style bibliography (“roll23.Rmd”)
  • Published paper (“shor and kistner 2023.pdf”)

Teaching

In teaching classes like this, or my upcoming data science in-person class, I would exclusively use Rmarkdown. The combination of code, output, and narrative to my mind is the best way to teach this technical subject matter. Moreover, the notebooks remain as a tangible record of the class, which students can refer to later.

Base R vs Tidyverse

R, the programming language, was originally written by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in 1993. It was based on the S programming language, which was developed by John Chambers and his colleagues at Bell Labs in the 1970s. R has since evolved into a powerful and flexible language for data analysis and visualization.

The Tidyverse is a collection of R packages designed for data science, was created by Hadley Wickham while a graduate student and then a professor at Rice. He later cofounded RStudio. It provides a consistent and user-friendly framework for data manipulation, visualization, and analysis. It is built on the principles of tidy data, which emphasizes an organized structure for datasets, and a consistent set of tools for data analysis.

Core Philosophy

Base R–– is the original, built-in way to work with data: you call functions directly on data frames or vectors and manage each step yourself.

Tidyverse wraps a set of popular community packages around a unified “verb + pipeline” style, so you think in terms of a sequence of actions rather than isolated function calls.

Some of the most popular packages in the Tidyverse include: dplyr, ggplot2, readr, stringr, forcats.

In practice, you can mix and match Base R and Tidyverse functions, but the Tidyverse is designed to work together seamlessly. It also has a more consistent syntax, so once you learn one function, you can apply that knowledge to others.

Syntax and Readability

  • Base R uses indexing (df[ , ]), the $ operator, and standalone functions like subset() and transform(). That’s concise for quick hacks but can be hard to follow once you’ve got several steps.

  • Tidyverse pipes (%>%) the result of one step into the next, and uses clear verbs like filter(), mutate(), and summarise(). This linear flow tends to read more like a recipe: “take the table, then filter, then add columns, then summarize.”

Data-Transformation Workflow

  • In Base R, you often create temporary objects or overwrite the original data at each stage

  • With the Tidyverse, you can write

df %>%
  filter(x > 1) %>%
  mutate(y = z / 2)

Notice that the original data frame is never modified. Because each verb passes its result forward, you skip the throw-away objects (df2, df3) that accumulate in base R scripts. Fewer temporary names mean fewer chances to rerun an outdated object by mistake.

Notice also that you don’t have to keep referring to the data frame. The first argument of each function is always the data frame, so you can leave it out. This is a big win for readability.

Finding Help (and Why It Looks Like Tidyverse)

Search Stack Overflow, R-bloggers, Posit Community, or even ask a language model, and the first answer you see will almost always pipe data through dplyr verbs. That is not a conspiracy—it reflects two facts:

  1. Most R tutorials written since 2016 assume the Tidyverse is loaded.
  2. ChatGPT and other LLMs are trained on that same public corpus.

When you google “R merge two data frames,” the accepted answer pipes left_join(). So it’s easier to work with the dominant framework.