Introduction

This document provides a high-level perspective on the ideals of replication, the principles of literate programming, and the practicalities of managing data, analysis, and communication in a unified, efficient manner.

The Ideal of Replication

Replication is a cornerstone of scientific inquiry, ensuring that research findings are reliable and verifiable, and allows researchers to build on prior work-standing on the shoulders of giants.

Why Replication Matters

Big picture Science:

Scientific Justification: Replication builds our confidence in the validity of existing findings
Building Knowledge: It enables researchers to extend existing studies.

Micro picture – Productivity

Personal Productivity: Documenting your work extensively helps “past you” assist “present you.” Nobody can remember the dozens or hundreds of decisions made through the research process. This is a structured way to remember the logic even after leaving a project for a while, or inviting collaborators.

Typical workflow

The typical workflow of a social scientist involves:

Acquiring data.
Exploring it through plots and models.
Writing presentations and papers.
Submitting for publication and replication.

But this process is often fragmented, with researchers using different tools for each step (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), leading to inefficiencies and potential errors. For example, if you change a variable in your analysis, you have to remember to update the regression table and visualizations in your presentation and paper. This is a recipe for disaster. First, it takes a long time to copy and paste. Second, there is many potential avenues for errors to creep in as you innocently forget one or another step.

Literate Programming: A Paradigm Shift

Introduced by Donald Knuth in 1984, literate programming prioritizes human readability over machine efficiency. Unlike traditional programming, which focuses on code for computers, literate programming prioritizes natural communication by interspersing natural language with code snippets.

Benefits

Readability: Code is explained in context, making it easier to follow.
Documentation: The narrative serves as built-in documentation.
Reproducibility: The approach supports transparent, reproducible research.

R Markdown embodies this paradigm, blending text and code seamlessly.

The Academic Workflow

The academic workflow consists of three key stages:

1. Gathering and Processing Data

Data Acquisition: Surveys, downloads, or APIs.
Data Processing: Merging datasets, cleaning, and recoding.
Reality Check: This stage often takes longer than expected.

Example: Loading and previewing data with R:

library(readr)
data <- read_csv("my_data.csv")
head(data)

2. Doing Analysis

Exploration: Playing with data and creating visualizations.
Modeling: Running and iterating on statistical models.

Example: A simple analysis with R:

fit <- lm(y ~ x, data = data)
summary(fit)

3. Communicating Results

Presentations: Sharing findings with slides.
Papers: Writing for publication.
Replication: Enabling others to reproduce your work.

Integrating the Workflow with R Markdown

Traditionally, researchers used separate tools (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), manually integrating results. R Markdown unifies these steps in a single toolkit.

What is R Markdown?

R Markdown combines Markdown text with R code chunks. When knitted, it executes the code and embeds the output into the document.

This code generates a plot directly within the document.

Tools Comparison

Point-and-Click (e.g., SPSS): Easy but hard to replicate.
R Scripts: Reproducible but separate from output.
R Markdown: Combines code, output, and narrative for excellent replication.

Tools

R Scripts in RStudio

Advantages
- Simple, code-focused.
- Similar to workflow in tools like Stata.
Disadvantages
Output must be stored externally (e.g., “table_17a.tex”).

Microsoft Office

PowerPoint/Word: Familiar but manual and error-prone.

LaTeX

Pros: Beautiful output for publication-ready papers and books
Cons: Steep learning curve, brittle syntax.

R Markdown Advantages

Free and unified.
Unifies code, output, and narrative.
Literate programming style is close to natural language.
Minimizes manual labor, maximizes automation.
Simplifies replication.

R Markdown Disadvantages

Mental adjustment for people used to more traditional statistical computing and programming workflow

My Workflow

Here’s a practical, hybrid workflow

Data Munging: Use R scripts in RStudio.
Analysis: Develop functions in scripts.
Communication: Switch to R Markdown for presentations and papers.

Base R vs Tidyverse

R, the programming language, was originally written by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in 1993. It was based on the S programming language, which was developed by John Chambers and his colleagues at Bell Labs in the 1970s. R has since evolved into a powerful and flexible language for data analysis and visualization.

The Tidyverse is a collection of R packages designed for data science, was created by Hadley Wickham while a graduate student and then a professor at Rice. He later cofounded RStudio. It provides a consistent and user-friendly framework for data manipulation, visualization, and analysis. It is built on the principles of tidy data, which emphasizes a clean and organized structure for datasets, and a consistent and coherent set of tools for data analysis.

Core Philosophy

Base R–– is the original, built-in way to work with data: you call functions directly on data frames or vectors and manage each step yourself.

Tidyverse wraps a set of popular community packages around a unified “verb + pipeline” style, so you think in terms of a sequence of actions rather than isolated function calls.

Some of the most popular packages in the Tidyverse include: dplyr, ggplot2, readr, stringr, forcats.

In practice, you can mix and match Base R and Tidyverse functions, but the Tidyverse is designed to work together seamlessly. It also has a more consistent syntax, so once you learn one function, you can apply that knowledge to others.

Syntax and Readability

Base R uses indexing (df[ , ]), the $ operator, and standalone functions like subset() and transform(). That’s concise for quick hacks but can be hard to follow once you’ve got several steps.
Tidyverse pipes (%>%) the result of one step into the next, and uses clear verbs like filter(), mutate(), and summarise(). This linear flow tends to read more like a recipe: “take the table, then filter, then add columns, then summarize.”

Data-Transformation Workflow

In Base R, you often create temporary objects or overwrite the original data at each stage
With the Tidyverse, you can write

Notice that the original data frame is never modified. Instead, everything is done in line, passed to the next function in the pipeline. You never have to name the intermediate data, which reduces clutter and the chance of accidentally reusing an old object.

Notice also that you don’t have to keep referring to the data frame. The first argument of each function is always the data frame, so you can leave it out. This is a big win for readability.

The 30,000 Foot View of the Research Production Process

Boris Shor

2025-05-30