This document provides a high-level perspective on the ideals of replication, the principles of literate programming, and the practicalities of managing data, analysis, and communication in a unified, efficient manner.
Replication is a cornerstone of scientific inquiry, ensuring that research findings are reliable and verifiable, and allows researchers to build on prior work-standing on the shoulders of giants.
Big picture Science:
Micro picture – Productivity
The typical workflow of a social scientist involves:
But this process is often fragmented, with researchers using different tools for each step (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), leading to inefficiencies and potential errors. For example, if you change a variable in your analysis, you have to remember to update the regression table and visualizations in your presentation and paper. This is a recipe for disaster. First, it takes a long time to copy and paste. Second, there is many potential avenues for errors to creep in as you innocently forget one or another step.
Introduced by Donald Knuth in 1984, literate programming prioritizes human readability over machine efficiency. Unlike traditional programming, which focuses on code for computers, literate programming prioritizes natural communication by interspersing natural language with code snippets.
R Markdown embodies this paradigm, blending text and code seamlessly.
The academic workflow consists of three key stages:
Example: Loading and previewing data with R:
library(readr)
data <- read_csv("my_data.csv")
head(data)
Example: A simple analysis with R:
fit <- lm(y ~ x, data = data)
summary(fit)
Traditionally, researchers used separate tools (e.g., Stata for analysis, Word for writing, PowerPoint for presentations), manually integrating results. R Markdown unifies these steps in a single toolkit.
R Markdown combines Markdown text with R code chunks. When knitted, it executes the code and embeds the output into the document.
This code generates a plot directly within the document.
Here’s a practical, hybrid workflow
R, the programming language, was originally written by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in 1993. It was based on the S programming language, which was developed by John Chambers and his colleagues at Bell Labs in the 1970s. R has since evolved into a powerful and flexible language for data analysis and visualization.
The Tidyverse is a
collection of R packages designed for data science, was created by
Hadley Wickham while a graduate student and then a professor at Rice. He
later cofounded RStudio. It provides a consistent and user-friendly
framework for data manipulation, visualization, and analysis. It is
built on the principles of tidy data, which emphasizes a clean and
organized structure for datasets, and a consistent and coherent set of
tools for data analysis.
Base R–– is the original, built-in way to work with data: you call functions directly on data frames or vectors and manage each step yourself.
Tidyverse wraps a set of popular community packages around a unified “verb + pipeline” style, so you think in terms of a sequence of actions rather than isolated function calls.
Some of the most popular packages in the Tidyverse include: dplyr, ggplot2, readr, stringr, forcats.
In practice, you can mix and match Base R and Tidyverse functions, but the Tidyverse is designed to work together seamlessly. It also has a more consistent syntax, so once you learn one function, you can apply that knowledge to others.
Base R uses indexing (df[ , ]), the $ operator, and standalone functions like subset() and transform(). That’s concise for quick hacks but can be hard to follow once you’ve got several steps.
Tidyverse pipes (%>%) the result of one step into the next, and uses clear verbs like filter(), mutate(), and summarise(). This linear flow tends to read more like a recipe: “take the table, then filter, then add columns, then summarize.”
In Base R, you often create temporary objects or overwrite the original data at each stage
With the Tidyverse, you can write
Notice that the original data frame is never modified. Instead, everything is done in line, passed to the next function in the pipeline. You never have to name the intermediate data, which reduces clutter and the chance of accidentally reusing an old object.
Notice also that you don’t have to keep referring to the data frame. The first argument of each function is always the data frame, so you can leave it out. This is a big win for readability.