The Split-Apply-Combine Strategy for Data Analysis

Tahana Akram & Oliver Reece - ADC-405-S26 - Lawrence University

Introduction

Data analysis rarely involves a single operation on an entire dataset. More often we need to ask questions like: How did each baseball player’s performance change over their career? How does ozone vary by location and season? These questions share a common structure. Break data into groups, do something to each group, and collect the results.

This recurring pattern is what Hadley Wickham (2011) formalized as the split-apply-combine strategy, and it is the foundation of the plyr package for R.

The Three Steps

1. Split — Divide your dataset into meaningful subsets based on one or more grouping variables. For example, split a baseball dataset by player ID so each player’s records are handled separately. Subsets can be rows of a data frame, slices of an array, or elements of a list.

2. Apply — Run a function on each subset completely independently. This could be something simple like computing a mean, or something complex like fitting a linear regression model. Because each piece is independent, order does not matter and results are reproducible.

3. Combine — Collect all individual results and reassemble them into a single coherent output — a data frame, array, or list depending on what you need.

“Just recognizing the split-apply-combine strategy when it occurs is useful, because it allows you to see the similarity between problems that previously might have appeared unconnected.” — Wickham (2011)

Why plyr? Motivation

Before plyr, applying a function to every group in a dataset meant writing verbose for-loops full of bookkeeping code that buried the actual computation. The example below fits a model to every location in a 24×24 ozone grid, compare what base R requires versus plyr:

# Base R — 10 lines of bookkeeping
models <- as.list(rep(NA, 24 * 24))
dim(models) <- c(24, 24)
deseas <- array(NA, c(24, 24, 72))
for (i in seq_len(24)) {
  for (j in seq_len(24)) {
    mod <- deseasf(ozone[i, j, ])
    models[[i, j]] <- mod
    deseas[i, j, ] <- resid(mod)
  }
}

# plyr — 2 lines, same result
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)

plyr eliminates hard-coded dimensions, removes the need to pre-allocate output structures, and makes the intent of the computation immediately clear. The function deseasf just does its job and plyr handles everything else.

The plyr Function Family

Every plyr function is named [input][output]ply ,learn 3 input types and 3 output types and the whole family makes sense. Input and output can each be an array, data frame, or list. A fourth output type **_** discards results, useful for side effects like saving plots.

All functions share the same three arguments: .data (what to split), .variables or .margins (how to split), and .fun (what to apply to each piece). The dot prefix prevents name clashes with your own function’s arguments.

Key Limitations

plyr makes one strong assumption that must always be respected: each piece of data is processed independently and only once. This means there are important cases where plyr is simply not the right tool:

Running averages or moving windows — each calculation depends on neighboring values, so pieces are not truly independent
Dynamic simulations — the output of step N feeds directly into step N+1, meaning order matters and independence is broken
Very large datasets — plyr prioritizes code clarity over raw speed; for maximum performance, recoding key steps in C or using data.table may be necessary
Overlapping data windows — pieces must not share data with each other for plyr to work correctly

“Note that plyr makes the strong assumption that each piece of data will be processed only once and independently of all other pieces.” — Wickham (2011)

For these situations, a traditional for-loop is still the most appropriate tool. The goal of plyr is not to replace loops entirely but to replace loops where the split-apply-combine pattern genuinely applies.

Helper Functions

plyr ships with several helper functions that make common tasks much cleaner and more robust:

splat(f) — converts a function that takes multiple named arguments into one that accepts a single data frame. Extremely useful when your function’s argument names match column names.

hp_per_cyl <- function(hp, cyl, ...) hp / cyl
ddply(mtcars, .(round(wt)), splat(hp_per_cyl))

each(f, g) — runs multiple functions on the same input and returns a named vector. Great for computing several summary statistics at once without writing wrapper functions.

colwise(f) — applies a function column-by-column across an entire data frame and returns a data frame. The .if argument restricts it to only numeric or factor columns. You can use colwise(function)(data) for every column or numcolwise(function)(data) or only numeric columns.

failwith(default, f) — returns a default value such as NA instead of stopping with an error when f fails. Essential when fitting many models where some subsets may be too small or poorly conditioned.

.progress = "text" — displays a live progress bar showing estimated time remaining. Particularly useful for long-running operations involving hundreds or thousands of subsets.

The 4 Step Process

Start Small. Slice out a portion (subset) of the data to work on.
By hand, write out the problem and try to solve it.
Now take your hand-written solution and make it a reusable function.
Make the function autonomous using plyr functions. Run this function where needed, then stitch everything back together at the end.

Using these steps, Wickham proceeds to walk us through 2 case studies:

Baseball Case Study: He splits a data frame of career records by player ID, fits a linear model to each subset, and combines the resulting slopes, intercepts, and \(R^2\) values back into a single data frame.

Ozone Layer Case Study: He slices a 3D array along its spatial coordinates, applies a linear model to each location’s time series, and puts the cleaned residuals back together into matching 3D and data frame formats.

Reducing Cognitive Impedence

One of the main points of this article is that plyr is meant to help reduce human effort. plyr achieves this in a couple ways. Firstly, it keeps all labels and column/row names throughout the cleaning process such that when you connect everything back together, the labels are kept from the original data. This is not the case in base R. Furthermore, the input structure gets mapped directly to the output, meaning that you can switch back and forth between types like arrays, lists, and dataframes.

`plyr` Performance Advantages

Alongside being more efficient for the user, plyr also has some hidden benefits. Below is a lollypop graph intending to visualise one of the most important ones: parallelisation.

Pre and Post Cleaned Data Example