Comparing Reproducibility Approaches

Overview

This document compares two approaches to ensuring reproducibility in analytical projects:

  1. Folder-based reproducibility (legacy approach)
  2. bcgovpond-based reproducibility (current approach)

Both approaches aim to make past results reproducible and auditable. The difference lies in how much of that responsibility is carried by people and process versus encoded directly into the system.


My legacy folder-based approach

Description

Under the folder-based approach:

  • Project code is tracked in git
  • R package versions are frozen using renv
  • Each analysis run uses a data/current/ directory
  • After a run, the contents of data/current/ are moved to data/past/
  • New data is introduced by copying files into data/current/

To recreate or audit a past result, the user:

  1. Checks out the appropriate git commit
  2. Restores packages using renv::restore()
  3. Manually identifies the correct historical data files in data/past/
  4. Moves those files back into data/current/
  5. Re-runs the analysis script

This approach relies on consistent naming conventions, careful file movement, and institutional knowledge.


The bcgovpond approach

Description

Under the bcgovpond approach:

  • Project code is tracked in git
  • R package versions are frozen using renv
  • Raw data is stored immutably in data_store/data_pond/
  • Metadata describing each dataset is stored in data_index/meta/
  • Logical pointers (“views”) defining which data is used are stored in data_index/views/

Analysis scripts never refer to mutable directories. Instead, they resolve logical data names through versioned views.

To recreate or audit a past result, the user:

  1. Checks out the appropriate git commit
  2. Restores packages using renv::restore()
  3. Runs the analysis script

The mapping from logical datasets to physical files is fully defined by the checked-out commit.


Key differences

Procedural vs structural reproducibility

Aspect Folder-based approach bcgovpond approach
Code versioning Explicit (git) Explicit (git)
Package versions Explicit (renv) Explicit (renv)
Raw data preservation Explicit (data/past/) Explicit (data_store/data_pond/)
Data selection Implicit, procedural Explicit, versioned
Mutable runtime state data/current/ None
File movement during audits Required None
Audit trail Reconstructed Native

Practical implications

Cognitive load

The folder-based approach requires users to remember and correctly execute a sequence of manual steps to reproduce results. Errors in file selection or movement can silently invalidate a reproduction attempt.

The bcgovpond approach eliminates these steps by encoding data selection directly in versioned metadata. Reproducibility does not depend on remembering procedural details.


Risk under time pressure

The legacy approach works well when users are careful and unhurried. Under time pressure, however, mutable directories and manual file movement increase the risk of subtle mistakes.

bcgovpond reduces this risk by removing mutable shared state and file shuffling from the reproduction process.


Auditability

With the folder-based approach, explaining how a result was produced often requires reconstructing the sequence of actions that led to it.

With bcgovpond, the explanation is structural: the git commit itself defines the code, the environment, and the data selection used.


When the legacy approach may be sufficient

The folder-based approach may be adequate when:

  • A single analyst maintains the project
  • The project lifespan is short
  • Reproduction requests are rare
  • Institutional knowledge is stable

When bcgovpond provides clear advantages

The bcgovpond approach provides clear benefits when:

  • Projects span multiple years
  • Multiple analysts interact with the same data
  • Audits or external review are expected
  • Staff turnover is a concern
  • Results must be defensible long after initial production

Summary

Both approaches support reproducibility in principle. The key difference is where reproducibility lives:

  • The folder-based approach relies on procedural discipline and user memory.
  • The bcgovpond approach encodes reproducibility into filesystem structure and versioned metadata.

In practice, bcgovpond reduces cognitive load, lowers audit risk, and improves long-term defensibility by making the correct workflow the easiest one to follow.