Reproducable Research Guide

Author

D.McCabe

Published

September 23, 2024

Presentation The Importance of Reproducible Research in High-Throughput Biology

Replication vs Reproducability

Replication - producing a specific instance of the research pipeline to validate the data analysis and results. This often involves re-running the same code or analysis on the original dataset or closely related data. In modern data science, analysis techniques can be extremely complex, making replication crucial for verifying findings. The stakes are high, especially when compliance, regulatory approval, or policy decisions depend on the results.

Reproducibility - the ability to obtain the same results by conducting a new, independent instance of an experiment (using new test subjects, new data, and new conditions) - a central tenet of the scientific method. Its role has come into sharp focus with the replication crisis highlighting the importance of rigorous experimental design, transparent methods, and full disclosure of data and results.

The Reproducibility Crisis

p-Hacking and Flexible Analyses

P‑hacking and flexible analyses occur when exploratory analyses are presented as confirmatory results, leading to spurious statistical significance through multiple testing and data dredging. By selectively reporting only favorable outcomes, trying different statistical methods, or reshaping data until a “significant” result emerges, researchers can unintentionally mislead themselves and others. This practice inflates the rate of false positives, undermines the reliability of findings, and obscures the true signal in the data, making it harder for future studies to build on a trustworthy foundation.

Underpowered Studies

Small sample sizes reduce statistical power, making results highly variable and prone to random fluctuations. In such studies, true effects can be overlooked, while noise is more easily mistaken for meaningful signal. This not only leads to a higher risk of false positives and false negatives, but also undermines the reliability and reproducibility of findings. Ensuring adequately powered studies is critical for producing robust, credible, and actionable results.

Insufficient Transparency

A lack of open data, code, and protocols makes replication efforts challenging and in many cases, nearly impossible. Without access to the underlying methods and materials, findings cannot be properly scrutinized, tested, or validated. Complex pipelines, hidden settings, and custom or poorly-documented code further obscure how results were generated, increasing the risk of error and misinterpretation. In some cases, misuse of statistical tools or unintended software bugs can go unnoticed for years. Together, these issues erode trust in the scientific process and hamper the ability of other researchers to build upon prior work, making transparency a cornerstone of credible, reproducible research.

Cultural Factors

The research culture itself can undermine reproducibility. An overemphasis on quantity, rewarding sheer volume of publications over methodological rigour often incentivises researchers to cut corners or chase only “exciting” results. Meanwhile, failed replications are viewed negatively, making researchers less likely to attempt, document, or publish them. This stigma discourages the very scrutiny that is vital for scientific progress, allowing errors and unverified claims to persist and ultimately eroding trust in the research enterprise.

Lack of Contextual Understanding

Data science is an ancillary discipline — it can reveal patterns, correlations, and anomalies, but it cannot, on its own as an abstract mathematical exercise, produce meaningful conclusions. To interpret results and draw actionable insights, one must have deep knowledge of the system that generated the data. Without that context, even the best analysis is prone to misinterpretation or irrelevance.

flowchart LR
  subgraph End-to-end Research Project
    direction LR
    Science[Science]
    subgraph Unsafe in Isolation
      direction LR
      Data[Data]
      subgraph Applied Statistics
        direction LR
        Analysis[Analysis]
      end
    end
  end
  
  Science-->Data-->Analysis

Research Pipeline Replication

When conducting a study and authoring a report, the golden rule is to script every step to ensure transparency and reproducibility. This allows readers to examine the analysis in depth, tracing back to the raw data that underpins the study albeit this level of detail is not always practical. Moreover, scripting facilitates the integration of new data as it becomes available, keeping the study “live” and easily updatable over time. Replication is aided with literate programming (Sweave, knitr, jupiter, matlab codebook, mathematica, quarto, google colab…) with live data visualisation.

%%{init: {'theme': 'forest'}}%%
flowchart LR
  subgraph The Research Pipeline
    direction LR
    subgraph Experiment
      direction LR
      Measured[Raw\nMeasured\n Data]
    end
    Wrangled[Analytic\n Data]
    Results[Computational\n Results]
    subgraph Presentational Code
    direction LR
      Figures[Figures]
      Tables[Tables]
      Misc[Numerical\n Summaries]
    end
    Article{Report\n Article}
    Text[Narrative]
  
    Measured -->|Processing\n Code| Wrangled --> |Analytic\n Code| Results
    Results --> Figures
    Results --> Tables
    Results --> Misc
    Figures --> Article
    Tables  --> Article
    Misc    --> Article
    Text    --> Article
  
    Article2[Report/Article]
    Experiment2[Raw Measured Data]
    Experiment2 -.-> |Author| Article2
    Article2 -.-> |Reader| Experiment2
  end

Data Analysis Steps

There’s no right path but generally the sequence is roughly as follows with the following actors:

Analyst (does the work)
Data Source (provides data)
Stakeholders (define questions, review results)
Computing Environment (runs code, stores data)

sequenceDiagram
    participant Stakeholder
    participant Analyst
    participant DataSource
    participant ComputeEnv

    Stakeholder->>Analyst: Define the question
    Analyst->>Stakeholder: Clarify question / goals
    Analyst->>Analyst: Define ideal dataset
    Analyst->>DataSource: Determine data availability
    DataSource-->>Analyst: Provide data access info
    Analyst->>DataSource: Obtain the data
    DataSource-->>ComputeEnv: Deliver raw data
    ComputeEnv->>Analyst: Data ready for cleaning
    Analyst->>ComputeEnv: Clean the data (munging)
    Analyst->>ComputeEnv: Exploratory data analysis
    Analyst->>ComputeEnv: Statistical prediction/modeling
    Analyst->>Stakeholder: Interpret results
    Stakeholder->>Analyst: Challenge results / feedback
    Analyst->>Analyst: Adjust analysis / revisit steps
    Analyst->>Stakeholder: Synthesise/write up results
    Analyst->>ComputeEnv: Create reproducible code

Defining the right question

Defining the right question upfront is the most powerful data reduction technique - this is critical for time. The sooner we clarify precisely what we’re trying to answer, the more effectively we can filter out irrelevant data and focus our efforts where it truly matters. This drastically simplifies the problem and improves the quality and efficiency of the analysis.

Defining the ideal dataset

Defining the ideal dataset is essential for steering your research in the right direction and ensuring your results are trustworthy and actionable:

Clarity of Purpose: Knowing what the ideal dataset looks like helps focus data collection and analysis on the variables and observations that truly matter for answering the research question.
Quality over Quantity: It ensures that the data you gather is relevant, accurate, and complete—avoiding noise, missing values, or irrelevant information that can obscure insights.
Efficient Use of Resources: By defining the ideal dataset upfront, you avoid wasting time and effort on collecting or processing unnecessary data.
Improved Reproducibility: Clear definition aids in documenting exactly what data is needed, making it easier for others to replicate or build on your work.
Better Modeling and Inference: The right data helps models capture true underlying patterns rather than artefacts, reducing bias and increasing confidence in conclusions.

Example: Data Analysis Workflow for Spam Email Detection

Problem: Can I automatically detect emails that are spam versus those that are not?

To turn this general question into a concrete data analysis project, you first need to translate it into terms suited for data science. A refined question might be:

Problem Refined: Can I use quantitative features of emails — like the frequency of certain words, sender information, or formatting — to classify emails as spam or not spam?

Step 1: Define the Question and Clarify Goals

You start by confirming the goal: build a classifier that can predict if an email is spam or “ham” (not spam). You might clarify: is this a descriptive, inferential, predictive, or causal question? In this case, it’s primarily predictive.

Step 2: Define the Ideal Dataset

Ideally, you want a dataset containing all emails in a given system (e.g., all Gmail emails) with labels indicating spam or not. This would give you a comprehensive population and eliminate sampling bias. However, this is practically impossible due to privacy and security.

Step 3: Determine Data Availability

Since you can’t get access to all Gmail emails, you explore alternatives. Fortunately, publicly available datasets exist, such as the SpamBase dataset collected by Hewlett Packard researchers, which includes a few thousand labeled spam and non-spam emails. You find this dataset accessible in the UCI Machine Learning Repository or bundled in R’s kernlab package.

You ensure to respect data licenses and document your data sources thoroughly, including URLs and access dates.

Step 4: Obtain the Data

You download the SpamBase dataset or load it directly in R, making sure to keep track of where and when you got it. You also try reaching out politely to data providers if necessary.

Step 5: Prepare the Data

Raw data usually needs cleaning: checking for missing values, formatting variables properly, and understanding how the data was preprocessed. You find documentation on how the SpamBase data was collected and processed, which helps you understand limitations.

You might also consider sub-sampling if the dataset is very large, or feature engineering to extract meaningful quantitative characteristics from email text.

Step 6: Exploratory Data Analysis (EDA)

You explore the dataset to understand distributions, spot anomalies, or identify informative features — for example, frequency of suspicious words, email length, or sender reputation.

Step 7: Statistical Prediction/Modeling

Using the prepared dataset, you build a classifier — say, logistic regression or a random forest — to predict spam versus ham. You split the data into training and testing sets to evaluate performance objectively.

Step 8: Interpret Results and Iterate

You present results to stakeholders (or yourself) and discuss whether the model performs well enough. You might need to refine features, adjust models, or even revisit data cleaning.

Step 9: Document and Create Reproducible Code

You script the entire process from data loading to final modeling in reproducible code (e.g., an R Markdown or Jupyter notebook), so that the analysis can be reviewed, updated, or extended later.

Important Considerations

If the dataset proves inadequate (too small, biased, or missing key variables), you may need to rethink your question or seek additional data. Transparency and documentation at every step ensure the work can be replicated and trusted. Always respect data privacy and terms of use.