What this document is

This document will provide an ordered outline of common data processing and analyses-related steps to take you from obtaining your data through to creating final outputs.

This will be mostly high level with links to papers, guides, and other references where they may be more detailed and helpful.

There will be points to indicate where to potentially seek more guidance on methods and/or re-visit earlier steps.

However, this “road map” presumes many things including that you have done the much more important preparatory work leading up to this stage (see: What this document will not cover) and that you will properly set the outputs in context (papers, presentations, reports, etc.)

No data or analyses stand alone.

What this document will not cover

- How to develop a research question
- How to conduct a literature review
- How to prepare a research proposal (though methods sections should contain some of these details)
- How to do preparatory work (e.g. defining the target population, draw a DAG, choose variables of interest)
- How to write a data request or IRB application
- A detailed decision tree that can cover all potential challenges (needing more data, revising aims, etc.)
- How to write a scientific paper or white paper, design a conference poster or presentation, report to communities, etc.

1. Data Management and Preparation

2. Description

  • Stratified tables
  • Data visualizations to show group distributions and differences
    • Overlapping density plots
      • geom_density
      • fill = group_name (to split groups)
      • alpha = 0.5 (transparency)
    • Overlapping or side-by-side histograms
      • geom_histogram
      • fill = group_name (to split groups)
      • position = “dodge” (to make them side-by-side, not stacked)
    • Geographic maps / heat maps
      • Rayshader / rayrender (for fancy graphics)

3. Prediction

  • What is the setting / clinical relevance?
  • What data are practically available as predictors? When?
  • Binary (classification) vs. continuous outcome
  • Fit best predicting models
    • Ensembles / stacked learning (SuperLearner)
    • Grid search or automated approaches for hyperparameters (AutoML)
    • Cross-validation
  • Assess metrics
    • Discrimination (accuracy, sensitivity, specificity, AUC)
    • Calibration (Brier score, plots)
  • External validation
    • Test in independent data
  • Development for clinical practice
    • Decision Theory / Decision Curve Analysis
    • Cost-benefit trade-offs
    • Nomograms and clinical tools

4. Causal Inference

  • Well-defined exposure / intervention
  • Target trial methods
  • Threats to validity:
    • Confounding control
    • Selection bias
    • Missing data
  • Subgroup analyses
    • Interactions
    • Heterogeneous effects