Model Building Exercises and Activities

Supplementary Materials for PHTH 6202

Author

Intermediate Epidemiology

Published

January 1, 2026

Exercise 1: Identifying Model Types

For each scenario below, identify whether a causal, prediction, or association model is most appropriate, and explain your reasoning.

Scenario A: Physical Activity and Dementia

“We want to know if increasing physical activity in middle-aged adults would reduce their risk of developing dementia later in life.”

Your answer:

  • Model type: _____________
  • Reasoning:
  • Key variables to consider:
  • Potential confounders:

Scenario B: Hospital Readmission Risk

“We need to identify patients at high risk of 30-day readmission so we can provide them with enhanced discharge planning and follow-up care.”

Your answer:

  • Model type: _____________
  • Reasoning:
  • Key predictors to consider:
  • How would you validate the model?

Scenario C: Diet and Cancer Screening

“We’re exploring whether dietary patterns are associated with participation in cancer screening programs in a population-based survey.”

Your answer:

  • Model type: _____________
  • Reasoning:
  • Analysis approach:
  • Limitations:

Exercise 2: Drawing DAGs

Exercise 2A: Coffee and Heart Disease

You’re investigating whether coffee consumption affects the risk of heart disease.

Known relationships:

  • Age affects both coffee consumption and heart disease risk
  • Smoking affects both coffee consumption and heart disease
  • Education affects smoking, coffee consumption, and healthcare access
  • Healthcare access affects heart disease diagnosis
  • Stress affects coffee consumption
  • Coffee consumption may affect blood pressure
  • Blood pressure affects heart disease

Tasks:

  1. Draw a DAG representing these relationships
  2. Identify:
    • Confounders that must be adjusted for
    • Variables that should NOT be adjusted for (mediators, colliders)
    • The minimal sufficient adjustment set

Bonus question: How would your DAG change if you were interested in the effect of coffee on mortality rather than heart disease?


Exercise 2B: Obesity and Mortality

Based on the Hernán & Taubman paper, consider this research question:

“Does obesity increase mortality risk?”

Tasks:

  1. Identify at least 4 different mechanisms by which someone might become obese
  2. For each mechanism, identify whether it has independent effects on mortality
  3. Draw a DAG showing obesity as a common outcome (collider) of these mechanisms
  4. Explain why “the effect of obesity on mortality” is poorly defined
  5. Propose 2 better-defined research questions

Exercise 3: Variable Selection Strategies

Case Study: Vitamin D and COVID-19

You have data on 5,000 patients hospitalized with COVID-19, including:

  • Vitamin D levels (measured at admission)
  • Age, sex, race/ethnicity
  • BMI
  • Comorbidities (diabetes, hypertension, kidney disease, etc.)
  • Smoking status
  • Socioeconomic indicators
  • Season of admission
  • Hospital outcomes (ICU admission, mortality)

Research question: Does vitamin D deficiency increase risk of severe COVID-19?

Task A: Identify the Variables

For each variable below, classify it as:

  • C = Confounder (must adjust)
  • M = Mediator (do not adjust)
  • Co = Collider (do not adjust)
  • P = Precision variable (optional)
  • ? = Unclear/depends on causal model
Variable Classification Reasoning
Age
Sex
Race/ethnicity
BMI
Diabetes
Smoking
Season
Kidney disease

Task B: Selection Strategies

Three analysts approach this differently:

Analyst 1: Uses stepwise selection (p<0.05 to enter, p>0.10 to remove)

Analyst 2: Adjusts for everything in the table above

Analyst 3: Draws a DAG and identifies confounders, adjusts only for those

Questions:

  1. What are the pros and cons of each approach?
  2. Which is most appropriate for a causal question?
  3. If this were a prediction model instead, would your answer change?

Exercise 4: The Change-in-Estimate Approach

You’re studying the effect of sleep duration on type 2 diabetes risk.

Starting Model

Base model: Sleep duration → Diabetes
Crude OR = 1.45 (95% CI: 1.25-1.68)

Adding Variables One at a Time

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Effect of Adding Variables (One at a Time)
Model OR CI Change
Crude 1.45 1.25-1.68
+ Age 1.38 1.18-1.61 4.8%
+ Sex 1.44 1.24-1.67 0.7%
+ BMI 1.22 1.04-1.43 15.9%
+ Physical activity 1.36 1.16-1.59 6.2%
+ Diet quality 1.41 1.21-1.64 2.8%
+ Depression 1.32 1.13-1.54 9.0%
+ Shift work 1.28 1.09-1.50 11.7%

Questions:

  1. Using a 10% change threshold, which variables would you keep?
  2. Why might BMI cause such a large change?
  3. Could BMI be a mediator? How would you decide?
  4. What about shift work - confounder or mediator?
  5. Should you adjust for all variables that cause >10% change?

Exercise 5: Well-Defined Interventions

Evaluating Research Questions

For each research question, determine:

  1. Is the exposure well-defined?
  2. Can you describe a hypothetical trial?
  3. If not well-defined, what’s the problem?
  4. How would you reframe the question?

Question 1

“What is the effect of depression on cardiovascular disease?”

Analysis:

  • Well-defined exposure? (Yes/No): _____
  • Hypothetical trial:
  • Problems:
  • Better question:

Question 2

“Does high LDL cholesterol cause myocardial infarction?”

Analysis:

  • Well-defined exposure? (Yes/No): _____
  • Hypothetical trial:
  • Problems:
  • Better question:

Question 3

“What is the effect of a Mediterranean diet intervention (as defined by specific food groups and quantities) on 5-year cardiovascular mortality?”

Analysis:

  • Well-defined exposure? (Yes/No): _____
  • Hypothetical trial:
  • Problems (if any):
  • Is this a good question?

Question 4

“Does C-reactive protein (CRP) level cause increased mortality?”

Analysis:

  • Well-defined exposure? (Yes/No): _____
  • Hypothetical trial:
  • Problems:
  • Better question:

Exercise 6: Critique Published Studies

Instructions

Find 3 papers from your field that use multivariable regression. For each, complete the following analysis:

Paper 1

Citation:

Research Question:

Model Type (Causal/Prediction/Association):

Variable Selection Method:

Strengths:

Weaknesses:

What would you do differently?


Critical Evaluation Checklist

Use this checklist when evaluating papers:

For ALL models:

For CAUSAL models specifically:

For PREDICTION models specifically:


Exercise 7: The Many Analysts Problem

Simulation Activity

In groups of 3-4, you’ll each analyze the same simulated dataset to answer:

“Does drug X reduce the risk of outcome Y?”

Available Variables

  • Treatment (drug X vs placebo)
  • Age (continuous)
  • Sex (M/F)
  • Baseline disease severity (mild/moderate/severe)
  • Comorbidity count (0-5)
  • Smoking status (current/former/never)
  • BMI (continuous)
  • Blood pressure (continuous)
  • Cholesterol level (continuous)
  • Previous hospitalizations (count)

Rules

  1. Each person chooses their own:

    • Which confounders to adjust for
    • Whether to include interactions
    • How to handle continuous variables (linear, categories, splines)
    • Which model to use (logistic, Cox, etc.)
  2. Record your effect estimate and 95% CI

  3. Compare with your group

Discussion Questions

  1. Did you get the same answer? Why or why not?
  2. Which approach seems most defensible?
  3. How would you resolve disagreements?
  4. What does this teach us about analytic flexibility?

Exercise 8: Building a DAG for Your Research

Your Own Research Question

Take a research question from your own work or thesis.

Step 1: Define the Question

Exposure:

Outcome:

Is the exposure well-defined?

Can you describe a trial?


Step 2: List Variables

List all variables you think are relevant:

Confounders:

Potential Mediators:

Potential Colliders:

Other (precision variables, effect modifiers):


Step 3: Draw Your DAG

(Space for drawing or paste image)

Step 4: Identify Adjustment Set

Minimal sufficient adjustment set:

Variables to definitely NOT adjust for:

Uncertain about:


Step 5: Sensitivity Analyses

What unmeasured confounders might be important?

How would you assess robustness?


Exercise 9: Model Building Decisions

Decision Tree Activity

For the research question: “Does prenatal vitamin use reduce risk of neural tube defects?”

Work through this decision tree:

START: Research Question
    |
    v
Is exposure well-defined?
    |
    +-- No --> Reframe question
    |
    +-- Yes
        |
        v
    What's your goal?
        |
        +-- Causal effect
        |       |
        |       v
        |   Draw DAG
        |       |
        |       v
        |   Identify confounders
        |       |
        |       v
        |   Adjust for confounders only
        |       |
        |       v
        |   Check positivity
        |       |
        |       v
        |   Sensitivity analyses
        |
        +-- Prediction
        |       |
        |       v
        |   Define clinical use case
        |       |
        |       v
        |   Identify candidate predictors
        |       |
        |       v
        |   Split data (train/test)
        |       |
        |       v
        |   Build model (with CV)
        |       |
        |       v
        |   Validate externally
        |
        +-- Exploration/Association
                |
                v
            Univariate screening
                |
                v
            Adjust for multiple comparisons
                |
                v
            Avoid causal language

Questions:

  1. Walk through each branch - what decisions do you make?
  2. How would your approach differ for each goal?
  3. What checks and balances are built in?

Exercise 10: Teaching Exercise

Explain to a Colleague

Practice explaining these concepts in simple terms:

Concept 1: Consistency Condition

Explain why “the effect of obesity on mortality” is poorly defined:

(Write a 3-4 sentence explanation that a non-statistician could understand)


Concept 2: Confounding vs Mediation

Use a diagram to show the difference between a confounder and a mediator:

(Draw or describe)


Concept 3: Collider Bias

Give a real-world example of how adjusting for a collider can create bias:

(Explain with a specific scenario)


Concept 4: Prediction vs Causation

Explain to a clinician why their predictive model can’t tell them what causes the outcome:

(3-4 sentences)


Additional Resources

Online Tools

  1. DAGitty (dagitty.net)
    • Interactive DAG drawing
    • Automatic identification of adjustment sets
    • Testable implications
  2. Causal Fusion (causalfusion.net)
    • Teaching tool for causal concepts
    • Interactive examples

R Packages

# Install key packages
install.packages(c(
  "dagitty",    # DAG creation and analysis
  "ggdag",      # Beautiful DAG plotting
  "gt",         # Great tables
  "broom",      # Tidy model outputs
  "performance" # Model checking
))

Datasets for Practice

  1. NHANES - National Health and Nutrition Examination Survey
    • Complex survey data
    • Many variables for practicing adjustment strategies
  2. Framingham Heart Study (teaching dataset)
    • Classic cardiovascular risk factors
    • Good for causal inference exercises
  3. UCI Machine Learning Repository
    • Many datasets for prediction modeling practice

Answer Key (All Exercises)

TipAbout the DAG Solutions

Solutions for exercises involving DAGs include actual R code using dagitty and ggdag packages.

You can: - Run the code yourself to see the DAGs - Modify the DAG structures to test alternatives - Use adjustmentSets() function to verify minimal sufficient adjustment sets - Compare your DAG to the solution

Color scheme: - Blue = Exposure | Red = Outcome | Green = Confounder | Orange = Mediator | Purple = Collider


Exercise 1 - Model Types

Scenario A: Physical Activity and Dementia

  • Model type: Causal
  • Reasoning: The question asks “would reduce” - this is asking about the effect of an intervention
  • Key considerations: Well-defined intervention possible (exercise program), long follow-up needed, many potential confounders

Scenario B: Hospital Readmission

  • Model type: Prediction
  • Reasoning: Goal is to identify high-risk patients, not to understand causes
  • Key considerations: Need actionable timeframe, validation essential, predictors must be available at discharge

Scenario C: Diet and Screening

  • Model type: Association/Exploratory
  • Reasoning: “Exploring” suggests hypothesis generation, not causal inference
  • Key considerations: Cross-sectional design limits causal inference, multiple testing, hypothesis-generating

Exercise 2A - Coffee and Heart Disease DAG

Variables: C = Coffee (blue, exposure) | H = Heart Disease (red, outcome) | A = Age (green, confounder) | S = Smoking (green, confounder) | E = Education (green, confounder) | B = Blood Pressure (orange, mediator) | P = Healthcare Access (purple, potential collider)

Minimal sufficient adjustment set: {Age, Smoking, Education} or {A, S, E}

Do NOT adjust for: - Blood pressure (B) - mediator on pathway from coffee to heart disease - Healthcare access (P) - potential collider caused by both education and heart disease diagnosis

Why this adjustment set works: Age, Smoking, and Education block all backdoor paths between Coffee and Heart Disease without blocking the causal pathway or opening collider bias.

Code
# Verify adjustment set using dagitty
adjustmentSets(coffee_dag)
{ A, P, S }
{ A, E, S }

Exercise 2B - Obesity and Mortality

1. Four Mechanisms Leading to Obesity:

  1. Dietary patterns (high calorie, high sugar/fat intake)
  2. Physical inactivity (sedentary lifestyle, lack of exercise)
  3. Genetic predisposition (metabolism genes, appetite regulation)
  4. Medical conditions (hypothyroidism, medications like corticosteroids)

2. Independent Effects on Mortality:

  1. Diet → Mortality: Yes (cardiovascular effects, inflammation, independent of weight)
  2. Physical inactivity → Mortality: Yes (cardiovascular fitness, muscle health, independent of weight)
  3. Genetics → Mortality: Yes (same genes may affect disease susceptibility)
  4. Medical conditions → Mortality: Yes (diseases themselves affect mortality)

3. DAG Showing Obesity as Common Outcome:

Variables: O = Obesity (blue, exposure) | M = Mortality (red, outcome) | D = Diet (green, confounder) | E = Exercise (green, confounder) | G = Genetics (green, confounder) | I = Illness (green, confounder)

Key insight: All pathways to obesity (D, E, G, I) are confounders because they also independently affect mortality. This creates multiple backdoor paths that are difficult to measure and adjust for.

4. Why “Effect of Obesity on Mortality” is Poorly Defined:

  • Problem 1: We don’t know which mechanism(s) led each person to their current BMI
  • Problem 2: Different mechanisms may have different mortality effects (e.g., obesity from overeating vs from medication)
  • Problem 3: Cannot adjust for all mechanisms (especially genetic/physiological ones we can’t measure)
  • Problem 4: Even if measured, adjusting for them leaves only “residual” obesity, not a meaningful causal effect

5. Two Better-Defined Research Questions:

  1. “What is the effect of a Mediterranean diet intervention on 10-year cardiovascular mortality?”
    • Well-defined: specific dietary pattern
    • Implementable in RCT
    • Clear mechanism
  2. “What is the effect of a structured exercise program (150 min/week moderate activity) on all-cause mortality in middle-aged adults?”
    • Well-defined: specific exercise prescription
    • Implementable in RCT
    • Measurable adherence

Exercise 3 - Vitamin D and COVID-19

Task A: Variable Classifications

Variable Classification Reasoning
Age C Common cause of both vitamin D status and COVID severity
Sex C Affects vitamin D metabolism and COVID outcomes
Race/ethnicity C Affects vitamin D levels (skin pigmentation) and COVID risk (socioeconomic factors)
BMI ? Could be confounder OR mediator - depends on causal model
Diabetes ? Could be confounder (causes low vit D) OR mediator (vit D affects diabetes)
Smoking C Affects vitamin D levels and COVID severity
Season C Affects vitamin D (sun exposure) but not COVID severity directly
Kidney disease M Likely mediator - vitamin D may affect kidney function; kidney disease affects COVID

Note: BMI and diabetes are ambiguous - need to draw DAG to clarify roles!

Task B: Comparing Three Analysts

Analyst 1: Stepwise selection - Pros: Simple, automated - Cons: ❌ Atheoretical; may remove confounders; inflated Type I error; NOT appropriate for causal question - Verdict: Inappropriate for causal inference

Analyst 2: Adjust for everything - Pros: Thorough - Cons: ❌ Likely includes mediators (kidney disease); loses power; may include colliders - Verdict: “Kitchen sink” - not appropriate

Analyst 3: DAG-based - Pros: ✅ Theory-driven; identifies confounders correctly; transparent - Cons: Requires causal knowledge - Verdict: BEST approach for causal question

If this were a prediction model: - Analyst 1 or 2 might be acceptable IF using cross-validation - Analyst 3’s approach wouldn’t be necessary (don’t care about confounding) - Would focus on discrimination/calibration instead


Exercise 4 - Change-in-Estimate: Sleep and Diabetes

Using 10% threshold:

Keep these variables: - BMI (15.9% change) ✓ - Shift work (11.7% change) ✓

Could argue for: - Depression (9.0% change - close to threshold)

Drop: - Age, Sex, Physical activity, Diet quality (all <10%)

Why BMI causes large change:

BMI is likely a mediator: - Sleep duration → BMI → Diabetes - Short sleep may cause weight gain - Weight gain causes diabetes - Adjusting for BMI blocks part of the causal pathway

Question: Should we adjust for BMI? - If interested in total effect of sleep: NO - If interested in direct effect (not through BMI): Maybe, but need mediation analysis

Is BMI a mediator?

Evidence it’s a mediator: - Sleep affects weight - Weight affects diabetes - On the causal pathway

How to decide: - Draw a DAG showing temporal relationships - Consider: Does sleep affect BMI? (Yes) - Does BMI affect diabetes? (Yes) - Therefore: BMI is a mediator

Shift work - confounder or mediator?

Could be either:

If shift work → sleep duration → diabetes: - Shift work is a confounder (causes both exposure and outcome) - Should adjust

If sleep duration → shift work (unlikely) → diabetes: - Shift work is a mediator - Should not adjust for total effect

Most likely: Shift work is a confounder - it causes people to sleep less AND independently affects diabetes risk (circadian disruption)

Should you adjust for all variables >10% change?

NO!

  • The 10% rule can identify associations, not necessarily confounders
  • BMI shows large change but is likely a mediator - shouldn’t adjust for total effect
  • Need to use causal reasoning (DAG), not just statistical criteria
  • This is why change-in-estimate is not recommended - it can mislead!

Exercise 5 - Well-Defined Interventions

Question 1: Depression

  • Well-defined? No
  • Problems: Depression is not an intervention; many ways to treat/prevent depression (CBT, SSRIs, exercise, etc.)
  • Better: “Does cognitive behavioral therapy reduce risk of CVD in adults with major depression?”

Question 2: LDL Cholesterol

  • Well-defined? No
  • Problems: Multiple ways to lower LDL (statins, diet, ezetimibe, PCSK9 inhibitors, each may have different effects)
  • Better: “Does statin therapy (vs placebo) reduce MI risk in adults with LDL >130 mg/dL?”

Question 3: Mediterranean Diet

  • Well-defined? Yes!
  • Problems: None - this is well-specified
  • Good because: Specific intervention described, could implement in trial, clear definition

Question 4: CRP

  • Well-defined? No
  • Problems: CRP is a biomarker, not an intervention; no way to specifically target CRP
  • Better: Either (a) study CRP as a predictor of outcomes, or (b) study interventions that affect CRP (e.g., “Does aspirin reduce CVD in adults with elevated CRP?”)

Exercise 6 - Critique Published Studies

Sample critique for a hypothetical paper:

Paper 1: “Association between coffee consumption and Type 2 diabetes”

Citation: [Example]

Research Question: Does coffee consumption reduce diabetes risk?

Model Type: Claims to be causal, but analysis suggests association/prediction hybrid

Variable Selection Method: - Started with 50 variables - Used stepwise selection (p<0.05 to enter) - Final model: 8 variables

Strengths: - Large sample size (n=50,000) - Long follow-up (20 years) - Validated coffee assessment

Weaknesses: - ❌ Used stepwise selection for “causal” inference - ❌ No DAG presented - ❌ Likely adjusted for mediators (BMI, glucose) - ❌ No sensitivity analyses - ❌ Causal language but inappropriate methods

What I would do differently: 1. Draw a DAG identifying confounders 2. Pre-specify adjustment variables based on DAG 3. Do NOT use stepwise selection 4. Conduct sensitivity analyses for unmeasured confounding 5. Be more careful about causal language or clearly state this is exploratory


Exercise 7 - Many Analysts Problem

Expected outcomes from simulation:

Discussion Questions - Sample Answers:

1. Did you get the same answer?

Probably not! Even with the same data, different decisions lead to different results: - Different confounders selected - Different categorization of continuous variables (e.g., age as continuous vs categories vs splines) - Different interaction terms - Different model types (logistic vs Cox with different baseline hazards)

2. Which approach seems most defensible?

The one that: - Has clear theoretical justification for variable selection - Uses DAG to identify confounders - Pre-specified the analysis approach - Includes appropriate sensitivity analyses - Acknowledges limitations

3. How would you resolve disagreements?

  • Examine the DAGs each person drew
  • Discuss which confounders are most important based on subject knowledge
  • Check if results are similar across reasonable specifications (robustness)
  • Consider presenting multiple models with different assumptions
  • Be transparent about the analytic choices made

4. What does this teach us?

  • Many reasonable decisions must be made in any analysis
  • Results depend on these decisions, not just on the data
  • Transparency is essential
  • Pre-specification helps limit researcher degrees of freedom
  • There’s rarely one “right” analysis
  • Uncertainty in results comes from analytic choices, not just sampling variability

Exercise 8 - Your Own Research

This is individualized - no single answer. But here are evaluation criteria:

Good DAG characteristics:

✅ Exposure and outcome clearly identified
✅ All major confounders included
✅ Mediators identified and noted
✅ Colliders identified and avoided
✅ Arrows represent causal relationships, not just associations
✅ Temporal ordering makes sense
✅ Based on subject matter knowledge, not data

Red flags to watch for:

❌ Too many variables (probably missing structure)
❌ No clear confounders identified
❌ Mixing confounders and mediators
❌ Exposure is ill-defined (biomarker, physiological measure)
❌ Arrows based on statistical associations rather than causal beliefs

Getting feedback:

  • Share with advisor/mentor
  • Present to lab group
  • Check against published DAGs in your field
  • Revise based on feedback
  • Remember: DAGs represent beliefs, can be wrong, should be revised

Exercise 9 - Prenatal Vitamin and Neural Tube Defects

Walking Through the Decision Tree:

START: Research Question “Does prenatal vitamin use reduce risk of neural tube defects?”

Is exposure well-defined?Yes - Prenatal vitamins are a specific intervention (can specify dose, timing, formulation)

What’s your goal?Causal effect (does vitamin use CAUSE reduction in NTDs?)

Path: Causal Effect

Step 1: Draw DAG

Variables: V = Prenatal Vitamin Use (blue, exposure) | N = Neural Tube Defects (red, outcome) | A = Maternal Age (green, confounder) | S = SES (green, confounder) | P = Planned Pregnancy (green, confounder) | D = Dietary Folate (green, confounder) | F = Folate Levels (orange, mediator) | H = Homocysteine (orange, mediator) | C = Prenatal Care Visits (purple, collider)

Step 2: Identify confounders

Code
# Find minimal sufficient adjustment set
adjustmentSets(prenatal_dag)
{ A, D, P, S }

Must adjust for: - Maternal age (A) - affects vitamin use AND NTD risk - SES (S) - affects vitamin access AND healthcare/nutrition - Planned pregnancy (P) - affects vitamin use AND prenatal care - Baseline dietary folate intake (D)

Step 3: Adjust for confounders only - Do NOT adjust for: folate levels, homocysteine (mediators) - Do NOT adjust for: prenatal care visits (potential collider)

Step 4: Check positivity - Are there women in all confounder strata who take vitamins? - Are there women in all strata who don’t? - May have positivity violations in planned pregnancies (almost all take vitamins)

Step 5: Sensitivity analyses - Vary definitions of exposure (timing, dose) - Test different adjustment sets - Assess impact of unmeasured confounding (E-value) - Stratify by planned vs unplanned pregnancy


Exercise 10 - Teaching Exercise

Concept 1: Consistency Condition

Explain to a non-statistician:

“Imagine we want to know if obesity causes early death. The problem is that there are many ways to become obese - overeating, not exercising, certain medications, genetic factors. Each of these might affect your health differently, even if they all lead to the same weight. So when we compare people who are obese to people who aren’t, we’re not comparing one thing - we’re comparing a complex mix of different paths to obesity. That’s why researchers say we need to study specific interventions like ‘Mediterranean diet’ or ‘exercise programs’ rather than obesity itself.”


Concept 2: Confounding vs Mediation

Diagram:

CONFOUNDER:

        Age
       ↙   ↘
  Exercise → Mortality

Age causes both exercise level AND mortality risk (older people exercise less AND have higher mortality). Must adjust.

MEDIATOR:

  Exercise → Weight Loss → Mortality

Exercise causes weight loss, which causes lower mortality. Weight loss is ON the pathway. Adjusting blocks the effect we want to measure.


Concept 3: Collider Bias

Real-world example:

“Imagine you’re studying whether exercise affects heart disease mortality. You decide to adjust for ‘being hospitalized’ thinking it’s a confounder. But hospitalization is actually caused by BOTH exercise (less exercise → more hospitalizations) AND by underlying severe disease (which also causes death).

When you condition on hospitalization (look only at hospitalized people), you create a spurious association between exercise and underlying disease severity. Among hospitalized patients, those who exercise must be sicker (otherwise why are they hospitalized despite exercising?). This makes exercise look harmful when it’s actually protective!

This is called collider bias - adjusting for a common effect opens up a backdoor path that wasn’t there before.”


Concept 4: Prediction vs Causation

Explain to a clinician:

“Your predictive model is excellent at identifying which patients will be readmitted - it’s like a weather forecast that accurately predicts rain. But just like a weather forecast doesn’t tell you HOW to prevent rain, your model doesn’t tell you what CAUSES readmission.

For example, your model might include ‘number of prior hospitalizations’ as a strong predictor. But we can’t prevent readmissions by changing the number on someone’s medical chart! The number of prior hospitalizations is a marker of underlying illness, not a cause of future readmissions.

To know what causes readmissions, we’d need a different study design that identifies and adjusts for confounders, excludes mediators, and avoids colliders - which would likely give us a different (and possibly less accurate) prediction model. That’s okay - they serve different purposes!”


Additional Practice Problems

Problem 1: Alcohol and Liver Disease

Scenario: You’re studying the effect of alcohol consumption on liver cirrhosis.

Available data: - Alcohol intake (drinks/week) - Liver cirrhosis diagnosis
- Age, sex, BMI - Hepatitis C infection - Coffee consumption - Education - Diabetes

Tasks:

  1. Draw a DAG
  2. Identify confounders
  3. Should you adjust for hepatitis C? Why or why not?
  4. Should you adjust for coffee? Why or why not?
  5. Is “alcohol consumption” well-defined enough for causal inference?

Answers:

1. Draw a DAG:

Variables: A = Alcohol (blue, exposure) | C = Cirrhosis (red, outcome) | G = Age (green, confounder) | E = Education (green, confounder) | B = BMI (green, confounder) | H = Hepatitis C (gray, NOT a confounder) | K = Coffee (orange, mediator - may protect liver)

2. Confounders: Age (G), Education (E), potentially BMI (B)

3. Hepatitis C (H): - If hepatitis causes alcohol use: Confounder, adjust - If hepatitis is unrelated to alcohol use: NOT a confounder, but may want to stratify - Most likely: NOT a confounder (hepatitis doesn’t cause drinking) - Do NOT adjust unless you have evidence it affects alcohol consumption

4. Coffee (K): - Coffee is associated with alcohol (social drinking) - Coffee → Cirrhosis pathway exists (protective effect) - Alcohol → Coffee pathway possible (both are beverages) - This makes coffee a MEDIATOR (on pathway from alcohol to cirrhosis) - Do NOT adjust if interested in total effect of alcohol - Adjusting would block protective pathway through coffee consumption

5. Well-defined? - Better than “obesity” but still some issues - Type of alcohol matters (wine vs spirits) - Pattern matters (daily vs binge) - Better question: “Does reducing alcohol intake from 4+ drinks/day to <1 drink/day reduce cirrhosis risk?”


Problem 2: Statins and Dementia

Scenario: Observational study finds statin users have lower dementia rates.

Possible confounders: - Age, sex, education - Cardiovascular disease - Cholesterol levels - Healthcare utilization - SES

Questions:

  1. Should you adjust for cholesterol levels? Why or why not?
  2. Should you adjust for cardiovascular disease?
  3. What’s the target trial?

Answers:

DAG showing confounding by indication:

Variables: S = Statin Use (blue, exposure) | D = Dementia (red, outcome) | A = Age (green, confounder) | E = Education/SES (green, confounder) | U = Healthcare Utilization (green, confounder) | L = Cholesterol Levels (orange-red, INDICATION - do NOT adjust!) | V = CVD (orange-red, INDICATION - do NOT adjust!)

1. Cholesterol levels (L): - ❌ Do NOT adjust - this is an indication for treatment - People with high cholesterol get statins - Adjusting for indication creates confounding by indication - Instead: Use methods like instrumental variables or restriction

  1. Cardiovascular disease:
    • Similar to cholesterol - it’s an indication for statins
    • ❌ Do not adjust for indication
    • Creates selection bias / confounding by indication
  2. Target trial:
    • Population: Adults 60-75 without dementia or CVD
    • Intervention: Statin therapy (specify dose/type)
    • Comparison: Placebo
    • Outcome: Incident dementia over 10 years
    • Assignment: Random
    • This trial would answer the causal question!

Answer Key Summary

Complete Solutions Provided For:

  • ✅ Exercise 1: Identifying Model Types (3 scenarios)
  • ✅ Exercise 2A: Coffee and Heart Disease DAG with dagitty visualization
  • ✅ Exercise 2B: Obesity and Mortality (5 tasks) with DAG visualization
  • ✅ Exercise 3: Vitamin D and COVID-19 (variable classification & analyst comparison)
  • ✅ Exercise 4: Change-in-Estimate - Sleep and Diabetes (5 questions)
  • ✅ Exercise 5: Well-Defined Interventions (4 questions)
  • ✅ Exercise 6: Critique Published Studies (sample critique provided)
  • ✅ Exercise 7: Many Analysts Problem (4 discussion questions)
  • ✅ Exercise 8: Building DAG for Your Research (evaluation criteria)
  • ✅ Exercise 9: Prenatal Vitamins Decision Tree with DAG visualization
  • ✅ Exercise 10: Teaching Exercise (all 4 concepts explained)

Bonus Problems with DAG Visualizations:

  • ✅ Problem 1: Alcohol and Liver Disease with DAG showing confounders vs non-confounders
  • ✅ Problem 2: Statins and Dementia with DAG showing confounding by indication

All DAGs use: - Color-coded nodes (blue=exposure, red=outcome, green=confounders, orange=mediators, purple=colliders) - Single-letter labels overlaid on nodes - dagitty code that students can modify and verify with adjustmentSets()


Instructor Notes

How to Use These Exercises

In-Class Activities (75 min class)

Recommended flow: - 10 min: Exercise 1 (individual → pair-share) - 20 min: Exercise 2A (groups of 3, draw DAGs) - 15 min: Exercise 4 (individual → discussion) - 15 min: Exercise 7 (group simulation if data available) - 15 min: Debrief and Q&A

Homework Assignments

Option 1: Exercises 2B, 3, 5, 6 (comprehensive homework) Option 2: Exercise 6 + Exercise 8 (apply to own research) Option 3: Exercise 10 (teaching exercise for deeper understanding)

Small Group Discussion

Exercises 3, 4, and 7 work particularly well in small groups where students can defend different analytic choices.

Assessment Ideas

  • Quiz: Exercise 1 variations (identifying model types)
  • Short paper: Exercise 6 (critique 2 published papers)
  • Presentation: Exercise 8 (present DAG for own research)