Lab Activity 1 Answer Key

This document provides complete solutions for all three guided practice tasks. Students should complete their own work first, then use this key to check their understanding.

Setup: Load Data and Packages

# Load required packages
library(tidyverse)
library(NHANES)
library(knitr)
library(kableExtra)

# Load the NHANES data
data(NHANES)

# Create analysis dataset (same as in lab)
nhanes_analysis <- NHANES %>%
  select(
    ID,
    Gender,
    Age,
    Race1,
    Education,
    BMI,
    Pulse,
    BPSys1,
    BPDia1,
    PhysActive,
    SmokeNow,
    Diabetes,
    HealthGen
  ) %>%
  mutate(
    Hypertension = ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"),
    Age_Group = cut(Age, 
                    breaks = c(0, 20, 35, 50, 65, 100),
                    labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
  )

Task 1: Explore Health Disparities by Education

Research Question

“How does hypertension prevalence vary by education level?”

Solution Code

# Group by education level and calculate key statistics
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

print(health_by_education)

## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

Interpretation Notes

Key findings: - Sample sizes vary across education groups - Mean systolic BP generally increases as education decreases - Hypertension prevalence shows clear social gradient

What to look for: - Higher hypertension rates in lower education groups (social determinant) - This is a classic example of health inequality - Missing education data should be noted and reported

Task 2: Create a Visualization

Solution Code

# Create bar chart
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar chart of hypertension prevalence by education level

Key Code Elements Explained

Code Element	Purpose
`filter(!is.na(Education))`	Removes missing education data
`geom_col()`	Creates bar chart
`geom_text()`	Adds percentage labels on bars
`vjust = -0.5`	Positions labels above bars
`ylim(0, 50)`	Sets y-axis limit
`angle = 45`	Rotates x-axis labels for readability

Task 3: Data Interpretation

Research Question

“What does this pattern tell us about health disparities and social determinants?” This pattern shows a clear education–health gradient, which is a classic example of how social determinants of health shape outcomes like hypertension. What the pattern tells us 1. Lower education is associated with higher hypertension prevalence People with less formal education (e.g., 8th grade or less) show the highest rates of hypertension, while college graduates show the lowest. This suggests that education is not just academic—it’s protective for health. 2. Education acts as a proxy for multiple social determinants Education level is closely linked to: Income and job stability → ability to afford healthy food, housing, and healthcare Health literacy → understanding nutrition, medications, blood pressure screening, and risk factors Access to healthcare → insurance coverage, preventive care, regular checkups Chronic stress exposure → financial strain, job insecurity, unsafe neighborhoods These factors accumulate over time and increase cardiovascular risk. 3. The gradient (not just extremes) matters Hypertension doesn’t suddenly drop only at college graduation—it gradually declines with each higher level of education. That stepwise pattern tells us disparities are systemic, not driven by individual choice alone. 4. It reflects structural—not biological—differences There is no biological reason education itself changes blood pressure. The pattern points to structural inequities in: Economic opportunity Access to preventive care Environmental and occupational exposures Long-term stress and allostatic load

Sample Interpretation

Here’s a strong answer (2-3 sentences):

The data reveal a clear social gradient in hypertension prevalence across education levels, with lower-educated groups showing substantially higher hypertension rates than college-educated groups. This pattern reflects the broader concept of social determinants of health—factors like income, access to preventive care, stress, and health literacy that are closely linked to education. From a public health perspective, this disparity suggests that cardiovascular disease prevention programs should be targeted to reach underserved populations and that addressing education and socioeconomic inequality may be critical for reducing hypertension burden in the population.

Grading Rubric for Task 3

Criteria	Excellent (Full Credit)	Adequate	Needs Work
Identifies pattern	Explicitly states which groups have highest/lowest rates	Mentions direction but lacks specificity	Vague or incorrect about pattern
Explains mechanism	References social determinants, access, or health literacy	Mentions inequality but lacks detail	No explanation provided
Public health relevance	Discusses implications for policy or programs	Notes importance but general	Missing public health connection
Writing quality	Clear, 2-3 well-written sentences	Adequate but could be clearer	Incomplete or unclear

Additional Results and Extensions

Stratified by Gender AND Education

# How do gender differences in hypertension vary by education?
health_strat <- nhanes_analysis %>%
  group_by(Education, Gender) %>%
  summarise(
    N = n(),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  ) %>%
  filter(!is.na(Education))

print(health_strat)

## # A tibble: 10 × 4
##    Education      Gender     N Pct_Hypertension
##    <fct>          <fct>  <int>            <dbl>
##  1 8th Grade      female   209            22.2 
##  2 8th Grade      male     242            33.3 
##  3 9 - 11th Grade female   402            16.1 
##  4 9 - 11th Grade male     486            18.3 
##  5 High School    female   770            20.5 
##  6 High School    male     747            17.3 
##  7 Some College   female  1197            17.5 
##  8 Some College   male    1070            15.8 
##  9 College Grad   female  1099             9.85
## 10 College Grad   male     999            16.5

Visualization: Education AND Gender

health_strat %>%
  ggplot(aes(x = Education, y = Pct_Hypertension, fill = Gender)) +
  geom_col(position = "dodge", alpha = 0.8) +
  labs(
    title = "Hypertension Prevalence by Education and Gender",
    x = "Education Level",
    y = "Prevalence (%)",
    fill = "Gender",
    caption = "Source: NHANES"
  ) +
  ylim(0, 60) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar chart showing hypertension by education and gender

Common Student Mistakes and Corrections

Mistake 1: Forgetting `na.rm = TRUE`

❌ Incorrect:

Pct_Hypertension = sum(Hypertension == "Yes") / sum(!is.na(Hypertension)) * 100

✅ Correct:

Pct_Hypertension = sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100

Why it matters: Without na.rm = TRUE, if there are any NA values in the Hypertension variable, the sum will return NA instead of a number.

Mistake 2: Not Filtering Missing Categories

❌ Incorrect:

ggplot(health_by_education, aes(x = Education, y = Pct_Hypertension)) +
  geom_col()

✅ Correct:

health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col()

Why it matters: Including NA as a category creates an empty bar in the chart, making it harder to read.

Mistake 3: Forgetting to Round Percentages

❌ Incorrect:

Pct_Hypertension = sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100

✅ Correct:

Pct_Hypertension = round(
  sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)

Why it matters: Rounding to 2 decimal places makes tables more readable and professional.

Mistake 4: Not Using `.groups = 'drop'` in Grouped Summarise

❌ Incorrect:

health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(N = n())

✅ Correct:

health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(N = n(), .groups = 'drop')

Why it matters: Without .groups = 'drop', dplyr creates a grouped tibble, which can cause unexpected behavior in downstream operations.

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

✓ Correct group_by() (5 pts)
✓ Calculates N correctly (5 pts)
✓ Calculates mean systolic BP correctly (5 pts)
✓ Calculates hypertension percentage correctly (10 pts)

Task 2: Visualization (25 points)

✓ Filters missing values (5 pts)
✓ Correct plot type and aesthetics (10 pts)
✓ Proper labels and formatting (5 pts)
✓ Readable axis labels (5 pts)

Task 3: Interpretation (25 points)

✓ Identifies specific pattern in data (8 pts)
✓ Explains mechanism/social determinants (8 pts)
✓ Connects to public health implications (9 pts)

Overall Code Quality (25 points)

✓ Comments explain code (5 pts)
✓ Code runs without errors (10 pts)
✓ Output is properly formatted (5 pts)
✓ Submitted as HTML file (5 pts)

Discussion Questions for Instructors

Why did we focus on education as a determinant? (Answer: Education is a key social determinant strongly linked to health outcomes; it’s also reliably measured in surveys)
What other stratifications would be informative? (Answer: Income, occupation, healthcare access, geographic region, time trends)
How would you explain the education gradient to a public health administrator? (Answer: Emphasize actionable policy implications—targeted interventions for low-education groups)
What is the causal pathway between education and hypertension? (Answer: Discuss mechanisms: health literacy, income, access to care, stress, health behaviors)
Are there potential confounders we haven’t considered? (Answer: Age, race/ethnicity, gender—note how students might conduct stratified analyses)

Additional Resources for Instructors

For follow-up: Students can explore multivariate logistic regression to model hypertension risk adjusted for multiple factors
Extension activity: Have students compare their findings to published literature on education and hypertension
Real-world connection: Link to Healthy People 2030 objectives on health equity and social determinants

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 knitr_1.51       NHANES_2.1.0     lubridate_1.9.4  forcats_1.0.1   
##  [6] stringr_1.6.0    dplyr_1.1.4      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [11] tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1   xml2_1.5.2        
##  [6] jquerylib_0.1.4    textshaping_1.0.4  systemfonts_1.3.1  scales_1.4.0       yaml_2.3.12       
## [11] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4     svglite_2.2.2     
## [16] bslib_0.10.0       pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7       
## [21] utf8_1.2.6         stringi_1.8.7      cachem_1.1.0       xfun_0.56          sass_0.4.10       
## [26] S7_0.2.1           viridisLite_0.4.2  timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [31] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0  hms_1.1.4         
## [36] lifecycle_1.0.5    vctrs_0.7.1        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
## [41] rmarkdown_2.30     tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.9

Answer Key Last Updated: January 29, 2026

EPI 553 - Lab Activity 1: Answer Key

EMMANUEL NANA ARKO

January 29, 2026

Lab Activity 1 Answer Key

Setup: Load Data and Packages

Task 1: Explore Health Disparities by Education

Research Question

Solution Code

Interpretation Notes

Task 2: Create a Visualization

Solution Code

Key Code Elements Explained

Task 3: Data Interpretation

Research Question

Sample Interpretation

Grading Rubric for Task 3

Additional Results and Extensions

Stratified by Gender AND Education

Visualization: Education AND Gender

Common Student Mistakes and Corrections

Mistake 1: Forgetting `na.rm = TRUE`

Mistake 2: Not Filtering Missing Categories

Mistake 3: Forgetting to Round Percentages

Mistake 4: Not Using `.groups = 'drop'` in Grouped Summarise

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

Task 2: Visualization (25 points)

Task 3: Interpretation (25 points)

Overall Code Quality (25 points)

Learning Objectives Checklist

Discussion Questions for Instructors

Additional Resources for Instructors

EPI 553 - Lab Activity 1: Answer Key

EMMANUEL NANA ARKO

January 29, 2026

Lab Activity 1 Answer Key

Setup: Load Data and Packages

Task 1: Explore Health Disparities by Education

Research Question

Solution Code

Interpretation Notes

Task 2: Create a Visualization

Solution Code

Key Code Elements Explained

Task 3: Data Interpretation

Research Question

Sample Interpretation

Grading Rubric for Task 3

Additional Results and Extensions

Stratified by Gender AND Education

Visualization: Education AND Gender

Common Student Mistakes and Corrections

Mistake 1: Forgetting na.rm = TRUE

Mistake 2: Not Filtering Missing Categories

Mistake 3: Forgetting to Round Percentages

Mistake 4: Not Using .groups = 'drop' in Grouped Summarise

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

Task 2: Visualization (25 points)

Task 3: Interpretation (25 points)

Overall Code Quality (25 points)

Learning Objectives Checklist

Discussion Questions for Instructors

Additional Resources for Instructors

Mistake 1: Forgetting `na.rm = TRUE`

Mistake 4: Not Using `.groups = 'drop'` in Grouped Summarise