Welcome to Your First Lab Activity

Time: ~30 minutes
Goal: Learn to work with real public health survey data in R
Learning Objectives:

Load and explore a nationally representative health survey dataset
Use tidyverse functions to summarize and group data
Create a professional summary table for epidemiological questions
Practice the complete data exploration workflow
Develop skills for identifying health disparities

Part 1: Setting Up Your Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Troubleshooting: If you see an error, run this once:

Your Turn: Guided Practice

🎯 Task 1: Explore Health Disparities by Education (15 minutes)

Using the nhanes_analysis data, explore:

“How does hypertension prevalence vary by education level?”

Write code to:

Group by education level
Calculate sample size, mean systolic BP, and percent with hypertension
Print the results

# Group by education status then evaluate summary health statistics:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  filter(!is.na(Education)) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)

## # A tibble: 5 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.             28.3
## 2 9 - 11th Grade   888       124.             17.3
## 3 High School     1517       124.             18.9
## 4 Some College    2267       122.             16.6
## 5 College Grad    2098       119.             13.1

🎯 Task 2: Create a Visualization (10 minutes)

Create a bar chart showing hypertension by education level:

# Create visualization comparing education status and select health outcome (hypertension):
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

R output visualization

🎯 Task 3: Write a Data Interpretation (5 minutes)

Write 2-3 sentences:

“What does this pattern tell us about health disparities and social determinants?”

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

Respondents with the lowest education level (8th grade) have the highest rates of hypertension, whereas the most educated (college graduates) have the lowest rates.The pattern is indirect, with increasing education levels generally correlating with lower rates of hypertension. This trend could be explained by those with heightened levels of education having generally more health knowledge, which could provide a protective function against developing hypertension such as making healthier behavioral choices such as eating, smoking, drinking, physical activity, etc. This could have implications on the importance of public health campaigns that seek to educate the public on health behaviors to reduce the risk of hypertension.

EPI 553 - R Lab Activity 1: Exploring Health Survey Data with NHANES

Matthew Goldman

January 27, 2026