Lab Activity 1

Time: ~30 minutes
Goal: Learn to work with real public health survey data in R
Learning Objectives:

Load and explore a nationally representative health survey dataset
Use tidyverse functions to summarize and group data
Create a professional summary table for epidemiological questions
Practice the complete data exploration workflow
Develop skills for identifying health disparities

Context: The NHANES Dataset

National Health and Nutrition Examination Survey (NHANES)

The NHANES is the gold standard for population-based health and nutritional data in the United States, conducted by the CDC’s National Center for Health Statistics. It combines:

Interviews - Health history, demographics, behaviors
Physical examinations - Blood pressure, BMI, clinical measurements
Laboratory tests - Blood work, biomarkers

Real-world use: NHANES data informs Healthy People objectives, food and nutrition guidelines, and health disparities research.

Today’s task: Explore NHANES data on cardiovascular health, physical activity, and demographic disparities—key epidemiological outcomes.

Part 1: Setting Up Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Part 2: Loading the NHANES Data

Load the NHANES Dataset

# Load the NHANES data
data(NHANES)

Part 3: Data Preparation

Create Analysis Dataset

# Select key variables for analysis
nhanes_analysis <- NHANES %>%
  dplyr::select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = factor(ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"))
  )

# Create age groups for analysis
#mutate = create
nhanes_analysis <- nhanes_analysis %>%
  mutate(
    Age_Group = cut(Age, 
                    breaks = c(0, 20, 35, 50, 65, 100),
                    labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
  )

Task 1: Explore Health Disparities by Education (15 minutes)

“How does hypertension prevalence vary by education level?”

Write code to:

Group by education level
Calculate sample size, mean systolic BP, and percent with hypertension
Print the results

# Your code here:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)

## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

My Interpretation

Sample sizes vary across education groups
As education levels increase, Mean systolic BP decreases
There is a strong association between education level and percentage with hypertension, with higher percentages among lower education levels

Task 2: Create a Visualization (10 minutes)

Create a bar chart showing hypertension by education level:

# Create visualization here:
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

R output visualization

Task 3: Write Data Interpretation (5 minutes)

“What does this pattern tell us about health disparities and social determinants?”

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

My Data Interpretation

The data reveals that lower education groups have significantly higher hypertension rates, compared to the college education groups who have the lower rates of hypertension. Individuals who face inequalities in regards to social determinants of health, such as education, are more likely to have poorer health outcomes when compared to those who are more fortunate, when there are inequalities in regards to education individuals are less likely to have the amount of health literacy they may need to keep themselves in good health. This matters for public health because it indicates programs must shift focus toward undeserved populations to effectively combat rising hypertension rates in the population.

Grading Rubric for Task 3

Criteria	Excellent (Full Credit)	Adequate	Needs Work
Identifies pattern	Explicitly states which groups have highest/lowest rates	Mentions direction but lacks specificity	Vague or incorrect about pattern
Explains mechanism	References social determinants, access, or health literacy	Mentions inequality but lacks detail	No explanation provided
Public health relevance	Discusses implications for policy or programs	Notes importance but general	Missing public health connection
Writing quality	Clear, 2-3 well-written sentences	Adequate but could be clearer	Incomplete or unclear

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

✓ Correct group_by() (5 pts)
✓ Calculates N correctly (5 pts)
✓ Calculates mean systolic BP correctly (5 pts)
✓ Calculates hypertension percentage correctly (10 pts)

Task 2: Visualization (25 points)

✓ Filters missing values (5 pts)
✓ Correct plot type and aesthetics (10 pts)
✓ Proper labels and formatting (5 pts)
✓ Readable axis labels (5 pts)

Task 3: Interpretation (25 points)

✓ Identifies specific pattern in data (8 pts)
✓ Explains mechanism/social determinants (8 pts)
✓ Connects to public health implications (9 pts)

Overall Code Quality (25 points)

✓ Comments explain code (5 pts)
✓ Code runs without errors (10 pts)
✓ Output is properly formatted (5 pts)
✓ Submitted as HTML file (5 pts)

Exporting Your Work

Save and Knit

Save: File → Save As → Lab01_NHANES_YourName.Rmd
Knit: Click the blue Knit button
Submit: Upload the .Rpubs link to Brightspace

Key Takeaways

Skills Practiced

✓ Loading data from R packages
✓ Data exploration with str(), summary(), head()
✓ Grouping and summarizing with group_by() and summarise()
✓ Creating derived variables with mutate()
✓ Calculating epidemiological statistics
✓ Stratification to reveal disparities
✓ Professional visualization with ggplot2
✓ Publication-ready tables

Troubleshooting

“object ‘NHANES’ not found”

→ Make sure you ran data(NHANES) after loading the package

Missing values (NA) showing

→ This is normal! Always use na.rm = TRUE in calculations

Bar chart looks wrong

→ Use filter(!is.na(Variable)) to remove missing groups

Resources

tidyverse: https://www.tidyverse.org/
dplyr cheatsheet: https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf
ggplot2 cheatsheet: https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf
NHANES: https://wwwn.cdc.gov/nchs/nhanes/

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 knitr_1.50       NHANES_2.1.0     lubridate_1.9.4  forcats_1.0.0   
##  [6] stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4      readr_2.1.5      tidyr_1.3.1     
## [11] tibble_3.2.1     ggplot2_3.5.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1   xml2_1.3.8        
##  [6] jquerylib_0.1.4    textshaping_1.0.1  systemfonts_1.3.1  scales_1.4.0       yaml_2.3.10       
## [11] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4     svglite_2.2.2     
## [16] bslib_0.9.0        pillar_1.10.2      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.6       
## [21] utf8_1.2.5         cachem_1.1.0       stringi_1.8.7      xfun_0.52          sass_0.4.10       
## [26] viridisLite_0.4.2  timechange_0.3.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.3    
## [31] digest_0.6.37      grid_4.5.2         rstudioapi_0.17.1  hms_1.1.3          lifecycle_1.0.4   
## [36] vctrs_0.6.5        evaluate_1.0.3     glue_1.8.0         farver_2.1.2       rmarkdown_2.29    
## [41] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.8.1

Lab Activity 1 Complete!

Last updated: January 29, 2026

EPI 553 - R Lab Activity 1: Exploring Health Survey Data with NHANES

Natalia Small

January 29, 2026