Lab Activity 1

Time: ~30 minutes
Goal: Learn to work with real public health survey data in R
Learning Objectives:

  • Load and explore a nationally representative health survey dataset
  • Use tidyverse functions to summarize and group data
  • Create a professional summary table for epidemiological questions
  • Practice the complete data exploration workflow
  • Develop skills for identifying health disparities

Context: The NHANES Dataset

National Health and Nutrition Examination Survey (NHANES)

The NHANES is the gold standard for population-based health and nutritional data in the United States, conducted by the CDC’s National Center for Health Statistics. It combines:

  • Interviews - Health history, demographics, behaviors
  • Physical examinations - Blood pressure, BMI, clinical measurements
  • Laboratory tests - Blood work, biomarkers

Real-world use: NHANES data informs Healthy People objectives, food and nutrition guidelines, and health disparities research.

Today’s task: Explore NHANES data on cardiovascular health, physical activity, and demographic disparities—key epidemiological outcomes.


Part 1: Setting Up Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Part 2: Loading the NHANES Data

Load the NHANES Dataset

# Load the NHANES data
data(NHANES)

Part 3: Data Preparation

Create Analysis Dataset

# Select key variables for analysis
nhanes_analysis <- NHANES %>%
  dplyr::select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = factor(ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"))
  )

# Create age groups for analysis
#mutate = create
nhanes_analysis <- nhanes_analysis %>%
  mutate(
    Age_Group = cut(Age, 
                    breaks = c(0, 20, 35, 50, 65, 100),
                    labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
  )

Task 1: Explore Health Disparities by Education (15 minutes)

“How does hypertension prevalence vary by education level?”

Write code to:

  1. Group by education level
  2. Calculate sample size, mean systolic BP, and percent with hypertension
  3. Print the results
# Your code here:
health_by_education <- nhanes_analysis %>%
  group_by(Education) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
  )

print(health_by_education)
## # A tibble: 6 × 4
##   Education          N Mean_SysBP Pct_Hypertension
##   <fct>          <int>      <dbl>            <dbl>
## 1 8th Grade        451       128.            28.3 
## 2 9 - 11th Grade   888       124.            17.3 
## 3 High School     1517       124.            18.9 
## 4 Some College    2267       122.            16.6 
## 5 College Grad    2098       119.            13.1 
## 6 <NA>            2779       106.             0.72

My Interpretation

  • Sample sizes vary across education groups
  • As education levels increase, Mean systolic BP decreases
  • There is a strong association between education level and percentage with hypertension, with higher percentages among lower education levels

Task 2: Create a Visualization (10 minutes)

Create a bar chart showing hypertension by education level:

# Create visualization here:
health_by_education %>%
  filter(!is.na(Education)) %>%
  ggplot(aes(x = Education, y = Pct_Hypertension)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    caption = "Source: NHANES"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

R output visualization


Task 3: Write Data Interpretation (5 minutes)

“What does this pattern tell us about health disparities and social determinants?”

Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?

My Data Interpretation

The data reveals that lower education groups have significantly higher hypertension rates, compared to the college education groups who have the lower rates of hypertension. Individuals who face inequalities in regards to social determinants of health, such as education, are more likely to have poorer health outcomes when compared to those who are more fortunate, when there are inequalities in regards to education individuals are less likely to have the amount of health literacy they may need to keep themselves in good health. This matters for public health because it indicates programs must shift focus toward undeserved populations to effectively combat rising hypertension rates in the population.


Grading Rubric for Task 3

Criteria Excellent (Full Credit) Adequate Needs Work
Identifies pattern Explicitly states which groups have highest/lowest rates Mentions direction but lacks specificity Vague or incorrect about pattern
Explains mechanism References social determinants, access, or health literacy Mentions inequality but lacks detail No explanation provided
Public health relevance Discusses implications for policy or programs Notes importance but general Missing public health connection
Writing quality Clear, 2-3 well-written sentences Adequate but could be clearer Incomplete or unclear

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

  • ✓ Correct group_by() (5 pts)
  • ✓ Calculates N correctly (5 pts)
  • ✓ Calculates mean systolic BP correctly (5 pts)
  • ✓ Calculates hypertension percentage correctly (10 pts)

Task 2: Visualization (25 points)

  • ✓ Filters missing values (5 pts)
  • ✓ Correct plot type and aesthetics (10 pts)
  • ✓ Proper labels and formatting (5 pts)
  • ✓ Readable axis labels (5 pts)

Task 3: Interpretation (25 points)

  • ✓ Identifies specific pattern in data (8 pts)
  • ✓ Explains mechanism/social determinants (8 pts)
  • ✓ Connects to public health implications (9 pts)

Overall Code Quality (25 points)

  • ✓ Comments explain code (5 pts)
  • ✓ Code runs without errors (10 pts)
  • ✓ Output is properly formatted (5 pts)
  • ✓ Submitted as HTML file (5 pts)

Exporting Your Work

Save and Knit

  1. Save: File → Save As → Lab01_NHANES_YourName.Rmd
  2. Knit: Click the blue Knit button
  3. Submit: Upload the .Rpubs link to Brightspace

Key Takeaways

Skills Practiced

✓ Loading data from R packages
✓ Data exploration with str(), summary(), head()
✓ Grouping and summarizing with group_by() and summarise()
✓ Creating derived variables with mutate()
✓ Calculating epidemiological statistics
✓ Stratification to reveal disparities
✓ Professional visualization with ggplot2
✓ Publication-ready tables


Troubleshooting

“object ‘NHANES’ not found”

→ Make sure you ran data(NHANES) after loading the package

Missing values (NA) showing

→ This is normal! Always use na.rm = TRUE in calculations

Bar chart looks wrong

→ Use filter(!is.na(Variable)) to remove missing groups


Resources


sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 knitr_1.50       NHANES_2.1.0     lubridate_1.9.4  forcats_1.0.0   
##  [6] stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4      readr_2.1.5      tidyr_1.3.1     
## [11] tibble_3.2.1     ggplot2_3.5.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1   xml2_1.3.8        
##  [6] jquerylib_0.1.4    textshaping_1.0.1  systemfonts_1.3.1  scales_1.4.0       yaml_2.3.10       
## [11] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4     svglite_2.2.2     
## [16] bslib_0.9.0        pillar_1.10.2      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.6       
## [21] utf8_1.2.5         cachem_1.1.0       stringi_1.8.7      xfun_0.52          sass_0.4.10       
## [26] viridisLite_0.4.2  timechange_0.3.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.3    
## [31] digest_0.6.37      grid_4.5.2         rstudioapi_0.17.1  hms_1.1.3          lifecycle_1.0.4   
## [36] vctrs_0.6.5        evaluate_1.0.3     glue_1.8.0         farver_2.1.2       rmarkdown_2.29    
## [41] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.8.1

Lab Activity 1 Complete!

Last updated: January 29, 2026