Welcome to Your First Lab Activity

Time: ~30 minutes
Goal: Learn to work with real public health survey data in R
Learning Objectives:

Load and explore a nationally representative health survey dataset
Use tidyverse functions to summarize and group data
Create a professional summary table for epidemiological questions
Practice the complete data exploration workflow
Develop skills for identifying health disparities

Context: The NHANES Dataset

National Health and Nutrition Examination Survey (NHANES)

The NHANES is the gold standard for population-based health and nutritional data in the United States, conducted by the CDC’s National Center for Health Statistics. It combines:

Interviews - Health history, demographics, behaviors
Physical examinations - Blood pressure, BMI, clinical measurements
Laboratory tests - Blood work, biomarkers

Real-world use: NHANES data informs Healthy People objectives, food and nutrition guidelines, and health disparities research.

Today’s task: You’ll explore NHANES data on cardiovascular health, physical activity, and demographic disparities—key epidemiological outcomes.

Part 1: Setting Up Your Workspace

Load Required Packages

# Load required packages
library(tidyverse)    # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES)       # NHANES dataset
library(knitr)        # For professional table output
library(kableExtra)   # Enhanced tables

Troubleshooting: If you see an error, run this once:

install.packages("NHANES")

Then reload: library(NHANES)

Part 2: Loading and Exploring the NHANES Data

Load and Examine the NHANES Dataset

# Load the NHANES data
data(NHANES)

# Examine the first few rows
head(NHANES, n = 10)

## # A tibble: 10 × 76
##       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education    MaritalStatus HHIncome  
##    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>        <fct>         <fct>     
##  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-349…
##  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>         <NA>          20000-249…
##  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some College LivePartner   35000-449…
##  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>         <NA>          75000-999…
##  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>         <NA>          55000-649…
##  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
##  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College Grad Married       75000-999…
## # ℹ 65 more variables: HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## #   Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## #   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
## #   BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
## #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>, HealthGen <fct>,
## #   DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, …

# View data structure
str(NHANES)

## tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
##  $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num [1:10000] 165 165 165 105 168 ...
##  $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
##  $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...

# Dimensions: rows (observations) and columns (variables)
dim(NHANES)

## [1] 10000    76

View All Available Variables

# What variables do we have?
names(NHANES)

##  [1] "ID"               "SurveyYr"         "Gender"           "Age"              "AgeDecade"       
##  [6] "AgeMonths"        "Race1"            "Race3"            "Education"        "MaritalStatus"   
## [11] "HHIncome"         "HHIncomeMid"      "Poverty"          "HomeRooms"        "HomeOwn"         
## [16] "Work"             "Weight"           "Length"           "HeadCirc"         "Height"          
## [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"            "BPSysAve"        
## [26] "BPDiaAve"         "BPSys1"           "BPDia1"           "BPSys2"           "BPDia2"          
## [31] "BPSys3"           "BPDia3"           "Testosterone"     "DirectChol"       "TotChol"         
## [36] "UrineVol1"        "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
## [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
## [46] "Depressed"        "nPregnancies"     "nBabies"          "Age1stBaby"       "SleepHrsNight"   
## [51] "SleepTrouble"     "PhysActive"       "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"      
## [56] "TVHrsDayChild"    "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
## [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"         "Marijuana"       
## [66] "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"      "HardDrugs"        "SexEver"         
## [71] "SexAge"           "SexNumPartnLife"  "SexNumPartYear"   "SameSex"          "SexOrientation"  
## [76] "PregnantNow"

Check for Missing Data

# Count missing values in each column
missing_summary <- data.frame(
  Variable = names(NHANES),
    # AM: Column "Variable" is character vector of column names
  Missing_Count = colSums(is.na(NHANES)),
    #Sum of binary result of "is.na" (0,1) in the NHANES df = count of missing vals
  Missing_Percent = round(colSums(is.na(NHANES)) / nrow(NHANES) * 100, 2)
    #Count of missing values divided by number of rows, formatted as percent 2 decimals
)

# Show variables with the most missing data
print(missing_summary[order(-missing_summary$Missing_Count), ][1:15, ])

##                          Variable Missing_Count Missing_Percent
## HeadCirc                 HeadCirc          9912           99.12
## Length                     Length          9457           94.57
## DiabetesAge           DiabetesAge          9371           93.71
## TVHrsDayChild       TVHrsDayChild          9347           93.47
## CompHrsDayChild   CompHrsDayChild          9347           93.47
## BMICatUnder20yrs BMICatUnder20yrs          8726           87.26
## AgeRegMarij           AgeRegMarij          8634           86.34
## UrineFlow2             UrineFlow2          8524           85.24
## UrineVol2               UrineVol2          8522           85.22
## PregnantNow           PregnantNow          8304           83.04
## Age1stBaby             Age1stBaby          8116           81.16
## nBabies                   nBabies          7584           75.84
## nPregnancies         nPregnancies          7396           73.96
## AgeFirstMarij       AgeFirstMarij          7109           71.09
## SmokeAge                 SmokeAge          6920           69.20

  #Order missing summary by Missing Count, show rows 1:15, no specification of column index, so all

Epidemiological Note: Always use na.rm = TRUE in functions like sum() and mean() to exclude missing values, but report how many were excluded.

Part 3: Data Preparation and Exploration

Create Analysis Dataset

# Select key variables for analysis
nhanes_analysis <- NHANES %>%
  ## "::" allows you to install individual element from package, without installing everything - here "select"
  ##      list columns to use in next transformation
  dplyr::select(
    ID,
    Gender,           # Sex (Male/Female)
    Age,              # Age in years
    Race1,            # Race/ethnicity
    Education,        # Education level
    BMI,              # Body Mass Index
    Pulse,            # Resting heart rate
    BPSys1,           # Systolic blood pressure (1st reading)
    BPDia1,           # Diastolic blood pressure (1st reading)
    PhysActive,       # Physically active (Yes/No)
    SmokeNow,         # Current smoking status
    Diabetes,         # Diabetes diagnosis (Yes/No)
    HealthGen         # General health rating
  ) %>%
  # Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
  mutate(
    Hypertension = factor(ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No"))
  )


# Remove rows with missing values for key variables
nhanes_analysis2 <- nhanes_analysis %>%
        filter(complete.cases(.))  # Complete cases only

# View the processed dataset
head(nhanes_analysis, 10)

## # A tibble: 10 × 14
##       ID Gender   Age Race1 Education      BMI Pulse BPSys1 BPDia1 PhysActive SmokeNow Diabetes
##    <int> <fct>  <int> <fct> <fct>        <dbl> <int>  <int>  <int> <fct>      <fct>    <fct>   
##  1 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  2 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  3 51624 male      34 White High School   32.2    70    114     88 No         No       No      
##  4 51625 male       4 Other <NA>          15.3    NA     NA     NA <NA>       <NA>     No      
##  5 51630 female    49 White Some College  30.6    86    118     82 No         Yes      No      
##  6 51638 male       9 White <NA>          16.8    82     84     50 <NA>       <NA>     No      
##  7 51646 male       8 White <NA>          20.6    72    114     46 <NA>       <NA>     No      
##  8 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
##  9 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## 10 51647 female    45 White College Grad  27.2    62    106     62 Yes        <NA>     No      
## # ℹ 2 more variables: HealthGen <fct>, Hypertension <fct>

# Check dimensions
dim(nhanes_analysis)

## [1] 10000    14

Part 4: Summary Statistics by Groups

Summary by Gender

Research Question: Do cardiovascular health indicators and physical activity differ by gender?

# Create summary statistics grouped by gender
health_by_gender <- nhanes_analysis %>%
  group_by(Gender) %>%
  summarise(
    N = n(),
    # Descriptive statistics
    Mean_Age = round(mean(Age, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_DiaBP = round(mean(BPDia1, na.rm = TRUE), 2),
    # Prevalence estimates (percentages)
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2),
    Pct_Current_Smoker = round(
      sum(SmokeNow == "Yes", na.rm = TRUE) / sum(!is.na(SmokeNow)) * 100, 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2)
  )

print(health_by_gender)

## # A tibble: 2 × 10
##   Gender     N Mean_Age Mean_BMI Mean_SysBP Mean_DiaBP Pct_Physically_Active Pct_Current_Smoker
##   <fct>  <int>    <dbl>    <dbl>      <dbl>      <dbl>                 <dbl>              <dbl>
## 1 female  5020     37.6     26.8       117.       67.0                  53.2               45.5
## 2 male    4980     35.8     26.6       121.       69.5                  58.6               45.8
## # ℹ 2 more variables: Pct_Hypertension <dbl>, Pct_Diabetes <dbl>

Summary by Age Group

Research Question: How do health outcomes and behaviors vary across the lifespan?

#What not to do:
# Summary by age
health_by_age <- nhanes_analysis %>%
  group_by(Age) %>%
  summarise(
    N = n(),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2),
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2)
  )

#Do this:
# Create age groups for analysis
nhanes_analysis <- nhanes_analysis %>%
  mutate(
    Age_Group = cut(Age, 
                    breaks = c(0, 20, 35, 50, 65, 100),
                    labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
  )

# Summary by age group
health_by_age <- nhanes_analysis %>%
  group_by(Age_Group) %>%
  summarise(
    N = n(),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    Pct_Diabetes = round(
      sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2),
    Pct_Physically_Active = round(
      sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2)
  )

print(health_by_age)

## # A tibble: 6 × 7
##   Age_Group     N Mean_BMI Mean_SysBP Pct_Hypertension Pct_Diabetes Pct_Physically_Active
##   <fct>     <int>    <dbl>      <dbl>            <dbl>        <dbl>                 <dbl>
## 1 18-20      2769     20.8       107.             0.73         0.65                  76.8
## 2 21-35      1984     28.2       114.             4.72         1.46                  61.8
## 3 36-50      2138     29.1       118.            11.4          6.69                  55.5
## 4 51-65      1784     29.4       127.            22.6         17.7                   47.5
## 5 65+        1188     28.6       135.            40.4         21.4                   37.2
## 6 <NA>        137    NaN         NaN            NaN          NaN                    NaN

Stratified Analysis: Gender AND Race

Research Question: Are there racial/ethnic differences in hypertension that differ by gender?

# Stratified by gender and race/ethnicity
health_stratified <- nhanes_analysis %>%
  group_by(Gender, Race1) %>%
  summarise(
    N = n(),
    Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
    Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

print(health_stratified)

## # A tibble: 10 × 6
##    Gender Race1        N Mean_SysBP Mean_BMI Pct_Hypertension
##    <fct>  <fct>    <int>      <dbl>    <dbl>            <dbl>
##  1 female Black      614       120.     29.8            17.4 
##  2 female Hispanic   320       114.     26.3             8.88
##  3 female Mexican    452       112.     26.4             7.08
##  4 female White     3221       118.     26.6            13.8 
##  5 female Other      413       113.     24.0            10   
##  6 male   Black      583       123.     26.3            17.3 
##  7 male   Hispanic   290       119.     26.4            10.1 
##  8 male   Mexican    563       119.     26.6            12.4 
##  9 male   White     3151       121.     26.8            15.2 
## 10 male   Other      393       118.     24.8            12.8

Part 5: Data Visualization

Hypertension Prevalence by Gender and Age

# Create visualization
health_by_age_gender <- nhanes_analysis %>%
  group_by(Age_Group, Gender) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    .groups = 'drop'
  )

Physical Activity by Age Group

activity_by_age <- health_by_age %>%
  filter(!is.na(Age_Group))

ggplot(activity_by_age, aes(x = Age_Group, y = Pct_Physically_Active)) +
  geom_col(fill = "#2ecc71", alpha = 0.8) +
  geom_text(aes(label = paste0(Pct_Physically_Active, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Physical Activity Participation by Age Group",
    subtitle = "NHANES Data",
    x = "Age Group",
    y = "Prevalence (%)",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  ylim(0, 90) +
  # AM: Limiting Y axis to 70 was excluding the 18-20 category
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing physical activity prevalence by age group

Hypertension by Race/Ethnicity

hypertension_by_race <- nhanes_analysis %>%
  group_by(Race1) %>%
  summarise(
    Pct_Hypertension = round(
      sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
    N = n(),
    .groups = 'drop'
  ) %>%
  filter(!is.na(Race1))

ggplot(hypertension_by_race, aes(x = Race1, y = Pct_Hypertension)) +
  geom_col(fill = "#e74c3c", alpha = 0.8) +
  geom_text(aes(label = paste0(Pct_Hypertension, "%")), 
            vjust = -0.5, size = 3) +
  labs(
    title = "Hypertension Prevalence by Race/Ethnicity",
    subtitle = "NHANES Data",
    x = "Race/Ethnicity",
    y = "Prevalence (%)",
    caption = "Source: National Health and Nutrition Examination Survey"
  ) +
  ylim(0, 50) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Bar chart showing hypertension prevalence by race/ethnicity

Part 6: Professional Summary Table

# Create summary table by gender
summary_table <- health_by_gender %>%
  select(
    Gender,
    N,
    `Mean Age` = Mean_Age,
    `Mean BMI` = Mean_BMI,
    `Mean SysBP` = Mean_SysBP,
    `% Hypertension` = Pct_Hypertension,
    `% Physically Active` = Pct_Physically_Active
  )

# Display with kableExtra
kable(summary_table, 
      caption = "Cardiovascular Health and Physical Activity by Gender",
      format = "html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Cardiovascular Health and Physical Activity by Gender
Gender	N	Mean Age	Mean BMI	Mean SysBP	% Hypertension	% Physically Active
female	5020	37.64	26.77	117.3	13.07	53.18
male	4980	35.83	26.55	120.9	14.68	58.56

# I've never used Kable, but this is great.

Your Turn: Guided Practice

🎯 Task 1: Explore Health Disparities by Education (15 minutes)

Using the nhanes_analysis data, explore:

“How does hypertension prevalence vary by education level?”

Write code to:

Group by education level
Calculate sample size, mean systolic BP, and percent with hypertension
Print the results

# Your code here:

#Identify what is in the dataset for Education category:
unique(nhanes_analysis$Education)

## [1] High School    <NA>           Some College   College Grad   9 - 11th Grade 8th Grade     
## Levels: 8th Grade 9 - 11th Grade High School Some College College Grad

#Create new DF from NHANES_analysis
health_by_education <- nhanes_analysis %>%
    group_by(Education) %>%
    summarize(
      #count
      N = n(),
      #percent with hypertension
      Pct_Hypertension = round( 
          sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
      #mean systolic bp
      Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
      #mean BMI
      Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
    .groups = 'drop'
        )

print(health_by_education)

## # A tibble: 6 × 5
##   Education          N Pct_Hypertension Mean_SysBP Mean_BMI
##   <fct>          <int>            <dbl>      <dbl>    <dbl>
## 1 8th Grade        451            28.3        128.     29.5
## 2 9 - 11th Grade   888            17.3        124.     29.1
## 3 High School     1517            18.9        124.     29.4
## 4 Some College    2267            16.6        122.     29.2
## 5 College Grad    2098            13.1        119.     27.6
## 6 <NA>            2779             0.72       106.     20.5

🎯 Task 2: Create a Visualization (10 minutes)

Create a bar chart showing hypertension by education level:

# Your visualization here:
#Count of missing values
missing_edu <- sum(is.na(health_by_education$Education))

#ggplot call
health_by_education %>%
  filter(!is.na(Education)) %>%
  
  ggplot(aes(x=Education, y=Pct_Hypertension))  +
  geom_col(fill="coral", alpha = 0.5) +
  #Data labels concat of value and "%", just above bar
  geom_text(aes(label= paste0(Pct_Hypertension,"%")), vjust = -0.5, size=3) +
  #Label axes, caption, title
  labs(
    title = "Hypertension Prevalence by Education Level",
    x = "Education Level",
    y = "Percent with Hypertension (%)",
    #append a linebreak, missing count to the caption
    caption = paste0("Source: NHANES \n", missing_edu, " observation excuded from visualization due to missing education")) +
  ylim(0, 40) +
  # Theme with no gridlines - my preference
  theme_classic()

R output visualization

🎯 Task 3: Write a Data Interpretation (5 minutes)

Write 2-3 sentences:

“What does this pattern tell us about health disparities and social determinants?”

Consider:
- Which education groups have highest/lowest hypertension?
- What might explain these differences?
- Why does this matter for public health?

Based on the NHANES data, populations with 8th grade education levels have the highest rates of hypertension (28.3%), while college graduates have the lowest rates of hypertension (13.09%), and increased level of education is broadly negatively associated with hypertension in the data. Level of education is strongly linked with other structural determinants of inequality, such as access to healthcare, labor market position resulting in increased stress and more demanding work, housing, income, and food insecurity, all of which are associated with negative healthcare outcomes. Because education level is linked to socioeconomic status, interventions and prevention policies for hypertension will be most effective in context of structural barriers - either addressing broad drivers like food and housing insecurity, or with targeted interventions like blood pressure screenings at flexible times and locations or in underserved neighborhoods.

Grading Rubric for Task 3

Criteria	Excellent (Full Credit)	Adequate	Needs Work
Identifies pattern	Explicitly states which groups have highest/lowest rates	Mentions direction but lacks specificity	Vague or incorrect about pattern
Explains mechanism	References social determinants, access, or health literacy	Mentions inequality but lacks detail	No explanation provided
Public health relevance	Discusses implications for policy or programs	Notes importance but general	Missing public health connection
Writing quality	Clear, 2-3 well-written sentences	Adequate but could be clearer	Incomplete or unclear

Assessment Rubric: Overall Lab Performance

Scoring Guide (100 points total)

Task 1: Code (25 points)

✓ Correct group_by() (5 pts)
✓ Calculates N correctly (5 pts)
✓ Calculates mean systolic BP correctly (5 pts)
✓ Calculates hypertension percentage correctly (10 pts)

Task 2: Visualization (25 points)

✓ Filters missing values (5 pts)
✓ Correct plot type and aesthetics (10 pts)
✓ Proper labels and formatting (5 pts)
✓ Readable axis labels (5 pts)

Task 3: Interpretation (25 points)

✓ Identifies specific pattern in data (8 pts)
✓ Explains mechanism/social determinants (8 pts)
✓ Connects to public health implications (9 pts)

Overall Code Quality (25 points)

✓ Comments explain code (5 pts)
✓ Code runs without errors (10 pts)
✓ Output is properly formatted (5 pts)
✓ Submitted as HTML file (5 pts)

Exporting Your Work

Save and Knit

Save: File → Save As → Lab01_NHANES_YourName.Rmd
Knit: Click the blue Knit button
Submit: Upload the .Rpubs link to Brightspace

Key Takeaways

Skills Practiced

✓ Loading data from R packages
✓ Data exploration with str(), summary(), head()
✓ Grouping and summarizing with group_by() and summarise()
✓ Creating derived variables with mutate()
✓ Calculating epidemiological statistics
✓ Stratification to reveal disparities
✓ Professional visualization with ggplot2
✓ Publication-ready tables

Troubleshooting

“object ‘NHANES’ not found”

→ Make sure you ran data(NHANES) after loading the package

Missing values (NA) showing

→ This is normal! Always use na.rm = TRUE in calculations

Bar chart looks wrong

→ Use filter(!is.na(Variable)) to remove missing groups

Resources

tidyverse: https://www.tidyverse.org/
dplyr cheatsheet: https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf
ggplot2 cheatsheet: https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf
NHANES: https://wwwn.cdc.gov/nchs/nhanes/

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 knitr_1.51       NHANES_2.1.0     lubridate_1.9.4  forcats_1.0.1   
##  [6] stringr_1.6.0    dplyr_1.1.4      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [11] tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1   xml2_1.5.1        
##  [6] jquerylib_0.1.4    textshaping_1.0.4  systemfonts_1.3.1  scales_1.4.0       yaml_2.3.12       
## [11] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4     svglite_2.2.2     
## [16] bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7       
## [21] utf8_1.2.6         stringi_1.8.7      cachem_1.1.0       xfun_0.56          sass_0.4.10       
## [26] S7_0.2.1           viridisLite_0.4.2  timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [31] magrittr_2.0.4     digest_0.6.39      grid_4.5.1         rstudioapi_0.18.0  hms_1.1.4         
## [36] lifecycle_1.0.5    vctrs_0.7.0        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
## [41] rmarkdown_2.30     tools_4.5.1        pkgconfig_2.0.3    htmltools_0.5.9

Lab Activity 1 Complete!

Last updated: January 27, 2026

EPI 553 - R Lab Activity 1: Exploring Health Survey Data with NHANES

Alex Mossawir Instructor: Dr. Muntasir Masum

January 27, 2026