Time: ~30 minutes
Goal: Learn to work with real public health survey data
in R
Learning Objectives:
The NHANES is the gold standard for population-based health and nutritional data in the United States, conducted by the CDCβs National Center for Health Statistics. It combines:
Real-world use: NHANES data informs Healthy People objectives, food and nutrition guidelines, and health disparities research.
Todayβs task: Youβll explore NHANES data on cardiovascular health, physical activity, and demographic disparitiesβkey epidemiological outcomes.
# Load required packages
library(tidyverse) # Data manipulation (dplyr, ggplot2, etc.)
library(NHANES) # NHANES dataset
library(knitr) # For professional table output
library(kableExtra) # Enhanced tablesTroubleshooting: If you see an error, run this once:
Then reload: library(NHANES)
## # A tibble: 10 Γ 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High School Married 25000-349β¦
## 2 51624 2009_10 male 34 " 30-39" 409 White <NA> High School Married 25000-349β¦
## 3 51624 2009_10 male 34 " 30-39" 409 White <NA> High School Married 25000-349β¦
## 4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA> <NA> 20000-249β¦
## 5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some College LivePartner 35000-449β¦
## 6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA> <NA> 75000-999β¦
## 7 51646 2009_10 male 8 " 0-9" 101 White <NA> <NA> <NA> 55000-649β¦
## 8 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad Married 75000-999β¦
## 9 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad Married 75000-999β¦
## 10 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad Married 75000-999β¦
## # βΉ 65 more variables: HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## # Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
## # BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
## # BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## # Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
## # UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>, HealthGen <fct>,
## # DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, β¦
## tibble [10,000 Γ 76] (S3: tbl_df/tbl/data.frame)
## $ ID : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
## $ SurveyYr : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
## $ Age : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
## $ AgeDecade : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
## $ AgeMonths : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
## $ Race1 : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
## $ Race3 : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Education : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
## $ MaritalStatus : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
## $ HHIncome : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
## $ HHIncomeMid : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
## $ Poverty : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
## $ HomeRooms : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
## $ HomeOwn : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
## $ Work : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
## $ Weight : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
## $ Length : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HeadCirc : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ Height : num [1:10000] 165 165 165 105 168 ...
## $ BMI : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
## $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BMI_WHO : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
## $ Pulse : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
## $ BPSysAve : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
## $ BPDiaAve : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
## $ BPSys1 : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
## $ BPDia1 : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
## $ BPSys2 : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
## $ BPDia2 : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
## $ BPSys3 : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
## $ BPDia3 : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
## $ Testosterone : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ DirectChol : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
## $ TotChol : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
## $ UrineVol1 : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
## $ UrineFlow1 : num [1:10000] NA NA NA NA 0.094 ...
## $ UrineVol2 : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ UrineFlow2 : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ Diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ DiabetesAge : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HealthGen : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
## $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
## $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
## $ LittleInterest : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
## $ Depressed : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
## $ nPregnancies : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
## $ nBabies : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
## $ Age1stBaby : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
## $ SleepHrsNight : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
## $ SleepTrouble : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ PhysActive : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
## $ PhysActiveDays : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
## $ TVHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
## $ CompHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
## $ TVHrsDayChild : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
## $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
## $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ AlcoholDay : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
## $ AlcoholYear : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
## $ SmokeNow : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
## $ Smoke100 : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ Smoke100n : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ SmokeAge : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
## $ Marijuana : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ AgeFirstMarij : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
## $ RegularMarij : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
## $ AgeRegMarij : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
## $ HardDrugs : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
## $ SexEver : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
## $ SexAge : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
## $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
## $ SexNumPartYear : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
## $ SameSex : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
## $ SexOrientation : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
## $ PregnantNow : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
## [1] 10000 76
## [1] "ID" "SurveyYr" "Gender" "Age" "AgeDecade"
## [6] "AgeMonths" "Race1" "Race3" "Education" "MaritalStatus"
## [11] "HHIncome" "HHIncomeMid" "Poverty" "HomeRooms" "HomeOwn"
## [16] "Work" "Weight" "Length" "HeadCirc" "Height"
## [21] "BMI" "BMICatUnder20yrs" "BMI_WHO" "Pulse" "BPSysAve"
## [26] "BPDiaAve" "BPSys1" "BPDia1" "BPSys2" "BPDia2"
## [31] "BPSys3" "BPDia3" "Testosterone" "DirectChol" "TotChol"
## [36] "UrineVol1" "UrineFlow1" "UrineVol2" "UrineFlow2" "Diabetes"
## [41] "DiabetesAge" "HealthGen" "DaysPhysHlthBad" "DaysMentHlthBad" "LittleInterest"
## [46] "Depressed" "nPregnancies" "nBabies" "Age1stBaby" "SleepHrsNight"
## [51] "SleepTrouble" "PhysActive" "PhysActiveDays" "TVHrsDay" "CompHrsDay"
## [56] "TVHrsDayChild" "CompHrsDayChild" "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n" "SmokeAge" "Marijuana"
## [66] "AgeFirstMarij" "RegularMarij" "AgeRegMarij" "HardDrugs" "SexEver"
## [71] "SexAge" "SexNumPartnLife" "SexNumPartYear" "SameSex" "SexOrientation"
## [76] "PregnantNow"
# Count missing values in each column
missing_summary <- data.frame(
Variable = names(NHANES),
Missing_Count = colSums(is.na(NHANES)),
Missing_Percent = round(colSums(is.na(NHANES)) / nrow(NHANES) * 100, 2)
)
# Show variables with the most missing data
print(missing_summary[order(-missing_summary$Missing_Count), ][1:15, ])## Variable Missing_Count Missing_Percent
## HeadCirc HeadCirc 9912 99.12
## Length Length 9457 94.57
## DiabetesAge DiabetesAge 9371 93.71
## TVHrsDayChild TVHrsDayChild 9347 93.47
## CompHrsDayChild CompHrsDayChild 9347 93.47
## BMICatUnder20yrs BMICatUnder20yrs 8726 87.26
## AgeRegMarij AgeRegMarij 8634 86.34
## UrineFlow2 UrineFlow2 8524 85.24
## UrineVol2 UrineVol2 8522 85.22
## PregnantNow PregnantNow 8304 83.04
## Age1stBaby Age1stBaby 8116 81.16
## nBabies nBabies 7584 75.84
## nPregnancies nPregnancies 7396 73.96
## AgeFirstMarij AgeFirstMarij 7109 71.09
## SmokeAge SmokeAge 6920 69.20
Epidemiological Note: Always use
na.rm = TRUE in functions like sum() and
mean() to exclude missing values, but report how
many were excluded.
# Select key variables for analysis
nhanes_analysis <- NHANES %>%
select(
ID,
Gender, # Sex (Male/Female)
Age, # Age in years
Race1, # Race/ethnicity
Education, # Education level
BMI, # Body Mass Index
Pulse, # Resting heart rate
BPSys1, # Systolic blood pressure (1st reading)
BPDia1, # Diastolic blood pressure (1st reading)
PhysActive, # Physically active (Yes/No)
SmokeNow, # Current smoking status
Diabetes, # Diabetes diagnosis (Yes/No)
HealthGen # General health rating
) %>%
# Create a binary hypertension indicator (BPSys1 >= 140 OR BPDia1 >= 90)
mutate(
Hypertension = ifelse(BPSys1 >= 140 | BPDia1 >= 90, "Yes", "No")
)
# View the processed dataset
head(nhanes_analysis, 10)## # A tibble: 10 Γ 14
## ID Gender Age Race1 Education BMI Pulse BPSys1 BPDia1 PhysActive SmokeNow Diabetes
## <int> <fct> <int> <fct> <fct> <dbl> <int> <int> <int> <fct> <fct> <fct>
## 1 51624 male 34 White High School 32.2 70 114 88 No No No
## 2 51624 male 34 White High School 32.2 70 114 88 No No No
## 3 51624 male 34 White High School 32.2 70 114 88 No No No
## 4 51625 male 4 Other <NA> 15.3 NA NA NA <NA> <NA> No
## 5 51630 female 49 White Some College 30.6 86 118 82 No Yes No
## 6 51638 male 9 White <NA> 16.8 82 84 50 <NA> <NA> No
## 7 51646 male 8 White <NA> 20.6 72 114 46 <NA> <NA> No
## 8 51647 female 45 White College Grad 27.2 62 106 62 Yes <NA> No
## 9 51647 female 45 White College Grad 27.2 62 106 62 Yes <NA> No
## 10 51647 female 45 White College Grad 27.2 62 106 62 Yes <NA> No
## # βΉ 2 more variables: HealthGen <fct>, Hypertension <chr>
## [1] 10000 14
Research Question: Do cardiovascular health indicators and physical activity differ by gender?
# Create summary statistics grouped by gender
health_by_gender <- nhanes_analysis %>%
group_by(Gender) %>%
summarise(
N = n(),
# Descriptive statistics
Mean_Age = round(mean(Age, na.rm = TRUE), 2),
Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
Mean_DiaBP = round(mean(BPDia1, na.rm = TRUE), 2),
# Prevalence estimates (percentages)
Pct_Physically_Active = round(
sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2),
Pct_Current_Smoker = round(
sum(SmokeNow == "Yes", na.rm = TRUE) / sum(!is.na(SmokeNow)) * 100, 2),
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
Pct_Diabetes = round(
sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2)
)
print(health_by_gender)## # A tibble: 2 Γ 10
## Gender N Mean_Age Mean_BMI Mean_SysBP Mean_DiaBP Pct_Physically_Active Pct_Current_Smoker
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 5020 37.6 26.8 117. 67.0 53.2 45.5
## 2 male 4980 35.8 26.6 121. 69.5 58.6 45.8
## # βΉ 2 more variables: Pct_Hypertension <dbl>, Pct_Diabetes <dbl>
Research Question: How do health outcomes and behaviors vary across the lifespan?
# Create age groups for analysis
nhanes_analysis <- nhanes_analysis %>%
mutate(
Age_Group = cut(Age,
breaks = c(0, 20, 35, 50, 65, 100),
labels = c("18-20", "21-35", "36-50", "51-65", "65+"))
)
# Summary by age group
health_by_age <- nhanes_analysis %>%
group_by(Age_Group) %>%
summarise(
N = n(),
Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
Pct_Diabetes = round(
sum(Diabetes == "Yes", na.rm = TRUE) / sum(!is.na(Diabetes)) * 100, 2),
Pct_Physically_Active = round(
sum(PhysActive == "Yes", na.rm = TRUE) / sum(!is.na(PhysActive)) * 100, 2)
)
print(health_by_age)## # A tibble: 6 Γ 7
## Age_Group N Mean_BMI Mean_SysBP Pct_Hypertension Pct_Diabetes Pct_Physically_Active
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18-20 2769 20.8 107. 0.73 0.65 76.8
## 2 21-35 1984 28.2 114. 4.72 1.46 61.8
## 3 36-50 2138 29.1 118. 11.4 6.69 55.5
## 4 51-65 1784 29.4 127. 22.6 17.7 47.5
## 5 65+ 1188 28.6 135. 40.4 21.4 37.2
## 6 <NA> 137 NaN NaN NaN NaN NaN
Research Question: Are there racial/ethnic differences in hypertension that differ by gender?
# Stratified by gender and race/ethnicity
health_stratified <- nhanes_analysis %>%
group_by(Gender, Race1) %>%
summarise(
N = n(),
Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
Mean_BMI = round(mean(BMI, na.rm = TRUE), 2),
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
.groups = 'drop'
)
print(health_stratified)## # A tibble: 10 Γ 6
## Gender Race1 N Mean_SysBP Mean_BMI Pct_Hypertension
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 female Black 614 120. 29.8 17.4
## 2 female Hispanic 320 114. 26.3 8.88
## 3 female Mexican 452 112. 26.4 7.08
## 4 female White 3221 118. 26.6 13.8
## 5 female Other 413 113. 24.0 10
## 6 male Black 583 123. 26.3 17.3
## 7 male Hispanic 290 119. 26.4 10.1
## 8 male Mexican 563 119. 26.6 12.4
## 9 male White 3151 121. 26.8 15.2
## 10 male Other 393 118. 24.8 12.8
# Create visualization
health_by_age_gender <- nhanes_analysis %>%
group_by(Age_Group, Gender) %>%
summarise(
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
.groups = 'drop'
)
ggplot(health_by_age_gender, aes(x = Age_Group, y = Pct_Hypertension, fill = Gender)) +
geom_col(position = "dodge", alpha = 0.8) +
labs(
title = "Hypertension Prevalence by Age and Gender",
subtitle = "NHANES Data",
x = "Age Group",
y = "Prevalence (%)",
fill = "Gender",
caption = "Source: National Health and Nutrition Examination Survey"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)activity_by_age <- health_by_age %>%
filter(!is.na(Age_Group))
ggplot(activity_by_age, aes(x = Age_Group, y = Pct_Physically_Active)) +
geom_col(fill = "#2ecc71", alpha = 0.8) +
geom_text(aes(label = paste0(Pct_Physically_Active, "%")),
vjust = -0.5, size = 3) +
labs(
title = "Physical Activity Participation by Age Group",
subtitle = "NHANES Data",
x = "Age Group",
y = "Prevalence (%)",
caption = "Source: National Health and Nutrition Examination Survey"
) +
ylim(0, 70) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)hypertension_by_race <- nhanes_analysis %>%
group_by(Race1) %>%
summarise(
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2),
N = n(),
.groups = 'drop'
) %>%
filter(!is.na(Race1))
ggplot(hypertension_by_race, aes(x = Race1, y = Pct_Hypertension)) +
geom_col(fill = "#e74c3c", alpha = 0.8) +
geom_text(aes(label = paste0(Pct_Hypertension, "%")),
vjust = -0.5, size = 3) +
labs(
title = "Hypertension Prevalence by Race/Ethnicity",
subtitle = "NHANES Data",
x = "Race/Ethnicity",
y = "Prevalence (%)",
caption = "Source: National Health and Nutrition Examination Survey"
) +
ylim(0, 50) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)# Create summary table by gender
summary_table <- health_by_gender %>%
select(
Gender,
N,
`Mean Age` = Mean_Age,
`Mean BMI` = Mean_BMI,
`Mean SysBP` = Mean_SysBP,
`% Hypertension` = Pct_Hypertension,
`% Physically Active` = Pct_Physically_Active
)
# Display with kableExtra
kable(summary_table,
caption = "Cardiovascular Health and Physical Activity by Gender",
format = "html") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)| Gender | N | Mean Age | Mean BMI | Mean SysBP | % Hypertension | % Physically Active |
|---|---|---|---|---|---|---|
| female | 5020 | 37.64 | 26.77 | 117.3 | 13.07 | 53.18 |
| male | 4980 | 35.83 | 26.55 | 120.9 | 14.68 | 58.56 |
Using the nhanes_analysis data, explore:
βHow does hypertension prevalence vary by education level?β
Write code to:
# Your code here:
health_by_education <- nhanes_analysis %>%
group_by(Education) %>%
summarise(
N = n(),
Mean_SysBP = round(mean(BPSys1, na.rm = TRUE), 2),
Pct_Hypertension = round(
sum(Hypertension == "Yes", na.rm = TRUE) / sum(!is.na(Hypertension)) * 100, 2)
)
print(health_by_education)Create a bar chart showing hypertension by education level:
# Your visualization here:
health_by_education %>%
filter(!is.na(Education)) %>%
ggplot(aes(x = Education, y = Pct_Hypertension)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = paste0(Pct_Hypertension, "%")),
vjust = -0.5, size = 3) +
labs(
title = "Hypertension Prevalence by Education Level",
x = "Education Level",
y = "Percent with Hypertension (%)",
caption = "Source: NHANES"
) +
ylim(0, 50) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Write 2-3 sentences:
βWhat does this pattern tell us about health disparities and social determinants?β
Consider: - Which education groups have highest/lowest hypertension? - What might explain these differences? - Why does this matter for public health?
| Criteria | Excellent (Full Credit) | Adequate | Needs Work |
|---|---|---|---|
| Identifies pattern | Explicitly states which groups have highest/lowest rates | Mentions direction but lacks specificity | Vague or incorrect about pattern |
| Explains mechanism | References social determinants, access, or health literacy | Mentions inequality but lacks detail | No explanation provided |
| Public health relevance | Discusses implications for policy or programs | Notes importance but general | Missing public health connection |
| Writing quality | Clear, 2-3 well-written sentences | Adequate but could be clearer | Incomplete or unclear |
group_by() (5 pts)Lab01_NHANES_YourName.Rmdβ Loading data from R packages
β Data exploration with str(), summary(),
head()
β Grouping and summarizing with group_by() and
summarise()
β Creating derived variables with mutate()
β Calculating epidemiological statistics
β Stratification to reveal disparities
β Professional visualization with ggplot2
β Publication-ready tables
β Make sure you ran data(NHANES) after loading the
package
β This is normal! Always use na.rm = TRUE in
calculations
β Use filter(!is.na(Variable)) to remove missing
groups
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 NHANES_2.1.0 lubridate_1.9.4 forcats_1.0.1
## [6] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.0 readr_2.1.6 tidyr_1.3.2
## [11] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 utf8_1.2.6 generics_0.1.4 xml2_1.5.1 stringi_1.8.7
## [6] hms_1.1.4 digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2
## [11] timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0 viridisLite_0.4.2
## [16] scales_1.4.0 textshaping_1.0.4 jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.12 otel_0.2.0 tools_4.5.2
## [26] tzdb_0.5.0 vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4 pkgconfig_2.0.3
## [31] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 glue_1.8.0 systemfonts_1.3.1
## [36] xfun_0.55 tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.9
## [41] rmarkdown_2.30 svglite_2.2.2 labeling_0.4.3 compiler_4.5.2 S7_0.2.1
Lab Activity 1 Complete!
Last updated: January 22, 2026