The National Health and Nutrition Examination Survey (NHANES) data is a complex survey of tens of thousands of people designed to assess the health and nutritional status of adults and children in the United States. The NHANES data includes many measurements related to overall health, physical activity, diet, psychological health, socioeconomic factors and more.
Depending on the sampling design, each person has a sampling weight that quantifies how many people in the larger population their data is meant to represent. In this notebook, we’ll apply survey methods that use sampling weights to estimate and model relationships between measurements.
We are going to focus on a common health indicator, Body Mass Index (BMI kg/m2), and how it is related to physical activity. We’ll visualize the data and use survey-weighted regression to test for associations.
# Install packageS
# install.packages("NHANES")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("dplyr")
# Load the NHANES and dplyr packages
library(NHANES)
library(dplyr)
# Load the NHANESraw data
data("NHANESraw")
# Take a glimpse at the contents
glimpse(NHANESraw)
Rows: 20,293
Columns: 78
$ ID <int> 51624, 51625, 51626, 51627, 51628, 5~
$ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, ~
$ Gender <fct> male, male, male, male, female, male~
$ Age <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80~
$ AgeMonths <int> 409, 49, 202, 131, 722, 313, 596, 12~
$ Race1 <fct> White, Other, Black, Black, Black, M~
$ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ Education <fct> High School, NA, NA, NA, High School~
$ MaritalStatus <fct> Married, NA, NA, NA, Widowed, Marrie~
$ HHIncome <fct> 25000-34999, 20000-24999, 45000-5499~
$ HHIncomeMid <int> 30000, 22500, 50000, 22500, 12500, 3~
$ Poverty <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, ~
$ HomeRooms <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, ~
$ HomeOwn <fct> Own, Own, Own, Rent, Rent, Rent, Ren~
$ Work <fct> NotWorking, NA, NotWorking, NA, NotW~
$ Weight <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6,~
$ Length <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA~
$ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ Height <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 1~
$ BMI <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 3~
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ BMI_WHO <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, ~
$ Pulse <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, ~
$ BPSysAve <int> 113, NA, 109, 93, 150, 104, 112, NA,~
$ BPDiaAve <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, ~
$ BPSys1 <int> 114, NA, 112, 92, 154, 102, 118, NA,~
$ BPDia1 <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, ~
$ BPSys2 <int> 114, NA, 114, 94, 150, 104, 108, NA,~
$ BPDia2 <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, ~
$ BPSys3 <int> 112, NA, 104, 92, 150, 104, 116, NA,~
$ BPDia3 <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, ~
$ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ DirectChol <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.~
$ TotChol <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.~
$ UrineVol1 <int> 352, NA, 281, 139, 30, 202, 77, NA, ~
$ UrineFlow1 <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, ~
$ UrineVol2 <int> NA, NA, NA, NA, 246, NA, NA, NA, NA,~
$ UrineFlow2 <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA~
$ Diabetes <fct> No, No, No, No, Yes, No, No, No, No,~
$ DiabetesAge <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, ~
$ HealthGen <fct> Good, NA, Vgood, NA, Fair, Good, Goo~
$ DaysPhysHlthBad <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, N~
$ DaysMentHlthBad <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0~
$ LittleInterest <fct> Most, NA, NA, NA, Most, None, Severa~
$ Depressed <fct> Several, NA, NA, NA, Most, Most, Sev~
$ nPregnancies <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA~
$ nBabies <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA~
$ Age1stBaby <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, ~
$ SleepHrsNight <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA~
$ SleepTrouble <fct> Yes, NA, No, NA, No, No, Yes, NA, NA~
$ PhysActive <fct> No, NA, Yes, NA, No, Yes, No, NA, NA~
$ PhysActiveDays <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4,~
$ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
$ TVHrsDayChild <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA,~
$ CompHrsDayChild <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA,~
$ Alcohol12PlusYr <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, N~
$ AlcoholDay <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1~
$ AlcoholYear <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52~
$ SmokeNow <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA~
$ Smoke100 <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, ~
$ SmokeAge <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, ~
$ Marijuana <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, N~
$ AgeFirstMarij <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, ~
$ RegularMarij <fct> No, NA, NA, NA, NA, Yes, No, NA, NA,~
$ AgeRegMarij <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, ~
$ HardDrugs <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, N~
$ SexEver <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, ~
$ SexAge <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, N~
$ SexNumPartnLife <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA~
$ SexNumPartYear <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA,~
$ SameSex <fct> No, NA, NA, NA, No, No, Yes, NA, NA,~
$ SexOrientation <fct> Heterosexual, NA, NA, NA, NA, Hetero~
$ WTINT2YR <dbl> 80100.544, 53901.104, 13953.078, 116~
$ WTMEC2YR <dbl> 81528.772, 56995.035, 14509.279, 120~
$ SDMVPSU <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, ~
$ SDMVSTRA <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, ~
$ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
# Load the ggplot2 package
library(ggplot2)
# Use mutate to create a 4-year weight variable and call it WTMEC4YR
NHANESraw <- NHANESraw %>% mutate(WTMEC4YR = WTMEC2YR/2)
# Calculate the sum of this weight variable
NHANESraw %>% summarize(sum(WTMEC4YR))
# Plot the sample weights using boxplots, with Race1 on the x-axis
ggplot(NHANESraw, aes(x = Race1, y = WTMEC4YR)) + geom_boxplot()
# Load the survey package
library(survey)
# Specify the survey design
nhanes_design <- svydesign(
data = NHANESraw,
strata = ~SDMVSTRA,
id = ~SDMVPSU,
nest = TRUE,
weights = ~WTMEC4YR)
# Print a summary of this design
summary(nhanes_design)
Stratified 1 - level Cluster Sampling design (with replacement)
With (62) clusters.
svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU,
nest = TRUE, weights = ~WTMEC4YR)
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.986e-06 5.664e-05 1.054e-04 Inf 1.721e-04 Inf
Stratum Sizes:
75 76 77 78 79 80 81 82 83 84 85 86 87
obs 803 785 823 829 696 751 696 724 713 683 592 946 598
design.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2
actual.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2
88 89 90 91 92 93 94 95 96 97 98 99 100
obs 647 251 862 998 875 602 688 722 676 608 708 682 700
design.PSU 2 2 3 3 3 2 2 2 2 2 2 2 2
actual.PSU 2 2 3 3 3 2 2 2 2 2 2 2 2
101 102 103
obs 715 624 296
design.PSU 2 2 2
actual.PSU 2 2 2
Data variables:
[1] "ID" "SurveyYr" "Gender"
[4] "Age" "AgeMonths" "Race1"
[7] "Race3" "Education" "MaritalStatus"
[10] "HHIncome" "HHIncomeMid" "Poverty"
[13] "HomeRooms" "HomeOwn" "Work"
[16] "Weight" "Length" "HeadCirc"
[19] "Height" "BMI" "BMICatUnder20yrs"
[22] "BMI_WHO" "Pulse" "BPSysAve"
[25] "BPDiaAve" "BPSys1" "BPDia1"
[28] "BPSys2" "BPDia2" "BPSys3"
[31] "BPDia3" "Testosterone" "DirectChol"
[34] "TotChol" "UrineVol1" "UrineFlow1"
[37] "UrineVol2" "UrineFlow2" "Diabetes"
[40] "DiabetesAge" "HealthGen" "DaysPhysHlthBad"
[43] "DaysMentHlthBad" "LittleInterest" "Depressed"
[46] "nPregnancies" "nBabies" "Age1stBaby"
[49] "SleepHrsNight" "SleepTrouble" "PhysActive"
[52] "PhysActiveDays" "TVHrsDay" "CompHrsDay"
[55] "TVHrsDayChild" "CompHrsDayChild" "Alcohol12PlusYr"
[58] "AlcoholDay" "AlcoholYear" "SmokeNow"
[61] "Smoke100" "SmokeAge" "Marijuana"
[64] "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
[67] "HardDrugs" "SexEver" "SexAge"
[70] "SexNumPartnLife" "SexNumPartYear" "SameSex"
[73] "SexOrientation" "WTINT2YR" "WTMEC2YR"
[76] "SDMVPSU" "SDMVSTRA" "PregnantNow"
[79] "WTMEC4YR"
BMI categories are different for children and young adults younger than 20 so we will subset the data to only analyze adults of at least 20 years of age.
# Select adults of Age >= 20 with subset
nhanes_adult <- subset(nhanes_design, Age >= 20)
# Print a summary of this subset
summary(nhanes_adult)
Stratified 1 - level Cluster Sampling design (with replacement)
With (62) clusters.
subset(nhanes_design, Age >= 20)
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.986e-06 4.303e-05 8.107e-05 Inf 1.240e-04 Inf
Stratum Sizes:
75 76 77 78 79 80 81 82 83 84 85 86 87
obs 471 490 526 500 410 464 447 400 411 395 357 512 327
design.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2
actual.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2
88 89 90 91 92 93 94 95 96 97 98 99 100
obs 355 153 509 560 483 376 368 454 362 315 414 409 377
design.PSU 2 2 3 3 3 2 2 2 2 2 2 2 2
actual.PSU 2 2 3 3 3 2 2 2 2 2 2 2 2
101 102 103
obs 460 308 165
design.PSU 2 2 2
actual.PSU 2 2 2
Data variables:
[1] "ID" "SurveyYr" "Gender"
[4] "Age" "AgeMonths" "Race1"
[7] "Race3" "Education" "MaritalStatus"
[10] "HHIncome" "HHIncomeMid" "Poverty"
[13] "HomeRooms" "HomeOwn" "Work"
[16] "Weight" "Length" "HeadCirc"
[19] "Height" "BMI" "BMICatUnder20yrs"
[22] "BMI_WHO" "Pulse" "BPSysAve"
[25] "BPDiaAve" "BPSys1" "BPDia1"
[28] "BPSys2" "BPDia2" "BPSys3"
[31] "BPDia3" "Testosterone" "DirectChol"
[34] "TotChol" "UrineVol1" "UrineFlow1"
[37] "UrineVol2" "UrineFlow2" "Diabetes"
[40] "DiabetesAge" "HealthGen" "DaysPhysHlthBad"
[43] "DaysMentHlthBad" "LittleInterest" "Depressed"
[46] "nPregnancies" "nBabies" "Age1stBaby"
[49] "SleepHrsNight" "SleepTrouble" "PhysActive"
[52] "PhysActiveDays" "TVHrsDay" "CompHrsDay"
[55] "TVHrsDayChild" "CompHrsDayChild" "Alcohol12PlusYr"
[58] "AlcoholDay" "AlcoholYear" "SmokeNow"
[61] "Smoke100" "SmokeAge" "Marijuana"
[64] "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
[67] "HardDrugs" "SexEver" "SexAge"
[70] "SexNumPartnLife" "SexNumPartYear" "SameSex"
[73] "SexOrientation" "WTINT2YR" "WTMEC2YR"
[76] "SDMVPSU" "SDMVSTRA" "PregnantNow"
[79] "WTMEC4YR"
# Compare the number of observations in the full data to the adult data
nrow(nhanes_design)
[1] 20293
nrow(nhanes_adult)
[1] 11778
# Calculate the mean BMI in NHANESraw
bmi_mean_raw <- NHANESraw %>%
filter(Age >= 20) %>%
summarize(mean(BMI, na.rm = TRUE))
bmi_mean_raw
# Calculate the survey-weighted mean BMI of US adults
bmi_mean <- svymean(~BMI, design = nhanes_adult, na.rm = TRUE)
bmi_mean
mean SE
BMI 28.734 0.1235
# Draw a weighted histogram of BMI in the US population
NHANESraw %>%
filter(Age >= 20) %>%
ggplot(mapping = aes(x = BMI, weight = WTMEC4YR)) +
geom_histogram()+
geom_vline(xintercept = coef(bmi_mean), color="red")
NHANESraw %>%
filter(Age >= 20) %>%
ggplot(mapping = aes(x = BMI, weight = WTMEC4YR)) +
geom_density() +
geom_vline(xintercept = coef(bmi_mean), color="red")
From both the weighted histogram and density plot, one can clearly see that the distribution of BMI is slightly skewed to the left (positive skewness) as a few people have a much higher BMI
# Load the broom library
library(broom)
# Make a boxplot of BMI stratified by physically active status
NHANESraw %>%
filter(Age>=20) %>%
ggplot(mapping = aes(x = PhysActive, y = BMI, weight = WTMEC4YR)) + geom_boxplot()
# Conduct a t-test comparing mean BMI between physically active status
survey_ttest <- svyttest(BMI~PhysActive, design = nhanes_adult)
# Use broom to show the tidy results
tidy(survey_ttest)
# Estimate the proportion who are physically active by current smoking status
phys_by_smoke <- svyby(~PhysActive, by = ~SmokeNow,
FUN = svymean,
design = nhanes_adult,
keep.names = FALSE)
# Print the table
phys_by_smoke
# Plot the proportions
ggplot(data = phys_by_smoke,
aes(x = SmokeNow, y = PhysActiveYes, fill = SmokeNow)) +
geom_col() + labs(y = "Proportion Physically Active")
# Estimate mean BMI by current smoking status
BMI_by_smoke <- svyby(~BMI, by = ~SmokeNow,
FUN = svymean,
design = nhanes_adult,
na.rm = TRUE)
BMI_by_smoke
# Plot the distribution of BMI by current smoking status
NHANESraw %>%
filter(Age>=20, !is.na(SmokeNow)) %>%
ggplot(aes(x = SmokeNow, y = BMI, weight = WTMEC4YR)) + geom_boxplot()
We saw that people who smoke are less likely to be physically active and have a lower BMI on average. We also saw that people who are not physically active have a higher BMI on average.
# Plot the distribution of BMI by smoking and physical activity status
NHANESraw %>%
filter(Age>=20) %>%
ggplot(mapping = (aes(x=SmokeNow, y=BMI, weight = WTMEC4YR, color = PhysActive))) + geom_boxplot()
In the above plot, we see that people who are physically active tend to have lower BMI no matter their smoking status, and this is true even if they didn’t answer the question. However, we also see that smokers have lower BMI in general. Also, looking closely we see the difference in BMI comparing physically active people to non-physically active people is slightly smaller in smokers than in non-smokers.
# Fit a multiple regression model
mod1 <- svyglm(BMI ~ PhysActive*SmokeNow, design = nhanes_adult)
# Tidy the model results
tidy_mod1 <- tidy(mod1)
tidy_mod1
# Calculate expected mean difference in BMI for activity within non-smokers
diff_non_smoke <- tidy_mod1 %>%
filter(term=="PhysActiveYes") %>%
select(estimate)
diff_non_smoke
# Calculate expected mean difference in BMI for activity within smokers
diff_smoke <- tidy_mod1 %>%
filter(term%in%c("PhysActiveYes","PhysActiveYes:SmokeNowYes")) %>%
summarize(estimate = sum(estimate))
diff_smoke
We fit a linear regression model where the association of physical activity with BMI could vary by smoking status. The interaction between physical activity and smoking has a small p-value, which suggests the association does vary by smoking status. The difference between physically active and non-physically active people is larger in magnitude in the non-smoker population.
# Adjust mod1 for other possible confounders
mod2 <- svyglm(BMI ~ PhysActive*SmokeNow+Race1+Alcohol12PlusYr+Gender,
design = nhanes_adult)
# Tidy the output
tidy(mod2)