The purpose of this tutorial is to provide brief introduction to
conducting design-based analysis of complex sample survey data using the
survey package in R.
We will be using data from the National Survey on Drug Use and Health (NSDUH). NSDUH is a repeated cross-sectional household survey distributed annually to a sample of non-institutionalized residents of the United States. Its primary aim is to monitor drug and alcohol use by providing nationally-representative estimates of substance use behaviors, mental health status, and correlates of these health outcomes/behaviors. Additional information on NSDUH can be found here.
To improve estimate precision and reduce overall survey costs, households (and specific individuals within households) who participate in NSDUH are selected through a complex set of cluster and stratified sampling procedures. Certain population subgroups (including younger people, Hispanic people, Black/African-American people) are also oversampled relative to their general population size to produce more reliable estimates (i.e., through standard error reduction). This ultimately generates a sample in which there is unequal probability of selection across sampling units, and produces somewhat homogeneous samples within clusters due to shared community characteristics (i.e., nonindependence). Together, these violate simple random sample assumptions (equal probability of selection, independence of observations) common across basic statistical procedures.
In order to generate accurate nationally-representative estimates (e.g., proportions, means, standard errors) from complex sample survey data, we have to conduct design-based analysis (i.e., analysis that adjusts for unequal probability of selection). For more information on NSDUH sample design can be found in the NSDUH 2019 Sample Design Report. Guidance for statistical analysis can be found in the NSDUH 2019 Statistical Inference Report.
The data frame included in this tutorial
(nsduh_20152019_subset.RData) includes 5 combined waves
(2015, 2016, 2017, 2018, 2019) and a subset of 15 variables to reduce
the overall file size.
Combining waves: To increase the sample size of relatively small population subgroups, it is common practice in public health research to combine multiple waves from national cross-sectional datasets like NSDUH. However, additional procedures are needed to adjust sampling weights, and these simple steps are outlined in this tutorial.
Unconditional analysis of a subpopulation: In addition, researchers are often interested in studying a specific subpopulation (e.g., adolescents). Here, it is neccessary to identify the subpopulation within the overall dataset for the analysis at hand. Avoid dropping unused observations as this removes key information used to generate variance estimates. Additional background information on subpopulation analysis can be found in West, Berglund, & Herringa, 2008 and in Lumley, 2021.
svydesignsvymean,
svyby, and svycipropsvyglmsvydiags package
(under construction!)Note: A useful resource for analyzing complex sample data in R is available here.
First, we will load the necessary R packages relevant
for this analysis.
If you do not have these packages installed, follow these steps:
The most important package is the survey package (Lumley,
2021). This package provides flexible tools to generate design-based
estimates that account for the complex sample survey design of
NSDUH.
We will also use the svydiags package to compute
diagnostic data for regression models fitted with complex survey data Valliant,
2018
This project also uses the core suite of tidyverse (Wickham,
2021) packages and the naniar (Tierny
et al., 2021) packages to assist with data management procedures ,
the stats package to run statistical procedures (R
Core Team), and the jtools package for miscellaneous
statistical procedures (Long,
2021).
# load packages
library(survey) ## run design-based analysis
library(svydiags) ## regression diagnostics
library(tidyverse) ## data management
library(naniar) ## data management (recoding missing values)
library(stats) ## run statistical procedures
library(jtools) ## miscellaneous statistical tools## Warning: package 'jtools' was built under R version 4.4.1
NOTE: The data frame must be located in the same folder as this RMarkdown file in order to be properly loaded.
The following code outlines some initial data management steps. I have renamed some variables to make it easier to quickly identify and interpret results. See the NSDUH 2019 Public Use File Codebook for more information on variables and coding schemes in NSDUH.
Subpopulation: In this tutorial, the subpopulation of interest are adults (aged 18+) in the United States with non-missing data on the following variables: depression_bin (past-year major depressive episode, binary coded), stress_cont (Kessler K6 distress scale, continuous: range 0-24), newrace2 (race/ethnicity), irsex (gender), sexident (sexual orientation), catag6 (age category), income (annual household income level), and anyhlti2 (any health insurance).
Following the procedures below, we identify a subpopulation of N=207,479 participants who meet our study eligibility criteria.
The decision to identify a subpopulation in complex sample survey analysis has critical implications due to its impact on variance calculations. While mean and proportion estimates will not be affected, failing to conduct this step could lead to incorrect inferences (e.g., significant results that become insignificant when accounting for complete design-based variance information).
When we identify the subpop, R will (1) use our subpopulation data to generate estimates (means, proprotions, etc.) and (2) use the entire data frame to calculate standard errors and confidence intervals.
When reporting results from subpopulation analyses, you should report the subpopulation sample size as opposed to the entire data frame sample size. In our tutorial, this means that we will report a analytic sample size of N=207,479 as opposed to N=282,768 (total NSDUH 2015-2019 sample). In addition, if we were to additionally stratify our primary analysis by gender, we could report the total subpop N, female subpop N, and male subpop N.
# Load the data set
load(file="nsduh_20152019_subset.RData")
# If the data frame is not automatically named "nsduh", uncomment the code below and run it.
## nsduh <- dplyr::rename(nsduh_20152019_subset) # shorten the data set name
## rm(nsduh_20152019_subset) # remove old data frame
# set invalid responses to missing (NA)
nsduh <- nsduh %>%
replace_with_na(replace = list(sexident = c(85, 89, 94, 97, 98, 99))) %>%
replace_with_na(replace = list(anyhlti2 = c(94, 97, 98)))
# create factor variables for categorical vars
nsduh$amdeyr <- as.factor(nsduh$amdeyr) # past-year major depressive episode (MDE)
nsduh$newrace2 <- as.factor(nsduh$newrace2) # racial/ethnic identity
nsduh$irsex <- as.factor(nsduh$irsex) # gender (binary woman/man)
nsduh$sexident <- as.factor(nsduh$sexident) # sexual orientation
nsduh$catag6 <- as.factor(nsduh$catag6) # age category
nsduh$income <- as.factor(nsduh$income) # annual household income level
nsduh$anyhlti2 <- as.factor(nsduh$anyhlti2) # any health insurance (1=yes)
nsduh$year <- as.factor(nsduh$year) # year of NSDUH survey completion
# rename/mutate variables for ease of use in analysis
nsduh <- nsduh %>%
mutate(
white = factor(ifelse(newrace2 == "1", 1, 0)), # White
black = factor(ifelse(newrace2 == "2", 1, 0)), # Black/African-American
naan = factor(ifelse(newrace2 == "3", 1, 0)), # Native American or Alaska Native
nhpi = factor(ifelse(newrace2 == "4", 1, 0)), # Native Hawaiian/Pacific Islander
asian = factor(ifelse(newrace2 == "5", 1, 0)), # Asian
multi = factor(ifelse(newrace2 == "6", 1, 0)), # Multiracial
hispanic = factor(ifelse(newrace2 == "7", 1, 0)), # Hispanic
man = factor(ifelse(irsex == "1", 1, 0)), # Men
woman = factor(ifelse(irsex == "2", 1, 0)), # Women
hetero = factor(ifelse(sexident == "1", 1, 0)), # Heterosexual
gay = factor(ifelse(sexident == "2", 1, 0)), # Gay
bi = factor(ifelse(sexident == "3", 1, 0)), # Bisexual
depression_bin = factor(ifelse(amdeyr == "1", 1, 0)), # Past-year MDE
healthins = factor(ifelse(anyhlti2 == "1", 1, 0)), # any health insurance
stress_cont = k6scmon) %>% # K6 Distress Scale
mutate(
subpop = factor(ifelse(!is.na(newrace2) & !is.na(sexident) & !is.na(irsex) & !is.na(depression_bin) & !is.na(stress_cont) & !is.na(income) & !is.na(catag6) & !is.na(healthins) & (catag6!="1"), 1, 0))) # this creates a "subpop" variable to identify all observations in the dataset that have non-missing values on each of the study variables of interest.
table(nsduh$subpop) # This indicates that 207,479 participants have non-missing data on our study variables.##
## 0 1
## 75289 207479
NSDUH includes the following survey weight variables:
vestr (variance estimation stratum),
verep (variance estimation cluster replicates), and
analwt_c (person-level analysis weight). The
svydesign procedure takes these variables as arguments to
generate a design object that we can easily include in our analysis
going forward.
Combining waves: Since we are combining multiple data collection waves in this tutorial, we need to divide the person-level analysis weight by the number of waves included.
Subpopulation: Using the “subpop” variable we
generated in the prior step, we can simply subset the
svydesign object to include our subpopulation
(N=207,479).
# divide person-level analysis weight by 5 (5 waves included: 2015-2019)
nsduh$adjwt_5 <- nsduh$analwt_c/5
# Note: if only using one wave of data (e.g., only 2015), the above step is not required.
# build up the primary survey object
nsduh_design <-
svydesign(
id = ~verep,
strata = ~vestr,
weights = ~adjwt_5,
data = nsduh,
nest = TRUE)
# svydesign for the subpopulation -- we will use this throughout
nsduh_design_sp <- subset(
nsduh_design,
subpop=="1")
summary(nsduh_design_sp) ## describes the svydesign object## Stratified 1 - level Cluster Sampling design (with replacement)
## With (100) clusters.
## subset(nsduh_design, subpop == "1")
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000419 0.0006735 0.0013508 0.0033004 0.0030106 1.2997441
## Stratum Sizes:
## 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 40011
## obs 4190 4032 3862 3839 4333 4070 4096 4733 4571 4477 4132
## design.PSU 2 2 2 2 2 2 2 2 2 2 2
## actual.PSU 2 2 2 2 2 2 2 2 2 2 2
## 40012 40013 40014 40015 40016 40017 40018 40019 40020 40021 40022
## obs 3949 4139 3744 4154 4077 4127 4213 4601 4587 4316 3872
## design.PSU 2 2 2 2 2 2 2 2 2 2 2
## actual.PSU 2 2 2 2 2 2 2 2 2 2 2
## 40023 40024 40025 40026 40027 40028 40029 40030 40031 40032 40033
## obs 3816 4122 4434 4101 3620 4225 4355 4096 3898 3695 3784
## design.PSU 2 2 2 2 2 2 2 2 2 2 2
## actual.PSU 2 2 2 2 2 2 2 2 2 2 2
## 40034 40035 40036 40037 40038 40039 40040 40041 40042 40043 40044
## obs 3902 4057 3773 3908 4157 4513 4701 4427 4093 4007 3815
## design.PSU 2 2 2 2 2 2 2 2 2 2 2
## actual.PSU 2 2 2 2 2 2 2 2 2 2 2
## 40045 40046 40047 40048 40049 40050
## obs 4561 3976 4363 4404 4415 4147
## design.PSU 2 2 2 2 2 2
## actual.PSU 2 2 2 2 2 2
## Data variables:
## [1] "questid2" "amhtxrc3" "k6scmon" "amdeyr"
## [5] "sexident" "irsex" "catag3" "catag6"
## [9] "newrace2" "anyhlti2" "income" "analwt_c"
## [13] "vestr" "verep" "year" "white"
## [17] "black" "naan" "nhpi" "asian"
## [21] "multi" "hispanic" "man" "woman"
## [25] "hetero" "gay" "bi" "depression_bin"
## [29] "healthins" "stress_cont" "subpop" "adjwt_5"
# Survey-weighted means and 95% CI
## Estimate mean K6 Distress Scale scores for entire subpop
k6mean <- svymean(~stress_cont, nsduh_design_sp)
k6mean## mean SE
## stress_cont 3.8621 0.013
## 2.5 % 97.5 %
## stress_cont 3.836671 3.887529
## Estimate mean for individual subgroups (gender)
k6mean_gender <- svyby(~stress_cont, ~woman, design=nsduh_design_sp, svymean)
k6mean_gender## woman stress_cont se
## 0 0 3.534490 0.01824574
## 1 1 4.170166 0.01944375
## 2.5 % 97.5 %
## 0 3.498729 3.570251
## 1 4.132057 4.208275
# Survey-weighted proportions and 95% CI
## Race/ethnicity
svyciprop(~white, nsduh_design_sp, method="logit")## 2.5% 97.5%
## white 0.646 0.641 0.651
## 2.5% 97.5%
## black 0.118 0.114 0.121
## 2.5% 97.5%
## hispanic 0.158 0.154 0.162
## 2.5% 97.5%
## asian 0.0528 0.0506 0.0551
## 2.5% 97.5%
## naan 0.00544 0.00497 0.00594
## 2.5% 97.5%
## nhpi 0.00359 0.00315 0.00408
## 2.5% 97.5%
## multi 0.0171 0.0165 0.0177
## 2.5% 97.5%
## woman 0.515 0.512 0.519
## 2.5% 97.5%
## man 0.485 0.481 0.488
## 2.5% 97.5%
## hetero 0.950 0.948 0.951
## 2.5% 97.5%
## gay 0.0193 0.0184 0.0202
## 2.5% 97.5%
## bi 0.0310 0.0302 0.0319
## 2.5% 97.5%
## depression_bin 0.0718 0.0702 0.0735
## 2.5% 97.5%
## healthins 0.903 0.901 0.905
## 2.5% 97.5%
## I(catag6 == 2) 0.137 0.136 0.139
## 2.5% 97.5%
## I(catag6 == 3) 0.159 0.157 0.162
## 2.5% 97.5%
## I(catag6 == 4) 0.247 0.244 0.250
## 2.5% 97.5%
## I(catag6 == 5) 0.254 0.250 0.258
## 2.5% 97.5%
## I(catag6 == 6) 0.203 0.199 0.206
## 2.5% 97.5%
## I(income == 1) 0.159 0.156 0.162
## 2.5% 97.5%
## I(income == 2) 0.293 0.289 0.297
## 2.5% 97.5%
## I(income == 3) 0.161 0.158 0.163
## 2.5% 97.5%
## I(income == 4) 0.387 0.382 0.393
## Graphing continuous variables
# Note: Currently ggplot2 is not equipped to handle complex sample data, but there are some packages online that are beginning to overcome this hurdle.
svyhist(~stress_cont, nsduh_design_sp) # positive skew# Do Kessler-6 Distress Scale scores differ between men and women?
## Independent samples t-test: svyttest(cont~cat, design)
t <- svyttest(stress_cont ~ woman,
design=nsduh_design_sp)
t ##
## Design-based t-test
##
## data: stress_cont ~ woman
## t = 23.235, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
## 0.5806968 0.6906566
## sample estimates:
## difference in mean
## 0.6356767
## Yes, women have higher scores.
## Mean difference = 0.61 (95% CI: 0.56, 0.67)
## Notes:
## paired sample t-test can be run as: svyttest(I(var1-var2)~0, design)
## one-sample t-test can be run as: svyttest(var~0, design)
## Wilcoxon signed rank test (non-parametric independent t-test)
w <- svyranktest(stress_cont ~ woman, nsduh_design_sp, test=c("wilcoxon"))
w##
## Design-based KruskalWallis test
##
## data: stress_cont ~ woman
## t = 24.994, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score
## 0.0466829
# Does the depression prevalence differ by sexual orientation?
## Design-based Wald (chi-square) test of association: svychisq(~cat+cat, design, statistic="adjWald")
c <- svychisq(~depression_bin + sexident,
design=nsduh_design_sp,
statistic="adjWald")
c##
## Design-based Wald test of association
##
## data: svychisq(~depression_bin + sexident, design = nsduh_design_sp, statistic = "adjWald")
## F = 399.4, ndf = 2, ddf = 49, p-value < 2.2e-16
# Linear regression
## Are there differences in Kessler-6 Distress Scale scores by race/ethnicity?
### Simple linear regression
a1 <- svyglm(stress_cont ~ black + hispanic + asian + naan + nhpi + multi,
nsduh_design_sp,
family=gaussian())
summary(a1)##
## Call:
## svyglm(formula = stress_cont ~ black + hispanic + asian + naan +
## nhpi + multi, design = nsduh_design_sp, family = gaussian())
##
## Survey design:
## subset(nsduh_design, subpop == "1")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.92355 0.01639 239.390 < 2e-16 ***
## black1 -0.11665 0.03597 -3.243 0.002261 **
## hispanic1 -0.38058 0.05366 -7.093 8.36e-09 ***
## asian1 -0.18832 0.05021 -3.750 0.000513 ***
## naan1 0.42018 0.18459 2.276 0.027747 *
## nhpi1 0.23248 0.20797 1.118 0.269680
## multi1 1.11687 0.13213 8.453 9.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 20.16392)
##
## Number of Fisher Scoring iterations: 2
# Yes, compared to white participants: Black, Native Hawaiian/Pacific Islander, and Hispanic participants had lower stress scores; Multiracial participants had higher stress scores.
## Is race/ethnicity still a significant predictor of Kessler-6 Distress Scale scores after controlling for income level, age category, and health insurance status?
### Multivariate linear regression
a2 <- svyglm(stress_cont ~ black + hispanic + asian + naan + nhpi + multi + income + catag6 + anyhlti2,
nsduh_design_sp,
family=gaussian())
summary(a2) ## Note: we should assess regression diagnostics before reporting results##
## Call:
## svyglm(formula = stress_cont ~ black + hispanic + asian + naan +
## nhpi + multi + income + catag6 + anyhlti2, design = nsduh_design_sp,
## family = gaussian())
##
## Survey design:
## subset(nsduh_design, subpop == "1")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.51195 0.04480 167.661 < 2e-16 ***
## black1 -0.77646 0.03505 -22.153 < 2e-16 ***
## hispanic1 -1.11188 0.05068 -21.939 < 2e-16 ***
## asian1 -0.48648 0.05638 -8.629 2.73e-10 ***
## naan1 -0.34826 0.17264 -2.017 0.05117 .
## nhpi1 -0.30741 0.19378 -1.586 0.12139
## multi1 0.54902 0.12523 4.384 9.69e-05 ***
## income2 -1.00848 0.04295 -23.483 < 2e-16 ***
## income3 -1.47684 0.05788 -25.515 < 2e-16 ***
## income4 -1.92838 0.04597 -41.944 < 2e-16 ***
## catag63 -1.09563 0.04712 -23.254 < 2e-16 ***
## catag64 -1.93820 0.03774 -51.363 < 2e-16 ***
## catag65 -2.75443 0.05031 -54.747 < 2e-16 ***
## catag66 -3.53821 0.04039 -87.604 < 2e-16 ***
## anyhlti22 -0.16256 0.05033 -3.230 0.00264 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 18.42876)
##
## Number of Fisher Scoring iterations: 2
# Logistic regression
## What is the association between sexual orientation and depression?
a3 <- svyglm(depression_bin ~ gay + bi,
nsduh_design_sp,
family=binomial(link=logit))## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
##
## Call:
## svyglm(formula = depression_bin ~ gay + bi, design = nsduh_design_sp,
## family = binomial(link = logit))
##
## Survey design:
## subset(nsduh_design, subpop == "1")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.67103 0.01218 -219.27 <2e-16 ***
## gay1 0.83155 0.04875 17.06 <2e-16 ***
## bi1 1.57098 0.03730 42.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.000005)
##
## Number of Fisher Scoring iterations: 5
## (Intercept) gay1 bi1
## 0.06918116 2.29687014 4.81133797
## 2.5 % 97.5 %
## (Intercept) 0.0675073 0.07089653
## gay1 2.0824074 2.53342001
## bi1 4.4636861 5.18606649