Background and Purpose

The purpose of this tutorial is to provide brief introduction to conducting design-based analysis of complex sample survey data using the survey package in R.

We will be using data from the National Survey on Drug Use and Health (NSDUH). NSDUH is a repeated cross-sectional household survey distributed annually to a sample of non-institutionalized residents of the United States. Its primary aim is to monitor drug and alcohol use by providing nationally-representative estimates of substance use behaviors, mental health status, and correlates of these health outcomes/behaviors. Additional information on NSDUH can be found here.

To improve estimate precision and reduce overall survey costs, households (and specific individuals within households) who participate in NSDUH are selected through a complex set of cluster and stratified sampling procedures. Certain population subgroups (including younger people, Hispanic people, Black/African-American people) are also oversampled relative to their general population size to produce more reliable estimates (i.e., through standard error reduction). This ultimately generates a sample in which there is unequal probability of selection across sampling units, and produces somewhat homogeneous samples within clusters due to shared community characteristics (i.e., nonindependence). Together, these violate simple random sample assumptions (equal probability of selection, independence of observations) common across basic statistical procedures.

In order to generate accurate nationally-representative estimates (e.g., proportions, means, standard errors) from complex sample survey data, we have to conduct design-based analysis (i.e., analysis that adjusts for unequal probability of selection). For more information on NSDUH sample design can be found in the NSDUH 2019 Sample Design Report. Guidance for statistical analysis can be found in the NSDUH 2019 Statistical Inference Report.

Data frame information: NSDUH 2015-2019

The data frame included in this tutorial (nsduh_20152019_subset.RData) includes 5 combined waves (2015, 2016, 2017, 2018, 2019) and a subset of 15 variables to reduce the overall file size.

Combining waves: To increase the sample size of relatively small population subgroups, it is common practice in public health research to combine multiple waves from national cross-sectional datasets like NSDUH. However, additional procedures are needed to adjust sampling weights, and these simple steps are outlined in this tutorial.

Unconditional analysis of a subpopulation: In addition, researchers are often interested in studying a specific subpopulation (e.g., adolescents). Here, it is neccessary to identify the subpopulation within the overall dataset for the analysis at hand. Avoid dropping unused observations as this removes key information used to generate variance estimates. Additional background information on subpopulation analysis can be found in West, Berglund, & Herringa, 2008 and in Lumley, 2021.

Key issues and procedures covered

  1. Load packages
  2. Load the data frame and identify a subpopulation of interest
  3. Setup design-based analysis using svydesign
  4. Descriptive statistics using svymean, svyby, and svyciprop
  5. T-tests and design-based Wald (chi-square) tests of independence
  6. Regression analysis using svyglm
  7. Regression diagnositics using the svydiags package (under construction!)

Note: A useful resource for analyzing complex sample data in R is available here.

Setup

First, we will load the necessary R packages relevant for this analysis.

If you do not have these packages installed, follow these steps:

  1. Click on the Tools menu
  2. Type survey, svydiags, tidyverse, naniar, stats, and jtools into the dialog box that opens.
  3. Click install (R will do stuff for a while, this might look like errors but is fine)

The most important package is the survey package (Lumley, 2021). This package provides flexible tools to generate design-based estimates that account for the complex sample survey design of NSDUH.

We will also use the svydiags package to compute diagnostic data for regression models fitted with complex survey data Valliant, 2018

This project also uses the core suite of tidyverse (Wickham, 2021) packages and the naniar (Tierny et al., 2021) packages to assist with data management procedures , the stats package to run statistical procedures (R Core Team), and the jtools package for miscellaneous statistical procedures (Long, 2021).

# load packages
library(survey) ## run design-based analysis
library(svydiags) ## regression diagnostics
library(tidyverse) ## data management
library(naniar) ## data management (recoding missing values)
library(stats) ## run statistical procedures
library(jtools) ## miscellaneous statistical tools
## Warning: package 'jtools' was built under R version 4.4.1
library(haven)

Load the data frame and identify a subpopulation of interest

NOTE: The data frame must be located in the same folder as this RMarkdown file in order to be properly loaded.

The following code outlines some initial data management steps. I have renamed some variables to make it easier to quickly identify and interpret results. See the NSDUH 2019 Public Use File Codebook for more information on variables and coding schemes in NSDUH.

Subpopulation: In this tutorial, the subpopulation of interest are adults (aged 18+) in the United States with non-missing data on the following variables: depression_bin (past-year major depressive episode, binary coded), stress_cont (Kessler K6 distress scale, continuous: range 0-24), newrace2 (race/ethnicity), irsex (gender), sexident (sexual orientation), catag6 (age category), income (annual household income level), and anyhlti2 (any health insurance).

Following the procedures below, we identify a subpopulation of N=207,479 participants who meet our study eligibility criteria.

Inference and reporting implications for subpopulation analysis

The decision to identify a subpopulation in complex sample survey analysis has critical implications due to its impact on variance calculations. While mean and proportion estimates will not be affected, failing to conduct this step could lead to incorrect inferences (e.g., significant results that become insignificant when accounting for complete design-based variance information).

When we identify the subpop, R will (1) use our subpopulation data to generate estimates (means, proprotions, etc.) and (2) use the entire data frame to calculate standard errors and confidence intervals.

When reporting results from subpopulation analyses, you should report the subpopulation sample size as opposed to the entire data frame sample size. In our tutorial, this means that we will report a analytic sample size of N=207,479 as opposed to N=282,768 (total NSDUH 2015-2019 sample). In addition, if we were to additionally stratify our primary analysis by gender, we could report the total subpop N, female subpop N, and male subpop N.

# Load the data set
load(file="nsduh_20152019_subset.RData")

# If the data frame is not automatically named "nsduh", uncomment the code below and run it.
## nsduh <- dplyr::rename(nsduh_20152019_subset) # shorten the data set name
## rm(nsduh_20152019_subset) # remove old data frame

# set invalid responses to missing (NA)
nsduh <- nsduh %>% 
  replace_with_na(replace = list(sexident = c(85, 89, 94, 97, 98, 99))) %>% 
  replace_with_na(replace = list(anyhlti2 = c(94, 97, 98)))

# create factor variables for categorical vars
nsduh$amdeyr <- as.factor(nsduh$amdeyr) # past-year major depressive episode (MDE)
nsduh$newrace2 <- as.factor(nsduh$newrace2) # racial/ethnic identity
nsduh$irsex <- as.factor(nsduh$irsex) # gender (binary woman/man)
nsduh$sexident <- as.factor(nsduh$sexident) # sexual orientation
nsduh$catag6 <- as.factor(nsduh$catag6) # age category
nsduh$income <- as.factor(nsduh$income) # annual household income level
nsduh$anyhlti2 <- as.factor(nsduh$anyhlti2) # any health insurance (1=yes)
nsduh$year <- as.factor(nsduh$year) # year of NSDUH survey completion

# rename/mutate variables for ease of use in analysis
nsduh <- nsduh %>%
  mutate(
    white = factor(ifelse(newrace2 == "1", 1, 0)), # White
    black = factor(ifelse(newrace2 == "2", 1, 0)), # Black/African-American
    naan = factor(ifelse(newrace2 == "3", 1, 0)), # Native American or Alaska Native
    nhpi = factor(ifelse(newrace2 == "4", 1, 0)), # Native Hawaiian/Pacific Islander
    asian = factor(ifelse(newrace2 == "5", 1, 0)), # Asian
    multi = factor(ifelse(newrace2 == "6", 1, 0)), # Multiracial
    hispanic = factor(ifelse(newrace2 == "7", 1, 0)), # Hispanic
    man = factor(ifelse(irsex == "1", 1, 0)), # Men
    woman = factor(ifelse(irsex == "2", 1, 0)), # Women
    hetero = factor(ifelse(sexident == "1", 1, 0)), # Heterosexual
    gay = factor(ifelse(sexident == "2", 1, 0)), # Gay
    bi = factor(ifelse(sexident == "3", 1, 0)), # Bisexual
    depression_bin = factor(ifelse(amdeyr == "1", 1, 0)), # Past-year MDE
    healthins = factor(ifelse(anyhlti2 == "1", 1, 0)), # any health insurance
    stress_cont = k6scmon) %>% # K6 Distress Scale
  mutate(
    subpop = factor(ifelse(!is.na(newrace2) & !is.na(sexident) & !is.na(irsex) & !is.na(depression_bin) & !is.na(stress_cont) & !is.na(income) & !is.na(catag6) & !is.na(healthins) & (catag6!="1"), 1, 0))) # this creates a "subpop" variable to identify all observations in the dataset that have non-missing values on each of the study variables of interest.

table(nsduh$subpop) # This indicates that 207,479 participants have non-missing data on our study variables.
## 
##      0      1 
##  75289 207479

Setup design-based analysis

NSDUH includes the following survey weight variables: vestr (variance estimation stratum), verep (variance estimation cluster replicates), and analwt_c (person-level analysis weight). The svydesign procedure takes these variables as arguments to generate a design object that we can easily include in our analysis going forward.

Combining waves: Since we are combining multiple data collection waves in this tutorial, we need to divide the person-level analysis weight by the number of waves included.

Subpopulation: Using the “subpop” variable we generated in the prior step, we can simply subset the svydesign object to include our subpopulation (N=207,479).

# divide person-level analysis weight by 5 (5 waves included: 2015-2019)
nsduh$adjwt_5 <- nsduh$analwt_c/5
# Note: if only using one wave of data (e.g., only 2015), the above step is not required.

# build up the primary survey object
nsduh_design <- 
  svydesign(
    id = ~verep,
    strata = ~vestr,
    weights = ~adjwt_5,
    data = nsduh,
    nest = TRUE)

# svydesign for the subpopulation -- we will use this throughout
nsduh_design_sp <- subset(
  nsduh_design, 
  subpop=="1")
summary(nsduh_design_sp) ## describes the svydesign object
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (100) clusters.
## subset(nsduh_design, subpop == "1")
## Probabilities:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000419 0.0006735 0.0013508 0.0033004 0.0030106 1.2997441 
## Stratum Sizes: 
##            40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 40011
## obs         4190  4032  3862  3839  4333  4070  4096  4733  4571  4477  4132
## design.PSU     2     2     2     2     2     2     2     2     2     2     2
## actual.PSU     2     2     2     2     2     2     2     2     2     2     2
##            40012 40013 40014 40015 40016 40017 40018 40019 40020 40021 40022
## obs         3949  4139  3744  4154  4077  4127  4213  4601  4587  4316  3872
## design.PSU     2     2     2     2     2     2     2     2     2     2     2
## actual.PSU     2     2     2     2     2     2     2     2     2     2     2
##            40023 40024 40025 40026 40027 40028 40029 40030 40031 40032 40033
## obs         3816  4122  4434  4101  3620  4225  4355  4096  3898  3695  3784
## design.PSU     2     2     2     2     2     2     2     2     2     2     2
## actual.PSU     2     2     2     2     2     2     2     2     2     2     2
##            40034 40035 40036 40037 40038 40039 40040 40041 40042 40043 40044
## obs         3902  4057  3773  3908  4157  4513  4701  4427  4093  4007  3815
## design.PSU     2     2     2     2     2     2     2     2     2     2     2
## actual.PSU     2     2     2     2     2     2     2     2     2     2     2
##            40045 40046 40047 40048 40049 40050
## obs         4561  3976  4363  4404  4415  4147
## design.PSU     2     2     2     2     2     2
## actual.PSU     2     2     2     2     2     2
## Data variables:
##  [1] "questid2"       "amhtxrc3"       "k6scmon"        "amdeyr"        
##  [5] "sexident"       "irsex"          "catag3"         "catag6"        
##  [9] "newrace2"       "anyhlti2"       "income"         "analwt_c"      
## [13] "vestr"          "verep"          "year"           "white"         
## [17] "black"          "naan"           "nhpi"           "asian"         
## [21] "multi"          "hispanic"       "man"            "woman"         
## [25] "hetero"         "gay"            "bi"             "depression_bin"
## [29] "healthins"      "stress_cont"    "subpop"         "adjwt_5"

Descriptive statistics

# Survey-weighted means and 95% CI
## Estimate mean K6 Distress Scale scores for entire subpop
k6mean <- svymean(~stress_cont, nsduh_design_sp)
k6mean
##               mean    SE
## stress_cont 3.8621 0.013
## Estimate 95% confidence interval for entire subpop
confint(k6mean)
##                2.5 %   97.5 %
## stress_cont 3.836671 3.887529
## Estimate mean for individual subgroups (gender)
k6mean_gender <- svyby(~stress_cont, ~woman, design=nsduh_design_sp, svymean)
k6mean_gender
##   woman stress_cont         se
## 0     0    3.534490 0.01824574
## 1     1    4.170166 0.01944375
## Estimate 95% confidence interval for individual subgroups
confint(k6mean_gender)
##      2.5 %   97.5 %
## 0 3.498729 3.570251
## 1 4.132057 4.208275
# Survey-weighted proportions and 95% CI
## Race/ethnicity
svyciprop(~white, nsduh_design_sp, method="logit")
##              2.5% 97.5%
## white 0.646 0.641 0.651
svyciprop(~black, nsduh_design_sp, method="logit")
##              2.5% 97.5%
## black 0.118 0.114 0.121
svyciprop(~hispanic, nsduh_design_sp, method="logit")
##                 2.5% 97.5%
## hispanic 0.158 0.154 0.162
svyciprop(~asian, nsduh_design_sp, method="logit")
##                2.5%  97.5%
## asian 0.0528 0.0506 0.0551
svyciprop(~naan, nsduh_design_sp, method="logit")
##                 2.5%   97.5%
## naan 0.00544 0.00497 0.00594
svyciprop(~nhpi, nsduh_design_sp, method="logit")
##                 2.5%   97.5%
## nhpi 0.00359 0.00315 0.00408
svyciprop(~multi, nsduh_design_sp, method="logit")
##                2.5%  97.5%
## multi 0.0171 0.0165 0.0177
## Gender
svyciprop(~woman, nsduh_design_sp, method="logit")
##              2.5% 97.5%
## woman 0.515 0.512 0.519
svyciprop(~man, nsduh_design_sp, method="logit")
##            2.5% 97.5%
## man 0.485 0.481 0.488
##Sexual orientation
svyciprop(~hetero, nsduh_design_sp, method="logit")
##               2.5% 97.5%
## hetero 0.950 0.948 0.951
svyciprop(~gay, nsduh_design_sp, method="logit")
##              2.5%  97.5%
## gay 0.0193 0.0184 0.0202
svyciprop(~bi, nsduh_design_sp, method="logit")
##             2.5%  97.5%
## bi 0.0310 0.0302 0.0319
## Past-year major depressive episode
svyciprop(~depression_bin, nsduh_design_sp, method="logit")
##                         2.5%  97.5%
## depression_bin 0.0718 0.0702 0.0735
## Any health insurance
svyciprop(~healthins, nsduh_design_sp, method="logit")
##                  2.5% 97.5%
## healthins 0.903 0.901 0.905
## Age category
svyciprop(~I(catag6==2), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(catag6 == 2) 0.137 0.136 0.139
svyciprop(~I(catag6==3), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(catag6 == 3) 0.159 0.157 0.162
svyciprop(~I(catag6==4), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(catag6 == 4) 0.247 0.244 0.250
svyciprop(~I(catag6==5), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(catag6 == 5) 0.254 0.250 0.258
svyciprop(~I(catag6==6), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(catag6 == 6) 0.203 0.199 0.206
## Income level
svyciprop(~I(income==1), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(income == 1) 0.159 0.156 0.162
svyciprop(~I(income==2), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(income == 2) 0.293 0.289 0.297
svyciprop(~I(income==3), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(income == 3) 0.161 0.158 0.163
svyciprop(~I(income==4), nsduh_design_sp, method="logit")
##                       2.5% 97.5%
## I(income == 4) 0.387 0.382 0.393
## Graphing continuous variables
# Note: Currently ggplot2 is not equipped to handle complex sample data, but there are some packages online that are beginning to overcome this hurdle.
svyhist(~stress_cont, nsduh_design_sp) # positive skew

svyboxplot(~stress_cont~1, nsduh_design_sp, all.outliers = TRUE) 

T-tests and design-based Wald (chi-square) tests of independence

# Do Kessler-6 Distress Scale scores differ between men and women?
## Independent samples t-test: svyttest(cont~cat, design)
t <- svyttest(stress_cont ~ woman, 
              design=nsduh_design_sp)
t 
## 
##  Design-based t-test
## 
## data:  stress_cont ~ woman
## t = 23.235, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  0.5806968 0.6906566
## sample estimates:
## difference in mean 
##          0.6356767
## Yes, women have higher scores.
## Mean difference = 0.61 (95% CI: 0.56, 0.67)

## Notes: 
## paired sample t-test can be run as: svyttest(I(var1-var2)~0, design)
## one-sample t-test can be run as: svyttest(var~0, design)

## Wilcoxon signed rank test (non-parametric independent t-test)
w <- svyranktest(stress_cont ~ woman, nsduh_design_sp, test=c("wilcoxon"))
w
## 
##  Design-based KruskalWallis test
## 
## data:  stress_cont ~ woman
## t = 24.994, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score 
##                     0.0466829
# Does the depression prevalence differ by sexual orientation?
## Design-based Wald (chi-square) test of association: svychisq(~cat+cat, design, statistic="adjWald")
c <- svychisq(~depression_bin + sexident,
              design=nsduh_design_sp, 
              statistic="adjWald")
c
## 
##  Design-based Wald test of association
## 
## data:  svychisq(~depression_bin + sexident, design = nsduh_design_sp,     statistic = "adjWald")
## F = 399.4, ndf = 2, ddf = 49, p-value < 2.2e-16
## Yes, there are differences in depression by sexual orientation.

Regression analysis

# Linear regression
## Are there differences in Kessler-6 Distress Scale scores by race/ethnicity?
### Simple linear regression
a1 <- svyglm(stress_cont ~ black + hispanic + asian + naan + nhpi + multi, 
             nsduh_design_sp, 
             family=gaussian())
summary(a1)
## 
## Call:
## svyglm(formula = stress_cont ~ black + hispanic + asian + naan + 
##     nhpi + multi, design = nsduh_design_sp, family = gaussian())
## 
## Survey design:
## subset(nsduh_design, subpop == "1")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.92355    0.01639 239.390  < 2e-16 ***
## black1      -0.11665    0.03597  -3.243 0.002261 ** 
## hispanic1   -0.38058    0.05366  -7.093 8.36e-09 ***
## asian1      -0.18832    0.05021  -3.750 0.000513 ***
## naan1        0.42018    0.18459   2.276 0.027747 *  
## nhpi1        0.23248    0.20797   1.118 0.269680    
## multi1       1.11687    0.13213   8.453 9.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 20.16392)
## 
## Number of Fisher Scoring iterations: 2
# Yes, compared to white participants: Black, Native Hawaiian/Pacific Islander, and Hispanic participants had lower stress scores; Multiracial participants had higher stress scores.

## Is race/ethnicity still a significant predictor of Kessler-6 Distress Scale scores after controlling for income level, age category, and health insurance status?
### Multivariate linear regression
a2 <- svyglm(stress_cont ~ black + hispanic + asian + naan + nhpi + multi + income + catag6 + anyhlti2,
             nsduh_design_sp, 
             family=gaussian())
summary(a2) ## Note: we should assess regression diagnostics before reporting results
## 
## Call:
## svyglm(formula = stress_cont ~ black + hispanic + asian + naan + 
##     nhpi + multi + income + catag6 + anyhlti2, design = nsduh_design_sp, 
##     family = gaussian())
## 
## Survey design:
## subset(nsduh_design, subpop == "1")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51195    0.04480 167.661  < 2e-16 ***
## black1      -0.77646    0.03505 -22.153  < 2e-16 ***
## hispanic1   -1.11188    0.05068 -21.939  < 2e-16 ***
## asian1      -0.48648    0.05638  -8.629 2.73e-10 ***
## naan1       -0.34826    0.17264  -2.017  0.05117 .  
## nhpi1       -0.30741    0.19378  -1.586  0.12139    
## multi1       0.54902    0.12523   4.384 9.69e-05 ***
## income2     -1.00848    0.04295 -23.483  < 2e-16 ***
## income3     -1.47684    0.05788 -25.515  < 2e-16 ***
## income4     -1.92838    0.04597 -41.944  < 2e-16 ***
## catag63     -1.09563    0.04712 -23.254  < 2e-16 ***
## catag64     -1.93820    0.03774 -51.363  < 2e-16 ***
## catag65     -2.75443    0.05031 -54.747  < 2e-16 ***
## catag66     -3.53821    0.04039 -87.604  < 2e-16 ***
## anyhlti22   -0.16256    0.05033  -3.230  0.00264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 18.42876)
## 
## Number of Fisher Scoring iterations: 2
# Logistic regression
## What is the association between sexual orientation and depression?
a3 <- svyglm(depression_bin ~ gay + bi,
             nsduh_design_sp,
             family=binomial(link=logit))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(a3)
## 
## Call:
## svyglm(formula = depression_bin ~ gay + bi, design = nsduh_design_sp, 
##     family = binomial(link = logit))
## 
## Survey design:
## subset(nsduh_design, subpop == "1")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.67103    0.01218 -219.27   <2e-16 ***
## gay1         0.83155    0.04875   17.06   <2e-16 ***
## bi1          1.57098    0.03730   42.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000005)
## 
## Number of Fisher Scoring iterations: 5
exp(coef(a3)) # get odds ratios (ORs)
## (Intercept)        gay1         bi1 
##  0.06918116  2.29687014  4.81133797
exp(confint(a3)) # get 95% CIs around ORs
##                 2.5 %     97.5 %
## (Intercept) 0.0675073 0.07089653
## gay1        2.0824074 2.53342001
## bi1         4.4636861 5.18606649

Regression diagnostics for complex sample data

# Coming soon!

# Linear regression diagnostics (`svydiags` package)
## Linearity between X variables and Y
## Multicollinearity
## Homoscedasticity of residual variance
## Normality of residuals

# Logistic regression assumptions
## X variables are linearly related to the logit of the Y variable