Basics of Complex Survey Design in R

Unless you work with healthcare or social surveys, the likelihood that you’ve encountered complex survey design is very low - for example, most of the data we have worked with in class are random samples of the population with simple or no sample design.

My first encounter with complex survey design was working with data from the National Health and Nutritional Examination Survey. This survey consists of respondents from a complex survey design.

What do Complex Survey Design do:

They often stratify the population beased on different characteristics such as race, geography, age etc. One way is to think about this is as a nested structure of sampled characteristics- depending on the question you’re trying to answer.

What to know:

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Once these groups have been defined, one samples from each group as if it were independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the sampling weights for men will likely be different from the sampling weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates

https://stats.idre.ucla.edu/other/mult-pkg/seminars/svy-intro/

Respondent Sample Weights: Each respondent in the sample is given a weight to represent how common that respondent is within the population.This is to account for oversampling and undersampling of several group, ie. gender, race etc.

Working with Complex sample designs in R:

The library of choice is the survey library.

library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart

load data:

#Sampled data from HRIS
library(survey)
library(foreign)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Reading in Leave-behind file
hrs_df_2012_lb <- readRDS("C://Users//oghen//Documents//HRS//2012//h12core//h12lb_r.rds")
#Reading in Demographics data
hrs_demo <-read.dta("C://Users//oghen//Documents//HRS//2012//h12f2a_STATA//randhrs1992_2016v1.dta")
        

# Set demo hhidpn value to character
hrs_demo$hhidpn <- as.character(hrs_demo$hhidpn)

#Create hhidpn in the leave behind file
hrs_df_2012_lb$hhid<- as.numeric(hrs_df_2012_lb$hhid)
hrs_df_2012_lb$hhid<- as.character(hrs_df_2012_lb$hhid)
hrs_df_2012_lb$hhidpn <-paste(hrs_df_2012_lb$hhid,hrs_df_2012_lb$pn,sep="")

#Removing the respondents who do not complete leave behing file
hrs_df_2012_lb <- subset(hrs_df_2012_lb, hrs_df_2012_lb$nlbcomp ==1 |hrs_df_2012_lb$nlbcomp ==2 | hrs_df_2012_lb$nlbcomp ==4  )
#hrs_df$hhidpn<-as.character(hrs_df$hhidpn)

#Merging the Leave behind and RAND File
HRS_combined=right_join(hrs_demo,hrs_df_2012_lb,by="hhidpn" )

#n= 7,412 observations.

keep_var <- c("hhidpn","nfinr","nlbcomp","r11agey_b","ragender","s11gender","r11mstat","h11cpl","raracem","rahispan","r11sayret","raedegrm","h11itot","h11atotb","h11atotn","h11afhous","h11amort","h11afdebt","r11ptyp1","h11aira","h11astck","nlb040","nlb040a_e","h11achck","r11rplnyr","r11higov","r11covr","r11wtresp","raestrat","raehsamp","racohbyr","inw11","nlb039e","nlb033d","nlb033h","nlb033l","nlb033q","nlb033a","nlb033f","nlb033j","nlb033u","nlb033z_2","nlb033m","nlb033o","nlb033s","nlb033t","nlb033w","nlb033z_3","nlb033z_4","nlb033b","nlb033g","nlb033k","nlb033p","nlb033y","nlb033c","nlb033e","nlb033i","nlb033n","nlb033r","nlb033v","nlb033x","nlb033z","nlb033z_5","nlb033z_6","nlb027a","nlb027b","nlb027c","nlb027d","nlb027e","nlb027f","nlb027g","nlb027h","nlb027i","nlb027j","nlb027k","nlb027l","nlb027m","nlb027n","nlb027o","nlb027p","nlb027q","nlb027r","nlb027s","nlb027t","nlb027u","nlb027v","nlb027w","nlb027x","nlb027y")

hrs_reduced <- HRS_combined[keep_var]

Creating sample design:

library(survey)
# community residents aged 50+ in 2012
hrs_design2 <- 
    svydesign(
        id = ~ raehsamp ,
        strata = ~ raestrat ,
        weights = ~ r11wtresp , 
        nest = TRUE ,
        data = subset( hrs_reduced , r11wtresp > 0 )
    )

hrs_design2 <- 
    update( 
        hrs_design2 , 

        one = 1 ,
    

        marital_stat_2012 =
            factor( r11mstat , levels = 1:8 , labels =
                c( "Married" , "Married, spouse absent" ,
                "Partnered" , "Separated" , "Divorced" ,
                "Separated/divorced" , "Widowed" ,
                "Never married" ) ) 
    )

Mehods for Doing analysis with complex survey design:

http://asdfree.com/health-and-retirement-study-hrs.html

https://rpubs.com/corey_sparks/53683