Working with Complex Survey Data from the American Community Survey

The American Community Survey (ACS) is performed annually by the US Census Bureau. Due to its frequency, the ACS can be used to better study trends than the US Census, which occurs every ten years. However, unlike the decennial census data for the ACS is obtained from a random sample of US residents and therefore the sampling structure must be taken into account for subsequent analyses.

1. Accessing the Data

The US Census Bureau releases ACS Public Use Microdata Sample (PUMS) files for free so that researchers can use the data in a more raw form (i.e. without being pretabulated). These data contain records at both the household and individual levels; however, their geographies are aggregated such that households and/or individuals cannot be identified.

1.1 Example: 2019 1-Year Estimates

PUMS data comes in 1 and 5-year estimates. For this example, we will be using the 1-Year estimates for 2019. Data can be obtained from data.census.gov or an FTP cites.

We are using 1-year estimates here in order to examine the trends at a more granular level.

### IMPORT DATA FOR ACS
## FOR WHOLE DATASET NEEDS PART A AND B
## PART A
psam_husa_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husa_2019.csv")
dim(psam_husa_2019)
## [1] 788908    235
## PART B
#psam_husb_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husb_2019.csv")

## COMBINE THE TWO
#all2019<-rbind(psam_husa_2019, psam_husb_2019)
#dim(all2019)

2. Assessing Data Quality

Missing data from the ACS is treated in one of two ways; assignment or allocation. Assignment is done when the missing value can be made complete through logical imputation (ie the data is already contained somewhere else in the record). However, allocation methods use random processes, such as hot deck or nearest neighbors imputation to complete the data record.

This research is inspired by the work of:

2.1 Allocation Variables

The ACS datafiles contain indicator variables whose respective question item responses were allocated (1) or not (0). These allocation variables are flagged in the data set using the naming construction where the original variable name is preceded by an “F” and ends with a “P”.

In the 2019 ACS data the allocation flag variables are in columns 101 through 155.

## flag allocation
names(psam_husa_2019)[101:155]
##  [1] "FACCESSP"    "FACRP"       "FAGSP"       "FBATHP"      "FBDSP"      
##  [6] "FBLDP"       "FBROADBNDP"  "FCOMPOTHXP"  "FCONP"       "FDIALUPP"   
## [11] "FELEP"       "FFINCP"      "FFSP"        "FFULP"       "FGASP"      
## [16] "FGRNTP"      "FHFLP"       "FHINCP"      "FHISPEEDP"   "FHOTWATP"   
## [21] "FINSP"       "FKITP"       "FLAPTOPP"    "FMHP"        "FMRGIP"     
## [26] "FMRGP"       "FMRGTP"      "FMRGXP"      "FMVP"        "FOTHSVCEXP" 
## [31] "FPLMP"       "FPLMPRP"     "FREFRP"      "FRMSP"       "FRNTMP"     
## [36] "FRNTP"       "FRWATP"      "FRWATPRP"    "FSATELLITEP" "FSINKP"     
## [41] "FSMARTPHONP" "FSMOCP"      "FSMP"        "FSMXHP"      "FSMXSP"     
## [46] "FSTOVP"      "FTABLETP"    "FTAXP"       "FTELP"       "FTENP"      
## [51] "FVACSP"      "FVALP"       "FVEHP"       "FWATP"       "FYBLP"
2.1.1 Rate of Allocation

The allocation variables can take the values (0, 1, NA). If the allocation takes the value 1, if the variable is allocated and 0, if the variable is not allocation. However, there are cases when the allocation variable is NA, i.e. when allocation is not needed or necessary. The numerator would then be the sum of allocation flags, excluding the NAs. The denominator should be the number of variables for which the number of allocation is possible (i.e not an NA).

## numerator
alloc_num<-rowSums(psam_husa_2019[,101:155], na.rm=TRUE)

## denominator
alloc_den<- length(101:155)-rowSums(apply(is.na(psam_husa_2019[,101:155]), 2, as.numeric))

2.2 Data Wrangling

The data used for this study pertain to household language and English speaking proficiency.

It has been written about here:

For this project we are interested in the following variables:

  • PUMA: Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)
  • ST : State Code
  • WGTP : Housing Unit Weight
  • NP : Number of persons in this household
  • RESMODE : Response mode
    • 1: Mail
    • 2: CATI/CAPI
    • 3: Internet
  • HHL : Household language
    • 1: English Only
    • 2: Spanish
    • 3: Other Indo-European languages
    • 4: Asian and Pacific Island languages
    • 5: Other
  • LNGI : Limited English speaking household
    • 1: At least one person in the household 14 and over speaks English only or speaks English ‘very well’
    • 2: No one in the household 14 and over speaks English only or speaks English ‘very well’
  • WGTP1:WGTP80 : Replicate weights 1 through 80
library(tidyverse)

keepACS<-psam_husa_2019%>%
  mutate(YEAR=2019)%>%
  select(PUMA, ST, WGTP, NP, RESMODE, HHL, LNGI, WGTP1:WGTP80)%>%
  cbind(nAlloc=alloc_num, dAlloc=alloc_den)%>%
  mutate(rate=nAlloc/dAlloc)

#str(keepACS)

3. Using the survey package

3.1 Defining the Design

Documentation for the svrepdesign function can be found here for each of the arguments:

https://search.r-project.org/CRAN/refmans/survey/html/svrepdesign.html

For the PUMS data from the ACS we define the following design:

library(survey)
h_design<-svrepdesign(weights= ~WGTP,
                      repweights = 'WGTP[1-9]+',
                      scale=4/80,
                      rscales = ncol('WGTP[1-9]+'),
                      mse=T,
                      combined.weights = T,
                      type='JK1',
                      data= keepACS)

Defining the arguments for PUMS:

  • weights: sampling weights are number of units a single observation represents
  • repweights: This argument is used to pass in the column names with the replicate weights (note that regex is used to define the columns)
  • scale: The variance is given by the equation \[Var(x)=\frac{4}{80}\sum_{r=1}^{80}(x_r-x)^2\] Thus the scale is \(\frac{4}{80}\). The 4 in the numerator is a necessary scaling factor to remove bias, while the 80 is due to the 80 replicate weights. Note here at \(x_r\) is the \(r^{th}\) replicate estimate and \(x\) is the full PUMS weighted estimate.
  • mse: Compute variances based on sum of squares around the point estimate
  • combined.weights: TRUE if the repweights already include the sampling weights
  • type: Jackknife estimator
  • data

Here is some more documentation on how other people used the survey package with ACS data:

3.2 Survey weighted procedures

Within the survey package there is a host of procedures that perform survey weighted analyses. These produce not only weighted estimates but appropriate standard error for a given complex sampling design.

  • svytable: Computes weighted tables
  • svychisq: Creates Rao-Scott Chi-squared tests for complex designs
  • svyby: Computes survey statistics (and standard errors) on subsets of a survey defined by factors.
  • svyglm: A generalized linear model function that takes into account the survey design when fitting the model, estimates, and standard errors.
3.2.1 Example 1 of svytable

A weighted table for household language and response mode.

## WEIGHTED TABLE
svytable(~HHL+RESMODE, h_design)
##    RESMODE
## HHL          1          2          3
##   1 12646526.5 17026218.2 24817595.5
##   2  1218086.0  5333705.7  2869580.3
##   3   670074.8  1102597.3  1530238.8
##   4   534701.7   839108.5  1753752.3
##   5   129954.5   441135.9   341991.8
3.2.2 Example 1 of svychisq

Test for association of household language and response mode.

## RAO SCOTT TEST
svychisq(~HHL+RESMODE, h_design)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~HHL + RESMODE, h_design)
## F = 1938.5, ndf = 6.402, ddf = 505.761, p-value < 2.2e-16
3.2.3 Example 1 of svyby

Computing sample means and standard errors for allocation rates, with splits by household language and English proficiency.

# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL<-svyby(~rate,~HHL+LNGI, svymean, design=h_design)
#rmRHL

this.year=2019

# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL<-as.data.frame(rmRHL)%>%
  mutate(YEAR=this.year)%>%
  filter(HHL!=5)

# LABELS FOR GRAPHIC
modeLabs<-data.frame(RESMODE=c(1, 2, 3), 
                     resmodeLab=c("1. Mail", "2. CATI/CAPI", "3. Internet"),
                     jitter2=c(-.1, 0, .1))

langLabs<-data.frame(HHL=c(1, 2, 3, 4, 5), 
                     hhlLab=c("1. English", "2. Spanish", "3. Indo European", "4. Asian / Pacific Islander", "5. Other"), 
                     jitter=c(-.2, -.1, 0, .1, .2))

lngiLab<-data.frame(LNGI=c(1, 2), 
                    lngiLab=c("1. English Proficient", 
                              "2. Limited English"))


## WRANGLING
allRHL2<-rmRHL%>%
  left_join(langLabs)%>%
  left_join(lngiLab)%>%
  filter(HHL!=5)
## Joining, by = "HHL"
## Joining, by = "LNGI"
## GRAPHICS
# adding error bars
#http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization

ggplot(allRHL2, aes(x=hhlLab, y=rate, color=lngiLab))+
  geom_errorbar(aes(ymin=rate-se, ymax=rate+se), width=.2)+
  geom_point()

3.2.4 Example 1 of svyglm

Modeling allocation rate as a function of resmode and household language.

# BUILD STATISTICAL TESTS
mod2<-svyglm(rate~as.factor(RESMODE)*as.factor(HHL), h_design)
summary(mod2)
## 
## Call:
## svyglm(formula = rate ~ as.factor(RESMODE) * as.factor(HHL), 
##     h_design)
## 
## Survey design:
## svrepdesign.default(weights = ~WGTP, repweights = "WGTP[1-9]+", 
##     scale = 4/80, rscales = ncol("WGTP[1-9]+"), mse = T, combined.weights = T, 
##     type = "JK1", data = keepACS)
## 
## Coefficients:
##                                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                          0.0644976  0.0002586  249.399  < 2e-16 ***
## as.factor(RESMODE)2                 -0.0100030  0.0004060  -24.639  < 2e-16 ***
## as.factor(RESMODE)3                 -0.0394881  0.0003334 -118.433  < 2e-16 ***
## as.factor(HHL)2                      0.0104922  0.0009836   10.667 6.41e-16 ***
## as.factor(HHL)3                      0.0053439  0.0012089    4.421 3.82e-05 ***
## as.factor(HHL)4                      0.0070464  0.0016941    4.159 9.54e-05 ***
## as.factor(HHL)5                      0.0073950  0.0028337    2.610  0.01124 *  
## as.factor(RESMODE)2:as.factor(HHL)2 -0.0125316  0.0011854  -10.572 9.32e-16 ***
## as.factor(RESMODE)3:as.factor(HHL)2  0.0035005  0.0011232    3.116  0.00273 ** 
## as.factor(RESMODE)2:as.factor(HHL)3 -0.0052306  0.0015467   -3.382  0.00122 ** 
## as.factor(RESMODE)3:as.factor(HHL)3  0.0011385  0.0015421    0.738  0.46301    
## as.factor(RESMODE)2:as.factor(HHL)4 -0.0052708  0.0023439   -2.249  0.02792 *  
## as.factor(RESMODE)3:as.factor(HHL)4 -0.0024705  0.0018011   -1.372  0.17489    
## as.factor(RESMODE)2:as.factor(HHL)5 -0.0077060  0.0033839   -2.277  0.02607 *  
## as.factor(RESMODE)3:as.factor(HHL)5  0.0074481  0.0035299    2.110  0.03871 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4115.264)
## 
## Number of Fisher Scoring iterations: 2
3.2.5 Example 2

Looking at HHL, LNGI, RESMODE, and Allocation Rate

# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL2<-svyby(~rate,~RESMODE+HHL+LNGI, svymean, design=h_design)
#rmRHL2

# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL2<-as.data.frame(rmRHL2)%>%
  mutate(YEAR=this.year)%>%
  filter(HHL!=5)


## HHL
## LNGI
## RESMODE

allRHL2<-rmRHL2%>%
  left_join(modeLabs)%>%
  left_join(langLabs)%>%
  left_join(lngiLab)%>%
  mutate(yearJ=YEAR+jitter)%>%
  filter(HHL!=5)

library(scales)

ggplot(allRHL2, aes(x=jitter2, y=rate, color=lngiLab))+
  geom_line(alpha=.5, aes(lty=resmodeLab))+
  geom_point(size=2, aes(pch=resmodeLab))+
  #geom_pointrange(aes(ymin=rate-1.96*se, ymax=rate+1.96*se))+
  geom_errorbar(alpha=.3, aes(ymin=rate-1.96*se, ymax=rate+1.96*se), size=.5, width=.1)+
  facet_grid(~hhlLab)+
  theme_bw()+
  scale_x_continuous(breaks= seq(2007, 2017, by=4 ))+
  theme(legend.position = "bottom")+
  xlab("YEAR")+
  ylab("Allocation Rate")#+

  #scale_color_grey(start=0.0, end=0.6, na.value="white")