ACS Analysis Example

Working with Complex Survey Data from the American Community Survey

The American Community Survey (ACS) is performed annually by the US Census Bureau. Due to its frequency, the ACS can be used to better study trends than the US Census, which occurs every ten years. However, unlike the decennial census data for the ACS is obtained from a random sample of US residents and therefore the sampling structure must be taken into account for subsequent analyses.

More information can be found here: https://www.census.gov/programs-surveys/acs/about.html

1. Accessing the Data

The US Census Bureau releases ACS Public Use Microdata Sample (PUMS) files for free so that researchers can use the data in a more raw form (i.e. without being pretabulated). These data contain records at both the household and individual levels; however, their geographies are aggregated such that households and/or individuals cannot be identified.

Information about PUMS: https://www.census.gov/programs-surveys/acs/microdata.html#:~:text=The%20Census%20Bureau’s%20American%20Community,through%20ACS%20pretabulated%20data%20products.

1.1 Example: 2019 1-Year Estimates

PUMS data comes in 1 and 5-year estimates. For this example, we will be using the 1-Year estimates for 2019. Data can be obtained from data.census.gov or an FTP cites.

We are using 1-year estimates here in order to examine the trends at a more granular level.

Please read the difference about 1 and 5 year estimates here: https://www.census.gov/programs-surveys/acs/guidance/estimates.html
Data can be downloaded from the FTP site here: https://www.census.gov/programs-surveys/acs/microdata/access.2019.html
Data dictionary: https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2019.pdf

### IMPORT DATA FOR ACS
## FOR WHOLE DATASET NEEDS PART A AND B
## PART A
psam_husa_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husa_2019.csv")
dim(psam_husa_2019)

## [1] 788908    235

## PART B
#psam_husb_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husb_2019.csv")

## COMBINE THE TWO
#all2019<-rbind(psam_husa_2019, psam_husb_2019)
#dim(all2019)

2. Assessing Data Quality

Missing data from the ACS is treated in one of two ways; assignment or allocation. Assignment is done when the missing value can be made complete through logical imputation (ie the data is already contained somewhere else in the record). However, allocation methods use random processes, such as hot deck or nearest neighbors imputation to complete the data record.

This research is inspired by the work of:

McGovern, P. D., & Griffin, D. H. (2003). Quality assessment of data collected from non-English speaking households in the American Community Survey. Bureau of the Census, Washington, DC.
- https://www.census.gov/content/dam/Census/library/working-papers/2003/acs/2003_Mcgovern_wpd.pdf
- https://nces.ed.gov/FCSM/pdf/IHSNG_PDM_ASA_2003ProceedingsVersion.pdf

2.1 Allocation Variables

The ACS datafiles contain indicator variables whose respective question item responses were allocated (1) or not (0). These allocation variables are flagged in the data set using the naming construction where the original variable name is preceded by an “F” and ends with a “P”.

In the 2019 ACS data the allocation flag variables are in columns 101 through 155.

## flag allocation
names(psam_husa_2019)[101:155]

##  [1] "FACCESSP"    "FACRP"       "FAGSP"       "FBATHP"      "FBDSP"      
##  [6] "FBLDP"       "FBROADBNDP"  "FCOMPOTHXP"  "FCONP"       "FDIALUPP"   
## [11] "FELEP"       "FFINCP"      "FFSP"        "FFULP"       "FGASP"      
## [16] "FGRNTP"      "FHFLP"       "FHINCP"      "FHISPEEDP"   "FHOTWATP"   
## [21] "FINSP"       "FKITP"       "FLAPTOPP"    "FMHP"        "FMRGIP"     
## [26] "FMRGP"       "FMRGTP"      "FMRGXP"      "FMVP"        "FOTHSVCEXP" 
## [31] "FPLMP"       "FPLMPRP"     "FREFRP"      "FRMSP"       "FRNTMP"     
## [36] "FRNTP"       "FRWATP"      "FRWATPRP"    "FSATELLITEP" "FSINKP"     
## [41] "FSMARTPHONP" "FSMOCP"      "FSMP"        "FSMXHP"      "FSMXSP"     
## [46] "FSTOVP"      "FTABLETP"    "FTAXP"       "FTELP"       "FTENP"      
## [51] "FVACSP"      "FVALP"       "FVEHP"       "FWATP"       "FYBLP"

2.1.1 Rate of Allocation

The allocation variables can take the values (0, 1, NA). If the allocation takes the value 1, if the variable is allocated and 0, if the variable is not allocation. However, there are cases when the allocation variable is NA, i.e. when allocation is not needed or necessary. The numerator would then be the sum of allocation flags, excluding the NAs. The denominator should be the number of variables for which the number of allocation is possible (i.e not an NA).

## numerator
alloc_num<-rowSums(psam_husa_2019[,101:155], na.rm=TRUE)

## denominator
alloc_den<- length(101:155)-rowSums(apply(is.na(psam_husa_2019[,101:155]), 2, as.numeric))

2.2 Data Wrangling

The data used for this study pertain to household language and English speaking proficiency.

It has been written about here:

Smalley, H.K. (2020). “A Longitudinal Perspective on the Effects of Household Language on Data Quality in the American Community Survey”
- https://www.rti.org/sites/default/files/5-surveylanguage-chapter3.pdf

For this project we are interested in the following variables:

PUMA: Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)
ST : State Code
WGTP : Housing Unit Weight
NP : Number of persons in this household
RESMODE : Response mode
- 1: Mail
- 2: CATI/CAPI
- 3: Internet
HHL : Household language
- 1: English Only
- 2: Spanish
- 3: Other Indo-European languages
- 4: Asian and Pacific Island languages
- 5: Other
LNGI : Limited English speaking household
- 1: At least one person in the household 14 and over speaks English only or speaks English ‘very well’
- 2: No one in the household 14 and over speaks English only or speaks English ‘very well’
WGTP1:WGTP80 : Replicate weights 1 through 80

library(tidyverse)

keepACS<-psam_husa_2019%>%
  mutate(YEAR=2019)%>%
  select(PUMA, ST, WGTP, NP, RESMODE, HHL, LNGI, WGTP1:WGTP80)%>%
  cbind(nAlloc=alloc_num, dAlloc=alloc_den)%>%
  mutate(rate=nAlloc/dAlloc)

#str(keepACS)

3. Using the `survey` package

3.1 Defining the Design

Documentation for the svrepdesign function can be found here for each of the arguments:

https://search.r-project.org/CRAN/refmans/survey/html/svrepdesign.html

For the PUMS data from the ACS we define the following design:

library(survey)
h_design<-svrepdesign(weights= ~WGTP,
                      repweights = 'WGTP[1-9]+',
                      scale=4/80,
                      rscales = ncol('WGTP[1-9]+'),
                      mse=T,
                      combined.weights = T,
                      type='JK1',
                      data= keepACS)

Defining the arguments for PUMS:

weights: sampling weights are number of units a single observation represents
repweights: This argument is used to pass in the column names with the replicate weights (note that regex is used to define the columns)
scale: The variance is given by the equation \[Var(x)=\frac{4}{80}\sum_{r=1}^{80}(x_r-x)^2\] Thus the scale is \(\frac{4}{80}\). The 4 in the numerator is a necessary scaling factor to remove bias, while the 80 is due to the 80 replicate weights. Note here at \(x_r\) is the \(r^{th}\) replicate estimate and \(x\) is the full PUMS weighted estimate.
mse: Compute variances based on sum of squares around the point estimate
combined.weights: TRUE if the repweights already include the sampling weights
type: Jackknife estimator
data

Here is some more documentation on how other people used the survey package with ACS data:

Analyze Survey Data For Free (This is a great resource)
- http://asdfree.com/
- http://asdfree.com/american-community-survey-acs.html
From Michael Laviolette at dhhs.state.nh.us https://stat.ethz.ch/pipermail/r-help/2014-September/422134.html

3.2 Survey weighted procedures

Within the survey package there is a host of procedures that perform survey weighted analyses. These produce not only weighted estimates but appropriate standard error for a given complex sampling design.

svytable: Computes weighted tables
svychisq: Creates Rao-Scott Chi-squared tests for complex designs
svyby: Computes survey statistics (and standard errors) on subsets of a survey defined by factors.
svyglm: A generalized linear model function that takes into account the survey design when fitting the model, estimates, and standard errors.

3.2.1 Example 1 of `svytable`

A weighted table for household language and response mode.

## WEIGHTED TABLE
svytable(~HHL+RESMODE, h_design)

##    RESMODE
## HHL          1          2          3
##   1 12646526.5 17026218.2 24817595.5
##   2  1218086.0  5333705.7  2869580.3
##   3   670074.8  1102597.3  1530238.8
##   4   534701.7   839108.5  1753752.3
##   5   129954.5   441135.9   341991.8

3.2.2 Example 1 of `svychisq`

Test for association of household language and response mode.

## RAO SCOTT TEST
svychisq(~HHL+RESMODE, h_design)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~HHL + RESMODE, h_design)
## F = 1938.5, ndf = 6.402, ddf = 505.761, p-value < 2.2e-16

3.2.3 Example 1 of `svyby`

Computing sample means and standard errors for allocation rates, with splits by household language and English proficiency.

# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL<-svyby(~rate,~HHL+LNGI, svymean, design=h_design)
#rmRHL

this.year=2019

# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL<-as.data.frame(rmRHL)%>%
  mutate(YEAR=this.year)%>%
  filter(HHL!=5)

# LABELS FOR GRAPHIC
modeLabs<-data.frame(RESMODE=c(1, 2, 3), 
                     resmodeLab=c("1. Mail", "2. CATI/CAPI", "3. Internet"),
                     jitter2=c(-.1, 0, .1))

langLabs<-data.frame(HHL=c(1, 2, 3, 4, 5), 
                     hhlLab=c("1. English", "2. Spanish", "3. Indo European", "4. Asian / Pacific Islander", "5. Other"), 
                     jitter=c(-.2, -.1, 0, .1, .2))

lngiLab<-data.frame(LNGI=c(1, 2), 
                    lngiLab=c("1. English Proficient", 
                              "2. Limited English"))


## WRANGLING
allRHL2<-rmRHL%>%
  left_join(langLabs)%>%
  left_join(lngiLab)%>%
  filter(HHL!=5)

## Joining, by = "HHL"

## Joining, by = "LNGI"

## GRAPHICS
# adding error bars
#http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization

ggplot(allRHL2, aes(x=hhlLab, y=rate, color=lngiLab))+
  geom_errorbar(aes(ymin=rate-se, ymax=rate+se), width=.2)+
  geom_point()

3.2.4 Example 1 of `svyglm`

Modeling allocation rate as a function of resmode and household language.

# BUILD STATISTICAL TESTS
mod2<-svyglm(rate~as.factor(RESMODE)*as.factor(HHL), h_design)
summary(mod2)

## 
## Call:
## svyglm(formula = rate ~ as.factor(RESMODE) * as.factor(HHL), 
##     h_design)
## 
## Survey design:
## svrepdesign.default(weights = ~WGTP, repweights = "WGTP[1-9]+", 
##     scale = 4/80, rscales = ncol("WGTP[1-9]+"), mse = T, combined.weights = T, 
##     type = "JK1", data = keepACS)
## 
## Coefficients:
##                                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                          0.0644976  0.0002586  249.399  < 2e-16 ***
## as.factor(RESMODE)2                 -0.0100030  0.0004060  -24.639  < 2e-16 ***
## as.factor(RESMODE)3                 -0.0394881  0.0003334 -118.433  < 2e-16 ***
## as.factor(HHL)2                      0.0104922  0.0009836   10.667 6.41e-16 ***
## as.factor(HHL)3                      0.0053439  0.0012089    4.421 3.82e-05 ***
## as.factor(HHL)4                      0.0070464  0.0016941    4.159 9.54e-05 ***
## as.factor(HHL)5                      0.0073950  0.0028337    2.610  0.01124 *  
## as.factor(RESMODE)2:as.factor(HHL)2 -0.0125316  0.0011854  -10.572 9.32e-16 ***
## as.factor(RESMODE)3:as.factor(HHL)2  0.0035005  0.0011232    3.116  0.00273 ** 
## as.factor(RESMODE)2:as.factor(HHL)3 -0.0052306  0.0015467   -3.382  0.00122 ** 
## as.factor(RESMODE)3:as.factor(HHL)3  0.0011385  0.0015421    0.738  0.46301    
## as.factor(RESMODE)2:as.factor(HHL)4 -0.0052708  0.0023439   -2.249  0.02792 *  
## as.factor(RESMODE)3:as.factor(HHL)4 -0.0024705  0.0018011   -1.372  0.17489    
## as.factor(RESMODE)2:as.factor(HHL)5 -0.0077060  0.0033839   -2.277  0.02607 *  
## as.factor(RESMODE)3:as.factor(HHL)5  0.0074481  0.0035299    2.110  0.03871 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4115.264)
## 
## Number of Fisher Scoring iterations: 2

3.2.5 Example 2

Looking at HHL, LNGI, RESMODE, and Allocation Rate

# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL2<-svyby(~rate,~RESMODE+HHL+LNGI, svymean, design=h_design)
#rmRHL2

# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL2<-as.data.frame(rmRHL2)%>%
  mutate(YEAR=this.year)%>%
  filter(HHL!=5)


## HHL
## LNGI
## RESMODE

allRHL2<-rmRHL2%>%
  left_join(modeLabs)%>%
  left_join(langLabs)%>%
  left_join(lngiLab)%>%
  mutate(yearJ=YEAR+jitter)%>%
  filter(HHL!=5)

library(scales)

ggplot(allRHL2, aes(x=jitter2, y=rate, color=lngiLab))+
  geom_line(alpha=.5, aes(lty=resmodeLab))+
  geom_point(size=2, aes(pch=resmodeLab))+
  #geom_pointrange(aes(ymin=rate-1.96*se, ymax=rate+1.96*se))+
  geom_errorbar(alpha=.3, aes(ymin=rate-1.96*se, ymax=rate+1.96*se), size=.5, width=.1)+
  facet_grid(~hhlLab)+
  theme_bw()+
  scale_x_continuous(breaks= seq(2007, 2017, by=4 ))+
  theme(legend.position = "bottom")+
  xlab("YEAR")+
  ylab("Allocation Rate")#+

  #scale_color_grey(start=0.0, end=0.6, na.value="white")