The American Community Survey (ACS) is performed annually by the US Census Bureau. Due to its frequency, the ACS can be used to better study trends than the US Census, which occurs every ten years. However, unlike the decennial census data for the ACS is obtained from a random sample of US residents and therefore the sampling structure must be taken into account for subsequent analyses.
The US Census Bureau releases ACS Public Use Microdata Sample (PUMS) files for free so that researchers can use the data in a more raw form (i.e. without being pretabulated). These data contain records at both the household and individual levels; however, their geographies are aggregated such that households and/or individuals cannot be identified.
PUMS data comes in 1 and 5-year estimates. For this example, we will be using the 1-Year estimates for 2019. Data can be obtained from data.census.gov or an FTP cites.
We are using 1-year estimates here in order to examine the trends at a more granular level.
### IMPORT DATA FOR ACS
## FOR WHOLE DATASET NEEDS PART A AND B
## PART A
psam_husa_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husa_2019.csv")
dim(psam_husa_2019)
## [1] 788908 235
## PART B
#psam_husb_2019 <- read.csv("~/Desktop/ACS/ALL_HOUSE/psam_husb_2019.csv")
## COMBINE THE TWO
#all2019<-rbind(psam_husa_2019, psam_husb_2019)
#dim(all2019)
Missing data from the ACS is treated in one of two ways; assignment or allocation. Assignment is done when the missing value can be made complete through logical imputation (ie the data is already contained somewhere else in the record). However, allocation methods use random processes, such as hot deck or nearest neighbors imputation to complete the data record.
This research is inspired by the work of:
McGovern, P. D., & Griffin, D. H. (2003). Quality assessment of data collected from non-English speaking households in the American Community Survey. Bureau of the Census, Washington, DC.
The ACS datafiles contain indicator variables whose respective question item responses were allocated (1) or not (0). These allocation variables are flagged in the data set using the naming construction where the original variable name is preceded by an “F” and ends with a “P”.
In the 2019 ACS data the allocation flag variables are in columns 101 through 155.
## flag allocation
names(psam_husa_2019)[101:155]
## [1] "FACCESSP" "FACRP" "FAGSP" "FBATHP" "FBDSP"
## [6] "FBLDP" "FBROADBNDP" "FCOMPOTHXP" "FCONP" "FDIALUPP"
## [11] "FELEP" "FFINCP" "FFSP" "FFULP" "FGASP"
## [16] "FGRNTP" "FHFLP" "FHINCP" "FHISPEEDP" "FHOTWATP"
## [21] "FINSP" "FKITP" "FLAPTOPP" "FMHP" "FMRGIP"
## [26] "FMRGP" "FMRGTP" "FMRGXP" "FMVP" "FOTHSVCEXP"
## [31] "FPLMP" "FPLMPRP" "FREFRP" "FRMSP" "FRNTMP"
## [36] "FRNTP" "FRWATP" "FRWATPRP" "FSATELLITEP" "FSINKP"
## [41] "FSMARTPHONP" "FSMOCP" "FSMP" "FSMXHP" "FSMXSP"
## [46] "FSTOVP" "FTABLETP" "FTAXP" "FTELP" "FTENP"
## [51] "FVACSP" "FVALP" "FVEHP" "FWATP" "FYBLP"
The allocation variables can take the values (0, 1, NA
). If the allocation takes the value 1, if the variable is allocated and 0, if the variable is not allocation. However, there are cases when the allocation variable is NA
, i.e. when allocation is not needed or necessary. The numerator would then be the sum of allocation flags, excluding the NA
s. The denominator should be the number of variables for which the number of allocation is possible (i.e not an NA
).
## numerator
alloc_num<-rowSums(psam_husa_2019[,101:155], na.rm=TRUE)
## denominator
alloc_den<- length(101:155)-rowSums(apply(is.na(psam_husa_2019[,101:155]), 2, as.numeric))
The data used for this study pertain to household language and English speaking proficiency.
It has been written about here:
For this project we are interested in the following variables:
PUMA
: Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)ST
: State CodeWGTP
: Housing Unit WeightNP
: Number of persons in this householdRESMODE
: Response mode
HHL
: Household language
LNGI
: Limited English speaking household
WGTP1:WGTP80
: Replicate weights 1 through 80library(tidyverse)
keepACS<-psam_husa_2019%>%
mutate(YEAR=2019)%>%
select(PUMA, ST, WGTP, NP, RESMODE, HHL, LNGI, WGTP1:WGTP80)%>%
cbind(nAlloc=alloc_num, dAlloc=alloc_den)%>%
mutate(rate=nAlloc/dAlloc)
#str(keepACS)
survey
packageDocumentation for the svrepdesign
function can be found here for each of the arguments:
https://search.r-project.org/CRAN/refmans/survey/html/svrepdesign.html
For the PUMS data from the ACS we define the following design:
library(survey)
h_design<-svrepdesign(weights= ~WGTP,
repweights = 'WGTP[1-9]+',
scale=4/80,
rscales = ncol('WGTP[1-9]+'),
mse=T,
combined.weights = T,
type='JK1',
data= keepACS)
Defining the arguments for PUMS:
weights
: sampling weights are number of units a single observation representsrepweights
: This argument is used to pass in the column names with the replicate weights (note that regex is used to define the columns)scale
: The variance is given by the equation \[Var(x)=\frac{4}{80}\sum_{r=1}^{80}(x_r-x)^2\] Thus the scale is \(\frac{4}{80}\). The 4 in the numerator is a necessary scaling factor to remove bias, while the 80 is due to the 80 replicate weights. Note here at \(x_r\) is the \(r^{th}\) replicate estimate and \(x\) is the full PUMS weighted estimate.mse
: Compute variances based on sum of squares around the point estimatecombined.weights
: TRUE if the repweights already include the sampling weightstype
: Jackknife estimatordata
Here is some more documentation on how other people used the survey
package with ACS data:
Within the survey
package there is a host of procedures that perform survey weighted analyses. These produce not only weighted estimates but appropriate standard error for a given complex sampling design.
svytable
: Computes weighted tablessvychisq
: Creates Rao-Scott Chi-squared tests for complex designssvyby
: Computes survey statistics (and standard errors) on subsets of a survey defined by factors.svyglm
: A generalized linear model function that takes into account the survey design when fitting the model, estimates, and standard errors.svytable
A weighted table for household language and response mode.
## WEIGHTED TABLE
svytable(~HHL+RESMODE, h_design)
## RESMODE
## HHL 1 2 3
## 1 12646526.5 17026218.2 24817595.5
## 2 1218086.0 5333705.7 2869580.3
## 3 670074.8 1102597.3 1530238.8
## 4 534701.7 839108.5 1753752.3
## 5 129954.5 441135.9 341991.8
svychisq
Test for association of household language and response mode.
## RAO SCOTT TEST
svychisq(~HHL+RESMODE, h_design)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~HHL + RESMODE, h_design)
## F = 1938.5, ndf = 6.402, ddf = 505.761, p-value < 2.2e-16
svyby
Computing sample means and standard errors for allocation rates, with splits by household language and English proficiency.
# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL<-svyby(~rate,~HHL+LNGI, svymean, design=h_design)
#rmRHL
this.year=2019
# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL<-as.data.frame(rmRHL)%>%
mutate(YEAR=this.year)%>%
filter(HHL!=5)
# LABELS FOR GRAPHIC
modeLabs<-data.frame(RESMODE=c(1, 2, 3),
resmodeLab=c("1. Mail", "2. CATI/CAPI", "3. Internet"),
jitter2=c(-.1, 0, .1))
langLabs<-data.frame(HHL=c(1, 2, 3, 4, 5),
hhlLab=c("1. English", "2. Spanish", "3. Indo European", "4. Asian / Pacific Islander", "5. Other"),
jitter=c(-.2, -.1, 0, .1, .2))
lngiLab<-data.frame(LNGI=c(1, 2),
lngiLab=c("1. English Proficient",
"2. Limited English"))
## WRANGLING
allRHL2<-rmRHL%>%
left_join(langLabs)%>%
left_join(lngiLab)%>%
filter(HHL!=5)
## Joining, by = "HHL"
## Joining, by = "LNGI"
## GRAPHICS
# adding error bars
#http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization
ggplot(allRHL2, aes(x=hhlLab, y=rate, color=lngiLab))+
geom_errorbar(aes(ymin=rate-se, ymax=rate+se), width=.2)+
geom_point()
svyglm
Modeling allocation rate as a function of resmode and household language.
# BUILD STATISTICAL TESTS
mod2<-svyglm(rate~as.factor(RESMODE)*as.factor(HHL), h_design)
summary(mod2)
##
## Call:
## svyglm(formula = rate ~ as.factor(RESMODE) * as.factor(HHL),
## h_design)
##
## Survey design:
## svrepdesign.default(weights = ~WGTP, repweights = "WGTP[1-9]+",
## scale = 4/80, rscales = ncol("WGTP[1-9]+"), mse = T, combined.weights = T,
## type = "JK1", data = keepACS)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0644976 0.0002586 249.399 < 2e-16 ***
## as.factor(RESMODE)2 -0.0100030 0.0004060 -24.639 < 2e-16 ***
## as.factor(RESMODE)3 -0.0394881 0.0003334 -118.433 < 2e-16 ***
## as.factor(HHL)2 0.0104922 0.0009836 10.667 6.41e-16 ***
## as.factor(HHL)3 0.0053439 0.0012089 4.421 3.82e-05 ***
## as.factor(HHL)4 0.0070464 0.0016941 4.159 9.54e-05 ***
## as.factor(HHL)5 0.0073950 0.0028337 2.610 0.01124 *
## as.factor(RESMODE)2:as.factor(HHL)2 -0.0125316 0.0011854 -10.572 9.32e-16 ***
## as.factor(RESMODE)3:as.factor(HHL)2 0.0035005 0.0011232 3.116 0.00273 **
## as.factor(RESMODE)2:as.factor(HHL)3 -0.0052306 0.0015467 -3.382 0.00122 **
## as.factor(RESMODE)3:as.factor(HHL)3 0.0011385 0.0015421 0.738 0.46301
## as.factor(RESMODE)2:as.factor(HHL)4 -0.0052708 0.0023439 -2.249 0.02792 *
## as.factor(RESMODE)3:as.factor(HHL)4 -0.0024705 0.0018011 -1.372 0.17489
## as.factor(RESMODE)2:as.factor(HHL)5 -0.0077060 0.0033839 -2.277 0.02607 *
## as.factor(RESMODE)3:as.factor(HHL)5 0.0074481 0.0035299 2.110 0.03871 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4115.264)
##
## Number of Fisher Scoring iterations: 2
Looking at HHL, LNGI, RESMODE, and Allocation Rate
# SVYBY IS FOR STATISTICS ON SUBSETS
# SVYMEAN
rmRHL2<-svyby(~rate,~RESMODE+HHL+LNGI, svymean, design=h_design)
#rmRHL2
# MEANS ACROSS SUBSETS FOR EACH YEAR
rmRHL2<-as.data.frame(rmRHL2)%>%
mutate(YEAR=this.year)%>%
filter(HHL!=5)
## HHL
## LNGI
## RESMODE
allRHL2<-rmRHL2%>%
left_join(modeLabs)%>%
left_join(langLabs)%>%
left_join(lngiLab)%>%
mutate(yearJ=YEAR+jitter)%>%
filter(HHL!=5)
library(scales)
ggplot(allRHL2, aes(x=jitter2, y=rate, color=lngiLab))+
geom_line(alpha=.5, aes(lty=resmodeLab))+
geom_point(size=2, aes(pch=resmodeLab))+
#geom_pointrange(aes(ymin=rate-1.96*se, ymax=rate+1.96*se))+
geom_errorbar(alpha=.3, aes(ymin=rate-1.96*se, ymax=rate+1.96*se), size=.5, width=.1)+
facet_grid(~hhlLab)+
theme_bw()+
scale_x_continuous(breaks= seq(2007, 2017, by=4 ))+
theme(legend.position = "bottom")+
xlab("YEAR")+
ylab("Allocation Rate")#+
#scale_color_grey(start=0.0, end=0.6, na.value="white")