Most of the data available through tidycensus and the Census API is aggregated to certain geographic levels (tract, county, state, etc.). In other words, the data we get by executing get_acs()
has been summarized by the Census Bureau so that we are able to learn how many people live in a particular county or what the median household income of a state is. There are thousands of individual variables that the Census aggregates and publishes in tabular form.
For many purposes, these pre-aggregated tables have enough information to work with. But, the Census Bureau also releases microdata from the American Community Survey. Microdata is the individual-level responses to the ACS that is used to create the summary and detail tables the Census publishes. Instead of a getting one row per state from a table, we can get one row per respondent. For the American Community Survey, this data is called the [Public Use Microdata Sample (PUMS)] (https://www.census.gov/programs-surveys/acs/technical-documentation/pums/about.html).
Using PUMS data instead of the published tables allows you to create custom estimates that aren’t available in pre-aggregated tables. You can also use microdata to fit models on individual-level data instead of modeling block groups or census tracts.
One drawback to be aware of when using PUMS data (as compared to aggregated data) is that you only get the state and public use microdata area (PUMA) of each individual in the microdata. PUMAs are Census geographies that contain at least 100,000 people and are entirely within a single state. They are built from census tracts and counties and may or may not be similar to other recognized geographic boundaries. In New York City, for instance, PUMAs are closely aligned to Community Districts. So, if you are interested in pulling data about block groups, census tracts, or other small areas, you can’t use PUMS data.
The first place to start is to identify the variables you want to download. There are hundreds of PUMS variables and each describes characteristics of a person or a housing unit. pums_variables
is a dataset built into tidycensus that contains names, descriptions, codes, and other metadata about the PUMS variables.
Because PUMS variables and codes can change slightly by year, you should select which year’s data you will be working with and filter pums_variables
for only that year and survey. In the following example, we will be working with the 2019 1-year American Community Survey.
install.packages(c("survey", "srvyr"))
library(tidyverse)
library(tidycensus)
pums_vars_2019 <- pums_variables %>%
filter(year == 2019, survey == "acs1")
pums_variables
contains both the variables as well as their possible values. The following are the unique variables.
pums_vars_2019 %>%
distinct(var_code, var_label, data_type, level)
## # A tibble: 516 x 4
## var_code var_label data_type level
## <chr> <chr> <chr> <chr>
## 1 RT Record Type chr <NA>
## 2 SERIALNO Housing unit/GQ person serial number chr <NA>
## 3 DIVISION Division code based on 2010 Census definitions chr <NA>
## 4 PUMA Public use microdata area code (PUMA) based on 2010 Census definition (areas ~ chr <NA>
## 5 REGION Region code based on 2010 Census definitions chr <NA>
## # ... with 511 more rows
Some PUMS variables relate to individuals (e.g. age, race, educational attainment) while others relate to households/housing units (e.g. number of bedrooms, electricity cost, property value).
Individuals in PUMS data are always nested within a housing unit (The ACS is sent to an housing unit address and all people living in that unit complete the survey.) Housing units are uniquely identified by the SERIALNO
variable and persons are uniquely identified by the combination of SERIALNO
and SPORDER
.
pums_vars_2019 %>%
distinct(var_code, var_label, data_type, level) %>%
filter(level == "person")
## # A tibble: 281 x 4
## var_code var_label data_type level
## <chr> <chr> <chr> <chr>
## 1 SPORDER Person number num person
## 2 PWGTP Person's weight num person
## 3 AGEP Age num person
## 4 CIT Citizenship status chr person
## 5 CITWP Year of naturalization write-in num person
## # ... with 276 more rows
It is important to be mindful of whether the variables you choose to analyze are person- or household-level variables.
get_pums()
to download PUMS dataTo download PUMS data from the Census API, use the get_pums()
function. The key arguments to specify are variables
, state
, survey
, and year
. Here we get PUMA
, SEX
, AGEP
, and SCHL
variables for Vermont from the 2019 1-year ACS.
vt_pums <- get_pums(
variables = c("PUMA", "SEX", "AGEP", "SCHL"),
state = "VT",
survey = "acs1",
year = 2019
)
vt_pums
## # A tibble: 6,543 x 9
## SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST SCHL SEX
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 2019GQ0001326 1 0 32 19 00200 50 19 2
## 2 2019GQ0003653 1 0 18 85 00100 50 16 1
## 3 2019GQ0005151 1 0 57 19 00400 50 19 2
## 4 2019GQ0006126 1 0 127 20 00100 50 16 1
## 5 2019GQ0006733 1 0 127 21 00100 50 19 2
## # ... with 6,538 more rows
We get 6543 rows and 9 columns. In addition to the variables we specified, get_pums()
also always returns SERIALNO
, SPORDER
, WGTP
, PWGTP
, and ST
. SERIALNO
and SPORDER
are the variables that uniquely identify observations, WGTP
and PWGTP
are the housing-unit and person weights, and ST
is the state code.
ST
, SEX
and SCHL
return coded character variables. You can look up what these codes represent in the pums_variables
dataset provided or set recode = TRUE
in get_pums()
to return additional columns with the values of these variables recoded.
vt_pums_recoded <- get_pums(
variables = c("PUMA", "SEX", "AGEP", "SCHL"),
state = "VT",
survey = "acs1",
year = 2019,
recode = TRUE
)
vt_pums_recoded
## # A tibble: 6,543 x 12
## SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST SCHL SEX ST_label SCHL_label SEX_label
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <ord> <ord> <ord>
## 1 2019GQ000~ 1 0 32 19 00200 50 19 2 Vermont/~ 1 or more years of co~ Female
## 2 2019GQ000~ 1 0 18 85 00100 50 16 1 Vermont/~ Regular high school d~ Male
## 3 2019GQ000~ 1 0 57 19 00400 50 19 2 Vermont/~ 1 or more years of co~ Female
## 4 2019GQ000~ 1 0 127 20 00100 50 16 1 Vermont/~ Regular high school d~ Male
## 5 2019GQ000~ 1 0 127 21 00100 50 19 2 Vermont/~ 1 or more years of co~ Female
## # ... with 6,538 more rows
PUMS data is a sample of about 1% of the US population. When we want to estimate a variable that represents the entire population rather than a sample, we have to apply a weighting adjustment. For each observation in the sample, the weight variable tells us the number of people in the population that the observation represents. By adding up all the person weights in our VT PUMS data, we get the (estimated) population of Vermont.
sum(vt_pums_recoded$PWGTP)
## [1] 623989
Weighting PUMS data can also be approached through the wt
argument in dplyr::count()
. Here, we calculate the population by sex for each PUMA in Vermont (there are only four in the whole state!).
vt_pums_recoded %>%
count(PUMA, SEX_label, wt = PWGTP)
## # A tibble: 8 x 3
## PUMA SEX_label n
## <chr> <ord> <dbl>
## 1 00100 Male 107747
## 2 00100 Female 112159
## 3 00200 Male 73971
## 4 00200 Female 73741
## 5 00300 Male 62414
## 6 00300 Female 63558
## 7 00400 Male 65276
## 8 00400 Female 65123
Many of the variables included in the PUMS data are categorical and we might want to group some categories together and estimate the proportion of the population with these characteristics. In the example below, first we create a new variable that is whether or not the person has a Bachelor’s degree or above, group by PUMA and sex, then calculate the total population, average age, total with BA or above (only for people 25 and older), and percent with BA or above.
vt_pums_recoded %>%
mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>%
group_by(PUMA, SEX_label) %>%
summarize(
total_pop = sum(PWGTP),
mean_age = weighted.mean(AGEP, PWGTP),
ba_above = sum(PWGTP[ba_above == TRUE & AGEP >= 25]),
ba_above_pct = ba_above / sum(PWGTP[AGEP >= 25])
)
## # A tibble: 8 x 6
## # Groups: PUMA [4]
## PUMA SEX_label total_pop mean_age ba_above ba_above_pct
## <chr> <ord> <dbl> <dbl> <dbl> <dbl>
## 1 00100 Male 107747 38.3 30263 0.417
## 2 00100 Female 112159 40.4 35339 0.452
## 3 00200 Male 73971 41.3 17871 0.339
## 4 00200 Female 73741 43.4 22277 0.411
## 5 00300 Male 62414 43.4 15450 0.339
## 6 00300 Female 63558 46.4 22053 0.445
## 7 00400 Male 65276 42.1 13788 0.303
## 8 00400 Female 65123 44.3 18206 0.386
It is important to calculate standard errors when making custom estimates using PUMS data, because it can be easy to create a very small sub-group where there are not enough observations in the sample to make a reliable estimate.
In order to generate reliable standard errors, the Census Bureau provides a set of replicate weights for each observation in the PUMS dataset. These replicate weights are used to simulate multiple samples from the single PUMS sample and can be used to calculate more precise standard errors. PUMS data contains both person- and housing-unit-level replicate weights.
To download replicate weights along with PUMS data, set the rep_weights
argument in get_pums()
to "person"
, "housing"
, or "both"
. Here we get the same Vermont PUMS data as above in addition to the 80 person replicate weight columns.
vt_pums_rep_weights <- get_pums(
variables = c("PUMA", "SEX", "AGEP", "SCHL"),
state = "VT",
survey = "acs1",
year = 2019,
recode = TRUE,
rep_weights = "person"
)
To convert this data frame to a survey or srvyr object, we can use the to_survey()
function.
vt_survey_design <- to_survey(vt_pums_rep_weights)
By default, to_survey()
converts a data frame to a tbl_svy
object by using person replicate weights. You can change the arguments in to_survey
if you are analyzing housing-unit data (type = "housing"
).
With the srvyr design object defined, we can calculate the same estimates as above, but we now get column with standard errors as well.
library(srvyr, warn.conflicts = FALSE)
vt_survey_design %>%
survey_count(PUMA, SEX_label)
## # A tibble: 8 x 4
## PUMA SEX_label n n_se
## <chr> <ord> <dbl> <dbl>
## 1 00100 Male 107747 844.
## 2 00100 Female 112159 906.
## 3 00200 Male 73971 824.
## 4 00200 Female 73741 778.
## 5 00300 Male 62414 610.
## 6 00300 Female 63558 651.
## 7 00400 Male 65276 613.
## 8 00400 Female 65123 574.
The following is a use case for PUMS data that incorporates a regression analysis. It predicts an individual’s commute to work (JWMNP
) based on their wages, employer type (private, public, self), and which PUMA they live in. We’ll use the 2019 5-year estimates so we have a bigger sample to work with.
vt_pums_to_model <- get_pums(
variables = c("PUMA", "WAGP", "JWMNP", "COW", "ESR"),
state = "VT",
survey = "acs5",
year = 2019
)
Now, we filter out observations that aren’t relevant, do a little recoding of the class of worker variable, and finally convert the data frame to a survey design object. Note that we did not request replicate weights from the API, so we specify design = "cluster"
when we create the survey design. This creates the survey using a cluster design with the household (SERIALNO
) as the primary sampling unit, the PUMA as the strata, and PWGTP
as the weight.
vt_model_sd <- vt_pums_to_model %>%
filter(
ESR == 1, # civilian employed
WAGP > 0, # earned wages last year
JWMNP > 0 # commute more than zero min
) %>%
mutate(
emp_type = case_when(
COW %in% c("1", "2") ~ "private",
COW %in% c("3", "4", "5") ~ "public",
TRUE ~ "self"
)
) %>%
to_survey(design = "cluster")
View summary stats through srvyr
.
vt_model_sd %>%
summarize(
n = survey_total(1),
mean_wage = survey_mean(WAGP),
median_wage = survey_median(WAGP),
mean_commute = survey_mean(JWMNP),
median_commute = survey_median(JWMNP)
)
## n n_se mean_wage mean_wage_se median_wage median_wage_se mean_commute mean_commute_se
## 1 283906 3347.998 45348.25 534.6333 36058.82 688.6967 23.74635 0.2597597
## median_commute median_commute_se
## 1 20 0
vt_model_sd %>%
survey_count(emp_type)
## # A tibble: 3 x 3
## emp_type n n_se
## <chr> <dbl> <dbl>
## 1 private 226860 3168.
## 2 public 41740 1198.
## 3 self 15306 779.
Fit a simple linear regression model.
model <- survey::svyglm(log(JWMNP) ~ log(WAGP) + emp_type + PUMA, design = vt_model_sd)
summary(model)
##
## Call:
## svyglm(formula = log(JWMNP) ~ log(WAGP) + emp_type + PUMA, design = vt_model_sd)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.740090 0.108450 16.045 < 2e-16 ***
## log(WAGP) 0.111551 0.010102 11.043 < 2e-16 ***
## emp_typepublic -0.030211 0.024281 -1.244 0.213
## emp_typeself -0.267282 0.045974 -5.814 6.31e-09 ***
## PUMA00200 0.006020 0.028194 0.214 0.831
## PUMA00300 0.007023 0.028483 0.247 0.805
## PUMA00400 -0.118739 0.028305 -4.195 2.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.6851087)
##
## Number of Fisher Scoring iterations: 2
If you’ve used the spatial functionality in tidycensus before, you might also want to put your custom PUMS estimates on a map. To do this, let’s use tigris
to download PUMA boundaries for the states in New England as an sf
object.
ne_states <- c("VT", "NH", "ME", "MA", "CT", "RI")
ne_pumas <- map(ne_states, tigris::pumas, class = "sf", cb = TRUE) %>%
reduce(rbind)
Next we download the income-to-poverty ratio from the PUMS dataset and calculate the percentage of population below 200% of the poverty line for each PUMA.
ne_pums <- get_pums(
variables = c("PUMA", "POVPIP"),
state = ne_states,
survey = "acs1",
year = 2019
)
ne_pov <- ne_pums %>%
group_by(ST, PUMA) %>%
summarize(
total_pop = sum(PWGTP),
pct_in_pov = sum(PWGTP[POVPIP < 200]) / total_pop
)
And now we can make a choropleth map by joining the PUMA boundaries with the PUMS data.
ne_pumas %>%
left_join(ne_pov, by = c("STATEFP10" = "ST", "PUMACE10" = "PUMA")) %>%
ggplot(aes(fill = pct_in_pov)) +
geom_sf() +
scale_fill_viridis_b(
name = NULL,
option = "magma",
labels = scales::label_percent(1)
) +
labs(title = "Percentage of population below 200% of the poverty line") +
theme_void()