Working with Census microdata

Most of the data available through tidycensus and the Census API is aggregated to certain geographic levels (tract, county, state, etc.). In other words, the data we get by executing get_acs() has been summarized by the Census Bureau so that we are able to learn how many people live in a particular county or what the median household income of a state is. There are thousands of individual variables that the Census aggregates and publishes in tabular form.

For many purposes, these pre-aggregated tables have enough information to work with. But, the Census Bureau also releases microdata from the American Community Survey. Microdata is the individual-level responses to the ACS that is used to create the summary and detail tables the Census publishes. Instead of a getting one row per state from a table, we can get one row per respondent. For the American Community Survey, this data is called the [Public Use Microdata Sample (PUMS)] (https://www.census.gov/programs-surveys/acs/technical-documentation/pums/about.html).

Using PUMS data instead of the published tables allows you to create custom estimates that aren’t available in pre-aggregated tables. You can also use microdata to fit models on individual-level data instead of modeling block groups or census tracts.

PUMAs

One drawback to be aware of when using PUMS data (as compared to aggregated data) is that you only get the state and public use microdata area (PUMA) of each individual in the microdata. PUMAs are Census geographies that contain at least 100,000 people and are entirely within a single state. They are built from census tracts and counties and may or may not be similar to other recognized geographic boundaries. In New York City, for instance, PUMAs are closely aligned to Community Districts. So, if you are interested in pulling data about block groups, census tracts, or other small areas, you can’t use PUMS data.

PUMS data dictionaries

The first place to start is to identify the variables you want to download. There are hundreds of PUMS variables and each describes characteristics of a person or a housing unit. pums_variables is a dataset built into tidycensus that contains names, descriptions, codes, and other metadata about the PUMS variables.

Because PUMS variables and codes can change slightly by year, you should select which year’s data you will be working with and filter pums_variables for only that year and survey. In the following example, we will be working with the 2019 1-year American Community Survey.

install.packages(c("survey", "srvyr"))

library(tidyverse)
library(tidycensus)

pums_vars_2019 <- pums_variables %>% 
  filter(year == 2019, survey == "acs1")

pums_variables contains both the variables as well as their possible values. The following are the unique variables.

pums_vars_2019 %>% 
  distinct(var_code, var_label, data_type, level)

## # A tibble: 516 x 4
##   var_code var_label                                                                      data_type level
##   <chr>    <chr>                                                                          <chr>     <chr>
## 1 RT       Record Type                                                                    chr       <NA> 
## 2 SERIALNO Housing unit/GQ person serial number                                           chr       <NA> 
## 3 DIVISION Division code based on 2010 Census definitions                                 chr       <NA> 
## 4 PUMA     Public use microdata area code (PUMA) based on 2010 Census definition (areas ~ chr       <NA> 
## 5 REGION   Region code based on 2010 Census definitions                                   chr       <NA> 
## # ... with 511 more rows

Person vs. housing unit

Some PUMS variables relate to individuals (e.g. age, race, educational attainment) while others relate to households/housing units (e.g. number of bedrooms, electricity cost, property value).

Individuals in PUMS data are always nested within a housing unit (The ACS is sent to an housing unit address and all people living in that unit complete the survey.) Housing units are uniquely identified by the SERIALNO variable and persons are uniquely identified by the combination of SERIALNO and SPORDER.

pums_vars_2019 %>% 
  distinct(var_code, var_label, data_type, level) %>% 
  filter(level == "person")

## # A tibble: 281 x 4
##   var_code var_label                       data_type level 
##   <chr>    <chr>                           <chr>     <chr> 
## 1 SPORDER  Person number                   num       person
## 2 PWGTP    Person's weight                 num       person
## 3 AGEP     Age                             num       person
## 4 CIT      Citizenship status              chr       person
## 5 CITWP    Year of naturalization write-in num       person
## # ... with 276 more rows

It is important to be mindful of whether the variables you choose to analyze are person- or household-level variables.

Using `get_pums()` to download PUMS data

To download PUMS data from the Census API, use the get_pums() function. The key arguments to specify are variables, state, survey, and year. Here we get PUMA, SEX, AGEP, and SCHL variables for Vermont from the 2019 1-year ACS.

vt_pums <- get_pums(
  variables = c("PUMA", "SEX", "AGEP", "SCHL"),
  state = "VT",
  survey = "acs1",
  year = 2019
  )

vt_pums

## # A tibble: 6,543 x 9
##   SERIALNO      SPORDER  WGTP PWGTP  AGEP PUMA  ST    SCHL  SEX  
##   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 2019GQ0001326       1     0    32    19 00200 50    19    2    
## 2 2019GQ0003653       1     0    18    85 00100 50    16    1    
## 3 2019GQ0005151       1     0    57    19 00400 50    19    2    
## 4 2019GQ0006126       1     0   127    20 00100 50    16    1    
## 5 2019GQ0006733       1     0   127    21 00100 50    19    2    
## # ... with 6,538 more rows

We get 6543 rows and 9 columns. In addition to the variables we specified, get_pums() also always returns SERIALNO, SPORDER, WGTP, PWGTP, and ST. SERIALNO and SPORDER are the variables that uniquely identify observations, WGTP and PWGTP are the housing-unit and person weights, and ST is the state code.

ST, SEX and SCHL return coded character variables. You can look up what these codes represent in the pums_variables dataset provided or set recode = TRUE in get_pums() to return additional columns with the values of these variables recoded.

vt_pums_recoded <- get_pums(
  variables = c("PUMA", "SEX", "AGEP", "SCHL"),
  state = "VT",
  survey = "acs1",
  year = 2019,
  recode = TRUE
  )

vt_pums_recoded

## # A tibble: 6,543 x 12
##   SERIALNO   SPORDER  WGTP PWGTP  AGEP PUMA  ST    SCHL  SEX   ST_label  SCHL_label             SEX_label
##   <chr>        <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <ord>     <ord>                  <ord>    
## 1 2019GQ000~       1     0    32    19 00200 50    19    2     Vermont/~ 1 or more years of co~ Female   
## 2 2019GQ000~       1     0    18    85 00100 50    16    1     Vermont/~ Regular high school d~ Male     
## 3 2019GQ000~       1     0    57    19 00400 50    19    2     Vermont/~ 1 or more years of co~ Female   
## 4 2019GQ000~       1     0   127    20 00100 50    16    1     Vermont/~ Regular high school d~ Male     
## 5 2019GQ000~       1     0   127    21 00100 50    19    2     Vermont/~ 1 or more years of co~ Female   
## # ... with 6,538 more rows

Analyzing PUMS data

PUMS data is a sample of about 1% of the US population. When we want to estimate a variable that represents the entire population rather than a sample, we have to apply a weighting adjustment. For each observation in the sample, the weight variable tells us the number of people in the population that the observation represents. By adding up all the person weights in our VT PUMS data, we get the (estimated) population of Vermont.

sum(vt_pums_recoded$PWGTP)

## [1] 623989

Weighting PUMS data can also be approached through the wt argument in dplyr::count(). Here, we calculate the population by sex for each PUMA in Vermont (there are only four in the whole state!).

vt_pums_recoded %>% 
  count(PUMA, SEX_label, wt = PWGTP)

## # A tibble: 8 x 3
##   PUMA  SEX_label      n
##   <chr> <ord>      <dbl>
## 1 00100 Male      107747
## 2 00100 Female    112159
## 3 00200 Male       73971
## 4 00200 Female     73741
## 5 00300 Male       62414
## 6 00300 Female     63558
## 7 00400 Male       65276
## 8 00400 Female     65123

Many of the variables included in the PUMS data are categorical and we might want to group some categories together and estimate the proportion of the population with these characteristics. In the example below, first we create a new variable that is whether or not the person has a Bachelor’s degree or above, group by PUMA and sex, then calculate the total population, average age, total with BA or above (only for people 25 and older), and percent with BA or above.

vt_pums_recoded %>% 
  mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>% 
  group_by(PUMA, SEX_label) %>% 
  summarize(
    total_pop = sum(PWGTP),
    mean_age = weighted.mean(AGEP, PWGTP),
    ba_above = sum(PWGTP[ba_above == TRUE & AGEP >= 25]),
    ba_above_pct = ba_above / sum(PWGTP[AGEP >= 25])
  )

## # A tibble: 8 x 6
## # Groups:   PUMA [4]
##   PUMA  SEX_label total_pop mean_age ba_above ba_above_pct
##   <chr> <ord>         <dbl>    <dbl>    <dbl>        <dbl>
## 1 00100 Male         107747     38.3    30263        0.417
## 2 00100 Female       112159     40.4    35339        0.452
## 3 00200 Male          73971     41.3    17871        0.339
## 4 00200 Female        73741     43.4    22277        0.411
## 5 00300 Male          62414     43.4    15450        0.339
## 6 00300 Female        63558     46.4    22053        0.445
## 7 00400 Male          65276     42.1    13788        0.303
## 8 00400 Female        65123     44.3    18206        0.386

Calculating standard errors

It is important to calculate standard errors when making custom estimates using PUMS data, because it can be easy to create a very small sub-group where there are not enough observations in the sample to make a reliable estimate.

In order to generate reliable standard errors, the Census Bureau provides a set of replicate weights for each observation in the PUMS dataset. These replicate weights are used to simulate multiple samples from the single PUMS sample and can be used to calculate more precise standard errors. PUMS data contains both person- and housing-unit-level replicate weights.

To download replicate weights along with PUMS data, set the rep_weights argument in get_pums() to "person", "housing", or "both". Here we get the same Vermont PUMS data as above in addition to the 80 person replicate weight columns.

vt_pums_rep_weights <- get_pums(
  variables = c("PUMA", "SEX", "AGEP", "SCHL"),
  state = "VT",
  survey = "acs1",
  year = 2019,
  recode = TRUE,
  rep_weights = "person"
  )

To convert this data frame to a survey or srvyr object, we can use the to_survey() function.

vt_survey_design <- to_survey(vt_pums_rep_weights)

By default, to_survey() converts a data frame to a tbl_svy object by using person replicate weights. You can change the arguments in to_survey if you are analyzing housing-unit data (type = "housing").

With the srvyr design object defined, we can calculate the same estimates as above, but we now get column with standard errors as well.

library(srvyr, warn.conflicts = FALSE)

vt_survey_design %>% 
  survey_count(PUMA, SEX_label)

## # A tibble: 8 x 4
##   PUMA  SEX_label      n  n_se
##   <chr> <ord>      <dbl> <dbl>
## 1 00100 Male      107747  844.
## 2 00100 Female    112159  906.
## 3 00200 Male       73971  824.
## 4 00200 Female     73741  778.
## 5 00300 Male       62414  610.
## 6 00300 Female     63558  651.
## 7 00400 Male       65276  613.
## 8 00400 Female     65123  574.

Modeling with PUMS data

The following is a use case for PUMS data that incorporates a regression analysis. It predicts an individual’s commute to work (JWMNP) based on their wages, employer type (private, public, self), and which PUMA they live in. We’ll use the 2019 5-year estimates so we have a bigger sample to work with.

vt_pums_to_model <- get_pums(
  variables = c("PUMA", "WAGP", "JWMNP", "COW", "ESR"),
  state = "VT",
  survey = "acs5",
  year = 2019
  )

Now, we filter out observations that aren’t relevant, do a little recoding of the class of worker variable, and finally convert the data frame to a survey design object. Note that we did not request replicate weights from the API, so we specify design = "cluster" when we create the survey design. This creates the survey using a cluster design with the household (SERIALNO) as the primary sampling unit, the PUMA as the strata, and PWGTP as the weight.

vt_model_sd <- vt_pums_to_model %>% 
  filter(
    ESR == 1,   # civilian employed
    WAGP > 0,   # earned wages last year
    JWMNP > 0   # commute more than zero min
    ) %>%
  mutate(
    emp_type = case_when(
      COW %in% c("1", "2")      ~ "private",
      COW %in% c("3", "4", "5") ~ "public",
      TRUE                      ~ "self"
      )
    ) %>%
  to_survey(design = "cluster")

View summary stats through srvyr.

vt_model_sd %>% 
  summarize(
    n              = survey_total(1),
    mean_wage      = survey_mean(WAGP),
    median_wage    = survey_median(WAGP),
    mean_commute   = survey_mean(JWMNP),
    median_commute = survey_median(JWMNP)
    )

##        n     n_se mean_wage mean_wage_se median_wage median_wage_se mean_commute mean_commute_se
## 1 283906 3347.998  45348.25     534.6333    36058.82       688.6967     23.74635       0.2597597
##   median_commute median_commute_se
## 1             20                 0

vt_model_sd %>% 
  survey_count(emp_type)

## # A tibble: 3 x 3
##   emp_type      n  n_se
##   <chr>     <dbl> <dbl>
## 1 private  226860 3168.
## 2 public    41740 1198.
## 3 self      15306  779.

Fit a simple linear regression model.

model <- survey::svyglm(log(JWMNP) ~ log(WAGP) + emp_type + PUMA, design = vt_model_sd)
summary(model)

## 
## Call:
## svyglm(formula = log(JWMNP) ~ log(WAGP) + emp_type + PUMA, design = vt_model_sd)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.740090   0.108450  16.045  < 2e-16 ***
## log(WAGP)       0.111551   0.010102  11.043  < 2e-16 ***
## emp_typepublic -0.030211   0.024281  -1.244    0.213    
## emp_typeself   -0.267282   0.045974  -5.814 6.31e-09 ***
## PUMA00200       0.006020   0.028194   0.214    0.831    
## PUMA00300       0.007023   0.028483   0.247    0.805    
## PUMA00400      -0.118739   0.028305  -4.195 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6851087)
## 
## Number of Fisher Scoring iterations: 2

Mapping PUMS data

If you’ve used the spatial functionality in tidycensus before, you might also want to put your custom PUMS estimates on a map. To do this, let’s use tigris to download PUMA boundaries for the states in New England as an sf object.

ne_states <- c("VT", "NH", "ME", "MA", "CT", "RI")
ne_pumas <- map(ne_states, tigris::pumas, class = "sf", cb = TRUE) %>% 
  reduce(rbind)

Next we download the income-to-poverty ratio from the PUMS dataset and calculate the percentage of population below 200% of the poverty line for each PUMA.

ne_pums <- get_pums(
  variables = c("PUMA", "POVPIP"),
  state = ne_states,
  survey = "acs1",
  year = 2019
  )

ne_pov <- ne_pums %>%
  group_by(ST, PUMA) %>%
  summarize(
    total_pop = sum(PWGTP),
    pct_in_pov = sum(PWGTP[POVPIP < 200]) / total_pop
  )

And now we can make a choropleth map by joining the PUMA boundaries with the PUMS data.

ne_pumas %>%
  left_join(ne_pov, by = c("STATEFP10" = "ST", "PUMACE10" = "PUMA")) %>%
  ggplot(aes(fill = pct_in_pov)) +
  geom_sf() +
  scale_fill_viridis_b(
    name = NULL,
    option = "magma",
    labels = scales::label_percent(1)
    ) +
  labs(title = "Percentage of population below 200% of the poverty line") +
  theme_void()