Tidycensus demo

How to pull PUMS data in R

In this demo we will be creating Table 52 in the Consolidated Plan: “Estimated Number of Housing Units Occupied by Low- or Moderate-Income Families with LBP Hazards”

Setting up

If you are at the office or on your VPN, you will need to load the ‘curl’ package and run these three lines of code to get around the proxy server and be able to use the census API

#function to get around proxy server for tidycensus
library(curl)

## Using libcurl 7.64.1 with Schannel

companyproxy <- curl::ie_proxy_info()$Proxy
Sys.setenv(http_proxy=companyproxy)
Sys.setenv(https_proxy=companyproxy)

Tidyverse and Tidycensus are the two packages required to pull census data in R. The ‘srvyr’ package is useful to make estimations with weighted PUMS data.

#load packages
library(tidycensus)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x readr::parse_date() masks curl::parse_date()

library(srvyr)

## 
## Attaching package: 'srvyr'

## The following object is masked from 'package:stats':
## 
##     filter

Here we are creating vectors of Philadelphia’s PUMA codes and the PUMS variables that we are interested in.
(Actually some of these variables are not necessary for this table but I just pulled this over from the GAIN Act Analysis)

#PUMAs that make up Philadelphia
phillypuma <- c("03201","03202",
                "03203","03204", 
                "03205","03206", 
                "03207","03208", 
                "03209","03210", 
                "03211")

#variables of interest
pvariable <- c("ADJINC", #Adjustment factor for income and earnings dollar amounts
               "HINCP", #Household Income
               "RNTP", #Rent
               "GRNTP", #Gross Rent
               "ADJHSG", #Adjustment factor for housing dollar amounts
               "TEN",  #Housing Tenure
               "PWGTP", #Person Weight
               "WGTP", #Housing Unit Weight
               "NP", #Number of People
               "BDSP", #Number of Bedrooms
               "HISP", #Hispanic/not Hispanic
               "RAC1P", #Race
               "RELSHIPP", #relationship to the reference person
               "HHT", #Household/family type
               "HUPAC", #presence of children in the household
               "YBL")#Year Structure Built

To view variable codes and options, you can either:
1. Create an object in our environment with the year and survey that we want to view variables from.
2. Or we can simply type View(pums_variables) to view variables from all available PUMS data.
In the following code block we go with the first option so that we don’t have to sift through variables from surveys/years we’re not working with.

pumsvars19 <- pums_variables %>% filter(year == 2019, survey == "acs5")
pumsvars19

## # A tibble: 5,277 x 12
##    survey year  var_code var_label   data_type level val_min  val_max val_label 
##    <chr>  <chr> <chr>    <chr>       <chr>     <chr> <chr>    <chr>   <chr>     
##  1 acs5   2019  RT       Record Type chr       <NA>  H        H       Housing R~
##  2 acs5   2019  RT       Record Type chr       <NA>  P        P       Person Re~
##  3 acs5   2019  SERIALNO Housing un~ chr       <NA>  2015000~ 201799~ Unique id~
##  4 acs5   2019  SERIALNO Housing un~ chr       <NA>  2018GQ0~ 2019GQ~ GQ Unique~
##  5 acs5   2019  SERIALNO Housing un~ chr       <NA>  2018HU0~ 2019HU~ HU Unique~
##  6 acs5   2019  DIVISION Division c~ chr       <NA>  0        0       Puerto Ri~
##  7 acs5   2019  DIVISION Division c~ chr       <NA>  1        1       New Engla~
##  8 acs5   2019  DIVISION Division c~ chr       <NA>  2        2       Middle At~
##  9 acs5   2019  DIVISION Division c~ chr       <NA>  3        3       East Nort~
## 10 acs5   2019  DIVISION Division c~ chr       <NA>  4        4       West Nort~
## # ... with 5,267 more rows, and 3 more variables: recode <lgl>,
## #   val_length <int>, val_na <dbl>

Grabbing the data

The get_pums() function is where you are able to load a PUMS dataset into R. Most of the time, the required arguments are puma, state, variables, year, survey, and key (which is your Census API key). Go to this website to get a free API key from the Census Bureau. The following API call gets PUMA data from the 2019 5 year ACS for all PUMAs within Philadelphia. The variables (whose codes I found from the above table.) are from the pvariable object that we created above. The data frame resulting from this next code block returns a dataset where each row signifies an individual and each column is a variable.

Depending on what kind of analysis you’re doing you can select to include person replicate weights, household replicate weights, or both. For the purposes of this demo we will be using household replicate weights, but I will include both because it doesn’t hurt and if we decided that we wanted person replicate weight later down the line we wouldn’t have to go through these steps again. But more on replicate weights later.

The recode argument defaults to FALSE, but if you change it to TRUE there will be additional columns in your data that will tell you what your columns mean. I find this useful so I don’t have to keep scrolling through the large list of variables.

#function to get PUMS data 
phl_pums <- get_pums(
  variables = pvariable,
  state = "PA",
  puma = phillypuma,
  survey = "acs5",
  year = 2019,
  rep_weights = "both",
  recode = TRUE,
  key = "82bbcc6711ff4a4923edffb104d47daaa8d191cd"
)

## Getting data from the 2015-2019 5-year ACS Public Use Microdata Sample

Manipulating the data to create new variables

Because this is the 5 year ACS, we must adjust the income variable so that it is all in 2018 dollars. Multiplying HINCP AND ADJINC will do the trick. We can call the resulting column adjusted_inc.

#put the adjusted income variable into the data frame
phl_pums$adjusted_inc <- as.double(phl_pums$HINCP) * as.double(phl_pums$ADJINC)

Next, we want to create an Area Median Income (AMI) variable. There is probably a cleaner way to do this, but this way works so I don’t want to mess it up.

#create a column that tells you weather the household is below 80 AMI based on size of HH
phl_pums$AMI80 <- ifelse(phl_pums$NP == 1 & phl_pums$adjusted_inc <= 50500, "Below80AMI",
                  ifelse(phl_pums$NP == 2 & phl_pums$adjusted_inc <= 57700, "Below80AMI",
                  ifelse(phl_pums$NP == 3 & phl_pums$adjusted_inc <= 64900, "Below80AMI",
                  ifelse(phl_pums$NP == 4 & phl_pums$adjusted_inc <= 72100, "Below80AMI",
                  ifelse(phl_pums$NP == 5 & phl_pums$adjusted_inc <= 77900, "Below80AMI",
                  ifelse(phl_pums$NP == 6 & phl_pums$adjusted_inc <= 83650, "Below80AMI",
                  ifelse(phl_pums$NP == 7 & phl_pums$adjusted_inc <= 89450, "Below80AMI",
                  ifelse(phl_pums$NP == 8 & phl_pums$adjusted_inc <= 95200, "Below80AMI", 
                  ifelse(phl_pums$NP == 9 & phl_pums$adjusted_inc <= 100950, "Below80AMI",
                  ifelse(phl_pums$NP == 10 & phl_pums$adjusted_inc <= 106750, "Below80AMI",
                  ifelse(phl_pums$NP == 11 & phl_pums$adjusted_inc <= 112500, "Below80AMI",
                  ifelse(phl_pums$NP == 12 & phl_pums$adjusted_inc <= 118250, "Below80AMI", "Above80AMI"))))))))))))

The following code counts the number of households in the data that make above and below 80% AMI. By adding wt = WGTP in the dplyr::count() function, we will weight the data and get an estimation of how many people live in households making more or less than 80% AMI. *Note: the data that we pulled has data at the person level. If you want to restrict it to household level data, you can filter for SPORDER == 1 or RELSHIPP == 20. This will filter the data to only show the Reference Person for the household– the head of household.

#count number of observations above and below 80% AMI
phl_pums %>% group_by(AMI80) %>% dplyr::count(wt = PWGTP)

## # A tibble: 2 x 2
## # Groups:   AMI80 [2]
##   AMI80           n
##   <chr>       <dbl>
## 1 Above80AMI 638879
## 2 Below80AMI 940954

Here we do the same thing for 50% AMI and 30% AMI

#create a column that tells you weather the household is below 50 AMI based on size of HH
phl_pums$AMI50 <- ifelse(phl_pums$NP == 1 & phl_pums$adjusted_inc <= 31550, "Below50AMI",
                  ifelse(phl_pums$NP == 2 & phl_pums$adjusted_inc <= 36050, "Below50AMI",
                  ifelse(phl_pums$NP == 3 & phl_pums$adjusted_inc <= 40550, "Below50AMI",
                  ifelse(phl_pums$NP == 4 & phl_pums$adjusted_inc <= 45050, "Below50AMI",
                  ifelse(phl_pums$NP == 5 & phl_pums$adjusted_inc <= 48700, "Below50AMI",
                  ifelse(phl_pums$NP == 6 & phl_pums$adjusted_inc <= 52300, "Below50AMI",
                  ifelse(phl_pums$NP == 7 & phl_pums$adjusted_inc <= 55900, "Below50AMI",
                  ifelse(phl_pums$NP == 8 & phl_pums$adjusted_inc <= 59500, "Below50AMI", 
                  ifelse(phl_pums$NP == 9 & phl_pums$adjusted_inc <= 63100, "Below50AMI",
                  ifelse(phl_pums$NP == 10 & phl_pums$adjusted_inc <= 66700, "Below50AMI",
                  ifelse(phl_pums$NP == 11 & phl_pums$adjusted_inc <= 70300, "Below50AMI",
                  ifelse(phl_pums$NP == 12 & phl_pums$adjusted_inc <= 73900, "Below50AMI", "Above50AMI"))))))))))))

#count number of observations above and below 50% AMI
phl_pums %>% group_by(AMI50) %>% dplyr::count(wt=PWGTP)

## # A tibble: 2 x 2
## # Groups:   AMI50 [2]
##   AMI50           n
##   <chr>       <dbl>
## 1 Above50AMI 915988
## 2 Below50AMI 663845

#create a column that tells you weather the household is below 30 AMI based on size of HH
phl_pums$AMI30 <- ifelse(phl_pums$NP == 1 & phl_pums$adjusted_inc <= 18950, "Below30AMI",
                  ifelse(phl_pums$NP == 2 & phl_pums$adjusted_inc <= 21650, "Below30AMI",
                  ifelse(phl_pums$NP == 3 & phl_pums$adjusted_inc <= 24350, "Below30AMI",
                  ifelse(phl_pums$NP == 4 & phl_pums$adjusted_inc <= 27050, "Below30AMI",
                  ifelse(phl_pums$NP == 5 & phl_pums$adjusted_inc <= 30170, "Below30AMI",
                  ifelse(phl_pums$NP == 6 & phl_pums$adjusted_inc <= 34590, "Below30AMI",
                  ifelse(phl_pums$NP == 7 & phl_pums$adjusted_inc <= 39010, "Below30AMI",
                  ifelse(phl_pums$NP == 8 & phl_pums$adjusted_inc <= 43430, "Below30AMI", 
                  ifelse(phl_pums$NP == 9 & phl_pums$adjusted_inc <= 47850, "Below30AMI",
                  ifelse(phl_pums$NP == 10 & phl_pums$adjusted_inc <= 52270, "Below30AMI",
                  ifelse(phl_pums$NP == 11 & phl_pums$adjusted_inc <= 56690, "Below30AMI",
                  ifelse(phl_pums$NP == 12 & phl_pums$adjusted_inc <= 61110, "Below30AMI", "Above30AMI"))))))))))))


#count number of observations above and below 30% AMI
phl_pums %>% group_by(AMI30) %>% dplyr::count(wt=PWGTP)

## # A tibble: 2 x 2
## # Groups:   AMI30 [2]
##   AMI30            n
##   <chr>        <dbl>
## 1 Above30AMI 1148198
## 2 Below30AMI  431635

Here we are grouping people together based off of AMI. So far we have three separate variables telling us whether a household is above or below a certain percent AMI, but we need to be able to analyse them as groups. So we use another ifelse() function to create a single variable that tells us which AMI bucket a certain Household falls into.

#create buckets so there is no overlap between below80AMI, below50AMI, and below30AMI
phl_pums$buckets <- ifelse(phl_pums$AMI80 == "Below80AMI" & phl_pums$AMI50 == "Above50AMI","50-80", 
                    ifelse(phl_pums$AMI50 == "Below50AMI" & phl_pums$AMI30 == "Above30AMI","30-50", 
                    ifelse(phl_pums$AMI30 == "Below30AMI","0-30",
                    ifelse(phl_pums$AMI80 == "Above80AMI", ">80", NA))))
#count the number of observations in each bucket
phl_pums %>% group_by(buckets) %>% dplyr::count(wt=PWGTP)

## # A tibble: 4 x 2
## # Groups:   buckets [4]
##   buckets      n
##   <chr>    <dbl>
## 1 >80     638879
## 2 0-30    431635
## 3 30-50   232210
## 4 50-80   277109

For the table that we are creating we also need to know if the housing structure was built before 1980 and whether there are children under 6 years old in the home. We will again use the ifelse() function to create these variables. ( if you cant tell i love using ifelse() :D ) These variables will be dichotomous with a 1 meaning that there are children present/the house was built before 1980, and a 0 meaning the opposite.

#presence of children under 6 years old
phl_pums$childrenunder6 <- ifelse(phl_pums$HUPAC %in% c(1, 3), 1, 0)

#built before 1980
phl_pums$builtbefore1980 <- ifelse(phl_pums$YBL %in% c("01", "02", "03", "04", "05"), 1, 0)

Now we filter the data set to make sure that we are only getting one observation per household.

#filter for head of household (reference person) data
phl_pums1 <- phl_pums %>% filter(RELSHIPP == 20)
phl_pums1

## # A tibble: 19,234 x 197
##    SERIALNO    SPORDER  HINCP  RNTP GRNTP    NP  BDSP PUMA  ST    ADJHSG  ADJINC
##    <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>   <chr> 
##  1 2015000000~       1 100000     0     0     3     3 03201 42    1.0791~ 1.080~
##  2 2015000002~       1  81900     0     0     2     3 03201 42    1.0791~ 1.080~
##  3 2015000004~       1   9060   480   510     1     2 03207 42    1.0791~ 1.080~
##  4 2015000008~       1  65800     0     0     5     3 03208 42    1.0791~ 1.080~
##  5 2015000008~       1  32700     0     0     1     2 03202 42    1.0791~ 1.080~
##  6 2015000010~       1 124500  1000  1100     2     2 03209 42    1.0791~ 1.080~
##  7 2015000010~       1  25000     0     0     1     4 03207 42    1.0791~ 1.080~
##  8 2015000011~       1    100  1100  1210     1     1 03209 42    1.0791~ 1.080~
##  9 2015000012~       1  37400   600   830     2     2 03208 42    1.0791~ 1.080~
## 10 2015000014~       1  88000   630  1500     3     3 03202 42    1.0791~ 1.080~
## # ... with 19,224 more rows, and 186 more variables: TEN <chr>, YBL <chr>,
## #   HHT <chr>, HUPAC <chr>, RELSHIPP <chr>, HISP <chr>, RAC1P <chr>,
## #   ST_label <ord>, ADJHSG_label <ord>, ADJINC_label <ord>, TEN_label <ord>,
## #   YBL_label <ord>, HHT_label <ord>, HUPAC_label <ord>, RELSHIPP_label <ord>,
## #   HISP_label <ord>, RAC1P_label <ord>, WGTP <dbl>, PWGTP <dbl>, WGTP1 <dbl>,
## #   WGTP2 <dbl>, WGTP3 <dbl>, WGTP4 <dbl>, WGTP5 <dbl>, WGTP6 <dbl>,
## #   WGTP7 <dbl>, WGTP8 <dbl>, WGTP9 <dbl>, WGTP10 <dbl>, WGTP11 <dbl>, ...

Using the srvyr package to make estimates

And we create a survey object out of the filtered data. This step is crucial for being able to use the replicate weights to make estimations.

#change the data to a survey object
phlsurvey <- to_survey(phl_pums1, type = "housing")

From here we can use blocks of code like this to make estimations. survey_total() and survey_median() are especially useful for our purposes.
This code block takes the survey object that we named phlsurvey, groups observations by their AMI, and counts the number of observations in each group. This uses the weighted data and specifically the housing replicate weights. vartype = "ci"means that a confidence will be calculated and displayed as an upper bound a lower bound in two separate columns.

#estimate of total households by AMI category
totalhh <- phlsurvey %>%
  group_by(buckets) %>%
  summarize(
    totalHHs = survey_total(vartype = "ci")
  )
totalhh

## # A tibble: 4 x 4
##   buckets totalHHs totalHHs_low totalHHs_upp
##   <chr>      <dbl>        <dbl>        <dbl>
## 1 >80       236744      231286.      242202.
## 2 0-30      169524      164957.      174091.
## 3 30-50      88955       85095.       92815.
## 4 50-80     106114      102254.      109974.

The next two functions are almost the same as the previous one. The difference is that we are filtering for households living in homes built before 1980 in the first one and in the second one we are filtering for households with children who are living in homes built before 1980.

#estimate of households built before 1980 by AMI category
ninteen80 <- phlsurvey %>% 
  filter(builtbefore1980 == 1) %>% 
  group_by(buckets) %>%
  summarize(
    before1980 = survey_total(vartype = "ci")
  )
ninteen80

## # A tibble: 4 x 4
##   buckets before1980 before1980_low before1980_upp
##   <chr>        <dbl>          <dbl>          <dbl>
## 1 >80         201452        196163.        206741.
## 2 0-30        150631        146048.        155214.
## 3 30-50        78829         75167.         82491.
## 4 50-80        95707         92108.         99306.

#estimate of households built before 1980 with children under 6 present by AMI category
eightyndunder6 <- phlsurvey %>% 
  filter(builtbefore1980 == 1 & childrenunder6 == 1) %>% 
  group_by(buckets) %>%
  summarize(
    b41980ndunder6 = survey_total(vartype = "ci")
  )
eightyndunder6

## # A tibble: 4 x 4
##   buckets b41980ndunder6 b41980ndunder6_low b41980ndunder6_upp
##   <chr>            <dbl>              <dbl>              <dbl>
## 1 >80              21810             19578.             24042.
## 2 0-30             19482             17389.             21575.
## 3 30-50            11837             10303.             13371.
## 4 50-80            11647              9929.             13365.

Joining tables and exporting data

Finally we will join these three tables to each other based on the buckets variable. While the confidence intervals are good to know, we dont really need them for these purposes. So we will select the columns that contain the estimations as well as the AMI categories.

table52 <- left_join(totalhh, ninteen80, by = "buckets") %>% 
  left_join(., eightyndunder6, by = "buckets") %>%
  select(buckets, totalHHs, before1980, b41980ndunder6)
table52

## # A tibble: 4 x 4
##   buckets totalHHs before1980 b41980ndunder6
##   <chr>      <dbl>      <dbl>          <dbl>
## 1 >80       236744     201452          21810
## 2 0-30      169524     150631          19482
## 3 30-50      88955      78829          11837
## 4 50-80     106114      95707          11647

And now we have a table with all the information we need to fill out Table 52 in the Con Plan :)
You can export this table to a csv file by typing write.csv(name of datafram, “file directory you want it saved to.csv”)
For example:

write.csv(table52, "C:/Users/sean.finnegan/Documents/PUMS/table52.csv")

If you are interested in a jumping off point to go more in-depth, please go to [https://walker-data.com/tidycensus/]