DATA429/599: Introduction to PUMS and PUMAs

Learning Objectives

In this lesson students will learn to:

Work with household level PUMS data from the Census FTP
Aggregate data to the PUMA level
Map PUMAs using the tigris package

Public Use Microdata Sample (PUMS)

https://www.census.gov/programs-surveys/acs/microdata.html

In the following example we will use household level data from the 2023 American Community Survey (ACS) in Oregon.

The data can be accessed from the File Transfer Protocol (FTP) website, https://www2.census.gov/programs-surveys/acs/data/pums/.

For ease of working with the data for this example, you can access the data here:

## OREGON 2023
## HOUSEHOLD LEVEL DATA
or_house<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/DATA/PUMS/Oregon23/psam_h41.csv", 
                   header=TRUE)
dim(or_house)

## [1] 21004   241

Public Use Microdata Areas (PUMAs)

PUMAs are Census created geographic areas that:

Must have a minimum population of 100,000 people
May be created by combining two or more contiguous counties
May be created by combining census tracts that cross county boundaries
Entirely inside or outside of metropolitan or micropolitan statistical areas
Avoid splitting American Indian reservations and off-reservation trust lands
Nest within states or equivalent entities

PUMAs are the most detailed geographic areas available in the ACS PUMS. They are redrawn after each decennial census.

TIGER/Line Shapefiles

https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html

We will use the trigis package, which contains the shapefiles to visualize PUMAs.

### INSTALL AND LOAD PACKAGE
#install.packages("tigris")
library(tigris)

## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.

### GET PUMAs for OREGON
pumasOR <- pumas("OR", year=2020, cb=T)

## The 2020 CB PUMAs use the new 2020 PUMA boundary definitions.

##   |                                                                              |                                                                      |   0%  |                                                                              |======                                                                |   9%  |                                                                              |=============                                                         |  18%  |                                                                              |===================                                                   |  27%  |                                                                              |=========================                                             |  36%  |                                                                              |===============================                                       |  44%  |                                                                              |=====================================                                 |  53%  |                                                                              |===========================================                           |  62%  |                                                                              |=================================================                     |  71%  |                                                                              |==============================================================        |  88%  |                                                                              |======================================================================| 100%

STEP 1: Make a PUMA map!

### LIBRARY TIDYVERSE
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

### GGPLOT - GEOM_SF
ggplot(pumasOR)+
  geom_sf()+
  theme_minimal()

Question of Interest 1

How many households from the sample are in each PUMA?

STEP 1: Wrangling the Data

First let’s filter out the group quarters, since we only want to look at households.

### FILTER OUT THE GROUP QUARTERS
onlyHouse<-or_house%>%
  filter(TYPEHUGQ==1)

Now we will want to count how many households there are in each PUMA. While we’re at it, it might be fun to calculate how many people are in these households, the weighted number of households, and the weighted number of people in those households.

### HOUSEHOLDS BY PUMA
houseSamp<-onlyHouse%>%
  group_by(PUMA)%>%
  summarise(n=n(), 
            people=sum(NP, na.rm = TRUE), #SUM OF NUMBER OF PEOPLE IN HOUSE
            wgtN=sum(WGTP), #SUM WEIGHTS FOR EST NUM OF HOUSEHOLDS
            wgtP=sum(NP*WGTP)) #SUM WEIGHTS*PEOPLE FOR EST NUM PEOPLE IN HOUSEHOLDS

STEP 2: Join the Data

#### FIX DATA FORMAT
pumasOR$PUMACE20<-as.integer(pumasOR$PUMACE20)

#### JOIN DATA WITH MAP
pumaHouse<-pumasOR%>%
  left_join(houseSamp, by=c("PUMACE20"="PUMA"))

STEP 3: Map the Data

Maps showing sample data and weighted estimates.

### MAP 1
### NUMBER OF HOUSEHOLDS IN THE SAMPLE BY PUMA
ggplot(pumaHouse)+
  geom_sf(color='black', aes(fill=n))+
  scale_fill_viridis_c(option="viridis", "Sample Households")+
  ggtitle("PUMS Household Sample Size by PUMA", 
          subtitle = "2023 American Community Survey")+
  theme_minimal()

### MAP 2
### NUMBER OF PEOPLE IN HOUSEHOLDS IN THE SAMPLE BY PUMA
ggplot(pumaHouse)+
  geom_sf(color='black', aes(fill=people))+
  scale_fill_viridis_c(option="viridis", "Sample People")+
  ggtitle("PUMS People in Sampled Households by PUMA", 
          subtitle = "2023 American Community Survey")+
  theme_minimal()

### MAP 3
### WEIGHTED NUMBER OF HOUSEHOLDS BY PUMA
ggplot(pumaHouse)+
  geom_sf(color='black', aes(fill=wgtN))+
  scale_fill_viridis_c(option="viridis", "Weighted Households")+
  ggtitle("PUMS Weighted Household Count by PUMA", 
          subtitle = "2023 American Community Survey")+
  theme_minimal()

### MAP 4
### WEIGHTED NUMBER OF PEOPLE IN HOUSEHOLDS BY PUMA
ggplot(pumaHouse)+
  geom_sf(color='black', aes(fill=wgtP))+
  scale_fill_viridis_c(option="viridis", "Weighted People")+
  ggtitle("PUMS Weighted Count of People in Households by PUMA", 
          subtitle = "2023 American Community Survey")+
  theme_minimal()

Question of Interest 2

How many households from the sample have access to internet?

STEP 1: Wrangling the Data

### LETS LOOK AT HOUSEHOLD ACCESS TO INTERNET 
## INTERNET: ACCESSINET
## SEE DATA DICTIONARY 
orInternet<-onlyHouse%>%
  mutate(internet01=(ACCESSINET!=3))%>% 
  group_by(PUMA)%>%
  summarise(propInt=mean(internet01, na.rm=TRUE))

STEP 2: Join the Data

#### JOIN DATA WITH MAP
pumaHouse2<-pumasOR%>%
  left_join(orInternet, by=c("PUMACE20"="PUMA"))

STEP 3: Map the Data

Map of internet access:

### MAP OF INTERNET ACCESS
ggplot(pumaHouse2)+
  geom_sf(color='black', aes(fill=propInt))+
  scale_fill_viridis_c(option="viridis", "Households with Internet")+
  ggtitle("(Unweighted) Sample Proportion of Households with Internet by PUMA", 
          subtitle = "2023 American Community Survey")+
  theme_minimal()