In this lesson students will learn to:
https://www.census.gov/programs-surveys/acs/microdata.html
In the following example we will use household level data from the 2023 American Community Survey (ACS) in Oregon.
The data can be accessed from the File Transfer Protocol (FTP) website, https://www2.census.gov/programs-surveys/acs/data/pums/.
For ease of working with the data for this example, you can access the data here:
## OREGON 2023
## HOUSEHOLD LEVEL DATA
or_house<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/DATA/PUMS/Oregon23/psam_h41.csv",
header=TRUE)
dim(or_house)
## [1] 21004 241
PUMAs are Census created geographic areas that:
PUMAs are the most detailed geographic areas available in the ACS PUMS. They are redrawn after each decennial census.
https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
We will use the trigis
package, which contains the
shapefiles to visualize PUMAs.
### INSTALL AND LOAD PACKAGE
#install.packages("tigris")
library(tigris)
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
### GET PUMAs for OREGON
pumasOR <- pumas("OR", year=2020, cb=T)
## The 2020 CB PUMAs use the new 2020 PUMA boundary definitions.
## | | | 0% | |====== | 9% | |============= | 18% | |=================== | 27% | |========================= | 36% | |=============================== | 44% | |===================================== | 53% | |=========================================== | 62% | |================================================= | 71% | |============================================================== | 88% | |======================================================================| 100%
### LIBRARY TIDYVERSE
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
### GGPLOT - GEOM_SF
ggplot(pumasOR)+
geom_sf()+
theme_minimal()
How many households from the sample are in each PUMA?
First let’s filter out the group quarters, since we only want to look at households.
### FILTER OUT THE GROUP QUARTERS
onlyHouse<-or_house%>%
filter(TYPEHUGQ==1)
Now we will want to count how many households there are in each PUMA. While we’re at it, it might be fun to calculate how many people are in these households, the weighted number of households, and the weighted number of people in those households.
### HOUSEHOLDS BY PUMA
houseSamp<-onlyHouse%>%
group_by(PUMA)%>%
summarise(n=n(),
people=sum(NP, na.rm = TRUE), #SUM OF NUMBER OF PEOPLE IN HOUSE
wgtN=sum(WGTP), #SUM WEIGHTS FOR EST NUM OF HOUSEHOLDS
wgtP=sum(NP*WGTP)) #SUM WEIGHTS*PEOPLE FOR EST NUM PEOPLE IN HOUSEHOLDS
#### FIX DATA FORMAT
pumasOR$PUMACE20<-as.integer(pumasOR$PUMACE20)
#### JOIN DATA WITH MAP
pumaHouse<-pumasOR%>%
left_join(houseSamp, by=c("PUMACE20"="PUMA"))
Maps showing sample data and weighted estimates.
### MAP 1
### NUMBER OF HOUSEHOLDS IN THE SAMPLE BY PUMA
ggplot(pumaHouse)+
geom_sf(color='black', aes(fill=n))+
scale_fill_viridis_c(option="viridis", "Sample Households")+
ggtitle("PUMS Household Sample Size by PUMA",
subtitle = "2023 American Community Survey")+
theme_minimal()
### MAP 2
### NUMBER OF PEOPLE IN HOUSEHOLDS IN THE SAMPLE BY PUMA
ggplot(pumaHouse)+
geom_sf(color='black', aes(fill=people))+
scale_fill_viridis_c(option="viridis", "Sample People")+
ggtitle("PUMS People in Sampled Households by PUMA",
subtitle = "2023 American Community Survey")+
theme_minimal()
### MAP 3
### WEIGHTED NUMBER OF HOUSEHOLDS BY PUMA
ggplot(pumaHouse)+
geom_sf(color='black', aes(fill=wgtN))+
scale_fill_viridis_c(option="viridis", "Weighted Households")+
ggtitle("PUMS Weighted Household Count by PUMA",
subtitle = "2023 American Community Survey")+
theme_minimal()
### MAP 4
### WEIGHTED NUMBER OF PEOPLE IN HOUSEHOLDS BY PUMA
ggplot(pumaHouse)+
geom_sf(color='black', aes(fill=wgtP))+
scale_fill_viridis_c(option="viridis", "Weighted People")+
ggtitle("PUMS Weighted Count of People in Households by PUMA",
subtitle = "2023 American Community Survey")+
theme_minimal()
How many households from the sample have access to internet?
### LETS LOOK AT HOUSEHOLD ACCESS TO INTERNET
## INTERNET: ACCESSINET
## SEE DATA DICTIONARY
orInternet<-onlyHouse%>%
mutate(internet01=(ACCESSINET!=3))%>%
group_by(PUMA)%>%
summarise(propInt=mean(internet01, na.rm=TRUE))
#### JOIN DATA WITH MAP
pumaHouse2<-pumasOR%>%
left_join(orInternet, by=c("PUMACE20"="PUMA"))
Map of internet access:
### MAP OF INTERNET ACCESS
ggplot(pumaHouse2)+
geom_sf(color='black', aes(fill=propInt))+
scale_fill_viridis_c(option="viridis", "Households with Internet")+
ggtitle("(Unweighted) Sample Proportion of Households with Internet by PUMA",
subtitle = "2023 American Community Survey")+
theme_minimal()