Tidycensus-demo.knit

Tidycensus demo

How to pull ACS data in R

If you want more in depth info, please go to [https://walker-data.com/tidycensus/]

If you are at the office or on your VPN, you will need to load the ‘curl’ package and run these three lines of code to get around the proxy server and be able to use the census API

#function to get around proxy server for tidycensus
library(curl)

## Using libcurl 7.64.1 with Schannel

companyproxy <- curl::ie_proxy_info()$Proxy
Sys.setenv(http_proxy=companyproxy)
Sys.setenv(https_proxy=companyproxy)

Tidyverse and Tidycensus are the two packages required to pull census data in R. The ‘srvyr’ package is useful to make estimations with weighted PUMS data, but not required if you’re using Census or ACS data– because there are no weights for those.

#load packages
library(tidycensus)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x readr::parse_date() masks curl::parse_date()

library(srvyr)

## 
## Attaching package: 'srvyr'

## The following object is masked from 'package:stats':
## 
##     filter

The following line of code creates a table of variable information for the specified survey. The load_variables() function takes the year and dataset as its arguments. In the following code block I am looking at variables for the 2015-2019 5 year ACS.

vars <- load_variables(2019, "acs5")
vars

## # A tibble: 27,040 x 3
##    name       label                                    concept   
##    <chr>      <chr>                                    <chr>     
##  1 B01001_001 Estimate!!Total:                         SEX BY AGE
##  2 B01001_002 Estimate!!Total:!!Male:                  SEX BY AGE
##  3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE
##  4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE
##  5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE
##  6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE
##  7 B01001_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE
##  8 B01001_008 Estimate!!Total:!!Male:!!20 years        SEX BY AGE
##  9 B01001_009 Estimate!!Total:!!Male:!!21 years        SEX BY AGE
## 10 B01001_010 Estimate!!Total:!!Male:!!22 to 24 years  SEX BY AGE
## # ... with 27,030 more rows

The get_acs() function is where you are able to load a dataset into R. Most of the time, the required arguments are geography, state, variables, year, survey, and key (which is your Census API key). Go to this website to get a free API key from the Census Bureau. The following API call gets county level information for Pennsylvania from the 2019 5 year ACS. The variables (whos codes I found from the above table.) show the total population, population that lives in owner-occupied units, and population that lives in owner occupied units and drives to work.

census_api_key("82bbcc6711ff4a4923edffb104d47daaa8d191cd", overwrite = TRUE, install = TRUE)

## Your original .Renviron will be backed up and stored in your R HOME directory if needed.

## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`

## [1] "82bbcc6711ff4a4923edffb104d47daaa8d191cd"

drive2work <- get_acs(geography = "county",
                state = "PA",
                variables = c("B01001_001", "B08537_002", "B08537_005"), #population, owner occupied, owner occupied that drove to work 
                year = 2019,
                survey = "acs5",
                key = "82bbcc6711ff4a4923edffb104d47daaa8d191cd")

## Getting data from the 2015-2019 5-year ACS

drive2work

## # A tibble: 201 x 5
##    GEOID NAME                           variable   estimate   moe
##    <chr> <chr>                          <chr>         <dbl> <dbl>
##  1 42001 Adams County, Pennsylvania     B01001_001   102470    NA
##  2 42001 Adams County, Pennsylvania     B08537_002    26104   879
##  3 42001 Adams County, Pennsylvania     B08537_005    21672   816
##  4 42003 Allegheny County, Pennsylvania B01001_001  1221744    NA
##  5 42003 Allegheny County, Pennsylvania B08537_002   502386  4412
##  6 42003 Allegheny County, Pennsylvania B08537_005   393627  3703
##  7 42005 Armstrong County, Pennsylvania B01001_001    65867    NA
##  8 42005 Armstrong County, Pennsylvania B08537_002    14523   568
##  9 42005 Armstrong County, Pennsylvania B08537_005    12093   532
## 10 42007 Beaver County, Pennsylvania    B01001_001   165833    NA
## # ... with 191 more rows

The following block of code includes some data cleaning/manipulation, If you can see from the above table, there is a variable column with all of our variables. In order to make it easier to work with this data, we will use the pivot_wider() function. It takes the names from the variable column and makes them their own columns, and then uses the values from the estimate column and puts them under the correct variable. Then I filtered for some specific counties to make the upcoming graph easier to read.
Note: the first step in this section is unnecessary if you include output = "wider" in the get_acs() call

#pivoting data to make it clean and filtering for Philadelphia and aurrounding counties
padriving <- drive2work %>% pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>% filter(NAME %in% c("Philadelphia County, Pennsylvania", "Bucks County, Pennsylvania", "Montgomery County, Pennsylvania", "Chester County, Pennsylvania", "Delaware County, Pennsylvania"))

You can also create new variables from the existing variables. In these lines of code I am dividing the number of people who both live in an owner occupied unit and drive to work by the amount of people who live in an owner occupied unit and then the population as a whole.

#percent of people who live in an owner occupied home, who drive to work
padriving$pctowndrive <- padriving$estimate_B08537_005 / padriving$estimate_B08537_002
#percent of the total population that both lives in an owner occupied home and drives to work
padriving$pctdrive <- padriving$estimate_B08537_005 / padriving$estimate_B01001_001

Next we can use ggplot to visualize the difference between the counties that we selected

#bar graph
ggplot(padriving, aes(y = NAME, x = pctowndrive)) +
  geom_col(fill = "#9400c6") +
  ylab("") +
  theme(plot.title = element_text(hjust= 0, face="bold", size = 15),
        plot.background = element_rect(fill="#f7f7f7"),
        legend.title = element_blank(),
        legend.background = element_rect(fill = "#f7f7f7", color = NA),
        panel.background=element_rect(fill = "#f7f7f7", color = NA),
        panel.grid.minor=element_blank(),
        panel.grid.major.y=element_blank(),
        panel.grid.major.x=element_line(color = "#cfcfcf"),
        axis.ticks=element_blank(),
        plot.caption = element_text(hjust = -0.7, size = 10),
        legend.position="top",
        panel.border=element_blank())

If you want to make a quick map, its relatively easy with Tidycensus. in the get_acs() function just specify geometry = TRUE and a geometry column will be added to make your data mappable. In the following code, I loaded the number of renters per census tract in Philadelphia from the 2015-2019 ACS. Then you can use ggplot to create a map.

renters <- get_acs(geography = "tract",
                state = "PA",
                county = "Philadelphia",
                variables = "B25008_003",
                year = 2019,
                survey = "acs5",
                key = "82bbcc6711ff4a4923edffb104d47daaa8d191cd",
                geometry = TRUE)

## Getting data from the 2015-2019 5-year ACS

## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

renters

## Simple feature collection with 384 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.13799
## Geodetic CRS:  NAD83
## First 10 features:
##          GEOID                                                   NAME
## 1  42101014500    Census Tract 145, Philadelphia County, Pennsylvania
## 2  42101004202  Census Tract 42.02, Philadelphia County, Pennsylvania
## 3  42101004102  Census Tract 41.02, Philadelphia County, Pennsylvania
## 4  42101000804   Census Tract 8.04, Philadelphia County, Pennsylvania
## 5  42101025300    Census Tract 253, Philadelphia County, Pennsylvania
## 6  42101028100    Census Tract 281, Philadelphia County, Pennsylvania
## 7  42101008602  Census Tract 86.02, Philadelphia County, Pennsylvania
## 8  42101030700    Census Tract 307, Philadelphia County, Pennsylvania
## 9  42101031101 Census Tract 311.01, Philadelphia County, Pennsylvania
## 10 42101025900    Census Tract 259, Philadelphia County, Pennsylvania
##      variable estimate  moe                       geometry
## 1  B25008_003     1042  205 MULTIPOLYGON (((-75.15131 3...
## 2  B25008_003     1529  670 MULTIPOLYGON (((-75.15614 3...
## 3  B25008_003     4358 1083 MULTIPOLYGON (((-75.16396 3...
## 4  B25008_003     3342  445 MULTIPOLYGON (((-75.17067 3...
## 5  B25008_003     2015  558 MULTIPOLYGON (((-75.18653 4...
## 6  B25008_003     1842  638 MULTIPOLYGON (((-75.15237 4...
## 7  B25008_003     2372  401 MULTIPOLYGON (((-75.22156 3...
## 8  B25008_003     2458  553 MULTIPOLYGON (((-75.0935 40...
## 9  B25008_003     2404  678 MULTIPOLYGON (((-75.08267 4...
## 10 B25008_003     1071  375 MULTIPOLYGON (((-75.1809 40...

renters %>%
  ggplot(aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  scale_fill_viridis_c()

You can even write these datasets to a shapefile and use it in ArcGIS

library(sf)
st_write(renters, "tidycensus demo renters.shp")