Census Data Wrangling and Mapping in R

03/21/2019

Getting Started

Setup

Welcome! While we’re waiting:

Clone or download the workshop files from: https://github.com/dlab-geo/rCensus_workshop
- If you downloaded the zipfile, unzip it.
- Make a note of the folder in which the files reside.
Open RStudio
Open a new R script file

Introduction

About me
About you
- Your familiarity with US Census data
- with geospatial data
- with geospatial data in R

Outline

Describe primary Census data products
Introduce R packages for working with Census Data
Use those packages to fetch census data
Use those packages to fetch census data plus census geograpic boundary files
Make maps of census data

Census Data Overview

US Census Data

The “nation’s leading provider of quality data about its people and economy.”

Available at www.census.gov

Primary Census Products

Decennial Census
American Community Survey (ACS)

Decennial Census

Complete count of the population every 10 years since 1790

Includes data on

population, by age & race/ethnicity
housing, by occupancy & tenure (owned, rented)

American Community Survey (ACS)

Annual survey of a sample of about 3 million households
Provides estimates of demographic, social, economic & housing characteristics
Includes margin of error values for the estimates.

Decennial Census* vs ACS Data

Demographic*	Social	Economic	Housing
Sex	Families	Income	Tenure*
Age	Education	Benefits	Occupancy*
Race	Marital Status	Employment Status	Structure Type
Hispanic Origin	Fertility	Occupation	Housing Value
	Grandparents	Industry	Taxes & Insurance
	Veterans	Commuting	Utilities
	Disability Status	Place of Work	Mortgage
	Language at Home	Health Insurance	Monthly Rent
	Citizenship
	Mobility

Census Geographies

Census data are publicly available at one or more levels of geographic aggregation.

Census Data & Census Geographies

ACS 5 Year Dataset RECOMMENDED

ACS 1 year and 5 year products are currently available

ACS 3 year no longer available

ACS 5 year data provdes much better estimates, lower margins of error

More data available for ACS 5 Year product

Census Data Workflow

Identify your

topic of interest
year(s)
geographic level of detail
for what locations?

Then determine what specific tables and variables are available - ACS or Decennial?

CAUTION

“If you want to measure change you can’t change the measures!”

Census tables, variables, geographies, and geographic boundaries change over time!

Measuring change over time with census data is its own thing, complex and not covered by this workshop!

R Packages

Packages for Working with Census Data

These are the ones we recommend and will use today.

tidycensus & tigris

tidycensus

Functions for accessing census decennial and ACS 5 year datasets via Census APIs

only a subset of datasets / years available
requires a Census API key

tidycensus

Limited set of years available via tidycensus

decennial census: 1990, 2000, and 2010
ACS 5 yr: 2006-2010 through 2014-2018 are available.
Note: tidycensus refers to ACS 5year datasets by the endyear.
Need to check availability of latest census data releases in tidycensus

tigris

Provides access to census geographic data files

detailed TIGER/Line boundary files (e.g., shapefiles), or
simplified Cartographic boundary files

Also provides access to census feature data,

eg, rivers, roads, coastlands, landmarks, and more

Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.

Use tigris directly to access other census geographic data.

tidycensus & tigris

Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.

Check out his website to keep abreast of his great packages, blog posts, and tutorials.

Walker also develped a new DataCamp course: Analyzing US Census Data in R!

Highly recommended! First chapter free!

tidyverse

A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.

dplyr and tidyr for reshaping data
ggplot2 for plotting
purr, readr and tibble for improved performance

These packages are used by tidyverse under the hood.

sf

Simple features for geospatial data objects and methods.

Next generation R package for working with vector geospatial data
- superceding the sp package

sf includes the functionality of the sp, rgdal, rgeos and proj4 packages.

but with improved performance, simplified command syntax, and easier workflows.

Alternatives to Accessing Census Data in R

You can write code to access the Census APIs directly.

You can download Census data directly from:

American Factfinder or
NHGIS.org
Social Explorer
- Subscription service but FREE for UCB community

You can download Census geographic data directly on the census website

Tutorial Time!

Part 1

We will work through several exercises using tidycensus to fetch, wrangle and map census data.

Loading packages

Load the packages we will use today

library(tidycensus)
library(tidyverse) 
library(tigris)
library(sf)

If you are getting errors try importing dplyr or reinstalling dplyr package as that has worked for some.

Install any packages that you do not have on your computer

Also install any dependancies.

# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")

Census API Key

You need a census API key to programmatically fetch census data.

Get it here (pretty quick):

(https://api.census.gov/data/key_signup.html)

For more info see:

https://www.census.gov/data/developers/data-sets.html

Install your Census API Key

Use the tidycensus function census_api_key to make tidycensus use your key when it fetches data from the census.

# Install your census api key - long alphanumeric string
census_api_key(THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS)

Set working directory

Be sure to Clone or downloaded & unzip the workshop files from: https://github.com/dlab-geo/rCensus_workshop

Then, set your working directory this folder, e.g.,

setwd("~/Documents/Dlab/workshops/2019/rCensus_workshop")

Fetching Decennial Census Data

Population Data

Let’s start by fetching population data from the 2010 Census for all states

In order to fetch census data you need to identify the census variables that contain the data of interest.

Topics, Tables & Variables

Census data variables are organized in tables

Which are organized by topic or concept.

The tidycensus load_variables function can help with this step.

First, take a look at the function documentation.

?load_variables

load_variables

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS
                           dataset = 'sf1',  # 'sf1' for decennial or 'acs5'
                           cache = TRUE)     # Whether to save fetched data locally

Decennial Census Variables

Let’s take a look at and discuss the resultant dataframe.

How many 2010 census variables are in the dataframe?

View(vars2010)

2010 Decennial Census Tables

Variables: 3,346
Topics: Population, housing
Tables: currenty 333 - that’s a lot!
- 177 population tables (identified with a ‘‘P’’) available to the block level
- 58 housing tables (identified with an ‘‘H’’) available to the block level
- 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
- 4 housing tables (identified with an “HCT”) available to the census tract level
- 10 population tables (identified with a “PCO”) available to the county level
- plus 2 additoinal PCT tables

What Variable has the 2010 Total Population value?

We can sort and filter the vars2010 dataframe to find it.

get_decennial

We can use the tidycensus function get_decenial to fetch the 2010 census data for total population by state.

First, check the documentation for the function.

?get_decennial

get_decennial

Fetch total population by state (P001001) from the 2010 census using get_decennial.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year

## Getting data from the 2010 decennial Census

View the Data

How many rows and columns?
Do you see the expected number of states?
What column contains the population counts?
Do the data values see to be right?

#pop2010

Visualize results

We can visualize the data to get a quick overview of the distribution of data values.

It’s a first step in exploratory data analysis and a last step in data communication.

ggplot2 is the most commonly used R package for data visualization.

It is loaded when you load the tidyverse package.

Let’s use it to visualize the population data.

Plot 2010 Population by state

Use ggplot2 to create an ordered horizontal bar chart.

pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) + 
  geom_bar(stat="identity") + coord_flip() +
  theme_minimal() + 
  labs(title = "2010 US Population by State") +
  xlab("State") +
  ylab("in millions")

Display the plot

Challenge

Fetch population data by state for 2000.

Don’t assume variable names are the same across years. Check first!

Challenge Solution

Total Population in 2000

# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)

# Take a look and search in the dataframe
View(vars2000)

# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)

# Take a look (plot if time)
pop2000

Limiting by Area of Interest

In the previous example we retrieved population data for all states.

This is the default behavior if you don’t specify a subset.
But you can limit the data to be retrieved by subunits like state.

Limit Areas of Interest

Let’s fetch data for just 3 states.

state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
                         variables = "P001001",     # variables of interest
                         year = 2010,               # census year
                         state=c("CA","OR","WA"))   # Filter by states of interest

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '41' for state 'OR'

## Using FIPS code '53' for state 'WA'

Note we are referencing states by their abbrevation.

View Results

state_pop2010

## # A tibble: 3 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 06    California P001001  37253956
## 2 41    Oregon     P001001   3831074
## 3 53    Washington P001001   6724540

Changing Census Tabulation unit

get_decennial accepts a number of different values for tabulation unit.

Options include: state, county, tract, block group, block, and ZCTA.

Let’s change the tabulation unit from state to county.

co_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010)

## Getting data from the 2010 decennial Census

Changing Census Tabulation unit

View the county data to see what was retrieved.

co_pop2010

## # A tibble: 3,221 x 4
##    GEOID NAME                        variable  value
##    <chr> <chr>                       <chr>     <dbl>
##  1 05131 Sebastian County, Arkansas  P001001  125744
##  2 05133 Sevier County, Arkansas     P001001   17058
##  3 05135 Sharp County, Arkansas      P001001   17264
##  4 05137 Stone County, Arkansas      P001001   12394
##  5 05139 Union County, Arkansas      P001001   41639
##  6 05141 Van Buren County, Arkansas  P001001   17295
##  7 05143 Washington County, Arkansas P001001  203065
##  8 05145 White County, Arkansas      P001001   77076
##  9 05149 Yell County, Arkansas       P001001   22185
## 10 06011 Colusa County, California   P001001   21419
## # … with 3,211 more rows

Challenge

Fetch population by county for just California
Fetch population by county for Oregon & California

Try it before you look ahead at solutions.

Challenge Solution

## Fetch population by **county** for just California
co_pop2010_ca <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010,
                            state=c('CA'))

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

#co_pop2010_ca

## Fetch population by **county** for Oregon & California
co_pop2010_caor <- get_decennial(geography = "county",   # census tabulation unit
                               variables = "P001001",  # variables of interest
                               year = 2010,
                               state=c('CA','OR'))

## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'

## Using FIPS code '41' for state 'OR'

co_pop2010_caor

## # A tibble: 94 x 4
##    GEOID NAME                            variable   value
##    <chr> <chr>                           <chr>      <dbl>
##  1 06011 Colusa County, California       P001001    21419
##  2 06007 Butte County, California        P001001   220000
##  3 06001 Alameda County, California      P001001  1510271
##  4 06003 Alpine County, California       P001001     1175
##  5 06005 Amador County, California       P001001    38091
##  6 06009 Calaveras County, California    P001001    45578
##  7 06013 Contra Costa County, California P001001  1049025
##  8 06015 Del Norte County, California    P001001    28610
##  9 06031 Kings County, California        P001001   152982
## 10 06021 Glenn County, California        P001001    28122
## # … with 84 more rows

Challenge

Fetch population by tract for all states.
Fetch population by tract for California.

Challenge Solution

## Fetch population by **tract** for California.
cal_pop2010_tracts <- get_decennial(geography = "tract",   # census tabulation unit
                                 variables = "P001001",  # variables of interest
                                 year = 2010,
                                 state=c('CA'))
cal_pop2010_tracts


## Fetch population by **tract** for all states.
pop2010_tracts <- get_decennial(geography = "tract",   # census tabulation unit
                                    variables = "P001001",  # variables of interest
                                    year = 2010)

pop2010_tracts  ## DOES THIS WORK?

Fetching Census Tract Data

If you want census data at the tract level or below you must specifiy the state & county or counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=c("Alameda","Contra Costa"))  # and only these counties

## Getting data from the 2010 decennial Census

Fetching Census Tract Data

View the results! How many census tracts are in these 3 counties?

tract_pop2010

## # A tibble: 569 x 4
##    GEOID       NAME                                          variable value
##    <chr>       <chr>                                         <chr>    <dbl>
##  1 06001400100 Census Tract 4001, Alameda County, California P001001   2937
##  2 06001400200 Census Tract 4002, Alameda County, California P001001   1974
##  3 06001400300 Census Tract 4003, Alameda County, California P001001   4865
##  4 06001400400 Census Tract 4004, Alameda County, California P001001   3703
##  5 06001400500 Census Tract 4005, Alameda County, California P001001   3517
##  6 06001400600 Census Tract 4006, Alameda County, California P001001   1571
##  7 06001400700 Census Tract 4007, Alameda County, California P001001   4206
##  8 06001400800 Census Tract 4008, Alameda County, California P001001   3594
##  9 06001400900 Census Tract 4009, Alameda County, California P001001   2302
## 10 06001401000 Census Tract 4010, Alameda County, California P001001   5678
## # … with 559 more rows

Challenge

Fetch population by county for Alameda County, California
Fetch population by tract for the nine county Bay Area:

Alameda, SF, Contra Costa, Marin County, Napa,
San Mateo, Santa Clara, Solano, Sonoma, Santa Cruz

Note: You can use names, abbreviations or FIPS codes for your state and county.

# County FIPS Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

Challenge Solution

#  population by **county** for Alameda County, California
alco_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                                 variables = "P001001",  # variables of interest
                                 year = 2010,
                                 state=c('CA'),
                                 county=c('Alameda County'))

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '001' for 'Alameda County'

#alco_pop2010

Challenge Solution

Fetch population by tract for the nine county Bay Area

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

bayarea_pop2010_tract <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=nine_counties)  # and only these counties

## Getting data from the 2010 decennial Census

#bayarea_pop2010_tract

RECAP & QUESTIONS

Fetch population by tract for the nine county Bay Area

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

bayarea_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                      variables = "P001001",            # variable of interest
                      year = 2010,                      # census year
                      state="CA",                       # limit to state of California
                     county=nine_counties)             # and only these counties

# View the data
bayarea_pop2010

Fetching data for more than one census variable

What three things are new here?

#urban rural pop for 3 counties
ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '055' for 'Napa County'

## Using FIPS code '097' for 'Sonoma County'

## Using FIPS code '045' for 'Mendocino County'

Fetching data for more than one census variable

You can specify more than one variable:

variables = c("P002002","P002005")

You can rename the values in the output ‘variable’ column.

variables = c(urban="P002002",rural="P002005")

You can identify a summary_var (a denominator - here, the total count of all people or households surveyed. Can be used for calcuations like percent of total.)

summary_var = "P002001"

Take a look at the results

ur_pop10

## # A tibble: 6 x 5
##   GEOID NAME                         variable  value summary_value
##   <chr> <chr>                        <chr>     <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110         87841
## 2 06055 Napa County, California      urban    118194        136484
## 3 06097 Sonoma County, California    urban    424102        483878
## 4 06045 Mendocino County, California rural     39731         87841
## 5 06055 Napa County, California      rural     18290        136484
## 6 06097 Sonoma County, California    rural     59776        483878

Calculating Percents

The summary_value column comes in handy when you want to compute percent of total.

Here’s one way to do it.

# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))

Calculating Percents

Let’s take a look at the output

ur_pop10 # Take a look

## # A tibble: 6 x 6
##   GEOID NAME                         variable  value summary_value   pct
##   <chr> <chr>                        <chr>     <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110         87841  54.8
## 2 06055 Napa County, California      urban    118194        136484  86.6
## 3 06097 Sonoma County, California    urban    424102        483878  87.6
## 4 06045 Mendocino County, California rural     39731         87841  45.2
## 5 06055 Napa County, California      rural     18290        136484  13.4
## 6 06097 Sonoma County, California    rural     59776        483878  12.4

Plot it

Plots give us compact visual summaries of the data

myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

Don’t worry if you don’t get all the ggplot code now. It’s here for reference.

Plot it

myplot

Fetch all the data in one table

This is often helpful but you need to keep tract of the meaning of each variable.

alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
                           table =  "P002",      # Table of urban & rural population counts
                           year = 2010,          # Decennial census year
                           state='CA',           # Filter state
                           county="Alameda")     # Filter county

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '001' for 'Alameda County'

Take a look

unique(alco_pop10$variable) # What and how many unique vars in table?

## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"

head(alco_pop10,3)  # Take a look at output

## # A tibble: 3 x 4
##   GEOID       NAME                                          variable value
##   <chr>       <chr>                                         <chr>    <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, California P002001   2937
## 2 06001400200 Census Tract 4002, Alameda County, California P002001   1974
## 3 06001400300 Census Tract 4003, Alameda County, California P002001   4865

Output options

Let’s try all three of these commands and then look at the ouput to see what’s different?

get_decennial(geography = "state", variables = "P001001",
              year = 2010)

get_decennial(geography = "state", variables = c(pop10="P001001"),
              year = 2010)

get_decennial(geography = "state", variables = c(pop10="P001001"),
              year = 2010, output="wide")

Output options

head(get_decennial(geography = "state", variables = "P001001",
                   year = 2010), 2)

## Getting data from the 2010 decennial Census

## # A tibble: 2 x 4
##   GEOID NAME    variable   value
##   <chr> <chr>   <chr>      <dbl>
## 1 01    Alabama P001001  4779736
## 2 02    Alaska  P001001   710231

head(get_decennial(geography = "state", variables = c(pop10="P001001"),
                   year = 2010), 2)

## Getting data from the 2010 decennial Census

## # A tibble: 2 x 4
##   GEOID NAME    variable   value
##   <chr> <chr>   <chr>      <dbl>
## 1 01    Alabama pop10    4779736
## 2 02    Alaska  pop10     710231

head(get_decennial(geography = "state", variables = c(pop10="P001001"),
                   year = 2010, output="wide"), 2)

## Getting data from the 2010 decennial Census

## # A tibble: 2 x 3
##   GEOID NAME      pop10
##   <chr> <chr>     <dbl>
## 1 01    Alabama 4779736
## 2 02    Alaska   710231

Data Wrangling

Your R skills can help you reformat the data and make it more useable.

Let’s fetch population data for 2010 & 2000 by state with output=wide.

We will label the variables pop00 and pop10.

Then we will combine these into one data frame.

Data Wrangling

Fetch pop by state from both the 2000 and 2010 census

pop2000 <- get_decennial(geography = "state",
                         variables = c(pop00="P001001"), 
                         year = 2000, output="wide")

## Getting data from the 2000 decennial Census

pop2010 <- get_decennial(geography = "state",
                         variables = c(pop10="P001001"), 
                         year = 2010, output="wide")

## Getting data from the 2010 decennial Census

Merge population by state from both censuses

Save in a new dataframe with both columns

pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
                             select(NAME, pop00, pop10)

head(pop2000_2010,3)

##      NAME   pop00   pop10
## 1 Alabama 4447100 4779736
## 2  Alaska  626932  710231
## 3 Arizona 5130632 6392017

Save the data

Use write.csv to save a data frame to a CSV file.

write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)

TIME FOR QUESTIONS

Part 2. Mapping

Mapping Census Data with `tidycensus`

You can fetch geographic data by adding the parameter geometry=TRUE to tidycensus functions

Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.
Only a subset of data available via tigris can be accessed via tidycensus.

You can then use common mapping options like plot, ggplot and tmap to make maps.

Geometry Options

Before fetching geometry, we need to specify a few tigris options

Set the class of returned data to be sf objects (not sp, the default)
Set tigris_use_cache to TRUE

# Tigris options - used by tidycensus
options(tigris_class = "sf")      # SP is the default format returned by tigris
options(tigris_use_cache = TRUE)  # Save retrieved data locally

Caching the data is important because it speeds things up if you often fetch census data for the same geographies over and over again.

tigris cache directory

You may want to use the geographic data downloaded by tigris in other applications.

To do this, you need to know where the files are saved locally.

You can also specify where tigris should save cached data.

# Check the location of the tigris cached data
Sys.getenv('TIGRIS_CACHE_DIR') 

# Set it
tigris_cache_dir("~/Documents/gis_data/census")  # Folder for local data

# Check it again
Sys.getenv('TIGRIS_CACHE_DIR')

Fetch geographic boundary data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry with the data for mapping

## Getting data from the 2010 decennial Census

Take a look

Let’s take a minute to discuss the format of an sf spatial object.

pop2010geo

## Simple feature collection with 52 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 52 x 4
##    GEOID NAME        pop10                                         geometry
##    <chr> <chr>       <dbl>                               <MULTIPOLYGON [°]>
##  1 01    Alabama    4.78e6 (((-85.00237 31.00068, -85.02411 31.00068, -85.…
##  2 02    Alaska     7.10e5 (((-164.9762 54.13459, -164.9378 54.13668, -164…
##  3 04    Arizona    6.39e6 (((-109.0452 36.99908, -109.0452 36.96949, -109…
##  4 05    Arkansas   2.92e6 (((-94.55929 36.4995, -94.51948 36.49921, -94.3…
##  5 06    Californ…  3.73e7 (((-122.4463 37.86105, -122.4386 37.86837, -122…
##  6 22    Louisiana  4.53e6 (((-88.86507 29.75271, -88.88976 29.7182, -88.9…
##  7 21    Kentucky   4.34e6 (((-83.67541 36.60081, -83.67561 36.59858, -83.…
##  8 08    Colorado   5.03e6 (((-102.0422 36.99308, -102.0545 36.99311, -102…
##  9 09    Connecti…  3.57e6 (((-71.85957 41.3224, -71.86823 41.33094, -71.8…
## 10 10    Delaware   8.98e5 (((-75.55945 39.62981, -75.5591 39.62906, -75.5…
## # … with 42 more rows

Geospatial Data in R

R sf objects include

a dataframe with a geometry column named of geometry
- The geometry can be of type POINT, LINE, POLYGON
- or, MULTIPOINT, MULTILINE or MULTIPOLGYON
a CRS (coordinate reference system), specified by
- epsg(SRID) code
- proj4string

For a deeper understanding of the sf package and its functionality, we recommend our Geospatial-Fundamentals-in-R-with-sf workshop.

Census Data Coordinate Reference System (CRS)

All census geographic data use the NAD83 CRS, or coordinate reference system. NAD83 stands for North American Datum of 1983. The geographic coordinates are longitude and latitude values encoded as decimal degrees.

WGS84, or The World Geodetic System of 1984 is the most commonly used geographic CRS. The difference between points in these systems varies up to 1 meter in continental US.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.

Mapping sf Spatial Objects

We can use plot to make a quick map the geometry stored in an sf spatial object.

plot(pop2010geo$geometry)

Question

What do you get if you plot the sf object without specifying “$geometry”

The Challenge of US maps

The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.

Fetch geographic data with tidycensus, SHIFTED

tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)

## Getting data from the 2010 decennial Census

## Using feature geometry obtained from the albersusa package

## Please note: Alaska and Hawaii are being shifted and are not to scale.

Shift Happens!

plot(pop2010geo_shifted$geometry)

Save it

You can save sf data to a shapefile using st_write

st_write(pop2010geo_shifted,"usa_2010_shifted.shp")

Check your TIGRIS_CACHE_DIR to see it

my_cache_dir <- Sys.getenv('TIGRIS_CACHE_DIR') 

dir(my_cache_dir) # What files stored there?

Mapping Data Values

plot(pop2010geo_shifted['pop10'])

ggplot2 Maps

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()

ggplot2 Maps

Note the use of geom_sf which tells ggplot that spatial data objects are being mapped. - this is a huge improvement!!

Challenge

Create a map of CA Population in 2010 by county

Challenge Solution

2010 pop Data for California Counties

#fetch it
cal_pop10 <- get_decennial(geography = "county", 
                           variables = "P001001",
                           year = 2010, 
                           state='CA',
                           geometry=TRUE)

# map it
#plot(cal_pop10['value'])

Fetch County data for more than one state

We can fetch both the census data and the geometry for more than one state!

this is so much easier than any alternative approach!

west_pop10 <- get_decennial(geography = "county", 
                           variables =  "P001001",
                           year = 2010, 
                           state=c('CA','OR','NV',"AZ"),
                           geometry=T)

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '41' for state 'OR'

## Using FIPS code '32' for state 'NV'

## Using FIPS code '04' for state 'AZ'

Map it

These are just quick plots to make sure we got the right data!

plot(west_pop10['value'])

Census Tract Data

Fetching the data for all tracts in one state.

but you need to specify one or more counties.

# Fetch tract data 
alco_pop10 <- get_decennial(geography = "tract", 
                           variables = "P001001", 
                           year = 2010, 
                           state='CA',
                           county='Alameda',
                           geometry=T)

## Getting data from the 2010 decennial Census

## Using FIPS code '06' for state 'CA'

## Using FIPS code '001' for 'Alameda County'

Challenge

Fetch and map the 2010 population by census tract for Alameda and Contra Costa counties.

Challenge Solution

Fetch Tract population & geometry data for Alameda & Contra Costa Counties

alcc_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=c("Alameda","Contra Costa"),
                      geometry=T)

## Getting data from the 2010 decennial Census

Challenge Solution

Map it

plot(alcc_pop10['value'])

More Complex Challenge (if time)

Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.

To start, identify the variables for the

total number of hounsing units
number of renter occupied units

Complex Challenge Solution

SF Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004",
                           year = 2010, 
                           summary_var = "H004001",  # Total Urban - the denominator
                           state='CA',
                           county='San Francisco',
                           geometry=T)

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

plot(sf_pct_rented['pct'])

Questions?

Part 3. ACS 5 year data

ACS Data with tidycensus

The tidycensus workflow for ACS data is similar to that used for decennial census data.

But there are many more variables in the ACS.

Because the ACS contains sample data, each ACS variable of interest includes both an estimate of the value and a margin of error.

ACS 5 year

You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2006 - 2010 dataset.

The default end year for my version of tidycensus (as of April 9, 2020) is 2018 for the 2014-2018 ACS 5 year dataset.

Fetch List of ACS 5 year Variables

Let’s start by fetching ACS 5-year 2016 data on poverty (not all variables appear included in 2018 data yet).

We want to explore the number of folks living below the poverty level by census tract.

First we need to find the variable name(s)!

Load ACS Table Vars

Load the ACS 2012-2016 5 year data variables into a dataframe.

ACS 5-year datasets are referenced by end year in tidycensus!

Then take a look at the variable names, labels and concepts.

How many variables refer to the concept of poverty?

acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T)
View(acs2016vars)

ACS Tables and variables

Many thousands more than for decennial census!

See the documentation on the census website

Types of tables:

B prefix = base tables
C = collapsed tables
DP = data profiles
S = Subject tables

Census Reporter

ACS variables can be confusing.

The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

Let’s check it out to see what tables/variables we should use.

Filter the ACS Variables

In RStudio, view the dataframe acs2016vars and interactively filter the name column to display only the variables in the table C17002

Take a look at the different variables in this table.

What variable(s) contain the estimate of the number of people living below poverty?

get_acs

Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco

?get_acs

get_acs in action

Fetch the data in the table C17002 that contain the counts of people living below 100% of the poverty line.

sf_poor <- get_acs(geography = "tract",  
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   year = 2016,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)

## Getting data from the 2012-2016 5-year ACS

## Using FIPS code '06' for state 'CA'

## Using FIPS code '075' for 'San Francisco County'

View output

Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.

sf_poor

Create Poverty Map, try 2

What are we mapping!

# What are we mapping?
plot(sf_poor['estimate'])

Create Poverty Map, try 2

# Remove census tracts that have no people!
sf_poor <- subset(sf_poor, summary_est > 0)

# What are we mapping?
plot(sf_poor['estimate'])

Calculating percents

Let’s calculate the percent below poverty by tract.

sf_poor <- sf_poor %>%
  mutate(pct = 100 * (estimate / summary_est))

head(sf_poor, 3)

## Simple feature collection with 3 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79947 xmax: -122.3996 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##         GEOID                                               NAME
## 1 06075010100 Census Tract 101, San Francisco County, California
## 2 06075010100 Census Tract 101, San Francisco County, California
## 3 06075010200 Census Tract 102, San Francisco County, California
##     variable estimate moe summary_est summary_moe
## 1 C17002_002      314 196        3972         291
## 2 C17002_003      397 185        3972         291
## 3 C17002_002      160 123        4300         442
##                         geometry      pct
## 1 MULTIPOLYGON (((-122.4206 3... 7.905337
## 2 MULTIPOLYGON (((-122.4206 3... 9.994965
## 3 MULTIPOLYGON (((-122.4267 3... 3.720930

Group by and sum

We want to group the data by the geometry and then sum the data values so that we have one value per geometry.

sf_poor_summed <- sf_poor %>%
  select(GEOID, estimate, pct, geometry) %>%
  group_by(GEOID) %>% 
  summarise(count_below_pov = sum(estimate),
            pct_below_pov = sum(pct))

Group by and sum

head(sf_poor_summed)

## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 4
##   GEOID   count_below_pov pct_below_pov                            geometry
##   <chr>             <dbl>         <dbl>                  <MULTIPOLYGON [°]>
## 1 060750…             711         17.9  (((-122.4206 37.81111, -122.4075 3…
## 2 060750…             235          5.47 (((-122.4267 37.80964, -122.4249 3…
## 3 060750…             625         14.4  (((-122.4187 37.80593, -122.4169 3…
## 4 060750…             398          7.93 (((-122.4149 37.80354, -122.4116 3…
## 5 060750…             229          8.53 (((-122.4052 37.80476, -122.4035 3…
## 6 060750…             882         24.7  (((-122.411 37.80117, -122.4078 37…

Map Counts

Where are SF’s poorest areas?

plot(sf_poor_summed['count_below_pov'])

Map Percents

Where are SF’s poorest areas?

plot(sf_poor_summed['pct_below_pov'])

Challenge

The ACS 2013-2017 5 year dataset was released Dec 6, 2018.

Although my current version of tidycensus states that 2012-2016 is the latest ACS 5-year product, see if you can fetch & map the percent of people below poverty line in San Francisco using the 2013-2017 ACS 5-year data.

Challenge Solution

sf_poor_2017 <- get_acs(geography = "tract",  
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   year = 2017,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)   

head(sf_poor_2017)

Margins of Error (MOE)

We haven’t talked about it but it may be important in your work with ACS data.

Math is needed to combine MOEs when you combine variables.

tidycensus includes some nice functions for these calculations.

See this web page on how to handle MOEs in tidycensus

Questions?

Maps with tmap - Demo

tmap

The tmap package is great for making both static and interactive maps. It turns R into a GIS.

Let’s check it out with our last dataframe.

tmap

library(tmap)
tmap_mode("view") # set mode to interactive

## tmap mode set to interactive viewing

poverty_map <- tm_shape(sf_poor_summed) +
                  tm_polygons(col="pct_below_pov", alpha=0.7)

tmap

View the map - click on tracts

poverty_map

tmap

There are a number of great tutorials online for working with tmap.

See the References at the end of this workshop document.

Census Geographic Data Files

Cartographic Boundary vs Detailed TIGER/Line data

By default, tidycensus downloads census cartographic boundary data.

These are simplifed geometries, clipped to coastlines.

In get_acs you can also request the more detailed census TIGER/Line data.

The cartographic boundary data is great for mapping but the detailed data is often better for analysis.

Let’s check it out.

Fetch Cartographic Boundary Data

sf_poor_cb <- get_acs(geography = "tract",   
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   summary_var = "C17002_001",
                   year = 2016,           
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = TRUE)     # THIS IS THE DEFAULT!

## Getting data from the 2012-2016 5-year ACS

## Using FIPS code '06' for state 'CA'

## Using FIPS code '075' for 'San Francisco County'

Fetch Detailed TIGER/Line Geometry

sf_poor_tl <- get_acs(geography = "tract",   
                   variables = c('C17002_002','C17002_003'), # poverty variables       
                   summary_var = "C17002_001",
                   year = 2016,              
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = FALSE)  # Fetching the TIGER/Line data

## Getting data from the 2012-2016 5-year ACS

## Using FIPS code '06' for state 'CA'

## Using FIPS code '075' for 'San Francisco County'

Visualize differences with Tmap

zoom in to explore, especially around the coastline.

tm_shape(sf_poor_tl) + tm_borders() +
tm_shape(sf_poor_cb) + tm_borders(col="red")

## Warning: The shape sf_poor_cb contains empty units.

Questions?

Summary

tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial
Using tidycensus to fetch the tabular data or both tabular and geographic data is IMHO way easier than any alternatives, IF you (1) know R, (2) know a bit about working with geographic data in R.
This approach is also scaleable if you want multiple census variables and geographies.
If you just want to fetch the geographic data it may be easier to use the tigris package or download it directly from the census.

References

Related D-Lab Workshops

Extras for Enthusiasts

Scaling Up Example

In this example we show you how you can read in census variables of interest from a file into an R dataframe. You can then use that dataframe to fetch data for all those variables using tidycensus.

# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)

my_cenvar_df

##              my_cen_var_names my_cen_vars
## 1          citizenship_totpop B05001_001E
## 2     citizenship_non_citizen B05001_006E
## 3                entry_totpop B05005_001E
## 4                  entry_2010 B05005_002E
## 5             entry_2000_2009 B05005_007E
## 6           birthplace_totpop B05007_001E
## 7            birthplace_europ B05007_014E
## 8            birthplace_asian B05007_027E
## 9     birthplace_latinAmerica B05007_040E
## 10    birthplace_southAmerica B05007_081E
## 11    birthplace_other_nonUSA B05007_094E
## 12    birthplace_byage_totpop B06001_001E
## 13     birthplace_byage_fborn B06001_049E
## 14             poverty_totpop B06012_001E
## 15                  below_pov B06012_002E
## 16                 below_pov2 B06012_003E
## 17       poverty_fborn_totpop B06012_017E
## 18            below_pov_fborn B06012_018E
## 19           below_pov2_fborn B06012_019E
## 20       health_native_totpop B27020_002E
## 21  health_native_noinsurance B27020_006E
## 22    health_fborn_nat_totpop B27020_008E
## 23 fborn_nohealth_naturalized B27020_012E
## 24 health_fborn_noncit_totpop B27020_013E
## 25  fborn_nohealth_noncitizen B27020_017E

Fetch the ACS data

Fetch the ACS data for these variables for the 9 county bay area

nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bay9_data <-get_acs(geography = "tract", 
                       variables = my_cenvar_df$my_cen_vars, 
                       year=2016,
                       state = "CA", 
                       county = nine_counties, 
                       geometry = T)

## Getting data from the 2012-2016 5-year ACS

## Using FIPS code '06' for state 'CA'
## Using FIPS code '06' for state 'CA'

bay9_data

## Simple feature collection with 39700 features and 5 fields (with 150 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                          NAME   variable
## 1  06001400100 Census Tract 4001, Alameda County, California B05001_001
## 2  06001400100 Census Tract 4001, Alameda County, California B05001_006
## 3  06001400100 Census Tract 4001, Alameda County, California B05005_001
## 4  06001400100 Census Tract 4001, Alameda County, California B05005_002
## 5  06001400100 Census Tract 4001, Alameda County, California B05005_007
## 6  06001400100 Census Tract 4001, Alameda County, California B05007_001
## 7  06001400100 Census Tract 4001, Alameda County, California B05007_014
## 8  06001400100 Census Tract 4001, Alameda County, California B05007_027
## 9  06001400100 Census Tract 4001, Alameda County, California B05007_040
## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081
##    estimate moe                       geometry
## 1      3018 195 MULTIPOLYGON (((-122.2469 3...
## 2       218 106 MULTIPOLYGON (((-122.2469 3...
## 3       944 168 MULTIPOLYGON (((-122.2469 3...
## 4        61  50 MULTIPOLYGON (((-122.2469 3...
## 5       176  99 MULTIPOLYGON (((-122.2469 3...
## 6       843 156 MULTIPOLYGON (((-122.2469 3...
## 7       208  75 MULTIPOLYGON (((-122.2469 3...
## 8       546 141 MULTIPOLYGON (((-122.2469 3...
## 9        46  43 MULTIPOLYGON (((-122.2469 3...
## 10       11  17 MULTIPOLYGON (((-122.2469 3...

Reformat Ouput

We only want to keep the estimate column for each variable of interest, plus the GEOID and geometry columns.
We then want to make the data wide using the spread function. This will put each estimate variable is in its own column.

bay9_data2 <- bay9_data %>%
  select("GEOID", "variable", "estimate") %>%
  spread(key=variable, value=estimate)

Take a look

bay9_data2

## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007
## 1  06001400100       3018        218        944         61        176
## 2  06001400200       1960         92        315         33         55
## 3  06001400300       5236        317        900        130        314
## 4  06001400400       4171        190        524         69        115
## 5  06001400500       3748        430        701        226        123
## 6  06001400600       1661         62        205         21          9
## 7  06001400700       4552        353        619        122        177
## 8  06001400800       3506        276        767        108        185
## 9  06001400900       2262        202        446         20         16
## 10 06001401000       6193        477        754         90        186
##    B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094
## 1         843        208        546         46         11         43
## 2         243         65        136         28         19         14
## 3         857        119         90        257         86        391
## 4         471        146        228         35         15         62
## 5         635        160         83        245         37        147
## 6         178         49         74         38          0         17
## 7         587         70        145        299         67         73
## 8         741        181        330        149          8         81
## 9         405         20        136        148         38        101
## 10        671         56         46        519         67         50
##    B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017
## 1        3018        843       3011        113         20        843
## 2        1960        243       1952        106         84        243
## 3        5236        857       5153        450        217        848
## 4        4171        471       4158        268        198        471
## 5        3748        635       3733        339        240        635
## 6        1661        178       1656        158        229        178
## 7        4552        587       4552        820        723        587
## 8        3506        741       3457        381        348        741
## 9        2262        405       2228        358        268        405
## 10       6193        671       6184       1466        768        671
##    B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012
## 1          31         12       2175         88        625         12
## 2          31         22       1717         38        151          0
## 3         124        183       4379        111        540         34
## 4          52         82       3694        152        281          0
## 5         151         58       3113        243        205          6
## 6          59         21       1483        143        116          8
## 7         107        103       3965        512        234         33
## 8          75        191       2759        184        465         99
## 9          66         39       1857        103        203         15
## 10        120         17       5513        463        194          0
##    B27020_013 B27020_017                       geometry
## 1         218         10 MULTIPOLYGON (((-122.2469 3...
## 2          92         35 MULTIPOLYGON (((-122.2574 3...
## 3         317         15 MULTIPOLYGON (((-122.2642 3...
## 4         190         21 MULTIPOLYGON (((-122.2618 3...
## 5         430        148 MULTIPOLYGON (((-122.2694 3...
## 6          62          0 MULTIPOLYGON (((-122.2681 3...
## 7         353        147 MULTIPOLYGON (((-122.2779 3...
## 8         276         79 MULTIPOLYGON (((-122.2887 3...
## 9         202         28 MULTIPOLYGON (((-122.2856 3...
## 10        477        111 MULTIPOLYGON (((-122.2787 3...

Rename the columns

Use the dataframe of census variables to rename the columns so that they are self-describing.

colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")

Take a look

bay9_data2

## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID citizenship_totpop citizenship_non_citizen entry_totpop
## 1  06001400100               3018                     218          944
## 2  06001400200               1960                      92          315
## 3  06001400300               5236                     317          900
## 4  06001400400               4171                     190          524
## 5  06001400500               3748                     430          701
## 6  06001400600               1661                      62          205
## 7  06001400700               4552                     353          619
## 8  06001400800               3506                     276          767
## 9  06001400900               2262                     202          446
## 10 06001401000               6193                     477          754
##    entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ
## 1          61             176               843              208
## 2          33              55               243               65
## 3         130             314               857              119
## 4          69             115               471              146
## 5         226             123               635              160
## 6          21               9               178               49
## 7         122             177               587               70
## 8         108             185               741              181
## 9          20              16               405               20
## 10         90             186               671               56
##    birthplace_asian birthplace_latinAmerica birthplace_southAmerica
## 1               546                      46                      11
## 2               136                      28                      19
## 3                90                     257                      86
## 4               228                      35                      15
## 5                83                     245                      37
## 6                74                      38                       0
## 7               145                     299                      67
## 8               330                     149                       8
## 9               136                     148                      38
## 10               46                     519                      67
##    birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn
## 1                       43                    3018                    843
## 2                       14                    1960                    243
## 3                      391                    5236                    857
## 4                       62                    4171                    471
## 5                      147                    3748                    635
## 6                       17                    1661                    178
## 7                       73                    4552                    587
## 8                       81                    3506                    741
## 9                      101                    2262                    405
## 10                      50                    6193                    671
##    poverty_totpop below_pov below_pov2 poverty_fborn_totpop
## 1            3011       113         20                  843
## 2            1952       106         84                  243
## 3            5153       450        217                  848
## 4            4158       268        198                  471
## 5            3733       339        240                  635
## 6            1656       158        229                  178
## 7            4552       820        723                  587
## 8            3457       381        348                  741
## 9            2228       358        268                  405
## 10           6184      1466        768                  671
##    below_pov_fborn below_pov2_fborn health_native_totpop
## 1               31               12                 2175
## 2               31               22                 1717
## 3              124              183                 4379
## 4               52               82                 3694
## 5              151               58                 3113
## 6               59               21                 1483
## 7              107              103                 3965
## 8               75              191                 2759
## 9               66               39                 1857
## 10             120               17                 5513
##    health_native_noinsurance health_fborn_nat_totpop
## 1                         88                     625
## 2                         38                     151
## 3                        111                     540
## 4                        152                     281
## 5                        243                     205
## 6                        143                     116
## 7                        512                     234
## 8                        184                     465
## 9                        103                     203
## 10                       463                     194
##    fborn_nohealth_naturalized health_fborn_noncit_totpop
## 1                          12                        218
## 2                           0                         92
## 3                          34                        317
## 4                           0                        190
## 5                           6                        430
## 6                           8                         62
## 7                          33                        353
## 8                          99                        276
## 9                          15                        202
## 10                          0                        477
##    fborn_nohealth_noncitizen                       geometry
## 1                         10 MULTIPOLYGON (((-122.2469 3...
## 2                         35 MULTIPOLYGON (((-122.2574 3...
## 3                         15 MULTIPOLYGON (((-122.2642 3...
## 4                         21 MULTIPOLYGON (((-122.2618 3...
## 5                        148 MULTIPOLYGON (((-122.2694 3...
## 6                          0 MULTIPOLYGON (((-122.2681 3...
## 7                        147 MULTIPOLYGON (((-122.2779 3...
## 8                         79 MULTIPOLYGON (((-122.2887 3...
## 9                         28 MULTIPOLYGON (((-122.2856 3...
## 10                       111 MULTIPOLYGON (((-122.2787 3...

Fetching data for multiple years

This requires variable name to be the same across years!

# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) { 
  get_decennial(geography = "state",
  variables = c(totalpop = "P001001"),
  dataset = "sf1",
  year = x) %>%
  mutate(year = x) }
)

# View output
head(pop90_10)
tail(pop90_10)

# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) + 
             geom_bar(stat="identity", position=position_dodge()) + coord_flip()

Combining Census Data with Other Data

Area Weighted Interpolation

One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.

Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.

Read in a Shapefile

sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])

Check the CRS

st_crs(sfnhoods)
st_crs(sf_poor5)

CRS transformation

sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))

Area Weighted Interpolation

Reaggregate percent of people below poverty from census tract to neighborhood polygons.

sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods,
extensive = F) # True= aw sum; False= aw avg

Map it

par(mfrow=c(1,2))
plot(sf_poor5['pct_below_pov'])
plot(sfhoods2['pct_below_pov'])
par(mfrow=c(1,1))

Map it with `tmap`

tm_shape(sfhoods2) +
   tm_polygons(col="pct_below_pov")

Combine the values

head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov

# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
  tm_polygons(col="pct_below_pov", 
    popup.vars = c("nhood", "pct_below_pov")
  )

Getting Started

Setup

Introduction

Outline

Census Data Overview

US Census Data

Primary Census Products

Decennial Census

American Community Survey (ACS)

Decennial Census* vs ACS Data

Census Geographies

Census Data & Census Geographies

ACS 5 Year Dataset RECOMMENDED

Census Data Workflow

CAUTION

R Packages

Packages for Working with Census Data

tidycensus & tigris

tidycensus

tidycensus

tigris

tidycensus & tigris

tidyverse

sf

Alternatives to Accessing Census Data in R

Tutorial Time!

Part 1

Loading packages

Install any packages that you do not have on your computer

Census API Key

Install your Census API Key

Set working directory

Fetching Decennial Census Data

Population Data

Topics, Tables & Variables

load_variables

Decennial Census Variables

2010 Decennial Census Tables

What Variable has the 2010 Total Population value?

get_decennial

get_decennial

View the Data

Visualize results

Plot 2010 Population by state

Display the plot

Challenge

Challenge Solution

Limiting by Area of Interest

Limit Areas of Interest

View Results

Changing Census Tabulation unit

Changing Census Tabulation unit

Challenge

Challenge Solution

Challenge

Challenge Solution

Fetching Census Tract Data

Fetching Census Tract Data

Challenge

Challenge Solution

Challenge Solution

RECAP & QUESTIONS

Fetching data for more than one census variable

Fetching data for more than one census variable

Take a look at the results

Calculating Percents

Calculating Percents

Plot it

Plot it

Fetch all the data in one table

Take a look

Output options

Output options

Data Wrangling

Data Wrangling

Merge population by state from both censuses

Save the data

TIME FOR QUESTIONS

Part 2. Mapping

Mapping Census Data with tidycensus

Mapping Census Data with `tidycensus`