Decennial Census
American Community Survey (ACS)
03/21/2019
Welcome! While we’re waiting:
Clone or download the workshop files from: https://github.com/dlab-geo/rCensus_workshop
Open RStudio
Open a new R script file
About me
About you
Describe primary Census data products
Introduce R packages for working with Census Data
Use those packages to fetch census data
Use those packages to fetch census data plus census geograpic boundary files
Make maps of census data
The “nation’s leading provider of quality data about its people and economy.”
Available at www.census.gov
Decennial Census
American Community Survey (ACS)
Complete count of the population every 10 years since 1790
Includes data on
population, by age & race/ethnicity
housing, by occupancy & tenure (owned, rented)
Annual survey of a sample of about 3 million households
Provides estimates of demographic, social, economic & housing characteristics
Includes margin of error values for the estimates.
| Demographic* | Social | Economic | Housing |
|---|---|---|---|
| Sex | Families | Income | Tenure* |
| Age | Education | Benefits | Occupancy* |
| Race | Marital Status | Employment Status | Structure Type |
| Hispanic Origin | Fertility | Occupation | Housing Value |
| Grandparents | Industry | Taxes & Insurance | |
| Veterans | Commuting | Utilities | |
| Disability Status | Place of Work | Mortgage | |
| Language at Home | Health Insurance | Monthly Rent | |
| Citizenship | |||
| Mobility |
Census data are publicly available at one or more levels of geographic aggregation.
ACS 1 year and 5 year products are currently available
ACS 5 year data provdes much better estimates, lower margins of error
More data available for ACS 5 Year product
Identify your
Then determine what specific tables and variables are available - ACS or Decennial?
“If you want to measure change you can’t change the measures!”
Census tables, variables, geographies, and geographic boundaries change over time!
Measuring change over time with census data is its own thing, complex and not covered by this workshop!
These are the ones we recommend and will use today.
Functions for accessing census decennial and ACS 5 year datasets via Census APIs
Census API keyLimited set of years available via tidycensus
tidycensusProvides access to census geographic data files
Also provides access to census feature data,
Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.
tigris directly to access other census geographic data.Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.
Check out his website to keep abreast of his great packages, blog posts, and tutorials.
Walker also develped a new DataCamp course: Analyzing US Census Data in R!
A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.
dplyr and tidyr for reshaping data
ggplot2 for plotting
purr, readr and tibble for improved performance
These packages are used by tidyverse under the hood.
Simple features for geospatial data objects and methods.
sp packagesf includes the functionality of the sp, rgdal, rgeos and proj4 packages.
You can write code to access the Census APIs directly.
You can download Census data directly from:
You can download Census geographic data directly on the census website
We will work through several exercises using tidycensus to fetch, wrangle and map census data.
Load the packages we will use today
library(tidycensus) library(tidyverse) library(tigris) library(sf)
If you are getting errors try importing dplyr or reinstalling dplyr package as that has worked for some.
Also install any dependancies.
# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")
You need a census API key to programmatically fetch census data.
Get it here (pretty quick):
For more info see:
Use the tidycensus function census_api_key to make tidycensus use your key when it fetches data from the census.
# Install your census api key - long alphanumeric string census_api_key(THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS)
Be sure to Clone or downloaded & unzip the workshop files from: https://github.com/dlab-geo/rCensus_workshop
Then, set your working directory this folder, e.g.,
setwd("~/Documents/Dlab/workshops/2019/rCensus_workshop")Let’s start by fetching population data from the 2010 Census for all states
In order to fetch census data you need to identify the census variables that contain the data of interest.
Census data variables are organized in tables
Which are organized by topic or concept.
The tidycensus load_variables function can help with this step.
First, take a look at the function documentation.
?load_variables
Use load_variables to fetch all variables used in the 2010 census into a dataframe.
vars2010 <- load_variables(year=2010, # Year or end year for ACS
dataset = 'sf1', # 'sf1' for decennial or 'acs5'
cache = TRUE) # Whether to save fetched data locally
Let’s take a look at and discuss the resultant dataframe.
View(vars2010)
Variables: 3,346
Topics: Population, housing
Tables: currenty 333 - that’s a lot!
We can sort and filter the vars2010 dataframe to find it.
We can use the tidycensus function get_decenial to fetch the 2010 census data for total population by state.
First, check the documentation for the function.
?get_decennial
Fetch total population by state (P001001) from the 2010 census using get_decennial.
pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010) # census year
## Getting data from the 2010 decennial Census
How many rows and columns?
Do you see the expected number of states?
What column contains the population counts?
Do the data values see to be right?
#pop2010
We can visualize the data to get a quick overview of the distribution of data values.
It’s a first step in exploratory data analysis and a last step in data communication.
ggplot2 is the most commonly used R package for data visualization.
tidyverse package.Let’s use it to visualize the population data.
Use ggplot2 to create an ordered horizontal bar chart.
pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) +
geom_bar(stat="identity") + coord_flip() +
theme_minimal() +
labs(title = "2010 US Population by State") +
xlab("State") +
ylab("in millions")
Fetch population data by state for 2000.
Don’t assume variable names are the same across years. Check first!
Total Population in 2000
# What is the variable name in 2000? vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T) # Take a look and search in the dataframe View(vars2000) # Fetch the 2000 pop data pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000) # Take a look (plot if time) pop2000
In the previous example we retrieved population data for all states.
This is the default behavior if you don’t specify a subset.
But you can limit the data to be retrieved by subunits like state.
Let’s fetch data for just 3 states.
state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010, # census year
state=c("CA","OR","WA")) # Filter by states of interest
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '41' for state 'OR'
## Using FIPS code '53' for state 'WA'
Note we are referencing states by their abbrevation.
state_pop2010
## # A tibble: 3 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 06 California P001001 37253956 ## 2 41 Oregon P001001 3831074 ## 3 53 Washington P001001 6724540
get_decennial accepts a number of different values for tabulation unit.
state, county, tract, block group, block, and ZCTA.Let’s change the tabulation unit from state to county.
co_pop2010 <- get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010)
## Getting data from the 2010 decennial Census
View the county data to see what was retrieved.
co_pop2010
## # A tibble: 3,221 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 05131 Sebastian County, Arkansas P001001 125744 ## 2 05133 Sevier County, Arkansas P001001 17058 ## 3 05135 Sharp County, Arkansas P001001 17264 ## 4 05137 Stone County, Arkansas P001001 12394 ## 5 05139 Union County, Arkansas P001001 41639 ## 6 05141 Van Buren County, Arkansas P001001 17295 ## 7 05143 Washington County, Arkansas P001001 203065 ## 8 05145 White County, Arkansas P001001 77076 ## 9 05149 Yell County, Arkansas P001001 22185 ## 10 06011 Colusa County, California P001001 21419 ## # … with 3,211 more rows
Fetch population by county for just California
Fetch population by county for Oregon & California
Try it before you look ahead at solutions.
## Fetch population by **county** for just California
co_pop2010_ca <- get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010,
state=c('CA'))
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
#co_pop2010_ca
## Fetch population by **county** for Oregon & California
co_pop2010_caor <- get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010,
state=c('CA','OR'))
## Getting data from the 2010 decennial Census ## Using FIPS code '06' for state 'CA'
## Using FIPS code '41' for state 'OR'
co_pop2010_caor
## # A tibble: 94 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 06011 Colusa County, California P001001 21419 ## 2 06007 Butte County, California P001001 220000 ## 3 06001 Alameda County, California P001001 1510271 ## 4 06003 Alpine County, California P001001 1175 ## 5 06005 Amador County, California P001001 38091 ## 6 06009 Calaveras County, California P001001 45578 ## 7 06013 Contra Costa County, California P001001 1049025 ## 8 06015 Del Norte County, California P001001 28610 ## 9 06031 Kings County, California P001001 152982 ## 10 06021 Glenn County, California P001001 28122 ## # … with 84 more rows
Fetch population by tract for all states.
Fetch population by tract for California.
## Fetch population by **tract** for California.
cal_pop2010_tracts <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010,
state=c('CA'))
cal_pop2010_tracts
## Fetch population by **tract** for all states.
pop2010_tracts <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010)
pop2010_tracts ## DOES THIS WORK?
If you want census data at the tract level or below you must specifiy the state & county or counties.
tract_pop2010 <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year
state="CA", # limit to state of California
county=c("Alameda","Contra Costa")) # and only these counties
## Getting data from the 2010 decennial Census
View the results! How many census tracts are in these 3 counties?
tract_pop2010
## # A tibble: 569 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 06001400100 Census Tract 4001, Alameda County, California P001001 2937 ## 2 06001400200 Census Tract 4002, Alameda County, California P001001 1974 ## 3 06001400300 Census Tract 4003, Alameda County, California P001001 4865 ## 4 06001400400 Census Tract 4004, Alameda County, California P001001 3703 ## 5 06001400500 Census Tract 4005, Alameda County, California P001001 3517 ## 6 06001400600 Census Tract 4006, Alameda County, California P001001 1571 ## 7 06001400700 Census Tract 4007, Alameda County, California P001001 4206 ## 8 06001400800 Census Tract 4008, Alameda County, California P001001 3594 ## 9 06001400900 Census Tract 4009, Alameda County, California P001001 2302 ## 10 06001401000 Census Tract 4010, Alameda County, California P001001 5678 ## # … with 559 more rows
Fetch population by county for Alameda County, California
Fetch population by tract for the nine county Bay Area:
Note: You can use names, abbreviations or FIPS codes for your state and county.
# County FIPS Codes for
# Alameda, SF, Contra Costa, Marin County, Napa,
# San Mateo, Santa Clara, Solano, Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
# population by **county** for Alameda County, California
alco_pop2010 <- get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variables of interest
year = 2010,
state=c('CA'),
county=c('Alameda County'))
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '001' for 'Alameda County'
#alco_pop2010
Fetch population by tract for the nine county Bay Area
# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa,
# San Mateo, Santa Clara, Solano, Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bayarea_pop2010_tract <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year
state="CA", # limit to state of California
county=nine_counties) # and only these counties
## Getting data from the 2010 decennial Census
#bayarea_pop2010_tract
Fetch population by tract for the nine county Bay Area
# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa,
# San Mateo, Santa Clara, Solano, Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bayarea_pop2010 <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year
state="CA", # limit to state of California
county=nine_counties) # and only these counties
# View the data
bayarea_pop2010
What three things are new here?
#urban rural pop for 3 counties
ur_pop10 <- get_decennial(geography = "county", # census tabulation unit
variables = c(urban="P002002",rural="P002005"),
year = 2010,
summary_var = "P002001", # The denominator
state='CA',
county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '055' for 'Napa County'
## Using FIPS code '097' for 'Sonoma County'
## Using FIPS code '045' for 'Mendocino County'
variables = c("P002002","P002005")
variables = c(urban="P002002",rural="P002005")
summary_var (a denominator - here, the total count of all people or households surveyed. Can be used for calcuations like percent of total.)summary_var = "P002001"
ur_pop10
## # A tibble: 6 x 5 ## GEOID NAME variable value summary_value ## <chr> <chr> <chr> <dbl> <dbl> ## 1 06045 Mendocino County, California urban 48110 87841 ## 2 06055 Napa County, California urban 118194 136484 ## 3 06097 Sonoma County, California urban 424102 483878 ## 4 06045 Mendocino County, California rural 39731 87841 ## 5 06055 Napa County, California rural 18290 136484 ## 6 06097 Sonoma County, California rural 59776 483878
The summary_value column comes in handy when you want to compute percent of total.
Here’s one way to do it.
# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
mutate(pct = 100 * (value / summary_value))
Let’s take a look at the output
ur_pop10 # Take a look
## # A tibble: 6 x 6 ## GEOID NAME variable value summary_value pct ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 06045 Mendocino County, California urban 48110 87841 54.8 ## 2 06055 Napa County, California urban 118194 136484 86.6 ## 3 06097 Sonoma County, California urban 424102 483878 87.6 ## 4 06045 Mendocino County, California rural 39731 87841 45.2 ## 5 06055 Napa County, California rural 18290 136484 13.4 ## 6 06097 Sonoma County, California rural 59776 483878 12.4
Plots give us compact visual summaries of the data
myplot <- ggplot(data = ur_pop10,
mapping = aes(x = NAME, fill = variable,
y = ifelse(test = variable == "urban",
yes = -pct, no = pct))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs, limits=c(-100,100)) +
labs(title="Urban & Rural Population in Wine Country",
x="County", y = " Percent of Population", fill="") +
coord_flip()
Don’t worry if you don’t get all the ggplot code now. It’s here for reference.
myplot
This is often helpful but you need to keep tract of the meaning of each variable.
alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
table = "P002", # Table of urban & rural population counts
year = 2010, # Decennial census year
state='CA', # Filter state
county="Alameda") # Filter county
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '001' for 'Alameda County'
unique(alco_pop10$variable) # What and how many unique vars in table?
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
head(alco_pop10,3) # Take a look at output
## # A tibble: 3 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 06001400100 Census Tract 4001, Alameda County, California P002001 2937 ## 2 06001400200 Census Tract 4002, Alameda County, California P002001 1974 ## 3 06001400300 Census Tract 4003, Alameda County, California P002001 4865
Let’s try all three of these commands and then look at the ouput to see what’s different?
get_decennial(geography = "state", variables = "P001001",
year = 2010)
get_decennial(geography = "state", variables = c(pop10="P001001"),
year = 2010)
get_decennial(geography = "state", variables = c(pop10="P001001"),
year = 2010, output="wide")
head(get_decennial(geography = "state", variables = "P001001",
year = 2010), 2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 01 Alabama P001001 4779736 ## 2 02 Alaska P001001 710231
head(get_decennial(geography = "state", variables = c(pop10="P001001"),
year = 2010), 2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 01 Alabama pop10 4779736 ## 2 02 Alaska pop10 710231
head(get_decennial(geography = "state", variables = c(pop10="P001001"),
year = 2010, output="wide"), 2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 3 ## GEOID NAME pop10 ## <chr> <chr> <dbl> ## 1 01 Alabama 4779736 ## 2 02 Alaska 710231
Your R skills can help you reformat the data and make it more useable.
Let’s fetch population data for 2010 & 2000 by state with output=wide.
Then we will combine these into one data frame.
Fetch pop by state from both the 2000 and 2010 census
pop2000 <- get_decennial(geography = "state",
variables = c(pop00="P001001"),
year = 2000, output="wide")
## Getting data from the 2000 decennial Census
pop2010 <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
year = 2010, output="wide")
## Getting data from the 2010 decennial Census
Save in a new dataframe with both columns
pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
select(NAME, pop00, pop10)
head(pop2000_2010,3)
## NAME pop00 pop10 ## 1 Alabama 4447100 4779736 ## 2 Alaska 626932 710231 ## 3 Arizona 5130632 6392017
Use write.csv to save a data frame to a CSV file.
write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)
tidycensusYou can fetch geographic data by adding the parameter geometry=TRUE to tidycensus functions
Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.
Only a subset of data available via tigris can be accessed via tidycensus.
You can then use common mapping options like plot, ggplot and tmap to make maps.
Before fetching geometry, we need to specify a few tigris options
Set the class of returned data to be sf objects (not sp, the default)
Set tigris_use_cache to TRUE
# Tigris options - used by tidycensus options(tigris_class = "sf") # SP is the default format returned by tigris options(tigris_use_cache = TRUE) # Save retrieved data locally
Caching the data is important because it speeds things up if you often fetch census data for the same geographies over and over again.
You may want to use the geographic data downloaded by tigris in other applications.
To do this, you need to know where the files are saved locally.
You can also specify where tigris should save cached data.
# Check the location of the tigris cached data
Sys.getenv('TIGRIS_CACHE_DIR')
# Set it
tigris_cache_dir("~/Documents/gis_data/census") # Folder for local data
# Check it again
Sys.getenv('TIGRIS_CACHE_DIR')
We fetch the geospatial data by setting geometry=TRUE.
pop2010geo <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
year = 2010,
output="wide",
geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census
Let’s take a minute to discuss the format of an sf spatial object.
pop2010geo
## Simple feature collection with 52 features and 3 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## # A tibble: 52 x 4 ## GEOID NAME pop10 geometry ## <chr> <chr> <dbl> <MULTIPOLYGON [°]> ## 1 01 Alabama 4.78e6 (((-85.00237 31.00068, -85.02411 31.00068, -85.… ## 2 02 Alaska 7.10e5 (((-164.9762 54.13459, -164.9378 54.13668, -164… ## 3 04 Arizona 6.39e6 (((-109.0452 36.99908, -109.0452 36.96949, -109… ## 4 05 Arkansas 2.92e6 (((-94.55929 36.4995, -94.51948 36.49921, -94.3… ## 5 06 Californ… 3.73e7 (((-122.4463 37.86105, -122.4386 37.86837, -122… ## 6 22 Louisiana 4.53e6 (((-88.86507 29.75271, -88.88976 29.7182, -88.9… ## 7 21 Kentucky 4.34e6 (((-83.67541 36.60081, -83.67561 36.59858, -83.… ## 8 08 Colorado 5.03e6 (((-102.0422 36.99308, -102.0545 36.99311, -102… ## 9 09 Connecti… 3.57e6 (((-71.85957 41.3224, -71.86823 41.33094, -71.8… ## 10 10 Delaware 8.98e5 (((-75.55945 39.62981, -75.5591 39.62906, -75.5… ## # … with 42 more rows
R sf objects include
a dataframe with a geometry column named of geometry
a CRS (coordinate reference system), specified by
For a deeper understanding of the sf package and its functionality, we recommend our Geospatial-Fundamentals-in-R-with-sf workshop.
All census geographic data use the NAD83 CRS, or coordinate reference system. NAD83 stands for North American Datum of 1983. The geographic coordinates are longitude and latitude values encoded as decimal degrees.
WGS84, or The World Geodetic System of 1984 is the most commonly used geographic CRS. The difference between points in these systems varies up to 1 meter in continental US.
Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.
An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.
We can use plot to make a quick map the geometry stored in an sf spatial object.
plot(pop2010geo$geometry)
What do you get if you plot the sf object without specifying “$geometry”
The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.
tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.
pop2010geo_shifted <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
output="wide",
year = 2010,
geometry=TRUE,
shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.
plot(pop2010geo_shifted$geometry)
You can save sf data to a shapefile using st_write
st_write(pop2010geo_shifted,"usa_2010_shifted.shp")
my_cache_dir <- Sys.getenv('TIGRIS_CACHE_DIR')
dir(my_cache_dir) # What files stored there?
plot(pop2010geo_shifted['pop10'])
ggplot(pop2010geo_shifted, aes(fill = pop10)) + geom_sf()
Note the use of geom_sf which tells ggplot that spatial data objects are being mapped. - this is a huge improvement!!
Create a map of CA Population in 2010 by county
2010 pop Data for California Counties
#fetch it
cal_pop10 <- get_decennial(geography = "county",
variables = "P001001",
year = 2010,
state='CA',
geometry=TRUE)
# map it
#plot(cal_pop10['value'])
We can fetch both the census data and the geometry for more than one state!
west_pop10 <- get_decennial(geography = "county",
variables = "P001001",
year = 2010,
state=c('CA','OR','NV',"AZ"),
geometry=T)
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '41' for state 'OR'
## Using FIPS code '32' for state 'NV'
## Using FIPS code '04' for state 'AZ'
These are just quick plots to make sure we got the right data!
plot(west_pop10['value'])
Fetching the data for all tracts in one state.
# Fetch tract data
alco_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county='Alameda',
geometry=T)
## Getting data from the 2010 decennial Census
## Using FIPS code '06' for state 'CA'
## Using FIPS code '001' for 'Alameda County'
Fetch and map the 2010 population by census tract for Alameda and Contra Costa counties.
Fetch Tract population & geometry data for Alameda & Contra Costa Counties
alcc_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county=c("Alameda","Contra Costa"),
geometry=T)
## Getting data from the 2010 decennial Census
Map it
plot(alcc_pop10['value'])
Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.
To start, identify the variables for the
total number of hounsing units
number of renter occupied units
SF Rented Units, 2010
sf_rented <- get_decennial(geography = "tract", # census tabulation unit
variables = "H004004",
year = 2010,
summary_var = "H004001", # Total Urban - the denominator
state='CA',
county='San Francisco',
geometry=T)
sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
mutate(pct = 100 * (value / summary_value))
plot(sf_pct_rented['pct'])
The tidycensus workflow for ACS data is similar to that used for decennial census data.
Because the ACS contains sample data, each ACS variable of interest includes both an estimate of the value and a margin of error.
You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2006 - 2010 dataset.
The default end year for my version of tidycensus (as of April 9, 2020) is 2018 for the 2014-2018 ACS 5 year dataset.
Let’s start by fetching ACS 5-year 2016 data on poverty (not all variables appear included in 2018 data yet).
We want to explore the number of folks living below the poverty level by census tract.
First we need to find the variable name(s)!
Load the ACS 2012-2016 5 year data variables into a dataframe.
end year in tidycensus!Then take a look at the variable names, labels and concepts.
How many variables refer to the concept of poverty?
acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T) View(acs2016vars)
Many thousands more than for decennial census!
See the documentation on the census website
Types of tables:
B prefix = base tablesC = collapsed tablesDP = data profilesS = Subject tablesACS variables can be confusing.
The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.
Let’s check it out to see what tables/variables we should use.
In RStudio, view the dataframe acs2016vars and interactively filter the name column to display only the variables in the table C17002
Take a look at the different variables in this table.
What variable(s) contain the estimate of the number of people living below poverty?
Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco
?get_acs
Fetch the data in the table C17002 that contain the counts of people living below 100% of the poverty line.
sf_poor <- get_acs(geography = "tract",
variables = c('C17002_002','C17002_003'), # poverty variables
year = 2016,
state="CA",
summary_var = "C17002_001", # Est of num people - denom
county="San Francisco",
geometry=T)
## Getting data from the 2012-2016 5-year ACS
## Using FIPS code '06' for state 'CA'
## Using FIPS code '075' for 'San Francisco County'
Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.
sf_poor
What are we mapping!
# What are we mapping? plot(sf_poor['estimate'])
# Remove census tracts that have no people! sf_poor <- subset(sf_poor, summary_est > 0) # What are we mapping? plot(sf_poor['estimate'])
Let’s calculate the percent below poverty by tract.
sf_poor <- sf_poor %>% mutate(pct = 100 * (estimate / summary_est)) head(sf_poor, 3)
## Simple feature collection with 3 features and 8 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -122.4267 ymin: 37.79947 xmax: -122.3996 ymax: 37.81144 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## GEOID NAME ## 1 06075010100 Census Tract 101, San Francisco County, California ## 2 06075010100 Census Tract 101, San Francisco County, California ## 3 06075010200 Census Tract 102, San Francisco County, California ## variable estimate moe summary_est summary_moe ## 1 C17002_002 314 196 3972 291 ## 2 C17002_003 397 185 3972 291 ## 3 C17002_002 160 123 4300 442 ## geometry pct ## 1 MULTIPOLYGON (((-122.4206 3... 7.905337 ## 2 MULTIPOLYGON (((-122.4206 3... 9.994965 ## 3 MULTIPOLYGON (((-122.4267 3... 3.720930
We want to group the data by the geometry and then sum the data values so that we have one value per geometry.
sf_poor_summed <- sf_poor %>%
select(GEOID, estimate, pct, geometry) %>%
group_by(GEOID) %>%
summarise(count_below_pov = sum(estimate),
pct_below_pov = sum(pct))
head(sf_poor_summed)
## Simple feature collection with 6 features and 3 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## # A tibble: 6 x 4 ## GEOID count_below_pov pct_below_pov geometry ## <chr> <dbl> <dbl> <MULTIPOLYGON [°]> ## 1 060750… 711 17.9 (((-122.4206 37.81111, -122.4075 3… ## 2 060750… 235 5.47 (((-122.4267 37.80964, -122.4249 3… ## 3 060750… 625 14.4 (((-122.4187 37.80593, -122.4169 3… ## 4 060750… 398 7.93 (((-122.4149 37.80354, -122.4116 3… ## 5 060750… 229 8.53 (((-122.4052 37.80476, -122.4035 3… ## 6 060750… 882 24.7 (((-122.411 37.80117, -122.4078 37…
Where are SF’s poorest areas?
plot(sf_poor_summed['count_below_pov'])
Where are SF’s poorest areas?
plot(sf_poor_summed['pct_below_pov'])
The ACS 2013-2017 5 year dataset was released Dec 6, 2018.
Although my current version of tidycensus states that 2012-2016 is the latest ACS 5-year product, see if you can fetch & map the percent of people below poverty line in San Francisco using the 2013-2017 ACS 5-year data.
sf_poor_2017 <- get_acs(geography = "tract",
variables = c('C17002_002','C17002_003'), # poverty variables
year = 2017,
state="CA",
summary_var = "C17002_001", # Est of num people - denom
county="San Francisco",
geometry=T)
head(sf_poor_2017)
We haven’t talked about it but it may be important in your work with ACS data.
Math is needed to combine MOEs when you combine variables.
See this web page on how to handle MOEs in tidycensus
The tmap package is great for making both static and interactive maps. It turns R into a GIS.
Let’s check it out with our last dataframe.
library(tmap)
tmap_mode("view") # set mode to interactive
## tmap mode set to interactive viewing
poverty_map <- tm_shape(sf_poor_summed) +
tm_polygons(col="pct_below_pov", alpha=0.7)
View the map - click on tracts
poverty_map
There are a number of great tutorials online for working with tmap.
See the References at the end of this workshop document.
Cartographic Boundary vs Detailed TIGER/Line data
By default, tidycensus downloads census cartographic boundary data.
In get_acs you can also request the more detailed census TIGER/Line data.
The cartographic boundary data is great for mapping but the detailed data is often better for analysis.
Let’s check it out.
sf_poor_cb <- get_acs(geography = "tract",
variables = c('C17002_002','C17002_003'), # poverty variables
summary_var = "C17002_001",
year = 2016,
state="CA",
county="San Francisco",
geometry=TRUE,
cb = TRUE) # THIS IS THE DEFAULT!
## Getting data from the 2012-2016 5-year ACS
## Using FIPS code '06' for state 'CA'
## Using FIPS code '075' for 'San Francisco County'
sf_poor_tl <- get_acs(geography = "tract",
variables = c('C17002_002','C17002_003'), # poverty variables
summary_var = "C17002_001",
year = 2016,
state="CA",
county="San Francisco",
geometry=TRUE,
cb = FALSE) # Fetching the TIGER/Line data
## Getting data from the 2012-2016 5-year ACS
## Using FIPS code '06' for state 'CA'
## Using FIPS code '075' for 'San Francisco County'
zoom in to explore, especially around the coastline.
tm_shape(sf_poor_tl) + tm_borders() + tm_shape(sf_poor_cb) + tm_borders(col="red")
## Warning: The shape sf_poor_cb contains empty units.
tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial
Using tidycensus to fetch the tabular data or both tabular and geographic data is IMHO way easier than any alternatives, IF you (1) know R, (2) know a bit about working with geographic data in R.
This approach is also scaleable if you want multiple census variables and geographies.
If you just want to fetch the geographic data it may be easier to use the tigris package or download it directly from the census.
In this example we show you how you can read in census variables of interest from a file into an R dataframe. You can then use that dataframe to fetch data for all those variables using tidycensus.
# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)
my_cenvar_df
## my_cen_var_names my_cen_vars ## 1 citizenship_totpop B05001_001E ## 2 citizenship_non_citizen B05001_006E ## 3 entry_totpop B05005_001E ## 4 entry_2010 B05005_002E ## 5 entry_2000_2009 B05005_007E ## 6 birthplace_totpop B05007_001E ## 7 birthplace_europ B05007_014E ## 8 birthplace_asian B05007_027E ## 9 birthplace_latinAmerica B05007_040E ## 10 birthplace_southAmerica B05007_081E ## 11 birthplace_other_nonUSA B05007_094E ## 12 birthplace_byage_totpop B06001_001E ## 13 birthplace_byage_fborn B06001_049E ## 14 poverty_totpop B06012_001E ## 15 below_pov B06012_002E ## 16 below_pov2 B06012_003E ## 17 poverty_fborn_totpop B06012_017E ## 18 below_pov_fborn B06012_018E ## 19 below_pov2_fborn B06012_019E ## 20 health_native_totpop B27020_002E ## 21 health_native_noinsurance B27020_006E ## 22 health_fborn_nat_totpop B27020_008E ## 23 fborn_nohealth_naturalized B27020_012E ## 24 health_fborn_noncit_totpop B27020_013E ## 25 fborn_nohealth_noncitizen B27020_017E
Fetch the ACS data for these variables for the 9 county bay area
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bay9_data <-get_acs(geography = "tract",
variables = my_cenvar_df$my_cen_vars,
year=2016,
state = "CA",
county = nine_counties,
geometry = T)
## Getting data from the 2012-2016 5-year ACS
## Using FIPS code '06' for state 'CA' ## Using FIPS code '06' for state 'CA'
bay9_data
## Simple feature collection with 39700 features and 5 fields (with 150 geometries empty) ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## First 10 features: ## GEOID NAME variable ## 1 06001400100 Census Tract 4001, Alameda County, California B05001_001 ## 2 06001400100 Census Tract 4001, Alameda County, California B05001_006 ## 3 06001400100 Census Tract 4001, Alameda County, California B05005_001 ## 4 06001400100 Census Tract 4001, Alameda County, California B05005_002 ## 5 06001400100 Census Tract 4001, Alameda County, California B05005_007 ## 6 06001400100 Census Tract 4001, Alameda County, California B05007_001 ## 7 06001400100 Census Tract 4001, Alameda County, California B05007_014 ## 8 06001400100 Census Tract 4001, Alameda County, California B05007_027 ## 9 06001400100 Census Tract 4001, Alameda County, California B05007_040 ## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081 ## estimate moe geometry ## 1 3018 195 MULTIPOLYGON (((-122.2469 3... ## 2 218 106 MULTIPOLYGON (((-122.2469 3... ## 3 944 168 MULTIPOLYGON (((-122.2469 3... ## 4 61 50 MULTIPOLYGON (((-122.2469 3... ## 5 176 99 MULTIPOLYGON (((-122.2469 3... ## 6 843 156 MULTIPOLYGON (((-122.2469 3... ## 7 208 75 MULTIPOLYGON (((-122.2469 3... ## 8 546 141 MULTIPOLYGON (((-122.2469 3... ## 9 46 43 MULTIPOLYGON (((-122.2469 3... ## 10 11 17 MULTIPOLYGON (((-122.2469 3...
We only want to keep the estimate column for each variable of interest, plus the GEOID and geometry columns.
We then want to make the data wide using the spread function. This will put each estimate variable is in its own column.
bay9_data2 <- bay9_data %>%
select("GEOID", "variable", "estimate") %>%
spread(key=variable, value=estimate)
bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty) ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## First 10 features: ## GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007 ## 1 06001400100 3018 218 944 61 176 ## 2 06001400200 1960 92 315 33 55 ## 3 06001400300 5236 317 900 130 314 ## 4 06001400400 4171 190 524 69 115 ## 5 06001400500 3748 430 701 226 123 ## 6 06001400600 1661 62 205 21 9 ## 7 06001400700 4552 353 619 122 177 ## 8 06001400800 3506 276 767 108 185 ## 9 06001400900 2262 202 446 20 16 ## 10 06001401000 6193 477 754 90 186 ## B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094 ## 1 843 208 546 46 11 43 ## 2 243 65 136 28 19 14 ## 3 857 119 90 257 86 391 ## 4 471 146 228 35 15 62 ## 5 635 160 83 245 37 147 ## 6 178 49 74 38 0 17 ## 7 587 70 145 299 67 73 ## 8 741 181 330 149 8 81 ## 9 405 20 136 148 38 101 ## 10 671 56 46 519 67 50 ## B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017 ## 1 3018 843 3011 113 20 843 ## 2 1960 243 1952 106 84 243 ## 3 5236 857 5153 450 217 848 ## 4 4171 471 4158 268 198 471 ## 5 3748 635 3733 339 240 635 ## 6 1661 178 1656 158 229 178 ## 7 4552 587 4552 820 723 587 ## 8 3506 741 3457 381 348 741 ## 9 2262 405 2228 358 268 405 ## 10 6193 671 6184 1466 768 671 ## B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012 ## 1 31 12 2175 88 625 12 ## 2 31 22 1717 38 151 0 ## 3 124 183 4379 111 540 34 ## 4 52 82 3694 152 281 0 ## 5 151 58 3113 243 205 6 ## 6 59 21 1483 143 116 8 ## 7 107 103 3965 512 234 33 ## 8 75 191 2759 184 465 99 ## 9 66 39 1857 103 203 15 ## 10 120 17 5513 463 194 0 ## B27020_013 B27020_017 geometry ## 1 218 10 MULTIPOLYGON (((-122.2469 3... ## 2 92 35 MULTIPOLYGON (((-122.2574 3... ## 3 317 15 MULTIPOLYGON (((-122.2642 3... ## 4 190 21 MULTIPOLYGON (((-122.2618 3... ## 5 430 148 MULTIPOLYGON (((-122.2694 3... ## 6 62 0 MULTIPOLYGON (((-122.2681 3... ## 7 353 147 MULTIPOLYGON (((-122.2779 3... ## 8 276 79 MULTIPOLYGON (((-122.2887 3... ## 9 202 28 MULTIPOLYGON (((-122.2856 3... ## 10 477 111 MULTIPOLYGON (((-122.2787 3...
Use the dataframe of census variables to rename the columns so that they are self-describing.
colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")
bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty) ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424 ## epsg (SRID): 4269 ## proj4string: +proj=longlat +datum=NAD83 +no_defs ## First 10 features: ## GEOID citizenship_totpop citizenship_non_citizen entry_totpop ## 1 06001400100 3018 218 944 ## 2 06001400200 1960 92 315 ## 3 06001400300 5236 317 900 ## 4 06001400400 4171 190 524 ## 5 06001400500 3748 430 701 ## 6 06001400600 1661 62 205 ## 7 06001400700 4552 353 619 ## 8 06001400800 3506 276 767 ## 9 06001400900 2262 202 446 ## 10 06001401000 6193 477 754 ## entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ ## 1 61 176 843 208 ## 2 33 55 243 65 ## 3 130 314 857 119 ## 4 69 115 471 146 ## 5 226 123 635 160 ## 6 21 9 178 49 ## 7 122 177 587 70 ## 8 108 185 741 181 ## 9 20 16 405 20 ## 10 90 186 671 56 ## birthplace_asian birthplace_latinAmerica birthplace_southAmerica ## 1 546 46 11 ## 2 136 28 19 ## 3 90 257 86 ## 4 228 35 15 ## 5 83 245 37 ## 6 74 38 0 ## 7 145 299 67 ## 8 330 149 8 ## 9 136 148 38 ## 10 46 519 67 ## birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn ## 1 43 3018 843 ## 2 14 1960 243 ## 3 391 5236 857 ## 4 62 4171 471 ## 5 147 3748 635 ## 6 17 1661 178 ## 7 73 4552 587 ## 8 81 3506 741 ## 9 101 2262 405 ## 10 50 6193 671 ## poverty_totpop below_pov below_pov2 poverty_fborn_totpop ## 1 3011 113 20 843 ## 2 1952 106 84 243 ## 3 5153 450 217 848 ## 4 4158 268 198 471 ## 5 3733 339 240 635 ## 6 1656 158 229 178 ## 7 4552 820 723 587 ## 8 3457 381 348 741 ## 9 2228 358 268 405 ## 10 6184 1466 768 671 ## below_pov_fborn below_pov2_fborn health_native_totpop ## 1 31 12 2175 ## 2 31 22 1717 ## 3 124 183 4379 ## 4 52 82 3694 ## 5 151 58 3113 ## 6 59 21 1483 ## 7 107 103 3965 ## 8 75 191 2759 ## 9 66 39 1857 ## 10 120 17 5513 ## health_native_noinsurance health_fborn_nat_totpop ## 1 88 625 ## 2 38 151 ## 3 111 540 ## 4 152 281 ## 5 243 205 ## 6 143 116 ## 7 512 234 ## 8 184 465 ## 9 103 203 ## 10 463 194 ## fborn_nohealth_naturalized health_fborn_noncit_totpop ## 1 12 218 ## 2 0 92 ## 3 34 317 ## 4 0 190 ## 5 6 430 ## 6 8 62 ## 7 33 353 ## 8 99 276 ## 9 15 202 ## 10 0 477 ## fborn_nohealth_noncitizen geometry ## 1 10 MULTIPOLYGON (((-122.2469 3... ## 2 35 MULTIPOLYGON (((-122.2574 3... ## 3 15 MULTIPOLYGON (((-122.2642 3... ## 4 21 MULTIPOLYGON (((-122.2618 3... ## 5 148 MULTIPOLYGON (((-122.2694 3... ## 6 0 MULTIPOLYGON (((-122.2681 3... ## 7 147 MULTIPOLYGON (((-122.2779 3... ## 8 79 MULTIPOLYGON (((-122.2887 3... ## 9 28 MULTIPOLYGON (((-122.2856 3... ## 10 111 MULTIPOLYGON (((-122.2787 3...
This requires variable name to be the same across years!
# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) {
get_decennial(geography = "state",
variables = c(totalpop = "P001001"),
dataset = "sf1",
year = x) %>%
mutate(year = x) }
)
# View output
head(pop90_10)
tail(pop90_10)
# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) +
geom_bar(stat="identity", position=position_dodge()) + coord_flip()
One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.
Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.
sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])
st_crs(sfnhoods) st_crs(sf_poor5)
sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))
Reaggregate percent of people below poverty from census tract to neighborhood polygons.
sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods, extensive = F) # True= aw sum; False= aw avg
par(mfrow=c(1,2)) plot(sf_poor5['pct_below_pov']) plot(sfhoods2['pct_below_pov']) par(mfrow=c(1,1))
tmaptm_shape(sfhoods2) + tm_polygons(col="pct_below_pov")
head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov
# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
tm_polygons(col="pct_below_pov",
popup.vars = c("nhood", "pct_below_pov")
)