The United States census has been the leading source for analyzing neighborhoods since 1790, and it has been of great help to data scientists. Data scientists often face challenges in understanding data the Census Bureau publish and what R packages on CRAN help analyze the census data here.
In this session, learners would learn about the three approaches use for mapping census data and shapefiles, and how to create maps in R using the two most recent approaches.
The three approaches for mapping census data and shapefiles are:
The historical or old school way of downloading census data and shapefiles into R. This involves going to the Census website and downloading the census data and shapefiles into your computer before analyzing the data.
The second approach uses the API w/tigris for maps and gets census for census data. The approach is far more advanced and less time-consuming than the old-school method.
The most recent approach, which is the easiest and super fast, is the use of API w/tidycensus for maps and census data. This approach allows access to both census data and shapefiles together.
The Old School Approach mapping approach involves the following steps:
Using the Tigris package to download census shapefiles you simply call the following functions depending what you want:
We also need to use the census API package from Hannah Resht of Bloomberg, which helps download the census demographic data.
Data scientists using R face a couple of problems mapping census data in R. Most time, it is not easy to find data published by Census Bureau. Even when you get the census data needed, it is also challenging to understand what package(s)to use in R. The session will help you understand the steps and packages you need for mapping census data in R.
To enhance learner’s skills in finding and mapping census data through the use of appropriate packages and functions.
For this session, we will concentrate on using the two most recent approaches (API w/tigris and API w/tidycenus) for downloading census data and shapefiles and how we can further analyze by joining both the census data and shape files to produce a well-informed report.
Install packages
Install or call all relevant package. See below some of the relevant package needed.
Sign up for a Census Key
The second step is to apply for your census key here. Install your Census API in your .Renviron File for repeated use through Source: R/helpers.R. This function will add your census API key to R.environ file that will help you call the key without key being store in your code.
Install your key using census_api_key(put your key, install = FALSE) and census_api_key(“put your key here”, overwrite = TRUE, install = TRUE) if you need to overwrite an existing key.
After you have installed your key, it can be called any time by typing Sys.getenv(“CENSUS_API_KEY”).Read more about the census key here
This approach is far more advanced and less time-consuming than the old-school method.
The use of getCensus() function from the package will make an API call and returns a data frame of result. To pass the getCensus() function, you need the following arguments: Name - the Name of the Census data set, Vinatge - the year of the data set, Vars - the Name of the variables you want to access, Region - the geography level of data, like county or tracts.
However, for you to access all the Census metadata you need to run the function for census metadata as
acs_vars <- listCensusMetadata(name = "acs/acs5",
type = "variables", vintage = 2016)which will help you to have all the variables in scs and acs5 files. From the table you get from the code above you can search for the variable you want from the data set. For instead you can have variables names like; B21004_001E, B190013_001M and so no. It is up to you base on your data need to decide which variable name you want. For exam A14009 is the variable name for Average Household income by Race.
Let say we have search through the metadata and we want to download the median income from 2016. Lets start by loading the relevant package and use the function below;
library(tigris)
pa_income <- getCensus(name = "acs/acs5",
vintage = 2016, vars = c("NAME", "B19013_001E", "B19013_001M"),
region = "county", regionin = "state:42",
key = "312b9edfc29781d3596a1fd71a3f5230d774cd25")
# you can use Sys.getenv("CENSUS_API_KEY") to save your key in Renviron
head(pa_income) %>% pander()| state | county | NAME | B19013_001E | B19013_001M |
|---|---|---|---|---|
| 42 | 103 | Pike County, Pennsylvania | 61199 | 2849 |
| 42 | 109 | Snyder County, Pennsylvania | 51110 | 1675 |
| 42 | 115 | Susquehanna County, Pennsylvania | 50160 | 1362 |
| 42 | 039 | Crawford County, Pennsylvania | 45637 | 1267 |
| 42 | 049 | Erie County, Pennsylvania | 47094 | 710 |
| 42 | 057 | Fulton County, Pennsylvania | 49420 | 2138 |
Let continue with some basic example by wrangling the data using dplyr. Now let say we want to rename the variables used in the above code chunk. We rename B10013_001E as MHI_sss and B19013_00M as MHI_soe. see below:
library(dplyr)
pa_income %>%
rename(MHI_sss = B19013_001E,
MHI_soe = B19013_001M)%>%
mutate(se = MHI_soe/1.632, cv = (se/MHI_sss)*100)Now let download state shapefile. To download state shapefile we first make sure shape files downloaded as sf files.
library(tigris)
options(tigris_use_cache = TRUE)
options(tigris_class = "sf")
# We can also use the state number in place of the state. For instate Pennsylvanian is the 42 state in US.
pa <- counties("PA", cb=T)
view(pa)Now let go further my mapping Pennsylvanian, and let try to join the census data of Pennsylvanian with the map
library(tigris)
ggplot(pa)+ # we can use the state number in pace of the state
geom_sf()+
theme_void()+
theme(panel.grid.major = element_line(colour = "transparent"))+
labs(title = "Pennsylvanian counties")
Let see how we can join both our census data for Pennsylvanian and the Pennsylvanian map as show below
head(pa_income)pa[[2]]## [1] "049" "051" "055" "027" "043" "013" "067" "099" "103" "101" "059" "125"
## [13] "061" "005" "031" "075" "017" "033" "097" "041" "113" "045" "127" "029"
## [25] "003" "089" "091" "133" "035" "071" "063" "111" "095" "123" "023" "079"
## [37] "077" "119" "105" "085" "115" "121" "107" "011" "025" "087" "053" "037"
## [49] "019" "109" "065" "015" "039" "083" "001" "069" "047" "117" "073" "009"
## [61] "021" "057" "007" "131" "093" "081" "129"
pa_for_now <- left_join(pa, pa_income, by=c("COUNTYFP"="county"))names(pa_for_now)## [1] "STATEFP" "COUNTYFP" "COUNTYNS" "AFFGEOID" "GEOID"
## [6] "NAME.x" "NAMELSAD" "STUSPS" "STATE_NAME" "LSAD"
## [11] "ALAND" "AWATER" "state" "NAME.y" "B19013_001E"
## [16] "B19013_001M" "geometry"
head(pa_for_now, 1)Using all the analysis above we can now plot Pennsylvanian median income using ggplot as show below
ggplot(pa_for_now)+
geom_sf(aes(fill=B19013_001E), color="white")+
theme_void()+
theme(panel.grid.major = element_line(colour = "transparent"))+
scale_fill_distiller(palette = "oranges",
direction = 1, name="Median Income")+
labs(title = "2016 pennsylvanian counties Median Income",
caption = "Source: US Census/ACs5")
Over the years, a lot has improved in mapping Census data. Let us get some idea on how we can download census data and shapefiles together. The most recent census package use to get census data and shapefiles is call tidycensus. With tidycensus we can download the shapefiles along with the data we need. With this new approach, you still need to load your census key.
Let’s say we want to search for the median income as a variable as we did in the previous steps:
library(tidycensus)
var_search <- load_variables(2016, "acs5", # Use load_variables to view and search for variables
cache = TRUE) # year=2016 and survey=acs5
head(var_search, 3)
var_search$label <- toupper(var_search$label)
income <- var_search %>%
mutate(contains.median_income = grepl("MEDIAN INCOME",
label)) %>%
filter(contains.median_income)
head(income, 3)Now, we are interested in the unemployment rate for a Delaware county in Pennsylvania. We find out the variables for unemployment are B23025_002E, and the variable for the labor force is B23025_005E. Note that we need the variable for the labor force because we will calculate the unemployment rate.
library(tidycensus)
work_foc <- c(labor_force = "B23025_005",
unemployed = "B23025_002")
Pennsylvanian <- get_acs(geography = "tract", # get_acs() function is located inside the tidycensus package
year = 2017,
survey = "acs5",
variables = work_foc, # you need to call work_foc for the two variables
county = "Delaware",
state = "42", geometry = T) # 42 shows that Pennsylvanian is the 42 state in US
head(Pennsylvanian)Now let transform the data using dplyr functions
library(tidyr)
penns <- Pennsylvanian %>%
mutate(variable=case_when(
variable=="unemployed" ~ "populationUnemployed",
variable=="labor_force" ~ "laborForce")) %>%
select(-moe) %>%
spread(variable, estimate) %>% #Spread moves rows into columns
mutate(percent_unemployed=round(populationUnemployed/laborForce, 2))
head(penns)The next step is to arrange the percentage of unemployed in descending order as shown below;
arr_desc <- arrange( penns,
desc(percent_unemployed) )
head(arr_desc)Let us close the analysis by showing a map of the unemployment rate in Delaware county in Pennsylvanian as shown below;
ggplot(arr_desc) +
geom_sf(aes(fill=percent_unemployed), color="NA") +
labs(title = "Unemployment rate in Delaware County",
subtitle = "",
caption = "Source: ACS 5-year, 2017",
fill = "Unemloyment ratio") +
scale_fill_viridis(direction=-1)
Learn more about [package, technique, dataset] with the following:
Resource tidycensus
Resource The 5 verbs of dplyr
Resource Data Practicum Community Analytics by Prof. Anthony Howell
This code through references and cites the following sources:
Source I. Hyperlink Text
Source II. Hyperlink Text
Source III. Hyperlink Text