For this lab, we’ll work with the tidycensus package to
work with data from the U.S. Census Bureau. To do this, we’ll
investigate two phenomenon for the state of
Michigan:
Part A will investigate the data in a aspatial way, while Part B will focus on a spatial analysis.
library(tidycensus)
library(kableExtra)
library(dplyr)
library(stringr)
library(ggplot2)
library(forcats)
library(plotly)
library(sf)
library(mapview)
library(leafpop)
For Part A, we’ll investigate the graduate degree rates for the state of Michigan. While this information is at different geographic resolutions, we’ll focus on county-level data.
In order to get the data we need, we’ll use the
tidycensus::get_acs() function.
mi_county_graduate_degrees <- get_acs(
geography = "county",
variables = "DP02_0066P",
state = "Michigan"
)
This results in a dataset with 83 rows, which corresponds to the number of counties in our dataset. Let’s take a peek at the dataframe as it is from the tidycensus package.
mi_county_graduate_degrees %>%
head(5) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| GEOID | NAME | variable | estimate | moe |
|---|---|---|---|---|
| 26001 | Alcona County, Michigan | DP02_0066P | 8.1 | 1.2 |
| 26003 | Alger County, Michigan | DP02_0066P | 6.4 | 1.1 |
| 26005 | Allegan County, Michigan | DP02_0066P | 8.4 | 0.7 |
| 26007 | Alpena County, Michigan | DP02_0066P | 6.5 | 1.1 |
| 26009 | Antrim County, Michigan | DP02_0066P | 13.2 | 1.2 |
The data is already tidy enough for visualization, which is great!
Since we’re only focused on 1 state, we can trim the county names to
only be the actual county name & remove the “, Michigan” part. To do
this, we’ll make use of the stringr package.
mi_county_graduate_degrees <- mi_county_graduate_degrees %>%
mutate(NAME = str_remove(NAME, ", Michigan"))
mi_county_graduate_degrees %>%
head(5) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| GEOID | NAME | variable | estimate | moe |
|---|---|---|---|---|
| 26001 | Alcona County | DP02_0066P | 8.1 | 1.2 |
| 26003 | Alger County | DP02_0066P | 6.4 | 1.1 |
| 26005 | Allegan County | DP02_0066P | 8.4 | 0.7 |
| 26007 | Alpena County | DP02_0066P | 6.5 | 1.1 |
| 26009 | Antrim County | DP02_0066P | 13.2 | 1.2 |
First, let’s look at our extreme values: which county has the highest and lowest percentage of graduate degree holders?
max_pct <- max(mi_county_graduate_degrees$estimate)
min_pct <- min(mi_county_graduate_degrees$estimate)
max_county <- mi_county_graduate_degrees$NAME[mi_county_graduate_degrees$estimate == max_pct]
min_county <- mi_county_graduate_degrees$NAME[mi_county_graduate_degrees$estimate == min_pct]
data.frame(
Category = c("Highest", "Lowest"),
County = c(max_county, min_county),
Percent = c(scales::percent(max_pct/100), scales::percent(min_pct/100))
) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Category | County | Percent |
|---|---|---|
| Highest | Washtenaw County | 31% |
| Lowest | Lake County | 3% |
And we can create an error bar chart to visualize all of the counties & their margins-of-error (MoE).
mi_county_graduate_degrees %>%
mutate(NAME = fct_reorder(NAME, estimate)) %>%
ggplot(aes(x=NAME, y=estimate)) +
geom_pointrange(aes(ymin=estimate-moe, ymax=estimate+moe)) +
scale_y_continuous(labels = scales::percent_format(scale=1)) +
coord_flip() +
theme_minimal() +
labs(
title = "Graduate degree percentages in Michigan",
y = "Percent of Residents with Gradute Degree",
x = "County"
)
We can clearly see in the plot that Washtenaw County stands out as an outlier for the rate of graduate degree holders. This makes sense though - Washtenaw County contains the city of Ann Arbor, home of the University of Michigan; plenty of Wasthenaw County residents will be faculty with higher education degrees (including some Ph.D. students!).
We can use the plotly package to make this type of chart interactive.
mi_county_graduate_degrees %>%
mutate(NAME = fct_reorder(NAME, estimate)) %>%
plot_ly(
type = "scatter",
y = ~NAME,
x = ~estimate/100,
mode = "markers",
error_x = ~list(array = moe/100,
color = '#000000'),
height = 800
) %>%
layout(
title = "Graduate degree percentages in Michigan",
xaxis = list(
title = "Percent of Residents with Gradute Degree",
tickformat = ".0%"
),
yaxis = list(title = "County")
)
For a spatial analysis of U.S. Census Data, we’ll pivot to a different variable: median age.
Assuming that we don’t know which variable contains this information
in the ACS data, we’ll use the tidycensus::load_variables
function to find it. The table below shows a preview of what’s returned
by the function for the 2023 American Community Survey 5-Year
Estimates.
load_variables(2023, "acs5") %>%
head(5) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| name | label | concept | geography |
|---|---|---|---|
| B01001A_001 | Estimate!!Total: | Sex by Age (White Alone) | tract |
| B01001A_002 | Estimate!!Total:!!Male: | Sex by Age (White Alone) | tract |
| B01001A_003 | Estimate!!Total:!!Male:!!Under 5 years | Sex by Age (White Alone) | tract |
| B01001A_004 | Estimate!!Total:!!Male:!!5 to 9 years | Sex by Age (White Alone) | tract |
| B01001A_005 | Estimate!!Total:!!Male:!!10 to 14 years | Sex by Age (White Alone) | tract |
Thumbing through table, we find that the overall median age variable
is "B01002A_001". So then we go back to our
get_acs() function to get this table; except this time,
we’ll also use the function to fetch the spatial information for the
county. The table below shows a preview of what the data looks like
(very similar to the prior table, but with the additional spatial
column).
mi_median_age <- get_acs(
geography = "county",
variables = "B01002A_001",
state = "Michigan",
geometry = TRUE
) %>%
mutate(NAME = str_remove(NAME, ", Michigan"))
mi_median_age %>%
head(5) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| GEOID | NAME | variable | estimate | moe | geometry |
|---|---|---|---|---|---|
| 26041 | Delta County | B01002A_001 | 49.7 | 0.5 | MULTIPOLYGON (((-86.65922 4… |
| 26055 | Grand Traverse County | B01002A_001 | 44.5 | 0.3 | MULTIPOLYGON (((-85.58321 4… |
| 26007 | Alpena County | B01002A_001 | 48.7 | 0.6 | MULTIPOLYGON (((-83.21207 4… |
| 26103 | Marquette County | B01002A_001 | 40.0 | 0.4 | MULTIPOLYGON (((-87.40383 4… |
| 26147 | St. Clair County | B01002A_001 | 45.6 | 0.2 | MULTIPOLYGON (((-82.99626 4… |
Another data construction step we need for creating a proportional
symbol map: we need to calculate the centroid of each county -
this is very easily done via the sf::st_centroid
function.
mi_county_centroids <- st_centroid(mi_median_age, of_largest_polygon = TRUE)
Now that we have the dataset we need, we can calculate both an static & interactive spatial view of the data. Let’s start with the static map - this will be a proportional symbol map.
ggplot() +
geom_sf(data = mi_median_age, fill = "lightgrey") +
geom_sf(data = mi_county_centroids, aes(size = estimate, color = estimate)) +
scale_color_continuous(limits=c(30, 60), breaks=seq(30, 60, by=10)) +
scale_size_continuous(limits=c(30, 60), breaks=seq(30, 60, by=10)) +
guides(color= guide_legend(), size=guide_legend()) +
theme_void() +
labs(
title = "Median Age for Counties in Michigan",
color = "Median Age",
size = "Median Age"
)
From this static map, we can see a relatively clear pattern: the southern side of Michigan is seemingly younger than the northern part of the state. This also makes sense - northern Michigan is known to be the area retirees in Michigan live to enjoy a more natural, serene life.
To make the map more interactive, we can use the same dataset but instead use the mapview R package.
mapview(mi_median_age, alpha.regions = 0.2, legend = FALSE, popup = FALSE, label = FALSE) +
mapview(
mi_county_centroids,
layer.name = "Median Age",
zcol = "estimate",
popup = popupTable(
mi_county_centroids,
zcol=c("NAME", "estimate", "moe")
)
)