Overview

For this lab, we’ll work with the tidycensus package to work with data from the U.S. Census Bureau. To do this, we’ll investigate two phenomenon for the state of Michigan:

  • Part A: the percentage of the population (at a county level) with a graduate-level degree
  • Part B: TBD

Part A will investigate the data in a aspatial way, while Part B will focus on a spatial analysis.

library(tidycensus)
library(kableExtra)
library(dplyr)
library(stringr)
library(ggplot2)
library(forcats)
library(plotly)
library(sf)
library(mapview)
library(leafpop)

Part A: Graduate Degree Rates

For Part A, we’ll investigate the graduate degree rates for the state of Michigan. While this information is at different geographic resolutions, we’ll focus on county-level data.

Data Construction

In order to get the data we need, we’ll use the tidycensus::get_acs() function.

mi_county_graduate_degrees <- get_acs(
  geography = "county",
  variables = "DP02_0066P",
  state = "Michigan"
)

This results in a dataset with 83 rows, which corresponds to the number of counties in our dataset. Let’s take a peek at the dataframe as it is from the tidycensus package.

mi_county_graduate_degrees %>%
  head(5) %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
GEOID NAME variable estimate moe
26001 Alcona County, Michigan DP02_0066P 8.1 1.2
26003 Alger County, Michigan DP02_0066P 6.4 1.1
26005 Allegan County, Michigan DP02_0066P 8.4 0.7
26007 Alpena County, Michigan DP02_0066P 6.5 1.1
26009 Antrim County, Michigan DP02_0066P 13.2 1.2

The data is already tidy enough for visualization, which is great! Since we’re only focused on 1 state, we can trim the county names to only be the actual county name & remove the “, Michigan” part. To do this, we’ll make use of the stringr package.

mi_county_graduate_degrees <- mi_county_graduate_degrees %>%
  mutate(NAME = str_remove(NAME, ", Michigan"))

mi_county_graduate_degrees %>%
  head(5) %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
GEOID NAME variable estimate moe
26001 Alcona County DP02_0066P 8.1 1.2
26003 Alger County DP02_0066P 6.4 1.1
26005 Allegan County DP02_0066P 8.4 0.7
26007 Alpena County DP02_0066P 6.5 1.1
26009 Antrim County DP02_0066P 13.2 1.2

Results

First, let’s look at our extreme values: which county has the highest and lowest percentage of graduate degree holders?

max_pct <- max(mi_county_graduate_degrees$estimate)
min_pct <- min(mi_county_graduate_degrees$estimate)
max_county <- mi_county_graduate_degrees$NAME[mi_county_graduate_degrees$estimate == max_pct]
min_county <- mi_county_graduate_degrees$NAME[mi_county_graduate_degrees$estimate == min_pct]

data.frame(
  Category = c("Highest", "Lowest"),
  County = c(max_county, min_county),
  Percent = c(scales::percent(max_pct/100), scales::percent(min_pct/100))
) %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
Category County Percent
Highest Washtenaw County 31%
Lowest Lake County 3%

And we can create an error bar chart to visualize all of the counties & their margins-of-error (MoE).

mi_county_graduate_degrees %>%
  mutate(NAME = fct_reorder(NAME, estimate)) %>%
  ggplot(aes(x=NAME, y=estimate)) +
  geom_pointrange(aes(ymin=estimate-moe, ymax=estimate+moe)) +
  scale_y_continuous(labels = scales::percent_format(scale=1)) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Graduate degree percentages in Michigan",
    y = "Percent of Residents with Gradute Degree",
    x = "County"
  )

We can clearly see in the plot that Washtenaw County stands out as an outlier for the rate of graduate degree holders. This makes sense though - Washtenaw County contains the city of Ann Arbor, home of the University of Michigan; plenty of Wasthenaw County residents will be faculty with higher education degrees (including some Ph.D. students!).

We can use the plotly package to make this type of chart interactive.

mi_county_graduate_degrees %>%
  mutate(NAME = fct_reorder(NAME, estimate)) %>%
  plot_ly(
  type = "scatter",
  y = ~NAME,
  x = ~estimate/100,
  mode = "markers",
  error_x = ~list(array = moe/100,
                        color = '#000000'),
  height = 800
  ) %>%
  layout(
    title = "Graduate degree percentages in Michigan",
    xaxis = list(
      title = "Percent of Residents with Gradute Degree",
      tickformat = ".0%"
    ),
    yaxis = list(title = "County")
  )

Part B: Median Age

For a spatial analysis of U.S. Census Data, we’ll pivot to a different variable: median age.

Data Construction

Assuming that we don’t know which variable contains this information in the ACS data, we’ll use the tidycensus::load_variables function to find it. The table below shows a preview of what’s returned by the function for the 2023 American Community Survey 5-Year Estimates.

load_variables(2023, "acs5") %>%
  head(5) %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
name label concept geography
B01001A_001 Estimate!!Total: Sex by Age (White Alone) tract
B01001A_002 Estimate!!Total:!!Male: Sex by Age (White Alone) tract
B01001A_003 Estimate!!Total:!!Male:!!Under 5 years Sex by Age (White Alone) tract
B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years Sex by Age (White Alone) tract
B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years Sex by Age (White Alone) tract

Thumbing through table, we find that the overall median age variable is "B01002A_001". So then we go back to our get_acs() function to get this table; except this time, we’ll also use the function to fetch the spatial information for the county. The table below shows a preview of what the data looks like (very similar to the prior table, but with the additional spatial column).

mi_median_age <- get_acs(
  geography = "county",
  variables = "B01002A_001",
  state = "Michigan",
  geometry = TRUE
) %>%
  mutate(NAME = str_remove(NAME, ", Michigan"))
mi_median_age %>%
  head(5) %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
GEOID NAME variable estimate moe geometry
26041 Delta County B01002A_001 49.7 0.5 MULTIPOLYGON (((-86.65922 4…
26055 Grand Traverse County B01002A_001 44.5 0.3 MULTIPOLYGON (((-85.58321 4…
26007 Alpena County B01002A_001 48.7 0.6 MULTIPOLYGON (((-83.21207 4…
26103 Marquette County B01002A_001 40.0 0.4 MULTIPOLYGON (((-87.40383 4…
26147 St. Clair County B01002A_001 45.6 0.2 MULTIPOLYGON (((-82.99626 4…

Another data construction step we need for creating a proportional symbol map: we need to calculate the centroid of each county - this is very easily done via the sf::st_centroid function.

mi_county_centroids <- st_centroid(mi_median_age, of_largest_polygon = TRUE)

Results

Now that we have the dataset we need, we can calculate both an static & interactive spatial view of the data. Let’s start with the static map - this will be a proportional symbol map.

ggplot() +
  geom_sf(data = mi_median_age, fill = "lightgrey") +
  geom_sf(data = mi_county_centroids, aes(size = estimate, color = estimate)) +
  scale_color_continuous(limits=c(30, 60), breaks=seq(30, 60, by=10)) +
  scale_size_continuous(limits=c(30, 60), breaks=seq(30, 60, by=10)) +
  guides(color= guide_legend(), size=guide_legend()) +
  theme_void() +
  labs(
    title = "Median Age for Counties in Michigan",
    color = "Median Age",
    size = "Median Age"
  )

From this static map, we can see a relatively clear pattern: the southern side of Michigan is seemingly younger than the northern part of the state. This also makes sense - northern Michigan is known to be the area retirees in Michigan live to enjoy a more natural, serene life.

To make the map more interactive, we can use the same dataset but instead use the mapview R package.

mapview(mi_median_age, alpha.regions = 0.2, legend = FALSE, popup = FALSE, label = FALSE) +
  mapview(
    mi_county_centroids,
    layer.name = "Median Age",
    zcol = "estimate",
    popup = popupTable(
      mi_county_centroids,
      zcol=c("NAME", "estimate", "moe")
    )
  )