Lab 5 - Census Data

Lab Lessons

This week’s project introduced working with spatial and tabular U.S. Census Bureau data. To access the Census data the class used Dr. Walker’s tidycensus R library. This library interacts with the Census’s own API to query again and export data from the Census’s holdings. To do this the main function get_acs(). This reaches back to the Census holdings and queries for American Community Survey data that fits the function’s parameters. In addition to querying for data the lab implementation creating charts and maps with the freshly queried data. Doing so introduced us to the mapview and ggiraph libraries.

Project Outline

For this week’s project I used acs data for 2021. I chose to study the a few attributes of the different geographies of Census data for Virginia. I first explore the percentage of people with graduate degrees in Virginia counties. I then explored median income and population data at the Census Tract level for Arlington, VA.

Part A: Comparison of Residents with Graduate Degrees in Virginia

#Loading libraries ----
library(tidycensus)
library(tidyverse)
library(scales)
library(plotly)
library(ggiraph)
library(mapview)
library(tigris)
library(sf)
library(viridisLite)
library(classInt)
library(gridExtra)

#Querying for graduate degree holders for Virginia.
va_grad_degrees <- get_acs(
  geography = 'county',
  variables = "DP02_0066P",
  state = "VA",
  year = 2022
)

Now that we have the graduate degree data for Virginia. Let’s see what counties have the highest and lowest percentages of people with graduate degrees.

#Counties with the highest percentage of graduate degrees ----
va_top_counties <- va_grad_degrees %>%
  top_n(5, estimate)

va_top_counties
## # A tibble: 5 × 5
##   GEOID NAME                           variable   estimate   moe
##   <chr> <chr>                          <chr>         <dbl> <dbl>
## 1 51013 Arlington County, Virginia     DP02_0066P     41.1   1.1
## 2 51059 Fairfax County, Virginia       DP02_0066P     32.3   0.4
## 3 51510 Alexandria city, Virginia      DP02_0066P     34.3   1.1
## 4 51540 Charlottesville city, Virginia DP02_0066P     30.9   2.2
## 5 51610 Falls Church city, Virginia    DP02_0066P     48.3   3.7

And now the counties that have the lowest percentages.

#Counties with the lowest percentage of graduate degrees ----
va_bottom_counties <- va_grad_degrees %>%
  top_n(-5, estimate)

va_bottom_counties
## # A tibble: 5 × 5
##   GEOID NAME                       variable   estimate   moe
##   <chr> <chr>                      <chr>         <dbl> <dbl>
## 1 51051 Dickenson County, Virginia DP02_0066P      2.1   0.8
## 2 51077 Grayson County, Virginia   DP02_0066P      3.8   1  
## 3 51105 Lee County, Virginia       DP02_0066P      3.1   1  
## 4 51111 Lunenburg County, Virginia DP02_0066P      3.7   1.2
## 5 51670 Hopewell city, Virginia    DP02_0066P      3.8   1.1

Now it’s time to explore the top counties agmonst each other. This next block of code creates an error plot that displays the top 20 counties for Virginia.

#Plotting the top 20 counties in VA with graduate degrees

va_plot_errorbar <- va_grad_degrees %>%
  top_n(20, estimate) %>% 
  ggplot(aes(x = estimate,y = reorder(NAME, estimate))) +
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe),
                width = 0.5, linewidth = 0.5) +
  geom_point(color = 'darkred', size = 2) +
  scale_x_continuous(labels = function(x) {scales::percent(x/100)}) +  #fixes percentage
  scale_y_discrete(labels = function(x) str_remove(x, " County, Virginia|, Virginia")) + 
  labs(title = "Percentage of Residents with Graduate Degrees, 2022",
       subtitle = "Counties in Virginia",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") +
  theme_minimal(base_size = 12)

va_plot_errorbar

While this plot is interesting, it’s not an interactive plot. Let’s change that using the plotly library.

Plotly allows viewers to hover their cursors over a chart item and a pop-up will appear to call out what the precise value.

# Part A: Question 2 - Making the same chart interactive using Plotly----
va_plotly <- ggplotly(va_plot_errorbar, tooltip = "x")
va_plotly

The final chart using the graduate data uses the ggiraph library. This library expands on the functionality of plotly. Try for yourself to see the differences.

# Part A: Question 3 - Making the same chart interactive using ggiraph ----
va_plot_ggiraph <- va_grad_degrees %>%
  top_n(20, estimate) %>%
  ggplot(aes(x = estimate, 
             y = reorder(NAME, estimate),
             tooltip = estimate,
             data_id = GEOID)) +
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), 
                width = 0.5, size = 0.5) + 
  geom_point_interactive(color = "darkred", size = 2) +
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Virginia|, Virginia")) + 
  labs(title = "Percentage of Residents with Graduate Degrees",
       subtitle = "Counties in Virginia",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

girafe(ggobj = va_plot_ggiraph) %>%
  girafe_options(opts_hover(css = "fill:cyan;"))

Part 2

This part of the lab explores the median income and population estimates for 2021 ACS data for Arlington, VA Census Tracts.

This first block of code includes querying for the data, filtering for just Arlington, VA, and using mutate to create a new table that is a rough estimate of the tract’s GDP. This approach uses “Wide” data, so there are multiple attribute columns for each geometry.

#VA Population and Income data for 2021
va_data <- get_acs(
  geography = "tract",
  variables = c(
    Income = "B19013_001",
    Population = "B01003_001"),
  state = "VA",
  year = 2021,
  output = "wide",
  geometry = TRUE
)

#New Column Names
columns <- c("IncomeE", "IncomeM", "PopulationE", "PopulationM")

#Filtering and Conditioning data for Arlington, VA
#Filtering for Arlington
#Creating a new column that is Tract GDP
arlington_va <- va_data %>%
  filter(grepl('Arlington County', va_data$NAME)) %>%
  mutate_at(vars(columns), ~replace_na(., 0)) %>%
  mutate(TractGDP = IncomeE * PopulationE)

Since we just conditioned the data it is pretty easy to make three maps: Income, Population and GDP. However, for this example I made an interactive map using mapview that shows the estimated GDP of Arlington Census Tracts. The formula used to find this rough estimate is Median Income * Population.

#Using Mapview to create a map of Arlington Income.
#Number of breaks
n<-5
# Calculate the Jenks natural breaks
breaks <- classIntervals(arlington_va$TractGDP, n, style = "jenks")$brks

colors <- inferno(n=5)

mapview(arlington_va, zcol='TractGDP',
        at = breaks,
        layer.name = "Arlington, VA<br/>2021 Tract GDP",
        col.regions = colors)

For the final map I created a comparison of the Census Tract’s population and median income. I did this using ggplot and gridExtra. This allows for maps and charts to be displayed in a facetwrap fashion without needing to uses data that fits within facetwrap’s schema, ie. not-wide data.

However, I am still having issues with my PROJ.db not allowing me to use the SF library. I thought this issue was resolved but it unfornately came back. Below it my ggplot code and if there isn’t an output feel free to copy the chunks of code out into your console to run and see the maps.

knitr::include_graphics("C:\\PSU\\Geog588\\Week6\\ggplot_code.png")

Below is the error message I receive. There’s an issue with access the PROJ.db. This is a database file that holds all projection information needed for spatial data. I am not sure why, but with the RGDAL changes my PROJ library points to a PROJ.db file within my Postgres application files instead of pointing directly towards the file within the R libraries folder.

knitr::include_graphics("C:\\PSU\\Geog588\\Week6\\ErrorMessage.png")