Introduction
The purpose of this lab is to work with US Census Data in the RStudio
environment. Using the tidycensus package to easily extract up to data
American Community Survey data directly from the US Census Bureau data
portal. This lab works with ggplotly and ggirafe packages to create
interactive visualizations for the census data. The sf and leaflet
packages were utilized to create interactive maps with census data.
# install.packages(c("tidycensus", "tidyverse"))
# install.packages(c("mapview", "plotly", "ggiraph", "survey", "srvyr"))
library(tidycensus)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggiraph)
library(sf)
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(leaflet)
library(mapview)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Part A - Non-spatial analysis and data visualization with
tidycensus
Part 1.2 Data navigation and retrieving percentages of graduate
degree holders
# Part 1.2. Finding which county have the smallest percentage of graduate degree holders?
lowest_grad_pops <- head(arrange(grad_pop, estimate))
lowest_grad_pops
## # A tibble: 6 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 51051 Dickenson County, Virginia DP02_0066P 1.9 0.8
## 2 51105 Lee County, Virginia DP02_0066P 3.6 1
## 3 51580 Covington city, Virginia DP02_0066P 3.6 1.6
## 4 51111 Lunenburg County, Virginia DP02_0066P 3.7 1.2
## 5 51141 Patrick County, Virginia DP02_0066P 3.7 1.2
## 6 51081 Greensville County, Virginia DP02_0066P 3.8 1.3
Part 1.3 Visualization of data with a margin of error plot
# Part 1.3. Making a margin of error plot for Virginia. This method does not work well for Virginia due to the large amount of counties. This work provides a good example of how to project target data with a margin of error and a high amount of control over labelling.
moe_grad_plot <- ggplot(grad_pop, aes(x = estimate, y = reorder(NAME, estimate))) +
geom_col(color = "darkblue") +
geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), width = .2, color = "red") +
scale_x_continuous(labels = scales::number) +
# Can use the "str_remove" function in scale_y_discrete to remove unwanted string values from the names of fields
scale_y_discrete(labels = function(x) str_remove(x, " County Virginia|, Virginia")) +
labs(
title = "Graduate Degree Holders by County in Virginia",
x = "Percentages of Graduate Degree Holders",
y = "County",
caption = "Data Source: U.S Census Bureau ACS 2019-2023"
)
moe_grad_plot

Part 2. Making interactive charts with plotly
# Part 2. Using the plotly package to convert the data created and plotted in Part 1.3 into an interactive chart with margin of error included.
grad_plot <- ggplot(data = grad_pop, aes(x = estimate,
y = reorder(NAME, estimate))) +
geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe),
width = 0.5, linewidth = 0.5) +
geom_point(color = "darkred", size = 2) +
scale_x_continuous(labels = label_number()) + # the label_number function is from the scales() package
scale_y_discrete(labels = function(x) str_remove(x, " County, Virginia|, Virginia")) +
labs(title = "Median graduate degree holder, 2019-2023 ACS",
subtitle = "Counties in Virginia",
caption = "Data acquired with R and tidycensus",
x = "ACS estimate",
y = "")
# ggplotly takes the created plot and adds visualization components automatically
# A user can click around the graph to zoom in on specific data. It makes it easier to interact with the underlying information.
ggplotly(grad_plot)
Part 3 Making interactive charts with ggiraph
# Part 3. This optional activity includes using the ggiraph package to make the plot created in step 1 & 2 interactive with a different interactive package.
# ggiraph is an alternative approach for making ggplot2 graphics interactive, it includes *_interactive() versions of ggplot2 geoms.
grad_plot_girafe <- ggplot(grad_pop, aes(x = estimate,
y = reorder(NAME, estimate))) +
geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe),
width = 0.5, linewidth = 0.5) +
geom_point_interactive(color = "darkred", size = 2) +
scale_x_continuous(labels = label_number()) +
scale_y_discrete(labels = function(x) str_remove(x, " County, Virginia|, Virginia")) +
labs(title = "Median graduate degree holder, 2019-2023 ACS",
subtitle = "Counties in Virginia",
caption = "Data acquired with R and tidycensus",
x = "ACS estimate",
y = "") +
theme_minimal(base_size= 12)
girafe(ggobj = grad_plot_girafe) %>%
girafe_options(opts_hover(css = "fill:cyan;"))
Part B - Spatial analysis and visualization with tidycensus
Story
This part of the project used get_acs() to fetch spatial ACS data of
Virginia census tracts. This focused on the five year estimates from ACS
2019-2023. This work flow was informed by a legal requirement my work at
the Virginia Department of Energy is tasked to complete every 3 years.
The Virginia Clean Economy Act requires my agency to perform an impact
analysis of the legislation on “Historically Economically Disadvantaged
Communities” (HEDCs). HEDCs are defined as a census tract where the
population is a community of color or a low income community. The
current workflow uses a python scripted ArcGIS Pro tool to update the
data. But this is game changing for pulling the data in from the web and
visualizing in real time.
Part 1. Using the load_variables function to understand the ACS
variable list
# Part 1. Using the load_variables() function to find the variables for the latest ACS data.
vars <- load_variables(2023, "acs5")
### From this list I was able to search for "total population" and "total white population" using the "var" object to identify B02001_001 and B02001_002 for race data. These datasets were used this to calculate the number of communities that qualified as communities of color in Virginia
Part 2. Calculating the population percentage of Virginia census
tracts for individuals who identify as people of color
# Step 2 looks to calculate the population of individuals who identify as people of color in Virginia. This will inform the reoccurring impact assessment my agency performs for the Virginia Clean Economy Act.
# To calculate the census tracts that are communities of color I need to obtain the total population and subtract by the total white population
pop <- get_acs(
geography = "tract",
variables = c(total_pop = "B02001_001", white_pop = "B02001_002"),
state = "VA",
geometry = TRUE
)
## Getting data from the 2019-2023 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## | | | 0% | |= | 1% | |= | 2% | |== | 3% | |== | 4% | |=== | 4% | |==== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |======= | 10% | |======== | 11% | |======== | 12% | |========= | 13% | |========== | 15% | |=========== | 16% | |============= | 18% | |============= | 19% | |================= | 24% | |=================== | 28% | |==================== | 29% | |===================== | 30% | |====================== | 32% | |======================== | 34% | |=========================== | 39% | |============================= | 41% | |============================== | 43% | |=============================== | 44% | |================================= | 47% | |================================== | 48% | |=================================== | 50% | |==================================== | 51% | |===================================== | 53% | |====================================== | 55% | |======================================= | 56% | |======================================== | 58% | |========================================= | 59% | |========================================== | 60% | |============================================= | 65% | |============================================== | 66% | |================================================ | 69% | |================================================== | 71% | |====================================================== | 77% | |======================================================== | 79% | |========================================================= | 81% | |========================================================= | 82% | |=========================================================== | 84% | |============================================================= | 87% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================= | 92% | |================================================================= | 93% | |=================================================================== | 96% | |==================================================================== | 97% | |======================================================================| 100%
# transforming the data from long to wide in order to make the "total_pop" and "white_pop" columns
pop_wide <- pop %>%
select(GEOID, NAME, variable, estimate) %>%
pivot_wider(names_from = variable, values_from = estimate)
# mutating the total_pop and "white_pop" columns together to create a "Community of color population" or "coc_pop" column.
pop_wide <- pop_wide %>%
group_by(GEOID) %>%
mutate(coc_pop = total_pop - white_pop) %>%
mutate(coc_perc = (coc_pop / total_pop) * 100)
Part 3. Displaying the data in mapview
# Part 3. Using mapview() to quickly display the data interactively.
mapview(pop_wide)
Part 4. Making a choropleth map of communities of color by
population percentage
# Part 4. Using ggplot to create choropleth map of community of color population percentage. A choropleth maps was chosen due to the use of percentages. It was chosen over a graduated symbols maps as raw counts were not displayed.
ggplot(data = pop_wide, aes(fill = coc_perc)) +
geom_sf() +
theme_void() +
scale_fill_viridis_c(option = "rocket") +
labs(title = "Census Tract Population Percent of People who Identify as People of Color",
fill = "ACS estimate",
caption = "2019-2023 ACS | tidycensus R package")
### Results The resulting choropleth map displays the distribution of
communities of color. Which are defined in the Virginia Clean Economy
Act as census tracts where more than 50% of the population are
individuals who identify as people of color. White population was
subtracted from total population ACS data to obtain an estimate for
people of color per census tract. The population percentage was
calculated and displayed using the “rocket” color scheme to show the
census tracts with higher communities of color. The resulting map
sharply displays a higher collection of communities of color of the
eastern areas of Virginia and collected in the urban centers of the
state. With notable rural communities of color being seen in Brunswick,
Greensville, and Nottoway County.