INTRODUCTION

This report explores the tidycensus package to obtain and manipulate American Community Survey (ACS) data from the US Census Bureau to conduct spatial analysis. Data was pulled using the tidycensus package, extracting counties in Washington state for Part A and census tracts in Snohomish county, Washington, in Part B. The focus of this report is to use tidycensus to obtain data from the US Census Bureau in a “tidy” format to use in analysis. The first section looks at the percentage of graduate degree holders in Washington counties and the second section looks at retail trade earnings in census tracts in Snohomish county, Washington. This report is part of Lab 5 of GEOG588.

Load the packages required.

library(tidycensus) # For getting tidy US Census data
library(tidyverse) # For data analysis and visualization
library(plotly) # For interactive graphs
library(ggiraph) # For interactive graphs as well
library(mapview) # For interactive map viewing
library(scales) # To control axis/legend labels

PART A - NON-SPATIAL ANALYSIS

DATA PREPARATION

Using the get_acs() function from tidycensus, we’ll pull data for counties in Washington state from the US Census Bureau. We’ll look at the percentage of population that have a graduate degree in that state, at the county level. We’ll be looking at the variable “DP02_0066P”.

grad_deg_percent <- get_acs(
  geography = "county",     # We want to pull data for counties
  variables = "DP02_0066P",   # The variable is percentage of population with a graduate degree
  state = "WA",   # For the state of Washington
  year = 2021   # For the 2017-2021 5-year ACS. 
)
## Getting data from the 2017-2021 5-year ACS
## Using the ACS Data Profile

Next, we’ll view the data

View(grad_deg_percent) # Use this to view the data in RStudio. 
glimpse(grad_deg_percent) # Here's a glimpse for here:
## Rows: 39
## Columns: 5
## $ GEOID    <chr> "53001", "53003", "53005", "53007", "53009", "53011", "53013"…
## $ NAME     <chr> "Adams County, Washington", "Asotin County, Washington", "Ben…
## $ variable <chr> "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02…
## $ estimate <dbl> 5.8, 8.5, 12.6, 10.3, 11.9, 11.4, 10.9, 5.7, 7.0, 6.4, 7.3, 1…
## $ moe      <dbl> 1.4, 1.3, 0.8, 1.3, 1.0, 0.4, 3.3, 0.6, 1.2, 2.4, 0.9, 4.0, 0…

ANALYSIS

In this section, we’ll answer some questions by conducting analysis on the data.

Question 1: Which counties have the larget percentages of graduate degree holders?

arrange(grad_deg_percent, desc(estimate), na.rm=TRUE) # We arrange the data in descending order, highest to lowest, omitting the NA values. 
## # A tibble: 39 × 5
##    GEOID NAME                         variable   estimate   moe
##    <chr> <chr>                        <chr>         <dbl> <dbl>
##  1 53075 Whitman County, Washington   DP02_0066P     22.6   1.9
##  2 53033 King County, Washington      DP02_0066P     22.1   0.3
##  3 53055 San Juan County, Washington  DP02_0066P     22.1   1.1
##  4 53031 Jefferson County, Washington DP02_0066P     18.8   1.4
##  5 53029 Island County, Washington    DP02_0066P     14.1   1.1
##  6 53073 Whatcom County, Washington   DP02_0066P     14     0.7
##  7 53067 Thurston County, Washington  DP02_0066P     13.8   0.7
##  8 53039 Klickitat County, Washington DP02_0066P     12.7   2.2
##  9 53005 Benton County, Washington    DP02_0066P     12.6   0.8
## 10 53035 Kitsap County, Washington    DP02_0066P     12.2   0.5
## # ℹ 29 more rows

Whitman County has the largest percentage of graduate degree holders in the state at 22.6%, seen under the “estimate” column. Interestingly, this is in far-eastern Washington state, away from the major metro area of Seattle. Whitman county contains Pullman, which is home to Washington State University, which could be the reason for the high rate.

The second highest county is King County, which contains Seattle, at 22.1%. This is followed by San Juan, Jefferson, Island, Whatcom, Thurston, Klickitat, Benton, and Kitsap counties in the top ten of the list.

Question 2: Which counties have the smallest percentages of graduate degree holders?

arrange(grad_deg_percent, estimate, na.rm=TRUE) # We arrange the data by ascending order, which is the default.
## # A tibble: 39 × 5
##    GEOID NAME                            variable   estimate   moe
##    <chr> <chr>                           <chr>         <dbl> <dbl>
##  1 53025 Grant County, Washington        DP02_0066P      5.1   0.7
##  2 53015 Cowlitz County, Washington      DP02_0066P      5.7   0.6
##  3 53001 Adams County, Washington        DP02_0066P      5.8   1.4
##  4 53027 Grays Harbor County, Washington DP02_0066P      6.1   0.8
##  5 53045 Mason County, Washington        DP02_0066P      6.1   0.9
##  6 53019 Ferry County, Washington        DP02_0066P      6.4   2.4
##  7 53077 Yakima County, Washington       DP02_0066P      6.6   0.6
##  8 53041 Lewis County, Washington        DP02_0066P      6.9   0.8
##  9 53017 Douglas County, Washington      DP02_0066P      7     1.2
## 10 53021 Franklin County, Washington     DP02_0066P      7.3   0.9
## # ℹ 29 more rows

Grant County has the lowest percentage of graduate degree holders at 5.1%. This is followed by Cowlitz, Adams, Grays Harbor, Mason, Ferry, Yakima, Lewis, Douglas, and Franklin counties in the top ten of the list.

Question 3: Create a Margin of Error Plot

grad_plot <- ggplot(grad_deg_percent, aes(x = estimate,   # We use ggplot to graph the estimate values
                               y = reorder(NAME, estimate))) +  # We'll order by the estimate values 
  geom_errorbar(aes(xmin=estimate - moe, xmax = estimate + moe),  # The errorbar is generated by creating a range that adds/subtracts the MOE to the estimate value
                width=0.5, linewidth = 0.5) + # Set some dimensions
  geom_point(color = "darkred", size = 2) + # Assign color to the estimate point
  scale_x_continuous(labels = label_percent(scale=1)) +  # Label with "%" on x axis.
  scale_y_discrete(labels = function(x) str_remove(x, " County, Washington|, Washington")) # Remove the "County, Washington" line.

Now, let’s set up the labels:

grad_plot <- grad_plot + # This is a separate chunk for the labels
  labs(title = "Percent graduate degrees, 2021 ACS", # Add a title
       subtitle = "Counties in Washington state",  # Subtitle
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates", # Description
       x = "2017-2021 ACS estimate", # Data information
       y = "") + 
  theme_minimal(base_size = 12) # Set theme

And, here’s the plot:

grad_plot

Overall, I believe that this method works fairly well for this state, specifically if trying to identify the highest and lowest value counties of the graduate degree holder estimates. For where the majority of the data falls, however, between the 5 to 15% marks, the margin of error does vary greatly for some counties. Garfield and Snohomish counties, for example, have the widest margins of error. This could affect analysis if we want to look at the average/median values or a subset of the data.

Interactive chart

Next, we use the plotly package to create an interactive chart using the plot created above.

ggplotly(grad_plot, tooltip = "x")

In this chart, hovering over the data allows the user to see the estimate value for each county. Zooming in by clicking and dragging a square area also allows for a closer look at the data. It makes comparisons easier between counties with similar values.

Interactive chart with ggiraph

Next, we use the ggiraph package to create another interactive chart.

grad_plot_ggiraph <- ggplot(grad_deg_percent, aes(x = estimate, # We create the MOE chart again here.
                                       y = reorder(NAME, estimate),
                                       tooltip = estimate,
                                       data_id = GEOID)) + 
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), 
                width = 0.5, linewidth = 0.5) + 
  geom_point_interactive(color = "darkred", size = 2) +
  scale_x_continuous(labels = label_percent(scale=1)) +
  scale_y_discrete(labels = function(x) str_remove(x, " County, Washington|, Washington")) +
  labs(title = "Percent graduate degrees, 2021 ACS",
       subtitle = "By counties in Washington state",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

girafe(ggobj = grad_plot_ggiraph) %>%  # Then we set the ggobj in ggiraph.
  girafe_options(opts_hover(css = "fill:cyan;")) # Set the fill color of the points when hovered over.

The chart above uses ggiraph to create another interactive chart that is different from one created by plotly. Here, the estimate value dots change color when hovered over. The plot can also be downloaded as an image.

Map of Graduate Degree Percent by WA County

Lastly, we’ll create an interactive map of Washington state counties to visualize the different rates of graduate degree holder across the state. Though this section focuses on the non-spatial analysis, the map will help to visualize the locations of the counties studied above.

First, we’ll add the geometry argument to get_acs()

grad_deg_percent_geo <- get_acs(
  geography = "county",     # We want to pull data for counties
  variables = "DP02_0066P",   # The variable is percentage of population with a graduate degree
  state = "WA",   # For the state of Washington
  year = 2021,   # For the 2017-2021 5-year ACS. 
  geometry = TRUE,   # This allows for geometry info to be added to our data.
  progress_bar = FALSE
)
mapview(grad_deg_percent_geo, zcol = "estimate")

In the interactive map above, the yellow-green hue shows the counties with higher percentage of graduate degree holder and the blue-purple hue shows the counties with lower percentages. The top three highest counties can be seen by the yellow color, with Whitman county in the east at 22.6%, King county in the central-west at 22.1%, and San Juan county in the northwest at 22.1%. The county with the lowest percentage, Grant county, is located in central Washington.

PART B - SPATIAL ANALYSIS:

In this section, we conduct spatial analyses on data obtained by using the tidycensus package. We look at the data for Washington state, looking at Snohomish county specifically, which is a county located north of King county, which contains Seattle.

DATA PREPARATION

First, we use the load_variables() function to find a variable of interest.

vars <- load_variables(2021, "acs5") # We look at the 2017-2021 ACS dataset.
View(vars) # View the data in RStudio.
glimpse(var) # Here's a glimpse of the data here.
## function (x, y = NULL, na.rm = FALSE, use)

After reviewing the variables, we selected the variable B24031_008 Estimate!!Total:!!Retailtrade. This variable shows the median earnings in the past 12 months for the civilian employed population 16 years and over for the “Retail Trade” industry. The estimate unit is in 2021 inflation-adjusted dollars.

Next, we use get_acs() to fetch spatial ACS data for the variable for census tracts within Snohomish county.

sno_retail_income <- get_acs(  # Use get_acs() to retrieve data
  geography = "tract", # We look at census tracts.
  variables = "B24031_008",  # The variable for the median earnings in retail trade
  state = "WA",  # In the state of Washington
  county = "Snohomish",  # In Snohomish county.
  geometry = TRUE,   # This option lets us retrieve geometry information already joined to the data. 
  progress_bar = FALSE
)
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

ANALYSIS

Now, we use mapview() to display our data interactively.

mapview(sno_retail_income, zcol = "estimate")

The highest retail income by census tract in retail trade is on the yellow-green end of the graph, while the lowest values are on the blue-purple end of the graph. Most of Snohomish county seems to fall in the blue-purple shade, indicating lower values of retail earnings, while higher values in the greener shade are located mostly to the southwest of the county. There is one census tract, census tract 520.09 in the southwest, that has the highest estimate at $165,179, a widely different value from the other tracts.

Lastly, we use ggplot to create either a choropleth map for the data. Though the data is not in percentage or rate, dollar values may be presented in a similar way because the values can be looked at comparatively.

ggplot(sno_retail_income, aes(fill = estimate)) + # Create a plot with the estimate (retail earnings) as the fill
  geom_sf() + # Add geography
  theme_void() +  # Void theme
  scale_fill_viridis_c(option = "inferno", n.breaks = 4) +  # Use the "inferno" color palette in viridis and four breaks/divisions
  labs(title = "Median Earnings in Retail Trade Industry",  # Add title
       subtitle = "Census tracts in Snohomish County", # Subtitle
       fill = "Median Retail Earnings", # Legend
       caption = "2017-2021 ACS | tidycensus R package")

The figure above shows a classed choropleth map of the the median retail earnings (or income) for the population aged over 16 for census tracts in Snohomish County in Washington state. To differentiate the style from the interactive map, this choropleth map was created with four breaks or classes: $0 to $49,000, $50,000 to $99,000, $100,000 to $149,000, and over $150,000. This simplified class breaks allows us to see that for the majority of the census tracts in the county, retail earnings are around the $50,000 range. The highest median retail earnings are located in the southwest portion of the county. It is interesting to note, however, that some of the census tracts, particularly in the eastern portion of the county, have much larger land areas. Supplemental data that could help with further analysis may be the number of retail businesses in each census tract or a percentage of persons employed in the retail industry for each census tract.

CONCLUSION

The tidycensus package allows us to access US Census Bureau data easily and in a format that is read for analysis. In this report, tidycensus, in conjunction with other analysis packages, was used to look at the the percentage of graduate degree holders in the counties of Washington state and the median retail earnings for census tracts in Snohomish county.

Chuckanut Drive in Western Washington
Chuckanut Drive in Western Washington
Samish Bay, Washington
Samish Bay, Washington
Bellingham Bay, Washington
Bellingham Bay, Washington

Thank you for reading!