Part A: Non-Spatial Data

First, I load R packages that will be used and load any functions for formatting.

# Load packages
library(tidycensus)
library(tidyverse)
library(plotly)
library(ggiraph)
library(ggiraphExtra)
library(mapview)
library(leaflet)

# Map theme
theme_map <- function(...) {
  theme_minimal() +
    theme(
      text = element_text(family = "Lato", color = "#22211d"),
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      # panel.grid.minor = element_line(color = "#ebebe5", size = 0.2),
      panel.grid.major = element_line(color = "#ebebe5", size = 0.2),
      panel.grid.minor = element_blank(),
      plot.background = element_rect(fill = "#ffffff", color = NA), 
      panel.background = element_rect(fill = "#ffffff", color = NA), 
      legend.background = element_rect(fill = "#ffffff", color = NA),
      panel.border = element_blank(),
      # caption
      plot.caption = element_text(hjust=0, colour="#7b7b7b",
                                  margin = margin(t = 5, r=0, b = 5, l =0)),
      ...
    )
}

# CSS to fix leaflet legend NA spacing
css_fix <- "div.info.legend.leaflet-control br {clear: both;}" 
# CSS to HTML
html_fix <- htmltools::tags$style(type = "text/css", css_fix)

Introduction

First, I use the get_acs() function to retrieve the percentage of the population of New Jersey that has a graduate degree at the county level.

# Get the percent of NJ residents with a graduate degree, by county, in 2021
nj_grad <- get_acs(variables = "DP02_0066P", geography = "county", state = "NJ", year = 2021)

Somerset (25.3%), Morris (22.8%), and Hunterdon (22.1%) are the counties with the highest percentage of residents with a graduate degree.

# Highest percentage of residents
nj_grad %>% arrange(desc(estimate))

## # A tibble: 21 × 5
##    GEOID NAME                         variable   estimate   moe
##    <chr> <chr>                        <chr>         <dbl> <dbl>
##  1 34035 Somerset County, New Jersey  DP02_0066P     25.7   0.8
##  2 34027 Morris County, New Jersey    DP02_0066P     23.5   0.6
##  3 34019 Hunterdon County, New Jersey DP02_0066P     22.6   1.1
##  4 34003 Bergen County, New Jersey    DP02_0066P     20.2   0.5
##  5 34021 Mercer County, New Jersey    DP02_0066P     20.2   0.7
##  6 34025 Monmouth County, New Jersey  DP02_0066P     18.8   0.5
##  7 34017 Hudson County, New Jersey    DP02_0066P     18.4   0.6
##  8 34023 Middlesex County, New Jersey DP02_0066P     18.3   0.4
##  9 34013 Essex County, New Jersey     DP02_0066P     15.3   0.4
## 10 34039 Union County, New Jersey     DP02_0066P     14.9   0.5
## # … with 11 more rows

Cumberland (5.5%), Salem (7.2%), and Passaic (9.6%) counties are the counties with the lowest percentage of residents with a graduate degree.

# Lowest percentage of residents
nj_grad %>% arrange(estimate)

## # A tibble: 21 × 5
##    GEOID NAME                          variable   estimate   moe
##    <chr> <chr>                         <chr>         <dbl> <dbl>
##  1 34011 Cumberland County, New Jersey DP02_0066P      5.9   0.6
##  2 34033 Salem County, New Jersey      DP02_0066P      7.6   0.9
##  3 34031 Passaic County, New Jersey    DP02_0066P      9.6   0.5
##  4 34001 Atlantic County, New Jersey   DP02_0066P     10.1   0.6
##  5 34029 Ocean County, New Jersey      DP02_0066P     11.1   0.4
##  6 34041 Warren County, New Jersey     DP02_0066P     11.5   0.8
##  7 34015 Gloucester County, New Jersey DP02_0066P     11.7   0.6
##  8 34009 Cape May County, New Jersey   DP02_0066P     13     0.9
##  9 34007 Camden County, New Jersey     DP02_0066P     13.1   0.4
## 10 34037 Sussex County, New Jersey     DP02_0066P     13.5   0.8
## # … with 11 more rows

Make a margin of error plot for the state you’ve chosen. Does this method work well for your state? For this, use the margin-of-error plot exampleLinks to an external site. in the workshop, and feel free to improve its design. Be sure to add informative titles, labels, figure captions, and subtitles if necessary.

Below is a margin-of-error plot for New Jersey counties. This method works fairly well for New Jersey due to the relatively low number of counties.

# Ggplot estimate with margin of error
nj_moe_plot <- ggplot(nj_grad, aes(x = estimate, 
                                   y = reorder(NAME, estimate))) +
  geom_point(color = "red", size = 2) +
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe),
                width = 0.5, linewidth = 0.5) +
  labs(title = "New Jersey Residents with a Graduate Degree",
       subtitle = "2016-2021",
       caption = "Source: ACS Data, 2017-2021",
       y = "",
       x = "Estimate (%)") +
  theme_minimal()

nj_moe_plot

Figure 1. A chart displaying the percent of New Jersey residents with a graduate degree, 2017-2021 data at the county level. Somerset, Morris, and Hunterdon counties have the highest percentage of residents with a graduate degree, while Cumberland, Salem, and Passaic counties have the lowest percentage of residents with a graduate degree. Margin of error is included for each county’s estimate.

Using plotly, the margin of error plot can be converted to an interactive plot.

# ggplot interactive plotly chart of margin of error plot
ggplotly(nj_moe_plot)

Figure 2. An interactive plotly chart displaying the percent of New Jersey residents with a graduate degree, 2017-2021 data at the county level. Margin of error is included for each county’s estimate.

Using the ggiraph package is another way to convert the static margin of error plot into an interactive plot.

# Create css for tooltip to change font
tooltip_css <- "background-colour:transparent;font-family: Lato, sans-serif;"

# Create ggiraph object
gg_nj_grad <- ggplot(data = nj_grad) +
  geom_point_interactive(aes(x = estimate, y = reorder(NAME,estimate), color = "red",
                             tooltip = estimate, data_id = NAME)) + 
  geom_errorbar_interactive(aes(y = NAME, x = estimate, xmin = estimate - moe, xmax = estimate + moe),
                            width = 0.5) +
  labs(title = "New Jersey Residents with a Graduate Degree",
       subtitle = "2017-2021 Data, with Margin of Error Included",
       caption = "Source: ACS Data, 2017-2021",
       y = "",
       x = "Estimate (%)") +
  theme_minimal() +
  theme(legend.position = "none")

# Put ggiraph chart together
girafe(ggobj = gg_nj_grad,  options = list(
  opts_sizing(width = 1), 
  opts_tooltip(css = tooltip_css)
))

Figure 3. An interactive ggiraph chart of the percent of New Jersey residents with a graduate degree, 2017-2021 data at the county level. Margin of error is included for each county’s estimate.

Part B: Spatial

Data Preparation

Load Variables

First, I use the load_variables() function to find a variable that interests me. I use the filter() and str_detect() functions to find relevant variables for analysis.

In this case, I was interested in variables relating to working from home.

# Load variables from 2021
acs_2021_vars <- load_variables(2021, dataset = "acs5")
acs_2021_vars

## # A tibble: 27,886 × 4
##    name        label                                    concept          geogr…¹
##    <chr>       <chr>                                    <chr>            <chr>  
##  1 B01001A_001 Estimate!!Total:                         SEX BY AGE (WHI… tract  
##  2 B01001A_002 Estimate!!Total:!!Male:                  SEX BY AGE (WHI… tract  
##  3 B01001A_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE (WHI… tract  
##  4 B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE (WHI… tract  
##  5 B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE (WHI… tract  
##  6 B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE (WHI… tract  
##  7 B01001A_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE (WHI… tract  
##  8 B01001A_008 Estimate!!Total:!!Male:!!20 to 24 years  SEX BY AGE (WHI… tract  
##  9 B01001A_009 Estimate!!Total:!!Male:!!25 to 29 years  SEX BY AGE (WHI… tract  
## 10 B01001A_010 Estimate!!Total:!!Male:!!30 to 34 years  SEX BY AGE (WHI… tract  
## # … with 27,876 more rows, and abbreviated variable name ¹geography

# Look at vars containing "work"
acs_2021_vars %>% filter(str_detect(concept, "WORK"))

## # A tibble: 4,036 × 4
##    name       label                                              concept geogr…¹
##    <chr>      <chr>                                              <chr>   <chr>  
##  1 B08006_001 Estimate!!Total:                                   SEX OF… tract  
##  2 B08006_002 Estimate!!Total:!!Car, truck, or van:              SEX OF… tract  
##  3 B08006_003 Estimate!!Total:!!Car, truck, or van:!!Drove alone SEX OF… tract  
##  4 B08006_004 Estimate!!Total:!!Car, truck, or van:!!Carpooled:  SEX OF… tract  
##  5 B08006_005 Estimate!!Total:!!Car, truck, or van:!!Carpooled:… SEX OF… tract  
##  6 B08006_006 Estimate!!Total:!!Car, truck, or van:!!Carpooled:… SEX OF… tract  
##  7 B08006_007 Estimate!!Total:!!Car, truck, or van:!!Carpooled:… SEX OF… tract  
##  8 B08006_008 Estimate!!Total:!!Public transportation (excludin… SEX OF… tract  
##  9 B08006_009 Estimate!!Total:!!Public transportation (excludin… SEX OF… tract  
## 10 B08006_010 Estimate!!Total:!!Public transportation (excludin… SEX OF… tract  
## # … with 4,026 more rows, and abbreviated variable name ¹geography

Since there are many variables that contain “work”, I further limited my search by searching directly for “worked from home”. I use the View() function to look more closely at the results.

# There are many, so I limit it furthe to "Worked from home"
acs_2021_vars %>% filter(str_detect(label, "Worked from home")) %>% View()

Retrieve Data

Next, I use get_acs() to retrieve ACS data that is relevant for analysis on the percentage of the population working from home. I chose to retrieve this data at the county level. This ensured that there would be sufficient population even while using 1-year estimates. I used a vector containing the relevant county FIPS codes within the D.C. metropolitan area to filter the relevant counties.

I decided to use the C08601: Means of Transportation to Work table, which is a simplified table that looks at workers 16 years and over and their means of transportation to work.

I determined that 1-year estimates were necessary due to the fact that working from home is a relevantly recent trend in most locations. I chose to use two different years - 2019 and 2021 - to show the changes that occurred as a result of the pandemic.

The following jurisdictions do not have 1-year estimates available:

Clarke County, VA
Falls Church, VA
Fairfax city, VA
Jefferson County, WV
Manassas, VA
Manassas Park, VA

Since there are missing jurisdictions, I decided to add them in so they would appear on the map (with NA values).

# Vector of relevant county FIPS codes
region_county_fips <- c(
  "11001", "24009", "24013", "24017", "24021", "24031", "24033",
  "51510", "51013", "51043", "51059", "51600", "51610", "51061",  
  "51630", "24027",  "51107", "51683", "51685",  "51153", "51177", 
  "51179", "54037"
)

# Get working from home data - county - 2021
acs_wfh_county_2019 <- get_acs(geography = "county", variables = "C08601_011", 
                               summary_var = "C08601_001",
                               survey="acs1", year = 2019) %>%
  filter(GEOID %in% region_county_fips) %>%
  mutate(pct_2019 = round((estimate/summary_est) * 100, 2))

acs_wfh_county_2019_geom <- get_acs(geography = "county", variables = "C08601_011", 
                               summary_var = "C08601_001",
                               survey="acs1", year = 2019, geometry = TRUE,
                               progress_bar = FALSE) %>%
  filter(GEOID %in% region_county_fips) %>%
  mutate(pct_2019 = round((estimate/summary_est) * 100, 2))

# Get working from home data - county - 2021
acs_wfh_county_2021 <- get_acs(geography = "county", variables = "C08601_011", 
                               summary_var = "C08601_001",
                               survey="acs1", year = 2021) %>%
  filter(GEOID %in% region_county_fips) %>%
  mutate(pct_2021 = round((estimate/summary_est) * 100, 2))

acs_sumvar <- get_acs(geography = "county", variables = "B08601_001",
                      survey="acs5", year = 2019, geometry = TRUE, 
                      progress_bar = FALSE) %>%
  filter(GEOID %in% region_county_fips) %>%
  select(GEOID, NAME, workers = estimate)

Next, I join the 2019 and 2021 datasets and calculate the difference between the two years.

# Join 2019 and 2021 data at MSA level

# Join 2019 and 2021 data at county level
acs_wfh_dc_county_join <- acs_wfh_county_2019 %>% 
  select(GEOID:variable, pct_2019) %>%
  left_join(., acs_wfh_county_2021, by = c("GEOID", "NAME", "variable")) %>%
  select(GEOID, NAME, variable, pct_2019, pct_2021) %>%
  mutate(diff_pct = pct_2021 - pct_2019) %>%
  arrange(desc(diff_pct))

# Join 2019 and 2021 data with 2019 geom data
acs_wfh_dc_county_join_geom <- acs_wfh_county_2019_geom %>% 
  select(GEOID:variable, pct_2019) %>%
  left_join(., acs_wfh_county_2021, by = c("GEOID", "NAME", "variable")) %>%
  select(GEOID, NAME, variable, pct_2019, pct_2021) %>%
  mutate(diff_pct = pct_2021 - pct_2019) %>%
  arrange(desc(diff_pct))

# Join 2019/2021 WFH data to the 5-year summary var
# this ensures all jurisdictions are included
acs_wfh_with_na <- acs_sumvar %>%
  left_join(., acs_wfh_dc_county_join, by = c("GEOID", "NAME"))

# DC area counties with the largest working from home changes
acs_wfh_dc_county_join %>% arrange(desc(diff_pct))

## # A tibble: 16 × 6
##    GEOID NAME                                    varia…¹ pct_2…² pct_2…³ diff_…⁴
##    <chr> <chr>                                   <chr>     <dbl>   <dbl>   <dbl>
##  1 51013 Arlington County, Virginia              C08601…    4.33    39.8    35.5
##  2 51510 Alexandria city, Virginia               C08601…    5.96    38.2    32.3
##  3 24031 Montgomery County, Maryland             C08601…    7.21    37.5    30.2
##  4 51179 Stafford County, Virginia               C08601…    6.86    34.5    27.7
##  5 51059 Fairfax County, Virginia                C08601…    6.4     33.7    27.3
##  6 51107 Loudoun County, Virginia                C08601…    9.81    36.6    26.8
##  7 11001 District of Columbia, District of Colu… C08601…    3.38    28.2    24.8
##  8 24017 Charles County, Maryland                C08601…    8.29    32.5    24.2
##  9 24033 Prince George's County, Maryland        C08601…    5.03    28.7    23.6
## 10 24027 Howard County, Maryland                 C08601…    8.17    28.7    20.5
## 11 51153 Prince William County, Virginia         C08601…    9.8     30.3    20.5
## 12 24013 Carroll County, Maryland                C08601…   10.7     29.6    18.9
## 13 24021 Frederick County, Maryland              C08601…    9.33    28.2    18.9
## 14 51177 Spotsylvania County, Virginia           C08601…    8.53    27.0    18.4
## 15 24009 Calvert County, Maryland                C08601…   13.8     32.1    18.4
## 16 51061 Fauquier County, Virginia               C08601…   12.3     27.2    14.9
## # … with abbreviated variable names ¹variable, ²pct_2019, ³pct_2021, ⁴diff_pct

# DC area counties with the smallest working from home changes
acs_wfh_dc_county_join %>% arrange(diff_pct)

## # A tibble: 16 × 6
##    GEOID NAME                                    varia…¹ pct_2…² pct_2…³ diff_…⁴
##    <chr> <chr>                                   <chr>     <dbl>   <dbl>   <dbl>
##  1 51061 Fauquier County, Virginia               C08601…   12.3     27.2    14.9
##  2 24009 Calvert County, Maryland                C08601…   13.8     32.1    18.4
##  3 51177 Spotsylvania County, Virginia           C08601…    8.53    27.0    18.4
##  4 24021 Frederick County, Maryland              C08601…    9.33    28.2    18.9
##  5 24013 Carroll County, Maryland                C08601…   10.7     29.6    18.9
##  6 51153 Prince William County, Virginia         C08601…    9.8     30.3    20.5
##  7 24027 Howard County, Maryland                 C08601…    8.17    28.7    20.5
##  8 24033 Prince George's County, Maryland        C08601…    5.03    28.7    23.6
##  9 24017 Charles County, Maryland                C08601…    8.29    32.5    24.2
## 10 11001 District of Columbia, District of Colu… C08601…    3.38    28.2    24.8
## 11 51107 Loudoun County, Virginia                C08601…    9.81    36.6    26.8
## 12 51059 Fairfax County, Virginia                C08601…    6.4     33.7    27.3
## 13 51179 Stafford County, Virginia               C08601…    6.86    34.5    27.7
## 14 24031 Montgomery County, Maryland             C08601…    7.21    37.5    30.2
## 15 51510 Alexandria city, Virginia               C08601…    5.96    38.2    32.3
## 16 51013 Arlington County, Virginia              C08601…    4.33    39.8    35.5
## # … with abbreviated variable names ¹variable, ²pct_2019, ³pct_2021, ⁴diff_pct

Analysis

Looking at the results, Arlington County, VA, Alexandria, VA, and Montgomery County, MD have the highest percentage of those working from home in 2021 (and the largest differences between 2019 and 2021).

Fauquier County, VA, Calvert County, MD, and Spotsylvania County, VA have the smallest changes between 2019 and 2021. These counties also have among the lowest percentages of those working from home. These counties are all exurban counties on the periphery of the D.C. metro area.

Maps

I used mapview() to view the data interactively. The interactive map allows you to view the change and the different years. Some counties started with a larger percentage of workers who worked from home, but with the interactive features it is possible to see that. For example, Fauquier County, VA and Calvert County, MD had lower changes than inner suburban counties, but they already had many workers who worked from home in 2019 - 12.26% and 13.75% respectively.

mapview(acs_wfh_with_na, 
        zcol = "diff_pct", 
        alpha.regions = .8,
        layer.name = "% Growth 2019-2021<br> Working From Home")

Figure 4. An interactive map of the percent difference in workers working from home between 2019 and 2021 within the Washington, D.C. metro area. The largest differences were in suburban counties in Virginia and Maryland.

After exploring the data interactively, I produce a map of percent working from home in 2021 in the D.C. metropolitan area using ggplot2.

ggplot() +
  geom_sf(data = acs_wfh_dc_county_join_geom,  aes(fill = pct_2021), color = "grey70", linewidth = 0.25) +
  coord_sf(datum = NA) + 
  #scale_fill_viridis(option = 'inferno') +
  labs(title = "% Working From Home in the D.C. Metro Area, 2021",
       subtitle = "Smaller jurisidictions were removed due to missing data",
       fill = "% Working\nFrom Home",
       caption = "Source: ACS Data 1-year Estimates, 2021") +
  scale_fill_distiller(palette = "GnBu", direction = 1, guide = "colorbar", na.value = "#7b7b7b") +
  guides(fill = guide_colorbar(nbin=300, raster=FALSE, ticks = TRUE)) +
  theme_map()

Figure 5. A map of the percent working from home in the Washington, D.C. Metropolitan Area in 2021. Arlington County, VA and Montgomery County, MD had the highest percentage of workers who worked from home in 2021.

Conclusion

Working from home increased across the Washington, D.C. metro area and around the United States. How much working from home increased varied. Counties on the periphery of the D.C. area often started off with a higher percentage of workers who worked from home before the pandemic, but these counties often had fewer workers working from home in 2021.

Lab 5

Simone Roy

2023-06-16