Commuting Data Analysis Lesson 3

#Background

For this lesson I decided to visualize what I’m currently working on for my work as the Census State Data Center Coordinator for the State of Connecticut working at the nonprofit CTData Collabortive. We are currently working on project to identify change in commuting patterns before and after the pandemic. While we don’t have a lot of data points, we wanted to visualize this and see if we can identiy any patterns.

This is not the dataset that I worked on for the other lessons.

library(readr)

## Warning: package 'readr' was built under R version 4.2.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)

## Warning: package 'skimr' was built under R version 4.2.3

library(janitor)

## Warning: package 'janitor' was built under R version 4.2.3

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)
library(ggplot2)

Here we load the data in.

# Read the CSV file
all_od_data <- read_csv("C:/Data/Commuting/combined_lodes_data.csv")

## Rows: 64936 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): work_town, home_town
## dbl (2): TotalJobs, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The next step would be to clean the data. This summarises the number of connections the work_town has to home_town. It also removes any data that work_town is equal to home_town. I plan on looking the American Community Survey 1-year estimates for work from home data to indentifty further patterns.

#get the total numbner of connections per work_town

sumall_od_data_2 <- all_od_data %>%
  filter(work_town != home_town) %>%
  group_by(work_town, year) %>%
  summarise(sum_connections = sum(TotalJobs, na.rm = TRUE), .groups = 'drop')%>%
  arrange(-sum_connections, year)

Next we needed to find the top 10 towns with the most connections.

#Calculate total connections for top 10 towns per year

top_10_per_year <- sumall_od_data_2 %>%
  group_by(year) %>%
  top_n(10, sum_connections) %>%
  summarise(total_connections = sum(sum_connections))

What is the general trend over time in total jobs in Work Towns? Less people were commuting during the pandemic and this data shows that. For 2021 the trend increases as expected. However, since we do not have other data points it is something that we should continue to look into.

# Plotting
ggplot(top_10_per_year, aes(x = year, y = total_connections)) +
  geom_line(color ="#7E90D0") +
  geom_point() +
  labs(x = "Year", y = "Total Connections", title = "Change Over Time of Top 10 Work Towns") +
  theme_minimal()

Now this trend line makes sense based on the COVID-19 pandemic resulting in more remote work.

# get the top 10 towns per year: 

# Group by year and town, and select top 10 rows within each year
top_10_per_year_per_town <- sumall_od_data_2 %>%
   mutate(work_town = str_replace_all(work_town, "\\s*\\([^\\)]+\\)", "")) %>% # Removes text within parentheses
  mutate(work_town = str_replace_all(work_town, "town", "")) %>%# Removes the word "town"

  group_by(year, work_town) %>%
  summarise(total_connections = sum(sum_connections)) %>%
  arrange(desc(total_connections)) %>%
  slice(1:10)

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

ggplot(top_10_per_year_per_town, aes(x = reorder(work_town, total_connections), y = total_connections, fill = as.factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Town", y = "Total Connections", fill = "Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Top 10 Towns per Year by Total Connections") +
  scale_fill_manual(values = c("2019" = "#7EB6D0", "2020" = "#D07EB5", "2021" = "#B6D07E"))

Here we see that some of the towns are only on the list for one or two years. If I had more time, I’d find a new top 10 that included connections from for each year for each of the top 10 towns.

top_10_per_year_per_town <- sumall_od_data_2 %>%
  group_by(year, work_town) %>%
  summarise(total_connections = sum(sum_connections), .groups = 'drop') %>%
  arrange(year, desc(total_connections)) %>%
  group_by(year) %>%
  slice_max(order_by = total_connections, n = 10)

# Since you want to filter based on a combination of year and work_town, ensure both datasets include these
# Join filtered_data with top_10_per_year_per_town to keep only the relevant rows
filtered_data <- sumall_od_data_2 %>%
  inner_join(top_10_per_year_per_town, by = c("year", "work_town"))

# Plotting
ggplot(filtered_data, aes(x = year, y = total_connections, color = work_town)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Total Connections", title = "Change Over Time for Top 10 Towns") +
  facet_wrap(~work_town, scales = "free_y", ncol = 2) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.y = element_text(size = 8))

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

The trend lines overtime are interesting. Windsor, North Haven, New Haven, all increased jobs throughout the pandemic.

# Filter the data for the years 2019 and 2021
data_2019 <- subset(sumall_od_data_2, year == 2019)
data_2021 <- subset(sumall_od_data_2, year == 2021)

# Calculate the sum of connections for each work_town in 2019
sum_2019 <- aggregate(sum_connections ~ work_town, data_2019, sum)

# Calculate the sum of connections for each work_town in 2021
sum_2021 <- aggregate(sum_connections ~ work_town, data_2021, sum)

# Merge the two data frames based on work_town
merged_data <- merge(sum_2019, sum_2021, by = "work_town", suffixes = c("_2019", "_2021"))

# Calculate percent change
merged_data$percent_change <- ((merged_data$sum_connections_2021 - merged_data$sum_connections_2019) / merged_data$sum_connections_2019) * 100

merged_data <- arrange(merged_data,(-percent_change))
# Print the result
#print(merged_data)

#create a histogram of percent change

# Create histogram for percent_change
ggplot(merged_data, aes(x = percent_change)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(x = "Percent Change", y = "Frequency", title = "Histogram of Percent Change") +
  theme_minimal()

#Conclusion

Further research needs to be conducted to identify the outliers of the histogram.

Data sourced frrom: U.S. Census Bureau. (2019-2021). Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES).

Commuting Data Analysis Lesson 3

CT Data: Jeff Borofsky

2024-03-12