Overview

The data set I am using is from data.gov and included the traffic crashes in Chicago ranging from years 2015-2026. It can be found here: https://catalog.data.gov/dataset/traffic-crashes-crashes. The data is collected by Chicago Police Departmenet and includes E-Crash data as well as self reported in station data. Codebooks and data dictionaries are available through the Chicago Data Portal. They provide information such as column names, descriptions of the variables and how they are calculated, and the specific keys or codes used by the responding officers for things like PRIM_CONTRIBUTORY_CAUSE and CRASH_TYPE. The data was collected by the Chicago Police Department. These records are generated when traffic crashes are reported to the police within city limits. The data is collected for public safety monitoring, resource allocation, urban planning, and legal/insurance documentation.

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Linking to GEOS 3.13.1, GDAL 3.11.4, PROJ 9.7.0; sf_use_s2() is TRUE
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(lubridate)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Body

Traffic_Crashes_Crashes <- read_csv ("C:/Users/mhe29/OneDrive - Drexel University/CJS 310 R Files/Traffic_Crashes_-_Crashes.csv")
## Rows: 1030368 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): CRASH_RECORD_ID, CRASH_DATE_EST_I, CRASH_DATE, TRAFFIC_CONTROL_DEV...
## dbl (17): POSTED_SPEED_LIMIT, LANE_CNT, STREET_NO, BEAT_OF_OCCURRENCE, NUM_U...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Traffic_Crashes_Crashes$CRASH_DATE <- mdy_hms(Traffic_Crashes_Crashes$CRASH_DATE)
Traffic_Crashes_Crashes <- Traffic_Crashes_Crashes %>%
  mutate(CRASH_YEAR = year(CRASH_DATE))

This data set is very large. I had to make the data set usable, so a few cleaning and transformation steps were required. The CRASH_DATE column was originally read as a character string, so it was converted to a usable date-time format using the mdy_hms() function from the lubridate package. A new year variable was extracted from the crash date to allow for longitudinal analysis. For the spatial analysis, a sub-data set was created specifically for the year 2024, and any rows with missing spatial data (NA in LATITUDE or LONGITUDE) were filtered out to ensure the map would render correctly. You can see this cleaning above

Graph 1: Crashes By Year

Traffic_Crashes_Crashes %>%
  count(CRASH_YEAR) %>%
  ggplot(aes(x = CRASH_YEAR, y = n)) +
  geom_line() +
  geom_point() +
  labs(title = "Number of Traffic Crashes by Year (Overview)",
       x = "Year",
       y = "Number of Crashes") +
  theme_minimal()

This graph gives an overview of the data set by showing the total number of crashes occurring each year. It helps identify trends over time. For example, there is a noticeable drop around 2020, which most likely is COVID-related traffic changes and enforcement efforts. The NHTSA found that while overall vehicle miles traveled dropped by 13.2% in 2020, the fatality rate increased to 1.37 fatalities per 100 million miles traveled. They contributed this to drivers engaging in riskier behaviors on emptier roads, specifically extreme speeding and failing to wear seatbelts. This explains why my first graph shows a drop in total crashes in 2020, while supporting my findings that driver behavior and speed are primary causes of severe crashes (NHTSA 2021). Another source, a thesis posted by Hiya Chetia at UIC, showed that overall pedestrian crashes increased from 2017 to 2020, suddenly dropped after the COVID-19 lockdown, and resumed growing from 2020 to 2023 after the initial plunge. Despite the decrease in overall pedestrian crashes, fatal crashes increased from 78 to 114 after COVID-19. I thought it was interesting how after a decrease, studies found a sharper increase. (Chetia, 2023) Within this study, it finds that certain neighborhoods experienced increases and decreases. The sharp drop at the end represents an incomplete year

Specified Data sets/Maps

A specific sub-data set was created for the geographic map. I filtered the data to only include crashes from the year 2024 and removed any records missing longitude or latitude coordinates. This was necessary because plotting over 1 million points across a decade would overcrowd the map and obscure recent geographic trends.

crashes_subset_2024 <- Traffic_Crashes_Crashes %>%
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  filter(CRASH_YEAR == 2024)

Hypotheses

Based on this data set, I came up with these hypotheses:

Hypothesis 1: The majority of traffic crashes with a known cause are the result of driver behavior errors instead of environmental factors.

Hypothesis 2: Higher posted speed limits are associated with a higher average number of injuries per crash due to the increased force of impact.

Hypothesis 3: Traffic crashes are not evenly distributed geographically but are concentrated in high-traffic commercial and downtown hotspot areas of Chicago.

Below are the visualizations testing these hypotheses and exploring the data further.

Graph 2: Primary Contributory Causes

Traffic_Crashes_Crashes %>%
  count(PRIM_CONTRIBUTORY_CAUSE, sort = TRUE) %>%
  slice_head(n = 9) %>%
  ggplot(aes(x = reorder(PRIM_CONTRIBUTORY_CAUSE, n), y = n)) + 
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) + 
  labs(title = "Top 10 Primary Contributory Causes of Crashes",
       x = "Cause",
       y = "Number of Crashes")

This graph shows the most common causes of traffic crashes. While the most common tag is “UNABLE TO DETERMINE,” the next highest causes are “FAILING TO YIELD RIGHT-OF-WAY” and “FOLLOWING TOO CLOSELY”. This supports Hypothesis 1, confirming that when a cause is known, specific driver behavior errors are the most common factors.

Graph 3: Speed Limits vs. Injuries

Traffic_Crashes_Crashes %>%
  group_by(POSTED_SPEED_LIMIT) %>%
  summarise(avg_injuries = mean(INJURIES_TOTAL, na.rm = TRUE)) %>%
  ggplot(aes(x = POSTED_SPEED_LIMIT, y = avg_injuries)) +
  geom_point() +
  geom_line() +
  labs(title = "Posted Speed Limit vs Average Injuries",
       x = "Posted Speed Limit",
       y = "Average Number of Injuries") +
  theme_minimal()

This graph examines the posted speed limits against the average number of injuries per crash. This is a wide range of data. While there are spikes in average injuries at certain higher speed limits, it is not a perfect upward trend. This shows mixed support for Hypothesis 2; speed is an important role, but other variables are probably just as important.

Graph 4: Geographic Hotspots

register_stadiamaps(key="33428048-0440-4b5f-b953-bc50e6ec7d72")

chicago_bbox <- c(left = -87.85, bottom = 41.64, right = -87.52, top = 42.02)
chicago_map <- get_stadiamap(
  bbox = chicago_bbox, 
  zoom = 11, 
  maptype = "stamen_toner_lite"
)
## ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
ggmap(chicago_map) +
  stat_density_2d(data = crashes_subset_2024,
                  aes(x = LONGITUDE, y = LATITUDE,
                      fill = after_stat(level),
                      alpha = after_stat(level)),
                  linewidth = 0.01,
                  bins = 35,
                  geom = "polygon") +
  scale_fill_gradient(low = "yellow", high = "red") +
  scale_alpha(range = c(0.1, 0.5), guide = "none") +
  labs(title = "2024 Chicago Traffic Crash Hotspots",
       x = "Longitude",
       y = "Latitude",
       fill = "Intensity") +
  theme_minimal()
## Warning: Removed 371 rows containing non-finite outside the scale range
## (`stat_density2d()`).

This map visualizes the density of traffic crashes across Chicago in 2024. The bright red hotspot strongly supports Hypothesis 3, revealing that crashes are highly concentrated in downtown Chicago rather than being evenly spread throughout the city. According to Vision Zero Chicago Downtown Action Plan in 2022, the plan specifically identifies the downtown area as having the highest number of severe crashes citywide. Eight identified High Crash Areas account for 36% of all fatal crashes annually despite holding only 25% of the city’s population.

Graph 5: Crashes by Day of the Week

Traffic_Crashes_Crashes %>%
  filter(!is.na(CRASH_DAY_OF_WEEK)) %>%
  mutate(Day = factor(CRASH_DAY_OF_WEEK, 
                      levels = 1:7, 
                      labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>%
  count(Day) %>%
  ggplot(aes(x = Day, y = n, fill = Day)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Total Crashes by Day of the Week",
       x = "Day of the Week",
       y = "Total Crashes") +
  theme_minimal()

This bar chart displays the distribution of crashes based on the day of the week. It helps us understand if weekdays (commute days) or weekends are more prone to crashes, providing additional context to the behavioral causes identified earlier. As we see here, Sunday is the lowest day, with Friday being the highest, which could be the result of Friday being the end of the week and start of the weekend.

Conclusion

Based on these visuals, Chicago traffic crashes are driven by specific driver behaviors and are more concentrated in downtown hotspots. Crash volumes dropped during 2020 due to the pandemic but still are a consistent issue. There are a few concerns with this data set. Heavily policed areas might artificially look like crash hotspots simply because officers are there to record minor incidents, whereas minor incidents in under-policed neighborhoods might go unrecorded. The primary cause is determined at the discretion of the responding officer or by a self-report in the station, which can introduce human bias against drivers. While names and license plates are removed, exact crash coordinates and dates can sometimes be reverse engineered to identify individuals, which can be a cause for concern. To address these concerns, researchers should avoid using this data to justify harsh over-policing in hotspot neighborhoods. Instead, this data should be utilized through a public health and infrastructure lens, using the hotspot map to justify traffic calming measures, like cameras or speed bumps in high crash areas. Similarly to The Vision Zero Action plan, Vision Zero focuses on infrastructure improvements rather than punitive policing, which is something that should be adapted in all cities. Furthermore, data should be changed to the neighborhood level in public dashboards to preserve individual privacy.

Reference

Chicago Department of Transportation. (2022). Vision Zero Chicago Downtown Action Plan. City of Chicago. https://www.chicago.gov/content/dam/city/depts/cdot/CDOT%20Projects/VisionZero/2022/VZDT_PlanDocument_online.pdf

Chetia, H. (2023). Spatiotemporal Comparative Analysis of Pre/Post Covid-19 Pedestrian Crash Incidences in Chicago (Version 1). University of Illinois Chicago. https://doi.org/10.25417/uic.25392877.v1

City of Chicago. “Traffic Crashes - Crashes.” Chicago Data Portal, accessed via Data.gov. https://catalog.data.gov/dataset/traffic-crashes-crashes

National Highway Traffic Safety Administration. Office of Behavioral Safety Research (2021). Continuation of Research on Traffic Safety during the COVID-19 Public Health Emergency: January – June 2021 [Traffic Safety Facts]. https://doi.org/10.21949/1526036