Traffic Crashes in Chicago, Illinois (2024)

Author

Chibogwu Onyeabo

Source: iStock

Introduction

Chicago is one of the many major cities in the United States with a population of about 2.6 million. As citygoers, tourists, and natives get around on the busy streets, car accidents are apart of the inevitable. The Chicago Police Department has an electronic crash reporting system (E-Crash) that stores data from every single traffic report that occurs within city limits.

Crashes are, about equally, either self reported by the driver or recorded at the scene by the police officer responding to the crash. The responding officer records many of the crash parameters such as weather condition, posted speed limits, and street condition. Crash reports are updated as information changes or arises.

With this data, I’d like to discover what factors contribute the most to traffic crashes. Crashes can occur at traffic lights or stop signs, they may involve other drivers, pedestrians, objects, or property; all information that this dataset includes. The crash reports also include road and weather condition, notes detailing the cause of the accident as reported by police, injuries and their severity, the type of roadway, and location of the crash.

Load in Libraries and Data

#load in libraries
library(tidyverse)
library(ggfortify)
library(ggplot2)
library(treemap)
library(RColorBrewer)
library(leaflet)
library(sf)
library(highcharter)
#load in data
chicago2024crashes <- read_csv("Traffic_Crashes_-_Crashes_20250506.csv")
Rows: 112043 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (31): CRASH_RECORD_ID, CRASH_DATE_EST_I, CRASH_DATE, TRAFFIC_CONTROL_DEV...
dbl (17): POSTED_SPEED_LIMIT, LANE_CNT, STREET_NO, BEAT_OF_OCCURRENCE, NUM_U...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view first 20 lines
head(chicago2024crashes, 20)
# A tibble: 20 × 48
   CRASH_RECORD_ID                CRASH_DATE_EST_I CRASH_DATE POSTED_SPEED_LIMIT
   <chr>                          <chr>            <chr>                   <dbl>
 1 8a82d14f6d2d392638a8c5f5bdaee… <NA>             12/31/202…                 30
 2 cc89da2a2705cf16fbe17bafd4205… <NA>             12/31/202…                 30
 3 3b32c74ced97162dcb27f4cab0e9b… <NA>             12/31/202…                 30
 4 783c000fe66fb73635b07ba9490bd… <NA>             12/31/202…                 30
 5 07aea53a5f52a70521c0738eedb58… <NA>             12/31/202…                 30
 6 2272baacf1b38aa7f3c0db112492f… <NA>             12/31/202…                 30
 7 2d2b538e2d1c53db870b396546fb6… <NA>             12/31/202…                 30
 8 4737abdb3074d26fe642c4743c5af… <NA>             12/31/202…                 30
 9 bb245c1dfc7bda57f500b5c1790e7… <NA>             12/31/202…                 30
10 045a192bc0f94ce01b8b56717ea44… <NA>             12/31/202…                 30
11 7cd5c1c860f7ef490e3fcd95272c5… <NA>             12/31/202…                 30
12 9e0d18a96161ea69dcc93b96962de… <NA>             12/31/202…                 30
13 533e52f9475a40ff5e2676b9f7ed7… <NA>             12/31/202…                 30
14 96b18d047a421e7f10d088501f0bb… <NA>             12/31/202…                 30
15 3bf251746b9b2075dc1b10cda2dfa… <NA>             12/31/202…                 30
16 66bfb3306a6f30e0d3eda148fd120… <NA>             12/31/202…                 30
17 29da7ca565f5584503f1c43126d11… <NA>             12/31/202…                 30
18 a1c213e6d318e793158f3afa96ce0… <NA>             12/31/202…                 30
19 a3686724212d06fb9c7c8dbe73eb5… <NA>             12/31/202…                 30
20 cd9097b1b2733706938c375a6dfda… <NA>             12/31/202…                 30
# ℹ 44 more variables: TRAFFIC_CONTROL_DEVICE <chr>, DEVICE_CONDITION <chr>,
#   WEATHER_CONDITION <chr>, LIGHTING_CONDITION <chr>, FIRST_CRASH_TYPE <chr>,
#   TRAFFICWAY_TYPE <chr>, LANE_CNT <dbl>, ALIGNMENT <chr>,
#   ROADWAY_SURFACE_COND <chr>, ROAD_DEFECT <chr>, REPORT_TYPE <chr>,
#   CRASH_TYPE <chr>, INTERSECTION_RELATED_I <chr>, NOT_RIGHT_OF_WAY_I <chr>,
#   HIT_AND_RUN_I <chr>, DAMAGE <chr>, DATE_POLICE_NOTIFIED <chr>,
#   PRIM_CONTRIBUTORY_CAUSE <chr>, SEC_CONTRIBUTORY_CAUSE <chr>, …

Introduction to the Variables

These are all determined by the police officer(s) responding to the crash scene.

STREET_NAME - street name of crash location

INJURIES_TOTAL - total persons sustaining fatal, incapacitating, non-incapacitating, and possible injuries

NUM_UNITS - number of vehicles, pedestrians, cyclists, or other roadway users involved in the crash

POSTED_SPEED_LIMIT - posted speed limit

PRIM_CONTRIBUTORY_CAUSE - most significant factor of the accident, as determined by officer judgement

SEC_CONTRIBUTORY_CAUSE - second most significant factor of the accident, as determined by officer judgement

ROADWAY_SURFACE_COND - road surface condition at the time of crash

WEATHER_CONDITION - weather condition at the time of crash

TRAFFICWAY_TYPE - traffic way type (ex. four-way, one-way)

TRAFFIC_CONTROL_DEVICE - traffic control device present at crash location (ex. stop sign, traffic light)

Statistical Analysis

The numerical variables in this dataset include injury count, lane count, speed limit, and the number of units involved (which are defined as vehicles, pedestrians, cyclists, and other non-passenger roadway users). Out of these, I’ll preform a linear regression to see how the total injuries of a crash can be predicted by the number of units involved.

#linear model y = injuries, x = units
lm <- lm(chicago2024crashes$INJURIES_TOTAL ~ chicago2024crashes$NUM_UNITS)
summary(lm) 

Call:
lm(formula = chicago2024crashes$INJURIES_TOTAL ~ chicago2024crashes$NUM_UNITS)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4906 -0.2237 -0.2237 -0.2237 14.6144 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -0.100179   0.008602  -11.64   <2e-16 ***
chicago2024crashes$NUM_UNITS  0.161922   0.004122   39.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6155 on 111792 degrees of freedom
  (249 observations deleted due to missingness)
Multiple R-squared:  0.01361,   Adjusted R-squared:  0.01361 
F-statistic:  1543 on 1 and 111792 DF,  p-value: < 2.2e-16

Equation: (total injuries) = 0.161922 (number of units) - 0.100179

Adjusted R-squared: 0.01361

P-value: <2.2e-16 (0.00000000000000022)

Diagnostic Plots:

#diagnostic plots
autoplot(lm, 1:4, nrow=2, ncol=2)

#view max values for both variables
fivenum(chicago2024crashes$INJURIES_TOTAL)
[1]  0  0  0  0 15
fivenum(chicago2024crashes$NUM_UNITS)
[1]  1  2  2  2 18

This linear model can be considered an accurate depiction of the relationship between the number of injuries and the number of units per accident. The P-value is calculated to be very close to zero (<2.2e-16) and the residuals vs. fitted line of regression is relatively horizontal. However, the adjusted R-squared reveals that only about 0.01% of the variation in the observations may be explained by this linear model. This is understandable as the maximum amount of injuries per accident in Chicago was only 15, and 18 for units involved. I’m sure if I used broader data, over a longer period of time or select years, the variation in injury and unit count would differ strongly.

Where do crashes occur the most?

#delete crash record ID number
chicago2024crashes <- chicago2024crashes |>
  select(!c(CRASH_RECORD_ID))

#group reports by street name and record the count
chicago2024crashes |>
  group_by(STREET_NAME) |>
  summarize(sum = n()) |>
  arrange(desc(sum)) |>
  head(10)
# A tibble: 10 × 2
   STREET_NAME      sum
   <chr>          <int>
 1 WESTERN AVE     3105
 2 PULASKI RD      2755
 3 CICERO AVE      2585
 4 ASHLAND AVE     2454
 5 HALSTED ST      2091
 6 KEDZIE AVE      2041
 7 MICHIGAN AVE    1456
 8 NORTH AVE       1256
 9 IRVING PARK RD  1246
10 CLARK ST        1238

Western Avenue is Chicago’s longest continuous street at 24 miles, no wonder most crashes occur here. (1) Pulaski Road is another major North-South street, 39.3 miles, the road slices through the middle of Chicago. There seems to be a relationship between incident occurrence and street length. To make it more simple, I’ll create a table for the top ten streets along with their distance according to Wikipedia. For the streets whose lengths were not listed, I measured it myself using Google Maps.

Street Distance (mi) Number of Incidents
Western Ave 27.38 3099
Pulaski Road 39.3 2748
Cicero Ave (Illinois Route 50) 66.49 2579
Ashland Ave 50.14* 2450
Halsted Street 33.66* 2088
Kedzie Ave >20* 2033
Michigan Ave >20* 1455
North Ave 32.31 1253
Irving Park Road (Illinois Route 19) 33.64 1244
Clark Street ~12* 1234

* = measured myself using Google Maps

A pattern I noticed is these streets are long, major roadways or highways in Chicago. Other potential factors that could contribute to the high car accident rates may be low quality roads or inadequate signage. However, none of these are apparent for these top ten streets (as determined by looking at the roads on Google Maps).

#data to map the streets
westernave <- read_sf("westernave.geojson")
pulaskird <- read_sf("pulaski.geojson")
ciceroave <- read_sf("ciceroave.geojson")
ashland <- read_sf("ashland.geojson")
halsted <- read_sf("halsted.geojson")
kedzie <- read_sf("kedzie.geojson")
michigan <- read_sf("michigan.geojson")
northave <- read_sf("northave.geojson")
irvingpark <- read_sf("irvingpark.geojson")
clark <- read_sf("clark.geojson")

#map all crashes 2024, maybe colorsort by season or smth else
leaflet() |>
  setView(lng = -87.6298, lat = 41.8781, zoom = 10) |>
  addProviderTiles("Stadia.OSMBright") |>
  addPolygons(data = westernave, color = "darkgreen",
              opacity = 1, popup = "WESTERN AVE") |>
  addPolygons(data = pulaskird, color = "red",
              opacity = 1, popup = "PULASKI RD") |>
  addPolygons(data = ciceroave, color = "white",
              opacity = 1, popup = "CICERO AVE") |>
  addPolygons(data = ashland, color = "blue",
              opacity = 1, popup = "ASHLAND AVE") |>
  addPolygons(data = halsted, color = "orange",
              opacity = 1, popup = "HALSTED ST") |>
  addPolygons(data = kedzie, color = "skyblue",
              opacity = 1, popup = "KEDZIE AVE") |>
  addPolygons(data = michigan, color = "violet",
              opacity = 1, popup = "MICHIGAN AVE") |>
  addPolygons(data = northave, color = "grey",
              opacity = 1, popup = "NORTHAVE") |>
  addPolygons(data = irvingpark, color = "yellow",
              opacity = 1, popup = "IRVING PARK RD") |>
  addPolygons(data = clark, color = "turquoise",
              opacity = 1, popup = "CLARK ST")

Here is a map of Chicago, Illinois with the ten streets mapped (2) where accidents occurred the most in 2024. Western Avenue is in green, Pulaski Road is in red, Cicero Avenue is in white, Ashland Ave is in orange, Halsted Street is orange, Kedzie Avenue is sky blue, Michigan Avenue is violet, North Avenue is in gray, Irving Park Road is in yellow, and Clark Street is turquoise. The name of each street is listed when you click on it. Just like I previously determined, what these streets have in common is they’re very long and continuous. Other contributing factors that may be worth exploring include how these roads fare in bad weather, or what intersections look like at these streets.

What are the leading causes for accidents in Chicago?

Luckily, the crash reports seem to have categorized causes which I can use to calculate the sums of each cause.

#determine leading causes
primcauses <- chicago2024crashes |>
  group_by(PRIM_CONTRIBUTORY_CAUSE) |>
  filter(PRIM_CONTRIBUTORY_CAUSE != "UNABLE TO DETERMINE", PRIM_CONTRIBUTORY_CAUSE != "NOT APPLICABLE") |>
  summarize(sumPrimCause = n()) |>
  arrange(desc(sumPrimCause)) 

#treemap
#treemap(primcauses,
#        index = "PRIM_CONTRIBUTORY_CAUSE",
#        vSize = "sumPrimCause",
#        type = "index",
#        title = "Primary Causes for Accidents")

Above is a treemap demonstrating the primary causes listed for each accident, with ambiguous entries like “not applicable” and “unable to determine” excluded. The size of the squares represent how many reports have that primary cause. Below is a similar treemap that represents the secondary contributory causes.

seccauses <- chicago2024crashes |>
  group_by(SEC_CONTRIBUTORY_CAUSE) |>
  filter(SEC_CONTRIBUTORY_CAUSE != "UNABLE TO DETERMINE", SEC_CONTRIBUTORY_CAUSE != "NOT APPLICABLE") |>
  summarize(sumSecCause = n()) |>
  arrange(desc(sumSecCause)) 

#treemap
#treemap(seccauses,
#        index = "SEC_CONTRIBUTORY_CAUSE",
#        vSize = "sumSecCause",
#        type = "index",
#        title = "Secondary Causes for Accidents")

Why is “failing to yield right-of-way” the most common cause for accidents?

I’ve been driving for more than two years now and, in my experience, drivers can fail to yield on practically any roadway. It could be a major intersection with a traffic light or even the parking lot of a grocery store. Out of these accidents, I’d like to get a grasp of the setting by exploring the “trafficway type” variable, which is the type of roadway that a crash occurred at.

#select only reports w the most prevelant cause
primcause <- chicago2024crashes |>
  filter(PRIM_CONTRIBUTORY_CAUSE == "FAILING TO YIELD RIGHT-OF-WAY")

#which trafficway do they occur the most at
#primcause |>
#  ggplot(aes(y = TRAFFICWAY_TYPE, fill = TRAFFIC_CONTROL_DEVICE)) +
#  geom_bar()

Out of all the trafficways, most accidents occurred at roads that were not divided, divided with an flat median, and at four-way, one-way, and T- intersections. For more information, I included the type of traffic control device present represented by color.

#filter to keep just crashes at these trafficways
primcause2 <- primcause |>
  filter(TRAFFICWAY_TYPE == "NOT DIVIDED"| TRAFFICWAY_TYPE == "ONE-WAY" | TRAFFICWAY_TYPE == "FOUR WAY" | TRAFFICWAY_TYPE == "T-INTERSECTION" | TRAFFICWAY_TYPE == "DIVIDED - W/MEDIAN (NOT RAISED)")

#bar graph
primcause2 |>
  ggplot(aes(y = TRAFFICWAY_TYPE, fill = TRAFFIC_CONTROL_DEVICE)) +
  geom_bar() +
  scale_fill_manual(values = c("hotpink", "aquamarine", "black", "blue", "green", "violet", "red", "darkred", "brown", "orange", "darkorange", "darkblue", "purple", "yellow", "gray", "lightblue", "limegreen")) +
  theme_classic() +
  labs(x = "Number of Accidents",
       y = "Trafficway type",
       title = "Trafficway type vs.\nTraffic Control Device",
       caption = "Source: Chicago PD",
       fill = "Traffic Control")

This chart reveals that, out of the selected traffic-way types, crashes occurred the most at stop signs, traffic lights, or where there were no traffic control devices at all. Taking a closer look, it’s hard to make any conclusions from this due to there being a large number of traffic control devices. Next, I’ll explore the other categorical variables in this dataset.

What were the conditions for accidents caused by weather?

Weather was a less common, yet relevant, contributory cause listed in the data. According to the Federal Highway Administration, approximately 21% of crashes are weather related. (3) This includes fog, rain, and snow that can affect road conditions along with impairing the ability to drive. The Chicago E-Crash reports include information on road, weather, and lighting condition at the time of the accident. I’m curious to see how these variables may overlap.

#filter only for accidents attributed to weather
crashesweather <- chicago2024crashes |>
  filter(PRIM_CONTRIBUTORY_CAUSE == "WEATHER" | SEC_CONTRIBUTORY_CAUSE == "WEATHER")

#heat map w road condition, light condition, & weather
crashesweather |>
  ggplot(aes(x = ROADWAY_SURFACE_COND, y = WEATHER_CONDITION, fill = LIGHTING_CONDITION)) +
  geom_tile(color = "black") +
  theme_dark() +
  scale_fill_brewer(palette = "BrBG") +
  labs(x = "Road Surface Condition",
       y = "Weather Condition",
       fill = "Lighting Condition",
       caption = "Data from Chicago PD",
       title = "Accidents Caused by Weather") +
    theme(
    axis.text.x = element_text(angle = 55, vjust = 1, hjust = 1)
  )

This tile map demonstrates the different road conditions, lighting, and weather present for all accidents attributed to weather. For weather-related accidents, it appears most occurred in low lighting and on wet roads. It’s very interesting to me that most weather-related crashes occur in the daylight (23 > 19) and not when it’s dark. I would assume that snowy weather would be the main contributor to car accidents in Chicago, however that doesn’t seem to be the case. According to the National Weather Service, 2024 was actually the warmest year on record for Chicago (4) which explains the missing data in the snowy weather categories.

When do accidents occur the most?

#make another column to sort accidents by month & level them in chronological order
crashesbymonth <- chicago2024crashes |>
  mutate(MONTH = case_when(
    str_detect(CRASH_DATE, pattern = "12/") ~"December",
    str_detect(CRASH_DATE, pattern = "11/") ~ "November",
    str_detect(CRASH_DATE, pattern = "10/") ~ "October",
    str_detect(CRASH_DATE, pattern = "9/") ~ "September",
    str_detect(CRASH_DATE, pattern = "8/") ~ "August",
    str_detect(CRASH_DATE, pattern = "7/") ~ "July",
    str_detect(CRASH_DATE, pattern = "6/") ~ "June",
    str_detect(CRASH_DATE, pattern = "5/") ~ "May",
    str_detect(CRASH_DATE, pattern = "4/") ~ "April",
    str_detect(CRASH_DATE, pattern = "3/") ~ "March",
    str_detect(CRASH_DATE, pattern = "2/") ~ "February",
    str_detect(CRASH_DATE, pattern = "1/") ~ "January")) |>
  mutate(MONTH = fct_relevel(MONTH, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

#barplot with weather condition by color
crashesbymonth |>
  ggplot(aes(y = MONTH, fill = WEATHER_CONDITION)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Number of Crash Reports",
       y = "Month",
       title = "Chicago, IL (2024) - Traffic Crashes by month",
       caption = "Source: Chicago PD",
       fill = "Weather Condition") +
  theme_bw()

Splitting the data up by month allows us to see the prevalence of crashes by which time of year it occurred. This bar plot further illustrates that accidents don’t occur the most in snowy weather. Instead, traffic crashes are more prevalent in early fall in clear weather. This is no surprise as we already know that weather wasn’t the most prominent cause of car accidents in 2024. I assume that the rise in car accidents during this season may be due to increased traffic from travelers and tourists in the city.

Sources:

  1. Western Avenue: Streets of Chicago, Apr. 6, 2015
  2. Street Center Lines in Chicago database https://data.cityofchicago.org/Transportation/transportation/pr57-gg9e/about_data
  3. How Do Weather Events Impact Roads? https://ops.fhwa.dot.gov/weather/q1_roadimpact.htm, Last updated Sep. 6, 2024
  4. 2024 Calendar Year Climate Summary - Chicago, IL https://www.weather.gov/lot/2024AnnualClimate#:~:text=2024%20finished%20as%20the%20warmest,both%20locations%20were%20below%20normal.