Hello,
In this report we will explore the automobile crash dataset for the city of Chicago.
The data can be obtained at from the Chicago Data Portal by click here
Through the use of data visualization we will uncover relationships between:
Overall, we will look into what may typically cause more accidents on Chicago’s roads.
We will use the visualizations to investigate whether lighting conditions have a factor on the number of hit and runs and if higher speed limits lead to an increase in fatalities.
Before we begin our analysis we need to important the package we will be using. For this analysis we will be importing the following: tidyverse, here, skimr, and janitor.
library(tidyverse)
library(here)
library(skimr)
library(janitor)
Next, we will import our dataset containing the crash data. We will use a fresh dataset that has been downloaded from the Chicago data portal. Since this is a raw dataset, some data wrangling will be necessary.
We will import our dataset using the read_csv with a nested here function. The here function lets RStudio know we are working within our working directory. The read_csv file.
plot_crashes <- read_csv(here("data", "Traffic_Crashes_-_Crashes_20260318.csv")) #importing the cleancrashes_new.csv
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 108885 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (31): CRASH_RECORD_ID, CRASH_DATE_EST_I, CRASH_DATE, TRAFFIC_CONTROL_DEV...
## dbl (16): POSTED_SPEED_LIMIT, STREET_NO, BEAT_OF_OCCURRENCE, NUM_UNITS, INJU...
## lgl (1): LANE_CNT
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now that we have our dataframe loaded, let’s take a quick look at the columns to ensure everything is cleaned up before proceeding.
names(plot_crashes) #looking at the column names in the plot_crashes dataframe
## [1] "CRASH_RECORD_ID" "CRASH_DATE_EST_I"
## [3] "CRASH_DATE" "POSTED_SPEED_LIMIT"
## [5] "TRAFFIC_CONTROL_DEVICE" "DEVICE_CONDITION"
## [7] "WEATHER_CONDITION" "LIGHTING_CONDITION"
## [9] "FIRST_CRASH_TYPE" "TRAFFICWAY_TYPE"
## [11] "LANE_CNT" "ALIGNMENT"
## [13] "ROADWAY_SURFACE_COND" "ROAD_DEFECT"
## [15] "REPORT_TYPE" "CRASH_TYPE"
## [17] "INTERSECTION_RELATED_I" "NOT_RIGHT_OF_WAY_I"
## [19] "HIT_AND_RUN_I" "DAMAGE"
## [21] "DATE_POLICE_NOTIFIED" "PRIM_CONTRIBUTORY_CAUSE"
## [23] "SEC_CONTRIBUTORY_CAUSE" "STREET_NO"
## [25] "STREET_DIRECTION" "STREET_NAME"
## [27] "BEAT_OF_OCCURRENCE" "PHOTOS_TAKEN_I"
## [29] "STATEMENTS_TAKEN_I" "DOORING_I"
## [31] "WORK_ZONE_I" "WORK_ZONE_TYPE"
## [33] "WORKERS_PRESENT_I" "NUM_UNITS"
## [35] "MOST_SEVERE_INJURY" "INJURIES_TOTAL"
## [37] "INJURIES_FATAL" "INJURIES_INCAPACITATING"
## [39] "INJURIES_NON_INCAPACITATING" "INJURIES_REPORTED_NOT_EVIDENT"
## [41] "INJURIES_NO_INDICATION" "INJURIES_UNKNOWN"
## [43] "CRASH_HOUR" "CRASH_DAY_OF_WEEK"
## [45] "CRASH_MONTH" "LATITUDE"
## [47] "LONGITUDE" "LOCATION"
Some of the column names could use some tidying up before proceeding. We will use clean_names and rename functions to tidy up the data. First we will assign plot_crashes to clean_plot_crashses and pipe it through the clean_names and rename functions.
cleaned_plot_crashes <- plot_crashes %>%
clean_names() %>%
rename(hit_and_run = hit_and_run_i) %>% #renaming the hit_and_run_i column
rename(work_zone = work_zone_i) %>% #renaming the work_zone_i column
rename(workers_present = workers_present_i) #renaming the workers_present_i column
names(cleaned_plot_crashes) #showing the column names in the dataset
## [1] "crash_record_id" "crash_date_est_i"
## [3] "crash_date" "posted_speed_limit"
## [5] "traffic_control_device" "device_condition"
## [7] "weather_condition" "lighting_condition"
## [9] "first_crash_type" "trafficway_type"
## [11] "lane_cnt" "alignment"
## [13] "roadway_surface_cond" "road_defect"
## [15] "report_type" "crash_type"
## [17] "intersection_related_i" "not_right_of_way_i"
## [19] "hit_and_run" "damage"
## [21] "date_police_notified" "prim_contributory_cause"
## [23] "sec_contributory_cause" "street_no"
## [25] "street_direction" "street_name"
## [27] "beat_of_occurrence" "photos_taken_i"
## [29] "statements_taken_i" "dooring_i"
## [31] "work_zone" "work_zone_type"
## [33] "workers_present" "num_units"
## [35] "most_severe_injury" "injuries_total"
## [37] "injuries_fatal" "injuries_incapacitating"
## [39] "injuries_non_incapacitating" "injuries_reported_not_evident"
## [41] "injuries_no_indication" "injuries_unknown"
## [43] "crash_hour" "crash_day_of_week"
## [45] "crash_month" "latitude"
## [47] "longitude" "location"
Now that our data has been imported and cleaned up, we can begin our analysis. First let’s take a look at the number of high and runs associated with each lighting condition.
To accomplished this we will pipe the cleaned_plot_crashes through four functions. First I obtained the observations that were hit and runs by filtering for rows that contained a “Y” in the hit_and_run column. Next, I used ggplot from the Tidyverse package to create a bar chart based on lighting condition by adding geom_bar to the ggplot function. Lastly, I added coord_flip to place the lighting_conditions on the vertical axis to allow for room category names.
cleaned_plot_crashes %>% #piping the cleaned_plot_crashes through the below functions
filter(hit_and_run == "Y") %>% #filtering for hit_and_runs
ggplot(aes(x = lighting_condition, fill = lighting_condition)) + #setting the aesthetics for the plot
geom_bar() + #creating a bar chart
coord_flip() + #flipping the coordinates
labs(
title = "Lighting Conditions and Hit and Runs",
subtitle = "Hit and Runs Grouped by Lighting Condition",
y = "Lighting Condition",
x = "Number of Hit and Runs",
caption = "(Chicago Traffic Accident Data from 2025)"
)
Figure 1: Bar Chart of Number of Hit and Runs Grouped by Lighting Conditions in Which They Occurred
The above bar chart displays the number of hit and runs associated with each light condition type. From the bar chart we can infer that daylight conditions tend to have the most hit and runs. We could say that that might be some underlying reason for why the most hit and runs happen during daylight. We can investigate further by removing the filter for hit and runs and running the analysis again. By removing the filter we will create a bar chart for the total number of observations for each lighting conditon.
cleaned_plot_crashes %>% #piping the cleaned_plot_crashes through the below functions
ggplot(aes(x = lighting_condition, fill = lighting_condition)) + #setting the aesthetics for the plot
geom_bar() + #creating a bar chart
coord_flip() + #flipping the coordinates
labs(
title = "Lighting Conditions and Number of Accidents",
subtitle = "Number of Traffic Accidents Grouped by Lighting Condition",
x = "Lighting Condition",
y = "Number of Accidents",
caption = "(Chicago Traffic Accident Data from 2025)"
)
Figure 2: Bar Chart of Number of Accidents Grouped by Lighting Conditions in Which They Occurred
The above bar chart details the total number of traffic accidents with their associated lighting condition. As we can infer from the bar chart, the most traffic accidents tend to happen during daylight hours which corresponds to the most hit and runs happening during daylight hours as well. If we had data for the number of vehicles out on the road during the various lighting conditions my hypothesis would be that there is more drivers out on the road during the daylight hours. The more drivers out on the road is related to the increased number of accidents and furthermore, the increase in hit and runs.
Next, let’s look into injuries. We will create a scatter plot looking at the percentage of injuries resulting from traffic accidents at various speed limits.
To create the scatter plot we will make use of group_by, summarise, and mutate functions. Specifically I will be grouping the data by posted_speed_limit. Next I will use summarise on the number of accidents and number of accidents with an injury total greater than zero to get a count of each group. To generate the percentage of traffic accidents resulting in injuries I will use the mutate function to create a new column that is the the quotient of the number of total injuries divided by number of accidents. This result will be multiplied by 100 to get the percentage.
cleaned_plot_crashes %>%
group_by(posted_speed_limit) %>% #group the remaining observations by speed limit
summarise(num_accidents = n(), num_total_injuries = sum(injuries_total > 0)) %>%
mutate(percentage_of_injuries = num_total_injuries / num_accidents) %>%
ggplot(aes( x = posted_speed_limit, y = percentage_of_injuries * 100)) + #create the aesthetics for the visualization
geom_point() + #create the scatter plot
geom_smooth(se = FALSE ) +
ylim(0,100) +
labs(
title = "Percentage of Accidents with Injuries",
subtitle = "Percentage of Accidents with Injuries Grouped By Posted Speed Limit",
y = "Percentage of Accidents with Injuries",
x = "Posted Speed Limit",
caption = "(Chicago Traffic Accident Data from 2025)"
)
Figure 3: Scatterplot of Percentage of Accidents with Injuries Along the Posted Speed Limit
The above scatter plot reveals some interesting data. Despite some outliers, there appears to be a trend with higher speed limits resulting in accidents with injuries. What might be causing the outlier around 30 miles per hour? Chicago’s default speed limit is 30 miles per hour (Spielman 2025).
Sources:
City of Chicago. (2025). Traffic Accidents. Chicago Data Portal. Retrieved March 19, 2026, from https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data
Spielman, Fran (2025). City Council Votes Down Lower Speed Limit. Chicago Sun*Times. Retrieved March 19, 2026, from https://chicago.suntimes.com/city-hall/2025/02/19/city-council-votes-down-lower-speed-limit-united-center-unregistered-car-sales-settlements