For the Final Project, I decided to explore and analyze the Traffic Crash Reports (CPD) data set. This data set contains several records of crashes reported in Cincinnati that the Cincinnati Police Department responded to. The data set is available on City of Cincinnati open data and was last updated on February 18, 2025.
rm(list = ls()) # remove everything from current environment
library(tidyverse)
library(lubridate)
library(ggTimeSeries)
library(sugrrants)
library(gridExtra)
library(ggmap)
library(cowplot)
library(here)
library(knitr)
data <- read_csv(here('Final Project/Traffic_Crash_Reports_CPD.csv'))
Each row in this data set is a record of a response by the Cincinnati Police Department(CPD) to an accident. The table below shows a description for each column in the data set.
col_names <- names(data)
col_description <- c('Location of the accident',
'Approxamite latitude coordinates of the location of the accident',
'Approxamite longitude coordinates of the location of the accident',
'Age of the person involved in the accident',
'Community Council neighborhood in which this accident occured',
'CPD Neighborhood in which this accident occured',
'SNA Neighborhood in which this accident occured',
'Date the crash occured',
'Type of right of way that crash occured in (i.e. intersection, on ramp, driveway)',
'Classifies the crash as resulting in a fatality, injury or property damage',
'ID that Classifies the crash as resulting in a fatality, injury or property damage',
'Date the crash was reported to CPD',
'Day of the week the crash occured on',
'Gender of the person involved in the accident',
'Category of the injuries sustained to the person involved in the accident',
'Unique identifier to distinguish each crash record',
'Light conditions of the roadway at the time of the accident',
'Report number (unique identifier for each individual crash',
'The manner in which the crash occured',
'Condition of the roadway at the time of the accident',
'Contour of the roadway at the location of the accident',
'Road surface type',
'Class of the road on which the accident occured',
'Road class description',
'Type of unit/automobile that was involved in the accident',
'Description of the person involvement in the accident: driver, occupant or a pedestrain',
'Type of weather conditions at the time of the accident',
'Zip code of the location of the accident')
col_info <- data.frame(
Column_Names = col_names,
Description = col_description
)
kable(col_info)
| Column_Names | Description |
|---|---|
| ADDRESS_X | Location of the accident |
| LATITUDE_X | Approxamite latitude coordinates of the location of the accident |
| LONGITUDE_X | Approxamite longitude coordinates of the location of the accident |
| AGE | Age of the person involved in the accident |
| COMMUNITY_COUNCIL_NEIGHBORHOOD | Community Council neighborhood in which this accident occured |
| CPD_NEIGHBORHOOD | CPD Neighborhood in which this accident occured |
| SNA_NEIGHBORHOOD | SNA Neighborhood in which this accident occured |
| CRASHDATE | Date the crash occured |
| CRASHLOCATION | Type of right of way that crash occured in (i.e. intersection, on ramp, driveway) |
| CRASHSEVERITY | Classifies the crash as resulting in a fatality, injury or property damage |
| CRASHSEVERITYID | ID that Classifies the crash as resulting in a fatality, injury or property damage |
| DATECRASHREPORTED | Date the crash was reported to CPD |
| DAYOFWEEK | Day of the week the crash occured on |
| GENDER | Gender of the person involved in the accident |
| INJURIES | Category of the injuries sustained to the person involved in the accident |
| INSTANCEID | Unique identifier to distinguish each crash record |
| LIGHTCONDITIONSPRIMARY | Light conditions of the roadway at the time of the accident |
| LOCALREPORTNO | Report number (unique identifier for each individual crash |
| MANNEROFCRASH | The manner in which the crash occured |
| ROADCONDITIONSPRIMARY | Condition of the roadway at the time of the accident |
| ROADCONTOUR | Contour of the roadway at the location of the accident |
| ROADSURFACE | Road surface type |
| ROADCLASS | Class of the road on which the accident occured |
| ROADCLASSDESC | Road class description |
| UNITTYPE | Type of unit/automobile that was involved in the accident |
| TYPEOFPERSON | Description of the person involvement in the accident: driver, occupant or a pedestrain |
| WEATHER | Type of weather conditions at the time of the accident |
| ZIP | Zip code of the location of the accident |
Data Set Dimensions: This data set has 391912 rows and 28 columns.
Duplicate Rows: There are 3 duplicate rows in this data set.
Total Missing Values: The data set has a total of 605328 missing values.
Missing Values Per Column:
| x | |
|---|---|
| ADDRESS_X | 24 |
| LATITUDE_X | 26 |
| LONGITUDE_X | 28 |
| AGE | 49397 |
| COMMUNITY_COUNCIL_NEIGHBORHOOD | 0 |
| CPD_NEIGHBORHOOD | 0 |
| SNA_NEIGHBORHOOD | 0 |
| CRASHDATE | 19 |
| CRASHLOCATION | 197891 |
| CRASHSEVERITY | 5 |
| CRASHSEVERITYID | 5 |
| DATECRASHREPORTED | 26 |
| DAYOFWEEK | 17 |
| GENDER | 44875 |
| INJURIES | 346 |
| INSTANCEID | 0 |
| LIGHTCONDITIONSPRIMARY | 23 |
| LOCALREPORTNO | 0 |
| MANNEROFCRASH | 22 |
| ROADCONDITIONSPRIMARY | 24 |
| ROADCONTOUR | 21 |
| ROADSURFACE | 23 |
| ROADCLASS | 152292 |
| ROADCLASSDESC | 152389 |
| UNITTYPE | 350 |
| TYPEOFPERSON | 336 |
| WEATHER | 24 |
| ZIP | 7165 |
While this data set will need a lot of data wrangling due the the missing values, it should still help provide some insight into the recent accidents that have occurred across Cincinnati.
When looking at this data set, a few questions came to my mind that would be interesting to explore.
Prior to being able to explore these questions, we first need to perform some data tidying in order to perform our analysis.
# split date and time from the CRASHDATE column
data <- separate(data, CRASHDATE, into = c('CRASHDATE', "CRASHTIME"), sep = " ")
# trim off any trailing spaces in the CRASHDATE column
data$CRASHDATE <- trimws(data$CRASHDATE)
# remove missing values from the CRASHDATE column
data <- data %>% filter(!is.na(CRASHDATE))
# change to date datatype
data$CRASHDATE <- as.Date(data$CRASHDATE, format = '%m/%d/%Y', na.rm = TRUE)
data %>%
group_by(SNA_NEIGHBORHOOD) %>%
summarize(count = n()) %>%
ggplot() +
geom_col(aes(x = count, y = reorder(SNA_NEIGHBORHOOD, -count)), fill = 'darkred') +
scale_x_continuous(breaks = c(0, 5000, 10000, 15000, 20000, 25000, 30000),
labels = c('0', '5000', '10000', '15000', '20000', '25000', '30000')) +
labs(
x = 'Number of Traffic Crashes',
y = 'Appoximate Neighborhood',
)
As the graph shows above, Westwood and Downtown have the highest number
of reported traffic crashes. This count spans over decades of data. Even
if you filter to only see traffic crashes that occurred in 2025, most
traffic crashes still took place in the same two neighborhoods. I would
recommend looking deeper into the road conditions, demographics, and
traffic flows of these neighborhoods to determine if any
improvements/adjustments could be made to prevent crashes from
occurring.
data %>%
mutate(MONTH = month(CRASHDATE)) %>%
group_by(MONTH) %>%
summarize(count = n()) %>%
ggplot() +
geom_col(aes(x = MONTH, y = count), fill = 'darkblue') +
scale_x_continuous(breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
labels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')) +
labs(
title = 'Number of Crashes Per Month',
x = '',
y = 'Number of Crashes'
)
With almost 400,000 reported traffic crashes, the number of traffic crashes in the month of October is higher than in any other month. However, this rose the question of why these traffic crashes occurred? Could it be because of the weather conditions during this time of year?
Since most traffic crashes have occurred in October, I decided to explore the weather conditions of this past October.
data %>%
filter(CRASHDATE <= ymd('2024-10-31'),
CRASHDATE >= ymd('2024-10-01')) %>% # only month of Oct.
group_by(CRASHDATE, WEATHER) %>%
summarize(count = n()) %>%
ggplot() +
geom_col(aes(x = WEATHER, y = count, fill = WEATHER, group = CRASHDATE)) +
facet_calendar(~CRASHDATE) +
theme(axis.text.x = element_blank()) +
scale_fill_manual(breaks = c('1 - CLEAR',
'2 - CLOUDY',
'3 - FOG, SMOG, SMOKE',
'4 - RAIN',
'5 - SLEET, HAIL',
'6 - SNOW',
'7 - SEVERE CROSSWINDS',
'8 - BLOWING SAND, SOIL, DIRT, SNOW',
'9 - FREEZING RAIN, FREEZING DRIZZLE',
'99 - OTHER/UNKNOWN'),
labels = c('Clear',
'Cloudy',
'Fog/Smog/Smoke',
'Rain',
'Sleet/Hail',
'Snow',
'Sever Crosswinds',
'Blowing Sand/Soil/Dirt/Snow',
'Freezing Rain/Drizzle',
'Other'),
values = c('salmon', 'lightblue', 'darkgrey', 'darkblue', 'yellow', 'white', 'purple', 'brown', 'green', 'black')) +
labs(
title = 'Traffic Crashes in October of 2024',
x = '',
y = 'Number of Crashes'
)
Surprisingly, most of the traffic crashes in October of 2024 occurred when the weather was clear. The only day that majority of crashes took place during rainy weather was on Halloween. This analysis disproves there being a strong correlation between the number of crashes and poor weather conditions. There doesn’t appear to be any weekday vs. weekend trend either.
This begs the question of what other conditions could have influenced these traffic crashes. Further analysis would be needed to confirm if there is any strong correlation between other conditions and the number of traffic crashes in October.
Because of the vast amount of data included in this data set, I decided to look into where most crashes have occurred so far in 2025. I looked into where the most severe injuries have occurred as well.
ggmap::register_google(key = 'AIzaSyBVC4hbOlw9oAD3xi1rtIs6R4jnoF4YPNA')
# import map of Cincinnati
cinci_map = get_map(location = 'Cincinnati',
zoom = 15, source = 'google',
maptype = 'terrain')
# filter data for crashes that occured in 2025
data2025 <- data %>%
mutate(Year = year(CRASHDATE)) %>%
filter(Year == '2025')
# create visualization to show the location and severity of crashes in 2025
v1 <- ggmap(cinci_map) +
geom_point(data = data2025,
aes(x = LONGITUDE_X, y = LATITUDE_X, color = CRASHSEVERITY, size = CRASHSEVERITY)) +
scale_size_manual(breaks = c('1 - FATAL', '2 - SERIOUS INJURY SUSPECTED', '3 - MINOR INJURY SUSPECTED', '4 - INJURY POSSIBLE', '5 - PROPERTY DAMAGE ONLY'),
values = c(6, 5, 3, 2, 1),
labels = c('Fatal', 'Serious Injury Suspected', 'Minor Injury Suspected', 'Injury Possible', 'Property Damage Only')) +
scale_color_manual(breaks = c('1 - FATAL', '2 - SERIOUS INJURY SUSPECTED', '3 - MINOR INJURY SUSPECTED', '4 - INJURY POSSIBLE', '5 - PROPERTY DAMAGE ONLY'),
values = c('purple', 'red', 'orange', 'yellow', 'darkgreen'),
labels = c('Fatal', 'Serious Injury Suspected', 'Minor Injury Suspected', 'Injury Possible', 'Property Damage Only')) +
theme(legend.position = 'none') +
labs(
x = 'Latitude',
y = 'Longitude',
color = 'Crash Severity',
size = 'Crash Severity'
)
# extract legend
legend <- get_legend(v1 + theme(legend.position = 'right'))
# create visualization to show the density of where crashes have taken place in 2025
v2 <- ggmap(cinci_map) +
geom_density_2d(data = data2025,
aes(x = LONGITUDE_X, y = LATITUDE_X), linewidth = 1.25) +
labs(
x = 'Latitude',
y = 'Longitude',
)
grid.arrange(v2, v1, legend, ncol = 3)
The density plot above shows that most traffic crashes have occurred around The Contemporary Arts Center and around the main I-71 intersection by the Great American Ball Park. However, even though most crashes have occurred around these locations, the most severe crashes this year have all been by the TQL stadium. Luckily there have not been any fatal crashes yet in 2025, but there were a few crashes with serious injuries suspected. I would recommend looking into why such severe crashes keep occurring in the same location. Is it due to the speed limit, road conditions, etc? Looking into these factors and making adjustments accordingly could help decrease the number of severe crashes on this street. I would recommend performing a similar investigation for the roads around The Contemporary Arts Center and the main I-71 intersection since so many crashes occur around these areas.