Final Project

For the Final Project, I decided to explore and analyze the Traffic Crash Reports (CPD) data set. This data set contains several records of crashes reported in Cincinnati that the Cincinnati Police Department responded to. The data set is available on City of Cincinnati open data and was last updated on February 18, 2025.

Import All Necessary Packages and Load the Data Set

rm(list = ls()) # remove everything from current environment
library(tidyverse)
library(lubridate)
library(ggTimeSeries)
library(sugrrants)
library(gridExtra)
library(ggmap)
library(cowplot)
library(here)
library(knitr)

data <- read_csv(here('Final Project/Traffic_Crash_Reports_CPD.csv'))

Data Set Description

Each row in this data set is a record of a response by the Cincinnati Police Department(CPD) to an accident. The table below shows a description for each column in the data set.

Click to expand the code chunk

col_names <- names(data)
col_description <- c('Location of the accident',
                     'Approxamite latitude coordinates of the location of the accident',
                    'Approxamite longitude coordinates of the location of the accident',
                    'Age of the person involved in the accident',
                    'Community Council neighborhood in which this accident occured',
                    'CPD Neighborhood in which this accident occured',
                    'SNA Neighborhood in which this accident occured',
                    'Date the crash occured',
                    'Type of right of way that crash occured in (i.e. intersection, on ramp, driveway)',
                    'Classifies the crash as resulting in a fatality, injury or property damage',
                    'ID that Classifies the crash as resulting in a fatality, injury or property damage',
                    'Date the crash was reported to CPD',
                    'Day of the week the crash occured on',
                    'Gender of the person involved in the accident',
                    'Category of the injuries sustained to the person involved in the accident',
                    'Unique identifier to distinguish each crash record',
                    'Light conditions of the roadway at the time of the accident',
                    'Report number (unique identifier for each individual crash',
                   'The manner in which the crash occured',
                   'Condition of the roadway at the time of the accident',
                   'Contour of the roadway at the location of the accident',
                   'Road surface type',
                   'Class of the road on which the accident occured',
                   'Road class description',
                   'Type of unit/automobile that was involved in the accident',
                   'Description of the person involvement in the accident: driver, occupant or a pedestrain',
                   'Type of weather conditions at the time of the accident',
                   'Zip code of the location of the accident')

col_info <- data.frame(
  Column_Names = col_names,
  Description = col_description
)

kable(col_info)

Column_Names	Description
ADDRESS_X	Location of the accident
LATITUDE_X	Approxamite latitude coordinates of the location of the accident
LONGITUDE_X	Approxamite longitude coordinates of the location of the accident
AGE	Age of the person involved in the accident
COMMUNITY_COUNCIL_NEIGHBORHOOD	Community Council neighborhood in which this accident occured
CPD_NEIGHBORHOOD	CPD Neighborhood in which this accident occured
SNA_NEIGHBORHOOD	SNA Neighborhood in which this accident occured
CRASHDATE	Date the crash occured
CRASHLOCATION	Type of right of way that crash occured in (i.e. intersection, on ramp, driveway)
CRASHSEVERITY	Classifies the crash as resulting in a fatality, injury or property damage
CRASHSEVERITYID	ID that Classifies the crash as resulting in a fatality, injury or property damage
DATECRASHREPORTED	Date the crash was reported to CPD
DAYOFWEEK	Day of the week the crash occured on
GENDER	Gender of the person involved in the accident
INJURIES	Category of the injuries sustained to the person involved in the accident
INSTANCEID	Unique identifier to distinguish each crash record
LIGHTCONDITIONSPRIMARY	Light conditions of the roadway at the time of the accident
LOCALREPORTNO	Report number (unique identifier for each individual crash
MANNEROFCRASH	The manner in which the crash occured
ROADCONDITIONSPRIMARY	Condition of the roadway at the time of the accident
ROADCONTOUR	Contour of the roadway at the location of the accident
ROADSURFACE	Road surface type
ROADCLASS	Class of the road on which the accident occured
ROADCLASSDESC	Road class description
UNITTYPE	Type of unit/automobile that was involved in the accident
TYPEOFPERSON	Description of the person involvement in the accident: driver, occupant or a pedestrain
WEATHER	Type of weather conditions at the time of the accident
ZIP	Zip code of the location of the accident

Descriptive Analytics

Data Set Dimensions: This data set has 391912 rows and 28 columns.

Duplicate Rows: There are 3 duplicate rows in this data set.

Total Missing Values: The data set has a total of 605328 missing values.

Missing Values Per Column:

	x
ADDRESS_X	24
LATITUDE_X	26
LONGITUDE_X	28
AGE	49397
COMMUNITY_COUNCIL_NEIGHBORHOOD	0
CPD_NEIGHBORHOOD	0
SNA_NEIGHBORHOOD	0
CRASHDATE	19
CRASHLOCATION	197891
CRASHSEVERITY	5
CRASHSEVERITYID	5
DATECRASHREPORTED	26
DAYOFWEEK	17
GENDER	44875
INJURIES	346
INSTANCEID	0
LIGHTCONDITIONSPRIMARY	23
LOCALREPORTNO	0
MANNEROFCRASH	22
ROADCONDITIONSPRIMARY	24
ROADCONTOUR	21
ROADSURFACE	23
ROADCLASS	152292
ROADCLASSDESC	152389
UNITTYPE	350
TYPEOFPERSON	336
WEATHER	24
ZIP	7165

While this data set will need a lot of data wrangling due the the missing values, it should still help provide some insight into the recent accidents that have occurred across Cincinnati.

Key Questions

When looking at this data set, a few questions came to my mind that would be interesting to explore.

In which neighborhoods have most traffic crashes occurred in?
What time of year do most traffic crashes occur in Cincinnati?
Where do most of these traffic crashes occur?

Tidying the Data

Prior to being able to explore these questions, we first need to perform some data tidying in order to perform our analysis.

# split date and time from the CRASHDATE column
data <- separate(data, CRASHDATE, into = c('CRASHDATE', "CRASHTIME"), sep = " ")

# trim off any trailing spaces in the CRASHDATE column
data$CRASHDATE <- trimws(data$CRASHDATE)

# remove missing values from the CRASHDATE column
data <- data %>% filter(!is.na(CRASHDATE))

# change to date datatype
data$CRASHDATE <- as.Date(data$CRASHDATE, format = '%m/%d/%Y', na.rm = TRUE)

Question #1:

In which neighborhoods have most traffic crashes occurred in?

data %>%
  group_by(SNA_NEIGHBORHOOD) %>%
  summarize(count = n()) %>%
  ggplot() +
  geom_col(aes(x = count, y = reorder(SNA_NEIGHBORHOOD, -count)), fill = 'darkred') +
  scale_x_continuous(breaks = c(0, 5000, 10000, 15000, 20000, 25000, 30000),
                     labels = c('0', '5000', '10000', '15000', '20000', '25000', '30000')) +
  labs(
    x = 'Number of Traffic Crashes',
    y = 'Appoximate Neighborhood',
  )

As the graph shows above, Westwood and Downtown have the highest number of reported traffic crashes. This count spans over decades of data. Even if you filter to only see traffic crashes that occurred in 2025, most traffic crashes still took place in the same two neighborhoods. I would recommend looking deeper into the road conditions, demographics, and traffic flows of these neighborhoods to determine if any improvements/adjustments could be made to prevent crashes from occurring.

Question #2:

Part 1: What time of year do most traffic crashes occur in Cincinnati?

data %>%
  mutate(MONTH = month(CRASHDATE)) %>%
  group_by(MONTH) %>%
  summarize(count = n()) %>%
  ggplot() +
  geom_col(aes(x = MONTH, y = count), fill = 'darkblue') +
  scale_x_continuous(breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   labels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')) +
  labs(
    title = 'Number of Crashes Per Month',
    x = '',
    y = 'Number of Crashes'
  )

With almost 400,000 reported traffic crashes, the number of traffic crashes in the month of October is higher than in any other month. However, this rose the question of why these traffic crashes occurred? Could it be because of the weather conditions during this time of year?

Part 2: What were the weather conditions during that time of year?

Since most traffic crashes have occurred in October, I decided to explore the weather conditions of this past October.

data %>%
  filter(CRASHDATE <= ymd('2024-10-31'),
         CRASHDATE >= ymd('2024-10-01')) %>% # only month of Oct.
  group_by(CRASHDATE, WEATHER) %>%
  summarize(count = n()) %>%
  ggplot() +
  geom_col(aes(x = WEATHER, y = count, fill = WEATHER, group = CRASHDATE)) +
  facet_calendar(~CRASHDATE) +
  theme(axis.text.x = element_blank()) +
  scale_fill_manual(breaks = c('1 - CLEAR', 
                               '2 - CLOUDY', 
                               '3 - FOG, SMOG, SMOKE',
                               '4 - RAIN', 
                               '5 - SLEET, HAIL', 
                               '6 - SNOW', 
                               '7 - SEVERE CROSSWINDS',
                               '8 - BLOWING SAND, SOIL, DIRT, SNOW', 
                               '9 - FREEZING RAIN, FREEZING DRIZZLE', 
                               '99 - OTHER/UNKNOWN'),
                    labels = c('Clear',
                               'Cloudy', 
                               'Fog/Smog/Smoke',
                               'Rain',
                               'Sleet/Hail',
                               'Snow',
                               'Sever Crosswinds',
                               'Blowing Sand/Soil/Dirt/Snow',
                               'Freezing Rain/Drizzle',
                               'Other'),
                    values = c('salmon', 'lightblue', 'darkgrey', 'darkblue', 'yellow', 'white', 'purple', 'brown', 'green', 'black')) +
  labs(
    title = 'Traffic Crashes in October of 2024',
    x = '',
    y = 'Number of Crashes'
  )

Surprisingly, most of the traffic crashes in October of 2024 occurred when the weather was clear. The only day that majority of crashes took place during rainy weather was on Halloween. This analysis disproves there being a strong correlation between the number of crashes and poor weather conditions. There doesn’t appear to be any weekday vs. weekend trend either.

This begs the question of what other conditions could have influenced these traffic crashes. Further analysis would be needed to confirm if there is any strong correlation between other conditions and the number of traffic crashes in October.

Question #3:

Where do most of these traffic crashes occur?

Because of the vast amount of data included in this data set, I decided to look into where most crashes have occurred so far in 2025. I looked into where the most severe injuries have occurred as well.

ggmap::register_google(key = 'AIzaSyBVC4hbOlw9oAD3xi1rtIs6R4jnoF4YPNA')

# import map of Cincinnati  
cinci_map = get_map(location = 'Cincinnati', 
                      zoom = 15, source = 'google', 
                      maptype = 'terrain')

# filter data  for crashes that occured in 2025
data2025 <- data %>%
  mutate(Year = year(CRASHDATE)) %>%
  filter(Year == '2025')

# create visualization to show the location and severity of crashes in 2025
v1 <- ggmap(cinci_map) +
  geom_point(data = data2025, 
             aes(x = LONGITUDE_X, y = LATITUDE_X, color = CRASHSEVERITY, size = CRASHSEVERITY)) +
  scale_size_manual(breaks = c('1 - FATAL', '2 - SERIOUS INJURY SUSPECTED', '3 - MINOR INJURY SUSPECTED', '4 - INJURY POSSIBLE', '5 - PROPERTY DAMAGE ONLY'),
                    values = c(6, 5, 3, 2, 1),
                    labels =  c('Fatal', 'Serious Injury Suspected', 'Minor Injury Suspected', 'Injury Possible', 'Property Damage Only')) +
  scale_color_manual(breaks = c('1 - FATAL', '2 - SERIOUS INJURY SUSPECTED', '3 - MINOR INJURY SUSPECTED', '4 - INJURY POSSIBLE', '5 - PROPERTY DAMAGE ONLY'),
                     values = c('purple', 'red', 'orange', 'yellow', 'darkgreen'),
                     labels =  c('Fatal', 'Serious Injury Suspected', 'Minor Injury Suspected', 'Injury Possible', 'Property Damage Only')) +
  theme(legend.position = 'none') +
  labs(
    x =  'Latitude',
    y = 'Longitude',
    color = 'Crash Severity',
    size = 'Crash Severity'
  )

# extract legend
legend <- get_legend(v1 + theme(legend.position = 'right'))

# create visualization to show the density of where crashes have taken place in 2025
v2 <- ggmap(cinci_map) +
  geom_density_2d(data = data2025,
                  aes(x = LONGITUDE_X, y = LATITUDE_X), linewidth = 1.25) +
  labs(
    x =  'Latitude',
    y = 'Longitude',
  )

grid.arrange(v2, v1, legend, ncol = 3)

The density plot above shows that most traffic crashes have occurred around The Contemporary Arts Center and around the main I-71 intersection by the Great American Ball Park. However, even though most crashes have occurred around these locations, the most severe crashes this year have all been by the TQL stadium. Luckily there have not been any fatal crashes yet in 2025, but there were a few crashes with serious injuries suspected. I would recommend looking into why such severe crashes keep occurring in the same location. Is it due to the speed limit, road conditions, etc? Looking into these factors and making adjustments accordingly could help decrease the number of severe crashes on this street. I would recommend performing a similar investigation for the roads around The Contemporary Arts Center and the main I-71 intersection since so many crashes occur around these areas.