Data Description

The dataset includes reported vehicle crashes in the Cincinnati area. It encompasses all crashes within the Cincinnati city limits from 1/1/2012 to 2/22/2025.

There are a handful of descriptive analytic techniques I had to employ. As the data collection spans 12 years, the data collection progress has changed. For example, for the Crash Severity variable, it associates a certain number with the severity of the crash. A few years ago, level 3 represented Property Only Damage. Now, level 5 represents Property Only Damage. This problem is tackled in the second question we dive into. There are also a handful of missing values. These were omitted from the data analysis.

Total number of reported crashes in the city of Cincinnati (number of observations in the dataset):

nrow(crash <- read_csv("crash_report.csv"))
## [1] 388802

First Research Question

Where are crashes geographically concentrated across Cincinnati? Are there certain neighborhoods where crashes are more likely to happen?

We visualize crash volumes by ZIP code using a Google map of Cincinnati. Because there is no longitude or latitude variable in our dataset, I downloaded an online csv file that details Cincinnati Zip Codes and their respective longitude and latitude coordinates.

zip <- read_csv("zip_table.csv")
## Rows: 44 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): ZIP, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
crash <- crash %>%
  left_join(zip, by = "ZIP")

register_google(key = "AIzaSyBVC4hbOlw9oAD3xi1rtIs6R4jnoF4YPNA")
cincinnatimap = get_map(location = "University of Cincinnati", zoom = 12, source = "google", maptype = "terrain")
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=University%20of%20Cincinnati&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx>
## ℹ <https://maps.googleapis.com/maps/api/geocode/json?address=University+of+Cincinnati&key=xxx>
crash_summary <- crash %>%
  group_by(ZIP) %>%
  summarize(
    Crash_Count = n(),
    Latitude = mean(Latitude, na.rm = TRUE),
    Longitude = mean(Longitude, na.rm = TRUE))
ggmap(cincinnatimap) +
  geom_point(data = crash_summary,
             aes(x = Longitude, y = Latitude, size = Crash_Count),
             color = "red", alpha = 0.5) +
  labs(title = "Crash Volume by ZIP Map",
       x = "Longitude",
       y = "Latitude") +
  geom_text(data = crash_summary,
            aes(x = Longitude, y = Latitude, label = ZIP),
            size = 3,
            color = "black",
            vjust = -1)

Looking at the map above, we can see that the zip code with the most crashes is 45202, which is downtown. This makes sense as there are many narrow streets and a lot of work and residential traffic throughout the day. Generally speaking, as you move further away from the city, the number of crashes per zip code decreases. We can see 45233 and 45212 have very few crashes to support this argument.

Second Research Question

Does the day of the week have an affect on the severity of car crashes in Cincinnati?

As mentioned in the description of the dataset, the coding of crash severity has changed within the dataset. The below chunk shows combining codes 3 and 5, which both represent Property Damage Only in various years.

day_severity <- crash %>%
  filter(!is.na(DAYOFWEEK)) %>%
  mutate(
    DAYOFWEEK = factor(DAYOFWEEK, 
                       levels = c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")),
    CRASHSEVERITY = case_when(
      CRASHSEVERITY == "1 - FATAL" ~ "Fatal",
      CRASHSEVERITY == "1 - FATAL INJURY" ~ "Fatal Injury",
      CRASHSEVERITY == "2 - INJURY" ~ "Injury",
      CRASHSEVERITY == "2 - SERIOUS INJURY SUSPECTED" ~ "Serious Injury",
      CRASHSEVERITY == "3 - MINOR INJURY SUSPECTED" ~ "Minor Injury",
      CRASHSEVERITY %in% c("3 - PROPERTY DAMAGE ONLY (PDO)", "5 - PROPERTY DAMAGE ONLY") ~ "Property Damage Only",
      CRASHSEVERITY == "4 - INJURY POSSIBLE" ~ "Possible Injury")) %>%
  group_by(DAYOFWEEK, CRASHSEVERITY) %>%
  summarize(Crash_Count = n())

# Crash Severity by Day Plot

Now that we have the severity codes all in line and grouped by the day of the week, we can plot our findings using a dodged bar chart for each day of the week. Because Property Damage Only is the overwhelming majority of crashes, I decided to use a log transformation to make it easier to read for the viewer.

day_severity %>%
  ggplot(aes(x = DAYOFWEEK, y = Crash_Count, fill = CRASHSEVERITY)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_log10() +
  scale_fill_viridis_d() +
  theme(panel.grid.major.x = element_blank())

  labs(title = "Crash Severity by Day of the Week",
       x = "Day of the Week",
       y = "Number of Crashes")
## $x
## [1] "Day of the Week"
## 
## $y
## [1] "Number of Crashes"
## 
## $title
## [1] "Crash Severity by Day of the Week"
## 
## attr(,"class")
## [1] "labels"

We can see that crash severity does not dramatically change based on the day of the week. Fatal injuries are a bit more likely on the weekends, but everything else is relatively stagnant. This makes sense as there could be more impaired related crashes on the weekends, which likely would be more severe crashes.

Third Research Question

When during the day do crashes occur most often?

The final question I want to research is whether or not time of day has an affect on the number of crashes. I expect that the morning and evening rush hour times will have the most crashes as this is when roads can be crowded and congested.

Before we create our plot, we need to first turn our CRASHDATE variable into an hourly bin. We do this by creating the Crash_Hour variable. Then, we can sum the number of crashes in each given hour. Also in this step, for clarity, I created hour labels that we will see used on the graph.

crash <- crash %>%
  mutate(Crash_DateTime = mdy_hm(CRASHDATE),
         Crash_Hour = hour(Crash_DateTime))

hourly_crashes <- crash %>%
  filter(!is.na(Crash_Hour)) %>%
  group_by(Crash_Hour) %>%
  summarize(crash_count = n())

hour_labels <- c("Midnight", "1AM", "2AM", "3AM", "4AM", "5AM", "6AM", "7AM",
                 "8AM", "9AM", "10AM", "11AM", "Noon", "1PM", "2PM", "3PM",
                 "4PM", "5PM", "6PM", "7PM", "8PM", "9PM", "10PM", "11PM")

Now we can plot the number of crashes for each hour of the day and observe the findings.

ggplot(hourly_crashes, aes(x = Crash_Hour, y = crash_count, fill = Crash_Hour)) +
  geom_col() +
  scale_x_continuous(breaks = 0:23, labels = hour_labels) +
  scale_fill_viridis_c(option = "inferno") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()) +
  labs(
    title = "Crash Volume by Hour",
    x = "Hour",
    y = "Number of Crashes",
    fill = "Hour")

From this chart, we can see that the evening rush hour leads to the largest number of crashes. In particular, in the 4pm hour. We also see an uptick during the hours of 7 and 8 am, but it is significantly less than the number of crashes for the evening.