DATA110 Final Project

Author

Tikki Dibonge

NYC Motor Vehicle Collisions: An Exploration

Source: https://injurylawyer.com/blog/traffic-stats-show-main-causes-of-car-crash-injuries-in-nyc/

Introduction:

My project is an exploration of the motor vehicle collisions and the injuries/deaths that are associated with them, in New York City. My project exploration will show any patterns and trends that come from this data set. I chose this data set because I know New York City is a very congested city, so the amount of collisions must be a lot. A lot of times on social media I will see accidents that are based in NYC, so I know it’s a real issue that needs to be fixed. I thought it would be very interesting to analyze this data to see how prevalent this issue is in NYC, where and when these collisions happen, and the amount of injuries or deaths associated with the collisions. The source of this data set comes from the NYPD form MV-104AN, that uses FORMS, which is their Finest Online Records Management System. There is no file that describes how the data was collected but on the data.gov website, it is explained that officers manually enter the data in of each collision that caused injury, death, and/or a minimum of $1000 in damages, using the MV-104AN form.

Data set variables include:

CRASH DATE, CRASH TIME, BOROUGH, ZIP CODE, LATITUDE, LONGITUDE, LOCATION (both latitude and longitude), ON STREET NAME, CROSS STREET NAME (the street the collision was across from), OFF STREET NAME, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, CONTRIBUTING FACTOR VEHICLE 1 (what the vehicle/person did to cause collision), and VEHICLE TYPE CODE 1 (what kind of vehicle that caused the collision).

# Install janitor package because janitor uses 'clean_names() to standardize variable names, https://cran.r-project.org/web/packages/janitor/index.html
install.packages("~/Downloads/janitor_2.2.1.tar",
                 repos = NULL,
                 type = "source")

# Install snakecase package because snakecase helps with consistent variable naming, https://cran.r-project.org/web/packages/snakecase/index.html
install.packages("~/Downloads/snakecase_0.11.1.tar",
                 repos = NULL,
                 type = "source")

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(leaflet)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(broom)
library(ggthemes)

Load necessary libraries.

# Load and name downloaded dataset from NYPD MV-104AN
crashes_raw <- readr::read_csv("~/Downloads/DATA110/Motor_Vehicle_Collisions_-_Crashes.csv")

Rows: 2224893 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time  (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Define and read the CSV.

# Cleaning and converting the original column names into usable r-safe names, https://lubridate.tidyverse.org
crashes <- crashes_raw |>
  janitor::clean_names() |>
  mutate(
    crash_date = lubridate::mdy(crash_date),
    crash_time = hms::as_hms(crash_time),
    hour = lubridate::hour(crash_time), # creating new 'hour' column of only hour of times for visualizations
    borough = if_else(is.na(borough) | borough == "", "UNKNOWN", toupper(borough)), # standardizing the borough names
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude) # making sure the latitude and longitude are numeric
  )

names(crashes) # making sure new set is cleaned and changed correctly

 [1] "crash_date"                    "crash_time"                   
 [3] "borough"                       "zip_code"                     
 [5] "latitude"                      "longitude"                    
 [7] "location"                      "on_street_name"               
 [9] "cross_street_name"             "off_street_name"              
[11] "number_of_persons_injured"     "number_of_persons_killed"     
[13] "number_of_pedestrians_injured" "number_of_pedestrians_killed" 
[15] "number_of_cyclist_injured"     "number_of_cyclist_killed"     
[17] "number_of_motorist_injured"    "number_of_motorist_killed"    
[19] "contributing_factor_vehicle_1" "contributing_factor_vehicle_2"
[21] "contributing_factor_vehicle_3" "contributing_factor_vehicle_4"
[23] "contributing_factor_vehicle_5" "collision_id"                 
[25] "vehicle_type_code_1"           "vehicle_type_code_2"          
[27] "vehicle_type_code_3"           "vehicle_type_code_4"          
[29] "vehicle_type_code_5"           "hour"

Used the ‘clean_names()’ to change the headers into r-safe variable names with the snake_case package that parses dates and times. Also created a new hour column, and standardized borough names.

crashes_filtered <- crashes |>
  select(crash_date, crash_time, hour, borough, latitude, longitude, number_of_persons_injured, number_of_persons_killed, number_of_pedestrians_injured, number_of_cyclist_injured, number_of_motorist_injured, contributing_factor_vehicle_1, vehicle_type_code_1) |> # choosing only the relevant variables I'll need
  filter(!is.na(crash_date)) |>
  mutate(across(starts_with("number_of_"), ~replace_na(.x, 0))) |> # replacing any 'NA' observations in 'number_of_' columns, https://www.r-bloggers.com/2022/06/replace-na-with-zero-in-r/
  arrange(desc(crash_date))

crashes_filtered <- crashes_filtered |>
  mutate(
    hour = suppressWarnings(as.numeric(hour)) #https://www.statology.org/suppress-warnings-in-r/
  ) |>
  filter(!is.na(hour)) # makes the hour numeric and removes any corrupted observations

by_borough <- crashes_filtered |>
  group_by(borough) |> # combined crash, injured, and killed totals by borough
  summarize(
    crashes = n(),
    total_injured = sum(number_of_persons_injured, na.rm=TRUE),
    total_killed = sum(number_of_persons_killed, na.rm=TRUE)
  ) |>
  arrange(desc(crashes))

by_borough

# A tibble: 6 × 4
  borough       crashes total_injured total_killed
  <chr>           <int>         <dbl>        <dbl>
1 UNKNOWN        681152        248601         1407
2 BROOKLYN       494852        173788          722
3 QUEENS         413651        131929          589
4 MANHATTAN      341989         78106          383
5 BRONX          228612         79533          318
6 STATEN ISLAND   64637         19473          107

More cleaning so that making visualizations are easier and accurate, and combined columns for visualization.

# Summarizing the crash frequency and the average injuries by hour of the day and borough
hourly <- crashes_filtered |>
  group_by(hour, borough) |>
  summarize(
    count = n(),
    avg_injured = mean(number_of_persons_injured, na.rm = TRUE),
    .groups = "drop"
  )

# create visualization of hourly crash patterns
p_hour <- ggplot(hourly, aes(x = hour, y = count, color = borough)) + 
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = 0:23) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Hourly Vehicle Collisions By Borough",
    x = "Hour of Day (0-23)",
    y = "Number of Collisions",
    color = "Borough"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

ggplotly(p_hour, width = 900, height = 500) |>
  layout(
    margin = list(b = 140), #https://plotly.com/r/setting-graph-size/
    annotations = list(
      list(
        text = "Data source: The NYPD form MV-104AN using FORMS (Finest Online Records Management System)",
        x = 0,
        y = -0.38,
        xref = "paper",
        yref = "paper",
        showarrow = FALSE,
        xanchor = "left",
        font = list(size = 11)
      ) #https://plotly.com/r/reference/layout/annotations/#:~:text=Simple%20Traces,Title
    )
  )

Plot shows the amount of collisions in a 24 hour cycle by borough. It shows the variance across the different boroughs, which also shows Brooklyn and Queens have the most collisions. Research also found that Brooklyn and Queens consistently lead in collisions because of the borough’s population density and lack road safety (Frekhtman & Associates). Collisions are generally highest in the morning at 8am and in the afternoon at 4/5pm, which align with commute hours. Brooklyn and Queens have higher peaks, which again could be because of their different structures (NYC DOT).

factors <- crashes_filtered |>
  filter(!is.na(contributing_factor_vehicle_1) & contributing_factor_vehicle_1 != "" ) |>
  group_by(contributing_factor_vehicle_1) |>
  summarize(count = n(), .groups='drop') |>
  arrange(desc(count)) |>
  slice_head(n = 12) # find the top 12 most contributing factors to collisions

p_factors <- ggplot(factors, aes(x = reorder(contributing_factor_vehicle_1, count), y = count, fill = contributing_factor_vehicle_1)) + # creating the bar chart of the top contributing factors
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02', '#a6761d', '#666666', '#a6cee3', '#1f78b4', '#b2df8a', '#fb9a99')) +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.2))) + #https://ggplot2.tidyverse.org/reference/expansion.html#:~:text=Source:%20R/scale%2Dexpansion,Examples
  geom_text(aes(label = count),
            hjust = -0.1, size = 3.5, color = "black") +
  labs(title = "Top 12 Reported Contributing Factors of Vehicle 1",
       x = "Contributing factor", y = "Number of crashes",
       fill = "Factor",
       caption = "Data source: The NYPD form MV-104AN using FORMS (Finest Online Records Management System)") +
  theme_minimal(base_size = 13, base_family = "Times-Roman") +
  theme(legend.position = "none",
        axis.text.x = element_text(size = 10))

p_factors

Besides ‘unspecified’, driver inattention/distraction is the highest contributing factor to collisions, which aligns with national and NYC-specific research that found that distraction is the leading cause of urban crashes (NYC DOT).

# Filter for only Brooklyn collisions in 2024 with injuries or deaths for leaflet map
brooklyn_2024 <- crashes_filtered |>
  filter(
    borough == "BROOKLYN",
    year(crash_date) == 2024,
    !is.na(latitude),
    !is.na(longitude),
    !(number_of_persons_injured == 0 & number_of_persons_killed == 0)
  ) |>
  mutate(
    crash_color = case_when(
      number_of_persons_killed > 0 & number_of_persons_injured == 0 ~ "orange",
      number_of_persons_injured > 0 & number_of_persons_killed == 0 ~ "yellow",
      number_of_persons_killed > 0 & number_of_persons_injured > 0 ~ "red",
      TRUE ~ "grey" # assign colors to whether someone was injured, killed, or both for map points. also removed points where none occured for specific leaflet map
    ),
    popup_text = paste0(
      "<b>Date: </b>", crash_date, "<br/>",
      "<b>Time: </b>", crash_time, "<br/>",
      "<b>Injured: </b>", number_of_persons_injured, "<br/>",
      "<b>Killed: </b>", number_of_persons_killed, "<br/>",
      "<b>Factor: </b>", contributing_factor_vehicle_1, "<br/>",
      "<b>Vehicle Type: </b>", vehicle_type_code_1
    )
  )


leaflet(brooklyn_2024) |> # created leaflet map visualization
  addTiles() |>
    setView(lng = -73.95, lat = 40.65, zoom = 11) |>
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    radius = 6,
    color = ~crash_color,
    fillOpacity = 0.8,
    popup = ~popup_text,
    clusterOptions = markerClusterOptions() #https://rstudio.github.io/leaflet/articles/markers.html#:~:text=Marker%20Clusters,Leaflet%20%7C%20©%20OpenStreetMap%2C%20ODbL
  ) |>
  addLegend(
    position = "bottomright",
    colors = c("yellow", "orange", "red"),
    labels = c(
      "Injured Only (0 killed)",
      "Killed Only (0 injured)",
      "Injured + Killed"
    ),
    title = "Collision Severity",
    opacity = 1
  )

Map shows how many collisions where someone was injured, killed, or both in Brooklyn (the borough with the most collisions) in 2024. Most points are yellow, which indicates a low death rate in these last year collisions. Map shows the spatial clustering of these collisions happen on major corridors, which raises some questions about the layout and infrastructure of not only Brooklyn, but New York City as a whole.

Linear Regression Model

crashes_filtered <- crashes_filtered |> #https://rstudio-pubs-static.s3.amazonaws.com/1130523_aefc41d424894951af6d207a1d24fcee.html?utm_source=chatgpt.com
  mutate(borough = factor(borough))

injury_model <- lm(
  number_of_persons_injured ~ hour + borough,
  data = crashes_filtered
)

summary(injury_model)


Call:
lm(formula = number_of_persons_injured ~ hour + borough, data = crashes_filtered)

Residuals:
   Min     1Q Median     3Q    Max 
-0.388 -0.356 -0.325 -0.202 42.642 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3.172e-01  1.846e-03 171.856   <2e-16 ***
hour                  2.333e-03  8.264e-05  28.236   <2e-16 ***
boroughBROOKLYN       2.955e-03  1.805e-03   1.638    0.101    
boroughMANHATTAN     -1.197e-01  1.928e-03 -62.106   <2e-16 ***
boroughQUEENS        -2.912e-02  1.860e-03 -15.658   <2e-16 ***
boroughSTATEN ISLAND -4.746e-02  3.179e-03 -14.929   <2e-16 ***
boroughUNKNOWN        1.738e-02  1.725e-03  10.077   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7136 on 2224886 degrees of freedom
Multiple R-squared:  0.004539,  Adjusted R-squared:  0.004536 
F-statistic:  1691 on 6 and 2224886 DF,  p-value: < 2.2e-16

Equation: ŷ = 0.3172 + 0.002333 * hour - 0.02912 * 1Queens + 0.002955 * 1Brooklyn

P-values: <2e16, 0.101

Adjusted R-squared: 0.004536

Diagnostic plots:

injury_model <- lm(number_of_persons_injured ~ hour + borough, data = crashes_filtered)

par(mfrow = c(2,2))
plot(injury_model)

End:

Overall, this was a very interesting data set to analyze. When I first created my leaflet map, there was an overwhelming amount of grey points that showed no one was injured or killed, so while there are a lot of collisions, most of them are not harmful or fatal. But it is still a major problem because there were thousands of points in just Brooklyn alone, and only in the year 2024. I know that New York City is heavily populated so naturally, there will be a lot of collisions but seeing the data and seeing the data on the map really put into perspective just how often collisions occur. Based on my linear regression model, the time of day and which borough is not that related to the whether someone gets injured, or even the prevalence of these collisions. But instead, it’s based on many other factors like the street layout of where the collision occurred. So while it’s still very interesting to see how the amount of collisions goes up and down throughout the day and in which borough, this collision issue seems bigger than that. This project made me think about how these boroughs, streets, and neighborhoods are structured in New York City and why there hasn’t been any change. I think because all these years, there hasn’t been any change so a sudden change in the structure and layout of the city would take a lot of time, cost a lot of money, and upset a lot of people and businesses. But still, I think it should happen because this is a major problem. And of course, wherever there’s a problem, there’s a lot of money to be made so that’s probably an incentive to not fix the structure or layout of these boroughs and New York City as a whole. I think one thing I would want to change/add to my project is a visualization or multiple visualizations that go more into depth about what kind of streets/roads these collisions happen on and what kind of street signs are on them. I think that would explain more about the layout of the boroughs and why collisions keep happening. A pattern I found interesting was that the most common factor of these collisions is a distracted driver. This was really shocking and concerning to see because people are all over New York City so I feel like that would be the last place I would want to be distracted while driving. It was also concerning to see because there were a lot of vehicles that were trucks, mopeds, and station wagons so again, if I was operating a vehicle like that, being distracted while driving would be the last thing I would do, especially in New York City.

Sources:

Serious Injury Response, Tracking & Analysis Program (SIRTA) Quarterly Report -Q4 2024 -NYC Department of Transportation Background.

https://www.nyc.gov/html/dot/downloads/pdf/sirta-report-q4-2024.pdf

Associates, Frekhtman . “”. Frekhtman & Associates | New York Injury and Accident Attorneys, 20 Nov. 2024, 866attylaw.com/brooklyn-car-accident-statistics/.

‌https://866attylaw.com/brooklyn-car-accident-statistics/#:~:text=Moving%20from%20the%20impact%20on,drivers%20can’t%20see%20well.