Project 3

Author

Zachary Rodavich

https://www.latimes.com/california/story/2023-09-05/two-killed-firetruck-crash-speeding-car-west-compton For my final project, I will be focusing on accidents that occurred in New York City, as reported by the New York City Police Department. The dataset for this project was sourced from New York City’s Open Data Hub, which provides publically accesible data about NYC. Variables used include the number of accidents caused by specific factors, including drunk-driving (DUI), driving distracted, and speeding. The reason why I chose to do this project is because as a rookie driver, I am understandably concerned about safety when driving, and it hurts my feelings when someone is hurt or killed in a car accident that could easily have been prevented.

#Loading in all the required libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(leaflet)
library(maps)


Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

library(DataExplorer)

#Setting the working directory
setwd("/Users/zacharyrodavich/Downloads")

#Loading in the CSV file
nycaccidents <- read_csv("motor_vehicle_collisions_NYPD.csv")

Rows: 1048575 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time  (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nycaccidents

# A tibble: 1,048,575 × 29
   `CRASH DATE` `CRASH TIME` BOROUGH   `ZIP CODE` LATITUDE LONGITUDE LOCATION   
   <chr>        <time>       <chr>          <dbl>    <dbl>     <dbl> <chr>      
 1 9/11/2021    02:39        <NA>              NA     NA        NA   <NA>       
 2 3/26/2022    11:45        <NA>              NA     NA        NA   <NA>       
 3 6/29/2022    06:55        <NA>              NA     NA        NA   <NA>       
 4 9/11/2021    09:35        BROOKLYN       11208     40.7     -73.9 (40.667202…
 5 12/14/2021   08:13        BROOKLYN       11233     40.7     -73.9 (40.683304…
 6 4/14/2021    12:47        <NA>              NA     NA        NA   <NA>       
 7 12/14/2021   17:05        <NA>              NA     40.7     -74.0 (40.709183…
 8 12/14/2021   08:17        BRONX          10475     40.9     -73.8 (40.86816,…
 9 12/14/2021   21:10        BROOKLYN       11207     40.7     -73.9 (40.67172,…
10 12/14/2021   14:58        MANHATTAN      10017     40.8     -74.0 (40.75144,…
# ℹ 1,048,565 more rows
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>, …

#Filtering via a number of factors, including more severe accident factors. We will also be focusing on Brookyln, the largest of the NYC boroughs, and will be focusing on accidents resulting in motorist injuries.
accidents <- nycaccidents |>
  filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% c("Unsafe Speed","Unsafe Lane Changing","Passing Too Closely","Traffic Control Disregarded","Driver Inexperience","Passing or Lane Usage Improper","Driver Inattention/Distraction","Alcohol Involvment","Failure to Yied Right-Of-Way","Aggressive Driving/Road Rage")) |>
  filter(BOROUGH == "BROOKLYN") |>
  filter(`NUMBER OF MOTORIST INJURED` > 0)

#Defining specific variable names.
nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Unsafe Speed"] <- "Speeding"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Unsafe Lane Changing"] <- "Unsafe/Improper Lane Change"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Traffic Control Disregarded"] <- "Disobeyed Traffic Signals or Signs"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Driver Inexperience"] <- "Inexperienced or Unlicenced Driver"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Passing or Lane Usage Improper"] <- "Improper Passing or Lane Use"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Alcohol Involvment"] <- "DUI/Drunk Driving"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Driver Inattention/Distraction"] <- "Distracted Driving"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Failure to Yield Right-of-Way"] <- "Failed to Yield when Required"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Passing too Closely"] <- "Unsafe Passing Distance"

nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`== "Aggressive Driving/Road Rage"] <- "Road Rage/Driving Aggressivley"

#Creating a short scatterplot with a linear regression line
s1 <- accidents |>
  count(`CONTRIBUTING FACTOR VEHICLE 1`) |>
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) |>
  ggplot(aes(x = `CONTRIBUTING FACTOR VEHICLE 1`, y = n)) +
  geom_point() +
  geom_smooth(aes(group = 1), method = "lm", color = "#A16", se = FALSE) +
  labs(
    title = "Accidents reported in Brooklyn, NYC, Resulting in Injuries",
    caption = "Source: New York City Open Data",
    x = "Cause of Accident",
    y = "Number of Incidents"
  ) +
  theme_minimal(base_size = 12) +
  coord_flip()

s1

`geom_smooth()` using formula = 'y ~ x'

From the first data visualization, we can see that as the severity of the cause of the accident increases, the number of incidents rises as well.

#creating our multiple linear regression model
summary_data <- accidents |>
  count(`CONTRIBUTING FACTOR VEHICLE 1`) |>
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`))

model <- lm(n ~ as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`)), data = summary_data)

summary(model)


Call:
lm(formula = n ~ as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`)), 
    data = summary_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2515.2 -1178.6  -657.7   569.8  5199.2 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                              3040.7     1971.0
as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`))   -320.5      390.3
                                                       t value Pr(>|t|)
(Intercept)                                              1.543    0.174
as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`))  -0.821    0.443

Residual standard error: 2530 on 6 degrees of freedom
Multiple R-squared:  0.101, Adjusted R-squared:  -0.04883 
F-statistic: 0.6741 on 1 and 6 DF,  p-value: 0.443

#For our second visualization, We will be looking at a bar graph of various different accidents
p1 <- accidents |>
  ggplot(aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `CONTRIBUTING FACTOR VEHICLE 1`, FUN = length),fill = `CONTRIBUTING FACTOR VEHICLE 1`)) +
  geom_bar(alpha=0.5, color = "white")+
  scale_fill_discrete(
    name = "Accidents", 
    labels = c("Road Rage/Driving Aggresivley", "Distracted Driving","Inexperienced/Unlicenced Driver", "Improper Passing or Lane Use", "Unsafe Passing","Disobeyed Traffic Signs or Signals","Unsafe Lane Change","Speeding")
    ) +
  labs(
    x = "Cause of Accident", 
       y = "Number of Incidents",
       title = "Accidents reported in Brookyln, NYC, resulting in Injuries",
       caption = "Source : New York City Open Data"
    ) +
  theme_bw()+
coord_flip()
p1

#I added an extra visualization just for  a little bit of fun and exploration with different types of data visualization, so here's a treemap.
p2 <- accidents |>
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) |>
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`) |>
  summarize(count = n()) |>
  ungroup()

library(treemap)

treemap(p2, 
        index="CONTRIBUTING FACTOR VEHICLE 1", 
        vSize="count",
        vColor="count",
        type="value",    
        palette="RdYlBu", 
        title = "Accidents reported in Brookyln, NYC, resulting in Injuries",  
        title.legend = "Accident Types"
        )

#Here is my third visualization, which is a map with user interacticity. You can click on all the circles and see where the accidents occured, and what caused the accident.
accidents_lat <- mean(accidents$LATITUDE, na.rm = TRUE)
accidents_lon <- mean(accidents$LONGITUDE, na.rm = TRUE)
m1 <- leaflet(data = accidents) |>
  setView(lng = accidents_lon, lat = accidents_lat, zoom = 11.5) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    radius = 50, 
    color = "#290",
    fillColor = "#250",
    fillOpacity = 0.25,
    label = ~`CONTRIBUTING FACTOR VEHICLE 1`,
    popup = ~paste("<strong>Accident Cause:</strong>", `CONTRIBUTING FACTOR VEHICLE 1`),
    highlightOptions = highlightOptions(
      weight = 4,
      color = "#606",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  )

Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively

Warning in validateCoords(lng, lat, funcName): Data contains 275 rows with
either missing or invalid lat/lon values and will be ignored

m1

What the data shows is that the vast majority of accidents that took place in Brooklyn, New York, and which resulted in injuries or fatalities involved a motorist who was distracted whilst driving, either talking to passengers in the car, using their cell phone, or engaging in other activities that results in their attention being taken away from driving and their surroundings. According to Pines Salomon, an attorney agency based in San Diego, California, distracted driving, including use of electronic devices whilst driving, is the most common factor behind car accidents in America. Additionally, according the Maryland MVA (Motor Vehicle Administration), the vast majority of individuals who are distracted whilst driving are using their cell phone or other electronic device whilst driving. Driving responsibly and safely requires a driver’s full attention, and one distraction is all it takes to cause a massive accident with serious injuries, or even fatalities.

These visualizations represents the drivers who caused a crash because they were distracted by their phone, other passengers in their car, or something out the window that took their attention from the safe operation of their vehicle, and put other motorists and pedestrians in danger. As a rookie driver myself, I find these trends very concerning, as this shows that far too many drivers do not acknowledge the rules of the road. I wished I could have included some more visualizations, including a heatmap or alluvial, or by filtering via distracted driving incidents in each borough to see which one has the highest accidents involving a distracted driver.

“San Diego Car Accident Lawyers - the 25 Top Causes of Car Accidents in the US.” Pines Salomon Personal Injury Lawyers, 7 Apr. 2026, seriousaccidents.com/personal-injury-resources/top-causes-of-car-accidents/.

“Common Causes of Distracted Driving.” Zero Deaths Maryland & Vision Zero - Maryland Highway Safety Office, Zero Deaths Maryland, 24 Mar. 2023, zerodeathsmd.gov/news/common-causes-of-distracted-driving/.

AI USE ATTRIBUTION STATEMENT
────────────────────────────────────────
Title: DATA 110 Final Project
Creator: Zachary Rodavich
Context: DATA 110
Document Type: Student assignment

AI Permission: AI-NO
AI Creation Categories: None selected

AI Tools Used:
  • Gemini 3 (used 2026-05-11) — Debugging
  • Gemini 3 (used 2026-05-12) — Debugging

AI Prompt: There is an error in this code that needs fixing. Please show me what went wrong and how I can fix my code.

Human Role: I edited any faulty lines of code with code suggested by the A.I. programs listed above.

Notes: All other work is written by me and me ONLY. A.I. is solely used for the purposes of debugging and finding problems within my code.

────────────────────────────────────────
Generated with AI Attribution Generator