https://www.latimes.com/california/story/2023-09-05/two-killed-firetruck-crash-speeding-car-west-compton For my final project, I will be focusing on accidents that occurred in New York City, as reported by the New York City Police Department. The dataset for this project was sourced from New York City’s Open Data Hub, which provides publically accesible data about NYC. Variables used include the number of accidents caused by specific factors, including drunk-driving (DUI), driving distracted, and speeding. The reason why I chose to do this project is because as a rookie driver, I am understandably concerned about safety when driving, and it hurts my feelings when someone is hurt or killed in a car accident that could easily have been prevented.
#Loading in all the required librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)library(leaflet)library(maps)
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
library(DataExplorer)
#Setting the working directorysetwd("/Users/zacharyrodavich/Downloads")
#Loading in the CSV filenycaccidents <-read_csv("motor_vehicle_collisions_NYPD.csv")
Rows: 1048575 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time (1): CRASH TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nycaccidents
# A tibble: 1,048,575 × 29
`CRASH DATE` `CRASH TIME` BOROUGH `ZIP CODE` LATITUDE LONGITUDE LOCATION
<chr> <time> <chr> <dbl> <dbl> <dbl> <chr>
1 9/11/2021 02:39 <NA> NA NA NA <NA>
2 3/26/2022 11:45 <NA> NA NA NA <NA>
3 6/29/2022 06:55 <NA> NA NA NA <NA>
4 9/11/2021 09:35 BROOKLYN 11208 40.7 -73.9 (40.667202…
5 12/14/2021 08:13 BROOKLYN 11233 40.7 -73.9 (40.683304…
6 4/14/2021 12:47 <NA> NA NA NA <NA>
7 12/14/2021 17:05 <NA> NA 40.7 -74.0 (40.709183…
8 12/14/2021 08:17 BRONX 10475 40.9 -73.8 (40.86816,…
9 12/14/2021 21:10 BROOKLYN 11207 40.7 -73.9 (40.67172,…
10 12/14/2021 14:58 MANHATTAN 10017 40.8 -74.0 (40.75144,…
# ℹ 1,048,565 more rows
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
# `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
# `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
# `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
# `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
# `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>, …
#Filtering via a number of factors, including more severe accident factors. We will also be focusing on Brookyln, the largest of the NYC boroughs, and will be focusing on accidents resulting in motorist injuries.accidents <- nycaccidents |>filter(`CONTRIBUTING FACTOR VEHICLE 1`%in%c("Unsafe Speed","Unsafe Lane Changing","Passing Too Closely","Traffic Control Disregarded","Driver Inexperience","Passing or Lane Usage Improper","Driver Inattention/Distraction","Alcohol Involvment","Failure to Yied Right-Of-Way","Aggressive Driving/Road Rage")) |>filter(BOROUGH =="BROOKLYN") |>filter(`NUMBER OF MOTORIST INJURED`>0)
nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`=="Passing or Lane Usage Improper"] <-"Improper Passing or Lane Use"
nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`[nycaccidents$`CONTRIBUTING FACTOR VEHICLE 1`=="Failure to Yield Right-of-Way"] <-"Failed to Yield when Required"
#Creating a short scatterplot with a linear regression lines1 <- accidents |>count(`CONTRIBUTING FACTOR VEHICLE 1`) |>filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) |>ggplot(aes(x =`CONTRIBUTING FACTOR VEHICLE 1`, y = n)) +geom_point() +geom_smooth(aes(group =1), method ="lm", color ="#A16", se =FALSE) +labs(title ="Accidents reported in Brooklyn, NYC, Resulting in Injuries",caption ="Source: New York City Open Data",x ="Cause of Accident",y ="Number of Incidents" ) +theme_minimal(base_size =12) +coord_flip()s1
`geom_smooth()` using formula = 'y ~ x'
From the first data visualization, we can see that as the severity of the cause of the accident increases, the number of incidents rises as well.
Call:
lm(formula = n ~ as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`)),
data = summary_data)
Residuals:
Min 1Q Median 3Q Max
-2515.2 -1178.6 -657.7 569.8 5199.2
Coefficients:
Estimate Std. Error
(Intercept) 3040.7 1971.0
as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`)) -320.5 390.3
t value Pr(>|t|)
(Intercept) 1.543 0.174
as.numeric(as.factor(`CONTRIBUTING FACTOR VEHICLE 1`)) -0.821 0.443
Residual standard error: 2530 on 6 degrees of freedom
Multiple R-squared: 0.101, Adjusted R-squared: -0.04883
F-statistic: 0.6741 on 1 and 6 DF, p-value: 0.443
#For our second visualization, We will be looking at a bar graph of various different accidentsp1 <- accidents |>ggplot(aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `CONTRIBUTING FACTOR VEHICLE 1`, FUN = length),fill =`CONTRIBUTING FACTOR VEHICLE 1`)) +geom_bar(alpha=0.5, color ="white")+scale_fill_discrete(name ="Accidents", labels =c("Road Rage/Driving Aggresivley", "Distracted Driving","Inexperienced/Unlicenced Driver", "Improper Passing or Lane Use", "Unsafe Passing","Disobeyed Traffic Signs or Signals","Unsafe Lane Change","Speeding") ) +labs(x ="Cause of Accident", y ="Number of Incidents",title ="Accidents reported in Brookyln, NYC, resulting in Injuries",caption ="Source : New York City Open Data" ) +theme_bw()+coord_flip()p1
#I added an extra visualization just for a little bit of fun and exploration with different types of data visualization, so here's a treemap.p2 <- accidents |>filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) |>group_by(`CONTRIBUTING FACTOR VEHICLE 1`) |>summarize(count =n()) |>ungroup()library(treemap)treemap(p2, index="CONTRIBUTING FACTOR VEHICLE 1", vSize="count",vColor="count",type="value", palette="RdYlBu", title ="Accidents reported in Brookyln, NYC, resulting in Injuries", title.legend ="Accident Types" )
#Here is my third visualization, which is a map with user interacticity. You can click on all the circles and see where the accidents occured, and what caused the accident.accidents_lat <-mean(accidents$LATITUDE, na.rm =TRUE)accidents_lon <-mean(accidents$LONGITUDE, na.rm =TRUE)m1 <-leaflet(data = accidents) |>setView(lng = accidents_lon, lat = accidents_lat, zoom =11.5) |>addProviderTiles("Esri.WorldStreetMap") |>addCircles(radius =50, color ="#290",fillColor ="#250",fillOpacity =0.25,label =~`CONTRIBUTING FACTOR VEHICLE 1`,popup =~paste("<strong>Accident Cause:</strong>", `CONTRIBUTING FACTOR VEHICLE 1`),highlightOptions =highlightOptions(weight =4,color ="#606",fillOpacity =0.7,bringToFront =TRUE ) )
Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively
Warning in validateCoords(lng, lat, funcName): Data contains 275 rows with
either missing or invalid lat/lon values and will be ignored
m1
What the data shows is that the vast majority of accidents that took place in Brooklyn, New York, and which resulted in injuries or fatalities involved a motorist who was distracted whilst driving, either talking to passengers in the car, using their cell phone, or engaging in other activities that results in their attention being taken away from driving and their surroundings. According to Pines Salomon, an attorney agency based in San Diego, California, distracted driving, including use of electronic devices whilst driving, is the most common factor behind car accidents in America. Additionally, according the Maryland MVA (Motor Vehicle Administration), the vast majority of individuals who are distracted whilst driving are using their cell phone or other electronic device whilst driving. Driving responsibly and safely requires a driver’s full attention, and one distraction is all it takes to cause a massive accident with serious injuries, or even fatalities.
These visualizations represents the drivers who caused a crash because they were distracted by their phone, other passengers in their car, or something out the window that took their attention from the safe operation of their vehicle, and put other motorists and pedestrians in danger. As a rookie driver myself, I find these trends very concerning, as this shows that far too many drivers do not acknowledge the rules of the road. I wished I could have included some more visualizations, including a heatmap or alluvial, or by filtering via distracted driving incidents in each borough to see which one has the highest accidents involving a distracted driver.
“San Diego Car Accident Lawyers - the 25 Top Causes of Car Accidents in the US.” Pines Salomon Personal Injury Lawyers, 7 Apr. 2026, seriousaccidents.com/personal-injury-resources/top-causes-of-car-accidents/.
“Common Causes of Distracted Driving.” Zero Deaths Maryland & Vision Zero - Maryland Highway Safety Office, Zero Deaths Maryland, 24 Mar. 2023, zerodeathsmd.gov/news/common-causes-of-distracted-driving/.
AI USE ATTRIBUTION STATEMENT
────────────────────────────────────────
Title: DATA 110 Final Project
Creator: Zachary Rodavich
Context: DATA 110
Document Type: Student assignment
AI Permission: AI-NO
AI Creation Categories: None selected
AI Tools Used:
• Gemini 3 (used 2026-05-11) — Debugging
• Gemini 3 (used 2026-05-12) — Debugging
AI Prompt: There is an error in this code that needs fixing. Please show me what went wrong and how I can fix my code.
Human Role: I edited any faulty lines of code with code suggested by the A.I. programs listed above.
Notes: All other work is written by me and me ONLY. A.I. is solely used for the purposes of debugging and finding problems within my code.
────────────────────────────────────────
Generated with AI Attribution Generator