Youssef

Author

Youssef Mezni

Introduction

For this project, I used the NYC Motor Vehicle Collisions dataset to study severe traffic crashes in New York City. I chose this topic because car crashes are a serious public safety issue, and this dataset makes it possible to look at both the causes of crashes and where they happen. The dataset includes categorical variables such as borough, contributing factor, and vehicle type, and quantitative variables such as number of persons injured and number of persons killed. It also includes latitude and longitude, which made it possible to create a map for the GIS part of the project. To clean the data, I renamed the variables to make them easier to use in R, selected only the columns that were important for my analysis, removed rows with missing values in the main variables, and filtered the data to focus on more severe crashes. I kept crashes with at least 8 injured people or at least 1 person killed, which reduced the dataset to a smaller and more manageable subset that stayed under the project limit for plotting.

Source:NYPD link: https://www.kaggle.com/datasets/tush32/motor-vehicle-collisions-crashes

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
crashes <- read_csv("Motor_Vehicle_Collisions_-_Crashes (1).csv")
Rows: 2026647 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time  (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(crashes)
# A tibble: 6 × 29
  `CRASH DATE` `CRASH TIME` BOROUGH  `ZIP CODE` LATITUDE LONGITUDE LOCATION     
  <chr>        <time>       <chr>         <dbl>    <dbl>     <dbl> <chr>        
1 09/11/2021   02:39        <NA>             NA     NA        NA   <NA>         
2 03/26/2022   11:45        <NA>             NA     NA        NA   <NA>         
3 06/29/2022   06:55        <NA>             NA     NA        NA   <NA>         
4 09/11/2021   09:35        BROOKLYN      11208     40.7     -73.9 (40.667202, …
5 12/14/2021   08:13        BROOKLYN      11233     40.7     -73.9 (40.683304, …
6 04/14/2021   12:47        <NA>             NA     NA        NA   <NA>         
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
#   `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …
names(crashes)
 [1] "CRASH DATE"                    "CRASH TIME"                   
 [3] "BOROUGH"                       "ZIP CODE"                     
 [5] "LATITUDE"                      "LONGITUDE"                    
 [7] "LOCATION"                      "ON STREET NAME"               
 [9] "CROSS STREET NAME"             "OFF STREET NAME"              
[11] "NUMBER OF PERSONS INJURED"     "NUMBER OF PERSONS KILLED"     
[13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED" 
[15] "NUMBER OF CYCLIST INJURED"     "NUMBER OF CYCLIST KILLED"     
[17] "NUMBER OF MOTORIST INJURED"    "NUMBER OF MOTORIST KILLED"    
[19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
[21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
[23] "CONTRIBUTING FACTOR VEHICLE 5" "COLLISION_ID"                 
[25] "VEHICLE TYPE CODE 1"           "VEHICLE TYPE CODE 2"          
[27] "VEHICLE TYPE CODE 3"           "VEHICLE TYPE CODE 4"          
[29] "VEHICLE TYPE CODE 5"          
#Changing the name of the Variables
crashes1 <- crashes |>
  rename(
    crash_date = "CRASH DATE",
    borough = "BOROUGH",
    latitude = "LATITUDE",
    longitude = "LONGITUDE",
    contributing_factor = "CONTRIBUTING FACTOR VEHICLE 1",
    vehicle_type = "VEHICLE TYPE CODE 1",
    persons_injured = "NUMBER OF PERSONS INJURED",
    persons_killed = "NUMBER OF PERSONS KILLED"
  )

head(crashes)
# A tibble: 6 × 29
  `CRASH DATE` `CRASH TIME` BOROUGH  `ZIP CODE` LATITUDE LONGITUDE LOCATION     
  <chr>        <time>       <chr>         <dbl>    <dbl>     <dbl> <chr>        
1 09/11/2021   02:39        <NA>             NA     NA        NA   <NA>         
2 03/26/2022   11:45        <NA>             NA     NA        NA   <NA>         
3 06/29/2022   06:55        <NA>             NA     NA        NA   <NA>         
4 09/11/2021   09:35        BROOKLYN      11208     40.7     -73.9 (40.667202, …
5 12/14/2021   08:13        BROOKLYN      11233     40.7     -73.9 (40.683304, …
6 04/14/2021   12:47        <NA>             NA     NA        NA   <NA>         
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
#   `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …
#Select the main variables
crashes_clean <- crashes1 |>
  select(
    crash_date,
    borough,
    latitude,
    longitude,
    contributing_factor,
    vehicle_type,
    persons_injured,
    persons_killed
  )

head(crashes_clean)
# A tibble: 6 × 8
  crash_date borough  latitude longitude contributing_factor        vehicle_type
  <chr>      <chr>       <dbl>     <dbl> <chr>                      <chr>       
1 09/11/2021 <NA>         NA        NA   Aggressive Driving/Road R… Sedan       
2 03/26/2022 <NA>         NA        NA   Pavement Slippery          Sedan       
3 06/29/2022 <NA>         NA        NA   Following Too Closely      Sedan       
4 09/11/2021 BROOKLYN     40.7     -73.9 Unspecified                Sedan       
5 12/14/2021 BROOKLYN     40.7     -73.9 <NA>                       <NA>        
6 04/14/2021 <NA>         NA        NA   Unspecified                Dump        
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
#Checking NA's
crashes_clean2 <- crashes_clean |>
  filter(!is.na(borough)) |>
  filter(!is.na(latitude)) |>
  filter(!is.na(longitude)) |>
  filter(!is.na(contributing_factor)) |>
  filter(!is.na(vehicle_type)) |>
  filter(!is.na(persons_injured)) |>
  filter(!is.na(persons_killed)) |>
  filter(persons_injured >= 1)

head(crashes_clean2)
# A tibble: 6 × 8
  crash_date borough       latitude longitude contributing_factor   vehicle_type
  <chr>      <chr>            <dbl>     <dbl> <chr>                 <chr>       
1 12/14/2021 BRONX             40.9     -73.8 Unspecified           Sedan       
2 12/14/2021 QUEENS            40.7     -73.8 Reaction to Uninvolv… Sedan       
3 12/14/2021 BROOKLYN          40.7     -74.0 Steering Failure      Sedan       
4 12/11/2021 BRONX             40.9     -73.9 Unspecified           Station Wag…
5 12/13/2021 STATEN ISLAND     40.6     -74.1 Unspecified           Sedan       
6 12/14/2021 BROOKLYN          40.6     -74.0 Unspecified           Sedan       
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
#Filter the data to more severe crashes
crashes_800 <- crashes_clean2 |>
  filter(persons_injured >= 8 | persons_killed >= 1)  

nrow(crashes_800)
[1] 722
head(crashes_800)
# A tibble: 6 × 8
  crash_date borough       latitude longitude contributing_factor   vehicle_type
  <chr>      <chr>            <dbl>     <dbl> <chr>                 <chr>       
1 08/27/2021 MANHATTAN         40.8     -73.9 Unspecified           Sedan       
2 09/11/2021 BROOKLYN          40.7     -74.0 Unsafe Speed          Sedan       
3 12/17/2021 STATEN ISLAND     40.5     -74.2 Unsafe Speed          Sedan       
4 04/21/2021 QUEENS            40.7     -73.8 Alcohol Involvement   Sedan       
5 09/14/2021 QUEENS            40.7     -73.7 Driver Inattention/D… Station Wag…
6 08/21/2021 QUEENS            40.8     -73.9 Traffic Control Disr… Station Wag…
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
#Explore categorical variables

crashes_800 |>
  count(borough)
# A tibble: 5 × 2
  borough           n
  <chr>         <int>
1 BRONX           132
2 BROOKLYN        285
3 MANHATTAN        87
4 QUEENS          171
5 STATEN ISLAND    47
crashes_800 |>
  count(contributing_factor)
# A tibble: 38 × 2
   contributing_factor                n
   <chr>                          <int>
 1 Aggressive Driving/Road Rage       8
 2 Alcohol Involvement               29
 3 Animals Action                     1
 4 Backing Unsafely                   8
 5 Brakes Defective                   2
 6 Driver Inattention/Distraction    91
 7 Driver Inexperience               19
 8 Drugs (illegal)                    6
 9 Failure to Keep Right              1
10 Failure to Yield Right-of-Way     63
# ℹ 28 more rows
# Boxplot before improvements
ggplot(crashes_800,
       aes(x = reorder(contributing_factor, persons_injured, median),
           y = persons_injured,
           fill = contributing_factor)) +
  geom_boxplot(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Persons Injured by Contributing Factor in Severe NYC Crashes",
    subtitle = "Filtered crashes with high injuries or at least one death",
    x = "Contributing Factor",
    y = "Number of Persons Injured",
    caption = "Source: NYC Motor Vehicle Collisions dataset"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 12)
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

# Number 7 was Random
top_factors <- crashes_800 |>
  count(contributing_factor, sort = TRUE) |>
  slice_head(n = 7)

top_factors
# A tibble: 7 × 2
  contributing_factor                n
  <chr>                          <int>
1 Unspecified                      192
2 Driver Inattention/Distraction    91
3 Traffic Control Disregarded       91
4 Unsafe Speed                      76
5 Failure to Yield Right-of-Way     63
6 Alcohol Involvement               29
7 Physical Disability               23
crashes_top <- crashes_800 |>
  filter(contributing_factor %in% top_factors$contributing_factor)

head(crashes_top)
# A tibble: 6 × 8
  crash_date borough       latitude longitude contributing_factor   vehicle_type
  <chr>      <chr>            <dbl>     <dbl> <chr>                 <chr>       
1 08/27/2021 MANHATTAN         40.8     -73.9 Unspecified           Sedan       
2 09/11/2021 BROOKLYN          40.7     -74.0 Unsafe Speed          Sedan       
3 12/17/2021 STATEN ISLAND     40.5     -74.2 Unsafe Speed          Sedan       
4 04/21/2021 QUEENS            40.7     -73.8 Alcohol Involvement   Sedan       
5 09/14/2021 QUEENS            40.7     -73.7 Driver Inattention/D… Station Wag…
6 08/21/2021 QUEENS            40.8     -73.9 Traffic Control Disr… Station Wag…
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
#Boxplot after improvments 
ggplot(crashes_top,
       aes(x = contributing_factor,
           y = persons_injured,
           fill = contributing_factor)) +
  geom_boxplot(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Persons Injured by Top 7 Contributing Factors in NYC Crashes",
    subtitle = "Filtered crashes with high injuries or at least one fatality",
    x = "Contributing Factor",
    y = "Number of Persons Injured",
    caption = "Source: NYC Motor Vehicle Collisions dataset"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_gray(base_size = 12)

#Mapping
pal <- colorNumeric(
  palette = "YlOrRd",
  domain = crashes_top$persons_injured
)

leaflet(crashes_top) |>
  addTiles() |>
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    radius = ~persons_injured / 2,
    color = ~pal(persons_injured),
    fillOpacity = 0.7,
    stroke = FALSE
  ) |>
  addLegend(
    "bottomright",
    pal = pal,
    values = ~persons_injured,
    title = "Persons Injured",
    opacity = 1
  ) |>
  setView(lng = -73.94, lat = 40.70, zoom = 10)
leaflet(crashes_top) |>
  addProviderTiles(providers$CartoDB.Positron) |>
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    radius = ~persons_injured / 2,
    color = ~pal(persons_injured),
    fillOpacity = 0.75,
    stroke = TRUE,
    weight = 1,
    popup = ~paste0(
      "<b>Borough:</b> ", borough, "<br/>",
      "<b>Contributing Factor:</b> ", contributing_factor, "<br/>",
      "<b>Vehicle Type:</b> ", vehicle_type, "<br/>",
      "<b>Persons Injured:</b> ", persons_injured, "<br/>",
      "<b>Persons Killed:</b> ", persons_killed
    ),
    label = ~paste0("Injured: ", persons_injured)
  ) |>
  addLegend(
    "topright",
    pal = pal,
    values = ~persons_injured,
    title = "Persons Injured",
    opacity = 1
  ) |>
  setView(lng = -73.94, lat = 40.70, zoom = 10)

Conclusion

In this project, I used both a boxplot and a leaflet map to better understand severe motor vehicle crashes in New York City. The boxplot compared the top contributing factors and showed how the number of persons injured changes across different crash causes. One pattern I noticed is that some factors, such as driver inattention/distraction and failure to yield right-of-way, appeared often in serious crashes, while other categories had fewer cases. I also noticed that the data had some outliers, which shows that a few crashes involved a much higher number of injuries than most of the others. The map helped show the geographic side of the data by plotting where these severe crashes happened using latitude and longitude. The popup labels made the map more useful because they gave extra details about each crash when a point was clicked. Overall, the graph and the map worked well together because the graph showed differences between crash factors, while the map showed where the crashes were located. If I had more time, I would explore one borough in more detail or compare crash patterns by time of day to see whether location and timing create even stronger patterns.

Sources: Web3 https://www.w3schools.com/