For this project, I used the NYC Motor Vehicle Collisions dataset to study severe traffic crashes in New York City. I chose this topic because car crashes are a serious public safety issue, and this dataset makes it possible to look at both the causes of crashes and where they happen. The dataset includes categorical variables such as borough, contributing factor, and vehicle type, and quantitative variables such as number of persons injured and number of persons killed. It also includes latitude and longitude, which made it possible to create a map for the GIS part of the project. To clean the data, I renamed the variables to make them easier to use in R, selected only the columns that were important for my analysis, removed rows with missing values in the main variables, and filtered the data to focus on more severe crashes. I kept crashes with at least 8 injured people or at least 1 person killed, which reduced the dataset to a smaller and more manageable subset that stayed under the project limit for plotting.
Rows: 2026647 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time (1): CRASH TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(crashes)
# A tibble: 6 × 29
`CRASH DATE` `CRASH TIME` BOROUGH `ZIP CODE` LATITUDE LONGITUDE LOCATION
<chr> <time> <chr> <dbl> <dbl> <dbl> <chr>
1 09/11/2021 02:39 <NA> NA NA NA <NA>
2 03/26/2022 11:45 <NA> NA NA NA <NA>
3 06/29/2022 06:55 <NA> NA NA NA <NA>
4 09/11/2021 09:35 BROOKLYN 11208 40.7 -73.9 (40.667202, …
5 12/14/2021 08:13 BROOKLYN 11233 40.7 -73.9 (40.683304, …
6 04/14/2021 12:47 <NA> NA NA NA <NA>
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
# `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
# `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
# `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
# `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
# `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
# `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …
names(crashes)
[1] "CRASH DATE" "CRASH TIME"
[3] "BOROUGH" "ZIP CODE"
[5] "LATITUDE" "LONGITUDE"
[7] "LOCATION" "ON STREET NAME"
[9] "CROSS STREET NAME" "OFF STREET NAME"
[11] "NUMBER OF PERSONS INJURED" "NUMBER OF PERSONS KILLED"
[13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED"
[15] "NUMBER OF CYCLIST INJURED" "NUMBER OF CYCLIST KILLED"
[17] "NUMBER OF MOTORIST INJURED" "NUMBER OF MOTORIST KILLED"
[19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
[21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
[23] "CONTRIBUTING FACTOR VEHICLE 5" "COLLISION_ID"
[25] "VEHICLE TYPE CODE 1" "VEHICLE TYPE CODE 2"
[27] "VEHICLE TYPE CODE 3" "VEHICLE TYPE CODE 4"
[29] "VEHICLE TYPE CODE 5"
#Changing the name of the Variablescrashes1 <- crashes |>rename(crash_date ="CRASH DATE",borough ="BOROUGH",latitude ="LATITUDE",longitude ="LONGITUDE",contributing_factor ="CONTRIBUTING FACTOR VEHICLE 1",vehicle_type ="VEHICLE TYPE CODE 1",persons_injured ="NUMBER OF PERSONS INJURED",persons_killed ="NUMBER OF PERSONS KILLED" )head(crashes)
# A tibble: 6 × 29
`CRASH DATE` `CRASH TIME` BOROUGH `ZIP CODE` LATITUDE LONGITUDE LOCATION
<chr> <time> <chr> <dbl> <dbl> <dbl> <chr>
1 09/11/2021 02:39 <NA> NA NA NA <NA>
2 03/26/2022 11:45 <NA> NA NA NA <NA>
3 06/29/2022 06:55 <NA> NA NA NA <NA>
4 09/11/2021 09:35 BROOKLYN 11208 40.7 -73.9 (40.667202, …
5 12/14/2021 08:13 BROOKLYN 11233 40.7 -73.9 (40.683304, …
6 04/14/2021 12:47 <NA> NA NA NA <NA>
# ℹ 22 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
# `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
# `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
# `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
# `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
# `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
# `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …
#Select the main variablescrashes_clean <- crashes1 |>select( crash_date, borough, latitude, longitude, contributing_factor, vehicle_type, persons_injured, persons_killed )head(crashes_clean)
# A tibble: 6 × 8
crash_date borough latitude longitude contributing_factor vehicle_type
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 09/11/2021 <NA> NA NA Aggressive Driving/Road R… Sedan
2 03/26/2022 <NA> NA NA Pavement Slippery Sedan
3 06/29/2022 <NA> NA NA Following Too Closely Sedan
4 09/11/2021 BROOKLYN 40.7 -73.9 Unspecified Sedan
5 12/14/2021 BROOKLYN 40.7 -73.9 <NA> <NA>
6 04/14/2021 <NA> NA NA Unspecified Dump
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
# A tibble: 5 × 2
borough n
<chr> <int>
1 BRONX 132
2 BROOKLYN 285
3 MANHATTAN 87
4 QUEENS 171
5 STATEN ISLAND 47
crashes_800 |>count(contributing_factor)
# A tibble: 38 × 2
contributing_factor n
<chr> <int>
1 Aggressive Driving/Road Rage 8
2 Alcohol Involvement 29
3 Animals Action 1
4 Backing Unsafely 8
5 Brakes Defective 2
6 Driver Inattention/Distraction 91
7 Driver Inexperience 19
8 Drugs (illegal) 6
9 Failure to Keep Right 1
10 Failure to Yield Right-of-Way 63
# ℹ 28 more rows
# Boxplot before improvementsggplot(crashes_800,aes(x =reorder(contributing_factor, persons_injured, median),y = persons_injured,fill = contributing_factor)) +geom_boxplot(alpha =0.8, show.legend =FALSE) +coord_flip() +labs(title ="Persons Injured by Contributing Factor in Severe NYC Crashes",subtitle ="Filtered crashes with high injuries or at least one death",x ="Contributing Factor",y ="Number of Persons Injured",caption ="Source: NYC Motor Vehicle Collisions dataset" ) +scale_fill_brewer(palette ="Set2") +theme_minimal(base_size =12)
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
# Number 7 was Randomtop_factors <- crashes_800 |>count(contributing_factor, sort =TRUE) |>slice_head(n =7)top_factors
# A tibble: 7 × 2
contributing_factor n
<chr> <int>
1 Unspecified 192
2 Driver Inattention/Distraction 91
3 Traffic Control Disregarded 91
4 Unsafe Speed 76
5 Failure to Yield Right-of-Way 63
6 Alcohol Involvement 29
7 Physical Disability 23
# A tibble: 6 × 8
crash_date borough latitude longitude contributing_factor vehicle_type
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 08/27/2021 MANHATTAN 40.8 -73.9 Unspecified Sedan
2 09/11/2021 BROOKLYN 40.7 -74.0 Unsafe Speed Sedan
3 12/17/2021 STATEN ISLAND 40.5 -74.2 Unsafe Speed Sedan
4 04/21/2021 QUEENS 40.7 -73.8 Alcohol Involvement Sedan
5 09/14/2021 QUEENS 40.7 -73.7 Driver Inattention/D… Station Wag…
6 08/21/2021 QUEENS 40.8 -73.9 Traffic Control Disr… Station Wag…
# ℹ 2 more variables: persons_injured <dbl>, persons_killed <dbl>
#Boxplot after improvments ggplot(crashes_top,aes(x = contributing_factor,y = persons_injured,fill = contributing_factor)) +geom_boxplot(alpha =0.8, show.legend =FALSE) +coord_flip() +labs(title ="Persons Injured by Top 7 Contributing Factors in NYC Crashes",subtitle ="Filtered crashes with high injuries or at least one fatality",x ="Contributing Factor",y ="Number of Persons Injured",caption ="Source: NYC Motor Vehicle Collisions dataset" ) +scale_fill_brewer(palette ="Set2") +theme_gray(base_size =12)
In this project, I used both a boxplot and a leaflet map to better understand severe motor vehicle crashes in New York City. The boxplot compared the top contributing factors and showed how the number of persons injured changes across different crash causes. One pattern I noticed is that some factors, such as driver inattention/distraction and failure to yield right-of-way, appeared often in serious crashes, while other categories had fewer cases. I also noticed that the data had some outliers, which shows that a few crashes involved a much higher number of injuries than most of the others. The map helped show the geographic side of the data by plotting where these severe crashes happened using latitude and longitude. The popup labels made the map more useful because they gave extra details about each crash when a point was clicked. Overall, the graph and the map worked well together because the graph showed differences between crash factors, while the map showed where the crashes were located. If I had more time, I would explore one borough in more detail or compare crash patterns by time of day to see whether location and timing create even stronger patterns.