Project2_NYC_Collisions

Author

Gamaliel Ngouafon

ESSAY Part 3 — Introduction to the Dataset and Topic

Motor Vehicle are one of the leading causes of death and injury in the united states, and New York City, with its dense population, mixed traffic of pedestrians, taxis, cyclists, trucks, and private vehicles belong to one of the most complex traffic environment in the world. Understanding what causes these crashes and who gets hurt is crucial in designing safer streets and targeting enforcement resources. This project uses the NYC Motor Vehicle Collisions Crashes dataset to answer the question: which contributing factors lead to the most injuries in Brooklyn, and where do they cluster geographically?

The dataset is collected by the New York City Police Department (NYPD), which files a MV-104AN form for every crash involving injury, death, or property damage over $1,000. The data is published and maintained by the City of New York through the NYC OpenData portal, updated weekly. The dataset was downloaded from: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95

Some of the variables present are: NUMBER OF MOTORIST INJURED Quantitative (discrete): Number of motorists specifically injured LATITUDE Quantitative (continuous): Decimal-degree WGS84 latitude of the crash LONGITUDE Quantitative (continuous): Decimal-degree WGS84 longitude of the crash NUMBER OF PERSONS INJURED Quantitative (discrete): Total number of people injured in the crash NUMBER OF PERSONS KILLED Quantitative (discrete): Total number of people killed in the crash

Data Cleaning:

Why I chose this topic: The data is rich, direct, complete and recent, making it ideal for a mapping project. Pedestrians and cyclists are the most vulnerable road users.

Step4: Loading necessary libraries

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(leaflet)
library(lubridate)
library(webshot2)

Loading datasets

# set working directory to the folder where the dataset is present.
getwd()

[1] "/Users/darrenabou/Desktop/Spring 26/Data110/Project 2"

NYC_Collisions<- read_csv('/Users/darrenabou/Desktop/Spring 26/Data110/Data mapping/Motor_Vehicle_Collisions_-_Crashes.csv')

Rows: 1048575 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
time  (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

STEP 5: Cleaning

# Removing all Nas values and renaming the date variable while keeping tis meaning 
NYC_Collisions_clean <- NYC_Collisions |>
  filter(!is.na(LONGITUDE) & !is.na(LATITUDE) & !is.na(`NUMBER OF PERSONS INJURED`) & !is.na(`NUMBER OF PERSONS KILLED`)& !is.na(`NUMBER OF PEDESTRIANS INJURED`) & !is.na(`NUMBER OF PEDESTRIANS KILLED`) & !is.na(`NUMBER OF CYCLIST INJURED`) & !is.na(BOROUGH) & !is.na(`NUMBER OF CYCLIST KILLED`) & !is.na(`NUMBER OF MOTORIST INJURED`) & !is.na(`NUMBER OF MOTORIST KILLED`)) |>
  #mutate(crash_date = as.Date(`CRASH DATE`)) |>
  mutate(crash_date = mdy(`CRASH DATE`))
head(NYC_Collisions_clean)

# A tibble: 6 × 30
  `CRASH DATE` `CRASH TIME` BOROUGH   `ZIP CODE` LATITUDE LONGITUDE LOCATION    
  <chr>        <time>       <chr>          <dbl>    <dbl>     <dbl> <chr>       
1 11/1/23      01:29        BROOKLYN       11230     40.6     -74.0 (40.62179, …
2 9/11/21      09:35        BROOKLYN       11208     40.7     -73.9 (40.667202,…
3 12/14/21     08:13        BROOKLYN       11233     40.7     -73.9 (40.683304,…
4 12/14/21     08:17        BRONX          10475     40.9     -73.8 (40.86816, …
5 12/14/21     21:10        BROOKLYN       11207     40.7     -73.9 (40.67172, …
6 12/14/21     14:58        MANHATTAN      10017     40.8     -74.0 (40.75144, …
# ℹ 23 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
#   `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …

Exploration after cleaning

head(NYC_Collisions_clean)

# A tibble: 6 × 30
  `CRASH DATE` `CRASH TIME` BOROUGH   `ZIP CODE` LATITUDE LONGITUDE LOCATION    
  <chr>        <time>       <chr>          <dbl>    <dbl>     <dbl> <chr>       
1 11/1/23      01:29        BROOKLYN       11230     40.6     -74.0 (40.62179, …
2 9/11/21      09:35        BROOKLYN       11208     40.7     -73.9 (40.667202,…
3 12/14/21     08:13        BROOKLYN       11233     40.7     -73.9 (40.683304,…
4 12/14/21     08:17        BRONX          10475     40.9     -73.8 (40.86816, …
5 12/14/21     21:10        BROOKLYN       11207     40.7     -73.9 (40.67172, …
6 12/14/21     14:58        MANHATTAN      10017     40.8     -74.0 (40.75144, …
# ℹ 23 more variables: `ON STREET NAME` <chr>, `CROSS STREET NAME` <chr>,
#   `OFF STREET NAME` <chr>, `NUMBER OF PERSONS INJURED` <dbl>,
#   `NUMBER OF PERSONS KILLED` <dbl>, `NUMBER OF PEDESTRIANS INJURED` <dbl>,
#   `NUMBER OF PEDESTRIANS KILLED` <dbl>, `NUMBER OF CYCLIST INJURED` <dbl>,
#   `NUMBER OF CYCLIST KILLED` <dbl>, `NUMBER OF MOTORIST INJURED` <dbl>,
#   `NUMBER OF MOTORIST KILLED` <dbl>, `CONTRIBUTING FACTOR VEHICLE 1` <chr>,
#   `CONTRIBUTING FACTOR VEHICLE 2` <chr>, …

glimpse(NYC_Collisions_clean)

Rows: 661,751
Columns: 30
$ `CRASH DATE`                    <chr> "11/1/23", "9/11/21", "12/14/21", "12/…
$ `CRASH TIME`                    <time> 01:29:00, 09:35:00, 08:13:00, 08:17:0…
$ BOROUGH                         <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "B…
$ `ZIP CODE`                      <dbl> 11230, 11208, 11233, 10475, 11207, 100…
$ LATITUDE                        <dbl> 40.62179, 40.66720, 40.68330, 40.86816…
$ LONGITUDE                       <dbl> -73.97002, -73.86650, -73.91727, -73.8…
$ LOCATION                        <chr> "(40.62179, -73.970024)", "(40.667202,…
$ `ON STREET NAME`                <chr> "OCEAN PARKWAY", NA, "SARATOGA AVENUE"…
$ `CROSS STREET NAME`             <chr> "AVENUE K", NA, "DECATUR STREET", NA, …
$ `OFF STREET NAME`               <chr> NA, "1211      LORING AVENUE", NA, "34…
$ `NUMBER OF PERSONS INJURED`     <dbl> 1, 0, 0, 2, 0, 0, 0, 2, 0, 4, 1, 0, 0,…
$ `NUMBER OF PERSONS KILLED`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF PEDESTRIANS INJURED` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF PEDESTRIANS KILLED`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF CYCLIST INJURED`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF CYCLIST KILLED`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `NUMBER OF MOTORIST INJURED`    <dbl> 1, 0, 0, 2, 0, 0, 0, 2, 0, 4, 1, 0, 0,…
$ `NUMBER OF MOTORIST KILLED`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `CONTRIBUTING FACTOR VEHICLE 1` <chr> "Unspecified", "Unspecified", NA, "Uns…
$ `CONTRIBUTING FACTOR VEHICLE 2` <chr> "Unspecified", NA, NA, "Unspecified", …
$ `CONTRIBUTING FACTOR VEHICLE 3` <chr> "Unspecified", NA, NA, NA, NA, NA, NA,…
$ `CONTRIBUTING FACTOR VEHICLE 4` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `CONTRIBUTING FACTOR VEHICLE 5` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ COLLISION_ID                    <dbl> 4675373, 4456314, 4486609, 4486660, 44…
$ `VEHICLE TYPE CODE 1`           <chr> "Moped", "Sedan", NA, "Sedan", "Sedan"…
$ `VEHICLE TYPE CODE 2`           <chr> "Sedan", NA, NA, "Sedan", NA, "Station…
$ `VEHICLE TYPE CODE 3`           <chr> "Sedan", NA, NA, NA, NA, NA, NA, NA, N…
$ `VEHICLE TYPE CODE 4`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `VEHICLE TYPE CODE 5`           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ crash_date                      <date> 2023-11-01, 2021-09-11, 2021-12-14, 2…

summary(NYC_Collisions_clean$`NUMBER OF PERSONS INJURED`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.3503  1.0000 34.0000

STEP 6: Visualizing Categorical variables

number 1

# to see the cities in new york with the most crashes in descending order form big to smallest
NYC_Collisions_clean |>
  count(BOROUGH) |>
  ggplot(aes(x= reorder(BOROUGH, -n), y = n, fill = BOROUGH)) +
  geom_col( color = "black")+
  scale_fill_manual( values = c("BRONX" = "#e74c3c",
    "BROOKLYN" = "#3498db",
    "MANHATTAN" = "#2ecc71",
    "QUEENS" = "#f1c40f",
    "STATEN ISLAND" = "#9b59b6"
  )) +
  scale_y_log10() +
  labs(
    title = "Number of Crashes by Borough",
       x = "Borough",
       y = "Count"
  ) +
theme_bw()

second categorical variable

# this plot shows the top 10 factors leading into crashes 
NYC_Collisions_clean |>
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) |>
  count(`CONTRIBUTING FACTOR VEHICLE 1`, sort = TRUE) |>
  slice_head(n = 10) |>
  ggplot(aes(x = reorder(`CONTRIBUTING FACTOR VEHICLE 1`, n), y = n)) +
  geom_col(fill = "#f00a0a") +
  coord_flip() +
  scale_y_log10() +
  labs(
    title   = "Top 10 Contributing Factors in NYC Crashes (All Years)",
    x       = "Contributing Factor",
    y       = "Number of Crashes",
    caption = "Source: NYPD via NYC OpenData"
  ) +
  theme_bw()

Numerical Variables

NUMBER ONE

NYC_Collisions_clean |>
  filter(`NUMBER OF PEDESTRIANS INJURED` <= 10) |>
  ggplot(aes(x = `NUMBER OF PEDESTRIANS INJURED`)) +
  geom_histogram(fill = "cyan", bins = 10, color = "black") +
  scale_y_log10() +
  labs(
    title = "Pedestrian Injuries (0–10 Range)",
    x = "Number of Pedestrians Injured",
    y = "Count of Crashes"
  ) +
  theme_bw()

Warning in scale_y_log10(): log-10 transformation introduced infinite values.

second visualization

NYC_Collisions_clean |>
  mutate(fatal = ifelse(`NUMBER OF PERSONS KILLED` > 0, "Fatal", "Non-Fatal")) |>
  count(fatal) |>
  ggplot(aes(x = fatal, y = n, fill = fatal)) +
  geom_col() +
  scale_fill_manual( values = c("Fatal" = "red",
    "Non-Fatal" = "green")) +
  scale_y_log10() +
  labs(
    title = "Fatal vs Non-Fatal Crashes",
    x = "Crash Type",
    y = "Number of Crashes"
  ) +
  theme_bw()

Step 7 — Filtering: Inclusion/Exclusion Criteria (≤ 800 Observations)

List of all the cities present in New York

unique(NYC_Collisions_clean$BOROUGH)

[1] "BROOKLYN"      "BRONX"         "MANHATTAN"     "QUEENS"       
[5] "STATEN ISLAND"

Exclusion to narrow down to 800 observation

# This chunk filters the dataset to under 800 observations using dplyr

NYC_Sample <- NYC_Collisions_clean |>
  filter(
    !is.na(`CONTRIBUTING FACTOR VEHICLE 1`),
    `CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified",
    `NUMBER OF PERSONS INJURED` >= 0
  ) |>
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`) |>
  filter(n() > 100) |>   # keep meaningful factors only
  ungroup() |>
  slice_sample(n = 800)

Crashes by year

# This chunk aggregates crashes by year and cause

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Highcharts (www.highcharts.com) is a Highsoft software product which is

not free for commercial and Governmental use

library(RColorBrewer)

time_data <- NYC_Sample |>
  mutate(year = lubridate::year(crash_date)) |>
  group_by(year, `CONTRIBUTING FACTOR VEHICLE 1`) |>
  summarise(
    total_crashes = n(),
    .groups = "drop"
  ) |>
  arrange(year)

STEP 8: Main visualization showing the main causes of crashes and thir impact

# This chunk creates an interactive line chart

cols <- brewer.pal(5, "Set1")

highchart() |>
  hc_add_series(
    data = time_data,
    type = "area",
    hcaes(
      x = year,
      y = total_crashes,
      group = `CONTRIBUTING FACTOR VEHICLE 1`
    )
  ) |>
  hc_colors(cols) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Number of Crashes")) |>
  hc_title(text = "Crash Causes Over Time in NYC") |>
  hc_subtitle(text = "Interactive comparison of major contributing factors") |>
  hc_plotOptions(series = list(marker = list(symbol = "triangle"))) |>
  hc_legend(align = "right", verticalAlign = "bottom") |>
  hc_tooltip(
    shared = TRUE,
    borderColor = "black",
    pointFormat = "{point.series.name}: {point.y}<br>"
  )

Step 9 : Map Visualization (GIS)

# This chunk creates an interactive map showing crash causes spatially

# Create a color palette based on contributing factors
map_pal <- colorFactor(
  palette = "Set1",
  domain = NYC_Sample$`CONTRIBUTING FACTOR VEHICLE 1`
)

Creating lables for each value on map

# Create popup labels
popup_info <- paste0(
  "<b>Cause:</b> ", NYC_Sample$`CONTRIBUTING FACTOR VEHICLE 1`, "<br>",
  "<b>Injured:</b> ", NYC_Sample$`NUMBER OF PERSONS INJURED`, "<br>",
  "<b>Killed:</b> ", NYC_Sample$`NUMBER OF PERSONS KILLED`, "<br>",
  "<b>Borough:</b> ", NYC_Sample$BOROUGH
)

Interactive Map

# Build the interactive map representing our main visualization
leaflet(NYC_Sample) |>
  setView(lng = -73.94, lat = 40.67, zoom = 11) |>
  addProviderTiles("Esri.NatGeoWorldMap") |>

  addCircleMarkers(
    lng = ~LONGITUDE,
    lat = ~LATITUDE,

    radius = ~sqrt(`NUMBER OF PERSONS INJURED`) * 10,

    color = ~map_pal(`CONTRIBUTING FACTOR VEHICLE 1`),
    fillOpacity = 0.7,
    stroke = FALSE,

    popup = popup_info
  )

Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

ESSAY PART 10: OUTRO - CONCLUSIONS

Visualization and Map Analysis

The visualizations in this project were designed to explore the relationship between crash causes and their impact.The main visualization is an interactive time series chart created using the highcharter package. This plot displays the number of crashes over time for different contributing factors, allowing users to explore trends and compare how causes of crashes change across years.

In addition to these visualizations, an interactive map was created using the leaflet package to represent the spatial distribution of crashes. Each point on the map corresponds to a crash location, with color indicating the contributing factor and point size representing the number of persons injured.

One interesting pattern observed across the visualizations is that a small number of contributing factors account for a large proportion of crashes, highlighting the uneven distribution of causes. Additionally, the skewness of the injury data emphasizes that severe crashes are relatively rare but impactful. A limitation of the project is the heavy of the data which despite my initiave to solve them stills hinders the dataset.

Sources

The dataset used in this project was obtained from the NYC OpenData platform: New York City Police Department. NYC OpenData. https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95

Additional documentation on the dataset was accessed through Data.gov: U.S. General Services Administration. https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

Leaflet documentation for mapping: Leaflet for R. Leaflet:https://andrewpwheeler.com/2020/08/31/notes-on-making-leaflet-maps-in-r/

AI Assistance: Some coding guidance regarding the addprovidertile for my map.AI provide me with sources and explanations like the leaflet to look at other type of addprovidertile.