Introduction

Motor vehicle collisions are a major public safety concern in New York City, with thousands of crashes occurring each year across the five boroughs. The injury rates differ across neighborhoods, traffic density, urban layout, and demographic factors. This project investigates the following research question: Is the proportion of motor vehicle collisions that result in injuries significantly different between Manhattan and Brooklyn?

The dataset used in this analysis is the Motor Vehicle Collisions – Crashes dataset from NYC Open Data, which contains the information on reported collisions. Each row represents a single crash, along with variables such as the borough, date and time of the incident, contributing factors, vehicle types, and the number of injuries and fatalities. This project focuses on two key variables: BOROUGH and NUMBER OF PERSONS INJURED. This dataset was accessed from NYC Open Data: https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes.

Data Analysis

The data analysis process begins by cleaning the dataset and preparing the variables in order to answer the research question. First, the data is filtered to include only collisions that occurred in Manhattan and Brooklyn, since these are the two boroughs and the main focuses being compared. An injury indicator variable is then created to categorize each collision as resulting in injury (one or more persons injured) or no injury. Summary counts and proportions of injuries are computed for each borough to provide initial insight into group differences. Finally, visualizations were created to compare the distribution of injury and non-injury collisions.

  1. Loading Data & Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("Motor_Vehicle_Collisions.csv")
## Rows: 2219379 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
## dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
## time  (1): CRASH TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Clean data
df_clean <- df %>%
  dplyr::filter(BOROUGH %in% c("BROOKLYN", "MANHATTAN")) %>%
  dplyr::filter(!is.na(`NUMBER OF PERSONS INJURED`)) %>%
  mutate(injury = ifelse(`NUMBER OF PERSONS INJURED` > 0, 1, 0))
  1. Summary Table
df_clean %>%
  group_by(BOROUGH, injury) %>%
  summarize(count = n(), .groups = "drop")
## # A tibble: 4 × 3
##   BOROUGH   injury  count
##   <chr>      <dbl>  <int>
## 1 BROOKLYN       0 363665
## 2 BROOKLYN       1 129631
## 3 MANHATTAN      0 276136
## 4 MANHATTAN      1  65006
  1. Visualization
ggplot(df_clean, aes(x = BOROUGH, fill = factor(injury))) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Injury vs. Non-Injury Collisions",
    x = "Borough",
    y = "Proportion",
    fill = "Injury (1 = Yes)"
  )

Statistical Analysis

In order to determine whether the proportion of collisions that result in injuries differs between Manhattan and Brooklyn, a two-proportion z-test was conducted. This test compares the proportion of injury-causing collisions in two independent groups. The null hypothesis states the injury proportions in Manhattan and Brooklyn are equal, while the alternative hypothesis states that the proportions differ. Using the cleaned dataset, the total number of collisions and the number of injury collisions were calculated for each borough. These values were then entered into the prop.test() function in R to perform the statistical test. The resulting p-value helps determine whether there is statistically significant evidence to reject the null hypothesis (the 0.05 significance level).

injury_summary <- df_clean %>%
  group_by(BOROUGH) %>%
  summarize(
    injury_count = sum(injury),
    total_collisions = n(),
    .groups = "drop"
  )

injury_summary
## # A tibble: 2 × 3
##   BOROUGH   injury_count total_collisions
##   <chr>            <dbl>            <int>
## 1 BROOKLYN        129631           493296
## 2 MANHATTAN        65006           341142
x <- injury_summary$injury_count          # number of injuries
n <- injury_summary$total_collisions      # total collisions
prop.test(
  x = x,
  n = n,
  alternative = "two.sided",
  correct = FALSE
)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 5883.3, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07042982 0.07403287
## sample estimates:
##    prop 1    prop 2 
## 0.2627854 0.1905541

Conclusions & Future Directions

The results of the two-proportion z-test provide clear evidence that the proportion of motor vehicle collisions resulting in injuries differs significantly between Manhattan and Brooklyn. The estimated injury proportion for Brooklyn was around 26.28%, while the proportion for Manhattan was about 19.06%. Because the p-value for the test was less than 2.2 × 10⁻¹⁶, which is far below the standard significance level of 0.05, we can reject the null hypothesis that the two boroughs have equal injury proportions. This highlights that the difference observed between the two boroughs is not caused by random chance, but instead a meaningful distinction in collision outcomes.

These findings suggest that Brooklyn experiences a substantially higher likelihood of injury per collision than Manhattan. This difference may be associated with variations in road design, traffic density, enforcement patterns, driver behavior, or neighborhood-level infrastructure across the two boroughs. Understanding why Brooklyn’s injury rate is higher could help city planners, transportation officials, and policymakers target improvements where they are most needed. Future research could extend this analysis by examining additional boroughs, considering the role of contributing factors, or exploring time-based patterns such as peak hours or weekday versus weekend trends. Overall, this study highlights the importance of borough-specific strategies in improving traffic safety across New York City.

References

NYC Open Data. (n.d.). Motor Vehicle Collisions – Crashes. https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

NYC Open Data. (n.d.). Motor Vehicle Collisions – Crashes. https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://ggplot2.tidyverse.org