Final Project

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

#Introduction Motor vehicle crashes represent a major public safety concern in large urban areas. Injury outcomes from crashes can vary across locations because of differences in traffic density road design population behavior and emergency response capacity. Examining whether injury occurrence differs by borough can assist policymakers and transportation officials in identifying locations that may require targeted safety interventions.

The dataset used for this analysis is Motor Vehicle Collisions Crashes obtained from Data.gov. The dataset includes detailed records of reported vehicle crashes such as location number of persons injured and other crash related characteristics. Due to its large size the dataset satisfies the requirement of containing more than 1000 observations. The variables used in this study are borough and injury status. Injury status is defined as a binary variable indicating whether at least one person was injured in a crash.

#Datasets Motor Vehicle Collisions - Crashes https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

#Data analysis

#Load Datasets
crashes <- read.csv("Motor_Vehicle_Collisions_-_Crashes (2).csv")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

crash_clean <- crashes %>%
  select(BOROUGH, NUMBER.OF.PERSONS.INJURED) %>%
  filter(!is.na(BOROUGH), BOROUGH != "") %>%
  mutate(
    injury_status = ifelse(NUMBER.OF.PERSONS.INJURED > 0, 1, 0),
    BOROUGH = factor(BOROUGH)
  )

# Check results
summary(crash_clean)

##           BOROUGH       NUMBER.OF.PERSONS.INJURED injury_status   
##  BRONX        :228680   Min.   : 0.0000           Min.   :0.0000  
##  BROOKLYN     :495004   1st Qu.: 0.0000           1st Qu.:0.0000  
##  MANHATTAN    :342060   Median : 0.0000           Median :0.0000  
##  QUEENS       :413751   Mean   : 0.3128           Mean   :0.2379  
##  STATEN ISLAND: 64654   3rd Qu.: 0.0000           3rd Qu.:0.0000  
##                         Max.   :43.0000           Max.   :1.0000  
##                         NA's   :11                NA's   :11

table(crash_clean$BOROUGH, useNA = "ifany")

## 
##         BRONX      BROOKLYN     MANHATTAN        QUEENS STATEN ISLAND 
##        228680        495004        342060        413751         64654

table(crash_clean$injury_status)

## 
##       0       1 
## 1176760  367378

#Interpretation of Data Analysis During data preparation rows with missing borough values were removed to ensure accurate categorization. A binary variable injury_status was created to indicate whether a crash resulted in at least one injury. The cleaned data were summarized into a contingency table appropriate for a chi square test of independence. Data manipulation was completed using dplyr functions such as select filter and mutate.

#Statistical Analysis

The chi-square test of independence is appropriate for this analysis because both variables are categorical. The test evaluates whether the distribution of injury outcomes differs across boroughs beyond what would be expected by chance.

#Hypotheses

Null Hypothesis (H₀): There is no association between borough and whether a crash results in at least one injury.

Alternative Hypothesis (H₁): There is an association between borough and whether a crash results in at least one injury

#Contingency Table
injury_table <- table(crash_clean$BOROUGH, crash_clean$injury_status)
injury_table

##                
##                      0      1
##   BRONX         170420  58259
##   BROOKLYN      364602 130398
##   MANHATTAN     276666  65391
##   QUEENS        314646  99102
##   STATEN ISLAND  50426  14228

#Chi Square
chi_result <- chisq.test(injury_table)
chi_result

## 
##  Pearson's Chi-squared test
## 
## data:  injury_table
## X-squared = 6377.2, df = 4, p-value < 2.2e-16

min(chi_result$expected)

## [1] 15382.34

#Interpretation of Statistical Analysis A Pearson chi square test of independence was conducted to examine the association between borough and whether a motor vehicle crash resulted in at least one injury. The results showed a statistically significant association with a chi square value of 6377.2 and 4 degrees of freedom and a p value less than 2.2e 16. The null hypothesis was rejected indicating that injury outcomes vary significantly across boroughs. The minimum expected cell count was 15382.34 which confirms that the assumptions of the chi square test were satisfied and that the results are reliable.

#Barplotrs

injury_prop <- prop.table(injury_table, margin = 1)

barplot(injury_prop,
        beside = TRUE,
        legend = c("No Injury", "At Least One Injury"),
        main = "Injury Outcomes by Borough",
        xlab = "Borough",
        ylab = "Proportion")

rownames(injury_prop)

## [1] "BRONX"         "BROOKLYN"      "MANHATTAN"     "QUEENS"       
## [5] "STATEN ISLAND"

#Interpretation Of Bar plots The bar plot displays the proportion of crashes with no injury and at least one injury across the five boroughs. For injury status zero the proportion of crashes without injury is highest in Manhattan and Queens while Brooklyn and the Bronx show slightly lower proportions. For injury status one the proportion of crashes resulting in at least one injury is highest in the Bronx and Brooklyn followed by Queens. Manhattan shows the lowest proportion of injury related crashes. Staten Island has fewer crashes overall but still shows a noticeable share of injury outcomes. These visual differences support the chi square results and indicate that injury occurrence varies by borough.

#Conclusion This study examined whether borough is associated with injury occurrence in motor vehicle crashes. The Pearson chi square test showed a statistically significant association indicating that injury outcomes differ across Bronx Brooklyn Manhattan Queens and Staten Island. The null hypothesis was rejected which confirms that geographic location plays a role in whether a crash results in injury. These findings suggest that borough specific traffic conditions may influence crash severity.

#Future Directions Future research should incorporate additional variables such as time of day weather conditions traffic volume and road type to better explain differences in injury outcomes. Further studies may also apply multivariable models such as logistic regression to control for confounding factors. Such analyses could provide more precise guidance for targeted traffic safety interventions across boroughs.

#References

Motor Vehicle Collisions - Crashes https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

Final Project

Nishan Ghale

2025-12-16