summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
#Introduction Motor vehicle crashes represent a major public safety concern in large urban areas. Injury outcomes from crashes can vary across locations because of differences in traffic density road design population behavior and emergency response capacity. Examining whether injury occurrence differs by borough can assist policymakers and transportation officials in identifying locations that may require targeted safety interventions.
The dataset used for this analysis is Motor Vehicle Collisions Crashes obtained from Data.gov. The dataset includes detailed records of reported vehicle crashes such as location number of persons injured and other crash related characteristics. Due to its large size the dataset satisfies the requirement of containing more than 1000 observations. The variables used in this study are borough and injury status. Injury status is defined as a binary variable indicating whether at least one person was injured in a crash.
#Datasets Motor Vehicle Collisions - Crashes https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes
#Data analysis
#Load Datasets
crashes <- read.csv("Motor_Vehicle_Collisions_-_Crashes (2).csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
crash_clean <- crashes %>%
select(BOROUGH, NUMBER.OF.PERSONS.INJURED) %>%
filter(!is.na(BOROUGH), BOROUGH != "") %>%
mutate(
injury_status = ifelse(NUMBER.OF.PERSONS.INJURED > 0, 1, 0),
BOROUGH = factor(BOROUGH)
)
# Check results
summary(crash_clean)
## BOROUGH NUMBER.OF.PERSONS.INJURED injury_status
## BRONX :228680 Min. : 0.0000 Min. :0.0000
## BROOKLYN :495004 1st Qu.: 0.0000 1st Qu.:0.0000
## MANHATTAN :342060 Median : 0.0000 Median :0.0000
## QUEENS :413751 Mean : 0.3128 Mean :0.2379
## STATEN ISLAND: 64654 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :43.0000 Max. :1.0000
## NA's :11 NA's :11
table(crash_clean$BOROUGH, useNA = "ifany")
##
## BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
## 228680 495004 342060 413751 64654
table(crash_clean$injury_status)
##
## 0 1
## 1176760 367378
#Interpretation of Data Analysis During data preparation rows with missing borough values were removed to ensure accurate categorization. A binary variable injury_status was created to indicate whether a crash resulted in at least one injury. The cleaned data were summarized into a contingency table appropriate for a chi square test of independence. Data manipulation was completed using dplyr functions such as select filter and mutate.
#Statistical Analysis
The chi-square test of independence is appropriate for this analysis because both variables are categorical. The test evaluates whether the distribution of injury outcomes differs across boroughs beyond what would be expected by chance.
#Hypotheses
Null Hypothesis (H₀): There is no association between borough and whether a crash results in at least one injury.
Alternative Hypothesis (H₁): There is an association between borough and whether a crash results in at least one injury
#Contingency Table
injury_table <- table(crash_clean$BOROUGH, crash_clean$injury_status)
injury_table
##
## 0 1
## BRONX 170420 58259
## BROOKLYN 364602 130398
## MANHATTAN 276666 65391
## QUEENS 314646 99102
## STATEN ISLAND 50426 14228
#Chi Square
chi_result <- chisq.test(injury_table)
chi_result
##
## Pearson's Chi-squared test
##
## data: injury_table
## X-squared = 6377.2, df = 4, p-value < 2.2e-16
min(chi_result$expected)
## [1] 15382.34
#Interpretation of Statistical Analysis A Pearson chi square test of independence was conducted to examine the association between borough and whether a motor vehicle crash resulted in at least one injury. The results showed a statistically significant association with a chi square value of 6377.2 and 4 degrees of freedom and a p value less than 2.2e 16. The null hypothesis was rejected indicating that injury outcomes vary significantly across boroughs. The minimum expected cell count was 15382.34 which confirms that the assumptions of the chi square test were satisfied and that the results are reliable.
#Barplotrs
injury_prop <- prop.table(injury_table, margin = 1)
barplot(injury_prop,
beside = TRUE,
legend = c("No Injury", "At Least One Injury"),
main = "Injury Outcomes by Borough",
xlab = "Borough",
ylab = "Proportion")
rownames(injury_prop)
## [1] "BRONX" "BROOKLYN" "MANHATTAN" "QUEENS"
## [5] "STATEN ISLAND"
#Interpretation Of Bar plots The bar plot displays the proportion of crashes with no injury and at least one injury across the five boroughs. For injury status zero the proportion of crashes without injury is highest in Manhattan and Queens while Brooklyn and the Bronx show slightly lower proportions. For injury status one the proportion of crashes resulting in at least one injury is highest in the Bronx and Brooklyn followed by Queens. Manhattan shows the lowest proportion of injury related crashes. Staten Island has fewer crashes overall but still shows a noticeable share of injury outcomes. These visual differences support the chi square results and indicate that injury occurrence varies by borough.
#Conclusion This study examined whether borough is associated with injury occurrence in motor vehicle crashes. The Pearson chi square test showed a statistically significant association indicating that injury outcomes differ across Bronx Brooklyn Manhattan Queens and Staten Island. The null hypothesis was rejected which confirms that geographic location plays a role in whether a crash results in injury. These findings suggest that borough specific traffic conditions may influence crash severity.
#Future Directions Future research should incorporate additional variables such as time of day weather conditions traffic volume and road type to better explain differences in injury outcomes. Further studies may also apply multivariable models such as logistic regression to control for confounding factors. Such analyses could provide more precise guidance for targeted traffic safety interventions across boroughs.
#References
Motor Vehicle Collisions - Crashes https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes