The dataset “Motor Vehicle Collisions - Crashes,” provided by John Ferrara and accessible through NYC Open Data, presents a valuable opportunity for analysis of traffic incidents in New York City. This research seeks to investigate several critical questions: Which boroughs exhibit the highest frequency of motor vehicle accidents? What factors contribute to the occurrence of these incidents? By exploring this untidy dataset, the goal is to extract meaningful insights that can inform public safety initiatives and guide policy decisions related to traffic management and vehicle safety.
Upon importing the dataset, it becomes evident that there are instances of missing data. The initial step in the data cleaning process involves prioritizing the removal of incomplete information across all relevant columns. Following this, specific columns will be selected for analysis. Given that the data is already structured appropriately, there is no need to convert it into a long format.
knitr::opts_chunk$set(echo = TRUE)
# Check the number of rows in the original data
cat("Original dimensions:", dim(MV_Collisions_Crashes), "\n")
## Original dimensions: 2124088 29
# Define key columns
key_columns <- c("BOROUGH", "LATITUDE", "LONGITUDE",
"NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED",
"CONTRIBUTING FACTOR VEHICLE 1", "VEHICLE TYPE CODE 1")
# Check NAs in key columns
na_counts <- sapply(MV_Collisions_Crashes[key_columns], function(x) sum(is.na(x)))
print(na_counts)
## BOROUGH LATITUDE
## 660586 251097
## LONGITUDE NUMBER OF PERSONS INJURED
## 251097 18
## NUMBER OF PERSONS KILLED CONTRIBUTING FACTOR VEHICLE 1
## 31 7134
## VEHICLE TYPE CODE 1
## 14432
# Remove missing values in key columns and create a new dataframe with selected columns
MV_Collisions_Crashes_Cleaned <- MV_Collisions_Crashes %>%
select(all_of(key_columns)) %>%
drop_na()
In the next chuck the count function tallied the number of accidents for each borough, and that finding was then mutated to calculate the percentage of accidents for each borough relative to the total number of accidents. which will then be visualized as a barplot.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
borough_accidents <- MV_Collisions_Crashes_Cleaned %>%
count(BOROUGH, name = "accident_count") %>%
mutate(percentage = (accident_count / sum(accident_count)) * 100) %>%
arrange(desc(accident_count))
print(borough_accidents)
## # A tibble: 5 × 3
## BOROUGH accident_count percentage
## <chr> <int> <dbl>
## 1 BROOKLYN 448254 32.0
## 2 QUEENS 378016 27.0
## 3 MANHATTAN 311442 22.2
## 4 BRONX 205780 14.7
## 5 STATEN ISLAND 58868 4.20
library(ggplot2)
ggplot(borough_accidents, aes(x = reorder(BOROUGH, -percentage), y = percentage)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Percentage of Accidents by Borough",
x = "Borough",
y = "Percentage of Accidents (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Brooklyn accounts for approximately 31.96% of all recorded accidents,
making it the borough with the highest proportion. Queens follows with
about 26.96%, indicating that just over a quarter of accidents occurred
there. Manhattan represents approximately 22.21% of the total accidents,
while the Bronx accounts for around 14.67%. Finally, Staten Island has
the smallest percentage at about 4.20%, reflecting its significantly
lower share of incidents compared to the other boroughs. Overall,
Brooklyn and Queens together comprise the majority of traffic accidents
in the city.
In the next chunk analyze the contributing factor to motor vehicle accidents. First the total number of accidents for each factor is calculate. Then determine the percentage of accidents attributed to each factor and rounds the percentages to two decimal places for clarity. Following this calculation, the top five contributing factors are extracted and presented in descending order based on accident counts. Finally, a bar plot is created to visually represent these top factors, enhancing understanding of their relative impact on accident rates, with appropriately adjusted axis labels for better readability.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
# contributing factor percentage
contributing_factor_percentages <- MV_Collisions_Crashes_Cleaned %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarize(accident_count = n(), .groups = 'drop') %>%
mutate(percentage = (accident_count / sum(accident_count)) * 100) %>%
mutate(percentage = round(percentage, 2)) # Round to 2 decimal places
# Get the top five contributing factors
top_five_factors <- contributing_factor_percentages %>%
arrange(desc(accident_count)) %>%
slice_head(n = 5)
print(top_five_factors)
## # A tibble: 5 × 3
## `CONTRIBUTING FACTOR VEHICLE 1` accident_count percentage
## <chr> <int> <dbl>
## 1 Unspecified 533182 38.0
## 2 Driver Inattention/Distraction 269892 19.2
## 3 Failure to Yield Right-of-Way 88570 6.32
## 4 Backing Unsafely 59130 4.22
## 5 Following Too Closely 45772 3.26
# bar plot
ggplot(top_five_factors, aes(x = reorder(`CONTRIBUTING FACTOR VEHICLE 1`, -percentage), y = percentage)) +
geom_bar(stat = "identity", fill = "gold") + # Bar color
labs(title = "Top Five Contributing Factors for Accidents",
x = "Contributing Factor Vehicle 1",
y = "Percentage of Accidents") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), # Rotate and adjust text size
axis.title.x = element_text(size = 14), # Adjust x-axis title size
axis.title.y = element_text(size = 14)) + # Adjust y-axis title size
scale_x_discrete(expand = expansion(mult = c(0.1, 0.1))) # Add space around x-axis labels
The analysis indicates that while numerous factors contribute to motor
vehicle accidents, the focus here is on the top five contributing
factors. A notable 38.02% of accidents were classified as “unspecified,”
suggesting a significant gap in data reporting or clarity regarding the
causes. The second most common factor was driver inattention or
distraction, accounting for 19.25% of incidents. Other key contributors
included failure to yield the right-of-way at 6.32%, backing unsafely at
4.22%, and following too closely at 3.26%. These statistics underscore
the critical importance of driver awareness and safe decision-making on
the road. While many factors contribute to accidents, these top five
highlight specific areas where targeted interventions and education
could make a meaningful impact in reducing collision rates.
knitr::opts_chunk$set(echo = TRUE)
# Count accidents for each contributing factor and get the top five
top_five_factors <- MV_Collisions_Crashes_Cleaned %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarize(accident_count = n(), .groups = 'drop') %>%
arrange(desc(accident_count)) %>%
slice_head(n = 5)
# Filter the original dataset for only the top five contributing factors
borough_contributing_factors <- MV_Collisions_Crashes_Cleaned %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% top_five_factors$`CONTRIBUTING FACTOR VEHICLE 1`) %>%
group_by(BOROUGH, `CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarize(accident_count = n(), .groups = 'drop') %>%
# Calculate the total accidents for each borough for percentage calculation
group_by(BOROUGH) %>%
mutate(percentage = (accident_count / sum(accident_count)) * 100) %>%
ungroup() %>%
arrange(BOROUGH, desc(accident_count))
# Print the results
print(borough_contributing_factors)
## # A tibble: 25 × 4
## BOROUGH `CONTRIBUTING FACTOR VEHICLE 1` accident_count percentage
## <chr> <chr> <int> <dbl>
## 1 BRONX Unspecified 86422 60.0
## 2 BRONX Driver Inattention/Distraction 33352 23.1
## 3 BRONX Failure to Yield Right-of-Way 9461 6.56
## 4 BRONX Backing Unsafely 8811 6.11
## 5 BRONX Following Too Closely 6097 4.23
## 6 BROOKLYN Unspecified 189682 57.8
## 7 BROOKLYN Driver Inattention/Distraction 78043 23.8
## 8 BROOKLYN Failure to Yield Right-of-Way 28618 8.72
## 9 BROOKLYN Backing Unsafely 17884 5.45
## 10 BROOKLYN Following Too Closely 14108 4.30
## # ℹ 15 more rows
# Create a bar plot to visualize the relationship
ggplot(borough_contributing_factors, aes(x = reorder(`CONTRIBUTING FACTOR VEHICLE 1`, -percentage), y = percentage, fill = BOROUGH)) +
geom_bar(stat = "identity", position = "dodge") + # Dodge position for side-by-side bars
labs(title = "Top Five Accident Contributing Factors by Borough",
x = "Contributing Factor Vehicle 1",
y = "Percentage of Accidents") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis text for readability
In the analysis of contributing factors to accidents, the category
labeled “Unspecified” is particularly noteworthy across the boroughs,
with the Bronx exhibiting the highest percentage at 59.96%. This
indicates a concerning lack of clarity in reporting accident causes in
that area. Brooklyn follows closely, with unspecified factors accounting
for 57.77% of accidents, highlighting a similar issue of unreported
circumstances that complicate safety interventions. In Manhattan, while
the percentage is somewhat lower at 49.65%, it still signifies a
significant portion of accidents, underscoring the need for improved
reporting practices. Queens has 47.92% of accidents attributed to
unspecified factors, while Staten Island reports 53.55%. The prevalence
of unspecified factors across these boroughs points to systemic
challenges in accurately identifying accident causes, which hampers
effective intervention strategies.
In addition to unspecified factors, “Driver Inattention/Distraction” emerges as a major concern, particularly in Manhattan, where it constitutes 32.69% of accidents. This high percentage reflects the unique challenges of the urban environment, where distractions are likely more pronounced due to the dense population and traffic. In Brooklyn and the Bronx, driver inattention accounts for 23.77% and 23.14% of accidents, respectively, further emphasizing the need to address distracted driving as a critical road safety issue in these areas.
Other contributing factors, such as “Failure to Yield Right-of-Way,” “Backing Unsafely,” and “Following Too Closely,” contribute to accident statistics as well, but at lower rates, typically ranging from about 4% to 8% across the boroughs. While these factors are less common, they still represent significant risks that could be mitigated through targeted education and enforcement efforts. Overall, the data presents a complex landscape of accident contributing factors in New York City, highlighting the necessity for a comprehensive approach to enhance road safety and reduce the incidence of unspecified incidents.