The dataset I chose for my final project is a dataset on almost 2 million car crashes that have occurred in New York City collected by NYC open data. (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95)
The categorical variable in this dataset are: -BOROUGH: the borough of NYC where the crash occurred (Manhattan, Bronx, Brooklyn, Queens, Staten Island) -ZIP CODE: the zip code of the crash location -ON STREET NAME: the street name of the crash location -CROSS STREET NAME: the cross street name of the crash location -OFF STREET NAME: the off-street name of the crash location -CONTRIBUTING FACTOR VEHICLE 1: the primary contributing factor for vehicle 1 in the crash -CONTRIBUTING FACTOR VEHICLE 2: the primary contributing factor for vehicle 2 in the crash -CONTRIBUTING FACTOR VEHICLE 3: the primary contributing factor for vehicle 3 in the crash -CONTRIBUTING FACTOR VEHICLE 4: the primary contributing factor for vehicle 4 in the crash -CONTRIBUTING FACTOR VEHICLE 5: the primary contributing factor for vehicle 5 in the crash -COLLISION_ID: unique identifier for the crash -VEHICLE TYPE CODE 1: vehicle type for vehicle 1 in the crash -VEHICLE TYPE CODE 2: vehicle type for vehicle 2 in the crash -VEHICLE TYPE CODE 3: vehicle type for vehicle 3 in the crash -VEHICLE TYPE CODE 4: vehicle type for vehicle 4 in the crash -VEHICLE TYPE CODE 5: vehicle type for vehicle 5 in the crash
The quantitative variables in this dataset are: -CRASH DATE: date of the crash -CRASH TIME: time of the crash -LATITUDE: the latitude of the crash location -LONGITUDE: the longitude of the crash location -LOCATION: the latitude and longitude of the crash location combined -NUMBER OF PERSONS INJURED: number of persons injured in the crash -NUMBER OF PERSONS KILLED: number of persons killed in the crash -NUMBER OF PEDESTRIANS INJURED: number of pedestrians injured in the crash -NUMBER OF PEDESTRIANS KILLED: number of pedestrians killed in the crash -NUMBER OF CYCLIST INJURED: number of cyclists injured in the crash -NUMBER OF CYCLIST KILLED: number of cyclists killed in the crash -NUMBER OF MOTORIST INJURED: number of motorists injured in the crash -NUMBER OF MOTORIST KILLED: number of motorists killed in the crash
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Crashes <- read_csv("Motor_Vehicle_collisions_-_Crashes(1).csv")
## Rows: 1987321 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
## dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
## time (1): CRASH TIME
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(Crashes))
## [1] 16993892
mean(is.na(Crashes))*100
## [1] 29.48675
The two research questions I will explore are: 1. How does the number of injuries vary based on vehicle type? 2. How does the number of deaths vary based on teh cause of the accident?
injuries_by_vehicle_type <- Crashes %>%
group_by(`VEHICLE TYPE CODE 1`) %>%
summarize(total_injuries = sum(`NUMBER OF PERSONS INJURED`)) %>%
arrange(desc(total_injuries))
head(injuries_by_vehicle_type, n = 12)
## # A tibble: 12 × 2
## `VEHICLE TYPE CODE 1` total_injuries
## <chr> <dbl>
## 1 PASSENGER VEHICLE 106857
## 2 SPORT UTILITY / STATION WAGON 45622
## 3 TAXI 8171
## 4 Pick-up Truck 7844
## 5 UNKNOWN 7652
## 6 Bus 6172
## 7 VAN 5320
## 8 Motorcycle 4701
## 9 OTHER 4415
## 10 Box Truck 3503
## 11 BUS 3349
## 12 LIVERY VEHICLE 3216
vehicle_types <- unique(Crashes$`VEHICLE TYPE CODE 1`)
avg_injuries <- tapply(Crashes$`NUMBER OF PERSONS INJURED`, Crashes$`VEHICLE TYPE CODE 1`, mean)
bar_colors <- c("red", "blue", "green", "orange", "purple", "gray", "black", "brown", "navy", "magenta", "pink", "cyan")
barplot(avg_injuries, main="Average Number of Injuries by Vehicle Type",
xlab="Vehicle Type", ylab="Average Number of Injuries",
col=bar_colors, las=2, cex.names=0.7, ylim=c(0, max(avg_injuries, na.rm=TRUE) + 1))
top_vehicle_types <- names(sort(avg_injuries, decreasing = TRUE)[1:12])
barplot(avg_injuries[top_vehicle_types], main = "Average Number of Injuries for Top 12 Vehicle Types",
xlab = "Vehicle Type", ylab = "Average Number of Injuries",
col = bar_colors, las = 2, cex.names = 0.7, ylim = c(0, max(avg_injuries[top_vehicle_types]) + 1))
deaths_by_cause <- Crashes %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(total_deaths))
head(deaths_by_cause, n = 10)
## # A tibble: 10 × 2
## `CONTRIBUTING FACTOR VEHICLE 1` total_deaths
## <chr> <dbl>
## 1 Unsafe Speed 355
## 2 Failure to Yield Right-of-Way 243
## 3 Traffic Control Disregarded 238
## 4 Alcohol Involvement 101
## 5 Pedestrian/Bicyclist/Other Pedestrian Error/Confusion 92
## 6 Illnes 53
## 7 Passenger Distraction 48
## 8 Backing Unsafely 40
## 9 Physical Disability 32
## 10 Passing or Lane Usage Improper 27
avg_deaths <- tapply(Crashes$`NUMBER OF PERSONS KILLED`, Crashes$`CONTRIBUTING FACTOR VEHICLE 1`, mean)
top_avg_deaths <- sort(avg_deaths, decreasing = TRUE)[1:12]
barplot(top_avg_deaths,
main = "Average Number of Deaths for Top 12 Contributing Factors",
xlab = "Contributing Factor Vehicle 1",
ylab = "Average Number of Deaths",
col = bar_colors[1:12],
las = 2, cex.names = 0.7
)
top_avg_deaths
## Illnes
## 0.034550196
## Unsafe Speed
## 0.013849335
## Pedestrian/Bicyclist/Other Pedestrian Error/Confusion
## 0.010501084
## Drugs (illegal)
## 0.009791922
## Drugs (Illegal)
## 0.007159905
## Traffic Control Disregarded
## 0.006947688
## Tinted Windows
## 0.006711409
## Passenger Distraction
## 0.006119327
## Alcohol Involvement
## 0.004790817
## Tow Hitch Defective
## 0.004784689
## Physical Disability
## 0.003427959
## Pavement Defective
## 0.003378378
For my first research question, it is interesting to see the differences between the types of vehicles with the most injuries total and the most average injuries.It appears that the vehicle types with the most injuries are more popular cars to be driven like Passenger Vehicles, recreational vehicles, taxis, pickup trucks, and more, while the vehicle types that result in more injuries in accidents seem to be larger construction vehicles and trucks like the 18 wheeler and also motorcycles.
For my second research question, it would once again seem that the most deaths tend to be from very common cause of crashes like unsafe speed, drunk driving, and driving errors. However, the highest average deaths for accidents included defects in either the environment or the vehicle itself along with drug usage. However, it is very clear that unsafe speed and alcohol involvement are the major causes of death in accidents as they appear high in both the highest number of total deaths and average deaths.
Research Question 3: Is there a significant difference in the average number of deaths of different vehicle types?
All data is independent due to each crash being recorded being an independent event from one another. As shown in my boxplot below, it is also safe to assume that the data is normally distributed with similar variance.
𝐻𝑜: There is no significant difference in the average number of deaths between different vehicle types. 𝐻𝑎: There is a significant difference in the average number of deaths between different vehicle types.
Crashes_1000 <- Crashes %>%
group_by(`VEHICLE TYPE CODE 1`) %>%
summarize(n = n()) %>%
filter(n > 1000) %>%
pull(`VEHICLE TYPE CODE 1`)
Crashes_filtered <- Crashes %>%
filter(`VEHICLE TYPE CODE 1` %in% Crashes_1000)
model <- aov(`NUMBER OF PERSONS KILLED` ~ `VEHICLE TYPE CODE 1`, data = Crashes_filtered)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## `VEHICLE TYPE CODE 1` 36 6.1 0.17061 111.3 <2e-16 ***
## Residuals 1959962 3004.1 0.00153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 12260 observations deleted due to missingness
boxplot(`NUMBER OF PERSONS KILLED` ~ `VEHICLE TYPE CODE 1`, data = Crashes_filtered,
main = "Distribution of Number of Deaths by Vehicle Type",
xlab = "Vehicle Type",
ylab = "Number of Deaths",
col = "lightblue",
las=2, cex.axis = 0.6)
The p-value of this anova test is less than 2e-16, meaning we reject the null hypothesis. This means that we can be 99% certain that there is a significant difference in the number of deaths that occur in accidents that have different vehicle types involved.
Research Question 4: Is there a relationship between the number of person injured in car accidents and the number of persons killed in car accidents?
plot(Crashes$`NUMBER OF PERSONS KILLED`, Crashes$`NUMBER OF PERSONS INJURED`,
xlab = "Number of Persons KILLED", ylab = "Number of Persons INJURED",
main = "Correlation between Number of Persons Killed and Number of Persons Injured")
abline(lm(Crashes$`NUMBER OF PERSONS KILLED` ~ Crashes$`NUMBER OF PERSONS INJURED`))
reg <- lm(Crashes$`NUMBER OF PERSONS INJURED` ~ Crashes$`NUMBER OF PERSONS KILLED`)
summary(reg)
##
## Call:
## lm(formula = Crashes$`NUMBER OF PERSONS INJURED` ~ Crashes$`NUMBER OF PERSONS KILLED`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.365 -0.298 -0.298 -0.298 42.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2977570 0.0004893 608.52 <2e-16 ***
## Crashes$`NUMBER OF PERSONS KILLED` 0.2667885 0.0123001 21.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6893 on 1987282 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.0002367, Adjusted R-squared: 0.0002362
## F-statistic: 470.5 on 1 and 1987282 DF, p-value: < 2.2e-16
The equation for the linear model is: NUMBER OF PERSONS INJURED = 0.2977570 + 0.2667885 * NUMBER OF PERSONS KILLED
The p-value for Number of Persons Killed is less than 2e-16 which indicates that the number of persons killed is statistically significant in predicting the number of persons injured in an accident. The R-squared value is only 0.0002367 however, implying a very weak positive correlation between the two variables, meaning there are other factors that most likely influence them than just each other.
There are multiple ethical concerns that could arise from this dataset. This datset could be biased due to any type of under reporting or over reporting of certain kinds of accidents, such as minor fender benders going unreported while there is more attention given to large accidents. I also believe the accuracy of this dataset could be called into question due to the large number of missing values in it, making up nearly 30 percent of the data.
An ethical concern for this project is that the results of my EDA could be misused. My EDA revealed that accidents involed construction vehicles, bigger trucks like 18 wheelers, and motorcycles resulted in more deaths on average, which could lead a viewer of this project to raise safety concerns over these vehicles. However, based on the value of the averages themselves these vehicles do not cause even 1 extra death on average than any other vehicle type, meaning this data should not be used in any way to crack down on the safety of those types of vehicles.