Data 101 Final Project

The Dataset

The dataset I chose for my final project is a dataset on almost 2 million car crashes that have occurred in New York City collected by NYC open data. (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95)

The categorical variable in this dataset are: -BOROUGH: the borough of NYC where the crash occurred (Manhattan, Bronx, Brooklyn, Queens, Staten Island) -ZIP CODE: the zip code of the crash location -ON STREET NAME: the street name of the crash location -CROSS STREET NAME: the cross street name of the crash location -OFF STREET NAME: the off-street name of the crash location -CONTRIBUTING FACTOR VEHICLE 1: the primary contributing factor for vehicle 1 in the crash -CONTRIBUTING FACTOR VEHICLE 2: the primary contributing factor for vehicle 2 in the crash -CONTRIBUTING FACTOR VEHICLE 3: the primary contributing factor for vehicle 3 in the crash -CONTRIBUTING FACTOR VEHICLE 4: the primary contributing factor for vehicle 4 in the crash -CONTRIBUTING FACTOR VEHICLE 5: the primary contributing factor for vehicle 5 in the crash -COLLISION_ID: unique identifier for the crash -VEHICLE TYPE CODE 1: vehicle type for vehicle 1 in the crash -VEHICLE TYPE CODE 2: vehicle type for vehicle 2 in the crash -VEHICLE TYPE CODE 3: vehicle type for vehicle 3 in the crash -VEHICLE TYPE CODE 4: vehicle type for vehicle 4 in the crash -VEHICLE TYPE CODE 5: vehicle type for vehicle 5 in the crash

The quantitative variables in this dataset are: -CRASH DATE: date of the crash -CRASH TIME: time of the crash -LATITUDE: the latitude of the crash location -LONGITUDE: the longitude of the crash location -LOCATION: the latitude and longitude of the crash location combined -NUMBER OF PERSONS INJURED: number of persons injured in the crash -NUMBER OF PERSONS KILLED: number of persons killed in the crash -NUMBER OF PEDESTRIANS INJURED: number of pedestrians injured in the crash -NUMBER OF PEDESTRIANS KILLED: number of pedestrians killed in the crash -NUMBER OF CYCLIST INJURED: number of cyclists injured in the crash -NUMBER OF CYCLIST KILLED: number of cyclists killed in the crash -NUMBER OF MOTORIST INJURED: number of motorists injured in the crash -NUMBER OF MOTORIST KILLED: number of motorists killed in the crash

Reading in the dataset and calculating missing values

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.1
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Crashes <- read_csv("Motor_Vehicle_collisions_-_Crashes(1).csv")

## Rows: 1987321 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
## dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
## time  (1): CRASH TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sum(is.na(Crashes))

## [1] 16993892

mean(is.na(Crashes))*100

## [1] 29.48675

Research Questions for EDA

The two research questions I will explore are: 1. How does the number of injuries vary based on vehicle type? 2. How does the number of deaths vary based on teh cause of the accident?

EDA for research question 1

injuries_by_vehicle_type <- Crashes %>%
  group_by(`VEHICLE TYPE CODE 1`) %>%
  summarize(total_injuries = sum(`NUMBER OF PERSONS INJURED`)) %>%
  arrange(desc(total_injuries))

head(injuries_by_vehicle_type, n = 12)

## # A tibble: 12 × 2
##    `VEHICLE TYPE CODE 1`         total_injuries
##    <chr>                                  <dbl>
##  1 PASSENGER VEHICLE                     106857
##  2 SPORT UTILITY / STATION WAGON          45622
##  3 TAXI                                    8171
##  4 Pick-up Truck                           7844
##  5 UNKNOWN                                 7652
##  6 Bus                                     6172
##  7 VAN                                     5320
##  8 Motorcycle                              4701
##  9 OTHER                                   4415
## 10 Box Truck                               3503
## 11 BUS                                     3349
## 12 LIVERY VEHICLE                          3216

vehicle_types <- unique(Crashes$`VEHICLE TYPE CODE 1`)
avg_injuries <- tapply(Crashes$`NUMBER OF PERSONS INJURED`, Crashes$`VEHICLE TYPE CODE 1`, mean)
bar_colors <- c("red", "blue", "green", "orange", "purple", "gray", "black", "brown", "navy", "magenta", "pink", "cyan")

barplot(avg_injuries, main="Average Number of Injuries by Vehicle Type", 
        xlab="Vehicle Type", ylab="Average Number of Injuries",
        col=bar_colors, las=2, cex.names=0.7, ylim=c(0, max(avg_injuries, na.rm=TRUE) + 1))

top_vehicle_types <- names(sort(avg_injuries, decreasing = TRUE)[1:12])

barplot(avg_injuries[top_vehicle_types], main = "Average Number of Injuries for Top 12 Vehicle Types",
        xlab = "Vehicle Type", ylab = "Average Number of Injuries",
        col = bar_colors, las = 2, cex.names = 0.7, ylim = c(0, max(avg_injuries[top_vehicle_types]) + 1))

EDA For Research Question 2:

deaths_by_cause <- Crashes %>%
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
  summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(total_deaths))

head(deaths_by_cause, n = 10)

## # A tibble: 10 × 2
##    `CONTRIBUTING FACTOR VEHICLE 1`                       total_deaths
##    <chr>                                                        <dbl>
##  1 Unsafe Speed                                                   355
##  2 Failure to Yield Right-of-Way                                  243
##  3 Traffic Control Disregarded                                    238
##  4 Alcohol Involvement                                            101
##  5 Pedestrian/Bicyclist/Other Pedestrian Error/Confusion           92
##  6 Illnes                                                          53
##  7 Passenger Distraction                                           48
##  8 Backing Unsafely                                                40
##  9 Physical Disability                                             32
## 10 Passing or Lane Usage Improper                                  27

avg_deaths <- tapply(Crashes$`NUMBER OF PERSONS KILLED`, Crashes$`CONTRIBUTING FACTOR VEHICLE 1`, mean)
top_avg_deaths <- sort(avg_deaths, decreasing = TRUE)[1:12]

barplot(top_avg_deaths, 
        main = "Average Number of Deaths for Top 12 Contributing Factors", 
        xlab = "Contributing Factor Vehicle 1", 
        ylab = "Average Number of Deaths", 
        col = bar_colors[1:12], 
        las = 2, cex.names = 0.7
        )

top_avg_deaths

##                                                Illnes 
##                                           0.034550196 
##                                          Unsafe Speed 
##                                           0.013849335 
## Pedestrian/Bicyclist/Other Pedestrian Error/Confusion 
##                                           0.010501084 
##                                       Drugs (illegal) 
##                                           0.009791922 
##                                       Drugs (Illegal) 
##                                           0.007159905 
##                           Traffic Control Disregarded 
##                                           0.006947688 
##                                        Tinted Windows 
##                                           0.006711409 
##                                 Passenger Distraction 
##                                           0.006119327 
##                                   Alcohol Involvement 
##                                           0.004790817 
##                                   Tow Hitch Defective 
##                                           0.004784689 
##                                   Physical Disability 
##                                           0.003427959 
##                                    Pavement Defective 
##                                           0.003378378

Summary of EDA

For my first research question, it is interesting to see the differences between the types of vehicles with the most injuries total and the most average injuries.It appears that the vehicle types with the most injuries are more popular cars to be driven like Passenger Vehicles, recreational vehicles, taxis, pickup trucks, and more, while the vehicle types that result in more injuries in accidents seem to be larger construction vehicles and trucks like the 18 wheeler and also motorcycles.

For my second research question, it would once again seem that the most deaths tend to be from very common cause of crashes like unsafe speed, drunk driving, and driving errors. However, the highest average deaths for accidents included defects in either the environment or the vehicle itself along with drug usage. However, it is very clear that unsafe speed and alcohol involvement are the major causes of death in accidents as they appear high in both the highest number of total deaths and average deaths.

ANOVA Test

Research Question 3: Is there a significant difference in the average number of deaths of different vehicle types?

Assumptions

All data is independent due to each crash being recorded being an independent event from one another. As shown in my boxplot below, it is also safe to assume that the data is normally distributed with similar variance.

Null and Alternate Hypotheses

𝐻𝑜: There is no significant difference in the average number of deaths between different vehicle types. 𝐻𝑎: There is a significant difference in the average number of deaths between different vehicle types.

The Test

Crashes_1000 <- Crashes %>% 
  group_by(`VEHICLE TYPE CODE 1`) %>% 
  summarize(n = n()) %>% 
  filter(n > 1000) %>% 
  pull(`VEHICLE TYPE CODE 1`)

Crashes_filtered <- Crashes %>% 
  filter(`VEHICLE TYPE CODE 1` %in% Crashes_1000)

model <- aov(`NUMBER OF PERSONS KILLED` ~ `VEHICLE TYPE CODE 1`, data = Crashes_filtered)
summary(model)

##                            Df Sum Sq Mean Sq F value Pr(>F)    
## `VEHICLE TYPE CODE 1`      36    6.1 0.17061   111.3 <2e-16 ***
## Residuals             1959962 3004.1 0.00153                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 12260 observations deleted due to missingness

boxplot(`NUMBER OF PERSONS KILLED` ~ `VEHICLE TYPE CODE 1`, data = Crashes_filtered,
        main = "Distribution of Number of Deaths by Vehicle Type",
        xlab = "Vehicle Type",
        ylab = "Number of Deaths",
        col = "lightblue",
        las=2, cex.axis = 0.6)

Explaination

The p-value of this anova test is less than 2e-16, meaning we reject the null hypothesis. This means that we can be 99% certain that there is a significant difference in the number of deaths that occur in accidents that have different vehicle types involved.

Correaltion

Research Question 4: Is there a relationship between the number of person injured in car accidents and the number of persons killed in car accidents?

Scatter Plot

plot(Crashes$`NUMBER OF PERSONS KILLED`, Crashes$`NUMBER OF PERSONS INJURED`, 
     xlab = "Number of Persons KILLED", ylab = "Number of Persons INJURED", 
     main = "Correlation between Number of Persons Killed and Number of Persons Injured")
abline(lm(Crashes$`NUMBER OF PERSONS KILLED` ~ Crashes$`NUMBER OF PERSONS INJURED`))

Linear regression

reg <- lm(Crashes$`NUMBER OF PERSONS INJURED` ~ Crashes$`NUMBER OF PERSONS KILLED`)
summary(reg)

## 
## Call:
## lm(formula = Crashes$`NUMBER OF PERSONS INJURED` ~ Crashes$`NUMBER OF PERSONS KILLED`)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.365 -0.298 -0.298 -0.298 42.702 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        0.2977570  0.0004893  608.52   <2e-16 ***
## Crashes$`NUMBER OF PERSONS KILLED` 0.2667885  0.0123001   21.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6893 on 1987282 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.0002367,  Adjusted R-squared:  0.0002362 
## F-statistic: 470.5 on 1 and 1987282 DF,  p-value: < 2.2e-16

The equation for the linear model is: NUMBER OF PERSONS INJURED = 0.2977570 + 0.2667885 * NUMBER OF PERSONS KILLED

The p-value for Number of Persons Killed is less than 2e-16 which indicates that the number of persons killed is statistically significant in predicting the number of persons injured in an accident. The R-squared value is only 0.0002367 however, implying a very weak positive correlation between the two variables, meaning there are other factors that most likely influence them than just each other.

Ethical Concerns

There are multiple ethical concerns that could arise from this dataset. This datset could be biased due to any type of under reporting or over reporting of certain kinds of accidents, such as minor fender benders going unreported while there is more attention given to large accidents. I also believe the accuracy of this dataset could be called into question due to the large number of missing values in it, making up nearly 30 percent of the data.

An ethical concern for this project is that the results of my EDA could be misused. My EDA revealed that accidents involed construction vehicles, bigger trucks like 18 wheelers, and motorcycles resulted in more deaths on average, which could lead a viewer of this project to raise safety concerns over these vehicles. However, based on the value of the averages themselves these vehicles do not cause even 1 extra death on average than any other vehicle type, meaning this data should not be used in any way to crack down on the safety of those types of vehicles.