Final Project - DATA 101

The Research Questions…

QUESTION 1:- How does the number of deaths vary based on the cause of the accident?

QUESTION 2 :- How does the number of injuries vary based on the Vehicle?

Introduction-:

The source of the data set I chose for my final project is NYC open data. This data set include information about car crashes that happened in New York city.The categorical variable in this data set are: “BOROUGH”,“ZIP CODE”,“ON STREET NAME”,“CROSS STREET NAME”,“OFF STREET NAME” ,“CONTRIBUTING FACTOR VEHICLE 1” ,“CONTRIBUTING FACTOR VEHICLE 2” ,“CONTRIBUTING FACTOR VEHICLE 3” ,“CONTRIBUTING FACTOR VEHICLE 4” ,“CONTRIBUTING FACTOR VEHICLE 5”,“COLLISION_ID” ,“VEHICLE TYPE CODE 1” ,“VEHICLE TYPE CODE 2” ,“VEHICLE TYPE CODE 3” ,“VEHICLE TYPE CODE 4”,“VEHICLE TYPE CODE 5”.

The quantitative variables in this data set are: “CRASH DATE”,“CRASH TIME”,“LATITUDE”,“LONGITUDE”,“LOCATION”, “NUMBER OF PERSONS INJURED”,“NUMBER OF PERSONS KILLED”,“NUMBER OF PEDESTRIANS INJURED”,“NUMBER OF PEDESTRIANS KILLED”,“NUMBER OF CYCLIST INJURED”,“NUMBER OF CYCLIST KILLED”,“NUMBER OF MOTORIST INJURED”,“NUMBER OF MOTORIST KILLED”.

Data Analysis…

So according to my research questions the variables I will be mostly looking at are “CONTRIBUTING FACTOR VEHICLE 1”, “NUMBER OF PERSON KILLED”,“NUMBER OF PERSON INJURED,”VEHICLE TYPE CODE 1”. I am planning to group the data by the factor 1 and then summarize total deaths while arranging the data in a descending order for the 1st question. For my 2nd question I will group the data by Borough, and then I will summarize the total amount of insureds while arranging the data in a descending order.

Loading the library that I need to work on this…

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tinytex)
library(ggplot2)
library(tidyr)

Importing the Data Set for the Final Project…

setwd("/Users/janithrithilakasiri/Downloads")
Crashes_NYC <- read_csv("Motor_Vehicle_Collisions_-_Crashes.csv")

## Rows: 2008519 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
## dbl  (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
## time  (1): CRASH TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Missing for NA Values…

sum(is.na(Crashes_NYC))

## [1] 17178766

mean(is.na(Crashes_NYC))*100

## [1] 29.49294

Question 01-:How does the number of deaths vary based on the cause of the accident?

Deaths_by_cause <- Crashes_NYC %>%
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
  summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(total_deaths))

head(Deaths_by_cause, n=5)

## # A tibble: 5 × 2
##   `CONTRIBUTING FACTOR VEHICLE 1`                       total_deaths
##   <chr>                                                        <dbl>
## 1 Unsafe Speed                                                   368
## 2 Failure to Yield Right-of-Way                                  247
## 3 Traffic Control Disregarded                                    245
## 4 Alcohol Involvement                                            103
## 5 Pedestrian/Bicyclist/Other Pedestrian Error/Confusion           95

Avg_deaths <- tapply(Crashes_NYC$`NUMBER OF PERSONS KILLED`, Crashes_NYC$`CONTRIBUTING FACTOR VEHICLE 1`, mean)
Top_5_avg_deaths <- sort(Avg_deaths, decreasing = TRUE)[1:5]

barplot(Top_5_avg_deaths, 
        main = "Average Number of Deaths for Top 5 Contributing Factors", 
        ylab = "Average Number of Deaths", 
        col = rainbow(5), 
        las = 2, cex.names = 0.7
        )

Top_5_avg_deaths

##                                                Illnes 
##                                           0.033844189 
##                                          Unsafe Speed 
##                                           0.013897281 
## Pedestrian/Bicyclist/Other Pedestrian Error/Confusion 
##                                           0.010540331 
##                                       Drugs (illegal) 
##                                           0.009580838 
##                                       Drugs (Illegal) 
##                                           0.007159905

Question 02 -:How does the number of injuries vary based on the Vehicle?

Kills_by_Vehicle <- Crashes_NYC %>%
  group_by(`VEHICLE TYPE CODE 1`) %>%
  summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(total_deaths))

head(Kills_by_Vehicle, n = 5)

## # A tibble: 5 × 2
##   `VEHICLE TYPE CODE 1`         total_deaths
##   <chr>                                <dbl>
## 1 PASSENGER VEHICLE                      398
## 2 SPORT UTILITY / STATION WAGON          218
## 3 Motorcycle                             174
## 4 MOTORCYCLE                              91
## 5 UNKNOWN                                 61

vehicle_types <- unique(Crashes_NYC$`VEHICLE TYPE CODE 1`)
avg_injuries <- tapply(Crashes_NYC$`NUMBER OF PERSONS INJURED`, Crashes_NYC$`VEHICLE TYPE CODE 1`, mean)
bar_colors <- c("red",  "purple", "magenta", "pink", "cyan")

Top_5_V <- names(sort(avg_injuries,decreasing = TRUE)[1:5])

barplot(avg_injuries[Top_5_V], main = "Average Number of Injuries for Top 5 Vehicle Types",
         ylab = "Average Number of Injuries",
        col = bar_colors, las = 2, cex.names = 0.7, ylim = c(0, max(avg_injuries[Top_5_V]) + 1))

Statistical Analysis…

Is there a correlation between the number of person injured in car accidents and the number of persons killed in car accidents?

Scatter plot

plot(Crashes_NYC$`NUMBER OF PERSONS KILLED`, Crashes_NYC$`NUMBER OF PERSONS INJURED`, 
     xlab = "Number of Persons KILLED", ylab = "Number of Persons INJURED", 
     main = "Correlation between # of Persons Killed & # of Persons Injured")
abline(lm(Crashes_NYC$'NUMBER OF PERSONS KILLED' ~ Crashes_NYC$'NUMBER OF PERSONS INJURED'),col ='red')

Linear Regression

reg <- lm(Crashes_NYC$`NUMBER OF PERSONS INJURED` ~ Crashes_NYC$`NUMBER OF PERSONS KILLED`)
summary(reg)

## 
## Call:
## lm(formula = Crashes_NYC$`NUMBER OF PERSONS INJURED` ~ Crashes_NYC$`NUMBER OF PERSONS KILLED`)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.357 -0.301 -0.301 -0.301 42.699 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            0.3005707  0.0004887  615.08   <2e-16
## Crashes_NYC$`NUMBER OF PERSONS KILLED` 0.2640064  0.0122179   21.61   <2e-16
##                                           
## (Intercept)                            ***
## Crashes_NYC$`NUMBER OF PERSONS KILLED` ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6921 on 2008480 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.0002324,  Adjusted R-squared:  0.0002319 
## F-statistic: 466.9 on 1 and 2008480 DF,  p-value: < 2.2e-16

The p-value for Number of Persons Killed is less than 2e-16.The R-squared value is only 0.0002367. It’s implying a very weak positive correlation between the two variables.

Summary of Research Questions…

For my first research question, it would once again seem that the most deaths tend to be from very common cause of crashes like unsafe speed, drunk driving, and driving errors. However, the highest average deaths for accidents is because of illness. However, it is very clear that unsafe speed and alcohol involvement are the major causes of death in accidents as they appear high in both the highest number of total deaths and average deaths.

For my second research question, it is interesting to see the differences between the types of vehicles with the most injuries total and the most average injuries.It appears that the vehicle types with the most injuries are more popular cars to be driven like Passenger Vehicles, recreational vehicles, taxis, pickup trucks, and more.

Ethical Concerns…

The only ethical concerns that could arise from this data set is that this data set could be biased due to any type of under reporting or over reporting of certain kinds of accidents. Also the accuracy of this data set could be called into question due to the large number of missing values in it.