The source of the data set I chose for my final project is NYC open data. This data set include information about car crashes that happened in New York city.The categorical variable in this data set are: “BOROUGH”,“ZIP CODE”,“ON STREET NAME”,“CROSS STREET NAME”,“OFF STREET NAME” ,“CONTRIBUTING FACTOR VEHICLE 1” ,“CONTRIBUTING FACTOR VEHICLE 2” ,“CONTRIBUTING FACTOR VEHICLE 3” ,“CONTRIBUTING FACTOR VEHICLE 4” ,“CONTRIBUTING FACTOR VEHICLE 5”,“COLLISION_ID” ,“VEHICLE TYPE CODE 1” ,“VEHICLE TYPE CODE 2” ,“VEHICLE TYPE CODE 3” ,“VEHICLE TYPE CODE 4”,“VEHICLE TYPE CODE 5”.
The quantitative variables in this data set are: “CRASH DATE”,“CRASH TIME”,“LATITUDE”,“LONGITUDE”,“LOCATION”, “NUMBER OF PERSONS INJURED”,“NUMBER OF PERSONS KILLED”,“NUMBER OF PEDESTRIANS INJURED”,“NUMBER OF PEDESTRIANS KILLED”,“NUMBER OF CYCLIST INJURED”,“NUMBER OF CYCLIST KILLED”,“NUMBER OF MOTORIST INJURED”,“NUMBER OF MOTORIST KILLED”.
So according to my research questions the variables I will be mostly looking at are “CONTRIBUTING FACTOR VEHICLE 1”, “NUMBER OF PERSON KILLED”,“NUMBER OF PERSON INJURED,”VEHICLE TYPE CODE 1”. I am planning to group the data by the factor 1 and then summarize total deaths while arranging the data in a descending order for the 1st question. For my 2nd question I will group the data by Borough, and then I will summarize the total amount of insureds while arranging the data in a descending order.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tinytex)
library(ggplot2)
library(tidyr)
setwd("/Users/janithrithilakasiri/Downloads")
Crashes_NYC <- read_csv("Motor_Vehicle_Collisions_-_Crashes.csv")
## Rows: 2008519 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): CRASH DATE, BOROUGH, LOCATION, ON STREET NAME, CROSS STREET NAME,...
## dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER ...
## time (1): CRASH TIME
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(Crashes_NYC))
## [1] 17178766
mean(is.na(Crashes_NYC))*100
## [1] 29.49294
Deaths_by_cause <- Crashes_NYC %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(total_deaths))
head(Deaths_by_cause, n=5)
## # A tibble: 5 × 2
## `CONTRIBUTING FACTOR VEHICLE 1` total_deaths
## <chr> <dbl>
## 1 Unsafe Speed 368
## 2 Failure to Yield Right-of-Way 247
## 3 Traffic Control Disregarded 245
## 4 Alcohol Involvement 103
## 5 Pedestrian/Bicyclist/Other Pedestrian Error/Confusion 95
Avg_deaths <- tapply(Crashes_NYC$`NUMBER OF PERSONS KILLED`, Crashes_NYC$`CONTRIBUTING FACTOR VEHICLE 1`, mean)
Top_5_avg_deaths <- sort(Avg_deaths, decreasing = TRUE)[1:5]
barplot(Top_5_avg_deaths,
main = "Average Number of Deaths for Top 5 Contributing Factors",
ylab = "Average Number of Deaths",
col = rainbow(5),
las = 2, cex.names = 0.7
)
Top_5_avg_deaths
## Illnes
## 0.033844189
## Unsafe Speed
## 0.013897281
## Pedestrian/Bicyclist/Other Pedestrian Error/Confusion
## 0.010540331
## Drugs (illegal)
## 0.009580838
## Drugs (Illegal)
## 0.007159905
Kills_by_Vehicle <- Crashes_NYC %>%
group_by(`VEHICLE TYPE CODE 1`) %>%
summarize(total_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(total_deaths))
head(Kills_by_Vehicle, n = 5)
## # A tibble: 5 × 2
## `VEHICLE TYPE CODE 1` total_deaths
## <chr> <dbl>
## 1 PASSENGER VEHICLE 398
## 2 SPORT UTILITY / STATION WAGON 218
## 3 Motorcycle 174
## 4 MOTORCYCLE 91
## 5 UNKNOWN 61
vehicle_types <- unique(Crashes_NYC$`VEHICLE TYPE CODE 1`)
avg_injuries <- tapply(Crashes_NYC$`NUMBER OF PERSONS INJURED`, Crashes_NYC$`VEHICLE TYPE CODE 1`, mean)
bar_colors <- c("red", "purple", "magenta", "pink", "cyan")
Top_5_V <- names(sort(avg_injuries,decreasing = TRUE)[1:5])
barplot(avg_injuries[Top_5_V], main = "Average Number of Injuries for Top 5 Vehicle Types",
ylab = "Average Number of Injuries",
col = bar_colors, las = 2, cex.names = 0.7, ylim = c(0, max(avg_injuries[Top_5_V]) + 1))
Is there a correlation between the number of person injured in car accidents and the number of persons killed in car accidents?
Scatter plot
plot(Crashes_NYC$`NUMBER OF PERSONS KILLED`, Crashes_NYC$`NUMBER OF PERSONS INJURED`,
xlab = "Number of Persons KILLED", ylab = "Number of Persons INJURED",
main = "Correlation between # of Persons Killed & # of Persons Injured")
abline(lm(Crashes_NYC$'NUMBER OF PERSONS KILLED' ~ Crashes_NYC$'NUMBER OF PERSONS INJURED'),col ='red')
Linear Regression
reg <- lm(Crashes_NYC$`NUMBER OF PERSONS INJURED` ~ Crashes_NYC$`NUMBER OF PERSONS KILLED`)
summary(reg)
##
## Call:
## lm(formula = Crashes_NYC$`NUMBER OF PERSONS INJURED` ~ Crashes_NYC$`NUMBER OF PERSONS KILLED`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.357 -0.301 -0.301 -0.301 42.699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3005707 0.0004887 615.08 <2e-16
## Crashes_NYC$`NUMBER OF PERSONS KILLED` 0.2640064 0.0122179 21.61 <2e-16
##
## (Intercept) ***
## Crashes_NYC$`NUMBER OF PERSONS KILLED` ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6921 on 2008480 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.0002324, Adjusted R-squared: 0.0002319
## F-statistic: 466.9 on 1 and 2008480 DF, p-value: < 2.2e-16
The p-value for Number of Persons Killed is less than 2e-16.The R-squared value is only 0.0002367. It’s implying a very weak positive correlation between the two variables.
For my first research question, it would once again seem that the most deaths tend to be from very common cause of crashes like unsafe speed, drunk driving, and driving errors. However, the highest average deaths for accidents is because of illness. However, it is very clear that unsafe speed and alcohol involvement are the major causes of death in accidents as they appear high in both the highest number of total deaths and average deaths.
For my second research question, it is interesting to see the differences between the types of vehicles with the most injuries total and the most average injuries.It appears that the vehicle types with the most injuries are more popular cars to be driven like Passenger Vehicles, recreational vehicles, taxis, pickup trucks, and more.
The only ethical concerns that could arise from this data set is that this data set could be biased due to any type of under reporting or over reporting of certain kinds of accidents. Also the accuracy of this data set could be called into question due to the large number of missing values in it.