This project looks to investigate car crashes in New York. This project looks at several things including top cited causes, traffic fatalities and types of vehicles involved, crashes by month and year, and also the spread of the data using the latitude and longitude to map car crash locations captured in the data.
The data used for investigation with this project is attached in the final submission folder.
#Required Packages
library(tidyverse)
library(RMariaDB)
library(readxl)
library("ggmap")
data <- read_csv(file.choose())
m <- data %>%
select(`CRASH DATE`, `CRASH TIME`, LATITUDE, LONGITUDE, `NUMBER OF PERSONS INJURED`, `NUMBER OF PERSONS KILLED`,
`NUMBER OF PEDESTRIANS INJURED`, `NUMBER OF PEDESTRIANS KILLED`, `NUMBER OF CYCLIST INJURED`,
`NUMBER OF CYCLIST KILLED`, `NUMBER OF MOTORIST INJURED`, `NUMBER OF PERSONS KILLED`, `CONTRIBUTING FACTOR VEHICLE 1`,
`VEHICLE TYPE CODE 1`, `VEHICLE TYPE CODE 2`)
The data used for this project was pulled from a Kaggle data set that contained information about car crashes in New York, primarily New York City and contains details on the crash event.The original data set contains 29 columns, but this analysis only considers 14 of those columns to be relevant. A summary of those columns is listed below:
## CRASH DATE CRASH TIME LATITUDE LONGITUDE NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED CONTRIBUTING FACTOR VEHICLE 1 VEHICLE TYPE CODE 1
## Length:2018963 Length:2018963 Min. : 0.00 Min. :-201.36 Min. : 0.0000 Min. :0.000000 Min. : 0.0000 Min. :0.000000 Min. :0.00000 Min. :0.0000000 Min. : 0.0000 Length:2018963 Length:2018963
## Class :character Class1:hms 1st Qu.:40.67 1st Qu.: -73.97 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Class2:difftime Median :40.72 Median : -73.93 Median : 0.0000 Median :0.000000 Median : 0.0000 Median :0.000000 Median :0.00000 Median :0.0000000 Median : 0.0000 Mode :character Mode :character
## Mode :numeric Mean :40.63 Mean : -73.75 Mean : 0.3025 Mean :0.001448 Mean : 0.0552 Mean :0.000725 Mean :0.02614 Mean :0.0001119 Mean : 0.2181
## 3rd Qu.:40.77 3rd Qu.: -73.87 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.: 0.0000
## Max. :43.34 Max. : 0.00 Max. :43.0000 Max. :8.000000 Max. :27.0000 Max. :6.000000 Max. :4.00000 Max. :2.0000000 Max. :43.0000
## NA's :229727 NA's :229727 NA's :18 NA's :31
## VEHICLE TYPE CODE 2
## Length:2018963
## Class :character
## Mode :character
##
##
##
##
First, conduct some basic EDA. What are the Top 10 reasons cited for crashes? Note: filtering out a value of “UNSPECIFIED” is required. Cited reason number 5 is also irrelevant if we want to use this data to explain what are the top causes of crashes.
## # A tibble: 10 × 2
## `toupper(\`CONTRIBUTING FACTOR VEHICLE 1\`)` count_obs
## <chr> <int>
## 1 DRIVER INATTENTION/DISTRACTION 401420
## 2 FAILURE TO YIELD RIGHT-OF-WAY 119209
## 3 FOLLOWING TOO CLOSELY 107510
## 4 BACKING UNSAFELY 75053
## 5 OTHER VEHICULAR 62709
## 6 PASSING OR LANE USAGE IMPROPER 55474
## 7 TURNING IMPROPERLY 49928
## 8 PASSING TOO CLOSELY 49873
## 9 FATIGUED/DROWSY 47346
## 10 UNSAFE LANE CHANGING 39722
Now let’s look to investigate the most accident prone types of vehicles. To do this we will want to use the vehicle type code for both vehicle 1 and vehicle 2 (assuming proper value). We can then count the number of observations per vehicle type and organize in descending order. (Note: Many observations in this data set were missing observations for the vehicles involved in the accident and that is why the largest value is actually relatively small for the number of observations. It may also be helpful to filter out values for unknown or other as well.)
## # A tibble: 20 × 2
## `toupper(value)` count_obs
## <chr> <int>
## 1 SEDAN 782
## 2 STATION WAGON/SPORT UTILITY VEHICLE 777
## 3 PASSENGER VEHICLE 502
## 4 MOTORCYCLE 363
## 5 SPORT UTILITY / STATION WAGON 289
## 6 BIKE 137
## 7 BUS 117
## 8 PICK-UP TRUCK 83
## 9 TAXI 83
## 10 UNKNOWN 74
## 11 BOX TRUCK 71
## 12 VAN 66
## 13 TRACTOR TRUCK DIESEL 59
## 14 LARGE COM VEH(6 OR MORE TIRES) 58
## 15 BICYCLE 52
## 16 E-BIKE 50
## 17 DUMP 38
## 18 OTHER 34
## 19 4 DR SEDAN 31
## 20 E-SCOOTER 22
Now for some graphing. This data was pulled from 2012-2023. New York has snow. Maybe we want to investigate the average number of crashes per month from the data set.
What about the total crash observations by year? (Note: 2023 data lags by about 6 months so analysis for 2023 would be irrelevant)
Interesting how the number of crashes fell after 2020 and stayed down. Covid-19 was the likely cause for the 2020 drop off, but the shift to working at home is the likely reason the number of observations had not increased in 2021 and 2022.
And finally, let’s look at where in New York these observations are taking place. The data set give latitude and longitude coordinates, which can be used to map this data over a map of New York. (Note: values need to be filtered to ensure proper location)
Wow! Comparing this map to a population density map, we can see that the most concentrated area in terms of car crashes is also the most concentrated area in terms of people. Having more people trying to move around in the same area, it does make sense that there are going to be more accidents.