Investigating Car Crashes In New York

Background & Prep

This project looks to investigate car crashes in New York. This project looks at several things including top cited causes, traffic fatalities and types of vehicles involved, crashes by month and year, and also the spread of the data using the latitude and longitude to map car crash locations captured in the data.

The data used for investigation with this project is attached in the final submission folder.

#Required Packages

library(tidyverse)
library(RMariaDB)
library(readxl)
library("ggmap")

data <- read_csv(file.choose())

m <- data %>% 
  select(`CRASH DATE`, `CRASH TIME`, LATITUDE, LONGITUDE, `NUMBER OF PERSONS INJURED`, `NUMBER OF PERSONS KILLED`,
         `NUMBER OF PEDESTRIANS INJURED`, `NUMBER OF PEDESTRIANS KILLED`, `NUMBER OF CYCLIST INJURED`,
         `NUMBER OF CYCLIST KILLED`, `NUMBER OF MOTORIST INJURED`, `NUMBER OF PERSONS KILLED`, `CONTRIBUTING FACTOR VEHICLE 1`, 
         `VEHICLE TYPE CODE 1`, `VEHICLE TYPE CODE 2`)

Data and Summary

The data used for this project was pulled from a Kaggle data set that contained information about car crashes in New York, primarily New York City and contains details on the crash event.The original data set contains 29 columns, but this analysis only considers 14 of those columns to be relevant. A summary of those columns is listed below:

##   CRASH DATE         CRASH TIME          LATITUDE        LONGITUDE       NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED CONTRIBUTING FACTOR VEHICLE 1 VEHICLE TYPE CODE 1
##  Length:2018963     Length:2018963    Min.   : 0.00    Min.   :-201.36   Min.   : 0.0000           Min.   :0.000000         Min.   : 0.0000               Min.   :0.000000             Min.   :0.00000           Min.   :0.0000000        Min.   : 0.0000            Length:2018963                Length:2018963     
##  Class :character   Class1:hms        1st Qu.:40.67    1st Qu.: -73.97   1st Qu.: 0.0000           1st Qu.:0.000000         1st Qu.: 0.0000               1st Qu.:0.000000             1st Qu.:0.00000           1st Qu.:0.0000000        1st Qu.: 0.0000            Class :character              Class :character   
##  Mode  :character   Class2:difftime   Median :40.72    Median : -73.93   Median : 0.0000           Median :0.000000         Median : 0.0000               Median :0.000000             Median :0.00000           Median :0.0000000        Median : 0.0000            Mode  :character              Mode  :character   
##                     Mode  :numeric    Mean   :40.63    Mean   : -73.75   Mean   : 0.3025           Mean   :0.001448         Mean   : 0.0552               Mean   :0.000725             Mean   :0.02614           Mean   :0.0001119        Mean   : 0.2181                                                             
##                                       3rd Qu.:40.77    3rd Qu.: -73.87   3rd Qu.: 0.0000           3rd Qu.:0.000000         3rd Qu.: 0.0000               3rd Qu.:0.000000             3rd Qu.:0.00000           3rd Qu.:0.0000000        3rd Qu.: 0.0000                                                             
##                                       Max.   :43.34    Max.   :   0.00   Max.   :43.0000           Max.   :8.000000         Max.   :27.0000               Max.   :6.000000             Max.   :4.00000           Max.   :2.0000000        Max.   :43.0000                                                             
##                                       NA's   :229727   NA's   :229727    NA's   :18                NA's   :31                                                                                                                                                                                                         
##  VEHICLE TYPE CODE 2
##  Length:2018963     
##  Class :character   
##  Mode  :character   
##                     
##                     
##                     
## 

Part 1

First, conduct some basic EDA. What are the Top 10 reasons cited for crashes? Note: filtering out a value of “UNSPECIFIED” is required. Cited reason number 5 is also irrelevant if we want to use this data to explain what are the top causes of crashes.

## # A tibble: 10 × 2
##    `toupper(\`CONTRIBUTING FACTOR VEHICLE 1\`)` count_obs
##    <chr>                                            <int>
##  1 DRIVER INATTENTION/DISTRACTION                  401420
##  2 FAILURE TO YIELD RIGHT-OF-WAY                   119209
##  3 FOLLOWING TOO CLOSELY                           107510
##  4 BACKING UNSAFELY                                 75053
##  5 OTHER VEHICULAR                                  62709
##  6 PASSING OR LANE USAGE IMPROPER                   55474
##  7 TURNING IMPROPERLY                               49928
##  8 PASSING TOO CLOSELY                              49873
##  9 FATIGUED/DROWSY                                  47346
## 10 UNSAFE LANE CHANGING                             39722

Part 2

Now let’s look to investigate the most accident prone types of vehicles. To do this we will want to use the vehicle type code for both vehicle 1 and vehicle 2 (assuming proper value). We can then count the number of observations per vehicle type and organize in descending order. (Note: Many observations in this data set were missing observations for the vehicles involved in the accident and that is why the largest value is actually relatively small for the number of observations. It may also be helpful to filter out values for unknown or other as well.)

## # A tibble: 20 × 2
##    `toupper(value)`                    count_obs
##    <chr>                                   <int>
##  1 SEDAN                                     782
##  2 STATION WAGON/SPORT UTILITY VEHICLE       777
##  3 PASSENGER VEHICLE                         502
##  4 MOTORCYCLE                                363
##  5 SPORT UTILITY / STATION WAGON             289
##  6 BIKE                                      137
##  7 BUS                                       117
##  8 PICK-UP TRUCK                              83
##  9 TAXI                                       83
## 10 UNKNOWN                                    74
## 11 BOX TRUCK                                  71
## 12 VAN                                        66
## 13 TRACTOR TRUCK DIESEL                       59
## 14 LARGE COM VEH(6 OR MORE TIRES)             58
## 15 BICYCLE                                    52
## 16 E-BIKE                                     50
## 17 DUMP                                       38
## 18 OTHER                                      34
## 19 4 DR SEDAN                                 31
## 20 E-SCOOTER                                  22

Part 3

Now for some graphing. This data was pulled from 2012-2023. New York has snow. Maybe we want to investigate the average number of crashes per month from the data set.

What about the total crash observations by year? (Note: 2023 data lags by about 6 months so analysis for 2023 would be irrelevant)

Interesting how the number of crashes fell after 2020 and stayed down. Covid-19 was the likely cause for the 2020 drop off, but the shift to working at home is the likely reason the number of observations had not increased in 2021 and 2022.

Part 4

And finally, let’s look at where in New York these observations are taking place. The data set give latitude and longitude coordinates, which can be used to map this data over a map of New York. (Note: values need to be filtered to ensure proper location)

Wow! Comparing this map to a population density map, we can see that the most concentrated area in terms of car crashes is also the most concentrated area in terms of people. Having more people trying to move around in the same area, it does make sense that there are going to be more accidents.