We live in NYC and love this city which full of people from different counties. As a melting pot, NYC has 8.3 million population and it also has the longest commute time via car and public transit. Nobody want to talk about motor vehicle collision, but it does exist and make our daily long commute even longer when it happens. Today, I will dig into the NYC motor vehicle collision data and present some findings which might be useful for those regulation/policy makers. I will present some graphs which shows the trend of the collision and also point out the top 10 reasons why those accidents happen.

library("data.table")
library(tidyverse)
## Warning in as.POSIXlt.POSIXct(Sys.time()): unable to identify current timezone 'H':
## please set environment variable 'TZ'
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(ggplot2)
library(dplyr)
library(shiny)
library(rsconnect)
## 
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
## 
##     serverInfo
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(leaflet)

Data Source

The dataset I use is from the New York City open data.https://data.cityofnewyork.us/Public-Safety/Motor- Vehicle-Collisions-Crashes/h9gi-nx95. The Motor Vehicle Collisions crash table contains details of different crash events, including crush date, time, borough, latitude, longitude, number of person killed/injured, contributing factor and vehicle type, etc. The dataset includes more than 1 million observations from August 2012 to October 2020 and miss data is inevitable.

Importing Data

GitHub can’t upload a file more than 25M, so I use fread() to import the data from local disk.

motor<-fread("Motor_Vehicle_Collisions_-_Crashes.csv", showProgress = FALSE, stringsAsFactors = FALSE)
head(motor)
summary(motor)
##   CRASH DATE         CRASH TIME          BOROUGH             ZIP CODE     
##  Length:1736981     Length:1736981     Length:1736981     Min.   :10000   
##  Class :character   Class :character   Class :character   1st Qu.:10305   
##  Mode  :character   Mode  :character   Mode  :character   Median :11207   
##                                                           Mean   :10832   
##                                                           3rd Qu.:11237   
##                                                           Max.   :11697   
##                                                           NA's   :532719  
##     LATITUDE        LONGITUDE         LOCATION         ON STREET NAME    
##  Min.   : 0.00    Min.   :-201.36   Length:1736981     Length:1736981    
##  1st Qu.:40.67    1st Qu.: -73.98   Class :character   Class :character  
##  Median :40.72    Median : -73.93   Mode  :character   Mode  :character  
##  Mean   :40.69    Mean   : -73.87                                        
##  3rd Qu.:40.77    3rd Qu.: -73.87                                        
##  Max.   :43.34    Max.   :   0.00                                        
##  NA's   :206846   NA's   :206846                                         
##  CROSS STREET NAME  OFF STREET NAME    NUMBER OF PERSONS INJURED
##  Length:1736981     Length:1736981     Min.   : 0.0000          
##  Class :character   Class :character   1st Qu.: 0.0000          
##  Mode  :character   Mode  :character   Median : 0.0000          
##                                        Mean   : 0.2711          
##                                        3rd Qu.: 0.0000          
##                                        Max.   :43.0000          
##                                        NA's   :17               
##  NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED
##  Min.   :0.000000         Min.   : 0.00000             
##  1st Qu.:0.000000         1st Qu.: 0.00000             
##  Median :0.000000         Median : 0.00000             
##  Mean   :0.001242         Mean   : 0.05128             
##  3rd Qu.:0.000000         3rd Qu.: 0.00000             
##  Max.   :8.000000         Max.   :27.00000             
##  NA's   :31                                            
##  NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED
##  Min.   :0.000000             Min.   :0.00000          
##  1st Qu.:0.000000             1st Qu.:0.00000          
##  Median :0.000000             Median :0.00000          
##  Mean   :0.000655             Mean   :0.02269          
##  3rd Qu.:0.000000             3rd Qu.:0.00000          
##  Max.   :6.000000             Max.   :4.00000          
##                                                        
##  NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
##  Min.   :0.00e+00         Min.   : 0.000             Min.   :0.000000         
##  1st Qu.:0.00e+00         1st Qu.: 0.000             1st Qu.:0.000000         
##  Median :0.00e+00         Median : 0.000             Median :0.000000         
##  Mean   :9.38e-05         Mean   : 0.197             Mean   :0.000492         
##  3rd Qu.:0.00e+00         3rd Qu.: 0.000             3rd Qu.:0.000000         
##  Max.   :2.00e+00         Max.   :43.000             Max.   :5.000000         
##                                                                               
##  CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
##  Length:1736981                Length:1736981               
##  Class :character              Class :character             
##  Mode  :character              Mode  :character             
##                                                             
##                                                             
##                                                             
##                                                             
##  CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4
##  Length:1736981                Length:1736981               
##  Class :character              Class :character             
##  Mode  :character              Mode  :character             
##                                                             
##                                                             
##                                                             
##                                                             
##  CONTRIBUTING FACTOR VEHICLE 5  COLLISION_ID     VEHICLE TYPE CODE 1
##  Length:1736981                Min.   :     22   Length:1736981     
##  Class :character              1st Qu.:2836887   Class :character   
##  Mode  :character              Median :3504640   Mode  :character   
##                                Mean   :2890128                      
##                                3rd Qu.:3939119                      
##                                Max.   :4373799                      
##                                                                     
##  VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4
##  Length:1736981      Length:1736981      Length:1736981     
##  Class :character    Class :character    Class :character   
##  Mode  :character    Mode  :character    Mode  :character   
##                                                             
##                                                             
##                                                             
##                                                             
##  VEHICLE TYPE CODE 5
##  Length:1736981     
##  Class :character   
##  Mode  :character   
##                     
##                     
##                     
## 

From the summary, we notice that the variable LATITUDE and LONGITUDE have 20,5623, or 12% missing data, which might affect our map when we plot in the data.

Data Handling

First, we need to separate the Crush Date column into three columns, Year, Month, and Day, which might help our following analysis.

#Format the date and Separate it into three columns
motor <- separate(motor, 1, c("Month", "Day","Year"))

Drop all the rows which don’t have borough information.

motor2<-motor %>%
       filter(BOROUGH!="")

Data Analysis

Collision by Borough over Years

borough<-motor2 %>%
         select(Year, BOROUGH)%>%
         group_by(Year,BOROUGH)%>%
         filter(Year > '2012')%>%
         count
borough
ggplot(borough, aes(x = Year, y = freq,group=BOROUGH,colour=BOROUGH)) +geom_point()+geom_line()+
  labs(x = "Year", y = "Number of Motor vehicle collision", 
              title = "Collision by Borough over Years")

The graph indicates Brooklyn has the highest collision number and Staten Island has the lowest number. From year 2013 to 2020, there is a decrease collision number for all boroughs. Among this five boroughs, Manhattan’s collision number decrease sharply since year 2015.

Total Collision number

What is the collision rate in NYC from 2013 to 2020?

YearNumber<-motor %>%
          select(Year)%>%
          filter(Year > '2012' )%>%
          group_by(Year)%>%
          count()
        
YearNumber
ggplot(YearNumber, aes(x = Year, y = freq,group=1)) +geom_point()+geom_line()+
  labs(x = "Year", y = "Number of Motor vehicle collision", 
              title = "Total Motor vehicle collision change from 2013 to 2019")

Apparently, total collision number in NYC increase sharply from 2013 to 2016. This number don’t have a lot of change during 2016 to 2018, but drop suddenly in 2019 because of Covid19.

Collisions by Month

MonthNumber<-motor %>%
          filter(Year > '2012')%>%
          select(Month)%>%
          group_by(Month)%>%
          count()
        
MonthNumber
ggplot(MonthNumber, aes(x=Month, y=freq)) +
    geom_bar(stat="identity", position=position_dodge(), colour="red", width = 0.9) +
    ggtitle("Motor Vehicle Collisions by Month") +
    xlab("Month") + ylab("Number") +geom_text(aes(label = freq), vjust = 1.3, colour = "white")

May, June and October has the highest collision number. New York has a long winter and People start joining outside activities in May and June.

Collisions by Hour

#separate the crash time to Hour and Min
time <- separate(motor, 'CRASH TIME', c("Hour", "min"))
head(time)
TimeNumber<-time %>%
          select(Hour)%>%
          group_by(Hour)%>%
          arrange(Hour)%>%
          count()

# Change the data type of Hour to number and then sort the number 
TimeNumber$Hour<-as.integer(as.character(TimeNumber$Hour))
TimeNumber<-TimeNumber%>%
            arrange(Hour)
        
TimeNumber
ggplot(TimeNumber, aes(x = Hour, y = freq,group=1)) +geom_point()+geom_line()+
  labs(x = "Hour", y = "Number of Motor vehicle collision", 
              title = "Collision by Hour")

The peak time has the highest collision number, especially during 4PM to 6PM. 3AM has the lowest collision rate for a whole day.

Causes of Motor Vehicle Collisions

reason<-motor %>%
          select('CONTRIBUTING FACTOR VEHICLE 1')%>%
          group_by('CONTRIBUTING FACTOR VEHICLE 1')%>%
          filter( 'CONTRIBUTING FACTOR VEHICLE 1'!="NA")%>%
          count

reason_count <-count(reason, "CONTRIBUTING.FACTOR.VEHICLE.1")
## Using freq as weighting variable
reason_count<-reason_count[order(reason_count$freq, decreasing = T),]
reason_count

The top Contributing factor is Unspecified and it is not a real reason. I will drop Unspecified reason and Choose the other top 10 factors.

#Drop Unspecified reason
reason_count<-reason_count%>%
              filter(freq>=33844 & freq<624121)
ggplot(reason_count, aes(reorder(CONTRIBUTING.FACTOR.VEHICLE.1,freq), y=freq,fill=CONTRIBUTING.FACTOR.VEHICLE.1)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +coord_flip()+
    ggtitle("Causes of Motor Vehicle Collisions") +
    xlab("Factor") + ylab("Number")

There are many reasons for the accidents. Driver inattention/Distraction is the #1 reason for the accident during those years. Failure to Yield Right of way and Following too closely are the other two reasons for accidents.Drivers need to pay more attention behind the wheel.

Map

Using all the latitude and longitude data, I first create a map to mark all the location of the collisions. You can check the code and map by clicking this link.

https://github.com/DaisyCai2019/Data608/blob/master/Final%20Project/Motor_vehicle%20Collision%20Map.ipynb

knitr::include_graphics("Map.png")

From the map, we notice that Manhattan, Bronx and north side of Queens have more collisions than other areas. We also find out some interesting yellow spots along the sea.

Shiny APP

Shiny App shows the collision data by Borough and also the collision location by year(2015-2020).

https://daisycai.shinyapps.io/Data608_Final/

Conclusion

Motor Vehicle collision is a serious issue for years. From my analysis, we are so happy to see the number gradually go down since 2018. As the top 10 collision reason suggest, drivers need to be more careful behind the wheel and always yield the right of way. I hope all those safe measures will sharply reduce those serious collisions and make our roads safer. For people who want to go out for outdoor activities, my suggestion is always avoid the peak time.