We live in NYC and love this city which full of people from different counties. As a melting pot, NYC has 8.3 million population and it also has the longest commute time via car and public transit. Nobody want to talk about motor vehicle collision, but it does exist and make our daily long commute even longer when it happens. Today, I will dig into the NYC motor vehicle collision data and present some findings which might be useful for those regulation/policy makers. I will present some graphs which shows the trend of the collision and also point out the top 10 reasons why those accidents happen.
library("data.table")
library(tidyverse)
## Warning in as.POSIXlt.POSIXct(Sys.time()): unable to identify current timezone 'H':
## please set environment variable 'TZ'
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(ggplot2)
library(dplyr)
library(shiny)
library(rsconnect)
##
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
##
## serverInfo
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(leaflet)
The dataset I use is from the New York City open data.https://data.cityofnewyork.us/Public-Safety/Motor- Vehicle-Collisions-Crashes/h9gi-nx95. The Motor Vehicle Collisions crash table contains details of different crash events, including crush date, time, borough, latitude, longitude, number of person killed/injured, contributing factor and vehicle type, etc. The dataset includes more than 1 million observations from August 2012 to October 2020 and miss data is inevitable.
GitHub can’t upload a file more than 25M, so I use fread() to import the data from local disk.
motor<-fread("Motor_Vehicle_Collisions_-_Crashes.csv", showProgress = FALSE, stringsAsFactors = FALSE)
head(motor)
summary(motor)
## CRASH DATE CRASH TIME BOROUGH ZIP CODE
## Length:1736981 Length:1736981 Length:1736981 Min. :10000
## Class :character Class :character Class :character 1st Qu.:10305
## Mode :character Mode :character Mode :character Median :11207
## Mean :10832
## 3rd Qu.:11237
## Max. :11697
## NA's :532719
## LATITUDE LONGITUDE LOCATION ON STREET NAME
## Min. : 0.00 Min. :-201.36 Length:1736981 Length:1736981
## 1st Qu.:40.67 1st Qu.: -73.98 Class :character Class :character
## Median :40.72 Median : -73.93 Mode :character Mode :character
## Mean :40.69 Mean : -73.87
## 3rd Qu.:40.77 3rd Qu.: -73.87
## Max. :43.34 Max. : 0.00
## NA's :206846 NA's :206846
## CROSS STREET NAME OFF STREET NAME NUMBER OF PERSONS INJURED
## Length:1736981 Length:1736981 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0000
## Mean : 0.2711
## 3rd Qu.: 0.0000
## Max. :43.0000
## NA's :17
## NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED
## Min. :0.000000 Min. : 0.00000
## 1st Qu.:0.000000 1st Qu.: 0.00000
## Median :0.000000 Median : 0.00000
## Mean :0.001242 Mean : 0.05128
## 3rd Qu.:0.000000 3rd Qu.: 0.00000
## Max. :8.000000 Max. :27.00000
## NA's :31
## NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.000655 Mean :0.02269
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :6.000000 Max. :4.00000
##
## NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
## Min. :0.00e+00 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.00e+00 1st Qu.: 0.000 1st Qu.:0.000000
## Median :0.00e+00 Median : 0.000 Median :0.000000
## Mean :9.38e-05 Mean : 0.197 Mean :0.000492
## 3rd Qu.:0.00e+00 3rd Qu.: 0.000 3rd Qu.:0.000000
## Max. :2.00e+00 Max. :43.000 Max. :5.000000
##
## CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
## Length:1736981 Length:1736981
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4
## Length:1736981 Length:1736981
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1
## Length:1736981 Min. : 22 Length:1736981
## Class :character 1st Qu.:2836887 Class :character
## Mode :character Median :3504640 Mode :character
## Mean :2890128
## 3rd Qu.:3939119
## Max. :4373799
##
## VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4
## Length:1736981 Length:1736981 Length:1736981
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## VEHICLE TYPE CODE 5
## Length:1736981
## Class :character
## Mode :character
##
##
##
##
From the summary, we notice that the variable LATITUDE and LONGITUDE have 20,5623, or 12% missing data, which might affect our map when we plot in the data.
First, we need to separate the Crush Date column into three columns, Year, Month, and Day, which might help our following analysis.
#Format the date and Separate it into three columns
motor <- separate(motor, 1, c("Month", "Day","Year"))
Drop all the rows which don’t have borough information.
motor2<-motor %>%
filter(BOROUGH!="")
borough<-motor2 %>%
select(Year, BOROUGH)%>%
group_by(Year,BOROUGH)%>%
filter(Year > '2012')%>%
count
borough
ggplot(borough, aes(x = Year, y = freq,group=BOROUGH,colour=BOROUGH)) +geom_point()+geom_line()+
labs(x = "Year", y = "Number of Motor vehicle collision",
title = "Collision by Borough over Years")
The graph indicates Brooklyn has the highest collision number and Staten Island has the lowest number. From year 2013 to 2020, there is a decrease collision number for all boroughs. Among this five boroughs, Manhattan’s collision number decrease sharply since year 2015.
What is the collision rate in NYC from 2013 to 2020?
YearNumber<-motor %>%
select(Year)%>%
filter(Year > '2012' )%>%
group_by(Year)%>%
count()
YearNumber
ggplot(YearNumber, aes(x = Year, y = freq,group=1)) +geom_point()+geom_line()+
labs(x = "Year", y = "Number of Motor vehicle collision",
title = "Total Motor vehicle collision change from 2013 to 2019")
Apparently, total collision number in NYC increase sharply from 2013 to 2016. This number don’t have a lot of change during 2016 to 2018, but drop suddenly in 2019 because of Covid19.
MonthNumber<-motor %>%
filter(Year > '2012')%>%
select(Month)%>%
group_by(Month)%>%
count()
MonthNumber
ggplot(MonthNumber, aes(x=Month, y=freq)) +
geom_bar(stat="identity", position=position_dodge(), colour="red", width = 0.9) +
ggtitle("Motor Vehicle Collisions by Month") +
xlab("Month") + ylab("Number") +geom_text(aes(label = freq), vjust = 1.3, colour = "white")
May, June and October has the highest collision number. New York has a long winter and People start joining outside activities in May and June.
#separate the crash time to Hour and Min
time <- separate(motor, 'CRASH TIME', c("Hour", "min"))
head(time)
TimeNumber<-time %>%
select(Hour)%>%
group_by(Hour)%>%
arrange(Hour)%>%
count()
# Change the data type of Hour to number and then sort the number
TimeNumber$Hour<-as.integer(as.character(TimeNumber$Hour))
TimeNumber<-TimeNumber%>%
arrange(Hour)
TimeNumber
ggplot(TimeNumber, aes(x = Hour, y = freq,group=1)) +geom_point()+geom_line()+
labs(x = "Hour", y = "Number of Motor vehicle collision",
title = "Collision by Hour")
The peak time has the highest collision number, especially during 4PM to 6PM. 3AM has the lowest collision rate for a whole day.
reason<-motor %>%
select('CONTRIBUTING FACTOR VEHICLE 1')%>%
group_by('CONTRIBUTING FACTOR VEHICLE 1')%>%
filter( 'CONTRIBUTING FACTOR VEHICLE 1'!="NA")%>%
count
reason_count <-count(reason, "CONTRIBUTING.FACTOR.VEHICLE.1")
## Using freq as weighting variable
reason_count<-reason_count[order(reason_count$freq, decreasing = T),]
reason_count
The top Contributing factor is Unspecified and it is not a real reason. I will drop Unspecified reason and Choose the other top 10 factors.
#Drop Unspecified reason
reason_count<-reason_count%>%
filter(freq>=33844 & freq<624121)
ggplot(reason_count, aes(reorder(CONTRIBUTING.FACTOR.VEHICLE.1,freq), y=freq,fill=CONTRIBUTING.FACTOR.VEHICLE.1)) +
geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +coord_flip()+
ggtitle("Causes of Motor Vehicle Collisions") +
xlab("Factor") + ylab("Number")
There are many reasons for the accidents. Driver inattention/Distraction is the #1 reason for the accident during those years. Failure to Yield Right of way and Following too closely are the other two reasons for accidents.Drivers need to pay more attention behind the wheel.
Using all the latitude and longitude data, I first create a map to mark all the location of the collisions. You can check the code and map by clicking this link.
knitr::include_graphics("Map.png")
From the map, we notice that Manhattan, Bronx and north side of Queens have more collisions than other areas. We also find out some interesting yellow spots along the sea.
Shiny App shows the collision data by Borough and also the collision location by year(2015-2020).
Motor Vehicle collision is a serious issue for years. From my analysis, we are so happy to see the number gradually go down since 2018. As the top 10 collision reason suggest, drivers need to be more careful behind the wheel and always yield the right of way. I hope all those safe measures will sharply reduce those serious collisions and make our roads safer. For people who want to go out for outdoor activities, my suggestion is always avoid the peak time.