Synopsis

The Historical Data about vehicular collisions in NYC indicates that an average of 586 collisions happen in New York City every day. These incidents have increased the demand for emergency services in the city.

In order to meet these demands, we predict the No. of collisions that would happen in the city on a specific day with the help of Machine Learning Techniques such as Random Forest & Statistical Methods such as ARIMA modelling.

Further optimization is performed to obtain the number of ambulances the City would require to provide satisfactory services with the help of Excel’s solver function.

Loading the required libraries

library(ggplot2)
library(forecast)
library(tseries)
library(dplyr)
library(ggthemes)
library(lubridate)
library(mice)
library(DT)
library(VIM)
library(xts)
library(caret)

Reading the data

The data is collected from NYC open source data library. The data has 1.1 Million rows and 29 variables. We have data regarding the Vehicle type that was involved in the accident and various location variables such as LATITUTE, LONGITUDE, BOROUGH are also available. The reason behind the accident and also the Number of people that where killed and injured is available according to Date.

dt <- read.csv("NYPD_Motor_Vehicle_Collisions.csv")
dtcopy <- dt
dtcopy$DATE <- as.Date(dtcopy$DATE, "%m/%d/%Y")

Cleaning Data

We are much interested in the instances where people where killed or people where injured. With data exploration, i found that their where a lot of entries where nobody was killed or injured but the collision was noted in the database. Thus to remove these entries we perform data cleaning and obtained the refined data set which will be helpful for Exploratory data analysis.

dt1 <- select(dtcopy, 11:18)
dt1$sum <- rowSums(dt1)
r <- dt1$sum !=0
refined <- dtcopy[r, ]

The graphical representation of data shows us that the data is stationary and has no seasonal or cyclic trend.

date_accidents <- data.frame(refined$DATE) %>% group_by(refined.DATE) %>% summarise(Crash = n())
colnames(date_accidents) <- c("DATE", "CRASH")
ggplot(date_accidents[1800:1919,], aes(DATE, CRASH)) + geom_line() + scale_x_date('YEAR 2017') + ggtitle("Trend for Vehicular Collisions during Jun'17 to Oct'17") + theme_tufte()

Forming the Dataframe for Accidents

accidents1 <- data.frame(dtcopy$DATE)
accidents1$Day <- day(accidents1$dtcopy.DATE)
accidents1$Month <- month(accidents1$dtcopy.DATE)
accidents1$Year <- year(accidents1$dtcopy.DATE)
accidents1$Wday <- wday(accidents1$dtcopy.DATE)

Data Cleaning

###Forming Location datasets
just_location <- refined[, c(4,5,6)]
md.pattern(just_location)

##        LATITUDE LONGITUDE ZIP.CODE       
## 146518        1         1        1      0
##  27226        1         1        0      1
##   6482        0         0        1      2
##  31831        0         0        0      3
##           38313     38313    59057 135683

datatable(just_location[1:10000, ])

#Plot fot missing values
aggr(just_location, col=c('orange', 'brown'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(just_location), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##   Variable     Count
##   ZIP.CODE 0.2784959
##   LATITUDE 0.1806731
##  LONGITUDE 0.1806731

##just_location.m <- mice(just_location, m = 5, maxit = 1)
##just_location.c <- complete(just_location.m, 2)
##-------------------------------------------------
just_location.c1 <- read.csv("complete_location.csv")
just_location.c1 <- just_location.c1[,2:4]
datatable(just_location.c1[1:10000, ])

From the data exploration part, we find that their are a lot of values which are empty. The above plot helps us visualize the missing values. We find that 70% of location variables are compltely filled while 30% of location variables are partially filled. Droping these 30% variables will cause serious loss of information and hence we perform Data Imputation. To perform data imputation for these variables we use MICE package. MICE package uses PMM which stands for Predictive Mean matching and performs linear regression matching algorithm to impute the data. As the command takes huge processing time, i ran the command for the first time and saved the output in a csv format to cut processing time. With this imputed data for ZIPCODE, LATITUTE & LONGITUDE, Exploratory data analysis becomes real easy. The two tables illustarte the successfull imputation of location variables.

Imputing Boroughs

We now impute the boroughs with the help of if and for loops.

##Imputing Borough
boroughs <- just_location.c1
boroughs$ZIP.CODE <- as.character(boroughs$ZIP.CODE)
b <- vector()
for(i in 1:212057)
{
  if(grepl("104", boroughs[i,1])){
    b[i]<-"BRONX"
  }
  else if(grepl("103", boroughs[i,1])){
    b[i]<-"STATEN ISLAND"
  }
  else if(grepl("112", boroughs[i,1])){
    b[i]<-"BROOKLYN"
  }
  else if(grepl("100", boroughs[i,1])){
    b[i]<-"MANHATTAN"
  }
  else if(grepl("110", boroughs[i,1])||grepl("111", boroughs[i,1])||
          grepl("113",boroughs[i,1])||grepl("114", boroughs[i,1])||
          grepl("116",boroughs[i,1])){
    b[i]<-"QUEENS"
  }
}
boroughs$BOROUGH <- b
just_location.c1$BOROUGH <- boroughs$BOROUGH

Data Prepartion for Exploratory Data Analysis

injured <- refined[,c(11,13,15,17)]
injured$Total <- rowSums(injured)
killed <- refined[,c(12,14,16,18)]
killed$Total <- rowSums(killed)
just_location.c1$Injured <- injured$Total
just_location.c1$Killed <- killed$Total
datatable(just_location.c1[1:10000, ])

Exploratory Data Analysis

Summarising according to hour

The timeline of injuries caused in NYC are plotted borough wise across the 5 boroughs of New York City

injured$Time <- refined$TIME
injured$hour <- hour(as.POSIXct(injured$Time, format = "%H:%M"))
injured$BOROUGH <- boroughs$BOROUGH
hour_accidents_b <- injured %>% group_by(BOROUGH, hour) %>% summarise(Injured = n())
hour_accidents_b <- filter(hour_accidents_b, BOROUGH != "")
datatable(hour_accidents_b)

ggplot(hour_accidents_b, aes(hour, Injured, fill = BOROUGH)) +
  geom_bar(stat = "identity", show.legend = FALSE) + theme_calc() +
  ggtitle("No. of People Injured in NYC Collision according to Hour") +
  facet_grid(BOROUGH ~ .)

From the above plot we find that maximum number of accidents occur during rush hours of 10AM to 5PM.

Summarising according to Month

This plot beautifully illustrates the Density Distribution across the Five Boroughs of NYC, visualizing the Number of People Injured from 2012 - 2017.

injured$Date <- refined$DATE
injured$Month <- month(injured$Date)
injured1 <- filter(injured, BOROUGH != "")
injured_month_b1 <- injured1 %>% group_by(BOROUGH, Month) %>% summarise(Injured = n())
ggplot(injured_month_b1, aes(Month, Injured, fill = BOROUGH, col = BOROUGH)) +
  geom_violin(alpha = 0.5, show.legend = FALSE) +
  theme_solarized_2() + ggtitle("Number of Injuries in NYC according to Month from Year 2012 to 2017")+
  facet_grid(. ~ BOROUGH) + scale_x_discrete(limits = c(4,8,12)) +
  geom_point(show.legend = FALSE, size = 2.5)

We find that Brooklyn had maximum Injuries. Also, the distribution suggests that the months from May to September led to maximum Injuries.

Top Reasons for accidents

Top reasons which caused accidents is shown in the below data table. We find that

Driver inattention
Failure to yeild the right way
Drowsy

where the top reasons while Alcohol involvement was 10th which is shocking because we usually are under impression that Drink and drive cause maximum collisions.

reason <- just_location.c1
reason$Reason <- refined$CONTRIBUTING.FACTOR.VEHICLE.1
top10_reasons <- reason %>% group_by(Reason) %>% summarise(Total_accidents = n()) %>% top_n(20, Total_accidents)
top10_reasons <- top10_reasons[-19,]
top10_reasons <- top10_reasons[order(top10_reasons$Total_accidents, decreasing = TRUE),]
datatable(top10_reasons)

Vehical type code which lead to max injuries

The Vehicals that caused maximum injuries are shown in the following data table. We find that

SUV
TAXI
VAN

cause maximum injuries in a vehicular collisions. Also, Ambulance is found responsible for huge injuries.

vehical <- just_location.c1[,c("BOROUGH", "Injured")]
vehical$Vehical_type <- refined$VEHICLE.TYPE.CODE.1
top10_type <- vehical %>% group_by(Vehical_type) %>% summarise(Injured = sum(Injured)) %>% top_n(20, Injured)
top10_type <- top10_type[-c(1,4),]
top10_type <- top10_type[order(top10_type$Injured, decreasing = T),]
datatable(top10_type[-4,])

Number of people killed in every borough

This visualization was made in Tableau. Just the file is being pasted here.

We find that Brooklyn had the maximum number of people killed because of Vehicular Collisions while Staten Island had relatively few number.

Playing around with Leaflet

just_location.c1.r <- just_location.c1[just_location.c1$BOROUGH != "", ]
try_accidents <- just_location.c1 %>% group_by(ZIP.CODE) %>% summarise(LAT = mean(LATITUDE),
                                                                       LNG = mean(LONGITUDE),
                                                                       Crash = n(),
                                                                       Injured = sum(Injured),
                                                                       Killed = sum(Killed))
library(leaflet)
lati <- try_accidents$LAT
longi <- try_accidents$LNG
map2 <- try_accidents %>% leaflet() %>% addTiles() %>%
  addCircles(weight = 4, radius = try_accidents$Injured * 0.5) %>%
  addMarkers(lat = lati, lng = longi, popup = toupper(paste(as.character(try_accidents$ZIP.CODE),
                                                      paste("No. of Person Injured: ",
                                                            as.character(try_accidents$Injured)),
                                                      paste("No. of Person Killed: ",
                                                            as.character(try_accidents$Killed)),
                                                      sep = "<br/>")),
             clusterOptions = markerClusterOptions())
map2

When we click on the cluster, we find the Total crashes, Total people injured and Total people killed in that specific ZIPCODE. This gives us how prone is the Zipcode to vehicular collisions. The blue clusters are formed clubbing all the nearby ZIPCODEs together into one.

Prediction Modeling

ARIMA Modeling

We perform ARIMA that is auto regressive, Integrated, Moving Average statistical method to forecast the number of Vehicular Collisions that would occur in NYC. To perform this we need to convert our data bject into time series format. After converting it into time series we perform Augmented Dickey-Fuller Test to check the stationarity. We find that the data is stationary and is good to obtain forecast. We split it into train and test and obtain the model performance.

We then plot the residuals and find that they are nomrally distributed around 0 which is a good sign suggesting that the model is performing well.

accidents_date <- dtcopy %>% group_by(DATE) %>% summarise(Crash = n())
accidents_date <- as.data.frame(accidents_date)
crashes <- data.frame(Crash = accidents_date$Crash)
rownames(crashes) <- as.character(accidents_date$DATE)
crashes_ts <- as.xts(crashes)
adf.test(crashes_ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  crashes_ts
## Dickey-Fuller = -10.197, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary

train <- window(crashes_ts, start = "2012-07-01", end = "2016-08-31")
test <- window(crashes_ts, start = "2016-09-01", end = "2017-10-01")
model_arima <- auto.arima(train, stationary = TRUE)
resid_Arima <- data.frame(Residuals = model_arima$residuals)
ggplot(resid_Arima, aes(Residuals)) +
  geom_histogram(aes(y = ..density..),fill = "white", col = "black") + 
  geom_density(fill = "magenta", alpha = 0.5) + ggtitle("Plot for Residuals vs Fitted values") +
  xlab("Residuals") + ylab("Density for Fitted Values") + theme_minimal()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above plot we can say that ARIMA modeling is performing pretty well. To draw a conclusion we need comparison. Thus, We also perform Random Forest prediction modeling to find out which model will be used for further analysis.

Random Forest

Random Forest is an ensemble method which is used to obtain decision tree type prediction system. For time series, random forest can not detect seasonality and stationarity on its own. To help, we create attributes such as Day, Month, Weekday and year from our date and summarize the vehicular crashes that had happend on that specific day. In this manner, we create weights for different Day, Weekday, Month & Year combinations which helps random forest predict time series. We plot the residuals vs fitted graph for this model and find amazing results.

sumed_accidents2 <- accidents1 %>% group_by(Day, Month, Year, Wday) %>% summarise(Crashes = n())
sumed_accidents2$Day <- as.factor(sumed_accidents2$Day)
sumed_accidents2$Month <- as.factor(sumed_accidents2$Month)
sumed_accidents2$Year <- as.factor(sumed_accidents2$Year)
intrain <- createDataPartition(y = sumed_accidents2$Crashes, list = FALSE, p = 0.7)
training <- sumed_accidents2[intrain, ]
testing <- sumed_accidents2[-intrain, ]
final_model <- train(Crashes ~ ., data = training, method = "rf", 
                     trControl = trainControl(method = "repeatedcv", repeats = 3,
                     number = 2, search = "random"), preProcess = c("center", "scale"))
p <- data.frame(predicted = predict(final_model, testing))
p$Actual <- testing$Crashes
p$resid <- (p$predicted - p$Actual)
pp <- filter(p , p$resid >= -250)
ggplot(pp, aes(resid)) + geom_histogram(aes(y = ..density..),fill = "white", col = "black") +
  geom_density(fill = "darkorange", alpha = 0.5) + ggtitle("Plot for Residuals vs Fitted values") +
  xlab("Residuals") + ylab("Density for Fitted Values") + theme_minimal()

Optimization

We use Solver function of MS EXCEL to obatin the Optimization part using GRG Non-linear. We set constarints such as the number of ambulances deployed in the city, number of ambulances required, ambulances avaialble in the warehouse, response time for each ambulance and times a ambulance should be assigned to obtain the extra number of ambulances that should be deployed to meet the specific day’s vehicular collisions demand.

Conclusions

From exploratory Data analysis, we find that 10AM to 5PM is the time when majority of the accidents occur.
The months from May to September are found to have maximum vehicular collisions and hence extra availabiility of ambulances should be arranged.
Brooklyn is found to be the Borough which has highest number of vehcular collisions and maximum number of people being killed.
SUVs, TAXI & VANs are responsible for majority crashes caausing maximum injuries.
Driver inattention, Drowsy & Fatigue are the top reasons which cause maximum collisions.
Random forest expalined 45% variance and had a better nomally distributed residual vs fitted curve and hence was selected as the final model.
Optimization using GRG non-linear algorithm from Solver function in MS Excel can be used to optimize the ambulances that would be required to meet this colllision demands.

Prediction of Vehicular Collision in NYC

Dhaval Sawlani

December 22, 2017