Predicting the probability of a Severe bodily injury in a Car accident in Tel Aviv

Introduction

A major part of practical actuarial problems in general insurance is the rating of risk factors in motor insurance. The understanding of what affects the likelihood and severity of car accidents is important not just to the insurance sector but to anyone who uses a car as a means of transportation.

In Israel, every accident which involves a bodily injury must be reported to the police. Due to this a reliable database was put together by the central bureau of statistics. This data was made available to me through the Israel National Road and safety Authority and could be used for research. Data that is collected is the time and date, location, severity of accident,type of vehicle involved and more. Many research hypotheses could be tested on this data but I will focus on one main research question.

Layout

Descriptive Statistics
Presenting the data on a map
Modeling the data
Model selection and predictions
Shiny app to present predictions
Conclusions

Data

The data consists of reported accidents that occurred in 2011 in Tel Aviv. The features that were collected where the location of accident,time and dates and details of the accidents and the vehicle involved. Except for the geo data all data is categorical (the time of the accident is in a hour band).

library(ggplot2)
library(dplyr)
library(plotly)
library(rpivotTable)
library(leaflet)
library(readxl)
library(mgcv)
library(randomForest)

Loading and cleaning data

TA_dat <- read_excel("TA_sev.xlsx")
TA_dat <- TA_dat[,1:18]

dim(TA_dat)

## [1] 10607    18

colnames(TA_dat)

##  [1] "Date"         "Month"        "DayorNight"   "DayType"     
##  [5] "Day"          "Hourband"     "Hour"         "Severity"    
##  [9] "SevFac"       "Typeacc"      "Intersection" "Street"      
## [13] "Crossingroad" "X"            "Y"            "CarType"     
## [17] "lat"          "lon"

TA_dat$Hour <- as.integer(TA_dat$Hour)
TA_dat$Sev_percent <- TA_dat$Severity*100

# regrouping day types

TA_dat$DayType2 <- ifelse(TA_dat$DayType %in% c("Fri-erev-hag","Shabbat-Hag"),"Fri-Hag","other")

## Setting the levels

TA_dat$Day <- factor(TA_dat$Day,levels=c("Sun","Mon","Tues","Wed","Thurs","Fri","Sat"))
TA_dat$DayorNight <- factor(TA_dat$DayorNight,levels=c("morning","noon","afternoon","evening","night"))
TA_dat$Month <- as.factor(TA_dat$Month)
TA_dat$Intersection <- as.factor(TA_dat$Intersection)

Descriptive Statistics

plot1 <-  TA_dat %>% ggplot(aes(SevFac)) +  geom_bar()
ggplotly(plot1)

plot2 <-  TA_dat %>% ggplot(aes(DayorNight)) +  geom_bar()
ggplotly(plot2)

plot3 <-  TA_dat %>% ggplot(aes(Day)) +  geom_bar()
ggplotly(plot3)

plot4 <-  TA_dat %>% ggplot(aes(Intersection)) +  geom_bar()
ggplotly(plot4)

plot5 <-  TA_dat %>% ggplot(aes(DayType)) +  geom_bar()
ggplotly(plot5)

# seperate plot for each one

Presentation of Geographical data

The geographical coordinates supplied by the Israel National Road and safety Authority where in the ITM coordinate system (Marked X/Y in the data). Unfortunately all mapping packages I know work with WGS84 coordinates. Converting ITM to WGS84 is not a straight forward task. But luckily someone has written python code which I could use. Thank you HarelM and Dr Zvika Ben Haim.

Longitude - is a geographic coordinate that specifies the east-west position of a point on the Earth’s surface.

p_lng <- ggplot(TA_dat, aes(x=lon))+ geom_histogram(aes(y=..density..),binwidth=.005,colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666") + ggtitle('Longitude distribution')
  
ggplotly(p_lng)

Latitude - is an angle which ranges from 0 degrees at the Equator to 90 degrees with North or South at the poles.

p_lat <- ggplot(TA_dat, aes(x=lat))+ geom_histogram(aes(y=..density..),binwidth=.005,colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666") + ggtitle('Latitude distribution')
  
ggplotly(p_lat)

Presenting on a Map

pal <- colorFactor(c("red", "blue"), domain = c("Light","Harsh"))

leaflet(data = TA_dat) %>% addTiles() %>% addCircles(~lon, ~lat,color = ~pal(SevFac))

Pivot tables

These pivot tables are interactive please feel free to explore the data.

pivot_dat <- select(TA_dat,Month,DayorNight,DayType,Day,Hourband,Sev_percent,Typeacc,Intersection,CarType)

rpivotTable(pivot_dat,cols="Typeacc",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")

Only intersections

rpivotTable(pivot_dat,rows="Intersection",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")

filter(TA_dat,Intersection==1) %>% leaflet() %>% addTiles() %>% addCircleMarkers(~lon, ~lat,color = ~pal(SevFac),popup=TA_dat$Typeacc,radius=1)

rpivotTable(pivot_dat,rows="CarType",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")

Modeling

My criteria for model selection was pvalue and deviance. Therefore I didn’t include variables that weren’t significant.

m1 <- gam(Severity~s(lat)+as.factor(Intersection)+s(Hour)+as.factor(DayType2),data=TA_dat,family=binomial)
summary(m1)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Severity ~ s(lat) + as.factor(Intersection) + s(Hour) + as.factor(DayType2)
## 
## Parametric coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.8656     0.2860 -13.517   <2e-16 ***
## as.factor(Intersection)1   1.5136     0.1486  10.182   <2e-16 ***
## as.factor(DayType2)other  -0.7940     0.2842  -2.794   0.0052 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df Chi.sq  p-value    
## s(lat)  5.237  6.394  18.93    0.005 ** 
## s(Hour) 6.813  7.898  35.26 2.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0246   Deviance explained = 9.17%
## UBRE = -0.82992  Scale est. = 1         n = 10607

Missing values in car type

In order to fill in the missing values in car type I built a model that models the type of car given all other available data. I chose to use a random forest classification model.

set.seed(123)

# making less vechile groups

TA_dat$CarType2 <- TA_dat$CarType
TA_dat$CarType2[which(TA_dat$CarType %in% c("Trucks","Work vehicle","Train","Bus","Taxi"))] <- "other"


rf1 <- randomForest(as.factor(CarType2) ~ Hour+Severity+Intersection+lat+lon,data=filter(TA_dat,CarType2 != "unknown"))

# Model predictions

TA_dat$Fixed_car <- predict(rf1,TA_dat)
#rpivotTable(select(TA_dat,Fixed_car,CarType2),cols="CarType2",rows="Fixed_car",aggregatorName = "Count",rendererName = "Row Heatmap")

# correcting just the unknown values

unk_ind <- which(TA_dat$CarType2 == "unknown")
TA_dat$Final_car <- TA_dat$CarType2
TA_dat$Final_car[unk_ind] <- as.character(predict(rf1,TA_dat[unk_ind,]))

rpivotTable(select(TA_dat,Final_car,CarType2),cols="CarType2",rows="Final_car",aggregatorName = "Count",rendererName = "Table")

rpivotTable(select(TA_dat,Final_car,CarType,Sev_percent),rows="Final_car",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")

Will now try to incorporate car type data into model and add interactions - one smoothing function to both variables is used. In the term s(lat,lon) simple smoothing terms s(lat) and s(lon) are also automatically included.

m4 <- gam(Severity~s(lat,lon)+as.factor(Intersection)+s(Hour)+as.factor(DayType2)+as.factor(Final_car),data=TA_dat,family=binomial)
summary(m4)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Severity ~ s(lat, lon) + as.factor(Intersection) + s(Hour) + 
##     as.factor(DayType2) + as.factor(Final_car)
## 
## Parametric coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -3.2895     0.4057  -8.109  5.1e-16 ***
## as.factor(Intersection)1         1.5266     0.1519  10.048  < 2e-16 ***
## as.factor(DayType2)other        -0.7821     0.2882  -2.714  0.00666 ** 
## as.factor(Final_car)Motorcycle  -0.6888     0.3497  -1.970  0.04886 *  
## as.factor(Final_car)other       -0.5653     0.3396  -1.664  0.09603 .  
## as.factor(Final_car)private     -0.8417     0.3371  -2.496  0.01254 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df Chi.sq  p-value    
## s(lat,lon) 22.440 26.492  57.80 0.000393 ***
## s(Hour)     6.663  7.764  37.82 9.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0319   Deviance explained =   12%
## UBRE = -0.83128  Scale est. = 1         n = 10607

vis.gam(m4, view=c("lat","lon"),ticktype = "detailed", main = "Lat/lon")

# Demonstrating the effect of location

n <- nrow(TA_dat)

# Middle of the week, seven O'clock in a private Car.

grid_table <- data.frame(Intersection=TA_dat$Intersection,DayType2=rep("other",n),lat=TA_dat$lat,lon=TA_dat$lon,Hour=rep(19,n),Final_car=rep("private",n))


# fitted values

grid_table$pred <- exp(predict(m4,grid_table))


pal <- colorNumeric(palette = topo.colors(10),domain = grid_table$pred) 
leaflet(data = grid_table) %>% addTiles() %>% addCircles(~lon, ~lat,color = ~pal(pred),popup=as.character(round(grid_table$pred,4)))  %>%  addLegend(pal = pal, values = ~pred, opacity = 3)

Prediction

Available here.

Conclusions

To conclude, it can be seen that intersections and weekends/holidays are more dangerous. Also, the center of tel aviv is safer than the northern and southern parts. As expected, bicycles are more dangerous than other vehicles. The most dangerous times of the day are the late hours of the night.

Further investigations that could be taken are -

Applying to all of Israel.
Using multi year data.
Cross referencing with other databases - such as a weather database, in order to add more explanatory variables.