A major part of practical actuarial problems in general insurance is the rating of risk factors in motor insurance. The understanding of what affects the likelihood and severity of car accidents is important not just to the insurance sector but to anyone who uses a car as a means of transportation.
In Israel, every accident which involves a bodily injury must be reported to the police. Due to this a reliable database was put together by the central bureau of statistics. This data was made available to me through the Israel National Road and safety Authority and could be used for research. Data that is collected is the time and date, location, severity of accident,type of vehicle involved and more. Many research hypotheses could be tested on this data but I will focus on one main research question.
The data consists of reported accidents that occurred in 2011 in Tel Aviv. The features that were collected where the location of accident,time and dates and details of the accidents and the vehicle involved. Except for the geo data all data is categorical (the time of the accident is in a hour band).
library(ggplot2)
library(dplyr)
library(plotly)
library(rpivotTable)
library(leaflet)
library(readxl)
library(mgcv)
library(randomForest)
TA_dat <- read_excel("TA_sev.xlsx")
TA_dat <- TA_dat[,1:18]
dim(TA_dat)
## [1] 10607 18
colnames(TA_dat)
## [1] "Date" "Month" "DayorNight" "DayType"
## [5] "Day" "Hourband" "Hour" "Severity"
## [9] "SevFac" "Typeacc" "Intersection" "Street"
## [13] "Crossingroad" "X" "Y" "CarType"
## [17] "lat" "lon"
TA_dat$Hour <- as.integer(TA_dat$Hour)
TA_dat$Sev_percent <- TA_dat$Severity*100
# regrouping day types
TA_dat$DayType2 <- ifelse(TA_dat$DayType %in% c("Fri-erev-hag","Shabbat-Hag"),"Fri-Hag","other")
## Setting the levels
TA_dat$Day <- factor(TA_dat$Day,levels=c("Sun","Mon","Tues","Wed","Thurs","Fri","Sat"))
TA_dat$DayorNight <- factor(TA_dat$DayorNight,levels=c("morning","noon","afternoon","evening","night"))
TA_dat$Month <- as.factor(TA_dat$Month)
TA_dat$Intersection <- as.factor(TA_dat$Intersection)
plot1 <- TA_dat %>% ggplot(aes(SevFac)) + geom_bar()
ggplotly(plot1)
plot2 <- TA_dat %>% ggplot(aes(DayorNight)) + geom_bar()
ggplotly(plot2)
plot3 <- TA_dat %>% ggplot(aes(Day)) + geom_bar()
ggplotly(plot3)
plot4 <- TA_dat %>% ggplot(aes(Intersection)) + geom_bar()
ggplotly(plot4)
plot5 <- TA_dat %>% ggplot(aes(DayType)) + geom_bar()
ggplotly(plot5)
# seperate plot for each one
The geographical coordinates supplied by the Israel National Road and safety Authority where in the ITM coordinate system (Marked X/Y in the data). Unfortunately all mapping packages I know work with WGS84 coordinates. Converting ITM to WGS84 is not a straight forward task. But luckily someone has written python code which I could use. Thank you HarelM and Dr Zvika Ben Haim.
Longitude - is a geographic coordinate that specifies the east-west position of a point on the Earth’s surface.
p_lng <- ggplot(TA_dat, aes(x=lon))+ geom_histogram(aes(y=..density..),binwidth=.005,colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") + ggtitle('Longitude distribution')
ggplotly(p_lng)
Latitude - is an angle which ranges from 0 degrees at the Equator to 90 degrees with North or South at the poles.
p_lat <- ggplot(TA_dat, aes(x=lat))+ geom_histogram(aes(y=..density..),binwidth=.005,colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") + ggtitle('Latitude distribution')
ggplotly(p_lat)
pal <- colorFactor(c("red", "blue"), domain = c("Light","Harsh"))
leaflet(data = TA_dat) %>% addTiles() %>% addCircles(~lon, ~lat,color = ~pal(SevFac))
These pivot tables are interactive please feel free to explore the data.
pivot_dat <- select(TA_dat,Month,DayorNight,DayType,Day,Hourband,Sev_percent,Typeacc,Intersection,CarType)
rpivotTable(pivot_dat,cols="Typeacc",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")
rpivotTable(pivot_dat,rows="Intersection",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")
filter(TA_dat,Intersection==1) %>% leaflet() %>% addTiles() %>% addCircleMarkers(~lon, ~lat,color = ~pal(SevFac),popup=TA_dat$Typeacc,radius=1)
rpivotTable(pivot_dat,rows="CarType",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")
My criteria for model selection was pvalue and deviance. Therefore I didn’t include variables that weren’t significant.
m1 <- gam(Severity~s(lat)+as.factor(Intersection)+s(Hour)+as.factor(DayType2),data=TA_dat,family=binomial)
summary(m1)
##
## Family: binomial
## Link function: logit
##
## Formula:
## Severity ~ s(lat) + as.factor(Intersection) + s(Hour) + as.factor(DayType2)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8656 0.2860 -13.517 <2e-16 ***
## as.factor(Intersection)1 1.5136 0.1486 10.182 <2e-16 ***
## as.factor(DayType2)other -0.7940 0.2842 -2.794 0.0052 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(lat) 5.237 6.394 18.93 0.005 **
## s(Hour) 6.813 7.898 35.26 2.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0246 Deviance explained = 9.17%
## UBRE = -0.82992 Scale est. = 1 n = 10607
In order to fill in the missing values in car type I built a model that models the type of car given all other available data. I chose to use a random forest classification model.
set.seed(123)
# making less vechile groups
TA_dat$CarType2 <- TA_dat$CarType
TA_dat$CarType2[which(TA_dat$CarType %in% c("Trucks","Work vehicle","Train","Bus","Taxi"))] <- "other"
rf1 <- randomForest(as.factor(CarType2) ~ Hour+Severity+Intersection+lat+lon,data=filter(TA_dat,CarType2 != "unknown"))
# Model predictions
TA_dat$Fixed_car <- predict(rf1,TA_dat)
#rpivotTable(select(TA_dat,Fixed_car,CarType2),cols="CarType2",rows="Fixed_car",aggregatorName = "Count",rendererName = "Row Heatmap")
# correcting just the unknown values
unk_ind <- which(TA_dat$CarType2 == "unknown")
TA_dat$Final_car <- TA_dat$CarType2
TA_dat$Final_car[unk_ind] <- as.character(predict(rf1,TA_dat[unk_ind,]))
rpivotTable(select(TA_dat,Final_car,CarType2),cols="CarType2",rows="Final_car",aggregatorName = "Count",rendererName = "Table")
rpivotTable(select(TA_dat,Final_car,CarType,Sev_percent),rows="Final_car",aggregatorName = "Average",vals="Sev_percent",rendererName = "Heatmap")
Will now try to incorporate car type data into model and add interactions - one smoothing function to both variables is used. In the term s(lat,lon) simple smoothing terms s(lat) and s(lon) are also automatically included.
m4 <- gam(Severity~s(lat,lon)+as.factor(Intersection)+s(Hour)+as.factor(DayType2)+as.factor(Final_car),data=TA_dat,family=binomial)
summary(m4)
##
## Family: binomial
## Link function: logit
##
## Formula:
## Severity ~ s(lat, lon) + as.factor(Intersection) + s(Hour) +
## as.factor(DayType2) + as.factor(Final_car)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2895 0.4057 -8.109 5.1e-16 ***
## as.factor(Intersection)1 1.5266 0.1519 10.048 < 2e-16 ***
## as.factor(DayType2)other -0.7821 0.2882 -2.714 0.00666 **
## as.factor(Final_car)Motorcycle -0.6888 0.3497 -1.970 0.04886 *
## as.factor(Final_car)other -0.5653 0.3396 -1.664 0.09603 .
## as.factor(Final_car)private -0.8417 0.3371 -2.496 0.01254 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(lat,lon) 22.440 26.492 57.80 0.000393 ***
## s(Hour) 6.663 7.764 37.82 9.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0319 Deviance explained = 12%
## UBRE = -0.83128 Scale est. = 1 n = 10607
vis.gam(m4, view=c("lat","lon"),ticktype = "detailed", main = "Lat/lon")
# Demonstrating the effect of location
n <- nrow(TA_dat)
# Middle of the week, seven O'clock in a private Car.
grid_table <- data.frame(Intersection=TA_dat$Intersection,DayType2=rep("other",n),lat=TA_dat$lat,lon=TA_dat$lon,Hour=rep(19,n),Final_car=rep("private",n))
# fitted values
grid_table$pred <- exp(predict(m4,grid_table))
pal <- colorNumeric(palette = topo.colors(10),domain = grid_table$pred)
leaflet(data = grid_table) %>% addTiles() %>% addCircles(~lon, ~lat,color = ~pal(pred),popup=as.character(round(grid_table$pred,4))) %>% addLegend(pal = pal, values = ~pred, opacity = 3)
To conclude, it can be seen that intersections and weekends/holidays are more dangerous. Also, the center of tel aviv is safer than the northern and southern parts. As expected, bicycles are more dangerous than other vehicles. The most dangerous times of the day are the late hours of the night.
Further investigations that could be taken are -