Predicting traffic accidents in Russia

Introduction

Traffic accidents in Russia caused 16981 deaths and 210877 injuries in 2019 according to official statistics. Russian government had developed a strategy to reduce the number of traffic accidents in 2016. The strategy targets 1.5 - 2 deaths per 100000 population by the end of 2024. While much effort has focused on predicting whether Russia is able to reach the assigned level, what influences the severity of traffic accident remains unknown.

There are a variety of weather and traffic conditions that contribute to accidents. This paper attempts to find out if we could use quantitative characteristics to predict the severity of traffic accident in Russia?

This research is based on the data for Saint-Petersburg, Russia.

Desired data

Ideal data set has to include both weather and traffic conditions. Weather conditions include variables: air temperature, road temperature, rain, snow, ice, fog. Traffic conditions include variables: date, time, longitude, latitude, type of accident, road type, place type, road condition, road lights, number of vehicles, number of participants, fatal, injury, class of vehicles, model of vehicles, category of participants, sex of participants, cause of accident.

Available data

We downloand the data from official police site. Police describes each accident in unique accident card. It contains the following data: type of accident, date, city district, info about an accident, number of vehicles, number of participants, ID of an accident, fatalities, injured, time of an accident etc. The data is available in .pdf, .xls, .csv, .xml formats. We use .csv and .xml files, because we cannot read the data from .pdf files and .xls files representation is not suitable for automated reading. Each card is given as separate spreadsheet in one file and different number of rows and columns.

Prerequisite packages

library(reshape2)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(readr)
library(stringi)
library(stringr)
library(ROSE)
library(rpart)

Get data

We start from reading in .xml format. We downlaod the data for 2019. The data was broken by 14 days intervals by police server. Total 2019 is 22 files. It covers the period from January 1, 2019 to November 26, 2019. The data for December is not available yet.

files <- list.files("./raw_data")
files <- as.list(files)

setwd("./raw_data")
list_file <- list.files(pattern = "*.xml") %>% 
    lapply(xmlToDataFrame, stringsAsFactors = F) %>% bind_rows
setwd("C:/Users/Георгий/Documents/GitHub/rus_traffic_accidents/")

Each accident cards starts from a description of region and a name of the document. This is not relevant to our research - we filter rows, which contain that information. We also tranform the date to date - month - year format and time to time format. Then we arrange the data by date, time, type of accident, district of accident - we need these columns to create a key later.

list_file <- list_file %>% 
    filter(!(is.na(DTPV))) %>% select(-text)
list_file$date <- dmy(list_file$date)
list_file$time <- as.POSIXct(list_file$time, format = "%H:%M")
xmldf <- list_file %>% arrange(date, time, DTPV, district)
rm(list_file)

Summary of the dataset shows that columns KTS, KUCH, POG, RAN downloaded as characters, while they are numeric values.

summary(xmldf)

We tranform those to numeric.

xmldf$KTS <- as.numeric(xmldf$KTS)
xmldf$KUCH <- as.numeric(xmldf$KUCH)
xmldf$POG <- as.numeric(xmldf$POG)
xmldf$RAN <- as.numeric(xmldf$RAN)

Check if there are missing values in the dataset after all transformations.

sum(complete.cases(xmldf))

Column infoDtp contains descriptions of accidents: weather conditions, road conditions, coordinates etc. Each description is a string with no delimiters and different length. The order of description items differs by lines also. This limits our ability to get the data from the column. We start from getting driving mode after an accident, road type and coordinates.

driving_mode <- xmldf %>% 
    separate(infoDtp, into = c("text1"), 
             sep = "([:digit:]+\\.[:digit:]+\\.[:digit:]+)")
driving_mode <- driving_mode %>% 
    separate(text1, into = c("text1"), 
             sep = "([:digit:]+\\.[:digit:]+)") %>% select(text1)


xmldf <- cbind(xmldf, driving_mode)
xmldf$driving_mode <- xmldf$text1
xmldf <- xmldf %>% select(-text1)

coords_dtp <- xmldf %>% 
    select(date, infoDtp) %>% 
    extract(infoDtp, c("text1"), 
            "([:digit:]+\\.[:digit:]+\\.[:digit:]+)", remove = TRUE) %>% 
    select(text1)

xmldf <- cbind(xmldf, coords_dtp)
xmldf$coords_dtp <- xmldf$text1
xmldf <- xmldf %>% select(-text1)

road_type <- xmldf %>% select(infoDtp) %>% 
    separate(infoDtp, into = c("text1", "text2"), 
             sep = "([:digit:]+\\.[:digit:]+\\.[:digit:]+)") %>% 
    select(text2)

road_type <- road_type %>% 
    extract(text2, c("text1"), "([^A-z]+[:punct:])", remove = FALSE)

road_type <- road_type %>% 
    extract(text1, c("text1"), "([:alnum:]+[:alnum:]+)", remove = TRUE)

xmldf$road_type <- road_type$text1

This is as far as we could get with sensible regular expressions - there should be a more simple way to get the required data. We found out that .csv files contain some of the information from accidents card in a format that we could use for our analysis. The idea is to create a dataset with weather and road conditions and join it with the first dataset on a key. The key is the date, time, type of accident and a district.

We download .csv files and merge them. Then make the columns and join both datasets.

### read all the csv files

setwd("./csv_data")
list_file <- list.files(pattern = "*.csv") %>% 
    lapply(read.csv, sep = ";", encoding = "UTF-8") %>% bind_rows
setwd("C:/Users/Георгий/Documents/GitHub/rus_traffic_accidents/")
raw_data <- data.frame(list_file)

### transform .csv file to split the columns

raw_data <- raw_data %>% filter(!is.na(Номер))
raw_data <- unique(raw_data)
raw_data <- raw_data %>% select(Номер, Дата, Время, Схема, Широта, Вид.ДТП,
                                     Адрес, Дорога, Категория.дороги, Состояние.погоды.1, 
                                     Состояние.проезжей.части, Освещение)

raw_data$weather_cond <- paste(raw_data$Состояние.погоды.1,
                               raw_data$Состояние.проезжей.части)
raw_data$road_cond <- raw_data$Освещение
raw_data$date <- raw_data$Дата
raw_data$time <- raw_data$Время
raw_data$id <- raw_data$Схема
raw_data$latitude <- raw_data$Широта
raw_data$longitude <- raw_data$Вид.ДТП
raw_data$road_cat <- raw_data$Категория.дороги
raw_data$type <- raw_data$Адрес
raw_data$district <- raw_data$Дорога

raw_data <- raw_data %>% select(date, time, type, district, latitude, longitude, 
                                road_cat, road_cond, weather_cond)

raw_data$date <- dmy(raw_data$date)
raw_data <- raw_data %>% arrange(date, time, type, district)
raw_data$time <- as.POSIXct(raw_data$time, format = "%H:%M")
raw_data$key <- paste(raw_data$date, raw_data$time, raw_data$type, raw_data$district)

### make key in xml file

xmldf$key <- paste(xmldf$date, xmldf$time, xmldf$DTPV, xmldf$district)

### check if all rows match by the key

set_raw <- raw_data
set_xml <- xmldf
d1 <- data.frame(set_raw, set_xml)
d1$check <- d1$key == d1$key.1
sum(d1$check == TRUE)
rm(set_raw, set_xml, d1)
rm(files, coords_dtp, driving_mode, road_type)

raw_data_sel <- raw_data %>% select(key.1 = key, latitude, longitude, road_cat,
                                    road_cond, weather_cond)

### merge two datasets

d2 <- cbind(xmldf, raw_data_sel)
rm(list_file, xmldf, raw_data, raw_data_sel)

Then we kick temporary columns and write the file as our raw data.

### kick temporary columns

d2 <- d2 %>% select(dtpv = DTPV, date, time, district, kts = KTS, kuch = KUCH, 
                    fatal = POG, injury = RAN, driving_mode, latitude, 
                    longitude, road_cat, road_cond, weather_cond)

# save raw data

write.csv(d2, "raw_data.csv")

Desired data vs available data

desired data	available data	check
air temperature	air temperature	no
road temperature	road temperature	no
rain, snow, ice, fog	rain, snow, ice, fog	yes
date	date	yes
time	time	yes
lognitude	lognitude	yes
latitude	latitude	yes
type of accident	type of accident	yes
road type	road type	yes
place type	place type	yes
road condition	road condition	yes
road lights	road lights	yes
number of vehicles	number of vehicles	yes
number of participants	number of participants	yes
fatal	fatal	yes
injury	injury	yes
class of vehicles	class of vehicles	no
model of vehicles	model of vehicles	no
category of participant	category of participant	no
sex of participants	sex of participants	no
cause of accident	cause of accident	no
weather condition	weather condition	yes

Translate data

Data contains variables levels description on Russian. For the purpose of this research we translate all Russian into English.

raw_data <- read.csv("raw_data.csv")
raw_data <- raw_data %>% select(-X)

# check unique values in text columns and subsitute them with English words
str(raw_data)

## check accident types
unique(raw_data$dtpv)

raw_data <- raw_data %>% 
    mutate(dtpv_temp = 
               ifelse(
                   dtpv == "Столкновение", 
                              "collision", 
                   ifelse(dtpv == "Наезд на пешехода",
                                                  "pedestrian_hit", 
                          ifelse(dtpv == "Наезд на препятствие", "obstacle_hit",
                                 ifelse(dtpv == "Наезд на стоящее ТС", "hit_run",
                                        ifelse(dtpv == "Опрокидывание", "rollover",
                                               ifelse(dtpv == "Съезд с дороги", "off_road",
                                                      ifelse(dtpv == "Падение пассажира", "passenger_fall",
                                                             ifelse(dtpv == "Иной вид ДТП", "other",
                                                                    ifelse(dtpv == "Наезд на велосипедиста", "bicycle_hit",
                                                                           ifelse(dtpv == "Наезд на внезапно возникшее препятствие", "immediate_hit",
                                                                                  ifelse(dtpv == "Наезд на лицо, не являющееся участником дорожного движения, осуществляющее несение службы", "policeman_hit",
                                                                                         ifelse(dtpv == "Отбрасывание предмета", "throwing_object", dtpv)))))))))))))


raw_data$dtpv_temp <- as.factor(raw_data$dtpv_temp)
raw_data$dtpv <- raw_data$dtpv_temp
raw_data <- raw_data %>% select(-dtpv_temp)

## convert date to date format
raw_data$date <- ymd(raw_data$date)

## convert time to time
raw_data$time <- as.POSIXct(raw_data$time)

## check districts
unique(raw_data$district)

raw_data$district <- plyr::revalue(raw_data$district,
        c("Василеостровский район" = "vasil", "Калининский район" = "kalin", 
        "Кировский район" = "kirov", "Приморский район" = "primorsk",
        "Адмиралтейский район" = "admiral", "Центральный район" = "center",
        "Красногвардейский район" = "krasnogvard", "Фрунзенский район" = "frunz",
        "Красносельский район" = "krasnosel", "Невский район" = "nevski",
        "Московский район" = "mosk", "Петродворцовый район" = "petrodvorts",
        "Пушкинский район" = "pushkin", "Выборгский район" = "vyborg",
        "Петроградский район"  = "petrograd", "Колпинский район" = "kolpino",
        "Курортный район" = "kurort", "Кронштадтский район" = "kronshtadt"))

## check driving mode
unique(raw_data$driving_mode)

raw_data$driving_mode <- plyr::revalue(raw_data$driving_mode, 
                                       c("Режим движения не изменялся" = "open",
                                         "Движение частично перекрыто" = "part_closed",
                                         "Движение полностью перекрыто" = "closed"))

## check road category
unique(raw_data$road_cat)

raw_data$road_cat <- plyr::revalue(raw_data$road_cat,
                                   c("Местного значения (дорога местного значения, включая относящиеся к собственности поселений, муниципальных районов, городских округов)" = "local",
                                     "Региональная или межмуниципальная (дорога регионального или межмуниципального значения)" = "regional",
                                     "Частная (дорога, относящиеся к частной и иным формам собственности)" = "private",
                                     "Не указано" = "not_specified",
                                     "Федеральная (дорога федерального значения)" = "federal",
                                     "Другие места" = "other"))


## check road conditions
unique(raw_data$road_cond)

raw_data$road_cond <- plyr::revalue(raw_data$road_cond,
                                    c("Обработанное противогололедными материалами" = "wet",
                                      "Заснеженное" = "snow", "Мокрое" = "wet", 
                                      "Загрязненное" = "dirty", "Пасмурно" = "unknown",
                                      "Ясно" = "unknown", "Снегопад" = "snow", "Сухое" = "dry",
                                      "Со снежным накатом" = "snow", "Гололедица" = "sleet",
                                      "Дождь" = "wet", "Пыльное" = "dust"))

## check weather conditions
unique(raw_data$weather_cond)

raw_data$weather_cond <- plyr::revalue(raw_data$weather_cond,
                                       c("Ясно " = "clear", " Пасмурно" = "clouds",
                                         "Пасмурно " = "clouds", "Снегопад " = "snow",
                                         "Сведения отсутствуют " = "unknown",
                                         " Снегопад" = "snow", " Ясно" = "clear",
                                         "Пасмурно Снегопад" = "snow", "Метель " = "snow",
                                         "Сужение проезжей части припаркованным транспортом " = "unknown",
                                         "Дождь " = "rain", "Конструктивное сужение проезжей части вследствие уменьшения количества полос движения " = "unknown",
                                         "Пасмурно Дождь" = "rain", "Пасмурно Метель" = "snow", " Дождь" = "rain", 
                                         "Неправильное применение, плохая видимость дорожных знаков Сведения отсутствуют" = "unknown",
                                         "Сужение проезжей части вследствие проведения работ " = "unknown", "Ясно Дождь" = "rain", "Пасмурно Туман" = "fog",
                                         "Туман " = "fog"))


# Clean data --------------------------------------------------------------

# check unspecified values in factor columns that came from second dataset

levels(raw_data$road_cat)
str(raw_data$road_cat)

levels(raw_data$road_cond) # we have empty rows that do not read as NA
str(raw_data$road_cond)
levels(raw_data$road_cond)[levels(raw_data$road_cond) == ""] <- "unknown"

levels(raw_data$weather_cond)
str(raw_data$weather_cond)

levels(raw_data$driving_mode)
str(raw_data$driving_mode)

levels(raw_data$dtpv)
str(raw_data$dtpv)

# final check on the summary

summary(raw_data)

# write clean data set

write.csv(raw_data, "clean_data.csv")

Exploratory data analysis

Preprocessing

To explore the dataset we introduce the severity of an accident. The severity is the number of casualties divided by the sum of vehicles and the number of participants: \[ s_r = \frac{fatal + injury}{vehicles + participants} \]

There are two classes of accidents by the level of severity: serious and moderate. The accident is considered serious if severity rate is over 0.35. The rate of fatal accidents in serious class is 4 times more than in moderate class.

# read data

clean_data <- read.csv("clean_data.csv")
clean_data <- clean_data %>% select(-X)
clean_data <- data.frame(clean_data)
clean_data$date <- ymd(clean_data$date)
clean_data$time <- as.POSIXct(clean_data$time)
clean_data <- clean_data %>% mutate(hour = hour(clean_data$time),
                                    casualties = fatal + injury)
clean_data$month <- month(clean_data$date, label = TRUE)
clean_data$day <- wday(clean_data$date, label = TRUE)
clean_data <- clean_data %>% mutate(cas_type = case_when(
    fatal > 0 ~ "fatal",
    injury > 0 ~ "injury",
    TRUE ~ "non-injury"
))

# add severity rate and severity class to dataset

sev_rate <- clean_data %>% 
    mutate(kts_kuch = kts + kuch, 
           severity = casualties / kts_kuch) %>% 
    mutate(year = year(date),
           mmonth = month(date),
           mday = mday(date),
           hhour = hour(time),
           mminute = minute(time)) %>% 
    mutate(timeline = make_datetime(year, mmonth, mday, hhour, mminute))

sev_rate <- sev_rate %>% mutate(sev_class = case_when(
    severity > 0.35 ~ "serious",
    severity <= 0.35 ~ "moderate"
))

sev_rate$sev_class <- as.factor(sev_rate$sev_class)
sev_rate$cas_type <- as.factor(sev_rate$cas_type)

sev_rate <- sev_rate %>% select(-c(driving_mode, hour, year, mmonth, mday,
                                   hhour, mminute))

class_serious <- sev_rate %>% 
    filter(sev_class == "serious") %>% select(cas_type)

class_moderate <- sev_rate %>% 
    filter(sev_class == "moderate") %>% select(cas_type)

summary(class_serious)

##    cas_type  
##  fatal : 74  
##  injury:759

summary(class_moderate)

##    cas_type   
##  fatal : 108  
##  injury:4919

Number of casualties by month increased from January to October 2019. Casualties decreased to January level in November 2019. Peak of casualties was in August: there were 800 casualties compared to the lowest level of 460 casualties in February. Casualties in summer and early fall were higher than other seasons.

Casualties were unevenly distributed by days of a week. Most accidents with casualties happened in Friday - Sunday. We called them weekend accidents. There were more weekend accidents in May - August 2019 than in other months. The number of Monday accidents in June - August 2019 increased compared to other months. The reason behind this seasonality could be summer weekend migration - people travel outside of the city on Friday evening and return on Sunday evening.

Majority of accidents happened on dry road. There were half as many accidents on wet road. Police officers did not fill in the road conditions 1216 times - this is 21% of all cases.

Accidents by class, day of week, time of day

Sunday. Accidents on dry road happened at night from 00:00AM to 04:00AM. The second peak was in late evening after 06:00PM. There were few serious accidents registered with snow road conditions. They were in the evening. The pattern of accidents distribution with unknown road conditions follows that of dry road. Sunday accidents on wet road were evenly distributed by the time of the day.

Monday. There were fewer serious accidents on Mondays than on Sundays. Accidents were distributed more or less even during the day for all types of road conditions. There was a time gap between 02:00AM and 05:00AM when there were very little accidents.

Tuesday. Patterns of accidents followed Monday distributions for all types of road conditions except for outliers. There were two outliers: two major accidents happened on Tuesday early morning on dry and wet roads.

Wednesday. Dry road accidents were evenly distributed by the day. Snow road accidents happened mainly in the evening. The pattern for unknown road conditions was different from other days: the number of serious accidents progressively increased from 00:00AM to 01:00PM, then dropped. Serious accidents on wet road were in the morning and in the late evening.

Thursday. Serious accidents on dry road looks like normal distribution. Accidents happened mostly in day hours. The same is true for unknown road conditions. Accidents on wet road happened during the night and evening hours.

Friday. Accidents on Fridays happend after 04:00PM on dry road. This might correlate to earlier findings that there were many accidents on Fridays during May - August. Friday accidents were evenly distributed during the day for other road conditions.

Saturday. Accidents were evenly distributed during the day for all road conditions. There were some outliers for dry road in the evening - some serious accidents with many casualties.

Accidents by type of accident, denisty by vehicles and participants

Collision of two or more vehicles caused 4000 casualties in 2019. This was the major cause of injuries. Collisions happened in clear and cloudy weather equally often. Minor part of collisions was in heavy weather conditions: snow, rain, fog.

Pedestrian hit was the second major cause of injury or death, but it was twice as less as collisions. Cars hit pedestrians equally in clear and cloudy weather. Third place by cause of injuries was passenger fall - there were many accidents with public transport: buses, trams.

ggplot(sev_rate) + 
    geom_col(aes(x = dtpv, y = casualties, fill = weather_cond)) + 
    coord_flip() + xlab("accident type") + 
    ggtitle("Casualties by type")

Severity by number of participants and vehicles

Collisions caused serious injuries when the number of cars was 2-3, but the cars were with passengers - the number of participants was more than 5. When the number of vehicles increased the severity decreased. There is an outlier in collisions with 2 vehicles and 17 participants - it was an accident with public transport.

The severity of pedestrian hit descreased when the number of the vehicles increased and increased when the number of participants increased. Pedestrian hits when 1 vehicle hit 2 or more pedestrians caused serious casualties.

Passenger fall involved 1 or 2 vehicles. Number of casualties increased as the number of participants increased. There were 2 outliers in this type of accident - passenger falls that involved 11 and 14 passengers.

Obstacle hit severity increased as the number of participants increased. The severity of obstacle hit decreased when the number of vehicles increased.

ggplot(sev_rate, aes(kts, dtpv, fill = severity)) +
    geom_tile(color = "white") + xlab("number of vehicles") +
    scale_x_continuous(breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9)) +
    facet_wrap(~ kuch, nrow = 2) + ylab("")

Geospatial distribution of accidents

Accident density map shows that more accidents happened in the northern districts of the city: Vyborgskiy, Kalininskiy districts. There were less accidents in the southern and western parts of the city. Accidents concentrated along main avenues.

Many serious accidents happened on intersections of avenues and streets with 3-4 lanes. Allocation of accidents corresponds to the routes of people travelling to work and back.

There was high correlation between denisty of accidents and a number of lanes on a road in the city center. The hottest point in this area was the intersection of Nevskiy and Liteyniy prospects. Another place with hot points was Sadovaya street.

library(plotly)
df <- sev_rate

# geo styling
fig <- sev_rate
fig <- fig %>% 
    plot_ly(
        type = 'densitymapbox',
        lat = ~longitude,
        lon = ~latitude,
        coloraxis = 'coloraxis',
        radius = 10
    )
fig <- fig %>% 
    layout(
        mapbox = list(
            style = "open-street-map",
            center = list(lon = 30, lat = 59)), coloraxis = list(colorscale = "Viridis")
    )
fig

Severity of accidents distribution followed the density of accidents. There were severe accidents with many casualties on main avenues. Serious accidents were on the roads that lead from city suburbs - those are the main travelling routes. Northern districts had more serious accidents that southern parts of the city.

Local hot points were in Central and Moscow district on intersections. The trend was that the more lanes there are on the street - the heavier were casualties.

library(leaflet)

mybins <- seq(0, 0.9, by = 0.1)
mypalette <- colorBin(palette="Reds", domain=sev_rate$severity, 
                      na.color="transparent", bins=mybins)

mytext <- paste(
    "Type: ", sev_rate$dtpv, "<br/>",
    "Date: ", sev_rate$date, "<br/>",
    "Severity: ", sev_rate$cas_type, sep = ""
) %>% lapply(htmltools::HTML)

m <- leaflet(sev_rate) %>% 
    addTiles() %>% 
    setView(lat = 59.93, lng = 30.2, zoom = 9) %>% 
    addCircleMarkers(
        ~latitude, ~longitude,
        fillColor = ~mypalette(severity), fillOpacity = 0.7, color = "white",
        radius = ~casualties, stroke = FALSE,
        label = mytext,
        labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
                                    textsize = "13px", direction = "auto")) %>% 
    addLegend( pal = mypalette, values = ~severity, opacity = 0.9, 
               title = "Severity", position = "bottomright")
m

Exploratory analysis summary

Casualties from traffic accidents increased in summer, when there was more traffic on roads. Serious accidents happened on weekends when there was a lot of traffic coming in and out of the city. Serious accidents were on dry road and in clear conditions. Collisions of 2-3 vehicles on intersections of main avenues and multi-lane streets was the cause of the most severe accidents. Northern part of the city suffered from more serious accidents than southern.

Possible reasons behind the observed facts:
- road grid design allows high speed intersections, despite artificial bumps installed on many instersections;
- more populated parts of the city suffer from traffic overload on roads due to weekend migration;
- insufficient developement of two-lane narrow streets with natural speed limits.

Modeling

Choosing features

Features must correspond to data, which an emergency service operator could collect from a telephone call. This reduces our number of featrues to minimum:
- type of accident;
- day, month, time;
- district - obtained from address of an accident;
- number of vehicles;
- number of participants;
- road conditions.

long_list <- sev_rate %>% select(dtpv, day, month, date, time, district,
                                 kts, kuch, road_cond, sev_class)

We need to identify the covariance of predictors. We transform factor variables into numerical values of their values. Then we break the data into training and testing sets - we use the caret package: 60% goes to training set, 40% to testing set.

long_list <- long_list %>% 
    select(-c(day, month)) %>% mutate(dtpvnum = as.numeric(dtpv),
                                      distrnum = as.numeric(district),
                                      roadnum = as.numeric(road_cond)) %>% 
    mutate(mon = month(date), d = mday(date), h = hour(time))

library(Hmisc)
library(corrplot)
library(caret)

set.seed(5860)
in_train <- createDataPartition(y = long_list$sev_class, p = 0.6, list = FALSE)
training <- long_list[in_train,]
testing <- long_list[-in_train,]

dim(training)

## [1] 3517   14

There is an average correlation between the number of vehicles and the number of participants. We observe correlation between the number of vehicles and the type of accident. There is no other significant covariation between features. Revealed raltionships between predictors correspond to the logic of the phenomenon under study: in an accident with 2 or more cars the number of participant increases.

cor_data <- training[, c(5, 6, 9, 10, 11, 12, 13, 14)]
res2 <- rcorr(as.matrix(cor_data))
res2

##            kts  kuch dtpvnum distrnum roadnum   mon     d     h
## kts       1.00  0.58   -0.73     0.03   -0.01 -0.02  0.00 -0.06
## kuch      0.58  1.00   -0.37     0.00   -0.01 -0.01  0.00 -0.03
## dtpvnum  -0.73 -0.37    1.00     0.01    0.03  0.02  0.01  0.07
## distrnum  0.03  0.00    0.01     1.00    0.01  0.03 -0.03  0.00
## roadnum  -0.01 -0.01    0.03     0.01    1.00 -0.11 -0.08  0.00
## mon      -0.02 -0.01    0.02     0.03   -0.11  1.00 -0.04  0.01
## d         0.00  0.00    0.01    -0.03   -0.08 -0.04  1.00  0.01
## h        -0.06 -0.03    0.07     0.00    0.00  0.01  0.01  1.00
## 
## n= 3517 
## 
## 
## P
##          kts    kuch   dtpvnum distrnum roadnum mon    d      h     
## kts             0.0000 0.0000  0.0881   0.4354  0.2784 0.9468 0.0002
## kuch     0.0000        0.0000  0.9038   0.7528  0.3987 0.9900 0.0562
## dtpvnum  0.0000 0.0000         0.6525   0.1248  0.1685 0.4820 0.0000
## distrnum 0.0881 0.9038 0.6525           0.5199  0.1219 0.0532 0.8405
## roadnum  0.4354 0.7528 0.1248  0.5199           0.0000 0.0000 0.7892
## mon      0.2784 0.3987 0.1685  0.1219   0.0000         0.0173 0.5538
## d        0.9468 0.9900 0.4820  0.0532   0.0000  0.0173        0.4671
## h        0.0002 0.0562 0.0000  0.8405   0.7892  0.5538 0.4671

corrplot(res2$r, type = "upper", order = "hclust", 
         p.mat = res2$P, sig.level = 0.01, insig = "pch")

The number of participants and the number of vehicles are not normally distributed values. We do a test correlation calculation using the Spearman rank correlation coefficient. The calculation confirms the presence of a correlation between these predictors.

spearman <- training %>% select(kts, kuch) %>% group_by(kts) %>% 
    summarise(kuch = mean(kuch))

cor.test(spearman$kts, spearman$kuch , method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  spearman$kts and spearman$kuch
## S = 6.5375, p-value = 0.001111
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9221722

The correlation between the predictors allows us to reduce the dimension of the dataset. Reducing the dimension results in worse interpretation of each predictor. In this model, interpretation plays an important role, so we will not reduce the number of predictors.

short_list <- training %>% select(dtpv, mon, d, h, district, kts, kuch,
                                  road_cond, sev_class)
dim(short_list)

## [1] 3517    9

Metrics

In this research we must predict whether an ambulance is needed at the accident. This a binary classification. Serious accidents require an ambulance. Moderate accidents do not require an ambulance. Positive value for our classifier is serious accident.

If a case is classified as false negative the price of this mistake is life or health of a person. The price of false positive is extra cost of ambulance service. Increase in the number of FN will lead to more deaths or injuries, while increase in FP will only require more money. Therefore, the Sensitivity is critical for our model.

We will also estimate classificator performance by ROC and cost. The cost of the model we measure through time required to train it.

Models and training options

The basic model is GLM, alternative models are Decision tree and Random forest. We train models with repeated cross-validation on 10 folds and 3 repeats. For linear regression we use the numerical values of the levels of factor variables. For other models we use factors. We determine the best model by ROC and Sensitivity.

ctrl_lm <- trainControl(method = "repeatedcv",
                        number = 10,
                        repeats = 3,
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE)

num_list <- training %>% select(dtpvnum, mon, d, h, distrnum, 
                                 roadnum, kts, kuch, sev_class)

lm_start <- Sys.time()
lm.fit <- train(sev_class ~.,
                data = short_list,
                method = "glm",
                metric = "ROC",
                trControl = ctrl_lm)
lm_stop <- Sys.time()
lm_diff <- lm_stop - lm_start

rpart_start <- Sys.time()
rpart.fit <- train(sev_class ~.,
                   data = short_list,
                   method = "rpart",
                   metric = "ROC",
                   trControl = ctrl_lm)
rpart_stop <- Sys.time()
rpart_diff <- rpart_stop - rpart_start

rf_start <- Sys.time()
rf.fit <- train(sev_class ~.,
                data = short_list,
                method = "rf",
                metric = "ROC",
                trControl = ctrl_lm)
rf_stop <- Sys.time()
rf_diff <- rf_stop - rf_start

Random forest shows the best result in ROC and Sensitivity. Linear model and decision tree are worse than the random forest in overall performance. Linear model is better than decision tree in terms of ROC and decision tree is more preferable in terms of sensitivity. The cost of training the random forest overcomes exceeds the other two models by an order.

lm_roc <- as.numeric(round(lm.fit$results[2], 4))
rpart_roc <- as.numeric(round(rpart.fit$results[1, 2], 4))
rf_roc <- as.numeric(round(rf.fit$results[2, 2], 4))

lm_sens <- as.numeric(round(lm.fit$results[4], 4))
rpart_sens <- as.numeric(round(rpart.fit$results[1, 4], 4))
rf_sens <- as.numeric(round(rf.fit$results[2, 4], 4))

compare_models <- tribble(~model, ~roc, ~sens, ~train_time,
                          "glm", lm_roc, lm_sens, lm_diff,
                          "rpart", rpart_roc, rpart_sens, rpart_diff,
                          "rf", rf_roc, rf_sens, rf_diff)

compare_models

## # A tibble: 3 x 4
##   model   roc  sens train_time     
##   <chr> <dbl> <dbl> <drtn>         
## 1 glm   0.878 0.373   6.806237 secs
## 2 rpart 0.828 0.383   3.434420 secs
## 3 rf    0.904 0.508 546.422988 secs

The choice of the model is decision tree due to better performance and cost of training combination. The check of decision tree performance is also required at later stage after tuning the model and dealing with imbalanced dataset.

Imbalanced data in traffic accidents

An imbalanced dataset is a dataset where one class outnumbers other class by a large proportion. Imbalanced datasets do not contain enough information on minority class for an algorithm to make an accurate prediction. It is desirable to use classification algorithms with balanced datasets.

Car accident dataset for Saint-Petersburg is imbalanced. Prevalance of serious accidents is 14.1 percent. Decision tree or random forest algorithms trained on this dataset return insufficient levels of sensitivity: 38 and 50 percent respectively.

Dealing with imbalanced data

Common way to deal with imbalanced datasets is to modify data into balanced distribution using an algorithm. There are four algorithms available in R for balancing purpose:
- undersampling;
- oversampling;
- synthetic data generation;
- cost sensitive learning.

Undersampling method randomly kicks majority class observations from a dataset until both majority and minority classes are balanced. This method is applicable to large datasets where reducing the number of observations benefit to training time. The flip side of this approach is that you might miss important features of majority class.

Oversampling method randomly replicates minority class observations. It is good for relatively small datasets when you do not want to loose any valuable information. The not-so-good part of the method is that it only adds observations of certain types, which leads to overfitting.

Synthetic data generation creates artificial dataset based on features similarities. This method creates many different combinations of observed minority class features. The method is good as it does not lead to overfitting.

Cost sensitive learning method calculates the cost of error classification for FP and FN - both costs result from multiplication of number of false classifications by their costs. Total cost tend to minimum.

Creating balanced dataset

We use ROSE library to make balanced dataset with synthetic data generation approach.

prop.table(table(short_list$sev_class))

## 
##  moderate   serious 
## 0.8578334 0.1421666

short_list_balanced <- ROSE(sev_class ~., data = short_list, seed = 1)$data
prop.table(table(short_list_balanced$sev_class))

## 
##  moderate   serious 
## 0.5146432 0.4853568

Then we train a decision tree on balanced dataset without any tuning and test the model on the testing set. The result of this model is our reference point to model tuning.

test_list <- testing %>% select(dtpv, mon, d, h, district, kts, kuch,
                                road_cond, sev_class)
rpart.fit.bal <- train(sev_class ~.,
                   data = short_list_balanced,
                   method = "rpart",
                   metric = "ROC",
                   trControl = ctrl_lm)
rpart.pred2 <- predict(rpart.fit.bal, test_list)
confusionMatrix(rpart.pred2, test_list$sev_class, positive = "serious")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction moderate serious
##   moderate     1828     181
##   serious       182     152
##                                           
##                Accuracy : 0.8451          
##                  95% CI : (0.8298, 0.8595)
##     No Information Rate : 0.8579          
##     P-Value [Acc > NIR] : 0.9632          
##                                           
##                   Kappa : 0.3655          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.45646         
##             Specificity : 0.90945         
##          Pos Pred Value : 0.45509         
##          Neg Pred Value : 0.90991         
##              Prevalence : 0.14213         
##          Detection Rate : 0.06487         
##    Detection Prevalence : 0.14255         
##       Balanced Accuracy : 0.68295         
##                                           
##        'Positive' Class : serious         
##

roc.curve(test_list$sev_class, rpart.pred2, plotit = FALSE)

## Area under the curve (AUC): 0.683

Model tuning

We use three methods to tune the model: add features, tune max depth of a tree, tune complexity. We start from adding features to our dataset.

clean_data <- read.csv("clean_data.csv")
clean_data <- clean_data %>% select(-X)
clean_data <- data.frame(clean_data)
clean_data$date <- ymd(clean_data$date)
clean_data$time <- as.POSIXct(clean_data$time)
clean_data <- clean_data %>% mutate(casualties = fatal + injury)
clean_data$month <- month(clean_data$date, label = TRUE)
clean_data$day <- wday(clean_data$date, label = TRUE)
clean_data <- clean_data %>% mutate(cas_type = case_when(
    fatal > 0 ~ "fatal",
    injury > 0 ~ "injury",
    TRUE ~ "non-injury"
))

sev_rate <- clean_data %>% 
    mutate(kts_kuch = kts + kuch, 
           severity = casualties / kts_kuch) %>% 
    mutate(year = year(date),
           mmonth = month(date),
           mday = mday(date),
           hhour = hour(time),
           mminute = minute(time)) %>% 
    mutate(timeline = make_datetime(year, mmonth, mday, hhour, mminute))

sev_rate <- sev_rate %>% mutate(sev_class = case_when(
    severity > 0.35 ~ "serious",
    severity <= 0.35 ~ "moderate"
))

sev_rate$sev_class <- as.factor(sev_rate$sev_class)
sev_rate$cas_type <- as.factor(sev_rate$cas_type)
sev_rate <- sev_rate %>% select(-c(driving_mode, year, mmonth, mday,
                                   hhour, mminute))

long_list <- sev_rate %>% select(dtpv, time, district, kts, 
                                 kuch, latitude,
                                 longitude, weather_cond, road_cond, month,
                                 day, sev_class)
long_list$dtpv <- as.factor(long_list$dtpv)
long_list$district <- as.factor(long_list$district)
long_list$road_cond <- as.factor(long_list$road_cond)
long_list$month <- as.numeric(long_list$month)
long_list$day <- as.numeric(long_list$day)
long_list$hour <- hour(long_list$time)
long_list$minute <- minute(long_list$time)
long_list <- long_list %>% mutate(hour_minute = hour + minute / 60)
long_list$time <- long_list$hour_minute
long_list <- long_list %>% select(-c(hour, minute, hour_minute))

set.seed(6000)
in_train <- createDataPartition(y = long_list$sev_class, p = 0.7, list = FALSE)
training <- long_list[in_train,]
testing <- long_list[-in_train,]

data_rose <- ROSE(sev_class ~., data = training, seed = 1)$data

Now we are ready to train a model. We start from caret package. The length of the model tuning is set to 15 and we do repeated cross-validation with 10 repeats and 10 folds. This model results in Sensitivity of 71% and ROC is 81.3 percent. This is much better than the original model.

set.seed(1)
mod_rpart <- caret::train(sev_class ~.,
                          method = "rpart",
                          data = data_rose,
                          tuneLength = 15,
                          metric = "ROC",
                          trControl = trainControl(method = "repeatedcv",
                                                   number = 10,
                                                   repeats = 10,
                                                   summaryFunction = twoClassSummary,
                                                   classProbs = TRUE))

pred_rpart <- predict(mod_rpart, testing)
confusionMatrix(pred_rpart, testing$sev_class, positive = "serious")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction moderate serious
##   moderate     1381      72
##   serious       127     177
##                                          
##                Accuracy : 0.8867         
##                  95% CI : (0.871, 0.9012)
##     No Information Rate : 0.8583         
##     P-Value [Acc > NIR] : 0.0002503      
##                                          
##                   Kappa : 0.5737         
##                                          
##  Mcnemar's Test P-Value : 0.0001292      
##                                          
##             Sensitivity : 0.7108         
##             Specificity : 0.9158         
##          Pos Pred Value : 0.5822         
##          Neg Pred Value : 0.9504         
##              Prevalence : 0.1417         
##          Detection Rate : 0.1007         
##    Detection Prevalence : 0.1730         
##       Balanced Accuracy : 0.8133         
##                                          
##        'Positive' Class : serious        
##

roc.curve(testing$sev_class, pred_rpart, plotit = FALSE)

## Area under the curve (AUC): 0.813

Lets see if we could make the model better manually. We use rpart library to tune the model further. We use complexity parameter from the best model we have in caret model. Then there is a limitation on the depth of the tree. We limit it to 6 with maxdepth parameter. We also make the tree to try split when the number of observations in the group is 100 or more. This is done with minsplit parameter. Tuning results in Sensitivity of 84.3% and ROC of 82.0 percent. This is better than the previous model.

set.seed(1)
mod_tree <- rpart(sev_class ~.,
                  data = data_rose,
                  method = "class",
                  control = rpart.control(cp = mod_rpart$bestTune,
                                          minsplit = 100,
                                          maxdepth = 6))

pred_tree <- predict(mod_tree, testing, type = "class")
confusionMatrix(pred_tree, testing$sev_class, positive = "serious")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction moderate serious
##   moderate     1201      39
##   serious       307     210
##                                           
##                Accuracy : 0.8031          
##                  95% CI : (0.7837, 0.8214)
##     No Information Rate : 0.8583          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4415          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8434          
##             Specificity : 0.7964          
##          Pos Pred Value : 0.4062          
##          Neg Pred Value : 0.9685          
##              Prevalence : 0.1417          
##          Detection Rate : 0.1195          
##    Detection Prevalence : 0.2943          
##       Balanced Accuracy : 0.8199          
##                                           
##        'Positive' Class : serious         
##

roc.curve(testing$sev_class, pred_tree, plotit = FALSE)

## Area under the curve (AUC): 0.820

Conclusion

Car accidents data exploration shows that Saint-Petersburg road net has many flashpoints, where severe car accidents happen frequently. There is a need to transform intersections and calm down the traffic on the streets. Understanding the severity of accidents could make those transformations effective.

One way to understand and predict the severity of car accidents is to use the decision tree models. Decision tree in this research could be considered a good classifier to split accidents to severe and not-severe.