Introduction to the Analysis of Accident Severity in Proximity to Hotspots

Road safety is a paramount concern for both individuals and authorities, with the impact of traffic accidents ranging from minor to fatal. Understanding the factors that contribute to the severity of these incidents can inform preventive measures and policies. This analysis aims to investigate the influence of the distance from accident sites to traffic hotspots—locations where accidents occur more frequently—on the severity of these accidents. The working hypothesis suggests that accidents occurring closer to hotspots are more likely to result in minor (slight) injuries rather than being fatal.

To conduct this investigation, data from the year 2017 were sourced from a comprehensive UK road safety dataset available at Kaggle. The dataset encompasses detailed reports of road accidents, providing a fertile ground for spatial and quantitative analysis.

Methodology

The methodology of the analysis involves several stages:

  • Spatial Clustering: Using the DBSCAN clustering method, accident locations defined by longitude and latitude were grouped to identify areas of high accident concentration.

  • Hotspot Identification: The centers of the identified clusters were calculated, establishing the hotspots. Subsequently, the distances of specific accident sites to these hotspots were determined.

  • Data Modeling: With the distances and other relevant variables at hand, the XGBoost algorithm was employed to model the data and examine the relationship between proximity to hotspots and accident severity.

The outcome of this analysis aims to shed light on the spatial dynamics of road accidents and their outcomes, potentially offering insights into road safety enhancement strategies.

# Load libraries
library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)
library(sf)
library(here)
library(dbscan)
library(leaflet)
library(geosphere)
library(randomForest)
library(pROC)
library(caret)
library(viridis)
library(ggridges)
library(xgboost)
library(DALEX)

# Set working directory
setwd(here())

Read data

# Load data
accident_data <- read_csv("accident_data.csv")
## Rows: 129982 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (19): Accident_Index, 1st_Road_Class, 2nd_Road_Class, Accident_Severity...
## dbl  (13): 1st_Road_Number, 2nd_Road_Number, Did_Police_Officer_Attend_Scene...
## date  (1): Date
## time  (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(accident_data)
##  Accident_Index     1st_Road_Class     1st_Road_Number 2nd_Road_Class    
##  Length:129982      Length:129982      Min.   :   0    Length:129982     
##  Class :character   Class :character   1st Qu.:   0    Class :character  
##  Mode  :character   Mode  :character   Median :  43    Mode  :character  
##                                        Mean   : 852                      
##                                        3rd Qu.: 587                      
##                                        Max.   :9786                      
##                                                                          
##  2nd_Road_Number  Accident_Severity  Carriageway_Hazards      Date           
##  Min.   :   0.0   Length:129982      Length:129982       Min.   :2017-01-01  
##  1st Qu.:   0.0   Class :character   Class :character    1st Qu.:2017-04-03  
##  Median :   0.0   Mode  :character   Mode  :character    Median :2017-07-04  
##  Mean   : 301.5                                          Mean   :2017-07-03  
##  3rd Qu.:   0.0                                          3rd Qu.:2017-10-04  
##  Max.   :9704.0                                          Max.   :2017-12-31  
##  NA's   :153                                                                 
##  Day_of_Week        Did_Police_Officer_Attend_Scene_of_Accident
##  Length:129982      Min.   :1.000                              
##  Class :character   1st Qu.:1.000                              
##  Mode  :character   Median :1.000                              
##                     Mean   :1.265                              
##                     3rd Qu.:2.000                              
##                     Max.   :3.000                              
##                                                                
##  Junction_Control   Junction_Detail       Latitude     Light_Conditions  
##  Length:129982      Length:129982      Min.   :49.93   Length:129982     
##  Class :character   Class :character   1st Qu.:51.47   Class :character  
##  Mode  :character   Mode  :character   Median :51.90   Mode  :character  
##                                        Mean   :52.44                     
##                                        3rd Qu.:53.39                     
##                                        Max.   :60.48                     
##                                        NA's   :29                        
##  Local_Authority_(District) Local_Authority_(Highway) Location_Easting_OSGR
##  Length:129982              Length:129982             Min.   : 73639       
##  Class :character           Class :character          1st Qu.:387279       
##  Mode  :character           Mode  :character          Median :457594       
##                                                       Mean   :451170       
##                                                       3rd Qu.:528910       
##                                                       Max.   :655391       
##                                                       NA's   :19           
##  Location_Northing_OSGR   Longitude       LSOA_of_Accident_Location
##  Min.   :  12107        Min.   :-7.4096   Length:129982            
##  1st Qu.: 176000        1st Qu.:-2.1908   Class :character         
##  Median : 224126        Median :-1.1498   Mode  :character         
##  Mean   : 283578        Mean   :-1.2684                            
##  3rd Qu.: 388829        3rd Qu.:-0.1417                            
##  Max.   :1177531        Max.   : 1.7596                            
##  NA's   :19             NA's   :29                                 
##  Number_of_Casualties Number_of_Vehicles Pedestrian_Crossing-Human_Control
##  Min.   : 1.000       Min.   : 1.000     Min.   :0.000                    
##  1st Qu.: 1.000       1st Qu.: 1.000     1st Qu.:0.000                    
##  Median : 1.000       Median : 2.000     Median :0.000                    
##  Mean   : 1.316       Mean   : 1.838     Mean   :0.023                    
##  3rd Qu.: 1.000       3rd Qu.: 2.000     3rd Qu.:0.000                    
##  Max.   :42.000       Max.   :23.000     Max.   :2.000                    
##                                          NA's   :2574                     
##  Pedestrian_Crossing-Physical_Facilities Police_Force      
##  Min.   :0.0000                          Length:129982     
##  1st Qu.:0.0000                          Class :character  
##  Median :0.0000                          Mode  :character  
##  Mean   :0.8631                                            
##  3rd Qu.:0.0000                                            
##  Max.   :8.0000                                            
##  NA's   :2765                                              
##  Road_Surface_Conditions  Road_Type         Special_Conditions_at_Site
##  Length:129982           Length:129982      Length:129982             
##  Class :character        Class :character   Class :character          
##  Mode  :character        Mode  :character   Mode  :character          
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##   Speed_limit        Time          Urban_or_Rural_Area Weather_Conditions
##  Min.   :20.00   Length:129982     Length:129982       Length:129982     
##  1st Qu.:30.00   Class1:hms        Class :character    Class :character  
##  Median :30.00   Class2:difftime   Mode  :character    Mode  :character  
##  Mean   :37.26   Mode  :numeric                                          
##  3rd Qu.:40.00                                                           
##  Max.   :70.00                                                           
##                                                                          
##       Year       InScotland       
##  Min.   :2017   Length:129982     
##  1st Qu.:2017   Class :character  
##  Median :2017   Mode  :character  
##  Mean   :2017                     
##  3rd Qu.:2017                     
##  Max.   :2017                     
## 
head(accident_data)
## # A tibble: 6 × 34
##   Accident_Index `1st_Road_Class` `1st_Road_Number` `2nd_Road_Class`
##   <chr>          <chr>                        <dbl> <chr>           
## 1 2017010001708  A                              105 <NA>            
## 2 2017010009342  A                                5 Unclassified    
## 3 2017010009344  A                               13 C               
## 4 2017010009348  A                             1010 B               
## 5 2017010009350  A                              107 A               
## 6 2017010009351  Unclassified                     0 <NA>            
## # ℹ 30 more variables: `2nd_Road_Number` <dbl>, Accident_Severity <chr>,
## #   Carriageway_Hazards <chr>, Date <date>, Day_of_Week <chr>,
## #   Did_Police_Officer_Attend_Scene_of_Accident <dbl>, Junction_Control <chr>,
## #   Junction_Detail <chr>, Latitude <dbl>, Light_Conditions <chr>,
## #   `Local_Authority_(District)` <chr>, `Local_Authority_(Highway)` <chr>,
## #   Location_Easting_OSGR <dbl>, Location_Northing_OSGR <dbl>, Longitude <dbl>,
## #   LSOA_of_Accident_Location <chr>, Number_of_Casualties <dbl>, …
# Remove unnecessary columns
accident_data <- accident_data %>%
  select(
    Longitude, Latitude,
    Accident_Severity,
    Date, Time,
    Weather_Conditions, Light_Conditions,
    Road_Surface_Conditions,
    Speed_limit,
    Urban_or_Rural_Area,
    Number_of_Casualties, Number_of_Vehicles
  )

Data adjustments

# Remove observations with missing values
accident_data <- accident_data %>%
  filter(!is.na(Longitude), !is.na(Latitude))
# Encode the target variable
accident_data$Accident_Severity <- accident_data$Accident_Severity %>%
  recode("Slight" = 0, "Severe" = 1, "Fatal" = 1)
## Warning: Unreplaced values treated as NA as `.x` is not compatible.
## Please specify replacements exhaustively or supply `.default`.
# Format the date and time variables
accident_data$Date <- as.Date(accident_data$Date, format = "%Y-%m-%d")
accident_data$Year <- format(accident_data$Date, "%Y")

accident_data$Day_of_Week <- weekdays(accident_data$Date)
accident_data$Month <- format(accident_data$Date, "%m")
accident_data$Hour <- hour(hms(accident_data$Time))

accident_data$TimeOfDay <- cut(accident_data$Hour,
  breaks = c(-Inf, 6, 12, 18, Inf),
  labels = c("Night", "Morning", "Afternoon", "Evening")
)

DBSCAN clustering

# Convert Longitude and Latitude to a matrix
coordinates <- as.matrix(accident_data[, c("Longitude", "Latitude")])
# Perform DBSCAN clustering
dbscan_result <- dbscan(coordinates, eps = 0.1, minPts = 500)

# Extract cluster labels
cluster_labels <- dbscan_result$cluster
accident_data$Cluster <- cluster_labels

# Filter the accident_data dataframe to include only observations in a cluster
clustered_accident_data <- accident_data[accident_data$Cluster > 0, ]

Calculate distance to hotspots

# Calculate the centroids of the clusters
centroids <- tapply(
  seq_len(nrow(clustered_accident_data)),
  clustered_accident_data$Cluster,
  function(rows) colMeans(clustered_accident_data[rows, 1:2])
)

# Calculate the distance from each point to its closest centroid
accident_data$DistanceToHotspot <- apply(coordinates, 1, function(coord) {
  min(sapply(centroids, function(centroid) dist(rbind(coord, centroid))))
})
summary(accident_data$DistanceToHotspot)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000044 0.099050 0.220209 0.277845 0.372619 4.873858

Plot clustering outcome

# Create a color palette for clusters
cluster_palette <- colorFactor(
  palette = "Set1",
  domain = clustered_accident_data$Cluster
)

# Create a leaflet map with a random sample of points
map <- leaflet() %>%
  addTiles() %>%
  addCircleMarkers(
    data = sample_n(clustered_accident_data, size = 10000),
    lng = ~Longitude,
    lat = ~Latitude,
    color = ~ cluster_palette(Cluster),
    radius = 3,
    popup = ~ paste("Grid Cell:", Cluster)
  )
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
map

Clustering summary

The clustering analysis depicted on the map has successfully identified 29 distinct clusters, primarily situated near densely populated areas such as major cities and urban agglomerations. This pattern is not unexpected as higher traffic volumes in such areas typically lead to a greater number of incidents. The visualization clearly indicates these hotspots, with varying cluster sizes likely reflecting the intensity of traffic accident occurrences. Recognizing these clusters is instrumental for urban planners and public safety officials, as it can guide traffic management strategies, resource deployment for emergency services, and the implementation of safety measures to mitigate accident risks in these high-density zones.

Exploratory Data Analysis

# Filter unnecessary columns
accident_data <- accident_data[, !(names(accident_data) %in% c(
  "Longitude", "Latitude", "Cluster", "Hour",
  "Number_of_Casualties", "Date", "Time", "Year"
))]

# Remove observations with missing values
accident_data <- accident_data[!is.na(accident_data$Accident_Severity), ]

# Convert all columns to factors
accident_data <- accident_data %>%
  mutate_if(is.character, as.factor)

summary(accident_data)
##  Accident_Severity             Weather_Conditions
##  Min.   :0.0000    Fine no high winds   :86117   
##  1st Qu.:0.0000    Raining no high winds:11964   
##  Median :0.0000    Unknown              : 4066   
##  Mean   :0.0156    Other                : 2244   
##  3rd Qu.:0.0000    Fine + high winds    : 1007   
##  Max.   :1.0000    Raining + high winds :  893   
##                    (Other)              : 1134   
##                      Light_Conditions                 Road_Surface_Conditions
##  Darkness - lighting unknown : 2401   Data missing or out of range: 1819     
##  Darkness - lights lit       :21947   Dry                         :76340     
##  Darkness - lights unlit     :  767   Flood over 3cm. deep        :   86     
##  Darkness - no lighting      : 5107   Frost or ice                : 1949     
##  Data missing or out of range:    1   Snow                        :  377     
##  Daylight                    :77202   Wet or damp                 :26854     
##                                                                              
##   Speed_limit    Urban_or_Rural_Area Number_of_Vehicles       Day_of_Week   
##  Min.   :20.00   Rural:33846         Min.   : 1.000     czwartek    :16547  
##  1st Qu.:30.00   Urban:73579         1st Qu.: 1.000     niedziela   :11965  
##  Median :30.00                       Median : 2.000     piątek      :17590  
##  Mean   :36.95                       Mean   : 1.862     poniedziałek:14986  
##  3rd Qu.:40.00                       3rd Qu.: 2.000     sobota      :13998  
##  Max.   :70.00                       Max.   :23.000     środa       :16091  
##                                                         wtorek      :16248  
##      Month           TimeOfDay     DistanceToHotspot 
##  11     : 9903   Night    : 7696   Min.   :0.000044  
##  01     : 9367   Morning  :34583   1st Qu.:0.096830  
##  07     : 9335   Afternoon:47189   Median :0.216116  
##  10     : 9306   Evening  :17954   Mean   :0.272476  
##  09     : 9174   NA's     :    3   3rd Qu.:0.365946  
##  06     : 9079                     Max.   :4.873858  
##  (Other):51261

EDA

ggplot(
  accident_data,
  aes(x = factor(Accident_Severity), fill = factor(Accident_Severity))
) +
  geom_bar(stat = "count") +
  scale_fill_manual(
    values = c("0" = "blue", "1" = "red"),
    labels = c("0" = "Slight", "1" = "Severe or Fatal"),
    name = "Accident Severity"
  ) +
  labs(
    x = "Accident Severity",
    y = "Count",
    title = "Distribution of Accident Severity"
  ) +
  theme_minimal()

ggplot(accident_data, aes(x = Speed_limit, fill = ..count..)) +
  geom_histogram(binwidth = 10, color = "black", show.legend = FALSE) +
  scale_fill_viridis(name = "Frequency", option = "D") +
  labs(
    x = "Speed Limit",
    y = "Frequency",
    title = "Distribution of Speed Limits"
  ) +
  theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(accident_data, aes(x = Weather_Conditions, fill = Weather_Conditions)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Accidents in Different Weather Conditions",
    x = "Weather Conditions",
    y = "Count"
  )

ggplot(accident_data, aes(x = Day_of_Week, fill = factor(Accident_Severity))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(
    values = c("0" = "lightblue", "1" = "red"),
    labels = c("0" = "Slight", "1" = "Severe or Fatal"),
    name = "Accident Severity"
  ) +
  labs(
    x = "Day of the Week",
    y = "Count",
    title = "Day of the Week vs Accident Severity"
  ) +
  theme_light() +
  theme(
    panel.background = element_rect(fill = "white", colour = "white"),
    plot.background = element_rect(fill = "white", colour = NA),
    legend.background = element_rect(fill = "white", colour = NA),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

ggplot(accident_data, aes(x = TimeOfDay, fill = factor(Accident_Severity))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(
    values = c("0" = "lightblue", "1" = "red"),
    labels = c("0" = "Slight", "1" = "Severe or Fatal"),
    name = "Accident Severity"
  ) +
  labs(
    x = "Time of Day",
    y = "Count",
    title = "Time of Day vs Accident Severity"
  ) +
  theme_light() +
  theme(
    panel.background = element_rect(fill = "white", colour = "white"),
    plot.background = element_rect(fill = "white", colour = NA),
    legend.background = element_rect(fill = "white", colour = NA),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

ggplot(accident_data, aes(x = DistanceToHotspot)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  xlim(0, 2) +
  labs(
    x = "Distance to Hotspot",
    y = "Density",
    title = "Empirical Density Function of Distance to Hotspot"
  ) +
  theme_minimal()
## Warning: Removed 122 rows containing non-finite values (`stat_density()`).

ggplot(
  accident_data,
  aes(x = factor(Accident_Severity), y = DistanceToHotspot)
) +
  geom_violin(trim = FALSE, fill = "skyblue", color = "black") +
  scale_x_discrete(
    name = "Accident Severity",
    labels = c("0" = "Slight", "1" = "Severe or Fatal")
  ) +
  theme_minimal()

ggplot(
  accident_data,
  aes(x = DistanceToHotspot, fill = factor(Accident_Severity))
) +
  geom_density(alpha = 0.5) +
  facet_grid(Accident_Severity ~ .) +
  scale_x_continuous(limits = c(0, 2)) +
  scale_fill_manual(
    values = c("0" = "lightblue", "1" = "red"),
    labels = c("0" = "Slight", "1" = "Severe or Fatal"),
    name = "Accident Severity"
  ) +
  theme_minimal()
## Warning: Removed 122 rows containing non-finite values (`stat_density()`).

## Exploratory Data Analysis Summary

The exploratory data analysis stage provided several key insights from the visualizations created:

Distance to Hotspot

  • The spread in the slight category closer to hotspots indicates that minor accidents are more concentrated around these areas. Conversely, severe or fatal accidents appear to happen at various distances from hotspots, suggesting a different set of factors may influence the more serious accidents.

Weather Conditions

  • The majority of accidents occur in clear weather conditions (“Fine no high winds”), which is likely a reflection of the predominance of such weather in the dataset. Adverse weather conditions, like rain or snow, are associated with fewer accidents. This might be due to more cautious driving or fewer people traveling under these conditions.

Day of the Week

  • Fridays experience the highest number of accidents, while Sundays have the fewest. This pattern may relate to the rhythms of workweek travel and weekend behaviors. Notably, the severity of accidents does not vary markedly with the day of the week.

Time of Day

  • Afternoon hours see a spike in accident counts, possibly linked to increased traffic volume. Nighttime has the fewest accidents, which could be attributed to lower traffic density. The severity of accidents does not seem to be influenced proportionally by the time of day.

Speed Limits

  • The distribution of accidents peaks at a speed limit of 30 units, hinting that most accidents occur in areas with this speed limit, potentially in urban settings where such limits are common.

Splitting the Data into Training and Testing Sets

In this phase of our analysis, we take the crucial step of dividing our dataset into training and testing subsets. This is fundamental for any machine learning application, as it allows us to train our models on one set of data (training set) and then test their performance on a separate, unseen set of data (testing set). We use the createDataPartition function from the caret package to achieve a balanced split, ensuring that our training and testing sets have a similar distribution of accident severities. By setting set.seed(123), we guarantee that our results are reproducible, an important aspect of scientific analysis.

# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(
  accident_data$Accident_Severity,
  p = 0.7, list = FALSE
)
train_data <- accident_data[train_indices, ]
test_data <- accident_data[-train_indices, ]

# Convert the data to numeric
train_data_numeric <- train_data %>%
  mutate_if(is.factor, as.numeric)
test_data_numeric <- test_data %>%
  mutate_if(is.factor, as.numeric)

Preparing the XGBoost Model

We opt for the XGBoost algorithm, renowned for its efficiency and effectiveness in classification problems. XGBoost stands for eXtreme Gradient Boosting and is particularly useful in dealing with unbalanced datasets and non-linear relationships. We define various parameters for our XGBoost model, such as the learning rate (eta), maximum depth of trees (max_depth), and subsampling rates (subsample and colsample_bytree). These parameters are fine-tuned to optimize the model’s performance.

# Create the xgboost model
params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss",
  eta = 0.1,
  max_depth = 4,
  subsample = 0.8,
  colsample_bytree = 0.8
)

model <- xgboost(
  data = as.matrix(train_data_numeric %>% select(-Accident_Severity)),
  label = train_data$Accident_Severity,
  params = params,
  nrounds = 100
)
## [1]  train-logloss:0.603900 
## [2]  train-logloss:0.530794 
## [3]  train-logloss:0.470001 
## [4]  train-logloss:0.418688 
## [5]  train-logloss:0.374980 
## [6]  train-logloss:0.337497 
## [7]  train-logloss:0.305067 
## [8]  train-logloss:0.276948 
## [9]  train-logloss:0.252464 
## [10] train-logloss:0.231066 
## [11] train-logloss:0.212294 
## [12] train-logloss:0.195770 
## [13] train-logloss:0.181260 
## [14] train-logloss:0.168421 
## [15] train-logloss:0.157110 
## [16] train-logloss:0.147137 
## [17] train-logloss:0.138266 
## [18] train-logloss:0.130489 
## [19] train-logloss:0.123520 
## [20] train-logloss:0.117389 
## [21] train-logloss:0.111967 
## [22] train-logloss:0.107139 
## [23] train-logloss:0.102884 
## [24] train-logloss:0.099144 
## [25] train-logloss:0.095789 
## [26] train-logloss:0.092831 
## [27] train-logloss:0.090203 
## [28] train-logloss:0.087899 
## [29] train-logloss:0.085868 
## [30] train-logloss:0.084051 
## [31] train-logloss:0.082454 
## [32] train-logloss:0.081045 
## [33] train-logloss:0.079806 
## [34] train-logloss:0.078668 
## [35] train-logloss:0.077710 
## [36] train-logloss:0.076850 
## [37] train-logloss:0.076079 
## [38] train-logloss:0.075407 
## [39] train-logloss:0.074810 
## [40] train-logloss:0.074257 
## [41] train-logloss:0.073829 
## [42] train-logloss:0.073379 
## [43] train-logloss:0.073039 
## [44] train-logloss:0.072742 
## [45] train-logloss:0.072436 
## [46] train-logloss:0.072161 
## [47] train-logloss:0.071931 
## [48] train-logloss:0.071731 
## [49] train-logloss:0.071553 
## [50] train-logloss:0.071358 
## [51] train-logloss:0.071194 
## [52] train-logloss:0.071064 
## [53] train-logloss:0.070934 
## [54] train-logloss:0.070795 
## [55] train-logloss:0.070656 
## [56] train-logloss:0.070564 
## [57] train-logloss:0.070462 
## [58] train-logloss:0.070343 
## [59] train-logloss:0.070243 
## [60] train-logloss:0.070160 
## [61] train-logloss:0.070095 
## [62] train-logloss:0.070027 
## [63] train-logloss:0.069970 
## [64] train-logloss:0.069882 
## [65] train-logloss:0.069818 
## [66] train-logloss:0.069750 
## [67] train-logloss:0.069681 
## [68] train-logloss:0.069647 
## [69] train-logloss:0.069556 
## [70] train-logloss:0.069529 
## [71] train-logloss:0.069461 
## [72] train-logloss:0.069404 
## [73] train-logloss:0.069350 
## [74] train-logloss:0.069286 
## [75] train-logloss:0.069259 
## [76] train-logloss:0.069210 
## [77] train-logloss:0.069157 
## [78] train-logloss:0.069088 
## [79] train-logloss:0.069059 
## [80] train-logloss:0.069000 
## [81] train-logloss:0.068958 
## [82] train-logloss:0.068932 
## [83] train-logloss:0.068875 
## [84] train-logloss:0.068810 
## [85] train-logloss:0.068778 
## [86] train-logloss:0.068715 
## [87] train-logloss:0.068677 
## [88] train-logloss:0.068651 
## [89] train-logloss:0.068580 
## [90] train-logloss:0.068556 
## [91] train-logloss:0.068525 
## [92] train-logloss:0.068498 
## [93] train-logloss:0.068432 
## [94] train-logloss:0.068376 
## [95] train-logloss:0.068298 
## [96] train-logloss:0.068276 
## [97] train-logloss:0.068224 
## [98] train-logloss:0.068169 
## [99] train-logloss:0.068138 
## [100]    train-logloss:0.068088

Making Predictions

Finally, we use our trained model to make predictions on the test dataset. These predictions will help us evaluate the performance of our model in predicting the severity of road accidents. By comparing these predictions with the actual outcomes in the test dataset, we can assess the accuracy and reliability of our model.

# Make predictions on the test data
predictions <- predict(
  model,
  as.matrix(test_data_numeric %>% select(-Accident_Severity))
)

Model Evaluation Metrics

Accuracy

We assessed the accuracy of our XGBoost model, which reflects the proportion of correct predictions. The model achieved an accuracy of approximately 98.42%, indicating a high level of effectiveness in predicting the severity of road accidents. This metric is straightforward but should be interpreted with caution in cases of class imbalance.

# Accuracy
mean(ifelse(predictions > 0.5, 1, 0) == test_data$Accident_Severity)
## [1] 0.9842058

Area Under the Curve (AUC)

In addition to accuracy, we evaluated the model’s performance using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC). The AUC score, standing at 0.774, provides an aggregate measure of the model’s ability to discriminate between different outcome classes. A higher AUC value generally indicates a better model performance. This metric offers a more nuanced view of the model’s effectiveness, especially in unbalanced datasets.

# AUC score
roc_obj <- roc(test_data$Accident_Severity, predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_obj)
print(auc_value)
## Area under the curve: 0.774

Receiver Operating Characteristic (ROC) Curve

As seen in the graph, the ROC curve arches towards the upper left corner, indicating a good predictive performance. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. Conversely, a curve that approaches the 45-degree diagonal line represents a less effective model, with an AUC close to 0.5, which is no better than random guessing.

The area under the ROC curve (AUC) provides a single measure of overall performance and can be used to compare different models. In our case, the previously calculated AUC of 0.774 suggests that the model has a good level of discrimination capability, distinguishing between the different classes of accident severity.

# Plot the ROC curve
plot(roc_obj, main = "ROC")

Feature Importance Analysis

The XGBoost model provides a mechanism to evaluate the importance of each feature in making predictions. Here we use the xgb.importance function from the XGBoost package to extract feature importance metrics. The table above lists the features along with their respective importance scores across three metrics: Gain, Cover, and Frequency.

In our analysis, Speed_limit is identified as the most significant feature, with the highest Gain score, indicating that it has the highest impact on the outcome of the model. This is followed by DistanceToHotspot and Number_of_Vehicles, which also play substantial roles in predicting the severity of road accidents.

The DistanceToHotspot is the custom spatial variable we created, measuring the distance from each accident to the nearest accident cluster center identified by DBSCAN clustering. Its high importance in our model underscores the critical role spatial factors play in road accident analysis, indicating that accidents occurring near these hotspots may have distinct characteristics or risk factors. This insight is valuable for targeted road safety measures and urban planning.

# Feature importance
xgb.importance(model = model)
##                     Feature       Gain      Cover  Frequency
##  1:             Speed_limit 0.29141535 0.29054941 0.10260223
##  2:       DistanceToHotspot 0.17386799 0.10937970 0.27806691
##  3:      Number_of_Vehicles 0.15805308 0.27698453 0.13903346
##  4:     Urban_or_Rural_Area 0.09145477 0.06517493 0.03197026
##  5:               TimeOfDay 0.06472619 0.10213348 0.10037175
##  6:        Light_Conditions 0.05694244 0.07158009 0.07881041
##  7:             Day_of_Week 0.05295697 0.02755963 0.08401487
##  8:                   Month 0.05100284 0.01736964 0.08401487
##  9:      Weather_Conditions 0.03303740 0.02117634 0.05576208
## 10: Road_Surface_Conditions 0.02654298 0.01809226 0.04535316
# Show importance plot with ggplot2
xgb.importance(model = model) %>%
  as.data.frame() %>%
  ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_bw()

Explainable Machine Learning

Creating an explainer using the DALEX package is an essential step towards demystifying the complex decisions made by our XGBoost model. This process is not just about gaining technical clarity, but about ensuring that the predictive model we have built is transparent and its results are interpretable. With the explainer, we aim to break down the model’s predictions and understand the impact of each variable, particularly our spatial feature DistanceToHotspot.

accident_data_matrix <- as.matrix(
  accident_data %>%
    select(-Accident_Severity) %>%
    mutate_if(is.factor, as.numeric)
)
accident_data_y <- accident_data$Accident_Severity

# Create an explainer for your model
explainer <- DALEX::explain(
  model = model,
  data = accident_data_matrix,
  y = accident_data_y,
)
## Preparation of a new explainer is initiated
##   -> model label       :  xgb.Booster  (  default  )
##   -> data              :  107425  rows  10  cols 
##   -> data              :  rownames to data was added ( from 1 to 107425 ) 
##   -> target variable   :  107425  values 
##   -> predict function  :  yhat.default will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package Model of class: xgb.Booster package unrecognized , ver. Unknown , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  0.0007605775 , mean =  0.01565999 , max =  0.5043826  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.2877295 , mean =  -5.840353e-05 , max =  0.9980441  
##   A new explainer has been created!

This plot is a graphical representation of the importance of different predictors in the XGBoost model, as determined by the DALEX package’s explainer object. In this plot, the variables are ranked according to their influence on the model’s predictive accuracy, with longer bars indicating greater importance.

# Variable Importance Plot
vi <- variable_importance(explainer)
plot(vi)

In this plot, the x-axis represents different values of the Speed_limit, while the y-axis shows the average predicted probability of the accident severity. The shape of the curve suggests that the predicted severity of accidents increases significantly as the speed limit goes up to a certain point, then plateaus, and starts to decrease after that. This could imply that there’s an optimal range of speed limits that are associated with the severity of accidents, which could provide actionable insights for policymakers when considering speed regulation as a part of road safety measures.

# Partial Dependence Plot for Speed_limit variable
sl_pdp <- model_profile(explainer, variables = "Speed_limit")
plot(sl_pdp)

This plot shows a distinct trend where the predicted severity of accidents changes with varying distances to hotspots. Initially, there is a sharp increase, which suggests that accidents occurring closer to hotspots (but not the hotspot themselves) are predicted to be more severe. Beyond a certain point, as the distance increases, the predicted severity decreases, indicating that accidents much further from these hotspots might be less fatal.

# Patrial Dependence Plot for DistanceToHotspot variable
dth_pdp <- model_profile(explainer, variables = "DistanceToHotspot")
plot(dth_pdp)

The Break Down Plot starts with a baseline, which is the intercept (the average prediction over the dataset). Each subsequent bar shows the contribution of an individual feature to the prediction, with bars pointing to the right indicating an increase in the predicted value and bars to the left indicating a decrease. The sum of these contributions and the intercept gives us the final prediction for this particular observation.

We can see the effects of features like TimeOfDay, Speed_limit, Urban_or_Rural_Area, and others, including the spatial variable DistanceToHotspot. The length and direction of each bar provide a clear and detailed picture of the factors leading to the prediction, which is particularly useful for understanding the model’s decision-making process for individual predictions. This insight can be especially valuable when investigating specific cases or anomalies.

# Break Down Plot for a single prediction
first_observation <- accident_data_matrix[1, , drop = FALSE]
bd <- predict_parts(
  explainer,
  new_observation = first_observation
)
plot(bd)

Each bar represents a feature’s SHAP value, which quantifies the impact of that feature on the model’s output. Red bars indicate features that push the prediction lower (decrease the predicted probability of the outcome), and green bars indicate features that push the prediction higher. The length of the bar shows the magnitude of the feature’s impact.

In SHAP value plots, the sum of all feature contributions along with the base value (not shown in the plot) equals the final prediction for the given observation. SHAP values provide a more nuanced and detailed explanation of the prediction, attributing the contribution of each feature in a way that sums up to the actual prediction made by the model.

# Shapley Values Plot for a single prediction
sv <- predict_parts(
  explainer,
  new_observation = first_observation,
  type = "shap"
)
plot(sv)

General Summary of Results

The comprehensive analysis conducted has led to several notable findings. The DBSCAN clustering method effectively identified 29 distinct clusters, predominantly located near highly populated areas such as major cities and urban agglomerations. This pattern aligns with the intuitive understanding that busier areas are likely to witness a higher frequency of traffic incidents.

The exploratory data analysis (EDA) further refined our insights, revealing that minor accidents tend to cluster around these hotspots, while severe or fatal accidents are dispersed over a range of distances from these focal points. Patterns across the days of the week and times of day were also observed, with Fridays and afternoons being particularly prone to higher accident counts. Interestingly, the severity of accidents did not show a significant variation with these temporal factors.

The peak in accident distribution at a speed limit of 30 units suggests that most accidents occur within urban speed zones, where traffic density is high and numerous variables may contribute to accidents.

Through the application of the XGBoost model, we were able to quantify the relationship between these factors and accident severity. The model’s accuracy and the AUC score indicate a robust predictive capability, particularly in distinguishing between minor and severe accident outcomes.

In conclusion, the spatial analysis of accident severity in proximity to hotspots has provided valuable insights, with the potential to inform targeted road safety measures and urban planning strategies. Future work may delve deeper into the causal factors and explore interventions to mitigate risks, especially in identified high-risk zones.