Road safety is a paramount concern for both individuals and authorities, with the impact of traffic accidents ranging from minor to fatal. Understanding the factors that contribute to the severity of these incidents can inform preventive measures and policies. This analysis aims to investigate the influence of the distance from accident sites to traffic hotspots—locations where accidents occur more frequently—on the severity of these accidents. The working hypothesis suggests that accidents occurring closer to hotspots are more likely to result in minor (slight) injuries rather than being fatal.
To conduct this investigation, data from the year 2017 were sourced from a comprehensive UK road safety dataset available at Kaggle. The dataset encompasses detailed reports of road accidents, providing a fertile ground for spatial and quantitative analysis.
The methodology of the analysis involves several stages:
Spatial Clustering: Using the DBSCAN clustering method, accident locations defined by longitude and latitude were grouped to identify areas of high accident concentration.
Hotspot Identification: The centers of the identified clusters were calculated, establishing the hotspots. Subsequently, the distances of specific accident sites to these hotspots were determined.
Data Modeling: With the distances and other relevant variables at hand, the XGBoost algorithm was employed to model the data and examine the relationship between proximity to hotspots and accident severity.
The outcome of this analysis aims to shed light on the spatial dynamics of road accidents and their outcomes, potentially offering insights into road safety enhancement strategies.
# Load libraries
library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)
library(sf)
library(here)
library(dbscan)
library(leaflet)
library(geosphere)
library(randomForest)
library(pROC)
library(caret)
library(viridis)
library(ggridges)
library(xgboost)
library(DALEX)
# Set working directory
setwd(here())
# Load data
accident_data <- read_csv("accident_data.csv")
## Rows: 129982 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): Accident_Index, 1st_Road_Class, 2nd_Road_Class, Accident_Severity...
## dbl (13): 1st_Road_Number, 2nd_Road_Number, Did_Police_Officer_Attend_Scene...
## date (1): Date
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(accident_data)
## Accident_Index 1st_Road_Class 1st_Road_Number 2nd_Road_Class
## Length:129982 Length:129982 Min. : 0 Length:129982
## Class :character Class :character 1st Qu.: 0 Class :character
## Mode :character Mode :character Median : 43 Mode :character
## Mean : 852
## 3rd Qu.: 587
## Max. :9786
##
## 2nd_Road_Number Accident_Severity Carriageway_Hazards Date
## Min. : 0.0 Length:129982 Length:129982 Min. :2017-01-01
## 1st Qu.: 0.0 Class :character Class :character 1st Qu.:2017-04-03
## Median : 0.0 Mode :character Mode :character Median :2017-07-04
## Mean : 301.5 Mean :2017-07-03
## 3rd Qu.: 0.0 3rd Qu.:2017-10-04
## Max. :9704.0 Max. :2017-12-31
## NA's :153
## Day_of_Week Did_Police_Officer_Attend_Scene_of_Accident
## Length:129982 Min. :1.000
## Class :character 1st Qu.:1.000
## Mode :character Median :1.000
## Mean :1.265
## 3rd Qu.:2.000
## Max. :3.000
##
## Junction_Control Junction_Detail Latitude Light_Conditions
## Length:129982 Length:129982 Min. :49.93 Length:129982
## Class :character Class :character 1st Qu.:51.47 Class :character
## Mode :character Mode :character Median :51.90 Mode :character
## Mean :52.44
## 3rd Qu.:53.39
## Max. :60.48
## NA's :29
## Local_Authority_(District) Local_Authority_(Highway) Location_Easting_OSGR
## Length:129982 Length:129982 Min. : 73639
## Class :character Class :character 1st Qu.:387279
## Mode :character Mode :character Median :457594
## Mean :451170
## 3rd Qu.:528910
## Max. :655391
## NA's :19
## Location_Northing_OSGR Longitude LSOA_of_Accident_Location
## Min. : 12107 Min. :-7.4096 Length:129982
## 1st Qu.: 176000 1st Qu.:-2.1908 Class :character
## Median : 224126 Median :-1.1498 Mode :character
## Mean : 283578 Mean :-1.2684
## 3rd Qu.: 388829 3rd Qu.:-0.1417
## Max. :1177531 Max. : 1.7596
## NA's :19 NA's :29
## Number_of_Casualties Number_of_Vehicles Pedestrian_Crossing-Human_Control
## Min. : 1.000 Min. : 1.000 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:0.000
## Median : 1.000 Median : 2.000 Median :0.000
## Mean : 1.316 Mean : 1.838 Mean :0.023
## 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.:0.000
## Max. :42.000 Max. :23.000 Max. :2.000
## NA's :2574
## Pedestrian_Crossing-Physical_Facilities Police_Force
## Min. :0.0000 Length:129982
## 1st Qu.:0.0000 Class :character
## Median :0.0000 Mode :character
## Mean :0.8631
## 3rd Qu.:0.0000
## Max. :8.0000
## NA's :2765
## Road_Surface_Conditions Road_Type Special_Conditions_at_Site
## Length:129982 Length:129982 Length:129982
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Speed_limit Time Urban_or_Rural_Area Weather_Conditions
## Min. :20.00 Length:129982 Length:129982 Length:129982
## 1st Qu.:30.00 Class1:hms Class :character Class :character
## Median :30.00 Class2:difftime Mode :character Mode :character
## Mean :37.26 Mode :numeric
## 3rd Qu.:40.00
## Max. :70.00
##
## Year InScotland
## Min. :2017 Length:129982
## 1st Qu.:2017 Class :character
## Median :2017 Mode :character
## Mean :2017
## 3rd Qu.:2017
## Max. :2017
##
head(accident_data)
## # A tibble: 6 × 34
## Accident_Index `1st_Road_Class` `1st_Road_Number` `2nd_Road_Class`
## <chr> <chr> <dbl> <chr>
## 1 2017010001708 A 105 <NA>
## 2 2017010009342 A 5 Unclassified
## 3 2017010009344 A 13 C
## 4 2017010009348 A 1010 B
## 5 2017010009350 A 107 A
## 6 2017010009351 Unclassified 0 <NA>
## # ℹ 30 more variables: `2nd_Road_Number` <dbl>, Accident_Severity <chr>,
## # Carriageway_Hazards <chr>, Date <date>, Day_of_Week <chr>,
## # Did_Police_Officer_Attend_Scene_of_Accident <dbl>, Junction_Control <chr>,
## # Junction_Detail <chr>, Latitude <dbl>, Light_Conditions <chr>,
## # `Local_Authority_(District)` <chr>, `Local_Authority_(Highway)` <chr>,
## # Location_Easting_OSGR <dbl>, Location_Northing_OSGR <dbl>, Longitude <dbl>,
## # LSOA_of_Accident_Location <chr>, Number_of_Casualties <dbl>, …
# Remove unnecessary columns
accident_data <- accident_data %>%
select(
Longitude, Latitude,
Accident_Severity,
Date, Time,
Weather_Conditions, Light_Conditions,
Road_Surface_Conditions,
Speed_limit,
Urban_or_Rural_Area,
Number_of_Casualties, Number_of_Vehicles
)
# Remove observations with missing values
accident_data <- accident_data %>%
filter(!is.na(Longitude), !is.na(Latitude))
# Encode the target variable
accident_data$Accident_Severity <- accident_data$Accident_Severity %>%
recode("Slight" = 0, "Severe" = 1, "Fatal" = 1)
## Warning: Unreplaced values treated as NA as `.x` is not compatible.
## Please specify replacements exhaustively or supply `.default`.
# Format the date and time variables
accident_data$Date <- as.Date(accident_data$Date, format = "%Y-%m-%d")
accident_data$Year <- format(accident_data$Date, "%Y")
accident_data$Day_of_Week <- weekdays(accident_data$Date)
accident_data$Month <- format(accident_data$Date, "%m")
accident_data$Hour <- hour(hms(accident_data$Time))
accident_data$TimeOfDay <- cut(accident_data$Hour,
breaks = c(-Inf, 6, 12, 18, Inf),
labels = c("Night", "Morning", "Afternoon", "Evening")
)
# Convert Longitude and Latitude to a matrix
coordinates <- as.matrix(accident_data[, c("Longitude", "Latitude")])
# Perform DBSCAN clustering
dbscan_result <- dbscan(coordinates, eps = 0.1, minPts = 500)
# Extract cluster labels
cluster_labels <- dbscan_result$cluster
accident_data$Cluster <- cluster_labels
# Filter the accident_data dataframe to include only observations in a cluster
clustered_accident_data <- accident_data[accident_data$Cluster > 0, ]
# Calculate the centroids of the clusters
centroids <- tapply(
seq_len(nrow(clustered_accident_data)),
clustered_accident_data$Cluster,
function(rows) colMeans(clustered_accident_data[rows, 1:2])
)
# Calculate the distance from each point to its closest centroid
accident_data$DistanceToHotspot <- apply(coordinates, 1, function(coord) {
min(sapply(centroids, function(centroid) dist(rbind(coord, centroid))))
})
summary(accident_data$DistanceToHotspot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000044 0.099050 0.220209 0.277845 0.372619 4.873858
# Create a color palette for clusters
cluster_palette <- colorFactor(
palette = "Set1",
domain = clustered_accident_data$Cluster
)
# Create a leaflet map with a random sample of points
map <- leaflet() %>%
addTiles() %>%
addCircleMarkers(
data = sample_n(clustered_accident_data, size = 10000),
lng = ~Longitude,
lat = ~Latitude,
color = ~ cluster_palette(Cluster),
radius = 3,
popup = ~ paste("Grid Cell:", Cluster)
)
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
map
The clustering analysis depicted on the map has successfully identified 29 distinct clusters, primarily situated near densely populated areas such as major cities and urban agglomerations. This pattern is not unexpected as higher traffic volumes in such areas typically lead to a greater number of incidents. The visualization clearly indicates these hotspots, with varying cluster sizes likely reflecting the intensity of traffic accident occurrences. Recognizing these clusters is instrumental for urban planners and public safety officials, as it can guide traffic management strategies, resource deployment for emergency services, and the implementation of safety measures to mitigate accident risks in these high-density zones.
# Filter unnecessary columns
accident_data <- accident_data[, !(names(accident_data) %in% c(
"Longitude", "Latitude", "Cluster", "Hour",
"Number_of_Casualties", "Date", "Time", "Year"
))]
# Remove observations with missing values
accident_data <- accident_data[!is.na(accident_data$Accident_Severity), ]
# Convert all columns to factors
accident_data <- accident_data %>%
mutate_if(is.character, as.factor)
summary(accident_data)
## Accident_Severity Weather_Conditions
## Min. :0.0000 Fine no high winds :86117
## 1st Qu.:0.0000 Raining no high winds:11964
## Median :0.0000 Unknown : 4066
## Mean :0.0156 Other : 2244
## 3rd Qu.:0.0000 Fine + high winds : 1007
## Max. :1.0000 Raining + high winds : 893
## (Other) : 1134
## Light_Conditions Road_Surface_Conditions
## Darkness - lighting unknown : 2401 Data missing or out of range: 1819
## Darkness - lights lit :21947 Dry :76340
## Darkness - lights unlit : 767 Flood over 3cm. deep : 86
## Darkness - no lighting : 5107 Frost or ice : 1949
## Data missing or out of range: 1 Snow : 377
## Daylight :77202 Wet or damp :26854
##
## Speed_limit Urban_or_Rural_Area Number_of_Vehicles Day_of_Week
## Min. :20.00 Rural:33846 Min. : 1.000 czwartek :16547
## 1st Qu.:30.00 Urban:73579 1st Qu.: 1.000 niedziela :11965
## Median :30.00 Median : 2.000 piątek :17590
## Mean :36.95 Mean : 1.862 poniedziałek:14986
## 3rd Qu.:40.00 3rd Qu.: 2.000 sobota :13998
## Max. :70.00 Max. :23.000 środa :16091
## wtorek :16248
## Month TimeOfDay DistanceToHotspot
## 11 : 9903 Night : 7696 Min. :0.000044
## 01 : 9367 Morning :34583 1st Qu.:0.096830
## 07 : 9335 Afternoon:47189 Median :0.216116
## 10 : 9306 Evening :17954 Mean :0.272476
## 09 : 9174 NA's : 3 3rd Qu.:0.365946
## 06 : 9079 Max. :4.873858
## (Other):51261
EDA
ggplot(
accident_data,
aes(x = factor(Accident_Severity), fill = factor(Accident_Severity))
) +
geom_bar(stat = "count") +
scale_fill_manual(
values = c("0" = "blue", "1" = "red"),
labels = c("0" = "Slight", "1" = "Severe or Fatal"),
name = "Accident Severity"
) +
labs(
x = "Accident Severity",
y = "Count",
title = "Distribution of Accident Severity"
) +
theme_minimal()
ggplot(accident_data, aes(x = Speed_limit, fill = ..count..)) +
geom_histogram(binwidth = 10, color = "black", show.legend = FALSE) +
scale_fill_viridis(name = "Frequency", option = "D") +
labs(
x = "Speed Limit",
y = "Frequency",
title = "Distribution of Speed Limits"
) +
theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(accident_data, aes(x = Weather_Conditions, fill = Weather_Conditions)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Accidents in Different Weather Conditions",
x = "Weather Conditions",
y = "Count"
)
ggplot(accident_data, aes(x = Day_of_Week, fill = factor(Accident_Severity))) +
geom_bar(position = "dodge") +
scale_fill_manual(
values = c("0" = "lightblue", "1" = "red"),
labels = c("0" = "Slight", "1" = "Severe or Fatal"),
name = "Accident Severity"
) +
labs(
x = "Day of the Week",
y = "Count",
title = "Day of the Week vs Accident Severity"
) +
theme_light() +
theme(
panel.background = element_rect(fill = "white", colour = "white"),
plot.background = element_rect(fill = "white", colour = NA),
legend.background = element_rect(fill = "white", colour = NA),
axis.text.x = element_text(angle = 45, hjust = 1)
)
ggplot(accident_data, aes(x = TimeOfDay, fill = factor(Accident_Severity))) +
geom_bar(position = "dodge") +
scale_fill_manual(
values = c("0" = "lightblue", "1" = "red"),
labels = c("0" = "Slight", "1" = "Severe or Fatal"),
name = "Accident Severity"
) +
labs(
x = "Time of Day",
y = "Count",
title = "Time of Day vs Accident Severity"
) +
theme_light() +
theme(
panel.background = element_rect(fill = "white", colour = "white"),
plot.background = element_rect(fill = "white", colour = NA),
legend.background = element_rect(fill = "white", colour = NA),
axis.text.x = element_text(angle = 45, hjust = 1)
)
ggplot(accident_data, aes(x = DistanceToHotspot)) +
geom_density(fill = "skyblue", alpha = 0.5) +
xlim(0, 2) +
labs(
x = "Distance to Hotspot",
y = "Density",
title = "Empirical Density Function of Distance to Hotspot"
) +
theme_minimal()
## Warning: Removed 122 rows containing non-finite values (`stat_density()`).
ggplot(
accident_data,
aes(x = factor(Accident_Severity), y = DistanceToHotspot)
) +
geom_violin(trim = FALSE, fill = "skyblue", color = "black") +
scale_x_discrete(
name = "Accident Severity",
labels = c("0" = "Slight", "1" = "Severe or Fatal")
) +
theme_minimal()
ggplot(
accident_data,
aes(x = DistanceToHotspot, fill = factor(Accident_Severity))
) +
geom_density(alpha = 0.5) +
facet_grid(Accident_Severity ~ .) +
scale_x_continuous(limits = c(0, 2)) +
scale_fill_manual(
values = c("0" = "lightblue", "1" = "red"),
labels = c("0" = "Slight", "1" = "Severe or Fatal"),
name = "Accident Severity"
) +
theme_minimal()
## Warning: Removed 122 rows containing non-finite values (`stat_density()`).
## Exploratory Data Analysis Summary
The exploratory data analysis stage provided several key insights from the visualizations created:
In this phase of our analysis, we take the crucial step of dividing our dataset into training and testing subsets. This is fundamental for any machine learning application, as it allows us to train our models on one set of data (training set) and then test their performance on a separate, unseen set of data (testing set). We use the createDataPartition function from the caret package to achieve a balanced split, ensuring that our training and testing sets have a similar distribution of accident severities. By setting set.seed(123), we guarantee that our results are reproducible, an important aspect of scientific analysis.
# Split the data into training and testing sets
set.seed(123)
train_indices <- createDataPartition(
accident_data$Accident_Severity,
p = 0.7, list = FALSE
)
train_data <- accident_data[train_indices, ]
test_data <- accident_data[-train_indices, ]
# Convert the data to numeric
train_data_numeric <- train_data %>%
mutate_if(is.factor, as.numeric)
test_data_numeric <- test_data %>%
mutate_if(is.factor, as.numeric)
We opt for the XGBoost algorithm, renowned for its efficiency and effectiveness in classification problems. XGBoost stands for eXtreme Gradient Boosting and is particularly useful in dealing with unbalanced datasets and non-linear relationships. We define various parameters for our XGBoost model, such as the learning rate (eta), maximum depth of trees (max_depth), and subsampling rates (subsample and colsample_bytree). These parameters are fine-tuned to optimize the model’s performance.
# Create the xgboost model
params <- list(
objective = "binary:logistic",
eval_metric = "logloss",
eta = 0.1,
max_depth = 4,
subsample = 0.8,
colsample_bytree = 0.8
)
model <- xgboost(
data = as.matrix(train_data_numeric %>% select(-Accident_Severity)),
label = train_data$Accident_Severity,
params = params,
nrounds = 100
)
## [1] train-logloss:0.603900
## [2] train-logloss:0.530794
## [3] train-logloss:0.470001
## [4] train-logloss:0.418688
## [5] train-logloss:0.374980
## [6] train-logloss:0.337497
## [7] train-logloss:0.305067
## [8] train-logloss:0.276948
## [9] train-logloss:0.252464
## [10] train-logloss:0.231066
## [11] train-logloss:0.212294
## [12] train-logloss:0.195770
## [13] train-logloss:0.181260
## [14] train-logloss:0.168421
## [15] train-logloss:0.157110
## [16] train-logloss:0.147137
## [17] train-logloss:0.138266
## [18] train-logloss:0.130489
## [19] train-logloss:0.123520
## [20] train-logloss:0.117389
## [21] train-logloss:0.111967
## [22] train-logloss:0.107139
## [23] train-logloss:0.102884
## [24] train-logloss:0.099144
## [25] train-logloss:0.095789
## [26] train-logloss:0.092831
## [27] train-logloss:0.090203
## [28] train-logloss:0.087899
## [29] train-logloss:0.085868
## [30] train-logloss:0.084051
## [31] train-logloss:0.082454
## [32] train-logloss:0.081045
## [33] train-logloss:0.079806
## [34] train-logloss:0.078668
## [35] train-logloss:0.077710
## [36] train-logloss:0.076850
## [37] train-logloss:0.076079
## [38] train-logloss:0.075407
## [39] train-logloss:0.074810
## [40] train-logloss:0.074257
## [41] train-logloss:0.073829
## [42] train-logloss:0.073379
## [43] train-logloss:0.073039
## [44] train-logloss:0.072742
## [45] train-logloss:0.072436
## [46] train-logloss:0.072161
## [47] train-logloss:0.071931
## [48] train-logloss:0.071731
## [49] train-logloss:0.071553
## [50] train-logloss:0.071358
## [51] train-logloss:0.071194
## [52] train-logloss:0.071064
## [53] train-logloss:0.070934
## [54] train-logloss:0.070795
## [55] train-logloss:0.070656
## [56] train-logloss:0.070564
## [57] train-logloss:0.070462
## [58] train-logloss:0.070343
## [59] train-logloss:0.070243
## [60] train-logloss:0.070160
## [61] train-logloss:0.070095
## [62] train-logloss:0.070027
## [63] train-logloss:0.069970
## [64] train-logloss:0.069882
## [65] train-logloss:0.069818
## [66] train-logloss:0.069750
## [67] train-logloss:0.069681
## [68] train-logloss:0.069647
## [69] train-logloss:0.069556
## [70] train-logloss:0.069529
## [71] train-logloss:0.069461
## [72] train-logloss:0.069404
## [73] train-logloss:0.069350
## [74] train-logloss:0.069286
## [75] train-logloss:0.069259
## [76] train-logloss:0.069210
## [77] train-logloss:0.069157
## [78] train-logloss:0.069088
## [79] train-logloss:0.069059
## [80] train-logloss:0.069000
## [81] train-logloss:0.068958
## [82] train-logloss:0.068932
## [83] train-logloss:0.068875
## [84] train-logloss:0.068810
## [85] train-logloss:0.068778
## [86] train-logloss:0.068715
## [87] train-logloss:0.068677
## [88] train-logloss:0.068651
## [89] train-logloss:0.068580
## [90] train-logloss:0.068556
## [91] train-logloss:0.068525
## [92] train-logloss:0.068498
## [93] train-logloss:0.068432
## [94] train-logloss:0.068376
## [95] train-logloss:0.068298
## [96] train-logloss:0.068276
## [97] train-logloss:0.068224
## [98] train-logloss:0.068169
## [99] train-logloss:0.068138
## [100] train-logloss:0.068088
Finally, we use our trained model to make predictions on the test dataset. These predictions will help us evaluate the performance of our model in predicting the severity of road accidents. By comparing these predictions with the actual outcomes in the test dataset, we can assess the accuracy and reliability of our model.
# Make predictions on the test data
predictions <- predict(
model,
as.matrix(test_data_numeric %>% select(-Accident_Severity))
)
We assessed the accuracy of our XGBoost model, which reflects the proportion of correct predictions. The model achieved an accuracy of approximately 98.42%, indicating a high level of effectiveness in predicting the severity of road accidents. This metric is straightforward but should be interpreted with caution in cases of class imbalance.
# Accuracy
mean(ifelse(predictions > 0.5, 1, 0) == test_data$Accident_Severity)
## [1] 0.9842058
In addition to accuracy, we evaluated the model’s performance using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC). The AUC score, standing at 0.774, provides an aggregate measure of the model’s ability to discriminate between different outcome classes. A higher AUC value generally indicates a better model performance. This metric offers a more nuanced view of the model’s effectiveness, especially in unbalanced datasets.
# AUC score
roc_obj <- roc(test_data$Accident_Severity, predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_obj)
print(auc_value)
## Area under the curve: 0.774
As seen in the graph, the ROC curve arches towards the upper left corner, indicating a good predictive performance. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. Conversely, a curve that approaches the 45-degree diagonal line represents a less effective model, with an AUC close to 0.5, which is no better than random guessing.
The area under the ROC curve (AUC) provides a single measure of overall performance and can be used to compare different models. In our case, the previously calculated AUC of 0.774 suggests that the model has a good level of discrimination capability, distinguishing between the different classes of accident severity.
# Plot the ROC curve
plot(roc_obj, main = "ROC")
The XGBoost model provides a mechanism to evaluate the importance of each feature in making predictions. Here we use the xgb.importance function from the XGBoost package to extract feature importance metrics. The table above lists the features along with their respective importance scores across three metrics: Gain, Cover, and Frequency.
In our analysis, Speed_limit is identified as the most significant feature, with the highest Gain score, indicating that it has the highest impact on the outcome of the model. This is followed by DistanceToHotspot and Number_of_Vehicles, which also play substantial roles in predicting the severity of road accidents.
The DistanceToHotspot is the custom spatial variable we
created, measuring the distance from each accident to the nearest
accident cluster center identified by DBSCAN clustering. Its high
importance in our model underscores the critical role spatial factors
play in road accident analysis, indicating that accidents occurring near
these hotspots may have distinct characteristics or risk factors. This
insight is valuable for targeted road safety measures and urban
planning.
# Feature importance
xgb.importance(model = model)
## Feature Gain Cover Frequency
## 1: Speed_limit 0.29141535 0.29054941 0.10260223
## 2: DistanceToHotspot 0.17386799 0.10937970 0.27806691
## 3: Number_of_Vehicles 0.15805308 0.27698453 0.13903346
## 4: Urban_or_Rural_Area 0.09145477 0.06517493 0.03197026
## 5: TimeOfDay 0.06472619 0.10213348 0.10037175
## 6: Light_Conditions 0.05694244 0.07158009 0.07881041
## 7: Day_of_Week 0.05295697 0.02755963 0.08401487
## 8: Month 0.05100284 0.01736964 0.08401487
## 9: Weather_Conditions 0.03303740 0.02117634 0.05576208
## 10: Road_Surface_Conditions 0.02654298 0.01809226 0.04535316
# Show importance plot with ggplot2
xgb.importance(model = model) %>%
as.data.frame() %>%
ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw()
Creating an explainer using the DALEX package is an essential step
towards demystifying the complex decisions made by our XGBoost model.
This process is not just about gaining technical clarity, but about
ensuring that the predictive model we have built is transparent and its
results are interpretable. With the explainer, we aim to break down the
model’s predictions and understand the impact of each variable,
particularly our spatial feature DistanceToHotspot.
accident_data_matrix <- as.matrix(
accident_data %>%
select(-Accident_Severity) %>%
mutate_if(is.factor, as.numeric)
)
accident_data_y <- accident_data$Accident_Severity
# Create an explainer for your model
explainer <- DALEX::explain(
model = model,
data = accident_data_matrix,
y = accident_data_y,
)
## Preparation of a new explainer is initiated
## -> model label : xgb.Booster ( default )
## -> data : 107425 rows 10 cols
## -> data : rownames to data was added ( from 1 to 107425 )
## -> target variable : 107425 values
## -> predict function : yhat.default will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package Model of class: xgb.Booster package unrecognized , ver. Unknown , task regression ( default )
## -> predicted values : numerical, min = 0.0007605775 , mean = 0.01565999 , max = 0.5043826
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.2877295 , mean = -5.840353e-05 , max = 0.9980441
## A new explainer has been created!
This plot is a graphical representation of the importance of different predictors in the XGBoost model, as determined by the DALEX package’s explainer object. In this plot, the variables are ranked according to their influence on the model’s predictive accuracy, with longer bars indicating greater importance.
# Variable Importance Plot
vi <- variable_importance(explainer)
plot(vi)
In this plot, the x-axis represents different values of the Speed_limit, while the y-axis shows the average predicted probability of the accident severity. The shape of the curve suggests that the predicted severity of accidents increases significantly as the speed limit goes up to a certain point, then plateaus, and starts to decrease after that. This could imply that there’s an optimal range of speed limits that are associated with the severity of accidents, which could provide actionable insights for policymakers when considering speed regulation as a part of road safety measures.
# Partial Dependence Plot for Speed_limit variable
sl_pdp <- model_profile(explainer, variables = "Speed_limit")
plot(sl_pdp)
This plot shows a distinct trend where the predicted severity of accidents changes with varying distances to hotspots. Initially, there is a sharp increase, which suggests that accidents occurring closer to hotspots (but not the hotspot themselves) are predicted to be more severe. Beyond a certain point, as the distance increases, the predicted severity decreases, indicating that accidents much further from these hotspots might be less fatal.
# Patrial Dependence Plot for DistanceToHotspot variable
dth_pdp <- model_profile(explainer, variables = "DistanceToHotspot")
plot(dth_pdp)
The Break Down Plot starts with a baseline, which is the intercept (the average prediction over the dataset). Each subsequent bar shows the contribution of an individual feature to the prediction, with bars pointing to the right indicating an increase in the predicted value and bars to the left indicating a decrease. The sum of these contributions and the intercept gives us the final prediction for this particular observation.
We can see the effects of features like TimeOfDay, Speed_limit, Urban_or_Rural_Area, and others, including the spatial variable DistanceToHotspot. The length and direction of each bar provide a clear and detailed picture of the factors leading to the prediction, which is particularly useful for understanding the model’s decision-making process for individual predictions. This insight can be especially valuable when investigating specific cases or anomalies.
# Break Down Plot for a single prediction
first_observation <- accident_data_matrix[1, , drop = FALSE]
bd <- predict_parts(
explainer,
new_observation = first_observation
)
plot(bd)
Each bar represents a feature’s SHAP value, which quantifies the impact of that feature on the model’s output. Red bars indicate features that push the prediction lower (decrease the predicted probability of the outcome), and green bars indicate features that push the prediction higher. The length of the bar shows the magnitude of the feature’s impact.
In SHAP value plots, the sum of all feature contributions along with the base value (not shown in the plot) equals the final prediction for the given observation. SHAP values provide a more nuanced and detailed explanation of the prediction, attributing the contribution of each feature in a way that sums up to the actual prediction made by the model.
# Shapley Values Plot for a single prediction
sv <- predict_parts(
explainer,
new_observation = first_observation,
type = "shap"
)
plot(sv)
The comprehensive analysis conducted has led to several notable findings. The DBSCAN clustering method effectively identified 29 distinct clusters, predominantly located near highly populated areas such as major cities and urban agglomerations. This pattern aligns with the intuitive understanding that busier areas are likely to witness a higher frequency of traffic incidents.
The exploratory data analysis (EDA) further refined our insights, revealing that minor accidents tend to cluster around these hotspots, while severe or fatal accidents are dispersed over a range of distances from these focal points. Patterns across the days of the week and times of day were also observed, with Fridays and afternoons being particularly prone to higher accident counts. Interestingly, the severity of accidents did not show a significant variation with these temporal factors.
The peak in accident distribution at a speed limit of 30 units suggests that most accidents occur within urban speed zones, where traffic density is high and numerous variables may contribute to accidents.
Through the application of the XGBoost model, we were able to quantify the relationship between these factors and accident severity. The model’s accuracy and the AUC score indicate a robust predictive capability, particularly in distinguishing between minor and severe accident outcomes.
In conclusion, the spatial analysis of accident severity in proximity to hotspots has provided valuable insights, with the potential to inform targeted road safety measures and urban planning strategies. Future work may delve deeper into the causal factors and explore interventions to mitigate risks, especially in identified high-risk zones.