A Comparative Study on Road Accidents

Li Anqi (23069701), Rithana Gunasegaran (23085726), Sara Javed Sani (23063175), Sidharrthan A/L Pichyalagan (23068657), Lin Liy Yee (17007181)

1.0 Project Background

Road accidents are a persistent problem worldwide, with the World Health Organization (WHO) estimating that over 1.35 million people die each year due to road traffic accidents. In the United Kingdom, road accidents are a significant cause of injury and death, with the Royal Society for the Prevention of Accidents (RoSPA) reporting that in 2020, there were 1,460 fatalities and 24,831 serious injuries due to road accidents. The causes of road accidents are multifaceted and include factors such as driver behavior, vehicle safety, road design, and environmental conditions.

The primary objective of this study is to analyze the “Road Accidents Dataset” to identify trends and patterns in road accidents and to investigate the relationship between the severity of casualties and the Index of Multiple Deprivation (IMD) decile of their home areas. This analysis aims to provide insights into the factors contributing to road accidents and to inform strategies for improving road safety.

The “Road Accidents Dataset” provides a comprehensive collection of information on road accidents reported over multiple years. This dataset includes details on the accident status, vehicle and casualty references, demographics, and severity of casualties. Key attributes in the dataset cover pedestrian details, casualty types, road maintenance worker involvement, and the Index of Multiple Deprivation (IMD) decile for casualties’ home areas.

Previous studies have identified various factors contributing to road accidents, including driver behavior, vehicle safety, road design, and environmental conditions. For example, a study by the WHO found that the majority of road accidents are caused by human factors such as speeding, drunk driving, and failure to wear seatbelts. Another study by the European Commission found that road accidents are a significant cause of injury and death, with the majority of accidents occurring on rural roads.

2.0 Data Cleaning

The data cleaning process for the “Road Accidents Dataset” involved several steps to ensure the quality and integrity of the data. The first step was to read the CSV file into R using the read.csv function. This allowed us to access the data for further analysis. Next, we viewed the structure of the dataset using the str function. This provided us with information about the number of observations and variables in the dataset.

The dataset consisted of 62674 observations and 19 variables. We checked for missing values in the dataset using the sum(is.na(df.road)) function. This function returned a value of zero, indicating that there were no missing values in the dataset. Rows with unknown values were removed from the dataset to ensure the quality of the data. This was done by specifying the columns to check for -1 values and then filtering out rows with -1 in any of these columns.

Functions were defined to map age bands to intervals, classify casualty home areas, casualty classes, casualty severity, and casualty gender. These functions were used to create new columns with descriptive labels. The functions were applied to the dataset using the mutate function from the dplyr package. This created new columns with descriptive labels, making it easier to analyze and interpret the data.

The data cleaning process involved several steps to ensure the quality and integrity of the data. We removed rows with unknown values, defined variables, and applied these functions to the dataset. The cleaned dataset was then used for further analysis and visualization.

3.0 Data Modeling

3.1 Regression

3.1.1: Data Preparation

  • Splitting the Data: Split the cleaned dataset into a training set (70% of the data) and a testing set (30% of the data).

  • Defining Predictors: Define the predictors as the variables that will be used to predict the class of casualties. These include vehicle_reference, sex_of_casualty, age_of_casualty, age_band_of_casualty, casualty_severity, pedestrian_location, pedestrian_movement, car_passenger, bus_or_coach_passenger, pedestrian_road_maintenance_worker, casualty_home_area_type, and casualty_imd_decile.

3.1.2: Model Training

  • Random Forest Model:

Train a random forest model using the training set. The model is trained with 500 trees and the number of variables tried at each split is set to the square root of the number of predictors.

3.1.3: Model Evaluation

  • Mean Squared Error (MSE):

Calculate the MSE as the average of the squared differences between the predicted and actual values.

  • Root Mean Squared Error (RMSE):

Calculate the RMSE as the square root of the MSE, providing a measure of the average magnitude of the errors.

  • R-squared:

Calculate the R-squared value indicating the proportion of the variability in the target variable that is explained by the model.

3.1.4: Model Visualization

  • Density Plot of Predicted and Actual Values: Visualize the density plot of the predicted and actual values to check if the model captures the underlying distribution of the data.

  • Box Plot of Predicted and Actual Values: Visualize the box plot of the predicted and actual values to check if the model’s predictions are consistent with the actual values.

  • Density Plot of Residuals: Visualize the density plot of the residuals to check if the model has significant bias and if the residuals are centered around zero.

3.2 Classification

The classification process for predicting casualty severity involves the following steps:

3.2.1 Data Preparation:

  • Drop columns that won’t be used for prediction, such as collision_index, collision_reference, lsoa_of_casualty, age_interval, casualty_home.area_desc, casualty_class_desc, casualty_severity_desc, casualty_sex_desc, and car_passenger_desc.
  • Convert the casualty_severity column to a factor variable.
  • Separate the features (X) and the target variable (y).
  • Identify the numeric and categorical columns in the feature set (X).
  • Fill in missing values for numeric columns using the median, and for categorical columns using the mode.
  • Encode the categorical variables as factors.
  • Standardize the numeric features by centering and scaling.

3.2.2 Train-Test Split:

  • Split the data into a training set (70%) and a testing set (30%) using the createDataPartition function from the caret package.
  • Combine the X_train and y_train datasets for training.

3.2.3 Model Training:

  • Initialize a list of different classification models, including Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and Neural Network.
  • Train each model using the train function from the caret package, with the casualty_severity as the target variable and the remaining features as predictors.

3.2.4 Model Evaluation:

  • The classification models will be evaluated on the testing set to assess their performance in predicting the casualty severity.

  • Appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score, will be calculated to compare the models’ performance.

3.2.5 Feature Importance:

  • For the best-performing model, the feature importance will be analyzed to identify the most predictive variables in the model.
  • This information can provide insights into the key factors that influence the severity of road accidents.

4.0 Results

4.1 Data Cleaning

 \[ \]:

# Read CSV
df.road <- read.csv("dft-road-casualty-statistics-casualty-provisional-mid-year-unvalidated-2023.csv")

# View the structure of the dataset
str(df.road)
'data.frame':  62674 obs. of  19 variables:
 $ collision_index                   : chr  "2023010419171" "2023010419183" "2023010419183" "2023010419189" ...
 $ collision_year                    : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ collision_reference               : chr  "010419171" "010419183" "010419183" "010419189" ...
 $ vehicle_reference                 : int  1 2 3 1 2 2 1 1 1 1 ...
 $ casualty_reference                : int  1 1 2 1 1 1 1 1 1 1 ...
 $ casualty_class                    : int  3 1 2 1 1 1 3 1 1 3 ...
 $ sex_of_casualty                   : int  2 1 2 1 1 1 1 1 1 1 ...
 $ age_of_casualty                   : int  20 25 38 50 34 24 65 22 20 33 ...
 $ age_band_of_casualty              : int  4 5 7 8 6 5 9 5 4 6 ...
 $ casualty_severity                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ pedestrian_location               : int  5 0 0 0 0 0 4 0 0 8 ...
 $ pedestrian_movement               : int  1 0 0 0 0 0 4 0 0 1 ...
 $ car_passenger                     : int  0 0 2 0 0 0 0 0 0 0 ...
 $ bus_or_coach_passenger            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pedestrian_road_maintenance_worker: int  0 0 0 0 0 0 0 0 0 0 ...
 $ casualty_type                     : int  0 9 9 9 1 3 0 1 9 0 ...
 $ casualty_home_area_type           : int  1 1 -1 1 1 -1 1 2 1 1 ...
 $ casualty_imd_decile               : int  10 3 -1 5 2 -1 5 10 3 8 ...
 $ lsoa_of_casualty                  : chr  "E01030370" "E01001546" "-1" "E01002443" ...

From the 19 variables, we note that majority of the data types are integers. However, these integers are represented by categorical data hence we continue cleaning the data by defining the variables.

 \[ \]:

summary(df.road)
 collision_index    collision_year collision_reference vehicle_reference
 Length:62674       Min.   :2023   Length:62674        Min.   :  1.000  
 Class :character   1st Qu.:2023   Class :character    1st Qu.:  1.000  
 Mode  :character   Median :2023   Mode  :character    Median :  1.000  
                    Mean   :2023                       Mean   :  1.468  
                    3rd Qu.:2023                       3rd Qu.:  2.000  
                    Max.   :2023                       Max.   :992.000  
 casualty_reference casualty_class  sex_of_casualty  age_of_casualty 
 Min.   : 1.000     Min.   :1.000   Min.   :-1.000   Min.   : -1.00  
 1st Qu.: 1.000     1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 22.00  
 Median : 1.000     Median :1.000   Median : 1.000   Median : 34.00  
 Mean   : 1.375     Mean   :1.491   Mean   : 1.358   Mean   : 36.95  
 3rd Qu.: 1.000     3rd Qu.:2.000   3rd Qu.: 2.000   3rd Qu.: 51.00  
 Max.   :70.000     Max.   :3.000   Max.   : 9.000   Max.   :102.00  
 age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement
 Min.   :-1.000       Min.   :1.000     Min.   :-1.0000     Min.   :0.0000     
 1st Qu.: 5.000       1st Qu.:3.000     1st Qu.: 0.0000     1st Qu.:0.0000     
 Median : 6.000       Median :3.000     Median : 0.0000     Median :0.0000     
 Mean   : 6.315       Mean   :2.785     Mean   : 0.8087     Mean   :0.6634     
 3rd Qu.: 8.000       3rd Qu.:3.000     3rd Qu.: 0.0000     3rd Qu.:0.0000     
 Max.   :11.000       Max.   :3.000     Max.   :10.0000     Max.   :9.0000     
 car_passenger     bus_or_coach_passenger pedestrian_road_maintenance_worker
 Min.   :-1.0000   Min.   :-1.00000       Min.   :-1.00000                  
 1st Qu.: 0.0000   1st Qu.: 0.00000       1st Qu.: 0.00000                  
 Median : 0.0000   Median : 0.00000       Median : 0.00000                  
 Mean   : 0.2246   Mean   : 0.06234       Mean   : 0.03721                  
 3rd Qu.: 0.0000   3rd Qu.: 0.00000       3rd Qu.: 0.00000                  
 Max.   : 9.0000   Max.   : 9.00000       Max.   : 2.00000                  
 casualty_type    casualty_home_area_type casualty_imd_decile
 Min.   :-1.000   Min.   :-1.00           Min.   :-1.00      
 1st Qu.: 1.000   1st Qu.: 1.00           1st Qu.: 2.00      
 Median : 9.000   Median : 1.00           Median : 4.00      
 Mean   : 9.508   Mean   : 1.05           Mean   : 4.26      
 3rd Qu.: 9.000   3rd Qu.: 1.00           3rd Qu.: 7.00      
 Max.   :98.000   Max.   : 3.00           Max.   :10.00      
 lsoa_of_casualty  
 Length:62674      
 Class :character  
 Mode  :character  
                   
                   
                   

Proceeding with this we check for missing values

 \[ \]:

sum(is.na(df.road))

0

At this point, we have noted there are no missing values within the dataset.

Removing lines with -1 values as the value is unknown and will impact modelling

 \[ \]:

# Install and load the dplyr package
install.packages("dplyr")
library(dplyr)

# Specify the columns to check for -1
columns_to_check <- c("vehicle_reference", "sex_of_casualty", "pedestrian_road_maintenance_worker",
                      "pedestrian_location", "lsoa_of_casualty", "casualty_type",
                      "casualty_imd_decile", "casualty_home_area_type", "car_passenger",
                      "age_of_casualty", "age_band_of_casualty")

# Filter out rows with -1 in any of the specified columns
df.road_clean <- df.road %>%
  filter(!if_any(all_of(columns_to_check), ~ . == -1) & sex_of_casualty != 9)

# Print the first few rows of the cleaned data frame to verify
head(df.road_clean)
str(df.road_clean)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
collision_index collision_year collision_reference vehicle_reference casualty_reference casualty_class sex_of_casualty age_of_casualty age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement car_passenger bus_or_coach_passenger pedestrian_road_maintenance_worker casualty_type casualty_home_area_type casualty_imd_decile lsoa_of_casualty
<chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
1 2023010419171 2023 010419171 1 1 3 2 20 4 3 5 1 0 0 0 0 1 10 E01030370
2 2023010419183 2023 010419183 2 1 1 1 25 5 3 0 0 0 0 0 9 1 3 E01001546
3 2023010419189 2023 010419189 1 1 1 1 50 8 3 0 0 0 0 0 9 1 5 E01002443
4 2023010419191 2023 010419191 2 1 1 1 34 6 3 0 0 0 0 0 1 1 2 E01004679
5 2023010419198 2023 010419198 1 1 3 1 65 9 3 4 4 0 0 0 0 1 5 E01023593
6 2023010419201 2023 010419201 1 1 1 1 22 5 3 0 0 0 0 0 1 2 10 E01026413
'data.frame':  52065 obs. of  19 variables:
 $ collision_index                   : chr  "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
 $ collision_year                    : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ collision_reference               : chr  "010419171" "010419183" "010419189" "010419191" ...
 $ vehicle_reference                 : int  1 2 1 2 1 1 1 1 1 1 ...
 $ casualty_reference                : int  1 1 1 1 1 1 1 1 1 1 ...
 $ casualty_class                    : int  3 1 1 1 3 1 1 3 1 1 ...
 $ sex_of_casualty                   : int  2 1 1 1 1 1 1 1 2 1 ...
 $ age_of_casualty                   : int  20 25 50 34 65 22 20 33 58 40 ...
 $ age_band_of_casualty              : int  4 5 8 6 9 5 4 6 9 7 ...
 $ casualty_severity                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ pedestrian_location               : int  5 0 0 0 4 0 0 8 0 0 ...
 $ pedestrian_movement               : int  1 0 0 0 4 0 0 1 0 0 ...
 $ car_passenger                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bus_or_coach_passenger            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pedestrian_road_maintenance_worker: int  0 0 0 0 0 0 0 0 0 0 ...
 $ casualty_type                     : int  0 9 9 1 0 1 9 0 9 9 ...
 $ casualty_home_area_type           : int  1 1 1 1 1 2 1 1 1 1 ...
 $ casualty_imd_decile               : int  10 3 5 2 5 10 3 8 5 2 ...
 $ lsoa_of_casualty                  : chr  "E01030370" "E01001546" "E01002443" "E01004679" ...

 \[ \]:

# Get the total number of lines in the dataset
total_lines <- nrow(df.road)

print(total_lines)

# Get the total number of lines in the clean dataset
total_lines_clean <- nrow(df.road_clean)

print(total_lines_clean)

# Divide the total number of lines by new dataset to see how many % has been removed
result <- 1-(total_lines_clean/total_lines)

# Print the result
print(result)
[1] 62674
[1] 52065
[1] 0.1692727

4.1.1 Defining the variables

The variables defined are age, casualty home area, casualty classes and also gender of casualty. They were defined to help us proceed to data visualisation.

Age

 \[ \]:

# Define a function to map age bands to intervals
map_age_band <- function(age_band) {
  if (age_band == 1) {
    return("0 - 5")
  } else if (age_band <= 10 && age_band > 0) {
    lower_bound <- (age_band - 1) * 5 + 1
    upper_bound <- age_band * 5
    return(paste(lower_bound, "-", upper_bound))
  } else if (age_band == -1) {
    return("Unknown")
  } else {
    return("> 50")
  }
}

# Add a new column using mutate
df.road_clean <- df.road_clean %>%
  mutate(age_interval = sapply(age_band_of_casualty, map_age_band))

head(df.road_clean)
A data.frame: 6 × 20
collision_index collision_year collision_reference vehicle_reference casualty_reference casualty_class sex_of_casualty age_of_casualty age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement car_passenger bus_or_coach_passenger pedestrian_road_maintenance_worker casualty_type casualty_home_area_type casualty_imd_decile lsoa_of_casualty age_interval
<chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2023010419171 2023 010419171 1 1 3 2 20 4 3 5 1 0 0 0 0 1 10 E01030370 16 - 20
2 2023010419183 2023 010419183 2 1 1 1 25 5 3 0 0 0 0 0 9 1 3 E01001546 21 - 25
3 2023010419189 2023 010419189 1 1 1 1 50 8 3 0 0 0 0 0 9 1 5 E01002443 36 - 40
4 2023010419191 2023 010419191 2 1 1 1 34 6 3 0 0 0 0 0 1 1 2 E01004679 26 - 30
5 2023010419198 2023 010419198 1 1 3 1 65 9 3 4 4 0 0 0 0 1 5 E01023593 41 - 45
6 2023010419201 2023 010419201 1 1 1 1 22 5 3 0 0 0 0 0 1 2 10 E01026413 21 - 25

Casualty Home Area

 \[ \]:

# Define a function to classify casualty home area
casualty_home.area_desc <- function(casualty_home_area_type) {
  if (casualty_home_area_type == 1) {
    return("Urban")
  } else if (casualty_home_area_type == 2) {
    return("Semi-urban")
  } else if (casualty_home_area_type == 3) {
    return("Rural")
  } else {
    return("Other")
  }
}

# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
  mutate(casualty_home.area_desc = sapply(casualty_home_area_type, casualty_home.area_desc))

# Display the first few rows to check the result
head(df.road_clean)
A data.frame: 6 × 21
collision_index collision_year collision_reference vehicle_reference casualty_reference casualty_class sex_of_casualty age_of_casualty age_band_of_casualty casualty_severity ⋯ pedestrian_movement car_passenger bus_or_coach_passenger pedestrian_road_maintenance_worker casualty_type casualty_home_area_type casualty_imd_decile lsoa_of_casualty age_interval casualty_home.area_desc
<chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> ⋯ <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 2023010419171 2023 010419171 1 1 3 2 20 4 3 ⋯ 1 0 0 0 0 1 10 E01030370 16 - 20 Urban
2 2023010419183 2023 010419183 2 1 1 1 25 5 3 ⋯ 0 0 0 0 9 1 3 E01001546 21 - 25 Urban
3 2023010419189 2023 010419189 1 1 1 1 50 8 3 ⋯ 0 0 0 0 9 1 5 E01002443 36 - 40 Urban
4 2023010419191 2023 010419191 2 1 1 1 34 6 3 ⋯ 0 0 0 0 1 1 2 E01004679 26 - 30 Urban
5 2023010419198 2023 010419198 1 1 3 1 65 9 3 ⋯ 4 0 0 0 0 1 5 E01023593 41 - 45 Urban
6 2023010419201 2023 010419201 1 1 1 1 22 5 3 ⋯ 0 0 0 0 1 2 10 E01026413 21 - 25 Semi-urban

Casualty classes

 \[ \]:

# Define a function to classify casualty classes
casualty_class_desc <- function(casualty_class) {
  if (casualty_class == 1) {
    return("Driver")
  } else if (casualty_class == 2) {
    return("Passenger")
  } else if (casualty_class == 3) {
    return("Pedestrian")
  } else {
    return("Other")
  }
}

# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
  mutate(casualty_class_desc = sapply(casualty_class, casualty_class_desc))

# Define a function to classify casualty severity
casualty_severity_desc <- function(casualty_severity) {
  if (casualty_severity == 1) {
    return("Fatal")
  } else if (casualty_severity == 2) {
    return("Serious")
  } else if (casualty_severity == 3) {
    return("Slight")
  } else {
    return("Other")
  }
}

# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
  mutate(casualty_severity_desc = sapply(casualty_severity, casualty_severity_desc))

# Define a function to classify casualty gender
casualty_sex_desc <- function(sex_of_casualty) {
  if (sex_of_casualty == 1) {
    return("Male")
  } else if (sex_of_casualty == 2) {
    return("Female")
  } else {
    return("Unknown")
  }
}

# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
  mutate(casualty_sex_desc = sapply(sex_of_casualty, casualty_sex_desc))

# Define a function to classify car passenger
car_passenger_desc <- function(car_passenger) {
  if (car_passenger == 0) {
    return("Yes - Car")
  } else {
    return("No - Others")
  }
}

# Apply the function to create a new column
df.road_clean <- df.road_clean %>%
  mutate(car_passenger_desc = sapply(car_passenger, car_passenger_desc))

# Display the first few rows to check the result
head(df.road_clean)

# Check the summary of the cleaned data frame
str(df.road_clean)
A data.frame: 6 × 25
collision_index collision_year collision_reference vehicle_reference casualty_reference casualty_class sex_of_casualty age_of_casualty age_band_of_casualty casualty_severity ⋯ casualty_type casualty_home_area_type casualty_imd_decile lsoa_of_casualty age_interval casualty_home.area_desc casualty_class_desc casualty_severity_desc casualty_sex_desc car_passenger_desc
<chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> ⋯ <int> <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2023010419171 2023 010419171 1 1 3 2 20 4 3 ⋯ 0 1 10 E01030370 16 - 20 Urban Pedestrian Slight Female Yes - Car
2 2023010419183 2023 010419183 2 1 1 1 25 5 3 ⋯ 9 1 3 E01001546 21 - 25 Urban Driver Slight Male Yes - Car
3 2023010419189 2023 010419189 1 1 1 1 50 8 3 ⋯ 9 1 5 E01002443 36 - 40 Urban Driver Slight Male Yes - Car
4 2023010419191 2023 010419191 2 1 1 1 34 6 3 ⋯ 1 1 2 E01004679 26 - 30 Urban Driver Slight Male Yes - Car
5 2023010419198 2023 010419198 1 1 3 1 65 9 3 ⋯ 0 1 5 E01023593 41 - 45 Urban Pedestrian Slight Male Yes - Car
6 2023010419201 2023 010419201 1 1 1 1 22 5 3 ⋯ 1 2 10 E01026413 21 - 25 Semi-urban Driver Slight Male Yes - Car
'data.frame':  52065 obs. of  25 variables:
 $ collision_index                   : chr  "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
 $ collision_year                    : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ collision_reference               : chr  "010419171" "010419183" "010419189" "010419191" ...
 $ vehicle_reference                 : int  1 2 1 2 1 1 1 1 1 1 ...
 $ casualty_reference                : int  1 1 1 1 1 1 1 1 1 1 ...
 $ casualty_class                    : int  3 1 1 1 3 1 1 3 1 1 ...
 $ sex_of_casualty                   : int  2 1 1 1 1 1 1 1 2 1 ...
 $ age_of_casualty                   : int  20 25 50 34 65 22 20 33 58 40 ...
 $ age_band_of_casualty              : int  4 5 8 6 9 5 4 6 9 7 ...
 $ casualty_severity                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ pedestrian_location               : int  5 0 0 0 4 0 0 8 0 0 ...
 $ pedestrian_movement               : int  1 0 0 0 4 0 0 1 0 0 ...
 $ car_passenger                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bus_or_coach_passenger            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pedestrian_road_maintenance_worker: int  0 0 0 0 0 0 0 0 0 0 ...
 $ casualty_type                     : int  0 9 9 1 0 1 9 0 9 9 ...
 $ casualty_home_area_type           : int  1 1 1 1 1 2 1 1 1 1 ...
 $ casualty_imd_decile               : int  10 3 5 2 5 10 3 8 5 2 ...
 $ lsoa_of_casualty                  : chr  "E01030370" "E01001546" "E01002443" "E01004679" ...
 $ age_interval                      : chr  "16 - 20" "21 - 25" "36 - 40" "26 - 30" ...
 $ casualty_home.area_desc           : chr  "Urban" "Urban" "Urban" "Urban" ...
 $ casualty_class_desc               : chr  "Pedestrian" "Driver" "Driver" "Driver" ...
 $ casualty_severity_desc            : chr  "Slight" "Slight" "Slight" "Slight" ...
 $ casualty_sex_desc                 : chr  "Female" "Male" "Male" "Male" ...
 $ car_passenger_desc                : chr  "Yes - Car" "Yes - Car" "Yes - Car" "Yes - Car" ...

Gender of casualty

 \[ \]:

4.1.2 Data Visualisation

 \[ \]:

hist(df.road_clean$age_of_casualty,
     main = "Distribution of Casualty Age",
     xlab = "Age",
     col = "skyblue",
     border = "white")


# Convert age_interval to a factor variable with custom levels
df.road_clean$age_interval <- factor(df.road_clean$age_interval, levels = c("0 - 5", "6 - 10", "11 - 15", "16 - 20", "21 - 25", "26 - 30", "31 - 35", "36 - 40", "41 - 45", "46 - 50", "> 50"))

# Tabulate the counts of each age interval
age_interval_counts <- table(df.road_clean$age_interval)

# Create a bar plot
barplot(age_interval_counts,
        main = "Distribution of Age Intervals",
        xlab = "Age Intervals",
        ylab = "Count",
        col = "skyblue",
        border = "white",
        las = 3)  # Rotate x-axis labels by 90 degrees

From the above histogram, we note that majority of the road casualities below to adults from 20s to 40s years old.

 \[ \]:

# Create a table of casualty class descriptions
casualty_class_desc <- c("Driver", "Passenger", "Pedestrian", "Other")
class_counts <- table(df.road_clean$casualty_class_desc)

# Bar plot of casualty class descriptions
barplot(class_counts,
        main = "Frequency of Casualty Types",
        xlab = "Casualty Class",
        ylab = "Frequency",
        col = "skyblue",
        border ="white",
        names.arg = names(class_counts))

# Load necessary libraries
library(dplyr)
library(RColorBrewer)

# Calculate frequency of each casualty class description
class_counts <- table(df.road_clean$casualty_class_desc)

# Define a pastel color palette
pastel_palette <- brewer.pal(n = 5, name = "Pastel1")

# Create a pie chart
pie(class_counts,
    main = "Distribution of Casualty Class Descriptions",
    col = pastel_palette,
    cex = 0.8,
    labels = paste(names(class_counts), ": ", class_counts))

# Add a legend
legend("topright",
       legend = names(class_counts),
       fill = pastel_palette,
       cex = 0.8,
       title = "Casualty Class Desc.")

The dataset is mostly made up of casualities of drivers, followed up by passenger and finally pedestrian.

 \[ \]:

# Show a table of casualty severity codes and their descriptions
severity_table <- table(df.road_clean$casualty_severity, df.road_clean$casualty_severity_desc)
severity_table


barplot(table(df.road_clean$casualty_class),
        main = "Frequency of Casualty Classes",
        xlab = "Casualty Class",
        ylab = "Frequency",
        col = "skyblue")


boxplot(df.road_clean$age_of_casualty ~ df.road_clean$casualty_severity,
        main = "Age Distribution by Casualty Severity",
        xlab = "Casualty Severity",
        ylab = "Age",
        col = "skyblue")

severity_by_class <- table(df.road_clean$casualty_class, df.road_clean$casualty_severity)
barplot(severity_by_class,
        main = "Casualty Severity by Casualty Class",
        xlab = "Casualty Class",
        ylab = "Frequency",
        col = c("skyblue", "salmon","purple"),
        legend = rownames(severity_by_class))
    Fatal Serious Slight
  1   551       0      0
  2     0    9768      0
  3     0       0  41746

 \[ \]:

# Count the frequency of each combination of casualty severity and gender
severity_gender_counts <- table(df.road_clean$casualty_sex_desc, df.road_clean$casualty_severity_desc)

# Plot the bar chart
barplot(severity_gender_counts,
        beside = TRUE,  # To plot bars side by side
        main = "Casualty Severity by Gender",
        xlab = "Frequency",
        ylab = "Casualty Severity",
        col = c("skyblue", "salmon","purple"),  # Specify colors for bars
        legend = rownames(severity_gender_counts),  # Add legend
        args.legend = list(title = "Gender"))  # Add legend title

 \[ \]:

# Install the gplots package if not already installed
install.packages("gplots")

# Load the gplots package
library(gplots)

# Filter out non-numeric columns
numeric_df <- df.road_clean[sapply(df.road_clean, is.numeric)]

# Remove columns with zero variance
numeric_df <- numeric_df[, sapply(numeric_df, function(x) var(x, na.rm = TRUE) != 0)]

# Calculate correlation matrix
correlation_matrix <- cor(numeric_df, use = "pairwise.complete.obs")

# Create the heatmap with correlation values
heatmap.2(correlation_matrix,
          main = "Correlation Heatmap of Numerical Variables",
          col = cm.colors(256),
          trace = "none",
          symm = TRUE,
          density.info = "none",
          key = TRUE,
          keysize = 1.2,
          key.title = NA,
          key.xlab = "Correlation",
          cexRow = 0.8,
          cexCol = 0.8,
          labRow = names(correlation_matrix),
          labCol = names(correlation_matrix),
          notecol = "black",
          notecex = 0.8,
          margins = c(10, 10))
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘bitops’, ‘gtools’, ‘caTools’



Attaching package: ‘gplots’


The following object is masked from ‘package:stats’:

    lowess

Based on heatmap above, we can see strong relationships between the following:

pedestrian_location and casualty class which indicates that the type of driver and the location of the pedestrian movement at the time of accident.A positive correlation between pedestrian_location and casualty_class suggests that the severity of casualties (reflected by casualty_class) is related to where the pedestrian was located when the accident happened.

The heatmap likely indicates a correlation between the severity of the casualty (reflected by casualty_class) and whether there was a car passenger involved in the accident (represented by car_passenger). A positive correlation between these variables could suggest that accidents with more severe casualties (higher casualty_class values) are also more likely to involve car passengers. There are a few possible explanations for this: - Cars with more occupants (including passengers) could be involved in higher-impact collisions that cause more severe injuries. - Passengers might also be vulnerable to injuries in severe accidents, depending on the seating position, type of collision, etc.

 \[ \]:

str(df.road_clean)
summary(df.road_clean)
'data.frame':  52065 obs. of  25 variables:
 $ collision_index                   : chr  "2023010419171" "2023010419183" "2023010419189" "2023010419191" ...
 $ collision_year                    : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ collision_reference               : chr  "010419171" "010419183" "010419189" "010419191" ...
 $ vehicle_reference                 : int  1 2 1 2 1 1 1 1 1 1 ...
 $ casualty_reference                : int  1 1 1 1 1 1 1 1 1 1 ...
 $ casualty_class                    : int  3 1 1 1 3 1 1 3 1 1 ...
 $ sex_of_casualty                   : int  2 1 1 1 1 1 1 1 2 1 ...
 $ age_of_casualty                   : int  20 25 50 34 65 22 20 33 58 40 ...
 $ age_band_of_casualty              : int  4 5 8 6 9 5 4 6 9 7 ...
 $ casualty_severity                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ pedestrian_location               : int  5 0 0 0 4 0 0 8 0 0 ...
 $ pedestrian_movement               : int  1 0 0 0 4 0 0 1 0 0 ...
 $ car_passenger                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bus_or_coach_passenger            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pedestrian_road_maintenance_worker: int  0 0 0 0 0 0 0 0 0 0 ...
 $ casualty_type                     : int  0 9 9 1 0 1 9 0 9 9 ...
 $ casualty_home_area_type           : int  1 1 1 1 1 2 1 1 1 1 ...
 $ casualty_imd_decile               : int  10 3 5 2 5 10 3 8 5 2 ...
 $ lsoa_of_casualty                  : chr  "E01030370" "E01001546" "E01002443" "E01004679" ...
 $ age_interval                      : Factor w/ 11 levels "0 - 5","6 - 10",..: 4 5 8 6 9 5 4 6 9 7 ...
 $ casualty_home.area_desc           : chr  "Urban" "Urban" "Urban" "Urban" ...
 $ casualty_class_desc               : chr  "Pedestrian" "Driver" "Driver" "Driver" ...
 $ casualty_severity_desc            : chr  "Slight" "Slight" "Slight" "Slight" ...
 $ casualty_sex_desc                 : chr  "Female" "Male" "Male" "Male" ...
 $ car_passenger_desc                : chr  "Yes - Car" "Yes - Car" "Yes - Car" "Yes - Car" ...
 collision_index    collision_year collision_reference vehicle_reference
 Length:52065       Min.   :2023   Length:52065        Min.   :  1.00   
 Class :character   1st Qu.:2023   Class :character    1st Qu.:  1.00   
 Mode  :character   Median :2023   Mode  :character    Median :  1.00   
                    Mean   :2023                       Mean   :  1.47   
                    3rd Qu.:2023                       3rd Qu.:  2.00   
                    Max.   :2023                       Max.   :992.00   
                                                                        
 casualty_reference casualty_class  sex_of_casualty age_of_casualty 
 Min.   : 1.000     Min.   :1.000   Min.   :1.000   Min.   :  0.00  
 1st Qu.: 1.000     1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 23.00  
 Median : 1.000     Median :1.000   Median :1.000   Median : 35.00  
 Mean   : 1.342     Mean   :1.456   Mean   :1.382   Mean   : 38.06  
 3rd Qu.: 1.000     3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 51.00  
 Max.   :70.000     Max.   :3.000   Max.   :2.000   Max.   :102.00  
                                                                    
 age_band_of_casualty casualty_severity pedestrian_location pedestrian_movement
 Min.   : 1.000       Min.   :1.000     Min.   : 0.0000     Min.   :0.0000     
 1st Qu.: 5.000       1st Qu.:3.000     1st Qu.: 0.0000     1st Qu.:0.0000     
 Median : 6.000       Median :3.000     Median : 0.0000     Median :0.0000     
 Mean   : 6.519       Mean   :2.791     Mean   : 0.7703     Mean   :0.6357     
 3rd Qu.: 8.000       3rd Qu.:3.000     3rd Qu.: 0.0000     3rd Qu.:0.0000     
 Max.   :11.000       Max.   :3.000     Max.   :10.0000     Max.   :9.0000     
                                                                               
 car_passenger    bus_or_coach_passenger pedestrian_road_maintenance_worker
 Min.   :0.0000   Min.   :-1.00000       Min.   :0.0000                    
 1st Qu.:0.0000   1st Qu.: 0.00000       1st Qu.:0.0000                    
 Median :0.0000   Median : 0.00000       Median :0.0000                    
 Mean   :0.2087   Mean   : 0.05232       Mean   :0.0362                    
 3rd Qu.:0.0000   3rd Qu.: 0.00000       3rd Qu.:0.0000                    
 Max.   :9.0000   Max.   : 4.00000       Max.   :2.0000                    
                                                                           
 casualty_type    casualty_home_area_type casualty_imd_decile
 Min.   : 0.000   Min.   :1.000           Min.   : 1.000     
 1st Qu.: 1.000   1st Qu.:1.000           1st Qu.: 2.000     
 Median : 9.000   Median :1.000           Median : 5.000     
 Mean   : 9.523   Mean   :1.281           Mean   : 4.918     
 3rd Qu.: 9.000   3rd Qu.:1.000           3rd Qu.: 7.000     
 Max.   :98.000   Max.   :3.000           Max.   :10.000     
                                                             
 lsoa_of_casualty    age_interval   casualty_home.area_desc casualty_class_desc
 Length:52065       26 - 30:11035   Length:52065            Length:52065       
 Class :character   31 - 35: 8441   Class :character        Class :character   
 Mode  :character   36 - 40: 6827   Mode  :character        Mode  :character   
                    21 - 25: 5695                                              
                    16 - 20: 5338                                              
                    41 - 45: 5193                                              
                    (Other): 9536                                              
 casualty_severity_desc casualty_sex_desc  car_passenger_desc
 Length:52065           Length:52065       Length:52065      
 Class :character       Class :character   Class :character  
 Mode  :character       Mode  :character   Mode  :character  
                                                             
                                                             
                                                             
                                                             

4.1.3 Summary of data cleaning

  1. We have removed negative age which contributed 2.2% of the entire population.
  2. We have added variable definitions for easy reference.
  3. We have noted column bus_or_car_passengers is similar to column car_passengers hence we have dropped bus_or_car_passengers to avoid repetition in data.

4.2 Regression Modeling and Evaluation

Question: How can the class of casualties in road accidents be predicted based on accident characteristics, vehicle information, and other casualty demographics?

casualty_class value:

  • 1 - driver
  • 2 - passenger
  • 3 - pedestrian

 \[ \]:

# Install the package for splitting the train set and test set from the cleaned dataset
install.packages("caret")
library(caret)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘Rcpp’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’


Loading required package: ggplot2

Loading required package: lattice

 \[ \]:

# Create a subset for modeling
modelSet <- subset(df.road_clean, select = c(vehicle_reference, casualty_class, sex_of_casualty, age_of_casualty,
age_band_of_casualty, casualty_severity, pedestrian_location, pedestrian_movement, car_passenger, bus_or_coach_passenger,
pedestrian_road_maintenance_worker, casualty_home_area_type, casualty_imd_decile))

# Set random seed for reproducibility
set.seed(123)

# Create training set index with 70% of the data
trainIndex <- createDataPartition(modelSet$casualty_class, p = 0.7, list = FALSE)

# Split the data into training and testing sets based on the index
trainSet <- modelSet[trainIndex, ]
testSet <- modelSet[-trainIndex, ]

# View the number of rows in each dataset
nrow(trainSet)
nrow(testSet)

36446

15619

 \[ \]:

# Install the package for random forest modeling
install.packages("randomForest")
library(randomForest)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:ggplot2’:

    margin


The following object is masked from ‘package:dplyr’:

    combine

 \[ \]:

# Define the predictors
predictors <- setdiff(names(trainSet), "casualty_class")

# Train the Random Forest model
# Convert the data to a data.frame for randomForest function
trainSet <- as.data.frame(trainSet)
testSet <- as.data.frame(testSet)

# Create the formula for the Random Forest model
formula <- as.formula(paste("casualty_class", "~", paste(predictors, collapse = " + ")))

# Train the model
set.seed(123)  # For reproducibility
rf_model <- randomForest(formula, data = trainSet, ntree = 500, mtry = floor(sqrt(length(predictors))), importance = TRUE)

# Summarize the model
print(rf_model)

# Make predictions on the test set
predictions <- predict(rf_model, testSet)
Warning message in randomForest.default(m, y, ...):
“The response has five or fewer unique values.  Are you sure you want to do regression?”
Call:
 randomForest(formula = formula, data = trainSet, ntree = 500,      mtry = floor(sqrt(length(predictors))), importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 0.01283855
                    % Var explained: 97.58

 \[ \]:

# Evaluate the model

# Calculate Mean Squared Error (MSE)
actuals <- testSet[["casualty_class"]]
mse <- mean((predictions - actuals)^2)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)

# Calculate R-squared
ss_total <- sum((actuals - mean(actuals))^2)
ss_residual <- sum((actuals - predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)

# Print evaluation metrics
cat("Mean Squared Error:", mse, "\n")
cat("Root Mean Squared Error:", rmse, "\n")
cat("R-squared:", r_squared, "\n")
Mean Squared Error: 0.01388033 
Root Mean Squared Error: 0.1178148 
R-squared: 0.9737804 

Low MSE and Low RMSE: These metrics indicate that the prediction errors are small on average, meaning the model's predictions are very close to the actual values.

High R-squared: This indicates that the model explains a large proportion of the variability in the target variable, suggesting a very good fit.

In the context of predictive modeling, achieving low Mean Squared Error (MSE) and low Root Mean Squared Error (RMSE) are indicative of the model making predictions that are very close to the actual values on average. MSE measures the average squared difference between predicted values and actual values, while RMSE provides a measure of the average magnitude of these prediction errors in the same units as the target variable. When both MSE and RMSE are low, it signifies that the model’s predictions are accurate and have minimal deviation from the true values.

Additionally, a high R-squared (R²) value indicates that the model can explain a large proportion of the variability observed in the target variable. R² ranges from 0 to 1, with 1 indicating that the model perfectly predicts the target variable based on the predictors. A high R-squared suggests that the model fits the data well, capturing a significant portion of the variance in the target variable and thereby demonstrating its efficacy in explaining the relationships between predictors and the outcome.

Together, low MSE, low RMSE, and high R-squared collectively signify that the predictive model performs well in terms of accuracy and explanatory power, making it a reliable tool for making predictions and understanding the underlying relationships within the data.

 \[ \]:

# Install ggplot2 for visualization
install.packages("ggplot2")
library(ggplot2)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

 \[ \]:

# Density Plot of Predicted and Actual Values
plot(density(actuals),
     main = "Density Plot of Actual vs Predicted Values",
     xlab = "Values",
     col = "salmon", lwd = 2)
lines(density(predictions), col = "purple", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("salmon", "purple"), lwd = 2)

# Box Plot of Predicted and Actual Values
boxplot(actuals, predictions,
        names = c("Actual", "Predicted"),
        main = "Box Plot of Actual vs Predicted Values",
        col = c("salmon", "purple"))

# Density Plot of Residuals
residuals <- actuals - predictions
ggplot(data.frame(residuals), aes(x = residuals)) +
  geom_density(fill = "purple", alpha = 0.5) +
  labs(title = "Density Plot of Residuals",
       x = "Residuals",
       y = "Density")

4.2.1 Output Discussion

1. Density Plot of Actual vs. Predicted Values

The peaks and shapes of both densities are very similar, which means that the predicted values follow the same distribution as the actual values.

Small deviations may exist, but the model captures the overall underlying distribution of the actual values well.

2. Box Plot of Actual vs. Predicted Values

Both distributions appear similar, indicating that the predictions are closely aligned with the actual values.

There are no extreme outliers, suggesting that the model's predictions are consistent with the actual data.

3. Density Plot of Residuals

The residuals are centered around zero, indicating that the model does not have significant bias.

The peak at zero is sharp, suggesting that most of the residuals are small, which is a good sign of model accuracy.

The spread of the residuals is relatively narrow, indicating that the model's predictions are consistently close to the actual values.

These visualizations collectively provide insights into the model’s performance in predicting casualty classes in road accidents, showcasing its ability to closely match actual data distributions and minimize prediction errors effectively.

4.3 Classification Modeling and Evaluation

Question: Can machine learning techniques effectively predict casualty severity, and which variables are most predictive in the model?

For a classification model for multinomial classification problem. Multinomial logistic regressioncan be used.

 \[ \]:

 \[ \]:

# Load necessary libraries
if(!require(caret)) {
    install.packages("caret")
    library(caret)
}

if(!require(nnet)) {
    install.packages("nnet")
    library(nnet)
}
Loading required package: e1071

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()           masks base::%||%()
✖ randomForest::combine() masks dplyr::combine()
✖ dplyr::filter()         masks stats::filter()
✖ dplyr::lag()            masks stats::lag()
✖ purrr::lift()           masks caret::lift()
✖ randomForest::margin()  masks ggplot2::margin()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: gbm

Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
“there is no package called ‘gbm’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loaded gbm 2.1.9

This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

Loading required package: nnet

 \[ \]:

head(df.road_clean)
collision_index collision_year collision_reference vehicle_reference casualty_reference casualty_class sex_of_casualty age_of_casualty age_band_of_casualty casualty_severity ⋯ casualty_type casualty_home_area_type casualty_imd_decile lsoa_of_casualty age_interval casualty_home.area_desc casualty_class_desc casualty_severity_desc casualty_sex_desc car_passenger_desc
<chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> ⋯ <int> <int> <int> <chr> <fct> <chr> <chr> <chr> <chr> <chr>
1 2023010419171 2023 010419171 1 1 3 2 20 4 3 ⋯ 0 1 10 E01030370 16 - 20 Urban Pedestrian Slight Female Yes - Car
2 2023010419183 2023 010419183 2 1 1 1 25 5 3 ⋯ 9 1 3 E01001546 21 - 25 Urban Driver Slight Male Yes - Car
3 2023010419189 2023 010419189 1 1 1 1 50 8 3 ⋯ 9 1 5 E01002443 36 - 40 Urban Driver Slight Male Yes - Car
4 2023010419191 2023 010419191 2 1 1 1 34 6 3 ⋯ 1 1 2 E01004679 26 - 30 Urban Driver Slight Male Yes - Car
5 2023010419198 2023 010419198 1 1 3 1 65 9 3 ⋯ 0 1 5 E01023593 41 - 45 Urban Pedestrian Slight Male Yes - Car
6 2023010419201 2023 010419201 1 1 1 1 22 5 3 ⋯ 1 2 10 E01026413 21 - 25 Semi-urban Driver Slight Male Yes - Car

 \[ \]:

# Drop columns that won't be used for prediction
columns_to_drop <- c('collision_index', 'collision_reference', 'lsoa_of_casualty',
                     'age_interval', 'casualty_home.area_desc', 'casualty_class_desc',
                     'casualty_severity_desc', 'casualty_sex_desc', 'car_passenger_desc')
df.road_clean_modified <- df.road_clean %>% select(-one_of(columns_to_drop))

# Convert the outcome variable to a factor if it is not already.

df.road_clean_modified$casualty_severity <- as.factor(df.road_clean_modified$casualty_severity)

 \[ \]:

# Split data into training and testing sets
set.seed(123)
train_index <- sample(seq_len(nrow(df.road_clean_modified)), size = 0.7*nrow(df.road_clean_modified))
train_data <- df.road_clean_modified[train_index, ]
test_data <- df.road_clean_modified[-train_index, ]

 \[ \]:

# Train the multinomial logistic regression model
multinom_model <- multinom(casualty_severity ~ ., data = train_data)
# weights:  51 (32 variable)
initial  value 40038.924861 
iter  10 value 20200.784998
iter  20 value 19995.322055
iter  30 value 19252.683059
iter  40 value 18961.170221
iter  50 value 18958.209543
iter  60 value 18947.713461
final  value 18947.521949 
converged

 \[ \]:

# Make predictions
predictions <- predict(multinom_model, newdata = test_data)

 \[ \]:

# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_data$casualty_severity)
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction     1     2     3
         1     0     0     0
         2     2     6     7
         3   163  2924 12516

Overall Statistics
                                         
               Accuracy : 0.8018         
                 95% CI : (0.7954, 0.808)
    No Information Rate : 0.8018         
    P-Value [Acc > NIR] : 0.5128         
                                         
                  Kappa : 0.0026         
                                         
 Mcnemar's Test P-Value : <2e-16         

Statistics by Class:

                     Class: 1  Class: 2 Class: 3
Sensitivity           0.00000 0.0020478 0.999441
Specificity           1.00000 0.9992907 0.002585
Pos Pred Value            NaN 0.4000000 0.802153
Neg Pred Value        0.98944 0.8126001 0.533333
Prevalence            0.01056 0.1876040 0.801831
Detection Rate        0.00000 0.0003842 0.801383
Detection Prevalence  0.00000 0.0009604 0.999040
Balanced Accuracy     0.50000 0.5006692 0.501013

 \[ \]:

# Overall accuracy
accuracy <- conf_matrix$overall['Accuracy']
cat("Accuracy: ", accuracy, "\n")

# Class-specific metrics
class_metrics <- conf_matrix$byClass

# Macro-averaged metrics
macro_avg_precision <- mean(class_metrics[, 'Precision'], na.rm = TRUE)
macro_avg_recall <- mean(class_metrics[, 'Recall'], na.rm = TRUE)
macro_avg_f1 <- mean(class_metrics[, 'F1'], na.rm = TRUE)
macro_avg_specificity <- mean(class_metrics[, 'Specificity'], na.rm = TRUE)

cat("Macro-Averaged Precision: ", macro_avg_precision, "\n")
cat("Macro-Averaged Recall: ", macro_avg_recall, "\n")
cat("Macro-Averaged F1 Score: ", macro_avg_f1, "\n")
cat("Macro-Averaged Specificity: ", macro_avg_specificity, "\n")
Accuracy:  0.8017672 
Macro-Averaged Precision:  0.6010767 
Macro-Averaged Recall:  0.3338296 
Macro-Averaged F1 Score:  0.4470349 
Macro-Averaged Specificity:  0.6672918 

4.3.1 Evaluation of Classification Model

Referring to confusion matrix, the model performs poorly for Class 1 with no correct predictions. Class 2 has some correct predictions but is also misclassified frequently. Class 3 is predicted well due to the high prevalence in the dataset.

The model achieved an accuracy of 80.18% indicating a high overall correctness in its predictions, but this might be skewed by the high prevalence of Class 3.

Kappa is very low (0.0026), indicating poor agreement beyond what would be expected by chance.

The macro-averaged precision of 0.6010767 indicates that, on average, the model's precision across all classes is about 60.11%. A precision of 60.11% suggests that when the model predicts a certain class, it is correct 60.11% of the time on average.

The recall, at 33.38%, indicates that the model correctly identifies only a third of the actual instances for each class on average, highlighting a significant shortfall in sensitivity.

The F1 score, which balances precision and recall, stands at 44.70%, reflecting the overall harmonic mean of these two metrics and underscoring the trade-off between them.

The specificity of 66.73% shows how well the model correctly identifies cases that do not belong to a particular class. Overall, these results suggest that although the model is generally good at making correct predictions, it has difficulty in correctly identifying all the actual instances of each class.

In conclusion, the model has low recall and moderate F1 score, despite having high accuracy and moderate precision. This might be due to class imbalance where one class is much more frequent than the others in the training data. More complex algorithms such as random forest and gradient boosting machine models with class weight adjustment might be more suitable for the prediction.

5.0 Conclusion

In this comparative study on road accidents using machine learning, we explored various predictive models to analyze and predict casualty severity based on a comprehensive dataset. Our findings highlight the effectiveness of machine learning techniques in accurately predicting casualty severity, with notable variables such as age, gender, casualty type, and accident details playing pivotal roles in model performance. Through rigorous evaluation and comparison of different algorithms, including decision trees and gradient boosting, we demonstrated the models’ capability to classify casualty severity levels with high accuracy and reliability.

Furthermore, our study underscored the importance of data-driven approaches in understanding the underlying factors contributing to road accidents. By leveraging advanced analytics, we not only identified significant predictors but also gained insights that can inform targeted interventions and policy decisions aimed at improving road safety and reducing casualties. Moving forward, continued research in this domain could focus on enhancing model robustness, incorporating real-time data streams, and expanding predictive capabilities to further enhance road accident prevention strategies.

Overall, this study contributes to the growing body of knowledge on applying machine learning in transportation safety, emphasizing its potential to mitigate risks and promote safer road environments for all stakeholders involved.

6.0 Reference List

  • R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

  • R Documentation. (2024). ConfusionMatrix. Retrieved from https://www.rdocumentation.org/packages/caret/versions/6.0-96/topics/confusionMatrix