TITLE: Road accident cause analysis and crash injury severity prediction with Machine Learning

1. PROJECT INTRODUCTION

Road accidents are a major concern all over the world, since they cause casualties, injuries, and fatalities each year. Accidents also cause significant economic losses. The 2018 report of the World Health Organization states that more than 1.35 million people die each year from causes related to road accidents. An additional 20-50 million are injured or disabled [1]. If factors that cause accidents can be better understood and predicted, it might be possible to take measures to mitigate the damages and its severity [2].

2. PROJECT RESEARCH QUESTIONS

Many factors are responsible for causing road accidents. How to reduce the occurrence of fatal traffic accidents and improve road safety has been a significant problem for both governments and research institutions. Knowing what the influential factors are and how they affect the accidents can help better understand the cause-effect. This is beneficial to improve the estimation of the accident severity and preparation of countermeasures.[3]

The purpose of this project is to predict road accident crash severity and analyse non-human factors causing accidents, using traffic crash data from 2011 to 2023 from the New Zealand governments open data portal (data.gov.nz). Here is research questions that the project is aiming to answer:

What are the major accident vehicle types?
Are there any non-human factors that cause more severe accidents?
Can the data predict road accidents injury severity based on past data?

3. PROJECT RESEARCH OBJECTIVES

The research objectives for the project are:

Identify major accident vehicle types involved.
Identify non-human variables that may cause and increase the likelihood of severe accidents, including high fatalities.
Develop machine learning models to predict road accident injury severity.

4. LITERATURE REVIEW

Given the significant impact to society and mortality rates, accident prediction has been extensively studied with many different models. From a methodology perspective, there are many that have been explored. In 2019, Herbert et. al. [4] used a balanced random forest algorithm to study the accidents that occurred in Montreal. Overall, the algorithms predicted 85 percent of Montreal incidents, with a false positive rate (FPR) of 13%. Another study in 2019 on a GIS-based data mining methodology framework to analyze the influential factors of fatal traffic accidents showed that XGBoost obtained the highest modeling accuracy. [3] Fiorentini in 2020 gave outcomes from a random undersampling majority-class (RUMC) based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, which showed that for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works [6].

There are many factors that cause traffic accidents. Previous research showed the most often factor contributing to the accident occurrence are human factors, with driver inattention being the highest, which could be caused by several causes as e.g.distraction, overloading attention, monotonous driving, etc.[5]. However, non-human factors have also caused accidents. One study by Jalilian et. al in 2019 [7] showed a significant relationship between fatal RTAs and factors such as; the sort of the road, the hindered visibility, the location of the accident, the accidents’ place, the climate, and lighting of the day (P<0.05). When it was cloudy, the chance was 2.60 times more than when was clear (P<0.05). But the sample size used was small, with only 2314 accidents dataset examined.

5. DATA COLLECTION & DATASET INFORMATION

The dataset comes from the Waka Kotahi Crash Analysis System (CAS), which records all traffic crashes reported to data.gov.nz by the New Zealand (NZ) Police. CAS covers crashes on all NZ roadways where the public have legal access with a motor vehicle.

The Dataset URL is as follows:

The dataset was downloaded from data.gov.nz on 30 Oct 2023. As of 30 Oct 2023, data was available from economic year 1999/2000 to 2022/2023, and has 821,744 observations.

6. DATA CLEANING (By Mithirendra/Yao Hong)

# Load the required libraries
library(dplyr)
library(ggplot2)
library('reshape2')

6a. LOAD THE DATASET

# retrieve large dataset
d1 <- read.csv("Crash_Analysis_System_(CAS)_data.csv")
str(d1)

The dataset that we read is quite large, over 500000 observations. The first step will be to select key columns that will be relevant to our data analysis.

# Subset data to smaller subset by the following categories:
# Year = Crash Year
# Vehicle Types = bicycle, bus, carStationWagon, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility
# Crash Location = crashLocation1, crashLocation2, region
# Severity of crash = crashSeverity, fatalCount, minorInjuryCount, seriousInjuryCount
# Accident location conditions = light, flatHill, roadLane, roadSurface, WeatherA
# Others = Holiday, areaUnitID

data <- select(d1, crashYear,  
                   bicycle, bus, carStationWagon, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility, 
                   crashLocation1, crashLocation2, region, 
                   crashSeverity, fatalCount, minorInjuryCount, seriousInjuryCount, 
                   light, flatHill, roadLane, roadSurface, weatherA, 
                   holiday, areaUnitID)

# Structure of dataset
str(data)

## 'data.frame':    821744 obs. of  25 variables:
##  $ crashYear         : int  2011 2012 2012 2011 2011 2011 2011 2011 2012 2012 ...
##  $ bicycle           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bus               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ carStationWagon   : int  2 2 0 2 1 1 1 1 2 2 ...
##  $ moped             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ motorcycle        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolBus         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ suv               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ taxi              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ truck             : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ vanOrUtility      : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ crashLocation1    : chr  "SH 35 WAINUI" "HALL ST" "SHARON ROAD" "SPRINGSTON ROLLESTON ROAD" ...
##  $ crashLocation2    : chr  "HIRINI ST" "LAKE ROAD" "RIDGE ROAD" "DYNES ROAD" ...
##  $ region            : chr  "Gisborne Region" "Waikato Region" "Auckland Region" "Canterbury Region" ...
##  $ crashSeverity     : chr  "Non-Injury Crash" "Non-Injury Crash" "Minor Crash" "Non-Injury Crash" ...
##  $ fatalCount        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ minorInjuryCount  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ seriousInjuryCount: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ light             : chr  "Overcast" "Overcast" "Bright sun" "Bright sun" ...
##  $ flatHill          : chr  "Flat" "Flat" "Hill Road" "Flat" ...
##  $ roadLane          : chr  "2-way" "2-way" "2-way" "2-way" ...
##  $ roadSurface       : chr  "Sealed" "Sealed" "Sealed" "Sealed" ...
##  $ weatherA          : chr  "Fine" "Fine" "Fine" "Fine" ...
##  $ holiday           : chr  "Christmas New Year" "" "" "" ...
##  $ areaUnitID        : int  544801 528900 507000 597513 611500 607300 568101 525420 528403 523601 ...

summary(data)

##    crashYear       bicycle             bus          carStationWagon 
##  Min.   :2000   Min.   :0.00000   Min.   :0.00000   Min.   : 0.000  
##  1st Qu.:2005   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 1.000  
##  Median :2010   Median :0.00000   Median :0.00000   Median : 1.000  
##  Mean   :2011   Mean   :0.02896   Mean   :0.01587   Mean   : 1.311  
##  3rd Qu.:2017   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 2.000  
##  Max.   :2023   Max.   :5.00000   Max.   :3.00000   Max.   :11.000  
##                 NA's   :5         NA's   :5         NA's   :5       
##      moped            motorcycle        schoolBus             suv        
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   :0.007098   Mean   :0.03627   Mean   :0.000753   Mean   :0.1051  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :4.000000   Max.   :8.00000   Max.   :3.000000   Max.   :6.0000  
##  NA's   :5          NA's   :5         NA's   :5          NA's   :5       
##       taxi            truck         vanOrUtility    crashLocation1    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Length:821744     
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :0.0107   Mean   :0.0804   Mean   :0.1758                     
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000                     
##  Max.   :5.0000   Max.   :5.0000   Max.   :6.0000                     
##  NA's   :5        NA's   :5        NA's   :5                          
##  crashLocation2        region          crashSeverity        fatalCount     
##  Length:821744      Length:821744      Length:821744      Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.00000  
##                                                           Mean   :0.01043  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :9.00000  
##                                                           NA's   :1        
##  minorInjuryCount  seriousInjuryCount    light             flatHill        
##  Min.   : 0.0000   Min.   : 0.0000    Length:821744      Length:821744     
##  1st Qu.: 0.0000   1st Qu.: 0.0000    Class :character   Class :character  
##  Median : 0.0000   Median : 0.0000    Mode  :character   Mode  :character  
##  Mean   : 0.3186   Mean   : 0.0692                                         
##  3rd Qu.: 1.0000   3rd Qu.: 0.0000                                         
##  Max.   :34.0000   Max.   :14.0000                                         
##  NA's   :1         NA's   :1                                               
##    roadLane         roadSurface          weatherA           holiday         
##  Length:821744      Length:821744      Length:821744      Length:821744     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    areaUnitID    
##  Min.   :500100  
##  1st Qu.:519400  
##  Median :536642  
##  Mean   :546242  
##  3rd Qu.:573523  
##  Max.   :626801  
##  NA's   :97

6b. CLEANING THE DATASET

The next steps is to find whether they are any NAs and any outliers. From the summary above, it can be seen that number of NA values were found to be very small and will be deleted. Box plot below also shows no outliers. It was also found that there are some labels that are missing. These missing labels were re-labelled as “Unknown”.

6c. CHECKING FOR OUTLIERS AMONG VEHICLE TYPES

# Remove NAs
data <- na.omit(data)

# Finding outliers for vehicle type accidents
boxplot(data$bicycle, data$bus, data$carStationWagon, data$moped, data$motorcycle, data$schoolBus, data$suv,
        data$taxi, data$truck, data$vanOrUtility,
        names=c("Bicycle", "Bus", "Car", "Moped", "Motorcycle", "SchoolBus", "Suv", "Taxi", "Truck", "Van"))

6d. CHECKING FOR OUTLIERS AMONG INJURY TYPES

# Finding outliers for injury types
boxplot(data$fatalCount, data$minorInjuryCount, data$seriousInjuryCount,
        names=c("Fatality", "MinorInjury", "SeriousInjury"))

6e. REPLACING NAs WITH UNKNOWN

# Replacing empty regions values with "Unknown" label
data <- data %>% mutate(region = ifelse(region == "", 'Unknown', region))
data <- data %>% mutate(flatHill = ifelse(flatHill == "Null", 'Unknown', flatHill))
data <- data %>% mutate(roadLane = ifelse(roadLane == "Null", 'Unknown', roadLane))
data <- data %>% mutate(roadSurface = ifelse(roadSurface == "Null", 'Unknown', roadSurface))
data <- data %>% mutate(weatherA = ifelse(weatherA == "Null", 'Unknown', weatherA))
data <- data %>% mutate(holiday = ifelse(holiday == "", 'Unknown', holiday))

7. EXPLORATORY DATA ANALYSIS & VISUALIZATIONS (By Mithirendra/Yao Hong)

Next, an exploratory data analysis is conducted on the data to get a feel of the data and to gain more insights on the variables.

7a. SHOWCASING OVERALL NUMBER OF ACCIDENTS IN DATASET

First, we try to understand the number of accidents in the whole dataset.

7b. SUBSETTING THE DATASET

Due to the sheer size of the database, the data is subsetted into the last 8 years of data to perform analysis. The new dataset consists of 278,823 observations and 26 variables.

## 'data.frame':    278823 obs. of  26 variables:
##  $ crashYear         : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ bicycle           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bus               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ carStationWagon   : int  1 1 0 2 1 1 1 1 1 0 ...
##  $ moped             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ motorcycle        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ schoolBus         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ suv               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ taxi              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ truck             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ vanOrUtility      : int  0 2 0 0 0 0 0 1 0 0 ...
##  $ crashLocation1    : chr  "SH 96" "VIEW ROAD" "LAVAUD ST" "MOORHOUSE AVENUE" ...
##  $ crashLocation2    : chr  "OTAUTAU-WREYS BUSH ROAD" "GREAT NORTH ROAD" "MOUNT ALBERT ROAD" "MADRAS ST" ...
##  $ region            : chr  "Southland Region" "Auckland Region" "Wellington Region" "Canterbury Region" ...
##  $ crashSeverity     : chr  "Minor Crash" "Minor Crash" "Minor Crash" "Minor Crash" ...
##  $ fatalCount        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ minorInjuryCount  : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ seriousInjuryCount: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ light             : chr  "Overcast" "Bright sun" "Overcast" "Dark" ...
##  $ flatHill          : chr  "Flat" "Hill Road" "Hill Road" "Flat" ...
##  $ roadLane          : chr  "2-way" "2-way" "2-way" "2-way" ...
##  $ roadSurface       : chr  "Sealed" "Sealed" "Sealed" "Sealed" ...
##  $ weatherA          : chr  "Fine" "Fine" "Fine" "Fine" ...
##  $ holiday           : chr  "" "" "" "" ...
##  $ areaUnitID        : int  612720 511800 576302 591500 549800 523816 521901 506653 609033 502001 ...
##  $ total_vehicles    : Named num  1 3 1 2 1 1 1 2 1 1 ...
##   ..- attr(*, "names")= chr [1:278823] "6224" "6316" "6482" "6641" ...
##  - attr(*, "na.action")= 'omit' Named int [1:104] 34715 80708 104127 105718 105815 106049 164160 193962 413044 413284 ...
##   ..- attr(*, "names")= chr [1:104] "34715" "80708" "104127" "105718" ...

# Summary of data
summary(data)

7c. CHECKING FOR NAs AFTER CLEANING & SUBSET

There are no more NAs in the database because it has been cleaned, as shown below.

# Missing value checks
cat("Total number of missing values:", sum(is.na(data)), "\n")

## Total number of missing values: 0

7d. NUMBER OF ACCIDENTS IN THE LAST 8 YEARS

The following is plotted to showcase the number of accidents that have happened from year 2015 to year 2022. The plot shows that number of accidents remains above 30,000 cases every year except for the year 2022.

# Showing number of accidents per year from reduced dataset (for last 8 years)
barplot(table(data$crashYear),
        main="Number of accidents per year from 2015 - 2022",
        ylim = c(0,50000),
        col = "red",
        xlab = "Year",
        ylab = "Number of accidents")

7e. ANSWERING RESEARCH QUESTION 1 - IDENTIFYING VEHICLES TYPES INVOLVED IN ACCIDENTS

The following line chart is plotted to understand which vehicle causes the most accidents. It can be seen that sedan vehicles (car/station wagons) cause the highest number of accidents versus any other vehicle types in New Zeland.

ggplot(vehicles_melted, aes(x = crashYear, y = total_accident, color = variable)) +
  geom_line(linetype =1,
            lwd = 1.1) + 
  ggtitle("Number of crashes by vehicle year (2015 - 2022)") + xlab("Year") + ylab("Total crashes")

7f. ANSWERING RESEARCH QUESTION 2 - IDENTIFYING NON HUMAN VARIABLES THAT COULD CAUSE ACCIDENTS

The following analysis is to understand if accidents are causes by any non human variables. The non human variables analysed are: * Road lighting
* Flat vs Hill roads
* Number of Road lanes
* Road conditions
* Weather conditions

Different graphs were plotted to understand the different variables.

Here is the summary of findings.
* Road lighting - accidents seem to happen during bright sunny days, dark days and even overcast days. There isn’t any particular road lighting that stands out.
* Flat vs Hill roads - a high number of accidents happened on flat roads.
* Number of Road lanes - most of the accidents happened on 2-way roads.
* Road conditions - most accidents happened on sealed roads.
* Weather conditions - most of the accidents happened when the weather was fine.

In summary, it can be seen that none of the non-human variables standout in causing any accidents, except maybe flat, sealed and 2 ways roads which had high accident rates. The accidents on these flat, sealed and 2 way lane roads may have been caused by human errors such as speeding or losing control of the vehicle, since flat, sealed and 2 way lane roads could be prone to drivers speeding.

7g. EXPLORING CRASH SEVERITY

In terms of crash severity, it can be seen that most of the accidents were non-injury crashes. The number of minor crashes and serious crashes were much smaller.

# Showing number of accidents by crash severity types
barplot(table(data$crashSeverity),
        col = "blue",
        ylim = c(0,200000),
        ylab = "Number of accidents",
        xlab = "Crash Severity",
        main = "Number of crashes by severity after accidents (2015 - 2022)")

7h. EXPLORING INJURY SEVERITY

In terms of injurity severity, it can be seen that most of the accidents had minor injuries, with more than 80,000 accidents. The number of serious injuries and fatality accidents were much smaller. However, in terms of relative numbers, the number of serious injury accidents was still high, which was close 20,000 accidents.

# Showing number of injury by injury types
sumdata <- select(data, fatalCount, minorInjuryCount, seriousInjuryCount)
sumdata <- colSums(sumdata)
barplot(sumdata,
        col = "green",
        ylim = c(0,90000),
        ylab = "Number of accidents",
        xlab = "Injury Severity",
        main = "Number of injuries by severity after accidents (2015 - 2022)")

7i. INJURY TYPES TREND

By year, it can be seen that most accident injuries happened in year 2018 and 2019. The Covid-19 pandemic in year 2020 and 2021 reduced the number of accident injuries. However the number has been on an increasing trend in year 2022.

## `summarise()` has grouped output by 'crashYear'. You can override using the
## `.groups` argument.

ggplot(injury_melted, aes(x = crashYear, y = total_injury, color = variable)) +
  geom_line(linetype =1,
            lwd = 1.1) + 
  ggtitle("Number of injury types by year (2015 - 2022)") + xlab("Year)") + ylab("Total injuries")

7j. CHECKING IF FESTIVE CELEBRATIONS CAUSE HIGHER ACCIDENTS

We wanted to also understand if festive celebrations had caused higher number of accidents. The graph below shows that festive holidays do not contribute to higher number of accidents.

# Shows Holiday Period with highest accidents
ggplot(data, aes(x = holiday, fill = crashSeverity)) + geom_bar(position = "stack") + theme(axis.text.x = element_text(angle = 90, size = 10)) + 
  ggtitle("Checking if accidents happened during festive holidays") + xlab("Festive Type") + ylab("Total accidents")

7k. CHECKING IF CERTAIN REGIONS HAVE HIGHER ACCIDENTS

The graph below shows that higher number of accidents happened in the big city regions, namely Auckland, Wellington and Waikato regions.

# Shows Region with highest accidents
ggplot(data, aes(x = region, fill = crashSeverity)) + geom_bar(position = "stack") + theme(axis.text.x = element_text(angle = 90, size = 10)) + 
  ggtitle("Number of accidents by region") + xlab("Region") + ylab("Total accidents")

In general, the Exploratory Data Analysis above has given a good understand of the data. The next step is to perform Data modeling to understand if road accident injury severity can be predicted.

8. DATA MODELING & PREDICTIONS (By Wei Ven/Ain)

In the initial phase of the project, the dataset was curated to focus on relevant variables, including counts of fatal, minor, and serious injuries, as well as various factors such as the type of vehicles involved and road conditions. Categorical variables such as ‘flatHill,’ ‘light,’ ‘roadLane,’ and ‘roadSurface’ underwent one-hot encoding to transform them into a suitable format for modeling. Subsequently, the dataset was split into training and testing sets using a random seed for reproducibility. The dataset was spilt into 80% for training, and 20% for testing.

The first step will be to load the necessary libraries to perform the modeling.

#Wei Ven
library(caret)  # For createDataPartition function
library(randomForest)
library(dplyr)
library(xgboost)

# data <- readRDS("data.rds")

Then, the step of subsetting the data and preparing for modeling will be done, by spliting data into training and test sets.

#Subset data
data_reg <- select(data, fatalCount, minorInjuryCount, seriousInjuryCount,bicycle, bus, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility, flatHill, light, roadLane, roadSurface)

# One-Hot Encode 'flatHill', 'light', 'roadLane', 'roadSurface'
data_reg <- cbind(data_reg, model.matrix(~ flatHill + light + roadLane + roadSurface - 1, data = data_reg))

# Drop the original categorical columns after encoding
data_reg <- select(data_reg, -flatHill, -light, -roadLane, -roadSurface)

# Split the data into training and test sets
set.seed(123)
trainIndex <- sample(seq_len(nrow(data_reg)), size = 0.8 * nrow(data_reg))
train_data_reg <- data_reg[trainIndex, ]
test_data_reg <- data_reg[-trainIndex, ]

4 different models were explored by our team to understand the peformance of each model. The best model will be chosen the models that were explored. 3 models for Regression and 1 model for Classification.

8a. MODEL 1 - REGRESSION - LINEAR REGRESSION MODEL

The first model employed in this project is a Linear Regression Model, specifically tailored for predicting ‘fatalCount,’ ‘minorCount,’ and ‘seriousCount.’ This regression task involves predicting numerical outcomes for each injury category. The choice of linear regression serves as a foundational baseline for comparison with more complex models. Predictions were generated on the test set, and for practical application, these predictions were rounded to whole numbers. The rationale behind using a linear regression model lies in its interpretability and simplicity, making it an appropriate starting point for understanding the basic relationships within the data in the context of a regression task.

# MODEL 1: Linear Regression Model

# Train Linear Regression Model for fatalCount
reg_model_fatal <- lm(fatalCount ~ ., data = train_data_reg)
predictions_fatal <- predict(reg_model_fatal, newdata = test_data_reg)
rounded_predictions_fatal <- round(predictions_fatal)

# Train Linear Regression Model for minorInjuryCount
reg_model_minor <- lm(minorInjuryCount ~ ., data = train_data_reg)
predictions_minor <- predict(reg_model_minor, newdata = test_data_reg)
rounded_predictions_minor <- round(predictions_minor)

# Train Linear Regression Model for seriousInjuryCount
reg_model_serious <- lm(seriousInjuryCount ~ ., data = train_data_reg)
predictions_serious <- predict(reg_model_serious, newdata = test_data_reg)
rounded_predictions_serious <- round(predictions_serious)

# Calculate MAE and MSE for fatalCount/minorInjuryCount/seriousInjury Count
mae_fatal_lm <- mean(abs(test_data_reg$fatalCount - rounded_predictions_fatal))
mse_fatal_lm<- mean((test_data_reg$fatalCount - rounded_predictions_fatal)^2)
mae_minor_lm<- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_minor))
mse_minor_lm<- mean((test_data_reg$minorInjuryCount - rounded_predictions_minor)^2)
mae_serious_lm<- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_serious))
mse_serious_lm<- mean((test_data_reg$seriousInjuryCount - rounded_predictions_serious)^2)

8b. MODEL 2 - REGRESSION - XGBOOST MODEL

The second model employed is a XGBoost model.Introducing the XGBoost model marks a significant step in predicting our intended outcomes. XGBoost is known for handling complex patterns effectively. It was trained on a specific set of features, excluding the target variable and other injury counts, to adapt well to our dataset while keeping attention on our overall goal. The training process used boosting rounds, allowing the model to improve predictions iteratively. XGBoost naturally deals with complex patterns and interactions among variables, making it more advanced than linear regression. The chosen features cover various aspects of the data, helping the model identify subtle relationships and patterns.

# MODEL 2: XGBoost Regression:

# Train XGBoost Model for fatalCount
xgb_model_fatal <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$fatalCount, nrounds = 100, print_every_n = 10)

## [1]  train-rmse:0.361019 
## [11] train-rmse:0.111809 
## [21] train-rmse:0.111268 
## [31] train-rmse:0.111223 
## [41] train-rmse:0.111200 
## [51] train-rmse:0.111180 
## [61] train-rmse:0.111172 
## [71] train-rmse:0.111140 
## [81] train-rmse:0.111119 
## [91] train-rmse:0.111112 
## [100]    train-rmse:0.111104

predictions_xgb_fatal <- predict(xgb_model_fatal, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_fatal <- round(predictions_xgb_fatal)

# Calculate MAE and MSE for fatalCount (XGBoost)
mae_fatal_xgb <- mean(abs(test_data_reg$fatalCount - rounded_predictions_xgb_fatal))
mse_fatal_xgb <- mean((test_data_reg$fatalCount - rounded_predictions_xgb_fatal)^2)

# XGBoost Regression for minorInjuryCount:
xgb_model_minor <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$minorInjuryCount, nrounds = 100, print_every_n=10)

## [1]  train-rmse:0.629876 
## [11] train-rmse:0.610864 
## [21] train-rmse:0.610297 
## [31] train-rmse:0.610155 
## [41] train-rmse:0.610039 
## [51] train-rmse:0.609959 
## [61] train-rmse:0.609909 
## [71] train-rmse:0.609863 
## [81] train-rmse:0.609710 
## [91] train-rmse:0.609661 
## [100]    train-rmse:0.609591

predictions_xgb_minor <- predict(xgb_model_minor, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_minor <-round(predictions_xgb_minor)

# Calculate MAE and MSE for minorInjuryCount (XGBoost)
mae_minor_xgb <- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor))
mse_minor_xgb <- mean((test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor)^2)

# XGBoost Regression for seriousInjuryCount:
xgb_model_serious <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$seriousInjuryCount, nrounds =100,print_every_n=10)

## [1]  train-rmse:0.422399 
## [11] train-rmse:0.292626 
## [21] train-rmse:0.292174 
## [31] train-rmse:0.292081 
## [41] train-rmse:0.291955 
## [51] train-rmse:0.291918 
## [61] train-rmse:0.291893 
## [71] train-rmse:0.291823 
## [81] train-rmse:0.291774 
## [91] train-rmse:0.291735 
## [100]    train-rmse:0.291701

predictions_xgb_serious <- predict(xgb_model_serious, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_serious <- round(predictions_xgb_serious)

# Calculate MAE and MSE for seriousInjuryCount (XGBoost)
mae_serious_xgb <- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious))
mse_serious_xgb <- mean((test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious)^2)

8c. MODEL 3 - REGRESSION - POISSON REGRESSION MODEL

The third model is a Poisson Regression Model. This model capitalizes on the Poisson distribution’s strength in handling count data. Specialized for count data, the Poisson regression offers a detailed perspective on how our selected features interact with the expected counts of road incidents. It goes beyond prediction, revealing the underlying dynamics that contribute to the frequency of fatal incidents based on distinct features. As we examine the model’s summary, key insights emerge from coefficients and their significance. These provide a clear understanding of how each feature influences the expected counts of fatal incidents.

# MODEL 3: Fit Poisson Regression Model for fatalCount

poisson_model_fatal <- glm(fatalCount ~ ., data = train_data_reg, family = "poisson")

# Make predictions on the test set
predictions_poisson_fatal <- predict(poisson_model_fatal, newdata = test_data_reg, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

# Round predictions to the nearest integer
rounded_predictions_poisson <- round(predictions_poisson_fatal)

# Calculate MAE and MSE for fatalCount
mae_fatal_poisson <- mean(abs(test_data_reg$fatalCount - rounded_predictions_poisson))
mse_fatal_poisson <- mean((test_data_reg$fatalCount - rounded_predictions_poisson)^2)

# Fit Poisson Regression Model for minorInjuryCount
poisson_model_minor <- glm(minorInjuryCount ~ ., data = train_data_reg, family = "poisson")
predictions_poisson_minor <- predict(poisson_model_minor, newdata = test_data_reg, type = "response")
rounded_predictions_poisson_minor <- round(predictions_poisson_minor)
mae_minor_poisson <- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor))
mse_minor_poisson <- mean((test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor)^2)

# Fit Poisson Regression Model for seriousInjuryCount
poisson_model_serious <- glm(seriousInjuryCount ~ ., data = train_data_reg, family = "poisson")
predictions_poisson_serious <- predict(poisson_model_serious, newdata = test_data_reg, type = "response")
rounded_predictions_poisson_serious <- round(predictions_poisson_serious)

8d. MODEL 4 - CLASSIFICATION - RANDOM FOREST MODEL

The fourth model is using the classification random forest model on a factorized variable. Training and test sets are created and a prediction on Crash Severity is done at the end.

library(randomForest)

# Select relevant columns for prediction
  predictors <- data %>%
    select(
      light, flatHill, roadLane, roadSurface, weatherA, region,
      crashLocation1, crashLocation2, crashSeverity
    )

  # Factorize crashSeverity for modeling
  predictors$crashSeverity <- factor(
    predictors$crashSeverity,
    levels = c("Fatal Crash", "Minor Crash", "Non-Injury Crash", "Serious Crash")
  )

buildmodel<-function(format, data){
  randomForest(format, data, ntree = 200)
}
  
# Creating training and testing sets
set.seed(123)
trainIndex <- createDataPartition(predictors$crashSeverity, p = 0.8, 
                                list = FALSE, times = 1)
data_train <- predictors[trainIndex, ]
data_test <- predictors[-trainIndex, ]

# Training the Random Forest model
format <- crashSeverity ~ . - crashSeverity
model_rf <- buildmodel(format,data_train)

# Predicting crashSeverity
predictions <- predict(model_rf, newdata = data_test)

9. EVALUATION OF MODELS (By Wei Ven/Ain)

This section calculates the Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Square Error (RMSE) for the 3 models.

9a. EVALUATION OF REGRESSION MODEL - CALCULATE MAE AND MSE

In this section, the MAE and MSE for the 3 Regression models - Linear Regression, XGBoost and Poisson Regression are calculated. The different values obtained from the 3 models are compared for interpretation.

# Calculate MAE and MSE for seriousInjuryCount
mae_serious_poisson <- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious))
mse_serious_poisson <- mean((test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious)^2)

# Create data frame for MAE
mae_results_df <- data.frame(
  Model = c("Linear Regression", "XGBoost", "Poisson"),
  MinorInjuryCount = c(mae_minor_lm, mae_minor_xgb, mae_minor_poisson),
  SeriousInjuryCount = c(mae_serious_lm, mae_serious_xgb, mae_serious_poisson),
  FatalCount = c(mae_fatal_lm, mae_fatal_xgb, mae_fatal_poisson)
)

# Create data frame for MSE
mse_results_df <- data.frame(
  Model = c("Linear Regression", "XGBoost", "Poisson"),
  MinorInjuryCount = c(mse_minor_lm, mse_minor_xgb, mse_minor_poisson),
  SeriousInjuryCount = c(mse_serious_lm, mse_serious_xgb, mse_serious_poisson),
  FatalCount = c(mse_fatal_lm, mse_fatal_xgb, mse_fatal_poisson)
)

# Print the MAE results data frame
print("MAE Results:")

## [1] "MAE Results:"

print(mae_results_df)

##               Model MinorInjuryCount SeriousInjuryCount FatalCount
## 1 Linear Regression        0.3120416         0.07224962 0.01043665
## 2           XGBoost        0.3082937         0.07117368 0.01056218
## 3           Poisson        0.3129024         0.07176544 0.01197884

# Print the MSE results data frame
print("MSE Results:")

## [1] "MSE Results:"

print(mse_results_df)

##               Model MinorInjuryCount SeriousInjuryCount FatalCount
## 1 Linear Regression        0.4692011         0.09699632 0.01423832
## 2           XGBoost        0.4644849         0.09642249 0.01432798
## 3           Poisson        0.4707433         0.10260916 0.13797185

9b. EVALUATION OF REGRESSION MODEL - CALCULATE RMSE

In this section, the RMSE for the 3 Regression models - Linear Regression, XGBoost and Poisson Regression are calculated. A data frame is then created for the different values obtain from the 3 models to compare values.

# Calculate RMSE for Linear Regression
rmse_fatal_lm <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_fatal)^2))
rmse_minor_lm <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_minor)^2))
rmse_serious_lm <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_serious)^2))

# Calculate RMSE for XGBoost
rmse_fatal_xgb <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_xgb_fatal)^2))
rmse_minor_xgb <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor)^2))
rmse_serious_xgb <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious)^2))

# Calculate RMSE for Poisson Regression
rmse_minor_poisson <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor)^2))
rmse_serious_poisson <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious)^2))
rmse_fatal_poisson <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_poisson)^2))

# Create a dataframe
rmse_df <- data.frame(
  Model = c("Linear Regression", "XGBoost", "Poisson Regression"),
  RMSE_Fatal = c(rmse_fatal_lm, rmse_fatal_xgb, rmse_fatal_poisson), 
  RMSE_Minor = c(rmse_minor_lm, rmse_minor_xgb, rmse_minor_poisson),
  RMSE_Serious = c(rmse_serious_lm, rmse_serious_xgb, rmse_serious_poisson)
)

# Print the RMSE dataframe
print(rmse_df)

##                Model RMSE_Fatal RMSE_Minor RMSE_Serious
## 1  Linear Regression  0.1193244  0.6849826    0.3114423
## 2            XGBoost  0.1196996  0.6815313    0.3105197
## 3 Poisson Regression  0.3714456  0.6861074    0.3203267

9c. EVALUATION OF CLASSIFICATION MODEL - CONFUSION MATRIX

The Random Forest model achieves an overall accuracy of approximately 69.3%, indicating the proportion of correct predictions out of the total. 95% Confidence Interval (CI): The range where the true accuracy likely lies is between 68.9% and 69.7%. The No Information Rate (NIR) accuracy expected by always predicting the most frequent class is approximately 68.92%.

# Random Forest Model evaluation
conf_matrix <- confusionMatrix(predictions, data_test$crashSeverity)

#output confusion matrix
conf_matrix

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Fatal Crash Minor Crash Non-Injury Crash Serious Crash
##   Fatal Crash                0           0                0             0
##   Minor Crash                8         241               12           105
##   Non-Injury Crash         484       13281            38419          3207
##   Serious Crash              0           6                0             0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6933          
##                  95% CI : (0.6894, 0.6971)
##     No Information Rate : 0.6892          
##     P-Value [Acc > NIR] : 0.01819         
##                                           
##                   Kappa : 0.0226          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Fatal Crash Class: Minor Crash
## Sensitivity                    0.000000           0.017815
## Specificity                    1.000000           0.997040
## Pos Pred Value                      NaN           0.658470
## Neg Pred Value                 0.991177           0.760149
## Prevalence                     0.008823           0.242598
## Detection Rate                 0.000000           0.004322
## Detection Prevalence           0.000000           0.006563
## Balanced Accuracy              0.500000           0.507428
##                      Class: Non-Injury Crash Class: Serious Crash
## Sensitivity                          0.99969            0.0000000
## Specificity                          0.02077            0.9998856
## Pos Pred Value                       0.69360            0.0000000
## Neg Pred Value                       0.96774            0.9405994
## Prevalence                           0.68918            0.0593942
## Detection Rate                       0.68897            0.0000000
## Detection Prevalence                 0.99333            0.0001076
## Balanced Accuracy                    0.51023            0.4999428

10. DATA INTERPRETATION & DISCUSSION OF OUTPUT (By Wei Ven/Ain)

ANSWERING RESEARCH QUESTION 3 - PREDICTION MODELS

10a. FOR REGRESSION MODELS

By analyzing the Mean Absolute Error (MAE) and Mean Squared Error (MSE) results, it provides valuable insights into the performance of our models. In terms of MAE, the Linear Regression model demonstrates commendable accuracy, with predictions hovering around 0.31 for minor injuries, 0.072 for serious injuries, and 0.0104 for fatal incidents. Surpassing this baseline, the XGBoost model exhibits a marginal improvement, achieving slightly lower MAE values across all injury categories. Meanwhile, the Poisson Regression model aligns closely with the Linear Regression and XGBoost performance, showcasing MAE values in the same range, but slightly higher MAE output is seen.

Shifting focus to MSE, we observe that the squared errors, reflecting average squared differences between predicted and actual values, are larger than their absolute counterparts. In this context, the XGBoost model excels with slightly lower MSE values compared to Linear Regression, signifying its enhanced performance. On the other hand, the Poisson Regression model, while comparable to Linear Regression, reveals a relatively higher MSE for FatalCount, suggesting potential areas for improvement.

The Root Mean Squared Error (RMSE) results further illuminate the precision of our models. Looking at the RMSE values, XGBoost consistently outperforms both Linear Regression and Poisson Regression across all injury categories. The RMSE for fatal incidents in XGBoost is lower, indicating a superior ability to minimize errors in predicting the counts of fatal injuries. In summary, the combined analysis of MAE, MSE, and RMSE highlights the strengths of each model, with XGBoost standing out for its higher predictive precision and resilience across different types of injury in a traffic accident.

10b. FOR CLASSIFICATION MODEL

The model exhibits poor performance in predicting “Fatal Crash,” “Minor Crash,” and “Serious Crash,” as indicated by low sensitivity values for these classes. High specificity is observed for most classes, suggesting good identification of non-occurrences of certain crash types. The Positive Predictive Value is reasonably good for “Non-Injury Crash” but is NaN for “Fatal Crash,” implying the model failed to make any true positive predictions for this class. Overall, the model demonstrates moderate accuracy but struggles in correctly predicting certain critical classes like “Fatal Crash” and “Serious Crash.” Further model tuning or feature engineering may be necessary to enhance predictions for these classes.

11. CONCLUSION

In general, based on the research questions set out in the beginning of this study, the analysis above has managed to answer all 3 questions. In the EDA portion, it was found that major accident vehicle types involved were cars (Station Wagon) which contributed the highest number of accidents versus any other vehicle. The study also tried to find if non-human variables that may cause and increase the likelihood of severe accidents, including high fatalities. The study found there were no outstanding non-human variables (like road lighting, flat/hill road, number of road lanes etc.) that had been a standout to cause more accidents, except flat, sealed and 2 ways roads. Most of the accidents that happened were most likely due to human error that happened on these flat, sealed and 2 way roads.

In terms of prediction, it can be seen that certain machine learning models can be used to predict traffic crashes. Our analysis shows that the XGBoost has higher predictive precision and resilience across different types of injury in a traffic accident. The model predicts minor accidents at a higher accuracy versus fatality and serious accidents. Further studies can be explored to understand if there are additional models that could be used for prediction.

12. RPUBS REPORT LINK

The project report above has been published to RPUBs. Here is the report link on RPUBs –> https://rpubs.com/mithirendra/1133528

13. REFERENCES

WHO|10 Facts on Global Road Safety. Accessed: Oct. 10, 2018. [Online].Available: http://www.who.int/features/factfiles/roadsafety/en/
Santos, D., Salas, J., Quaresma, P. Machine Learning Approaches to Traffic Analysis and Hotspot Prediction. Computers 2021, 10(12), 157; https://doi.org/10.3390/computers10120157
Mai, J., Ding, Y., Cheng, J. C. P., Tan, Y., Gan, V. J. L.,Zhang, J.C. Analyzing the Leading Causes of Traffic Fatalities Using XGBoost and Grid-Based Analysis: A City Management Perspective. IEEE Access 2019, Vol 7. 10.1109/ACCESS.2019.2946401
Hébert, A.; Guédon, T.; Glatard, T.; Jaumard, B. High-Resolution Road Vehicle Collision Prediction for the City of Montreal. Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019.
Bucsuházya, K., Matuchováa, E., Zůvalaa, R., Moravcováa, P., Kostíkováa, M., Mikuleca, R. Human factors contributing to the road traffic accident occurrence. AIIT 2nd International Congress on Transport Infrastructure and Systems in a changing world (TIS ROMA 2019), 23rd-24th September 2019, Rome, Italy.
Fiorentini, N., Losa, M. Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms. Infrastructures 2020, 5(7), 61; https://doi.org/10.3390/infrastructures507006
Jalilian, M.M, Safarpour, H., Bazyar, J., Keykaleh, M.S., Malekyan, L., Khorshidi, A. Environmental Related Risk Factors to Road Traffic Accidents in Ilam, Iran. Med Arch. 2019 Jun; 73(3): 169–172.