Road accidents are a major concern all over the world, since they cause casualties, injuries, and fatalities each year. Accidents also cause significant economic losses. The 2018 report of the World Health Organization states that more than 1.35 million people die each year from causes related to road accidents. An additional 20-50 million are injured or disabled [1]. If factors that cause accidents can be better understood and predicted, it might be possible to take measures to mitigate the damages and its severity [2].
Many factors are responsible for causing road accidents. How to reduce the occurrence of fatal traffic accidents and improve road safety has been a significant problem for both governments and research institutions. Knowing what the influential factors are and how they affect the accidents can help better understand the cause-effect. This is beneficial to improve the estimation of the accident severity and preparation of countermeasures.[3]
The purpose of this project is to predict road accident crash severity and analyse non-human factors causing accidents, using traffic crash data from 2011 to 2023 from the New Zealand governments open data portal (data.gov.nz). Here is research questions that the project is aiming to answer:
The research objectives for the project are:
Given the significant impact to society and mortality rates, accident prediction has been extensively studied with many different models. From a methodology perspective, there are many that have been explored. In 2019, Herbert et. al. [4] used a balanced random forest algorithm to study the accidents that occurred in Montreal. Overall, the algorithms predicted 85 percent of Montreal incidents, with a false positive rate (FPR) of 13%. Another study in 2019 on a GIS-based data mining methodology framework to analyze the influential factors of fatal traffic accidents showed that XGBoost obtained the highest modeling accuracy. [3] Fiorentini in 2020 gave outcomes from a random undersampling majority-class (RUMC) based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, which showed that for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works [6].
There are many factors that cause traffic accidents. Previous research showed the most often factor contributing to the accident occurrence are human factors, with driver inattention being the highest, which could be caused by several causes as e.g.distraction, overloading attention, monotonous driving, etc.[5]. However, non-human factors have also caused accidents. One study by Jalilian et. al in 2019 [7] showed a significant relationship between fatal RTAs and factors such as; the sort of the road, the hindered visibility, the location of the accident, the accidents’ place, the climate, and lighting of the day (P<0.05). When it was cloudy, the chance was 2.60 times more than when was clear (P<0.05). But the sample size used was small, with only 2314 accidents dataset examined.
The dataset comes from the Waka Kotahi Crash Analysis System (CAS), which records all traffic crashes reported to data.gov.nz by the New Zealand (NZ) Police. CAS covers crashes on all NZ roadways where the public have legal access with a motor vehicle.
The Dataset URL is as follows:
The dataset was downloaded from data.gov.nz on 30 Oct 2023. As of 30 Oct 2023, data was available from economic year 1999/2000 to 2022/2023, and has 821,744 observations.
# Load the required libraries
library(dplyr)
library(ggplot2)
library('reshape2')
# retrieve large dataset
d1 <- read.csv("Crash_Analysis_System_(CAS)_data.csv")
str(d1)
The dataset that we read is quite large, over 500000 observations. The first step will be to select key columns that will be relevant to our data analysis.
# Subset data to smaller subset by the following categories:
# Year = Crash Year
# Vehicle Types = bicycle, bus, carStationWagon, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility
# Crash Location = crashLocation1, crashLocation2, region
# Severity of crash = crashSeverity, fatalCount, minorInjuryCount, seriousInjuryCount
# Accident location conditions = light, flatHill, roadLane, roadSurface, WeatherA
# Others = Holiday, areaUnitID
data <- select(d1, crashYear,
bicycle, bus, carStationWagon, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility,
crashLocation1, crashLocation2, region,
crashSeverity, fatalCount, minorInjuryCount, seriousInjuryCount,
light, flatHill, roadLane, roadSurface, weatherA,
holiday, areaUnitID)
# Structure of dataset
str(data)
## 'data.frame': 821744 obs. of 25 variables:
## $ crashYear : int 2011 2012 2012 2011 2011 2011 2011 2011 2012 2012 ...
## $ bicycle : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ carStationWagon : int 2 2 0 2 1 1 1 1 2 2 ...
## $ moped : int 0 0 0 0 0 0 0 0 0 0 ...
## $ motorcycle : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolBus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ suv : int 0 0 0 0 0 0 0 0 0 0 ...
## $ taxi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ truck : int 0 0 0 0 0 0 0 1 0 0 ...
## $ vanOrUtility : int 0 0 1 0 1 0 0 0 0 0 ...
## $ crashLocation1 : chr "SH 35 WAINUI" "HALL ST" "SHARON ROAD" "SPRINGSTON ROLLESTON ROAD" ...
## $ crashLocation2 : chr "HIRINI ST" "LAKE ROAD" "RIDGE ROAD" "DYNES ROAD" ...
## $ region : chr "Gisborne Region" "Waikato Region" "Auckland Region" "Canterbury Region" ...
## $ crashSeverity : chr "Non-Injury Crash" "Non-Injury Crash" "Minor Crash" "Non-Injury Crash" ...
## $ fatalCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ minorInjuryCount : int 0 0 1 0 0 0 0 0 0 0 ...
## $ seriousInjuryCount: int 0 0 0 0 0 0 0 0 0 0 ...
## $ light : chr "Overcast" "Overcast" "Bright sun" "Bright sun" ...
## $ flatHill : chr "Flat" "Flat" "Hill Road" "Flat" ...
## $ roadLane : chr "2-way" "2-way" "2-way" "2-way" ...
## $ roadSurface : chr "Sealed" "Sealed" "Sealed" "Sealed" ...
## $ weatherA : chr "Fine" "Fine" "Fine" "Fine" ...
## $ holiday : chr "Christmas New Year" "" "" "" ...
## $ areaUnitID : int 544801 528900 507000 597513 611500 607300 568101 525420 528403 523601 ...
summary(data)
## crashYear bicycle bus carStationWagon
## Min. :2000 Min. :0.00000 Min. :0.00000 Min. : 0.000
## 1st Qu.:2005 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 1.000
## Median :2010 Median :0.00000 Median :0.00000 Median : 1.000
## Mean :2011 Mean :0.02896 Mean :0.01587 Mean : 1.311
## 3rd Qu.:2017 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 2.000
## Max. :2023 Max. :5.00000 Max. :3.00000 Max. :11.000
## NA's :5 NA's :5 NA's :5
## moped motorcycle schoolBus suv
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.0000
## Mean :0.007098 Mean :0.03627 Mean :0.000753 Mean :0.1051
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :4.000000 Max. :8.00000 Max. :3.000000 Max. :6.0000
## NA's :5 NA's :5 NA's :5 NA's :5
## taxi truck vanOrUtility crashLocation1
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:821744
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :0.0000 Median :0.0000 Median :0.0000 Mode :character
## Mean :0.0107 Mean :0.0804 Mean :0.1758
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :5.0000 Max. :5.0000 Max. :6.0000
## NA's :5 NA's :5 NA's :5
## crashLocation2 region crashSeverity fatalCount
## Length:821744 Length:821744 Length:821744 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Mode :character Median :0.00000
## Mean :0.01043
## 3rd Qu.:0.00000
## Max. :9.00000
## NA's :1
## minorInjuryCount seriousInjuryCount light flatHill
## Min. : 0.0000 Min. : 0.0000 Length:821744 Length:821744
## 1st Qu.: 0.0000 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Median : 0.0000 Mode :character Mode :character
## Mean : 0.3186 Mean : 0.0692
## 3rd Qu.: 1.0000 3rd Qu.: 0.0000
## Max. :34.0000 Max. :14.0000
## NA's :1 NA's :1
## roadLane roadSurface weatherA holiday
## Length:821744 Length:821744 Length:821744 Length:821744
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## areaUnitID
## Min. :500100
## 1st Qu.:519400
## Median :536642
## Mean :546242
## 3rd Qu.:573523
## Max. :626801
## NA's :97
The next steps is to find whether they are any NAs and any outliers. From the summary above, it can be seen that number of NA values were found to be very small and will be deleted. Box plot below also shows no outliers. It was also found that there are some labels that are missing. These missing labels were re-labelled as “Unknown”.
# Remove NAs
data <- na.omit(data)
# Finding outliers for vehicle type accidents
boxplot(data$bicycle, data$bus, data$carStationWagon, data$moped, data$motorcycle, data$schoolBus, data$suv,
data$taxi, data$truck, data$vanOrUtility,
names=c("Bicycle", "Bus", "Car", "Moped", "Motorcycle", "SchoolBus", "Suv", "Taxi", "Truck", "Van"))
# Finding outliers for injury types
boxplot(data$fatalCount, data$minorInjuryCount, data$seriousInjuryCount,
names=c("Fatality", "MinorInjury", "SeriousInjury"))
# Replacing empty regions values with "Unknown" label
data <- data %>% mutate(region = ifelse(region == "", 'Unknown', region))
data <- data %>% mutate(flatHill = ifelse(flatHill == "Null", 'Unknown', flatHill))
data <- data %>% mutate(roadLane = ifelse(roadLane == "Null", 'Unknown', roadLane))
data <- data %>% mutate(roadSurface = ifelse(roadSurface == "Null", 'Unknown', roadSurface))
data <- data %>% mutate(weatherA = ifelse(weatherA == "Null", 'Unknown', weatherA))
data <- data %>% mutate(holiday = ifelse(holiday == "", 'Unknown', holiday))
Next, an exploratory data analysis is conducted on the data to get a feel of the data and to gain more insights on the variables.
First, we try to understand the number of accidents in the whole dataset.
Due to the sheer size of the database, the data is subsetted into the last 8 years of data to perform analysis. The new dataset consists of 278,823 observations and 26 variables.
## 'data.frame': 278823 obs. of 26 variables:
## $ crashYear : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ bicycle : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ carStationWagon : int 1 1 0 2 1 1 1 1 1 0 ...
## $ moped : int 0 0 1 0 0 0 0 0 0 0 ...
## $ motorcycle : int 0 0 0 0 0 0 0 0 0 1 ...
## $ schoolBus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ suv : int 0 0 0 0 0 0 0 0 0 0 ...
## $ taxi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ truck : int 0 0 0 0 0 0 0 0 0 0 ...
## $ vanOrUtility : int 0 2 0 0 0 0 0 1 0 0 ...
## $ crashLocation1 : chr "SH 96" "VIEW ROAD" "LAVAUD ST" "MOORHOUSE AVENUE" ...
## $ crashLocation2 : chr "OTAUTAU-WREYS BUSH ROAD" "GREAT NORTH ROAD" "MOUNT ALBERT ROAD" "MADRAS ST" ...
## $ region : chr "Southland Region" "Auckland Region" "Wellington Region" "Canterbury Region" ...
## $ crashSeverity : chr "Minor Crash" "Minor Crash" "Minor Crash" "Minor Crash" ...
## $ fatalCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ minorInjuryCount : int 1 1 1 1 2 1 1 1 1 1 ...
## $ seriousInjuryCount: int 0 0 0 0 0 0 0 0 0 0 ...
## $ light : chr "Overcast" "Bright sun" "Overcast" "Dark" ...
## $ flatHill : chr "Flat" "Hill Road" "Hill Road" "Flat" ...
## $ roadLane : chr "2-way" "2-way" "2-way" "2-way" ...
## $ roadSurface : chr "Sealed" "Sealed" "Sealed" "Sealed" ...
## $ weatherA : chr "Fine" "Fine" "Fine" "Fine" ...
## $ holiday : chr "" "" "" "" ...
## $ areaUnitID : int 612720 511800 576302 591500 549800 523816 521901 506653 609033 502001 ...
## $ total_vehicles : Named num 1 3 1 2 1 1 1 2 1 1 ...
## ..- attr(*, "names")= chr [1:278823] "6224" "6316" "6482" "6641" ...
## - attr(*, "na.action")= 'omit' Named int [1:104] 34715 80708 104127 105718 105815 106049 164160 193962 413044 413284 ...
## ..- attr(*, "names")= chr [1:104] "34715" "80708" "104127" "105718" ...
# Summary of data
summary(data)
There are no more NAs in the database because it has been cleaned, as shown below.
# Missing value checks
cat("Total number of missing values:", sum(is.na(data)), "\n")
## Total number of missing values: 0
The following is plotted to showcase the number of accidents that have happened from year 2015 to year 2022. The plot shows that number of accidents remains above 30,000 cases every year except for the year 2022.
# Showing number of accidents per year from reduced dataset (for last 8 years)
barplot(table(data$crashYear),
main="Number of accidents per year from 2015 - 2022",
ylim = c(0,50000),
col = "red",
xlab = "Year",
ylab = "Number of accidents")
The following line chart is plotted to understand which vehicle causes the most accidents. It can be seen that sedan vehicles (car/station wagons) cause the highest number of accidents versus any other vehicle types in New Zeland.
ggplot(vehicles_melted, aes(x = crashYear, y = total_accident, color = variable)) +
geom_line(linetype =1,
lwd = 1.1) +
ggtitle("Number of crashes by vehicle year (2015 - 2022)") + xlab("Year") + ylab("Total crashes")
The following analysis is to understand if accidents are causes by
any non human variables. The non human variables analysed are: * Road
lighting
* Flat vs Hill roads
* Number of Road lanes
* Road conditions
* Weather conditions
Different graphs were plotted to understand the different variables.
Here is the summary of findings.
* Road lighting - accidents seem to happen during bright sunny days,
dark days and even overcast days. There isn’t any particular road
lighting that stands out.
* Flat vs Hill roads - a high number of accidents happened on flat
roads.
* Number of Road lanes - most of the accidents happened on 2-way
roads.
* Road conditions - most accidents happened on sealed roads.
* Weather conditions - most of the accidents happened when the weather
was fine.
In summary, it can be seen that none of the non-human variables standout in causing any accidents, except maybe flat, sealed and 2 ways roads which had high accident rates. The accidents on these flat, sealed and 2 way lane roads may have been caused by human errors such as speeding or losing control of the vehicle, since flat, sealed and 2 way lane roads could be prone to drivers speeding.
In terms of crash severity, it can be seen that most of the accidents were non-injury crashes. The number of minor crashes and serious crashes were much smaller.
# Showing number of accidents by crash severity types
barplot(table(data$crashSeverity),
col = "blue",
ylim = c(0,200000),
ylab = "Number of accidents",
xlab = "Crash Severity",
main = "Number of crashes by severity after accidents (2015 - 2022)")
In terms of injurity severity, it can be seen that most of the accidents had minor injuries, with more than 80,000 accidents. The number of serious injuries and fatality accidents were much smaller. However, in terms of relative numbers, the number of serious injury accidents was still high, which was close 20,000 accidents.
# Showing number of injury by injury types
sumdata <- select(data, fatalCount, minorInjuryCount, seriousInjuryCount)
sumdata <- colSums(sumdata)
barplot(sumdata,
col = "green",
ylim = c(0,90000),
ylab = "Number of accidents",
xlab = "Injury Severity",
main = "Number of injuries by severity after accidents (2015 - 2022)")
By year, it can be seen that most accident injuries happened in year 2018 and 2019. The Covid-19 pandemic in year 2020 and 2021 reduced the number of accident injuries. However the number has been on an increasing trend in year 2022.
## `summarise()` has grouped output by 'crashYear'. You can override using the
## `.groups` argument.
ggplot(injury_melted, aes(x = crashYear, y = total_injury, color = variable)) +
geom_line(linetype =1,
lwd = 1.1) +
ggtitle("Number of injury types by year (2015 - 2022)") + xlab("Year)") + ylab("Total injuries")
We wanted to also understand if festive celebrations had caused higher number of accidents. The graph below shows that festive holidays do not contribute to higher number of accidents.
# Shows Holiday Period with highest accidents
ggplot(data, aes(x = holiday, fill = crashSeverity)) + geom_bar(position = "stack") + theme(axis.text.x = element_text(angle = 90, size = 10)) +
ggtitle("Checking if accidents happened during festive holidays") + xlab("Festive Type") + ylab("Total accidents")
The graph below shows that higher number of accidents happened in the big city regions, namely Auckland, Wellington and Waikato regions.
# Shows Region with highest accidents
ggplot(data, aes(x = region, fill = crashSeverity)) + geom_bar(position = "stack") + theme(axis.text.x = element_text(angle = 90, size = 10)) +
ggtitle("Number of accidents by region") + xlab("Region") + ylab("Total accidents")
In general, the Exploratory Data Analysis above has given a good understand of the data. The next step is to perform Data modeling to understand if road accident injury severity can be predicted.
In the initial phase of the project, the dataset was curated to focus on relevant variables, including counts of fatal, minor, and serious injuries, as well as various factors such as the type of vehicles involved and road conditions. Categorical variables such as ‘flatHill,’ ‘light,’ ‘roadLane,’ and ‘roadSurface’ underwent one-hot encoding to transform them into a suitable format for modeling. Subsequently, the dataset was split into training and testing sets using a random seed for reproducibility. The dataset was spilt into 80% for training, and 20% for testing.
The first step will be to load the necessary libraries to perform the modeling.
#Wei Ven
library(caret) # For createDataPartition function
library(randomForest)
library(dplyr)
library(xgboost)
# data <- readRDS("data.rds")
Then, the step of subsetting the data and preparing for modeling will be done, by spliting data into training and test sets.
#Subset data
data_reg <- select(data, fatalCount, minorInjuryCount, seriousInjuryCount,bicycle, bus, moped, motorcycle, schoolBus, suv, taxi, truck, vanOrUtility, flatHill, light, roadLane, roadSurface)
# One-Hot Encode 'flatHill', 'light', 'roadLane', 'roadSurface'
data_reg <- cbind(data_reg, model.matrix(~ flatHill + light + roadLane + roadSurface - 1, data = data_reg))
# Drop the original categorical columns after encoding
data_reg <- select(data_reg, -flatHill, -light, -roadLane, -roadSurface)
# Split the data into training and test sets
set.seed(123)
trainIndex <- sample(seq_len(nrow(data_reg)), size = 0.8 * nrow(data_reg))
train_data_reg <- data_reg[trainIndex, ]
test_data_reg <- data_reg[-trainIndex, ]
4 different models were explored by our team to understand the peformance of each model. The best model will be chosen the models that were explored. 3 models for Regression and 1 model for Classification.
The first model employed in this project is a Linear Regression Model, specifically tailored for predicting ‘fatalCount,’ ‘minorCount,’ and ‘seriousCount.’ This regression task involves predicting numerical outcomes for each injury category. The choice of linear regression serves as a foundational baseline for comparison with more complex models. Predictions were generated on the test set, and for practical application, these predictions were rounded to whole numbers. The rationale behind using a linear regression model lies in its interpretability and simplicity, making it an appropriate starting point for understanding the basic relationships within the data in the context of a regression task.
# MODEL 1: Linear Regression Model
# Train Linear Regression Model for fatalCount
reg_model_fatal <- lm(fatalCount ~ ., data = train_data_reg)
predictions_fatal <- predict(reg_model_fatal, newdata = test_data_reg)
rounded_predictions_fatal <- round(predictions_fatal)
# Train Linear Regression Model for minorInjuryCount
reg_model_minor <- lm(minorInjuryCount ~ ., data = train_data_reg)
predictions_minor <- predict(reg_model_minor, newdata = test_data_reg)
rounded_predictions_minor <- round(predictions_minor)
# Train Linear Regression Model for seriousInjuryCount
reg_model_serious <- lm(seriousInjuryCount ~ ., data = train_data_reg)
predictions_serious <- predict(reg_model_serious, newdata = test_data_reg)
rounded_predictions_serious <- round(predictions_serious)
# Calculate MAE and MSE for fatalCount/minorInjuryCount/seriousInjury Count
mae_fatal_lm <- mean(abs(test_data_reg$fatalCount - rounded_predictions_fatal))
mse_fatal_lm<- mean((test_data_reg$fatalCount - rounded_predictions_fatal)^2)
mae_minor_lm<- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_minor))
mse_minor_lm<- mean((test_data_reg$minorInjuryCount - rounded_predictions_minor)^2)
mae_serious_lm<- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_serious))
mse_serious_lm<- mean((test_data_reg$seriousInjuryCount - rounded_predictions_serious)^2)
The second model employed is a XGBoost model.Introducing the XGBoost model marks a significant step in predicting our intended outcomes. XGBoost is known for handling complex patterns effectively. It was trained on a specific set of features, excluding the target variable and other injury counts, to adapt well to our dataset while keeping attention on our overall goal. The training process used boosting rounds, allowing the model to improve predictions iteratively. XGBoost naturally deals with complex patterns and interactions among variables, making it more advanced than linear regression. The chosen features cover various aspects of the data, helping the model identify subtle relationships and patterns.
# MODEL 2: XGBoost Regression:
# Train XGBoost Model for fatalCount
xgb_model_fatal <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$fatalCount, nrounds = 100, print_every_n = 10)
## [1] train-rmse:0.361019
## [11] train-rmse:0.111809
## [21] train-rmse:0.111268
## [31] train-rmse:0.111223
## [41] train-rmse:0.111200
## [51] train-rmse:0.111180
## [61] train-rmse:0.111172
## [71] train-rmse:0.111140
## [81] train-rmse:0.111119
## [91] train-rmse:0.111112
## [100] train-rmse:0.111104
predictions_xgb_fatal <- predict(xgb_model_fatal, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_fatal <- round(predictions_xgb_fatal)
# Calculate MAE and MSE for fatalCount (XGBoost)
mae_fatal_xgb <- mean(abs(test_data_reg$fatalCount - rounded_predictions_xgb_fatal))
mse_fatal_xgb <- mean((test_data_reg$fatalCount - rounded_predictions_xgb_fatal)^2)
# XGBoost Regression for minorInjuryCount:
xgb_model_minor <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$minorInjuryCount, nrounds = 100, print_every_n=10)
## [1] train-rmse:0.629876
## [11] train-rmse:0.610864
## [21] train-rmse:0.610297
## [31] train-rmse:0.610155
## [41] train-rmse:0.610039
## [51] train-rmse:0.609959
## [61] train-rmse:0.609909
## [71] train-rmse:0.609863
## [81] train-rmse:0.609710
## [91] train-rmse:0.609661
## [100] train-rmse:0.609591
predictions_xgb_minor <- predict(xgb_model_minor, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_minor <-round(predictions_xgb_minor)
# Calculate MAE and MSE for minorInjuryCount (XGBoost)
mae_minor_xgb <- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor))
mse_minor_xgb <- mean((test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor)^2)
# XGBoost Regression for seriousInjuryCount:
xgb_model_serious <- xgboost(data = as.matrix(train_data_reg[, setdiff(names(train_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]),label = train_data_reg$seriousInjuryCount, nrounds =100,print_every_n=10)
## [1] train-rmse:0.422399
## [11] train-rmse:0.292626
## [21] train-rmse:0.292174
## [31] train-rmse:0.292081
## [41] train-rmse:0.291955
## [51] train-rmse:0.291918
## [61] train-rmse:0.291893
## [71] train-rmse:0.291823
## [81] train-rmse:0.291774
## [91] train-rmse:0.291735
## [100] train-rmse:0.291701
predictions_xgb_serious <- predict(xgb_model_serious, newdata = as.matrix(test_data_reg[, setdiff(names(test_data_reg), c("fatalCount", "minorInjuryCount", "seriousInjuryCount"))]))
rounded_predictions_xgb_serious <- round(predictions_xgb_serious)
# Calculate MAE and MSE for seriousInjuryCount (XGBoost)
mae_serious_xgb <- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious))
mse_serious_xgb <- mean((test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious)^2)
The third model is a Poisson Regression Model. This model capitalizes on the Poisson distribution’s strength in handling count data. Specialized for count data, the Poisson regression offers a detailed perspective on how our selected features interact with the expected counts of road incidents. It goes beyond prediction, revealing the underlying dynamics that contribute to the frequency of fatal incidents based on distinct features. As we examine the model’s summary, key insights emerge from coefficients and their significance. These provide a clear understanding of how each feature influences the expected counts of fatal incidents.
# MODEL 3: Fit Poisson Regression Model for fatalCount
poisson_model_fatal <- glm(fatalCount ~ ., data = train_data_reg, family = "poisson")
# Make predictions on the test set
predictions_poisson_fatal <- predict(poisson_model_fatal, newdata = test_data_reg, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Round predictions to the nearest integer
rounded_predictions_poisson <- round(predictions_poisson_fatal)
# Calculate MAE and MSE for fatalCount
mae_fatal_poisson <- mean(abs(test_data_reg$fatalCount - rounded_predictions_poisson))
mse_fatal_poisson <- mean((test_data_reg$fatalCount - rounded_predictions_poisson)^2)
# Fit Poisson Regression Model for minorInjuryCount
poisson_model_minor <- glm(minorInjuryCount ~ ., data = train_data_reg, family = "poisson")
predictions_poisson_minor <- predict(poisson_model_minor, newdata = test_data_reg, type = "response")
rounded_predictions_poisson_minor <- round(predictions_poisson_minor)
mae_minor_poisson <- mean(abs(test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor))
mse_minor_poisson <- mean((test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor)^2)
# Fit Poisson Regression Model for seriousInjuryCount
poisson_model_serious <- glm(seriousInjuryCount ~ ., data = train_data_reg, family = "poisson")
predictions_poisson_serious <- predict(poisson_model_serious, newdata = test_data_reg, type = "response")
rounded_predictions_poisson_serious <- round(predictions_poisson_serious)
The fourth model is using the classification random forest model on a factorized variable. Training and test sets are created and a prediction on Crash Severity is done at the end.
library(randomForest)
# Select relevant columns for prediction
predictors <- data %>%
select(
light, flatHill, roadLane, roadSurface, weatherA, region,
crashLocation1, crashLocation2, crashSeverity
)
# Factorize crashSeverity for modeling
predictors$crashSeverity <- factor(
predictors$crashSeverity,
levels = c("Fatal Crash", "Minor Crash", "Non-Injury Crash", "Serious Crash")
)
buildmodel<-function(format, data){
randomForest(format, data, ntree = 200)
}
# Creating training and testing sets
set.seed(123)
trainIndex <- createDataPartition(predictors$crashSeverity, p = 0.8,
list = FALSE, times = 1)
data_train <- predictors[trainIndex, ]
data_test <- predictors[-trainIndex, ]
# Training the Random Forest model
format <- crashSeverity ~ . - crashSeverity
model_rf <- buildmodel(format,data_train)
# Predicting crashSeverity
predictions <- predict(model_rf, newdata = data_test)
This section calculates the Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Square Error (RMSE) for the 3 models.
In this section, the MAE and MSE for the 3 Regression models - Linear Regression, XGBoost and Poisson Regression are calculated. The different values obtained from the 3 models are compared for interpretation.
# Calculate MAE and MSE for seriousInjuryCount
mae_serious_poisson <- mean(abs(test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious))
mse_serious_poisson <- mean((test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious)^2)
# Create data frame for MAE
mae_results_df <- data.frame(
Model = c("Linear Regression", "XGBoost", "Poisson"),
MinorInjuryCount = c(mae_minor_lm, mae_minor_xgb, mae_minor_poisson),
SeriousInjuryCount = c(mae_serious_lm, mae_serious_xgb, mae_serious_poisson),
FatalCount = c(mae_fatal_lm, mae_fatal_xgb, mae_fatal_poisson)
)
# Create data frame for MSE
mse_results_df <- data.frame(
Model = c("Linear Regression", "XGBoost", "Poisson"),
MinorInjuryCount = c(mse_minor_lm, mse_minor_xgb, mse_minor_poisson),
SeriousInjuryCount = c(mse_serious_lm, mse_serious_xgb, mse_serious_poisson),
FatalCount = c(mse_fatal_lm, mse_fatal_xgb, mse_fatal_poisson)
)
# Print the MAE results data frame
print("MAE Results:")
## [1] "MAE Results:"
print(mae_results_df)
## Model MinorInjuryCount SeriousInjuryCount FatalCount
## 1 Linear Regression 0.3120416 0.07224962 0.01043665
## 2 XGBoost 0.3082937 0.07117368 0.01056218
## 3 Poisson 0.3129024 0.07176544 0.01197884
# Print the MSE results data frame
print("MSE Results:")
## [1] "MSE Results:"
print(mse_results_df)
## Model MinorInjuryCount SeriousInjuryCount FatalCount
## 1 Linear Regression 0.4692011 0.09699632 0.01423832
## 2 XGBoost 0.4644849 0.09642249 0.01432798
## 3 Poisson 0.4707433 0.10260916 0.13797185
In this section, the RMSE for the 3 Regression models - Linear Regression, XGBoost and Poisson Regression are calculated. A data frame is then created for the different values obtain from the 3 models to compare values.
# Calculate RMSE for Linear Regression
rmse_fatal_lm <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_fatal)^2))
rmse_minor_lm <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_minor)^2))
rmse_serious_lm <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_serious)^2))
# Calculate RMSE for XGBoost
rmse_fatal_xgb <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_xgb_fatal)^2))
rmse_minor_xgb <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_xgb_minor)^2))
rmse_serious_xgb <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_xgb_serious)^2))
# Calculate RMSE for Poisson Regression
rmse_minor_poisson <- sqrt(mean((test_data_reg$minorInjuryCount - rounded_predictions_poisson_minor)^2))
rmse_serious_poisson <- sqrt(mean((test_data_reg$seriousInjuryCount - rounded_predictions_poisson_serious)^2))
rmse_fatal_poisson <- sqrt(mean((test_data_reg$fatalCount - rounded_predictions_poisson)^2))
# Create a dataframe
rmse_df <- data.frame(
Model = c("Linear Regression", "XGBoost", "Poisson Regression"),
RMSE_Fatal = c(rmse_fatal_lm, rmse_fatal_xgb, rmse_fatal_poisson),
RMSE_Minor = c(rmse_minor_lm, rmse_minor_xgb, rmse_minor_poisson),
RMSE_Serious = c(rmse_serious_lm, rmse_serious_xgb, rmse_serious_poisson)
)
# Print the RMSE dataframe
print(rmse_df)
## Model RMSE_Fatal RMSE_Minor RMSE_Serious
## 1 Linear Regression 0.1193244 0.6849826 0.3114423
## 2 XGBoost 0.1196996 0.6815313 0.3105197
## 3 Poisson Regression 0.3714456 0.6861074 0.3203267
The Random Forest model achieves an overall accuracy of approximately 69.3%, indicating the proportion of correct predictions out of the total. 95% Confidence Interval (CI): The range where the true accuracy likely lies is between 68.9% and 69.7%. The No Information Rate (NIR) accuracy expected by always predicting the most frequent class is approximately 68.92%.
# Random Forest Model evaluation
conf_matrix <- confusionMatrix(predictions, data_test$crashSeverity)
#output confusion matrix
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fatal Crash Minor Crash Non-Injury Crash Serious Crash
## Fatal Crash 0 0 0 0
## Minor Crash 8 241 12 105
## Non-Injury Crash 484 13281 38419 3207
## Serious Crash 0 6 0 0
##
## Overall Statistics
##
## Accuracy : 0.6933
## 95% CI : (0.6894, 0.6971)
## No Information Rate : 0.6892
## P-Value [Acc > NIR] : 0.01819
##
## Kappa : 0.0226
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Fatal Crash Class: Minor Crash
## Sensitivity 0.000000 0.017815
## Specificity 1.000000 0.997040
## Pos Pred Value NaN 0.658470
## Neg Pred Value 0.991177 0.760149
## Prevalence 0.008823 0.242598
## Detection Rate 0.000000 0.004322
## Detection Prevalence 0.000000 0.006563
## Balanced Accuracy 0.500000 0.507428
## Class: Non-Injury Crash Class: Serious Crash
## Sensitivity 0.99969 0.0000000
## Specificity 0.02077 0.9998856
## Pos Pred Value 0.69360 0.0000000
## Neg Pred Value 0.96774 0.9405994
## Prevalence 0.68918 0.0593942
## Detection Rate 0.68897 0.0000000
## Detection Prevalence 0.99333 0.0001076
## Balanced Accuracy 0.51023 0.4999428
By analyzing the Mean Absolute Error (MAE) and Mean Squared Error (MSE) results, it provides valuable insights into the performance of our models. In terms of MAE, the Linear Regression model demonstrates commendable accuracy, with predictions hovering around 0.31 for minor injuries, 0.072 for serious injuries, and 0.0104 for fatal incidents. Surpassing this baseline, the XGBoost model exhibits a marginal improvement, achieving slightly lower MAE values across all injury categories. Meanwhile, the Poisson Regression model aligns closely with the Linear Regression and XGBoost performance, showcasing MAE values in the same range, but slightly higher MAE output is seen.
Shifting focus to MSE, we observe that the squared errors, reflecting average squared differences between predicted and actual values, are larger than their absolute counterparts. In this context, the XGBoost model excels with slightly lower MSE values compared to Linear Regression, signifying its enhanced performance. On the other hand, the Poisson Regression model, while comparable to Linear Regression, reveals a relatively higher MSE for FatalCount, suggesting potential areas for improvement.
The Root Mean Squared Error (RMSE) results further illuminate the precision of our models. Looking at the RMSE values, XGBoost consistently outperforms both Linear Regression and Poisson Regression across all injury categories. The RMSE for fatal incidents in XGBoost is lower, indicating a superior ability to minimize errors in predicting the counts of fatal injuries. In summary, the combined analysis of MAE, MSE, and RMSE highlights the strengths of each model, with XGBoost standing out for its higher predictive precision and resilience across different types of injury in a traffic accident.
The model exhibits poor performance in predicting “Fatal Crash,” “Minor Crash,” and “Serious Crash,” as indicated by low sensitivity values for these classes. High specificity is observed for most classes, suggesting good identification of non-occurrences of certain crash types. The Positive Predictive Value is reasonably good for “Non-Injury Crash” but is NaN for “Fatal Crash,” implying the model failed to make any true positive predictions for this class. Overall, the model demonstrates moderate accuracy but struggles in correctly predicting certain critical classes like “Fatal Crash” and “Serious Crash.” Further model tuning or feature engineering may be necessary to enhance predictions for these classes.
In general, based on the research questions set out in the beginning of this study, the analysis above has managed to answer all 3 questions. In the EDA portion, it was found that major accident vehicle types involved were cars (Station Wagon) which contributed the highest number of accidents versus any other vehicle. The study also tried to find if non-human variables that may cause and increase the likelihood of severe accidents, including high fatalities. The study found there were no outstanding non-human variables (like road lighting, flat/hill road, number of road lanes etc.) that had been a standout to cause more accidents, except flat, sealed and 2 ways roads. Most of the accidents that happened were most likely due to human error that happened on these flat, sealed and 2 way roads.
In terms of prediction, it can be seen that certain machine learning models can be used to predict traffic crashes. Our analysis shows that the XGBoost has higher predictive precision and resilience across different types of injury in a traffic accident. The model predicts minor accidents at a higher accuracy versus fatality and serious accidents. Further studies can be explored to understand if there are additional models that could be used for prediction.
The project report above has been published to RPUBs. Here is the report link on RPUBs –> https://rpubs.com/mithirendra/1133528