Introduction:

Classification algorithms are a subset of supervised learning where the model attempts to accurately predict the correct label of a given observation. When the model tries to predict a discrete outcome, classification algorithms are utilized for decision-making and pattern identification.

In this document, we will go through an overview of various classification techniques, including the following:

Logistic Regression
Regularized Regression
Support Vector Machines
Decision Trees
BAGGING and Random Forest

These methods will all be used on the same training and test datasets for prediction in order to evaluate them against each other through their resulting ROC curves and AUC values.

The dataset (https://www.kaggle.com/datasets/denkuznetz/traffic-accident-prediction) contains data meant to predict the occurrence of a vehicular accident based on a collection of features that may be related to the probability of a vehicular accident, listed below:

Weather (categorical): clear, rainy, foggy, snowy, stormy

Road_Type (categorical): Highway, city road, rural road, mountain road

Time_of_Day (categorical): Morning, Afternoon, Evening, Night

Traffic_Density (discrete): 0 (low density), 1 (moderate density), 2 (high density)

Speed_Limit (numeric): Posted speed limit of road

Number of Vehicles (discrete): Number of vehicles involved in the accident

Driver_Alcohol (binary): 0 (driver consumed alcohol), 1 (driver did not consume alcohol)

Accident_Severity (categorical): Low, Moderate, High

Road_Condition (categorical): Dry, Wet, Icy, Under Construction

Vehicle_Type (categorical): Car, Truck, Motorcycle, Bus

Driver_Age (numeric): Age of driver

Driver_Experience (numeric): Years of experience of driver

Road_Light_Condition (categorical): Daylight, Artificial Light, No Light

Accident (binary): 0 (no accident), 1 (had accident)

Methodology:

Classification Algorithms

This document will utilize the following classification algorithms to create predictive models on whether or not an accident will occur based on a collection of factors.

Logistic Regression

Logistic regression models the relationship between a binary dependent variable to one or more independent variables, using a logistic function to predict the probability of the dependent variable belonging to a particular category. They are often utilized because of their simplicity, efficiency, and interpretability. The model is created to predict the log odds of a given event based on parameter estimates of given predictors. It is often used in predictive analysis and machine learning contexts as well as in more traditional statistical association analysis applications.

Regularized Regression Techniques

Regularized regression techniques are an expansion on more traditional regression methods by introducing penalty terms to control the complexity of the model and account for issues with overfitting. They can be effective for situations with many predictors or when multicollinearity is present between features.

We will explore three of the most common forms of regularized regression, including Ridge Regression, LASSO, and Elastic Net Regression techniques. Each introduces penalties to the regression model in a slightly different way and may be more or less advantageous in certain modeling situations.

Support Vector Machines

SVM is another technique that can be used in classification contexts, and is a distribution-free method. In SVM, classes are separated in a feature space and SVM draws a hyperplane, or decision boundary, to maximize the margin between parallel hyperplanes while minimizing misclassification errors. SVM can handle both linear and nonlinear class boundaries, using the kernel trick to handle non-linear classification. The most common kernel transformations are radial kernel and polynomial kernel. We will explore a linear, radial, and polynomial methods to create SVM classification models.

Decision Trees

Decision or Classification trees are created in order to create subsets of the data that are as homogeneous as possible. They can be valued for their interpretability, but risk overfitting. Therefore, trees can be “pruned” to control for their complexity, with two major recommended approaches being 1) choosing a value that minimizes cross-validation error and 2) applying the 1-SE rule. We will create both an unpruned and pruned classification trees according to these rules and compare their performance.

BAGGING

BAGGING, or Bootstrap Aggregation, is an ensemble learning technique meant to improve the stability and accuracy of various machine learning techniques. It operates on the principle of reducing variance by voting over multiple models created through bootstrap resamples of the original data. We will construct a bagged classification model.

Random Forest

Random forest is another ensemble leraning technique that constructs multiple decision trees instead of just one and aggregates across their results. Based on BAGGING and similar techniques, it can mitigate the effects of overfitting and improve the accuracy of the model. They can be particularly effective in the cases of higher dimensional data, nonlinear relationships, and interactions without requiring extensive feature engineering beforehand.

Model Performance Measures

All of the classification models created will be assessed through ROC curves and their corresponding AUC values.

ROC Curves

ROC Curves are a graphical technique used to measure the performance of a binary classification model by plotting the true positive rate against the false positive rate at various classification thresholds.

AUC

The AUC, or area under the curve, is a number that quantifies the performance of the ROC in a single value by approximating the area under the constructed curve between 0 and 1. The closer the AUC is to 1, the better the performance of the model.

EDA

To begin, we look at a summary of the dataset.

Truthfully, with over half of the observations of including at least one missing value, best practice is to not proceed with imputation and model building. However, for illustrative purposes, we will continue by imputing the missing values through multiple imputation.

##              Weather            Road_Type          Time_of_Day 
##            "polyreg"            "polyreg"            "polyreg" 
##      Traffic_Density          Speed_Limit   Number_of_Vehicles 
##            "polyreg"                "pmm"                "pmm" 
##       Driver_Alcohol    Accident_Severity       Road_Condition 
##             "logreg"            "polyreg"            "polyreg" 
##         Vehicle_Type           Driver_Age    Driver_Experience 
##            "polyreg"                "pmm"                "pmm" 
## Road_Light_Condition             Accident 
##            "polyreg"             "logreg"

##    Weather            Road_Type      Time_of_Day  Traffic_Density
##  Clear :353   City Road    :241   Afternoon:290   0:258          
##  Foggy :115   Highway      :417   Evening  :227   1:319          
##  Rainy :236   Mountain Road: 44   Morning  :209   2:263          
##  Snowy : 92   Rural Road   :138   Night    :114                  
##  Stormy: 44                                                      
##                                                                  
##   Speed_Limit     Number_of_Vehicles Driver_Alcohol Accident_Severity
##  Min.   : 30.00   Min.   : 1.000     0:705          High    : 87     
##  1st Qu.: 50.00   1st Qu.: 2.000     1:135          Low     :502     
##  Median : 60.00   Median : 3.000                    Moderate:251     
##  Mean   : 70.82   Mean   : 3.304                                     
##  3rd Qu.: 80.00   3rd Qu.: 4.000                                     
##  Max.   :213.00   Max.   :14.000                                     
##             Road_Condition     Vehicle_Type   Driver_Age   Driver_Experience
##  Dry               :413    Bus       : 28   Min.   :18.0   Min.   : 9.00    
##  Icy               :163    Car       :618   1st Qu.:30.0   1st Qu.:26.00    
##  Under Construction:102    Motorcycle: 92   Median :43.0   Median :39.00    
##  Wet               :162    Truck     :102   Mean   :43.3   Mean   :38.81    
##                                             3rd Qu.:56.0   3rd Qu.:52.00    
##                                             Max.   :69.0   Max.   :69.00    
##        Road_Light_Condition Accident
##  Artificial Light:424       0:588   
##  Daylight        :335       1:252   
##  No Light        : 81               
##                                     
##                                     
##

##              Weather            Road_Type          Time_of_Day 
##                    0                    0                    0 
##      Traffic_Density          Speed_Limit   Number_of_Vehicles 
##                    0                    0                    0 
##       Driver_Alcohol    Accident_Severity       Road_Condition 
##                    0                    0                    0 
##         Vehicle_Type           Driver_Age    Driver_Experience 
##                    0                    0                    0 
## Road_Light_Condition             Accident 
##                    0                    0

Individual Feature Distributions

After handling the missing data, we can take a look at the distributions of the numeric predictors.

Due to the apparent sparseness of observations with higher values for the number of vehicles, we will take a closer look to see if there is a meaningful way to rebin this into a categorical variable.

## # A tibble: 10 × 2
##    Number_of_Vehicles     n
##                 <dbl> <int>
##  1                  1   152
##  2                  2   159
##  3                  3   167
##  4                  4   173
##  5                  5   164
##  6                 10     6
##  7                 11     8
##  8                 12     3
##  9                 13     4
## 10                 14     4

It appears that 1, 2, 3, 4, and 5+ cars would be an intuitive way of recategorizing the information in this feature, so we will proceed by recategorizing it.

##   1   2   3   4  5+ 
## 152 159 167 173 189

The classification methods we will consider are robust to predictor normality, so will not transform the remaining predictors further for simplicity and interpretability.

Relationships between Features

To continue, we will visualize the relationships between the features and the binary response variable of Accident.

Just looking at the box plots, there don’t appear to be any obvious differences for any of these three continuous variables in across the groups of observations that had or did not have a vehicular accident.

Looking across the mosaic plots of the response against the categorical features in the dataset, we do see some differences in distributions for the different values of the binary response, most obviously in the features Weather and Road_Condition. On the other hand, Road_Lighting_Conditions and Driver_Alcohol seem relatively homogeneous over the different values of Accident.

Research Question:

What classification model performs best in the prediction of Accident based on the other features in the model?

Logistic Regression:

We begin by fitting a full logistic regression model and looking at its parameter estimates.

Vehicle Accident Logistic Full Main Effects Model Parameter Estimates
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.3574724	0.6311025	-0.5664253	0.5711047
WeatherFoggy	-0.1422856	0.2435854	-0.5841305	0.5591325
WeatherRainy	-0.4793253	0.1991645	-2.4066806	0.0160982
WeatherSnowy	0.0727269	0.2547388	0.2854961	0.7752641
WeatherStormy	0.4202898	0.3367355	1.2481304	0.2119833
Road_TypeHighway	-0.0648255	0.1835125	-0.3532484	0.7239023
Road_TypeMountain Road	-0.7436352	0.4281546	-1.7368379	0.0824158
Road_TypeRural Road	0.3204651	0.2317248	1.3829554	0.1666785
Time_of_DayEvening	0.2805773	0.2016972	1.3910817	0.1642007
Time_of_DayMorning	0.0533958	0.2100048	0.2542601	0.7992947
Time_of_DayNight	0.1563274	0.2497255	0.6259967	0.5313171
Traffic_Density1	-0.0829896	0.1885619	-0.4401186	0.6598512
Traffic_Density2	-0.1718677	0.1990654	-0.8633733	0.3879323
Speed_Limit	-0.0041290	0.0026010	-1.5874462	0.1124116
Number_of_Vehicles2	-0.0663388	0.2600031	-0.2551460	0.7986103
Number_of_Vehicles3	-0.0458354	0.2581897	-0.1775260	0.8590953
Number_of_Vehicles4	0.0612497	0.2526715	0.2424083	0.8084638
Number_of_Vehicles5+	0.2668742	0.2497410	1.0686038	0.2852482
Driver_Alcohol1	-0.0149571	0.2124295	-0.0704099	0.9438674
Accident_SeverityLow	0.0997656	0.2698071	0.3697666	0.7115564
Accident_SeverityModerate	0.1924592	0.2862052	0.6724517	0.5012962
Road_ConditionIcy	-0.0166074	0.2058786	-0.0806659	0.9357077
Road_ConditionUnder Construction	-0.2919951	0.2576581	-1.1332657	0.2571027
Road_ConditionWet	-0.4419176	0.2187348	-2.0203352	0.0433486
Vehicle_TypeCar	-0.4394989	0.4273965	-1.0283164	0.3038010
Vehicle_TypeMotorcycle	-0.2624986	0.4780259	-0.5491304	0.5829159
Vehicle_TypeTruck	-0.0162652	0.4692625	-0.0346611	0.9723500
Driver_Age	0.0234894	0.0279049	0.8417663	0.3999188
Driver_Experience	-0.0197402	0.0274985	-0.7178629	0.4728418
Road_Light_ConditionDaylight	-0.1195075	0.1662274	-0.7189399	0.4721780
Road_Light_ConditionNo Light	-0.1380659	0.2756960	-0.5007902	0.6165188

Many of the p-values given for these predictors seem nonsignificant. As such, we will also conduct stepwise feature selection to construct a reduced model.

The parameter estimates for the stepwise reduced model are given below.

Vehicle Accident Logistic Stepwise Main Effects Model Parameter Estimates
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.7124603	0.1643437	-4.3351839	0.0000146
WeatherFoggy	-0.2060527	0.2365337	-0.8711345	0.3836807
WeatherRainy	-0.5206374	0.1934613	-2.6911709	0.0071202
WeatherSnowy	0.0229886	0.2488195	0.0923906	0.9263877
WeatherStormy	0.3344471	0.3283509	1.0185662	0.3084090
Road_TypeHighway	-0.0543196	0.1789070	-0.3036191	0.7614181
Road_TypeMountain Road	-0.6029121	0.4178708	-1.4428195	0.1490713
Road_TypeRural Road	0.3429190	0.2266010	1.5133168	0.1301992

Using the R caret package, we conduct five-fold cross-validation to get an initial look at the accuracy estimates for both models.

## Generalized Linear Model 
## 
## 840 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 673, 673, 672, 671, 671 
## Resampling results:
## 
##   Accuracy   Kappa      
##   0.6797739  0.009110272

## Generalized Linear Model 
## 
## 840 samples
##   2 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 671, 672, 673, 672, 672 
## Resampling results:
## 
##   Accuracy   Kappa      
##   0.7011933  0.005420431

We can also construct confusion matrices and examine the sensitivity and specificity of both logistic regression models.

Specificity and Sensitivity Comparisons of Main Effect Logistic Candidate Models
	Specificity	Sensitivity
Full Model	0.9438776	0.0634921
Stepwise Model	1.0000000	0.0039683

We note that the sensitivity for both models is calculated to be very, very low. We proceed by creating the training and test datasets that will be used for the rest of this analysis and look at the five-fold CV ROC curves created based on the training dataset as well as the summary statistics of the AUC values for each of these curves.

5-fold Cross-Validated AUC for Vehicle Accident candidate models
	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Full Main Effect Model	0.5980	0.5990	0.6226	0.63414	0.6434	0.7077
Stepwise Main Effect Model	0.5202	0.5358	0.5991	0.58004	0.6027	0.6424

We will continue by finding the optimal cut-off probability on the test dataset and the AUC values for both models.

Model	AUC
Full Main Effects Logistic Model	0.5972034
Stepwise Reduced Main Effects Logistic Model	0.5832203

The following graphs give different measures of predictive model performance including the accuracy, sensitivity, and specificity at different cut-off probabilities.

Regularized Logistic Regression

Moving onto regularized logistic regression, we create LASSO, Ridge, and Elastic Net classification models and find their optimal cut-off probabilities.

Using this optimal cut off probability, we can find certain performance measures and present them below.

## Warning in confusionMatrix.default(pred.lab.lasso.fct, y_test): Levels are not
## in the same order for reference and data. Refactoring data to match.

## Warning in confusionMatrix.default(pred.lab.elastic.fct, y_test): Levels are
## not in the same order for reference and data. Refactoring data to match.

	lasso	ridge	elastic
Sensitivity	1	0.9407	1
Specificity	0	0.16	0
Pos Pred Value	0.7024	0.7255	0.7024
Neg Pred Value	NA	0.5333	NA
Precision	0.7024	0.7255	0.7024
Recall	1	0.9407	1
F1	0.8252	0.8192	0.8252
Prevalence	0.7024	0.7024	0.7024
Detection Rate	0.7024	0.6607	0.7024
Detection Prevalence	1	0.9107	1
Balanced Accuracy	0.5	0.5503	0.5

Finally, using the test data, once again we will calculate the AUC values and create ROC curves to compare them.

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

We can see that of the regularized regression models, Elastic Net regression performed the best of the three methods. However, all three have values only slightly over 0.5, which indicates a correct prediction only about 50% of the time.

SVM

Moving onto SVM, we create three candidate models, one linear and two non-linear. Using the R library caret, we find the best value of C for each model and the corresponding accuracies.

## # A tibble: 3 × 2
##   Model                        Accuracy
##   <chr>                           <dbl>
## 1 SVM Linear w/ choice of cost    0.699
## 2 SVM Radial                      0.699
## 3 SVM Poly                        0.699

We also construct the confusion matrices and ROC curves for each of the SVM models.

## Confusion Matrix and Statistics
## 
##              true
## pred          NoAccident YesAccident
##   NoAccident         118          50
##   YesAccident          0           0
##                                           
##                Accuracy : 0.7024          
##                  95% CI : (0.6271, 0.7704)
##     No Information Rate : 0.7024          
##     P-Value [Acc > NIR] : 0.5381          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.219e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7024          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7024          
##          Detection Rate : 0.7024          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : NoAccident      
##

## Setting levels: control = NoAccident, case = YesAccident

## Setting direction: controls < cases

## Confusion Matrix and Statistics
## 
##              true
## pred          NoAccident YesAccident
##   NoAccident         118          50
##   YesAccident          0           0
##                                           
##                Accuracy : 0.7024          
##                  95% CI : (0.6271, 0.7704)
##     No Information Rate : 0.7024          
##     P-Value [Acc > NIR] : 0.5381          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.219e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7024          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7024          
##          Detection Rate : 0.7024          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : NoAccident      
##

## Setting levels: control = NoAccident, case = YesAccident

## Setting direction: controls > cases

## Confusion Matrix and Statistics
## 
##              true
## pred          NoAccident YesAccident
##   NoAccident         118          50
##   YesAccident          0           0
##                                           
##                Accuracy : 0.7024          
##                  95% CI : (0.6271, 0.7704)
##     No Information Rate : 0.7024          
##     P-Value [Acc > NIR] : 0.5381          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.219e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7024          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7024          
##          Detection Rate : 0.7024          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : NoAccident      
##

## Setting levels: control = NoAccident, case = YesAccident

## Setting direction: controls < cases

Model	AUC
SVM Linear w/ choice of cost	0.5072034
SVM Radial	0.4931356
SVM Poly	0.5211017

CART

We construct three classification trees, one unpruned, one pruned through the 1-SE rule, and one pruned by the minimum CV error. We can begin by plotting the unpruned tree, as follows.

The pruned tree by minimum CV and 1-SE are represented as follows.

CP	nsplit	rel error	xerror	xstd
0.005941	0	1	2	0.1177
0.002475	5	0.9703	2.05	0.1172
0.001	9	0.9604	2.059	0.1172

The ROC curves and AUC values of each of the classification trees are shown below.

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

BAGGING

We begin constructing our bagged model by finding the best tuned hyperparameters.

	nbagg	minsplit	maxdepth	cp
17	50	20	10	0.01

With these hyperparameters, we construct the BAGGED model and examine its confusion matrix.

##           Reference
## Prediction   0   1
##          0 115  47
##          1   3   3

Finally, we conduct prediction on the test set and construct the ROC curve and find the AUC value.

Random Forest

Once again, we find the best hyperparameters for our Random Forest model and then examine the confusion matrix and accuracy estimates.

	mtry	ntree	nodesize	maxnodes	best.auc
120	5	500	3	20	0.8853

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    NoAccident YesAccident
##   NoAccident         118          50
##   YesAccident          0           0
##                                           
##                Accuracy : 0.7024          
##                  95% CI : (0.6271, 0.7704)
##     No Information Rate : 0.7024          
##     P-Value [Acc > NIR] : 0.5381          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.219e-12       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7024          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7024          
##          Detection Rate : 0.7024          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : NoAccident      
##

## Area under the curve: 0.5373

The variable importance is given in the below plots.

Finally, we construct an ROC curve and find the AUC for the random forest model as well.

## Setting levels: control = NoAccident, case = YesAccident

## Setting direction: controls < cases

All ROC Curves

Finally, we will compile all of the results from the above classification models into one plot of all the ROC curves as well as their AUC values.

Looking at this graph, we note that most of the models performed relatively similarly. Logistic regression, when considering both the main effects and stepwise model, performed quite a bit better than the other classification models in terms of AUC. The next best performing model was the Elastic Net and BAGGED models. However, just looking at this graph, we see that the performance of all these models was overall poor. Only about 50% of the predictions were correct for most of the models.

Numerically Compare All AUC Values

We can also create a table of all of the AUC values, ordering it in descending order to find which had the highest AUC values.

AUC Values of Candidate Classification Models (Descending)
models	AUCvectors
Full Logit Model	0.5972034
Stepwise Logit Model	0.5832203
LASSO Model	0.5480508
Random Forest	0.5372881
Ridge Model	0.5358475
SVM Poly	0.5211017
Full Tree	0.5208475
SVM Linear	0.5072034
Elastic Model	0.5000000
Pruned Tree (1-SE)	0.5000000
Pruned Tree (Min CV Error)	0.5000000
SVM Radial	0.4931356
BAGGING	0.4754237

The logistic models performed the best of all candidate models. Therefore, out of the classification models considered, we would likely recommend logistic modeling – it is the simplest, quite efficient to implement, and of the models being considered, also probably the easiest to interpret. However, we notice that none of the models performed particularly well.

This may be because of imbalance in the response variable Accident. When we take a closer look at the distribution, we note that the variable is imbalanced, with Accident = 0 having more than double the number of observations than Accident = 1.

## # A tibble: 2 × 2
##   Accident     n
##      <dbl> <int>
## 1        0   588
## 2        1   252

Also, when we did initial exploratory analysis, we found that many of the features didn’t seem to effectively predict for Accident, which was corroborated by the number of insignificant p-values found in the full logistic model. It is possible that the information contained in the dataset didn’t allow for particularly effective classification models in the first place.

Other limitations of this analysis include the number of missing values that had to be imputed, with over half of the observations in the original dataset having missing values. For future analysis, these same methods may be better applied onto a different dataset with a more balanced binary response and better suited predictors to get a more accurate reflection of the performance of these models with a better dataset.

A Comprehensive Overview of Classification Algorithms in the Prediction of Vehicle Accidents

Alice Xiang

2025-05-04

Introduction:

Methodology:

Classification Algorithms

Logistic Regression

Regularized Regression Techniques

Support Vector Machines

Decision Trees

BAGGING

Random Forest

Model Performance Measures

ROC Curves

AUC

EDA

Individual Feature Distributions

Relationships between Features

Research Question:

Logistic Regression:

Regularized Logistic Regression

SVM

CART

BAGGING

Random Forest

All ROC Curves

Numerically Compare All AUC Values

References: