Classification algorithms are a subset of supervised learning where the model attempts to accurately predict the correct label of a given observation. When the model tries to predict a discrete outcome, classification algorithms are utilized for decision-making and pattern identification.
In this document, we will go through an overview of various classification techniques, including the following:
These methods will all be used on the same training and test datasets for prediction in order to evaluate them against each other through their resulting ROC curves and AUC values.
The dataset (https://www.kaggle.com/datasets/denkuznetz/traffic-accident-prediction) contains data meant to predict the occurrence of a vehicular accident based on a collection of features that may be related to the probability of a vehicular accident, listed below:
Weather (categorical): clear, rainy, foggy, snowy, stormy
Road_Type (categorical): Highway, city road, rural road, mountain road
Time_of_Day (categorical): Morning, Afternoon, Evening, Night
Traffic_Density (discrete): 0 (low density), 1 (moderate density), 2 (high density)
Speed_Limit (numeric): Posted speed limit of road
Number of Vehicles (discrete): Number of vehicles involved in the accident
Driver_Alcohol (binary): 0 (driver consumed alcohol), 1 (driver did not consume alcohol)
Accident_Severity (categorical): Low, Moderate, High
Road_Condition (categorical): Dry, Wet, Icy, Under Construction
Vehicle_Type (categorical): Car, Truck, Motorcycle, Bus
Driver_Age (numeric): Age of driver
Driver_Experience (numeric): Years of experience of driver
Road_Light_Condition (categorical): Daylight, Artificial Light, No Light
Accident (binary): 0 (no accident), 1 (had accident)
This document will utilize the following classification algorithms to create predictive models on whether or not an accident will occur based on a collection of factors.
Logistic regression models the relationship between a binary dependent variable to one or more independent variables, using a logistic function to predict the probability of the dependent variable belonging to a particular category. They are often utilized because of their simplicity, efficiency, and interpretability. The model is created to predict the log odds of a given event based on parameter estimates of given predictors. It is often used in predictive analysis and machine learning contexts as well as in more traditional statistical association analysis applications.
Regularized regression techniques are an expansion on more traditional regression methods by introducing penalty terms to control the complexity of the model and account for issues with overfitting. They can be effective for situations with many predictors or when multicollinearity is present between features.
We will explore three of the most common forms of regularized regression, including Ridge Regression, LASSO, and Elastic Net Regression techniques. Each introduces penalties to the regression model in a slightly different way and may be more or less advantageous in certain modeling situations.
SVM is another technique that can be used in classification contexts, and is a distribution-free method. In SVM, classes are separated in a feature space and SVM draws a hyperplane, or decision boundary, to maximize the margin between parallel hyperplanes while minimizing misclassification errors. SVM can handle both linear and nonlinear class boundaries, using the kernel trick to handle non-linear classification. The most common kernel transformations are radial kernel and polynomial kernel. We will explore a linear, radial, and polynomial methods to create SVM classification models.
Decision or Classification trees are created in order to create subsets of the data that are as homogeneous as possible. They can be valued for their interpretability, but risk overfitting. Therefore, trees can be “pruned” to control for their complexity, with two major recommended approaches being 1) choosing a value that minimizes cross-validation error and 2) applying the 1-SE rule. We will create both an unpruned and pruned classification trees according to these rules and compare their performance.
BAGGING, or Bootstrap Aggregation, is an ensemble learning technique meant to improve the stability and accuracy of various machine learning techniques. It operates on the principle of reducing variance by voting over multiple models created through bootstrap resamples of the original data. We will construct a bagged classification model.
Random forest is another ensemble leraning technique that constructs multiple decision trees instead of just one and aggregates across their results. Based on BAGGING and similar techniques, it can mitigate the effects of overfitting and improve the accuracy of the model. They can be particularly effective in the cases of higher dimensional data, nonlinear relationships, and interactions without requiring extensive feature engineering beforehand.
All of the classification models created will be assessed through ROC curves and their corresponding AUC values.
ROC Curves are a graphical technique used to measure the performance of a binary classification model by plotting the true positive rate against the false positive rate at various classification thresholds.
The AUC, or area under the curve, is a number that quantifies the performance of the ROC in a single value by approximating the area under the constructed curve between 0 and 1. The closer the AUC is to 1, the better the performance of the model.
To begin, we look at a summary of the dataset.
Truthfully, with over half of the observations of including at least one missing value, best practice is to not proceed with imputation and model building. However, for illustrative purposes, we will continue by imputing the missing values through multiple imputation.
## Weather Road_Type Time_of_Day
## "polyreg" "polyreg" "polyreg"
## Traffic_Density Speed_Limit Number_of_Vehicles
## "polyreg" "pmm" "pmm"
## Driver_Alcohol Accident_Severity Road_Condition
## "logreg" "polyreg" "polyreg"
## Vehicle_Type Driver_Age Driver_Experience
## "polyreg" "pmm" "pmm"
## Road_Light_Condition Accident
## "polyreg" "logreg"
## Weather Road_Type Time_of_Day Traffic_Density
## Clear :353 City Road :241 Afternoon:290 0:258
## Foggy :115 Highway :417 Evening :227 1:319
## Rainy :236 Mountain Road: 44 Morning :209 2:263
## Snowy : 92 Rural Road :138 Night :114
## Stormy: 44
##
## Speed_Limit Number_of_Vehicles Driver_Alcohol Accident_Severity
## Min. : 30.00 Min. : 1.000 0:705 High : 87
## 1st Qu.: 50.00 1st Qu.: 2.000 1:135 Low :502
## Median : 60.00 Median : 3.000 Moderate:251
## Mean : 70.82 Mean : 3.304
## 3rd Qu.: 80.00 3rd Qu.: 4.000
## Max. :213.00 Max. :14.000
## Road_Condition Vehicle_Type Driver_Age Driver_Experience
## Dry :413 Bus : 28 Min. :18.0 Min. : 9.00
## Icy :163 Car :618 1st Qu.:30.0 1st Qu.:26.00
## Under Construction:102 Motorcycle: 92 Median :43.0 Median :39.00
## Wet :162 Truck :102 Mean :43.3 Mean :38.81
## 3rd Qu.:56.0 3rd Qu.:52.00
## Max. :69.0 Max. :69.00
## Road_Light_Condition Accident
## Artificial Light:424 0:588
## Daylight :335 1:252
## No Light : 81
##
##
##
## Weather Road_Type Time_of_Day
## 0 0 0
## Traffic_Density Speed_Limit Number_of_Vehicles
## 0 0 0
## Driver_Alcohol Accident_Severity Road_Condition
## 0 0 0
## Vehicle_Type Driver_Age Driver_Experience
## 0 0 0
## Road_Light_Condition Accident
## 0 0
After handling the missing data, we can take a look at the distributions of the numeric predictors.
Due to the apparent sparseness of observations with higher values for the number of vehicles, we will take a closer look to see if there is a meaningful way to rebin this into a categorical variable.
## # A tibble: 10 × 2
## Number_of_Vehicles n
## <dbl> <int>
## 1 1 152
## 2 2 159
## 3 3 167
## 4 4 173
## 5 5 164
## 6 10 6
## 7 11 8
## 8 12 3
## 9 13 4
## 10 14 4
It appears that 1, 2, 3, 4, and 5+ cars would be an intuitive way of recategorizing the information in this feature, so we will proceed by recategorizing it.
## 1 2 3 4 5+
## 152 159 167 173 189
The classification methods we will consider are robust to predictor normality, so will not transform the remaining predictors further for simplicity and interpretability.
To continue, we will visualize the relationships between the features and the binary response variable of Accident.
Just looking at the box plots, there don’t appear to be any obvious differences for any of these three continuous variables in across the groups of observations that had or did not have a vehicular accident.
Looking across the mosaic plots of the response against the categorical features in the dataset, we do see some differences in distributions for the different values of the binary response, most obviously in the features Weather and Road_Condition. On the other hand, Road_Lighting_Conditions and Driver_Alcohol seem relatively homogeneous over the different values of Accident.
What classification model performs best in the prediction of Accident based on the other features in the model?
We begin by fitting a full logistic regression model and looking at its parameter estimates.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -0.3574724 | 0.6311025 | -0.5664253 | 0.5711047 |
WeatherFoggy | -0.1422856 | 0.2435854 | -0.5841305 | 0.5591325 |
WeatherRainy | -0.4793253 | 0.1991645 | -2.4066806 | 0.0160982 |
WeatherSnowy | 0.0727269 | 0.2547388 | 0.2854961 | 0.7752641 |
WeatherStormy | 0.4202898 | 0.3367355 | 1.2481304 | 0.2119833 |
Road_TypeHighway | -0.0648255 | 0.1835125 | -0.3532484 | 0.7239023 |
Road_TypeMountain Road | -0.7436352 | 0.4281546 | -1.7368379 | 0.0824158 |
Road_TypeRural Road | 0.3204651 | 0.2317248 | 1.3829554 | 0.1666785 |
Time_of_DayEvening | 0.2805773 | 0.2016972 | 1.3910817 | 0.1642007 |
Time_of_DayMorning | 0.0533958 | 0.2100048 | 0.2542601 | 0.7992947 |
Time_of_DayNight | 0.1563274 | 0.2497255 | 0.6259967 | 0.5313171 |
Traffic_Density1 | -0.0829896 | 0.1885619 | -0.4401186 | 0.6598512 |
Traffic_Density2 | -0.1718677 | 0.1990654 | -0.8633733 | 0.3879323 |
Speed_Limit | -0.0041290 | 0.0026010 | -1.5874462 | 0.1124116 |
Number_of_Vehicles2 | -0.0663388 | 0.2600031 | -0.2551460 | 0.7986103 |
Number_of_Vehicles3 | -0.0458354 | 0.2581897 | -0.1775260 | 0.8590953 |
Number_of_Vehicles4 | 0.0612497 | 0.2526715 | 0.2424083 | 0.8084638 |
Number_of_Vehicles5+ | 0.2668742 | 0.2497410 | 1.0686038 | 0.2852482 |
Driver_Alcohol1 | -0.0149571 | 0.2124295 | -0.0704099 | 0.9438674 |
Accident_SeverityLow | 0.0997656 | 0.2698071 | 0.3697666 | 0.7115564 |
Accident_SeverityModerate | 0.1924592 | 0.2862052 | 0.6724517 | 0.5012962 |
Road_ConditionIcy | -0.0166074 | 0.2058786 | -0.0806659 | 0.9357077 |
Road_ConditionUnder Construction | -0.2919951 | 0.2576581 | -1.1332657 | 0.2571027 |
Road_ConditionWet | -0.4419176 | 0.2187348 | -2.0203352 | 0.0433486 |
Vehicle_TypeCar | -0.4394989 | 0.4273965 | -1.0283164 | 0.3038010 |
Vehicle_TypeMotorcycle | -0.2624986 | 0.4780259 | -0.5491304 | 0.5829159 |
Vehicle_TypeTruck | -0.0162652 | 0.4692625 | -0.0346611 | 0.9723500 |
Driver_Age | 0.0234894 | 0.0279049 | 0.8417663 | 0.3999188 |
Driver_Experience | -0.0197402 | 0.0274985 | -0.7178629 | 0.4728418 |
Road_Light_ConditionDaylight | -0.1195075 | 0.1662274 | -0.7189399 | 0.4721780 |
Road_Light_ConditionNo Light | -0.1380659 | 0.2756960 | -0.5007902 | 0.6165188 |
Many of the p-values given for these predictors seem nonsignificant. As such, we will also conduct stepwise feature selection to construct a reduced model.
The parameter estimates for the stepwise reduced model are given below.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -0.7124603 | 0.1643437 | -4.3351839 | 0.0000146 |
WeatherFoggy | -0.2060527 | 0.2365337 | -0.8711345 | 0.3836807 |
WeatherRainy | -0.5206374 | 0.1934613 | -2.6911709 | 0.0071202 |
WeatherSnowy | 0.0229886 | 0.2488195 | 0.0923906 | 0.9263877 |
WeatherStormy | 0.3344471 | 0.3283509 | 1.0185662 | 0.3084090 |
Road_TypeHighway | -0.0543196 | 0.1789070 | -0.3036191 | 0.7614181 |
Road_TypeMountain Road | -0.6029121 | 0.4178708 | -1.4428195 | 0.1490713 |
Road_TypeRural Road | 0.3429190 | 0.2266010 | 1.5133168 | 0.1301992 |
Using the R caret package, we conduct five-fold cross-validation to get an initial look at the accuracy estimates for both models.
## Generalized Linear Model
##
## 840 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 673, 673, 672, 671, 671
## Resampling results:
##
## Accuracy Kappa
## 0.6797739 0.009110272
## Generalized Linear Model
##
## 840 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 671, 672, 673, 672, 672
## Resampling results:
##
## Accuracy Kappa
## 0.7011933 0.005420431
We can also construct confusion matrices and examine the sensitivity and specificity of both logistic regression models.
Specificity | Sensitivity | |
---|---|---|
Full Model | 0.9438776 | 0.0634921 |
Stepwise Model | 1.0000000 | 0.0039683 |
We note that the sensitivity for both models is calculated to be very, very low. We proceed by creating the training and test datasets that will be used for the rest of this analysis and look at the five-fold CV ROC curves created based on the training dataset as well as the summary statistics of the AUC values for each of these curves.
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Full Main Effect Model | 0.5980 | 0.5990 | 0.6226 | 0.63414 | 0.6434 | 0.7077 |
Stepwise Main Effect Model | 0.5202 | 0.5358 | 0.5991 | 0.58004 | 0.6027 | 0.6424 |
We will continue by finding the optimal cut-off probability on the test dataset and the AUC values for both models.
Model | AUC |
---|---|
Full Main Effects Logistic Model | 0.5972034 |
Stepwise Reduced Main Effects Logistic Model | 0.5832203 |
The following graphs give different measures of predictive model performance including the accuracy, sensitivity, and specificity at different cut-off probabilities.
Moving onto regularized logistic regression, we create LASSO, Ridge, and Elastic Net classification models and find their optimal cut-off probabilities.
Using this optimal cut off probability, we can find certain performance measures and present them below.
## Warning in confusionMatrix.default(pred.lab.lasso.fct, y_test): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Warning in confusionMatrix.default(pred.lab.elastic.fct, y_test): Levels are
## not in the same order for reference and data. Refactoring data to match.
lasso | ridge | elastic | |
---|---|---|---|
Sensitivity | 1 | 0.9407 | 1 |
Specificity | 0 | 0.16 | 0 |
Pos Pred Value | 0.7024 | 0.7255 | 0.7024 |
Neg Pred Value | NA | 0.5333 | NA |
Precision | 0.7024 | 0.7255 | 0.7024 |
Recall | 1 | 0.9407 | 1 |
F1 | 0.8252 | 0.8192 | 0.8252 |
Prevalence | 0.7024 | 0.7024 | 0.7024 |
Detection Rate | 0.7024 | 0.6607 | 0.7024 |
Detection Prevalence | 1 | 0.9107 | 1 |
Balanced Accuracy | 0.5 | 0.5503 | 0.5 |
Finally, using the test data, once again we will calculate the AUC values and create ROC curves to compare them.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
We can see that of the regularized regression models, Elastic Net
regression performed the best of the three methods. However, all three
have values only slightly over 0.5, which indicates a correct prediction
only about 50% of the time.
Moving onto SVM, we create three candidate models, one linear and two non-linear. Using the R library caret, we find the best value of C for each model and the corresponding accuracies.
## # A tibble: 3 × 2
## Model Accuracy
## <chr> <dbl>
## 1 SVM Linear w/ choice of cost 0.699
## 2 SVM Radial 0.699
## 3 SVM Poly 0.699
We also construct the confusion matrices and ROC curves for each of the SVM models.
## Confusion Matrix and Statistics
##
## true
## pred NoAccident YesAccident
## NoAccident 118 50
## YesAccident 0 0
##
## Accuracy : 0.7024
## 95% CI : (0.6271, 0.7704)
## No Information Rate : 0.7024
## P-Value [Acc > NIR] : 0.5381
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 4.219e-12
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7024
## Neg Pred Value : NaN
## Prevalence : 0.7024
## Detection Rate : 0.7024
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : NoAccident
##
## Setting levels: control = NoAccident, case = YesAccident
## Setting direction: controls < cases
## Confusion Matrix and Statistics
##
## true
## pred NoAccident YesAccident
## NoAccident 118 50
## YesAccident 0 0
##
## Accuracy : 0.7024
## 95% CI : (0.6271, 0.7704)
## No Information Rate : 0.7024
## P-Value [Acc > NIR] : 0.5381
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 4.219e-12
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7024
## Neg Pred Value : NaN
## Prevalence : 0.7024
## Detection Rate : 0.7024
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : NoAccident
##
## Setting levels: control = NoAccident, case = YesAccident
## Setting direction: controls > cases
## Confusion Matrix and Statistics
##
## true
## pred NoAccident YesAccident
## NoAccident 118 50
## YesAccident 0 0
##
## Accuracy : 0.7024
## 95% CI : (0.6271, 0.7704)
## No Information Rate : 0.7024
## P-Value [Acc > NIR] : 0.5381
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 4.219e-12
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7024
## Neg Pred Value : NaN
## Prevalence : 0.7024
## Detection Rate : 0.7024
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : NoAccident
##
## Setting levels: control = NoAccident, case = YesAccident
## Setting direction: controls < cases
Model | AUC |
---|---|
SVM Linear w/ choice of cost | 0.5072034 |
SVM Radial | 0.4931356 |
SVM Poly | 0.5211017 |
We construct three classification trees, one unpruned, one pruned through the 1-SE rule, and one pruned by the minimum CV error. We can begin by plotting the unpruned tree, as follows.
The pruned tree by minimum CV and 1-SE are represented as follows.
CP | nsplit | rel error | xerror | xstd |
---|---|---|---|---|
0.005941 | 0 | 1 | 2 | 0.1177 |
0.002475 | 5 | 0.9703 | 2.05 | 0.1172 |
0.001 | 9 | 0.9604 | 2.059 | 0.1172 |
The ROC curves and AUC values of each of the classification trees are shown below.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
We begin constructing our bagged model by finding the best tuned hyperparameters.
nbagg | minsplit | maxdepth | cp | |
---|---|---|---|---|
17 | 50 | 20 | 10 | 0.01 |
With these hyperparameters, we construct the BAGGED model and examine its confusion matrix.
## Reference
## Prediction 0 1
## 0 115 47
## 1 3 3
Finally, we conduct prediction on the test set and construct the ROC curve and find the AUC value.
Once again, we find the best hyperparameters for our Random Forest model and then examine the confusion matrix and accuracy estimates.
mtry | ntree | nodesize | maxnodes | best.auc | |
---|---|---|---|---|---|
120 | 5 | 500 | 3 | 20 | 0.8853 |
## Confusion Matrix and Statistics
##
## Reference
## Prediction NoAccident YesAccident
## NoAccident 118 50
## YesAccident 0 0
##
## Accuracy : 0.7024
## 95% CI : (0.6271, 0.7704)
## No Information Rate : 0.7024
## P-Value [Acc > NIR] : 0.5381
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 4.219e-12
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7024
## Neg Pred Value : NaN
## Prevalence : 0.7024
## Detection Rate : 0.7024
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : NoAccident
##
## Area under the curve: 0.5373
The variable importance is given in the below plots.
Finally, we construct an ROC curve and find the AUC for the random forest model as well.
## Setting levels: control = NoAccident, case = YesAccident
## Setting direction: controls < cases
Finally, we will compile all of the results from the above classification models into one plot of all the ROC curves as well as their AUC values.
Looking at this graph, we note that most of the models performed relatively similarly. Logistic regression, when considering both the main effects and stepwise model, performed quite a bit better than the other classification models in terms of AUC. The next best performing model was the Elastic Net and BAGGED models. However, just looking at this graph, we see that the performance of all these models was overall poor. Only about 50% of the predictions were correct for most of the models.
We can also create a table of all of the AUC values, ordering it in descending order to find which had the highest AUC values.
models | AUCvectors |
---|---|
Full Logit Model | 0.5972034 |
Stepwise Logit Model | 0.5832203 |
LASSO Model | 0.5480508 |
Random Forest | 0.5372881 |
Ridge Model | 0.5358475 |
SVM Poly | 0.5211017 |
Full Tree | 0.5208475 |
SVM Linear | 0.5072034 |
Elastic Model | 0.5000000 |
Pruned Tree (1-SE) | 0.5000000 |
Pruned Tree (Min CV Error) | 0.5000000 |
SVM Radial | 0.4931356 |
BAGGING | 0.4754237 |
The logistic models performed the best of all candidate models. Therefore, out of the classification models considered, we would likely recommend logistic modeling – it is the simplest, quite efficient to implement, and of the models being considered, also probably the easiest to interpret. However, we notice that none of the models performed particularly well.
This may be because of imbalance in the response variable Accident. When we take a closer look at the distribution, we note that the variable is imbalanced, with Accident = 0 having more than double the number of observations than Accident = 1.
## # A tibble: 2 × 2
## Accident n
## <dbl> <int>
## 1 0 588
## 2 1 252
Also, when we did initial exploratory analysis, we found that many of the features didn’t seem to effectively predict for Accident, which was corroborated by the number of insignificant p-values found in the full logistic model. It is possible that the information contained in the dataset didn’t allow for particularly effective classification models in the first place.
Other limitations of this analysis include the number of missing values that had to be imputed, with over half of the observations in the original dataset having missing values. For future analysis, these same methods may be better applied onto a different dataset with a more balanced binary response and better suited predictors to get a more accurate reflection of the performance of these models with a better dataset.