Introduciton

They are many factors that contribute to motorcycle, accidents. These include “crash characteristics, traffic volume, travel speed, road’s horizontal profile, road’s vertical profile, weather conditions, helmet use, safety features installed in the motorcycle such as the crash cage, RUI, motorcyclist’s socio-demographic characteristics, time of day and motorcyclist’s visibility to other vehicle operators among other factors” (Farid and Ksaibati 2020). In the group analysis modelling some of these factors were undertaken to predict the likelihood of fatalities in motorcycle accidents. The factors considered in the original analysis included road features (intersection, roundabout, etc.,), crash nature (head on, sideswipe, etc.,) and rain. The modelling was done using logistic linear regression and was mainly limited to accidents occurring in 60mkh speed limit zones, where the majority of accidents occurred. This analysis left out a number of other factors.

One aspect of the results that stood out was that sections of road with the highest level of accidents appeared to have no special features or traffic controls. The following tables show the distribution of accidents for road features and traffic controls; in this analysis unlike the original various types of intersections are treated as one not multiple categories.

Roadway Feature Accidents
Intersection 11118
No_Roadway_Feature 11081
Bridge_Causeway 226
Median_Opening 158
Forestry_National_Park_Road 78
Merge_Lane 69
Railway_Crossing 20
Other 7
Traffic Control Accidents
No_traffic_control 16518
Give_way_sign 3091
Traffic_Lights 2308
Stop_sign 699
Pedestrian_Crossing 70
Human_Control 40
Railway_Crossing 19
Miscellaneous 12

The number of accidents occurring on sections of road with both no traffic control and no roadway feature is 10,931. In the early analysis it was suggested that this high level of accidents might be related to rider fatigue or rider error (aren’t all accidents at some level related to rider/driver error?). However, it is more likely that other factors that were not considered in the initial analysis may also be playing a role. One specific factor that was not examined was road topology; especially curves which present spefic challenges to motorcycle riders as will be discussed in the next section.

Others have found a hierarchical structure in traffic accidents (Lenguerrand, Martin, and Laumon 2006; Jones and Jørgensen 2003; Park et al. 2017; Usman, Fu, and Miranda-Moreno 2016), thus one obvious question is if logistic linear regression is the most appropriate modelling method to use in analysing motorcycle crash data and would a multilevel model might give better insight.

As will be discussed in the next section, conditions for a rider are very different when on a curved stretch of road compared to a straight section, effectively creating a grouping of accidents around curved or non-curved.. So in this analysis a multilevel model will be evaluated using a variable Curved, indicating curved or straight, as the random effect variable.

The aim of the analysis it to see if such a multilevel model as described does indeed result in a better model than found in the initial analysis.

Why is Road Topology important

Curved roads present more challenges to two-wheeled riders than to four-wheeled drivers. As well as the potential visibility issues on curves, motorcyclist riders also have to deal with changed bike dynamics while on the curve.

Steering on any two wheeled vehicle, travelling at 10kph or over, is initiated by counter steering (pushing the handlebars in the opposite direction the rider wishes to turn), this then throws the bike into a lean in the direction the rider wants to go (Wikipedia, n.d.a). The bike will then turn in the direction it is leaning, that is to say that motorbikes at speed are turned by leaning the bike int he direction you with to go. Whilst this is all very counter intuitive, riders actually eventually do it without even being consciously aware of what they are doing.

Higher lean angles carry increased risk of loss of control as the bikes centripetal force is being countered by the friction generated by the tyres. Since the bike is at an angle it is is the side of the bike tyre that is in contact with the ground, motorcycle tyres are rounded on the side in an attempt to increase contact areas when cornering. Any loss in traction can result in loss of control of the bike. Complicating things for the rider is that at higher lean angles a bike’s suspension becomes less effective. This in turn makes the bike more susceptible to loss of traction if unexpected hazards such as oil, sand, wet leaves or gravel are encountered.

A particularly nasty form of accident involves momentary loss of traction of the rear tyre during cornering. If the tyre then regains traction the rear suspension spring will become fully loaded before it then fully unloads launching the rider, sometimes meters, into the air (Wikipedia n.d.b).

Highside crash

Braking for motorcycles is also more complex than four-wheeled vehicles on corners. A bike’s breaks are most effective when the bike is vertical. Breaking during a corner while the bike is leaned over will alter a bike’s dynamics including the responsiveness of the suspension, which again increases the potential for loss of traction. The higher the lean angle the less the front front brake can be used (80% of a bike’s braking capacity is in the front brake). On a sports bike at maximum lean angle any application of brakes, or throttle, is likely to result in loss of traction and the bike will skid out to the edge of the corner on its side.

Lowside crash

This all adds up to a very different control situation for riders on corners than straight sections of road. Requiring a different set of responses and care on behalf of the rider while having less ability to use brakes and throttle.

Despite the risks, the data for days on which accidents occur seem to indicate there is a recreational aspect to riding curved roads, with the number of accidents on curved roads increasing on weekends for motorcyclists.

Accidents on curved roads by day
Day Accidents
Sunday 1719
Saturday 1314
Friday 952
Monday 737
Thursday 737
Wednesday 730
Tuesday 703

Additional Variables

Re-examining the available data with some textual analysis several new variables were engineered. Two related to road topology and one relating to rider behaviour:

Accident Severity When In/Out of Lane
Minor Hospitalisation Fatal Total
In Lane 7934 11103 677 19714
Out Of Lane 1468 1491 84 3043
Accident Severity on Straight vs Curved Roads
Minor Hospitalisation Fatal Total
Straight 7060 8412 393 15865
Curve 2342 4182 368 6892
Accident Severity on Level vs Crest/Dip
Minor Hospitalisation Fatal Total
Level 8620 11115 662 20397
Crest/Dip 782 1479 99 2360

Of the three new variables curved roads is the main one of interest. While there are more than twice as many total accidents on straight road sections compared to on curves, the number of fatal accidents occurring on both are quite close. Which would indicate that accidents on curved road sections are more dangerous for riders.

The topology variables may go to some way to explain why featureless roads may not be so featureless after all. The number of accidents on featureless roads which actually do have some topology feature is 4,952 out of a toal of 10,931.

Finally a response variable was created, Crash_Severity_Fatal, which will be TRUE if an accident resulted in a fatality, and FALSE otherwise. Again this is a broader socpe than the original analysis where the response variable had two classes Fatality and Hospitalisation.

Predictor Selection

Variables to be considered during the modelling process are:

Variable Description
Weekend Boolean: if TRUE if Saturday or Sunday, else FALSE
Crash_Roadway_Feature Factor: Intersection, roundabout etc.
Crash_Traffic_Control Factor: Traffic Lights, Stop Sign etc.
Crash_Speed_Limit Factor: indicates speed zone at location of accident
Curved Boolean: If TRUE road accident occurred on a Curve, else on a straight section of raod
CrestOrDip Boolean: If TRUE accident occurred on a crest or dip in the road, else on level road.
OutOfLane Boolean: If TRUE accident occurred with a vehcicle out of it’s lane (overtaking, u-turn), else vehcicle(s) were in their correct lane.
Daylight Boolean: If TRUE accident occurred during daylight, else dawn/dusk or night time.

We can broadly group the above as following;

Classification Variables
Road Topology Curved, CrestOrDip
Road Feature Crash_Roadway_Feature, Crash_Traffic_Control, Crash_Speed_Limit
Rider Behaviour OutOfLane, Weekend
Environmental Daylight

The inclusion of Weekend in Rider Behaviour is to emphasise the increased number of accidents on weekends which is assumed to be due to an increase in recreational riders on weekends.

Train, Test, Validation sets

For the modelling process the main data set was split into three smaller data sets train, test and validation sets, in the ratio of 60%, 20%, 20%. Due to the imbalance in the response variable it was decided that during partitioning it would be prudent to ensure that the number of TRUE cases were distributed in proportion across the three sets.

inds <- partition(crashDF$Crash_Severity_Fatal, p = c(train = 0.6, valid = 0.2, test = 0.2))

To address the imbalance in the response variable in the train dataset, as in the first analysis, the ROSE package (CRAN 2014) was used to synthetically create extra cases for the positive class of the response variable.

Dataset Total Rows Response == TRUE
Train 13654 6783
Test 4551 153
Validation 4552 152

Methodology

Modelling used a logistic multilevel model with one-level. The grouping variable will be Curved, this selection was made on the basis of the very different riding conditions and bike dynamics encountered on curves compared to straight road sections. The other variables were used as fixed effects.

Various combinations of the fixed effects variables were used. Suitability of inclusion of variables as fixed effect was based on statistical significance (Pr<|z|), as well as checks for multicollinearity using variance inflation factor (VIF) calculations.

Predictions were made for each model with the test data set. As the predictions are returned as probabilities of the response been TRUE, a threshold value was be required to convert the probabilities into actual responses. As in the first analysis the best threshold will be calculated via use of the colAUC() method in the caTools package (CRAN 2020a). Probabilities greater than or equal to the calculated threshold or cutoff value will be mapped to a TRUE response.

A confusion matrix is generated from the predicted responses vs the actual response. From the matrix several measures are calculated, Accuracy, F1 Score and Mathews Correlation Coefficient (MCC). While Accuracy was used in the original analysis its use can be considered flawed due to the imbalanced nature of the response variable. A model that would have simply predicted FALSE all the time would have achieved an Accuracy of around 96%. This is due to the Accuracy measures emphasis on the true negatives and true positives; in the test data the true negatives would swamp any change in number of the true positive responses

The F1 score gives a different view of the confusion matrix by emphasising the false negatives and false positives, but has similar issues. MCC by contrast takes into consideration all quadrants of the confusion matrix and gives a more truthful score when evaluating binary classifications than Accuracy or F1 score (Chicco and Jurman 2020).

In this analysis a false positive will be considered the more misleading result so BIC rather than AIC will be used as an indication of model fit (The Methodology Center n.d.).

Modelling will be performed using the glmer() method from the lme4 package (CRAN 2020b). The quadratic approximation (BOBYQA) optimizer will be used, with a set maximum of 200,000 iterations; this has been found to supply a good medium between speed and convergence (Miller 2018).

Model 1

outputStats <- function( model, pred=NULL, cutoff=NULL ) {
 print(summary(model), correlation=TRUE)

 if(is.null(pred)) {
   pred <- predict(model, newdata=testDF, type="response")
  }

 if(is.null(cutoff)){
   cutoff <-  colAUC(pred, testDF[["Crash_Severity_Fatal"]], plotROC = TRUE)[1,1]
 }
 
 print(paste("Cutoff=", cutoff))

 res <- factor( as.character(pred >= cutoff), ordered=c("FALSE", "TRUE"))
 orig <- factor(as.character(testDF$Crash_Severity_Fatal == 1), ordered=c("FALSE", "TRUE"))
 cm <- confusionMatrix(orig, res, positive="TRUE")
 print(check_collinearity(model))
 coef(model)
 
 print(cm)
 mAccuracy <- cm$overall["Accuracy"]
 mBIC <- BIC(model)
 mMCC <- mcc(TP=cm$table[4], FP=cm$table[3], TN=cm$table[1], FN=cm$table[2])
 print(paste("Accuracy=", mAccuracy))
 print(paste("F1=",cm$byClass["F1"]))
 print(paste("BIC =", mBIC))
 print(paste("Mathews correlation coefficent=", mMCC))
 
}

The are various approaches to developing a model such as starting with a few variables and adding or with many variables and removing, or even using automated techniques to find the “best” predictors. In this analysis the approach will be an all in model and then whittle down the variables based on calculated statistics.

model1 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control + Crash_Roadway_Feature +
                      Crash_Speed_Limit + CrestOrDip + Weekend + OutOfLane + Daylight +
                        (1|Curved), 
                      data=trainDF, 
                      family=binomial(link ="logit"),
                      control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))

Looking at the summary details for the model we see that the additional variance for the Curved random effect is disappointingly low, which could be an indication of issues to come. In the fixed effects we can see that very few of the Crash_Roadway_Feature levels are statistically significant, neither is the CrestOrDip, making both variables candidates for removal.

Examining the VIF calculations and the correlation matrix it can be seen that the colliniearty of predictors is in general low, with the exception of two of Crash_Roadway_Feature levels, No_Roadway_Feature and Intersection, which have a correlation of 0.95. In the remainder of this document the correlation matrix won’t be displayed unless the VIF scores indicate an issue.

print(summary(model1), correlation=TRUE)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Roadway_Feature +  
##     Crash_Speed_Limit + CrestOrDip + Weekend + OutOfLane + Daylight +  
##     (1 | Curved)
##    Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  16901.2  17081.7  -8426.6  16853.2    13630 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5088 -0.7833 -0.4633  0.8185  2.1584 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Curved (Intercept) 0.08266  0.2875  
## Number of obs: 13654, groups:  Curved, 2
## 
## Fixed effects:
##                                                   Estimate Std. Error z value
## (Intercept)                                       -0.03489    0.27554  -0.127
## Crash_Traffic_ControlHuman_Control               -13.21804  303.02948  -0.044
## Crash_Traffic_ControlMiscellaneous               -12.71834  508.74806  -0.025
## Crash_Traffic_ControlNo_traffic_control            0.22105    0.07315   3.022
## Crash_Traffic_ControlPedestrian_Crossing         -12.93334  197.25548  -0.066
## Crash_Traffic_ControlRailway_Crossing            -14.40521  364.10857  -0.040
## Crash_Traffic_ControlStop_sign                     0.27803    0.13452   2.067
## Crash_Traffic_ControlTraffic_Lights                0.44687    0.08899   5.022
## Crash_Roadway_FeatureForestry_National_Park_Road   0.22327    0.36685   0.609
## Crash_Roadway_FeatureIntersection                 -0.46783    0.16799  -2.785
## Crash_Roadway_FeatureMedian_Opening               -0.07935    0.32612  -0.243
## Crash_Roadway_FeatureMerge_Lane                    0.10585    0.37475   0.282
## Crash_Roadway_FeatureNo_Roadway_Feature           -0.10427    0.16409  -0.635
## Crash_Roadway_FeatureOther                       -12.90864  623.74985  -0.021
## Crash_Roadway_FeatureRailway_Crossing              1.33667    0.76748   1.742
## Crash_Speed_Limit60kmh                            -0.13009    0.05598  -2.324
## Crash_Speed_Limit70kmh                             0.60412    0.08578   7.043
## Crash_Speed_Limit80to90kmh                         1.09798    0.06951  15.795
## Crash_Speed_Limit100to110kmh                       1.15876    0.06394  18.124
## CrestOrDipTRUE                                    -0.01040    0.05909  -0.176
## WeekendTRUE                                        0.20851    0.03945   5.285
## OutOfLaneTRUE                                     -0.12669    0.05843  -2.168
## DaylightTRUE                                      -0.49294    0.04119 -11.966
##                                                  Pr(>|z|)    
## (Intercept)                                       0.89923    
## Crash_Traffic_ControlHuman_Control                0.96521    
## Crash_Traffic_ControlMiscellaneous                0.98006    
## Crash_Traffic_ControlNo_traffic_control           0.00251 ** 
## Crash_Traffic_ControlPedestrian_Crossing          0.94772    
## Crash_Traffic_ControlRailway_Crossing             0.96844    
## Crash_Traffic_ControlStop_sign                    0.03875 *  
## Crash_Traffic_ControlTraffic_Lights              5.12e-07 ***
## Crash_Roadway_FeatureForestry_National_Park_Road  0.54278    
## Crash_Roadway_FeatureIntersection                 0.00535 ** 
## Crash_Roadway_FeatureMedian_Opening               0.80776    
## Crash_Roadway_FeatureMerge_Lane                   0.77758    
## Crash_Roadway_FeatureNo_Roadway_Feature           0.52512    
## Crash_Roadway_FeatureOther                        0.98349    
## Crash_Roadway_FeatureRailway_Crossing             0.08157 .  
## Crash_Speed_Limit60kmh                            0.02013 *  
## Crash_Speed_Limit70kmh                           1.89e-12 ***
## Crash_Speed_Limit80to90kmh                        < 2e-16 ***
## Crash_Speed_Limit100to110kmh                      < 2e-16 ***
## CrestOrDipTRUE                                    0.86031    
## WeekendTRUE                                      1.26e-07 ***
## OutOfLaneTRUE                                     0.03014 *  
## DaylightTRUE                                      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of fixed effects could have been required in summary()
## 
## Correlation of Fixed Effects:
##             (Intr) C_T_CH C_T_CM C_T_CN C_T_CP C_T_CR C_T_CS C_T_CT C_R_FF
## Crsh_T_CH_C  0.000                                                        
## Crsh_Trf_CM  0.000  0.000                                                 
## Crsh_T_CN__ -0.256  0.000  0.000                                          
## Crsh_T_CP_C  0.000  0.000  0.000  0.000                                   
## Crsh_T_CR_C  0.000  0.000  0.000  0.000  0.000                            
## Crsh_Tr_CS_ -0.089  0.000  0.000  0.362  0.000  0.000                     
## Crsh_T_CT_L -0.138  0.000  0.000  0.553  0.000  0.000  0.300              
## C_R_FF_N_P_ -0.258  0.000  0.000 -0.007  0.000  0.000 -0.001 -0.011       
## Crsh_Rdw_FI -0.609  0.000  0.000  0.124  0.000  0.000 -0.005  0.002  0.424
## Crsh_R_FM_O -0.302  0.000  0.000  0.008  0.000  0.000  0.004  0.011  0.221
## Crsh_R_FM_L -0.248  0.000  0.000 -0.003  0.000  0.000 -0.002  0.007  0.187
## Cr_R_FN_R_F -0.581  0.000  0.000 -0.018  0.000  0.000 -0.007 -0.008  0.440
## Crsh_Rdw_FO  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## Crsh_R_FR_C -0.126  0.000  0.000 -0.002  0.000 -0.002 -0.001  0.000  0.094
## Crsh_Sp_L60 -0.147  0.000  0.000  0.017  0.000  0.000 -0.007 -0.069  0.001
## Crsh_Sp_L70 -0.092  0.000  0.000  0.008  0.000  0.000  0.001 -0.065  0.002
## Crs_S_L8090 -0.145  0.000  0.000  0.001  0.000  0.000  0.004 -0.018  0.036
## C_S_L100110 -0.142  0.000  0.000 -0.018  0.000  0.000 -0.004 -0.005 -0.028
## CrstOrDTRUE  0.005  0.000  0.000 -0.031  0.000  0.000 -0.052  0.001 -0.070
## WeekendTRUE -0.053  0.000  0.000 -0.012  0.000  0.000 -0.009 -0.007  0.042
## OutOfLnTRUE -0.001  0.000  0.000 -0.037  0.000  0.000  0.033 -0.017  0.008
## DaylghtTRUE -0.105  0.000  0.000  0.039  0.000  0.000 -0.012  0.036 -0.022
##             C_R_FI C_R_FM_O C_R_FM_L C_R_FN C_R_FO C_R_FR C_S_L6 C_S_L7 C_S_L8
## Crsh_T_CH_C                                                                   
## Crsh_Trf_CM                                                                   
## Crsh_T_CN__                                                                   
## Crsh_T_CP_C                                                                   
## Crsh_T_CR_C                                                                   
## Crsh_Tr_CS_                                                                   
## Crsh_T_CT_L                                                                   
## C_R_FF_N_P_                                                                   
## Crsh_Rdw_FI                                                                   
## Crsh_R_FM_O  0.490                                                            
## Crsh_R_FM_L  0.422  0.224                                                     
## Cr_R_FN_R_F  0.954  0.496    0.430                                            
## Crsh_Rdw_FO  0.000  0.000    0.000    0.000                                   
## Crsh_R_FR_C  0.203  0.104    0.089    0.207  0.000                            
## Crsh_Sp_L60 -0.007 -0.001   -0.030   -0.007  0.000 -0.003                     
## Crsh_Sp_L70  0.001  0.011   -0.021   -0.009  0.000 -0.017  0.485              
## Crs_S_L8090  0.069  0.054    0.017    0.045  0.000  0.006  0.588  0.391       
## C_S_L100110  0.065  0.055    0.027    0.028  0.000 -0.022  0.632  0.421  0.545
## CrstOrDTRUE -0.025 -0.046   -0.002   -0.028  0.000 -0.001 -0.022 -0.034 -0.042
## WeekendTRUE  0.020  0.015   -0.029    0.009  0.000  0.021  0.004  0.011 -0.017
## OutOfLnTRUE -0.001 -0.055   -0.079   -0.020  0.000  0.009 -0.026  0.015 -0.005
## DaylghtTRUE  0.004  0.001   -0.029    0.004  0.000  0.029  0.009 -0.033 -0.081
##             C_S_L1 CODTRU WkTRUE OOLTRU
## Crsh_T_CH_C                            
## Crsh_Trf_CM                            
## Crsh_T_CN__                            
## Crsh_T_CP_C                            
## Crsh_T_CR_C                            
## Crsh_Tr_CS_                            
## Crsh_T_CT_L                            
## C_R_FF_N_P_                            
## Crsh_Rdw_FI                            
## Crsh_R_FM_O                            
## Crsh_R_FM_L                            
## Cr_R_FN_R_F                            
## Crsh_Rdw_FO                            
## Crsh_R_FR_C                            
## Crsh_Sp_L60                            
## Crsh_Sp_L70                            
## Crs_S_L8090                            
## C_S_L100110                            
## CrstOrDTRUE -0.007                     
## WeekendTRUE -0.048 -0.026              
## OutOfLnTRUE -0.012  0.011  0.082       
## DaylghtTRUE -0.081 -0.030 -0.044 -0.082
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 1 negative eigenvalues
 print(check_collinearity(model1))
## # Check for Multicollinearity
## 
## Low Correlation
## 
##              Parameter  VIF Increased SE
##  Crash_Traffic_Control 1.61         1.27
##  Crash_Roadway_Feature 1.69         1.30
##      Crash_Speed_Limit 1.16         1.08
##             CrestOrDip 1.02         1.01
##                Weekend 1.02         1.01
##              OutOfLane 1.04         1.02
##               Daylight 1.04         1.02

Model1 is then used to make predictions on the data in the test data set. From it’s predicted probabilities a threshold or cut-off value is calculated that will maximise the area under the curve (AUC) a ROC curve is also displayed to give a visual representation of the process.

pred <- predict(model1, newdata=testDF, type="response")
cutoff <-  colAUC(pred, testDF[["Crash_Severity_Fatal"]], plotROC = TRUE)[1,1]

print(paste("Cutoff=", cutoff))
## [1] "Cutoff= 0.668310313362877"

Using the calculated cut-off value the predicted probabilities are then converted into actual predicted responses. The predicted responses and actual responses are used to generate a confusion matrix from which various measures are calculated.

# convert probability to response useing cutoff value
 res <- factor( as.character(pred >= cutoff), levels=c("FALSE", "TRUE"))
 orig <- factor(as.character(testDF$Crash_Severity_Fatal == 1), levels=c("FALSE", "TRUE"))
 
 cm <- confusionMatrix(orig, res, positive="TRUE")
 
 print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3845  553
##      TRUE    115   38
##                                           
##                Accuracy : 0.8532          
##                  95% CI : (0.8426, 0.8634)
##     No Information Rate : 0.8701          
##     P-Value [Acc > NIR] : 0.9996          
##                                           
##                   Kappa : 0.0515          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06430         
##             Specificity : 0.97096         
##          Pos Pred Value : 0.24837         
##          Neg Pred Value : 0.87426         
##              Prevalence : 0.12986         
##          Detection Rate : 0.00835         
##    Detection Prevalence : 0.03362         
##       Balanced Accuracy : 0.51763         
##                                           
##        'Positive' Class : TRUE            
## 
 m1Accuracy <- cm$overall["Accuracy"]
 m1BIC <- BIC(model1)
 m1MCC <- mcc(TP=cm$table[4], FP=cm$table[3], TN=cm$table[1], FN=cm$table[2])
 print(paste("Accuracy =", m1Accuracy))
## [1] "Accuracy = 0.853219072731268"
 print(paste("F1 =",cm$byClass["F1"]))
## [1] "F1 = 0.102150537634409"
 print(paste("Mathews correlation coefficent =", m1MCC))
## [1] "Mathews correlation coefficent = 0.0657534042826112"

While the Accuracy is reasonable, the F1 and MCC scores are rather low, reflecting the low TP vs FP results. Again stressing the inadequacy of relying just on the Accuracy score.

Due to the original analysis model been trained on different data (only data from 60kmh speed limit zones) it is not totally correct to compare that model to models in this analysis, however in the interests of curiosity we will anyway. In the original analysis the Accuracy was 81% which at flush blush would indicated that this current model is better with an Accuracy of 85%. But if we calculate the MCC score for the original result we can see that it’s MCC value is greater and therefore better, than. Indicating that the original analysis model over it’s limited data selection is more accurate than model1.

origMCC <- mcc(TP=26, FP=206, FN=23, TN=1005)
 print(paste("Mathews correlation coefficent =", origMCC))
## [1] "Mathews correlation coefficent = 0.179821132526949"

Model 2

As identified above in analysis of the summary variable for model 1 variables Crash_Roadway_Feature and CrestOrCurve will be removed from the model, to give us model 2. Again predictions and calculations are made.

The BIC score has decreased slightly which indicates that model2 is slightly better fit to the data than model1.

The variance explained by the random effect Curved is slightly increased. All the fixed predictors with the exception of OutOfLane are statistically significant. As such OutOfLane is a candidate for removal in the next model.

Collinearity for model2 still remains slow.

It can be seen that model2 has a slight increase in MCC over model1 which implies that model2 is more accurate.

model2 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control + Crash_Speed_Limit +
                  Weekend + OutOfLane + Daylight + (1|Curved), 
                  data=trainDF, 
                  family=binomial(link ="logit"),
                  control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Speed_Limit +  
##     Weekend + OutOfLane + Daylight + (1 | Curved)
##    Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  16948.5  17068.8  -8458.2  16916.5    13638 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3845 -0.7885 -0.4624  0.8298  2.1628 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Curved (Intercept) 0.1003   0.3167  
## Number of obs: 13654, groups:  Curved, 2
## 
## Fixed effects:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               -0.50116    0.23786  -2.107   0.0351
## Crash_Traffic_ControlHuman_Control       -12.88831    6.33928  -2.033   0.0420
## Crash_Traffic_ControlMiscellaneous       -12.44948    9.92579  -1.254   0.2097
## Crash_Traffic_ControlNo_traffic_control    0.47146    0.06457   7.302 2.84e-13
## Crash_Traffic_ControlPedestrian_Crossing -12.59571    2.75358  -4.574 4.78e-06
## Crash_Traffic_ControlRailway_Crossing    -12.78835   12.00149  -1.066   0.2866
## Crash_Traffic_ControlStop_sign             0.28061    0.13487   2.081   0.0375
## Crash_Traffic_ControlTraffic_Lights        0.46680    0.08918   5.234 1.66e-07
## Crash_Speed_Limit60kmh                    -0.12997    0.05581  -2.329   0.0199
## Crash_Speed_Limit70kmh                     0.63042    0.08545   7.378 1.61e-13
## Crash_Speed_Limit80to90kmh                 1.14766    0.06893  16.651  < 2e-16
## Crash_Speed_Limit100to110kmh               1.23501    0.06305  19.589  < 2e-16
## WeekendTRUE                                0.21721    0.03927   5.531 3.18e-08
## OutOfLaneTRUE                             -0.09865    0.05800  -1.701   0.0890
## DaylightTRUE                              -0.49742    0.04099 -12.134  < 2e-16
##                                             
## (Intercept)                              *  
## Crash_Traffic_ControlHuman_Control       *  
## Crash_Traffic_ControlMiscellaneous          
## Crash_Traffic_ControlNo_traffic_control  ***
## Crash_Traffic_ControlPedestrian_Crossing ***
## Crash_Traffic_ControlRailway_Crossing       
## Crash_Traffic_ControlStop_sign           *  
## Crash_Traffic_ControlTraffic_Lights      ***
## Crash_Speed_Limit60kmh                   *  
## Crash_Speed_Limit70kmh                   ***
## Crash_Speed_Limit80to90kmh               ***
## Crash_Speed_Limit100to110kmh             ***
## WeekendTRUE                              ***
## OutOfLaneTRUE                            .  
## DaylightTRUE                             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 15 > 12.
## Use print(summary(model2), correlation=TRUE)  or
##     vcov(summary(model2))        if you need it
## # Check for Multicollinearity
## 
## Low Correlation
## 
##              Parameter  VIF Increased SE
##  Crash_Traffic_Control 1.09         1.04
##      Crash_Speed_Limit 1.10         1.05
##                Weekend 1.02         1.01
##              OutOfLane 1.03         1.01
##               Daylight 1.03         1.02

## [1] "Cutoff= 0.681262576275015"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3808  590
##      TRUE    111   42
##                                           
##                Accuracy : 0.846           
##                  95% CI : (0.8351, 0.8563)
##     No Information Rate : 0.8611          
##     P-Value [Acc > NIR] : 0.9984          
##                                           
##                   Kappa : 0.0559          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.066456        
##             Specificity : 0.971676        
##          Pos Pred Value : 0.274510        
##          Neg Pred Value : 0.865848        
##              Prevalence : 0.138871        
##          Detection Rate : 0.009229        
##    Detection Prevalence : 0.033619        
##       Balanced Accuracy : 0.519066        
##                                           
##        'Positive' Class : TRUE            
## 
## [1] "Accuracy = 0.845967919138651"
## [1] "F1 = 0.107006369426752"
## [1] "Mathews correlation coefficent = 0.0731583783984442"

Model 3

For model 3 the OutOfLane variable was removed from the fixed effects in the formula for model 2.

model3 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control+ Crash_Speed_Limit +
                  Weekend + Daylight + (1|Curved), 
                  data=trainDF, 
                  family=binomial(link ="logit"),
                  control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))

The changes in BIC, Accuracy and MCC are all very slight, BIC indicating a slight improvement, while Accuracy and MCC showing a slight decline in accuracy. At this point all predictors are statistically significant. With the introduction of dummy variables it would be possible to remove the least significant levels in the factors, but since dummy variables weren’t used at the start of modelling this will be left for another analysis.

The MCC scores aren’t redically different between model 2 and model 3. Since model3 is the less complex it is the one that will be chosen as the winner.

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Speed_Limit +  
##     Weekend + Daylight + (1 | Curved)
##    Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  16949.4  17062.2  -8459.7  16919.4    13639 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3850 -0.7824 -0.4822  0.8306  2.0736 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Curved (Intercept) 0.1036   0.3219  
## Number of obs: 13654, groups:  Curved, 2
## 
## Fixed effects:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               -0.50164    0.24052  -2.086 0.037010
## Crash_Traffic_ControlHuman_Control       -12.88060    4.35976  -2.954 0.003133
## Crash_Traffic_ControlMiscellaneous       -12.44073    4.17885  -2.977 0.002910
## Crash_Traffic_ControlNo_traffic_control    0.46331    0.06441   7.193 6.32e-13
## Crash_Traffic_ControlPedestrian_Crossing -12.59823    3.70946  -3.396 0.000683
## Crash_Traffic_ControlRailway_Crossing    -12.77910    3.04198  -4.201 2.66e-05
## Crash_Traffic_ControlStop_sign             0.28868    0.13501   2.138 0.032492
## Crash_Traffic_ControlTraffic_Lights        0.46429    0.08921   5.205 1.94e-07
## Crash_Speed_Limit60kmh                    -0.13272    0.05576  -2.380 0.017308
## Crash_Speed_Limit70kmh                     0.63233    0.08542   7.402 1.34e-13
## Crash_Speed_Limit80to90kmh                 1.14684    0.06890  16.646  < 2e-16
## Crash_Speed_Limit100to110kmh               1.23367    0.06302  19.577  < 2e-16
## WeekendTRUE                                0.22235    0.03914   5.680 1.34e-08
## DaylightTRUE                              -0.50347    0.04084 -12.328  < 2e-16
##                                             
## (Intercept)                              *  
## Crash_Traffic_ControlHuman_Control       ** 
## Crash_Traffic_ControlMiscellaneous       ** 
## Crash_Traffic_ControlNo_traffic_control  ***
## Crash_Traffic_ControlPedestrian_Crossing ***
## Crash_Traffic_ControlRailway_Crossing    ***
## Crash_Traffic_ControlStop_sign           *  
## Crash_Traffic_ControlTraffic_Lights      ***
## Crash_Speed_Limit60kmh                   *  
## Crash_Speed_Limit70kmh                   ***
## Crash_Speed_Limit80to90kmh               ***
## Crash_Speed_Limit100to110kmh             ***
## WeekendTRUE                              ***
## DaylightTRUE                             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 14 > 12.
## Use print(summary(model3), correlation=TRUE)  or
##     vcov(summary(model3))        if you need it

## [1] "Cutoff= 0.680202230960597"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3800  598
##      TRUE    111   42
##                                           
##                Accuracy : 0.8442          
##                  95% CI : (0.8333, 0.8546)
##     No Information Rate : 0.8594          
##     P-Value [Acc > NIR] : 0.9983          
##                                           
##                   Kappa : 0.0546          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.065625        
##             Specificity : 0.971619        
##          Pos Pred Value : 0.274510        
##          Neg Pred Value : 0.864029        
##              Prevalence : 0.140628        
##          Detection Rate : 0.009229        
##    Detection Prevalence : 0.033619        
##       Balanced Accuracy : 0.518622        
##                                           
##        'Positive' Class : TRUE            
## 
## [1] "Accuracy = 0.844210063722259"
## [1] "F1 = 0.105926860025221"
## [1] "Mathews correlation coefficent = 0.0718308810296092"

Validation

As a validation of model 3 it was run against the validation set. This resulted in he MCC improving. So it would seem that the model is at least not over fitted to it’s training or test data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3797  603
##      TRUE    100   52
##                                           
##                Accuracy : 0.8456          
##                  95% CI : (0.8347, 0.8559)
##     No Information Rate : 0.8561          
##     P-Value [Acc > NIR] : 0.979           
##                                           
##                   Kappa : 0.0789          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.07939         
##             Specificity : 0.97434         
##          Pos Pred Value : 0.34211         
##          Neg Pred Value : 0.86295         
##              Prevalence : 0.14389         
##          Detection Rate : 0.01142         
##    Detection Prevalence : 0.03339         
##       Balanced Accuracy : 0.52686         
##                                           
##        'Positive' Class : TRUE            
## 
## [1] "Accuracy = 0.845562390158172"
## [1] "F1 = 0.128872366790582"
## [1] "Mathews correlation coefficent = 0.104964593214166"

Summary

Model BIC Accuracy MCC
model1 17082 0.85 0.0657534
model2 17069 0.85 0.0731584
model3 17062 0.84 0.0718309
model3 - validation 0.85 0.1049646
original analysis 0.81 0.1798211

The model results aren’t great especially when comparing the MCC value to that obtained in the original analysis. Although it should be pointed out that the original analysis was only for one speed zone where as this analysis covered all zones.

The question then becomes was the assumption valid that here would be some sort of clustering or hierarchy on curves vs straight roads valid. As a check a regular linear regression model was run against the test data set using a similar formula to model 3, but the random effects are now fixed.

lmModel <- glm(Crash_Severity_Fatal ~ Crash_Traffic_Control+ Crash_Speed_Limit +
                  Weekend + Daylight + Curved,
                  data=trainDF,family = "binomial")
## [1] "Accuracy = 0.844210063722259"
## [1] "Mathews correlation coefficent = 0.0718308810296092"

As can be seen the Accuracy and MCC score are the same for both the simple linear regression model and the multilevel model. So the use of the random effect Curved doesn’t seem to be useful. This would seem to indicate that the proposal that the crash data will cluster along curve vs straight roads is in fact incorrect. While others have found hierarchically structure in accident data it would appear that grouping by curved roads vs straight roads is not one of them.

References

Chicco, Davide, and Giuseppe Jurman. 2020. “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.” BMC Genomics 21 (1): 6. https://doi.org/10.1186/s12864-019-6413-7.

———. 2020a. “Package caTools.” https://cran.r-project.org/web/packages/caTools/index.html.

———. 2020b. “Package lme4.” https://cran.r-project.org/web/packages/lme4/index.html.

Farid, Ahmed, and Khaled Ksaibati. 2020. “Modeling severities of motorcycle crashes using random parameters.” Journal of Traffic and Transportation Engineering (English Edition), June. https://doi.org/10.1016/j.jtte.2020.01.001.

Imprialou, Maria Ioanna M., Mohammed Quddus, and David E. Pitfield. 2015. “Multilevel logistic regression modeling for crash mapping in metropolitan areas.” National Research Council. https://doi.org/10.3141/2514-05.

Jones, Andrew P., and Stig H. Jørgensen. 2003. “The use of multilevel models for the prediction of road accident outcomes.” Accident Analysis and Prevention 35 (1): 59–69. https://doi.org/10.1016/S0001-4575(01)00086-0.

Kumar, Sachin, and Durga Toshniwal. 2015. “A data mining framework to analyze road accident data.” Journal of Big Data 2 (1): 26. https://doi.org/10.1186/s40537-015-0035-y.

Lenguerrand, E, J L Martin, and B Laumon. 2006. “Modelling the hierarchical structure of road crash data-Application to severity analysis.” Accident Analysis and Prevention 38: 43–53. https://doi.org/10.1016/j.aap.2005.06.021.

Miller, Steven V. 2018. “Mixed Effects Modeling Tips: Use a Fast Optimizer, but Perform Optimizer Checks.” http://svmiller.com/blog/2018/06/mixed-effects-models-optimizer-checks/.

Park, Ho Chul, Dong Kyu Kim, Seung Young Kho, and Peter Y. Park. 2017. “Cross-classified multilevel models for severity of commercial motor vehicle crashes considering heterogeneity among companies and regions.” Accident Analysis and Prevention 106 (September): 305–14. https://doi.org/10.1016/j.aap.2017.06.009.

The Methodology Center. n.d. “AIC vs. BIC.” Accessed November 7, 2020. https://www.methodology.psu.edu/resources/AIC-vs-BIC/.

Usman, Taimur, Liping Fu, and Luis F. Miranda-Moreno. 2016. “Injury severity analysis: Comparison of multilevel logistic regression models and effects of collision data aggregation.” Journal of Modern Transportation 24 (1): 73–87. https://doi.org/10.1007/s40534-016-0096-4.

———. n.d.b. “Highsider.” Accessed November 7, 2020. https://en.wikipedia.org/wiki/Highsider.

Zhang, Kairan, and Mohamed Hassan. 2019. “Crash severity analysis of nighttime and daytime highway work zone crashes.” PLoS ONE 14 (8). https://doi.org/10.1371/journal.pone.0221128.