They are many factors that contribute to motorcycle, accidents. These include “crash characteristics, traffic volume, travel speed, road’s horizontal profile, road’s vertical profile, weather conditions, helmet use, safety features installed in the motorcycle such as the crash cage, RUI, motorcyclist’s socio-demographic characteristics, time of day and motorcyclist’s visibility to other vehicle operators among other factors” (Farid and Ksaibati 2020). In the group analysis modelling some of these factors were undertaken to predict the likelihood of fatalities in motorcycle accidents. The factors considered in the original analysis included road features (intersection, roundabout, etc.,), crash nature (head on, sideswipe, etc.,) and rain. The modelling was done using logistic linear regression and was mainly limited to accidents occurring in 60mkh speed limit zones, where the majority of accidents occurred. This analysis left out a number of other factors.
One aspect of the results that stood out was that sections of road with the highest level of accidents appeared to have no special features or traffic controls. The following tables show the distribution of accidents for road features and traffic controls; in this analysis unlike the original various types of intersections are treated as one not multiple categories.
| Roadway Feature | Accidents |
|---|---|
| Intersection | 11118 |
| No_Roadway_Feature | 11081 |
| Bridge_Causeway | 226 |
| Median_Opening | 158 |
| Forestry_National_Park_Road | 78 |
| Merge_Lane | 69 |
| Railway_Crossing | 20 |
| Other | 7 |
| Traffic Control | Accidents |
|---|---|
| No_traffic_control | 16518 |
| Give_way_sign | 3091 |
| Traffic_Lights | 2308 |
| Stop_sign | 699 |
| Pedestrian_Crossing | 70 |
| Human_Control | 40 |
| Railway_Crossing | 19 |
| Miscellaneous | 12 |
The number of accidents occurring on sections of road with both no traffic control and no roadway feature is 10,931. In the early analysis it was suggested that this high level of accidents might be related to rider fatigue or rider error (aren’t all accidents at some level related to rider/driver error?). However, it is more likely that other factors that were not considered in the initial analysis may also be playing a role. One specific factor that was not examined was road topology; especially curves which present spefic challenges to motorcycle riders as will be discussed in the next section.
Others have found a hierarchical structure in traffic accidents (Lenguerrand, Martin, and Laumon 2006; Jones and Jørgensen 2003; Park et al. 2017; Usman, Fu, and Miranda-Moreno 2016), thus one obvious question is if logistic linear regression is the most appropriate modelling method to use in analysing motorcycle crash data and would a multilevel model might give better insight.
As will be discussed in the next section, conditions for a rider are very different when on a curved stretch of road compared to a straight section, effectively creating a grouping of accidents around curved or non-curved.. So in this analysis a multilevel model will be evaluated using a variable Curved, indicating curved or straight, as the random effect variable.
The aim of the analysis it to see if such a multilevel model as described does indeed result in a better model than found in the initial analysis.
Curved roads present more challenges to two-wheeled riders than to four-wheeled drivers. As well as the potential visibility issues on curves, motorcyclist riders also have to deal with changed bike dynamics while on the curve.
Steering on any two wheeled vehicle, travelling at 10kph or over, is initiated by counter steering (pushing the handlebars in the opposite direction the rider wishes to turn), this then throws the bike into a lean in the direction the rider wants to go (Wikipedia, n.d.a). The bike will then turn in the direction it is leaning, that is to say that motorbikes at speed are turned by leaning the bike int he direction you with to go. Whilst this is all very counter intuitive, riders actually eventually do it without even being consciously aware of what they are doing.
Higher lean angles carry increased risk of loss of control as the bikes centripetal force is being countered by the friction generated by the tyres. Since the bike is at an angle it is is the side of the bike tyre that is in contact with the ground, motorcycle tyres are rounded on the side in an attempt to increase contact areas when cornering. Any loss in traction can result in loss of control of the bike. Complicating things for the rider is that at higher lean angles a bike’s suspension becomes less effective. This in turn makes the bike more susceptible to loss of traction if unexpected hazards such as oil, sand, wet leaves or gravel are encountered.
A particularly nasty form of accident involves momentary loss of traction of the rear tyre during cornering. If the tyre then regains traction the rear suspension spring will become fully loaded before it then fully unloads launching the rider, sometimes meters, into the air (Wikipedia n.d.b).
Highside crash
Braking for motorcycles is also more complex than four-wheeled vehicles on corners. A bike’s breaks are most effective when the bike is vertical. Breaking during a corner while the bike is leaned over will alter a bike’s dynamics including the responsiveness of the suspension, which again increases the potential for loss of traction. The higher the lean angle the less the front front brake can be used (80% of a bike’s braking capacity is in the front brake). On a sports bike at maximum lean angle any application of brakes, or throttle, is likely to result in loss of traction and the bike will skid out to the edge of the corner on its side.
Lowside crash
This all adds up to a very different control situation for riders on corners than straight sections of road. Requiring a different set of responses and care on behalf of the rider while having less ability to use brakes and throttle.
Despite the risks, the data for days on which accidents occur seem to indicate there is a recreational aspect to riding curved roads, with the number of accidents on curved roads increasing on weekends for motorcyclists.
| Day | Accidents |
|---|---|
| Sunday | 1719 |
| Saturday | 1314 |
| Friday | 952 |
| Monday | 737 |
| Thursday | 737 |
| Wednesday | 730 |
| Tuesday | 703 |
Re-examining the available data with some textual analysis several new variables were engineered. Two related to road topology and one relating to rider behaviour:
| Minor | Hospitalisation | Fatal | Total | |
|---|---|---|---|---|
| In Lane | 7934 | 11103 | 677 | 19714 |
| Out Of Lane | 1468 | 1491 | 84 | 3043 |
| Minor | Hospitalisation | Fatal | Total | |
|---|---|---|---|---|
| Straight | 7060 | 8412 | 393 | 15865 |
| Curve | 2342 | 4182 | 368 | 6892 |
| Minor | Hospitalisation | Fatal | Total | |
|---|---|---|---|---|
| Level | 8620 | 11115 | 662 | 20397 |
| Crest/Dip | 782 | 1479 | 99 | 2360 |
Of the three new variables curved roads is the main one of interest. While there are more than twice as many total accidents on straight road sections compared to on curves, the number of fatal accidents occurring on both are quite close. Which would indicate that accidents on curved road sections are more dangerous for riders.
The topology variables may go to some way to explain why featureless roads may not be so featureless after all. The number of accidents on featureless roads which actually do have some topology feature is 4,952 out of a toal of 10,931.
Finally a response variable was created, Crash_Severity_Fatal, which will be TRUE if an accident resulted in a fatality, and FALSE otherwise. Again this is a broader socpe than the original analysis where the response variable had two classes Fatality and Hospitalisation.
Variables to be considered during the modelling process are:
| Variable | Description |
|---|---|
| Weekend | Boolean: if TRUE if Saturday or Sunday, else FALSE |
| Crash_Roadway_Feature | Factor: Intersection, roundabout etc. |
| Crash_Traffic_Control | Factor: Traffic Lights, Stop Sign etc. |
| Crash_Speed_Limit | Factor: indicates speed zone at location of accident |
| Curved | Boolean: If TRUE road accident occurred on a Curve, else on a straight section of raod |
| CrestOrDip | Boolean: If TRUE accident occurred on a crest or dip in the road, else on level road. |
| OutOfLane | Boolean: If TRUE accident occurred with a vehcicle out of it’s lane (overtaking, u-turn), else vehcicle(s) were in their correct lane. |
| Daylight | Boolean: If TRUE accident occurred during daylight, else dawn/dusk or night time. |
We can broadly group the above as following;
| Classification | Variables |
|---|---|
| Road Topology | Curved, CrestOrDip |
| Road Feature | Crash_Roadway_Feature, Crash_Traffic_Control, Crash_Speed_Limit |
| Rider Behaviour | OutOfLane, Weekend |
| Environmental | Daylight |
The inclusion of Weekend in Rider Behaviour is to emphasise the increased number of accidents on weekends which is assumed to be due to an increase in recreational riders on weekends.
For the modelling process the main data set was split into three smaller data sets train, test and validation sets, in the ratio of 60%, 20%, 20%. Due to the imbalance in the response variable it was decided that during partitioning it would be prudent to ensure that the number of TRUE cases were distributed in proportion across the three sets.
inds <- partition(crashDF$Crash_Severity_Fatal, p = c(train = 0.6, valid = 0.2, test = 0.2))
To address the imbalance in the response variable in the train dataset, as in the first analysis, the ROSE package (CRAN 2014) was used to synthetically create extra cases for the positive class of the response variable.
| Dataset | Total Rows | Response == TRUE |
|---|---|---|
| Train | 13654 | 6783 |
| Test | 4551 | 153 |
| Validation | 4552 | 152 |
Modelling used a logistic multilevel model with one-level. The grouping variable will be Curved, this selection was made on the basis of the very different riding conditions and bike dynamics encountered on curves compared to straight road sections. The other variables were used as fixed effects.
Various combinations of the fixed effects variables were used. Suitability of inclusion of variables as fixed effect was based on statistical significance (Pr<|z|), as well as checks for multicollinearity using variance inflation factor (VIF) calculations.
Predictions were made for each model with the test data set. As the predictions are returned as probabilities of the response been TRUE, a threshold value was be required to convert the probabilities into actual responses. As in the first analysis the best threshold will be calculated via use of the colAUC() method in the caTools package (CRAN 2020a). Probabilities greater than or equal to the calculated threshold or cutoff value will be mapped to a TRUE response.
A confusion matrix is generated from the predicted responses vs the actual response. From the matrix several measures are calculated, Accuracy, F1 Score and Mathews Correlation Coefficient (MCC). While Accuracy was used in the original analysis its use can be considered flawed due to the imbalanced nature of the response variable. A model that would have simply predicted FALSE all the time would have achieved an Accuracy of around 96%. This is due to the Accuracy measures emphasis on the true negatives and true positives; in the test data the true negatives would swamp any change in number of the true positive responses
The F1 score gives a different view of the confusion matrix by emphasising the false negatives and false positives, but has similar issues. MCC by contrast takes into consideration all quadrants of the confusion matrix and gives a more truthful score when evaluating binary classifications than Accuracy or F1 score (Chicco and Jurman 2020).
In this analysis a false positive will be considered the more misleading result so BIC rather than AIC will be used as an indication of model fit (The Methodology Center n.d.).
Modelling will be performed using the glmer() method from the lme4 package (CRAN 2020b). The quadratic approximation (BOBYQA) optimizer will be used, with a set maximum of 200,000 iterations; this has been found to supply a good medium between speed and convergence (Miller 2018).
outputStats <- function( model, pred=NULL, cutoff=NULL ) {
print(summary(model), correlation=TRUE)
if(is.null(pred)) {
pred <- predict(model, newdata=testDF, type="response")
}
if(is.null(cutoff)){
cutoff <- colAUC(pred, testDF[["Crash_Severity_Fatal"]], plotROC = TRUE)[1,1]
}
print(paste("Cutoff=", cutoff))
res <- factor( as.character(pred >= cutoff), ordered=c("FALSE", "TRUE"))
orig <- factor(as.character(testDF$Crash_Severity_Fatal == 1), ordered=c("FALSE", "TRUE"))
cm <- confusionMatrix(orig, res, positive="TRUE")
print(check_collinearity(model))
coef(model)
print(cm)
mAccuracy <- cm$overall["Accuracy"]
mBIC <- BIC(model)
mMCC <- mcc(TP=cm$table[4], FP=cm$table[3], TN=cm$table[1], FN=cm$table[2])
print(paste("Accuracy=", mAccuracy))
print(paste("F1=",cm$byClass["F1"]))
print(paste("BIC =", mBIC))
print(paste("Mathews correlation coefficent=", mMCC))
}
The are various approaches to developing a model such as starting with a few variables and adding or with many variables and removing, or even using automated techniques to find the “best” predictors. In this analysis the approach will be an all in model and then whittle down the variables based on calculated statistics.
model1 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control + Crash_Roadway_Feature +
Crash_Speed_Limit + CrestOrDip + Weekend + OutOfLane + Daylight +
(1|Curved),
data=trainDF,
family=binomial(link ="logit"),
control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
Looking at the summary details for the model we see that the additional variance for the Curved random effect is disappointingly low, which could be an indication of issues to come. In the fixed effects we can see that very few of the Crash_Roadway_Feature levels are statistically significant, neither is the CrestOrDip, making both variables candidates for removal.
Examining the VIF calculations and the correlation matrix it can be seen that the colliniearty of predictors is in general low, with the exception of two of Crash_Roadway_Feature levels, No_Roadway_Feature and Intersection, which have a correlation of 0.95. In the remainder of this document the correlation matrix won’t be displayed unless the VIF scores indicate an issue.
print(summary(model1), correlation=TRUE)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Roadway_Feature +
## Crash_Speed_Limit + CrestOrDip + Weekend + OutOfLane + Daylight +
## (1 | Curved)
## Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## 16901.2 17081.7 -8426.6 16853.2 13630
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5088 -0.7833 -0.4633 0.8185 2.1584
##
## Random effects:
## Groups Name Variance Std.Dev.
## Curved (Intercept) 0.08266 0.2875
## Number of obs: 13654, groups: Curved, 2
##
## Fixed effects:
## Estimate Std. Error z value
## (Intercept) -0.03489 0.27554 -0.127
## Crash_Traffic_ControlHuman_Control -13.21804 303.02948 -0.044
## Crash_Traffic_ControlMiscellaneous -12.71834 508.74806 -0.025
## Crash_Traffic_ControlNo_traffic_control 0.22105 0.07315 3.022
## Crash_Traffic_ControlPedestrian_Crossing -12.93334 197.25548 -0.066
## Crash_Traffic_ControlRailway_Crossing -14.40521 364.10857 -0.040
## Crash_Traffic_ControlStop_sign 0.27803 0.13452 2.067
## Crash_Traffic_ControlTraffic_Lights 0.44687 0.08899 5.022
## Crash_Roadway_FeatureForestry_National_Park_Road 0.22327 0.36685 0.609
## Crash_Roadway_FeatureIntersection -0.46783 0.16799 -2.785
## Crash_Roadway_FeatureMedian_Opening -0.07935 0.32612 -0.243
## Crash_Roadway_FeatureMerge_Lane 0.10585 0.37475 0.282
## Crash_Roadway_FeatureNo_Roadway_Feature -0.10427 0.16409 -0.635
## Crash_Roadway_FeatureOther -12.90864 623.74985 -0.021
## Crash_Roadway_FeatureRailway_Crossing 1.33667 0.76748 1.742
## Crash_Speed_Limit60kmh -0.13009 0.05598 -2.324
## Crash_Speed_Limit70kmh 0.60412 0.08578 7.043
## Crash_Speed_Limit80to90kmh 1.09798 0.06951 15.795
## Crash_Speed_Limit100to110kmh 1.15876 0.06394 18.124
## CrestOrDipTRUE -0.01040 0.05909 -0.176
## WeekendTRUE 0.20851 0.03945 5.285
## OutOfLaneTRUE -0.12669 0.05843 -2.168
## DaylightTRUE -0.49294 0.04119 -11.966
## Pr(>|z|)
## (Intercept) 0.89923
## Crash_Traffic_ControlHuman_Control 0.96521
## Crash_Traffic_ControlMiscellaneous 0.98006
## Crash_Traffic_ControlNo_traffic_control 0.00251 **
## Crash_Traffic_ControlPedestrian_Crossing 0.94772
## Crash_Traffic_ControlRailway_Crossing 0.96844
## Crash_Traffic_ControlStop_sign 0.03875 *
## Crash_Traffic_ControlTraffic_Lights 5.12e-07 ***
## Crash_Roadway_FeatureForestry_National_Park_Road 0.54278
## Crash_Roadway_FeatureIntersection 0.00535 **
## Crash_Roadway_FeatureMedian_Opening 0.80776
## Crash_Roadway_FeatureMerge_Lane 0.77758
## Crash_Roadway_FeatureNo_Roadway_Feature 0.52512
## Crash_Roadway_FeatureOther 0.98349
## Crash_Roadway_FeatureRailway_Crossing 0.08157 .
## Crash_Speed_Limit60kmh 0.02013 *
## Crash_Speed_Limit70kmh 1.89e-12 ***
## Crash_Speed_Limit80to90kmh < 2e-16 ***
## Crash_Speed_Limit100to110kmh < 2e-16 ***
## CrestOrDipTRUE 0.86031
## WeekendTRUE 1.26e-07 ***
## OutOfLaneTRUE 0.03014 *
## DaylightTRUE < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of fixed effects could have been required in summary()
##
## Correlation of Fixed Effects:
## (Intr) C_T_CH C_T_CM C_T_CN C_T_CP C_T_CR C_T_CS C_T_CT C_R_FF
## Crsh_T_CH_C 0.000
## Crsh_Trf_CM 0.000 0.000
## Crsh_T_CN__ -0.256 0.000 0.000
## Crsh_T_CP_C 0.000 0.000 0.000 0.000
## Crsh_T_CR_C 0.000 0.000 0.000 0.000 0.000
## Crsh_Tr_CS_ -0.089 0.000 0.000 0.362 0.000 0.000
## Crsh_T_CT_L -0.138 0.000 0.000 0.553 0.000 0.000 0.300
## C_R_FF_N_P_ -0.258 0.000 0.000 -0.007 0.000 0.000 -0.001 -0.011
## Crsh_Rdw_FI -0.609 0.000 0.000 0.124 0.000 0.000 -0.005 0.002 0.424
## Crsh_R_FM_O -0.302 0.000 0.000 0.008 0.000 0.000 0.004 0.011 0.221
## Crsh_R_FM_L -0.248 0.000 0.000 -0.003 0.000 0.000 -0.002 0.007 0.187
## Cr_R_FN_R_F -0.581 0.000 0.000 -0.018 0.000 0.000 -0.007 -0.008 0.440
## Crsh_Rdw_FO 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Crsh_R_FR_C -0.126 0.000 0.000 -0.002 0.000 -0.002 -0.001 0.000 0.094
## Crsh_Sp_L60 -0.147 0.000 0.000 0.017 0.000 0.000 -0.007 -0.069 0.001
## Crsh_Sp_L70 -0.092 0.000 0.000 0.008 0.000 0.000 0.001 -0.065 0.002
## Crs_S_L8090 -0.145 0.000 0.000 0.001 0.000 0.000 0.004 -0.018 0.036
## C_S_L100110 -0.142 0.000 0.000 -0.018 0.000 0.000 -0.004 -0.005 -0.028
## CrstOrDTRUE 0.005 0.000 0.000 -0.031 0.000 0.000 -0.052 0.001 -0.070
## WeekendTRUE -0.053 0.000 0.000 -0.012 0.000 0.000 -0.009 -0.007 0.042
## OutOfLnTRUE -0.001 0.000 0.000 -0.037 0.000 0.000 0.033 -0.017 0.008
## DaylghtTRUE -0.105 0.000 0.000 0.039 0.000 0.000 -0.012 0.036 -0.022
## C_R_FI C_R_FM_O C_R_FM_L C_R_FN C_R_FO C_R_FR C_S_L6 C_S_L7 C_S_L8
## Crsh_T_CH_C
## Crsh_Trf_CM
## Crsh_T_CN__
## Crsh_T_CP_C
## Crsh_T_CR_C
## Crsh_Tr_CS_
## Crsh_T_CT_L
## C_R_FF_N_P_
## Crsh_Rdw_FI
## Crsh_R_FM_O 0.490
## Crsh_R_FM_L 0.422 0.224
## Cr_R_FN_R_F 0.954 0.496 0.430
## Crsh_Rdw_FO 0.000 0.000 0.000 0.000
## Crsh_R_FR_C 0.203 0.104 0.089 0.207 0.000
## Crsh_Sp_L60 -0.007 -0.001 -0.030 -0.007 0.000 -0.003
## Crsh_Sp_L70 0.001 0.011 -0.021 -0.009 0.000 -0.017 0.485
## Crs_S_L8090 0.069 0.054 0.017 0.045 0.000 0.006 0.588 0.391
## C_S_L100110 0.065 0.055 0.027 0.028 0.000 -0.022 0.632 0.421 0.545
## CrstOrDTRUE -0.025 -0.046 -0.002 -0.028 0.000 -0.001 -0.022 -0.034 -0.042
## WeekendTRUE 0.020 0.015 -0.029 0.009 0.000 0.021 0.004 0.011 -0.017
## OutOfLnTRUE -0.001 -0.055 -0.079 -0.020 0.000 0.009 -0.026 0.015 -0.005
## DaylghtTRUE 0.004 0.001 -0.029 0.004 0.000 0.029 0.009 -0.033 -0.081
## C_S_L1 CODTRU WkTRUE OOLTRU
## Crsh_T_CH_C
## Crsh_Trf_CM
## Crsh_T_CN__
## Crsh_T_CP_C
## Crsh_T_CR_C
## Crsh_Tr_CS_
## Crsh_T_CT_L
## C_R_FF_N_P_
## Crsh_Rdw_FI
## Crsh_R_FM_O
## Crsh_R_FM_L
## Cr_R_FN_R_F
## Crsh_Rdw_FO
## Crsh_R_FR_C
## Crsh_Sp_L60
## Crsh_Sp_L70
## Crs_S_L8090
## C_S_L100110
## CrstOrDTRUE -0.007
## WeekendTRUE -0.048 -0.026
## OutOfLnTRUE -0.012 0.011 0.082
## DaylghtTRUE -0.081 -0.030 -0.044 -0.082
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
print(check_collinearity(model1))
## # Check for Multicollinearity
##
## Low Correlation
##
## Parameter VIF Increased SE
## Crash_Traffic_Control 1.61 1.27
## Crash_Roadway_Feature 1.69 1.30
## Crash_Speed_Limit 1.16 1.08
## CrestOrDip 1.02 1.01
## Weekend 1.02 1.01
## OutOfLane 1.04 1.02
## Daylight 1.04 1.02
Model1 is then used to make predictions on the data in the test data set. From it’s predicted probabilities a threshold or cut-off value is calculated that will maximise the area under the curve (AUC) a ROC curve is also displayed to give a visual representation of the process.
pred <- predict(model1, newdata=testDF, type="response")
cutoff <- colAUC(pred, testDF[["Crash_Severity_Fatal"]], plotROC = TRUE)[1,1]
print(paste("Cutoff=", cutoff))
## [1] "Cutoff= 0.668310313362877"
Using the calculated cut-off value the predicted probabilities are then converted into actual predicted responses. The predicted responses and actual responses are used to generate a confusion matrix from which various measures are calculated.
# convert probability to response useing cutoff value
res <- factor( as.character(pred >= cutoff), levels=c("FALSE", "TRUE"))
orig <- factor(as.character(testDF$Crash_Severity_Fatal == 1), levels=c("FALSE", "TRUE"))
cm <- confusionMatrix(orig, res, positive="TRUE")
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 3845 553
## TRUE 115 38
##
## Accuracy : 0.8532
## 95% CI : (0.8426, 0.8634)
## No Information Rate : 0.8701
## P-Value [Acc > NIR] : 0.9996
##
## Kappa : 0.0515
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06430
## Specificity : 0.97096
## Pos Pred Value : 0.24837
## Neg Pred Value : 0.87426
## Prevalence : 0.12986
## Detection Rate : 0.00835
## Detection Prevalence : 0.03362
## Balanced Accuracy : 0.51763
##
## 'Positive' Class : TRUE
##
m1Accuracy <- cm$overall["Accuracy"]
m1BIC <- BIC(model1)
m1MCC <- mcc(TP=cm$table[4], FP=cm$table[3], TN=cm$table[1], FN=cm$table[2])
print(paste("Accuracy =", m1Accuracy))
## [1] "Accuracy = 0.853219072731268"
print(paste("F1 =",cm$byClass["F1"]))
## [1] "F1 = 0.102150537634409"
print(paste("Mathews correlation coefficent =", m1MCC))
## [1] "Mathews correlation coefficent = 0.0657534042826112"
While the Accuracy is reasonable, the F1 and MCC scores are rather low, reflecting the low TP vs FP results. Again stressing the inadequacy of relying just on the Accuracy score.
Due to the original analysis model been trained on different data (only data from 60kmh speed limit zones) it is not totally correct to compare that model to models in this analysis, however in the interests of curiosity we will anyway. In the original analysis the Accuracy was 81% which at flush blush would indicated that this current model is better with an Accuracy of 85%. But if we calculate the MCC score for the original result we can see that it’s MCC value is greater and therefore better, than. Indicating that the original analysis model over it’s limited data selection is more accurate than model1.
origMCC <- mcc(TP=26, FP=206, FN=23, TN=1005)
print(paste("Mathews correlation coefficent =", origMCC))
## [1] "Mathews correlation coefficent = 0.179821132526949"
As identified above in analysis of the summary variable for model 1 variables Crash_Roadway_Feature and CrestOrCurve will be removed from the model, to give us model 2. Again predictions and calculations are made.
The BIC score has decreased slightly which indicates that model2 is slightly better fit to the data than model1.
The variance explained by the random effect Curved is slightly increased. All the fixed predictors with the exception of OutOfLane are statistically significant. As such OutOfLane is a candidate for removal in the next model.
Collinearity for model2 still remains slow.
It can be seen that model2 has a slight increase in MCC over model1 which implies that model2 is more accurate.
model2 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control + Crash_Speed_Limit +
Weekend + OutOfLane + Daylight + (1|Curved),
data=trainDF,
family=binomial(link ="logit"),
control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Speed_Limit +
## Weekend + OutOfLane + Daylight + (1 | Curved)
## Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## 16948.5 17068.8 -8458.2 16916.5 13638
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3845 -0.7885 -0.4624 0.8298 2.1628
##
## Random effects:
## Groups Name Variance Std.Dev.
## Curved (Intercept) 0.1003 0.3167
## Number of obs: 13654, groups: Curved, 2
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.50116 0.23786 -2.107 0.0351
## Crash_Traffic_ControlHuman_Control -12.88831 6.33928 -2.033 0.0420
## Crash_Traffic_ControlMiscellaneous -12.44948 9.92579 -1.254 0.2097
## Crash_Traffic_ControlNo_traffic_control 0.47146 0.06457 7.302 2.84e-13
## Crash_Traffic_ControlPedestrian_Crossing -12.59571 2.75358 -4.574 4.78e-06
## Crash_Traffic_ControlRailway_Crossing -12.78835 12.00149 -1.066 0.2866
## Crash_Traffic_ControlStop_sign 0.28061 0.13487 2.081 0.0375
## Crash_Traffic_ControlTraffic_Lights 0.46680 0.08918 5.234 1.66e-07
## Crash_Speed_Limit60kmh -0.12997 0.05581 -2.329 0.0199
## Crash_Speed_Limit70kmh 0.63042 0.08545 7.378 1.61e-13
## Crash_Speed_Limit80to90kmh 1.14766 0.06893 16.651 < 2e-16
## Crash_Speed_Limit100to110kmh 1.23501 0.06305 19.589 < 2e-16
## WeekendTRUE 0.21721 0.03927 5.531 3.18e-08
## OutOfLaneTRUE -0.09865 0.05800 -1.701 0.0890
## DaylightTRUE -0.49742 0.04099 -12.134 < 2e-16
##
## (Intercept) *
## Crash_Traffic_ControlHuman_Control *
## Crash_Traffic_ControlMiscellaneous
## Crash_Traffic_ControlNo_traffic_control ***
## Crash_Traffic_ControlPedestrian_Crossing ***
## Crash_Traffic_ControlRailway_Crossing
## Crash_Traffic_ControlStop_sign *
## Crash_Traffic_ControlTraffic_Lights ***
## Crash_Speed_Limit60kmh *
## Crash_Speed_Limit70kmh ***
## Crash_Speed_Limit80to90kmh ***
## Crash_Speed_Limit100to110kmh ***
## WeekendTRUE ***
## OutOfLaneTRUE .
## DaylightTRUE ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation matrix not shown by default, as p = 15 > 12.
## Use print(summary(model2), correlation=TRUE) or
## vcov(summary(model2)) if you need it
## # Check for Multicollinearity
##
## Low Correlation
##
## Parameter VIF Increased SE
## Crash_Traffic_Control 1.09 1.04
## Crash_Speed_Limit 1.10 1.05
## Weekend 1.02 1.01
## OutOfLane 1.03 1.01
## Daylight 1.03 1.02
## [1] "Cutoff= 0.681262576275015"
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 3808 590
## TRUE 111 42
##
## Accuracy : 0.846
## 95% CI : (0.8351, 0.8563)
## No Information Rate : 0.8611
## P-Value [Acc > NIR] : 0.9984
##
## Kappa : 0.0559
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.066456
## Specificity : 0.971676
## Pos Pred Value : 0.274510
## Neg Pred Value : 0.865848
## Prevalence : 0.138871
## Detection Rate : 0.009229
## Detection Prevalence : 0.033619
## Balanced Accuracy : 0.519066
##
## 'Positive' Class : TRUE
##
## [1] "Accuracy = 0.845967919138651"
## [1] "F1 = 0.107006369426752"
## [1] "Mathews correlation coefficent = 0.0731583783984442"
For model 3 the OutOfLane variable was removed from the fixed effects in the formula for model 2.
model3 <- glmer(Crash_Severity_Fatal~ Crash_Traffic_Control+ Crash_Speed_Limit +
Weekend + Daylight + (1|Curved),
data=trainDF,
family=binomial(link ="logit"),
control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
The changes in BIC, Accuracy and MCC are all very slight, BIC indicating a slight improvement, while Accuracy and MCC showing a slight decline in accuracy. At this point all predictors are statistically significant. With the introduction of dummy variables it would be possible to remove the least significant levels in the factors, but since dummy variables weren’t used at the start of modelling this will be left for another analysis.
The MCC scores aren’t redically different between model 2 and model 3. Since model3 is the less complex it is the one that will be chosen as the winner.
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Crash_Severity_Fatal ~ Crash_Traffic_Control + Crash_Speed_Limit +
## Weekend + Daylight + (1 | Curved)
## Data: trainDF
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## AIC BIC logLik deviance df.resid
## 16949.4 17062.2 -8459.7 16919.4 13639
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3850 -0.7824 -0.4822 0.8306 2.0736
##
## Random effects:
## Groups Name Variance Std.Dev.
## Curved (Intercept) 0.1036 0.3219
## Number of obs: 13654, groups: Curved, 2
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.50164 0.24052 -2.086 0.037010
## Crash_Traffic_ControlHuman_Control -12.88060 4.35976 -2.954 0.003133
## Crash_Traffic_ControlMiscellaneous -12.44073 4.17885 -2.977 0.002910
## Crash_Traffic_ControlNo_traffic_control 0.46331 0.06441 7.193 6.32e-13
## Crash_Traffic_ControlPedestrian_Crossing -12.59823 3.70946 -3.396 0.000683
## Crash_Traffic_ControlRailway_Crossing -12.77910 3.04198 -4.201 2.66e-05
## Crash_Traffic_ControlStop_sign 0.28868 0.13501 2.138 0.032492
## Crash_Traffic_ControlTraffic_Lights 0.46429 0.08921 5.205 1.94e-07
## Crash_Speed_Limit60kmh -0.13272 0.05576 -2.380 0.017308
## Crash_Speed_Limit70kmh 0.63233 0.08542 7.402 1.34e-13
## Crash_Speed_Limit80to90kmh 1.14684 0.06890 16.646 < 2e-16
## Crash_Speed_Limit100to110kmh 1.23367 0.06302 19.577 < 2e-16
## WeekendTRUE 0.22235 0.03914 5.680 1.34e-08
## DaylightTRUE -0.50347 0.04084 -12.328 < 2e-16
##
## (Intercept) *
## Crash_Traffic_ControlHuman_Control **
## Crash_Traffic_ControlMiscellaneous **
## Crash_Traffic_ControlNo_traffic_control ***
## Crash_Traffic_ControlPedestrian_Crossing ***
## Crash_Traffic_ControlRailway_Crossing ***
## Crash_Traffic_ControlStop_sign *
## Crash_Traffic_ControlTraffic_Lights ***
## Crash_Speed_Limit60kmh *
## Crash_Speed_Limit70kmh ***
## Crash_Speed_Limit80to90kmh ***
## Crash_Speed_Limit100to110kmh ***
## WeekendTRUE ***
## DaylightTRUE ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation matrix not shown by default, as p = 14 > 12.
## Use print(summary(model3), correlation=TRUE) or
## vcov(summary(model3)) if you need it
## [1] "Cutoff= 0.680202230960597"
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 3800 598
## TRUE 111 42
##
## Accuracy : 0.8442
## 95% CI : (0.8333, 0.8546)
## No Information Rate : 0.8594
## P-Value [Acc > NIR] : 0.9983
##
## Kappa : 0.0546
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.065625
## Specificity : 0.971619
## Pos Pred Value : 0.274510
## Neg Pred Value : 0.864029
## Prevalence : 0.140628
## Detection Rate : 0.009229
## Detection Prevalence : 0.033619
## Balanced Accuracy : 0.518622
##
## 'Positive' Class : TRUE
##
## [1] "Accuracy = 0.844210063722259"
## [1] "F1 = 0.105926860025221"
## [1] "Mathews correlation coefficent = 0.0718308810296092"
As a validation of model 3 it was run against the validation set. This resulted in he MCC improving. So it would seem that the model is at least not over fitted to it’s training or test data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 3797 603
## TRUE 100 52
##
## Accuracy : 0.8456
## 95% CI : (0.8347, 0.8559)
## No Information Rate : 0.8561
## P-Value [Acc > NIR] : 0.979
##
## Kappa : 0.0789
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07939
## Specificity : 0.97434
## Pos Pred Value : 0.34211
## Neg Pred Value : 0.86295
## Prevalence : 0.14389
## Detection Rate : 0.01142
## Detection Prevalence : 0.03339
## Balanced Accuracy : 0.52686
##
## 'Positive' Class : TRUE
##
## [1] "Accuracy = 0.845562390158172"
## [1] "F1 = 0.128872366790582"
## [1] "Mathews correlation coefficent = 0.104964593214166"
| Model | BIC | Accuracy | MCC |
|---|---|---|---|
| model1 | 17082 | 0.85 | 0.0657534 |
| model2 | 17069 | 0.85 | 0.0731584 |
| model3 | 17062 | 0.84 | 0.0718309 |
| model3 - validation | 0.85 | 0.1049646 | |
| original analysis | 0.81 | 0.1798211 |
The model results aren’t great especially when comparing the MCC value to that obtained in the original analysis. Although it should be pointed out that the original analysis was only for one speed zone where as this analysis covered all zones.
The question then becomes was the assumption valid that here would be some sort of clustering or hierarchy on curves vs straight roads valid. As a check a regular linear regression model was run against the test data set using a similar formula to model 3, but the random effects are now fixed.
lmModel <- glm(Crash_Severity_Fatal ~ Crash_Traffic_Control+ Crash_Speed_Limit +
Weekend + Daylight + Curved,
data=trainDF,family = "binomial")
## [1] "Accuracy = 0.844210063722259"
## [1] "Mathews correlation coefficent = 0.0718308810296092"
As can be seen the Accuracy and MCC score are the same for both the simple linear regression model and the multilevel model. So the use of the random effect Curved doesn’t seem to be useful. This would seem to indicate that the proposal that the crash data will cluster along curve vs straight roads is in fact incorrect. While others have found hierarchically structure in accident data it would appear that grouping by curved roads vs straight roads is not one of them.
Chicco, Davide, and Giuseppe Jurman. 2020. “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.” BMC Genomics 21 (1): 6. https://doi.org/10.1186/s12864-019-6413-7.
CRAN. 2014. “Package ROSE.” https://cran.r-project.org/web/packages/ROSE/index.html.
———. 2020a. “Package caTools.” https://cran.r-project.org/web/packages/caTools/index.html.
———. 2020b. “Package lme4.” https://cran.r-project.org/web/packages/lme4/index.html.
Farid, Ahmed, and Khaled Ksaibati. 2020. “Modeling severities of motorcycle crashes using random parameters.” Journal of Traffic and Transportation Engineering (English Edition), June. https://doi.org/10.1016/j.jtte.2020.01.001.
Imprialou, Maria Ioanna M., Mohammed Quddus, and David E. Pitfield. 2015. “Multilevel logistic regression modeling for crash mapping in metropolitan areas.” National Research Council. https://doi.org/10.3141/2514-05.
Jones, Andrew P., and Stig H. Jørgensen. 2003. “The use of multilevel models for the prediction of road accident outcomes.” Accident Analysis and Prevention 35 (1): 59–69. https://doi.org/10.1016/S0001-4575(01)00086-0.
Kumar, Sachin, and Durga Toshniwal. 2015. “A data mining framework to analyze road accident data.” Journal of Big Data 2 (1): 26. https://doi.org/10.1186/s40537-015-0035-y.
Lenguerrand, E, J L Martin, and B Laumon. 2006. “Modelling the hierarchical structure of road crash data-Application to severity analysis.” Accident Analysis and Prevention 38: 43–53. https://doi.org/10.1016/j.aap.2005.06.021.
Miller, Steven V. 2018. “Mixed Effects Modeling Tips: Use a Fast Optimizer, but Perform Optimizer Checks.” http://svmiller.com/blog/2018/06/mixed-effects-models-optimizer-checks/.
Park, Ho Chul, Dong Kyu Kim, Seung Young Kho, and Peter Y. Park. 2017. “Cross-classified multilevel models for severity of commercial motor vehicle crashes considering heterogeneity among companies and regions.” Accident Analysis and Prevention 106 (September): 305–14. https://doi.org/10.1016/j.aap.2017.06.009.
The Methodology Center. n.d. “AIC vs. BIC.” Accessed November 7, 2020. https://www.methodology.psu.edu/resources/AIC-vs-BIC/.
Usman, Taimur, Liping Fu, and Luis F. Miranda-Moreno. 2016. “Injury severity analysis: Comparison of multilevel logistic regression models and effects of collision data aggregation.” Journal of Modern Transportation 24 (1): 73–87. https://doi.org/10.1007/s40534-016-0096-4.
Wikipedia. n.d.a. Countersteering. http://sheldonbrown.com/gloss{\_}cn-z.html{\#}countersteering.
———. n.d.b. “Highsider.” Accessed November 7, 2020. https://en.wikipedia.org/wiki/Highsider.
Zhang, Kairan, and Mohamed Hassan. 2019. “Crash severity analysis of nighttime and daytime highway work zone crashes.” PLoS ONE 14 (8). https://doi.org/10.1371/journal.pone.0221128.