1. Project Objective:

This report addresses the following questions:

  1. Based on the available data, what are the factors that best discriminate between different severities of accidents? Can the knowledge of these factors be helpful in the practice of reducing frequency of higher-severity accidents?
  1. Can the given data be used to infer areas for improvement within different police jurisdictions? If so, provide the reasoning for your analysis with examples.

2. Assumptions:

The following assumptions were made as part of the research project:

1.Location based numerical data (“Location_Easting_OSGR”, “Location_Northing_OSGR”, Longitude, Latitude) behave in an entirely different way than normal numerical data such that the specific intersection of multiple attributes’ values holds the only true “value” to be interpreted. As a result, all location based fields will be ignored for this research project to focus on individual attributes that can be interpreted in a singular fashion if necessary.

2.Time based numerical data (date, time) also behave in a different way than normal numerical data such that they are multi-cyclical. Time of day in the context of transportation will typically demonstrate two heightened periods of activity as values approach and depart both morning and evening rush hour centers. Date in the context of transportation will typically highlight heightened periods of travel (near holiday weekends, summer vacation, etc.). Thus, to better utilize our attributes, we have converted date to “Is Holiday” binary attribute, and will retain the “Day of Week” symbolic attribute to note days of high activity.

3.The prediction of any accident severity value that is lower than that of the actual value (1 being fatal, 3 being slight) is considered a false positive, or Type I error. A prediction of any accident severity value that is higher than that of the actual value (for example, predict 3=slight when actual is 2= serious) is considered false negative, or Type II error.

4.The “cost” of a false negative far outweighs the cost of a false positive in this report. The reason for this is that the report attempts to predict the severity, including fatality, of vehicle accidents. The failure to predict a fatal accident has dire consequences for decisions based on this research, thus we will heavily focus on minimizing false negative, while also using overall accuracy in our measurement of classification models.

5.Due to the very large size of the dataset, individual (single) Police jurisdictions (which split data into 51 different geographic groups of various sizes) will be used as primary testing and training sets in lieu of the full dataset. The assumption being made is that any single Police jurisdiction can serve as a normally distributed representative example of the larger dataset - this will help serve for more simplicity and reduce significant computing resources needed / processing time.

6.Also due to the very large size of the dataset, for all cross validation performed, fold sizes of 10 were used to save on computing resources and time to process. Normally outside of an academic research project this value would be far greater (approaching the total number of rows in the data set), however a value of 10 was used to demonstrate the process of cross validation of our chosen models.

3. Data Summary

Dataset description

The associated data is a subset of the public collection of data of the circumstances of personal injury road accidents in Great Britain in 2012. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form. Information on damage-only accidents, with no human casualties or accidents on private roads or car parks is not included in this data.

Very few, if any, fatal accidents do not become known to the police although it is known that a considerable proportion of non-fatal injury accidents are not reported to the police. Figures for deaths refer to persons killed immediately or who died within 30 days of the accident. This is the usual international definition, adopted by the Vienna Convention in 1968.

As well as giving details of date, time and location, the accident file gives a summary of all reported vehicles and pedestrians involved in road accidents and the total number of casualties, by severity.

Dependent / Outcome Variable: - ‘Accident_Severity’

The primary focus of this project is to observe, detect, and predict changes in the severity of each accident reported in the dataset. This is illustrated in the dataset by the variable: Accident_Severity. It is a categorical variable with the following volumes in the data:

Accident Severity by Volume
Severity Volume
Fatal 1637
Serious 20901
Slight 123033

Other Calculated Variables: - ‘IsWeekend’ - ‘IsHoliday’

Based on the date field and published / known public holidays in the United Kingdom in the year 2012, we added two additional categorical variables to provide additional potential variables to help influence our dependent classification models. This included Christmas, Easter, Good Friday, New Years Eve, New Years Day.

4. Feature Selection

Information Gain Analysis:

To provide more details about our dataset variables, we observe their calculated Information Gain and Relative Information Gain to measure each variable’s affect on changes in value to our target/dependent classification variable: Accident Severity.

Information Gain Table
Feature Information Gain Relative Information Gain Percent missing
6 X1st_Road_Number 0.0396662 0.0583371 0.000000
12 X2nd_Road_Number 0.0278035 0.0408906 1.045538
2 Number_of_Vehicles 0.0115293 0.0169562 0.000000
1 Police_Force 0.0076984 0.0113221 0.000000
8 Speed_limit 0.0069141 0.0101686 0.000000
9 Junction_Detail 0.0053657 0.0078914 0.000000
20 Urban_or_Rural_Area 0.0042275 0.0062174 0.000000
11 X2nd_Road_Class 0.0041117 0.0060471 39.709145
10 Junction_Control 0.0034868 0.0051281 38.695894
15 Light_Conditions 0.0034634 0.0050936 0.000000
3 Number_of_Casualties 0.0034500 0.0050739 0.000000
7 Road_Type 0.0028268 0.0041573 0.000000
5 X1st_Road_Class 0.0012002 0.0017652 0.000000
4 Day_of_Week 0.0009866 0.0014510 0.000000
14 Pedestrian_Crossing.Physical_Facilities 0.0008043 0.0011829 0.000000
16 Weather_Conditions 0.0007014 0.0010316 0.000000
18 Special_Conditions_at_Site 0.0003504 0.0005153 0.000687
17 Road_Surface_Conditions 0.0001685 0.0002478 0.179294
19 Carriageway_Hazards 0.0001601 0.0002355 0.000000
21 isHoliday 0.0000660 0.0000971 0.000000
13 Pedestrian_Crossing.Human_Control 0.0000023 0.0000034 0.000000

Information Gain Findings:

Our findings for the information gain for the most part aligned with our expectations. However, we were suprised be the X1st_Road_Number and X2nd_Road_Number being the top two in size for information gain. We believe that this may be due to the Road Numbers taking on a large number of distinct values.

In addition when looking at the top ten values we also found X2nd_Road_Class and Junction_Control to be listed as well. We made the decision that these variables were not useful as they had a very high number of NA’s. Utimately we decided our top most useful values are as follows: X1st_Road_Number,X2nd_Road_Number,Number of Vehicles, Police Force, Speed Limit, Junction Detail, Urban or Rural Area, Light Conditions, Number of Casualties and Road Type.

The following below breaks down the categorical and numerical variables to further understanding:

Selected Categorical Variables: - ‘Police Force’ - ‘Light_Conditions’ - ‘Road_Type’ - ‘Junction_Detail’ - ‘Urban_or_Rural_Area’

Categorical variables were of ample supply in this dataset, and thus a paired down list was chosen carefully based on the following determining factors: - Choose Variables with the maximum information gain - Avoid variables with large number of missing variables - Avoid variables with possible covariance (i.e. IsWeekend, Day_of_Week) - Maximize domain expertise on the causes of accidents (weather, intersection type, etc.) - Maximize other influencers (holidays, etc.) - Achieve optimal spread of categorical and numeric variables

Police_Force: To provide more details about our dataset, the following basic/descriptive variable helps identify clusters of geographically similar crash data (in this case, by Police Jurisdiction). Police Force represents 51 different jurisdictional areas as seperated by Police organization throughout the United Kingdom.) As noted later, this variable is used as a further tool to subset our large dataset.

Light_Conditions: This first categorical variable affects driving condition and seemed intuitive to affect accident severity. The variable details the observed lighting conditions at the scene and time of the accident, as reported on the accident report with the police force. This includes both the amount of daylight (daylight or no daylight), as well as weather supplemental lighting was provided and activated at the time of the accident. Available values and associated volumes are:

Data Volume by Light Conditions
Condition Volume
Darkness - lighting unknown 2593
Darkness - lights lit 28167
Darkness - lights unlit 761
Darkness - no lighting 7568
Daylight 106482

Junction_Detail: The next categorical variable details the setup/structure of the intersection at the scene and time of the accident, as reported on the accident report with the police force. Available values and associated volumes are:

Data Volume by Junction Detail
Condition Volume
Crossroads 14510
Mini-roundabout 1704
More than 4 arms (not roundabout) 2044
Not at junction or within 20 metres 56287
Other junction 3407
Private drive or entrance 5503
Roundabout 13256
Slip road 2148
T or staggered junction 46712

To better understand these variables and their affect on accident serverity, we observe histograms illustrating the spread of values over each:

Urban_or_Rural_Area: Related to the location at the time of the accident, as reported by the police force on the accident report. This includes(Urban, Rural, or unallocated)

Data Volume by Urban or Rural Area
Type Volume
Rural 50862
Urban 94709

Road_Type: Related to the type of road the accident occurred on, as reported by the police force on the accident report. Available values:

Data Volume by Road Type
Type Volume
Dual carriageway 20572
One way street 2747
Roundabout 10173
Single carriageway 109970
Slip road 1552
Unknown 557

Shown here are two histograms illustrating the spread of values over each (Urban/Rural and Road Type):

Selected Numerical Variables: - ‘X1st_Road_Number’ - ‘X2nd_Road_Number’ - ‘Number_of_Vehicles’ - ‘Speed_limit’ - ‘Number_of_Casualties’

Number_of_Vehicles: Related to the number of vehicles that were in the accident, as reported by the police force on the accident report. Available values:

Data Volume by Number of Vehicles
Number of Vehicles Volume
1 44086
2 87152
3 11147
4 2394
5 503
6 172
7 70
8 22
9 14
10 5
11 2
12 1
13 1
16 1
18 1

Speed_limit: Related to the speed limit on the road where the accident occurred, as reported by the police force on the accident report. Available values:

Data Volume by Speed Limit
Speed Limit Volume
10 1
20 2247
30 94995
40 11914
50 5220
60 21172
70 10022

Number_of_Casualties: Related to the number of casualties due to the accident, as reported by the police force on the accident report. Available values:

Data Volume by Number of Casualties
Number of Casualties Volume
1 112428
2 22732
3 6509
4 2442
5 879
6 348
7 111
8 56
9 21
10 13
11 11
12 4
13 5
14 1
15 2
16 2
17 1
19 1
25 2
33 1
38 1
42 1

To understand the numerical conditions, we observe histograms illustrating the spread of values over each:

5. Classification Approach:

To conduct our classification analysis on the selected subset of attributes, we will first conduct cross validation on 4 classifier models:

  1. Multinomial (Fit Multinomial Log-linear Models, using ‘multinom’ function from the ‘nnet’ library)
  1. Support Vector Machines (using ‘svm’ function of the ‘e1071’ library)
  1. Naive Bayes (using ‘naiveBayes’ function from the ‘e1071’ library )
  1. Classification Tree using rpart (using ‘rpart’ function from the ‘rpart’ library)

As mentiond previously in our assumptions, we will also further subset our data to 4 distinct police force jurisdictions, to make the data size more managable. We have arbitrarily chosen 4 random Police Force values:

Subset A = Police Force of “Kent”

Subset B = Police Force of “North Wales”

Subset C = Police Force of “Metropolitan Police”

Subset D = Police Force of “Northumbria”

Cross Validation: Cross validation for each of our four models was conducted and results were compared for analysis of best fitting model:

CV: Kent- Multinomial
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 0.9997884 1.0000000 0.0002116 0.8907493 0.9879294 1.07 %
Severe 0.0000000 1.0000000 1.0000000 0.0000000 0.8907493 0.9005501 9.84 %
Slight 0.9997651 0.0000000 0.0002349 1.0000000 0.8907493 0.8907493 89.1 %
CV: Kent- Support Vector Machines
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.4313725 1.0000000 0.5686275 0.0000000 0.9935119 0.9939280 1.07 %
Severe 0.9978723 0.9939647 0.0021277 0.0060353 0.9935119 0.9943444 9.84 %
Slight 0.9997651 0.9904031 0.0002349 0.0095969 0.9935119 0.9987376 89.1 %
CV: Kent- Naive Baye’s
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0392157 0.9947112 0.9607843 0.0052888 0.8679364 0.9824686 1.07 %
Severe 0.0468085 0.9742340 0.9531915 0.0257660 0.8679364 0.8812155 9.84 %
Slight 0.9685224 0.0499040 0.0314776 0.9500960 0.8679364 0.8682998 89.1 %
CV: Kent- Rpart Classification Tree
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0392157 0.9959805 0.9607843 0.0040195 0.8608204 0.9837359 1.07 %
Severe 0.1021277 0.9584494 0.8978723 0.0415506 0.8608204 0.8725074 9.84 %
Slight 0.9544280 0.1036468 0.0455720 0.8963532 0.8608204 0.8615417 89.1 %
CV: North Wales- Multinomial
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0454545 0.9993391 0.9545455 0.0006609 0.8117264 0.9826498 1.43 %
Severe 0.0037736 0.9976378 0.9962264 0.0023622 0.8117264 0.8235294 17.26 %
Slight 0.9967949 0.0069686 0.0032051 0.9930314 0.8117264 0.8117264 81.3 %
CV: North Wales- Support Vector Machines
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0 1.0000000 1 0.0000000 0.9856678 0.9856678 1.43 %
Severe 1 0.9874016 0 0.0125984 0.9856678 0.9895356 17.26 %
Slight 1 0.9790941 0 0.0209059 0.9856678 0.9960500 81.3 %
CV: North Wales- Naive Baye’s
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.1818182 0.8162591 0.8181818 0.1837409 0.6618893 0.7743902 1.43 %
Severe 0.0226415 0.9818898 0.9773585 0.0181102 0.6618893 0.7827427 17.26 %
Slight 0.8060897 0.2404181 0.1939103 0.7595819 0.6618893 0.6883469 81.3 %
CV: North Wales- Rpart Classification Tree
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 0.9874422 1.0000000 0.0125578 0.734202 0.9648973 1.43 %
Severe 0.1547170 0.8811024 0.8452830 0.1188976 0.734202 0.7503329 17.26 %
Slight 0.8701923 0.1707317 0.1298077 0.8292683 0.734202 0.7380485 81.3 %
CV: Metropolitan Police- Multinomial
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0 1 1 0 0.8790523 0.9937569 0.55 %
Severe 0 1 1 0 0.8790523 0.8839339 11.54 %
Slight 1 0 0 1 0.8790523 0.8790523 87.91 %
CV: Metropolitan Police- Support Vector Machines
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.9847328 1.0000000 0.0152672 0.0000000 0.9998314 0.9999157 0.55 %
Severe 0.9992695 1.0000000 0.0007305 0.0000000 0.9998314 0.9999157 11.54 %
Slight 1.0000000 0.9986058 0.0000000 0.0013942 0.9998314 0.9998314 87.91 %
CV: Metropolitan Police- Naive Baye’s
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0305344 0.9833404 0.9694656 0.0166596 0.8623161 0.9752086 0.55 %
Severe 0.0051132 0.9971405 0.9948868 0.0028595 0.8623161 0.8802014 11.54 %
Slight 0.9800978 0.0195190 0.0199022 0.9804810 0.8623161 0.8636997 87.91 %
CV: Metropolitan Police- Rpart Classification Tree
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0 1 1 0 0.8790523 0.9937569 0.55 %
Severe 0 1 1 0 0.8790523 0.8839339 11.54 %
Slight 1 0 0 1 0.8790523 0.8790523 87.91 %
CV: Northumbria- Multinomial
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 1.0000000 1.0000000 0.0000000 0.8575861 0.9871429 1.12 %
Severe 0.0047281 0.9992857 0.9952719 0.0007143 0.8575861 0.8672733 13.12 %
Slight 0.9992764 0.0043573 0.0007236 0.9956427 0.8575861 0.8575861 85.76 %
CV: Northumbria- Support Vector Machines
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 1.0000000 1.0000000 0.0000000 0.98852 0.9888268 1.12 %
Severe 0.9976359 0.9892857 0.0023641 0.0107143 0.98852 0.9903637 13.12 %
Slight 1.0000000 0.9847495 0.0000000 0.0152505 0.98852 0.9978077 85.76 %
CV: Northumbria- Naive Baye’s
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0277778 0.9858801 0.9722222 0.0141199 0.8346261 0.9711191 1.12 %
Severe 0.0307329 0.9785714 0.9692671 0.0214286 0.8346261 0.8512658 13.12 %
Slight 0.9681621 0.0675381 0.0318379 0.9324619 0.8346261 0.8390518 85.76 %
CV: Northumbria- Rpart Classification Tree
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0277778 0.9934107 0.9722222 0.0065893 0.7924294 0.9785441 1.12 %
Severe 0.1513002 0.9064286 0.8486998 0.0935714 0.7924294 0.8044094 13.12 %
Slight 0.9005065 0.1590414 0.0994935 0.8409586 0.7924294 0.7944012 85.76 %

CV Analysis:

The cross validation results of our 4 classifier models showed very interesting results. However, the stand-out model was clearly SVM, which had the strongest accuracy throughout every subset, and also the lowest false negative at least for our 2 largest data sets. False negative rate was higher on subset B (North Wales, 1535 values) and subset D (Northumbria, 3223 values). This may be a sign that the SVM model accuracy improves with the higher volume of its training set.

Also of note was the false negative rate for both Multinomial and RPart was 100% for every dataset. Given the cost of a false negative (i.e. incorrectly predicting that an accident was not as serious or fatal as it was) is very great in the case of our data analysis, we rely more heavily on the false negative rate in addition to accuracy to choose the best model. Naive Baye’s, while holding impressive accuracy rates, still had false negative rates nearing 100%, and thus would also not be as preferable as SVM in this case.

City to City Learning: To further explore our models, we will also train our best performing model against one city, while using to predict values against another.

This will be done without cross validation, to further exlore a phenomenon we experienced where Cross-validation runs against the same data sets performed better with the raw model alone, rather than conducting cross validaiton of that model.

Police Force A to B (Training on Kent, predicting North Wales)

Police Force B to C (Training on North Wales, predicting Metropolitan Police)

Police Force C to D (Training on Metropolitan Police, predicting Northumbria)

Police Force D to A (Training on Northumbria, predicting Kent)

Training on Kent, predicting North Wales
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.6818182 1.0000000 0.3181818 0.0000000 0.9876221 0.9954038 1.43 %
Severe 0.9547170 1.0000000 0.0452830 0.0000000 0.9876221 0.9921466 17.26 %
Slight 1.0000000 0.9337979 0.0000000 0.0662021 0.9876221 0.9876221 81.3 %
Training on North Wales, predicting Metropolitan Police
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 1.0000000 1.0000000 0.0000000 0.9944353 0.9944772 0.55 %
Severe 0.9996348 0.9979984 0.0003652 0.0020016 0.9944353 0.9981804 11.54 %
Slight 1.0000000 0.9686302 0.0000000 0.0313698 0.9944353 0.9961992 87.91 %
Training on Metropolitan Police, predicting Northumbria
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.6944444 1.0000000 0.3055556 0.0000000 0.9928638 0.9965743 1.12 %
Severe 0.9716312 1.0000000 0.0283688 0.0000000 0.9928638 0.9962640 13.12 %
Slight 1.0000000 0.9498911 0.0000000 0.0501089 0.9928638 0.9928638 85.76 %
Training on Northumbria, predicting Kent
observation sensitivity specificity false.neg.rate false.pos.rate accuracy.overall accuracy.observation pct.actual.values
Fatal 0.0000000 1.0000000 1.0000000 0.0000000 0.9891168 0.9893238 1.07 %
Severe 0.9978723 0.9928041 0.0021277 0.0071959 0.9891168 0.9932745 9.84 %
Slight 1.0000000 0.9596929 0.0000000 0.0403071 0.9891168 0.9955762 89.1 %

6. Final Conclusions and Recommendations

As mentioned previously, based on our analysis of the data the factors that appeared to best discriminate between different severities are as follows:

Variables found from info gain: > Categorical Variables:

Numerical Variables:

SVM Model % Accurate on our largest dataset (Metropolitan Police):

When using Cross Validation: 99.9831373%

When not using Cross Validation: 99.2863791%

SVM Model False Negative Rate on our largest dataset (Metropolitan Police):

When using Cross Validation: 1.5267176%

When not using Cross Validation: 30.5555556%

Overall our results did indicate that we could use this data to make predictions that could help implement new changes to prevent future accidents. For example we could look at the road types, speed limits, the lighting conditions or junction details. Changes could then be implemented to correct whatever issues are found.

The analysis of the data indicated that we could in fact apply accurate predictions and potential improvements to different jurisdictions. As displayed above we applied the analysis to several randomly selected jurisdictions and found a good accuracy in prediction, especially the greater our training subset is. Therefore, we can use these predictors to infer areas of improvement based on police force’s experience and the data.

Our findings of the most accurate model were not suprising in that, similar to the homework assignment, SVM was clearly the most accurate model. We had also found Naive Baise to be a good predictor as well. This falls in line with our expectations as they are ‘eager’ learning algorithms in that they utilize a training phase. Our multinomial model that uses log linear regression did fairly well but as we observed it is a ‘lazy’ learner therefore it did not perform as well as the other two models. This was simialar to rparts performance. Although, rpart allowed us to build our regression in a different way, first by building a tree and then allowing us to gather predictions from those results.

Interestingly we did find that our results seemed to be more accurate without using cross validation which was an interesting phenomena. We did however, use cross validation, however we were limited in number of folds due to the size of the data set and available computing power.

Going further, the results did leave us with a desire to dig deeper and ask why our results were significant. For instance why was X1st and X2nd road types significant? Are there multiple significant variables that could be linked that contribute to an accident? What would our results be had we been able to change our cross validation using a computer with higher computing power? Although outside the scope of our project, we felt these were some of the questions that we would be interested in exploring given our current analysis of the data.