Introduction

This exercise will use the Fredericton Traffic Accident open data set: http://data-fredericton.opendata.arcgis.com/datasets/traffic-accidents-accidents-de-la-circulation?selectedAttribute=Severity.

While this dataset already has several categorical variables, some additional variables will be added. The point of this exercise is to determine if creating new categories or bins from the existing dataset can make it easier to predict when severe accidents are more likely to happen in Fredericton, which would also indicate for drivers the times to be more careful driving in Fredericton.

Fredericton Traffic Accidents dataframe

## 'data.frame':    9301 obs. of  20 variables:
##  $ ï..X      : num  -66.6 -66.6 -66.6 -66.7 -66.7 ...
##  $ Y         : num  45.9 45.9 46 46 45.9 ...
##  $ FID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ OBJECTID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Severity  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Year_     : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month_    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day_      : int  25 19 20 11 25 26 27 27 14 27 ...
##  $ DayOfWeek : chr  "Tue" "Wed" "Wed" "Tue" ...
##  $ Hour_     : int  1200 1600 1800 1600 700 2200 600 1200 1900 800 ...
##  $ NoVehicles: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ NoInjured : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ NoKilled  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Street    : chr  "MacKenzie Road" "Waterloo Row" "Ring Road" "Clements Drive" ...
##  $ At_Near   : chr  "Near" "Near" "At" "At" ...
##  $ Near      : chr  "Hilton Road" "Beaverbrook Street" "Maple Street" "Regiment Creek Avenue" ...
##  $ Type      : chr  "Parking Lot" "Side" "Rear Ended" "Rear Ended" ...
##  $ Date      : chr  "2011-01-25 0:00" "2011-01-19 0:00" "2011-01-20 0:00" "2011-01-11 0:00" ...
##  $ Signalized: chr  " " " " "Yes" " " ...
##  $ Day_Night : chr  "D" "D" "N" "D" ...

There are more drivers during the week, due to work, so not surprisingly there are more accidents during the week.

Friday and Thursday, for some reason, are the highest days for accidents.

There are more total accidents during the day, but one would have to assume that due to commuting to work or school that there would be more drivers during the day.

It does seem intuitive that there would be more total accidents during weekday rush hour in Fredericton, that all of the other hours in the week.

Logistic Regression

Fredericton currently classifies traffic accidents as Severity 1 or 2. Those accidents classified as Severity 2 involve injuries. A new logical classification was created from this variable, and called Severe, so as to focus on the more serious accidents (class 2). For Severe, the default Severity has been recoded 0, and the more severe cases as 1.

Logistic regression is first performed on several pre-existing variables (as factors) in the dataset to predict Severe.

summary(glm.accidents.old)
## 
## Call:
## glm(formula = Severe ~ as.factor(Day_Night) + as.factor(Month_) + 
##     as.factor(DayOfWeek) + as.factor(Hour_), family = binomial, 
##     data = Traffic_Accidents)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9961  -0.6711  -0.6132  -0.5397   2.1510  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -14.15535  535.41124  -0.026  0.97891    
## as.factor(Day_Night)D   12.19590  535.41118   0.023  0.98183    
## as.factor(Day_Night)N   12.12639  535.41119   0.023  0.98193    
## as.factor(Month_)2      -0.09171    0.12595  -0.728  0.46651    
## as.factor(Month_)3       0.13690    0.13480   1.016  0.30985    
## as.factor(Month_)4      -0.09147    0.14506  -0.631  0.52832    
## as.factor(Month_)5       0.21653    0.13290   1.629  0.10325    
## as.factor(Month_)6       0.42661    0.12975   3.288  0.00101 ** 
## as.factor(Month_)7       0.29315    0.13190   2.222  0.02625 *  
## as.factor(Month_)8       0.51711    0.12738   4.060 4.91e-05 ***
## as.factor(Month_)9       0.36841    0.13003   2.833  0.00461 ** 
## as.factor(Month_)10      0.24907    0.12762   1.952  0.05097 .  
## as.factor(Month_)11      0.29313    0.12918   2.269  0.02327 *  
## as.factor(Month_)12      0.13095    0.12555   1.043  0.29693    
## as.factor(DayOfWeek).L   0.01043    0.07671   0.136  0.89182    
## as.factor(DayOfWeek).Q   0.12393    0.07799   1.589  0.11203    
## as.factor(DayOfWeek).C   0.05019    0.07268   0.691  0.48986    
## as.factor(DayOfWeek)^4   0.04311    0.07110   0.606  0.54434    
## as.factor(DayOfWeek)^5  -0.12252    0.06847  -1.789  0.07355 .  
## as.factor(DayOfWeek)^6   0.06806    0.06882   0.989  0.32272    
## as.factor(Hour_)100      0.97594    0.31079   3.140  0.00169 ** 
## as.factor(Hour_)200      0.56352    0.30766   1.832  0.06701 .  
## as.factor(Hour_)300      0.48165    0.35594   1.353  0.17600    
## as.factor(Hour_)400      0.01604    0.52688   0.030  0.97571    
## as.factor(Hour_)500      0.42577    0.42688   0.997  0.31856    
## as.factor(Hour_)600      0.71532    0.31731   2.254  0.02418 *  
## as.factor(Hour_)700      0.28218    0.26091   1.082  0.27946    
## as.factor(Hour_)800      0.09315    0.26003   0.358  0.72018    
## as.factor(Hour_)900      0.34207    0.26523   1.290  0.19715    
## as.factor(Hour_)999    -11.75625  177.65596  -0.066  0.94724    
## as.factor(Hour_)1000     0.30579    0.26468   1.155  0.24796    
## as.factor(Hour_)1100     0.05867    0.26261   0.223  0.82321    
## as.factor(Hour_)1200     0.27912    0.25372   1.100  0.27127    
## as.factor(Hour_)1300     0.18190    0.25428   0.715  0.47439    
## as.factor(Hour_)1400     0.29493    0.25451   1.159  0.24654    
## as.factor(Hour_)1500     0.30417    0.25253   1.204  0.22840    
## as.factor(Hour_)1600     0.19888    0.25047   0.794  0.42717    
## as.factor(Hour_)1700     0.29090    0.24032   1.210  0.22610    
## as.factor(Hour_)1800     0.43945    0.24263   1.811  0.07012 .  
## as.factor(Hour_)1900     0.36776    0.24512   1.500  0.13352    
## as.factor(Hour_)2000     0.57716    0.24400   2.365  0.01801 *  
## as.factor(Hour_)2100     0.68814    0.24352   2.826  0.00472 ** 
## as.factor(Hour_)2200     0.58004    0.26719   2.171  0.02994 *  
## as.factor(Hour_)2300     0.42979    0.28466   1.510  0.13108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8954.0  on 9294  degrees of freedom
## Residual deviance: 8854.8  on 9251  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 8942.8
## 
## Number of Fisher Scoring iterations: 12

Logistic regression is next performed on several new variables/bins (as factors) in the dataset to predict Severe.

summary(glm.accidents.new)
## 
## Call:
## glm(formula = Severe ~ Traffic_Accidents$Rush_hour + Traffic_Accidents$Daytype + 
##     Traffic_Accidents$Season, family = binomial, data = Traffic_Accidents)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7589  -0.6781  -0.6220  -0.5676   1.9520  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -1.64740    0.06086 -27.067  < 2e-16 ***
## Traffic_Accidents$Rush_hourRush Hour -0.09661    0.05372  -1.798   0.0721 .  
## Traffic_Accidents$DaytypeWeekend      0.13467    0.06395   2.106   0.0352 *  
## Traffic_Accidents$SeasonSpring        0.10273    0.07753   1.325   0.1852    
## Traffic_Accidents$SeasonSummer        0.41534    0.07271   5.712 1.12e-08 ***
## Traffic_Accidents$SeasonAutumn        0.29437    0.07372   3.993 6.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8954.0  on 9294  degrees of freedom
## Residual deviance: 8906.8  on 9289  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 8918.8
## 
## Number of Fisher Scoring iterations: 4

The smaller number of bins in the “new” variables would appear to be easier to read, and indicate that summer and autumn are the worst time for serious traffic accidents.

Linear Discriminant Analysis

First examining the newly created factors.

lda.accidents.new
## Call:
## lda(Severe ~ Traffic_Accidents$Rush_hour + Traffic_Accidents$Daytype + 
##     Traffic_Accidents$Season, data = Traffic_Accidents)
## 
## Prior probabilities of groups:
##     FALSE      TRUE 
## 0.8131253 0.1868747 
## 
## Group means:
##       Traffic_Accidents$Rush_hourRush Hour Traffic_Accidents$DaytypeWeekend
## FALSE                            0.5366499                        0.2093146
## TRUE                             0.5077720                        0.2337363
##       Traffic_Accidents$SeasonSpring Traffic_Accidents$SeasonSummer
## FALSE                      0.2247949                      0.2229426
## TRUE                       0.2049511                      0.2769142
##       Traffic_Accidents$SeasonAutumn
## FALSE                      0.2318074
## TRUE                       0.2556131
## 
## Coefficients of linear discriminants:
##                                             LD1
## Traffic_Accidents$Rush_hourRush Hour -0.5265643
## Traffic_Accidents$DaytypeWeekend      0.7520598
## Traffic_Accidents$SeasonSpring        0.5085284
## Traffic_Accidents$SeasonSummer        2.2828692
## Traffic_Accidents$SeasonAutumn        1.5573754

Then comparing to original variables (as factors).

## Warning in lda.default(x, grouping, ...): variables are collinear
lda.accidents.old
## Call:
## lda(Severe ~ as.factor(Day_Night) + +as.factor(Month_) + as.factor(DayOfWeek) + 
##     as.factor(Hour_), data = Traffic_Accidents)
## 
## Prior probabilities of groups:
##     FALSE      TRUE 
## 0.8131253 0.1868747 
## 
## Group means:
##       as.factor(Day_Night)D as.factor(Day_Night)N as.factor(Month_)1
## FALSE             0.7787775             0.2210902          0.1080974
## TRUE              0.7587795             0.2412205          0.0875072
##       as.factor(Month_)2 as.factor(Month_)3 as.factor(Month_)4
## FALSE         0.11537444         0.07581371         0.07224133
## TRUE          0.08520438         0.07081174         0.05469200
##       as.factor(Month_)5 as.factor(Month_)6 as.factor(Month_)7
## FALSE         0.07673988         0.07462292         0.07607833
## TRUE          0.07944732         0.09326425         0.08462867
##       as.factor(Month_)8 as.factor(Month_)9 as.factor(Month_)10
## FALSE         0.07224133         0.07224133          0.08335539
## TRUE          0.09902130         0.08520438          0.08750720
##       as.factor(Month_)11 as.factor(Month_)12 as.factor(DayOfWeek).L
## FALSE          0.07621064          0.09698333             0.04178213
## TRUE           0.08290155          0.08981002             0.04362802
##       as.factor(DayOfWeek).Q as.factor(DayOfWeek).C as.factor(DayOfWeek)^4
## FALSE            -0.05692201           -0.012531570            -0.02944804
## TRUE             -0.03837972           -0.003055399            -0.02194325
##       as.factor(DayOfWeek)^5 as.factor(DayOfWeek)^6 as.factor(Hour_)100
## FALSE           -0.006510735          -0.0008313626         0.007806298
## TRUE            -0.025314280           0.0077082999         0.014968336
##       as.factor(Hour_)200 as.factor(Hour_)300 as.factor(Hour_)400
## FALSE          0.01098174         0.007144747         0.003704684
## TRUE           0.01381693         0.008635579         0.002878526
##       as.factor(Hour_)500 as.factor(Hour_)600 as.factor(Hour_)700
## FALSE         0.004763165          0.00939402          0.04194231
## TRUE          0.005181347          0.01324122          0.03972366
##       as.factor(Hour_)800 as.factor(Hour_)900 as.factor(Hour_)999
## FALSE          0.06972744          0.04538238         0.001190791
## TRUE           0.05469200          0.04663212         0.000000000
##       as.factor(Hour_)1000 as.factor(Hour_)1100 as.factor(Hour_)1200
## FALSE           0.04591162           0.06059804           0.07396137
## TRUE            0.04663212           0.04835924           0.07253886
##       as.factor(Hour_)1300 as.factor(Hour_)1400 as.factor(Hour_)1500
## FALSE           0.07568140           0.06919820           0.07488754
## TRUE            0.06850892           0.06966033           0.07714450
##       as.factor(Hour_)1600 as.factor(Hour_)1700 as.factor(Hour_)1800
## FALSE           0.09208785           0.08732469           0.05054247
## TRUE            0.08462867           0.08578008           0.05526770
##       as.factor(Hour_)1900 as.factor(Hour_)2000 as.factor(Hour_)2100
## FALSE           0.04458852           0.03466526           0.02950516
## TRUE            0.04663212           0.04490501           0.04145078
##       as.factor(Hour_)2200 as.factor(Hour_)2300
## FALSE           0.01918497           0.01640646
## TRUE            0.02417962           0.01784686
## 
## Coefficients of linear discriminants:
##                                 LD1
## as.factor(Day_Night)D   5.076239012
## as.factor(Day_Night)N   4.748798946
## as.factor(Month_)1     -0.711823501
## as.factor(Month_)2     -1.009204791
## as.factor(Month_)3     -0.249301157
## as.factor(Month_)4     -1.028552552
## as.factor(Month_)5      0.045832773
## as.factor(Month_)6      0.901518854
## as.factor(Month_)7      0.349613906
## as.factor(Month_)8      1.306013886
## as.factor(Month_)9      0.653329875
## as.factor(Month_)10     0.174898092
## as.factor(Month_)11     0.358082719
## as.factor(Month_)12    -0.256345864
## as.factor(DayOfWeek).L  0.040757187
## as.factor(DayOfWeek).Q  0.470760243
## as.factor(DayOfWeek).C  0.193779685
## as.factor(DayOfWeek)^4  0.167996382
## as.factor(DayOfWeek)^5 -0.455494215
## as.factor(DayOfWeek)^6  0.260772894
## as.factor(Hour_)100     4.061145865
## as.factor(Hour_)200     2.058752975
## as.factor(Hour_)300     1.733277672
## as.factor(Hour_)400    -0.001283306
## as.factor(Hour_)500     1.474858459
## as.factor(Hour_)600     2.699660648
## as.factor(Hour_)700     0.893825056
## as.factor(Hour_)800     0.221829986
## as.factor(Hour_)900     1.097412862
## as.factor(Hour_)999    -3.594886717
## as.factor(Hour_)1000    0.961164167
## as.factor(Hour_)1100    0.089296042
## as.factor(Hour_)1200    0.860215253
## as.factor(Hour_)1300    0.503051637
## as.factor(Hour_)1400    0.918900078
## as.factor(Hour_)1500    0.953990388
## as.factor(Hour_)1600    0.563300958
## as.factor(Hour_)1700    0.916870818
## as.factor(Hour_)1800    1.506751564
## as.factor(Hour_)1900    1.237257184
## as.factor(Hour_)2000    2.111172235
## as.factor(Hour_)2100    2.609012946
## as.factor(Hour_)2200    2.127358921
## as.factor(Hour_)2300    1.500694957