This exercise will use the Fredericton Traffic Accident open data set: http://data-fredericton.opendata.arcgis.com/datasets/traffic-accidents-accidents-de-la-circulation?selectedAttribute=Severity.
While this dataset already has several categorical variables, some additional variables will be added. The point of this exercise is to determine if creating new categories or bins from the existing dataset can make it easier to predict when severe accidents are more likely to happen in Fredericton, which would also indicate for drivers the times to be more careful driving in Fredericton.
## 'data.frame': 9301 obs. of 20 variables:
## $ ï..X : num -66.6 -66.6 -66.6 -66.7 -66.7 ...
## $ Y : num 45.9 45.9 46 46 45.9 ...
## $ FID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Severity : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Year_ : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month_ : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day_ : int 25 19 20 11 25 26 27 27 14 27 ...
## $ DayOfWeek : chr "Tue" "Wed" "Wed" "Tue" ...
## $ Hour_ : int 1200 1600 1800 1600 700 2200 600 1200 1900 800 ...
## $ NoVehicles: int 2 2 2 2 2 2 2 2 2 2 ...
## $ NoInjured : int 0 0 0 0 0 0 0 1 0 0 ...
## $ NoKilled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Street : chr "MacKenzie Road" "Waterloo Row" "Ring Road" "Clements Drive" ...
## $ At_Near : chr "Near" "Near" "At" "At" ...
## $ Near : chr "Hilton Road" "Beaverbrook Street" "Maple Street" "Regiment Creek Avenue" ...
## $ Type : chr "Parking Lot" "Side" "Rear Ended" "Rear Ended" ...
## $ Date : chr "2011-01-25 0:00" "2011-01-19 0:00" "2011-01-20 0:00" "2011-01-11 0:00" ...
## $ Signalized: chr " " " " "Yes" " " ...
## $ Day_Night : chr "D" "D" "N" "D" ...
There are more drivers during the week, due to work, so not surprisingly there are more accidents during the week.
Friday and Thursday, for some reason, are the highest days for accidents.
There are more total accidents during the day, but one would have to assume that due to commuting to work or school that there would be more drivers during the day.
It does seem intuitive that there would be more total accidents during weekday rush hour in Fredericton, that all of the other hours in the week.
Fredericton currently classifies traffic accidents as Severity 1 or 2. Those accidents classified as Severity 2 involve injuries. A new logical classification was created from this variable, and called Severe, so as to focus on the more serious accidents (class 2). For Severe, the default Severity has been recoded 0, and the more severe cases as 1.
Logistic regression is first performed on several pre-existing variables (as factors) in the dataset to predict Severe.
summary(glm.accidents.old)
##
## Call:
## glm(formula = Severe ~ as.factor(Day_Night) + as.factor(Month_) +
## as.factor(DayOfWeek) + as.factor(Hour_), family = binomial,
## data = Traffic_Accidents)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9961 -0.6711 -0.6132 -0.5397 2.1510
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.15535 535.41124 -0.026 0.97891
## as.factor(Day_Night)D 12.19590 535.41118 0.023 0.98183
## as.factor(Day_Night)N 12.12639 535.41119 0.023 0.98193
## as.factor(Month_)2 -0.09171 0.12595 -0.728 0.46651
## as.factor(Month_)3 0.13690 0.13480 1.016 0.30985
## as.factor(Month_)4 -0.09147 0.14506 -0.631 0.52832
## as.factor(Month_)5 0.21653 0.13290 1.629 0.10325
## as.factor(Month_)6 0.42661 0.12975 3.288 0.00101 **
## as.factor(Month_)7 0.29315 0.13190 2.222 0.02625 *
## as.factor(Month_)8 0.51711 0.12738 4.060 4.91e-05 ***
## as.factor(Month_)9 0.36841 0.13003 2.833 0.00461 **
## as.factor(Month_)10 0.24907 0.12762 1.952 0.05097 .
## as.factor(Month_)11 0.29313 0.12918 2.269 0.02327 *
## as.factor(Month_)12 0.13095 0.12555 1.043 0.29693
## as.factor(DayOfWeek).L 0.01043 0.07671 0.136 0.89182
## as.factor(DayOfWeek).Q 0.12393 0.07799 1.589 0.11203
## as.factor(DayOfWeek).C 0.05019 0.07268 0.691 0.48986
## as.factor(DayOfWeek)^4 0.04311 0.07110 0.606 0.54434
## as.factor(DayOfWeek)^5 -0.12252 0.06847 -1.789 0.07355 .
## as.factor(DayOfWeek)^6 0.06806 0.06882 0.989 0.32272
## as.factor(Hour_)100 0.97594 0.31079 3.140 0.00169 **
## as.factor(Hour_)200 0.56352 0.30766 1.832 0.06701 .
## as.factor(Hour_)300 0.48165 0.35594 1.353 0.17600
## as.factor(Hour_)400 0.01604 0.52688 0.030 0.97571
## as.factor(Hour_)500 0.42577 0.42688 0.997 0.31856
## as.factor(Hour_)600 0.71532 0.31731 2.254 0.02418 *
## as.factor(Hour_)700 0.28218 0.26091 1.082 0.27946
## as.factor(Hour_)800 0.09315 0.26003 0.358 0.72018
## as.factor(Hour_)900 0.34207 0.26523 1.290 0.19715
## as.factor(Hour_)999 -11.75625 177.65596 -0.066 0.94724
## as.factor(Hour_)1000 0.30579 0.26468 1.155 0.24796
## as.factor(Hour_)1100 0.05867 0.26261 0.223 0.82321
## as.factor(Hour_)1200 0.27912 0.25372 1.100 0.27127
## as.factor(Hour_)1300 0.18190 0.25428 0.715 0.47439
## as.factor(Hour_)1400 0.29493 0.25451 1.159 0.24654
## as.factor(Hour_)1500 0.30417 0.25253 1.204 0.22840
## as.factor(Hour_)1600 0.19888 0.25047 0.794 0.42717
## as.factor(Hour_)1700 0.29090 0.24032 1.210 0.22610
## as.factor(Hour_)1800 0.43945 0.24263 1.811 0.07012 .
## as.factor(Hour_)1900 0.36776 0.24512 1.500 0.13352
## as.factor(Hour_)2000 0.57716 0.24400 2.365 0.01801 *
## as.factor(Hour_)2100 0.68814 0.24352 2.826 0.00472 **
## as.factor(Hour_)2200 0.58004 0.26719 2.171 0.02994 *
## as.factor(Hour_)2300 0.42979 0.28466 1.510 0.13108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8954.0 on 9294 degrees of freedom
## Residual deviance: 8854.8 on 9251 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 8942.8
##
## Number of Fisher Scoring iterations: 12
Logistic regression is next performed on several new variables/bins (as factors) in the dataset to predict Severe.
summary(glm.accidents.new)
##
## Call:
## glm(formula = Severe ~ Traffic_Accidents$Rush_hour + Traffic_Accidents$Daytype +
## Traffic_Accidents$Season, family = binomial, data = Traffic_Accidents)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7589 -0.6781 -0.6220 -0.5676 1.9520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.64740 0.06086 -27.067 < 2e-16 ***
## Traffic_Accidents$Rush_hourRush Hour -0.09661 0.05372 -1.798 0.0721 .
## Traffic_Accidents$DaytypeWeekend 0.13467 0.06395 2.106 0.0352 *
## Traffic_Accidents$SeasonSpring 0.10273 0.07753 1.325 0.1852
## Traffic_Accidents$SeasonSummer 0.41534 0.07271 5.712 1.12e-08 ***
## Traffic_Accidents$SeasonAutumn 0.29437 0.07372 3.993 6.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8954.0 on 9294 degrees of freedom
## Residual deviance: 8906.8 on 9289 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 8918.8
##
## Number of Fisher Scoring iterations: 4
The smaller number of bins in the “new” variables would appear to be easier to read, and indicate that summer and autumn are the worst time for serious traffic accidents.
First examining the newly created factors.
lda.accidents.new
## Call:
## lda(Severe ~ Traffic_Accidents$Rush_hour + Traffic_Accidents$Daytype +
## Traffic_Accidents$Season, data = Traffic_Accidents)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.8131253 0.1868747
##
## Group means:
## Traffic_Accidents$Rush_hourRush Hour Traffic_Accidents$DaytypeWeekend
## FALSE 0.5366499 0.2093146
## TRUE 0.5077720 0.2337363
## Traffic_Accidents$SeasonSpring Traffic_Accidents$SeasonSummer
## FALSE 0.2247949 0.2229426
## TRUE 0.2049511 0.2769142
## Traffic_Accidents$SeasonAutumn
## FALSE 0.2318074
## TRUE 0.2556131
##
## Coefficients of linear discriminants:
## LD1
## Traffic_Accidents$Rush_hourRush Hour -0.5265643
## Traffic_Accidents$DaytypeWeekend 0.7520598
## Traffic_Accidents$SeasonSpring 0.5085284
## Traffic_Accidents$SeasonSummer 2.2828692
## Traffic_Accidents$SeasonAutumn 1.5573754
Then comparing to original variables (as factors).
## Warning in lda.default(x, grouping, ...): variables are collinear
lda.accidents.old
## Call:
## lda(Severe ~ as.factor(Day_Night) + +as.factor(Month_) + as.factor(DayOfWeek) +
## as.factor(Hour_), data = Traffic_Accidents)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.8131253 0.1868747
##
## Group means:
## as.factor(Day_Night)D as.factor(Day_Night)N as.factor(Month_)1
## FALSE 0.7787775 0.2210902 0.1080974
## TRUE 0.7587795 0.2412205 0.0875072
## as.factor(Month_)2 as.factor(Month_)3 as.factor(Month_)4
## FALSE 0.11537444 0.07581371 0.07224133
## TRUE 0.08520438 0.07081174 0.05469200
## as.factor(Month_)5 as.factor(Month_)6 as.factor(Month_)7
## FALSE 0.07673988 0.07462292 0.07607833
## TRUE 0.07944732 0.09326425 0.08462867
## as.factor(Month_)8 as.factor(Month_)9 as.factor(Month_)10
## FALSE 0.07224133 0.07224133 0.08335539
## TRUE 0.09902130 0.08520438 0.08750720
## as.factor(Month_)11 as.factor(Month_)12 as.factor(DayOfWeek).L
## FALSE 0.07621064 0.09698333 0.04178213
## TRUE 0.08290155 0.08981002 0.04362802
## as.factor(DayOfWeek).Q as.factor(DayOfWeek).C as.factor(DayOfWeek)^4
## FALSE -0.05692201 -0.012531570 -0.02944804
## TRUE -0.03837972 -0.003055399 -0.02194325
## as.factor(DayOfWeek)^5 as.factor(DayOfWeek)^6 as.factor(Hour_)100
## FALSE -0.006510735 -0.0008313626 0.007806298
## TRUE -0.025314280 0.0077082999 0.014968336
## as.factor(Hour_)200 as.factor(Hour_)300 as.factor(Hour_)400
## FALSE 0.01098174 0.007144747 0.003704684
## TRUE 0.01381693 0.008635579 0.002878526
## as.factor(Hour_)500 as.factor(Hour_)600 as.factor(Hour_)700
## FALSE 0.004763165 0.00939402 0.04194231
## TRUE 0.005181347 0.01324122 0.03972366
## as.factor(Hour_)800 as.factor(Hour_)900 as.factor(Hour_)999
## FALSE 0.06972744 0.04538238 0.001190791
## TRUE 0.05469200 0.04663212 0.000000000
## as.factor(Hour_)1000 as.factor(Hour_)1100 as.factor(Hour_)1200
## FALSE 0.04591162 0.06059804 0.07396137
## TRUE 0.04663212 0.04835924 0.07253886
## as.factor(Hour_)1300 as.factor(Hour_)1400 as.factor(Hour_)1500
## FALSE 0.07568140 0.06919820 0.07488754
## TRUE 0.06850892 0.06966033 0.07714450
## as.factor(Hour_)1600 as.factor(Hour_)1700 as.factor(Hour_)1800
## FALSE 0.09208785 0.08732469 0.05054247
## TRUE 0.08462867 0.08578008 0.05526770
## as.factor(Hour_)1900 as.factor(Hour_)2000 as.factor(Hour_)2100
## FALSE 0.04458852 0.03466526 0.02950516
## TRUE 0.04663212 0.04490501 0.04145078
## as.factor(Hour_)2200 as.factor(Hour_)2300
## FALSE 0.01918497 0.01640646
## TRUE 0.02417962 0.01784686
##
## Coefficients of linear discriminants:
## LD1
## as.factor(Day_Night)D 5.076239012
## as.factor(Day_Night)N 4.748798946
## as.factor(Month_)1 -0.711823501
## as.factor(Month_)2 -1.009204791
## as.factor(Month_)3 -0.249301157
## as.factor(Month_)4 -1.028552552
## as.factor(Month_)5 0.045832773
## as.factor(Month_)6 0.901518854
## as.factor(Month_)7 0.349613906
## as.factor(Month_)8 1.306013886
## as.factor(Month_)9 0.653329875
## as.factor(Month_)10 0.174898092
## as.factor(Month_)11 0.358082719
## as.factor(Month_)12 -0.256345864
## as.factor(DayOfWeek).L 0.040757187
## as.factor(DayOfWeek).Q 0.470760243
## as.factor(DayOfWeek).C 0.193779685
## as.factor(DayOfWeek)^4 0.167996382
## as.factor(DayOfWeek)^5 -0.455494215
## as.factor(DayOfWeek)^6 0.260772894
## as.factor(Hour_)100 4.061145865
## as.factor(Hour_)200 2.058752975
## as.factor(Hour_)300 1.733277672
## as.factor(Hour_)400 -0.001283306
## as.factor(Hour_)500 1.474858459
## as.factor(Hour_)600 2.699660648
## as.factor(Hour_)700 0.893825056
## as.factor(Hour_)800 0.221829986
## as.factor(Hour_)900 1.097412862
## as.factor(Hour_)999 -3.594886717
## as.factor(Hour_)1000 0.961164167
## as.factor(Hour_)1100 0.089296042
## as.factor(Hour_)1200 0.860215253
## as.factor(Hour_)1300 0.503051637
## as.factor(Hour_)1400 0.918900078
## as.factor(Hour_)1500 0.953990388
## as.factor(Hour_)1600 0.563300958
## as.factor(Hour_)1700 0.916870818
## as.factor(Hour_)1800 1.506751564
## as.factor(Hour_)1900 1.237257184
## as.factor(Hour_)2000 2.111172235
## as.factor(Hour_)2100 2.609012946
## as.factor(Hour_)2200 2.127358921
## as.factor(Hour_)2300 1.500694957