Regression Analysis on Likelihood of Road Fatalities due to Single or Multiple crashes in Australia.
In Australia, numerous road fatalities are recorded every year. However, there has been significant decrease in the road fatalities over the past few years. Introduction of strict road safety rules and regulations, improvement in vehicle standards etc. could be some of the important factors for the reduction in road crashes. Inspite of these regulations in place, number of road fatalities in Australia is high compared to the global scale. This analysis would examine the relationship between different variables for predicting a road fatality. A thorough study of these factors would help in the future to reduce the number of crashes in particular state by planning road safety campaigns. There are numerous factors that are necessary for this study, viz. Involvement of Bus, Heavy Rigid or Articulated Truck, Week of Day, Time of the Day, who was the road user, Age_group of Road user, Gender of the Road User, Speed limit of the road where the crash took place, State in which crash took place and whether it was Christmas or Easter period.
To study the likelihood of a road fatality due to Single and Multiple crash type.
To study the important factors related to the likelihood of road fatality.
The predictor variable is Crash Type. It describes the involvement of number of vehicles in fatal accident. Single crash occurs when only one vehicle is involved in a crash. Multiple crash is when two or more vehicles are involved in a crash.
State - Australian state where the crash took place. It is one of the important variable which gives us information in which state maximum number of crashes take place.
Dayweek variable gives us information on which day the crash took place.
Bus Involvement variables gives information whether a bus was involved in a fatal crash.
Heavy Rigid Truck Involvement variables gives information whether a Heavy truck was involved in a fatal crash.
Articulated Truck Involvement variables gives information whether an Articulated Truck was involved in a fatal crash. An articulated truck is a truck with a permanent or semi-perm pivot joint which allows the vehicle to turn sharply.
Speed Limit variables gives information of the speed zone of the road where the crash took place.
Road User variables gives information which road user (Driver, Pedestrian, Passenger, Bicyclist etc.) involved in a fatal crash.
Gender variables gives information about whether the person was male or female who was involved in a fatal crash.
Christmas Period variables gives information whether there was christmas period at the time of a fatal crash.
Easter Period variables gives information whether there was Easter period at the time of a fatal crash.
Age Group variables gives information which age group the person belonged who was involved in a fatal crash.
Day of Week variables gives information whether it was Weekday or Weekend at the time of a fatal crash.
Time of Day variables gives information whether it was Night time or Day time when a fatal crash took place.
library(readr)
ardd <- read.csv("R/ardd_april21.csv")
head(ardd)
str(ardd)
'data.frame': 50924 obs. of 23 variables:
$ Crash.ID : int 20213015 20211009 20211054 20213068 20212022 20212076 20214022 20214002 20215031 20211072 ...
$ State : Factor w/ 8 levels "ACT","NSW","NT",..: 4 2 2 4 7 7 5 5 8 2 ...
$ Month : int 4 4 4 4 4 4 4 4 4 4 ...
$ Year : int 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
$ Dayweek : Factor w/ 7 levels "Friday","Monday",..: 4 5 3 1 3 5 3 5 3 5 ...
$ Time : Factor w/ 1411 levels "","0:00:00","0:01:00",..: 953 838 1088 833 118 299 745 245 823 843 ...
$ Crash.Type : Factor w/ 2 levels "Multiple","Single": 2 2 2 2 2 1 2 1 2 1 ...
$ Bus.Involvement : Factor w/ 3 levels "-9","No","Yes": 2 2 2 2 1 1 2 2 2 2 ...
$ Heavy.Rigid.Truck.Involvement: Factor w/ 3 levels "-9","No","Yes": 2 3 2 2 1 1 2 2 2 2 ...
$ Articulated.Truck.Involvement: Factor w/ 3 levels "-9","No","Yes": 2 2 2 2 1 1 2 2 2 2 ...
$ Speed.Limit : int 70 110 80 100 60 100 90 100 110 50 ...
$ Road.User : Factor w/ 8 levels "BICYCLIST","Driver",..: 2 8 6 6 2 4 2 6 6 2 ...
$ Gender : Factor w/ 4 levels "-9","Female",..: 3 2 3 2 3 3 3 2 3 3 ...
$ Age : int 21 21 5 23 19 65 24 62 16 23 ...
$ National.Remoteness.Areas : Factor w/ 6 levels "","Inner Regional Australia",..: 2 3 2 2 3 4 3 2 1 2 ...
$ SA4.Name.2016 : Factor w/ 89 levels "","Adelaide - Central and Hills",..: 22 51 66 22 48 32 3 64 1 45 ...
$ National.LGA.Name.2017 : Factor w/ 498 levels "","Adelaide (C)",..: 399 245 390 472 306 29 165 440 1 289 ...
$ National.Road.Type : Factor w/ 14 levels "","Access road",..: 10 10 10 7 4 7 13 13 1 4 ...
$ Christmas.Period : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Easter.Period : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Age.Group : Factor w/ 6 levels "0_to_16","17_to_25",..: 2 2 1 2 2 5 2 4 1 2 ...
$ Day.of.week : Factor w/ 2 levels "Weekday","Weekend": 2 1 2 2 2 1 2 1 2 1 ...
$ Time.of.day : Factor w/ 2 levels "Day","Night": 2 2 2 2 2 1 2 1 2 2 ...
ardd <- ardd[-c(1,3:4,6,14:18)] ## Dropping Unwanted columns
head(ardd)
a <- as.data.frame(table(ardd$State)) ### Checking for NA like values in State variable
a
b <- as.data.frame(table(ardd$Dayweek)) ### Checking for NA like values in Dayweek variable
b
c <- as.data.frame(table(ardd$Crash.Type)) ### Checking for NA like values in Crash type variable
c
as.data.frame(table(ardd$Bus.Involvement)) ### Checking for NA like values in Bus Involvement variable
is.character(ardd$Bus.Involvement)
[1] FALSE
ardd$Bus.Involvement <- as.character(ardd$Bus.Involvement)
is.character(ardd$Bus.Involvement)
[1] TRUE
ardd <- ardd[ardd$Bus.Involvement != '-9', ] ### Removing -9 values from Bus involvement
d <- as.data.frame(table(ardd$Bus.Involvement))
d
as.data.frame(table(ardd$Heavy.Rigid.Truck.Involvement)) ### Checking for NA like values in Heavy Rigid Truck Involvement variable
is.character(ardd$Heavy.Rigid.Truck.Involvement)
[1] FALSE
ardd$Heavy.Rigid.Truck.Involvement <- as.character(ardd$Heavy.Rigid.Truck.Involvement)
is.character(ardd$Heavy.Rigid.Truck.Involvement)
[1] TRUE
ardd <- ardd[ardd$Heavy.Rigid.Truck.Involvement != '-9', ] ### Removing -9 values from Heavy Rigid Truck involvement
e <- as.data.frame(table(ardd$Heavy.Rigid.Truck.Involvement))
e
as.data.frame(table(ardd$Articulated.Truck.Involvement)) ### Checking for NA like values in Articulated Truck Involvement variable
is.character(ardd$Articulated.Truck.Involvement)
[1] FALSE
ardd$Articulated.Truck.Involvement <- as.character(ardd$Articulated.Truck.Involvement)
is.character(ardd$Articulated.Truck.Involvement)
[1] TRUE
ardd <- ardd[ardd$Articulated.Truck.Involvement != '-9', ] ### Removing -9 values from Articulated Truck involvement
f <- as.data.frame(table(ardd$Articulated.Truck.Involvement))
f
g <- as.data.frame(table(ardd$Speed.Limit)) ### Checking for NA like values in Speed Limit variable
g
h <- as.data.frame(table(ardd$Road.User)) ### Checking for NA like values in Road User variable
h
as.data.frame(table(ardd$Gender)) ### Checking for NA like values in Gender variable
is.character(ardd$Gender)
[1] FALSE
ardd$Gender <- as.character(ardd$Gender)
is.character(ardd$Gender)
[1] TRUE
ardd <- ardd[ardd$Gender != '-9', ] ### Removing -9 values from Gender
as.data.frame(table(ardd$Gender))
ardd <- ardd[ardd$Gender != 'Unspecified', ] ### Removing Unspecified values from Gender
i <- as.data.frame(table(ardd$Gender))
i
j <- as.data.frame(table(ardd$Christmas.Period)) ### Checking for NA like values in Christmas Period variable
j
k <- as.data.frame(table(ardd$Easter.Period)) ### Checking for NA like values in Easter Period variable
k
l <- as.data.frame(table(ardd$Age.Group)) ### Checking for NA like values in Age Group variable
l
m <- as.data.frame(table(ardd$Day.of.week)) ### Checking for NA like values in DayofWeek variable
m
n <- as.data.frame(table(ardd$Time.of.day)) ### Checking for NA like values in TimeofDay variable
n
sum(is.na(ardd))
[1] 0
str(ardd)
'data.frame': 31291 obs. of 14 variables:
$ State : Factor w/ 8 levels "ACT","NSW","NT",..: 4 2 2 4 5 5 8 2 8 2 ...
$ Dayweek : Factor w/ 7 levels "Friday","Monday",..: 4 5 3 1 3 5 3 5 3 5 ...
$ Crash.Type : Factor w/ 2 levels "Multiple","Single": 2 2 2 2 2 1 2 1 2 1 ...
$ Bus.Involvement : chr "No" "No" "No" "No" ...
$ Heavy.Rigid.Truck.Involvement: chr "No" "Yes" "No" "No" ...
$ Articulated.Truck.Involvement: chr "No" "No" "No" "No" ...
$ Speed.Limit : int 70 110 80 100 90 100 110 50 110 100 ...
$ Road.User : Factor w/ 8 levels "BICYCLIST","Driver",..: 2 8 6 6 2 6 6 2 6 6 ...
$ Gender : chr "Male" "Female" "Male" "Female" ...
$ Christmas.Period : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Easter.Period : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Age.Group : Factor w/ 6 levels "0_to_16","17_to_25",..: 2 2 1 2 2 4 1 2 1 6 ...
$ Day.of.week : Factor w/ 2 levels "Weekday","Weekend": 2 1 2 2 2 1 2 1 2 1 ...
$ Time.of.day : Factor w/ 2 levels "Day","Night": 2 2 2 2 2 1 2 2 2 1 ...
ardd$State = factor(ardd$State, levels = c("ACT","NSW","NT","Qld","SA","Tas","Vic","WA" ),
labels = c(1,2,3,4,5,6,7,8), ordered = FALSE)
levels(ardd$State) ### One Hot encoding for State var
[1] "1" "2" "3" "4" "5" "6" "7" "8"
ardd$Dayweek = factor(ardd$Dayweek, levels = c("Friday","Monday","Saturday","Sunday","Thursday","Tuesday","Wednesday"),
labels = c(5,1,6,7,4,2,3), ordered = FALSE)
levels(ardd$Dayweek) ### One Hot encoding for DayWeek var
[1] "5" "1" "6" "7" "4" "2" "3"
ardd$Crash.Type = factor(ardd$Crash.Type, levels = c("Multiple","Single"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Crash.Type) ### One Hot encoding for Crash type var
[1] "1" "2"
ardd$Bus.Involvement = factor(ardd$Bus.Involvement, levels = c("No","Yes"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Bus.Involvement) ### One Hot encoding for Bus Involvement var
[1] "1" "2"
ardd$Heavy.Rigid.Truck.Involvement = factor(ardd$Heavy.Rigid.Truck.Involvement, levels = c("No","Yes"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Heavy.Rigid.Truck.Involvement) ### One Hot encoding for Heavy Rigid Truck Involvement var
[1] "1" "2"
ardd$Articulated.Truck.Involvement = factor(ardd$Articulated.Truck.Involvement, levels = c("No","Yes"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Articulated.Truck.Involvement) ### One Hot encoding for Articulated Truck Involvement var
[1] "1" "2"
sum(is.na(ardd$Articulated.Truck.Involvement))
[1] 0
sum(is.na(ardd$Road.User))
[1] 0
levels(ardd$Road.User)
[1] "BICYCLIST" "Driver" "Motorcycle pillion passenger" "Motorcycle rider"
[5] "Other" "Passenger" "Pedal cyclist" "Pedestrian"
ardd$Road.User = factor(ardd$Road.User, levels = c("BICYCLIST","Driver","Motorcycle pillion passenger","Motorcycle rider",
"Other","Passenger","Pedal cyclist","Pedestrian"),
labels = c(1,2,3,4,5,6,7,8), ordered = FALSE)
levels(ardd$Road.User) ### One Hot encoding for Road User var
[1] "1" "2" "3" "4" "5" "6" "7" "8"
sum(is.na(ardd$Road.User))
[1] 0
ardd$Gender = factor(ardd$Gender, levels = c("Male","Female"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Gender) ### One Hot encoding for Gender var
[1] "1" "2"
sum(is.na(ardd$Gender))
[1] 0
is.factor(ardd$Gender)
[1] TRUE
ardd$Christmas.Period = factor(ardd$Christmas.Period, levels = c("No","Yes"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Christmas.Period) ### One Hot encoding for Christmas period var
[1] "1" "2"
sum(is.na(ardd$Christmas.Period))
[1] 0
ardd$Easter.Period = factor(ardd$Easter.Period, levels = c("No","Yes"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Easter.Period) ### One Hot encoding for Easter Period var
[1] "1" "2"
sum(is.na(ardd$Easter.Period))
[1] 0
#levels(ardd$Age.Group)
ardd$Age.Group = factor(ardd$Age.Group, levels = c("0_to_16","17_to_25","26_to_39","40_to_64","65_to_74","75_or_older"),
labels = c(1,2,3,4,5,6), ordered = FALSE)
levels(ardd$Age.Group) ### One Hot encoding for Age group var
[1] "1" "2" "3" "4" "5" "6"
sum(is.na(ardd$Age.Group))
[1] 0
#levels(ardd$Day.of.week)
ardd$Day.of.week = factor(ardd$Day.of.week, levels = c("Weekday","Weekend"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Day.of.week) ### One Hot encoding for Day of Week var
[1] "1" "2"
sum(is.na(ardd$Day.of.week))
[1] 0
ardd$Time.of.day = factor(ardd$Time.of.day, levels = c("Day","Night"),
labels = c(1,2), ordered = FALSE)
levels(ardd$Time.of.day) ### One Hot encoding for Time of Day var
[1] "1" "2"
sum(is.na(ardd$Time.of.day))
[1] 0
str(ardd)
'data.frame': 31291 obs. of 14 variables:
$ State : Factor w/ 8 levels "1","2","3","4",..: 4 2 2 4 5 5 8 2 8 2 ...
$ Dayweek : Factor w/ 7 levels "5","1","6","7",..: 4 5 3 1 3 5 3 5 3 5 ...
$ Crash.Type : Factor w/ 2 levels "1","2": 2 2 2 2 2 1 2 1 2 1 ...
$ Bus.Involvement : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Heavy.Rigid.Truck.Involvement: Factor w/ 2 levels "1","2": 1 2 1 1 1 1 1 1 1 1 ...
$ Articulated.Truck.Involvement: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Speed.Limit : int 70 110 80 100 90 100 110 50 110 100 ...
$ Road.User : Factor w/ 8 levels "1","2","3","4",..: 2 8 6 6 2 6 6 2 6 6 ...
$ Gender : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 1 1 2 ...
$ Christmas.Period : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Easter.Period : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Age.Group : Factor w/ 6 levels "1","2","3","4",..: 2 2 1 2 2 4 1 2 1 6 ...
$ Day.of.week : Factor w/ 2 levels "1","2": 2 1 2 2 2 1 2 1 2 1 ...
$ Time.of.day : Factor w/ 2 levels "1","2": 2 2 2 2 2 1 2 2 2 1 ...
barplot(a[,2], names = a[,1],
main = "Number of Fatalities per state" ,
ylab = "Count of Fatalities", xlab = "States",
cex.names=0.8, ylim = c(0,20000),
col = "red"
)
From the barplots we can see State of New South Wales records highest number of fatal crashes followed by State of Victoria, Queensland, West Australia, South Australia, Tasmania, Northern Territory and Australian Capital Territory.
barplot(b[,2], names = b[,1],
main = "Number of Fatalities on each day of Week" ,
ylab = "Count of Fatalities", xlab = "Days",
cex.names=0.8, ylim = c(0,12500),
col = "red"
)
From the barplots, we see that Saturday records highest number of fatalities followed by remaining days in the week.
barplot(c[,2], names = c[,1],
main = "Number of Fatalities of each Crash Type" ,
ylab = "Count of Fatalities", xlab = "Crash Type",
cex.names=0.8, ylim = c(0,35000),
col = "red"
)
Here, we can see that number of Single crashes is on the rise. Multiple crashes are comparitively lower than single crash type.
barplot(d[,2], names = d[,1],
main = "If a Bus was involved in a Fatal accident" ,
ylab = "Count", xlab = "Bus Involvement",
cex.names=0.8, ylim = c(0,60000),
col = "red"
)
As we saw earlier, single crash type is on the rise and multiple crashes comparitively low. Involvement of bus in a crash is siginificantly lower.
barplot(e[,2], names = e[,1],
main = "If a Heavy Truck was involved in a Fatal accident" ,
ylab = "Count", xlab = "Heavy Rigid Truck Involvement",
cex.names=0.8, ylim = c(0,40000),
col = "red"
)
Involvement of a Heavy Rigid truck in a fatal crash is significantly lower.
barplot(f[,2], names = f[,1],
main = "If an Articulated truck was involved in a Fatal accident" ,
ylab = "Count", xlab = "Articulated Truck Involvement",
cex.names=0.8, ylim = c(0,40000),
col = "red"
)
Also, the involvement of an Articulated truck in a fatal crash is significantly low.
barplot(g[,2], names = g[,1],
main = "Speed Limit of Road where a Fatal accident happended" ,
ylab = "Count of Fatalities", xlab = "Speed Zone",
cex.names=0.8, ylim = c(0,15000),
col = "red"
)
From the above barplot, we can see than maximum fatal crashes take place in the speed zone of 100 followed by 60, 80 and 110.
barplot(h[,2], names = h[,1],
main = "Road User involved in a Fatal accident" ,
ylab = "Count of Fatalities", xlab = "Road User",
cex.names=0.8, ylim = c(0,20000),
col = "red"
)
In a fatal crash, death of the driver of the vehicle is highest followed by passenger travelling in the vehicle.
barplot(i[,2], names = i[,1],
main = "Genderwise Fatal accidents" ,
ylab = "Count of Fatalities", xlab = "Gender",
cex.names=0.8, ylim = c(0,30000),
col = "red"
)
In the number of fatalities reported, fatalities of gender male is on the rise.
barplot(j[,2], names = j[,1],
main = "Number of Fatalities on Christmas Period" ,
ylab = "Count", xlab = "Christmas Period",
cex.names=0.8, ylim = c(0,35000),
col = "red"
)
There was no christmas period when maximum number of fatalities took place.
barplot(k[,2], names = k[,1],
main = "Number of Fatalities on Easter Period" ,
ylab = "Count", xlab = "Easter Period",
cex.names=0.8, ylim = c(0,35000),
col = "red"
)
There was no easter period when maximum number of fatalities took place.
barplot(l[,2], names = l[,1],
main = "Number of Fatalities per Age Group" ,
ylab = "Count of Fatalities", xlab = "Age Group",
cex.names=0.8, ylim = c(0,12500),
col = "red"
)
Fatalities of people in age group of 46 to 64 is maximum followed by age groups 17 to 25 and 26 to 39.
barplot(m[,2], names = m[,1],
main = "Number of Fatalities on Weekday & Weekday" ,
ylab = "Count of Fatalities", xlab = "Weekday/Weekend",
cex.names=0.8, ylim = c(0,25000),
col = "red"
)
Maximum number of fatalities have taken place on a week day.
barplot(n[,2], names = n[,1],
main = "Number of Fatalities on Time of the Day" ,
ylab = "Count of Fatalities", xlab = "Time of Day (Day/Night)",
cex.names=0.8, ylim = c(0,22500),
col = "red"
)
Maximum number of fatalities have taken place during the day.
xtabs(~ Crash.Type + State, data = ardd)
State
Crash.Type 1 2 3 4 5 6 7 8
1 66 3554 196 2428 934 366 5462 1127
2 68 4300 519 3044 1257 298 5805 1867
We can see that, single crash type is recorded maximum times in almost every state of Australia.
xtabs(~ Crash.Type + Dayweek, data = ardd)
Dayweek
Crash.Type 5 1 6 7 4 2 3
1 2317 1768 2226 2016 2020 1829 1957
2 2700 1890 3431 3065 2193 1880 1999
We can see that, single crash type is recorded highest on every day in a week
xtabs(~ Crash.Type + Bus.Involvement, data = ardd)
Bus.Involvement
Crash.Type 1 2
1 13809 324
2 16969 189
We can see that, maximum times there is no bus involved in a crash.
xtabs(~ Crash.Type + Heavy.Rigid.Truck.Involvement, data = ardd)
Heavy.Rigid.Truck.Involvement
Crash.Type 1 2
1 12882 1251
2 16865 293
We can see that, maximum times there is no heavy rigid involved in a crash.
xtabs(~ Crash.Type + Articulated.Truck.Involvement, data = ardd)
Articulated.Truck.Involvement
Crash.Type 1 2
1 11666 2467
2 16495 663
We can see that, maximum times there is no Articulated Truck involved in a crash.
xtabs(~ Crash.Type + Speed.Limit, data = ardd)
Speed.Limit
Crash.Type 5 10 15 20 25 30 40 50 60 70 75 80 90 100 110 130
1 1 3 0 5 0 0 92 873 2970 831 137 2055 396 5520 1224 26
2 2 15 1 20 2 10 172 1808 3776 858 117 2021 315 6011 1943 87
We can see that, speed zone of 100 has recorded highest single and multiple crash types respectively.
xtabs(~ Crash.Type + Road.User, data = ardd)
Road.User
Crash.Type 1 2 3 4 5 6 7 8
1 2 6861 110 2624 13 3138 732 653
2 0 7743 102 1856 39 3397 135 3886
We can see that, whether it is a single or multiple Crash Type, fatality of the driver of the vehicle is the highest.
xtabs(~ Crash.Type + Gender, data = ardd)
Gender
Crash.Type 1 2
1 9763 4370
2 12865 4293
We can see that, number of male fatalities is on the rise compared to female in a single and multiple crashes.
xtabs(~ Crash.Type + Christmas.Period, data = ardd)
Christmas.Period
Crash.Type 1 2
1 13715 418
2 16603 555
xtabs(~ Crash.Type + Easter.Period, data = ardd)
Easter.Period
Crash.Type 1 2
1 14027 106
2 17022 136
We can see that, maximum times, there was no christmas or Easter period when single or multiple fatal crashes took place.
xtabs(~ Crash.Type + Age.Group, data = ardd)
Age.Group
Crash.Type 1 2 3 4 5 6
1 859 2911 3180 4295 1221 1667
2 1157 4675 4047 4467 1142 1670
We can see that, Age group 17 to 25 is involved in a single type fatal crash while Age group 40 to 64 is involved in a multiple crash type.
xtabs(~ Crash.Type + Day.of.week, data = ardd)
Day.of.week
Crash.Type 1 2
1 9169 4964
2 9366 7792
We can see that, maximum times, there was a weekday when single or multiple fatal crash took place.
xtabs(~ Crash.Type + Time.of.day, data = ardd)
Time.of.day
Crash.Type 1 2
1 9824 4309
2 8342 8816
We can see that, in day time, multiple crashes took place more than single crash type. And, in night time, single crash types are the highest.
model1 <- glm(Crash.Type ~ ., family = binomial(), data = ardd)
summary(model1)
Call:
glm(formula = Crash.Type ~ ., family = binomial(), data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7966 -1.0133 0.4079 0.9330 2.4606
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.037503 82.169754 -0.146 0.8835
State2 0.206263 0.193432 1.066 0.2863
State3 0.508028 0.213192 2.383 0.0172 *
State4 0.283314 0.194149 1.459 0.1445
State5 0.278438 0.197928 1.407 0.1595
State6 -0.205874 0.210831 -0.976 0.3288
State7 -0.154802 0.192964 -0.802 0.4224
State8 0.466257 0.196391 2.374 0.0176 *
Dayweek1 0.049161 0.050987 0.964 0.3349
Dayweek6 0.296277 0.062249 4.760 1.94e-06 ***
Dayweek7 0.357263 0.063357 5.639 1.71e-08 ***
Dayweek4 -0.035785 0.051613 -0.693 0.4881
Dayweek2 -0.017354 0.053042 -0.327 0.7435
Dayweek3 -0.069637 0.052233 -1.333 0.1825
Bus.Involvement2 -1.190690 0.109123 -10.911 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.941575 0.073603 -26.379 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988102 0.050974 -39.003 < 2e-16 ***
Speed.Limit 0.006626 0.000696 9.520 < 2e-16 ***
Road.User2 11.669541 82.169487 0.142 0.8871
Road.User3 11.335902 82.169616 0.138 0.8903
Road.User4 10.897915 82.169491 0.133 0.8945
Road.User5 12.603498 82.170212 0.153 0.8781
Road.User6 11.608560 82.169490 0.141 0.8877
Road.User7 9.860954 82.169542 0.120 0.9045
Road.User8 13.754009 82.169500 0.167 0.8671
Gender2 -0.488235 0.030770 -15.867 < 2e-16 ***
Christmas.Period2 0.053937 0.073513 0.734 0.4631
Easter.Period2 -0.151745 0.144993 -1.047 0.2953
Age.Group2 0.153878 0.061035 2.521 0.0117 *
Age.Group3 0.014051 0.062192 0.226 0.8213
Age.Group4 -0.068771 0.061082 -1.126 0.2602
Age.Group5 -0.304848 0.072771 -4.189 2.80e-05 ***
Age.Group6 -0.438126 0.068654 -6.382 1.75e-10 ***
Day.of.week2 -0.148728 0.063943 -2.326 0.0200 *
Time.of.day2 0.642789 0.030092 21.361 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35355 on 31256 degrees of freedom
AIC: 35425
Number of Fisher Scoring iterations: 9
The equation becomes:
Y_hat = -12.037503 + 0.206263State2 + 0.508028State3 + 0.283314State4 + 0.278438State5 - 0.205874State6 - 0.154802State7 + 0.466257State8 + 0.049161Dayweek1 + 0.296277Dayweek6 + 0.357263Dayweek7 - 0.035785Dayweek4 - 0.017354Dayweek2 - 0.069637Dayweek3 - 1.190690Bus.Involvement2 - 1.941575Heavy.Rigid.Truck.Involvement2 - 1.988102Articulated.Truck.Involvement2 + 0.006626Speed.Limit + 11.669541Road.User2 + 11.335902Road.User3 + 10.897915Road.User4 + 12.603498Road.User5 + 11.608560Road.User6 + 9.860954Road.User7 + 13.754009Road.User8 - 0.488235Gender2 + 0.053937Christmas.Period2 - 0.151745Easter.Period2 + 0.153878Age.Group2 + 0.014051Age.Group3 - 0.068771Age.Group4 - 0.438126Age.Group5 - 0.438126Age.Group6 - 0.148728Day.of.week2 + 0.642789Time.of.day2 + e
A regression model with all the regressors shows that, variables road user, Christmas Period and Easter Period are insignificant as their p_value is > 0.05.
All the other variables are statistically significant and can have significant effect on the likelihood of a fatal accident.
The Null and Residual deviance are 43086 and 35355 on 31290 and 31256 degrees of freedom respectively.
The AIC of the model is 35425
H0: the model fits the data well H1: the model does not fit the data well
[1] 35354.69
[1] 4.745029e-56
The chi-sqr test statistic is 4.745029e-56 on 35354.69 degrees of freedom.
model1_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model1_null,model1,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Christmas.Period + Easter.Period + Age.Group + Day.of.week +
Time.of.day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31256 35355 34 7731 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anova(model1)
Analysis of Deviance Table (Type II tests)
Response: Crash.Type
LR Chisq Df Pr(>Chisq)
State 314.2 7 < 2.2e-16 ***
Dayweek 35.7 6 3.096e-06 ***
Bus.Involvement 125.5 1 < 2.2e-16 ***
Heavy.Rigid.Truck.Involvement 884.5 1 < 2.2e-16 ***
Articulated.Truck.Involvement 1877.8 1 < 2.2e-16 ***
Speed.Limit 91.0 1 < 2.2e-16 ***
Road.User 3468.2 7 < 2.2e-16 ***
Gender 252.7 1 < 2.2e-16 ***
Christmas.Period 0.5 1 0.46275
Easter.Period 1.1 1 0.29647
Age.Group 164.7 5 < 2.2e-16 ***
Day.of.week 5.4 1 0.02028 *
Time.of.day 461.8 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The analysis of deviance table gives a p_value of 2.2e-16 i.e. < 0.05
As the p_value is significant, we do not have conclusive evidence to reject H0.
Hence, we conclude that model fits the data well and logistic model is adequate.
In the ANOVA table, we found that variable road user has Pr(>Chisq) less than 0.05.
confint.default(model1, level = 0.95)
2.5 % 97.5 %
(Intercept) -1.730873e+02 149.012254236
State2 -1.728560e-01 0.585382769
State3 9.017965e-02 0.925875598
State4 -9.721042e-02 0.663838274
State5 -1.094932e-01 0.666368379
State6 -6.190945e-01 0.207347136
State7 -5.330044e-01 0.223401188
State8 8.133833e-02 0.851175147
Dayweek1 -5.077085e-02 0.149093245
Dayweek6 1.742714e-01 0.418283031
Dayweek7 2.330859e-01 0.481440261
Dayweek4 -1.369451e-01 0.065375427
Dayweek2 -1.213137e-01 0.086606549
Dayweek3 -1.720123e-01 0.032738356
Bus.Involvement2 -1.404567e+00 -0.976812601
Heavy.Rigid.Truck.Involvement2 -2.085834e+00 -1.797316528
Articulated.Truck.Involvement2 -2.088008e+00 -1.888195146
Speed.Limit 5.261835e-03 0.007990115
Road.User2 -1.493797e+02 172.718776954
Road.User3 -1.497136e+02 172.385389451
Road.User4 -1.501513e+02 171.947158384
Road.User5 -1.484472e+02 173.654154325
Road.User6 -1.494407e+02 172.657801463
Road.User7 -1.511884e+02 170.910297531
Road.User8 -1.472953e+02 174.803270087
Gender2 -5.485418e-01 -0.427927180
Christmas.Period2 -9.014683e-02 0.198020008
Easter.Period2 -4.359262e-01 0.132435818
Age.Group2 3.425154e-02 0.273503637
Age.Group3 -1.078428e-01 0.135945892
Age.Group4 -1.884893e-01 0.050947756
Age.Group5 -4.474768e-01 -0.162219397
Age.Group6 -5.726849e-01 -0.303566128
Day.of.week2 -2.740534e-01 -0.023401569
Time.of.day2 5.838108e-01 0.701767989
car::vif(model1)
GVIF Df GVIF^(1/(2*Df))
State 1.066763 7 1.004627
Dayweek 5.835786 6 1.158355
Bus.Involvement 1.023137 1 1.011502
Heavy.Rigid.Truck.Involvement 1.035202 1 1.017449
Articulated.Truck.Involvement 1.069302 1 1.034071
Speed.Limit 1.229618 1 1.108882
Road.User 1.725085 7 1.039717
Gender 1.163051 1 1.078448
Christmas.Period 1.005954 1 1.002973
Easter.Period 1.005037 1 1.002515
Age.Group 1.429913 5 1.036408
Day.of.week 5.976209 1 2.444629
Time.of.day 1.305329 1 1.142510
In the multicollinearity check, if the value is equal to one then we conclude there significant low correlation.
In the multicollinearity check, if the value is between one to five, we conclude a moderate correlation
In the multicollinearity check, if the value is greater than five, we conclude there is strong correlation.
All the variables show low or moderate correlation as their value is between 1 to 5
# Full model should contains all the variables
full = glm(Crash.Type ~ ., family = binomial, data=ardd)
summary(full)
Call:
glm(formula = Crash.Type ~ ., family = binomial, data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7966 -1.0133 0.4079 0.9330 2.4606
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.037503 82.169754 -0.146 0.8835
State2 0.206263 0.193432 1.066 0.2863
State3 0.508028 0.213192 2.383 0.0172 *
State4 0.283314 0.194149 1.459 0.1445
State5 0.278438 0.197928 1.407 0.1595
State6 -0.205874 0.210831 -0.976 0.3288
State7 -0.154802 0.192964 -0.802 0.4224
State8 0.466257 0.196391 2.374 0.0176 *
Dayweek1 0.049161 0.050987 0.964 0.3349
Dayweek6 0.296277 0.062249 4.760 1.94e-06 ***
Dayweek7 0.357263 0.063357 5.639 1.71e-08 ***
Dayweek4 -0.035785 0.051613 -0.693 0.4881
Dayweek2 -0.017354 0.053042 -0.327 0.7435
Dayweek3 -0.069637 0.052233 -1.333 0.1825
Bus.Involvement2 -1.190690 0.109123 -10.911 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.941575 0.073603 -26.379 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988102 0.050974 -39.003 < 2e-16 ***
Speed.Limit 0.006626 0.000696 9.520 < 2e-16 ***
Road.User2 11.669541 82.169487 0.142 0.8871
Road.User3 11.335902 82.169616 0.138 0.8903
Road.User4 10.897915 82.169491 0.133 0.8945
Road.User5 12.603498 82.170212 0.153 0.8781
Road.User6 11.608560 82.169490 0.141 0.8877
Road.User7 9.860954 82.169542 0.120 0.9045
Road.User8 13.754009 82.169500 0.167 0.8671
Gender2 -0.488235 0.030770 -15.867 < 2e-16 ***
Christmas.Period2 0.053937 0.073513 0.734 0.4631
Easter.Period2 -0.151745 0.144993 -1.047 0.2953
Age.Group2 0.153878 0.061035 2.521 0.0117 *
Age.Group3 0.014051 0.062192 0.226 0.8213
Age.Group4 -0.068771 0.061082 -1.126 0.2602
Age.Group5 -0.304848 0.072771 -4.189 2.80e-05 ***
Age.Group6 -0.438126 0.068654 -6.382 1.75e-10 ***
Day.of.week2 -0.148728 0.063943 -2.326 0.0200 *
Time.of.day2 0.642789 0.030092 21.361 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35355 on 31256 degrees of freedom
AIC: 35425
Number of Fisher Scoring iterations: 9
# null model contains no variable
null=glm(Crash.Type~1, family = binomial(), data=ardd)
summary(null)
Call:
glm(formula = Crash.Type ~ 1, family = binomial(), data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.261 -1.261 1.096 1.096 1.096
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.19395 0.01136 17.07 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 43086 on 31290 degrees of freedom
AIC: 43088
Number of Fisher Scoring iterations: 3
#Backward elimination using AIC values
bwd = step(full, data=ardd, direction="backward")
Start: AIC=35424.69
Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Christmas.Period + Easter.Period + Age.Group + Day.of.week +
Time.of.day
Df Deviance AIC
- Christmas.Period 1 35355 35423
- Easter.Period 1 35356 35424
<none> 35355 35425
- Day.of.week 1 35360 35428
- Dayweek 6 35390 35448
- Speed.Limit 1 35446 35514
- Bus.Involvement 1 35480 35548
- Age.Group 5 35519 35579
- Gender 1 35607 35675
- State 7 35669 35725
- Time.of.day 1 35817 35885
- Heavy.Rigid.Truck.Involvement 1 36239 36307
- Articulated.Truck.Involvement 1 37232 37300
- Road.User 7 38823 38879
Step: AIC=35423.23
Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Easter.Period + Age.Group + Day.of.week + Time.of.day
Df Deviance AIC
- Easter.Period 1 35356 35422
<none> 35355 35423
- Day.of.week 1 35361 35427
- Dayweek 6 35391 35447
- Speed.Limit 1 35446 35512
- Bus.Involvement 1 35481 35547
- Age.Group 5 35520 35578
- Gender 1 35608 35674
- State 7 35669 35723
- Time.of.day 1 35817 35883
- Heavy.Rigid.Truck.Involvement 1 36242 36308
- Articulated.Truck.Involvement 1 37235 37301
- Road.User 7 38823 38877
Step: AIC=35422.35
Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Age.Group + Day.of.week + Time.of.day
Df Deviance AIC
<none> 35356 35422
- Day.of.week 1 35362 35426
- Dayweek 6 35392 35446
- Speed.Limit 1 35447 35511
- Bus.Involvement 1 35482 35546
- Age.Group 5 35521 35577
- Gender 1 35609 35673
- State 7 35670 35722
- Time.of.day 1 35818 35882
- Heavy.Rigid.Truck.Involvement 1 36243 36307
- Articulated.Truck.Involvement 1 37236 37300
- Road.User 7 38824 38876
summary(bwd)
Call:
glm(formula = Crash.Type ~ State + Dayweek + Bus.Involvement +
Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement +
Speed.Limit + Road.User + Gender + Age.Group + Day.of.week +
Time.of.day, family = binomial, data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7959 -1.0137 0.4076 0.9337 2.4607
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.204e+01 8.217e+01 -0.146 0.8835
State2 2.070e-01 1.934e-01 1.070 0.2846
State3 5.072e-01 2.132e-01 2.379 0.0173 *
State4 2.837e-01 1.941e-01 1.461 0.1439
State5 2.792e-01 1.979e-01 1.411 0.1583
State6 -2.055e-01 2.108e-01 -0.975 0.3297
State7 -1.533e-01 1.929e-01 -0.794 0.4270
State8 4.665e-01 1.964e-01 2.376 0.0175 *
Dayweek1 4.925e-02 5.098e-02 0.966 0.3340
Dayweek6 2.964e-01 6.225e-02 4.762 1.91e-06 ***
Dayweek7 3.573e-01 6.336e-02 5.639 1.71e-08 ***
Dayweek4 -3.558e-02 5.161e-02 -0.689 0.4905
Dayweek2 -1.476e-02 5.300e-02 -0.279 0.7806
Dayweek3 -6.744e-02 5.220e-02 -1.292 0.1964
Bus.Involvement2 -1.191e+00 1.091e-01 -10.919 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.943e+00 7.359e-02 -26.399 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988e+00 5.097e-02 -39.015 < 2e-16 ***
Speed.Limit 6.630e-03 6.958e-04 9.528 < 2e-16 ***
Road.User2 1.167e+01 8.217e+01 0.142 0.8871
Road.User3 1.134e+01 8.217e+01 0.138 0.8903
Road.User4 1.090e+01 8.217e+01 0.133 0.8945
Road.User5 1.260e+01 8.217e+01 0.153 0.8782
Road.User6 1.161e+01 8.217e+01 0.141 0.8877
Road.User7 9.860e+00 8.217e+01 0.120 0.9045
Road.User8 1.375e+01 8.217e+01 0.167 0.8671
Gender2 -4.883e-01 3.077e-02 -15.870 < 2e-16 ***
Age.Group2 1.534e-01 6.103e-02 2.514 0.0119 *
Age.Group3 1.327e-02 6.219e-02 0.213 0.8310
Age.Group4 -6.930e-02 6.108e-02 -1.135 0.2565
Age.Group5 -3.059e-01 7.277e-02 -4.204 2.62e-05 ***
Age.Group6 -4.391e-01 6.865e-02 -6.396 1.59e-10 ***
Day.of.week2 -1.487e-01 6.394e-02 -2.325 0.0201 *
Time.of.day2 6.424e-01 3.009e-02 21.350 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35356 on 31258 degrees of freedom
AIC: 35422
Number of Fisher Scoring iterations: 9
The equation from Backward Elimination is:
Y_hat = -1.204e+01 + 2.070e-01State2 + 5.072e-01State3 + 2.837e-01State4 + 2.792e-01State5 - 2.055e-01State6 - 1.533e-01State7 + 4.665e-01State8 + 4.925e-02Dayweek1 + 2.964e-01Dayweek6 + 3.573e-01Dayweek7 - 3.558e-02Dayweek4 - 1.476e-02Dayweek2 - 6.744e-02Dayweek3 - 1.191e+00Bus.Involvement - 1.943e+00Heavy.Rigid.Truck.Involvement - 1.988e+00Articulated.Truck.Involvement + 6.630e-03Speed.Limit + 1.167e+01Road.User2 + 1.134e+01Road.User3 + 1.090e+01Road.User4 + 1.260e+01Road.User5 + 1.161e+01Road.User6 + 9.860e+00Road.User7 + 1.375e+01Road.User8 - 4.883e-01Gender2 + 1.534e-01Age.Group2 + 1.327e-02Age.Group3 - 6.930e-02Age.Group4 - 3.059e-01Age.Group5 - 4.391e-01Age.Group6 - 1.487e-01Day.of.week2 + 6.424e-01Time.of.day2 + e
A regression model with all the regressors shows that, variables road user, Christmas Period and Easter Period are insignificant as their p_value is > 0.05.
All the other variables are statistically significant and can have significant effect on the likelihood of a fatal accident.
The Null and Residual deviance are 43086 and 35355 on 31290 and 31256 degrees of freedom respectively.
The AIC of the model is 35425
H0: the model fits the data well H1: the model does not fit the data well
deviance(bwd)
[1] 35356.35
pchisq(bwd$deviance, df=bwd$df.residual, lower.tail=FALSE)
[1] 4.877108e-56
model2_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model2_null,bwd,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Age.Group + Day.of.week + Time.of.day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31258 35356 32 7729.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The analysis of deviance table gives a p_value of 2.2e-16 i.e. < 0.05
As the p_value is significant, we do not have conclusive evidence to reject H0.
Hence, we conclude that model fits the data well and logistic model is statistically significant.
Anova(bwd)
Analysis of Deviance Table (Type II tests)
Response: Crash.Type
LR Chisq Df Pr(>Chisq)
State 313.2 7 < 2.2e-16 ***
Dayweek 35.6 6 3.298e-06 ***
Bus.Involvement 125.7 1 < 2.2e-16 ***
Heavy.Rigid.Truck.Involvement 886.2 1 < 2.2e-16 ***
Articulated.Truck.Involvement 1879.4 1 < 2.2e-16 ***
Speed.Limit 91.1 1 < 2.2e-16 ***
Road.User 3467.4 7 < 2.2e-16 ***
Gender 252.8 1 < 2.2e-16 ***
Age.Group 165.1 5 < 2.2e-16 ***
Day.of.week 5.4 1 0.02033 *
Time.of.day 461.4 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All the variables in the backward elimination method are statistically significant. Hence, these variables would have a significant on the probablity of fatalities.
#forward selection using AIC values
fwd = step(null, scope=list(lower=null, upper=full), direction="forward")
Start: AIC=43087.64
Crash.Type ~ 1
Df Deviance AIC
+ Road.User 7 40161 40177
+ Articulated.Truck.Involvement 1 41440 41444
+ Time.of.day 1 41678 41682
+ Heavy.Rigid.Truck.Involvement 1 42204 42208
+ Day.of.week 1 42744 42748
+ Age.Group 5 42810 42822
+ State 7 42837 42853
+ Dayweek 6 42860 42874
+ Gender 1 42951 42955
+ Bus.Involvement 1 43018 43022
+ Speed.Limit 1 43038 43042
<none> 43086 43088
+ Christmas.Period 1 43084 43088
+ Easter.Period 1 43085 43089
Step: AIC=40176.98
Crash.Type ~ Road.User
The equation from Forward selection is:
Y_hat = -1.204e+01 + 1.167e+01Road.User2 + 1.134e+01Road.User3 + 1.090e+01Road.User4 + 1.260e+01Road.User5 + 1.161e+01Road.User6 + 9.860e+00Road.User7 + 1.375e+01Road.User8 - 1.988e+00Articulated.Truck.Involvement + 6.424e-01Time.of.day2 - 1.943e+00Heavy.Rigid.Truck.Involvement + 2.070e-01State2 + 5.072e-01State3 + 2.837e-01State4 + 2.792e-01State5 - 2.055e-01State6 - 1.533e-01State7 + 4.665e-01State8 - 4.883e-01Gender2 + 1.534e-01Age.Group2 + 1.327e-02Age.Group3 - 6.930e-02Age.Group4 - 3.059e-01Age.Group5 - 4.391e-01Age.Group6 - 1.191e+00Bus.Involvement + 6.630e-03Speed.Limit + 4.925e-02Dayweek1 + 2.964e-01Dayweek6 + 3.573e-01Dayweek7 - 3.558e-02Dayweek4 - 1.476e-02Dayweek2 - 6.744e-02Dayweek3 - 1.487e-01Day.of.week2 + e
H0: the model fits the data well H1: the model does not fit the data well
deviance(fwd)
[1] 35356.35
pchisq(fwd$deviance, df=fwd$df.residual, lower.tail=FALSE)
[1] 4.877108e-56
The chi-sqr test statistic is 4.745029e-56 on 35356.35 degrees of freedom.
model2_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model2_null,fwd,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ Road.User + Articulated.Truck.Involvement + Time.of.day +
Heavy.Rigid.Truck.Involvement + State + Gender + Age.Group +
Bus.Involvement + Speed.Limit + Dayweek + Day.of.week
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31258 35356 32 7729.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The analysis of deviance table gives a p_value of 2.2e-16 i.e. < 0.05
As the p_value is significant, we do not have conclusive evidence to reject H0.
Hence, we conclude that model fits the data well and logistic model is statistically significant.
Anova(fwd)
Analysis of Deviance Table (Type II tests)
Response: Crash.Type
LR Chisq Df Pr(>Chisq)
Road.User 3467.4 7 < 2.2e-16 ***
Articulated.Truck.Involvement 1879.4 1 < 2.2e-16 ***
Time.of.day 461.4 1 < 2.2e-16 ***
Heavy.Rigid.Truck.Involvement 886.2 1 < 2.2e-16 ***
State 313.2 7 < 2.2e-16 ***
Gender 252.8 1 < 2.2e-16 ***
Age.Group 165.1 5 < 2.2e-16 ***
Bus.Involvement 125.7 1 < 2.2e-16 ***
Speed.Limit 91.1 1 < 2.2e-16 ***
Dayweek 35.6 6 3.298e-06 ***
Day.of.week 5.4 1 0.02033 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All the variables in the forward selection model building strategy are statistically significant. Hence, these variables would have a significant on the probablity of fatalities.
model.glm <- glm(Crash.Type ~ ., family = binomial(link = logit), data = ardd)
summary(model.glm)
Call:
glm(formula = Crash.Type ~ ., family = binomial(link = logit),
data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7966 -1.0133 0.4079 0.9330 2.4606
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.037503 82.169754 -0.146 0.8835
State2 0.206263 0.193432 1.066 0.2863
State3 0.508028 0.213192 2.383 0.0172 *
State4 0.283314 0.194149 1.459 0.1445
State5 0.278438 0.197928 1.407 0.1595
State6 -0.205874 0.210831 -0.976 0.3288
State7 -0.154802 0.192964 -0.802 0.4224
State8 0.466257 0.196391 2.374 0.0176 *
Dayweek1 0.049161 0.050987 0.964 0.3349
Dayweek6 0.296277 0.062249 4.760 1.94e-06 ***
Dayweek7 0.357263 0.063357 5.639 1.71e-08 ***
Dayweek4 -0.035785 0.051613 -0.693 0.4881
Dayweek2 -0.017354 0.053042 -0.327 0.7435
Dayweek3 -0.069637 0.052233 -1.333 0.1825
Bus.Involvement2 -1.190690 0.109123 -10.911 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.941575 0.073603 -26.379 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988102 0.050974 -39.003 < 2e-16 ***
Speed.Limit 0.006626 0.000696 9.520 < 2e-16 ***
Road.User2 11.669541 82.169487 0.142 0.8871
Road.User3 11.335902 82.169616 0.138 0.8903
Road.User4 10.897915 82.169491 0.133 0.8945
Road.User5 12.603498 82.170212 0.153 0.8781
Road.User6 11.608560 82.169490 0.141 0.8877
Road.User7 9.860954 82.169542 0.120 0.9045
Road.User8 13.754009 82.169500 0.167 0.8671
Gender2 -0.488235 0.030770 -15.867 < 2e-16 ***
Christmas.Period2 0.053937 0.073513 0.734 0.4631
Easter.Period2 -0.151745 0.144993 -1.047 0.2953
Age.Group2 0.153878 0.061035 2.521 0.0117 *
Age.Group3 0.014051 0.062192 0.226 0.8213
Age.Group4 -0.068771 0.061082 -1.126 0.2602
Age.Group5 -0.304848 0.072771 -4.189 2.80e-05 ***
Age.Group6 -0.438126 0.068654 -6.382 1.75e-10 ***
Day.of.week2 -0.148728 0.063943 -2.326 0.0200 *
Time.of.day2 0.642789 0.030092 21.361 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35355 on 31256 degrees of freedom
AIC: 35425
Number of Fisher Scoring iterations: 9
We fit the above model with all the variables in the dataset.
We found that, varibles Easter period, Christmas period and Road user were insignificant.
To test the adeqaucy of the model we performed ANOVA testing
Anova(model.glm)
Analysis of Deviance Table (Type II tests)
Response: Crash.Type
LR Chisq Df Pr(>Chisq)
State 314.2 7 < 2.2e-16 ***
Dayweek 35.7 6 3.096e-06 ***
Bus.Involvement 125.5 1 < 2.2e-16 ***
Heavy.Rigid.Truck.Involvement 884.5 1 < 2.2e-16 ***
Articulated.Truck.Involvement 1877.8 1 < 2.2e-16 ***
Speed.Limit 91.0 1 < 2.2e-16 ***
Road.User 3468.2 7 < 2.2e-16 ***
Gender 252.7 1 < 2.2e-16 ***
Christmas.Period 0.5 1 0.46275
Easter.Period 1.1 1 0.29647
Age.Group 164.7 5 < 2.2e-16 ***
Day.of.week 5.4 1 0.02028 *
Time.of.day 461.8 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the ANOVA table, we find that, varible road is statistically significant. Hence, we do not remove this variable from the model.
The other two variables, Christmas period and easter period are statistically significant. These variables would not have a significant effect on the likelihood of a road fatality due to single or multiple crash.
Hence, we remove these variables and fit the reduced model.
Reduced Model - Removing the insignificant variables
reduced.model.glm <- glm(Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement + Speed.Limit +
Road.User + Gender + Age.Group + Day.of.week + Time.of.day, data = ardd, family = binomial(link = logit))
summary(reduced.model.glm)
Call:
glm(formula = Crash.Type ~ State + Dayweek + Bus.Involvement +
Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement +
Speed.Limit + Road.User + Gender + Age.Group + Day.of.week +
Time.of.day, family = binomial(link = logit), data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7959 -1.0137 0.4076 0.9337 2.4607
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.204e+01 8.217e+01 -0.146 0.8835
State2 2.070e-01 1.934e-01 1.070 0.2846
State3 5.072e-01 2.132e-01 2.379 0.0173 *
State4 2.837e-01 1.941e-01 1.461 0.1439
State5 2.792e-01 1.979e-01 1.411 0.1583
State6 -2.055e-01 2.108e-01 -0.975 0.3297
State7 -1.533e-01 1.929e-01 -0.794 0.4270
State8 4.665e-01 1.964e-01 2.376 0.0175 *
Dayweek1 4.925e-02 5.098e-02 0.966 0.3340
Dayweek6 2.964e-01 6.225e-02 4.762 1.91e-06 ***
Dayweek7 3.573e-01 6.336e-02 5.639 1.71e-08 ***
Dayweek4 -3.558e-02 5.161e-02 -0.689 0.4905
Dayweek2 -1.476e-02 5.300e-02 -0.279 0.7806
Dayweek3 -6.744e-02 5.220e-02 -1.292 0.1964
Bus.Involvement2 -1.191e+00 1.091e-01 -10.919 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.943e+00 7.359e-02 -26.399 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988e+00 5.097e-02 -39.015 < 2e-16 ***
Speed.Limit 6.630e-03 6.958e-04 9.528 < 2e-16 ***
Road.User2 1.167e+01 8.217e+01 0.142 0.8871
Road.User3 1.134e+01 8.217e+01 0.138 0.8903
Road.User4 1.090e+01 8.217e+01 0.133 0.8945
Road.User5 1.260e+01 8.217e+01 0.153 0.8782
Road.User6 1.161e+01 8.217e+01 0.141 0.8877
Road.User7 9.860e+00 8.217e+01 0.120 0.9045
Road.User8 1.375e+01 8.217e+01 0.167 0.8671
Gender2 -4.883e-01 3.077e-02 -15.870 < 2e-16 ***
Age.Group2 1.534e-01 6.103e-02 2.514 0.0119 *
Age.Group3 1.327e-02 6.219e-02 0.213 0.8310
Age.Group4 -6.930e-02 6.108e-02 -1.135 0.2565
Age.Group5 -3.059e-01 7.277e-02 -4.204 2.62e-05 ***
Age.Group6 -4.391e-01 6.865e-02 -6.396 1.59e-10 ***
Day.of.week2 -1.487e-01 6.394e-02 -2.325 0.0201 *
Time.of.day2 6.424e-01 3.009e-02 21.350 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35356 on 31258 degrees of freedom
AIC: 35422
Number of Fisher Scoring iterations: 9
From the reduced model, we find that variable Road.User is insignificant as the p_value is > 0.05
The other variables are statistically significant.
The Null and Residual deviance of the model is 43086 and 35356 on 31290 and 31258 degrees of freedom.
We check the adequacy of the model.
H0: the model fits the data well H1: the model does not fit the data well
deviance(reduced.model.glm)
[1] 35356.35
pchisq(reduced.model.glm$deviance, df=reduced.model.glm$df.residual, lower.tail=FALSE)
[1] 4.877108e-56
The chi-sqr stat for the reduced model us 4.877108e-56 on 35356.35 degrees of freedom
model3_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model3_null,reduced.model.glm,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + Road.User +
Gender + Age.Group + Day.of.week + Time.of.day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31258 35356 32 7729.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here, we find that the model is statistically significant as p_value is < 0.05.
reduced.model.glm1 <- glm(Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement + Speed.Limit +
+ Gender + Age.Group + Day.of.week + Time.of.day, data = ardd, family = binomial(link = logit))
summary(reduced.model.glm1)
Call:
glm(formula = Crash.Type ~ State + Dayweek + Bus.Involvement +
Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement +
Speed.Limit + +Gender + Age.Group + Day.of.week + Time.of.day,
family = binomial(link = logit), data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.984 -1.169 0.722 1.014 2.173
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0248056 0.1955565 -0.127 0.899062
State2 0.3822043 0.1830803 2.088 0.036831 *
State3 0.8925152 0.2019173 4.420 9.86e-06 ***
State4 0.3728566 0.1837896 2.029 0.042487 *
State5 0.4213235 0.1873411 2.249 0.024515 *
State6 -0.1027256 0.1997640 -0.514 0.607088
State7 0.1045859 0.1825810 0.573 0.566767
State8 0.6060186 0.1860077 3.258 0.001122 **
Dayweek1 0.0422310 0.0481566 0.877 0.380514
Dayweek6 0.2456655 0.0589419 4.168 3.07e-05 ***
Dayweek7 0.2568283 0.0599746 4.282 1.85e-05 ***
Dayweek4 -0.0276303 0.0487175 -0.567 0.570609
Dayweek2 -0.0141071 0.0500386 -0.282 0.778002
Dayweek3 -0.0548824 0.0492725 -1.114 0.265342
Bus.Involvement2 -0.7728701 0.0960518 -8.046 8.53e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.6778281 0.0679003 -24.710 < 2e-16 ***
Articulated.Truck.Involvement2 -1.7311366 0.0474949 -36.449 < 2e-16 ***
Speed.Limit 0.0004456 0.0006101 0.730 0.465176
Gender2 -0.2382037 0.0274555 -8.676 < 2e-16 ***
Age.Group2 -0.0692954 0.0547420 -1.266 0.205565
Age.Group3 -0.2095850 0.0548507 -3.821 0.000133 ***
Age.Group4 -0.2144510 0.0535680 -4.003 6.25e-05 ***
Age.Group5 -0.2073767 0.0651022 -3.185 0.001446 **
Age.Group6 -0.1149606 0.0608296 -1.890 0.058774 .
Day.of.week2 -0.1660151 0.0607037 -2.735 0.006241 **
Time.of.day2 0.8263358 0.0281772 29.326 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 38824 on 31265 degrees of freedom
AIC: 38876
Number of Fisher Scoring iterations: 4
From the reduced model, we find that variable Speed.Limit is insignificant as the p_value is > 0.05
The Null and Residual deviance of the model is 43086 and 38824 on 31290 and 31265 degrees of freedom.
We check the adequacy of the model.
H0: the model fits the data well H1: the model does not fit the data well
deviance(reduced.model.glm1)
[1] 38823.79
pchisq(reduced.model.glm1$deviance, df=reduced.model.glm1$df.residual, lower.tail=FALSE)
[1] 6.639288e-174
The chi-sqr stat for the reduced model us 6.639288e-174 on 38823.79 degrees of freedom
model3_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model3_null,reduced.model.glm1,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Speed.Limit + +Gender + Age.Group +
Day.of.week + Time.of.day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31265 38824 25 4261.9 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here, we find that the model is statistically significant as p_value is < 0.05.
reduced.model.glm2 <- glm(Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement
+ Gender + Age.Group + Day.of.week + Time.of.day, data = ardd, family = binomial(link = logit))
summary(reduced.model.glm2)
Call:
glm(formula = Crash.Type ~ State + Dayweek + Bus.Involvement +
Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement +
Gender + Age.Group + Day.of.week + Time.of.day, family = binomial(link = logit),
data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9821 -1.1705 0.7204 1.0145 2.1727
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.006702 0.190759 0.035 0.971974
State2 0.385906 0.183032 2.108 0.034996 *
State3 0.902017 0.201520 4.476 7.60e-06 ***
State4 0.377034 0.183723 2.052 0.040151 *
State5 0.427220 0.187188 2.282 0.022472 *
State6 -0.095967 0.199570 -0.481 0.630609
State7 0.109468 0.182480 0.600 0.548579
State8 0.612710 0.185805 3.298 0.000975 ***
Dayweek1 0.042376 0.048153 0.880 0.378849
Dayweek6 0.245760 0.058944 4.169 3.05e-05 ***
Dayweek7 0.257058 0.059977 4.286 1.82e-05 ***
Dayweek4 -0.027987 0.048714 -0.575 0.565621
Dayweek2 -0.014292 0.050035 -0.286 0.775153
Dayweek3 -0.054963 0.049270 -1.116 0.264619
Bus.Involvement2 -0.777645 0.095830 -8.115 4.86e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.677711 0.067896 -24.710 < 2e-16 ***
Articulated.Truck.Involvement2 -1.726242 0.047010 -36.721 < 2e-16 ***
Gender2 -0.237414 0.027434 -8.654 < 2e-16 ***
Age.Group2 -0.068454 0.054727 -1.251 0.211003
Age.Group3 -0.208238 0.054817 -3.799 0.000145 ***
Age.Group4 -0.212708 0.053512 -3.975 7.04e-05 ***
Age.Group5 -0.207127 0.065098 -3.182 0.001464 **
Age.Group6 -0.118356 0.060650 -1.951 0.051001 .
Day.of.week2 -0.165379 0.060698 -2.725 0.006437 **
Time.of.day2 0.824202 0.028024 29.411 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 38824 on 31266 degrees of freedom
AIC: 38874
Number of Fisher Scoring iterations: 4
In this model, we find all the variables are statistically significant we need not reduce the model further.
The Null and Residual deviance is 43086 and 38824 on 31290 and 31266 degrees of freedom.
We check the adequacy of the model
H0: the model fits the data well H1: the model does not fit the data well
deviance(reduced.model.glm2)
[1] 38824.32
pchisq(reduced.model.glm2$deviance, df=reduced.model.glm2$df.residual, lower.tail=FALSE)
[1] 7.024608e-174
The chi-sqr stat for the reduced model us7.024608e-174 on 38824.32 degrees of freedom
model3_null<-glm(Crash.Type ~ 1, family=binomial,ardd)
anova(model3_null,reduced.model.glm2,test="Chisq")
Analysis of Deviance Table
Model 1: Crash.Type ~ 1
Model 2: Crash.Type ~ State + Dayweek + Bus.Involvement + Heavy.Rigid.Truck.Involvement +
Articulated.Truck.Involvement + Gender + Age.Group + Day.of.week +
Time.of.day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 31290 43086
2 31266 38824 24 4261.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here, we find that the model is statistically significant as p_value is < 0.05.
From the above model, we find that State, DayWeek,Bus Involvement, Heavy Rigid Truck Involvement, Articulated Involvement, Age Group, Time of Day, Day of week are significant factors for likelihood of fatality due to single or multiple crash type.
However, the variables speed limit and road user are equally important factors. They may not be statistically significant factors, but they are significant from a subject point of view. These factors would have large effect on the likelihood of fatality whether or not we categorise them as statistically significant.
We further check the odds ratio
summary(bwd)
Call:
glm(formula = Crash.Type ~ State + Dayweek + Bus.Involvement +
Heavy.Rigid.Truck.Involvement + Articulated.Truck.Involvement +
Speed.Limit + Road.User + Gender + Age.Group + Day.of.week +
Time.of.day, family = binomial, data = ardd)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7959 -1.0137 0.4076 0.9337 2.4607
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.204e+01 8.217e+01 -0.146 0.8835
State2 2.070e-01 1.934e-01 1.070 0.2846
State3 5.072e-01 2.132e-01 2.379 0.0173 *
State4 2.837e-01 1.941e-01 1.461 0.1439
State5 2.792e-01 1.979e-01 1.411 0.1583
State6 -2.055e-01 2.108e-01 -0.975 0.3297
State7 -1.533e-01 1.929e-01 -0.794 0.4270
State8 4.665e-01 1.964e-01 2.376 0.0175 *
Dayweek1 4.925e-02 5.098e-02 0.966 0.3340
Dayweek6 2.964e-01 6.225e-02 4.762 1.91e-06 ***
Dayweek7 3.573e-01 6.336e-02 5.639 1.71e-08 ***
Dayweek4 -3.558e-02 5.161e-02 -0.689 0.4905
Dayweek2 -1.476e-02 5.300e-02 -0.279 0.7806
Dayweek3 -6.744e-02 5.220e-02 -1.292 0.1964
Bus.Involvement2 -1.191e+00 1.091e-01 -10.919 < 2e-16 ***
Heavy.Rigid.Truck.Involvement2 -1.943e+00 7.359e-02 -26.399 < 2e-16 ***
Articulated.Truck.Involvement2 -1.988e+00 5.097e-02 -39.015 < 2e-16 ***
Speed.Limit 6.630e-03 6.958e-04 9.528 < 2e-16 ***
Road.User2 1.167e+01 8.217e+01 0.142 0.8871
Road.User3 1.134e+01 8.217e+01 0.138 0.8903
Road.User4 1.090e+01 8.217e+01 0.133 0.8945
Road.User5 1.260e+01 8.217e+01 0.153 0.8782
Road.User6 1.161e+01 8.217e+01 0.141 0.8877
Road.User7 9.860e+00 8.217e+01 0.120 0.9045
Road.User8 1.375e+01 8.217e+01 0.167 0.8671
Gender2 -4.883e-01 3.077e-02 -15.870 < 2e-16 ***
Age.Group2 1.534e-01 6.103e-02 2.514 0.0119 *
Age.Group3 1.327e-02 6.219e-02 0.213 0.8310
Age.Group4 -6.930e-02 6.108e-02 -1.135 0.2565
Age.Group5 -3.059e-01 7.277e-02 -4.204 2.62e-05 ***
Age.Group6 -4.391e-01 6.865e-02 -6.396 1.59e-10 ***
Day.of.week2 -1.487e-01 6.394e-02 -2.325 0.0201 *
Time.of.day2 6.424e-01 3.009e-02 21.350 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43086 on 31290 degrees of freedom
Residual deviance: 35356 on 31258 degrees of freedom
AIC: 35422
Number of Fisher Scoring iterations: 9
exp(bwd$coefficients[18] - bwd$coefficients[33])
Speed.Limit
0.5295315
The odds ratio between Speed limit i.e. due to overspeeding and Time of Day shows that the odds of fatality due to over speeding is 0.5 times than odds of fatality at night.
Odds of Likelihood of fatality due to over speeding and person belong to age group 40-64
exp(bwd$coefficients[18] - bwd$coefficients[29])
Speed.Limit
1.078889
For predicting the likelihood of fatality, we select the backward elimination model as it is found to the most statistically significant one.
Final model thus becomes:
Y_hat = -1.204e+01 + 2.070e-01State2 + 5.072e-01State3 + 2.837e-01State4 + 2.792e-01State5 - 2.055e-01State6 - 1.533e-01State7 + 4.665e-01State8 + 4.925e-02Dayweek1 + 2.964e-01Dayweek6 + 3.573e-01Dayweek7 - 3.558e-02Dayweek4 - 1.476e-02Dayweek2 - 6.744e-02Dayweek3 - 1.191e+00Bus.Involvement - 1.943e+00Heavy.Rigid.Truck.Involvement - 1.988e+00Articulated.Truck.Involvement + 6.630e-03Speed.Limit + 1.167e+01Road.User2 + 1.134e+01Road.User3 + 1.090e+01Road.User4 + 1.260e+01Road.User5 + 1.161e+01Road.User6 + 9.860e+00Road.User7 + 1.375e+01Road.User8 - 4.883e-01Gender2 + 1.534e-01Age.Group2 + 1.327e-02Age.Group3 - 6.930e-02Age.Group4 - 3.059e-01Age.Group5 - 4.391e-01Age.Group6 - 1.487e-01Day.of.week2 + 6.424e-01Time.of.day2 + e
Now, we predict the likelihood of fatality by selecting random values in the equation.
1. Predicting the likelihood when State is NSW(2), Dayweek is Saturday(6), there is no bus, truck or articulated truck involved (1), Speed.Limit is 100, Road.User is driver(2), gender is male(1), Age.group is 40-64, Day of Week is Weekend(2), Time of Day is night(2)
predict.data1 <- data.frame(State = "2", Dayweek = "6", Bus.Involvement = "1", Heavy.Rigid.Truck.Involvement = "1", Articulated.Truck.Involvement = "1",
Speed.Limit = 100 , Road.User = "2", Gender = "1", Age.Group = "4", Day.of.week = "2", Time.of.day = "2")
predict(object = bwd, newdata = predict.data1, type = "response")
1
0.7724804
From the prediction we can see that, in the above scenario, the probablity of fatality is 0.7724
predict.data2 <- data.frame(State = "7", Dayweek = "5", Bus.Involvement = "1", Heavy.Rigid.Truck.Involvement = "1", Articulated.Truck.Involvement = "1",
Speed.Limit = 100 , Road.User = "2", Gender = "1", Age.Group = "4", Day.of.week = "1", Time.of.day = "1")
predict(object = bwd, newdata = predict.data2, type = "response")
1
0.5179869
From the prediction we can see that, in the above scenario, the probablity of fatality is 0.5179
predict.data3 <- data.frame(State = "4", Dayweek = "5", Bus.Involvement = "1", Heavy.Rigid.Truck.Involvement = "1", Articulated.Truck.Involvement = "1",
Speed.Limit = 100 , Road.User = "6", Gender = "1", Age.Group = "2", Day.of.week = "1", Time.of.day = "2")
predict(object = bwd, newdata = predict.data3, type = "response")
1
0.788046
From the prediction we can see that, in the above scenario, the probablity of fatality is 0.7880
From the above comparison, it is found that, probablity of fatality is high when the speed limit is 100, Age group is between 17-25, Time of day is night, gender is male, crash type is single as there was no bus or truck involved and day is weekday.
Also, we find that variable speed limit plays an important factor when it comes to predict the likelihood of fatality.
The data analysis reveals a link between speed limit, Age group, Time of Day and Day of week and the likelihood of a fatality due to Single or multiple crash type. When a model containing Easter or Christmas period was run, it was discovered that these two variables were not statistically significant with the other variables in the model.
After removing these variables, the model was run again. Using an ANOVA, all predictor variables were determined to be highly significant at p<0.05, suggesting that they all contribute highly in predicting the likelihood of fatality.
Model building strategy was used to find a significant model by backward elimination and forward selection.
Odds ratio was also calculated to find the odds of fatality due to overspeeding in the night time and likelihood of a fatality of age group 40-64. The odds ratio between Speed limit i.e. due to overspeeding and Time of Day shows that the odds of fatality due to over speeding is 0.5 times than odds of fatality at night.
Finally, future study might look at the role of driving when alcohol levels are high as a predictor variable, also the condition of roads would be a significant predictor variable. As incidents involving driving while alcohol levels are high, fatalities due to bad road conditions are more likely to result in a fatal crash.
Involving these variables in study would bring more clarity in predicting the likelihood of fatalities.