This purpose of this project was to demonstrate appropriate machine learning skills, relevant data wrangling/cleaning and data visualisation techniques in R. The data was obtained from the vic roads open data on crashes that have occurred between the 11th of August 2012 and 11th of August 2017 (https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/crashes-last-five-years). Data was converted from the raw form into a usable version with only the most appropriate variables retained.
The overall aim was to determine which driving conditions and/or elements would lead to road crash involving fatalities.
The original data had n=76541 observations with 65 variables. Irrelevant variables such as accident ID, accident status and co-ordinates were removed. Variables which had missing or ‘unknown’ levels had these rows removed which eliminated 7451 observations. The following variables were used and converted to create a smaller version of the original data:
The cleaned version of the data now had n=69089 observations and 10 variables.
Although the longitude and latitude variables were not included in the final data, they were extracted and used to create point maps around Victoria and Melbourne for crashes involving fatalities. It was hoped that additional insights could be gained from a geographic visualisation.
latvic <- c(-38.6,-34)
longvic <- c(140,149.5)
locvic <- make_bbox(longvic,latvic,f=0)
mapvic <- get_map(locvic)
ggmap(mapvic)
p1 <- ggmap(mapvic)
p1+geom_point(data=Crashes.map,aes(x=LONGITUDE,y=LATITUDE,size=FATALITY),alpha=0.5)+
labs(title="Point map showing crash fatalities in Victoria",
size="Number of fatalities",
x="Longitude",y="Latitude")+
theme(plot.title = element_text(hjust = 0.5))
Crashes.melb <- Crashes.map[Crashes.map$LONGITUDE>=144.60 & Crashes.map$LONGITUDE<=145.51 &
Crashes.map$LATITUDE>=-38.43 & Crashes.map$LATITUDE<=-37.51,]
From this map we can see that there are many crashes with fatalities clustered around area of Melbourne. The crashes taking place in the more rural areas of Victoria seem to take place on major roads such as highways and freeways, as indicated by the solid white lines. However, from this current image the points around Melbourne are not clearly discernible. To remedy this, we can produce another map of the Melbourne Metropolitan Area only.
Crashes.melb <- Crashes.map[Crashes.map$LONGITUDE>=144.60 & Crashes.map$LONGITUDE<=145.51 &
Crashes.map$LATITUDE>=-38.43 & Crashes.map$LATITUDE<=-37.51,]
mapmelb <- get_map("Melbourne:Australia",zoom=10)
ggmap(mapmelb)
p2 <- ggmap(mapmelb)
p2+geom_point(data=Crashes.melb,aes(x=LONGITUDE,y=LATITUDE,size=as.factor(FATALITY)),alpha=0.9)+
labs(title="Point map showing crash fatalities in Greater Metropolitan Area of Melbourne",
size="Number of fatalities",
x="Longitude",y="Latitude")+
theme(plot.title = element_text(hjust = 0.5))}
This close-up map of Melbourne tells a similar picture as many of the fatal car crashes occur on the major roads. Very few fatal crashes seem to occur in the CBD, likely due to lower speeds and the presence of constant traffic. Most crashes also only appear to have one fatality with more severe crashes occurring further out into the suburbs.
Before working with our data, we checked to see if the variables were in working order by looking at their correlation. The only medium-level correlation was between the number of persons and total number of vehicles involved which is somewhat expected given that these variables are very similar in nature. But for the most part, the variables had a small to medium correlation with each other. Since there was such low correlation, we decided that methods such as a principal component analysis would not be necessary.
cor(Crashes[,-1]) > 0.3
## SPEED_ZONE TOTAL_PERSONS BICYCLIST PEDESTRIAN MOTORIST
## SPEED_ZONE TRUE FALSE FALSE FALSE FALSE
## TOTAL_PERSONS FALSE TRUE FALSE FALSE FALSE
## BICYCLIST FALSE FALSE TRUE FALSE FALSE
## PEDESTRIAN FALSE FALSE FALSE TRUE FALSE
## MOTORIST FALSE FALSE FALSE FALSE TRUE
## OLD_DRIVER FALSE FALSE FALSE FALSE FALSE
## YOUNG_DRIVER FALSE FALSE FALSE FALSE FALSE
## NO_OF_VEHICLES FALSE TRUE FALSE FALSE FALSE
## FATALITY FALSE FALSE FALSE FALSE FALSE
## OLD_DRIVER YOUNG_DRIVER NO_OF_VEHICLES FATALITY
## SPEED_ZONE FALSE FALSE FALSE FALSE
## TOTAL_PERSONS FALSE FALSE TRUE FALSE
## BICYCLIST FALSE FALSE FALSE FALSE
## PEDESTRIAN FALSE FALSE FALSE FALSE
## MOTORIST FALSE FALSE FALSE FALSE
## OLD_DRIVER TRUE FALSE FALSE FALSE
## YOUNG_DRIVER FALSE TRUE FALSE FALSE
## NO_OF_VEHICLES FALSE FALSE TRUE FALSE
## FATALITY FALSE FALSE FALSE TRUE
Normality checks were attempted but since our data had more than 500 observations, these checks failed. Our summary statistics showed that many of the results in fact did not have a fatality with 1310 having 1 or more (~1.90%). These trends are similar for the other variables and need to be kept in mind when running our predictive analytics.
table(Crashes$FATALITY)
##
## 0 1 2 3 4 5
## 67779 1225 72 9 3 1
summary(Crashes$FATALITY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02045 0.00000 5.00000
The data will be analysed using a variety of machine learning techniques in order to find the variables most associated with car crash fatalities. Originally, the variable fatality was numeric but by changing it to a “Yes” or “No” variable, we can also allow ourselves to use classification techniques in combination with regression techniques.
The following regression techniques will be used:
The following classification techniques will be used:
We also needed to create our relevant training and test datasets with and empty dataframe to store our findings.
set.seed(123)
smp_size <- floor(0.75*nrow(Crashes))
split_value <- seq(1,smp_size)
train <- Crashes[split_value,]
test <- Crashes[-split_value,]
x_train <- train[,-10]
y_train <- train[,10]
x_test <- test[,-10]
y_test <- test[,10]
Crashes.results <- data.frame(
Technique=character(),
Error=numeric(),
Information=character()
)
Our first regression technique was a linear regression and produced a model fit that was significant. Many of the predictor variables were also highly significant.
lm.fit <- lm(FATALITY~.,data=Crashes,subset=split_value)
summary(lm.fit)
lm.pred <- predict(lm.fit,x_test)
mean((lm.pred-y_test)^2)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="Linear Regression",
Error=mean((lm.pred-y_test)^2),
Information="MSE"
))
Next we used a SVM technique. Normally, we would tune the model to obtain the best values for our parameters but since our dataset was so long, this would take many hours and is not essential for this project.
library(e1071)
svm.model <- svm(FATALITY~.,data=Crashes,subset=split_value)
summary(svm.model)
svm.pred <- predict(svm.model,x_test)
mean((svm.pred-y_test)^2)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="SVM",
Error=mean((svm.pred-y_test)^2),
Information="MSE"
))
Finally, a regression tree was used. A maximal tree was created, pruned using cross-validation and then used to made predictions on number of fatalities.
library(tree)
tree.reg <- tree(FATALITY~.,Crashes,subset=split_value) # Creates maximal tree
cv.reg <- cv.tree(tree.reg) # Cross-validation for optimal tree complexity
summary(cv.reg)
reg.results <- data.frame(cv.reg$size,cv.reg$dev)
prune.reg <- prune.tree(tree.reg,best=reg.results[which.min(cv.reg$dev),1])
tree.reg.pred <- predict(tree.reg,x_test)
mean((tree.reg.pred-y_test)^2)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="Regression Tree",
Error=mean((tree.reg.pred-y_test)^2),
Information="MSE"
))
We also ran predictions with classification techniques once we had changed the fatality variable from a numeric to a factor. Understanding that some of the nuances in the variable would be lost, these tests were still carried out in the hopes that more information could be obtained.
Crashes.class <- Crashes
Crashes.class$FATALITY[Crashes.class$FATALITY>=1] <- "Yes"
Crashes.class$FATALITY[Crashes.class$FATALITY<1] <- "No"
Crashes.class$FATALITY <- as.factor(Crashes.class$FATALITY)
train.class <- Crashes.class[split_value,]
test.class <- Crashes.class[-split_value,]
x_train.class <- train.class[,-10]
y_train.class <- train.class[,10]
x_test.class <- test.class[,-10]
y_test.class <- test.class[,10]
The first classification technique was KNN but some modifications had to be made. Since there were so few results with “Yes” for fatality, the dataset had to be reduced to 2620 observations. This meant that the original proportion of roughly 1.90% was changed to a flat 50%. This was done because the KNN would not run otherwise as the results were originally too close together to determine neighbours.
library(class)
library(dplyr)
Crashes.knn <- rbind(filter(Crashes,Crashes$FATALITY==0) %>% sample_n(.,1310),
filter(Crashes,Crashes$FATALITY>=1))
Crashes.knn$FATALITY[Crashes.knn$FATALITY>=1] <- 1
Crashes.knn <- Crashes.knn[,-1]
table(Crashes.knn$FATALITY)
smp_size.knn <- floor(0.75*nrow(Crashes.knn))
split_value.knn <- sample(seq_len(nrow(Crashes.knn)),size=smp_size.knn,replace=F)
train.knn <- Crashes.knn[split_value.knn,]
test.knn <- Crashes.knn[-split_value.knn,]
knn.factor <- train.knn$FATALITY
knn.results <- data.frame(K=integer(),Misclassification=numeric(),Information=character())
for(i in 1:8){
knn.fit <- knn(train.knn,test.knn,knn.factor,k=i,prob=TRUE)
misclass <- mean(knn.fit!=test.knn$FATALITY)
single.results <- data.frame(K=i,Misclassification=misclass)
knn.results <- rbind(knn.results,single.results)
}
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="KNN",
Error=knn.results$Misclassification[which.min(knn.results$Misclassification)],
Information=as.character(paste0("K=",knn.results$K[which.min(knn.results$Misclassification)]))
))
A Naive-Bayes prediction for the categorical variable we have for fatalities. The function can be found within the e1071 library, the same library used for the svm function.
nb.model <- naiveBayes(FATALITY~.,Crashes.class,subset=split_value)
nb.pred <- predict(nb.model,x_test.class)
mean(nb.pred!=y_test.class)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="Naive-Bayes",
Error=mean(nb.pred!=y_test.class),
Information="Misclassification"
))
The logistic regression technique was used next and a naive bayes classifier of 0.5 was used to predict group classification.
glm.model <- glm(FATALITY~.,Crashes.class,subset=split_value,family="binomial")
glm.pred <- predict(glm.model,x_test.class,type="response")
glm.pred <- ifelse(glm.pred<=0.5,"No","Yes")
mean(glm.pred!=y_test.class)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="Logistic Regression",
Error=mean(glm.pred!=y_test.class),
Information="Misclassification"
))
Next, a svm technique was used and similar to the regression svm, tuning would normally be done but the code was running for many many hours without completion so it was omitted from this project.
svm.class <- svm(FATALITY~.,data=Crashes.class,subset=split_value)
summary(svm.class)
svm.pred.class <- predict(svm.class,x_test.class)
mean(svm.pred.class!=y_test.class)
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="SVM",
Error=mean(svm.pred.class!=y_test.class),
Information="Misclassification"
))
Finally, a classification tree technique was used with methods similar to the regression tree used earlier.
class.tree <- tree(FATALITY~.,data=Crashes.class,subset=split_value)
summary(class.tree) # Variable used are Speed zone and Alcohol
ctree.prune <- cv.tree(class.tree)
which.min(ctree.prune$dev) # size of 4
pruned.ctree <- prune.misclass(class.tree,best=4)
pruned.results <- summary(pruned.ctree)
pruned.results$misclass[1]/pruned.results$misclass[2]
Crashes.results <- rbind(Crashes.results,data.frame(
Technique="Classification Tree",
Error=pruned.results$misclass[1]/pruned.results$misclass[2],
Information="Misclassification"
))
We can now see the full dataframe with all of our findings with the appropriate error values.
Crashes.results <- read.csv("Crashes(results).csv")
Crashes.results
## X Technique Error Information
## 1 4 KNN 0.00610687 K=1
## 2 8 Classification Tree 0.01825691 Misclassification
## 3 7 SVM 0.02147861 Misclassification
## 4 1 Linear Regression 0.02383617 MSE
## 5 2 SVM 0.02400970 MSE
## 6 3 Regression Tree 0.02424985 MSE
## 7 5 Naive-Bayes 0.03218897 Misclassification
## 8 6 Logistic Regression 0.29398483 Misclassification
From our findings, the speed zone of the car crash was an important factor in the outcome of the crash. As a result, we can visualise our findings in terms of frequency and proportion in relation to the presence of fatalities.
Crashes.plot <- rbind(subset(Crashes.class,Crashes.class$FATALITY=="Yes") %>%
group_by(SPEED_ZONE,FATALITY) %>% summarise(prop=length(FATALITY)/1310,count=n()),
subset(Crashes.class,Crashes.class$FATALITY=="No") %>%
group_by(SPEED_ZONE,FATALITY) %>% summarise(prop=length(FATALITY)/67779,count=n()))
ggplot(Crashes.plot,aes(x=FATALITY,y=prop,fill=SPEED_ZONE))+geom_bar(stat="identity")+
scale_fill_gradient(low="grey",high="black")+
labs(title="Stacked bar plot showing proportion of speed zone fatalities",
x="Involving fatality",
y="Proportion",
fill="Speed zone") +
theme_bw()
This shows a simple plot of proportional values. The darker colour indicates a higher speed and we can see that a large number of fatalities occur when the speed zone is 100km/h which are generally seen on major highways or freeways. There is a higher frequency of crashes that occur at lower speeds that do not result in any fatalities.
Crashes.plot2 <- Crashes.plot
Crashes.plot2$SPEED_ZONE <- as.factor(Crashes.plot2$SPEED_ZONE)
Crashes.plot2$prop <- as.vector(rbind(18/(18+3798),102/(102+12521),256/(256+24495),88/(88+5225),1/(1+10),207/(207+9947),
13/(13+297),573/(573+10611),52/(52+767),108/108,3798/(18+3798),12521/(102+12521),24495/(256+24495),
5225/(88+5225),10/(10+1),9947/(207+9947),297/(13+297),10611/(573+10611),767/(52+767)))
Crashes.plot2$prop <- round(Crashes.plot2$prop,4)
colnames(Crashes.plot2) <- c("Speed","Fatality","Proportion","Count")
barplot <- ggplot(Crashes.plot2,aes(x=Speed,y=Proportion,fill=Fatality,text=paste("Count:",Count)))+
geom_bar(stat="identity",position="dodge",colour="black",alpha=0.75)+
scale_fill_manual(values=c("dodger blue","#EBEE2D")) + theme_bw()+
labs(title="Proportional bar charts for fatalities across speed zones",
fill="Presence \n of fatality")
ggplotlybar <- ggplotly(barplot,tooltip=c("x","y","fill","text"))
ggplotlybar
This visualisation again shows the proportion of crashes with fatalities but within their speed zone groups. There is a increase of fatality proportions as the speed zone increases with an exception in the 75 km/h bracket. This is because there is only a total of 11 recorded crashes in this speed zone while the neighbouring speeds have values in their thousands.
Crashes.results
## X Technique Error Information
## 1 4 KNN 0.00610687 K=1
## 2 8 Classification Tree 0.01825691 Misclassification
## 3 7 SVM 0.02147861 Misclassification
## 4 1 Linear Regression 0.02383617 MSE
## 5 2 SVM 0.02400970 MSE
## 6 3 Regression Tree 0.02424985 MSE
## 7 5 Naive-Bayes 0.03218897 Misclassification
## 8 6 Logistic Regression 0.29398483 Misclassification
We can see that our test error rates are exceedingly low all except for the logistic regression. The K=1 for our KNN indicates that our predictions and findings produced some errors, most likely due to the nature of the data where many results were 0 or close to. Additionally, the fact that only 1.90% of our observations had the outcome we were searching may have lead to poor predictions.
The logistic regression was the only predictive technique with somewhat reasonable findings.
glm.model <- glm(FATALITY~.,Crashes.class,subset=split_value,family="binomial")
summary(glm.model)
##
## Call:
## glm(formula = FATALITY ~ ., family = "binomial", data = Crashes.class,
## subset = split_value)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6929 -0.1932 -0.1317 -0.1039 3.5066
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.771696 0.190971 -40.696 < 2e-16 ***
## ALCOHOL_RELATEDYes 2.060648 0.090638 22.735 < 2e-16 ***
## SPEED_ZONE 0.044591 0.001862 23.942 < 2e-16 ***
## TOTAL_PERSONS 0.072959 0.017011 4.289 1.80e-05 ***
## BICYCLIST 0.042700 0.161404 0.265 0.791356
## PEDESTRIAN 1.061474 0.079054 13.427 < 2e-16 ***
## MOTORIST 0.362342 0.086425 4.193 2.76e-05 ***
## OLD_DRIVER 0.871025 0.106347 8.190 2.60e-16 ***
## YOUNG_DRIVER -0.274336 0.076533 -3.585 0.000338 ***
## NO_OF_VEHICLES -0.082802 0.050235 -1.648 0.099293 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9448.7 on 51815 degrees of freedom
## Residual deviance: 8246.8 on 51806 degrees of freedom
## AIC: 8266.8
##
## Number of Fisher Scoring iterations: 7
The logistic regression shows that while many of the variables were still present, it had a very high AIC value of 8266.8. The formula for the logistic regression has the ‘highest’ value for the intercept estimate and not any of the predictor variables. This indicates that there were problems with the data or the way it was changed and had interpretations made from it.
Potential ways to fix these problems could have been to change the proportion of fatality vs. no fatality rows since only 1.90% had the target outcome. Additionally, subset selection could have been used or regularisation techniques such as lasso regression to find the best variables. However, many of the variables were factors with many levels and the variables used were essentialy the most useful numeric variables.