Overview

This purpose of this project was to demonstrate appropriate machine learning skills, relevant data wrangling/cleaning and data visualisation techniques in R. The data was obtained from the vic roads open data on crashes that have occurred between the 11th of August 2012 and 11th of August 2017 (https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/crashes-last-five-years). Data was converted from the raw form into a usable version with only the most appropriate variables retained.

The overall aim was to determine which driving conditions and/or elements would lead to road crash involving fatalities.

Reading in the original data

The original data had n=76541 observations with 65 variables. Irrelevant variables such as accident ID, accident status and co-ordinates were removed. Variables which had missing or ‘unknown’ levels had these rows removed which eliminated 7451 observations. The following variables were used and converted to create a smaller version of the original data:

The cleaned version of the data now had n=69089 observations and 10 variables.

Mapping crashes in Victoria

Although the longitude and latitude variables were not included in the final data, they were extracted and used to create point maps around Victoria and Melbourne for crashes involving fatalities. It was hoped that additional insights could be gained from a geographic visualisation.

latvic <- c(-38.6,-34)
longvic <- c(140,149.5)
locvic <- make_bbox(longvic,latvic,f=0)
mapvic <- get_map(locvic)
ggmap(mapvic)
p1 <- ggmap(mapvic)
p1+geom_point(data=Crashes.map,aes(x=LONGITUDE,y=LATITUDE,size=FATALITY),alpha=0.5)+
  labs(title="Point map showing crash fatalities in Victoria",
       size="Number of fatalities",
       x="Longitude",y="Latitude")+
  theme(plot.title = element_text(hjust = 0.5))

Crashes.melb <- Crashes.map[Crashes.map$LONGITUDE>=144.60 & Crashes.map$LONGITUDE<=145.51 &
                   Crashes.map$LATITUDE>=-38.43 & Crashes.map$LATITUDE<=-37.51,]

From this map we can see that there are many crashes with fatalities clustered around area of Melbourne. The crashes taking place in the more rural areas of Victoria seem to take place on major roads such as highways and freeways, as indicated by the solid white lines. However, from this current image the points around Melbourne are not clearly discernible. To remedy this, we can produce another map of the Melbourne Metropolitan Area only.

Crashes.melb <- Crashes.map[Crashes.map$LONGITUDE>=144.60 & Crashes.map$LONGITUDE<=145.51 &
                   Crashes.map$LATITUDE>=-38.43 & Crashes.map$LATITUDE<=-37.51,]

mapmelb <- get_map("Melbourne:Australia",zoom=10)
ggmap(mapmelb)
p2 <- ggmap(mapmelb)
p2+geom_point(data=Crashes.melb,aes(x=LONGITUDE,y=LATITUDE,size=as.factor(FATALITY)),alpha=0.9)+
  labs(title="Point map showing crash fatalities in Greater Metropolitan Area of Melbourne",
       size="Number of fatalities",
       x="Longitude",y="Latitude")+
  theme(plot.title = element_text(hjust = 0.5))}

This close-up map of Melbourne tells a similar picture as many of the fatal car crashes occur on the major roads. Very few fatal crashes seem to occur in the CBD, likely due to lower speeds and the presence of constant traffic. Most crashes also only appear to have one fatality with more severe crashes occurring further out into the suburbs.

Data pre-analysis

Before working with our data, we checked to see if the variables were in working order by looking at their correlation. The only medium-level correlation was between the number of persons and total number of vehicles involved which is somewhat expected given that these variables are very similar in nature. But for the most part, the variables had a small to medium correlation with each other. Since there was such low correlation, we decided that methods such as a principal component analysis would not be necessary.

cor(Crashes[,-1]) > 0.3
##                SPEED_ZONE TOTAL_PERSONS BICYCLIST PEDESTRIAN MOTORIST
## SPEED_ZONE           TRUE         FALSE     FALSE      FALSE    FALSE
## TOTAL_PERSONS       FALSE          TRUE     FALSE      FALSE    FALSE
## BICYCLIST           FALSE         FALSE      TRUE      FALSE    FALSE
## PEDESTRIAN          FALSE         FALSE     FALSE       TRUE    FALSE
## MOTORIST            FALSE         FALSE     FALSE      FALSE     TRUE
## OLD_DRIVER          FALSE         FALSE     FALSE      FALSE    FALSE
## YOUNG_DRIVER        FALSE         FALSE     FALSE      FALSE    FALSE
## NO_OF_VEHICLES      FALSE          TRUE     FALSE      FALSE    FALSE
## FATALITY            FALSE         FALSE     FALSE      FALSE    FALSE
##                OLD_DRIVER YOUNG_DRIVER NO_OF_VEHICLES FATALITY
## SPEED_ZONE          FALSE        FALSE          FALSE    FALSE
## TOTAL_PERSONS       FALSE        FALSE           TRUE    FALSE
## BICYCLIST           FALSE        FALSE          FALSE    FALSE
## PEDESTRIAN          FALSE        FALSE          FALSE    FALSE
## MOTORIST            FALSE        FALSE          FALSE    FALSE
## OLD_DRIVER           TRUE        FALSE          FALSE    FALSE
## YOUNG_DRIVER        FALSE         TRUE          FALSE    FALSE
## NO_OF_VEHICLES      FALSE        FALSE           TRUE    FALSE
## FATALITY            FALSE        FALSE          FALSE     TRUE

Normality checks were attempted but since our data had more than 500 observations, these checks failed. Our summary statistics showed that many of the results in fact did not have a fatality with 1310 having 1 or more (~1.90%). These trends are similar for the other variables and need to be kept in mind when running our predictive analytics.

table(Crashes$FATALITY)
## 
##     0     1     2     3     4     5 
## 67779  1225    72     9     3     1
summary(Crashes$FATALITY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02045 0.00000 5.00000

Predictive techniques

The data will be analysed using a variety of machine learning techniques in order to find the variables most associated with car crash fatalities. Originally, the variable fatality was numeric but by changing it to a “Yes” or “No” variable, we can also allow ourselves to use classification techniques in combination with regression techniques.

The following regression techniques will be used:

The following classification techniques will be used:

We also needed to create our relevant training and test datasets with and empty dataframe to store our findings.

set.seed(123)
smp_size <- floor(0.75*nrow(Crashes))
split_value <- seq(1,smp_size)
train <- Crashes[split_value,]
test <- Crashes[-split_value,]
x_train <- train[,-10]
y_train <- train[,10]
x_test <- test[,-10]
y_test <- test[,10]

Crashes.results <- data.frame(
  Technique=character(),
  Error=numeric(),
  Information=character()
)

Regressions

Our first regression technique was a linear regression and produced a model fit that was significant. Many of the predictor variables were also highly significant.

lm.fit <- lm(FATALITY~.,data=Crashes,subset=split_value)
summary(lm.fit)
lm.pred <- predict(lm.fit,x_test)
mean((lm.pred-y_test)^2)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="Linear Regression",
  Error=mean((lm.pred-y_test)^2),
  Information="MSE"
))

Next we used a SVM technique. Normally, we would tune the model to obtain the best values for our parameters but since our dataset was so long, this would take many hours and is not essential for this project.

library(e1071)
svm.model <- svm(FATALITY~.,data=Crashes,subset=split_value)
summary(svm.model)
svm.pred <- predict(svm.model,x_test)
mean((svm.pred-y_test)^2)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="SVM",
  Error=mean((svm.pred-y_test)^2),
  Information="MSE"
))

Finally, a regression tree was used. A maximal tree was created, pruned using cross-validation and then used to made predictions on number of fatalities.

library(tree)
tree.reg <- tree(FATALITY~.,Crashes,subset=split_value) # Creates maximal tree

cv.reg <- cv.tree(tree.reg) # Cross-validation for optimal tree complexity
summary(cv.reg)
reg.results <- data.frame(cv.reg$size,cv.reg$dev)
prune.reg <- prune.tree(tree.reg,best=reg.results[which.min(cv.reg$dev),1])

tree.reg.pred <- predict(tree.reg,x_test)
mean((tree.reg.pred-y_test)^2)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="Regression Tree",
  Error=mean((tree.reg.pred-y_test)^2),
  Information="MSE"
))

Classifications

We also ran predictions with classification techniques once we had changed the fatality variable from a numeric to a factor. Understanding that some of the nuances in the variable would be lost, these tests were still carried out in the hopes that more information could be obtained.

Crashes.class <- Crashes
Crashes.class$FATALITY[Crashes.class$FATALITY>=1] <- "Yes"
Crashes.class$FATALITY[Crashes.class$FATALITY<1] <- "No"
Crashes.class$FATALITY <- as.factor(Crashes.class$FATALITY)

train.class <- Crashes.class[split_value,]
test.class <- Crashes.class[-split_value,]
x_train.class <- train.class[,-10]
y_train.class <- train.class[,10]
x_test.class <- test.class[,-10]
y_test.class <- test.class[,10]

The first classification technique was KNN but some modifications had to be made. Since there were so few results with “Yes” for fatality, the dataset had to be reduced to 2620 observations. This meant that the original proportion of roughly 1.90% was changed to a flat 50%. This was done because the KNN would not run otherwise as the results were originally too close together to determine neighbours.

library(class)
library(dplyr)
Crashes.knn <- rbind(filter(Crashes,Crashes$FATALITY==0) %>% sample_n(.,1310),
                     filter(Crashes,Crashes$FATALITY>=1))
Crashes.knn$FATALITY[Crashes.knn$FATALITY>=1] <- 1
Crashes.knn <- Crashes.knn[,-1]
table(Crashes.knn$FATALITY)

smp_size.knn <- floor(0.75*nrow(Crashes.knn))
split_value.knn <- sample(seq_len(nrow(Crashes.knn)),size=smp_size.knn,replace=F) 
train.knn <- Crashes.knn[split_value.knn,]
test.knn <- Crashes.knn[-split_value.knn,]
knn.factor <- train.knn$FATALITY

knn.results <- data.frame(K=integer(),Misclassification=numeric(),Information=character())

for(i in 1:8){
  knn.fit <- knn(train.knn,test.knn,knn.factor,k=i,prob=TRUE)
  misclass <- mean(knn.fit!=test.knn$FATALITY)
  single.results <- data.frame(K=i,Misclassification=misclass)
  knn.results <- rbind(knn.results,single.results)
}

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="KNN",
  Error=knn.results$Misclassification[which.min(knn.results$Misclassification)],
  Information=as.character(paste0("K=",knn.results$K[which.min(knn.results$Misclassification)]))
))

A Naive-Bayes prediction for the categorical variable we have for fatalities. The function can be found within the e1071 library, the same library used for the svm function.

nb.model <- naiveBayes(FATALITY~.,Crashes.class,subset=split_value)
nb.pred <- predict(nb.model,x_test.class)
mean(nb.pred!=y_test.class)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="Naive-Bayes",
  Error=mean(nb.pred!=y_test.class),
  Information="Misclassification"
))

The logistic regression technique was used next and a naive bayes classifier of 0.5 was used to predict group classification.

glm.model <- glm(FATALITY~.,Crashes.class,subset=split_value,family="binomial")
glm.pred <- predict(glm.model,x_test.class,type="response")
glm.pred <- ifelse(glm.pred<=0.5,"No","Yes")
mean(glm.pred!=y_test.class)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="Logistic Regression",
  Error=mean(glm.pred!=y_test.class),
  Information="Misclassification"
))

Next, a svm technique was used and similar to the regression svm, tuning would normally be done but the code was running for many many hours without completion so it was omitted from this project.

svm.class <- svm(FATALITY~.,data=Crashes.class,subset=split_value)
summary(svm.class)
svm.pred.class <- predict(svm.class,x_test.class)
mean(svm.pred.class!=y_test.class)

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="SVM",
  Error=mean(svm.pred.class!=y_test.class),
  Information="Misclassification"
))

Finally, a classification tree technique was used with methods similar to the regression tree used earlier.

class.tree <- tree(FATALITY~.,data=Crashes.class,subset=split_value)
summary(class.tree) # Variable used are Speed zone and Alcohol
ctree.prune <- cv.tree(class.tree)
which.min(ctree.prune$dev) # size of 4

pruned.ctree <- prune.misclass(class.tree,best=4)
pruned.results <- summary(pruned.ctree)
pruned.results$misclass[1]/pruned.results$misclass[2]

Crashes.results <- rbind(Crashes.results,data.frame(
  Technique="Classification Tree",
  Error=pruned.results$misclass[1]/pruned.results$misclass[2],
  Information="Misclassification"
))

We can now see the full dataframe with all of our findings with the appropriate error values.

Crashes.results <- read.csv("Crashes(results).csv")
Crashes.results
##   X           Technique      Error       Information
## 1 4                 KNN 0.00610687               K=1
## 2 8 Classification Tree 0.01825691 Misclassification
## 3 7                 SVM 0.02147861 Misclassification
## 4 1   Linear Regression 0.02383617               MSE
## 5 2                 SVM 0.02400970               MSE
## 6 3     Regression Tree 0.02424985               MSE
## 7 5         Naive-Bayes 0.03218897 Misclassification
## 8 6 Logistic Regression 0.29398483 Misclassification

Visualisations of data

From our findings, the speed zone of the car crash was an important factor in the outcome of the crash. As a result, we can visualise our findings in terms of frequency and proportion in relation to the presence of fatalities.

Crashes.plot <- rbind(subset(Crashes.class,Crashes.class$FATALITY=="Yes") %>%
  group_by(SPEED_ZONE,FATALITY) %>% summarise(prop=length(FATALITY)/1310,count=n()),
subset(Crashes.class,Crashes.class$FATALITY=="No") %>%
  group_by(SPEED_ZONE,FATALITY) %>% summarise(prop=length(FATALITY)/67779,count=n()))

ggplot(Crashes.plot,aes(x=FATALITY,y=prop,fill=SPEED_ZONE))+geom_bar(stat="identity")+
  scale_fill_gradient(low="grey",high="black")+
  labs(title="Stacked bar plot showing proportion of speed zone fatalities",
       x="Involving fatality",
       y="Proportion",
       fill="Speed zone") +
  theme_bw()

This shows a simple plot of proportional values. The darker colour indicates a higher speed and we can see that a large number of fatalities occur when the speed zone is 100km/h which are generally seen on major highways or freeways. There is a higher frequency of crashes that occur at lower speeds that do not result in any fatalities.

Crashes.plot2 <- Crashes.plot
Crashes.plot2$SPEED_ZONE <- as.factor(Crashes.plot2$SPEED_ZONE)
Crashes.plot2$prop <- as.vector(rbind(18/(18+3798),102/(102+12521),256/(256+24495),88/(88+5225),1/(1+10),207/(207+9947),
  13/(13+297),573/(573+10611),52/(52+767),108/108,3798/(18+3798),12521/(102+12521),24495/(256+24495),
  5225/(88+5225),10/(10+1),9947/(207+9947),297/(13+297),10611/(573+10611),767/(52+767)))
Crashes.plot2$prop <- round(Crashes.plot2$prop,4)
colnames(Crashes.plot2) <- c("Speed","Fatality","Proportion","Count")

barplot <- ggplot(Crashes.plot2,aes(x=Speed,y=Proportion,fill=Fatality,text=paste("Count:",Count)))+
  geom_bar(stat="identity",position="dodge",colour="black",alpha=0.75)+
  scale_fill_manual(values=c("dodger blue","#EBEE2D")) + theme_bw()+
  labs(title="Proportional bar charts for fatalities across speed zones",
       fill="Presence \n of fatality")
ggplotlybar <- ggplotly(barplot,tooltip=c("x","y","fill","text"))
ggplotlybar

This visualisation again shows the proportion of crashes with fatalities but within their speed zone groups. There is a increase of fatality proportions as the speed zone increases with an exception in the 75 km/h bracket. This is because there is only a total of 11 recorded crashes in this speed zone while the neighbouring speeds have values in their thousands.

Interpretation of findings

Crashes.results
##   X           Technique      Error       Information
## 1 4                 KNN 0.00610687               K=1
## 2 8 Classification Tree 0.01825691 Misclassification
## 3 7                 SVM 0.02147861 Misclassification
## 4 1   Linear Regression 0.02383617               MSE
## 5 2                 SVM 0.02400970               MSE
## 6 3     Regression Tree 0.02424985               MSE
## 7 5         Naive-Bayes 0.03218897 Misclassification
## 8 6 Logistic Regression 0.29398483 Misclassification

We can see that our test error rates are exceedingly low all except for the logistic regression. The K=1 for our KNN indicates that our predictions and findings produced some errors, most likely due to the nature of the data where many results were 0 or close to. Additionally, the fact that only 1.90% of our observations had the outcome we were searching may have lead to poor predictions.

The logistic regression was the only predictive technique with somewhat reasonable findings.

glm.model <- glm(FATALITY~.,Crashes.class,subset=split_value,family="binomial")
summary(glm.model)
## 
## Call:
## glm(formula = FATALITY ~ ., family = "binomial", data = Crashes.class, 
##     subset = split_value)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6929  -0.1932  -0.1317  -0.1039   3.5066  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.771696   0.190971 -40.696  < 2e-16 ***
## ALCOHOL_RELATEDYes  2.060648   0.090638  22.735  < 2e-16 ***
## SPEED_ZONE          0.044591   0.001862  23.942  < 2e-16 ***
## TOTAL_PERSONS       0.072959   0.017011   4.289 1.80e-05 ***
## BICYCLIST           0.042700   0.161404   0.265 0.791356    
## PEDESTRIAN          1.061474   0.079054  13.427  < 2e-16 ***
## MOTORIST            0.362342   0.086425   4.193 2.76e-05 ***
## OLD_DRIVER          0.871025   0.106347   8.190 2.60e-16 ***
## YOUNG_DRIVER       -0.274336   0.076533  -3.585 0.000338 ***
## NO_OF_VEHICLES     -0.082802   0.050235  -1.648 0.099293 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9448.7  on 51815  degrees of freedom
## Residual deviance: 8246.8  on 51806  degrees of freedom
## AIC: 8266.8
## 
## Number of Fisher Scoring iterations: 7

The logistic regression shows that while many of the variables were still present, it had a very high AIC value of 8266.8. The formula for the logistic regression has the ‘highest’ value for the intercept estimate and not any of the predictor variables. This indicates that there were problems with the data or the way it was changed and had interpretations made from it.

Potential ways to fix these problems could have been to change the proportion of fatality vs. no fatality rows since only 1.90% had the target outcome. Additionally, subset selection could have been used or regularisation techniques such as lasso regression to find the best variables. However, many of the variables were factors with many levels and the variables used were essentialy the most useful numeric variables.