##Data Acquisition
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(neuralnet)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(nnet)
#Reading the data:
accidents.df <- read.csv("E:/R/Accidentsnn.csv")
#Selecting the variables:
vars=c("ALCHL_I", "PROFIL_I_R", "VEH_INVL")
str(accidents.df)
## 'data.frame': 999 obs. of 5 variables:
## $ ALCHL_I : int 2 2 1 2 2 2 2 2 2 2 ...
## $ PROFIL_I_R: int 0 1 0 0 1 0 0 1 1 0 ...
## $ SUR_COND : int 1 1 1 2 1 1 2 2 1 1 ...
## $ VEH_INVL : int 1 1 1 2 2 1 1 1 1 1 ...
## $ MAX_SEV_IR: int 0 2 0 1 1 0 2 1 1 0 ...
library(doParallel)
library(neuralnet)
library(caret)
library(nnet)
set.seed(1)
train.index <- sample(c(1:dim(accidents.df)[1]), dim(accidents.df)[1]*0.6)
training <- accidents.df[train.index, ]
validation <- accidents.df[-train.index, ]
set.seed(122)
training=sample(row.names(accidents.df), dim(accidents.df)[1]*0.6)
validation=setdiff(row.names(accidents.df), training)
Our seeds are set and we are using a 60/40 split of data (1*.06) with 599 observations in the training data and 400 in the validation data.
library(doParallel)
library(neuralnet)
library(caret)
library(nnet)
#When y has multiple classes you need to dummify:
trainData <- cbind(accidents.df[training,c(vars)],
class.ind(accidents.df[training,]$SUR_COND),
class.ind(accidents.df[training,]$MAX_SEV_IR))
names(trainData)=c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""), paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))
validData <- cbind(accidents.df[validation,c(vars)],
class.ind(accidents.df[validation,]$SUR_COND),
class.ind(accidents.df[validation,]$MAX_SEV_IR))
names(validData)=c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""), paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))
dim(trainData)
## [1] 599 11
dim(validData)
## [1] 400 11
Above we added additional surface condition columns and injury columns to our training data and validation data sets. Our dummy variables are now set, and our output classifications will be set in the next step.
library(doParallel)
library(neuralnet)
library(caret)
library(nnet)
#Runing nn with 2 hidden nodes:
#Use hidden= with a vector of integers specifying number of hidden nodes in each layer
nn.acc <- neuralnet(MAX_SEV_IR_0 + MAX_SEV_IR_1 + MAX_SEV_IR_2 ~
ALCHL_I + PROFIL_I_R + VEH_INVL + SUR_COND_1 + SUR_COND_2
+ SUR_COND_3 + SUR_COND_4, data = trainData, hidden = 2,rep=100,linear.output = FALSE)
#Displaying weights:
#nn.acc$weights
#Displaying predictions:
#prediction(nn.acc)
#Plotting the network:
plot(nn.acc, rep="best")
training.prediction=compute(nn.acc, trainData[,-c(8:11)])
training.class=apply(training.prediction$net.result,1,which.max)-1
confusionMatrix(as.factor(training.class), as.factor(accidents.df[training,]$MAX_SEV_IR))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 326 0 18
## 1 0 162 41
## 2 11 7 34
##
## Overall Statistics
##
## Accuracy : 0.8715
## 95% CI : (0.842, 0.8972)
## No Information Rate : 0.5626
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7736
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9674 0.9586 0.36559
## Specificity 0.9313 0.9047 0.96443
## Pos Pred Value 0.9477 0.7980 0.65385
## Neg Pred Value 0.9569 0.9823 0.89214
## Prevalence 0.5626 0.2821 0.15526
## Detection Rate 0.5442 0.2705 0.05676
## Detection Prevalence 0.5743 0.3389 0.08681
## Balanced Accuracy 0.9493 0.9316 0.66501
validation.prediction=compute(nn.acc, validData[,-c(8:11)])
validation.class=apply(validation.prediction$net.result,1,which.max)-1
confusionMatrix(as.factor(validation.class), as.factor(accidents.df[validation,]$MAX_SEV_IR))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 205 0 19
## 1 0 126 22
## 2 9 4 15
##
## Overall Statistics
##
## Accuracy : 0.865
## 95% CI : (0.8276, 0.8969)
## No Information Rate : 0.535
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7633
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9579 0.9692 0.2679
## Specificity 0.8978 0.9185 0.9622
## Pos Pred Value 0.9152 0.8514 0.5357
## Neg Pred Value 0.9489 0.9841 0.8898
## Prevalence 0.5350 0.3250 0.1400
## Detection Rate 0.5125 0.3150 0.0375
## Detection Prevalence 0.5600 0.3700 0.0700
## Balanced Accuracy 0.9279 0.9439 0.6150
In this model, we used our seven predictors as the seven nodes in the input layer and three neurons in the output layer, one representing each class; 0, 1, and 2 in the matrix with two hidden layers and set a repetition length of 100 (100 runs). Our output nodes are MAX_SEV_IR, the presence of injuries/fatalities variable, with 0 representing no injuries, 1 representing injuries, and 2 representing fatalities. As we look at the visualization for best runs, we can see the initial weights, inputs, and outputs from our input nodes through our hidden layers and to the output nodes. Our output layer obtains the input values from the hidden layer, taking a weighted sum of these input values and applying the function which output is displayed numerically on our plot (written as 1 / 1 + e-[all corresponding input values]). Finally, moving on to our Confusion Matrix, we see our overall accuracy at around 87% with the reference and prediction numbers showing our totals across the columns for each. By analyzing our true positive values in the matrix and our statistics by class breakdown, we can see that the Class 0 predictions had the most accurate returns. With a much higher detection rate and prevalence then Class 1 or 2, this checks with what I would expect to see. The least accurate predictions being the fatal class also checks as the prevalence is quite low (lucky for automobile drivers) and other stats like balanced accuracy are also significantly lower. The base accuracy of this model is much higher without considering the Class 2 neurons, but overall, this model is an effective predictor for accident outcomes.