We will use a new package, the NeuralNetTools package, to visualize our ANN. If you do not already have the NeuralNetTools package installed, you will first install it using the install.packages() function.
install.packages("NeuralNetTools")
In this lesson, we will also use the DescTools and caret packages. Next, we load the three packages for use in the session.
library(DescTools)
library(caret)
library(NeuralNetTools)
In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.
The variables include:
We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.
BH <- read.csv(file = "BostonHousing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.
Abstract(BH)
## ------------------------------------------------------------------------------
## BH
##
## data frame: 505 obs. of 13 variables
## 505 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 crim numeric .
## 2 zn numeric .
## 3 indus numeric .
## 4 chas integer .
## 5 nox numeric .
## 6 rm numeric .
## 7 age numeric .
## 8 dis numeric .
## 9 rad integer .
## 10 tax integer .
## 11 ptratio numeric .
## 12 b numeric .
## 13 median_val character .
Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable.
BH$median_val <- factor(x = BH$median_val)
We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.
plot(BH$median_val,
main = "Median Value")
All of our potential predictors are numeric, but based on the data description, chas is categorical (nominal) and rad could either be treated as categorical or numeric. We can keep chas as-is, since it is already binary. We will also keep rad as-is, and use the numeric representation of the categorical (ordinal) variable as input to our model.
We know that ANN can handle redundant variables, but missing values need to be handled, categorical variables need to be binarized and rescaling should be done.
na.omit()) or perform imputation.PlotMiss(x = BH)
class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() (and predict) function from the caret for categorical variables with more than 2 class levels. The chas variable is already binary, so it will be kept as-is.For ordinal variables, we can preserve the ordering by converting them from ordinal factors to numeric using the as.numeric() function. Since rad is already numeric, we kept it as-is.
We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility. We use the BH dataframe to create our train and test sets.
set.seed(831) # initialize random seed
sub <- createDataPartition(y = BH$median_val, # target variable
p = 0.80, # proportion in train
list = FALSE)
Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- BH[sub, ]
test <- BH[-sub, ]
Since there is no true default model and we will need to choose the number of hidden nodes in the hidden layer (with no default rule to help guide us), we go straight to hyperparameter tuning to find the optimal number of hidden nodes and weight decay.
We can use the train() function from the caret package to tune our hyperparameters. Here, we will use the nnet package (method = "nnet"). We can tune the size and decay hyperparameters.
size: number of nodes in the hidden layer Note: There can only be one hidden layer using nnetdecay: weight decay. regularization parameter to avoid overfitting, which adds a penalty for complexity.We will use a grid search and 5-fold cross validation repeated 3 times.
First, we set up the grid using the expand.grid() function. We will consider hidden node sizes (size) of 1, 3, 5 and 7 and decay values ranging from 0 to 0.1 in 0.01 increments.
grids <- expand.grid(size = seq(from = 1,
to = 7,
by = 2),
decay = seq(from = 0,
to = 0.1,
by = 0.01))
Next, we set up our control object for input in the train() function for the trControl argument.
ctrl <- trainControl(method = "repeatedcv",
number = 5, # 5 folds
repeats = 3, # 3 repeats
search = "grid") # grid search
Next, we initialize a random seed for our resampling.
set.seed(831)
Then, we use the train() function to train the ANN model using 5-Fold Cross Validation (repeated 3 times) to search over the hyperparameter grid (grids). We use the preProcess argument to specify that we want to apply min-max (range) normalization to the numeric variables in our train data. We set trace = FALSE to suppress output from each of the iterations of the algorithm.
annMod <- train(form = median_val ~., # use all other variables to predict target
data = train, # training data
preProcess = "range", # apply min-max normalization
method = "nnet", # use nnet()
trControl = ctrl,
tuneGrid = grids, # search over the created grid
trace = FALSE) # suppress output
We can view the Accuracy and Kappa across our hyperparameter grid and obtain the optimal values of size and decay.
annMod
## Neural Network
##
## 405 samples
## 12 predictor
## 2 classes: 'Above', 'Below'
##
## Pre-processing: re-scaling to [0, 1] (12)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0.00 0.9292934 0.7172720
## 1 0.01 0.9333384 0.7549879
## 1 0.02 0.9333180 0.7546748
## 1 0.03 0.9341616 0.7568195
## 1 0.04 0.9325053 0.7507855
## 1 0.05 0.9341616 0.7551163
## 1 0.06 0.9341514 0.7546235
## 1 0.07 0.9333283 0.7519121
## 1 0.08 0.9358180 0.7575649
## 1 0.09 0.9349847 0.7550847
## 1 0.10 0.9349950 0.7534247
## 3 0.00 0.9357967 0.7585337
## 3 0.01 0.9407565 0.7782681
## 3 0.02 0.9481236 0.8057421
## 3 0.03 0.9481441 0.8075790
## 3 0.04 0.9481444 0.8028828
## 3 0.05 0.9440287 0.7878713
## 3 0.06 0.9431953 0.7840357
## 3 0.07 0.9423723 0.7814717
## 3 0.08 0.9407262 0.7751457
## 3 0.09 0.9399031 0.7711532
## 3 0.10 0.9374440 0.7615445
## 5 0.00 0.9317025 0.7488274
## 5 0.01 0.9415796 0.7791584
## 5 0.02 0.9465083 0.8015618
## 5 0.03 0.9506236 0.8132657
## 5 0.04 0.9464777 0.7966578
## 5 0.05 0.9456645 0.7937885
## 5 0.06 0.9456544 0.7941847
## 5 0.07 0.9423723 0.7810426
## 5 0.08 0.9407262 0.7751457
## 5 0.09 0.9399031 0.7711532
## 5 0.10 0.9374440 0.7615445
## 7 0.00 0.9292329 0.7470980
## 7 0.01 0.9448419 0.7932675
## 7 0.02 0.9465078 0.7986076
## 7 0.03 0.9522596 0.8196623
## 7 0.04 0.9497802 0.8088619
## 7 0.05 0.9464980 0.7966562
## 7 0.06 0.9440083 0.7883038
## 7 0.07 0.9431953 0.7847887
## 7 0.08 0.9407262 0.7751457
## 7 0.09 0.9390901 0.7683924
## 7 0.10 0.9382671 0.7640045
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.03.
We can visualize the tuned ANN model using the plotnet() function in the NeuralNetTools package. By setting pos_col = "darkgreen", weights that are positive will display in dark green. By setting neg_col = "darkred", weights that are negative will display in dark red. The size of the connections indicates the size of the weight.
plotnet(mod_in = annMod$finalModel, # nnet object
pos_col = "darkgreen", # positive weights are shown in green
neg_col = "darkred", # negative weights are shown in red
bias = FALSE, # do not plot bias
circle_cex = 4, # reduce circle size (default is 5)
cex_val = 0.6) # reduce text label size (default is 1)
We use the predict() function to obtain class predictions for our target variable, median_val, using the ANN model.
tune.tr.preds <- predict(object = annMod, # tuned model
newdata = train) # training data
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the training dataset (train).
tune_tr_conf <- confusionMatrix(data = tune.tr.preds, # predictions
reference = train$median_val, # actual
positive = "Above",
mode = "everything")
We will wait to view the output until we consider goodness of fit.
We use the predict() function to generate class predictions for our testing data set and evaluate model performance.
tune.te.preds <- predict(object = annMod, # tuned model
newdata = test) # testing data
Next, we get performance measures using the confusionMatrix() function.
tune_te_conf <- confusionMatrix(data = tune.te.preds, # predictions
reference = test$median_val, # actual
positive = "Above",
mode = "everything")
tune_te_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Above Below
## Above 11 3
## Below 5 81
##
## Accuracy : 0.92
## 95% CI : (0.8484, 0.9648)
## No Information Rate : 0.84
## P-Value [Acc > NIR] : 0.01474
##
## Kappa : 0.6865
##
## Mcnemar's Test P-Value : 0.72367
##
## Sensitivity : 0.6875
## Specificity : 0.9643
## Pos Pred Value : 0.7857
## Neg Pred Value : 0.9419
## Precision : 0.7857
## Recall : 0.6875
## F1 : 0.7333
## Prevalence : 0.1600
## Detection Rate : 0.1100
## Detection Prevalence : 0.1400
## Balanced Accuracy : 0.8259
##
## 'Positive' Class : Above
##
Based on the output, the model has high Accuracy and and a good Kappa value. Based on the Specificity, the model does a very good job of predicting the negative class, Below, but struggles more to predict the positive class, Above, which is our class level of interest, based on the Sensitivity/Recall and F1 Measure. This is likely due to the class imbalance present, with most of our observations belonging to the majority class, Below.
To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.
Overall
cbind(Training = tune_tr_conf$overall,
Testing = tune_te_conf$overall)
## Training Testing
## Accuracy 9.827160e-01 0.92000000
## Kappa 9.370350e-01 0.68652038
## AccuracyLower 9.647141e-01 0.84844236
## AccuracyUpper 9.930234e-01 0.96482844
## AccuracyNull 8.320988e-01 0.84000000
## AccuracyPValue 2.349860e-23 0.01474205
## McnemarPValue 4.496918e-01 0.72367361
Based on the overall output, we see that the training performance is almost perfect, with the performance on the testing data similar for Accuracy, but there is a big difference between the training and testing sets for Kappa. This suggests that the model is Overfitting.
By Class
cbind(Training = tune_tr_conf$byClass,
Testing = tune_te_conf$byClass)
## Training Testing
## Sensitivity 0.9264706 0.6875000
## Specificity 0.9940653 0.9642857
## Pos Pred Value 0.9692308 0.7857143
## Neg Pred Value 0.9852941 0.9418605
## Precision 0.9692308 0.7857143
## Recall 0.9264706 0.6875000
## F1 0.9473684 0.7333333
## Prevalence 0.1679012 0.1600000
## Detection Rate 0.1555556 0.1100000
## Detection Prevalence 0.1604938 0.1400000
## Balanced Accuracy 0.9602679 0.8258929
Based on the class-level performance, we see that there is a big difference in the training and testing sets performance for class-level measures, confirming that the model is overfitting.
Based on our conclusion that the model is overfitting, we need to reduce the flexibility of our model. To improve the fit of the model, we can increase the training size, increase standarization and employ feature selection methods. We can also use resampling methods to handle the class imbalance present in the data.