Classification Analysis: Artifical Neural Networks

Preliminary

We will use a new package, the NeuralNetTools package, to visualize our ANN. If you do not already have the NeuralNetTools package installed, you will first install it using the install.packages() function.

install.packages("NeuralNetTools")

In this lesson, we will also use the DescTools and caret packages. Next, we load the three packages for use in the session.

library(DescTools)
library(caret)
library(NeuralNetTools)

In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.

The variables include:

crim: Per capita crime rate by town
zn: Proportion of residential land zoned for lots over 25,000 ft2
indus: Proportion of nonretail business acres per town
chas: Charles River dummy variable (1 if tract bounds river, 0 otherwise)
nox: Nitric oxide concentration (parts per 10 million)
rm: Average number of rooms per dwelling
age: Proportion of owner-occupied units built prior to 1940
dis: Weighted distances to five Boston employment centers
rad: Index of accessibility to radial highways (1-8,24)
tax: Full-value property-tax rate per $10,000
ptratio: Pupil/teacher ratio by town
median_val: Above if the median home value is above the overall median, 30k, otherwise Below

We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.

BH <- read.csv(file = "BostonHousing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.

Abstract(BH)

## ------------------------------------------------------------------------------ 
## BH
## 
## data frame:  505 obs. of  13 variables
##      505 complete cases (100.0%)
## 
##   Nr  ColName     Class      NAs  Levels
##   1   crim        numeric    .          
##   2   zn          numeric    .          
##   3   indus       numeric    .          
##   4   chas        integer    .          
##   5   nox         numeric    .          
##   6   rm          numeric    .          
##   7   age         numeric    .          
##   8   dis         numeric    .          
##   9   rad         integer    .          
##   10  tax         integer    .          
##   11  ptratio     numeric    .          
##   12  b           numeric    .          
##   13  median_val  character  .

Preparing Target (Y) Variables

Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable.

BH$median_val <- factor(x = BH$median_val)

We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.

plot(BH$median_val, 
     main = "Median Value")

Preparing Predictor (X) Variables

All of our potential predictors are numeric, but based on the data description, chas is categorical (nominal) and rad could either be treated as categorical or numeric. We can keep chas as-is, since it is already binary. We will also keep rad as-is, and use the numeric representation of the categorical (ordinal) variable as input to our model.

Data Preprocessing & Transformation

We know that ANN can handle redundant variables, but missing values need to be handled, categorical variables need to be binarized and rescaling should be done.

Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation.

PlotMiss(x = BH)

Binarization of Categorical Variables If categorical predictor (X) variables are used in analysis, they must be converted to binary variables (if they are not already binary (0/1) using the class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() (and predict) function from the caret for categorical variables with more than 2 class levels. The chas variable is already binary, so it will be kept as-is.

For ordinal variables, we can preserve the ordering by converting them from ordinal factors to numeric using the as.numeric() function. Since rad is already numeric, we kept it as-is.

Rescale Numeric Variables We will apply min-max (range) normalization to the numeric variables during hyperparameter tuning.

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility. We use the BH dataframe to create our train and test sets.

set.seed(831) # initialize random seed
sub <- createDataPartition(y = BH$median_val, # target variable
                           p = 0.80, # proportion in train
                           list = FALSE)

Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- BH[sub, ] 
test <- BH[-sub, ]

Analysis

Since there is no true default model and we will need to choose the number of hidden nodes in the hidden layer (with no default rule to help guide us), we go straight to hyperparameter tuning to find the optimal number of hidden nodes and weight decay.

We can use the train() function from the caret package to tune our hyperparameters. Here, we will use the nnet package (method = "nnet"). We can tune the size and decay hyperparameters.

size: number of nodes in the hidden layer Note: There can only be one hidden layer using nnet
decay: weight decay. regularization parameter to avoid overfitting, which adds a penalty for complexity.

We will use a grid search and 5-fold cross validation repeated 3 times.

First, we set up the grid using the expand.grid() function. We will consider hidden node sizes (size) of 1, 3, 5 and 7 and decay values ranging from 0 to 0.1 in 0.01 increments.

grids <-  expand.grid(size = seq(from = 1, 
                                 to = 7, 
                                 by = 2),
                      decay = seq(from = 0,
                                  to = 0.1,
                                  by = 0.01))

Next, we set up our control object for input in the train() function for the trControl argument.

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # 5 folds
                     repeats = 3, # 3 repeats
                     search = "grid") # grid search

Next, we initialize a random seed for our resampling.

set.seed(831)

Then, we use the train() function to train the ANN model using 5-Fold Cross Validation (repeated 3 times) to search over the hyperparameter grid (grids). We use the preProcess argument to specify that we want to apply min-max (range) normalization to the numeric variables in our train data. We set trace = FALSE to suppress output from each of the iterations of the algorithm.

annMod <- train(form = median_val ~., # use all other variables to predict target
                data = train, # training data
                preProcess = "range", # apply min-max normalization
                method = "nnet", # use nnet()
                trControl = ctrl, 
                tuneGrid = grids, # search over the created grid
                trace = FALSE) # suppress output

We can view the Accuracy and Kappa across our hyperparameter grid and obtain the optimal values of size and decay.

annMod

## Neural Network 
## 
## 405 samples
##  12 predictor
##   2 classes: 'Above', 'Below' 
## 
## Pre-processing: re-scaling to [0, 1] (12) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0.00   0.9292934  0.7172720
##   1     0.01   0.9333384  0.7549879
##   1     0.02   0.9333180  0.7546748
##   1     0.03   0.9341616  0.7568195
##   1     0.04   0.9325053  0.7507855
##   1     0.05   0.9341616  0.7551163
##   1     0.06   0.9341514  0.7546235
##   1     0.07   0.9333283  0.7519121
##   1     0.08   0.9358180  0.7575649
##   1     0.09   0.9349847  0.7550847
##   1     0.10   0.9349950  0.7534247
##   3     0.00   0.9357967  0.7585337
##   3     0.01   0.9407565  0.7782681
##   3     0.02   0.9481236  0.8057421
##   3     0.03   0.9481441  0.8075790
##   3     0.04   0.9481444  0.8028828
##   3     0.05   0.9440287  0.7878713
##   3     0.06   0.9431953  0.7840357
##   3     0.07   0.9423723  0.7814717
##   3     0.08   0.9407262  0.7751457
##   3     0.09   0.9399031  0.7711532
##   3     0.10   0.9374440  0.7615445
##   5     0.00   0.9317025  0.7488274
##   5     0.01   0.9415796  0.7791584
##   5     0.02   0.9465083  0.8015618
##   5     0.03   0.9506236  0.8132657
##   5     0.04   0.9464777  0.7966578
##   5     0.05   0.9456645  0.7937885
##   5     0.06   0.9456544  0.7941847
##   5     0.07   0.9423723  0.7810426
##   5     0.08   0.9407262  0.7751457
##   5     0.09   0.9399031  0.7711532
##   5     0.10   0.9374440  0.7615445
##   7     0.00   0.9292329  0.7470980
##   7     0.01   0.9448419  0.7932675
##   7     0.02   0.9465078  0.7986076
##   7     0.03   0.9522596  0.8196623
##   7     0.04   0.9497802  0.8088619
##   7     0.05   0.9464980  0.7966562
##   7     0.06   0.9440083  0.7883038
##   7     0.07   0.9431953  0.7847887
##   7     0.08   0.9407262  0.7751457
##   7     0.09   0.9390901  0.7683924
##   7     0.10   0.9382671  0.7640045
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.03.

We can visualize the tuned ANN model using the plotnet() function in the NeuralNetTools package. By setting pos_col = "darkgreen", weights that are positive will display in dark green. By setting neg_col = "darkred", weights that are negative will display in dark red. The size of the connections indicates the size of the weight.

plotnet(mod_in = annMod$finalModel, # nnet object
        pos_col = "darkgreen", # positive weights are shown in green
        neg_col = "darkred", # negative weights are shown in red
        bias = FALSE, # do not plot bias
        circle_cex = 4, # reduce circle size (default is 5)
        cex_val = 0.6) # reduce text label size (default is 1)

Model Performance & Fit

Training Performance

We use the predict() function to obtain class predictions for our target variable, median_val, using the ANN model.

tune.tr.preds <- predict(object = annMod, # tuned model
                         newdata = train) # training data

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the training dataset (train).

tune_tr_conf <- confusionMatrix(data = tune.tr.preds, # predictions
                                reference = train$median_val, # actual
                                positive = "Above",
                                mode = "everything")

We will wait to view the output until we consider goodness of fit.

Testing Performance

We use the predict() function to generate class predictions for our testing data set and evaluate model performance.

tune.te.preds <- predict(object = annMod, # tuned model
                         newdata = test) # testing data

Next, we get performance measures using the confusionMatrix() function.

tune_te_conf <- confusionMatrix(data = tune.te.preds, # predictions
                                reference = test$median_val, # actual
                                positive = "Above",
                                mode = "everything")
tune_te_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Above Below
##      Above    11     3
##      Below     5    81
##                                           
##                Accuracy : 0.92            
##                  95% CI : (0.8484, 0.9648)
##     No Information Rate : 0.84            
##     P-Value [Acc > NIR] : 0.01474         
##                                           
##                   Kappa : 0.6865          
##                                           
##  Mcnemar's Test P-Value : 0.72367         
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.9643          
##          Pos Pred Value : 0.7857          
##          Neg Pred Value : 0.9419          
##               Precision : 0.7857          
##                  Recall : 0.6875          
##                      F1 : 0.7333          
##              Prevalence : 0.1600          
##          Detection Rate : 0.1100          
##    Detection Prevalence : 0.1400          
##       Balanced Accuracy : 0.8259          
##                                           
##        'Positive' Class : Above           
##

Based on the output, the model has high Accuracy and and a good Kappa value. Based on the Specificity, the model does a very good job of predicting the negative class, Below, but struggles more to predict the positive class, Above, which is our class level of interest, based on the Sensitivity/Recall and F1 Measure. This is likely due to the class imbalance present, with most of our observations belonging to the majority class, Below.

Goodness of Fit

To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.

Overall

cbind(Training = tune_tr_conf$overall,
      Testing = tune_te_conf$overall)

##                    Training    Testing
## Accuracy       9.827160e-01 0.92000000
## Kappa          9.370350e-01 0.68652038
## AccuracyLower  9.647141e-01 0.84844236
## AccuracyUpper  9.930234e-01 0.96482844
## AccuracyNull   8.320988e-01 0.84000000
## AccuracyPValue 2.349860e-23 0.01474205
## McnemarPValue  4.496918e-01 0.72367361

Based on the overall output, we see that the training performance is almost perfect, with the performance on the testing data similar for Accuracy, but there is a big difference between the training and testing sets for Kappa. This suggests that the model is Overfitting.

By Class

cbind(Training = tune_tr_conf$byClass,
      Testing = tune_te_conf$byClass)

##                       Training   Testing
## Sensitivity          0.9264706 0.6875000
## Specificity          0.9940653 0.9642857
## Pos Pred Value       0.9692308 0.7857143
## Neg Pred Value       0.9852941 0.9418605
## Precision            0.9692308 0.7857143
## Recall               0.9264706 0.6875000
## F1                   0.9473684 0.7333333
## Prevalence           0.1679012 0.1600000
## Detection Rate       0.1555556 0.1100000
## Detection Prevalence 0.1604938 0.1400000
## Balanced Accuracy    0.9602679 0.8258929

Based on the class-level performance, we see that there is a big difference in the training and testing sets performance for class-level measures, confirming that the model is overfitting.

Based on our conclusion that the model is overfitting, we need to reduce the flexibility of our model. To improve the fit of the model, we can increase the training size, increase standarization and employ feature selection methods. We can also use resampling methods to handle the class imbalance present in the data.