Exploration of Neural Networks

This is a brief exploration of the mathematical underpinnings of neural networks and a model application in R unsing the MNIST dataset to correctly identify handwritting digits. It follows and builds on chapter 11 Neural Networks in The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.

Neural Networks describe a class of learning method that blends input from the statistical learning and artificial intelligence research fields. The big idea is to find linear combinations of the inputs as derived features and model the target as a nonlinear function of these features.

Project Pursuit Regression

Assume we have input vector $X$ with $p$ components and target $Y$. Let $\omega_m, m=1,2,...,M$ be unit p-vectors of unknown parameters. Then the *project pursuit regression (PPR) model has the form

\[f(X)=\sum_{m=1}^Mg_m(\omega_m^T)\] This is an additive model in the features, say, $V_m=\omega_m^T X$ instead of the actual inputs themselves. The functions $g_m$ are unknown but estimated using a flexible smoothing method.

The function $g_m(\omega_m^T X)$ is a ridge function in $\mathbb{R}^p$ that varies in the direction of the identified vector $\omega_m$. The scalar variable $V_m=\omega_m^T X$ is the projection of X onto the unit vector $\omega_m$. The goal of the PPR model is to find the best $\omega_m$ so that the model fits well.

The image on the left below shows $\omega = (1/\sqrt{2})(1,1)^T$ where the function only varies in the direction $X_1+X_2$. In the example on the right $\omega = (1, 0)$.

The PPR model is very general because the process of forming nonlinear functions of linear combinations can produce a broad array of model classes. If $M$ is arbitrarily large, for some choice of $g_m$ and continuous function in $\mathbb{R}^p$ can be approximated. This makes $PPR$ a universal approximator as a class type. The tradeoff is in explainability due to the way each input enters the model in complex and multifaceted ways. PPR models are good for prediction but not so good for understanding the data.

Fitting a PPR model involves finding the minimizers of the error function over functions $g_m$ \[\sum_{i=1}^N\Big[y_i-\sum_{m=1}^Mg_m(\omega_m^Tx_i)\Big]^2\]

and direction vectors $\omega_m, m=1,2,...,M$. Complexity constraints are required on $g_m$ to avoid overfitting.

To illustrate the PPR, assume given $g$ we want to minimize the error function shown above over $\omega$. We can use a Gauss-Newton search in which part of the Hessian involving the second derivative of $g$ is thrown out.

Let $_{old} be the current estimate for $\omega$. Then \[g(\omega^Tx_i)\approx g(\omega_{old}^Tx_i)+g^\prime(\omega_{old}x_i)(\omega-\omega_{old})^Tx_i\]

to generate

\[\sum_{i=1}^N[y_i-g(\omega^Tx_i)]^2\approx\sum_{i=1}^Ng^\prime(\omega_{old}^Tx_i)^2\Big[\Big(\omega_{old}^T+\frac{y_i-g(\omega_{old}^Tx_i)}{g^\prime(\omega_{old}^Tx_i)}\Big)\Big]^2\]

To minimize the right-hand side, we use least squares regression with the target $\omega_{old}^T+\frac{y_i-g(\omega_{old}^Tx_i)}{g^\prime(\omega_{old}^Tx_i)}$ on the input $x_i$ and weights $g^\prime(\omega_{old}^Tx_i)^2$ and no intercept term.

This generates the updated coefficient vector $\omega_{new}$.

These two steps, estimating $g$ and $\omega$ are iterated until convergence. If we have more than one term in the PPR model the model is built in a forward stage-wise manner, adding a new $(\omega_m, g_m)$ at each stage.

Neural Networks

There are many models that fit under the class of neural networks so this discussion sticks to the single hidden layer back-propagation network, known as the perceptron which forms the backbone of the class.

The perceptron is modeled after understanding of the neurons in the human brain. Each unit represents a neuron and the connections represent synapses. The neurons fired when the total signal passed to the unit exceeded some threshold.

A neural network is simply a nonlinear statistical learning model similar to the PPR. NN’s are two-stage regression or classifications models that can be visually represented by the figure below.

This network applies to both regression and classification models. With regression, typically $K=1$ and there is only one output $Y_1$ at the top of the network. The models can handle multiple quantitative responses without issue.

For $K$-class classification there are $K$ units at the top, with the $k$th unit modeling the probability of class $k$. There are $K$ target measures $Y_k, k=1,...,K$, each coded as a $0-1$ variable for the $k$th class.

The derived features $Z_m$ are created from linear combinations of the $Z_m$: \[\begin{aligned} Z_m&=\sigma(\alpha_{0m}+\alpha_m^TX), m=1,...,M\\ T_k&=\beta_{0m}+\beta_k^TZ, k=1,...,K\\ f_k(X)&=g_k(T), k=1,...,K\\ \end{aligned}\]

where $Z=(Z_1,Z_2,..,Z_M)$ and $T=(T_1,T_2,...,T_K)$.

Ther activation function $\sigma(\nu)$ is usually chosen to be the sigmoid $\sigma(\nu)=\frac{1}{1+e^{\nu}}$ which plots as shown below.

There are a number of activation functions that are used classed by mathematical properties as nonlinear, range, and continuously differentiable. They include

Linear activation: $\sigma(\nu)=a+\nu^\prime b$
ReLU activation: $\sigma(\nu)=max(0, a+\nu^\prime b)$
Heaviside activation: $\sigma(\nu)=1_{a+\nu^\prime b>0}$
Logistic Activation: $\sigma(\nu)=(1+exp(-a-\nu^\prime b))^{-1}$

Network diagrams as shown above sometimes include additional bias units feeding into each unit in the hidden and output layers. If we consider the constant $1$ are the additional input features, this hidden bias unit captures the intercepts $\alpha_{0m}$ and $\beta_{0k}$ in the model for target $Y_k$ described above.

The output function $g_k(T)$ generates a final transformation of the vector of outputs $T$. With regression the identity function, $g_k(T)$, is typically used. In classification problems the softmax function

\[g_k(T)=\frac{e^{T_k}}{\sum_{l=1}^K e^{T_l}}\]

which produces estimates that sum to one.

The units in the middle of the network compute the derived features $Z_m$ and are called the hidden units because the values of $Z_m$ are not directly seen. There can be more than one hidden layer and sometimes there are many layers in more complex models.

Consider $Z_m$ as a basis expansion of the original inputs $X$ so that the neural network is then a standard linear model, or a linear logit model, using the transformations as inputs.

If $\sigma$ is represented as the identity function then the model becomes a linear model in the inputs so a neural network can be considered a nonlinear generalization of the linear model for both regression and classification. If a nonlinear transformation $\sigma$ is introduces the class of linear models is greatly increased.

In the plot of the sigmoid function above note that the rate of activation of the sigmoid depends on the norm of $\alpha_m$ and if the norm $\Vert\alpha_m\Vert$ is very small, the unit will operate in the linear part of its activation function.

The neural network model with one hidden layer has the same form as the PPR model with the difference being that the PPR model uses nonparametric functions $g_m(\nu)$ while the neural network uses the simpler function $\sigma(\nu)$ with three free parameters in its argument.

Considering the neural network model as a PPR model we have \[\begin{aligned} g_m(\omega_m^TX)&=\beta_m\sigma(\alpha_{0m}+\alpha_m^TX)\\ &=\beta_m\sigma(\alpha_{0m}+\Vert\alpha_m\Vert(\omega_m^TX))\\ \end{aligned}\]

where $\omega_m=\alpha_m/\Vert\alpha_m\Vert$ is the $m$th unit vector. Since $\alpha_{\beta, \alpha_0, s}(\nu)=\beta\sigma(\alpha_o+s\nu)$ has lower complexity than a more general nonparametric $g(\nu)$ and note that a neural network may have 20 to 100 of these functions or more.

Fitting Neural Networks

The neural network has unknown parameters, usually called weights. Finding the proper weight values will make the model fit the training data well.

The complete set of weights is denoted $\theta$ of the form \[\{\alpha_{0m}, \alpha_m:m=1,2,...,M\}M(p+1) \text{ weights }\] \[\{\beta_{0k}, \beta_k: k=1,2,...,K\}K(M+1)\text{ weights }\]

For regression, we use the sum-of-squared errors as out measure of fit (error function): \[R(\theta)=\sum_{k=1}^K\sum_{i=1}^N(y_{ik}-f_k(x_i))^2.\] For classification, we use either the squared error of cross-entropy (deviance): \[R(\theta)=-\sum_{i=1}^N\sum_{k=1}^Ky_{ik}\text{ log }f_k(x_i),\]

and the corresponding classifier is $G(x)=\text{ argmax }_{k}f_k(x)$. If the softmax activation function and the cross-entropy error function are used the neural network model is the linear logistic regression model in the hidden units where all parameters are estimated using the maximum likelihood.

Because we want to avoid overfitting it is not generally advised to minimize $R(\theta)$. The use of a penalty term for regularization purposes or by stopping the optimization early are used for this purpose.

Back-Propagation

The use of gradient descent to minimize $R(\theta)$ using back-propagation is the generic approach used here. The gradient is found using the chain rule for differentiation and can be computed through either a backward or forward sweep through the network keeping track of local quantities.

Let $z_{mi}=\sigma(\alpha_{0m}+\alpha_m^Tx_i)$ and let $z_i=(z_{1i}, z_{2i},...,z_{Mi})$. Then we have \[\begin{aligned} R(\theta)&=\sum_{i=1}^NR_i\\ &=\sum_{i=1}^N\sum_{k=1}^K(y_{-k}-f_k(x_i))^2,\\ \end{aligned}\]

with the derivatives \[\begin{aligned} \frac{\partial R_i}{\partial \beta_{km}}&= -2(y_{ik}-f_k(x_i))g_k^\prime(\beta_k^Tz_i)z_{mi}\\ \frac{\partial R_i}{\partial \alpha_{ml}}&= -\sum_{k=1}^K2(y_{ik}-f_k(x_i))g_k^\prime(\beta_k^Tz_i)\beta_{km}\sigma^\prime(\alpha_m^Tx_i)x_{il}\\ \end{aligned}\]

with these derivatives, the gradient descent update at the $(r+1)$st iteration is of the form \[\begin{aligned} \beta_{km}^{(r+1)}&=\beta_{km}^{(r)}-\gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \beta_{km}^{(r)}}\\ \alpha_{ml}^{(r+1)}&=\alpha_{ml}^{(r)}-\gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \alpha_{ml}^{(r)}}\\ \end{aligned}\]

where $\gamma_r$ is the learning rate.

We can rewrite the equation directly above as

\[\begin{aligned} \beta_{km}^{(r+1)}&=\delta_{ki}z_{mi},\\ \alpha_{ml}^{(r+1)}&=s_{mi}x_{il}.\\ \end{aligned\]s_{mi}

The values for $\delta_{ki}$ and $s_{mi}$ are the **errors"* from the current model at the output and at the hidden layer units, respectively.

These errors satisfy \[s_{mi}=\sigma^\prime(\alpha_m^Tx_i)\sum_{k=1}^K\beta_{km}\delta_{ki},\] and are called the back-propagation equations.

The gradient descent updates can be implemented with a two-pass algorithm that uses a forward pass and a backward pass.

The forward pass fixes the current weights and the predicted values $\hat{f}_k(x_i)$ are computed using the equation for the linear combinations of the $Z_m$: \[\begin{aligned} Z_m&=\sigma(\alpha_{0m}+\alpha_m^TX), m=1,...,M\\ T_k&=\beta_{0m}+\beta_k^TZ, k=1,...,K\\ f_k(X)&=g_k(T), k=1,...,K\\ \end{aligned}\] where $Z=(Z_1,Z_2,..,Z_M)$ and $T=(T_1,T_2,...,T_K)$.

The backward pass computes the errors $\delta_{ki}$, and then are back propagated via \[s_{mi}=\sigma^\prime(\alpha_m^Tx_i)\sum_{k=1}^K\beta_{km}\delta_{ki},\] to generate the errors $s_{mi}$.

The gradients are then computed using both sets of errors to generate the updates in \[\begin{aligned} \beta_{km}^{(r+1)}&=\beta_{km}^{(r)}-\gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \beta_{km}^{(r)}}\\ \alpha_{ml}^{(r+1)}&=\alpha_{ml}^{(r)}-\gamma_r\sum_{i=1}^N\frac{\partial R_i}{\partial \alpha_{ml}^{(r)}}\\ \end{aligned}\]

where $\gamma_r$ is the learning rate.

The back propagation process is also known as the delta rule.

The updates from the computation of the gradient descent are described as a form of batch learning where the parameter updates are a sum over all of the training cases.

This can also be done computationaly, where each observation can be processed with the gradient being updated after each training case adn cycling through the training cases many times. The summands are replaced by a single summand. One weep through all the training cases is called an epoch.

The learning rate, $\gamma_r$ for batch learning is a constant and for computational learning the learning rate decreases to zero as $r\rightarrow\infty$.

There are other techniques that are faster than back propagation that include using conjugate gradients and variable metric methods.

Training Issues with Neural Networks

Art meets science comes to mind. The model is generally overparameterized and we have a non-convext optimization problem that can be unstable unless certain rules are followed. The issues include

Starting Values:

If the starting weights are zero the derivatives are zero and the model never moves. If the weights are too large, the solutions can be poor. Usually, starting with small randomly valued weights is the best approach.

Overfitting:

Solutions to overfitting include early stopping according to some predetermined rule designed to stop before a global minimum is achieved and the process of weight decay. Weight decay adds a penalty to the error function $R(\theta)+\lambda J(\theta)$ where \[J(\theta)=\sum_{k,m}\beta_{km}^2+\sum_{m, l}\alpha_{ml}^2\] and $\lambda \geq 0$ is a tuning parameter. Larger values of $\lambda$ will shrink the weights toward zero. Cross-validation is typically used to compute $\lambda$.

The penalty adds $2\beta_{km}$ and $2\alpha_{ml}$ to the gradient descent update functions shown before.

There is also a weight elimination penalty that can serve to shrink smaller weights more efficiently.

Scaling the Inputs

Scaling the inputs determines the effective scaling of the weights in the bottom layer and can have a large effect on the quality of the solution. It is best to standardize all inputs before running the model to have mean zero and standard deviation one. The ensures all inputs are treated equally in the regularization process and allows the selection of random starting weights to have a more meaningful range.

Number of Hidden Units and Layers

It is generally better to have a larger number of hidden units than the opposite. Too few hidden units and the model might non have enough flexibility to capture the nonlinearities in the data. With a larger number than necessary, the weights can be shrunk towards zero if appropriate regularization is used. Typically the number of hidden units is somewhere in the range of $5$ to $100$ with the number increasing with the number of inputs and training cases. This is generally guided by experience but cross-validation can be used to estimate the optimal numbers.

Multiple Minima

The error function $R(\theta)$ is nonconvex with many local minima. Therefore the final solution depends highly on the starting weights. The solution with the lowest penalized error is the goal so a number of random starting configurations to try is a reasonable option.

Building a Neural Net Model on MNIST Data

Using the nnet and mxnet packages, neural networks are constructed below to evaluate the MNIST database. The MNIST database (Modified National Institute of Standards and Technology Database) is a database consisting of training images and testing images comprised of handwritten digits from $0$ to $9$ that is largely used for evaluating image recognition programs in machine learning applications. The code is adapted from https://rpubs.com/meisenbach/284590.

To begin, we read in the data set and name it train.

# read in data
library(readr)

train <- read_csv("../Math647_HW6/train.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.

# Running spec() on the train data shows there are 1783 pixel column labels
#spec(train)

Examine the size of the data set using the dim() function.

dim(train)

## [1] 42000   785

We can see that the training data set has $42,000$ observations and $784$ pixel variables from $0$ to $255$. This represents the distribution in the data of the handwritten digits.

Create a data set test by importing the test data set.

test <- read_csv("../Math647_HW6/test.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.

Next, create the training labels and save them as train_labels and output the summary() function.

# create and save the training labels
train_labels <- train[, 1]
train_labels <- as.factor(train_labels$label)
summary(train_labels)

##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Using the nnet package

The nnet package is designed for feed forward neural networks with a single hidden layer, and for multinomial log-linear models.

Begin the model building process by creating a training and testing data set and normalizing the data.

library(nnet)

# split the training data into train and test to do local evaluation
set.seed(123)
rows <- sample(1:nrow(train), as.integer(0.7*nrow(train)))

# Get train and test labels
train_labels <- train[rows, 1]
test_labels <- train[-rows, 1]

# convert the labels to factors
train_labels <- as.factor(train_labels$label)

Normalize the data

Create the normalization function to center the data.

# create a custom normalization function
normalize <- function(x) { 
  return(x / 255)
}

# create the train and test datasets and apply normalization
train_norm <- as.data.frame(lapply(train[rows, -1], normalize))
test_norm <- as.data.frame(lapply(train[-rows,-1], normalize))

Check a random pixel in the training set to see if the normalization worked before and after.

summary(train$pixel350)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   89.51  228.00  255.00

summary(train_norm$pixel350)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3487  0.8902  1.0000

Check a random pixel in the test set to see if the normalization worked.

summary(test_norm$pixel350)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.01176 0.35629 0.90196 1.00000

Create the class indicator matrix. The class indicator matrix is zero except for the columns corresponding to the class.

# create the class indicator matrix
train_labels_matrix = class.ind(train_labels)
head(train_labels)

## [1] 7 6 3 3 1 4
## Levels: 0 1 2 3 4 5 6 7 8 9

Output the head of the class indicator matrix. nnet requires the class labels in the form of a one-of-n matrix

head(train_labels_matrix)

##      0 1 2 3 4 5 6 7 8 9
## [1,] 0 0 0 0 0 0 0 1 0 0
## [2,] 0 0 0 0 0 0 1 0 0 0
## [3,] 0 0 0 1 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 0
## [5,] 0 1 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 1 0 0 0 0 0

Training the Model

Build a single hidden layer feed forward neural network using the nnet() function.

Use the set.seed() function for reproducibility. Name the model nnet1 and use the nnet() function with the training data, the training labels matrix, a single hidden layer and the softmax activation function.

Note that a single layer is not likely to be sufficient. This step is just to build a model to try out the nnet() function. We use the softmax activation function on a classifier. The default maximum number of iterations is 100 so we note that the models algorithm did not converge before reaching the maximum number of iterations and the model took 112 seconds to run.

# train model
set.seed(42)
startTime <- proc.time()
nnet1 <- nnet(train_norm, train_labels_matrix, 
            size = 1, 
            softmax = TRUE 
            )

## # weights:  805
## initial  value 69578.961643 
## iter  10 value 61396.451577
## iter  20 value 58735.961180
## iter  30 value 56017.732579
## iter  40 value 54547.602514
## iter  50 value 53443.583161
## iter  60 value 52881.985092
## iter  70 value 52599.343350
## iter  80 value 52399.078142
## iter  90 value 52231.477305
## iter 100 value 52063.948017
## final  value 52063.948017 
## stopped after 100 iterations

Display the time it took to run the model using the proc.time() function.

proc.time() - startTime

##    user  system elapsed 
##   99.56    0.56  103.42

Output the nnet1 model

nnet1

## a 784-1-10 network with 805 weights
## options were - softmax modelling

Evaluating Model Performance

Take a look at the predictions from the nnet1 model using the predict() function on the testing data with type = “class”. Use the cbind() function to display the prediction and test label columns.

We can see the predictions are pretty awful.

# get predictions
pred = predict(nnet1, test_norm, type="class")
cbind(head(pred), head(test_labels))

##   head(pred) label
## 1          7     9
## 2          1     2
## 3          1     0
## 4          7     6
## 5          7     9
## 6          1     1

Evaluate the model accuracy by setting the predictions equal to the labels and evaluating the accuracy. We see the accuracy of the predictions are only about $21\%$ which is not so hot.

# evaluate the model
accuracy <- mean(pred == test_labels)
print(paste('Accuracy:', accuracy))

## [1] "Accuracy: 0.205618601698278"

Improving Model Performance

We can tune the model to improve the performance using the caret package. The number of hidden nodes can be increased and we can also perform K-fold cross validation.

# load the caret package
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

We can ask out machine to use multiple processors to increase performance.

# Enable parallel processing
library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

The machine this used has four cores.

detectCores()

## [1] 4

Use the makeCluster() function to create a set of copies of R running in parallel and communicating over sockets using the default arguments and name it “c1”.

# subtract 1 from detectCores() to reduce the number of cores used
num_cores <- detectCores() 
cl <- makeCluster(num_cores)
cl

## socket cluster with 4 nodes on host 'localhost'

Run the registerDoParallel() function to enable the caret package to utilize more than one processor.

registerDoParallel(cl)

Set up the training parameters to use cross-validation. Set the folds equal to “3” to reduce the time needed to process the data. We leave the “decay” argument at 0 because the model has less than 10 nodes and overfitting is not a concern.

# Set up training parameters
TrainingParameters <- trainControl(method = "cv", number = 3)

grid_nn <- expand.grid(.size = c(1, 3, 5, 10),
                       .decay = 0)
grid_nn

##   .size .decay
## 1     1      0
## 2     3      0
## 3     5      0
## 4    10      0

Use the train() function from the caret package to evaluate all of the training data in a neural network model we call “nnet2”

# use all of the given training data
train_norm <- as.data.frame(lapply(train[, -1], normalize))
train_labels <- train[, 1]
# convert the labels to factors
train_labels <- as.factor(train_labels$label)

startTime <- proc.time()
set.seed(123)
nnet2 <- train(train_norm, train_labels,
             trControl= TrainingParameters,
             method = "nnet",
             tuneGrid = grid_nn,
             MaxNWts = 20000
             )

## # weights:  7960
## initial  value 116441.516855 
## iter  10 value 48202.891620
## iter  20 value 36495.980990
## iter  30 value 28805.180943
## iter  40 value 25213.636994
## iter  50 value 21948.319543
## iter  60 value 19687.442029
## iter  70 value 17958.897582
## iter  80 value 16581.959778
## iter  90 value 15517.824442
## iter 100 value 14758.523216
## final  value 14758.523216 
## stopped after 100 iterations

We can see the process ran the full 100 iterations and the time to process the data took almost 70 minutes on my laptop.

proc.time() - startTime

##    user  system elapsed 
##  701.48    3.91 5910.11

When using cross-validation with the caret package the accuracy for each model is automatically reported therefore there is no need to partition the data into test and train datasets.

nnet2

## Neural Network 
## 
## 42000 samples
##   784 predictor
##    10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 28001, 28002, 27997 
## Resampling results across tuning parameters:
## 
##   size  Accuracy   Kappa    
##    1    0.2310496  0.1419212
##    3    0.6085942  0.5645412
##    5    0.6926278  0.6580393
##   10    0.8793790  0.8659480
## 
## Tuning parameter 'decay' was held constant at a value of 0
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 10 and decay = 0.

We can see the best model has 10 hidden nodes and an estimated accuracy of $87\%$.

Next, we want to normalize the test data and make predictions on this dataset.

# normalize test data and predict on it
test_norm <- as.data.frame(lapply(test, normalize))
NNPredictions <-predict(nnet2, test)

Create a dataframe to output predictions for submission and take a look at the head of the dataframe.

# output predictions for submission
predictions <- data.frame(ImageId=1:nrow(test), 
                          Label=levels(train_labels)[NNPredictions])
head(predictions)

##   ImageId Label
## 1       1     2
## 2       2     0
## 3       3     9
## 4       4     2
## 5       5     3
## 6       6     7

Output the prediction file to the working directory.

write_csv(predictions, "../Math647_HW6/caret_nnet.csv")

Stop the parallel processing cluster from running.

# stop the cluster for parallel processing
stopCluster(cl)

Deep Neural Net using MXNet

The mxnet package brings fast gpu computation and state-of-the-art deep learning processes to the R community. Configure a neural net model.

load the mxnet package.

# load the namespace of mxnet package and attach it
require(mxnet)

## Loading required package: mxnet

Transpose the input matrix to npixel $\times$ nexamples, which is the column major format accepted by the mxnet package. Then normalize the test and train matrices and view the class distribution.

train_matrix <- data.matrix(train)
test_matrix <- data.matrix(test)

train_matrix.x <- train_matrix[,-1]
train_matrix.y <- train_matrix[,1]

# normalize 
train_matrix.x <- t(train_matrix.x/255)
test_matrix <- t(test_matrix/255)

# check the distribution of the classes
table(train_matrix.y)

## train_matrix.y
##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Configure the neural network. This is a neural network with a single hidden layer and 10 nodes. To produce a relevant comparison with the models built with the nnet package.

The configuration variables used are:

1. “data” representing the input layer.
1. “fc1” is the name of the first hidden layer.
1. “act1” is the first activation layer.
1. “fc2” is the name of the output layer with the number of neurons set to 10.
1. “sm” is the softmax activation function used.

## Network Configuration
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=10)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc2, name="sm")

Use the mx.cpu() function to set the computer to use the cpu and train the model using the mx.model.FeedForward.create() function using the arguments below.

# set the device to the cpu
devices <- mx.cpu()

# train the model
mx.set.seed(0)
model <- mx.model.FeedForward.create(softmax, X=train_matrix.x, y=train_matrix.y,
                                     ctx=devices, num.round=10, array.batch.size=100,
                                     learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy,
                                     initializer=mx.init.uniform(0.07),
                                     epoch.end.callback=mx.callback.log.train.metric(100))

## Warning in mx.model.select.layout.train(X, y): Auto detect layout input matrix, use colmajor..

## Start training with 1 devices

## [1] Train-accuracy=0.836047618694249

## [2] Train-accuracy=0.915404763675871

## [3] Train-accuracy=0.922523811743373

## [4] Train-accuracy=0.927690479868934

## [5] Train-accuracy=0.931190478234064

## [6] Train-accuracy=0.933333335320155

## [7] Train-accuracy=0.935000000965028

## [8] Train-accuracy=0.936619048317273

## [9] Train-accuracy=0.937619047789347

## [10] Train-accuracy=0.938285714955557

We can see that our training accuracy is both greatly improved and faster. The model achieved a $95\%$ accuracy level.

## Network Configuration
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

How about accuracy with two hidden layers? Below is a model run with two hidden layers.

# train the model
mx.set.seed(0)
model <- mx.model.FeedForward.create(softmax, X=train_matrix.x, y=train_matrix.y,
                                     ctx=devices, num.round=10, array.batch.size=100,
                                     learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy,
                                     initializer=mx.init.uniform(0.07),
                                     epoch.end.callback=mx.callback.log.train.metric(100))

## Warning in mx.model.select.layout.train(X, y): Auto detect layout input matrix, use colmajor..

## Start training with 1 devices

## [1] Train-accuracy=0.860404763388492

## [2] Train-accuracy=0.959642860435304

## [3] Train-accuracy=0.972357150202706

## [4] Train-accuracy=0.979452391181673

## [5] Train-accuracy=0.982119058143525

## [6] Train-accuracy=0.984976199978874

## [7] Train-accuracy=0.988023818816458

## [8] Train-accuracy=0.989761912964639

## [9] Train-accuracy=0.992095245066143

## [10] Train-accuracy=0.992690482309886

With two layers the accuracy improved to over $99\%$!

Generate the predictions of the model using the predict() function.

# get predictions
preds <- predict(model, test_matrix)

## Warning in mx.model.select.layout.predict(X, model): Auto detect layout input matrix, use colmajor..

# preds

View the dimensions of the predictions which we name “pred” above.

dim(preds)

## [1]    10 28000

pred.label <- max.col(t(preds)) - 1
table(pred.label)

## pred.label
##    0    1    2    3    4    5    6    7    8    9 
## 2882 3266 2708 2813 2743 2497 2698 2834 2813 2746

# output submission
submission <- data.frame(ImageId=1:ncol(test_matrix), Label=pred.label)
write.csv(submission, file="../Math647_HW6/mxnet.csv", row.names=FALSE,  quote=FALSE)

Conclusion

We saw that the single layer neural network model nnet2 with cross-validation and built using the nnet package was able to achieve an accuracy level of $87\%$ and it took almost 70 minutes to train on my laptop.

The single layer neural network model with cross-validation, model built using the mxnet package was able to acheive an accuracy level of $93\%$ and the run time was less than a minute.

The two-layer model built using mxnet showed an accuracy over $99\%$ and was almost as fast the the single layer model.

Exploration of Neural Networks

Chris Schmidt

6/7/2021