Building a Keras neural network with the MNIST dataset

What is Keras?

Keras is a model-level library meaning it provides high-level functions for specifying and training deep learning models. It cannot by itself perform low-level operations like tensor manipulation or differentiation. Keras depends on a well-optimized tensor library to do so, which serves as the backend engine. Several backend engines can be plugged into Keras - TensorFlow developed by Google, or CNTK developed by Microsoft and Theano.

The default and recommended backend is TensorFlow. In R, when you install Keras library it will automatically install TensorFlow.

#install.packages("keras")
library(keras)

MNIST Dataset

The MNIST dataset is often considered the “hello world” of deep learning. It’s a set of grayscale images (28 x 28 pixels) of hand written digits and associated labels (0 through 9).

This is a standard dataset that comes with a standard training and testing split. There are 60,000 images in the training set and 10,000 images in the test set.

The MNIST dataset comes preloaded in Keras, in the form of train and test lists, each of which includes a set of images (x) and their associated labels (y).

The training images (x) are structured in R as a 3-dimensional array (a tensor) of dimension 60,000 x 28 x 28; in other words 60,000 28x28 images. The training labels (y) are stored as a numeric vector of length 60,000.

Load the MNIST Dataset

mnist <- dataset_mnist() # Load the MNIST dataset
train_images <- mnist$train$x # create a 60,000x28x28 tensor for the training images
train_labels <- mnist$train$y # create a 60,000-element vector for the training labels
test_images <- mnist$test$x   # create a 10,000x28x28 tensor for the test images
test_labels <- mnist$test$y   # create a 10,000-element vector for the test labels

These commands show how the data is structured in R:

str(train_images)

##  int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...

str(train_labels)

##  int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...

Each image is grayscale with 8-bit pixel values (0 to 255). For example, let’s look at the 200th digit in the training set:

digit <- train_images[200,,]      # select the 200th training image
plot(as.raster(digit, max = 255)) # plot it!

Building the Neural Network

There are two broad things that we need to do - first, select a network architecture and second, compile and optimize it for our chosen backpropagation algorithms.

Define the network architecture

To build the network we first specify:

The number of layers
The size of each layer
The activation function for each layer

In the following code, we create a new neural network called network. In this case, we want the input layer to have 784 (=28*28) neurons and the output layer to have 10 neurons.

network <- keras_model_sequential() %>%  # create a linear stack of layers
  layer_dense(units = 784, activation = "relu", input_shape = c(28 * 28)) %>%  # input layer
  layer_dense(units = 512, activation = "relu") %>%                            # hidden layer
  layer_dense(units = 10,  activation = "softmax")                            # output layer

Keras offers several choices of activation function:

linear
sigmoid
ReLU
SeLU
exponential
softmax
arctangent

Softmax is often used in the last layer. It’s a generalized logistic function where instead of 2 classes we have 10 classes. Each class is a probability and the total of the last layer should sum to 1.

layer_dense tells keras that the layers are to be densely connected, i.e. every neuron is connected to every neuron in the previous layer. But it is also possible to create sparse networks which work better for some applications.

Compile the network

Next we compile the network, which optimizes the network data structure for training. Here we have three choices to make:

Gradient descent optimization algorithm. Choices include Stochastic Gradient Descent, RMSprop, Adagrad… See this page for a visual overview of different algorithms. Each algorithm in turn has its own parameters which can be tweaked. For example, for SGD, you can adjust the learning rate and momentum.
Loss function. This is the cost function that will be used during training. Choices include various flavors of Squared Error, Hinge, Crossentropy, etc.
Performance metric. Similar to a loss function, but used to evaluate the model performance on the test set.

network %>% compile(
  optimizer = "rmsprop",             # network will update itself based on the training data & loss
  loss = "categorical_crossentropy", # measure mismatch between y_pred and y, calculated after each minibatch
  metrics = c("accuracy")            # measure of performace - correctly classified images
)

Pre-processing the data

As with all machine learning, the models are very picky about the format of the input data. We need to do three things:

Reshape the 28x28 image arrays into 784-unit vectors. This is necessary because the first layer of the network is expecting a vector input.
Stack the image vectors into a 60,000 x 768 matrix so we now have a single matrix storing all of the training images.
Currently the pixels range from [0,255], but we need to normalize them to [0,1].

train_images <- array_reshape(train_images, c(60000, 28 * 28)) # Unfold each 28x28 image into a linear vector of 784. Train_images was an image cube, now it's a 2d matrix where each row is an image.
train_images <- train_images / 255  # Normalize each element in the matrix to [0,1]

test_images <- array_reshape(test_images, c(10000, 28 * 28)) # Do the same for the test images
test_images <- test_images / 255

dim(train_images) # Check to see that the dimensions are now correct

## [1] 60000   784

dim(test_images)

## [1] 10000   784

Prepare the y labels

The y labels also need to be prepared. Currently train_labels is a vector of length 60,000 containing the 0-9 labels for each training image - there is a single digit for each image. But the network doesn’t output a single digit, it outputs a vector of ten probabilities. So for each training image, we need to convert the numeric label into a set of ten boolean dummy variables. The resulting label data structure is a 60,000x10 matrix of boolean dummy variables.

So [3, 0, 9] would become:

000100000
100000000
000000001

# The dimensions of our y labels need to match the dimensions of our output layer
train_labels <- to_categorical(train_labels) # makes key-value boolean dummy vars out of numerical vectors
test_labels <- to_categorical(test_labels)   # do the same with the test labels

dim(train_labels)

## [1] 60000    10

dim(test_labels)

## [1] 10000    10

Train the model

Finally, we’re ready to train the network. Keras treats networks as just another type of R model, so many of the standard R functions for models, like fit(), evalute() and predict() work just as well on neural networks as on regressions, trees, etc. In this case, we use fit to train the model.

For efficiency, we’ll divide the training set into minibatches of 1,000 images and train on each set until we’ve trained on all the training images. This is one epoch. Then we repeat the process for 20 epochs.

Importantly, we’d like to calculate the training loss at the end of each epoch, but we’d also like to see how the model performs on data that it hasn’t seen yet. To avoid overfitting we don’t want to use the actual testing data to do this. We should only use the testing data to evaluate the loss of the fully trained model at the very end.

So instead, we reserve some fraction of the minibatch as a ‘validation set’. The model will not be trained on this data and will only use it to calculate the loss at the end of each epoch. We’ll call this the ‘validation loss’, in contrast to the training loss.

history <- network %>% fit(train_images, train_labels, epochs = 20, batch_size = 1000, validation_split = 0.1)

We can plot both the training loss and the validation loss at the end of each epoch:

plot(history)

The gap between training and validation loss suggests that the model is overfitting, but let’s look at the full set of test data.

Before running the full test data though, we can peek at the predicted categories for the first ten test images and compare them to the actual y values:

network %>% predict_classes(test_images[1:10,]) # display the predicted category for the first ten test images

##  [1] 7 2 1 0 4 1 4 9 5 9

mnist$test$y[1:10] # display the actual labels for the first ten test images (subtract one from the array index to get the predicted value)

##  [1] 7 2 1 0 4 1 4 9 5 9

Testing the model

Now that the model is fully trained, we’re ready to evaluate its performance on the 10,000-image test set.

network %>% evaluate(test_images, test_labels)

## $loss
## [1] 0.08416831
## 
## $acc
## [1] 0.9822

The test loss here is greater than the training loss above, indicating overfitting. So let’s…

Try another architecture!

We’ll re-specify the model, adding a second hidden layer with 512 neurons. We’ll also use a technique called ‘dropout’, which randomly sets some fraction of the neurons in each layer to zero during each training step, which can help to avoid overfitting.

network2 <- keras_model_sequential()
network2 %>% 
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 512, activation = "relu") %>%  # new layer
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = "softmax")

network2 %>% compile(
  optimizer = "rmsprop", 
  loss = "categorical_crossentropy", 
  metrics = c("accuracy"))

Fit the model

history <- network2 %>% fit(train_images, train_labels, 
                            epochs = 20, batch_size = 1000, 
                            validation_split = 0.1)

plot(history)

Since training and validation accuracy is almost equal this model has a better fit.

network2 %>% evaluate(test_images, test_labels) # run the model on the full test set

## $loss
## [1] 0.06807244
## 
## $acc
## [1] 0.983

With the additional hidden layer and dropout added, the test loss decreased. State of the art as of August 2018 is 0.0027 with a 6-layer convnet (784-50-100-500-1000). The Wiki page for the MNIST dataset shows typical loss rates for various machine learning algorithms.

Break down the architecture

Some thoughts from the Keras tutorial: What we’ve seen so far in designing and fitting a neural network is a whole bunch of tensor operations - adding a tensor, multiplying a tensor etc…

When we use tensors in R we dont have to use for loops, we can just apply matrix algebra via the Tensorflow pckage. This significantly improves performance in training complex networks with 10’s or 100’s of layers. This is called vectorized implementation.

z = Weights . x + bias

at every unit of the 1st layer we are calculating

max(W.X+b, 0)

Gradient-based optimization

W, b are tensors that are attributes of a layer. Initially, the weights are given random values - called random initialization.

What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called training, is basically the learning that machine learning is all about.

What happens in a training loop?

Draw a batch of training samples x and corresponding targets y.
Run the network on x (a step called the forward pass) to obtain predictions y_pred forward propagation
Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y
Update all the weights in a way that it slightly reduces the loss in this batch. backward propagation

Step 1-3 is straightforward. Step 4 is the challenge.

The difficult part is step 4: updating the network’s weights. Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?

All operations used in the network are differentiable, and compute the gradient of the loss with regard to the network’s coefficients.

A gradient is a derivative of a multidimensional input like a tensor. You can think of back prop as a chain of partial differential equations. If a function is differentiable then we can find a min of that function.

In a NN framework, what this means is that we need to find the combination of weights such that yields the lowest possible loss.

1 Draw a batch of training samples x and corresponding targets y. 2 Run the network on x to obtain predictions y_pred. 3 Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. 4 Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5 Move the parameters a little in the opposite direction from the gradient; for example, W = W - (step*gradient), thus reducing the loss on the batch a bit.

This is called a mini-batch Stochastic Gradient Descent.

The step factor is called learning rate. Picking a reasonable learning rate is important in DL models because if it’s too small then the model will learn very slowly, and if it’s too large then the model might miss the global min of loss function!