Keras is a model-level library meaning it provides high-level functions for specifying and training deep learning models. It cannot by itself perform low-level operations like tensor manipulation or differentiation. Keras depends on a well-optimized tensor library to do so, which serves as the backend engine. Several backend engines can be plugged into Keras - TensorFlow developed by Google, or CNTK developed by Microsoft and Theano.
The default and recommended backend is TensorFlow. In R, when you install Keras library it will automatically install TensorFlow.
#install.packages("keras")
library(keras)
The MNIST dataset is often considered the “hello world” of deep learning. It’s a set of grayscale images (28 x 28 pixels) of hand written digits and associated labels (0 through 9).
This is a standard dataset that comes with a standard training and testing split. There are 60,000 images in the training set and 10,000 images in the test set.
The MNIST dataset comes preloaded in Keras, in the form of train and test lists, each of which includes a set of images (x) and their associated labels (y).
The training images (x) are structured in R as a 3-dimensional array (a tensor) of dimension 60,000 x 28 x 28; in other words 60,000 28x28 images. The training labels (y) are stored as a numeric vector of length 60,000.
mnist <- dataset_mnist() # Load the MNIST dataset
train_images <- mnist$train$x # create a 60,000x28x28 tensor for the training images
train_labels <- mnist$train$y # create a 60,000-element vector for the training labels
test_images <- mnist$test$x # create a 10,000x28x28 tensor for the test images
test_labels <- mnist$test$y # create a 10,000-element vector for the test labels
These commands show how the data is structured in R:
str(train_images)
## int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
str(train_labels)
## int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
Each image is grayscale with 8-bit pixel values (0 to 255). For example, let’s look at the 200th digit in the training set:
digit <- train_images[200,,] # select the 200th training image
plot(as.raster(digit, max = 255)) # plot it!
There are two broad things that we need to do - first, select a network architecture and second, compile and optimize it for our chosen backpropagation algorithms.
To build the network we first specify:
In the following code, we create a new neural network called network
. In this case, we want the input layer to have 784 (=28*28) neurons and the output layer to have 10 neurons.
network <- keras_model_sequential() %>% # create a linear stack of layers
layer_dense(units = 784, activation = "relu", input_shape = c(28 * 28)) %>% # input layer
layer_dense(units = 512, activation = "relu") %>% # hidden layer
layer_dense(units = 10, activation = "softmax") # output layer
Keras offers several choices of activation function:
Softmax is often used in the last layer. It’s a generalized logistic function where instead of 2 classes we have 10 classes. Each class is a probability and the total of the last layer should sum to 1.
layer_dense
tells keras that the layers are to be densely connected, i.e. every neuron is connected to every neuron in the previous layer. But it is also possible to create sparse networks which work better for some applications.
Next we compile the network, which optimizes the network data structure for training. Here we have three choices to make:
network %>% compile(
optimizer = "rmsprop", # network will update itself based on the training data & loss
loss = "categorical_crossentropy", # measure mismatch between y_pred and y, calculated after each minibatch
metrics = c("accuracy") # measure of performace - correctly classified images
)
As with all machine learning, the models are very picky about the format of the input data. We need to do three things:
train_images <- array_reshape(train_images, c(60000, 28 * 28)) # Unfold each 28x28 image into a linear vector of 784. Train_images was an image cube, now it's a 2d matrix where each row is an image.
train_images <- train_images / 255 # Normalize each element in the matrix to [0,1]
test_images <- array_reshape(test_images, c(10000, 28 * 28)) # Do the same for the test images
test_images <- test_images / 255
dim(train_images) # Check to see that the dimensions are now correct
## [1] 60000 784
dim(test_images)
## [1] 10000 784
The y labels also need to be prepared. Currently train_labels
is a vector of length 60,000 containing the 0-9 labels for each training image - there is a single digit for each image. But the network doesn’t output a single digit, it outputs a vector of ten probabilities. So for each training image, we need to convert the numeric label into a set of ten boolean dummy variables. The resulting label data structure is a 60,000x10 matrix of boolean dummy variables.
So [3, 0, 9] would become:
000100000
100000000
000000001
# The dimensions of our y labels need to match the dimensions of our output layer
train_labels <- to_categorical(train_labels) # makes key-value boolean dummy vars out of numerical vectors
test_labels <- to_categorical(test_labels) # do the same with the test labels
dim(train_labels)
## [1] 60000 10
dim(test_labels)
## [1] 10000 10
Finally, we’re ready to train the network. Keras treats networks as just another type of R model, so many of the standard R functions for models, like fit()
, evalute()
and predict()
work just as well on neural networks as on regressions, trees, etc. In this case, we use fit
to train the model.
For efficiency, we’ll divide the training set into minibatches of 1,000 images and train on each set until we’ve trained on all the training images. This is one epoch. Then we repeat the process for 20 epochs.
Importantly, we’d like to calculate the training loss at the end of each epoch, but we’d also like to see how the model performs on data that it hasn’t seen yet. To avoid overfitting we don’t want to use the actual testing data to do this. We should only use the testing data to evaluate the loss of the fully trained model at the very end.
So instead, we reserve some fraction of the minibatch as a ‘validation set’. The model will not be trained on this data and will only use it to calculate the loss at the end of each epoch. We’ll call this the ‘validation loss’, in contrast to the training loss.
history <- network %>% fit(train_images, train_labels, epochs = 20, batch_size = 1000, validation_split = 0.1)
We can plot both the training loss and the validation loss at the end of each epoch:
plot(history)
The gap between training and validation loss suggests that the model is overfitting, but let’s look at the full set of test data.
Before running the full test data though, we can peek at the predicted categories for the first ten test images and compare them to the actual y values:
network %>% predict_classes(test_images[1:10,]) # display the predicted category for the first ten test images
## [1] 7 2 1 0 4 1 4 9 5 9
mnist$test$y[1:10] # display the actual labels for the first ten test images (subtract one from the array index to get the predicted value)
## [1] 7 2 1 0 4 1 4 9 5 9
Now that the model is fully trained, we’re ready to evaluate its performance on the 10,000-image test set.
network %>% evaluate(test_images, test_labels)
## $loss
## [1] 0.08416831
##
## $acc
## [1] 0.9822
The test loss here is greater than the training loss above, indicating overfitting. So let’s…
We’ll re-specify the model, adding a second hidden layer with 512 neurons. We’ll also use a technique called ‘dropout’, which randomly sets some fraction of the neurons in each layer to zero during each training step, which can help to avoid overfitting.
network2 <- keras_model_sequential()
network2 %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 512, activation = "relu") %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 512, activation = "relu") %>% # new layer
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = "softmax")
network2 %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy"))
history <- network2 %>% fit(train_images, train_labels,
epochs = 20, batch_size = 1000,
validation_split = 0.1)
plot(history)
Since training and validation accuracy is almost equal this model has a better fit.
network2 %>% evaluate(test_images, test_labels) # run the model on the full test set
## $loss
## [1] 0.06807244
##
## $acc
## [1] 0.983
With the additional hidden layer and dropout added, the test loss decreased. State of the art as of August 2018 is 0.0027 with a 6-layer convnet (784-50-100-500-1000). The Wiki page for the MNIST dataset shows typical loss rates for various machine learning algorithms.
Some thoughts from the Keras tutorial: What we’ve seen so far in designing and fitting a neural network is a whole bunch of tensor operations - adding a tensor, multiplying a tensor etc…
When we use tensors in R we dont have to use for loops, we can just apply matrix algebra via the Tensorflow pckage. This significantly improves performance in training complex networks with 10’s or 100’s of layers. This is called vectorized implementation.
z = Weights . x + bias
at every unit of the 1st layer we are calculating
max(W.X+b, 0)
W, b are tensors that are attributes of a layer. Initially, the weights are given random values - called random initialization.
What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called training, is basically the learning that machine learning is all about.
What happens in a training loop?
Step 1-3 is straightforward. Step 4 is the challenge.
The difficult part is step 4: updating the network’s weights. Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?
All operations used in the network are differentiable, and compute the gradient of the loss with regard to the network’s coefficients.
A gradient is a derivative of a multidimensional input like a tensor. You can think of back prop as a chain of partial differential equations. If a function is differentiable then we can find a min of that function.
In a NN framework, what this means is that we need to find the combination of weights such that yields the lowest possible loss.
1 Draw a batch of training samples x and corresponding targets y. 2 Run the network on x to obtain predictions y_pred. 3 Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. 4 Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5 Move the parameters a little in the opposite direction from the gradient; for example, W = W - (step*gradient), thus reducing the loss on the batch a bit.
This is called a mini-batch Stochastic Gradient Descent.
The step factor is called learning rate. Picking a reasonable learning rate is important in DL models because if it’s too small then the model will learn very slowly, and if it’s too large then the model might miss the global min of loss function!