The problem we’re trying to solve here is to classify grayscale images of handwritten digits which is a classic dataset called MNIST.
The MNIST dataset comes preloaded in Keras, in the form of train and test lists, each of which includes a set of images (x) and associated labels (y):
library(keras)
mnist <- dataset_mnist()
train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y
train_images and train_labels form the training set, the data that the model will learn from. The model will then be tested on the test set, test_images and test_labels.
str():function is a convenient way to get a quick glimpse at the structure of an array.
str(train_images)
## int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
str(train_labels)
## int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
Let’s have a look at the test data:
str(test_images)
## int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
str(test_labels)
## int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
We’ll feed the neural network the training data, train_images and train_labels. The network will ask the network to produce predictions for test_images, and we’ll verify whether these predictions match the labels from test_labels. Before training, we’ll preprocess the data by reshaping it into the shape the network expects and scaling it so that all values are in the [0, 1] interval. Previously, our training images, for instance, were stored in an array of shape (60000, 28, 28) of type integer with values in the [0, 255] interval. We transform it into a double array of shape (60000, 28 * 28) with values between 0 and 1.
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255
We also need to categorically encode the labels, a step which we explain in chapter 3:
train_labels <- to_categorical(train_labels)
test_labels <- to_categorical(test_labels)
network <- keras_model_sequential() %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
layer_dense(units = 10, activation = "softmax")
Here our network consists of a sequence of two layers, which are densely connected (also called fully connected) neural layers. The second (and last) layer is a 10-way softmax layer, which means it will return an array of 10 probability scores (summing to 1). Each score will be the probability that the current digit image belongs to one of our 10 digit classes.
summary(network)
## Model: "sequential"
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense (Dense) (None, 512) 401920
## ___________________________________________________________________________
## dense_1 (Dense) (None, 10) 5130
## ===========================================================================
## Total params: 407,050
## Trainable params: 407,050
## Non-trainable params: 0
## ___________________________________________________________________________
What is activation function? Its used to introduce non-linearity into the network. Without the activation function , the output of each layer will become just a linear combination of previous layer. We need to activate the network to make it to include some complex problem
#The Relu is a widely used activation function
knitr::include_graphics("activation.png")
To make the network ready for training, we need to pick three more things, as part of the compilation step:
network %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
We are now ready to train our network, which in Keras is done via a call to the fit method of the network: we “fit” the model to its training data.
network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)
knitr::include_graphics("123.png")
Two quantities are being displayed during training: the “loss” of the network over the training data, and the accuracy of the network over the training data.
We quickly reach an accuracy of 0.989 (i.e. 98.9%) on the training data. Now let’s check that our model performs well on the test set too:
metrics <- network %>% evaluate(test_images, test_labels, verbose = 0)
metrics
## $loss
## [1] 0.06574973
##
## $acc
## [1] 0.9792
Our test set accuracy turns out to be 98.1% – that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of “overfitting”, the fact that machine learning models tend to perform worse on new data than on their training data..