library(keras)
library(dplyr)
library(tensorflow)
library(rsample)
library(caret)

MNIST Prediction

Today we’ll be doing some digit prediction using a data set that you can get from here: https://www.kaggle.com/c/digit-recognizer/data?select=train.csv The data set contains information regarding images of different numbers. We’ll be using it to create a keras model.

Pre-processing

Let’s read the data as always using read.csv.

train_mnist <- read.csv("train.csv")

Before we take a look at our data, let’s check if there are any missing values.

train_mnist %>% anyNA()
## [1] FALSE

Seems like we’re safe in this regard. Now let’s take a look using head().

head(train_mnist)

Seems like our data has a whole lot of columns, 785 of them to be exact! They are all integers as well. The label column seems to represent the actual number that is shown in the image. The rest of the columns represents a pixel in the image, with the value of 0 being black and 255 being white. Let’s use the function below in order to visualize our data.

vizTrain <- function(input){
  
  dimmax <- sqrt(ncol(input[,-1]))
  
  dimn <- ceiling(sqrt(nrow(input)))
  par(mfrow=c(dimn, dimn), mar=c(.1, .1, .1, .1))
  
  for (i in 1:nrow(input)){
      m1 <- as.matrix(input[i,2:785])
      dim(m1) <- c(28,28)
      
      m1 <- apply(apply(m1, 1, rev), 1, t)
      
      image(1:28, 1:28, 
            m1, col=grey.colors(255), 
            #  remove axis text
            xaxt = 'n', yaxt = 'n')
      text(2, 20, col="white", cex=1.2, input[i, 1])
  }
}
vizTrain(head(train_mnist,36))


As we can see above, these images are of handwritten numbers. They also match the label column that we’ve mentioned earlier.

Cross-Validation

As we usually do, we have to split our data into training and testing data sets by using initial_split. We’ll use the usual 80:20 proportion.

set.seed(123)

split <- initial_split(data=train_mnist,
                       prop=0.8,
                       strata="label")

We’ll store it like below, into train_data and test_data respectively.

train_data <- training(split)
test_data <- testing(split)

Before we proceed, let’s check the distribution proportion of our data.

prop.table(table(train_mnist$label))
## 
##          0          1          2          3          4          5          6 
## 0.09838095 0.11152381 0.09945238 0.10359524 0.09695238 0.09035714 0.09850000 
##          7          8          9 
## 0.10478571 0.09673810 0.09971429

It seems like the data was distributed quite evenly, I see no large difference in values. Next, we’ll have to separate the label column, as it is the target variable.

train_x <- train_data %>% select(-label)
train_y <- train_data %>% select(label)

test_x <- test_data %>% select(-label)
test_y <- test_data %>% select(label)

Scaling

In order to make the modeling process faster, we should scale our data. In this case where all the columns have the same range of values, we can simply divide by the maximum amount. I’ve already mentioned the range previously, but let’s check it using range() to be sure.

range(train_x)
## [1]   0 255

Seems like we were correct. Now let’s simply divide by 255.

train_x <- train_x/255
test_x <- test_x/255

Changing Data Format

As keras only accepts array format for the predictors, we’ll have to convert our data set into an array as below.

train_x <- array_reshape(x=as.matrix(train_x),
                         dim=dim(train_x))

test_x <- array_reshape(x=as.matrix(test_x),
                         dim=dim(test_x))

And since we’re doing classification, we’ll have to do One Hot Encoding, which just means turning the target variable into a categorical value.

train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y)

Modeling

Now we can begin modeling. We’ll start by creating the base of the model using keras_model_sequential. We’ll then add an input and hidden layer with these values:
- 784 input_shape to represent all the pixels
- 256 neurons, typically decided using 2^n
- relu for data sets whose target predictor falls between 0 and +inf
As well as an output layer with these values:
- 10 units representing the range of the target variable (0-9)
- softmax for multiclass classification

set_random_seed(777)

model <- keras_model_sequential(name="keras") %>%
  
  # hidden layer
  layer_dense(input_shape=784,
              units=256,
              activation="relu") %>%
# output layer
layer_dense(units=10,
            activation="softmax")
model
## Model: "keras"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_1 (Dense)                    (None, 256)                     200960      
##  dense (Dense)                      (None, 10)                      2570        
## ================================================================================
## Total params: 203,530
## Trainable params: 203,530
## Non-trainable params: 0
## ________________________________________________________________________________

Our next step would be to compile our model. For this we’ll be using:
- categorical_crossentropy for the loss method, as our model is for multiclass classification
- optimizer_adam for the optimizer, as it’s a further development from SGD. I’ll simply set the learning_rate as 0.15
- accuracy for metrics for model evaluation purposes

model %>% compile(loss="categorical_crossentropy",
                  optimizer= optimizer_adam(learning_rate=0.15),
                  metrics="accuracy")

I’ll set my batch size to 4200 and have it run for 15 epochs.

modelHistory <- model %>% fit(x=train_x,
                         y=train_y,
                         epochs=15,
                         batch_size=4200,
                         validation_data=list(test_x,test_y))
## Epoch 1/15
## 8/8 - 1s - loss: 42.1994 - accuracy: 0.3305 - val_loss: 3.9949 - val_accuracy: 0.4706 - 1s/epoch - 135ms/step
## Epoch 2/15
## 8/8 - 0s - loss: 1.8472 - accuracy: 0.5148 - val_loss: 1.4061 - val_accuracy: 0.5200 - 231ms/epoch - 29ms/step
## Epoch 3/15
## 8/8 - 0s - loss: 1.2068 - accuracy: 0.5816 - val_loss: 1.0513 - val_accuracy: 0.6694 - 246ms/epoch - 31ms/step
## Epoch 4/15
## 8/8 - 0s - loss: 0.9164 - accuracy: 0.6993 - val_loss: 0.8393 - val_accuracy: 0.7307 - 234ms/epoch - 29ms/step
## Epoch 5/15
## 8/8 - 0s - loss: 0.7416 - accuracy: 0.7467 - val_loss: 0.7214 - val_accuracy: 0.7762 - 226ms/epoch - 28ms/step
## Epoch 6/15
## 8/8 - 0s - loss: 0.6311 - accuracy: 0.7941 - val_loss: 0.6218 - val_accuracy: 0.8148 - 234ms/epoch - 29ms/step
## Epoch 7/15
## 8/8 - 0s - loss: 0.5417 - accuracy: 0.8242 - val_loss: 0.5600 - val_accuracy: 0.8381 - 229ms/epoch - 29ms/step
## Epoch 8/15
## 8/8 - 0s - loss: 0.4833 - accuracy: 0.8432 - val_loss: 0.5080 - val_accuracy: 0.8494 - 236ms/epoch - 30ms/step
## Epoch 9/15
## 8/8 - 0s - loss: 0.4451 - accuracy: 0.8580 - val_loss: 0.4785 - val_accuracy: 0.8557 - 224ms/epoch - 28ms/step
## Epoch 10/15
## 8/8 - 0s - loss: 0.4027 - accuracy: 0.8710 - val_loss: 0.4483 - val_accuracy: 0.8637 - 230ms/epoch - 29ms/step
## Epoch 11/15
## 8/8 - 0s - loss: 0.3645 - accuracy: 0.8825 - val_loss: 0.4117 - val_accuracy: 0.8759 - 225ms/epoch - 28ms/step
## Epoch 12/15
## 8/8 - 0s - loss: 0.3268 - accuracy: 0.8968 - val_loss: 0.3782 - val_accuracy: 0.8885 - 217ms/epoch - 27ms/step
## Epoch 13/15
## 8/8 - 0s - loss: 0.2972 - accuracy: 0.9056 - val_loss: 0.3581 - val_accuracy: 0.8967 - 218ms/epoch - 27ms/step
## Epoch 14/15
## 8/8 - 0s - loss: 0.2705 - accuracy: 0.9134 - val_loss: 0.3494 - val_accuracy: 0.9015 - 216ms/epoch - 27ms/step
## Epoch 15/15
## 8/8 - 0s - loss: 0.2524 - accuracy: 0.9207 - val_loss: 0.3528 - val_accuracy: 0.9045 - 225ms/epoch - 28ms/step

Evaluation

Before we evaluate our model, we can take a look at the visualization of the fitting process by simply plotting it like below.

plot(modelHistory)

We can see that our model’s accuracy continues to increase, while the loss decreases along with the amount of epochs that was ran. In order to evaluate our model, we have to predict the testing data set first so we have something to validate. We’ll put it through k_argmax so that it outputs the class (in this case, label value) with the highest probability instead of just the probabilities of all the classes. We’ll finally convert it into a vector.

prediction <- predict(model, test_x) %>% k_argmax() %>% as.vector()
## 263/263 - 1s - 748ms/epoch - 3ms/step
head(prediction)
## [1] 0 8 2 9 7 5

Now we can evaluate our model using confusionMatrix and inputting the predicted data into data and the original data into reference.

confusionMatrix(data = as.factor(prediction),
                reference = as.factor(test_data$label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 790   0  11   1   0   4   7   2   3   4
##          1   0 888   8   2   1   3   3  13   6   7
##          2   2   3 788  30   5   2   0  14   3   0
##          3   7   2  28 785   2 105   0  10  85  15
##          4   1   0   8   2 750   5   4   5   6  44
##          5   5   5   2  31   1 536   9   0   9   5
##          6   9   0   2   0   7  26 847   0   5   0
##          7   0   1  14   3   4   3   0 797   1  23
##          8   6   8  10  28   3  43   4   3 686  13
##          9   1   0   0   5  25   5   1  16   3 733
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9045          
##                  95% CI : (0.8981, 0.9107)
##     No Information Rate : 0.108           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8939          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.96224   0.9791  0.90471  0.88501  0.93985  0.73224
## Specificity           0.99578   0.9943  0.99217  0.96620  0.99014  0.99126
## Pos Pred Value        0.96107   0.9538  0.93034  0.75553  0.90909  0.88889
## Neg Pred Value        0.99591   0.9975  0.98901  0.98615  0.99367  0.97487
## Prevalence            0.09771   0.1080  0.10367  0.10557  0.09498  0.08712
## Detection Rate        0.09403   0.1057  0.09379  0.09343  0.08926  0.06379
## Detection Prevalence  0.09783   0.1108  0.10081  0.12366  0.09819  0.07177
## Balanced Accuracy     0.97901   0.9867  0.94844  0.92560  0.96499  0.86175
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9680  0.92674  0.85006  0.86848
## Specificity            0.9935  0.99350  0.98446  0.99259
## Pos Pred Value         0.9453  0.94208  0.85323  0.92902
## Neg Pred Value         0.9963  0.99166  0.98407  0.98542
## Prevalence             0.1041  0.10236  0.09605  0.10045
## Detection Rate         0.1008  0.09486  0.08165  0.08724
## Detection Prevalence   0.1066  0.10069  0.09569  0.09391
## Balanced Accuracy      0.9807  0.96012  0.91726  0.93054

Seems like our model is already quite nice, with an overall accuracy of 91%

Conclusion

Using keras, we can create image prediction/classification models based off the pixel data of the images. There are also a lot of components when building a keras model, such as the input, hidden, and output layers; which all have a lot of different parameters that we can tune in order to create a better model.