INTRODUCTION

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

This dataset taken from kaggle https://www.kaggle.com/competitions/digit-recognizer/data.

Our Goal is to make a deep learning model, to predict the image we have, using predictor (the pixel size).

IMPORT LIBRARY

library(keras)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
p <- keras_model_sequential() 
## Loaded Tensorflow version 2.0.0
tensorflow::set_random_seed(42)

DATA CLEANING

df_train <- read.csv("MNIST in CV/train.csv")
df_test <- read.csv("MNIST in CV/test.csv")
head(df_test)
dim(df_train)
## [1] 42000   785

Our data contain 785 column and 42000 observation/row in data Train.

VISUALIZATION

vizTrain <- function(input){
  
  dimmax <- sqrt(ncol(input[,-1]))
  
  dimn <- ceiling(sqrt(nrow(input)))
  par(mfrow=c(dimn, dimn), mar=c(.1, .1, .1, .1))
  
  for (i in 1:nrow(input)){
      m1 <- as.matrix(input[i,2:785])
      dim(m1) <- c(28,28)
      
      m1 <- apply(apply(m1, 1, rev), 1, t)
      
      image(1:28, 1:28, 
            m1, col=grey.colors(255), 
            # remove axis text
            xaxt = 'n', yaxt = 'n')
      text(2, 20, col="white", cex=1.2, input[i, 1])
  }
  
}

The code above is a function to make our data can see by a visualization.

# your code here
vizTrain(head(df_train, 36))

So our data are look like that. Those are a hand-written number from 0 to 9.

CROSS VALIDATION

library(rsample)

set.seed(100)
initializer <- initializer_random_normal(seed = 100)

index <- initial_split(df_train, prop=0.8, strata="label")

data_train <- training(index)
data_test <- testing(index)
prop.table(table(data_train$label))
## 
##          0          1          2          3          4          5          6 
## 0.09833919 0.11197095 0.09905352 0.10354783 0.09699982 0.09051134 0.09893446 
##          7          8          9 
## 0.10420263 0.09705935 0.09938092

We devide our data to 80:20 proportion, to data train and data test, this is for training the data, and model evalutation later.

SCALING

train_x <- data_train %>% select(-label) %>% as.matrix() / 255
train_y <- data_train %>% select(label)

test_x <- data_test %>% select(-label) %>% as.matrix() / 255
test_y <- data_test%>% select(label)

range(train_x)
## [1] 0 1

Our data is don’t have same scale.

The pixel size is 255, so for scaling we have to devide it with 255, and the result will be 0-1 and we convert it to matrix. The Deep Learning need a scaled data.

So we have 2 data train, for all predictor, and for label only. It same to data test.

train_x <- array_reshape(train_x, dim=dim(train_x))
test_x <- array_reshape(test_x, dim=dim(test_x))

For data predictor (x), we convert it to array.

ONE-HOT ENCODING

# One-hot encoding target variable
train_y <- to_categorical(train_y$label, num_classes = 10)
test_y <- to_categorical(test_y$label, num_classes = 10)

The target variabel we do One-Hot Encoding. We convert it to categorical and adjust according to the number of labels (0 until 9, it means 10 label)

MODELLING ARCHITECTURE

# Membuat arsitektur
model1 <- keras_model_sequential(name="model_keras") %>% 
  layer_dense(units=256, activation="relu", input_shape=784, name="hidden_1") %>% 
  layer_dense(units=128, activation="relu", name="hidden_2") %>%
  # layer_dense(units=16, activation="relu", name="hidden_3") %>%
  layer_dense(units=10, activation="softmax", name="output")

model1
## Model: "model_keras"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## hidden_1 (Dense)                    (None, 256)                     200960      
## ________________________________________________________________________________
## hidden_2 (Dense)                    (None, 128)                     32896       
## ________________________________________________________________________________
## output (Dense)                      (None, 10)                      1290        
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________

For Model Architecture, we do keras model, we have 3 hidden layer with relu activation function, and softmax for ouput layer (the last layer)

COMPILE MODEL

The next step is to determine the error function, optimizer, and metrics that will be shown during training.

# your code here
model1 %>% compile(loss=loss_categorical_crossentropy(),
                   optimizer=optimizer_sgd(learning_rate=0.1), 
                   metrics="accuracy")

For compiling, we use loss_categorical_crossentropy because our data will predict a categorical data.

VISUALIZATION

Whis step, we will visualize our model. So the model will train, and the validation data as measurement.

# your code here
history <- model1 %>% fit(x=train_x, 
                          y=train_y, 
                          validation_data=list(test_x, test_y), 
                          batch_size=21000,
                          epoch=20) %>%  plot()
plot(history)

The good model is when the red line (our model) and the blue line (validation data) is close together like above.

So we can say that our model is good enough as the accuracy is 0.8+

MODEL PREDICT AND EVALUATION

PREDICT WITH DATA VALIDATION

Then we make our prediction using our model above

pred <- predict(model1, test_x) %>%  k_argmax() %>% as.array() %>% as.factor()
head(pred)
## [1] 1 3 9 3 7 6
## Levels: 0 1 2 3 4 5 6 7 8 9

EVALUATION WITH DATA VALIDATION

After that we have to evaluate the model. So if we have a bad model, we can tuning it again.

confusionMatrix(pred, reference = as.factor(data_test$label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 784   0  14   5   3   9  11  10   3   9
##          1   0 900  16   6   7  13  14  20  25  13
##          2   8   9 697  33   6  18  14  12  17   2
##          3   1   3  23 712   0  80   0   2  54  14
##          4   1   0  18   2 699  35  10  12   8  54
##          5  13   6   2  47   0 539  16   5  37   4
##          6   8   1  32   7  15  19 743   1  13   1
##          7   2   0  11  13   2   8   0 789   4  43
##          8  11   3  32  39   3  25   5  10 609  12
##          9   0   0   4   8  78   8   0  39  32 697
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8532          
##                  95% CI : (0.8455, 0.8608)
##     No Information Rate : 0.1097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8369          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.94686   0.9761  0.82097  0.81651  0.85978  0.71485
## Specificity           0.99155   0.9848  0.98424  0.97649  0.98155  0.98300
## Pos Pred Value        0.92453   0.8876  0.85417  0.80090  0.83313  0.80568
## Neg Pred Value        0.99418   0.9970  0.97996  0.97870  0.98493  0.97220
## Prevalence            0.09855   0.1097  0.10105  0.10378  0.09676  0.08974
## Detection Rate        0.09331   0.1071  0.08296  0.08474  0.08319  0.06415
## Detection Prevalence  0.10093   0.1207  0.09712  0.10581  0.09986  0.07962
## Balanced Accuracy     0.96920   0.9804  0.90261  0.89650  0.92067  0.84893
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.91390  0.87667  0.75935  0.82097
## Specificity           0.98722  0.98894  0.98158  0.97762
## Pos Pred Value        0.88452  0.90482  0.81308  0.80485
## Neg Pred Value        0.99074  0.98526  0.97478  0.97983
## Prevalence            0.09676  0.10712  0.09545  0.10105
## Detection Rate        0.08843  0.09391  0.07248  0.08296
## Detection Prevalence  0.09998  0.10378  0.08915  0.10307
## Balanced Accuracy     0.95056  0.93280  0.87047  0.89930

As we can see above, our model have 85 % accuaracy, which is we can say it’s good enough.

MODEL ATTEMPT

MODEL PREDICT WITH UNSEEN DATA

As we have df_test above, let’s try to implement our model to really unseen data, because it has no label.

CROSS VALIDATION

We do the same thing like above. We scale and convert to array.

preprocess_x <- function(x){
    train_x <- x  %>% as.matrix() / 255
    train_x <- array_reshape(train_x, dim=dim(train_x))
    return(train_x)
}

testt_x <- preprocess_x(df_test)

PREDICT

We predict the data with our model above.

pred2 <- predict(model1, testt_x) %>% k_argmax() %>% as.array() %>% as.factor()
df_test$label <- pred2
df_test[,c(780:785)]

This is what the the model predict the unseen data.

So the model will predict the unseen data with 86% accuracy.

We can’t evaluate it with Confusion Matrix, because the df_test have really no label data.

CONCLUSION

🖋 Insight :

  • Model have 0.8532 accuracy, For example :
    • Our model predict data with label 0 = 798 data, according to the reference data. That means, the model predicts the label 0 to the data, the answers are all correct.
    • Our model predict data with label 2 = 6 data. That means, the model predicts the label 2 to the data, it has 6 data with wrong prediction, it should be label 0, not label 2.
  • As this case is multiple class, we just have look to accuracy

It means, the model have accuracy 85% (right) to predict the unseen data later.

For choosing the best model for our Neural Network/Deep Learning, we should consider few things:
- Choose the simplest model
- Time consumption
- Model is not overfit / underfit, because we need the model to be good in both data (train & test)