Digit Recognizer Using XGBoost

Overview

The following codes are designed to create a machine learning model to determine a number from a database of handwritten numbers from 1 to 9. The data was taken from “THE MNIST DATABASE of handwritten digits”. It has originally 60,000 observations. However, since the data is too big to process for my computer, I decided to truncate it down to 6499 observations. There are 785 columns, one of which is the label which determines the true number while the rest of 784 columns are scaled numbers from a 28x28 pixel image.

The machine learning algorithm I used is XGBoost. A very popular algorithm which has a very good reputation to Data Scientist especially in competitions in Kaggle.

Load the necessary libraries and the data

set.seed(311)
library(xgboost)
library(Matrix)
library(RCurl)

url <- "https://raw.githubusercontent.com/GrejSegura/DataBank/master/digit_train_trunc.csv"
get <- getURL(url)
digit_data <- read.csv(text = get)

Data Pre-processing

Convert all values to numeric as xgboost only accepts numeric type of data.

digit_data <- sapply(digit_data, as.numeric)
#sapply in the last operation converted the data to matrix, 
#bring back the structure to data frame
digit_data <- as.data.frame(digit_data)

Create a training and test data. 80% of the data was allotted to training while the rest are for Testing.

#create training and test data
d <- 1:nrow(digit_data)
index_digit <- sample(d, round(nrow(digit_data)*.8))

train_digit <- digit_data[index_digit,]
test_digit <- digit_data[-index_digit,]

#create training/test data
train_1 <- train_digit[,2:length(digit_data)]
test_1 <- test_digit[,2:length(digit_data)]

#important to separate the label or classification variable
train_2 <- train_digit[,1]
test_2 <- test_digit[,1]

Model Building

Now, we will create a model using XGBoost with decision tree as the booster.

xgtree_digit <- xgboost(as.matrix(train_1), train_2, 
                        booster = 'gbtree',
                        objective = 'multi:softmax',
                        num_class = 10,
                        max.depth = 5,
                        eta = 0.1,
                        nthread = 4,
                        nrounds = 120,
                        min_child_weight = 1,
                        subsample = 0.5, 
                        colsample_bytree = 1, 
                        num_parallel_tree = 1,
                        verbose = 0)

The mean error for the training was very small at 0.000385. This is a very good indication that we have a strong predicting model. Let us see by testing the model to the Test data.

pred_xgtree_digit <- predict(xgtree_digit, as.matrix(test_1))

msexgtree_digit.table <- table(as.factor(pred_xgtree_digit), as.factor(test_2))
wrong <- ifelse(abs(test_2 - pred_xgtree_digit) > 0, 1, 0)
error <- sum(wrong) / length(as.factor(test_2))

msexgtree_digit.table

##    
##       0   1   2   3   4   5   6   7   8   9
##   0 122   0   0   2   0   1   2   0   0   2
##   1   0 145   3   1   0   2   0   1   2   1
##   2   0   1 138   3   1   0   0   2   3   0
##   3   0   0   2 116   0   1   0   0   1   3
##   4   0   0   1   0 107   0   0   2   2   2
##   5   0   1   0   5   0 104   2   0   2   2
##   6   2   2   1   1   1   2 115   0   1   0
##   7   0   0   2   1   0   0   2 131   1   1
##   8   1   0   2   4   0   0   1   0 110   1
##   9   0   0   2   3   4   0   0   5   5 114

error

## [1] 0.07538462

The table shows that most of the images were predicted correctly by the model. The error rate justifies it with a mere 0.075 of the data or 7.5% are incorrectly identified by the model.

Image Visualizer

We wish to visualize the data since we need to see if indeed the handwritten digits were correctly identified by the model.

digit_visual <- function(x){
        #create NULL vectors
        image_raw <- data.frame(matrix(ncol = 28, nrow = 28))
        i = 0
        for (i in 0:27){
                image_raw[28 - i,] <- x[(1:28) + (i*28)]
        }
        heatmap(as.matrix((image_raw)), Rowv=NA, Colv=NA, col = heat.colors(256),  
                symm = FALSE, margins=c(5,10), labRow = FALSE, labCol = FALSE)
}

We now try if the visualizer is working fine. Let us visualize the 18th observation of the Test data and identify the number.

number <- as.vector(as.matrix(test_1[18,]))
label.18 <- test_2[18]
label.18

## [1] 9

digit_visual(number)

The label says the number is 9 and by looking at our visualizer it is indeed number 9.

Guessing Game - by the Prediction Model

Here we will test the capability of our model by creating a very simple guessing game, which the model will play - not us. The model will determine the number shown in the image and we will be the judge.

try_luck <- function(x, y){
        result <- predict(xgtree_digit, as.matrix(x))
        #check if the number is correctly guessed
        if(y == result){
                say <- paste('CORRECTLY GUESSED NUMBER ', result, ', CHECK THE IMAGE TO CONFIRM.', sep = '')
                print(say)
        } else {
                say <- paste('OOPS! CORRECT GUESS WOULD BE ', result, '.', sep = '')
                print(say)  
        }
}
# -----------------------#
# CREATE ANOTHER FUNCTION FOR TRY GUESSING NUMBERS
guess <- function(x){
        num <- test_1[row,]
        # assign the correct number
        real <- test_2[row]
        # result of the model
        try_luck(num, real)
        digit_visual(num)
}
# OK, LET'S TRY THIS ONE -- GAME TIME!!!
# pick a row in test_2, only 1-1300 allowed ----> assign it to row variable below

Ok! Let’s see if the model will correctly predict. First, the model will identify the number and the image will follow. By checking the label in the data, the game will decide if the model predicted the correct number by saying “CORRECTLY GUESSED NUMBER” or “OOPS! CORRECT GUESS WOULD BE (correct number)” if it is wrong.

Let us try 5 images.

index <- c(300, 451, 654, 765, 123) ### the index were the images are located in our test data

for (i in index){
row <- i
# let the model guess
guess(row)
}

## [1] "CORRECTLY GUESSED NUMBER 8, CHECK THE IMAGE TO CONFIRM."

## [1] "CORRECTLY GUESSED NUMBER 0, CHECK THE IMAGE TO CONFIRM."

## [1] "CORRECTLY GUESSED NUMBER 0, CHECK THE IMAGE TO CONFIRM."

## [1] "CORRECTLY GUESSED NUMBER 5, CHECK THE IMAGE TO CONFIRM."

## [1] "CORRECTLY GUESSED NUMBER 0, CHECK THE IMAGE TO CONFIRM."

Great! the model predicted all 5 numbers correctly.

Conclusion

We built a model to predict a number by looking at the images. The accuracy was at 92.5% which is impressive given that we have only built 1 model. The accuracy should improve however if we advance it by ensembling multiple models.