Digit Regonizer with R

For data science gymnastics, I usually head over to Kaggle.
Kaggle is cool and if you don’t know what they are about, but love data-science, I urge you to check ’em out.
They were hosting a competition, Classify handwritten digits using the MNIST data, which is simply about implementing a piece of software that is able to read written numbers. With a knack for machine learning and few hours to kill, I was in… though, I have to confess, I wasn’t in it to win it.

Getting the big picture

I sprinted off by executing the exploratory analysis scripts, provided by Kaggle to visualize the data-set. This would serve as a fine indication, the level of data tranformation, such as Histogram of oriented gradients, needed before fitting the machine learning model.

To my dismay, to only thing that I managed to visualize was a cryptic error message:

.

Team Kaggle, it seems, was using the old version of ggplot and thus the scripts would not execute on ggplot version 2.1.0 (or later). So, my very first step was to rewrite the script to work in the most recent version, result of which can be found here.

With the script fixed, I could plot the pixel mean and standard deviation group by digits. Two thing worth noticing here are that the digits are nicely centered, so trimming the empty spaces might be sufficient tranformation, but also that the 7s resemble the 9s way too much for my liking. The standard deviations of the digits.

Visualising MNIST Dataset using t-SNE

t-SNE (t-Distributed Stochastic Neighbour) is a technique to map high-dimensional data to a 2D- or 3D plane for visualization purposes. It is very popular and effective tools, however it is computationally expensive and scales quadratically O(N4) with the number of samples.

Barnes-Hut-SNE is a further improvement of the t-SNE algorithm and scales to O(N log N) instead of O(N4).

Due to limited computer power, I am using t-SNE on a ssubset from the MNIST full dataset, to see if the high dimensional data contains sufficient variability to be modeled by a classifier. The t-SNE visualization show highly separated cluster and indicating the digits can be lineraly or non-linearly separable.

tsne <- Rtsne(as.matrix(df.training), check_duplicates = FALSE, pca = TRUE, 
    perplexity = 30, theta = 0.5, dims = 2)
t-SNE clustering of the MNIST dataset

t-SNE clustering of the MNIST dataset

The raise of the model

The competition provided a benchmark model; a randomized decicion tree (aka. random forest), which achived an accuracy of 93% and indicated a significant cleaned dataset.

Support Vector Machine (SVM) model has been utilized extensively in both optical character recognization and face recognition. It has been gaining popularity due to the boost of computer power.

Chooicing between a neural network based model and SVM model, I ended up using SVM, suspecting that the memory of my laptop would run out otherwise.

Due to the quality and the cleaness of the source data, I merely removed the zero-sum columns from the data-set. And rushed to fit the model.

colums.sum.larger.than.zero <- names(which(colSums(df.training[, -1]) > 0))
df.training <- df.training[c("label", colums.sum.larger.than.zero)]

The case of digit recoginition an non-linear kernel provide lower error rates, though are considerably calcuation heavy to use. I also experimented with the non-standard C value as I was clearly overfitting the model, when using with 70% (approx. 29000 rows) of the training set.

require(caret)
require(e1071)
require(kernlab)

model <- ksvm(label ~ ., data = df.data, type = "C-svc", kernel = "rbfdot", 
    C = 100, gamma = 0.001, scaled = FALSE)

The result

The svm model performs with 0.99 percent accuracy when run on the test set. At the moment of writing (and 3 months left of the competition to run), the model is ranked as the top 33 percent best performing when submitted to Kaggle!