Final Project DATA 607

Digit Recognition

The purpose of the project is to read hand written digits. To do this I have used MNIST database. This database is also widely used for training and testing in the field of machine learning and for this project I have use three models (explained later).

Load Data

I have used MNIST database and then divided the original database in 80:20 ratio. The 80% of data will be used for training the models and rest will be used for testing the model.

digit<- read.csv("mnist.csv")
nrow(digit)

## [1] 59999

digit <- digit[1:20000,]
# Divide data in 80:20 ratio - training:test
samp_size <-floor(0.80* nrow(digit))
train_ind <-sample(seq_len(nrow(digit)), size = samp_size)

# Training data
# train_set <- digit[train_ind,]
DATASET.train <- as.data.frame(digit[train_ind,])

# Test Data
# test_set  <-digit[-train_ind,]
DATASET.test <-  as.data.frame(digit[-train_ind,])

Visualization

MNIST images

Lets see what hand written images MNIST data set has

flip <- function(matrix){
    apply(matrix, 2, rev)
}

par(mfrow=c(3,3))
for (i in 1:28){
    dit <- flip(matrix(rev(as.numeric(DATASET.train[i,-c(1, 786)])), nrow = 28)) #look at one digit
    image(dit, col = grey.colors(255))
}

Visualising MNIST Dataset using t-SNE

t-SNE stands for t-Distributed Stochastic Neighbor Embedding. It visualizes high-dimensional data by giving each data point a location in a two or three-dimensional map. It is a variation of Stochastic Neighbor Embedding that allows optimization, and produces significantly better visualizations by reducing the tendency to lump points together in the center of the map that often renders the visualization ineffective and unreadable. Since it visualizes high-dimensional data, it is one of the best technique to visualize a large data set such as MNIST.

tsne <- Rtsne(as.matrix(DATASET.test), check_duplicates = TRUE, pca = TRUE, perplexity = 30, theta = 0.5, dims = 2)

tSNE:

Training and Test Data Set

Training Data Set

Lets visualize the training data using a Bar Plot. This will give us an idea of frequency of each digits in the DATASET.training data frame.

 barplot(table(DATASET.train$X5), main="Total Number of Digits (Training Set)", col=brewer.pal(10,"Set3"),
    xlab="Numbers", ylab = "Frequency of Numbers")

Test Data Set

The ratio of digits in test data set is very similar to that of training data set, as seen below.

 barplot(table(DATASET.test$X5), main="Total Number of Digits (Test Set)", col=brewer.pal(10,"Set2"),
    xlab="Numbers", ylab = "Frequency of Numbers")

Models

For this project I’ll be using three Models namely :

RPart
kNN
FNN

RPart

RPart or Recursive PARTitioning build classification or regression models of a very general structure using a two stage procedure (described below), the resulting models can be represented as binary trees. It is very important to understand how a decision tree works in RPart. Consider a case of 168 cardiac arrest patients.The goal is to predict which patients can be successfully revived in the field. The resultant model separated the patients into four groups as shown below, where

X1 = initial heart rhythm 1= VF/VT 2=EMD 3=Asystole 4=Other

X2 = initial response to defibrillator 1=Improved 2=No change 3=Worse

X3 = initial response to drugs 1=Improved 2=No change 3=Worse

RPart decision tree:

As seen above, the tree is built by the following process: first the single variable is found which best splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made.

For in depth understanding of RPart, please follow the below link An Introduction to Recursive Partitioning

Model creation

Lets create a model based on RPart Algorithm. The first argument of the function is a model formula, with the ~ symbol standing for “is modeled as”. Since we are dealing with categorical data (0-9), we will give method as “class” signifying classification. The Proc.time function will give us the time and CPU resource consumed in building this model. We should take note of Elapsed Time which will tell us the actual duration consumed in building this model.

pc <- proc.time()
model.rpart <- rpart(DATASET.train$X5 ~ .,method = "class", data = DATASET.train)
proc.time() - pc

##    user  system elapsed 
##   44.58    0.29   44.96

printcp(model.rpart)

## 
## Classification tree:
## rpart(formula = DATASET.train$X5 ~ ., data = DATASET.train, method = "class")
## 
## Variables actually used in tree construction:
##  [1] X0.127  X0.207  X0.227  X0.228  X0.277  X0.343  X0.389  X0.431 
##  [9] X1      X186    X253.12 X253.30 X66     X70    
## 
## Root node error: 14190/16000 = 0.88687
## 
## n= 16000 
## 
##          CP nsplit rel error  xerror      xstd
## 1  0.098802      0   1.00000 1.00000 0.0028235
## 2  0.079986      1   0.90120 0.90226 0.0035644
## 3  0.077801      2   0.82121 0.83347 0.0039140
## 4  0.063777      3   0.74341 0.74482 0.0042210
## 5  0.054545      4   0.67963 0.65215 0.0044020
## 6  0.047216      5   0.62509 0.59824 0.0044487
## 7  0.042072      6   0.57787 0.55152 0.0044560
## 8  0.019168      7   0.53580 0.51896 0.0044429
## 9  0.016279      8   0.51663 0.49655 0.0044252
## 10 0.013319      9   0.50035 0.48978 0.0044185
## 11 0.013037     10   0.48703 0.48048 0.0044081
## 12 0.012967     11   0.47400 0.47837 0.0044056
## 13 0.012755     12   0.46103 0.47569 0.0044023
## 14 0.010007     13   0.44827 0.46004 0.0043810
## 15 0.010000     14   0.43827 0.45553 0.0043741

prp(model.rpart, extra=6, main="Classification (RPART). Tree of Handwritten Digit Recognition ",
box.col=brewer.pal(10,"Set3")[model.rpart$frame$yval])

draw.tree(model.rpart, cex = 0.52, nodeinfo = TRUE, col = gray(0:8/8))

heat.tree <- function(tree, low.is.green=FALSE, ...) { # dots args passed to prp
y <- model.rpart$frame$yval
if(low.is.green)
y <- -y
max <- max(y)
min <- min(y)
cols <- rainbow(99, end=.36)[
ifelse(y > y[1], (y-y[1]) * (99-50) / (max-y[1]) + 50,
(y-min) * (50-1) / (y[1]-min) + 1)]
prp(model.rpart, branch.col=brewer.pal(10,"Set3"), box.col=brewer.pal(10,"Set3"), ...)
}

heat.tree(model.rpart, type=4, varlen=0, faclen=0, fallen.leaves=TRUE)

Confusion Matrix - RPart

The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.

prediction.rpart <- predict(model.rpart, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$X5, `Predicted Class` = prediction.rpart)

##             Predicted Class
## Actual Class   0   1   2   3   4   5   6   7   8   9
##            0 342   0   2  11   0   2   1  33  16  18
##            1   2 382   6  26   3   1   0   9  41   1
##            2  27  21 144  19  19   1  16  18  93  12
##            3   9   7   8 244   6  32   0  17  31  57
##            4   0  18   6   8 216  11   4  58  30  35
##            5  37   3   0  42  30 133  15  38  30  14
##            6  15   3   9  20  11  13 233  36  41   3
##            7   3   4   9   2   7   1   0 298  41  25
##            8   5  38  25  33   8  15  29   8 184  46
##            9   2  16   7  13  11  16   6  72  42 245

Prediction Accuracy

Lets find out the overall prediction accuracy of our model.

error.rate.rpart <- sum(DATASET.test$X5 != prediction.rpart)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.rpart) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))

## [1] "Prediction Accuracy: 60.53"

Predict Digit for RPart

row <- 1
prediction.digit <- as.vector(predict(model.rpart, newdata = DATASET.test[row, ], type = "class"))
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))

## [1] "Current Digit: 4"

print(paste0("Predicted Digit: ", prediction.digit))

## [1] "Predicted Digit: 4"

kNN

kNN stands for k - nearest neighbor. This is one of my favorite model because of the fact that its simplest amongst all. Imagine you need to classify a vehicle based on parameters such as off road capability and gas consumption. Assume you already have four groups - EV, SUV, Sedans and Trucks. Now if you have to classify Toyota Rav4 Hybrid for k = 3, you need to see the nearest three neighbors and then as per the animation below, you will find SUV nearest to Rav4 and hence it can be classified as SUV.

kNN classification:

The above is an oversimplification of kNN, kNN is an non parametric lazy learning algorithm and by non parametric I mean that it does not make any assumptions on the underlying data distribution i.e. it has no bias.

Model creation

pc <- proc.time()
model.knn <- IBk(DATASET.train$X5 ~ ., data = DATASET.train)
proc.time() - pc

##    user  system elapsed 
##  956.83    0.75  958.53

Confusion Matrix - (kNN)

The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.

prediction.knn <- predict(model.knn, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$X5, `Predicted Class` = prediction.knn)

##             Predicted Class
## Actual Class   0   1   2   3   4   5   6   7   8   9
##            0 422   0   0   0   0   0   3   0   0   0
##            1   0 467   1   2   1   0   0   0   0   0
##            2   3  10 343   2   0   2   1   6   3   0
##            3   0   1   4 385   0   7   1   2   8   3
##            4   0   2   0   0 377   0   0   0   0   7
##            5   1   0   0   1   0 335   4   0   1   0
##            6   2   0   0   0   0   1 380   0   1   0
##            7   0   1   3   0   1   0   0 379   0   6
##            8   1   4   4   7   4   7   2   2 353   7
##            9   1   1   1   1   4   1   2  10   2 407

Prediction Accuracy

Lets find out the overall prediction accuracy of our model.

error.rate.knn <- sum(DATASET.test$X5 != prediction.knn)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.knn) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))

## [1] "Prediction Accuracy: 96.2"

Predict Digit for kNN

row <- 1
prediction.digit <- as.vector(predict(model.knn, newdata = DATASET.test[row,  ], type = "class"))
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))

## [1] "Current Digit: 4"

print(paste0("Predicted Digit: ", prediction.digit))

## [1] "Predicted Digit: 4"

FNN

FNN stands for Fast Nearest Neighbor. As you may have guessed it by now FNN belongs to the same family of Models to which k-NN belongs i.e. Nearest neighbor search. What is Random Forest to Rpart or other decision tree based algorithms, FNN is to k-NN.In applications areas, such as in digit recognition, recommendation systems and computer vision, where fast response times are critical, using brute force linear search is often not feasible and it areas such as these where FNN shines.

For detailed understanding of Fast Nearest Neighbor, please refer to the following link

Fast k-NN Search

Model creation

pc <- proc.time()
model.FNN <- FNN::knn(DATASET.train[, -1], DATASET.test[, -1], DATASET.train$X5,  k = 10, algorithm = "cover_tree")
proc.time() - pc

##    user  system elapsed 
##  211.70    0.36  212.67

Confusion Matrix - (fNN)

The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.

table(`Actual Class` = DATASET.test$X5, `Predicted Class` = model.FNN)

##             Predicted Class
## Actual Class   0   1   2   3   4   5   6   7   8   9
##            0 424   0   0   0   0   0   1   0   0   0
##            1   0 467   2   1   0   0   1   0   0   0
##            2   6  15 335   1   0   1   2   7   3   0
##            3   0   3   6 378   1  10   1   4   6   2
##            4   0   5   0   0 369   0   3   0   0   9
##            5   1   3   0   5   2 328   2   0   1   0
##            6   1   1   0   0   1   2 379   0   0   0
##            7   0   5   1   0   1   0   0 380   0   3
##            8   2   6   2   9   4   7   3   1 343  14
##            9   4   3   2   3   2   2   1  10   2 401

Prediction Accuracy

Lets find out the overall prediction accuracy of our model.

error.rate.fnn <- sum(DATASET.test$X5 != model.FNN)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.fnn) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))

## [1] "Prediction Accuracy: 95.1"

Predict Digit for fNN

row <- 1
prediction.digit <- model.FNN[row]
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))

## [1] "Current Digit: 4"

print(paste0("Predicted Digit: ", prediction.digit))

## [1] "Predicted Digit: 4"

Conclusion

After several runs and comparing all the three models, the Nearest Neighbor models appear to be superior to Rpart- tree based model. Both kNN and FNN repeatedly returned with above 95% Accuracy (which can also be the case of Model over fitting). The choice between kNN and FNN is rather not that difficult since kNN took comparatively greater time to run than FNN, and hence my Model of choice for Digit Recognition has to be FNN.

Final Project DATA 607

Dhananjay Kumar

December 11, 2016

Digit Recognition

Load Data

Visualization

MNIST images

Visualising MNIST Dataset using t-SNE

Training and Test Data Set

Training Data Set

Test Data Set

Models

RPart

Model creation

Confusion Matrix - RPart

Prediction Accuracy

Predict Digit for RPart

kNN

Model creation

Confusion Matrix - (kNN)

Prediction Accuracy

Predict Digit for kNN

FNN

Model creation

Confusion Matrix - (fNN)

Prediction Accuracy

Predict Digit for fNN

Conclusion