The purpose of the project is to read hand written digits. To do this I have used MNIST database. This database is also widely used for training and testing in the field of machine learning and for this project I have use three models (explained later).
I have used MNIST database and then divided the original database in 80:20 ratio. The 80% of data will be used for training the models and rest will be used for testing the model.
digit<- read.csv("mnist.csv")
nrow(digit)
## [1] 59999
digit <- digit[1:20000,]
# Divide data in 80:20 ratio - training:test
samp_size <-floor(0.80* nrow(digit))
train_ind <-sample(seq_len(nrow(digit)), size = samp_size)
# Training data
# train_set <- digit[train_ind,]
DATASET.train <- as.data.frame(digit[train_ind,])
# Test Data
# test_set <-digit[-train_ind,]
DATASET.test <- as.data.frame(digit[-train_ind,])
Lets see what hand written images MNIST data set has
flip <- function(matrix){
apply(matrix, 2, rev)
}
par(mfrow=c(3,3))
for (i in 1:28){
dit <- flip(matrix(rev(as.numeric(DATASET.train[i,-c(1, 786)])), nrow = 28)) #look at one digit
image(dit, col = grey.colors(255))
}
t-SNE stands for t-Distributed Stochastic Neighbor Embedding. It visualizes high-dimensional data by giving each data point a location in a two or three-dimensional map. It is a variation of Stochastic Neighbor Embedding that allows optimization, and produces significantly better visualizations by reducing the tendency to lump points together in the center of the map that often renders the visualization ineffective and unreadable. Since it visualizes high-dimensional data, it is one of the best technique to visualize a large data set such as MNIST.
tsne <- Rtsne(as.matrix(DATASET.test), check_duplicates = TRUE, pca = TRUE, perplexity = 30, theta = 0.5, dims = 2)
tSNE:
Lets visualize the training data using a Bar Plot. This will give us an idea of frequency of each digits in the DATASET.training data frame.
barplot(table(DATASET.train$X5), main="Total Number of Digits (Training Set)", col=brewer.pal(10,"Set3"),
xlab="Numbers", ylab = "Frequency of Numbers")
The ratio of digits in test data set is very similar to that of training data set, as seen below.
barplot(table(DATASET.test$X5), main="Total Number of Digits (Test Set)", col=brewer.pal(10,"Set2"),
xlab="Numbers", ylab = "Frequency of Numbers")
For this project I’ll be using three Models namely :
RPart or Recursive PARTitioning build classification or regression models of a very general structure using a two stage procedure (described below), the resulting models can be represented as binary trees. It is very important to understand how a decision tree works in RPart. Consider a case of 168 cardiac arrest patients.The goal is to predict which patients can be successfully revived in the field. The resultant model separated the patients into four groups as shown below, where
X1 = initial heart rhythm 1= VF/VT 2=EMD 3=Asystole 4=Other
X2 = initial response to defibrillator 1=Improved 2=No change 3=Worse
X3 = initial response to drugs 1=Improved 2=No change 3=Worse
RPart decision tree:
As seen above, the tree is built by the following process: first the single variable is found which best splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made.
For in depth understanding of RPart, please follow the below link An Introduction to Recursive Partitioning
Lets create a model based on RPart Algorithm. The first argument of the function is a model formula, with the ~ symbol standing for “is modeled as”. Since we are dealing with categorical data (0-9), we will give method as “class” signifying classification. The Proc.time function will give us the time and CPU resource consumed in building this model. We should take note of Elapsed Time which will tell us the actual duration consumed in building this model.
pc <- proc.time()
model.rpart <- rpart(DATASET.train$X5 ~ .,method = "class", data = DATASET.train)
proc.time() - pc
## user system elapsed
## 44.58 0.29 44.96
printcp(model.rpart)
##
## Classification tree:
## rpart(formula = DATASET.train$X5 ~ ., data = DATASET.train, method = "class")
##
## Variables actually used in tree construction:
## [1] X0.127 X0.207 X0.227 X0.228 X0.277 X0.343 X0.389 X0.431
## [9] X1 X186 X253.12 X253.30 X66 X70
##
## Root node error: 14190/16000 = 0.88687
##
## n= 16000
##
## CP nsplit rel error xerror xstd
## 1 0.098802 0 1.00000 1.00000 0.0028235
## 2 0.079986 1 0.90120 0.90226 0.0035644
## 3 0.077801 2 0.82121 0.83347 0.0039140
## 4 0.063777 3 0.74341 0.74482 0.0042210
## 5 0.054545 4 0.67963 0.65215 0.0044020
## 6 0.047216 5 0.62509 0.59824 0.0044487
## 7 0.042072 6 0.57787 0.55152 0.0044560
## 8 0.019168 7 0.53580 0.51896 0.0044429
## 9 0.016279 8 0.51663 0.49655 0.0044252
## 10 0.013319 9 0.50035 0.48978 0.0044185
## 11 0.013037 10 0.48703 0.48048 0.0044081
## 12 0.012967 11 0.47400 0.47837 0.0044056
## 13 0.012755 12 0.46103 0.47569 0.0044023
## 14 0.010007 13 0.44827 0.46004 0.0043810
## 15 0.010000 14 0.43827 0.45553 0.0043741
prp(model.rpart, extra=6, main="Classification (RPART). Tree of Handwritten Digit Recognition ",
box.col=brewer.pal(10,"Set3")[model.rpart$frame$yval])
draw.tree(model.rpart, cex = 0.52, nodeinfo = TRUE, col = gray(0:8/8))
heat.tree <- function(tree, low.is.green=FALSE, ...) { # dots args passed to prp
y <- model.rpart$frame$yval
if(low.is.green)
y <- -y
max <- max(y)
min <- min(y)
cols <- rainbow(99, end=.36)[
ifelse(y > y[1], (y-y[1]) * (99-50) / (max-y[1]) + 50,
(y-min) * (50-1) / (y[1]-min) + 1)]
prp(model.rpart, branch.col=brewer.pal(10,"Set3"), box.col=brewer.pal(10,"Set3"), ...)
}
heat.tree(model.rpart, type=4, varlen=0, faclen=0, fallen.leaves=TRUE)
The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.
prediction.rpart <- predict(model.rpart, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$X5, `Predicted Class` = prediction.rpart)
## Predicted Class
## Actual Class 0 1 2 3 4 5 6 7 8 9
## 0 342 0 2 11 0 2 1 33 16 18
## 1 2 382 6 26 3 1 0 9 41 1
## 2 27 21 144 19 19 1 16 18 93 12
## 3 9 7 8 244 6 32 0 17 31 57
## 4 0 18 6 8 216 11 4 58 30 35
## 5 37 3 0 42 30 133 15 38 30 14
## 6 15 3 9 20 11 13 233 36 41 3
## 7 3 4 9 2 7 1 0 298 41 25
## 8 5 38 25 33 8 15 29 8 184 46
## 9 2 16 7 13 11 16 6 72 42 245
Lets find out the overall prediction accuracy of our model.
error.rate.rpart <- sum(DATASET.test$X5 != prediction.rpart)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.rpart) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))
## [1] "Prediction Accuracy: 60.53"
row <- 1
prediction.digit <- as.vector(predict(model.rpart, newdata = DATASET.test[row, ], type = "class"))
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))
## [1] "Current Digit: 4"
print(paste0("Predicted Digit: ", prediction.digit))
## [1] "Predicted Digit: 4"
kNN stands for k - nearest neighbor. This is one of my favorite model because of the fact that its simplest amongst all. Imagine you need to classify a vehicle based on parameters such as off road capability and gas consumption. Assume you already have four groups - EV, SUV, Sedans and Trucks. Now if you have to classify Toyota Rav4 Hybrid for k = 3, you need to see the nearest three neighbors and then as per the animation below, you will find SUV nearest to Rav4 and hence it can be classified as SUV.
kNN classification:
The above is an oversimplification of kNN, kNN is an non parametric lazy learning algorithm and by non parametric I mean that it does not make any assumptions on the underlying data distribution i.e. it has no bias.
pc <- proc.time()
model.knn <- IBk(DATASET.train$X5 ~ ., data = DATASET.train)
proc.time() - pc
## user system elapsed
## 956.83 0.75 958.53
The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.
prediction.knn <- predict(model.knn, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$X5, `Predicted Class` = prediction.knn)
## Predicted Class
## Actual Class 0 1 2 3 4 5 6 7 8 9
## 0 422 0 0 0 0 0 3 0 0 0
## 1 0 467 1 2 1 0 0 0 0 0
## 2 3 10 343 2 0 2 1 6 3 0
## 3 0 1 4 385 0 7 1 2 8 3
## 4 0 2 0 0 377 0 0 0 0 7
## 5 1 0 0 1 0 335 4 0 1 0
## 6 2 0 0 0 0 1 380 0 1 0
## 7 0 1 3 0 1 0 0 379 0 6
## 8 1 4 4 7 4 7 2 2 353 7
## 9 1 1 1 1 4 1 2 10 2 407
Lets find out the overall prediction accuracy of our model.
error.rate.knn <- sum(DATASET.test$X5 != prediction.knn)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.knn) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))
## [1] "Prediction Accuracy: 96.2"
row <- 1
prediction.digit <- as.vector(predict(model.knn, newdata = DATASET.test[row, ], type = "class"))
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))
## [1] "Current Digit: 4"
print(paste0("Predicted Digit: ", prediction.digit))
## [1] "Predicted Digit: 4"
FNN stands for Fast Nearest Neighbor. As you may have guessed it by now FNN belongs to the same family of Models to which k-NN belongs i.e. Nearest neighbor search. What is Random Forest to Rpart or other decision tree based algorithms, FNN is to k-NN.In applications areas, such as in digit recognition, recommendation systems and computer vision, where fast response times are critical, using brute force linear search is often not feasible and it areas such as these where FNN shines.
For detailed understanding of Fast Nearest Neighbor, please refer to the following link
pc <- proc.time()
model.FNN <- FNN::knn(DATASET.train[, -1], DATASET.test[, -1], DATASET.train$X5, k = 10, algorithm = "cover_tree")
proc.time() - pc
## user system elapsed
## 211.70 0.36 212.67
The confusion Matrix will give us the idea of how precise and accurate our Model was for each digit.
table(`Actual Class` = DATASET.test$X5, `Predicted Class` = model.FNN)
## Predicted Class
## Actual Class 0 1 2 3 4 5 6 7 8 9
## 0 424 0 0 0 0 0 1 0 0 0
## 1 0 467 2 1 0 0 1 0 0 0
## 2 6 15 335 1 0 1 2 7 3 0
## 3 0 3 6 378 1 10 1 4 6 2
## 4 0 5 0 0 369 0 3 0 0 9
## 5 1 3 0 5 2 328 2 0 1 0
## 6 1 1 0 0 1 2 379 0 0 0
## 7 0 5 1 0 1 0 0 380 0 3
## 8 2 6 2 9 4 7 3 1 343 14
## 9 4 3 2 3 2 2 1 10 2 401
Lets find out the overall prediction accuracy of our model.
error.rate.fnn <- sum(DATASET.test$X5 != model.FNN)/nrow(DATASET.test)
accuracy <- round((1 - error.rate.fnn) *100,2)
print(paste0("Prediction Accuracy: ", accuracy))
## [1] "Prediction Accuracy: 95.1"
row <- 1
prediction.digit <- model.FNN[row]
print(paste0("Current Digit: ", as.character(DATASET.test$X5[row])))
## [1] "Current Digit: 4"
print(paste0("Predicted Digit: ", prediction.digit))
## [1] "Predicted Digit: 4"
After several runs and comparing all the three models, the Nearest Neighbor models appear to be superior to Rpart- tree based model. Both kNN and FNN repeatedly returned with above 95% Accuracy (which can also be the case of Model over fitting). The choice between kNN and FNN is rather not that difficult since kNN took comparatively greater time to run than FNN, and hence my Model of choice for Digit Recognition has to be FNN.