Summary

Digit recognition and pattern analysis are well-known topics in the area of image recognition and are great starting points to become familiar with the machine learning techniques, such as deep learning, that is used in image recognition problems. This is a classification problem and I applied three different algorithms to predict the MNIST handwritten digit database: SVM (Support Vector Machine), NN (Neural Network), and Fast K-Nearest Neighbor (FNN). Among these, FNN predicts with the highest accuracy followed by SVM, and NN. Not only FNN is the most accurate, it is the fastest too. However, voting between all three techniques produces more-accurate prediction. In addition to combining three predictions, a new technique described below, improves the accuracy in all algorithms. The contrast of pixels color in grayscale is increased and decreased in a way that the aforementioned algorithms can more accurately distinguish the digits that are likely to be confused. This technique could be considered as a simple feature engineering for the digit recognition problem. Some examples of the digits that are likely to be confused with each other are 4 and 9, 3 and 8, 5 and 8, or 2 and 7. This method resulted in an accuracy of ~0.98%.

Getting Data

I downloaded the train and test data from Kaggle Digit Recognizer Competition. Our models are trained using “train” data set and the “test” set is used for the final submission. The first step is to read both data sets:

library(readr)
train <- read_csv("train.csv")
test <- read_csv("test.csv")

what is the size of each data set?

dim(train) ; dim(test)
## [1] 42000   785
## [1] 28000   784

Train data set has one extra column, called “label”, containing the “digit” that is represented by the other 784 columns. The 784 (28x28 cells) columns contain a value from 0 to 255. This value tells us the lightness or darkness of each cell. Let’s convert the “label” column to factor:

train[, 1] <- as.factor(train[, 1]$label)  # As Category

All the other columns are numeric:

head(sapply(train[1,], class))
##     label    pixel0    pixel1    pixel2    pixel3    pixel4 
##  "factor" "integer" "integer" "integer" "integer" "integer"

For more information about the data see the data description on Kaggle. There are 42000 observation with 784 features. However, some columns contain zero for all observations or they have near zero variance. These columns need to be removed:

train_orig <- train
test_orig <- test
library(caret) 
nzv.data <- nearZeroVar(train, saveMetrics = TRUE)
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
train <- train[,!names(train) %in% drop.cols]
test <- test[,!names(test) %in% drop.cols]

Now, let’s do some exploratory data analysis. Thanks to Jose Maria Gomez and Guillermo Santos Garcia for sharing their code, so I use for data visualization and EDA:

library(RColorBrewer)
BNW <- c("white", "black")
CUSTOM_BNW <- colorRampPalette(colors = BNW)

par(mfrow = c(4, 3), pty = "s", mar = c(1, 1, 1, 1), xaxt = "n", yaxt = "n")
images_digits_0_9 <- array(dim = c(10, 28 * 28))
for (digit in 0:9) {
  images_digits_0_9[digit + 1, ] <- apply(train_orig[train_orig[, 1] == digit, -1], 2, sum)
  images_digits_0_9[digit + 1, ] <- images_digits_0_9[digit + 1, ]/max(images_digits_0_9[digit + 1, ]) * 255
  z <- array(images_digits_0_9[digit + 1, ], dim = c(28, 28))
  z <- z[, 28:1]
  image(1:28, 1:28, z, main = digit, col = CUSTOM_BNW(256))
}

More blurriness, more chance of misprediction. For example, 0 has a smooth and fully dark line but see how blurry is 9 or 4 or even 1. That means there is a higher chance of incorrect prediction of such numbers. We will explore this more in detail when we predict our validation data set. What is the proportion of each digit in the train set?

CUSTOM_BNW_PLOT <- colorRampPalette(brewer.pal(10, "Set3"))
LabTable <- table(train_orig$label)
par(mfrow = c(1, 1))
percentage <- round(LabTable/sum(LabTable) * 100)
labels <- paste0(row.names(LabTable), " (", percentage, "%) ")
pie(LabTable, labels = labels, col = CUSTOM_BNW_PLOT(10), main = "Percentage of Digits (Training Set)")

So, all digits contribute almost equally to the data set implying that the train set is appropriately randomly selected. I chose SVM (Support Vector Mechine) as the first model which is a good fit for classification problems. Let’s see how SVM performs in Digit Recognition Classification. To speed up our initial model runs, I only use 10% of the entire train set (~4200) for training and use almost the same size for validation. Consequently, is it expected to have less accuracy in comparison with the case where all train set is applied. For the final prediction, the entire train set will be utilized.

set.seed(43210)
trainIndex <- createDataPartition(train$label, p = 0.1, list = FALSE, times = 1)
allindices <- c(1:42000)
training <- train[trainIndex,]
validating <- train[-trainIndex,]
vali0_index <- allindices[! allindices %in% trainIndex]
validIndex <- createDataPartition(validating$label, p = 0.11, list = FALSE, times = 1)
validating <- validating[validIndex,]
original_validindex <- vali0_index[validIndex]

Model 1: SVM (Support Vector Machine)

Parallel computation is applied to improve the efficiency of our model. I use “doMC”" package for parallelization. K-fold cross validation (k=4) technique will examine the performance of SVM training process.

library(doMC)
registerDoMC(cores = 3)
tc <- trainControl(method = "cv", number = 4, verboseIter = F, allowParallel = T)
modSVMR1 <- train(label ~. , data= training, method = "svmRadial", trControl = tc)
SVMRadial_predict1 <- as.numeric(predict(modSVMR1,newdata = validating))-1

The actual versus predicted labels:

confusionMatrix(SVMRadial_predict1, validating$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 403   0   1   2   0   4   8   2   2   2
##          1   0 456   1   3   3   5   2   6   5   2
##          2   0   3 381   8   3   1   2   9   6   3
##          3   0   2   6 388   0   5   0   1   7   7
##          4   0   0   4   0 381   2   0   3   2  11
##          5   3   2   3  13   0 353   2   0  12   0
##          6   2   0   7   3   5   3 394   0   2   0
##          7   0   1   4   5   2   0   0 398   2  12
##          8   1   0   4   7   5   2   2   1 360   4
##          9   0   0   3   2   5   1   0  16   5 374
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.9262, 0.9415)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9268          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98533   0.9828  0.92029  0.90023  0.94307  0.93883
## Specificity           0.99440   0.9927  0.99066  0.99250  0.99415  0.99076
## Pos Pred Value        0.95047   0.9441  0.91587  0.93269  0.94541  0.90979
## Neg Pred Value        0.99839   0.9978  0.99119  0.98852  0.99388  0.99391
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09683   0.1096  0.09154  0.09322  0.09154  0.08481
## Detection Prevalence  0.10187   0.1160  0.09995  0.09995  0.09683  0.09322
## Balanced Accuracy     0.98987   0.9877  0.95548  0.94636  0.96861  0.96479
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.96098  0.91284  0.89330  0.90120
## Specificity           0.99414  0.99302  0.99308  0.99146
## Pos Pred Value        0.94712  0.93868  0.93264  0.92118
## Neg Pred Value        0.99573  0.98983  0.98861  0.98908
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09467  0.09563  0.08650  0.08986
## Detection Prevalence  0.09995  0.10187  0.09274  0.09755
## Balanced Accuracy     0.97756  0.95293  0.94319  0.94633

The Confusion Matrix provide us with valuable information. First of all, SVM did a great job on only 10% of the whole data. It predicts the validating set with 93.4% accuracy. Highest accuracy belongs to labels “0”, “1”, and “6”. However, it had difficulty with predicting digits “8”, “9”, and “3”. For example, in several cases “4” and “9”, “3” and “8”, or “7” and “9” are misclassified. The specificity value in Confusion Matrix shows how each digit is confused with other labels. Take a look at the specificity for digit “2”. It has the lowest value among all digits. It means that ~0.93% of the time SVM mispredicts other digits as “2” (i.e., the digit is not 2, but is predicted as 2). In contrast, digit “0” has the highest specificity. It is very unlikely that other digits are falsely predicted as “0” (when they are not actually zero). Most of the time, “6” is mispredicted as “0” by SVM.

SVM06 <- which(SVMRadial_predict1 != validating$label & validating$label == 6)
head(SVM06)
## [1]   24  186  515  865 1176 1806

Some examples are:

rotate <- function(x) t(apply(x, 2, rev))
par(mfrow = c(1, 3), pty ='s')
for (i06 in 1:3){
  m = rotate(matrix(unlist(train_orig[original_validindex[SVM06[i06]],-1]),ncol = 28,byrow = T))
  image(m,col=CUSTOM_BNW(255), main = "SVM, false prediction of 6")  
}

and here are some examples of “6” correctly predicted:

SVM66 <- which(SVMRadial_predict1 == validating$label & validating$label == 6)
par(mfrow = c(1, 3), pty ='s')
for (i66 in 1:3){
  m = rotate(matrix(unlist(train_orig[original_validindex[SVM66[i66]],-1]),ncol = 28,byrow = T))
  image(m,col=CUSTOM_BNW(255), main = "SVM, true prediction of 6")  
}

I wonder if I could enhance the accuracy of the model by manipulating the raw data. Let’s go back to the first figure. I found out some digits are more blurry than the others. For instance, take a look at the digits 4, 9, and 1. I use my photography knowledge to improve the prediction. I am thinking of playing with the contrast. If the darkness (or the value) of a cell is increased, it is more likely that the cell is detected by a classifier. That means we boost the weak features. So, the light gray cells become more gray, and dark gray becomes darker. If the light-color cells become darker and the dark-color cells remain the same, it would be equivalent to decreasing the image contrast. If we change the function and reduce the color of light-gray cells to white and increase the darkness of dark cells, that means we increase the contrast. To decrease the contrast, I boost the small cell values using the function: 255 * (cell value/255) ^ (1/power) for power > 1. To increase the contrast, I apply this function: 255 * (cell value/255) ^ (power). When the lighter-colored pixels become darker that is like the digit is written with a thicker pen or font and therefor, the digits are more distinguishable for the classifier. I tried different “power”s, 3, 5, 7, 10, and 1000 and cross validation (not presented here) showed that power=3 is the best choice. Let’s see what is the effect of changing the contrast:

power = 3
Contrast <- function (DATASET, POWER) {
  outDATASET <- cbind(DATASET$label, as.data.frame((DATASET[,-1]/255)^(POWER)*255))
  names(outDATASET)[1] <- "label"  
  outDATASET
}

train_orig_low_contrast <- Contrast(train_orig, 1/power) 
train_orig_high_contrast <- Contrast(train_orig, power) 

plotIndex = 4
par(mfrow = c(1, 3), pty ='s')
m = rotate(matrix(unlist(train_orig_low_contrast[plotIndex,-1]),ncol = 28,byrow = T))
image(m,col=CUSTOM_BNW(255), main = "lower contrast")
m = rotate(matrix(unlist(train_orig[plotIndex,-1]),ncol = 28,byrow = T))
image(m,col=CUSTOM_BNW(255), main = "original")
m = rotate(matrix(unlist(train_orig_high_contrast[plotIndex,-1]),ncol = 28,byrow = T))
image(m,col=CUSTOM_BNW(255), main = "higher contrast")

Now, let’s repeat SVM training with the two new data sets and see if changing the contrast impacts SVM performance:

training2   <- Contrast(training, 1/power)
training3   <- Contrast(training,   power)
validating2 <- Contrast(validating, 1/power)
validating3 <- Contrast(validating, power)

registerDoMC(cores = 3)
modSVMR2 <- train(label ~. , data= training2, method = "svmRadial", trControl = tc)
SVMRadial_predict2 <- as.numeric(predict(modSVMR2,newdata = validating2))-1
confusionMatrix(SVMRadial_predict2, validating2$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 401   0   0   1   1   2   7   2   3   1
##          1   0 457   1   2   2   6   1   4   5   2
##          2   1   4 386   8   2   1   3  10   5   3
##          3   1   0   6 389   0   2   0   2   7   9
##          4   1   0   3   0 382   2   0   1   3   7
##          5   2   1   2  14   0 355   2   0  10   0
##          6   2   0   7   3   3   3 394   0   3   0
##          7   0   1   3   5   3   0   0 402   2  10
##          8   1   0   3   6   2   4   3   1 361   4
##          9   0   1   3   3   9   1   0  14   4 379
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9385          
##                  95% CI : (0.9308, 0.9456)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9316          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98044   0.9849  0.93237  0.90255  0.94554  0.94415
## Specificity           0.99547   0.9938  0.99013  0.99276  0.99548  0.99181
## Pos Pred Value        0.95933   0.9521  0.91253  0.93510  0.95739  0.91969
## Neg Pred Value        0.99786   0.9981  0.99251  0.98879  0.99415  0.99444
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09635   0.1098  0.09274  0.09346  0.09178  0.08530
## Detection Prevalence  0.10043   0.1153  0.10163  0.09995  0.09587  0.09274
## Balanced Accuracy     0.98796   0.9893  0.96125  0.94766  0.97051  0.96798
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.96098  0.92202  0.89578  0.91325
## Specificity           0.99440  0.99356  0.99362  0.99066
## Pos Pred Value        0.94940  0.94366  0.93766  0.91546
## Neg Pred Value        0.99573  0.99090  0.98888  0.99039
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09467  0.09659  0.08674  0.09106
## Detection Prevalence  0.09971  0.10235  0.09250  0.09947
## Balanced Accuracy     0.97769  0.95779  0.94470  0.95196
registerDoMC(cores = 3)
modSVMR3 <- train(label ~. , data= training3, method = "svmRadial", trControl = tc)
SVMRadial_predict3 <- as.numeric(predict(modSVMR3,newdata = validating3))-1
confusionMatrix(SVMRadial_predict3, validating3$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 400   0   2   1   0   4   7   0   3   3
##          1   0 455   6   3   3   7   2   6   5   1
##          2   1   3 369   5   2   3   4   9   8   4
##          3   0   1   6 389   0   9   0   2   8   7
##          4   0   0   6   0 379   2   1   5   2  17
##          5   3   2   3  15   0 343   3   2  14   2
##          6   4   2  10   3   6   2 391   0   3   0
##          7   0   1   5   4   2   0   0 396   3  15
##          8   1   0   4   7   5   3   2   1 351   3
##          9   0   0   3   4   7   3   0  15   6 363
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9217          
##                  95% CI : (0.9131, 0.9297)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9129          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.97800   0.9806  0.89130  0.90255  0.93812  0.91223
## Specificity           0.99467   0.9911  0.98959  0.99116  0.99122  0.98838
## Pos Pred Value        0.95238   0.9324  0.90441  0.92180  0.91990  0.88630
## Neg Pred Value        0.99759   0.9976  0.98801  0.98877  0.99333  0.99126
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09611   0.1093  0.08866  0.09346  0.09106  0.08241
## Detection Prevalence  0.10091   0.1173  0.09803  0.10139  0.09899  0.09298
## Balanced Accuracy     0.98633   0.9858  0.94045  0.94685  0.96467  0.95031
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.95366  0.90826  0.87097  0.87470
## Specificity           0.99200  0.99195  0.99308  0.98986
## Pos Pred Value        0.92874  0.92958  0.93103  0.90524
## Neg Pred Value        0.99492  0.98929  0.98626  0.98617
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09395  0.09515  0.08433  0.08722
## Detection Prevalence  0.10115  0.10235  0.09058  0.09635
## Balanced Accuracy     0.97283  0.95010  0.93203  0.93228

I knew it! Boosting the cells with lighter colors reduced the error (93.8% accurate, comparing 93.4%). Another idea is to create a training set that is a mixture of “training” (original data), “training2” (lower contrast), and “training3” (higher contrast) and train SVM using this new data set. That means the classifier has more samples of the digits that is predicted incorrectly. Validation set shows that this technique also increases the accuracy of SVM. In addition, I found out that using the lower contrast validation set would increase the prediction accuracy. I defined three different cross validation sets (not presented here) and all of them confirmed the results above. This indicates that our model is not overfitting.

training4 <- rbind(training, training2, training3)
registerDoMC(cores = 3)
modSVMR4 <- train(label ~. , data= training4, method = "svmRadial", trControl = tc)
SVMRadial_predict4 <- as.numeric(predict(modSVMR4,newdata = validating2))-1
confusionMatrix(SVMRadial_predict4, validating2$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 403   0   0   2   0   5   8   3   4   1
##          1   0 456   1   1   1   6   1   3   4   0
##          2   1   1 383   6   4   1   4  10   5   3
##          3   1   2   6 396   0   5   0   2   8   5
##          4   0   0   4   0 380   2   0   1   0   8
##          5   0   3   0  14   0 349   2   0   5   0
##          6   2   0   6   1   3   3 393   0   6   0
##          7   0   0   3   4   2   0   0 403   1  10
##          8   2   1   8   5   4   3   2   2 368   8
##          9   0   1   3   2  10   2   0  12   2 380
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9397         
##                  95% CI : (0.932, 0.9467)
##     No Information Rate : 0.1115         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.933          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98533   0.9828  0.92512  0.91879  0.94059  0.92819
## Specificity           0.99387   0.9954  0.99066  0.99223  0.99601  0.99366
## Pos Pred Value        0.94601   0.9641  0.91627  0.93176  0.96203  0.93566
## Neg Pred Value        0.99839   0.9978  0.99172  0.99063  0.99363  0.99287
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09683   0.1096  0.09202  0.09515  0.09130  0.08385
## Detection Prevalence  0.10235   0.1136  0.10043  0.10211  0.09491  0.08962
## Balanced Accuracy     0.98960   0.9891  0.95789  0.95551  0.96830  0.96093
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.95854  0.92431  0.91315  0.91566
## Specificity           0.99440  0.99463  0.99069  0.99146
## Pos Pred Value        0.94928  0.95272  0.91315  0.92233
## Neg Pred Value        0.99546  0.99117  0.99069  0.99067
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09443  0.09683  0.08842  0.09130
## Detection Prevalence  0.09947  0.10163  0.09683  0.09899
## Balanced Accuracy     0.97647  0.95947  0.95192  0.95356

About 94% is the accuracy we get by adding up “training2” (lower contrast) and “training3” (higher contrast) and original “training” set. We stop SVM training here and move on to the next model, K-Nearest Neighbor.

Model 2: FNN (Fast K-Nearest Neighbor)

The K-Nearest Neighbor (KNN) is a simple yet accurate algorithm to solve the Digit Recognition problem. For predicting a new instance, KNN calculates the Euclidean Distance between the new instance and all the instances in the entire training set. Then, the algorithm looks for the top K nearest (most similar) instances and outputs the class with the highest frequency (most vote) as prediction. The question is how to choose K? Cross Validation can be used to choose the best value for K that results in the highest accuracy. I use CARET package to train KNN and choose K. Then, I will switch to FNN which is much faster.

registerDoMC(cores = 3)
ctrl <- trainControl(method="repeatedcv",repeats = 1, number = 4, verboseIter = T, allowParallel = T)
knnFit <- train(label ~ ., data = training, method = "knn", trControl = ctrl)
## Aggregating results
## Selecting tuning parameters
## Fitting k = 5 on full training set
plot(knnFit)

So, we choose K=5. At first, FNN will be trained with three training sets: 1) original “training” 2) “training2” (lower contrast) 3) “training3” (higher contrast). See the FNN documentation for different “algorithm” options and other settings. I found that “kd-tree” was the best choice.

library(FNN)
fnn.kd1 <- FNN::knn(training[,-1], validating[,-1], training$label, k=5, algorithm = c("kd_tree"))
fnn.kd.pred1 <- as.numeric(fnn.kd1)-1
fnn.kd2 <- FNN::knn(training2[,-1], validating2[,-1], training2$label, k=5, algorithm = c("kd_tree"))
fnn.kd.pred2 <- as.numeric(fnn.kd2)-1
fnn.kd3 <- FNN::knn(training3[,-1], validating3[,-1], training3$label, k=5, algorithm = c("kd_tree"))
fnn.kd.pred3 <- as.numeric(fnn.kd3)-1
confusionMatrix(fnn.kd.pred1, validating$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 406   0   5   2   0   4   9   0   5   3
##          1   0 460  13   2  10   4   4  14   8   4
##          2   1   2 370   6   0   0   0   2   5   3
##          3   0   1   4 401   0  13   0   0  17   6
##          4   0   0   1   0 371   0   0   2   2   5
##          5   0   0   0  11   0 346   5   0  13   0
##          6   2   0   5   1   3   7 391   0   9   0
##          7   0   1  15   3   5   0   1 408   2  22
##          8   0   0   0   1   0   0   0   0 329   0
##          9   0   0   1   4  15   2   0  10  13 372
## 
## Overall Statistics
##                                           
##                Accuracy : 0.926           
##                  95% CI : (0.9176, 0.9338)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9177          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99267   0.9914  0.89372  0.93039  0.91832  0.92021
## Specificity           0.99254   0.9840  0.99493  0.98901  0.99734  0.99234
## Pos Pred Value        0.93548   0.8863  0.95116  0.90724  0.97375  0.92267
## Neg Pred Value        0.99920   0.9989  0.98834  0.99194  0.99127  0.99208
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09755   0.1105  0.08890  0.09635  0.08914  0.08313
## Detection Prevalence  0.10428   0.1247  0.09346  0.10620  0.09154  0.09010
## Balanced Accuracy     0.99260   0.9877  0.94433  0.95970  0.95783  0.95628
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.95366  0.93578  0.81638  0.89639
## Specificity           0.99280  0.98685  0.99973  0.98799
## Pos Pred Value        0.93541  0.89278  0.99697  0.89209
## Neg Pred Value        0.99493  0.99244  0.98069  0.98852
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09395  0.09803  0.07905  0.08938
## Detection Prevalence  0.10043  0.10980  0.07929  0.10019
## Balanced Accuracy     0.97323  0.96131  0.90806  0.94219
confusionMatrix(fnn.kd.pred2, validating2$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 406   0   5   2   0   3   5   0   3   1
##          1   0 460   9   2   8   3   1   8  12   1
##          2   1   2 371   8   0   0   0   1   2   3
##          3   0   1   6 398   0   9   0   0  13   6
##          4   0   0   1   0 366   0   0   1   3   6
##          5   0   0   1  12   1 351   5   0   8   0
##          6   2   0   5   0   4   7 398   0   7   1
##          7   0   1  15   3   5   1   0 415   3  12
##          8   0   0   0   1   0   0   1   0 343   0
##          9   0   0   1   5  20   2   0  11   9 385
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9354          
##                  95% CI : (0.9275, 0.9426)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9281          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99267   0.9914  0.89614  0.92343  0.90594  0.93351
## Specificity           0.99494   0.9881  0.99546  0.99062  0.99707  0.99287
## Pos Pred Value        0.95529   0.9127  0.95619  0.91917  0.97082  0.92857
## Neg Pred Value        0.99920   0.9989  0.98861  0.99115  0.98996  0.99339
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09755   0.1105  0.08914  0.09563  0.08794  0.08433
## Detection Prevalence  0.10211   0.1211  0.09322  0.10404  0.09058  0.09082
## Balanced Accuracy     0.99380   0.9897  0.94580  0.95703  0.95151  0.96319
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.97073  0.95183  0.85112  0.92771
## Specificity           0.99307  0.98926  0.99947  0.98719
## Pos Pred Value        0.93868  0.91209  0.99420  0.88915
## Neg Pred Value        0.99679  0.99434  0.98428  0.99195
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09563  0.09971  0.08241  0.09250
## Detection Prevalence  0.10187  0.10932  0.08289  0.10404
## Balanced Accuracy     0.98190  0.97055  0.92529  0.95745
confusionMatrix(fnn.kd.pred3, validating3$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 401   1   9   3   0   8  11   0   5   2
##          1   0 461  20  10  15  12   5  14  28   6
##          2   1   1 345   6   0   0   0   1  11   4
##          3   0   0   2 385   0  21   0   0  23   5
##          4   1   0   7   1 369   4   2   5   3  17
##          5   0   0   2  14   0 317   5   1  28   2
##          6   6   0   6   3   2   7 386   0   8   0
##          7   0   1  20   6   5   5   1 404   5  26
##          8   0   0   2   1   0   0   0   0 279   0
##          9   0   0   1   2  13   2   0  11  13 353
## 
## Overall Statistics
##                                           
##                Accuracy : 0.889           
##                  95% CI : (0.8791, 0.8984)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8766          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98044   0.9935  0.83333   0.8933  0.91337  0.84309
## Specificity           0.98961   0.9703  0.99360   0.9863  0.98936  0.98627
## Pos Pred Value        0.91136   0.8074  0.93496   0.8830  0.90220  0.85908
## Neg Pred Value        0.99785   0.9992  0.98181   0.9877  0.99067  0.98445
## Prevalence            0.09827   0.1115  0.09947   0.1036  0.09707  0.09034
## Detection Rate        0.09635   0.1108  0.08289   0.0925  0.08866  0.07617
## Detection Prevalence  0.10572   0.1372  0.08866   0.1048  0.09827  0.08866
## Balanced Accuracy     0.98502   0.9819  0.91346   0.9398  0.95136  0.91468
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.94146  0.92661  0.69231  0.85060
## Specificity           0.99147  0.98148  0.99920  0.98879
## Pos Pred Value        0.92344  0.85412  0.98936  0.89367
## Neg Pred Value        0.99359  0.99133  0.96804  0.98354
## Prevalence            0.09851  0.10476  0.09683  0.09971
## Detection Rate        0.09274  0.09707  0.06704  0.08481
## Detection Prevalence  0.10043  0.11365  0.06776  0.09491
## Balanced Accuracy     0.96647  0.95404  0.84575  0.91970

Again, among these three models, the one with lower contrast had better accuracy (93.5%). Now, what if the three training sets are combined again:

fnn.kd4 <- FNN::knn(training4[,-1], validating2[,-1], training4$label, k=5, algorithm = c("kd_tree"))
fnn.kd.pred4 <- as.numeric(fnn.kd4)-1
confusionMatrix(fnn.kd.pred4, validating2$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 405   0   3   0   1   3   8   1   5   2
##          1   0 458   4   1   6   1   1   6   7   0
##          2   1   3 387   7   0   0   1   2   4   3
##          3   1   1   5 403   0   6   0   0  15   5
##          4   0   0   1   0 374   0   0   0   2   3
##          5   0   0   0  12   0 356   6   0   7   0
##          6   2   0   1   0   1   7 394   0   7   0
##          7   0   1  12   3   5   0   0 418   2   7
##          8   0   0   0   3   0   0   0   0 345   2
##          9   0   1   1   2  17   3   0   9   9 393
## 
## Overall Statistics
##                                           
##                Accuracy : 0.945           
##                  95% CI : (0.9376, 0.9517)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9388          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99022   0.9871  0.93478  0.93503  0.92574  0.94681
## Specificity           0.99387   0.9930  0.99440  0.99116  0.99840  0.99340
## Pos Pred Value        0.94626   0.9463  0.94853  0.92431  0.98421  0.93438
## Neg Pred Value        0.99893   0.9984  0.99281  0.99249  0.99207  0.99471
## Prevalence            0.09827   0.1115  0.09947  0.10356  0.09707  0.09034
## Detection Rate        0.09731   0.1100  0.09298  0.09683  0.08986  0.08554
## Detection Prevalence  0.10284   0.1163  0.09803  0.10476  0.09130  0.09154
## Balanced Accuracy     0.99205   0.9900  0.96459  0.96309  0.96207  0.97010
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.96098   0.9587  0.85608  0.94699
## Specificity           0.99520   0.9919  0.99867  0.98879
## Pos Pred Value        0.95631   0.9330  0.98571  0.90345
## Neg Pred Value        0.99573   0.9952  0.98478  0.99410
## Prevalence            0.09851   0.1048  0.09683  0.09971
## Detection Rate        0.09467   0.1004  0.08289  0.09443
## Detection Prevalence  0.09899   0.1076  0.08409  0.10452
## Balanced Accuracy     0.97809   0.9753  0.92737  0.96789

We obtained higher accuracy ~94.5% combining three training sets. Note that the lower-contrast data set is used for prediction. So far, SVM and FNN performed well, specially FNN which is much faster than SVM. Remember, when we train the models with the entire train set, we will get a better result (higher accuracy) on a larger test set. We will examine this at the end of this report. Let’s move to the next and our final model, beloved Neural Network (NN).

Model 3: NN (Neural Network)

There are several packages in R for Neural Network: nnet, neuralnet, MXNet, and h2o. I picked the simplest one, neuralnet, which has less capabilities than the other three. Feel free to try others and share your findings with me. For example, see the Kaggle Kernel that uses the h2o package.

There are lot of resources available to get familiar with the Neural Network technique. Check out this interesting YouTube Video about Deep Learning and Neural Network. I dive into our Neural Network modeling with a brief discussion on the number of hidden layers and hidden units. Check out the top comment on this thread: “…the situations in which performance improves with a second (or third, etc.) hidden layer are very few. One hidden layer is sufficient for the large majority of problems”. Take another look at the YouTube video above. If your data is separable with one straight line, then only one simple hidden layer can do it for you. I choose one hidden layer for our problem. Cross Validation sets proved that more layers will not improve the performance. I did not find any robust rule for the number of hidden nodes. It is recommended not to choose more than twice the number of input nodes. The safe choice is something between or in the range of the nodes in input and output layer. I tested different hidden layers (350, 252, 128, and 64), but 252 produced the highest accuracy. Neuralnet does not like any factor class. Therefore, we have to create dummy features for the “label”s. We follow the algorithm used here, thanks Michy! Also, other features (pixel’s value) should be scaled and centered.

library(neuralnet)
class.ind <- function(cl)
{
  n <- length(cl)
  cl <- as.factor(cl)
  x <- matrix(0, n, length(levels(cl)) )
  x[(1:n) + n*(unclass(cl)-1)] <- 1
  dimnames(x) <- list(names(cl), levels(cl))
  x
}
training5 <- cbind(class.ind(as.factor(training$label)), training[,-1])
names(training5) <- c("l0","l1","l2","l3","l4","l5","l6","l7","l8","l9",
                      names(training)[-1])

Normalizing the data:

scl <- function(x){ (x - min(x))/(max(x) - min(x)) }
training5[, 11:ncol(training5)] <- data.frame(lapply(training5[, 11:ncol(training5)], scl))

For Neuralnet, unlike the caret package, we need to provide the extended formula:

n <- names(training5)
form <- as.formula(paste("l0 + l1 + l2 + l3 + l4 + l5 + l6 + l7 + l8 + l9 ~", 
                      paste(n[!n %in% c("l0","l1","l2", "l3","l4","l5",
                                        "l6","l7","l8","l9")], collapse = " + ")))

And here is the exciting training part. Please see the neuralnet documentation and here for the details of settings. I appreciate if anybody shares how neuralnet can be parallelized in R on Mac (Thank you in advance).

neuralModel1 <- neuralnet (formula = form , hidden = c(252), 
                          act.fct = "logistic" ,
                          linear.output = F , 
                          data = training5,
                          threshold = 0.01,
                          err.fct = "ce",
                          lifesign.step = 20,
                          lifesign = "full")
## hidden: 252    thresh: 0.01    rep: 1/1    steps:      20    min thresh: 73.7919605
##                                                        40    min thresh: 12.96359142
##                                                        60    min thresh: 2.026276384
##                                                        80    min thresh: 0.4248037432
##                                                       100    min thresh: 0.07348435173
##                                                       117    error: 0.23792  time: 1.16 mins
validating5 <- validating
validating5[, -1] <- data.frame(lapply(validating5[, -1], scl))
neuralModel.pred_ <- compute(neuralModel1, validating5[,-1])
neuralModel.pred <- neuralModel.pred_$net.result
nn_pred1 <- max.col(neuralModel.pred)-1
confusionMatrix(nn_pred1, validating5$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 395   0   0   2   0   6   5   1   4   2
##          1   0 452   2   1   4   3   1   3   6   1
##          2   2   6 372   7   1   3   4   6   4   3
##          3   0   0   5 387   0   4   2   2   6  10
##          4   2   2   7   2 374   5   3   0   1  13
##          5   5   0   0  15   2 342   2   1   8   3
##          6   2   0   5   2   4   6 388   1   4   0
##          7   0   0  12   4   3   1   0 405   3  10
##          8   2   3   9   7   3   3   5   1 360   9
##          9   1   1   2   4  13   3   0  16   7 364
## 
## Overall Statistics
##                                                   
##                Accuracy : 0.9223931               
##                  95% CI : (0.9138445, 0.930343)   
##     No Information Rate : 0.1114849               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.9137423               
##  Mcnemar's Test P-Value : NA                      
## 
## Statistics by Class:
## 
##                        Class: 0  Class: 1   Class: 2   Class: 3   Class: 4
## Sensitivity          0.96577017 0.9741379 0.89855072 0.89791183 0.92574257
## Specificity          0.99467093 0.9943213 0.99039488 0.99222728 0.99068654
## Pos Pred Value       0.95180723 0.9556025 0.91176471 0.93028846 0.91442543
## Neg Pred Value       0.99626368 0.9967471 0.98881193 0.98825414 0.99200639
## Prevalence           0.09827006 0.1114849 0.09947141 0.10355598 0.09706872
## Detection Rate       0.09490630 0.1086016 0.08938011 0.09298414 0.08986064
## Detection Prevalence 0.09971168 0.1136473 0.09802979 0.09995195 0.09827006
## Balanced Accuracy    0.98022055 0.9842296 0.94447280 0.94506956 0.95821455
##                        Class: 5   Class: 6   Class: 7   Class: 8
## Sensitivity          0.90957447 0.94634146 0.92889908 0.89330025
## Specificity          0.99049128 0.99360341 0.99114332 0.98882682
## Pos Pred Value       0.90476190 0.94174757 0.92465753 0.89552239
## Neg Pred Value       0.99101480 0.99413333 0.99167562 0.98856383
## Prevalence           0.09034118 0.09851033 0.10475733 0.09682845
## Detection Rate       0.08217203 0.09322441 0.09730899 0.08649688
## Detection Prevalence 0.09082172 0.09899087 0.10523787 0.09658818
## Balanced Accuracy    0.95003288 0.96997244 0.96002120 0.94106353
##                        Class: 9
## Sensitivity          0.87710843
## Specificity          0.98745663
## Pos Pred Value       0.88564477
## Neg Pred Value       0.98640363
## Prevalence           0.09971168
## Detection Rate       0.08745795
## Detection Prevalence 0.09875060
## Balanced Accuracy    0.93228253

Total accuracy is 92.1%. It is Okay! Let’s see how NN performs using low-contrast data set.

training6 <- cbind(class.ind(as.factor(training2$label)), training2[,-1])
names(training6) <- c("l0","l1","l2","l3","l4","l5","l6","l7","l8","l9",
                      names(training2)[-1])
training6[, 11:ncol(training6)] <- data.frame(lapply(training6[, 11:ncol(training6)], scl))
neuralModel2 <- neuralnet (formula = form , hidden = c(252), 
                          act.fct = "logistic" ,
                          linear.output = F , 
                          data = training6,
                          threshold = 0.01,
                          err.fct = "ce",
                          lifesign.step = 20,
                          lifesign = "full")
## hidden: 252    thresh: 0.01    rep: 1/1    steps:      20    min thresh: 68.87873934
##                                                        40    min thresh: 19.27089182
##                                                        60    min thresh: 3.312580699
##                                                        80    min thresh: 0.9725664917
##                                                       100    min thresh: 0.1656871058
##                                                       120    min thresh: 0.04584637883
##                                                       140    min thresh: 0.01316531347
##                                                       141    error: 0.17856  time: 1.43 mins
validating6 <- validating2
validating6[, -1] <- data.frame(lapply(validating6[, -1], scl))
neuralModel.pred_ <- compute(neuralModel2, validating6[,-1])
neuralModel.pred <- neuralModel.pred_$net.result
nn_pred2 <- max.col(neuralModel.pred)-1
confusionMatrix(nn_pred2, validating6$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 396   0   1   2   0   7   3   2   2   2
##          1   0 451   2   1   1   3   1   5   9   1
##          2   3   4 372  11   3   1   3   8   0   4
##          3   2   0   5 378   0  17   5   3   9   6
##          4   2   2   7   0 377   4   3   2   4  16
##          5   2   1   0  16   1 329   3   0  12   4
##          6   2   1   4   1   3   8 385   0   3   0
##          7   0   2   8   8   2   1   0 401   4  11
##          8   2   2  12   8   5   3   7   3 350   6
##          9   0   1   3   6  12   3   0  12  10 365
## 
## Overall Statistics
##                                                   
##                Accuracy : 0.9139837               
##                  95% CI : (0.9050517, 0.9223294)  
##     No Information Rate : 0.1114849               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.9043922               
##  Mcnemar's Test P-Value : NA                      
## 
## Statistics by Class:
## 
##                        Class: 0  Class: 1   Class: 2   Class: 3   Class: 4
## Sensitivity          0.96821516 0.9719828 0.89855072 0.87703016 0.93316832
## Specificity          0.99493738 0.9937804 0.99012807 0.98740284 0.98935604
## Pos Pred Value       0.95421687 0.9514768 0.90953545 0.88941176 0.90407674
## Neg Pred Value       0.99653056 0.9964751 0.98880895 0.98581750 0.99279039
## Prevalence           0.09827006 0.1114849 0.09947141 0.10355598 0.09706872
## Detection Rate       0.09514656 0.1083614 0.08938011 0.09082172 0.09058145
## Detection Prevalence 0.09971168 0.1138876 0.09827006 0.10211437 0.10019222
## Balanced Accuracy    0.98157627 0.9828816 0.94433940 0.93221650 0.96126218
##                        Class: 5   Class: 6   Class: 7   Class: 8
## Sensitivity          0.87500000 0.93902439 0.91972477 0.86848635
## Specificity          0.98969889 0.99413646 0.99033816 0.98723065
## Pos Pred Value       0.89402174 0.94594595 0.91762014 0.87939698
## Neg Pred Value       0.98761202 0.99334221 0.99060403 0.98591923
## Prevalence           0.09034118 0.09851033 0.10475733 0.09682845
## Detection Rate       0.07904853 0.09250360 0.09634791 0.08409419
## Detection Prevalence 0.08841903 0.09778952 0.10499760 0.09562710
## Balanced Accuracy    0.93234945 0.96658043 0.95503147 0.92785850
##                        Class: 9
## Sensitivity          0.87951807
## Specificity          0.98745663
## Pos Pred Value       0.88592233
## Neg Pred Value       0.98666667
## Prevalence           0.09971168
## Detection Rate       0.08769822
## Detection Prevalence 0.09899087
## Balanced Accuracy    0.93348735

No! It did not do as well as the first Neural Network model. How about the combination of three training sets?

training7 <- cbind(class.ind(as.factor(training4$label)), training4[,-1])
names(training7) <- c("l0","l1","l2","l3","l4","l5","l6","l7","l8","l9",
                      names(training4)[-1])
training7[, 11:ncol(training7)] <- data.frame(lapply(training7[, 11:ncol(training7)], scl))
neuralModel3 <- neuralnet (formula = form , hidden = c(252), 
                          act.fct = "logistic" ,
                          linear.output = F , 
                          data = training7,
                          threshold = 0.01,
                          err.fct = "ce",
                          lifesign.step = 20,
                          lifesign = "full")
## hidden: 252    thresh: 0.01    rep: 1/1    steps:      20    min thresh: 155.0722869
##                                                        40    min thresh: 41.14946403
##                                                        60    min thresh: 9.638728292
##                                                        80    min thresh: 3.955521324
##                                                       100    min thresh: 1.39377885
##                                                       120    min thresh: 0.4240035251
##                                                       140    min thresh: 0.1961348431
##                                                       160    min thresh: 0.0577045562
##                                                       180    min thresh: 0.01582375663
##                                                       185    error: 0.30213  time: 5.43 mins
validating6 <- validating2
validating6[, -1] <- data.frame(lapply(validating6[, -1], scl))
neuralModel.pred_ <- compute(neuralModel3, validating6[,-1])
neuralModel.pred <- neuralModel.pred_$net.result
nn_pred2 <- max.col(neuralModel.pred)-1
confusionMatrix(nn_pred2, validating6$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 398   0   2   1   1   4   5   3   4   3
##          1   0 450   0   0   2   2   2   2   4   0
##          2   1   4 374   6   0   4   1   8   3   2
##          3   4   1  10 390   0   8   0   7   4   6
##          4   1   3   4   1 379   4   0   4   1   6
##          5   0   1   3  20   1 342   2   1   9   3
##          6   3   1   5   1   3   4 393   0   5   0
##          7   0   0   5   5   5   0   0 397   2  12
##          8   2   3   8   4   3   7   7   0 361   6
##          9   0   1   3   3  10   1   0  14  10 377
## 
## Overall Statistics
##                                                   
##                Accuracy : 0.927679                
##                  95% CI : (0.9193855, 0.935366)   
##     No Information Rate : 0.1114849               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.919621                
##  Mcnemar's Test P-Value : NA                      
## 
## Statistics by Class:
## 
##                        Class: 0  Class: 1   Class: 2   Class: 3   Class: 4
## Sensitivity          0.97310513 0.9698276 0.90338164 0.90487239 0.93811881
## Specificity          0.99387157 0.9967550 0.99226254 0.98927901 0.99361362
## Pos Pred Value       0.94536817 0.9740260 0.92803970 0.90697674 0.94044665
## Neg Pred Value       0.99705961 0.9962162 0.98935887 0.98901393 0.99334930
## Prevalence           0.09827006 0.1114849 0.09947141 0.10355598 0.09706872
## Detection Rate       0.09562710 0.1081211 0.08986064 0.09370495 0.09106199
## Detection Prevalence 0.10115329 0.1110043 0.09682845 0.10331571 0.09682845
## Balanced Accuracy    0.98348835 0.9832913 0.94782209 0.94707570 0.96586622
##                        Class: 5   Class: 6   Class: 7   Class: 8
## Sensitivity          0.90957447 0.95853659 0.91055046 0.89578164
## Specificity          0.98943476 0.99413646 0.99221685 0.98935887
## Pos Pred Value       0.89528796 0.94698795 0.93192488 0.90024938
## Neg Pred Value       0.99100529 0.99546304 0.98956103 0.98883276
## Prevalence           0.09034118 0.09851033 0.10475733 0.09682845
## Detection Rate       0.08217203 0.09442576 0.09538683 0.08673715
## Detection Prevalence 0.09178280 0.09971168 0.10235464 0.09634791
## Balanced Accuracy    0.94950461 0.97633652 0.95138366 0.94257025
##                        Class: 9
## Sensitivity          0.90843373
## Specificity          0.98879103
## Pos Pred Value       0.89976134
## Neg Pred Value       0.98984772
## Prevalence           0.09971168
## Detection Rate       0.09058145
## Detection Prevalence 0.10067275
## Balanced Accuracy    0.94861238

This is the best (92.9%) we get from the Neuralnet. Let’s see what is the accuracy of Neural Network on the original training set itself:

neuralModel.pred_ <- compute(neuralModel1, training5[,11:ncol(training5)])
neuralModel.pred <- neuralModel.pred_$net.result
nn_pred3 <- max.col(neuralModel.pred)-1
actual_values <- max.col(training5[,1:10])-1
confusionMatrix(nn_pred3, actual_values)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 414   0   0   0   0   0   0   0   0   0
##          1   0 469   0   0   0   0   0   0   0   0
##          2   0   0 418   0   0   0   0   0   0   0
##          3   0   0   0 436   0   0   0   0   0   0
##          4   0   0   0   0 408   0   0   0   0   0
##          5   0   0   0   0   0 380   0   0   0   0
##          6   0   0   0   0   0   0 414   0   0   0
##          7   0   0   0   0   0   0   0 441   0   0
##          8   0   0   0   0   0   0   0   0 407   0
##          9   0   0   0   0   0   0   0   0   0 419
## 
## Overall Statistics
##                                                   
##                Accuracy : 1                       
##                  95% CI : (0.9991233, 1)          
##     No Information Rate : 0.1115074               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 1                       
##  Mcnemar's Test P-Value : NA                      
## 
## Statistics by Class:
## 
##                        Class: 0  Class: 1   Class: 2  Class: 3   Class: 4
## Sensitivity          1.00000000 1.0000000 1.00000000 1.0000000 1.00000000
## Specificity          1.00000000 1.0000000 1.00000000 1.0000000 1.00000000
## Pos Pred Value       1.00000000 1.0000000 1.00000000 1.0000000 1.00000000
## Neg Pred Value       1.00000000 1.0000000 1.00000000 1.0000000 1.00000000
## Prevalence           0.09843081 0.1115074 0.09938184 0.1036614 0.09700428
## Detection Rate       0.09843081 0.1115074 0.09938184 0.1036614 0.09700428
## Detection Prevalence 0.09843081 0.1115074 0.09938184 0.1036614 0.09700428
## Balanced Accuracy    1.00000000 1.0000000 1.00000000 1.0000000 1.00000000
##                        Class: 5   Class: 6  Class: 7   Class: 8   Class: 9
## Sensitivity          1.00000000 1.00000000 1.0000000 1.00000000 1.00000000
## Specificity          1.00000000 1.00000000 1.0000000 1.00000000 1.00000000
## Pos Pred Value       1.00000000 1.00000000 1.0000000 1.00000000 1.00000000
## Neg Pred Value       1.00000000 1.00000000 1.0000000 1.00000000 1.00000000
## Prevalence           0.09034712 0.09843081 0.1048502 0.09676652 0.09961959
## Detection Rate       0.09034712 0.09843081 0.1048502 0.09676652 0.09961959
## Detection Prevalence 0.09034712 0.09843081 0.1048502 0.09676652 0.09961959
## Balanced Accuracy    1.00000000 1.00000000 1.0000000 1.00000000 1.00000000

That’s right, the accuracy is 100%! That means our neural network model predicts the “training” set 100% correctly. This tells us that the model is overfitted. Some suggest to increase the “threshold” in neuralnet settings to stop the training before the model gets overfitted. Others suggest to use lower number of hidden nodes. None of them fixed our issue. I appreciate if anybody knows how regularization is implemented in neuralnet package. Is it already implemented?

Predicting Test Set

We need to train all the classifiers with the entire train set. We start with SVM.

SVM

train2 <- Contrast (train, 1/power)
train3 <- Contrast (train,   power)
test4  <- as.data.frame((test/255)^(1/power)*255)

train4 <- rbind(train, train2, train3)

set.seed(43210)
# registerDoMC(cores = 2)
# modSVMR_all <- train(label ~. , data= train4, method = "svmRadial", trControl = tc)
# SVMRadial_predict_all <- as.numeric(predict(modSVMR_all, newdata = test4))-1

FNN

# fnn.kd_all <- FNN::knn(train4[,-1], test4, train4$label, k=5, algorithm = c("kd_tree"))
# fnn.kd.pred_all <- as.numeric(fnn.kd_all)-1

Neural Network

train5 <- cbind(class.ind(as.factor(train4$label)), train4[,-1])
names(train5) <- c("l0","l1","l2","l3","l4","l5","l6","l7","l8","l9",
                      names(train4)[-1])
n <- names(train5)
form <- as.formula(paste("l0 + l1 + l2 + l3 + l4 + l5 + l6 + l7 + l8 + l9 ~", 
                      paste(n[!n %in% c("l0","l1","l2", "l3","l4","l5",
                                        "l6","l7","l8","l9")], collapse = " + ")))

# train5[, 11:ncol(train5)] <- data.frame(lapply(train5[, 11:ncol(train5)], scl))
# neuralModel_all <- neuralnet (formula = form , hidden = c(252), 
#                           act.fct = "logistic" ,
#                           linear.output = F , 
#                           data = train5,
#                           threshold = 0.01,
#                           err.fct = "ce",
#                           lifesign.step = 200,
#                           lifesign = "full")
# test5 <- data.frame(lapply(test4, scl))
# neuralModel.pred_ <- compute(neuralModel_all, test5)
# neuralModel.pred <- neuralModel.pred_$net.result
# nn_pred_all <- max.col(neuralModel.pred)-1

We have three predictions. We combine them together in one data frame and the final prediction comes from voting between three models. If all models predict different digits, we rely on FNN model since it had the highest accuracy.

# all_preds <- data.frame(SVM= SVMRadial_predict_all, FNN = fnn.kd.pred_all, NN = nn_pred_all)

This is the voting function:

# most_frequent <-function (row) {
#   ll<-data.frame(table(unlist(row)))
#   high_freq <- ll[which.max(ll$Freq),]
#   if (as.numeric(high_freq[,2]) > 1) {
#     as.numeric(levels(high_freq[,1]))[high_freq[,1]]
#   } else {
#     row[2] # this will pick FNN in case of no consensus!
#   }
# }

Let’s do the voting and save as CSV format …

# elected <- apply(all_preds,1, most_frequent)
# SVM_FNN_NN_Submit = data.frame(ImageId = seq(1,28000), Label = elected)
# write.csv(SVM_FNN_NN_Submit, file = "FNN_SVM_NN_Power3.csv", row.names=F)

This prediction scored 97.9% on Kaggle’s leaderboard.