Digit Recognizer using Support Vector Machines

Following is the R code for ‘Digit Recognizer’ problem on Kaggle. In this model I have used Support Vector Machines (SVM) technique with radial kernel. The dataset contains around 780 attributes per observation. Each of these attribute is a pixel intensity of a particular point on the grid. As the dataset had lots of attributes per observation Principal Component Analysis was used for dimensionality reduction. Achieved accuracy of 99% on train and 98% on test dataset.

set.seed(0)

# Load the libs required for the analysis
library(class)
library(readr)
library(caret)
library(e1071)

# Load the dataset:
train <- read.csv("C:/Users/Adi/Desktop/digit recognizer/train.csv")
test <- read.csv("C:/Users/Adi/Desktop/digit recognizer/test.csv")


#splitting train data in TRAIN and TEST again
rows <- sample(1:nrow(train), 30000) 
labels <- as.factor(train[rows,1])
train_train <- train[rows,-1]


#Applying PCA
pca.train <- prcomp(train_train, scale=FALSE, center = T)

varEx<-as.data.frame(pca.train$sdev^2/sum(pca.train$sdev^2))
varEx<-cbind(c(1:784),cumsum(varEx[,1]))
colnames(varEx)<-c("Nmbr_PCs","Cum_Var")
VarianceExplanation<-varEx[seq(0,700,50),]

rotate<-pca.train$rotation[,1:50]
trainFinal2<-as.matrix(scale(train_train,center = TRUE, scale=FALSE))%*%(rotate)

# SVM models
svm.fit <- svm(trainFinal2,labels, kernel='radial') 
summary(svm.fit)

## 
## Call:
## svm.default(x = trainFinal2, y = labels, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.02 
## 
## Number of Support Vectors:  9610
## 
##  ( 1218 833 351 843 1107 801 1149 997 1118 1193 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9

#Checking on train data
yhat <- predict(svm.fit,trainFinal2 )
confusionMatrix(yhat, train[rows,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 2996    0    0    0    0    5    1    0    1    6
##          1    0 3361    2    0    5    2    1    9    4    1
##          2    0    3 2997    7    0    0    1    8    1    1
##          3    0    1    1 3091    0    4    0    0    6    2
##          4    2    3    4    0 2882    1    3    6    4   13
##          5    0    0    0    8    0 2688    1    0    4    0
##          6    0    1    0    0    1    4 2961    0    1    0
##          7    0    6    4    5    3    2    0 3073    1   18
##          8    1    3    3   10    0    2    1    1 2768    1
##          9    2    1    2    5    7    0    0    5    2 2970
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9929          
##                  95% CI : (0.9919, 0.9938)
##     No Information Rate : 0.1126          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9921          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.99833   0.9947   0.9947   0.9888  0.99448  0.99261
## Specificity           0.99952   0.9991   0.9992   0.9995  0.99867  0.99952
## Pos Pred Value        0.99568   0.9929   0.9930   0.9955  0.98766  0.99519
## Neg Pred Value        0.99981   0.9993   0.9994   0.9987  0.99941  0.99927
## Prevalence            0.10003   0.1126   0.1004   0.1042  0.09660  0.09027
## Detection Rate        0.09987   0.1120   0.0999   0.1030  0.09607  0.08960
## Detection Prevalence  0.10030   0.1128   0.1006   0.1035  0.09727  0.09003
## Balanced Accuracy     0.99893   0.9969   0.9970   0.9941  0.99658  0.99607
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.99731   0.9907  0.99140   0.9861
## Specificity           0.99974   0.9986  0.99919   0.9991
## Pos Pred Value        0.99764   0.9875  0.99211   0.9920
## Neg Pred Value        0.99970   0.9989  0.99912   0.9984
## Prevalence            0.09897   0.1034  0.09307   0.1004
## Detection Rate        0.09870   0.1024  0.09227   0.0990
## Detection Prevalence  0.09893   0.1037  0.09300   0.0998
## Balanced Accuracy     0.99852   0.9946  0.99530   0.9926

#Checking on test data
trainMeans<-colMeans(train_train)
trainMeansMatrix<-do.call("rbind",replicate(nrow(train[-rows,]),trainMeans,simplif=FALSE))
testFinal<-as.matrix(train[-rows,-1]-trainMeansMatrix) 
testfinal2<-as.matrix(testFinal)%*%(rotate) 

yhat <- predict(svm.fit,testfinal2 )
confusionMatrix(yhat, train[-rows,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1119    0    3    0    1    1    3    2    2    0
##          1    0 1288    2    0    0    1    0    4    6    2
##          2    0    3 1129    7    3    0    0    7    5    2
##          3    0    3    5 1194    1    8    0    0   10    8
##          4    1    1    6    0 1151    2    3    1    3   16
##          5    3    0    0    8    1 1062    4    0    3    3
##          6    6    2    0    1    4    4 1153    0    1    0
##          7    0    4   13    5    0    1    0 1271    2   12
##          8    2    3    6    8    0    4    5    5 1237    5
##          9    0    1    0    2   13    4    0    9    2 1128
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9777          
##                  95% CI : (0.9749, 0.9802)
##     No Information Rate : 0.1088          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9752          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98939   0.9870  0.96993   0.9747  0.98041  0.97700
## Specificity           0.99890   0.9986  0.99751   0.9968  0.99695  0.99798
## Pos Pred Value        0.98939   0.9885  0.97664   0.9715  0.97213  0.97970
## Neg Pred Value        0.99890   0.9984  0.99677   0.9971  0.99787  0.99771
## Prevalence            0.09425   0.1087  0.09700   0.1021  0.09783  0.09058
## Detection Rate        0.09325   0.1073  0.09408   0.0995  0.09592  0.08850
## Detection Prevalence  0.09425   0.1086  0.09633   0.1024  0.09867  0.09033
## Balanced Accuracy     0.99414   0.9928  0.98372   0.9857  0.98868  0.98749
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.98716   0.9784   0.9732  0.95918
## Specificity           0.99834   0.9965   0.9965  0.99714
## Pos Pred Value        0.98463   0.9717   0.9702  0.97325
## Neg Pred Value        0.99861   0.9974   0.9968  0.99557
## Prevalence            0.09733   0.1082   0.1059  0.09800
## Detection Rate        0.09608   0.1059   0.1031  0.09400
## Detection Prevalence  0.09758   0.1090   0.1062  0.09658
## Balanced Accuracy     0.99275   0.9875   0.9849  0.97816

Digit Recognizer using Support Vector Machines

Aditya S Nakate

7 January 2017