Following is the R code for ‘Digit Recognizer’ problem on Kaggle. In this model I have used Support Vector Machines (SVM) technique with radial kernel. The dataset contains around 780 attributes per observation. Each of these attribute is a pixel intensity of a particular point on the grid. As the dataset had lots of attributes per observation Principal Component Analysis was used for dimensionality reduction. Achieved accuracy of 99% on train and 98% on test dataset.
set.seed(0)
# Load the libs required for the analysis
library(class)
library(readr)
library(caret)
library(e1071)
# Load the dataset:
train <- read.csv("C:/Users/Adi/Desktop/digit recognizer/train.csv")
test <- read.csv("C:/Users/Adi/Desktop/digit recognizer/test.csv")
#splitting train data in TRAIN and TEST again
rows <- sample(1:nrow(train), 30000)
labels <- as.factor(train[rows,1])
train_train <- train[rows,-1]
#Applying PCA
pca.train <- prcomp(train_train, scale=FALSE, center = T)
varEx<-as.data.frame(pca.train$sdev^2/sum(pca.train$sdev^2))
varEx<-cbind(c(1:784),cumsum(varEx[,1]))
colnames(varEx)<-c("Nmbr_PCs","Cum_Var")
VarianceExplanation<-varEx[seq(0,700,50),]
rotate<-pca.train$rotation[,1:50]
trainFinal2<-as.matrix(scale(train_train,center = TRUE, scale=FALSE))%*%(rotate)
# SVM models
svm.fit <- svm(trainFinal2,labels, kernel='radial')
summary(svm.fit)
##
## Call:
## svm.default(x = trainFinal2, y = labels, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.02
##
## Number of Support Vectors: 9610
##
## ( 1218 833 351 843 1107 801 1149 997 1118 1193 )
##
##
## Number of Classes: 10
##
## Levels:
## 0 1 2 3 4 5 6 7 8 9
#Checking on train data
yhat <- predict(svm.fit,trainFinal2 )
confusionMatrix(yhat, train[rows,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 2996 0 0 0 0 5 1 0 1 6
## 1 0 3361 2 0 5 2 1 9 4 1
## 2 0 3 2997 7 0 0 1 8 1 1
## 3 0 1 1 3091 0 4 0 0 6 2
## 4 2 3 4 0 2882 1 3 6 4 13
## 5 0 0 0 8 0 2688 1 0 4 0
## 6 0 1 0 0 1 4 2961 0 1 0
## 7 0 6 4 5 3 2 0 3073 1 18
## 8 1 3 3 10 0 2 1 1 2768 1
## 9 2 1 2 5 7 0 0 5 2 2970
##
## Overall Statistics
##
## Accuracy : 0.9929
## 95% CI : (0.9919, 0.9938)
## No Information Rate : 0.1126
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.99833 0.9947 0.9947 0.9888 0.99448 0.99261
## Specificity 0.99952 0.9991 0.9992 0.9995 0.99867 0.99952
## Pos Pred Value 0.99568 0.9929 0.9930 0.9955 0.98766 0.99519
## Neg Pred Value 0.99981 0.9993 0.9994 0.9987 0.99941 0.99927
## Prevalence 0.10003 0.1126 0.1004 0.1042 0.09660 0.09027
## Detection Rate 0.09987 0.1120 0.0999 0.1030 0.09607 0.08960
## Detection Prevalence 0.10030 0.1128 0.1006 0.1035 0.09727 0.09003
## Balanced Accuracy 0.99893 0.9969 0.9970 0.9941 0.99658 0.99607
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.99731 0.9907 0.99140 0.9861
## Specificity 0.99974 0.9986 0.99919 0.9991
## Pos Pred Value 0.99764 0.9875 0.99211 0.9920
## Neg Pred Value 0.99970 0.9989 0.99912 0.9984
## Prevalence 0.09897 0.1034 0.09307 0.1004
## Detection Rate 0.09870 0.1024 0.09227 0.0990
## Detection Prevalence 0.09893 0.1037 0.09300 0.0998
## Balanced Accuracy 0.99852 0.9946 0.99530 0.9926
#Checking on test data
trainMeans<-colMeans(train_train)
trainMeansMatrix<-do.call("rbind",replicate(nrow(train[-rows,]),trainMeans,simplif=FALSE))
testFinal<-as.matrix(train[-rows,-1]-trainMeansMatrix)
testfinal2<-as.matrix(testFinal)%*%(rotate)
yhat <- predict(svm.fit,testfinal2 )
confusionMatrix(yhat, train[-rows,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1119 0 3 0 1 1 3 2 2 0
## 1 0 1288 2 0 0 1 0 4 6 2
## 2 0 3 1129 7 3 0 0 7 5 2
## 3 0 3 5 1194 1 8 0 0 10 8
## 4 1 1 6 0 1151 2 3 1 3 16
## 5 3 0 0 8 1 1062 4 0 3 3
## 6 6 2 0 1 4 4 1153 0 1 0
## 7 0 4 13 5 0 1 0 1271 2 12
## 8 2 3 6 8 0 4 5 5 1237 5
## 9 0 1 0 2 13 4 0 9 2 1128
##
## Overall Statistics
##
## Accuracy : 0.9777
## 95% CI : (0.9749, 0.9802)
## No Information Rate : 0.1088
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9752
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.98939 0.9870 0.96993 0.9747 0.98041 0.97700
## Specificity 0.99890 0.9986 0.99751 0.9968 0.99695 0.99798
## Pos Pred Value 0.98939 0.9885 0.97664 0.9715 0.97213 0.97970
## Neg Pred Value 0.99890 0.9984 0.99677 0.9971 0.99787 0.99771
## Prevalence 0.09425 0.1087 0.09700 0.1021 0.09783 0.09058
## Detection Rate 0.09325 0.1073 0.09408 0.0995 0.09592 0.08850
## Detection Prevalence 0.09425 0.1086 0.09633 0.1024 0.09867 0.09033
## Balanced Accuracy 0.99414 0.9928 0.98372 0.9857 0.98868 0.98749
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.98716 0.9784 0.9732 0.95918
## Specificity 0.99834 0.9965 0.9965 0.99714
## Pos Pred Value 0.98463 0.9717 0.9702 0.97325
## Neg Pred Value 0.99861 0.9974 0.9968 0.99557
## Prevalence 0.09733 0.1082 0.1059 0.09800
## Detection Rate 0.09608 0.1059 0.1031 0.09400
## Detection Prevalence 0.09758 0.1090 0.1062 0.09658
## Balanced Accuracy 0.99275 0.9875 0.9849 0.97816