Following is the R code for ‘Digit Recognizer’ problem on Kaggle. In this model I have used Random Forest Algorithm and ‘tune’ function to identify the best parameters for Random Forest. The dataset contains around 780 attributes per observation. Each of these attribute is a pixel intensity of a particular point on the grid. As the dataset had lots of attributes per observation Principal Component Analysis was used for dimensionality reduction.
set.seed(0)
# Load the libs required for the analysis
library(class)
library(readr)
library(randomForest)
library(caret)
# Load the training and test datasets
train <- read.csv("C:/Users/Adi/Desktop/digit recognizer/train.csv")
test <- read.csv("C:/Users/Adi/Desktop/digit recognizer/test.csv")
#Applying PCA
pca.train3 <- prcomp(train[,-1], scale=FALSE, center = T)
# Identify the amount of variance explained by the PCs
varEx<-as.data.frame(pca.train3$sdev^2/sum(pca.train3$sdev^2))
varEx<-cbind(c(1:784),cumsum(varEx[,1]))
colnames(varEx)<-c("Nmbr PCs","Cum Var")
VarianceExplanation<-varEx[seq(0,700,50),]
rotate<-pca.train3$rotation[,1:100]
trainFinal2<-as.matrix(scale(train[,-1],center = TRUE, scale=FALSE))%*%(rotate)
#Splitting the dataset
rows<-sample(1:42000,30000)
testFinal2<-trainFinal2[-rows,]
trainFinal2<-trainFinal2[rows,]
#RUN A RANDOM FOREST BENCHMARK FOR COMPARISON
set.seed(0)
tune <- tuneRF(trainFinal2,train[rows,1],ntreeTry = 150, stepFactor = 5, improve = 0.5,
trace = T, plot = T, doBest = F)
## mtry = 33 OOB error = 1.23398
## Searching left ...
## mtry = 7 OOB error = 1.95475
## -0.584102 0.5
## Searching right ...
## mtry = 100 OOB error = 1.299972
## -0.05347939 0.5
mtryfinal <- tune[as.numeric(which.min(tune[,"OOBError"])),"mtry"]
rf <- randomForest(trainFinal2,as.factor(train[rows,1]), ntree=300, mtry=mtryfinal,keep.forest=TRUE)
# Prediction with training set
pred <- predict(rf, newdata=trainFinal2)
confusionMatrix(pred,train[rows,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 3001 0 0 0 0 0 0 0 0 0
## 1 0 3379 0 0 0 0 0 0 0 0
## 2 0 0 3013 0 0 0 0 0 0 0
## 3 0 0 0 3126 0 0 0 0 0 0
## 4 0 0 0 0 2898 0 0 0 0 0
## 5 0 0 0 0 0 2708 0 0 0 0
## 6 0 0 0 0 0 0 2969 0 0 0
## 7 0 0 0 0 0 0 0 3102 0 0
## 8 0 0 0 0 0 0 0 0 2792 0
## 9 0 0 0 0 0 0 0 0 0 3012
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.1126
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 1.0 1.0000 1.0000 1.0000 1.0000 1.00000
## Specificity 1.0 1.0000 1.0000 1.0000 1.0000 1.00000
## Pos Pred Value 1.0 1.0000 1.0000 1.0000 1.0000 1.00000
## Neg Pred Value 1.0 1.0000 1.0000 1.0000 1.0000 1.00000
## Prevalence 0.1 0.1126 0.1004 0.1042 0.0966 0.09027
## Detection Rate 0.1 0.1126 0.1004 0.1042 0.0966 0.09027
## Detection Prevalence 0.1 0.1126 0.1004 0.1042 0.0966 0.09027
## Balanced Accuracy 1.0 1.0000 1.0000 1.0000 1.0000 1.00000
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 1.00000 1.0000 1.00000 1.0000
## Specificity 1.00000 1.0000 1.00000 1.0000
## Pos Pred Value 1.00000 1.0000 1.00000 1.0000
## Neg Pred Value 1.00000 1.0000 1.00000 1.0000
## Prevalence 0.09897 0.1034 0.09307 0.1004
## Detection Rate 0.09897 0.1034 0.09307 0.1004
## Detection Prevalence 0.09897 0.1034 0.09307 0.1004
## Balanced Accuracy 1.00000 1.0000 1.00000 1.0000
# Prediction with test set
pred <- predict(rf, newdata=testFinal2)
confusionMatrix(pred,train[-rows,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1105 0 7 5 3 6 14 2 6 3
## 1 0 1272 3 1 2 0 3 3 12 3
## 2 3 7 1067 23 17 6 3 13 6 0
## 3 3 7 25 1114 2 21 0 0 29 23
## 4 1 1 16 0 1088 10 4 7 9 38
## 5 2 2 2 21 2 1008 19 3 29 7
## 6 11 5 3 6 8 7 1123 0 8 2
## 7 1 3 19 10 4 3 0 1239 4 20
## 8 3 4 16 34 10 9 2 7 1155 14
## 9 2 4 6 11 38 17 0 25 13 1066
##
## Overall Statistics
##
## Accuracy : 0.9364
## 95% CI : (0.9319, 0.9407)
## No Information Rate : 0.1088
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9293
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.97701 0.9747 0.91667 0.90939 0.92675 0.92732
## Specificity 0.99577 0.9975 0.99280 0.98979 0.99206 0.99203
## Pos Pred Value 0.96003 0.9792 0.93188 0.91013 0.92675 0.92055
## Neg Pred Value 0.99760 0.9969 0.99106 0.98970 0.99206 0.99276
## Prevalence 0.09425 0.1087 0.09700 0.10208 0.09783 0.09058
## Detection Rate 0.09208 0.1060 0.08892 0.09283 0.09067 0.08400
## Detection Prevalence 0.09592 0.1082 0.09542 0.10200 0.09783 0.09125
## Balanced Accuracy 0.98639 0.9861 0.95473 0.94959 0.95940 0.95968
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.96147 0.9538 0.90873 0.90646
## Specificity 0.99538 0.9940 0.99077 0.98928
## Pos Pred Value 0.95737 0.9509 0.92105 0.90186
## Neg Pred Value 0.99584 0.9944 0.98921 0.98983
## Prevalence 0.09733 0.1082 0.10592 0.09800
## Detection Rate 0.09358 0.1032 0.09625 0.08883
## Detection Prevalence 0.09775 0.1086 0.10450 0.09850
## Balanced Accuracy 0.97843 0.9739 0.94975 0.94787