Digit Recognizer using Random Forest Algorithm

Following is the R code for ‘Digit Recognizer’ problem on Kaggle. In this model I have used Random Forest Algorithm and ‘tune’ function to identify the best parameters for Random Forest. The dataset contains around 780 attributes per observation. Each of these attribute is a pixel intensity of a particular point on the grid. As the dataset had lots of attributes per observation Principal Component Analysis was used for dimensionality reduction.

set.seed(0)
# Load the libs required for the analysis
library(class)
library(readr)
library(randomForest)
library(caret)

# Load the training and test datasets
train <- read.csv("C:/Users/Adi/Desktop/digit recognizer/train.csv")
test <- read.csv("C:/Users/Adi/Desktop/digit recognizer/test.csv")


#Applying PCA 
pca.train3 <- prcomp(train[,-1], scale=FALSE, center = T)


# Identify the amount of variance explained by the PCs
varEx<-as.data.frame(pca.train3$sdev^2/sum(pca.train3$sdev^2))
varEx<-cbind(c(1:784),cumsum(varEx[,1]))
colnames(varEx)<-c("Nmbr PCs","Cum Var")
VarianceExplanation<-varEx[seq(0,700,50),]

rotate<-pca.train3$rotation[,1:100]

trainFinal2<-as.matrix(scale(train[,-1],center = TRUE, scale=FALSE))%*%(rotate)

#Splitting the dataset
rows<-sample(1:42000,30000)

testFinal2<-trainFinal2[-rows,]
trainFinal2<-trainFinal2[rows,]

#RUN A RANDOM FOREST BENCHMARK FOR COMPARISON
set.seed(0)

tune <- tuneRF(trainFinal2,train[rows,1],ntreeTry = 150, stepFactor = 5, improve = 0.5,
                              trace = T, plot = T, doBest = F)

## mtry = 33  OOB error = 1.23398 
## Searching left ...
## mtry = 7     OOB error = 1.95475 
## -0.584102 0.5 
## Searching right ...
## mtry = 100   OOB error = 1.299972 
## -0.05347939 0.5

mtryfinal <- tune[as.numeric(which.min(tune[,"OOBError"])),"mtry"]

rf <- randomForest(trainFinal2,as.factor(train[rows,1]), ntree=300,  mtry=mtryfinal,keep.forest=TRUE)

# Prediction with training set
pred <- predict(rf, newdata=trainFinal2)
confusionMatrix(pred,train[rows,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 3001    0    0    0    0    0    0    0    0    0
##          1    0 3379    0    0    0    0    0    0    0    0
##          2    0    0 3013    0    0    0    0    0    0    0
##          3    0    0    0 3126    0    0    0    0    0    0
##          4    0    0    0    0 2898    0    0    0    0    0
##          5    0    0    0    0    0 2708    0    0    0    0
##          6    0    0    0    0    0    0 2969    0    0    0
##          7    0    0    0    0    0    0    0 3102    0    0
##          8    0    0    0    0    0    0    0    0 2792    0
##          9    0    0    0    0    0    0    0    0    0 3012
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.1126     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity               1.0   1.0000   1.0000   1.0000   1.0000  1.00000
## Specificity               1.0   1.0000   1.0000   1.0000   1.0000  1.00000
## Pos Pred Value            1.0   1.0000   1.0000   1.0000   1.0000  1.00000
## Neg Pred Value            1.0   1.0000   1.0000   1.0000   1.0000  1.00000
## Prevalence                0.1   0.1126   0.1004   0.1042   0.0966  0.09027
## Detection Rate            0.1   0.1126   0.1004   0.1042   0.0966  0.09027
## Detection Prevalence      0.1   0.1126   0.1004   0.1042   0.0966  0.09027
## Balanced Accuracy         1.0   1.0000   1.0000   1.0000   1.0000  1.00000
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           1.00000   1.0000  1.00000   1.0000
## Specificity           1.00000   1.0000  1.00000   1.0000
## Pos Pred Value        1.00000   1.0000  1.00000   1.0000
## Neg Pred Value        1.00000   1.0000  1.00000   1.0000
## Prevalence            0.09897   0.1034  0.09307   0.1004
## Detection Rate        0.09897   0.1034  0.09307   0.1004
## Detection Prevalence  0.09897   0.1034  0.09307   0.1004
## Balanced Accuracy     1.00000   1.0000  1.00000   1.0000

# Prediction with test set
pred <- predict(rf, newdata=testFinal2)
confusionMatrix(pred,train[-rows,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1105    0    7    5    3    6   14    2    6    3
##          1    0 1272    3    1    2    0    3    3   12    3
##          2    3    7 1067   23   17    6    3   13    6    0
##          3    3    7   25 1114    2   21    0    0   29   23
##          4    1    1   16    0 1088   10    4    7    9   38
##          5    2    2    2   21    2 1008   19    3   29    7
##          6   11    5    3    6    8    7 1123    0    8    2
##          7    1    3   19   10    4    3    0 1239    4   20
##          8    3    4   16   34   10    9    2    7 1155   14
##          9    2    4    6   11   38   17    0   25   13 1066
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9364          
##                  95% CI : (0.9319, 0.9407)
##     No Information Rate : 0.1088          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9293          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.97701   0.9747  0.91667  0.90939  0.92675  0.92732
## Specificity           0.99577   0.9975  0.99280  0.98979  0.99206  0.99203
## Pos Pred Value        0.96003   0.9792  0.93188  0.91013  0.92675  0.92055
## Neg Pred Value        0.99760   0.9969  0.99106  0.98970  0.99206  0.99276
## Prevalence            0.09425   0.1087  0.09700  0.10208  0.09783  0.09058
## Detection Rate        0.09208   0.1060  0.08892  0.09283  0.09067  0.08400
## Detection Prevalence  0.09592   0.1082  0.09542  0.10200  0.09783  0.09125
## Balanced Accuracy     0.98639   0.9861  0.95473  0.94959  0.95940  0.95968
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.96147   0.9538  0.90873  0.90646
## Specificity           0.99538   0.9940  0.99077  0.98928
## Pos Pred Value        0.95737   0.9509  0.92105  0.90186
## Neg Pred Value        0.99584   0.9944  0.98921  0.98983
## Prevalence            0.09733   0.1082  0.10592  0.09800
## Detection Rate        0.09358   0.1032  0.09625  0.08883
## Detection Prevalence  0.09775   0.1086  0.10450  0.09850
## Balanced Accuracy     0.97843   0.9739  0.94975  0.94787

Digit Recognizer using Random Forest Algorithm

Aditya S Nakate

7 January 2017