Digit Recognizer

The purpose of this script is to evaluate the XGBoost package to classify the MNIST digit dataset. The secondary objective is to see if caret helps optimize the model.

Load Libraries

print(date())

## [1] "Fri Aug 26 09:01:52 2016"

library(plyr)
library(ggplot2)
library(lattice)
library(caret)
library(xgboost)

Load Training Data

setwd("~/Academic/Kaggle/DigitRecognizer")
train.raw <- read.csv("data/train.csv")

digitClasses <- as.character(0:9)
numClasses <- length(digitClasses)

Partition Training Data

set.seed(8675309)
inTrain <- createDataPartition(y=train.raw$label,p=0.8,list=FALSE)
data.train <- train.raw[inTrain,]
data.test  <- train.raw[-inTrain,]

train.images <- data.train[,-1]
train.digits <- data.train[,1]
test.images  <- data.test[,-1]
test.digits  <- data.test[,1]

XGBoost Basic Classifier

Let’s first calculate a basic XGBoost classifier,based on pixels as input features and default parameter values.

set.seed(8675309)
print(date()); xgb.basic.tTime <- proc.time()
xgb.basic.mod <- xgboost(data=xgb.DMatrix(model.matrix(~.,data=train.images),label=train.digits),
                         # verbose=0,
                         num_class=numClasses,
                         nrounds=100,
                         early.stop.round=3,
                         params=list(objective="multi:softmax",eval_metric="merror"))
xgb.basic.tTime <- proc.time() - xgb.basic.tTime
saveRDS(xgb.basic.mod,"xgb.basic.mod")
print(date())

Evaluate the basic XGBoost classifier as a sanity check.

xgb.basic.pred <- predict(xgb.basic.mod,newdata=xgb.DMatrix(model.matrix(~.,data=test.images),label=test.digits))
confusionMatrix(xgb.basic.pred,test.digits)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 791   0   1   2   1   2   2   1   0   1
##          1   0 957   3   5   2   0   1   2   1   0
##          2   2   4 787  10   3   4   0  15   2   0
##          3   0   3   5 827   0   7   0   1   7   5
##          4   2   3   3   1 775   3   4   2   1  13
##          5   1   0   0  11   1 694   4   1   2   3
##          6   2   0   3   2   4   6 822   0   5   1
##          7   0   1   8   6   2   4   0 865   2   7
##          8   8   3   4  10   2   4   2   1 799   4
##          9   1   2   4   6  14   5   0  14   7 790
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9653          
##                  95% CI : (0.9612, 0.9692)
##     No Information Rate : 0.1159          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9615          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98017   0.9836  0.96210  0.93977  0.96393  0.95199
## Specificity           0.99868   0.9981  0.99472  0.99628  0.99579  0.99700
## Pos Pred Value        0.98752   0.9856  0.95163  0.96725  0.96035  0.96792
## Neg Pred Value        0.99789   0.9978  0.99591  0.99297  0.99618  0.99544
## Prevalence            0.09609   0.1159  0.09740  0.10479  0.09574  0.08681
## Detection Rate        0.09419   0.1140  0.09371  0.09848  0.09228  0.08264
## Detection Prevalence  0.09538   0.1156  0.09848  0.10181  0.09609  0.08538
## Balanced Accuracy     0.98943   0.9908  0.97841  0.96802  0.97986  0.97449
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.98443   0.9590  0.96731  0.95874
## Specificity           0.99696   0.9960  0.99498  0.99300
## Pos Pred Value        0.97278   0.9665  0.95460  0.93713
## Neg Pred Value        0.99828   0.9951  0.99643  0.99550
## Prevalence            0.09943   0.1074  0.09836  0.09812
## Detection Rate        0.09788   0.1030  0.09514  0.09407
## Detection Prevalence  0.10062   0.1066  0.09967  0.10038
## Balanced Accuracy     0.99070   0.9775  0.98115  0.97587

XGBoost Classifier Optimized by PCA and Caret

Caret is an R package that generalizes the calculation of models for many classification and regression algorithms. It also provides convenience functions for data partitioning and classifier evaluation.

The basic idea is that each classification algorithm has a small set of parameters that can be adjusted to conform its objective function (like XGBoost) to any particular optimization landscape (like digit recognition). Parameter values are specified in a grid and the train() algorithm iterates over nodes in the parameter grid. For each node, it calculates (i.e., trains) the implied model and evaluates the model’s performance. Then the best model is returned. (Since training occurs at each node, I rather wish the function were actually called search() or iterate() rather than train()).

Because of this searching nature of the caret algorithm, it can be very expensive in terms of time to build a model. Fortunately, the digit recognition problem, or the MNIST dataset in particular includes relatively few primary component features. While each image has (28x28=) 784 pixels, which can be considered raw features, Harshit Mahapatra, and probably others, have shown that 99.3% of the variance is held in only 50 of the pixel positions. Therefore, as a first approximation, it makes sense to use PCA analysis to reduce the feature space and save significant training time during the search.

Primary Component Analysis (PCA)

First we build the PCA rotation matrix which isolates the dataset variance along some number of orthogonal axes. Since this pre-processing step does not consider labels or perform classifications, but only analyzes the variance in the data, it is acceptable to use the entire dataset. Portions of the data will be set aside for cross-validation by the actual training steps, in order to prevent over-fitting in the gradient descents of model training. It should be noted that since PCA effectively ignores small variances, it is also a guard against over-fitting the data.

Here is an average image of the entire training dataset. It clearly shows that there are many pixels around the edges that do not contain much information. So, we would expect to be able to reduce the classifier input features accordingly.

image(matrix(apply(train.raw[,-1],2,mean),28,28),col=terrain.colors(32))

# image(matrix(unlist(train.images[16,]),28,28))        # 6 (rotated around X axis?)

Perform PCA analysis and select the first 50 eigenvectors as input features. The plot shows that the contribution of additional components does not change the information content provided by using additional features.

pcaDims <- 50
X <- train.raw[,-1] / 255.0             # Scale all pixel values to [0,1]
covX <- cov(X)                          # 784x784 covariance matrix
pcaX <- prcomp(covX)                    # Perform PCA
pca <- pcaX$rotation[,1:pcaDims]        # Save the important part of the rotaion matrix (784x50)
preProcPCA <- function(images) {        # Function to reduce observations from N x 784 to N x 50
        as.matrix(images / 255.0) %*% pca
}
plot(summary(pcaX)$importance[3,],
     main="Importance of Primary Components",
     xlab="Primary Component Index",
     ylab="Cumulative Variance")

# screeplot(pcaX,npcs=50,type="lines")

Apply PCA transformation to training data, and note the reduction in size of the feature set.

train.images.pca <- preProcPCA(train.images)
test.images.pca  <- preProcPCA(test.images)
dim(train.images.pca)

## [1] 33602    50

Let’s see if there is any obvious structure to the PCA-reduced average images. Green is the minimum and white is the maximum. Bottom-left is PC1 and top-right is PC49.

I would not necressarily expect any structure since each primary component is a linear combination of all raw pixels. However, notice how the average values become almost the same at higher PCs.

avg.image <- apply(train.images.pca[,-50],2,mean)
summary(avg.image)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.60800 -0.14380 -0.02021 -0.13780  0.04863  2.19500

image(matrix(avg.image,7,7),col=terrain.colors(32))

Search for Optimized XGB Model

Use the caret package to search for an optimal set of parameter values.

print(date()); xgb.opt.tTime <- proc.time()

xgb.grid <- expand.grid(
        nrounds=100,
        max_depth=c(5,10,15),
        eta=c(0.3,0.03,0.003),
        gamma=c(0.5,1.0),
        colsample_bytree=1,
        min_child_weight=1
        )

xgb.trcontrol <- trainControl(
        method="cv",
        number=5,
        verboseIter=FALSE,
        returnData=FALSE,
        returnResamp="all",
        classProbs=TRUE,
        allowParallel=TRUE
        )

xgb.opt.mod <- train(x=train.images.pca,y=train.digits,
                     verbose=0,
                     objective="multi:softmax",
                     eval_metric="merror",
                     early.stop.round=3,
                     num_class=numClasses,
                     watchlist=list(train=xgb.DMatrix(data=train.images.pca,label=train.digits)),
                     trControl=xgb.trcontrol,
                     tuneGrid=xgb.grid,
                     method="xgbTree"
                     )
xgb.opt.tTime <- proc.time() - xgb.opt.tTime
saveRDS(xgb.opt.mod,"xgb.opt.mod")

xgb.opt.pred <- predict(xgb.opt.mod,test.images.pca)
confusionMatrix(xgb.opt.pred,test.digits)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 783   0   1   1   1   4   5   1   0   3
##          1   0 957   1   2   1   0   1   2   3   0
##          2   4   5 779  14   3   2   4  11   1   2
##          3   0   3   4 813   1  16   1   0  21  12
##          4   1   0   8   1 772   6   4   5   2  20
##          5   4   1   0  22   1 690   5   3   9   1
##          6  11   0   6   3   2   4 811   0   4   1
##          7   0   0   8   5   4   3   0 866   3  10
##          8   4   1   8   9   1   2   4   2 775   8
##          9   0   6   3  10  18   2   0  12   8 767
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9542          
##                  95% CI : (0.9495, 0.9585)
##     No Information Rate : 0.1159          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.949           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.97026   0.9836  0.95232  0.92386  0.96020  0.94650
## Specificity           0.99789   0.9987  0.99393  0.99229  0.99381  0.99400
## Pos Pred Value        0.97997   0.9897  0.94424  0.93341  0.94261  0.93750
## Neg Pred Value        0.99684   0.9978  0.99485  0.99110  0.99578  0.99491
## Prevalence            0.09609   0.1159  0.09740  0.10479  0.09574  0.08681
## Detection Rate        0.09324   0.1140  0.09276  0.09681  0.09193  0.08216
## Detection Prevalence  0.09514   0.1151  0.09824  0.10372  0.09752  0.08764
## Balanced Accuracy     0.98408   0.9911  0.97313  0.95807  0.97700  0.97025
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.97126   0.9601  0.93826  0.93083
## Specificity           0.99590   0.9956  0.99485  0.99221
## Pos Pred Value        0.96318   0.9633  0.95209  0.92857
## Neg Pred Value        0.99682   0.9952  0.99328  0.99247
## Prevalence            0.09943   0.1074  0.09836  0.09812
## Detection Rate        0.09657   0.1031  0.09228  0.09133
## Detection Prevalence  0.10026   0.1070  0.09693  0.09836
## Balanced Accuracy     0.98358   0.9778  0.96655  0.96152

Submit Results to Kaggle

Todo: Will this fit/run on Kaggle? Or does it take too many resources like a previous project? Answer: Takes about 5 hours to run on laptop.

Now apply one of the models to the test dataset and submit the results to Kaggle for scoring.

test.raw <- read.csv("data/test.csv")
# ss <- read.csv("data/sample_submission.csv")

final.results <- data.frame(ImageId=1:nrow(test.raw),
                            Label=predict(xgb.opt.mod,preProcPCA(test.raw)))
write.csv(final.results,"xgb_opt_solution.csv",row.names=FALSE)

Appendix A - Environment

Development of this report was done on a dual core laptop with 16 GB of RAM, running Windows 10.

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xgboost_0.4-4   caret_6.0-70    lattice_0.20-33 ggplot2_2.1.0  
## [5] plyr_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        compiler_3.3.1     formatR_1.4       
##  [4] nloptr_1.0.4       class_7.3-14       iterators_1.0.8   
##  [7] tools_3.3.1        digest_0.6.9       lme4_1.1-12       
## [10] evaluate_0.9       nlme_3.1-128       gtable_0.2.0      
## [13] mgcv_1.8-12        Matrix_1.2-6       foreach_1.4.3     
## [16] yaml_2.1.13        parallel_3.3.1     SparseM_1.7       
## [19] e1071_1.6-7        stringr_1.0.0      knitr_1.13        
## [22] MatrixModels_0.4-1 stats4_3.3.1       grid_3.3.1        
## [25] nnet_7.3-12        data.table_1.9.6   rmarkdown_1.0     
## [28] minqa_1.2.4        reshape2_1.4.1     car_2.1-2         
## [31] magrittr_1.5       scales_0.4.0       codetools_0.2-14  
## [34] htmltools_0.3.5    MASS_7.3-45        splines_3.3.1     
## [37] pbkrtest_0.4-6     colorspace_1.2-6   quantreg_5.26     
## [40] stringi_1.1.1      munsell_0.4.3      chron_2.3-47

print(date())

## [1] "Fri Aug 26 11:00:48 2016"

Digit Recognizer - XGBoost

Fred Smith

Monday, July 18, 2016