The purpose of this script is to evaluate the XGBoost package to classify the MNIST digit dataset. The secondary objective is to see if caret helps optimize the model.
XGBoost Basic Classifier
Let’s first calculate a basic XGBoost classifier,based on pixels as input features and default parameter values.
set.seed(8675309)
print(date()); xgb.basic.tTime <- proc.time()
xgb.basic.mod <- xgboost(data=xgb.DMatrix(model.matrix(~.,data=train.images),label=train.digits),
# verbose=0,
num_class=numClasses,
nrounds=100,
early.stop.round=3,
params=list(objective="multi:softmax",eval_metric="merror"))
xgb.basic.tTime <- proc.time() - xgb.basic.tTime
saveRDS(xgb.basic.mod,"xgb.basic.mod")
print(date())
Evaluate the basic XGBoost classifier as a sanity check.
xgb.basic.pred <- predict(xgb.basic.mod,newdata=xgb.DMatrix(model.matrix(~.,data=test.images),label=test.digits))
confusionMatrix(xgb.basic.pred,test.digits)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 791 0 1 2 1 2 2 1 0 1
## 1 0 957 3 5 2 0 1 2 1 0
## 2 2 4 787 10 3 4 0 15 2 0
## 3 0 3 5 827 0 7 0 1 7 5
## 4 2 3 3 1 775 3 4 2 1 13
## 5 1 0 0 11 1 694 4 1 2 3
## 6 2 0 3 2 4 6 822 0 5 1
## 7 0 1 8 6 2 4 0 865 2 7
## 8 8 3 4 10 2 4 2 1 799 4
## 9 1 2 4 6 14 5 0 14 7 790
##
## Overall Statistics
##
## Accuracy : 0.9653
## 95% CI : (0.9612, 0.9692)
## No Information Rate : 0.1159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9615
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.98017 0.9836 0.96210 0.93977 0.96393 0.95199
## Specificity 0.99868 0.9981 0.99472 0.99628 0.99579 0.99700
## Pos Pred Value 0.98752 0.9856 0.95163 0.96725 0.96035 0.96792
## Neg Pred Value 0.99789 0.9978 0.99591 0.99297 0.99618 0.99544
## Prevalence 0.09609 0.1159 0.09740 0.10479 0.09574 0.08681
## Detection Rate 0.09419 0.1140 0.09371 0.09848 0.09228 0.08264
## Detection Prevalence 0.09538 0.1156 0.09848 0.10181 0.09609 0.08538
## Balanced Accuracy 0.98943 0.9908 0.97841 0.96802 0.97986 0.97449
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.98443 0.9590 0.96731 0.95874
## Specificity 0.99696 0.9960 0.99498 0.99300
## Pos Pred Value 0.97278 0.9665 0.95460 0.93713
## Neg Pred Value 0.99828 0.9951 0.99643 0.99550
## Prevalence 0.09943 0.1074 0.09836 0.09812
## Detection Rate 0.09788 0.1030 0.09514 0.09407
## Detection Prevalence 0.10062 0.1066 0.09967 0.10038
## Balanced Accuracy 0.99070 0.9775 0.98115 0.97587
XGBoost Classifier Optimized by PCA and Caret
Caret is an R package that generalizes the calculation of models for many classification and regression algorithms. It also provides convenience functions for data partitioning and classifier evaluation.
The basic idea is that each classification algorithm has a small set of parameters that can be adjusted to conform its objective function (like XGBoost) to any particular optimization landscape (like digit recognition). Parameter values are specified in a grid and the train() algorithm iterates over nodes in the parameter grid. For each node, it calculates (i.e., trains) the implied model and evaluates the model’s performance. Then the best model is returned. (Since training occurs at each node, I rather wish the function were actually called search() or iterate() rather than train()).
Because of this searching nature of the caret algorithm, it can be very expensive in terms of time to build a model. Fortunately, the digit recognition problem, or the MNIST dataset in particular includes relatively few primary component features. While each image has (28x28=) 784 pixels, which can be considered raw features, Harshit Mahapatra, and probably others, have shown that 99.3% of the variance is held in only 50 of the pixel positions. Therefore, as a first approximation, it makes sense to use PCA analysis to reduce the feature space and save significant training time during the search.
Primary Component Analysis (PCA)
First we build the PCA rotation matrix which isolates the dataset variance along some number of orthogonal axes. Since this pre-processing step does not consider labels or perform classifications, but only analyzes the variance in the data, it is acceptable to use the entire dataset. Portions of the data will be set aside for cross-validation by the actual training steps, in order to prevent over-fitting in the gradient descents of model training. It should be noted that since PCA effectively ignores small variances, it is also a guard against over-fitting the data.
Here is an average image of the entire training dataset. It clearly shows that there are many pixels around the edges that do not contain much information. So, we would expect to be able to reduce the classifier input features accordingly.
image(matrix(apply(train.raw[,-1],2,mean),28,28),col=terrain.colors(32))

# image(matrix(unlist(train.images[16,]),28,28)) # 6 (rotated around X axis?)
Perform PCA analysis and select the first 50 eigenvectors as input features. The plot shows that the contribution of additional components does not change the information content provided by using additional features.
pcaDims <- 50
X <- train.raw[,-1] / 255.0 # Scale all pixel values to [0,1]
covX <- cov(X) # 784x784 covariance matrix
pcaX <- prcomp(covX) # Perform PCA
pca <- pcaX$rotation[,1:pcaDims] # Save the important part of the rotaion matrix (784x50)
preProcPCA <- function(images) { # Function to reduce observations from N x 784 to N x 50
as.matrix(images / 255.0) %*% pca
}
plot(summary(pcaX)$importance[3,],
main="Importance of Primary Components",
xlab="Primary Component Index",
ylab="Cumulative Variance")

# screeplot(pcaX,npcs=50,type="lines")
Apply PCA transformation to training data, and note the reduction in size of the feature set.
train.images.pca <- preProcPCA(train.images)
test.images.pca <- preProcPCA(test.images)
dim(train.images.pca)
## [1] 33602 50
Let’s see if there is any obvious structure to the PCA-reduced average images. Green is the minimum and white is the maximum. Bottom-left is PC1 and top-right is PC49.
I would not necressarily expect any structure since each primary component is a linear combination of all raw pixels. However, notice how the average values become almost the same at higher PCs.
avg.image <- apply(train.images.pca[,-50],2,mean)
summary(avg.image)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.60800 -0.14380 -0.02021 -0.13780 0.04863 2.19500
image(matrix(avg.image,7,7),col=terrain.colors(32))

Search for Optimized XGB Model
Use the caret package to search for an optimal set of parameter values.
print(date()); xgb.opt.tTime <- proc.time()
xgb.grid <- expand.grid(
nrounds=100,
max_depth=c(5,10,15),
eta=c(0.3,0.03,0.003),
gamma=c(0.5,1.0),
colsample_bytree=1,
min_child_weight=1
)
xgb.trcontrol <- trainControl(
method="cv",
number=5,
verboseIter=FALSE,
returnData=FALSE,
returnResamp="all",
classProbs=TRUE,
allowParallel=TRUE
)
xgb.opt.mod <- train(x=train.images.pca,y=train.digits,
verbose=0,
objective="multi:softmax",
eval_metric="merror",
early.stop.round=3,
num_class=numClasses,
watchlist=list(train=xgb.DMatrix(data=train.images.pca,label=train.digits)),
trControl=xgb.trcontrol,
tuneGrid=xgb.grid,
method="xgbTree"
)
xgb.opt.tTime <- proc.time() - xgb.opt.tTime
saveRDS(xgb.opt.mod,"xgb.opt.mod")
xgb.opt.pred <- predict(xgb.opt.mod,test.images.pca)
confusionMatrix(xgb.opt.pred,test.digits)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 783 0 1 1 1 4 5 1 0 3
## 1 0 957 1 2 1 0 1 2 3 0
## 2 4 5 779 14 3 2 4 11 1 2
## 3 0 3 4 813 1 16 1 0 21 12
## 4 1 0 8 1 772 6 4 5 2 20
## 5 4 1 0 22 1 690 5 3 9 1
## 6 11 0 6 3 2 4 811 0 4 1
## 7 0 0 8 5 4 3 0 866 3 10
## 8 4 1 8 9 1 2 4 2 775 8
## 9 0 6 3 10 18 2 0 12 8 767
##
## Overall Statistics
##
## Accuracy : 0.9542
## 95% CI : (0.9495, 0.9585)
## No Information Rate : 0.1159
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.949
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.97026 0.9836 0.95232 0.92386 0.96020 0.94650
## Specificity 0.99789 0.9987 0.99393 0.99229 0.99381 0.99400
## Pos Pred Value 0.97997 0.9897 0.94424 0.93341 0.94261 0.93750
## Neg Pred Value 0.99684 0.9978 0.99485 0.99110 0.99578 0.99491
## Prevalence 0.09609 0.1159 0.09740 0.10479 0.09574 0.08681
## Detection Rate 0.09324 0.1140 0.09276 0.09681 0.09193 0.08216
## Detection Prevalence 0.09514 0.1151 0.09824 0.10372 0.09752 0.08764
## Balanced Accuracy 0.98408 0.9911 0.97313 0.95807 0.97700 0.97025
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.97126 0.9601 0.93826 0.93083
## Specificity 0.99590 0.9956 0.99485 0.99221
## Pos Pred Value 0.96318 0.9633 0.95209 0.92857
## Neg Pred Value 0.99682 0.9952 0.99328 0.99247
## Prevalence 0.09943 0.1074 0.09836 0.09812
## Detection Rate 0.09657 0.1031 0.09228 0.09133
## Detection Prevalence 0.10026 0.1070 0.09693 0.09836
## Balanced Accuracy 0.98358 0.9778 0.96655 0.96152
Submit Results to Kaggle
Todo: Will this fit/run on Kaggle? Or does it take too many resources like a previous project? Answer: Takes about 5 hours to run on laptop.
Now apply one of the models to the test dataset and submit the results to Kaggle for scoring.
test.raw <- read.csv("data/test.csv")
# ss <- read.csv("data/sample_submission.csv")
final.results <- data.frame(ImageId=1:nrow(test.raw),
Label=predict(xgb.opt.mod,preProcPCA(test.raw)))
write.csv(final.results,"xgb_opt_solution.csv",row.names=FALSE)