caret demostration with kaggle bikeshare data (I)

Purpose

This is a demostration to solve regression problem with some (popular?) models in caret.
The data is from the kaggle: bicycle sharing competition. The features of data are quite clean, straightforward, and complete. No need to do much data cleanning, so that we can focus on comparing the models.
Included models: tree-based models (ctree, rpart), boosting models (gbm, gamboost), bagged models (treebag, bagEarth), random forest (rf, cforest, qrf), linear regression models (enet, pcr, glmnet), Radial-kernal regression (rvmRadial), and neural network (nnet, pcaNNet, neuralnet).
The list is by no means complete. For each model, I only try a grid of simple/convenient parameters. The purpose is to taste a’s many models as I could, and I am not intended to do a serious benchmark comparisons between models.

Data

data loading
split the datetime to hour, month, year, and wday using strptime
calculate log(count+1) to replace count, which will be used for prediction. Advantage: 1. the evalutation RMSLE metrics will be transofrmed to RMSE, one of the default metrics in caret::train. 2. The prediction of regression won’t be negative.
removing the unnecesary features 1. casual, register: Useless – not appear in the test set. 2. datetime, count: Redaudant – replaced by new features

train <- read.csv("../DATA/train.csv", colClasses=test_classes)
test <- read.csv("../DATA/test.csv", colClasses=test_classes[1:9])
train$count <- as.integer(train$count)

train_factor <- set_up_features(train)
test_factor <- set_up_features(test)

train_factor$lgcount <- log(train_factor$count+1)
train_factor<- train_factor[,-12]
train_factor<- train_factor[,-11]
train_factor<- train_factor[,-10]
train_factor<- train_factor[,-1]
test_factor<- test_factor[,-1]

Sample split and CV optimization – use caret

Split the sample into 20% testing subsample and 80% trainning subsample.

set.seed(212312)
trainIndex <- createDataPartition(train_factor$lgcount, p = 0.8, list=FALSE, times=1)
subTrain <- train_factor[trainIndex,]
subTest <- train_factor[-trainIndex,]

A simple tree model

a simple test using Conditional Inference Tree (party::ctree)
Use the entire dataset with no sample splitting

library(party)
#create our formula
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
#build our model
fit.ctree.party <- ctree(formula, data=subTrain,controls=ctree_control(mincriterion=0.95,savesplitstats=FALSE))

##             Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## ctree.party  0.4177777  55.68874   0.3658089   46.97015

## [1] object.size:  45.7 Mb

The RMSLE of left-out testing subsample is 0.418. Not bad, actually.
The tree is quite large (~46MBs), and I don’t know how to examine the result easily. Let’s try caret instead.

Modeling with caret

Try to reproduce the party::ctree result with trainning subsample

# all features 
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.ctree <- train(formula, data=subTrain,method='ctree',tuneGrid=expand.grid(mincriterion=0.95))
ctreeVarImp = varImp(fit.ctree)

##                   Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## ctree.caret.plain  0.5436062  84.36475   0.5015594   77.92729

## [1] object.size:  51.4 Mb

Quite disappointing. The difference between the rmsle of testing subsample (0.54) and the training subsample (~0.50) is large. There may be serious outfitting problem.
But, the most serious problem is apparently the ctree using caret is different from using it directly in party. The default parameters must be different. I have tried to force the mincriterion=0.95 (the tunning parameter of caret/ctree) for both models, but it doesn’t work.
I didn’t find a way to tune other ctree parameters, like maxdepth, in caret/ctree.
Apparently, there are many things I don’t know to use the two ctree models. But, let’s move on to try other models.

Model1: Try ctree2 with 6-fold CV optimization with caret

-- here I use ctree2, since the parameter (maxdepth) is easier to understand.
-- use RMSE to select the best model

  ##ctree2 with CV
  fitControl <- trainControl(method = 'cv', number=6,summaryFunction=defaultSummary)
  set.seed(123)
  Grid <- expand.grid(maxdepth = seq(15, 50,5))
  formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
  fit.ctree2CV <- train(formula, data=subTrain, method = 'ctree2', trControl=fitControl,tuneGrid=Grid,metric='RMSE')

##           Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## ctree2.CV  0.5140868  77.55127   0.4329095   65.68954

## [1] object.size:  117.6 Mb

The rmsle of the testing set becomes 0.51, and that of the trainning set is 0.43. Both are improved. The overfitting is even more serious, though. However, I still can’t reach the result using party::ctree (default parameters). :(
hour is the most important feature. But, surprisingly, the feature of wday, workingday, and holiday are all quite useless. Temperature seems to outweight than the other weather factor, e.g. windspeed and humidity.

Model2: Try caret::rpart with 6-fold CV optimization

try another tree-like model: CART using caret::rpart

##model2a: CART using rpart with CV
set.seed(123)
fitControl <- trainControl(method = 'cv', number=6)
Grid <- expand.grid(cp=seq(0, 0.05, 0.005))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.rpartCV <- train(formula, data=subTrain, method = 'rpart', trControl=fitControl, metric='RMSE',maximize=FALSE, tuneGrid = Grid)
##model2b: rpart2 with CV
set.seed(123)
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fitControl <- trainControl(method = 'cv', number=6)
Grid<-expand.grid(.maxdepth=seq(5,20,5))
fit.rpart2CV <- train(formula, data=subTrain, method = 'rpart2', trControl=fitControl, metric = 'RMSE', maximize=FALSE, tuneGrid=Grid)
plot(fit.rpartCV)
plot(fit.rpart2CV)

##           Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## rpart.CV   0.5325579  79.97178   0.4049704   61.81217
## rpart2.CV  0.7847154 148.60228   0.7871033  145.48955

## [1] rpart.CV size:  6.6 Mb

## [1] rpart2.CV size:  6.2 Mb

Ok. The performance is not better than caret::ctree. Especially, rpart2 seems to stop improving at maxdepth=10. The rmsle is fixed to around 0.78 for the training set, which is almost the same for the testing set. The rpart model, on the other hand, converges to non-contrained model with cp=0. The RMSLE of the Training subsample is much smaller than the Testing subsample, which means overfitting.
The models will treat the factorized features with more than two levels as multiple variables. The feature importance plot looks much more complex.
However, the physical size of the models are much smaller than those from ctree.

Esemble model: boosting / bagging / random forest

Model 3a: stocasting gradient boosting (gbm, 6-fold CV)

## gbm fitting
set.seed(123)
fitControl <- trainControl(method = 'cv', number = 6, summaryFunction=defaultSummary)
Grid <- expand.grid( n.trees = seq(50,1000,50), interaction.depth = c(30), shrinkage = c(0.1))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.gbm <- train(formula, data=subTrain, method = 'gbm', trControl=fitControl,tuneGrid=Grid,metric='RMSE',maximize=FALSE)
plot(fit.gbm)
plot(gbmVarImp)

##        Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## gbm.CV  0.3145828  43.72667   0.1658346   29.22251

## [1] object.size:  11.8 Mb

The result seems to be very good.
The model is optimized at n.tree\(=650\) for shrinkage=0.1 and interaction=30. I will explore more on the parameter space later. Smaller shrinkage may reduce the risk of overfitting.
The feature of hour is even more distinguished with this model. Also, gbm, like rpart, will treat multi-level features are multiple variables. I need to consider whether this is good or not. In this way, features like hour may be splitted into too many parameters which increase the complexity of models. On the other hand, we know the response of hour won’t be linear, which means we may not gain to use a single hour feature for naive linear regression.

Model 3b: Another boosting model - gamboost in caret

The model is from mboost. It is another realization of gradient bossting model with parameter mstop, the number of initial boosting iterations, and prune, which is not found in the original mboost guide.

Note: 1. prune seems to be not tunnable using tuneGrid. It will be held constant at the first number of the sequence. 2. It also doesn’t change the result. I have tried prune\(=0.1, 0.5, 0.9, 5\). It all gives the same result at mstop\(=300\).

## gamboost fitting
set.seed(123)
fitControl <- trainControl(method = 'cv', number=6, summaryFunction=defaultSummary)
Grid <- expand.grid(.mstop=seq(100,1000,100),.prune=c(5))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.gamboost <- train(formula, data=subTrain, method = 'gamboost', trControl=fitControl,tuneGrid=Grid,metric='RMSE',maximize=FALSE)

##             Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## gamboost.CV  0.6140908  98.05113   0.6068851   96.41029

## [1] object.size:  10.9 Mb

The RMSLE is worse than the model of caret::gbm. But, there shows no sign of overfitting.
I suppose there should be some way to allow prune to vary, or there should be a parameter similar to shrinkage in gbm. Well, I don’t want to spend too much time on this, and move on to the next model.

Model 4a: bagged CART - treebag in caret

The model is from ipred. In caret, it is simple, and no grid parameter to tune.

## treebag fitting
set.seed(123)
fitControl <- trainControl(method = 'none', summaryFunction=defaultSummary)
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.treebag <- train(formula, data=subTrain, method = 'treebag', trControl=fitControl)
show(fit.treebag)
save(fit.treebag,file='fit_treebag_v1.RData')

##            Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## treebag.CV  0.7499689  147.1881   0.7484233   143.1211

## [1] object.size:  132.5 Mb

Ok. Not surprisingly, the performance is worse than the boosted models.
I suppose this is based on rpart models, so that the multi-level features will be splitted. In addition, the result is not storage-friendly, and the calculation is quite slow.

Model 4b: bagged MARS - bagEarth in caret/earth

According to the caret guide, bagEarth is ‘A bagging wrapper for multivariate adaptive regression splines (MARS) via the earth function’.

## bagEarth fitting
set.seed(123)
fitControl <- trainControl(method = 'cv', number = 6, summaryFunction=defaultSummary)
Grid <- expand.grid(degree=c(2), nprune = seq(10,90,20))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.bagEarth <- train(formula, data=subTrain, method = 'bagEarth', trControl=fitControl,tuneGrid=Grid,metric='RMSE',maximize=FALSE,keepX=FALSE)
#show(fit.bagEarth)

There is also similar problem with caret::treebag, the tunning parameters aren’t under my control. degree seems to be held at the first value I feed in.
Another problem is the result is unnecessarily large (>300MBs), and very time consuming to run. It is even slower than random forest.

##             Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## bagEarth.CV  0.4242703  69.00662   0.4115398    68.0136

## [1] object.size:  383.9 Mb

The RMSLE is ok ~ 0.42 for Testing subsample with degree\(=2\). It seems to be not overfitting.

Model 5: random forest (rf, oob)

# random forest
set.seed(123)
tc <- trainControl("oob")
Grid <- expand.grid(mtry = seq(4,16,4))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.rf <- train(formula, data=subTrain , method='rf', trControl=tc,tuneGrid=Grid,metric='RMSE')

##        RMSE  Rsquared mtry
## 1 0.5943940 0.8238771    4
## 2 0.4905211 0.8800549    8
## 3 0.4718308 0.8890213   12
## 4 0.4618798 0.8936531   16

##        Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## rf.oob  0.4718695  74.40514   0.2334819   38.81704

## [1] object.size:  91.9 Mb

Disappointing, random forest behave worse than other models. In addition, there seems to be some overfitting. The mode is optimized at mtry=16.
Could it be due to the difference of oob and 4-fold CV?

Model 5b: try random forest with 4-fold CV

# random forest
set.seed(123)
tc <- trainControl("cv",number=4)
Grid <- expand.grid(mtry = seq(4,16,4))
formula <- lgcount ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + wday + month + year
fit.rf.cv <- train(formula, data=subTrain , method='rf', trControl=tc,tuneGrid=Grid,metric='RMSE')

##   mtry      RMSE  Rsquared      RMSESD  RsquaredSD
## 1    4 0.6070140 0.8597777 0.012837362 0.005337229
## 2    8 0.5076205 0.8808282 0.010867871 0.005845244
## 3   12 0.4881242 0.8852147 0.008540236 0.006206083
## 4   16 0.4788093 0.8878842 0.009833839 0.006777247

##        Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## rf.cv   0.4685932  73.94653   0.2325217   38.51168
## rf.oob  0.4718695  74.40514   0.2334819   38.81704

All right. No surprise. They are similar.

A quick table of all model performances

  show(compare)

##             Test.rmsle Test.rmse Train.RMSLE Train.RMSE
## ctree.party  0.4177777  55.68874   0.3658089   46.97015
## ctree        0.5436062  84.36475   0.5015594   77.92729
## ctree2.CV    0.5140868  77.55127   0.4329095   65.68954
## rpart.CV     0.5325579  79.97178   0.4049704   61.81217
## rpart2.CV    0.7847154 148.60228   0.7871033  145.48955
## gbm.CV       0.3145828  43.72667   0.1658346   29.22251
## gamboost.CV  0.6140908  98.05113   0.6068851   96.41029
## treebag      0.7499689 147.18809   0.7484233  143.12107
## bagEarth.CV  0.4242703  69.00662   0.4115398   68.01360
## rf.oob       0.5011784  84.06601   0.2918352   52.45717
## rf.cv        0.4685932  73.94653   0.2325217   38.51168

Submission of the Prediction

Let’s try submiting the predictions of gbm.cv, rf.cv, ctree.party, bagEarth.cv.

#run model against test data set
predict.rf <- predict(fit.rf.cv, test_factor)
predict.rf <- exp(predict.rf) - 1 
#build a dataframe with our results
submit.rf <- data.frame(datetime = test$datetime, count=predict.rf)
#write results to .csv for submission
write.csv(submit.rf, file="submit_rf_v1.csv",row.names=FALSE,quote=FALSE)

#run model against test data set
predict.gbm <- predict(fit.gbm, test_factor)
predict.gbm <- exp(predict.gbm) - 1 
#build a dataframe with our results
submit.gbm <- data.frame(datetime = test$datetime, count=predict.gbm)
#write results to .csv for submission
write.csv(submit.gbm, file="submit_gbm_v1.csv",row.names=FALSE,quote=FALSE)

#bagEarth
#run model against test data set
predict.bagEarth <- predict(fit.bagEarth, test_factor)
predict.bagEarth <- exp(predict.bagEarth) - 1 
#build a dataframe with our results
submit.bagEarth <- data.frame(datetime = test$datetime, count=predict.bagEarth)
#write results to .csv for submission
write.csv(submit.bagEarth, file="submit_bagEarth_v1.csv",row.names=FALSE,quote=FALSE)

#ctree.party
#run model against test data set
predict.ctree <- predict(fit.ctree.party, test_factor)
colnames(predict.ctree) = 'count'
predict.ctree <- exp(predict.ctree) - 1 
#build a dataframe with our results
submit.ctree <- data.frame(datetime = test$datetime, count=predict.ctree)
#write results to .csv for submission
write.csv(submit.ctree, file="submit_ctree_v1.csv",row.names=FALSE,quote=FALSE)

Models	scores	Test.rmsle
rf.cv	0.517	0.469
gbm.cv	0.402	0.315
bagEarth.cv	0.451	0.424
ctree.party	0.524	0.417

Good news is ctree.party is the worst prediction. The effort is not in vain.
My Testing.rmsle underestimates the rmsle, although at least it reflects the order of the score. Well, it is perhaps expected, since the left-out testing sample is more similar to the other 80% of training sample than the testing sample.
The best score is the gbm.cv, which ends at rank ~ 92/1823. Within top 5%, not bad for the first competition.
The strange thing is rf results bad. I probably did something stupid.

Summary

caret provides a nice integrated user interface of the large model set. I won’t be able to try so many different models in these few days without caret. However, we may need to go back to the original package for further tunning, unless if there are some mysterious tricks I don’t know.
Some models are tremendously storage-consuming, which may cause slow calculation.
I need to consider more carefully, whether the parameters (especially multi-leveled parameters like hours) should be leveled or not.
My random forest model must be overfitted. I will fix it in the next run.

To Be Continued …

I will use more random forest, linear regression, and radial-kernal regression in the next file: bikeshare-demand.2.Rmd.