Comparing R caret models in action … and in practice: does model accuracy always matter more than scalability? and how much this is about models instead of implementations?

Introduction
Data set and quick explanatory analysis
Fitting models
Measuring performances
Accuracy vs scalability
Models vs implementations
References

Introduction

Here we’ll compare regression caret models in action from caret Rpackage.

The solubility data can be obtained from the AppliedPredictiveModeling R package. Moreover,

Models fitting on train set > 15 minutes has been discared.
Machines: MacBook Pro, Processor 2.4 GHz Intel Core i7, Memory 8 GB 1600 MHz DDR3
Accuracy measure: RMSE (Root Mean Squared Error)

Data set & quick explanotory analysis

Predictors from the training and test are into solTrainX , solTestX, respectively.

#AppliedPredictiveModeling ::: solubility
library(caret)

library(AppliedPredictiveModeling)
data(solubility)
ls()

## [1] "metadata"       "solTestX"       "solTestXtrans"  "solTestY"      
## [5] "solTrainX"      "solTrainXtrans" "solTrainY"

Each column of the data corresponds to a predictor (i.e., chemical descriptor) and the rows correspond to compounds. There are 228 columns in the data.

The following code creates a concise statistical description of a sample of predictors.

ps = sample(length(solTrainX),16)
library(Hmisc)

describe(solTrainX[,ps])

## solTrainX[, ps] 
## 
##  16  Variables      951  Observations
## ---------------------------------------------------------------------------
## FP205 
##       n missing  unique     Sum    Mean 
##     951       0       2      74 0.07781 
## ---------------------------------------------------------------------------
## FP027 
##       n missing  unique     Sum    Mean 
##     951       0       2      93 0.09779 
## ---------------------------------------------------------------------------
## FP083 
##       n missing  unique     Sum    Mean 
##     951       0       2     260  0.2734 
## ---------------------------------------------------------------------------
## FP064 
##       n missing  unique     Sum    Mean 
##     951       0       2     396  0.4164 
## ---------------------------------------------------------------------------
## FP069 
##       n missing  unique     Sum    Mean 
##     951       0       2     344  0.3617 
## ---------------------------------------------------------------------------
## FP022 
##       n missing  unique     Sum    Mean 
##     951       0       2      99  0.1041 
## ---------------------------------------------------------------------------
## FP175 
##       n missing  unique     Sum    Mean 
##     951       0       2     128  0.1346 
## ---------------------------------------------------------------------------
## FP012 
##       n missing  unique     Sum    Mean 
##     951       0       2     168  0.1767 
## ---------------------------------------------------------------------------
## FP167 
##       n missing  unique     Sum    Mean 
##     951       0       2     312  0.3281 
## ---------------------------------------------------------------------------
## FP098 
##       n missing  unique     Sum    Mean 
##     951       0       2     226  0.2376 
## ---------------------------------------------------------------------------
## FP078 
##       n missing  unique     Sum    Mean 
##     951       0       2     289  0.3039 
## ---------------------------------------------------------------------------
## FP025 
##       n missing  unique     Sum    Mean 
##     951       0       2     110  0.1157 
## ---------------------------------------------------------------------------
## FP014 
##       n missing  unique     Sum    Mean 
##     951       0       2     153  0.1609 
## ---------------------------------------------------------------------------
## FP070 
##       n missing  unique     Sum    Mean 
##     951       0       2     338  0.3554 
## ---------------------------------------------------------------------------
## FP163 
##       n missing  unique     Sum    Mean 
##     951       0       2     453  0.4763 
## ---------------------------------------------------------------------------
## FP200 
##       n missing  unique     Sum    Mean 
##     951       0       2      47 0.04942 
## ---------------------------------------------------------------------------

Scatter plots of the such predictors versus response variable (solubility).

featurePlot(x = solTrainX[,ps], y = solTrainY,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"))

plot of chunk unnamed-chunk-3

The “FP” columns correspond to the binary 0/1 fingerprint predictors that are associated with the presence or absence of a particular chemical structure. Alternate versions of these data that have been Box–Cox transformed are contained in the data frames solTrainXtrans and solTestXtrans. These modified versions were used in the analyses in this and subsequent chapters. The solubility values for each compound are contained in numeric vectors named solTrainY and solTestY.

Fitting models

Each model used repeated 10-fold cross-validation and is specified with the trainControl function.

controlObject <- trainControl(method = "repeatedcv", repeats = 5, number = 10)

The following helping method will be used for predicting and assessing model performances.

predictAndMeasure = function(model,model.label,trainingData,ytrain,testData,ytest,tm , grid = NULL,verbose=F) {
  pred = predict(model , trainingData) 
  RMSE.train = RMSE(obs = ytrain , pred = pred)
  
  pred = predict(model , testData) 
  RMSE.test = RMSE(obs = ytest , pred = pred)
  
  if (verbose) cat("****** RMSE(train) =",RMSE.train," -  RMSE(test) =",RMSE.test,"  --  Time elapsed(sec.):",tm[[3]], "...  \n")
  
  perf.grid = NULL
  if (is.null(grid)) { 
    perf.grid = data.frame(predictor = c(model.label) , RMSE.train = c(RMSE.train) , RMSE.test = c(RMSE.test) , time = c(tm[[3]]))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE.train = c(RMSE.train) , RMSE.test = c(RMSE.test) , time = c(tm[[3]]))
    perf.grid = rbind(grid, .grid)
  }
  
  perf.grid
}

The following code creates model objects.

verbose = F 
trainingData = solTrainX
trainingData$Solubility = solTrainY

testData = solTestX
#testData$Solubility = solTestY

trainingData.trans = solTrainXtrans
trainingData.trans$Solubility = solTrainY

testData.trans = solTestXtrans
#testData$Solubility = solTestY

######################################################## linear regression 
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] linear regression ...  \n")
set.seed(669); ptm <- proc.time()
linearReg <- train(Solubility ~  . , data = trainingData, method = "lm", trControl = controlObject) 
if (verbose) linearReg
tm = proc.time() - ptm
perf.grid = predictAndMeasure (linearReg,"Linear Reg",trainingData,solTrainY,testData,solTestY,tm , grid = NULL , verbose)

if (verbose) cat("****** [WITH TRANSFORMATIONS] linear regression ...  \n")
set.seed(669); ptm <- proc.time()
linearReg.trans <- train(Solubility ~  . , data = trainingData.trans, method = "lm", trControl = controlObject) 
if (verbose) linearReg.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (linearReg.trans,"Linear Reg (Trans)",trainingData.trans,solTrainY,testData.trans,solTestY,tm, grid = perf.grid)

######################################################## Elastic Net
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] Elastic Net ...  \n")
set.seed(669); ptm <- proc.time()
enetGrid <- expand.grid(.lambda = c(0, .001, .01, .1), .fraction = seq(0.05, 1, length = 20))
enetModel <- train(Solubility ~ . , data = trainingData , method = "enet", preProc = c("center", "scale"), tuneGrid = enetGrid, trControl = controlObject)

## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2

if (verbose) enetModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (enetModel,"Elastic Net",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] Elastic Net ...  \n")
set.seed(669); ptm <- proc.time()
enetModel.trans <- train(Solubility ~ . , data = trainingData.trans , method = "enet", preProc = c("center", "scale"), tuneGrid = enetGrid, trControl = controlObject)
if (verbose) enetModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (enetModel.trans,"Elastic Net (Trans)",trainingData.trans,solTrainY,testData.trans,solTestY,tm, grid = perf.grid , verbose )

######################################################## Partial Least Squares
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] Partial Least Squares ...  \n")
set.seed(669); ptm <- proc.time()
plsModel <- train(Solubility ~ . , data = trainingData , method = "pls", preProc = c("center", "scale"), tuneLength = 15, trControl = controlObject)

if (verbose) plsModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (plsModel,"PLS",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] Partial Least Squares ...  \n")
set.seed(669); ptm <- proc.time()
plsModel.trans <- train(Solubility ~ . , data = trainingData.trans , method = "pls", preProc = c("center", "scale"), tuneLength = 15, trControl = controlObject)
if (verbose) plsModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (plsModel.trans,"PLS (Trans)",trainingData.trans,solTrainY,testData.trans,solTestY,tm, grid = perf.grid , verbose )

######################################################## Support Vector Machines 
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] Support Vector Machines ...  \n")
set.seed(669); ptm <- proc.time()
svmRModel <- train(Solubility ~ . , data = trainingData, method = "svmRadial",
                   tuneLength = 15, preProc = c("center", "scale"),  trControl = controlObject)

## Loading required package: kernlab

if (verbose) svmRModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (svmRModel,"SVM",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] Support Vector Machines ...  \n")
set.seed(669); ptm <- proc.time()
svmRModel.trans <- train(Solubility ~ . , data = trainingData.trans, method = "svmRadial",
                   tuneLength = 15, preProc = c("center", "scale"),  trControl = controlObject)
if (verbose) svmRModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (svmRModel.trans,"SVM (Trans)",trainingData.trans,solTrainY,testData.trans,solTestY,tm, grid = perf.grid , verbose )

######################################################## Bagged Tree
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] Bagged Tree ...  \n")
set.seed(669); ptm <- proc.time()
treebagModel <- train(Solubility ~ . , data = trainingData, method = "treebag", trControl = controlObject)

if (verbose) treebagModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (treebagModel,"Bagged Tree",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] Bagged Tree ...  \n")
set.seed(669); ptm <- proc.time()
treebagModel.trans <- train(Solubility ~ . , data = trainingData.trans, method = "treebag", trControl = controlObject)

if (verbose) treebagModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (treebagModel.trans,"Bagged Tree (Trans)",trainingData.trans,solTrainY,
                               testData.trans,solTestY,tm, grid = perf.grid , verbose )

######################################################## Cond Inf Tree
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] Cond Inf Tree ...  \n")
set.seed(669); ptm <- proc.time()
ctreeModel <- train(Solubility ~ . , data = trainingData , method = "ctree", tuneLength = 10, trControl = controlObject)

if (verbose) ctreeModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (ctreeModel,"Cond Inf Tree",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] Cond Inf Tree ...  \n")
set.seed(669); ptm <- proc.time()
ctreeModel.trans <- train(Solubility ~ . , data = trainingData.trans , method = "ctree", tuneLength = 10, trControl = controlObject)

if (verbose) ctreeModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (ctreeModel.trans,"Cond Inf Tree (Trans)",trainingData.trans,solTrainY,
                               testData.trans,solTestY,tm, grid = perf.grid , verbose )

######################################################## CART
if (verbose) cat("****** [WITHOUT TRANSFORMATIONS] CART ...  \n")
set.seed(669); ptm <- proc.time()
rpartModel <- train(Solubility ~ . , data = trainingData , method = "rpart", tuneLength = 30, trControl = controlObject)

## Loading required package: rpart

if (verbose) rpartModel
tm = proc.time() - ptm
perf.grid = predictAndMeasure (rpartModel,"CART",trainingData,solTrainY,testData,solTestY,tm , grid = perf.grid , verbose )

if (verbose) cat("****** [WITH TRANSFORMATIONS] CART ...  \n")
set.seed(669); ptm <- proc.time()
rpartModel.trans <- train(Solubility ~ . , data = trainingData.trans , method = "rpart", tuneLength = 30, trControl = controlObject)

if (verbose) rpartModel.trans
tm = proc.time() - ptm
perf.grid = predictAndMeasure (rpartModel.trans,"CART (Trans)",trainingData.trans,solTrainY,
                               testData.trans,solTestY,tm, grid = perf.grid , verbose )

Measuring performances

The same cross-validation folds were used for each model. The following figure shows parallel-coordinate plots for the resampling results across the models. Each line corresponds to a common cross-validation holdout.

allResamples <- resamples(list("Linear Reg" = linearReg, "Linear Reg (Trans)" = linearReg.trans, 
                               "SVM" = svmRModel , "SVM (Trans)" = svmRModel.trans , 
                               "PLS" = plsModel , "PLS (Trans)" = plsModel.trans , 
                               "Elastic Net" = enetModel , "Elastic Net (Trans)" = enetModel.trans , 
                               "Bagged Tree" = treebagModel , "Bagged Tree (Trans)" = treebagModel.trans , 
                               "Cond Inf Tree" = ctreeModel , "Cond Inf Tree (Trans)" = ctreeModel.trans , 
                               "CART" = rpartModel , "CART (Trans)" = rpartModel.trans
                               ))
parallelplot(allResamples)

plot of chunk unnamed-chunk-7

parallelplot(allResamples , metric = "Rsquared")

plot of chunk unnamed-chunk-7

From this, the top performing models are Support Vector Machines with and without Box–Cox transformations. Linear Regression / Partial Least Squares / Elastic Net with and without Box–Cox transformations are middle performing. Bagged trees / Conditional Inference Tree / CART showed modest results.

SVMs with Box–Cox transformations performs on test set as 0.60797 RMSE while without Box–Cox transformations as 0.61259.

perf.grid[order(perf.grid$RMSE.test, decreasing=F),]

##                predictor RMSE.train RMSE.test    time
## 8            SVM (Trans)     0.1974    0.6080  424.03
## 7                    SVM     0.2101    0.6126  429.61
## 4    Elastic Net (Trans)     0.5497    0.7041  510.13
## 6            PLS (Trans)     0.5543    0.7200   12.90
## 3            Elastic Net     0.5394    0.7415  473.14
## 2     Linear Reg (Trans)     0.4813    0.7456   10.17
## 5                    PLS     0.5327    0.7646   14.46
## 1             Linear Reg     0.4875    0.7969   15.20
## 10   Bagged Tree (Trans)     0.8412    0.8382  436.69
## 9            Bagged Tree     0.8425    0.8387  444.39
## 13                  CART     0.7454    0.8533   75.39
## 14          CART (Trans)     0.7454    0.8533   84.50
## 12 Cond Inf Tree (Trans)     0.6867    0.9799 1352.04
## 11         Cond Inf Tree     0.7149    1.0340 1260.13

The plot of the observed values against the predicted values can help us to understand how well the model fits. The scatter plot of residuals against the fitted values help us to detect potential heteroscedasticity problems (= when the error variance is not constant over all the observations, the error is said to be heteroscedastic ).

SVMs

predicted.SVM.trans = predict(svmRModel.trans , solTestXtrans) 
residualValues.SVM <- solTestY - predicted.SVM.trans
summary(residualValues.SVM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8700 -0.3450  0.0214  0.0017  0.3010  2.3800

sd(residualValues.SVM)

## [1] 0.6089

# Observed values versus predicted values
axisRange <- extendrange(c(solTestY, predicted.SVM.trans))
plot(solTestY, predicted.SVM.trans, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot of chunk unnamed-chunk-9

# Predicted values versus residuals
plot(predicted.SVM.trans, residualValues.SVM, ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

plot of chunk unnamed-chunk-9

Linear Regression

predicted.lin_reg.trans = predict(linearReg.trans, solTestXtrans) 
residualValues.reg_lin <- solTestY - predicted.lin_reg.trans
summary(residualValues.reg_lin)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.760  -0.427   0.017  -0.005   0.395   3.280

sd(residualValues.reg_lin)

## [1] 0.7467

# Observed values versus predicted values
axisRange <- extendrange(c(solTestY, predicted.lin_reg.trans))
plot(solTestY, predicted.lin_reg.trans, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot of chunk unnamed-chunk-10

# Predicted values versus residuals
plot(predicted.lin_reg.trans, residualValues.reg_lin, ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

plot of chunk unnamed-chunk-10

Accuracy vs scalability

New application scenarios are arising where model scalability matters more than model accuracy.

For instance, on-line learning applications. Let’s take a shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service (y = 1) , sometimes not (y = 0). Features x captures properties of user, of origin/destination and asking price. We want to learn p(y = 1 | x) to optimize price.

As a conseguence, new kinds of algorithms has been becoming populare such as stochastic gradient descent, mini-batch gradient descent, map-reduce batch gradient descent.

Here I focused only on the time a model needs to fit on training set as a proxy of scalability. Clearly, the more the time a model needs to fit, the lower the throughput on a cluster node of my architecture. Also memory and CPU consumption should be assessed.

Caming back to our models, SVMs needs more than 7 minutes to fit on training data set while PLS with Box–Cox transformations needs only 13 secs performing 0.7199966 RMSE on test set. The fastest model in fitting training data is Linear Regression with Box–Cox transformations (10.83 secs).

perf.grid[order(perf.grid$time, decreasing=F),]

##                predictor RMSE.train RMSE.test    time
## 2     Linear Reg (Trans)     0.4813    0.7456   10.17
## 6            PLS (Trans)     0.5543    0.7200   12.90
## 5                    PLS     0.5327    0.7646   14.46
## 1             Linear Reg     0.4875    0.7969   15.20
## 13                  CART     0.7454    0.8533   75.39
## 14          CART (Trans)     0.7454    0.8533   84.50
## 8            SVM (Trans)     0.1974    0.6080  424.03
## 7                    SVM     0.2101    0.6126  429.61
## 10   Bagged Tree (Trans)     0.8412    0.8382  436.69
## 9            Bagged Tree     0.8425    0.8387  444.39
## 3            Elastic Net     0.5394    0.7415  473.14
## 4    Elastic Net (Trans)     0.5497    0.7041  510.13
## 11         Cond Inf Tree     0.7149    1.0340 1260.13
## 12 Cond Inf Tree (Trans)     0.6867    0.9799 1352.04

Models vs implementations

The problem as exposed in the previous section is a zero-sum game. As for bias and variance, it seems there’s a clear trade-off between accuracy and scalability. On the other hand, continuing the metaphor, as for machine learning problems I need to check that there’s no additional noises in addition to bias, variance and irreducible errors, so here we need to check that such a loss of scalability for top performer models is intrisically bound to the problem and not to the implementation.

For instance, let’s see how to improve RMSE performances of Regressors with an octave based model. Similarly, let’s how it’s possible to build a nu-SVR based model that improves caret SVM RMSE performance fitting on the training set in less than a minute … stay tuned …

References

Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2013