Overview

In this vignette, I compare the statistical performance of four different machine learning (ML) algorithms using the caret library. The performance considers the predictive power of each. To demonstrate this, I refer to the Hitters data, which is commonly used in ML illustrations (see e.g., James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer).

Qualitative Approach

I consider a number of libraries to conduct this investigation:

library(xgboost)
library(caret)
library(ISLR)
library(plyr)
library(ggplot2)

I focus mainly on the the toy data named Hitters from the ISLR package.

ds <- Hitters
ds <- na.omit(ds)
head(ds)
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League
-Alan Ashby         315   81     7   24  38    39    14   3449   835     69   321  414    375      N
-Alvin Davis        479  130    18   66  72    76     3   1624   457     63   224  266    263      A
-Andre Dawson       496  141    20   65  78    37    11   5628  1575    225   828  838    354      N
-Andres Galarraga   321   87    10   39  42    30     2    396   101     12    48   46     33      N
-Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19   501  336    194      A
-Al Newman          185   37     1   23   8    21     2    214    42      1    30    9     24      N
                  Division PutOuts Assists Errors Salary NewLeague
-Alan Ashby              W     632      43     10  475.0         N
-Alvin Davis             W     880      82     14  480.0         A
-Andre Dawson            E     200      11      3  500.0         N
-Andres Galarraga        E     805      40      4   91.5         N
-Alfredo Griffin         W     282     421     25  750.0         A
-Al Newman               E      76     127      7   70.0         A

Similar to James et al (2013), I try to explain (for now) the salary of the player using two variables, Hits and Years. The response variable is given in thousands of $s. Since salary denotes the price level, we know that this variable is less likely to exhibit a bell-shaped distribution:

hist(ds$Salary)

On the other hand, if we consider the natural logarithm of the salary, we have

ds$Salary <- log(ds$Salary)
hist(ds$Salary)

Perspective on the Data

Before we start torturing the data, let’s take a descriptive approach to identify the relationship between the feature space and the response variable. Since we are dealing with a small dimension, this does not constitute a cumbersome task. To motivate this, let’s plot the the number of hits against the number of years. At the same time, let’s highlight the observations that correspond to the players with the highest and lowest salary quartiles.

plot(Hits~Years,data = ds, pch = 20, cex = 0.75)
points(Hits~Years,data = ds[ds$Salary > quantile(ds$Salary,0.75),],col = 3)
points(Hits~Years,data = ds[ds$Salary < quantile(ds$Salary,0.25),],col = 2)

The green highlighted dots indicate the salaries that are ranked on the top quartile in the sample. On the other hand, the red ones denote those in the lower quartile. Anything else refers to those in between. Intuitively, compensation should consider the performance, approximated by the number of hits, and the number of years as a proxy for experience. We observe that the more (less) compensated Baseball players are located on the top-right (bottom-left) on the plot above. While this is roughly the case, we also observe a few violations of this rationale.

We know that real-life facts do not exhibit linearity per se. If the data were fully linear, then a simple a linear regression model should capture the heterogeneity in the data very well, such that an increase in each dimension should imply an increase in the compensation. However, as I demonstrate later on in this vignette, real-life relationships do not necessarily confirm to such linearity.

Torturing Data and Machine Learning

The previous section provides a simple demonstration of the data and how one can qualitatively assess the classification of the data using simple descriptive statistics. In this section, I refer to more data-driven approach to uncover how moving from linearity to non-linearity improves the predictive power of the model. In particular, I deploy a linear regression with elastic net penalty, support vector machines (SVM) with linear and radial kernel, and extreme gradient boosting (XGB) for decision trees.

Training and Testing

Our investigation deploys a number of machine learning models. We will use the caret package mainly to achieve so. The performance of each model is established using training and testing samples.

To test the predictive power of each model, we split the sample into 80-20 training and testing sub-samples.

set.seed(13)
index <- 1:nrow(ds)
index_train <-sample(index,round(0.8*nrow(ds)) )
index_test <- index[-index_train]

To double check that both sample are mutually exclusive, we refer to the intersect command to confirm no overlapping observations:

intersect(index_test,index_train)
integer(0)

Hence, we have two sub-samples:

ds_train <- ds[index_train,]
ds_test <- ds[index_test,]

For the feature space, we use two variables - the number of hits and years of experience - whereas the response variable is the log-salary

x_var <- c("Hits","Years")
y_var <- "Salary"

The following commands run a ML algorithm by simply defining the name of the model, the training set, the testing set, the name of the regressors, and the name of the response variables. For each defined model, we run a loop to with 10-folds cross validation.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1,allowParallel = T)
model_list <- c("glmnet","svmLinear","svmRadial","xgbTree")
predict_y <- numeric()
model_formula <- formula(paste(y_var, " ~ " ,paste(x_var,collapse = " + ")))
for (model_i in model_list) {
  set.seed(13)
  
  if(model_i == "xgbTree") {
    train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 preProcess = c("center", "scale"))
    }
  else{
  train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 trControl=trctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
  }
  y_hat_i <- predict(train_model,ds_test[,x_var])
  predict_y <- cbind(predict_y,y_hat_i)
}

Note that the xgbTree does not require cross-validation, as the model is validated using the boosting approach inherit in the algorithm itself.

To test the performance of each, we regress the actual values versus each fitted ones as follows:

lm.list <- lapply(1:length(model_list), function(i) lm(predict_y[,i] ~ predict_y[,"Actual"]))
R_sq <- sapply(lm.list, function(x) summary(x)$adj)
R_sq
[1] 0.5610861 0.5357248 0.5999348 0.6442294

We observe that the non-linear models outperform the linear ones. Comparing between the non-linear SVM and xgbTree, we note that the latter returns a higher \(R^2\). However, to achieve greater confidence in such conclusion, we need to repeat this experiment using a bootstrap approach.

Bootstrap Performance

To gain more confidence in the above evidence, we repeat the same investigation 100 times. For each trial, we split the data randomly for training and testing using a distinct seed and repeat the same for cross-validation for tuning.

Let’s stack all of the above in a single function:

my_ml_function <- function(x_var,y_var,ds_train,ds_test,n) {
  index <- 1:nrow(ds)
  index_train <-sample(index,round(0.8*nrow(ds)) )
  index_test <- index[-index_train]
  ds_train <- ds[index_train,]
  ds_test <- ds[index_test,]
  trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
  predict_y <- numeric()
  model_formula <- formula(paste(y_var, " ~ " ,paste(x_var,collapse = " + ")))
  
  for (model_i in model_list) {
    set.seed(n)
    if(model_i == "xgbTree") {
    train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 preProcess = c("center", "scale"))
    }
  else{
  train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 trControl=trctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
  
    }  
    y_hat_i <- predict(train_model,ds_test[,x_var])
    predict_y <- cbind(predict_y,y_hat_i)
  }
  
  colnames(predict_y) <- model_list
  predict_y <- data.frame(predict_y,Actual = ds_test[,y_var])
  lm.list <- lapply(1:length(model_list), function(i) lm(predict_y[,i] ~ predict_y[,"Actual"]))
  R_sq <- sapply(lm.list, function(x) summary(x)$adj)
  names(R_sq) <- model_list
  return(R_sq)
}

Given the above function, we can run 100 repetitions to compute the distribution of the \(R^2\) resulting from each model:

library(parallel)
run_f <- function(n) my_ml_function(x_var,y_var,ds_train,ds_test,n)
R_list <- mclapply(1:100,run_f)

To determine the best fit among all models, we stack the results in a single data frame

ds_R <- data.frame(t(sapply(R_list,rbind)))
names(ds_R) <- model_list
summary(ds_R)
     glmnet         svmLinear        svmRadial         xgbTree      
 Min.   :0.1336   Min.   :0.1212   Min.   :0.4029   Min.   :0.3361  
 1st Qu.:0.3974   1st Qu.:0.3859   1st Qu.:0.5866   1st Qu.:0.6071  
 Median :0.4859   Median :0.4808   Median :0.6548   Median :0.6704  
 Mean   :0.4792   Mean   :0.4730   Mean   :0.6535   Mean   :0.6567  
 3rd Qu.:0.5729   3rd Qu.:0.5641   3rd Qu.:0.7248   3rd Qu.:0.7313  
 Max.   :0.7388   Max.   :0.7377   Max.   :0.8317   Max.   :0.8274  

We observe that the SVM model with the non-linear kernel and the xgbTree perform the best in terms of average/median \(R^2\). These imply low bias in the performance of the non-linear models. At the same time, we also note that there is a lower variance associated with both non-linear models compared to the linear ones. This is can reflected by the standard deviation of the \(R^2\) from each model:

apply(ds_R,2,sd)
    glmnet  svmLinear  svmRadial    xgbTree 
0.12333198 0.12651178 0.09600006 0.10619243 

To visualize these insights, we refer to the violin plot from ggplot2:

ds.plot <- lapply(1:ncol(ds_R), function(i) data.frame(R = ds_R[,i], Model = names(ds_R)[i] ) )
ds.plot <- ldply(ds.plot,data.frame)  
  
  
p <- ggplot(ds.plot,aes(x = factor(Model), y = R)) 
p <- p +  geom_violin() 
p <- p + geom_jitter(aes(colour = Model),size = 1,height = 0, width = 0.1) 
p <- p + geom_abline(intercept = median(apply(ds_R,2,median)), slope = 0, linetype ="dashed")
p <- p +  labs(x = "Model", y = "R Squared")
print(p)

From the above plot, we observe that the SVM with Radial kernel performs the best, exhibiting the lowest negative skewness.

Conclusion

We ran a horse race among 4 different algorithms using the caret package. While we used a simple toy data, we observed significant improvement in the prediction power of the non-linear models over the linear ones. Consistent with the sentiment among ML users, the xgbTree seems a very strong candidate in terms of improved bias and variance. Nonetheless, the major concern with such model is computational efficiency, especially when a retail investor needs to train such model on a recurring basis. It is unclear whether such model provides the upper hand compared to SVM with radial kernel, which resembles an artificial neural network with a single hidden layer.

---
title: "A Horse Race of Carets"
#output: rmarkdown::github_document
output:
  html_notebook: default
  pdf_document: default
author: Majeed Simaan
date: '`r format(Sys.Date(), "%Y-%B-%d")`'
fig_width: 50
---


# Overview
In this vignette, I compare the statistical performance of four different machine learning (ML) algorithms using the `caret` library. The performance considers the predictive power of each. To demonstrate this, I refer to the `Hitters` data, which is commonly used in ML illustrations (see e.g., James, G., Witten, D., Hastie, T., \& Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer).


# Qualitative Approach
I consider a number of libraries to conduct this investigation:
```{r,warning=F,message=F}
library(xgboost)
library(caret)
library(ISLR)
library(plyr)
library(ggplot2)
```
I focus mainly on the the toy data named `Hitters` from the `ISLR` package.
```{r}
ds <- Hitters
ds <- na.omit(ds)
head(ds)
```

Similar to James et al (2013), I try to explain (for now) the salary of the player using two variables, Hits and Years. The response variable is given in thousands of \$s. Since salary denotes the price level, we know that this variable is less likely to exhibit a bell-shaped distribution:
```{r,fig.align = "center"}
hist(ds$Salary)
```
On the other hand, if we consider the natural logarithm of the salary, we have
```{r}
ds$Salary <- log(ds$Salary)
hist(ds$Salary)
```

## Perspective on the Data
Before we start torturing the data, let's take a descriptive approach to identify the relationship between the feature space and the response variable. Since we are dealing with a small dimension, this does not constitute a cumbersome task. To motivate this, let's plot the the number of hits against the number of years. At the same time, let's highlight the observations that correspond to the players with the highest and lowest salary quartiles.
```{r,fig.align = "center"}
plot(Hits~Years,data = ds, pch = 20, cex = 0.75)
points(Hits~Years,data = ds[ds$Salary > quantile(ds$Salary,0.75),],col = 3)
points(Hits~Years,data = ds[ds$Salary < quantile(ds$Salary,0.25),],col = 2)

```
The green highlighted dots indicate the salaries that are ranked on the top quartile in the sample. On the other hand, the red ones denote those in the lower quartile. Anything else refers to those in between. Intuitively, compensation should consider the performance, approximated by the number of hits, and the number of years as a proxy for experience. We observe that the more (less) compensated Baseball players are located on the top-right (bottom-left) on the plot above. While this is roughly the case, we also observe a few violations of this rationale. 

We know that real-life facts do not exhibit linearity per se. If the data were fully linear, then a simple a linear regression model should capture the heterogeneity in the data very well, such that an increase in each dimension should imply an increase in the compensation. However, as I demonstrate later on in this vignette, real-life relationships do not necessarily confirm to such linearity. 


# Torturing Data and Machine Learning
The previous section provides a simple demonstration of the data and how one can qualitatively assess the classification of the data using simple descriptive statistics. In this section, I refer to more data-driven approach to uncover how moving from linearity to non-linearity improves the predictive power of the model. In particular, I deploy a linear regression with elastic net penalty, support vector machines (SVM) with linear and radial kernel, and extreme gradient boosting (XGB) for decision trees.

## Training and Testing
Our investigation deploys a number of machine learning models. We will use the `caret` package mainly to achieve so. The performance of each model is established using training and testing samples. 

To test the predictive power of each model, we split the sample into 80-20 training and testing sub-samples.
```{r}
set.seed(13)
index <- 1:nrow(ds)
index_train <-sample(index,round(0.8*nrow(ds)) )
index_test <- index[-index_train]
```
To double check that both sample are mutually exclusive, we refer to the `intersect` command to confirm no overlapping observations:
```{r}
intersect(index_test,index_train)
```
Hence, we have two sub-samples:
```{r}
ds_train <- ds[index_train,]
ds_test <- ds[index_test,]
```
For the feature space, we use two variables - the number of hits and years of experience - whereas the response variable is the log-salary
```{r}
x_var <- c("Hits","Years")
y_var <- "Salary"
```

The following commands run a ML algorithm by simply defining the name of the model, the training set, the testing set, the name of the regressors, and the name of the response variables. For each defined model, we run a loop to with 10-folds cross validation. 
```{r,message=F,warning=F}
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1,allowParallel = T)
model_list <- c("glmnet","svmLinear","svmRadial","xgbTree")
predict_y <- numeric()
model_formula <- formula(paste(y_var, " ~ " ,paste(x_var,collapse = " + ")))

for (model_i in model_list) {
  set.seed(13)
  
  if(model_i == "xgbTree") {
    train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 preProcess = c("center", "scale"))
    }
  else{
  train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 trControl=trctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
  }
  y_hat_i <- predict(train_model,ds_test[,x_var])
  predict_y <- cbind(predict_y,y_hat_i)
}

colnames(predict_y) <- model_list
predict_y <- data.frame(predict_y,Actual = ds_test[,y_var])
```
Note that the `xgbTree` does not require cross-validation, as the model is validated using the boosting approach inherit in the algorithm itself. 

To test the performance of each, we regress the actual values versus each fitted ones as  follows: 
```{r}
lm.list <- lapply(1:length(model_list), function(i) lm(predict_y[,i] ~ predict_y[,"Actual"]))
R_sq <- sapply(lm.list, function(x) summary(x)$adj)
R_sq
```
We observe that the non-linear models outperform the linear ones. Comparing between the non-linear SVM and xgbTree, we note that the latter returns a higher $R^2$. However, to achieve greater confidence in such conclusion, we need to repeat this experiment using a bootstrap approach. 

## Bootstrap Performance
To gain more confidence in the above evidence, we repeat the same investigation 100 times. For each trial, we split the data randomly for training and testing using a distinct seed and repeat  the same for cross-validation for tuning.  

Let's stack all of the above in a single function:
```{r}
my_ml_function <- function(x_var,y_var,ds_train,ds_test,n) {
  index <- 1:nrow(ds)
  index_train <-sample(index,round(0.8*nrow(ds)) )
  index_test <- index[-index_train]
  ds_train <- ds[index_train,]
  ds_test <- ds[index_test,]
  trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
  predict_y <- numeric()
  model_formula <- formula(paste(y_var, " ~ " ,paste(x_var,collapse = " + ")))
  
  for (model_i in model_list) {
    set.seed(n)
    if(model_i == "xgbTree") {
    train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 preProcess = c("center", "scale"))
    }
  else{
  train_model <- train(model_formula, data = ds_train[,c(y_var,x_var)], method = model_i,
                 trControl=trctrl,
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
  
    }  
    y_hat_i <- predict(train_model,ds_test[,x_var])
    predict_y <- cbind(predict_y,y_hat_i)
  }
  
  colnames(predict_y) <- model_list
  predict_y <- data.frame(predict_y,Actual = ds_test[,y_var])
  lm.list <- lapply(1:length(model_list), function(i) lm(predict_y[,i] ~ predict_y[,"Actual"]))
  R_sq <- sapply(lm.list, function(x) summary(x)$adj)
  names(R_sq) <- model_list
  return(R_sq)
}

```


Given the above function, we can run 100 repetitions to compute the distribution of the $R^2$ resulting from each model: 
```{r,warning=F,message=F}
library(parallel)
run_f <- function(n) my_ml_function(x_var,y_var,ds_train,ds_test,n)
R_list <- mclapply(1:100,run_f)
```


To determine the best fit among all models, we stack the results in a single data frame
```{r}
ds_R <- data.frame(t(sapply(R_list,rbind)))
names(ds_R) <- model_list
summary(ds_R)
```

We observe that the SVM model with the non-linear kernel and the xgbTree  perform the best in terms of average/median $R^2$. These imply low bias in the performance of the non-linear models. At the same time, we also note that there is a lower variance associated with both non-linear models compared to the linear ones.
This is can reflected by the standard deviation of the $R^2$ from each model:
```{r}
apply(ds_R,2,sd)
```
To visualize these insights, we refer to the violin plot from `ggplot2`:
```{r,message=F,warning=F,fig.align="center"}
ds.plot <- lapply(1:ncol(ds_R), function(i) data.frame(R = ds_R[,i], Model = names(ds_R)[i] ) )
ds.plot <- ldply(ds.plot,data.frame)  
  
  
p <- ggplot(ds.plot,aes(x = factor(Model), y = R)) 
p <- p +  geom_violin() 
p <- p + geom_jitter(aes(colour = Model),size = 1,height = 0, width = 0.1) 
p <- p + geom_abline(intercept = median(apply(ds_R,2,median)), slope = 0, linetype ="dashed")
p <- p +  labs(x = "Model", y = "R Squared")
print(p)
```
From the above plot, we observe that the SVM with Radial kernel performs the best, exhibiting the lowest negative skewness. 

# Conclusion
We ran a horse race among 4 different algorithms using the `caret` package. While we used a simple toy data, we observed significant improvement in the prediction power of the non-linear models over the linear ones. Consistent with the sentiment among ML users, the xgbTree seems a very strong candidate in terms of improved bias and variance. Nonetheless, the major concern with such model is computational efficiency, especially when a retail investor needs to train such model on a recurring basis. It is unclear whether such model provides the upper hand compared to SVM with radial kernel, which resembles an artificial neural network with a single hidden layer. 


