Introduction

While conducting regression analysis, there are a few ways with which we can examine model performance. Model performance involves assessing the resulting model with test data and using standard error metrics to qualify its output results. The basic idea behind model performance is to evaluate the quality of the derived model and to establish if there are possible ways to improve this.

The basic approach to model evaluation starts with splitting the data set into training and test sets, usually in a ratio of 75/25 or 80/20. The split ratio to run with is sometimes dependent on the volume of the data, the number of predictors or other qualifying assumptions. The training set, as it is called, is used to “train” the model, while the eventual output is “tested” with data from the test set. The most common, and typically trivial process, is to simply conduct one-time training with the training set and then test the model with data from the test portion. However, this lacks some rigour and may not generally produce robust metrics. Most practitioners would however conduct a cross-validation instead.

Cross-validation methods measure the performance of a model on new test data sets, with the basic idea involving fitting the same statistical method multiple times to different portions of a data set, while testing on another portion. In achieving cross-validation, attempts are made to ensure that the model does not see the test data during any steps before it is fed into it. The basic approach described above is sometimes also classed under cross-validation, though this can be a little misleading since cross-validation is viewed as a refined strategy which is involved multiple times with the training or test data.

Using the mtcars data set, We will be discussing a few validation techniques and examining some qualifying metrics; and later introduce an additional technique which may or may not complement existing ones.

The basic approach

In the basic approach, we divide the data set into two: one for the training of the model, and another for testing. We can examine the metrics produced by the two different results and use that to qualify the model performance. So, this proceeds simply on one track, but sometimes can have the adverse effect that the modeling is not thorough.

library(broom)
## Warning: package 'broom' was built under R version 3.6.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(caret)
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We will create a regression model by describing mpg as a function of wt and disp.

#set seed value
set.seed(100)
#Create random sampling indices
n_samples <- sample(1:nrow(mtcars), size = floor(0.75*nrow(mtcars)))

#Divide data set into training and test sets
mtcars.train <- mtcars[n_samples,]
mtcars.test <- mtcars[-n_samples,]

#Fit model with training data
lma <- lm(mpg ~ wt + disp, data = mtcars.train)

#Summary of model
summary(lma)
## 
## Call:
## lm(formula = mpg ~ wt + disp, data = mtcars.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.056 -2.018 -1.203  1.376  6.810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.83759    2.60094  12.625 2.84e-11 ***
## wt          -2.54938    1.34830  -1.891   0.0725 .  
## disp        -0.02083    0.01061  -1.964   0.0630 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.874 on 21 degrees of freedom
## Multiple R-squared:  0.7419, Adjusted R-squared:  0.7173 
## F-statistic: 30.18 on 2 and 21 DF,  p-value: 6.67e-07
pred <- predict(lma, newdata=mtcars.train[,c("wt","disp")])
data.frame(R2=R2(pred, mtcars.train$mpg),
           RMSE = RMSE(pred, mtcars.train$mpg),
           MAE = MAE(pred, mtcars.train$mpg))
##          R2     RMSE      MAE
## 1 0.7418812 2.688419 2.269659

The model output from the training data show fairly good results. The R2 value is ~71%, and the model coefficients look appreciably good. We will have to take a look at the residuals plot to gain additional insight into the quality of the model or if there are some outliers. But since that is not the general aim of this article, I’ll skip doing that.

Predicting mpg using this model will be done using the test set data. Since we already have mpg present in that test data, we have the luxury of being able to compare our results with existing (and supposedly correct) ones and could infer if the model is indeed good.

predictions <- mtcars.test %>% select(wt, disp) %>%
    predict(lma, newdata = .)
data.frame(R2 = R2(predictions, mtcars.test$mpg),
           RMSE = RMSE(predictions, mtcars.test$mpg),
           MAE = MAE(predictions, mtcars.test$mpg))
##          R2     RMSE      MAE
## 1 0.8565924 3.210083 2.571201

These are the metrics after comparing the predicted values to the actual values. They seem fairly good. The R2 in particular is an improvement from that obtained with the training data.

We could also use the augment function from the broom package to predict the new values of mpg using the model obtained from the training set.

augment(lma, newdata = mtcars.test[,c("wt","disp", "mpg")])
## # A tibble: 8 x 6
##   .rownames             wt  disp   mpg .fitted .se.fit
##   <chr>              <dbl> <dbl> <dbl>   <dbl>   <dbl>
## 1 Mazda RX4           2.62 160    21     22.8    0.723
## 2 Hornet Sportabout   3.44 360    18.7   16.6    1.29 
## 3 Merc 230            3.15 141.   22.8   21.9    1.03 
## 4 Cadillac Fleetwood  5.25 472    10.4    9.62   1.42 
## 5 Toyota Corolla      1.84  71.1  33.9   26.7    1.10 
## 6 Camaro Z28          3.84 350    13.3   15.8    0.891
## 7 Lotus Europa        1.51  95.1  30.4   27.0    1.42 
## 8 Volvo 142E          2.78 121    21.4   23.2    0.908

Getting a visual assessment of how these points appear with reference to the predicted values may actually help in better informing how well the model did.

augment(lma, newdata = mtcars.test[,c("wt","disp", "mpg")]) %>%
  ggplot(aes(x=1,y=mpg)) + geom_point(size=3) +geom_point(aes(x=2, y=.fitted), color="blue", size=3) + geom_text(aes(x=1, y=mpg, label=rownames(mtcars.test)), size=3, hjust=1.5) + geom_text(aes(x=2, y=.fitted, label=rownames(mtcars.test)), size=3, hjust=-1) + scale_x_continuous(limits=c(0,3))

With the exception of Toyota Corolla and Lotus Europa, the other points from the predictions seem to congregate at fairly the same points as those from the original test data. The output doesn’t seem awfully skewed and can still be considered appreciably good. Perhaps a better modification of the model could follow from careful feature selection since there are a handful more predictors to choose from, but that’s not entirely the objective of this article.

K-fold cross-validation

K-fold cross-validation departs quite significantly from the basic approach by introducing a split of the training set into k samller samples or folds. A test set is oftentimes still required for later evaluation of the model. For each of the k folds the model is trained using k-1 of the folds as training data as above, while the resulting model is validated on the remaining part of the data to obtain performance metrics. The resulting performance measure is the average of all the metrics computed in the k loops conducted. This average is sometimes called the cross-validation error.

k-fold cross-validation is clearly more robust than the basic method earlier discussed. Because it is computational and sees more variation in the data, it often gives a better estimate of the error rate than the basic method or even the leave-one out cross-validation method (LOOCV) which we will be discussing next.

Choosing the value of k could be the hard question though, and presents the ubiquitous bias-variance argument in modeling. A lower value of k could be more biased, though it may have less variance; while a higher value of k could present large variability, even though it may be less biased. Care needs to be taken in choosing this value, and sometimes trial-and-error is permissible.

Using the caret package, a 10-fold cross-validation is conducted to estimate the error rate for the same mtcars data set. This means the training data will be divided into 10 different subsets, with the model built using 9 and tested on the remainder subset. This could be achieved using standard loops in base R, but it is often more handy to use an existing package such as caret.

#Set seed value
set.seed(100)

#Define control parameters
train.control <- trainControl(method = "cv", number = 10)

#Train model
cvlma <- train(mpg ~ wt + disp, data = mtcars.train, method = "lm",
               trControl = train.control)

cvlma
## Linear Regression 
## 
## 24 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 21, 22, 21, 22, 22, 22, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.738461  0.9145291  2.511833
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Compared with the basic method, these results are quite good. The RMSE is lower and the R2 value is higher. Having gone through different portions of the training data, it appears the model has seen enough variability to form a better idea of the data set. the computational This could suggest that our model.

We next predict the outcomes using the derived model cvlma.

cvpredictions <- mtcars.test %>% select(wt, disp) %>%
    predict(cvlma, newdata = .)
data.frame(R2 = R2(cvpredictions, mtcars.test$mpg),
           RMSE = RMSE(cvpredictions, mtcars.test$mpg),
           MAE = MAE(cvpredictions, mtcars.test$mpg))
##          R2     RMSE      MAE
## 1 0.8565924 3.210083 2.571201

The results from the cross-validation output are basically the same as those obtained earlier. This is could be because the data set is very small, and we have trained on the same chunk of data in both instances. Were the data set larger, there could be a lot more variability introduced into the regression output. However, if anything, the results from cvlma show that our model was perhaps much better than lma indicated.

Leave One Out Cross-validation

Leave One Out Cross-validation, also known as LOOCV, is k-fold cross-validation taken to the extreme, so that in a repeated process all the members of the training set are used for building the model and one data point which is left out is used to evaluate the model. The major advantage of LOOCV is that it uses all data points in deriving the model. However, this introduces bias (as discussed under k-fold CV).

Naturally, this would have been an extremely computationally expensive task, but the availability today of better processors make this process quite seamless. Still, execution time could be rather high compared to other validation techniques. In addition, because the model is tested each time using only one data point there could be some variation in the prediction error and the average could be thrown off. There is typically enough variability to ensure that this smoothens out, however.

The caret package offers a method for conducting LOOCV by adjusting the training control parameters.

#set seed
set.seed(100)

#Define training control parameters
train.control <- trainControl(method = "LOOCV")

#Train model
loocvlma <- train(mpg ~ wt + disp, data = mtcars.train, method = "lm",
                  trControl = train.control)

loocvlma
## Linear Regression 
## 
## 24 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 23, 23, 23, 23, 23, 23, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.171228  0.6478849  2.640849
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Though the RMSE and MAE are quite good, the R2 shows a rather poor value, but that also is not unexpected given the large bias in using only one test point per N-1 training points for each iteration of the LOOCV.

loocvpredictions <- mtcars.test %>% select(wt, disp) %>%
    predict(loocvlma, newdata = .)
data.frame(R2 = R2(loocvpredictions, mtcars.test$mpg),
           RMSE = RMSE(loocvpredictions, mtcars.test$mpg),
           MAE = MAE(loocvpredictions, mtcars.test$mpg))
##          R2     RMSE      MAE
## 1 0.8565924 3.210083 2.571201

The predictions on the test set also return the same values as earlier.

Repeated k-fold cross-validation

This is a variant of k-fold cross-validation which includes repeats a specified number of times. This validation method takes advantage of the fact that there are many more ways of generating the k folds. The final model error is taken as the mean error from the number of repeats conducted

Implementing repeated k-fold cross-validation in caret is quite straightforward. We adjust the training control parameters as desired and we ensure that the method is set to repeatedcv.

#set seed
set.seed(100)

#Define training control parameters
train.control <- trainControl(method = "repeatedcv", number = 10,
                              repeats = 3)

#Model training
repeatedcvlma <- train(mpg ~ wt + disp, data = mtcars.train, method = "lm",
                       trControl = train.control)

repeatedcvlma
## Linear Regression 
## 
## 24 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 21, 22, 21, 22, 22, 22, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.789112  0.9065175  2.539772
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Again, the model appears to be a good fit. Its R2 is quite high and the MAE and RMSE are comparable to what we obtained with the earlier validation methods.

We could examine the variability in the folds in repeatedcvlma. The Resample column contains details about which fold and which repeat in the validation loop.

repeatedcvlma$resample
##         RMSE  Rsquared       MAE    Resample
## 1  2.9520378 0.9836324 2.8001635 Fold01.Rep1
## 2  2.0826732 1.0000000 1.9341582 Fold02.Rep1
## 3  5.5161529 0.7659046 5.0799549 Fold03.Rep1
## 4  2.2882639 1.0000000 2.0613054 Fold04.Rep1
## 5  2.3509615 1.0000000 1.8584507 Fold05.Rep1
## 6  2.1891585 1.0000000 2.1872482 Fold06.Rep1
## 7  3.6871518 0.4035601 3.0006708 Fold07.Rep1
## 8  1.9188247 1.0000000 1.8607632 Fold08.Rep1
## 9  1.4356400 0.9921942 1.3872700 Fold09.Rep1
## 10 2.9637468 1.0000000 2.9483449 Fold10.Rep1
## 11 1.3585048 0.9979123 1.3323775 Fold01.Rep2
## 12 2.4066920 1.0000000 2.0294835 Fold02.Rep2
## 13 1.7861584 1.0000000 1.7048720 Fold03.Rep2
## 14 2.2877949 1.0000000 2.2822071 Fold04.Rep2
## 15 2.2557985 1.0000000 2.1419506 Fold05.Rep2
## 16 3.0405169 1.0000000 2.8335510 Fold06.Rep2
## 17 4.8009043 0.8295598 4.2342430 Fold07.Rep2
## 18 3.3106309 0.7145891 2.4662163 Fold08.Rep2
## 19 1.9923618 1.0000000 1.6007974 Fold09.Rep2
## 20 4.6466522 0.9702191 4.2821767 Fold10.Rep2
## 21 5.6103640 1.0000000 5.0446332 Fold01.Rep3
## 22 2.1658656 0.9662299 1.7931937 Fold02.Rep3
## 23 2.7611713 1.0000000 2.7601504 Fold03.Rep3
## 24 4.3413028 0.1736630 4.0465952 Fold04.Rep3
## 25 2.7473187 0.9999998 2.3375618 Fold05.Rep3
## 26 1.1294381 1.0000000 1.1272943 Fold06.Rep3
## 27 3.6863776 0.3980599 3.3317429 Fold07.Rep3
## 28 0.9979942 1.0000000 0.8958156 Fold08.Rep3
## 29 2.2261291 1.0000000 2.1449617 Fold09.Rep3
## 30 2.7367756 1.0000000 2.6850102 Fold10.Rep3

With this information, we could take a look at the variability in the R2 output, for example, or in the RMSE. We see here that the R2 has rather low variability.

sd(repeatedcvlma$resample$Rsquared)
## [1] 0.2122463

Predicting on the available test set data using the derived model produces the same results as earlier obtained with the other validation methods

repeatedcvpredictions <- mtcars.test %>% select(wt, disp) %>%
    predict(repeatedcvlma, newdata = .)
data.frame(R2 = R2(repeatedcvpredictions, mtcars.test$mpg),
           RMSE = RMSE(repeatedcvpredictions, mtcars.test$mpg),
           MAE = MAE(repeatedcvpredictions, mtcars.test$mpg))
##          R2     RMSE      MAE
## 1 0.8565924 3.210083 2.571201

Regression Forests?

Following the analysis of these different cross-validation or model evaluation methods, it seems quite suggestive that we may have room for some exploration of a kind of model building process that lends itself to some foresting technique, i.e. the the combination of several models into one. The idea behind foresting after all is to combine several outputs deliberately generated into one in order to reduce variability and to achieve predictions closer to average.

In building a regression model, might it be possible to also generate several regression models from the available data and then make all these models available as the model of choice? This means, we would have the below as a regression model built on only a subset of the training set:

\(Y_{i} = \beta_{0} + \beta_{1}X_{i} + \epsilon_{i}\)

Many more of \(Y_{i}\) is built with different parts (could be ordered?) of the training set and the conglomeration of models is offered as the final model to which we could apply the test data and obtain evaluation metrics, or test future data.

The final output model for future data will be:

\(Y=\frac{1}{n}\sum_{i}^{n} Y_i\)

for as many subsets of the available data as possible.

To examine this logic, we could use the mtcars data set again.

head(mtcars, 4)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

We could sample mtcars.train with replacement for as many as 20-30 times, obtaining only 18 (~75%) data points each time and then generate the needed multiple regression models.

#Set seed value
set.seed(100)

#Choose number of sampling times
k=25
forestlist <- list()

#Generate n regression models
for (i in 1:k){
  f_samples <- sample(1:nrow(mtcars.train), size = floor(0.75*nrow(mtcars.train)), replace=FALSE)
  forestlma <- lm(mpg ~ wt + disp, data = mtcars.train[f_samples,])
  forestlist[[i]] <- forestlma
}

Now, we can make a prediction using each of this model, and obtain the corresponding predicted values for each test point.

list_of_predictions <- lapply(forestlist, predict, newdata = mtcars.test)

We then compute the mean of each point obtained. Here we use a nested for-loop. But any other method of averaging over a list will also suffice.

mtcars.test.results <- rep(0, nrow(mtcars.test))
for (i in 1:8){
  x <- 0
  mean_x <- 0
  for (j in 1:25){
    x <- x + list_of_predictions[[j]][i]
  }
  mean_x <- x/25
  mtcars.test.results[i] <- mean_x
}
results <- data.frame(car_type = rownames(mtcars.test), predicted = mtcars.test.results, actual = mtcars.test[,c("mpg")])
head(results)
##             car_type predicted actual
## 1          Mazda RX4 22.682994   21.0
## 2  Hornet Sportabout 16.225832   18.7
## 3           Merc 230 21.673179   22.8
## 4 Cadillac Fleetwood  8.993379   10.4
## 5     Toyota Corolla 26.679017   33.9
## 6         Camaro Z28 15.368019   13.3
data.frame(R2=R2(results$predicted, results$actual),
           RMSE = RMSE(results$predicted, results$actual),
           MAE = MAE(results$predicted, results$actual))
##          R2     RMSE      MAE
## 1 0.8572128 3.217573 2.629452

In the end, we obtain fairly the same value of metrics, except the MAE is now a little lower than others. The R2 is also slightly higher.

This method of using multiple regression models as the model to go with could perhaps help with better averaging over the variability present in the output.

We can visualize the regression models by using abline and plotting the line for each of the model. This shows us that we have some kind of forest of regression models and are simply averaging out the results of each of the model at every data point.

Conclusion

In this article, we examined the different means of cross-validation which helps to assess the robustness of a fitted model. We considered linear regression in this case. Firstly, we looked at the basic method which consists simply of splitting the data into a training and test set. We next examined the different cross-validation methods, including k-fold cross-validation, repeated k-fold and leave-one out cross-validation. These are validation models that help with evaluating a model. Finally, we examined some kind of regression forests, a model building process, and posit that such aggregation may as well serve as a complete model to run with, if permissible.

References

  1. scikit-learn developers (2019). Cross-validation. Retrieved from https://scikit-learn.org/stable/modules/cross_validation.html

  2. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2014). An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated

  3. STHDA (2019). Cross-validation essentials in R. Retrieved from http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/