Cross Validation

After discussing model quality through \(R^2\) and adjusted \(R^2\) in our first post, here we will look at another important aspect of model quality: cross validation. In our previous post we used the entire datasets to build our models, and then used the same data to determine their quality. In our cross validation discussion, we will look at exposing our model to data not used to derive it. We will thus determine model quality by exposing it to data it has not seen. The driver for this is somewhat simple. It is possible to design models that are so bound to the data used to construct them that when exposed to new data they are not of use. Overfitting, as this is called, happens when a model predicts data extremely well, but only that data that was data used to design it. New data is not correctly predicted by the model.

We will look at a very simple one predictor only model first, then present an example with a popular dataset. Below is a scattered plot of our first set of data.

##          x1      Y
## 1  20.39292 196.00
## 2  54.79652 300.00
## 3  17.12275 172.00
## 4  33.70603 274.88
## 5  58.34702 356.00
## 6  83.69874 548.00
## 7  18.08815 196.00
## 8  33.00347 204.00
## 9  66.33106 460.00
## 10 41.76767 276.00

For this data with 41 data rows (we also showed 10 data rows) we wish to build a simple linear regression model:

\(\hat{y} = \beta_0 + \beta_1 x\)

We can take the entire dataset, as we did in our first post, and calculate a model.

With this we obtain a very good model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 25.56 and \(\beta_1 =\) 7.18

Validation

Before we get knee depth into Cross Validation, let’s look at a simple first approach. If we want to expose our model to data not used in defining it, or training, we can simple split the data in two sets. We can split our data using 80% of it for training, and then the 20% to test predicted results versus known values.

When running validation, we typically select points at random from the dataset to join both groups. R libraries such as caret make this really easy. The use of this library will be shown, but first lets simply pick the first 80% of our points for training, and the latter 20% for evaluation.

train<-head(df,n=length(x1)*.8)
test<-tail(df,n=length(x1)*.2)

Using this code we might be leaving out some data in the intersection of train and eval, but we won’t bother in this example.

m2<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m2),lty=5)

With this we obtain a very good model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 20.76 and \(\beta_1 =\) 7.3. This is a very similar model to the first one in which we used all of the data. But since here we have split our data we have a test set which we can use to calculate the models performance against data is hasn’t seen. To do this we first calculate predictions using the model, then calculate the Root Mean Squared Error (RMSE). We will used the RMSE from the caret library.

predictions <- m2 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)

With this now we know our model’s Root Mean Squared Error is 50.01.

But here we split our data taking the first 80% for training and the other for testing. There was no thought put into why the first 80% and not the last 80%. So let’s do the same inverting out split selection.

train<-tail(df,n=length(x1)*.8)
test<-head(df,n=length(x1)*.2)

m3<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m3),lty=5)

predictions <- m3 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)

Once again basically the same model with a \(R^2\) of 0.89, and coefficients \(\beta_0 =\) 34.2 and \(\beta_1 =\) 7.17. RMSE is also similar 69.57. The difference is expected, as we are testing our model against different data. But is this difference in RMSE meaningful. Are we to expect different models quality depending on the data that is chosen in the split? How do we know which split gives us the true model quality metric?

In practice we do not split data in head and tail. But rather do it randomly. The caret library provides easy to use tool to do this splits, so let’s rerun the model and test using random data to go into the train and test dataset.

training.samples <- df$Y %>% createDataPartition(p = 0.8, list = FALSE)
train  <- df[training.samples, ]
test <- df[-training.samples, ]

m4<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m4),lty=5)

predictions <- m4 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)

Once again basically the same model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 31.05 and \(\beta_1 =\) 7.24. Again RMSE is now somewhat different though 62.11. The selection of the split seems to be having some effect on the model.

Issues with data splitting

How we split the data affects our model quality calculation. We saw some of that effect on the previous exercise of looking at training with the tail, the head, or random value from the dataset. But in fact this can be more evident if we have in our dataset points with more leverage. These can be outliers, but also true data points that simply have more weight on the model. Depending on which groups these data points fall into, training or testing, their effect will be different in our error calculations.

The plot below is for basically the same dataset to which we have added a few new points.

##           X2        Y2
## 35  71.71852  531.9473
## 36  42.51791  384.6785
## 37  68.53734  429.3861
## 38 113.92475  841.4961
## 39  74.91034  536.0291
## 40  86.25216  649.5631
## 41 113.22574  773.9779
## 42 120.00000 1250.0000
## 43 140.00000 1550.0000
## 44 150.00000 1700.0000

Now if we run models for the head and the tail we get very different results.

train<-head(df2,n=nrow(df2)*.8)
test<-tail(df2,n=nrow(df2)*.2)

m5<-lm(Y2~X2,train)
plot(train$X2,train$Y2)
abline(coef(m5),lty=5)

b0<-round(summary(m5)$coefficients[1],2)
b1<-round(summary(m5)$coefficients[2],2)
r2<-round(summary(m5)$r.squared,2)

predictions <- m5 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)

With this split the model’s \(R^2\) is 0.91, and coefficients \(\beta_0 =\) 19.48 and \(\beta_1 =\) 7.33, very similar to our previous models. But now RMSE has jumped to 285.67. This is because the model is being training with basically the same data as before, but now its being tested against new data it hasn’t seen which doesn’t seem to fit our model well.

So what if we use the tail of the dataset to include the points we have added in training the model.

train<-tail(df2,n=nrow(df2)*.8)
test<-head(df2,n=nrow(df2)*.2)

m6<-lm(Y2~X2,train)
plot(train$X2,train$Y2)
abline(coef(m6),lty=5)

b0<-round(summary(m6)$coefficients[1],2)
b1<-round(summary(m6)$coefficients[2],2)
r2<-round(summary(m6)$r.squared,2)

predictions <- m6 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)

Now RMSE is a more reasonable 104.27, but our model has changed. \(R^2\) is 0.84, not as good as before, and our coefficients are \(\beta_0 =\) -109.87 and \(\beta_1 =\) 9.47, not the same model.

With splitting the data randomly using caret, results could be different depending on what set, training or test, do our recently added points fall. So definitely how we split the data will have an effect on our modeling metrics.

So how do we really know how good our model is?

Cross Validation

Cross validation works to ameliorate the issues we see with model metrics when selecting how to split the data for testing.There are several ways to split data in cross validation, but here we will present an approach called K-folds.

In K-fold we randomly split the data set into subsets, k of them. We then reserve one subset and train the model on all others. The model is then tested on the reserved subset and the prediction error is recorded. This is repeat until each of the k subsets has served as the test set. The average of the k recorded errors is computed, which is known as the cross-validation error. This average serves as the performance metric for our model. The number of subsets or K is usually 5 or 10.

The caret library makes this computation an easy process in R.

set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
m7 <- train(Y2 ~X2, data = df2, method = "lm", trControl = train.control)
print(m7)
## Linear Regression 
## 
## 44 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 40, 39, 39, 40, 40, 39, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   131.3398  0.9429153  109.5061
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Here we see the final model quality parameters. RMSE is lower than our previous worst error, training with the head, but not a good as when we trained with the tail. The actual model derived using the cross validation tools in caret can be inspected using summary, same as for R standards lm.

summary(m7)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -203.67  -97.86  -17.61   76.11  418.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -78.3114    46.3121  -1.691   0.0983 .  
## X2            9.0677     0.5749  15.771   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131 on 42 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8521 
## F-statistic: 248.7 on 1 and 42 DF,  p-value: < 2.2e-16

And this model represents the model trained using the complete data set. We can compare it to the model resulting from the standard R lm function

m8<-lm(Y2~X2,df2)
summary(m8)
## 
## Call:
## lm(formula = Y2 ~ X2, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -203.67  -97.86  -17.61   76.11  418.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -78.3114    46.3121  -1.691   0.0983 .  
## X2            9.0677     0.5749  15.771   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131 on 42 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8521 
## F-statistic: 248.7 on 1 and 42 DF,  p-value: < 2.2e-16

Boston Housing dataset

Cross validation K-Fold can be easily implemented on “real” datasets. Here we follow all the steps to calculate the quality metrics RMSE for a naive model using all the predictors in the Boston Housing dataset, where the median value of homes is the variable to predict.

# Boston Housing Data
data(BostonHousing)
dim(BostonHousing)
## [1] 506  14
head(BostonHousing)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(medv ~., data = BostonHousing, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)
## Linear Regression 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 454, 457, 456, 455, 455, 455, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   4.787689  0.732889  3.368666
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## b            9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16