After discussing model quality through \(R^2\) and adjusted \(R^2\) in our first post, here we will look at another important aspect of model quality: cross validation. In our previous post we used the entire datasets to build our models, and then used the same data to determine their quality. In our cross validation discussion, we will look at exposing our model to data not used to derive it. We will thus determine model quality by exposing it to data it has not seen. The driver for this is somewhat simple. It is possible to design models that are so bound to the data used to construct them that when exposed to new data they are not of use. Overfitting, as this is called, happens when a model predicts data extremely well, but only that data that was data used to design it. New data is not correctly predicted by the model.
We will look at a very simple one predictor only model first, then present an example with a popular dataset. Below is a scattered plot of our first set of data.
## x1 Y
## 1 20.39292 196.00
## 2 54.79652 300.00
## 3 17.12275 172.00
## 4 33.70603 274.88
## 5 58.34702 356.00
## 6 83.69874 548.00
## 7 18.08815 196.00
## 8 33.00347 204.00
## 9 66.33106 460.00
## 10 41.76767 276.00
For this data with 41 data rows (we also showed 10 data rows) we wish to build a simple linear regression model:
\(\hat{y} = \beta_0 + \beta_1 x\)
We can take the entire dataset, as we did in our first post, and calculate a model.
With this we obtain a very good model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 25.56 and \(\beta_1 =\) 7.18
Before we get knee depth into Cross Validation, let’s look at a simple first approach. If we want to expose our model to data not used in defining it, or training, we can simple split the data in two sets. We can split our data using 80% of it for training, and then the 20% to test predicted results versus known values.
When running validation, we typically select points at random from the dataset to join both groups. R libraries such as caret make this really easy. The use of this library will be shown, but first lets simply pick the first 80% of our points for training, and the latter 20% for evaluation.
train<-head(df,n=length(x1)*.8)
test<-tail(df,n=length(x1)*.2)
Using this code we might be leaving out some data in the intersection of train and eval, but we won’t bother in this example.
m2<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m2),lty=5)
With this we obtain a very good model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 20.76 and \(\beta_1 =\) 7.3. This is a very similar model to the first one in which we used all of the data. But since here we have split our data we have a test set which we can use to calculate the models performance against data is hasn’t seen. To do this we first calculate predictions using the model, then calculate the Root Mean Squared Error (RMSE). We will used the RMSE from the caret library.
predictions <- m2 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)
With this now we know our model’s Root Mean Squared Error is 50.01.
But here we split our data taking the first 80% for training and the other for testing. There was no thought put into why the first 80% and not the last 80%. So let’s do the same inverting out split selection.
train<-tail(df,n=length(x1)*.8)
test<-head(df,n=length(x1)*.2)
m3<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m3),lty=5)
predictions <- m3 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)
Once again basically the same model with a \(R^2\) of 0.89, and coefficients \(\beta_0 =\) 34.2 and \(\beta_1 =\) 7.17. RMSE is also similar 69.57. The difference is expected, as we are testing our model against different data. But is this difference in RMSE meaningful. Are we to expect different models quality depending on the data that is chosen in the split? How do we know which split gives us the true model quality metric?
In practice we do not split data in head and tail. But rather do it randomly. The caret library provides easy to use tool to do this splits, so let’s rerun the model and test using random data to go into the train and test dataset.
training.samples <- df$Y %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
m4<-lm(Y~x1,train)
plot(train$x1,train$Y)
abline(coef(m4),lty=5)
predictions <- m4 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)
Once again basically the same model with a \(R^2\) of 0.91, and coefficients \(\beta_0 =\) 31.05 and \(\beta_1 =\) 7.24. Again RMSE is now somewhat different though 62.11. The selection of the split seems to be having some effect on the model.
How we split the data affects our model quality calculation. We saw some of that effect on the previous exercise of looking at training with the tail, the head, or random value from the dataset. But in fact this can be more evident if we have in our dataset points with more leverage. These can be outliers, but also true data points that simply have more weight on the model. Depending on which groups these data points fall into, training or testing, their effect will be different in our error calculations.
The plot below is for basically the same dataset to which we have added a few new points.
## X2 Y2
## 35 71.71852 531.9473
## 36 42.51791 384.6785
## 37 68.53734 429.3861
## 38 113.92475 841.4961
## 39 74.91034 536.0291
## 40 86.25216 649.5631
## 41 113.22574 773.9779
## 42 120.00000 1250.0000
## 43 140.00000 1550.0000
## 44 150.00000 1700.0000
Now if we run models for the head and the tail we get very different results.
train<-head(df2,n=nrow(df2)*.8)
test<-tail(df2,n=nrow(df2)*.2)
m5<-lm(Y2~X2,train)
plot(train$X2,train$Y2)
abline(coef(m5),lty=5)
b0<-round(summary(m5)$coefficients[1],2)
b1<-round(summary(m5)$coefficients[2],2)
r2<-round(summary(m5)$r.squared,2)
predictions <- m5 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)
With this split the model’s \(R^2\) is 0.91, and coefficients \(\beta_0 =\) 19.48 and \(\beta_1 =\) 7.33, very similar to our previous models. But now RMSE has jumped to 285.67. This is because the model is being training with basically the same data as before, but now its being tested against new data it hasn’t seen which doesn’t seem to fit our model well.
So what if we use the tail of the dataset to include the points we have added in training the model.
train<-tail(df2,n=nrow(df2)*.8)
test<-head(df2,n=nrow(df2)*.2)
m6<-lm(Y2~X2,train)
plot(train$X2,train$Y2)
abline(coef(m6),lty=5)
b0<-round(summary(m6)$coefficients[1],2)
b1<-round(summary(m6)$coefficients[2],2)
r2<-round(summary(m6)$r.squared,2)
predictions <- m6 %>% predict(test)
RMSE <- round(RMSE(predictions, test$Y),2)
Now RMSE is a more reasonable 104.27, but our model has changed. \(R^2\) is 0.84, not as good as before, and our coefficients are \(\beta_0 =\) -109.87 and \(\beta_1 =\) 9.47, not the same model.
With splitting the data randomly using caret, results could be different depending on what set, training or test, do our recently added points fall. So definitely how we split the data will have an effect on our modeling metrics.
So how do we really know how good our model is?
Cross validation works to ameliorate the issues we see with model metrics when selecting how to split the data for testing.There are several ways to split data in cross validation, but here we will present an approach called K-folds.
In K-fold we randomly split the data set into subsets, k of them. We then reserve one subset and train the model on all others. The model is then tested on the reserved subset and the prediction error is recorded. This is repeat until each of the k subsets has served as the test set. The average of the k recorded errors is computed, which is known as the cross-validation error. This average serves as the performance metric for our model. The number of subsets or K is usually 5 or 10.
The caret library makes this computation an easy process in R.
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
m7 <- train(Y2 ~X2, data = df2, method = "lm", trControl = train.control)
print(m7)
## Linear Regression
##
## 44 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40, 39, 39, 40, 40, 39, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 131.3398 0.9429153 109.5061
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Here we see the final model quality parameters. RMSE is lower than our previous worst error, training with the head, but not a good as when we trained with the tail. The actual model derived using the cross validation tools in caret can be inspected using summary, same as for R standards lm.
summary(m7)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -203.67 -97.86 -17.61 76.11 418.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -78.3114 46.3121 -1.691 0.0983 .
## X2 9.0677 0.5749 15.771 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131 on 42 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8521
## F-statistic: 248.7 on 1 and 42 DF, p-value: < 2.2e-16
And this model represents the model trained using the complete data set. We can compare it to the model resulting from the standard R lm function
m8<-lm(Y2~X2,df2)
summary(m8)
##
## Call:
## lm(formula = Y2 ~ X2, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -203.67 -97.86 -17.61 76.11 418.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -78.3114 46.3121 -1.691 0.0983 .
## X2 9.0677 0.5749 15.771 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131 on 42 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8521
## F-statistic: 248.7 on 1 and 42 DF, p-value: < 2.2e-16
Cross validation K-Fold can be easily implemented on “real” datasets. Here we follow all the steps to calculate the quality metrics RMSE for a naive model using all the predictors in the Boston Housing dataset, where the median value of homes is the variable to predict.
# Boston Housing Data
data(BostonHousing)
dim(BostonHousing)
## [1] 506 14
head(BostonHousing)
## crim zn indus chas nox rm age dis rad tax ptratio b
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(medv ~., data = BostonHousing, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 454, 457, 456, 455, 455, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.787689 0.732889 3.368666
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas1 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## b 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16