In this method, we divide the training set that we have into two sets. We use one set to make our model and use the other set (Validation set) to validate our model and also estimate the test set MSE
We will use the ‘Auto’ dataset in ISLR Library. We will set a seed before splitting our dataset so that the results are reproduced precisely in later time.
library (ISLR)
set.seed (7)
train = sample (392 ,196)
Now, we have row numbers which should be considered for training set. We will use ‘subset’ function in ‘lm’ package to fit our regression model on our training set.
lm.fit =lm(mpg ~ horsepower ,data = Auto ,subset = train )
We will calculate the test error by calculating residuals of the test set ([-train]) and use it to calculate MSE of tet set
attach (Auto)
mean((mpg -predict (lm.fit ,Auto))[-train ]^2)
## [1] 23.16501
This our test MSE. Now let us change the seed and repeat the process again.
set.seed(8)
train = sample(392, 196)
lm.fit =lm(mpg ~ horsepower ,data = Auto ,subset = train )
attach (Auto)
## The following objects are masked from Auto (pos = 3):
##
## acceleration, cylinders, displacement, horsepower, mpg, name,
## origin, weight, year
mean((mpg -predict (lm.fit ,Auto))[-train ]^2)
## [1] 24.26852
the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
In the validation approach, only a subset of the observationsâthose that are included in the training set rather than in the validation setâare used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
In this method, we divide the training set into k different sets. Our model is fit into k-1 models and tested on the remaining set. This is performed k times (making each of k sets as test set every time). Each set, when used as a test set, creates different MSEs. The average of all these MSEs is our Cross Validation Test MSE (CV Error). We use ‘glm’ function to fit our regression model and use ‘delta’ variable of ‘cv.glm’ function to get our CV Error. ‘cv.glm’ is part of ‘boot’ library.
We create a variable ‘cv_error’ to store our cross validation error for each test set.
library(boot)
set.seed (17)
cv_error= rep (0 ,10)
Now, we use ‘glm’ to fit a regression model, on different powers. We will store the CV errors corresponding to the polynomial fits of orders one to ten. For this, we will use ‘cv.glm’ function.
for (i in 1:10)
{
glm_fit = glm( mpg ~ poly(horsepower ,i), data=Auto)
cv_error[i]=cv.glm (Auto ,glm_fit ,K=10) $delta [1]
}
Now let us look at our test MSEs for different orders.
cv_error
## [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
## [8] 19.71201 18.95140 19.50196
plot(cv_error, type = 'l')
From the graph, we can see that 5th and 7th order polynomials have the least MSE. Generally, we choose k = 5 or 10.
LOOCV is a special case of k-Fold Cross validation (k = n). Usually we use ‘lm’ function to fit a Linear regression model. But we can use ‘glm’ function to do the same. It has some additional features that will come handy at times. We can use ‘cv.glm’ function to perform LOOCV and k-Fold Cross Validation. The ‘cv.glm’ function is part of ‘boot’ library.
library(boot)
glm_fit <- glm(mpg ~ horsepower, data=Auto)
We did not specify ‘family’ variable, so glm understands that we want linear regression, not classification. To know the MSE using LOOCV method, we use ‘delta’ variable of ‘cv.glm’ function.
error <- cv.glm(Auto, glm_fit)
error$delta
## [1] 24.23151 24.23114
LOOCV is expensive to implement, since the model has to be fit n times. This can be very time consuming if n is large, and if each individual model is slow to fit. But it has far less bias compared to Validation Set approach. The test MSE estimate that is obtained has much less variance.