Validation-set approach: general concept

To estimate the test error associated with a given statistical learning method in order to evaluate its performance.

This is done by splitting the training set in training set and validation set. Fit the model using training set. Predict the results of validation set using the fitted model and calculate the estimate of test-error.

Drawback of this method:

(1) Estimated test-error using this method has high variability as we run training and validation set once and it also depends on which observations were part of training or validation set due to the randomness of which observations are used for training vs. validation.

(2) Splitting data in two set leave fewer observations in a training set and fitting the model with fewer observations over estimate the errors due to increased bias.

suppressMessages(library(ISLR))

data("Auto")
dim(Auto)
## [1] 392   9
set.seed(1)

n = nrow(Auto)
index = sample(1:n, size = n*0.50, replace = FALSE)
length(index)
## [1] 196
train.set = Auto[index,]
test.set = Auto[-index,]
dim(train.set)
## [1] 196   9
dim(test.set)
## [1] 196   9
lm.fit1 = lm(mpg ~ horsepower, data=train.set)
lm.fit1
## 
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
## 
## Coefficients:
## (Intercept)   horsepower  
##     40.3404      -0.1617
summary(lm.fit1)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.698  -3.085  -0.216   2.680  16.770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.340377   1.002269   40.25   <2e-16 ***
## horsepower  -0.161701   0.008809  -18.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.692 on 194 degrees of freedom
## Multiple R-squared:  0.6346, Adjusted R-squared:  0.6327 
## F-statistic: 336.9 on 1 and 194 DF,  p-value: < 2.2e-16
mpg.pred = predict(lm.fit1, test.set)

test.MSE1 = mean((test.set$mpg - mpg.pred)^2)
test.MSE1
## [1] 26.14142
lm.fit2 = lm(mpg ~ poly(horsepower,2), data=train.set)
lm.fit2
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
## 
## Coefficients:
##          (Intercept)  poly(horsepower, 2)1  poly(horsepower, 2)2  
##                23.00                -86.12                 26.19
summary(lm.fit2)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8507  -2.6415  -0.0274   2.2932  14.7340 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           23.0020     0.3079  74.718  < 2e-16 ***
## poly(horsepower, 2)1 -86.1241     4.3099 -19.983  < 2e-16 ***
## poly(horsepower, 2)2  26.1868     4.3099   6.076 6.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.31 on 193 degrees of freedom
## Multiple R-squared:  0.6933, Adjusted R-squared:  0.6901 
## F-statistic: 218.1 on 2 and 193 DF,  p-value: < 2.2e-16
mpg.pred2 = predict(lm.fit2, test.set)

test.MSE2 = mean((test.set$mpg - mpg.pred2)^2)
test.MSE2
## [1] 19.82259

Now, we will re-run the entire process of fitting both models but using different data for training and test sets. we will do this by changing the seed.

set.seed(1234)

n = nrow(Auto)
index = sample(1:n, size = n*0.50, replace = FALSE)
length(index)
## [1] 196
train.set = Auto[index,]
test.set = Auto[-index,]
dim(train.set)
## [1] 196   9
dim(test.set)
## [1] 196   9
lm.fit3 = lm(mpg ~ horsepower, data=train.set)
lm.fit3
## 
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
## 
## Coefficients:
## (Intercept)   horsepower  
##     39.8348      -0.1583
summary(lm.fit3)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4376  -3.0581  -0.1184   2.8042  17.0543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.834786   0.981123   40.60   <2e-16 ***
## horsepower  -0.158294   0.008884  -17.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.851 on 194 degrees of freedom
## Multiple R-squared:  0.6207, Adjusted R-squared:  0.6188 
## F-statistic: 317.5 on 1 and 194 DF,  p-value: < 2.2e-16
mpg.pred = predict(lm.fit3, test.set)

test.MSE3 = mean((test.set$mpg - mpg.pred)^2)
test.MSE3
## [1] 24.63678
lm.fit4 = lm(mpg ~ poly(horsepower,2), data=train.set)
lm.fit4
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
## 
## Coefficients:
##          (Intercept)  poly(horsepower, 2)1  poly(horsepower, 2)2  
##                23.48                -86.44                 30.49
summary(lm.fit4)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3631  -2.4309   0.0471   2.2233  15.2213 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             23.48       0.31  75.731  < 2e-16 ***
## poly(horsepower, 2)1   -86.44       4.34 -19.915  < 2e-16 ***
## poly(horsepower, 2)2    30.49       4.34   7.024 3.59e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.341 on 193 degrees of freedom
## Multiple R-squared:  0.6979, Adjusted R-squared:  0.6948 
## F-statistic:   223 on 2 and 193 DF,  p-value: < 2.2e-16
mpg.pred2 = predict(lm.fit4, test.set)

test.MSE4 = mean((test.set$mpg - mpg.pred2)^2)
test.MSE4
## [1] 19.58817
df = data.frame(set1_model1 = test.MSE1,set2_model1 = test.MSE3,set1_model2 = test.MSE2,set2_model2 = test.MSE4 )
df
##   set1_model1 set2_model1 set1_model2 set2_model2
## 1    26.14142    24.63678    19.82259    19.58817
anova(lm.fit1, lm.fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ horsepower
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1    194 4270.8                      
## 2    194 4565.8  0      -295
anova(lm.fit2, lm.fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ poly(horsepower, 2)
## Model 2: mpg ~ poly(horsepower, 2)
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1    193 3585.1                      
## 2    193 3636.2  0   -51.121

above table shows the difference in variability in estimating test-MSE. (compare set-1 with set-2)