Drawback of this method:
(1) Estimated test-error using this method has high variability as we run training and validation set once and it also depends on which observations were part of training or validation set due to the randomness of which observations are used for training vs. validation.
(2) Splitting data in two set leave fewer observations in a training set and fitting the model with fewer observations over estimate the errors due to increased bias.
suppressMessages(library(ISLR))
data("Auto")
dim(Auto)
## [1] 392 9
set.seed(1)
n = nrow(Auto)
index = sample(1:n, size = n*0.50, replace = FALSE)
length(index)
## [1] 196
train.set = Auto[index,]
test.set = Auto[-index,]
dim(train.set)
## [1] 196 9
dim(test.set)
## [1] 196 9
lm.fit1 = lm(mpg ~ horsepower, data=train.set)
lm.fit1
##
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
##
## Coefficients:
## (Intercept) horsepower
## 40.3404 -0.1617
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.698 -3.085 -0.216 2.680 16.770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.340377 1.002269 40.25 <2e-16 ***
## horsepower -0.161701 0.008809 -18.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.692 on 194 degrees of freedom
## Multiple R-squared: 0.6346, Adjusted R-squared: 0.6327
## F-statistic: 336.9 on 1 and 194 DF, p-value: < 2.2e-16
mpg.pred = predict(lm.fit1, test.set)
test.MSE1 = mean((test.set$mpg - mpg.pred)^2)
test.MSE1
## [1] 26.14142
lm.fit2 = lm(mpg ~ poly(horsepower,2), data=train.set)
lm.fit2
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
##
## Coefficients:
## (Intercept) poly(horsepower, 2)1 poly(horsepower, 2)2
## 23.00 -86.12 26.19
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8507 -2.6415 -0.0274 2.2932 14.7340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.0020 0.3079 74.718 < 2e-16 ***
## poly(horsepower, 2)1 -86.1241 4.3099 -19.983 < 2e-16 ***
## poly(horsepower, 2)2 26.1868 4.3099 6.076 6.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.31 on 193 degrees of freedom
## Multiple R-squared: 0.6933, Adjusted R-squared: 0.6901
## F-statistic: 218.1 on 2 and 193 DF, p-value: < 2.2e-16
mpg.pred2 = predict(lm.fit2, test.set)
test.MSE2 = mean((test.set$mpg - mpg.pred2)^2)
test.MSE2
## [1] 19.82259
Now, we will re-run the entire process of fitting both models but using different data for training and test sets. we will do this by changing the seed.
set.seed(1234)
n = nrow(Auto)
index = sample(1:n, size = n*0.50, replace = FALSE)
length(index)
## [1] 196
train.set = Auto[index,]
test.set = Auto[-index,]
dim(train.set)
## [1] 196 9
dim(test.set)
## [1] 196 9
lm.fit3 = lm(mpg ~ horsepower, data=train.set)
lm.fit3
##
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
##
## Coefficients:
## (Intercept) horsepower
## 39.8348 -0.1583
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ horsepower, data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4376 -3.0581 -0.1184 2.8042 17.0543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.834786 0.981123 40.60 <2e-16 ***
## horsepower -0.158294 0.008884 -17.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.851 on 194 degrees of freedom
## Multiple R-squared: 0.6207, Adjusted R-squared: 0.6188
## F-statistic: 317.5 on 1 and 194 DF, p-value: < 2.2e-16
mpg.pred = predict(lm.fit3, test.set)
test.MSE3 = mean((test.set$mpg - mpg.pred)^2)
test.MSE3
## [1] 24.63678
lm.fit4 = lm(mpg ~ poly(horsepower,2), data=train.set)
lm.fit4
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
##
## Coefficients:
## (Intercept) poly(horsepower, 2)1 poly(horsepower, 2)2
## 23.48 -86.44 30.49
summary(lm.fit4)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3631 -2.4309 0.0471 2.2233 15.2213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.48 0.31 75.731 < 2e-16 ***
## poly(horsepower, 2)1 -86.44 4.34 -19.915 < 2e-16 ***
## poly(horsepower, 2)2 30.49 4.34 7.024 3.59e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.341 on 193 degrees of freedom
## Multiple R-squared: 0.6979, Adjusted R-squared: 0.6948
## F-statistic: 223 on 2 and 193 DF, p-value: < 2.2e-16
mpg.pred2 = predict(lm.fit4, test.set)
test.MSE4 = mean((test.set$mpg - mpg.pred2)^2)
test.MSE4
## [1] 19.58817
df = data.frame(set1_model1 = test.MSE1,set2_model1 = test.MSE3,set1_model2 = test.MSE2,set2_model2 = test.MSE4 )
df
## set1_model1 set2_model1 set1_model2 set2_model2
## 1 26.14142 24.63678 19.82259 19.58817
anova(lm.fit1, lm.fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ horsepower
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 194 4270.8
## 2 194 4565.8 0 -295
anova(lm.fit2, lm.fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ poly(horsepower, 2)
## Model 2: mpg ~ poly(horsepower, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 193 3585.1
## 2 193 3636.2 0 -51.121