Q1. We now review k-fold cross-validation.
The k-fold cross validation is implemented by randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k???1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1,MSE2, . .,MSEk. The k-fold CV estimate is computed by averaging these values.
Advantage of k-fold cross validation relative to the validation set: The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. Moreover, validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
Disadvantage of k-fold cross validation relative to the validation set: Validation set approach is conceptually simple and easy to implement.
The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=n.
Advantage of k-fold cross validation relative to LOOCV: LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive. Moreover, k-fold CV often gives more accurate estimates of the test error rate than does LOOCV.
Disadvantage of k-fold cross validation relative to LOOCV: If the main purpose bias reduction, LOOCV should be preffered to k-fold CV since it tends to has less bias.
So, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation; typically using k=5 or k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
cv.qda <-
function (data, model=Direction~Lag1+Lag2, yname="Direction", K=10, seed=123) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
library(MASS)
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
qda.fit=qda(model, data=data[train.index,])
#observed test set y
qda.y <- data[test.index, yname]
#predicted test set y
qda.predy=predict(qda.fit, data[test.index,])$class
#observed - predicted on test data
error= mean(qda.y!=qda.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,
qda_error_rate = mean(CV), seed = seed)
}
#Run this function on Smarket for two models.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.2
data(Smarket)
#need to eliminate variable Today b/c Direction is determined by it
er1=cv.qda(data=Smarket[,-8],model=Direction~Lag1+Lag2, yname="Direction", K=10, seed=123)
## Warning: package 'MASS' was built under R version 3.3.2
er2=cv.qda(data=Smarket[,-8], model=Direction~. , yname="Direction", K=10, seed=123)
er1;er2
## $call
## Direction ~ Lag1 + Lag2
##
## $K
## [1] 10
##
## $qda_error_rate
## [1] 0.4848
##
## $seed
## [1] 123
## $call
## Direction ~ .
##
## $K
## [1] 10
##
## $qda_error_rate
## [1] 0.508
##
## $seed
## [1] 123
er1$qda_error_rate;er2$qda_error_rate
## [1] 0.4848
## [1] 0.508
library(ISLR)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
data_auto=Auto[,-9]
data_auto$origin <- factor(data_auto$origin,labels=c("American","European","Japanese"))
summary(data_auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 American:245
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 European: 68
## Median :2804 Median :15.50 Median :76.00 Japanese: 79
## Mean :2978 Mean :15.54 Mean :75.98
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00
## Max. :5140 Max. :24.80 Max. :82.00
dim(data_auto)
## [1] 392 8
LDA function
cv.lda <-
function (data, model=origin~., yname="origin", K=10, seed=123) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
library(MASS)
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
lda.fit=lda(model, data=data[train.index,])
#observed test set y
lda.y <- data[test.index, yname]
#predicted test set y
lda.predy=predict(lda.fit, data[test.index,])$class
#observed - predicted on test data
error= mean(lda.y!=lda.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,
lda_error_rate = mean(CV), seed = seed)
}
cv.qda <-
function (data, model=origin~., yname="origin", K=10, seed=123) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname] #response variable
library(MASS)
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
#generate indices 1:10 and sample n of them
# K fold cross-validated error
CV=NULL
for (i in 1:K) { #i=1
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
#model with training data
qda.fit=qda(model, data=data[train.index,])
#observed test set y
qda.y <- data[test.index, yname]
#predicted test set y
qda.predy=predict(qda.fit, data[test.index,])$class
#observed - predicted on test data
error= mean(qda.y!=qda.predy)
#error rates
CV=c(CV,error)
}
#Output
list(call = model, K = K,
qda_error_rate = mean(CV), seed = seed)
}
er_lda=cv.lda(data=data_auto,model=origin~., yname="origin", K=10, seed=123)
er_lda$lda_error_rate
## [1] 0.2674595
er_qda=cv.qda(data=data_auto,model=origin~., yname="origin", K=10, seed=123)
er_qda$qda_error_rate
## [1] 0.2066937
LDA error rate is 0.2675 and QDA error rate is 0.2193.The result indicates that QDA error rate is lower than LDA error rate.
cv.knn<- function (dataY, dataX, kn=1, K=10, seed=123) {
n <- nrow(dataX)
set.seed(seed)
library(class)
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
dataX=scale(dataX)
CV=NULL;PvsO=NULL
for (i in 1:K) {
test.index <- seq_len(n)[(s == i)] #test data
train.index <- seq_len(n)[(s != i)] #training data
train.X <- dataX[train.index,]
test.X <- dataX[test.index,]
train.y <- dataY[train.index]
test.y <- dataY[test.index]
#predicted test set y
knn.pred=knn(train.X, test.X, train.y, k=kn)
#observed - predicted on test data
error= mean(knn.pred!=test.y)
#error rates
CV=c(CV,mean(error))
predvsobs=data.frame(knn.pred,test.y)
PvsO=rbind(PvsO,predvsobs)
}
#Output
list(k = K,
knn_error_rate = mean(CV), confusion=table(PvsO[,1],PvsO[,2]), seed=seed)
}
cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=1, K=10, seed=123)
## $k
## [1] 10
##
## $knn_error_rate
## [1] 0.2576653
##
## $confusion
##
## American European Japanese
## American 210 17 22
## European 13 35 11
## Japanese 22 16 46
##
## $seed
## [1] 123
cv.error=NULL
for (i in 1:100) {
cv.error[i] <- cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=i,
K=10, seed=123)$knn_error_rate
}
k=which(cv.error==min(cv.error))
print(k)
## [1] 5
er_knn=cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=5, K=10, seed=123)
er_knn
## $k
## [1] 10
##
## $knn_error_rate
## [1] 0.2399561
##
## $confusion
##
## American European Japanese
## American 214 15 18
## European 11 34 11
## Japanese 20 19 50
##
## $seed
## [1] 123
The best k size (neighborhood size) is 5 using 10 fold-CV errors
er_knn$knn_error_rate
## [1] 0.2399561
er_lda$lda_error_rate
## [1] 0.2674595
er_qda$qda_error_rate
## [1] 0.2066937
The result indicate that error rate of knn is more than QDA but less than LDA.
set.seed(123)
library(boot)
glm.fit=glm(mpg~.,data=data_auto)
cv.error= cv.glm(data_auto, glm.fit,K=10)$delta[1]
cv.error #MSE for each model
## [1] 11.33641
1-cv.error/var(Auto$mpg) #R^2
## [1] 0.8139075
R^2 is 0.8139 which means than 81.39% of the variance of mpg can be explained by the pedictors
Note- If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function.
```