Cross validation

R Markdown

Q1. We now review k-fold cross-validation.

Explain how k-fold cross-validation is implemented.

The k-fold cross validation is implemented by randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k???1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1,MSE2, . .,MSEk. The k-fold CV estimate is computed by averaging these values.

What are the advantages and disadvantages of k-fold cross-validation relative to:
The validation set approach?

Advantage of k-fold cross validation relative to the validation set: The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. Moreover, validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

Disadvantage of k-fold cross validation relative to the validation set: Validation set approach is conceptually simple and easy to implement.

LOOCV?

The LOOCV cross-validation approach is a special case of k-fold cross-validation in which k=n.

Advantage of k-fold cross validation relative to LOOCV: LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive. Moreover, k-fold CV often gives more accurate estimates of the test error rate than does LOOCV.

Disadvantage of k-fold cross validation relative to LOOCV: If the main purpose bias reduction, LOOCV should be preffered to k-fold CV since it tends to has less bias.

So, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation; typically using k=5 or k=10 yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

2. Build R function for K-fold Cross-Validated error for QDA

cv.qda <-
  function (data, model=Direction~Lag1+Lag2, yname="Direction", K=10, seed=123) {
    n <- nrow(data)
    set.seed(seed)
    datay=data[,yname] #response variable
    library(MASS)
    #partition the data into K subsets
    f <- ceiling(n/K)
    s <- sample(rep(1:K, f), n)  
    #generate indices 1:10 and sample n of them  
    # K fold cross-validated error
    
    CV=NULL
    
    for (i in 1:K) { #i=1
      test.index <- seq_len(n)[(s == i)] #test data
      train.index <- seq_len(n)[(s != i)] #training data
      
      #model with training data
      qda.fit=qda(model, data=data[train.index,])
      #observed test set y
      qda.y <- data[test.index, yname]
      #predicted test set y
      qda.predy=predict(qda.fit, data[test.index,])$class
      
      #observed - predicted on test data
      error= mean(qda.y!=qda.predy)
      #error rates 
      CV=c(CV,error)
    }
    #Output
    list(call = model, K = K, 
         qda_error_rate = mean(CV), seed = seed)  
  }

#Run this function on Smarket for two models.
library(ISLR)

## Warning: package 'ISLR' was built under R version 3.3.2

data(Smarket)
#need to eliminate variable Today b/c Direction is determined by it
er1=cv.qda(data=Smarket[,-8],model=Direction~Lag1+Lag2, yname="Direction", K=10, seed=123)

## Warning: package 'MASS' was built under R version 3.3.2

er2=cv.qda(data=Smarket[,-8], model=Direction~. , yname="Direction", K=10, seed=123)
er1;er2

## $call
## Direction ~ Lag1 + Lag2
## 
## $K
## [1] 10
## 
## $qda_error_rate
## [1] 0.4848
## 
## $seed
## [1] 123

## $call
## Direction ~ .
## 
## $K
## [1] 10
## 
## $qda_error_rate
## [1] 0.508
## 
## $seed
## [1] 123

er1$qda_error_rate;er2$qda_error_rate

## [1] 0.4848

## [1] 0.508

3. Auto data set is used for classification of origin of car from the other characteristics.

Eliminate name variable from Auto data and change the origin variable to factor variable with labels of country they are from.

library(ISLR)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

data_auto=Auto[,-9]
data_auto$origin <- factor(data_auto$origin,labels=c("American","European","Japanese"))
summary(data_auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##      weight      acceleration        year            origin   
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   American:245  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   European: 68  
##  Median :2804   Median :15.50   Median :76.00   Japanese: 79  
##  Mean   :2978   Mean   :15.54   Mean   :75.98                 
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00                 
##  Max.   :5140   Max.   :24.80   Max.   :82.00

dim(data_auto)

## [1] 392   8

b) Compare the LDA, QDA using 10 fold-CV error.

LDA function

cv.lda <-
  function (data, model=origin~., yname="origin", K=10, seed=123) {
    n <- nrow(data)
    set.seed(seed)
    datay=data[,yname] #response variable
    library(MASS)
    #partition the data into K subsets
    f <- ceiling(n/K)
    s <- sample(rep(1:K, f), n)  
    #generate indices 1:10 and sample n of them  
    # K fold cross-validated error
    
    CV=NULL
    
    for (i in 1:K) { #i=1
      test.index <- seq_len(n)[(s == i)] #test data
      train.index <- seq_len(n)[(s != i)] #training data
      
      #model with training data
      lda.fit=lda(model, data=data[train.index,])
      #observed test set y
      lda.y <- data[test.index, yname]
      #predicted test set y
      lda.predy=predict(lda.fit, data[test.index,])$class
      
      #observed - predicted on test data
      error= mean(lda.y!=lda.predy)
      #error rates 
      CV=c(CV,error)
    }
    #Output
    list(call = model, K = K, 
         lda_error_rate = mean(CV), seed = seed)  
  }

QDA function

cv.qda <-
  function (data, model=origin~., yname="origin", K=10, seed=123) {
    n <- nrow(data)
    set.seed(seed)
    datay=data[,yname] #response variable
    library(MASS)
    #partition the data into K subsets
    f <- ceiling(n/K)
    s <- sample(rep(1:K, f), n)  
    #generate indices 1:10 and sample n of them  
    # K fold cross-validated error
    
    CV=NULL
    
    for (i in 1:K) { #i=1
      test.index <- seq_len(n)[(s == i)] #test data
      train.index <- seq_len(n)[(s != i)] #training data
      
      #model with training data
      qda.fit=qda(model, data=data[train.index,])
      #observed test set y
      qda.y <- data[test.index, yname]
      #predicted test set y
      qda.predy=predict(qda.fit, data[test.index,])$class
      
      #observed - predicted on test data
      error= mean(qda.y!=qda.predy)
      #error rates 
      CV=c(CV,error)
    }
    #Output
    list(call = model, K = K, 
         qda_error_rate = mean(CV), seed = seed)  
  }

Comparison of LDA and QDA error rate

er_lda=cv.lda(data=data_auto,model=origin~., yname="origin", K=10, seed=123)
er_lda$lda_error_rate

## [1] 0.2674595

er_qda=cv.qda(data=data_auto,model=origin~., yname="origin", K=10, seed=123)
er_qda$qda_error_rate

## [1] 0.2066937

LDA error rate is 0.2675 and QDA error rate is 0.2193.The result indicates that QDA error rate is lower than LDA error rate.

c) Find the best k size (neighborhood size) using 10 fold-CV errors.

cv.knn<- function (dataY, dataX, kn=1, K=10, seed=123) {
  n <- nrow(dataX)
  set.seed(seed)
  library(class)
  
  f <- ceiling(n/K)
  s <- sample(rep(1:K, f), n)  
  dataX=scale(dataX)
  CV=NULL;PvsO=NULL
  
  for (i in 1:K) { 
    test.index <- seq_len(n)[(s == i)] #test data
    train.index <- seq_len(n)[(s != i)] #training data
   
    train.X <- dataX[train.index,]
    test.X <- dataX[test.index,]
    train.y <- dataY[train.index]
    test.y <- dataY[test.index]
    #predicted test set y
    knn.pred=knn(train.X, test.X, train.y, k=kn) 
    #observed - predicted on test data 
    error= mean(knn.pred!=test.y) 
    #error rates 
    CV=c(CV,mean(error))
    predvsobs=data.frame(knn.pred,test.y)
    PvsO=rbind(PvsO,predvsobs)
  } 
  
  #Output
  list(k = K,
       knn_error_rate = mean(CV), confusion=table(PvsO[,1],PvsO[,2]), seed=seed)
}

cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=1, K=10, seed=123)

## $k
## [1] 10
## 
## $knn_error_rate
## [1] 0.2576653
## 
## $confusion
##           
##            American European Japanese
##   American      210       17       22
##   European       13       35       11
##   Japanese       22       16       46
## 
## $seed
## [1] 123

cv.error=NULL
for (i in 1:100) {
  cv.error[i] <- cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=i, 
                        K=10, seed=123)$knn_error_rate
 
}
k=which(cv.error==min(cv.error))
print(k)

## [1] 5

er_knn=cv.knn(dataY=data_auto$origin, dataX=data_auto[,-8], kn=5, K=10, seed=123)
er_knn

## $k
## [1] 10
## 
## $knn_error_rate
## [1] 0.2399561
## 
## $confusion
##           
##            American European Japanese
##   American      214       15       18
##   European       11       34       11
##   Japanese       20       19       50
## 
## $seed
## [1] 123

The best k size (neighborhood size) is 5 using 10 fold-CV errors

Compare the best knn model with LDA and QDA.

er_knn$knn_error_rate

## [1] 0.2399561

er_lda$lda_error_rate

## [1] 0.2674595

er_qda$qda_error_rate

## [1] 0.2066937

The result indicate that error rate of knn is more than QDA but less than LDA.

4. Find the prediction R^2 using cv.glm for auto data with regression model “mpg ~ .” .

set.seed(123)
library(boot)
glm.fit=glm(mpg~.,data=data_auto)
cv.error= cv.glm(data_auto, glm.fit,K=10)$delta[1]
cv.error #MSE for each model

## [1] 11.33641

1-cv.error/var(Auto$mpg) #R^2

## [1] 0.8139075

R^2 is 0.8139 which means than 81.39% of the variance of mpg can be explained by the pedictors

Note- If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function.

```