a. Explain how k-fold cross-validation is implemented. The k-fold cross-validation is based on the principle of spliting and testing the dataset. For this purpose we first take the random splited data k out of n toal observation. These groups acts as a validation set and the remainder acts as a training set. The test error is then estimated by averaging the k resulting MSE estimates. b.what are the advantages and disadvantages of k fold crossvalidation relative to: i. Validation set approach: The disadvantages of validation set approach is the estimate of the test error rate can be highly variable depending upon selection of testing and validating sets. Since only a subset of the observations are used to fit the model, the validation set error may overestimate the test error rate for the model fit on the entire dataset. ii. LOOCV It has less bias. We repeatedly fit the statistical learning method using training data that contains n-1 observation.The validation approach produces different MSE because of the randomness in the splitting process. whereas in LOOCV, multiple attempt always give the same result because we split based on the 1 observation. Each time LOOCV is computationally intensive. We fit each model n number of times.
a.Fit a logistic regression model that uses income and balance to predict default.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.2
attach(Default)
set.seed(1)
fit.glm <- glm(default ~ income+balance, family='binomial')
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train <- sample(dim(Default)[1],dim(Default)[1]/2)
fit.glm <- glm(default ~ income + balance, data = Default, family = 'binomial', subset = train)
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
probs <- predict(fit.glm, newdata = Default[-train,], type = 'response')
pred.glm <- rep('No', length(probs))
pred.glm[probs >0.5] <- 'Yes'
mean(pred.glm != Default [-train,] $default)
## [1] 0.0254
The validation set approach gave us 2.54% test error rate.
train <- sample(dim(Default)[1], dim(Default)[1]/2)
fit.glm <- glm(default ~ income + balance, data = Default, family = 'binomial', subset = train )
probs <- predict (fit.glm, newdata = Default [-train,], type = 'response')
pred.glm <- rep('No', length(probs))
pred.glm[probs > 0.5] <- 'Yes'
mean(pred.glm != Default[-train,]$default)
## [1] 0.0274
train <- sample(dim(Default)[1], dim(Default)[1]/2)
fit.glm <- glm(default ~ income + balance, data = Default, family = 'binomial', subset = train )
probs <- predict (fit.glm, newdata = Default [-train,], type = 'response')
pred.glm <- rep('No', length(probs))
pred.glm[probs > 0.5] <- 'Yes'
mean(pred.glm != Default[-train,]$default)
## [1] 0.0244
train <- sample(dim(Default)[1], dim(Default)[1]/2)
fit.glm <- glm(default ~ income + balance, data = Default, family = 'binomial', subset = train )
probs <- predict (fit.glm, newdata = Default [-train,], type = 'response')
pred.glm <- rep('No', length(probs))
pred.glm[probs > 0.5] <- 'Yes'
mean(pred.glm != Default[-train,]$default)
## [1] 0.0244
We can see that three different splits produce a different resulting error rate which shows that the rate varies by which observation are in the training/validation sets.
train <- sample(dim(Default)[1],dim(Default)[1]/2)
fit.glm <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
pred.glm <- rep("No", length(probs))
probs <- predict(fit.glm, newdata = Default[-train,], type = "response")
pred.glm[probs > 0.5 ] <- "Yes"
mean(pred.glm != Default[-train, ] $default)
## [1] 0.0278
We can see that including a dummy variable did not result in a reduction in the test error rate.
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
set.seed(1)
attach(Default)
## The following objects are masked from Default (pos = 3):
##
## balance, default, income, student
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The glm() estimates of the standard errors for the coefficients intercept = 4.348e-01, income =4.985e-06 and balance = 2.274e-04respectively.
boot.fn <- function(data, index){
fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(fit))
}
library(boot)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The bootstrap estimates of the standard errors for the coefficients are: B0 =4.340879e-01, B1=4.946329e-06 and B2=2.277448e-04.
The estimated standard errors obtained by each of the two methods are very similar.
We will now consider the Boston housing data set from the MASS library. a. Based on this data set, provide an estimate for the population mean of medv. Call this estimate m.hat
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.2
attach(Boston)
m.hat <- mean(medv)
m.hat
## [1] 22.53281
s.hat <- sd(medv) /sqrt(dim(Boston)[1])
s.hat
## [1] 0.4088611
set.seed(1)
boot.fn <- function(data, index){
m <- mean(data [index])
return(m)
}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
The bootstrap estimated standard error of m.hat of 0.410662 is similar to the estimate found in b) of 0.4088611
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI.m.hat <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4110)
CI.m.hat
## [1] 21.7062 23.3520
We observed the confidence interval is very close to the one provided by the t.test() function.
med.hat <- median(medv)
med.hat
## [1] 21.2
boot.fn <- function(data, index){
mu <- median(data[index])
return (mu)
}
boot(medv,boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
we got estimated median = 21.2 which is equal to the value obtained in (e), with a standard error of 0.377 which is relatively smalle compared to median value.
percent.hat <- quantile(medv,c(0.1))
percent.hat
## 10%
## 12.75
h.Use the bootstrap to estimate the standard error of (mu-hat)0.1. Comment on your findings.
boot.fn <- function(data, index){
mu <- quantile (data[index], c(0.1))
return(mu)
}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
We get an estimated 10th percentile value of 12.75 which is equal to the value obtained in (g) with a standard error of 0.5113 which is relatively small compared to percentile value