Problem 3 - Part A
The implementation of k-fold cross-validation is achieved by randomly dividing a set of observations into k groups (folds) of approximately equal size. the first k-fold is utilized as a validation set and the method is fit on the remaining k-1 folds. The mean squared error is computed on the observations in the held-out fold. The procedure is repeated k times and in each instance a different group of observations is utilized as a validation set.
Problem 3 - Part B
An advantage that k-fold cross-validation has in comparison to the validation set method is that the validation set method is much more likely to overestimate the test error rate. This occurs because a smaller amount of data is being utilized for training and validation in relation to the k-fold method which repeats the process k times and uses more observations. The disadvantage is computational cost if a large data set is being used to fit models k times.
An advantage that k-fold cross-validation has in comparison to the leave-one-out cross-validation (LOOCV) method is computational expense. Since LOOCV models are trained and test n times based on how many individual observations are in the data, this can be a consumptive process for large data sets. LOOCV has the advantage of always yielding the same results since there is no randomness in the splits, but it also has the disadvantage of having higher variance than the k-fold method since the trained models are highly correlated with each other.
Problem 5 - Part A
default <- Default
attach(default)
## The following object is masked _by_ .GlobalEnv:
##
## default
In this chunk, I am creating a logistic regression model that uses income and balance to predict default.
set.seed(1)
defaultGLM <- glm(default~income+balance, data=default, family=binomial)
Problem 5 - Part B
set.seed(1)
split <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain <- default[split,]
defaultTest <- default[-split,]
defaultGLM.2 <- glm(default~income+balance, data=defaultTrain, family=binomial)
defaultProb <- predict(defaultGLM.2, defaultTest, type="response")
defaultPred <- ifelse(defaultProb > 0.5, "Yes", "No")
table(defaultPred, defaultTest$default)
##
## defaultPred No Yes
## No 4824 108
## Yes 19 49
2.54% of the observations in the validation set are misclassified.
mean(defaultTest$default != defaultPred)
## [1] 0.0254
Problem 5 - Part C
Repeating the process three times using different splits of the observations does not exhibit a significant alteration of the test error rate. In each instance, the test error rate is between 2.4% and 2.9% (average is approximately 2.6%).
set.seed(7)
split.2 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.2 <- default[split.2,]
defaultTest.2 <- default[-split.2,]
defaultGLM.3 <- glm(default~income+balance, data=defaultTrain.2, family=binomial)
defaultProb.2 <- predict(defaultGLM.3, defaultTest.2, type="response")
defaultPred.2 <- ifelse(defaultProb.2 > 0.5, "Yes", "No")
table(defaultPred.2, defaultTest.2$default)
##
## defaultPred.2 No Yes
## No 4823 99
## Yes 22 56
mean(defaultTest.2$default != defaultPred.2)
## [1] 0.0242
set.seed(10)
split.3 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.3 <- default[split.3,]
defaultTest.3 <- default[-split.3,]
defaultGLM.4 <- glm(default~income+balance, data=defaultTrain.3, family=binomial)
defaultProb.3 <- predict(defaultGLM.4, defaultTest.3, type="response")
defaultPred.3 <- ifelse(defaultProb.3 > 0.5, "Yes", "No")
table(defaultPred.3, defaultTest.3$default)
##
## defaultPred.3 No Yes
## No 4815 125
## Yes 19 41
mean(defaultTest.3$default != defaultPred.3)
## [1] 0.0288
set.seed(4)
split.4 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.4 <- default[split.4,]
defaultTest.4 <- default[-split.4,]
defaultGLM.5 <- glm(default~income+balance, data=defaultTrain.4, family=binomial)
defaultProb.4 <- predict(defaultGLM.5, defaultTest.4, type="response")
defaultPred.4 <- ifelse(defaultProb.4 > 0.5, "Yes", "No")
table(defaultPred.4, defaultTest.4$default)
##
## defaultPred.4 No Yes
## No 4815 107
## Yes 21 57
mean(defaultTest.4$default != defaultPred.4)
## [1] 0.0256
Problem 5 - Part D
set.seed(1)
split.5 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.5 <- default[split.5,]
defaultTest.5 <- default[-split.5,]
defaultGLM.6 <- glm(default~income+balance+student, data=defaultTrain.5, family=binomial)
defaultProb.5 <- predict(defaultGLM.6, defaultTest.5, type="response")
defaultPred.5 <- ifelse(defaultProb.5 > 0.5, "Yes", "No")
table(defaultPred.5, defaultTest.5$default)
##
## defaultPred.5 No Yes
## No 4825 112
## Yes 18 45
The inclusion of the student variable does not seem to have a noticeable effect on the test error rate. The test error rate is 2.6% which is very similar to the test error rates for the previous models.
mean(defaultTest.5$default != defaultPred.5)
## [1] 0.026
Problem 6 - Part A
In this GLM model, the standard error for the coefficients associated with income and balanced is 4.985e-06 and 2.274e-04, respectively.
defaultGLM.7 <- glm(default~income+balance, data=default, family=binomial)
summary(defaultGLM.7)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Problem 6 - Part B
In this chunk, I am creating the boot function which outputs the coefficient estimates for income and balance in the multiple regression model.
defaultBoot.fn <- function(data, index)
return(coef(glm(default~income+balance, data=data, family=binomial, subset=index)))
Problem 6 - Part C
In this chunk, I am using the boot function to estimate the standard errors for the logistic regression coefficients for income (4.866284e-06) and balance (2.298949e-04).
set.seed(1)
boot(default, defaultBoot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = default, statistic = defaultBoot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
Problem 6 - Part D
The standard errors for income and balance in the bootstrap function is 4.866284e-06 and 2.298949e-04, respectively. These values are very close to the standard errors obtained for income and balance in the GLM function (4.985e-06 and 2.274e-04).
Problem 9 - Part A
boston <- Boston
attach(boston)
In this chunk, I create the variable medvMu which is the estimate for the population mean of medv in the Boston data set.
medvMu <- mean(medv)
medvMu
## [1] 22.53281
Problem 9 - Part B
In this chunk, I obtain the standard error estimate of medvMu.
medvSd <- sd(medv)/sqrt(length(medv))
medvSd
## [1] 0.4088611
Problem 9 - Part C
bostonBoot.fn <- function(data, index)
return(mean(data[index]))
The standard error obtained in the bootstrap is 0.4106622. This is only slightly higher than the standard error obtained in part B which is 0.4088611.
set.seed(1)
bostonBoot <- boot(medv, bostonBoot.fn, 1000)
bostonBoot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = bostonBoot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
Problem 9 - Part D
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The confidence interval obtained in the bootstrap below is nearly identical to the confidence interval obtained in the one sample t-test (less than a 0.02 difference for each bound)
medvCI <- c(22.53281 - 2 * 0.4106622, 22.53281 + 2 * 0.4106622)
medvCI
## [1] 21.71149 23.35413
Problem 9 - Part E
In this chunk, I obtain the estimate for the median value of medv.
medvMedian <- median(medv)
medvMedian
## [1] 21.2
Problem 9 - Part F
bostonBoot.fn.2 <- function(data, index)
return(median(data[index]))
The estimated standard error of the median obtained in the bootstrap is 0.3778075. The median value 21.2 is identical to the median obtain in part E. In relation to the median value, the estimated standard error value is quite small.
set.seed(1)
boot(medv, bostonBoot.fn.2, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = bostonBoot.fn.2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
Problem 9 - Part G
In this chunk, I am creating the variable which provides the estimate for the tenth percentile value of medv.
medvMu0.1 <- quantile(medv, 0.1)
medvMu0.1
## 10%
## 12.75
Problem 9 - Part H
bostonBoot.fn.3 <- function(data, index)
return(quantile(data[index], 0.1))
The estimated standard error of the tenth percentile of medv obtained in the bootstrap is 0.4767526. In relation to the tenth percentile value, the estimated standard error value is quite small.
set.seed(1)
boot(medv, bostonBoot.fn.3, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = bostonBoot.fn.3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526