This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load Libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(boot)
# Question 3 We now review k-fold cross-validation.
# (a) Explain how k-fold cross-validation is implemented.
# K-fold cross-validation is implemented by initially dividing the dataset into
# K groups or folds that are equal in observation size. Each fold is
# used once in the test set and should utilize the rest of the k-1 folds as a
# training set to fit the model. The performance is then evaluated on the
# trained model based on the remaining fold that was not utilize during the
# training process. Once the training and testing in the model occurs on each
# fold, different performance metrics like accuracy, F1 Score, ect are
# generated for each test set. The training and evaluation steps are repeated
# for each of the k folds, and after all the iterations of k, the results
# that are produced from these iterations are then averaged to compute the
# overall performance of the model. This provides a generalized estimate of the
# models overall performance.
# (b) What are the advantages and disadvantages of k-fold cross validation
# relative to:
# i. The validation set approach?
# The advantages of the k fold approach compared to the validation set approach
# are that in the k-fold approach all the data is used for both
# training and testing in the model so it utilizes all available data
# for efficient. This approach produces are more reliable approximation
# of the model's overall performance compared to the validation set approach.
# The disadvantage of the k fold cross validation compared to validation set
# is that k0fold cross validation is computationally more expensive because
# it requires running the model many times compared to a single split that is
# is utilized in validation set approach. Additionally the k fold may be more
# time consuming because it requires training the models multiple times
# and this can be slower compared to validation set approach.
# ii. LOOCV?
# The advantages of the k fold approach compared to LOOCV approach
# are that in the k-fold the training of the model occurs based on k amount
# of times and is more computationally efficient compared to LOOCV because
# in LOOCV the training of the model occurs for each data point and thus
# it becomes taxing especially in large datasets. As a result K-fold
# has a faster execution compared to LOOCV especially in large datasets.
# The disadvantage of the k fold cross validation compared to LOOCV is that it
# is not as accurate as LOOCV because LOOCV test the model on each data point
# while K fold test the model on a smaller subset of data points in each fold
# and thus is not as accurate as LOOCV. Furthermore, LOOCV will produce a model
# that can be more generalized especially if the value of K in k fold is not
# chosen properly which can dramatically influence results.
# Question 5
# In Chapter 4, we used logistic regression to predict the probability of
# default using income and balance on the Default data set. We will
# now estimate the test error of this logistic regression model using the
# validation set approach. Do not forget to set a random seed before
# beginning your analysis.
# Part (a) Fit a logistic regression model that uses income and balance to
# predict default.
data("Default")
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
# Question 5 Part (a) Fit a logistic regression model that uses income and
# balance to predict default. Do not forget to set a random seed before
# beginning your analysis.
set.seed(7)
log_model1 = glm(default ~ income + balance, data = Default, family = binomial)
summary(log_model1)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Question 5 part (b) Using the validation set approach, estimate the test
# error of this model. In order to do this, you must perform the following
# steps: i. Split the sample set into a training set and a validation set.
set.seed(7)
train = sample(nrow(Default), size = nrow(Default) * 0.5)
train_default = Default[train,]
test_default = Default[-train,]
dim(train_default)
## [1] 5000 4
dim(test_default)
## [1] 5000 4
# Question 5 part (b) ii. Fit a multiple logistic regression model using only
# the training observations.
set.seed(7)
log_model2 = glm(default ~ income + balance, data = Default, family = binomial,
subset = train)
summary(log_model2)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default, subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.119e+01 5.867e-01 -19.079 < 2e-16 ***
## income 1.818e-05 6.800e-06 2.674 0.00751 **
## balance 5.546e-03 3.079e-04 18.011 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1536.99 on 4999 degrees of freedom
## Residual deviance: 839.24 on 4997 degrees of freedom
## AIC: 845.24
##
## Number of Fisher Scoring iterations: 8
# Question 5 part (b) iii. Obtain a prediction of default status for each
#individual in the validation set by computing the posterior probability of
# default for that individual, and classifying the individual to
# the default category if the posterior probability is greater
# than 0.5.
attach(Default)
log_probs = predict(log_model2, test_default, type="response")
log_pred = rep("No", 5000)
log_pred[log_probs>0.5] = "Yes"
default_test = default[-train]
table(log_pred, default_test)
## default_test
## log_pred No Yes
## No 4823 99
## Yes 22 56
# Question 5 part (b) iv. Compute the validation set error, which is the
# fraction of the observations in the validation set that are misclassified.
mean(log_pred!=default_test)
## [1] 0.0242
# Question 5 part (c) Repeat the process in (b) three times, using three
# different splits of the observations into a training set and a validation set.
# Comment on the results obtained.
# First split will be a 60-40 split
set.seed(7)
train = sample(nrow(Default), size = nrow(Default) * 0.6)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)
## default_test
## glm_pred No Yes
## No 3860 76
## Yes 17 47
mean(glm_pred!=default_test)
## [1] 0.02325
# From the result generated the validation set from a 60-40 split
# produced a test error of 2.33%
# Question 5 part (c)
# Second split will be a 70-30 split
set.seed(7)
train = sample(nrow(Default), size = nrow(Default) * 0.7)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)
## default_test
## glm_pred No Yes
## No 2895 55
## Yes 14 36
mean(glm_pred!=default_test)
## [1] 0.023
# From the result generated the validation set from a 70-30 split
# produced a test error of 2.3%
# Question 5 part (c)
# Third split will be a 80-20 split
set.seed(7)
train = sample(nrow(Default), size = nrow(Default) * 0.8)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)
## default_test
## glm_pred No Yes
## No 1926 38
## Yes 10 26
mean(glm_pred!=default_test)
## [1] 0.024
# From the result generated the validation set from a 80-20 split
# produced a test error of 2.4%
# Question 5
# (d) Now consider a logistic regression model that predicts the probability
# of default using income, balance, and a dummy variable
# for student. Estimate the test error for this model using the validation
# set approach. Comment on whether or not including a
# dummy variable for student leads to a reduction in the test error
# rate.
set.seed(7)
train = sample(nrow(Default), size = nrow(Default) * 0.5)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_default1 = glm(default ~ income + balance + student, data = Default,
family = binomial, subset = train)
glm_probs = predict(glm_default1, test_default, type = "response")
glm_pred = rep("No", length(default_test))
glm_pred[glm_probs>0.5] = "Yes"
table(glm_pred, default_test)
## default_test
## glm_pred No Yes
## No 4824 100
## Yes 21 55
mean(glm_pred!=default_test)
## [1] 0.0242
# The validation approach without "student" dummy variable present produced
# a test error of 2.42% and if "student" is included the test error is still
# 2.42%. As a result the inclusion of the dummy variable for student did not
# lead to a decrease in the test error rate.
# Question 6
# We continue to consider the use of a logistic regression model to
# predict the probability of default using income and balance on the
# Default data set. In particular, we will now compute estimates for
# the standard errors of the income and balance logistic regression coefficients
# in two different ways: (1) using the bootstrap, and (2) using
# the standard formula for computing the standard errors in the glm()
# function. Do not forget to set a random seed before beginning your
# analysis.
# Part (a) Using the summary() and glm() functions, determine the estimated
# standard errors for the coefficients associated with income
# and balance in a multiple logistic regression model that uses
# both predictors.
set.seed(7)
glm_q6 = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_q6)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Based on the summary function of the model the glm()
# estimated standard errors for the coefficients
# intercept, income and balance are 4.348e-01, 4.985e-06 and
# 2.274e-04 respectively.
# Question 6 (b) Write a function, boot.fn(), that takes as input the
# Default data set as well as an index of the observations, and that outputs the
# coefficient estimates for income and balance in the multiple logistic
# regression model.
boot_fn = function(data, index){
fit = glm(default ~ income +balance, data=data, family=binomial, subset=index)
return(coef(fit))
}
boot_fn(Default,1:10000)
## (Intercept) income balance
## -1.154047e+01 2.080898e-05 5.647103e-03
# Question 6 (c) Use the boot() function together with your boot.fn()
# function to estimate the standard errors of the logistic regression
# coefficients for income and balance.
boot(Default,boot_fn,500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot_fn, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -1.849433e-03 4.524018e-01
## t2* 2.080898e-05 2.692252e-08 4.904948e-06
## t3* 5.647103e-03 -1.613802e-06 2.369834e-04
# Based on the
# estimated standard errors for the coefficients are 4.347e-01, 4.971e-06 and
# 2.301e-04 respectively.
# Question 6 Part (d) Comment on the estimated standard errors obtained using
# the glm() function and using your bootstrap function.
# The standard errors obtained using the glm() function for the coefficients
# intercept, income and balance are 4.348e-01, 4.985e-06 and
# 2.274e-04 respectively. For the bootstrap function the standard errors
# obtained for B0, B1, and B2 are 4.347e-01, 4.971e-06 and
# 2.301e-04 respectively.
# Comparing these the two results from the different functions, the estimated
# standard errors that are obtained form both methods are very similar.
# The very slight difference between the two may be attributed due to
# differences in methodology between the two different functions with glm()
# relying more on distributional assumptions while bootstrap is more empirical.
# Detach the default dataset.
detach(Default)
# Question 9 We will now consider the Boston housing data set, from the ISLR2
# library.(a) Based on this data set, provide an estimate for the population
# mean of medv. Call this estimate u^.
attach(Boston)
mu_hat = mean(medv)
mu_hat
## [1] 22.53281
# The estimate for the population mean of medv denoted by u^ (mu hat) is
# 22.53281
# Question 9 part (b) Provide an estimate of the standard error of u^.
# Interpret this result.
# Hint: We can compute the standard error of the sample mean by
# dividing the sample standard deviation by the square root of the
# number of observations.
stderr_mean = sd(medv)/sqrt(length(medv))
stderr_mean
## [1] 0.4088611
# The estimate of the standard error of u^ (sample mean) is 0.4088611
# Question 9 part (c) Now estimate the standard error of u^ using the bootstrap.
# How does this compare to your answer from (b)?
set.seed(7)
boot_fn=function(data, index){
mu=mean(data[index])
return (mu)
}
boot(medv,boot_fn, 500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_fn, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.01973043 0.4083676
# Question 9 part (c)
# The standard error using bootstrap is 0.4083676 which is very similar to
# the standard error from part b which is 0.4088611. This makes sense because
# both methods are estimating variability of the sample mean using the same
# concept. The very slight difference between these two values can be
# attributed to the how in the bootstrap method randomness is introduced
# because it inolves resampling and recalculating the sample mean on each
# bootstrap sample that is present.
# Question 9 part (d) Based on your bootstrap estimate from (c), provide a
# 95% confidence interval for the mean of medv. Compare it to the results
# obtained using t.test(Boston$medv).
# Hint: You can approximate a 95% confidence interval using the
# formula
CI_mu_hat = c(22.53 - 2 * 0.408, 22.53 + 2 * 0.408)
CI_mu_hat
## [1] 21.714 23.346
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
# The 95% confidence interval from the bootstrap is between 21.714 23.346
# and using t=test the confidence interval is 22.53. The results from the
# two methods are are almost the same from one another.
# Question 9 part (e) Based on this data set, provide an estimate, ^umed, for
# the median value of medv in the population.
median(medv)
## [1] 21.2
# Question 9 part(f) We now would like to estimate the standard error of ^umed
# Unfortunately, there is no simple formula for computing the standard
# error of the median. Instead, estimate the standard error of the
# median using the bootstrap. Comment on your findings.
set.seed(7)
boot_fn2 = function(data, index){
mu=median(data[index])
return (mu)
}
boot(medv,boot_fn2,500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_fn2, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0064 0.3740171
# Using the bootstrap method, the median value is determined to be 21.2 which
# is equal to the value that is determined in part (e). In the bootstrap method
# the standard error is 0.3740171.
# Question 9 (g) Based on this data set, provide an estimate for the
# tenth percentile of medv in Boston census tracts. Call this quantity ^u0.1
# (You can use the quantile() function.)
mu_tenth_medv = quantile(medv, c(0.1))
mu_tenth_medv
## 10%
## 12.75
# Question 9 (h) Use the bootstrap to estimate the standard error of ^u0.1
# Comment on your findings.
set.seed(7)
boot_fn3=function(data, index){
mu=quantile(data[index],0.1)
return (mu)
}
boot(medv, boot_fn3, 500)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot_fn3, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0152 0.4880218
# The estimated tenth percentile that is determined is 12.75 which is the same
# value that is determined in the previous part (g). The standard error from the
# bootstrap method is 0.4880218.