Assignment4

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Load Libraries 
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.3

library(boot)

# Question 3 We now review k-fold cross-validation.

# (a) Explain how k-fold cross-validation is implemented.
# K-fold cross-validation is implemented by initially dividing the dataset into
# K groups or folds that are equal in observation size. Each fold is 
# used once in the test set and should utilize the rest of the k-1 folds as a
# training set to fit the model. The performance is then evaluated on the
# trained model based on the remaining fold that was not utilize during the 
# training process. Once the training and testing in the model occurs on each
# fold, different performance metrics like accuracy, F1 Score, ect are
# generated for each test set. The training and evaluation steps are repeated
# for each of the k folds, and after all the iterations of k, the results
# that are produced from these iterations are then averaged to compute the 
# overall performance of the model. This provides a generalized estimate of the
# models overall performance. 


# (b) What are the advantages and disadvantages of k-fold cross validation
# relative to:
# i. The validation set approach?
# The advantages of the k fold approach compared to the validation set approach 
# are that in the k-fold approach all the data is used for both 
# training and testing in the model so it utilizes all available data
# for efficient. This approach produces are more reliable approximation 
# of the model's overall performance compared to the validation set approach. 
# The disadvantage of the k fold cross validation compared to validation set 
# is that k0fold cross validation is computationally more expensive because
# it requires running the model many times compared to a single split that is
# is utilized in validation set approach. Additionally the k fold may be more
# time consuming because it requires training the models multiple times
# and this can be slower compared to validation set approach. 

# ii. LOOCV?
# The advantages of the k fold approach compared to LOOCV approach 
# are that in the k-fold the training of the model occurs based on k amount
# of times and is more computationally efficient compared to LOOCV because 
# in LOOCV the training of the model occurs for each data point and thus
# it becomes taxing especially in large datasets. As a result K-fold 
# has a faster execution compared to LOOCV especially in large datasets. 
# The disadvantage of the k fold cross validation compared to LOOCV is that it
# is not as accurate as LOOCV because LOOCV test the model on each data point 
# while K fold test the model on a smaller subset of data points in each fold 
# and thus is not as accurate as LOOCV. Furthermore, LOOCV will produce a model
# that can be more generalized especially if the value of K in k fold is not
# chosen properly which can dramatically influence results.

# Question 5 
# In Chapter 4, we used logistic regression to predict the probability of
# default using income and balance on the Default data set. We will
# now estimate the test error of this logistic regression model using the
# validation set approach. Do not forget to set a random seed before
# beginning your analysis.

# Part (a) Fit a logistic regression model that uses income and balance to
# predict default.

data("Default")
summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

str(Default)

## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

# Question 5 Part (a) Fit a logistic regression model that uses income and 
# balance to predict default. Do not forget to set a random seed before
# beginning your analysis.
set.seed(7)

log_model1 = glm(default ~ income + balance, data = Default, family = binomial) 
summary(log_model1)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

# Question 5 part (b) Using the validation set approach, estimate the test
# error of this model. In order to do this, you must perform the following 
# steps: i. Split the sample set into a training set and a validation set.

set.seed(7) 
train = sample(nrow(Default), size = nrow(Default) * 0.5)
train_default = Default[train,]
test_default = Default[-train,]

dim(train_default)

## [1] 5000    4

dim(test_default)

## [1] 5000    4

# Question 5 part (b) ii. Fit a multiple logistic regression model using only
# the training observations.

set.seed(7)
log_model2 = glm(default ~ income + balance, data = Default, family = binomial,
                 subset = train)
summary(log_model2)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default, subset = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.119e+01  5.867e-01 -19.079  < 2e-16 ***
## income       1.818e-05  6.800e-06   2.674  0.00751 ** 
## balance      5.546e-03  3.079e-04  18.011  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1536.99  on 4999  degrees of freedom
## Residual deviance:  839.24  on 4997  degrees of freedom
## AIC: 845.24
## 
## Number of Fisher Scoring iterations: 8

# Question 5 part (b) iii. Obtain a prediction of default status for each 
#individual in the validation set by computing the posterior probability of
# default for that individual, and classifying the individual to
# the default category if the posterior probability is greater
# than 0.5.
attach(Default)
log_probs = predict(log_model2, test_default, type="response")
log_pred = rep("No", 5000)
log_pred[log_probs>0.5] = "Yes"
default_test = default[-train]
table(log_pred, default_test)

##         default_test
## log_pred   No  Yes
##      No  4823   99
##      Yes   22   56

# Question 5 part (b) iv. Compute the validation set error, which is the 
# fraction of the observations in the validation set that are misclassified.

mean(log_pred!=default_test)

## [1] 0.0242

# Question 5 part (c) Repeat the process in (b) three times, using three
# different splits of the observations into a training set and a validation set.
# Comment on the results obtained.

# First split will be a 60-40 split
set.seed(7) 
train = sample(nrow(Default), size = nrow(Default) * 0.6)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)

##         default_test
## glm_pred   No  Yes
##      No  3860   76
##      Yes   17   47

mean(glm_pred!=default_test)

## [1] 0.02325

# From the result generated the validation set from a 60-40 split
# produced a test error of 2.33%

# Question 5 part (c)
# Second split will be a 70-30 split
set.seed(7) 
train = sample(nrow(Default), size = nrow(Default) * 0.7)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)

##         default_test
## glm_pred   No  Yes
##      No  2895   55
##      Yes   14   36

mean(glm_pred!=default_test)

## [1] 0.023

# From the result generated the validation set from a 70-30 split
# produced a test error of 2.3%

# Question 5 part (c)
# Third split will be a 80-20 split
set.seed(7) 
train = sample(nrow(Default), size = nrow(Default) * 0.8)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_probs = predict(log_model2, test_default, type="response")
glm_pred<- rep("No", length(default_test))
glm_pred[glm_probs>0.5]="Yes"
table(glm_pred, default_test)

##         default_test
## glm_pred   No  Yes
##      No  1926   38
##      Yes   10   26

mean(glm_pred!=default_test)

## [1] 0.024

# From the result generated the validation set from a 80-20 split
# produced a test error of 2.4%

# Question 5
# (d) Now consider a logistic regression model that predicts the probability
# of default using income, balance, and a dummy variable
# for student. Estimate the test error for this model using the validation
# set approach. Comment on whether or not including a
# dummy variable for student leads to a reduction in the test error
# rate.

set.seed(7) 
train = sample(nrow(Default), size = nrow(Default) * 0.5)
train_default = Default[train,]
test_default = Default[-train,]
default_test = default[-train]
glm_default1 = glm(default ~ income + balance + student, data = Default,
                                   family = binomial, subset = train)
glm_probs = predict(glm_default1, test_default, type = "response")
glm_pred = rep("No", length(default_test))
glm_pred[glm_probs>0.5] = "Yes"
table(glm_pred, default_test)

##         default_test
## glm_pred   No  Yes
##      No  4824  100
##      Yes   21   55

mean(glm_pred!=default_test)

## [1] 0.0242

# The validation approach without "student" dummy variable present produced
# a test error of 2.42% and if "student" is included the test error is still
# 2.42%. As a result the inclusion of the dummy variable for student did not
# lead to a decrease in the test error rate.

# Question 6
# We continue to consider the use of a logistic regression model to
# predict the probability of default using income and balance on the
# Default data set. In particular, we will now compute estimates for
# the standard errors of the income and balance logistic regression coefficients
# in two different ways: (1) using the bootstrap, and (2) using
# the standard formula for computing the standard errors in the glm()
# function. Do not forget to set a random seed before beginning your
# analysis.
# Part (a) Using the summary() and glm() functions, determine the estimated
# standard errors for the coefficients associated with income
# and balance in a multiple logistic regression model that uses
# both predictors.

set.seed(7)
glm_q6 = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_q6)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

# Based on the summary function of the model the glm() 
# estimated standard errors for the coefficients 
# intercept, income and balance are  4.348e-01, 4.985e-06 and
# 2.274e-04 respectively.

# Question 6 (b) Write a function, boot.fn(), that takes as input the
# Default data set as well as an index of the observations, and that outputs the
# coefficient estimates for income and balance in the multiple logistic
# regression model.

boot_fn = function(data, index){
  fit = glm(default ~ income +balance, data=data, family=binomial, subset=index)
  return(coef(fit))
}

boot_fn(Default,1:10000)

##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

# Question 6 (c) Use the boot() function together with your boot.fn()
# function to estimate the standard errors of the logistic regression 
# coefficients for income and balance.

boot(Default,boot_fn,500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot_fn, R = 500)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -1.849433e-03 4.524018e-01
## t2*  2.080898e-05  2.692252e-08 4.904948e-06
## t3*  5.647103e-03 -1.613802e-06 2.369834e-04

# Based on the  
# estimated standard errors for the coefficients are 4.347e-01, 4.971e-06 and
# 2.301e-04 respectively.

# Question 6 Part (d) Comment on the estimated standard errors obtained using
# the glm() function and using your bootstrap function.

# The standard errors obtained using the glm() function for the coefficients 
# intercept, income and balance are  4.348e-01, 4.985e-06 and
# 2.274e-04 respectively. For the bootstrap function the standard errors
# obtained for B0, B1, and B2 are 4.347e-01, 4.971e-06 and
# 2.301e-04 respectively.
# Comparing these the two results from the different functions, the estimated
# standard errors that are obtained form both methods are very similar. 
# The very slight difference between the two may be attributed due to 
# differences in methodology between the two different functions with glm()
# relying more on distributional assumptions while bootstrap is more empirical.

# Detach the default dataset. 
detach(Default)

# Question 9 We will now consider the Boston housing data set, from the ISLR2
# library.(a) Based on this data set, provide an estimate for the population
# mean of medv. Call this estimate u^.

attach(Boston)
mu_hat = mean(medv)
mu_hat

## [1] 22.53281

# The estimate for the population mean of medv denoted by u^ (mu hat) is
# 22.53281

# Question 9 part (b) Provide an estimate of the standard error of u^.
# Interpret this result.
# Hint: We can compute the standard error of the sample mean by
# dividing the sample standard deviation by the square root of the
# number of observations.

stderr_mean = sd(medv)/sqrt(length(medv))
stderr_mean

## [1] 0.4088611

# The estimate of the standard error of u^ (sample mean) is 0.4088611

# Question 9 part (c) Now estimate the standard error of u^ using the bootstrap.
# How does this compare to your answer from (b)?
set.seed(7)
boot_fn=function(data, index){
    mu=mean(data[index])
    return (mu)
}

boot(medv,boot_fn, 500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot_fn, R = 500)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.01973043   0.4083676

# Question 9 part (c) 
# The standard error using bootstrap is 0.4083676 which is very similar to
# the standard error from part b which is 0.4088611. This makes sense because
# both methods are estimating variability of the sample mean using the same 
# concept. The very slight difference between these two values can be 
# attributed to the how in the bootstrap method randomness is introduced
# because it inolves resampling and recalculating the sample mean on each 
# bootstrap sample that is present.

# Question 9 part (d) Based on your bootstrap estimate from (c), provide a
# 95% confidence interval for the mean of medv. Compare it to the results
# obtained using t.test(Boston$medv).
# Hint: You can approximate a 95% confidence interval using the
# formula

CI_mu_hat = c(22.53 - 2 * 0.408, 22.53 + 2 * 0.408)
CI_mu_hat

## [1] 21.714 23.346

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

# The 95% confidence interval from the bootstrap is between 21.714 23.346
# and using t=test the confidence interval is 22.53. The results from the
# two methods are are almost the same from one another.

# Question 9 part (e) Based on this data set, provide an estimate, ^umed, for
# the median value of medv in the population.

median(medv)

## [1] 21.2

# Question 9 part(f) We now would like to estimate the standard error of ^umed
# Unfortunately, there is no simple formula for computing the standard
# error of the median. Instead, estimate the standard error of the
# median using the bootstrap. Comment on your findings.

set.seed(7)
boot_fn2 = function(data, index){
    mu=median(data[index])
    return (mu)
}

boot(medv,boot_fn2,500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot_fn2, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0064   0.3740171

# Using the bootstrap method, the median value is determined to be 21.2 which
# is equal to the value that is determined in part (e). In the bootstrap method
# the standard error is 0.3740171.

# Question 9 (g) Based on this data set, provide an estimate for the
# tenth percentile of medv in Boston census tracts. Call this quantity ^u0.1
# (You can use the quantile() function.)

mu_tenth_medv = quantile(medv, c(0.1))
mu_tenth_medv

##   10% 
## 12.75

# Question 9 (h) Use the bootstrap to estimate the standard error of ^u0.1
# Comment on your findings.

set.seed(7)
boot_fn3=function(data, index){
    mu=quantile(data[index],0.1)
    return (mu)
}
boot(medv, boot_fn3, 500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot_fn3, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0152   0.4880218

# The estimated tenth percentile that is determined is 12.75 which is the same
# value that is determined in the previous part (g). The standard error from the
# bootstrap method is 0.4880218.

Assignment4

Daniel O.

2025-03-23

R Markdown