Assignment 4 / Chapter 05 (page 197): 3,5,6,9

3) We now review K-fold cross-validation

Explain how K-Fold cross-validation is implement

Overall, we randomly divide the set of observations into k groups. The first fold is the validation set, and the method is fit on the remaining k folds. The MSE1 is computed on the observation that is being held out. This is then repeated k times with each time a different group of observations is treated as a valdiation set. This in turn results in k estimates of the test error, MSE1,…,MSEK.

Lastly, the estimate of cross validation is computed by averaging the values

what are the advantages and disadvantages of k-fold cross validation relative to

the validation set approach

Advantage: It can help reduce overfitting when you test and train your model.

Disadvantage: The validation estimate of the test error rate can vary highly. This is dependant on the observations that are included in the training and test set as a subset of observations are used to fit the model.

Sometimes, it tends to be that the validation error rate may tend to overestimate the test error rate for the model fit on the entire data set. This is because the less data one has, it often times leads to lower performacne.

LOOVC

Advantage: : It provides a much less biased measure of test MSE compared to using a single test set because we repeatedly fit a model to a dataset that contains n-1 observations

Disadvantage: It is very computational expensive as the model is n times compared to the k fold cross validation. In addition, it can give an “APPROXIMATE” unbiased estimate of the test errors since each set contains n-1 observations.

It is important to note that this approach has a higher variance than a normal k fold cv (outputs are highly correlated, and mean of highly correlated observations has higher variance than less correlated ones).

If you want to use k fold, people usually use k = 5 or 10 as the yield test error rate estimates doesn’t suffer as much from high bias or from very high variance

5) In Chapter 4, we used logisitc regression to predict the probability of “default” using “income” and “balance” on the “Default” data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

Fit a logistic regression model that uses “income” and “balance” to predict “default”.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.3

head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

set.seed(1)

fit.glm = glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Using a validation set approach, estimate the test error of this model. Perform the following steps

Split the sample set into a training an validation set

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(Default)
Default$id <- 1:nrow(Default)

#use 70% of dataset as training set and 30% as test set 
train <- Default %>% dplyr::sample_frac(0.70)
test  <- dplyr::anti_join(Default, train, by = 'id')

Fit a multiple logistic reg model using only training obs

fit.glm = glm(default ~ income + balance, data = train, family ="binomial")
# NOTE: I will not subset, as I am using my train data that is newly created
# Do not take points off for this
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4481  -0.1402  -0.0561  -0.0211   3.3484  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.167e+01  5.214e-01 -22.379  < 2e-16 ***
## income       2.560e-05  6.012e-06   4.258 2.06e-05 ***
## balance      5.574e-03  2.678e-04  20.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.3  on 6999  degrees of freedom
## Residual deviance: 1079.6  on 6997  degrees of freedom
## AIC: 1085.6
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the “default” category if the posterior probability is greater than 0.5.

probability = predict(fit.glm, newdata = test, type = "response")
predicted = ifelse(probability > 0.5, "Yes", "No")

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified

misclassified = sum(predicted != test$default)
# Big Brain:
# predicted != test$default creates a logical vector where each element is TRUE if the corresponding element in predicted does not match the corresponding element in test$default, and FALSE otherwise. For example, if the predicted class for the first observation is "Yes" but the actual class is "No", then the first element of the logical vector will be TRUE.

# sum(predicted != test$default) counts the number of TRUE elements in the logical vector, which corresponds to the number of misclassified observations in the test set. The resulting value is assigned to the variable misclassified.


error = misclassified / nrow(test)
# divide the number of misclassified observations by the total number of observations in the test set to get the validation set error. 

cat("Validation set error: ", error, "\n")

## Validation set error:  0.02666667

Repeat the process in b(3x), using 3 diff splits of obs into training set and validation set. Comment on these results

# round 1
#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(Default)
Default$id <- 1:nrow(Default)

#use 50% of dataset as training set and 50% as test set 
train <- Default %>% dplyr::sample_frac(0.5)
test  <- dplyr::anti_join(Default, train, by = 'id')

#Fit the reg model
fit.glm = glm(default ~ income + balance, data = train, family ="binomial")
# NOTE: I will not subset, as I am using my train data that is newly created
# Do not take points off for this
#summary(fit.glm)

# use train data to predict
probability = predict(fit.glm, newdata = test, type = "response")
predicted = ifelse(probability > 0.5, "Yes", "No")


error1 = misclassified / nrow(test)
cat("Validation set error for 50% training and testing is: ", error1, "\n")

## Validation set error for 50% training and testing is:  0.016

##############################################################################
# round 2
#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(Default)
Default$id <- 1:nrow(Default)

#use 75% of dataset as training set and 25% as test set 
train <- Default %>% dplyr::sample_frac(0.75)
test  <- dplyr::anti_join(Default, train, by = 'id')

#Fit the reg model
fit.glm = glm(default ~ income + balance, data = train, family ="binomial")
# NOTE: I will not subset, as I am using my train data that is newly created
# Do not take points off for this
#summary(fit.glm)

# use train data to predict
probability = predict(fit.glm, newdata = test, type = "response")
predicted = ifelse(probability > 0.5, "Yes", "No")


error2 = misclassified / nrow(test)
cat("Validation set error for 75% train, 25% test is: ", error2, "\n")

## Validation set error for 75% train, 25% test is:  0.032

# round 3
#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(Default)
Default$id <- 1:nrow(Default)

#use 90% of dataset as training set and 10% as test set 
train <- Default %>% dplyr::sample_frac(0.90)
test  <- dplyr::anti_join(Default, train, by = 'id')

#Fit the reg model
fit.glm = glm(default ~ income + balance, data = train, family ="binomial")
# NOTE: I will not subset, as I am using my train data that is newly created
# Do not take points off for this
#summary(fit.glm)

# use train data to predict
probability = predict(fit.glm, newdata = test, type = "response")
predicted = ifelse(probability > 0.5, "Yes", "No")


error3 = misclassified / nrow(test)
cat("Validation set error for 90% train, 10% testing is: ", error3, "\n")

## Validation set error for 90% train, 10% testing is:  0.08

As we can see, the validation estimates vary (The estimates being printed out in the console). One would think that with more training, you would get less error, and that does not appear to be the case for our 75 to 25 ratio of training and testing. This could be because we are not obtaining the ‘best’ observations to train and/or test on. To mitigate this, one could implement k fold cross validation which in turn would yield better error estimates.

Now consider a logistic regression model that predicts the probability of “default” using “income”, “balance”, and a dummy variable for “student”. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for “student” leads to a reduction in the test error rate.

#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(Default)
Default$id <- 1:nrow(Default)

#use 70% of dataset as training set and 30% as test set 
train <- Default %>% dplyr::sample_frac(0.70)
test  <- dplyr::anti_join(Default, train, by = 'id')

fit.glm = glm(default ~ income + balance + student, data = train, family ="binomial")
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4482  -0.1374  -0.0540  -0.0202   3.4027  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.095e+01  5.880e-01 -18.616   <2e-16 ***
## income       6.273e-06  9.845e-06   0.637    0.524    
## balance      5.678e-03  2.741e-04  20.715   <2e-16 ***
## studentYes  -7.167e-01  2.886e-01  -2.483    0.013 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.3  on 6999  degrees of freedom
## Residual deviance: 1073.5  on 6996  degrees of freedom
## AIC: 1081.5
## 
## Number of Fisher Scoring iterations: 8

probability = predict(fit.glm, newdata = test, type = "response")
predicted = ifelse(probability > 0.5, "Yes", "No")

error = misclassified / nrow(test)
cat("Validation set error is: ", error, "\n")

## Validation set error is:  0.02666667

When adding student variable, it does not lead to a great reduction in test error. One reason why this might happen is because, in our logistic regression model, our stuentYes estimate is -7.167e-01, with a p value of .013, which is not significant enough to contribute to the model /// there is not enough evidence to suggest that studentYes affects the relationship in default.

Q6. We continue to consider the use of a logistic regression model to predict the probability of “default” using “income” and “balance” on the “Default” data set. In particular, we will now computes estimates for the standard errors of the “income” and “balance” logistic regression coefficients in two different ways : (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with “income” and “balance” in a multiple logistic regression model that uses both predictors.

set.seed(1)

fit.glm = glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Beta_0’s Standard Error: 4.348e-01 Beta_1’s Standard Error: 4.985e-06 Beta_2’s Standard Error: 2.274e-04

Write a function, boot.fn(), that takes as input the “Default” data set as well as an index of the observations, and that outputs the coefficient estimates for “income” and “balance” in the multiple logistic regression model.

set.seed(1)
boot.fn = function(data, index){
  fit = glm(default ~ income + balance, data = data, family = "binomial", subset = index)
  return(coef(fit))
}


#boot.fn() function is designed to be used with bootstrapping, which is a resampling technique used for estimating the uncertainty of a statistical estimator. When bootstrapping, we repeatedly draw random samples with replacement from the original data set to create a set of bootstrap samples.

# The index argument in the boot.fn() function represents the indices of the observations in a bootstrap sample. When we pass this index argument to the subset argument of glm(), we are telling the function to fit the logistic regression model using only the observations in the bootstrap sample specified by index.
# 
# So, subset = index means that the logistic regression model will be fit only on the observations in the bootstrap sample specified by the index argument.

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for “income” and “balance”.

library(boot)
boot(Default, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

The standard errors for our Beta_0, Beta_1, and Beta_2 are 4.344722e-01, 4.866284e-06, 2.298949e-04, respectively

Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

When compare our 6.c estimates to our estimates from section 6.a, we see that the estimates are relatively similar to one another. The estimates of the standard deviation of the sampling distributions are similar to one another.

Q9 We will now consider the Boston housing data set, from the ISLR2 library

Based on this data set, provide an estimate for the population mean of meds. Call this estimate mu

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.1.3

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Default

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

boston = as.data.frame(Boston)
mean(boston$medv)

## [1] 22.53281

cat("The mean of medv is: ", mean(boston$medv), "\n")

## The mean of medv is:  22.53281

Provide an estimate of the standard error of μ^. Interpret this result. (Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.)

counts = nrow(boston)
SE = (sd(boston$medv)/ sqrt(counts))

cat("The standard error of mu for medv is ",SE, ". This is smaller in comparison to the mean of medv")

## The standard error of mu for medv is  0.4088611 . This is smaller in comparison to the mean of medv

Now estimate the Standard error of mu using bootstrap. How does this compare to your answer from b?

set.seed(1)

boot.fn = function(data, index){
  mu = mean(data[index, "medv"])
  return(mu) 
}

boot(boston, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

The estimated standard error of mu using the bootstrap is .404525 When compared to the standard error found in b (without bootstrapping) these values are very similar with a small difference

Based on your bootstrap estimate from c provide a 95% CI for the mean of medv. Compare it to the results obtained using t.test(Boston$medv)

t.test(boston$medv)

## 
##  One Sample t-test
## 
## data:  boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

a = mean(boston$medv)

lower = a - 2*0.404525
upper = a + 2*0.404525

cat("The BootStrapped estimate confidence interval is [",lower, ",", upper, "]", "\n")

## The BootStrapped estimate confidence interval is [ 21.72376 , 23.34186 ]

cat("This is approximately the same to the values provided by t.test, which are [21.72953, 23.33608] with very little differnce", "\n")

## This is approximately the same to the values provided by t.test, which are [21.72953, 23.33608] with very little differnce

cat(" It is important to note that The mean calculated from one sample t-test of medv falls in 95% confidence interval obtained from mean by bootstrap methods.")

##  It is important to note that The mean calculated from one sample t-test of medv falls in 95% confidence interval obtained from mean by bootstrap methods.

based on this data set, provide an estimate for the median of mu, based on medv in the population

median(boston$medv)

## [1] 21.2

The median for medv is 21.2

We now would like to estimate the standard error of ^µmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

set.seed(1)
boot.med = function(data, index){
  med = median(data[index, "medv"])
  return(med)
}
boot(boston, boot.med, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = boot.med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

We have a standard error of 0.3778075 using the bootstrap for ^µmed. The median also falls outside of our lower bound from bootstrapped and bootstrapped confidence intervals.

Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μ^0.1 (You can use the quantile() function.)

quantile(boston$medv,.1)

##   10% 
## 12.75

Use the bootstrap to estimate the standard error of ^µ0.1. Comment on your findings

set.seed(1)
boot.quant = function(data, index){
  quant = quantile(data[index, "medv"], .1)
  return(quant)
}

boot(boston, boot.quant, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = boot.quant, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

The standard error of the tenth percentile of medv is 0.4767526.

Bootstrapping technique comes handy when we want to estimate an unknown estimator. This is important as it can be quicker, and more reproducible than having to use other standard statistical softwares.