Ch.5 Problems 3,5,6,9

3.

We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented. (b) What are the advantages and disadvantages of k-fold cross validation relative to: i. The validation set approach? ii. LOOCV?

K-fold cross validation is implemented by splitting the data randomly into k groups of approximately equal size. The first group is treated as the validation set, and the other k-1 groups are the groups from which the model is fit. The MSE is calculated for the validation set, and then the process is repeated. This process is done k times and each time a different set of observations as the validation set. Finally, once you have the MSE for each group, the CV estimate can then be computed by taking the average of the MSE values.

One advantage of using k-fold cross validation relative to the validation set approach is that the k-fold cross validation approach decreases the bias. However, k-fold cross validation is more computationally expensive than the validation set approcah because you have to repeat the procedure k times.
one advantage k-fold cross validation has over LOOCV is that it is less computationally expensive than LOOCV. in LOOCV you have to repeat the procedure n times, meaning as many times as there are observations in the data. In k-fold cross validation you split the data into k equal sections and then repeat the procedure k times. assuming k isnt equal to n, this is less expensive to do. k-fold cross validation also often gives more accurate estimates of the test error rate than LOOCV.

5.

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

(a) Fit a logistic regression model that uses income and balance to predict default.

library(ISLR)
library(MASS)
library(caTools)
set.seed(8)
glm.fit=glm(default ~income+balance, data = Default, family = binomial)

Here is the logistic regression model using income and balance to predict default.

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set. ii. Fit a multiple logistic regression model using only the training observations. iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5. iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

sample_data = sample.split(Default$default, SplitRatio = 0.5)
Training.set = subset(Default, sample_data==TRUE)
Test.set = subset(Default, sample_data==FALSE)
test.default = Test.set$default

glm1.fit = glm(default~income+balance, data=Training.set, family = binomial)
glm1.probs = predict(glm1.fit,Test.set, type="response")
glm1.preds = rep("No",length(Test.set$default))
glm1.preds[glm1.probs>0.5]= "Yes"
table(glm1.preds, test.default)

##           test.default
## glm1.preds   No  Yes
##        No  4808  114
##        Yes   25   53

Here we can see that the model incorrectly predicted yes 114 times and incorrectly predicted no 25 times.

(114+25)/5000

## [1] 0.0278

The model has a test error rate of 2.78%

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(81)
sample_data = sample.split(Default$default, SplitRatio = 0.5)
Training.set = subset(Default, sample_data==TRUE)
Test.set = subset(Default, sample_data==FALSE)
test.default = Test.set$default

glm1.fit = glm(default~income+balance, data=Training.set, family = binomial)
glm1.probs = predict(glm1.fit,Test.set, type="response")
glm1.preds = rep("No",length(Test.set$default))
glm1.preds[glm1.probs>0.5]= "Yes"
table(glm1.preds, test.default)

##           test.default
## glm1.preds   No  Yes
##        No  4812  113
##        Yes   21   54

Using a different seed, we get that the model incorrectly predicts yes 113 times and incorrectly predicts no 21 times.

(113+21)/5000

## [1] 0.0268

The test error rate this time was 2.68%

set.seed(812)
sample_data = sample.split(Default$default, SplitRatio = 0.5)
Training.set = subset(Default, sample_data==TRUE)
Test.set = subset(Default, sample_data==FALSE)
test.default = Test.set$default

glm1.fit = glm(default~income+balance, data=Training.set, family = binomial)
glm1.probs = predict(glm1.fit,Test.set, type="response")
glm1.preds = rep("No",length(Test.set$default))
glm1.preds[glm1.probs>0.5]= "Yes"
table(glm1.preds, test.default)

##           test.default
## glm1.preds   No  Yes
##        No  4813  115
##        Yes   20   52

Using a different seed, we can see that the model incorrectly predicted yes 115 times and incorrectly predicted no 20 times.

(115+20)/5000

## [1] 0.027

This model had a test error rate of 2.7%

set.seed(8123)
sample_data = sample.split(Default$default, SplitRatio = 0.5)
Training.set = subset(Default, sample_data==TRUE)
Test.set = subset(Default, sample_data==FALSE)
test.default = Test.set$default

glm1.fit = glm(default~income+balance, data=Training.set, family = binomial)
glm1.probs = predict(glm1.fit,Test.set, type="response")
glm1.preds = rep("No",length(Test.set$default))
glm1.preds[glm1.probs>0.5]= "Yes"
table(glm1.preds, test.default)

##           test.default
## glm1.preds   No  Yes
##        No  4816  123
##        Yes   17   44

We can see from this table that the model incorrectly predicted yes 123 times and incorrectly predicted no 17 times.

(123+17)/5000

## [1] 0.028

This model had a test error rate of 2.8% All three models of these models have close to the same test error rate.

D. Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

glm1.fit = glm(default~income+balance+student, data=Training.set, family = binomial)
glm1.probs = predict(glm1.fit,Test.set, type="response")
glm1.preds = rep("No",length(Test.set$default))
glm1.preds[glm1.probs>0.5]= "Yes"
contrasts(Default$student)

##     Yes
## No    0
## Yes   1

table(glm1.preds, test.default)

##           test.default
## glm1.preds   No  Yes
##        No  4815  124
##        Yes   18   43

The table indicates that the model incorrectly predicted yes 124 times and no 18 times.

(124+18)/5000

## [1] 0.0284

The test error rate is 2.84% which is very very slightly higher than the test error rate of the model not including the student variable.

6.

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(7)
glm2.fit=glm(default~income+balance, data = Default, family = binomial)
summary(glm2.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Based on this result we can see that the estimated standard error for income is 4.985e-06 and the estimated standard error for balance is 2.274e-04.

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn=function(data, index){return(coef(glm(default~income+balance, data=data, subset = index,family = binomial)))}

Above is the written function boot.fn.

**(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)
boot(Default, boot.fn, R= 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01  3.279628e-03 4.032749e-01
## t2*  2.080898e-05  4.685383e-07 4.589626e-06
## t3*  5.647103e-03 -1.132436e-05 2.206669e-04

The estimates for standard error for income and balance are 4.589626e-06 and 2.206669e-04 respectively.

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

based on the results in part (a) and the results in part (c) we can see that the estimated standard errors when using the glm() function and using the bootstrap function are very similar.

9.

We will now consider the Boston housing data set, from the MASS library. (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.

mean(Boston$medv)

## [1] 22.53281

The mean value for the medv variable is 22.53281.

(b) Provide an estimate of the standard error of ˆμ. Interpret this result.

sd(Boston$medv)/sqrt(dim(Boston)[1]-1)

## [1] 0.4092658

The standard error of the mean is 0.4092658. the standard error tells us how much variability there is in the sample data. With a standard error of .40 there doesnt seem to be a lot of variability in the values of the medv variable.

(c) Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?

mean.fn=function(data, index){return(mean(Boston$medv[index]))}
boot(Boston, mean.fn, 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = mean.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 22.53281 0.0141996   0.4160302

We can see from the result of this bootstrap that the standard error is .4160302. This is close to the value for the standard error we obtained in part (b).

(d) Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

22.53281+ 2*0.4160302

## [1] 23.36487

22.53281- 2*0.4160302

## [1] 21.70075

We can see that the 95% confidence interval for the mean value of medv is (21.70075, 23.36487).

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

We can see from this t-test that the 95% confidence interval is (21.72953, 23.33608). This 95% confidence interval is very similar, but slightly smaller than the first one.

(e) Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.

median(Boston$medv)

## [1] 21.2

This result indicates that the value of the median for the medv variable is 21.2.

**(f) We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

median.fn=function(data, index){return(median(Boston$medv[index]))}
boot(Boston, median.fn, 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = median.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2   0.045   0.4042389

The standard error of the median is 0.4042389. This is very close to the result we got for standard error of the bootstrap based on the mean of the medv variable.

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity ˆμ0.1. (You can use the quantile() function.)

quantile(Boston$medv, 0.1)

##   10% 
## 12.75

The estimate for the 10th percentile of the medv variable is 12.75.

(h) Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.

quant.fn=function(data, index){return(quantile(Boston$medv[index], 0.1))}
boot(Boston, quant.fn, 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = quant.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 -0.1185   0.5221196

This result indicates that the standard error on the 10th percentile of the medv variable is 0.5221196. This is similar to, but slightly larger than the estimates we got for the mean or median of the medv variable.

DataMining Assignment 4

Daniel Bloemker

3/8/2022

Ch.5 Problems 3,5,6,9

3.

5.

6.

9.