Assignment 4

3. We now review k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

K fold cross-validation is implemented by creating groups of observations into folds. These group sizes are specified by the k variable. out of these k folds, one is specified as the holdout group while the rest (k-1) are used to fit the model. Then the last set is used to test the model that was formed from the other folds. This process is repeated until each fold is used as the holdout. The test error is then calculated using the average MSE of the k test models.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

i. The validation set approach?

The validation set approach has the advantage of being simpler, a clean 70-30 or 80-20 train-test split is much faster to compute and with large enough datasets is an effective division of the observations. The disadvantages of this approach come from its simplicity. Only one split means that smaller datasets can be represented unevenly and models typically perform worse when only trained on one set of variables. The K-folds method allows for multiple trains across varried data.

ii. LOOCV?

Since this model involves using as many folds as there are observations, it can put a large strain on computational power as compared to k-folds. While k-folds may only need to run 10 sets, LOOCV must run many times that to account for every observation. Another issue of LOOCV is that training the data on every single point except one can create many models with very high correlation. Since these models MSE is averaged, this correlation leads to high variance overall.

5. In Chapter 4, we used logisitc regression to predict the probability of “default” using “income” and “balance” on the “Default” data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

(a) Fit a logistic regression model that uses “income” and “balance” to predict “default”.

data("Default")
glm.default.fit = glm(default ~ income + balance, Default, family = "binomial")
summary(glm.default.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Using the validation set approach, estimate the test error of this -model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

set.seed(12345)

sample = sample.split(Default$default, SplitRatio = .5)
train =  subset(Default, sample == TRUE)
test  = subset(Default, sample == FALSE)

ii. Fit a multiple logistic regression model using only the training observations.

glm.default.fitb = glm(default ~ income + balance, train, family = "binomial")
summary(glm.default.fitb)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9768  -0.1488  -0.0590  -0.0223   3.7002  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.144e+01  6.082e-01 -18.815  < 2e-16 ***
## income       2.276e-05  6.875e-06   3.311  0.00093 ***
## balance      5.548e-03  3.187e-04  17.406  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1456.95  on 4999  degrees of freedom
## Residual deviance:  810.91  on 4997  degrees of freedom
## AIC: 816.91
## 
## Number of Fisher Scoring iterations: 8

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the “default” category if the posterior probability is greater than 0.5.

glm.probs = predict(glm.default.fitb, newdata = test, type = "response")
glm.preds = rep("No", length(glm.probs)) 
glm.preds[glm.probs > 0.5] = "Yes"

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

table(glm.preds, test$default)

##          
## glm.preds   No  Yes
##       No  4812  108
##       Yes   21   59

mean(glm.preds == test$default) #accuracy

## [1] 0.9742

mean(glm.preds != test$default) #Error/misclassification

## [1] 0.0258

only 2.8% of the observations were misclassified by the glm model.

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(54321)

sample2 = sample.split(Default$default, SplitRatio = .5)
train2 =  subset(Default, sample2 == TRUE)
test2  = subset(Default, sample2 == FALSE)

glm.default.fit2 = glm(default ~ income + balance, train2, family = "binomial")

glm.probs2 = predict(glm.default.fit2, newdata = test2, type = "response")
glm.preds2 = rep("No", length(glm.probs2)) 
glm.preds2[glm.probs2 > 0.5] = "Yes"

table(glm.preds2, test2$default)

##           
## glm.preds2   No  Yes
##        No  4811  109
##        Yes   22   58

1 - mean(glm.preds2 == test2$default)

## [1] 0.0262

set.seed(415661)

sample3 = sample.split(Default$default, SplitRatio = .5)
train3 =  subset(Default, sample3 == TRUE)
test3  = subset(Default, sample3 == FALSE)

glm.default.fit3 = glm(default ~ income + balance, train3, family = "binomial")

glm.probs3 = predict(glm.default.fit3, newdata = test3, type = "response")
glm.preds3 = rep("No", length(glm.probs3)) 
glm.preds3[glm.probs3 > 0.5] = "Yes"

table(glm.preds3, test3$default)

##           
## glm.preds3   No  Yes
##        No  4809  112
##        Yes   24   55

1 - mean(glm.preds3 == test3$default)

## [1] 0.0272

set.seed(75982)

sample4 = sample.split(Default$default, SplitRatio = .5)
train4 =  subset(Default, sample4 == TRUE)
test4  = subset(Default, sample4 == FALSE)

glm.default.fit4 = glm(default ~ income + balance, train4, family = "binomial")

glm.probs4 = predict(glm.default.fit4, newdata = test4, type = "response")
glm.preds4 = rep("No", length(glm.probs4)) 
glm.preds4[glm.probs4 > 0.5] = "Yes"

table(glm.preds4, test4$default)

##           
## glm.preds4   No  Yes
##        No  4820  113
##        Yes   13   54

1 - mean(glm.preds4 == test4$default)

## [1] 0.0252

Between the three models, there is a small variance in the error rate recorded as well as the classifications made by the models.

d. Now consider a logistic regression model that predicts the probability of “default” using “income”, “balance”, and a dummy variable for “student”. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for “student” leads to a reduction in the test error rate.

set.seed(45127385)

sample = sample.split(Default$default, SplitRatio = .5)
train =  subset(Default, sample == TRUE)
test  = subset(Default, sample == FALSE)

glm.default.fit = glm(default ~ income + balance + student, train, family = "binomial")

glm.probs = predict(glm.default.fit, newdata = test, type = "response")
glm.preds = rep("No", length(glm.probs)) 
glm.preds[glm.probs > 0.5] = "Yes"

table(glm.preds, test$default)

##          
## glm.preds   No  Yes
##       No  4820  115
##       Yes   13   52

1 - mean(glm.preds == test$default)

## [1] 0.0256

the addition of the student variable doesn’t appear to have any negative or positive effect on the model’s ability to predict accurately or testing error rate.

6. We continue to consider the use of a logistic regression model to predict the probability of “default” using “income” and “balance” on the “Default” data set. In particular, we will now computes estimates for the standard errors of the “income” and “balance” logistic regression coefficients in two different ways : (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with “income” and “balance” in a multiple logistic regression model that uses both predictors.

set.seed(123)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

The standard errors for the intercept is 0.4347564, while the coefficients for income and balance’s standard errors are 4.985^-6 and 2.274^-4.

(b) Write a function, boot.fn(), that takes as input the “Default” data set as well as an index of the observations, and that outputs the coefficient estimates for “income” and “balance” in the multiple logistic regression model.

boot.fn = function(data, index) {
    return(coef(glm(default ~ income + balance, data = data, family = "binomial", subset = index)))
}

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for “income” and “balance”.

boot(Default, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -2.754771e-02 4.204817e-01
## t2*  2.080898e-05  1.582518e-07 4.729534e-06
## t3*  5.647103e-03  1.296980e-05 2.217214e-04

The standard errors from bootstrapping 1000 iterations are as follows: the intercept is 0.4204817, while the coefficients for income and balance’s standard errors are 4.729534^-6 and 2.217214^-4.

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard errors from both methods are fairly close in this situation. It might be said that in this case using a standard glm is simpler as opposed to the processing of 1000 iterations for bootstrapping in terms of computational power.

9. We will now consider the “Boston” housing data set, from the “MASS” library.

(a) Based on this data set, provide an estimate for the population mean of “medv”. Call this estimate μ^.

data("Boston")
mu.hat = mean(Boston$medv)
mu.hat

## [1] 22.53281

(b) Provide an estimate of the standard error of μ^. Interpret this result

se.hat <- sd(Boston$medv) / sqrt(dim(Boston)[1])
se.hat

## [1] 0.4088611

The standard error of the mean helps us understand how varied our sample mean would be if we used a new sample from the population. The Standard error in this case is quite low.

(c) Now estimate the standard error of μ^ using the bootstrap. How does this compare to your answer from (b) ?

boot.fn = function(data, index) {
    return(mean(data[index]))
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.02171561   0.4163195

The bootstrapped standard error is .4331926, which is very close to the calculated error of .4088611.

(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of “medv”. Compare it to the results obtained using t.test(Boston$medv).

confidence.int = c(22.53 - 2 * 0.4331926, 22.53 + 2 * 0.4331926)
confidence.int

## [1] 21.66361 23.39639

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Because the bootstrap standard error was a bit larger, the confidence interval for it was also a bit larger. That being said, both intervals are about the same size with only a tenth difference between the upper and lower bounds.

(e) Based on this data set, provide an estimate, μ^med, for the median value of “medv” in the population.

med.hat <- median(Boston$medv)
med.hat

## [1] 21.2

(f) We now would like to estimate the standard error of μ^med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

boot.fn = function(data, index) {
    return(median(data[index]))
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.01105   0.3887242

Our median value is exactly the same at 21.2. The standard error is 0.3841776 which is quite low compared to our overall median.

(g) Based on this data set, provide an estimate for the tenth percentile of “medv” in Boston suburbs. Call this quantity μ^0.1.

mu01.hat <- quantile(Boston$medv, c(0.1))
mu01.hat

##   10% 
## 12.75

(h) Use the bootstrap to estimate the standard error of μ^0.1. Comment on your findings.

boot.fn <- function(data, index) {
    return(quantile(data[index], c(0.1)))
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 -0.0169   0.5202692

Looking at the 10th percentile of houses, we can see that the value from (g) and (h) are the same (12.75) but the standard error for the bootstrap was .4879856. This is a small value compared to the original percentile.