Applied Exercises

Exercise 5 in section 5.4

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. Fit a logistic regression model that uses income and balance to predict default.

# Remove the # below
#install.packages("ISLR")
library(ISLR)
attach(Default)
set.seed(1)

(a). Fit a logistic regression model that uses income and balance to predict default.

# logistic regression
glm.fit = glm(default~income+balance,data = Default, family="binomial")

(b).Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: (i). Split the sample set into a training set and a validation set.

# 50% as train
set.seed(1)
train = sample(nrow(Default), nrow(Default)*0.5, replace = FALSE)
test_data = Default[-train,]

(ii). Fit a multiple logistic regression model using only the training observations.

# logistic regression with train dataset
glm.fit_train = glm(default~income+balance,family="binomial",data = Default,subset = train)

(iii). Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

glm.probs=predict(glm.fit_train,test_data,type="response")
glm.pred=rep(0,nrow(test_data)) 
glm.pred[glm.probs >.5]=1

(iv). Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified

# change the Yes/No into 1/0
Default$num_default = as.numeric(Default$default) # No=1, Yes= 2
z = Default$num_default ==2 # FALSE, FALSE, FALSE, ....
Default[,'num_default'] = as.numeric(z)

mean(glm.pred != test_data$num_default) # test error = 2.54%
## [1] NaN

(c). Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

# 1st
set.seed(123)
train1 = sample(nrow(Default), nrow(Default)*0.4, replace = FALSE)
test_data1 = Default[-train1,]
glm.fit_train1 = glm(default~income+balance,family="binomial",data = Default,subset = train1)
glm.probs1=predict(glm.fit_train1,test_data1,type="response")
glm.pred1=rep(0,nrow(test_data1)) 
glm.pred1[glm.probs1 >.5]=1
mean(glm.pred1 != test_data1$num_default) # test error = 2.57%
## [1] 0.02566667
# 2nd
set.seed(66)
train2 = sample(nrow(Default), nrow(Default)*0.6, replace = FALSE)
test_data2 = Default[-train2,]
glm.fit_train2 = glm(default~income+balance,family="binomial",data = Default,subset = train2)
glm.probs2=predict(glm.fit_train2,test_data2,type="response")
glm.pred2=rep(0,nrow(test_data2)) 
glm.pred2[glm.probs2 >.5]=1
mean(glm.pred2 != test_data2$num_default) # test error = 2.98%
## [1] 0.02975
# 3rd
set.seed(888)
train3 = sample(nrow(Default), nrow(Default)*0.7, replace = FALSE)
test_data3 = Default[-train3,]
glm.fit_train3 = glm(default~income+balance,family="binomial",data = Default,subset = train3)
glm.probs3=predict(glm.fit_train3,test_data3,type="response")
glm.pred3=rep(0,nrow(test_data3)) 
glm.pred3[glm.probs3 >.5]=1
mean(glm.pred3 != test_data3$num_default) # test error = 2.57%
## [1] 0.02566667
# the test error we got
# the proportion of training data: 40%,50%,60%,70%
test_err = c(2.57,2.54,2.98,2.57)
summary(test_err)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.540   2.562   2.570   2.665   2.672   2.980
var(test_err)
## [1] 0.0443

Comments:
The mean test error of validation dataset is 2.67%, and the variation of the test error is equal to 0.04. Because we use different seeds and proportions, it will determine which observations can enter training data, then these observations are used to fit the logistic model. From the textbook we know that statistical methods tend to perform worse when trained on few observations,this phenomenon could also be seen in our case, when we choose 40% of observations to build model, the test error is worse than we choose 50%. However, in this case, when we add more observations in training data, the test error could be larger, so we can also find there is a overfitting problem on training data set with Validation Set Approach. This is really not an ideal and rigorous verification method.

(d). Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

# change the student variables into numeric
attach(Default)
Default$num_stu = as.numeric(student)
s = Default$num_stu ==2
Default[,'num_stu'] = as.numeric(s)

# reset training data and validation data set
set.seed(1)
train = sample(nrow(Default), nrow(Default)*0.5, replace = FALSE)
test_data = Default[-train,]

# build a model and predict
glm.fit2 = glm(default~income+balance+num_stu, data = Default, family = "binomial", subset = train)
glm.probs=predict(glm.fit2,test_data,type="response")
glm.pred=rep(0,nrow(test_data)) 
glm.pred[glm.probs >.5]=1
mean(glm.pred != test_data$num_stu) # test error rate: 29.68% 
## [1] 0.2968

Comments:
When I use the same seed number(123), it means the traning data sets are same between the models whether I add dummy variable for student. But the test error rate are totally different and adding a dummy varible will not reduce the test error rate. When the model without student variable, the test error rate is 2.54%, after adding student, the test error rate soared to 29.68%, which rate is also much larger than all the test error rates we mentioned before.

Exercise 6 in section 5.4

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now computes estimates for the standard errors of the income and balance logistic regression coefficients in two different ways : (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

# Remove the # below
attach(Default)
set.seed(1)

(a).Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

#Insert your code here
glm.fit = glm(default~income+balance,data=Default,family = "binomial")
summary(glm.fit)$coefficients[,2]
##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

(b). Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

#Insert your code here
boot.fn = function(data,index)
  return(coef(glm(default~income+balance, data =data,subset=index,family = "binomial")))
boot.fn(Default,1:10000)
##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

(c).Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)
boot(Default, boot.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

(d).Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

Comments:
The estimated standard errors from glm function and bootstrap function are pretty close. We know the estimate sigma square rely on the linear model being correct, and the bootstrap approach does not rely on this assumption. It means bootstrap method gives a more accurate estimate of the standard error of beta1 and beta2. Now the results from these 2 methods are similar, this means the logistic regression model provides a good fit to the Default data.



Exercise 7 in section 5.4

In sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).

# Remove the # below
set.seed(1)
attach(Weekly)
Weekly = Weekly

(a).Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

glm.fit = glm(Direction~Lag1+Lag2,data = Weekly,family = "binomial")

(b).Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

glm.fit1 = glm(Direction~Lag1+Lag2,data = Weekly[-1,],family = "binomial")

(c).Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(direction = Up|Lag1,Lag2)>0.5. Was this observation correctly classified ?

predict(glm.fit1,Weekly[1,],type = "response") > 0.5
##    1 
## TRUE
Weekly[1,'Direction']
## [1] Down
## Levels: Down Up

Answer:
The prediction direction is UP, but the true direction is DOWN.

(d).Write a loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps : (i). Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2. (ii). Compute the posterior probability of the market moving up for the ith observation. (iii). Use the posterior probability for the ith observation in order to predict whether or not the market moves up. (iv). Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

count = rep(0,nrow(Weekly))
for(i in 1:nrow(Weekly)){
  glm.fit = glm(Direction~Lag1+Lag2,data = Weekly[-i,], family= 'binomial')
  pred = predict(glm.fit,Weekly[i,],type = "response") > 0.5
  true_dir = Weekly[i,]$Direction == "Up"
  if (pred != true_dir)
    count[i] = 1
}
sum(count)
## [1] 490

(e).Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

mean(count)
## [1] 0.4499541

The test error of LOOCV estimate method in this case is 45%, it means the logistic regression model we built is not so good.



Exercise 8 in section 5.4

We will now perform cross-validation on a simulated data set. (a). Generate a simulated data set as follows:

# Remove the # below
set.seed(1)
y <- rnorm(100) # Generate 100 random numbers that obey the normal distribution
x <- rnorm(100)
y <- x - 2 * x^2 + rnorm(100)

In this data set, what is n and what is p ? Write out the model used to generate the data in equation form. Answer:
n = 100, p = 2.
Equation: Y = X - 2X^2 + \(\epsilon\)
\(\epsilon\) ~ N(0,1)

(b).Create a scatterplot of X against Y. Comment on what you find.

plot(x,y)

Comments:
The figure is a parabola, X ranges from -2 to 2 and Y ranges from -8 to 2.

(c).Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares :

# Remove the # below
library(boot)
set.seed(1)
Data = data.frame(x,y)

(i). \(Y= \beta_{0}+\beta_{1}X+\epsilon\)

glm.fit = glm(y~x, data = Data)
cv.err = cv.glm(Data, glm.fit)
cv.err$delta
## [1] 5.890979 5.888812

(ii). \(Y= \beta_{0}+\beta_{1}X+\beta_{2}X^2+\epsilon\)

glm.fit = glm(y~poly(x,2),data = Data)
cv.err = cv.glm(Data, glm.fit)
cv.err$delta
## [1] 1.086596 1.086326

(iii). \(Y= \beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}X^3+\epsilon\)

glm.fit = glm(y~poly(x,3),data = Data)
cv.err = cv.glm(Data, glm.fit)
cv.err$delta
## [1] 1.102585 1.102227

(iv). \(Y= \beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}X^3+\beta_{4}X^4+\epsilon\)

glm.fit = glm(y~poly(x,4),data = Data)
cv.err = cv.glm(Data, glm.fit)
cv.err$delta
## [1] 1.114772 1.114334

Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y .

(d). Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?

# Remove the # below
set.seed(10)
cv.error = rep(0,4)
for (i in 1:4){
  glm.fit = glm(y~poly(x,i),data = Data)
  cv.error[i] = cv.glm(Data, glm.fit)$delta[1]
}
cv.error
## [1] 5.890979 1.086596 1.102585 1.114772

Answer:
The results is same as what we got in (c). Because there is no randomness in LOOCV method, we don’t need to use seed here. The steps of LOOCV are:
(1) Using 1 observation as test set, and all the other n-1 observations as traning data set.
(2) Build a model on traning data set, then evaluate it on 1 observation.
(3) Repeat (1) and (2) until each observation has been used as test data only once.
(4) Average these n errors to get the final error estimate.
So in this method, we can only use one way to calculate the LOOCV error, and there is no randomness in splitting data set.

(e). Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
Answer:
The quadratic polynomial had the smallest LOOCV error, this is what we espected because it matches the true distribution of Y in (a).

(f). Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

summary(glm(y ~ x, data = Data))$coefficients
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -1.8185184  0.2364064 -7.6923393 1.143164e-11
## x            0.2430443  0.2478503  0.9806091 3.292002e-01
summary(glm(y ~ poly(x,2), data = Data))$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  -1.827707  0.1032351 -17.70431 3.804657e-32
## poly(x, 2)1   2.316401  1.0323515   2.24381 2.711854e-02
## poly(x, 2)2 -21.058587  1.0323515 -20.39866 7.333860e-37
summary(glm(y ~ poly(x,3), data = Data))$coefficients
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  -1.8277074  0.1037248 -17.6207390 7.610579e-32
## poly(x, 3)1   2.3164010  1.0372479   2.2332183 2.785714e-02
## poly(x, 3)2 -21.0585869  1.0372479 -20.3023667 1.636959e-36
## poly(x, 3)3  -0.3048398  1.0372479  -0.2938929 7.694742e-01
summary(glm(y ~ poly(x,4), data = Data))$coefficients
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  -1.8277074  0.1041467 -17.5493533 1.444977e-31
## poly(x, 4)1   2.3164010  1.0414671   2.2241711 2.850549e-02
## poly(x, 4)2 -21.0585869  1.0414671 -20.2201171 3.457023e-36
## poly(x, 4)3  -0.3048398  1.0414671  -0.2927023 7.703881e-01
## poly(x, 4)4  -0.4926249  1.0414671  -0.4730105 6.372907e-01

Answer:
From the p-value, we can not reject the null hypotheses for x^3 or x^4 at 5% level(H0: the coefficient is 0), which agrees with the conclusions drawn based on the cross-validation results.



Exercise 9 in section 5.4

We will now consider the Boston housing data set, from the MASS library.

# Remove the # below
library(MASS)
Boston = Boston
attach(Boston)

(a).Based on this data set, provide an estimate for the population mean of medv. Call this estimate \(\hat{μ}\)

medv_mean = mean(medv)
medv_mean
## [1] 22.53281

(b).Provide an estimate of the standard error of \(\hat{μ}\). Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

se = sd(medv)/sqrt(length(medv))
se
## [1] 0.4088611

(c).Now estimate the standard error of \(\hat{μ}\) using the bootstrap. How does this compare to your answer from (b) ?

set.seed(1)
boot.fn = function(data,index)
  mean(data[index])
boot = boot(medv,boot.fn,1000)
boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

Answer:
The results are slightly different, which are 0.4107 and 0.4089 respectively.

(d). Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).
Hint: You can approximate a 95 % confidence interval using the formula [ \(\hat{μ}-2SE(\hat{μ}),\hat{μ}+2SE(\hat{μ})\) ].

# lower bound 
lower = boot$t0 - 2*0.4106622  # boot$t0 = medv_mean
# upper bound 
upper = medv_mean + 2*0.4106622
c(lower,upper)
## [1] 21.71148 23.35413
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Answer:
The 95%CI for mean of our calculation is [21.71,23.35], the mean of t.test is [21.73,23.34]. The difference between them is only a few percentiles.

(e). Based on this data set, provide an estimate, \(\hat{μ}_{med}\), for the median value of medv in the population.

median(medv)
## [1] 21.2

(f). We now would like to estimate the standard error of \(\hat{μ}_{med}\). Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

set.seed(1)
boot.fn = function(data,index)
  median(data[index])
boot1 = boot(medv,boot.fn,1000)
boot1
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

Answer: Median of 21.2 with standard error of 0.378, this value is slightly smaller than the SE for mean calculated by boostrap.

(g). Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity \(\hat{μ}_{0.1}\). (You can use the quantile() function.)

quantile(medv,0.1)
##   10% 
## 12.75

(h). Use the bootstrap to estimate the standard error of \(\hat{μ}_{0.1}\). Comment on your findings.

set.seed(1)
boot.fn = function(data,index)
  quantile(data[index],0.1)
boot2 = boot(medv,boot.fn,1000)
boot2
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

Answer:
Tenth percentile of 12.75 with standard error of 0.477. This value is a little bit larger than the se for mean or median, but it still small.