R Notebook

Question 3:

We now review k-fold cross-validation.

Part A Explain how k-fold cross-validation is implemented. K-fold is an alternative to LOOCV. K-fold involves randomly dividing the set of observations into k groups or fold, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k-1 folds. The mean squared error, MSE, is then computed on observations in the held-out fold. This procedures is repeated times; each time, a different group of observations is treated as a validation set. This process results in the k estimates of the test error, MSE1, MSE2,… MSEk.

Part B What are the advantages and disadvantages of k-fold cross validation relative to:

The validation set approach? The validation set appraoch is conceptually simple and is easy to implement. However, the validation set approach has two key drawbacks when compared to k-fold cross- validation. First, validation test error estimate can be highly variable in comparision to CV estimates from the k-fold approach. Second, only a subset of observations–those that are included in the training set rather than in the validation set–are used to fit the model. Since statistical methods tend to perform worse when trained on a fewer observations, this suggests that the validation set error rate may overestimate the test error rate for the model fit on the entire data set.
LOOCV? In comparison to LOOCv, k-fold is advantageous in computation. LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensie. Additionally, LOOCV can be very time consuming if n is large, and if each individual model is slow to fit.

Question 5:

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

Part A Fit a logistic regression model that uses income and balance to predict default.

library(ISLR)
set.seed(1)
attach(Default)
defaultfit.glm1<-glm(default~income + balance, data = Default, family = "binomial")
summary(defaultfit.glm1)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

defaultprobs1<-predict(defaultfit.glm1, Default$default, type="response")
defaultpred.glm1<-ifelse(defaultprobs1>0.5, "Yes", "No")
mean(Default$default==defaultpred.glm1)

## [1] 0.9737

mean(Default$default!=defaultpred.glm1)

## [1] 0.0263

When predicting on the entire dataset, the misclassification rate is calculated as 0.0263. This means that 2.63% of the time defaults are misclassified. Predicting defaults on the entire dataset will not provide good predictions for outside datasets.

Part B Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

Split the sample set into a training set and a validation set.
Fit a multiple logistic regression model using only the training observations.

defaulttrain=sample(1000,500)
defaultglm.fit2=glm(default~income + balance, data=Default, family="binomial", subset=defaulttrain)
summary(defaultglm.fit2)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = defaulttrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.61918  -0.17565  -0.07825  -0.03426   3.11898  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.162e+01  1.741e+00  -6.672 2.53e-11 ***
## income       5.420e-05  2.136e-05   2.537   0.0112 *  
## balance      5.079e-03  8.906e-04   5.703 1.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 155.02  on 499  degrees of freedom
## Residual deviance:  91.69  on 497  degrees of freedom
## AIC: 97.69
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

defaultprobs2<-predict(defaultglm.fit2, newdata=Default[-defaulttrain,], type="response")
defaultpred.glm2<-ifelse(defaultprobs2>0.5, "Yes", "No")

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

mean(Default[-defaulttrain,]$default==defaultpred.glm2)

## [1] 0.9729474

mean(Default[-defaulttrain,]$default!=defaultpred.glm2)

## [1] 0.02705263

With the validation set approach, we get a test error rate of 0.0271. This means that defaults are misclassified 2.71% of the time. In comparison to the model that was fitted on the entire data, this model’s test error increased only slightly by 0.01.

Part C Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(1)
defaulttrain2=sample(1000,750)
defaultglm.fit3=glm(default~income + balance, data=Default, family="binomial", subset=defaulttrain2)
summary(defaultglm.fit3)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = defaulttrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5884  -0.1764  -0.0793  -0.0343   3.2053  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.089e+01  1.384e+00  -7.865 3.68e-15 ***
## income       3.587e-05  1.673e-05   2.143   0.0321 *  
## balance      5.001e-03  7.227e-04   6.920 4.50e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 219.22  on 749  degrees of freedom
## Residual deviance: 134.00  on 747  degrees of freedom
## AIC: 140
## 
## Number of Fisher Scoring iterations: 8

defaultprobs3<-predict(defaultglm.fit3, newdata=Default[-defaulttrain2,], type="response")
defaultpred.glm3<-ifelse(defaultprobs3>0.5, "Yes", "No")
mean(Default[-defaulttrain2,]$default==defaultpred.glm3)

## [1] 0.9730811

mean(Default[-defaulttrain2,]$default!=defaultpred.glm3)

## [1] 0.02691892

With a 3:4 split, we achieved a slightly lower test error rate of 0.0269. This means that defaults were misclassified 2.69% of the time. In comparison to the 1:2 split, the misclassification rate for this model is only slightly lower by 0.00003.

set.seed(1)
defaulttrain3=sample(1000,250)
defaultglm.fit4=glm(default~income + balance, data=Default, family="binomial", subset=defaulttrain3)
summary(defaultglm.fit4)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = defaulttrain3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.55680  -0.07651  -0.02198  -0.00727   3.15773  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.707e+01  4.428e+00  -3.855 0.000116 ***
## income       1.147e-04  5.299e-05   2.165 0.030372 *  
## balance      6.597e-03  1.914e-03   3.446 0.000568 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 56.611  on 249  degrees of freedom
## Residual deviance: 22.072  on 247  degrees of freedom
## AIC: 28.072
## 
## Number of Fisher Scoring iterations: 9

defaultprobs4<-predict(defaultglm.fit4, newdata=Default[-defaulttrain3,], type="response")
defaultpred.glm4<-ifelse(defaultprobs4>0.5, "Yes", "No")
mean(Default[-defaulttrain3,]$default==defaultpred.glm4)

## [1] 0.9700513

mean(Default[-defaulttrain3,]$default!=defaultpred.glm4)

## [1] 0.02994872

When training the model on a much smaller set, we get a test error rate of 0.0299. This means that defaults are misclassified 2.99% of the time. In comparison to the previous prediction models, this split has the highest test rate. This may be due to the fact that there is a much smaller sample to predict on, so when predicting a larger sample, there is a lot of unforeseen error and variability in data that the training data set may not have accounted for.

defaulttrain4=sample(1000,100)
defaultglm.fit5=glm(default~income + balance, data=Default, family="binomial", subset=defaulttrain4)
summary(defaultglm.fit5)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = defaulttrain4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.96206  -0.03682  -0.01111  -0.00204   1.46223  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.765e+01  7.895e+00  -2.235   0.0254 *
## income      -7.394e-06  8.399e-05  -0.088   0.9298  
## balance      9.666e-03  4.300e-03   2.248   0.0246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.9484  on 99  degrees of freedom
## Residual deviance:  8.6831  on 97  degrees of freedom
## AIC: 14.683
## 
## Number of Fisher Scoring iterations: 10

defaultprobs5<-predict(defaultglm.fit5, newdata=Default[-defaulttrain4,], type="response")
defaultpred.glm5<-ifelse(defaultprobs5>0.5, "Yes", "No")
mean(Default[-defaulttrain4,]$default==defaultpred.glm5)

## [1] 0.9714141

mean(Default[-defaulttrain4,]$default!=defaultpred.glm5)

## [1] 0.02858586

When fitting even smaller training dataset, we get a much higher test error of 0.0286. This means that 2.86% of defaults are misclassified with this model. When training the data on a small set, we will not be able to get accurate predictions on outside or untouched data. We need more data to be able to predict or tell more about the status of defaulting considering variables income and balance.

Part D Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

set.seed(1)
strain<-sample(1000, 500)
sglm.fit=glm(default~income + balance + student, data=Default, family="binomial", subset=strain)
summary(sglm.fit)

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = Default, subset = strain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.23260  -0.15085  -0.06748  -0.02663   3.06443  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.486e+01  2.686e+00  -5.534 3.13e-08 ***
## income       1.196e-04  4.110e-05   2.909  0.00362 ** 
## balance      5.129e-03  9.106e-04   5.633 1.77e-08 ***
## studentYes   2.355e+00  1.240e+00   1.899  0.05761 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 155.017  on 499  degrees of freedom
## Residual deviance:  87.706  on 496  degrees of freedom
## AIC: 95.706
## 
## Number of Fisher Scoring iterations: 8

sprobs<-predict(sglm.fit, newdata=Default[-strain,], type="response")
spred.glm<-ifelse(sprobs>0.5, "Yes", "No")
mean(Default[-strain,]$default==spred.glm)

## [1] 0.9703158

mean(Default[-strain,]$default!=spred.glm)

## [1] 0.02968421

With the student dummy variable, we got a test error rate of 0.0297. This means that 2.97% of the time defaults were misclassified when considering income, balance and whether the person was a student or not. Using the validation set approach and adding in a student dummy variable does not seem to lead to a reduction in test error rate. The test error rate of the model fitting default against just income and balance (with same split) had a test error of 0.02694737. The difference between the two models is 0.0027 in test error. This difference is not significant, especially when taking into consideration an additional variable.

Question 6:

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

Part A Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

library(ISLR)
set.seed(1)
defaultglm.fit<-glm(default~income+balance, data=Default, family="binomial")
summary(defaultglm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

The logistic model estimates of the standard errors for the coefficients for intercept ($\beta_0$), income ($\beta_1$), and balance ($\beta_2$) are respectively 0.4348, 0.000004985, and 0.0002274.

Part B Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

defaultboot.fn=function(Default, index) { 
  fit<-glm(default~income+balance, data=Default, family="binomial", subset = index)
  return(coef(fit))
}

Part C Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)
boot(Default, defaultboot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = defaultboot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -8.008379e-03 4.239273e-01
## t2*  2.080898e-05  5.870933e-08 4.582525e-06
## t3*  5.647103e-03  2.299970e-06 2.267955e-04

The bootstrap model estimates of the standard errors for the coefficients for the intercept (t1), income (t2), and balance (t3) are respectively 0.4242, 0.000004585, and 0.0002269.

Part D Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard errors obtained by the glm() function and bootstrap function were extremely similar. The difference in standard error between the glm and bootstrap for the coefficient is 0.0106. The difference in standard error between the glm and bootstrap for income is 0.0000004. Lastly, the difference in standard error between the glm and bootstrap for balance is 0.0000005.

Question 9:

We will now consider the Boston housing data set, from the MASS library.

library(MASS)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

set.seed(1)

Part A Based on this data set, provide an estimate for the population mean of medv. Call this estimate $\hat{\mu}$.

medv.mean=mean(Boston$medv)
medv.mean

## [1] 22.53281

Part B Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result.

medv.sterr=sd(Boston$medv)/sqrt(length(Boston$medv))
medv.sterr

## [1] 0.4088611

The standard error for mean of the medv is 0.4089. The standard error indicates how precise (or not) an estimate of the population parameter the sample statistic is. The standard error for mean given tells us that the estimate sample mean ($\hat{\mu}$) differs 0.4089 from the actual value of the population mean ($\mu$).

Part C Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from (b)?

medvboot.fn=function(Boston, index) { 
  return(mean(Boston[index]))
}
bootstrap.mean=boot(Boston$medv, medvboot.fn, 1000)
bootstrap.mean

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = medvboot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.008517589   0.4119374

The standard error obtain by bootstrapping is similar to the one that I manually calculated earlier.

Part D Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

# T- Test 
t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

#Bootstrap 
c(bootstrap.mean$t0-2*0.4136, bootstrap.mean$t0+2*0.4136)

## [1] 21.70561 23.36001

The bootstrap is only 0.024 higher than the t.test estimate on the lower and upper bound.

Part E Based on this data set, provide an estimate, $\hat{\mu}_{med}$, for the median value of medv in the population.

medv.median=median(Boston$medv)
medv.median

## [1] 21.2

Part F We now would like to estimate the standard error of $\hat{\mu}_{med}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

medvboot.fn2=function(Boston, index) { 
  return(median(Boston[index]))
}
bootstrap.median=boot(Boston$medv, medvboot.fn2, 1000)
bootstrap.median

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = medvboot.fn2, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0098   0.3874004

The standard error for median of the medv is 0.3805. The standard error indicates how precise (or not) an estimate of the population parameter the sample statistic is. The standard error for median given tells us that the estimate sample median ($\hat{\mu}_{med}$) differs 0.3805 from the actual value of the population median ($\mu_{med}$).

Part G Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity $\hat{\mu}_{0.1}$. (You can use the quantile() function.)

medv.tenth=quantile(Boston$medv, c(0.1))
medv.tenth

##   10% 
## 12.75

The estimate for the tenth percentile of medv ($\hat{\mu}_{0.1}$) in Boston suburbs is 12.75.

Part H Use the bootstrap to estimate the standard error of $\hat{\mu}_{0.1}$. Comment on your findings.

medvboot.fn3=function(Boston, index) { 
  return(quantile(Boston[index], c(0.1)))
}
bootstrap.tenth=boot(Boston$medv, medvboot.fn3, 1000)
bootstrap.tenth

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = medvboot.fn3, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.00515   0.5113487

When using the quantile and bootstrapping methods, I got the same estimate of 12.75 for the tenth percentile of medv in the Boston subrubs dataset.