Question 3

Question 3a

Explain how K-fold Cross validation is implemented

Steps For K-fold Crossvalidation

  • Dataset is divided in to k number of groups(folds)
  • The first fold is held out as validation fold.
  • Model is trained with Remaining K- number of folds which in turn used to predict the outcome for the observations from validation fold.
  • The above process repeated for K number of times until all the folds are used as validation fold
  • K fold cross validation calculates the Mean squared error rate from the validation fold. It results in k number of test error rate . The final output from k fold cross validation is the average estimate of k number of Mean squared error rate as per the below formula.

\[CV_{(k)} = \frac{1}{k} \sum_{i=1}^k MSE_i\]

Question 3b(i)

what are the advantages and disadvantages of K-fold cross validation relative to “The validation set approach”

Advantages

  • In K fold cross validation,Model is validated K number of terms instead of one time as in Validation set approach using training and validation set.
  • The Error Estimate of K fold cross validation is highly reliable as all the k folds are used as a validation fold instead of model is trained on fewer observation and predict the output on fewer observation as in Validation set approach.
  • K fold cross validation will not over estimate the error rate as in validation set approach, because of the number of observation involved in training the model.

Disadvantages

  • K fold cross validation is a complex method and needs little bit computation compared to validation set approach.

Question 3b(ii)

what are the advantages and disadvantages of K-fold cross validation relative to “LOOCV”

Advantages

  • Data set is divided into K number of folds and validated against the model for K number of times, where k is less than the Number of observation. whereas in LOOCV all the observation is validated using the model which result in n number of validation of the model.
  • The output of k fold estimates shows high variance as the model are validated k<n times, the estimates are less correlated with each other whereas in LOOCV, the model is validated n number of terms, the estimates are highly correlated with each other.

Disadvantages

  • K fold cross validation provides biased estimates compared to LOOCV. Because in K fold validation the model is trained with fewer observations only whereas in LOOCV the model is trained with n-1 observation. So bias reduction is more in LOOCV than k-fold cross validation.

Question 5

Question 5a

Fit a logistic regression model that uses income and balance to predict default

glm.default.fit = glm(default~income+balance,family = binomial,data = Default)
summary(glm.default.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

As per the logistic regression model, the p- value for the income and balance predictors are very small which implies that the income and balance predictors are significant predictors in this model. Also it has the positive coefficient which implies that if there is a increase in the income and balance, then there will be high chance for a person to default their debt

glm.default.probs= predict(glm.default.fit,type = "response")
glm.default.pred = rep("No", 10000)
glm.default.pred[glm.default.probs>0.5]="Yes"
table(glm.default.pred,Default$default)
##                 
## glm.default.pred   No  Yes
##              No  9629  225
##              Yes   38  108
mean(glm.default.pred==Default$default)
## [1] 0.9737
mean(glm.default.pred!=Default$default)
## [1] 0.0263

Question 5b

Using the validation set approach, estimate the test error of this model. In order to do this you must perform the following steps:

Split the sample set into a training set and a validation set.

set.seed(1)
train=sample(10000,5000)
train.default= Default[train,]
test.default = Default[-train,]
default.test = test.default$default
head(train.default)
##      default student   balance   income
## 1017      No      No  939.0985 45519.02
## 8004      No     Yes  397.5425 22710.87
## 4775     Yes      No 1511.6110 53506.94
## 9725      No      No  301.3194 51539.95
## 8462      No      No  878.4461 29561.78
## 4050     Yes      No 1673.4863 49310.33
head(test.default)
##   default student   balance   income
## 1      No      No  729.5265 44361.63
## 2      No     Yes  817.1804 12106.13
## 3      No      No 1073.5492 31767.14
## 5      No      No  785.6559 38463.50
## 8      No     Yes  808.6675 17600.45
## 9      No      No 1161.0579 37468.53

fit a multiple logistic regression using only the training observations

glm.default.fit1 = glm(default~income+balance,family = binomial,data = train.default)
summary(glm.default.fit1)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train.default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5830  -0.1428  -0.0573  -0.0213   3.3395  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual and classifying the individual to the default category if the posterior probability is greater than 0.5

glm.default.probs1= predict(glm.default.fit1,data=test.default,type = "response")
glm.default.pred1 = rep("No", 5000)
glm.default.pred1[glm.default.probs1>0.5]="Yes"

Compute the Validation set error which is the fraction of the observations in the Validation set that are mis-classified

table(glm.default.pred1,default.test)
##                  default.test
## glm.default.pred1   No  Yes
##               No  4762  157
##               Yes   81    0
mean(glm.default.pred1!=test.default$default)
## [1] 0.0476

The test error rate for this model is 4.6%

Question 5c

Repeat the process in (b) three times using three different splits of the observation into training set and Validation set. Comment on the results obtained

Split 1

set.seed(12)
train1=sample(10000,5000)
train.default1= Default[train1,]
test.default1 = Default[-train1,]
default.test1 = test.default1$default
head(train.default1)
##      default student   balance   income
## 4546      No     Yes  250.1730 17439.63
## 4442      No      No    0.0000 46542.99
## 1271      No     Yes    0.0000 19622.58
## 2222      No     Yes 1181.5132 11784.80
## 9285      No      No  252.1070 38494.65
## 1500      No      No  449.4127 52892.25
head(test.default1)
##    default student  balance    income
## 2       No     Yes 817.1804 12106.135
## 5       No      No 785.6559 38463.496
## 6       No     Yes 919.5885  7491.559
## 8       No     Yes 808.6675 17600.451
## 11      No     Yes   0.0000 21871.073
## 13      No      No 237.0451 28251.695
glm.default.fit2 = glm(default~income+balance,family = binomial,data = train.default1)
summary(glm.default.fit2)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train.default1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6052  -0.1249  -0.0444  -0.0148   3.8710  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.229e+01  6.715e-01 -18.303   <2e-16 ***
## income       1.627e-05  7.204e-06   2.258   0.0239 *  
## balance      6.203e-03  3.587e-04  17.294   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1429.88  on 4999  degrees of freedom
## Residual deviance:  718.38  on 4997  degrees of freedom
## AIC: 724.38
## 
## Number of Fisher Scoring iterations: 8
glm.default.probs2= predict(glm.default.fit2,data=test.default1,type = "response")
glm.default.pred2 = rep("No", 5000)
glm.default.pred2[glm.default.probs2>0.5]="Yes"
table(glm.default.pred2,default.test1)
##                  default.test1
## glm.default.pred2   No  Yes
##               No  4754  167
##               Yes   75    4
mean(glm.default.pred2!=default.test1)
## [1] 0.0484

Validation set Test Error Rate is 4.8% for the split 1

Split 2

set.seed(123)
train2=sample(10000,5000)
train.default2= Default[train2,]
test.default2 = Default[-train2,]
default.test2 = test.default2$default
head(train.default2)
##      default student   balance   income
## 2463      No      No  559.4686 61187.81
## 2511      No      No  905.6415 47271.35
## 8718      No     Yes 1806.5517 17648.20
## 2986      No      No 1615.2250 55219.52
## 1842      No     Yes 1229.4415 12158.04
## 9334      No     Yes 1723.2161 23279.00
head(test.default2)
##   default student   balance    income
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 6      No     Yes  919.5885  7491.559
## 7      No      No  825.5133 24905.227
## 8      No     Yes  808.6675 17600.451
glm.default.fit3 = glm(default~income+balance,family = binomial,data = train.default2)
summary(glm.default.fit3)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train.default2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2810  -0.1348  -0.0529  -0.0185   3.7767  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.451e-01 -18.504  < 2e-16 ***
## income       2.210e-05  7.381e-06   2.995  0.00275 ** 
## balance      5.874e-03  3.362e-04  17.474  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1429.88  on 4999  degrees of freedom
## Residual deviance:  752.69  on 4997  degrees of freedom
## AIC: 758.69
## 
## Number of Fisher Scoring iterations: 8
glm.default.probs3= predict(glm.default.fit3,data=test.default2,type = "response")
glm.default.pred3 = rep("No", 5000)
glm.default.pred3[glm.default.probs3>0.5]="Yes"
table(glm.default.pred3,default.test2)
##                  default.test2
## glm.default.pred3   No  Yes
##               No  4754  169
##               Yes   75    2
mean(glm.default.pred3!=default.test2)
## [1] 0.0488

Validation set Test Error Rate is 4.9% for the split 2

Split 3

set.seed(1234)
train3=sample(10000,5000)
train.default3= Default[train3,]
test.default3 = Default[-train3,]
default.test3 = test.default3$default
head(train.default3)
##      default student    balance   income
## 7452      No     Yes  311.32186 22648.76
## 8016      No     Yes  697.13558 18377.15
## 7162      No     Yes  470.10718 16014.11
## 8086      No      No 1200.04162 56081.08
## 7269      No      No  553.64902 47021.49
## 9196      No      No   10.23149 27237.38
head(test.default3)
##   default student  balance    income
## 1      No      No 729.5265 44361.625
## 2      No     Yes 817.1804 12106.135
## 4      No      No 529.2506 35704.494
## 6      No     Yes 919.5885  7491.559
## 7      No      No 825.5133 24905.227
## 8      No     Yes 808.6675 17600.451
glm.default.fit4 = glm(default~income+balance,family = binomial,data = train.default3)
summary(glm.default.fit4)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train.default3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0011  -0.1550  -0.0648  -0.0242   3.6746  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.121e+01  6.002e-01 -18.682  < 2e-16 ***
## income       2.171e-05  7.102e-06   3.056  0.00224 ** 
## balance      5.401e-03  3.133e-04  17.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1423.08  on 4999  degrees of freedom
## Residual deviance:  822.85  on 4997  degrees of freedom
## AIC: 828.85
## 
## Number of Fisher Scoring iterations: 8
glm.default.probs4= predict(glm.default.fit4,data=test.default3,type = "response")
glm.default.pred4 = rep("No", 5000)
glm.default.pred4[glm.default.probs4>0.5]="Yes"
table(glm.default.pred4,default.test3)
##                  default.test3
## glm.default.pred4   No  Yes
##               No  4772  169
##               Yes   56    3
mean(glm.default.pred4!=default.test3)
## [1] 0.045

Validation set Test Error Rate is 4.5% for the split 3

By comparing the Confusion matrix for all the three above splits , the results are seems to be similar. Also there is very very minor variability in the test error rate

Question 5d

Now Consider a logistic regression model that predicts the probability of default using income,balance and a dummy variable using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate

set.seed(1)
train=sample(10000,5000)
train.default= Default[train,]
test.default = Default[-train,]
default.test = test.default$default
head(train.default)
##      default student   balance   income
## 1017      No      No  939.0985 45519.02
## 8004      No     Yes  397.5425 22710.87
## 4775     Yes      No 1511.6110 53506.94
## 9725      No      No  301.3194 51539.95
## 8462      No      No  878.4461 29561.78
## 4050     Yes      No 1673.4863 49310.33
head(test.default)
##   default student   balance   income
## 1      No      No  729.5265 44361.63
## 2      No     Yes  817.1804 12106.13
## 3      No      No 1073.5492 31767.14
## 5      No      No  785.6559 38463.50
## 8      No     Yes  808.6675 17600.45
## 9      No      No 1161.0579 37468.53

fit a multiple logistic regression using only the training observations

glm.default.fit5 = glm(default~income+balance+student,family = binomial,data = train.default)
summary(glm.default.fit5)
## 
## Call:
## glm(formula = default ~ income + balance + student, family = binomial, 
##     data = train.default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5823  -0.1419  -0.0554  -0.0210   3.3961  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.134e+01  6.937e-01 -16.346   <2e-16 ***
## income       1.686e-05  1.122e-05   1.502   0.1331    
## balance      5.767e-03  3.213e-04  17.947   <2e-16 ***
## studentYes  -5.992e-01  3.324e-01  -1.803   0.0715 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.77  on 4999  degrees of freedom
## Residual deviance:  800.07  on 4996  degrees of freedom
## AIC: 808.07
## 
## Number of Fisher Scoring iterations: 8
glm.default.probs5= predict(glm.default.fit5,data=test.default,type = "response")
glm.default.pred5 = rep("No", 5000)
glm.default.pred5[glm.default.probs5>0.5]="Yes"
table(glm.default.pred5,default.test)
##                  default.test
## glm.default.pred5   No  Yes
##               No  4758  157
##               Yes   85    0
mean(glm.default.pred5!=default.test)
## [1] 0.0484

Validation set test error rate is 4.8 which is unchanged for the inclusion of the predictor student. So it implies that the student predictor inclusion doesn’t leads to the reduction of the test error rate

Question 6

We continue to the use of a logistic regression model to predict the probability of Default using income and balance on the Default data set. In particular, we ill now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:1) Using the Bootstrap, and 2)Using the standard formula for computing the standard errors in the glm() function

Question 6a

Using the summary() and glm() functions determine the estimated standard errors fr the coefficients associated with the income and balance in a multiple linear regression that uses the predictors

set.seed(111)
glm.default.fit6 = glm(default~income+balance,family = binomial,data = Default)
summary(glm.default.fit6)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
(summary(glm.default.fit6))$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

Question 6b

Write a function boot.fn() that takes as input the default data set as well as an index of the observations and that outputs the coefficients for income and balance

boot.fn = function(data,index){
  default= data$default
  income = data$income
  balance = data$balance
  glm.default.fit7 = glm(default~income+balance,family=binomial,data = data)
  return(summary(glm.default.fit7)$coefficients)
}
boot.fn(Default,1:10000)
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

Question 6c

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance

boot(Default,boot.fn,R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##            original  bias    std. error
## t1*   -1.154047e+01       0           0
## t2*    2.080898e-05       0           0
## t3*    5.647103e-03       0           0
## t4*    4.347564e-01       0           0
## t5*    4.985167e-06       0           0
## t6*    2.273731e-04       0           0
## t7*   -2.654468e+01       0           0
## t8*    4.174178e+00       0           0
## t9*    2.483628e+01       0           0
## t10*  2.958355e-155       0           0
## t11*   2.990638e-05       0           0
## t12*  3.638120e-136       0           0

Question 6d

Comment on the estimated standard errors obtained using the glm() function an the bootstrap function

Inference:

Based on the glm() function we got the standard error for the income coefficient as 4.985167e-06 and for balance coefficient as 2.273731e-04. These standard errors are calculated based on the assumptions to calculate the logistic regression. whereas in case of bootstrap function, it doesn’t rely on any assumption to calculate the standard errors for the coefficients . So it implies that the use of bootstrap function drastically reduces the standard errors in order to calculate the default using balance and income.

Question 9

Question 9a

Based on this data set provide an estimate for the population mean of medv. call this estimates as \(\hat{\mu}\).

mu = mean(Boston$medv)

Question 9b

Provide an estimate of the standard error of \(\hat{\mu}\).

sd.medv = sd(Boston$medv)
se.medv = (sd.medv/sqrt(length(Boston$medv)))
se.medv
## [1] 0.4088611

Question 9c

Now estimate the standard error of \(\hat{\mu}\) using the bootstrap. how does this compare to your answer from (b)

mu.fn=function(data, index)
{
  medv= data$medv[index]
  
  return(mean(medv))
}
mu.fn(Boston,1:length(Boston$medv))
## [1] 22.53281
boot(Boston,mu.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = mu.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 22.53281 0.0100417   0.3914634

The standard error calculated by bootstrap function is 0.41 which is same as the standard error calculated nu (b)

Question 9d

Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. compare it to the results obtained using t.test(Boston$medv)

95% confidence interval

lower.bound=mu-2*se.medv
lower.bound
## [1] 21.71508
upper.bound= mu+2*se.medv
upper.bound
## [1] 23.35053
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The confidence interval created using the standard error for the mean of medv is almost same as created by the one sample t test. Both the values are differ by the decimal values only

Question 9e

Based on this data set provide an estimate \(\hat{\mu_{med}}\), for the median value of medv in the population

median.medv = median(Boston$medv)

Question 9f

med.fn=function(data, index)
{
  medv= data$medv[index]
  
  return(median(medv))
}
med.fn(Boston,1:length(Boston$medv))
## [1] 21.2
boot(Boston,med.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = med.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  -0.041    0.380577

The standard error for the median value is 0.37 which is quite less when compared to the median value 21.2. so it implies that estimate of the median value is quite accurate.

Question 9g

Based on this data set, provide an estimate for the tenth percentile of medv in boston suburbs. call this quantity as \(\hat{\mu_{0.1}}\)

tenth.percentile = quantile(Boston$medv, 0.1)

Question 9h

use the bootstap to estimate the standard error of \(\hat{\mu_{0.1}}\)

quantile.fn=function(data, index)
{
  medv= data$medv[index]
  
  return(quantile(medv, 0.1))
}
quantile.fn(Boston,1:length(Boston$medv))
##   10% 
## 12.75
boot(Boston,quantile.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = quantile.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75   0.011   0.4771023

The standard error for the 10th percentile is 0.51 which is bit higher, more than half percentage so the value by this bootstarp can be little trusted