Explain how K-fold Cross validation is implemented
\[CV_{(k)} = \frac{1}{k} \sum_{i=1}^k MSE_i\]
what are the advantages and disadvantages of K-fold cross validation relative to “The validation set approach”
what are the advantages and disadvantages of K-fold cross validation relative to “LOOCV”
Fit a logistic regression model that uses income and balance to predict default
glm.default.fit = glm(default~income+balance,family = binomial,data = Default)
summary(glm.default.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
As per the logistic regression model, the p- value for the income and balance predictors are very small which implies that the income and balance predictors are significant predictors in this model. Also it has the positive coefficient which implies that if there is a increase in the income and balance, then there will be high chance for a person to default their debt
glm.default.probs= predict(glm.default.fit,type = "response")
glm.default.pred = rep("No", 10000)
glm.default.pred[glm.default.probs>0.5]="Yes"
table(glm.default.pred,Default$default)
##
## glm.default.pred No Yes
## No 9629 225
## Yes 38 108
mean(glm.default.pred==Default$default)
## [1] 0.9737
mean(glm.default.pred!=Default$default)
## [1] 0.0263
Using the validation set approach, estimate the test error of this model. In order to do this you must perform the following steps:
Split the sample set into a training set and a validation set.
set.seed(1)
train=sample(10000,5000)
train.default= Default[train,]
test.default = Default[-train,]
default.test = test.default$default
head(train.default)
## default student balance income
## 1017 No No 939.0985 45519.02
## 8004 No Yes 397.5425 22710.87
## 4775 Yes No 1511.6110 53506.94
## 9725 No No 301.3194 51539.95
## 8462 No No 878.4461 29561.78
## 4050 Yes No 1673.4863 49310.33
head(test.default)
## default student balance income
## 1 No No 729.5265 44361.63
## 2 No Yes 817.1804 12106.13
## 3 No No 1073.5492 31767.14
## 5 No No 785.6559 38463.50
## 8 No Yes 808.6675 17600.45
## 9 No No 1161.0579 37468.53
fit a multiple logistic regression using only the training observations
glm.default.fit1 = glm(default~income+balance,family = binomial,data = train.default)
summary(glm.default.fit1)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train.default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual and classifying the individual to the default category if the posterior probability is greater than 0.5
glm.default.probs1= predict(glm.default.fit1,data=test.default,type = "response")
glm.default.pred1 = rep("No", 5000)
glm.default.pred1[glm.default.probs1>0.5]="Yes"
Compute the Validation set error which is the fraction of the observations in the Validation set that are mis-classified
table(glm.default.pred1,default.test)
## default.test
## glm.default.pred1 No Yes
## No 4762 157
## Yes 81 0
mean(glm.default.pred1!=test.default$default)
## [1] 0.0476
The test error rate for this model is 4.6%
Repeat the process in (b) three times using three different splits of the observation into training set and Validation set. Comment on the results obtained
set.seed(12)
train1=sample(10000,5000)
train.default1= Default[train1,]
test.default1 = Default[-train1,]
default.test1 = test.default1$default
head(train.default1)
## default student balance income
## 4546 No Yes 250.1730 17439.63
## 4442 No No 0.0000 46542.99
## 1271 No Yes 0.0000 19622.58
## 2222 No Yes 1181.5132 11784.80
## 9285 No No 252.1070 38494.65
## 1500 No No 449.4127 52892.25
head(test.default1)
## default student balance income
## 2 No Yes 817.1804 12106.135
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
## 8 No Yes 808.6675 17600.451
## 11 No Yes 0.0000 21871.073
## 13 No No 237.0451 28251.695
glm.default.fit2 = glm(default~income+balance,family = binomial,data = train.default1)
summary(glm.default.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train.default1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6052 -0.1249 -0.0444 -0.0148 3.8710
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.229e+01 6.715e-01 -18.303 <2e-16 ***
## income 1.627e-05 7.204e-06 2.258 0.0239 *
## balance 6.203e-03 3.587e-04 17.294 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1429.88 on 4999 degrees of freedom
## Residual deviance: 718.38 on 4997 degrees of freedom
## AIC: 724.38
##
## Number of Fisher Scoring iterations: 8
glm.default.probs2= predict(glm.default.fit2,data=test.default1,type = "response")
glm.default.pred2 = rep("No", 5000)
glm.default.pred2[glm.default.probs2>0.5]="Yes"
table(glm.default.pred2,default.test1)
## default.test1
## glm.default.pred2 No Yes
## No 4754 167
## Yes 75 4
mean(glm.default.pred2!=default.test1)
## [1] 0.0484
Validation set Test Error Rate is 4.8% for the split 1
set.seed(123)
train2=sample(10000,5000)
train.default2= Default[train2,]
test.default2 = Default[-train2,]
default.test2 = test.default2$default
head(train.default2)
## default student balance income
## 2463 No No 559.4686 61187.81
## 2511 No No 905.6415 47271.35
## 8718 No Yes 1806.5517 17648.20
## 2986 No No 1615.2250 55219.52
## 1842 No Yes 1229.4415 12158.04
## 9334 No Yes 1723.2161 23279.00
head(test.default2)
## default student balance income
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 6 No Yes 919.5885 7491.559
## 7 No No 825.5133 24905.227
## 8 No Yes 808.6675 17600.451
glm.default.fit3 = glm(default~income+balance,family = binomial,data = train.default2)
summary(glm.default.fit3)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train.default2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2810 -0.1348 -0.0529 -0.0185 3.7767
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.451e-01 -18.504 < 2e-16 ***
## income 2.210e-05 7.381e-06 2.995 0.00275 **
## balance 5.874e-03 3.362e-04 17.474 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1429.88 on 4999 degrees of freedom
## Residual deviance: 752.69 on 4997 degrees of freedom
## AIC: 758.69
##
## Number of Fisher Scoring iterations: 8
glm.default.probs3= predict(glm.default.fit3,data=test.default2,type = "response")
glm.default.pred3 = rep("No", 5000)
glm.default.pred3[glm.default.probs3>0.5]="Yes"
table(glm.default.pred3,default.test2)
## default.test2
## glm.default.pred3 No Yes
## No 4754 169
## Yes 75 2
mean(glm.default.pred3!=default.test2)
## [1] 0.0488
Validation set Test Error Rate is 4.9% for the split 2
set.seed(1234)
train3=sample(10000,5000)
train.default3= Default[train3,]
test.default3 = Default[-train3,]
default.test3 = test.default3$default
head(train.default3)
## default student balance income
## 7452 No Yes 311.32186 22648.76
## 8016 No Yes 697.13558 18377.15
## 7162 No Yes 470.10718 16014.11
## 8086 No No 1200.04162 56081.08
## 7269 No No 553.64902 47021.49
## 9196 No No 10.23149 27237.38
head(test.default3)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 4 No No 529.2506 35704.494
## 6 No Yes 919.5885 7491.559
## 7 No No 825.5133 24905.227
## 8 No Yes 808.6675 17600.451
glm.default.fit4 = glm(default~income+balance,family = binomial,data = train.default3)
summary(glm.default.fit4)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train.default3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0011 -0.1550 -0.0648 -0.0242 3.6746
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.121e+01 6.002e-01 -18.682 < 2e-16 ***
## income 2.171e-05 7.102e-06 3.056 0.00224 **
## balance 5.401e-03 3.133e-04 17.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1423.08 on 4999 degrees of freedom
## Residual deviance: 822.85 on 4997 degrees of freedom
## AIC: 828.85
##
## Number of Fisher Scoring iterations: 8
glm.default.probs4= predict(glm.default.fit4,data=test.default3,type = "response")
glm.default.pred4 = rep("No", 5000)
glm.default.pred4[glm.default.probs4>0.5]="Yes"
table(glm.default.pred4,default.test3)
## default.test3
## glm.default.pred4 No Yes
## No 4772 169
## Yes 56 3
mean(glm.default.pred4!=default.test3)
## [1] 0.045
Validation set Test Error Rate is 4.5% for the split 3
By comparing the Confusion matrix for all the three above splits , the results are seems to be similar. Also there is very very minor variability in the test error rate
Now Consider a logistic regression model that predicts the probability of default using income,balance and a dummy variable using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate
set.seed(1)
train=sample(10000,5000)
train.default= Default[train,]
test.default = Default[-train,]
default.test = test.default$default
head(train.default)
## default student balance income
## 1017 No No 939.0985 45519.02
## 8004 No Yes 397.5425 22710.87
## 4775 Yes No 1511.6110 53506.94
## 9725 No No 301.3194 51539.95
## 8462 No No 878.4461 29561.78
## 4050 Yes No 1673.4863 49310.33
head(test.default)
## default student balance income
## 1 No No 729.5265 44361.63
## 2 No Yes 817.1804 12106.13
## 3 No No 1073.5492 31767.14
## 5 No No 785.6559 38463.50
## 8 No Yes 808.6675 17600.45
## 9 No No 1161.0579 37468.53
fit a multiple logistic regression using only the training observations
glm.default.fit5 = glm(default~income+balance+student,family = binomial,data = train.default)
summary(glm.default.fit5)
##
## Call:
## glm(formula = default ~ income + balance + student, family = binomial,
## data = train.default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5823 -0.1419 -0.0554 -0.0210 3.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.134e+01 6.937e-01 -16.346 <2e-16 ***
## income 1.686e-05 1.122e-05 1.502 0.1331
## balance 5.767e-03 3.213e-04 17.947 <2e-16 ***
## studentYes -5.992e-01 3.324e-01 -1.803 0.0715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.77 on 4999 degrees of freedom
## Residual deviance: 800.07 on 4996 degrees of freedom
## AIC: 808.07
##
## Number of Fisher Scoring iterations: 8
glm.default.probs5= predict(glm.default.fit5,data=test.default,type = "response")
glm.default.pred5 = rep("No", 5000)
glm.default.pred5[glm.default.probs5>0.5]="Yes"
table(glm.default.pred5,default.test)
## default.test
## glm.default.pred5 No Yes
## No 4758 157
## Yes 85 0
mean(glm.default.pred5!=default.test)
## [1] 0.0484
Validation set test error rate is 4.8 which is unchanged for the inclusion of the predictor student. So it implies that the student predictor inclusion doesn’t leads to the reduction of the test error rate
We continue to the use of a logistic regression model to predict the probability of Default using income and balance on the Default data set. In particular, we ill now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:1) Using the Bootstrap, and 2)Using the standard formula for computing the standard errors in the glm() function
Using the summary() and glm() functions determine the estimated standard errors fr the coefficients associated with the income and balance in a multiple linear regression that uses the predictors
set.seed(111)
glm.default.fit6 = glm(default~income+balance,family = binomial,data = Default)
summary(glm.default.fit6)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
(summary(glm.default.fit6))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
Write a function boot.fn() that takes as input the default data set as well as an index of the observations and that outputs the coefficients for income and balance
boot.fn = function(data,index){
default= data$default
income = data$income
balance = data$balance
glm.default.fit7 = glm(default~income+balance,family=binomial,data = data)
return(summary(glm.default.fit7)$coefficients)
}
boot.fn(Default,1:10000)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance
boot(Default,boot.fn,R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 0 0
## t2* 2.080898e-05 0 0
## t3* 5.647103e-03 0 0
## t4* 4.347564e-01 0 0
## t5* 4.985167e-06 0 0
## t6* 2.273731e-04 0 0
## t7* -2.654468e+01 0 0
## t8* 4.174178e+00 0 0
## t9* 2.483628e+01 0 0
## t10* 2.958355e-155 0 0
## t11* 2.990638e-05 0 0
## t12* 3.638120e-136 0 0
Comment on the estimated standard errors obtained using the glm() function an the bootstrap function
Based on the glm() function we got the standard error for the income coefficient as 4.985167e-06 and for balance coefficient as 2.273731e-04. These standard errors are calculated based on the assumptions to calculate the logistic regression. whereas in case of bootstrap function, it doesn’t rely on any assumption to calculate the standard errors for the coefficients . So it implies that the use of bootstrap function drastically reduces the standard errors in order to calculate the default using balance and income.
Based on this data set provide an estimate for the population mean of medv. call this estimates as \(\hat{\mu}\).
mu = mean(Boston$medv)
Provide an estimate of the standard error of \(\hat{\mu}\).
sd.medv = sd(Boston$medv)
se.medv = (sd.medv/sqrt(length(Boston$medv)))
se.medv
## [1] 0.4088611
Now estimate the standard error of \(\hat{\mu}\) using the bootstrap. how does this compare to your answer from (b)
mu.fn=function(data, index)
{
medv= data$medv[index]
return(mean(medv))
}
mu.fn(Boston,1:length(Boston$medv))
## [1] 22.53281
boot(Boston,mu.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = mu.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0100417 0.3914634
The standard error calculated by bootstrap function is 0.41 which is same as the standard error calculated nu (b)
Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. compare it to the results obtained using t.test(Boston$medv)
lower.bound=mu-2*se.medv
lower.bound
## [1] 21.71508
upper.bound= mu+2*se.medv
upper.bound
## [1] 23.35053
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The confidence interval created using the standard error for the mean of medv is almost same as created by the one sample t test. Both the values are differ by the decimal values only
Based on this data set provide an estimate \(\hat{\mu_{med}}\), for the median value of medv in the population
median.medv = median(Boston$medv)
med.fn=function(data, index)
{
medv= data$medv[index]
return(median(medv))
}
med.fn(Boston,1:length(Boston$medv))
## [1] 21.2
boot(Boston,med.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = med.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.041 0.380577
The standard error for the median value is 0.37 which is quite less when compared to the median value 21.2. so it implies that estimate of the median value is quite accurate.
Based on this data set, provide an estimate for the tenth percentile of medv in boston suburbs. call this quantity as \(\hat{\mu_{0.1}}\)
tenth.percentile = quantile(Boston$medv, 0.1)
use the bootstap to estimate the standard error of \(\hat{\mu_{0.1}}\)
quantile.fn=function(data, index)
{
medv= data$medv[index]
return(quantile(medv, 0.1))
}
quantile.fn(Boston,1:length(Boston$medv))
## 10%
## 12.75
boot(Boston,quantile.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = quantile.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.011 0.4771023
The standard error for the 10th percentile is 0.51 which is bit higher, more than half percentage so the value by this bootstarp can be little trusted