Lovaas answer: To do k-fold cross-validation divide the data set in K (usually 5 or 10) equal parts. You then perform leave-one-out validation by leaving out one segment at a time and fitting the model to that. Perform an unweighted average of the 5 or 10 MSE’s you get from this validation approach.
Lovaas answer: The validation set approach can have a very variable MSE, which isn’t as big of a problem with k-fold CV. Validation set approach also doesn’t use the entire dataset to train, which tends to reduce model accuracy, especially when data is limited.
Lovaas answer: Leave one out cross validation is more computationally expensive and has a tendency to have a higher variance than k-fold CV. K-fold CV does have more bias than LOOCV, though, which is a downside.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
set.seed(42)
glm.fit=glm(default~income + balance,data=Default,family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
attach(Default) #For some reason I had to do this again to get the file the Knit
train = 1:(dim(Default)[1]/2) #train data
test = (dim(Default)[1]/2 + 1):dim(Default)[1] #test data; assuming I will need this later
Default.train = Default[train, ]
Default.test = Default[test, ]
defaultvar.test = default[test]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2803 -0.1432 -0.0561 -0.0197 3.7407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.198e+01 6.279e-01 -19.078 < 2e-16 ***
## income 2.885e-05 7.060e-06 4.086 4.38e-05 ***
## balance 5.822e-03 3.232e-04 18.013 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1517.14 on 4999 degrees of freedom
## Residual deviance: 797.11 on 4997 degrees of freedom
## AIC: 803.11
##
## Number of Fisher Scoring iterations: 8
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)
## [1] 0.0258
train = sample(dim(Default)[1], dim(Default)[1]/2) #train data, random sample
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)
## [1] 0.026
train = sample(dim(Default)[1], dim(Default)[1]/2)
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)
## [1] 0.026
train = sample(dim(Default)[1], dim(Default)[1]/2)
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)
## [1] 0.0294
Lovaas answer ex 5c: Around 2-3% every time, not too bad!
#using same train and test from last question.
glm.fit3=glm(default~income + balance + student, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit3, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)
## [1] 0.0298
Lovaas answer ex 5d: Adding the student dummy variable didn’t seem to reduce the test error; it may have actually increased it.
set.seed(42) #I googled it and you need to do this fairly frequently
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Lovaas answer ex 6a: standard errors are pretty low; that’s great.
boot.fn = function(data, index) return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))
library(boot)
## Warning: package 'boot' was built under R version 4.0.5
boot(Default, boot.fn, 50)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 50)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -9.976565e-02 4.207689e-01
## t2* 2.080898e-05 2.821594e-07 5.178128e-06
## t3* 5.647103e-03 4.988990e-05 2.157373e-04
Lovaas answer ex 5d: Standard errors are extremely low under both approaches…maybe we would see more of a difference if we were using a model that had a worse fit/higher error rate.
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.5
attach(Boston)
mu = mean(medv)
mu
## [1] 22.53281
se = sd(medv)/sqrt(length(medv))
se
## [1] 0.4088611
Lovaas answer ex 9b: The standard error of the mean is about 0.41; meaning about 68% of the data falls in between (mean - 0.41) and (mean + 0.41)
boot.fn = function(data, index) return(mean(data[index]))
library(boot)
bstrap = boot(medv, boot.fn, 1000)
bstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.01940771 0.4138961
Lovaas answer ex 9c: The stanrdard error between the bootstrap and the estimate from (b) is about the same. Neat!
standarderror9c = 0.4138961
alpha = 0.05 #1 - 0.95
degrees.freedom = length(medv) - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
margin.error <- t.score * standarderror9c
mu - margin.error
## [1] 21.71964
mu + margin.error
## [1] 23.34598
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The confidence interval under the bootstrap approach is (21.72, 23.25), and the confidence interval using a standard t test is (21.73, 23.34). These are pretty similar!
med = median(medv)
med
## [1] 21.2
boot.fn = function(data, index) return(median(data[index])) #change medv to median
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0077 0.3957175
Lovaas answer 9f: standard error is a bit lower; 0.396. Still quite similar.
medv.tenth = quantile(medv, c(0.1))
medv.tenth
## 10%
## 12.75
boot.fn = function(data, index) return(quantile(data[index], c(0.1))) #change median to quantile
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.0025 0.5230318
Lovaas answer ex 9h: standard error is 0.504; compared to the 10th percentile value of 12.75 this isn’t too bad.