Assigment 4

Chapter 05 (page 197): 3, 5, 6, 9

Chapter 5 ex 3

Explain how k-fold cross-validation is implemented.

Lovaas answer: To do k-fold cross-validation divide the data set in K (usually 5 or 10) equal parts. You then perform leave-one-out validation by leaving out one segment at a time and fitting the model to that. Perform an unweighted average of the 5 or 10 MSE’s you get from this validation approach.

What are the advantages and disadvantages of k-fold cross-validation relative to:

The validation set approach?

Lovaas answer: The validation set approach can have a very variable MSE, which isn’t as big of a problem with k-fold CV. Validation set approach also doesn’t use the entire dataset to train, which tends to reduce model accuracy, especially when data is limited.

LOOCV? (Leave one out cross validation)

Lovaas answer: Leave one out cross validation is more computationally expensive and has a tendency to have a higher variance than k-fold CV. K-fold CV does have more bias than LOOCV, though, which is a downside.

Chapter 5 ex 5

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

set.seed(42)
glm.fit=glm(default~income + balance,data=Default,family=binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

attach(Default) #For some reason I had to do this again to get the file the Knit
train = 1:(dim(Default)[1]/2) #train data 
test = (dim(Default)[1]/2 + 1):dim(Default)[1] #test data; assuming I will need this later
Default.train = Default[train, ]
Default.test = Default[test, ]
defaultvar.test = default[test]

glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
summary(glm.fit2)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2803  -0.1432  -0.0561  -0.0197   3.7407  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.198e+01  6.279e-01 -19.078  < 2e-16 ***
## income       2.885e-05  7.060e-06   4.086 4.38e-05 ***
## balance      5.822e-03  3.232e-04  18.013  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1517.14  on 4999  degrees of freedom
## Residual deviance:  797.11  on 4997  degrees of freedom
## AIC: 803.11
## 
## Number of Fisher Scoring iterations: 8

glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"

mean(glm.pred != defaultvar.test)

## [1] 0.0258

train = sample(dim(Default)[1], dim(Default)[1]/2) #train data, random sample
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)

## [1] 0.026

train = sample(dim(Default)[1], dim(Default)[1]/2) 
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)

## [1] 0.026

train = sample(dim(Default)[1], dim(Default)[1]/2) 
Default.train = Default[train, ]
Default.test = Default[-train, ]
defaultvar.test = default[-train]
glm.fit2=glm(default~income + balance, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit2, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)

## [1] 0.0294

Lovaas answer ex 5c: Around 2-3% every time, not too bad!

#using same train and test from last question. 

glm.fit3=glm(default~income + balance + student, data=Default, subset=train, family=binomial)
glm.pred = rep("No", dim(Default)[1]/2)
glm.probs = predict(glm.fit3, Default.test, type = "response")
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != defaultvar.test)

## [1] 0.0298

Lovaas answer ex 5d: Adding the student dummy variable didn’t seem to reduce the test error; it may have actually increased it.

Chapter 5 ex 6

set.seed(42) #I googled it and you need to do this fairly frequently

glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Lovaas answer ex 6a: standard errors are pretty low; that’s great.

boot.fn = function(data, index) return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))

library(boot)

## Warning: package 'boot' was built under R version 4.0.5

boot(Default, boot.fn, 50)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 50)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -9.976565e-02 4.207689e-01
## t2*  2.080898e-05  2.821594e-07 5.178128e-06
## t3*  5.647103e-03  4.988990e-05 2.157373e-04

Lovaas answer ex 5d: Standard errors are extremely low under both approaches…maybe we would see more of a difference if we were using a model that had a worse fit/higher error rate.

Chapter 5 ex 9

library(MASS)

## Warning: package 'MASS' was built under R version 4.0.5

attach(Boston)
mu = mean(medv)
mu

## [1] 22.53281

se = sd(medv)/sqrt(length(medv))
se

## [1] 0.4088611

Lovaas answer ex 9b: The standard error of the mean is about 0.41; meaning about 68% of the data falls in between (mean - 0.41) and (mean + 0.41)

boot.fn = function(data, index) return(mean(data[index]))
library(boot)
bstrap = boot(medv, boot.fn, 1000)
bstrap

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.01940771   0.4138961

Lovaas answer ex 9c: The stanrdard error between the bootstrap and the estimate from (b) is about the same. Neat!

standarderror9c =  0.4138961
alpha = 0.05 #1 - 0.95
degrees.freedom = length(medv) - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
margin.error <- t.score * standarderror9c
mu - margin.error

## [1] 21.71964

mu + margin.error

## [1] 23.34598

t.test(medv)

## 
##  One Sample t-test
## 
## data:  medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The confidence interval under the bootstrap approach is (21.72, 23.25), and the confidence interval using a standard t test is (21.73, 23.34). These are pretty similar!

med = median(medv)
med

## [1] 21.2

boot.fn = function(data, index) return(median(data[index])) #change medv to median
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0077   0.3957175

Lovaas answer 9f: standard error is a bit lower; 0.396. Still quite similar.

medv.tenth = quantile(medv, c(0.1))
medv.tenth

##   10% 
## 12.75

boot.fn = function(data, index) return(quantile(data[index], c(0.1))) #change median to quantile
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 -0.0025   0.5230318

Lovaas answer ex 9h: standard error is 0.504; compared to the 10th percentile value of 12.75 this isn’t too bad.