HW4 - Justin Howard

Load libraries

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.2

library(MASS)

## Warning: package 'MASS' was built under R version 4.1.2

library(class)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(boot)

##3. We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented. (b) What are the advantages and disadvantages of k-fold cross validation relative to: i. The validation set approach? ii. LOOCV? (a) This approach involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2,…, MSEk. The k-fold CV estimate is computed by averaging these values.

The above explanation was taken from the book, in my own words, k-fold CV takes a data set and breaks out into a certain number (k) of data sets. If the breakout number (k) is 5, the dataset is broken into 5 datasets. 1 dataset is used as the test (validation) data set and the other 4 datasets are combined as the training set. The model is fitted with the selected test set and trained on the 4 datasets combined into one training set. The MSE of that model is taken, then the process is repeated 4 more times, switching out the test set each time until each data set is used as the test set, while the other datasets are combined as the training data set. The MSEs from each model are then averaged and the result is the average MSE of the k number of models processed.

advantages of k-fold CV relative to validation set: k-fold CV is less variable than validation set and k-fold CV trains on more data than the validation set, leading to lower test error rate compared to validation test error rate.

Disadvantage of k-fold CV relative to validation set: validation is more simple and is less computational than k-fold CV

Advantages of k-fold CV relative to LOOCV: Computational cost for k-fold CV is less compared to LOOCV, especially for more complex models. K-fold CV often gives more accurate estimates of the test error rate than does LOOCV. K-fold CV will have less variance than LOOCV.

Disadvantages of k-fold CV relative to LOOCV: LOOCV leads to unbiased estimates of the test error because it is training on n-1 observations and k-fold CV trains on k-1 folds. LOOCV will have less bias than k-fold CV

##5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. (a) Fit a logistic regression model that uses income and balance to predict default.

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set. ii. Fit a multiple logistic regression model using only the train ing observations. iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5. iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com ment on the results obtained.
Now consider a logistic regression model that predicts the prob ability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the val idation set approach. Comment on whether or not including a dummy

Looking at the Default data set

str(Default)

## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

?Default

## starting httpd help server ... done

summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

table(Default$default)

## 
##   No  Yes 
## 9667  333

head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

##5. (a) Fit a logistic regression model that uses income and balance to predict default.

glm.d1 = glm(default ~ balance + income, data = Default, family = binomial)
summary(glm.d1)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

##5. (b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set.

set.seed(22)
tr_ind = sample(nrow(Default), 0.5*nrow(Default), replace = F)  ##using a 50/50 validationa train/test split

Default_train = Default[tr_ind,] ##5000 obs and 4 variables
Default_test = Default[-tr_ind,] ##5000 obs and 4 variables

Create the labels

default.test.labels = Default$default[-tr_ind]
summary(default.test.labels) ##5000 labels

##   No  Yes 
## 4831  169

Fit a multiple logistic regression model using only the train ing observations.

glm.d2 = glm(default ~ balance + income, data = Default_train, family = binomial)
summary(glm.d2)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4800  -0.1431  -0.0558  -0.0207   3.7215  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.170e+01  6.314e-01 -18.535  < 2e-16 ***
## balance      5.668e-03  3.255e-04  17.413  < 2e-16 ***
## income       2.344e-05  7.092e-06   3.305 0.000951 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1443.44  on 4999  degrees of freedom
## Residual deviance:  771.54  on 4997  degrees of freedom
## AIC: 777.54
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5. iv.

glm.probsDefault = predict(glm.d2, Default_test, type = "response")
glm.predDefault= ifelse(glm.probsDefault >=0.5,1,0)
table(glm.predDefault, default.test.labels, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual   No  Yes
##      0 4813  119
##      1   18   50

Accuracy rate

accuracy_rateglm.d2 <- (50+4813)/5000
accuracy_rateglm.d2

## [1] 0.9726

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.d2 = 1 - accuracy_rateglm.d2
error_rateglm.d2

## [1] 0.0274

##mean(glm.predDefault==default.test.labels)

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com ment on the results obtained. 70/30: Error rate = 0.02633 80/20: Error rate = 0.02600 90/10: Error rate = 0.01700 Appears the more training observations the better the accuracy rate, error rate decreased as the number of training observations increased

70/30 train/test split

set.seed(22)
tr_ind7 = sample(nrow(Default), 0.7*nrow(Default), replace = F)  ##using a 70/30 validation train/test split

Default_train7 = Default[tr_ind7,] ##7000 obs and 4 variables
Default_test3 = Default[-tr_ind7,] ##3000 obs and 4 variables

Create the labels

default.test.labels7 = Default$default[-tr_ind7]
summary(default.test.labels7) ##3000 labels

##   No  Yes 
## 2909   91

Fit a multiple logistic regression model using only the train ing observations.

glm.d3 = glm(default ~ balance + income, data = Default_train7, family = binomial)
summary(glm.d3)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default_train7)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4762  -0.1489  -0.0583  -0.0217   3.7093  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.148e+01  5.129e-01 -22.381  < 2e-16 ***
## balance      5.613e-03  2.670e-04  21.022  < 2e-16 ***
## income       2.141e-05  5.855e-06   3.657 0.000255 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2104.1  on 6999  degrees of freedom
## Residual deviance: 1133.3  on 6997  degrees of freedom
## AIC: 1139.3
## 
## Number of Fisher Scoring iterations: 8

glm.probsDefault7 = predict(glm.d3, Default_test3, type = "response")
glm.predDefault7= ifelse(glm.probsDefault7 >=0.5,1,0)
table(glm.predDefault7, default.test.labels7, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual   No  Yes
##      0 2896   66
##      1   13   25

Accuracy rate

accuracy_rateglm.d3 <- (25+2896)/3000
accuracy_rateglm.d3

## [1] 0.9736667

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.d3 = 1 - accuracy_rateglm.d3
error_rateglm.d3

## [1] 0.02633333

80/20 train/test split

set.seed(22)
tr_ind8 = sample(nrow(Default), 0.8*nrow(Default), replace = F)  ##using a 80/20 validation train/test split

Default_train8 = Default[tr_ind8,] ##8000 obs and 4 variables
Default_test2 = Default[-tr_ind8,] ##2000 obs and 4 variables

Create the labels

default.test.labels8 = Default$default[-tr_ind8]
summary(default.test.labels8) ##2000 labels

##   No  Yes 
## 1942   58

Fit a multiple logistic regression model using only the training observations.

glm.d4 = glm(default ~ balance + income, data = Default_train8, family = binomial)
summary(glm.d4)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default_train8)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4821  -0.1476  -0.0577  -0.0212   3.7180  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.150e+01  4.822e-01 -23.852  < 2e-16 ***
## balance      5.655e-03  2.523e-04  22.412  < 2e-16 ***
## income       2.015e-05  5.507e-06   3.659 0.000253 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2394.2  on 7999  degrees of freedom
## Residual deviance: 1287.2  on 7997  degrees of freedom
## AIC: 1293.2
## 
## Number of Fisher Scoring iterations: 8

glm.probsDefault8 = predict(glm.d4, Default_test2, type = "response")
glm.predDefault8= ifelse(glm.probsDefault8 >=0.5,1,0)
table(glm.predDefault8, default.test.labels8, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual   No  Yes
##      0 1932   42
##      1   10   16

Accuracy rate

accuracy_rateglm.d4 <- (16+1932)/2000
accuracy_rateglm.d4

## [1] 0.974

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.d4 = 1 - accuracy_rateglm.d4
error_rateglm.d4

## [1] 0.026

90/10 train/test split

set.seed(22)
tr_ind9 = sample(nrow(Default), 0.9*nrow(Default), replace = F)  ##using a 90/10 validation train/test split

Default_train9 = Default[tr_ind9,] ##9000 obs and 4 variables
Default_test1 = Default[-tr_ind9,] ##1000 obs and 4 variables

Create the labels

default.test.labels9 = Default$default[-tr_ind9]
summary(default.test.labels9) ##1000 labels

##  No Yes 
## 975  25

Fit a multiple logistic regression model using only the train ing observations.

glm.d5 = glm(default ~ balance + income, data = Default_train9, family = binomial)
summary(glm.d5)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default_train9)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4728  -0.1489  -0.0590  -0.0218   3.7078  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.144e+01  4.526e-01  -25.28  < 2e-16 ***
## balance      5.616e-03  2.366e-04   23.74  < 2e-16 ***
## income       2.031e-05  5.208e-06    3.90  9.6e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2684.3  on 8999  degrees of freedom
## Residual deviance: 1457.2  on 8997  degrees of freedom
## AIC: 1463.2
## 
## Number of Fisher Scoring iterations: 8

glm.probsDefault9 = predict(glm.d5, Default_test1, type = "response")
glm.predDefault9= ifelse(glm.probsDefault9 >=0.5,1,0)
table(glm.predDefault9, default.test.labels9, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual  No Yes
##      0 972  14
##      1   3  11

Accuracy rate

accuracy_rateglm.d5 <- (11+972)/1000
accuracy_rateglm.d5

## [1] 0.983

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.d5 = 1 - accuracy_rateglm.d5
error_rateglm.d5

## [1] 0.017

##mean(glm.predDefault==default.test.labels)

#5. (d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy I added student to the model and trained on the 80/20 and 90/10 train/test splits, there was not an increase in performance adding student, the 80/20 model predicted on less and the 90/10 model predicted one less compared to the 80/20 and 90/10 models that did not include student. Simple models are always the goal, adding student to the model doesn’t increase performance and I would suggest not including it in the model.

Fit a multiple logistic regression model using only the training observations 80/20 train/test split

glm.ds = glm(default ~ balance + income + student, data = Default_train8, family = binomial)
summary(glm.ds)

## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = Default_train8)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4747  -0.1457  -0.0567  -0.0208   3.7281  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.089e+01  5.514e-01 -19.757   <2e-16 ***
## balance      5.719e-03  2.557e-04  22.366   <2e-16 ***
## income       4.629e-06  9.073e-06   0.510   0.6100    
## studentYes  -5.641e-01  2.609e-01  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2394.2  on 7999  degrees of freedom
## Residual deviance: 1282.6  on 7996  degrees of freedom
## AIC: 1290.6
## 
## Number of Fisher Scoring iterations: 8

glm.probsDefaultds = predict(glm.ds, Default_test2, type = "response")
glm.predDefaultds= ifelse(glm.probsDefaultds >=0.5,1,0)
table(glm.predDefaultds, default.test.labels8, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual   No  Yes
##      0 1931   42
##      1   11   16

Accuracy rate

accuracy_rateglm.ds <- (16+1931)/2000
accuracy_rateglm.ds

## [1] 0.9735

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.ds = 1 - accuracy_rateglm.ds
error_rateglm.ds

## [1] 0.0265

Using the 90/10 train/test split

glm.ds2 = glm(default ~ balance + income + student, data = Default_train9, family = binomial)
summary(glm.ds2)

## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = Default_train9)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4672  -0.1469  -0.0576  -0.0212   3.7195  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.081e+01  5.145e-01 -21.007   <2e-16 ***
## balance      5.691e-03  2.403e-04  23.688   <2e-16 ***
## income       3.882e-06  8.520e-06   0.456   0.6486    
## studentYes  -5.973e-01  2.442e-01  -2.446   0.0144 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2684.3  on 8999  degrees of freedom
## Residual deviance: 1451.3  on 8996  degrees of freedom
## AIC: 1459.3
## 
## Number of Fisher Scoring iterations: 8

glm.probsDefaultds2 = predict(glm.ds2, Default_test1, type = "response")
glm.predDefaultds2= ifelse(glm.probsDefaultds2 >=0.5,1,0)
table(glm.predDefaultds2, default.test.labels9, dnn = c("Actual", "Predicted"))

##       Predicted
## Actual  No Yes
##      0 971  15
##      1   4  10

Accuracy rate

accuracy_rateglm.ds2 <- (10+971)/1000
accuracy_rateglm.ds2

## [1] 0.981

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate

error_rateglm.ds2 = 1 - accuracy_rateglm.d5
error_rateglm.ds2

## [1] 0.017

##mean(glm.predDefault==default.test.labels)

##6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression co efficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis. (a) Using the summary() and glm() functions, determine the esti mated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors. (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model. (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance. (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

#6. (a) Using the summary() and glm() functions, determine the esti mated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

glm.6 = glm(default ~ balance + income, data = Default, family = binomial)
summary(glm.6)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

#6. (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data, index){ # initialize function to accept a data and index object
  return(coef(glm(default ~ balance + income, data = Default, family = binomial, subset = index)))}
boot.fn(Default, 1:10000)

##   (Intercept)       balance        income 
## -1.154047e+01  5.647103e-03  2.080898e-05

#6. (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

set.seed(22)
boot.fn(Default,sample(10000,10000,replace=T))

##   (Intercept)       balance        income 
## -1.216864e+01  5.945671e-03  2.316788e-05

boot(Default, boot.fn, 1000) #compute the standard errors of 1,000 bootstrap estimates of the linear model coefficients using the boot function

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.427835e-02 4.320665e-01
## t2*  5.647103e-03  1.417378e-05 2.264517e-04
## t3*  2.080898e-05  1.629532e-07 4.983567e-06

#6. (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. glm() std. errors: Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 balance 5.647e-03 2.274e-04 24.836 < 2e-16 income 2.081e-05 4.985e-06 4.174 2.99e-05 ***

bootstrap std. errors: original bias std. error t1* -1.154047e+01 -3.954546e-02 4.285173e-01 t2* 5.647103e-03 1.731616e-05 2.258669e-04 t3* 2.080898e-05 1.470388e-07 4.901097e-06

Difference in standard errors between glm() and bootstrap function is minimal

##9. We will now consider the Boston housing data set, from the MASS library.

Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆµ. (b) Provide an estimate of the standard error of ˆµ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations. (c) Now estimate the standard error of ˆµ using the bootstrap. How does this compare to your answer from (b)? (d) Based on your bootstrap estimate from (c), provide a 95 % con fidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)]. (e) Based on this data set, provide an estimate, ˆµmed, for the median value of medv in the population. (f) We now would like to estimate the standard error of ˆµmed. Unfor tunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings. (g) Based on this data set, provide an estimate for the tenth per centile of medv in Boston suburbs. Call this quantity ˆµ0.1. (You can use the quantile() function.) (h) Use the bootstrap to estimate the standard error of ˆµ0.1. Com ment on your findings

Looking at the Boston data set

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

?Boston
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

table(Boston$medv)

## 
##    5  5.6  6.3    7  7.2  7.4  7.5  8.1  8.3  8.4  8.5  8.7  8.8  9.5  9.6  9.7 
##    2    1    1    2    3    1    1    1    2    2    2    1    2    1    1    1 
## 10.2 10.4 10.5 10.8 10.9   11 11.3 11.5 11.7 11.8 11.9   12 12.1 12.3 12.5 12.6 
##    3    2    2    1    2    1    1    1    2    2    2    1    1    1    1    1 
## 12.7 12.8   13 13.1 13.2 13.3 13.4 13.5 13.6 13.8 13.9   14 14.1 14.2 14.3 14.4 
##    3    1    1    4    1    3    4    2    2    5    2    1    3    1    2    2 
## 14.5 14.6 14.8 14.9   15 15.1 15.2 15.3 15.4 15.6 15.7   16 16.1 16.2 16.3 16.4 
##    3    2    1    3    3    1    3    1    2    5    1    1    3    2    1    1 
## 16.5 16.6 16.7 16.8   17 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9   18 18.1 
##    2    2    2    2    1    3    3    1    3    3    1    1    5    1    1    1 
## 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9   19 19.1 19.2 19.3 19.4 19.5 19.6 19.7 
##    3    2    3    4    2    3    2    4    2    4    2    5    6    4    5    2 
## 19.8 19.9   20 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 20.9   21 21.1 21.2 21.4 
##    3    4    5    5    2    4    4    3    6    2    3    2    3    2    5    5 
## 21.5 21.6 21.7 21.8 21.9   22 22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8 22.9   23 
##    2    2    7    2    3    7    1    5    2    2    3    5    2    4    4    4 
## 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 23.9   24 24.1 24.2 24.3 24.4 24.5 24.6 
##    7    4    4    2    1    2    4    4    5    2    3    1    3    4    3    2 
## 24.7 24.8   25 25.1 25.2 25.3 26.2 26.4 26.5 26.6 26.7   27 27.1 27.5 27.9   28 
##    3    4    8    1    1    1    1    2    1    3    1    1    2    4    2    1 
## 28.1 28.2 28.4 28.5 28.6 28.7   29 29.1 29.4 29.6 29.8 29.9 30.1 30.3 30.5 30.7 
##    1    1    2    1    1    3    2    2    1    2    2    1    3    1    1    1 
## 30.8   31 31.1 31.2 31.5 31.6 31.7   32 32.2 32.4 32.5 32.7 32.9   33 33.1 33.2 
##    1    1    1    1    2    2    1    2    1    1    1    1    1    1    2    2 
## 33.3 33.4 33.8 34.6 34.7 34.9 35.1 35.2 35.4   36 36.1 36.2 36.4 36.5   37 37.2 
##    1    2    1    1    1    3    1    1    2    1    1    2    1    1    1    1 
## 37.3 37.6 37.9 38.7 39.8 41.3 41.7 42.3 42.8 43.1 43.5 43.8   44 44.8 45.4   46 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 46.7 48.3 48.5 48.8   50 
##    1    1    1    1   16

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆµ. 22.53281

?mean
µ = mean(Boston$medv)
µ

## [1] 22.53281

Provide an estimate of the standard error of ˆµ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations. 0.4088611

?sd
sd_medv = sd(Boston$medv)
sd_medv

## [1] 9.197104

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

str(Boston$medv) #(506 observations)

##  num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

?sqrt
sqrt_medv = sqrt(506)
sqrt_medv

## [1] 22.49444

std_err_medv = sd_medv / sqrt_medv
std_err_medv

## [1] 0.4088611

Now estimate the standard error of ˆµ using the bootstrap. How does this compare to your answer from (b)? (b):0.4088611 vs. (c): 0.4257536, (c) is slightly higher than (b)

boot.fn.medv = function(data, index){ # initialize function to accept a data and index object
  return(mean(data[index]))} #calculate the mean of the data column

set.seed(22)
btstrp_medv = boot(Boston$medv, boot.fn.medv, 1000) 
btstrp_medv

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.01964348   0.4257536

#compute the standard errors of 1,000 bootstrap estimates of the mean of the medv observations using the boot function

Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].

Method 1 CI: 21.68130 - 23.38432 Method 2 (t.test) CI: 21.72953 - 23.33608 Method 2 (t.test) - slightly more narrow than method 1

ci_medv = c(µ - (2*sd(btstrp_medv$t)), µ + (2*sd(btstrp_medv$t)))
ci_medv

## [1] 21.68130 23.38431

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Based on this data set, provide an estimate, ˆµmed, for the median value of medv in the population.

µmed = median(Boston$medv)
µmed

## [1] 21.2

We now would like to estimate the standard error of ˆµmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings. standard error 0.3965364, pretty small standard error relative to median value of 21.2

boot.fn.medv_med = function(data, index){ # initialize function to accept a data and index object
  return(median(data[index]))} #calculate the median of the data column

set.seed(22)
btstrp_medv_med = boot(Boston$medv, boot.fn.medv_med, 1000) 
btstrp_medv_med

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv_med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.02395   0.3965364

#compute the standard errors of 1,000 bootstrap estimates of the median of the medv observations using the boot function

Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity ˆµ0.1. (You can use the quantile() function.)

µ0.1 = quantile(Boston$medv,0.1)
µ0.1

##   10% 
## 12.75

Use the bootstrap to estimate the standard error of ˆµ0.1. Com ment on your findings standard error 0.5091694, pretty small standard error relative to 10% quantile value of 12.75

boot.fn.medv_quantile = function(data, index){ # initialize function to accept a data and index object
  return(quantile(data[index],0.1))} #calculate the quantile of the data column

set.seed(22)
btstrp_medv_quantile = boot(Boston$medv, boot.fn.medv_quantile, 1000) 
btstrp_medv_quantile

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv_quantile, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0024   0.5091694