Load libraries
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
library(class)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
##3. We now review k-fold cross-validation. (a) Explain how k-fold cross-validation is implemented. (b) What are the advantages and disadvantages of k-fold cross validation relative to: i. The validation set approach? ii. LOOCV? (a) This approach involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2,…, MSEk. The k-fold CV estimate is computed by averaging these values.
The above explanation was taken from the book, in my own words, k-fold CV takes a data set and breaks out into a certain number (k) of data sets. If the breakout number (k) is 5, the dataset is broken into 5 datasets. 1 dataset is used as the test (validation) data set and the other 4 datasets are combined as the training set. The model is fitted with the selected test set and trained on the 4 datasets combined into one training set. The MSE of that model is taken, then the process is repeated 4 more times, switching out the test set each time until each data set is used as the test set, while the other datasets are combined as the training data set. The MSEs from each model are then averaged and the result is the average MSE of the k number of models processed.
Disadvantage of k-fold CV relative to validation set: validation is more simple and is less computational than k-fold CV
Advantages of k-fold CV relative to LOOCV: Computational cost for k-fold CV is less compared to LOOCV, especially for more complex models. K-fold CV often gives more accurate estimates of the test error rate than does LOOCV. K-fold CV will have less variance than LOOCV.
Disadvantages of k-fold CV relative to LOOCV: LOOCV leads to unbiased estimates of the test error because it is training on n-1 observations and k-fold CV trains on k-1 folds. LOOCV will have less bias than k-fold CV
##5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. (a) Fit a logistic regression model that uses income and balance to predict default.
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set. ii. Fit a multiple logistic regression model using only the train ing observations. iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5. iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Com ment on the results obtained.
Now consider a logistic regression model that predicts the prob ability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the val idation set approach. Comment on whether or not including a dummy
Looking at the Default data set
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
?Default
## starting httpd help server ... done
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
table(Default$default)
##
## No Yes
## 9667 333
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
##5. (a) Fit a logistic regression model that uses income and balance to predict default.
glm.d1 = glm(default ~ balance + income, data = Default, family = binomial)
summary(glm.d1)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
##5. (b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps: i. Split the sample set into a training set and a validation set.
set.seed(22)
tr_ind = sample(nrow(Default), 0.5*nrow(Default), replace = F) ##using a 50/50 validationa train/test split
Default_train = Default[tr_ind,] ##5000 obs and 4 variables
Default_test = Default[-tr_ind,] ##5000 obs and 4 variables
Create the labels
default.test.labels = Default$default[-tr_ind]
summary(default.test.labels) ##5000 labels
## No Yes
## 4831 169
glm.d2 = glm(default ~ balance + income, data = Default_train, family = binomial)
summary(glm.d2)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4800 -0.1431 -0.0558 -0.0207 3.7215
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.170e+01 6.314e-01 -18.535 < 2e-16 ***
## balance 5.668e-03 3.255e-04 17.413 < 2e-16 ***
## income 2.344e-05 7.092e-06 3.305 0.000951 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1443.44 on 4999 degrees of freedom
## Residual deviance: 771.54 on 4997 degrees of freedom
## AIC: 777.54
##
## Number of Fisher Scoring iterations: 8
glm.probsDefault = predict(glm.d2, Default_test, type = "response")
glm.predDefault= ifelse(glm.probsDefault >=0.5,1,0)
table(glm.predDefault, default.test.labels, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 4813 119
## 1 18 50
Accuracy rate
accuracy_rateglm.d2 <- (50+4813)/5000
accuracy_rateglm.d2
## [1] 0.9726
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.d2 = 1 - accuracy_rateglm.d2
error_rateglm.d2
## [1] 0.0274
##mean(glm.predDefault==default.test.labels)
70/30 train/test split
set.seed(22)
tr_ind7 = sample(nrow(Default), 0.7*nrow(Default), replace = F) ##using a 70/30 validation train/test split
Default_train7 = Default[tr_ind7,] ##7000 obs and 4 variables
Default_test3 = Default[-tr_ind7,] ##3000 obs and 4 variables
Create the labels
default.test.labels7 = Default$default[-tr_ind7]
summary(default.test.labels7) ##3000 labels
## No Yes
## 2909 91
glm.d3 = glm(default ~ balance + income, data = Default_train7, family = binomial)
summary(glm.d3)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default_train7)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4762 -0.1489 -0.0583 -0.0217 3.7093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.148e+01 5.129e-01 -22.381 < 2e-16 ***
## balance 5.613e-03 2.670e-04 21.022 < 2e-16 ***
## income 2.141e-05 5.855e-06 3.657 0.000255 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2104.1 on 6999 degrees of freedom
## Residual deviance: 1133.3 on 6997 degrees of freedom
## AIC: 1139.3
##
## Number of Fisher Scoring iterations: 8
glm.probsDefault7 = predict(glm.d3, Default_test3, type = "response")
glm.predDefault7= ifelse(glm.probsDefault7 >=0.5,1,0)
table(glm.predDefault7, default.test.labels7, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 2896 66
## 1 13 25
Accuracy rate
accuracy_rateglm.d3 <- (25+2896)/3000
accuracy_rateglm.d3
## [1] 0.9736667
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.d3 = 1 - accuracy_rateglm.d3
error_rateglm.d3
## [1] 0.02633333
80/20 train/test split
set.seed(22)
tr_ind8 = sample(nrow(Default), 0.8*nrow(Default), replace = F) ##using a 80/20 validation train/test split
Default_train8 = Default[tr_ind8,] ##8000 obs and 4 variables
Default_test2 = Default[-tr_ind8,] ##2000 obs and 4 variables
Create the labels
default.test.labels8 = Default$default[-tr_ind8]
summary(default.test.labels8) ##2000 labels
## No Yes
## 1942 58
glm.d4 = glm(default ~ balance + income, data = Default_train8, family = binomial)
summary(glm.d4)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default_train8)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4821 -0.1476 -0.0577 -0.0212 3.7180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.150e+01 4.822e-01 -23.852 < 2e-16 ***
## balance 5.655e-03 2.523e-04 22.412 < 2e-16 ***
## income 2.015e-05 5.507e-06 3.659 0.000253 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2394.2 on 7999 degrees of freedom
## Residual deviance: 1287.2 on 7997 degrees of freedom
## AIC: 1293.2
##
## Number of Fisher Scoring iterations: 8
glm.probsDefault8 = predict(glm.d4, Default_test2, type = "response")
glm.predDefault8= ifelse(glm.probsDefault8 >=0.5,1,0)
table(glm.predDefault8, default.test.labels8, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 1932 42
## 1 10 16
Accuracy rate
accuracy_rateglm.d4 <- (16+1932)/2000
accuracy_rateglm.d4
## [1] 0.974
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.d4 = 1 - accuracy_rateglm.d4
error_rateglm.d4
## [1] 0.026
90/10 train/test split
set.seed(22)
tr_ind9 = sample(nrow(Default), 0.9*nrow(Default), replace = F) ##using a 90/10 validation train/test split
Default_train9 = Default[tr_ind9,] ##9000 obs and 4 variables
Default_test1 = Default[-tr_ind9,] ##1000 obs and 4 variables
Create the labels
default.test.labels9 = Default$default[-tr_ind9]
summary(default.test.labels9) ##1000 labels
## No Yes
## 975 25
glm.d5 = glm(default ~ balance + income, data = Default_train9, family = binomial)
summary(glm.d5)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default_train9)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4728 -0.1489 -0.0590 -0.0218 3.7078
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.144e+01 4.526e-01 -25.28 < 2e-16 ***
## balance 5.616e-03 2.366e-04 23.74 < 2e-16 ***
## income 2.031e-05 5.208e-06 3.90 9.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2684.3 on 8999 degrees of freedom
## Residual deviance: 1457.2 on 8997 degrees of freedom
## AIC: 1463.2
##
## Number of Fisher Scoring iterations: 8
glm.probsDefault9 = predict(glm.d5, Default_test1, type = "response")
glm.predDefault9= ifelse(glm.probsDefault9 >=0.5,1,0)
table(glm.predDefault9, default.test.labels9, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 972 14
## 1 3 11
Accuracy rate
accuracy_rateglm.d5 <- (11+972)/1000
accuracy_rateglm.d5
## [1] 0.983
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.d5 = 1 - accuracy_rateglm.d5
error_rateglm.d5
## [1] 0.017
##mean(glm.predDefault==default.test.labels)
#5. (d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy I added student to the model and trained on the 80/20 and 90/10 train/test splits, there was not an increase in performance adding student, the 80/20 model predicted on less and the 90/10 model predicted one less compared to the 80/20 and 90/10 models that did not include student. Simple models are always the goal, adding student to the model doesn’t increase performance and I would suggest not including it in the model.
Fit a multiple logistic regression model using only the training observations 80/20 train/test split
glm.ds = glm(default ~ balance + income + student, data = Default_train8, family = binomial)
summary(glm.ds)
##
## Call:
## glm(formula = default ~ balance + income + student, family = binomial,
## data = Default_train8)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.1457 -0.0567 -0.0208 3.7281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.089e+01 5.514e-01 -19.757 <2e-16 ***
## balance 5.719e-03 2.557e-04 22.366 <2e-16 ***
## income 4.629e-06 9.073e-06 0.510 0.6100
## studentYes -5.641e-01 2.609e-01 -2.162 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2394.2 on 7999 degrees of freedom
## Residual deviance: 1282.6 on 7996 degrees of freedom
## AIC: 1290.6
##
## Number of Fisher Scoring iterations: 8
glm.probsDefaultds = predict(glm.ds, Default_test2, type = "response")
glm.predDefaultds= ifelse(glm.probsDefaultds >=0.5,1,0)
table(glm.predDefaultds, default.test.labels8, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 1931 42
## 1 11 16
Accuracy rate
accuracy_rateglm.ds <- (16+1931)/2000
accuracy_rateglm.ds
## [1] 0.9735
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.ds = 1 - accuracy_rateglm.ds
error_rateglm.ds
## [1] 0.0265
Using the 90/10 train/test split
glm.ds2 = glm(default ~ balance + income + student, data = Default_train9, family = binomial)
summary(glm.ds2)
##
## Call:
## glm(formula = default ~ balance + income + student, family = binomial,
## data = Default_train9)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4672 -0.1469 -0.0576 -0.0212 3.7195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.081e+01 5.145e-01 -21.007 <2e-16 ***
## balance 5.691e-03 2.403e-04 23.688 <2e-16 ***
## income 3.882e-06 8.520e-06 0.456 0.6486
## studentYes -5.973e-01 2.442e-01 -2.446 0.0144 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2684.3 on 8999 degrees of freedom
## Residual deviance: 1451.3 on 8996 degrees of freedom
## AIC: 1459.3
##
## Number of Fisher Scoring iterations: 8
glm.probsDefaultds2 = predict(glm.ds2, Default_test1, type = "response")
glm.predDefaultds2= ifelse(glm.probsDefaultds2 >=0.5,1,0)
table(glm.predDefaultds2, default.test.labels9, dnn = c("Actual", "Predicted"))
## Predicted
## Actual No Yes
## 0 971 15
## 1 4 10
Accuracy rate
accuracy_rateglm.ds2 <- (10+971)/1000
accuracy_rateglm.ds2
## [1] 0.981
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. Error rate
error_rateglm.ds2 = 1 - accuracy_rateglm.d5
error_rateglm.ds2
## [1] 0.017
##mean(glm.predDefault==default.test.labels)
##6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression co efficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis. (a) Using the summary() and glm() functions, determine the esti mated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors. (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model. (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance. (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
#6. (a) Using the summary() and glm() functions, determine the esti mated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
glm.6 = glm(default ~ balance + income, data = Default, family = binomial)
summary(glm.6)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
#6. (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
boot.fn = function(data, index){ # initialize function to accept a data and index object
return(coef(glm(default ~ balance + income, data = Default, family = binomial, subset = index)))}
boot.fn(Default, 1:10000)
## (Intercept) balance income
## -1.154047e+01 5.647103e-03 2.080898e-05
#6. (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.
set.seed(22)
boot.fn(Default,sample(10000,10000,replace=T))
## (Intercept) balance income
## -1.216864e+01 5.945671e-03 2.316788e-05
boot(Default, boot.fn, 1000) #compute the standard errors of 1,000 bootstrap estimates of the linear model coefficients using the boot function
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.427835e-02 4.320665e-01
## t2* 5.647103e-03 1.417378e-05 2.264517e-04
## t3* 2.080898e-05 1.629532e-07 4.983567e-06
#6. (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. glm() std. errors: Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 balance 5.647e-03 2.274e-04 24.836 < 2e-16 income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
bootstrap std. errors: original bias std. error t1* -1.154047e+01 -3.954546e-02 4.285173e-01 t2* 5.647103e-03 1.731616e-05 2.258669e-04 t3* 2.080898e-05 1.470388e-07 4.901097e-06
Difference in standard errors between glm() and bootstrap function is minimal
##9. We will now consider the Boston housing data set, from the MASS library.
Looking at the Boston data set
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
?Boston
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
table(Boston$medv)
##
## 5 5.6 6.3 7 7.2 7.4 7.5 8.1 8.3 8.4 8.5 8.7 8.8 9.5 9.6 9.7
## 2 1 1 2 3 1 1 1 2 2 2 1 2 1 1 1
## 10.2 10.4 10.5 10.8 10.9 11 11.3 11.5 11.7 11.8 11.9 12 12.1 12.3 12.5 12.6
## 3 2 2 1 2 1 1 1 2 2 2 1 1 1 1 1
## 12.7 12.8 13 13.1 13.2 13.3 13.4 13.5 13.6 13.8 13.9 14 14.1 14.2 14.3 14.4
## 3 1 1 4 1 3 4 2 2 5 2 1 3 1 2 2
## 14.5 14.6 14.8 14.9 15 15.1 15.2 15.3 15.4 15.6 15.7 16 16.1 16.2 16.3 16.4
## 3 2 1 3 3 1 3 1 2 5 1 1 3 2 1 1
## 16.5 16.6 16.7 16.8 17 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9 18 18.1
## 2 2 2 2 1 3 3 1 3 3 1 1 5 1 1 1
## 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 19 19.1 19.2 19.3 19.4 19.5 19.6 19.7
## 3 2 3 4 2 3 2 4 2 4 2 5 6 4 5 2
## 19.8 19.9 20 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 20.9 21 21.1 21.2 21.4
## 3 4 5 5 2 4 4 3 6 2 3 2 3 2 5 5
## 21.5 21.6 21.7 21.8 21.9 22 22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8 22.9 23
## 2 2 7 2 3 7 1 5 2 2 3 5 2 4 4 4
## 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 23.9 24 24.1 24.2 24.3 24.4 24.5 24.6
## 7 4 4 2 1 2 4 4 5 2 3 1 3 4 3 2
## 24.7 24.8 25 25.1 25.2 25.3 26.2 26.4 26.5 26.6 26.7 27 27.1 27.5 27.9 28
## 3 4 8 1 1 1 1 2 1 3 1 1 2 4 2 1
## 28.1 28.2 28.4 28.5 28.6 28.7 29 29.1 29.4 29.6 29.8 29.9 30.1 30.3 30.5 30.7
## 1 1 2 1 1 3 2 2 1 2 2 1 3 1 1 1
## 30.8 31 31.1 31.2 31.5 31.6 31.7 32 32.2 32.4 32.5 32.7 32.9 33 33.1 33.2
## 1 1 1 1 2 2 1 2 1 1 1 1 1 1 2 2
## 33.3 33.4 33.8 34.6 34.7 34.9 35.1 35.2 35.4 36 36.1 36.2 36.4 36.5 37 37.2
## 1 2 1 1 1 3 1 1 2 1 1 2 1 1 1 1
## 37.3 37.6 37.9 38.7 39.8 41.3 41.7 42.3 42.8 43.1 43.5 43.8 44 44.8 45.4 46
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 46.7 48.3 48.5 48.8 50
## 1 1 1 1 16
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
?mean
µ = mean(Boston$medv)
µ
## [1] 22.53281
?sd
sd_medv = sd(Boston$medv)
sd_medv
## [1] 9.197104
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
str(Boston$medv) #(506 observations)
## num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
?sqrt
sqrt_medv = sqrt(506)
sqrt_medv
## [1] 22.49444
std_err_medv = sd_medv / sqrt_medv
std_err_medv
## [1] 0.4088611
boot.fn.medv = function(data, index){ # initialize function to accept a data and index object
return(mean(data[index]))} #calculate the mean of the data column
set.seed(22)
btstrp_medv = boot(Boston$medv, boot.fn.medv, 1000)
btstrp_medv
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.01964348 0.4257536
#compute the standard errors of 1,000 bootstrap estimates of the mean of the medv observations using the boot function
Method 1 CI: 21.68130 - 23.38432 Method 2 (t.test) CI: 21.72953 - 23.33608 Method 2 (t.test) - slightly more narrow than method 1
ci_medv = c(µ - (2*sd(btstrp_medv$t)), µ + (2*sd(btstrp_medv$t)))
ci_medv
## [1] 21.68130 23.38431
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
µmed = median(Boston$medv)
µmed
## [1] 21.2
boot.fn.medv_med = function(data, index){ # initialize function to accept a data and index object
return(median(data[index]))} #calculate the median of the data column
set.seed(22)
btstrp_medv_med = boot(Boston$medv, boot.fn.medv_med, 1000)
btstrp_medv_med
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv_med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02395 0.3965364
#compute the standard errors of 1,000 bootstrap estimates of the median of the medv observations using the boot function
µ0.1 = quantile(Boston$medv,0.1)
µ0.1
## 10%
## 12.75
boot.fn.medv_quantile = function(data, index){ # initialize function to accept a data and index object
return(quantile(data[index],0.1))} #calculate the quantile of the data column
set.seed(22)
btstrp_medv_quantile = boot(Boston$medv, boot.fn.medv_quantile, 1000)
btstrp_medv_quantile
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.medv_quantile, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0024 0.5091694