# Load dependencies
pacman::p_load(ISLR, ISLR2, dplyr, car, caret, boot)
Explain how k-fold cross-validation is implemented.
K-fold cross-validation is a technique used to asses performance of a machine learning model. It does this by dividing the dataset into k subsets/folds. The model is trained and evaluated k number of times, each time using a different fold as the test set while the remaining k-1 folds serve as the training set.
What are the advantages and disadvantages of k-fold cross-validation relative to:
i. The validation set approach?
ii. LOOCV?
The advantages of k-fold cross-validation relative to the validation set approach is it uses the entire dataset for training and testing over multiple iterations. This leads to a more accurate result. The disadvantages are it’s computationally more expensive to run and more complex to implement.
The advantages of k-fold cross-validation compared to the Leave-One-Out Cross-Validation (LOOCV) is it’s more efficient as LOOCV requires n model training iterations while k-fold requires less. K-fold also provides lower variance in the model. The disadvantages compared to LOOCV is k-fold is slightly more biased because of the lower variance.
Fit a logistic regression model that uses income and balance to predict default.
# Read in data from ISLR package
df = Default
head(df)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
# Logistic Regression Model
logit_model = glm(default ~ income + balance, data= df, family= 'binomial')
summary(logit_model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Check for multicollinearity
vif(logit_model)
## income balance
## 1.045605 1.045605
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.
ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
set.seed(42)
# Split data into train and test
train_index = createDataPartition(df$default, p= 0.8, list= FALSE)
train = df[train_index,]
test = df[-train_index,]
# Train Logistic Regression Model
logit_model = glm(default ~ income + balance, data= train, family= 'binomial')
# Predictions and Classifications
pred_prob = predict(logit_model, test, type= 'response')
pred_labels = ifelse(pred_prob > 0.5, 'Yes', 'No')
pred_labels = as.factor(pred_labels)
# Compute test error (misclassification rate)
mean(pred_labels != test$default)
## [1] 0.02451226
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
set.seed(42)
# Store test errors
test_errors = numeric(3)
for (i in 1:3) {
# Split data into train and test
train_index = createDataPartition(df$default, p = 0.8, list = FALSE)
train = df[train_index, ]
test = df[-train_index, ]
# Train logistic regression model
logit_model = glm(default ~ income + balance, data = train, family = "binomial")
# Make predictions
pred_prob = predict(logit_model, test, type = "response")
pred_labels = ifelse(pred_prob > 0.5, "Yes", "No")
pred_labels = as.factor(pred_labels)
# Compute test error
test_errors[i] = mean(pred_labels != test$default)
}
# Display test errors
test_errors
## [1] 0.02451226 0.02651326 0.02301151
Now consider a logistic regression model that predicts the
probability of default using income,
balance, and a dummy variable for student.
Estimate the test error for this model using the validation set
approach. Comment on whether or not including a dummy variable for
student leads to a reduction in the test error
rate.
Including the dummy variable did result in a slightly lower misclassification rate going from 0.0245 to 0.024. Seeing a decrease of 0.0005.
set.seed(42)
# Create dummy variable for Student
df$student = ifelse(df$student == 'Yes', 1, 0)
# Split data into train and test
train_index = createDataPartition(df$default, p = 0.8, list = FALSE)
train = df[train_index, ]
test = df[-train_index, ]
# Train logistic regression model
logit_model = glm(default ~ ., data = train, family = "binomial")
# Make predictions
pred_prob = predict(logit_model, test, type = "response")
pred_labels = ifelse(pred_prob > 0.5, "Yes", "No")
pred_labels = as.factor(pred_labels)
# Compute test error
mean(pred_labels != test$default)
## [1] 0.02401201
Using the summary() and glm() functions, determine the estimated
standard errors for the coefficients associated with income
and balance in a multiple logistic regression model that
uses both predictors.
The estimated standard error for the coefficient income
is 4.985E-1 and the standard error for the coefficient
balance is 2.274E-4.
# Logistic Regression Model
logit_model = glm(default ~ income + balance, data= df, family= 'binomial')
summary(logit_model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Write a function, boot.fn(), that takes as input the Default data
set as well as an index of the observations, and that outputs the
coefficient estimates for income and balance
in the multiple logistic regression model.
set.seed(42)
# Create a bootstrap function to compute a logistic regression model
boot.glm = function(data, index) {
logit_model = glm(default ~ income + balance, data= data[index, ], family= 'binomial')
return(coef(logit_model)[c('income', 'balance')])
}
Use the boot() function together with your boot.fn() function to
estimate the standard errors of the logistic regression coefficients for
income and balance.
# Perform bootstrapping
boot_results = boot(data=df, statistic= boot.glm, R= 1000)
# Compute the standard errors from bootstrap
apply(boot_results$t, 2, sd)
## [1] 5.073444e-06 2.299133e-04
Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
There is a strong similarity in the estimated standard errors obtained for income and balance with the glm() at 4.985E-06 and 2.274E-04 and bootstrap at 4.884E-06 and 2.269E-04 respectively. This suggests that both methods provide consistent estimates of uncertainty and that the results are stable.
Based on this data set, provide an estimate for the population
mean of medv, call this estimate µˆ.
The estimate for the population mean is 22.53 or $22,530.
# Read in the data from ISLR2 package
boston = Boston
head(boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
# Estimate population mean
mean(boston$medv)
## [1] 22.53281
Provide an estimate of the standard error of µˆ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
The standard error of µˆ is 0.409 (or $409). This means the estimated mean home value is expected to vary by approximately $409 across different samples of the same size (506). This small value indicates that the sample mean is a precise and reliable estimate of the population mean.
sd(boston$medv)/sqrt(nrow(boston))
## [1] 0.4088611
Now estimate the standard error of µˆ using the bootstrap. How does this compare to your answer from (b)?
The standard error decreased by about 0.0079395. This suggests the bootstrap samples are producing more consistent or stable estimates to the original sample.
set.seed(42)
# Create a bootstrap function to compute the mean
boot.mean = function(data, indices) {
return(mean(data[indices]))
}
# Perform bootstrapping
boot_results = boot(boston$medv, boot.mean, R = 1000)
# Print the standard error
sd(boot_results$t)
## [1] 0.4009216
Based on your bootstrap estimate from (c), provide a 95 %
confidence interval for the mean of medv. Compare it to the
results obtained using t.test(Boston$medv). Hint: You can approximate a
95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ +
2SE(ˆµ)].
The results produces nearly identical ranges. The 95% confidence interval from the one-sample t-test produced (21.72953, 23.33608). The interval from the bootstrap estimate produced (21.73096, 23.33465). The bootstrap interval is slightly narrower at 1.60369 compared to 1.60655, a difference of 0.00286. The t-test interval has a lower starting point but a higher endpoint than the bootstrap interval.
# Find 95% CI of population mean using t-test
t.test(boston$medv)
##
## One Sample t-test
##
## data: boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
# Store population mean and sd
pop_mean = mean(boston$medv)
boot_sd = sd(boot_results$t)
# Lower bound confidence interval
pop_mean - 2 * boot_sd
## [1] 21.73096
# Upper bound confidence interval
pop_mean + 2 * boot_sd
## [1] 23.33465
Based on this data set, provide an estimate, µˆ_med, for the
median value of medv in the population.
The estimate for the population median is 21.2 or $21,200.
median(boston$medv)
## [1] 21.2
We now would like to estimate the standard error of µˆ_med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
The bootstrap standard error for the median is 0.366 (or $366). This suggests that the estimated median home value is expected to vary by approximately $366 across different samples of the same size (506). This small value indicates that the sample median is a precise and reliable estimate of the population median.
set.seed(42)
# Create a bootstrap function to compute the median
boot.median = function(data, indices) {
return(median(data[indices]))
}
# Perform bootstrapping
boot_results = boot(boston$medv, boot.median, R = 1000)
# Print the standard error
sd(boot_results$t)
## [1] 0.3661785
Based on this data set, provide an estimate for the tenth
percentile of medv in Boston census tracts. Call this
quantity µˆ_0.1. (You can use the quantile() function.)
quantile(boston$medv, probs= 0.1)
## 10%
## 12.75
Use the bootstrap to estimate the standard error of µˆ_0.1. Comment on your findings.
The bootstrap standard error for the 10th percentile is 0.495 (or $495). This suggests that the estimated 10th percentile value of $12,750 is expected to vary by approximately $495 across different samples of the same size (506). While this variation is not particularly large, it indicates some level of uncertainty in the estimate.
set.seed(42)
# Create a bootstrap function to compute the 10th percentile
boot.10percentile = function(data, indices) {
return(quantile(data[indices], probs = 0.1))
}
# Perform bootstrapping
boot_results = boot(boston$medv, boot.10percentile, R = 1000)
# Print the standard error
sd(boot_results$t)
## [1] 0.4948966