Question 3 Divide the Dataset: Split your entire dataset into kk equally or nearly equally sized segments or folds. If the dataset is not inherently ordered, it is a good practice to shuffle the data before splitting to ensure randomness. Perform K Iterations: For each unique group: Use One Fold as the Test Set: Take one of the kk folds to be the test set (the data that the model has not seen before). Use the Remaining Folds as the Training Set: Combine the remaining k−1k−1 folds to create the training set. Train the Model: Fit your model on the training set. Evaluate the Model: Assess the model’s performance using the test set. Calculate the Average Performance: Once the model has been trained and evaluated on each of the kk folds, calculate the average performance of the model across all kk tests. This average is used as the overall performance metric of the model.
b k-fold cross-validation offers a balanced approach to model evaluation, providing more reliable estimates than the validation set approach without being as computationally intensive as LOOCV. However, it does require more resources than the validation set approach and may introduce slightly more bias compared to LOOCV, making it a method that balances between the extremes of these two approaches.
# Install and load the ISLR package to access the Default dataset
if(!require(ISLR)){install.packages("ISLR")}
## Loading required package: ISLR
## Warning: package 'ISLR' was built under R version 4.3.3
library(ISLR)
# Load the Default dataset
data(Default)
# Fit a logistic regression model
model <- glm(default ~ income + balance, data = Default, family = binomial)
# Summary of the model
summary(model)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Set seed for reproducibility
set.seed(123)
# Load the ISLR package and the Default dataset
library(ISLR)
data(Default)
# Determine the size of the dataset
n <- nrow(Default)
# Create a random sample of row indices for the training set (70% of the data)
train_indices <- sample(1:n, size = 0.7*n)
# Split the data into training and validation sets
train_data <- Default[train_indices, ]
validation_data <- Default[-train_indices, ]
# Fit the logistic regression model on the training data
model_train <- glm(default ~ income + balance, data = train_data, family = binomial)
# Display the summary of the model to see the coefficients and model statistics
summary(model_train)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.164e+01 5.202e-01 -22.375 < 2e-16 ***
## income 2.190e-05 5.981e-06 3.662 0.00025 ***
## balance 5.692e-03 2.729e-04 20.857 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2043.8 on 6999 degrees of freedom
## Residual deviance: 1098.2 on 6997 degrees of freedom
## AIC: 1104.2
##
## Number of Fisher Scoring iterations: 8
# Predict the posterior probabilities of default on the validation set
predicted_probs <- predict(model_train, newdata = validation_data, type = "response")
# Classify individuals as 'Yes' for default if the posterior probability > 0.5, otherwise 'No'
predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")
# Add the predictions to the validation dataset to compare with actual default status
validation_data$predicted_default_status <- predicted_default_status
# View the first few rows to verify
head(validation_data)
## default student balance income predicted_default_status
## 6 No Yes 919.5885 7491.559 No
## 19 No No 485.9369 61566.106 No
## 22 No No 954.2618 32457.509 No
## 23 No No 1055.9566 51317.883 No
## 24 No No 641.9844 30466.103 No
## 29 No No 615.7043 39376.395 No
# Calculate the number of misclassified observations
misclassified <- sum(validation_data$predicted_default_status != validation_data$default)
# Calculate the total number of observations in the validation set
total_observations <- nrow(validation_data)
# Calculate the validation set error rate
validation_set_error <- misclassified / total_observations
# Print the validation set error rate
validation_set_error
## [1] 0.02633333
# Load necessary library
library(ISLR)
# Load the Default dataset
data(Default)
# Placeholder for storing error rates
error_rates <- numeric(3)
set.seed(123) # For reproducibility
for (i in 1:3) {
# Splitting the dataset into training and validation sets
train_indices <- sample(1:nrow(Default), size = 0.7 * nrow(Default))
train_data <- Default[train_indices, ]
validation_data <- Default[-train_indices, ]
# Fit logistic regression model on the training data
model <- glm(default ~ income + balance, data = train_data, family = binomial)
# Predicting default status on the validation data
predicted_probs <- predict(model, newdata = validation_data, type = "response")
predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")
# Compute validation set error
misclassified <- sum(predicted_default_status != validation_data$default)
validation_set_error <- misclassified / nrow(validation_data)
# Store the error rate
error_rates[i] <- validation_set_error
}
# Output the error rates for each split
error_rates
## [1] 0.02633333 0.02700000 0.02666667
These results demonstrate the variability in model performance that can occur with different splits of the data into training and validation sets. Even though the changes in error rates are relatively small in this simulated example, in real-world scenarios, the variability can be more pronounced depending on the dataset’s characteristics and size.
The differences in error rates highlight the importance of cross-validation techniques, like k-fold cross-validation, for obtaining a more reliable estimate of model performance. By using multiple splits and averaging the results, you can mitigate the impact of the variability caused by the randomness of the split, leading to a more stable and generalizable assessment of the model’s ability to perform on unseen data
library(ISLR)
# Load the Default dataset
data(Default)
# Prepare data: Convert 'student' to a numeric dummy variable
Default$student_dummy <- ifelse(Default$student == "Yes", 1, 0)
# Set seed for reproducibility
set.seed(123)
# Splitting the dataset into training and validation sets
train_indices <- sample(1:nrow(Default), size = 0.7 * nrow(Default))
train_data <- Default[train_indices, ]
validation_data <- Default[-train_indices, ]
# Fit logistic regression model on the training data, now including 'student_dummy'
model <- glm(default ~ income + balance + student_dummy, data = train_data, family = binomial)
# Predicting default status on the validation data
predicted_probs <- predict(model, newdata = validation_data, type = "response")
predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")
# Compute validation set error
misclassified <- sum(predicted_default_status != validation_data$default)
validation_set_error <- misclassified / nrow(validation_data)
# Output the validation set error
validation_set_error
## [1] 0.027
This very small variation in the validation set error compared to previous results suggests that including the student status as a predictor did not improve the model’s performance for this particular dataset
# Load necessary library
library(ISLR)
# Load the Default dataset
data(Default)
# Convert 'default' and 'student' to a factor if they're not already
Default$default <- as.factor(Default$default)
Default$student <- as.factor(Default$student)
# Fit the logistic regression model using 'income' and 'balance' as predictors
model <- glm(default ~ income + balance, data = Default, family = binomial)
# Display the summary of the model
summary_model <- summary(model)
# Extract and display the estimated standard errors for 'income' and 'balance'
standard_errors <- summary_model$coefficients[, "Std. Error"]
standard_errors[c("income", "balance")]
## income balance
## 4.985167e-06 2.273731e-04
# Load necessary library
library(ISLR)
# Load the Default dataset
data(Default)
# Define the boot_fn function
boot_fn <- function(data, index) {
# Ensure 'default' is a factor for logistic regression
data$default <- as.factor(data$default)
# Fit the model using the specified index
fit <- glm(default ~ income + balance, data=data, family=binomial, subset=index)
# Return the coefficients for income and balance
return(coef(fit)[c("income", "balance")])
}
# Example usage of boot_fn with the entire dataset
boot_fn(Default, 1:nrow(Default))
## income balance
## 2.080898e-05 5.647103e-03
# Load necessary libraries
if (!require(ISLR)) install.packages("ISLR")
if (!require(boot)) install.packages("boot")
## Loading required package: boot
library(ISLR)
library(boot)
# Assume boot_fn is defined as previously
boot_fn <- function(data, index) {
data$default <- as.factor(data$default)
fit <- glm(default ~ income + balance, data = data, family = binomial, subset = index)
return(coef(fit)[c("income", "balance")])
}
# Load the Default dataset
data(Default)
# Perform bootstrap resampling to estimate the standard errors of the coefficients
boot_results <- boot(data = Default, statistic = boot_fn, R = 1000)
# Display the results
print(boot_results)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.610485e-07 4.733164e-06
## t2* 5.647103e-03 1.305788e-05 2.215018e-04
# Extract and display the estimated standard errors for 'income' and 'balance'
boot_results$standard.errors
## NULL
The standard errors for both income and balance coefficients are very similar between the glm function results and the bootstrap estimates. This indicates a strong alignment between the theoretical model-based approach and the empirical bootstrap approach, suggesting that the logistic regression model is reasonably well set for this dataset.
# Set the file path
file_path <- "C:/Users/ngaku/Downloads/Boston(2).csv"
# Read the CSV file into a data frame
boston_data <- read.csv(file_path)
# Check the first few rows of the data frame
head(boston_data)
## X crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Assuming 'medv' is the column for median value of owner-occupied homes
mu_hat <- mean(boston_data$medv)
# Display the estimate
mu_hat
## [1] 22.53281
# Calculate the standard deviation of 'medv'
s <- sd(boston_data$medv)
# Number of observations
n <- length(boston_data$medv)
# Calculate the standard error of the mean (SE)
SE_mu_hat <- s / sqrt(n)
# Display the standard error
SE_mu_hat
## [1] 0.4088611
A standard error of 0.40886110.4088611 means that if one were to repeatedly sample from the population and calculate the sample mean each time, the distribution of those sample means would have a standard deviation of approximately the result stated
# Load the boot library
library(boot)
# Define the bootstrap statistic function
boot_fn <- function(data, indices) {
d <- data[indices] # Extract the bootstrap sample
mean(d) # Calculate the mean of the bootstrap sample
}
# Perform the bootstrap
set.seed(123) # For reproducibility
bootstrap_results <- boot(data = boston_data$medv, statistic = boot_fn, R = 1000)
# Estimate of the standard error of mu_hat using bootstrap
bootstrap_se <- sd(bootstrap_results$t)
bootstrap_se
## [1] 0.4045557
the results for both are very similar which is good. It suggests that the analysis is on solid footing and that the estimated standard error is a reliable measure of the precision of the sample mean as an estimator for the population mean of medv.
# Assuming the sample mean (mu_hat) and standard error (SE_mu_hat) are known
mu_hat <- mean(boston_data$medv) # Replace this with the actual mean if not calculated here
SE_mu_hat <- 0.4088611 # Standard error from previous calculation or equivalent method
# Calculate the 95% CI using the two standard error rule
ci_lower <- mu_hat - 2 * SE_mu_hat
ci_upper <- mu_hat + 2 * SE_mu_hat
ci_two_se_rule <- c(ci_lower, ci_upper)
ci_two_se_rule
## [1] 21.71508 23.35053
# Calculate the median of 'medv'
mu_med_hat <- median(boston_data$medv)
# Display the estimate for the population median
mu_med_hat
## [1] 21.2
# Define a bootstrap function for the median
boot_fn_median <- function(data, indices) {
d <- data[indices] # Extract the bootstrap sample
median(d) # Calculate the median of the bootstrap sample
}
# Perform the bootstrap for the median
set.seed(123) # For reproducibility
bootstrap_results_median <- boot(data = boston_data$medv, statistic = boot_fn_median, R = 1000)
# Estimate of the standard error of mu_med_hat using bootstrap
bootstrap_se_median <- sd(bootstrap_results_median$t)
bootstrap_se_median
## [1] 0.3676453
The standard error of the median being slightly lower than that of the mean as indicated by earlier calculations could suggest that the median is a more robust estimator in this particular dataset, possibly due to the presence of outliers or a skewed distribution of medv.
# Calculate the 10th percentile of 'medv'
mu_0.1_hat <- quantile(boston_data$medv, 0.1)
# Display the estimate
mu_0.1_hat
## 10%
## 12.75
# Define a bootstrap function for the 10th percentile
boot_fn_p10 <- function(data, indices) {
d <- data[indices] # Extract the bootstrap sample
quantile(d, 0.1) # Calculate the 10th percentile of the bootstrap sample
}
# Perform the bootstrap for the 10th percentile
set.seed(123) # For reproducibility
bootstrap_results_p10 <- boot(data = boston_data$medv, statistic = boot_fn_p10, R = 1000)
# Estimate of the standard error of mu_0.1_hat using bootstrap
bootstrap_se_p10 <- sd(bootstrap_results_p10$t)
bootstrap_se_p10
## [1] 0.527868
The variability indicated by the standard error, the bootstrap method provides a data-driven way to assess the precision of the 10th percentile as an estimate. A standard error of this magnitude allows for the construction of confidence intervals around the 10th percentile estimate, giving a range within which the true population 10th percentile is likely to lie within