Exercise 4

Question 3 Divide the Dataset: Split your entire dataset into kk equally or nearly equally sized segments or folds. If the dataset is not inherently ordered, it is a good practice to shuffle the data before splitting to ensure randomness. Perform K Iterations: For each unique group: Use One Fold as the Test Set: Take one of the kk folds to be the test set (the data that the model has not seen before). Use the Remaining Folds as the Training Set: Combine the remaining k−1k−1 folds to create the training set. Train the Model: Fit your model on the training set. Evaluate the Model: Assess the model’s performance using the test set. Calculate the Average Performance: Once the model has been trained and evaluated on each of the kk folds, calculate the average performance of the model across all kk tests. This average is used as the overall performance metric of the model.

b k-fold cross-validation offers a balanced approach to model evaluation, providing more reliable estimates than the validation set approach without being as computationally intensive as LOOCV. However, it does require more resources than the validation set approach and may introduce slightly more bias compared to LOOCV, making it a method that balances between the extremes of these two approaches.

# Install and load the ISLR package to access the Default dataset
if(!require(ISLR)){install.packages("ISLR")}

## Loading required package: ISLR

## Warning: package 'ISLR' was built under R version 4.3.3

library(ISLR)

# Load the Default dataset
data(Default)

# Fit a logistic regression model
model <- glm(default ~ income + balance, data = Default, family = binomial)

# Summary of the model
summary(model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

# Set seed for reproducibility
set.seed(123)

# Load the ISLR package and the Default dataset
library(ISLR)
data(Default)

# Determine the size of the dataset
n <- nrow(Default)

# Create a random sample of row indices for the training set (70% of the data)
train_indices <- sample(1:n, size = 0.7*n)

# Split the data into training and validation sets
train_data <- Default[train_indices, ]
validation_data <- Default[-train_indices, ]

# Fit the logistic regression model on the training data
model_train <- glm(default ~ income + balance, data = train_data, family = binomial)

# Display the summary of the model to see the coefficients and model statistics
summary(model_train)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.164e+01  5.202e-01 -22.375  < 2e-16 ***
## income       2.190e-05  5.981e-06   3.662  0.00025 ***
## balance      5.692e-03  2.729e-04  20.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2043.8  on 6999  degrees of freedom
## Residual deviance: 1098.2  on 6997  degrees of freedom
## AIC: 1104.2
## 
## Number of Fisher Scoring iterations: 8

# Predict the posterior probabilities of default on the validation set
predicted_probs <- predict(model_train, newdata = validation_data, type = "response")

# Classify individuals as 'Yes' for default if the posterior probability > 0.5, otherwise 'No'
predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")

# Add the predictions to the validation dataset to compare with actual default status
validation_data$predicted_default_status <- predicted_default_status

# View the first few rows to verify
head(validation_data)

##    default student   balance    income predicted_default_status
## 6       No     Yes  919.5885  7491.559                       No
## 19      No      No  485.9369 61566.106                       No
## 22      No      No  954.2618 32457.509                       No
## 23      No      No 1055.9566 51317.883                       No
## 24      No      No  641.9844 30466.103                       No
## 29      No      No  615.7043 39376.395                       No

# Calculate the number of misclassified observations
misclassified <- sum(validation_data$predicted_default_status != validation_data$default)

# Calculate the total number of observations in the validation set
total_observations <- nrow(validation_data)

# Calculate the validation set error rate
validation_set_error <- misclassified / total_observations

# Print the validation set error rate
validation_set_error

## [1] 0.02633333

# Load necessary library

library(ISLR)

# Load the Default dataset
data(Default)

# Placeholder for storing error rates
error_rates <- numeric(3)

set.seed(123) # For reproducibility

for (i in 1:3) {
  # Splitting the dataset into training and validation sets
  train_indices <- sample(1:nrow(Default), size = 0.7 * nrow(Default))
  train_data <- Default[train_indices, ]
  validation_data <- Default[-train_indices, ]
  
  # Fit logistic regression model on the training data
  model <- glm(default ~ income + balance, data = train_data, family = binomial)
  
  # Predicting default status on the validation data
  predicted_probs <- predict(model, newdata = validation_data, type = "response")
  predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")
  
  # Compute validation set error
  misclassified <- sum(predicted_default_status != validation_data$default)
  validation_set_error <- misclassified / nrow(validation_data)
  
  # Store the error rate
  error_rates[i] <- validation_set_error
}

# Output the error rates for each split
error_rates

## [1] 0.02633333 0.02700000 0.02666667

These results demonstrate the variability in model performance that can occur with different splits of the data into training and validation sets. Even though the changes in error rates are relatively small in this simulated example, in real-world scenarios, the variability can be more pronounced depending on the dataset’s characteristics and size.

The differences in error rates highlight the importance of cross-validation techniques, like k-fold cross-validation, for obtaining a more reliable estimate of model performance. By using multiple splits and averaging the results, you can mitigate the impact of the variability caused by the randomness of the split, leading to a more stable and generalizable assessment of the model’s ability to perform on unseen data

library(ISLR)

# Load the Default dataset
data(Default)

# Prepare data: Convert 'student' to a numeric dummy variable
Default$student_dummy <- ifelse(Default$student == "Yes", 1, 0)

# Set seed for reproducibility
set.seed(123)

# Splitting the dataset into training and validation sets
train_indices <- sample(1:nrow(Default), size = 0.7 * nrow(Default))
train_data <- Default[train_indices, ]
validation_data <- Default[-train_indices, ]

# Fit logistic regression model on the training data, now including 'student_dummy'
model <- glm(default ~ income + balance + student_dummy, data = train_data, family = binomial)

# Predicting default status on the validation data
predicted_probs <- predict(model, newdata = validation_data, type = "response")
predicted_default_status <- ifelse(predicted_probs > 0.5, "Yes", "No")

# Compute validation set error
misclassified <- sum(predicted_default_status != validation_data$default)
validation_set_error <- misclassified / nrow(validation_data)

# Output the validation set error
validation_set_error

## [1] 0.027

This very small variation in the validation set error compared to previous results suggests that including the student status as a predictor did not improve the model’s performance for this particular dataset

# Load necessary library

library(ISLR)

# Load the Default dataset
data(Default)

# Convert 'default' and 'student' to a factor if they're not already
Default$default <- as.factor(Default$default)
Default$student <- as.factor(Default$student)

# Fit the logistic regression model using 'income' and 'balance' as predictors
model <- glm(default ~ income + balance, data = Default, family = binomial)

# Display the summary of the model
summary_model <- summary(model)

# Extract and display the estimated standard errors for 'income' and 'balance'
standard_errors <- summary_model$coefficients[, "Std. Error"]
standard_errors[c("income", "balance")]

##       income      balance 
## 4.985167e-06 2.273731e-04

# Load necessary library

library(ISLR)

# Load the Default dataset
data(Default)

# Define the boot_fn function
boot_fn <- function(data, index) {
  # Ensure 'default' is a factor for logistic regression
  data$default <- as.factor(data$default)
  
  # Fit the model using the specified index
  fit <- glm(default ~ income + balance, data=data, family=binomial, subset=index)
  
  # Return the coefficients for income and balance
  return(coef(fit)[c("income", "balance")])
}

# Example usage of boot_fn with the entire dataset
boot_fn(Default, 1:nrow(Default))

##       income      balance 
## 2.080898e-05 5.647103e-03

# Load necessary libraries
if (!require(ISLR)) install.packages("ISLR")
if (!require(boot)) install.packages("boot")

## Loading required package: boot

library(ISLR)
library(boot)

# Assume boot_fn is defined as previously
boot_fn <- function(data, index) {
  data$default <- as.factor(data$default)
  fit <- glm(default ~ income + balance, data = data, family = binomial, subset = index)
  return(coef(fit)[c("income", "balance")])
}

# Load the Default dataset
data(Default)

# Perform bootstrap resampling to estimate the standard errors of the coefficients
boot_results <- boot(data = Default, statistic = boot_fn, R = 1000)

# Display the results
print(boot_results)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.610485e-07 4.733164e-06
## t2* 5.647103e-03 1.305788e-05 2.215018e-04

# Extract and display the estimated standard errors for 'income' and 'balance'
boot_results$standard.errors

## NULL

The standard errors for both income and balance coefficients are very similar between the glm function results and the bootstrap estimates. This indicates a strong alignment between the theoretical model-based approach and the empirical bootstrap approach, suggesting that the logistic regression model is reasonably well set for this dataset.

# Set the file path
file_path <- "C:/Users/ngaku/Downloads/Boston(2).csv"

# Read the CSV file into a data frame
boston_data <- read.csv(file_path)

# Check the first few rows of the data frame
head(boston_data)

##   X    crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

# Assuming 'medv' is the column for median value of owner-occupied homes
mu_hat <- mean(boston_data$medv)

# Display the estimate
mu_hat

## [1] 22.53281

# Calculate the standard deviation of 'medv'
s <- sd(boston_data$medv)

# Number of observations
n <- length(boston_data$medv)

# Calculate the standard error of the mean (SE)
SE_mu_hat <- s / sqrt(n)

# Display the standard error
SE_mu_hat

## [1] 0.4088611

A standard error of 0.40886110.4088611 means that if one were to repeatedly sample from the population and calculate the sample mean each time, the distribution of those sample means would have a standard deviation of approximately the result stated

# Load the boot library

library(boot)

# Define the bootstrap statistic function
boot_fn <- function(data, indices) {
  d <- data[indices]  # Extract the bootstrap sample
  mean(d)  # Calculate the mean of the bootstrap sample
}

# Perform the bootstrap
set.seed(123)  # For reproducibility
bootstrap_results <- boot(data = boston_data$medv, statistic = boot_fn, R = 1000)

# Estimate of the standard error of mu_hat using bootstrap
bootstrap_se <- sd(bootstrap_results$t)
bootstrap_se

## [1] 0.4045557

the results for both are very similar which is good. It suggests that the analysis is on solid footing and that the estimated standard error is a reliable measure of the precision of the sample mean as an estimator for the population mean of medv.

# Assuming the sample mean (mu_hat) and standard error (SE_mu_hat) are known
mu_hat <- mean(boston_data$medv)  # Replace this with the actual mean if not calculated here
SE_mu_hat <- 0.4088611  # Standard error from previous calculation or equivalent method

# Calculate the 95% CI using the two standard error rule
ci_lower <- mu_hat - 2 * SE_mu_hat
ci_upper <- mu_hat + 2 * SE_mu_hat
ci_two_se_rule <- c(ci_lower, ci_upper)
ci_two_se_rule

## [1] 21.71508 23.35053

# Calculate the median of 'medv'
mu_med_hat <- median(boston_data$medv)

# Display the estimate for the population median
mu_med_hat

## [1] 21.2

# Define a bootstrap function for the median
boot_fn_median <- function(data, indices) {
  d <- data[indices]  # Extract the bootstrap sample
  median(d)  # Calculate the median of the bootstrap sample
}

# Perform the bootstrap for the median
set.seed(123)  # For reproducibility
bootstrap_results_median <- boot(data = boston_data$medv, statistic = boot_fn_median, R = 1000)

# Estimate of the standard error of mu_med_hat using bootstrap
bootstrap_se_median <- sd(bootstrap_results_median$t)
bootstrap_se_median

## [1] 0.3676453

The standard error of the median being slightly lower than that of the mean as indicated by earlier calculations could suggest that the median is a more robust estimator in this particular dataset, possibly due to the presence of outliers or a skewed distribution of medv.

# Calculate the 10th percentile of 'medv'
mu_0.1_hat <- quantile(boston_data$medv, 0.1)

# Display the estimate
mu_0.1_hat

##   10% 
## 12.75

# Define a bootstrap function for the 10th percentile
boot_fn_p10 <- function(data, indices) {
  d <- data[indices]  # Extract the bootstrap sample
  quantile(d, 0.1)  # Calculate the 10th percentile of the bootstrap sample
}

# Perform the bootstrap for the 10th percentile
set.seed(123)  # For reproducibility
bootstrap_results_p10 <- boot(data = boston_data$medv, statistic = boot_fn_p10, R = 1000)

# Estimate of the standard error of mu_0.1_hat using bootstrap
bootstrap_se_p10 <- sd(bootstrap_results_p10$t)
bootstrap_se_p10

## [1] 0.527868

The variability indicated by the standard error, the bootstrap method provides a data-driven way to assess the precision of the 10th percentile as an estimate. A standard error of this magnitude allows for the construction of confidence intervals around the 10th percentile estimate, giving a range within which the true population 10th percentile is likely to lie within

Exercise 4

Ndungu Gakunga

2024-03-08