Victorius Chendryanto (22/496953/PA/21379)

TASK 1

Importing Default dataset from ISLR

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.3

dataset <- Default
dataset

Splitting dataset into three train and validation sets

set.seed(29092023) # for reproducibility
# First split
train_indices1 <- sample(1:nrow(dataset), nrow(dataset) * 0.5, replace = FALSE)
train1 <- dataset[train_indices1, ]
validation1 <- dataset[-train_indices1, ]
# Second split
train_indices2 <- sample(1:nrow(dataset), nrow(dataset) * 0.5, replace = FALSE)
train2 <- dataset[train_indices2, ]
validation2 <- dataset[-train_indices2, ]
# Third split
train_indices3 <- sample(1:nrow(dataset), nrow(dataset) * 0.5, replace = FALSE)
train3 <- dataset[train_indices3, ]
validation3 <- dataset[-train_indices3, ]

Creating a validation set error to ease the validation test of three different splits

validation_set_error <- function(train_dataset, validation_dataset){
  # Fit logistic regression model to predict probability of default using income and balance
  fit_logit <- glm(default ~ income + balance, data = train_dataset, family = binomial)

  # Predict the model on validation set
  test_pred <- predict(fit_logit, newdata = validation_dataset, type = "response")
  test_class_pred <- ifelse(test_pred > 0.5, "Yes", "No")

  # Calculate validation set error
  validation_error <- mean(test_class_pred != validation_dataset$default)
  
  # Return validation set error value
  return(validation_error)
}

# Calculating validation set errors for each split
error1 <- validation_set_error(train1, validation1)
error2 <- validation_set_error(train2, validation2)
error3 <- validation_set_error(train3, validation3)

# Reporting errors
cat("Validation set error of the first split:", error1, "\nValidation set error of the second split:", error2, "\nValidation set error of the third split:", error3, "\nAverage validation set error:", mean(c(error1, error2, error3)))

## Validation set error of the first split: 0.0262 
## Validation set error of the second split: 0.0274 
## Validation set error of the third split: 0.0288 
## Average validation set error: 0.02746667

Analysis (a) and (b)

The validation set errors for this generalized linear model (GLM) are consistently low across the three splits: 0.0262, 0.0274, and 0.0288, with an average error of 0.0274 (2.74%). This indicates that the model is performing well and generalizes effectively across different subsets of data. The small variation between the errors across splits suggests that the GLM is capturing the linear relationships between the predictors and the outcome with minimal risk of overfitting.

In GLMs, the training-validation split is used to assess how well the model can generalize to unseen data. The low validation error across all splits is a strong indicator that the chosen link function and predictors are appropriate for the data at hand. The consistency of the validation error also suggests that the model specification is robust and that the estimated coefficients from the training data hold up well when applied to the validation data.

Furthermore, the average error of 0.0274 translates to a prediction accuracy of about 97.26%, which implies that the model can explain a significant portion of the variability in the data. Given the simplicity of GLMs compared to more complex machine learning models, this level of accuracy is impressive and suggests that the linear relationships in the data are well-represented by the model.

Seeing the different from adding dummy variable student as a variable

# Creating a validation set error to ease the validation test of three different splits
validation_set_error_student <- function(train_dataset, validation_dataset){
  # Fit logistic regression model to predict probability of default using income and balance
  fit_logit <- glm(default ~ income + balance + student, data = train_dataset, family = binomial)

  # Predict the model on validation set
  test_pred <- predict(fit_logit, newdata = validation_dataset, type = "response")
  test_class_pred <- ifelse(test_pred > 0.5, "Yes", "No")

  # Calculate validation set error
  validation_error <- mean(test_class_pred != validation_dataset$default)
  
  # Return validation set error value
  return(validation_error)
}

# Calculating validation set errors for each split
error1_student <- validation_set_error_student(train1, validation1)
error2_student <- validation_set_error_student(train2, validation2)
error3_student <- validation_set_error_student(train3, validation3)

# Reporting errors
cat("Validation set error of the first split (with student):", error1_student, "\nValidation set error of the second split (with student):", error2_student, "\nValidation set error of the third split (with student):", error3_student)

## Validation set error of the first split (with student): 0.027 
## Validation set error of the second split (with student): 0.028 
## Validation set error of the third split (with student): 0.0288

Creating a better output for analysis

df <- data.frame("Split" = c(1,2,3), "With_Student" = c(error1_student, error2_student, error3_student), "without_student" = c(error1, error2, error3))
print(df)

##   Split With_Student without_student
## 1     1       0.0270          0.0262
## 2     2       0.0280          0.0274
## 3     3       0.0288          0.0288

Analysis (c)

The inclusion of the “Student” variable does not significantly improve model performance and may even slightly worsen it. The model without the “Student” variable consistently performs as well or better, indicating that this variable may not be necessary for the final model. Further investigation could be done to see if other variables or interactions could explain the target variable more effectively.

TASK 2

library(boot)

## Warning: package 'boot' was built under R version 4.3.3

library(ISLR)
set.seed(29092023) # Set a random seed for reproducibility
# (a) Fit the logistic regression model
fit_logit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(fit_logit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

# (b)
boot.fn <- function(data, index) {
  fit_logit <- glm(default ~ income + balance, data = data, subset = index, family = binomial)
  return(coef(fit_logit))
}

# (c)
boot.results <- boot(Default, boot.fn, R = 1000)
print(boot.results)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -4.395504e-02 4.354043e-01
## t2*  2.080898e-05  2.263395e-07 4.796893e-06
## t3*  5.647103e-03  1.963309e-05 2.348164e-04

Creating an analysis help table

# Extract standard errors from the glm object
glm_se <- summary(fit_logit)$coefficients[, "Std. Error"]

# Extract bootstrap standard errors
boot_se <- apply(boot.results$t, 2, sd)

# Create a data frame to compare the results
comparison_table <- data.frame(
  Coefficient = c("Intercept", "Income", "Balance"),
  GLM_Estimate = coef(fit_logit),
  GLM_Std_Error = glm_se,
  Bootstrap_Estimate = boot.results$t0,
  Bootstrap_Std_Error = boot_se
)

# Print the comparison table
print(comparison_table)

##             Coefficient  GLM_Estimate GLM_Std_Error Bootstrap_Estimate
## (Intercept)   Intercept -1.154047e+01  4.347564e-01      -1.154047e+01
## income           Income  2.080898e-05  4.985167e-06       2.080898e-05
## balance         Balance  5.647103e-03  2.273731e-04       5.647103e-03
##             Bootstrap_Std_Error
## (Intercept)        4.354043e-01
## income             4.796893e-06
## balance            2.348164e-04

Analysis

The comparison between the standard errors obtained from the glm() function and those obtained through the bootstrap method reveals interesting insights about the variability of the logistic regression model’s coefficient estimates. For the intercept term, both methods yield very similar standard errors, with the glm standard error being 0.4348 and the bootstrap standard error being 0.4354. This close match indicates that the glm model accurately captures the variability in the intercept estimate. It suggests that the model is stable and the original data is representative of the underlying distribution for this parameter.

For the income variable, the standard errors are also quite close, with the glm method producing a standard error of 4.985e-06 and the bootstrap method resulting in a slightly lower value of 4.797e-06. This consistency reinforces the reliability of the glm standard error estimate for the income coefficient, indicating that the effect of income on the probability of default is estimated with low variability and high precision. The small magnitude of the standard errors also suggests that income has a relatively minor influence on default when compared to balance.

In the case of the balance variable, there is a small difference between the standard errors: 2.274e-04 from the glm model and 2.348e-04 from the bootstrap method. Although this difference is minor, it indicates that the bootstrap method might capture slightly more variability in the estimate of the balance coefficient. This could be due to the fact that the balance variable has a stronger influence on default, as evidenced by the larger coefficient and smaller p-value in the logistic regression model.

Overall, the similarity between the standard errors obtained from the glm() function and the bootstrap method suggests that the logistic regression model is robust and provides reliable estimates of the standard errors for the income and balance coefficients. The bootstrap results confirm the precision of the glm() standard errors, indicating that the model’s assumptions hold well for this data set. Thus, both methods agree that balance is a more significant predictor of default than income, and the standard errors obtained reflect the true variability of the coefficients in the population.

Tugas 2 Analitika Data