Chapter 4

Question 16

First, I overwrite the crim variable with a binary factor that takes the value 1 if the crime rate is above the median, and 0 otherwise.

The updated dataset is shown below:

library(ISLR2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(Boston)
Boston$crim <- factor(ifelse(Boston$crim > median(Boston$crim), 1, 0))

glimpse(Boston)
## Rows: 506
## Columns: 13
## $ crim    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

I begin by fitting all four model types using the full set of predictors. Instead of using a train/test split, I assess model performance based on cross-validation accuracy, using 10-fold cross-validation repeated 5 times.

Logistic regression

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(1)
ctrl <- trainControl(method = "cv", number = 10)
log_regression <- train(crim ~ ., 
                 data = Boston, 
                 method = "glm", 
                 family = "binomial", 
                 trControl = ctrl)

log_regression
## Generalized Linear Model 
## 
## 506 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 456, 455, 456, 454, 456, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8972775  0.7944931

LDA

set.seed(2)
lda<- train(crim ~ ., 
                 data = Boston, 
                 method = "lda", 
                 trControl = ctrl)

lda
## Linear Discriminant Analysis 
## 
## 506 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 456, 455, 456, 456, 455, 454, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8479457  0.6953226

Naive Bayes

set.seed(3)
naive_b <- train(crim ~ ., 
                 data = Boston, 
                 method = "naive_bayes", 
                 trControl = ctrl)
naive_b
## Naive Bayes 
## 
## 506 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 456, 456, 455, 455, 455, 456, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.8080241  0.6160362
##    TRUE      0.8674389  0.7348430
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

KNN

set.seed(4)
knn <- train(crim ~ ., 
                 data = Boston, 
                 method = "knn", 
                 preProcess = c("center", "scale"), 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(k = seq(1, 20, 2)))

knn
## k-Nearest Neighbors 
## 
## 506 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 455, 456, 456, 455, 455, 456, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9147406  0.8295187
##    3  0.9185867  0.8372412
##    5  0.9166244  0.8332668
##    7  0.9009744  0.8019980
##    9  0.8812836  0.7626315
##   11  0.8713982  0.7428413
##   13  0.8754389  0.7509752
##   15  0.8753620  0.7507975
##   17  0.8793243  0.7587595
##   19  0.8734027  0.7469133
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

KNN surprisingly gave the best performance, despite no prior variable selection—even though KNN is usually sensitive to weak or irrelevant predictors.

To address this and potentially improve the model, I took the following steps:

Used caret::varImp() to rank all 13 predictors from most to least important.

Trained KNN models using incremental subsets of predictors (starting with the top 1, then top 2, top 3, and so on).

Evaluated which combination of predictors—and the corresponding optimal value of k—yielded the highest accuracy, using 5 repetitions of 10-fold cross-validation.

Below are the variable importance rankings produced by caret:

varImp(knn)
## ROC curve variable importance
## 
##         Importance
## nox         100.00
## dis          86.05
## age          82.64
## indus        80.85
## tax          78.00
## rad          73.48
## lstat        59.43
## medv         50.63
## zn           47.35
## ptratio      45.59
## rm           20.66
## chas          0.00

By ranking the variables in order of importance, I was able to fit, tune, and evaluate all models efficiently within a compact loop. The output is shown below:

library(tibble)
knn_best_to_worst <- as.data.frame(varImp(knn)$importance) %>%
  rownames_to_column("var") %>%
  arrange(desc(X0)) %>%
  pull(var)

Boston <- Boston %>%
  dplyr::select(c(knn_best_to_worst, "crim"))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(knn_best_to_worst)
## 
##   # Now:
##   data %>% select(all_of(knn_best_to_worst))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cv_accuracy <- c()
chosen_k <- c()

set.seed(5)

for (i in 1:13)
 {
  knn_crim_temp <- Boston[, c(1:i, ncol(Boston))]  # select first i predictors + the 'crim' column
  knn_crim_temp <- train(crim ~ ., 
                         data = knn_crim_temp,
                         method = "knn", 
                         preProcess = c("center", "scale"), 
                         trControl = ctrl, 
                         tuneGrid = expand.grid(k = seq(1, 20, 2)))
  
  cv_accuracy[i] <- max(knn_crim_temp$results$Accuracy)
  chosen_k[i] <- as.numeric(knn_crim_temp$bestTune)
}

data.frame(pred_num = 1:13, cv_accuracy = cv_accuracy, chosen_k = chosen_k) %>%
  arrange(desc(cv_accuracy))
##    pred_num cv_accuracy chosen_k
## 1        13   0.9980392        3
## 2         1   0.9507406        1
## 3        10   0.9447753        5
## 4         4   0.9426667        3
## 5         5   0.9388235        5
## 6         7   0.9367783        1
## 7         9   0.9366983        1
## 8         6   0.9347451        3
## 9         8   0.9327768        1
## 10       11   0.9326652        5
## 11       12   0.9210483        3
## 12        2   0.8930166        5
## 13        3   0.8913183        7

Surprisingly, the most accurate model—by a noticeable margin—is also the simplest possible: a KNN model using just one predictor (p = 1) and one nearest neighbor (K = 1). It’s so straightforward that its logic can be summed up in just two lines:

“To predict the binary crime rate (crim) for a new suburb, simply find the suburb in the training set with the most similar nitrogen oxide concentration (nox) and assign its binary crime label as the prediction.”

Chapter 5

Question 5

glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

a.

log_regression1 <- glm(default ~ income + balance, data = Default, family = "binomial")

summary(log_regression1)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

b.

i.

I used a 70/30 split for the training and test sets, respectively.

set.seed(123, sample.kind = "Rounding")
## Warning in set.seed(123, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)

train <- Default[index, ]
test <- Default[-index, ]

nrow(train) / nrow(Default)
## [1] 0.7001

ii.

I continue to restrict the models to using only the income and balance predictors.

log_regression_2 <- glm(default ~ income + balance, data = train, family = "binomial")

summary(log_regression_2)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.141e+01  5.099e-01 -22.379  < 2e-16 ***
## income       2.256e-05  5.898e-06   3.825 0.000131 ***
## balance      5.531e-03  2.651e-04  20.866  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1119.2  on 6998  degrees of freedom
## AIC: 1125.2
## 
## Number of Fisher Scoring iterations: 8

iii.

I stored the predicted class labels in the vector test_pred, with the first 10 predictions shown below:

pred_def <- factor(ifelse(predict(log_regression_2, newdata = test, type = "response") > 0.5, "Yes", "No"))
pred_def[1:10]
##  2  3  4 12 13 14 15 17 18 20 
## No No No No No No No No No No 
## Levels: No Yes

iv.

round(mean(test$default != pred_def), 5)
## [1] 0.02601

c.

To assess the stability and generalizability of the logistic regression model, I performed three additional train/test splits using a loop. For each split, I trained a logistic regression model on the training set and evaluated its performance on the corresponding test set. The resulting test errors were recorded in the vector log_def_errors. The results are summarized below:

log_reg_errors <- c()
log_reg_errors[1] <- mean(test$default != pred_def)

set.seed(42, sample.kind = "Rounding")
## Warning in set.seed(42, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
for (i in 2:4) {
  index <- createDataPartition(y = Default$default, p = 0.7, list = F)
  train <- Default[index, ]
  test <- Default[-index, ]
  
  log_regression <- glm(default ~ income + balance, data = train, family = "binomial")
  pred_def <- factor(ifelse(predict(log_regression, newdata = test, type = "response") > 0.5, "Yes", "No"))
  
  log_reg_errors[i] <- mean(test$default != pred_def)
}


round(log_reg_errors, 5)
## [1] 0.02601 0.02534 0.02601 0.02534

As expected with both the validation set approach and cross-validation, there is some variation in the results. The average test error across all iterations is 0.02550.

d.

set.seed(43, sample.kind = "Rounding")
## Warning in set.seed(43, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]

log_regression <- glm(default ~ ., data = train, family = "binomial")
summary(log_regression)
## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.100e+01  5.957e-01 -18.473   <2e-16 ***
## studentYes  -6.353e-01  2.825e-01  -2.249   0.0245 *  
## balance      5.780e-03  2.789e-04  20.721   <2e-16 ***
## income       4.235e-06  9.788e-06   0.433   0.6653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1093.3  on 6997  degrees of freedom
## AIC: 1101.3
## 
## Number of Fisher Scoring iterations: 8
pred_def <- factor(ifelse(predict(log_regression, newdata = test, type = "response") > 0.5, "Yes", "No"))

The studentYes variable is statistically significant and appears to have reduced the significance of the income variable, likely due to correlation between the two. However, our primary focus is on the model’s out-of-sample performance rather than individual coefficient significance. The corresponding test error is shown below:

round(mean(test$default != pred_def), 5)
## [1] 0.02634

It appears that the test error for this model is higher than that of the simpler model—and, in fact, higher than all four previous models. This suggests that including the student dummy variable does not improve, and may even slightly worsen, the model’s test error rate.

Chapter 6

Quesion 9

a.

I randomly divide the dataset into two equal halves: one for training and one for testing. This ensures I can assess how well each model generalizes to new, unseen data.

library(ISLR2)     # For College data
library(MASS)      # For ridge regression via lm.ridge
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Boston
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:ISLR2':
## 
##     Boston
library(pls)       # For PCR and PLS
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)

# Create train/test split
train_index <- sample(1:nrow(College), nrow(College) / 2)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

b.

I fit a multiple linear regression model using all predictors. This model assumes a linear relationship between the predictors and the response (Apps). It serves as a baseline to compare against more advanced models.

lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((lm_pred - test_data$Apps)^2)
lm_mse
## [1] 1108531

c.

Ridge regression is a type of regularized linear regression that adds a penalty for large coefficient values (L2 penalty). This can reduce overfitting, especially when predictors are highly correlated. In the absence of glmnet, I use the lm.ridge() function and select the best penalty parameter (lambda) based on generalized cross-validation (GCV).

# Standardize predictors
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_train <- train_data$Apps
y_test <- test_data$Apps

# Try multiple lambda values
lambda_seq <- seq(0, 1000, length = 100)
ridge_cv <- lm.ridge(Apps ~ ., data = train_data, lambda = lambda_seq)

# Choose lambda with lowest GCV (Generalized Cross Validation)
best_lambda <- lambda_seq[which.min(ridge_cv$GCV)]
best_lambda
## [1] 0
# Predict manually
ridge_coefs <- coef(ridge_cv)[which.min(ridge_cv$GCV), ]
ridge_pred <- cbind(1, x_test) %*% ridge_coefs
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 1108531

d.

It was unable to fit a Lasso model since the glmnet package is unavailable on Posit Cloud. Lasso regression requires specialized solvers not available in base R.

e.

PCR reduces the predictor variables to a smaller set of uncorrelated components (principal components), which are then used to predict Apps. It is helpful when multicollinearity is present. I use cross-validation to select the optimal number of components.

pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr_fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4335     4194     2363     2378     2106     1858     1858
## adjCV         4335     4194     2359     2379     1892     1845     1850
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1859     1867     1824      1822      1824      1829      1837
## adjCV     1852     1860     1815      1814      1815      1821      1828
##        14 comps  15 comps  16 comps  17 comps
## CV         1830      1819      1267      1285
## adjCV      1823      1790      1255      1272
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.216    57.68    64.73    70.55    76.33    81.30    85.01    88.40
## Apps    6.976    71.47    71.58    83.32    83.44    83.45    83.46    83.47
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.16     93.36     95.38     96.94     97.96     98.76     99.40
## Apps    84.53     84.86     84.98     84.98     84.99     85.24     90.87
##       16 comps  17 comps
## X        99.87    100.00
## Apps     93.93     93.97
# Choose number of components with lowest CV error
validationplot(pcr_fit, val.type = "MSEP")

# Use M components (e.g. 10) for prediction
pcr_pred <- predict(pcr_fit, test_data, ncomp = 10)
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 1505718

f.

PLS, like PCR, creates new components — but unlike PCR, these components are created by maximizing the covariance between predictors and the response. This often makes PLS more efficient and better-suited to prediction tasks. Again, I use cross-validation to choose how many components to include.

pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls_fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4335     2196     1894     1752     1672     1496     1416
## adjCV         4335     2188     1891     1743     1636     1461     1393
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1395     1398     1386      1384      1384      1385      1384
## adjCV     1374     1377     1366      1365      1365      1366      1365
##        14 comps  15 comps  16 comps  17 comps
## CV         1385      1385      1385      1385
## adjCV      1365      1366      1366      1366
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.91    43.08    63.26    65.16    68.50    73.75    76.10    79.03
## Apps    76.64    83.93    87.14    91.90    93.49    93.85    93.91    93.94
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       81.76     85.41     89.03     91.38     93.31     95.43     97.41
## Apps    93.96     93.96     93.96     93.97     93.97     93.97     93.97
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.97     93.97
# Plot CV error
validationplot(pls_fit, val.type = "MSEP")

# Use M components (e.g. 10) for prediction
pls_pred <- predict(pls_fit, test_data, ncomp = 10)
pls_mse <- mean((pls_pred - y_test)^2)
pls_mse
## [1] 1134531

g.

For each model, I compute the test set MSE. Comparing these errors helps us determine which modeling technique is most accurate and robust for this prediction task.

cat("Test MSEs:\n")
## Test MSEs:
cat("Linear Regression:", lm_mse, "\n")
## Linear Regression: 1108531
cat("Ridge Regression:", ridge_mse, "\n")
## Ridge Regression: 1108531
cat("PCR:", pcr_mse, "\n")
## PCR: 1505718
cat("PLS:", pls_mse, "\n")
## PLS: 1134531

20% Predicting Credit Card Default (Logistic Regression)

1. Load & Explore the Data

# Load libraries
library(ISLR)          # dataset
## 
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
## 
##     Auto, Credit
library(caret)         # data splitting & evaluation
library(pROC)          # ROC/AUC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Load dataset
data(Default)
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

This section loads the dataset and libraries. Default is from the ISLR package and includes customer features:

default: whether they defaulted (“Yes” or “No”)

income: annual income (numeric)

balance: average credit card balance (numeric)

student: whether the person is a student

I also check the data structure to ensure variables are properly formatted.

2. Financial Context

This model simulates how a bank or financial institution assesses the risk of borrower default using logistic regression. Understanding who is more likely to default allows companies to:

Adjust credit limits

Set interest rates

Require collateral This is crucial in risk management and lending decisions.

3. Data Preparation

# Convert factors
Default$default <- as.factor(Default$default)
Default$student <- as.factor(Default$student)

# Split into training (70%) and test (30%)
set.seed(123)
trainIndex <- createDataPartition(Default$default, p = 0.7, list = FALSE)
trainData <- Default[trainIndex, ]
testData  <- Default[-trainIndex, ]

I first ensure that the response and categorical predictors are treated as factors. Then, I use caret::createDataPartition() to split the data into a training set (to build the model) and a test set (to evaluate model performance on new data). This mimics how models are used in practice: train on historical data, test on future cases.

4. Fit Model on Training Data

log_model <- glm(default ~ income + balance + student,
                 data = trainData, family = binomial)
summary(log_model)
## 
## Call:
## glm(formula = default ~ income + balance + student, family = binomial, 
##     data = trainData)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.097e+01  5.889e-01 -18.632   <2e-16 ***
## income       1.129e-05  9.836e-06   1.148    0.251    
## balance      5.582e-03  2.690e-04  20.753   <2e-16 ***
## studentYes  -4.074e-01  2.840e-01  -1.435    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1117.1  on 6997  degrees of freedom
## AIC: 1125.1
## 
## Number of Fisher Scoring iterations: 8

This model predicts the probability of default based on income, balance, and student status.

The logistic regression model is appropriate because the response (default) is binary.

The summary() shows which variables are statistically significant (e.g., balance is usually highly significant). This aligns with ISLR Chapter 4 and the financial goal of predicting default.

5. Predict on Test Data

# Predicted probabilities
testData$pred_prob <- predict(log_model, newdata = testData, type = "response")

# Binary predictions at 0.5 threshold
testData$pred_class <- ifelse(testData$pred_prob > 0.5, "Yes", "No")
testData$pred_class <- as.factor(testData$pred_class)

I apply the trained model to new (test) data to generate predicted probabilities of default. Then, I classify each observation as “Yes” or “No” based on a threshold (typically 0.5). This step simulates how banks would use the model to flag high-risk customers.

6. Model Evaluation

# Confusion matrix
conf <- confusionMatrix(testData$pred_class, testData$default, positive = "Yes")
print(conf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2889   70
##        Yes   11   29
##                                           
##                Accuracy : 0.973           
##                  95% CI : (0.9665, 0.9785)
##     No Information Rate : 0.967           
##     P-Value [Acc > NIR] : 0.03383         
##                                           
##                   Kappa : 0.406           
##                                           
##  Mcnemar's Test P-Value : 1.16e-10        
##                                           
##             Sensitivity : 0.29293         
##             Specificity : 0.99621         
##          Pos Pred Value : 0.72500         
##          Neg Pred Value : 0.97634         
##              Prevalence : 0.03301         
##          Detection Rate : 0.00967         
##    Detection Prevalence : 0.01334         
##       Balanced Accuracy : 0.64457         
##                                           
##        'Positive' Class : Yes             
## 
# Accuracy
accuracy <- conf$overall['Accuracy']
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 97.3 %"
# ROC and AUC
roc_obj <- roc(testData$default, testData$pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
print(paste("AUC:", round(auc(roc_obj), 4)))
## [1] "AUC: 0.954"
plot(roc_obj, main = "ROC Curve for Credit Default Model")

Confusion matrix shows correct vs incorrect predictions.

Accuracy tells how often the model predicts correctly.

ROC curve and AUC measure the model’s ability to distinguish between default and non-default cases:

AUC closer to 1 = better performance.

Banks use this to judge whether a model is reliable for real-world use.

7. Interpretation

Accuracy: % of correct predictions on unseen data.

AUC: Measures discrimination ability; a higher AUC (~0.5–1) is better.

Coefficients: Quantify how balance, income, and student status affect the probability of default.

E.g., a higher balance strongly increases default risk—critical for credit policy.

Financial Insight:

Balance usually has the strongest effect—those with higher balances are much more likely to default.

Student status might influence default risk, depending on income and debt levels.

Accuracy and AUC help institutions decide whether to deploy the model in a real setting.

Business Application:

Targeted credit card offers (exclude high-risk users)

Dynamic interest rates based on predicted default risk

Portfolio risk monitoring