library(tidymodels)
library(discrim)
library(readr)
We read in the Default dataset and convert default and
student to factor variables for classification.
Default <- read_csv("Default.csv") %>%
select(-1) %>%
mutate(
default = factor(default, levels = c("No", "Yes")),
student = factor(student, levels = c("No", "Yes"))
)
glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
We use a 70/30 split, stratified by the response variable
default, to ensure both sets have a similar proportion of
defaulters.
set.seed(123)
default_split <- initial_split(Default, prop = 0.7, strata = default)
default_train <- training(default_split)
default_test <- testing(default_split)
Logistic regression models the probability of default using
balance, income, and student as
predictors.
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
lr_fit <- lr_spec %>%
fit(default ~ balance + income + student, data = default_train)
tidy(lr_fit)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.7 0.584 -18.4 1.38e-75
## 2 balance 0.00571 0.000275 20.8 8.62e-96
## 3 income 0.000000173 0.00000979 0.0177 9.86e- 1
## 4 studentYes -0.700 0.281 -2.49 1.28e- 2
LDA assumes that the predictors are normally distributed within each class and that the classes share a common covariance matrix.
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_fit <- lda_spec %>%
fit(default ~ balance + income + student, data = default_train)
lda_fit
## parsnip model object
##
## Call:
## lda(default ~ balance + income + student, data = data)
##
## Prior probabilities of groups:
## No Yes
## 0.96714286 0.03285714
##
## Group means:
## balance income studentYes
## No 800.9643 33610.00 0.2889217
## Yes 1756.7220 32190.23 0.3739130
##
## Coefficients of linear discriminants:
## LD1
## balance 2.244335e-03
## income 2.028998e-06
## studentYes -2.010832e-01
QDA is similar to LDA but allows each class to have its own covariance matrix, which can capture more complex decision boundaries.
qda_spec <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_fit <- qda_spec %>%
fit(default ~ balance + income + student, data = default_train)
qda_fit
## parsnip model object
##
## Call:
## qda(default ~ balance + income + student, data = data)
##
## Prior probabilities of groups:
## No Yes
## 0.96714286 0.03285714
##
## Group means:
## balance income studentYes
## No 800.9643 33610.00 0.2889217
## Yes 1756.7220 32190.23 0.3739130
We gather predictions from all three models on the test set and compute confusion matrices along with accuracy, sensitivity, and specificity.
models <- list(
"Logistic Regression" = lr_fit,
"LDA" = lda_fit,
"QDA" = qda_fit
)
preds <- imap_dfr(models, augment,
new_data = default_test, .id = "model")
# Confusion matrix for each model
preds %>%
group_by(model) %>%
conf_mat(truth = default, estimate = .pred_class) %>%
print()
## # A tibble: 3 × 2
## model conf_mat
## <chr> <list>
## 1 LDA <conf_mat>
## 2 Logistic Regression <conf_mat>
## 3 QDA <conf_mat>
# Multiple performance metrics
multi_metric <- metric_set(accuracy, sensitivity, specificity)
preds %>%
group_by(model) %>%
multi_metric(truth = default, estimate = .pred_class)
## # A tibble: 9 × 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 LDA accuracy binary 0.971
## 2 Logistic Regression accuracy binary 0.973
## 3 QDA accuracy binary 0.973
## 4 LDA sensitivity binary 0.998
## 5 Logistic Regression sensitivity binary 0.997
## 6 QDA sensitivity binary 0.998
## 7 LDA specificity binary 0.214
## 8 Logistic Regression specificity binary 0.282
## 9 QDA specificity binary 0.262
The ROC curve plots sensitivity (true positive rate) against 1 - specificity (false positive rate) at various threshold values. A curve closer to the top-left corner indicates better classification performance.
preds %>%
group_by(model) %>%
roc_curve(default, .pred_No) %>%
autoplot() +
labs(
title = "ROC Curves for Default Dataset",
subtitle = "Comparing Logistic Regression, LDA, and QDA"
) +
theme_minimal()
The Area Under the ROC Curve (AUC) summarizes the overall performance of each model. An AUC of 1 represents a perfect classifier, while an AUC of 0.5 represents random guessing.
preds %>%
group_by(model) %>%
roc_auc(default, .pred_No)
## # A tibble: 3 × 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 LDA roc_auc binary 0.945
## 2 Logistic Regression roc_auc binary 0.945
## 3 QDA roc_auc binary 0.944
The ROC curves and AUC values show that all three models perform similarly well on the Default dataset. This is expected because logistic regression and LDA often produce nearly identical results when the normality assumption approximately holds. QDA allows for more flexible decision boundaries, but in this case the added flexibility does not lead to a significant improvement, suggesting that a linear decision boundary is sufficient for separating defaulters from non-defaulters.