First, I overwrite the crim variable with a binary factor that takes the value 1 if the crime rate is above the median, and 0 otherwise.
The updated dataset is shown below:
library(ISLR2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(Boston)
Boston$crim <- factor(ifelse(Boston$crim > median(Boston$crim), 1, 0))
glimpse(Boston)
## Rows: 506
## Columns: 13
## $ crim <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
I begin by fitting all four model types using the full set of predictors. Instead of using a train/test split, I assess model performance based on cross-validation accuracy, using 10-fold cross-validation repeated 5 times.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(1)
ctrl <- trainControl(method = "cv", number = 10)
log_regression <- train(crim ~ .,
data = Boston,
method = "glm",
family = "binomial",
trControl = ctrl)
log_regression
## Generalized Linear Model
##
## 506 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 456, 455, 456, 454, 456, 456, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8972775 0.7944931
set.seed(2)
lda<- train(crim ~ .,
data = Boston,
method = "lda",
trControl = ctrl)
lda
## Linear Discriminant Analysis
##
## 506 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 456, 455, 456, 456, 455, 454, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8479457 0.6953226
set.seed(3)
naive_b <- train(crim ~ .,
data = Boston,
method = "naive_bayes",
trControl = ctrl)
naive_b
## Naive Bayes
##
## 506 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 456, 456, 455, 455, 455, 456, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.8080241 0.6160362
## TRUE 0.8674389 0.7348430
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
set.seed(4)
knn <- train(crim ~ .,
data = Boston,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 20, 2)))
knn
## k-Nearest Neighbors
##
## 506 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 455, 456, 456, 455, 455, 456, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9147406 0.8295187
## 3 0.9185867 0.8372412
## 5 0.9166244 0.8332668
## 7 0.9009744 0.8019980
## 9 0.8812836 0.7626315
## 11 0.8713982 0.7428413
## 13 0.8754389 0.7509752
## 15 0.8753620 0.7507975
## 17 0.8793243 0.7587595
## 19 0.8734027 0.7469133
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
KNN surprisingly gave the best performance, despite no prior variable selection—even though KNN is usually sensitive to weak or irrelevant predictors.
To address this and potentially improve the model, I took the following steps:
Used caret::varImp() to rank all 13 predictors from most to least important.
Trained KNN models using incremental subsets of predictors (starting with the top 1, then top 2, top 3, and so on).
Evaluated which combination of predictors—and the corresponding optimal value of k—yielded the highest accuracy, using 5 repetitions of 10-fold cross-validation.
Below are the variable importance rankings produced by caret:
varImp(knn)
## ROC curve variable importance
##
## Importance
## nox 100.00
## dis 86.05
## age 82.64
## indus 80.85
## tax 78.00
## rad 73.48
## lstat 59.43
## medv 50.63
## zn 47.35
## ptratio 45.59
## rm 20.66
## chas 0.00
By ranking the variables in order of importance, I was able to fit, tune, and evaluate all models efficiently within a compact loop. The output is shown below:
library(tibble)
knn_best_to_worst <- as.data.frame(varImp(knn)$importance) %>%
rownames_to_column("var") %>%
arrange(desc(X0)) %>%
pull(var)
Boston <- Boston %>%
dplyr::select(c(knn_best_to_worst, "crim"))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(knn_best_to_worst)
##
## # Now:
## data %>% select(all_of(knn_best_to_worst))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cv_accuracy <- c()
chosen_k <- c()
set.seed(5)
for (i in 1:13)
{
knn_crim_temp <- Boston[, c(1:i, ncol(Boston))] # select first i predictors + the 'crim' column
knn_crim_temp <- train(crim ~ .,
data = knn_crim_temp,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 20, 2)))
cv_accuracy[i] <- max(knn_crim_temp$results$Accuracy)
chosen_k[i] <- as.numeric(knn_crim_temp$bestTune)
}
data.frame(pred_num = 1:13, cv_accuracy = cv_accuracy, chosen_k = chosen_k) %>%
arrange(desc(cv_accuracy))
## pred_num cv_accuracy chosen_k
## 1 13 0.9980392 3
## 2 1 0.9507406 1
## 3 10 0.9447753 5
## 4 4 0.9426667 3
## 5 5 0.9388235 5
## 6 7 0.9367783 1
## 7 9 0.9366983 1
## 8 6 0.9347451 3
## 9 8 0.9327768 1
## 10 11 0.9326652 5
## 11 12 0.9210483 3
## 12 2 0.8930166 5
## 13 3 0.8913183 7
Surprisingly, the most accurate model—by a noticeable margin—is also the simplest possible: a KNN model using just one predictor (p = 1) and one nearest neighbor (K = 1). It’s so straightforward that its logic can be summed up in just two lines:
“To predict the binary crime rate (crim) for a new suburb, simply find the suburb in the training set with the most similar nitrogen oxide concentration (nox) and assign its binary crime label as the prediction.”
glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
log_regression1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_regression1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
I used a 70/30 split for the training and test sets, respectively.
set.seed(123, sample.kind = "Rounding")
## Warning in set.seed(123, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
nrow(train) / nrow(Default)
## [1] 0.7001
I continue to restrict the models to using only the income and balance predictors.
log_regression_2 <- glm(default ~ income + balance, data = train, family = "binomial")
summary(log_regression_2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.141e+01 5.099e-01 -22.379 < 2e-16 ***
## income 2.256e-05 5.898e-06 3.825 0.000131 ***
## balance 5.531e-03 2.651e-04 20.866 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1119.2 on 6998 degrees of freedom
## AIC: 1125.2
##
## Number of Fisher Scoring iterations: 8
I stored the predicted class labels in the vector test_pred, with the first 10 predictions shown below:
pred_def <- factor(ifelse(predict(log_regression_2, newdata = test, type = "response") > 0.5, "Yes", "No"))
pred_def[1:10]
## 2 3 4 12 13 14 15 17 18 20
## No No No No No No No No No No
## Levels: No Yes
round(mean(test$default != pred_def), 5)
## [1] 0.02601
To assess the stability and generalizability of the logistic regression model, I performed three additional train/test splits using a loop. For each split, I trained a logistic regression model on the training set and evaluated its performance on the corresponding test set. The resulting test errors were recorded in the vector log_def_errors. The results are summarized below:
log_reg_errors <- c()
log_reg_errors[1] <- mean(test$default != pred_def)
set.seed(42, sample.kind = "Rounding")
## Warning in set.seed(42, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
for (i in 2:4) {
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_regression <- glm(default ~ income + balance, data = train, family = "binomial")
pred_def <- factor(ifelse(predict(log_regression, newdata = test, type = "response") > 0.5, "Yes", "No"))
log_reg_errors[i] <- mean(test$default != pred_def)
}
round(log_reg_errors, 5)
## [1] 0.02601 0.02534 0.02601 0.02534
As expected with both the validation set approach and cross-validation, there is some variation in the results. The average test error across all iterations is 0.02550.
set.seed(43, sample.kind = "Rounding")
## Warning in set.seed(43, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_regression <- glm(default ~ ., data = train, family = "binomial")
summary(log_regression)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.100e+01 5.957e-01 -18.473 <2e-16 ***
## studentYes -6.353e-01 2.825e-01 -2.249 0.0245 *
## balance 5.780e-03 2.789e-04 20.721 <2e-16 ***
## income 4.235e-06 9.788e-06 0.433 0.6653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1093.3 on 6997 degrees of freedom
## AIC: 1101.3
##
## Number of Fisher Scoring iterations: 8
pred_def <- factor(ifelse(predict(log_regression, newdata = test, type = "response") > 0.5, "Yes", "No"))
The studentYes variable is statistically significant and appears to have reduced the significance of the income variable, likely due to correlation between the two. However, our primary focus is on the model’s out-of-sample performance rather than individual coefficient significance. The corresponding test error is shown below:
round(mean(test$default != pred_def), 5)
## [1] 0.02634
It appears that the test error for this model is higher than that of the simpler model—and, in fact, higher than all four previous models. This suggests that including the student dummy variable does not improve, and may even slightly worsen, the model’s test error rate.
I randomly divide the dataset into two equal halves: one for training and one for testing. This ensures I can assess how well each model generalizes to new, unseen data.
library(ISLR2) # For College data
library(MASS) # For ridge regression via lm.ridge
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## Boston
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:ISLR2':
##
## Boston
library(pls) # For PCR and PLS
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
# Create train/test split
train_index <- sample(1:nrow(College), nrow(College) / 2)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
I fit a multiple linear regression model using all predictors. This model assumes a linear relationship between the predictors and the response (Apps). It serves as a baseline to compare against more advanced models.
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((lm_pred - test_data$Apps)^2)
lm_mse
## [1] 1108531
Ridge regression is a type of regularized linear regression that adds a penalty for large coefficient values (L2 penalty). This can reduce overfitting, especially when predictors are highly correlated. In the absence of glmnet, I use the lm.ridge() function and select the best penalty parameter (lambda) based on generalized cross-validation (GCV).
# Standardize predictors
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_train <- train_data$Apps
y_test <- test_data$Apps
# Try multiple lambda values
lambda_seq <- seq(0, 1000, length = 100)
ridge_cv <- lm.ridge(Apps ~ ., data = train_data, lambda = lambda_seq)
# Choose lambda with lowest GCV (Generalized Cross Validation)
best_lambda <- lambda_seq[which.min(ridge_cv$GCV)]
best_lambda
## [1] 0
# Predict manually
ridge_coefs <- coef(ridge_cv)[which.min(ridge_cv$GCV), ]
ridge_pred <- cbind(1, x_test) %*% ridge_coefs
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 1108531
It was unable to fit a Lasso model since the glmnet package is unavailable on Posit Cloud. Lasso regression requires specialized solvers not available in base R.
PCR reduces the predictor variables to a smaller set of uncorrelated components (principal components), which are then used to predict Apps. It is helpful when multicollinearity is present. I use cross-validation to select the optimal number of components.
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr_fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4335 4194 2363 2378 2106 1858 1858
## adjCV 4335 4194 2359 2379 1892 1845 1850
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1859 1867 1824 1822 1824 1829 1837
## adjCV 1852 1860 1815 1814 1815 1821 1828
## 14 comps 15 comps 16 comps 17 comps
## CV 1830 1819 1267 1285
## adjCV 1823 1790 1255 1272
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.216 57.68 64.73 70.55 76.33 81.30 85.01 88.40
## Apps 6.976 71.47 71.58 83.32 83.44 83.45 83.46 83.47
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.16 93.36 95.38 96.94 97.96 98.76 99.40
## Apps 84.53 84.86 84.98 84.98 84.99 85.24 90.87
## 16 comps 17 comps
## X 99.87 100.00
## Apps 93.93 93.97
# Choose number of components with lowest CV error
validationplot(pcr_fit, val.type = "MSEP")
# Use M components (e.g. 10) for prediction
pcr_pred <- predict(pcr_fit, test_data, ncomp = 10)
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 1505718
PLS, like PCR, creates new components — but unlike PCR, these components are created by maximizing the covariance between predictors and the response. This often makes PLS more efficient and better-suited to prediction tasks. Again, I use cross-validation to choose how many components to include.
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls_fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4335 2196 1894 1752 1672 1496 1416
## adjCV 4335 2188 1891 1743 1636 1461 1393
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1395 1398 1386 1384 1384 1385 1384
## adjCV 1374 1377 1366 1365 1365 1366 1365
## 14 comps 15 comps 16 comps 17 comps
## CV 1385 1385 1385 1385
## adjCV 1365 1366 1366 1366
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.91 43.08 63.26 65.16 68.50 73.75 76.10 79.03
## Apps 76.64 83.93 87.14 91.90 93.49 93.85 93.91 93.94
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 81.76 85.41 89.03 91.38 93.31 95.43 97.41
## Apps 93.96 93.96 93.96 93.97 93.97 93.97 93.97
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.97 93.97
# Plot CV error
validationplot(pls_fit, val.type = "MSEP")
# Use M components (e.g. 10) for prediction
pls_pred <- predict(pls_fit, test_data, ncomp = 10)
pls_mse <- mean((pls_pred - y_test)^2)
pls_mse
## [1] 1134531
For each model, I compute the test set MSE. Comparing these errors helps us determine which modeling technique is most accurate and robust for this prediction task.
cat("Test MSEs:\n")
## Test MSEs:
cat("Linear Regression:", lm_mse, "\n")
## Linear Regression: 1108531
cat("Ridge Regression:", ridge_mse, "\n")
## Ridge Regression: 1108531
cat("PCR:", pcr_mse, "\n")
## PCR: 1505718
cat("PLS:", pls_mse, "\n")
## PLS: 1134531
# Load libraries
library(ISLR) # dataset
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
library(caret) # data splitting & evaluation
library(pROC) # ROC/AUC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Load dataset
data(Default)
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
This section loads the dataset and libraries. Default is from the ISLR package and includes customer features:
default: whether they defaulted (“Yes” or “No”)
income: annual income (numeric)
balance: average credit card balance (numeric)
student: whether the person is a student
I also check the data structure to ensure variables are properly formatted.
This model simulates how a bank or financial institution assesses the risk of borrower default using logistic regression. Understanding who is more likely to default allows companies to:
Adjust credit limits
Set interest rates
Require collateral This is crucial in risk management and lending decisions.
# Convert factors
Default$default <- as.factor(Default$default)
Default$student <- as.factor(Default$student)
# Split into training (70%) and test (30%)
set.seed(123)
trainIndex <- createDataPartition(Default$default, p = 0.7, list = FALSE)
trainData <- Default[trainIndex, ]
testData <- Default[-trainIndex, ]
I first ensure that the response and categorical predictors are treated as factors. Then, I use caret::createDataPartition() to split the data into a training set (to build the model) and a test set (to evaluate model performance on new data). This mimics how models are used in practice: train on historical data, test on future cases.
log_model <- glm(default ~ income + balance + student,
data = trainData, family = binomial)
summary(log_model)
##
## Call:
## glm(formula = default ~ income + balance + student, family = binomial,
## data = trainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.097e+01 5.889e-01 -18.632 <2e-16 ***
## income 1.129e-05 9.836e-06 1.148 0.251
## balance 5.582e-03 2.690e-04 20.753 <2e-16 ***
## studentYes -4.074e-01 2.840e-01 -1.435 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1117.1 on 6997 degrees of freedom
## AIC: 1125.1
##
## Number of Fisher Scoring iterations: 8
This model predicts the probability of default based on income, balance, and student status.
The logistic regression model is appropriate because the response (default) is binary.
The summary() shows which variables are statistically significant (e.g., balance is usually highly significant). This aligns with ISLR Chapter 4 and the financial goal of predicting default.
# Predicted probabilities
testData$pred_prob <- predict(log_model, newdata = testData, type = "response")
# Binary predictions at 0.5 threshold
testData$pred_class <- ifelse(testData$pred_prob > 0.5, "Yes", "No")
testData$pred_class <- as.factor(testData$pred_class)
I apply the trained model to new (test) data to generate predicted probabilities of default. Then, I classify each observation as “Yes” or “No” based on a threshold (typically 0.5). This step simulates how banks would use the model to flag high-risk customers.
# Confusion matrix
conf <- confusionMatrix(testData$pred_class, testData$default, positive = "Yes")
print(conf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2889 70
## Yes 11 29
##
## Accuracy : 0.973
## 95% CI : (0.9665, 0.9785)
## No Information Rate : 0.967
## P-Value [Acc > NIR] : 0.03383
##
## Kappa : 0.406
##
## Mcnemar's Test P-Value : 1.16e-10
##
## Sensitivity : 0.29293
## Specificity : 0.99621
## Pos Pred Value : 0.72500
## Neg Pred Value : 0.97634
## Prevalence : 0.03301
## Detection Rate : 0.00967
## Detection Prevalence : 0.01334
## Balanced Accuracy : 0.64457
##
## 'Positive' Class : Yes
##
# Accuracy
accuracy <- conf$overall['Accuracy']
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 97.3 %"
# ROC and AUC
roc_obj <- roc(testData$default, testData$pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
print(paste("AUC:", round(auc(roc_obj), 4)))
## [1] "AUC: 0.954"
plot(roc_obj, main = "ROC Curve for Credit Default Model")
Confusion matrix shows correct vs incorrect predictions.
Accuracy tells how often the model predicts correctly.
ROC curve and AUC measure the model’s ability to distinguish between default and non-default cases:
AUC closer to 1 = better performance.
Banks use this to judge whether a model is reliable for real-world use.
Accuracy: % of correct predictions on unseen data.
AUC: Measures discrimination ability; a higher AUC (~0.5–1) is better.
Coefficients: Quantify how balance, income, and student status affect the probability of default.
E.g., a higher balance strongly increases default risk—critical for credit policy.
Financial Insight:
Balance usually has the strongest effect—those with higher balances are much more likely to default.
Student status might influence default risk, depending on income and debt levels.
Accuracy and AUC help institutions decide whether to deploy the model in a real setting.
Business Application:
Targeted credit card offers (exclude high-risk users)
Dynamic interest rates based on predicted default risk
Portfolio risk monitoring