1 Introduction

This analysis aims to evaluate the performance of several classification models in predicting credit default. The models are based on three logistic regression specifications with different levels of complexity.

Model performance is assessed using ROC curves and the Area Under the Curve (AUC) metric to measure their ability to distinguish between default and non-default cases.

2 Data Understanding

This section provides an initial exploration of the dataset to understand its structure and variables.

# Load data
data <- read.csv("Default.csv")

# Display structure
str(data)
## 'data.frame':    10000 obs. of  5 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ default: chr  "No" "No" "No" "No" ...
##  $ student: chr  "No" "Yes" "No" "No" ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
# Summary statistics
summary(data)
##        X           default            student             balance      
##  Min.   :    1   Length:10000       Length:10000       Min.   :   0.0  
##  1st Qu.: 2501   Class :character   Class :character   1st Qu.: 481.7  
##  Median : 5000   Mode  :character   Mode  :character   Median : 823.6  
##  Mean   : 5000                                         Mean   : 835.4  
##  3rd Qu.: 7500                                         3rd Qu.:1166.3  
##  Max.   :10000                                         Max.   :2654.3  
##      income     
##  Min.   :  772  
##  1st Qu.:21340  
##  Median :34553  
##  Mean   :33517  
##  3rd Qu.:43808  
##  Max.   :73554

The target variable is default, indicating whether a customer fails to repay their credit. The predictor variables include balance, income, and student.

3 Data Preprocessing

This step ensures that the dataset is properly formatted and ready for modeling.

# Convert variables to appropriate types
data$default <- as.factor(data$default)
data$student <- as.factor(data$student)

# Check missing values
colSums(is.na(data))
##       X default student balance  income 
##       0       0       0       0       0

Data type conversion and missing value checks are essential to maintain model reliability.

4 Data Splitting

The dataset is divided into training and testing sets to evaluate model performance objectively.

library(caret)

set.seed(123)
trainIndex <- createDataPartition(data$default, p=0.7, list=FALSE)
train <- data[trainIndex,]
test <- data[-trainIndex,]

Approximately 70% of the data is used for training, while the remaining 30% is used for testing.

5 Model Development

Three logistic regression models are constructed based on the given tables.

5.1 Model 1 (Balance Only)

model1 <- glm(default ~ balance, data=train, family="binomial")
summary(model1)
## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.039e+01  4.191e-01  -24.78   <2e-16 ***
## balance      5.357e-03  2.587e-04   20.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1155.1  on 6999  degrees of freedom
## AIC: 1159.1
## 
## Number of Fisher Scoring iterations: 8

This model uses only balance as the predictor variable.

5.2 Model 2 (Student Only)

model2 <- glm(default ~ student, data=train, family="binomial")
summary(model2)
## 
## Call:
## glm(formula = default ~ student, family = "binomial", data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.43336    0.08187 -41.935   <2e-16 ***
## studentYes   0.21648    0.14037   1.542    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 2048.3  on 6999  degrees of freedom
## AIC: 2052.3
## 
## Number of Fisher Scoring iterations: 6

This model evaluates the effect of student status on credit default.

5.3 Model 3 (Full Model)

model3 <- glm(default ~ balance + income + student, data=train, family="binomial")
summary(model3)
## 
## Call:
## glm(formula = default ~ balance + income + student, family = "binomial", 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.048e+01  5.782e-01 -18.122  < 2e-16 ***
## balance      5.638e-03  2.733e-04  20.628  < 2e-16 ***
## income      -5.519e-07  9.660e-06  -0.057  0.95444    
## studentYes  -8.937e-01  2.774e-01  -3.221  0.00128 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1128.8  on 6997  degrees of freedom
## AIC: 1136.8
## 
## Number of Fisher Scoring iterations: 8

This model combines all available predictors to provide a more comprehensive analysis.

6 Model Prediction

Each model generates predicted probabilities for the default class.

prob1 <- predict(model1, test, type="response")
prob2 <- predict(model2, test, type="response")
prob3 <- predict(model3, test, type="response")

These probabilities are used to construct ROC curves.

7 ROC Curve Analysis

ROC curves are used to evaluate the classification performance of each model.

library(pROC)

roc1 <- roc(test$default, prob1)
roc2 <- roc(test$default, prob2)
roc3 <- roc(test$default, prob3)

plot(roc1, col="blue", main="ROC Curve Comparison")
lines(roc2, col="red")
lines(roc3, col="green")

legend("bottomright",
       legend=c("Model 1: balance",
                "Model 2: student",
                "Model 3: full model"),
       col=c("blue","red","green"),
       lwd=2)

The ROC curve illustrates the trade-off between the True Positive Rate and False Positive Rate.

8 AUC Evaluation

AUC provides a numerical measure of model performance.

auc(roc1)
## Area under the curve: 0.957
auc(roc2)
## Area under the curve: 0.5958
auc(roc3)
## Area under the curve: 0.9561

The AUC values further confirm the ROC findings. Model 1 achieves an AUC of 0.957, and Model 3 achieves 0.9561, indicating excellent classification performance. In contrast, Model 2 has an AUC of only 0.5958, which is only slightly better than random guessing.

This result highlights that balance is a highly informative predictor, while student status alone provides very limited predictive power.

9 Threshold Optimization

Threshold Optimization

coords(roc1, "best", ret="threshold")
coords(roc2, "best", ret="threshold")
coords(roc3, "best", ret="threshold")

The optimal thresholds for all models are around 0.03–0.04, which is significantly lower than the default threshold of 0.5. This indicates that the dataset is highly imbalanced, with a much larger proportion of non-default cases. Therefore, using a lower threshold is necessary to improve the detection of default cases.

10 Confusion Matrix

Further evaluation is conducted using confusion matrices.

pred1 <- ifelse(prob1 > 0.5, "Yes", "No")
pred2 <- ifelse(prob2 > 0.5, "Yes", "No")
pred3 <- ifelse(prob3 > 0.5, "Yes", "No")

confusionMatrix(as.factor(pred1), test$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2885   61
##        Yes   15   38
##                                         
##                Accuracy : 0.9747        
##                  95% CI : (0.9684, 0.98)
##     No Information Rate : 0.967         
##     P-Value [Acc > NIR] : 0.008743      
##                                         
##                   Kappa : 0.4882        
##                                         
##  Mcnemar's Test P-Value : 2.445e-07     
##                                         
##             Sensitivity : 0.9948        
##             Specificity : 0.3838        
##          Pos Pred Value : 0.9793        
##          Neg Pred Value : 0.7170        
##              Prevalence : 0.9670        
##          Detection Rate : 0.9620        
##    Detection Prevalence : 0.9823        
##       Balanced Accuracy : 0.6893        
##                                         
##        'Positive' Class : No            
## 
confusionMatrix(as.factor(pred2), test$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2900   99
##        Yes    0    0
##                                         
##                Accuracy : 0.967         
##                  95% CI : (0.96, 0.9731)
##     No Information Rate : 0.967         
##     P-Value [Acc > NIR] : 0.5267        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 1.000         
##             Specificity : 0.000         
##          Pos Pred Value : 0.967         
##          Neg Pred Value :   NaN         
##              Prevalence : 0.967         
##          Detection Rate : 0.967         
##    Detection Prevalence : 1.000         
##       Balanced Accuracy : 0.500         
##                                         
##        'Positive' Class : No            
## 
confusionMatrix(as.factor(pred3), test$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2884   63
##        Yes   16   36
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.9673, 0.9791)
##     No Information Rate : 0.967           
##     P-Value [Acc > NIR] : 0.0204          
##                                           
##                   Kappa : 0.4646          
##                                           
##  Mcnemar's Test P-Value : 2.274e-07       
##                                           
##             Sensitivity : 0.9945          
##             Specificity : 0.3636          
##          Pos Pred Value : 0.9786          
##          Neg Pred Value : 0.6923          
##              Prevalence : 0.9670          
##          Detection Rate : 0.9617          
##    Detection Prevalence : 0.9827          
##       Balanced Accuracy : 0.6791          
##                                           
##        'Positive' Class : No              
## 

10.1 General Insight

The confusion matrix results reveal that accuracy alone can be misleading due to class imbalance. Although all models achieve high accuracy (above 96%), their ability to correctly identify default cases differs significantly.

10.2 Model 1 (balance)

Model 1 shows strong overall performance with high accuracy (97.47%) and a good balance between sensitivity and specificity. It is able to correctly identify most non-default cases while still capturing a portion of default cases, resulting in a relatively higher balanced accuracy compared to other models.

10.3 Model 2 (student)

Model 2 performs poorly despite having high accuracy (96.7%). It predicts all observations as non-default, resulting in zero specificity for the default class. This confirms that the model fails to distinguish between classes and is not useful for prediction.

10.4 Model 3 (full model)

Model 3 achieves performance comparable to Model 1, with slightly lower accuracy (97.37%) but similar classification behavior. Although it includes more variables, it does not significantly outperform Model 1, suggesting that additional predictors such as income and student do not contribute substantial predictive power beyond balance.

11 Conclusion

Based on the overall evaluation using ROC curves, AUC, and confusion matrices, Model 1 (balance only) and Model 3 (full model) demonstrate superior performance compared to Model 2.

Interestingly, Model 3 does not significantly outperform Model 1, indicating that balance alone is already a very strong predictor of credit default. Meanwhile, the student variable, when used independently, leads to misleading conclusions and poor classification performance.

These findings emphasize the importance of selecting relevant predictors and highlight that adding more variables does not always improve model performance. In this case, a simpler model using balance alone is nearly as effective as a more complex model.