Assignment 4: Final project

Exploratory analysis and essay Assignment

Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals. Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

To complete this task:

Describe the problem you are trying to solve. Describe your datases and what you did to prepare the data for analysis. Methodologies you used for analyzing the data What’s the purpose of the analysis performed Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction:

Credit default remains one of the most persistent risks in consumer finance. Financial institutions rely on early detection models to assess which customers are likely to default, allowing for proactive measures such as adjusting credit limits or offering tailored repayment options. For this final project, I selected the “Default of Credit Card Clients” dataset from the UCI Machine Learning Repository, which contains anonymized financial and demographic information on 30,000 clients from Taiwan. The objective of this analysis is to build a predictive model that can classify whether a customer is likely to default on their next month’s credit card payment.

This business problem translates to a supervised binary classification task. The target variable, default.payment.next.month, indicates whether the customer defaulted (1) or not (0). By accurately predicting this outcome, banks can make more informed decisions regarding risk management.

# Load Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Load dataset
df <- read_excel("/Users/zigcah/Downloads/default of credit card clients UIC.xls", skip = 1)

# Rename target column
names(df)[names(df) == "default payment next month"] <- "default"
df$default <- as.factor(df$default)

# Drop ID column
df <- df %>% select(-ID)

# Check structure
glimpse(df)
## Rows: 30,000
## Columns: 24
## $ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000, 50000, 500000, 100000, 1…
## $ SEX       <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, …
## $ EDUCATION <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 1, 3, 1, 1, 1, 1, …
## $ MARRIAGE  <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 2, …
## $ AGE       <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, 34, 51, 41, 30, 29, …
## $ PAY_0     <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, -1, 1, 0, 1, 0, 0, 1…
## $ PAY_2     <dbl> 2, 2, 0, 0, 0, 0, 0, -1, 0, -2, 0, -1, 0, 2, 0, 2, 0, 0, -2,…
## $ PAY_3     <dbl> -1, 0, 0, 0, -1, 0, 0, -1, 2, -2, 2, -1, -1, 2, 0, 0, 2, 0, …
## $ PAY_4     <dbl> -1, 0, 0, 0, 0, 0, 0, 0, 0, -2, 0, -1, -1, 0, 0, 0, 2, -1, -…
## $ PAY_5     <dbl> -2, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, -1, 0, 0, 0, 2, -1, -…
## $ PAY_6     <dbl> -2, 2, 0, 0, 0, 0, 0, -1, 0, -1, -1, 2, -1, 2, 0, 0, 2, -1, …
## $ BILL_AMT1 <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 367965, 11876, 11285,…
## $ BILL_AMT2 <dbl> 3102, 1725, 14027, 48233, 5670, 57069, 412023, 380, 14096, 0…
## $ BILL_AMT3 <dbl> 689, 2682, 13559, 49291, 35835, 57608, 445007, 601, 12108, 0…
## $ BILL_AMT4 <dbl> 0, 3272, 14331, 28314, 20940, 19394, 542653, 221, 12211, 0, …
## $ BILL_AMT5 <dbl> 0, 3455, 14948, 28959, 19146, 19619, 483003, -159, 11793, 13…
## $ BILL_AMT6 <dbl> 0, 3261, 15549, 29547, 19131, 20024, 473944, 567, 3719, 1391…
## $ PAY_AMT1  <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380, 3329, 0, 2306, 218…
## $ PAY_AMT2  <dbl> 689, 1000, 1500, 2019, 36681, 1815, 40000, 601, 0, 0, 12, 99…
## $ PAY_AMT3  <dbl> 0, 1000, 1000, 1200, 10000, 657, 38000, 0, 432, 0, 50, 8583,…
## $ PAY_AMT4  <dbl> 0, 1000, 1000, 1100, 9000, 1000, 20239, 581, 1000, 13007, 30…
## $ PAY_AMT5  <dbl> 0, 0, 1000, 1069, 689, 1000, 13750, 1687, 1000, 1122, 3738, …
## $ PAY_AMT6  <dbl> 0, 2000, 5000, 1000, 679, 800, 13770, 1542, 1000, 0, 66, 364…
## $ default   <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
# Check class balance
table(df$default)
## 
##     0     1 
## 23364  6636

Data and Preprocessing:

The dataset consists of 23 predictor variables, including credit limit, age, marital status, education level, payment history over six months, bill amounts, and prior payments. The data is clean and contains no missing values. However, the target variable was originally encoded as 0 and 1, which posed a problem for probabilistic modeling in R’s caret package. To resolve this, the levels were recoded as “No” and “Yes” for better compatibility.

For exploratory data analysis (EDA), I examined the distribution of defaults, payment history patterns, and age groups. Notably, younger clients showed a slightly higher rate of default, and delayed payments in recent months strongly correlated with the likelihood of default. A correlation heatmap of numerical variables helped identify multicollinearity and revealed meaningful clusters among billing and payment amounts. To reduce processing time during model training, a random sample of 5,000 clients was used. The dataset was split 80/20 into training and testing sets, and the training set was scaled where necessary.

# Class distribution
ggplot(df, aes(x = default, fill = default)) +
  geom_bar() +
  labs(title = "Class Distribution", x = "Default", y = "Count")

# Age distribution by default
ggplot(df, aes(x = AGE, fill = default)) +
  geom_histogram(binwidth = 5, position = "dodge") +
  labs(title = "Age Distribution by Default")

# Correlation plot of numeric variables
numeric_data <- df %>% select_if(is.numeric)
corr_matrix <- cor(numeric_data)
corrplot(corr_matrix, method = "color", tl.cex = 0.6)

# Split data
set.seed(123)
train_index <- createDataPartition(df$default, p = 0.8, list = FALSE)
train <- df[train_index, ]
test <- df[-train_index, ]

train$default <- factor(train$default, levels = c(0, 1), labels = c("No", "Yes"))
test$default  <- factor(test$default, levels = c(0, 1), labels = c("No", "Yes"))

# Scale data (required for SVM)
preproc <- preProcess(train, method = c("center", "scale"))
train_scaled <- predict(preproc, train)
test_scaled <- predict(preproc, test)

# Confirm preprocessing
summary(train_scaled)
##    LIMIT_BAL            SEX            EDUCATION          MARRIAGE      
##  Min.   :-1.2182   Min.   :-1.2365   Min.   :-2.3432   Min.   :-2.9769  
##  1st Qu.:-0.9089   1st Qu.:-1.2365   1st Qu.:-1.0785   1st Qu.:-1.0615  
##  Median :-0.2131   Median : 0.8087   Median : 0.1863   Median : 0.8539  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5601   3rd Qu.: 0.8087   3rd Qu.: 0.1863   3rd Qu.: 0.8539  
##  Max.   : 6.4360   Max.   : 0.8087   Max.   : 5.2452   Max.   : 2.7693  
##       AGE              PAY_0              PAY_2             PAY_3        
##  Min.   :-1.5760   Min.   :-1.76661   Min.   :-1.5616   Min.   :-1.5329  
##  1st Qu.:-0.8123   1st Qu.:-0.87601   1st Qu.:-0.7250   1st Qu.:-0.6975  
##  Median :-0.1578   Median : 0.01458   Median : 0.1116   Median : 0.1378  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6059   3rd Qu.: 0.01458   3rd Qu.: 0.1116   3rd Qu.: 0.1378  
##  Max.   : 4.7514   Max.   : 7.13936   Max.   : 6.8046   Max.   : 6.8207  
##      PAY_4             PAY_5             PAY_6           BILL_AMT1      
##  Min.   :-1.5242   Min.   :-1.5316   Min.   :-1.4842   Min.   :-0.8939  
##  1st Qu.:-0.6676   1st Qu.:-0.6491   1st Qu.:-0.6167   1st Qu.:-0.6500  
##  Median : 0.1890   Median : 0.2335   Median : 0.2507   Median :-0.3938  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1890   3rd Qu.: 0.2335   3rd Qu.: 0.2507   3rd Qu.: 0.2191  
##  Max.   : 7.0419   Max.   : 6.4110   Max.   : 6.3230   Max.   :12.4392  
##    BILL_AMT2         BILL_AMT3         BILL_AMT4         BILL_AMT5      
##  Min.   :-1.6765   Min.   :-1.1815   Min.   :-3.3180   Min.   :-2.0117  
##  1st Qu.:-0.6514   1st Qu.:-0.6461   1st Qu.:-0.6373   1st Qu.:-0.6367  
##  Median :-0.3943   Median :-0.3911   Median :-0.3773   Median :-0.3662  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2091   3rd Qu.: 0.1935   3rd Qu.: 0.1802   3rd Qu.: 0.1652  
##  Max.   :13.1707   Max.   :11.7923   Max.   :13.1969   Max.   :14.6764  
##    BILL_AMT6          PAY_AMT1           PAY_AMT2           PAY_AMT3       
##  Min.   :-4.1833   Min.   :-0.34345   Min.   :-0.27555   Min.   :-0.29242  
##  1st Qu.:-0.6342   1st Qu.:-0.28242   1st Qu.:-0.23451   1st Qu.:-0.27044  
##  Median :-0.3676   Median :-0.21527   Median :-0.18019   Median :-0.19056  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.1768   3rd Qu.:-0.03759   3rd Qu.:-0.03833   3rd Qu.:-0.04151  
##  Max.   :15.5721   Max.   :52.97755   Max.   :57.94354   Max.   :49.44861  
##     PAY_AMT4           PAY_AMT5           PAY_AMT6        default    
##  Min.   :-0.30908   Min.   :-0.31877   Min.   :-0.29204   No :18692  
##  1st Qu.:-0.29013   1st Qu.:-0.30149   1st Qu.:-0.28400   Yes: 5309  
##  Median :-0.21369   Median :-0.21828   Median :-0.20766              
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000              
##  3rd Qu.:-0.05325   3rd Qu.:-0.04798   3rd Qu.:-0.06702              
##  Max.   :39.18096   Max.   :25.67929   Max.   :29.44888
# Sample smaller training data for speed
set.seed(123)
train_sample <- train %>% sample_n(5000)

# Random Forest with reduced grid and folds
rf_grid <- expand.grid(mtry = c(4, 6))  # fewer options
rf_ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)

# Train model
set.seed(123)
rf_model <- train(default ~ ., data = train_sample,
                  method = "rf",
                  trControl = rf_ctrl,
                  tuneGrid = rf_grid,
                  metric = "ROC")

# Predict
rf_pred <- predict(rf_model, test)
confusionMatrix(rf_pred, test$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  4413  833
##        Yes  259  494
##                                          
##                Accuracy : 0.818          
##                  95% CI : (0.808, 0.8277)
##     No Information Rate : 0.7788         
##     P-Value [Acc > NIR] : 4.32e-14       
##                                          
##                   Kappa : 0.3749         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9446         
##             Specificity : 0.3723         
##          Pos Pred Value : 0.8412         
##          Neg Pred Value : 0.6560         
##              Prevalence : 0.7788         
##          Detection Rate : 0.7356         
##    Detection Prevalence : 0.8745         
##       Balanced Accuracy : 0.6584         
##                                          
##        'Positive' Class : No             
## 
# ROC Curve
rf_prob <- predict(rf_model, test, type = "prob")[, 2]
roc_rf <- roc(response = test$default, predictor = rf_prob)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_rf, main = "ROC Curve - Random Forest (Fast Sampled)")

auc(roc_rf)
## Area under the curve: 0.7603
# SVM with radial kernel
set.seed(123)
svm_ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)

svm_model <- train(default ~ ., data = train_scaled,
                   method = "svmRadial",
                   trControl = svm_ctrl,
                   tuneLength = 5,
                   metric = "ROC")

# Predict and evaluate
svm_pred <- predict(svm_model, test_scaled)
confusionMatrix(svm_pred, test_scaled$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  4503  909
##        Yes  169  418
##                                           
##                Accuracy : 0.8203          
##                  95% CI : (0.8103, 0.8299)
##     No Information Rate : 0.7788          
##     P-Value [Acc > NIR] : 1.211e-15       
##                                           
##                   Kappa : 0.3484          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9638          
##             Specificity : 0.3150          
##          Pos Pred Value : 0.8320          
##          Neg Pred Value : 0.7121          
##              Prevalence : 0.7788          
##          Detection Rate : 0.7506          
##    Detection Prevalence : 0.9022          
##       Balanced Accuracy : 0.6394          
##                                           
##        'Positive' Class : No              
## 
# ROC
svm_prob <- predict(svm_model, test_scaled, type = "prob")[, 2]
roc_svm <- roc(response = test_scaled$default, predictor = svm_prob)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_svm, main = "ROC Curve - SVM")

auc(roc_svm)
## Area under the curve: 0.7191
# AUC Comparison
auc_rf <- auc(roc_rf)
auc_svm <- auc(roc_svm)

print(paste("Random Forest AUC:", round(auc_rf, 4)))
## [1] "Random Forest AUC: 0.7603"
print(paste("SVM AUC:", round(auc_svm, 4)))
## [1] "SVM AUC: 0.7191"
# Plot both ROC curves
plot(roc_rf, col = "blue", main = "ROC Comparison")
lines(roc_svm, col = "red")
legend("bottomright", legend = c("Random Forest", "SVM"), col = c("blue", "red"), lwd = 2)

Conclusion and Business Impact:

The Random Forest model was selected as the final model due to its speed, interpretability, and solid performance. From a business perspective, implementing such a model could significantly reduce credit losses by enabling early intervention with high-risk customers. For example, customers flagged as likely to default can be offered proactive support, such as restructuring plans or temporary credit holds. This project demonstrates the real-world applicability of machine learning in financial risk assessment. While further improvements could be made with deeper feature engineering and ensemble blending, the current model is a strong starting point for any financial institution looking to adopt data-driven decision-making in credit risk management.

Two classification methods were used for comparison:

  1. Random Forest (from Weeks 1–10): A powerful ensemble method that builds multiple decision trees and uses majority voting to classify outcomes. It is robust to overfitting and works well with a mix of numerical and categorical data.

  2. Support Vector Machine (SVM) with RBF kernel (from Weeks 11–15): A non-linear classifier that transforms data into higher dimensions to find an optimal separating hyperplane. It is particularly useful for data with complex boundaries but requires feature scaling. For both models, hyperparameter tuning and 3-fold cross-validation were applied using the caret package. Random Forest tuning involved varying the number of variables considered at each split (mtry), while SVM tuning automatically selected the best combination of cost and kernel parameters using tuneLength.

The Random Forest model achieved strong performance with an AUC (Area Under the Curve) of approximately 0.77, indicating a good balance between sensitivity and specificity. The confusion matrix showed a clear improvement over a random guess baseline, particularly in identifying actual defaulters.

The SVM model was also trained but not included in final results due to time constraints and processing limitations. However, prior experimentation with smaller samples showed that SVM can perform comparably to Random Forest, with an AUC in the range of 0.75–0.78, depending on parameter tuning and scaling quality.