1.Introduction 2.Problem Statement 3.Motivations(20 points) 4.Data Preprocessing/handling 5. Exploratory Analysis(20 points) 6.Modeling details and Key Insights(20 points) 7. Final Takeaways and conclusions(20 points) 8. Flow of Presentation(20 points)

1.Introduction

The dataset used in this project is sourced from kaggle and it contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Dataset Overview

Number of Observations: 30,000 Number of Variables: 25

Variables include: LIMIT_BAL: Amount of given credit (NT dollar) SEX: Gender (1 = male; 2 = female) EDUCATION: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others, 5 = unknown, 6 = unknown) MARRIAGE: Marital status (1 = married; 2 = single; 3 = others) AGE: Age in years PAY_0 to PAY_6: Past monthly payment records BILL_AMT1 to BILL_AMT6: Bill statement amounts PAY_AMT1 to PAY_AMT6: Previous payments default.payment.next.month: Default payment (1 = yes; 0 = no)

2. Problem Statement

Banks worldwide face the critical challenge of accurately predicting which credit card customers will default on upcoming payments. With rising credit card usage globally, traditional static credit scoring methods are no longer sufficient. Financial institutions need data-driven approaches that leverage machine learning to identify high-risk customers before defaults occur, allowing for timely intervention while maintaining competitive credit offerings. This project aims to develop an accurate, interpretable predictive model using a comprehensive Taiwanese credit card dataset to enhance default risk assessment capabilities that can be applied across various banking environments.

3. Motivation

As an aspiring financial analyst with a keen interest in banking operations, I’m driven to understand how data science can transform credit risk management. This Taiwanese dataset offers me practical experience with real-world financial data analysis that’s directly applicable to my career goals. By developing predictive models for credit default, I’ll gain valuable skills in risk assessment that are highly sought after in the banking industry. The insights from this analysis will deepen my understanding of how payment behaviors and demographic factors influence default probability—knowledge I can leverage in future financial analysis roles. This project allows me to demonstrate my ability to translate raw financial data into actionable business insights, positioning me as a data-driven financial professional who can help banks make more informed lending decisions and improve their bottom line.

#Load packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats)
library(ROCR)
library(janitor)

## Warning: package 'janitor' was built under R version 4.4.3

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:dplyr':
## 
##     recode

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.4.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ROSE)

## Loaded ROSE 0.0-4

library(e1071)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(randomForestSRC)

## Warning: package 'randomForestSRC' was built under R version 4.4.3

## 
##  randomForestSRC 3.3.3 
##  
##  Type rfsrc.news() to see new features, changes, and bug fixes. 
##  
## 
## 
## Attaching package: 'randomForestSRC'
## 
## The following objects are masked from 'package:e1071':
## 
##     impute, tune
## 
## The following object is masked from 'package:purrr':
## 
##     partial

4.Data Preprocessing/handling

#Reading in the dataset
default <- read.csv("C:/Users/doris/OneDrive/Documents/UCI_Credit_Card.csv/UCI_Credit_Card.csv")

# Copy the dataset
d1 <- default

# Remove ID since its just a placeholder
d1 <-d1[, -1]

# verify the change
dim(default)

## [1] 30000    25

dim(d1)

## [1] 30000    24

str(d1)

## 'data.frame':    30000 obs. of  24 variables:
##  $ LIMIT_BAL                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : num  3913 2682 29239 46990 8617 ...
##  $ BILL_AMT2                 : num  3102 1725 14027 48233 5670 ...
##  $ BILL_AMT3                 : num  689 2682 13559 49291 35835 ...
##  $ BILL_AMT4                 : num  0 3272 14331 28314 20940 ...
##  $ BILL_AMT5                 : num  0 3455 14948 28959 19146 ...
##  $ BILL_AMT6                 : num  0 3261 15549 29547 19131 ...
##  $ PAY_AMT1                  : num  0 0 1518 2000 2000 ...
##  $ PAY_AMT2                  : num  689 1000 1500 2019 36681 ...
##  $ PAY_AMT3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : num  0 1000 1000 1100 9000 ...
##  $ PAY_AMT5                  : num  0 0 1000 1069 689 ...
##  $ PAY_AMT6                  : num  0 2000 5000 1000 679 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

Data Cleaning

#Check missing values
colSums(is.na(d1))

##                  LIMIT_BAL                        SEX 
##                          0                          0 
##                  EDUCATION                   MARRIAGE 
##                          0                          0 
##                        AGE                      PAY_0 
##                          0                          0 
##                      PAY_2                      PAY_3 
##                          0                          0 
##                      PAY_4                      PAY_5 
##                          0                          0 
##                      PAY_6                  BILL_AMT1 
##                          0                          0 
##                  BILL_AMT2                  BILL_AMT3 
##                          0                          0 
##                  BILL_AMT4                  BILL_AMT5 
##                          0                          0 
##                  BILL_AMT6                   PAY_AMT1 
##                          0                          0 
##                   PAY_AMT2                   PAY_AMT3 
##                          0                          0 
##                   PAY_AMT4                   PAY_AMT5 
##                          0                          0 
##                   PAY_AMT6 default.payment.next.month 
##                          0                          0

# Check for duplicate rows
duplicates <- d1[duplicated(d1), ]
cat("Number of duplicate rows: ", nrow(duplicates), "\n")

## Number of duplicate rows:  35

# If duplicates are found remove duplicate rows
d1 <- d1[!duplicated(d1), ]
cat("Number of rows after removing duplicates: ", nrow(d1), "\n")

## Number of rows after removing duplicates:  29965

#Cleaning the column names
d1 = clean_names(d1)
colnames(d1)

##  [1] "limit_bal"                  "sex"                       
##  [3] "education"                  "marriage"                  
##  [5] "age"                        "pay_0"                     
##  [7] "pay_2"                      "pay_3"                     
##  [9] "pay_4"                      "pay_5"                     
## [11] "pay_6"                      "bill_amt1"                 
## [13] "bill_amt2"                  "bill_amt3"                 
## [15] "bill_amt4"                  "bill_amt5"                 
## [17] "bill_amt6"                  "pay_amt1"                  
## [19] "pay_amt2"                   "pay_amt3"                  
## [21] "pay_amt4"                   "pay_amt5"                  
## [23] "pay_amt6"                   "default_payment_next_month"

summary(d1)

##    limit_bal            sex          education        marriage    
##  Min.   :  10000   Min.   :1.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 140000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   : 167442   Mean   :1.604   Mean   :1.854   Mean   :1.552  
##  3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :1000000   Max.   :2.000   Max.   :6.000   Max.   :3.000  
##       age            pay_0              pay_2             pay_3        
##  Min.   :21.00   Min.   :-2.00000   Min.   :-2.0000   Min.   :-2.0000  
##  1st Qu.:28.00   1st Qu.:-1.00000   1st Qu.:-1.0000   1st Qu.:-1.0000  
##  Median :34.00   Median : 0.00000   Median : 0.0000   Median : 0.0000  
##  Mean   :35.49   Mean   :-0.01675   Mean   :-0.1319   Mean   :-0.1644  
##  3rd Qu.:41.00   3rd Qu.: 0.00000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :79.00   Max.   : 8.00000   Max.   : 8.0000   Max.   : 8.0000  
##      pay_4             pay_5             pay_6           bill_amt1      
##  Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-165580  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:   3595  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median :  22438  
##  Mean   :-0.2189   Mean   :-0.2645   Mean   :-0.2894   Mean   :  51283  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:  67260  
##  Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 964511  
##    bill_amt2        bill_amt3         bill_amt4         bill_amt5     
##  Min.   :-69777   Min.   :-157264   Min.   :-170000   Min.   :-81334  
##  1st Qu.:  3010   1st Qu.:   2711   1st Qu.:   2360   1st Qu.:  1787  
##  Median : 21295   Median :  20135   Median :  19081   Median : 18130  
##  Mean   : 49236   Mean   :  47068   Mean   :  43313   Mean   : 40358  
##  3rd Qu.: 64109   3rd Qu.:  60201   3rd Qu.:  54601   3rd Qu.: 50247  
##  Max.   :983931   Max.   :1664089   Max.   : 891586   Max.   :927171  
##    bill_amt6          pay_amt1         pay_amt2          pay_amt3     
##  Min.   :-339603   Min.   :     0   Min.   :      0   Min.   :     0  
##  1st Qu.:   1262   1st Qu.:  1000   1st Qu.:    850   1st Qu.:   390  
##  Median :  17124   Median :  2102   Median :   2010   Median :  1804  
##  Mean   :  38917   Mean   :  5670   Mean   :   5928   Mean   :  5232  
##  3rd Qu.:  49252   3rd Qu.:  5008   3rd Qu.:   5000   3rd Qu.:  4512  
##  Max.   : 961664   Max.   :873552   Max.   :1684259   Max.   :896040  
##     pay_amt4         pay_amt5         pay_amt6      default_payment_next_month
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :0.0000            
##  1st Qu.:   300   1st Qu.:   261   1st Qu.:   131   1st Qu.:0.0000            
##  Median :  1500   Median :  1500   Median :  1500   Median :0.0000            
##  Mean   :  4832   Mean   :  4805   Mean   :  5222   Mean   :0.2213            
##  3rd Qu.:  4016   3rd Qu.:  4042   3rd Qu.:  4000   3rd Qu.:0.0000            
##  Max.   :621000   Max.   :426529   Max.   :528666   Max.   :1.0000

Data Preprocessing

# First check the levels, do they match the references
levels(as.factor(d1$sex))

## [1] "1" "2"

levels(as.factor(d1$education))

## [1] "0" "1" "2" "3" "4" "5" "6"

levels(as.factor(d1$marriage))

## [1] "0" "1" "2" "3"

levels(as.factor(d1$default_payment_next_month))

## [1] "0" "1"

Based on the dataset references, marriage has Marital status (1=married, 2=single, 3=others) but in the dataset they appears to be an additional unknown variable ‘0’ Education has many unknown categories and a single ‘other’ category, we need to recode all this into one single unknown category to make our analysis easier.

# View the frequency count of marriage and education

# View frequency of each level
d1 %>% count(education)

d1 %>% count(marriage)

# Restructuring the categories to a simpler form and converting the 
# appropriate variables into factors

library(dplyr)

# Rename and recode relevant variables
d1 <- d1 %>%
  # Rename pay_ variables to corresponding months
  rename(
    pay_september = pay_0,
    pay_august    = pay_2,
    pay_july      = pay_3,
    pay_june      = pay_4,
    pay_may       = pay_5,
    pay_april     = pay_6
  ) %>%
  mutate(
    # Recode marriage
    marriage = case_when(
      marriage %in% c(1, 2, 3) ~ marriage,
      marriage == 0 ~ 3  # unknown → others
    ),
    
    # Recode education
    education = case_when(
      education %in% c(1, 2, 3) ~ education,
      education %in% c(0, 4, 5, 6) ~ 4  # unknowns → others
    ),
    
    # Convert categorical vars to factors
    marriage = factor(marriage,
                      levels = c(1, 2, 3),
                      labels = c("married", "single", "others")),
    
    education = factor(education,
                       levels = c(1, 2, 3, 4),
                       labels = c("graduate school", "university", "high school", "others")),
    
    sex = factor(sex,
                 levels = c(1, 2),
                 labels = c("male", "female")),
    
    default_payment_next_month = factor(default_payment_next_month,
                                        levels = c(0, 1),
                                        labels = c("no", "yes"))
)
# Check total NAs in the dataset
#colSums(is.na(d1))

library(dplyr)

# Convert pay_* variables to numeric
d1 <- d1 %>%
  mutate(
    age = as.numeric(age),  # Convert age to numeric
    across(
      .cols = starts_with("pay_"),  
      .fns = ~ as.numeric(.),       
      .names = "{col}"              
    )
  )

str(d1)

## 'data.frame':    29965 obs. of  24 variables:
##  $ limit_bal                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ sex                       : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 2 2 1 ...
##  $ education                 : Factor w/ 4 levels "graduate school",..: 2 2 2 2 2 1 1 2 3 3 ...
##  $ marriage                  : Factor w/ 3 levels "married","single",..: 1 2 2 1 1 2 2 2 1 2 ...
##  $ age                       : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ pay_september             : num  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ pay_august                : num  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ pay_july                  : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ pay_june                  : num  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ pay_may                   : num  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ pay_april                 : num  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ bill_amt1                 : num  3913 2682 29239 46990 8617 ...
##  $ bill_amt2                 : num  3102 1725 14027 48233 5670 ...
##  $ bill_amt3                 : num  689 2682 13559 49291 35835 ...
##  $ bill_amt4                 : num  0 3272 14331 28314 20940 ...
##  $ bill_amt5                 : num  0 3455 14948 28959 19146 ...
##  $ bill_amt6                 : num  0 3261 15549 29547 19131 ...
##  $ pay_amt1                  : num  0 0 1518 2000 2000 ...
##  $ pay_amt2                  : num  689 1000 1500 2019 36681 ...
##  $ pay_amt3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ pay_amt4                  : num  0 1000 1000 1100 9000 ...
##  $ pay_amt5                  : num  0 0 1000 1069 689 ...
##  $ pay_amt6                  : num  0 2000 5000 1000 679 ...
##  $ default_payment_next_month: Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...

# Verify the change 
str(d1)

## 'data.frame':    29965 obs. of  24 variables:
##  $ limit_bal                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ sex                       : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 2 2 1 ...
##  $ education                 : Factor w/ 4 levels "graduate school",..: 2 2 2 2 2 1 1 2 3 3 ...
##  $ marriage                  : Factor w/ 3 levels "married","single",..: 1 2 2 1 1 2 2 2 1 2 ...
##  $ age                       : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ pay_september             : num  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ pay_august                : num  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ pay_july                  : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ pay_june                  : num  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ pay_may                   : num  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ pay_april                 : num  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ bill_amt1                 : num  3913 2682 29239 46990 8617 ...
##  $ bill_amt2                 : num  3102 1725 14027 48233 5670 ...
##  $ bill_amt3                 : num  689 2682 13559 49291 35835 ...
##  $ bill_amt4                 : num  0 3272 14331 28314 20940 ...
##  $ bill_amt5                 : num  0 3455 14948 28959 19146 ...
##  $ bill_amt6                 : num  0 3261 15549 29547 19131 ...
##  $ pay_amt1                  : num  0 0 1518 2000 2000 ...
##  $ pay_amt2                  : num  689 1000 1500 2019 36681 ...
##  $ pay_amt3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ pay_amt4                  : num  0 1000 1000 1100 9000 ...
##  $ pay_amt5                  : num  0 0 1000 1069 689 ...
##  $ pay_amt6                  : num  0 2000 5000 1000 679 ...
##  $ default_payment_next_month: Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...

Data Exploratory Analaysis

I’ll begin by visually inspecting the relationship between default status and each demographic variable (education, marital status, sex, and age). Then Perform a chi-square test for each, to test their significance.

# First, create a bar plot of the response variable with percentages
# Visualize the dependent variable distribution

# Create a table with counts and percentages
class_dist <- d1 %>%
  tabyl(default_payment_next_month) %>%
  mutate(percent = round(percent * 100, 1))  # Make percentage nicely rounded

# Plot
ggplot(class_dist, aes(x = default_payment_next_month, y = percent, fill = default_payment_next_month)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = paste0(percent, "%")), vjust = -0.5, size = 5) +   # Add percentages on top
  scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.05))) +  # Y-axis from 0 to 100
  scale_fill_manual(values = c("no" = "red", "yes" = "blue")) +
  labs(
    title = "Non-Default vs Default",
    x = "Default Status",
    y = "Percentage"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(size = 14))

1) Which demographic groups has the highest default rates?

Default rates by demographic groups

# Default rate by Sex
sex_default <- d1 %>%
  group_by(sex, default_payment_next_month) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(sex) %>%
  mutate(prop = count / sum(count),
         percent = prop * 100)

sex_plot <- ggplot(sex_default %>% filter(default_payment_next_month == "yes"), 
                  aes(x = sex, y = percent, fill = sex)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.1, size = 4) +
  scale_fill_manual(values = c("male" = "#4393c3", "female" = "#d6604d")) +
  labs(title = "Default Rate by Gender",
       x = "Gender",
       y = "Default Rate (%)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "none")

print(sex_plot)

# Chi-square test for sex and default
sex_table <- table(d1$sex, d1$default_payment_next_month)
sex_chi <- chisq.test(sex_table)
print(sex_chi)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  sex_table
## X-squared = 47.131, df = 1, p-value = 6.639e-12

if(sex_chi$p.value < 0.05) {
  cat("Gender: Significant differences in default rates (p =", sex_chi$p.value, ")\n")
} else {
  cat("Gender: No significant differences in default rates (p =", sex_chi$p.value, ")\n")
}

## Gender: Significant differences in default rates (p = 6.639289e-12 )

# Default rate by Education
edu_default <- d1 %>%
  group_by(education, default_payment_next_month) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(education) %>%
  mutate(prop = count / sum(count),
         percent = prop * 100)

edu_plot <- ggplot(edu_default %>% filter(default_payment_next_month == "yes"), 
                  aes(x = reorder(education, -percent), y = percent, fill = education)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.005, size = 4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Default Rate by Education Level",
       x = "Education Level",
       y = "Default Rate (%)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

print(edu_plot)

# Chi-square test for education and default
edu_table <- table(d1$education, d1$default_payment_next_month)
edu_chi <- chisq.test(edu_table)
print(edu_chi)

## 
##  Pearson's Chi-squared test
## 
## data:  edu_table
## X-squared = 160.46, df = 3, p-value < 2.2e-16

# Interpretation with full p-value
if(edu_chi$p.value < 0.05) {
  cat("Education: Significant differences in default rates (p =", edu_chi$p.value, ")\n")
} else {
  cat("Education: No significant differences in default rates (p =", edu_chi$p.value, ")\n")
}

## Education: Significant differences in default rates (p = 1.458089e-34 )

# Default rate by Marriage Status
marriage_default <- d1 %>%
  group_by(marriage, default_payment_next_month) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(marriage) %>%
  mutate(prop = count / sum(count),
         percent = prop * 100)

marriage_plot <- ggplot(marriage_default %>% filter(default_payment_next_month == "yes"), 
                       aes(x = reorder(marriage, -percent), y = percent, fill = marriage)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.01, size = 4) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Default Rate by Marital Status",
       x = "Marital Status",
       y = "Default Rate (%)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "none")

print(marriage_plot)

# Chi-square test for marriage and default
marriage_table <- table(d1$marriage, d1$default_payment_next_month)
marriage_chi <- chisq.test(marriage_table)
print(marriage_chi)

## 
##  Pearson's Chi-squared test
## 
## data:  marriage_table
## X-squared = 27.489, df = 2, p-value = 1.074e-06

# Interpretation
if(marriage_chi$p.value < 0.05) {
  cat("Marriage: Significant differences in default rates (p =", marriage_chi$p.value, ")\n")
} else {
  cat("Marriage: No significant differences in default rates (p =", marriage_chi$p.value, ")\n")
}

## Marriage: Significant differences in default rates (p = 1.073732e-06 )

# Default rate by Age
# First, we create age groups for better visualization
# Then, calculate the default rates by age group
d1 <- d1 %>%
  mutate(age_group = case_when(
    age < 30 ~ "< 30",
    age >= 30 & age < 40 ~ "30-39",
    age >= 40 & age < 50 ~ "40-49",
    age >= 50 & age < 60 ~ "50-59",
    age >= 60 ~ "60+"
  ),
  age_group = factor(age_group, levels = c("< 30", "30-39", "40-49", "50-59", "60+")))

# Calculate the default rates by age group
age_default <- d1 %>%
  group_by(age_group, default_payment_next_month) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(age_group) %>%
  mutate(prop = count / sum(count),
         percent = prop * 100)

# Create a new dataframe with just the default rates (where default_payment_next_month == "yes")
default_rates <- age_default %>% 
  filter(default_payment_next_month == "yes") %>%
  arrange(desc(percent))  # Sort by descending default rate

# Create a new factor order based on the sorted default rates
new_order <- default_rates$age_group

# Update the factor levels in the age_default dataframe
age_default <- age_default %>%
  mutate(age_group = factor(age_group, levels = new_order))

# Now create the plot with the reordered bars
age_plot <- ggplot(age_default %>% filter(default_payment_next_month == "yes"),
                   aes(x = age_group, y = percent, fill = age_group)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.2, size = 4) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(title = "Default Rate by Age Group",
       x = "Age Group",
       y = "Default Rate (%)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.position = "none")

# Print the plot
print(age_plot)

# Chi-square test for age groups and default
age_table <- table(d1$age_group, d1$default_payment_next_month)
age_chi <- chisq.test(age_table)
print(age_chi)

## 
##  Pearson's Chi-squared test
## 
## data:  age_table
## X-squared = 45.812, df = 4, p-value = 2.695e-09

# Interpretation
if(age_chi$p.value < 0.05) {
  cat("Age Group: Significant differences in default rates (p =", sprintf("%.4f", age_chi$p.value), ")\n")
} else {
  cat("Age Group: No significant differences in default rates (p =", sprintf("%.4f", age_chi$p.value), ")\n")
}

## Age Group: Significant differences in default rates (p = 0.0000 )

Overall Hypothesis Framework: Null Hypothesis (H₀): Default rates are independent of demographic factors (sex, education, marital status, age).

Alternative Hypothesis (H₁): Default rates significantly differ across demographic groups — i.e., there is a relationship between default status and the demographic variable. All four tests (sex, education, marriage, and age group) show statistically significant associations with default status. Therefore, default rates are not independent of these demographic factors.

# Interactive display of demographic groups with highest default rates
# Summarize default rates by all demographic combinations
demographic_default <- d1 %>%
  group_by(sex, education, marriage, age_group) %>%
  summarise(
    total = n(),
    default_count = sum(default_payment_next_month == "yes"),
    default_rate = mean(default_payment_next_month == "yes") * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(default_rate))
# Display top 10 demographic groups with highest default rates
top_10_groups <- head(demographic_default, 10)
print("Top 10 Demographic Groups with Highest Default Rates:")

## [1] "Top 10 Demographic Groups with Highest Default Rates:"

print(top_10_groups)

## # A tibble: 10 × 7
##    sex    education       marriage age_group total default_count default_rate
##    <fct>  <fct>           <fct>    <fct>     <int>         <int>        <dbl>
##  1 female university      others   60+           1             1        100  
##  2 female university      single   60+           4             3         75  
##  3 female graduate school others   < 30          5             3         60  
##  4 male   graduate school others   50-59         8             4         50  
##  5 male   high school     others   < 30          6             3         50  
##  6 female high school     others   50-59        23            10         43.5
##  7 male   university      married  60+          45            19         42.2
##  8 male   university      others   40-49        27            11         40.7
##  9 female graduate school single   60+           5             2         40  
## 10 female high school     single   60+          16             6         37.5

Which specific payment history variables (PAY_0 to PAY_6) have the strongest correlation with default?

# Create a numeric version of the default variable (0 = no, 1 = yes)
d1$default_num <- ifelse(d1$default_payment_next_month == "yes", 1, 0)

# Select payment history variables and default for correlation analysis
pay_vars <- c("pay_september", "pay_august", "pay_july", "pay_june", "pay_may", "pay_april")
corr_data <- d1[, c(pay_vars, "default_num")]

# Calculate correlations
cor_matrix <- cor(sapply(corr_data, as.numeric), use = "complete.obs")

# Print correlation with default
cat("Correlation between payment variables and default:\n")

## Correlation between payment variables and default:

cor_with_default <- cor_matrix["default_num", pay_vars]
print(cor_with_default)

## pay_september    pay_august      pay_july      pay_june       pay_may 
##     0.3249638     0.2636562     0.2352305     0.2165509     0.2040591 
##     pay_april 
##     0.1867396

# Sort by absolute correlation value to identify strongest correlations
sorted_cors <- sort(abs(cor_with_default), decreasing = TRUE)
cat("\nPayment variables sorted by strength of correlation with default:\n")

## 
## Payment variables sorted by strength of correlation with default:

print(sorted_cors)

## pay_september    pay_august      pay_july      pay_june       pay_may 
##     0.3249638     0.2636562     0.2352305     0.2165509     0.2040591 
##     pay_april 
##     0.1867396

# Visualize correlations with default
barplot(cor_with_default[order(abs(cor_with_default), decreasing = TRUE)], 
        main = "Correlation between Payment History and Default",
        xlab = "Payment Variables", 
        ylab = "Correlation Coefficient",
        col = ifelse(cor_with_default[order(abs(cor_with_default), decreasing = TRUE)] > 0, "red", "blue"),
        las = 2)
abline(h = 0)
legend("bottomright", legend = c("Positive correlation", "Negative correlation"), 
       fill = c("red", "blue"))

# Statistical significance testing for correlations
# Using point-biserial correlation (equivalent to t-test for binary and continuous variables)

# Initialize data frame to store results
significance_tests <- data.frame(
  Payment_Variable = pay_vars,
  Correlation = cor_with_default,
  p_value = numeric(length(pay_vars))  # initialize with empty p-values
)

# Run point-biserial correlation tests for each payment variable
for (i in 1:length(pay_vars)) {
  x <- as.numeric(d1[[pay_vars[i]]])  # Convert x to numeric
  y <- d1$default_num                 # Already numeric
  test_result <- cor.test(x, y)
  significance_tests$p_value[i] <- test_result$p.value
}

# Sort by absolute correlation value
significance_tests <- significance_tests[order(abs(significance_tests$Correlation), decreasing = TRUE), ]

# Print results as formatted table
cat("\nStatistical significance of correlations:\n")

## 
## Statistical significance of correlations:

print(knitr::kable(significance_tests, digits = 4))

## 
## 
## |              |Payment_Variable | Correlation| p_value|
## |:-------------|:----------------|-----------:|-------:|
## |pay_september |pay_september    |      0.3250|       0|
## |pay_august    |pay_august       |      0.2637|       0|
## |pay_july      |pay_july         |      0.2352|       0|
## |pay_june      |pay_june         |      0.2166|       0|
## |pay_may       |pay_may          |      0.2041|       0|
## |pay_april     |pay_april        |      0.1867|       0|

pay_september has the strongest correlation with default at approximately 0.325 pay_august has the second strongest correlation at about 0.264 pay_july has the third strongest correlation at about 0.235 The remaining variables (pay_june, pay_may, and pay_april) have progressively weaker correlations

library(forcats)  # for fct_reorder
# Visualize default rates by payment status for the top correlated variable
# (Since pay_0 is the most correlated based on previous results above)
top_var <- names(sorted_cors)[1]

default_by_payment <- d1 %>%
  group_by(across(all_of(top_var)), default_payment_next_month) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(across(all_of(top_var))) %>%
  mutate(total = sum(count),
         percentage = count / total * 100) %>%
  filter(default_payment_next_month == "yes") %>%
  arrange(across(all_of(top_var)))

ggplot(default_by_payment, aes(x = fct_reorder(factor(get(top_var)), percentage, .desc = TRUE), y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), vjust = -0.2) +
  labs(title = paste("Default Rate by", top_var, "Status"),
       x = paste(top_var, "Payment Status (-2 = No consumption, -1 = Paid in full, etc.)"),
       y = "Default Rate (%)") +
  theme_minimal()

The resulting visualization shows how default rates vary across different payment statuses. From the chart:

Lower payment statuses (-2, -1, 0) have relatively low default rates (around 13-17%) Default rates jump significantly at status 1 (34%) The highest default rates are at status 2, 3, and 7 (69.1%, 75.8%, and 77.8% respectively) There’s a dip in default rates at statuses 4, 5, and 6 before rising again

This suggests that certain payment behaviors (particularly those coded as 2, 3, and 7) are strong predictors of future defaults. The payment status values appear to use a coding system where -2 means “No consumption,” -1 means “Paid in full,” and other values indicate different levels of payment delays or issues.

Is there a critical threshold where the bill-to-payment ratio significantly increases default probability?

# Step 1: Create total bill and payment columns
d1 <- d1 %>%
  mutate(total_bill = bill_amt1 + bill_amt2 + bill_amt3 + bill_amt4 + bill_amt5 + bill_amt6,
         total_payment = pay_amt1 + pay_amt2 + pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6,
         bill_payment_ratio = total_bill / (total_payment + 1))  # +1 to avoid division by zero

# Step 2: Visualize the Ratio Distribution by Default Status
# Filter out infinite or non-positive ratios
d1_clean <- d1 %>%
  filter(is.finite(bill_payment_ratio), bill_payment_ratio > 0)

# Now plot again safely
ggplot(d1_clean, aes(x = bill_payment_ratio, fill = default_payment_next_month)) +
  geom_density(alpha = 0.5) +
  scale_x_log10() +
  labs(title = "Bill-to-Payment Ratio by Default Status",
       x = "Bill-to-Payment Ratio (log scale)", y = "Density") +
  theme_minimal()

The density plot (Image 1) shows a clear difference in the distribution of bill-to-payment ratios between defaulters and non-defaulters. Both distributions appear bimodal, with defaulters (green) having a higher density at larger bill-to-payment ratio values.

# Step 1: Create Bins (log10 scale makes sense due to skew)
d1_binned <- d1_clean %>%
  mutate(ratio_bin = cut(log10(bill_payment_ratio),
                         breaks = seq(-1, 6, by = 0.5),  # log10(0.1) to log10(1,000,000)
                         include.lowest = TRUE))

# Step 2: Calculate default rate per bin
default_by_bin <- d1_binned %>%
  group_by(ratio_bin) %>%
  summarise(
    count = n(),
    default_rate = mean(default_payment_next_month == "yes")
  ) %>%
  filter(!is.na(ratio_bin))  # remove any NA bins if present

# Step 3: Plot
ggplot(default_by_bin, aes(x = ratio_bin, y = default_rate)) +
  geom_col(fill = "#3b82f6") +
  labs(title = "Default Rate by Bill-to-Payment Ratio Bin",
       x = "Log10(Bill-to-Payment Ratio)",
       y = "Default Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart (Image 2) reveals a pattern where default rates generally increase as the bill-to-payment ratio increases, especially in the range of log10(3.5) to log10(4.5), which corresponds to bill-to-payment ratios of approximately 3,000 to 30,000. The highest default rate (over 60%) occurs in the [4, 4.5] bin.

# Subset to 5000 random samples from each group
set.seed(123)  # for reproducibility

no_sample <- sample(d1$bill_payment_ratio[d1$default_payment_next_month == "no"], 5000)
yes_sample <- sample(d1$bill_payment_ratio[d1$default_payment_next_month == "yes"], 5000)

# Shapiro-Wilk on subsamples
shapiro.test(no_sample)

## 
##  Shapiro-Wilk normality test
## 
## data:  no_sample
## W = 0.027618, p-value < 2.2e-16

shapiro.test(yes_sample)

## 
##  Shapiro-Wilk normality test
## 
## data:  yes_sample
## W = 0.017158, p-value < 2.2e-16

Normality is violated, thus a parametric test won’t work

# Step 3: Perform a Wilcoxon rank sum test
wilcoxon_result <- wilcox.test(bill_payment_ratio ~ default_payment_next_month, data = d1)

# Print the result of the Wilcoxon test
print(wilcoxon_result)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  bill_payment_ratio by default_payment_next_month
## W = 65873983, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

# Interpretation
if (wilcoxon_result$p.value < 0.05) {
  cat("There is a significant difference in bill-to-payment ratios between defaulters and non-defaulters (p =", round(wilcoxon_result$p.value, 4), ").\n")
} else {
  cat("There is no significant difference in bill-to-payment ratios between defaulters and non-defaulters (p =", round(wilcoxon_result$p.value, 4), ").\n")
}

## There is a significant difference in bill-to-payment ratios between defaulters and non-defaulters (p = 0 ).

The statistical analysis confirms these visual observations: The Shapiro-Wilk tests for both groups have p-values < 2.2e-16, indicating the data is not normally distributed, which justifies using non-parametric tests. The Wilcoxon rank sum test shows a highly significant difference (p < 2.2e-16) between the bill-to-payment ratios of defaulters and non-defaulters.

Interpretation: Since the p-value is less than 0.05, we can conclude that there is a significant difference in bill-to-payment ratios between defaulters and non-defaulters. This result suggests that the distribution of bill-to-payment ratios differs between the two groups

To answer my question specific question: Yes, there does appear to be a critical threshold where the bill-to-payment ratio significantly increases default probability. Based on your binned analysis, this threshold appears to be around log10(3.5), or approximately a ratio of 3,000. Beyond this point, default rates increase dramatically, peaking at the [4, 4.5] bin (ratio of 10,000-30,000).

# Correlation Matrix
continuous_vars <- c(
  "limit_bal", 
  "age",
  "bill_amt1", "bill_amt2", "bill_amt3", "bill_amt4", "bill_amt5", "bill_amt6",
  "pay_amt1", "pay_amt2", "pay_amt3", "pay_amt4", "pay_amt5", "pay_amt6"
)

# Subset to only include continuous variables that exist in your dataset
continuous_vars <- continuous_vars[continuous_vars %in% names(d1)]

# Create a correlation matrix
correlation_matrix <- cor(d1[continuous_vars], use = "pairwise.complete.obs")

# Optional: Round to 2 decimal places for better readability
correlation_matrix <- round(correlation_matrix, 2)

# Print the correlation matrix
print(correlation_matrix)

##           limit_bal  age bill_amt1 bill_amt2 bill_amt3 bill_amt4 bill_amt5
## limit_bal      1.00 0.14      0.29      0.28      0.28      0.29      0.30
## age            0.14 1.00      0.06      0.05      0.05      0.05      0.05
## bill_amt1      0.29 0.06      1.00      0.95      0.89      0.86      0.83
## bill_amt2      0.28 0.05      0.95      1.00      0.93      0.89      0.86
## bill_amt3      0.28 0.05      0.89      0.93      1.00      0.92      0.88
## bill_amt4      0.29 0.05      0.86      0.89      0.92      1.00      0.94
## bill_amt5      0.30 0.05      0.83      0.86      0.88      0.94      1.00
## bill_amt6      0.29 0.05      0.80      0.83      0.85      0.90      0.95
## pay_amt1       0.20 0.03      0.14      0.28      0.24      0.23      0.22
## pay_amt2       0.18 0.02      0.10      0.10      0.32      0.21      0.18
## pay_amt3       0.21 0.03      0.16      0.15      0.13      0.30      0.25
## pay_amt4       0.20 0.02      0.16      0.15      0.14      0.13      0.29
## pay_amt5       0.22 0.02      0.17      0.16      0.18      0.16      0.14
## pay_amt6       0.22 0.02      0.18      0.17      0.18      0.18      0.16
##           bill_amt6 pay_amt1 pay_amt2 pay_amt3 pay_amt4 pay_amt5 pay_amt6
## limit_bal      0.29     0.20     0.18     0.21     0.20     0.22     0.22
## age            0.05     0.03     0.02     0.03     0.02     0.02     0.02
## bill_amt1      0.80     0.14     0.10     0.16     0.16     0.17     0.18
## bill_amt2      0.83     0.28     0.10     0.15     0.15     0.16     0.17
## bill_amt3      0.85     0.24     0.32     0.13     0.14     0.18     0.18
## bill_amt4      0.90     0.23     0.21     0.30     0.13     0.16     0.18
## bill_amt5      0.95     0.22     0.18     0.25     0.29     0.14     0.16
## bill_amt6      1.00     0.20     0.17     0.23     0.25     0.31     0.12
## pay_amt1       0.20     1.00     0.29     0.25     0.20     0.15     0.19
## pay_amt2       0.17     0.29     1.00     0.24     0.18     0.18     0.16
## pay_amt3       0.23     0.25     0.24     1.00     0.22     0.16     0.16
## pay_amt4       0.25     0.20     0.18     0.22     1.00     0.15     0.16
## pay_amt5       0.31     0.15     0.18     0.16     0.15     1.00     0.15
## pay_amt6       0.12     0.19     0.16     0.16     0.16     0.15     1.00

# Create a heatmap visualization
library(corrplot)

## corrplot 0.95 loaded

corrplot(correlation_matrix, 
         method = "color", 
         type = "upper", 
         order = "hclust", 
         tl.col = "black", 
         tl.srt = 45,
         diag = FALSE,
         mar = c(0,0,2,0),
         title = "Correlation Matrix of Continuous Predictors")

# Alternative with ggplot2 showing values
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggplot2)

# Convert correlation matrix to long format
corr_long <- melt(correlation_matrix)

# Create the heatmap with visible coefficients
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", value)), size = 2.5) +
  scale_fill_gradient2(low = "#E46726", mid = "white", high = "#6D9EC1", 
                       midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.position = "right") +
  labs(title = "Correlation Matrix of Continuous Predictors", x = "", y = "") +
  coord_fixed()

Bill Amounts: There appear to be strong positive correlations (dark blue) between consecutive bill amounts (bill_amt1 through bill_amt6). This makes sense as customers with high bills in one month likely have high bills in adjacent months. Credit Limit (limit_bal): Shows moderate positive correlations with bill amounts, suggesting customers with higher credit limits tend to have higher bill amounts. Age: Appears to have weak correlations with other variables (very light colors), indicating age may not be strongly related to spending behaviors in your dataset. Payment Amounts: The payment variables (pay_amt1 through pay_amt6) show some correlation with each other and with the corresponding bill amounts, but these relationships appear weaker than the correlations among bill amounts. # Models

# First we need to remove the generated columns before splitting the data
d1_clean <- d1 %>% 
 dplyr::select(-age_group, -default_num, -total_bill, -total_payment, -bill_payment_ratio)

# Now create the training and testing datasets with clean data
set.seed(123)
trainIndex <- sample(1:nrow(d1_clean), 0.7*nrow(d1_clean))
train_data <- d1_clean[trainIndex, ]
test_data <- d1_clean[-trainIndex, ]

# Check the class distribution in the training data
table(train_data$default_payment_next_month)

## 
##    no   yes 
## 16353  4622

set.seed(123)
# logistic regression model
model <- glm(default_payment_next_month ~ ., data = train_data, family = binomial)
summary(model)

## 
## Call:
## glm(formula = default_payment_next_month ~ ., family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.601e-01  1.033e-01  -9.294  < 2e-16 ***
## limit_bal            -8.435e-07  1.896e-07  -4.448 8.66e-06 ***
## sexfemale            -1.443e-01  3.678e-02  -3.922 8.78e-05 ***
## educationuniversity  -9.988e-02  4.270e-02  -2.339 0.019325 *  
## educationhigh school -1.234e-01  5.703e-02  -2.164 0.030452 *  
## educationothers      -1.001e+00  2.120e-01  -4.720 2.36e-06 ***
## marriagesingle       -1.908e-01  4.141e-02  -4.608 4.06e-06 ***
## marriageothers       -1.975e-01  1.593e-01  -1.240 0.215067    
## age                   5.773e-03  2.222e-03   2.598 0.009369 ** 
## pay_september         5.627e-01  2.138e-02  26.314  < 2e-16 ***
## pay_august            9.043e-02  2.418e-02   3.740 0.000184 ***
## pay_july              7.441e-02  2.676e-02   2.781 0.005427 ** 
## pay_june              3.731e-02  2.960e-02   1.261 0.207434    
## pay_may               3.976e-02  3.232e-02   1.230 0.218515    
## pay_april            -1.106e-02  2.659e-02  -0.416 0.677428    
## bill_amt1            -4.643e-06  1.302e-06  -3.567 0.000362 ***
## bill_amt2             1.210e-06  1.742e-06   0.694 0.487496    
## bill_amt3             1.272e-06  1.586e-06   0.802 0.422494    
## bill_amt4             7.824e-07  1.568e-06   0.499 0.617851    
## bill_amt5             4.611e-07  1.777e-06   0.259 0.795327    
## bill_amt6             3.025e-08  1.420e-06   0.021 0.983010    
## pay_amt1             -1.213e-05  2.676e-06  -4.531 5.87e-06 ***
## pay_amt2             -8.955e-06  2.405e-06  -3.724 0.000196 ***
## pay_amt3             -2.587e-06  1.980e-06  -1.307 0.191303    
## pay_amt4             -3.160e-06  2.064e-06  -1.531 0.125749    
## pay_amt5             -4.914e-06  2.258e-06  -2.176 0.029551 *  
## pay_amt6             -2.699e-06  1.612e-06  -1.674 0.094116 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22123  on 20974  degrees of freedom
## Residual deviance: 19412  on 20948  degrees of freedom
## AIC: 19466
## 
## Number of Fisher Scoring iterations: 6

# Check multicollinearity
vif(model)

##                    GVIF Df GVIF^(1/(2*Df))
## limit_bal      1.530942  1        1.237312
## sex            1.027274  1        1.013545
## education      1.225509  3        1.034474
## marriage       1.332737  2        1.074450
## age            1.381564  1        1.175400
## pay_september  1.517490  1        1.231865
## pay_august     2.626394  1        1.620615
## pay_july       3.190245  1        1.786126
## pay_june       3.792096  1        1.947331
## pay_may        4.275140  1        2.067641
## pay_april      3.002595  1        1.732800
## bill_amt1     22.110241  1        4.702153
## bill_amt2     37.857469  1        6.152842
## bill_amt3     29.284916  1        5.411554
## bill_amt4     25.634745  1        5.063077
## bill_amt5     30.126752  1        5.488784
## bill_amt6     18.353047  1        4.284046
## pay_amt1       1.441830  1        1.200762
## pay_amt2       1.463007  1        1.209548
## pay_amt3       1.487660  1        1.219697
## pay_amt4       1.413433  1        1.188879
## pay_amt5       1.464211  1        1.210046
## pay_amt6       1.108250  1        1.052734

# First I removed the predictor with the highest vif to see if that reduces the variance
model2 <- glm(default_payment_next_month ~ .-bill_amt2, data = train_data, family = binomial)
vif(model2)

##                    GVIF Df GVIF^(1/(2*Df))
## limit_bal      1.529941  1        1.236908
## sex            1.027089  1        1.013454
## education      1.225390  3        1.034457
## marriage       1.332773  2        1.074457
## age            1.381716  1        1.175464
## pay_september  1.516556  1        1.231485
## pay_august     2.625055  1        1.620202
## pay_july       3.183112  1        1.784128
## pay_june       3.792290  1        1.947380
## pay_may        4.275262  1        2.067671
## pay_april      3.002368  1        1.732734
## bill_amt1      9.843152  1        3.137380
## bill_amt3     22.721659  1        4.766724
## bill_amt4     25.579847  1        5.057652
## bill_amt5     30.076344  1        5.484190
## bill_amt6     18.342857  1        4.282856
## pay_amt1       1.223209  1        1.105988
## pay_amt2       1.331716  1        1.154000
## pay_amt3       1.485735  1        1.218907
## pay_amt4       1.412986  1        1.188691
## pay_amt5       1.463914  1        1.209923
## pay_amt6       1.108141  1        1.052683

# Next I removed the predictor with the second  highest vif to see if that reduces the variance
model3 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5, data = train_data, family = binomial)
vif(model3)

##                    GVIF Df GVIF^(1/(2*Df))
## limit_bal      1.527904  1        1.236084
## sex            1.027053  1        1.013436
## education      1.224670  3        1.034355
## marriage       1.332746  2        1.074452
## age            1.381691  1        1.175454
## pay_september  1.516373  1        1.231411
## pay_august     2.625210  1        1.620250
## pay_july       3.183250  1        1.784166
## pay_june       3.789664  1        1.946706
## pay_may        4.273017  1        2.067128
## pay_april      2.998393  1        1.731587
## bill_amt1      9.842675  1        3.137304
## bill_amt3     22.652209  1        4.759434
## bill_amt4     20.294946  1        4.504991
## bill_amt6      9.207984  1        3.034466
## pay_amt1       1.222935  1        1.105864
## pay_amt2       1.331620  1        1.153959
## pay_amt3       1.483029  1        1.217797
## pay_amt4       1.212957  1        1.101343
## pay_amt5       1.249524  1        1.117821
## pay_amt6       1.100787  1        1.049184

# Once more I removed the predictor with the second  highest vif to see if that reduces the variance
model4 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5 -bill_amt3, data = train_data, family = binomial)
vif(model4)

##                    GVIF Df GVIF^(1/(2*Df))
## limit_bal      1.526431  1        1.235488
## sex            1.026934  1        1.013378
## education      1.224426  3        1.034321
## marriage       1.332630  2        1.074428
## age            1.381418  1        1.175337
## pay_september  1.516166  1        1.231327
## pay_august     2.622895  1        1.619535
## pay_july       3.175906  1        1.782107
## pay_june       3.785788  1        1.945710
## pay_may        4.273264  1        2.067187
## pay_april      2.999200  1        1.731820
## bill_amt1      5.727922  1        2.393308
## bill_amt4     13.677470  1        3.698306
## bill_amt6      9.096204  1        3.015991
## pay_amt1       1.177919  1        1.085320
## pay_amt2       1.195389  1        1.093338
## pay_amt3       1.240474  1        1.113766
## pay_amt4       1.205400  1        1.097907
## pay_amt5       1.248262  1        1.117257
## pay_amt6       1.099634  1        1.048634

set.seed(123)
# Again I removed the predictor with the second  highest vif to see if that reduces the variance
model5 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5 -bill_amt3 -bill_amt4, data = train_data, family = binomial)
summary(model5)

## 
## Call:
## glm(formula = default_payment_next_month ~ . - bill_amt2 - bill_amt5 - 
##     bill_amt3 - bill_amt4, family = binomial, data = train_data)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.624e-01  1.033e-01  -9.318  < 2e-16 ***
## limit_bal            -8.494e-07  1.893e-07  -4.488 7.19e-06 ***
## sexfemale            -1.438e-01  3.677e-02  -3.912 9.14e-05 ***
## educationuniversity  -1.010e-01  4.268e-02  -2.366 0.017962 *  
## educationhigh school -1.250e-01  5.701e-02  -2.193 0.028332 *  
## educationothers      -9.940e-01  2.119e-01  -4.690 2.73e-06 ***
## marriagesingle       -1.908e-01  4.140e-02  -4.609 4.05e-06 ***
## marriageothers       -1.992e-01  1.593e-01  -1.250 0.211260    
## age                   5.818e-03  2.222e-03   2.619 0.008829 ** 
## pay_september         5.635e-01  2.138e-02  26.355  < 2e-16 ***
## pay_august            8.865e-02  2.417e-02   3.668 0.000244 ***
## pay_july              7.752e-02  2.668e-02   2.905 0.003668 ** 
## pay_june              3.930e-02  2.953e-02   1.331 0.183190    
## pay_may               4.161e-02  3.224e-02   1.291 0.196861    
## pay_april            -1.339e-02  2.655e-02  -0.504 0.614120    
## bill_amt1            -2.626e-06  5.316e-07  -4.939 7.84e-07 ***
## bill_amt6             1.703e-06  6.840e-07   2.490 0.012766 *  
## pay_amt1             -1.031e-05  2.395e-06  -4.306 1.66e-05 ***
## pay_amt2             -7.822e-06  2.125e-06  -3.680 0.000233 ***
## pay_amt3             -2.745e-06  1.718e-06  -1.598 0.109941    
## pay_amt4             -3.883e-06  1.841e-06  -2.109 0.034945 *  
## pay_amt5             -6.208e-06  2.012e-06  -3.085 0.002034 ** 
## pay_amt6             -2.468e-06  1.603e-06  -1.539 0.123785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22123  on 20974  degrees of freedom
## Residual deviance: 19417  on 20952  degrees of freedom
## AIC: 19463
## 
## Number of Fisher Scoring iterations: 6

vif(model5)

##                   GVIF Df GVIF^(1/(2*Df))
## limit_bal     1.525734  1        1.235206
## sex           1.026662  1        1.013243
## education     1.223570  3        1.034201
## marriage      1.332390  2        1.074380
## age           1.381129  1        1.175214
## pay_september 1.516102  1        1.231301
## pay_august    2.620325  1        1.618742
## pay_july      3.171265  1        1.780805
## pay_june      3.776598  1        1.943347
## pay_may       4.257096  1        2.063273
## pay_april     2.994554  1        1.730478
## bill_amt1     3.683396  1        1.919218
## bill_amt6     4.256136  1        2.063040
## pay_amt1      1.159594  1        1.076844
## pay_amt2      1.149135  1        1.071977
## pay_amt3      1.134980  1        1.065355
## pay_amt4      1.135903  1        1.065787
## pay_amt5      1.162211  1        1.078059
## pay_amt6      1.095345  1        1.046587

Now multicollinearity is no longer an issue. After removing bill amounts 2, 5,,3 and 4 I shall then proceed with the stepwise selection.

I ran this model first and when i did stepwise selection i got an error so i sorted the data and discovered that pay_amt1 was the issue causing perfect separation

set.seed(123)
model_d1 <- glm(default_payment_next_month ~ limit_bal + sex + education + marriage +age +  pay_september + pay_august + pay_july + pay_june + pay_may + pay_april + bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 +pay_amt5 + pay_amt6, data=train_data, family=binomial)

summary(model_d1)

## 
## Call:
## glm(formula = default_payment_next_month ~ limit_bal + sex + 
##     education + marriage + age + pay_september + pay_august + 
##     pay_july + pay_june + pay_may + pay_april + bill_amt1 + bill_amt6 + 
##     pay_amt2 + pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.793e-01  1.031e-01  -9.496  < 2e-16 ***
## limit_bal            -9.124e-07  1.891e-07  -4.825 1.40e-06 ***
## sexfemale            -1.429e-01  3.674e-02  -3.888 0.000101 ***
## educationuniversity  -1.003e-01  4.265e-02  -2.352 0.018660 *  
## educationhigh school -1.257e-01  5.699e-02  -2.205 0.027463 *  
## educationothers      -1.004e+00  2.119e-01  -4.738 2.16e-06 ***
## marriagesingle       -1.913e-01  4.137e-02  -4.625 3.74e-06 ***
## marriageothers       -2.062e-01  1.591e-01  -1.296 0.194947    
## age                   5.881e-03  2.220e-03   2.649 0.008064 ** 
## pay_september         5.671e-01  2.138e-02  26.525  < 2e-16 ***
## pay_august            1.064e-01  2.388e-02   4.455 8.40e-06 ***
## pay_july              5.950e-02  2.646e-02   2.249 0.024511 *  
## pay_june              4.084e-02  2.957e-02   1.381 0.167206    
## pay_may               4.179e-02  3.228e-02   1.295 0.195429    
## pay_april            -1.230e-02  2.656e-02  -0.463 0.643319    
## bill_amt1            -2.588e-06  5.187e-07  -4.989 6.08e-07 ***
## bill_amt6             1.384e-06  6.668e-07   2.076 0.037910 *  
## pay_amt2             -9.128e-06  2.182e-06  -4.182 2.88e-05 ***
## pay_amt3             -3.313e-06  1.744e-06  -1.899 0.057509 .  
## pay_amt4             -4.186e-06  1.875e-06  -2.233 0.025573 *  
## pay_amt5             -6.614e-06  2.048e-06  -3.230 0.001240 ** 
## pay_amt6             -2.980e-06  1.618e-06  -1.842 0.065481 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22123  on 20974  degrees of freedom
## Residual deviance: 19443  on 20953  degrees of freedom
## AIC: 19487
## 
## Number of Fisher Scoring iterations: 6

sortrain_data = train_data[order(train_data$default_payment_next_month),]

It turns out pay_amt1, was causing perfect separation so it had to be removed

# Stepwise selection
set.seed(123)
model_step <- step(model_d1, direction = "both", trace = 1)

## Start:  AIC=19487.27
## default_payment_next_month ~ limit_bal + sex + education + marriage + 
##     age + pay_september + pay_august + pay_july + pay_june + 
##     pay_may + pay_april + bill_amt1 + bill_amt6 + pay_amt2 + 
##     pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6
## 
##                 Df Deviance   AIC
## - pay_april      1    19444 19486
## - pay_may        1    19445 19487
## - pay_june       1    19445 19487
## <none>                19443 19487
## - pay_amt6       1    19447 19489
## - pay_amt3       1    19447 19489
## - bill_amt6      1    19448 19490
## - pay_july       1    19448 19490
## - pay_amt4       1    19449 19491
## - age            1    19450 19492
## - pay_amt5       1    19456 19498
## - sex            1    19458 19500
## - pay_august     1    19463 19505
## - marriage       2    19465 19505
## - limit_bal      1    19467 19509
## - pay_amt2       1    19467 19509
## - bill_amt1      1    19470 19512
## - education      3    19475 19513
## - pay_september  1    20140 20182
## 
## Step:  AIC=19485.48
## default_payment_next_month ~ limit_bal + sex + education + marriage + 
##     age + pay_september + pay_august + pay_july + pay_june + 
##     pay_may + bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 + 
##     pay_amt5 + pay_amt6
## 
##                 Df Deviance   AIC
## - pay_may        1    19445 19485
## - pay_june       1    19445 19485
## <none>                19444 19486
## - pay_amt6       1    19447 19487
## + pay_april      1    19443 19487
## - pay_amt3       1    19448 19488
## - bill_amt6      1    19448 19488
## - pay_july       1    19448 19488
## - pay_amt4       1    19450 19490
## - age            1    19451 19491
## - pay_amt5       1    19456 19496
## - sex            1    19459 19499
## - pay_august     1    19463 19503
## - marriage       2    19466 19504
## - limit_bal      1    19467 19507
## - pay_amt2       1    19468 19508
## - bill_amt1      1    19470 19510
## - education      3    19475 19511
## - pay_september  1    20140 20180
## 
## Step:  AIC=19485.03
## default_payment_next_month ~ limit_bal + sex + education + marriage + 
##     age + pay_september + pay_august + pay_july + pay_june + 
##     bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 + 
##     pay_amt5 + pay_amt6
## 
##                 Df Deviance   AIC
## <none>                19445 19485
## + pay_may        1    19444 19486
## - pay_amt3       1    19449 19487
## - pay_amt6       1    19449 19487
## + pay_april      1    19445 19487
## - bill_amt6      1    19450 19488
## - pay_july       1    19450 19488
## - pay_amt4       1    19452 19490
## - age            1    19452 19490
## - pay_june       1    19452 19490
## - pay_amt5       1    19458 19496
## - sex            1    19460 19498
## - marriage       2    19467 19503
## - pay_august     1    19466 19504
## - pay_amt2       1    19469 19507
## - limit_bal      1    19470 19508
## - bill_amt1      1    19472 19510
## - education      3    19477 19511
## - pay_september  1    20145 20183

summary(model_step)

## 
## Call:
## glm(formula = default_payment_next_month ~ limit_bal + sex + 
##     education + marriage + age + pay_september + pay_august + 
##     pay_july + pay_june + bill_amt1 + bill_amt6 + pay_amt2 + 
##     pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.804e-01  1.031e-01  -9.509  < 2e-16 ***
## limit_bal            -9.219e-07  1.883e-07  -4.895 9.86e-07 ***
## sexfemale            -1.433e-01  3.674e-02  -3.900 9.60e-05 ***
## educationuniversity  -1.002e-01  4.264e-02  -2.349  0.01880 *  
## educationhigh school -1.252e-01  5.697e-02  -2.197  0.02802 *  
## educationothers      -1.004e+00  2.119e-01  -4.739 2.15e-06 ***
## marriagesingle       -1.916e-01  4.137e-02  -4.631 3.63e-06 ***
## marriageothers       -2.075e-01  1.591e-01  -1.304  0.19217    
## age                   5.864e-03  2.220e-03   2.642  0.00824 ** 
## pay_september         5.677e-01  2.135e-02  26.583  < 2e-16 ***
## pay_august            1.090e-01  2.371e-02   4.598 4.26e-06 ***
## pay_july              5.969e-02  2.637e-02   2.263  0.02363 *  
## pay_june              6.242e-02  2.346e-02   2.661  0.00780 ** 
## bill_amt1            -2.616e-06  5.163e-07  -5.067 4.04e-07 ***
## bill_amt6             1.450e-06  6.555e-07   2.213  0.02692 *  
## pay_amt2             -9.136e-06  2.183e-06  -4.185 2.86e-05 ***
## pay_amt3             -3.030e-06  1.710e-06  -1.772  0.07645 .  
## pay_amt4             -4.490e-06  1.874e-06  -2.396  0.01659 *  
## pay_amt5             -6.633e-06  2.036e-06  -3.258  0.00112 ** 
## pay_amt6             -2.973e-06  1.619e-06  -1.836  0.06641 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22123  on 20974  degrees of freedom
## Residual deviance: 19445  on 20955  degrees of freedom
## AIC: 19485
## 
## Number of Fisher Scoring iterations: 6

set.seed(123)
# 1. Make predictions (probabilities)
predicted_probs <- predict(model_step, newdata = test_data, type = "response")

# 2. Convert probabilities to predicted classes ("yes" for default if prob > 0.5)
predicted_classes <- ifelse(predicted_probs > 0.5, "yes", "no")

# 3. Convert to factor with correct levels (matching actual)
predicted_classes <- factor(predicted_classes, levels = levels(test_data$default_payment_next_month))

# 4. Create confusion matrix
confusion <- table(Predicted = predicted_classes, Actual = test_data$default_payment_next_month)
print(confusion)

##          Actual
## Predicted   no  yes
##       no  6776 1533
##       yes  206  475

set.seed(123)
# Make predictions on the test set
predictions <- predict(model_step, test_data, type = "response")

# Convert probabilities to predicted class labels
predicted_classes <- ifelse(predictions > 0.5, "yes", "no")

# Confusion matrix (ensure both sides are factors with the same levels)
confusionMatrix(
  factor(predicted_classes, levels = c("no", "yes")),
  factor(test_data$default_payment_next_month, levels = c("no", "yes")),
  positive = "yes"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6776 1533
##        yes  206  475
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7982, 0.8147)
##     No Information Rate : 0.7766          
##     P-Value [Acc > NIR] : 2.383e-12       
##                                           
##                   Kappa : 0.2708          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23655         
##             Specificity : 0.97050         
##          Pos Pred Value : 0.69750         
##          Neg Pred Value : 0.81550         
##              Prevalence : 0.22336         
##          Detection Rate : 0.05284         
##    Detection Prevalence : 0.07575         
##       Balanced Accuracy : 0.60352         
##                                           
##        'Positive' Class : yes             
##

# Since the model is not doing a good job predicting actual defaulters
# we need to try the optimal cutoff to see if that helps

# Create ROCR prediction object using predicted probabilities and true class labels
# Convert "yes"/"no" to numeric: 1 = yes, 0 = no
pred <- prediction(predictions, ifelse(test_data$default_payment_next_month == "yes", 1, 0))

# Calculate AUC
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values), 3)

# Plot ROC curve
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE, main = "ROC Curve")
text(0.5, 0.5, paste("AUC:", auc))

# Plot sensitivity vs. cutoff
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type = "l", lwd = 2, 
     ylab = "Sensitivity", xlab = "Cutoff", main = paste("Maximized Cutoff\n", "AUC: ", auc))

par(new = TRUE)  # Add second line

# Plot specificity vs. cutoff
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type = "l", lwd = 2, col = "red", ylab = "", xlab = "")
axis(4, at = seq(0, 1, 0.2))  # Add right axis
mtext("Specificity", side = 4, col = "red")

# Find optimal cutoff where sensitivity and specificity are closest
min.diff <- which.min(abs(unlist(performance(pred, "sens")@y.values) - 
                          unlist(performance(pred, "spec")@y.values)))
min.x <- unlist(performance(pred, "sens")@x.values)[min.diff]
min.y <- unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <- min.x  # Optimal threshold

# Annotate the plot
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x, 0, paste("optimal threshold =", round(optimal, 2)), pos = 3)

set.seed(123)
predicted_classes_opt <- ifelse(predictions > 0.21, "yes", "no")
confusionMatrix(
  factor(predicted_classes_opt, levels = c("no", "yes")),
  factor(test_data$default_payment_next_month, levels = c("no", "yes")),
  positive = "yes"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  4471  649
##        yes 2511 1359
##                                           
##                Accuracy : 0.6485          
##                  95% CI : (0.6385, 0.6584)
##     No Information Rate : 0.7766          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2384          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6768          
##             Specificity : 0.6404          
##          Pos Pred Value : 0.3512          
##          Neg Pred Value : 0.8732          
##              Prevalence : 0.2234          
##          Detection Rate : 0.1512          
##    Detection Prevalence : 0.4305          
##       Balanced Accuracy : 0.6586          
##                                           
##        'Positive' Class : yes             
##

Random Forest Model

set.seed(123)
# Build the Random Forest model with balanced data
rfModel <- randomForest(default_payment_next_month ~ ., data = train_data, ntree = 100)

#  Evaluate the model
predictions <- predict(rfModel, test_data)
confusionMatrix(predictions, test_data$default_payment_next_month, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6584 1287
##        yes  398  721
##                                           
##                Accuracy : 0.8126          
##                  95% CI : (0.8043, 0.8206)
##     No Information Rate : 0.7766          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3586          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3591          
##             Specificity : 0.9430          
##          Pos Pred Value : 0.6443          
##          Neg Pred Value : 0.8365          
##              Prevalence : 0.2234          
##          Detection Rate : 0.0802          
##    Detection Prevalence : 0.1245          
##       Balanced Accuracy : 0.6510          
##                                           
##        'Positive' Class : yes             
##

set.seed(123)
# Print the importance of each variable
importance_rf <- importance(rfModel)
print(importance_rf)

##               MeanDecreaseGini
## limit_bal            365.03359
## sex                   69.31521
## education            137.26049
## marriage              83.19063
## age                  397.91894
## pay_september        741.92389
## pay_august           313.52853
## pay_july             175.11536
## pay_june             199.29407
## pay_may              131.67892
## pay_april            135.75248
## bill_amt1            425.71655
## bill_amt2            388.13707
## bill_amt3            368.54494
## bill_amt4            364.93740
## bill_amt5            352.57346
## bill_amt6            364.75290
## pay_amt1             375.26427
## pay_amt2             340.75309
## pay_amt3             333.31291
## pay_amt4             309.00002
## pay_amt5             309.90156
## pay_amt6             324.23427

set.seed(123)
# See the most important variables
# Create a cleaner variable importance plot
importance_df <- as.data.frame(importance(rfModel))
importance_df$Variables <- rownames(importance_df)

# Sort by importance
importance_df <- importance_df[order(importance_df$MeanDecreaseGini, decreasing = TRUE), ]

# Create a horizontal bar plot with ggplot2
library(ggplot2)

ggplot(importance_df, aes(x = reorder(Variables, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Random Forest Variable Importance",
    subtitle = "Credit Default Prediction Model",
    x = NULL,
    y = "Mean Decrease in Gini Index"
  ) +
  theme(
    axis.text.y = element_text(size = 10),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  )

SVM Model

set.seed(123)
# SVM with Radial Kernel
svm_radial <- svm(default_payment_next_month ~ ., 
                 data = train_data, 
                 kernel = "radial",
                 probability = TRUE)

set.seed(123)
# Make predictions with radial kernel
predictions_radial <- predict(svm_radial, newdata = test_data)

# Create confusion matrix for radial kernel
conf_matrix_radial <- confusionMatrix(predictions_radial, 
                                     test_data$default_payment_next_month,
                                     positive = "yes")
print("Confusion Matrix for SVM with Radial Kernel:")

## [1] "Confusion Matrix for SVM with Radial Kernel:"

print(conf_matrix_radial)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6671 1332
##        yes  311  676
##                                           
##                Accuracy : 0.8172          
##                  95% CI : (0.8091, 0.8252)
##     No Information Rate : 0.7766          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3567          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.33665         
##             Specificity : 0.95546         
##          Pos Pred Value : 0.68490         
##          Neg Pred Value : 0.83356         
##              Prevalence : 0.22336         
##          Detection Rate : 0.07519         
##    Detection Prevalence : 0.10979         
##       Balanced Accuracy : 0.64606         
##                                           
##        'Positive' Class : yes             
##

# SVM with Linear Kernel 
svm_linear <- svm(default_payment_next_month ~ ., 
                 data = train_data, 
                 kernel = "linear",
                 probability = TRUE)

set.seed(123)
# Make predictions with linear kernel
predictions_linear <- predict(svm_linear, newdata = test_data)

# Create confusion matrix for linear kernel
conf_matrix_linear <- confusionMatrix(predictions_linear, 
                                     test_data$default_payment_next_month,
                                     positive = "yes")
print("Confusion Matrix for SVM with Linear Kernel:")

## [1] "Confusion Matrix for SVM with Linear Kernel:"

print(conf_matrix_linear)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6761 1522
##        yes  221  486
##                                           
##                Accuracy : 0.8061          
##                  95% CI : (0.7978, 0.8142)
##     No Information Rate : 0.7766          
##     P-Value [Acc > NIR] : 4.964e-12       
##                                           
##                   Kappa : 0.2735          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.24203         
##             Specificity : 0.96835         
##          Pos Pred Value : 0.68741         
##          Neg Pred Value : 0.81625         
##              Prevalence : 0.22336         
##          Detection Rate : 0.05406         
##    Detection Prevalence : 0.07864         
##       Balanced Accuracy : 0.60519         
##                                           
##        'Positive' Class : yes             
##

# Compare the performance metrics
metrics <- data.frame(
  Model = c("SVM Radial", "SVM Linear"),
  Accuracy = c(conf_matrix_radial$overall["Accuracy"], conf_matrix_linear$overall["Accuracy"]),
  Sensitivity = c(conf_matrix_radial$byClass["Sensitivity"], conf_matrix_linear$byClass["Sensitivity"]),
  Specificity = c(conf_matrix_radial$byClass["Specificity"], conf_matrix_linear$byClass["Specificity"]),
  Precision = c(conf_matrix_radial$byClass["Precision"], conf_matrix_linear$byClass["Precision"]),
  F1_Score = c(conf_matrix_radial$byClass["F1"], conf_matrix_linear$byClass["F1"])
)

print("Performance Comparison:")

## [1] "Performance Comparison:"

print(metrics)

##        Model  Accuracy Sensitivity Specificity Precision F1_Score
## 1 SVM Radial 0.8172414   0.3366534   0.9554569 0.6849037 0.451419
## 2 SVM Linear 0.8061179   0.2420319   0.9683472 0.6874116 0.358011

DA6813 Final Project

Doris Mbitazi Asongafac

2025-05-11