1.Introduction 2.Problem Statement 3.Motivations(20 points) 4.Data Preprocessing/handling 5. Exploratory Analysis(20 points) 6.Modeling details and Key Insights(20 points) 7. Final Takeaways and conclusions(20 points) 8. Flow of Presentation(20 points)
The dataset used in this project is sourced from kaggle and it contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
Number of Observations: 30,000 Number of Variables: 25
Variables include: LIMIT_BAL: Amount of given credit (NT dollar) SEX: Gender (1 = male; 2 = female) EDUCATION: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others, 5 = unknown, 6 = unknown) MARRIAGE: Marital status (1 = married; 2 = single; 3 = others) AGE: Age in years PAY_0 to PAY_6: Past monthly payment records BILL_AMT1 to BILL_AMT6: Bill statement amounts PAY_AMT1 to PAY_AMT6: Previous payments default.payment.next.month: Default payment (1 = yes; 0 = no)
Banks worldwide face the critical challenge of accurately predicting which credit card customers will default on upcoming payments. With rising credit card usage globally, traditional static credit scoring methods are no longer sufficient. Financial institutions need data-driven approaches that leverage machine learning to identify high-risk customers before defaults occur, allowing for timely intervention while maintaining competitive credit offerings. This project aims to develop an accurate, interpretable predictive model using a comprehensive Taiwanese credit card dataset to enhance default risk assessment capabilities that can be applied across various banking environments.
As an aspiring financial analyst with a keen interest in banking operations, I’m driven to understand how data science can transform credit risk management. This Taiwanese dataset offers me practical experience with real-world financial data analysis that’s directly applicable to my career goals. By developing predictive models for credit default, I’ll gain valuable skills in risk assessment that are highly sought after in the banking industry. The insights from this analysis will deepen my understanding of how payment behaviors and demographic factors influence default probability—knowledge I can leverage in future financial analysis roles. This project allows me to demonstrate my ability to translate raw financial data into actionable business insights, positioning me as a data-driven financial professional who can help banks make more informed lending decisions and improve their bottom line.
#Load packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
library(ROCR)
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 3.5.2 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:dplyr':
##
## recode
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(ROSE)
## Loaded ROSE 0.0-4
library(e1071)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(randomForestSRC)
## Warning: package 'randomForestSRC' was built under R version 4.4.3
##
## randomForestSRC 3.3.3
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
##
##
## Attaching package: 'randomForestSRC'
##
## The following objects are masked from 'package:e1071':
##
## impute, tune
##
## The following object is masked from 'package:purrr':
##
## partial
#Reading in the dataset
default <- read.csv("C:/Users/doris/OneDrive/Documents/UCI_Credit_Card.csv/UCI_Credit_Card.csv")
# Copy the dataset
d1 <- default
# Remove ID since its just a placeholder
d1 <-d1[, -1]
# verify the change
dim(default)
## [1] 30000 25
dim(d1)
## [1] 30000 24
str(d1)
## 'data.frame': 30000 obs. of 24 variables:
## $ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : int 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : int 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : int 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : int 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_0 : int 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ PAY_2 : int 2 2 0 0 0 0 0 -1 0 -2 ...
## $ PAY_3 : int -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ PAY_4 : int -1 0 0 0 0 0 0 0 0 -2 ...
## $ PAY_5 : int -2 0 0 0 0 0 0 0 0 -1 ...
## $ PAY_6 : int -2 2 0 0 0 0 0 -1 0 -1 ...
## $ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
## $ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
## $ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
## $ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
## $ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
## $ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
## $ PAY_AMT1 : num 0 0 1518 2000 2000 ...
## $ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
## $ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
## $ PAY_AMT5 : num 0 0 1000 1069 689 ...
## $ PAY_AMT6 : num 0 2000 5000 1000 679 ...
## $ default.payment.next.month: int 1 1 0 0 0 0 0 0 0 0 ...
#Check missing values
colSums(is.na(d1))
## LIMIT_BAL SEX
## 0 0
## EDUCATION MARRIAGE
## 0 0
## AGE PAY_0
## 0 0
## PAY_2 PAY_3
## 0 0
## PAY_4 PAY_5
## 0 0
## PAY_6 BILL_AMT1
## 0 0
## BILL_AMT2 BILL_AMT3
## 0 0
## BILL_AMT4 BILL_AMT5
## 0 0
## BILL_AMT6 PAY_AMT1
## 0 0
## PAY_AMT2 PAY_AMT3
## 0 0
## PAY_AMT4 PAY_AMT5
## 0 0
## PAY_AMT6 default.payment.next.month
## 0 0
# Check for duplicate rows
duplicates <- d1[duplicated(d1), ]
cat("Number of duplicate rows: ", nrow(duplicates), "\n")
## Number of duplicate rows: 35
# If duplicates are found remove duplicate rows
d1 <- d1[!duplicated(d1), ]
cat("Number of rows after removing duplicates: ", nrow(d1), "\n")
## Number of rows after removing duplicates: 29965
#Cleaning the column names
d1 = clean_names(d1)
colnames(d1)
## [1] "limit_bal" "sex"
## [3] "education" "marriage"
## [5] "age" "pay_0"
## [7] "pay_2" "pay_3"
## [9] "pay_4" "pay_5"
## [11] "pay_6" "bill_amt1"
## [13] "bill_amt2" "bill_amt3"
## [15] "bill_amt4" "bill_amt5"
## [17] "bill_amt6" "pay_amt1"
## [19] "pay_amt2" "pay_amt3"
## [21] "pay_amt4" "pay_amt5"
## [23] "pay_amt6" "default_payment_next_month"
summary(d1)
## limit_bal sex education marriage
## Min. : 10000 Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 140000 Median :2.000 Median :2.000 Median :2.000
## Mean : 167442 Mean :1.604 Mean :1.854 Mean :1.552
## 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :1000000 Max. :2.000 Max. :6.000 Max. :3.000
## age pay_0 pay_2 pay_3
## Min. :21.00 Min. :-2.00000 Min. :-2.0000 Min. :-2.0000
## 1st Qu.:28.00 1st Qu.:-1.00000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :34.00 Median : 0.00000 Median : 0.0000 Median : 0.0000
## Mean :35.49 Mean :-0.01675 Mean :-0.1319 Mean :-0.1644
## 3rd Qu.:41.00 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :79.00 Max. : 8.00000 Max. : 8.0000 Max. : 8.0000
## pay_4 pay_5 pay_6 bill_amt1
## Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-165580
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.: 3595
## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 22438
## Mean :-0.2189 Mean :-0.2645 Mean :-0.2894 Mean : 51283
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 67260
## Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 964511
## bill_amt2 bill_amt3 bill_amt4 bill_amt5
## Min. :-69777 Min. :-157264 Min. :-170000 Min. :-81334
## 1st Qu.: 3010 1st Qu.: 2711 1st Qu.: 2360 1st Qu.: 1787
## Median : 21295 Median : 20135 Median : 19081 Median : 18130
## Mean : 49236 Mean : 47068 Mean : 43313 Mean : 40358
## 3rd Qu.: 64109 3rd Qu.: 60201 3rd Qu.: 54601 3rd Qu.: 50247
## Max. :983931 Max. :1664089 Max. : 891586 Max. :927171
## bill_amt6 pay_amt1 pay_amt2 pay_amt3
## Min. :-339603 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1262 1st Qu.: 1000 1st Qu.: 850 1st Qu.: 390
## Median : 17124 Median : 2102 Median : 2010 Median : 1804
## Mean : 38917 Mean : 5670 Mean : 5928 Mean : 5232
## 3rd Qu.: 49252 3rd Qu.: 5008 3rd Qu.: 5000 3rd Qu.: 4512
## Max. : 961664 Max. :873552 Max. :1684259 Max. :896040
## pay_amt4 pay_amt5 pay_amt6 default_payment_next_month
## Min. : 0 Min. : 0 Min. : 0 Min. :0.0000
## 1st Qu.: 300 1st Qu.: 261 1st Qu.: 131 1st Qu.:0.0000
## Median : 1500 Median : 1500 Median : 1500 Median :0.0000
## Mean : 4832 Mean : 4805 Mean : 5222 Mean :0.2213
## 3rd Qu.: 4016 3rd Qu.: 4042 3rd Qu.: 4000 3rd Qu.:0.0000
## Max. :621000 Max. :426529 Max. :528666 Max. :1.0000
# First check the levels, do they match the references
levels(as.factor(d1$sex))
## [1] "1" "2"
levels(as.factor(d1$education))
## [1] "0" "1" "2" "3" "4" "5" "6"
levels(as.factor(d1$marriage))
## [1] "0" "1" "2" "3"
levels(as.factor(d1$default_payment_next_month))
## [1] "0" "1"
Based on the dataset references, marriage has Marital status (1=married, 2=single, 3=others) but in the dataset they appears to be an additional unknown variable ‘0’ Education has many unknown categories and a single ‘other’ category, we need to recode all this into one single unknown category to make our analysis easier.
# View the frequency count of marriage and education
# View frequency of each level
d1 %>% count(education)
d1 %>% count(marriage)
# Restructuring the categories to a simpler form and converting the
# appropriate variables into factors
library(dplyr)
# Rename and recode relevant variables
d1 <- d1 %>%
# Rename pay_ variables to corresponding months
rename(
pay_september = pay_0,
pay_august = pay_2,
pay_july = pay_3,
pay_june = pay_4,
pay_may = pay_5,
pay_april = pay_6
) %>%
mutate(
# Recode marriage
marriage = case_when(
marriage %in% c(1, 2, 3) ~ marriage,
marriage == 0 ~ 3 # unknown → others
),
# Recode education
education = case_when(
education %in% c(1, 2, 3) ~ education,
education %in% c(0, 4, 5, 6) ~ 4 # unknowns → others
),
# Convert categorical vars to factors
marriage = factor(marriage,
levels = c(1, 2, 3),
labels = c("married", "single", "others")),
education = factor(education,
levels = c(1, 2, 3, 4),
labels = c("graduate school", "university", "high school", "others")),
sex = factor(sex,
levels = c(1, 2),
labels = c("male", "female")),
default_payment_next_month = factor(default_payment_next_month,
levels = c(0, 1),
labels = c("no", "yes"))
)
# Check total NAs in the dataset
#colSums(is.na(d1))
library(dplyr)
# Convert pay_* variables to numeric
d1 <- d1 %>%
mutate(
age = as.numeric(age), # Convert age to numeric
across(
.cols = starts_with("pay_"),
.fns = ~ as.numeric(.),
.names = "{col}"
)
)
str(d1)
## 'data.frame': 29965 obs. of 24 variables:
## $ limit_bal : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 2 2 1 ...
## $ education : Factor w/ 4 levels "graduate school",..: 2 2 2 2 2 1 1 2 3 3 ...
## $ marriage : Factor w/ 3 levels "married","single",..: 1 2 2 1 1 2 2 2 1 2 ...
## $ age : num 24 26 34 37 57 37 29 23 28 35 ...
## $ pay_september : num 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ pay_august : num 2 2 0 0 0 0 0 -1 0 -2 ...
## $ pay_july : num -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ pay_june : num -1 0 0 0 0 0 0 0 0 -2 ...
## $ pay_may : num -2 0 0 0 0 0 0 0 0 -1 ...
## $ pay_april : num -2 2 0 0 0 0 0 -1 0 -1 ...
## $ bill_amt1 : num 3913 2682 29239 46990 8617 ...
## $ bill_amt2 : num 3102 1725 14027 48233 5670 ...
## $ bill_amt3 : num 689 2682 13559 49291 35835 ...
## $ bill_amt4 : num 0 3272 14331 28314 20940 ...
## $ bill_amt5 : num 0 3455 14948 28959 19146 ...
## $ bill_amt6 : num 0 3261 15549 29547 19131 ...
## $ pay_amt1 : num 0 0 1518 2000 2000 ...
## $ pay_amt2 : num 689 1000 1500 2019 36681 ...
## $ pay_amt3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ pay_amt4 : num 0 1000 1000 1100 9000 ...
## $ pay_amt5 : num 0 0 1000 1069 689 ...
## $ pay_amt6 : num 0 2000 5000 1000 679 ...
## $ default_payment_next_month: Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
# Verify the change
str(d1)
## 'data.frame': 29965 obs. of 24 variables:
## $ limit_bal : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 2 2 1 ...
## $ education : Factor w/ 4 levels "graduate school",..: 2 2 2 2 2 1 1 2 3 3 ...
## $ marriage : Factor w/ 3 levels "married","single",..: 1 2 2 1 1 2 2 2 1 2 ...
## $ age : num 24 26 34 37 57 37 29 23 28 35 ...
## $ pay_september : num 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ pay_august : num 2 2 0 0 0 0 0 -1 0 -2 ...
## $ pay_july : num -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ pay_june : num -1 0 0 0 0 0 0 0 0 -2 ...
## $ pay_may : num -2 0 0 0 0 0 0 0 0 -1 ...
## $ pay_april : num -2 2 0 0 0 0 0 -1 0 -1 ...
## $ bill_amt1 : num 3913 2682 29239 46990 8617 ...
## $ bill_amt2 : num 3102 1725 14027 48233 5670 ...
## $ bill_amt3 : num 689 2682 13559 49291 35835 ...
## $ bill_amt4 : num 0 3272 14331 28314 20940 ...
## $ bill_amt5 : num 0 3455 14948 28959 19146 ...
## $ bill_amt6 : num 0 3261 15549 29547 19131 ...
## $ pay_amt1 : num 0 0 1518 2000 2000 ...
## $ pay_amt2 : num 689 1000 1500 2019 36681 ...
## $ pay_amt3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ pay_amt4 : num 0 1000 1000 1100 9000 ...
## $ pay_amt5 : num 0 0 1000 1069 689 ...
## $ pay_amt6 : num 0 2000 5000 1000 679 ...
## $ default_payment_next_month: Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
I’ll begin by visually inspecting the relationship between default status and each demographic variable (education, marital status, sex, and age). Then Perform a chi-square test for each, to test their significance.
# First, create a bar plot of the response variable with percentages
# Visualize the dependent variable distribution
# Create a table with counts and percentages
class_dist <- d1 %>%
tabyl(default_payment_next_month) %>%
mutate(percent = round(percent * 100, 1)) # Make percentage nicely rounded
# Plot
ggplot(class_dist, aes(x = default_payment_next_month, y = percent, fill = default_payment_next_month)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = paste0(percent, "%")), vjust = -0.5, size = 5) + # Add percentages on top
scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.05))) + # Y-axis from 0 to 100
scale_fill_manual(values = c("no" = "red", "yes" = "blue")) +
labs(
title = "Non-Default vs Default",
x = "Default Status",
y = "Percentage"
) +
theme_minimal() +
theme(legend.position = "none",
text = element_text(size = 14))
# Default rate by Sex
sex_default <- d1 %>%
group_by(sex, default_payment_next_month) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(sex) %>%
mutate(prop = count / sum(count),
percent = prop * 100)
sex_plot <- ggplot(sex_default %>% filter(default_payment_next_month == "yes"),
aes(x = sex, y = percent, fill = sex)) +
geom_col(width = 0.5) +
geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.1, size = 4) +
scale_fill_manual(values = c("male" = "#4393c3", "female" = "#d6604d")) +
labs(title = "Default Rate by Gender",
x = "Gender",
y = "Default Rate (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
print(sex_plot)
# Chi-square test for sex and default
sex_table <- table(d1$sex, d1$default_payment_next_month)
sex_chi <- chisq.test(sex_table)
print(sex_chi)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: sex_table
## X-squared = 47.131, df = 1, p-value = 6.639e-12
if(sex_chi$p.value < 0.05) {
cat("Gender: Significant differences in default rates (p =", sex_chi$p.value, ")\n")
} else {
cat("Gender: No significant differences in default rates (p =", sex_chi$p.value, ")\n")
}
## Gender: Significant differences in default rates (p = 6.639289e-12 )
# Default rate by Education
edu_default <- d1 %>%
group_by(education, default_payment_next_month) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(education) %>%
mutate(prop = count / sum(count),
percent = prop * 100)
edu_plot <- ggplot(edu_default %>% filter(default_payment_next_month == "yes"),
aes(x = reorder(education, -percent), y = percent, fill = education)) +
geom_col(width = 0.5) +
geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.005, size = 4) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Default Rate by Education Level",
x = "Education Level",
y = "Default Rate (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
print(edu_plot)
# Chi-square test for education and default
edu_table <- table(d1$education, d1$default_payment_next_month)
edu_chi <- chisq.test(edu_table)
print(edu_chi)
##
## Pearson's Chi-squared test
##
## data: edu_table
## X-squared = 160.46, df = 3, p-value < 2.2e-16
# Interpretation with full p-value
if(edu_chi$p.value < 0.05) {
cat("Education: Significant differences in default rates (p =", edu_chi$p.value, ")\n")
} else {
cat("Education: No significant differences in default rates (p =", edu_chi$p.value, ")\n")
}
## Education: Significant differences in default rates (p = 1.458089e-34 )
# Default rate by Marriage Status
marriage_default <- d1 %>%
group_by(marriage, default_payment_next_month) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(marriage) %>%
mutate(prop = count / sum(count),
percent = prop * 100)
marriage_plot <- ggplot(marriage_default %>% filter(default_payment_next_month == "yes"),
aes(x = reorder(marriage, -percent), y = percent, fill = marriage)) +
geom_col(width = 0.5) +
geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.01, size = 4) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Default Rate by Marital Status",
x = "Marital Status",
y = "Default Rate (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
print(marriage_plot)
# Chi-square test for marriage and default
marriage_table <- table(d1$marriage, d1$default_payment_next_month)
marriage_chi <- chisq.test(marriage_table)
print(marriage_chi)
##
## Pearson's Chi-squared test
##
## data: marriage_table
## X-squared = 27.489, df = 2, p-value = 1.074e-06
# Interpretation
if(marriage_chi$p.value < 0.05) {
cat("Marriage: Significant differences in default rates (p =", marriage_chi$p.value, ")\n")
} else {
cat("Marriage: No significant differences in default rates (p =", marriage_chi$p.value, ")\n")
}
## Marriage: Significant differences in default rates (p = 1.073732e-06 )
# Default rate by Age
# First, we create age groups for better visualization
# Then, calculate the default rates by age group
d1 <- d1 %>%
mutate(age_group = case_when(
age < 30 ~ "< 30",
age >= 30 & age < 40 ~ "30-39",
age >= 40 & age < 50 ~ "40-49",
age >= 50 & age < 60 ~ "50-59",
age >= 60 ~ "60+"
),
age_group = factor(age_group, levels = c("< 30", "30-39", "40-49", "50-59", "60+")))
# Calculate the default rates by age group
age_default <- d1 %>%
group_by(age_group, default_payment_next_month) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(age_group) %>%
mutate(prop = count / sum(count),
percent = prop * 100)
# Create a new dataframe with just the default rates (where default_payment_next_month == "yes")
default_rates <- age_default %>%
filter(default_payment_next_month == "yes") %>%
arrange(desc(percent)) # Sort by descending default rate
# Create a new factor order based on the sorted default rates
new_order <- default_rates$age_group
# Update the factor levels in the age_default dataframe
age_default <- age_default %>%
mutate(age_group = factor(age_group, levels = new_order))
# Now create the plot with the reordered bars
age_plot <- ggplot(age_default %>% filter(default_payment_next_month == "yes"),
aes(x = age_group, y = percent, fill = age_group)) +
geom_col(width = 0.5) +
geom_text(aes(label = sprintf("%.1f%%", percent)), vjust = -0.2, size = 4) +
scale_fill_brewer(palette = "YlOrRd") +
labs(title = "Default Rate by Age Group",
x = "Age Group",
y = "Default Rate (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
# Print the plot
print(age_plot)
# Chi-square test for age groups and default
age_table <- table(d1$age_group, d1$default_payment_next_month)
age_chi <- chisq.test(age_table)
print(age_chi)
##
## Pearson's Chi-squared test
##
## data: age_table
## X-squared = 45.812, df = 4, p-value = 2.695e-09
# Interpretation
if(age_chi$p.value < 0.05) {
cat("Age Group: Significant differences in default rates (p =", sprintf("%.4f", age_chi$p.value), ")\n")
} else {
cat("Age Group: No significant differences in default rates (p =", sprintf("%.4f", age_chi$p.value), ")\n")
}
## Age Group: Significant differences in default rates (p = 0.0000 )
Overall Hypothesis Framework: Null Hypothesis (H₀): Default rates are independent of demographic factors (sex, education, marital status, age).
Alternative Hypothesis (H₁): Default rates significantly differ across demographic groups — i.e., there is a relationship between default status and the demographic variable. All four tests (sex, education, marriage, and age group) show statistically significant associations with default status. Therefore, default rates are not independent of these demographic factors.
# Interactive display of demographic groups with highest default rates
# Summarize default rates by all demographic combinations
demographic_default <- d1 %>%
group_by(sex, education, marriage, age_group) %>%
summarise(
total = n(),
default_count = sum(default_payment_next_month == "yes"),
default_rate = mean(default_payment_next_month == "yes") * 100,
.groups = "drop"
) %>%
arrange(desc(default_rate))
# Display top 10 demographic groups with highest default rates
top_10_groups <- head(demographic_default, 10)
print("Top 10 Demographic Groups with Highest Default Rates:")
## [1] "Top 10 Demographic Groups with Highest Default Rates:"
print(top_10_groups)
## # A tibble: 10 × 7
## sex education marriage age_group total default_count default_rate
## <fct> <fct> <fct> <fct> <int> <int> <dbl>
## 1 female university others 60+ 1 1 100
## 2 female university single 60+ 4 3 75
## 3 female graduate school others < 30 5 3 60
## 4 male graduate school others 50-59 8 4 50
## 5 male high school others < 30 6 3 50
## 6 female high school others 50-59 23 10 43.5
## 7 male university married 60+ 45 19 42.2
## 8 male university others 40-49 27 11 40.7
## 9 female graduate school single 60+ 5 2 40
## 10 female high school single 60+ 16 6 37.5
# Create a numeric version of the default variable (0 = no, 1 = yes)
d1$default_num <- ifelse(d1$default_payment_next_month == "yes", 1, 0)
# Select payment history variables and default for correlation analysis
pay_vars <- c("pay_september", "pay_august", "pay_july", "pay_june", "pay_may", "pay_april")
corr_data <- d1[, c(pay_vars, "default_num")]
# Calculate correlations
cor_matrix <- cor(sapply(corr_data, as.numeric), use = "complete.obs")
# Print correlation with default
cat("Correlation between payment variables and default:\n")
## Correlation between payment variables and default:
cor_with_default <- cor_matrix["default_num", pay_vars]
print(cor_with_default)
## pay_september pay_august pay_july pay_june pay_may
## 0.3249638 0.2636562 0.2352305 0.2165509 0.2040591
## pay_april
## 0.1867396
# Sort by absolute correlation value to identify strongest correlations
sorted_cors <- sort(abs(cor_with_default), decreasing = TRUE)
cat("\nPayment variables sorted by strength of correlation with default:\n")
##
## Payment variables sorted by strength of correlation with default:
print(sorted_cors)
## pay_september pay_august pay_july pay_june pay_may
## 0.3249638 0.2636562 0.2352305 0.2165509 0.2040591
## pay_april
## 0.1867396
# Visualize correlations with default
barplot(cor_with_default[order(abs(cor_with_default), decreasing = TRUE)],
main = "Correlation between Payment History and Default",
xlab = "Payment Variables",
ylab = "Correlation Coefficient",
col = ifelse(cor_with_default[order(abs(cor_with_default), decreasing = TRUE)] > 0, "red", "blue"),
las = 2)
abline(h = 0)
legend("bottomright", legend = c("Positive correlation", "Negative correlation"),
fill = c("red", "blue"))
# Statistical significance testing for correlations
# Using point-biserial correlation (equivalent to t-test for binary and continuous variables)
# Initialize data frame to store results
significance_tests <- data.frame(
Payment_Variable = pay_vars,
Correlation = cor_with_default,
p_value = numeric(length(pay_vars)) # initialize with empty p-values
)
# Run point-biserial correlation tests for each payment variable
for (i in 1:length(pay_vars)) {
x <- as.numeric(d1[[pay_vars[i]]]) # Convert x to numeric
y <- d1$default_num # Already numeric
test_result <- cor.test(x, y)
significance_tests$p_value[i] <- test_result$p.value
}
# Sort by absolute correlation value
significance_tests <- significance_tests[order(abs(significance_tests$Correlation), decreasing = TRUE), ]
# Print results as formatted table
cat("\nStatistical significance of correlations:\n")
##
## Statistical significance of correlations:
print(knitr::kable(significance_tests, digits = 4))
##
##
## | |Payment_Variable | Correlation| p_value|
## |:-------------|:----------------|-----------:|-------:|
## |pay_september |pay_september | 0.3250| 0|
## |pay_august |pay_august | 0.2637| 0|
## |pay_july |pay_july | 0.2352| 0|
## |pay_june |pay_june | 0.2166| 0|
## |pay_may |pay_may | 0.2041| 0|
## |pay_april |pay_april | 0.1867| 0|
pay_september has the strongest correlation with default at approximately 0.325 pay_august has the second strongest correlation at about 0.264 pay_july has the third strongest correlation at about 0.235 The remaining variables (pay_june, pay_may, and pay_april) have progressively weaker correlations
library(forcats) # for fct_reorder
# Visualize default rates by payment status for the top correlated variable
# (Since pay_0 is the most correlated based on previous results above)
top_var <- names(sorted_cors)[1]
default_by_payment <- d1 %>%
group_by(across(all_of(top_var)), default_payment_next_month) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(across(all_of(top_var))) %>%
mutate(total = sum(count),
percentage = count / total * 100) %>%
filter(default_payment_next_month == "yes") %>%
arrange(across(all_of(top_var)))
ggplot(default_by_payment, aes(x = fct_reorder(factor(get(top_var)), percentage, .desc = TRUE), y = percentage)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = sprintf("%.1f%%", percentage)), vjust = -0.2) +
labs(title = paste("Default Rate by", top_var, "Status"),
x = paste(top_var, "Payment Status (-2 = No consumption, -1 = Paid in full, etc.)"),
y = "Default Rate (%)") +
theme_minimal()
The resulting visualization shows how default rates vary across different payment statuses. From the chart:
Lower payment statuses (-2, -1, 0) have relatively low default rates (around 13-17%) Default rates jump significantly at status 1 (34%) The highest default rates are at status 2, 3, and 7 (69.1%, 75.8%, and 77.8% respectively) There’s a dip in default rates at statuses 4, 5, and 6 before rising again
This suggests that certain payment behaviors (particularly those coded as 2, 3, and 7) are strong predictors of future defaults. The payment status values appear to use a coding system where -2 means “No consumption,” -1 means “Paid in full,” and other values indicate different levels of payment delays or issues.
# Step 1: Create total bill and payment columns
d1 <- d1 %>%
mutate(total_bill = bill_amt1 + bill_amt2 + bill_amt3 + bill_amt4 + bill_amt5 + bill_amt6,
total_payment = pay_amt1 + pay_amt2 + pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6,
bill_payment_ratio = total_bill / (total_payment + 1)) # +1 to avoid division by zero
# Step 2: Visualize the Ratio Distribution by Default Status
# Filter out infinite or non-positive ratios
d1_clean <- d1 %>%
filter(is.finite(bill_payment_ratio), bill_payment_ratio > 0)
# Now plot again safely
ggplot(d1_clean, aes(x = bill_payment_ratio, fill = default_payment_next_month)) +
geom_density(alpha = 0.5) +
scale_x_log10() +
labs(title = "Bill-to-Payment Ratio by Default Status",
x = "Bill-to-Payment Ratio (log scale)", y = "Density") +
theme_minimal()
The density plot (Image 1) shows a clear difference in the distribution
of bill-to-payment ratios between defaulters and non-defaulters. Both
distributions appear bimodal, with defaulters (green) having a higher
density at larger bill-to-payment ratio values.
# Step 1: Create Bins (log10 scale makes sense due to skew)
d1_binned <- d1_clean %>%
mutate(ratio_bin = cut(log10(bill_payment_ratio),
breaks = seq(-1, 6, by = 0.5), # log10(0.1) to log10(1,000,000)
include.lowest = TRUE))
# Step 2: Calculate default rate per bin
default_by_bin <- d1_binned %>%
group_by(ratio_bin) %>%
summarise(
count = n(),
default_rate = mean(default_payment_next_month == "yes")
) %>%
filter(!is.na(ratio_bin)) # remove any NA bins if present
# Step 3: Plot
ggplot(default_by_bin, aes(x = ratio_bin, y = default_rate)) +
geom_col(fill = "#3b82f6") +
labs(title = "Default Rate by Bill-to-Payment Ratio Bin",
x = "Log10(Bill-to-Payment Ratio)",
y = "Default Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart (Image 2) reveals a pattern where default rates generally
increase as the bill-to-payment ratio increases, especially in the range
of log10(3.5) to log10(4.5), which corresponds to bill-to-payment ratios
of approximately 3,000 to 30,000. The highest default rate (over 60%)
occurs in the [4, 4.5] bin.
# Subset to 5000 random samples from each group
set.seed(123) # for reproducibility
no_sample <- sample(d1$bill_payment_ratio[d1$default_payment_next_month == "no"], 5000)
yes_sample <- sample(d1$bill_payment_ratio[d1$default_payment_next_month == "yes"], 5000)
# Shapiro-Wilk on subsamples
shapiro.test(no_sample)
##
## Shapiro-Wilk normality test
##
## data: no_sample
## W = 0.027618, p-value < 2.2e-16
shapiro.test(yes_sample)
##
## Shapiro-Wilk normality test
##
## data: yes_sample
## W = 0.017158, p-value < 2.2e-16
Normality is violated, thus a parametric test won’t work
# Step 3: Perform a Wilcoxon rank sum test
wilcoxon_result <- wilcox.test(bill_payment_ratio ~ default_payment_next_month, data = d1)
# Print the result of the Wilcoxon test
print(wilcoxon_result)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bill_payment_ratio by default_payment_next_month
## W = 65873983, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# Interpretation
if (wilcoxon_result$p.value < 0.05) {
cat("There is a significant difference in bill-to-payment ratios between defaulters and non-defaulters (p =", round(wilcoxon_result$p.value, 4), ").\n")
} else {
cat("There is no significant difference in bill-to-payment ratios between defaulters and non-defaulters (p =", round(wilcoxon_result$p.value, 4), ").\n")
}
## There is a significant difference in bill-to-payment ratios between defaulters and non-defaulters (p = 0 ).
The statistical analysis confirms these visual observations: The Shapiro-Wilk tests for both groups have p-values < 2.2e-16, indicating the data is not normally distributed, which justifies using non-parametric tests. The Wilcoxon rank sum test shows a highly significant difference (p < 2.2e-16) between the bill-to-payment ratios of defaulters and non-defaulters.
Interpretation: Since the p-value is less than 0.05, we can conclude that there is a significant difference in bill-to-payment ratios between defaulters and non-defaulters. This result suggests that the distribution of bill-to-payment ratios differs between the two groups
To answer my question specific question: Yes, there does appear to be a critical threshold where the bill-to-payment ratio significantly increases default probability. Based on your binned analysis, this threshold appears to be around log10(3.5), or approximately a ratio of 3,000. Beyond this point, default rates increase dramatically, peaking at the [4, 4.5] bin (ratio of 10,000-30,000).
# Correlation Matrix
continuous_vars <- c(
"limit_bal",
"age",
"bill_amt1", "bill_amt2", "bill_amt3", "bill_amt4", "bill_amt5", "bill_amt6",
"pay_amt1", "pay_amt2", "pay_amt3", "pay_amt4", "pay_amt5", "pay_amt6"
)
# Subset to only include continuous variables that exist in your dataset
continuous_vars <- continuous_vars[continuous_vars %in% names(d1)]
# Create a correlation matrix
correlation_matrix <- cor(d1[continuous_vars], use = "pairwise.complete.obs")
# Optional: Round to 2 decimal places for better readability
correlation_matrix <- round(correlation_matrix, 2)
# Print the correlation matrix
print(correlation_matrix)
## limit_bal age bill_amt1 bill_amt2 bill_amt3 bill_amt4 bill_amt5
## limit_bal 1.00 0.14 0.29 0.28 0.28 0.29 0.30
## age 0.14 1.00 0.06 0.05 0.05 0.05 0.05
## bill_amt1 0.29 0.06 1.00 0.95 0.89 0.86 0.83
## bill_amt2 0.28 0.05 0.95 1.00 0.93 0.89 0.86
## bill_amt3 0.28 0.05 0.89 0.93 1.00 0.92 0.88
## bill_amt4 0.29 0.05 0.86 0.89 0.92 1.00 0.94
## bill_amt5 0.30 0.05 0.83 0.86 0.88 0.94 1.00
## bill_amt6 0.29 0.05 0.80 0.83 0.85 0.90 0.95
## pay_amt1 0.20 0.03 0.14 0.28 0.24 0.23 0.22
## pay_amt2 0.18 0.02 0.10 0.10 0.32 0.21 0.18
## pay_amt3 0.21 0.03 0.16 0.15 0.13 0.30 0.25
## pay_amt4 0.20 0.02 0.16 0.15 0.14 0.13 0.29
## pay_amt5 0.22 0.02 0.17 0.16 0.18 0.16 0.14
## pay_amt6 0.22 0.02 0.18 0.17 0.18 0.18 0.16
## bill_amt6 pay_amt1 pay_amt2 pay_amt3 pay_amt4 pay_amt5 pay_amt6
## limit_bal 0.29 0.20 0.18 0.21 0.20 0.22 0.22
## age 0.05 0.03 0.02 0.03 0.02 0.02 0.02
## bill_amt1 0.80 0.14 0.10 0.16 0.16 0.17 0.18
## bill_amt2 0.83 0.28 0.10 0.15 0.15 0.16 0.17
## bill_amt3 0.85 0.24 0.32 0.13 0.14 0.18 0.18
## bill_amt4 0.90 0.23 0.21 0.30 0.13 0.16 0.18
## bill_amt5 0.95 0.22 0.18 0.25 0.29 0.14 0.16
## bill_amt6 1.00 0.20 0.17 0.23 0.25 0.31 0.12
## pay_amt1 0.20 1.00 0.29 0.25 0.20 0.15 0.19
## pay_amt2 0.17 0.29 1.00 0.24 0.18 0.18 0.16
## pay_amt3 0.23 0.25 0.24 1.00 0.22 0.16 0.16
## pay_amt4 0.25 0.20 0.18 0.22 1.00 0.15 0.16
## pay_amt5 0.31 0.15 0.18 0.16 0.15 1.00 0.15
## pay_amt6 0.12 0.19 0.16 0.16 0.16 0.15 1.00
# Create a heatmap visualization
library(corrplot)
## corrplot 0.95 loaded
corrplot(correlation_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
diag = FALSE,
mar = c(0,0,2,0),
title = "Correlation Matrix of Continuous Predictors")
# Alternative with ggplot2 showing values
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
# Convert correlation matrix to long format
corr_long <- melt(correlation_matrix)
# Create the heatmap with visible coefficients
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", value)), size = 2.5) +
scale_fill_gradient2(low = "#E46726", mid = "white", high = "#6D9EC1",
midpoint = 0, limit = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.position = "right") +
labs(title = "Correlation Matrix of Continuous Predictors", x = "", y = "") +
coord_fixed()
Bill Amounts: There appear to be strong positive correlations (dark blue) between consecutive bill amounts (bill_amt1 through bill_amt6). This makes sense as customers with high bills in one month likely have high bills in adjacent months. Credit Limit (limit_bal): Shows moderate positive correlations with bill amounts, suggesting customers with higher credit limits tend to have higher bill amounts. Age: Appears to have weak correlations with other variables (very light colors), indicating age may not be strongly related to spending behaviors in your dataset. Payment Amounts: The payment variables (pay_amt1 through pay_amt6) show some correlation with each other and with the corresponding bill amounts, but these relationships appear weaker than the correlations among bill amounts. # Models
# First we need to remove the generated columns before splitting the data
d1_clean <- d1 %>%
dplyr::select(-age_group, -default_num, -total_bill, -total_payment, -bill_payment_ratio)
# Now create the training and testing datasets with clean data
set.seed(123)
trainIndex <- sample(1:nrow(d1_clean), 0.7*nrow(d1_clean))
train_data <- d1_clean[trainIndex, ]
test_data <- d1_clean[-trainIndex, ]
# Check the class distribution in the training data
table(train_data$default_payment_next_month)
##
## no yes
## 16353 4622
set.seed(123)
# logistic regression model
model <- glm(default_payment_next_month ~ ., data = train_data, family = binomial)
summary(model)
##
## Call:
## glm(formula = default_payment_next_month ~ ., family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.601e-01 1.033e-01 -9.294 < 2e-16 ***
## limit_bal -8.435e-07 1.896e-07 -4.448 8.66e-06 ***
## sexfemale -1.443e-01 3.678e-02 -3.922 8.78e-05 ***
## educationuniversity -9.988e-02 4.270e-02 -2.339 0.019325 *
## educationhigh school -1.234e-01 5.703e-02 -2.164 0.030452 *
## educationothers -1.001e+00 2.120e-01 -4.720 2.36e-06 ***
## marriagesingle -1.908e-01 4.141e-02 -4.608 4.06e-06 ***
## marriageothers -1.975e-01 1.593e-01 -1.240 0.215067
## age 5.773e-03 2.222e-03 2.598 0.009369 **
## pay_september 5.627e-01 2.138e-02 26.314 < 2e-16 ***
## pay_august 9.043e-02 2.418e-02 3.740 0.000184 ***
## pay_july 7.441e-02 2.676e-02 2.781 0.005427 **
## pay_june 3.731e-02 2.960e-02 1.261 0.207434
## pay_may 3.976e-02 3.232e-02 1.230 0.218515
## pay_april -1.106e-02 2.659e-02 -0.416 0.677428
## bill_amt1 -4.643e-06 1.302e-06 -3.567 0.000362 ***
## bill_amt2 1.210e-06 1.742e-06 0.694 0.487496
## bill_amt3 1.272e-06 1.586e-06 0.802 0.422494
## bill_amt4 7.824e-07 1.568e-06 0.499 0.617851
## bill_amt5 4.611e-07 1.777e-06 0.259 0.795327
## bill_amt6 3.025e-08 1.420e-06 0.021 0.983010
## pay_amt1 -1.213e-05 2.676e-06 -4.531 5.87e-06 ***
## pay_amt2 -8.955e-06 2.405e-06 -3.724 0.000196 ***
## pay_amt3 -2.587e-06 1.980e-06 -1.307 0.191303
## pay_amt4 -3.160e-06 2.064e-06 -1.531 0.125749
## pay_amt5 -4.914e-06 2.258e-06 -2.176 0.029551 *
## pay_amt6 -2.699e-06 1.612e-06 -1.674 0.094116 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22123 on 20974 degrees of freedom
## Residual deviance: 19412 on 20948 degrees of freedom
## AIC: 19466
##
## Number of Fisher Scoring iterations: 6
# Check multicollinearity
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## limit_bal 1.530942 1 1.237312
## sex 1.027274 1 1.013545
## education 1.225509 3 1.034474
## marriage 1.332737 2 1.074450
## age 1.381564 1 1.175400
## pay_september 1.517490 1 1.231865
## pay_august 2.626394 1 1.620615
## pay_july 3.190245 1 1.786126
## pay_june 3.792096 1 1.947331
## pay_may 4.275140 1 2.067641
## pay_april 3.002595 1 1.732800
## bill_amt1 22.110241 1 4.702153
## bill_amt2 37.857469 1 6.152842
## bill_amt3 29.284916 1 5.411554
## bill_amt4 25.634745 1 5.063077
## bill_amt5 30.126752 1 5.488784
## bill_amt6 18.353047 1 4.284046
## pay_amt1 1.441830 1 1.200762
## pay_amt2 1.463007 1 1.209548
## pay_amt3 1.487660 1 1.219697
## pay_amt4 1.413433 1 1.188879
## pay_amt5 1.464211 1 1.210046
## pay_amt6 1.108250 1 1.052734
# First I removed the predictor with the highest vif to see if that reduces the variance
model2 <- glm(default_payment_next_month ~ .-bill_amt2, data = train_data, family = binomial)
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## limit_bal 1.529941 1 1.236908
## sex 1.027089 1 1.013454
## education 1.225390 3 1.034457
## marriage 1.332773 2 1.074457
## age 1.381716 1 1.175464
## pay_september 1.516556 1 1.231485
## pay_august 2.625055 1 1.620202
## pay_july 3.183112 1 1.784128
## pay_june 3.792290 1 1.947380
## pay_may 4.275262 1 2.067671
## pay_april 3.002368 1 1.732734
## bill_amt1 9.843152 1 3.137380
## bill_amt3 22.721659 1 4.766724
## bill_amt4 25.579847 1 5.057652
## bill_amt5 30.076344 1 5.484190
## bill_amt6 18.342857 1 4.282856
## pay_amt1 1.223209 1 1.105988
## pay_amt2 1.331716 1 1.154000
## pay_amt3 1.485735 1 1.218907
## pay_amt4 1.412986 1 1.188691
## pay_amt5 1.463914 1 1.209923
## pay_amt6 1.108141 1 1.052683
# Next I removed the predictor with the second highest vif to see if that reduces the variance
model3 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5, data = train_data, family = binomial)
vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## limit_bal 1.527904 1 1.236084
## sex 1.027053 1 1.013436
## education 1.224670 3 1.034355
## marriage 1.332746 2 1.074452
## age 1.381691 1 1.175454
## pay_september 1.516373 1 1.231411
## pay_august 2.625210 1 1.620250
## pay_july 3.183250 1 1.784166
## pay_june 3.789664 1 1.946706
## pay_may 4.273017 1 2.067128
## pay_april 2.998393 1 1.731587
## bill_amt1 9.842675 1 3.137304
## bill_amt3 22.652209 1 4.759434
## bill_amt4 20.294946 1 4.504991
## bill_amt6 9.207984 1 3.034466
## pay_amt1 1.222935 1 1.105864
## pay_amt2 1.331620 1 1.153959
## pay_amt3 1.483029 1 1.217797
## pay_amt4 1.212957 1 1.101343
## pay_amt5 1.249524 1 1.117821
## pay_amt6 1.100787 1 1.049184
# Once more I removed the predictor with the second highest vif to see if that reduces the variance
model4 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5 -bill_amt3, data = train_data, family = binomial)
vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## limit_bal 1.526431 1 1.235488
## sex 1.026934 1 1.013378
## education 1.224426 3 1.034321
## marriage 1.332630 2 1.074428
## age 1.381418 1 1.175337
## pay_september 1.516166 1 1.231327
## pay_august 2.622895 1 1.619535
## pay_july 3.175906 1 1.782107
## pay_june 3.785788 1 1.945710
## pay_may 4.273264 1 2.067187
## pay_april 2.999200 1 1.731820
## bill_amt1 5.727922 1 2.393308
## bill_amt4 13.677470 1 3.698306
## bill_amt6 9.096204 1 3.015991
## pay_amt1 1.177919 1 1.085320
## pay_amt2 1.195389 1 1.093338
## pay_amt3 1.240474 1 1.113766
## pay_amt4 1.205400 1 1.097907
## pay_amt5 1.248262 1 1.117257
## pay_amt6 1.099634 1 1.048634
set.seed(123)
# Again I removed the predictor with the second highest vif to see if that reduces the variance
model5 <- glm(default_payment_next_month ~ .-bill_amt2 -bill_amt5 -bill_amt3 -bill_amt4, data = train_data, family = binomial)
summary(model5)
##
## Call:
## glm(formula = default_payment_next_month ~ . - bill_amt2 - bill_amt5 -
## bill_amt3 - bill_amt4, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.624e-01 1.033e-01 -9.318 < 2e-16 ***
## limit_bal -8.494e-07 1.893e-07 -4.488 7.19e-06 ***
## sexfemale -1.438e-01 3.677e-02 -3.912 9.14e-05 ***
## educationuniversity -1.010e-01 4.268e-02 -2.366 0.017962 *
## educationhigh school -1.250e-01 5.701e-02 -2.193 0.028332 *
## educationothers -9.940e-01 2.119e-01 -4.690 2.73e-06 ***
## marriagesingle -1.908e-01 4.140e-02 -4.609 4.05e-06 ***
## marriageothers -1.992e-01 1.593e-01 -1.250 0.211260
## age 5.818e-03 2.222e-03 2.619 0.008829 **
## pay_september 5.635e-01 2.138e-02 26.355 < 2e-16 ***
## pay_august 8.865e-02 2.417e-02 3.668 0.000244 ***
## pay_july 7.752e-02 2.668e-02 2.905 0.003668 **
## pay_june 3.930e-02 2.953e-02 1.331 0.183190
## pay_may 4.161e-02 3.224e-02 1.291 0.196861
## pay_april -1.339e-02 2.655e-02 -0.504 0.614120
## bill_amt1 -2.626e-06 5.316e-07 -4.939 7.84e-07 ***
## bill_amt6 1.703e-06 6.840e-07 2.490 0.012766 *
## pay_amt1 -1.031e-05 2.395e-06 -4.306 1.66e-05 ***
## pay_amt2 -7.822e-06 2.125e-06 -3.680 0.000233 ***
## pay_amt3 -2.745e-06 1.718e-06 -1.598 0.109941
## pay_amt4 -3.883e-06 1.841e-06 -2.109 0.034945 *
## pay_amt5 -6.208e-06 2.012e-06 -3.085 0.002034 **
## pay_amt6 -2.468e-06 1.603e-06 -1.539 0.123785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22123 on 20974 degrees of freedom
## Residual deviance: 19417 on 20952 degrees of freedom
## AIC: 19463
##
## Number of Fisher Scoring iterations: 6
vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## limit_bal 1.525734 1 1.235206
## sex 1.026662 1 1.013243
## education 1.223570 3 1.034201
## marriage 1.332390 2 1.074380
## age 1.381129 1 1.175214
## pay_september 1.516102 1 1.231301
## pay_august 2.620325 1 1.618742
## pay_july 3.171265 1 1.780805
## pay_june 3.776598 1 1.943347
## pay_may 4.257096 1 2.063273
## pay_april 2.994554 1 1.730478
## bill_amt1 3.683396 1 1.919218
## bill_amt6 4.256136 1 2.063040
## pay_amt1 1.159594 1 1.076844
## pay_amt2 1.149135 1 1.071977
## pay_amt3 1.134980 1 1.065355
## pay_amt4 1.135903 1 1.065787
## pay_amt5 1.162211 1 1.078059
## pay_amt6 1.095345 1 1.046587
Now multicollinearity is no longer an issue. After removing bill amounts 2, 5,,3 and 4 I shall then proceed with the stepwise selection.
I ran this model first and when i did stepwise selection i got an error so i sorted the data and discovered that pay_amt1 was the issue causing perfect separation
set.seed(123)
model_d1 <- glm(default_payment_next_month ~ limit_bal + sex + education + marriage +age + pay_september + pay_august + pay_july + pay_june + pay_may + pay_april + bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 +pay_amt5 + pay_amt6, data=train_data, family=binomial)
summary(model_d1)
##
## Call:
## glm(formula = default_payment_next_month ~ limit_bal + sex +
## education + marriage + age + pay_september + pay_august +
## pay_july + pay_june + pay_may + pay_april + bill_amt1 + bill_amt6 +
## pay_amt2 + pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.793e-01 1.031e-01 -9.496 < 2e-16 ***
## limit_bal -9.124e-07 1.891e-07 -4.825 1.40e-06 ***
## sexfemale -1.429e-01 3.674e-02 -3.888 0.000101 ***
## educationuniversity -1.003e-01 4.265e-02 -2.352 0.018660 *
## educationhigh school -1.257e-01 5.699e-02 -2.205 0.027463 *
## educationothers -1.004e+00 2.119e-01 -4.738 2.16e-06 ***
## marriagesingle -1.913e-01 4.137e-02 -4.625 3.74e-06 ***
## marriageothers -2.062e-01 1.591e-01 -1.296 0.194947
## age 5.881e-03 2.220e-03 2.649 0.008064 **
## pay_september 5.671e-01 2.138e-02 26.525 < 2e-16 ***
## pay_august 1.064e-01 2.388e-02 4.455 8.40e-06 ***
## pay_july 5.950e-02 2.646e-02 2.249 0.024511 *
## pay_june 4.084e-02 2.957e-02 1.381 0.167206
## pay_may 4.179e-02 3.228e-02 1.295 0.195429
## pay_april -1.230e-02 2.656e-02 -0.463 0.643319
## bill_amt1 -2.588e-06 5.187e-07 -4.989 6.08e-07 ***
## bill_amt6 1.384e-06 6.668e-07 2.076 0.037910 *
## pay_amt2 -9.128e-06 2.182e-06 -4.182 2.88e-05 ***
## pay_amt3 -3.313e-06 1.744e-06 -1.899 0.057509 .
## pay_amt4 -4.186e-06 1.875e-06 -2.233 0.025573 *
## pay_amt5 -6.614e-06 2.048e-06 -3.230 0.001240 **
## pay_amt6 -2.980e-06 1.618e-06 -1.842 0.065481 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22123 on 20974 degrees of freedom
## Residual deviance: 19443 on 20953 degrees of freedom
## AIC: 19487
##
## Number of Fisher Scoring iterations: 6
sortrain_data = train_data[order(train_data$default_payment_next_month),]
It turns out pay_amt1, was causing perfect separation so it had to be removed
# Stepwise selection
set.seed(123)
model_step <- step(model_d1, direction = "both", trace = 1)
## Start: AIC=19487.27
## default_payment_next_month ~ limit_bal + sex + education + marriage +
## age + pay_september + pay_august + pay_july + pay_june +
## pay_may + pay_april + bill_amt1 + bill_amt6 + pay_amt2 +
## pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6
##
## Df Deviance AIC
## - pay_april 1 19444 19486
## - pay_may 1 19445 19487
## - pay_june 1 19445 19487
## <none> 19443 19487
## - pay_amt6 1 19447 19489
## - pay_amt3 1 19447 19489
## - bill_amt6 1 19448 19490
## - pay_july 1 19448 19490
## - pay_amt4 1 19449 19491
## - age 1 19450 19492
## - pay_amt5 1 19456 19498
## - sex 1 19458 19500
## - pay_august 1 19463 19505
## - marriage 2 19465 19505
## - limit_bal 1 19467 19509
## - pay_amt2 1 19467 19509
## - bill_amt1 1 19470 19512
## - education 3 19475 19513
## - pay_september 1 20140 20182
##
## Step: AIC=19485.48
## default_payment_next_month ~ limit_bal + sex + education + marriage +
## age + pay_september + pay_august + pay_july + pay_june +
## pay_may + bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 +
## pay_amt5 + pay_amt6
##
## Df Deviance AIC
## - pay_may 1 19445 19485
## - pay_june 1 19445 19485
## <none> 19444 19486
## - pay_amt6 1 19447 19487
## + pay_april 1 19443 19487
## - pay_amt3 1 19448 19488
## - bill_amt6 1 19448 19488
## - pay_july 1 19448 19488
## - pay_amt4 1 19450 19490
## - age 1 19451 19491
## - pay_amt5 1 19456 19496
## - sex 1 19459 19499
## - pay_august 1 19463 19503
## - marriage 2 19466 19504
## - limit_bal 1 19467 19507
## - pay_amt2 1 19468 19508
## - bill_amt1 1 19470 19510
## - education 3 19475 19511
## - pay_september 1 20140 20180
##
## Step: AIC=19485.03
## default_payment_next_month ~ limit_bal + sex + education + marriage +
## age + pay_september + pay_august + pay_july + pay_june +
## bill_amt1 + bill_amt6 + pay_amt2 + pay_amt3 + pay_amt4 +
## pay_amt5 + pay_amt6
##
## Df Deviance AIC
## <none> 19445 19485
## + pay_may 1 19444 19486
## - pay_amt3 1 19449 19487
## - pay_amt6 1 19449 19487
## + pay_april 1 19445 19487
## - bill_amt6 1 19450 19488
## - pay_july 1 19450 19488
## - pay_amt4 1 19452 19490
## - age 1 19452 19490
## - pay_june 1 19452 19490
## - pay_amt5 1 19458 19496
## - sex 1 19460 19498
## - marriage 2 19467 19503
## - pay_august 1 19466 19504
## - pay_amt2 1 19469 19507
## - limit_bal 1 19470 19508
## - bill_amt1 1 19472 19510
## - education 3 19477 19511
## - pay_september 1 20145 20183
summary(model_step)
##
## Call:
## glm(formula = default_payment_next_month ~ limit_bal + sex +
## education + marriage + age + pay_september + pay_august +
## pay_july + pay_june + bill_amt1 + bill_amt6 + pay_amt2 +
## pay_amt3 + pay_amt4 + pay_amt5 + pay_amt6, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.804e-01 1.031e-01 -9.509 < 2e-16 ***
## limit_bal -9.219e-07 1.883e-07 -4.895 9.86e-07 ***
## sexfemale -1.433e-01 3.674e-02 -3.900 9.60e-05 ***
## educationuniversity -1.002e-01 4.264e-02 -2.349 0.01880 *
## educationhigh school -1.252e-01 5.697e-02 -2.197 0.02802 *
## educationothers -1.004e+00 2.119e-01 -4.739 2.15e-06 ***
## marriagesingle -1.916e-01 4.137e-02 -4.631 3.63e-06 ***
## marriageothers -2.075e-01 1.591e-01 -1.304 0.19217
## age 5.864e-03 2.220e-03 2.642 0.00824 **
## pay_september 5.677e-01 2.135e-02 26.583 < 2e-16 ***
## pay_august 1.090e-01 2.371e-02 4.598 4.26e-06 ***
## pay_july 5.969e-02 2.637e-02 2.263 0.02363 *
## pay_june 6.242e-02 2.346e-02 2.661 0.00780 **
## bill_amt1 -2.616e-06 5.163e-07 -5.067 4.04e-07 ***
## bill_amt6 1.450e-06 6.555e-07 2.213 0.02692 *
## pay_amt2 -9.136e-06 2.183e-06 -4.185 2.86e-05 ***
## pay_amt3 -3.030e-06 1.710e-06 -1.772 0.07645 .
## pay_amt4 -4.490e-06 1.874e-06 -2.396 0.01659 *
## pay_amt5 -6.633e-06 2.036e-06 -3.258 0.00112 **
## pay_amt6 -2.973e-06 1.619e-06 -1.836 0.06641 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22123 on 20974 degrees of freedom
## Residual deviance: 19445 on 20955 degrees of freedom
## AIC: 19485
##
## Number of Fisher Scoring iterations: 6
set.seed(123)
# 1. Make predictions (probabilities)
predicted_probs <- predict(model_step, newdata = test_data, type = "response")
# 2. Convert probabilities to predicted classes ("yes" for default if prob > 0.5)
predicted_classes <- ifelse(predicted_probs > 0.5, "yes", "no")
# 3. Convert to factor with correct levels (matching actual)
predicted_classes <- factor(predicted_classes, levels = levels(test_data$default_payment_next_month))
# 4. Create confusion matrix
confusion <- table(Predicted = predicted_classes, Actual = test_data$default_payment_next_month)
print(confusion)
## Actual
## Predicted no yes
## no 6776 1533
## yes 206 475
set.seed(123)
# Make predictions on the test set
predictions <- predict(model_step, test_data, type = "response")
# Convert probabilities to predicted class labels
predicted_classes <- ifelse(predictions > 0.5, "yes", "no")
# Confusion matrix (ensure both sides are factors with the same levels)
confusionMatrix(
factor(predicted_classes, levels = c("no", "yes")),
factor(test_data$default_payment_next_month, levels = c("no", "yes")),
positive = "yes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6776 1533
## yes 206 475
##
## Accuracy : 0.8066
## 95% CI : (0.7982, 0.8147)
## No Information Rate : 0.7766
## P-Value [Acc > NIR] : 2.383e-12
##
## Kappa : 0.2708
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.23655
## Specificity : 0.97050
## Pos Pred Value : 0.69750
## Neg Pred Value : 0.81550
## Prevalence : 0.22336
## Detection Rate : 0.05284
## Detection Prevalence : 0.07575
## Balanced Accuracy : 0.60352
##
## 'Positive' Class : yes
##
# Since the model is not doing a good job predicting actual defaulters
# we need to try the optimal cutoff to see if that helps
# Create ROCR prediction object using predicted probabilities and true class labels
# Convert "yes"/"no" to numeric: 1 = yes, 0 = no
pred <- prediction(predictions, ifelse(test_data$default_payment_next_month == "yes", 1, 0))
# Calculate AUC
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values), 3)
# Plot ROC curve
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE, main = "ROC Curve")
text(0.5, 0.5, paste("AUC:", auc))
# Plot sensitivity vs. cutoff
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type = "l", lwd = 2,
ylab = "Sensitivity", xlab = "Cutoff", main = paste("Maximized Cutoff\n", "AUC: ", auc))
par(new = TRUE) # Add second line
# Plot specificity vs. cutoff
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type = "l", lwd = 2, col = "red", ylab = "", xlab = "")
axis(4, at = seq(0, 1, 0.2)) # Add right axis
mtext("Specificity", side = 4, col = "red")
# Find optimal cutoff where sensitivity and specificity are closest
min.diff <- which.min(abs(unlist(performance(pred, "sens")@y.values) -
unlist(performance(pred, "spec")@y.values)))
min.x <- unlist(performance(pred, "sens")@x.values)[min.diff]
min.y <- unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <- min.x # Optimal threshold
# Annotate the plot
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x, 0, paste("optimal threshold =", round(optimal, 2)), pos = 3)
set.seed(123)
predicted_classes_opt <- ifelse(predictions > 0.21, "yes", "no")
confusionMatrix(
factor(predicted_classes_opt, levels = c("no", "yes")),
factor(test_data$default_payment_next_month, levels = c("no", "yes")),
positive = "yes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 4471 649
## yes 2511 1359
##
## Accuracy : 0.6485
## 95% CI : (0.6385, 0.6584)
## No Information Rate : 0.7766
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2384
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6768
## Specificity : 0.6404
## Pos Pred Value : 0.3512
## Neg Pred Value : 0.8732
## Prevalence : 0.2234
## Detection Rate : 0.1512
## Detection Prevalence : 0.4305
## Balanced Accuracy : 0.6586
##
## 'Positive' Class : yes
##
set.seed(123)
# Build the Random Forest model with balanced data
rfModel <- randomForest(default_payment_next_month ~ ., data = train_data, ntree = 100)
# Evaluate the model
predictions <- predict(rfModel, test_data)
confusionMatrix(predictions, test_data$default_payment_next_month, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6584 1287
## yes 398 721
##
## Accuracy : 0.8126
## 95% CI : (0.8043, 0.8206)
## No Information Rate : 0.7766
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3586
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3591
## Specificity : 0.9430
## Pos Pred Value : 0.6443
## Neg Pred Value : 0.8365
## Prevalence : 0.2234
## Detection Rate : 0.0802
## Detection Prevalence : 0.1245
## Balanced Accuracy : 0.6510
##
## 'Positive' Class : yes
##
set.seed(123)
# Print the importance of each variable
importance_rf <- importance(rfModel)
print(importance_rf)
## MeanDecreaseGini
## limit_bal 365.03359
## sex 69.31521
## education 137.26049
## marriage 83.19063
## age 397.91894
## pay_september 741.92389
## pay_august 313.52853
## pay_july 175.11536
## pay_june 199.29407
## pay_may 131.67892
## pay_april 135.75248
## bill_amt1 425.71655
## bill_amt2 388.13707
## bill_amt3 368.54494
## bill_amt4 364.93740
## bill_amt5 352.57346
## bill_amt6 364.75290
## pay_amt1 375.26427
## pay_amt2 340.75309
## pay_amt3 333.31291
## pay_amt4 309.00002
## pay_amt5 309.90156
## pay_amt6 324.23427
set.seed(123)
# See the most important variables
# Create a cleaner variable importance plot
importance_df <- as.data.frame(importance(rfModel))
importance_df$Variables <- rownames(importance_df)
# Sort by importance
importance_df <- importance_df[order(importance_df$MeanDecreaseGini, decreasing = TRUE), ]
# Create a horizontal bar plot with ggplot2
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Variables, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(
title = "Random Forest Variable Importance",
subtitle = "Credit Default Prediction Model",
x = NULL,
y = "Mean Decrease in Gini Index"
) +
theme(
axis.text.y = element_text(size = 10),
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank()
)
set.seed(123)
# SVM with Radial Kernel
svm_radial <- svm(default_payment_next_month ~ .,
data = train_data,
kernel = "radial",
probability = TRUE)
set.seed(123)
# Make predictions with radial kernel
predictions_radial <- predict(svm_radial, newdata = test_data)
# Create confusion matrix for radial kernel
conf_matrix_radial <- confusionMatrix(predictions_radial,
test_data$default_payment_next_month,
positive = "yes")
print("Confusion Matrix for SVM with Radial Kernel:")
## [1] "Confusion Matrix for SVM with Radial Kernel:"
print(conf_matrix_radial)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6671 1332
## yes 311 676
##
## Accuracy : 0.8172
## 95% CI : (0.8091, 0.8252)
## No Information Rate : 0.7766
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3567
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.33665
## Specificity : 0.95546
## Pos Pred Value : 0.68490
## Neg Pred Value : 0.83356
## Prevalence : 0.22336
## Detection Rate : 0.07519
## Detection Prevalence : 0.10979
## Balanced Accuracy : 0.64606
##
## 'Positive' Class : yes
##
# SVM with Linear Kernel
svm_linear <- svm(default_payment_next_month ~ .,
data = train_data,
kernel = "linear",
probability = TRUE)
set.seed(123)
# Make predictions with linear kernel
predictions_linear <- predict(svm_linear, newdata = test_data)
# Create confusion matrix for linear kernel
conf_matrix_linear <- confusionMatrix(predictions_linear,
test_data$default_payment_next_month,
positive = "yes")
print("Confusion Matrix for SVM with Linear Kernel:")
## [1] "Confusion Matrix for SVM with Linear Kernel:"
print(conf_matrix_linear)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6761 1522
## yes 221 486
##
## Accuracy : 0.8061
## 95% CI : (0.7978, 0.8142)
## No Information Rate : 0.7766
## P-Value [Acc > NIR] : 4.964e-12
##
## Kappa : 0.2735
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.24203
## Specificity : 0.96835
## Pos Pred Value : 0.68741
## Neg Pred Value : 0.81625
## Prevalence : 0.22336
## Detection Rate : 0.05406
## Detection Prevalence : 0.07864
## Balanced Accuracy : 0.60519
##
## 'Positive' Class : yes
##
# Compare the performance metrics
metrics <- data.frame(
Model = c("SVM Radial", "SVM Linear"),
Accuracy = c(conf_matrix_radial$overall["Accuracy"], conf_matrix_linear$overall["Accuracy"]),
Sensitivity = c(conf_matrix_radial$byClass["Sensitivity"], conf_matrix_linear$byClass["Sensitivity"]),
Specificity = c(conf_matrix_radial$byClass["Specificity"], conf_matrix_linear$byClass["Specificity"]),
Precision = c(conf_matrix_radial$byClass["Precision"], conf_matrix_linear$byClass["Precision"]),
F1_Score = c(conf_matrix_radial$byClass["F1"], conf_matrix_linear$byClass["F1"])
)
print("Performance Comparison:")
## [1] "Performance Comparison:"
print(metrics)
## Model Accuracy Sensitivity Specificity Precision F1_Score
## 1 SVM Radial 0.8172414 0.3366534 0.9554569 0.6849037 0.451419
## 2 SVM Linear 0.8061179 0.2420319 0.9683472 0.6874116 0.358011