CTBC Bank, a globally recognized, Taiwanese financial institution, seeks to improve its credit risk management practices by developing a sophisticated predictive model that leverages a wide range of data sources, including demographic indicators and comprehensive customer financial profiles. This initiative aims to proactively identify potential credit default risks and enhance risk mitigation strategies, recognizing the critical impact that credit risk can have on a bank’s financial health. By adopting advanced data analytics techniques, CTBC Bank aims to remain competitive in the financial sector, following research findings that demonstrate the benefits of robust credit risk assessment models in weathering economic downturns and maintaining high-quality loan portfolios with lower default rates.
The expected loss per customer is computed by multiplying probability of default with loss given default. Our analysis focuses on the probability of default. The specific data mining problem is to predict the dummy variable “default” (yes/no) using predictor variables in the form of demographic information and payment history (e.g. age, gender, past payments). Considering that the project has a clear outcome - the determination and reduction of the target variable (customer defaults) - a supervised learning approach will be pursued. The goal is to match the expected defaults across a sample of clients as closely as possible to the actual defaults. Thereby, we enable the bank to only give credits to customers with a low expected default to improve their profitability.
The source of the data is the UC Irvine Machine Learning Repository. The results come from a study conducted by a Taiwanese bank in 2005. Clients were granted a credit card with a randomly assigned limit balance. Neither their repayment capabilities, nor their credit history were considered. Then, their use was tracked for 6 months. The dataset gives information about the demographics as well as payment behavior and default of the 30,000 credit card clients. The dataset can be found here: UC Irvine Machine Learning Repository
The original dataset includes 30,000 observations and 25 variables. 5 variables are categorical, 20 variables are numerical. It does not include any NAs.
ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment in the following month (1=yes, 0=no)
library("readxl")
library(dplyr)
##
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
##
## filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, setdiff, setequal, union
bank_df <- data.frame(read_excel("default of credit card clients.xls"))
## New names:
## • `` -> `...1`
head(bank_df)
## ...1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## 1 ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
## 2 1 20000 2 2 1 24 2 2 -1 -1 -2 -2
## 3 2 120000 2 2 2 26 -1 2 0 0 0 2
## 4 3 90000 2 2 2 34 0 0 0 0 0 0
## 5 4 50000 2 2 1 37 0 0 0 0 0 0
## 6 5 50000 1 2 1 57 -1 0 -1 0 0 0
## X12 X13 X14 X15 X16 X17 X18 X19
## 1 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
## 2 3913 3102 689 0 0 0 0 689
## 3 2682 1725 2682 3272 3455 3261 0 1000
## 4 29239 14027 13559 14331 14948 15549 1518 1500
## 5 46990 48233 49291 28314 28959 29547 2000 2019
## 6 8617 5670 35835 20940 19146 19131 2000 36681
## X20 X21 X22 X23 Y
## 1 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
## 2 0 0 0 0 1
## 3 1000 1000 0 2000 1
## 4 1000 1000 1000 5000 0
## 5 1200 1100 1069 1000 0
## 6 10000 9000 689 679 0
The head has already shown us some work which has to be done before we explore the data. The column names are currently in the first row. Therefore, we want to assign the first row as column names are remove it from the dataset. The last column has spaces in its name, which could cause problems in the further analysis. Hence, we have to rename it and get rid of the spaces. The first column with the client ID does not hold any relevant information, wherefore we can drop it. Lastly, the variable PAY_ is not in line with PAY_AMT and BILL_AMT. Both PAY_AMT and BILL_AMT start with 1 representing September 2005. PAY_ also starts in September but assigns it the number 0. We want our dataset to be as uniform as possible for easier understanding. This is why we rename PAY_0 to PAY_1.
# Use the first row as column names
colnames(bank_df) <- bank_df[1, ]
# Remove the first row from the data frame
bank_df <- bank_df[-1, ]
# Rename the column with spaces in the name
names(bank_df)[names(bank_df) == "default payment next month"] <- "default.payment.next.month"
# Remove the first column (ID)
bank_df <- bank_df[, -1]
# Rename PAY_0 to match the other _XY variables
names(bank_df)[names(bank_df) == "PAY_0"] <- "PAY_1"
Now, the data is ready to be explored. The summary confirms that there are no NAs in the dataset. However, we can see that the variables are not classified correctly as categorical and numerical. Assigning the variables the correct classes is the first step. Secondly, we can see that EDUCATION and MARRIAGE have values which were either not specified in the data dictionary, or do not have informational value (e.g. the prediction that clients with an unknown educational or relationship status are more/less likely to default is not actionable). Given our high number of observations and the relatively low number of rows containing these uninformational values, we simply remove the respective rows. Furthermore, BILL_AMT had negative values in some cases. Since it is not possible to have a negative bill, these values are either errors in the dataset or due to the possibility of overpaying the credit card. Most banks, including our client, do not offer the possibility of overpaying. Therefore, we will not consider these cases and treat these cases as if they had a bill statement of 0 (credit card not used). Also PAY_ had values which were not specified in the data dictionary. The dataset had not only -1 and values above 1, but also 0. We assumed that there could either be no difference between -1 and 0 (meaning a payment delay for 0 months) or that -1 would mean that the payment was made in advance and 0 meant just in time. To decrease the complexity of the dataset and because our client does not differentiate between advance and punctual payments, we replaced all negative values with 0. PAY_ is now a numerical variable, representing the payment delay in months.
# Explore the data
str(bank_df)
## 'data.frame': 30000 obs. of 24 variables:
## $ LIMIT_BAL : chr "20000" "120000" "90000" "50000" ...
## $ SEX : chr "2" "2" "2" "2" ...
## $ EDUCATION : chr "2" "2" "2" "2" ...
## $ MARRIAGE : chr "1" "2" "2" "1" ...
## $ AGE : chr "24" "26" "34" "37" ...
## $ PAY_1 : chr "2" "-1" "0" "0" ...
## $ PAY_2 : chr "2" "2" "0" "0" ...
## $ PAY_3 : chr "-1" "0" "0" "0" ...
## $ PAY_4 : chr "-1" "0" "0" "0" ...
## $ PAY_5 : chr "-2" "0" "0" "0" ...
## $ PAY_6 : chr "-2" "2" "0" "0" ...
## $ BILL_AMT1 : chr "3913" "2682" "29239" "46990" ...
## $ BILL_AMT2 : chr "3102" "1725" "14027" "48233" ...
## $ BILL_AMT3 : chr "689" "2682" "13559" "49291" ...
## $ BILL_AMT4 : chr "0" "3272" "14331" "28314" ...
## $ BILL_AMT5 : chr "0" "3455" "14948" "28959" ...
## $ BILL_AMT6 : chr "0" "3261" "15549" "29547" ...
## $ PAY_AMT1 : chr "0" "0" "1518" "2000" ...
## $ PAY_AMT2 : chr "689" "1000" "1500" "2019" ...
## $ PAY_AMT3 : chr "0" "1000" "1000" "1200" ...
## $ PAY_AMT4 : chr "0" "1000" "1000" "1100" ...
## $ PAY_AMT5 : chr "0" "0" "1000" "1069" ...
## $ PAY_AMT6 : chr "0" "2000" "5000" "1000" ...
## $ default.payment.next.month: chr "1" "1" "0" "0" ...
summary(bank_df)
## LIMIT_BAL SEX EDUCATION MARRIAGE
## Length:30000 Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## AGE PAY_1 PAY_2 PAY_3
## Length:30000 Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## PAY_4 PAY_5 PAY_6 BILL_AMT1
## Length:30000 Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## Length:30000 Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3
## Length:30000 Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## PAY_AMT4 PAY_AMT5 PAY_AMT6
## Length:30000 Length:30000 Length:30000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## default.payment.next.month
## Length:30000
## Class :character
## Mode :character
# Variable Classification
bank_df$LIMIT_BAL <- as.numeric(bank_df$LIMIT_BAL)
bank_df$SEX <- as.factor(bank_df$SEX)
bank_df$EDUCATION <- as.factor(bank_df$EDUCATION)
bank_df$MARRIAGE <- as.factor(bank_df$MARRIAGE)
bank_df$default.payment.next.month <- as.factor(bank_df$default.payment.next.month)
bank_df$AGE <- as.numeric(bank_df$AGE)
bank_df$PAY_1 <- as.numeric(bank_df$PAY_1)
bank_df$PAY_2 <- as.numeric(bank_df$PAY_2)
bank_df$PAY_3 <- as.numeric(bank_df$PAY_3)
bank_df$PAY_4 <- as.numeric(bank_df$PAY_4)
bank_df$PAY_5 <- as.numeric(bank_df$PAY_5)
bank_df$PAY_6 <- as.numeric(bank_df$PAY_6)
bank_df$BILL_AMT1 <- as.numeric(bank_df$BILL_AMT1)
bank_df$BILL_AMT2 <- as.numeric(bank_df$BILL_AMT2)
bank_df$BILL_AMT3 <- as.numeric(bank_df$BILL_AMT3)
bank_df$BILL_AMT4 <- as.numeric(bank_df$BILL_AMT4)
bank_df$BILL_AMT5 <- as.numeric(bank_df$BILL_AMT5)
bank_df$BILL_AMT6 <- as.numeric(bank_df$BILL_AMT6)
bank_df$PAY_AMT1 <- as.numeric(bank_df$PAY_AMT1)
bank_df$PAY_AMT2 <- as.numeric(bank_df$PAY_AMT2)
bank_df$PAY_AMT3 <- as.numeric(bank_df$PAY_AMT3)
bank_df$PAY_AMT4 <- as.numeric(bank_df$PAY_AMT4)
bank_df$PAY_AMT5 <- as.numeric(bank_df$PAY_AMT5)
bank_df$PAY_AMT6 <- as.numeric(bank_df$PAY_AMT6)
# Check how many values were not mentioned in the data dictionary or do not hold informational value (e.g. "unknown")
education_rows_to_remove <- sum(bank_df$EDUCATION %in% c(0, 4, 5, 6))
marriage_rows_to_remove <- sum(bank_df$MARRIAGE %in% c(0, 3))
education_rows_to_remove
## [1] 468
marriage_rows_to_remove
## [1] 377
# Remove the rows with those values
bank_df <- bank_df[!(bank_df$EDUCATION %in% c(0,4,5,6)), ]
bank_df <- bank_df[!(bank_df$MARRIAGE %in% c(0,3)), ]
# Update Education and Marriage so that the number of levels get updated
bank_df$MARRIAGE <- factor(bank_df$MARRIAGE)
bank_df$EDUCATION <- factor(bank_df$EDUCATION)
# Substitute negative values in the BILL_AMT_XY and PAY_XY with 0s
bill_cols <- c("BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6")
bank_df[, bill_cols][bank_df[, bill_cols] < 0] <- 0
pay_cols <- c("PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6")
bank_df[, pay_cols][bank_df[, pay_cols] < 0] <- 0
For the purpose of dimension reduction, we computed the average for the PAY_, BILL_AMT, and PAY_AMT variables for each client. In this way, we now have information about the average payment delay, average bill statement, and average payment amount across the 6 months of the study. Those variables will be used for our models and should give a sufficient summary of the credit history. Afterwards, we again explored the data to check if it is clean and ensure that we have not forgotten anything.
# Compute the variables PAY_AVG, BILL_AVG, and PAY_AMT_AVG
bank_df$PAY_AVG <- rowMeans(bank_df[, c("PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6")], na.rm = TRUE)
bank_df$BILL_AVG <- rowMeans(bank_df[, c("BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6")], na.rm = TRUE)
bank_df$PAY_AMT_AVG <- rowMeans(bank_df[, c("PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6")], na.rm = TRUE)
# Place the average column behind the respective inputs
bank_df <- bank_df %>% relocate(PAY_AVG, .after=PAY_6)
bank_df <- bank_df %>% relocate(BILL_AVG, .after=BILL_AMT6)
bank_df <- bank_df %>% relocate(PAY_AMT_AVG, .after=PAY_AMT6)
# Explore the cleaned data
View(bank_df)
summary(bank_df)
## LIMIT_BAL SEX EDUCATION MARRIAGE AGE
## Min. : 10000 1:11575 1:10531 1:13425 Min. :21.00
## 1st Qu.: 50000 2:17588 2:13862 2:15738 1st Qu.:28.00
## Median : 140000 3: 4770 Median :34.00
## Mean : 168094 Mean :35.39
## 3rd Qu.: 240000 3rd Qu.:41.00
## Max. :1000000 Max. :79.00
## PAY_1 PAY_2 PAY_3 PAY_4
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3589 Mean :0.3238 Mean :0.3075 Mean :0.2617
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :8.0000 Max. :8.0000 Max. :8.0000 Max. :8.0000
## PAY_5 PAY_6 PAY_AVG BILL_AMT1
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. : 0
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3510
## Median :0.000 Median :0.0000 Median :0.0000 Median : 22303
## Mean :0.225 Mean :0.2294 Mean :0.2844 Mean : 51060
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.3333 3rd Qu.: 66761
## Max. :8.000 Max. :8.0000 Max. :6.0000 Max. :964511
## BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2960 1st Qu.: 2640 1st Qu.: 2329 1st Qu.: 1780
## Median : 21127 Median : 20073 Median : 19050 Median : 18123
## Mean : 49082 Mean : 46957 Mean : 43305 Mean : 40425
## 3rd Qu.: 63720 3rd Qu.: 59947 3rd Qu.: 54509 3rd Qu.: 50302
## Max. :983931 Max. :1664089 Max. :891586 Max. :927171
## BILL_AMT6 BILL_AVG PAY_AMT1 PAY_AMT2
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 1288 1st Qu.: 4789 1st Qu.: 990.5 1st Qu.: 820
## Median : 17143 Median : 21010 Median : 2100.0 Median : 2008
## Mean : 39068 Mean : 44983 Mean : 5635.9 Mean : 5870
## 3rd Qu.: 49348 3rd Qu.: 57084 3rd Qu.: 5006.0 3rd Qu.: 5000
## Max. :961664 Max. :877314 Max. :505000.0 Max. :1684259
## PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 390 1st Qu.: 296 1st Qu.: 264 1st Qu.: 138
## Median : 1800 Median : 1500 Median : 1502 Median : 1500
## Mean : 5168 Mean : 4824 Mean : 4797 Mean : 5204
## 3rd Qu.: 4500 3rd Qu.: 4028 3rd Qu.: 4078 3rd Qu.: 4001
## Max. :896040 Max. :528897 Max. :426529 Max. :528666
## PAY_AMT_AVG default.payment.next.month
## Min. : 0 0:22649
## 1st Qu.: 1117 1: 6514
## Median : 2399
## Mean : 5250
## 3rd Qu.: 5579
## Max. :385692
str(bank_df)
## 'data.frame': 29163 obs. of 27 variables:
## $ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : num 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_1 : num 2 0 0 0 0 0 0 0 0 0 ...
## $ PAY_2 : num 2 2 0 0 0 0 0 0 0 0 ...
## $ PAY_3 : num 0 0 0 0 0 0 0 0 2 0 ...
## $ PAY_4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PAY_5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PAY_6 : num 0 2 0 0 0 0 0 0 0 0 ...
## $ PAY_AVG : num 0.667 0.667 0 0 0 ...
## $ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
## $ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
## $ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
## $ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
## $ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
## $ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
## $ BILL_AVG : num 1284 2846 16942 38556 18223 ...
## $ PAY_AMT1 : num 0 0 1518 2000 2000 ...
## $ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
## $ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
## $ PAY_AMT5 : num 0 0 1000 1069 689 ...
## $ PAY_AMT6 : num 0 2000 5000 1000 679 ...
## $ PAY_AMT_AVG : num 115 833 1836 1398 9842 ...
## $ default.payment.next.month: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
Before building the first model, we computed descriptive statistics to get an overview of the sample. We plotted the distribution of the variables as well as some measures such as mean/median or percentages for several variables.
Regarding demographics, we can observe a normal distribution for the age with a mean of 35 and a median of 34. Also the histogram demonstrates that age groups around 30 make up the majority of the sample. As of gender, women dominate the sample with a percentage of 60%. The chart of education levels shows that the majority of clients have finished university. There is also a relatively high number of persons who have obtained a graduate degree. In contrast to this, only few study participants have not pursued higher education than high school. Lastly, we can observe that our sample is distributed relatively equally across single and married individuals.
library(ggplot2)
#Age Distribution Table
ggplot(data = bank_df, aes(x = AGE)) +
geom_histogram(fill = "dodgerblue", color = "black", bins = 20)+
labs(title = "Age Distribution Histogram",
x = "Age",
y = "Number of Cases")
# Calculation Mean/Median of Age
mean(as.numeric(bank_df$AGE))
## [1] 35.39056
median(as.numeric(bank_df$AGE))
## [1] 34
# Gender Distribution
ggplot(data = bank_df, aes(x = SEX)) +
geom_bar(fill = "dodgerblue", color = "black") +
scale_x_discrete(labels = c("Male", "Female")) +
labs(title = "Gender Distribution",
x = "Gender",
y = "Count") +
theme_minimal() +
theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
plot.title = element_text(hjust = 0.5, face = "bold"))
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Investigate the number of women further
percentage_women <- sum(bank_df$SEX == 2)/dim(bank_df)[1]
print(percentage_women)
## [1] 0.603093
# Education Chart
ggplot(data = bank_df, aes(x = factor(EDUCATION), fill = factor(EDUCATION))) +
geom_bar() +
scale_x_discrete(labels = c("Graduate School", "University", "High School")) +
scale_fill_manual(values = c("1" = "dodgerblue", "2" = "darkorange", "3" = "forestgreen"),
name = "Education Level",
labels = c("Graduate School", "University", "High School")) +
labs(title = "Education Levels",
x = "Education Level",
y = "Count") +
theme_minimal() +
theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
plot.title = element_text(hjust = 0.5, face = "bold"))
# Marital Status
ggplot(data = bank_df, aes(x = factor(MARRIAGE))) +
geom_bar(fill = "dodgerblue", color = "black") +
scale_x_discrete(labels = c("Married", "Single")) +
labs(title = "Marital Status Distribution",
x = "Marital Status",
y = "Count") +
theme_minimal() +
theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
plot.title = element_text(hjust = 0.5, face = "bold"))
Regarding credit card use, the large majority of clients does not have payment delays with a mean of 0.26 months. The average monthly bill amount distribution resembles the right half of a normal distribution with a mean of 44,983 NT Dollars (1,390$). The limit balance shows a random, rather left skewed distribution with a mean of 168,094 NT Dollars. This is significantly higher than the average bill amount, hinting at the fact that many clients do not fully exploit their limit balance.
# Payment Delay
ggplot(data = bank_df, aes(x = PAY_AVG)) +
geom_histogram(fill = "dodgerblue", color = "black", bins = 20)+
labs(title = "Payment Delay Distribution",
x = "Average Payment Delay in Months",
y = "Number of Cases")
mean(bank_df$PAY_AVG)
## [1] 0.2843958
# Bill Amount
ggplot(data = bank_df, aes(x = BILL_AVG)) +
geom_histogram(fill = "dodgerblue", color = "black", bins = 40)+
labs(title = "Bill Amount Distribution",
x = "Average Monthly Bill Amount in NT Dollars",
y = "Number of Cases")
mean(bank_df$BILL_AVG)
## [1] 44982.82
# Limit Balance
ggplot(data = bank_df, aes(x = LIMIT_BAL)) +
geom_histogram(fill = "dodgerblue", color = "black", bins = 40)+
labs(title = "Limit Balance Distribution",
x = "Average Monthly Limit Balance NT Dollars",
y = "Number of Cases")
mean(bank_df$LIMIT_BAL)
## [1] 168093.9
Lastly, we considered the number of defaulting clients. In our sample, the number of defaulting clients is relatively high - they make up 22% of the clients. This might be attributed to the fact that the study did not base the selection of clients on past credit history. Therefore, a high number of not creditworthy individuals were granted a loan. This results in a high number of defaults. However, this stands in contrast to the low average payment delay. To investigate the matter further, we computed the percentage of clients who have never paid with delay during the time of the study. This group makes up only 72% of the sample. Hence, the payment delays in the sample should not be underestimated due to the left skewed distribution chart and low mean.
# Default
library(ggplot2)
ggplot(data = bank_df, aes(x = factor(default.payment.next.month))) +
geom_bar(fill = "#00857F", color = "black") +
scale_x_discrete(labels = c("No Default", "Default")) +
labs(title = "Number of Defaulting Clients",
x = "Default (yes/no)",
y = "Count") +
theme_minimal() +
theme(
panel.grid.major = element_line(colour = "white", size = 0.2),
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.background = element_rect(fill = "white"))
percentage_default <- sum(bank_df$default.payment.next.month == 1)/dim(bank_df)[1]
print(percentage_default)
## [1] 0.2233652
percentage_no_delay <- sum(bank_df$PAY_AVG == 0)/dim(bank_df)[1]
print(percentage_no_delay)
## [1] 0.6619689
Our target is to predict whether a client will default or not. Since this is a categorical variable, we use a random forest and logistic regression approach. Subsequently, we combined them with the Ensemble Method. The logistic regression, however, performed best out of these three models.
TBD
# set seed for reproducibility
set.seed(101)
# Define the proportions for each set
train_prop <- 0.6
valid_prop <- 0.2
test_prop <- 0.2
# Calculate the number of rows for each set
total_rows <- nrow(bank_df)
train_size <- round(train_prop * total_rows)
valid_size <- round(valid_prop * total_rows)
# Create a vector of row indices. With the sample function, we scramble the rows around. We keep all records, but they are not in the same order as before.
indices <- sample(1:total_rows, total_rows, replace = FALSE)
# Split the dataset into training, validation, and testing sets
bank_df_train <- bank_df[indices[1:train_size], ]
bank_df_valid <- bank_df[indices[(train_size + 1):(train_size + valid_size)], ]
bank_df_test <- bank_df[indices[(train_size + valid_size + 1):total_rows], ]
We built our first logistic regression model using the variables LIMIT_BAL, MARRIAGE, AGE, PAY_AVG, BILL_AVG, EDUCATION, and SEX. We decided to only use the average variables for our monthly variables since the average contains information about all months. We dropped PAY_AMT because it strongly correlates with BILL_AMT to prevent collinearity issues. It should be considered that we also kept LIMIT_BAL - one might argue that the limit balance is determined by default risk and is, therefore, no good predictor. However, in the study of the bank limit balances were randomly assigned to investigate the effect this has on spending and default. Therefore, it may be used as a predictor.
After building the first model, we checked the predictors for significance. In our second, final model we dropped AGE, BILL_AVG, and EDUCATION due to insignificance. Our significant predictors (LIMIT_BAL, MARRIAGE, PAY_AVG, and SEX) are all significant at a level of .0.
# 1. Create the first model using all variables.
logreg1 <- glm(default.payment.next.month ~ LIMIT_BAL + MARRIAGE + AGE + PAY_AVG + BILL_AVG + EDUCATION + SEX, data = rbind(bank_df_train, bank_df_valid), family = "binomial")
summary(logreg1)
##
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE +
## AGE + PAY_AVG + BILL_AVG + EDUCATION + SEX, family = "binomial",
## data = rbind(bank_df_train, bank_df_valid))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.354e+00 9.970e-02 -13.586 < 2e-16 ***
## LIMIT_BAL -1.861e-06 1.672e-07 -11.124 < 2e-16 ***
## MARRIAGE2 -1.861e-01 3.937e-02 -4.727 2.28e-06 ***
## AGE 3.786e-03 2.144e-03 1.766 0.0774 .
## PAY_AVG 1.301e+00 3.021e-02 43.064 < 2e-16 ***
## BILL_AVG -1.286e-07 3.130e-07 -0.411 0.6813
## EDUCATION2 3.717e-03 3.997e-02 0.093 0.9259
## EDUCATION3 -5.403e-02 5.425e-02 -0.996 0.3193
## SEX2 -1.428e-01 3.498e-02 -4.083 4.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24830 on 23330 degrees of freedom
## Residual deviance: 21728 on 23322 degrees of freedom
## AIC: 21746
##
## Number of Fisher Scoring iterations: 4
# 2. Create the final model dropping the insignificant predictors.
logreg_final <- glm(default.payment.next.month ~ LIMIT_BAL + MARRIAGE + PAY_AVG + SEX, data = rbind(bank_df_train, bank_df_valid), family = "binomial")
summary(logreg_final)
##
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE +
## PAY_AVG + SEX, family = "binomial", data = rbind(bank_df_train,
## bank_df_valid))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.218e+00 4.286e-02 -28.421 < 2e-16 ***
## LIMIT_BAL -1.831e-06 1.494e-07 -12.255 < 2e-16 ***
## MARRIAGE2 -2.135e-01 3.442e-02 -6.203 5.55e-10 ***
## PAY_AVG 1.299e+00 2.987e-02 43.474 < 2e-16 ***
## SEX2 -1.503e-01 3.465e-02 -4.339 1.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24830 on 23330 degrees of freedom
## Residual deviance: 21732 on 23326 degrees of freedom
## AIC: 21742
##
## Number of Fisher Scoring iterations: 4
We created the first random forest model using the same variables we used for our first logistic regression model (see above for more detailed description of this decision) and started with relatively low numbers of ntree, mtry, and nodesize. For the creation of the model, we used the training data. We then used the validation data for prediction and evaluated the model performance based on the confusion matrix. We repeated this process for several random forest models with tweaked parameters to find the best one. The confusion matrix for the first model showed a high accuracy of 78% with a sensitivity of 99% - speaking in favor of the model. However, the specificity is very low with 7%, so the model has to be adjusted to achieve higher specificity. It is important to increase specificity because a low specificity means high opportunity cost. The specificity tell us how well our model identifies true negatives (clients who do not default). If the specificity is low, our model advises the bank to decline a high number of clients who would not actually default. This means lost profit for the bank. Before creating our second random forest, we, therefore, checked if the low specificity is due to a too low number of 0s in the dataset. If this was the case, oversampling of 0s might solve the problem. This is, however, not the case. This is why we simply proceeded by tweaking the parameters to find the best random forest.
After rf_2, we have tried increasing ntree and the nodesize further
to further increase sensitivity. This did not work, we only added
complexity without gaining sensitivity or accuracy. However, there is no
linear relationship between accuracy of the model and mtry. This is why
we then decided to start with mtry = 5 and decrease it
incrementally by 1 to find out if this improves our model. Decreasing
mtry improved the accuracy and specificity of our model, but only
marginally. Therefore, we simply decided to stay with
mtry = 3.
Another parameter which we tweaked for our final model to achieve a higher specificity is the cutoff score. The default cutoff score is set at 0.5. If we increase the cutoff score, the number of predicted negatives, and thereby specificity, increases. We tried a cutoff score of 0.6 and 0.7. We decided to stay with the cutoff score of 0.7 because it produces a decent specificity and we did not want to sacrifice more sensitivity. In fact, we prioritize sensitivity given that undetected positives (= defaulting clients) result in a loss for CTBC.
# Import libraries
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attache Paket: 'randomForest'
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## margin
## Das folgende Objekt ist maskiert 'package:dplyr':
##
## combine
library(caret)
## Lade nötiges Paket: lattice
# 1. Create the first random forest
rf <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 10,
mtry = 1, nodesize = 5, importance = TRUE)
# 2. Look at confusion matrix.
rf.pred <- predict(rf, bank_df_valid)
confusionMatrix(rf.pred, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4484 1256
## 1 35 58
##
## Accuracy : 0.7787
## 95% CI : (0.7678, 0.7893)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.2408
##
## Kappa : 0.0543
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99225
## Specificity : 0.04414
## Pos Pred Value : 0.78118
## Neg Pred Value : 0.62366
## Prevalence : 0.77473
## Detection Rate : 0.76873
## Detection Prevalence : 0.98406
## Balanced Accuracy : 0.51820
##
## 'Positive' Class : 0
##
# 3. Check if our specificity issue is due to a low number of 0s.
count_0 <- table(bank_df$default.payment.next.month)[as.character(0)]
count_0
## 0
## 22649
count_train_0 <- table(bank_df_train$default.payment.next.month)[as.character(0)]
count_train_0
## 0
## 13582
count_valid_0 <- table(bank_df_valid$default.payment.next.month)[as.character(0)]
count_valid_0
## 0
## 4519
# 4.1 Create a second random forest with tweaked inputs (higher ntree and mtry)
rf_2 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50,
mtry = 5, nodesize = 5, importance = TRUE)
# 4.2 Look at the specificity in the confusion matrix
rf.pred_2 <- predict(rf_2, bank_df_valid)
confusionMatrix(rf.pred_2, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4143 891
## 1 376 423
##
## Accuracy : 0.7828
## 95% CI : (0.772, 0.7933)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.07205
##
## Kappa : 0.2772
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9168
## Specificity : 0.3219
## Pos Pred Value : 0.8230
## Neg Pred Value : 0.5294
## Prevalence : 0.7747
## Detection Rate : 0.7103
## Detection Prevalence : 0.8630
## Balanced Accuracy : 0.6194
##
## 'Positive' Class : 0
##
# 5.1 Create a third random forest with tweaked inputs (higher ntree and nodesize)
rf_3 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 80,
mtry = 5, nodesize = 20, importance = TRUE)
# 5.2 Look at the specificity in the confusion matrix
rf.pred_3 <- predict(rf_3, bank_df_valid)
confusionMatrix(rf.pred_3, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4215 883
## 1 304 431
##
## Accuracy : 0.7965
## 95% CI : (0.7859, 0.8068)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 3.046e-05
##
## Kappa : 0.309
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9327
## Specificity : 0.3280
## Pos Pred Value : 0.8268
## Neg Pred Value : 0.5864
## Prevalence : 0.7747
## Detection Rate : 0.7226
## Detection Prevalence : 0.8740
## Balanced Accuracy : 0.6304
##
## 'Positive' Class : 0
##
# 6.1 Create a fourth random forest with tweaked inputs (mtry = 4)
rf_4 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50,
mtry = 4, nodesize = 5, importance = TRUE)
# 6.2 Look at the specificity in the confusion matrix
rf.pred_4 <- predict(rf_4, bank_df_valid)
confusionMatrix(rf.pred_4, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4169 894
## 1 350 420
##
## Accuracy : 0.7867
## 95% CI : (0.776, 0.7972)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.01429
##
## Kappa : 0.2839
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9225
## Specificity : 0.3196
## Pos Pred Value : 0.8234
## Neg Pred Value : 0.5455
## Prevalence : 0.7747
## Detection Rate : 0.7147
## Detection Prevalence : 0.8680
## Balanced Accuracy : 0.6211
##
## 'Positive' Class : 0
##
# 7.1 Create a fifth random forest with tweaked inputs (mtry = 3)
rf_5 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50,
mtry = 3, nodesize = 5, importance = TRUE)
# 7.2 Look at the specificity in the confusion matrix
rf.pred_5 <- predict(rf_5, bank_df_valid)
confusionMatrix(rf.pred_5, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4195 900
## 1 324 414
##
## Accuracy : 0.7902
## 95% CI : (0.7795, 0.8005)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.00236
##
## Kappa : 0.2882
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9283
## Specificity : 0.3151
## Pos Pred Value : 0.8234
## Neg Pred Value : 0.5610
## Prevalence : 0.7747
## Detection Rate : 0.7192
## Detection Prevalence : 0.8735
## Balanced Accuracy : 0.6217
##
## 'Positive' Class : 0
##
# 8.1 Create a sixth random forest by increasing the cutoff score to 0.6
rf_6 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50,
mtry = 3, nodesize = 5, importance = TRUE, cutoff = c(0.6,0.4))
# 8.2 Look at the specificity in the confusion matrix
rf.pred_6 <- predict(rf_6, bank_df_valid)
confusionMatrix(rf.pred_6, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3988 747
## 1 531 567
##
## Accuracy : 0.7809
## 95% CI : (0.7701, 0.7915)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.1328
##
## Kappa : 0.3334
##
## Mcnemar's Test P-Value : 1.809e-09
##
## Sensitivity : 0.8825
## Specificity : 0.4315
## Pos Pred Value : 0.8422
## Neg Pred Value : 0.5164
## Prevalence : 0.7747
## Detection Rate : 0.6837
## Detection Prevalence : 0.8118
## Balanced Accuracy : 0.6570
##
## 'Positive' Class : 0
##
# 9.1 Create the final random forest by increasing the cutoff score to 0.7
rf_final <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50,
mtry = 3, nodesize = 5, importance = TRUE, cutoff = c(0.7,0.3))
# 9.2 Look at the specificity in the confusion matrix
rf.pred_final <- predict(rf_final, bank_df_valid)
confusionMatrix(rf.pred_final, as.factor(bank_df_valid$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3699 610
## 1 820 704
##
## Accuracy : 0.7548
## 95% CI : (0.7436, 0.7658)
## No Information Rate : 0.7747
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.3353
##
## Mcnemar's Test P-Value : 3.26e-08
##
## Sensitivity : 0.8185
## Specificity : 0.5358
## Pos Pred Value : 0.8584
## Neg Pred Value : 0.4619
## Prevalence : 0.7747
## Detection Rate : 0.6342
## Detection Prevalence : 0.7387
## Balanced Accuracy : 0.6772
##
## 'Positive' Class : 0
##
As a third model, we built an Ensemble model based on the random forest and logistic regression. In the first step, we stored the predicted probabilities of the random forest in a dataframe to prevent an error message (without storing them in a dataframe, we had problems setting up the res dataframe). Afterwards, we built the dataframe res which shows the actual class as well as the predicted class based on probability by the two models for every observation in the test dataset. The dataframe also contains the specific probability. It should be noted that TREEPred has to be specified with the increased cutoff score of 0.7 instead of 0.5. Then, we set the options so that R will round numbers to one decimal place when they are printed and display numbers in scientific notation when they are less than 0.01 or greater than 1,000. Finally, we added a column showing the average probability of the two model probabilities to the dataframe. The average probability is the mean of the probability computed by the random forest and the logistic regression. This is the probability of the Ensemble model and will be used to arrive at the class predicted by the Ensemble model. Given that we only have two models, the Ensemble model can only be produced by using the average probability. In the case of a higher (especially odd) number of input models, an Ensemble model can also be built with majority vote. In the case of two models, this does not make sense because the result will either be that the models have the same result or there is a tie between the models. Therefore, we would have to prioritize the vote of one model, leading to identical results of the chosen model and the Ensemble model.
# Build a dataframe with the probability predictions of the random forest
predictions_tree <- data.frame(predict(rf_final, bank_df_test, type = "prob"))
# Build dataframe showing the actual class and predicted probability/class by the models
res <- data.frame(ActualClass = bank_df_test$default.payment.next.month,
LRProb = predict(logreg_final, bank_df_test, type = "response"),
LRPred = ifelse(predict(logreg_final, bank_df_test, type = "response")>0.5, 1, 0),
TREEProb = predictions_tree[,2],
TREEPred = ifelse(predictions_tree[,2]>0.7, 1, 0))
# Set options
options(digits = 1, scipen = 2)
# Add average probability
res$avg <- rowMeans(data.frame(res$LRProb, res$TREEProb))
To evaluate the performance of our models on new data, we used the test data. First, we ran each model on the test dataset to predict if a client defaults or not. Then, we computed the confusion matrix to get an overview of the accuracy, sensitivity, and specificity of the models.
The logistic regression has an accuracy of 80% with a sensitivity of 82% and specificity of 65%. We decided not to tweak the cutoff score of the logistic regression model because we prioritize sensitivity to prevent losses and already have a relatively high specificity.
For the random forest, we have already defined the cutoff score of 0.7 in the model building process. The random forest has a lower accuracy and specificity with 76% and 54%, respectively. The sensitivity is as high as the sensitivity of the logistic regression with 82%. Therefore, the random forest performance evaluated by the confusion matrix is worse than the performance of the logistic regression. The only way we could increase the accuracy would be by decreasing the cutoff score. This would, however, mean as very low specificity (resulting in high opportunity cost, as explained during the model building process).
Regarding the Ensemble model, the cutoff score had to be lowered significantly to get a valuable model. With the default cutoff score of 0.5, the sensitivity was only 27%. We concluded that we can increase the sensitivity by lowering the cutoff score. A lower cutoff score results in a higher number of individuals designated as positives (defaulting). Hence, also the number of true positives increases. After incrementally decreasing the cutoff score, we found that a cutoff score of 0.2 resulted in the most attractive tradeoff between the decreasing accuracy and increasing sensitivity (it should be noted that tweaking the cutoff score leads to a decrease in accuracy). With a cutoff score of 0.2, the model has an accuracy of 72%, sensitivity of 65%, and specificity of 74%. Concluding, the Ensemble model is has the worst performance. Firstly, the accuracy is the lowest. Secondly, it has a relatively low sensitivity with 65%. Given that every false negative results in a loss for the bank, such a low sensitivity is not bearable and the model is the least appropriate for our case.
Based on the confusion matrices, the logistic regression outperforms the other two models by far.
# 1. Predict the values for the testing dataset with the logistic regression.
logit_pred <- predict(logreg_final, bank_df_test, type = "response")
confusionMatrix(factor(bank_df_test$default.payment.next.month), factor(ifelse(logit_pred >=0.5,1,0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4389 159
## 1 984 300
##
## Accuracy : 0.804
## 95% CI : (0.794, 0.814)
## No Information Rate : 0.921
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.258
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.817
## Specificity : 0.654
## Pos Pred Value : 0.965
## Neg Pred Value : 0.234
## Prevalence : 0.921
## Detection Rate : 0.753
## Detection Prevalence : 0.780
## Balanced Accuracy : 0.735
##
## 'Positive' Class : 0
##
# 2. Predict the values for the testing dataset with the random forest.
rf.pred_final_test <- predict(rf_final, bank_df_test)
confusionMatrix(predict(rf_final,bank_df_test), as.factor(bank_df_test$default.payment.next.month))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3726 589
## 1 822 695
##
## Accuracy : 0.758
## 95% CI : (0.747, 0.769)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.338
##
## Mcnemar's Test P-Value : 6.56e-10
##
## Sensitivity : 0.819
## Specificity : 0.541
## Pos Pred Value : 0.863
## Neg Pred Value : 0.458
## Prevalence : 0.780
## Detection Rate : 0.639
## Detection Prevalence : 0.740
## Balanced Accuracy : 0.680
##
## 'Positive' Class : 0
##
# 3. Predict the values for the testing dataset with the Ensemble Model
# Whenever we ran the confusion matrix on the bank_df_test, we got an error message. The error message was fixed by simply duplicating the df and using the duplicate.
bank_df_test_new <- data.frame(bank_df_test)
bank_df_test_new$default_factor <- factor(bank_df_test_new[, "default.payment.next.month"])
# Confusion Matrix with a cutoff score of 0.5
res$avg_result <- ifelse(res$avg > 0.5,1,0)
confusionMatrix(factor((res$avg > 0.5)* 1), factor(bank_df_test_new[,28]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4349 937
## 1 199 347
##
## Accuracy : 0.805
## 95% CI : (0.795, 0.815)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.00000114
##
## Kappa : 0.285
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.2702
## Specificity : 0.9562
## Pos Pred Value : 0.6355
## Neg Pred Value : 0.8227
## Prevalence : 0.2202
## Detection Rate : 0.0595
## Detection Prevalence : 0.0936
## Balanced Accuracy : 0.6132
##
## 'Positive' Class : 1
##
# Confusion Matrix with a cutoff score of 0.3
confusionMatrix(factor((res$avg > 0.3)* 1), factor(bank_df_test_new[,28]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3931 651
## 1 617 633
##
## Accuracy : 0.783
## 95% CI : (0.772, 0.793)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 0.313
##
## Kappa : 0.361
##
## Mcnemar's Test P-Value : 0.354
##
## Sensitivity : 0.493
## Specificity : 0.864
## Pos Pred Value : 0.506
## Neg Pred Value : 0.858
## Prevalence : 0.220
## Detection Rate : 0.109
## Detection Prevalence : 0.214
## Balanced Accuracy : 0.679
##
## 'Positive' Class : 1
##
# Confusion Matrix with a cutoff score of 0.2
confusionMatrix(factor((res$avg > 0.2)* 1), factor(bank_df_test_new[,28]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3346 455
## 1 1202 829
##
## Accuracy : 0.716
## 95% CI : (0.704, 0.727)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.315
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.646
## Specificity : 0.736
## Pos Pred Value : 0.408
## Neg Pred Value : 0.880
## Prevalence : 0.220
## Detection Rate : 0.142
## Detection Prevalence : 0.348
## Balanced Accuracy : 0.691
##
## 'Positive' Class : 1
##
# Confusion Matrix with a cutoff score of 0.1
confusionMatrix(factor((res$avg > 0.1)* 1), factor(bank_df_test_new[,28]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1541 127
## 1 3007 1157
##
## Accuracy : 0.463
## 95% CI : (0.45, 0.476)
## No Information Rate : 0.78
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.133
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.901
## Specificity : 0.339
## Pos Pred Value : 0.278
## Neg Pred Value : 0.924
## Prevalence : 0.220
## Detection Rate : 0.198
## Detection Prevalence : 0.714
## Balanced Accuracy : 0.620
##
## 'Positive' Class : 1
##
As a second performance measure, we plotted the gains charts of the models. The gains chart of a model gives information about how many true positives are correctly, cumulatively identified (y-axis) for a number of observations (x-axis). As a benchmark, the results for a random assignment are plotted. This would be equals to shuffling the 1’s and 0’s randomly in the Actual Class column. The difference between the benchmark line and the model curve tells us how much better the model performance is compared to random assignment.
Looking at the gains charts, it is evident that there are only small differences between the models. The lines are very close to each other and overlap to great extent. All of the curves are above the benchmark, demonstrating that all models are valuable and perform better than random assignment. However, we cannot extract any other information from the graphs. Considering their proximity, the gains charts are not useful for comparing the models. Therefore, our evaluation will remain based on our results from the confusion matrix.
# 1. Compute Gains
library(gains)
gain <- gains(as.numeric(bank_df_test$default.payment.next.month), as.numeric(logit_pred), groups=10)
gain_rf <- gains(as.numeric(bank_df_test$default.payment.next.month),as.numeric(predictions_tree[,2]), groups=10)
gain_ens <- gains(as.numeric(bank_df_test$default.payment.next.month),as.numeric(res$avg), groups=10)
# 2. Plot Charts
plot(c(0,gain$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain$cume.obs),xlab="# Cases", ylab="Cumulative Defaults", main="Gains Chart", type="l")
lines(c(0,gain_rf$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain_rf$cume.obs), lty=2)
lines(c(0,gain_ens$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain_ens$cume.obs), lty=3)
# 3. Add Benchmark
lines(c(0,sum(bank_df_test$default.payment.next.month == 1))~c(0, dim(bank_df_test)[1]), lty=1)
Lastly, we would like to explore which variables are the most important predictors.
For the random forest, we can do this by plotting the variable importance plot. The importance score for a particular predictor is computed by summing up the decrease in the Gini index for that predictor over all the trees in the forest. For the logistic regression, we can take a look at the equation and estimate the importance of each predictor by comparing the absolute values of coefficient estimates.
The variable importance plot of the random forest shows the PAY_AVG is by far the most significant predictor. Given that this variable gives information about the average payment delay of a client, this is only logical. Clients who frequently pay with delay, are more likely to default. Other important predictors are LIMIT_BAL, BILL_AVG, AGE, and MARRIAGE. EDUCATION and SEX do only have little influence on the default risk of a client.
The summary of the logistic regression demonstrates that the default risk of a client is lower if she has a low limit balance and no payment delays, is single, and female. The absolute values give insight into the influence of each factor. As for the random forest, the most significant predictor are the payment delays captured by PAY_AVG. PAY_AVG is followed by the marital status and sex. The limit balance has the lowest influence. The results might be due to the following reasons:
It should be noted that all of these explanations are only assumed and not the product of scientific research.
# 1. Plot a variable importance plot for the random forest.
varImpPlot(rf_final, type = 1)
# 2. Use summary to see the coefficient estimates
summary(logreg_final)
##
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE +
## PAY_AVG + SEX, family = "binomial", data = rbind(bank_df_train,
## bank_df_valid))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.22e+00 4.29e-02 -28.42 < 2e-16 ***
## LIMIT_BAL -1.83e-06 1.49e-07 -12.26 < 2e-16 ***
## MARRIAGE2 -2.13e-01 3.44e-02 -6.20 5.5e-10 ***
## PAY_AVG 1.30e+00 2.99e-02 43.47 < 2e-16 ***
## SEX2 -1.50e-01 3.46e-02 -4.34 1.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24830 on 23330 degrees of freedom
## Residual deviance: 21732 on 23326 degrees of freedom
## AIC: 21742
##
## Number of Fisher Scoring iterations: 4
In summary, we have built a logistic regression, random forest, and ensemble model to predict whether a bank client will default or not. After tweaking several parameters and comparing the model performance, the logistic regression can be evaluated as the best. Therefore, we would propose our logistic regression as the final model to our client. It has an accuracy of 80%, sensitivity of 82%, and specificity of 65%. The predictors for the logistic regression are past payment delays (PAY_AVG), marital status (MARRIAGE), sex (sex), and limit balance (LIMIT_BAL). Its main advantage in comparison to the random forest and ensemble model is that it has a relatively balanced sensitivity and specificity. The random forest has a low specificity, whereas the ensemble model has a low sensitivity. Hence, the cutoff scores of both models had to be tweaked, resulting in a decreased accuracy.
We think we have built the best model we can with our final model. However, we have also tried to build a model without information on credit history, solely using demographic data. In this way, we wanted to advise our client how to deal with new customers who do not have proof of their past credit history. However, regardless of tweaking the cutoff score, we were not able to achieve a specificity higher than 20%. This means that the model will only classify a very small number of clients as creditworthy. Therefore, the cost and complexity of implementing the model is likely to surpass its business value and we advise our client to simply give new clients without credit history low limit balances with high interest rate on debt. For clients with credit history, the client should implement our final model.