Problem Statement

Business Problem

CTBC Bank, a globally recognized, Taiwanese financial institution, seeks to improve its credit risk management practices by developing a sophisticated predictive model that leverages a wide range of data sources, including demographic indicators and comprehensive customer financial profiles. This initiative aims to proactively identify potential credit default risks and enhance risk mitigation strategies, recognizing the critical impact that credit risk can have on a bank’s financial health. By adopting advanced data analytics techniques, CTBC Bank aims to remain competitive in the financial sector, following research findings that demonstrate the benefits of robust credit risk assessment models in weathering economic downturns and maintaining high-quality loan portfolios with lower default rates.

Data Mining Problem

The expected loss per customer is computed by multiplying probability of default with loss given default. Our analysis focuses on the probability of default. The specific data mining problem is to predict the dummy variable “default” (yes/no) using predictor variables in the form of demographic information and payment history (e.g. age, gender, past payments). Considering that the project has a clear outcome - the determination and reduction of the target variable (customer defaults) - a supervised learning approach will be pursued. The goal is to match the expected defaults across a sample of clients as closely as possible to the actual defaults. Thereby, we enable the bank to only give credits to customers with a low expected default to improve their profitability.

Dataset

Source

The source of the data is the UC Irvine Machine Learning Repository. The results come from a study conducted by a Taiwanese bank in 2005. Clients were granted a credit card with a randomly assigned limit balance. Neither their repayment capabilities, nor their credit history were considered. Then, their use was tracked for 6 months. The dataset gives information about the demographics as well as payment behavior and default of the 30,000 credit card clients. The dataset can be found here: UC Irvine Machine Learning Repository

Data Dictionary

The original dataset includes 30,000 observations and 25 variables. 5 variables are categorical, 20 variables are numerical. It does not include any NAs.

ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)

SEX: Gender (1=male, 2=female)

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

MARRIAGE: Marital status (1=married, 2=single, 3=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment in the following month (1=yes, 0=no)

Data Cleansing

Import the data and libraries

library("readxl")
library(dplyr)

## 
## Attache Paket: 'dplyr'

## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union

bank_df <- data.frame(read_excel("default of credit card clients.xls"))

## New names:
## • `` -> `...1`

head(bank_df)

##   ...1        X1  X2        X3       X4  X5    X6    X7    X8    X9   X10   X11
## 1   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
## 2    1     20000   2         2        1  24     2     2    -1    -1    -2    -2
## 3    2    120000   2         2        2  26    -1     2     0     0     0     2
## 4    3     90000   2         2        2  34     0     0     0     0     0     0
## 5    4     50000   2         2        1  37     0     0     0     0     0     0
## 6    5     50000   1         2        1  57    -1     0    -1     0     0     0
##         X12       X13       X14       X15       X16       X17      X18      X19
## 1 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
## 2      3913      3102       689         0         0         0        0      689
## 3      2682      1725      2682      3272      3455      3261        0     1000
## 4     29239     14027     13559     14331     14948     15549     1518     1500
## 5     46990     48233     49291     28314     28959     29547     2000     2019
## 6      8617      5670     35835     20940     19146     19131     2000    36681
##        X20      X21      X22      X23                          Y
## 1 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
## 2        0        0        0        0                          1
## 3     1000     1000        0     2000                          1
## 4     1000     1000     1000     5000                          0
## 5     1200     1100     1069     1000                          0
## 6    10000     9000      689      679                          0

Rename and drop columns

The head has already shown us some work which has to be done before we explore the data. The column names are currently in the first row. Therefore, we want to assign the first row as column names are remove it from the dataset. The last column has spaces in its name, which could cause problems in the further analysis. Hence, we have to rename it and get rid of the spaces. The first column with the client ID does not hold any relevant information, wherefore we can drop it. Lastly, the variable PAY_ is not in line with PAY_AMT and BILL_AMT. Both PAY_AMT and BILL_AMT start with 1 representing September 2005. PAY_ also starts in September but assigns it the number 0. We want our dataset to be as uniform as possible for easier understanding. This is why we rename PAY_0 to PAY_1.

# Use the first row as column names
colnames(bank_df) <- bank_df[1, ]
# Remove the first row from the data frame
bank_df <- bank_df[-1, ]
# Rename the column with spaces in the name
names(bank_df)[names(bank_df) == "default payment next month"] <- "default.payment.next.month"
# Remove the first column (ID)
bank_df <- bank_df[, -1]
# Rename PAY_0 to match the other _XY variables
names(bank_df)[names(bank_df) == "PAY_0"] <- "PAY_1"

Data Exploration and Preprocessing

Now, the data is ready to be explored. The summary confirms that there are no NAs in the dataset. However, we can see that the variables are not classified correctly as categorical and numerical. Assigning the variables the correct classes is the first step. Secondly, we can see that EDUCATION and MARRIAGE have values which were either not specified in the data dictionary, or do not have informational value (e.g. the prediction that clients with an unknown educational or relationship status are more/less likely to default is not actionable). Given our high number of observations and the relatively low number of rows containing these uninformational values, we simply remove the respective rows. Furthermore, BILL_AMT had negative values in some cases. Since it is not possible to have a negative bill, these values are either errors in the dataset or due to the possibility of overpaying the credit card. Most banks, including our client, do not offer the possibility of overpaying. Therefore, we will not consider these cases and treat these cases as if they had a bill statement of 0 (credit card not used). Also PAY_ had values which were not specified in the data dictionary. The dataset had not only -1 and values above 1, but also 0. We assumed that there could either be no difference between -1 and 0 (meaning a payment delay for 0 months) or that -1 would mean that the payment was made in advance and 0 meant just in time. To decrease the complexity of the dataset and because our client does not differentiate between advance and punctual payments, we replaced all negative values with 0. PAY_ is now a numerical variable, representing the payment delay in months.

# Explore the data
str(bank_df)

## 'data.frame':    30000 obs. of  24 variables:
##  $ LIMIT_BAL                 : chr  "20000" "120000" "90000" "50000" ...
##  $ SEX                       : chr  "2" "2" "2" "2" ...
##  $ EDUCATION                 : chr  "2" "2" "2" "2" ...
##  $ MARRIAGE                  : chr  "1" "2" "2" "1" ...
##  $ AGE                       : chr  "24" "26" "34" "37" ...
##  $ PAY_1                     : chr  "2" "-1" "0" "0" ...
##  $ PAY_2                     : chr  "2" "2" "0" "0" ...
##  $ PAY_3                     : chr  "-1" "0" "0" "0" ...
##  $ PAY_4                     : chr  "-1" "0" "0" "0" ...
##  $ PAY_5                     : chr  "-2" "0" "0" "0" ...
##  $ PAY_6                     : chr  "-2" "2" "0" "0" ...
##  $ BILL_AMT1                 : chr  "3913" "2682" "29239" "46990" ...
##  $ BILL_AMT2                 : chr  "3102" "1725" "14027" "48233" ...
##  $ BILL_AMT3                 : chr  "689" "2682" "13559" "49291" ...
##  $ BILL_AMT4                 : chr  "0" "3272" "14331" "28314" ...
##  $ BILL_AMT5                 : chr  "0" "3455" "14948" "28959" ...
##  $ BILL_AMT6                 : chr  "0" "3261" "15549" "29547" ...
##  $ PAY_AMT1                  : chr  "0" "0" "1518" "2000" ...
##  $ PAY_AMT2                  : chr  "689" "1000" "1500" "2019" ...
##  $ PAY_AMT3                  : chr  "0" "1000" "1000" "1200" ...
##  $ PAY_AMT4                  : chr  "0" "1000" "1000" "1100" ...
##  $ PAY_AMT5                  : chr  "0" "0" "1000" "1069" ...
##  $ PAY_AMT6                  : chr  "0" "2000" "5000" "1000" ...
##  $ default.payment.next.month: chr  "1" "1" "0" "0" ...

summary(bank_df)

##   LIMIT_BAL             SEX             EDUCATION           MARRIAGE        
##  Length:30000       Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      AGE               PAY_1              PAY_2              PAY_3          
##  Length:30000       Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     PAY_4              PAY_5              PAY_6            BILL_AMT1        
##  Length:30000       Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   BILL_AMT2          BILL_AMT3          BILL_AMT4          BILL_AMT5        
##  Length:30000       Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   BILL_AMT6           PAY_AMT1           PAY_AMT2           PAY_AMT3        
##  Length:30000       Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    PAY_AMT4           PAY_AMT5           PAY_AMT6        
##  Length:30000       Length:30000       Length:30000      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  default.payment.next.month
##  Length:30000              
##  Class :character          
##  Mode  :character

# Variable Classification
bank_df$LIMIT_BAL <- as.numeric(bank_df$LIMIT_BAL)
bank_df$SEX <- as.factor(bank_df$SEX)
bank_df$EDUCATION <- as.factor(bank_df$EDUCATION)
bank_df$MARRIAGE <- as.factor(bank_df$MARRIAGE)
bank_df$default.payment.next.month <- as.factor(bank_df$default.payment.next.month)
bank_df$AGE <- as.numeric(bank_df$AGE)

bank_df$PAY_1 <- as.numeric(bank_df$PAY_1)
bank_df$PAY_2 <- as.numeric(bank_df$PAY_2)
bank_df$PAY_3 <- as.numeric(bank_df$PAY_3)
bank_df$PAY_4 <- as.numeric(bank_df$PAY_4)
bank_df$PAY_5 <- as.numeric(bank_df$PAY_5)
bank_df$PAY_6 <- as.numeric(bank_df$PAY_6)

bank_df$BILL_AMT1 <- as.numeric(bank_df$BILL_AMT1)
bank_df$BILL_AMT2 <- as.numeric(bank_df$BILL_AMT2)
bank_df$BILL_AMT3 <- as.numeric(bank_df$BILL_AMT3)
bank_df$BILL_AMT4 <- as.numeric(bank_df$BILL_AMT4)
bank_df$BILL_AMT5 <- as.numeric(bank_df$BILL_AMT5)
bank_df$BILL_AMT6 <- as.numeric(bank_df$BILL_AMT6)

bank_df$PAY_AMT1 <- as.numeric(bank_df$PAY_AMT1)
bank_df$PAY_AMT2 <- as.numeric(bank_df$PAY_AMT2)
bank_df$PAY_AMT3 <- as.numeric(bank_df$PAY_AMT3)
bank_df$PAY_AMT4 <- as.numeric(bank_df$PAY_AMT4)
bank_df$PAY_AMT5 <- as.numeric(bank_df$PAY_AMT5)
bank_df$PAY_AMT6 <- as.numeric(bank_df$PAY_AMT6)

# Check how many values were not mentioned in the data dictionary or do not hold informational value (e.g. "unknown")
education_rows_to_remove <- sum(bank_df$EDUCATION %in% c(0, 4, 5, 6))
marriage_rows_to_remove <- sum(bank_df$MARRIAGE %in% c(0, 3))
education_rows_to_remove

## [1] 468

marriage_rows_to_remove

## [1] 377

# Remove the rows with those values
bank_df <- bank_df[!(bank_df$EDUCATION %in% c(0,4,5,6)), ]
bank_df <- bank_df[!(bank_df$MARRIAGE %in% c(0,3)), ]
# Update Education and Marriage so that the number of levels get updated
bank_df$MARRIAGE <- factor(bank_df$MARRIAGE)
bank_df$EDUCATION <- factor(bank_df$EDUCATION)

# Substitute negative values in the BILL_AMT_XY and PAY_XY with 0s
bill_cols <- c("BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6")
bank_df[, bill_cols][bank_df[, bill_cols] < 0] <- 0
pay_cols <- c("PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6")
bank_df[, pay_cols][bank_df[, pay_cols] < 0] <- 0

Dimension Reduction and Final Check

For the purpose of dimension reduction, we computed the average for the PAY_, BILL_AMT, and PAY_AMT variables for each client. In this way, we now have information about the average payment delay, average bill statement, and average payment amount across the 6 months of the study. Those variables will be used for our models and should give a sufficient summary of the credit history. Afterwards, we again explored the data to check if it is clean and ensure that we have not forgotten anything.

# Compute the variables PAY_AVG, BILL_AVG, and PAY_AMT_AVG
bank_df$PAY_AVG <- rowMeans(bank_df[, c("PAY_1", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6")], na.rm = TRUE)
bank_df$BILL_AVG <- rowMeans(bank_df[, c("BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6")], na.rm = TRUE)
bank_df$PAY_AMT_AVG <- rowMeans(bank_df[, c("PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6")], na.rm = TRUE)
# Place the average column behind the respective inputs
bank_df <- bank_df %>% relocate(PAY_AVG, .after=PAY_6)
bank_df <- bank_df %>% relocate(BILL_AVG, .after=BILL_AMT6)
bank_df <- bank_df %>% relocate(PAY_AMT_AVG, .after=PAY_AMT6)

# Explore the cleaned data
View(bank_df)
summary(bank_df)

##    LIMIT_BAL       SEX       EDUCATION MARRIAGE       AGE       
##  Min.   :  10000   1:11575   1:10531   1:13425   Min.   :21.00  
##  1st Qu.:  50000   2:17588   2:13862   2:15738   1st Qu.:28.00  
##  Median : 140000             3: 4770             Median :34.00  
##  Mean   : 168094                                 Mean   :35.39  
##  3rd Qu.: 240000                                 3rd Qu.:41.00  
##  Max.   :1000000                                 Max.   :79.00  
##      PAY_1            PAY_2            PAY_3            PAY_4       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.3589   Mean   :0.3238   Mean   :0.3075   Mean   :0.2617  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :8.0000   Max.   :8.0000   Max.   :8.0000   Max.   :8.0000  
##      PAY_5           PAY_6           PAY_AVG         BILL_AMT1     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :     0  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  3510  
##  Median :0.000   Median :0.0000   Median :0.0000   Median : 22303  
##  Mean   :0.225   Mean   :0.2294   Mean   :0.2844   Mean   : 51060  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.3333   3rd Qu.: 66761  
##  Max.   :8.000   Max.   :8.0000   Max.   :6.0000   Max.   :964511  
##    BILL_AMT2        BILL_AMT3         BILL_AMT4        BILL_AMT5     
##  Min.   :     0   Min.   :      0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2960   1st Qu.:   2640   1st Qu.:  2329   1st Qu.:  1780  
##  Median : 21127   Median :  20073   Median : 19050   Median : 18123  
##  Mean   : 49082   Mean   :  46957   Mean   : 43305   Mean   : 40425  
##  3rd Qu.: 63720   3rd Qu.:  59947   3rd Qu.: 54509   3rd Qu.: 50302  
##  Max.   :983931   Max.   :1664089   Max.   :891586   Max.   :927171  
##    BILL_AMT6         BILL_AVG         PAY_AMT1           PAY_AMT2      
##  Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:  1288   1st Qu.:  4789   1st Qu.:   990.5   1st Qu.:    820  
##  Median : 17143   Median : 21010   Median :  2100.0   Median :   2008  
##  Mean   : 39068   Mean   : 44983   Mean   :  5635.9   Mean   :   5870  
##  3rd Qu.: 49348   3rd Qu.: 57084   3rd Qu.:  5006.0   3rd Qu.:   5000  
##  Max.   :961664   Max.   :877314   Max.   :505000.0   Max.   :1684259  
##     PAY_AMT3         PAY_AMT4         PAY_AMT5         PAY_AMT6     
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:   390   1st Qu.:   296   1st Qu.:   264   1st Qu.:   138  
##  Median :  1800   Median :  1500   Median :  1502   Median :  1500  
##  Mean   :  5168   Mean   :  4824   Mean   :  4797   Mean   :  5204  
##  3rd Qu.:  4500   3rd Qu.:  4028   3rd Qu.:  4078   3rd Qu.:  4001  
##  Max.   :896040   Max.   :528897   Max.   :426529   Max.   :528666  
##   PAY_AMT_AVG     default.payment.next.month
##  Min.   :     0   0:22649                   
##  1st Qu.:  1117   1: 6514                   
##  Median :  2399                             
##  Mean   :  5250                             
##  3rd Qu.:  5579                             
##  Max.   :385692

str(bank_df)

## 'data.frame':    29163 obs. of  27 variables:
##  $ LIMIT_BAL                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_1                     : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ PAY_2                     : num  2 2 0 0 0 0 0 0 0 0 ...
##  $ PAY_3                     : num  0 0 0 0 0 0 0 0 2 0 ...
##  $ PAY_4                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PAY_5                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PAY_6                     : num  0 2 0 0 0 0 0 0 0 0 ...
##  $ PAY_AVG                   : num  0.667 0.667 0 0 0 ...
##  $ BILL_AMT1                 : num  3913 2682 29239 46990 8617 ...
##  $ BILL_AMT2                 : num  3102 1725 14027 48233 5670 ...
##  $ BILL_AMT3                 : num  689 2682 13559 49291 35835 ...
##  $ BILL_AMT4                 : num  0 3272 14331 28314 20940 ...
##  $ BILL_AMT5                 : num  0 3455 14948 28959 19146 ...
##  $ BILL_AMT6                 : num  0 3261 15549 29547 19131 ...
##  $ BILL_AVG                  : num  1284 2846 16942 38556 18223 ...
##  $ PAY_AMT1                  : num  0 0 1518 2000 2000 ...
##  $ PAY_AMT2                  : num  689 1000 1500 2019 36681 ...
##  $ PAY_AMT3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : num  0 1000 1000 1100 9000 ...
##  $ PAY_AMT5                  : num  0 0 1000 1069 689 ...
##  $ PAY_AMT6                  : num  0 2000 5000 1000 679 ...
##  $ PAY_AMT_AVG               : num  115 833 1836 1398 9842 ...
##  $ default.payment.next.month: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...

Descriptive Statistics

Before building the first model, we computed descriptive statistics to get an overview of the sample. We plotted the distribution of the variables as well as some measures such as mean/median or percentages for several variables.

Demographic Information

Regarding demographics, we can observe a normal distribution for the age with a mean of 35 and a median of 34. Also the histogram demonstrates that age groups around 30 make up the majority of the sample. As of gender, women dominate the sample with a percentage of 60%. The chart of education levels shows that the majority of clients have finished university. There is also a relatively high number of persons who have obtained a graduate degree. In contrast to this, only few study participants have not pursued higher education than high school. Lastly, we can observe that our sample is distributed relatively equally across single and married individuals.

library(ggplot2)

#Age Distribution Table 
ggplot(data = bank_df, aes(x = AGE)) +
  geom_histogram(fill = "dodgerblue", color = "black", bins = 20)+
  labs(title = "Age Distribution Histogram",
       x = "Age",
       y = "Number of Cases")

# Calculation Mean/Median of Age
mean(as.numeric(bank_df$AGE))

## [1] 35.39056

median(as.numeric(bank_df$AGE))

## [1] 34

# Gender Distribution
ggplot(data = bank_df, aes(x = SEX)) +
  geom_bar(fill = "dodgerblue", color = "black") +
  scale_x_discrete(labels = c("Male", "Female")) +
  labs(title = "Gender Distribution",
       x = "Gender",
       y = "Count") +
  theme_minimal() +
  theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
        plot.title = element_text(hjust = 0.5, face = "bold"))

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Investigate the number of women further
percentage_women <- sum(bank_df$SEX == 2)/dim(bank_df)[1]
print(percentage_women)

## [1] 0.603093

# Education Chart
ggplot(data = bank_df, aes(x = factor(EDUCATION), fill = factor(EDUCATION))) +
  geom_bar() +
  scale_x_discrete(labels = c("Graduate School", "University", "High School")) +
  scale_fill_manual(values = c("1" = "dodgerblue", "2" = "darkorange", "3" = "forestgreen"),
                    name = "Education Level",
                    labels = c("Graduate School", "University", "High School")) +
  labs(title = "Education Levels",
       x = "Education Level",
       y = "Count") +
  theme_minimal() +
  theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
        plot.title = element_text(hjust = 0.5, face = "bold"))

# Marital Status
ggplot(data = bank_df, aes(x = factor(MARRIAGE))) +
  geom_bar(fill = "dodgerblue", color = "black") +
  scale_x_discrete(labels = c("Married", "Single")) +
  labs(title = "Marital Status Distribution",
       x = "Marital Status",
       y = "Count") +
  theme_minimal() +
  theme(panel.grid.major = element_line(colour = "lightgray", size = 0.2),
        plot.title = element_text(hjust = 0.5, face = "bold"))

Credit Card Use

Regarding credit card use, the large majority of clients does not have payment delays with a mean of 0.26 months. The average monthly bill amount distribution resembles the right half of a normal distribution with a mean of 44,983 NT Dollars (1,390$). The limit balance shows a random, rather left skewed distribution with a mean of 168,094 NT Dollars. This is significantly higher than the average bill amount, hinting at the fact that many clients do not fully exploit their limit balance.

# Payment Delay
ggplot(data = bank_df, aes(x = PAY_AVG)) +
  geom_histogram(fill = "dodgerblue", color = "black", bins = 20)+
  labs(title = "Payment Delay Distribution",
       x = "Average Payment Delay in Months",
       y = "Number of Cases")

mean(bank_df$PAY_AVG)

## [1] 0.2843958

# Bill Amount
ggplot(data = bank_df, aes(x = BILL_AVG)) +
  geom_histogram(fill = "dodgerblue", color = "black", bins = 40)+
  labs(title = "Bill Amount Distribution",
       x = "Average Monthly Bill Amount in NT Dollars",
       y = "Number of Cases")

mean(bank_df$BILL_AVG)

## [1] 44982.82

# Limit Balance
ggplot(data = bank_df, aes(x = LIMIT_BAL)) +
  geom_histogram(fill = "dodgerblue", color = "black", bins = 40)+
  labs(title = "Limit Balance Distribution",
       x = "Average Monthly Limit Balance NT Dollars",
       y = "Number of Cases")

mean(bank_df$LIMIT_BAL)

## [1] 168093.9

Default

Lastly, we considered the number of defaulting clients. In our sample, the number of defaulting clients is relatively high - they make up 22% of the clients. This might be attributed to the fact that the study did not base the selection of clients on past credit history. Therefore, a high number of not creditworthy individuals were granted a loan. This results in a high number of defaults. However, this stands in contrast to the low average payment delay. To investigate the matter further, we computed the percentage of clients who have never paid with delay during the time of the study. This group makes up only 72% of the sample. Hence, the payment delays in the sample should not be underestimated due to the left skewed distribution chart and low mean.

# Default
library(ggplot2)
ggplot(data = bank_df, aes(x = factor(default.payment.next.month))) +
  geom_bar(fill = "#00857F", color = "black") +
  scale_x_discrete(labels = c("No Default", "Default")) +
  labs(title = "Number of Defaulting Clients",
       x = "Default (yes/no)",
       y = "Count") +
  theme_minimal() +
  theme(
    panel.grid.major = element_line(colour = "white", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.background = element_rect(fill = "white"))

percentage_default <- sum(bank_df$default.payment.next.month == 1)/dim(bank_df)[1]
print(percentage_default)

## [1] 0.2233652

percentage_no_delay <- sum(bank_df$PAY_AVG == 0)/dim(bank_df)[1]
print(percentage_no_delay)

## [1] 0.6619689

Model Building

Our target is to predict whether a client will default or not. Since this is a categorical variable, we use a random forest and logistic regression approach. Subsequently, we combined them with the Ensemble Method. The logistic regression, however, performed best out of these three models.

Data Partitioning

TBD

# set seed for reproducibility
set.seed(101)
# Define the proportions for each set
train_prop <- 0.6  
valid_prop <- 0.2  
test_prop <- 0.2   
# Calculate the number of rows for each set
total_rows <- nrow(bank_df)
train_size <- round(train_prop * total_rows)
valid_size <- round(valid_prop * total_rows)
# Create a vector of row indices. With the sample function, we scramble the rows around. We keep all records, but they are not in the same order as before.
indices <- sample(1:total_rows, total_rows, replace = FALSE)
# Split the dataset into training, validation, and testing sets
bank_df_train <- bank_df[indices[1:train_size], ]
bank_df_valid <- bank_df[indices[(train_size + 1):(train_size + valid_size)], ]
bank_df_test <- bank_df[indices[(train_size + valid_size + 1):total_rows], ]

Logistic Regression

We built our first logistic regression model using the variables LIMIT_BAL, MARRIAGE, AGE, PAY_AVG, BILL_AVG, EDUCATION, and SEX. We decided to only use the average variables for our monthly variables since the average contains information about all months. We dropped PAY_AMT because it strongly correlates with BILL_AMT to prevent collinearity issues. It should be considered that we also kept LIMIT_BAL - one might argue that the limit balance is determined by default risk and is, therefore, no good predictor. However, in the study of the bank limit balances were randomly assigned to investigate the effect this has on spending and default. Therefore, it may be used as a predictor.

After building the first model, we checked the predictors for significance. In our second, final model we dropped AGE, BILL_AVG, and EDUCATION due to insignificance. Our significant predictors (LIMIT_BAL, MARRIAGE, PAY_AVG, and SEX) are all significant at a level of .0.

# 1. Create the first model using all variables.
logreg1 <- glm(default.payment.next.month ~ LIMIT_BAL + MARRIAGE + AGE + PAY_AVG + BILL_AVG + EDUCATION + SEX, data = rbind(bank_df_train, bank_df_valid), family = "binomial")
summary(logreg1)

## 
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE + 
##     AGE + PAY_AVG + BILL_AVG + EDUCATION + SEX, family = "binomial", 
##     data = rbind(bank_df_train, bank_df_valid))
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.354e+00  9.970e-02 -13.586  < 2e-16 ***
## LIMIT_BAL   -1.861e-06  1.672e-07 -11.124  < 2e-16 ***
## MARRIAGE2   -1.861e-01  3.937e-02  -4.727 2.28e-06 ***
## AGE          3.786e-03  2.144e-03   1.766   0.0774 .  
## PAY_AVG      1.301e+00  3.021e-02  43.064  < 2e-16 ***
## BILL_AVG    -1.286e-07  3.130e-07  -0.411   0.6813    
## EDUCATION2   3.717e-03  3.997e-02   0.093   0.9259    
## EDUCATION3  -5.403e-02  5.425e-02  -0.996   0.3193    
## SEX2        -1.428e-01  3.498e-02  -4.083 4.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24830  on 23330  degrees of freedom
## Residual deviance: 21728  on 23322  degrees of freedom
## AIC: 21746
## 
## Number of Fisher Scoring iterations: 4

# 2. Create the final model dropping the insignificant predictors.
logreg_final <- glm(default.payment.next.month ~ LIMIT_BAL + MARRIAGE + PAY_AVG + SEX, data = rbind(bank_df_train, bank_df_valid), family = "binomial")
summary(logreg_final)

## 
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE + 
##     PAY_AVG + SEX, family = "binomial", data = rbind(bank_df_train, 
##     bank_df_valid))
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.218e+00  4.286e-02 -28.421  < 2e-16 ***
## LIMIT_BAL   -1.831e-06  1.494e-07 -12.255  < 2e-16 ***
## MARRIAGE2   -2.135e-01  3.442e-02  -6.203 5.55e-10 ***
## PAY_AVG      1.299e+00  2.987e-02  43.474  < 2e-16 ***
## SEX2        -1.503e-01  3.465e-02  -4.339 1.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24830  on 23330  degrees of freedom
## Residual deviance: 21732  on 23326  degrees of freedom
## AIC: 21742
## 
## Number of Fisher Scoring iterations: 4

Random Forest

We created the first random forest model using the same variables we used for our first logistic regression model (see above for more detailed description of this decision) and started with relatively low numbers of ntree, mtry, and nodesize. For the creation of the model, we used the training data. We then used the validation data for prediction and evaluated the model performance based on the confusion matrix. We repeated this process for several random forest models with tweaked parameters to find the best one. The confusion matrix for the first model showed a high accuracy of 78% with a sensitivity of 99% - speaking in favor of the model. However, the specificity is very low with 7%, so the model has to be adjusted to achieve higher specificity. It is important to increase specificity because a low specificity means high opportunity cost. The specificity tell us how well our model identifies true negatives (clients who do not default). If the specificity is low, our model advises the bank to decline a high number of clients who would not actually default. This means lost profit for the bank. Before creating our second random forest, we, therefore, checked if the low specificity is due to a too low number of 0s in the dataset. If this was the case, oversampling of 0s might solve the problem. This is, however, not the case. This is why we simply proceeded by tweaking the parameters to find the best random forest.

After rf_2, we have tried increasing ntree and the nodesize further to further increase sensitivity. This did not work, we only added complexity without gaining sensitivity or accuracy. However, there is no linear relationship between accuracy of the model and mtry. This is why we then decided to start with mtry = 5 and decrease it incrementally by 1 to find out if this improves our model. Decreasing mtry improved the accuracy and specificity of our model, but only marginally. Therefore, we simply decided to stay with mtry = 3.

Another parameter which we tweaked for our final model to achieve a higher specificity is the cutoff score. The default cutoff score is set at 0.5. If we increase the cutoff score, the number of predicted negatives, and thereby specificity, increases. We tried a cutoff score of 0.6 and 0.7. We decided to stay with the cutoff score of 0.7 because it produces a decent specificity and we did not want to sacrifice more sensitivity. In fact, we prioritize sensitivity given that undetected positives (= defaulting clients) result in a loss for CTBC.

# Import libraries
library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attache Paket: 'randomForest'

## Das folgende Objekt ist maskiert 'package:ggplot2':
## 
##     margin

## Das folgende Objekt ist maskiert 'package:dplyr':
## 
##     combine

library(caret)

## Lade nötiges Paket: lattice

# 1. Create the first random forest
rf <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 10, 
                   mtry = 1, nodesize = 5, importance = TRUE) 

# 2. Look at confusion matrix.
rf.pred <- predict(rf, bank_df_valid)
confusionMatrix(rf.pred, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4484 1256
##          1   35   58
##                                           
##                Accuracy : 0.7787          
##                  95% CI : (0.7678, 0.7893)
##     No Information Rate : 0.7747          
##     P-Value [Acc > NIR] : 0.2408          
##                                           
##                   Kappa : 0.0543          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99225         
##             Specificity : 0.04414         
##          Pos Pred Value : 0.78118         
##          Neg Pred Value : 0.62366         
##              Prevalence : 0.77473         
##          Detection Rate : 0.76873         
##    Detection Prevalence : 0.98406         
##       Balanced Accuracy : 0.51820         
##                                           
##        'Positive' Class : 0               
##

# 3. Check if our specificity issue is due to a low number of 0s.
count_0 <- table(bank_df$default.payment.next.month)[as.character(0)]
count_0

##     0 
## 22649

count_train_0 <- table(bank_df_train$default.payment.next.month)[as.character(0)]
count_train_0

##     0 
## 13582

count_valid_0 <- table(bank_df_valid$default.payment.next.month)[as.character(0)]
count_valid_0

##    0 
## 4519

# 4.1 Create a second random forest with tweaked inputs (higher ntree and mtry)
rf_2 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50, 
                     mtry = 5, nodesize = 5, importance = TRUE) 

# 4.2 Look at the specificity in the confusion matrix
rf.pred_2 <- predict(rf_2, bank_df_valid)
confusionMatrix(rf.pred_2, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4143  891
##          1  376  423
##                                          
##                Accuracy : 0.7828         
##                  95% CI : (0.772, 0.7933)
##     No Information Rate : 0.7747         
##     P-Value [Acc > NIR] : 0.07205        
##                                          
##                   Kappa : 0.2772         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.9168         
##             Specificity : 0.3219         
##          Pos Pred Value : 0.8230         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.7747         
##          Detection Rate : 0.7103         
##    Detection Prevalence : 0.8630         
##       Balanced Accuracy : 0.6194         
##                                          
##        'Positive' Class : 0              
##

# 5.1 Create a third random forest with tweaked inputs (higher ntree and nodesize)
rf_3 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 80, 
                     mtry = 5, nodesize = 20, importance = TRUE) 

# 5.2 Look at the specificity in the confusion matrix
rf.pred_3 <- predict(rf_3, bank_df_valid)
confusionMatrix(rf.pred_3, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4215  883
##          1  304  431
##                                           
##                Accuracy : 0.7965          
##                  95% CI : (0.7859, 0.8068)
##     No Information Rate : 0.7747          
##     P-Value [Acc > NIR] : 3.046e-05       
##                                           
##                   Kappa : 0.309           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9327          
##             Specificity : 0.3280          
##          Pos Pred Value : 0.8268          
##          Neg Pred Value : 0.5864          
##              Prevalence : 0.7747          
##          Detection Rate : 0.7226          
##    Detection Prevalence : 0.8740          
##       Balanced Accuracy : 0.6304          
##                                           
##        'Positive' Class : 0               
##

# 6.1 Create a fourth random forest with tweaked inputs (mtry = 4)
rf_4 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50, 
                     mtry = 4, nodesize = 5, importance = TRUE) 

# 6.2 Look at the specificity in the confusion matrix
rf.pred_4 <- predict(rf_4, bank_df_valid)
confusionMatrix(rf.pred_4, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4169  894
##          1  350  420
##                                          
##                Accuracy : 0.7867         
##                  95% CI : (0.776, 0.7972)
##     No Information Rate : 0.7747         
##     P-Value [Acc > NIR] : 0.01429        
##                                          
##                   Kappa : 0.2839         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.9225         
##             Specificity : 0.3196         
##          Pos Pred Value : 0.8234         
##          Neg Pred Value : 0.5455         
##              Prevalence : 0.7747         
##          Detection Rate : 0.7147         
##    Detection Prevalence : 0.8680         
##       Balanced Accuracy : 0.6211         
##                                          
##        'Positive' Class : 0              
##

# 7.1 Create a fifth random forest with tweaked inputs (mtry = 3)
rf_5 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50, 
                     mtry = 3, nodesize = 5, importance = TRUE) 

# 7.2 Look at the specificity in the confusion matrix
rf.pred_5 <- predict(rf_5, bank_df_valid)
confusionMatrix(rf.pred_5, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4195  900
##          1  324  414
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7795, 0.8005)
##     No Information Rate : 0.7747          
##     P-Value [Acc > NIR] : 0.00236         
##                                           
##                   Kappa : 0.2882          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9283          
##             Specificity : 0.3151          
##          Pos Pred Value : 0.8234          
##          Neg Pred Value : 0.5610          
##              Prevalence : 0.7747          
##          Detection Rate : 0.7192          
##    Detection Prevalence : 0.8735          
##       Balanced Accuracy : 0.6217          
##                                           
##        'Positive' Class : 0               
##

# 8.1 Create a sixth random forest by increasing the cutoff score to 0.6
rf_6 <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50, 
                     mtry = 3, nodesize = 5, importance = TRUE, cutoff = c(0.6,0.4)) 
# 8.2 Look at the specificity in the confusion matrix
rf.pred_6 <- predict(rf_6, bank_df_valid)
confusionMatrix(rf.pred_6, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3988  747
##          1  531  567
##                                           
##                Accuracy : 0.7809          
##                  95% CI : (0.7701, 0.7915)
##     No Information Rate : 0.7747          
##     P-Value [Acc > NIR] : 0.1328          
##                                           
##                   Kappa : 0.3334          
##                                           
##  Mcnemar's Test P-Value : 1.809e-09       
##                                           
##             Sensitivity : 0.8825          
##             Specificity : 0.4315          
##          Pos Pred Value : 0.8422          
##          Neg Pred Value : 0.5164          
##              Prevalence : 0.7747          
##          Detection Rate : 0.6837          
##    Detection Prevalence : 0.8118          
##       Balanced Accuracy : 0.6570          
##                                           
##        'Positive' Class : 0               
##

# 9.1 Create the final random forest by increasing the cutoff score to 0.7
rf_final <- randomForest(as.factor(default.payment.next.month) ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_AVG + BILL_AVG, data = bank_df_train, ntree = 50, 
                     mtry = 3, nodesize = 5, importance = TRUE, cutoff = c(0.7,0.3)) 
# 9.2 Look at the specificity in the confusion matrix
rf.pred_final <- predict(rf_final, bank_df_valid)
confusionMatrix(rf.pred_final, as.factor(bank_df_valid$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3699  610
##          1  820  704
##                                           
##                Accuracy : 0.7548          
##                  95% CI : (0.7436, 0.7658)
##     No Information Rate : 0.7747          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.3353          
##                                           
##  Mcnemar's Test P-Value : 3.26e-08        
##                                           
##             Sensitivity : 0.8185          
##             Specificity : 0.5358          
##          Pos Pred Value : 0.8584          
##          Neg Pred Value : 0.4619          
##              Prevalence : 0.7747          
##          Detection Rate : 0.6342          
##    Detection Prevalence : 0.7387          
##       Balanced Accuracy : 0.6772          
##                                           
##        'Positive' Class : 0               
##

Ensemble Model

As a third model, we built an Ensemble model based on the random forest and logistic regression. In the first step, we stored the predicted probabilities of the random forest in a dataframe to prevent an error message (without storing them in a dataframe, we had problems setting up the res dataframe). Afterwards, we built the dataframe res which shows the actual class as well as the predicted class based on probability by the two models for every observation in the test dataset. The dataframe also contains the specific probability. It should be noted that TREEPred has to be specified with the increased cutoff score of 0.7 instead of 0.5. Then, we set the options so that R will round numbers to one decimal place when they are printed and display numbers in scientific notation when they are less than 0.01 or greater than 1,000. Finally, we added a column showing the average probability of the two model probabilities to the dataframe. The average probability is the mean of the probability computed by the random forest and the logistic regression. This is the probability of the Ensemble model and will be used to arrive at the class predicted by the Ensemble model. Given that we only have two models, the Ensemble model can only be produced by using the average probability. In the case of a higher (especially odd) number of input models, an Ensemble model can also be built with majority vote. In the case of two models, this does not make sense because the result will either be that the models have the same result or there is a tie between the models. Therefore, we would have to prioritize the vote of one model, leading to identical results of the chosen model and the Ensemble model.

# Build a dataframe with the probability predictions of the random forest
predictions_tree <- data.frame(predict(rf_final, bank_df_test, type = "prob"))

# Build dataframe showing the actual class and predicted probability/class by the models
res <- data.frame(ActualClass = bank_df_test$default.payment.next.month, 
                 LRProb = predict(logreg_final, bank_df_test, type = "response"), 
                 LRPred = ifelse(predict(logreg_final, bank_df_test, type = "response")>0.5, 1, 0),
                 TREEProb = predictions_tree[,2], 
                 TREEPred = ifelse(predictions_tree[,2]>0.7, 1, 0))

# Set options
options(digits = 1, scipen = 2)

# Add average probability
res$avg <- rowMeans(data.frame(res$LRProb, res$TREEProb))

Evaluation of Model Performance on Test Data

Confusion Matrix

To evaluate the performance of our models on new data, we used the test data. First, we ran each model on the test dataset to predict if a client defaults or not. Then, we computed the confusion matrix to get an overview of the accuracy, sensitivity, and specificity of the models.

The logistic regression has an accuracy of 80% with a sensitivity of 82% and specificity of 65%. We decided not to tweak the cutoff score of the logistic regression model because we prioritize sensitivity to prevent losses and already have a relatively high specificity.

For the random forest, we have already defined the cutoff score of 0.7 in the model building process. The random forest has a lower accuracy and specificity with 76% and 54%, respectively. The sensitivity is as high as the sensitivity of the logistic regression with 82%. Therefore, the random forest performance evaluated by the confusion matrix is worse than the performance of the logistic regression. The only way we could increase the accuracy would be by decreasing the cutoff score. This would, however, mean as very low specificity (resulting in high opportunity cost, as explained during the model building process).

Regarding the Ensemble model, the cutoff score had to be lowered significantly to get a valuable model. With the default cutoff score of 0.5, the sensitivity was only 27%. We concluded that we can increase the sensitivity by lowering the cutoff score. A lower cutoff score results in a higher number of individuals designated as positives (defaulting). Hence, also the number of true positives increases. After incrementally decreasing the cutoff score, we found that a cutoff score of 0.2 resulted in the most attractive tradeoff between the decreasing accuracy and increasing sensitivity (it should be noted that tweaking the cutoff score leads to a decrease in accuracy). With a cutoff score of 0.2, the model has an accuracy of 72%, sensitivity of 65%, and specificity of 74%. Concluding, the Ensemble model is has the worst performance. Firstly, the accuracy is the lowest. Secondly, it has a relatively low sensitivity with 65%. Given that every false negative results in a loss for the bank, such a low sensitivity is not bearable and the model is the least appropriate for our case.

Based on the confusion matrices, the logistic regression outperforms the other two models by far.

# 1. Predict the values for the testing dataset with the logistic regression.
logit_pred <- predict(logreg_final, bank_df_test, type = "response")
confusionMatrix(factor(bank_df_test$default.payment.next.month), factor(ifelse(logit_pred >=0.5,1,0)))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4389  159
##          1  984  300
##                                         
##                Accuracy : 0.804         
##                  95% CI : (0.794, 0.814)
##     No Information Rate : 0.921         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.258         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.817         
##             Specificity : 0.654         
##          Pos Pred Value : 0.965         
##          Neg Pred Value : 0.234         
##              Prevalence : 0.921         
##          Detection Rate : 0.753         
##    Detection Prevalence : 0.780         
##       Balanced Accuracy : 0.735         
##                                         
##        'Positive' Class : 0             
##

# 2. Predict the values for the testing dataset with the random forest.
rf.pred_final_test <- predict(rf_final, bank_df_test)
confusionMatrix(predict(rf_final,bank_df_test), as.factor(bank_df_test$default.payment.next.month))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3726  589
##          1  822  695
##                                         
##                Accuracy : 0.758         
##                  95% CI : (0.747, 0.769)
##     No Information Rate : 0.78          
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.338         
##                                         
##  Mcnemar's Test P-Value : 6.56e-10      
##                                         
##             Sensitivity : 0.819         
##             Specificity : 0.541         
##          Pos Pred Value : 0.863         
##          Neg Pred Value : 0.458         
##              Prevalence : 0.780         
##          Detection Rate : 0.639         
##    Detection Prevalence : 0.740         
##       Balanced Accuracy : 0.680         
##                                         
##        'Positive' Class : 0             
##

# 3. Predict the values for the testing dataset with the Ensemble Model

# Whenever we ran the confusion matrix on the bank_df_test, we got an error message. The error message was fixed by simply duplicating the df and using the duplicate.
bank_df_test_new <- data.frame(bank_df_test)
bank_df_test_new$default_factor <- factor(bank_df_test_new[, "default.payment.next.month"])

# Confusion Matrix with a cutoff score of 0.5
res$avg_result <- ifelse(res$avg > 0.5,1,0)
confusionMatrix(factor((res$avg > 0.5)* 1), factor(bank_df_test_new[,28]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4349  937
##          1  199  347
##                                         
##                Accuracy : 0.805         
##                  95% CI : (0.795, 0.815)
##     No Information Rate : 0.78          
##     P-Value [Acc > NIR] : 0.00000114    
##                                         
##                   Kappa : 0.285         
##                                         
##  Mcnemar's Test P-Value : < 2e-16       
##                                         
##             Sensitivity : 0.2702        
##             Specificity : 0.9562        
##          Pos Pred Value : 0.6355        
##          Neg Pred Value : 0.8227        
##              Prevalence : 0.2202        
##          Detection Rate : 0.0595        
##    Detection Prevalence : 0.0936        
##       Balanced Accuracy : 0.6132        
##                                         
##        'Positive' Class : 1             
##

# Confusion Matrix with a cutoff score of 0.3
confusionMatrix(factor((res$avg > 0.3)* 1), factor(bank_df_test_new[,28]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3931  651
##          1  617  633
##                                         
##                Accuracy : 0.783         
##                  95% CI : (0.772, 0.793)
##     No Information Rate : 0.78          
##     P-Value [Acc > NIR] : 0.313         
##                                         
##                   Kappa : 0.361         
##                                         
##  Mcnemar's Test P-Value : 0.354         
##                                         
##             Sensitivity : 0.493         
##             Specificity : 0.864         
##          Pos Pred Value : 0.506         
##          Neg Pred Value : 0.858         
##              Prevalence : 0.220         
##          Detection Rate : 0.109         
##    Detection Prevalence : 0.214         
##       Balanced Accuracy : 0.679         
##                                         
##        'Positive' Class : 1             
##

# Confusion Matrix with a cutoff score of 0.2
confusionMatrix(factor((res$avg > 0.2)* 1), factor(bank_df_test_new[,28]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3346  455
##          1 1202  829
##                                         
##                Accuracy : 0.716         
##                  95% CI : (0.704, 0.727)
##     No Information Rate : 0.78          
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.315         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.646         
##             Specificity : 0.736         
##          Pos Pred Value : 0.408         
##          Neg Pred Value : 0.880         
##              Prevalence : 0.220         
##          Detection Rate : 0.142         
##    Detection Prevalence : 0.348         
##       Balanced Accuracy : 0.691         
##                                         
##        'Positive' Class : 1             
##

# Confusion Matrix with a cutoff score of 0.1
confusionMatrix(factor((res$avg > 0.1)* 1), factor(bank_df_test_new[,28]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1541  127
##          1 3007 1157
##                                        
##                Accuracy : 0.463        
##                  95% CI : (0.45, 0.476)
##     No Information Rate : 0.78         
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.133        
##                                        
##  Mcnemar's Test P-Value : <2e-16       
##                                        
##             Sensitivity : 0.901        
##             Specificity : 0.339        
##          Pos Pred Value : 0.278        
##          Neg Pred Value : 0.924        
##              Prevalence : 0.220        
##          Detection Rate : 0.198        
##    Detection Prevalence : 0.714        
##       Balanced Accuracy : 0.620        
##                                        
##        'Positive' Class : 1            
##

Gains Chart

As a second performance measure, we plotted the gains charts of the models. The gains chart of a model gives information about how many true positives are correctly, cumulatively identified (y-axis) for a number of observations (x-axis). As a benchmark, the results for a random assignment are plotted. This would be equals to shuffling the 1’s and 0’s randomly in the Actual Class column. The difference between the benchmark line and the model curve tells us how much better the model performance is compared to random assignment.

Looking at the gains charts, it is evident that there are only small differences between the models. The lines are very close to each other and overlap to great extent. All of the curves are above the benchmark, demonstrating that all models are valuable and perform better than random assignment. However, we cannot extract any other information from the graphs. Considering their proximity, the gains charts are not useful for comparing the models. Therefore, our evaluation will remain based on our results from the confusion matrix.

# 1. Compute Gains
library(gains)
gain <- gains(as.numeric(bank_df_test$default.payment.next.month), as.numeric(logit_pred), groups=10)
gain_rf <- gains(as.numeric(bank_df_test$default.payment.next.month),as.numeric(predictions_tree[,2]), groups=10)
gain_ens <- gains(as.numeric(bank_df_test$default.payment.next.month),as.numeric(res$avg), groups=10)

# 2. Plot Charts
plot(c(0,gain$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain$cume.obs),xlab="# Cases", ylab="Cumulative Defaults", main="Gains Chart", type="l")
lines(c(0,gain_rf$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain_rf$cume.obs), lty=2)
lines(c(0,gain_ens$cume.pct.of.total*sum(bank_df_test$default.payment.next.month == 1))~c(0,gain_ens$cume.obs), lty=3)

# 3. Add Benchmark
lines(c(0,sum(bank_df_test$default.payment.next.month == 1))~c(0, dim(bank_df_test)[1]), lty=1)

Most Important Predictors

Lastly, we would like to explore which variables are the most important predictors.

For the random forest, we can do this by plotting the variable importance plot. The importance score for a particular predictor is computed by summing up the decrease in the Gini index for that predictor over all the trees in the forest. For the logistic regression, we can take a look at the equation and estimate the importance of each predictor by comparing the absolute values of coefficient estimates.

The variable importance plot of the random forest shows the PAY_AVG is by far the most significant predictor. Given that this variable gives information about the average payment delay of a client, this is only logical. Clients who frequently pay with delay, are more likely to default. Other important predictors are LIMIT_BAL, BILL_AVG, AGE, and MARRIAGE. EDUCATION and SEX do only have little influence on the default risk of a client.

The summary of the logistic regression demonstrates that the default risk of a client is lower if she has a low limit balance and no payment delays, is single, and female. The absolute values give insight into the influence of each factor. As for the random forest, the most significant predictor are the payment delays captured by PAY_AVG. PAY_AVG is followed by the marital status and sex. The limit balance has the lowest influence. The results might be due to the following reasons:

PAY_AVG: As mentioned above - if clients are frequently not able to pay their bills on time, it is logical that they are more likely to not have the financial resources to pay the bills at all at some point.
MARRIAGE: This result seems counterintuitive since married persons are more likely to have a family and, therefore, should seek financial stability. However, a family also comes with financial responsibilities and cost. The causality would have to be the subject of further research.
SEX: This might be attributed to the risk-aversity of women.
LIMIT_BAL: A higher limit balance might induce clients to spend money they do not have.

It should be noted that all of these explanations are only assumed and not the product of scientific research.

# 1. Plot a variable importance plot for the random forest.
varImpPlot(rf_final, type = 1)

# 2. Use summary to see the coefficient estimates
summary(logreg_final)

## 
## Call:
## glm(formula = default.payment.next.month ~ LIMIT_BAL + MARRIAGE + 
##     PAY_AVG + SEX, family = "binomial", data = rbind(bank_df_train, 
##     bank_df_valid))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.22e+00   4.29e-02  -28.42  < 2e-16 ***
## LIMIT_BAL   -1.83e-06   1.49e-07  -12.26  < 2e-16 ***
## MARRIAGE2   -2.13e-01   3.44e-02   -6.20  5.5e-10 ***
## PAY_AVG      1.30e+00   2.99e-02   43.47  < 2e-16 ***
## SEX2        -1.50e-01   3.46e-02   -4.34  1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24830  on 23330  degrees of freedom
## Residual deviance: 21732  on 23326  degrees of freedom
## AIC: 21742
## 
## Number of Fisher Scoring iterations: 4

Conclusion

Summary of Models

In summary, we have built a logistic regression, random forest, and ensemble model to predict whether a bank client will default or not. After tweaking several parameters and comparing the model performance, the logistic regression can be evaluated as the best. Therefore, we would propose our logistic regression as the final model to our client. It has an accuracy of 80%, sensitivity of 82%, and specificity of 65%. The predictors for the logistic regression are past payment delays (PAY_AVG), marital status (MARRIAGE), sex (sex), and limit balance (LIMIT_BAL). Its main advantage in comparison to the random forest and ensemble model is that it has a relatively balanced sensitivity and specificity. The random forest has a low specificity, whereas the ensemble model has a low sensitivity. Hence, the cutoff scores of both models had to be tweaked, resulting in a decreased accuracy.

Further Models

We think we have built the best model we can with our final model. However, we have also tried to build a model without information on credit history, solely using demographic data. In this way, we wanted to advise our client how to deal with new customers who do not have proof of their past credit history. However, regardless of tweaking the cutoff score, we were not able to achieve a specificity higher than 20%. This means that the model will only classify a very small number of clients as creditworthy. Therefore, the cost and complexity of implementing the model is likely to surpass its business value and we advise our client to simply give new clients without credit history low limit balances with high interest rate on debt. For clients with credit history, the client should implement our final model.

Final Project: Credit Default of Taiwanese Bank Clients

David Neufang

2023-12-01