Introduction
Data
Exploratory Data Analysis
- Exploration of Predictors
- Exploration of Dependent Variable
Modelling
Conclusion
- Model
- Predictors’ impact
Future work

Introduction

This report contains two parts devoted to Exploratory Analysis and Prediction Models. In the chapters, I tried to touch on the most popular methods and algorithms in order to find the best model which will help predict default and to answer the questions:

How does the probability of default payment vary by categories of different demographic variables?
Which variables are the strongest predictors of default payment?

I hope you will enjoy this report.

In this research, I use Google’s R Style Guide

Data

Data Overview

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The data set comes from the UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science.

Number of Instances: 30000

Data Dictionary

There are 25 variables:

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment in June, 2005 (1=yes, 0=no)

Importing Data

Let’s load the data, remove some extra data and make basic transformations:

library(XLConnect)
data.file.path <- 
  file.path(getwd(), "Data", "default of credit card clients.xls")

raw.data <- 
  readWorksheet(loadWorkbook(data.file.path), sheet = 1, 
                startCol = 1,
                startRow = 2,  # first row is synthetic names such as X1
                autofitCol = TRUE)

# Converting columns to factors according to the data description 
pay.cols.names  <- paste0("PAY_", c(0, 2:6))  # "PAY_0" "PAY_2" "PAY_3"...
cols.to.factors <- c("SEX", "EDUCATION", "MARRIAGE", 
                     pay.cols.names, "default.payment.next.month")
raw.data[cols.to.factors] <- lapply(raw.data[cols.to.factors], factor)

Exploratory Data Analysis

Exploration of Predictors

Categorical

Repayment Status (PAY_X)

Let’s compare the documentation provided for the data and a real situation. According to the description, this PAY_X is a set of categorical variables with the levels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above. Let’s look at real PAY_X columns:

library(ggplot2)
# Generating bar plots for PAY_0..6 variables 
pay.histograms <- 
  lapply(pay.cols.names, 
         function (pay.col.name){ 
           ggplot(data = raw.data[, pay.cols.names], 
                  aes(x = raw.data[, pay.col.name])) +
                  geom_bar(stat = "count") + 
                  theme_minimal() +
                  xlab(paste0("Repayment status ", pay.col.name)) +
                  ylab("Observations count")
         })
MultiPlot(plotlist = pay.histograms, cols = 3)

We observe undocumented values for repayment status variables: -2 and 0. Moreover, fraction of it is 86.5%. Strictly speaking, it is “NAs”. Due to this, there are several standard strategies to deal with:

Remove observations with NAs. After that we will have 4061 from 30000 observations. Obviously, the loss is too great.
Delete variables. I would suggest exploring the prediction power of this before removing these variables.
Replace NAs by major class. But our NA is a major class already. Additionally, it does not look like NAs are random. So, this is a wrong approach in our case.
Keep this data as an “NA classes”. I think this is the best decision at this stage of the research and we should keep this data “as is”.

Social Status Predictors (SEX, EDUCATION, MARRIAGE)

# Getting social predictors only 
dem.raw.data <- raw.data %>% select(SEX:MARRIAGE)

# Generating bar plots for PAY_0..6 variables 
dem.histograms <- 
  lapply(colnames(dem.raw.data), 
         function (pay.col.name){ 
           ggplot(data = raw.data[, pay.cols.names], 
                  aes(x = raw.data[, pay.col.name])) +
                  geom_bar(stat = "count") + 
                  theme_minimal() +
                  xlab(pay.col.name) +
                  ylab("Observations count")
         })
MultiPlot(plotlist = dem.histograms, cols = 3)

We have some undocumented value 0 for EDUCATION and MARRIGE. I suggest this as NA. Also, let’s rename levels SEX to Male and Female

library(plyr)
# Renaming Male/Female levels for SEX variable
levels(raw.data$SEX)[levels(raw.data$SEX) == "1"]             <- "Male"
levels(raw.data$SEX)[levels(raw.data$SEX) == "2"]             <- "Female"
str(raw.data$SEX)

##  Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 1 ...

Numerical

Age (AGE)

# Getting ID and AGE columns only
age.raw.data <- raw.data[c("ID", "AGE")]

# Calculate a cutoff for the "rule of thumb"
age.thumb.cutoffs <- 
  apply(age.raw.data, 2,
        function (t){
          Q3 = quantile(t, .75) + 1.5 * IQR(t) 
        }
) %>% t()

# Generating a histogram for the AGE variable with a vertical cutoff line
age.hist.plot <-
  ggplot(data = age.raw.data, 
         aes(x = age.raw.data$AGE)) +
         geom_histogram(binwidth = 1,
                        colour = "black", fill = "white") +
         geom_vline(aes(xintercept = age.thumb.cutoffs[1, "AGE"]), 
                        colour = "#BB0000", linetype = "dashed") +
         theme_minimal() +
         xlab("Age")

# Generating a scatterplot for the AGE variable with a horizontal cutoff line
age.scatter.plot <-
  ggplot(data = age.raw.data, 
         aes(x = 1:length(age.raw.data$AGE),
             y = age.raw.data$AGE)) +
         geom_point(shape = 1,
                    position = position_jitter(height = .5)) +
         geom_hline(aes(yintercept = age.thumb.cutoffs[1, "AGE"]), 
                    colour = "#BB0000", linetype = "dashed") +
         theme_minimal() +
         xlab("Index") + ylab("Age") 

# Printing both plots near each other
MultiPlot(age.hist.plot, age.scatter.plot)

Look at the scatterplot. There is something strange in the structure of the data. We can explore over-plotting for customers with indexes more than 15000. It looks like table forming was changed for these customers and a data provider started to populate it by packs of data with some order by Age. Let’s draw separated histograms for this data to be sure that both part of the data has the same statistical structure.

# Marking both parts of the AGE data
age.raw.data <- age.raw.data %>% 
  mutate(isClear = ifelse(ID < 15000, "Clear part", "Questionable part"))

# Drawing violin histograms and boxplots for both parts of data
ggplot(data = age.raw.data, aes(x = isClear, y = AGE)) +
  geom_violin(aes(fill = isClear), trim = FALSE, alpha = .3) +
  geom_boxplot(aes(fill = isClear), width = .2, outlier.colour = NA) +
  theme(legend.position = "NA") +
  labs(x = "", y = "Age")

So, both parts of data have the same density. No more doubts in data with ID > 15000.

Amount of Bill Statement (BILL_AMTX)

Outliers

Let’s explore histograms and check data for outliers using our expert judgment and the “rule of thumb” (Han, Jiawei and M. Kamber, 2006. Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann, 2006): outlier if bigger than \[ Q3 + 1.5*IQR \] or less than \[ Q1 - 1.5*IQR \] Note: The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that separate parts are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

# Getting BILL_AMT1..6 columns only
bill.raw.data <- raw.data %>% select(starts_with("BILL_AMT"))  

# Calculate cutoffs for the "rule of thumb"
(r.thumb.cutoffs <-
  apply(bill.raw.data, 2, function (t){
          c(Q1 = quantile(t, .25) - 1.5 * IQR(t),
            Q3 = quantile(t, .75) + 1.5 * IQR(t))
        }
  ) %>% t())

##              Q1.25%   Q3.75%
## BILL_AMT1 -91739.62 162389.4
## BILL_AMT2 -88547.50 155538.5
## BILL_AMT3 -83581.50 146412.5
## BILL_AMT4 -75942.12 132774.9
## BILL_AMT5 -70878.25 122831.8
## BILL_AMT6 -70657.38 121111.6

# Drawing histograms with cutoff lines
HistogramsFromVariables <- 
  function(data.columns, cutoffs.table = data.frame(), 
           bin.width = 5000, text = "") {
    # Generates list of ggplot histograms with vertical lines (cutoffs.table)
    #
    # Args:
    #   data.columns:  variables
    #   cutoffs.table: table with values for drawing vertical lines
    #   bin.width:     bin size for histograms. 
    #   text:          comment for the x axe 
    #
    # Returns:
    #   List of ggplot objects
    lapply(colnames(data.columns),
           function (col.name){ 
             h <- ggplot(data = data.columns, 
                    aes(x = data.columns[, col.name])) +
                    geom_histogram(binwidth = bin.width, 
                                   colour = "black", fill = "white") +
                    theme_minimal() +
                    xlab(paste0(text, " ", col.name, " ")) +
                    ylab("Obs. count") 
             
             # Adjusting plot to data contained in cutoffs.table:
             # two, one or without cutoffs
             if (dim(cutoffs.table)[1] > 1) {
               h + geom_vline(aes(xintercept = cutoffs.table[col.name, "Q1.25%"]), 
                                  colour = "#BB0000", linetype = "dashed") +
                   geom_vline(aes(xintercept = cutoffs.table[col.name, "Q3.75%"]), 
                                  colour = "#BB0000", linetype = "dashed")  
             } else if (dim(cutoffs.table)[1] == 1) {
               h + geom_vline(aes(xintercept = cutoffs.table[1, col.name]), 
                     colour = "#BB0000", linetype = "dashed")
             } else {
               return(h)
             }
           })
  }
ScatterplotsFromVariables <- 
  function(data.columns, cutoffs.table = data.frame(), text = "Amount") {
    # Generates list of scatterplots with 'rule of thumb' boundries.
    #
    # Args:
    #   data.columns:  variables
    #   cutoffs.table: table with values for drawing horizontal lines
    #   text:          comment for the y axe 
    #
    # Returns:
    #   List of ggplot objects
    lapply(colnames(data.columns),
           function (col.name){ 
             p <- ggplot(data = data.columns, 
                    aes(x = 1:length(data.columns[, col.name]),
                        y = data.columns[, col.name])) +
                    geom_point(shape=1) +
                    theme_minimal() +
                    xlab(paste0("Index ", col.name, " ")) +
                    ylab(text)
             
             # Adjusting the plot to data contained in cutoffs.table:
             # two, one or without cutoffs
             if (dim(cutoffs.table)[1] > 1) {
               p + geom_hline(aes(yintercept = cutoffs.table[col.name, "Q1.25%"]), 
                                  colour = "#BB0000", linetype = "dashed") +
                   geom_hline(aes(yintercept=cutoffs.table[col.name, "Q3.75%"]), 
                                  colour = "#BB0000", linetype = "dashed")  
             } else if (dim(cutoffs.table)[1] == 1) {
               p + geom_hline(aes(yintercept = cutoffs.table[1, col.name]), 
                              colour  ="#BB0000", linetype = "dashed")
             } else {
               return(p)
             }
           })
}
MultiPlot(plotlist = HistogramsFromVariables(bill.raw.data, 
                                             r.thumb.cutoffs,
                                             text = "Amount of bill statement"))

MultiPlot(plotlist = ScatterplotsFromVariables(bill.raw.data, 
                                               r.thumb.cutoffs))

It looks like we have outliers but Q3 limit is too low for our purposes. I suggest removing the data with the amount of the bill statements under Q1 cutoff and bigger than Q3 cutoff + 220000

Variable Without Outliers

# Collecting indexes to remove
to.remove.indexes <- 
  lapply(colnames(bill.raw.data),
         function (bill.col.name){
           which(bill.raw.data[, bill.col.name] > 
                   r.thumb.cutoffs[bill.col.name, "Q3.75%"] + 220000 |
                 bill.raw.data[, bill.col.name] < 
                   r.thumb.cutoffs[bill.col.name, "Q1.25%"])
         }
  ) %>% unlist()

# Getting BILL_AMT1..6 columns without outliers
bill.raw.data <- bill.raw.data[-to.remove.indexes, ]

# Collecting observation indexes to remove from 
# the main data
to.remove.indexes.total <- to.remove.indexes

Let’s look at histograms without outliers:

Let’s take a closer look at histograms for bills less than 5000 with bin.width = 100 because most of the payments are here:

The generated histogram shows the most common bill statement interval is under 100 dollars.

Amount of Previous Payment (PAY_AMTX)

Outliers

Let’s explore histograms and check data for outliers using our expert judgment and the “rule of thumb”. In that case, we do not have negative values and because of this we will use upper bound only: \[ Q3 + 1.5*IQR \]

# Getting PAY_AMT1..6 columns only
pay.raw.data <- raw.data %>% select(starts_with("PAY_AMT"))  # Getting columns

# Calculate cutoffs for the "rule of thumb"
(pay.thumb.cutoffs <- 
  apply(pay.raw.data, 2,
        function (t){
          Q3 = quantile(t, .75) + 1.5 * IQR(t) 
        }
) %>% t())

##      PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## [1,]    11015  11250.5  10677.5 9589.125     9700 9823.375

# Generating histograms for PAY_AMT1..6
MultiPlot(plotlist = HistogramsFromVariables(pay.raw.data, pay.thumb.cutoffs, 
                                             text = "Amount of previous payment"))

# Generating scatterplots for PAY_AMT1..6
MultiPlot(plotlist = ScatterplotsFromVariables(pay.raw.data, pay.thumb.cutoffs))

It looks like we have outliers but Q3 limit is too low for our purposes. I suggest removing the data with the amount of the previous payment bigger than Q3 cutoff + 200000

Variable Without Outliers

# Collecting indexes to remove
to.remove.indexes <- 
  lapply(colnames(pay.raw.data),
         function (pay.col.name){
           which(pay.raw.data[, pay.col.name] > 
                   pay.thumb.cutoffs[1, pay.col.name] + 200000)
         }
) %>% unlist()

# Getting PAY_AMT1..6 columns without outliers
pay.raw.data <- pay.raw.data[-to.remove.indexes, ]

# Collecting observation indexes to remove from 
# the main data
to.remove.indexes.total <- c(to.remove.indexes.total, to.remove.indexes)

Analysis

Taking a closer look at histograms for payments less than 5000, it looks like people prefer to pay round figures of money.

Limit of Balance (LIMIT_BAL)

Outliers

Let’s explore Limit of balanse variable

# Getting ID and LIMIT_BAL columns only
bal.raw.data <- raw.data[c("ID", "LIMIT_BAL")]

# Calculate cutoffs for LIMIT_BAL variable using "rule of thumb"
bal.thumb.cutoffs <-
  apply(bal.raw.data, 2,
        function (t){
          Q3 = quantile(t, .75) + 1.5 * IQR(t) 
        }
) %>% t()

# Generating a histogram for the LIMIT_BAL variable with a vertical cutoff line
ggplot(data = bal.raw.data, 
  aes(x = bal.raw.data$LIMIT_BAL)) +
  geom_histogram(binwidth = 100, colour = "black", fill = "white") +
  geom_vline(aes(xintercept = bal.thumb.cutoffs[1, "LIMIT_BAL"]), 
             colour = "#BB0000", linetype = "dashed") +
  theme_minimal() +
  labs(x = "Limit of Balance", y = "Observations count")

We can detect on the histogram above some popular limits of balance: 20000, 50000, 80000, 200000, 360000, 500000

# Generating a scatterplot for the LIMIT_BAL variable 
# with a horizontal cutoff line       
ggplot(data = bal.raw.data, 
  aes(x = 1:length(bal.raw.data$LIMIT_BAL),
      y = bal.raw.data$LIMIT_BAL)) +
  geom_point(shape = 1, position=position_jitter(height = 100)) +
  geom_hline(aes(yintercept=bal.thumb.cutoffs[1, "LIMIT_BAL"]), 
             colour="#BB0000", linetype="dashed") +
  theme_minimal() +
  labs(x = "Index", y = "Limit of Balance")

I suggest filtering observations with limits higher than 750000

Variable Without Outliers

# Collecting observation indexes to remove from 
# the main data - adding indexes with LIMIT_BAL > 750000
to.remove.indexes.total <- c(to.remove.indexes.total, 
                             which(bal.raw.data$LIMIT_BAL > 750000))

Outliers Exploration Results

After exploration of all variables for outliers we have collected 1283 indexes to remove.
Number of unique observation to remove is 457 which is 36% of all remove marks and 2% of all the data.

Let’s remove outliers for further research:

# Saving final dataset without outliers
fin.data <- raw.data[-to.remove.indexes.total, -1]

Predictors’ Correlation

Numerical

Correlogram for numerical variables:

library(corrgram)
corrgram(fin.data, order = TRUE, lower.panel = panel.shade,
         upper.panel = panel.pie, text.panel = panel.txt)

corrgram(fin.data, order = TRUE, lower.panel = panel.ellipse,
         upper.panel = panel.pts, text.panel = panel.txt,
         diag.panel = panel.minmax)

We see a high level of linear correlations between the amount of bill statements in different months. In case of the multicollinearity we need to use such technics as Ridge and Lasso regression and Principal components method. We can even drop some variables if we need to, but price of this is unbiasedness of estimates and this is not the best decision.

Categorical

To find correlations between Categorical variables we need to perform Chi-Square Test. In our case we will test the null hypothesis whether the one variable is independent of another at 0.05 significance level. We will filter all combinations with p-value less than 0.05 significance level because under this level we reject the null hypothesis that variables are independent.

# Creating combinations of pairs for categorical variables
# Categorical cols
cat.var.names <- colnames(fin.data[sapply(fin.data, is.factor)])  
# Combinations of pairs
cat.combs     <- as.data.frame(t(combn(cat.var.names, 2)))  

# Performing Chi-Square tests for all pairs
cat.chisq.test.res <- 
  apply(t(cat.combs), 2, function(x){
                           c(Var1   = x[1], 
                             Var2   = x[2], 
                             pvalue = chisq.test(fin.data[, x[1]], 
                                                 fin.data[, x[2]])$p.value)
                         }
  )
cat.chisq.test.res        <- cat.chisq.test.res %>% 
                               t() %>% 
                               data.frame(stringsAsFactors=FALSE) 
cat.chisq.test.res$pvalue <- as.numeric(cat.chisq.test.res$pvalue)

cat.chisq.test.res        <- cat.chisq.test.res %>% 
                               filter(pvalue <= .05) %>%  # Filter p-values < 0.05
                               arrange(pvalue)            # Most dependent at top
knitr::kable(cat.chisq.test.res, caption = "Table with kable", 
             format = "markdown", padding = 0,
             col.names = c("", "", "p-value"))

		p-value
PAY_0	PAY_2	0.00e+00
PAY_0	PAY_3	0.00e+00
PAY_0	PAY_4	0.00e+00
PAY_0	PAY_5	0.00e+00
PAY_0	PAY_6	0.00e+00
PAY_0	default.payment.next.month	0.00e+00
PAY_2	PAY_3	0.00e+00
PAY_2	PAY_4	0.00e+00
PAY_2	PAY_5	0.00e+00
PAY_2	PAY_6	0.00e+00
PAY_2	default.payment.next.month	0.00e+00
PAY_3	PAY_4	0.00e+00
PAY_3	PAY_5	0.00e+00
PAY_3	PAY_6	0.00e+00
PAY_3	default.payment.next.month	0.00e+00
PAY_4	PAY_5	0.00e+00
PAY_4	PAY_6	0.00e+00
PAY_4	default.payment.next.month	0.00e+00
PAY_5	PAY_6	0.00e+00
PAY_5	default.payment.next.month	0.00e+00
PAY_6	default.payment.next.month	0.00e+00
EDUCATION	PAY_2	0.00e+00
EDUCATION	PAY_3	0.00e+00
EDUCATION	MARRIAGE	0.00e+00
EDUCATION	PAY_0	0.00e+00
EDUCATION	PAY_4	0.00e+00
EDUCATION	PAY_5	0.00e+00
EDUCATION	PAY_6	0.00e+00
SEX	PAY_2	0.00e+00
EDUCATION	default.payment.next.month	0.00e+00
SEX	PAY_3	0.00e+00
SEX	PAY_0	0.00e+00
SEX	PAY_4	0.00e+00
SEX	PAY_5	0.00e+00
MARRIAGE	PAY_4	0.00e+00
MARRIAGE	PAY_0	0.00e+00
MARRIAGE	PAY_5	0.00e+00
MARRIAGE	PAY_2	0.00e+00
MARRIAGE	PAY_3	0.00e+00
MARRIAGE	PAY_6	0.00e+00
SEX	default.payment.next.month	0.00e+00
SEX	PAY_6	0.00e+00
MARRIAGE	default.payment.next.month	1.00e-07
SEX	MARRIAGE	2.00e-07
SEX	EDUCATION	9.35e-05

We can see that PAY_X variables are highly correlated to each other and for further exploration we can use one of them - PAY_0 for example. It looks like all categorical variables are dependent on each other and impact of PAY_X variables to default.payment.next.month is high. In addition, we can not say we have even one independent pair.

Near Zero Variables

Let’s try to find variables with zero variance.

library(caret) 
# Testing for existing of near-zero predictors in our data
nearZeroVar(select(fin.data, -default.payment.next.month))

## integer(0)

So, we do not have such variables. In case you have near zero variance predictors, it is better to remove it to reduce model-fitting time without reducing model accuracy.

Exploration of Dependent Variable

20 EPV (events per predictor variable) rule

Do we have enough observations to implement a good model?

library(gmodels)
def.ct <- (CrossTable(fin.data$default.payment.next.month))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29543 
## 
##  
##            |          0 |          1 | 
##            |------------|------------|
##            |      22996 |       6547 | 
##            |      0.778 |      0.222 | 
##            |------------|------------|
## 
## 
## 
##

20 EPV (events per predictor variable) approach (Harrell, Frank. Regression modeling strategies. NY: Springer, 2001):

Get the smallest presented value of Dependent variable. In this case, it is 6547
Divide this by number of predictors: 6547 / 23 = 285
We have enough observations if we have more than 20 observations per smallest presented predictor. It is our case, we have enough observations.

Majority Class Classifier

Majority Class Classifier is an elementary model for prediction. Accuracy will be equal to the fraction of instances belonging to the majority class. In our case, majority class is nondefault or 0. It is the lowest level of accuracy for our future modelling. We can just say that all credits cards are nondefault and we will have a level of accuracy equal to the number of negatives / number of observations = 22996 / 29543 = 77.8%

Сorrelation between Preditors and Dependent Variable

Social Status Predictors (SEX, EDUCATION, MARRIAGE, AGE)

In this chapter, we will explore the structure of correlation between different variables and a dependent variable.

AGE, SEX and MARRIAGE combination

As we seen before, there is a correlation between default and AGE, SEX and MARRIAGE. Let’s try to understand nature of this and also, add AGE as another social predictor.

# Adding classifier for AGE
fin.data <- fin.data %>% 
              mutate(AGE_F = ifelse(AGE < 25, "-25", 
                             ifelse(AGE < 35, "25-34",        
                             ifelse(AGE < 45, "35-44", "45-")))
                     )

# Generating mosaic diagrams with Chi-Square tests
library(vcd)
cotabplot(~ EDUCATION + default.payment.next.month | SEX + AGE_F, 
          data = fin.data, 
          panel = cotab_coindep, n = 5000, type = "mosaic", 
          test = "maxchisq", interpolate = 1:2)

In the diagram above we can see that there is a correlation between EDUCATION (1=graduate school, 2=university, 3=high school, 4=others) and default. The more educated cardholders have the less chances to default.

However, the situation changes with the increasing of AGE parameter. With age, the impact of education to default decreases. Moreover, for females, it is quicker. For cardholders younger than 25, there is no significant correlation between EDUCATION and default.

Here we can see that the impact of EDUCATION to default depends on MARRIAGE (1=married, 2=single, 3=others). The biggest impact for single cardholders is between 25 and 35. For married cardholders, at the same AGE interval, having high school education is important for default probability.

# Making MARRIAGE levels more easy-to-read
levels(fin.data$MARRIAGE)[levels(fin.data$MARRIAGE) == "1"] <- "Married"
levels(fin.data$MARRIAGE)[levels(fin.data$MARRIAGE) == "2"] <- "Single"
levels(fin.data$MARRIAGE)[levels(fin.data$MARRIAGE) == "3"] <- "Other"

# Generating mosaic diagrams with Chi-Square tests
library(vcd)
cotabplot(~ EDUCATION + default.payment.next.month | MARRIAGE + AGE_F, 
          data = fin.data, 
          panel = cotab_coindep, n = 5000, type = "mosaic", 
          test = "maxchisq", interpolate = 1:2)

We can also notice that married status increases the probability of default for ages 25-34. For 35-44 there is no significant impact, and for 45 and older, again, single status decreases the probability of default.

# Generating mosaic diagrams with Chi-Square tests
library(vcd)
cotabplot(~  MARRIAGE + default.payment.next.month | AGE_F, 
          data = fin.data, 
          panel = cotab_coindep, n = 5000, type = "mosaic", 
          test = "maxchisq", interpolate = 1:2)

Male persons have more chances to default for all age groups:

# Generating mosaic diagrams with Chi-Square tests
library(vcd)
cotabplot(~ SEX + default.payment.next.month | AGE_F, 
          data = fin.data, 
          panel = cotab_coindep, n = 5000, type = "mosaic", 
          test = "maxchisq", interpolate = 1:2)

Age (AGE)

The biggest chances to default for age group under 25 and smallest for 25 - 34 age group.

# Plotting mosaic diagram with Chi-Square tests
mosaic(~ AGE_F + default.payment.next.month, 
       data = fin.data, shade=TRUE, legend=TRUE)

Financial Predictors

Limit of Balance (LIMIT_BAL)

Using the diagrams below, we cannote that the fraction of default credit cards with a limit of up to 100000 dollars is higher. In contrast, the higher limits are more spread out for non-default cards.

# Generating violin histograms and boxplots for LIMIT_BAL
# splitted for non-default and default classes
library("tidyr")

# Preparing special data for violin diagrams
limit.spec.data <- fin.data %>% 
  filter(LIMIT_BAL < 600000) %>%
  select(LIMIT_BAL, default.payment.next.month) %>%
  gather(LIMIT_BAL, key = "Month", value = "Balance") 

ggplot(data = limit.spec.data, aes(x = default.payment.next.month, 
                                   y = Balance)) +
       geom_violin(aes(fill = default.payment.next.month), trim = FALSE,
       alpha = 0.3) +
       geom_boxplot(aes(fill = default.payment.next.month), width = 0.2,
       outlier.colour = NA) +
       theme(legend.position = "NA") +
       labs(colour = "default",
            subtitle  = paste0("Amount of given credit in NT dollars for",
                               " default (1) and non-default (0) cards"),
            y = "Amount of given credit in NT dollars",
            x = "non-default (0) ~ default (1)")

Repayment Status (PAY_X)

PAY_X description: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above.

Having a delay even for 1 month in any of the previous months, increases chances of default. Moreover, it works for any social subgroups:

# Generating mosaic diagrams for PAY_0 and 
# dependent variable with Chi-Square tests
mosaic(~ PAY_0 + default.payment.next.month, 
       data = fin.data, shade=TRUE, legend=TRUE)

# Generating mosaic diagrams for PAY_0 and 
# dependent variable with Chi-Square tests folded by AGE
mosaic(~ PAY_0 + default.payment.next.month | AGE_F, 
       data = fin.data, shade=TRUE, legend=TRUE)

Amount of Previous Payment (PAY_AMTX) & Amount of Bill Statement (BILL_AMTX)

Diagrams below show us that for default cards sizes of payments are smaller than for non-defaults cards and at the same time sizes of bills for default cards are higher (especially for pre-default month). It make sense.

# Preparing special data for violin diagrams generating
bill.spec.data <- fin.data %>% 
  filter(LIMIT_BAL < 600000, BILL_AMT1 < 300000) %>%
  select(BILL_AMT1, BILL_AMT3, BILL_AMT6, default.payment.next.month) %>%
  gather(BILL_AMT1, BILL_AMT3, BILL_AMT6, key = "Month", value = "Amount") 

pay.spec.data <- fin.data %>% 
  filter(PAY_AMT1  < 20000, PAY_AMT3  < 20000, PAY_AMT6  < 20000) %>%
  select(PAY_AMT1, PAY_AMT3, PAY_AMT6, default.payment.next.month) %>%
  gather(PAY_AMT1, PAY_AMT3, PAY_AMT6, key = "Month", value = "Amount") 

# Generating violin diagrams for me BILL_AMT1, 3, 6
bill.amt.plot <- 
  ggplot(data = bill.spec.data, aes(x = default.payment.next.month, 
                                    y = Amount)) +
         geom_violin(aes(fill = default.payment.next.month), 
                     trim = FALSE, alpha = .3) +
         geom_boxplot(aes(fill = default.payment.next.month), 
                      width = .2, outlier.colour = NA) +
         theme(legend.position = "NA") +
         facet_wrap( ~ Month) + 
         labs(colour = "default", 
              subtitle  = paste0("Amount of bill statement for 1th,",
                                 " 3rd and 6th months for default (1)",
                                 " and non-default (0) cards"),
              y = "Amount of bill statement",
              x = "non-default (0) ~ default (1)")

# Generating violin diagrams for me PAY_AMT1, 3, 6
pay.amt.plot <- 
  ggplot(data = pay.spec.data, aes(x = default.payment.next.month, 
                                   y = Amount)) +
         geom_violin(aes(fill = default.payment.next.month), 
                     trim = FALSE, alpha = .3) +
         geom_boxplot(aes(fill = default.payment.next.month), 
                      width = .2, outlier.colour = NA) +
         theme(legend.position = "NA") +
         facet_wrap( ~ Month) + 
         labs(colour = "default", 
              subtitle  = paste0("Amount of previous payment for 1th,",
                                 " 3rd and 6th months for default (1)",
                                 " and non-default (0) cards"),
              y = "Amount of previous payment",
              x = "non-default (0) ~ default (1)")

# Printing both diagrams near each other
MultiPlot(bill.amt.plot, pay.amt.plot, cols = 1)

Using diagrams below we can compare bills and payments in dynamic and make some observations:

Amount of payments for default cards is significantly lower from the very first month of observations;
Dynamic of payments for default and non-default are very similar, but dynamic of bill statements are different. Over time, non-default cardholders decrease their bill statements but default cardholders decrease their bill statements not at the same pace.

# Creating table which contains mean and median values of
# BILL_AMT1..6 and PAY_AMT1..6 predictors for default and non-default cards
full.pay.amt <- fin.data %>%
  select(BILL_AMT1:default.payment.next.month) %>%
  group_by(default.payment.next.month) %>%
  summarise_each(funs(median = median, mean = mean)) %>%
  gather(-default.payment.next.month,
         key = "name", value = "value") %>%
  separate(name, c("bptype", "name", "ftype")) %>%
  mutate(var = paste0(bptype, "_", name)) %>%
  select(-name)

# Generating line chart for mean and median values for 
# BILL_AMT1..6 and PAY_AMT1..6 for different months 
ggplot(full.pay.amt, 
       aes(x = var, y = value, 
           group  = default.payment.next.month, 
           colour = default.payment.next.month)) +
           geom_line() + geom_point() +
           scale_color_hue(labels = c("non-default (0)", "default (1)")) +
           theme(legend.position = "bottom") +
           labs(colour    = "", x = "",
                subtitle  = paste0("Average and median amount",
                                   " of previous payment / bill by month"),
                y         = "Amount of payment/bill") +
           facet_wrap(bptype ~ ftype, scales = "free")

Also, turning to corellogram from the “Predictors’ Correlation - Numerical” chapter and looking at the graphs above, we see that despite BILL_AMTX variables having high correlation, it would be a mistake to just drop all of them because they are important for the explanation of dependent variables.

Summary

After exploratory analysis we already have a picture about predictors’ impact to response variable:

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- Impact to default (details): The lower the amount of given credit limit of the balance owing, the bigger the chances to default. (In general)
SEX: Gender (1=male, 2=female)
- Impact to default (details): Male persons have more chances to default. (In general)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others)
- Impact to default (details): The better education the lower chances to default. (In general)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
- Impact to default (details): Married persons have more chances to default. (In general)
AGE: Age in years
- Impact to default (details): The biggest chance of default is in the age group under 25 and the smallest for 25 - 34 age group.
PAY_0..6: Repayment status in September .. April, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- Impact to default (details): Having a delay, even for 1 month in any of the previous months, increases the chance of default.
BILL_AMT1..6: Amount of bill statement in September .. April, 2005 (NT dollar)
- Impact to default (details): The smaller the difference between the amount owed on the bill in September and April, the bigger the chances to default. (In general)
PAY_AMT1..6: Amount of previous payment in September .. April, 2005 (NT dollar)
- Impact to default (details): The smaller the payment amount, the bigger the chance of default. (In general)

Modelling

Introduction

In this chapter, we will check hypothesizes about predictors’ impact on the dependent variable made in the previous chapter Also, we will try to implement a model which can predict default with a level higher than Majority Class Classifier can provide us. It means that accuracy of our future models should be higher than 77.8%

Libraries

# Load libraries
library(caret)   # Classification and Regression Training
library(tidyr)   # Easily Tidy Data 
library(dplyr)   # A Grammar of Data Manipulation
library(pROC)    # Display and Analyze ROC Curves
library(car)     # Box-Cox, Yeo-Johnson and Basic Power Transformations
library(lmtest)  # Testing Linear Regression Models
library(e1071)   # Tuning of Functions Using Grid Search, Support Vector Machines
library(nnet)    # Feed-Forward Neural Networks and Multinomial Log-Linear Models
library(ranger)  # A Fast Implementation of Random Forests
library(sandwich)    # Robust Covariance Matrix Estimators
library(rpart.plot)  # Decision Tree

Functions

We will create a few functions for automation of collecting and visualising results of prediction process for different models.

ModelPredictAndEvaluate <- function(model, data, model.name, tag = "train") {
  # Make prediction on given dataset and return AUC, Accuracy, Sensitivity
  # Specificity, best Threshold and ROC data 
  # based on the best Sensitivity and Specificity combination
  # and based on the best Accuracy
  #
  # Args:
  #   model:        model
  #   data:         dataset
  #   model.name:   title for a model
  #   tag:          text tag, i.e. "train", "test"
  #
  # Returns:
  #   List:
  #     confusion.matrix.best.SS:  Confusion matrix based on threshold for best 
  #                                Sensitivity and Specificity combination.
  #     confusion.matrix.best.Acc: Confusion matrix based on fitted threshold 
  #                                for the best possible Accuracy for the data.  
  #     evaluation: data.frame with AUC, Accuracy, Sensitivity, Specificity, 
  #                 best Threshold, Model (model.name), Dataset (tag) for
  #                 both thresholds: best Sensitivity and Specificity 
  #                 combination and for best fitted Accuracy.
  #     roc:        roc() function results for the model and the data.
  #                 Possible to use results for plotting ROC curve.
  #         
  
  # Choose proper option for predict() based on model's option
  prob.type <- ifelse(!is.null(model$modelInfo$prob), "prob", "response")
  
  # Make prediction
  pred.model <- predict(model, data, type = prob.type)
  
  # Create ROC curve
  if (!is.null(model$modelInfo$prob)) {
    probabilities <- pred.model[, 2] 
  } else {
    probabilities <- pred.model 
  }
  model.roc <- roc(data$default.payment.next.month, probabilities)
  
  # Get the best threshold for the model for 
  # maximal sensitivity and specificity combination
  thres.index <- which.max(model.roc$sensitivities + model.roc$specificities)
  best.threshold <- model.roc$thresholds[thres.index]
  
  # Calculate class probabilities for threshold which is relevant to 
  # maximal sensitivity and specificity combination
  p.class.pred.model <- ifelse(probabilities > best.threshold, 'X1', 'X0')
  
  # Confusion matrix for the model which is relevant to 
  # maximal sensitivity and specificity combination
  cf.pred.model <- confusionMatrix(p.class.pred.model, 
                                   data$default.payment.next.month)
  
  # Save model results based on best Sensitivity and Specificity combination
  model.result <-
    data.frame(Model       = model.name,
               Dataset     = paste0(tag, "_best_SS"),
               AUC         = model.roc$auc,
               Accuracy    = cf.pred.model$overall['Accuracy'],
               Sensitivity = cf.pred.model$byClass['Sensitivity'],
               Specificity = cf.pred.model$byClass['Specificity'],
               Threshold   = best.threshold,
               row.names   = NULL
    ) 
  
  #
  # Fitting threshold trying to find best accuracy
  #
  # Initial values for Confusion matrix and the best accuracy
  current.best.accuracy <- 0
  best.accuracy.cf      <- NA
  
  for (thres in seq(.05, .95, by = .05)) {
    # Calculate class probabilities for current threshold
    p.class.pred.model.acc <- 
      ifelse(probabilities > thres, 'X1', 'X0')
    
    # Confusion matrix for the model and current threshold
    cf.pred.model.acc <- confusionMatrix(p.class.pred.model.acc, 
                                         data$default.payment.next.month)
    
    # Save final results only for best accuracy
    if ((cf.pred.model.acc$overall['Accuracy'] > current.best.accuracy) &
        (cf.pred.model.acc$byClass['Sensitivity'] >= .1) & 
        (cf.pred.model.acc$byClass['Specificity'] >= .1)){
      best.accuracy.thres <- 
        c(cf.pred.model.acc$byClass['Sensitivity'],
          cf.pred.model.acc$byClass['Specificity'],
          cf.pred.model.acc$overall['Accuracy'], 
          Threshold = thres,
          cf.pred.model.acc) 
      best.accuracy.cf      <- cf.pred.model.acc$table
      current.best.accuracy <- cf.pred.model.acc$overall['Accuracy']
    }
  }
  
  # Add results of evaluation to the final table
  if (current.best.accuracy > 0) {
    model.result <-       
      rbind(model.result,
            data.frame(Model       = model.name,
                       Dataset     = paste0(tag, "_best_Acc"),
                       AUC         = model.roc$auc,
                       Accuracy    = best.accuracy.thres['Accuracy'],
                       Sensitivity = best.accuracy.thres['Sensitivity'],
                       Specificity = best.accuracy.thres['Specificity'],
                       Threshold   = best.accuracy.thres['Threshold'],
                       row.names   = NULL
            ))
  }
  
  return(list("confusion.matrix.best.SS"  = t(cf.pred.model$table), 
              "confusion.matrix.best.Acc" = t(best.accuracy.cf),
              "evaluation"                = model.result,
              "roc"                       = model.roc))
}

ModelPredictEvaluateCompareTable <- function(model, xtrain, xtest, model.name) {
  # Make and print predictions for train, test dataset 
  # and return AUC, Accuracy, Sensitivity
  # Specificity, best Threshold and ROC data for them 
  # based on the best Sensitivity and Specificity combination
  # and based on the best Accuracy. Also, it adds train() function's best result.
  #
  # Args:
  #   model:        model
  #   xtrain:       train dataset
  #   xtest:        test dataset
  #   model.name:   title for a model
  #
  # Returns:
  #   List of:
  #    table: data.frame with AUC, Accuracy, Sensitivity, Specificity, 
  #           best Threshold, Model (model.name), Dataset (tag) for
  #           for train and test datasets plus train() function 
  #           best results for both thresholds: best Sensitivity and Specificity 
  #           combination and for best fitted Accuracy.
  #    train: ModelPredictAndEvaluate() function's results for train dataset.
  #    test:  ModelPredictAndEvaluate() function's results for test dataset.  
  #         
  
  # Make prediction on train data and save results: model.train.res
  model.train.res <- ModelPredictAndEvaluate(model, xtrain, 
                                             model.name, "train")
  
  # Collecting model's efficiency parameters 
  # for the final comparison: models.results
  models.results <- model.train.res$evaluation
  
  # Make prediction on test data and save results: model.test.res
  model.test.res <- ModelPredictAndEvaluate(model, xtest, 
                                            model.name, "test")
  
  # Add models efficiency parameters for the final comparison
  models.results <- models.results %>% rbind(model.test.res$evaluation)
  
  # Add train() function's best results to the final table 
  # in cases it is train() produced model
  if ("train" %in% class(model)) {
    if (!is.null(model$results$ROC)) {
      best.fit <-
        model$results[which.max(model$results[, 'ROC']), ]
      model.result <-
        data.frame(Model       = model.name,
                   Dataset     = "train_cv",
                   AUC         = best.fit$ROC,
                   Accuracy    = NA,
                   Sensitivity = best.fit$Sens,
                   Specificity = best.fit$Spec,
                   Threshold   = NA,
                   row.names   = NULL) 
    } else {
      model.result <-
        data.frame(Model       = model.name,
                   Dataset     = "train_cv",
                   AUC         = NA,
                   Accuracy    = 
                     model$results[which.max(model$results[, 'Accuracy']), 
                                   'Accuracy'],
                   Sensitivity = NA,
                   Specificity = NA,
                   Threshold   = NA,
                   row.names   = NULL) 
      
    }
    
    models.results <- models.results %>% rbind(model.result)
  }
  
  # Print final table with results
  print(models.results[-1])
  
  # Return final table and ModelPredictAndEvaluate() results using invisible mode  
  invisible(list(table = models.results,
                 train = model.train.res, 
                 test  = model.test.res))
}

Splitting and Data Preparation

#
# Removing very rare classes
#
fin.data <- fin.data %>%
  filter(!(PAY_2 %in% c(8))) %>% 
  filter(!(PAY_3 %in% c(1))) %>% 
  filter(!(PAY_4 %in% c(1, 8))) %>% 
  filter(!(PAY_5 %in% c(8))) %>% 
  filter(!(PAY_6 %in% c(8)))  %>% 
  select(-AGE_F)
#
# Splitting the Data
#
set.seed(123) 
# Shuffle row indices: rows
rows <- sample(nrow(fin.data))

# Randomly order data: fin.data
fin.data <- fin.data[rows, ]

# Identify row to split on (70%): split
split <- round(nrow(fin.data) * .7)

# Create train
train <- fin.data[1:split, ]

# Create test
test <- fin.data[(split + 1):nrow(fin.data), ]

#
# Rename factors to Syntactically Valid R Names
#
xtrain <- train
xtest  <- test

# Get categorical columns only
cat.var.names <- colnames(fin.data[sapply(fin.data, is.factor)])  

# Rename categorical columns to Syntactically Valid R Names
for (x in cat.var.names) {
  levels(xtrain[, x]) <- make.names(levels(xtrain[, x]))
  levels(xtest[, x])  <- make.names(levels(xtest[, x]))
}

# Change reference level for PAY_X columns to X.1 i.e. pay duly 
xtrain[pay.cols.names] <- lapply(xtrain[pay.cols.names], relevel, ref = "X.1")
xtest[pay.cols.names]  <- lapply(xtest[pay.cols.names], relevel, ref = "X.1")

# Change reference level for EDUCATION column to X1 i.e. graduate school
xtrain["EDUCATION"] <- lapply(xtrain["EDUCATION"], relevel, ref = "X1")
xtest["EDUCATION"]  <- lapply(xtest["EDUCATION"], relevel, ref = "X1")

# Change reference level for MARRIAGE column to Married
xtrain["MARRIAGE"] <- lapply(xtrain["MARRIAGE"], relevel, ref = "Married")
xtest["MARRIAGE"]  <- lapply(xtest["MARRIAGE"], relevel, ref = "Married")

Fitting models

Linear models

Simple logit model on all variables

set.seed(123) 

# Fit logit model on all variables: glm.s
glm.s <- glm(default.payment.next.month ~ ., xtrain, family = "binomial")

# Make predictions on train and test data for the logit model
glm.s.pred <- 
  ModelPredictEvaluateCompareTable(glm.s, xtrain, xtest, "GLM_Logit_Full")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7738768 0.7718501   0.8207781   0.5976175 0.2182288
## 2 train_best_Acc 0.7738768 0.8222491   0.9459175   0.3818663 0.4500000
## 3   test_best_SS 0.7720944 0.7641350   0.8112133   0.6038767 0.2088148
## 4  test_best_Acc 0.7720944 0.8206749   0.9557600   0.3608350 0.5000000

In the table above you can see:

train_best_SS - results of the model on the train data with the Threshold chosen by maximum possible sum of Sensitivity and Specificity;
train_best_Acc - results of the model on the train data with the Threshold chosen by maximum Accuracy on the data;
test_best_SS - results of the model on the test data with the Threshold chosen by maximum possible sum of Sensitivity and Specificity;
test_best_Acc - results of the model on the test data with the Threshold chosen by maximum Accuracy on the data.

# Print confusion matrix for the best accuracy on the test data
glm.s.pred$test$confusion.matrix.best.Acc

##          Prediction
## Reference   X0   X1
##        X0 6546  303
##        X1 1286  726

# Collect results for final comparison
compare.table <- glm.s.pred$table

Simple logit model - choosing variables using AIC

set.seed(123) 
# Fit restricted logit model: glm.step
(glm.step <- step(glm.s, trace = FALSE))

## 
## Call:  glm(formula = default.payment.next.month ~ LIMIT_BAL + SEX + 
##     EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_3 + PAY_4 + PAY_5 + 
##     PAY_6 + BILL_AMT3 + PAY_AMT1 + PAY_AMT2 + PAY_AMT4 + PAY_AMT5 + 
##     PAY_AMT6, family = "binomial", data = xtrain)
## 
## Coefficients:
##    (Intercept)       LIMIT_BAL       SEXFemale     EDUCATIONX0  
##   -1.147426330    -0.000001824    -0.148910257   -11.767265887  
##    EDUCATIONX2     EDUCATIONX3     EDUCATIONX4     EDUCATIONX5  
##   -0.002209573    -0.009917952    -1.069532994    -1.181418152  
##    EDUCATIONX6      MARRIAGEX0  MARRIAGESingle   MARRIAGEOther  
##    0.127894044    -1.386538545    -0.174993553    -0.021774927  
##            AGE        PAY_0X.2         PAY_0X0         PAY_0X1  
##    0.004100447    -0.430496272    -0.714677862     0.386050776  
##        PAY_0X2         PAY_0X3         PAY_0X4         PAY_0X5  
##    1.536331693     1.612749993     1.216557335     0.454799576  
##        PAY_0X6         PAY_0X7         PAY_0X8        PAY_3X.2  
##    0.985328322    14.086482466   -11.711405062    -0.006196304  
##        PAY_3X0         PAY_3X2         PAY_3X3         PAY_3X4  
##    0.120215352     0.418855347     0.333266584    -0.224613586  
##        PAY_3X5         PAY_3X6         PAY_3X7         PAY_3X8  
##   -0.417305835    13.721163797    -0.635514166   -24.143617853  
##       PAY_4X.2         PAY_4X0         PAY_4X2         PAY_4X3  
##    0.130089630     0.066179344     0.394366581     0.525589003  
##        PAY_4X4         PAY_4X5         PAY_4X6         PAY_4X7  
##    0.595267915    -1.486964706   -27.910411946    -0.809925568  
##       PAY_5X.2         PAY_5X0         PAY_5X2         PAY_5X3  
##    0.053098876     0.137649436     0.404776105     0.266832691  
##        PAY_5X4         PAY_5X5         PAY_5X6         PAY_5X7  
##    0.567755203     1.952513564    37.093328133    12.498994583  
##       PAY_6X.2         PAY_6X0         PAY_6X2         PAY_6X3  
##    0.163445485    -0.180199205     0.147618635     0.599422391  
##        PAY_6X4         PAY_6X5         PAY_6X6         PAY_6X7  
##   -0.446868293    -0.475878063     1.497783353   -10.287563999  
##      BILL_AMT3        PAY_AMT1        PAY_AMT2        PAY_AMT4  
##    0.000002231    -0.000014316    -0.000015906    -0.000004494  
##       PAY_AMT5        PAY_AMT6  
##   -0.000004158    -0.000002781  
## 
## Degrees of Freedom: 20674 Total (i.e. Null);  20613 Residual
## Null Deviance:       21750 
## Residual Deviance: 17810     AIC: 17930

# Make predictions on train and test data for restricted logit model 
glm.step.pred <- 
  ModelPredictEvaluateCompareTable(glm.step, xtrain, 
                                   xtest, "GLM_Logit_Restricted")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7734269 0.7624184   0.8042993   0.6132804 0.2073663
## 2 train_best_Acc 0.7734269 0.8226844   0.9378020   0.4127509 0.4000000
## 3   test_best_SS 0.7710716 0.7677463   0.8180756   0.5964215 0.2140302
## 4  test_best_Acc 0.7710716 0.8198849   0.9547379   0.3608350 0.5000000

# Collect results for final comparison
compare.table <- rbind(compare.table, glm.step.pred$table)

According to the Stepwise Algorithm the best formula is: default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT3 + PAY_AMT1 + PAY_AMT2 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6

Comparison of restricted and unrestricted logit models

set.seed(123) 
# Confident intervals for AUC for restricted and unrestricted models
ci.auc(glm.s.pred$test$roc); ci.auc(glm.step.pred$test$roc)

## 95% CI: 0.7598-0.7844 (DeLong)

## 95% CI: 0.7588-0.7834 (DeLong)

# Test for statistically significant difference between AUC on test data
roc.test(glm.s.pred$test$roc, glm.step.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.s.pred$test$roc and glm.step.pred$test$roc
## Z = 1.641, p-value = 0.1008
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7720944   0.7710716

p-value is more than 0.05 It means we cannot reject HO hypothesis that AUC true difference in AUC is equal to 0. Stated another way, models are equal in a meaning of AUC.

Check restricted and unrestricted logit model for Multicollinearity

# Let's try to find linearly dependent variables
# Get names of linearly dependent variables
attributes(alias(glm.s)$Complete)$dimnames[[1]]

## NULL

attributes(alias(glm.step)$Complete)$dimnames[[1]]

## NULL

We do not have any linearly dependent variables for both models.

# Get Variance Inflation Factors for variables
vif(glm.s)

##                       GVIF Df GVIF^(1/(2*Df))
## LIMIT_BAL         1.611870  1        1.269595
## SEX               1.032801  1        1.016268
## EDUCATION         1.266043  6        1.019853
## MARRIAGE          1.372604  3        1.054203
## AGE               1.405157  1        1.185393
## PAY_0      50928611.056569 10        2.428556
## PAY_2     619819757.739603  9        3.079351
## PAY_3     598954170.869274  9        3.073498
## PAY_4     632129259.170845  8        3.548545
## PAY_5     224114042.549988  8        3.325863
## PAY_6       7237053.911423  8        2.683630
## BILL_AMT1        19.961809  1        4.467864
## BILL_AMT2        37.505261  1        6.124154
## BILL_AMT3        31.965799  1        5.653830
## BILL_AMT4        27.014888  1        5.197585
## BILL_AMT5        35.538099  1        5.961384
## BILL_AMT6        21.839599  1        4.673286
## PAY_AMT1          1.470808  1        1.212769
## PAY_AMT2          1.477856  1        1.215671
## PAY_AMT3          1.541702  1        1.241653
## PAY_AMT4          1.525965  1        1.235300
## PAY_AMT5          1.652673  1        1.285563
## PAY_AMT6          1.097983  1        1.047847

vif(glm.step)

##                       GVIF Df GVIF^(1/(2*Df))
## LIMIT_BAL         1.552810  1        1.246118
## SEX               1.030503  1        1.015137
## EDUCATION         1.256465  6        1.019207
## MARRIAGE          1.370162  3        1.053890
## AGE               1.403560  1        1.184719
## PAY_0       1023734.820906 10        1.997604
## PAY_3      89472504.560333  9        2.765416
## PAY_4     492058644.841769  8        3.493422
## PAY_5     210380851.309903  8        3.312744
## PAY_6       6769487.240504  8        2.672451
## BILL_AMT3         1.894709  1        1.376484
## PAY_AMT1          1.159114  1        1.076622
## PAY_AMT2          1.267219  1        1.125708
## PAY_AMT4          1.163630  1        1.078717
## PAY_AMT5          1.143264  1        1.069235
## PAY_AMT6          1.078451  1        1.038485

Due to the data above, we do not have Multicollinearity for both restricted and unrestricted models. Because we do not have any VIFs > 10.

Lasso and Ridge regression

set.seed(123)
# Create custom trainControl: control
control <- trainControl(
  method = "cv", number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE
)

# Fit Ridge-Lasso regression using train(): glmnet
glmnet <-
  train(default.payment.next.month ~ ., xtrain,
        tuneGrid = expand.grid(alpha = seq(0, 1, length = 20),
                               lambda = seq(.0001, 1, length = 20)),
        method = "glmnet",
        trControl = control)

According to adjusting results the best model is a Ridge regression (alpha = 0). Let’s call this model appropriate.

glmnet.ridge <- glmnet
# Make predictions on train and test data for restricted logit model and 
glmnet.ridge.pred <- 
  ModelPredictEvaluateCompareTable(glmnet.ridge, xtrain, 
                                   xtest, "GLM_Ridge")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7733142 0.7644498   0.8062817   0.6154864 0.2050717
## 2 train_best_Acc 0.7733142 0.8206530   0.9518647   0.3534083 0.4500000
## 3   test_best_SS 0.7726450 0.7826430   0.8439188   0.5740557 0.2275486
## 4  test_best_Acc 0.7726450 0.8168378   0.9410133   0.3941352 0.4000000
## 5       train_cv 0.7700446        NA   0.9602277   0.3006773        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, glmnet.ridge.pred$table)

Predictors’ importance

In this chapter, we will explore importance of predictors and interpret them.

Heteroskedasticity test

set.seed(123)
# Perform the Breusch-Pagan test against heteroskedasticity on 
# restricted logit model
bptest(glm.step)

## 
##  studentized Breusch-Pagan test
## 
## data:  glm.step
## BP = 1196, df = 61, p-value < 0.00000000000000022

p-value is less than 0.05 and it means we have heteroskedasticity in the data. Because of this, we have to use robust Covariance Matrix for Coefficients` Significance Levels estimation for next steps.

Higher-order serial correlation tests

# Perform the Breusch-Godfrey test for higher-order serial correlation
bgtest(glm.step)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  glm.step
## LM test = 1.5421, df = 1, p-value = 0.2143

# Perform the Durbin-Watson test for autocorrelation of disturbances
dwtest(glm.step)

## 
##  Durbin-Watson test
## 
## data:  glm.step
## DW = 1.9827, p-value = 0.1072
## alternative hypothesis: true autocorrelation is greater than 0

For both tests, p-values are higher than 5% significant level. It means we cannot reject null hypothesis where there is no autocorrelation.

Significance Levels

# Show Estimated Coefficients and Significance Levels for restricted logit model
# Note that because of heteroskedasticity we have to use 
# Heteroskedasticity-Consistent Covariance Matrix Estimation
coeftest(glm.step, vcov. = vcovHC(glm.step, type = "HC1"))

## 
## z test of coefficients:
## 
##                   Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)    -1.1474e+00  1.2786e-01  -8.9740 < 2.2e-16 ***
## LIMIT_BAL      -1.8242e-06  2.0760e-07  -8.7870 < 2.2e-16 ***
## SEXFemale      -1.4891e-01  3.9170e-02  -3.8016 0.0001438 ***
## EDUCATIONX0    -1.1767e+01  3.7903e-01 -31.0459 < 2.2e-16 ***
## EDUCATIONX2    -2.2096e-03  4.4870e-02  -0.0492 0.9607249    
## EDUCATIONX3    -9.9180e-03  6.1520e-02  -0.1612 0.8719239    
## EDUCATIONX4    -1.0695e+00  4.5962e-01  -2.3270 0.0199660 *  
## EDUCATIONX5    -1.1814e+00  3.2603e-01  -3.6236 0.0002905 ***
## EDUCATIONX6     1.2789e-01  5.4743e-01   0.2336 0.8152750    
## MARRIAGEX0     -1.3865e+00  6.8442e-01  -2.0259 0.0427801 *  
## MARRIAGESingle -1.7499e-01  4.4924e-02  -3.8953 9.808e-05 ***
## MARRIAGEOther  -2.1775e-02  1.7844e-01  -0.1220 0.9028749    
## AGE             4.1004e-03  2.4375e-03   1.6822 0.0925285 .  
## PAY_0X.2       -4.3050e-01  1.0555e-01  -4.0785 4.533e-05 ***
## PAY_0X0        -7.1468e-01  7.2052e-02  -9.9190 < 2.2e-16 ***
## PAY_0X1         3.8605e-01  7.0097e-02   5.5073 3.643e-08 ***
## PAY_0X2         1.5363e+00  8.3104e-02  18.4869 < 2.2e-16 ***
## PAY_0X3         1.6127e+00  1.8163e-01   8.8793 < 2.2e-16 ***
## PAY_0X4         1.2166e+00  3.1514e-01   3.8604 0.0001132 ***
## PAY_0X5         4.5480e-01  5.0970e-01   0.8923 0.3722408    
## PAY_0X6         9.8533e-01  9.4684e-01   1.0407 0.2980378    
## PAY_0X7         1.4086e+01  1.1159e+00  12.6232 < 2.2e-16 ***
## PAY_0X8        -1.1711e+01  9.6193e-01 -12.1749 < 2.2e-16 ***
## PAY_3X.2       -6.1963e-03  1.0897e-01  -0.0569 0.9546528    
## PAY_3X0         1.2022e-01  9.2012e-02   1.3065 0.1913755    
## PAY_3X2         4.1886e-01  9.7453e-02   4.2980 1.723e-05 ***
## PAY_3X3         3.3327e-01  2.1960e-01   1.5176 0.1291109    
## PAY_3X4        -2.2461e-01  4.7974e-01  -0.4682 0.6396396    
## PAY_3X5        -4.1731e-01  9.3477e-01  -0.4464 0.6552893    
## PAY_3X6         1.3721e+01  1.2088e+00  11.3512 < 2.2e-16 ***
## PAY_3X7        -6.3551e-01  9.5377e-01  -0.6663 0.5052087    
## PAY_3X8        -2.4144e+01  1.3422e+00 -17.9878 < 2.2e-16 ***
## PAY_4X.2        1.3009e-01  1.2938e-01   1.0055 0.3146693    
## PAY_4X0         6.6179e-02  9.5195e-02   0.6952 0.4869302    
## PAY_4X2         3.9437e-01  1.1307e-01   3.4877 0.0004873 ***
## PAY_4X3         5.2559e-01  2.9187e-01   1.8008 0.0717400 .  
## PAY_4X4         5.9527e-01  6.2800e-01   0.9479 0.3431882    
## PAY_4X5        -1.4870e+00  1.1199e+00  -1.3278 0.1842453    
## PAY_4X6        -2.7910e+01  2.3026e+00 -12.1212 < 2.2e-16 ***
## PAY_4X7        -8.0993e-01  1.3733e+00  -0.5898 0.5553544    
## PAY_5X.2        5.3099e-02  1.2675e-01   0.4189 0.6752659    
## PAY_5X0         1.3765e-01  9.2409e-02   1.4896 0.1363369    
## PAY_5X2         4.0478e-01  1.1905e-01   3.4002 0.0006735 ***
## PAY_5X3         2.6683e-01  2.8402e-01   0.9395 0.3474732    
## PAY_5X4         5.6776e-01  5.8939e-01   0.9633 0.3353988    
## PAY_5X5         1.9525e+00  1.5695e+00   1.2441 0.2134727    
## PAY_5X6         3.7093e+01  1.9044e+00  19.4775 < 2.2e-16 ***
## PAY_5X7         1.2499e+01  1.3969e+00   8.9476 < 2.2e-16 ***
## PAY_6X.2        1.6345e-01  9.7765e-02   1.6718 0.0945585 .  
## PAY_6X0        -1.8020e-01  8.1899e-02  -2.2003 0.0277887 *  
## PAY_6X2         1.4762e-01  1.0524e-01   1.4027 0.1607093    
## PAY_6X3         5.9942e-01  2.9580e-01   2.0264 0.0427199 *  
## PAY_6X4        -4.4687e-01  5.6734e-01  -0.7877 0.4308983    
## PAY_6X5        -4.7588e-01  7.2703e-01  -0.6546 0.5127573    
## PAY_6X6         1.4978e+00  9.2867e-01   1.6128 0.1067805    
## PAY_6X7        -1.0288e+01  1.1447e+00  -8.9868 < 2.2e-16 ***
## BILL_AMT3       2.2306e-06  5.0592e-07   4.4090 1.038e-05 ***
## PAY_AMT1       -1.4316e-05  3.8248e-06  -3.7430 0.0001818 ***
## PAY_AMT2       -1.5906e-05  4.3287e-06  -3.6745 0.0002383 ***
## PAY_AMT4       -4.4945e-06  2.5196e-06  -1.7838 0.0744568 .  
## PAY_AMT5       -4.1578e-06  2.6783e-06  -1.5524 0.1205636    
## PAY_AMT6       -2.7806e-06  2.1657e-06  -1.2839 0.1991728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Variable importance chart for Ridge regression
plot(varImp(glmnet.ridge, scale = FALSE))

In the exploratory analysis chapter, we made some conclusions about the impact of predictors on the dependent variable. Here, we can first of all confirm our guesses using signs before the Estimated Coefficients. For instance, we can see a negative for Coefficient for SEXFemale. It means Male has more chances for default than Female. We have made the same conclusion in the Exploratory analysis summary

Secondly, we can see how important each variable is. Here we can see that having payment delays is most important for default prediction. It is PAY variables with index 1..8 where the number indicates the number of months of payment delay. Again, we observed this conformity in the Exploratory analysis but now we can see the measure of importance. The more payment delay, the more chances to default.

Please read more about interpreting in the Conclusion chapter

Models with BoxCox, Spatial Sign Pre-Processing

Let’s try to improve efficiency with BoxCox, Spatial Sign Pre-Processing

BoxCox

set.seed(123)
# Fit restricted linear classificator with BoxCox Pre-Processing 
glm.boxcox <-
  train(default.payment.next.month ~ LIMIT_BAL + SEX + 
          EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_2 + PAY_4 + PAY_5 + 
          PAY_6 + BILL_AMT1 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 + 
          PAY_AMT5 + PAY_AMT6, xtrain,
        method = "glm",
        preProcess = c("center", "scale", "BoxCox"),
        trControl = control)

# Make predictions on train and test data for
# linear classificator with BoxCox Pre-Processing  
glm.boxcox.pred <- 
  ModelPredictEvaluateCompareTable(glm.boxcox, xtrain, xtest, "GLM_BoxCox")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7744741 0.7496010   0.7819353   0.6344584 0.1967552
## 2 train_best_Acc 0.7744741 0.8227328   0.9460414   0.3836311 0.4500000
## 3   test_best_SS 0.7735816 0.7675206   0.8186597   0.5934394 0.2161305
## 4  test_best_Acc 0.7735816 0.8201106   0.9472916   0.3871769 0.4500000
## 5       train_cv 0.7696977        NA   0.9523602   0.3525216        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, glm.boxcox.pred$table)

Ridge regression with BoxCox Pre-Processing

set.seed(123)
# Fit Ridge regression with BoxCox Pre-Processing 
glm.boxcox.ridge <- 
  train(default.payment.next.month ~ ., xtrain,
        tuneGrid = expand.grid(alpha  = glmnet.ridge$bestTune$alpha, 
                               lambda = glmnet.ridge$bestTune$lambda),
        method = "glmnet",
        preProcess = c("center", "scale", "BoxCox"),
        trControl = control)

# Make predictions on train and test data for
# Ridge regression with BoxCox Pre-Processing   
glm.boxcox.ridge.pred <- 
  ModelPredictEvaluateCompareTable(glm.boxcox.ridge, xtrain, 
                                   xtest, "GLM_boxcox.ridge")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7753960 0.7646433   0.8059720   0.6174719 0.2053471
## 2 train_best_Acc 0.7753960 0.8211366   0.9397844   0.3986323 0.4000000
## 3   test_best_SS 0.7747679 0.7748561   0.8291721   0.5899602 0.2184333
## 4  test_best_Acc 0.7747679 0.8174021   0.9413053   0.3956262 0.4000000
## 5       train_cv 0.7721462        NA   0.9594224   0.3042079        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, glm.boxcox.ridge.pred$table)

Restricted model with the Spatial Sign kernel

set.seed(123)
# Fit restricted linear classificator with Spatial Sign kernel
glm.spatial <- 
  train(default.payment.next.month ~ LIMIT_BAL + SEX + 
          EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_2 + PAY_4 + PAY_5 + 
          PAY_6 + BILL_AMT1 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 + 
          PAY_AMT5 + PAY_AMT6, xtrain,
        method = "glm",
        preProcess = c("center", "scale", "spatialSign"),
        trControl = control)

# Make predictions on train and test data for
# linear classificator with BoxCox Pre-Processing  
glm.spatial.pred <- 
  ModelPredictEvaluateCompareTable(glm.spatial, xtrain, xtest, "GLM_Spatial")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7776518 0.7585973   0.7950688   0.6287227 0.2219315
## 2 train_best_Acc 0.7776518 0.8213785   0.9514310   0.3582616 0.5000000
## 3   test_best_SS 0.7794517 0.7588308   0.7985107   0.6237575 0.2227850
## 4  test_best_Acc 0.7794517 0.8183049   0.9525478   0.3613320 0.5000000
## 5       train_cv 0.7729955        NA   0.9503155   0.3547272        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, glm.spatial.pred$table)

Comparison between linear models

In this chapter, we will compare results of linear models on test data and train data with cross-validation. We will compare metrics in different combinations: AUC and Accuracy; AUC + Accuracy + Sensitivity + Specificity; AUC.

# Comparison on test data using maximum AUC and Accuracy combination
compare.table %>% 
  filter(substring(Dataset, 1, 5) != "train") %>%
  mutate(sum_score = AUC + Accuracy) %>%
  top_n(6, sum_score) %>%
  arrange(desc(sum_score))

##                  Model       Dataset       AUC  Accuracy Sensitivity
## 1          GLM_Spatial test_best_Acc 0.7794517 0.8183049   0.9525478
## 2           GLM_BoxCox test_best_Acc 0.7735816 0.8201106   0.9472916
## 3       GLM_Logit_Full test_best_Acc 0.7720944 0.8206749   0.9557600
## 4     GLM_boxcox.ridge test_best_Acc 0.7747679 0.8174021   0.9413053
## 5 GLM_Logit_Restricted test_best_Acc 0.7710716 0.8198849   0.9547379
## 6            GLM_Ridge test_best_Acc 0.7726450 0.8168378   0.9410133
##   Specificity Threshold sum_score
## 1   0.3613320      0.50  1.597757
## 2   0.3871769      0.45  1.593692
## 3   0.3608350      0.50  1.592769
## 4   0.3956262      0.40  1.592170
## 5   0.3608350      0.50  1.590956
## 6   0.3941352      0.40  1.589483

# Comparison on test data using maximum AUC,
# Sensitivity, Specificity and Accuracy combination
compare.table %>% 
  filter(substring(Dataset, 1, 5) != "train") %>%
  mutate(sum_score = AUC + Accuracy + Sensitivity + Specificity) %>%
  top_n(6, sum_score) %>%
  arrange(desc(sum_score))

##                  Model      Dataset       AUC  Accuracy Sensitivity
## 1            GLM_Ridge test_best_SS 0.7726450 0.7826430   0.8439188
## 2     GLM_boxcox.ridge test_best_SS 0.7747679 0.7748561   0.8291721
## 3          GLM_Spatial test_best_SS 0.7794517 0.7588308   0.7985107
## 4 GLM_Logit_Restricted test_best_SS 0.7710716 0.7677463   0.8180756
## 5           GLM_BoxCox test_best_SS 0.7735816 0.7675206   0.8186597
## 6       GLM_Logit_Full test_best_SS 0.7720944 0.7641350   0.8112133
##   Specificity Threshold sum_score
## 1   0.5740557 0.2275486  2.973263
## 2   0.5899602 0.2184333  2.968756
## 3   0.6237575 0.2227850  2.960551
## 4   0.5964215 0.2140302  2.953315
## 5   0.5934394 0.2161305  2.953201
## 6   0.6038767 0.2088148  2.951319

# Comparison of cross-validation on train data
compare.table %>% 
  filter(substring(Dataset, 1, 8) == "train_cv") %>%
  arrange(desc(AUC))

##              Model  Dataset       AUC Accuracy Sensitivity Specificity
## 1      GLM_Spatial train_cv 0.7729955       NA   0.9503155   0.3547272
## 2 GLM_boxcox.ridge train_cv 0.7721462       NA   0.9594224   0.3042079
## 3        GLM_Ridge train_cv 0.7700446       NA   0.9602277   0.3006773
## 4       GLM_BoxCox train_cv 0.7696977       NA   0.9523602   0.3525216
##   Threshold
## 1        NA
## 2        NA
## 3        NA
## 4        NA

Finally, our champion is a linear model with spatial sign kernel.

Decision Tree

set.seed(123)
fit.cart <- train(default.payment.next.month ~ .,
                  data = xtrain,
                  method = "rpart", trControl = control)
fit.cart

## CART 
## 
## 20675 samples
##    23 predictor
##     2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens       Spec     
##   0.004301787  0.6819729  0.9567596  0.3187696
##   0.010589014  0.6564273  0.9605991  0.2973563
##   0.146481359  0.5557021  0.9817224  0.1296817
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.004301787.

# Plot the decision tree
rpart.plot(fit.cart$finalModel)

# Make predictions on train and test data 
# using Decision Tree
fit.cart.pred <- 
  ModelPredictEvaluateCompareTable(fit.cart, xtrain, xtest, "CART")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.6770878 0.8038210   0.9073845   0.4350320 0.2751099
## 2 train_best_Acc 0.6770878 0.8175091   0.9605997   0.3079638 0.4500000
## 3   test_best_SS 0.6816073 0.8049882   0.9126880   0.4383698 0.2751099
## 4  test_best_Acc 0.6816073 0.8169507   0.9653964   0.3116302 0.4500000
## 5       train_cv 0.6819729        NA   0.9567596   0.3187696        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, fit.cart.pred$table)

Decision tree gives us ability to predict default with 82% accuracy using only 3 predictors: PAY_0X2, PAY_2X2, PAY_0X3.

PAY_0X2 means having payment delay for two months in September, 2005
PAY_2X2 means having payment delay for two months in August, 2005
PAY_0X3 means having payment delay for three months in September, 2005
X1 - default, X0 - non-default

Please read more about interpreting in the Conclusion chapter

Random Forest Сlassification

set.seed(123)
# Train Random Forest using more fast algorithm "ranger" 
fit.rf <- train(default.payment.next.month ~ .,
                data = xtrain,
                method = "ranger", trControl = control)

fit.rf

## Random Forest 
## 
## 20675 samples
##    23 predictor
##     2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.7718726  0.9866186  0.1275063
##   42    0.7650784  0.9443689  0.3677446
##   82    0.7623392  0.9422001  0.3675249
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

# Make predictions on train and test data 
# using Random Forest classification algorithm
fit.rf.pred <- 
  ModelPredictEvaluateCompareTable(fit.rf, xtrain, 
                                   xtest, "RF")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7987047 0.7651753   0.7894932   0.6785793 0.2233039
## 2 train_best_Acc 0.7987047 0.8207497   0.9469706   0.3712773 0.3500000
## 3   test_best_SS 0.7757078 0.7555581   0.7856621   0.6530815 0.2239535
## 4  test_best_Acc 0.7757078 0.8060038   0.9427654   0.3404573 0.3500000
## 5       train_cv 0.7718726        NA   0.9866186   0.1275063        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, fit.rf.pred$table)

Neural Network

set.seed(123)
fit.nnet <- train(default.payment.next.month ~ .,
                  data = xtrain,
                  method = "nnet", trace = FALSE,
                  tuneGrid = expand.grid(.decay = c(0, .05, .2), .size = 4:9),
                  trControl = control)
fit.nnet

## Neural Network 
## 
## 20675 samples
##    23 predictor
##     2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  ROC        Sens       Spec        
##   0.00   4     0.5696450  1.0000000  0.0002207506
##   0.00   5     0.5675889  0.9999380  0.0000000000
##   0.00   6     0.5676826  1.0000000  0.0000000000
##   0.00   7     0.5535240  0.9999380  0.0000000000
##   0.00   8     0.5754733  0.9998761  0.0000000000
##   0.00   9     0.5772597  0.9998761  0.0000000000
##   0.05   4     0.5606549  1.0000000  0.0002207506
##   0.05   5     0.5751728  0.9999380  0.0000000000
##   0.05   6     0.5830525  0.9995663  0.0011037528
##   0.05   7     0.5798257  0.9998141  0.0000000000
##   0.05   8     0.5735729  0.9999380  0.0002207506
##   0.05   9     0.5793253  0.9999380  0.0002207506
##   0.20   4     0.5844926  0.9999380  0.0000000000
##   0.20   5     0.5821064  0.9999381  0.0000000000
##   0.20   6     0.5836424  1.0000000  0.0000000000
##   0.20   7     0.5999705  0.9999380  0.0000000000
##   0.20   8     0.5893835  0.9999380  0.0000000000
##   0.20   9     0.5995681  1.0000000  0.0000000000
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were size = 7 and decay = 0.2.

# Make predictions on train and test data for
# Neural Network
fit.nnet.pred <- 
  ModelPredictEvaluateCompareTable(fit.nnet, xtrain, xtest, "NNET")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.5784914 0.6986699   0.7990336   0.3412751 0.2599924
## 2 train_best_Acc 0.5784914 0.7001693   0.8018833   0.3379660 0.3000000
## 3   test_best_SS 0.5821877 0.5949667   0.6148343   0.5273360 0.1898918
## 4  test_best_Acc 0.5821877 0.6963097   0.8018689   0.3369781 0.3000000
## 5       train_cv 0.5999705        NA   0.9999380   0.0000000        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, fit.nnet.pred$table)

# Load nnet visualise module
source("nnet_plot_update.r") 

# Visualise neural network
plot.nnet(fit.nnet)

The Neural Network shows very poor results.

Fit Neural Network Based on PCA Transformed Data

set.seed(123)
# Fit Neural Network based on PCA transformed data 
fit.pcaNNet <- train(default.payment.next.month ~ .,
                     data = xtrain, trace = FALSE,
                     method = "pcaNNet",
                     trControl = control)
fit.pcaNNet

## Neural Networks with Feature Extraction 
## 
## 20675 samples
##    23 predictor
##     2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens       Spec      
##   1     0e+00  0.5248100  0.9939319  0.04128035
##   1     1e-04  0.6507448  0.9879220  0.07660044
##   1     1e-01  0.7611355  0.9396598  0.39178750
##   3     0e+00  0.5802201  0.9837051  0.11285167
##   3     1e-04  0.6228223  0.9863097  0.10236505
##   3     1e-01  0.7688423  0.9468465  0.35692933
##   5     0e+00  0.5000000  1.0000000  0.00000000
##   5     1e-04  0.6297742  0.9776332  0.14266418
##   5     1e-01  0.7732246  0.9470945  0.36398557
## 
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

# Make predictions on train and test data for
# Neural Network based on PCA transformed data 
fit.pcaNNet.pred <- 
  ModelPredictEvaluateCompareTable(fit.pcaNNet, xtrain, xtest, "PCA_NNET")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7909592 0.7794921   0.8225746   0.6260754 0.2352238
## 2 train_best_Acc 0.7909592 0.8266022   0.9454838   0.4032649 0.5000000
## 3   test_best_SS 0.7757733 0.7578151   0.7964666   0.6262425 0.2201007
## 4  test_best_Acc 0.7757733 0.8189821   0.9455395   0.3881710 0.5000000
## 5       train_cv 0.7732246        NA   0.9470945   0.3639856        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, fit.pcaNNet.pred$table)

# Visualize tuning steps 
plot(fit.pcaNNet)

Neural Network, based on PCA transformed data results, are much better than results of regular nnet. Let’s try to improve results of pcaNNet. There is a clue in the visualisation of the tuning model above that we can try to increase decay for size = 5.

set.seed(123)
# Tuning of Neural Network based on PCA transformed data 
fit.pcaNNet.tuned <- 
  (train(default.payment.next.month ~ .,
        data = xtrain,
        method = "pcaNNet", trace = FALSE,
        tuneGrid = expand.grid(.decay = c(0, .05, .2), .size = fit.pcaNNet$bestTune$size),
        trControl = control)) 

# Make predictions on train and test data for
# tuned Neural Network based on PCA transformed data 
fit.pcaNNet.tuned.pred <- 
  ModelPredictEvaluateCompareTable(fit.pcaNNet.tuned, xtrain, 
                                   xtest, "PCA_NNET_tuned")

##          Dataset       AUC  Accuracy Sensitivity Specificity Threshold
## 1  train_best_SS 0.7919578 0.7776058   0.8180523   0.6335760 0.2373869
## 2 train_best_Acc 0.7919578 0.8276663   0.9526081   0.3827487 0.5000000
## 3   test_best_SS 0.7724154 0.7818531   0.8405607   0.5820080 0.2612996
## 4  test_best_Acc 0.7724154 0.8179664   0.9512338   0.3643141 0.5000000
## 5       train_cv 0.7728598        NA   0.9474038   0.3578172        NA

# Collect results for final comparison
compare.table <- rbind(compare.table, fit.pcaNNet.tuned.pred$table)

With decay = 0.2 pcaNNet works better.

Final Comparison Between Models

In this chapter, we will compare results of all models we fitted on test data and train data with cross-validation. We will compare metrics in different combinations: AUC and Accuracy; AUC + Accuracy + Sensitivity + Specificity; AUC. Let’s compare efficiency for the train data too.

Train Data

caret.models <- list(GLM_Spatial          = glm.spatial, 
                     GLM_boxcox.ridge     = glm.boxcox.ridge,
                     GLM_BoxCox           = glm.boxcox,
                     GLM_Ridge            = glmnet.ridge,
                     PCA_NNET             = fit.nnet,
                     PCA_NNET_tuned       = fit.pcaNNet.tuned,
                     CART                 = fit.cart,
                     RF                   = fit.rf)

# Models' collection resampling
results <- resamples(caret.models)
# Models' difference summary
summary(results, Metric = "ROC")

## 
## Call:
## summary.resamples(object = results, Metric = "ROC")
## 
## Models: GLM_Spatial, GLM_boxcox.ridge, GLM_BoxCox, GLM_Ridge, PCA_NNET, PCA_NNET_tuned, CART, RF 
## Number of resamples: 10 
## 
## ROC 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## GLM_Spatial      0.7473  0.7594 0.7740 0.7730  0.7873 0.7931    0
## GLM_boxcox.ridge 0.7454  0.7609 0.7724 0.7721  0.7855 0.7988    0
## GLM_BoxCox       0.7464  0.7553 0.7715 0.7697  0.7843 0.7914    0
## GLM_Ridge        0.7435  0.7582 0.7700 0.7700  0.7837 0.7965    0
## PCA_NNET         0.5764  0.5931 0.6028 0.6000  0.6086 0.6195    0
## PCA_NNET_tuned   0.7515  0.7597 0.7750 0.7729  0.7824 0.7967    0
## CART             0.6500  0.6749 0.6820 0.6820  0.6856 0.7159    0
## RF               0.7495  0.7607 0.7705 0.7719  0.7818 0.8028    0
## 
## Sens 
##                    Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## GLM_Spatial      0.9380  0.9473 0.9523 0.9503  0.9543 0.9579    0
## GLM_boxcox.ridge 0.9455  0.9587 0.9603 0.9594  0.9621 0.9647    0
## GLM_BoxCox       0.9436  0.9504 0.9514 0.9524  0.9554 0.9610    0
## GLM_Ridge        0.9473  0.9591 0.9607 0.9602  0.9636 0.9659    0
## PCA_NNET         0.9994  1.0000 1.0000 0.9999  1.0000 1.0000    0
## PCA_NNET_tuned   0.9374  0.9481 0.9486 0.9474  0.9490 0.9536    0
## CART             0.9443  0.9517 0.9591 0.9568  0.9621 0.9634    0
## RF               0.9827  0.9847 0.9876 0.9866  0.9881 0.9913    0
## 
## Spec 
##                     Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## GLM_Spatial      0.30240  0.3245 0.3653 0.3547  0.3795 0.3921    0
## GLM_boxcox.ridge 0.26490  0.2804 0.3113 0.3042  0.3262 0.3348    0
## GLM_BoxCox       0.28480  0.3234 0.3642 0.3525  0.3800 0.3929    0
## GLM_Ridge        0.27150  0.2788 0.3046 0.3007  0.3212 0.3304    0
## PCA_NNET         0.00000  0.0000 0.0000 0.0000  0.0000 0.0000    0
## PCA_NNET_tuned   0.30460  0.3339 0.3653 0.3578  0.3763 0.3929    0
## CART             0.25610  0.2882 0.3201 0.3188  0.3453 0.3753    0
## RF               0.09692  0.1104 0.1159 0.1275  0.1467 0.1700    0

# Summarize p-values for pair-wise comparisons
diffs <- diff(results)
summary(diffs)

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##                  GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge 
## GLM_Spatial                   0.0008493        0.0032977  0.0029509
## GLM_boxcox.ridge 1.000000                      0.0024484  0.0021016
## GLM_BoxCox       0.797694    1.000000                    -0.0003468
## GLM_Ridge        1.000000    0.007495         1.000000             
## PCA_NNET         9.320e-09   4.349e-09        5.936e-09  3.739e-09 
## PCA_NNET_tuned   1.000000    1.000000         1.000000   1.000000  
## CART             1.506e-06   1.620e-06        2.656e-06  2.465e-06 
## RF               1.000000    1.000000         1.000000   1.000000  
##                  PCA_NNET   PCA_NNET_tuned CART       RF        
## GLM_Spatial       0.1730250  0.0001357      0.0910225  0.0011229
## GLM_boxcox.ridge  0.1721757 -0.0007136      0.0901732  0.0002736
## GLM_BoxCox        0.1697272 -0.0031621      0.0877248 -0.0021749
## GLM_Ridge         0.1700741 -0.0028152      0.0880717 -0.0018280
## PCA_NNET                    -0.1728893     -0.0820024 -0.1719021
## PCA_NNET_tuned   1.675e-09                  0.0908869  0.0009872
## CART             6.185e-05  2.177e-06                 -0.0898997
## RF               1.641e-09  1.000000       1.996e-06            
## 
## Sens 
##                  GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge 
## GLM_Spatial                  -0.0091068       -0.0020447 -0.0099122
## GLM_boxcox.ridge 5.734e-05                     0.0070621 -0.0008054
## GLM_BoxCox       1.0000000   0.0009657                   -0.0078675
## GLM_Ridge        1.550e-05   0.4937066        0.0003742            
## PCA_NNET         2.698e-08   5.435e-08        5.507e-09  5.115e-08 
## PCA_NNET_tuned   0.6688182   2.490e-06        0.0560344  1.780e-06 
## CART             0.3328771   1.0000000        1.0000000  1.0000000 
## RF               5.595e-08   3.205e-07        1.142e-08  4.924e-07 
##                  PCA_NNET   PCA_NNET_tuned CART       RF        
## GLM_Spatial      -0.0496225  0.0029117     -0.0064441 -0.0363030
## GLM_boxcox.ridge -0.0405157  0.0120185      0.0026627 -0.0271962
## GLM_BoxCox       -0.0475778  0.0049564     -0.0043994 -0.0342583
## GLM_Ridge        -0.0397103  0.0128239      0.0034681 -0.0263908
## PCA_NNET                     0.0525342      0.0431784  0.0133195
## PCA_NNET_tuned   1.173e-09                 -0.0093558 -0.0392147
## CART             2.183e-07  0.1062087                 -0.0298589
## RF               3.447e-06  8.636e-09      3.405e-06            
## 
## Spec 
##                  GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge
## GLM_Spatial                   0.050519         0.002206   0.054050
## GLM_boxcox.ridge 1.302e-06                    -0.048314   0.003531
## GLM_BoxCox       1.0000000   2.336e-05                    0.051844
## GLM_Ridge        1.020e-05   1.0000000        0.0001541           
## PCA_NNET         2.543e-09   1.483e-09        6.787e-09  4.970e-10
## PCA_NNET_tuned   1.0000000   1.558e-06        1.0000000  2.262e-05
## CART             0.0679122   1.0000000        0.0077162  1.0000000
## RF               5.976e-09   2.620e-09        6.593e-09  4.769e-10
##                  PCA_NNET  PCA_NNET_tuned CART      RF       
## GLM_Spatial       0.354727 -0.003090       0.035958  0.227221
## GLM_boxcox.ridge  0.304208 -0.053609      -0.014562  0.176702
## GLM_BoxCox        0.352522 -0.005296       0.033752  0.225015
## GLM_Ridge         0.300677 -0.057140      -0.018092  0.173171
## PCA_NNET                   -0.357817      -0.318770 -0.127506
## PCA_NNET_tuned   1.005e-09                 0.039048  0.230311
## CART             3.244e-08 0.0384245                 0.191263
## RF               1.485e-06 8.220e-09      1.800e-08

Results show us that there is no statistically significant difference in AUC between most of the models.

Test Data

Let’s compare results on the test data using AUC:

# Comparison on test data using maximum AUC 
compare.table %>% 
  filter(substring(Dataset, 1, 5) != "train") %>%
  filter(substring(Dataset, 11, 13) != "SS") %>%
  top_n(6, AUC) %>%
  arrange(desc(AUC))

##              Model       Dataset       AUC  Accuracy Sensitivity
## 1      GLM_Spatial test_best_Acc 0.7794517 0.8183049   0.9525478
## 2         PCA_NNET test_best_Acc 0.7757733 0.8189821   0.9455395
## 3               RF test_best_Acc 0.7757078 0.8060038   0.9427654
## 4 GLM_boxcox.ridge test_best_Acc 0.7747679 0.8174021   0.9413053
## 5       GLM_BoxCox test_best_Acc 0.7735816 0.8201106   0.9472916
## 6        GLM_Ridge test_best_Acc 0.7726450 0.8168378   0.9410133
##   Specificity Threshold
## 1   0.3613320      0.50
## 2   0.3881710      0.50
## 3   0.3404573      0.35
## 4   0.3956262      0.40
## 5   0.3871769      0.45
## 6   0.3941352      0.40

The first 6 models have very similar results and we will check statistically significant differences between them:

set.seed(123) 
# Confident intervals for AUC 
ci.auc(fit.rf.pred$test$roc); ci.auc(glm.spatial.pred$test$roc); ci.auc(fit.pcaNNet.pred$test$roc)

## 95% CI: 0.7637-0.7877 (DeLong)

## 95% CI: 0.7675-0.7914 (DeLong)

## 95% CI: 0.7636-0.7879 (DeLong)

# Test for statistically significant difference between AUC on test data
# GLM_Spatial vs RF
roc.test(glm.spatial.pred$test$roc, fit.rf.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.spatial.pred$test$roc and fit.rf.pred$test$roc
## Z = 1.1926, p-value = 0.233
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7794517   0.7757078

# GLM_Spatial vs PCA_NNET
roc.test(glm.spatial.pred$test$roc, fit.pcaNNet.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.spatial.pred$test$roc and fit.pcaNNet.pred$test$roc
## Z = 1.2673, p-value = 0.205
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7794517   0.7757733

# GLM_Spatial vs GLM_boxcox.ridge
roc.test(glm.spatial.pred$test$roc, glm.boxcox.ridge.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.spatial.pred$test$roc and glm.boxcox.ridge.pred$test$roc
## Z = 2.2599, p-value = 0.02383
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7794517   0.7747679

# GLM_Spatial vs GLM_BoxCox
roc.test(glm.spatial.pred$test$roc, glm.boxcox.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.spatial.pred$test$roc and glm.boxcox.pred$test$roc
## Z = 3.462, p-value = 0.0005361
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7794517   0.7735816

# GLM_Spatial vs GLM_Logit_Full
roc.test(glm.spatial.pred$test$roc, glm.s.pred$test$roc)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  glm.spatial.pred$test$roc and glm.s.pred$test$roc
## Z = 3.6682, p-value = 0.0002443
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7794517   0.7720944

There is no statistically significant difference in AUC metric between GLM_Spatial, RF and PCA_NNET models.

Let’s try to compare models by combination of other parameters:

# Comparison on test data using maximum AUC,
# Sensitivity, Specificity and Accuracy combination
compare.table %>% 
  filter(substring(Dataset, 1, 5) != "train") %>%
  mutate(sum_score = AUC + Accuracy + Sensitivity + Specificity) %>%
  top_n(6, sum_score) %>%
  arrange(desc(sum_score))

##              Model      Dataset       AUC  Accuracy Sensitivity
## 1   PCA_NNET_tuned test_best_SS 0.7724154 0.7818531   0.8405607
## 2        GLM_Ridge test_best_SS 0.7726450 0.7826430   0.8439188
## 3               RF test_best_SS 0.7757078 0.7555581   0.7856621
## 4 GLM_boxcox.ridge test_best_SS 0.7747679 0.7748561   0.8291721
## 5      GLM_Spatial test_best_SS 0.7794517 0.7588308   0.7985107
## 6         PCA_NNET test_best_SS 0.7757733 0.7578151   0.7964666
##   Specificity Threshold sum_score
## 1   0.5820080 0.2612996  2.976837
## 2   0.5740557 0.2275486  2.973263
## 3   0.6530815 0.2239535  2.970010
## 4   0.5899602 0.2184333  2.968756
## 5   0.6237575 0.2227850  2.960551
## 6   0.6262425 0.2201007  2.956298

I would choose the linear model with spatial sign kernel model or the Random Forest model because they are still the top choices by combination of other parameters and show stable results for both train and test data. The combination of AUC, Accuracy, Sensitivity and Specificity for Random Forest is a little bit better than for the linear model with spatial sign kernel and ridge regression with BoxCox pre-processing.

Confusion matrices (gives the best combination of Accuracy and Sensitivity and Specificity) for The linear model with spatial sign kernel

# Confusion matrix for GLM_Spatial model for best Accuracy 
glm.spatial.pred$test$confusion.matrix.best.Acc

##          Prediction
## Reference   X0   X1
##        X0 6524  325
##        X1 1285  727

# Confusion matrix for GLM_Spatial model 
# for best Sensitivity and Specificity combination
glm.spatial.pred$test$confusion.matrix.best.SS

##          Prediction
## Reference   X0   X1
##        X0 5469 1380
##        X1  757 1255

Confusion matrices (gives the best combination of Accuracy and Sensitivity and Specificity) for Random Forest

# Confusion matrix for RF model for best Accuracy 
fit.rf.pred$test$confusion.matrix.best.Acc

##          Prediction
## Reference   X0   X1
##        X0 6457  392
##        X1 1327  685

# Confusion matrix for RF model 
# for best Sensitivity and Specificity combination
fit.rf.pred$test$confusion.matrix.best.SS

##          Prediction
## Reference   X0   X1
##        X0 5381 1468
##        X1  698 1314

Confusion matrices (best Accuracy and best Sensitivity and Specificity combination) for the ridge regression with BoxCox pre-processing

# Confusion matrix for RF model for best Accuracy 
glm.boxcox.ridge.pred$test$confusion.matrix.best.Acc

##          Prediction
## Reference   X0   X1
##        X0 6447  402
##        X1 1216  796

# Confusion matrix for RF model 
# for best Sensitivity and Specificity combination
glm.boxcox.ridge.pred$test$confusion.matrix.best.SS

##          Prediction
## Reference   X0   X1
##        X0 5679 1170
##        X1  825 1187

Conclusion

Model

From the models we explored, I would choose the Linear model with Spatial Sign kernel with AUC = 0.7794517 and Accuracy = 0.8183049 or Random Forest with AUC = 0.7757078 and Accuracy = 0.8060038.

Predictors’ impact

In this chapter, we will update the summary we wrote after Exploratory analysis using interpretable models. We can use Decision Tree and Significance Levels research for restricted, unrestricted and Ridge regression models. First of all, let’s order predictors by their approximate importance:

PAY_0..6: Repayment status in September .. April, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- Impact to default (details): Having a delay, even for 1 month in any of the previous months, increases the chance of default.
- Update: Our hypothesis is confirmed through modelling. Coefficients for linear models and the importance levels diagram shows that PAY_X predictor is the most important, and the odds for default decrease with increased payments delay. Of course, we have PAY_3X8, PAY_0X8, PAY_4X6 with big coefficients with minus sign but there are few of such observations. According to Decision Tree, we can predict default with about 82% accuracy using only 3 predictors: PAY_0X2, PAY_2X2, PAY_0X3.
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others)
- Impact to default (details): The better education the lower chances to default. (In general)
- Update: According to linear models education graduate school, university, high school are not significant. However, “Other” level (X4) of education and “NA” levels of education (X0, X5) are significant and decrease odds to default.
SEX: Gender (1=male, 2=female)
- Impact to default (details): Male persons have more chances to default. (In general)
- Update: This variable is significant and linear models confirm the hypothesis above.
MARRIAGE: Marital status (1=married, 2=single, 3=others)
- Impact to default (details): Married persons have more chances to default. (In general)
- Update: This variable is significant and linear models confirm the hypothesis above.
AGE: Age in years
- Impact to default (details): The biggest chance of default is in the age group under 25 and the smallest for 25 - 34 age group.
- Update: This variable is significant according to linear models but on the border of significance.
PAY_AMT1..6: Amount of previous payment in September .. April, 2005 (NT dollar)
- Impact to default (details): The smaller the payment amount, the bigger the chance of default. (In general)
- Update: According to linear models this factor is most significant for prediction and the hypothesis above is confirmed - each dollar decreases the odds to default.
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- Impact to default (details): The lower the amount of given credit limit of the balance owing, the bigger the chances to default. (In general)
- Update: This variable is significant and linear models confirm the hypothesis above.
BILL_AMT1..6: Amount of bill statement in September .. April, 2005 (NT dollar)
- Impact to default (details): The smaller the difference between the amount owed on the bill in September and April, the bigger the chances to default. (In general)
- Update: In the whole, this variable is not significant for prediction.

Future work

Try to investigate closer strange population of the dataset for index > 15000 we observed here.
It makes sense to conduct an ensembling models experiment and to try to tune some models in more depth.

Default Payments of Credit Card Clients in Taiwan from 2005

Vladimir G. Drugov

Introduction

Data

Data Overview

Data Dictionary

Importing Data

Exploratory Data Analysis

Exploration of Predictors

Categorical

Repayment Status (PAY_X)

Social Status Predictors (SEX, EDUCATION, MARRIAGE)

Numerical

Age (AGE)

Amount of Bill Statement (BILL_AMTX)

Outliers

Variable Without Outliers

Amount of Previous Payment (PAY_AMTX)

Outliers

Variable Without Outliers

Analysis

Limit of Balance (LIMIT_BAL)

Outliers

Variable Without Outliers

Outliers Exploration Results

Predictors’ Correlation

Numerical

Categorical

Near Zero Variables

Exploration of Dependent Variable

20 EPV (events per predictor variable) rule

Majority Class Classifier

Сorrelation between Preditors and Dependent Variable

Social Status Predictors (SEX, EDUCATION, MARRIAGE, AGE)

AGE, SEX and MARRIAGE combination

Age (AGE)

Financial Predictors

Limit of Balance (LIMIT_BAL)

Repayment Status (PAY_X)

Amount of Previous Payment (PAY_AMTX) & Amount of Bill Statement (BILL_AMTX)

Summary

Modelling

Introduction

Libraries

Functions

Splitting and Data Preparation

Fitting models

Linear models

Simple logit model on all variables

Simple logit model - choosing variables using AIC

Comparison of restricted and unrestricted logit models

Check restricted and unrestricted logit model for Multicollinearity

Lasso and Ridge regression

Predictors’ importance

Heteroskedasticity test

Higher-order serial correlation tests

Significance Levels

Models with BoxCox, Spatial Sign Pre-Processing

BoxCox

Ridge regression with BoxCox Pre-Processing

Restricted model with the Spatial Sign kernel

Comparison between linear models

Decision Tree

Random Forest Сlassification

Neural Network

Fit Neural Network Based on PCA Transformed Data

Final Comparison Between Models

Train Data

Test Data

Conclusion

Model

Predictors’ impact

Future work