This report contains two parts devoted to Exploratory Analysis and Prediction Models. In the chapters, I tried to touch on the most popular methods and algorithms in order to find the best model which will help predict default and to answer the questions:
I hope you will enjoy this report.
In this research, I use Google’s R Style Guide
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The data set comes from the UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science.
Number of Instances: 30000
There are 25 variables:
Let’s load the data, remove some extra data and make basic transformations:
library(XLConnect)
data.file.path <-
file.path(getwd(), "Data", "default of credit card clients.xls")
raw.data <-
readWorksheet(loadWorkbook(data.file.path), sheet = 1,
startCol = 1,
startRow = 2, # first row is synthetic names such as X1
autofitCol = TRUE)
# Converting columns to factors according to the data description
pay.cols.names <- paste0("PAY_", c(0, 2:6)) # "PAY_0" "PAY_2" "PAY_3"...
cols.to.factors <- c("SEX", "EDUCATION", "MARRIAGE",
pay.cols.names, "default.payment.next.month")
raw.data[cols.to.factors] <- lapply(raw.data[cols.to.factors], factor)
Let’s compare the documentation provided for the data and a real situation. According to the description, this PAY_X is a set of categorical variables with the levels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above. Let’s look at real PAY_X columns:
library(ggplot2)
# Generating bar plots for PAY_0..6 variables
pay.histograms <-
lapply(pay.cols.names,
function (pay.col.name){
ggplot(data = raw.data[, pay.cols.names],
aes(x = raw.data[, pay.col.name])) +
geom_bar(stat = "count") +
theme_minimal() +
xlab(paste0("Repayment status ", pay.col.name)) +
ylab("Observations count")
})
MultiPlot(plotlist = pay.histograms, cols = 3)
We observe undocumented values for repayment status variables: -2
and 0
. Moreover, fraction of it is 86.5%.
Strictly speaking, it is “NAs”. Due to this, there are several standard strategies to deal with:
# Getting ID and AGE columns only
age.raw.data <- raw.data[c("ID", "AGE")]
# Calculate a cutoff for the "rule of thumb"
age.thumb.cutoffs <-
apply(age.raw.data, 2,
function (t){
Q3 = quantile(t, .75) + 1.5 * IQR(t)
}
) %>% t()
# Generating a histogram for the AGE variable with a vertical cutoff line
age.hist.plot <-
ggplot(data = age.raw.data,
aes(x = age.raw.data$AGE)) +
geom_histogram(binwidth = 1,
colour = "black", fill = "white") +
geom_vline(aes(xintercept = age.thumb.cutoffs[1, "AGE"]),
colour = "#BB0000", linetype = "dashed") +
theme_minimal() +
xlab("Age")
# Generating a scatterplot for the AGE variable with a horizontal cutoff line
age.scatter.plot <-
ggplot(data = age.raw.data,
aes(x = 1:length(age.raw.data$AGE),
y = age.raw.data$AGE)) +
geom_point(shape = 1,
position = position_jitter(height = .5)) +
geom_hline(aes(yintercept = age.thumb.cutoffs[1, "AGE"]),
colour = "#BB0000", linetype = "dashed") +
theme_minimal() +
xlab("Index") + ylab("Age")
# Printing both plots near each other
MultiPlot(age.hist.plot, age.scatter.plot)
Look at the scatterplot. There is something strange in the structure of the data. We can explore over-plotting for customers with indexes more than 15000. It looks like table forming was changed for these customers and a data provider started to populate it by packs of data with some order by Age. Let’s draw separated histograms for this data to be sure that both part of the data has the same statistical structure.
# Marking both parts of the AGE data
age.raw.data <- age.raw.data %>%
mutate(isClear = ifelse(ID < 15000, "Clear part", "Questionable part"))
# Drawing violin histograms and boxplots for both parts of data
ggplot(data = age.raw.data, aes(x = isClear, y = AGE)) +
geom_violin(aes(fill = isClear), trim = FALSE, alpha = .3) +
geom_boxplot(aes(fill = isClear), width = .2, outlier.colour = NA) +
theme(legend.position = "NA") +
labs(x = "", y = "Age")
So, both parts of data have the same density. No more doubts in data with ID > 15000.
Let’s explore histograms and check data for outliers using our expert judgment and the “rule of thumb” (Han, Jiawei and M. Kamber, 2006. Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann, 2006): outlier if bigger than \[ Q3 + 1.5*IQR \] or less than \[ Q1 - 1.5*IQR \] Note: The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that separate parts are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.
# Getting BILL_AMT1..6 columns only
bill.raw.data <- raw.data %>% select(starts_with("BILL_AMT"))
# Calculate cutoffs for the "rule of thumb"
(r.thumb.cutoffs <-
apply(bill.raw.data, 2, function (t){
c(Q1 = quantile(t, .25) - 1.5 * IQR(t),
Q3 = quantile(t, .75) + 1.5 * IQR(t))
}
) %>% t())
## Q1.25% Q3.75%
## BILL_AMT1 -91739.62 162389.4
## BILL_AMT2 -88547.50 155538.5
## BILL_AMT3 -83581.50 146412.5
## BILL_AMT4 -75942.12 132774.9
## BILL_AMT5 -70878.25 122831.8
## BILL_AMT6 -70657.38 121111.6
# Drawing histograms with cutoff lines
HistogramsFromVariables <-
function(data.columns, cutoffs.table = data.frame(),
bin.width = 5000, text = "") {
# Generates list of ggplot histograms with vertical lines (cutoffs.table)
#
# Args:
# data.columns: variables
# cutoffs.table: table with values for drawing vertical lines
# bin.width: bin size for histograms.
# text: comment for the x axe
#
# Returns:
# List of ggplot objects
lapply(colnames(data.columns),
function (col.name){
h <- ggplot(data = data.columns,
aes(x = data.columns[, col.name])) +
geom_histogram(binwidth = bin.width,
colour = "black", fill = "white") +
theme_minimal() +
xlab(paste0(text, " ", col.name, " ")) +
ylab("Obs. count")
# Adjusting plot to data contained in cutoffs.table:
# two, one or without cutoffs
if (dim(cutoffs.table)[1] > 1) {
h + geom_vline(aes(xintercept = cutoffs.table[col.name, "Q1.25%"]),
colour = "#BB0000", linetype = "dashed") +
geom_vline(aes(xintercept = cutoffs.table[col.name, "Q3.75%"]),
colour = "#BB0000", linetype = "dashed")
} else if (dim(cutoffs.table)[1] == 1) {
h + geom_vline(aes(xintercept = cutoffs.table[1, col.name]),
colour = "#BB0000", linetype = "dashed")
} else {
return(h)
}
})
}
ScatterplotsFromVariables <-
function(data.columns, cutoffs.table = data.frame(), text = "Amount") {
# Generates list of scatterplots with 'rule of thumb' boundries.
#
# Args:
# data.columns: variables
# cutoffs.table: table with values for drawing horizontal lines
# text: comment for the y axe
#
# Returns:
# List of ggplot objects
lapply(colnames(data.columns),
function (col.name){
p <- ggplot(data = data.columns,
aes(x = 1:length(data.columns[, col.name]),
y = data.columns[, col.name])) +
geom_point(shape=1) +
theme_minimal() +
xlab(paste0("Index ", col.name, " ")) +
ylab(text)
# Adjusting the plot to data contained in cutoffs.table:
# two, one or without cutoffs
if (dim(cutoffs.table)[1] > 1) {
p + geom_hline(aes(yintercept = cutoffs.table[col.name, "Q1.25%"]),
colour = "#BB0000", linetype = "dashed") +
geom_hline(aes(yintercept=cutoffs.table[col.name, "Q3.75%"]),
colour = "#BB0000", linetype = "dashed")
} else if (dim(cutoffs.table)[1] == 1) {
p + geom_hline(aes(yintercept = cutoffs.table[1, col.name]),
colour ="#BB0000", linetype = "dashed")
} else {
return(p)
}
})
}
MultiPlot(plotlist = HistogramsFromVariables(bill.raw.data,
r.thumb.cutoffs,
text = "Amount of bill statement"))
MultiPlot(plotlist = ScatterplotsFromVariables(bill.raw.data,
r.thumb.cutoffs))
It looks like we have outliers but Q3 limit is too low for our purposes. I suggest removing the data with the amount of the bill statements under Q1 cutoff
and bigger than Q3 cutoff + 220000
# Collecting indexes to remove
to.remove.indexes <-
lapply(colnames(bill.raw.data),
function (bill.col.name){
which(bill.raw.data[, bill.col.name] >
r.thumb.cutoffs[bill.col.name, "Q3.75%"] + 220000 |
bill.raw.data[, bill.col.name] <
r.thumb.cutoffs[bill.col.name, "Q1.25%"])
}
) %>% unlist()
# Getting BILL_AMT1..6 columns without outliers
bill.raw.data <- bill.raw.data[-to.remove.indexes, ]
# Collecting observation indexes to remove from
# the main data
to.remove.indexes.total <- to.remove.indexes
Let’s look at histograms without outliers:
Let’s take a closer look at histograms for bills less than 5000 with bin.width = 100
because most of the payments are here:
The generated histogram shows the most common bill statement interval is under 100 dollars.
Let’s explore histograms and check data for outliers using our expert judgment and the “rule of thumb”. In that case, we do not have negative values and because of this we will use upper bound only: \[ Q3 + 1.5*IQR \]
# Getting PAY_AMT1..6 columns only
pay.raw.data <- raw.data %>% select(starts_with("PAY_AMT")) # Getting columns
# Calculate cutoffs for the "rule of thumb"
(pay.thumb.cutoffs <-
apply(pay.raw.data, 2,
function (t){
Q3 = quantile(t, .75) + 1.5 * IQR(t)
}
) %>% t())
## PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## [1,] 11015 11250.5 10677.5 9589.125 9700 9823.375
# Generating histograms for PAY_AMT1..6
MultiPlot(plotlist = HistogramsFromVariables(pay.raw.data, pay.thumb.cutoffs,
text = "Amount of previous payment"))
# Generating scatterplots for PAY_AMT1..6
MultiPlot(plotlist = ScatterplotsFromVariables(pay.raw.data, pay.thumb.cutoffs))
It looks like we have outliers but Q3 limit is too low for our purposes. I suggest removing the data with the amount of the previous payment bigger than Q3 cutoff + 200000
# Collecting indexes to remove
to.remove.indexes <-
lapply(colnames(pay.raw.data),
function (pay.col.name){
which(pay.raw.data[, pay.col.name] >
pay.thumb.cutoffs[1, pay.col.name] + 200000)
}
) %>% unlist()
# Getting PAY_AMT1..6 columns without outliers
pay.raw.data <- pay.raw.data[-to.remove.indexes, ]
# Collecting observation indexes to remove from
# the main data
to.remove.indexes.total <- c(to.remove.indexes.total, to.remove.indexes)
Taking a closer look at histograms for payments less than 5000, it looks like people prefer to pay round figures of money.
Let’s explore Limit of balanse variable
# Getting ID and LIMIT_BAL columns only
bal.raw.data <- raw.data[c("ID", "LIMIT_BAL")]
# Calculate cutoffs for LIMIT_BAL variable using "rule of thumb"
bal.thumb.cutoffs <-
apply(bal.raw.data, 2,
function (t){
Q3 = quantile(t, .75) + 1.5 * IQR(t)
}
) %>% t()
# Generating a histogram for the LIMIT_BAL variable with a vertical cutoff line
ggplot(data = bal.raw.data,
aes(x = bal.raw.data$LIMIT_BAL)) +
geom_histogram(binwidth = 100, colour = "black", fill = "white") +
geom_vline(aes(xintercept = bal.thumb.cutoffs[1, "LIMIT_BAL"]),
colour = "#BB0000", linetype = "dashed") +
theme_minimal() +
labs(x = "Limit of Balance", y = "Observations count")
We can detect on the histogram above some popular limits of balance: 20000, 50000, 80000, 200000, 360000, 500000
# Generating a scatterplot for the LIMIT_BAL variable
# with a horizontal cutoff line
ggplot(data = bal.raw.data,
aes(x = 1:length(bal.raw.data$LIMIT_BAL),
y = bal.raw.data$LIMIT_BAL)) +
geom_point(shape = 1, position=position_jitter(height = 100)) +
geom_hline(aes(yintercept=bal.thumb.cutoffs[1, "LIMIT_BAL"]),
colour="#BB0000", linetype="dashed") +
theme_minimal() +
labs(x = "Index", y = "Limit of Balance")
I suggest filtering observations with limits higher than 750000
# Collecting observation indexes to remove from
# the main data - adding indexes with LIMIT_BAL > 750000
to.remove.indexes.total <- c(to.remove.indexes.total,
which(bal.raw.data$LIMIT_BAL > 750000))
1283
indexes to remove.457
which is 36
% of all remove marks and 2
% of all the data.Let’s remove outliers for further research:
# Saving final dataset without outliers
fin.data <- raw.data[-to.remove.indexes.total, -1]
Correlogram for numerical variables:
library(corrgram)
corrgram(fin.data, order = TRUE, lower.panel = panel.shade,
upper.panel = panel.pie, text.panel = panel.txt)
corrgram(fin.data, order = TRUE, lower.panel = panel.ellipse,
upper.panel = panel.pts, text.panel = panel.txt,
diag.panel = panel.minmax)
We see a high level of linear correlations between the amount of bill statements in different months. In case of the multicollinearity we need to use such technics as Ridge and Lasso regression and Principal components method. We can even drop some variables if we need to, but price of this is unbiasedness of estimates and this is not the best decision.
To find correlations between Categorical variables we need to perform Chi-Square Test. In our case we will test the null hypothesis whether the one variable is independent of another at 0.05 significance level. We will filter all combinations with p-value less than 0.05 significance level because under this level we reject the null hypothesis that variables are independent.
# Creating combinations of pairs for categorical variables
# Categorical cols
cat.var.names <- colnames(fin.data[sapply(fin.data, is.factor)])
# Combinations of pairs
cat.combs <- as.data.frame(t(combn(cat.var.names, 2)))
# Performing Chi-Square tests for all pairs
cat.chisq.test.res <-
apply(t(cat.combs), 2, function(x){
c(Var1 = x[1],
Var2 = x[2],
pvalue = chisq.test(fin.data[, x[1]],
fin.data[, x[2]])$p.value)
}
)
cat.chisq.test.res <- cat.chisq.test.res %>%
t() %>%
data.frame(stringsAsFactors=FALSE)
cat.chisq.test.res$pvalue <- as.numeric(cat.chisq.test.res$pvalue)
cat.chisq.test.res <- cat.chisq.test.res %>%
filter(pvalue <= .05) %>% # Filter p-values < 0.05
arrange(pvalue) # Most dependent at top
knitr::kable(cat.chisq.test.res, caption = "Table with kable",
format = "markdown", padding = 0,
col.names = c("", "", "p-value"))
p-value | ||
---|---|---|
PAY_0 | PAY_2 | 0.00e+00 |
PAY_0 | PAY_3 | 0.00e+00 |
PAY_0 | PAY_4 | 0.00e+00 |
PAY_0 | PAY_5 | 0.00e+00 |
PAY_0 | PAY_6 | 0.00e+00 |
PAY_0 | default.payment.next.month | 0.00e+00 |
PAY_2 | PAY_3 | 0.00e+00 |
PAY_2 | PAY_4 | 0.00e+00 |
PAY_2 | PAY_5 | 0.00e+00 |
PAY_2 | PAY_6 | 0.00e+00 |
PAY_2 | default.payment.next.month | 0.00e+00 |
PAY_3 | PAY_4 | 0.00e+00 |
PAY_3 | PAY_5 | 0.00e+00 |
PAY_3 | PAY_6 | 0.00e+00 |
PAY_3 | default.payment.next.month | 0.00e+00 |
PAY_4 | PAY_5 | 0.00e+00 |
PAY_4 | PAY_6 | 0.00e+00 |
PAY_4 | default.payment.next.month | 0.00e+00 |
PAY_5 | PAY_6 | 0.00e+00 |
PAY_5 | default.payment.next.month | 0.00e+00 |
PAY_6 | default.payment.next.month | 0.00e+00 |
EDUCATION | PAY_2 | 0.00e+00 |
EDUCATION | PAY_3 | 0.00e+00 |
EDUCATION | MARRIAGE | 0.00e+00 |
EDUCATION | PAY_0 | 0.00e+00 |
EDUCATION | PAY_4 | 0.00e+00 |
EDUCATION | PAY_5 | 0.00e+00 |
EDUCATION | PAY_6 | 0.00e+00 |
SEX | PAY_2 | 0.00e+00 |
EDUCATION | default.payment.next.month | 0.00e+00 |
SEX | PAY_3 | 0.00e+00 |
SEX | PAY_0 | 0.00e+00 |
SEX | PAY_4 | 0.00e+00 |
SEX | PAY_5 | 0.00e+00 |
MARRIAGE | PAY_4 | 0.00e+00 |
MARRIAGE | PAY_0 | 0.00e+00 |
MARRIAGE | PAY_5 | 0.00e+00 |
MARRIAGE | PAY_2 | 0.00e+00 |
MARRIAGE | PAY_3 | 0.00e+00 |
MARRIAGE | PAY_6 | 0.00e+00 |
SEX | default.payment.next.month | 0.00e+00 |
SEX | PAY_6 | 0.00e+00 |
MARRIAGE | default.payment.next.month | 1.00e-07 |
SEX | MARRIAGE | 2.00e-07 |
SEX | EDUCATION | 9.35e-05 |
We can see that PAY_X
variables are highly correlated to each other and for further exploration we can use one of them - PAY_0
for example. It looks like all categorical variables are dependent on each other and impact of PAY_X
variables to default.payment.next.month
is high. In addition, we can not say we have even one independent pair.
Let’s try to find variables with zero variance.
library(caret)
# Testing for existing of near-zero predictors in our data
nearZeroVar(select(fin.data, -default.payment.next.month))
## integer(0)
So, we do not have such variables. In case you have near zero variance predictors, it is better to remove it to reduce model-fitting time without reducing model accuracy.
Do we have enough observations to implement a good model?
library(gmodels)
def.ct <- (CrossTable(fin.data$default.payment.next.month))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 29543
##
##
## | 0 | 1 |
## |------------|------------|
## | 22996 | 6547 |
## | 0.778 | 0.222 |
## |------------|------------|
##
##
##
##
20 EPV (events per predictor variable) approach (Harrell, Frank. Regression modeling strategies. NY: Springer, 2001):
Get the smallest presented value of Dependent variable. In this case, it is 6547
Divide this by number of predictors: 6547
/ 23
= 285
We have enough observations if we have more than 20 observations per smallest presented predictor. It is our case, we have enough observations.
Majority Class Classifier is an elementary model for prediction. Accuracy will be equal to the fraction of instances belonging to the majority class. In our case, majority class is nondefault
or 0
. It is the lowest level of accuracy for our future modelling. We can just say that all credits cards are nondefault
and we will have a level of accuracy equal to the number of negatives
/ number of observations
= 22996
/ 29543
= 77.8
%
Using the diagrams below, we cannote that the fraction of default credit cards with a limit of up to 100000 dollars is higher. In contrast, the higher limits are more spread out for non-default cards.
# Generating violin histograms and boxplots for LIMIT_BAL
# splitted for non-default and default classes
library("tidyr")
# Preparing special data for violin diagrams
limit.spec.data <- fin.data %>%
filter(LIMIT_BAL < 600000) %>%
select(LIMIT_BAL, default.payment.next.month) %>%
gather(LIMIT_BAL, key = "Month", value = "Balance")
ggplot(data = limit.spec.data, aes(x = default.payment.next.month,
y = Balance)) +
geom_violin(aes(fill = default.payment.next.month), trim = FALSE,
alpha = 0.3) +
geom_boxplot(aes(fill = default.payment.next.month), width = 0.2,
outlier.colour = NA) +
theme(legend.position = "NA") +
labs(colour = "default",
subtitle = paste0("Amount of given credit in NT dollars for",
" default (1) and non-default (0) cards"),
y = "Amount of given credit in NT dollars",
x = "non-default (0) ~ default (1)")
PAY_X description: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above.
Having a delay even for 1 month in any of the previous months, increases chances of default. Moreover, it works for any social subgroups:
# Generating mosaic diagrams for PAY_0 and
# dependent variable with Chi-Square tests
mosaic(~ PAY_0 + default.payment.next.month,
data = fin.data, shade=TRUE, legend=TRUE)
# Generating mosaic diagrams for PAY_0 and
# dependent variable with Chi-Square tests folded by AGE
mosaic(~ PAY_0 + default.payment.next.month | AGE_F,
data = fin.data, shade=TRUE, legend=TRUE)
Diagrams below show us that for default cards sizes of payments are smaller than for non-defaults cards and at the same time sizes of bills for default cards are higher (especially for pre-default month). It make sense.
# Preparing special data for violin diagrams generating
bill.spec.data <- fin.data %>%
filter(LIMIT_BAL < 600000, BILL_AMT1 < 300000) %>%
select(BILL_AMT1, BILL_AMT3, BILL_AMT6, default.payment.next.month) %>%
gather(BILL_AMT1, BILL_AMT3, BILL_AMT6, key = "Month", value = "Amount")
pay.spec.data <- fin.data %>%
filter(PAY_AMT1 < 20000, PAY_AMT3 < 20000, PAY_AMT6 < 20000) %>%
select(PAY_AMT1, PAY_AMT3, PAY_AMT6, default.payment.next.month) %>%
gather(PAY_AMT1, PAY_AMT3, PAY_AMT6, key = "Month", value = "Amount")
# Generating violin diagrams for me BILL_AMT1, 3, 6
bill.amt.plot <-
ggplot(data = bill.spec.data, aes(x = default.payment.next.month,
y = Amount)) +
geom_violin(aes(fill = default.payment.next.month),
trim = FALSE, alpha = .3) +
geom_boxplot(aes(fill = default.payment.next.month),
width = .2, outlier.colour = NA) +
theme(legend.position = "NA") +
facet_wrap( ~ Month) +
labs(colour = "default",
subtitle = paste0("Amount of bill statement for 1th,",
" 3rd and 6th months for default (1)",
" and non-default (0) cards"),
y = "Amount of bill statement",
x = "non-default (0) ~ default (1)")
# Generating violin diagrams for me PAY_AMT1, 3, 6
pay.amt.plot <-
ggplot(data = pay.spec.data, aes(x = default.payment.next.month,
y = Amount)) +
geom_violin(aes(fill = default.payment.next.month),
trim = FALSE, alpha = .3) +
geom_boxplot(aes(fill = default.payment.next.month),
width = .2, outlier.colour = NA) +
theme(legend.position = "NA") +
facet_wrap( ~ Month) +
labs(colour = "default",
subtitle = paste0("Amount of previous payment for 1th,",
" 3rd and 6th months for default (1)",
" and non-default (0) cards"),
y = "Amount of previous payment",
x = "non-default (0) ~ default (1)")
# Printing both diagrams near each other
MultiPlot(bill.amt.plot, pay.amt.plot, cols = 1)
Using diagrams below we can compare bills and payments in dynamic and make some observations:
Amount of payments for default cards is significantly lower from the very first month of observations;
Dynamic of payments for default and non-default are very similar, but dynamic of bill statements are different. Over time, non-default cardholders decrease their bill statements but default cardholders decrease their bill statements not at the same pace.
# Creating table which contains mean and median values of
# BILL_AMT1..6 and PAY_AMT1..6 predictors for default and non-default cards
full.pay.amt <- fin.data %>%
select(BILL_AMT1:default.payment.next.month) %>%
group_by(default.payment.next.month) %>%
summarise_each(funs(median = median, mean = mean)) %>%
gather(-default.payment.next.month,
key = "name", value = "value") %>%
separate(name, c("bptype", "name", "ftype")) %>%
mutate(var = paste0(bptype, "_", name)) %>%
select(-name)
# Generating line chart for mean and median values for
# BILL_AMT1..6 and PAY_AMT1..6 for different months
ggplot(full.pay.amt,
aes(x = var, y = value,
group = default.payment.next.month,
colour = default.payment.next.month)) +
geom_line() + geom_point() +
scale_color_hue(labels = c("non-default (0)", "default (1)")) +
theme(legend.position = "bottom") +
labs(colour = "", x = "",
subtitle = paste0("Average and median amount",
" of previous payment / bill by month"),
y = "Amount of payment/bill") +
facet_wrap(bptype ~ ftype, scales = "free")
Also, turning to corellogram from the “Predictors’ Correlation - Numerical” chapter and looking at the graphs above, we see that despite BILL_AMTX variables having high correlation, it would be a mistake to just drop all of them because they are important for the explanation of dependent variables.
After exploratory analysis we already have a picture about predictors’ impact to response variable:
In this chapter, we will check hypothesizes about predictors’ impact on the dependent variable made in the previous chapter Also, we will try to implement a model which can predict default with a level higher than Majority Class Classifier can provide us. It means that accuracy of our future models should be higher than 77.8%
# Load libraries
library(caret) # Classification and Regression Training
library(tidyr) # Easily Tidy Data
library(dplyr) # A Grammar of Data Manipulation
library(pROC) # Display and Analyze ROC Curves
library(car) # Box-Cox, Yeo-Johnson and Basic Power Transformations
library(lmtest) # Testing Linear Regression Models
library(e1071) # Tuning of Functions Using Grid Search, Support Vector Machines
library(nnet) # Feed-Forward Neural Networks and Multinomial Log-Linear Models
library(ranger) # A Fast Implementation of Random Forests
library(sandwich) # Robust Covariance Matrix Estimators
library(rpart.plot) # Decision Tree
We will create a few functions for automation of collecting and visualising results of prediction process for different models.
ModelPredictAndEvaluate <- function(model, data, model.name, tag = "train") {
# Make prediction on given dataset and return AUC, Accuracy, Sensitivity
# Specificity, best Threshold and ROC data
# based on the best Sensitivity and Specificity combination
# and based on the best Accuracy
#
# Args:
# model: model
# data: dataset
# model.name: title for a model
# tag: text tag, i.e. "train", "test"
#
# Returns:
# List:
# confusion.matrix.best.SS: Confusion matrix based on threshold for best
# Sensitivity and Specificity combination.
# confusion.matrix.best.Acc: Confusion matrix based on fitted threshold
# for the best possible Accuracy for the data.
# evaluation: data.frame with AUC, Accuracy, Sensitivity, Specificity,
# best Threshold, Model (model.name), Dataset (tag) for
# both thresholds: best Sensitivity and Specificity
# combination and for best fitted Accuracy.
# roc: roc() function results for the model and the data.
# Possible to use results for plotting ROC curve.
#
# Choose proper option for predict() based on model's option
prob.type <- ifelse(!is.null(model$modelInfo$prob), "prob", "response")
# Make prediction
pred.model <- predict(model, data, type = prob.type)
# Create ROC curve
if (!is.null(model$modelInfo$prob)) {
probabilities <- pred.model[, 2]
} else {
probabilities <- pred.model
}
model.roc <- roc(data$default.payment.next.month, probabilities)
# Get the best threshold for the model for
# maximal sensitivity and specificity combination
thres.index <- which.max(model.roc$sensitivities + model.roc$specificities)
best.threshold <- model.roc$thresholds[thres.index]
# Calculate class probabilities for threshold which is relevant to
# maximal sensitivity and specificity combination
p.class.pred.model <- ifelse(probabilities > best.threshold, 'X1', 'X0')
# Confusion matrix for the model which is relevant to
# maximal sensitivity and specificity combination
cf.pred.model <- confusionMatrix(p.class.pred.model,
data$default.payment.next.month)
# Save model results based on best Sensitivity and Specificity combination
model.result <-
data.frame(Model = model.name,
Dataset = paste0(tag, "_best_SS"),
AUC = model.roc$auc,
Accuracy = cf.pred.model$overall['Accuracy'],
Sensitivity = cf.pred.model$byClass['Sensitivity'],
Specificity = cf.pred.model$byClass['Specificity'],
Threshold = best.threshold,
row.names = NULL
)
#
# Fitting threshold trying to find best accuracy
#
# Initial values for Confusion matrix and the best accuracy
current.best.accuracy <- 0
best.accuracy.cf <- NA
for (thres in seq(.05, .95, by = .05)) {
# Calculate class probabilities for current threshold
p.class.pred.model.acc <-
ifelse(probabilities > thres, 'X1', 'X0')
# Confusion matrix for the model and current threshold
cf.pred.model.acc <- confusionMatrix(p.class.pred.model.acc,
data$default.payment.next.month)
# Save final results only for best accuracy
if ((cf.pred.model.acc$overall['Accuracy'] > current.best.accuracy) &
(cf.pred.model.acc$byClass['Sensitivity'] >= .1) &
(cf.pred.model.acc$byClass['Specificity'] >= .1)){
best.accuracy.thres <-
c(cf.pred.model.acc$byClass['Sensitivity'],
cf.pred.model.acc$byClass['Specificity'],
cf.pred.model.acc$overall['Accuracy'],
Threshold = thres,
cf.pred.model.acc)
best.accuracy.cf <- cf.pred.model.acc$table
current.best.accuracy <- cf.pred.model.acc$overall['Accuracy']
}
}
# Add results of evaluation to the final table
if (current.best.accuracy > 0) {
model.result <-
rbind(model.result,
data.frame(Model = model.name,
Dataset = paste0(tag, "_best_Acc"),
AUC = model.roc$auc,
Accuracy = best.accuracy.thres['Accuracy'],
Sensitivity = best.accuracy.thres['Sensitivity'],
Specificity = best.accuracy.thres['Specificity'],
Threshold = best.accuracy.thres['Threshold'],
row.names = NULL
))
}
return(list("confusion.matrix.best.SS" = t(cf.pred.model$table),
"confusion.matrix.best.Acc" = t(best.accuracy.cf),
"evaluation" = model.result,
"roc" = model.roc))
}
ModelPredictEvaluateCompareTable <- function(model, xtrain, xtest, model.name) {
# Make and print predictions for train, test dataset
# and return AUC, Accuracy, Sensitivity
# Specificity, best Threshold and ROC data for them
# based on the best Sensitivity and Specificity combination
# and based on the best Accuracy. Also, it adds train() function's best result.
#
# Args:
# model: model
# xtrain: train dataset
# xtest: test dataset
# model.name: title for a model
#
# Returns:
# List of:
# table: data.frame with AUC, Accuracy, Sensitivity, Specificity,
# best Threshold, Model (model.name), Dataset (tag) for
# for train and test datasets plus train() function
# best results for both thresholds: best Sensitivity and Specificity
# combination and for best fitted Accuracy.
# train: ModelPredictAndEvaluate() function's results for train dataset.
# test: ModelPredictAndEvaluate() function's results for test dataset.
#
# Make prediction on train data and save results: model.train.res
model.train.res <- ModelPredictAndEvaluate(model, xtrain,
model.name, "train")
# Collecting model's efficiency parameters
# for the final comparison: models.results
models.results <- model.train.res$evaluation
# Make prediction on test data and save results: model.test.res
model.test.res <- ModelPredictAndEvaluate(model, xtest,
model.name, "test")
# Add models efficiency parameters for the final comparison
models.results <- models.results %>% rbind(model.test.res$evaluation)
# Add train() function's best results to the final table
# in cases it is train() produced model
if ("train" %in% class(model)) {
if (!is.null(model$results$ROC)) {
best.fit <-
model$results[which.max(model$results[, 'ROC']), ]
model.result <-
data.frame(Model = model.name,
Dataset = "train_cv",
AUC = best.fit$ROC,
Accuracy = NA,
Sensitivity = best.fit$Sens,
Specificity = best.fit$Spec,
Threshold = NA,
row.names = NULL)
} else {
model.result <-
data.frame(Model = model.name,
Dataset = "train_cv",
AUC = NA,
Accuracy =
model$results[which.max(model$results[, 'Accuracy']),
'Accuracy'],
Sensitivity = NA,
Specificity = NA,
Threshold = NA,
row.names = NULL)
}
models.results <- models.results %>% rbind(model.result)
}
# Print final table with results
print(models.results[-1])
# Return final table and ModelPredictAndEvaluate() results using invisible mode
invisible(list(table = models.results,
train = model.train.res,
test = model.test.res))
}
#
# Removing very rare classes
#
fin.data <- fin.data %>%
filter(!(PAY_2 %in% c(8))) %>%
filter(!(PAY_3 %in% c(1))) %>%
filter(!(PAY_4 %in% c(1, 8))) %>%
filter(!(PAY_5 %in% c(8))) %>%
filter(!(PAY_6 %in% c(8))) %>%
select(-AGE_F)
#
# Splitting the Data
#
set.seed(123)
# Shuffle row indices: rows
rows <- sample(nrow(fin.data))
# Randomly order data: fin.data
fin.data <- fin.data[rows, ]
# Identify row to split on (70%): split
split <- round(nrow(fin.data) * .7)
# Create train
train <- fin.data[1:split, ]
# Create test
test <- fin.data[(split + 1):nrow(fin.data), ]
#
# Rename factors to Syntactically Valid R Names
#
xtrain <- train
xtest <- test
# Get categorical columns only
cat.var.names <- colnames(fin.data[sapply(fin.data, is.factor)])
# Rename categorical columns to Syntactically Valid R Names
for (x in cat.var.names) {
levels(xtrain[, x]) <- make.names(levels(xtrain[, x]))
levels(xtest[, x]) <- make.names(levels(xtest[, x]))
}
# Change reference level for PAY_X columns to X.1 i.e. pay duly
xtrain[pay.cols.names] <- lapply(xtrain[pay.cols.names], relevel, ref = "X.1")
xtest[pay.cols.names] <- lapply(xtest[pay.cols.names], relevel, ref = "X.1")
# Change reference level for EDUCATION column to X1 i.e. graduate school
xtrain["EDUCATION"] <- lapply(xtrain["EDUCATION"], relevel, ref = "X1")
xtest["EDUCATION"] <- lapply(xtest["EDUCATION"], relevel, ref = "X1")
# Change reference level for MARRIAGE column to Married
xtrain["MARRIAGE"] <- lapply(xtrain["MARRIAGE"], relevel, ref = "Married")
xtest["MARRIAGE"] <- lapply(xtest["MARRIAGE"], relevel, ref = "Married")
set.seed(123)
# Fit logit model on all variables: glm.s
glm.s <- glm(default.payment.next.month ~ ., xtrain, family = "binomial")
# Make predictions on train and test data for the logit model
glm.s.pred <-
ModelPredictEvaluateCompareTable(glm.s, xtrain, xtest, "GLM_Logit_Full")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7738768 0.7718501 0.8207781 0.5976175 0.2182288
## 2 train_best_Acc 0.7738768 0.8222491 0.9459175 0.3818663 0.4500000
## 3 test_best_SS 0.7720944 0.7641350 0.8112133 0.6038767 0.2088148
## 4 test_best_Acc 0.7720944 0.8206749 0.9557600 0.3608350 0.5000000
In the table above you can see:
train_best_SS
- results of the model on the train data with the Threshold chosen by maximum possible sum of Sensitivity and Specificity;train_best_Acc
- results of the model on the train data with the Threshold chosen by maximum Accuracy on the data;test_best_SS
- results of the model on the test data with the Threshold chosen by maximum possible sum of Sensitivity and Specificity;test_best_Acc
- results of the model on the test data with the Threshold chosen by maximum Accuracy on the data.# Print confusion matrix for the best accuracy on the test data
glm.s.pred$test$confusion.matrix.best.Acc
## Prediction
## Reference X0 X1
## X0 6546 303
## X1 1286 726
# Collect results for final comparison
compare.table <- glm.s.pred$table
set.seed(123)
# Fit restricted logit model: glm.step
(glm.step <- step(glm.s, trace = FALSE))
##
## Call: glm(formula = default.payment.next.month ~ LIMIT_BAL + SEX +
## EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_3 + PAY_4 + PAY_5 +
## PAY_6 + BILL_AMT3 + PAY_AMT1 + PAY_AMT2 + PAY_AMT4 + PAY_AMT5 +
## PAY_AMT6, family = "binomial", data = xtrain)
##
## Coefficients:
## (Intercept) LIMIT_BAL SEXFemale EDUCATIONX0
## -1.147426330 -0.000001824 -0.148910257 -11.767265887
## EDUCATIONX2 EDUCATIONX3 EDUCATIONX4 EDUCATIONX5
## -0.002209573 -0.009917952 -1.069532994 -1.181418152
## EDUCATIONX6 MARRIAGEX0 MARRIAGESingle MARRIAGEOther
## 0.127894044 -1.386538545 -0.174993553 -0.021774927
## AGE PAY_0X.2 PAY_0X0 PAY_0X1
## 0.004100447 -0.430496272 -0.714677862 0.386050776
## PAY_0X2 PAY_0X3 PAY_0X4 PAY_0X5
## 1.536331693 1.612749993 1.216557335 0.454799576
## PAY_0X6 PAY_0X7 PAY_0X8 PAY_3X.2
## 0.985328322 14.086482466 -11.711405062 -0.006196304
## PAY_3X0 PAY_3X2 PAY_3X3 PAY_3X4
## 0.120215352 0.418855347 0.333266584 -0.224613586
## PAY_3X5 PAY_3X6 PAY_3X7 PAY_3X8
## -0.417305835 13.721163797 -0.635514166 -24.143617853
## PAY_4X.2 PAY_4X0 PAY_4X2 PAY_4X3
## 0.130089630 0.066179344 0.394366581 0.525589003
## PAY_4X4 PAY_4X5 PAY_4X6 PAY_4X7
## 0.595267915 -1.486964706 -27.910411946 -0.809925568
## PAY_5X.2 PAY_5X0 PAY_5X2 PAY_5X3
## 0.053098876 0.137649436 0.404776105 0.266832691
## PAY_5X4 PAY_5X5 PAY_5X6 PAY_5X7
## 0.567755203 1.952513564 37.093328133 12.498994583
## PAY_6X.2 PAY_6X0 PAY_6X2 PAY_6X3
## 0.163445485 -0.180199205 0.147618635 0.599422391
## PAY_6X4 PAY_6X5 PAY_6X6 PAY_6X7
## -0.446868293 -0.475878063 1.497783353 -10.287563999
## BILL_AMT3 PAY_AMT1 PAY_AMT2 PAY_AMT4
## 0.000002231 -0.000014316 -0.000015906 -0.000004494
## PAY_AMT5 PAY_AMT6
## -0.000004158 -0.000002781
##
## Degrees of Freedom: 20674 Total (i.e. Null); 20613 Residual
## Null Deviance: 21750
## Residual Deviance: 17810 AIC: 17930
# Make predictions on train and test data for restricted logit model
glm.step.pred <-
ModelPredictEvaluateCompareTable(glm.step, xtrain,
xtest, "GLM_Logit_Restricted")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7734269 0.7624184 0.8042993 0.6132804 0.2073663
## 2 train_best_Acc 0.7734269 0.8226844 0.9378020 0.4127509 0.4000000
## 3 test_best_SS 0.7710716 0.7677463 0.8180756 0.5964215 0.2140302
## 4 test_best_Acc 0.7710716 0.8198849 0.9547379 0.3608350 0.5000000
# Collect results for final comparison
compare.table <- rbind(compare.table, glm.step.pred$table)
According to the Stepwise Algorithm the best formula is: default.payment.next.month ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_3 + PAY_4 + PAY_5 +
PAY_6 + BILL_AMT3 + PAY_AMT1 + PAY_AMT2 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6
set.seed(123)
# Confident intervals for AUC for restricted and unrestricted models
ci.auc(glm.s.pred$test$roc); ci.auc(glm.step.pred$test$roc)
## 95% CI: 0.7598-0.7844 (DeLong)
## 95% CI: 0.7588-0.7834 (DeLong)
# Test for statistically significant difference between AUC on test data
roc.test(glm.s.pred$test$roc, glm.step.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.s.pred$test$roc and glm.step.pred$test$roc
## Z = 1.641, p-value = 0.1008
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7720944 0.7710716
p-value is more than 0.05 It means we cannot reject HO hypothesis that AUC true difference in AUC is equal to 0. Stated another way, models are equal in a meaning of AUC.
# Let's try to find linearly dependent variables
# Get names of linearly dependent variables
attributes(alias(glm.s)$Complete)$dimnames[[1]]
## NULL
attributes(alias(glm.step)$Complete)$dimnames[[1]]
## NULL
We do not have any linearly dependent variables for both models.
# Get Variance Inflation Factors for variables
vif(glm.s)
## GVIF Df GVIF^(1/(2*Df))
## LIMIT_BAL 1.611870 1 1.269595
## SEX 1.032801 1 1.016268
## EDUCATION 1.266043 6 1.019853
## MARRIAGE 1.372604 3 1.054203
## AGE 1.405157 1 1.185393
## PAY_0 50928611.056569 10 2.428556
## PAY_2 619819757.739603 9 3.079351
## PAY_3 598954170.869274 9 3.073498
## PAY_4 632129259.170845 8 3.548545
## PAY_5 224114042.549988 8 3.325863
## PAY_6 7237053.911423 8 2.683630
## BILL_AMT1 19.961809 1 4.467864
## BILL_AMT2 37.505261 1 6.124154
## BILL_AMT3 31.965799 1 5.653830
## BILL_AMT4 27.014888 1 5.197585
## BILL_AMT5 35.538099 1 5.961384
## BILL_AMT6 21.839599 1 4.673286
## PAY_AMT1 1.470808 1 1.212769
## PAY_AMT2 1.477856 1 1.215671
## PAY_AMT3 1.541702 1 1.241653
## PAY_AMT4 1.525965 1 1.235300
## PAY_AMT5 1.652673 1 1.285563
## PAY_AMT6 1.097983 1 1.047847
vif(glm.step)
## GVIF Df GVIF^(1/(2*Df))
## LIMIT_BAL 1.552810 1 1.246118
## SEX 1.030503 1 1.015137
## EDUCATION 1.256465 6 1.019207
## MARRIAGE 1.370162 3 1.053890
## AGE 1.403560 1 1.184719
## PAY_0 1023734.820906 10 1.997604
## PAY_3 89472504.560333 9 2.765416
## PAY_4 492058644.841769 8 3.493422
## PAY_5 210380851.309903 8 3.312744
## PAY_6 6769487.240504 8 2.672451
## BILL_AMT3 1.894709 1 1.376484
## PAY_AMT1 1.159114 1 1.076622
## PAY_AMT2 1.267219 1 1.125708
## PAY_AMT4 1.163630 1 1.078717
## PAY_AMT5 1.143264 1 1.069235
## PAY_AMT6 1.078451 1 1.038485
Due to the data above, we do not have Multicollinearity for both restricted and unrestricted models. Because we do not have any VIFs > 10.
set.seed(123)
# Create custom trainControl: control
control <- trainControl(
method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE
)
# Fit Ridge-Lasso regression using train(): glmnet
glmnet <-
train(default.payment.next.month ~ ., xtrain,
tuneGrid = expand.grid(alpha = seq(0, 1, length = 20),
lambda = seq(.0001, 1, length = 20)),
method = "glmnet",
trControl = control)
According to adjusting results the best model is a Ridge regression (alpha = 0). Let’s call this model appropriate.
glmnet.ridge <- glmnet
# Make predictions on train and test data for restricted logit model and
glmnet.ridge.pred <-
ModelPredictEvaluateCompareTable(glmnet.ridge, xtrain,
xtest, "GLM_Ridge")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7733142 0.7644498 0.8062817 0.6154864 0.2050717
## 2 train_best_Acc 0.7733142 0.8206530 0.9518647 0.3534083 0.4500000
## 3 test_best_SS 0.7726450 0.7826430 0.8439188 0.5740557 0.2275486
## 4 test_best_Acc 0.7726450 0.8168378 0.9410133 0.3941352 0.4000000
## 5 train_cv 0.7700446 NA 0.9602277 0.3006773 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, glmnet.ridge.pred$table)
In this chapter, we will explore importance of predictors and interpret them.
set.seed(123)
# Perform the Breusch-Pagan test against heteroskedasticity on
# restricted logit model
bptest(glm.step)
##
## studentized Breusch-Pagan test
##
## data: glm.step
## BP = 1196, df = 61, p-value < 0.00000000000000022
p-value is less than 0.05 and it means we have heteroskedasticity in the data. Because of this, we have to use robust Covariance Matrix for Coefficients` Significance Levels estimation for next steps.
# Perform the Breusch-Godfrey test for higher-order serial correlation
bgtest(glm.step)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: glm.step
## LM test = 1.5421, df = 1, p-value = 0.2143
# Perform the Durbin-Watson test for autocorrelation of disturbances
dwtest(glm.step)
##
## Durbin-Watson test
##
## data: glm.step
## DW = 1.9827, p-value = 0.1072
## alternative hypothesis: true autocorrelation is greater than 0
For both tests, p-values are higher than 5% significant level. It means we cannot reject null hypothesis where there is no autocorrelation.
# Show Estimated Coefficients and Significance Levels for restricted logit model
# Note that because of heteroskedasticity we have to use
# Heteroskedasticity-Consistent Covariance Matrix Estimation
coeftest(glm.step, vcov. = vcovHC(glm.step, type = "HC1"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1474e+00 1.2786e-01 -8.9740 < 2.2e-16 ***
## LIMIT_BAL -1.8242e-06 2.0760e-07 -8.7870 < 2.2e-16 ***
## SEXFemale -1.4891e-01 3.9170e-02 -3.8016 0.0001438 ***
## EDUCATIONX0 -1.1767e+01 3.7903e-01 -31.0459 < 2.2e-16 ***
## EDUCATIONX2 -2.2096e-03 4.4870e-02 -0.0492 0.9607249
## EDUCATIONX3 -9.9180e-03 6.1520e-02 -0.1612 0.8719239
## EDUCATIONX4 -1.0695e+00 4.5962e-01 -2.3270 0.0199660 *
## EDUCATIONX5 -1.1814e+00 3.2603e-01 -3.6236 0.0002905 ***
## EDUCATIONX6 1.2789e-01 5.4743e-01 0.2336 0.8152750
## MARRIAGEX0 -1.3865e+00 6.8442e-01 -2.0259 0.0427801 *
## MARRIAGESingle -1.7499e-01 4.4924e-02 -3.8953 9.808e-05 ***
## MARRIAGEOther -2.1775e-02 1.7844e-01 -0.1220 0.9028749
## AGE 4.1004e-03 2.4375e-03 1.6822 0.0925285 .
## PAY_0X.2 -4.3050e-01 1.0555e-01 -4.0785 4.533e-05 ***
## PAY_0X0 -7.1468e-01 7.2052e-02 -9.9190 < 2.2e-16 ***
## PAY_0X1 3.8605e-01 7.0097e-02 5.5073 3.643e-08 ***
## PAY_0X2 1.5363e+00 8.3104e-02 18.4869 < 2.2e-16 ***
## PAY_0X3 1.6127e+00 1.8163e-01 8.8793 < 2.2e-16 ***
## PAY_0X4 1.2166e+00 3.1514e-01 3.8604 0.0001132 ***
## PAY_0X5 4.5480e-01 5.0970e-01 0.8923 0.3722408
## PAY_0X6 9.8533e-01 9.4684e-01 1.0407 0.2980378
## PAY_0X7 1.4086e+01 1.1159e+00 12.6232 < 2.2e-16 ***
## PAY_0X8 -1.1711e+01 9.6193e-01 -12.1749 < 2.2e-16 ***
## PAY_3X.2 -6.1963e-03 1.0897e-01 -0.0569 0.9546528
## PAY_3X0 1.2022e-01 9.2012e-02 1.3065 0.1913755
## PAY_3X2 4.1886e-01 9.7453e-02 4.2980 1.723e-05 ***
## PAY_3X3 3.3327e-01 2.1960e-01 1.5176 0.1291109
## PAY_3X4 -2.2461e-01 4.7974e-01 -0.4682 0.6396396
## PAY_3X5 -4.1731e-01 9.3477e-01 -0.4464 0.6552893
## PAY_3X6 1.3721e+01 1.2088e+00 11.3512 < 2.2e-16 ***
## PAY_3X7 -6.3551e-01 9.5377e-01 -0.6663 0.5052087
## PAY_3X8 -2.4144e+01 1.3422e+00 -17.9878 < 2.2e-16 ***
## PAY_4X.2 1.3009e-01 1.2938e-01 1.0055 0.3146693
## PAY_4X0 6.6179e-02 9.5195e-02 0.6952 0.4869302
## PAY_4X2 3.9437e-01 1.1307e-01 3.4877 0.0004873 ***
## PAY_4X3 5.2559e-01 2.9187e-01 1.8008 0.0717400 .
## PAY_4X4 5.9527e-01 6.2800e-01 0.9479 0.3431882
## PAY_4X5 -1.4870e+00 1.1199e+00 -1.3278 0.1842453
## PAY_4X6 -2.7910e+01 2.3026e+00 -12.1212 < 2.2e-16 ***
## PAY_4X7 -8.0993e-01 1.3733e+00 -0.5898 0.5553544
## PAY_5X.2 5.3099e-02 1.2675e-01 0.4189 0.6752659
## PAY_5X0 1.3765e-01 9.2409e-02 1.4896 0.1363369
## PAY_5X2 4.0478e-01 1.1905e-01 3.4002 0.0006735 ***
## PAY_5X3 2.6683e-01 2.8402e-01 0.9395 0.3474732
## PAY_5X4 5.6776e-01 5.8939e-01 0.9633 0.3353988
## PAY_5X5 1.9525e+00 1.5695e+00 1.2441 0.2134727
## PAY_5X6 3.7093e+01 1.9044e+00 19.4775 < 2.2e-16 ***
## PAY_5X7 1.2499e+01 1.3969e+00 8.9476 < 2.2e-16 ***
## PAY_6X.2 1.6345e-01 9.7765e-02 1.6718 0.0945585 .
## PAY_6X0 -1.8020e-01 8.1899e-02 -2.2003 0.0277887 *
## PAY_6X2 1.4762e-01 1.0524e-01 1.4027 0.1607093
## PAY_6X3 5.9942e-01 2.9580e-01 2.0264 0.0427199 *
## PAY_6X4 -4.4687e-01 5.6734e-01 -0.7877 0.4308983
## PAY_6X5 -4.7588e-01 7.2703e-01 -0.6546 0.5127573
## PAY_6X6 1.4978e+00 9.2867e-01 1.6128 0.1067805
## PAY_6X7 -1.0288e+01 1.1447e+00 -8.9868 < 2.2e-16 ***
## BILL_AMT3 2.2306e-06 5.0592e-07 4.4090 1.038e-05 ***
## PAY_AMT1 -1.4316e-05 3.8248e-06 -3.7430 0.0001818 ***
## PAY_AMT2 -1.5906e-05 4.3287e-06 -3.6745 0.0002383 ***
## PAY_AMT4 -4.4945e-06 2.5196e-06 -1.7838 0.0744568 .
## PAY_AMT5 -4.1578e-06 2.6783e-06 -1.5524 0.1205636
## PAY_AMT6 -2.7806e-06 2.1657e-06 -1.2839 0.1991728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Variable importance chart for Ridge regression
plot(varImp(glmnet.ridge, scale = FALSE))
In the exploratory analysis chapter, we made some conclusions about the impact of predictors on the dependent variable. Here, we can first of all confirm our guesses using signs before the Estimated Coefficients. For instance, we can see a negative for Coefficient for SEXFemale
. It means Male has more chances for default than Female. We have made the same conclusion in the Exploratory analysis summary
Secondly, we can see how important each variable is. Here we can see that having payment delays is most important for default prediction. It is PAY variables with index 1..8 where the number indicates the number of months of payment delay. Again, we observed this conformity in the Exploratory analysis but now we can see the measure of importance. The more payment delay, the more chances to default.
Please read more about interpreting in the Conclusion chapter
Let’s try to improve efficiency with BoxCox, Spatial Sign Pre-Processing
set.seed(123)
# Fit restricted linear classificator with BoxCox Pre-Processing
glm.boxcox <-
train(default.payment.next.month ~ LIMIT_BAL + SEX +
EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_2 + PAY_4 + PAY_5 +
PAY_6 + BILL_AMT1 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 +
PAY_AMT5 + PAY_AMT6, xtrain,
method = "glm",
preProcess = c("center", "scale", "BoxCox"),
trControl = control)
# Make predictions on train and test data for
# linear classificator with BoxCox Pre-Processing
glm.boxcox.pred <-
ModelPredictEvaluateCompareTable(glm.boxcox, xtrain, xtest, "GLM_BoxCox")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7744741 0.7496010 0.7819353 0.6344584 0.1967552
## 2 train_best_Acc 0.7744741 0.8227328 0.9460414 0.3836311 0.4500000
## 3 test_best_SS 0.7735816 0.7675206 0.8186597 0.5934394 0.2161305
## 4 test_best_Acc 0.7735816 0.8201106 0.9472916 0.3871769 0.4500000
## 5 train_cv 0.7696977 NA 0.9523602 0.3525216 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, glm.boxcox.pred$table)
set.seed(123)
# Fit Ridge regression with BoxCox Pre-Processing
glm.boxcox.ridge <-
train(default.payment.next.month ~ ., xtrain,
tuneGrid = expand.grid(alpha = glmnet.ridge$bestTune$alpha,
lambda = glmnet.ridge$bestTune$lambda),
method = "glmnet",
preProcess = c("center", "scale", "BoxCox"),
trControl = control)
# Make predictions on train and test data for
# Ridge regression with BoxCox Pre-Processing
glm.boxcox.ridge.pred <-
ModelPredictEvaluateCompareTable(glm.boxcox.ridge, xtrain,
xtest, "GLM_boxcox.ridge")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7753960 0.7646433 0.8059720 0.6174719 0.2053471
## 2 train_best_Acc 0.7753960 0.8211366 0.9397844 0.3986323 0.4000000
## 3 test_best_SS 0.7747679 0.7748561 0.8291721 0.5899602 0.2184333
## 4 test_best_Acc 0.7747679 0.8174021 0.9413053 0.3956262 0.4000000
## 5 train_cv 0.7721462 NA 0.9594224 0.3042079 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, glm.boxcox.ridge.pred$table)
set.seed(123)
# Fit restricted linear classificator with Spatial Sign kernel
glm.spatial <-
train(default.payment.next.month ~ LIMIT_BAL + SEX +
EDUCATION + MARRIAGE + AGE + PAY_0 + PAY_2 + PAY_4 + PAY_5 +
PAY_6 + BILL_AMT1 + BILL_AMT3 + BILL_AMT5 + PAY_AMT1 + PAY_AMT2 +
PAY_AMT5 + PAY_AMT6, xtrain,
method = "glm",
preProcess = c("center", "scale", "spatialSign"),
trControl = control)
# Make predictions on train and test data for
# linear classificator with BoxCox Pre-Processing
glm.spatial.pred <-
ModelPredictEvaluateCompareTable(glm.spatial, xtrain, xtest, "GLM_Spatial")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7776518 0.7585973 0.7950688 0.6287227 0.2219315
## 2 train_best_Acc 0.7776518 0.8213785 0.9514310 0.3582616 0.5000000
## 3 test_best_SS 0.7794517 0.7588308 0.7985107 0.6237575 0.2227850
## 4 test_best_Acc 0.7794517 0.8183049 0.9525478 0.3613320 0.5000000
## 5 train_cv 0.7729955 NA 0.9503155 0.3547272 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, glm.spatial.pred$table)
In this chapter, we will compare results of linear models on test data and train data with cross-validation. We will compare metrics in different combinations: AUC and Accuracy; AUC + Accuracy + Sensitivity + Specificity; AUC.
# Comparison on test data using maximum AUC and Accuracy combination
compare.table %>%
filter(substring(Dataset, 1, 5) != "train") %>%
mutate(sum_score = AUC + Accuracy) %>%
top_n(6, sum_score) %>%
arrange(desc(sum_score))
## Model Dataset AUC Accuracy Sensitivity
## 1 GLM_Spatial test_best_Acc 0.7794517 0.8183049 0.9525478
## 2 GLM_BoxCox test_best_Acc 0.7735816 0.8201106 0.9472916
## 3 GLM_Logit_Full test_best_Acc 0.7720944 0.8206749 0.9557600
## 4 GLM_boxcox.ridge test_best_Acc 0.7747679 0.8174021 0.9413053
## 5 GLM_Logit_Restricted test_best_Acc 0.7710716 0.8198849 0.9547379
## 6 GLM_Ridge test_best_Acc 0.7726450 0.8168378 0.9410133
## Specificity Threshold sum_score
## 1 0.3613320 0.50 1.597757
## 2 0.3871769 0.45 1.593692
## 3 0.3608350 0.50 1.592769
## 4 0.3956262 0.40 1.592170
## 5 0.3608350 0.50 1.590956
## 6 0.3941352 0.40 1.589483
# Comparison on test data using maximum AUC,
# Sensitivity, Specificity and Accuracy combination
compare.table %>%
filter(substring(Dataset, 1, 5) != "train") %>%
mutate(sum_score = AUC + Accuracy + Sensitivity + Specificity) %>%
top_n(6, sum_score) %>%
arrange(desc(sum_score))
## Model Dataset AUC Accuracy Sensitivity
## 1 GLM_Ridge test_best_SS 0.7726450 0.7826430 0.8439188
## 2 GLM_boxcox.ridge test_best_SS 0.7747679 0.7748561 0.8291721
## 3 GLM_Spatial test_best_SS 0.7794517 0.7588308 0.7985107
## 4 GLM_Logit_Restricted test_best_SS 0.7710716 0.7677463 0.8180756
## 5 GLM_BoxCox test_best_SS 0.7735816 0.7675206 0.8186597
## 6 GLM_Logit_Full test_best_SS 0.7720944 0.7641350 0.8112133
## Specificity Threshold sum_score
## 1 0.5740557 0.2275486 2.973263
## 2 0.5899602 0.2184333 2.968756
## 3 0.6237575 0.2227850 2.960551
## 4 0.5964215 0.2140302 2.953315
## 5 0.5934394 0.2161305 2.953201
## 6 0.6038767 0.2088148 2.951319
# Comparison of cross-validation on train data
compare.table %>%
filter(substring(Dataset, 1, 8) == "train_cv") %>%
arrange(desc(AUC))
## Model Dataset AUC Accuracy Sensitivity Specificity
## 1 GLM_Spatial train_cv 0.7729955 NA 0.9503155 0.3547272
## 2 GLM_boxcox.ridge train_cv 0.7721462 NA 0.9594224 0.3042079
## 3 GLM_Ridge train_cv 0.7700446 NA 0.9602277 0.3006773
## 4 GLM_BoxCox train_cv 0.7696977 NA 0.9523602 0.3525216
## Threshold
## 1 NA
## 2 NA
## 3 NA
## 4 NA
Finally, our champion is a linear model with spatial sign kernel.
set.seed(123)
fit.cart <- train(default.payment.next.month ~ .,
data = xtrain,
method = "rpart", trControl = control)
fit.cart
## CART
##
## 20675 samples
## 23 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.004301787 0.6819729 0.9567596 0.3187696
## 0.010589014 0.6564273 0.9605991 0.2973563
## 0.146481359 0.5557021 0.9817224 0.1296817
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.004301787.
# Plot the decision tree
rpart.plot(fit.cart$finalModel)
# Make predictions on train and test data
# using Decision Tree
fit.cart.pred <-
ModelPredictEvaluateCompareTable(fit.cart, xtrain, xtest, "CART")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.6770878 0.8038210 0.9073845 0.4350320 0.2751099
## 2 train_best_Acc 0.6770878 0.8175091 0.9605997 0.3079638 0.4500000
## 3 test_best_SS 0.6816073 0.8049882 0.9126880 0.4383698 0.2751099
## 4 test_best_Acc 0.6816073 0.8169507 0.9653964 0.3116302 0.4500000
## 5 train_cv 0.6819729 NA 0.9567596 0.3187696 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, fit.cart.pred$table)
Decision tree gives us ability to predict default with 82% accuracy using only 3 predictors: PAY_0X2, PAY_2X2, PAY_0X3.
Please read more about interpreting in the Conclusion chapter
set.seed(123)
# Train Random Forest using more fast algorithm "ranger"
fit.rf <- train(default.payment.next.month ~ .,
data = xtrain,
method = "ranger", trControl = control)
fit.rf
## Random Forest
##
## 20675 samples
## 23 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.7718726 0.9866186 0.1275063
## 42 0.7650784 0.9443689 0.3677446
## 82 0.7623392 0.9422001 0.3675249
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions on train and test data
# using Random Forest classification algorithm
fit.rf.pred <-
ModelPredictEvaluateCompareTable(fit.rf, xtrain,
xtest, "RF")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7987047 0.7651753 0.7894932 0.6785793 0.2233039
## 2 train_best_Acc 0.7987047 0.8207497 0.9469706 0.3712773 0.3500000
## 3 test_best_SS 0.7757078 0.7555581 0.7856621 0.6530815 0.2239535
## 4 test_best_Acc 0.7757078 0.8060038 0.9427654 0.3404573 0.3500000
## 5 train_cv 0.7718726 NA 0.9866186 0.1275063 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, fit.rf.pred$table)
set.seed(123)
fit.nnet <- train(default.payment.next.month ~ .,
data = xtrain,
method = "nnet", trace = FALSE,
tuneGrid = expand.grid(.decay = c(0, .05, .2), .size = 4:9),
trControl = control)
fit.nnet
## Neural Network
##
## 20675 samples
## 23 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ...
## Resampling results across tuning parameters:
##
## decay size ROC Sens Spec
## 0.00 4 0.5696450 1.0000000 0.0002207506
## 0.00 5 0.5675889 0.9999380 0.0000000000
## 0.00 6 0.5676826 1.0000000 0.0000000000
## 0.00 7 0.5535240 0.9999380 0.0000000000
## 0.00 8 0.5754733 0.9998761 0.0000000000
## 0.00 9 0.5772597 0.9998761 0.0000000000
## 0.05 4 0.5606549 1.0000000 0.0002207506
## 0.05 5 0.5751728 0.9999380 0.0000000000
## 0.05 6 0.5830525 0.9995663 0.0011037528
## 0.05 7 0.5798257 0.9998141 0.0000000000
## 0.05 8 0.5735729 0.9999380 0.0002207506
## 0.05 9 0.5793253 0.9999380 0.0002207506
## 0.20 4 0.5844926 0.9999380 0.0000000000
## 0.20 5 0.5821064 0.9999381 0.0000000000
## 0.20 6 0.5836424 1.0000000 0.0000000000
## 0.20 7 0.5999705 0.9999380 0.0000000000
## 0.20 8 0.5893835 0.9999380 0.0000000000
## 0.20 9 0.5995681 1.0000000 0.0000000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.2.
# Make predictions on train and test data for
# Neural Network
fit.nnet.pred <-
ModelPredictEvaluateCompareTable(fit.nnet, xtrain, xtest, "NNET")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.5784914 0.6986699 0.7990336 0.3412751 0.2599924
## 2 train_best_Acc 0.5784914 0.7001693 0.8018833 0.3379660 0.3000000
## 3 test_best_SS 0.5821877 0.5949667 0.6148343 0.5273360 0.1898918
## 4 test_best_Acc 0.5821877 0.6963097 0.8018689 0.3369781 0.3000000
## 5 train_cv 0.5999705 NA 0.9999380 0.0000000 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, fit.nnet.pred$table)
# Load nnet visualise module
source("nnet_plot_update.r")
# Visualise neural network
plot.nnet(fit.nnet)
The Neural Network shows very poor results.
set.seed(123)
# Fit Neural Network based on PCA transformed data
fit.pcaNNet <- train(default.payment.next.month ~ .,
data = xtrain, trace = FALSE,
method = "pcaNNet",
trControl = control)
fit.pcaNNet
## Neural Networks with Feature Extraction
##
## 20675 samples
## 23 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18608, 18608, 18607, 18607, 18608, 18607, ...
## Resampling results across tuning parameters:
##
## size decay ROC Sens Spec
## 1 0e+00 0.5248100 0.9939319 0.04128035
## 1 1e-04 0.6507448 0.9879220 0.07660044
## 1 1e-01 0.7611355 0.9396598 0.39178750
## 3 0e+00 0.5802201 0.9837051 0.11285167
## 3 1e-04 0.6228223 0.9863097 0.10236505
## 3 1e-01 0.7688423 0.9468465 0.35692933
## 5 0e+00 0.5000000 1.0000000 0.00000000
## 5 1e-04 0.6297742 0.9776332 0.14266418
## 5 1e-01 0.7732246 0.9470945 0.36398557
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
# Make predictions on train and test data for
# Neural Network based on PCA transformed data
fit.pcaNNet.pred <-
ModelPredictEvaluateCompareTable(fit.pcaNNet, xtrain, xtest, "PCA_NNET")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7909592 0.7794921 0.8225746 0.6260754 0.2352238
## 2 train_best_Acc 0.7909592 0.8266022 0.9454838 0.4032649 0.5000000
## 3 test_best_SS 0.7757733 0.7578151 0.7964666 0.6262425 0.2201007
## 4 test_best_Acc 0.7757733 0.8189821 0.9455395 0.3881710 0.5000000
## 5 train_cv 0.7732246 NA 0.9470945 0.3639856 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, fit.pcaNNet.pred$table)
# Visualize tuning steps
plot(fit.pcaNNet)
Neural Network, based on PCA transformed data results, are much better than results of regular nnet. Let’s try to improve results of pcaNNet. There is a clue in the visualisation of the tuning model above that we can try to increase decay for size = 5.
set.seed(123)
# Tuning of Neural Network based on PCA transformed data
fit.pcaNNet.tuned <-
(train(default.payment.next.month ~ .,
data = xtrain,
method = "pcaNNet", trace = FALSE,
tuneGrid = expand.grid(.decay = c(0, .05, .2), .size = fit.pcaNNet$bestTune$size),
trControl = control))
# Make predictions on train and test data for
# tuned Neural Network based on PCA transformed data
fit.pcaNNet.tuned.pred <-
ModelPredictEvaluateCompareTable(fit.pcaNNet.tuned, xtrain,
xtest, "PCA_NNET_tuned")
## Dataset AUC Accuracy Sensitivity Specificity Threshold
## 1 train_best_SS 0.7919578 0.7776058 0.8180523 0.6335760 0.2373869
## 2 train_best_Acc 0.7919578 0.8276663 0.9526081 0.3827487 0.5000000
## 3 test_best_SS 0.7724154 0.7818531 0.8405607 0.5820080 0.2612996
## 4 test_best_Acc 0.7724154 0.8179664 0.9512338 0.3643141 0.5000000
## 5 train_cv 0.7728598 NA 0.9474038 0.3578172 NA
# Collect results for final comparison
compare.table <- rbind(compare.table, fit.pcaNNet.tuned.pred$table)
With decay = 0.2 pcaNNet works better.
In this chapter, we will compare results of all models we fitted on test data and train data with cross-validation. We will compare metrics in different combinations: AUC and Accuracy; AUC + Accuracy + Sensitivity + Specificity; AUC. Let’s compare efficiency for the train data too.
caret.models <- list(GLM_Spatial = glm.spatial,
GLM_boxcox.ridge = glm.boxcox.ridge,
GLM_BoxCox = glm.boxcox,
GLM_Ridge = glmnet.ridge,
PCA_NNET = fit.nnet,
PCA_NNET_tuned = fit.pcaNNet.tuned,
CART = fit.cart,
RF = fit.rf)
# Models' collection resampling
results <- resamples(caret.models)
# Models' difference summary
summary(results, Metric = "ROC")
##
## Call:
## summary.resamples(object = results, Metric = "ROC")
##
## Models: GLM_Spatial, GLM_boxcox.ridge, GLM_BoxCox, GLM_Ridge, PCA_NNET, PCA_NNET_tuned, CART, RF
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GLM_Spatial 0.7473 0.7594 0.7740 0.7730 0.7873 0.7931 0
## GLM_boxcox.ridge 0.7454 0.7609 0.7724 0.7721 0.7855 0.7988 0
## GLM_BoxCox 0.7464 0.7553 0.7715 0.7697 0.7843 0.7914 0
## GLM_Ridge 0.7435 0.7582 0.7700 0.7700 0.7837 0.7965 0
## PCA_NNET 0.5764 0.5931 0.6028 0.6000 0.6086 0.6195 0
## PCA_NNET_tuned 0.7515 0.7597 0.7750 0.7729 0.7824 0.7967 0
## CART 0.6500 0.6749 0.6820 0.6820 0.6856 0.7159 0
## RF 0.7495 0.7607 0.7705 0.7719 0.7818 0.8028 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GLM_Spatial 0.9380 0.9473 0.9523 0.9503 0.9543 0.9579 0
## GLM_boxcox.ridge 0.9455 0.9587 0.9603 0.9594 0.9621 0.9647 0
## GLM_BoxCox 0.9436 0.9504 0.9514 0.9524 0.9554 0.9610 0
## GLM_Ridge 0.9473 0.9591 0.9607 0.9602 0.9636 0.9659 0
## PCA_NNET 0.9994 1.0000 1.0000 0.9999 1.0000 1.0000 0
## PCA_NNET_tuned 0.9374 0.9481 0.9486 0.9474 0.9490 0.9536 0
## CART 0.9443 0.9517 0.9591 0.9568 0.9621 0.9634 0
## RF 0.9827 0.9847 0.9876 0.9866 0.9881 0.9913 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GLM_Spatial 0.30240 0.3245 0.3653 0.3547 0.3795 0.3921 0
## GLM_boxcox.ridge 0.26490 0.2804 0.3113 0.3042 0.3262 0.3348 0
## GLM_BoxCox 0.28480 0.3234 0.3642 0.3525 0.3800 0.3929 0
## GLM_Ridge 0.27150 0.2788 0.3046 0.3007 0.3212 0.3304 0
## PCA_NNET 0.00000 0.0000 0.0000 0.0000 0.0000 0.0000 0
## PCA_NNET_tuned 0.30460 0.3339 0.3653 0.3578 0.3763 0.3929 0
## CART 0.25610 0.2882 0.3201 0.3188 0.3453 0.3753 0
## RF 0.09692 0.1104 0.1159 0.1275 0.1467 0.1700 0
# Summarize p-values for pair-wise comparisons
diffs <- diff(results)
summary(diffs)
##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge
## GLM_Spatial 0.0008493 0.0032977 0.0029509
## GLM_boxcox.ridge 1.000000 0.0024484 0.0021016
## GLM_BoxCox 0.797694 1.000000 -0.0003468
## GLM_Ridge 1.000000 0.007495 1.000000
## PCA_NNET 9.320e-09 4.349e-09 5.936e-09 3.739e-09
## PCA_NNET_tuned 1.000000 1.000000 1.000000 1.000000
## CART 1.506e-06 1.620e-06 2.656e-06 2.465e-06
## RF 1.000000 1.000000 1.000000 1.000000
## PCA_NNET PCA_NNET_tuned CART RF
## GLM_Spatial 0.1730250 0.0001357 0.0910225 0.0011229
## GLM_boxcox.ridge 0.1721757 -0.0007136 0.0901732 0.0002736
## GLM_BoxCox 0.1697272 -0.0031621 0.0877248 -0.0021749
## GLM_Ridge 0.1700741 -0.0028152 0.0880717 -0.0018280
## PCA_NNET -0.1728893 -0.0820024 -0.1719021
## PCA_NNET_tuned 1.675e-09 0.0908869 0.0009872
## CART 6.185e-05 2.177e-06 -0.0898997
## RF 1.641e-09 1.000000 1.996e-06
##
## Sens
## GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge
## GLM_Spatial -0.0091068 -0.0020447 -0.0099122
## GLM_boxcox.ridge 5.734e-05 0.0070621 -0.0008054
## GLM_BoxCox 1.0000000 0.0009657 -0.0078675
## GLM_Ridge 1.550e-05 0.4937066 0.0003742
## PCA_NNET 2.698e-08 5.435e-08 5.507e-09 5.115e-08
## PCA_NNET_tuned 0.6688182 2.490e-06 0.0560344 1.780e-06
## CART 0.3328771 1.0000000 1.0000000 1.0000000
## RF 5.595e-08 3.205e-07 1.142e-08 4.924e-07
## PCA_NNET PCA_NNET_tuned CART RF
## GLM_Spatial -0.0496225 0.0029117 -0.0064441 -0.0363030
## GLM_boxcox.ridge -0.0405157 0.0120185 0.0026627 -0.0271962
## GLM_BoxCox -0.0475778 0.0049564 -0.0043994 -0.0342583
## GLM_Ridge -0.0397103 0.0128239 0.0034681 -0.0263908
## PCA_NNET 0.0525342 0.0431784 0.0133195
## PCA_NNET_tuned 1.173e-09 -0.0093558 -0.0392147
## CART 2.183e-07 0.1062087 -0.0298589
## RF 3.447e-06 8.636e-09 3.405e-06
##
## Spec
## GLM_Spatial GLM_boxcox.ridge GLM_BoxCox GLM_Ridge
## GLM_Spatial 0.050519 0.002206 0.054050
## GLM_boxcox.ridge 1.302e-06 -0.048314 0.003531
## GLM_BoxCox 1.0000000 2.336e-05 0.051844
## GLM_Ridge 1.020e-05 1.0000000 0.0001541
## PCA_NNET 2.543e-09 1.483e-09 6.787e-09 4.970e-10
## PCA_NNET_tuned 1.0000000 1.558e-06 1.0000000 2.262e-05
## CART 0.0679122 1.0000000 0.0077162 1.0000000
## RF 5.976e-09 2.620e-09 6.593e-09 4.769e-10
## PCA_NNET PCA_NNET_tuned CART RF
## GLM_Spatial 0.354727 -0.003090 0.035958 0.227221
## GLM_boxcox.ridge 0.304208 -0.053609 -0.014562 0.176702
## GLM_BoxCox 0.352522 -0.005296 0.033752 0.225015
## GLM_Ridge 0.300677 -0.057140 -0.018092 0.173171
## PCA_NNET -0.357817 -0.318770 -0.127506
## PCA_NNET_tuned 1.005e-09 0.039048 0.230311
## CART 3.244e-08 0.0384245 0.191263
## RF 1.485e-06 8.220e-09 1.800e-08
Results show us that there is no statistically significant difference in AUC between most of the models.
Let’s compare results on the test data using AUC:
# Comparison on test data using maximum AUC
compare.table %>%
filter(substring(Dataset, 1, 5) != "train") %>%
filter(substring(Dataset, 11, 13) != "SS") %>%
top_n(6, AUC) %>%
arrange(desc(AUC))
## Model Dataset AUC Accuracy Sensitivity
## 1 GLM_Spatial test_best_Acc 0.7794517 0.8183049 0.9525478
## 2 PCA_NNET test_best_Acc 0.7757733 0.8189821 0.9455395
## 3 RF test_best_Acc 0.7757078 0.8060038 0.9427654
## 4 GLM_boxcox.ridge test_best_Acc 0.7747679 0.8174021 0.9413053
## 5 GLM_BoxCox test_best_Acc 0.7735816 0.8201106 0.9472916
## 6 GLM_Ridge test_best_Acc 0.7726450 0.8168378 0.9410133
## Specificity Threshold
## 1 0.3613320 0.50
## 2 0.3881710 0.50
## 3 0.3404573 0.35
## 4 0.3956262 0.40
## 5 0.3871769 0.45
## 6 0.3941352 0.40
The first 6 models have very similar results and we will check statistically significant differences between them:
set.seed(123)
# Confident intervals for AUC
ci.auc(fit.rf.pred$test$roc); ci.auc(glm.spatial.pred$test$roc); ci.auc(fit.pcaNNet.pred$test$roc)
## 95% CI: 0.7637-0.7877 (DeLong)
## 95% CI: 0.7675-0.7914 (DeLong)
## 95% CI: 0.7636-0.7879 (DeLong)
# Test for statistically significant difference between AUC on test data
# GLM_Spatial vs RF
roc.test(glm.spatial.pred$test$roc, fit.rf.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.spatial.pred$test$roc and fit.rf.pred$test$roc
## Z = 1.1926, p-value = 0.233
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7794517 0.7757078
# GLM_Spatial vs PCA_NNET
roc.test(glm.spatial.pred$test$roc, fit.pcaNNet.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.spatial.pred$test$roc and fit.pcaNNet.pred$test$roc
## Z = 1.2673, p-value = 0.205
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7794517 0.7757733
# GLM_Spatial vs GLM_boxcox.ridge
roc.test(glm.spatial.pred$test$roc, glm.boxcox.ridge.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.spatial.pred$test$roc and glm.boxcox.ridge.pred$test$roc
## Z = 2.2599, p-value = 0.02383
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7794517 0.7747679
# GLM_Spatial vs GLM_BoxCox
roc.test(glm.spatial.pred$test$roc, glm.boxcox.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.spatial.pred$test$roc and glm.boxcox.pred$test$roc
## Z = 3.462, p-value = 0.0005361
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7794517 0.7735816
# GLM_Spatial vs GLM_Logit_Full
roc.test(glm.spatial.pred$test$roc, glm.s.pred$test$roc)
##
## DeLong's test for two correlated ROC curves
##
## data: glm.spatial.pred$test$roc and glm.s.pred$test$roc
## Z = 3.6682, p-value = 0.0002443
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7794517 0.7720944
There is no statistically significant difference in AUC metric between GLM_Spatial, RF and PCA_NNET models.
Let’s try to compare models by combination of other parameters:
# Comparison on test data using maximum AUC,
# Sensitivity, Specificity and Accuracy combination
compare.table %>%
filter(substring(Dataset, 1, 5) != "train") %>%
mutate(sum_score = AUC + Accuracy + Sensitivity + Specificity) %>%
top_n(6, sum_score) %>%
arrange(desc(sum_score))
## Model Dataset AUC Accuracy Sensitivity
## 1 PCA_NNET_tuned test_best_SS 0.7724154 0.7818531 0.8405607
## 2 GLM_Ridge test_best_SS 0.7726450 0.7826430 0.8439188
## 3 RF test_best_SS 0.7757078 0.7555581 0.7856621
## 4 GLM_boxcox.ridge test_best_SS 0.7747679 0.7748561 0.8291721
## 5 GLM_Spatial test_best_SS 0.7794517 0.7588308 0.7985107
## 6 PCA_NNET test_best_SS 0.7757733 0.7578151 0.7964666
## Specificity Threshold sum_score
## 1 0.5820080 0.2612996 2.976837
## 2 0.5740557 0.2275486 2.973263
## 3 0.6530815 0.2239535 2.970010
## 4 0.5899602 0.2184333 2.968756
## 5 0.6237575 0.2227850 2.960551
## 6 0.6262425 0.2201007 2.956298
I would choose the linear model with spatial sign kernel model or the Random Forest model because they are still the top choices by combination of other parameters and show stable results for both train and test data. The combination of AUC, Accuracy, Sensitivity and Specificity for Random Forest is a little bit better than for the linear model with spatial sign kernel and ridge regression with BoxCox pre-processing.
Confusion matrices (gives the best combination of Accuracy and Sensitivity and Specificity) for The linear model with spatial sign kernel
# Confusion matrix for GLM_Spatial model for best Accuracy
glm.spatial.pred$test$confusion.matrix.best.Acc
## Prediction
## Reference X0 X1
## X0 6524 325
## X1 1285 727
# Confusion matrix for GLM_Spatial model
# for best Sensitivity and Specificity combination
glm.spatial.pred$test$confusion.matrix.best.SS
## Prediction
## Reference X0 X1
## X0 5469 1380
## X1 757 1255
Confusion matrices (gives the best combination of Accuracy and Sensitivity and Specificity) for Random Forest
# Confusion matrix for RF model for best Accuracy
fit.rf.pred$test$confusion.matrix.best.Acc
## Prediction
## Reference X0 X1
## X0 6457 392
## X1 1327 685
# Confusion matrix for RF model
# for best Sensitivity and Specificity combination
fit.rf.pred$test$confusion.matrix.best.SS
## Prediction
## Reference X0 X1
## X0 5381 1468
## X1 698 1314
Confusion matrices (best Accuracy and best Sensitivity and Specificity combination) for the ridge regression with BoxCox pre-processing
# Confusion matrix for RF model for best Accuracy
glm.boxcox.ridge.pred$test$confusion.matrix.best.Acc
## Prediction
## Reference X0 X1
## X0 6447 402
## X1 1216 796
# Confusion matrix for RF model
# for best Sensitivity and Specificity combination
glm.boxcox.ridge.pred$test$confusion.matrix.best.SS
## Prediction
## Reference X0 X1
## X0 5679 1170
## X1 825 1187
From the models we explored, I would choose the Linear model with Spatial Sign kernel with AUC = 0.7794517
and Accuracy = 0.8183049
or Random Forest with AUC = 0.7757078
and Accuracy = 0.8060038
.
In this chapter, we will update the summary we wrote after Exploratory analysis using interpretable models. We can use Decision Tree and Significance Levels research for restricted, unrestricted and Ridge regression models. First of all, let’s order predictors by their approximate importance:
index > 15000
we observed here.
Social Status Predictors (SEX, EDUCATION, MARRIAGE)
We have some undocumented value
0
forEDUCATION
andMARRIGE
. I suggest this asNA
. Also, let’s rename levels SEX toMale
andFemale