Jorge Marcos Martos
library(dplyr)
library(tidyverse)
library(ggplot2)
library(fastDummies)
library(missForest)
library(corrplot)
library(glmnet)
library(caret)
library(lattice)
library(e1071)
library(MASS) In this project we will build an automatic credit card approval predictor using machine learning techniques.
For that purpose we will use the data set Credit Screening from the UCI Machine Learning Repository.
All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data, nonetheless we will try to get additional information on what these variables might be.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
crx <- read.csv(url, sep = ",", header = F)After checking the data set, we could see that null values were included as ‘?’, we need to replace these.
crx[crx == "?"] <- NA
# Let's see how many mising values we have
sapply(crx, function(x) sum(is.na(x))); sum(sapply(crx, function(x) sum(is.na(x))))## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 12 12 0 6 6 9 9 0 0 0 0 0 0 13 0 0
## [1] 67
We have a total of 67 null values, we will input these null values using the function missForest() from the missForest library, this function uses a random forest to impute null values using the mean in the case of numerical variables and the mode in the case of categorical variables.
crx <- type.convert(crx, as.is=FALSE) # This code converts categorical variables into
# factors, so the missForest can use it.
crx.i <- missForest(as.data.frame(crx)) ## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
crx <- crx.i$ximp
# Double-checking the null values have been succesfuly imputed:
sapply(crx, function(x) sum(is.na(x)))## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
After looking for information about what variables are normally included regarding card approval, I decided to re-codify the variables using this blog as a reference: http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html
The variables will be recodified as: “Gender”, “Age”, “Debt”, “Married”, “BankCustomer”, “EducationLevel”, “Ethnicity”, “YearsEmployed”, “PriorDefault”, “Employed”, “CreditScore”, “DriversLicense”, “Citizen”, “ZipCode”, “Income” y “ApprovalStatus”.
# Renaming variables
crx <- crx %>%
rename(Gender = V1,
Age = V2,
Debt = V3,
Married = V4,
BankCustomer = V5,
EducationLevel = V6,
Ethnicity = V7,
YearsEmployed = V8,
PriorDefault = V9,
Employed = V10,
CreditScore = V11,
DriversLicense = V12,
Citizen = V13,
ZipCode = V14,
Income = V15,
ApprovalStatus = V16)## 'data.frame': 690 obs. of 16 variables:
## $ Gender : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
## $ Age : num 30.8 58.7 24.5 27.8 20.2 ...
## $ Debt : num 0 4.46 0.5 1.54 5.62 ...
## $ Married : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
## $ BankCustomer : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
## $ EducationLevel: Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
## $ Ethnicity : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
## $ YearsEmployed : num 1.25 3.04 1.5 3.75 1.71 ...
## $ PriorDefault : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ Employed : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
## $ CreditScore : int 1 6 0 5 0 0 0 0 0 0 ...
## $ DriversLicense: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
## $ Citizen : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
## $ ZipCode : num 202 43 280 100 120 360 164 80 180 52 ...
## $ Income : int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ ApprovalStatus: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...
The ZipCode variable should be a factorial variable, let’s see what categories it has
## [1] 202.00000 43.00000 280.00000 100.00000 120.00000 360.00000
## [7] 164.00000 80.00000 180.00000 52.00000 128.00000 260.00000
## [13] 0.00000 320.00000 396.00000 96.00000 200.00000 300.00000
## [19] 145.00000 500.00000 168.00000 434.00000 583.00000 30.00000
## [25] 240.00000 70.00000 455.00000 311.00000 216.00000 491.00000
## [31] 400.00000 239.00000 160.00000 711.00000 250.00000 520.00000
## [37] 515.00000 420.00000 224.54611 980.00000 443.00000 140.00000
## [43] 94.00000 368.00000 288.00000 928.00000 188.00000 112.00000
## [49] 171.00000 268.00000 167.00000 75.00000 152.00000 176.00000
## [55] 329.00000 212.00000 410.00000 274.00000 375.00000 408.00000
## [61] 350.00000 204.00000 40.00000 181.00000 399.00000 440.00000
## [67] 93.00000 60.00000 395.00000 393.00000 21.00000 29.00000
## [73] 102.00000 431.00000 370.00000 24.00000 20.00000 129.00000
## [79] 510.00000 195.00000 144.00000 380.00000 144.45333 49.00000
## [85] 50.00000 91.66967 381.00000 150.00000 117.00000 56.00000
## [91] 211.00000 230.00000 156.00000 22.00000 228.00000 519.00000
## [97] 253.00000 487.00000 220.00000 119.54927 88.00000 73.00000
## [103] 121.00000 470.00000 136.00000 132.00000 292.00000 154.00000
## [109] 272.00000 219.71467 340.00000 91.16295 108.00000 720.00000
## [115] 450.00000 232.00000 170.00000 1160.00000 411.00000 144.02600
## [121] 460.00000 348.00000 480.00000 640.00000 372.00000 276.00000
## [127] 221.00000 352.00000 141.00000 178.00000 600.00000 550.00000
## [133] 187.63619 2000.00000 225.00000 210.00000 110.00000 356.00000
## [139] 45.00000 62.00000 92.00000 174.00000 17.00000 86.00000
## [145] 90.22924 454.00000 214.25681 254.00000 28.00000 263.00000
## [151] 333.00000 312.00000 290.00000 371.00000 99.00000 252.00000
## [157] 760.00000 560.00000 130.00000 523.00000 680.00000 163.00000
## [163] 208.00000 383.00000 330.00000 422.00000 840.00000 432.00000
## [169] 32.00000 186.00000 303.00000 147.29800 349.00000 224.00000
## [175] 369.00000 148.43755 235.47510 76.00000 231.00000 309.00000
## [181] 416.00000 465.00000 256.00000
It returns 183 different values, 170 of those were there before applying the missForest algorithm for imputation, the other 13 resulted as the missing value imputation made by the missForest function, as it treated the variable as a numeric one. The function wouldn’t be able to handle a categorical variable with more than 53 levels regardless.
This would provide us with too many categories, hindering our analysis, so I decided to get rid of this variable.
Our target variable would be the variable ‘approval status’‘, it would be better to transform it into a binary factorial variable, the negative would be the level ’0’, and the positive the level ‘1’.
## 'data.frame': 690 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
## $ Age : num 30.8 58.7 24.5 27.8 20.2 ...
## $ Debt : num 0 4.46 0.5 1.54 5.62 ...
## $ Married : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
## $ BankCustomer : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
## $ EducationLevel: Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
## $ Ethnicity : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
## $ YearsEmployed : num 1.25 3.04 1.5 3.75 1.71 ...
## $ PriorDefault : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ Employed : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
## $ CreditScore : int 1 6 0 5 0 0 0 0 0 0 ...
## $ DriversLicense: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
## $ Citizen : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
## $ Income : int 0 560 824 3 0 0 31285 1349 314 1442 ...
## $ ApprovalStatus: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
We will compare now every variable against the target variable, the target variable will have the 0 value in cases the credit hasn’t been approved and 1 in case it has been granted.
Gender vs Credit Approval
ggplot(data = crx, aes(x = Gender, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Gender') + ggtitle('Gender vs Credito Approval')Seems like the gender ‘a’ has a bigger proportion of approvals than the gender ‘b’, but the difference between both rates doesn’t seem to be that significant, we will have to test the variables dependency later on.
Marital Status vs Credit Approval
ggplot(data = crx, aes(x = Married, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Marital Status') + ggtitle('Marital Status vs Credit Approval')There seems to be a clear difference between the different marital status and the rate of credit approval, for the marital status ‘l’ there seem to be 100% rate of credit approval, this might be caused by a too small sample, making it not representative, let’s check it:
## # A tibble: 3 x 2
## # Groups: Married [3]
## Married n
## <fct> <int>
## 1 l 2
## 2 u 525
## 3 y 163
Effectively, there are only two observations in ‘l’ marital status, that would explain this anomaly.
Bank Customer vs Credit Approval
ggplot(data = crx, aes(x = BankCustomer, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Bank Customer') + ggtitle('Bank Customer vs Credit Approval')Yes, there seems to be a correlation between different bank customer statuses and the credit approval rate, we see again an anomaly of a 100% rate of credit approval in one of the categories, let’s check how many observations are within that category:
## # A tibble: 3 x 2
## # Groups: BankCustomer [3]
## BankCustomer n
## <fct> <int>
## 1 g 525
## 2 gg 2
## 3 p 163
As in the previous case, there are just two observations in that category, that would explain the anomaly.
Education Level vs Credit Approval
ggplot(data = crx, aes(x = EducationLevel, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Education Level') + ggtitle('Education Level vs Credit Approval')The education level also seems to affect credit approval rate, we have a high rate of credit approval in the education level ‘x’ and ‘cc’, whereas those with an education level ‘ff’ have a lower credit approval rate.
Ethnic Group vs Credit Approval
ggplot(data = crx, aes(x = Ethnicity, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Ethnic Group') + ggtitle('Ethnic Group vs Credit Approval')The ethnic group also seems to be related to the credit approval rate, people of the ethnic groups ‘z’ and ‘h’ have less possibilities to be have a credit granted compared to other ethnic groups as, for example, ‘ff’.
Prior Default vs Credit Approval
ggplot(data = crx, aes(x = PriorDefault, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Prior Default') + ggtitle('Prior Default vs Credit Approval')There is a clear correlation between those who previously defaulted and those who didn’t, we assume that the bank don’t easily approve credits to people who have already defaulted, so we can recodify the two categories accordingly:
Employed vs Credit Approval
ggplot(data = crx, aes(x = Employed, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Employed') + ggtitle('Employed vs Credit Approval')There seems to be a clear correlation between whether a person is employed or not and the credit approval rate, we can assume that the category ‘f’ are those who are unemployed and the category ‘t’ corresponds to those who are employed, let’s recodify the categories accordingly:
Driver’s License vs Credit Approval
ggplot(data = crx, aes(x = DriversLicense, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Drivers License') + ggtitle('Drivers License vs Credit Approval')There doesn’t seem to be a clear correlation between both variables.
Citizenship vs Credit Approval
ggplot(data = crx, aes(x = Citizen, fill = ApprovalStatus)) +
geom_bar(position = "fill") +
labs(y = "Rate", x = 'Citizenship') + ggtitle('Citizenship vs Credit Approval')There seems to be a relation between the citizenship status and the credit approval rate.
In order to check whether there is independence between the different categorical variables and the target variable, we will check the chi-square with a 95% significance level, the following function will print the name of the variable and the resulting p-values.
categoricVars <- crx %>% dplyr::select(Gender, Married, BankCustomer, EducationLevel,
Ethnicity, PriorDefault, Employed, DriversLicense,
Citizen)
sapply(categoricVars,
function(x) round(chisq.test(table(x, crx$ApprovalStatus))$p.value,2))## Gender Married BankCustomer EducationLevel Ethnicity
## 0.49 0.00 0.00 0.00 0.00
## PriorDefault Employed DriversLicense Citizen
## 0.00 0.00 0.45 0.01
The variables Married, BankCustomer, EducationLevel, Ethnicity, PriorDefault, Employed are Citizen son dependent, whereas Gender and DriversLicense are independient of the target variable. We will remove Gender and DriverLicense from our model.
Age vs Credit Approval
cdplot(crx$ApprovalStatus ~ crx$Age, main = "Age vs Credit Approval",
xlab = "Age", ylab = "Conditional Density" ) The plot shows how those who are older have more chances of getting the credit approved, although when it reaches the threshold of 75 years old it seems to drastically lower the probabilities of getting a credit approved.
Let’s see if a boxplot could provide more information:
ggplot(crx, aes(x= ApprovalStatus, y= Age, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Age", x = 'Credit Approval') + ggtitle('Age vs Credit Approval') +
scale_fill_brewer(palette = "Set2")We can also see a probable correlation between age and credit approval, the older ones seem to have more chances of getting it approved.
Debt vs Credit Approval
cdplot(crx$ApprovalStatus ~ crx$Debt, main = "Debt vs Credit Approval",
xlab = "Debt", ylab = "Conditional Density" ) The plot describe a relation between the debt and the credit approval in which the more debt you have the more chances you have of getting a credit, although it seems to go lower around the 26 on the Debt axis to then go up again.
ggplot(crx, aes(x= ApprovalStatus, y= Debt, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Debt", x = 'Credit Approval') +
ggtitle('Debt vs Credit Approval') +
scale_fill_brewer(palette = "Set2")The box-plot seems to hint at the same as the previous plot.
Years Employed vs Credit Approval
ggplot(crx, aes(x= ApprovalStatus, y= YearsEmployed, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Years Employed", x = 'Credit Approval') +
ggtitle('Years Employed vs Credit Approval') +
scale_fill_brewer(palette = "Set2")There seems to be a positive correlation between the years employed and the credit approval.
Credit Score vs Credit Approval
ggplot(crx, aes(x= ApprovalStatus, y= CreditScore, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Credit Score", x = 'Credit Approval') +
ggtitle('Credit Score vs Credit Approval') +
scale_fill_brewer(palette = "Set2")The plot hints at a clear positive correlation between credit score and credit approval
Income vs Credit Approval
# This plot contains extreme outliers, so we need to zoom it in
ggplot(crx, aes(x= ApprovalStatus, y= Income, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Income", x = 'Credit Approval') +
ggtitle('Income vs Credit Approval') +
scale_fill_brewer(palette = "Set2") +
coord_cartesian(ylim=c(0, 2000)) #zoomThe graph shows what it seems to be a clear positive correlation between income and credit approval.
We’ll now plot a correlation matrix in order to check whether there is colinearity between numeric variables or not.
numericVars <- data.frame(crx$Age, crx$Debt, crx$YearsEmployed, crx$CreditScore, crx$Income)
corrplot(cor(numericVars), method = "number", type="upper")The biggest value is 0.4 between Years Employed and Age which makes sense, this value is not as big as to cause colinearity so we will include both variables in our model.
It’s best practice to normalize the numeric variables if they are not normally distributed.
Let’s plot our numeric variables normality and see whether they follow a normal distribution or not.
for (columna in 1:ncol(crx)){
if (class(crx[,columna]) != "factor"){
qqnorm(crx[,columna],
main = paste("Normality Plot: ", colnames(crx[columna])))
qqline(crx[,columna])
} else {
next
}
}None of them seems to follow a normal distribution, but let’s double-check it using the Shapiro test.
## crx.Age crx.Debt crx.YearsEmployed crx.CreditScore
## 0 0 0 0
## crx.Income
## 0
The p-values obtained in the Shapiro test are near 0, we reject the null hypothesis that there is normality in all cases, therefore we accept the alternative hypothesis that none of the variables has a normal distribution.
We need to normalize all numeric variables.
There is no colinearity between numeric variables.
The categorical variables Gender and DriversLicense don’t seem to influence the target variable, the rest does to different degrees.
The categories ‘l’ and ‘gg’ of the variables ‘Married’ and ‘BankCustomer’ respectively, only have two obersvations each, and they were granted credit in all cases. Thus, both variables are supposed to be binary variables, so it might be that these two categories were recorded by mistake. We should remove them from our model.
Before we split the data set into a train and test sets, we need to normalize our numeric variables, also, provided that we will use binomial regression model, it would be best practice to one hot encode our categorical variables.
Let’s first normalize the numeric variables:
crx$Age <- scale(crx$Age)
crx$Debt <- scale(crx$Debt)
crx$YearsEmployed <- scale(crx$YearsEmployed)
crx$CreditScore <- scale(crx$CreditScore)
crx$Income <- scale(crx$Income)Now, let’s remove the Gender and DriversLicense variables from our model:
We can proceed with one hot encoding the categorical variables creating dummy variables for each of the categories.
Our target variable is now divided into two dummy variables, we should remove them and add it again, it’s also a good moment to get rid of those two categories (now dummy variables), that we considered removing before: ‘married_l’ and ‘BankCustomer_gg’.
df$ApprovalStatus_0 <- NULL
df$ApprovalStatus_1 <- NULL
df$Married_l <- NULL
df$BankCustomer_gg <- NULL
df$ApprovalStatus <- crx$ApprovalStatusNow, let’s use a binomial linear model to check what variables we should be using, for that purpose we will fit a binomial regression model to the data and then we will do both, a forward and a backward step-wise selection. For that we first need to fit our model to the target variable against all the variables as our last step, and fit our model to the target variable against one as our start step.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Now that we have fitted last and first step, we will start doing step-wise selection from both directions (from first to last and from last to first), setting the parameter ‘direction’ to ‘both’, we will use the stepAIC function as we will take the metric Akaike Information Criterion (AIC) as our estimator, the lower the AIC it is, the better will it be for our model.
## Start: AIC=950.16
## ApprovalStatus ~ 1
##
## Df Deviance AIC
## + PriorDefault_No 1 540.95 544.95
## + PriorDefault_Yes 1 540.95 544.95
## + CreditScore 1 762.74 766.74
## + Employed_employed 1 798.66 802.66
## + Employed_unemployed 1 798.66 802.66
## + Income 1 862.80 866.80
## + YearsEmployed 1 863.32 867.32
## + Debt 1 918.36 922.36
## + EducationLevel_x 1 920.99 924.99
## + Married_y 1 922.64 926.64
## + BankCustomer_p 1 922.64 926.64
## + Ethnicity_ff 1 924.15 928.15
## + Ethnicity_h 1 924.16 928.16
## + EducationLevel_ff 1 924.72 928.72
## + Married_u 1 924.92 928.92
## + BankCustomer_g 1 924.92 928.92
## + Age 1 929.91 933.91
## + EducationLevel_q 1 932.62 936.62
## + EducationLevel_cc 1 935.90 939.90
## + EducationLevel_i 1 936.81 940.81
## + Citizen_s 1 939.43 943.43
## + EducationLevel_k 1 941.39 945.39
## + EducationLevel_d 1 942.09 946.09
## + Citizen_g 1 942.51 946.51
## + EducationLevel_aa 1 946.06 950.06
## <none> 948.16 950.16
## + Ethnicity_v 1 946.22 950.22
## + Ethnicity_z 1 946.34 950.34
## + EducationLevel_w 1 946.74 950.74
## + Citizen_p 1 947.10 951.10
## + EducationLevel_e 1 947.52 951.52
## + EducationLevel_r 1 947.55 951.55
## + Ethnicity_dd 1 947.68 951.68
## + EducationLevel_j 1 947.85 951.85
## + EducationLevel_m 1 947.95 951.95
## + Ethnicity_bb 1 948.06 952.06
## + Ethnicity_n 1 948.11 952.11
## + EducationLevel_c 1 948.11 952.11
## + Ethnicity_o 1 948.13 952.13
## + Ethnicity_j 1 948.16 952.16
##
## Step: AIC=544.95
## ApprovalStatus ~ PriorDefault_No
##
## Df Deviance AIC
## + CreditScore 1 503.21 509.21
## + Income 1 505.09 511.09
## + Employed_unemployed 1 507.67 513.67
## + Employed_employed 1 507.67 513.67
## + Citizen_p 1 523.48 529.48
## + Married_y 1 528.70 534.70
## + BankCustomer_p 1 528.70 534.70
## + EducationLevel_x 1 531.38 537.38
## + Married_u 1 531.98 537.98
## + BankCustomer_g 1 531.98 537.98
## + YearsEmployed 1 532.97 538.97
## + EducationLevel_aa 1 534.80 540.80
## + EducationLevel_cc 1 534.81 540.81
## + EducationLevel_ff 1 536.36 542.36
## + Ethnicity_ff 1 536.69 542.69
## + Ethnicity_h 1 537.32 543.32
## + EducationLevel_k 1 537.95 543.95
## + Citizen_s 1 537.97 543.97
## + Ethnicity_o 1 538.22 544.22
## + EducationLevel_i 1 538.60 544.60
## + Ethnicity_n 1 538.72 544.72
## <none> 540.95 544.95
## + EducationLevel_d 1 539.15 545.15
## + EducationLevel_q 1 539.20 545.20
## + Ethnicity_j 1 539.35 545.35
## + Ethnicity_bb 1 539.36 545.36
## + Debt 1 539.69 545.69
## + EducationLevel_w 1 539.88 545.88
## + Ethnicity_dd 1 539.93 545.93
## + EducationLevel_e 1 540.26 546.26
## + Ethnicity_v 1 540.48 546.48
## + EducationLevel_r 1 540.59 546.59
## + Age 1 540.69 546.69
## + EducationLevel_m 1 540.69 546.69
## + EducationLevel_j 1 540.83 546.83
## + Ethnicity_z 1 540.85 546.85
## + EducationLevel_c 1 540.89 546.89
## + Citizen_g 1 540.92 546.92
## - PriorDefault_No 1 948.16 950.16
##
## Step: AIC=509.21
## ApprovalStatus ~ PriorDefault_No + CreditScore
##
## Df Deviance AIC
## + Income 1 477.87 485.87
## + Citizen_p 1 484.26 492.26
## + EducationLevel_x 1 492.76 500.76
## + Married_y 1 494.87 502.87
## + BankCustomer_p 1 494.87 502.87
## + Ethnicity_ff 1 497.51 505.51
## + Married_u 1 497.62 505.62
## + BankCustomer_g 1 497.62 505.62
## + EducationLevel_ff 1 497.72 505.72
## + EducationLevel_cc 1 497.84 505.84
## + Employed_employed 1 498.33 506.33
## + Employed_unemployed 1 498.33 506.33
## + YearsEmployed 1 499.68 507.68
## + Ethnicity_h 1 499.76 507.76
## + EducationLevel_aa 1 499.85 507.85
## + Ethnicity_o 1 500.21 508.21
## + EducationLevel_k 1 500.79 508.79
## + EducationLevel_i 1 500.88 508.88
## <none> 503.21 509.21
## + Ethnicity_j 1 501.27 509.27
## + Ethnicity_bb 1 501.27 509.27
## + Ethnicity_n 1 501.28 509.28
## + EducationLevel_d 1 501.70 509.70
## + EducationLevel_w 1 501.72 509.72
## + Ethnicity_dd 1 502.00 510.00
## + Citizen_g 1 502.06 510.06
## + EducationLevel_q 1 502.17 510.17
## + EducationLevel_e 1 502.47 510.47
## + Citizen_s 1 502.69 510.69
## + Ethnicity_z 1 502.82 510.82
## + EducationLevel_r 1 502.86 510.86
## + Ethnicity_v 1 503.05 511.05
## + EducationLevel_m 1 503.06 511.06
## + EducationLevel_j 1 503.13 511.13
## + Age 1 503.17 511.17
## + Debt 1 503.19 511.19
## + EducationLevel_c 1 503.21 511.21
## - CreditScore 1 540.95 544.95
## - PriorDefault_No 1 762.74 766.74
##
## Step: AIC=485.87
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income
##
## Df Deviance AIC
## + Citizen_p 1 462.37 472.37
## + EducationLevel_x 1 468.30 478.30
## + EducationLevel_ff 1 470.00 480.00
## + Married_y 1 470.29 480.29
## + BankCustomer_p 1 470.29 480.29
## + Married_u 1 471.72 481.72
## + BankCustomer_g 1 471.72 481.72
## + Ethnicity_ff 1 472.52 482.52
## + EducationLevel_cc 1 472.64 482.64
## + Employed_unemployed 1 473.35 483.35
## + Employed_employed 1 473.35 483.35
## + YearsEmployed 1 473.93 483.93
## + Ethnicity_h 1 474.45 484.45
## + EducationLevel_aa 1 475.33 485.33
## + Ethnicity_j 1 475.57 485.57
## + Ethnicity_n 1 475.74 485.74
## + EducationLevel_i 1 475.76 485.76
## + Ethnicity_bb 1 475.85 485.85
## <none> 477.87 485.87
## + EducationLevel_k 1 476.08 486.08
## + Citizen_g 1 476.19 486.19
## + EducationLevel_w 1 476.44 486.44
## + Ethnicity_dd 1 476.57 486.57
## + EducationLevel_q 1 476.57 486.57
## + EducationLevel_d 1 477.00 487.00
## + Ethnicity_z 1 477.05 487.05
## + EducationLevel_e 1 477.35 487.35
## + EducationLevel_m 1 477.71 487.71
## + EducationLevel_j 1 477.72 487.72
## + Debt 1 477.73 487.73
## + Ethnicity_o 1 477.76 487.76
## + Age 1 477.82 487.82
## + Citizen_s 1 477.84 487.84
## + Ethnicity_v 1 477.85 487.85
## + EducationLevel_r 1 477.86 487.86
## + EducationLevel_c 1 477.87 487.87
## - Income 1 503.21 509.21
## - CreditScore 1 505.09 511.09
## - PriorDefault_No 1 723.20 729.20
##
## Step: AIC=472.37
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p
##
## Df Deviance AIC
## + EducationLevel_x 1 452.34 464.34
## + EducationLevel_ff 1 454.25 466.25
## + Married_y 1 456.07 468.07
## + BankCustomer_p 1 456.07 468.07
## + EducationLevel_cc 1 456.58 468.58
## + Employed_unemployed 1 456.73 468.73
## + Employed_employed 1 456.73 468.73
## + Ethnicity_ff 1 456.89 468.89
## + Married_u 1 457.40 469.40
## + BankCustomer_g 1 457.40 469.40
## + YearsEmployed 1 457.91 469.91
## + Ethnicity_h 1 458.34 470.34
## + EducationLevel_i 1 459.38 471.38
## + Ethnicity_bb 1 459.40 471.40
## + Ethnicity_n 1 459.86 471.86
## + EducationLevel_aa 1 460.15 472.15
## <none> 462.37 472.37
## + EducationLevel_w 1 460.55 472.55
## + EducationLevel_q 1 460.83 472.83
## + EducationLevel_k 1 460.94 472.94
## + Ethnicity_j 1 461.40 473.40
## + Ethnicity_z 1 461.55 473.55
## + EducationLevel_d 1 461.69 473.69
## + Age 1 462.15 474.15
## + EducationLevel_m 1 462.28 474.28
## + Ethnicity_o 1 462.28 474.28
## + Ethnicity_v 1 462.30 474.30
## + EducationLevel_e 1 462.34 474.34
## + EducationLevel_r 1 462.35 474.35
## + Debt 1 462.36 474.36
## + EducationLevel_j 1 462.36 474.36
## + Ethnicity_dd 1 462.36 474.36
## + EducationLevel_c 1 462.36 474.36
## + Citizen_g 1 462.37 474.37
## + Citizen_s 1 462.37 474.37
## - Citizen_p 1 477.87 485.87
## - Income 1 484.26 492.26
## - CreditScore 1 490.12 498.12
## - PriorDefault_No 1 719.88 727.88
##
## Step: AIC=464.34
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x
##
## Df Deviance AIC
## + Married_y 1 444.97 458.97
## + BankCustomer_p 1 444.97 458.97
## + EducationLevel_ff 1 445.13 459.13
## + EducationLevel_cc 1 445.65 459.65
## + Married_u 1 446.44 460.44
## + BankCustomer_g 1 446.44 460.44
## + Ethnicity_ff 1 447.63 461.63
## + Employed_unemployed 1 447.75 461.75
## + Employed_employed 1 447.75 461.75
## + YearsEmployed 1 448.32 462.32
## + Ethnicity_n 1 449.70 463.70
## + EducationLevel_w 1 449.77 463.77
## + EducationLevel_q 1 449.92 463.92
## + Ethnicity_bb 1 449.97 463.97
## + EducationLevel_i 1 449.99 463.99
## + Ethnicity_h 1 450.18 464.18
## <none> 452.34 464.34
## + EducationLevel_aa 1 450.82 464.82
## + Ethnicity_j 1 451.25 465.25
## + EducationLevel_k 1 451.37 465.37
## + Ethnicity_z 1 451.69 465.69
## + EducationLevel_d 1 451.88 465.88
## + Ethnicity_v 1 452.08 466.08
## + Age 1 452.14 466.14
## + EducationLevel_c 1 452.15 466.15
## + EducationLevel_e 1 452.25 466.25
## + Ethnicity_o 1 452.26 466.26
## + Debt 1 452.26 466.26
## + Citizen_g 1 452.30 466.30
## + Citizen_s 1 452.30 466.30
## + EducationLevel_r 1 452.31 466.31
## + Ethnicity_dd 1 452.32 466.32
## + EducationLevel_m 1 452.33 466.33
## + EducationLevel_j 1 452.34 466.34
## - EducationLevel_x 1 462.37 472.37
## - Citizen_p 1 468.30 478.30
## - Income 1 473.18 483.18
## - CreditScore 1 480.55 490.55
## - PriorDefault_No 1 697.98 707.98
##
## Step: AIC=458.97
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y
##
## Df Deviance AIC
## + EducationLevel_cc 1 437.84 453.84
## + EducationLevel_ff 1 438.29 454.29
## + Married_u 1 438.70 454.70
## + BankCustomer_g 1 438.70 454.70
## + Ethnicity_ff 1 440.73 456.73
## + Employed_unemployed 1 441.06 457.06
## + Employed_employed 1 441.06 457.06
## + YearsEmployed 1 441.28 457.28
## + EducationLevel_w 1 441.97 457.97
## + Ethnicity_bb 1 442.12 458.12
## + EducationLevel_i 1 442.24 458.24
## + Ethnicity_n 1 442.33 458.33
## + Ethnicity_h 1 442.66 458.66
## <none> 444.97 458.97
## + EducationLevel_q 1 443.61 459.61
## + Ethnicity_j 1 443.89 459.89
## + EducationLevel_aa 1 443.92 459.92
## + Ethnicity_z 1 444.02 460.02
## + EducationLevel_k 1 444.07 460.07
## + Age 1 444.41 460.41
## + EducationLevel_d 1 444.63 460.63
## + Ethnicity_v 1 444.64 460.64
## + EducationLevel_c 1 444.66 460.66
## + Debt 1 444.83 460.83
## + Ethnicity_o 1 444.87 460.87
## + EducationLevel_m 1 444.90 460.90
## + EducationLevel_e 1 444.92 460.92
## + EducationLevel_r 1 444.93 460.93
## + Ethnicity_dd 1 444.96 460.96
## + EducationLevel_j 1 444.97 460.97
## + Citizen_g 1 444.97 460.97
## + Citizen_s 1 444.97 460.97
## - Married_y 1 452.34 464.34
## - EducationLevel_x 1 456.07 468.07
## - Citizen_p 1 459.56 471.56
## - Income 1 465.83 477.83
## - CreditScore 1 471.44 483.44
## - PriorDefault_No 1 686.84 698.84
##
## Step: AIC=453.84
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc
##
## Df Deviance AIC
## + EducationLevel_ff 1 431.93 449.93
## + Employed_employed 1 433.94 451.94
## + Employed_unemployed 1 433.94 451.94
## + EducationLevel_w 1 433.96 451.96
## + Married_u 1 434.02 452.02
## + BankCustomer_g 1 434.02 452.02
## + Ethnicity_ff 1 434.06 452.06
## + YearsEmployed 1 434.90 452.90
## + Ethnicity_n 1 435.02 453.02
## + Ethnicity_bb 1 435.49 453.49
## + EducationLevel_i 1 435.65 453.65
## + EducationLevel_q 1 435.79 453.79
## <none> 437.84 453.84
## + Ethnicity_h 1 436.15 454.15
## + Ethnicity_j 1 436.62 454.62
## + Ethnicity_z 1 437.03 455.03
## + EducationLevel_c 1 437.07 455.07
## + EducationLevel_aa 1 437.22 455.22
## + EducationLevel_k 1 437.27 455.27
## + Age 1 437.46 455.46
## + Ethnicity_v 1 437.49 455.49
## + EducationLevel_d 1 437.65 455.65
## + EducationLevel_e 1 437.73 455.73
## + Ethnicity_o 1 437.75 455.75
## + Debt 1 437.77 455.77
## + EducationLevel_r 1 437.79 455.79
## + Ethnicity_dd 1 437.82 455.82
## + EducationLevel_m 1 437.83 455.83
## + Citizen_g 1 437.83 455.83
## + Citizen_s 1 437.83 455.83
## + EducationLevel_j 1 437.84 455.84
## - EducationLevel_cc 1 444.97 458.97
## - Married_y 1 445.65 459.65
## - EducationLevel_x 1 449.94 463.94
## - Citizen_p 1 453.02 467.02
## - Income 1 458.56 472.56
## - CreditScore 1 462.58 476.58
## - PriorDefault_No 1 677.25 691.25
##
## Step: AIC=449.93
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff
##
## Df Deviance AIC
## + Employed_unemployed 1 427.78 447.78
## + Employed_employed 1 427.78 447.78
## + Married_u 1 428.14 448.14
## + BankCustomer_g 1 428.14 448.14
## + Ethnicity_bb 1 428.77 448.77
## + EducationLevel_w 1 428.89 448.89
## + EducationLevel_i 1 428.96 448.96
## + Ethnicity_n 1 429.33 449.33
## + YearsEmployed 1 429.73 449.73
## <none> 431.93 449.93
## + Ethnicity_ff 1 430.28 450.28
## + EducationLevel_q 1 430.47 450.47
## + Ethnicity_h 1 430.80 450.80
## + Ethnicity_j 1 430.93 450.93
## + EducationLevel_aa 1 430.94 450.94
## + Ethnicity_z 1 431.00 451.00
## + EducationLevel_k 1 431.03 451.03
## + EducationLevel_d 1 431.60 451.60
## + EducationLevel_c 1 431.61 451.61
## + Ethnicity_o 1 431.83 451.83
## + EducationLevel_m 1 431.85 451.85
## + EducationLevel_r 1 431.88 451.88
## + EducationLevel_e 1 431.90 451.90
## + Ethnicity_v 1 431.91 451.91
## + EducationLevel_j 1 431.93 451.93
## + Citizen_g 1 431.93 451.93
## + Citizen_s 1 431.93 451.93
## + Debt 1 431.93 451.93
## + Age 1 431.93 451.93
## + Ethnicity_dd 1 431.93 451.93
## - EducationLevel_ff 1 437.84 453.84
## - EducationLevel_cc 1 438.29 454.29
## - Married_y 1 439.22 455.22
## - EducationLevel_x 1 443.05 459.05
## - Citizen_p 1 447.36 463.36
## - Income 1 453.80 469.80
## - CreditScore 1 456.92 472.92
## - PriorDefault_No 1 655.07 671.07
##
## Step: AIC=447.78
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff +
## Employed_unemployed
##
## Df Deviance AIC
## + Married_u 1 423.81 445.81
## + BankCustomer_g 1 423.81 445.81
## + YearsEmployed 1 425.22 447.22
## + Ethnicity_bb 1 425.25 447.25
## + EducationLevel_i 1 425.31 447.31
## + EducationLevel_w 1 425.37 447.37
## + Ethnicity_n 1 425.40 447.40
## <none> 427.78 447.78
## + Ethnicity_ff 1 425.92 447.92
## + Ethnicity_h 1 426.39 448.39
## + Ethnicity_j 1 426.65 448.65
## + Ethnicity_z 1 426.66 448.66
## + EducationLevel_q 1 426.81 448.81
## + EducationLevel_k 1 426.91 448.91
## + EducationLevel_aa 1 427.04 449.04
## + EducationLevel_c 1 427.44 449.44
## + Ethnicity_v 1 427.64 449.64
## + Citizen_s 1 427.64 449.64
## + Citizen_g 1 427.64 449.64
## + EducationLevel_d 1 427.67 449.67
## + EducationLevel_m 1 427.67 449.67
## + Ethnicity_o 1 427.69 449.69
## + EducationLevel_r 1 427.71 449.71
## + Age 1 427.74 449.74
## + EducationLevel_e 1 427.77 449.77
## + EducationLevel_j 1 427.77 449.77
## + Debt 1 427.78 449.78
## + Ethnicity_dd 1 427.78 449.78
## - Employed_unemployed 1 431.93 449.93
## - CreditScore 1 433.61 451.61
## - EducationLevel_ff 1 433.94 451.94
## - EducationLevel_cc 1 434.11 452.11
## - Married_y 1 434.28 452.28
## - EducationLevel_x 1 437.59 455.59
## - Citizen_p 1 444.13 462.13
## - Income 1 448.23 466.23
## - PriorDefault_No 1 643.83 661.83
##
## Step: AIC=445.81
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff +
## Employed_unemployed + Married_u
##
## Df Deviance AIC
## + Ethnicity_bb 1 421.31 445.31
## + EducationLevel_i 1 421.37 445.37
## + Ethnicity_n 1 421.37 445.37
## + EducationLevel_w 1 421.41 445.41
## <none> 423.81 445.81
## + YearsEmployed 1 422.09 446.09
## + Ethnicity_h 1 422.26 446.26
## + Ethnicity_j 1 422.66 446.66
## + Ethnicity_z 1 422.67 446.67
## + EducationLevel_q 1 422.85 446.85
## + EducationLevel_k 1 422.94 446.94
## + EducationLevel_aa 1 423.05 447.05
## + EducationLevel_c 1 423.47 447.47
## + EducationLevel_d 1 423.70 447.70
## + EducationLevel_m 1 423.70 447.70
## + Ethnicity_o 1 423.72 447.72
## + Age 1 423.72 447.72
## + EducationLevel_r 1 423.74 447.74
## + Ethnicity_v 1 423.76 447.76
## - Married_u 1 427.78 447.78
## - EducationLevel_cc 1 427.79 447.79
## + EducationLevel_e 1 423.79 447.79
## + Debt 1 423.80 447.80
## + Ethnicity_ff 1 423.80 447.80
## + EducationLevel_j 1 423.80 447.80
## + Citizen_g 1 423.81 447.81
## + Citizen_s 1 423.81 447.81
## + Ethnicity_dd 1 423.81 447.81
## - Employed_unemployed 1 428.14 448.14
## - Married_y 1 429.08 449.08
## - CreditScore 1 429.55 449.55
## - EducationLevel_ff 1 429.93 449.93
## - EducationLevel_x 1 433.53 453.53
## - Citizen_p 1 440.46 460.46
## - Income 1 442.35 462.35
## - PriorDefault_No 1 642.38 662.38
##
## Step: AIC=445.31
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff +
## Employed_unemployed + Married_u + Ethnicity_bb
##
## Df Deviance AIC
## + Ethnicity_n 1 418.96 444.96
## <none> 421.31 445.31
## + EducationLevel_w 1 419.31 445.31
## + YearsEmployed 1 419.57 445.57
## - Ethnicity_bb 1 423.81 445.81
## + Ethnicity_z 1 419.99 445.99
## + Ethnicity_v 1 420.19 446.19
## + EducationLevel_k 1 420.22 446.22
## + EducationLevel_aa 1 420.24 446.24
## + Ethnicity_j 1 420.36 446.36
## + Ethnicity_h 1 420.37 446.37
## + EducationLevel_i 1 420.52 446.52
## + EducationLevel_q 1 420.75 446.75
## + EducationLevel_c 1 420.83 446.83
## - EducationLevel_cc 1 424.85 446.85
## - Employed_unemployed 1 425.00 447.00
## + Age 1 421.12 447.12
## + EducationLevel_m 1 421.14 447.14
## + EducationLevel_d 1 421.20 447.20
## + Ethnicity_o 1 421.22 447.22
## - Married_u 1 425.25 447.25
## + EducationLevel_r 1 421.25 447.25
## + Ethnicity_dd 1 421.29 447.29
## + Ethnicity_ff 1 421.30 447.30
## + Citizen_g 1 421.30 447.30
## + Citizen_s 1 421.30 447.30
## + Debt 1 421.30 447.30
## + EducationLevel_e 1 421.31 447.31
## + EducationLevel_j 1 421.31 447.31
## - Married_y 1 426.59 448.59
## - CreditScore 1 427.45 449.45
## - EducationLevel_ff 1 428.12 450.12
## - EducationLevel_x 1 430.50 452.50
## - Citizen_p 1 438.80 460.80
## - Income 1 440.11 462.11
## - PriorDefault_No 1 641.92 663.92
##
## Step: AIC=444.96
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff +
## Employed_unemployed + Married_u + Ethnicity_bb + Ethnicity_n
##
## Df Deviance AIC
## + EducationLevel_w 1 416.78 444.78
## <none> 418.96 444.96
## + YearsEmployed 1 417.21 445.21
## - Ethnicity_n 1 421.31 445.31
## - Ethnicity_bb 1 421.37 445.37
## + Ethnicity_z 1 417.67 445.67
## + Ethnicity_h 1 417.93 445.93
## + EducationLevel_k 1 417.96 445.96
## + EducationLevel_aa 1 417.96 445.96
## + Ethnicity_j 1 417.97 445.97
## + EducationLevel_i 1 418.23 446.23
## + Ethnicity_v 1 418.24 446.24
## - Employed_unemployed 1 422.46 446.46
## + EducationLevel_q 1 418.47 446.47
## + EducationLevel_c 1 418.58 446.58
## - EducationLevel_cc 1 422.61 446.61
## + Age 1 418.71 446.71
## + EducationLevel_r 1 418.73 446.73
## + EducationLevel_m 1 418.82 446.82
## + EducationLevel_d 1 418.87 446.87
## + Ethnicity_o 1 418.87 446.87
## + Citizen_g 1 418.94 446.94
## + Citizen_s 1 418.94 446.94
## + Ethnicity_dd 1 418.94 446.94
## + Ethnicity_ff 1 418.95 446.95
## - Married_u 1 422.95 446.95
## + EducationLevel_e 1 418.95 446.95
## + Debt 1 418.96 446.96
## + EducationLevel_j 1 418.96 446.96
## - Married_y 1 424.30 448.30
## - CreditScore 1 425.11 449.11
## - EducationLevel_ff 1 425.50 449.50
## - EducationLevel_x 1 428.30 452.30
## - Citizen_p 1 436.80 460.80
## - Income 1 437.87 461.87
## - PriorDefault_No 1 641.88 665.88
##
## Step: AIC=444.78
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p +
## EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff +
## Employed_unemployed + Married_u + Ethnicity_bb + Ethnicity_n +
## EducationLevel_w
##
## Df Deviance AIC
## <none> 416.78 444.78
## - Ethnicity_bb 1 418.78 444.78
## - EducationLevel_w 1 418.96 444.96
## + YearsEmployed 1 415.01 445.01
## - Ethnicity_n 1 419.31 445.31
## + Ethnicity_h 1 415.41 445.41
## + Ethnicity_j 1 415.64 445.64
## + Ethnicity_z 1 415.65 445.65
## + Ethnicity_v 1 415.66 445.66
## - Employed_unemployed 1 419.80 445.80
## + EducationLevel_q 1 415.82 445.82
## + EducationLevel_c 1 415.89 445.89
## + EducationLevel_k 1 416.11 446.11
## + EducationLevel_aa 1 416.16 446.16
## + EducationLevel_i 1 416.24 446.24
## + Age 1 416.47 446.47
## + EducationLevel_r 1 416.55 446.55
## + Ethnicity_o 1 416.70 446.70
## + EducationLevel_m 1 416.74 446.74
## + EducationLevel_d 1 416.74 446.74
## + EducationLevel_e 1 416.74 446.74
## + Citizen_g 1 416.75 446.75
## + Citizen_s 1 416.75 446.75
## + Debt 1 416.77 446.77
## - Married_u 1 420.77 446.77
## + Ethnicity_ff 1 416.77 446.77
## + Ethnicity_dd 1 416.78 446.78
## + EducationLevel_j 1 416.78 446.78
## - EducationLevel_cc 1 420.96 446.96
## - Married_y 1 422.16 448.16
## - EducationLevel_ff 1 422.43 448.43
## - CreditScore 1 423.52 449.52
## - EducationLevel_x 1 426.98 452.98
## - Citizen_p 1 435.02 461.02
## - Income 1 435.26 461.26
## - PriorDefault_No 1 640.08 666.08
With an AIC=443.47, we will include these functions to our final model:
myvars <- c("PriorDefault_No", "CreditScore", "Income", "Citizen_p", "EducationLevel_x",
"EducationLevel_ff", "Married_y", "EducationLevel_cc", "Employed_unemployed",
"Married_u", "Ethnicity_n", "EducationLevel_w", "ApprovalStatus")
df <- df[myvars] We can proceed now to split our data between the data set with all the features X, and the target variable data set Y, we will convert them into matrices so they can be processed by the glmnet function.
# X e Y
X <- data.matrix(subset(df, select= - ApprovalStatus))
Y <- as.double(as.matrix(df$ApprovalStatus))We are ready now to split both data sets between training and test sets
# TRAIN
X_Train <- X[0:590, ]
Y_Train <- Y[0:590]
# TEST
X_Test <- X[591:nrow(X), ]
Y_Test <- Y[591:length(Y)]We have a binary classification problem (whether to approve credit or not), for that reason we will create a Logistic Regression model.
We need to create a model able to predict whether to approve a credit or not as best as possible, but we also need to minimize the number of false positives, as false positives would make our bank losing money granting credits that it shouldn’t. For that reason, we will use the Area Under the (ROC) Curve (AUC) as our estimator.
ROC is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0, so the area below this curve would be the best estimator possible when it comes to getting good predictions while minimizing false positives at the same time.
For better results we will use also a regularization, whether to use either Lasso or Ridge we will use an Elastic-Net model for that.
The Elastic-Net allow us to select a Ridge regularization setting the alpha parameter to 0 and a Lasso regularization setting the alpha parameter to 1. We will check both of them and we will select the one with a bigger AUC.
# We will use the glmnet as Elastic-Net, cv. stands for cross-validation
cv.ridge <- cv.glmnet(X_Train, Y_Train, family='binomial', alpha=0, parallel=TRUE, standardize=TRUE, type.measure='auc')## Warning: executing %dopar% sequentially: no parallel backend registered
Now that we have the model, let’s check the value for the biggest AUC.
## [1] 0.9230839
The biggest AUC value is approximately 0.926, let’s see what value lambda corresponds to this value:
## [1] 94
## [1] FALSE
## [1] 0.06161378
The lambda value that gives us the biggest AUC possible is 0.056
# We set the parameter 'alpha' to 1 so we can use Lasso
cv.lasso <- cv.glmnet(X_Train, Y_Train, family='binomial', alpha=1, parallel=TRUE, standardize=TRUE, type.measure='auc')Let’s see how AUC behaves trying different Lambda values
Let’s calculate the optimal lambda:
## [1] 0.001102055
Let’s compare both maximum AUCs
## [1] 0.9294144
## [1] -0.006330498
Both return almost the same result, but Ridge is slightly better, so we will use this one.
Let’s try our Logistic Regression model using Ridge regularization to see how well it predicts.
## [1] 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now, we create a confusion matrix so we can compare the actual outcome and our predicted outcome:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 1
## 1 8 6
##
## Accuracy : 0.91
## 95% CI : (0.836, 0.958)
## No Information Rate : 0.93
## P-Value [Acc > NIR] : 0.8380
##
## Kappa : 0.5273
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9140
## Specificity : 0.8571
## Pos Pred Value : 0.9884
## Neg Pred Value : 0.4286
## Precision : 0.9884
## Recall : 0.9140
## F1 : 0.9497
## Prevalence : 0.9300
## Detection Rate : 0.8500
## Detection Prevalence : 0.8600
## Balanced Accuracy : 0.8856
##
## 'Positive' Class : 0
##
We have a model with an Accuracy of 91%, and Recall of 91.4%, F1 de 94.07% y Precision of 98.84%.
## y_pred
## Y_Test 0 1 Sum
## 0 85 1 86
## 1 8 6 14
## Sum 93 7 100
In the confusion matrix we just had one false positive out 100 predictions, 6 were correctly approved and 85 were correctly denied. We also had 8 false negatives.
Let’s check which variables have more influence in our model.
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.1246888
## PriorDefault_No 2.1701118
## CreditScore 0.3653171
## Income 0.2671944
## Citizen_p 1.2752108
## EducationLevel_x 1.1378969
## EducationLevel_ff -0.8193439
## Married_y -0.3173088
## EducationLevel_cc 0.9025020
## Employed_unemployed -0.7563359
## Married_u 0.1924204
## Ethnicity_n 1.6256479
## EducationLevel_w 0.2872622
Seems like, not having defaulted before, being of ethnicity ‘n’, having a citizenship status of ‘p’ and an education level ‘x’ is positively correlated with having a credit approval. Whereas having an education level ‘ff’ and being ‘unemployed’ have the biggest negative impact when getting a credit approved.