Credit Screening Project

Jorge Marcos Martos

library(dplyr)
library(tidyverse)
library(ggplot2)
library(fastDummies)
library(missForest)
library(corrplot)
library(glmnet)
library(caret)
library(lattice)
library(e1071)
library(MASS)

Data Loading and Overview.

In this project we will build an automatic credit card approval predictor using machine learning techniques.

For that purpose we will use the data set Credit Screening from the UCI Machine Learning Repository.

All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data, nonetheless we will try to get additional information on what these variables might be.

url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
crx <- read.csv(url, sep = ",", header = F)

Data Preprocessing

Handling Missing Values

After checking the data set, we could see that null values were included as ‘?’, we need to replace these.

crx[crx == "?"] <- NA

# Let's see how many mising values we have
sapply(crx, function(x) sum(is.na(x))); sum(sapply(crx, function(x) sum(is.na(x))))

##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
##  12  12   0   6   6   9   9   0   0   0   0   0   0  13   0   0

## [1] 67

We have a total of 67 null values, we will input these null values using the function missForest() from the missForest library, this function uses a random forest to impute null values using the mean in the case of numerical variables and the mode in the case of categorical variables.

crx <- type.convert(crx, as.is=FALSE) # This code converts categorical variables into
# factors, so the missForest can use it.

crx.i <- missForest(as.data.frame(crx))

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

crx <- crx.i$ximp

# Double-checking the null values have been succesfuly imputed:
sapply(crx, function(x) sum(is.na(x)))

##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Recodifying Variables.

After looking for information about what variables are normally included regarding card approval, I decided to re-codify the variables using this blog as a reference: http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html

The variables will be recodified as: “Gender”, “Age”, “Debt”, “Married”, “BankCustomer”, “EducationLevel”, “Ethnicity”, “YearsEmployed”, “PriorDefault”, “Employed”, “CreditScore”, “DriversLicense”, “Citizen”, “ZipCode”, “Income” y “ApprovalStatus”.

# Renaming variables
crx <- crx %>% 
  rename(Gender = V1,
         Age = V2,
         Debt =  V3,
         Married = V4,
         BankCustomer = V5,
         EducationLevel = V6,
         Ethnicity = V7,
         YearsEmployed = V8,
         PriorDefault = V9,
         Employed = V10,
         CreditScore = V11,
         DriversLicense = V12,
         Citizen = V13,
         ZipCode = V14,
         Income = V15,
         ApprovalStatus = V16)

# Checking variables
str(crx)

## 'data.frame':    690 obs. of  16 variables:
##  $ Gender        : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
##  $ Age           : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ Debt          : num  0 4.46 0.5 1.54 5.62 ...
##  $ Married       : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
##  $ BankCustomer  : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
##  $ EducationLevel: Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
##  $ Ethnicity     : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
##  $ YearsEmployed : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ PriorDefault  : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Employed      : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
##  $ CreditScore   : int  1 6 0 5 0 0 0 0 0 0 ...
##  $ DriversLicense: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
##  $ Citizen       : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
##  $ ZipCode       : num  202 43 280 100 120 360 164 80 180 52 ...
##  $ Income        : int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ ApprovalStatus: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...

The ZipCode variable should be a factorial variable, let’s see what categories it has

unique(crx$ZipCode)

##   [1]  202.00000   43.00000  280.00000  100.00000  120.00000  360.00000
##   [7]  164.00000   80.00000  180.00000   52.00000  128.00000  260.00000
##  [13]    0.00000  320.00000  396.00000   96.00000  200.00000  300.00000
##  [19]  145.00000  500.00000  168.00000  434.00000  583.00000   30.00000
##  [25]  240.00000   70.00000  455.00000  311.00000  216.00000  491.00000
##  [31]  400.00000  239.00000  160.00000  711.00000  250.00000  520.00000
##  [37]  515.00000  420.00000  224.54611  980.00000  443.00000  140.00000
##  [43]   94.00000  368.00000  288.00000  928.00000  188.00000  112.00000
##  [49]  171.00000  268.00000  167.00000   75.00000  152.00000  176.00000
##  [55]  329.00000  212.00000  410.00000  274.00000  375.00000  408.00000
##  [61]  350.00000  204.00000   40.00000  181.00000  399.00000  440.00000
##  [67]   93.00000   60.00000  395.00000  393.00000   21.00000   29.00000
##  [73]  102.00000  431.00000  370.00000   24.00000   20.00000  129.00000
##  [79]  510.00000  195.00000  144.00000  380.00000  144.45333   49.00000
##  [85]   50.00000   91.66967  381.00000  150.00000  117.00000   56.00000
##  [91]  211.00000  230.00000  156.00000   22.00000  228.00000  519.00000
##  [97]  253.00000  487.00000  220.00000  119.54927   88.00000   73.00000
## [103]  121.00000  470.00000  136.00000  132.00000  292.00000  154.00000
## [109]  272.00000  219.71467  340.00000   91.16295  108.00000  720.00000
## [115]  450.00000  232.00000  170.00000 1160.00000  411.00000  144.02600
## [121]  460.00000  348.00000  480.00000  640.00000  372.00000  276.00000
## [127]  221.00000  352.00000  141.00000  178.00000  600.00000  550.00000
## [133]  187.63619 2000.00000  225.00000  210.00000  110.00000  356.00000
## [139]   45.00000   62.00000   92.00000  174.00000   17.00000   86.00000
## [145]   90.22924  454.00000  214.25681  254.00000   28.00000  263.00000
## [151]  333.00000  312.00000  290.00000  371.00000   99.00000  252.00000
## [157]  760.00000  560.00000  130.00000  523.00000  680.00000  163.00000
## [163]  208.00000  383.00000  330.00000  422.00000  840.00000  432.00000
## [169]   32.00000  186.00000  303.00000  147.29800  349.00000  224.00000
## [175]  369.00000  148.43755  235.47510   76.00000  231.00000  309.00000
## [181]  416.00000  465.00000  256.00000

It returns 183 different values, 170 of those were there before applying the missForest algorithm for imputation, the other 13 resulted as the missing value imputation made by the missForest function, as it treated the variable as a numeric one. The function wouldn’t be able to handle a categorical variable with more than 53 levels regardless.

This would provide us with too many categories, hindering our analysis, so I decided to get rid of this variable.

crx = subset(crx, select = -ZipCode)

Our target variable would be the variable ‘approval status’‘, it would be better to transform it into a binary factorial variable, the negative would be the level ’0’, and the positive the level ‘1’.

crx <- crx %>% 
    mutate(ApprovalStatus = recode(ApprovalStatus, 
                      "+" = "1", 
                      "-" = "0")) 

str(crx)

## 'data.frame':    690 obs. of  15 variables:
##  $ Gender        : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
##  $ Age           : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ Debt          : num  0 4.46 0.5 1.54 5.62 ...
##  $ Married       : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
##  $ BankCustomer  : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
##  $ EducationLevel: Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
##  $ Ethnicity     : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
##  $ YearsEmployed : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ PriorDefault  : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Employed      : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
##  $ CreditScore   : int  1 6 0 5 0 0 0 0 0 0 ...
##  $ DriversLicense: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
##  $ Citizen       : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
##  $ Income        : int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ ApprovalStatus: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Exploratory Data Analysis

We will compare now every variable against the target variable, the target variable will have the 0 value in cases the credit hasn’t been approved and 1 in case it has been granted.

Categorical Variables vs Target Variable

Gender vs Credit Approval

ggplot(data = crx, aes(x = Gender, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Gender') + ggtitle('Gender vs Credito Approval')

Seems like the gender ‘a’ has a bigger proportion of approvals than the gender ‘b’, but the difference between both rates doesn’t seem to be that significant, we will have to test the variables dependency later on.

Marital Status vs Credit Approval

ggplot(data = crx, aes(x = Married, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Marital Status') + ggtitle('Marital Status vs Credit Approval')

There seems to be a clear difference between the different marital status and the rate of credit approval, for the marital status ‘l’ there seem to be 100% rate of credit approval, this might be caused by a too small sample, making it not representative, let’s check it:

crx %>% 
  group_by(Married) %>% 
  count()

## # A tibble: 3 x 2
## # Groups:   Married [3]
##   Married     n
##   <fct>   <int>
## 1 l           2
## 2 u         525
## 3 y         163

Effectively, there are only two observations in ‘l’ marital status, that would explain this anomaly.

Bank Customer vs Credit Approval

ggplot(data = crx, aes(x = BankCustomer, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Bank Customer') + ggtitle('Bank Customer vs Credit Approval')

Yes, there seems to be a correlation between different bank customer statuses and the credit approval rate, we see again an anomaly of a 100% rate of credit approval in one of the categories, let’s check how many observations are within that category:

crx %>% group_by(BankCustomer) %>% count()

## # A tibble: 3 x 2
## # Groups:   BankCustomer [3]
##   BankCustomer     n
##   <fct>        <int>
## 1 g              525
## 2 gg               2
## 3 p              163

As in the previous case, there are just two observations in that category, that would explain the anomaly.

Education Level vs Credit Approval

ggplot(data = crx, aes(x = EducationLevel, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Education Level') + ggtitle('Education Level vs Credit Approval')

The education level also seems to affect credit approval rate, we have a high rate of credit approval in the education level ‘x’ and ‘cc’, whereas those with an education level ‘ff’ have a lower credit approval rate.

Ethnic Group vs Credit Approval

ggplot(data = crx, aes(x = Ethnicity, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Ethnic Group') + ggtitle('Ethnic Group vs Credit Approval')

The ethnic group also seems to be related to the credit approval rate, people of the ethnic groups ‘z’ and ‘h’ have less possibilities to be have a credit granted compared to other ethnic groups as, for example, ‘ff’.

Prior Default vs Credit Approval

ggplot(data = crx, aes(x = PriorDefault, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Prior Default') + ggtitle('Prior Default vs Credit Approval')

There is a clear correlation between those who previously defaulted and those who didn’t, we assume that the bank don’t easily approve credits to people who have already defaulted, so we can recodify the two categories accordingly:

crx$PriorDefault <- recode(crx$PriorDefault,
                               "f" = "Yes",
                               "t" = "No")

Employed vs Credit Approval

ggplot(data = crx, aes(x = Employed, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Employed') + ggtitle('Employed vs Credit Approval')

There seems to be a clear correlation between whether a person is employed or not and the credit approval rate, we can assume that the category ‘f’ are those who are unemployed and the category ‘t’ corresponds to those who are employed, let’s recodify the categories accordingly:

crx$Employed <- recode(crx$Employed,
                               "f" = "unemployed",
                               "t" = "employed")

Driver’s License vs Credit Approval

ggplot(data = crx, aes(x = DriversLicense, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Drivers License') + ggtitle('Drivers License vs Credit Approval')

There doesn’t seem to be a clear correlation between both variables.

Citizenship vs Credit Approval

ggplot(data = crx, aes(x = Citizen, fill = ApprovalStatus)) +
  geom_bar(position = "fill") +
  labs(y = "Rate", x = 'Citizenship') + ggtitle('Citizenship vs Credit Approval')

There seems to be a relation between the citizenship status and the credit approval rate.

Independence Test of the Categorical Variables against the Target Variable

In order to check whether there is independence between the different categorical variables and the target variable, we will check the chi-square with a 95% significance level, the following function will print the name of the variable and the resulting p-values.

categoricVars <- crx %>% dplyr::select(Gender, Married, BankCustomer, EducationLevel,
                                       Ethnicity, PriorDefault, Employed, DriversLicense,
                                       Citizen) 

sapply(categoricVars, 
       function(x) round(chisq.test(table(x, crx$ApprovalStatus))$p.value,2))

##         Gender        Married   BankCustomer EducationLevel      Ethnicity 
##           0.49           0.00           0.00           0.00           0.00 
##   PriorDefault       Employed DriversLicense        Citizen 
##           0.00           0.00           0.45           0.01

The variables Married, BankCustomer, EducationLevel, Ethnicity, PriorDefault, Employed are Citizen son dependent, whereas Gender and DriversLicense are independient of the target variable. We will remove Gender and DriverLicense from our model.

Numeric Variables vs Target Variable

Age vs Credit Approval

cdplot(crx$ApprovalStatus ~ crx$Age, main = "Age vs Credit Approval", 
       xlab = "Age", ylab = "Conditional Density" )

The plot shows how those who are older have more chances of getting the credit approved, although when it reaches the threshold of 75 years old it seems to drastically lower the probabilities of getting a credit approved.

Let’s see if a boxplot could provide more information:

ggplot(crx, aes(x= ApprovalStatus, y= Age, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Age", x = 'Credit Approval') + ggtitle('Age vs Credit Approval') +
scale_fill_brewer(palette = "Set2")

We can also see a probable correlation between age and credit approval, the older ones seem to have more chances of getting it approved.

Debt vs Credit Approval

cdplot(crx$ApprovalStatus ~ crx$Debt, main = "Debt vs Credit Approval", 
       xlab = "Debt", ylab = "Conditional Density" )

The plot describe a relation between the debt and the credit approval in which the more debt you have the more chances you have of getting a credit, although it seems to go lower around the 26 on the Debt axis to then go up again.

ggplot(crx, aes(x= ApprovalStatus, y= Debt, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Debt", x = 'Credit Approval') + 
  ggtitle('Debt vs Credit Approval') +
scale_fill_brewer(palette = "Set2")

The box-plot seems to hint at the same as the previous plot.

Years Employed vs Credit Approval

ggplot(crx, aes(x= ApprovalStatus, y= YearsEmployed, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Years Employed", x = 'Credit Approval') + 
  ggtitle('Years Employed vs Credit Approval') +
scale_fill_brewer(palette = "Set2")

There seems to be a positive correlation between the years employed and the credit approval.

Credit Score vs Credit Approval

ggplot(crx, aes(x= ApprovalStatus, y= CreditScore, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Credit Score", x = 'Credit Approval') + 
  ggtitle('Credit Score vs Credit Approval') +
scale_fill_brewer(palette = "Set2")

The plot hints at a clear positive correlation between credit score and credit approval

Income vs Credit Approval

# This plot contains extreme outliers, so we need to zoom it in

ggplot(crx, aes(x= ApprovalStatus, y= Income, fill= ApprovalStatus)) +
geom_boxplot() +
labs(y = "Income", x = 'Credit Approval') + 
  ggtitle('Income vs Credit Approval') +
scale_fill_brewer(palette = "Set2") +
  coord_cartesian(ylim=c(0, 2000)) #zoom

The graph shows what it seems to be a clear positive correlation between income and credit approval.

Correlation Matrix

We’ll now plot a correlation matrix in order to check whether there is colinearity between numeric variables or not.

numericVars <- data.frame(crx$Age, crx$Debt, crx$YearsEmployed, crx$CreditScore, crx$Income)

corrplot(cor(numericVars), method = "number", type="upper")

The biggest value is 0.4 between Years Employed and Age which makes sense, this value is not as big as to cause colinearity so we will include both variables in our model.

Normality in the Numeric Variables

It’s best practice to normalize the numeric variables if they are not normally distributed.

Let’s plot our numeric variables normality and see whether they follow a normal distribution or not.

for (columna in 1:ncol(crx)){
  if (class(crx[,columna]) != "factor"){
    qqnorm(crx[,columna], 
         main = paste("Normality Plot: ", colnames(crx[columna])))
    qqline(crx[,columna])
  } else {
    next
  }
}

None of them seems to follow a normal distribution, but let’s double-check it using the Shapiro test.

sapply(numericVars, function(x) round(shapiro.test(x)$p.value,2))

##           crx.Age          crx.Debt crx.YearsEmployed   crx.CreditScore 
##                 0                 0                 0                 0 
##        crx.Income 
##                 0

The p-values obtained in the Shapiro test are near 0, we reject the null hypothesis that there is normality in all cases, therefore we accept the alternative hypothesis that none of the variables has a normal distribution.

EDA conclusions:

We need to normalize all numeric variables.
There is no colinearity between numeric variables.
The categorical variables Gender and DriversLicense don’t seem to influence the target variable, the rest does to different degrees.
The categories ‘l’ and ‘gg’ of the variables ‘Married’ and ‘BankCustomer’ respectively, only have two obersvations each, and they were granted credit in all cases. Thus, both variables are supposed to be binary variables, so it might be that these two categories were recorded by mistake. We should remove them from our model.

Data Preparation

Before we split the data set into a train and test sets, we need to normalize our numeric variables, also, provided that we will use binomial regression model, it would be best practice to one hot encode our categorical variables.

Let’s first normalize the numeric variables:

crx$Age <- scale(crx$Age)
crx$Debt <- scale(crx$Debt)
crx$YearsEmployed <- scale(crx$YearsEmployed)
crx$CreditScore <- scale(crx$CreditScore)
crx$Income <- scale(crx$Income)

Now, let’s remove the Gender and DriversLicense variables from our model:

crx$Gender <- NULL
crx$DriversLicense <- NULL

We can proceed with one hot encoding the categorical variables creating dummy variables for each of the categories.

df <- dummy_cols(crx, remove_selected_columns = T)

Our target variable is now divided into two dummy variables, we should remove them and add it again, it’s also a good moment to get rid of those two categories (now dummy variables), that we considered removing before: ‘married_l’ and ‘BankCustomer_gg’.

df$ApprovalStatus_0 <- NULL
df$ApprovalStatus_1 <- NULL
df$Married_l <- NULL
df$BankCustomer_gg <- NULL

df$ApprovalStatus <- crx$ApprovalStatus

Now, let’s use a binomial linear model to check what variables we should be using, for that purpose we will fit a binomial regression model to the data and then we will do both, a forward and a backward step-wise selection. For that we first need to fit our model to the target variable against all the variables as our last step, and fit our model to the target variable against one as our start step.

fit1 <- glm(ApprovalStatus~., data=df, family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

fit0 <- glm(ApprovalStatus~1, data=df, family=binomial)

Now that we have fitted last and first step, we will start doing step-wise selection from both directions (from first to last and from last to first), setting the parameter ‘direction’ to ‘both’, we will use the stepAIC function as we will take the metric Akaike Information Criterion (AIC) as our estimator, the lower the AIC it is, the better will it be for our model.

step <-stepAIC(fit0,direction="both",scope=list(upper=fit1,lower=fit0))

## Start:  AIC=950.16
## ApprovalStatus ~ 1
## 
##                       Df Deviance    AIC
## + PriorDefault_No      1   540.95 544.95
## + PriorDefault_Yes     1   540.95 544.95
## + CreditScore          1   762.74 766.74
## + Employed_employed    1   798.66 802.66
## + Employed_unemployed  1   798.66 802.66
## + Income               1   862.80 866.80
## + YearsEmployed        1   863.32 867.32
## + Debt                 1   918.36 922.36
## + EducationLevel_x     1   920.99 924.99
## + Married_y            1   922.64 926.64
## + BankCustomer_p       1   922.64 926.64
## + Ethnicity_ff         1   924.15 928.15
## + Ethnicity_h          1   924.16 928.16
## + EducationLevel_ff    1   924.72 928.72
## + Married_u            1   924.92 928.92
## + BankCustomer_g       1   924.92 928.92
## + Age                  1   929.91 933.91
## + EducationLevel_q     1   932.62 936.62
## + EducationLevel_cc    1   935.90 939.90
## + EducationLevel_i     1   936.81 940.81
## + Citizen_s            1   939.43 943.43
## + EducationLevel_k     1   941.39 945.39
## + EducationLevel_d     1   942.09 946.09
## + Citizen_g            1   942.51 946.51
## + EducationLevel_aa    1   946.06 950.06
## <none>                     948.16 950.16
## + Ethnicity_v          1   946.22 950.22
## + Ethnicity_z          1   946.34 950.34
## + EducationLevel_w     1   946.74 950.74
## + Citizen_p            1   947.10 951.10
## + EducationLevel_e     1   947.52 951.52
## + EducationLevel_r     1   947.55 951.55
## + Ethnicity_dd         1   947.68 951.68
## + EducationLevel_j     1   947.85 951.85
## + EducationLevel_m     1   947.95 951.95
## + Ethnicity_bb         1   948.06 952.06
## + Ethnicity_n          1   948.11 952.11
## + EducationLevel_c     1   948.11 952.11
## + Ethnicity_o          1   948.13 952.13
## + Ethnicity_j          1   948.16 952.16
## 
## Step:  AIC=544.95
## ApprovalStatus ~ PriorDefault_No
## 
##                       Df Deviance    AIC
## + CreditScore          1   503.21 509.21
## + Income               1   505.09 511.09
## + Employed_unemployed  1   507.67 513.67
## + Employed_employed    1   507.67 513.67
## + Citizen_p            1   523.48 529.48
## + Married_y            1   528.70 534.70
## + BankCustomer_p       1   528.70 534.70
## + EducationLevel_x     1   531.38 537.38
## + Married_u            1   531.98 537.98
## + BankCustomer_g       1   531.98 537.98
## + YearsEmployed        1   532.97 538.97
## + EducationLevel_aa    1   534.80 540.80
## + EducationLevel_cc    1   534.81 540.81
## + EducationLevel_ff    1   536.36 542.36
## + Ethnicity_ff         1   536.69 542.69
## + Ethnicity_h          1   537.32 543.32
## + EducationLevel_k     1   537.95 543.95
## + Citizen_s            1   537.97 543.97
## + Ethnicity_o          1   538.22 544.22
## + EducationLevel_i     1   538.60 544.60
## + Ethnicity_n          1   538.72 544.72
## <none>                     540.95 544.95
## + EducationLevel_d     1   539.15 545.15
## + EducationLevel_q     1   539.20 545.20
## + Ethnicity_j          1   539.35 545.35
## + Ethnicity_bb         1   539.36 545.36
## + Debt                 1   539.69 545.69
## + EducationLevel_w     1   539.88 545.88
## + Ethnicity_dd         1   539.93 545.93
## + EducationLevel_e     1   540.26 546.26
## + Ethnicity_v          1   540.48 546.48
## + EducationLevel_r     1   540.59 546.59
## + Age                  1   540.69 546.69
## + EducationLevel_m     1   540.69 546.69
## + EducationLevel_j     1   540.83 546.83
## + Ethnicity_z          1   540.85 546.85
## + EducationLevel_c     1   540.89 546.89
## + Citizen_g            1   540.92 546.92
## - PriorDefault_No      1   948.16 950.16
## 
## Step:  AIC=509.21
## ApprovalStatus ~ PriorDefault_No + CreditScore
## 
##                       Df Deviance    AIC
## + Income               1   477.87 485.87
## + Citizen_p            1   484.26 492.26
## + EducationLevel_x     1   492.76 500.76
## + Married_y            1   494.87 502.87
## + BankCustomer_p       1   494.87 502.87
## + Ethnicity_ff         1   497.51 505.51
## + Married_u            1   497.62 505.62
## + BankCustomer_g       1   497.62 505.62
## + EducationLevel_ff    1   497.72 505.72
## + EducationLevel_cc    1   497.84 505.84
## + Employed_employed    1   498.33 506.33
## + Employed_unemployed  1   498.33 506.33
## + YearsEmployed        1   499.68 507.68
## + Ethnicity_h          1   499.76 507.76
## + EducationLevel_aa    1   499.85 507.85
## + Ethnicity_o          1   500.21 508.21
## + EducationLevel_k     1   500.79 508.79
## + EducationLevel_i     1   500.88 508.88
## <none>                     503.21 509.21
## + Ethnicity_j          1   501.27 509.27
## + Ethnicity_bb         1   501.27 509.27
## + Ethnicity_n          1   501.28 509.28
## + EducationLevel_d     1   501.70 509.70
## + EducationLevel_w     1   501.72 509.72
## + Ethnicity_dd         1   502.00 510.00
## + Citizen_g            1   502.06 510.06
## + EducationLevel_q     1   502.17 510.17
## + EducationLevel_e     1   502.47 510.47
## + Citizen_s            1   502.69 510.69
## + Ethnicity_z          1   502.82 510.82
## + EducationLevel_r     1   502.86 510.86
## + Ethnicity_v          1   503.05 511.05
## + EducationLevel_m     1   503.06 511.06
## + EducationLevel_j     1   503.13 511.13
## + Age                  1   503.17 511.17
## + Debt                 1   503.19 511.19
## + EducationLevel_c     1   503.21 511.21
## - CreditScore          1   540.95 544.95
## - PriorDefault_No      1   762.74 766.74
## 
## Step:  AIC=485.87
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income
## 
##                       Df Deviance    AIC
## + Citizen_p            1   462.37 472.37
## + EducationLevel_x     1   468.30 478.30
## + EducationLevel_ff    1   470.00 480.00
## + Married_y            1   470.29 480.29
## + BankCustomer_p       1   470.29 480.29
## + Married_u            1   471.72 481.72
## + BankCustomer_g       1   471.72 481.72
## + Ethnicity_ff         1   472.52 482.52
## + EducationLevel_cc    1   472.64 482.64
## + Employed_unemployed  1   473.35 483.35
## + Employed_employed    1   473.35 483.35
## + YearsEmployed        1   473.93 483.93
## + Ethnicity_h          1   474.45 484.45
## + EducationLevel_aa    1   475.33 485.33
## + Ethnicity_j          1   475.57 485.57
## + Ethnicity_n          1   475.74 485.74
## + EducationLevel_i     1   475.76 485.76
## + Ethnicity_bb         1   475.85 485.85
## <none>                     477.87 485.87
## + EducationLevel_k     1   476.08 486.08
## + Citizen_g            1   476.19 486.19
## + EducationLevel_w     1   476.44 486.44
## + Ethnicity_dd         1   476.57 486.57
## + EducationLevel_q     1   476.57 486.57
## + EducationLevel_d     1   477.00 487.00
## + Ethnicity_z          1   477.05 487.05
## + EducationLevel_e     1   477.35 487.35
## + EducationLevel_m     1   477.71 487.71
## + EducationLevel_j     1   477.72 487.72
## + Debt                 1   477.73 487.73
## + Ethnicity_o          1   477.76 487.76
## + Age                  1   477.82 487.82
## + Citizen_s            1   477.84 487.84
## + Ethnicity_v          1   477.85 487.85
## + EducationLevel_r     1   477.86 487.86
## + EducationLevel_c     1   477.87 487.87
## - Income               1   503.21 509.21
## - CreditScore          1   505.09 511.09
## - PriorDefault_No      1   723.20 729.20
## 
## Step:  AIC=472.37
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p
## 
##                       Df Deviance    AIC
## + EducationLevel_x     1   452.34 464.34
## + EducationLevel_ff    1   454.25 466.25
## + Married_y            1   456.07 468.07
## + BankCustomer_p       1   456.07 468.07
## + EducationLevel_cc    1   456.58 468.58
## + Employed_unemployed  1   456.73 468.73
## + Employed_employed    1   456.73 468.73
## + Ethnicity_ff         1   456.89 468.89
## + Married_u            1   457.40 469.40
## + BankCustomer_g       1   457.40 469.40
## + YearsEmployed        1   457.91 469.91
## + Ethnicity_h          1   458.34 470.34
## + EducationLevel_i     1   459.38 471.38
## + Ethnicity_bb         1   459.40 471.40
## + Ethnicity_n          1   459.86 471.86
## + EducationLevel_aa    1   460.15 472.15
## <none>                     462.37 472.37
## + EducationLevel_w     1   460.55 472.55
## + EducationLevel_q     1   460.83 472.83
## + EducationLevel_k     1   460.94 472.94
## + Ethnicity_j          1   461.40 473.40
## + Ethnicity_z          1   461.55 473.55
## + EducationLevel_d     1   461.69 473.69
## + Age                  1   462.15 474.15
## + EducationLevel_m     1   462.28 474.28
## + Ethnicity_o          1   462.28 474.28
## + Ethnicity_v          1   462.30 474.30
## + EducationLevel_e     1   462.34 474.34
## + EducationLevel_r     1   462.35 474.35
## + Debt                 1   462.36 474.36
## + EducationLevel_j     1   462.36 474.36
## + Ethnicity_dd         1   462.36 474.36
## + EducationLevel_c     1   462.36 474.36
## + Citizen_g            1   462.37 474.37
## + Citizen_s            1   462.37 474.37
## - Citizen_p            1   477.87 485.87
## - Income               1   484.26 492.26
## - CreditScore          1   490.12 498.12
## - PriorDefault_No      1   719.88 727.88
## 
## Step:  AIC=464.34
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x
## 
##                       Df Deviance    AIC
## + Married_y            1   444.97 458.97
## + BankCustomer_p       1   444.97 458.97
## + EducationLevel_ff    1   445.13 459.13
## + EducationLevel_cc    1   445.65 459.65
## + Married_u            1   446.44 460.44
## + BankCustomer_g       1   446.44 460.44
## + Ethnicity_ff         1   447.63 461.63
## + Employed_unemployed  1   447.75 461.75
## + Employed_employed    1   447.75 461.75
## + YearsEmployed        1   448.32 462.32
## + Ethnicity_n          1   449.70 463.70
## + EducationLevel_w     1   449.77 463.77
## + EducationLevel_q     1   449.92 463.92
## + Ethnicity_bb         1   449.97 463.97
## + EducationLevel_i     1   449.99 463.99
## + Ethnicity_h          1   450.18 464.18
## <none>                     452.34 464.34
## + EducationLevel_aa    1   450.82 464.82
## + Ethnicity_j          1   451.25 465.25
## + EducationLevel_k     1   451.37 465.37
## + Ethnicity_z          1   451.69 465.69
## + EducationLevel_d     1   451.88 465.88
## + Ethnicity_v          1   452.08 466.08
## + Age                  1   452.14 466.14
## + EducationLevel_c     1   452.15 466.15
## + EducationLevel_e     1   452.25 466.25
## + Ethnicity_o          1   452.26 466.26
## + Debt                 1   452.26 466.26
## + Citizen_g            1   452.30 466.30
## + Citizen_s            1   452.30 466.30
## + EducationLevel_r     1   452.31 466.31
## + Ethnicity_dd         1   452.32 466.32
## + EducationLevel_m     1   452.33 466.33
## + EducationLevel_j     1   452.34 466.34
## - EducationLevel_x     1   462.37 472.37
## - Citizen_p            1   468.30 478.30
## - Income               1   473.18 483.18
## - CreditScore          1   480.55 490.55
## - PriorDefault_No      1   697.98 707.98
## 
## Step:  AIC=458.97
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y
## 
##                       Df Deviance    AIC
## + EducationLevel_cc    1   437.84 453.84
## + EducationLevel_ff    1   438.29 454.29
## + Married_u            1   438.70 454.70
## + BankCustomer_g       1   438.70 454.70
## + Ethnicity_ff         1   440.73 456.73
## + Employed_unemployed  1   441.06 457.06
## + Employed_employed    1   441.06 457.06
## + YearsEmployed        1   441.28 457.28
## + EducationLevel_w     1   441.97 457.97
## + Ethnicity_bb         1   442.12 458.12
## + EducationLevel_i     1   442.24 458.24
## + Ethnicity_n          1   442.33 458.33
## + Ethnicity_h          1   442.66 458.66
## <none>                     444.97 458.97
## + EducationLevel_q     1   443.61 459.61
## + Ethnicity_j          1   443.89 459.89
## + EducationLevel_aa    1   443.92 459.92
## + Ethnicity_z          1   444.02 460.02
## + EducationLevel_k     1   444.07 460.07
## + Age                  1   444.41 460.41
## + EducationLevel_d     1   444.63 460.63
## + Ethnicity_v          1   444.64 460.64
## + EducationLevel_c     1   444.66 460.66
## + Debt                 1   444.83 460.83
## + Ethnicity_o          1   444.87 460.87
## + EducationLevel_m     1   444.90 460.90
## + EducationLevel_e     1   444.92 460.92
## + EducationLevel_r     1   444.93 460.93
## + Ethnicity_dd         1   444.96 460.96
## + EducationLevel_j     1   444.97 460.97
## + Citizen_g            1   444.97 460.97
## + Citizen_s            1   444.97 460.97
## - Married_y            1   452.34 464.34
## - EducationLevel_x     1   456.07 468.07
## - Citizen_p            1   459.56 471.56
## - Income               1   465.83 477.83
## - CreditScore          1   471.44 483.44
## - PriorDefault_No      1   686.84 698.84
## 
## Step:  AIC=453.84
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc
## 
##                       Df Deviance    AIC
## + EducationLevel_ff    1   431.93 449.93
## + Employed_employed    1   433.94 451.94
## + Employed_unemployed  1   433.94 451.94
## + EducationLevel_w     1   433.96 451.96
## + Married_u            1   434.02 452.02
## + BankCustomer_g       1   434.02 452.02
## + Ethnicity_ff         1   434.06 452.06
## + YearsEmployed        1   434.90 452.90
## + Ethnicity_n          1   435.02 453.02
## + Ethnicity_bb         1   435.49 453.49
## + EducationLevel_i     1   435.65 453.65
## + EducationLevel_q     1   435.79 453.79
## <none>                     437.84 453.84
## + Ethnicity_h          1   436.15 454.15
## + Ethnicity_j          1   436.62 454.62
## + Ethnicity_z          1   437.03 455.03
## + EducationLevel_c     1   437.07 455.07
## + EducationLevel_aa    1   437.22 455.22
## + EducationLevel_k     1   437.27 455.27
## + Age                  1   437.46 455.46
## + Ethnicity_v          1   437.49 455.49
## + EducationLevel_d     1   437.65 455.65
## + EducationLevel_e     1   437.73 455.73
## + Ethnicity_o          1   437.75 455.75
## + Debt                 1   437.77 455.77
## + EducationLevel_r     1   437.79 455.79
## + Ethnicity_dd         1   437.82 455.82
## + EducationLevel_m     1   437.83 455.83
## + Citizen_g            1   437.83 455.83
## + Citizen_s            1   437.83 455.83
## + EducationLevel_j     1   437.84 455.84
## - EducationLevel_cc    1   444.97 458.97
## - Married_y            1   445.65 459.65
## - EducationLevel_x     1   449.94 463.94
## - Citizen_p            1   453.02 467.02
## - Income               1   458.56 472.56
## - CreditScore          1   462.58 476.58
## - PriorDefault_No      1   677.25 691.25
## 
## Step:  AIC=449.93
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff
## 
##                       Df Deviance    AIC
## + Employed_unemployed  1   427.78 447.78
## + Employed_employed    1   427.78 447.78
## + Married_u            1   428.14 448.14
## + BankCustomer_g       1   428.14 448.14
## + Ethnicity_bb         1   428.77 448.77
## + EducationLevel_w     1   428.89 448.89
## + EducationLevel_i     1   428.96 448.96
## + Ethnicity_n          1   429.33 449.33
## + YearsEmployed        1   429.73 449.73
## <none>                     431.93 449.93
## + Ethnicity_ff         1   430.28 450.28
## + EducationLevel_q     1   430.47 450.47
## + Ethnicity_h          1   430.80 450.80
## + Ethnicity_j          1   430.93 450.93
## + EducationLevel_aa    1   430.94 450.94
## + Ethnicity_z          1   431.00 451.00
## + EducationLevel_k     1   431.03 451.03
## + EducationLevel_d     1   431.60 451.60
## + EducationLevel_c     1   431.61 451.61
## + Ethnicity_o          1   431.83 451.83
## + EducationLevel_m     1   431.85 451.85
## + EducationLevel_r     1   431.88 451.88
## + EducationLevel_e     1   431.90 451.90
## + Ethnicity_v          1   431.91 451.91
## + EducationLevel_j     1   431.93 451.93
## + Citizen_g            1   431.93 451.93
## + Citizen_s            1   431.93 451.93
## + Debt                 1   431.93 451.93
## + Age                  1   431.93 451.93
## + Ethnicity_dd         1   431.93 451.93
## - EducationLevel_ff    1   437.84 453.84
## - EducationLevel_cc    1   438.29 454.29
## - Married_y            1   439.22 455.22
## - EducationLevel_x     1   443.05 459.05
## - Citizen_p            1   447.36 463.36
## - Income               1   453.80 469.80
## - CreditScore          1   456.92 472.92
## - PriorDefault_No      1   655.07 671.07
## 
## Step:  AIC=447.78
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff + 
##     Employed_unemployed
## 
##                       Df Deviance    AIC
## + Married_u            1   423.81 445.81
## + BankCustomer_g       1   423.81 445.81
## + YearsEmployed        1   425.22 447.22
## + Ethnicity_bb         1   425.25 447.25
## + EducationLevel_i     1   425.31 447.31
## + EducationLevel_w     1   425.37 447.37
## + Ethnicity_n          1   425.40 447.40
## <none>                     427.78 447.78
## + Ethnicity_ff         1   425.92 447.92
## + Ethnicity_h          1   426.39 448.39
## + Ethnicity_j          1   426.65 448.65
## + Ethnicity_z          1   426.66 448.66
## + EducationLevel_q     1   426.81 448.81
## + EducationLevel_k     1   426.91 448.91
## + EducationLevel_aa    1   427.04 449.04
## + EducationLevel_c     1   427.44 449.44
## + Ethnicity_v          1   427.64 449.64
## + Citizen_s            1   427.64 449.64
## + Citizen_g            1   427.64 449.64
## + EducationLevel_d     1   427.67 449.67
## + EducationLevel_m     1   427.67 449.67
## + Ethnicity_o          1   427.69 449.69
## + EducationLevel_r     1   427.71 449.71
## + Age                  1   427.74 449.74
## + EducationLevel_e     1   427.77 449.77
## + EducationLevel_j     1   427.77 449.77
## + Debt                 1   427.78 449.78
## + Ethnicity_dd         1   427.78 449.78
## - Employed_unemployed  1   431.93 449.93
## - CreditScore          1   433.61 451.61
## - EducationLevel_ff    1   433.94 451.94
## - EducationLevel_cc    1   434.11 452.11
## - Married_y            1   434.28 452.28
## - EducationLevel_x     1   437.59 455.59
## - Citizen_p            1   444.13 462.13
## - Income               1   448.23 466.23
## - PriorDefault_No      1   643.83 661.83
## 
## Step:  AIC=445.81
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff + 
##     Employed_unemployed + Married_u
## 
##                       Df Deviance    AIC
## + Ethnicity_bb         1   421.31 445.31
## + EducationLevel_i     1   421.37 445.37
## + Ethnicity_n          1   421.37 445.37
## + EducationLevel_w     1   421.41 445.41
## <none>                     423.81 445.81
## + YearsEmployed        1   422.09 446.09
## + Ethnicity_h          1   422.26 446.26
## + Ethnicity_j          1   422.66 446.66
## + Ethnicity_z          1   422.67 446.67
## + EducationLevel_q     1   422.85 446.85
## + EducationLevel_k     1   422.94 446.94
## + EducationLevel_aa    1   423.05 447.05
## + EducationLevel_c     1   423.47 447.47
## + EducationLevel_d     1   423.70 447.70
## + EducationLevel_m     1   423.70 447.70
## + Ethnicity_o          1   423.72 447.72
## + Age                  1   423.72 447.72
## + EducationLevel_r     1   423.74 447.74
## + Ethnicity_v          1   423.76 447.76
## - Married_u            1   427.78 447.78
## - EducationLevel_cc    1   427.79 447.79
## + EducationLevel_e     1   423.79 447.79
## + Debt                 1   423.80 447.80
## + Ethnicity_ff         1   423.80 447.80
## + EducationLevel_j     1   423.80 447.80
## + Citizen_g            1   423.81 447.81
## + Citizen_s            1   423.81 447.81
## + Ethnicity_dd         1   423.81 447.81
## - Employed_unemployed  1   428.14 448.14
## - Married_y            1   429.08 449.08
## - CreditScore          1   429.55 449.55
## - EducationLevel_ff    1   429.93 449.93
## - EducationLevel_x     1   433.53 453.53
## - Citizen_p            1   440.46 460.46
## - Income               1   442.35 462.35
## - PriorDefault_No      1   642.38 662.38
## 
## Step:  AIC=445.31
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff + 
##     Employed_unemployed + Married_u + Ethnicity_bb
## 
##                       Df Deviance    AIC
## + Ethnicity_n          1   418.96 444.96
## <none>                     421.31 445.31
## + EducationLevel_w     1   419.31 445.31
## + YearsEmployed        1   419.57 445.57
## - Ethnicity_bb         1   423.81 445.81
## + Ethnicity_z          1   419.99 445.99
## + Ethnicity_v          1   420.19 446.19
## + EducationLevel_k     1   420.22 446.22
## + EducationLevel_aa    1   420.24 446.24
## + Ethnicity_j          1   420.36 446.36
## + Ethnicity_h          1   420.37 446.37
## + EducationLevel_i     1   420.52 446.52
## + EducationLevel_q     1   420.75 446.75
## + EducationLevel_c     1   420.83 446.83
## - EducationLevel_cc    1   424.85 446.85
## - Employed_unemployed  1   425.00 447.00
## + Age                  1   421.12 447.12
## + EducationLevel_m     1   421.14 447.14
## + EducationLevel_d     1   421.20 447.20
## + Ethnicity_o          1   421.22 447.22
## - Married_u            1   425.25 447.25
## + EducationLevel_r     1   421.25 447.25
## + Ethnicity_dd         1   421.29 447.29
## + Ethnicity_ff         1   421.30 447.30
## + Citizen_g            1   421.30 447.30
## + Citizen_s            1   421.30 447.30
## + Debt                 1   421.30 447.30
## + EducationLevel_e     1   421.31 447.31
## + EducationLevel_j     1   421.31 447.31
## - Married_y            1   426.59 448.59
## - CreditScore          1   427.45 449.45
## - EducationLevel_ff    1   428.12 450.12
## - EducationLevel_x     1   430.50 452.50
## - Citizen_p            1   438.80 460.80
## - Income               1   440.11 462.11
## - PriorDefault_No      1   641.92 663.92
## 
## Step:  AIC=444.96
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff + 
##     Employed_unemployed + Married_u + Ethnicity_bb + Ethnicity_n
## 
##                       Df Deviance    AIC
## + EducationLevel_w     1   416.78 444.78
## <none>                     418.96 444.96
## + YearsEmployed        1   417.21 445.21
## - Ethnicity_n          1   421.31 445.31
## - Ethnicity_bb         1   421.37 445.37
## + Ethnicity_z          1   417.67 445.67
## + Ethnicity_h          1   417.93 445.93
## + EducationLevel_k     1   417.96 445.96
## + EducationLevel_aa    1   417.96 445.96
## + Ethnicity_j          1   417.97 445.97
## + EducationLevel_i     1   418.23 446.23
## + Ethnicity_v          1   418.24 446.24
## - Employed_unemployed  1   422.46 446.46
## + EducationLevel_q     1   418.47 446.47
## + EducationLevel_c     1   418.58 446.58
## - EducationLevel_cc    1   422.61 446.61
## + Age                  1   418.71 446.71
## + EducationLevel_r     1   418.73 446.73
## + EducationLevel_m     1   418.82 446.82
## + EducationLevel_d     1   418.87 446.87
## + Ethnicity_o          1   418.87 446.87
## + Citizen_g            1   418.94 446.94
## + Citizen_s            1   418.94 446.94
## + Ethnicity_dd         1   418.94 446.94
## + Ethnicity_ff         1   418.95 446.95
## - Married_u            1   422.95 446.95
## + EducationLevel_e     1   418.95 446.95
## + Debt                 1   418.96 446.96
## + EducationLevel_j     1   418.96 446.96
## - Married_y            1   424.30 448.30
## - CreditScore          1   425.11 449.11
## - EducationLevel_ff    1   425.50 449.50
## - EducationLevel_x     1   428.30 452.30
## - Citizen_p            1   436.80 460.80
## - Income               1   437.87 461.87
## - PriorDefault_No      1   641.88 665.88
## 
## Step:  AIC=444.78
## ApprovalStatus ~ PriorDefault_No + CreditScore + Income + Citizen_p + 
##     EducationLevel_x + Married_y + EducationLevel_cc + EducationLevel_ff + 
##     Employed_unemployed + Married_u + Ethnicity_bb + Ethnicity_n + 
##     EducationLevel_w
## 
##                       Df Deviance    AIC
## <none>                     416.78 444.78
## - Ethnicity_bb         1   418.78 444.78
## - EducationLevel_w     1   418.96 444.96
## + YearsEmployed        1   415.01 445.01
## - Ethnicity_n          1   419.31 445.31
## + Ethnicity_h          1   415.41 445.41
## + Ethnicity_j          1   415.64 445.64
## + Ethnicity_z          1   415.65 445.65
## + Ethnicity_v          1   415.66 445.66
## - Employed_unemployed  1   419.80 445.80
## + EducationLevel_q     1   415.82 445.82
## + EducationLevel_c     1   415.89 445.89
## + EducationLevel_k     1   416.11 446.11
## + EducationLevel_aa    1   416.16 446.16
## + EducationLevel_i     1   416.24 446.24
## + Age                  1   416.47 446.47
## + EducationLevel_r     1   416.55 446.55
## + Ethnicity_o          1   416.70 446.70
## + EducationLevel_m     1   416.74 446.74
## + EducationLevel_d     1   416.74 446.74
## + EducationLevel_e     1   416.74 446.74
## + Citizen_g            1   416.75 446.75
## + Citizen_s            1   416.75 446.75
## + Debt                 1   416.77 446.77
## - Married_u            1   420.77 446.77
## + Ethnicity_ff         1   416.77 446.77
## + Ethnicity_dd         1   416.78 446.78
## + EducationLevel_j     1   416.78 446.78
## - EducationLevel_cc    1   420.96 446.96
## - Married_y            1   422.16 448.16
## - EducationLevel_ff    1   422.43 448.43
## - CreditScore          1   423.52 449.52
## - EducationLevel_x     1   426.98 452.98
## - Citizen_p            1   435.02 461.02
## - Income               1   435.26 461.26
## - PriorDefault_No      1   640.08 666.08

With an AIC=443.47, we will include these functions to our final model:

myvars <- c("PriorDefault_No", "CreditScore", "Income", "Citizen_p", "EducationLevel_x",
            "EducationLevel_ff", "Married_y", "EducationLevel_cc", "Employed_unemployed",
            "Married_u", "Ethnicity_n", "EducationLevel_w", "ApprovalStatus")
df <- df[myvars]

We can proceed now to split our data between the data set with all the features X, and the target variable data set Y, we will convert them into matrices so they can be processed by the glmnet function.

# X e Y
X <- data.matrix(subset(df, select= - ApprovalStatus))
Y <- as.double(as.matrix(df$ApprovalStatus))

We are ready now to split both data sets between training and test sets

# TRAIN
X_Train <- X[0:590, ]
Y_Train <- Y[0:590]

# TEST
X_Test <- X[591:nrow(X), ]
Y_Test <- Y[591:length(Y)]

Logistic Regression Model:

We have a binary classification problem (whether to approve credit or not), for that reason we will create a Logistic Regression model.

We need to create a model able to predict whether to approve a credit or not as best as possible, but we also need to minimize the number of false positives, as false positives would make our bank losing money granting credits that it shouldn’t. For that reason, we will use the Area Under the (ROC) Curve (AUC) as our estimator.

ROC is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0, so the area below this curve would be the best estimator possible when it comes to getting good predictions while minimizing false positives at the same time.

For better results we will use also a regularization, whether to use either Lasso or Ridge we will use an Elastic-Net model for that.

The Elastic-Net allow us to select a Ridge regularization setting the alpha parameter to 0 and a Lasso regularization setting the alpha parameter to 1. We will check both of them and we will select the one with a bigger AUC.

Ridge Regularization

# We will use the glmnet as Elastic-Net, cv. stands for cross-validation
cv.ridge <- cv.glmnet(X_Train, Y_Train, family='binomial', alpha=0, parallel=TRUE, standardize=TRUE, type.measure='auc')

## Warning: executing %dopar% sequentially: no parallel backend registered

plot(cv.ridge)

Now that we have the model, let’s check the value for the biggest AUC.

max(cv.ridge$cvm)

## [1] 0.9230839

The biggest AUC value is approximately 0.926, let’s see what value lambda corresponds to this value:

match(max(cv.ridge$cvm),cv.ridge$cvm)

## [1] 94

cv.ridge$lambda[95] == cv.ridge$lambda.min

## [1] FALSE

cv.ridge$lambda.min

## [1] 0.06161378

The lambda value that gives us the biggest AUC possible is 0.056

Lasso Regularization:

# We set the parameter 'alpha' to 1 so we can use Lasso
cv.lasso <- cv.glmnet(X_Train, Y_Train, family='binomial', alpha=1, parallel=TRUE, standardize=TRUE, type.measure='auc')

Let’s see how AUC behaves trying different Lambda values

plot(cv.lasso)

Let’s calculate the optimal lambda:

cv.lasso$lambda.min

## [1] 0.001102055

Let’s compare both maximum AUCs

max(cv.lasso$cvm)

## [1] 0.9294144

max(cv.ridge$cvm) - max(cv.lasso$cvm)

## [1] -0.006330498

Both return almost the same result, but Ridge is slightly better, so we will use this one.

Testing the model.

Let’s try our Logistic Regression model using Ridge regularization to see how well it predicts.

y_pred <- as.numeric(predict.glmnet(cv.ridge$glmnet.fit, newx=X_Test, s=cv.ridge$lambda.min)>.5)

y_pred

##   [1] 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Now, we create a confusion matrix so we can compare the actual outcome and our predicted outcome:

confusionMatrix(as.factor(Y_Test), as.factor(y_pred), mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85  1
##          1  8  6
##                                         
##                Accuracy : 0.91          
##                  95% CI : (0.836, 0.958)
##     No Information Rate : 0.93          
##     P-Value [Acc > NIR] : 0.8380        
##                                         
##                   Kappa : 0.5273        
##                                         
##  Mcnemar's Test P-Value : 0.0455        
##                                         
##             Sensitivity : 0.9140        
##             Specificity : 0.8571        
##          Pos Pred Value : 0.9884        
##          Neg Pred Value : 0.4286        
##               Precision : 0.9884        
##                  Recall : 0.9140        
##                      F1 : 0.9497        
##              Prevalence : 0.9300        
##          Detection Rate : 0.8500        
##    Detection Prevalence : 0.8600        
##       Balanced Accuracy : 0.8856        
##                                         
##        'Positive' Class : 0             
##

We have a model with an Accuracy of 91%, and Recall of 91.4%, F1 de 94.07% y Precision of 98.84%.

cTab    <- table(Y_Test, y_pred)    # Confusion Matrix
addmargins(cTab)

##       y_pred
## Y_Test   0   1 Sum
##    0    85   1  86
##    1     8   6  14
##    Sum  93   7 100

In the confusion matrix we just had one false positive out 100 predictions, 6 were correctly approved and 85 were correctly denied. We also had 8 false negatives.

Let’s check which variables have more influence in our model.

coef(cv.ridge, s=cv.ridge$lambda.min)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)         -1.1246888
## PriorDefault_No      2.1701118
## CreditScore          0.3653171
## Income               0.2671944
## Citizen_p            1.2752108
## EducationLevel_x     1.1378969
## EducationLevel_ff   -0.8193439
## Married_y           -0.3173088
## EducationLevel_cc    0.9025020
## Employed_unemployed -0.7563359
## Married_u            0.1924204
## Ethnicity_n          1.6256479
## EducationLevel_w     0.2872622

Seems like, not having defaulted before, being of ethnicity ‘n’, having a citizenship status of ‘p’ and an education level ‘x’ is positively correlated with having a credit approval. Whereas having an education level ‘ff’ and being ‘unemployed’ have the biggest negative impact when getting a credit approved.