Business Understanding

The dataset is obtained from: https://www.kaggle.com/c/home-credit-default-risk/data

Home Credit, a leading international multi-channel provider of consumer finance, is looking to broaden offerings to more customers. Targeting a greater number of customers allows Home Credit to improve its top line. However, the more loans the company offers, the more individuals of high-risk Home Credit will be exposed to; a critical decision for the company is to determine the likelihood that a customer would have a default. If the default rate among customers rises, the company would incur losses in its expansion. The objective of Home Credit is to correctly offer loans to individuals who can pay back and turn away those who cannot. The challenge lies in the great diversity of backgrounds of individuals who come to Home Credit to procure the loan. In this analysis, we aim to implement a variety of techniques to assess the idiosyncrasies of each customer and determine whether a customer will default.

Data Understanding

Our data consist of seven different sets. The main dataset is “Application_train.csv”. It contains 307,511 observations of 122 variables and provides static data for all applicants. The target variable resides in this dataset and indicates whether clients have difficulties in meeting payment. For our analysis, we consider this as default, although not all those who have difficulties in payment will actually go default. Each observation is a loan application and includes the target value, demographic variables, and some other information. Other datasets comprise data of previous application, credit card balance, repayment history, and balance of credits in the Credit Bureau. Due to the vastness of the data and limited time and resources, in this project, we will focus solely on the main dataset. The 122 variables in it can be broken into 5 categories: personal information of the customer, information of the loan, information of the area where the customer lives, the documents the customer provided, and the inquiries made to the Credit Bureau.

Data Preparation

## Loading libraries
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(dplyr)
library(tree)
library(partykit)
library(randomForest)
library(glmnet)
library(GGally)

source("DataAnalyticsFunctions.R")
source("causal_source.R")
options(warn=-1)

# Loading the data
application = read.csv("application_train.csv")

set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion")
## Drop variables related to housing
intermediate = subset(application, select = -c(AMT_GOODS_PRICE,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,
                                NONLIVINGAREA_MEDI,TOTALAREA_MODE,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,
                                    ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,
                                      YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,
                                      ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,
                                      FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
))

## Drop observations with NA
app=subset(intermediate,AMT_ANNUITY!="NA" & 
             CNT_FAM_MEMBERS!="NA" & DAYS_LAST_PHONE_CHANGE!="NA" )

## Extract variables that contain NA values
na_count <-sapply(app, function(y) sum(length(which(is.na(y)))))
na_count = data.frame(na_count)

## Extract variables that contain blank
na_count <-sapply(app, function(y) sum(length(which(y == ""))))
na_count = data.frame(na_count)

## Drop observations that do not have car age despite car ownership
app$checking=ifelse(app$FLAG_OWN_CAR=="Y" & is.na(app$OWN_CAR_AGE),1,0)
app = subset(app,app$checking == 0)
app$checking = NULL

## Drop observations that do not have valid gender type
app = subset(app,app$CODE_GENDER != "XNA")

## Create dummy variables for EXT_SOURCE
app$EXTS1 = 1
app$EXTS1[is.na(app$EXT_SOURCE_1)] = 0
app$EXTS2 = 1
app$EXTS2[is.na(app$EXT_SOURCE_2)] = 0
app$EXTS3 = 1
app$EXTS3[is.na(app$EXT_SOURCE_3)] = 0

## Create dummy variable for Social Circle
app$Social = 1
app$Social[is.na(app$OBS_30_CNT_SOCIAL_CIRCLE)] = 0

## Create dummy variable for Credit Bureau
app$CreditInq = 1
app$CreditInq[is.na(app$AMT_REQ_CREDIT_BUREAU_HOUR)] = 0

## Recode categorical variables into 0 an 1
# Car ownership
app$CAR_OWNERSHIP = 0
app$CAR_OWNERSHIP[app$FLAG_OWN_CAR == "Y"] = 1
app$FLAG_OWN_CAR = NULL

# Real estate ownership
app$REALTY_OWNERSHIP = 0
app$REALTY_OWNERSHIP[app$FLAG_OWN_REALTY == "Y"] = 1
app$FLAG_OWN_REALTY = NULL

# Gender
app$Female=ifelse(app$CODE_GENDER=="F",1,0)
app$CODE_GENDER = NULL

# Recode OWN_CAR_AGE
app$OWN_CAR_AGE_EDIT = app$OWN_CAR_AGE
app$OWN_CAR_AGE_EDIT[is.na(app$OWN_CAR_AGE)] = 9999
app$OWN_CAR_AGE = NULL

# Recode EXT_SOURCE
app$EXT_SOURCE_1_EDIT=ifelse(is.na(app$EXT_SOURCE_1),9999,app$EXT_SOURCE_1)
app$EXT_SOURCE_2_EDIT=ifelse(is.na(app$EXT_SOURCE_2),9999,app$EXT_SOURCE_2)
app$EXT_SOURCE_3_EDIT=ifelse(is.na(app$EXT_SOURCE_3),9999,app$EXT_SOURCE_3)
app$EXT_SOURCE_1=NULL
app$EXT_SOURCE_2=NULL
app$EXT_SOURCE_3=NULL

# Recode CREDIT BUREAU
app$AMT_REQ_CREDIT_BUREAU_DAY_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_DAY),9999,app$AMT_REQ_CREDIT_BUREAU_DAY)
app$AMT_REQ_CREDIT_BUREAU_DAY=NULL
app$AMT_REQ_CREDIT_BUREAU_HOUR_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_HOUR),9999,app$AMT_REQ_CREDIT_BUREAU_HOUR)
app$AMT_REQ_CREDIT_BUREAU_HOUR=NULL
app$AMT_REQ_CREDIT_BUREAU_WEEK_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_WEEK),9999,app$AMT_REQ_CREDIT_BUREAU_WEEK)
app$AMT_REQ_CREDIT_BUREAU_WEEK=NULL
app$AMT_REQ_CREDIT_BUREAU_MON_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_MON),9999,app$AMT_REQ_CREDIT_BUREAU_MON)
app$AMT_REQ_CREDIT_BUREAU_MON=NULL
app$AMT_REQ_CREDIT_BUREAU_QRT_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_QRT),9999,app$AMT_REQ_CREDIT_BUREAU_QRT)
app$AMT_REQ_CREDIT_BUREAU_QRT=NULL
app$AMT_REQ_CREDIT_BUREAU_YEAR_EDIT=ifelse(is.na(app$AMT_REQ_CREDIT_BUREAU_YEAR),9999,app$AMT_REQ_CREDIT_BUREAU_YEAR)
app$AMT_REQ_CREDIT_BUREAU_YEAR=NULL

# Recode SOCIAL CIRCLE
app$OBS_30_CNT_SOCIAL_CIRCLE_EDIT=ifelse(is.na(app$OBS_30_CNT_SOCIAL_CIRCLE),9999,app$OBS_30_CNT_SOCIAL_CIRCLE)
app$OBS_30_CNT_SOCIAL_CIRCLE=NULL
app$OBS_60_CNT_SOCIAL_CIRCLE_EDIT=ifelse(is.na(app$OBS_60_CNT_SOCIAL_CIRCLE),9999,app$OBS_60_CNT_SOCIAL_CIRCLE)
app$OBS_60_CNT_SOCIAL_CIRCLE=NULL
app$DEF_30_CNT_SOCIAL_CIRCLE_EDIT=ifelse(is.na(app$DEF_30_CNT_SOCIAL_CIRCLE),9999,app$DEF_30_CNT_SOCIAL_CIRCLE)
app$DEF_30_CNT_SOCIAL_CIRCLE=NULL
app$DEF_60_CNT_SOCIAL_CIRCLE_EDIT=ifelse(is.na(app$DEF_60_CNT_SOCIAL_CIRCLE),9999,app$DEF_60_CNT_SOCIAL_CIRCLE)
app$DEF_60_CNT_SOCIAL_CIRCLE=NULL

# Recode Name_Type_Suite
app$NAME_TYPE_SUITE_EDIT = app$NAME_TYPE_SUITE
levels(app$NAME_TYPE_SUITE_EDIT)[levels(app$NAME_TYPE_SUITE_EDIT)==""] <- "Unknown"
app$NAME_TYPE_SUITE = NULL

# Recode Occupation Type
app$OCCUPATION_TYPE_EDIT = app$OCCUPATION_TYPE
levels(app$OCCUPATION_TYPE_EDIT)[levels(app$OCCUPATION_TYPE_EDIT)==""] = "Unknown"
app$OCCUPATION_TYPE = NULL

# Check the whole dataset
na_count <-sapply(app, function(y) sum(length(which(y == ""))))
na_count = data.frame(na_count)

# Exclude Outliers
hist(app$AMT_INCOME_TOTAL) # There are a lot of outliers, most observations fall below 10m

## Closer examiniation shows that most observations fall below 600k so we use that as the cutoff point

app = subset(app,app$AMT_INCOME_TOTAL<600000)
### Create Interaction Variables
## Transforming the variables

# Car Ownership * Car Age
app$carOwner_carAge=app$CAR_OWNERSHIP*app$OWN_CAR_AGE_EDIT
app=subset(app, select=-c(OWN_CAR_AGE_EDIT))

# Social Circles
app$DEF30=app$DEF_30_CNT_SOCIAL_CIRCLE_EDIT*app$Social
app$OBS30=app$OBS_30_CNT_SOCIAL_CIRCLE_EDIT*app$Social
app$DEF60=app$DEF_60_CNT_SOCIAL_CIRCLE_EDIT*app$Social
app$OBS60=app$OBS_60_CNT_SOCIAL_CIRCLE_EDIT*app$Social
app$DEF_30_CNT_SOCIAL_CIRCLE_EDIT=NULL
app$DEF_60_CNT_SOCIAL_CIRCLE_EDIT=NULL
app$OBS_60_CNT_SOCIAL_CIRCLE_EDIT=NULL
app$OBS_30_CNT_SOCIAL_CIRCLE_EDIT=NULL

# Credit Bureau
app$Credit_Hour=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_HOUR_EDIT
app$AMT_REQ_CREDIT_BUREAU_HOUR_EDIT=NULL
app$Credit_Day=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_DAY_EDIT
app$AMT_REQ_CREDIT_BUREAU_DAY_EDIT=NULL
app$Credit_Week=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_WEEK_EDIT
app$AMT_REQ_CREDIT_BUREAU_WEEK_EDIT=NULL
app$Credit_Month=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_MON_EDIT
app$AMT_REQ_CREDIT_BUREAU_MON_EDIT=NULL
app$Credit_Quarter=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_QRT_EDIT
app$AMT_REQ_CREDIT_BUREAU_QRT_EDIT=NULL
app$Credit_Year=app$CreditInq*app$AMT_REQ_CREDIT_BUREAU_YEAR_EDIT
app$AMT_REQ_CREDIT_BUREAU_YEAR_EDIT=NULL

# External Sources
app$External1=app$EXTS1*app$EXT_SOURCE_1_EDIT
app$External2=app$EXTS2*app$EXT_SOURCE_2_EDIT
app$External3=app$EXTS3*app$EXT_SOURCE_3_EDIT
app$EXT_SOURCE_1_EDIT=NULL
app$EXT_SOURCE_2_EDIT=NULL
app$EXT_SOURCE_3_EDIT=NULL

##### Convert variables into factors

# Drop ID variable
app$SK_ID_CURR = NULL

# Drop FLAG_MOBIL variable as it is all 1
app$FLAG_MOBIL = NULL

# Transforming categorical variables to factor type
cols = c("TARGET","FLAG_EMP_PHONE","FLAG_WORK_PHONE","FLAG_CONT_MOBILE","FLAG_PHONE","FLAG_EMAIL",
         "HOUR_APPR_PROCESS_START","REG_REGION_NOT_LIVE_REGION","REG_REGION_NOT_WORK_REGION","LIVE_REGION_NOT_WORK_REGION",
         "REG_CITY_NOT_LIVE_CITY","REG_CITY_NOT_WORK_CITY","LIVE_CITY_NOT_WORK_CITY","FLAG_DOCUMENT_2",
         "FLAG_DOCUMENT_3","FLAG_DOCUMENT_4","FLAG_DOCUMENT_5","FLAG_DOCUMENT_6","FLAG_DOCUMENT_7",
         "FLAG_DOCUMENT_8","FLAG_DOCUMENT_9","FLAG_DOCUMENT_10","FLAG_DOCUMENT_11","FLAG_DOCUMENT_12",
         "FLAG_DOCUMENT_13","FLAG_DOCUMENT_14","FLAG_DOCUMENT_15","FLAG_DOCUMENT_16","FLAG_DOCUMENT_17",
         "FLAG_DOCUMENT_18","FLAG_DOCUMENT_19","FLAG_DOCUMENT_20","FLAG_DOCUMENT_21","EXTS1","EXTS2","EXTS3",
         "Social","CreditInq","CAR_OWNERSHIP","REALTY_OWNERSHIP","Female")
app[cols] = lapply(app[cols], factor)

##### Select a subset of data to deal with limited computing power ###############
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion")
split <- 30
nLine <- nrow(app) # the number of observations
### create a vector of fold memberships (random order)
splitID <- rep(1:split,each=ceiling(nLine/split))[sample(1:nLine)]

app_Full = app
app_Train = app_Full[splitID == 10,]
app_Test = app_Full[splitID == 15,]

##Exploratory Analysis

## Histogram of Income
ggplot(app_Train,aes(x=AMT_INCOME_TOTAL))+
  geom_histogram(binwidth = 20000,color="darkblue", fill="lightblue") +
  geom_vline(xintercept = mean(app_Train$AMT_INCOME_TOTAL), color = 'darkred')+
  theme(axis.text.x = element_text(hjust = 1, size = 12))+
  theme(axis.text.y = element_text(hjust = 1, size = 12))+
  xlab("Income") +  
  ylab("Frequency") + 
  ggtitle("Histogram of Income")

## ATM Credit vs. ATM Annuity
ggplot(app_Train, aes(x = AMT_CREDIT, y = AMT_ANNUITY, colour = AMT_INCOME_TOTAL)) + geom_point()

The plot above compares the relationship between the total loan amount and loan annuity for Home Credit loan applicants. The distribution illustrates a positive relationship between two variables. Besides, the distribution is divergent. We can see from the figure that the distribution clearly separates into several different lines with a mixture of different levels of income.

## Loan Type vs Loan Amount
ggplot(data=app_Train, aes(x=NAME_CONTRACT_TYPE, y=(AMT_CREDIT),group=NAME_FAMILY_STATUS,
                           fill=NAME_FAMILY_STATUS))+
  geom_bar(stat="identity")+
  scale_fill_brewer(palette = "Set2") +
  xlab("Loan Type")+
  ylab("Loan Amount")+
  ggtitle("Loan Type VS Loan Amount for Different Family Type")

The stacked barplot on the left-hand side shows the total loan amount for different family types in different loan types. From the plot, we can interpret that the loan amount of revolving loans is much smaller compared to that of cash loans. Also, the majority portion of applicants for both cash loans and revolving loans are married people.

## Pairwise correlation matrix plot
ggpairs(app_Train[,c(1,2,4,5)], mapping = aes(color = TARGET), legend = 1) + 
  theme(legend.position = "bottom")

The pairwise correlation matrix plot provides useful illustrations of relationships between our target variable (0 indicates not default and 1 indicates default) and several other variables (contract type, total income, total loan amount, etc.). From the matrix plot, we can interpret that no matter what type of contracts the customer has, the median income of the customers who did not default is higher than that of the customer who defaulted.

Modeling

Classification and regression tree (CART) & Principal component analysis (PCA)

## Custom Classification Tree
model.tree_Cust <- tree(TARGET~AMT_INCOME_TOTAL + AMT_CREDIT + DAYS_BIRTH + DAYS_EMPLOYED +
                     DAYS_REGISTRATION + DAYS_ID_PUBLISH + DAYS_LAST_PHONE_CHANGE + External1 + External2 +
                     External3 +Credit_Hour + Credit_Day + Credit_Week + Credit_Month +
                     Credit_Quarter + Credit_Year + DEF30 + OBS30+ DEF60+OBS60+
                     carOwner_carAge + FLAG_DOCUMENT_2, data=app_Train)

plot(model.tree_Cust)
text(model.tree_Cust, label="yval")
text(model.tree_Cust,label="yprob")

## Classification Tree with PCA
# Create the model matrix 
Mx<- model.matrix(TARGET ~ ., data=app_Train)[,-1]
Mx = Mx[,-which(colMeans(Mx) == 0)] # drop columns which only have zero values
My = app_Train$TARGET

# PCA
xdata = scale(Mx)
pca.data <- prcomp(xdata, scale=TRUE)

# First plot of top 10 components
par(mar=c(4,4,4,4)+0.3)
plot(pca.data,main="PCA: Variance Explained by Factors")
mtext(side=1, "Factors",  line=1, font=2)

# Compute standard deviation of each principal component
pr_std <-pca.data$sdev

# Compute variance
pr_var <- pr_std^2

# Proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
plot(cumsum(prop_varex), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b")

# Top 30 components explain for about 40% of the variance so we will use these 30 components if the dimension
# needs to be reduced

app_Train_Tree = data.frame(app_Train,pca.data$x[,1:30])
model.tree_PCA = tree(TARGET~.,data = app_Train_Tree[,c(1,78:107)])

plot(model.tree_PCA)
text(model.tree_PCA, label="yval")
text(model.tree_PCA,label="yprob")

Logistic regression

The regression type we chose to run is logistic regression; it predicts the default probability for a specific loan. We will run a logistic regression without interactions since our limited computational power does not allow for running lassoes with interactions on a dataset of this magnitude. However, we do include a few interactions manually where deemed appropriate, such as between car ownership and car age.

app_Logit = data.frame(Mx)
model.logistic = glm(My~., data=app_Logit, family="binomial")

summary(model.logistic)
## 
## Call:
## glm(formula = My ~ ., family = "binomial", data = app_Logit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6428  -0.4201  -0.2809  -0.1773   3.2545  
## 
## Coefficients: (3 not defined because of singularities)
##                                                    Estimate Std. Error
## (Intercept)                                      -1.377e+12  6.874e+12
## NAME_CONTRACT_TYPERevolving.loans                 1.127e-01  3.471e-01
## CNT_CHILDREN                                     -1.533e-02  5.691e-02
## AMT_INCOME_TOTAL                                 -4.894e-07  6.554e-07
## AMT_CREDIT                                       -6.014e-08  1.648e-07
## AMT_ANNUITY                                       1.223e-05  4.547e-06
## NAME_INCOME_TYPECommercial.associate              1.377e+12  6.874e+12
## NAME_INCOME_TYPEPensioner                         1.377e+12  6.874e+12
## NAME_INCOME_TYPEState.servant                     1.377e+12  6.874e+12
## NAME_INCOME_TYPEWorking                           1.377e+12  6.874e+12
## NAME_EDUCATION_TYPEHigher.education               2.309e+01  2.064e+05
## NAME_EDUCATION_TYPEIncomplete.higher              2.265e+01  2.064e+05
## NAME_EDUCATION_TYPELower.secondary                2.372e+01  2.064e+05
## NAME_EDUCATION_TYPESecondary...secondary.special  2.324e+01  2.064e+05
## NAME_FAMILY_STATUSMarried                        -9.933e-02  1.243e-01
## NAME_FAMILY_STATUSSeparated                      -3.314e-01  2.158e-01
## NAME_FAMILY_STATUSSingle...not.married           -9.258e-02  1.501e-01
## NAME_FAMILY_STATUSWidow                          -7.801e-01  2.755e-01
## NAME_HOUSING_TYPEHouse...apartment               -1.172e-01  6.305e-01
## NAME_HOUSING_TYPEMunicipal.apartment             -2.632e-01  6.727e-01
## NAME_HOUSING_TYPEOffice.apartment                 2.472e-01  7.428e-01
## NAME_HOUSING_TYPERented.apartment                -4.789e-02  6.858e-01
## NAME_HOUSING_TYPEWith.parents                    -1.840e-01  6.492e-01
## REGION_POPULATION_RELATIVE                        2.770e+00  3.766e+00
## DAYS_BIRTH                                        4.488e-06  1.445e-05
## DAYS_EMPLOYED                                     6.776e-05  2.578e-05
## DAYS_REGISTRATION                                 3.078e-06  1.266e-05
## DAYS_ID_PUBLISH                                   4.238e-05  2.776e-05
## FLAG_EMP_PHONE1                                          NA         NA
## FLAG_WORK_PHONE1                                  1.516e-01  1.027e-01
## FLAG_CONT_MOBILE1                                 2.421e+01  7.440e+04
## FLAG_PHONE1                                       1.138e-01  9.612e-02
## FLAG_EMAIL1                                      -3.425e-01  1.848e-01
## CNT_FAM_MEMBERS                                          NA         NA
## REGION_RATING_CLIENT                             -2.098e-01  2.659e-01
## REGION_RATING_CLIENT_W_CITY                       3.622e-01  2.658e-01
## WEEKDAY_APPR_PROCESS_STARTMONDAY                 -2.395e-01  1.375e-01
## WEEKDAY_APPR_PROCESS_STARTSATURDAY               -9.018e-02  1.535e-01
## WEEKDAY_APPR_PROCESS_STARTSUNDAY                 -3.490e-01  2.066e-01
## WEEKDAY_APPR_PROCESS_STARTTHURSDAY               -2.173e-02  1.318e-01
## WEEKDAY_APPR_PROCESS_STARTTUESDAY                -4.790e-02  1.298e-01
## WEEKDAY_APPR_PROCESS_STARTWEDNESDAY              -7.904e-02  1.337e-01
## HOUR_APPR_PROCESS_START1                          7.529e-01  2.638e+05
## HOUR_APPR_PROCESS_START2                         -2.220e-01  2.344e+05
## HOUR_APPR_PROCESS_START3                          2.374e+01  2.184e+05
## HOUR_APPR_PROCESS_START4                          2.380e+01  2.184e+05
## HOUR_APPR_PROCESS_START5                          2.421e+01  2.184e+05
## HOUR_APPR_PROCESS_START6                          2.420e+01  2.184e+05
## HOUR_APPR_PROCESS_START7                          2.409e+01  2.184e+05
## HOUR_APPR_PROCESS_START8                          2.407e+01  2.184e+05
## HOUR_APPR_PROCESS_START9                          2.421e+01  2.184e+05
## HOUR_APPR_PROCESS_START10                         2.402e+01  2.184e+05
## HOUR_APPR_PROCESS_START11                         2.417e+01  2.184e+05
## HOUR_APPR_PROCESS_START12                         2.416e+01  2.184e+05
## HOUR_APPR_PROCESS_START13                         2.419e+01  2.184e+05
## HOUR_APPR_PROCESS_START14                         2.414e+01  2.184e+05
## HOUR_APPR_PROCESS_START15                         2.397e+01  2.184e+05
## HOUR_APPR_PROCESS_START16                         2.404e+01  2.184e+05
## HOUR_APPR_PROCESS_START17                         2.410e+01  2.184e+05
## HOUR_APPR_PROCESS_START18                         2.421e+01  2.184e+05
## HOUR_APPR_PROCESS_START19                         2.411e+01  2.184e+05
## HOUR_APPR_PROCESS_START20                         2.458e+01  2.184e+05
## HOUR_APPR_PROCESS_START21                         2.630e-01  2.408e+05
## HOUR_APPR_PROCESS_START22                        -2.824e-01  2.633e+05
## REG_REGION_NOT_LIVE_REGION1                      -2.368e+01  4.832e+04
## REG_REGION_NOT_WORK_REGION1                       2.286e+01  4.832e+04
## LIVE_REGION_NOT_WORK_REGION1                     -2.288e+01  4.832e+04
## REG_CITY_NOT_LIVE_CITY1                           2.470e-01  2.076e-01
## REG_CITY_NOT_WORK_CITY1                           1.604e-01  2.317e-01
## LIVE_CITY_NOT_WORK_CITY1                          8.139e-03  2.239e-01
## ORGANIZATION_TYPEAgriculture                      1.539e-01  1.158e+00
## ORGANIZATION_TYPEBank                            -7.990e-02  1.222e+00
## ORGANIZATION_TYPEBusiness.Entity.Type.1           2.092e-02  1.139e+00
## ORGANIZATION_TYPEBusiness.Entity.Type.2           3.540e-01  1.114e+00
## ORGANIZATION_TYPEBusiness.Entity.Type.3           3.663e-01  1.098e+00
## ORGANIZATION_TYPECleaning                         2.228e-01  1.748e+00
## ORGANIZATION_TYPEConstruction                     6.077e-01  1.119e+00
## ORGANIZATION_TYPECulture                         -2.315e+01  1.349e+05
## ORGANIZATION_TYPEElectricity                     -4.637e-02  1.364e+00
## ORGANIZATION_TYPEEmergency                        1.163e+00  1.336e+00
## ORGANIZATION_TYPEGovernment                       3.780e-01  1.116e+00
## ORGANIZATION_TYPEHotel                           -5.620e-01  1.525e+00
## ORGANIZATION_TYPEHousing                         -1.028e-01  1.229e+00
## ORGANIZATION_TYPEIndustry..type.1                 6.903e-01  1.211e+00
## ORGANIZATION_TYPEIndustry..type.10                4.147e+00  1.716e+00
## ORGANIZATION_TYPEIndustry..type.11                1.235e-01  1.182e+00
## ORGANIZATION_TYPEIndustry..type.12                2.059e+00  1.393e+00
## ORGANIZATION_TYPEIndustry..type.13               -2.527e+01  3.689e+05
## ORGANIZATION_TYPEIndustry..type.2                 8.883e-01  1.580e+00
## ORGANIZATION_TYPEIndustry..type.3                 6.497e-01  1.142e+00
## ORGANIZATION_TYPEIndustry..type.4                -5.875e-02  1.352e+00
## ORGANIZATION_TYPEIndustry..type.5                 4.976e-01  1.351e+00
## ORGANIZATION_TYPEIndustry..type.6                -2.322e+01  1.798e+05
## ORGANIZATION_TYPEIndustry..type.7                -3.822e-01  1.329e+00
## ORGANIZATION_TYPEIndustry..type.8                -2.283e+01  3.992e+05
## ORGANIZATION_TYPEIndustry..type.9                 3.927e-02  1.174e+00
## ORGANIZATION_TYPEInsurance                       -3.147e-01  1.538e+00
## ORGANIZATION_TYPEKindergarten                     5.744e-01  1.131e+00
## ORGANIZATION_TYPELegal.Services                  -2.316e+01  1.381e+05
## ORGANIZATION_TYPEMedicine                        -4.099e-02  1.137e+00
## ORGANIZATION_TYPEMilitary                         1.089e-01  1.190e+00
## ORGANIZATION_TYPEMobile                          -2.401e+01  1.331e+05
## ORGANIZATION_TYPEOther                            4.411e-01  1.107e+00
## ORGANIZATION_TYPEPolice                           8.936e-01  1.175e+00
## ORGANIZATION_TYPEPostal                          -1.065e+00  1.514e+00
## ORGANIZATION_TYPERealtor                          2.201e+00  1.330e+00
## ORGANIZATION_TYPEReligion                        -4.504e+15  3.299e+07
## ORGANIZATION_TYPERestaurant                       1.164e+00  1.173e+00
## ORGANIZATION_TYPESchool                          -5.773e-01  1.168e+00
## ORGANIZATION_TYPESecurity                        -2.876e-01  1.185e+00
## ORGANIZATION_TYPESecurity.Ministries             -1.662e+00  1.515e+00
## ORGANIZATION_TYPESelf.employed                    5.088e-01  1.100e+00
## ORGANIZATION_TYPEServices                         1.962e-01  1.309e+00
## ORGANIZATION_TYPETelecom                         -2.295e+01  8.555e+04
## ORGANIZATION_TYPETrade..type.1                    3.510e-01  1.575e+00
## ORGANIZATION_TYPETrade..type.2                   -7.233e-01  1.269e+00
## ORGANIZATION_TYPETrade..type.3                    9.319e-01  1.135e+00
## ORGANIZATION_TYPETrade..type.4                   -2.378e+01  3.742e+05
## ORGANIZATION_TYPETrade..type.5                   -2.339e+01  4.058e+05
## ORGANIZATION_TYPETrade..type.6                   -3.146e-01  1.521e+00
## ORGANIZATION_TYPETrade..type.7                    3.067e-01  1.121e+00
## ORGANIZATION_TYPETransport..type.1               -2.313e+01  1.518e+05
## ORGANIZATION_TYPETransport..type.2                7.681e-02  1.228e+00
## ORGANIZATION_TYPETransport..type.3                8.018e-01  1.193e+00
## ORGANIZATION_TYPETransport..type.4                5.916e-01  1.125e+00
## ORGANIZATION_TYPEUniversity                       1.103e+00  1.195e+00
## ORGANIZATION_TYPEXNA                                     NA         NA
## DAYS_LAST_PHONE_CHANGE                            9.797e-05  5.394e-05
## FLAG_DOCUMENT_21                                  3.573e+00  1.511e+00
## FLAG_DOCUMENT_31                                  3.357e-01  3.328e-01
## FLAG_DOCUMENT_41                                 -2.404e+01  3.672e+05
## FLAG_DOCUMENT_51                                  1.173e-01  4.273e-01
## FLAG_DOCUMENT_61                                  2.446e-01  3.866e-01
## FLAG_DOCUMENT_71                                  1.377e+12  6.874e+12
## FLAG_DOCUMENT_81                                  1.250e-01  3.614e-01
## FLAG_DOCUMENT_91                                 -3.356e-01  8.122e-01
## FLAG_DOCUMENT_111                                -6.942e-01  8.296e-01
## FLAG_DOCUMENT_131                                 7.350e-01  7.014e-01
## FLAG_DOCUMENT_141                                 1.885e-01  7.788e-01
## FLAG_DOCUMENT_151                                -2.401e+01  1.240e+05
## FLAG_DOCUMENT_161                                -2.164e+00  1.041e+00
## FLAG_DOCUMENT_171                                -2.362e+01  1.629e+05
## FLAG_DOCUMENT_181                                -1.751e+00  1.052e+00
## FLAG_DOCUMENT_191                                -2.391e+01  2.240e+05
## FLAG_DOCUMENT_201                                -2.439e+01  1.445e+05
## FLAG_DOCUMENT_211                                 2.810e-01  1.138e+00
## EXTS11                                            5.233e-01  1.649e-01
## EXTS21                                            3.534e-01  8.013e-01
## EXTS31                                            1.741e+00  2.069e-01
## Social1                                           1.061e+00  1.059e+00
## CreditInq1                                       -6.610e-01  2.067e-01
## CAR_OWNERSHIP1                                   -3.882e-01  1.200e-01
## REALTY_OWNERSHIP1                                 8.919e-03  9.117e-02
## Female1                                          -2.844e-01  1.057e-01
## NAME_TYPE_SUITE_EDITChildren                      1.413e-01  7.559e-01
## NAME_TYPE_SUITE_EDITFamily                       -1.354e-01  6.548e-01
## NAME_TYPE_SUITE_EDITGroup.of.people              -2.438e+01  1.922e+05
## NAME_TYPE_SUITE_EDITOther_A                       1.518e+00  8.273e-01
## NAME_TYPE_SUITE_EDITOther_B                      -5.587e-02  8.516e-01
## NAME_TYPE_SUITE_EDITSpouse..partner              -6.206e-01  6.885e-01
## NAME_TYPE_SUITE_EDITUnaccompanied                -7.965e-02  6.463e-01
## OCCUPATION_TYPE_EDITAccountants                   1.407e-01  2.537e-01
## OCCUPATION_TYPE_EDITCleaning.staff                3.131e-01  2.970e-01
## OCCUPATION_TYPE_EDITCooking.staff                -6.996e-01  3.473e-01
## OCCUPATION_TYPE_EDITCore.staff                   -2.369e-01  2.093e-01
## OCCUPATION_TYPE_EDITDrivers                       3.541e-01  1.813e-01
## OCCUPATION_TYPE_EDITHigh.skill.tech.staff        -2.363e-01  2.629e-01
## OCCUPATION_TYPE_EDITHR.staff                     -4.456e-01  1.098e+00
## OCCUPATION_TYPE_EDITIT.staff                     -2.373e+01  7.435e+04
## OCCUPATION_TYPE_EDITLaborers                     -8.498e-02  1.412e-01
## OCCUPATION_TYPE_EDITLow.skill.Laborers            6.689e-01  3.234e-01
## OCCUPATION_TYPE_EDITManagers                      1.929e-02  1.979e-01
## OCCUPATION_TYPE_EDITMedicine.staff                5.200e-01  3.128e-01
## OCCUPATION_TYPE_EDITPrivate.service.staff        -6.519e-02  4.859e-01
## OCCUPATION_TYPE_EDITRealty.agents                -2.463e+01  7.121e+04
## OCCUPATION_TYPE_EDITSales.staff                   8.075e-02  1.664e-01
## OCCUPATION_TYPE_EDITSecretaries                  -1.206e+00  1.055e+00
## OCCUPATION_TYPE_EDITSecurity.staff                3.134e-01  3.064e-01
## OCCUPATION_TYPE_EDITWaiters.barmen.staff         -8.302e-02  6.428e-01
## carOwner_carAge                                   2.152e-03  5.444e-03
## DEF30                                             1.059e-01  1.576e-01
## OBS30                                             4.638e-02  2.905e-01
## DEF60                                             5.235e-02  1.896e-01
## OBS60                                            -4.630e-02  2.926e-01
## Credit_Hour                                      -1.467e+00  1.068e+00
## Credit_Day                                       -1.449e+00  1.046e+00
## Credit_Week                                      -3.491e-01  2.708e-01
## Credit_Month                                     -3.255e-02  5.686e-02
## Credit_Quarter                                   -3.807e-02  7.117e-02
## Credit_Year                                       6.839e-02  2.258e-02
## External1                                        -1.506e+00  3.359e-01
## External2                                        -2.445e+00  2.046e-01
## External3                                        -2.853e+00  2.360e-01
##                                                     z value Pr(>|z|)    
## (Intercept)                                      -2.000e-01  0.84120    
## NAME_CONTRACT_TYPERevolving.loans                 3.250e-01  0.74530    
## CNT_CHILDREN                                     -2.690e-01  0.78767    
## AMT_INCOME_TOTAL                                 -7.470e-01  0.45521    
## AMT_CREDIT                                       -3.650e-01  0.71519    
## AMT_ANNUITY                                       2.689e+00  0.00716 ** 
## NAME_INCOME_TYPECommercial.associate              2.000e-01  0.84120    
## NAME_INCOME_TYPEPensioner                         2.000e-01  0.84120    
## NAME_INCOME_TYPEState.servant                     2.000e-01  0.84120    
## NAME_INCOME_TYPEWorking                           2.000e-01  0.84120    
## NAME_EDUCATION_TYPEHigher.education               0.000e+00  0.99991    
## NAME_EDUCATION_TYPEIncomplete.higher              0.000e+00  0.99991    
## NAME_EDUCATION_TYPELower.secondary                0.000e+00  0.99991    
## NAME_EDUCATION_TYPESecondary...secondary.special  0.000e+00  0.99991    
## NAME_FAMILY_STATUSMarried                        -7.990e-01  0.42434    
## NAME_FAMILY_STATUSSeparated                      -1.536e+00  0.12463    
## NAME_FAMILY_STATUSSingle...not.married           -6.170e-01  0.53733    
## NAME_FAMILY_STATUSWidow                          -2.831e+00  0.00463 ** 
## NAME_HOUSING_TYPEHouse...apartment               -1.860e-01  0.85249    
## NAME_HOUSING_TYPEMunicipal.apartment             -3.910e-01  0.69557    
## NAME_HOUSING_TYPEOffice.apartment                 3.330e-01  0.73924    
## NAME_HOUSING_TYPERented.apartment                -7.000e-02  0.94432    
## NAME_HOUSING_TYPEWith.parents                    -2.830e-01  0.77690    
## REGION_POPULATION_RELATIVE                        7.350e-01  0.46206    
## DAYS_BIRTH                                        3.110e-01  0.75618    
## DAYS_EMPLOYED                                     2.629e+00  0.00857 ** 
## DAYS_REGISTRATION                                 2.430e-01  0.80791    
## DAYS_ID_PUBLISH                                   1.526e+00  0.12696    
## FLAG_EMP_PHONE1                                          NA       NA    
## FLAG_WORK_PHONE1                                  1.476e+00  0.13996    
## FLAG_CONT_MOBILE1                                 0.000e+00  0.99974    
## FLAG_PHONE1                                       1.184e+00  0.23649    
## FLAG_EMAIL1                                      -1.853e+00  0.06385 .  
## CNT_FAM_MEMBERS                                          NA       NA    
## REGION_RATING_CLIENT                             -7.890e-01  0.43008    
## REGION_RATING_CLIENT_W_CITY                       1.363e+00  0.17298    
## WEEKDAY_APPR_PROCESS_STARTMONDAY                 -1.742e+00  0.08156 .  
## WEEKDAY_APPR_PROCESS_STARTSATURDAY               -5.870e-01  0.55687    
## WEEKDAY_APPR_PROCESS_STARTSUNDAY                 -1.689e+00  0.09113 .  
## WEEKDAY_APPR_PROCESS_STARTTHURSDAY               -1.650e-01  0.86904    
## WEEKDAY_APPR_PROCESS_STARTTUESDAY                -3.690e-01  0.71219    
## WEEKDAY_APPR_PROCESS_STARTWEDNESDAY              -5.910e-01  0.55435    
## HOUR_APPR_PROCESS_START1                          0.000e+00  1.00000    
## HOUR_APPR_PROCESS_START2                          0.000e+00  1.00000    
## HOUR_APPR_PROCESS_START3                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START4                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START5                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START6                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START7                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START8                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START9                          0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START10                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START11                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START12                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START13                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START14                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START15                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START16                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START17                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START18                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START19                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START20                         0.000e+00  0.99991    
## HOUR_APPR_PROCESS_START21                         0.000e+00  1.00000    
## HOUR_APPR_PROCESS_START22                         0.000e+00  1.00000    
## REG_REGION_NOT_LIVE_REGION1                       0.000e+00  0.99961    
## REG_REGION_NOT_WORK_REGION1                       0.000e+00  0.99962    
## LIVE_REGION_NOT_WORK_REGION1                      0.000e+00  0.99962    
## REG_CITY_NOT_LIVE_CITY1                           1.190e+00  0.23400    
## REG_CITY_NOT_WORK_CITY1                           6.920e-01  0.48874    
## LIVE_CITY_NOT_WORK_CITY1                          3.600e-02  0.97101    
## ORGANIZATION_TYPEAgriculture                      1.330e-01  0.89428    
## ORGANIZATION_TYPEBank                            -6.500e-02  0.94788    
## ORGANIZATION_TYPEBusiness.Entity.Type.1           1.800e-02  0.98534    
## ORGANIZATION_TYPEBusiness.Entity.Type.2           3.180e-01  0.75068    
## ORGANIZATION_TYPEBusiness.Entity.Type.3           3.340e-01  0.73864    
## ORGANIZATION_TYPECleaning                         1.270e-01  0.89859    
## ORGANIZATION_TYPEConstruction                     5.430e-01  0.58717    
## ORGANIZATION_TYPECulture                          0.000e+00  0.99986    
## ORGANIZATION_TYPEElectricity                     -3.400e-02  0.97289    
## ORGANIZATION_TYPEEmergency                        8.700e-01  0.38412    
## ORGANIZATION_TYPEGovernment                       3.390e-01  0.73476    
## ORGANIZATION_TYPEHotel                           -3.680e-01  0.71251    
## ORGANIZATION_TYPEHousing                         -8.400e-02  0.93335    
## ORGANIZATION_TYPEIndustry..type.1                 5.700e-01  0.56862    
## ORGANIZATION_TYPEIndustry..type.10                2.417e+00  0.01564 *  
## ORGANIZATION_TYPEIndustry..type.11                1.050e-01  0.91677    
## ORGANIZATION_TYPEIndustry..type.12                1.478e+00  0.13930    
## ORGANIZATION_TYPEIndustry..type.13                0.000e+00  0.99995    
## ORGANIZATION_TYPEIndustry..type.2                 5.620e-01  0.57409    
## ORGANIZATION_TYPEIndustry..type.3                 5.690e-01  0.56952    
## ORGANIZATION_TYPEIndustry..type.4                -4.300e-02  0.96535    
## ORGANIZATION_TYPEIndustry..type.5                 3.680e-01  0.71265    
## ORGANIZATION_TYPEIndustry..type.6                 0.000e+00  0.99990    
## ORGANIZATION_TYPEIndustry..type.7                -2.880e-01  0.77357    
## ORGANIZATION_TYPEIndustry..type.8                 0.000e+00  0.99995    
## ORGANIZATION_TYPEIndustry..type.9                 3.300e-02  0.97331    
## ORGANIZATION_TYPEInsurance                       -2.050e-01  0.83783    
## ORGANIZATION_TYPEKindergarten                     5.080e-01  0.61154    
## ORGANIZATION_TYPELegal.Services                   0.000e+00  0.99987    
## ORGANIZATION_TYPEMedicine                        -3.600e-02  0.97125    
## ORGANIZATION_TYPEMilitary                         9.200e-02  0.92708    
## ORGANIZATION_TYPEMobile                           0.000e+00  0.99986    
## ORGANIZATION_TYPEOther                            3.990e-01  0.69016    
## ORGANIZATION_TYPEPolice                           7.600e-01  0.44702    
## ORGANIZATION_TYPEPostal                          -7.030e-01  0.48177    
## ORGANIZATION_TYPERealtor                          1.655e+00  0.09794 .  
## ORGANIZATION_TYPEReligion                        -1.365e+08  < 2e-16 ***
## ORGANIZATION_TYPERestaurant                       9.930e-01  0.32088    
## ORGANIZATION_TYPESchool                          -4.940e-01  0.62116    
## ORGANIZATION_TYPESecurity                        -2.430e-01  0.80818    
## ORGANIZATION_TYPESecurity.Ministries             -1.097e+00  0.27281    
## ORGANIZATION_TYPESelf.employed                    4.630e-01  0.64370    
## ORGANIZATION_TYPEServices                         1.500e-01  0.88086    
## ORGANIZATION_TYPETelecom                          0.000e+00  0.99979    
## ORGANIZATION_TYPETrade..type.1                    2.230e-01  0.82365    
## ORGANIZATION_TYPETrade..type.2                   -5.700e-01  0.56869    
## ORGANIZATION_TYPETrade..type.3                    8.210e-01  0.41166    
## ORGANIZATION_TYPETrade..type.4                    0.000e+00  0.99995    
## ORGANIZATION_TYPETrade..type.5                    0.000e+00  0.99995    
## ORGANIZATION_TYPETrade..type.6                   -2.070e-01  0.83615    
## ORGANIZATION_TYPETrade..type.7                    2.740e-01  0.78447    
## ORGANIZATION_TYPETransport..type.1                0.000e+00  0.99988    
## ORGANIZATION_TYPETransport..type.2                6.300e-02  0.95012    
## ORGANIZATION_TYPETransport..type.3                6.720e-01  0.50145    
## ORGANIZATION_TYPETransport..type.4                5.260e-01  0.59906    
## ORGANIZATION_TYPEUniversity                       9.230e-01  0.35592    
## ORGANIZATION_TYPEXNA                                     NA       NA    
## DAYS_LAST_PHONE_CHANGE                            1.816e+00  0.06933 .  
## FLAG_DOCUMENT_21                                  2.365e+00  0.01803 *  
## FLAG_DOCUMENT_31                                  1.009e+00  0.31301    
## FLAG_DOCUMENT_41                                  0.000e+00  0.99995    
## FLAG_DOCUMENT_51                                  2.740e-01  0.78370    
## FLAG_DOCUMENT_61                                  6.330e-01  0.52699    
## FLAG_DOCUMENT_71                                  2.000e-01  0.84120    
## FLAG_DOCUMENT_81                                  3.460e-01  0.72950    
## FLAG_DOCUMENT_91                                 -4.130e-01  0.67950    
## FLAG_DOCUMENT_111                                -8.370e-01  0.40270    
## FLAG_DOCUMENT_131                                 1.048e+00  0.29471    
## FLAG_DOCUMENT_141                                 2.420e-01  0.80875    
## FLAG_DOCUMENT_151                                 0.000e+00  0.99985    
## FLAG_DOCUMENT_161                                -2.078e+00  0.03774 *  
## FLAG_DOCUMENT_171                                 0.000e+00  0.99988    
## FLAG_DOCUMENT_181                                -1.664e+00  0.09616 .  
## FLAG_DOCUMENT_191                                 0.000e+00  0.99991    
## FLAG_DOCUMENT_201                                 0.000e+00  0.99987    
## FLAG_DOCUMENT_211                                 2.470e-01  0.80498    
## EXTS11                                            3.173e+00  0.00151 ** 
## EXTS21                                            4.410e-01  0.65913    
## EXTS31                                            8.416e+00  < 2e-16 ***
## Social1                                           1.002e+00  0.31658    
## CreditInq1                                       -3.198e+00  0.00139 ** 
## CAR_OWNERSHIP1                                   -3.234e+00  0.00122 ** 
## REALTY_OWNERSHIP1                                 9.800e-02  0.92206    
## Female1                                          -2.691e+00  0.00713 ** 
## NAME_TYPE_SUITE_EDITChildren                      1.870e-01  0.85167    
## NAME_TYPE_SUITE_EDITFamily                       -2.070e-01  0.83618    
## NAME_TYPE_SUITE_EDITGroup.of.people               0.000e+00  0.99990    
## NAME_TYPE_SUITE_EDITOther_A                       1.835e+00  0.06655 .  
## NAME_TYPE_SUITE_EDITOther_B                      -6.600e-02  0.94769    
## NAME_TYPE_SUITE_EDITSpouse..partner              -9.010e-01  0.36737    
## NAME_TYPE_SUITE_EDITUnaccompanied                -1.230e-01  0.90192    
## OCCUPATION_TYPE_EDITAccountants                   5.550e-01  0.57921    
## OCCUPATION_TYPE_EDITCleaning.staff                1.054e+00  0.29178    
## OCCUPATION_TYPE_EDITCooking.staff                -2.014e+00  0.04396 *  
## OCCUPATION_TYPE_EDITCore.staff                   -1.132e+00  0.25780    
## OCCUPATION_TYPE_EDITDrivers                       1.953e+00  0.05084 .  
## OCCUPATION_TYPE_EDITHigh.skill.tech.staff        -8.990e-01  0.36877    
## OCCUPATION_TYPE_EDITHR.staff                     -4.060e-01  0.68490    
## OCCUPATION_TYPE_EDITIT.staff                      0.000e+00  0.99975    
## OCCUPATION_TYPE_EDITLaborers                     -6.020e-01  0.54714    
## OCCUPATION_TYPE_EDITLow.skill.Laborers            2.068e+00  0.03861 *  
## OCCUPATION_TYPE_EDITManagers                      9.800e-02  0.92232    
## OCCUPATION_TYPE_EDITMedicine.staff                1.663e+00  0.09641 .  
## OCCUPATION_TYPE_EDITPrivate.service.staff        -1.340e-01  0.89326    
## OCCUPATION_TYPE_EDITRealty.agents                 0.000e+00  0.99972    
## OCCUPATION_TYPE_EDITSales.staff                   4.850e-01  0.62739    
## OCCUPATION_TYPE_EDITSecretaries                  -1.143e+00  0.25302    
## OCCUPATION_TYPE_EDITSecurity.staff                1.023e+00  0.30635    
## OCCUPATION_TYPE_EDITWaiters.barmen.staff         -1.290e-01  0.89724    
## carOwner_carAge                                   3.950e-01  0.69266    
## DEF30                                             6.720e-01  0.50160    
## OBS30                                             1.600e-01  0.87316    
## DEF60                                             2.760e-01  0.78246    
## OBS60                                            -1.580e-01  0.87429    
## Credit_Hour                                      -1.373e+00  0.16989    
## Credit_Day                                       -1.386e+00  0.16582    
## Credit_Week                                      -1.289e+00  0.19732    
## Credit_Month                                     -5.720e-01  0.56705    
## Credit_Quarter                                   -5.350e-01  0.59269    
## Credit_Year                                       3.028e+00  0.00246 ** 
## External1                                        -4.485e+00 7.29e-06 ***
## External2                                        -1.195e+01  < 2e-16 ***
## External3                                        -1.209e+01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5744.6  on 10195  degrees of freedom
## Residual deviance: 4810.9  on 10006  degrees of freedom
## AIC: 5190.9
## 
## Number of Fisher Scoring iterations: 25

The logistic regression is more robust and allows a wider variety of predictions. However, it requires good selection of variables. The outcome of both the classification tree and logistic regression is the probability that a loan would default. There is a need to convert these probabilities into classes of “Default” or “Not Default” . A threshold is needed; when the probability is greater than a certain threshold, the application is classified as “Default”.

To determine thresholds for various models, we took insights from the ROC curve. We plotted the ROC curve to check the in-sample performance for the following three models: Logistic Regression (Blue), Classification Tree (Green) and Classification Tree with PCA (Red). The logistic model performs the best(in-sample) compared to the remaining two models(Higher Area Under the Curve(AUC) for the logistic model). Using the ROC curve we evaluated the optimal thresholds for the three models; These three thresholds define the level at which that model will have the highest accuracy. These thresholds will also be applied for OOS evaluation.

## Visualization to determine best thresholds 
plot( c( 0, 1 ), c(0, 1), type="n", xlim=c(0,1), ylim=c(0,1), bty="n", xlab = "False positive rate", ylab="True positive rate")
lines(c(0,1),c(0,1), lty=2)
logit = rbind()
tree_Cust = rbind()
tree_PCA = rbind()

for( val in seq(from=0,to=1,by=0.01)){
  values <- FPR_TPR((model.logistic$fitted >= val) , model.logistic$y )
  points( values$FPR , values$TPR, pch = 21, bg="blue" )
  logit = rbind(logit,c(val,values$ACC))
  values <- FPR_TPR( (predict(model.tree_Cust,type="vector")[,2] >= val) , model.logistic$y )
  points( values$FPR , values$TPR, pch = 23, bg="green" ) 
  tree_Cust = rbind(tree_Cust,c(val,values$ACC))
  values <- FPR_TPR( (predict(model.tree_PCA,type="vector")[,2] >= val) , model.logistic$y )
  points( values$FPR , values$TPR, pch = 23, bg="red" )
  tree_PCA = rbind(tree_PCA,c(val,values$ACC))
}

logit = logit[which.max(logit[,2]),]
tree_Cust = tree_Cust[which.max(tree_Cust[,2]),]
tree_PCA = tree_PCA[(which.max(tree_PCA[,2])),]
vectorMax = rbind(logit,tree_Cust,tree_PCA)
colnames(vectorMax) = c("Threshold","ACC")
vectorMax
##           Threshold       ACC
## logit          0.65 0.9195763
## tree_Cust      0.19 0.9187917
## tree_PCA       0.14 0.9187917

Evaluation

In the evaluation section, the different models are compared on OOS R-squared and OOS accuracy using K-fold cross-validation. Per convention, we used 10 folds to evaluate our models.

#### K-fold Cross Validation ####

### K-fold for R-squared
## K-fold for logistics and tree
nfold <- 10
n <- nrow(app_Train)
foldid <- rep(1:nfold,each=ceiling(n/nfold))[sample(1:n)]
### create an empty dataframe of results
OOS <- data.frame(logistic=rep(NA,nfold),tree=rep(NA,nfold),
                  tree_pca=rep(NA,nfold),null=rep(NA,nfold)) 

## Use a for loop to run through the nfold trails
for(k in 1:nfold){ 
  train <- which(foldid!=k) # train on all but fold `k'
  
  model.logistic <-glm(My~., data=app_Logit,subset = train, family="binomial")
  model.tree_Cust <- tree(TARGET~AMT_INCOME_TOTAL + AMT_CREDIT + DAYS_BIRTH + DAYS_EMPLOYED +
                            DAYS_REGISTRATION + DAYS_ID_PUBLISH + DAYS_LAST_PHONE_CHANGE + External1 + External2 +
                            External3 +Credit_Hour + Credit_Day + Credit_Week + Credit_Month +
                            Credit_Quarter + Credit_Year + DEF30 + OBS30+ DEF60+OBS60+
                            carOwner_carAge + FLAG_DOCUMENT_2, data=app_Train,subset = train) 
  model.tree_PCA = tree(TARGET~.,data = app_Train_Tree[,c(1,78:107)],subset = train)
  model.null = glm(TARGET~1, data=app_Train, subset=train,family="binomial")
  ## get predictions: type=response so we have probabilities
  pred.logistic <- predict(model.logistic, newdata=app_Logit[-train,], type="response")
  pred.tree <- predict(model.tree_Cust, newdata=app_Train[-train,], type="vector")
  pred.tree <- pred.tree[,2]
  pred.tree.pca <- predict(model.tree_PCA, newdata=app_Train_Tree[-train,], type="vector")
  pred.tree.pca <- pred.tree.pca[,2]
  pred.null <- predict(model.null, newdata=app_Train[-train,], type="response")
  
  ## calculate and log R2
  # Logistics
  OOS$logistic[k] <- R2(y=My[-train], pred=pred.logistic, family="binomial")
  OOS$logistic[k]
  # Tree
  OOS$tree[k] <- R2(y=app_Train$TARGET[-train], pred=pred.tree, family="binomial")
  OOS$tree[k]
  # PCA Tree
  OOS$tree_pca[k] <- R2(y=app_Train_Tree$TARGET[-train], pred=pred.tree.pca, family="binomial")
  OOS$tree_pca[k]
  #Null
  OOS$null[k] <- R2(y=app_Train$TARGET[-train], pred=pred.null, family="binomial")
  OOS$null[k]
  #Null Model guess
  sum(app_Train$TARGET[train] == 1)/length(train)
  
  ## We will loop this nfold times
  ## this will print the progress (iteration that finished)
  print(paste("Iteration",k,"of",nfold,"(thank you for your patience)"))
}
## [1] "Iteration 1 of 10 (thank you for your patience)"
## [1] "Iteration 2 of 10 (thank you for your patience)"
## [1] "Iteration 3 of 10 (thank you for your patience)"
## [1] "Iteration 4 of 10 (thank you for your patience)"
## [1] "Iteration 5 of 10 (thank you for your patience)"
## [1] "Iteration 6 of 10 (thank you for your patience)"
## [1] "Iteration 7 of 10 (thank you for your patience)"
## [1] "Iteration 8 of 10 (thank you for your patience)"
## [1] "Iteration 9 of 10 (thank you for your patience)"
## [1] "Iteration 10 of 10 (thank you for your patience)"

Deployment

The result of the data mining will be deployed via writing an algorithm and building up a web page interface so that sales representatives can use to input applicants’ data. Upon finishing the data input, a suggestion of whether the application has a risk of default or not will pop up on the screen; the sales rep then can use it for further assessment. Some of the obstacles we identify in the implementations are below.

First, data accuracy might be an issue since we don’t have the means to verify every response. Some applicants might provide wrong data to increase the chance of getting the loan. However, this is also the case for the status quo. Even with these noises, we were still able to build a predictive model. There is no reason to believe that new customers from the expansion will produce more noisy data. Thus, this risk is relatively low.

Second, it might be time-consuming to acquire the data needed from customers. If customers have to spend more time to get evaluated for the loan, we might lose some customers during the data input process. However, since all the variables that the model uses were already being recorded, deploying the model does not come with increased time consumption. On the contrary, an improvement that our newly established model can bring is that it filters out many demographic variables that turn out not to be relevant for predicting default, which do not have to be recorded, thus simplifying the loan application process. This could both cut costs and lower the bar for potential new customers to apply for a loan.

Furthermore, since we don’t have the visibility of customers’ profiles, we cannot consider factors like geography or applicant behaviors. So the data might not be applicable in a different market. Entering a new market is always a risky step. In order to mitigate some of this risk, it should be entered in slow, small steps. By starting out with a small group first, it can be established whether the market is similar to the ones Home Credit is currently active in and if the product is successful there. By not going all-in right away, the risk is reduced.

Last but not least, using the optimal model raises an ethical problem: the variable Female is relevant in predicting whether an applicant will default. However, including it in the model to determine whether or not to approve a loan would, by definition, lead to gender discrimination, which is both unethical and unlawful. The only way to solve the gender discrimination problem is by taking this variable out of the model entirely. Instead, we could focus on more latent variables related to both gender and default probability, such as some proxy for risk-seeking behavior, to include in our model instead.

Improvement

We believe the current model is useful and will help Home Credit significantly in its expansion endeavors. Nevertheless, there are ample opportunities to improve the model as Home Credit needs become more complicated. Additional sources of data such as the history of payment should be used to improve the accuracy rate. Moreover, for now, we are only considering yes/no for the default rate without taking the loan amount into consideration. Naturally, the loan amount is a relevant factor: defaults on larger loans are far more detrimental than are smaller ones. A more robust model can be made using extra information like the demand curve. In this way, we can find the optimal interest rate and make some improvements to the current loan. For example, for a high-risk loan, we can have a higher interest rate to make the expected profit worthy. Finally, since right now we use a general model for all types of loans, it is probably not optimized. Different models should be created for different loan types to avoid losses and to maximize profits in Home Credit expansion. It is very likely that the highly different nature of revolving loans and cash loans makes for different optimal models.