Introduction

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

Data preparation

Define libraries
library(rsample)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(class)
library(gtools)
## 
## Attaching package: 'gtools'
## The following object is masked from 'package:rsample':
## 
##     permutations
library(cmplot)
## Warning: replacing previous import 'ggplot2::last_plot' by 'plotly::last_plot'
## when loading 'cmplot'
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Read data
credit_record <- read.csv("data/credit_record.csv")
application_record <- read.csv("data/application_record.csv")

credit_record : data which contains transaction information of customers in monthly times. Those information can later be made to make our target for our model application_record : data which contains customers personal information which can be made as our predictors

Preview Credit Record
glimpse(credit_record)
## Rows: 1,048,575
## Columns: 3
## $ ID             <int> 5001711, 5001711, 5001711, 5001711, 5001712, 5001712, 5~
## $ MONTHS_BALANCE <int> 0, -1, -2, -3, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -~
## $ STATUS         <chr> "X", "0", "0", "0", "C", "C", "C", "C", "C", "C", "C", ~

Description:

  • ID : Client number
  • MONTHS_BALANCE : The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on
  • STATUS : 0 = 1-29 days past due 1 = 30-59 days past due 2 = 60-89 days overdue 3 = 90-119 days overdue 4 = 120-149 days overdue 5 = Overdue or bad debts, write-offs for more than 150 days C = paid off that month X = No loan for the month

Customer with over 90 days or more past due would be considered as a ‘bad’ customer. Based on this consideration, we can adjust the value of “STATUS” column to be 1 (bad customer) and 0 (good customer) and make it as our target for our model.

Change the value of STATUS and remove MONTH_BALANCE
credit_record <- credit_record %>% 
  mutate(STATUS = case_when(STATUS %in% c("3", "4", "5") ~ 1,
                            T ~ 0)) %>%
  group_by(ID) %>%
  summarise(STATUS = max(STATUS)) %>%
  ungroup() %>% 
  mutate(STATUS = as.factor(case_when(STATUS == 1 ~ "Bad Customer",
                                      T ~ "Good Customer")))
Preview Application Record
glimpse(application_record)
## Rows: 438,557
## Columns: 18
## $ ID                  <int> 5008804, 5008805, 5008806, 5008808, 5008809, 50088~
## $ CODE_GENDER         <chr> "M", "M", "M", "F", "F", "F", "F", "F", "F", "F", ~
## $ FLAG_OWN_CAR        <chr> "Y", "Y", "Y", "N", "N", "N", "N", "N", "N", "N", ~
## $ FLAG_OWN_REALTY     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", ~
## $ CNT_CHILDREN        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AMT_INCOME_TOTAL    <dbl> 427500, 427500, 112500, 270000, 270000, 270000, 27~
## $ NAME_INCOME_TYPE    <chr> "Working", "Working", "Working", "Commercial assoc~
## $ NAME_EDUCATION_TYPE <chr> "Higher education", "Higher education", "Secondary~
## $ NAME_FAMILY_STATUS  <chr> "Civil marriage", "Civil marriage", "Married", "Si~
## $ NAME_HOUSING_TYPE   <chr> "Rented apartment", "Rented apartment", "House / a~
## $ DAYS_BIRTH          <int> -12005, -12005, -21474, -19110, -19110, -19110, -1~
## $ DAYS_EMPLOYED       <int> -4542, -4542, -1134, -3051, -3051, -3051, -3051, 3~
## $ FLAG_MOBIL          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ FLAG_WORK_PHONE     <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,~
## $ FLAG_PHONE          <int> 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,~
## $ FLAG_EMAIL          <int> 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,~
## $ OCCUPATION_TYPE     <chr> "", "", "Security staff", "Sales staff", "Sales st~
## $ CNT_FAM_MEMBERS     <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,~

Description:

  • ID : Client number
  • CODE_GENDER : Gender
  • FLAG_OWN_CAR : If users have a car
  • FLAG_OWN_REALTY : Is there a property?
  • CNT_CHILDREN : Number of children
  • AMT_INCOME_TOTAL : Annual income
  • NAME_INCOME_TYPE : Income category
  • NAME_EDUCATION_TYPE : Education level
  • NAME_FAMILY_STATUS : Marital status
  • NAME_HOUSING_TYPE : Way of living
  • DAYS_BIRTH : Birthday
  • DAYS_EMPLOYED : Start date
  • FLAG_MOBIL : Is there a mobile phone?
  • FLAG_WORK_PHONE : Is there a work phone?
  • FLAG_PHONE : Is there a phone?
  • FLAG_EMAIL : Is there an email?
  • OCCUPATION_TYPE : Occupation
  • CNT_FAM_MEMBERS : Family size

As most of the data contains binary values, it is best to change the data type to factor. for numerical binary values and values without decimal number, it is best to change the data type to integer.

Change the data type and adjust some character values to binary
application_record <- application_record %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate_if(is.double, as.integer) %>% 
  mutate(CODE_GENDER = as.integer(ifelse(CODE_GENDER == "M", 1, 0)),
         FLAG_OWN_CAR = as.integer(ifelse(FLAG_OWN_CAR == "Y", 1, 0)),
         FLAG_OWN_REALTY = as.integer(ifelse(FLAG_OWN_REALTY == "Y", 1, 0)))

After we are done with the preparation of our data, we can combine both of the data into one data frame.

Combine credit_record and application_record
credit_data <- merge(application_record, credit_record, by = "ID") %>% 
  select(-ID)
credit_data

Exploratory Data Analysis

For a target which have a binary data type, it is important to make sure that our predictors appeared in the 2 categories of our target. If not, it might cause inaccurate prediction.

Inspect NAME_INCOME_TYPE
credit_data %>% 
  select(NAME_INCOME_TYPE, STATUS) %>% 
  group_by(NAME_INCOME_TYPE, STATUS) %>% 
  summarise(Freq = n())
## `summarise()` has grouped output by 'NAME_INCOME_TYPE'. You can override using the `.groups` argument.
Inspect NAME_EDUCATION_TYPE
credit_data %>% 
  select(NAME_EDUCATION_TYPE, STATUS) %>% 
  group_by(NAME_EDUCATION_TYPE, STATUS) %>% 
  summarise(Freq = n())
## `summarise()` has grouped output by 'NAME_EDUCATION_TYPE'. You can override using the `.groups` argument.
Inspect NAME_FAMILY_STATUS
credit_data %>% 
  select(NAME_FAMILY_STATUS, STATUS) %>% 
  group_by(NAME_FAMILY_STATUS, STATUS) %>% 
  summarise(Freq = n())
## `summarise()` has grouped output by 'NAME_FAMILY_STATUS'. You can override using the `.groups` argument.
Inspect NAME_HOUSING_TYPE
credit_data %>% 
  select(NAME_HOUSING_TYPE, STATUS) %>% 
  group_by(NAME_HOUSING_TYPE, STATUS) %>% 
  summarise(Freq = n())
## `summarise()` has grouped output by 'NAME_HOUSING_TYPE'. You can override using the `.groups` argument.
Inspect OCCUPATION_TYPE
credit_data %>% 
  select(OCCUPATION_TYPE, STATUS) %>% 
  group_by(OCCUPATION_TYPE, STATUS) %>% 
  summarise(Freq = n())
## `summarise()` has grouped output by 'OCCUPATION_TYPE'. You can override using the `.groups` argument.
Incpect FLAG_MOBIL
unique(credit_data$FLAG_MOBIL)
## [1] 1

In the data column which are shown above, we can observe that some of them only included in the good customer category in the STATUS column. On the other hand, The FLAG_MOBIL have only one value. We need to remove them before we go to the modelling.

Remove problematic values and column
credit_data <- credit_data %>% 
  select(-FLAG_MOBIL) %>% 
  filter(NAME_INCOME_TYPE != "Student",
         NAME_EDUCATION_TYPE != "Academic degree",
         !OCCUPATION_TYPE %in% c("",
                                "HR staff",
                                "Private service staff",
                                "Realty agents"))

Modeling

Logistic regression

Logistic regression is one of the modeling method which can predict a target which have a binary value or 2 different values that are likely to be categorized as “yes” and “no” value. I will be creating couple of models to be compared against each other in terms of performance.

Cross Validation

Doing cross validation by separating the data for training and for testing purpose can give our model a close to accurate prediction as we simulate the model using the test data that we prepare. This method can also help us prevent the chance of creating an overfitted model.

Separate train and test data
set.seed(129)

index <- sample(nrow(credit_data), nrow(credit_data) * 0.75)
credit_train <- credit_data[index,]
credit_test <- credit_data[-index,]
Separate train and test data only for numerical data
set.seed(129)

credit_data_num <- credit_data %>% 
  select(STATUS, where(is.numeric))

index <- sample(nrow(credit_data_num), nrow(credit_data_num) * 0.75)
credit_train_num <- credit_data_num[index,]
Inspect the data balance of our target
table(credit_train$STATUS)
## 
##  Bad Customer Good Customer 
##           142         18309
table(credit_train_num$STATUS)
## 
##  Bad Customer Good Customer 
##           142         18309
prop.table(table(credit_train$STATUS))
## 
##  Bad Customer Good Customer 
##    0.00769606    0.99230394
prop.table(table(credit_train_num$STATUS))
## 
##  Bad Customer Good Customer 
##    0.00769606    0.99230394

our data appears to be imbalance, if we take look at the table value of the STATUS column, the customers which are considered a bad customers were rarely found compare to the good customers. With that observation, the gap between the amount of bad customers and good customers can be closed using upSample() function. If we were to use downSample(), the information contained in our data will be significantly reduced which we certainly what to avoid. Even so, we would want to reduce certain number of data which in the category of good customer to reduce the amount of generated dummy data in the bad customer category, and reduce the load of of our computer when creating the model from our data.

Balancing

Reduce half of the good customer data
ct_0 <- credit_train %>% 
  filter(STATUS == "Good Customer") %>% 
  slice(-c(1:9000))

ct_1 <- credit_train %>% 
  filter(STATUS == "Bad Customer")

credit_train <- rbind(ct_0, ct_1)


ctn_0 <- credit_train_num %>% 
  filter(STATUS == "Good Customer") %>% 
  slice(-c(1:9000))

ctn_1 <- credit_train_num %>% 
  filter(STATUS == "Bad Customer")

credit_train_num <- rbind(ctn_0, ctn_1)
Balancing data by upSample() method
set.seed(129)
ups_train <- upSample(x = credit_train %>% select(-STATUS),
                      y = credit_train$STATUS,
                      yname = "STATUS")

ups_train_num <- upSample(x = credit_train_num %>% select(-STATUS),
                          y = credit_train_num$STATUS,
                          yname = "STATUS")

table(ups_train$STATUS)
## 
##  Bad Customer Good Customer 
##          9309          9309
table(ups_train_num$STATUS)
## 
##  Bad Customer Good Customer 
##          9309          9309

Logistic modeling

Modeling using all predictors
model_all <- glm(STATUS ~ ., ups_train, family = "binomial")
model_all_num <- glm(STATUS ~ ., ups_train_num, family = "binomial")
Modeling using stepwise method
model_step <- step(glm(STATUS ~ ., ups_train, family = "binomial"), direction = "backward")
## Start:  AIC=23348.5
## STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + NAME_INCOME_TYPE + NAME_EDUCATION_TYPE + 
##     NAME_FAMILY_STATUS + NAME_HOUSING_TYPE + DAYS_BIRTH + DAYS_EMPLOYED + 
##     FLAG_WORK_PHONE + FLAG_PHONE + FLAG_EMAIL + OCCUPATION_TYPE + 
##     CNT_FAM_MEMBERS
## 
##                       Df Deviance   AIC
## - CNT_FAM_MEMBERS      1    23267 23347
## - FLAG_OWN_CAR         1    23268 23348
## <none>                      23267 23349
## - CNT_CHILDREN         1    23270 23350
## - FLAG_WORK_PHONE      1    23270 23350
## - FLAG_EMAIL           1    23273 23353
## - FLAG_PHONE           1    23280 23360
## - FLAG_OWN_REALTY      1    23284 23364
## - AMT_INCOME_TOTAL     1    23316 23396
## - CODE_GENDER          1    23328 23408
## - DAYS_BIRTH           1    23402 23482
## - NAME_FAMILY_STATUS   4    23519 23593
## - NAME_INCOME_TYPE     3    23525 23601
## - DAYS_EMPLOYED        1    23535 23615
## - NAME_EDUCATION_TYPE  3    23581 23657
## - NAME_HOUSING_TYPE    5    23712 23784
## - OCCUPATION_TYPE     14    23978 24032
## 
## Step:  AIC=23346.61
## STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + NAME_INCOME_TYPE + NAME_EDUCATION_TYPE + 
##     NAME_FAMILY_STATUS + NAME_HOUSING_TYPE + DAYS_BIRTH + DAYS_EMPLOYED + 
##     FLAG_WORK_PHONE + FLAG_PHONE + FLAG_EMAIL + OCCUPATION_TYPE
## 
##                       Df Deviance   AIC
## - FLAG_OWN_CAR         1    23268 23346
## <none>                      23267 23347
## - FLAG_WORK_PHONE      1    23271 23349
## - FLAG_EMAIL           1    23273 23351
## - FLAG_PHONE           1    23280 23358
## - FLAG_OWN_REALTY      1    23284 23362
## - CNT_CHILDREN         1    23292 23370
## - AMT_INCOME_TOTAL     1    23316 23394
## - CODE_GENDER          1    23329 23407
## - DAYS_BIRTH           1    23402 23480
## - NAME_INCOME_TYPE     3    23525 23599
## - DAYS_EMPLOYED        1    23535 23613
## - NAME_EDUCATION_TYPE  3    23581 23655
## - NAME_FAMILY_STATUS   4    23585 23657
## - NAME_HOUSING_TYPE    5    23712 23782
## - OCCUPATION_TYPE     14    23978 24030
## 
## Step:  AIC=23346.34
## STATUS ~ CODE_GENDER + FLAG_OWN_REALTY + CNT_CHILDREN + AMT_INCOME_TOTAL + 
##     NAME_INCOME_TYPE + NAME_EDUCATION_TYPE + NAME_FAMILY_STATUS + 
##     NAME_HOUSING_TYPE + DAYS_BIRTH + DAYS_EMPLOYED + FLAG_WORK_PHONE + 
##     FLAG_PHONE + FLAG_EMAIL + OCCUPATION_TYPE
## 
##                       Df Deviance   AIC
## <none>                      23268 23346
## - FLAG_WORK_PHONE      1    23272 23348
## - FLAG_EMAIL           1    23275 23351
## - FLAG_PHONE           1    23282 23358
## - FLAG_OWN_REALTY      1    23286 23362
## - CNT_CHILDREN         1    23294 23370
## - AMT_INCOME_TOTAL     1    23317 23393
## - CODE_GENDER          1    23329 23405
## - DAYS_BIRTH           1    23405 23481
## - NAME_INCOME_TYPE     3    23526 23598
## - DAYS_EMPLOYED        1    23536 23612
## - NAME_EDUCATION_TYPE  3    23582 23654
## - NAME_FAMILY_STATUS   4    23600 23670
## - NAME_HOUSING_TYPE    5    23714 23782
## - OCCUPATION_TYPE     14    23989 24039
model_step_num <- step(glm(STATUS ~ ., ups_train_num, family = "binomial"), direction = "backward")
## Start:  AIC=25105.49
## STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + DAYS_BIRTH + DAYS_EMPLOYED + FLAG_WORK_PHONE + 
##     FLAG_PHONE + FLAG_EMAIL + CNT_FAM_MEMBERS
## 
##                    Df Deviance   AIC
## - FLAG_PHONE        1    25082 25104
## - FLAG_WORK_PHONE   1    25083 25105
## <none>                   25082 25106
## - FLAG_OWN_CAR      1    25088 25110
## - FLAG_OWN_REALTY   1    25092 25114
## - AMT_INCOME_TOTAL  1    25100 25122
## - FLAG_EMAIL        1    25110 25132
## - CODE_GENDER       1    25115 25137
## - CNT_FAM_MEMBERS   1    25144 25166
## - CNT_CHILDREN      1    25154 25176
## - DAYS_BIRTH        1    25158 25180
## - DAYS_EMPLOYED     1    25481 25503
## 
## Step:  AIC=25104.32
## STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + DAYS_BIRTH + DAYS_EMPLOYED + FLAG_WORK_PHONE + 
##     FLAG_EMAIL + CNT_FAM_MEMBERS
## 
##                    Df Deviance   AIC
## - FLAG_WORK_PHONE   1    25083 25103
## <none>                   25082 25104
## - FLAG_OWN_CAR      1    25089 25109
## - FLAG_OWN_REALTY   1    25093 25113
## - AMT_INCOME_TOTAL  1    25102 25122
## - FLAG_EMAIL        1    25111 25131
## - CODE_GENDER       1    25115 25135
## - CNT_FAM_MEMBERS   1    25144 25164
## - CNT_CHILDREN      1    25154 25174
## - DAYS_BIRTH        1    25160 25180
## - DAYS_EMPLOYED     1    25487 25507
## 
## Step:  AIC=25102.96
## STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + DAYS_BIRTH + DAYS_EMPLOYED + FLAG_EMAIL + 
##     CNT_FAM_MEMBERS
## 
##                    Df Deviance   AIC
## <none>                   25083 25103
## - FLAG_OWN_CAR      1    25090 25108
## - FLAG_OWN_REALTY   1    25093 25111
## - AMT_INCOME_TOTAL  1    25104 25122
## - FLAG_EMAIL        1    25113 25131
## - CODE_GENDER       1    25117 25135
## - CNT_FAM_MEMBERS   1    25145 25163
## - CNT_CHILDREN      1    25155 25173
## - DAYS_BIRTH        1    25161 25179
## - DAYS_EMPLOYED     1    25487 25505
Model summary
summary(model_all)
## 
## Call:
## glm(formula = STATUS ~ ., family = "binomial", data = ups_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0539  -1.0882  -0.1161   1.0650   3.0686  
## 
## Coefficients:
##                                                    Estimate Std. Error z value
## (Intercept)                                       6.099e-01  2.554e-01   2.388
## CODE_GENDER                                      -3.377e-01  4.300e-02  -7.855
## FLAG_OWN_CAR                                      4.457e-02  3.573e-02   1.247
## FLAG_OWN_REALTY                                   1.506e-01  3.624e-02   4.155
## CNT_CHILDREN                                     -1.300e-01  7.327e-02  -1.775
## AMT_INCOME_TOTAL                                 -1.167e-06  1.677e-07  -6.957
## NAME_INCOME_TYPEPensioner                        -3.339e+00  4.649e-01  -7.183
## NAME_INCOME_TYPEState servant                     7.295e-01  6.901e-02  10.571
## NAME_INCOME_TYPEWorking                           1.654e-01  3.786e-02   4.369
## NAME_EDUCATION_TYPEIncomplete higher              1.356e-02  8.319e-02   0.163
## NAME_EDUCATION_TYPELower secondary               -1.780e+00  1.637e-01 -10.876
## NAME_EDUCATION_TYPESecondary / secondary special  4.197e-01  4.098e-02  10.242
## NAME_FAMILY_STATUSMarried                        -6.169e-01  7.013e-02  -8.798
## NAME_FAMILY_STATUSSeparated                      -2.666e-01  1.215e-01  -2.194
## NAME_FAMILY_STATUSSingle / not married           -1.185e+00  9.872e-02 -12.003
## NAME_FAMILY_STATUSWidow                          -5.022e-01  1.521e-01  -3.301
## NAME_HOUSING_TYPEHouse / apartment                7.225e-01  1.679e-01   4.303
## NAME_HOUSING_TYPEMunicipal apartment              3.562e-02  1.831e-01   0.195
## NAME_HOUSING_TYPEOffice apartment                 1.135e+00  2.543e-01   4.463
## NAME_HOUSING_TYPERented apartment                 1.644e+01  1.154e+02   0.143
## NAME_HOUSING_TYPEWith parents                     1.627e+00  1.861e-01   8.746
## DAYS_BIRTH                                        6.672e-05  5.771e-06  11.561
## DAYS_EMPLOYED                                    -1.402e-04  8.749e-06 -16.025
## FLAG_WORK_PHONE                                  -7.910e-02  3.995e-02  -1.980
## FLAG_PHONE                                        1.390e-01  3.839e-02   3.621
## FLAG_EMAIL                                        1.395e-01  5.674e-02   2.459
## OCCUPATION_TYPECleaning staff                     5.235e-01  1.544e-01   3.390
## OCCUPATION_TYPECooking staff                     -4.232e-01  1.262e-01  -3.354
## OCCUPATION_TYPECore staff                        -8.423e-01  8.313e-02 -10.132
## OCCUPATION_TYPEDrivers                            1.582e-01  1.017e-01   1.555
## OCCUPATION_TYPEHigh skill tech staff              2.821e-02  1.025e-01   0.275
## OCCUPATION_TYPEIT staff                          -2.869e+00  2.655e-01 -10.806
## OCCUPATION_TYPELaborers                          -1.502e-01  8.559e-02  -1.755
## OCCUPATION_TYPELow-skill Laborers                -1.580e-01  1.978e-01  -0.799
## OCCUPATION_TYPEManagers                           5.497e-01  9.082e-02   6.053
## OCCUPATION_TYPEMedicine staff                    -3.170e-01  1.126e-01  -2.815
## OCCUPATION_TYPESales staff                        4.750e-02  8.604e-02   0.552
## OCCUPATION_TYPESecretaries                       -7.099e-01  1.979e-01  -3.587
## OCCUPATION_TYPESecurity staff                    -1.614e-01  1.239e-01  -1.303
## OCCUPATION_TYPEWaiters/barmen staff              -4.095e-01  2.037e-01  -2.011
## CNT_FAM_MEMBERS                                   2.338e-02  7.051e-02   0.332
##                                                  Pr(>|z|)    
## (Intercept)                                      0.016926 *  
## CODE_GENDER                                      4.00e-15 ***
## FLAG_OWN_CAR                                     0.212252    
## FLAG_OWN_REALTY                                  3.25e-05 ***
## CNT_CHILDREN                                     0.075921 .  
## AMT_INCOME_TOTAL                                 3.48e-12 ***
## NAME_INCOME_TYPEPensioner                        6.82e-13 ***
## NAME_INCOME_TYPEState servant                     < 2e-16 ***
## NAME_INCOME_TYPEWorking                          1.25e-05 ***
## NAME_EDUCATION_TYPEIncomplete higher             0.870560    
## NAME_EDUCATION_TYPELower secondary                < 2e-16 ***
## NAME_EDUCATION_TYPESecondary / secondary special  < 2e-16 ***
## NAME_FAMILY_STATUSMarried                         < 2e-16 ***
## NAME_FAMILY_STATUSSeparated                      0.028209 *  
## NAME_FAMILY_STATUSSingle / not married            < 2e-16 ***
## NAME_FAMILY_STATUSWidow                          0.000964 ***
## NAME_HOUSING_TYPEHouse / apartment               1.69e-05 ***
## NAME_HOUSING_TYPEMunicipal apartment             0.845714    
## NAME_HOUSING_TYPEOffice apartment                8.07e-06 ***
## NAME_HOUSING_TYPERented apartment                0.886668    
## NAME_HOUSING_TYPEWith parents                     < 2e-16 ***
## DAYS_BIRTH                                        < 2e-16 ***
## DAYS_EMPLOYED                                     < 2e-16 ***
## FLAG_WORK_PHONE                                  0.047692 *  
## FLAG_PHONE                                       0.000293 ***
## FLAG_EMAIL                                       0.013927 *  
## OCCUPATION_TYPECleaning staff                    0.000699 ***
## OCCUPATION_TYPECooking staff                     0.000795 ***
## OCCUPATION_TYPECore staff                         < 2e-16 ***
## OCCUPATION_TYPEDrivers                           0.119948    
## OCCUPATION_TYPEHigh skill tech staff             0.783159    
## OCCUPATION_TYPEIT staff                           < 2e-16 ***
## OCCUPATION_TYPELaborers                          0.079177 .  
## OCCUPATION_TYPELow-skill Laborers                0.424520    
## OCCUPATION_TYPEManagers                          1.42e-09 ***
## OCCUPATION_TYPEMedicine staff                    0.004880 ** 
## OCCUPATION_TYPESales staff                       0.580925    
## OCCUPATION_TYPESecretaries                       0.000335 ***
## OCCUPATION_TYPESecurity staff                    0.192713    
## OCCUPATION_TYPEWaiters/barmen staff              0.044378 *  
## CNT_FAM_MEMBERS                                  0.740168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25810  on 18617  degrees of freedom
## Residual deviance: 23266  on 18577  degrees of freedom
## AIC: 23348
## 
## Number of Fisher Scoring iterations: 14
summary(model_all_num)
## 
## Call:
## glm(formula = STATUS ~ ., family = "binomial", data = ups_train_num)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.74775  -1.14556  -0.08721   1.15937   1.89847  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.770e-02  9.800e-02  -0.283  0.77743    
## CODE_GENDER      -1.889e-01  3.282e-02  -5.757 8.55e-09 ***
## FLAG_OWN_CAR      8.482e-02  3.265e-02   2.598  0.00937 ** 
## FLAG_OWN_REALTY   1.060e-01  3.297e-02   3.215  0.00131 ** 
## CNT_CHILDREN     -3.643e-01  4.320e-02  -8.432  < 2e-16 ***
## AMT_INCOME_TOTAL -6.290e-07  1.458e-07  -4.315 1.59e-05 ***
## DAYS_BIRTH        4.255e-05  4.870e-06   8.737  < 2e-16 ***
## DAYS_EMPLOYED    -1.547e-04  7.992e-06 -19.354  < 2e-16 ***
## FLAG_WORK_PHONE   3.919e-02  3.753e-02   1.044  0.29633    
## FLAG_PHONE       -3.180e-02  3.501e-02  -0.908  0.36365    
## FLAG_EMAIL       -2.623e-01  4.939e-02  -5.311 1.09e-07 ***
## CNT_FAM_MEMBERS   2.760e-01  3.513e-02   7.856 3.96e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25810  on 18617  degrees of freedom
## Residual deviance: 25081  on 18606  degrees of freedom
## AIC: 25105
## 
## Number of Fisher Scoring iterations: 4
summary(model_step)
## 
## Call:
## glm(formula = STATUS ~ CODE_GENDER + FLAG_OWN_REALTY + CNT_CHILDREN + 
##     AMT_INCOME_TOTAL + NAME_INCOME_TYPE + NAME_EDUCATION_TYPE + 
##     NAME_FAMILY_STATUS + NAME_HOUSING_TYPE + DAYS_BIRTH + DAYS_EMPLOYED + 
##     FLAG_WORK_PHONE + FLAG_PHONE + FLAG_EMAIL + OCCUPATION_TYPE, 
##     family = "binomial", data = ups_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0413  -1.0883  -0.1151   1.0628   3.0663  
## 
## Coefficients:
##                                                    Estimate Std. Error z value
## (Intercept)                                       6.875e-01  2.120e-01   3.243
## CODE_GENDER                                      -3.277e-01  4.206e-02  -7.792
## FLAG_OWN_REALTY                                   1.496e-01  3.615e-02   4.140
## CNT_CHILDREN                                     -1.068e-01  2.149e-02  -4.972
## AMT_INCOME_TOTAL                                 -1.153e-06  1.673e-07  -6.890
## NAME_INCOME_TYPEPensioner                        -3.343e+00  4.645e-01  -7.196
## NAME_INCOME_TYPEState servant                     7.255e-01  6.888e-02  10.533
## NAME_INCOME_TYPEWorking                           1.649e-01  3.780e-02   4.363
## NAME_EDUCATION_TYPEIncomplete higher              1.570e-02  8.322e-02   0.189
## NAME_EDUCATION_TYPELower secondary               -1.784e+00  1.639e-01 -10.880
## NAME_EDUCATION_TYPESecondary / secondary special  4.184e-01  4.096e-02  10.215
## NAME_FAMILY_STATUSMarried                        -6.169e-01  7.006e-02  -8.805
## NAME_FAMILY_STATUSSeparated                      -2.943e-01  9.982e-02  -2.948
## NAME_FAMILY_STATUSSingle / not married           -1.216e+00  7.730e-02 -15.726
## NAME_FAMILY_STATUSWidow                          -5.290e-01  1.354e-01  -3.906
## NAME_HOUSING_TYPEHouse / apartment                7.148e-01  1.677e-01   4.262
## NAME_HOUSING_TYPEMunicipal apartment              2.316e-02  1.827e-01   0.127
## NAME_HOUSING_TYPEOffice apartment                 1.117e+00  2.537e-01   4.404
## NAME_HOUSING_TYPERented apartment                 1.643e+01  1.154e+02   0.142
## NAME_HOUSING_TYPEWith parents                     1.617e+00  1.858e-01   8.707
## DAYS_BIRTH                                        6.682e-05  5.767e-06  11.588
## DAYS_EMPLOYED                                    -1.395e-04  8.725e-06 -15.988
## FLAG_WORK_PHONE                                  -7.483e-02  3.984e-02  -1.878
## FLAG_PHONE                                        1.395e-01  3.840e-02   3.633
## FLAG_EMAIL                                        1.415e-01  5.666e-02   2.497
## OCCUPATION_TYPECleaning staff                     5.174e-01  1.543e-01   3.353
## OCCUPATION_TYPECooking staff                     -4.263e-01  1.260e-01  -3.382
## OCCUPATION_TYPECore staff                        -8.465e-01  8.301e-02 -10.198
## OCCUPATION_TYPEDrivers                            1.578e-01  1.014e-01   1.556
## OCCUPATION_TYPEHigh skill tech staff              1.821e-02  1.022e-01   0.178
## OCCUPATION_TYPEIT staff                          -2.890e+00  2.651e-01 -10.904
## OCCUPATION_TYPELaborers                          -1.572e-01  8.543e-02  -1.840
## OCCUPATION_TYPELow-skill Laborers                -1.608e-01  1.956e-01  -0.822
## OCCUPATION_TYPEManagers                           5.506e-01  9.062e-02   6.077
## OCCUPATION_TYPEMedicine staff                    -3.253e-01  1.124e-01  -2.894
## OCCUPATION_TYPESales staff                        3.994e-02  8.584e-02   0.465
## OCCUPATION_TYPESecretaries                       -7.289e-01  1.974e-01  -3.692
## OCCUPATION_TYPESecurity staff                    -1.817e-01  1.229e-01  -1.478
## OCCUPATION_TYPEWaiters/barmen staff              -4.265e-01  2.033e-01  -2.097
##                                                  Pr(>|z|)    
## (Intercept)                                      0.001181 ** 
## CODE_GENDER                                      6.61e-15 ***
## FLAG_OWN_REALTY                                  3.47e-05 ***
## CNT_CHILDREN                                     6.63e-07 ***
## AMT_INCOME_TOTAL                                 5.56e-12 ***
## NAME_INCOME_TYPEPensioner                        6.19e-13 ***
## NAME_INCOME_TYPEState servant                     < 2e-16 ***
## NAME_INCOME_TYPEWorking                          1.28e-05 ***
## NAME_EDUCATION_TYPEIncomplete higher             0.850379    
## NAME_EDUCATION_TYPELower secondary                < 2e-16 ***
## NAME_EDUCATION_TYPESecondary / secondary special  < 2e-16 ***
## NAME_FAMILY_STATUSMarried                         < 2e-16 ***
## NAME_FAMILY_STATUSSeparated                      0.003195 ** 
## NAME_FAMILY_STATUSSingle / not married            < 2e-16 ***
## NAME_FAMILY_STATUSWidow                          9.38e-05 ***
## NAME_HOUSING_TYPEHouse / apartment               2.02e-05 ***
## NAME_HOUSING_TYPEMunicipal apartment             0.899128    
## NAME_HOUSING_TYPEOffice apartment                1.06e-05 ***
## NAME_HOUSING_TYPERented apartment                0.886735    
## NAME_HOUSING_TYPEWith parents                     < 2e-16 ***
## DAYS_BIRTH                                        < 2e-16 ***
## DAYS_EMPLOYED                                     < 2e-16 ***
## FLAG_WORK_PHONE                                  0.060374 .  
## FLAG_PHONE                                       0.000280 ***
## FLAG_EMAIL                                       0.012540 *  
## OCCUPATION_TYPECleaning staff                    0.000798 ***
## OCCUPATION_TYPECooking staff                     0.000719 ***
## OCCUPATION_TYPECore staff                         < 2e-16 ***
## OCCUPATION_TYPEDrivers                           0.119616    
## OCCUPATION_TYPEHigh skill tech staff             0.858628    
## OCCUPATION_TYPEIT staff                           < 2e-16 ***
## OCCUPATION_TYPELaborers                          0.065698 .  
## OCCUPATION_TYPELow-skill Laborers                0.411112    
## OCCUPATION_TYPEManagers                          1.23e-09 ***
## OCCUPATION_TYPEMedicine staff                    0.003800 ** 
## OCCUPATION_TYPESales staff                       0.641713    
## OCCUPATION_TYPESecretaries                       0.000223 ***
## OCCUPATION_TYPESecurity staff                    0.139360    
## OCCUPATION_TYPEWaiters/barmen staff              0.035962 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25810  on 18617  degrees of freedom
## Residual deviance: 23268  on 18579  degrees of freedom
## AIC: 23346
## 
## Number of Fisher Scoring iterations: 14
summary(model_step_num)
## 
## Call:
## glm(formula = STATUS ~ CODE_GENDER + FLAG_OWN_CAR + FLAG_OWN_REALTY + 
##     CNT_CHILDREN + AMT_INCOME_TOTAL + DAYS_BIRTH + DAYS_EMPLOYED + 
##     FLAG_EMAIL + CNT_FAM_MEMBERS, family = "binomial", data = ups_train_num)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.74950  -1.15249  -0.09078   1.15876   1.90310  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -9.934e-03  9.615e-02  -0.103  0.91771    
## CODE_GENDER      -1.895e-01  3.257e-02  -5.817 5.97e-09 ***
## FLAG_OWN_CAR      8.726e-02  3.242e-02   2.691  0.00712 ** 
## FLAG_OWN_REALTY   9.870e-02  3.200e-02   3.085  0.00204 ** 
## CNT_CHILDREN     -3.632e-01  4.317e-02  -8.414  < 2e-16 ***
## AMT_INCOME_TOTAL -6.550e-07  1.439e-07  -4.551 5.34e-06 ***
## DAYS_BIRTH        4.282e-05  4.863e-06   8.805  < 2e-16 ***
## DAYS_EMPLOYED    -1.550e-04  7.964e-06 -19.464  < 2e-16 ***
## FLAG_EMAIL       -2.678e-01  4.917e-02  -5.446 5.16e-08 ***
## CNT_FAM_MEMBERS   2.741e-01  3.497e-02   7.838 4.59e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25810  on 18617  degrees of freedom
## Residual deviance: 25083  on 18608  degrees of freedom
## AIC: 25103
## 
## Number of Fisher Scoring iterations: 4

Some of the predictors inside our data seems to be insignificant to the target as the stepwise model remove them from the predictor list inside the model. Let us compare the performance of both method. Let us take a look at the stepwise model which contain both numerical and categorical values. Values that are negative will reduce the chance of becoming a good customer, while the positive values gives the opposite interpretation.

Odds and Probability
exp(-3.3429657987)
## [1] 0.03533201
inv.logit(-3.3429657987)
## [1] 0.03412626

For interpretation purpose, I took the NAME_INCOME_TYPEPensioner predictor to be derived to obtain the odds and the probabilty value. People who are a pensioner have 0.035 times more likely to become a bad customer compare to the other, and the probability of a customer which are a pensioner become a bad customer increase as much as 3.4%. In this case, a customer which are a pensioner have a higher risk of not paying their loan in time compare to the other customers.

Prediction of the models
pred_all <- predict(model_all, credit_test, type = "response")
pred_all_num <- predict(model_all_num, credit_test, type = "response")
pred_step <- predict(model_step, credit_test, type = "response")
pred_step_num <- predict(model_step_num, credit_test, type = "response")
Convert the prediction value to real result
test_all <- credit_test
test_all_num <- credit_test
test_step <- credit_test
test_step_num <- credit_test

test_all$pred <- factor(ifelse(pred_all > 0.5, "Bad Customer", "Good Customer"))
test_all_num$pred <- factor(ifelse(pred_all_num > 0.5, "Bad Customer", "Good Customer"))
test_step$pred <- factor(ifelse(pred_step > 0.5, "Bad Customer", "Good Customer"))
test_step_num$pred <- factor(ifelse(pred_step_num > 0.5, "Bad Customer", "Good Customer"))
Preview the performance of each model
confusionMatrix(test_all$pred, test_all$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            30          4119
##   Good Customer           31          1971
##                                           
##                Accuracy : 0.3253          
##                  95% CI : (0.3136, 0.3372)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0054         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.323645        
##             Specificity : 0.491803        
##          Pos Pred Value : 0.984515        
##          Neg Pred Value : 0.007231        
##              Prevalence : 0.990083        
##          Detection Rate : 0.320436        
##    Detection Prevalence : 0.325476        
##       Balanced Accuracy : 0.407724        
##                                           
##        'Positive' Class : Good Customer   
## 
confusionMatrix(test_all_num$pred, test_all_num$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            22          3369
##   Good Customer           39          2721
##                                           
##                Accuracy : 0.4459          
##                  95% CI : (0.4335, 0.4585)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0069         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.446798        
##             Specificity : 0.360656        
##          Pos Pred Value : 0.985870        
##          Neg Pred Value : 0.006488        
##              Prevalence : 0.990083        
##          Detection Rate : 0.442367        
##    Detection Prevalence : 0.448708        
##       Balanced Accuracy : 0.403727        
##                                           
##        'Positive' Class : Good Customer   
## 
confusionMatrix(test_step$pred, test_step$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            30          4118
##   Good Customer           31          1972
##                                           
##                Accuracy : 0.3255          
##                  95% CI : (0.3138, 0.3373)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0054         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.323810        
##             Specificity : 0.491803        
##          Pos Pred Value : 0.984523        
##          Neg Pred Value : 0.007232        
##              Prevalence : 0.990083        
##          Detection Rate : 0.320598        
##    Detection Prevalence : 0.325638        
##       Balanced Accuracy : 0.407806        
##                                           
##        'Positive' Class : Good Customer   
## 
confusionMatrix(test_step_num$pred, test_step_num$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            23          3371
##   Good Customer           38          2719
##                                           
##                Accuracy : 0.4458          
##                  95% CI : (0.4333, 0.4583)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0063         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.446470        
##             Specificity : 0.377049        
##          Pos Pred Value : 0.986217        
##          Neg Pred Value : 0.006777        
##              Prevalence : 0.990083        
##          Detection Rate : 0.442042        
##    Detection Prevalence : 0.448220        
##       Balanced Accuracy : 0.411759        
##                                           
##        'Positive' Class : Good Customer   
## 

We can say that there were almost no different in the result of the 2 modeling method which we used to create our model. However, there are some noticeable difference between the model which only have numerical predictors and the model which have the categorical values and numerical values as the predictors. If we focus on one of the model, for example the stepwise model which have both categorical and numerical values, we can see that our model performed badly as the accuracy value were only 32.55%, sensitivity of 49.18%, and the precision of 0.72%. This proves that logistic regression preforms well with majority of numerical data compare to the categorical data. However, we can try to do some tuning to our model if it was really necessary to use logistic regression due to its being interpretable.

Model tuning

Performance plot
confmat_plot(pred_step,
             test_step$STATUS,
             postarget = "Good Customer",
             negtarget = "Bad Customer")

By plotting the performance of our model, we can see find the best threshold for our model. The higher the cutoff value, the higher the specificity will be. The lower the cutoff value, the higher the recall value

Tuning
test_step$pred <- factor(ifelse(pred_step > 0.9, "Bad Customer", "Good Customer"))
confusionMatrix(test_step$pred, test_step$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer             3           186
##   Good Customer           58          5904
##                                           
##                Accuracy : 0.9603          
##                  95% CI : (0.9551, 0.9651)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0091          
##                                           
##  Mcnemar's Test P-Value : 4.281e-16       
##                                           
##             Sensitivity : 0.96946         
##             Specificity : 0.04918         
##          Pos Pred Value : 0.99027         
##          Neg Pred Value : 0.01587         
##              Prevalence : 0.99008         
##          Detection Rate : 0.95984         
##    Detection Prevalence : 0.96927         
##       Balanced Accuracy : 0.50932         
##                                           
##        'Positive' Class : Good Customer   
## 
test_step$pred <- factor(ifelse(pred_step > 0.1, "Bad Customer", "Good Customer"))
confusionMatrix(test_step$pred, test_step$STATUS, positive = "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            59          6070
##   Good Customer            2            20
##                                          
##                Accuracy : 0.0128         
##                  95% CI : (0.0102, 0.016)
##     No Information Rate : 0.9901         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -6e-04         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.003284       
##             Specificity : 0.967213       
##          Pos Pred Value : 0.909091       
##          Neg Pred Value : 0.009626       
##              Prevalence : 0.990083       
##          Detection Rate : 0.003252       
##    Detection Prevalence : 0.003577       
##       Balanced Accuracy : 0.485249       
##                                          
##        'Positive' Class : Good Customer  
## 

Even after I try to tone the model, it is still not possible to get a well-performed model to predict the target. Considering the business case to find the possibly bad customer accurately (reducing the False Positive value), we will have to deal with 95.08% amount of False Positive data. Same goes when we try to tune the model to find the possibly bad customer as much as possible, there are 99.67% data which fall under False Negative category. Since the begining, the amount of data which were incorrectly predicted were as much as 67.47% when the threshold were set at 0.5.

K Nearest Neighbor

The other method we can use is the K-NN method. Athought this method cannot be interpreted the way the Logistic Regression were interpreted, it can improve the performance of our model.

Data Preparation

Although at the begining we have done some data pre-processing, We have to do a bit more processing before we can model our data based on K-NN method as it only able to crate a model based on numerical values.

Changing categorical values into dummy numerical values
knn_data <- credit_data %>% 
mutate(NAME_INCOME_TYPE = case_when(NAME_INCOME_TYPE == "Working" ~ 0,
                                    NAME_INCOME_TYPE == "Commercial associate" ~ 1,
                                    NAME_INCOME_TYPE == "State servant" ~ 2,
                                    NAME_INCOME_TYPE == "Pensioner" ~ 3),
       NAME_EDUCATION_TYPE = case_when(NAME_EDUCATION_TYPE == "Secondary / secondary special" ~ 0,
                                       NAME_EDUCATION_TYPE == "Higher education" ~ 1,
                                       NAME_EDUCATION_TYPE == "Incomplete higher" ~ 2,
                                       NAME_EDUCATION_TYPE == "Lower secondary" ~ 3),
       NAME_FAMILY_STATUS = case_when(NAME_FAMILY_STATUS == "Married" ~ 0,
                                      NAME_FAMILY_STATUS == "Single / not married" ~ 1,
                                      NAME_FAMILY_STATUS == "Civil marriage" ~ 2,
                                      NAME_FAMILY_STATUS == "Separated" ~ 3,
                                      NAME_FAMILY_STATUS == "Widow" ~ 4),
       NAME_HOUSING_TYPE = case_when(NAME_HOUSING_TYPE == "House / apartment" ~ 0,
                                     NAME_HOUSING_TYPE == "Rented apartment" ~ 1,
                                     NAME_HOUSING_TYPE == "Municipal apartment" ~ 2,
                                     NAME_HOUSING_TYPE == "With parents" ~ 3,
                                     NAME_HOUSING_TYPE == "Co-op apartment" ~ 4,
                                     NAME_HOUSING_TYPE == "Office apartment" ~ 5),
       OCCUPATION_TYPE = case_when(OCCUPATION_TYPE == "Security staff" ~ 0,
                                   OCCUPATION_TYPE == "Sales staff" ~ 1,
                                   OCCUPATION_TYPE == "Accountants" ~ 2,
                                   OCCUPATION_TYPE == "Laborers" ~ 3,
                                   OCCUPATION_TYPE == "Managers" ~ 4,
                                   OCCUPATION_TYPE == "Drivers" ~ 5,
                                   OCCUPATION_TYPE == "Core staff" ~ 6,
                                   OCCUPATION_TYPE == "High skill tech staff" ~ 7,
                                   OCCUPATION_TYPE == "Cleaning staff" ~ 8,
                                   OCCUPATION_TYPE == "Cooking staff" ~ 9,
                                   OCCUPATION_TYPE == "Low-skill Laborers" ~ 10,
                                   OCCUPATION_TYPE == "Medicine staff" ~ 11,
                                   OCCUPATION_TYPE == "Secretaries" ~ 12,
                                   OCCUPATION_TYPE == "Waiters/barmen staff" ~ 13,
                                   OCCUPATION_TYPE == "IT staff" ~ 14))

Cross validation

Cross validation
set.seed(129)

index <- sample(nrow(knn_data), nrow(knn_data) * 0.75)
knn_train <- knn_data[index,]
knn_test <- knn_data[-index,]
Scaling using Min-Max method
normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}

knn_train <- knn_train %>% mutate_if(is.numeric, normalize)
knn_test <- knn_test %>% mutate_if(is.numeric, normalize)
Balancing
knn_0 <- knn_train %>% 
  filter(STATUS == "Good Customer") %>% 
  slice(-c(1:9000))

knn_1 <- knn_train %>% 
  filter(STATUS == "Bad Customer")

knn_train <- rbind(knn_0, knn_1)

set.seed(129)
knn_ups_train <- upSample(x = knn_train %>% select(-STATUS),
                          y = knn_train$STATUS,
                          yname = "STATUS")

knn_train_x <-knn_ups_train %>% select(-STATUS)
knn_test_x <-knn_test %>% select(-STATUS)
knn_train_y <-knn_ups_train$STATUS
knn_test_y <-knn_test$STATUS

K-nn Modeling

Before we begin modeling, we have to know the “k” value to be inserted to the knn() function as one of its parameter. The optimum “k” value can be obtain by rounding the square root of the number of data that we have inside our train dataframe. If we obtain an even number, we can reduce the result by 1.

Optimum K value
round(sqrt(nrow(knn_train_x)))
## [1] 136
K-NN prediction
pred_knn <- knn(train = knn_train_x, 
                test = knn_test_x, 
                cl = knn_train_y, 
                k = 135)

confusionMatrix(pred_knn, knn_test_y, "Good Customer")
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Bad Customer Good Customer
##   Bad Customer            43          2803
##   Good Customer           18          3287
##                                           
##                Accuracy : 0.5414          
##                  95% CI : (0.5288, 0.5539)
##     No Information Rate : 0.9901          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0104          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.53974         
##             Specificity : 0.70492         
##          Pos Pred Value : 0.99455         
##          Neg Pred Value : 0.01511         
##              Prevalence : 0.99008         
##          Detection Rate : 0.53438         
##    Detection Prevalence : 0.53731         
##       Balanced Accuracy : 0.62233         
##                                           
##        'Positive' Class : Good Customer   
## 

With the K-NN method, we manage to improve the performance of our model with only 45.78% of false predicted data from the 67.47% using Logistic Regression method, and that is a decrease of error for as much as 21.69%. The precision is considered good compared to the Logistic Regression model at 99,45% with the sensitivity of 54,06% and accuracy of 54.22%.

Conclusion

Depending on the needs of our business case, we may predetermine which model to be used to create a model for predicting future data. However, when we do not add our data characteristic as a consideration of choosing the right method to create our machine learning model, we may ended up with a poorly performing model as I have demonstrated here in this RmD. For a data type which are dominantly categorical values, it is best to create a decision tree or random forest method.