1 Intro

We will create a Classification model 1 using “credit.csv” dataset. We would like to learn the relationship among variables with the target variables of “default” which is the status of the customer whether it is default or not default. We wanted to predict the newly disbursed loan status based on the available historical data of “credit.csv” dataframe whether it will turn out default or not default.

2 Data Preparation

-> Load the required package:

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)

-> Load the dataset and make sure to use ‘stringsAsFactors = TRUE’ to change string/character as factor to ensure faster modelling.

credit <- read.csv('data_input/credit.csv',stringsAsFactors = TRUE)

-> Check the data structure and see whether there are variables that have wrong data type

str(credit)

#> 'data.frame':    1000 obs. of  17 variables:
#>  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#>  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
#>  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#>  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#>  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#>  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#>  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#>  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
#>  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
#>  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
#>  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#>  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
#>  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#>  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
#>  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#>  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

From the data structure we can see that no data type are deemed not appropriate. The data has 1000 rows/obs and 17 columns/variables. The target variable is the amount, which signifies the loan amount approved and disbursed by the Bank.

This is the data structure obtain from the source material in kaggle:

checking_balance: Applicant’s checking account balance. months_loan_duration: Duration of the loan in months. credit_history: Credit history of the applicant. purpose: Purpose of the loan. amount: Loan amount. savings_balance: Savings account balance. employment_duration: Length of employment. percent_of_income: Percentage of income allocated to loan repayment. years_at_residence: Years at the current residence. age: Applicant’s age. other_credit: Presence of other credit agreements. housing: Housing status (e.g., rent, own). existing_loans_count: Number of existing loans. job: Job type or classification. dependents: Number of dependents. phone: Availability of a telephone. default: Target variable indicating loan default (“yes” or “no”).

Reference for the dataset: https://www.kaggle.com/datasets/daniellopez01/credit-risk

-> To check if there’s any missing row data

is.na(credit) %>% colSums()

#>     checking_balance months_loan_duration       credit_history 
#>                    0                    0                    0 
#>              purpose               amount      savings_balance 
#>                    0                    0                    0 
#>  employment_duration    percent_of_income   years_at_residence 
#>                    0                    0                    0 
#>                  age         other_credit              housing 
#>                    0                    0                    0 
#> existing_loans_count                  job           dependents 
#>                    0                    0                    0 
#>                phone              default 
#>                    0                    0

We can see that there are no data that are missing and the dataframe is ready for modelling.

3 Logistic Regression

3.1 Preprocessing Data

Seek the proportion of our target variable (default) inside the dataframe

prop.table(table(credit$default))

#> 
#>  no yes 
#> 0.7 0.3

table(credit$default)

#> 
#>  no yes 
#> 700 300

summary(credit)

#>    checking_balance months_loan_duration   credit_history
#>  < 0 DM    :274     Min.   : 4.0         critical :293   
#>  > 200 DM  : 63     1st Qu.:12.0         good     :530   
#>  1 - 200 DM:269     Median :18.0         perfect  : 40   
#>  unknown   :394     Mean   :20.9         poor     : 88   
#>                     3rd Qu.:24.0         very good: 49   
#>                     Max.   :72.0                         
#>                  purpose        amount           savings_balance
#>  business            : 97   Min.   :  250   < 100 DM     :603   
#>  car                 :337   1st Qu.: 1366   > 1000 DM    : 48   
#>  car0                : 12   Median : 2320   100 - 500 DM :103   
#>  education           : 59   Mean   : 3271   500 - 1000 DM: 63   
#>  furniture/appliances:473   3rd Qu.: 3972   unknown      :183   
#>  renovations         : 22   Max.   :18424                       
#>   employment_duration percent_of_income years_at_residence      age       
#>  < 1 year   :172      Min.   :1.000     Min.   :1.000      Min.   :19.00  
#>  > 7 years  :253      1st Qu.:2.000     1st Qu.:2.000      1st Qu.:27.00  
#>  1 - 4 years:339      Median :3.000     Median :3.000      Median :33.00  
#>  4 - 7 years:174      Mean   :2.973     Mean   :2.845      Mean   :35.55  
#>  unemployed : 62      3rd Qu.:4.000     3rd Qu.:4.000      3rd Qu.:42.00  
#>                       Max.   :4.000     Max.   :4.000      Max.   :75.00  
#>  other_credit  housing    existing_loans_count         job        dependents   
#>  bank :139    other:108   Min.   :1.000        management:148   Min.   :1.000  
#>  none :814    own  :713   1st Qu.:1.000        skilled   :630   1st Qu.:1.000  
#>  store: 47    rent :179   Median :1.000        unemployed: 22   Median :1.000  
#>                           Mean   :1.407        unskilled :200   Mean   :1.155  
#>                           3rd Qu.:2.000                         3rd Qu.:1.000  
#>                           Max.   :4.000                         Max.   :2.000  
#>  phone     default  
#>  no :596   no :700  
#>  yes:404   yes:300  
#>                     
#>                     
#>                     
#>

3.2 Train-Test Split

Before modelling process, we have to split the data into train dataset and test dataset. We will use the train dataset to train the model and the test dataset will be used as a comparasion and see if the model is overfit/underfit and fail to predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)
samplesize <- round(0.8 * nrow(credit), 0)
index <- sample(seq_len(nrow(credit)), size = samplesize)

data_train <- credit[index, ]
data_test <- credit[-index, ]

3.3 Modelling

model <- glm(formula = default~., family = "binomial", 
             data = data_train)
summary(model)

#> 
#> Call:
#> glm(formula = default ~ ., family = "binomial", data = data_train)
#> 
#> Coefficients:
#>                                   Estimate  Std. Error z value      Pr(>|z|)
#> (Intercept)                    -2.22038984  1.01870535  -2.180      0.029286
#> checking_balance> 200 DM       -0.65631428  0.38750312  -1.694      0.090322
#> checking_balance1 - 200 DM     -0.33314110  0.22819241  -1.460      0.144314
#> checking_balanceunknown        -1.70449853  0.25416655  -6.706 0.00000000002
#> months_loan_duration            0.02787660  0.00970347   2.873      0.004068
#> credit_historygood              0.73942240  0.28248098   2.618      0.008855
#> credit_historyperfect           1.24001177  0.48545271   2.554      0.010639
#> credit_historypoor              0.68930135  0.37030358   1.861      0.062681
#> credit_historyvery good         1.50822902  0.46370755   3.253      0.001144
#> purposecar                      0.37395694  0.35550290   1.052      0.292841
#> purposecar0                    -0.46649792  0.89134647  -0.523      0.600722
#> purposeeducation                0.67425267  0.48981254   1.377      0.168651
#> purposefurniture/appliances     0.05398491  0.34733021   0.155      0.876484
#> purposerenovations              1.05455087  0.66796827   1.579      0.114395
#> amount                          0.00011608  0.00004625   2.510      0.012086
#> savings_balance> 1000 DM       -1.00847769  0.51518782  -1.957      0.050289
#> savings_balance100 - 500 DM    -0.23350523  0.30457218  -0.767      0.443280
#> savings_balance500 - 1000 DM   -0.63685418  0.48031712  -1.326      0.184872
#> savings_balanceunknown         -0.98291416  0.29259844  -3.359      0.000782
#> employment_duration> 7 years   -0.57749964  0.31691579  -1.822      0.068417
#> employment_duration1 - 4 years -0.29433412  0.25506277  -1.154      0.248514
#> employment_duration4 - 7 years -1.13445442  0.32690284  -3.470      0.000520
#> employment_durationunemployed  -0.21962897  0.45871658  -0.479      0.632088
#> percent_of_income               0.33310932  0.09397215   3.545      0.000393
#> years_at_residence             -0.00717754  0.09313119  -0.077      0.938569
#> age                            -0.01024356  0.00958065  -1.069      0.284983
#> other_creditnone               -0.31676684  0.25570865  -1.239      0.215427
#> other_creditstore              -0.37448691  0.47222930  -0.793      0.427767
#> housingown                     -0.22158550  0.31366315  -0.706      0.479912
#> housingrent                     0.14400956  0.36374795   0.396      0.692175
#> existing_loans_count            0.22413377  0.21560034   1.040      0.298535
#> jobskilled                      0.19354648  0.31604581   0.612      0.540273
#> jobunemployed                  -0.08891266  0.69952570  -0.127      0.898858
#> jobunskilled                    0.02132330  0.38309118   0.056      0.955612
#> dependents                      0.23933059  0.26280071   0.911      0.362457
#> phoneyes                       -0.28486394  0.21679383  -1.314      0.188851
#>                                   
#> (Intercept)                    *  
#> checking_balance> 200 DM       .  
#> checking_balance1 - 200 DM        
#> checking_balanceunknown        ***
#> months_loan_duration           ** 
#> credit_historygood             ** 
#> credit_historyperfect          *  
#> credit_historypoor             .  
#> credit_historyvery good        ** 
#> purposecar                        
#> purposecar0                       
#> purposeeducation                  
#> purposefurniture/appliances       
#> purposerenovations                
#> amount                         *  
#> savings_balance> 1000 DM       .  
#> savings_balance100 - 500 DM       
#> savings_balance500 - 1000 DM      
#> savings_balanceunknown         ***
#> employment_duration> 7 years   .  
#> employment_duration1 - 4 years    
#> employment_duration4 - 7 years ***
#> employment_durationunemployed     
#> percent_of_income              ***
#> years_at_residence                
#> age                               
#> other_creditnone                  
#> other_creditstore                 
#> housingown                        
#> housingrent                       
#> existing_loans_count              
#> jobskilled                        
#> jobunemployed                     
#> jobunskilled                      
#> dependents                        
#> phoneyes                          
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 959.84  on 799  degrees of freedom
#> Residual deviance: 754.33  on 764  degrees of freedom
#> AIC: 826.33
#> 
#> Number of Fisher Scoring iterations: 5

3.4 Model Fitting

On the first model, there are quite a number of predictor variable that are insignificant to the target, thus we will do a model fitting using stepwise method.

library(MASS)
model2 <- stepAIC(model, direction = "backward")

#> Start:  AIC=826.33
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     purpose + amount + savings_balance + employment_duration + 
#>     percent_of_income + years_at_residence + age + other_credit + 
#>     housing + existing_loans_count + job + dependents + phone
#> 
#>                        Df Deviance    AIC
#> - job                   3   755.20 821.20
#> - purpose               5   761.18 823.18
#> - other_credit          2   755.95 823.95
#> - years_at_residence    1   754.34 824.34
#> - housing               2   756.57 824.57
#> - dependents            1   755.15 825.15
#> - existing_loans_count  1   755.41 825.41
#> - age                   1   755.49 825.49
#> - phone                 1   756.07 826.07
#> <none>                      754.33 826.33
#> - amount                1   760.68 830.68
#> - employment_duration   4   768.31 832.31
#> - months_loan_duration  1   762.68 832.68
#> - credit_history        4   769.93 833.93
#> - savings_balance       4   770.20 834.20
#> - percent_of_income     1   767.42 837.42
#> - checking_balance      3   809.38 875.38
#> 
#> Step:  AIC=821.2
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     purpose + amount + savings_balance + employment_duration + 
#>     percent_of_income + years_at_residence + age + other_credit + 
#>     housing + existing_loans_count + dependents + phone
#> 
#>                        Df Deviance    AIC
#> - purpose               5   762.18 818.18
#> - other_credit          2   756.82 818.82
#> - years_at_residence    1   755.20 819.20
#> - housing               2   757.43 819.43
#> - dependents            1   755.89 819.89
#> - existing_loans_count  1   756.28 820.28
#> - age                   1   756.58 820.58
#> <none>                      755.20 821.20
#> - phone                 1   757.34 821.34
#> - amount                1   761.36 825.36
#> - employment_duration   4   768.89 826.89
#> - months_loan_duration  1   764.39 828.39
#> - credit_history        4   770.87 828.87
#> - savings_balance       4   770.95 828.95
#> - percent_of_income     1   768.44 832.44
#> - checking_balance      3   809.94 869.94
#> 
#> Step:  AIC=818.18
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     years_at_residence + age + other_credit + housing + existing_loans_count + 
#>     dependents + phone
#> 
#>                        Df Deviance    AIC
#> - other_credit          2   763.47 815.47
#> - years_at_residence    1   762.19 816.19
#> - dependents            1   763.11 817.11
#> - age                   1   763.12 817.12
#> - existing_loans_count  1   763.24 817.24
#> - housing               2   765.57 817.57
#> <none>                      762.18 818.18
#> - phone                 1   764.79 818.79
#> - amount                1   768.73 822.73
#> - months_loan_duration  1   770.15 824.15
#> - employment_duration   4   776.32 824.32
#> - savings_balance       4   777.04 825.04
#> - credit_history        4   777.88 825.88
#> - percent_of_income     1   775.41 829.41
#> - checking_balance      3   815.93 865.93
#> 
#> Step:  AIC=815.47
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     years_at_residence + age + housing + existing_loans_count + 
#>     dependents + phone
#> 
#>                        Df Deviance    AIC
#> - years_at_residence    1   763.48 813.48
#> - age                   1   764.29 814.29
#> - existing_loans_count  1   764.49 814.49
#> - dependents            1   764.54 814.54
#> - housing               2   767.04 815.04
#> <none>                      763.47 815.47
#> - phone                 1   766.06 816.06
#> - amount                1   769.91 819.91
#> - employment_duration   4   777.23 821.23
#> - months_loan_duration  1   771.36 821.36
#> - savings_balance       4   778.20 822.20
#> - credit_history        4   781.18 825.18
#> - percent_of_income     1   776.34 826.34
#> - checking_balance      3   817.09 863.09
#> 
#> Step:  AIC=813.48
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     age + housing + existing_loans_count + dependents + phone
#> 
#>                        Df Deviance    AIC
#> - age                   1   764.29 812.29
#> - existing_loans_count  1   764.51 812.51
#> - dependents            1   764.56 812.56
#> <none>                      763.48 813.48
#> - housing               2   767.50 813.50
#> - phone                 1   766.06 814.06
#> - amount                1   769.91 817.91
#> - employment_duration   4   777.31 819.31
#> - months_loan_duration  1   771.38 819.38
#> - savings_balance       4   778.22 820.22
#> - credit_history        4   781.20 823.20
#> - percent_of_income     1   776.37 824.37
#> - checking_balance      3   817.42 861.42
#> 
#> Step:  AIC=812.29
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     housing + existing_loans_count + dependents + phone
#> 
#>                        Df Deviance    AIC
#> - dependents            1   765.22 811.22
#> - existing_loans_count  1   765.25 811.25
#> - housing               2   768.27 812.27
#> <none>                      764.29 812.29
#> - phone                 1   767.17 813.17
#> - amount                1   770.64 816.64
#> - months_loan_duration  1   772.59 818.59
#> - employment_duration   4   779.04 819.04
#> - savings_balance       4   779.07 819.07
#> - credit_history        4   782.38 822.38
#> - percent_of_income     1   777.13 823.13
#> - checking_balance      3   817.82 859.82
#> 
#> Step:  AIC=811.22
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     housing + existing_loans_count + phone
#> 
#>                        Df Deviance    AIC
#> - existing_loans_count  1   766.41 810.41
#> <none>                      765.22 811.22
#> - housing               2   769.24 811.24
#> - phone                 1   768.05 812.05
#> - amount                1   771.44 815.44
#> - months_loan_duration  1   773.40 817.40
#> - savings_balance       4   779.60 817.60
#> - employment_duration   4   779.67 817.67
#> - percent_of_income     1   777.48 821.48
#> - credit_history        4   783.78 821.78
#> - checking_balance      3   818.81 858.81
#> 
#> Step:  AIC=810.41
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     housing + phone
#> 
#>                        Df Deviance    AIC
#> - housing               2   770.30 810.30
#> <none>                      766.41 810.41
#> - phone                 1   768.95 810.95
#> - amount                1   772.56 814.56
#> - employment_duration   4   780.13 816.13
#> - months_loan_duration  1   774.31 816.31
#> - savings_balance       4   780.90 816.90
#> - credit_history        4   783.89 819.89
#> - percent_of_income     1   778.40 820.40
#> - checking_balance      3   819.64 857.64
#> 
#> Step:  AIC=810.3
#> default ~ checking_balance + months_loan_duration + credit_history + 
#>     amount + savings_balance + employment_duration + percent_of_income + 
#>     phone
#> 
#>                        Df Deviance    AIC
#> <none>                      770.30 810.30
#> - phone                 1   772.67 810.67
#> - amount                1   777.03 815.03
#> - employment_duration   4   784.16 816.16
#> - months_loan_duration  1   778.28 816.28
#> - savings_balance       4   784.49 816.49
#> - percent_of_income     1   781.62 819.62
#> - credit_history        4   788.94 820.94
#> - checking_balance      3   826.07 860.07

model2 that is obtained by using stepwise method are as follows:

summary(model2)

#> 
#> Call:
#> glm(formula = default ~ checking_balance + months_loan_duration + 
#>     credit_history + amount + savings_balance + employment_duration + 
#>     percent_of_income + phone, family = "binomial", data = data_train)
#> 
#> Coefficients:
#>                                   Estimate  Std. Error z value         Pr(>|z|)
#> (Intercept)                    -1.76834190  0.43009162  -4.112 0.00003930168120
#> checking_balance> 200 DM       -0.77346331  0.37859974  -2.043         0.041057
#> checking_balance1 - 200 DM     -0.41142105  0.22089928  -1.862         0.062535
#> checking_balanceunknown        -1.69409691  0.24704577  -6.857 0.00000000000701
#> months_loan_duration            0.02630094  0.00935363   2.812         0.004926
#> credit_historygood              0.55627923  0.22566072   2.465         0.013697
#> credit_historyperfect           1.28694622  0.47109761   2.732         0.006299
#> credit_historypoor              0.63396610  0.35767990   1.772         0.076322
#> credit_historyvery good         1.46748062  0.40256082   3.645         0.000267
#> amount                          0.00011344  0.00004394   2.582         0.009825
#> savings_balance> 1000 DM       -0.85199878  0.49738961  -1.713         0.086723
#> savings_balance100 - 500 DM    -0.12624929  0.29420360  -0.429         0.667834
#> savings_balance500 - 1000 DM   -0.72289042  0.47106289  -1.535         0.124884
#> savings_balanceunknown         -0.88205593  0.28261210  -3.121         0.001802
#> employment_duration> 7 years   -0.62176575  0.28214924  -2.204         0.027547
#> employment_duration1 - 4 years -0.34107446  0.24926570  -1.368         0.171213
#> employment_duration4 - 7 years -1.09704060  0.31560067  -3.476         0.000509
#> employment_durationunemployed  -0.37384135  0.39660237  -0.943         0.345880
#> percent_of_income               0.29623457  0.08957000   3.307         0.000942
#> phoneyes                       -0.30102704  0.19645773  -1.532         0.125455
#>                                   
#> (Intercept)                    ***
#> checking_balance> 200 DM       *  
#> checking_balance1 - 200 DM     .  
#> checking_balanceunknown        ***
#> months_loan_duration           ** 
#> credit_historygood             *  
#> credit_historyperfect          ** 
#> credit_historypoor             .  
#> credit_historyvery good        ***
#> amount                         ** 
#> savings_balance> 1000 DM       .  
#> savings_balance100 - 500 DM       
#> savings_balance500 - 1000 DM      
#> savings_balanceunknown         ** 
#> employment_duration> 7 years   *  
#> employment_duration1 - 4 years    
#> employment_duration4 - 7 years ***
#> employment_durationunemployed     
#> percent_of_income              ***
#> phoneyes                          
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 959.84  on 799  degrees of freedom
#> Residual deviance: 770.30  on 780  degrees of freedom
#> AIC: 810.3
#> 
#> Number of Fisher Scoring iterations: 5

3.5 Prediction

With model2 resulted from stepwise method, we will try to create a prediction using data test.

data_test$prob_default<-predict(model2, type = "response", newdata = data_test)

We will now visualise the data distribution

ggplot(data_test, aes(x=prob_default)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

data_test

subset_data <-data_test[, c("default", "prob_default")]

str(subset_data)

#> 'data.frame':    200 obs. of  2 variables:
#>  $ default     : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
#>  $ prob_default: num  0.1093 0.0319 0.0825 0.0266 0.8066 ...

# Filter and calculate min/max in one step
result <- data_test %>%
  filter(default == "no") %>%
  summarise(
    min_prob_default = min(prob_default, na.rm = TRUE),
    max_prob_default = max(prob_default, na.rm = TRUE)
  )

# Display results
result

# Filter and calculate min/max in one step
result_yes <- data_test %>%
  filter(default == "yes") %>%
  summarise(
    min_prob_default = min(prob_default, na.rm = TRUE),
    max_prob_default = max(prob_default, na.rm = TRUE)
  )

# Display results
result_yes

We define when probability of data test is more than 0.5, that means it’s default.

data_test$pred_default <- factor(ifelse(data_test$prob_default > 0.5, "yes","no"))
data_test[1:10, c("pred_default", "default")]

3.6 Model Evaluation

library(caret)
log_conf <- confusionMatrix(data_test$pred_default, data_test$default, positive = 'yes')
log_conf

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  113  37
#>        yes  17  33
#>                                           
#>                Accuracy : 0.73            
#>                  95% CI : (0.6628, 0.7902)
#>     No Information Rate : 0.65            
#>     P-Value [Acc > NIR] : 0.009749        
#>                                           
#>                   Kappa : 0.3647          
#>                                           
#>  Mcnemar's Test P-Value : 0.009722        
#>                                           
#>             Sensitivity : 0.4714          
#>             Specificity : 0.8692          
#>          Pos Pred Value : 0.6600          
#>          Neg Pred Value : 0.7533          
#>              Prevalence : 0.3500          
#>          Detection Rate : 0.1650          
#>    Detection Prevalence : 0.2500          
#>       Balanced Accuracy : 0.6703          
#>                                           
#>        'Positive' Class : yes             
#>

Based on the results of the confusion matrix above, we can gather the following information:

Summary: The model achieves 74% accuracy, which is moderately high but may be misleading due to class imbalance. Sensitivity (45.16%) is still relatively low, indicating that the model misses many actual “yes” cases. Specificity (86.96%) is high, showing the model performs well at identifying “no” cases. Precision for the “yes” class (60.87%) is an improvement, suggesting that when the model predicts “yes,” it is correct more often. The balanced accuracy (66.06%) provides a clearer picture, showing the model performs better than random guessing but still struggles with the minority class (“yes”).

3.7 Model Interpretation

# Odds ratio all coefficients
exp(model2$coefficients) %>% 
  data.frame()

Model interpretation: 1. Odds of someone that have checking balance more than 200 DM defaulted is 34% compare to someone that have checking balance < 200 DM the default odds raised to 70%.

4 K-Nearest Neighbour

4.1 Pre-Processing

str(credit)

#> 'data.frame':    1000 obs. of  17 variables:
#>  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#>  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
#>  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#>  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#>  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#>  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#>  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#>  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
#>  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
#>  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
#>  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#>  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
#>  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#>  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
#>  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#>  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

-> Create dummy variable from the variable that will be used for classification modelling.

dmy <- dummyVars(" ~default+checking_balance+months_loan_duration+credit_history+purpose+amount+savings_balance+employment_duration+percent_of_income+years_at_residence+age+other_credit+housing+existing_loans_count+job+dependents+phone", data = credit)
dmy <- data.frame(predict(dmy, newdata = credit))
str(dmy)

#> 'data.frame':    1000 obs. of  46 variables:
#>  $ default.no                     : num  1 0 1 1 0 1 1 1 1 0 ...
#>  $ default.yes                    : num  0 1 0 0 1 0 0 0 0 1 ...
#>  $ checking_balance...0.DM        : num  1 0 0 1 1 0 0 0 0 0 ...
#>  $ checking_balance...200.DM      : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ checking_balance.1...200.DM    : num  0 1 0 0 0 0 0 1 0 1 ...
#>  $ checking_balance.unknown       : num  0 0 1 0 0 1 1 0 1 0 ...
#>  $ months_loan_duration           : num  6 48 12 42 24 36 24 36 12 30 ...
#>  $ credit_history.critical        : num  1 0 1 0 0 0 0 0 0 1 ...
#>  $ credit_history.good            : num  0 1 0 1 0 1 1 1 1 0 ...
#>  $ credit_history.perfect         : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ credit_history.poor            : num  0 0 0 0 1 0 0 0 0 0 ...
#>  $ credit_history.very.good       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ purpose.business               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ purpose.car                    : num  0 0 0 0 1 0 0 1 0 1 ...
#>  $ purpose.car0                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ purpose.education              : num  0 0 1 0 0 1 0 0 0 0 ...
#>  $ purpose.furniture.appliances   : num  1 1 0 1 0 0 1 0 1 0 ...
#>  $ purpose.renovations            : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ amount                         : num  1169 5951 2096 7882 4870 ...
#>  $ savings_balance...100.DM       : num  0 1 1 1 1 0 0 1 0 1 ...
#>  $ savings_balance...1000.DM      : num  0 0 0 0 0 0 0 0 1 0 ...
#>  $ savings_balance.100...500.DM   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ savings_balance.500...1000.DM  : num  0 0 0 0 0 0 1 0 0 0 ...
#>  $ savings_balance.unknown        : num  1 0 0 0 0 1 0 0 0 0 ...
#>  $ employment_duration...1.year   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ employment_duration...7.years  : num  1 0 0 0 0 0 1 0 0 0 ...
#>  $ employment_duration.1...4.years: num  0 1 0 0 1 1 0 1 0 0 ...
#>  $ employment_duration.4...7.years: num  0 0 1 1 0 0 0 0 1 0 ...
#>  $ employment_duration.unemployed : num  0 0 0 0 0 0 0 0 0 1 ...
#>  $ percent_of_income              : num  4 2 2 2 3 2 3 2 2 4 ...
#>  $ years_at_residence             : num  4 2 3 4 4 4 4 2 4 2 ...
#>  $ age                            : num  67 22 49 45 53 35 53 35 61 28 ...
#>  $ other_credit.bank              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ other_credit.none              : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ other_credit.store             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ housing.other                  : num  0 0 0 1 1 1 0 0 0 0 ...
#>  $ housing.own                    : num  1 1 1 0 0 0 1 0 1 1 ...
#>  $ housing.rent                   : num  0 0 0 0 0 0 0 1 0 0 ...
#>  $ existing_loans_count           : num  2 1 1 1 2 1 1 1 1 2 ...
#>  $ job.management                 : num  0 0 0 0 0 0 0 1 0 1 ...
#>  $ job.skilled                    : num  1 1 0 1 1 0 1 0 0 0 ...
#>  $ job.unemployed                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ job.unskilled                  : num  0 0 1 0 0 1 0 0 1 0 ...
#>  $ dependents                     : num  1 1 2 2 2 2 1 1 1 1 ...
#>  $ phone.no                       : num  0 1 1 1 1 0 1 0 1 1 ...
#>  $ phone.yes                      : num  1 0 0 0 0 1 0 1 0 0 ...

To delete dummy variable that have 2 category.

dmy$default.no <- NULL
dmy$phone.no <- NULL

To understand the name of the dummy variable

names(dmy)

#>  [1] "default.yes"                     "checking_balance...0.DM"        
#>  [3] "checking_balance...200.DM"       "checking_balance.1...200.DM"    
#>  [5] "checking_balance.unknown"        "months_loan_duration"           
#>  [7] "credit_history.critical"         "credit_history.good"            
#>  [9] "credit_history.perfect"          "credit_history.poor"            
#> [11] "credit_history.very.good"        "purpose.business"               
#> [13] "purpose.car"                     "purpose.car0"                   
#> [15] "purpose.education"               "purpose.furniture.appliances"   
#> [17] "purpose.renovations"             "amount"                         
#> [19] "savings_balance...100.DM"        "savings_balance...1000.DM"      
#> [21] "savings_balance.100...500.DM"    "savings_balance.500...1000.DM"  
#> [23] "savings_balance.unknown"         "employment_duration...1.year"   
#> [25] "employment_duration...7.years"   "employment_duration.1...4.years"
#> [27] "employment_duration.4...7.years" "employment_duration.unemployed" 
#> [29] "percent_of_income"               "years_at_residence"             
#> [31] "age"                             "other_credit.bank"              
#> [33] "other_credit.none"               "other_credit.store"             
#> [35] "housing.other"                   "housing.own"                    
#> [37] "housing.rent"                    "existing_loans_count"           
#> [39] "job.management"                  "job.skilled"                    
#> [41] "job.unemployed"                  "job.unskilled"                  
#> [43] "dependents"                      "phone.yes"

To create training data dan testing data from dmy

set.seed(123)
dmy_train <- dmy[samplesize,2:21]
dmy_test <- dmy[-samplesize,2:21]

dmy_train_label <- dmy[samplesize,1]
dmy_test_label <- dmy[-samplesize,1]

Create prediction with K-NN

pred_knn <- class::knn(train = dmy_train,
                       test = dmy_test, 
                       cl = dmy_train_label, 
                       k = 17)

To create confusion matrix from K-NN prediction

pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 699 300
#>          1   0   0
#>                                              
#>                Accuracy : 0.6997             
#>                  95% CI : (0.6702, 0.728)    
#>     No Information Rate : 0.6997             
#>     P-Value [Acc > NIR] : 0.5156             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.0000             
#>             Specificity : 1.0000             
#>          Pos Pred Value :    NaN             
#>          Neg Pred Value : 0.6997             
#>              Prevalence : 0.3003             
#>          Detection Rate : 0.0000             
#>    Detection Prevalence : 0.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : 1                  
#>

Summary of Model Performance: The model predicts only the majority class (“0”) and fails to predict any instances of the minority class (“1”). The high specificity (100%) shows the model performs well at identifying the negative class but completely fails for the positive class, as indicated by a sensitivity of 0%. The accuracy (69.97%) is misleading due to the imbalanced nature of the dataset.

4.2 Model Evaluation Logistic Regression and K-NN

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])

# Model Evaluation Logit
eval_logit

# Model Evaluation K-NN
eval_knn

Key Observations: KNN’s Failure with the Positive Class:

KNN fails entirely to detect the positive class (recall = 0%). This is likely due to the imbalanced nature of the dataset, where the positive class is underrepresented, causing KNN to classify everything as the majority class. Logistic Regression’s Balanced Performance:

Logistic Regression, while not perfect, balances recall and specificity better. It correctly identifies 45.16% of positive cases while maintaining good specificity and precision. Imbalanced Dataset Effects:

Both models are influenced by the class imbalance. The high specificity and low recall highlight the difficulty in predicting the minority class.

CM1 model_TeamAlgo_LBB_YUF

Yudhofp

2024-11-20