We will create a Classification model 1 using “credit.csv” dataset. We would like to learn the relationship among variables with the target variables of “default” which is the status of the customer whether it is default or not default. We wanted to predict the newly disbursed loan status based on the available historical data of “credit.csv” dataframe whether it will turn out default or not default.
-> Load the required package:
-> Load the dataset and make sure to use ‘stringsAsFactors = TRUE’ to change string/character as factor to ensure faster modelling.
-> Check the data structure and see whether there are variables that have wrong data type
#> 'data.frame': 1000 obs. of 17 variables:
#> $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#> $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
#> $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#> $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#> $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#> $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#> $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#> $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
#> $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
#> $ age : int 67 22 49 45 53 35 53 35 61 28 ...
#> $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#> $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
#> $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#> $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
#> $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
From the data structure we can see that no data type are deemed not appropriate. The data has 1000 rows/obs and 17 columns/variables. The target variable is the amount, which signifies the loan amount approved and disbursed by the Bank.
This is the data structure obtain from the source material in kaggle:
checking_balance: Applicant’s checking account balance. months_loan_duration: Duration of the loan in months. credit_history: Credit history of the applicant. purpose: Purpose of the loan. amount: Loan amount. savings_balance: Savings account balance. employment_duration: Length of employment. percent_of_income: Percentage of income allocated to loan repayment. years_at_residence: Years at the current residence. age: Applicant’s age. other_credit: Presence of other credit agreements. housing: Housing status (e.g., rent, own). existing_loans_count: Number of existing loans. job: Job type or classification. dependents: Number of dependents. phone: Availability of a telephone. default: Target variable indicating loan default (“yes” or “no”).
Reference for the dataset: https://www.kaggle.com/datasets/daniellopez01/credit-risk
-> To check if there’s any missing row data
#> checking_balance months_loan_duration credit_history
#> 0 0 0
#> purpose amount savings_balance
#> 0 0 0
#> employment_duration percent_of_income years_at_residence
#> 0 0 0
#> age other_credit housing
#> 0 0 0
#> existing_loans_count job dependents
#> 0 0 0
#> phone default
#> 0 0
We can see that there are no data that are missing and the dataframe is ready for modelling.
Seek the proportion of our target variable (default) inside the dataframe
#>
#> no yes
#> 0.7 0.3
#>
#> no yes
#> 700 300
#> checking_balance months_loan_duration credit_history
#> < 0 DM :274 Min. : 4.0 critical :293
#> > 200 DM : 63 1st Qu.:12.0 good :530
#> 1 - 200 DM:269 Median :18.0 perfect : 40
#> unknown :394 Mean :20.9 poor : 88
#> 3rd Qu.:24.0 very good: 49
#> Max. :72.0
#> purpose amount savings_balance
#> business : 97 Min. : 250 < 100 DM :603
#> car :337 1st Qu.: 1366 > 1000 DM : 48
#> car0 : 12 Median : 2320 100 - 500 DM :103
#> education : 59 Mean : 3271 500 - 1000 DM: 63
#> furniture/appliances:473 3rd Qu.: 3972 unknown :183
#> renovations : 22 Max. :18424
#> employment_duration percent_of_income years_at_residence age
#> < 1 year :172 Min. :1.000 Min. :1.000 Min. :19.00
#> > 7 years :253 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:27.00
#> 1 - 4 years:339 Median :3.000 Median :3.000 Median :33.00
#> 4 - 7 years:174 Mean :2.973 Mean :2.845 Mean :35.55
#> unemployed : 62 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:42.00
#> Max. :4.000 Max. :4.000 Max. :75.00
#> other_credit housing existing_loans_count job dependents
#> bank :139 other:108 Min. :1.000 management:148 Min. :1.000
#> none :814 own :713 1st Qu.:1.000 skilled :630 1st Qu.:1.000
#> store: 47 rent :179 Median :1.000 unemployed: 22 Median :1.000
#> Mean :1.407 unskilled :200 Mean :1.155
#> 3rd Qu.:2.000 3rd Qu.:1.000
#> Max. :4.000 Max. :2.000
#> phone default
#> no :596 no :700
#> yes:404 yes:300
#>
#>
#>
#>
Before modelling process, we have to split the data into train dataset and test dataset. We will use the train dataset to train the model and the test dataset will be used as a comparasion and see if the model is overfit/underfit and fail to predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
#>
#> Call:
#> glm(formula = default ~ ., family = "binomial", data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.22038984 1.01870535 -2.180 0.029286
#> checking_balance> 200 DM -0.65631428 0.38750312 -1.694 0.090322
#> checking_balance1 - 200 DM -0.33314110 0.22819241 -1.460 0.144314
#> checking_balanceunknown -1.70449853 0.25416655 -6.706 0.00000000002
#> months_loan_duration 0.02787660 0.00970347 2.873 0.004068
#> credit_historygood 0.73942240 0.28248098 2.618 0.008855
#> credit_historyperfect 1.24001177 0.48545271 2.554 0.010639
#> credit_historypoor 0.68930135 0.37030358 1.861 0.062681
#> credit_historyvery good 1.50822902 0.46370755 3.253 0.001144
#> purposecar 0.37395694 0.35550290 1.052 0.292841
#> purposecar0 -0.46649792 0.89134647 -0.523 0.600722
#> purposeeducation 0.67425267 0.48981254 1.377 0.168651
#> purposefurniture/appliances 0.05398491 0.34733021 0.155 0.876484
#> purposerenovations 1.05455087 0.66796827 1.579 0.114395
#> amount 0.00011608 0.00004625 2.510 0.012086
#> savings_balance> 1000 DM -1.00847769 0.51518782 -1.957 0.050289
#> savings_balance100 - 500 DM -0.23350523 0.30457218 -0.767 0.443280
#> savings_balance500 - 1000 DM -0.63685418 0.48031712 -1.326 0.184872
#> savings_balanceunknown -0.98291416 0.29259844 -3.359 0.000782
#> employment_duration> 7 years -0.57749964 0.31691579 -1.822 0.068417
#> employment_duration1 - 4 years -0.29433412 0.25506277 -1.154 0.248514
#> employment_duration4 - 7 years -1.13445442 0.32690284 -3.470 0.000520
#> employment_durationunemployed -0.21962897 0.45871658 -0.479 0.632088
#> percent_of_income 0.33310932 0.09397215 3.545 0.000393
#> years_at_residence -0.00717754 0.09313119 -0.077 0.938569
#> age -0.01024356 0.00958065 -1.069 0.284983
#> other_creditnone -0.31676684 0.25570865 -1.239 0.215427
#> other_creditstore -0.37448691 0.47222930 -0.793 0.427767
#> housingown -0.22158550 0.31366315 -0.706 0.479912
#> housingrent 0.14400956 0.36374795 0.396 0.692175
#> existing_loans_count 0.22413377 0.21560034 1.040 0.298535
#> jobskilled 0.19354648 0.31604581 0.612 0.540273
#> jobunemployed -0.08891266 0.69952570 -0.127 0.898858
#> jobunskilled 0.02132330 0.38309118 0.056 0.955612
#> dependents 0.23933059 0.26280071 0.911 0.362457
#> phoneyes -0.28486394 0.21679383 -1.314 0.188851
#>
#> (Intercept) *
#> checking_balance> 200 DM .
#> checking_balance1 - 200 DM
#> checking_balanceunknown ***
#> months_loan_duration **
#> credit_historygood **
#> credit_historyperfect *
#> credit_historypoor .
#> credit_historyvery good **
#> purposecar
#> purposecar0
#> purposeeducation
#> purposefurniture/appliances
#> purposerenovations
#> amount *
#> savings_balance> 1000 DM .
#> savings_balance100 - 500 DM
#> savings_balance500 - 1000 DM
#> savings_balanceunknown ***
#> employment_duration> 7 years .
#> employment_duration1 - 4 years
#> employment_duration4 - 7 years ***
#> employment_durationunemployed
#> percent_of_income ***
#> years_at_residence
#> age
#> other_creditnone
#> other_creditstore
#> housingown
#> housingrent
#> existing_loans_count
#> jobskilled
#> jobunemployed
#> jobunskilled
#> dependents
#> phoneyes
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 959.84 on 799 degrees of freedom
#> Residual deviance: 754.33 on 764 degrees of freedom
#> AIC: 826.33
#>
#> Number of Fisher Scoring iterations: 5
On the first model, there are quite a number of predictor variable that are insignificant to the target, thus we will do a model fitting using stepwise method.
#> Start: AIC=826.33
#> default ~ checking_balance + months_loan_duration + credit_history +
#> purpose + amount + savings_balance + employment_duration +
#> percent_of_income + years_at_residence + age + other_credit +
#> housing + existing_loans_count + job + dependents + phone
#>
#> Df Deviance AIC
#> - job 3 755.20 821.20
#> - purpose 5 761.18 823.18
#> - other_credit 2 755.95 823.95
#> - years_at_residence 1 754.34 824.34
#> - housing 2 756.57 824.57
#> - dependents 1 755.15 825.15
#> - existing_loans_count 1 755.41 825.41
#> - age 1 755.49 825.49
#> - phone 1 756.07 826.07
#> <none> 754.33 826.33
#> - amount 1 760.68 830.68
#> - employment_duration 4 768.31 832.31
#> - months_loan_duration 1 762.68 832.68
#> - credit_history 4 769.93 833.93
#> - savings_balance 4 770.20 834.20
#> - percent_of_income 1 767.42 837.42
#> - checking_balance 3 809.38 875.38
#>
#> Step: AIC=821.2
#> default ~ checking_balance + months_loan_duration + credit_history +
#> purpose + amount + savings_balance + employment_duration +
#> percent_of_income + years_at_residence + age + other_credit +
#> housing + existing_loans_count + dependents + phone
#>
#> Df Deviance AIC
#> - purpose 5 762.18 818.18
#> - other_credit 2 756.82 818.82
#> - years_at_residence 1 755.20 819.20
#> - housing 2 757.43 819.43
#> - dependents 1 755.89 819.89
#> - existing_loans_count 1 756.28 820.28
#> - age 1 756.58 820.58
#> <none> 755.20 821.20
#> - phone 1 757.34 821.34
#> - amount 1 761.36 825.36
#> - employment_duration 4 768.89 826.89
#> - months_loan_duration 1 764.39 828.39
#> - credit_history 4 770.87 828.87
#> - savings_balance 4 770.95 828.95
#> - percent_of_income 1 768.44 832.44
#> - checking_balance 3 809.94 869.94
#>
#> Step: AIC=818.18
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> years_at_residence + age + other_credit + housing + existing_loans_count +
#> dependents + phone
#>
#> Df Deviance AIC
#> - other_credit 2 763.47 815.47
#> - years_at_residence 1 762.19 816.19
#> - dependents 1 763.11 817.11
#> - age 1 763.12 817.12
#> - existing_loans_count 1 763.24 817.24
#> - housing 2 765.57 817.57
#> <none> 762.18 818.18
#> - phone 1 764.79 818.79
#> - amount 1 768.73 822.73
#> - months_loan_duration 1 770.15 824.15
#> - employment_duration 4 776.32 824.32
#> - savings_balance 4 777.04 825.04
#> - credit_history 4 777.88 825.88
#> - percent_of_income 1 775.41 829.41
#> - checking_balance 3 815.93 865.93
#>
#> Step: AIC=815.47
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> years_at_residence + age + housing + existing_loans_count +
#> dependents + phone
#>
#> Df Deviance AIC
#> - years_at_residence 1 763.48 813.48
#> - age 1 764.29 814.29
#> - existing_loans_count 1 764.49 814.49
#> - dependents 1 764.54 814.54
#> - housing 2 767.04 815.04
#> <none> 763.47 815.47
#> - phone 1 766.06 816.06
#> - amount 1 769.91 819.91
#> - employment_duration 4 777.23 821.23
#> - months_loan_duration 1 771.36 821.36
#> - savings_balance 4 778.20 822.20
#> - credit_history 4 781.18 825.18
#> - percent_of_income 1 776.34 826.34
#> - checking_balance 3 817.09 863.09
#>
#> Step: AIC=813.48
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> age + housing + existing_loans_count + dependents + phone
#>
#> Df Deviance AIC
#> - age 1 764.29 812.29
#> - existing_loans_count 1 764.51 812.51
#> - dependents 1 764.56 812.56
#> <none> 763.48 813.48
#> - housing 2 767.50 813.50
#> - phone 1 766.06 814.06
#> - amount 1 769.91 817.91
#> - employment_duration 4 777.31 819.31
#> - months_loan_duration 1 771.38 819.38
#> - savings_balance 4 778.22 820.22
#> - credit_history 4 781.20 823.20
#> - percent_of_income 1 776.37 824.37
#> - checking_balance 3 817.42 861.42
#>
#> Step: AIC=812.29
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> housing + existing_loans_count + dependents + phone
#>
#> Df Deviance AIC
#> - dependents 1 765.22 811.22
#> - existing_loans_count 1 765.25 811.25
#> - housing 2 768.27 812.27
#> <none> 764.29 812.29
#> - phone 1 767.17 813.17
#> - amount 1 770.64 816.64
#> - months_loan_duration 1 772.59 818.59
#> - employment_duration 4 779.04 819.04
#> - savings_balance 4 779.07 819.07
#> - credit_history 4 782.38 822.38
#> - percent_of_income 1 777.13 823.13
#> - checking_balance 3 817.82 859.82
#>
#> Step: AIC=811.22
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> housing + existing_loans_count + phone
#>
#> Df Deviance AIC
#> - existing_loans_count 1 766.41 810.41
#> <none> 765.22 811.22
#> - housing 2 769.24 811.24
#> - phone 1 768.05 812.05
#> - amount 1 771.44 815.44
#> - months_loan_duration 1 773.40 817.40
#> - savings_balance 4 779.60 817.60
#> - employment_duration 4 779.67 817.67
#> - percent_of_income 1 777.48 821.48
#> - credit_history 4 783.78 821.78
#> - checking_balance 3 818.81 858.81
#>
#> Step: AIC=810.41
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> housing + phone
#>
#> Df Deviance AIC
#> - housing 2 770.30 810.30
#> <none> 766.41 810.41
#> - phone 1 768.95 810.95
#> - amount 1 772.56 814.56
#> - employment_duration 4 780.13 816.13
#> - months_loan_duration 1 774.31 816.31
#> - savings_balance 4 780.90 816.90
#> - credit_history 4 783.89 819.89
#> - percent_of_income 1 778.40 820.40
#> - checking_balance 3 819.64 857.64
#>
#> Step: AIC=810.3
#> default ~ checking_balance + months_loan_duration + credit_history +
#> amount + savings_balance + employment_duration + percent_of_income +
#> phone
#>
#> Df Deviance AIC
#> <none> 770.30 810.30
#> - phone 1 772.67 810.67
#> - amount 1 777.03 815.03
#> - employment_duration 4 784.16 816.16
#> - months_loan_duration 1 778.28 816.28
#> - savings_balance 4 784.49 816.49
#> - percent_of_income 1 781.62 819.62
#> - credit_history 4 788.94 820.94
#> - checking_balance 3 826.07 860.07
model2 that is obtained by using stepwise method are as follows:
#>
#> Call:
#> glm(formula = default ~ checking_balance + months_loan_duration +
#> credit_history + amount + savings_balance + employment_duration +
#> percent_of_income + phone, family = "binomial", data = data_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.76834190 0.43009162 -4.112 0.00003930168120
#> checking_balance> 200 DM -0.77346331 0.37859974 -2.043 0.041057
#> checking_balance1 - 200 DM -0.41142105 0.22089928 -1.862 0.062535
#> checking_balanceunknown -1.69409691 0.24704577 -6.857 0.00000000000701
#> months_loan_duration 0.02630094 0.00935363 2.812 0.004926
#> credit_historygood 0.55627923 0.22566072 2.465 0.013697
#> credit_historyperfect 1.28694622 0.47109761 2.732 0.006299
#> credit_historypoor 0.63396610 0.35767990 1.772 0.076322
#> credit_historyvery good 1.46748062 0.40256082 3.645 0.000267
#> amount 0.00011344 0.00004394 2.582 0.009825
#> savings_balance> 1000 DM -0.85199878 0.49738961 -1.713 0.086723
#> savings_balance100 - 500 DM -0.12624929 0.29420360 -0.429 0.667834
#> savings_balance500 - 1000 DM -0.72289042 0.47106289 -1.535 0.124884
#> savings_balanceunknown -0.88205593 0.28261210 -3.121 0.001802
#> employment_duration> 7 years -0.62176575 0.28214924 -2.204 0.027547
#> employment_duration1 - 4 years -0.34107446 0.24926570 -1.368 0.171213
#> employment_duration4 - 7 years -1.09704060 0.31560067 -3.476 0.000509
#> employment_durationunemployed -0.37384135 0.39660237 -0.943 0.345880
#> percent_of_income 0.29623457 0.08957000 3.307 0.000942
#> phoneyes -0.30102704 0.19645773 -1.532 0.125455
#>
#> (Intercept) ***
#> checking_balance> 200 DM *
#> checking_balance1 - 200 DM .
#> checking_balanceunknown ***
#> months_loan_duration **
#> credit_historygood *
#> credit_historyperfect **
#> credit_historypoor .
#> credit_historyvery good ***
#> amount **
#> savings_balance> 1000 DM .
#> savings_balance100 - 500 DM
#> savings_balance500 - 1000 DM
#> savings_balanceunknown **
#> employment_duration> 7 years *
#> employment_duration1 - 4 years
#> employment_duration4 - 7 years ***
#> employment_durationunemployed
#> percent_of_income ***
#> phoneyes
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 959.84 on 799 degrees of freedom
#> Residual deviance: 770.30 on 780 degrees of freedom
#> AIC: 810.3
#>
#> Number of Fisher Scoring iterations: 5
With model2 resulted from stepwise method, we will try to create a prediction using data test.
We will now visualise the data distribution
ggplot(data_test, aes(x=prob_default)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()#> 'data.frame': 200 obs. of 2 variables:
#> $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
#> $ prob_default: num 0.1093 0.0319 0.0825 0.0266 0.8066 ...
# Filter and calculate min/max in one step
result <- data_test %>%
filter(default == "no") %>%
summarise(
min_prob_default = min(prob_default, na.rm = TRUE),
max_prob_default = max(prob_default, na.rm = TRUE)
)
# Display results
result# Filter and calculate min/max in one step
result_yes <- data_test %>%
filter(default == "yes") %>%
summarise(
min_prob_default = min(prob_default, na.rm = TRUE),
max_prob_default = max(prob_default, na.rm = TRUE)
)
# Display results
result_yesWe define when probability of data test is more than 0.5, that means it’s default.
data_test$pred_default <- factor(ifelse(data_test$prob_default > 0.5, "yes","no"))
data_test[1:10, c("pred_default", "default")]library(caret)
log_conf <- confusionMatrix(data_test$pred_default, data_test$default, positive = 'yes')
log_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 113 37
#> yes 17 33
#>
#> Accuracy : 0.73
#> 95% CI : (0.6628, 0.7902)
#> No Information Rate : 0.65
#> P-Value [Acc > NIR] : 0.009749
#>
#> Kappa : 0.3647
#>
#> Mcnemar's Test P-Value : 0.009722
#>
#> Sensitivity : 0.4714
#> Specificity : 0.8692
#> Pos Pred Value : 0.6600
#> Neg Pred Value : 0.7533
#> Prevalence : 0.3500
#> Detection Rate : 0.1650
#> Detection Prevalence : 0.2500
#> Balanced Accuracy : 0.6703
#>
#> 'Positive' Class : yes
#>
Based on the results of the confusion matrix above, we can gather the following information:
Summary: The model achieves 74% accuracy, which is moderately high but may be misleading due to class imbalance. Sensitivity (45.16%) is still relatively low, indicating that the model misses many actual “yes” cases. Specificity (86.96%) is high, showing the model performs well at identifying “no” cases. Precision for the “yes” class (60.87%) is an improvement, suggesting that when the model predicts “yes,” it is correct more often. The balanced accuracy (66.06%) provides a clearer picture, showing the model performs better than random guessing but still struggles with the minority class (“yes”).
#> 'data.frame': 1000 obs. of 17 variables:
#> $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
#> $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
#> $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
#> $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
#> $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
#> $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
#> $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
#> $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
#> $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
#> $ age : int 67 22 49 45 53 35 53 35 61 28 ...
#> $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
#> $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
#> $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
#> $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
#> $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
-> Create dummy variable from the variable that will be used for classification modelling.
dmy <- dummyVars(" ~default+checking_balance+months_loan_duration+credit_history+purpose+amount+savings_balance+employment_duration+percent_of_income+years_at_residence+age+other_credit+housing+existing_loans_count+job+dependents+phone", data = credit)
dmy <- data.frame(predict(dmy, newdata = credit))
str(dmy)#> 'data.frame': 1000 obs. of 46 variables:
#> $ default.no : num 1 0 1 1 0 1 1 1 1 0 ...
#> $ default.yes : num 0 1 0 0 1 0 0 0 0 1 ...
#> $ checking_balance...0.DM : num 1 0 0 1 1 0 0 0 0 0 ...
#> $ checking_balance...200.DM : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ checking_balance.1...200.DM : num 0 1 0 0 0 0 0 1 0 1 ...
#> $ checking_balance.unknown : num 0 0 1 0 0 1 1 0 1 0 ...
#> $ months_loan_duration : num 6 48 12 42 24 36 24 36 12 30 ...
#> $ credit_history.critical : num 1 0 1 0 0 0 0 0 0 1 ...
#> $ credit_history.good : num 0 1 0 1 0 1 1 1 1 0 ...
#> $ credit_history.perfect : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ credit_history.poor : num 0 0 0 0 1 0 0 0 0 0 ...
#> $ credit_history.very.good : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ purpose.business : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ purpose.car : num 0 0 0 0 1 0 0 1 0 1 ...
#> $ purpose.car0 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ purpose.education : num 0 0 1 0 0 1 0 0 0 0 ...
#> $ purpose.furniture.appliances : num 1 1 0 1 0 0 1 0 1 0 ...
#> $ purpose.renovations : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ amount : num 1169 5951 2096 7882 4870 ...
#> $ savings_balance...100.DM : num 0 1 1 1 1 0 0 1 0 1 ...
#> $ savings_balance...1000.DM : num 0 0 0 0 0 0 0 0 1 0 ...
#> $ savings_balance.100...500.DM : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ savings_balance.500...1000.DM : num 0 0 0 0 0 0 1 0 0 0 ...
#> $ savings_balance.unknown : num 1 0 0 0 0 1 0 0 0 0 ...
#> $ employment_duration...1.year : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ employment_duration...7.years : num 1 0 0 0 0 0 1 0 0 0 ...
#> $ employment_duration.1...4.years: num 0 1 0 0 1 1 0 1 0 0 ...
#> $ employment_duration.4...7.years: num 0 0 1 1 0 0 0 0 1 0 ...
#> $ employment_duration.unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
#> $ percent_of_income : num 4 2 2 2 3 2 3 2 2 4 ...
#> $ years_at_residence : num 4 2 3 4 4 4 4 2 4 2 ...
#> $ age : num 67 22 49 45 53 35 53 35 61 28 ...
#> $ other_credit.bank : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ other_credit.none : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ other_credit.store : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ housing.other : num 0 0 0 1 1 1 0 0 0 0 ...
#> $ housing.own : num 1 1 1 0 0 0 1 0 1 1 ...
#> $ housing.rent : num 0 0 0 0 0 0 0 1 0 0 ...
#> $ existing_loans_count : num 2 1 1 1 2 1 1 1 1 2 ...
#> $ job.management : num 0 0 0 0 0 0 0 1 0 1 ...
#> $ job.skilled : num 1 1 0 1 1 0 1 0 0 0 ...
#> $ job.unemployed : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ job.unskilled : num 0 0 1 0 0 1 0 0 1 0 ...
#> $ dependents : num 1 1 2 2 2 2 1 1 1 1 ...
#> $ phone.no : num 0 1 1 1 1 0 1 0 1 1 ...
#> $ phone.yes : num 1 0 0 0 0 1 0 1 0 0 ...
To delete dummy variable that have 2 category.
To understand the name of the dummy variable
#> [1] "default.yes" "checking_balance...0.DM"
#> [3] "checking_balance...200.DM" "checking_balance.1...200.DM"
#> [5] "checking_balance.unknown" "months_loan_duration"
#> [7] "credit_history.critical" "credit_history.good"
#> [9] "credit_history.perfect" "credit_history.poor"
#> [11] "credit_history.very.good" "purpose.business"
#> [13] "purpose.car" "purpose.car0"
#> [15] "purpose.education" "purpose.furniture.appliances"
#> [17] "purpose.renovations" "amount"
#> [19] "savings_balance...100.DM" "savings_balance...1000.DM"
#> [21] "savings_balance.100...500.DM" "savings_balance.500...1000.DM"
#> [23] "savings_balance.unknown" "employment_duration...1.year"
#> [25] "employment_duration...7.years" "employment_duration.1...4.years"
#> [27] "employment_duration.4...7.years" "employment_duration.unemployed"
#> [29] "percent_of_income" "years_at_residence"
#> [31] "age" "other_credit.bank"
#> [33] "other_credit.none" "other_credit.store"
#> [35] "housing.other" "housing.own"
#> [37] "housing.rent" "existing_loans_count"
#> [39] "job.management" "job.skilled"
#> [41] "job.unemployed" "job.unskilled"
#> [43] "dependents" "phone.yes"
To create training data dan testing data from dmy
set.seed(123)
dmy_train <- dmy[samplesize,2:21]
dmy_test <- dmy[-samplesize,2:21]
dmy_train_label <- dmy[samplesize,1]
dmy_test_label <- dmy[-samplesize,1]Create prediction with K-NN
To create confusion matrix from K-NN prediction
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 699 300
#> 1 0 0
#>
#> Accuracy : 0.6997
#> 95% CI : (0.6702, 0.728)
#> No Information Rate : 0.6997
#> P-Value [Acc > NIR] : 0.5156
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.0000
#> Specificity : 1.0000
#> Pos Pred Value : NaN
#> Neg Pred Value : 0.6997
#> Prevalence : 0.3003
#> Detection Rate : 0.0000
#> Detection Prevalence : 0.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : 1
#>
Summary of Model Performance: The model predicts only the majority class (“0”) and fails to predict any instances of the minority class (“1”). The high specificity (100%) shows the model performs well at identifying the negative class but completely fails for the positive class, as indicated by a sensitivity of 0%. The accuracy (69.97%) is misleading due to the imbalanced nature of the dataset.
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = pred_knn_conf$byClass[3])Key Observations: KNN’s Failure with the Positive Class:
KNN fails entirely to detect the positive class (recall = 0%). This is likely due to the imbalanced nature of the dataset, where the positive class is underrepresented, causing KNN to classify everything as the majority class. Logistic Regression’s Balanced Performance:
Logistic Regression, while not perfect, balances recall and specificity better. It correctly identifies 45.16% of positive cases while maintaining good specificity and precision. Imbalanced Dataset Effects:
Both models are influenced by the class imbalance. The high specificity and low recall highlight the difficulty in predicting the minority class.