B.Li 2023.04.20
本次使用 scorecard 包内置的德国信贷数据,并选择逻辑回归模型进行信用评分卡构建。
目录
探索性分析
特征工程
建模&模型评估
评分卡构建
#加载所用包
library(scorecard)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::replace_na() masks scorecard::replace_na()
library(corrplot)
## corrplot 0.92 loaded
library(naniar)
library(glmnet)
## 载入需要的程辑包:Matrix
##
## 载入程辑包:'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
首先,查看数据的基础统计信息:
data("germancredit")#导入数据
glimpse(germancredit)
## Rows: 1,000
## Columns: 21
## $ status.of.existing.checking.account <fct> ... < 0 DM, 0…
## $ duration.in.month <dbl> 6, 48, 12, 42…
## $ credit.history <fct> critical acco…
## $ purpose <chr> "radio/televi…
## $ credit.amount <dbl> 1169, 5951, 2…
## $ savings.account.and.bonds <fct> unknown/ no s…
## $ present.employment.since <fct> ... >= 7 year…
## $ installment.rate.in.percentage.of.disposable.income <dbl> 4, 2, 2, 2, 3…
## $ personal.status.and.sex <fct> male : divorc…
## $ other.debtors.or.guarantors <fct> none, none, n…
## $ present.residence.since <dbl> 4, 2, 3, 4, 4…
## $ property <fct> "real estate"…
## $ age.in.years <dbl> 67, 22, 49, 4…
## $ other.installment.plans <fct> none, none, n…
## $ housing <fct> own, own, own…
## $ number.of.existing.credits.at.this.bank <dbl> 2, 1, 1, 1, 2…
## $ job <fct> skilled emplo…
## $ number.of.people.being.liable.to.provide.maintenance.for <dbl> 1, 1, 2, 2, 2…
## $ telephone <fct> "yes, registe…
## $ foreign.worker <fct> yes, yes, yes…
## $ creditability <fct> good, bad, go…
summary(germancredit)
## status.of.existing.checking.account
## ... < 0 DM :274
## 0 <= ... < 200 DM :269
## ... >= 200 DM / salary assignments for at least 1 year: 63
## no checking account :394
##
##
## duration.in.month
## Min. : 4.0
## 1st Qu.:12.0
## Median :18.0
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## credit.history
## no credits taken/ all credits paid back duly : 40
## all credits at this bank paid back duly : 49
## existing credits paid back duly till now :530
## delay in paying off in the past : 88
## critical account/ other credits existing (not at this bank):293
##
## purpose credit.amount savings.account.and.bonds
## Length:1000 Min. : 250 ... < 100 DM :603
## Class :character 1st Qu.: 1366 100 <= ... < 500 DM :103
## Mode :character Median : 2320 500 <= ... < 1000 DM : 63
## Mean : 3271 ... >= 1000 DM : 48
## 3rd Qu.: 3972 unknown/ no savings account:183
## Max. :18424
## present.employment.since
## unemployed : 62
## ... < 1 year :172
## 1 <= ... < 4 years:339
## 4 <= ... < 7 years:174
## ... >= 7 years :253
##
## installment.rate.in.percentage.of.disposable.income
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.973
## 3rd Qu.:4.000
## Max. :4.000
## personal.status.and.sex other.debtors.or.guarantors
## male : divorced/separated : 50 none :907
## female : divorced/separated/married:310 co-applicant: 41
## male : single :548 guarantor : 52
## male : married/widowed : 92
## female : single : 0
##
## present.residence.since
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.845
## 3rd Qu.:4.000
## Max. :4.000
## property age.in.years
## real estate :282 Min. :19.00
## building society savings agreement/ life insurance :232 1st Qu.:27.00
## car or other, not in attribute Savings account/bonds:332 Median :33.00
## unknown / no property :154 Mean :35.55
## 3rd Qu.:42.00
## Max. :75.00
## other.installment.plans housing number.of.existing.credits.at.this.bank
## bank :139 rent :179 Min. :1.000
## stores: 47 own :713 1st Qu.:1.000
## none :814 for free:108 Median :1.000
## Mean :1.407
## 3rd Qu.:2.000
## Max. :4.000
## job
## unemployed/ unskilled - non-resident : 22
## unskilled - resident :200
## skilled employee / official :630
## management/ self-employed/ highly qualified employee/ officer:148
##
##
## number.of.people.being.liable.to.provide.maintenance.for
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
## telephone foreign.worker creditability
## none :596 yes:963 bad :300
## yes, registered under the customers name:404 no : 37 good:700
##
##
##
##
发现:
1)当前样本分布(7:3)的均衡性尚可,认为不会导致模型仅学得先验认知。
2)部分特征可能取值一致性过高,没有预测能力。如有,需剔除此类特征。
3)未发现特征存在异常值,但特征取值范围差距明显,需要特征归一化。
其次,观察特征缺失情况:
gg_miss_var(germancredit)
4)未发现特征存在缺失值。
再其次,观察各特征的相关水平:
#对于连续型变量,进行相关性检验
X1<-germancredit%>%select(-creditability)%>%
select_if(is.numeric)
colnames(X1)<-c('duration','credit amount','installment rate','present residence','age','number of credit','number of liabel')
corrplot(corr = cor(X1),method = "square")
#对于离散型变量,进行卡方检验
chisq.test(germancredit$status.of.existing.checking.account,germancredit$creditability)
##
## Pearson's Chi-squared test
##
## data: germancredit$status.of.existing.checking.account and germancredit$creditability
## X-squared = 123.72, df = 3, p-value < 2.2e-16
5)大部分特征之间无明显相关关系。若模型表现欠佳,可尝试对高相关性变量择一后重新建模。
综上所述,认为无需对当前数据进行缺失值、非均衡性等问题的处理,经过特征选择及归一化后便可进行建模。
尝试通过IV值进行特征筛选。首先计算并观察各特征的IV值:
info_value = iv(germancredit, y = "creditability")
info_value
## variable info_value
## 1: status.of.existing.checking.account 6.660115e-01
## 2: duration.in.month 3.345035e-01
## 3: credit.history 2.932335e-01
## 4: age.in.years 2.596514e-01
## 5: savings.account.and.bonds 1.960096e-01
## 6: purpose 1.691951e-01
## 7: property 1.126383e-01
## 8: present.employment.since 8.643363e-02
## 9: housing 8.329343e-02
## 10: other.installment.plans 5.761454e-02
## 11: foreign.worker 4.387741e-02
## 12: credit.amount 3.895727e-02
## 13: other.debtors.or.guarantors 3.201932e-02
## 14: installment.rate.in.percentage.of.disposable.income 2.632209e-02
## 15: number.of.existing.credits.at.this.bank 1.326652e-02
## 16: personal.status.and.sex 8.839919e-03
## 17: job 8.762766e-03
## 18: telephone 6.377605e-03
## 19: present.residence.since 3.588773e-03
## 20: number.of.people.being.liable.to.provide.maintenance.for 4.339223e-05
发现达到最低可用门槛(IV>.02)的特征有13个,其中预测能力较强的特征(.1<IV<.5)有6个。关于IV值的高低,直观的理解为”特征在各切片下好坏样本的概率分布与先验概率的差异”:
ggplot(germancredit,aes(x = duration.in.month,fill = creditability))+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(germancredit,aes(x = present.residence.since ,fill = creditability))+geom_bar()
此处取duration.in.month(IV>.3)及present.residence.since(IV>.003)进行比较,可以看出前者部分切片下的样本分布明显有别于先验认知(3:7),而后者无明显区别。
基于IV值等信息进行特征筛选,并将过滤后的数据集划分成训练集/测试集:
df<-var_filter(germancredit,y="creditability",lims = list(identical_rate = .95,info_value = .02),positive = "bad|1")%>%
select(-status.of.existing.checking.account) #剔除IV过高的特征
## ℹ Filtering variables via missing_rate, identical_rate, info_value ...
## ✔ 1 variables are removed via identical_rate
## ✔ 6 variables are removed via info_value
## ✔ Variable filtering on 1000 rows and 20 columns in 00:00:00
## ✔ 7 variables are removed in total
df_list<-split_df(df,y="creditability",ratios=c(.7,.3),seed = 121)
df_train<-df_list$train
df_test<-df_list$test
出于特征归一化的需要,对特征值进行woe变换:
bins_tree<-woebin(df, y="creditability",method = 'tree',positive = "bad|1")#分箱,分箱方法选择决策树
## ℹ Creating woe binning ...
## ✔ Binning on 1000 rows and 13 columns in 00:00:03
df_woe<-lapply(df_list,function(x){woebin_ply(x,bins = bins_tree, to = "woe")})#基于分箱计算woe值
## ℹ Converting into woe values ...
## ✔ Woe transformating on 676 rows and 12 columns in 00:00:01
## ℹ Converting into woe values ...
## ✔ Woe transformating on 324 rows and 12 columns in 00:00:01
head(df_woe)
## $train
## creditability duration.in.month_woe credit.history_woe purpose_woe
## 1: 0 -1.3121864 -0.73374058 -0.4100628
## 2: 1 1.1349799 0.08831862 -0.4100628
## 3: 0 0.5245245 0.08831862 0.2799201
## 4: 1 0.1086883 0.08515781 0.2799201
## 5: 0 0.1086883 0.08831862 0.2799201
## ---
## 672: 0 0.5245245 0.08831862 0.2799201
## 673: 0 -0.3466246 0.08831862 0.2799201
## 674: 0 -0.3466246 0.08831862 0.2799201
## 675: 0 -0.3466246 0.08831862 -0.4100628
## 676: 1 1.1349799 0.08831862 -0.4100628
## credit.amount_woe savings.account.and.bonds_woe
## 1: 0.03366128 -0.7621401
## 2: 0.39053946 0.2713578
## 3: 0.39053946 0.2713578
## 4: 0.39053946 0.2713578
## 5: -0.25830746 -0.7621401
## ---
## 672: -0.25830746 0.2713578
## 673: -0.25830746 -0.7621401
## 674: -0.72823850 0.2713578
## 675: 0.03366128 0.2713578
## 676: -0.25830746 0.2713578
## present.employment.since_woe
## 1: -0.23556607
## 2: 0.03210325
## 3: -0.39441527
## 4: 0.03210325
## 5: -0.23556607
## ---
## 672: 0.43113746
## 673: -0.23556607
## 674: -0.39441527
## 675: -0.23556607
## 676: 0.03210325
## installment.rate.in.percentage.of.disposable.income_woe
## 1: 0.15730029
## 2: -0.19047277
## 3: -0.19047277
## 4: -0.06453852
## 5: -0.06453852
## ---
## 672: 0.15730029
## 673: 0.15730029
## 674: -0.06453852
## 675: 0.15730029
## 676: 0.15730029
## other.debtors.or.guarantors_woe property_woe age.in.years_woe
## 1: 0.02797385 -0.46103496 -0.2123715
## 2: 0.02797385 -0.46103496 0.5288441
## 3: -0.58778666 0.02857337 -0.2123715
## 4: 0.02797385 0.58608236 -0.2123715
## 5: 0.02797385 0.02857337 -0.2123715
## ---
## 672: 0.02797385 0.02857337 0.1424546
## 673: 0.02797385 0.03419136 -0.2123715
## 674: 0.02797385 -0.46103496 0.1424546
## 675: 0.02797385 0.03419136 -0.2123715
## 676: 0.02797385 0.58608236 0.5288441
## other.installment.plans_woe housing_woe
## 1: -0.1211786 -0.1941560
## 2: -0.1211786 -0.1941560
## 3: -0.1211786 0.4726044
## 4: -0.1211786 0.4726044
## 5: -0.1211786 -0.1941560
## ---
## 672: -0.1211786 -0.1941560
## 673: -0.1211786 -0.1941560
## 674: -0.1211786 -0.1941560
## 675: -0.1211786 -0.1941560
## 676: -0.1211786 0.4726044
##
## $test
## creditability duration.in.month_woe credit.history_woe purpose_woe
## 1: 0 -0.3466246 -0.73374058 0.2799201
## 2: 0 0.5245245 0.08831862 0.2799201
## 3: 0 0.1086883 1.23407084 0.2799201
## 4: 0 0.1086883 0.08831862 -0.4100628
## 5: 0 -1.3121864 1.23407084 -0.4100628
## ---
## 320: 0 -0.3466246 -0.73374058 0.2799201
## 321: 0 -0.3466246 1.23407084 -0.4100628
## 322: 0 0.1086883 0.08831862 -0.4100628
## 323: 0 0.1086883 0.08831862 -0.8056252
## 324: 0 1.1349799 -0.73374058 -0.8056252
## credit.amount_woe savings.account.and.bonds_woe
## 1: -0.25830746 0.2713578
## 2: 0.39053946 -0.7621401
## 3: 0.39053946 -0.7621401
## 4: -0.25830746 -0.7621401
## 5: 0.03366128 0.2713578
## ---
## 320: -0.25830746 -0.7621401
## 321: -0.72823850 0.1395519
## 322: -0.25830746 -0.7621401
## 323: -0.25830746 0.2713578
## 324: 0.39053946 0.1395519
## present.employment.since_woe
## 1: -0.39441527
## 2: 0.03210325
## 3: 0.43113746
## 4: -0.23556607
## 5: -0.23556607
## ---
## 320: 0.43113746
## 321: -0.23556607
## 322: -0.39441527
## 323: 0.03210325
## 324: 0.43113746
## installment.rate.in.percentage.of.disposable.income_woe
## 1: -0.19047277
## 2: -0.19047277
## 3: -0.19047277
## 4: -0.06453852
## 5: 0.15730029
## ---
## 320: -0.19047277
## 321: 0.15730029
## 322: -0.19047277
## 323: 0.15730029
## 324: -0.06453852
## other.debtors.or.guarantors_woe property_woe age.in.years_woe
## 1: 0.02797385 -0.46103496 -0.2123715
## 2: 0.02797385 0.58608236 -0.8724881
## 3: 0.02797385 0.03419136 0.5288441
## 4: 0.02797385 0.03419136 0.1424546
## 5: 0.02797385 0.03419136 -0.2123715
## ---
## 320: 0.02797385 0.02857337 -0.2123715
## 321: 0.02797385 0.03419136 0.1424546
## 322: 0.02797385 0.03419136 0.5288441
## 323: 0.02797385 0.02857337 -0.2123715
## 324: 0.02797385 0.03419136 -0.1609304
## other.installment.plans_woe housing_woe
## 1: -0.1211786 -0.1941560
## 2: -0.1211786 0.4726044
## 3: 0.4775508 -0.1941560
## 4: -0.1211786 -0.1941560
## 5: -0.1211786 -0.1941560
## ---
## 320: -0.1211786 -0.1941560
## 321: 0.4775508 -0.1941560
## 322: -0.1211786 0.4044452
## 323: -0.1211786 -0.1941560
## 324: -0.1211786 -0.1941560
此时特征已初步筛选完成(13个),取值范围同为0-1,且数据集已划分完毕。接下来可以进行建模及模型评估。
模型选择逻辑回归:
m1<-glm(creditability~.,family = binomial,data = df_woe$train)
summary(m1)
##
## Call:
## glm(formula = creditability ~ ., family = binomial, data = df_woe$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1101 -0.7685 -0.4724 0.8565 2.6345
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.9060 0.0982
## duration.in.month_woe 0.5994 0.1995
## credit.history_woe 0.8006 0.1781
## purpose_woe 1.0313 0.2538
## credit.amount_woe 1.0754 0.2468
## savings.account.and.bonds_woe 1.0294 0.2368
## present.employment.since_woe 0.7972 0.3324
## installment.rate.in.percentage.of.disposable.income_woe 1.7796 0.6314
## other.debtors.or.guarantors_woe 0.9177 0.7727
## property_woe 0.6061 0.3322
## age.in.years_woe 0.6477 0.2753
## other.installment.plans_woe 0.6994 0.3879
## housing_woe 0.4922 0.3562
## z value Pr(>|z|)
## (Intercept) -9.226 < 2e-16 ***
## duration.in.month_woe 3.005 0.00266 **
## credit.history_woe 4.495 6.96e-06 ***
## purpose_woe 4.064 4.82e-05 ***
## credit.amount_woe 4.358 1.31e-05 ***
## savings.account.and.bonds_woe 4.347 1.38e-05 ***
## present.employment.since_woe 2.398 0.01647 *
## installment.rate.in.percentage.of.disposable.income_woe 2.819 0.00482 **
## other.debtors.or.guarantors_woe 1.188 0.23495
## property_woe 1.824 0.06812 .
## age.in.years_woe 2.352 0.01866 *
## other.installment.plans_woe 1.803 0.07136 .
## housing_woe 1.382 0.16697
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 819.35 on 675 degrees of freedom
## Residual deviance: 662.78 on 663 degrees of freedom
## AIC: 688.78
##
## Number of Fisher Scoring iterations: 5
发现少量特征的预测作用不显著,尝试用双向搜索进一步筛选特征:
m2<-step(m1, direction="both", trace = FALSE)
summary(m2)
##
## Call:
## glm(formula = creditability ~ duration.in.month_woe + credit.history_woe +
## purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## property_woe + age.in.years_woe + other.installment.plans_woe,
## family = binomial, data = df_woe$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0554 -0.7653 -0.4872 0.8676 2.6298
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.90777 0.09797
## duration.in.month_woe 0.58563 0.19754
## credit.history_woe 0.79633 0.17649
## purpose_woe 1.04282 0.25281
## credit.amount_woe 1.08733 0.24582
## savings.account.and.bonds_woe 0.99939 0.23514
## present.employment.since_woe 0.82819 0.33029
## installment.rate.in.percentage.of.disposable.income_woe 1.76511 0.62848
## property_woe 0.82920 0.30459
## age.in.years_woe 0.69000 0.27325
## other.installment.plans_woe 0.63804 0.38476
## z value Pr(>|z|)
## (Intercept) -9.266 < 2e-16 ***
## duration.in.month_woe 2.965 0.00303 **
## credit.history_woe 4.512 6.42e-06 ***
## purpose_woe 4.125 3.71e-05 ***
## credit.amount_woe 4.423 9.73e-06 ***
## savings.account.and.bonds_woe 4.250 2.14e-05 ***
## present.employment.since_woe 2.507 0.01216 *
## installment.rate.in.percentage.of.disposable.income_woe 2.809 0.00498 **
## property_woe 2.722 0.00648 **
## age.in.years_woe 2.525 0.01156 *
## other.installment.plans_woe 1.658 0.09726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 819.35 on 675 degrees of freedom
## Residual deviance: 666.10 on 665 degrees of freedom
## AIC: 688.1
##
## Number of Fisher Scoring iterations: 5
模型表现:
result_list<-lapply(df_woe,function(x){predict.glm(m2,x,type = "response")})#预测结果
label_list<-lapply(df_woe,function(x){x$creditability})#真实标签
perf1<-perf_eva(result_list,label = label_list, show_plot = c("ks","roc","pr","f1"),positive = "bad|1")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the scorecard package.
## Please report the issue at <]8;;https://github.com/ShichenXie/scorecard/issueshttps://github.com/ShichenXie/scorecard/issues]8;;>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
当前模型KS=0.47,ROC=0.80,认为性能良好。接下来将模型映射为评分卡:
card = scorecard(bins_tree, m2)#基于模型所学系数,将各特征按其分箱赋予分值
score_list = lapply(df_list, function(x) scorecard_ply(x, card))#基于评分规则,计算各对象的最终得分
head(score_list)
## $train
## score
## 1: 672
## 2: 410
## 3: 417
## 4: 359
## 5: 534
## ---
## 672: 356
## 673: 526
## 674: 538
## 675: 480
## 676: 354
##
## $test
## score
## 1: 582
## 2: 465
## 3: 328
## 4: 568
## 5: 454
## ---
## 320: 577
## 321: 438
## 322: 575
## 323: 497
## 324: 461
计算目前的评分卡的PSI,从而判定其预测结果是否稳定:
perf_psi(score = score_list, label = label_list)
## $pic
## $pic$score
##
##
## $psi
## variable dataset psi
## 1: score train_test 0.06038522
PSI=.03,认为模型稳定。同时,上图显示了各分数段代表的坏账概率(Positive probability),分数越高则坏账率越低。
可根据业务需要指定界定好坏的分数阈值,或基于模型性能指标(KS/AUC/F1等)得出”相对最佳”的阈值:
#观察perf1,可知KS与F1的最大值出现于将预测概率排名前40%左右的样本判为正样本时,由此可以计算模型的判定阈值:
table_result<- cbind(result_list$test,label_list$test)%>%
as.data.frame()%>%
rename(p_estimate = V1,label = V2)%>%
arrange(desc(p_estimate))%>%
mutate(cnt = 1)%>%
mutate(cnt_positive = cumsum(cnt))%>%
mutate(rate_positive = cnt_positive/nrow(df_list$test))
p_estimate<-mean(table_result[which(table_result$rate_positive>=.38 & table_result$rate_positive<=.42),1])
p_estimate
## [1] 0.2830725
即,可将模型输出概率0.28作为判定好坏样本的阈值,与评分卡关联,可知其对应的确切分数。粗略比对,可知目标分数约为500。
至此,信用评分卡开发完成。