信用评分卡主要分为两类:
两种评分卡开发过程都是基于同样的方案,但是两者所应用的场景是有所不同的:
正常和违约通常不存在唯一的标准,其判定的标准往往取决于企业。但是,大多数评分卡开发都是基于60天,90天或者180天预期为标准。举个例子,标准可以定位,如果一个用户贷款逾期60天以上了,此时,定义这个用户为坏客户。
明确了正常和违约的含义之后,需要对数据进行打标签,通常使用1表示违约,0表示正常
假设评分卡使用了三个变量: 1. Age:年龄 2. TmAtAddress:当前地址的居住年限 3. EmpStatus:就业状况
这就是基于信用评分模型开发出来的信用评分卡,假设这个时候有一个人,他的基本属性如下: Age = 37 TmAtAddress = 3.5 EmpStatus = ‘全职’
这个时候,他的分值则为:485+39+36+38 = 598,这就是此用户的信用评分
评分卡的开发流程大致如何,其实任何一个数据挖掘项目的开发流程都由类似的开发过程:
实际中,数据可能分散在各个地方,这个时候就需要将我们能够使用的所有的数据整合汇总起来。这一步其实不容易的,有什么数据可以用,什么数据合适用,什么数据有用,这些也许需要很多次尝试才能知道结果。
探索性分析是检查数据以及理解数据的一个过程,一般情况下,需要进行如下的一些分析:
如果有成百上千的建模特征,这个时候需要筛选出那些有非常好预测能力的并且有比较好解释性的变量。特征选择的方法有很多,评分卡最常用的特征选择方法就是使用IV值进行筛选,建立罗辑回归模型之后使用逐步回归进一步筛选特征。当然,还有很多机器学习的特征选择方法,比如随即森林,boruta等等
将筛选出来的特征构建罗辑回归模型
模型验证一般要保证四个基本要求
当我们建立好罗辑回归模型之后,我们需要将罗辑回归的结果转变成为评分卡的形式,具体方法会在后面讲解
评分卡建立好了之后,需要转化成为可以实施的代码,并且确定得分的临界值,以对应所需要的业务行动。
上线好之后需要监控,应为评分卡的应用环境是在不断变化的。因此必须监控评分卡的实际表现如何,评分卡的客户群的特征变化等等
一般而言,信用评分卡的数据主要可以分为如下几组:
通常,所有权和状态变量用二元表示(0,1),交易可以提供两个类型的数据:频率和汇总值。频率记录了特定事件的发生情况,例如一个客户在一定时间内使用淘宝的次数。汇总值是账户余额或者交易值的计算和汇总统计,例如,客户每天的平均交易金额。
汇总值有几类,这里可以做一个总结:
有的时候,用户会有多条记录,因此需要用汇总值来处理,将多条记录转变成为1条记录
为了整合不同来源的数据,通常有两周操作方式:合并与联结
合并是用一个常用的关键变量,例如客户ID,合并不同来源的数据
联结是指将相同字段的不同记录合并到一起
数据获取并且整合之后,需要进行一些列完整性检验,包括
探索性分析需要做如下一些事情:
这里使用scorecard包中的一份信用评分数据作为例子
library(scorecard)
## Warning: package 'scorecard' was built under R version 3.4.4
data("germancredit")
names(germancredit)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "purpose"
## [5] "credit.amount"
## [6] "savings.account.and.bonds"
## [7] "present.employment.since"
## [8] "installment.rate.in.percentage.of.disposable.income"
## [9] "personal.status.and.sex"
## [10] "other.debtors.or.guarantors"
## [11] "present.residence.since"
## [12] "property"
## [13] "age.in.years"
## [14] "other.installment.plans"
## [15] "housing"
## [16] "number.of.existing.credits.at.this.bank"
## [17] "job"
## [18] "number.of.people.being.liable.to.provide.maintenance.for"
## [19] "telephone"
## [20] "foreign.worker"
## [21] "creditability"
统计量一般分为如下几部分:
一般而言,使用R语言进行分析时候,使用summary就可以得出数据的单变量统计量
summary(germancredit)
## status.of.existing.checking.account
## ... < 0 DM :274
## 0 <= ... < 200 DM :269
## ... >= 200 DM / salary assignments for at least 1 year: 63
## no checking account :394
##
##
## duration.in.month
## Min. : 4.0
## 1st Qu.:12.0
## Median :18.0
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## credit.history
## no credits taken/ all credits paid back duly : 40
## all credits at this bank paid back duly : 49
## existing credits paid back duly till now :530
## delay in paying off in the past : 88
## critical account/ other credits existing (not at this bank):293
##
## purpose credit.amount
## Length:1000 Min. : 250
## Class :character 1st Qu.: 1366
## Mode :character Median : 2320
## Mean : 3271
## 3rd Qu.: 3972
## Max. :18424
## savings.account.and.bonds present.employment.since
## ... < 100 DM :603 unemployed : 62
## 100 <= ... < 500 DM :103 ... < 1 year :172
## 500 <= ... < 1000 DM : 63 1 <= ... < 4 years:339
## ... >= 1000 DM : 48 4 <= ... < 7 years:174
## unknown/ no savings account:183 ... >= 7 years :253
##
## installment.rate.in.percentage.of.disposable.income
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.973
## 3rd Qu.:4.000
## Max. :4.000
## personal.status.and.sex
## male : divorced/separated : 0
## female : divorced/separated/married:360
## male : single :548
## male : married/widowed : 92
## female : single : 0
##
## other.debtors.or.guarantors present.residence.since
## none :907 Min. :1.000
## co-applicant: 41 1st Qu.:2.000
## guarantor : 52 Median :3.000
## Mean :2.845
## 3rd Qu.:4.000
## Max. :4.000
## property
## real estate :282
## building society savings agreement/ life insurance :232
## car or other, not in attribute Savings account/bonds:332
## unknown / no property :154
##
##
## age.in.years other.installment.plans housing
## Min. :19.00 bank :139 rent :179
## 1st Qu.:27.00 stores: 47 own :713
## Median :33.00 none :814 for free:108
## Mean :35.55
## 3rd Qu.:42.00
## Max. :75.00
## number.of.existing.credits.at.this.bank
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.407
## 3rd Qu.:2.000
## Max. :4.000
## job
## unemployed/ unskilled - non-resident : 22
## unskilled - resident :200
## skilled employee / official :630
## management/ self-employed/ highly qualified employee/ officer:148
##
##
## number.of.people.being.liable.to.provide.maintenance.for
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
## telephone foreign.worker
## none :596 yes:963
## yes, registered under the customers name:404 no : 37
##
##
##
##
## creditability
## bad :300
## good:700
##
##
##
##
通常,可以通过绘制连续变量的直方图,来查看数据的分布,观察数据是否有偏,是否有某种趋势:
require(ggplot2)
## Loading required package: ggplot2
qplot(germancredit$credit.amount,binwidth = 300)+xlab("credit amount")
可以看到,信用额度的分布是左偏的,因为数据已经打好标签了,这个时候可以看好坏客户的贷款额度的分布是否有区别
require(ggplot2)
qplot(germancredit$credit.amount,fill = germancredit$creditability,binwidth = 300)+xlab("credit amount")
从这里可以看出,好坏客户的信用额度的分布其实差别不大
对于连续变量,可以查看其分布,先要获取离散变量的整体信息,就需要查看离散变量的列联表。
比如,我们想看一下房子的信息与好坏客户的信息
table(germancredit$housing,germancredit$creditability)
##
## bad good
## rent 70 109
## own 186 527
## for free 44 64
可以看出,rent 的人群中坏客户占了0.39,自有房的人群中,坏客户占比0.26,第三类人群中坏客户占比是0.4.因此,如果一个人有房产,这个人有更大的概率是好客户
信用评分模型的开发有两个隐含的条件
实际上这两个条件不一定满足,因此很难说那些数据是极端值,识别极端值的方法是根据数据的差异
识别极端值一般由四种方法:
极端值的处理,如果极端值占比超过10%,那么数据可能存在多个分布,这样可能需要针对不同的群体开发评分卡。如果极端值比较少的话,可以直接删除掉极端值
特征选择需要注意的是,特征之间最好不要有相关性,如果变量之间存在相关性,意味着是存在冗余的信息的,这个时候可以利用主成分分析进行处理。
一般连续变量之间使用使用相关性检验,检验其相关性;离散变量使用卡方检验,检验其相关性。
我们对之前的数据进行检验:
# 提取出连续变量
require(tidyverse)
## Loading required package: tidyverse
## ─ Attaching packages ─────────────────────────── tidyverse 1.2.1 ─
## ✔ tibble 2.0.0 ✔ purrr 0.2.5
## ✔ tidyr 0.8.2 ✔ dplyr 0.7.8
## ✔ readr 1.3.1 ✔ stringr 1.3.1
## ✔ tibble 2.0.0 ✔ forcats 0.3.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## ─ Conflicts ──────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(corrplot)#先加载包
## corrplot 0.84 loaded
tmp1 <- germancredit %>% select(duration.in.month,credit.amount,installment.rate.in.percentage.of.disposable.income,present.residence.since,age.in.years,number.of.existing.credits.at.this.bank,number.of.people.being.liable.to.provide.maintenance.for)
# 需要多数据重新命名一下:
names(tmp1) <- c('duration','credit amount','installment rate','present residence','age','number of credit','number of liabel')
corrplot(cor(tmp1))
进行相关性检验
library(PerformanceAnalytics)
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(tmp1, histogram=TRUE, pch=19)
chisq.test(germancredit$status.of.existing.checking.account,germancredit$creditability)
##
## Pearson's Chi-squared test
##
## data: germancredit$status.of.existing.checking.account and germancredit$creditability
## X-squared = 123.72, df = 3, p-value < 2.2e-16
拒绝原假设,因此认为好坏客户的账户状态是不一样的。
变量之间的关系,追逐要的检验方法就是这两种,下面还列举了一些其他的检验方法:
传统的信用评分会使用信息值(IV)进行特征选择,其本质上是衡量两个离散变量,其中一个是二元变量,对于二分类问题,则可以使用此方法进行特征选择,其定义如下:
使用Scorecard包中的IV函数计算信息值
info_value = iv(germancredit, y = "creditability")
info_value
## variable info_value
## 1: status.of.existing.checking.account 6.660115e-01
## 2: duration.in.month 3.345035e-01
## 3: credit.history 2.932335e-01
## 4: age.in.years 2.596514e-01
## 5: savings.account.and.bonds 1.960096e-01
## 6: purpose 1.691951e-01
## 7: property 1.126383e-01
## 8: present.employment.since 8.643363e-02
## 9: housing 8.329343e-02
## 10: other.installment.plans 5.761454e-02
## 11: foreign.worker 4.387741e-02
## 12: personal.status.and.sex 4.268938e-02
## 13: credit.amount 3.895727e-02
## 14: other.debtors.or.guarantors 3.201932e-02
## 15: installment.rate.in.percentage.of.disposable.income 2.632209e-02
## 16: number.of.existing.credits.at.this.bank 1.326652e-02
## 17: job 8.762766e-03
## 18: telephone 6.377605e-03
## 19: present.residence.since 3.588773e-03
## 20: number.of.people.being.liable.to.provide.maintenance.for 4.339223e-05
一般而言:
因此可以筛选一批IV值比较大的变量
dt_f = var_filter(germancredit, y="creditability",iv_limit = 0.1)
## [INFO] filtering variables ...
names(dt_f)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "purpose"
## [5] "savings.account.and.bonds"
## [6] "property"
## [7] "age.in.years"
## [8] "creditability"
这样的话,筛选出了8个IV值大于0.1的变量
有很多的机器学习模型可以用于特征选择,其中就包括随机森林进行特征选择的原理是其实很简单,说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献,然后取个平均值,最后比一比特征之间的贡献大小。
这个贡献一般是指基尼系数或者外包估计的错误率。
require(randomForest)
tmp <- germancredit
tmp <- apply(tmp,MARGIN = 2,function(x){as.numeric(as.factor(x))}) %>% data.frame()
tmp$creditability <- as.factor(tmp$creditability) # 数据转换
ran <- randomForest(creditability~.,data = tmp) # 建立模型
ran
##
## Call:
## randomForest(formula = creditability ~ ., data = tmp)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 24.6%
## Confusion matrix:
## 1 2 class.error
## 1 118 182 0.60666667
## 2 64 636 0.09142857
importance(ran) # 得到变量的重要性
## MeanDecreaseGini
## status.of.existing.checking.account 45.170465
## duration.in.month 43.095049
## credit.history 23.160291
## purpose 24.732708
## credit.amount 57.584559
## savings.account.and.bonds 19.098886
## present.employment.since 21.546569
## installment.rate.in.percentage.of.disposable.income 17.753737
## personal.status.and.sex 12.532627
## other.debtors.or.guarantors 7.540228
## present.residence.since 17.225932
## property 18.519705
## age.in.years 44.539842
## other.installment.plans 12.415455
## housing 10.686786
## number.of.existing.credits.at.this.bank 9.462637
## job 13.516909
## number.of.people.being.liable.to.provide.maintenance.for 5.596389
## telephone 8.058962
## foreign.worker 1.717192
varImpPlot(ran) # 将变量的重要性进行绘图
这样可以通过随机森林得到每一个变量的重要性,就可以筛选出重要的变量进行下一步分析
Boruta是基于随机森林模型的一个特征选择的方法,Boruta特征选择原理:
require(Boruta)
Bo <- Boruta(creditability~.,data = tmp,pValue = 0.01,doTrace=2,maxRuns = 20)
Bo
## Boruta performed 19 iterations in 16.67126 secs.
## 9 attributes confirmed important: age.in.years, credit.amount,
## credit.history, duration.in.month, other.installment.plans and 4
## more;
## 2 attributes confirmed unimportant:
## number.of.people.being.liable.to.provide.maintenance.for,
## present.residence.since;
## 9 tentative attributes left: foreign.worker, housing,
## installment.rate.in.percentage.of.disposable.income, job,
## number.of.existing.credits.at.this.bank and 4 more;
plot(Bo)
getSelectedAttributes(Bo)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "credit.amount"
## [5] "savings.account.and.bonds"
## [6] "present.employment.since"
## [7] "property"
## [8] "age.in.years"
## [9] "other.installment.plans"
证据权重(Weight of Evidence,WOE),可以将逻辑回归模型转变成为标准评分卡格式 。
WOE的定义如下:
分子是某一个类别里面坏样本的占比,分母是此类别下好样本的占比,如果括号内的比值小于1,则此类别下坏样本的占比低于好样本的占比,WOE是负数,反之是正数。
需要注意的是,对于连续变量,要计算WOE值,需要先分箱,分箱的方法有很多,等距分箱,等比分箱,另外一种是使用决策树进行分箱:
bins = woebin(germancredit, y="creditability",method = 'tree')
## [INFO] creating woe binning ...
bins$age.in.years
## variable bin count count_distr good bad badprob woe
## 1: age.in.years [-Inf,26) 190 0.190 110 80 0.4210526 0.5288441
## 2: age.in.years [26,28) 101 0.101 74 27 0.2673267 -0.1609304
## 3: age.in.years [28,35) 257 0.257 172 85 0.3307393 0.1424546
## 4: age.in.years [35,37) 79 0.079 67 12 0.1518987 -0.8724881
## 5: age.in.years [37, Inf) 373 0.373 277 96 0.2573727 -0.2123715
## bin_iv total_iv breaks is_special_values
## 1: 0.057921024 0.1304985 26 FALSE
## 2: 0.002528906 0.1304985 28 FALSE
## 3: 0.005359008 0.1304985 35 FALSE
## 4: 0.048610052 0.1304985 37 FALSE
## 5: 0.016079553 0.1304985 Inf FALSE
对于age.in.years 的第一个类别,其WOE值为0.5288441,我们来回顾一下是如何计算的:
我们进行了特征选择,以及进行了WOE变换,然后我们可以建立模型了,莫新建立好之后,我们需要对模型进行评估,一个逻辑回归模型一般要达到三个标准:
TN : True Negative ,分类准确的负样本 TP : True Positive ,分类准确的正样本 FN : False Negative ,分类错误的正样本 FP : False Positive ,分类错误的负样本
从混淆矩阵可以非常清楚的知道,样本的分类情况
KS曲线是将总体分为10等分,并按照违约概率进行降序排序,计算每一等分中违约与正常的累计百分比。
一般而言,KS能够达到0.2,模型能用,KS达到0.3以上,说明模型是比较好的
ROC图的绘制方法和KS类似,但是坐标轴的含义不一样:
其中,sensitive 是真实的正值与总的正值的比例
specificity 定义为真实的负值占总负值的比例
ROC曲线下面的面积被称为AUC统计量,这个统计量越大,代表模型效果越好,一般而言,AUC大于0.75表示模型很可靠
群体稳定性指标(population stability index)公式:
psi = sum((实际占比-预期占比)* ln(实际占比/预期占比))
举个例子解释下,比如训练一个logistic回归模型,预测时候会有个类概率输出,p。
在你的测试数据集上的输出设定为p1,将它从小到大排序后将数据集10等分(每组样本数一直,此为等宽分组),计算每等分组的最大最小预测的类概率值。
现在你用这个模型去对新的样本进行预测,预测结果叫p2,利用刚才在测试数据集上得到的10等分每等分的上下界。按p2将新样本划分为10分,这个时候不一定是等分了。实际占比就是新样本通过p2落在p1划分出来的每等分界限内的占比,预期占比就是测试数据集上各等分样本的占比。
意义就是如果模型更稳定,那么在新的数据上预测所得类概率应该更建模分布一致,这样落在建模数据集所得的类概率所划分的等分区间上的样本占比应该和建模时一样,否则说明模型变化,一般来自预测变量结构变化。
通常用作模型效果监测。一般认为PSI小于0.1时候模型稳定性很高,0.1-0.2一般,需要进一步研究,大于0.2模型稳定性差,建议修复。
scorecard包中的perf_eva函数可以非常方便的进行模型评价,其可以进行更多指标的评估:
data("germancredit")
# var_filter 可以根据制定的标准筛选特征,默认的是筛选IV值大于0.02,缺失率小于0.95,
# identical value rate 小于0.95
dt_f = var_filter(germancredit, y="creditability")
## [INFO] filtering variables ...
# 划分数据集合
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
# 获取样本的标签
label_list = lapply(dt_list, function(x) x$creditability)
# 进行Woe 分箱,默认的使用方法是树方法
bins = woebin(dt_f, y="creditability") # 这里可以得出分箱,以及WOE变换的详细信息
## [INFO] creating woe binning ...
# 还可以制定划分的方式
breaks_adj = list(
age.in.years=c(26, 35, 40),
other.debtors.or.guarantors=c("none", "co-applicant%,%guarantor"))
bins_adj = woebin(dt_f, y="creditability", breaks_list=breaks_adj)
## [INFO] creating woe binning ...
## Warning in check_breaks_list(breaks_list, xs): There are 12 x variables
## that donot specified in breaks_list are using optimal binning.
# 进行将数据转换成为WOE的值
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins_adj))
## [INFO] converting into woe values ...
## [INFO] converting into woe values ...
# 建立逻辑回归模型
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
# 逐步回归
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
# 进行预测
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
# 模型的评价结果
perf = perf_eva(pred = pred_list$train,label = label_list$train,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))
## [INFO] The threshold of confusion matrix is 0.3262.
perf
## $binomial_metric
## $binomial_metric$dat
## MSE RMSE LogLoss R2 KS AUC Gini
## 1: 0.1482852 0.3850781 0.4536391 0.2802927 0.5598485 0.8248359 0.6496717
##
##
## $confusion_matrix
## $confusion_matrix$dat
## label pred_0 pred_1 error
## 1: 0 350 90 0.2045455
## 2: 1 43 137 0.2388889
## 3: total 393 227 0.2145161
##
##
## $pic
## TableGrob (3 x 3) "arrange": 8 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]
## 6 6 (2-2,3-3) arrange gtable[layout]
## 7 7 (3-3,1-1) arrange gtable[layout]
## 8 8 (3-3,2-2) arrange gtable[layout]
card = scorecard(bins_adj, m2)
score_list = lapply(dt_list, function(x) scorecard_ply(x, card))
perf_psi(score = score_list, label = label_list)
## $pic
## $pic$score
##
##
## $psi
## variable dataset psi
## 1: score train_test 0.0109612
详细的介绍会在后续进行
接下来,我们会讲解在模型建立好之后,得到了用户的拒收概率,如何实现一个评分卡。
将估计的违约概率表示为p,则估计的正常概率为1-p,因此:
odds = p / (1 - p)
或者
p = odds / (1 + odds)
评分卡的刻度则是这样表示的:
Score = A - Blog(odds)
其中,A和B是常数,负号使得违约概率越低,得分越高
逻辑回归模型的比率计算公式如下:
也就是说通过逻辑回归模型我们可以得到,p或者odds,其实任何模型都可以得到p,那么我们只要设定好A,B就可以得到具体的评分了,接下来我们介绍如何求得A,B
计算A,B需要假设两个已知的分值代入公式进行计算,通常,需要两个假设:
假设比率为odds的时候,分值为p1 。然后比率为2*odds的时候,其分值为p1+pdo
那么,代入公式:
p1 = A - B*log(odds)
p1+pdo = A - B*log(2*odds)
解方程得到:
B = pdo/log(2)
A = p1 + B*log(odds)
这样就可以从每一个人的违约概率,得到每一个人的分值。需要注意的是,需要指定几个参数:
我们还需要知道每一个变量的分数是如何影响总的分数的,每一个变量的每一个值的分数计算公式下面给出来:
假设变量x有k个取值,那么,变量x的每一个取值的计算公式为:
- B*(x对应的模型系数)*(x第k个取值的WOE值)
模型上线之后,需要对于上线的模型进行监控,监控模型是否良好的运行,是否需要更新。本章会介绍一些模型监控的报告
该报告的目的是生成一个能够代表总体分值的分布随时间发生变化的指数,上文提到的PSI稳定性指数可以在这里使用
稳定性报告是对总体的分值分布是否发生变化进行的监控,评分卡要素是监控自变量分布的变化。 下面以Age 的评分卡要素分析作为例子:
这个报告计算了实际(A%)和预期(E%)的分值分布两者之间差异,计算该指数的公式为:
scorecard包是在R中提供了一个完整的信用评分模型开发的解决方案。本节会对这一部分内容做一个详细的讲解
首先是现在安装:
install.packages(scorecard)
require(scorecard)
这个函数是用于划分数据集,使用方法如下:
split_df(dt, y = NULL, ratio = 0.7, seed = 618)
data(germancredit)
# Example I
dt_list = split_df(germancredit, y="creditability")
train = dt_list[[1]]
test = dt_list[[2]]
dim(germancredit)
## [1] 1000 21
dim(train)
## [1] 681 21
dim(test)
## [1] 319 21
使用这个函数来计算特征的IV值,用于特征选择作为参考,使用方法如下:
iv(dt, y, x = NULL, positive = "bad|1", order = TRUE)
data(germancredit)
# information values
info_value = iv(germancredit, y = "creditability")
info_value
## variable info_value
## 1: status.of.existing.checking.account 6.660115e-01
## 2: duration.in.month 3.345035e-01
## 3: credit.history 2.932335e-01
## 4: age.in.years 2.596514e-01
## 5: savings.account.and.bonds 1.960096e-01
## 6: purpose 1.691951e-01
## 7: property 1.126383e-01
## 8: present.employment.since 8.643363e-02
## 9: housing 8.329343e-02
## 10: other.installment.plans 5.761454e-02
## 11: foreign.worker 4.387741e-02
## 12: personal.status.and.sex 4.268938e-02
## 13: credit.amount 3.895727e-02
## 14: other.debtors.or.guarantors 3.201932e-02
## 15: installment.rate.in.percentage.of.disposable.income 2.632209e-02
## 16: number.of.existing.credits.at.this.bank 1.326652e-02
## 17: job 8.762766e-03
## 18: telephone 6.377605e-03
## 19: present.residence.since 3.588773e-03
## 20: number.of.people.being.liable.to.provide.maintenance.for 4.339223e-05
通过设定标准,使用这个函数可以通过特定的标准,信息值,缺失率,筛选特征,使用方法如下:
var_filter(dt, y, x = NULL, iv_limit = 0.02, missing_limit = 0.95,
identical_limit = 0.95, var_rm = NULL, var_kp = NULL,
return_rm_reason = FALSE, positive = "bad|1")
data(germancredit)
# variable filter
dt_sel = var_filter(germancredit, y = "creditability")
## [INFO] filtering variables ...
names(dt_sel)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "purpose"
## [5] "credit.amount"
## [6] "savings.account.and.bonds"
## [7] "present.employment.since"
## [8] "installment.rate.in.percentage.of.disposable.income"
## [9] "personal.status.and.sex"
## [10] "other.debtors.or.guarantors"
## [11] "property"
## [12] "age.in.years"
## [13] "other.installment.plans"
## [14] "housing"
## [15] "creditability"
使用这个函数进行WOE进行连续变量WOE分箱,使用方法如下:
woebin(dt, y, x = NULL, var_skip = NULL, breaks_list = NULL,
special_values = NULL, stop_limit = 0.1, count_distr_limit = 0.05,
bin_num_limit = 8, positive = "bad|1", no_cores = NULL,
print_step = 0L, method = "tree", save_breaks_list = NULL,
ignore_const_cols = TRUE, ignore_datetime_cols = TRUE,
check_cate_num = TRUE, replace_blank_na = TRUE, ...)
bins2_tree = woebin(germancredit, y="creditability", method="tree")
## [INFO] creating woe binning ...
bins2_tree$status.of.existing.checking.account
## variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
## bin count
## 1: ... < 0 DM%,%0 <= ... < 200 DM 543
## 2: ... >= 200 DM / salary assignments for at least 1 year 63
## 3: no checking account 394
## count_distr good bad badprob woe bin_iv total_iv
## 1: 0.543 303 240 0.4419890 0.6142040 0.225500603 0.639372
## 2: 0.063 49 14 0.2222222 -0.4054651 0.009460853 0.639372
## 3: 0.394 348 46 0.1167513 -1.1762632 0.404410499 0.639372
## breaks
## 1: ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3: no checking account
## is_special_values
## 1: FALSE
## 2: FALSE
## 3: FALSE
WOE分箱的具体划分规则指定好了,使用woebin_ply将原始数据转化成为WOE数据,使用方法如下:
woebin_ply(dt, bins, no_cores = NULL, print_step = 0L,
replace_blank_na = TRUE, ...)
dt_woe = woebin_ply(germancredit, bins=bins2_tree)
## [INFO] converting into woe values ...
head(dt_woe)
## creditability status.of.existing.checking.account_woe
## 1: good 0.614204
## 2: bad 0.614204
## 3: good -1.176263
## 4: good 0.614204
## 5: bad 0.614204
## 6: good -1.176263
## duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe
## 1: -1.3121864 -0.73374058 -0.4100628 0.03366128
## 2: 1.1349799 0.08831862 -0.4100628 0.39053946
## 3: -0.3466246 -0.73374058 0.2799201 -0.25830746
## 4: 0.5245245 0.08831862 0.2799201 0.39053946
## 5: 0.1086883 0.08515781 0.2799201 0.39053946
## 6: 0.5245245 0.08831862 0.2799201 0.39053946
## savings.account.and.bonds_woe present.employment.since_woe
## 1: -0.7621401 -0.23556607
## 2: 0.2713578 0.03210325
## 3: 0.2713578 -0.39441527
## 4: 0.2713578 -0.39441527
## 5: 0.2713578 0.03210325
## 6: -0.7621401 0.03210325
## installment.rate.in.percentage.of.disposable.income_woe
## 1: 0.1039609
## 2: -0.1554665
## 3: -0.1554665
## 4: -0.1554665
## 5: 0.1039609
## 6: -0.1554665
## personal.status.and.sex_woe other.debtors.or.guarantors_woe
## 1: -0.1655476 0.02797385
## 2: 0.2646926 0.02797385
## 3: -0.1655476 0.02797385
## 4: -0.1655476 -0.58778666
## 5: -0.1655476 0.02797385
## 6: -0.1655476 0.02797385
## present.residence.since_woe property_woe age.in.years_woe
## 1: -0.01359409 -0.46103496 -0.2123715
## 2: 0.07015071 -0.46103496 0.5288441
## 3: -0.01359409 -0.46103496 -0.2123715
## 4: -0.01359409 0.02857337 -0.2123715
## 5: -0.01359409 0.58608236 -0.2123715
## 6: -0.01359409 0.58608236 -0.8724881
## other.installment.plans_woe housing_woe
## 1: -0.1211786 -0.1941560
## 2: -0.1211786 -0.1941560
## 3: -0.1211786 -0.1941560
## 4: -0.1211786 0.4726044
## 5: -0.1211786 0.4726044
## 6: -0.1211786 0.4726044
## number.of.existing.credits.at.this.bank_woe job_woe
## 1: -0.1347806 -0.02278003
## 2: 0.0748775 -0.02278003
## 3: 0.0748775 -0.07847162
## 4: 0.0748775 -0.02278003
## 5: -0.1347806 -0.02278003
## 6: 0.0748775 -0.07847162
## number.of.people.being.liable.to.provide.maintenance.for_woe
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
## telephone_woe foreign.worker_woe
## 1: -0.09863759 0
## 2: 0.06469132 0
## 3: 0.06469132 0
## 4: 0.06469132 0
## 5: 0.06469132 0
## 6: -0.09863759 0
使用scorecard通过模型和woebin的结果构建出评分卡规则,使用方法如下:
scorecard(bins, model, points0 = 600, odds0 = 1/19, pdo = 50,
basepoints_eq0 = FALSE)
dt_woe$creditability <- as.character(dt_woe$creditability)
dt_woe$creditability[as.character(dt_woe$creditability)=='good']=0
dt_woe$creditability[as.character(dt_woe$creditability)=='bad']=1
dt_woe$creditability <- as.factor(dt_woe$creditability)
l <- glm(creditability~.,data = dt_woe,family = binomial())
l <- step(l)
## Start: AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe +
## credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## personal.status.and.sex_woe + other.debtors.or.guarantors_woe +
## present.residence.since_woe + property_woe + age.in.years_woe +
## other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe +
## job_woe + number.of.people.being.liable.to.provide.maintenance.for_woe +
## telephone_woe + foreign.worker_woe
##
##
## Step: AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe +
## credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## personal.status.and.sex_woe + other.debtors.or.guarantors_woe +
## present.residence.since_woe + property_woe + age.in.years_woe +
## other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe +
## job_woe + number.of.people.being.liable.to.provide.maintenance.for_woe +
## telephone_woe
##
##
## Step: AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe +
## credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## personal.status.and.sex_woe + other.debtors.or.guarantors_woe +
## present.residence.since_woe + property_woe + age.in.years_woe +
## other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe +
## job_woe + telephone_woe
##
## Df Deviance
## - job_woe 1 901.21
## - number.of.existing.credits.at.this.bank_woe 1 901.86
## <none> 901.13
## - property_woe 1 903.33
## - housing_woe 1 903.37
## - telephone_woe 1 903.84
## - present.employment.since_woe 1 905.78
## - other.installment.plans_woe 1 906.97
## - personal.status.and.sex_woe 1 907.44
## - other.debtors.or.guarantors_woe 1 907.52
## - present.residence.since_woe 1 907.56
## - age.in.years_woe 1 908.87
## - installment.rate.in.percentage.of.disposable.income_woe 1 916.85
## - duration.in.month_woe 1 917.07
## - savings.account.and.bonds_woe 1 919.14
## - credit.amount_woe 1 920.39
## - purpose_woe 1 921.29
## - credit.history_woe 1 921.81
## - status.of.existing.checking.account_woe 1 960.39
## AIC
## - job_woe 937.21
## - number.of.existing.credits.at.this.bank_woe 937.86
## <none> 939.13
## - property_woe 939.33
## - housing_woe 939.37
## - telephone_woe 939.84
## - present.employment.since_woe 941.78
## - other.installment.plans_woe 942.97
## - personal.status.and.sex_woe 943.44
## - other.debtors.or.guarantors_woe 943.52
## - present.residence.since_woe 943.56
## - age.in.years_woe 944.87
## - installment.rate.in.percentage.of.disposable.income_woe 952.85
## - duration.in.month_woe 953.07
## - savings.account.and.bonds_woe 955.14
## - credit.amount_woe 956.39
## - purpose_woe 957.29
## - credit.history_woe 957.81
## - status.of.existing.checking.account_woe 996.39
##
## Step: AIC=937.21
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe +
## credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## personal.status.and.sex_woe + other.debtors.or.guarantors_woe +
## present.residence.since_woe + property_woe + age.in.years_woe +
## other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe +
## telephone_woe
##
## Df Deviance
## - number.of.existing.credits.at.this.bank_woe 1 901.95
## <none> 901.21
## - property_woe 1 903.33
## - housing_woe 1 903.43
## - telephone_woe 1 904.73
## - present.employment.since_woe 1 905.81
## - other.installment.plans_woe 1 907.01
## - personal.status.and.sex_woe 1 907.49
## - other.debtors.or.guarantors_woe 1 907.54
## - present.residence.since_woe 1 907.58
## - age.in.years_woe 1 909.11
## - installment.rate.in.percentage.of.disposable.income_woe 1 916.85
## - duration.in.month_woe 1 917.16
## - savings.account.and.bonds_woe 1 919.14
## - credit.amount_woe 1 920.47
## - purpose_woe 1 921.67
## - credit.history_woe 1 921.94
## - status.of.existing.checking.account_woe 1 960.43
## AIC
## - number.of.existing.credits.at.this.bank_woe 935.95
## <none> 937.21
## - property_woe 937.33
## - housing_woe 937.43
## - telephone_woe 938.73
## - present.employment.since_woe 939.81
## - other.installment.plans_woe 941.01
## - personal.status.and.sex_woe 941.49
## - other.debtors.or.guarantors_woe 941.54
## - present.residence.since_woe 941.58
## - age.in.years_woe 943.11
## - installment.rate.in.percentage.of.disposable.income_woe 950.85
## - duration.in.month_woe 951.16
## - savings.account.and.bonds_woe 953.14
## - credit.amount_woe 954.47
## - purpose_woe 955.67
## - credit.history_woe 955.94
## - status.of.existing.checking.account_woe 994.43
##
## Step: AIC=935.95
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe +
## credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe +
## present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe +
## personal.status.and.sex_woe + other.debtors.or.guarantors_woe +
## present.residence.since_woe + property_woe + age.in.years_woe +
## other.installment.plans_woe + housing_woe + telephone_woe
##
## Df Deviance
## <none> 901.95
## - property_woe 1 904.03
## - housing_woe 1 904.14
## - telephone_woe 1 905.43
## - present.employment.since_woe 1 906.30
## - personal.status.and.sex_woe 1 908.09
## - other.installment.plans_woe 1 908.26
## - other.debtors.or.guarantors_woe 1 908.33
## - present.residence.since_woe 1 908.51
## - age.in.years_woe 1 909.56
## - installment.rate.in.percentage.of.disposable.income_woe 1 917.50
## - duration.in.month_woe 1 918.09
## - savings.account.and.bonds_woe 1 920.43
## - credit.amount_woe 1 921.73
## - credit.history_woe 1 922.18
## - purpose_woe 1 923.04
## - status.of.existing.checking.account_woe 1 960.57
## AIC
## <none> 935.95
## - property_woe 936.03
## - housing_woe 936.14
## - telephone_woe 937.43
## - present.employment.since_woe 938.30
## - personal.status.and.sex_woe 940.09
## - other.installment.plans_woe 940.26
## - other.debtors.or.guarantors_woe 940.33
## - present.residence.since_woe 940.51
## - age.in.years_woe 941.56
## - installment.rate.in.percentage.of.disposable.income_woe 949.50
## - duration.in.month_woe 950.09
## - savings.account.and.bonds_woe 952.43
## - credit.amount_woe 953.73
## - credit.history_woe 954.18
## - purpose_woe 955.04
## - status.of.existing.checking.account_woe 992.57
score <- scorecard(bins = bins2_tree,model = l)
score$status.of.existing.checking.account
## variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
## bin count
## 1: ... < 0 DM%,%0 <= ... < 200 DM 543
## 2: ... >= 200 DM / salary assignments for at least 1 year 63
## 3: no checking account 394
## count_distr good bad badprob woe bin_iv total_iv
## 1: 0.543 303 240 0.4419890 0.6142040 0.225500603 0.639372
## 2: 0.063 49 14 0.2222222 -0.4054651 0.009460853 0.639372
## 3: 0.394 348 46 0.1167513 -1.1762632 0.404410499 0.639372
## breaks
## 1: ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3: no checking account
## is_special_values points
## 1: FALSE -36
## 2: FALSE 24
## 3: FALSE 68
将一个新用户的原始数据获取这个用户的分数,使用方法如下:
scorecard_ply(dt, card, only_total_score = TRUE, print_step = 0L,
replace_blank_na = TRUE, var_kp = NULL)
resutl <- scorecard_ply(dt = germancredit,card = score)
resutl
## score
## 1: 648
## 2: 314
## 3: 638
## 4: 439
## 5: 310
## ---
## 996: 535
## 997: 462
## 998: 559
## 999: 342
## 1000: 402
这样就得到了每一个用户的分数,下一张我们用一个例子,来汇总这一整个流程
使用的数据集是德国的一个银行提供的数据集,这个数据集已经包含在了scorecard这个包里面,使用data(germancredit)
就可以获取这个数据集合:
data(germancredit)
head(germancredit)
## status.of.existing.checking.account duration.in.month
## 1 ... < 0 DM 6
## 2 0 <= ... < 200 DM 48
## 3 no checking account 12
## 4 ... < 0 DM 42
## 5 ... < 0 DM 24
## 6 no checking account 36
## credit.history
## 1 critical account/ other credits existing (not at this bank)
## 2 existing credits paid back duly till now
## 3 critical account/ other credits existing (not at this bank)
## 4 existing credits paid back duly till now
## 5 delay in paying off in the past
## 6 existing credits paid back duly till now
## purpose credit.amount savings.account.and.bonds
## 1 radio/television 1169 unknown/ no savings account
## 2 radio/television 5951 ... < 100 DM
## 3 education 2096 ... < 100 DM
## 4 furniture/equipment 7882 ... < 100 DM
## 5 car (new) 4870 ... < 100 DM
## 6 education 9055 unknown/ no savings account
## present.employment.since
## 1 ... >= 7 years
## 2 1 <= ... < 4 years
## 3 4 <= ... < 7 years
## 4 4 <= ... < 7 years
## 5 1 <= ... < 4 years
## 6 1 <= ... < 4 years
## installment.rate.in.percentage.of.disposable.income
## 1 4
## 2 2
## 3 2
## 4 2
## 5 3
## 6 2
## personal.status.and.sex other.debtors.or.guarantors
## 1 male : single none
## 2 female : divorced/separated/married none
## 3 male : single none
## 4 male : single guarantor
## 5 male : single none
## 6 male : single none
## present.residence.since
## 1 4
## 2 2
## 3 3
## 4 4
## 5 4
## 6 4
## property age.in.years
## 1 real estate 67
## 2 real estate 22
## 3 real estate 49
## 4 building society savings agreement/ life insurance 45
## 5 unknown / no property 53
## 6 unknown / no property 35
## other.installment.plans housing number.of.existing.credits.at.this.bank
## 1 none own 2
## 2 none own 1
## 3 none own 1
## 4 none for free 1
## 5 none for free 2
## 6 none for free 1
## job
## 1 skilled employee / official
## 2 skilled employee / official
## 3 unskilled - resident
## 4 skilled employee / official
## 5 skilled employee / official
## 6 unskilled - resident
## number.of.people.being.liable.to.provide.maintenance.for
## 1 1
## 2 1
## 3 2
## 4 2
## 5 2
## 6 2
## telephone foreign.worker creditability
## 1 yes, registered under the customers name yes good
## 2 none yes bad
## 3 none yes good
## 4 none yes good
## 5 none yes bad
## 6 yes, registered under the customers name yes good
一共有20个特征,最后一列是样本的标签,bad代表坏客户,good代表好客户,因为数据已经准备好了,因此可以直接进行特征选择:
dt_f = var_filter(germancredit, y="creditability",iv_limit = 0.1)
## [INFO] filtering variables ...
names(dt_f)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "purpose"
## [5] "savings.account.and.bonds"
## [6] "property"
## [7] "age.in.years"
## [8] "creditability"
筛选出8个特征
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)
head(dt_list)
## $train
## status.of.existing.checking.account duration.in.month
## 1: ... < 0 DM 6
## 2: 0 <= ... < 200 DM 48
## 3: no checking account 12
## 4: ... < 0 DM 42
## 5: no checking account 36
## ---
## 616: no checking account 12
## 617: ... < 0 DM 30
## 618: no checking account 12
## 619: ... < 0 DM 45
## 620: 0 <= ... < 200 DM 45
## credit.history
## 1: critical account/ other credits existing (not at this bank)
## 2: existing credits paid back duly till now
## 3: critical account/ other credits existing (not at this bank)
## 4: existing credits paid back duly till now
## 5: existing credits paid back duly till now
## ---
## 616: existing credits paid back duly till now
## 617: existing credits paid back duly till now
## 618: existing credits paid back duly till now
## 619: existing credits paid back duly till now
## 620: critical account/ other credits existing (not at this bank)
## purpose savings.account.and.bonds
## 1: radio/television unknown/ no savings account
## 2: radio/television ... < 100 DM
## 3: education ... < 100 DM
## 4: furniture/equipment ... < 100 DM
## 5: education unknown/ no savings account
## ---
## 616: furniture/equipment ... < 100 DM
## 617: car (used) ... < 100 DM
## 618: radio/television ... < 100 DM
## 619: radio/television ... < 100 DM
## 620: car (used) 100 <= ... < 500 DM
## property age.in.years
## 1: real estate 67
## 2: real estate 22
## 3: real estate 49
## 4: building society savings agreement/ life insurance 45
## 5: unknown / no property 35
## ---
## 616: real estate 31
## 617: building society savings agreement/ life insurance 40
## 618: car or other, not in attribute Savings account/bonds 38
## 619: unknown / no property 23
## 620: car or other, not in attribute Savings account/bonds 27
## creditability
## 1: 0
## 2: 1
## 3: 0
## 4: 0
## 5: 0
## ---
## 616: 0
## 617: 0
## 618: 0
## 619: 1
## 620: 0
##
## $test
## status.of.existing.checking.account duration.in.month
## 1: ... < 0 DM 24
## 2: no checking account 12
## 3: 0 <= ... < 200 DM 12
## 4: ... < 0 DM 24
## 5: ... < 0 DM 15
## ---
## 376: ... < 0 DM 36
## 377: ... < 0 DM 15
## 378: no checking account 15
## 379: ... < 0 DM 18
## 380: no checking account 12
## credit.history
## 1: delay in paying off in the past
## 2: existing credits paid back duly till now
## 3: existing credits paid back duly till now
## 4: critical account/ other credits existing (not at this bank)
## 5: existing credits paid back duly till now
## ---
## 376: existing credits paid back duly till now
## 377: critical account/ other credits existing (not at this bank)
## 378: all credits at this bank paid back duly
## 379: existing credits paid back duly till now
## 380: existing credits paid back duly till now
## purpose savings.account.and.bonds
## 1: car (new) ... < 100 DM
## 2: radio/television ... >= 1000 DM
## 3: car (new) ... < 100 DM
## 4: car (new) ... < 100 DM
## 5: car (new) ... < 100 DM
## ---
## 376: car (used) ... < 100 DM
## 377: furniture/equipment ... < 100 DM
## 378: radio/television 100 <= ... < 500 DM
## 379: radio/television unknown/ no savings account
## 380: car (new) unknown/ no savings account
## property age.in.years
## 1: unknown / no property 53
## 2: real estate 61
## 3: car or other, not in attribute Savings account/bonds 25
## 4: car or other, not in attribute Savings account/bonds 60
## 5: car or other, not in attribute Savings account/bonds 28
## ---
## 376: building society savings agreement/ life insurance 26
## 377: building society savings agreement/ life insurance 25
## 378: car or other, not in attribute Savings account/bonds 34
## 379: car or other, not in attribute Savings account/bonds 23
## 380: car or other, not in attribute Savings account/bonds 50
## creditability
## 1: 1
## 2: 0
## 3: 1
## 4: 1
## 5: 0
## ---
## 376: 1
## 377: 0
## 378: 0
## 379: 0
## 380: 0
bins = woebin(dt_f, y="creditability")
## [INFO] creating woe binning ...
bins$status.of.existing.checking.account
## variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
## bin count
## 1: ... < 0 DM%,%0 <= ... < 200 DM 543
## 2: ... >= 200 DM / salary assignments for at least 1 year 63
## 3: no checking account 394
## count_distr good bad badprob woe bin_iv total_iv
## 1: 0.543 303 240 0.4419890 0.6142040 0.225500603 0.639372
## 2: 0.063 49 14 0.2222222 -0.4054651 0.009460853 0.639372
## 3: 0.394 348 46 0.1167513 -1.1762632 0.404410499 0.639372
## breaks
## 1: ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3: no checking account
## is_special_values
## 1: FALSE
## 2: FALSE
## 3: FALSE
bins$duration.in.month
## variable bin count count_distr good bad badprob
## 1: duration.in.month [-Inf,8) 87 0.087 78 9 0.1034483
## 2: duration.in.month [8,16) 344 0.344 264 80 0.2325581
## 3: duration.in.month [16,34) 399 0.399 270 129 0.3233083
## 4: duration.in.month [34,44) 100 0.100 58 42 0.4200000
## 5: duration.in.month [44, Inf) 70 0.070 30 40 0.5714286
## woe bin_iv total_iv breaks is_special_values
## 1: -1.3121864 0.106849463 0.2826181 8 FALSE
## 2: -0.3466246 0.038293766 0.2826181 16 FALSE
## 3: 0.1086883 0.004813339 0.2826181 34 FALSE
## 4: 0.5245245 0.029972827 0.2826181 44 FALSE
## 5: 1.1349799 0.102688661 0.2826181 Inf FALSE
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
## [INFO] converting into woe values ...
## [INFO] converting into woe values ...
head(dt_woe_list)
## $train
## creditability status.of.existing.checking.account_woe
## 1: 0 0.614204
## 2: 1 0.614204
## 3: 0 -1.176263
## 4: 0 0.614204
## 5: 0 -1.176263
## ---
## 616: 0 -1.176263
## 617: 0 0.614204
## 618: 0 -1.176263
## 619: 1 0.614204
## 620: 0 0.614204
## duration.in.month_woe credit.history_woe purpose_woe
## 1: -1.3121864 -0.73374058 -0.4100628
## 2: 1.1349799 0.08831862 -0.4100628
## 3: -0.3466246 -0.73374058 0.2799201
## 4: 0.5245245 0.08831862 0.2799201
## 5: 0.5245245 0.08831862 0.2799201
## ---
## 616: -0.3466246 0.08831862 0.2799201
## 617: 0.1086883 0.08831862 -0.8056252
## 618: -0.3466246 0.08831862 -0.4100628
## 619: 1.1349799 0.08831862 -0.4100628
## 620: 1.1349799 -0.73374058 -0.8056252
## savings.account.and.bonds_woe property_woe age.in.years_woe
## 1: -0.7621401 -0.46103496 -0.2123715
## 2: 0.2713578 -0.46103496 0.5288441
## 3: 0.2713578 -0.46103496 -0.2123715
## 4: 0.2713578 0.02857337 -0.2123715
## 5: -0.7621401 0.58608236 -0.8724881
## ---
## 616: 0.2713578 -0.46103496 0.1424546
## 617: 0.2713578 0.02857337 -0.2123715
## 618: 0.2713578 0.03419136 -0.2123715
## 619: 0.2713578 0.58608236 0.5288441
## 620: 0.1395519 0.03419136 -0.1609304
##
## $test
## creditability status.of.existing.checking.account_woe
## 1: 1 0.614204
## 2: 0 -1.176263
## 3: 1 0.614204
## 4: 1 0.614204
## 5: 0 0.614204
## ---
## 376: 1 0.614204
## 377: 0 0.614204
## 378: 0 -1.176263
## 379: 0 0.614204
## 380: 0 -1.176263
## duration.in.month_woe credit.history_woe purpose_woe
## 1: 0.1086883 0.08515781 0.2799201
## 2: -0.3466246 0.08831862 -0.4100628
## 3: -0.3466246 0.08831862 0.2799201
## 4: 0.1086883 -0.73374058 0.2799201
## 5: -0.3466246 0.08831862 0.2799201
## ---
## 376: 0.5245245 0.08831862 -0.8056252
## 377: -0.3466246 -0.73374058 0.2799201
## 378: -0.3466246 1.23407084 -0.4100628
## 379: 0.1086883 0.08831862 -0.4100628
## 380: -0.3466246 0.08831862 0.2799201
## savings.account.and.bonds_woe property_woe age.in.years_woe
## 1: 0.2713578 0.58608236 -0.2123715
## 2: -0.7621401 -0.46103496 -0.2123715
## 3: 0.2713578 0.03419136 0.5288441
## 4: 0.2713578 0.03419136 -0.2123715
## 5: 0.2713578 0.03419136 0.1424546
## ---
## 376: 0.2713578 0.02857337 -0.1609304
## 377: 0.2713578 0.02857337 0.5288441
## 378: 0.1395519 0.03419136 0.1424546
## 379: -0.7621401 0.03419136 0.5288441
## 380: -0.7621401 0.03419136 -0.2123715
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
summary(m_step)
##
## Call:
## glm(formula = creditability ~ status.of.existing.checking.account_woe +
## duration.in.month_woe + credit.history_woe + purpose_woe +
## savings.account.and.bonds_woe + property_woe + age.in.years_woe,
## family = binomial(), data = dt_woe_list$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8673 -0.7379 -0.4205 0.7899 2.5882
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.9302 0.1060 -8.773
## status.of.existing.checking.account_woe 0.7343 0.1334 5.502
## duration.in.month_woe 0.9731 0.2173 4.479
## credit.history_woe 0.8721 0.1960 4.451
## purpose_woe 0.8589 0.2653 3.238
## savings.account.and.bonds_woe 0.7163 0.2522 2.840
## property_woe 0.5991 0.3229 1.855
## age.in.years_woe 0.9456 0.2918 3.240
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## status.of.existing.checking.account_woe 3.74e-08 ***
## duration.in.month_woe 7.50e-06 ***
## credit.history_woe 8.56e-06 ***
## purpose_woe 0.00120 **
## savings.account.and.bonds_woe 0.00450 **
## property_woe 0.06355 .
## age.in.years_woe 0.00119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 747.03 on 619 degrees of freedom
## Residual deviance: 583.26 on 612 degrees of freedom
## AIC: 599.26
##
## Number of Fisher Scoring iterations: 5
到这里,模型就已经训练好了
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## performance
perf = perf_eva(pred = pred_list, label = label_list,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))
## [INFO] The threshold of confusion matrix is 0.2622.
card = scorecard(bins, m2)
score_list = lapply(dt_list, function(x) scorecard_ply(x, card))
head(score_list)
## $train
## score
## 1: 658
## 2: 331
## 3: 590
## 4: 361
## 5: 531
## ---
## 616: 514
## 617: 457
## 618: 559
## 619: 286
## 620: 441
##
## $test
## score
## 1: 367
## 2: 633
## 3: 372
## 4: 442
## 5: 398
## ---
## 376: 425
## 377: 424
## 378: 470
## 379: 435
## 380: 570
perf_psi(score = score_list, label = label_list)
## $pic
## $pic$score
##
##
## $psi
## variable dataset psi
## 1: score train_test 0.02102106
需要联系我可以添加我的微信