目录

  1. 评分卡开发流程
  2. 数据的获取与整合
  3. 探索性数据分析
  4. 特征选择
  5. 粗分类与WOE变换
  6. 模型评估
  7. 评分卡开发
  8. 模型监控
  9. scorecard 信用评分包
  10. 案例

1. 评分卡开发流程

1.1.1 标准评分卡

信用评分卡主要分为两类:

  1. 申请评分卡
  2. 行为评分卡

两种评分卡开发过程都是基于同样的方案,但是两者所应用的场景是有所不同的:

  • 申请评分卡被用于对新贷款申请进行一次性的信用评分,来决定是否贷款,贷款额度,贷款定价
  • 行为评分卡是对于通过审批进入执行阶段的用户,即进行一定交易的用户,进行信用评分,结果用于制定清收策略

1.1.2 正常与违约

正常和违约通常不存在唯一的标准,其判定的标准往往取决于企业。但是,大多数评分卡开发都是基于60天,90天或者180天预期为标准。举个例子,标准可以定位,如果一个用户贷款逾期60天以上了,此时,定义这个用户为坏客户。

明确了正常和违约的含义之后,需要对数据进行打标签,通常使用1表示违约,0表示正常

1.1.3 标准评分卡的格式

假设评分卡使用了三个变量: 1. Age:年龄 2. TmAtAddress:当前地址的居住年限 3. EmpStatus:就业状况

这就是基于信用评分模型开发出来的信用评分卡,假设这个时候有一个人,他的基本属性如下: Age = 37 TmAtAddress = 3.5 EmpStatus = ‘全职’

这个时候,他的分值则为:485+39+36+38 = 598,这就是此用户的信用评分

1.1.4 信用评分卡的优点

  1. 易于理解
  2. 总的分值由于每一个变量的分值组合而成,非常易于解释
  3. 简单,非常用以实现
  4. 用户可以非常清楚的知道自己如何提高自己的分数

1.2 评分卡开发流程

评分卡的开发流程大致如何,其实任何一个数据挖掘项目的开发流程都由类似的开发过程:

1.3 数据准备

实际中,数据可能分散在各个地方,这个时候就需要将我们能够使用的所有的数据整合汇总起来。这一步其实不容易的,有什么数据可以用,什么数据合适用,什么数据有用,这些也许需要很多次尝试才能知道结果。

1.4 探索性分析

探索性分析是检查数据以及理解数据的一个过程,一般情况下,需要进行如下的一些分析:

  1. 特征的统计描述,取值范围
  2. 特征的违约率的分布(这一步需要对连续变量进行分箱)
  3. 通过卡方检验,相关性指标确定不同变量之间的关系

1.5 特征选择

如果有成百上千的建模特征,这个时候需要筛选出那些有非常好预测能力的并且有比较好解释性的变量。特征选择的方法有很多,评分卡最常用的特征选择方法就是使用IV值进行筛选,建立罗辑回归模型之后使用逐步回归进一步筛选特征。当然,还有很多机器学习的特征选择方法,比如随即森林,boruta等等

1.6 模型开发

将筛选出来的特征构建罗辑回归模型

1.7 模型验证

模型验证一般要保证四个基本要求

  1. 有比较好的准确度
  2. 模型应该稳健
  3. 模型必须简单
  4. 要有比较好的可解释性

1.8 评分卡开发

当我们建立好罗辑回归模型之后,我们需要将罗辑回归的结果转变成为评分卡的形式,具体方法会在后面讲解

1.9 模型的上线与监控

评分卡建立好了之后,需要转化成为可以实施的代码,并且确定得分的临界值,以对应所需要的业务行动。

上线好之后需要监控,应为评分卡的应用环境是在不断变化的。因此必须监控评分卡的实际表现如何,评分卡的客户群的特征变化等等

2. 数据获取与数据整合

2.1 信用评分卡的数据来源

一般而言,信用评分卡的数据主要可以分为如下几组:

  1. 人口统计特征,这个使用户的基本信息,包括,家庭收入,性别,年龄等
  2. 征信机构的数据,比如人行征信
  3. 交易数据,这一部分就很多的,购物信息,金融交易信息等等
  4. 其他产品所有权和使用记录,客户可能会在其他的金融机构同样有使用产品

通常,所有权和状态变量用二元表示(0,1),交易可以提供两个类型的数据:频率和汇总值。频率记录了特定事件的发生情况,例如一个客户在一定时间内使用淘宝的次数。汇总值是账户余额或者交易值的计算和汇总统计,例如,客户每天的平均交易金额。

汇总值有几类,这里可以做一个总结:

  1. 计数,有过多少次贷款 ,有过多少消费记录
  2. 求和,总的消费金额
  3. 占比,贷款额度与年收入的占比
  4. 时间差,第一次开户距今时常
  5. 波动率,过去三年每一份工作的时常标准差

有的时候,用户会有多条记录,因此需要用汇总值来处理,将多条记录转变成为1条记录

2.2 数据整合

为了整合不同来源的数据,通常有两周操作方式:合并与联结

合并是用一个常用的关键变量,例如客户ID,合并不同来源的数据

联结是指将相同字段的不同记录合并到一起

2.3 完整性检验

数据获取并且整合之后,需要进行一些列完整性检验,包括

  1. 行的唯一性,一个ID只能有一条记录
  2. 范围与取值,每一个特征都需要有一个清晰的取值范围
  3. 缺失值,
  4. 样本是否能够代替整体

3. 探索性分析

探索性分析需要做如下一些事情:

  1. 特征的统计描述以及分布
  2. 特征与预测变量之间的关系,特征是否有预测效果
  3. 缺失值与极端值的处理
  4. 特征中好坏样本的分布

这里使用scorecard包中的一份信用评分数据作为例子

library(scorecard)
## Warning: package 'scorecard' was built under R version 3.4.4
data("germancredit")
names(germancredit)
##  [1] "status.of.existing.checking.account"                     
##  [2] "duration.in.month"                                       
##  [3] "credit.history"                                          
##  [4] "purpose"                                                 
##  [5] "credit.amount"                                           
##  [6] "savings.account.and.bonds"                               
##  [7] "present.employment.since"                                
##  [8] "installment.rate.in.percentage.of.disposable.income"     
##  [9] "personal.status.and.sex"                                 
## [10] "other.debtors.or.guarantors"                             
## [11] "present.residence.since"                                 
## [12] "property"                                                
## [13] "age.in.years"                                            
## [14] "other.installment.plans"                                 
## [15] "housing"                                                 
## [16] "number.of.existing.credits.at.this.bank"                 
## [17] "job"                                                     
## [18] "number.of.people.being.liable.to.provide.maintenance.for"
## [19] "telephone"                                               
## [20] "foreign.worker"                                          
## [21] "creditability"

3.1 单变量统计量

统计量一般分为如下几部分:

  1. 矩,包括均值,众数,标准差
  2. 分位数
  3. 极端之

一般而言,使用R语言进行分析时候,使用summary就可以得出数据的单变量统计量

summary(germancredit)
##                                      status.of.existing.checking.account
##  ... < 0 DM                                            :274             
##  0 <= ... < 200 DM                                     :269             
##  ... >= 200 DM / salary assignments for at least 1 year: 63             
##  no checking account                                   :394             
##                                                                         
##                                                                         
##  duration.in.month
##  Min.   : 4.0     
##  1st Qu.:12.0     
##  Median :18.0     
##  Mean   :20.9     
##  3rd Qu.:24.0     
##  Max.   :72.0     
##                                                      credit.history
##  no credits taken/ all credits paid back duly               : 40   
##  all credits at this bank paid back duly                    : 49   
##  existing credits paid back duly till now                   :530   
##  delay in paying off in the past                            : 88   
##  critical account/ other credits existing (not at this bank):293   
##                                                                    
##    purpose          credit.amount  
##  Length:1000        Min.   :  250  
##  Class :character   1st Qu.: 1366  
##  Mode  :character   Median : 2320  
##                     Mean   : 3271  
##                     3rd Qu.: 3972  
##                     Max.   :18424  
##                savings.account.and.bonds       present.employment.since
##  ... < 100 DM               :603         unemployed        : 62        
##  100 <= ... < 500 DM        :103         ... < 1 year      :172        
##  500 <= ... < 1000 DM       : 63         1 <= ... < 4 years:339        
##  ... >= 1000 DM             : 48         4 <= ... < 7 years:174        
##  unknown/ no savings account:183         ... >= 7 years    :253        
##                                                                        
##  installment.rate.in.percentage.of.disposable.income
##  Min.   :1.000                                      
##  1st Qu.:2.000                                      
##  Median :3.000                                      
##  Mean   :2.973                                      
##  3rd Qu.:4.000                                      
##  Max.   :4.000                                      
##                         personal.status.and.sex
##  male : divorced/separated          :  0       
##  female : divorced/separated/married:360       
##  male : single                      :548       
##  male : married/widowed             : 92       
##  female : single                    :  0       
##                                                
##  other.debtors.or.guarantors present.residence.since
##  none        :907            Min.   :1.000          
##  co-applicant: 41            1st Qu.:2.000          
##  guarantor   : 52            Median :3.000          
##                              Mean   :2.845          
##                              3rd Qu.:4.000          
##                              Max.   :4.000          
##                                                  property  
##  real estate                                         :282  
##  building society savings agreement/ life insurance  :232  
##  car or other, not in attribute Savings account/bonds:332  
##  unknown / no property                               :154  
##                                                            
##                                                            
##   age.in.years   other.installment.plans     housing   
##  Min.   :19.00   bank  :139              rent    :179  
##  1st Qu.:27.00   stores: 47              own     :713  
##  Median :33.00   none  :814              for free:108  
##  Mean   :35.55                                         
##  3rd Qu.:42.00                                         
##  Max.   :75.00                                         
##  number.of.existing.credits.at.this.bank
##  Min.   :1.000                          
##  1st Qu.:1.000                          
##  Median :1.000                          
##  Mean   :1.407                          
##  3rd Qu.:2.000                          
##  Max.   :4.000                          
##                                                             job     
##  unemployed/ unskilled - non-resident                         : 22  
##  unskilled - resident                                         :200  
##  skilled employee / official                                  :630  
##  management/ self-employed/ highly qualified employee/ officer:148  
##                                                                     
##                                                                     
##  number.of.people.being.liable.to.provide.maintenance.for
##  Min.   :1.000                                           
##  1st Qu.:1.000                                           
##  Median :1.000                                           
##  Mean   :1.155                                           
##  3rd Qu.:1.000                                           
##  Max.   :2.000                                           
##                                     telephone   foreign.worker
##  none                                    :596   yes:963       
##  yes, registered under the customers name:404   no : 37       
##                                                               
##                                                               
##                                                               
##                                                               
##  creditability
##  bad :300     
##  good:700     
##               
##               
##               
## 

3.2 变量的分布情况

通常,可以通过绘制连续变量的直方图,来查看数据的分布,观察数据是否有偏,是否有某种趋势:

require(ggplot2)
## Loading required package: ggplot2
qplot(germancredit$credit.amount,binwidth = 300)+xlab("credit amount")

可以看到,信用额度的分布是左偏的,因为数据已经打好标签了,这个时候可以看好坏客户的贷款额度的分布是否有区别

require(ggplot2)
qplot(germancredit$credit.amount,fill = germancredit$creditability,binwidth = 300)+xlab("credit amount")

从这里可以看出,好坏客户的信用额度的分布其实差别不大

3.3 列联表分析

对于连续变量,可以查看其分布,先要获取离散变量的整体信息,就需要查看离散变量的列联表。

比如,我们想看一下房子的信息与好坏客户的信息

table(germancredit$housing,germancredit$creditability)
##           
##            bad good
##   rent      70  109
##   own      186  527
##   for free  44   64

可以看出,rent 的人群中坏客户占了0.39,自有房的人群中,坏客户占比0.26,第三类人群中坏客户占比是0.4.因此,如果一个人有房产,这个人有更大的概率是好客户

3.4 极端值的识别

信用评分模型的开发有两个隐含的条件

  1. 违约状态是预测变量的函数
  2. 特征数据由同一个分布产生

实际上这两个条件不一定满足,因此很难说那些数据是极端值,识别极端值的方法是根据数据的差异

识别极端值一般由四种方法:

  1. 设定一个取值范围,例如使用三杯标准差原则
  2. 建立模型,如果数据验证偏离模型,则认为是极端值,例如建立线性模型,严重偏离线性模型的数据为极端值
  3. 聚类,将数据聚类成为较小的子集,如果某个子集包含的观测值非常少,则可以认为是异常值
  4. 决策树,用决策树发现包含非常少数据的节点,这些节点的数据可能就是异常值

极端值的处理,如果极端值占比超过10%,那么数据可能存在多个分布,这样可能需要针对不同的群体开发评分卡。如果极端值比较少的话,可以直接删除掉极端值

4 特征选择

特征选择需要注意的是,特征之间最好不要有相关性,如果变量之间存在相关性,意味着是存在冗余的信息的,这个时候可以利用主成分分析进行处理。

一般连续变量之间使用使用相关性检验,检验其相关性;离散变量使用卡方检验,检验其相关性。

我们对之前的数据进行检验:

  1. 连续特征的相关性进行可视化以及相关性检验:
# 提取出连续变量
require(tidyverse)
## Loading required package: tidyverse
## ─ Attaching packages ─────────────────────────── tidyverse 1.2.1 ─
## ✔ tibble  2.0.0     ✔ purrr   0.2.5
## ✔ tidyr   0.8.2     ✔ dplyr   0.7.8
## ✔ readr   1.3.1     ✔ stringr 1.3.1
## ✔ tibble  2.0.0     ✔ forcats 0.3.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## ─ Conflicts ──────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(corrplot)#先加载包
## corrplot 0.84 loaded
tmp1 <-  germancredit %>% select(duration.in.month,credit.amount,installment.rate.in.percentage.of.disposable.income,present.residence.since,age.in.years,number.of.existing.credits.at.this.bank,number.of.people.being.liable.to.provide.maintenance.for)

# 需要多数据重新命名一下:
names(tmp1) <- c('duration','credit amount','installment rate','present residence','age','number of credit','number of liabel')
corrplot(cor(tmp1))

进行相关性检验

library(PerformanceAnalytics)
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(tmp1, histogram=TRUE, pch=19)

  1. 对离散特征进行卡方检验
chisq.test(germancredit$status.of.existing.checking.account,germancredit$creditability)
## 
##  Pearson's Chi-squared test
## 
## data:  germancredit$status.of.existing.checking.account and germancredit$creditability
## X-squared = 123.72, df = 3, p-value < 2.2e-16

拒绝原假设,因此认为好坏客户的账户状态是不一样的。

变量之间的关系,追逐要的检验方法就是这两种,下面还列举了一些其他的检验方法:

4.1 使用IV值进行特征选择

传统的信用评分会使用信息值(IV)进行特征选择,其本质上是衡量两个离散变量,其中一个是二元变量,对于二分类问题,则可以使用此方法进行特征选择,其定义如下:

使用Scorecard包中的IV函数计算信息值

info_value = iv(germancredit, y = "creditability")

info_value
##                                                     variable   info_value
##  1:                      status.of.existing.checking.account 6.660115e-01
##  2:                                        duration.in.month 3.345035e-01
##  3:                                           credit.history 2.932335e-01
##  4:                                             age.in.years 2.596514e-01
##  5:                                savings.account.and.bonds 1.960096e-01
##  6:                                                  purpose 1.691951e-01
##  7:                                                 property 1.126383e-01
##  8:                                 present.employment.since 8.643363e-02
##  9:                                                  housing 8.329343e-02
## 10:                                  other.installment.plans 5.761454e-02
## 11:                                           foreign.worker 4.387741e-02
## 12:                                  personal.status.and.sex 4.268938e-02
## 13:                                            credit.amount 3.895727e-02
## 14:                              other.debtors.or.guarantors 3.201932e-02
## 15:      installment.rate.in.percentage.of.disposable.income 2.632209e-02
## 16:                  number.of.existing.credits.at.this.bank 1.326652e-02
## 17:                                                      job 8.762766e-03
## 18:                                                telephone 6.377605e-03
## 19:                                  present.residence.since 3.588773e-03
## 20: number.of.people.being.liable.to.provide.maintenance.for 4.339223e-05

一般而言:

因此可以筛选一批IV值比较大的变量

dt_f = var_filter(germancredit, y="creditability",iv_limit = 0.1)
## [INFO] filtering variables ...
names(dt_f)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"                  
## [3] "credit.history"                     
## [4] "purpose"                            
## [5] "savings.account.and.bonds"          
## [6] "property"                           
## [7] "age.in.years"                       
## [8] "creditability"

这样的话,筛选出了8个IV值大于0.1的变量

4.2 随机森林特征选择

有很多的机器学习模型可以用于特征选择,其中就包括随机森林进行特征选择的原理是其实很简单,说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献,然后取个平均值,最后比一比特征之间的贡献大小。

这个贡献一般是指基尼系数或者外包估计的错误率。

随机森领模型判断特征的重要性

require(randomForest)
tmp <- germancredit
tmp <- apply(tmp,MARGIN = 2,function(x){as.numeric(as.factor(x))}) %>% data.frame()
tmp$creditability <- as.factor(tmp$creditability) # 数据转换

ran <- randomForest(creditability~.,data = tmp) # 建立模型
ran
## 
## Call:
##  randomForest(formula = creditability ~ ., data = tmp) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 24.6%
## Confusion matrix:
##     1   2 class.error
## 1 118 182  0.60666667
## 2  64 636  0.09142857
importance(ran) # 得到变量的重要性
##                                                          MeanDecreaseGini
## status.of.existing.checking.account                             45.170465
## duration.in.month                                               43.095049
## credit.history                                                  23.160291
## purpose                                                         24.732708
## credit.amount                                                   57.584559
## savings.account.and.bonds                                       19.098886
## present.employment.since                                        21.546569
## installment.rate.in.percentage.of.disposable.income             17.753737
## personal.status.and.sex                                         12.532627
## other.debtors.or.guarantors                                      7.540228
## present.residence.since                                         17.225932
## property                                                        18.519705
## age.in.years                                                    44.539842
## other.installment.plans                                         12.415455
## housing                                                         10.686786
## number.of.existing.credits.at.this.bank                          9.462637
## job                                                             13.516909
## number.of.people.being.liable.to.provide.maintenance.for         5.596389
## telephone                                                        8.058962
## foreign.worker                                                   1.717192
varImpPlot(ran) # 将变量的重要性进行绘图

这样可以通过随机森林得到每一个变量的重要性,就可以筛选出重要的变量进行下一步分析

4.3 Boruta 进行特征选择

Boruta是基于随机森林模型的一个特征选择的方法,Boruta特征选择原理:

  1. 先对数据中所有的变量(features)随机排列(shuffle),将每个feature的数值顺序打乱,随机排列,构建随机组合的shadow features.
  2. 然后训练一个random forest classifier,对每个feature的重要性(importance)进行打分 ,看feature在原数据中的评分是否比在随机排列中的评分更高。有则记录下来。
  3. 根据一个设定好的迭代次数n(iteration),将features随机排列n次,并对每个feature打分n次。对每个feature计算一个P值,比较n次打分是否存在差异,一般使用bonferroni correction来为多次统计检验进行矫正。P<0.01则认为该feature为confirmed important.
  4. 当所有特征得到确认或拒绝,或算法达到随机森林运行的一个规定的限制时,算法停止。
require(Boruta)
Bo <- Boruta(creditability~.,data = tmp,pValue = 0.01,doTrace=2,maxRuns = 20)
Bo
## Boruta performed 19 iterations in 16.67126 secs.
##  9 attributes confirmed important: age.in.years, credit.amount,
## credit.history, duration.in.month, other.installment.plans and 4
## more;
##  2 attributes confirmed unimportant:
## number.of.people.being.liable.to.provide.maintenance.for,
## present.residence.since;
##  9 tentative attributes left: foreign.worker, housing,
## installment.rate.in.percentage.of.disposable.income, job,
## number.of.existing.credits.at.this.bank and 4 more;
plot(Bo)

getSelectedAttributes(Bo)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"                  
## [3] "credit.history"                     
## [4] "credit.amount"                      
## [5] "savings.account.and.bonds"          
## [6] "present.employment.since"           
## [7] "property"                           
## [8] "age.in.years"                       
## [9] "other.installment.plans"

5 粗分类与WOE变换

证据权重(Weight of Evidence,WOE),可以将逻辑回归模型转变成为标准评分卡格式 。

WOE的定义如下:

分子是某一个类别里面坏样本的占比,分母是此类别下好样本的占比,如果括号内的比值小于1,则此类别下坏样本的占比低于好样本的占比,WOE是负数,反之是正数。

需要注意的是,对于连续变量,要计算WOE值,需要先分箱,分箱的方法有很多,等距分箱,等比分箱,另外一种是使用决策树进行分箱:

进行WOE变换

bins = woebin(germancredit, y="creditability",method = 'tree')
## [INFO] creating woe binning ...
bins$age.in.years
##        variable       bin count count_distr good bad   badprob        woe
## 1: age.in.years [-Inf,26)   190       0.190  110  80 0.4210526  0.5288441
## 2: age.in.years   [26,28)   101       0.101   74  27 0.2673267 -0.1609304
## 3: age.in.years   [28,35)   257       0.257  172  85 0.3307393  0.1424546
## 4: age.in.years   [35,37)    79       0.079   67  12 0.1518987 -0.8724881
## 5: age.in.years [37, Inf)   373       0.373  277  96 0.2573727 -0.2123715
##         bin_iv  total_iv breaks is_special_values
## 1: 0.057921024 0.1304985     26             FALSE
## 2: 0.002528906 0.1304985     28             FALSE
## 3: 0.005359008 0.1304985     35             FALSE
## 4: 0.048610052 0.1304985     37             FALSE
## 5: 0.016079553 0.1304985    Inf             FALSE

对于age.in.years 的第一个类别,其WOE值为0.5288441,我们来回顾一下是如何计算的:

  1. 此类别下,坏样本占总的坏样本的比例:80/(80+27+85+12+96) = 0.2666667
  2. 此类别下,好样本占总的好样本的比例:110/(110+74+172+67+277) = 0.1571429
  3. 套用公式:log(0.2666667/0.1571429)

WOE变换的优点

  1. 可以使得模型的预测效果更好
  2. 可以有很好的业务可解释性

6 模型评估

我们进行了特征选择,以及进行了WOE变换,然后我们可以建立模型了,莫新建立好之后,我们需要对模型进行评估,一个逻辑回归模型一般要达到三个标准:

  1. 精确性
  2. 稳健型
  3. 有意义

6.1 混淆矩阵

TN : True Negative ,分类准确的负样本 TP : True Positive ,分类准确的正样本 FN : False Negative ,分类错误的正样本 FP : False Positive ,分类错误的负样本

从混淆矩阵可以非常清楚的知道,样本的分类情况

6.2 KS曲线

KS曲线是将总体分为10等分,并按照违约概率进行降序排序,计算每一等分中违约与正常的累计百分比。

一般而言,KS能够达到0.2,模型能用,KS达到0.3以上,说明模型是比较好的

6.3 ROC曲线

ROC图的绘制方法和KS类似,但是坐标轴的含义不一样:

其中,sensitive 是真实的正值与总的正值的比例

specificity 定义为真实的负值占总负值的比例

ROC曲线下面的面积被称为AUC统计量,这个统计量越大,代表模型效果越好,一般而言,AUC大于0.75表示模型很可靠

6.4 PSI稳定性检验

群体稳定性指标(population stability index)公式:

psi = sum((实际占比-预期占比)* ln(实际占比/预期占比))

举个例子解释下,比如训练一个logistic回归模型,预测时候会有个类概率输出,p。

在你的测试数据集上的输出设定为p1,将它从小到大排序后将数据集10等分(每组样本数一直,此为等宽分组),计算每等分组的最大最小预测的类概率值。

现在你用这个模型去对新的样本进行预测,预测结果叫p2,利用刚才在测试数据集上得到的10等分每等分的上下界。按p2将新样本划分为10分,这个时候不一定是等分了。实际占比就是新样本通过p2落在p1划分出来的每等分界限内的占比,预期占比就是测试数据集上各等分样本的占比。

意义就是如果模型更稳定,那么在新的数据上预测所得类概率应该更建模分布一致,这样落在建模数据集所得的类概率所划分的等分区间上的样本占比应该和建模时一样,否则说明模型变化,一般来自预测变量结构变化。

通常用作模型效果监测。一般认为PSI小于0.1时候模型稳定性很高,0.1-0.2一般,需要进一步研究,大于0.2模型稳定性差,建议修复。

6.5 实现

scorecard包中的perf_eva函数可以非常方便的进行模型评价,其可以进行更多指标的评估:

data("germancredit")


# var_filter 可以根据制定的标准筛选特征,默认的是筛选IV值大于0.02,缺失率小于0.95,
# identical value rate 小于0.95

dt_f = var_filter(germancredit, y="creditability")
## [INFO] filtering variables ...
# 划分数据集合

dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)


#  获取样本的标签
label_list = lapply(dt_list, function(x) x$creditability)


# 进行Woe 分箱,默认的使用方法是树方法

bins = woebin(dt_f, y="creditability") # 这里可以得出分箱,以及WOE变换的详细信息
## [INFO] creating woe binning ...
# 还可以制定划分的方式

breaks_adj = list(
  age.in.years=c(26, 35, 40),
  other.debtors.or.guarantors=c("none", "co-applicant%,%guarantor"))

bins_adj = woebin(dt_f, y="creditability", breaks_list=breaks_adj)
## [INFO] creating woe binning ...
## Warning in check_breaks_list(breaks_list, xs): There are 12 x variables
## that donot specified in breaks_list are using optimal binning.
# 进行将数据转换成为WOE的值
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins_adj))
## [INFO] converting into woe values ... 
## [INFO] converting into woe values ...
# 建立逻辑回归模型

m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)

# 逐步回归
m_step = step(m1, direction="both", trace = FALSE)

m2 = eval(m_step$call)




# 进行预测
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))

# 模型的评价结果
perf = perf_eva(pred = pred_list$train,label = label_list$train,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))
## [INFO] The threshold of confusion matrix is 0.3262.

perf
## $binomial_metric
## $binomial_metric$dat
##          MSE      RMSE   LogLoss        R2        KS       AUC      Gini
## 1: 0.1482852 0.3850781 0.4536391 0.2802927 0.5598485 0.8248359 0.6496717
## 
## 
## $confusion_matrix
## $confusion_matrix$dat
##    label pred_0 pred_1     error
## 1:     0    350     90 0.2045455
## 2:     1     43    137 0.2388889
## 3: total    393    227 0.2145161
## 
## 
## $pic
## TableGrob (3 x 3) "arrange": 8 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]
## 6 6 (2-2,3-3) arrange gtable[layout]
## 7 7 (3-3,1-1) arrange gtable[layout]
## 8 8 (3-3,2-2) arrange gtable[layout]
card = scorecard(bins_adj, m2)
score_list = lapply(dt_list, function(x) scorecard_ply(x, card))
perf_psi(score = score_list, label = label_list)
## $pic
## $pic$score

## 
## 
## $psi
##    variable    dataset       psi
## 1:    score train_test 0.0109612

详细的介绍会在后续进行

7 评分卡开发

接下来,我们会讲解在模型建立好之后,得到了用户的拒收概率,如何实现一个评分卡。

将估计的违约概率表示为p,则估计的正常概率为1-p,因此:

odds = p / (1 - p)

或者

p = odds / (1 + odds)

评分卡的刻度则是这样表示的:

Score = A - Blog(odds)

其中,A和B是常数,负号使得违约概率越低,得分越高

逻辑回归模型的比率计算公式如下:

也就是说通过逻辑回归模型我们可以得到,p或者odds,其实任何模型都可以得到p,那么我们只要设定好A,B就可以得到具体的评分了,接下来我们介绍如何求得A,B

7.1 计算常数A,B

计算A,B需要假设两个已知的分值代入公式进行计算,通常,需要两个假设:

  1. 某个特定比率的预期分值
  2. 指定比率翻番的分数(pdo)

假设比率为odds的时候,分值为p1 。然后比率为2*odds的时候,其分值为p1+pdo

那么,代入公式:

p1 = A - B*log(odds)
p1+pdo = A - B*log(2*odds)

解方程得到:

B = pdo/log(2)
A = p1 + B*log(odds)

这样就可以从每一个人的违约概率,得到每一个人的分值。需要注意的是,需要指定几个参数:

  1. 某个特定比率的预期分值
  2. 指定比率翻番的分数(pdo)

7.2 分值分配

我们还需要知道每一个变量的分数是如何影响总的分数的,每一个变量的每一个值的分数计算公式下面给出来:

假设变量x有k个取值,那么,变量x的每一个取值的计算公式为:

- B*(x对应的模型系数)*(x第k个取值的WOE值)

8 模型监控

模型上线之后,需要对于上线的模型进行监控,监控模型是否良好的运行,是否需要更新。本章会介绍一些模型监控的报告

8.1 稳定性报告

该报告的目的是生成一个能够代表总体分值的分布随时间发生变化的指数,上文提到的PSI稳定性指数可以在这里使用

8.2 评分卡要素分析

稳定性报告是对总体的分值分布是否发生变化进行的监控,评分卡要素是监控自变量分布的变化。 下面以Age 的评分卡要素分析作为例子:

这个报告计算了实际(A%)和预期(E%)的分值分布两者之间差异,计算该指数的公式为:

9 scorecard 信用评分包

scorecard包是在R中提供了一个完整的信用评分模型开发的解决方案。本节会对这一部分内容做一个详细的讲解

首先是现在安装:

install.packages(scorecard)
require(scorecard)

9.1 split_df 划分数据集

这个函数是用于划分数据集,使用方法如下:

split_df(dt, y = NULL, ratio = 0.7, seed = 618)
  • y表示样本的标签,或者说是因变量
  • ratio 代表的是训练集合与测试集合的比例
data(germancredit)

# Example I
dt_list = split_df(germancredit, y="creditability")
train = dt_list[[1]]
test = dt_list[[2]]


dim(germancredit)
## [1] 1000   21
dim(train)
## [1] 681  21
dim(test)
## [1] 319  21

9.2 IV 计算信息值

使用这个函数来计算特征的IV值,用于特征选择作为参考,使用方法如下:

iv(dt, y, x = NULL, positive = "bad|1", order = TRUE)
  • x表示因变量,默认表示计算所有的自变量的IV
  • order 表示根据IV进行排序
data(germancredit)

# information values
info_value = iv(germancredit, y = "creditability")
info_value
##                                                     variable   info_value
##  1:                      status.of.existing.checking.account 6.660115e-01
##  2:                                        duration.in.month 3.345035e-01
##  3:                                           credit.history 2.932335e-01
##  4:                                             age.in.years 2.596514e-01
##  5:                                savings.account.and.bonds 1.960096e-01
##  6:                                                  purpose 1.691951e-01
##  7:                                                 property 1.126383e-01
##  8:                                 present.employment.since 8.643363e-02
##  9:                                                  housing 8.329343e-02
## 10:                                  other.installment.plans 5.761454e-02
## 11:                                           foreign.worker 4.387741e-02
## 12:                                  personal.status.and.sex 4.268938e-02
## 13:                                            credit.amount 3.895727e-02
## 14:                              other.debtors.or.guarantors 3.201932e-02
## 15:      installment.rate.in.percentage.of.disposable.income 2.632209e-02
## 16:                  number.of.existing.credits.at.this.bank 1.326652e-02
## 17:                                                      job 8.762766e-03
## 18:                                                telephone 6.377605e-03
## 19:                                  present.residence.since 3.588773e-03
## 20: number.of.people.being.liable.to.provide.maintenance.for 4.339223e-05

9.3 var_filter 筛选变量

通过设定标准,使用这个函数可以通过特定的标准,信息值,缺失率,筛选特征,使用方法如下:

var_filter(dt, y, x = NULL, iv_limit = 0.02, missing_limit = 0.95,
  identical_limit = 0.95, var_rm = NULL, var_kp = NULL,
  return_rm_reason = FALSE, positive = "bad|1")
  • iv_limit 表示信息值超过多少,才保留此特征,默认是大于0.02
  • missing_limit 表示保留某个缺失率以下的特征,默认是0.95
  • identical_limit 表示,如果某个特征的值一样的比例小于某个比例,则保留,默认是0.95
data(germancredit)

# variable filter
dt_sel = var_filter(germancredit, y = "creditability")
## [INFO] filtering variables ...
names(dt_sel)
##  [1] "status.of.existing.checking.account"                
##  [2] "duration.in.month"                                  
##  [3] "credit.history"                                     
##  [4] "purpose"                                            
##  [5] "credit.amount"                                      
##  [6] "savings.account.and.bonds"                          
##  [7] "present.employment.since"                           
##  [8] "installment.rate.in.percentage.of.disposable.income"
##  [9] "personal.status.and.sex"                            
## [10] "other.debtors.or.guarantors"                        
## [11] "property"                                           
## [12] "age.in.years"                                       
## [13] "other.installment.plans"                            
## [14] "housing"                                            
## [15] "creditability"

9.4 woebin 进行WOE变换

使用这个函数进行WOE进行连续变量WOE分箱,使用方法如下:

woebin(dt, y, x = NULL, var_skip = NULL, breaks_list = NULL,
  special_values = NULL, stop_limit = 0.1, count_distr_limit = 0.05,
  bin_num_limit = 8, positive = "bad|1", no_cores = NULL,
  print_step = 0L, method = "tree", save_breaks_list = NULL,
  ignore_const_cols = TRUE, ignore_datetime_cols = TRUE,
  check_cate_num = TRUE, replace_blank_na = TRUE, ...)
  • method 是使用分箱的方法,默认是使用决策树
  • breaks list 可以制定自己的分箱规则
  • stop_limit 如果使用树方法,当信息值增益比小于stop_limit时停止分箱分段; 如果使用chimerge方法,当最小卡方大于’qchisq(1-stoplimit,1)’时停止合并。 可接受的范围:0-0.5; 默认值为0.1。
bins2_tree = woebin(germancredit, y="creditability", method="tree")
## [INFO] creating woe binning ...
bins2_tree$status.of.existing.checking.account
##                               variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
##                                                       bin count
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM   543
## 2: ... >= 200 DM / salary assignments for at least 1 year    63
## 3:                                    no checking account   394
##    count_distr good bad   badprob        woe      bin_iv total_iv
## 1:       0.543  303 240 0.4419890  0.6142040 0.225500603 0.639372
## 2:       0.063   49  14 0.2222222 -0.4054651 0.009460853 0.639372
## 3:       0.394  348  46 0.1167513 -1.1762632 0.404410499 0.639372
##                                                    breaks
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3:                                    no checking account
##    is_special_values
## 1:             FALSE
## 2:             FALSE
## 3:             FALSE

9.5 woebin_ply 将原始数据转换成为WOE数据

WOE分箱的具体划分规则指定好了,使用woebin_ply将原始数据转化成为WOE数据,使用方法如下:

woebin_ply(dt, bins, no_cores = NULL, print_step = 0L,
  replace_blank_na = TRUE, ...)
  • dt 是原始数据
  • bins 是woebin 的返回结果
dt_woe = woebin_ply(germancredit, bins=bins2_tree)
## [INFO] converting into woe values ...
head(dt_woe)
##    creditability status.of.existing.checking.account_woe
## 1:          good                                0.614204
## 2:           bad                                0.614204
## 3:          good                               -1.176263
## 4:          good                                0.614204
## 5:           bad                                0.614204
## 6:          good                               -1.176263
##    duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe
## 1:            -1.3121864        -0.73374058  -0.4100628        0.03366128
## 2:             1.1349799         0.08831862  -0.4100628        0.39053946
## 3:            -0.3466246        -0.73374058   0.2799201       -0.25830746
## 4:             0.5245245         0.08831862   0.2799201        0.39053946
## 5:             0.1086883         0.08515781   0.2799201        0.39053946
## 6:             0.5245245         0.08831862   0.2799201        0.39053946
##    savings.account.and.bonds_woe present.employment.since_woe
## 1:                    -0.7621401                  -0.23556607
## 2:                     0.2713578                   0.03210325
## 3:                     0.2713578                  -0.39441527
## 4:                     0.2713578                  -0.39441527
## 5:                     0.2713578                   0.03210325
## 6:                    -0.7621401                   0.03210325
##    installment.rate.in.percentage.of.disposable.income_woe
## 1:                                               0.1039609
## 2:                                              -0.1554665
## 3:                                              -0.1554665
## 4:                                              -0.1554665
## 5:                                               0.1039609
## 6:                                              -0.1554665
##    personal.status.and.sex_woe other.debtors.or.guarantors_woe
## 1:                  -0.1655476                      0.02797385
## 2:                   0.2646926                      0.02797385
## 3:                  -0.1655476                      0.02797385
## 4:                  -0.1655476                     -0.58778666
## 5:                  -0.1655476                      0.02797385
## 6:                  -0.1655476                      0.02797385
##    present.residence.since_woe property_woe age.in.years_woe
## 1:                 -0.01359409  -0.46103496       -0.2123715
## 2:                  0.07015071  -0.46103496        0.5288441
## 3:                 -0.01359409  -0.46103496       -0.2123715
## 4:                 -0.01359409   0.02857337       -0.2123715
## 5:                 -0.01359409   0.58608236       -0.2123715
## 6:                 -0.01359409   0.58608236       -0.8724881
##    other.installment.plans_woe housing_woe
## 1:                  -0.1211786  -0.1941560
## 2:                  -0.1211786  -0.1941560
## 3:                  -0.1211786  -0.1941560
## 4:                  -0.1211786   0.4726044
## 5:                  -0.1211786   0.4726044
## 6:                  -0.1211786   0.4726044
##    number.of.existing.credits.at.this.bank_woe     job_woe
## 1:                                  -0.1347806 -0.02278003
## 2:                                   0.0748775 -0.02278003
## 3:                                   0.0748775 -0.07847162
## 4:                                   0.0748775 -0.02278003
## 5:                                  -0.1347806 -0.02278003
## 6:                                   0.0748775 -0.07847162
##    number.of.people.being.liable.to.provide.maintenance.for_woe
## 1:                                                            0
## 2:                                                            0
## 3:                                                            0
## 4:                                                            0
## 5:                                                            0
## 6:                                                            0
##    telephone_woe foreign.worker_woe
## 1:   -0.09863759                  0
## 2:    0.06469132                  0
## 3:    0.06469132                  0
## 4:    0.06469132                  0
## 5:    0.06469132                  0
## 6:   -0.09863759                  0

9.6 scorecard 构建评分卡

使用scorecard通过模型和woebin的结果构建出评分卡规则,使用方法如下:

scorecard(bins, model, points0 = 600, odds0 = 1/19, pdo = 50,
  basepoints_eq0 = FALSE)
  • bins 是woebin的返回结果
  • model 是构建好的逻辑回归模型
  • odds 见第七章
  • pdo 见第七章
dt_woe$creditability <- as.character(dt_woe$creditability)
dt_woe$creditability[as.character(dt_woe$creditability)=='good']=0
dt_woe$creditability[as.character(dt_woe$creditability)=='bad']=1
dt_woe$creditability <- as.factor(dt_woe$creditability)
l <- glm(creditability~.,data = dt_woe,family = binomial())
l <- step(l)
## Start:  AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe + 
##     credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe + 
##     present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe + 
##     personal.status.and.sex_woe + other.debtors.or.guarantors_woe + 
##     present.residence.since_woe + property_woe + age.in.years_woe + 
##     other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe + 
##     job_woe + number.of.people.being.liable.to.provide.maintenance.for_woe + 
##     telephone_woe + foreign.worker_woe
## 
## 
## Step:  AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe + 
##     credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe + 
##     present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe + 
##     personal.status.and.sex_woe + other.debtors.or.guarantors_woe + 
##     present.residence.since_woe + property_woe + age.in.years_woe + 
##     other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe + 
##     job_woe + number.of.people.being.liable.to.provide.maintenance.for_woe + 
##     telephone_woe
## 
## 
## Step:  AIC=939.13
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe + 
##     credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe + 
##     present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe + 
##     personal.status.and.sex_woe + other.debtors.or.guarantors_woe + 
##     present.residence.since_woe + property_woe + age.in.years_woe + 
##     other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe + 
##     job_woe + telephone_woe
## 
##                                                           Df Deviance
## - job_woe                                                  1   901.21
## - number.of.existing.credits.at.this.bank_woe              1   901.86
## <none>                                                         901.13
## - property_woe                                             1   903.33
## - housing_woe                                              1   903.37
## - telephone_woe                                            1   903.84
## - present.employment.since_woe                             1   905.78
## - other.installment.plans_woe                              1   906.97
## - personal.status.and.sex_woe                              1   907.44
## - other.debtors.or.guarantors_woe                          1   907.52
## - present.residence.since_woe                              1   907.56
## - age.in.years_woe                                         1   908.87
## - installment.rate.in.percentage.of.disposable.income_woe  1   916.85
## - duration.in.month_woe                                    1   917.07
## - savings.account.and.bonds_woe                            1   919.14
## - credit.amount_woe                                        1   920.39
## - purpose_woe                                              1   921.29
## - credit.history_woe                                       1   921.81
## - status.of.existing.checking.account_woe                  1   960.39
##                                                              AIC
## - job_woe                                                 937.21
## - number.of.existing.credits.at.this.bank_woe             937.86
## <none>                                                    939.13
## - property_woe                                            939.33
## - housing_woe                                             939.37
## - telephone_woe                                           939.84
## - present.employment.since_woe                            941.78
## - other.installment.plans_woe                             942.97
## - personal.status.and.sex_woe                             943.44
## - other.debtors.or.guarantors_woe                         943.52
## - present.residence.since_woe                             943.56
## - age.in.years_woe                                        944.87
## - installment.rate.in.percentage.of.disposable.income_woe 952.85
## - duration.in.month_woe                                   953.07
## - savings.account.and.bonds_woe                           955.14
## - credit.amount_woe                                       956.39
## - purpose_woe                                             957.29
## - credit.history_woe                                      957.81
## - status.of.existing.checking.account_woe                 996.39
## 
## Step:  AIC=937.21
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe + 
##     credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe + 
##     present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe + 
##     personal.status.and.sex_woe + other.debtors.or.guarantors_woe + 
##     present.residence.since_woe + property_woe + age.in.years_woe + 
##     other.installment.plans_woe + housing_woe + number.of.existing.credits.at.this.bank_woe + 
##     telephone_woe
## 
##                                                           Df Deviance
## - number.of.existing.credits.at.this.bank_woe              1   901.95
## <none>                                                         901.21
## - property_woe                                             1   903.33
## - housing_woe                                              1   903.43
## - telephone_woe                                            1   904.73
## - present.employment.since_woe                             1   905.81
## - other.installment.plans_woe                              1   907.01
## - personal.status.and.sex_woe                              1   907.49
## - other.debtors.or.guarantors_woe                          1   907.54
## - present.residence.since_woe                              1   907.58
## - age.in.years_woe                                         1   909.11
## - installment.rate.in.percentage.of.disposable.income_woe  1   916.85
## - duration.in.month_woe                                    1   917.16
## - savings.account.and.bonds_woe                            1   919.14
## - credit.amount_woe                                        1   920.47
## - purpose_woe                                              1   921.67
## - credit.history_woe                                       1   921.94
## - status.of.existing.checking.account_woe                  1   960.43
##                                                              AIC
## - number.of.existing.credits.at.this.bank_woe             935.95
## <none>                                                    937.21
## - property_woe                                            937.33
## - housing_woe                                             937.43
## - telephone_woe                                           938.73
## - present.employment.since_woe                            939.81
## - other.installment.plans_woe                             941.01
## - personal.status.and.sex_woe                             941.49
## - other.debtors.or.guarantors_woe                         941.54
## - present.residence.since_woe                             941.58
## - age.in.years_woe                                        943.11
## - installment.rate.in.percentage.of.disposable.income_woe 950.85
## - duration.in.month_woe                                   951.16
## - savings.account.and.bonds_woe                           953.14
## - credit.amount_woe                                       954.47
## - purpose_woe                                             955.67
## - credit.history_woe                                      955.94
## - status.of.existing.checking.account_woe                 994.43
## 
## Step:  AIC=935.95
## creditability ~ status.of.existing.checking.account_woe + duration.in.month_woe + 
##     credit.history_woe + purpose_woe + credit.amount_woe + savings.account.and.bonds_woe + 
##     present.employment.since_woe + installment.rate.in.percentage.of.disposable.income_woe + 
##     personal.status.and.sex_woe + other.debtors.or.guarantors_woe + 
##     present.residence.since_woe + property_woe + age.in.years_woe + 
##     other.installment.plans_woe + housing_woe + telephone_woe
## 
##                                                           Df Deviance
## <none>                                                         901.95
## - property_woe                                             1   904.03
## - housing_woe                                              1   904.14
## - telephone_woe                                            1   905.43
## - present.employment.since_woe                             1   906.30
## - personal.status.and.sex_woe                              1   908.09
## - other.installment.plans_woe                              1   908.26
## - other.debtors.or.guarantors_woe                          1   908.33
## - present.residence.since_woe                              1   908.51
## - age.in.years_woe                                         1   909.56
## - installment.rate.in.percentage.of.disposable.income_woe  1   917.50
## - duration.in.month_woe                                    1   918.09
## - savings.account.and.bonds_woe                            1   920.43
## - credit.amount_woe                                        1   921.73
## - credit.history_woe                                       1   922.18
## - purpose_woe                                              1   923.04
## - status.of.existing.checking.account_woe                  1   960.57
##                                                              AIC
## <none>                                                    935.95
## - property_woe                                            936.03
## - housing_woe                                             936.14
## - telephone_woe                                           937.43
## - present.employment.since_woe                            938.30
## - personal.status.and.sex_woe                             940.09
## - other.installment.plans_woe                             940.26
## - other.debtors.or.guarantors_woe                         940.33
## - present.residence.since_woe                             940.51
## - age.in.years_woe                                        941.56
## - installment.rate.in.percentage.of.disposable.income_woe 949.50
## - duration.in.month_woe                                   950.09
## - savings.account.and.bonds_woe                           952.43
## - credit.amount_woe                                       953.73
## - credit.history_woe                                      954.18
## - purpose_woe                                             955.04
## - status.of.existing.checking.account_woe                 992.57
score <- scorecard(bins = bins2_tree,model = l)

score$status.of.existing.checking.account
##                               variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
##                                                       bin count
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM   543
## 2: ... >= 200 DM / salary assignments for at least 1 year    63
## 3:                                    no checking account   394
##    count_distr good bad   badprob        woe      bin_iv total_iv
## 1:       0.543  303 240 0.4419890  0.6142040 0.225500603 0.639372
## 2:       0.063   49  14 0.2222222 -0.4054651 0.009460853 0.639372
## 3:       0.394  348  46 0.1167513 -1.1762632 0.404410499 0.639372
##                                                    breaks
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3:                                    no checking account
##    is_special_values points
## 1:             FALSE    -36
## 2:             FALSE     24
## 3:             FALSE     68

9.7 scorecard_ply

将一个新用户的原始数据获取这个用户的分数,使用方法如下:

scorecard_ply(dt, card, only_total_score = TRUE, print_step = 0L,
  replace_blank_na = TRUE, var_kp = NULL)
  • dt 训练模型的原始数据集
  • 使用scorecard 建立起来的评分卡规则
resutl <- scorecard_ply(dt = germancredit,card = score)
resutl
##       score
##    1:   648
##    2:   314
##    3:   638
##    4:   439
##    5:   310
##   ---      
##  996:   535
##  997:   462
##  998:   559
##  999:   342
## 1000:   402

这样就得到了每一个用户的分数,下一张我们用一个例子,来汇总这一整个流程

10 案例

使用的数据集是德国的一个银行提供的数据集,这个数据集已经包含在了scorecard这个包里面,使用data(germancredit)就可以获取这个数据集合:

data(germancredit)
head(germancredit)
##   status.of.existing.checking.account duration.in.month
## 1                          ... < 0 DM                 6
## 2                   0 <= ... < 200 DM                48
## 3                 no checking account                12
## 4                          ... < 0 DM                42
## 5                          ... < 0 DM                24
## 6                 no checking account                36
##                                                credit.history
## 1 critical account/ other credits existing (not at this bank)
## 2                    existing credits paid back duly till now
## 3 critical account/ other credits existing (not at this bank)
## 4                    existing credits paid back duly till now
## 5                             delay in paying off in the past
## 6                    existing credits paid back duly till now
##               purpose credit.amount   savings.account.and.bonds
## 1    radio/television          1169 unknown/ no savings account
## 2    radio/television          5951                ... < 100 DM
## 3           education          2096                ... < 100 DM
## 4 furniture/equipment          7882                ... < 100 DM
## 5           car (new)          4870                ... < 100 DM
## 6           education          9055 unknown/ no savings account
##   present.employment.since
## 1           ... >= 7 years
## 2       1 <= ... < 4 years
## 3       4 <= ... < 7 years
## 4       4 <= ... < 7 years
## 5       1 <= ... < 4 years
## 6       1 <= ... < 4 years
##   installment.rate.in.percentage.of.disposable.income
## 1                                                   4
## 2                                                   2
## 3                                                   2
## 4                                                   2
## 5                                                   3
## 6                                                   2
##               personal.status.and.sex other.debtors.or.guarantors
## 1                       male : single                        none
## 2 female : divorced/separated/married                        none
## 3                       male : single                        none
## 4                       male : single                   guarantor
## 5                       male : single                        none
## 6                       male : single                        none
##   present.residence.since
## 1                       4
## 2                       2
## 3                       3
## 4                       4
## 5                       4
## 6                       4
##                                             property age.in.years
## 1                                        real estate           67
## 2                                        real estate           22
## 3                                        real estate           49
## 4 building society savings agreement/ life insurance           45
## 5                              unknown / no property           53
## 6                              unknown / no property           35
##   other.installment.plans  housing number.of.existing.credits.at.this.bank
## 1                    none      own                                       2
## 2                    none      own                                       1
## 3                    none      own                                       1
## 4                    none for free                                       1
## 5                    none for free                                       2
## 6                    none for free                                       1
##                           job
## 1 skilled employee / official
## 2 skilled employee / official
## 3        unskilled - resident
## 4 skilled employee / official
## 5 skilled employee / official
## 6        unskilled - resident
##   number.of.people.being.liable.to.provide.maintenance.for
## 1                                                        1
## 2                                                        1
## 3                                                        2
## 4                                                        2
## 5                                                        2
## 6                                                        2
##                                  telephone foreign.worker creditability
## 1 yes, registered under the customers name            yes          good
## 2                                     none            yes           bad
## 3                                     none            yes          good
## 4                                     none            yes          good
## 5                                     none            yes           bad
## 6 yes, registered under the customers name            yes          good

一共有20个特征,最后一列是样本的标签,bad代表坏客户,good代表好客户,因为数据已经准备好了,因此可以直接进行特征选择:

  1. 特征选择 选择IV大于0.1的特征值
dt_f = var_filter(germancredit, y="creditability",iv_limit = 0.1)
## [INFO] filtering variables ...
names(dt_f)
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"                  
## [3] "credit.history"                     
## [4] "purpose"                            
## [5] "savings.account.and.bonds"          
## [6] "property"                           
## [7] "age.in.years"                       
## [8] "creditability"

筛选出8个特征

  1. 划分训练集合与测试集合
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)
head(dt_list)
## $train
##      status.of.existing.checking.account duration.in.month
##   1:                          ... < 0 DM                 6
##   2:                   0 <= ... < 200 DM                48
##   3:                 no checking account                12
##   4:                          ... < 0 DM                42
##   5:                 no checking account                36
##  ---                                                      
## 616:                 no checking account                12
## 617:                          ... < 0 DM                30
## 618:                 no checking account                12
## 619:                          ... < 0 DM                45
## 620:                   0 <= ... < 200 DM                45
##                                                   credit.history
##   1: critical account/ other credits existing (not at this bank)
##   2:                    existing credits paid back duly till now
##   3: critical account/ other credits existing (not at this bank)
##   4:                    existing credits paid back duly till now
##   5:                    existing credits paid back duly till now
##  ---                                                            
## 616:                    existing credits paid back duly till now
## 617:                    existing credits paid back duly till now
## 618:                    existing credits paid back duly till now
## 619:                    existing credits paid back duly till now
## 620: critical account/ other credits existing (not at this bank)
##                  purpose   savings.account.and.bonds
##   1:    radio/television unknown/ no savings account
##   2:    radio/television                ... < 100 DM
##   3:           education                ... < 100 DM
##   4: furniture/equipment                ... < 100 DM
##   5:           education unknown/ no savings account
##  ---                                                
## 616: furniture/equipment                ... < 100 DM
## 617:          car (used)                ... < 100 DM
## 618:    radio/television                ... < 100 DM
## 619:    radio/television                ... < 100 DM
## 620:          car (used)         100 <= ... < 500 DM
##                                                  property age.in.years
##   1:                                          real estate           67
##   2:                                          real estate           22
##   3:                                          real estate           49
##   4:   building society savings agreement/ life insurance           45
##   5:                                unknown / no property           35
##  ---                                                                  
## 616:                                          real estate           31
## 617:   building society savings agreement/ life insurance           40
## 618: car or other, not in attribute Savings account/bonds           38
## 619:                                unknown / no property           23
## 620: car or other, not in attribute Savings account/bonds           27
##      creditability
##   1:             0
##   2:             1
##   3:             0
##   4:             0
##   5:             0
##  ---              
## 616:             0
## 617:             0
## 618:             0
## 619:             1
## 620:             0
## 
## $test
##      status.of.existing.checking.account duration.in.month
##   1:                          ... < 0 DM                24
##   2:                 no checking account                12
##   3:                   0 <= ... < 200 DM                12
##   4:                          ... < 0 DM                24
##   5:                          ... < 0 DM                15
##  ---                                                      
## 376:                          ... < 0 DM                36
## 377:                          ... < 0 DM                15
## 378:                 no checking account                15
## 379:                          ... < 0 DM                18
## 380:                 no checking account                12
##                                                   credit.history
##   1:                             delay in paying off in the past
##   2:                    existing credits paid back duly till now
##   3:                    existing credits paid back duly till now
##   4: critical account/ other credits existing (not at this bank)
##   5:                    existing credits paid back duly till now
##  ---                                                            
## 376:                    existing credits paid back duly till now
## 377: critical account/ other credits existing (not at this bank)
## 378:                     all credits at this bank paid back duly
## 379:                    existing credits paid back duly till now
## 380:                    existing credits paid back duly till now
##                  purpose   savings.account.and.bonds
##   1:           car (new)                ... < 100 DM
##   2:    radio/television              ... >= 1000 DM
##   3:           car (new)                ... < 100 DM
##   4:           car (new)                ... < 100 DM
##   5:           car (new)                ... < 100 DM
##  ---                                                
## 376:          car (used)                ... < 100 DM
## 377: furniture/equipment                ... < 100 DM
## 378:    radio/television         100 <= ... < 500 DM
## 379:    radio/television unknown/ no savings account
## 380:           car (new) unknown/ no savings account
##                                                  property age.in.years
##   1:                                unknown / no property           53
##   2:                                          real estate           61
##   3: car or other, not in attribute Savings account/bonds           25
##   4: car or other, not in attribute Savings account/bonds           60
##   5: car or other, not in attribute Savings account/bonds           28
##  ---                                                                  
## 376:   building society savings agreement/ life insurance           26
## 377:   building society savings agreement/ life insurance           25
## 378: car or other, not in attribute Savings account/bonds           34
## 379: car or other, not in attribute Savings account/bonds           23
## 380: car or other, not in attribute Savings account/bonds           50
##      creditability
##   1:             1
##   2:             0
##   3:             1
##   4:             1
##   5:             0
##  ---              
## 376:             1
## 377:             0
## 378:             0
## 379:             0
## 380:             0
  1. 进行WOE binning
bins = woebin(dt_f, y="creditability")
## [INFO] creating woe binning ...
bins$status.of.existing.checking.account
##                               variable
## 1: status.of.existing.checking.account
## 2: status.of.existing.checking.account
## 3: status.of.existing.checking.account
##                                                       bin count
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM   543
## 2: ... >= 200 DM / salary assignments for at least 1 year    63
## 3:                                    no checking account   394
##    count_distr good bad   badprob        woe      bin_iv total_iv
## 1:       0.543  303 240 0.4419890  0.6142040 0.225500603 0.639372
## 2:       0.063   49  14 0.2222222 -0.4054651 0.009460853 0.639372
## 3:       0.394  348  46 0.1167513 -1.1762632 0.404410499 0.639372
##                                                    breaks
## 1:                         ... < 0 DM%,%0 <= ... < 200 DM
## 2: ... >= 200 DM / salary assignments for at least 1 year
## 3:                                    no checking account
##    is_special_values
## 1:             FALSE
## 2:             FALSE
## 3:             FALSE
bins$duration.in.month
##             variable       bin count count_distr good bad   badprob
## 1: duration.in.month  [-Inf,8)    87       0.087   78   9 0.1034483
## 2: duration.in.month    [8,16)   344       0.344  264  80 0.2325581
## 3: duration.in.month   [16,34)   399       0.399  270 129 0.3233083
## 4: duration.in.month   [34,44)   100       0.100   58  42 0.4200000
## 5: duration.in.month [44, Inf)    70       0.070   30  40 0.5714286
##           woe      bin_iv  total_iv breaks is_special_values
## 1: -1.3121864 0.106849463 0.2826181      8             FALSE
## 2: -0.3466246 0.038293766 0.2826181     16             FALSE
## 3:  0.1086883 0.004813339 0.2826181     34             FALSE
## 4:  0.5245245 0.029972827 0.2826181     44             FALSE
## 5:  1.1349799 0.102688661 0.2826181    Inf             FALSE
  1. 将数据转变成为WOE形式
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
## [INFO] converting into woe values ... 
## [INFO] converting into woe values ...
head(dt_woe_list)
## $train
##      creditability status.of.existing.checking.account_woe
##   1:             0                                0.614204
##   2:             1                                0.614204
##   3:             0                               -1.176263
##   4:             0                                0.614204
##   5:             0                               -1.176263
##  ---                                                      
## 616:             0                               -1.176263
## 617:             0                                0.614204
## 618:             0                               -1.176263
## 619:             1                                0.614204
## 620:             0                                0.614204
##      duration.in.month_woe credit.history_woe purpose_woe
##   1:            -1.3121864        -0.73374058  -0.4100628
##   2:             1.1349799         0.08831862  -0.4100628
##   3:            -0.3466246        -0.73374058   0.2799201
##   4:             0.5245245         0.08831862   0.2799201
##   5:             0.5245245         0.08831862   0.2799201
##  ---                                                     
## 616:            -0.3466246         0.08831862   0.2799201
## 617:             0.1086883         0.08831862  -0.8056252
## 618:            -0.3466246         0.08831862  -0.4100628
## 619:             1.1349799         0.08831862  -0.4100628
## 620:             1.1349799        -0.73374058  -0.8056252
##      savings.account.and.bonds_woe property_woe age.in.years_woe
##   1:                    -0.7621401  -0.46103496       -0.2123715
##   2:                     0.2713578  -0.46103496        0.5288441
##   3:                     0.2713578  -0.46103496       -0.2123715
##   4:                     0.2713578   0.02857337       -0.2123715
##   5:                    -0.7621401   0.58608236       -0.8724881
##  ---                                                            
## 616:                     0.2713578  -0.46103496        0.1424546
## 617:                     0.2713578   0.02857337       -0.2123715
## 618:                     0.2713578   0.03419136       -0.2123715
## 619:                     0.2713578   0.58608236        0.5288441
## 620:                     0.1395519   0.03419136       -0.1609304
## 
## $test
##      creditability status.of.existing.checking.account_woe
##   1:             1                                0.614204
##   2:             0                               -1.176263
##   3:             1                                0.614204
##   4:             1                                0.614204
##   5:             0                                0.614204
##  ---                                                      
## 376:             1                                0.614204
## 377:             0                                0.614204
## 378:             0                               -1.176263
## 379:             0                                0.614204
## 380:             0                               -1.176263
##      duration.in.month_woe credit.history_woe purpose_woe
##   1:             0.1086883         0.08515781   0.2799201
##   2:            -0.3466246         0.08831862  -0.4100628
##   3:            -0.3466246         0.08831862   0.2799201
##   4:             0.1086883        -0.73374058   0.2799201
##   5:            -0.3466246         0.08831862   0.2799201
##  ---                                                     
## 376:             0.5245245         0.08831862  -0.8056252
## 377:            -0.3466246        -0.73374058   0.2799201
## 378:            -0.3466246         1.23407084  -0.4100628
## 379:             0.1086883         0.08831862  -0.4100628
## 380:            -0.3466246         0.08831862   0.2799201
##      savings.account.and.bonds_woe property_woe age.in.years_woe
##   1:                     0.2713578   0.58608236       -0.2123715
##   2:                    -0.7621401  -0.46103496       -0.2123715
##   3:                     0.2713578   0.03419136        0.5288441
##   4:                     0.2713578   0.03419136       -0.2123715
##   5:                     0.2713578   0.03419136        0.1424546
##  ---                                                            
## 376:                     0.2713578   0.02857337       -0.1609304
## 377:                     0.2713578   0.02857337        0.5288441
## 378:                     0.1395519   0.03419136        0.1424546
## 379:                    -0.7621401   0.03419136        0.5288441
## 380:                    -0.7621401   0.03419136       -0.2123715
  1. 训练模型
m1 = glm( creditability ~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
summary(m_step)
## 
## Call:
## glm(formula = creditability ~ status.of.existing.checking.account_woe + 
##     duration.in.month_woe + credit.history_woe + purpose_woe + 
##     savings.account.and.bonds_woe + property_woe + age.in.years_woe, 
##     family = binomial(), data = dt_woe_list$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8673  -0.7379  -0.4205   0.7899   2.5882  
## 
## Coefficients:
##                                         Estimate Std. Error z value
## (Intercept)                              -0.9302     0.1060  -8.773
## status.of.existing.checking.account_woe   0.7343     0.1334   5.502
## duration.in.month_woe                     0.9731     0.2173   4.479
## credit.history_woe                        0.8721     0.1960   4.451
## purpose_woe                               0.8589     0.2653   3.238
## savings.account.and.bonds_woe             0.7163     0.2522   2.840
## property_woe                              0.5991     0.3229   1.855
## age.in.years_woe                          0.9456     0.2918   3.240
##                                         Pr(>|z|)    
## (Intercept)                              < 2e-16 ***
## status.of.existing.checking.account_woe 3.74e-08 ***
## duration.in.month_woe                   7.50e-06 ***
## credit.history_woe                      8.56e-06 ***
## purpose_woe                              0.00120 ** 
## savings.account.and.bonds_woe            0.00450 ** 
## property_woe                             0.06355 .  
## age.in.years_woe                         0.00119 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747.03  on 619  degrees of freedom
## Residual deviance: 583.26  on 612  degrees of freedom
## AIC: 599.26
## 
## Number of Fisher Scoring iterations: 5

到这里,模型就已经训练好了

  1. 模型评估
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## performance

perf = perf_eva(pred = pred_list, label = label_list,show_plot =  c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'))
## [INFO] The threshold of confusion matrix is 0.2622.

  1. 生成评分卡
card = scorecard(bins, m2)
  1. 计算用户的得分
score_list = lapply(dt_list, function(x) scorecard_ply(x, card))

head(score_list)
## $train
##      score
##   1:   658
##   2:   331
##   3:   590
##   4:   361
##   5:   531
##  ---      
## 616:   514
## 617:   457
## 618:   559
## 619:   286
## 620:   441
## 
## $test
##      score
##   1:   367
##   2:   633
##   3:   372
##   4:   442
##   5:   398
##  ---      
## 376:   425
## 377:   424
## 378:   470
## 379:   435
## 380:   570
  1. 模型的稳定平评估
perf_psi(score = score_list, label = label_list)
## $pic
## $pic$score

## 
## 
## $psi
##    variable    dataset        psi
## 1:    score train_test 0.02102106

需要联系我可以添加我的微信