R Markdown

货到付款是一种是一种非常受用户青睐的支付方式,对于客户而言,货到付款更加安全,特别是对于一些电商不发达的的确,货到付款能够有效的打消用户对于网购的不信任。

对于商家而言,货到付款这种支付方式不利于现金的流动,并且,有一部分人会在货到了之后不付款,也就是拒收。拒收的原因很多,很简单的就是不想要了。

一般而言,货到付款的拒收率可以高达20%,这将造成很大的运营成本。因此,本文利用机器学习的方法,对用户是否回拒收进行预测。

1 介绍

1.1 初步了解使用到的数据

kable(as.data.frame(colnames(Model_data)))
colnames(Model_data)
发货方式
原始来单金额
修改后金额
发货件数
原始来单件数
cod运费
用户性别
用户设备
app1
用户类型
地址种类
label
下单小时
付款小时
下单与付款时间间隔
金额差异
件数差异
确认小时
付款到派送

1.1.2 查看数据的结构

str(Model_data)
## Classes 'data.table' and 'data.frame':   322715 obs. of  20 variables:
##  $ 发货方式          : chr  "Delhivery" "Delhivery" "Ecom" "Ecom" ...
##  $ 州                : chr  "Telangana" "Telangana" "Maharashtra" "Maharashtra" ...
##  $ 原始来单金额      : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 修改后金额        : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 发货件数          : num  1 1 1 1 1 6 1 3 1 1 ...
##  $ 原始来单件数      : num  1 1 1 1 1 6 1 3 1 1 ...
##  $ cod运费           : num  1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
##  $ 用户性别          : chr  "women" "women" "men" "women" ...
##  $ 用户设备          : chr  "ios" "android" "android" "android" ...
##  $ app1              : chr  "iOS_4.1.0" "android_4.1.1" "android_4.2.2" "android_4.0.3" ...
##  $ 用户类型          : chr  "new_cod" "new_cod" "new_cod" "old_cod" ...
##  $ 地址种类          : chr  "Valid Address" "Valid Address" "Missing Rooftop" "Valid Address" ...
##  $ label             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ 下单小时          : num  15 9 10 15 16 16 16 6 12 17 ...
##  $ 付款小时          : num  14 11 15 16 16 17 11 6 6 9 ...
##  $ 下单与付款时间间隔: num  19.5 16.9 17.4 16.9 19.6 ...
##  $ 金额差异          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 件数差异          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 确认小时          : num  4 3 3 13 4 4 3 9 3 4 ...
##  $ 付款到派送        : num  2.71 -0.477 -0.151 -0.127 -0.17 ...
##  - attr(*, ".internal.selfref")=<externalptr>

1.1.3 查看数据的基本信息

summary(Model_data)
##    发货方式              州             原始来单金额     修改后金额    
##  Length:322715      Length:322715      Min.   : 0.72   Min.   : 0.390  
##  Class :character   Class :character   1st Qu.: 3.22   1st Qu.: 3.210  
##  Mode  :character   Mode  :character   Median : 7.29   Median : 7.190  
##                                        Mean   :10.12   Mean   : 9.784  
##                                        3rd Qu.:13.54   3rd Qu.:13.210  
##                                        Max.   :51.55   Max.   :51.550  
##                                                                        
##     发货件数       原始来单件数     cod运费        用户性别        
##  Min.   : 0.000   Min.   : 1.0   Min.   :0.000   Length:322715     
##  1st Qu.: 1.000   1st Qu.: 1.0   1st Qu.:0.770   Class :character  
##  Median : 1.000   Median : 1.0   Median :1.550   Mode  :character  
##  Mean   : 1.743   Mean   : 1.8   Mean   :1.197                     
##  3rd Qu.: 1.000   3rd Qu.: 2.0   3rd Qu.:1.550                     
##  Max.   :48.000   Max.   :50.0   Max.   :1.550                     
##                                                                    
##    用户设备             app1             用户类型        
##  Length:322715      Length:322715      Length:322715     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    地址种类         label         下单小时        付款小时    
##  Length:322715      0:285297   Min.   : 0.00   Min.   : 0.00  
##  Class :character   1: 37418   1st Qu.: 7.00   1st Qu.: 7.00  
##  Mode  :character              Median :11.00   Median :11.00  
##                                Mean   :10.85   Mean   :10.87  
##                                3rd Qu.:15.00   3rd Qu.:15.00  
##                                Max.   :23.00   Max.   :23.00  
##                                                               
##  下单与付款时间间隔    金额差异           件数差异            确认小时    
##  Min.   :-0.08387   Min.   :-45.6000   Min.   :-31.00000   Min.   : 0.00  
##  1st Qu.:-0.08260   1st Qu.:  0.0000   1st Qu.:  0.00000   1st Qu.: 7.00  
##  Median :-0.08166   Median :  0.0000   Median :  0.00000   Median : 9.00  
##  Mean   : 0.00000   Mean   : -0.3403   Mean   : -0.05702   Mean   : 9.37  
##  3rd Qu.:-0.07987   3rd Qu.:  0.0000   3rd Qu.:  0.00000   3rd Qu.:12.00  
##  Max.   :19.57188   Max.   :  5.1600   Max.   :  1.00000   Max.   :23.00  
##                                                                           
##    付款到派送    
##  Min.   :-4.011  
##  1st Qu.:-0.669  
##  Median :-0.063  
##  Mean   : 0.000  
##  3rd Qu.: 0.611  
##  Max.   :10.935  
##  NA's   :3441

2 初步的数据分析

2.1 修改数据的类型

2.2 查看标签的比例

pct(Model_data$label)
Count Percentage
0 285297 88.41
1 37418 11.59

3 单变量分析

WOE(Weight of Evidence):WOE显示了自变量对因变量的预测能力

WOE=ln(Distribution of Non-Events(Good)Distribution of Events(Bad))

其通过更基本的比率计算而来:

(Distribution of Good Credit Outcomes) / (Distribution of Bad Credit Outcomes)

Information Value(IV):

信息值有利于通过变量的重要性进行筛选变量

IV=∑(%Non-Events - %Events)∗WOE

Efficiency:

Efficiency=Abs(%Non-Events - %Events)/2

3.1 发货方式

A1 <- gbpct(Model_data$发货方式)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$发货方式), Model_data$label, 
     ylab="Good-Bad", xlab="发货方式", 
     main="发货方式对标签的影响")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="发货方式的WOE",
        xlab="发货方式",
        ylab="WOE"
)

3.2 州

A1 <- gbpct(Model_data$州)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$州), Model_data$label, 
     ylab="Good-Bad", xlab="州", 
     main="州对于标签的关系 ")

3.3 用户性别

A1 <- gbpct(Model_data$用户性别)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$用户性别), Model_data$label, 
     ylab="Good-Bad", xlab="用户性别", 
     main="用户性别对于标签的关系 ")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="用户性别的WOE",
        xlab="用户性别",
        ylab="WOE"
)

### 3.4 用户设备

A1 <- gbpct(Model_data$用户设备)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$用户设备), Model_data$label, 
     ylab="Good-Bad", xlab="用户设备", 
     main="用户设别与标签的关系")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="用户设备的WOE",
        xlab="用户设备",
        ylab="WOE"
)

3.5 操作系统版本

A1 <- gbpct(Model_data$app1)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$app1), Model_data$label, 
     ylab="Good-Bad", xlab="操作系统", 
     main="操作系统版本与标签的关系 ")

# barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
#         main="Score:Checking Shipping method Status",
#         xlab="Category",
#         ylab="WOE"
# )

3.6 用户类别

A1 <- gbpct(Model_data$用户类型)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$用户类型), Model_data$label, 
     ylab="Good-Bad", xlab="用户类型", 
     main="用户类型与标签的关系")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="用户类型的WOE",
        xlab="用户类型",
        ylab="WOE"
)

3.7 地址类型

A1 <- gbpct(Model_data$地址种类)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$地址种类), Model_data$label, 
     ylab="Good-Bad", xlab="地址类型", 
     main="地址类型与标签的关系")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="地址类型WOE",
        xlab="地址类型",
        ylab="WOE"
)

3.8 下单时间(小时)

A1 <- gbpct(Model_data$下单小时)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$下单小时), Model_data$label, 
     ylab="Good-Bad", xlab="下单时间(小时)", 
     main="下单时间(小时)与标签的关系 ")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="下单时间(小时)WOE",
        xlab="下单时间(小时)",
        ylab="WOE"
)

3.9 付款时间(小时)

A1 <- gbpct(Model_data$付款小时)

op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
par(family='STKaiti')
plot(as.factor(Model_data$付款小时), Model_data$label, 
     ylab="Good-Bad", xlab="付款时间(小时)", 
     main="付款时间(小时)与标签的关系")

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels), 
        main="付款时间(小时)WOE",
        xlab="Category",
        ylab="WOE"
)

计算信息值(Information Value) 和 WOE (Weight of Evidence)

kable(iv)
variable info_value
地址种类 0.4482661
app1 0.3126790
下单与付款时间间隔 0.2858385
cod运费 0.2818102
修改后金额 0.1986989
原始来单金额 0.1946768
金额差异 0.1632335
付款到派送 0.1379788
发货方式 0.1256872
用户性别 0.1238769
0.1158185
发货件数 0.0954921
原始来单件数 0.0929052
用户类型 0.0274259
确认小时 0.0205682
用户设备 0.0140496
付款小时 0.0119562
下单小时 0.0118502
件数差异 0.0073371
bins
## $发货方式
##    variable                    bin  count count_distr   good   bad
## 1: 发货方式 XpressBees%,%Delhivery 172229   0.5336876 156606 15623
## 2: 发货方式                   Ecom 150486   0.4663124 128691 21795
##       badprob        woe     bin_iv   total_iv                 breaks
## 1: 0.09071062 -0.2736100 0.03595137 0.06954223 XpressBees%,%Delhivery
## 2: 0.14483075  0.2556453 0.03359086 0.06954223                   Ecom
##    is_special_values
## 1:             FALSE
## 2:             FALSE
## 
## $州
##    variable
## 1:       州
## 2:       州
## 3:       州
## 4:       州
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  bin
## 1: West bengal%,%UTTAR PRADESH%,%madhya pradesh%,%west bengal%,%Uttar pradesh%,%new delhi%,%New Delhi%,%andhra pradesh%,%maharashtra%,%WEST BENGAL%,%uttar pardesh%,%MADHYA PRADESH%,%palakkad%,%Kheda%,%haryana%,%Andhra pradesh%,%Maharashtara%,%Pondicherry%,%RAJSTHAN%,%Tamil nadu%,%Tamilnadu%,%Jammu & Kashmir%,%J&K%,%maharasta%,%Hyderabad%,%daman%,%GUJARAT%,%Haryana,%,%Jharkhan%,%Chattisgarh%,%karnataka%,%kerala%,%West Bangal%,%Meghalaya%,%Mizoram%,%Nagaland%,%Goa%,%Arunachal Pradesh%,%Assam%,%Daman and Diu%,%Puducherry%,%Kerala
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                     West Bengal%,%Tamil Nadu%,%Chandigarh%,%Karnataka%,%Sikkim%,%Chhattisgarh%,%Himachal Pradesh%,%Andhra Pradesh
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Telangana%,%Manipur%,%Odisha%,%Tripura%,%Gujarat%,%Uttarakhand
## 4:                                                                                                                                                                                                                                                                                                                                                        Jammu and Kashmir%,%Haryana%,%Madhya Pradesh%,%Uttar Pradesh%,%Punjab%,%Rajasthan%,%Maharashtra%,%Jharkhand%,%Delhi%,%Bihar%,%punjab%,%Andaman and Nicobar Islands%,%tamil nadu%,%Hariyana
##     count count_distr   good   bad    badprob         woe       bin_iv
## 1:  21778  0.06748369  20513  1265 0.05808614 -0.75460784 0.0287454667
## 2:  87476  0.27106270  80294  7182 0.08210252 -0.38273813 0.0342551910
## 3:  62240  0.19286367  55233  7007 0.11258033 -0.03327208 0.0002107931
## 4: 151221  0.46858993 129257 21964 0.14524438  0.25898095 0.0346850655
##      total_iv
## 1: 0.09789652
## 2: 0.09789652
## 3: 0.09789652
## 4: 0.09789652
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               breaks
## 1: West bengal%,%UTTAR PRADESH%,%madhya pradesh%,%west bengal%,%Uttar pradesh%,%new delhi%,%New Delhi%,%andhra pradesh%,%maharashtra%,%WEST BENGAL%,%uttar pardesh%,%MADHYA PRADESH%,%palakkad%,%Kheda%,%haryana%,%Andhra pradesh%,%Maharashtara%,%Pondicherry%,%RAJSTHAN%,%Tamil nadu%,%Tamilnadu%,%Jammu & Kashmir%,%J&K%,%maharasta%,%Hyderabad%,%daman%,%GUJARAT%,%Haryana,%,%Jharkhan%,%Chattisgarh%,%karnataka%,%kerala%,%West Bangal%,%Meghalaya%,%Mizoram%,%Nagaland%,%Goa%,%Arunachal Pradesh%,%Assam%,%Daman and Diu%,%Puducherry%,%Kerala
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                     West Bengal%,%Tamil Nadu%,%Chandigarh%,%Karnataka%,%Sikkim%,%Chhattisgarh%,%Himachal Pradesh%,%Andhra Pradesh
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Telangana%,%Manipur%,%Odisha%,%Tripura%,%Gujarat%,%Uttarakhand
## 4:                                                                                                                                                                                                                                                                                                                                                        Jammu and Kashmir%,%Haryana%,%Madhya Pradesh%,%Uttar Pradesh%,%Punjab%,%Rajasthan%,%Maharashtra%,%Jharkhand%,%Delhi%,%Bihar%,%punjab%,%Andaman and Nicobar Islands%,%tamil nadu%,%Hariyana
##    is_special_values
## 1:             FALSE
## 2:             FALSE
## 3:             FALSE
## 4:             FALSE
## 
## $原始来单金额
##        variable       bin count count_distr  good   bad    badprob
## 1: 原始来单金额  [-Inf,2) 20298  0.06289760 18324  1974 0.09725096
## 2: 原始来单金额     [2,4) 82577  0.25588213 73354  9223 0.11168970
## 3: 原始来单金额     [4,6) 43843  0.13585672 39918  3925 0.08952398
## 4: 原始来单金额    [6,10) 55017  0.17048169 48312  6705 0.12187142
## 5: 原始来单金额   [10,18) 72926  0.22597648 62305 10621 0.14564079
## 6: 原始来单金额   [18,28) 29221  0.09054739 25891  3330 0.11395914
## 7: 原始来单金额 [28, Inf) 18833  0.05835799 17193  1640 0.08708119
##            woe       bin_iv   total_iv breaks is_special_values
## 1: -0.19677086 0.0022574432 0.03578174      2             FALSE
## 2: -0.04221780 0.0004487274 0.03578174      4             FALSE
## 3: -0.28808213 0.0100890135 0.03578174      6             FALSE
## 4:  0.05655241 0.0005571833 0.03578174     10             FALSE
## 5:  0.26217036 0.0171619061 0.03578174     18             FALSE
## 6: -0.01954424 0.0000343283 0.03578174     28             FALSE
## 7: -0.31842721 0.0052331416 0.03578174    Inf             FALSE
## 
## $修改后金额
##      variable       bin count count_distr  good   bad    badprob
## 1: 修改后金额  [-Inf,2) 20384  0.06316409 18401  1983 0.09728218
## 2: 修改后金额     [2,4) 83151  0.25766078 73874  9277 0.11156811
## 3: 修改后金额     [4,6) 44200  0.13696295 40226  3974 0.08990950
## 4: 修改后金额    [6,10) 56409  0.17479510 49630  6779 0.12017586
## 5: 修改后金额   [10,18) 73848  0.22883349 63115 10733 0.14533907
## 6: 修改后金额   [18,25) 23496  0.07280728 20784  2712 0.11542390
## 7: 修改后金额 [25, Inf) 21227  0.06577630 19267  1960 0.09233523
##             woe       bin_iv   total_iv breaks is_special_values
## 1: -0.196415291 2.259132e-03 0.03378337      2             FALSE
## 2: -0.043443850 4.782461e-04 0.03378337      4             FALSE
## 3: -0.283361538 9.858532e-03 0.03378337      6             FALSE
## 4:  0.040612980 2.928368e-04 0.03378337     10             FALSE
## 5:  0.259743519 1.704306e-02 0.03378337     18             FALSE
## 6: -0.005118219 1.903526e-06 0.03378337     25             FALSE
## 7: -0.254070444 3.849656e-03 0.03378337    Inf             FALSE
## 
## $发货件数
##    variable      bin  count count_distr   good   bad    badprob        woe
## 1: 发货件数 [-Inf,2) 242394  0.75110856 210330 32064 0.13228050  0.1504351
## 2: 发货件数    [2,3)  30017  0.09301396  27705  2312 0.07702302 -0.4521211
## 3: 发货件数 [3, Inf)  50304  0.15587748  47262  3042 0.06047233 -0.7118125
##        bin_iv   total_iv breaks is_special_values
## 1: 0.01800438 0.09402303      2             FALSE
## 2: 0.01596932 0.09402303      3             FALSE
## 3: 0.06004934 0.09402303    Inf             FALSE
## 
## $原始来单件数
##        variable      bin  count count_distr   good   bad    badprob
## 1: 原始来单件数 [-Inf,2) 239614  0.74249415 207870 31744 0.13247974
## 2: 原始来单件数    [2,3)  29493  0.09139024  27176  2317 0.07856101
## 3: 原始来单件数 [3, Inf)  53608  0.16611561  50251  3357 0.06262125
##           woe     bin_iv   total_iv breaks is_special_values
## 1:  0.1521697 0.01822272 0.09087764      2             FALSE
## 2: -0.4306821 0.01435595 0.09087764      3             FALSE
## 3: -0.6746039 0.05829897 0.09087764    Inf             FALSE
## 
## $cod运费
##    variable        bin  count count_distr   good   bad    badprob
## 1:  cod运费 [-Inf,1.5) 143652   0.4451358 129995 13657 0.09507003
## 2:  cod运费 [1.5, Inf) 179063   0.5548642 155302 23761 0.13269631
##           woe     bin_iv   total_iv breaks is_special_values
## 1: -0.2218649 0.02011498 0.03408191    1.5             FALSE
## 2:  0.1540528 0.01396692 0.03408191    Inf             FALSE
## 
## $用户性别
##    variable             bin  count count_distr   good   bad    badprob
## 1: 用户性别         missing   1972 0.006110655   1694   278 0.14097363
## 2: 用户性别 not set%,%women 228872 0.709207815 207855 21017 0.09182862
## 3: 用户性别             men  91871 0.284681530  75748 16123 0.17549608
##           woe       bin_iv  total_iv          breaks is_special_values
## 1:  0.2241521 0.0003344142 0.1238244         missing              TRUE
## 2: -0.2601302 0.0434092333 0.1238244 not set%,%women             FALSE
## 3:  0.4842137 0.0800807579 0.1238244             men             FALSE
## 
## $用户设备
##    variable               bin  count count_distr   good   bad    badprob
## 1: 用户设备           missing   2467 0.007644516   2109   358 0.14511552
## 2: 用户设备 pc%,%mobile%,%ios  35046 0.108597369  32068  2978 0.08497403
## 3: 用户设备           android 285202 0.883758115 251120 34082 0.11950127
##            woe       bin_iv   total_iv            breaks is_special_values
## 1:  0.25794268 0.0005611005 0.01293809           missing              TRUE
## 2: -0.34522784 0.0113285823 0.01293809 pc%,%mobile%,%ios             FALSE
## 3:  0.03421734 0.0010484026 0.01293809           android             FALSE
## 
## $app1
##    variable
## 1:     app1
## 2:     app1
## 3:     app1
## 4:     app1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  bin
## 1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           missing
## 2:                                                                                                                                                                                                                                                                                                               android_2.45%,%android_3.7.1%,%android_2.49%,%android_3.3.3%,%android_4.2.0%,%android_2.34%,%android_2.48%,%android_2.33%,%android_4.0.1%,%iOS_1.6.1%,%iOS_1.5.9%,%android_3.2.0%,%iOS_1.5.8%,%android_2.38%,%android_3.3.0%,%android_3.4.0%,%android_null%,%android_4.3.4%,%pc%,%android_4.3.5%,%iOS_1.6.2%,%android_4.0.2%,%android_4.3.3%,%iOS_1.9.1%,%iOS_2.0.0
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             iOS_1.8.0%,%iOS_4.2.0
## 4: iOS_2.0.1%,%android_4.3.0%,%iOS_1.7.0%,%iOS_2.1.0%,%iOS_1.9.0%,%android_3.6.2%,%android_3.7.3%,%android_4.2.1%,%iOS_4.0.0%,%android_4.3.2%,%android_3.8.0%,%android_3.4.3%,%android_3.2.1%,%android_3.9.1%,%android_3.3.1%,%android_3.8.1%,%android_4.0.3%,%android_3.1.1%,%android_3.5.5%,%iOS_4.1.0%,%android_3.4.2%,%android_3.4.1%,%android_4.1.1%,%android_4.2.3%,%android_4.1.0%,%android_3.9.0%,%mobile-pwa%,%iOS_1.6.0%,%mobile%,%android_3.0.2%,%android_3.5.2%,%android_2.42%,%android_3.6.1%,%android_3.0.1%,%android_4.2.2%,%android_3.7.0%,%android_4.0.0%,%android_3.5.1%,%android_4.3.1%,%android_2.44%,%android_2.50%,%android_2.40%,%android_2.46%,%android_2.47
##     count count_distr   good   bad    badprob        woe       bin_iv
## 1:   2467 0.007644516   2109   358 0.14511552  0.2579427 0.0005611005
## 2:  95468 0.295827588  90854  4614 0.04833033 -0.9487798 0.1851491250
## 3:  29467 0.091309670  27136  2331 0.07910544 -0.4231850 0.0138883801
## 4: 195313 0.605218227 165198 30115 0.15418841  0.3292575 0.0743423496
##    total_iv
## 1: 0.273941
## 2: 0.273941
## 3: 0.273941
## 4: 0.273941
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               breaks
## 1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           missing
## 2:                                                                                                                                                                                                                                                                                                               android_2.45%,%android_3.7.1%,%android_2.49%,%android_3.3.3%,%android_4.2.0%,%android_2.34%,%android_2.48%,%android_2.33%,%android_4.0.1%,%iOS_1.6.1%,%iOS_1.5.9%,%android_3.2.0%,%iOS_1.5.8%,%android_2.38%,%android_3.3.0%,%android_3.4.0%,%android_null%,%android_4.3.4%,%pc%,%android_4.3.5%,%iOS_1.6.2%,%android_4.0.2%,%android_4.3.3%,%iOS_1.9.1%,%iOS_2.0.0
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             iOS_1.8.0%,%iOS_4.2.0
## 4: iOS_2.0.1%,%android_4.3.0%,%iOS_1.7.0%,%iOS_2.1.0%,%iOS_1.9.0%,%android_3.6.2%,%android_3.7.3%,%android_4.2.1%,%iOS_4.0.0%,%android_4.3.2%,%android_3.8.0%,%android_3.4.3%,%android_3.2.1%,%android_3.9.1%,%android_3.3.1%,%android_3.8.1%,%android_4.0.3%,%android_3.1.1%,%android_3.5.5%,%iOS_4.1.0%,%android_3.4.2%,%android_3.4.1%,%android_4.1.1%,%android_4.2.3%,%android_4.1.0%,%android_3.9.0%,%mobile-pwa%,%iOS_1.6.0%,%mobile%,%android_3.0.2%,%android_3.5.2%,%android_2.42%,%android_3.6.1%,%android_3.0.1%,%android_4.2.2%,%android_3.7.0%,%android_4.0.0%,%android_3.5.1%,%android_4.3.1%,%android_2.44%,%android_2.50%,%android_2.40%,%android_2.46%,%android_2.47
##    is_special_values
## 1:              TRUE
## 2:             FALSE
## 3:             FALSE
## 4:             FALSE
## 
## $用户类型
##    variable
## 1: 用户类型
## 2: 用户类型
## 3: 用户类型
##                                                                bin  count
## 1: old_prepaid_old_cod%,%old_prepaid_new_cod%,%new_prepaid_old_cod  31057
## 2:                                                         old_cod  47207
## 3:                                   new_cod%,%new_prepaid_new_cod 244451
##    count_distr   good   bad   badprob         woe      bin_iv   total_iv
## 1:  0.09623662  28600  2457 0.0791126 -0.42308675 0.014631535 0.02626036
## 2:  0.14628077  42704  4503 0.0953884 -0.21816988 0.006400987 0.02626036
## 3:  0.75748261 213993 30458 0.1245976  0.08178425 0.005227836 0.02626036
##                                                             breaks
## 1: old_prepaid_old_cod%,%old_prepaid_new_cod%,%new_prepaid_old_cod
## 2:                                                         old_cod
## 3:                                   new_cod%,%new_prepaid_new_cod
##    is_special_values
## 1:             FALSE
## 2:             FALSE
## 3:             FALSE
## 
## $地址种类
##    variable                                                 bin  count
## 1: 地址种类                                             missing  32963
## 2: 地址种类                                       Valid Address 211036
## 3: 地址种类                            Missing Rooftop with POI  27599
## 4: 地址种类 Missing Rooftop%,%Inappropriate%,%Incomplete%,%Junk  51117
##    count_distr   good   bad     badprob         woe       bin_iv  total_iv
## 1:  0.10214276  32836   127 0.003852805 -3.52371480 3.935990e-01 0.4450414
## 2:  0.65393923 186341 24695 0.117017950  0.01040134 7.103135e-05 0.4450414
## 3:  0.08552128  24204  3395 0.123011703  0.06716472 3.958573e-04 0.4450414
## 4:  0.15839673  41916  9201 0.179998826  0.51502343 5.097554e-02 0.4450414
##                                                 breaks is_special_values
## 1:                                             missing              TRUE
## 2:                                       Valid Address             FALSE
## 3:                            Missing Rooftop with POI             FALSE
## 4: Missing Rooftop%,%Inappropriate%,%Incomplete%,%Junk             FALSE
## 
## $下单小时
##    variable       bin  count count_distr   good   bad   badprob
## 1: 下单小时  [-Inf,5)  37569  0.11641541  32658  4911 0.1307195
## 2: 下单小时    [5,17) 231021  0.71586694 205342 25679 0.1111544
## 3: 下单小时   [17,19)  34948  0.10829370  30943  4005 0.1145988
## 4: 下单小时 [19, Inf)  19177  0.05942395  16354  2823 0.1472076
##            woe       bin_iv    total_iv breaks is_special_values
## 1:  0.13676661 0.0022945074 0.008885842      5             FALSE
## 2: -0.04762447 0.0015941915 0.008885842     17             FALSE
## 3: -0.01322435 0.0000188428 0.008885842     19             FALSE
## 4:  0.27470650 0.0049783008 0.008885842    Inf             FALSE
## 
## $付款小时
##    variable       bin  count count_distr   good   bad   badprob
## 1: 付款小时  [-Inf,5)  37095  0.11494662  32224  4871 0.1313115
## 2: 付款小时    [5,17) 230946  0.71563454 205269 25677 0.1111818
## 3: 付款小时   [17,19)  35187  0.10903429  31172  4015 0.1141046
## 4: 付款小时 [19, Inf)  19487  0.06038455  16632  2855 0.1465079
##            woe       bin_iv    total_iv breaks is_special_values
## 1:  0.14196661 2.445947e-03 0.008901763      5             FALSE
## 2: -0.04734679 1.575312e-03 0.008901763     17             FALSE
## 3: -0.01810404 3.548894e-05 0.008901763     19             FALSE
## 4:  0.26912216 4.845015e-03 0.008901763    Inf             FALSE
## 
## $下单与付款时间间隔
##              variable               bin  count count_distr   good   bad
## 1: 下单与付款时间间隔     [-Inf,-0.083)  38240   0.1184946  33125  5115
## 2: 下单与付款时间间隔  [-0.083,-0.0814) 137899   0.4273089 120792 17107
## 3: 下单与付款时间间隔 [-0.0814,-0.0774) 106184   0.3290334  94815 11369
## 4: 下单与付款时间间隔    [-0.0774, Inf)  40392   0.1251631  36565  3827
##       badprob         woe      bin_iv   total_iv  breaks is_special_values
## 1: 0.13376046  0.16326799 0.003361988 0.01435376  -0.083             FALSE
## 2: 0.12405456  0.07679655 0.002595418 0.01435376 -0.0814             FALSE
## 3: 0.10706886 -0.08965840 0.002555278 0.01435376 -0.0774             FALSE
## 4: 0.09474648 -0.22563142 0.005841080 0.01435376     Inf             FALSE
## 
## $金额差异
##    variable         bin  count count_distr   good   bad   badprob woe
## 1: 金额差异 [-Inf, Inf) 322715           1 285297 37418 0.1159475   0
##    bin_iv total_iv breaks is_special_values
## 1:      0        0    Inf             FALSE
## 
## $件数差异
##    variable         bin  count count_distr   good   bad   badprob woe
## 1: 件数差异 [-Inf, Inf) 322715           1 285297 37418 0.1159475   0
##    bin_iv total_iv breaks is_special_values
## 1:      0        0    Inf             FALSE
## 
## $确认小时
##    variable       bin  count count_distr   good   bad    badprob
## 1: 确认小时  [-Inf,5)  20463  0.06340889  18528  1935 0.09456091
## 2: 确认小时    [5,12) 220152  0.68218707 193376 26776 0.12162506
## 3: 确认小时   [12,13)  29449  0.09125389  26112  3337 0.11331454
## 4: 确认小时 [13, Inf)  52651  0.16315015  47281  5370 0.10199236
##            woe       bin_iv   total_iv breaks is_special_values
## 1: -0.22779690 3.013701e-03 0.00832062      5             FALSE
## 2:  0.05424835 2.049801e-03 0.00832062     12             FALSE
## 3: -0.02594391 6.081242e-05 0.00832062     13             FALSE
## 4: -0.14390174 3.196306e-03 0.00832062    Inf             FALSE
## 
## $付款到派送
##      variable        bin  count count_distr   good   bad    badprob
## 1: 付款到派送    missing   3441  0.01066266   1327  2114 0.61435629
## 2: 付款到派送 [-Inf,0.2) 190833  0.59133601 168293 22540 0.11811374
## 3: 付款到派送    [0.2,1)  81786  0.25343105  73034  8752 0.10701098
## 4: 付款到派送    [1,1.4)  20867  0.06466077  18904  1963 0.09407198
## 5: 付款到派送 [1.4, Inf)  25788  0.07990952  23739  2049 0.07945556
##            woe       bin_iv  total_iv  breaks is_special_values
## 1:  2.49704000 0.1294604871 0.1468417 missing              TRUE
## 2:  0.02096387 0.0002619823 0.1468417     0.2             FALSE
## 3: -0.09026397 0.0019943602 0.1468417       1             FALSE
## 4: -0.23352075 0.0032224443 0.1468417     1.4             FALSE
## 5: -0.41838853 0.0119024356 0.1468417     Inf             FALSE
  1. 下面这些变量是没有预测能力或者预测能力非常弱的一些变量 (IV< 2%), 因此可以直接将这些变量筛选掉
library(tidyverse)
kable(iv %>% filter(info_value<0.02))
## Warning: package 'bindrcpp' was built under R version 3.4.4
variable info_value
用户设备 0.0140496
付款小时 0.0119562
下单小时 0.0118502
件数差异 0.0073371
  1. 下面这一部分变量只是有非常弱的预测变量 (2%<=IV< 10%), 因此可以考虑加上这一部分变量,也可以不加上这些变量
library(tidyverse)
kable(iv %>% filter(info_value>=0.02,info_value<0.1))
variable info_value
发货件数 0.0954921
原始来单件数 0.0929052
用户类型 0.0274259
确认小时 0.0205682
  1. 这些变量有一定的预测能力 (10%<=IV< 30%), 可以考虑选取其中一些变量加入到模型里面去
library(tidyverse)
kable(iv %>% filter(info_value>=0.1,info_value<0.3))
variable info_value
下单与付款时间间隔 0.2858385
cod运费 0.2818102
修改后金额 0.1986989
原始来单金额 0.1946768
金额差异 0.1632335
付款到派送 0.1379788
发货方式 0.1256872
用户性别 0.1238769
0.1158185
  1. 这些变量有比较强的预测能力 (IV 30% to 50%),模型选取这一部分变量进行建模
library(tidyverse)
kable(iv %>% filter(info_value>=0.3,info_value<0.5))
variable info_value
地址种类 0.4482661
app1 0.3126790

4 选取进行建模的变量

var_list_1 <- iv %>% filter(info_value>0.1) %>% select(variable) # 15 variables
Model_data1 <- Model_data %>% select(var_list_1$variable,label) #12 variables
head(Model_data1)
##           地址种类          app1 下单与付款时间间隔 cod运费 修改后金额
## 1:   Valid Address     iOS_4.1.0           19.45732    1.55       5.60
## 2:   Valid Address android_4.1.1           16.93115    1.55       6.92
## 3: Missing Rooftop android_4.2.2           17.41311    1.55      10.32
## 4:   Valid Address android_4.0.3           16.85653    1.55       4.67
## 5: Missing Rooftop android_4.1.1           19.56840    1.55      10.26
## 6:   Valid Address     iOS_4.1.0           16.91516    1.55      16.02
##    原始来单金额 金额差异 付款到派送  发货方式 用户性别          州 label
## 1:         5.60        0  2.7096488 Delhivery    women   Telangana     0
## 2:         6.92        0 -0.4770722 Delhivery    women   Telangana     0
## 3:        10.32        0 -0.1513002      Ecom      men Maharashtra     0
## 4:         4.67        0 -0.1274765      Ecom    women Maharashtra     0
## 5:        10.26        0 -0.1704649 Delhivery      men   Karnataka     0
## 6:        16.02        0  0.2219836 Delhivery    women   Karnataka     0

5 多元数据分析 - 聚类,降维

对变量的聚类可以讲含有相同信息的变量聚为同一个族类

当我们有大量的变量的时候,这种方法可以很好的用于进行降维。同样可以用于降维的方法还有主成分分析和因子分析。

Model_data1$app1 <- as.factor(Model_data1$app1)
Model_data1$label <- as.factor(Model_data1$label)
Model_data1$地址种类 <- as.factor(Model_data1$地址种类)
Model_data1$发货方式 <- as.factor(Model_data1$发货方式)
Model_data1$用户性别 <- as.factor(Model_data1$用户性别)
Model_data1$州 <- as.factor(Model_data1$州)
factors <- sapply(Model_data1, is.factor)
#subset Qualitative variables 
vars_quali <- Model_data1 %>% select(names(Model_data1)[factors])
#vars_quali$good_bad_21<-vars_quali$good_bad_21[drop=TRUE] # remove empty factors
str(vars_quali)
## Classes 'data.table' and 'data.frame':   322715 obs. of  6 variables:
##  $ 地址种类: Factor w/ 6 levels "Inappropriate",..: 6 6 4 6 4 6 6 6 4 6 ...
##  $ app1    : Factor w/ 71 levels "android_2.33",..: 67 42 45 40 42 67 42 32 42 29 ...
##  $ 发货方式: Factor w/ 3 levels "Delhivery","Ecom",..: 1 1 2 2 1 1 1 2 2 1 ...
##  $ 用户性别: Factor w/ 3 levels "men","not set",..: 3 3 1 3 1 3 3 3 3 1 ...
##  $ 州      : Factor w/ 70 levels "Andaman and Nicobar Islands",..: 59 59 38 38 29 29 29 14 29 29 ...
##  $ label   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#subset Quantitative variables 
vars_quanti <- Model_data1 %>% select(names(Model_data1)[!factors])
str(vars_quanti)
## Classes 'data.table' and 'data.frame':   322715 obs. of  6 variables:
##  $ 下单与付款时间间隔: num  19.5 16.9 17.4 16.9 19.6 ...
##  $ cod运费           : num  1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
##  $ 修改后金额        : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 原始来单金额      : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 金额差异          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 付款到派送        : num  2.71 -0.477 -0.151 -0.127 -0.17 ...
##  - attr(*, ".internal.selfref")=<externalptr>

## 6 变量的层次聚类

tree <- hclustvar(X.quanti=vars_quanti,X.quali=vars_quali)
par(family='STKaiti')
plot(tree, main="variable clustering")
rect.hclust(tree, k=8,  border = 1:8)

summary(tree)
##          Length Class      Mode     
## call       3    -none-     call     
## rec       16    -none-     list     
## init      12    -none-     numeric  
## merge     22    -none-     numeric  
## height    11    -none-     numeric  
## order     12    -none-     numeric  
## labels    12    -none-     character
## clusmat  144    -none-     numeric  
## X.quanti   6    data.table list     
## X.quali    6    data.table list
# Phylogenetic trees
# require library("ape")
par(family='STKaiti')
plot(as.phylo(tree), type = "fan",
     tip.color = hsv(runif(15, 0.65,  0.95), 1, 1, 0.7),
     edge.color = hsv(runif(10, 0.65, 0.75), 1, 1, 0.7), 
     edge.width = runif(20,  0.5, 3), use.edge.length = TRUE, col = "gray80")

summary.phylo(as.phylo(tree))
## 
## Phylogenetic tree: as.phylo(tree) 
## 
##   Number of tips: 12 
##   Number of nodes: 11 
##   Branch lengths:
##     mean: 0.2498154 
##     variance: 0.02762882 
##     distribution summary:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## 0.01203149 0.11483605 0.24931255 0.40189405 0.49995107 
##   No root edge.
##   First ten tip labels: 下单与付款时间间隔 
##                         cod运费
##                         修改后金额
##                         原始来单金额
##                         金额差异
##                         付款到派送
##                         地址种类
##                         app1
##                         发货方式
##                         用户性别
##   No node labels.
part<-cutreevar(tree,8)
print(part)
## 
## Call:
## cutreevar(obj = tree, k = 8)
## 
## 
## 
##  name      
##  "$var"    
##  "$sim"    
##  "$cluster"
##  "$wss"    
##  "$E"      
##  "$size"   
##  "$scores" 
##  "$coef"   
##  description                                                                    
##  "list of variables in each cluster"                                            
##  "similarity matrix in each cluster"                                            
##  "cluster memberships"                                                          
##  "within-cluster sum of squares"                                                
##  "gain in cohesion (in %)"                                                      
##  "size of each cluster"                                                         
##  "synthetic score of each cluster"                                              
##  "coef of the linear combinations defining the synthetic scores of each cluster"
summary(part)
## 
## Call:
## cutreevar(obj = tree, k = 8)
## 
## 
## 
## Data: 
##    number of observations:  322715
##    number of  variables:  12
##         number of numerical variables:  6
##         number of categorical variables:  6
##    number of clusters:  8
## 
## Cluster  1 : 
## squared loading     correlation 
##               1               1 
## 
## 
## Cluster  2 : 
##              squared loading correlation
## 修改后金额              0.93       -0.96
## 原始来单金额            0.92       -0.96
## cod运费                 0.65       -0.81
## 
## 
## Cluster  3 : 
## squared loading     correlation 
##               1               1 
## 
## 
## Cluster  4 : 
##            squared loading correlation
## 州                    0.68          NA
## 付款到派送            0.56       -0.75
## 发货方式              0.44          NA
## 
## 
## Cluster  5 : 
## squared loading     correlation 
##               1              NA 
## 
## 
## Cluster  6 : 
## squared loading     correlation 
##               1              NA 
## 
## 
## Cluster  7 : 
## squared loading     correlation 
##               1              NA 
## 
## 
## Cluster  8 : 
## squared loading     correlation 
##               1              NA 
## 
## 
## Gain in cohesion (in %):  80.38

7 通过聚类选取部分变量

# cod运费 
# 付款到派送  
# keep<- c(1,2,3,4,7,8,10,12)
cdata_reduced_2 <- Model_data1 # %>% select(keep)
str(cdata_reduced_2)
## Classes 'data.table' and 'data.frame':   322715 obs. of  12 variables:
##  $ 地址种类          : Factor w/ 6 levels "Inappropriate",..: 6 6 4 6 4 6 6 6 4 6 ...
##  $ app1              : Factor w/ 71 levels "android_2.33",..: 67 42 45 40 42 67 42 32 42 29 ...
##  $ 下单与付款时间间隔: num  19.5 16.9 17.4 16.9 19.6 ...
##  $ cod运费           : num  1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
##  $ 修改后金额        : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 原始来单金额      : num  5.6 6.92 10.32 4.67 10.26 ...
##  $ 金额差异          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 付款到派送        : num  2.71 -0.477 -0.151 -0.127 -0.17 ...
##  $ 发货方式          : Factor w/ 3 levels "Delhivery","Ecom",..: 1 1 2 2 1 1 1 2 2 1 ...
##  $ 用户性别          : Factor w/ 3 levels "men","not set",..: 3 3 1 3 1 3 3 3 3 1 ...
##  $ 州                : Factor w/ 70 levels "Andaman and Nicobar Islands",..: 59 59 38 38 29 29 29 14 29 29 ...
##  $ label             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

8 划分选来集合测试集合

bins <-  scorecard::woebin(cdata_reduced_2,y = 'label')
## Binning on 322715 rows and 12 columns in 00:00:12
dt_woe <- scorecard::woebin_ply(cdata_reduced_2,bins)
## Woe transformating on 322715 rows and 11 columns in 00:00:14
dt_woe$label <- as.factor(dt_woe$label)

div_part_1 <- createDataPartition(y = dt_woe$label, p = 0.7, list = F)

# Training Sample
train_1 <- dt_woe[div_part_1,] # 70% here
pct(train_1$label)
Count Percentage
0 199708 88.41
1 26193 11.59
# Test Sample
test_1 <- dt_woe[-div_part_1,] # rest of the 30% data goes here
pct(test_1$label)
Count Percentage
0 85589 88.41
1 11225 11.59

9 训练模型以及模型选择

9.1 逻辑回归以及逐步回归

m1 <- glm(label~.,data=train_1,family=binomial())
m1 <- step(m1)
## Start:  AIC=141024.5
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe + 
##     cod运费_woe + 修改后金额_woe + 原始来单金额_woe + 
##     金额差异_woe + 付款到派送_woe + 发货方式_woe + 
##     用户性别_woe + 州_woe
## 
## 
## Step:  AIC=141024.5
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe + 
##     cod运费_woe + 修改后金额_woe + 原始来单金额_woe + 
##     付款到派送_woe + 发货方式_woe + 用户性别_woe + 
##     州_woe
## 
##                          Df Deviance    AIC
## - 修改后金额_woe          1   141003 141023
## <none>                        141003 141025
## - 原始来单金额_woe        1   141009 141029
## - cod运费_woe             1   141135 141155
## - 下单与付款时间间隔_woe  1   141299 141319
## - 发货方式_woe            1   141412 141432
## - 州_woe                  1   141435 141455
## - 用户性别_woe            1   142696 142716
## - 付款到派送_woe          1   145948 145968
## - app1_woe                1   146169 146189
## - 地址种类_woe            1   146308 146328
## 
## Step:  AIC=141022.6
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe + 
##     cod运费_woe + 原始来单金额_woe + 付款到派送_woe + 
##     发货方式_woe + 用户性别_woe + 州_woe
## 
##                          Df Deviance    AIC
## <none>                        141003 141023
## - 原始来单金额_woe        1   141065 141083
## - cod运费_woe             1   141139 141157
## - 下单与付款时间间隔_woe  1   141299 141317
## - 发货方式_woe            1   141412 141430
## - 州_woe                  1   141435 141453
## - 用户性别_woe            1   142696 142714
## - 付款到派送_woe          1   145949 145967
## - app1_woe                1   146169 146187
## - 地址种类_woe            1   146308 146326
summary(m1)
## 
## Call:
## glm(formula = label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe + 
##     cod运费_woe + 原始来单金额_woe + 付款到派送_woe + 
##     发货方式_woe + 用户性别_woe + 州_woe, family = binomial(), 
##     data = train_1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6032  -0.5345  -0.3971  -0.2335   3.8450  
## 
## Coefficients:
##                         Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)            -2.032089   0.007589 -267.759  < 2e-16 ***
## 地址种类_woe            0.989096   0.020734   47.704  < 2e-16 ***
## app1_woe                1.005629   0.015343   65.544  < 2e-16 ***
## 下单与付款时间间隔_woe  1.005120   0.058818   17.089  < 2e-16 ***
## cod运费_woe             0.541428   0.046333   11.686  < 2e-16 ***
## 原始来单金额_woe        0.349058   0.044593    7.828 4.97e-15 ***
## 付款到派送_woe          1.408441   0.021893   64.332  < 2e-16 ***
## 发货方式_woe            0.750572   0.037196   20.179  < 2e-16 ***
## 用户性别_woe            0.825245   0.019827   41.622  < 2e-16 ***
## 州_woe                  0.624494   0.030162   20.705  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 162095  on 225900  degrees of freedom
## Residual deviance: 141003  on 225891  degrees of freedom
## AIC: 141023
## 
## Number of Fisher Scoring iterations: 7
significant.variables <- summary(m1)$coeff[-1,4] < 0.01
names(significant.variables)[significant.variables == TRUE]
## [1] "地址种类_woe"           "app1_woe"              
## [3] "下单与付款时间间隔_woe" "cod运费_woe"           
## [5] "原始来单金额_woe"       "付款到派送_woe"        
## [7] "发货方式_woe"           "用户性别_woe"          
## [9] "州_woe"
dt_pred = predict(m1, type='response', test_1)
perf_eva(test_1$label, dt_pred, type = c("ks","lift","roc","pr"))
## Warning: Removed 1 rows containing missing values (geom_path).

## $KS
## [1] 0.3402
## 
## $AUC
## [1] 0.7434
## 
## $Gini
## [1] 0.4868
## 
## $pic
## TableGrob (2 x 2) "arrange": 4 grobs
##       z     cells    name           grob
## pks   1 (1-1,1-1) arrange gtable[layout]
## plift 2 (1-1,2-2) arrange gtable[layout]
## proc  3 (2-2,1-1) arrange gtable[layout]
## ppr   4 (2-2,2-2) arrange gtable[layout]

9.2 随即森林

m3 <- randomForest(label ~ ., data = train_1)
par(family='STKaiti')
varImpPlot(m3, main="Random Forest: Variable Importance")

dt_pred = predict(m3, type='prob', test_1)[,1]
perf_eva(test_1$label, dt_pred, type = c("ks","lift","roc","pr"))
## Warning: Removed 1 rows containing missing values (geom_path).

## $KS
## [1] 0.1661
## 
## $AUC
## [1] 0.4041
## 
## $Gini
## [1] -0.1919
## 
## $pic
## TableGrob (2 x 2) "arrange": 4 grobs
##       z     cells    name           grob
## pks   1 (1-1,1-1) arrange gtable[layout]
## plift 2 (1-1,2-2) arrange gtable[layout]
## proc  3 (2-2,1-1) arrange gtable[layout]
## ppr   4 (2-2,2-2) arrange gtable[layout]

不平衡的数据会造成非常低AUC,需要尝试解决样本不平衡的问题

欠抽样

load('/Users/milin/COD\ 建模/model_rf_under.RData')
load('/Users/milin/COD\ 建模/dt_woe.RData')
require(scorecard)
dt_pred = predict(model_rf_under, type = 'prob', dt_woe)


perf_eva(dt_woe$label, dt_pred$`1`)

## $KS
## [1] 0.3986
## 
## $AUC
## [1] 0.7641
## 
## $Gini
## [1] 0.5281
## 
## $pic
## TableGrob (1 x 2) "arrange": 2 grobs
##      z     cells    name           grob
## pks  1 (1-1,1-1) arrange gtable[layout]
## proc 2 (1-1,2-2) arrange gtable[layout]

重抽样

load('/Users/milin/COD\ 建模/model_rf_under1.RData')
dt_pred = predict(model_rf_under, type = 'prob', dt_woe)


perf_eva(dt_woe$label, dt_pred$`1`)

## $KS
## [1] 0.3986
## 
## $AUC
## [1] 0.7641
## 
## $Gini
## [1] 0.5281
## 
## $pic
## TableGrob (1 x 2) "arrange": 2 grobs
##      z     cells    name           grob
## pks  1 (1-1,1-1) arrange gtable[layout]
## proc 2 (1-1,2-2) arrange gtable[layout]

附录 A: 使用到的包

A.1 library(lattice)

The lattice add-on of Trellis graphics for R. Ref:https://cran.r-project.org/web/packages/lattice/lattice.pdf

A.2 library(knitr)

For Dynamic Report Generation in R. Ref: https://cran.r-project.org/web/packages/knitr/knitr.pdf

A.3 library(gplots)

Various R Programming Tools for Plotting Data. Ref:https://cran.r-project.org/web/packages/gplots/gplots.pdf

A.4 library(ggplot2)

An Implementation of the Grammar of Graphics. Ref: https://cran.rstudio.com/web/packages/ggplot2/ggplot2.pdf

A.5 library(ClustOfVar)

Clustering of variables. Ref: https://cran.r-project.org/web/packages/ClustOfVar/ClustOfVar.pdf

A.6 library(ape)

Analyses of Phylogenetics and Evolution (as.phylo). Ref: https://cran.r-project.org/web/packages/ape/ape.pdf

A.7 library(Information)

Data Exploration with Information Theory (Weight-of-Evidence and Information Value). Ref: https://cran.r-project.org/web/packages/Information/Information.pdf

A.8 library(ROCR)

Visualizing the Performance of Scoring Classifiers. Ref: https://cran.r-project.org/web/packages/ROCR/ROCR.pdf

A.9 library(caret)

Classification and Regression Training - for any machine learning algorithms. Ref: ftp://cran.r-project.org/pub/R/web/packages/caret/caret.pdf

A.10 library(rpart)

Recursive partitioning for classification, regression and survival trees. Ref: https://cran.r-project.org/web/packages/rpart/rpart.pdf

A.10.1 library(rpart.utils)

Tools for parsing and manipulating rpart objects, including generating machine readable rules. Ref: https://cran.r-project.org/web/packages/rpart.utils/rpart.utils.pdf

A.10.2 library(rpart.plot)

Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. Ref: https://cran.r-project.org/web/packages/knitr/knitr.pdf

A.11 library(randomForest)

Leo Breiman and Cutler’s Random Forests for Classification and Regression. Ref: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

A.12 library(party)

A computational toolbox for recursive partitioning - Conditional inference Trees. Ref: https://cran.r-project.org/web/packages/party/party.pdf

A.13 library(bnlearn)

Bayesian Network Structure Learning, Parameter Learning and Inference. Ref: https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

A.14 library(DAAG)

Data Analysis and Graphics Data and Functions. Ref: https://cran.r-project.org/web/packages/DAAG/DAAG.pdf

A.15 library(vcd)

Visualizing Categorical Data. Ref: https://cran.r-project.org/web/packages/vcd/vcd.pdf

A.16 library(neuralnet)

Neural Network implementation. Ref: https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf

A.17 library(kernlab)

Kernel-Based Machine Learning Lab. Ref: https://cran.r-project.org/web/packages/kernlab/kernlab.pdf

A.18 library(glmnet)

Lasso and Elastic-Net Regularized Generalized Linear Models. Ref: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf

A.19 library(lars)

Least Angle Regression, Lasso and Forward Stagewise. Ref: ftp://cran.r-project.org/pub/R/web/packages/lars/lars.pdf