The basic difference of traditional modeling and machine learning is that “in traditional modeling we intend to set up a modeling framework and try to establish relationships while in machine learning we allow the model to learn from the data by understanding the hidden patterns”. Hence the first one requires analyst to have solid understanding of statistical techniques and business knowledge while the later one is more complex in nature and computational intensive, hence requires higher computation power of the systems and analyst needs to be tech savvy.
Kindly note that while traditional techniques perform well on small to large amount of data, machine learning will certainly learn better on high-dimensional and complex data such as Big Data set up.
kable(as.data.frame(colnames(Model_data)))
colnames(Model_data) |
---|
发货方式 |
州 |
原始来单金额 |
修改后金额 |
发货件数 |
原始来单件数 |
cod运费 |
用户性别 |
用户设备 |
app1 |
用户类型 |
地址种类 |
label |
下单小时 |
付款小时 |
下单与付款时间间隔 |
金额差异 |
件数差异 |
确认小时 |
付款到派送 |
str(Model_data)
## Classes 'data.table' and 'data.frame': 322715 obs. of 20 variables:
## $ 发货方式 : chr "Delhivery" "Delhivery" "Ecom" "Ecom" ...
## $ 州 : chr "Telangana" "Telangana" "Maharashtra" "Maharashtra" ...
## $ 原始来单金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 修改后金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 发货件数 : num 1 1 1 1 1 6 1 3 1 1 ...
## $ 原始来单件数 : num 1 1 1 1 1 6 1 3 1 1 ...
## $ cod运费 : num 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
## $ 用户性别 : chr "women" "women" "men" "women" ...
## $ 用户设备 : chr "ios" "android" "android" "android" ...
## $ app1 : chr "iOS_4.1.0" "android_4.1.1" "android_4.2.2" "android_4.0.3" ...
## $ 用户类型 : chr "new_cod" "new_cod" "new_cod" "old_cod" ...
## $ 地址种类 : chr "Valid Address" "Valid Address" "Missing Rooftop" "Valid Address" ...
## $ label : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ 下单小时 : num 15 9 10 15 16 16 16 6 12 17 ...
## $ 付款小时 : num 14 11 15 16 16 17 11 6 6 9 ...
## $ 下单与付款时间间隔: num 19.5 16.9 17.4 16.9 19.6 ...
## $ 金额差异 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 件数差异 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 确认小时 : num 4 3 3 13 4 4 3 9 3 4 ...
## $ 付款到派送 : num 2.71 -0.477 -0.151 -0.127 -0.17 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(Model_data)
## 发货方式 州 原始来单金额 修改后金额
## Length:322715 Length:322715 Min. : 0.72 Min. : 0.390
## Class :character Class :character 1st Qu.: 3.22 1st Qu.: 3.210
## Mode :character Mode :character Median : 7.29 Median : 7.190
## Mean :10.12 Mean : 9.784
## 3rd Qu.:13.54 3rd Qu.:13.210
## Max. :51.55 Max. :51.550
##
## 发货件数 原始来单件数 cod运费 用户性别
## Min. : 0.000 Min. : 1.0 Min. :0.000 Length:322715
## 1st Qu.: 1.000 1st Qu.: 1.0 1st Qu.:0.770 Class :character
## Median : 1.000 Median : 1.0 Median :1.550 Mode :character
## Mean : 1.743 Mean : 1.8 Mean :1.197
## 3rd Qu.: 1.000 3rd Qu.: 2.0 3rd Qu.:1.550
## Max. :48.000 Max. :50.0 Max. :1.550
##
## 用户设备 app1 用户类型
## Length:322715 Length:322715 Length:322715
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## 地址种类 label 下单小时 付款小时
## Length:322715 0:285297 Min. : 0.00 Min. : 0.00
## Class :character 1: 37418 1st Qu.: 7.00 1st Qu.: 7.00
## Mode :character Median :11.00 Median :11.00
## Mean :10.85 Mean :10.87
## 3rd Qu.:15.00 3rd Qu.:15.00
## Max. :23.00 Max. :23.00
##
## 下单与付款时间间隔 金额差异 件数差异 确认小时
## Min. :-0.08387 Min. :-45.6000 Min. :-31.00000 Min. : 0.00
## 1st Qu.:-0.08260 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 7.00
## Median :-0.08166 Median : 0.0000 Median : 0.00000 Median : 9.00
## Mean : 0.00000 Mean : -0.3403 Mean : -0.05702 Mean : 9.37
## 3rd Qu.:-0.07987 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.:12.00
## Max. :19.57188 Max. : 5.1600 Max. : 1.00000 Max. :23.00
##
## 付款到派送
## Min. :-4.011
## 1st Qu.:-0.669
## Median :-0.063
## Mean : 0.000
## 3rd Qu.: 0.611
## Max. :10.935
## NA's :3441
pct(Model_data$label)
Count | Percentage | |
---|---|---|
0 | 285297 | 88.41 |
1 | 37418 | 11.59 |
Weight of Evidence(WOE): WoE shows predictive power of an independent variable in relation to dependent variable. It evolved with credit scoring to magnify separation power between a good customer and a bad customer, hence it is one of the measures of separation between two classes(good/bad, yes/no, 0/1, A/B, response/no-response). It is defined as:
WOE=ln(Distribution of Non-Events(Good)Distribution of Events(Bad))
It is computed from the basic odds ratio:
(Distribution of Good Credit Outcomes) / (Distribution of Bad Credit Outcomes)
Information Value(IV):
IV helps to select variables by using their order of importance w.r.to information value after grouping.
IV=∑(%Non-Events - %Events)∗WOE
Efficiency:
Efficiency=Abs(%Non-Events - %Events)/2
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$发货方式)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$发货方式), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$州)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$州), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
# barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
# main="Score:Checking Shipping method Status",
# xlab="Category",
# ylab="WOE"
# )
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$用户性别)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$用户性别), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
### 3.4 Checking equipment
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$用户设备)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$用户设备), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$app1)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$app1), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
# barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
# main="Score:Checking Shipping method Status",
# xlab="Category",
# ylab="WOE"
# )
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$用户类型)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$用户类型), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$地址种类)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$地址种类), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$下单小时)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$下单小时), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
# Attribute 1: (qualitative)
#-----------------------------------------------------------
# Checking account status
# Status of existing checking account
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM /
# salary assignments for at least 1 year
# A14 : no checking account
A1 <- gbpct(Model_data$付款小时)
op1<-par(mfrow=c(1,2), new=TRUE)
## Warning in par(mfrow = c(1, 2), new = TRUE): 不绘图就不能调用par(new=TRUE)
plot(as.factor(Model_data$付款小时), Model_data$label,
ylab="Good-Bad", xlab="category",
main="Checking Account Status ~ Good-Bad ")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),
main="Score:Checking Shipping method Status",
xlab="Category",
ylab="WOE"
)
kable(iv)
variable | info_value |
---|---|
地址种类 | 0.4482661 |
app1 | 0.3126790 |
下单与付款时间间隔 | 0.2858385 |
cod运费 | 0.2818102 |
修改后金额 | 0.1986989 |
原始来单金额 | 0.1946768 |
金额差异 | 0.1632335 |
付款到派送 | 0.1379788 |
发货方式 | 0.1256872 |
用户性别 | 0.1238769 |
州 | 0.1158185 |
发货件数 | 0.0954921 |
原始来单件数 | 0.0929052 |
用户类型 | 0.0274259 |
确认小时 | 0.0205682 |
用户设备 | 0.0140496 |
付款小时 | 0.0119562 |
下单小时 | 0.0118502 |
件数差异 | 0.0073371 |
bins
## $发货方式
## variable bin count count_distr good bad
## 1: 发货方式 XpressBees%,%Delhivery 172229 0.5336876 156606 15623
## 2: 发货方式 Ecom 150486 0.4663124 128691 21795
## badprob woe bin_iv total_iv breaks
## 1: 0.09071062 -0.2736100 0.03595137 0.06954223 XpressBees%,%Delhivery
## 2: 0.14483075 0.2556453 0.03359086 0.06954223 Ecom
## is_special_values
## 1: FALSE
## 2: FALSE
##
## $州
## variable
## 1: 州
## 2: 州
## 3: 州
## 4: 州
## bin
## 1: West bengal%,%UTTAR PRADESH%,%madhya pradesh%,%west bengal%,%Uttar pradesh%,%new delhi%,%New Delhi%,%andhra pradesh%,%maharashtra%,%WEST BENGAL%,%uttar pardesh%,%MADHYA PRADESH%,%palakkad%,%Kheda%,%haryana%,%Andhra pradesh%,%Maharashtara%,%Pondicherry%,%RAJSTHAN%,%Tamil nadu%,%Tamilnadu%,%Jammu & Kashmir%,%J&K%,%maharasta%,%Hyderabad%,%daman%,%GUJARAT%,%Haryana,%,%Jharkhan%,%Chattisgarh%,%karnataka%,%kerala%,%West Bangal%,%Meghalaya%,%Mizoram%,%Nagaland%,%Goa%,%Arunachal Pradesh%,%Assam%,%Daman and Diu%,%Puducherry%,%Kerala
## 2: West Bengal%,%Tamil Nadu%,%Chandigarh%,%Karnataka%,%Sikkim%,%Chhattisgarh%,%Himachal Pradesh%,%Andhra Pradesh
## 3: Telangana%,%Manipur%,%Odisha%,%Tripura%,%Gujarat%,%Uttarakhand
## 4: Jammu and Kashmir%,%Haryana%,%Madhya Pradesh%,%Uttar Pradesh%,%Punjab%,%Rajasthan%,%Maharashtra%,%Jharkhand%,%Delhi%,%Bihar%,%punjab%,%Andaman and Nicobar Islands%,%tamil nadu%,%Hariyana
## count count_distr good bad badprob woe bin_iv
## 1: 21778 0.06748369 20513 1265 0.05808614 -0.75460784 0.0287454667
## 2: 87476 0.27106270 80294 7182 0.08210252 -0.38273813 0.0342551910
## 3: 62240 0.19286367 55233 7007 0.11258033 -0.03327208 0.0002107931
## 4: 151221 0.46858993 129257 21964 0.14524438 0.25898095 0.0346850655
## total_iv
## 1: 0.09789652
## 2: 0.09789652
## 3: 0.09789652
## 4: 0.09789652
## breaks
## 1: West bengal%,%UTTAR PRADESH%,%madhya pradesh%,%west bengal%,%Uttar pradesh%,%new delhi%,%New Delhi%,%andhra pradesh%,%maharashtra%,%WEST BENGAL%,%uttar pardesh%,%MADHYA PRADESH%,%palakkad%,%Kheda%,%haryana%,%Andhra pradesh%,%Maharashtara%,%Pondicherry%,%RAJSTHAN%,%Tamil nadu%,%Tamilnadu%,%Jammu & Kashmir%,%J&K%,%maharasta%,%Hyderabad%,%daman%,%GUJARAT%,%Haryana,%,%Jharkhan%,%Chattisgarh%,%karnataka%,%kerala%,%West Bangal%,%Meghalaya%,%Mizoram%,%Nagaland%,%Goa%,%Arunachal Pradesh%,%Assam%,%Daman and Diu%,%Puducherry%,%Kerala
## 2: West Bengal%,%Tamil Nadu%,%Chandigarh%,%Karnataka%,%Sikkim%,%Chhattisgarh%,%Himachal Pradesh%,%Andhra Pradesh
## 3: Telangana%,%Manipur%,%Odisha%,%Tripura%,%Gujarat%,%Uttarakhand
## 4: Jammu and Kashmir%,%Haryana%,%Madhya Pradesh%,%Uttar Pradesh%,%Punjab%,%Rajasthan%,%Maharashtra%,%Jharkhand%,%Delhi%,%Bihar%,%punjab%,%Andaman and Nicobar Islands%,%tamil nadu%,%Hariyana
## is_special_values
## 1: FALSE
## 2: FALSE
## 3: FALSE
## 4: FALSE
##
## $原始来单金额
## variable bin count count_distr good bad badprob
## 1: 原始来单金额 [-Inf,2) 20298 0.06289760 18324 1974 0.09725096
## 2: 原始来单金额 [2,4) 82577 0.25588213 73354 9223 0.11168970
## 3: 原始来单金额 [4,6) 43843 0.13585672 39918 3925 0.08952398
## 4: 原始来单金额 [6,10) 55017 0.17048169 48312 6705 0.12187142
## 5: 原始来单金额 [10,18) 72926 0.22597648 62305 10621 0.14564079
## 6: 原始来单金额 [18,28) 29221 0.09054739 25891 3330 0.11395914
## 7: 原始来单金额 [28, Inf) 18833 0.05835799 17193 1640 0.08708119
## woe bin_iv total_iv breaks is_special_values
## 1: -0.19677086 0.0022574432 0.03578174 2 FALSE
## 2: -0.04221780 0.0004487274 0.03578174 4 FALSE
## 3: -0.28808213 0.0100890135 0.03578174 6 FALSE
## 4: 0.05655241 0.0005571833 0.03578174 10 FALSE
## 5: 0.26217036 0.0171619061 0.03578174 18 FALSE
## 6: -0.01954424 0.0000343283 0.03578174 28 FALSE
## 7: -0.31842721 0.0052331416 0.03578174 Inf FALSE
##
## $修改后金额
## variable bin count count_distr good bad badprob
## 1: 修改后金额 [-Inf,2) 20384 0.06316409 18401 1983 0.09728218
## 2: 修改后金额 [2,4) 83151 0.25766078 73874 9277 0.11156811
## 3: 修改后金额 [4,6) 44200 0.13696295 40226 3974 0.08990950
## 4: 修改后金额 [6,10) 56409 0.17479510 49630 6779 0.12017586
## 5: 修改后金额 [10,18) 73848 0.22883349 63115 10733 0.14533907
## 6: 修改后金额 [18,25) 23496 0.07280728 20784 2712 0.11542390
## 7: 修改后金额 [25, Inf) 21227 0.06577630 19267 1960 0.09233523
## woe bin_iv total_iv breaks is_special_values
## 1: -0.196415291 2.259132e-03 0.03378337 2 FALSE
## 2: -0.043443850 4.782461e-04 0.03378337 4 FALSE
## 3: -0.283361538 9.858532e-03 0.03378337 6 FALSE
## 4: 0.040612980 2.928368e-04 0.03378337 10 FALSE
## 5: 0.259743519 1.704306e-02 0.03378337 18 FALSE
## 6: -0.005118219 1.903526e-06 0.03378337 25 FALSE
## 7: -0.254070444 3.849656e-03 0.03378337 Inf FALSE
##
## $发货件数
## variable bin count count_distr good bad badprob woe
## 1: 发货件数 [-Inf,2) 242394 0.75110856 210330 32064 0.13228050 0.1504351
## 2: 发货件数 [2,3) 30017 0.09301396 27705 2312 0.07702302 -0.4521211
## 3: 发货件数 [3, Inf) 50304 0.15587748 47262 3042 0.06047233 -0.7118125
## bin_iv total_iv breaks is_special_values
## 1: 0.01800438 0.09402303 2 FALSE
## 2: 0.01596932 0.09402303 3 FALSE
## 3: 0.06004934 0.09402303 Inf FALSE
##
## $原始来单件数
## variable bin count count_distr good bad badprob
## 1: 原始来单件数 [-Inf,2) 239614 0.74249415 207870 31744 0.13247974
## 2: 原始来单件数 [2,3) 29493 0.09139024 27176 2317 0.07856101
## 3: 原始来单件数 [3, Inf) 53608 0.16611561 50251 3357 0.06262125
## woe bin_iv total_iv breaks is_special_values
## 1: 0.1521697 0.01822272 0.09087764 2 FALSE
## 2: -0.4306821 0.01435595 0.09087764 3 FALSE
## 3: -0.6746039 0.05829897 0.09087764 Inf FALSE
##
## $cod运费
## variable bin count count_distr good bad badprob
## 1: cod运费 [-Inf,1.5) 143652 0.4451358 129995 13657 0.09507003
## 2: cod运费 [1.5, Inf) 179063 0.5548642 155302 23761 0.13269631
## woe bin_iv total_iv breaks is_special_values
## 1: -0.2218649 0.02011498 0.03408191 1.5 FALSE
## 2: 0.1540528 0.01396692 0.03408191 Inf FALSE
##
## $用户性别
## variable bin count count_distr good bad badprob
## 1: 用户性别 missing 1972 0.006110655 1694 278 0.14097363
## 2: 用户性别 not set%,%women 228872 0.709207815 207855 21017 0.09182862
## 3: 用户性别 men 91871 0.284681530 75748 16123 0.17549608
## woe bin_iv total_iv breaks is_special_values
## 1: 0.2241521 0.0003344142 0.1238244 missing TRUE
## 2: -0.2601302 0.0434092333 0.1238244 not set%,%women FALSE
## 3: 0.4842137 0.0800807579 0.1238244 men FALSE
##
## $用户设备
## variable bin count count_distr good bad badprob
## 1: 用户设备 missing 2467 0.007644516 2109 358 0.14511552
## 2: 用户设备 pc%,%mobile%,%ios 35046 0.108597369 32068 2978 0.08497403
## 3: 用户设备 android 285202 0.883758115 251120 34082 0.11950127
## woe bin_iv total_iv breaks is_special_values
## 1: 0.25794268 0.0005611005 0.01293809 missing TRUE
## 2: -0.34522784 0.0113285823 0.01293809 pc%,%mobile%,%ios FALSE
## 3: 0.03421734 0.0010484026 0.01293809 android FALSE
##
## $app1
## variable
## 1: app1
## 2: app1
## 3: app1
## 4: app1
## bin
## 1: missing
## 2: android_2.45%,%android_3.7.1%,%android_2.49%,%android_3.3.3%,%android_4.2.0%,%android_2.34%,%android_2.48%,%android_2.33%,%android_4.0.1%,%iOS_1.6.1%,%iOS_1.5.9%,%android_3.2.0%,%iOS_1.5.8%,%android_2.38%,%android_3.3.0%,%android_3.4.0%,%android_null%,%android_4.3.4%,%pc%,%android_4.3.5%,%iOS_1.6.2%,%android_4.0.2%,%android_4.3.3%,%iOS_1.9.1%,%iOS_2.0.0
## 3: iOS_1.8.0%,%iOS_4.2.0
## 4: iOS_2.0.1%,%android_4.3.0%,%iOS_1.7.0%,%iOS_2.1.0%,%iOS_1.9.0%,%android_3.6.2%,%android_3.7.3%,%android_4.2.1%,%iOS_4.0.0%,%android_4.3.2%,%android_3.8.0%,%android_3.4.3%,%android_3.2.1%,%android_3.9.1%,%android_3.3.1%,%android_3.8.1%,%android_4.0.3%,%android_3.1.1%,%android_3.5.5%,%iOS_4.1.0%,%android_3.4.2%,%android_3.4.1%,%android_4.1.1%,%android_4.2.3%,%android_4.1.0%,%android_3.9.0%,%mobile-pwa%,%iOS_1.6.0%,%mobile%,%android_3.0.2%,%android_3.5.2%,%android_2.42%,%android_3.6.1%,%android_3.0.1%,%android_4.2.2%,%android_3.7.0%,%android_4.0.0%,%android_3.5.1%,%android_4.3.1%,%android_2.44%,%android_2.50%,%android_2.40%,%android_2.46%,%android_2.47
## count count_distr good bad badprob woe bin_iv
## 1: 2467 0.007644516 2109 358 0.14511552 0.2579427 0.0005611005
## 2: 95468 0.295827588 90854 4614 0.04833033 -0.9487798 0.1851491250
## 3: 29467 0.091309670 27136 2331 0.07910544 -0.4231850 0.0138883801
## 4: 195313 0.605218227 165198 30115 0.15418841 0.3292575 0.0743423496
## total_iv
## 1: 0.273941
## 2: 0.273941
## 3: 0.273941
## 4: 0.273941
## breaks
## 1: missing
## 2: android_2.45%,%android_3.7.1%,%android_2.49%,%android_3.3.3%,%android_4.2.0%,%android_2.34%,%android_2.48%,%android_2.33%,%android_4.0.1%,%iOS_1.6.1%,%iOS_1.5.9%,%android_3.2.0%,%iOS_1.5.8%,%android_2.38%,%android_3.3.0%,%android_3.4.0%,%android_null%,%android_4.3.4%,%pc%,%android_4.3.5%,%iOS_1.6.2%,%android_4.0.2%,%android_4.3.3%,%iOS_1.9.1%,%iOS_2.0.0
## 3: iOS_1.8.0%,%iOS_4.2.0
## 4: iOS_2.0.1%,%android_4.3.0%,%iOS_1.7.0%,%iOS_2.1.0%,%iOS_1.9.0%,%android_3.6.2%,%android_3.7.3%,%android_4.2.1%,%iOS_4.0.0%,%android_4.3.2%,%android_3.8.0%,%android_3.4.3%,%android_3.2.1%,%android_3.9.1%,%android_3.3.1%,%android_3.8.1%,%android_4.0.3%,%android_3.1.1%,%android_3.5.5%,%iOS_4.1.0%,%android_3.4.2%,%android_3.4.1%,%android_4.1.1%,%android_4.2.3%,%android_4.1.0%,%android_3.9.0%,%mobile-pwa%,%iOS_1.6.0%,%mobile%,%android_3.0.2%,%android_3.5.2%,%android_2.42%,%android_3.6.1%,%android_3.0.1%,%android_4.2.2%,%android_3.7.0%,%android_4.0.0%,%android_3.5.1%,%android_4.3.1%,%android_2.44%,%android_2.50%,%android_2.40%,%android_2.46%,%android_2.47
## is_special_values
## 1: TRUE
## 2: FALSE
## 3: FALSE
## 4: FALSE
##
## $用户类型
## variable
## 1: 用户类型
## 2: 用户类型
## 3: 用户类型
## bin count
## 1: old_prepaid_old_cod%,%old_prepaid_new_cod%,%new_prepaid_old_cod 31057
## 2: old_cod 47207
## 3: new_cod%,%new_prepaid_new_cod 244451
## count_distr good bad badprob woe bin_iv total_iv
## 1: 0.09623662 28600 2457 0.0791126 -0.42308675 0.014631535 0.02626036
## 2: 0.14628077 42704 4503 0.0953884 -0.21816988 0.006400987 0.02626036
## 3: 0.75748261 213993 30458 0.1245976 0.08178425 0.005227836 0.02626036
## breaks
## 1: old_prepaid_old_cod%,%old_prepaid_new_cod%,%new_prepaid_old_cod
## 2: old_cod
## 3: new_cod%,%new_prepaid_new_cod
## is_special_values
## 1: FALSE
## 2: FALSE
## 3: FALSE
##
## $地址种类
## variable bin count
## 1: 地址种类 missing 32963
## 2: 地址种类 Valid Address 211036
## 3: 地址种类 Missing Rooftop with POI 27599
## 4: 地址种类 Missing Rooftop%,%Inappropriate%,%Incomplete%,%Junk 51117
## count_distr good bad badprob woe bin_iv total_iv
## 1: 0.10214276 32836 127 0.003852805 -3.52371480 3.935990e-01 0.4450414
## 2: 0.65393923 186341 24695 0.117017950 0.01040134 7.103135e-05 0.4450414
## 3: 0.08552128 24204 3395 0.123011703 0.06716472 3.958573e-04 0.4450414
## 4: 0.15839673 41916 9201 0.179998826 0.51502343 5.097554e-02 0.4450414
## breaks is_special_values
## 1: missing TRUE
## 2: Valid Address FALSE
## 3: Missing Rooftop with POI FALSE
## 4: Missing Rooftop%,%Inappropriate%,%Incomplete%,%Junk FALSE
##
## $下单小时
## variable bin count count_distr good bad badprob
## 1: 下单小时 [-Inf,5) 37569 0.11641541 32658 4911 0.1307195
## 2: 下单小时 [5,17) 231021 0.71586694 205342 25679 0.1111544
## 3: 下单小时 [17,19) 34948 0.10829370 30943 4005 0.1145988
## 4: 下单小时 [19, Inf) 19177 0.05942395 16354 2823 0.1472076
## woe bin_iv total_iv breaks is_special_values
## 1: 0.13676661 0.0022945074 0.008885842 5 FALSE
## 2: -0.04762447 0.0015941915 0.008885842 17 FALSE
## 3: -0.01322435 0.0000188428 0.008885842 19 FALSE
## 4: 0.27470650 0.0049783008 0.008885842 Inf FALSE
##
## $付款小时
## variable bin count count_distr good bad badprob
## 1: 付款小时 [-Inf,5) 37095 0.11494662 32224 4871 0.1313115
## 2: 付款小时 [5,17) 230946 0.71563454 205269 25677 0.1111818
## 3: 付款小时 [17,19) 35187 0.10903429 31172 4015 0.1141046
## 4: 付款小时 [19, Inf) 19487 0.06038455 16632 2855 0.1465079
## woe bin_iv total_iv breaks is_special_values
## 1: 0.14196661 2.445947e-03 0.008901763 5 FALSE
## 2: -0.04734679 1.575312e-03 0.008901763 17 FALSE
## 3: -0.01810404 3.548894e-05 0.008901763 19 FALSE
## 4: 0.26912216 4.845015e-03 0.008901763 Inf FALSE
##
## $下单与付款时间间隔
## variable bin count count_distr good bad
## 1: 下单与付款时间间隔 [-Inf,-0.083) 38240 0.1184946 33125 5115
## 2: 下单与付款时间间隔 [-0.083,-0.0814) 137899 0.4273089 120792 17107
## 3: 下单与付款时间间隔 [-0.0814,-0.0774) 106184 0.3290334 94815 11369
## 4: 下单与付款时间间隔 [-0.0774, Inf) 40392 0.1251631 36565 3827
## badprob woe bin_iv total_iv breaks is_special_values
## 1: 0.13376046 0.16326799 0.003361988 0.01435376 -0.083 FALSE
## 2: 0.12405456 0.07679655 0.002595418 0.01435376 -0.0814 FALSE
## 3: 0.10706886 -0.08965840 0.002555278 0.01435376 -0.0774 FALSE
## 4: 0.09474648 -0.22563142 0.005841080 0.01435376 Inf FALSE
##
## $金额差异
## variable bin count count_distr good bad badprob woe
## 1: 金额差异 [-Inf, Inf) 322715 1 285297 37418 0.1159475 0
## bin_iv total_iv breaks is_special_values
## 1: 0 0 Inf FALSE
##
## $件数差异
## variable bin count count_distr good bad badprob woe
## 1: 件数差异 [-Inf, Inf) 322715 1 285297 37418 0.1159475 0
## bin_iv total_iv breaks is_special_values
## 1: 0 0 Inf FALSE
##
## $确认小时
## variable bin count count_distr good bad badprob
## 1: 确认小时 [-Inf,5) 20463 0.06340889 18528 1935 0.09456091
## 2: 确认小时 [5,12) 220152 0.68218707 193376 26776 0.12162506
## 3: 确认小时 [12,13) 29449 0.09125389 26112 3337 0.11331454
## 4: 确认小时 [13, Inf) 52651 0.16315015 47281 5370 0.10199236
## woe bin_iv total_iv breaks is_special_values
## 1: -0.22779690 3.013701e-03 0.00832062 5 FALSE
## 2: 0.05424835 2.049801e-03 0.00832062 12 FALSE
## 3: -0.02594391 6.081242e-05 0.00832062 13 FALSE
## 4: -0.14390174 3.196306e-03 0.00832062 Inf FALSE
##
## $付款到派送
## variable bin count count_distr good bad badprob
## 1: 付款到派送 missing 3441 0.01066266 1327 2114 0.61435629
## 2: 付款到派送 [-Inf,0.2) 190833 0.59133601 168293 22540 0.11811374
## 3: 付款到派送 [0.2,1) 81786 0.25343105 73034 8752 0.10701098
## 4: 付款到派送 [1,1.4) 20867 0.06466077 18904 1963 0.09407198
## 5: 付款到派送 [1.4, Inf) 25788 0.07990952 23739 2049 0.07945556
## woe bin_iv total_iv breaks is_special_values
## 1: 2.49704000 0.1294604871 0.1468417 missing TRUE
## 2: 0.02096387 0.0002619823 0.1468417 0.2 FALSE
## 3: -0.09026397 0.0019943602 0.1468417 1 FALSE
## 4: -0.23352075 0.0032224443 0.1468417 1.4 FALSE
## 5: -0.41838853 0.0119024356 0.1468417 Inf FALSE
library(tidyverse)
kable(iv %>% filter(info_value<0.02))
## Warning: package 'bindrcpp' was built under R version 3.4.4
variable | info_value |
---|---|
用户设备 | 0.0140496 |
付款小时 | 0.0119562 |
下单小时 | 0.0118502 |
件数差异 | 0.0073371 |
library(tidyverse)
kable(iv %>% filter(info_value>=0.02,info_value<0.1))
variable | info_value |
---|---|
发货件数 | 0.0954921 |
原始来单件数 | 0.0929052 |
用户类型 | 0.0274259 |
确认小时 | 0.0205682 |
library(tidyverse)
kable(iv %>% filter(info_value>=0.1,info_value<0.3))
variable | info_value |
---|---|
下单与付款时间间隔 | 0.2858385 |
cod运费 | 0.2818102 |
修改后金额 | 0.1986989 |
原始来单金额 | 0.1946768 |
金额差异 | 0.1632335 |
付款到派送 | 0.1379788 |
发货方式 | 0.1256872 |
用户性别 | 0.1238769 |
州 | 0.1158185 |
library(tidyverse)
kable(iv %>% filter(info_value>=0.3,info_value<0.5))
variable | info_value |
---|---|
地址种类 | 0.4482661 |
app1 | 0.3126790 |
var_list_1 <- iv %>% filter(info_value>0.1) %>% select(variable) # 15 variables
Model_data1 <- Model_data %>% select(var_list_1$variable,label) #12 variables
head(Model_data1)
## 地址种类 app1 下单与付款时间间隔 cod运费 修改后金额
## 1: Valid Address iOS_4.1.0 19.45732 1.55 5.60
## 2: Valid Address android_4.1.1 16.93115 1.55 6.92
## 3: Missing Rooftop android_4.2.2 17.41311 1.55 10.32
## 4: Valid Address android_4.0.3 16.85653 1.55 4.67
## 5: Missing Rooftop android_4.1.1 19.56840 1.55 10.26
## 6: Valid Address iOS_4.1.0 16.91516 1.55 16.02
## 原始来单金额 金额差异 付款到派送 发货方式 用户性别 州 label
## 1: 5.60 0 2.7096488 Delhivery women Telangana 0
## 2: 6.92 0 -0.4770722 Delhivery women Telangana 0
## 3: 10.32 0 -0.1513002 Ecom men Maharashtra 0
## 4: 4.67 0 -0.1274765 Ecom women Maharashtra 0
## 5: 10.26 0 -0.1704649 Delhivery men Karnataka 0
## 6: 16.02 0 0.2219836 Delhivery women Karnataka 0
Clustering of variables is as a way to arrange variables into homogeneous clusters, i.e., groups of variables which are strongly related to each other and thus bring the same information.
When we have large number of variables, this should be done well before univariate analysis. This can also be done using Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA) or Factor Analysis.
Model_data1$app1 <- as.factor(Model_data1$app1)
Model_data1$label <- as.factor(Model_data1$label)
Model_data1$地址种类 <- as.factor(Model_data1$地址种类)
Model_data1$发货方式 <- as.factor(Model_data1$发货方式)
Model_data1$用户性别 <- as.factor(Model_data1$用户性别)
Model_data1$州 <- as.factor(Model_data1$州)
factors <- sapply(Model_data1, is.factor)
#subset Qualitative variables
vars_quali <- Model_data1 %>% select(names(Model_data1)[factors])
#vars_quali$good_bad_21<-vars_quali$good_bad_21[drop=TRUE] # remove empty factors
str(vars_quali)
## Classes 'data.table' and 'data.frame': 322715 obs. of 6 variables:
## $ 地址种类: Factor w/ 6 levels "Inappropriate",..: 6 6 4 6 4 6 6 6 4 6 ...
## $ app1 : Factor w/ 71 levels "android_2.33",..: 67 42 45 40 42 67 42 32 42 29 ...
## $ 发货方式: Factor w/ 3 levels "Delhivery","Ecom",..: 1 1 2 2 1 1 1 2 2 1 ...
## $ 用户性别: Factor w/ 3 levels "men","not set",..: 3 3 1 3 1 3 3 3 3 1 ...
## $ 州 : Factor w/ 70 levels "Andaman and Nicobar Islands",..: 59 59 38 38 29 29 29 14 29 29 ...
## $ label : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
#subset Quantitative variables
vars_quanti <- Model_data1 %>% select(names(Model_data1)[!factors])
str(vars_quanti)
## Classes 'data.table' and 'data.frame': 322715 obs. of 6 variables:
## $ 下单与付款时间间隔: num 19.5 16.9 17.4 16.9 19.6 ...
## $ cod运费 : num 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
## $ 修改后金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 原始来单金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 金额差异 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 付款到派送 : num 2.71 -0.477 -0.151 -0.127 -0.17 ...
## - attr(*, ".internal.selfref")=<externalptr>
#Step 2: Hierarchical Clustering of Variables
# requires library(ClustOfVar)
# Need help type ?hclustvar on R console
tree <- hclustvar(X.quanti=vars_quanti,X.quali=vars_quali)
par(family='STKaiti')
plot(tree, main="variable clustering")
rect.hclust(tree, k=8, border = 1:8)
summary(tree)
## Length Class Mode
## call 3 -none- call
## rec 16 -none- list
## init 12 -none- numeric
## merge 22 -none- numeric
## height 11 -none- numeric
## order 12 -none- numeric
## labels 12 -none- character
## clusmat 144 -none- numeric
## X.quanti 6 data.table list
## X.quali 6 data.table list
# Phylogenetic trees
# require library("ape")
par(family='STKaiti')
plot(as.phylo(tree), type = "fan",
tip.color = hsv(runif(15, 0.65, 0.95), 1, 1, 0.7),
edge.color = hsv(runif(10, 0.65, 0.75), 1, 1, 0.7),
edge.width = runif(20, 0.5, 3), use.edge.length = TRUE, col = "gray80")
summary.phylo(as.phylo(tree))
##
## Phylogenetic tree: as.phylo(tree)
##
## Number of tips: 12
## Number of nodes: 11
## Branch lengths:
## mean: 0.2498154
## variance: 0.02762882
## distribution summary:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0.01203149 0.11483605 0.24931255 0.40189405 0.49995107
## No root edge.
## First ten tip labels: 下单与付款时间间隔
## cod运费
## 修改后金额
## 原始来单金额
## 金额差异
## 付款到派送
## 地址种类
## app1
## 发货方式
## 用户性别
## No node labels.
part<-cutreevar(tree,8)
print(part)
##
## Call:
## cutreevar(obj = tree, k = 8)
##
##
##
## name
## "$var"
## "$sim"
## "$cluster"
## "$wss"
## "$E"
## "$size"
## "$scores"
## "$coef"
## description
## "list of variables in each cluster"
## "similarity matrix in each cluster"
## "cluster memberships"
## "within-cluster sum of squares"
## "gain in cohesion (in %)"
## "size of each cluster"
## "synthetic score of each cluster"
## "coef of the linear combinations defining the synthetic scores of each cluster"
summary(part)
##
## Call:
## cutreevar(obj = tree, k = 8)
##
##
##
## Data:
## number of observations: 322715
## number of variables: 12
## number of numerical variables: 6
## number of categorical variables: 6
## number of clusters: 8
##
## Cluster 1 :
## squared loading correlation
## 1 1
##
##
## Cluster 2 :
## squared loading correlation
## 修改后金额 0.93 -0.96
## 原始来单金额 0.92 -0.96
## cod运费 0.65 -0.81
##
##
## Cluster 3 :
## squared loading correlation
## 1 1
##
##
## Cluster 4 :
## squared loading correlation
## 州 0.68 NA
## 付款到派送 0.56 -0.75
## 发货方式 0.44 NA
##
##
## Cluster 5 :
## squared loading correlation
## 1 NA
##
##
## Cluster 6 :
## squared loading correlation
## 1 NA
##
##
## Cluster 7 :
## squared loading correlation
## 1 NA
##
##
## Cluster 8 :
## squared loading correlation
## 1 NA
##
##
## Gain in cohesion (in %): 80.38
Keep only important variables from variable of cluster analysis.
# cod运费
# 付款到派送
# keep<- c(1,2,3,4,7,8,10,12)
cdata_reduced_2 <- Model_data1 # %>% select(keep)
str(cdata_reduced_2)
## Classes 'data.table' and 'data.frame': 322715 obs. of 12 variables:
## $ 地址种类 : Factor w/ 6 levels "Inappropriate",..: 6 6 4 6 4 6 6 6 4 6 ...
## $ app1 : Factor w/ 71 levels "android_2.33",..: 67 42 45 40 42 67 42 32 42 29 ...
## $ 下单与付款时间间隔: num 19.5 16.9 17.4 16.9 19.6 ...
## $ cod运费 : num 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 1.55 ...
## $ 修改后金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 原始来单金额 : num 5.6 6.92 10.32 4.67 10.26 ...
## $ 金额差异 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 付款到派送 : num 2.71 -0.477 -0.151 -0.127 -0.17 ...
## $ 发货方式 : Factor w/ 3 levels "Delhivery","Ecom",..: 1 1 2 2 1 1 1 2 2 1 ...
## $ 用户性别 : Factor w/ 3 levels "men","not set",..: 3 3 1 3 1 3 3 3 3 1 ...
## $ 州 : Factor w/ 70 levels "Andaman and Nicobar Islands",..: 59 59 38 38 29 29 29 14 29 29 ...
## $ label : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
bins <- scorecard::woebin(cdata_reduced_2,y = 'label')
dt_woe <- scorecard::woebin_ply(cdata_reduced_2,bins)
## Woe transformating on 322715 rows and 11 columns in 00:00:15
dt_woe$label <- as.factor(dt_woe$label)
div_part_1 <- createDataPartition(y = dt_woe$label, p = 0.7, list = F)
# Training Sample
train_1 <- dt_woe[div_part_1,] # 70% here
pct(train_1$label)
Count | Percentage | |
---|---|---|
0 | 199708 | 88.41 |
1 | 26193 | 11.59 |
# Test Sample
test_1 <- dt_woe[-div_part_1,] # rest of the 30% data goes here
pct(test_1$label)
Count | Percentage | |
---|---|---|
0 | 85589 | 88.41 |
1 | 11225 | 11.59 |
The most important thing in developing model is to select right modeling algorithm(s). Here I have discussed several machine learning techniques. You may choose to use one of them or combination of few techniques to get best result.
# library(stats)
# Model: Stepwise Logistic Regression Model
m1 <- glm(label~.,data=train_1,family=binomial())
m1 <- step(m1)
## Start: AIC=141047.6
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe +
## cod运费_woe + 修改后金额_woe + 原始来单金额_woe +
## 金额差异_woe + 付款到派送_woe + 发货方式_woe +
## 用户性别_woe + 州_woe
##
##
## Step: AIC=141047.6
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe +
## cod运费_woe + 修改后金额_woe + 原始来单金额_woe +
## 付款到派送_woe + 发货方式_woe + 用户性别_woe +
## 州_woe
##
## Df Deviance AIC
## - 修改后金额_woe 1 141026 141046
## <none> 141026 141048
## - 原始来单金额_woe 1 141035 141055
## - cod运费_woe 1 141135 141155
## - 下单与付款时间间隔_woe 1 141304 141324
## - 州_woe 1 141356 141376
## - 发货方式_woe 1 141451 141471
## - 用户性别_woe 1 142586 142606
## - 付款到派送_woe 1 146103 146123
## - app1_woe 1 146296 146316
## - 地址种类_woe 1 146414 146434
##
## Step: AIC=141045.7
## label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe +
## cod运费_woe + 原始来单金额_woe + 付款到派送_woe +
## 发货方式_woe + 用户性别_woe + 州_woe
##
## Df Deviance AIC
## <none> 141026 141046
## - 原始来单金额_woe 1 141110 141128
## - cod运费_woe 1 141138 141156
## - 下单与付款时间间隔_woe 1 141304 141322
## - 州_woe 1 141356 141374
## - 发货方式_woe 1 141451 141469
## - 用户性别_woe 1 142586 142604
## - 付款到派送_woe 1 146103 146121
## - app1_woe 1 146296 146314
## - 地址种类_woe 1 146414 146432
summary(m1)
##
## Call:
## glm(formula = label ~ 地址种类_woe + app1_woe + 下单与付款时间间隔_woe +
## cod运费_woe + 原始来单金额_woe + 付款到派送_woe +
## 发货方式_woe + 用户性别_woe + 州_woe, family = binomial(),
## data = train_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6056 -0.5354 -0.3983 -0.2332 3.8536
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.033101 0.007605 -267.351 <2e-16 ***
## 地址种类_woe 1.007327 0.021000 47.967 <2e-16 ***
## app1_woe 1.019543 0.015437 66.045 <2e-16 ***
## 下单与付款时间间隔_woe 0.972107 0.058647 16.575 <2e-16 ***
## cod运费_woe 0.492360 0.046392 10.613 <2e-16 ***
## 原始来单金额_woe 0.407480 0.044695 9.117 <2e-16 ***
## 付款到派送_woe 1.422488 0.021926 64.877 <2e-16 ***
## 发货方式_woe 0.766012 0.037228 20.576 <2e-16 ***
## 用户性别_woe 0.793704 0.019867 39.952 <2e-16 ***
## 州_woe 0.544986 0.030095 18.109 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 162095 on 225900 degrees of freedom
## Residual deviance: 141026 on 225891 degrees of freedom
## AIC: 141046
##
## Number of Fisher Scoring iterations: 7
# List of significant variables and features with p-value <0.01
significant.variables <- summary(m1)$coeff[-1,4] < 0.01
names(significant.variables)[significant.variables == TRUE]
## [1] "地址种类_woe" "app1_woe"
## [3] "下单与付款时间间隔_woe" "cod运费_woe"
## [5] "原始来单金额_woe" "付款到派送_woe"
## [7] "发货方式_woe" "用户性别_woe"
## [9] "州_woe"
dt_pred = predict(m1, type='response', test_1)
perf_eva(test_1$label, dt_pred, type = c("ks","lift","roc","pr"))
## Warning: Removed 1 rows containing missing values (geom_path).
## $KS
## [1] 0.3423
##
## $AUC
## [1] 0.7447
##
## $Gini
## [1] 0.4894
##
## $pic
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## pks 1 (1-1,1-1) arrange gtable[layout]
## plift 2 (1-1,2-2) arrange gtable[layout]
## proc 3 (2-2,1-1) arrange gtable[layout]
## ppr 4 (2-2,2-2) arrange gtable[layout]
# Requires library(randomForest)
m3 <- randomForest(label ~ ., data = train_1)
par(family='STKaiti')
varImpPlot(m3, main="Random Forest: Variable Importance")
dt_pred = predict(m3, type='prob', test_1)[,1]
perf_eva(test_1$label, dt_pred, type = c("ks","lift","roc","pr"))
## Warning: Removed 1 rows containing missing values (geom_path).
## $KS
## [1] 0.1619
##
## $AUC
## [1] 0.4052
##
## $Gini
## [1] -0.1895
##
## $pic
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## pks 1 (1-1,1-1) arrange gtable[layout]
## plift 2 (1-1,2-2) arrange gtable[layout]
## proc 3 (2-2,1-1) arrange gtable[layout]
## ppr 4 (2-2,2-2) arrange gtable[layout]
Unbalanced samples can be seen resulting in very low AUC,Use some methods to balance the sample
load('/Users/milin/COD\ 建模/model_rf_under.RData')
load('/Users/milin/COD\ 建模/dt_woe.RData')
require(scorecard)
dt_pred = predict(model_rf_under, type = 'prob', dt_woe)
perf_eva(dt_woe$label, dt_pred$`1`)
## $KS
## [1] 0.3986
##
## $AUC
## [1] 0.7641
##
## $Gini
## [1] 0.5281
##
## $pic
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## pks 1 (1-1,1-1) arrange gtable[layout]
## proc 2 (1-1,2-2) arrange gtable[layout]
load('/Users/milin/COD\ 建模/model_rf_under1.RData')
dt_pred = predict(model_rf_under, type = 'prob', dt_woe)
perf_eva(dt_woe$label, dt_pred$`1`)
## $KS
## [1] 0.3986
##
## $AUC
## [1] 0.7641
##
## $Gini
## [1] 0.5281
##
## $pic
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## pks 1 (1-1,1-1) arrange gtable[layout]
## proc 2 (1-1,2-2) arrange gtable[layout]