In the given problem a telecom company has reached to the point where capturing new customer for it is quite expensive and has low ROI, therefore, in order to increase revenue the company has decided to focus on the existing customer base. Our task here is to develop a propoensity model for finding out the customers whom they should target in this campaign.
setwd("C:\\Users\\om\\Desktop")
cust_data = read.csv("Rawdatafile.csv")head(cust_data)
#> Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
#> 1 1000 Yes North Medium Private 30 Male 0 200 7
#> 2 1001 Yes Central Low Private 30 Female 3 200 15
#> 3 1002 Yes Central High Private 50 Male 0 600 7
#> 4 1003 Yes South High Private 20 Male 3 600 7
#> 5 1004 Yes North High Private 20 Female 3 600 15
#> 6 1005 Yes North Low Private 30 Female 0 200 15
#> Var9 Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19
#> 1 Cash Postpaid 1 1 300 300 300 300 10 50 500
#> 2 Cash Postpaid 1 1 300 300 600 300 10 50 1500
#> 3 Card Postpaid 2 1 600 300 600 300 10 50 1500
#> 4 Card Postpaid 2 1 600 300 300 100 10 200 1500
#> 5 Card Postpaid 2 1 600 600 300 300 10 50 500
#> 6 Card Postpaid 2 1 600 600 300 100 10 200 1500
#> Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27
#> 1 2 1 300 300 100 50 20 Good
#> 2 1 1 300 300 20 500 20 Poor
#> 3 2 1 600 300 100 500 20 Poor
#> 4 1 1 300 600 100 500 20 Poor
#> 5 2 2 600 600 100 500 20 Poor
#> 6 1 1 600 300 20 50 20 Poortail(cust_data)
#> Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5 Var6 Var7
#> 4995 5994 No Medium Private 30 Female 3 1200
#> 4996 5995 No High Private 99 Female 3 1200
#> 4997 5996 No Medium Govt 50 Male 0 200
#> 4998 5997 No Low Private 24 Female 0 200
#> 4999 5998 No High Private 120 Female 0 1200
#> 5000 5999 No High Unemployed 50 Male 0 600
#> Var8 Var9 Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18
#> 4995 15 Cash Postpaid 2 2 300 600 300 300 300 50
#> 4996 7 Cash Postpaid 1 2 600 300 300 300 300 200
#> 4997 15 Cash Postpaid 2 1 600 300 300 100 10 200
#> 4998 15 Cash Postpaid 1 2 600 300 600 100 10 200
#> 4999 15 Cash Postpaid 1 2 600 600 300 300 10 200
#> 5000 15 Card Postpaid 2 2 300 300 300 300 10 50
#> Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27
#> 4995 1500 1 2 600 600 100 500 50 <NA>
#> 4996 500 1 1 300 300 20 50 50 <NA>
#> 4997 1500 2 1 600 300 20 50 50 <NA>
#> 4998 500 1 2 600 600 20 500 20 <NA>
#> 4999 500 2 2 300 300 20 500 50 <NA>
#> 5000 1500 1 1 300 300 20 50 20 <NA>We noticed that the current dataset has 29 columns and 5000 rows.
Our aim here is to find the right audince for upselling we can use flag for Plan change variable to indentify the right customers who went for upselling in past. We will build our model on the characteristics of these customers based on other independent variables.
summary(cust_data)
#> Cust_id Plan_Chg_Flag Var1 Var2
#> Min. :1000 No :4500 : 25 High :1745
#> 1st Qu.:2250 Yes: 500 Central:1599 Low :1681
#> Median :3500 North :1736 Medium:1574
#> Mean :3500 South :1640
#> 3rd Qu.:4749
#> Max. :5999
#> Var3 Var4 Var5 Var6
#> Govt :1497 Min. : 20.00 Female:2538 Min. :0.000
#> Private :1985 1st Qu.: 20.00 Male :2462 1st Qu.:0.000
#> Unemployed:1518 Median : 30.00 Median :3.000
#> Mean : 33.14 Mean :1.502
#> 3rd Qu.: 50.00 3rd Qu.:3.000
#> Max. :120.00 Max. :3.000
#> Var7 Var8 Var9 Var10
#> Min. : 200.0 Min. : 7.00 Card:2620 Postpaid:5000
#> 1st Qu.: 200.0 1st Qu.: 7.00 Cash:2380
#> Median : 600.0 Median :15.00
#> Mean : 647.3 Mean :11.13
#> 3rd Qu.:1200.0 3rd Qu.:15.00
#> Max. :1200.0 Max. :15.00
#> Var11 Var12 Var13 Var14
#> Min. :1.000 Min. :1.000 Min. :300.0 Min. :300.0
#> 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:300.0 1st Qu.:300.0
#> Median :1.000 Median :1.000 Median :600.0 Median :600.0
#> Mean :1.486 Mean :1.487 Mean :462.4 Mean :455.7
#> 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:600.0 3rd Qu.:600.0
#> Max. :2.000 Max. :2.000 Max. :600.0 Max. :600.0
#> Var15 Var16 Var17 Var18
#> Min. :300.0 Min. :100.0 Min. : 10.0 Min. : 50.0
#> 1st Qu.:300.0 1st Qu.:100.0 1st Qu.: 10.0 1st Qu.: 50.0
#> Median :600.0 Median :300.0 Median :300.0 Median :200.0
#> Mean :455.5 Mean :206.7 Mean :167.2 Mean :128.9
#> 3rd Qu.:600.0 3rd Qu.:300.0 3rd Qu.:300.0 3rd Qu.:200.0
#> Max. :600.0 Max. :300.0 Max. :300.0 Max. :200.0
#> Var19 Var20 Var21 Var22
#> Min. : 500 Min. :1.000 Min. :1.000 Min. :300.0
#> 1st Qu.: 500 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:300.0
#> Median :1500 Median :2.000 Median :2.000 Median :600.0
#> Mean :1021 Mean :1.518 Mean :1.527 Mean :457.1
#> 3rd Qu.:1500 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:600.0
#> Max. :1500 Max. :2.000 Max. :2.000 Max. :600.0
#> Var23 Var24 Var25 Var26 Var27
#> Min. :300.0 Min. : 20.0 Min. : 50 Min. :20.00 Good: 811
#> 1st Qu.:300.0 1st Qu.: 20.0 1st Qu.: 50 1st Qu.:20.00 Poor: 17
#> Median :600.0 Median :100.0 Median :500 Median :20.00 NA's:4172
#> Mean :460.7 Mean : 62.1 Mean :286 Mean :34.36
#> 3rd Qu.:600.0 3rd Qu.:100.0 3rd Qu.:500 3rd Qu.:50.00
#> Max. :600.0 Max. :100.0 Max. :500 Max. :50.00Var1 and Var27 have missing values, Var4 has extreme value and Var10 has very high or low granularity value.
If we notice closely the % of missing values in Var27 is much higher than % of missing values in Var 1
% Missing values in Var27 = 4175/5000 = 83.44%
% Missing values in Var1 = 25/5000 = 0.5%
Because we have categories here and we can impute the missing values by using the appropriate average value of the variable.
cust_dat1 <- cust_data[,-c(12,29)]
head(cust_dat1)
#> Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
#> 1 1000 Yes North Medium Private 30 Male 0 200 7
#> 2 1001 Yes Central Low Private 30 Female 3 200 15
#> 3 1002 Yes Central High Private 50 Male 0 600 7
#> 4 1003 Yes South High Private 20 Male 3 600 7
#> 5 1004 Yes North High Private 20 Female 3 600 15
#> 6 1005 Yes North Low Private 30 Female 0 200 15
#> Var9 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21
#> 1 Cash 1 1 300 300 300 300 10 50 500 2 1
#> 2 Cash 1 1 300 300 600 300 10 50 1500 1 1
#> 3 Card 2 1 600 300 600 300 10 50 1500 2 1
#> 4 Card 2 1 600 300 300 100 10 200 1500 1 1
#> 5 Card 2 1 600 600 300 300 10 50 500 2 2
#> 6 Card 2 1 600 600 300 100 10 200 1500 1 1
#> Var22 Var23 Var24 Var25 Var26
#> 1 300 300 100 50 20
#> 2 300 300 20 500 20
#> 3 600 300 100 500 20
#> 4 300 600 100 500 20
#> 5 600 600 100 500 20
#> 6 600 300 20 50 20cust_dat1$Var1 = ifelse(cust_dat1$Var1 == "","North",
ifelse(cust_dat1$Var1 == "South",
"South",ifelse(cust_dat1$Var1 == "Central", "Central",
"North")))cust_dat1$Var4 = ifelse(cust_dat1$Var4 >50, 50,
ifelse(cust_dat1$Var4 <25, 20, 30))summary(cust_dat1)
#> Cust_id Plan_Chg_Flag Var1 Var2
#> Min. :1000 No :4500 Length:5000 High :1745
#> 1st Qu.:2250 Yes: 500 Class :character Low :1681
#> Median :3500 Mode :character Medium:1574
#> Mean :3500
#> 3rd Qu.:4749
#> Max. :5999
#> Var3 Var4 Var5 Var6
#> Govt :1497 Min. :20.00 Female:2538 Min. :0.000
#> Private :1985 1st Qu.:20.00 Male :2462 1st Qu.:0.000
#> Unemployed:1518 Median :30.00 Median :3.000
#> Mean :26.78 Mean :1.502
#> 3rd Qu.:30.00 3rd Qu.:3.000
#> Max. :50.00 Max. :3.000
#> Var7 Var8 Var9 Var11
#> Min. : 200.0 Min. : 7.00 Card:2620 Min. :1.000
#> 1st Qu.: 200.0 1st Qu.: 7.00 Cash:2380 1st Qu.:1.000
#> Median : 600.0 Median :15.00 Median :1.000
#> Mean : 647.3 Mean :11.13 Mean :1.486
#> 3rd Qu.:1200.0 3rd Qu.:15.00 3rd Qu.:2.000
#> Max. :1200.0 Max. :15.00 Max. :2.000
#> Var12 Var13 Var14 Var15
#> Min. :1.000 Min. :300.0 Min. :300.0 Min. :300.0
#> 1st Qu.:1.000 1st Qu.:300.0 1st Qu.:300.0 1st Qu.:300.0
#> Median :1.000 Median :600.0 Median :600.0 Median :600.0
#> Mean :1.487 Mean :462.4 Mean :455.7 Mean :455.5
#> 3rd Qu.:2.000 3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:600.0
#> Max. :2.000 Max. :600.0 Max. :600.0 Max. :600.0
#> Var16 Var17 Var18 Var19
#> Min. :100.0 Min. : 10.0 Min. : 50.0 Min. : 500
#> 1st Qu.:100.0 1st Qu.: 10.0 1st Qu.: 50.0 1st Qu.: 500
#> Median :300.0 Median :300.0 Median :200.0 Median :1500
#> Mean :206.7 Mean :167.2 Mean :128.9 Mean :1021
#> 3rd Qu.:300.0 3rd Qu.:300.0 3rd Qu.:200.0 3rd Qu.:1500
#> Max. :300.0 Max. :300.0 Max. :200.0 Max. :1500
#> Var20 Var21 Var22 Var23
#> Min. :1.000 Min. :1.000 Min. :300.0 Min. :300.0
#> 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:300.0 1st Qu.:300.0
#> Median :2.000 Median :2.000 Median :600.0 Median :600.0
#> Mean :1.518 Mean :1.527 Mean :457.1 Mean :460.7
#> 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:600.0 3rd Qu.:600.0
#> Max. :2.000 Max. :2.000 Max. :600.0 Max. :600.0
#> Var24 Var25 Var26
#> Min. : 20.0 Min. : 50 Min. :20.00
#> 1st Qu.: 20.0 1st Qu.: 50 1st Qu.:20.00
#> Median :100.0 Median :500 Median :20.00
#> Mean : 62.1 Mean :286 Mean :34.36
#> 3rd Qu.:100.0 3rd Qu.:500 3rd Qu.:50.00
#> Max. :100.0 Max. :500 Max. :50.00Corr_data<- cust_dat1[,-c(1,2,3,4,5,7,11)]
Corr_matrix = cor(Corr_data, Corr_data)
Corr_matrix
#> Var4 Var6 Var7 Var8 Var11
#> Var4 1.0000000000 0.0084869625 -0.016426170 -0.011887708 -0.021433414
#> Var6 0.0084869625 1.0000000000 -0.016601486 -0.001639742 -0.018773241
#> Var7 -0.0164261702 -0.0166014856 1.000000000 -0.008263687 0.013658869
#> Var8 -0.0118877076 -0.0016397421 -0.008263687 1.000000000 -0.025891208
#> Var11 -0.0214334135 -0.0187732408 0.013658869 -0.025891208 1.000000000
#> Var12 -0.0291889736 -0.0183756088 -0.004237924 -0.001584855 0.093342590
#> Var13 -0.0074780901 -0.0181606513 0.015166188 0.062776569 -0.008057612
#> Var14 0.0081412574 0.0079601550 0.005933986 0.088479125 -0.014522108
#> Var15 -0.0082962241 -0.0080490199 0.004717581 0.043672495 0.002250745
#> Var16 -0.0247304882 -0.0181216165 -0.007386208 0.076836657 0.001138404
#> Var17 0.0058614881 0.0191671229 -0.016896847 0.065533550 -0.018041108
#> Var18 0.0163149718 0.0031414380 0.021915040 0.068031372 0.002713973
#> Var19 0.0128926422 0.0183661676 -0.009913839 0.052337013 -0.003217505
#> Var20 -0.0005958734 -0.0252579108 0.017713928 0.094567744 -0.014200927
#> Var21 0.0091587072 -0.0028695312 -0.005859775 0.059556852 -0.024880428
#> Var22 0.0079045003 0.0111555331 -0.035192482 0.049740227 -0.024667301
#> Var23 0.0087476315 -0.0061009282 0.014271514 0.091974712 -0.016397841
#> Var24 0.0214321635 0.0247711690 -0.010445500 0.075245064 0.005919705
#> Var25 0.0112857813 -0.0084686559 0.003649281 0.052508976 -0.011412886
#> Var26 -0.0233673560 -0.0007493272 0.049883692 0.009400265 0.079272854
#> Var12 Var13 Var14 Var15 Var16
#> Var4 -0.029188974 -0.007478090 0.008141257 -0.008296224 -0.024730488
#> Var6 -0.018375609 -0.018160651 0.007960155 -0.008049020 -0.018121616
#> Var7 -0.004237924 0.015166188 0.005933986 0.004717581 -0.007386208
#> Var8 -0.001584855 0.062776569 0.088479125 0.043672495 0.076836657
#> Var11 0.093342590 -0.008057612 -0.014522108 0.002250745 0.001138404
#> Var12 1.000000000 -0.016785916 -0.015858887 -0.019101441 -0.014744562
#> Var13 -0.016785916 1.000000000 0.055095821 0.071290157 0.066840600
#> Var14 -0.015858887 0.055095821 1.000000000 0.033062575 0.074869594
#> Var15 -0.019101441 0.071290157 0.033062575 1.000000000 0.071763590
#> Var16 -0.014744562 0.066840600 0.074869594 0.071763590 1.000000000
#> Var17 -0.023564657 0.070737531 0.061858030 0.078861296 0.064704932
#> Var18 -0.033135551 0.045901180 0.043700356 0.065425822 0.081976392
#> Var19 -0.018983902 0.074923308 0.084166603 0.069805812 0.066638448
#> Var20 -0.006719257 0.038855652 0.076765383 0.073613092 0.067428673
#> Var21 -0.009446103 0.065436535 0.083316260 0.061752144 0.071812749
#> Var22 0.017224508 0.086469495 0.073527591 0.045549142 0.058198657
#> Var23 -0.001810930 0.072161033 0.060290850 0.060401542 0.056285116
#> Var24 -0.027526065 0.042685844 0.053320511 0.082260090 0.033801460
#> Var25 -0.016795578 0.069899582 0.069879267 0.077166696 0.056114190
#> Var26 0.091834729 -0.012929054 0.004033006 -0.003247205 -0.014369459
#> Var17 Var18 Var19 Var20 Var21
#> Var4 0.005861488 0.016314972 0.012892642 -0.0005958734 0.009158707
#> Var6 0.019167123 0.003141438 0.018366168 -0.0252579108 -0.002869531
#> Var7 -0.016896847 0.021915040 -0.009913839 0.0177139278 -0.005859775
#> Var8 0.065533550 0.068031372 0.052337013 0.0945677437 0.059556852
#> Var11 -0.018041108 0.002713973 -0.003217505 -0.0142009275 -0.024880428
#> Var12 -0.023564657 -0.033135551 -0.018983902 -0.0067192565 -0.009446103
#> Var13 0.070737531 0.045901180 0.074923308 0.0388556524 0.065436535
#> Var14 0.061858030 0.043700356 0.084166603 0.0767653827 0.083316260
#> Var15 0.078861296 0.065425822 0.069805812 0.0736130921 0.061752144
#> Var16 0.064704932 0.081976392 0.066638448 0.0674286727 0.071812749
#> Var17 1.000000000 0.067108754 0.046327213 0.0649007652 0.060111816
#> Var18 0.067108754 1.000000000 0.069996634 0.0530612543 0.072957556
#> Var19 0.046327213 0.069996634 1.000000000 0.0670482046 0.063105668
#> Var20 0.064900765 0.053061254 0.067048205 1.0000000000 0.066224162
#> Var21 0.060111816 0.072957556 0.063105668 0.0662241623 1.000000000
#> Var22 0.044190144 0.040808016 0.083804951 0.0724514535 0.069191331
#> Var23 0.049894926 0.052875773 0.083348684 0.0552681908 0.056352975
#> Var24 0.067912723 0.060620690 0.053159226 0.0642836170 0.075364422
#> Var25 0.052531077 0.062803885 0.060112038 0.0688068081 0.086375942
#> Var26 -0.008425435 -0.002563147 -0.019871706 -0.0045003502 -0.006086261
#> Var22 Var23 Var24 Var25 Var26
#> Var4 0.00790450 0.008747632 0.0214321635 0.011285781 -0.0233673560
#> Var6 0.01115553 -0.006100928 0.0247711690 -0.008468656 -0.0007493272
#> Var7 -0.03519248 0.014271514 -0.0104454997 0.003649281 0.0498836920
#> Var8 0.04974023 0.091974712 0.0752450642 0.052508976 0.0094002646
#> Var11 -0.02466730 -0.016397841 0.0059197048 -0.011412886 0.0792728536
#> Var12 0.01722451 -0.001810930 -0.0275260654 -0.016795578 0.0918347291
#> Var13 0.08646950 0.072161033 0.0426858435 0.069899582 -0.0129290545
#> Var14 0.07352759 0.060290850 0.0533205114 0.069879267 0.0040330058
#> Var15 0.04554914 0.060401542 0.0822600897 0.077166696 -0.0032472055
#> Var16 0.05819866 0.056285116 0.0338014596 0.056114190 -0.0143694593
#> Var17 0.04419014 0.049894926 0.0679127229 0.052531077 -0.0084254354
#> Var18 0.04080802 0.052875773 0.0606206899 0.062803885 -0.0025631468
#> Var19 0.08380495 0.083348684 0.0531592259 0.060112038 -0.0198717065
#> Var20 0.07245145 0.055268191 0.0642836170 0.068806808 -0.0045003502
#> Var21 0.06919133 0.056352975 0.0753644220 0.086375942 -0.0060862612
#> Var22 1.00000000 0.079703278 0.0809084996 0.051797477 -0.0155946707
#> Var23 0.07970328 1.000000000 0.0633165788 0.060751822 -0.0093850367
#> Var24 0.08090850 0.063316579 1.0000000000 0.092479963 -0.0001576408
#> Var25 0.05179748 0.060751822 0.0924799632 1.000000000 -0.0103331286
#> Var26 -0.01559467 -0.009385037 -0.0001576408 -0.010333129 1.0000000000library (car)
cust_dat1$Responder<-ifelse(cust_dat1$Plan_Chg_Flag == "Yes", 1, 0)
vif_Cust_data <- vif(lm(Responder ~ Var4+ Var6+ Var7+ Var8+ Var11+ Var12+ Var13+
Var14+ Var15+ Var16+ Var17+ Var18+ Var19+ Var20+ Var21+
Var22+ Var23+ Var24+ Var25+ Var26, data=cust_dat1))
vif_Cust_data
#> Var4 Var6 Var7 Var8 Var11 Var12 Var13 Var14
#> 1.004064 1.004115 1.006625 1.040132 1.017280 1.020826 1.034020 1.037052
#> Var15 Var16 Var17 Var18 Var19 Var20 Var21 Var22
#> 1.034849 1.035711 1.030768 1.031110 1.037650 1.036979 1.037472 1.037842
#> Var23 Var24 Var25 Var26
#> 1.034547 1.037831 1.035905 1.017624Because VIF for all the factors is less than 2 and correlation matrix has lower corr values, we will move forward as their is no evidence of collinearity.
Our aim here is to perform the frequency analysis of each category against the response variable
#Let us remove the 2nd column which we used to create the response variable
dat = cust_dat1[,c(-20)]
#Let us see our new dataset dat
head(dat)
#> Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
#> 1 1000 Yes North Medium Private 30 Male 0 200 7
#> 2 1001 Yes Central Low Private 30 Female 3 200 15
#> 3 1002 Yes Central High Private 30 Male 0 600 7
#> 4 1003 Yes South High Private 20 Male 3 600 7
#> 5 1004 Yes North High Private 20 Female 3 600 15
#> 6 1005 Yes North Low Private 30 Female 0 200 15
#> Var9 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var20 Var21 Var22
#> 1 Cash 1 1 300 300 300 300 10 50 2 1 300
#> 2 Cash 1 1 300 300 600 300 10 50 1 1 300
#> 3 Card 2 1 600 300 600 300 10 50 2 1 600
#> 4 Card 2 1 600 300 300 100 10 200 1 1 300
#> 5 Card 2 1 600 600 300 300 10 50 2 2 600
#> 6 Card 2 1 600 600 300 100 10 200 1 1 600
#> Var23 Var24 Var25 Var26 Responder
#> 1 300 100 50 20 1
#> 2 300 20 500 20 1
#> 3 300 100 500 20 1
#> 4 600 100 500 20 1
#> 5 600 100 500 20 1
#> 6 300 20 50 20 1bi_var<-apply(dat,2,typeof)
# add variable name for the type and added the flag ..default set to 1 for all var
bi_var<-data.frame(colnames(dat),bi_var,flag=1)
# set the row names as numbers
row.names(bi_var)<-1:nrow(bi_var)
# set the column names
colnames(bi_var)<-c("variable","var_type","flag")
bi_var
#> variable var_type flag
#> 1 Cust_id character 1
#> 2 Plan_Chg_Flag character 1
#> 3 Var1 character 1
#> 4 Var2 character 1
#> 5 Var3 character 1
#> 6 Var4 character 1
#> 7 Var5 character 1
#> 8 Var6 character 1
#> 9 Var7 character 1
#> 10 Var8 character 1
#> 11 Var9 character 1
#> 12 Var11 character 1
#> 13 Var12 character 1
#> 14 Var13 character 1
#> 15 Var14 character 1
#> 16 Var15 character 1
#> 17 Var16 character 1
#> 18 Var17 character 1
#> 19 Var18 character 1
#> 20 Var20 character 1
#> 21 Var21 character 1
#> 22 Var22 character 1
#> 23 Var23 character 1
#> 24 Var24 character 1
#> 25 Var25 character 1
#> 26 Var26 character 1
#> 27 Responder character 1
#Let us get the position for variables to set the flag as 0
bi_var$flag[which( bi_var$variable %in% c("Cust_id","Responder"))]= 0
## remove those with flag as 0
bi_var<-bi_var[bi_var$flag==1,]
bi_var
#> variable var_type flag
#> 2 Plan_Chg_Flag character 1
#> 3 Var1 character 1
#> 4 Var2 character 1
#> 5 Var3 character 1
#> 6 Var4 character 1
#> 7 Var5 character 1
#> 8 Var6 character 1
#> 9 Var7 character 1
#> 10 Var8 character 1
#> 11 Var9 character 1
#> 12 Var11 character 1
#> 13 Var12 character 1
#> 14 Var13 character 1
#> 15 Var14 character 1
#> 16 Var15 character 1
#> 17 Var16 character 1
#> 18 Var17 character 1
#> 19 Var18 character 1
#> 20 Var20 character 1
#> 21 Var21 character 1
#> 22 Var22 character 1
#> 23 Var23 character 1
#> 24 Var24 character 1
#> 25 Var25 character 1
#> 26 Var26 character 1event_rate = NULL
#loop in till all the var in the table
for ( i in 1:nrow(bi_var))
{
# Get the freq table for each var..ensure that the deleted var and numbers are in sequence..eg responder should be at 2nd position
aa<-as.matrix(table(dat[,i+2],dat[,27]))
cc<-aa
#Append var name and the categories in that variable
bb<-cbind(rep(as.character(bi_var$variable[i]),nrow(aa)),row.names(aa))
#Merge the name, cat and freq table
aa<-data.frame(cbind(bb,aa))
#Calc for ER, NER, WOE, IV and cum IV
aa[,5]<-as.numeric(cc[,1])/sum(as.numeric(cc[,1]))
aa[,6]<-as.numeric(cc[,2])/sum(as.numeric(cc[,2]))
aa[,7]<-log(aa[,5]/aa[,6])
aa[,8]<-(aa[,5]-aa[,6])*aa[,7]
aa[,9]<-sum(aa[,8])
#Append everything in new dataset
event_rate<-rbind(event_rate,aa)
}
#Give the column names for data created above ..after the for loop
colnames(event_rate)<-c("variable","Factor","Res","Non-Res","ER","NER","WOE","IV","Cum_IV")
#Read the eventrate file and safe for analysis
head(event_rate)
#> variable Factor Res Non-Res ER NER WOE
#> Central Plan_Chg_Flag Central 1436 163 0.3191111 0.326 -0.021358029
#> North Plan_Chg_Flag North 1579 182 0.3508889 0.364 -0.036684250
#> South Plan_Chg_Flag South 1485 155 0.3300000 0.310 0.062520357
#> High Var1 High 1574 171 0.3497778 0.342 0.022487295
#> Low Var1 Low 1513 168 0.3362222 0.336 0.000661157
#> Medium Var1 Medium 1413 161 0.3140000 0.322 -0.025158560
#> IV Cum_IV
#> Central 1.471331e-04 0.0018785115
#> North 4.809713e-04 0.0018785115
#> South 1.250407e-03 0.0018785115
#> High 1.749012e-04 0.0003763166
#> Low 1.469238e-07 0.0003763166
#> Medium 2.012685e-04 0.0003763166cust_dat1$GRPVar1<-ifelse(cust_dat1$Var1=="North",1,
ifelse(cust_dat1$Var1=="South",2,3))
cust_dat1$GRPVar2<-ifelse(cust_dat1$Var2=="Low",1,
ifelse(cust_dat1$Var2=="Medium",2,3))
cust_dat1$GRPVar3<-ifelse(cust_dat1$Var3=="Unemployed",1,
ifelse(cust_dat1$Var3=="Govt",2,3))
cust_dat1$GRPVar4<-ifelse(cust_dat1$Var4 < 25,1,
ifelse(cust_dat1$Var4< 40,2,3))
cust_dat1$GRPVar5<-ifelse(cust_dat1$Var5=="Male",1,2)
cust_dat1$GRPVar6<-ifelse(cust_dat1$Var6 < 2,1,2)
cust_dat1$GRPVar7<-ifelse(cust_dat1$Var7 < 500,1,
ifelse(cust_dat1$Var7< 1000,2,3))
cust_dat1$GRPVar8<-ifelse(cust_dat1$Var8 < 13,1,2)
cust_dat1$GRPVar9<-ifelse(cust_dat1$Var9=="Cash",1,2)
cust_dat1$GRPVar11<-ifelse(cust_dat1$Var11 < 2,1,2)
cust_dat1$GRPVar12<-ifelse(cust_dat1$Var12 < 2,1,2)
cust_dat1$GRPVar13<-ifelse(cust_dat1$Var13 < 500,1,2)
cust_dat1$GRPVar14<-ifelse(cust_dat1$Var14 < 500,1,2)
cust_dat1$GRPVar15<-ifelse(cust_dat1$Var15 < 500,1,2)
cust_dat1$GRPVar16<-ifelse(cust_dat1$Var16 < 200,1,2)
cust_dat1$GRPVar17<-ifelse(cust_dat1$Var17 < 20,1,2)
cust_dat1$GRPVar18<-ifelse(cust_dat1$Var18 < 100,1,2)
cust_dat1$GRPVar19<-ifelse(cust_dat1$Var19 < 1000,1,2)
cust_dat1$GRPVar20<-ifelse(cust_dat1$Var20 < 2,1,2)
cust_dat1$GRPVar21<-ifelse(cust_dat1$Var21 < 2,1,2)
cust_dat1$GRPVar22<-ifelse(cust_dat1$Var22 < 500,1,2)
cust_dat1$GRPVar23<-ifelse(cust_dat1$Var23 < 500,1,2)
cust_dat1$GRPVar24<-ifelse(cust_dat1$Var24 < 50,1,2)
cust_dat1$GRPVar25<-ifelse(cust_dat1$Var25 < 100,1,2)
cust_dat1$GRPVar26<-ifelse(cust_dat1$Var26 < 30,1,2)
cust_modeldata<-cust_dat1[,c(28,29,30,31,32,33,34,35,36,37,38,39,
40,41,42,43,44,45,46,47,48,49,50,51,52,53)]
str(cust_modeldata)
#> 'data.frame': 5000 obs. of 26 variables:
#> $ Responder: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ GRPVar1 : num 1 3 3 2 1 1 1 1 3 1 ...
#> $ GRPVar2 : num 2 1 3 3 3 1 3 2 1 3 ...
#> $ GRPVar3 : num 3 3 3 3 3 3 3 3 3 3 ...
#> $ GRPVar4 : num 2 2 2 1 1 2 2 2 2 1 ...
#> $ GRPVar5 : num 1 2 1 1 2 2 1 2 1 1 ...
#> $ GRPVar6 : num 1 2 1 2 2 1 1 2 2 2 ...
#> $ GRPVar7 : num 1 1 2 2 2 1 2 1 1 2 ...
#> $ GRPVar8 : num 1 2 1 1 2 2 1 2 1 1 ...
#> $ GRPVar9 : num 1 1 2 2 2 2 2 1 1 2 ...
#> $ GRPVar11 : num 1 1 2 2 2 2 2 2 2 2 ...
#> $ GRPVar12 : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ GRPVar13 : num 1 1 2 2 2 2 2 1 1 2 ...
#> $ GRPVar14 : num 1 1 1 1 2 2 1 2 2 1 ...
#> $ GRPVar15 : num 1 2 2 1 1 1 2 1 1 2 ...
#> $ GRPVar16 : num 2 2 2 1 2 1 1 1 1 2 ...
#> $ GRPVar17 : num 1 1 1 1 1 1 2 1 1 2 ...
#> $ GRPVar18 : num 1 1 1 2 1 2 1 2 1 2 ...
#> $ GRPVar19 : num 1 2 2 2 1 2 2 1 2 1 ...
#> $ GRPVar20 : num 2 1 2 1 2 1 2 2 2 2 ...
#> $ GRPVar21 : num 1 1 1 1 2 1 2 2 2 1 ...
#> $ GRPVar22 : num 1 1 2 1 2 2 1 1 2 1 ...
#> $ GRPVar23 : num 1 1 1 2 2 1 1 2 1 2 ...
#> $ GRPVar24 : num 2 1 2 2 2 1 2 1 1 2 ...
#> $ GRPVar25 : num 1 2 2 2 2 1 2 2 2 2 ...
#> $ GRPVar26 : num 1 1 1 1 1 1 1 1 1 1 ...#installing library sampling to get strata function
library(sampling)
#> Warning: package 'sampling' was built under R version 3.4.4#Sorting the data
cust_modeldata = cust_modeldata[order(cust_modeldata$Responder,decreasing = TRUE),]
training_data = strata(cust_modeldata,c("Responder"),
size=c(350,3150), method="srswor")
training_data<-getdata(cust_modeldata,training_data)
str(training_data)
#> 'data.frame': 3500 obs. of 29 variables:
#> $ GRPVar1 : num 1 3 2 1 1 1 3 3 2 3 ...
#> $ GRPVar2 : num 2 1 3 1 3 2 3 1 3 3 ...
#> $ GRPVar3 : num 3 3 3 3 3 3 3 3 2 1 ...
#> $ GRPVar4 : num 2 2 1 2 2 2 2 2 2 2 ...
#> $ GRPVar5 : num 1 2 1 2 1 2 2 1 1 2 ...
#> $ GRPVar6 : num 1 2 2 1 1 2 1 1 2 1 ...
#> $ GRPVar7 : num 1 1 2 1 2 1 2 1 1 1 ...
#> $ GRPVar8 : num 1 2 1 2 1 2 2 1 1 2 ...
#> $ GRPVar9 : num 1 1 2 2 2 1 2 2 2 1 ...
#> $ GRPVar11 : num 1 1 2 2 2 2 2 2 2 2 ...
#> $ GRPVar12 : num 1 1 1 1 1 1 2 2 2 2 ...
#> $ GRPVar13 : num 1 1 2 2 2 1 2 2 2 2 ...
#> $ GRPVar14 : num 1 1 1 2 1 2 1 1 2 2 ...
#> $ GRPVar15 : num 1 2 1 1 2 1 2 1 1 2 ...
#> $ GRPVar16 : num 2 2 1 1 1 1 1 2 2 1 ...
#> $ GRPVar17 : num 1 1 1 1 2 1 2 1 1 2 ...
#> $ GRPVar18 : num 1 1 2 2 1 2 2 1 2 2 ...
#> $ GRPVar19 : num 1 2 2 2 2 1 2 1 1 1 ...
#> $ GRPVar20 : num 2 1 1 1 2 2 2 1 1 1 ...
#> $ GRPVar21 : num 1 1 1 1 2 2 1 2 2 1 ...
#> $ GRPVar22 : num 1 1 1 2 1 1 2 1 2 1 ...
#> $ GRPVar23 : num 1 1 2 1 1 2 2 1 2 2 ...
#> $ GRPVar24 : num 2 1 2 1 2 1 1 1 1 1 ...
#> $ GRPVar25 : num 1 2 2 1 2 2 2 1 1 1 ...
#> $ GRPVar26 : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ Responder: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID_unit : int 1 2 4 6 7 8 11 12 14 17 ...
#> $ Prob : num 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ...
#> $ Stratum : int 1 1 1 1 1 1 1 1 1 1 ...fit <- glm(Responder ~as.factor(GRPVar2)+as.factor(GRPVar7)+as.factor(GRPVar8)
+as.factor(GRPVar9)+as.factor(GRPVar11)+as.factor(GRPVar12)
+as.factor(GRPVar1)+as.factor(GRPVar14)+as.factor(GRPVar15)
+as.factor(GRPVar16)+as.factor(GRPVar17)+as.factor(GRPVar18)
+as.factor(GRPVar19)+as.factor(GRPVar20)+as.factor(GRPVar21)
+as.factor(GRPVar22)+as.factor(GRPVar23)+as.factor(GRPVar24)
+as.factor(GRPVar25), family = binomial("logit"),data=training_data )
#Let us see the results now
summary(fit)
#>
#> Call:
#> glm(formula = Responder ~ as.factor(GRPVar2) + as.factor(GRPVar7) +
#> as.factor(GRPVar8) + as.factor(GRPVar9) + as.factor(GRPVar11) +
#> as.factor(GRPVar12) + as.factor(GRPVar1) + as.factor(GRPVar14) +
#> as.factor(GRPVar15) + as.factor(GRPVar16) + as.factor(GRPVar17) +
#> as.factor(GRPVar18) + as.factor(GRPVar19) + as.factor(GRPVar20) +
#> as.factor(GRPVar21) + as.factor(GRPVar22) + as.factor(GRPVar23) +
#> as.factor(GRPVar24) + as.factor(GRPVar25), family = binomial("logit"),
#> data = training_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.5085 -0.4209 -0.2364 -0.0001 3.5840
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.87085 0.27422 -17.762 < 2e-16 ***
#> as.factor(GRPVar2)2 0.09527 0.16780 0.568 0.570184
#> as.factor(GRPVar2)3 0.12870 0.16309 0.789 0.430009
#> as.factor(GRPVar7)2 0.28179 0.14565 1.935 0.053030 .
#> as.factor(GRPVar7)3 -17.12944 309.79923 -0.055 0.955906
#> as.factor(GRPVar8)2 0.29179 0.14747 1.979 0.047852 *
#> as.factor(GRPVar9)2 0.26274 0.14258 1.843 0.065367 .
#> as.factor(GRPVar11)2 -0.98343 0.14590 -6.741 1.58e-11 ***
#> as.factor(GRPVar12)2 -0.48727 0.14029 -3.473 0.000514 ***
#> as.factor(GRPVar1)2 -0.07923 0.16441 -0.482 0.629899
#> as.factor(GRPVar1)3 -0.06717 0.16305 -0.412 0.680385
#> as.factor(GRPVar14)2 0.56021 0.14864 3.769 0.000164 ***
#> as.factor(GRPVar15)2 0.39504 0.14792 2.671 0.007569 **
#> as.factor(GRPVar16)2 0.49988 0.14839 3.369 0.000755 ***
#> as.factor(GRPVar17)2 0.22597 0.14827 1.524 0.127486
#> as.factor(GRPVar18)2 0.44926 0.14675 3.061 0.002203 **
#> as.factor(GRPVar19)2 0.33036 0.14724 2.244 0.024858 *
#> as.factor(GRPVar20)2 0.55688 0.14871 3.745 0.000181 ***
#> as.factor(GRPVar21)2 0.46755 0.14773 3.165 0.001551 **
#> as.factor(GRPVar22)2 0.37475 0.14750 2.541 0.011062 *
#> as.factor(GRPVar23)2 0.46218 0.14608 3.164 0.001557 **
#> as.factor(GRPVar24)2 0.42752 0.15113 2.829 0.004672 **
#> as.factor(GRPVar25)2 0.42829 0.14843 2.886 0.003907 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 2275.6 on 3499 degrees of freedom
#> Residual deviance: 1508.8 on 3477 degrees of freedom
#> AIC: 1554.8
#>
#> Number of Fisher Scoring iterations: 18summary_pred_model<-predict(fit, type="response")
str(summary_pred_model)
#> Named num [1:3500] 0.0359 0.0478 0.0406 0.027 0.0871 ...
#> - attr(*, "names")= chr [1:3500] "1" "2" "4" "6" ...#Hosmer lemeshow goodness of fit test
hosmerlem <- function (y, yhat, g = 12)
{
cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0,1, 1/g)),
include.lowest = T)
obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq <- sum((obs - expect)^2/expect)
P <- 1 - pchisq(chisq, g - 2)
c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
}
#Run the above function after setting the values
R_hat<-as.vector(fitted(fit))
yhat<-R_hat
y<-training_data$Responder
hosmerlem(y, yhat)
#> X^2 Df P(>Chi)
#> 1674.226 10.000 0.000library(ROCR)
#> Warning: package 'ROCR' was built under R version 3.4.4
#> Loading required package: gplots
#> Warning: package 'gplots' was built under R version 3.4.4
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
gain.chart <- function(y_hat,y) {
plot(performance(prediction(y_hat,y), "tpr", "rpp"),lwd = 7, main = "Lift Chart")
lines(ecdf((rank(-y_hat)[y == T]) / length(y)),verticals = T, do.points = F, col = "red", lwd = 3)
}
gain.chart(R_hat,training_data$Responder)