Up-sell

Atul Kumar

2018-07-17

In the given problem a telecom company has reached to the point where capturing new customer for it is quite expensive and has low ROI, therefore, in order to increase revenue the company has decided to focus on the existing customer base. Our task here is to develop a propoensity model for finding out the customers whom they should target in this campaign.

Importing data set as R

setwd("C:\\Users\\om\\Desktop")
cust_data = read.csv("Rawdatafile.csv")

Looking at the first 6 rows of the dataset

head(cust_data)
#>   Cust_id Plan_Chg_Flag    Var1   Var2    Var3 Var4   Var5 Var6 Var7 Var8
#> 1    1000           Yes   North Medium Private   30   Male    0  200    7
#> 2    1001           Yes Central    Low Private   30 Female    3  200   15
#> 3    1002           Yes Central   High Private   50   Male    0  600    7
#> 4    1003           Yes   South   High Private   20   Male    3  600    7
#> 5    1004           Yes   North   High Private   20 Female    3  600   15
#> 6    1005           Yes   North    Low Private   30 Female    0  200   15
#>   Var9    Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19
#> 1 Cash Postpaid     1     1   300   300   300   300    10    50   500
#> 2 Cash Postpaid     1     1   300   300   600   300    10    50  1500
#> 3 Card Postpaid     2     1   600   300   600   300    10    50  1500
#> 4 Card Postpaid     2     1   600   300   300   100    10   200  1500
#> 5 Card Postpaid     2     1   600   600   300   300    10    50   500
#> 6 Card Postpaid     2     1   600   600   300   100    10   200  1500
#>   Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27
#> 1     2     1   300   300   100    50    20  Good
#> 2     1     1   300   300    20   500    20  Poor
#> 3     2     1   600   300   100   500    20  Poor
#> 4     1     1   300   600   100   500    20  Poor
#> 5     2     2   600   600   100   500    20  Poor
#> 6     1     1   600   300    20    50    20  Poor

Looking at the last 6 rows of the dataset

tail(cust_data)
#>      Cust_id Plan_Chg_Flag Var1   Var2       Var3 Var4   Var5 Var6 Var7
#> 4995    5994            No      Medium    Private   30 Female    3 1200
#> 4996    5995            No        High    Private   99 Female    3 1200
#> 4997    5996            No      Medium       Govt   50   Male    0  200
#> 4998    5997            No         Low    Private   24 Female    0  200
#> 4999    5998            No        High    Private  120 Female    0 1200
#> 5000    5999            No        High Unemployed   50   Male    0  600
#>      Var8 Var9    Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18
#> 4995   15 Cash Postpaid     2     2   300   600   300   300   300    50
#> 4996    7 Cash Postpaid     1     2   600   300   300   300   300   200
#> 4997   15 Cash Postpaid     2     1   600   300   300   100    10   200
#> 4998   15 Cash Postpaid     1     2   600   300   600   100    10   200
#> 4999   15 Cash Postpaid     1     2   600   600   300   300    10   200
#> 5000   15 Card Postpaid     2     2   300   300   300   300    10    50
#>      Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27
#> 4995  1500     1     2   600   600   100   500    50  <NA>
#> 4996   500     1     1   300   300    20    50    50  <NA>
#> 4997  1500     2     1   600   300    20    50    50  <NA>
#> 4998   500     1     2   600   600    20   500    20  <NA>
#> 4999   500     2     2   300   300    20   500    50  <NA>
#> 5000  1500     1     1   300   300    20    50    20  <NA>

We noticed that the current dataset has 29 columns and 5000 rows.

Our aim here is to find the right audince for upselling we can use flag for Plan change variable to indentify the right customers who went for upselling in past. We will build our model on the characteristics of these customers based on other independent variables.

Summarizing the given data set to check for missing values, extreme values and granularity

summary(cust_data)
#>     Cust_id     Plan_Chg_Flag      Var1          Var2     
#>  Min.   :1000   No :4500             :  25   High  :1745  
#>  1st Qu.:2250   Yes: 500      Central:1599   Low   :1681  
#>  Median :3500                 North  :1736   Medium:1574  
#>  Mean   :3500                 South  :1640                
#>  3rd Qu.:4749                                             
#>  Max.   :5999                                             
#>          Var3           Var4            Var5           Var6      
#>  Govt      :1497   Min.   : 20.00   Female:2538   Min.   :0.000  
#>  Private   :1985   1st Qu.: 20.00   Male  :2462   1st Qu.:0.000  
#>  Unemployed:1518   Median : 30.00                 Median :3.000  
#>                    Mean   : 33.14                 Mean   :1.502  
#>                    3rd Qu.: 50.00                 3rd Qu.:3.000  
#>                    Max.   :120.00                 Max.   :3.000  
#>       Var7             Var8         Var9           Var10     
#>  Min.   : 200.0   Min.   : 7.00   Card:2620   Postpaid:5000  
#>  1st Qu.: 200.0   1st Qu.: 7.00   Cash:2380                  
#>  Median : 600.0   Median :15.00                              
#>  Mean   : 647.3   Mean   :11.13                              
#>  3rd Qu.:1200.0   3rd Qu.:15.00                              
#>  Max.   :1200.0   Max.   :15.00                              
#>      Var11           Var12           Var13           Var14      
#>  Min.   :1.000   Min.   :1.000   Min.   :300.0   Min.   :300.0  
#>  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:300.0   1st Qu.:300.0  
#>  Median :1.000   Median :1.000   Median :600.0   Median :600.0  
#>  Mean   :1.486   Mean   :1.487   Mean   :462.4   Mean   :455.7  
#>  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:600.0   3rd Qu.:600.0  
#>  Max.   :2.000   Max.   :2.000   Max.   :600.0   Max.   :600.0  
#>      Var15           Var16           Var17           Var18      
#>  Min.   :300.0   Min.   :100.0   Min.   : 10.0   Min.   : 50.0  
#>  1st Qu.:300.0   1st Qu.:100.0   1st Qu.: 10.0   1st Qu.: 50.0  
#>  Median :600.0   Median :300.0   Median :300.0   Median :200.0  
#>  Mean   :455.5   Mean   :206.7   Mean   :167.2   Mean   :128.9  
#>  3rd Qu.:600.0   3rd Qu.:300.0   3rd Qu.:300.0   3rd Qu.:200.0  
#>  Max.   :600.0   Max.   :300.0   Max.   :300.0   Max.   :200.0  
#>      Var19          Var20           Var21           Var22      
#>  Min.   : 500   Min.   :1.000   Min.   :1.000   Min.   :300.0  
#>  1st Qu.: 500   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:300.0  
#>  Median :1500   Median :2.000   Median :2.000   Median :600.0  
#>  Mean   :1021   Mean   :1.518   Mean   :1.527   Mean   :457.1  
#>  3rd Qu.:1500   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:600.0  
#>  Max.   :1500   Max.   :2.000   Max.   :2.000   Max.   :600.0  
#>      Var23           Var24           Var25         Var26        Var27     
#>  Min.   :300.0   Min.   : 20.0   Min.   : 50   Min.   :20.00   Good: 811  
#>  1st Qu.:300.0   1st Qu.: 20.0   1st Qu.: 50   1st Qu.:20.00   Poor:  17  
#>  Median :600.0   Median :100.0   Median :500   Median :20.00   NA's:4172  
#>  Mean   :460.7   Mean   : 62.1   Mean   :286   Mean   :34.36              
#>  3rd Qu.:600.0   3rd Qu.:100.0   3rd Qu.:500   3rd Qu.:50.00              
#>  Max.   :600.0   Max.   :100.0   Max.   :500   Max.   :50.00

Var1 and Var27 have missing values, Var4 has extreme value and Var10 has very high or low granularity value.

If we notice closely the % of missing values in Var27 is much higher than % of missing values in Var 1

% Missing values in Var27 = 4175/5000 = 83.44%

% Missing values in Var1 = 25/5000 = 0.5%

Because we have categories here and we can impute the missing values by using the appropriate average value of the variable.

Next we will drop Var1 and Var27 as one has high grangularity and other has high missing missing value

cust_dat1 <- cust_data[,-c(12,29)] 
head(cust_dat1)
#>   Cust_id Plan_Chg_Flag    Var1   Var2    Var3 Var4   Var5 Var6 Var7 Var8
#> 1    1000           Yes   North Medium Private   30   Male    0  200    7
#> 2    1001           Yes Central    Low Private   30 Female    3  200   15
#> 3    1002           Yes Central   High Private   50   Male    0  600    7
#> 4    1003           Yes   South   High Private   20   Male    3  600    7
#> 5    1004           Yes   North   High Private   20 Female    3  600   15
#> 6    1005           Yes   North    Low Private   30 Female    0  200   15
#>   Var9 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21
#> 1 Cash     1     1   300   300   300   300    10    50   500     2     1
#> 2 Cash     1     1   300   300   600   300    10    50  1500     1     1
#> 3 Card     2     1   600   300   600   300    10    50  1500     2     1
#> 4 Card     2     1   600   300   300   100    10   200  1500     1     1
#> 5 Card     2     1   600   600   300   300    10    50   500     2     2
#> 6 Card     2     1   600   600   300   100    10   200  1500     1     1
#>   Var22 Var23 Var24 Var25 Var26
#> 1   300   300   100    50    20
#> 2   300   300    20   500    20
#> 3   600   300   100   500    20
#> 4   300   600   100   500    20
#> 5   600   600   100   500    20
#> 6   600   300    20    50    20

Imputing missing values

cust_dat1$Var1 = ifelse(cust_dat1$Var1 == "","North",
                        ifelse(cust_dat1$Var1 == "South", 
                        "South",ifelse(cust_dat1$Var1 == "Central", "Central", 
                                       "North")))

Treating the extreme values

cust_dat1$Var4 = ifelse(cust_dat1$Var4 >50, 50,
                        ifelse(cust_dat1$Var4 <25, 20, 30))

Visiting summary again

summary(cust_dat1)
#>     Cust_id     Plan_Chg_Flag     Var1               Var2     
#>  Min.   :1000   No :4500      Length:5000        High  :1745  
#>  1st Qu.:2250   Yes: 500      Class :character   Low   :1681  
#>  Median :3500                 Mode  :character   Medium:1574  
#>  Mean   :3500                                                 
#>  3rd Qu.:4749                                                 
#>  Max.   :5999                                                 
#>          Var3           Var4           Var5           Var6      
#>  Govt      :1497   Min.   :20.00   Female:2538   Min.   :0.000  
#>  Private   :1985   1st Qu.:20.00   Male  :2462   1st Qu.:0.000  
#>  Unemployed:1518   Median :30.00                 Median :3.000  
#>                    Mean   :26.78                 Mean   :1.502  
#>                    3rd Qu.:30.00                 3rd Qu.:3.000  
#>                    Max.   :50.00                 Max.   :3.000  
#>       Var7             Var8         Var9          Var11      
#>  Min.   : 200.0   Min.   : 7.00   Card:2620   Min.   :1.000  
#>  1st Qu.: 200.0   1st Qu.: 7.00   Cash:2380   1st Qu.:1.000  
#>  Median : 600.0   Median :15.00               Median :1.000  
#>  Mean   : 647.3   Mean   :11.13               Mean   :1.486  
#>  3rd Qu.:1200.0   3rd Qu.:15.00               3rd Qu.:2.000  
#>  Max.   :1200.0   Max.   :15.00               Max.   :2.000  
#>      Var12           Var13           Var14           Var15      
#>  Min.   :1.000   Min.   :300.0   Min.   :300.0   Min.   :300.0  
#>  1st Qu.:1.000   1st Qu.:300.0   1st Qu.:300.0   1st Qu.:300.0  
#>  Median :1.000   Median :600.0   Median :600.0   Median :600.0  
#>  Mean   :1.487   Mean   :462.4   Mean   :455.7   Mean   :455.5  
#>  3rd Qu.:2.000   3rd Qu.:600.0   3rd Qu.:600.0   3rd Qu.:600.0  
#>  Max.   :2.000   Max.   :600.0   Max.   :600.0   Max.   :600.0  
#>      Var16           Var17           Var18           Var19     
#>  Min.   :100.0   Min.   : 10.0   Min.   : 50.0   Min.   : 500  
#>  1st Qu.:100.0   1st Qu.: 10.0   1st Qu.: 50.0   1st Qu.: 500  
#>  Median :300.0   Median :300.0   Median :200.0   Median :1500  
#>  Mean   :206.7   Mean   :167.2   Mean   :128.9   Mean   :1021  
#>  3rd Qu.:300.0   3rd Qu.:300.0   3rd Qu.:200.0   3rd Qu.:1500  
#>  Max.   :300.0   Max.   :300.0   Max.   :200.0   Max.   :1500  
#>      Var20           Var21           Var22           Var23      
#>  Min.   :1.000   Min.   :1.000   Min.   :300.0   Min.   :300.0  
#>  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:300.0   1st Qu.:300.0  
#>  Median :2.000   Median :2.000   Median :600.0   Median :600.0  
#>  Mean   :1.518   Mean   :1.527   Mean   :457.1   Mean   :460.7  
#>  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:600.0   3rd Qu.:600.0  
#>  Max.   :2.000   Max.   :2.000   Max.   :600.0   Max.   :600.0  
#>      Var24           Var25         Var26      
#>  Min.   : 20.0   Min.   : 50   Min.   :20.00  
#>  1st Qu.: 20.0   1st Qu.: 50   1st Qu.:20.00  
#>  Median :100.0   Median :500   Median :20.00  
#>  Mean   : 62.1   Mean   :286   Mean   :34.36  
#>  3rd Qu.:100.0   3rd Qu.:500   3rd Qu.:50.00  
#>  Max.   :100.0   Max.   :500   Max.   :50.00

Generating Correlation matrix

Corr_data<- cust_dat1[,-c(1,2,3,4,5,7,11)]
Corr_matrix = cor(Corr_data, Corr_data)
Corr_matrix
#>                Var4          Var6         Var7         Var8        Var11
#> Var4   1.0000000000  0.0084869625 -0.016426170 -0.011887708 -0.021433414
#> Var6   0.0084869625  1.0000000000 -0.016601486 -0.001639742 -0.018773241
#> Var7  -0.0164261702 -0.0166014856  1.000000000 -0.008263687  0.013658869
#> Var8  -0.0118877076 -0.0016397421 -0.008263687  1.000000000 -0.025891208
#> Var11 -0.0214334135 -0.0187732408  0.013658869 -0.025891208  1.000000000
#> Var12 -0.0291889736 -0.0183756088 -0.004237924 -0.001584855  0.093342590
#> Var13 -0.0074780901 -0.0181606513  0.015166188  0.062776569 -0.008057612
#> Var14  0.0081412574  0.0079601550  0.005933986  0.088479125 -0.014522108
#> Var15 -0.0082962241 -0.0080490199  0.004717581  0.043672495  0.002250745
#> Var16 -0.0247304882 -0.0181216165 -0.007386208  0.076836657  0.001138404
#> Var17  0.0058614881  0.0191671229 -0.016896847  0.065533550 -0.018041108
#> Var18  0.0163149718  0.0031414380  0.021915040  0.068031372  0.002713973
#> Var19  0.0128926422  0.0183661676 -0.009913839  0.052337013 -0.003217505
#> Var20 -0.0005958734 -0.0252579108  0.017713928  0.094567744 -0.014200927
#> Var21  0.0091587072 -0.0028695312 -0.005859775  0.059556852 -0.024880428
#> Var22  0.0079045003  0.0111555331 -0.035192482  0.049740227 -0.024667301
#> Var23  0.0087476315 -0.0061009282  0.014271514  0.091974712 -0.016397841
#> Var24  0.0214321635  0.0247711690 -0.010445500  0.075245064  0.005919705
#> Var25  0.0112857813 -0.0084686559  0.003649281  0.052508976 -0.011412886
#> Var26 -0.0233673560 -0.0007493272  0.049883692  0.009400265  0.079272854
#>              Var12        Var13        Var14        Var15        Var16
#> Var4  -0.029188974 -0.007478090  0.008141257 -0.008296224 -0.024730488
#> Var6  -0.018375609 -0.018160651  0.007960155 -0.008049020 -0.018121616
#> Var7  -0.004237924  0.015166188  0.005933986  0.004717581 -0.007386208
#> Var8  -0.001584855  0.062776569  0.088479125  0.043672495  0.076836657
#> Var11  0.093342590 -0.008057612 -0.014522108  0.002250745  0.001138404
#> Var12  1.000000000 -0.016785916 -0.015858887 -0.019101441 -0.014744562
#> Var13 -0.016785916  1.000000000  0.055095821  0.071290157  0.066840600
#> Var14 -0.015858887  0.055095821  1.000000000  0.033062575  0.074869594
#> Var15 -0.019101441  0.071290157  0.033062575  1.000000000  0.071763590
#> Var16 -0.014744562  0.066840600  0.074869594  0.071763590  1.000000000
#> Var17 -0.023564657  0.070737531  0.061858030  0.078861296  0.064704932
#> Var18 -0.033135551  0.045901180  0.043700356  0.065425822  0.081976392
#> Var19 -0.018983902  0.074923308  0.084166603  0.069805812  0.066638448
#> Var20 -0.006719257  0.038855652  0.076765383  0.073613092  0.067428673
#> Var21 -0.009446103  0.065436535  0.083316260  0.061752144  0.071812749
#> Var22  0.017224508  0.086469495  0.073527591  0.045549142  0.058198657
#> Var23 -0.001810930  0.072161033  0.060290850  0.060401542  0.056285116
#> Var24 -0.027526065  0.042685844  0.053320511  0.082260090  0.033801460
#> Var25 -0.016795578  0.069899582  0.069879267  0.077166696  0.056114190
#> Var26  0.091834729 -0.012929054  0.004033006 -0.003247205 -0.014369459
#>              Var17        Var18        Var19         Var20        Var21
#> Var4   0.005861488  0.016314972  0.012892642 -0.0005958734  0.009158707
#> Var6   0.019167123  0.003141438  0.018366168 -0.0252579108 -0.002869531
#> Var7  -0.016896847  0.021915040 -0.009913839  0.0177139278 -0.005859775
#> Var8   0.065533550  0.068031372  0.052337013  0.0945677437  0.059556852
#> Var11 -0.018041108  0.002713973 -0.003217505 -0.0142009275 -0.024880428
#> Var12 -0.023564657 -0.033135551 -0.018983902 -0.0067192565 -0.009446103
#> Var13  0.070737531  0.045901180  0.074923308  0.0388556524  0.065436535
#> Var14  0.061858030  0.043700356  0.084166603  0.0767653827  0.083316260
#> Var15  0.078861296  0.065425822  0.069805812  0.0736130921  0.061752144
#> Var16  0.064704932  0.081976392  0.066638448  0.0674286727  0.071812749
#> Var17  1.000000000  0.067108754  0.046327213  0.0649007652  0.060111816
#> Var18  0.067108754  1.000000000  0.069996634  0.0530612543  0.072957556
#> Var19  0.046327213  0.069996634  1.000000000  0.0670482046  0.063105668
#> Var20  0.064900765  0.053061254  0.067048205  1.0000000000  0.066224162
#> Var21  0.060111816  0.072957556  0.063105668  0.0662241623  1.000000000
#> Var22  0.044190144  0.040808016  0.083804951  0.0724514535  0.069191331
#> Var23  0.049894926  0.052875773  0.083348684  0.0552681908  0.056352975
#> Var24  0.067912723  0.060620690  0.053159226  0.0642836170  0.075364422
#> Var25  0.052531077  0.062803885  0.060112038  0.0688068081  0.086375942
#> Var26 -0.008425435 -0.002563147 -0.019871706 -0.0045003502 -0.006086261
#>             Var22        Var23         Var24        Var25         Var26
#> Var4   0.00790450  0.008747632  0.0214321635  0.011285781 -0.0233673560
#> Var6   0.01115553 -0.006100928  0.0247711690 -0.008468656 -0.0007493272
#> Var7  -0.03519248  0.014271514 -0.0104454997  0.003649281  0.0498836920
#> Var8   0.04974023  0.091974712  0.0752450642  0.052508976  0.0094002646
#> Var11 -0.02466730 -0.016397841  0.0059197048 -0.011412886  0.0792728536
#> Var12  0.01722451 -0.001810930 -0.0275260654 -0.016795578  0.0918347291
#> Var13  0.08646950  0.072161033  0.0426858435  0.069899582 -0.0129290545
#> Var14  0.07352759  0.060290850  0.0533205114  0.069879267  0.0040330058
#> Var15  0.04554914  0.060401542  0.0822600897  0.077166696 -0.0032472055
#> Var16  0.05819866  0.056285116  0.0338014596  0.056114190 -0.0143694593
#> Var17  0.04419014  0.049894926  0.0679127229  0.052531077 -0.0084254354
#> Var18  0.04080802  0.052875773  0.0606206899  0.062803885 -0.0025631468
#> Var19  0.08380495  0.083348684  0.0531592259  0.060112038 -0.0198717065
#> Var20  0.07245145  0.055268191  0.0642836170  0.068806808 -0.0045003502
#> Var21  0.06919133  0.056352975  0.0753644220  0.086375942 -0.0060862612
#> Var22  1.00000000  0.079703278  0.0809084996  0.051797477 -0.0155946707
#> Var23  0.07970328  1.000000000  0.0633165788  0.060751822 -0.0093850367
#> Var24  0.08090850  0.063316579  1.0000000000  0.092479963 -0.0001576408
#> Var25  0.05179748  0.060751822  0.0924799632  1.000000000 -0.0103331286
#> Var26 -0.01559467 -0.009385037 -0.0001576408 -0.010333129  1.0000000000

Creating VIF

library (car)
cust_dat1$Responder<-ifelse(cust_dat1$Plan_Chg_Flag == "Yes", 1, 0)
vif_Cust_data <- vif(lm(Responder ~ Var4+ Var6+ Var7+ Var8+ Var11+ Var12+ Var13+
                          Var14+ Var15+ Var16+ Var17+ Var18+ Var19+ Var20+ Var21+
                          Var22+ Var23+ Var24+ Var25+ Var26, data=cust_dat1))
vif_Cust_data
#>     Var4     Var6     Var7     Var8    Var11    Var12    Var13    Var14 
#> 1.004064 1.004115 1.006625 1.040132 1.017280 1.020826 1.034020 1.037052 
#>    Var15    Var16    Var17    Var18    Var19    Var20    Var21    Var22 
#> 1.034849 1.035711 1.030768 1.031110 1.037650 1.036979 1.037472 1.037842 
#>    Var23    Var24    Var25    Var26 
#> 1.034547 1.037831 1.035905 1.017624

Because VIF for all the factors is less than 2 and correlation matrix has lower corr values, we will move forward as their is no evidence of collinearity.

Let us make the develop IV report using Bi-variate and variable reduction technique

Our aim here is to perform the frequency analysis of each category against the response variable

#Let us remove the 2nd column which we used to create the response variable
dat = cust_dat1[,c(-20)]

#Let us see our new dataset dat
head(dat)
#>   Cust_id Plan_Chg_Flag    Var1   Var2    Var3 Var4   Var5 Var6 Var7 Var8
#> 1    1000           Yes   North Medium Private   30   Male    0  200    7
#> 2    1001           Yes Central    Low Private   30 Female    3  200   15
#> 3    1002           Yes Central   High Private   30   Male    0  600    7
#> 4    1003           Yes   South   High Private   20   Male    3  600    7
#> 5    1004           Yes   North   High Private   20 Female    3  600   15
#> 6    1005           Yes   North    Low Private   30 Female    0  200   15
#>   Var9 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var20 Var21 Var22
#> 1 Cash     1     1   300   300   300   300    10    50     2     1   300
#> 2 Cash     1     1   300   300   600   300    10    50     1     1   300
#> 3 Card     2     1   600   300   600   300    10    50     2     1   600
#> 4 Card     2     1   600   300   300   100    10   200     1     1   300
#> 5 Card     2     1   600   600   300   300    10    50     2     2   600
#> 6 Card     2     1   600   600   300   100    10   200     1     1   600
#>   Var23 Var24 Var25 Var26 Responder
#> 1   300   100    50    20         1
#> 2   300    20   500    20         1
#> 3   300   100   500    20         1
#> 4   600   100   500    20         1
#> 5   600   100   500    20         1
#> 6   300    20    50    20         1

Let’s make the table to identify variable type

bi_var<-apply(dat,2,typeof)

# add variable name for the type and added the flag ..default set to 1 for all var
bi_var<-data.frame(colnames(dat),bi_var,flag=1)
# set the row names as numbers
row.names(bi_var)<-1:nrow(bi_var)
# set the column names
colnames(bi_var)<-c("variable","var_type","flag")
bi_var
#>         variable  var_type flag
#> 1        Cust_id character    1
#> 2  Plan_Chg_Flag character    1
#> 3           Var1 character    1
#> 4           Var2 character    1
#> 5           Var3 character    1
#> 6           Var4 character    1
#> 7           Var5 character    1
#> 8           Var6 character    1
#> 9           Var7 character    1
#> 10          Var8 character    1
#> 11          Var9 character    1
#> 12         Var11 character    1
#> 13         Var12 character    1
#> 14         Var13 character    1
#> 15         Var14 character    1
#> 16         Var15 character    1
#> 17         Var16 character    1
#> 18         Var17 character    1
#> 19         Var18 character    1
#> 20         Var20 character    1
#> 21         Var21 character    1
#> 22         Var22 character    1
#> 23         Var23 character    1
#> 24         Var24 character    1
#> 25         Var25 character    1
#> 26         Var26 character    1
#> 27     Responder character    1

#Let us get the position for variables to set the flag as 0
bi_var$flag[which( bi_var$variable %in% c("Cust_id","Responder"))]= 0
## remove those with flag as 0
bi_var<-bi_var[bi_var$flag==1,]
bi_var
#>         variable  var_type flag
#> 2  Plan_Chg_Flag character    1
#> 3           Var1 character    1
#> 4           Var2 character    1
#> 5           Var3 character    1
#> 6           Var4 character    1
#> 7           Var5 character    1
#> 8           Var6 character    1
#> 9           Var7 character    1
#> 10          Var8 character    1
#> 11          Var9 character    1
#> 12         Var11 character    1
#> 13         Var12 character    1
#> 14         Var13 character    1
#> 15         Var14 character    1
#> 16         Var15 character    1
#> 17         Var16 character    1
#> 18         Var17 character    1
#> 19         Var18 character    1
#> 20         Var20 character    1
#> 21         Var21 character    1
#> 22         Var22 character    1
#> 23         Var23 character    1
#> 24         Var24 character    1
#> 25         Var25 character    1
#> 26         Var26 character    1

Setting up bi-var analysis

event_rate = NULL
#loop in till all the var in the table
for ( i in 1:nrow(bi_var))
{
# Get the freq table for each var..ensure that the deleted var and numbers are in sequence..eg responder should be at 2nd position
aa<-as.matrix(table(dat[,i+2],dat[,27]))
cc<-aa
#Append var name and the categories in that variable
bb<-cbind(rep(as.character(bi_var$variable[i]),nrow(aa)),row.names(aa))
#Merge the name, cat and freq table
aa<-data.frame(cbind(bb,aa))
#Calc for ER, NER, WOE, IV and cum IV
aa[,5]<-as.numeric(cc[,1])/sum(as.numeric(cc[,1]))
aa[,6]<-as.numeric(cc[,2])/sum(as.numeric(cc[,2]))
aa[,7]<-log(aa[,5]/aa[,6])
aa[,8]<-(aa[,5]-aa[,6])*aa[,7]
aa[,9]<-sum(aa[,8])
#Append everything in new dataset
event_rate<-rbind(event_rate,aa)
}
#Give the column names for data created above ..after the for loop
colnames(event_rate)<-c("variable","Factor","Res","Non-Res","ER","NER","WOE","IV","Cum_IV")
#Read the eventrate file and safe for analysis
head(event_rate) 
#>              variable  Factor  Res Non-Res        ER   NER          WOE
#> Central Plan_Chg_Flag Central 1436     163 0.3191111 0.326 -0.021358029
#> North   Plan_Chg_Flag   North 1579     182 0.3508889 0.364 -0.036684250
#> South   Plan_Chg_Flag   South 1485     155 0.3300000 0.310  0.062520357
#> High             Var1    High 1574     171 0.3497778 0.342  0.022487295
#> Low              Var1     Low 1513     168 0.3362222 0.336  0.000661157
#> Medium           Var1  Medium 1413     161 0.3140000 0.322 -0.025158560
#>                   IV       Cum_IV
#> Central 1.471331e-04 0.0018785115
#> North   4.809713e-04 0.0018785115
#> South   1.250407e-03 0.0018785115
#> High    1.749012e-04 0.0003763166
#> Low     1.469238e-07 0.0003763166
#> Medium  2.012685e-04 0.0003763166

Binning the complete dataset to create model

cust_dat1$GRPVar1<-ifelse(cust_dat1$Var1=="North",1,
                          ifelse(cust_dat1$Var1=="South",2,3))
cust_dat1$GRPVar2<-ifelse(cust_dat1$Var2=="Low",1,
                          ifelse(cust_dat1$Var2=="Medium",2,3))
cust_dat1$GRPVar3<-ifelse(cust_dat1$Var3=="Unemployed",1,
                          ifelse(cust_dat1$Var3=="Govt",2,3))
cust_dat1$GRPVar4<-ifelse(cust_dat1$Var4 < 25,1,
                          ifelse(cust_dat1$Var4< 40,2,3))
cust_dat1$GRPVar5<-ifelse(cust_dat1$Var5=="Male",1,2)
cust_dat1$GRPVar6<-ifelse(cust_dat1$Var6 < 2,1,2)
cust_dat1$GRPVar7<-ifelse(cust_dat1$Var7 < 500,1,
                          ifelse(cust_dat1$Var7< 1000,2,3))
cust_dat1$GRPVar8<-ifelse(cust_dat1$Var8 < 13,1,2)
cust_dat1$GRPVar9<-ifelse(cust_dat1$Var9=="Cash",1,2)
cust_dat1$GRPVar11<-ifelse(cust_dat1$Var11 < 2,1,2)
cust_dat1$GRPVar12<-ifelse(cust_dat1$Var12 < 2,1,2)
cust_dat1$GRPVar13<-ifelse(cust_dat1$Var13 < 500,1,2)
cust_dat1$GRPVar14<-ifelse(cust_dat1$Var14 < 500,1,2)
cust_dat1$GRPVar15<-ifelse(cust_dat1$Var15 < 500,1,2)
cust_dat1$GRPVar16<-ifelse(cust_dat1$Var16 < 200,1,2)
cust_dat1$GRPVar17<-ifelse(cust_dat1$Var17 < 20,1,2)
cust_dat1$GRPVar18<-ifelse(cust_dat1$Var18 < 100,1,2)
cust_dat1$GRPVar19<-ifelse(cust_dat1$Var19 < 1000,1,2)
cust_dat1$GRPVar20<-ifelse(cust_dat1$Var20 < 2,1,2)
cust_dat1$GRPVar21<-ifelse(cust_dat1$Var21 < 2,1,2)
cust_dat1$GRPVar22<-ifelse(cust_dat1$Var22 < 500,1,2)
cust_dat1$GRPVar23<-ifelse(cust_dat1$Var23 < 500,1,2)
cust_dat1$GRPVar24<-ifelse(cust_dat1$Var24 < 50,1,2)
cust_dat1$GRPVar25<-ifelse(cust_dat1$Var25 < 100,1,2)
cust_dat1$GRPVar26<-ifelse(cust_dat1$Var26 < 30,1,2)
cust_modeldata<-cust_dat1[,c(28,29,30,31,32,33,34,35,36,37,38,39,
                             40,41,42,43,44,45,46,47,48,49,50,51,52,53)]
str(cust_modeldata)
#> 'data.frame':    5000 obs. of  26 variables:
#>  $ Responder: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ GRPVar1  : num  1 3 3 2 1 1 1 1 3 1 ...
#>  $ GRPVar2  : num  2 1 3 3 3 1 3 2 1 3 ...
#>  $ GRPVar3  : num  3 3 3 3 3 3 3 3 3 3 ...
#>  $ GRPVar4  : num  2 2 2 1 1 2 2 2 2 1 ...
#>  $ GRPVar5  : num  1 2 1 1 2 2 1 2 1 1 ...
#>  $ GRPVar6  : num  1 2 1 2 2 1 1 2 2 2 ...
#>  $ GRPVar7  : num  1 1 2 2 2 1 2 1 1 2 ...
#>  $ GRPVar8  : num  1 2 1 1 2 2 1 2 1 1 ...
#>  $ GRPVar9  : num  1 1 2 2 2 2 2 1 1 2 ...
#>  $ GRPVar11 : num  1 1 2 2 2 2 2 2 2 2 ...
#>  $ GRPVar12 : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ GRPVar13 : num  1 1 2 2 2 2 2 1 1 2 ...
#>  $ GRPVar14 : num  1 1 1 1 2 2 1 2 2 1 ...
#>  $ GRPVar15 : num  1 2 2 1 1 1 2 1 1 2 ...
#>  $ GRPVar16 : num  2 2 2 1 2 1 1 1 1 2 ...
#>  $ GRPVar17 : num  1 1 1 1 1 1 2 1 1 2 ...
#>  $ GRPVar18 : num  1 1 1 2 1 2 1 2 1 2 ...
#>  $ GRPVar19 : num  1 2 2 2 1 2 2 1 2 1 ...
#>  $ GRPVar20 : num  2 1 2 1 2 1 2 2 2 2 ...
#>  $ GRPVar21 : num  1 1 1 1 2 1 2 2 2 1 ...
#>  $ GRPVar22 : num  1 1 2 1 2 2 1 1 2 1 ...
#>  $ GRPVar23 : num  1 1 1 2 2 1 1 2 1 2 ...
#>  $ GRPVar24 : num  2 1 2 2 2 1 2 1 1 2 ...
#>  $ GRPVar25 : num  1 2 2 2 2 1 2 2 2 2 ...
#>  $ GRPVar26 : num  1 1 1 1 1 1 1 1 1 1 ...

Let us prepare training and validation dataset

#installing library sampling to get strata function
library(sampling)
#> Warning: package 'sampling' was built under R version 3.4.4
#Sorting the data
cust_modeldata = cust_modeldata[order(cust_modeldata$Responder,decreasing = TRUE),]
training_data = strata(cust_modeldata,c("Responder"),
                       size=c(350,3150), method="srswor")
training_data<-getdata(cust_modeldata,training_data)
str(training_data)
#> 'data.frame':    3500 obs. of  29 variables:
#>  $ GRPVar1  : num  1 3 2 1 1 1 3 3 2 3 ...
#>  $ GRPVar2  : num  2 1 3 1 3 2 3 1 3 3 ...
#>  $ GRPVar3  : num  3 3 3 3 3 3 3 3 2 1 ...
#>  $ GRPVar4  : num  2 2 1 2 2 2 2 2 2 2 ...
#>  $ GRPVar5  : num  1 2 1 2 1 2 2 1 1 2 ...
#>  $ GRPVar6  : num  1 2 2 1 1 2 1 1 2 1 ...
#>  $ GRPVar7  : num  1 1 2 1 2 1 2 1 1 1 ...
#>  $ GRPVar8  : num  1 2 1 2 1 2 2 1 1 2 ...
#>  $ GRPVar9  : num  1 1 2 2 2 1 2 2 2 1 ...
#>  $ GRPVar11 : num  1 1 2 2 2 2 2 2 2 2 ...
#>  $ GRPVar12 : num  1 1 1 1 1 1 2 2 2 2 ...
#>  $ GRPVar13 : num  1 1 2 2 2 1 2 2 2 2 ...
#>  $ GRPVar14 : num  1 1 1 2 1 2 1 1 2 2 ...
#>  $ GRPVar15 : num  1 2 1 1 2 1 2 1 1 2 ...
#>  $ GRPVar16 : num  2 2 1 1 1 1 1 2 2 1 ...
#>  $ GRPVar17 : num  1 1 1 1 2 1 2 1 1 2 ...
#>  $ GRPVar18 : num  1 1 2 2 1 2 2 1 2 2 ...
#>  $ GRPVar19 : num  1 2 2 2 2 1 2 1 1 1 ...
#>  $ GRPVar20 : num  2 1 1 1 2 2 2 1 1 1 ...
#>  $ GRPVar21 : num  1 1 1 1 2 2 1 2 2 1 ...
#>  $ GRPVar22 : num  1 1 1 2 1 1 2 1 2 1 ...
#>  $ GRPVar23 : num  1 1 2 1 1 2 2 1 2 2 ...
#>  $ GRPVar24 : num  2 1 2 1 2 1 1 1 1 1 ...
#>  $ GRPVar25 : num  1 2 2 1 2 2 2 1 1 1 ...
#>  $ GRPVar26 : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Responder: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID_unit  : int  1 2 4 6 7 8 11 12 14 17 ...
#>  $ Prob     : num  0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ...
#>  $ Stratum  : int  1 1 1 1 1 1 1 1 1 1 ...

Let us run the model

fit <- glm(Responder ~as.factor(GRPVar2)+as.factor(GRPVar7)+as.factor(GRPVar8)
           +as.factor(GRPVar9)+as.factor(GRPVar11)+as.factor(GRPVar12)
           +as.factor(GRPVar1)+as.factor(GRPVar14)+as.factor(GRPVar15)
           +as.factor(GRPVar16)+as.factor(GRPVar17)+as.factor(GRPVar18)
           +as.factor(GRPVar19)+as.factor(GRPVar20)+as.factor(GRPVar21)
           +as.factor(GRPVar22)+as.factor(GRPVar23)+as.factor(GRPVar24)
           +as.factor(GRPVar25), family = binomial("logit"),data=training_data )
#Let us see the results now
summary(fit)
#> 
#> Call:
#> glm(formula = Responder ~ as.factor(GRPVar2) + as.factor(GRPVar7) + 
#>     as.factor(GRPVar8) + as.factor(GRPVar9) + as.factor(GRPVar11) + 
#>     as.factor(GRPVar12) + as.factor(GRPVar1) + as.factor(GRPVar14) + 
#>     as.factor(GRPVar15) + as.factor(GRPVar16) + as.factor(GRPVar17) + 
#>     as.factor(GRPVar18) + as.factor(GRPVar19) + as.factor(GRPVar20) + 
#>     as.factor(GRPVar21) + as.factor(GRPVar22) + as.factor(GRPVar23) + 
#>     as.factor(GRPVar24) + as.factor(GRPVar25), family = binomial("logit"), 
#>     data = training_data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.5085  -0.4209  -0.2364  -0.0001   3.5840  
#> 
#> Coefficients:
#>                       Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)           -4.87085    0.27422 -17.762  < 2e-16 ***
#> as.factor(GRPVar2)2    0.09527    0.16780   0.568 0.570184    
#> as.factor(GRPVar2)3    0.12870    0.16309   0.789 0.430009    
#> as.factor(GRPVar7)2    0.28179    0.14565   1.935 0.053030 .  
#> as.factor(GRPVar7)3  -17.12944  309.79923  -0.055 0.955906    
#> as.factor(GRPVar8)2    0.29179    0.14747   1.979 0.047852 *  
#> as.factor(GRPVar9)2    0.26274    0.14258   1.843 0.065367 .  
#> as.factor(GRPVar11)2  -0.98343    0.14590  -6.741 1.58e-11 ***
#> as.factor(GRPVar12)2  -0.48727    0.14029  -3.473 0.000514 ***
#> as.factor(GRPVar1)2   -0.07923    0.16441  -0.482 0.629899    
#> as.factor(GRPVar1)3   -0.06717    0.16305  -0.412 0.680385    
#> as.factor(GRPVar14)2   0.56021    0.14864   3.769 0.000164 ***
#> as.factor(GRPVar15)2   0.39504    0.14792   2.671 0.007569 ** 
#> as.factor(GRPVar16)2   0.49988    0.14839   3.369 0.000755 ***
#> as.factor(GRPVar17)2   0.22597    0.14827   1.524 0.127486    
#> as.factor(GRPVar18)2   0.44926    0.14675   3.061 0.002203 ** 
#> as.factor(GRPVar19)2   0.33036    0.14724   2.244 0.024858 *  
#> as.factor(GRPVar20)2   0.55688    0.14871   3.745 0.000181 ***
#> as.factor(GRPVar21)2   0.46755    0.14773   3.165 0.001551 ** 
#> as.factor(GRPVar22)2   0.37475    0.14750   2.541 0.011062 *  
#> as.factor(GRPVar23)2   0.46218    0.14608   3.164 0.001557 ** 
#> as.factor(GRPVar24)2   0.42752    0.15113   2.829 0.004672 ** 
#> as.factor(GRPVar25)2   0.42829    0.14843   2.886 0.003907 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2275.6  on 3499  degrees of freedom
#> Residual deviance: 1508.8  on 3477  degrees of freedom
#> AIC: 1554.8
#> 
#> Number of Fisher Scoring iterations: 18

Prediction

summary_pred_model<-predict(fit, type="response")
str(summary_pred_model)
#>  Named num [1:3500] 0.0359 0.0478 0.0406 0.027 0.0871 ...
#>  - attr(*, "names")= chr [1:3500] "1" "2" "4" "6" ...

Let us check the model fit criteria now

#Hosmer lemeshow goodness of fit test
hosmerlem <- function (y, yhat, g = 12)
{
cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0,1, 1/g)), 
               include.lowest = T)
obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq <- sum((obs - expect)^2/expect)
P <- 1 - pchisq(chisq, g - 2)
c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
}

#Run the above function after setting the values
R_hat<-as.vector(fitted(fit))
yhat<-R_hat
y<-training_data$Responder
hosmerlem(y, yhat)
#>      X^2       Df  P(>Chi) 
#> 1674.226   10.000    0.000

Let us make lift chart for model

library(ROCR)
#> Warning: package 'ROCR' was built under R version 3.4.4
#> Loading required package: gplots
#> Warning: package 'gplots' was built under R version 3.4.4
#> 
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#> 
#>     lowess
gain.chart <- function(y_hat,y) {
plot(performance(prediction(y_hat,y), "tpr", "rpp"),lwd = 7, main = "Lift Chart")
lines(ecdf((rank(-y_hat)[y == T]) / length(y)),verticals = T, do.points = F, col = "red", lwd = 3)
}
gain.chart(R_hat,training_data$Responder)