Political persuasion is less about persuading people to change their minds, but more about getting people who agree with you to go out and vote. Using predictive analytics now plays a large role in this effort.
Business Problem: Find the voters who it will need to be persuaded by a flyer, those that will have a negative effect, those that have no effect, and those that will vote your way no matter what happens.
Analytics Problem: Develop an uplift model that predicts uplift for each voter.
There are 79 variables from the original data set. Here are all of the names and the first 6 records going only to the 10th variable.
## [1] "VOTER_ID" "SET_NO" "OPP_SEX" "AGE"
## [5] "HH_ND" "HH_NR" "HH_NI" "MED_AGE"
## [9] "NH_WHITE" "NH_AA" "NH_ASIAN" "NH_MULT"
## [13] "HISP" "COMM_LT10" "COMM_609P" "MED_HH_INC"
## [17] "COMM_CAR" "COMM_CP" "COMM_PT" "COMM_WALK"
## [21] "KIDS" "M_MAR" "F_MAR" "ED_4COL"
## [25] "GENDER_F" "GENDER_M" "H_AFDLN3P" "H_F1"
## [29] "H_M1" "H_MFDLN3P" "PARTY_D" "PARTY_I"
## [33] "PARTY_R" "VPP_08" "VPP_12" "VPR_08"
## [37] "VPR_10" "VPR_12" "VG_04" "VG_06"
## [41] "VG_08" "VG_10" "VG_12" "PP_PELIG"
## [45] "PR_PELIG" "AP_PELIG" "G_PELIG" "E_PELIG"
## [49] "NL5G" "NL3PR" "NL5AP" "NL2PP"
## [53] "REG_DAYS" "UPSCALEBUY" "UPSCALEMAL" "UPSCALEFEM"
## [57] "BOOKBUYERI" "FAMILYMAGA" "FEMALEORIE" "RELIGIOUSM"
## [61] "GARDENINGM" "CULINARYIN" "HEALTHFITN" "DOITYOURSE"
## [65] "FINANCIALM" "RELIGIOUSC" "POLITICALC" "MEDIANEDUC"
## [69] "CAND1S" "CAND2S" "MESSAGE_A" "MESSAGE_A_REV"
## [73] "I3" "CAND1_UND" "CAND2_UND" "MOVED_AD"
## [77] "MOVED_A" "opposite" "Partition"
## # A tibble: 6 × 10
## VOTER_ID SET_NO OPP_SEX AGE HH_ND HH_NR HH_NI MED_AGE NH_WHITE NH_AA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 193801 2 0 28 1 1 1 37 61 34
## 2 627701 1 0 53 2 0 0 46 87 8
## 3 306924 2 0 68 2 1 0 41 23 64
## 4 547609 1 0 66 0 2 0 35 53 29
## 5 141105 3 0 23 0 3 1 42 74 18
## 6 334787 1 0 49 2 0 0 32 64 30
## # A tibble: 4 × 4
## # Groups: MESSAGE_A [2]
## MESSAGE_A MOVED_AD n prop
## <dbl> <chr> <int> <dbl>
## 1 0 N 3278 0.656
## 2 0 Y 1722 0.344
## 3 1 N 2988 0.598
## 4 1 Y 2012 0.402
## MESSAGE_A MOVED_A
## 1 0 0.3444
## 2 1 0.4024
For the 5000 people in the treatment group (people who received a flyer supporting Smith), 40.2% moved in a Democratic direction vs. 34.4% for the people who did not get a message. Thus, an overall lift of 40.2% - 34.4% = 5.8%.
Based off of the missingness map, nothing appears to be missing, so we do not need to impute any data.
##
## Call:
## glm(formula = MOVED_A ~ ., family = binomial, data = VD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1617 -0.4589 -0.2291 0.5825 3.3129
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.468e+00 3.086e+00 1.124 0.26115
## OPP_SEX 9.307e-03 5.472e-03 1.701 0.08898 .
## AGE -4.338e-03 1.982e-03 -2.188 0.02866 *
## HH_ND 1.714e-01 3.285e-02 5.217 1.82e-07 ***
## HH_NR 2.036e-01 4.761e-02 4.276 1.90e-05 ***
## HH_NI 2.088e-01 4.519e-02 4.620 3.84e-06 ***
## MED_AGE 3.718e-02 8.103e-03 4.588 4.47e-06 ***
## NH_WHITE -4.107e-02 2.795e-02 -1.470 0.14166
## NH_AA -3.433e-02 2.850e-02 -1.205 0.22839
## NH_ASIAN 1.376e-03 3.075e-02 0.045 0.96432
## NH_MULT 2.951e-02 2.848e-02 1.036 0.30011
## HISP -3.152e-02 3.000e-02 -1.051 0.29340
## COMM_LT10 1.330e-03 6.860e-03 0.194 0.84628
## COMM_609P -1.354e-02 8.587e-03 -1.577 0.11481
## MED_HH_INC -1.116e-06 2.688e-06 -0.415 0.67798
## COMM_CAR -1.104e-02 1.357e-02 -0.814 0.41588
## COMM_CP 7.804e-03 1.423e-02 0.548 0.58337
## COMM_PT 3.408e-02 1.964e-02 1.735 0.08269 .
## COMM_WALK -5.703e-03 1.722e-02 -0.331 0.74050
## KIDS 5.907e-03 1.044e-02 0.566 0.57167
## M_MAR -1.359e-03 5.342e-03 -0.254 0.79914
## F_MAR -1.777e-03 4.356e-03 -0.408 0.68337
## ED_4COL 1.365e-02 6.614e-03 2.064 0.03899 *
## GENDER_F 1.042e+00 6.916e-02 15.064 < 2e-16 ***
## GENDER_M NA NA NA NA
## H_AFDLN3P -2.252e-01 3.562e-01 -0.632 0.52724
## H_F1 9.401e-01 1.106e-01 8.497 < 2e-16 ***
## H_M1 6.395e-01 1.314e-01 4.866 1.14e-06 ***
## H_MFDLN3P 1.885e-01 1.537e-01 1.226 0.22004
## PARTY_D -5.304e-01 2.621e-01 -2.024 0.04302 *
## PARTY_I -1.347e-01 2.530e-01 -0.533 0.59437
## PARTY_R -7.832e-01 2.815e-01 -2.782 0.00540 **
## VPP_08 1.088e+01 2.041e+02 0.053 0.95747
## VPP_12 8.276e+00 2.041e+02 0.041 0.96765
## VPR_08 -2.959e-01 4.904e-01 -0.603 0.54631
## VPR_10 -4.291e-01 4.815e-01 -0.891 0.37279
## VPR_12 -2.105e-01 5.065e-01 -0.415 0.67778
## VG_04 -1.882e-01 8.782e-02 -2.143 0.03209 *
## VG_06 -2.052e-01 9.666e-02 -2.123 0.03373 *
## VG_08 -2.287e-01 9.201e-02 -2.486 0.01292 *
## VG_10 -2.247e-02 9.442e-02 -0.238 0.81190
## VG_12 -1.639e-01 1.255e-01 -1.307 0.19138
## PP_PELIG -2.091e-01 4.081e+00 -0.051 0.95914
## PR_PELIG 1.560e-02 2.572e-02 0.606 0.54431
## AP_PELIG -1.896e-02 1.980e-02 -0.957 0.33848
## G_PELIG 5.857e-03 3.578e-03 1.637 0.10167
## E_PELIG 1.317e-02 4.694e-03 2.806 0.00502 **
## NL5G NA NA NA NA
## NL3PR NA NA NA NA
## NL5AP NA NA NA NA
## NL2PP NA NA NA NA
## REG_DAYS 5.780e-07 8.368e-06 0.069 0.94493
## UPSCALEBUY 3.954e-01 2.822e-01 1.401 0.16127
## UPSCALEMAL -2.210e-01 3.002e-01 -0.736 0.46158
## UPSCALEFEM 1.833e-01 7.690e-02 2.384 0.01712 *
## BOOKBUYERI 1.492e-01 7.024e-02 2.124 0.03371 *
## FAMILYMAGA -1.815e-01 8.147e-02 -2.228 0.02588 *
## FEMALEORIE -2.363e-01 2.704e-01 -0.874 0.38216
## RELIGIOUSM -1.326e+01 8.350e+02 -0.016 0.98733
## GARDENINGM 8.297e-02 1.652e-01 0.502 0.61550
## CULINARYIN -6.596e-03 2.447e-01 -0.027 0.97849
## HEALTHFITN -1.017e-01 7.001e-02 -1.453 0.14632
## DOITYOURSE -1.171e-01 1.555e-01 -0.753 0.45150
## FINANCIALM 8.463e-02 9.294e-02 0.911 0.36249
## RELIGIOUSC -1.754e-02 1.286e-01 -0.136 0.89152
## POLITICALC -6.592e-02 1.303e-01 -0.506 0.61299
## MEDIANEDUC -2.508e-02 3.987e-02 -0.629 0.52927
## CAND1SS -3.684e+00 9.587e-02 -38.423 < 2e-16 ***
## CAND1SU -2.162e-01 7.650e-02 -2.826 0.00472 **
## CAND2SS -4.970e-01 8.193e-02 -6.066 1.31e-09 ***
## CAND2SU -1.615e+00 1.168e-01 -13.821 < 2e-16 ***
## MESSAGE_A 4.776e-01 6.017e-02 7.938 2.06e-15 ***
## MESSAGE_A_REV NA NA NA NA
## I3Y NA NA NA NA
## CAND1_UNDY NA NA NA NA
## CAND2_UNDY NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13214.8 on 9999 degrees of freedom
## Residual deviance: 7294.9 on 9933 degrees of freedom
## AIC: 7428.9
##
## Number of Fisher Scoring iterations: 14
We can see based off of the summary that OPP_SEX, AGE, HH_ND, HH_NR, MED_AGE, COMM_PT, ED_4COL, GENDER_F, H_F1, H_M1, PARTY_D, PARTY_R, VG_04, VG_06, VG_08, E_PELIG,UPSCALEFEM, BOOKBUYERI, FAMILYMAGA, CAND1SS, CAND1SU, CAND2SS, CAND2SU, and MESSAGE_A appear to be significant with Logistic regression.
Now we will use stepwise logistic regression to make sure we have the best AIC possible.
##
## Call:
## glm(formula = MOVED_A ~ ., family = binomial, data = SDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8266 -0.4559 -0.2524 0.6045 3.3522
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.291343 0.227942 -1.278 0.20120
## OPP_SEX 0.008754 0.005192 1.686 0.09176 .
## AGE -0.005163 0.001886 -2.737 0.00620 **
## HH_ND 0.171625 0.030633 5.603 2.11e-08 ***
## HH_NR 0.215463 0.043256 4.981 6.32e-07 ***
## MED_AGE 0.014642 0.004459 3.284 0.00102 **
## COMM_PT 0.062766 0.006146 10.212 < 2e-16 ***
## ED_4COL 0.012462 0.002647 4.708 2.50e-06 ***
## GENDER_F 1.042478 0.067914 15.350 < 2e-16 ***
## H_F1 0.844275 0.105017 8.039 9.03e-16 ***
## H_M1 0.525995 0.127082 4.139 3.49e-05 ***
## PARTY_D -0.654827 0.089541 -7.313 2.61e-13 ***
## PARTY_R -1.192930 0.136282 -8.753 < 2e-16 ***
## VG_04 -0.157126 0.083688 -1.878 0.06045 .
## VG_06 -0.159145 0.087574 -1.817 0.06918 .
## VG_08 -0.139105 0.081385 -1.709 0.08741 .
## E_PELIG 0.011830 0.001426 8.296 < 2e-16 ***
## UPSCALEFEM 0.177946 0.065299 2.725 0.00643 **
## BOOKBUYERI 0.048672 0.040257 1.209 0.22665
## FAMILYMAGA -0.165175 0.064536 -2.559 0.01048 *
## CAND1SS -3.566725 0.091849 -38.832 < 2e-16 ***
## CAND1SU -0.210141 0.075041 -2.800 0.00510 **
## CAND2SS -0.517949 0.081900 -6.324 2.55e-10 ***
## CAND2SU -1.622831 0.116117 -13.976 < 2e-16 ***
## MESSAGE_A 0.477210 0.059244 8.055 7.95e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13214.8 on 9999 degrees of freedom
## Residual deviance: 7489.5 on 9975 degrees of freedom
## AIC: 7539.5
##
## Number of Fisher Scoring iterations: 5
## Start: AIC=7539.49
## MOVED_A ~ OPP_SEX + AGE + HH_ND + HH_NR + MED_AGE + COMM_PT +
## ED_4COL + GENDER_F + H_F1 + H_M1 + PARTY_D + PARTY_R + VG_04 +
## VG_06 + VG_08 + E_PELIG + UPSCALEFEM + BOOKBUYERI + FAMILYMAGA +
## CAND1S + CAND2S + MESSAGE_A
##
## Df Deviance AIC
## - BOOKBUYERI 1 7490.9 7538.9
## <none> 7489.5 7539.5
## - OPP_SEX 1 7492.2 7540.2
## - VG_08 1 7492.4 7540.4
## - VG_06 1 7492.8 7540.8
## - VG_04 1 7493.0 7541.0
## - FAMILYMAGA 1 7496.0 7544.0
## - UPSCALEFEM 1 7496.9 7544.9
## - AGE 1 7497.0 7545.0
## - MED_AGE 1 7500.3 7548.3
## - H_M1 1 7506.6 7554.6
## - ED_4COL 1 7511.8 7559.8
## - HH_NR 1 7514.6 7562.6
## - HH_ND 1 7521.3 7569.3
## - PARTY_D 1 7544.3 7592.3
## - MESSAGE_A 1 7555.2 7603.2
## - H_F1 1 7555.3 7603.3
## - E_PELIG 1 7559.0 7607.0
## - PARTY_R 1 7569.0 7617.0
## - COMM_PT 1 7597.3 7645.3
## - CAND2S 2 7692.5 7738.5
## - GENDER_F 1 7736.1 7784.1
## - CAND1S 2 9955.3 10001.3
##
## Step: AIC=7538.95
## MOVED_A ~ OPP_SEX + AGE + HH_ND + HH_NR + MED_AGE + COMM_PT +
## ED_4COL + GENDER_F + H_F1 + H_M1 + PARTY_D + PARTY_R + VG_04 +
## VG_06 + VG_08 + E_PELIG + UPSCALEFEM + FAMILYMAGA + CAND1S +
## CAND2S + MESSAGE_A
##
## Df Deviance AIC
## <none> 7490.9 7538.9
## + BOOKBUYERI 1 7489.5 7539.5
## - OPP_SEX 1 7493.7 7539.7
## - VG_08 1 7493.9 7539.9
## - VG_06 1 7494.2 7540.2
## - VG_04 1 7494.4 7540.4
## - FAMILYMAGA 1 7496.0 7542.0
## - AGE 1 7498.3 7544.3
## - MED_AGE 1 7500.8 7546.8
## - UPSCALEFEM 1 7501.0 7547.0
## - H_M1 1 7508.0 7554.0
## - ED_4COL 1 7514.5 7560.5
## - HH_NR 1 7516.3 7562.3
## - HH_ND 1 7522.8 7568.8
## - PARTY_D 1 7546.0 7592.0
## - H_F1 1 7556.4 7602.4
## - MESSAGE_A 1 7557.1 7603.1
## - E_PELIG 1 7560.4 7606.4
## - PARTY_R 1 7570.6 7616.6
## - COMM_PT 1 7597.3 7643.3
## - CAND2S 2 7693.5 7737.5
## - GENDER_F 1 7737.1 7783.1
## - CAND1S 2 9956.6 10000.6
##
## Call:
## glm(formula = MOVED_A ~ OPP_SEX + AGE + HH_ND + HH_NR + MED_AGE +
## COMM_PT + ED_4COL + GENDER_F + H_F1 + H_M1 + PARTY_D + PARTY_R +
## VG_04 + VG_06 + VG_08 + E_PELIG + UPSCALEFEM + FAMILYMAGA +
## CAND1S + CAND2S + MESSAGE_A, family = binomial, data = SDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8201 -0.4556 -0.2531 0.6037 3.3560
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.226694 0.221544 -1.023 0.30619
## OPP_SEX 0.008691 0.005197 1.673 0.09442 .
## AGE -0.005099 0.001885 -2.705 0.00684 **
## HH_ND 0.171508 0.030622 5.601 2.13e-08 ***
## HH_NR 0.216340 0.043236 5.004 5.62e-07 ***
## MED_AGE 0.013781 0.004401 3.131 0.00174 **
## COMM_PT 0.062070 0.006116 10.148 < 2e-16 ***
## ED_4COL 0.012755 0.002637 4.837 1.32e-06 ***
## GENDER_F 1.041427 0.067897 15.338 < 2e-16 ***
## H_F1 0.842347 0.105030 8.020 1.06e-15 ***
## H_M1 0.524572 0.127028 4.130 3.63e-05 ***
## PARTY_D -0.655901 0.089525 -7.326 2.36e-13 ***
## PARTY_R -1.193597 0.136252 -8.760 < 2e-16 ***
## VG_04 -0.156327 0.083673 -1.868 0.06172 .
## VG_06 -0.157776 0.087540 -1.802 0.07150 .
## VG_08 -0.139387 0.081380 -1.713 0.08675 .
## E_PELIG 0.011830 0.001426 8.296 < 2e-16 ***
## UPSCALEFEM 0.199549 0.062865 3.174 0.00150 **
## FAMILYMAGA -0.128644 0.056982 -2.258 0.02397 *
## CAND1SS -3.565162 0.091812 -38.831 < 2e-16 ***
## CAND1SU -0.208600 0.075024 -2.780 0.00543 **
## CAND2SS -0.514182 0.081829 -6.284 3.31e-10 ***
## CAND2SU -1.620811 0.116114 -13.959 < 2e-16 ***
## MESSAGE_A 0.478540 0.059232 8.079 6.53e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13214.8 on 9999 degrees of freedom
## Residual deviance: 7490.9 on 9976 degrees of freedom
## AIC: 7538.9
##
## Number of Fisher Scoring iterations: 5
##
## Call: glm(formula = MOVED_A ~ OPP_SEX + AGE + HH_ND + HH_NR + MED_AGE +
## COMM_PT + ED_4COL + GENDER_F + H_F1 + H_M1 + PARTY_D + PARTY_R +
## VG_04 + VG_06 + VG_08 + E_PELIG + UPSCALEFEM + FAMILYMAGA +
## CAND1S + CAND2S + MESSAGE_A, family = binomial, data = SDF)
##
## Coefficients:
## (Intercept) OPP_SEX AGE HH_ND HH_NR MED_AGE
## -0.226694 0.008691 -0.005099 0.171508 0.216340 0.013781
## COMM_PT ED_4COL GENDER_F H_F1 H_M1 PARTY_D
## 0.062070 0.012755 1.041427 0.842347 0.524572 -0.655901
## PARTY_R VG_04 VG_06 VG_08 E_PELIG UPSCALEFEM
## -1.193597 -0.156327 -0.157776 -0.139387 0.011830 0.199549
## FAMILYMAGA CAND1SS CAND1SU CAND2SS CAND2SU MESSAGE_A
## -0.128644 -3.565162 -0.208600 -0.514182 -1.620811 0.478540
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9976 Residual
## Null Deviance: 13210
## Residual Deviance: 7491 AIC: 7539
Our end result shows that with logistic regression that the predictors: OPP_SEX + AGE + HH_ND + HH_NR + MED_AGE + COMM_PT + ED_4COL + GENDER_F + H_F1 + H_M1 + PARTY_D + PARTY_R + VG_04 + VG_06 + VG_08 + E_PELIG + UPSCALEFEM + FAMILYMAGA + CAND1S + CAND2S + MESSAGE_A, are indeed significant and provide the best AIC.
##
## Call:
## glm(formula = MOVED_A ~ ., family = binomial, data = TDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3502 -0.4724 -0.3337 0.7584 2.8608
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.58359 0.09565 16.556 < 2e-16 ***
## HH_ND 0.07893 0.02638 2.992 0.002772 **
## HH_NR 0.09060 0.04028 2.249 0.024512 *
## PARTY_D -0.32224 0.08066 -3.995 6.47e-05 ***
## PARTY_R -0.91110 0.12620 -7.219 5.23e-13 ***
## CAND1SS -3.35649 0.08536 -39.321 < 2e-16 ***
## CAND1SU -0.24124 0.06837 -3.528 0.000418 ***
## CAND2SS -0.35719 0.07970 -4.482 7.41e-06 ***
## CAND2SU -1.48172 0.11100 -13.349 < 2e-16 ***
## CAND1_UNDY NA NA NA NA
## CAND2_UNDY NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13214.8 on 9999 degrees of freedom
## Residual deviance: 8200.3 on 9991 degrees of freedom
## AIC: 8218.3
##
## Number of Fisher Scoring iterations: 5
## Start: AIC=8218.34
## MOVED_A ~ HH_ND + HH_NR + PARTY_D + PARTY_R + CAND1S + CAND2S +
## CAND1_UND + CAND2_UND
##
##
## Step: AIC=8218.34
## MOVED_A ~ HH_ND + HH_NR + PARTY_D + PARTY_R + CAND1S + CAND2S +
## CAND1_UND
##
##
## Step: AIC=8218.34
## MOVED_A ~ HH_ND + HH_NR + PARTY_D + PARTY_R + CAND1S + CAND2S
##
## Df Deviance AIC
## <none> 8200.3 8218.3
## - HH_NR 1 8205.4 8221.4
## - HH_ND 1 8209.3 8225.3
## - PARTY_D 1 8216.6 8232.6
## - PARTY_R 1 8254.2 8270.2
## - CAND2S 2 8391.4 8405.4
## - CAND1S 2 10729.6 10743.6
##
## Call:
## glm(formula = MOVED_A ~ HH_ND + HH_NR + PARTY_D + PARTY_R + CAND1S +
## CAND2S, family = binomial, data = TDF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3502 -0.4724 -0.3337 0.7584 2.8608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.58359 0.09565 16.556 < 2e-16 ***
## HH_ND 0.07893 0.02638 2.992 0.002772 **
## HH_NR 0.09060 0.04028 2.249 0.024512 *
## PARTY_D -0.32224 0.08066 -3.995 6.47e-05 ***
## PARTY_R -0.91110 0.12620 -7.219 5.23e-13 ***
## CAND1SS -3.35649 0.08536 -39.321 < 2e-16 ***
## CAND1SU -0.24124 0.06837 -3.528 0.000418 ***
## CAND2SS -0.35719 0.07970 -4.482 7.41e-06 ***
## CAND2SU -1.48172 0.11100 -13.349 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13214.8 on 9999 degrees of freedom
## Residual deviance: 8200.3 on 9991 degrees of freedom
## AIC: 8218.3
##
## Number of Fisher Scoring iterations: 5
Notice that the AIC is slightly more that the first model using a large number of predictors. It may be beneficial to use less predictors due to the fact of over-fitting and/or bias.
The following model was created using upliftRF with MOVED_A as the response variable and OPP_SEX, AGE, HH_ND, HH_NR, MED_AGE, COMM_PT, ED_4COL, GENDER_F, H_F1, H_M1, PARTY_D, PARTY_R, VG_04, VG_06, VG_08, E_PELIG, UPSCALEFEM, BOOKBUYERI, FAMILYMAGA, CAND1SS, CAND1SU, CAND2SS, CAND2SU, and MESSAGE_A as the predictors. This was based off of the logistic model finding the most significant predictors.
Using our prediction model we can see the general uplift for 20 groups.
## group n.ct1 n.ct0 n.y1_ct1 n.y1_ct0 r.y1_ct1 r.y1_ct0 uplift
## [1,] 1 93 110 64 62 0.688172 0.563636 0.124536
## [2,] 2 101 102 58 55 0.574257 0.539216 0.035042
## [3,] 3 88 114 44 54 0.500000 0.473684 0.026316
## [4,] 4 96 107 40 55 0.416667 0.514019 -0.097352
## [5,] 5 92 110 46 41 0.500000 0.372727 0.127273
## [6,] 6 116 87 44 33 0.379310 0.379310 0.000000
## [7,] 7 113 89 44 25 0.389381 0.280899 0.108482
## [8,] 8 95 108 32 21 0.336842 0.194444 0.142398
## [9,] 9 99 103 31 30 0.313131 0.291262 0.021869
## [10,] 10 89 114 25 38 0.280899 0.333333 -0.052434
## [11,] 11 103 100 29 19 0.281553 0.190000 0.091553
## [12,] 12 106 96 28 23 0.264151 0.239583 0.024568
## [13,] 13 103 100 26 25 0.252427 0.250000 0.002427
## [14,] 14 98 104 27 25 0.275510 0.240385 0.035126
## [15,] 15 89 114 20 24 0.224719 0.210526 0.014193
## [16,] 16 97 105 32 27 0.329897 0.257143 0.072754
## [17,] 17 104 99 42 34 0.403846 0.343434 0.060412
## [18,] 18 108 94 46 31 0.425926 0.329787 0.096139
## [19,] 19 98 105 39 44 0.397959 0.419048 -0.021088
## [20,] 20 110 93 56 42 0.509091 0.451613 0.057478
## attr(,"class")
## [1] "performance"
Displayed is the Qini Plot which shows us the cumulative incremental gains against proportion of population targeted.
## $Qini
## [1] -0.005720911
##
## $inc.gains
## [1] 0.0018470272 0.0040990357 -0.0001691078 -0.0069261082 -0.0038641368
## [6] 0.0020916730 0.0119423221 0.0177343848 0.0186442528 0.0126562785
## [11] 0.0179205496 0.0207369005 0.0215785406 0.0229206811 0.0212461731
## [16] 0.0241171064 0.0285850602 0.0365155808 0.0346134840 0.0421936055
##
## $random.inc.gains
## [1] 0.002109680 0.004219361 0.006329041 0.008438721 0.010548401 0.012658082
## [7] 0.014767762 0.016877442 0.018987122 0.021096803 0.023206483 0.025316163
## [13] 0.027425844 0.029535524 0.031645204 0.033754884 0.035864565 0.037974245
## [19] 0.040083925 0.042193605
Using the predictors based off of the tree classification model. These predictor are: CAND1S + PARTY_D + CAND1_UND + CAND2S + PARTY_R + HH_NR + HH_ND + CAND2_UND
Displayed below will be the uplift and corresponding Qini Plot.
## group n.ct1 n.ct0 n.y1_ct1 n.y1_ct0 r.y1_ct1 r.y1_ct0 uplift
## [1,] 1 107 125 67 73 0.626168 0.584000 0.042168
## [2,] 2 81 103 49 67 0.604938 0.650485 -0.045547
## [3,] 3 101 93 79 58 0.782178 0.623656 0.158522
## [4,] 4 115 93 95 76 0.826087 0.817204 0.008883
## [5,] 5 108 119 58 43 0.537037 0.361345 0.175692
## [6,] 6 86 120 68 91 0.790698 0.758333 0.032364
## [7,] 7 91 86 13 6 0.142857 0.069767 0.073090
## [8,] 8 87 109 7 9 0.080460 0.082569 -0.002109
## [9,] 9 142 151 37 32 0.260563 0.211921 0.048643
## [10,] 10 63 62 21 12 0.333333 0.193548 0.139785
## [11,] 11 128 131 3 5 0.023438 0.038168 -0.014730
## [12,] 12 80 92 0 0 0.000000 0.000000 0.000000
## [13,] 13 104 89 99 82 0.951923 0.921348 0.030575
## [14,] 14 99 88 76 66 0.767677 0.750000 0.017677
## [15,] 15 236 218 20 24 0.084746 0.110092 -0.025346
## [16,] 16 269 250 34 13 0.126394 0.052000 0.074394
## [17,] 17 19 24 8 6 0.421053 0.250000 0.171053
## [18,] 18 82 101 39 45 0.475610 0.445545 0.030065
## attr(,"class")
## [1] "performance"
## $Qini
## [1] -0.002434018
##
## $inc.gains
## [1] -0.002006875 -0.010101630 0.001200324 0.011746898 0.019841165
## [6] 0.009571402 0.013156779 0.012278588 0.015217749 0.019886001
## [11] 0.018953228 0.018953228 0.028580674 0.034486288 0.032811780
## [16] 0.043499683 0.044582557 0.042193605
##
## $random.inc.gains
## [1] 0.002344089 0.004688178 0.007032268 0.009376357 0.011720446 0.014064535
## [7] 0.016408624 0.018752714 0.021096803 0.023440892 0.025784981 0.028129070
## [13] 0.030473160 0.032817249 0.035161338 0.037505427 0.039849516 0.042193605
## 0 1
## [1,] 0.3 0.3
## [2,] 0.3 0.8
## [3,] 0.2 0.5
## [4,] 0.4 0.2
## [5,] 0.3 0.5
## [6,] 0.0 0.5
## [7,] 0.0 0.5
## [8,] 0.4 0.3
## [9,] 0.5 0.5
## [10,] 0.3 0.7
## $Qini
## [1] 0.001892401
##
## $inc.gains
## [1] 0.003985485 0.007957324 0.011983746 0.021447445 0.027275572 0.038245158
## [7] 0.035619358 0.038940596 0.042193605
##
## $random.inc.gains
## [1] 0.004688178 0.009376357 0.014064535 0.018752714 0.023440892 0.028129070
## [7] 0.032817249 0.037505427 0.042193605
## voter_ID MOVED_AD X0 X1 uplift
## 1 617877 Y 0.1 0.8 0.7
## 2 16214 N 0.2 0.9 0.7
## 3 17505 N 0.3 1.0 0.7
## 4 17659 N 0.2 0.9 0.7
## 5 617261 N 0.2 0.8 0.6
## 6 178744 Y 0.2 0.8 0.6
## 7 521346 N 0.2 0.8 0.6
## 8 635520 N 0.2 0.8 0.6
## 9 209816 Y 0.3 0.9 0.6
## 10 383895 N 0.2 0.8 0.6
## 11 383578 Y 0.2 0.8 0.6
## 12 389108 N 0.2 0.8 0.6
## 13 208067 Y 0.2 0.8 0.6
## 14 629902 N 0.2 0.8 0.6
## 15 629895 Y 0.2 0.8 0.6
## 16 511485 N 0.2 0.8 0.6
## 17 636093 N 0.2 0.8 0.6
## 18 618967 N 0.2 0.8 0.6
## 19 513060 N 0.2 0.8 0.6
## 20 623531 N 0.2 0.8 0.6
## $Qini
## [1] 0.01435247
##
## $inc.gains
## [1] 0.01245306 0.01536005 0.02516489 0.03626655 0.04675854 0.05462522
## [7] 0.05037072 0.04961973 0.04689402 0.04219361
##
## $random.inc.gains
## [1] 0.004219361 0.008438721 0.012658082 0.016877442 0.021096803 0.025316163
## [7] 0.029535524 0.033754884 0.037974245 0.042193605
## voter_ID MOVED_AD predictedProbCT1 predictedProbCT0 uplift
## 1728 612333 N 0.7188169 0.23197289 0.4868440
## 1735 186634 Y 0.4864987 0.08526888 0.4012298
## 1076 375577 N 0.6281903 0.23263052 0.3955598
## 3927 65098 Y 0.7687683 0.39306685 0.3757014
## 2020 252392 Y 0.7905927 0.42037768 0.3702150
## 342 360920 N 0.7335690 0.37528261 0.3582863
## 1181 305949 Y 0.7108177 0.36213657 0.3486811
## 227 265160 Y 0.6715799 0.32811318 0.3434667
## 361 529707 N 0.7789310 0.44010841 0.3388226
## 2285 628973 Y 0.8478816 0.52009681 0.3277848
## 606 481831 Y 0.8215391 0.49489985 0.3266392
## 233 406851 Y 0.6653134 0.35422402 0.3110894
## 462 521439 Y 0.4332769 0.12413293 0.3091440
## 3027 223981 N 0.7793615 0.47275537 0.3066061
## 2440 193345 Y 0.8238595 0.52017683 0.3036827
## 716 300644 N 0.6180464 0.31738435 0.3006620
## 3298 167486 N 0.7074841 0.40775398 0.2997301
## 1912 234209 N 0.6510892 0.35170830 0.2993809
## 1775 545437 N 0.7274893 0.43095526 0.2965341
## 2060 130960 N 0.7328025 0.44047841 0.2923241
## $Qini
## [1] 0.005681677
##
## $inc.gains
## [1] 0.004663378 0.013485248 0.025351057 0.026123982 0.032239152 0.035209991
## [7] 0.030131885 0.040055148 0.039650166 0.042193605
##
## $random.inc.gains
## [1] 0.004219361 0.008438721 0.012658082 0.016877442 0.021096803 0.025316163
## [7] 0.029535524 0.033754884 0.037974245 0.042193605
## voter_ID MOVED_AD predictedProbCT1 predictedProbCT0 uplift
## 467 341224 N 0.6366274 0.4171255 0.2195019
## 3073 409759 N 0.6611796 0.4475062 0.2136734
## 771 71688 N 0.5769315 0.3815667 0.1953648
## 1064 228799 N 0.5769315 0.3815667 0.1953648
## 1099 239998 N 0.5769315 0.3815667 0.1953648
## 1745 394179 N 0.5769315 0.3815667 0.1953648
## 2049 197848 N 0.5769315 0.3815667 0.1953648
## 2361 248420 N 0.5769315 0.3815667 0.1953648
## 2466 279275 N 0.5769315 0.3815667 0.1953648
## 2565 208406 N 0.5769315 0.3815667 0.1953648
## 3110 481523 N 0.5769315 0.3815667 0.1953648
## 3786 586155 N 0.5769315 0.3815667 0.1953648
## 3731 161165 N 0.6030020 0.4111846 0.1918174
## 3678 506252 N 0.6285004 0.4414603 0.1870401
## 3686 301309 N 0.6285004 0.4414603 0.1870401
## 3869 373307 N 0.6285004 0.4414603 0.1870401
## 1927 491618 N 0.3631765 0.1772689 0.1859076
## 3029 323221 N 0.3631765 0.1772689 0.1859076
## 3959 120804 N 0.3631765 0.1772689 0.1859076
## 141 455292 N 0.5460954 0.3642273 0.1818681
## ME RMSE MAE MPE MAPE
## Test set 0.0260139 0.3818698 0.3354097 -Inf Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2380 655
## 1 191 826
##
## Accuracy : 0.7912
## 95% CI : (0.7784, 0.8036)
## No Information Rate : 0.6345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5178
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5577
## Specificity : 0.9257
## Pos Pred Value : 0.8122
## Neg Pred Value : 0.7842
## Prevalence : 0.3655
## Detection Rate : 0.2038
## Detection Prevalence : 0.2510
## Balanced Accuracy : 0.7417
##
## 'Positive' Class : 1
##
## ME RMSE MAE MPE MAPE
## Test set 0.02398083 0.3409753 0.2882331 -Inf Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2362 247
## 1 209 1234
##
## Accuracy : 0.8875
## 95% CI : (0.8773, 0.897)
## No Information Rate : 0.6345
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.756
##
## Mcnemar's Test P-Value : 0.08315
##
## Sensitivity : 0.8332
## Specificity : 0.9187
## Pos Pred Value : 0.8552
## Neg Pred Value : 0.9053
## Prevalence : 0.3655
## Detection Rate : 0.3045
## Detection Prevalence : 0.3561
## Balanced Accuracy : 0.8760
##
## 'Positive' Class : 1
##
## ME RMSE MAE MPE MAPE
## Test set -0.05347976 0.5113591 0.4773692 -Inf Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2016 1164
## 1 555 317
##
## Accuracy : 0.5758
## 95% CI : (0.5604, 0.591)
## No Information Rate : 0.6345
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.002
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.21404
## Specificity : 0.78413
## Pos Pred Value : 0.36353
## Neg Pred Value : 0.63396
## Prevalence : 0.36550
## Detection Rate : 0.07823
## Detection Prevalence : 0.21520
## Balanced Accuracy : 0.49909
##
## 'Positive' Class : 1
##
## ME RMSE MAE MPE MAPE
## Test set -0.03662056 0.3382929 0.2340912 -Inf Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2143 194
## 1 428 1287
##
## Accuracy : 0.8465
## 95% CI : (0.835, 0.8575)
## No Information Rate : 0.6345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6798
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8690
## Specificity : 0.8335
## Pos Pred Value : 0.7504
## Neg Pred Value : 0.9170
## Prevalence : 0.3655
## Detection Rate : 0.3176
## Detection Prevalence : 0.4232
## Balanced Accuracy : 0.8513
##
## 'Positive' Class : 1
##
## ME RMSE MAE MPE MAPE
## Test set -0.03592259 0.3547424 0.2562522 -Inf Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2032 155
## 1 539 1326
##
## Accuracy : 0.8287
## 95% CI : (0.8168, 0.8402)
## No Information Rate : 0.6345
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.65
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8953
## Specificity : 0.7904
## Pos Pred Value : 0.7110
## Neg Pred Value : 0.9291
## Prevalence : 0.3655
## Detection Rate : 0.3272
## Detection Prevalence : 0.4603
## Balanced Accuracy : 0.8428
##
## 'Positive' Class : 1
##
Based off of the above RMSE, Confusion Matrices, and Lift Charts, the best model would be model 2 which is the using Random Forest Tree Modeling with the predictors CAND1S + PARTY_D + CAND1_UND + CAND2S + PARTY_R + HH_NR + HH_ND + CAND2_UND + trt(MESSAGE_A).
Notice that the RMSE is slightly above Model 4, but the confusion matrix accuracy is a lot higher. Also, note we are using a lot less predictors in this model.
The propensities of the first three records of the validation set are:
## [1] 0.245190 0.460108 0.175117
We will reverse the variable MESSAGE_A and call it MESSAGE_A_REV. Using the best model, Random Forest 2, we will re-score the validation data using the MESSAGE_A_REV variable as a predictor, instead of MESSAGE_A.
## pr.y1_ct1 pr.y1_ct0
## [1,] 0.272836 0.372520
## [2,] 0.448127 0.439858
## [3,] 0.191454 0.267593
## [4,] 0.645310 0.709631
## [5,] 0.308509 0.400447
## [6,] 0.645310 0.709631
## [7,] 0.692539 0.808808
## [8,] 0.069686 0.176359
## [9,] 0.048987 0.078987
## [10,] 0.052399 0.117127
## group n.ct1 n.ct0 n.y1_ct1 n.y1_ct0 r.y1_ct1 r.y1_ct0 uplift
## [1,] 1 113 92 69 59 0.610619 0.641304 -0.030685
## [2,] 2 182 165 127 129 0.697802 0.781818 -0.084016
## [3,] 3 89 90 59 70 0.662921 0.777778 -0.114856
## [4,] 4 39 47 18 26 0.461538 0.553191 -0.091653
## [5,] 5 100 99 50 53 0.500000 0.535354 -0.035354
## [6,] 6 140 133 107 107 0.764286 0.804511 -0.040226
## [7,] 7 100 102 0 5 0.000000 0.049020 -0.049020
## [8,] 8 68 64 17 25 0.250000 0.390625 -0.140625
## [9,] 9 146 150 64 72 0.438356 0.480000 -0.041644
## [10,] 10 57 52 0 1 0.000000 0.019231 -0.019231
## [11,] 11 169 175 63 76 0.372781 0.434286 -0.061505
## [12,] 12 53 39 5 3 0.094340 0.076923 0.017417
## [13,] 13 91 79 32 36 0.351648 0.455696 -0.104048
## [14,] 14 108 101 14 12 0.129630 0.118812 0.010818
## [15,] 15 196 223 1 7 0.005102 0.031390 -0.026288
## [16,] 16 262 262 18 28 0.068702 0.106870 -0.038168
## [17,] 17 36 35 16 17 0.444444 0.485714 -0.041270
## [18,] 18 105 90 48 47 0.457143 0.522222 -0.065079
## attr(,"class")
## [1] "performance"
## $Qini
## [1] -0.002434018
##
## $inc.gains
## [1] -0.002006875 -0.010101630 0.001200324 0.011746898 0.019841165
## [6] 0.009571402 0.013156779 0.012278588 0.015217749 0.019886001
## [11] 0.018953228 0.018953228 0.028580674 0.034486288 0.032811780
## [16] 0.043499683 0.044582557 0.042193605
##
## $random.inc.gains
## [1] 0.002344089 0.004688178 0.007032268 0.009376357 0.011720446 0.014064535
## [7] 0.016408624 0.018752714 0.021096803 0.023440892 0.025784981 0.028129070
## [13] 0.030473160 0.032817249 0.035161338 0.037505427 0.039849516 0.042193605
By sorting the uplift scores from Greatest to Least, we then choose the cutoff by finding using the ceiling function on the the length of the list multiplied by .10. This will tell us which element is the cutoff, thus we can call upon that element to find our uplift cutoff.
Below is the uplift cutoff if we wanted only the top 10%.
## [1] 0.113584