STA-6543 Modeling Competition

Objectives and Goals

The goal of this analysis is to improve the effectiveness of the American Legion’s direct marketing fundraising campaign. Based on information provided by the American Legion, the response rate for their direct marketing campaign is 5.1%, with an average donation of $13.00. Each letter costs $.68 to produce and send. As such, sending letters to 3000 random individuals would result in an estimated loss of $51: $(13 * 3000 * .051) - (3000 * .68) = -51$.

In order to meet this goal of improving marketing effectiveness and increasing fund raising revenue, a predictive model will be developed to help determine if an individual is likely to be a donor or not. With this model, better direct marketing strategies can be employed, costs can be minimized, and more money can be raised.

Data

The source data was provided by the American Legion. The dataset consists of 3000 total records, with approximately 50% of the records being donors and 50% being non-donors. Since the response rate is so low - only 5.1% of people who receive a fund raising campaign letter donate - using a true random sample of the population would result in very few donors in our sample data. To avoid this issue, weighted sampling is used to include equal numbers of donors and non-donors. This will allow for a better fitting model since there will be plenty of both donors and non-donors to use in model training.

Analysis

Exploratory data analysis was conducted to gain an understanding of the data in the sample dataset.

The names and information about the variables in the fundraising dataset can be seen below.

## Classes 'tbl_df', 'tbl' and 'data.frame':    2400 obs. of  21 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 2 2 2 1 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 1 2 2 1 1 1 1 1 2 ...
##  $ num_child          : num  2 1 1 1 1 1 1 1 2 1 ...
##  $ income             : num  4 4 2 1 3 3 4 7 2 3 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 1 2 ...
##  $ wealth             : num  3 3 4 4 8 8 7 8 8 4 ...
##  $ home_value         : num  541 1229 444 442 2702 ...
##  $ med_fam_inc        : num  335 359 196 315 637 273 437 463 374 295 ...
##  $ avg_fam_inc        : num  367 490 263 343 695 331 454 597 434 319 ...
##  $ pct_lt15k          : num  13 10 38 24 2 21 9 13 3 19 ...
##  $ num_prom           : num  63 39 36 52 16 54 71 30 22 82 ...
##  $ lifetime_gifts     : num  91 35 178 134 20 110 118 57 29 242 ...
##  $ largest_gift       : num  10 15 20 20 20 9 10 13 15 12 ...
##  $ last_gift          : num  10 15 20 20 20 5 10 11 15 9 ...
##  $ months_since_donate: num  37 34 37 30 37 33 30 35 30 32 ...
##  $ time_lag           : num  4 13 0 10 5 4 3 2 6 7 ...
##  $ avg_gift           : num  6.5 11.67 9.89 16.75 20 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 2 2 2 1 2 2 1 1 1 1 ...

Variables measuring similar values were checked to see if they are correlated so that multicollinearity could be avoided.

The average and median family income, percent of people earning less than $15,000 in a neighborhood, and average home value variables are all correlated to some degree. However, the variables measuring information about the donations such as average gift, largest gift, and time since last gift were much less correlated to each other.

Additionally, comparisons between the ‘target’ variable and other variables. ‘Target’ is the binary donor/no donor variable the models outlined in this report will be predicting.

The amount of the last donation, number of children, months since last donation, average donation, and the total number of fundraising campaign letters received all appear to be significant to whether an individual donates or not.

## 
## Call:
## glm(formula = target ~ last_gift, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.967  -1.141  -1.081   1.196   1.304  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.292482   0.074489  -3.927 8.62e-05 ***
## last_gift    0.020708   0.004691   4.414 1.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3305.6  on 2398  degrees of freedom
## AIC: 3309.6
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = target ~ num_child, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.562  -1.162  -1.162   1.193   1.193  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.3380     0.1380  -2.449   0.0143 *
## num_child     0.3022     0.1237   2.443   0.0146 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3320.7  on 2398  degrees of freedom
## AIC: 3324.7
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = target ~ months_since_donate, family = "binomial", 
##     data = train.fundraising)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3331  -1.1659  -0.8163   1.1890   1.5881  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.02144    0.32862  -6.151 7.68e-10 ***
## months_since_donate  0.06433    0.01044   6.163 7.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3287.3  on 2398  degrees of freedom
## AIC: 3291.3
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = target ~ avg_gift, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.968  -1.149  -1.085   1.192   1.279  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.290559   0.078749  -3.690 0.000225 ***
## avg_gift     0.025898   0.006381   4.059 4.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3308.9  on 2398  degrees of freedom
## AIC: 3312.9
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = target ~ num_prom, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2637  -1.1688  -0.9783   1.1737   1.3996  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.270351   0.097316   2.778  0.00547 **
## num_prom    -0.005812   0.001801  -3.228  0.00125 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3316.5  on 2398  degrees of freedom
## AIC: 3320.5
## 
## Number of Fisher Scoring iterations: 3

On the other hand, family income, home value, total value of donations, largest donation all appear to have little impact on whether someone will respond to a fundraising letter with a donation.

## 
## Call:
## glm(formula = target ~ avg_fam_inc, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.185  -1.171  -1.166   1.184   1.191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.091e-02  1.124e-01  -0.275    0.783
## avg_fam_inc  3.684e-05  2.424e-04   0.152    0.879
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.9  on 2398  degrees of freedom
## AIC: 3330.9
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ med_fam_inc, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.175  -1.171  -1.163   1.183   1.196  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.161e-03  1.005e-01  -0.051    0.959
## med_fam_inc -2.539e-05  2.370e-04  -0.107    0.915
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327  on 2399  degrees of freedom
## Residual deviance: 3327  on 2398  degrees of freedom
## AIC: 3331
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ pct_lt15k, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.181  -1.173  -1.146   1.181   1.231  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.007958   0.064337   0.124    0.902
## pct_lt15k   -0.001561   0.003381  -0.462    0.644
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.8  on 2398  degrees of freedom
## AIC: 3330.8
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ home_value, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.189  -1.176  -1.103   1.179   1.257  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.690e-02  6.395e-02   0.421    0.674
## home_value  -3.659e-05  4.300e-05  -0.851    0.395
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.2  on 2398  degrees of freedom
## AIC: 3330.2
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ largest_gift, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.899  -1.168  -1.163   1.185   1.192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.042804   0.053171  -0.805    0.421
## largest_gift  0.001667   0.002057   0.810    0.418
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.2  on 2398  degrees of freedom
## AIC: 3330.2
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ lifetime_gifts, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.177  -1.173  -1.142   1.182   1.523  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)     0.0002954  0.0503723   0.006    0.995
## lifetime_gifts -0.0001380  0.0002668  -0.517    0.605
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.7  on 2398  degrees of freedom
## AIC: 3330.7
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = target ~ time_lag, family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.225  -1.169  -1.164   1.186   1.191  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.030799   0.065703  -0.469    0.639
## time_lag     0.002285   0.007444   0.307    0.759
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3326.9  on 2398  degrees of freedom
## AIC: 3330.9
## 
## Number of Fisher Scoring iterations: 3

Methodology and Model Performance

Several types of models were fit and examined. First, a simple logistic regression model was fit to provide a baseline to compare other models to.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train.fundraising)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8281  -1.1407  -0.7284   1.1700   1.6933  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.836e+00  5.132e-01  -3.577 0.000347 ***
## zipconvert2Yes      -1.260e+01  2.289e+02  -0.055 0.956114    
## zipconvert3No        1.249e+01  2.289e+02   0.055 0.956490    
## zipconvert4Yes      -1.256e+01  2.289e+02  -0.055 0.956255    
## zipconvert5Yes      -1.252e+01  2.289e+02  -0.055 0.956365    
## homeownerNo          1.461e-01  1.059e-01   1.380 0.167626    
## num_child            3.336e-01  1.279e-01   2.609 0.009092 ** 
## income              -5.378e-02  2.876e-02  -1.870 0.061547 .  
## femaleNo             2.516e-02  8.592e-02   0.293 0.769667    
## wealth              -1.934e-02  2.001e-02  -0.967 0.333665    
## home_value          -9.847e-05  7.963e-05  -1.237 0.216244    
## med_fam_inc         -1.222e-03  1.063e-03  -1.149 0.250371    
## avg_fam_inc          1.679e-03  1.136e-03   1.478 0.139308    
## pct_lt15k           -3.451e-03  4.942e-03  -0.698 0.485073    
## num_prom            -3.980e-03  2.570e-03  -1.548 0.121526    
## lifetime_gifts       2.889e-04  4.031e-04   0.717 0.473484    
## largest_gift        -2.204e-03  3.388e-03  -0.651 0.515324    
## last_gift            1.374e-02  8.664e-03   1.585 0.112860    
## months_since_donate  5.381e-02  1.126e-02   4.777 1.78e-06 ***
## time_lag            -1.430e-03  7.746e-03  -0.185 0.853491    
## avg_gift             5.864e-03  1.237e-02   0.474 0.635556    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3250.2  on 2379  degrees of freedom
## AIC: 3292.2
## 
## Number of Fisher Scoring iterations: 11

Confusion Matrix - Full Logistic Regression:

##                          target.test
## glm.pred.fundraising.full Donor No Donor
##                  Donor      115      151
##                  No Donor   175      159

This logistic regression has a misclassification rate of 54.33%.

## [1] 0.5433333

Using the above mentioned assumptions for average donation and fundraising letter cost of $13 and $.68, $1314.12 would be raised.

## [1] 1314.12

A second logistic regression was fit using stepwise variable selection. This resulted in the model below.

## 
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 + 
##     zipconvert5 + homeowner + num_child + income + home_value + 
##     avg_fam_inc + last_gift + months_since_donate, family = "binomial", 
##     data = train.fundraising)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7887  -1.1411  -0.7736   1.1703   1.6751  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.335e+00  3.900e-01  -5.987 2.14e-09 ***
## zipconvert2Yes      -1.266e+01  2.271e+02  -0.056  0.95553    
## zipconvert3No        1.257e+01  2.271e+02   0.055  0.95586    
## zipconvert4Yes      -1.264e+01  2.271e+02  -0.056  0.95563    
## zipconvert5Yes      -1.258e+01  2.271e+02  -0.055  0.95582    
## homeownerNo          1.505e-01  1.052e-01   1.430  0.15272    
## num_child            3.424e-01  1.271e-01   2.694  0.00706 ** 
## income              -5.308e-02  2.858e-02  -1.857  0.06330 .  
## home_value          -1.091e-04  7.675e-05  -1.422  0.15508    
## avg_fam_inc          5.741e-04  4.042e-04   1.420  0.15550    
## last_gift            1.610e-02  4.798e-03   3.356  0.00079 ***
## months_since_donate  5.852e-02  1.076e-02   5.438 5.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3255.7  on 2388  degrees of freedom
## AIC: 3279.7
## 
## Number of Fisher Scoring iterations: 11

This model has a misclassification rate of 53.17%, and an estimated fundraising amount of $1353.80. This is a slight improvement over the full model. The confusion matrix, misclassification rate, and estimated amount of money raised are below.

Confusion Matrix - Stepwise Logistic Regression:

##                          target.test
## glm.pred.fundraising.step Donor No Donor
##                  Donor      118      147
##                  No Donor   172      163

## [1] 0.5316667

## [1] 1353.8

The benefit of a logistic regression is it’s ease of interpretation. It is very easy to explain and understand how the model is predicting the likelihood that someone will donate. However, these models do not perform particularly well when predicting whether an individual will donate if they receive a fundraising campaign letter.

A random forest model was also fit to the fundraising data. This resulted in a better fitting model with more predictive power. While very difficult to interpret and understand how the model is predicting donor/no donor, the random forest resulted in a significantly lower test misclassification rate than the logistic regression model, and when calculating expected fundraising revenue, the random forest significantly out performs the logistic model.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 47.67%
## Confusion matrix:
##          Donor No Donor class.error
## Donor      677      532   0.4400331
## No Donor   612      579   0.5138539

As can be seen by the below chart and plot, the most important variables to predict if someone will donate are the largest previous donation someone has made, the number of months since their last donation, and the number of children they have.

## rf variable importance
## 
##                     Importance
## largest_gift           100.000
## months_since_donate     89.843
## num_child               89.365
## last_gift               58.882
## pct_lt15k               57.571
## med_fam_inc             53.405
## avg_fam_inc             49.701
## avg_gift                47.756
## income                  30.960
## zipconvert3No           28.026
## time_lag                26.587
## home_value              24.348
## homeownerNo             22.262
## zipconvert4Yes          14.620
## zipconvert2Yes          12.251
## wealth                   9.076
## num_prom                 6.466
## zipconvert5Yes           6.155
## femaleNo                 2.611
## lifetime_gifts           0.000

Confusion Matrix - Random Forest:

##                   target.test
## yhatfundraising.rf Donor No Donor
##           Donor      183      151
##           No Donor   107      159

The random forest model has a misclassification rate of 43.33%, which is significantly lower than the logistic regression models. Also, the estimated fundraising amount using the same assumptions is $2151.88, which is greater than the logistic regression models.

## [1] 0.43

## [1] 2151.88

Comparing the misclassification rates of each of the fitted models, it becomes clear that the random forest model is better at predicting if someone will donate if they receive a fundraising campaign letter. As a result, there is a significant increase in the estimated funds raised when employing the random forest model to determine who should be sent a letter.

Recommendations

It is recommended that a random forest model be used to predict if someone will be a donor or not, and that fundraising letters be sent to all individuals identified by the model as being likely donors. Based on validations against the test dataset, this will result in the lowest misclassification rate, subsequently resulting in the highest expected fundraising revenue.