The goal of this analysis is to improve the effectiveness of the American Legion’s direct marketing fundraising campaign. Based on information provided by the American Legion, the response rate for their direct marketing campaign is 5.1%, with an average donation of $13.00. Each letter costs $.68 to produce and send. As such, sending letters to 3000 random individuals would result in an estimated loss of $51: \((13 * 3000 * .051) - (3000 * .68) = -51\).
In order to meet this goal of improving marketing effectiveness and increasing fund raising revenue, a predictive model will be developed to help determine if an individual is likely to be a donor or not. With this model, better direct marketing strategies can be employed, costs can be minimized, and more money can be raised.
The source data was provided by the American Legion. The dataset consists of 3000 total records, with approximately 50% of the records being donors and 50% being non-donors. Since the response rate is so low - only 5.1% of people who receive a fund raising campaign letter donate - using a true random sample of the population would result in very few donors in our sample data. To avoid this issue, weighted sampling is used to include equal numbers of donors and non-donors. This will allow for a better fitting model since there will be plenty of both donors and non-donors to use in model training.
Exploratory data analysis was conducted to gain an understanding of the data in the sample dataset.
The names and information about the variables in the fundraising dataset can be seen below.
## Classes 'tbl_df', 'tbl' and 'data.frame': 2400 obs. of 21 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 2 2 2 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 2 2 1 1 1 1 1 2 ...
## $ num_child : num 2 1 1 1 1 1 1 1 2 1 ...
## $ income : num 4 4 2 1 3 3 4 7 2 3 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 1 2 ...
## $ wealth : num 3 3 4 4 8 8 7 8 8 4 ...
## $ home_value : num 541 1229 444 442 2702 ...
## $ med_fam_inc : num 335 359 196 315 637 273 437 463 374 295 ...
## $ avg_fam_inc : num 367 490 263 343 695 331 454 597 434 319 ...
## $ pct_lt15k : num 13 10 38 24 2 21 9 13 3 19 ...
## $ num_prom : num 63 39 36 52 16 54 71 30 22 82 ...
## $ lifetime_gifts : num 91 35 178 134 20 110 118 57 29 242 ...
## $ largest_gift : num 10 15 20 20 20 9 10 13 15 12 ...
## $ last_gift : num 10 15 20 20 20 5 10 11 15 9 ...
## $ months_since_donate: num 37 34 37 30 37 33 30 35 30 32 ...
## $ time_lag : num 4 13 0 10 5 4 3 2 6 7 ...
## $ avg_gift : num 6.5 11.67 9.89 16.75 20 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 2 2 2 1 2 2 1 1 1 1 ...
Variables measuring similar values were checked to see if they are correlated so that multicollinearity could be avoided.
The average and median family income, percent of people earning less than $15,000 in a neighborhood, and average home value variables are all correlated to some degree. However, the variables measuring information about the donations such as average gift, largest gift, and time since last gift were much less correlated to each other.
Additionally, comparisons between the ‘target’ variable and other variables. ‘Target’ is the binary donor/no donor variable the models outlined in this report will be predicting.
The amount of the last donation, number of children, months since last donation, average donation, and the total number of fundraising campaign letters received all appear to be significant to whether an individual donates or not.
##
## Call:
## glm(formula = target ~ last_gift, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.967 -1.141 -1.081 1.196 1.304
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.292482 0.074489 -3.927 8.62e-05 ***
## last_gift 0.020708 0.004691 4.414 1.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3305.6 on 2398 degrees of freedom
## AIC: 3309.6
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = target ~ num_child, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.562 -1.162 -1.162 1.193 1.193
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3380 0.1380 -2.449 0.0143 *
## num_child 0.3022 0.1237 2.443 0.0146 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3320.7 on 2398 degrees of freedom
## AIC: 3324.7
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = target ~ months_since_donate, family = "binomial",
## data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3331 -1.1659 -0.8163 1.1890 1.5881
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.02144 0.32862 -6.151 7.68e-10 ***
## months_since_donate 0.06433 0.01044 6.163 7.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3287.3 on 2398 degrees of freedom
## AIC: 3291.3
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = target ~ avg_gift, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.968 -1.149 -1.085 1.192 1.279
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.290559 0.078749 -3.690 0.000225 ***
## avg_gift 0.025898 0.006381 4.059 4.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3308.9 on 2398 degrees of freedom
## AIC: 3312.9
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = target ~ num_prom, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2637 -1.1688 -0.9783 1.1737 1.3996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.270351 0.097316 2.778 0.00547 **
## num_prom -0.005812 0.001801 -3.228 0.00125 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3316.5 on 2398 degrees of freedom
## AIC: 3320.5
##
## Number of Fisher Scoring iterations: 3
On the other hand, family income, home value, total value of donations, largest donation all appear to have little impact on whether someone will respond to a fundraising letter with a donation.
##
## Call:
## glm(formula = target ~ avg_fam_inc, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.185 -1.171 -1.166 1.184 1.191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.091e-02 1.124e-01 -0.275 0.783
## avg_fam_inc 3.684e-05 2.424e-04 0.152 0.879
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.9 on 2398 degrees of freedom
## AIC: 3330.9
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ med_fam_inc, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.175 -1.171 -1.163 1.183 1.196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.161e-03 1.005e-01 -0.051 0.959
## med_fam_inc -2.539e-05 2.370e-04 -0.107 0.915
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327 on 2399 degrees of freedom
## Residual deviance: 3327 on 2398 degrees of freedom
## AIC: 3331
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ pct_lt15k, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.181 -1.173 -1.146 1.181 1.231
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.007958 0.064337 0.124 0.902
## pct_lt15k -0.001561 0.003381 -0.462 0.644
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.8 on 2398 degrees of freedom
## AIC: 3330.8
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ home_value, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.189 -1.176 -1.103 1.179 1.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.690e-02 6.395e-02 0.421 0.674
## home_value -3.659e-05 4.300e-05 -0.851 0.395
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.2 on 2398 degrees of freedom
## AIC: 3330.2
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ largest_gift, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.899 -1.168 -1.163 1.185 1.192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.042804 0.053171 -0.805 0.421
## largest_gift 0.001667 0.002057 0.810 0.418
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.2 on 2398 degrees of freedom
## AIC: 3330.2
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ lifetime_gifts, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.177 -1.173 -1.142 1.182 1.523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0002954 0.0503723 0.006 0.995
## lifetime_gifts -0.0001380 0.0002668 -0.517 0.605
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.7 on 2398 degrees of freedom
## AIC: 3330.7
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = target ~ time_lag, family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.225 -1.169 -1.164 1.186 1.191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.030799 0.065703 -0.469 0.639
## time_lag 0.002285 0.007444 0.307 0.759
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3326.9 on 2398 degrees of freedom
## AIC: 3330.9
##
## Number of Fisher Scoring iterations: 3
Several types of models were fit and examined. First, a simple logistic regression model was fit to provide a baseline to compare other models to.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8281 -1.1407 -0.7284 1.1700 1.6933
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.836e+00 5.132e-01 -3.577 0.000347 ***
## zipconvert2Yes -1.260e+01 2.289e+02 -0.055 0.956114
## zipconvert3No 1.249e+01 2.289e+02 0.055 0.956490
## zipconvert4Yes -1.256e+01 2.289e+02 -0.055 0.956255
## zipconvert5Yes -1.252e+01 2.289e+02 -0.055 0.956365
## homeownerNo 1.461e-01 1.059e-01 1.380 0.167626
## num_child 3.336e-01 1.279e-01 2.609 0.009092 **
## income -5.378e-02 2.876e-02 -1.870 0.061547 .
## femaleNo 2.516e-02 8.592e-02 0.293 0.769667
## wealth -1.934e-02 2.001e-02 -0.967 0.333665
## home_value -9.847e-05 7.963e-05 -1.237 0.216244
## med_fam_inc -1.222e-03 1.063e-03 -1.149 0.250371
## avg_fam_inc 1.679e-03 1.136e-03 1.478 0.139308
## pct_lt15k -3.451e-03 4.942e-03 -0.698 0.485073
## num_prom -3.980e-03 2.570e-03 -1.548 0.121526
## lifetime_gifts 2.889e-04 4.031e-04 0.717 0.473484
## largest_gift -2.204e-03 3.388e-03 -0.651 0.515324
## last_gift 1.374e-02 8.664e-03 1.585 0.112860
## months_since_donate 5.381e-02 1.126e-02 4.777 1.78e-06 ***
## time_lag -1.430e-03 7.746e-03 -0.185 0.853491
## avg_gift 5.864e-03 1.237e-02 0.474 0.635556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3250.2 on 2379 degrees of freedom
## AIC: 3292.2
##
## Number of Fisher Scoring iterations: 11
Confusion Matrix - Full Logistic Regression:
## target.test
## glm.pred.fundraising.full Donor No Donor
## Donor 115 151
## No Donor 175 159
This logistic regression has a misclassification rate of 54.33%.
## [1] 0.5433333
Using the above mentioned assumptions for average donation and fundraising letter cost of $13 and $.68, $1314.12 would be raised.
## [1] 1314.12
A second logistic regression was fit using stepwise variable selection. This resulted in the model below.
##
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 +
## zipconvert5 + homeowner + num_child + income + home_value +
## avg_fam_inc + last_gift + months_since_donate, family = "binomial",
## data = train.fundraising)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7887 -1.1411 -0.7736 1.1703 1.6751
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.335e+00 3.900e-01 -5.987 2.14e-09 ***
## zipconvert2Yes -1.266e+01 2.271e+02 -0.056 0.95553
## zipconvert3No 1.257e+01 2.271e+02 0.055 0.95586
## zipconvert4Yes -1.264e+01 2.271e+02 -0.056 0.95563
## zipconvert5Yes -1.258e+01 2.271e+02 -0.055 0.95582
## homeownerNo 1.505e-01 1.052e-01 1.430 0.15272
## num_child 3.424e-01 1.271e-01 2.694 0.00706 **
## income -5.308e-02 2.858e-02 -1.857 0.06330 .
## home_value -1.091e-04 7.675e-05 -1.422 0.15508
## avg_fam_inc 5.741e-04 4.042e-04 1.420 0.15550
## last_gift 1.610e-02 4.798e-03 3.356 0.00079 ***
## months_since_donate 5.852e-02 1.076e-02 5.438 5.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3255.7 on 2388 degrees of freedom
## AIC: 3279.7
##
## Number of Fisher Scoring iterations: 11
This model has a misclassification rate of 53.17%, and an estimated fundraising amount of $1353.80. This is a slight improvement over the full model. The confusion matrix, misclassification rate, and estimated amount of money raised are below.
Confusion Matrix - Stepwise Logistic Regression:
## target.test
## glm.pred.fundraising.step Donor No Donor
## Donor 118 147
## No Donor 172 163
## [1] 0.5316667
## [1] 1353.8
The benefit of a logistic regression is it’s ease of interpretation. It is very easy to explain and understand how the model is predicting the likelihood that someone will donate. However, these models do not perform particularly well when predicting whether an individual will donate if they receive a fundraising campaign letter.
A random forest model was also fit to the fundraising data. This resulted in a better fitting model with more predictive power. While very difficult to interpret and understand how the model is predicting donor/no donor, the random forest resulted in a significantly lower test misclassification rate than the logistic regression model, and when calculating expected fundraising revenue, the random forest significantly out performs the logistic model.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 47.67%
## Confusion matrix:
## Donor No Donor class.error
## Donor 677 532 0.4400331
## No Donor 612 579 0.5138539
As can be seen by the below chart and plot, the most important variables to predict if someone will donate are the largest previous donation someone has made, the number of months since their last donation, and the number of children they have.
## rf variable importance
##
## Importance
## largest_gift 100.000
## months_since_donate 89.843
## num_child 89.365
## last_gift 58.882
## pct_lt15k 57.571
## med_fam_inc 53.405
## avg_fam_inc 49.701
## avg_gift 47.756
## income 30.960
## zipconvert3No 28.026
## time_lag 26.587
## home_value 24.348
## homeownerNo 22.262
## zipconvert4Yes 14.620
## zipconvert2Yes 12.251
## wealth 9.076
## num_prom 6.466
## zipconvert5Yes 6.155
## femaleNo 2.611
## lifetime_gifts 0.000
Confusion Matrix - Random Forest:
## target.test
## yhatfundraising.rf Donor No Donor
## Donor 183 151
## No Donor 107 159
The random forest model has a misclassification rate of 43.33%, which is significantly lower than the logistic regression models. Also, the estimated fundraising amount using the same assumptions is $2151.88, which is greater than the logistic regression models.
## [1] 0.43
## [1] 2151.88
Comparing the misclassification rates of each of the fitted models, it becomes clear that the random forest model is better at predicting if someone will donate if they receive a fundraising campaign letter. As a result, there is a significant increase in the estimated funds raised when employing the random forest model to determine who should be sent a letter.
It is recommended that a random forest model be used to predict if someone will be a donor or not, and that fundraising letters be sent to all individuals identified by the model as being likely donors. Based on validations against the test dataset, this will result in the lowest misclassification rate, subsequently resulting in the highest expected fundraising revenue.