Direct-Mail Fundraising

Background

A national veterans’ organization wishes to develop a predictive model to improve the costeffectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business Objectives and Goals.

Objective

Develop a classification model to help maximize the expected net profit by predicting who will be more likely to donate during the direct-mail fundraiser.

Goal

Improve the national veterans’ organization by using data analytics to help achieve costeffectiveneness during their direct marking campaign.

Data Sources and Data used.

The fund-raising file used contains 3,000 records. The records have approximately 50% donors and 50% non-donors.
This weighted sampling (50/50 split) is essential to produce repeatable results. If our training data included data where only 10% of the observations were donors (random sample), then it would have been difficult to create a predictive model for future donors because our training data would consist of such few and randomly selected donors.

Libraries

library(ISLR)
library(tidyverse)
library(MASS)
library(DescTools)
library(ResourceSelection)
library(caret)

Read and View Data

fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")
glimpse(fundraise_data)

## Rows: 3,000
## Columns: 21
## $ zipconvert2         <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ zipconvert3         <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No,…
## $ zipconvert4         <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No,…
## $ zipconvert5         <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, N…
## $ homeowner           <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ num_child           <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income              <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2…
## $ female              <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes…
## $ wealth              <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7…
## $ home_value          <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 4…
## $ med_fam_inc         <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203,…
## $ avg_fam_inc         <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271,…
## $ pct_lt15k           <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13…
## $ num_prom            <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 2…
## $ lifetime_gifts      <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 4…
## $ largest_gift        <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20,…
## $ last_gift           <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7,…
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 3…
## $ time_lag            <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3,…
## $ avg_gift            <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000,…
## $ target              <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, D…

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train = fundraise_data[train_index, ]
test = fundraise_data[-train_index, ]

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

temp = fundraise_data[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(temp)
round(correlation, 5)

##                     num_child   income   wealth home_value med_fam_inc
## num_child             1.00000  0.09189  0.06018   -0.01196     0.04696
## income                0.09189  1.00000  0.20899    0.29197     0.36751
## wealth                0.06018  0.20899  1.00000    0.26116     0.37776
## home_value           -0.01196  0.29197  0.26116    1.00000     0.73815
## med_fam_inc           0.04696  0.36751  0.37776    0.73815     1.00000
## avg_fam_inc           0.04726  0.37859  0.38589    0.75257     0.97227
## pct_lt15k            -0.03172 -0.28319 -0.37515   -0.39909    -0.66536
## num_prom             -0.08643 -0.06901 -0.41212   -0.06451    -0.05078
## lifetime_gifts       -0.05095 -0.01957 -0.22547   -0.02407    -0.03525
## largest_gift         -0.01755  0.03318 -0.02528    0.05649     0.04703
## last_gift            -0.01295  0.10959  0.05259    0.15886     0.13598
## months_since_donate  -0.00556  0.07724  0.03371    0.02343     0.03234
## time_lag             -0.00607 -0.00155 -0.06642    0.00068     0.01520
## avg_gift             -0.01969  0.12406  0.09108    0.16877     0.13716
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child               0.04726  -0.03172 -0.08643       -0.05095     -0.01755
## income                  0.37859  -0.28319 -0.06901       -0.01957      0.03318
## wealth                  0.38589  -0.37515 -0.41212       -0.22547     -0.02528
## home_value              0.75257  -0.39909 -0.06451       -0.02407      0.05649
## med_fam_inc             0.97227  -0.66536 -0.05078       -0.03525      0.04703
## avg_fam_inc             1.00000  -0.68028 -0.05731       -0.04033      0.04310
## pct_lt15k              -0.68028   1.00000  0.03778        0.05962     -0.00788
## num_prom               -0.05731   0.03778  1.00000        0.53862      0.11381
## lifetime_gifts         -0.04033   0.05962  0.53862        1.00000      0.50726
## largest_gift            0.04310  -0.00788  0.11381        0.50726      1.00000
## last_gift               0.13138  -0.06175 -0.05587        0.20206      0.44724
## months_since_donate     0.03127  -0.00901 -0.28232       -0.14462      0.01979
## time_lag                0.02434  -0.01991  0.11962        0.03855      0.03998
## avg_gift                0.13176  -0.06248 -0.14725        0.18232      0.47483
##                     last_gift months_since_donate time_lag avg_gift
## num_child            -0.01295            -0.00556 -0.00607 -0.01969
## income                0.10959             0.07724 -0.00155  0.12406
## wealth                0.05259             0.03371 -0.06642  0.09108
## home_value            0.15886             0.02343  0.00068  0.16877
## med_fam_inc           0.13598             0.03234  0.01520  0.13716
## avg_fam_inc           0.13138             0.03127  0.02434  0.13176
## pct_lt15k            -0.06175            -0.00901 -0.01991 -0.06248
## num_prom             -0.05587            -0.28232  0.11962 -0.14725
## lifetime_gifts        0.20206            -0.14462  0.03855  0.18232
## largest_gift          0.44724             0.01979  0.03998  0.47483
## last_gift             1.00000             0.18672  0.07511  0.86640
## months_since_donate   0.18672             1.00000  0.01553  0.18911
## time_lag              0.07511             0.01553  1.00000  0.07008
## avg_gift              0.86640             0.18911  0.07008  1.00000

The following predictors appear to be correlated with another per the above correlation matrix:
- home_value and med_fam_inc 0.73815 (positive)
- avg_fam_inc and home_value 0.75257 (positive)
- avg_fam_inc and med_fam_inc 0.97227 (positive)
- pct_lt15k and med_fam_inc -0.66536 (negative)
- avg_fam_inc and pct_lt15k - 0.68028 (negative)
- num_prom and lifetime_gifts 0.53862 (positive)
- lifetime_gifts and largest_gift 0.50726 (positive)
- last_gift and avg_gift 0.86640 (positive)

Our numerical variables are:
- num_child
- income
- wealth
- home_value
- med_fam_inc
- avg_fam_inc
- ptc_lt15k
- num_prom
- lifetime_gifts
- largest_gift
- months_since_donate
- time_lag
- avg_gift
Our categorical variables are:
- zipconvert2
- zipconvert3
- zipconvert4
- zipconvert5
- homeowner
- female
- target

2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

Logistic Regression

Process: We begin by fitting the model with all the predictors. We proceed by performing step-wise selection to determine the best predictors to include in the final model.
Significant Predictors: Based on these results from the step-wise selection, num_child, last_gift, months_since_donate are a significant predictors as they have a p-value below the significance level of 0.05. This is our method of ensuring the model doesn’t predict a particular output with insignificant variables. We’re training the solution to only use predictors that are significant.

glm.fund = glm(target ~., data = train, family = 'binomial')

glm.step = step(glm.fund, scope = list(upper = glm.fund),
                direction = "both", test = "Chisq", trace = F)

summary(glm.step)

## 
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 + 
##     zipconvert5 + homeowner + num_child + income + home_value + 
##     avg_fam_inc + last_gift + months_since_donate, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7887  -1.1411  -0.7736   1.1703   1.6751  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.335e+00  3.900e-01  -5.987 2.14e-09 ***
## zipconvert2Yes      -1.266e+01  2.271e+02  -0.056  0.95553    
## zipconvert3No        1.257e+01  2.271e+02   0.055  0.95586    
## zipconvert4Yes      -1.264e+01  2.271e+02  -0.056  0.95563    
## zipconvert5Yes      -1.258e+01  2.271e+02  -0.055  0.95582    
## homeownerNo          1.505e-01  1.052e-01   1.430  0.15272    
## num_child            3.424e-01  1.271e-01   2.694  0.00706 ** 
## income              -5.308e-02  2.858e-02  -1.857  0.06330 .  
## home_value          -1.091e-04  7.675e-05  -1.422  0.15508    
## avg_fam_inc          5.741e-04  4.042e-04   1.420  0.15550    
## last_gift            1.610e-02  4.798e-03   3.356  0.00079 ***
## months_since_donate  5.852e-02  1.076e-02   5.438 5.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3255.7  on 2388  degrees of freedom
## AIC: 3279.7
## 
## Number of Fisher Scoring iterations: 11

hoslem.test(glm.step$y, fitted(glm.step), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.step$y, fitted(glm.step)
## X-squared = 1.7672, df = 8, p-value = 0.9873

Goodness of Fit: The Hosmer-Lemeshow Test yielded a p-value of 0.9873 which is above the significance level of 0.05. We do not reject the null; therefore, the model is adequate.

Based on the above analysis, we fit the final model with the following predictors: `num_child`, `last_gift`, and `months_since_donate`.

glm.fund_final = glm(target ~ num_child + last_gift + months_since_donate, data = train, family = 'binomial')

pred.prob = predict.glm(glm.fund_final, newdata = test, type = 'response')
pred = ifelse(pred.prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       98      131
##   No Donor   192      179
##                                           
##                Accuracy : 0.4617          
##                  95% CI : (0.4212, 0.5025)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.9968933       
##                                           
##                   Kappa : -0.0852         
##                                           
##  Mcnemar's Test P-Value : 0.0008424       
##                                           
##             Sensitivity : 0.3379          
##             Specificity : 0.5774          
##          Pos Pred Value : 0.4279          
##          Neg Pred Value : 0.4825          
##              Prevalence : 0.4833          
##          Detection Rate : 0.1633          
##    Detection Prevalence : 0.3817          
##       Balanced Accuracy : 0.4577          
##                                           
##        'Positive' Class : Donor           
##

K Nearest Neighbors

Process: We begin by creating a train_ctrl variable for cross-validation. We proceed to fit the K Nearest Neighbors model with the significant variables gathered in our initial step-wise selection process.

set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)

knn.fit = train(target~ months_since_donate + num_child + last_gift,
                data=train,
                method='knn',
                trControl = train_ctrl,
                tuneLength=20)

pred.prod3 = predict(knn.fit, test)

confusionMatrix(as.factor(pred.prod3), test$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      158      156
##   No Donor   132      154
##                                           
##                Accuracy : 0.52            
##                  95% CI : (0.4792, 0.5606)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.4514          
##                                           
##                   Kappa : 0.0415          
##                                           
##  Mcnemar's Test P-Value : 0.1753          
##                                           
##             Sensitivity : 0.5448          
##             Specificity : 0.4968          
##          Pos Pred Value : 0.5032          
##          Neg Pred Value : 0.5385          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2633          
##    Detection Prevalence : 0.5233          
##       Balanced Accuracy : 0.5208          
##                                           
##        'Positive' Class : Donor           
##

Random Forest

Process: We begin by creating a train_control variable for cross-validation. We proceed to fit the Random Forest model with all predictors. With the model fitted, we determine the variables ranked by importance in the below table and chart. This information aids us in selecting the predictors to include in our final Random Forest model.

train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit = train(target~.,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

rf.fit$besttune

## NULL

varImp(rf.fit)

## rf variable importance
## 
##                     Importance
## months_since_donate    100.000
## largest_gift            78.852
## num_child               70.241
## last_gift               62.017
## pct_lt15k               58.123
## avg_fam_inc             48.639
## home_value              46.974
## income                  43.231
## med_fam_inc             40.671
## avg_gift                34.473
## zipconvert3No           29.082
## femaleNo                21.990
## homeownerNo             20.963
## num_prom                19.328
## wealth                  18.354
## time_lag                16.654
## lifetime_gifts          11.840
## zipconvert2Yes          10.684
## zipconvert5Yes           2.436
## zipconvert4Yes           0.000

plot(varImp(rf.fit))

We remove `avg_gift` as it is collinear with `last_gift`, and we remove `med_fam_inc` as it is collinear with `pct_lt15k`.

train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit_refitted = train(target~ months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_refitted = predict(rf.fit_refitted,test)
confusionMatrix(pred.rf_refitted,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      159      147
##   No Donor   131      163
##                                           
##                Accuracy : 0.5367          
##                  95% CI : (0.4958, 0.5771)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.1738          
##                                           
##                   Kappa : 0.074           
##                                           
##  Mcnemar's Test P-Value : 0.3683          
##                                           
##             Sensitivity : 0.5483          
##             Specificity : 0.5258          
##          Pos Pred Value : 0.5196          
##          Neg Pred Value : 0.5544          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2650          
##    Detection Prevalence : 0.5100          
##       Balanced Accuracy : 0.5370          
##                                           
##        'Positive' Class : Donor           
##

3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
- A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.

4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

models = c('Logistic Regression', 'K Nearest Neighbors', 'Random Forest')
acc= c(46.17, 52.00, 53.17)

acc.summary= as.data.frame(acc, row.names = models)
acc.summary

##                       acc
## Logistic Regression 46.17
## K Nearest Neighbors 52.00
## Random Forest       53.17

# Plot the bar chart 
barplot(acc,names.arg = models, ylab="Accuracy Score", col="navyblue",
main="Model Results", border="orange")

- The Random Forest model appears to be the model that dominates.

5. Select the best model. From your answer in (4), what do you think is the “best” model?
- Model Selected: Random Forest is the selected model due to its slightly higher accuracy.

Step 3: Testing. The file `FutureFundraising.csv` contains the attributes for future mailing candidates.

6. Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit_final = train(target~months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = fundraise_data,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_final = predict(rf.fit_final,future_fundraise_data)

Shiny App Upload Results

- Found 62 records... Imported 62 records. Simplifying into dataframe...
- Found 120 records...
- Imported 120 records. Simplifying into dataframe...
- Complete! Processed total of 1 rows.
- [1] 0.5083333

The model performed as expected and predicted 50.8% of the response variable. Using Random Forest as the classification method is the optimal choice when trying to predict whether a person will be a donor or non-donor based on out results.
Important factors to consider when targeting people who may be a donor are the number of months from the last donation to July 2018, the dollar amount of the largest gift to date, number of children, dollar amount of most recent gift, percent earnings that are less than $15k in potential donor’s neighborhood, household income, and wealth ratings.

7. Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header.

write.table(pred.rf_final, file = "predictions_randomforest_final.csv", col.names = c("value"), row.names = FALSE)

Direct-Mail Fundraising

Brenda Parnin, Jose Fernandez, Rudy Martinez

8/5/2021

Background

Business Objectives and Goals.

Objective

Goal

Data Sources and Data used.

Libraries

Read and View Data

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Logistic Regression

Based on the above analysis, we fit the final model with the following predictors: `num_child`, `last_gift`, and `months_since_donate`.

K Nearest Neighbors

Random Forest

We remove `avg_gift` as it is collinear with `last_gift`, and we remove `med_fam_inc` as it is collinear with `pct_lt15k`.

Step 3: Testing. The file `FutureFundraising.csv` contains the attributes for future mailing candidates.

Shiny App Upload Results

Direct-Mail Fundraising

Brenda Parnin, Jose Fernandez, Rudy Martinez

8/5/2021

Background

Business Objectives and Goals.

Objective

Goal

Data Sources and Data used.

Libraries

Read and View Data

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Logistic Regression

Based on the above analysis, we fit the final model with the following predictors: num_child, last_gift, and months_since_donate.

K Nearest Neighbors

Random Forest

We remove avg_gift as it is collinear with last_gift, and we remove med_fam_inc as it is collinear with pct_lt15k.

Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates.

Shiny App Upload Results

Based on the above analysis, we fit the final model with the following predictors: `num_child`, `last_gift`, and `months_since_donate`.

We remove `avg_gift` as it is collinear with `last_gift`, and we remove `med_fam_inc` as it is collinear with `pct_lt15k`.

Step 3: Testing. The file `FutureFundraising.csv` contains the attributes for future mailing candidates.