A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
The business objective here is to explore and develop different kind of models and find the best classification model that will help to maximize the expected net profit by predicting who will more likely donate during the direct-mail fundraising campaign.
The goal here is to improve the organization by using statistics to help achieve cost-effectiveness during their campaign.
The fund-raising file used contains 3,000 records. The records have approximately 50% donors and 50% non-donors.
This weighted sampling (50/50 split) is essential to produce repeatable results. If our training data included data where only 10% of the observations were donors (random sample), then it would have been difficult to create a predictive model for future donors because our training data would consist of such few and randomly selected donors.
The data was provided as 2 .rds files:
Fundraising_data: 3,000 observations and 20 predictor variables (6 factor and 14 numerical) and 1 response variable (Factor)
Future_fundraising: 120 observations and 20 predictor variables, no response variable, since it is unknown.
fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")
fundraise_data = as.data.frame(fundraise_data)
str(fundraise_data)
## 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num 698 828 1471 547 482 ...
## $ med_fam_inc : num 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
summary(fundraise_data)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Comments: From the summary above we can see that our target variable is balanced. We can also get an idea that variables such as lifetime_gifts, largest_gift, last_gift and avg_gift are skewed and might be significant. Next, we will perform some exploratory analysis on them and check if they are significant by running different models.
tbl_donor = table(fundraise_data$target)
cbind(tbl_donor, round(prop.table(tbl_donor)*100,2))
## tbl_donor
## Donor 1499 49.97
## No Donor 1501 50.03
Comments: As we can see, the data set is balanced, with an almost 50/50 split, this will definitely be an advantage for training purposes of our models.
cormat = round(cor(fundraise_data[,-c(1:5,8,21)]),2)
melted_cormat = melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Comments: As we can notice here the variable pct_lt15k stands out, which is the one with the lowest correlation with multiple variables. Also, we can see the variables that are strongly correlated between each other are: avg_fam_inc and med_fam_inc (for obvious reasons, since one is the calculation of the other), however, they are also correlated with home_value, income and wealth. When it comes to the gifts section, the ones correlated are last_gift with avg_gift, and also life_time with num_prom and largest_gift.
After the Heat map (where we got some interesting insights), some of the variables should be looked into in more detail. We will now use the pairwise plots matrix, to analyze them. The variables that we will look at, are the ones that showed the highest correlation in the previous heat map.
ggpairs(fundraise_data[, c(10, 11, 12, 15, 17, 20)], ggplot2::aes(colour = fundraise_data$target))
Comments: We can confirm from the pairs plots what we saw in the heat map analysis, where the variables lifetime_gifts, last_gift and avg_gift are skewed. We can also confirm the Correlation between avg_fam_inc and med_fam_inc, but also with the target (response variable).
The variable that is worth taking a look is Gender.
ggplot(fundraise_data, aes(x = ifelse(female=="Yes", "Female", "Male"), fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1)) +
labs(x = "Donors by Gender")
Comments: From the above plot we can see that women are more likely to donate than males. And we see also notice that there are more number of females than males in the data set.
Just to see how the Donors and No Donor are distributed throughout the zip codes.
ggplot(fundraise_data, aes(x = zipconvert2, fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))
ggplot(fundraise_data, aes(x = zipconvert3, fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))
ggplot(fundraise_data, aes(x = zipconvert4, fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))
ggplot(fundraise_data, aes(x = zipconvert5, fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))
Comments: From the bar plots we can see that the Donors and Non-donors are equally distributed. And the only zipcode that has more donors than non-donors is ZipConvert5.
Logistic Regression - Significance and VIF
fundraise_glm = glm(formula = target ~ ., data = fundraise_data, family = binomial)
summary(fundraise_glm)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = fundraise_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90432 -1.15349 0.00153 1.15919 1.79778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.885e+00 4.595e-01 -4.102 4.10e-05 ***
## zipconvert2Yes -1.365e+01 2.670e+02 -0.051 0.95924
## zipconvert3No 1.361e+01 2.670e+02 0.051 0.95934
## zipconvert4Yes -1.365e+01 2.670e+02 -0.051 0.95922
## zipconvert5Yes -1.365e+01 2.670e+02 -0.051 0.95922
## homeownerNo 4.957e-02 9.412e-02 0.527 0.59847
## num_child 2.752e-01 1.137e-01 2.422 0.01544 *
## income -6.952e-02 2.595e-02 -2.679 0.00738 **
## femaleNo 5.995e-02 7.673e-02 0.781 0.43463
## wealth -1.907e-02 1.800e-02 -1.059 0.28940
## home_value -1.074e-04 7.141e-05 -1.503 0.13272
## med_fam_inc -1.200e-03 9.303e-04 -1.289 0.19725
## avg_fam_inc 1.756e-03 1.010e-03 1.738 0.08226 .
## pct_lt15k -9.519e-04 4.440e-03 -0.214 0.83024
## num_prom -3.682e-03 2.317e-03 -1.589 0.11204
## lifetime_gifts 1.599e-04 3.721e-04 0.430 0.66743
## largest_gift -1.773e-03 3.091e-03 -0.574 0.56629
## last_gift 9.923e-03 7.562e-03 1.312 0.18945
## months_since_donate 5.922e-02 1.003e-02 5.906 3.51e-09 ***
## time_lag -6.174e-03 6.789e-03 -0.909 0.36311
## avg_gift 7.539e-03 1.106e-02 0.682 0.49526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4062.0 on 2979 degrees of freedom
## AIC: 4104
##
## Number of Fisher Scoring iterations: 12
Comments: From the full model we can see that since only 3 variables were significant: months_since_donate, num_child, and income, this might indicate that we have some colinearity that we should get rid of first.
glm.fit.full = glm(target ~ . -zipconvert3, data = fundraise_data, family = binomial)
vif.glm.full = vif(glm.fit.full)
vif.glm.full
## zipconvert2 zipconvert4 zipconvert5 homeowner
## 1.754935 1.726092 2.386175 1.133302
## num_child income female wealth
## 1.026263 1.312074 1.016380 1.521926
## home_value med_fam_inc avg_fam_inc pct_lt15k
## 3.283724 18.773974 21.009871 2.100204
## num_prom lifetime_gifts largest_gift last_gift
## 1.962842 2.137244 2.104440 3.950453
## months_since_donate time_lag avg_gift
## 1.132120 1.037998 4.235878
Comments: We can see that med_fam_inc and avg_fam_inc like we saw in the previous exploratory analysis are highly correlated and one of them should go. Before deciding which one of them should. I will run a random forest model (again, full model) and do the Variable Importance to see which one is more important.
I will use the full data set for this process.
set.seed(12345)
forest1 = train(target ~ .,
data = fundraise_data,
method = "rf",
importance = TRUE,
trControl = trainControl(method = "cv", number = 10))
forest1
## Random Forest
##
## 3000 samples
## 20 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.5363245 0.07269785
## 11 0.5279945 0.05599943
## 20 0.5323245 0.06466458
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Comments: Here we can see the accuracy of the model going down from 2 predictors, to 13, and then with 24 predictors. When running the Random Forest with no tuning or without any cross-validation method, the highest accuracy rate we get is: 54.13% which is really low, but we will run a preliminary variable analysis.
plot(varImp(forest1))
Comment: From the variable importance plot, we can see that the variables that we should for sure keep are: months_since_donate and largest_gift. Also from the VIF test that we got colinearity between med_fam_inc and avg_fam_inc, according to this plot, we should keep med_fam_inc.
Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train.df = fundraise_data[train_index, ]
test.df = fundraise_data[-train_index, ]
Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.
First I will be creating a full model just like in the feature selection.
glm.fit.null = glm(target ~ 1, data = fundraise_data, family = binomial)
glm.fit.full = glm(target ~ ., data = fundraise_data, family = binomial)
Stepwise AIC Selection
step.AIC = step(glm.fit.null, scope = list(upper=glm.fit.full),
direction ="both",test ="Chisq", trace = F)
summary(step.AIC)
##
## Call:
## glm(formula = target ~ months_since_donate + last_gift + income +
## num_child + num_prom, family = binomial, data = fundraise_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9735 -1.1544 0.6177 1.1603 1.7317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.908023 0.368624 -5.176 2.27e-07 ***
## months_since_donate 0.059997 0.009927 6.044 1.51e-09 ***
## last_gift 0.012379 0.003914 3.163 0.00156 **
## income -0.073489 0.022964 -3.200 0.00137 **
## num_child 0.283879 0.112839 2.516 0.01188 *
## num_prom -0.002907 0.001711 -1.699 0.08937 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4076.8 on 2994 degrees of freedom
## AIC: 4088.8
##
## Number of Fisher Scoring iterations: 4
Train Logistic Regression Model
We will train and test our Logistic Regression based on the AIC selected model.
glm.fit = glm(target ~ months_since_donate + last_gift + income +
num_child + num_prom, family = binomial, data = train.df)
Predictions on the Test data set
glm.probs = predict.glm(step.AIC, newdata = test.df, type="response")
glm.preds = as.factor(ifelse(glm.probs >= 0.5, "Donor", "No Donor"))
confusionMatrix(test.df$target, glm.preds)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 118 172
## No Donor 164 146
##
## Accuracy : 0.44
## 95% CI : (0.3998, 0.4808)
## No Information Rate : 0.53
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : -0.1222
##
## Mcnemar's Test P-Value : 0.7025
##
## Sensitivity : 0.4184
## Specificity : 0.4591
## Pos Pred Value : 0.4069
## Neg Pred Value : 0.4710
## Prevalence : 0.4700
## Detection Rate : 0.1967
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.4388
##
## 'Positive' Class : Donor
##
Comments: Our test accuracy for the variables selected by logistic regression using AIC method is 44.00%.
For the random forest we will be using the “repeatedcv” approach.
set.seed(12345)
train_control = trainControl(method="repeatedcv", number=10,repeats=3)
best.rf = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
data = train.df,
method = 'rf',
trControl = train_control,
importance = TRUE)
pred.best.rf = predict(best.rf, test.df)
confusionMatrix(pred.best.rf, test.df$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 167 145
## No Donor 123 165
##
## Accuracy : 0.5533
## 95% CI : (0.5125, 0.5936)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.03938
##
## Kappa : 0.1079
##
## Mcnemar's Test P-Value : 0.19957
##
## Sensitivity : 0.5759
## Specificity : 0.5323
## Pos Pred Value : 0.5353
## Neg Pred Value : 0.5729
## Prevalence : 0.4833
## Detection Rate : 0.2783
## Detection Prevalence : 0.5200
## Balanced Accuracy : 0.5541
##
## 'Positive' Class : Donor
##
Comments: For the Random Forest model the accuracy was 55.33% which was better than logistic regression. Lets try K-nearest neighbor for our next model.
For this model we will create a train_ctrl variable, for the cross-validation.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)
knn.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
data=train.df,
method='knn',
trControl = train_ctrl,
tuneLength=20)
pred.probs.knn = predict(knn.fit, test.df)
confusionMatrix(as.factor(pred.probs.knn), test.df$target, positive = 'Donor')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 179 152
## No Donor 111 158
##
## Accuracy : 0.5617
## 95% CI : (0.5209, 0.6018)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.01509
##
## Kappa : 0.1263
##
## Mcnemar's Test P-Value : 0.01364
##
## Sensitivity : 0.6172
## Specificity : 0.5097
## Pos Pred Value : 0.5408
## Neg Pred Value : 0.5874
## Prevalence : 0.4833
## Detection Rate : 0.2983
## Detection Prevalence : 0.5517
## Balanced Accuracy : 0.5635
##
## 'Positive' Class : Donor
##
Comment: For K-nearest neighbor the accuracy was 56.17% which was better than logistic regression and random forest.
We will be performing Naive Bayes next as it is said to give the best accuracy
set.seed(12345)
nb.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
data = train.df, method = "naive_bayes",
trControl = train_ctrl)
nb.fit
## Naive Bayes
##
## 2400 samples
## 5 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.5491628 0.09476239
## TRUE 0.5394503 0.07420510
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
## and adjust = 1.
pred.probs.nb = predict(nb.fit, test.df)
confusionMatrix(as.factor(pred.probs.nb), test.df$target, positive = 'Donor')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 236 227
## No Donor 54 83
##
## Accuracy : 0.5317
## 95% CI : (0.4908, 0.5722)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.2438
##
## Kappa : 0.08
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8138
## Specificity : 0.2677
## Pos Pred Value : 0.5097
## Neg Pred Value : 0.6058
## Prevalence : 0.4833
## Detection Rate : 0.3933
## Detection Prevalence : 0.7717
## Balanced Accuracy : 0.5408
##
## 'Positive' Class : Donor
##
Comments: For Naive Bayes the accuracy was 53.17%, which is lower than the accuracy we got for K-nearest neighbor.
dt.fit = rpart(target~ months_since_donate + last_gift + income + num_child + num_prom,
data = train.df)
rattle::fancyRpartPlot(dt.fit, sub = "")
pred.probs.dt = predict(dt.fit, test.df, type = "class")
confusionMatrix(as.factor(pred.probs.dt), test.df$target, positive = 'Donor')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 193 191
## No Donor 97 119
##
## Accuracy : 0.52
## 95% CI : (0.4792, 0.5606)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.4514
##
## Kappa : 0.0489
##
## Mcnemar's Test P-Value : 4.251e-08
##
## Sensitivity : 0.6655
## Specificity : 0.3839
## Pos Pred Value : 0.5026
## Neg Pred Value : 0.5509
## Prevalence : 0.4833
## Detection Rate : 0.3217
## Detection Prevalence : 0.6400
## Balanced Accuracy : 0.5247
##
## 'Positive' Class : Donor
##
Comments: For Naive Bayes the accuracy went down to 52%, which is lower than the accuracy we got for K-nearest neighbor and Naive Bayes.
The best model out of the ones we performed was K-nearest neighbor with a k-fold of 10. The accuracy on the test data was 56.17%. We also found that the most significant variables in predicting potential donors from the logistic regression model using AIC selection criteria include months_since_donate, last_gift, income, num_child and num_prom. In the next mail campaign, I would recommend targeting those who have donated more recently, the amount of their most recent gift, their household income, number of children they have and lifetime number of promotions they received till date.
Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)
knn.fit_final = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
data = fundraise_data,
method ='knn',
trControl = train_ctrl,
tuneLength = 20)
pred.knn_final = predict(knn.fit_final,future_fundraise_data)
write.table(pred.knn_final, file = "predictions_final.csv", col.names = c("value"), row.names = FALSE)