Background
- A national veterans’ organization wishes to develop a predictive model to improve the costeffectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
Business Objectives and Goals.
Objective
- Develop a classification model to help maximize the expected net profit by predicting who will be more likely to donate during the direct-mail fundraiser.
Goal
- Improve the national veterans’ organization by using data analytics to help achieve costeffectiveneness during their direct marking campaign.
Data Sources and Data used.
- The fund-raising file used contains 3,000 records. The records have approximately 50% donors and 50% non-donors.
- This weighted sampling (50/50 split) is essential to produce repeatable results. If our training data included data where only 10% of the observations were donors (random sample), then it would have been difficult to create a predictive model for future donors because our training data would consist of such few and randomly selected donors.
Libraries
Read and View Data
fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")
glimpse(fundraise_data)## Rows: 3,000
## Columns: 21
## $ zipconvert2 <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ zipconvert3 <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No,…
## $ zipconvert4 <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No,…
## $ zipconvert5 <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, N…
## $ homeowner <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ num_child <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2…
## $ female <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes…
## $ wealth <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7…
## $ home_value <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 4…
## $ med_fam_inc <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203,…
## $ avg_fam_inc <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271,…
## $ pct_lt15k <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13…
## $ num_prom <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 2…
## $ lifetime_gifts <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 4…
## $ largest_gift <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20,…
## $ last_gift <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7,…
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 3…
## $ time_lag <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3,…
## $ avg_gift <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000,…
## $ target <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, D…
Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train = fundraise_data[train_index, ]
test = fundraise_data[-train_index, ]Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.
- 1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
temp = fundraise_data[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(temp)
round(correlation, 5)## num_child income wealth home_value med_fam_inc
## num_child 1.00000 0.09189 0.06018 -0.01196 0.04696
## income 0.09189 1.00000 0.20899 0.29197 0.36751
## wealth 0.06018 0.20899 1.00000 0.26116 0.37776
## home_value -0.01196 0.29197 0.26116 1.00000 0.73815
## med_fam_inc 0.04696 0.36751 0.37776 0.73815 1.00000
## avg_fam_inc 0.04726 0.37859 0.38589 0.75257 0.97227
## pct_lt15k -0.03172 -0.28319 -0.37515 -0.39909 -0.66536
## num_prom -0.08643 -0.06901 -0.41212 -0.06451 -0.05078
## lifetime_gifts -0.05095 -0.01957 -0.22547 -0.02407 -0.03525
## largest_gift -0.01755 0.03318 -0.02528 0.05649 0.04703
## last_gift -0.01295 0.10959 0.05259 0.15886 0.13598
## months_since_donate -0.00556 0.07724 0.03371 0.02343 0.03234
## time_lag -0.00607 -0.00155 -0.06642 0.00068 0.01520
## avg_gift -0.01969 0.12406 0.09108 0.16877 0.13716
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child 0.04726 -0.03172 -0.08643 -0.05095 -0.01755
## income 0.37859 -0.28319 -0.06901 -0.01957 0.03318
## wealth 0.38589 -0.37515 -0.41212 -0.22547 -0.02528
## home_value 0.75257 -0.39909 -0.06451 -0.02407 0.05649
## med_fam_inc 0.97227 -0.66536 -0.05078 -0.03525 0.04703
## avg_fam_inc 1.00000 -0.68028 -0.05731 -0.04033 0.04310
## pct_lt15k -0.68028 1.00000 0.03778 0.05962 -0.00788
## num_prom -0.05731 0.03778 1.00000 0.53862 0.11381
## lifetime_gifts -0.04033 0.05962 0.53862 1.00000 0.50726
## largest_gift 0.04310 -0.00788 0.11381 0.50726 1.00000
## last_gift 0.13138 -0.06175 -0.05587 0.20206 0.44724
## months_since_donate 0.03127 -0.00901 -0.28232 -0.14462 0.01979
## time_lag 0.02434 -0.01991 0.11962 0.03855 0.03998
## avg_gift 0.13176 -0.06248 -0.14725 0.18232 0.47483
## last_gift months_since_donate time_lag avg_gift
## num_child -0.01295 -0.00556 -0.00607 -0.01969
## income 0.10959 0.07724 -0.00155 0.12406
## wealth 0.05259 0.03371 -0.06642 0.09108
## home_value 0.15886 0.02343 0.00068 0.16877
## med_fam_inc 0.13598 0.03234 0.01520 0.13716
## avg_fam_inc 0.13138 0.03127 0.02434 0.13176
## pct_lt15k -0.06175 -0.00901 -0.01991 -0.06248
## num_prom -0.05587 -0.28232 0.11962 -0.14725
## lifetime_gifts 0.20206 -0.14462 0.03855 0.18232
## largest_gift 0.44724 0.01979 0.03998 0.47483
## last_gift 1.00000 0.18672 0.07511 0.86640
## months_since_donate 0.18672 1.00000 0.01553 0.18911
## time_lag 0.07511 0.01553 1.00000 0.07008
## avg_gift 0.86640 0.18911 0.07008 1.00000
- The following predictors appear to be correlated with another per the above correlation matrix:
home_valueandmed_fam_inc0.73815 (positive)avg_fam_incandhome_value0.75257 (positive)avg_fam_incandmed_fam_inc0.97227 (positive)pct_lt15kandmed_fam_inc-0.66536 (negative)avg_fam_incandpct_lt15k- 0.68028 (negative)num_promandlifetime_gifts0.53862 (positive)lifetime_giftsandlargest_gift0.50726 (positive)last_giftandavg_gift0.86640 (positive)
- Our numerical variables are:
num_childincomewealthhome_valuemed_fam_incavg_fam_incptc_lt15knum_promlifetime_giftslargest_giftmonths_since_donatetime_lagavg_gift
- Our categorical variables are:
zipconvert2zipconvert3zipconvert4zipconvert5homeownerfemaletarget
- 2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.
Logistic Regression
- Process: We begin by fitting the model with all the predictors. We proceed by performing step-wise selection to determine the best predictors to include in the final model.
- Significant Predictors: Based on these results from the step-wise selection,
num_child,last_gift,months_since_donateare a significant predictors as they have a p-value below the significance level of 0.05. This is our method of ensuring the model doesn’t predict a particular output with insignificant variables. We’re training the solution to only use predictors that are significant.
glm.fund = glm(target ~., data = train, family = 'binomial')
glm.step = step(glm.fund, scope = list(upper = glm.fund),
direction = "both", test = "Chisq", trace = F)
summary(glm.step)##
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 +
## zipconvert5 + homeowner + num_child + income + home_value +
## avg_fam_inc + last_gift + months_since_donate, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7887 -1.1411 -0.7736 1.1703 1.6751
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.335e+00 3.900e-01 -5.987 2.14e-09 ***
## zipconvert2Yes -1.266e+01 2.271e+02 -0.056 0.95553
## zipconvert3No 1.257e+01 2.271e+02 0.055 0.95586
## zipconvert4Yes -1.264e+01 2.271e+02 -0.056 0.95563
## zipconvert5Yes -1.258e+01 2.271e+02 -0.055 0.95582
## homeownerNo 1.505e-01 1.052e-01 1.430 0.15272
## num_child 3.424e-01 1.271e-01 2.694 0.00706 **
## income -5.308e-02 2.858e-02 -1.857 0.06330 .
## home_value -1.091e-04 7.675e-05 -1.422 0.15508
## avg_fam_inc 5.741e-04 4.042e-04 1.420 0.15550
## last_gift 1.610e-02 4.798e-03 3.356 0.00079 ***
## months_since_donate 5.852e-02 1.076e-02 5.438 5.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3255.7 on 2388 degrees of freedom
## AIC: 3279.7
##
## Number of Fisher Scoring iterations: 11
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.step$y, fitted(glm.step)
## X-squared = 1.7672, df = 8, p-value = 0.9873
- Goodness of Fit: The Hosmer-Lemeshow Test yielded a p-value of 0.9873 which is above the significance level of 0.05. We do not reject the null; therefore, the model is adequate.
Based on the above analysis, we fit the final model with the following predictors: num_child, last_gift, and months_since_donate.
glm.fund_final = glm(target ~ num_child + last_gift + months_since_donate, data = train, family = 'binomial')pred.prob = predict.glm(glm.fund_final, newdata = test, type = 'response')
pred = ifelse(pred.prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 98 131
## No Donor 192 179
##
## Accuracy : 0.4617
## 95% CI : (0.4212, 0.5025)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.9968933
##
## Kappa : -0.0852
##
## Mcnemar's Test P-Value : 0.0008424
##
## Sensitivity : 0.3379
## Specificity : 0.5774
## Pos Pred Value : 0.4279
## Neg Pred Value : 0.4825
## Prevalence : 0.4833
## Detection Rate : 0.1633
## Detection Prevalence : 0.3817
## Balanced Accuracy : 0.4577
##
## 'Positive' Class : Donor
##
K Nearest Neighbors
- Process: We begin by creating a
train_ctrlvariable for cross-validation. We proceed to fit the K Nearest Neighbors model with the significant variables gathered in our initial step-wise selection process.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)
knn.fit = train(target~ months_since_donate + num_child + last_gift,
data=train,
method='knn',
trControl = train_ctrl,
tuneLength=20)
pred.prod3 = predict(knn.fit, test)
confusionMatrix(as.factor(pred.prod3), test$target, positive = 'Donor')## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 158 156
## No Donor 132 154
##
## Accuracy : 0.52
## 95% CI : (0.4792, 0.5606)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.4514
##
## Kappa : 0.0415
##
## Mcnemar's Test P-Value : 0.1753
##
## Sensitivity : 0.5448
## Specificity : 0.4968
## Pos Pred Value : 0.5032
## Neg Pred Value : 0.5385
## Prevalence : 0.4833
## Detection Rate : 0.2633
## Detection Prevalence : 0.5233
## Balanced Accuracy : 0.5208
##
## 'Positive' Class : Donor
##
Random Forest
- Process: We begin by creating a
train_controlvariable for cross-validation. We proceed to fit the Random Forest model with all predictors. With the model fitted, we determine the variables ranked by importance in the below table and chart. This information aids us in selecting the predictors to include in our final Random Forest model.
train_control = trainControl(method="repeatedcv",number=10,repeats=3)
rf.fit = train(target~.,
data = train,
method ='rf',
trControl = train_control,
importance = TRUE)
rf.fit$besttune## NULL
## rf variable importance
##
## Importance
## months_since_donate 100.000
## largest_gift 78.852
## num_child 70.241
## last_gift 62.017
## pct_lt15k 58.123
## avg_fam_inc 48.639
## home_value 46.974
## income 43.231
## med_fam_inc 40.671
## avg_gift 34.473
## zipconvert3No 29.082
## femaleNo 21.990
## homeownerNo 20.963
## num_prom 19.328
## wealth 18.354
## time_lag 16.654
## lifetime_gifts 11.840
## zipconvert2Yes 10.684
## zipconvert5Yes 2.436
## zipconvert4Yes 0.000
We remove avg_gift as it is collinear with last_gift, and we remove med_fam_inc as it is collinear with pct_lt15k.
train_control = trainControl(method="repeatedcv",number=10,repeats=3)
rf.fit_refitted = train(target~ months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
data = train,
method ='rf',
trControl = train_control,
importance = TRUE)
pred.rf_refitted = predict(rf.fit_refitted,test)
confusionMatrix(pred.rf_refitted,test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 159 147
## No Donor 131 163
##
## Accuracy : 0.5367
## 95% CI : (0.4958, 0.5771)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.1738
##
## Kappa : 0.074
##
## Mcnemar's Test P-Value : 0.3683
##
## Sensitivity : 0.5483
## Specificity : 0.5258
## Pos Pred Value : 0.5196
## Neg Pred Value : 0.5544
## Prevalence : 0.4833
## Detection Rate : 0.2650
## Detection Prevalence : 0.5100
## Balanced Accuracy : 0.5370
##
## 'Positive' Class : Donor
##
- 3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
- A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.
- 4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
models = c('Logistic Regression', 'K Nearest Neighbors', 'Random Forest')
acc= c(46.17, 52.00, 53.17)
acc.summary= as.data.frame(acc, row.names = models)
acc.summary## acc
## Logistic Regression 46.17
## K Nearest Neighbors 52.00
## Random Forest 53.17
# Plot the bar chart
barplot(acc,names.arg = models, ylab="Accuracy Score", col="navyblue",
main="Model Results", border="orange")- The Random Forest model appears to be the model that dominates.
- 5. Select the best model. From your answer in (4), what do you think is the “best” model?
- Model Selected: Random Forest is the selected model due to its slightly higher accuracy.
Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates.
- 6. Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.
train_control = trainControl(method="repeatedcv",number=10,repeats=3)
rf.fit_final = train(target~months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
data = fundraise_data,
method ='rf',
trControl = train_control,
importance = TRUE)
pred.rf_final = predict(rf.fit_final,future_fundraise_data)Shiny App Upload Results
- Found 62 records... Imported 62 records. Simplifying into dataframe...
- Found 120 records...
- Imported 120 records. Simplifying into dataframe...
- Complete! Processed total of 1 rows.
- [1] 0.5083333
- The model performed as expected and predicted 50.8% of the response variable. Using Random Forest as the classification method is the optimal choice when trying to predict whether a person will be a donor or non-donor based on out results.
- Important factors to consider when targeting people who may be a donor are the number of months from the last donation to July 2018, the dollar amount of the largest gift to date, number of children, dollar amount of most recent gift, percent earnings that are less than $15k in potential donor’s neighborhood, household income, and wealth ratings.
- 7. Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header.