A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
The objective of this project is to develop a classification model that can improve the national veteran organization marketing campign cost effectiveness. We want to effectively capture donors so the expected net profit is maxmized.
The data file, Fundraising.rds, contains 3,000 records and 22 variables, the descriptions are down below. The data given has approximately 50% donors and 50% non-donors.
In addition to this dataset, we also have our test dataset that is future_fundraising.rds.
knitr::include_graphics("C:\\Users\\sanja\\Downloads\\Datamining project variables.png")
#structure of fundraise and future
str(fundraise)
## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
str(future)
## tibble [120 × 20] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 1 2 1 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_child : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ income : num [1:120] 5 1 4 4 2 4 2 3 4 2 ...
## $ female : Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 2 2 2 ...
## $ wealth : num [1:120] 9 7 1 8 7 8 1 8 3 5 ...
## $ home_value : num [1:120] 1399 1355 835 1019 992 ...
## $ med_fam_inc : num [1:120] 637 411 310 389 524 371 209 253 302 335 ...
## $ avg_fam_inc : num [1:120] 703 497 364 473 563 408 259 285 324 348 ...
## $ pct_lt15k : num [1:120] 1 9 22 15 6 10 36 25 19 14 ...
## $ num_prom : num [1:120] 74 77 70 21 63 35 72 68 55 59 ...
## $ lifetime_gifts : num [1:120] 102 249 126 26 100 92 146 98 66 276 ...
## $ largest_gift : num [1:120] 6 15 6 16 20 37 12 5 7 15 ...
## $ last_gift : num [1:120] 5 7 6 16 3 37 11 3 5 13 ...
## $ months_since_donate: num [1:120] 29 35 34 37 21 37 36 32 30 33 ...
## $ time_lag : num [1:120] 3 3 8 5 6 5 5 9 9 10 ...
## $ avg_gift : num [1:120] 4.86 9.58 4.34 13 7.69 ...
An important note is that our target has two factors, Donor and non-donor. Since we wish to create a classification model a weighted sample is used to insure there is fair distribution between the two factors of target.
For the exploratory data anaylsis the following questions needed to be asked:
How are the predictors distrbuted? Is there any signifcant outliers?
Do the predictors correlate with the response variable? What predictors heavily influence the response variable.
Is there collinearity amongst the variable?
These questions will be answered in their respective order.
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Based off the summary, we can see that the following variables home_value, med_fam_inc, avg_fam_inc, pcy_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag, and avg_gift all seem to contain outliers. num_child, income, and wealth were all labeled as numeric data however they appear to be categorical variables.
#Checking for correlation
corfund <- (fundraise[,c(6:7, 9:21)])
corfund$target <- as.numeric(corfund$target)
cor(corfund)
## num_child income wealth home_value
## num_child 1.000000000 0.091893089 0.060175537 -0.0119642286
## income 0.091893089 1.000000000 0.208993101 0.2919734944
## wealth 0.060175537 0.208993101 1.000000000 0.2611611450
## home_value -0.011964229 0.291973494 0.261161145 1.0000000000
## med_fam_inc 0.046961647 0.367505334 0.377763371 0.7381530742
## avg_fam_inc 0.047261395 0.378585352 0.385892299 0.7525690021
## pct_lt15k -0.031717891 -0.283191234 -0.375145585 -0.3990861577
## num_prom -0.086432604 -0.069008634 -0.412117770 -0.0645138583
## lifetime_gifts -0.050954766 -0.019565470 -0.225473319 -0.0240737013
## largest_gift -0.017554416 0.033180760 -0.025276518 0.0564942757
## last_gift -0.012948678 0.109592754 0.052591311 0.1588576542
## months_since_donate -0.005563603 0.077238810 0.033713981 0.0234285142
## time_lag -0.006069356 -0.001545727 -0.066421329 0.0006789113
## avg_gift -0.019688680 0.124055750 0.091078754 0.1687736865
## target 0.042348253 -0.035953287 -0.003114465 -0.0215691141
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child 0.046961647 0.047261395 -0.0317178911 -0.08643260
## income 0.367505334 0.378585352 -0.2831912335 -0.06900863
## wealth 0.377763371 0.385892299 -0.3751455847 -0.41211777
## home_value 0.738153074 0.752569002 -0.3990861577 -0.06451386
## med_fam_inc 1.000000000 0.972271285 -0.6653626748 -0.05078270
## avg_fam_inc 0.972271285 1.000000000 -0.6802847967 -0.05731139
## pct_lt15k -0.665362675 -0.680284797 1.0000000000 0.03777518
## num_prom -0.050782705 -0.057311385 0.0377751828 1.00000000
## lifetime_gifts -0.035245827 -0.040327155 0.0596188059 0.53861957
## largest_gift 0.047032066 0.043103937 -0.0078829361 0.11381034
## last_gift 0.135976003 0.131378624 -0.0617521213 -0.05586809
## months_since_donate 0.032336691 0.031268594 -0.0090145584 -0.28232212
## time_lag 0.015202043 0.024340381 -0.0199114896 0.11962322
## avg_gift 0.137162758 0.131758434 -0.0624808920 -0.14725094
## target -0.008036116 -0.003177139 0.0007592833 -0.06836599
## lifetime_gifts largest_gift last_gift months_since_donate
## num_child -0.05095477 -0.017554416 -0.01294868 -0.005563603
## income -0.01956547 0.033180760 0.10959275 0.077238810
## wealth -0.22547332 -0.025276518 0.05259131 0.033713981
## home_value -0.02407370 0.056494276 0.15885765 0.023428514
## med_fam_inc -0.03524583 0.047032066 0.13597600 0.032336691
## avg_fam_inc -0.04032716 0.043103937 0.13137862 0.031268594
## pct_lt15k 0.05961881 -0.007882936 -0.06175212 -0.009014558
## num_prom 0.53861957 0.113810342 -0.05586809 -0.282322122
## lifetime_gifts 1.00000000 0.507262313 0.20205827 -0.144621862
## largest_gift 0.50726231 1.000000000 0.44723693 0.019789633
## last_gift 0.20205827 0.447236933 1.00000000 0.186715010
## months_since_donate -0.14462186 0.019789633 0.18671501 1.000000000
## time_lag 0.03854575 0.039977035 0.07511121 0.015528499
## avg_gift 0.18232435 0.474830096 0.86639998 0.189110799
## target -0.01962693 0.017783355 0.07772082 0.133813301
## time_lag avg_gift target
## num_child -0.0060693555 -0.01968868 0.0423482529
## income -0.0015457272 0.12405575 -0.0359532869
## wealth -0.0664213294 0.09107875 -0.0031144649
## home_value 0.0006789113 0.16877369 -0.0215691141
## med_fam_inc 0.0152020426 0.13716276 -0.0080361157
## avg_fam_inc 0.0243403812 0.13175843 -0.0031771394
## pct_lt15k -0.0199114896 -0.06248089 0.0007592833
## num_prom 0.1196232155 -0.14725094 -0.0683659889
## lifetime_gifts 0.0385457538 0.18232435 -0.0196269259
## largest_gift 0.0399770354 0.47483010 0.0177833547
## last_gift 0.0751112090 0.86639998 0.0777208200
## months_since_donate 0.0155284995 0.18911080 0.1338133012
## time_lag 1.0000000000 0.07008164 -0.0097457015
## avg_gift 0.0700816428 1.00000000 0.0756630051
## target -0.0097457015 0.07566301 1.0000000000
The output above shows us that the variables num_child, income, num_prom, last_gift, months_since_donate, and avg_gift are correlated with the response variable, target. In addition we also see that med_fam_inc and avg_fam_inc are highly correlated with income.
# Check collinearity
dataf <- (fundraise[,c(6:7, 9:20)])
vif(lm(data=dataf))
## income wealth home_value med_fam_inc
## 1.186680 1.508818 2.478427 18.420766
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 20.683498 2.039677 1.951447 1.994198
## largest_gift last_gift months_since_donate time_lag
## 1.715154 4.150953 1.144156 1.032417
## avg_gift
## 4.463250
Above we can see that med_fam_inc and ang_fam_inc are collinear as well as last_gift and avg_gift.
Previously we saw the variables home_value, med_fam_inc, avg_farm_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag, and avg_gift all have large outliers. We can take the square root to the values of 0 and then apply a log to the non-zero values.
#splitting into 80% training and 20% testing
set.seed(12345)
trainI <- sample(1:nrow(fundraise), round(nrow(fundraise) * 0.80))
train <- fundraise[trainI, ]
test <- fundraise[-trainI, ]
nrow(train)
## [1] 2400
#Setting up the Cross Validation
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
To find the variables that signifcant, variables that have a p-value less than 0.05, a logistic regression is used.
fundglm_fit <- glm(target~., data = fundraise, family = "binomial")
summary(fundglm_fit)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = fundraise)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90432 -1.15349 0.00153 1.15919 1.79778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.885e+00 4.595e-01 -4.102 4.10e-05 ***
## zipconvert2Yes -1.365e+01 2.670e+02 -0.051 0.95924
## zipconvert3No 1.361e+01 2.670e+02 0.051 0.95934
## zipconvert4Yes -1.365e+01 2.670e+02 -0.051 0.95922
## zipconvert5Yes -1.365e+01 2.670e+02 -0.051 0.95922
## homeownerNo 4.957e-02 9.412e-02 0.527 0.59847
## num_child 2.752e-01 1.137e-01 2.422 0.01544 *
## income -6.952e-02 2.595e-02 -2.679 0.00738 **
## femaleNo 5.995e-02 7.673e-02 0.781 0.43463
## wealth -1.907e-02 1.800e-02 -1.059 0.28940
## home_value -1.074e-04 7.141e-05 -1.503 0.13272
## med_fam_inc -1.200e-03 9.303e-04 -1.289 0.19725
## avg_fam_inc 1.756e-03 1.010e-03 1.738 0.08226 .
## pct_lt15k -9.519e-04 4.440e-03 -0.214 0.83024
## num_prom -3.682e-03 2.317e-03 -1.589 0.11204
## lifetime_gifts 1.599e-04 3.721e-04 0.430 0.66743
## largest_gift -1.773e-03 3.091e-03 -0.574 0.56629
## last_gift 9.923e-03 7.562e-03 1.312 0.18945
## months_since_donate 5.922e-02 1.003e-02 5.906 3.51e-09 ***
## time_lag -6.174e-03 6.789e-03 -0.909 0.36311
## avg_gift 7.539e-03 1.106e-02 0.682 0.49526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4062.0 on 2979 degrees of freedom
## AIC: 4104
##
## Number of Fisher Scoring iterations: 12
The output from the logistic regression shows that the intercept and the variables num_child, income, and months_since_donate are significant to predicting the target. Using these variables, I will perform a linear regresssion next.
fund_lm_fit <- glm(target ~ num_child + income + months_since_donate, data = train, family = 'binomial')
pred_prob <- predict.glm(fund_lm_fit, newdata = test, type = 'response')
pred <- ifelse(pred_prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 105 147
## No Donor 185 163
##
## Accuracy : 0.4467
## 95% CI : (0.4064, 0.4875)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.99974
##
## Kappa : -0.1126
##
## Mcnemar's Test P-Value : 0.04229
##
## Sensitivity : 0.3621
## Specificity : 0.5258
## Pos Pred Value : 0.4167
## Neg Pred Value : 0.4684
## Prevalence : 0.4833
## Detection Rate : 0.1750
## Detection Prevalence : 0.4200
## Balanced Accuracy : 0.4439
##
## 'Positive' Class : Donor
##
This model has a test accuracy of 44.67%. This is explains that our model is not great predictor. A randomForest model could be used next to see if it’ll produce a better test accuracy using the predictor variables we found to significant.
set.seed(12345)
fund_tree_fit <- randomForest(target ~ num_child + income + months_since_donate, data =train, mtry = 3, importance = TRUE)
fund_tree_fit
##
## Call:
## randomForest(formula = target ~ num_child + income + months_since_donate, data = train, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 50.5%
## Confusion matrix:
## Donor No Donor class.error
## Donor 721 488 0.4036394
## No Donor 724 467 0.6078925
rando_fit <- train(target~ num_child + income + months_since_donate, data = train, method ='rf', trControl = train_control, importance = TRUE)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
pred_rf <- predict(rando_fit,test)
confusionMatrix(pred_rf,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 197 180
## No Donor 93 130
##
## Accuracy : 0.545
## 95% CI : (0.5042, 0.5854)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.08875
##
## Kappa : 0.0977
##
## Mcnemar's Test P-Value : 1.94e-07
##
## Sensitivity : 0.6793
## Specificity : 0.4194
## Pos Pred Value : 0.5225
## Neg Pred Value : 0.5830
## Prevalence : 0.4833
## Detection Rate : 0.3283
## Detection Prevalence : 0.6283
## Balanced Accuracy : 0.5493
##
## 'Positive' Class : Donor
##
Here we see that we had a improvement in test accuracy. The randomForest model yields a test accuracy of 54.5%. For the last model fit I will try a KNN model with the predictor variables that are signifcant.
fund_knn_fit <- train(target~num_child + income + months_since_donate, data=train, method='knn',trControl = train_control, tuneLength=30)
pred_knn <- predict(fund_knn_fit,test)
confusionMatrix(pred_knn,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 179 163
## No Donor 111 147
##
## Accuracy : 0.5433
## 95% CI : (0.5025, 0.5837)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.102639
##
## Kappa : 0.0909
##
## Mcnemar's Test P-Value : 0.002063
##
## Sensitivity : 0.6172
## Specificity : 0.4742
## Pos Pred Value : 0.5234
## Neg Pred Value : 0.5698
## Prevalence : 0.4833
## Detection Rate : 0.2983
## Detection Prevalence : 0.5700
## Balanced Accuracy : 0.5457
##
## 'Positive' Class : Donor
##
The KNN model yields a test accuracy of 55.33% an improvement from the previous two models.
Out of the three models tested the KNN model produced the best test accuracy of 55.33%.
Using the future_fundraising test data given, I can apply the knn model
knn <- train(target~num_child + income + months_since_donate, data=train, method='knn',trControl = train_control, tuneLength=30)
pred_knn <- predict(knn,future)
Value <- c("value",as.character(pred_knn))
write.csv(Value,file="knn_model_value.csv")
The test accuracies were not the best however there is a couple of ways to improve them. One way is to simply have collected more observations. Another way was to change the scales of the models. And lastly apply a better transformation to the variables to treat outliers.