Background

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and evelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Data Summary

Import

funds <- readRDS("/Users/moniquevillarreal/Desktop/fundraising.rds")
ffunds <- readRDS("/Users/moniquevillarreal/Desktop/future_fundraising.rds")
any(is.na(funds))
## [1] FALSE
any(is.na(ffunds))
## [1] FALSE

Structure

str(funds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num  698 828 1471 547 482 ...
##  $ med_fam_inc        : num  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
Numerical values Categorical Values
num_child zipconvert2
income zipconvert3
wealth zipconvert4
home_value zipconvert5
med_fam_inc homeowner
avg_fam_inc target
pct_lt15k female
num_prom
lifetime_gifts
largest_gift
months_since_donate
time_lag
avg_gift

Organization

We would need to convert the categorical data into numerical. Wealth and Income variables, although numerical, represent ranges and thus are not representative of true numerical values.

funds$income = as.factor(funds$income)
funds$wealth = as.factor(funds$wealth)
# for the future fundraising
ffunds$income = as.factor(ffunds$income)
ffunds$wealth = as.factor(ffunds$wealth)
summary(funds)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##                                                                            
##  income   female         wealth       home_value      med_fam_inc    
##  1: 272   Yes:1831   8      :1630   Min.   :   0.0   Min.   :   0.0  
##  2: 448   No :1169   9      : 184   1st Qu.: 554.8   1st Qu.: 278.0  
##  3: 281              5      : 181   Median : 816.5   Median : 355.0  
##  4:1015              7      : 172   Mean   :1143.3   Mean   : 388.4  
##  5: 512              6      : 155   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  6: 238              3      : 154   Max.   :5945.0   Max.   :1500.0  
##  7: 234              (Other): 524                                    
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##                                                                    
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##                                                                         
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167                  
## 

Step 1. Partitioning

Think about estimating the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 1234).

set.seed(1234)
train.index <- sample(1:nrow(funds), round(nrow(funds)*0.8))
f.train <- funds[train.index,]
f.test <- funds[-train.index,]
test <- (-train.index)

Step 2. Model Building

Build, evaluate, and choose a model.

Models

Exploring the data

Converting the predictors into numerical values to review correlation.

fundsnum <- as.data.frame(sapply(funds[, c(1:21)], as.numeric))
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(fundsnum))

The strongest correlations appear within the following predictors: income, med_fam_inc, home_value, avg_fam_inc, avg_gift, lifetime_gifts, largest_gift, last_gift.

pairs(~income + med_fam_inc + home_value + avg_fam_inc + avg_gift + lifetime_gifts + largest_gift + last_gift, data = fundsnum)

sub_funds<- fundsnum[c(7, 10, 11, 12,15, 16, 19)]
corrplot(cor(sub_funds), method = "square", type = "upper", tl.col = "black", tl.cex = 2, col = colorRampPalette(c("orange", "blue"))(100))

Random Forest

This will help determine the most important variables.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
set.seed(1234)
rf.funds <-train(target~., data = fundsnum)
plot(varImp(rf.funds), top = 15)

Logistic Regression

Logistic regression

The smallest p-value is associated with months_since_donate

glm.fundfit <- glm(target~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag,  data = f.train, family = binomial)
summary(glm.fundfit)
## 
## Call:
## glm(formula = target ~ avg_gift + home_value + lifetime_gifts + 
##     med_fam_inc + avg_fam_inc + num_prom + pct_lt15k + months_since_donate + 
##     time_lag, family = binomial, data = f.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6537  -1.1640   0.8116   1.1582   1.6788  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.6769101  0.4370229  -3.837 0.000124 ***
## avg_gift             0.0138409  0.0065222   2.122 0.033827 *  
## home_value          -0.0001699  0.0000693  -2.451 0.014232 *  
## lifetime_gifts       0.0001501  0.0003447   0.436 0.663156    
## med_fam_inc         -0.0008947  0.0010777  -0.830 0.406436    
## avg_fam_inc          0.0012922  0.0011555   1.118 0.263462    
## num_prom            -0.0036540  0.0023567  -1.550 0.121036    
## pct_lt15k            0.0004583  0.0047939   0.096 0.923842    
## months_since_donate  0.0564748  0.0110991   5.088 3.61e-07 ***
## time_lag            -0.0122793  0.0075186  -1.633 0.102428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3266.8  on 2390  degrees of freedom
## AIC: 3286.8
## 
## Number of Fisher Scoring iterations: 4

The accuracy is 48%.

library(caret)
pred.glm <-predict(glm.fundfit, f.test)
classify <- ifelse(pred.glm>=0.5, "Donor", "No Donor")
confusionMatrix(as.factor(classify), f.test$target, positive = "Donor")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       11       13
##   No Donor   294      282
##                                           
##                Accuracy : 0.4883          
##                  95% CI : (0.4476, 0.5291)
##     No Information Rate : 0.5083          
##     P-Value [Acc > NIR] : 0.8463          
##                                           
##                   Kappa : -0.0079         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.03607         
##             Specificity : 0.95593         
##          Pos Pred Value : 0.45833         
##          Neg Pred Value : 0.48958         
##              Prevalence : 0.50833         
##          Detection Rate : 0.01833         
##    Detection Prevalence : 0.04000         
##       Balanced Accuracy : 0.49600         
##                                           
##        'Positive' Class : Donor           
## 

QDA

QDA has a 52% accuracy rate.

set.seed(1234)
qda.fit <- train (target ~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag, data = f.train, method = "qda")
qda.fit
## Quadratic Discriminant Analysis 
## 
## 2400 samples
##    9 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5101303  0.02286336

Predicting on the test set

qda.probs <- predict(qda.fit, f.test)
confusionMatrix(qda.probs, f.test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      259      241
##   No Donor    46       54
##                                           
##                Accuracy : 0.5217          
##                  95% CI : (0.4808, 0.5623)
##     No Information Rate : 0.5083          
##     P-Value [Acc > NIR] : 0.2702          
##                                           
##                   Kappa : 0.0326          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8492          
##             Specificity : 0.1831          
##          Pos Pred Value : 0.5180          
##          Neg Pred Value : 0.5400          
##              Prevalence : 0.5083          
##          Detection Rate : 0.4317          
##    Detection Prevalence : 0.8333          
##       Balanced Accuracy : 0.5161          
##                                           
##        'Positive' Class : Donor           
## 

K-Nearest Neighbors

Model has a 50% accuracy rate.

library(class)
knn.funds <- train(target ~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag, data = f.train, method = "knn", tuneGrid = expand.grid( k = seq(2, 20, 1)))
knn.funds
## k-Nearest Neighbors 
## 
## 2400 samples
##    9 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    2  0.4954042  -0.008992414
##    3  0.4961007  -0.007597672
##    4  0.4974568  -0.004522338
##    5  0.5003883   0.001359051
##    6  0.5091052   0.018750103
##    7  0.5093120   0.019221955
##    8  0.5068948   0.014447467
##    9  0.5056592   0.012059762
##   10  0.5069890   0.014541234
##   11  0.5034854   0.007618484
##   12  0.5077663   0.016140638
##   13  0.5055148   0.011638444
##   14  0.5086728   0.018010590
##   15  0.5105708   0.021793628
##   16  0.5081105   0.016858445
##   17  0.5104061   0.021545813
##   18  0.5081252   0.016928683
##   19  0.5075864   0.015989487
##   20  0.5075619   0.016013925
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.

Predicting on the test set

knn.probs <- predict(knn.funds, f.test)
confusionMatrix(knn.probs, f.test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      153      147
##   No Donor   152      148
##                                           
##                Accuracy : 0.5017          
##                  95% CI : (0.4609, 0.5424)
##     No Information Rate : 0.5083          
##     P-Value [Acc > NIR] : 0.6434          
##                                           
##                   Kappa : 0.0033          
##                                           
##  Mcnemar's Test P-Value : 0.8171          
##                                           
##             Sensitivity : 0.5016          
##             Specificity : 0.5017          
##          Pos Pred Value : 0.5100          
##          Neg Pred Value : 0.4933          
##              Prevalence : 0.5083          
##          Detection Rate : 0.2550          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.5017          
##                                           
##        'Positive' Class : Donor           
## 

Future Fundraising

Comparing accuracy rates

Model Accuracy Rate
LR 48%
QDA 52%
KNN 50%

Using model on test set

The model with the highest accuracy rate is QDA.

qdaff.prob <- predict(qda.fit, newdata = ffunds)
qdaff.pred <- data.frame(value = qdaff.prob)

Producing .csv

write.csv(qdaff.pred, file = "target.csv", col.names = c("value"), row.names = FALSE)