Direct-Mail Fundraising

Business Problem

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business Objectives and Goals

The business objective here is to explore and develop different kind of models and find the best classification model that will help to maximize the expected net profit by predicting who will more likely donate during the direct-mail fundraising campaign.

The goal here is to improve the organization by using statistics to help achieve cost-effectiveness during their campaign.

Data Source

The fund-raising file used contains 3,000 records. The records have approximately 50% donors and 50% non-donors.

This weighted sampling (50/50 split) is essential to produce repeatable results. If our training data included data where only 10% of the observations were donors (random sample), then it would have been difficult to create a predictive model for future donors because our training data would consist of such few and randomly selected donors.

The data was provided as 2 .rds files:

Fundraising_data: 3,000 observations and 20 predictor variables (6 factor and 14 numerical) and 1 response variable (Factor)

zip convert 1: 00000–19999
zip convert 2: 20000–39999
zip convert 3: 40000–59999
zip convert 4: 60000–79999
zip convert 5: 80000–99999
homeowner: Yes = homeowner, No = not a homeowner
num_child: Number of children income Household income
female: No = male, Yes = female
wealth: Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state
home_value: Average home value in potential donor’s neighborhood in hundreds of dollars
med_fam_inc Median family income in potential donor’s neighborhood in hundreds of dollars
avg_fam_inc: Average family income in potential donor’s neighborhood in hundreds
pct_lt15k: Percent earning less than $15K in potential donor’s neighborhood
num_prom: Lifetime number of promotions received to date
lifetime_gifts Dollar amount of lifetime gifts to date
largest_gift: Dollar amount of largest gift to date
last_gift Dollar amount of most recent gift
months_since_donate: Number of months from last donation to July 2018
time_lag Number of months between first and second gift
avg_gift: Average dollar amount of gifts to date
target Outcome variable: binary indicator for response Yes = donor, No = non-donor

Future_fundraising: 120 observations and 20 predictor variables, no response variable, since it is unknown.

Loading the data

fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")

fundraise_data = as.data.frame(fundraise_data)

str(fundraise_data)

## 'data.frame':    3000 obs. of  21 variables:
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num  1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num  1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num  7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num  698 828 1471 547 482 ...
##  $ med_fam_inc        : num  422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num  463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num  4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num  46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num  94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num  12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num  12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num  34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num  6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num  9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

Preview of the Data

summary(fundraise_data)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Comments: From the summary above we can see that our target variable is balanced. We can also get an idea that variables such as lifetime_gifts, largest_gift, last_gift and avg_gift are skewed and might be significant. Next, we will perform some exploratory analysis on them and check if they are significant by running different models.

tbl_donor = table(fundraise_data$target)
cbind(tbl_donor, round(prop.table(tbl_donor)*100,2))

##          tbl_donor      
## Donor         1499 49.97
## No Donor      1501 50.03

Comments: As we can see, the data set is balanced, with an almost 50/50 split, this will definitely be an advantage for training purposes of our models.

Exploratory Analysis

Heatmap - correlation

cormat = round(cor(fundraise_data[,-c(1:5,8,21)]),2)
melted_cormat = melt(cormat)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Comments: As we can notice here the variable pct_lt15k stands out, which is the one with the lowest correlation with multiple variables. Also, we can see the variables that are strongly correlated between each other are: avg_fam_inc and med_fam_inc (for obvious reasons, since one is the calculation of the other), however, they are also correlated with home_value, income and wealth. When it comes to the gifts section, the ones correlated are last_gift with avg_gift, and also life_time with num_prom and largest_gift.

Plots (Pairwise)

After the Heat map (where we got some interesting insights), some of the variables should be looked into in more detail. We will now use the pairwise plots matrix, to analyze them. The variables that we will look at, are the ones that showed the highest correlation in the previous heat map.

ggpairs(fundraise_data[, c(10, 11, 12, 15, 17, 20)], ggplot2::aes(colour = fundraise_data$target))

Comments: We can confirm from the pairs plots what we saw in the heat map analysis, where the variables lifetime_gifts, last_gift and avg_gift are skewed. We can also confirm the Correlation between avg_fam_inc and med_fam_inc, but also with the target (response variable).

Plots (Bar plots/Box plots)

1. Gender

The variable that is worth taking a look is Gender.

ggplot(fundraise_data, aes(x = ifelse(female=="Yes", "Female", "Male"), fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1)) +
  labs(x = "Donors by Gender")

Comments: From the above plot we can see that women are more likely to donate than males. And we see also notice that there are more number of females than males in the data set.

2. Zip Code

Just to see how the Donors and No Donor are distributed throughout the zip codes.

ggplot(fundraise_data, aes(x = zipconvert2, fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))

ggplot(fundraise_data, aes(x = zipconvert3, fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))

ggplot(fundraise_data, aes(x = zipconvert4, fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))

ggplot(fundraise_data, aes(x = zipconvert5, fill = target)) + 
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))

Comments: From the bar plots we can see that the Donors and Non-donors are equally distributed. And the only zipcode that has more donors than non-donors is ZipConvert5.

Colinearity and Significance

Logistic Regression - Significance and VIF

fundraise_glm = glm(formula = target ~ .,  data = fundraise_data, family = binomial)
summary(fundraise_glm)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = fundraise_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90432  -1.15349   0.00153   1.15919   1.79778  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.885e+00  4.595e-01  -4.102 4.10e-05 ***
## zipconvert2Yes      -1.365e+01  2.670e+02  -0.051  0.95924    
## zipconvert3No        1.361e+01  2.670e+02   0.051  0.95934    
## zipconvert4Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## zipconvert5Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## homeownerNo          4.957e-02  9.412e-02   0.527  0.59847    
## num_child            2.752e-01  1.137e-01   2.422  0.01544 *  
## income              -6.952e-02  2.595e-02  -2.679  0.00738 ** 
## femaleNo             5.995e-02  7.673e-02   0.781  0.43463    
## wealth              -1.907e-02  1.800e-02  -1.059  0.28940    
## home_value          -1.074e-04  7.141e-05  -1.503  0.13272    
## med_fam_inc         -1.200e-03  9.303e-04  -1.289  0.19725    
## avg_fam_inc          1.756e-03  1.010e-03   1.738  0.08226 .  
## pct_lt15k           -9.519e-04  4.440e-03  -0.214  0.83024    
## num_prom            -3.682e-03  2.317e-03  -1.589  0.11204    
## lifetime_gifts       1.599e-04  3.721e-04   0.430  0.66743    
## largest_gift        -1.773e-03  3.091e-03  -0.574  0.56629    
## last_gift            9.923e-03  7.562e-03   1.312  0.18945    
## months_since_donate  5.922e-02  1.003e-02   5.906 3.51e-09 ***
## time_lag            -6.174e-03  6.789e-03  -0.909  0.36311    
## avg_gift             7.539e-03  1.106e-02   0.682  0.49526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4062.0  on 2979  degrees of freedom
## AIC: 4104
## 
## Number of Fisher Scoring iterations: 12

Comments: From the full model we can see that since only 3 variables were significant: months_since_donate, num_child, and income, this might indicate that we have some colinearity that we should get rid of first.

glm.fit.full = glm(target ~ . -zipconvert3, data = fundraise_data, family = binomial)

vif.glm.full = vif(glm.fit.full)

vif.glm.full

##         zipconvert2         zipconvert4         zipconvert5           homeowner 
##            1.754935            1.726092            2.386175            1.133302 
##           num_child              income              female              wealth 
##            1.026263            1.312074            1.016380            1.521926 
##          home_value         med_fam_inc         avg_fam_inc           pct_lt15k 
##            3.283724           18.773974           21.009871            2.100204 
##            num_prom      lifetime_gifts        largest_gift           last_gift 
##            1.962842            2.137244            2.104440            3.950453 
## months_since_donate            time_lag            avg_gift 
##            1.132120            1.037998            4.235878

Comments: We can see that med_fam_inc and avg_fam_inc like we saw in the previous exploratory analysis are highly correlated and one of them should go. Before deciding which one of them should. I will run a random forest model (again, full model) and do the Variable Importance to see which one is more important.

Variable Importance

I will use the full data set for this process.

set.seed(12345)
forest1 = train(target ~ ., 
               data = fundraise_data, 
               method = "rf", 
               importance = TRUE, 
               trControl = trainControl(method = "cv", number = 10))

forest1

## Random Forest 
## 
## 3000 samples
##   20 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    2    0.5363245  0.07269785
##   11    0.5279945  0.05599943
##   20    0.5323245  0.06466458
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Comments: Here we can see the accuracy of the model going down from 2 predictors, to 13, and then with 24 predictors. When running the Random Forest with no tuning or without any cross-validation method, the highest accuracy rate we get is: 54.13% which is really low, but we will run a preliminary variable analysis.

Variable Importance Plot

plot(varImp(forest1))

Comment: From the variable importance plot, we can see that the variables that we should for sure keep are: months_since_donate and largest_gift. Also from the VIF test that we got colinearity between med_fam_inc and avg_fam_inc, according to this plot, we should keep med_fam_inc.

Methodology

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train.df = fundraise_data[train_index, ]
test.df = fundraise_data[-train_index, ]

Models

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Logistic Regression

First I will be creating a full model just like in the feature selection.

glm.fit.null = glm(target ~ 1, data = fundraise_data, family = binomial)
glm.fit.full = glm(target ~ ., data = fundraise_data, family = binomial)

Stepwise AIC Selection

step.AIC = step(glm.fit.null, scope = list(upper=glm.fit.full),
     direction ="both",test ="Chisq", trace = F)

summary(step.AIC)

## 
## Call:
## glm(formula = target ~ months_since_donate + last_gift + income + 
##     num_child + num_prom, family = binomial, data = fundraise_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9735  -1.1544   0.6177   1.1603   1.7317  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.908023   0.368624  -5.176 2.27e-07 ***
## months_since_donate  0.059997   0.009927   6.044 1.51e-09 ***
## last_gift            0.012379   0.003914   3.163  0.00156 ** 
## income              -0.073489   0.022964  -3.200  0.00137 ** 
## num_child            0.283879   0.112839   2.516  0.01188 *  
## num_prom            -0.002907   0.001711  -1.699  0.08937 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4076.8  on 2994  degrees of freedom
## AIC: 4088.8
## 
## Number of Fisher Scoring iterations: 4

Train Logistic Regression Model

We will train and test our Logistic Regression based on the AIC selected model.

glm.fit = glm(target ~ months_since_donate + last_gift + income + 
    num_child + num_prom, family = binomial, data = train.df)

Predictions on the Test data set

glm.probs = predict.glm(step.AIC, newdata = test.df, type="response")
glm.preds = as.factor(ifelse(glm.probs >= 0.5, "Donor", "No Donor"))

confusionMatrix(test.df$target, glm.preds)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      118      172
##   No Donor   164      146
##                                           
##                Accuracy : 0.44            
##                  95% CI : (0.3998, 0.4808)
##     No Information Rate : 0.53            
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : -0.1222         
##                                           
##  Mcnemar's Test P-Value : 0.7025          
##                                           
##             Sensitivity : 0.4184          
##             Specificity : 0.4591          
##          Pos Pred Value : 0.4069          
##          Neg Pred Value : 0.4710          
##              Prevalence : 0.4700          
##          Detection Rate : 0.1967          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.4388          
##                                           
##        'Positive' Class : Donor           
##

Comments: Our test accuracy for the variables selected by logistic regression using AIC method is 44.00%.

Random Forest

For the random forest we will be using the “repeatedcv” approach.

set.seed(12345)
train_control = trainControl(method="repeatedcv", number=10,repeats=3)

best.rf = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
                data = train.df,
                method = 'rf',
                trControl = train_control,
                importance = TRUE)

pred.best.rf = predict(best.rf, test.df)
confusionMatrix(pred.best.rf, test.df$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      167      145
##   No Donor   123      165
##                                           
##                Accuracy : 0.5533          
##                  95% CI : (0.5125, 0.5936)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.03938         
##                                           
##                   Kappa : 0.1079          
##                                           
##  Mcnemar's Test P-Value : 0.19957         
##                                           
##             Sensitivity : 0.5759          
##             Specificity : 0.5323          
##          Pos Pred Value : 0.5353          
##          Neg Pred Value : 0.5729          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2783          
##    Detection Prevalence : 0.5200          
##       Balanced Accuracy : 0.5541          
##                                           
##        'Positive' Class : Donor           
##

Comments: For the Random Forest model the accuracy was 55.33% which was better than logistic regression. Lets try K-nearest neighbor for our next model.

KNN (K-Nearest Neighbors)

For this model we will create a train_ctrl variable, for the cross-validation.

set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)

knn.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
                data=train.df,
                method='knn',
                trControl = train_ctrl,
                tuneLength=20)

pred.probs.knn = predict(knn.fit, test.df)

confusionMatrix(as.factor(pred.probs.knn), test.df$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      179      152
##   No Donor   111      158
##                                           
##                Accuracy : 0.5617          
##                  95% CI : (0.5209, 0.6018)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.01509         
##                                           
##                   Kappa : 0.1263          
##                                           
##  Mcnemar's Test P-Value : 0.01364         
##                                           
##             Sensitivity : 0.6172          
##             Specificity : 0.5097          
##          Pos Pred Value : 0.5408          
##          Neg Pred Value : 0.5874          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2983          
##    Detection Prevalence : 0.5517          
##       Balanced Accuracy : 0.5635          
##                                           
##        'Positive' Class : Donor           
##

Comment: For K-nearest neighbor the accuracy was 56.17% which was better than logistic regression and random forest.

Naive Bayes

We will be performing Naive Bayes next as it is said to give the best accuracy

set.seed(12345)
nb.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom, 
                data = train.df, method = "naive_bayes", 
                trControl = train_ctrl)
nb.fit

## Naive Bayes 
## 
## 2400 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.5491628  0.09476239
##    TRUE      0.5394503  0.07420510
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.

pred.probs.nb = predict(nb.fit, test.df)

confusionMatrix(as.factor(pred.probs.nb), test.df$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      236      227
##   No Donor    54       83
##                                           
##                Accuracy : 0.5317          
##                  95% CI : (0.4908, 0.5722)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.2438          
##                                           
##                   Kappa : 0.08            
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8138          
##             Specificity : 0.2677          
##          Pos Pred Value : 0.5097          
##          Neg Pred Value : 0.6058          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3933          
##    Detection Prevalence : 0.7717          
##       Balanced Accuracy : 0.5408          
##                                           
##        'Positive' Class : Donor           
##

Comments: For Naive Bayes the accuracy was 53.17%, which is lower than the accuracy we got for K-nearest neighbor.

Decision Trees

dt.fit = rpart(target~ months_since_donate + last_gift + income + num_child + num_prom, 
                data = train.df) 

rattle::fancyRpartPlot(dt.fit, sub = "")

pred.probs.dt = predict(dt.fit, test.df, type = "class")

confusionMatrix(as.factor(pred.probs.dt), test.df$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      193      191
##   No Donor    97      119
##                                           
##                Accuracy : 0.52            
##                  95% CI : (0.4792, 0.5606)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.4514          
##                                           
##                   Kappa : 0.0489          
##                                           
##  Mcnemar's Test P-Value : 4.251e-08       
##                                           
##             Sensitivity : 0.6655          
##             Specificity : 0.3839          
##          Pos Pred Value : 0.5026          
##          Neg Pred Value : 0.5509          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3217          
##    Detection Prevalence : 0.6400          
##       Balanced Accuracy : 0.5247          
##                                           
##        'Positive' Class : Donor           
##

Comments: For Naive Bayes the accuracy went down to 52%, which is lower than the accuracy we got for K-nearest neighbor and Naive Bayes.

Conclusion

The best model out of the ones we performed was K-nearest neighbor with a k-fold of 10. The accuracy on the test data was 56.17%. We also found that the most significant variables in predicting potential donors from the logistic regression model using AIC selection criteria include months_since_donate, last_gift, income, num_child and num_prom. In the next mail campaign, I would recommend targeting those who have donated more recently, the amount of their most recent gift, their household income, number of children they have and lifetime number of promotions they received till date.

Testing the future fundraising

Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates.

set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)


knn.fit_final = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
               data = fundraise_data,
               method ='knn',
               trControl = train_ctrl,
               tuneLength = 20)

pred.knn_final = predict(knn.fit_final,future_fundraise_data)

write.table(pred.knn_final, file = "predictions_final.csv", col.names = c("value"), row.names = FALSE)