library(readr)
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(rpart) # DT
library(tidyverse)
library(tree)
library(Metrics)
library(caret)
library(car)
library(kernlab)
library(MASS)
library(ROCR)
library(performance)
library(randomForest)
library(corrplot)
library(ggcorrplot) # corr plot
library(reshape2) # heat map
library(GGally) # Pais ggplot2
library(kableExtra) # Kabble tables
library(e1071) # Tune SVM
library(broom) # Tidy function
library("viridis") # heat map
library(naivebayes)A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
Explore different models and find the best classification model that will lead to maximize the net profit. The objective is to develop a predictive model for national veterans’ organization. The model is to improve the cost effectiveness of their direct marketing campaign. This will predict which of their donors should receive a mail campaign letter.
The data was provided as 2 .rds files:
Fundraise_data: 3,000 observations and 20 predictor variables (6 factor and 14 numerical) and 1 response variable (Factor)
Future_fundraiser: 120 observations and 20 predictor variables, no response variable, since it is unknown.
fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")fundraise_data = as.data.frame(fundraise_data)str(fundraise_data)## 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num 698 828 1471 547 482 ...
## $ med_fam_inc : num 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
summary(fundraise_data)## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Comments: From the summary above we can do some pre-analysis, firstly by seeing that our target variable is balanced. We can get a glance that there the variables such as: lifetime_gifts, largest_gift, last_gift and avg_gift are skewed and might be significant, it will be important to do some exploratory analysis on them, and later to see if they are significant by running the models.
tbl_donor = table(fundraise_data$target)
cbind(tbl_donor, round(prop.table(tbl_donor)*100,2))## tbl_donor
## Donor 1499 49.97
## No Donor 1501 50.03
Comments: The data is balanced, with an almost 50/50 split as we can see from the last table, which will definitely be an advantage for training purposes of our models.
fundraise_data$zipconvert2 = as.factor(fundraise_data$zipconvert2)We will start our analysis with a heatmap to help us find any possible correlation between our numeric variables.
cormat = round(cor(fundraise_data[,-c(1:5,8,21)]),2)
melted_cormat = melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_fill_viridis(option = "A") Comments: The first thing that stands out is the pct_lt15k variable, which is the one with the lowest correlation with multiple variables. The variables that we can see are strongly correlated between each other are: avg_fam_inc and med_fam_inc (for obvious reasons, since one is the calculation of the other), however, they are also correlated with home_value, income and wealth. When it comes to the gifts section, the ones correlated are last_gift with avg_gift, and also life_time with num_prom and largest_gift.
From the Heat map we got some interesting insights that deserve to be looked at with more detail. We will use the pairwise plots matrix, to analyze them. The variables that we will look at, are the ones that showed the highest correlation in the previous heat map.
ggpairs(fundraise_data[, c(10, 11, 12, 15, 17, 20)], ggplot2::aes(colour = fundraise_data$target)) Comments: We can confirm from the pairs plots what we saw in the pre-analysis, where the variables lifetime_gifts, last_gift and avg_gift are skewed. We can also confirm the Correlation between avg_fam_inc and med_fam_inc , but also with the target (response variable).
Now, for us to check the categorical varaibles, we will use some box plots and bar plots. For that I need to revert the one-hot-encoding process done with the zipconvert column, also I found that we have 4 observations that are not part of any zipconvert category.
x = ifelse(fundraise_data$zipconvert2 =="No",
ifelse(fundraise_data$zipconvert3 =="No",
ifelse(fundraise_data$zipconvert4 == "No",
ifelse(fundraise_data$zipconvert5 == "No", 0, "zipconvert5"), "zipconvert4"), "zipconvert3"), "zipconvert2")
x = as.factor(x)
x = droplevels(x)
table(x) # to verify if my reverse one-hot-encoding matches the summary() results## x
## 0 zipconvert2 zipconvert3 zipconvert4 zipconvert5
## 4 648 551 643 1154
# mutate the column just created with my data frame
fundraise_data = fundraise_data %>%
mutate(zipconvertAll = ifelse(fundraise_data$zipconvert2 =="No",
ifelse(fundraise_data$zipconvert3 =="No",
ifelse(fundraise_data$zipconvert4 == "No",
ifelse(fundraise_data$zipconvert5 == "No", 0, "zipconvert5"), "zipconvert4"),
"zipconvert3"), "zipconvert2"))
fundraise_data$zipconvertAll = as.factor(fundraise_data$zipconvertAll)Bar Plot
Just to see how the Donors and No Donor are distribuited throughout the zip codes.
ggplot(fundraise_data, aes(x = zipconvertAll, fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1))Comments: From the bar plot we can see that the Donors and Non-donors are equally distributed. And the only zipcode that has more donors than non-donors is ZipConvert5.
The last variable that is worth taking a look is Gender (Female).
ggplot(fundraise_data, aes(x = ifelse(female=="Yes", "Female", "Male"), fill = target)) +
geom_bar(position = position_dodge()) +
geom_text(stat = "count", aes(label = ..count..), size = 3, position = position_dodge(width = 1)) +
labs(x = "Donors by Gender")Comments: From here we can see that women are more likely to donate than males. And also we see that our database of “members”, has more females than males.
Logistic Regression - Significance and VIF
Usually for my projects I like to run the basic models (either Linear Regression for continuous responses or Logistic Regression for Binomial responses). These models give a very good first glance at which variables are significant, and can help us when we compare that information VS the Random Forest Variable Importance.
For reporting purposes, I won’t show my full model, since only 3 variables were significant: months_since_donate, num_child, and income, this might indicate that we have some colinearity that we should get rid of first.
After running the full model for the first time, I deleted both zipconvertAll and zipconvert3, because adding all the zipconvert columns would create a very high VIF coefficient.
glm.fit.full = glm(target ~ . -zipconvertAll -zipconvert3, data = fundraise_data, family = binomial)
vif.glm.full = vif(glm.fit.full)
vif.glm.full %>%
kable(col.names = c("VIF coefficient")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| VIF coefficient | |
|---|---|
| zipconvert2 | 1.754935 |
| zipconvert4 | 1.726092 |
| zipconvert5 | 2.386175 |
| homeowner | 1.133303 |
| num_child | 1.026263 |
| income | 1.312074 |
| female | 1.016380 |
| wealth | 1.521926 |
| home_value | 3.283724 |
| med_fam_inc | 18.773974 |
| avg_fam_inc | 21.009871 |
| pct_lt15k | 2.100204 |
| num_prom | 1.962842 |
| lifetime_gifts | 2.137244 |
| largest_gift | 2.104440 |
| last_gift | 3.950453 |
| months_since_donate | 1.132120 |
| time_lag | 1.037998 |
| avg_gift | 4.235878 |
Comments: We can see that med_fam_inc and avg_fam_inc like we saw in the previous exploratory analysis are highly correlated and one of them should go. Before deciding which one of them should. I will run a random forest model (again, full model) and do the Variable Importance to see which one is more important.
I will use the full data set for this process.
set.seed(12345)
forest1 = train(target ~ . -zipconvertAll -time_lag,
data = fundraise_data,
method = "rf",
importance = TRUE,
trControl = trainControl(method = "cv", number = 10))
forest1## Random Forest
##
## 3000 samples
## 21 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.5320012 0.06404249
## 10 0.5313334 0.06268251
## 19 0.5243356 0.04868397
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Comments: Here we can see the accuracy of the model going down from 2 predictors, to 13, and then with 24 predictors. When running the Random Forest with no tuning or without any cross-validation method, the highest accuracy rate we get is: 54.13%. Really low, but we will run a preliminary Variable Importance analysis.
plot(forest1, m.target = "NULL",
plots.one.page = TRUE, sorted = TRUE, verbose = F)Variable Importance Plot
From the variable importance plot, we can see that the variables that we should for sure keep are: months_since_donate and largest_gift. Also from the VIF test that we got colinearity between med_fam_inc and avg_fam_inc, according to this plot, we should keep med_fam_inc.
plot(varImp(forest1))Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train.df = fundraise_data[train_index, ]
test.df = fundraise_data[-train_index, ]Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.
I will start with the full model like in the Feature selection, as well I am excluding the new variable I created (-zipconvertAll).
glm.fit.null = glm(target ~ 1, data = fundraise_data, family = binomial)
glm.fit.full = glm(target ~ . -zipconvertAll, data = fundraise_data, family = binomial)step.AIC = step(glm.fit.null, scope = list(upper=glm.fit.full),
direction ="both",test ="Chisq", trace = F)
summary(step.AIC)##
## Call:
## glm(formula = target ~ months_since_donate + last_gift + income +
## num_child + num_prom, family = binomial, data = fundraise_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9735 -1.1544 0.6177 1.1603 1.7317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.908023 0.368624 -5.176 2.27e-07 ***
## months_since_donate 0.059997 0.009927 6.044 1.51e-09 ***
## last_gift 0.012379 0.003914 3.163 0.00156 **
## income -0.073489 0.022964 -3.200 0.00137 **
## num_child 0.283879 0.112839 2.516 0.01188 *
## num_prom -0.002907 0.001711 -1.699 0.08937 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4076.8 on 2994 degrees of freedom
## AIC: 4088.8
##
## Number of Fisher Scoring iterations: 4
Train Logistic Regression Model
We will train and test our Logistic Regression based on the AIC selected model.
glm.fit = glm(target ~ months_since_donate + last_gift + income +
num_child + num_prom, family = binomial, data = train.df)Predictions on the Test data set
glm.probs = predict.glm(step.AIC, newdata = test.df, type="response")
glm.preds = as.factor(ifelse(glm.probs >= 0.5, "Donor", "No Donor"))confusionMatrix(test.df$target, glm.preds)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 118 172
## No Donor 164 146
##
## Accuracy : 0.44
## 95% CI : (0.3998, 0.4808)
## No Information Rate : 0.53
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : -0.1222
##
## Mcnemar's Test P-Value : 0.7025
##
## Sensitivity : 0.4184
## Specificity : 0.4591
## Pos Pred Value : 0.4069
## Neg Pred Value : 0.4710
## Prevalence : 0.4700
## Detection Rate : 0.1967
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.4388
##
## 'Positive' Class : Donor
##
Comments: Our test accuracy for the variables selected by the AIC method is 44.00%.
Repeated CV Random Forest
set.seed(12345)
train_control = trainControl(method="repeatedcv", number=10,repeats=3)
best.rf = train(target ~ months_since_donate + last_gift + income + num_child + num_prom,
data = train.df,
method = 'rf',
trControl = train_control,
importance = TRUE)pred.best.rf = predict(best.rf, test.df)
confusionMatrix(pred.best.rf, test.df$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 167 145
## No Donor 123 165
##
## Accuracy : 0.5533
## 95% CI : (0.5125, 0.5936)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.03938
##
## Kappa : 0.1079
##
## Mcnemar's Test P-Value : 0.19957
##
## Sensitivity : 0.5759
## Specificity : 0.5323
## Pos Pred Value : 0.5353
## Neg Pred Value : 0.5729
## Prevalence : 0.4833
## Detection Rate : 0.2783
## Detection Prevalence : 0.5200
## Balanced Accuracy : 0.5541
##
## 'Positive' Class : Donor
##
For this model we will create a train_ctrl variable, for the cross-validation.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)knn.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
data=train.df,
method='knn',
trControl = train_ctrl,
tuneLength=20)
pred.probs.knn = predict(knn.fit, test.df)
confusionMatrix(as.factor(pred.probs.knn), test.df$target, positive = 'Donor')## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 179 152
## No Donor 111 158
##
## Accuracy : 0.5617
## 95% CI : (0.5209, 0.6018)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.01509
##
## Kappa : 0.1263
##
## Mcnemar's Test P-Value : 0.01364
##
## Sensitivity : 0.6172
## Specificity : 0.5097
## Pos Pred Value : 0.5408
## Neg Pred Value : 0.5874
## Prevalence : 0.4833
## Detection Rate : 0.2983
## Detection Prevalence : 0.5517
## Balanced Accuracy : 0.5635
##
## 'Positive' Class : Donor
##
Comments: Accuracy for K-Nearest Neighbor is the best one so far.
The last model I will perform is the Naive Bayes, since it has been proven that it has a great performance.
set.seed(12345)
nb.fit = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
data = train.df, method = "naive_bayes",
trControl = train_ctrl)
nb.fit## Naive Bayes
##
## 2400 samples
## 5 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.5491628 0.09476239
## TRUE 0.5394503 0.07420510
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
## and adjust = 1.
pred.probs.nb = predict(nb.fit, test.df)
confusionMatrix(as.factor(pred.probs.nb), test.df$target, positive = 'Donor')## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 236 227
## No Donor 54 83
##
## Accuracy : 0.5317
## 95% CI : (0.4908, 0.5722)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.2438
##
## Kappa : 0.08
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8138
## Specificity : 0.2677
## Pos Pred Value : 0.5097
## Neg Pred Value : 0.6058
## Prevalence : 0.4833
## Detection Rate : 0.3933
## Detection Prevalence : 0.7717
## Balanced Accuracy : 0.5408
##
## 'Positive' Class : Donor
##
Comments: Accuracy for Naive Bayes is lower than K-Nearest Neighbor, here is 53.17%.
accuracy.all = data.frame(0.44,0.5533,0.5617,0.5317)
colnames(accuracy.all) <- c("Logistic Regression","Random Forest","K-Nearest Neighbor","Naive Bayes")
accuracy.all %>%
kable() %>%
kable_styling(bootstrap_options = c("striped","hover")) | Logistic Regression | Random Forest | K-Nearest Neighbor | Naive Bayes |
|---|---|---|---|
| 0.44 | 0.5533 | 0.5617 | 0.5317 |
The best model we could find was the K-Nearest Neighbor with a k-fold of 10. The accuracy was 56.17% in the Test data.
Important factors to consider when targeting people who may be a donor are the number of months from the last donation to July 2018, the dollar amount of the largest gift to date, number of children, dollar amount of most recent gift, percent earnings that are less than $15k in potential donor’s neighborhood, household income, and wealth ratings.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)knn.fit.final = train(target~ months_since_donate + last_gift + income + num_child + num_prom,
data=fundraise_data,
method='knn',
trControl = train_ctrl,
tuneLength=20)
pred.probs.knn = predict(knn.fit.final, future_fundraise_data)write.table(pred.probs.knn, file = "predictions_knn_final.csv", col.names = c("value"), row.names = FALSE)