Future Fundraising Predictions

Background

A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we will use a sample of this data set to develop a classification model that can effectively capture donors so that the expected net profit is maximized.

Data Overview

Our data set was kindly provided by The American Legion. It contains 3,000 records with 20 attributes consisting of donor’s demographic data and information related to their previous donations. Our goal is to predict which of these donors should receive a mail campaign letter using a second subset of the data that contains the attributes of 120 future mailing candidates.

Below is a brief summary of the numeric and categorical variables in our sample. There were three numeric variables that consisted of a discrete set of ordinal values so these were recoded as categorical. Before we begin our analysis, let’s take a closer look at our target variable.

library(readr)
library(tidyverse)
library(caret)
library(randomForest)
library(reshape2)
library(formattable)
library(DT)
library(skimr)
library(plotROC)
library(ggstatsplot)
library(cowplot)
library(kableExtra)
library(gridExtra)
library(scales)
theme_set(theme_minimal())

fund <- read_rds("/Users/Kelli/Documents/Hw/STA6543/fundraising.rds")
fund.test <- read_rds("/Users/Kelli/Documents/Hw/STA6543/future_fundraising.rds")

fund$num_child <- as.factor(fund$num_child)
fund$income <- as.factor(fund$income)
fund$wealth <- as.factor(fund$wealth)
fund.test$num_child <- as.factor(fund.test$num_child)
fund.test$income <- as.factor(fund.test$income)
fund.test$wealth <- as.factor(fund.test$wealth)

skim_df <- skim_with(
  numeric = sfl(Mean = mean, Median = median, Min = min, Max = max, 
                "1st Quartile" = ~ quantile(., probs = .25), IQR = IQR,
                "3rd Quartile" = ~ quantile(., probs = .75)), append = F)

skim_df(fund) %>% 
  select(-complete_rate) %>% 
  yank("numeric") %>% 
  datatable(colnames = c("Variable" = 2, "Missing Observations" = 3),
    caption = htmltools::tags$caption(style = "text-align: center;",
                                      htmltools::tags$b("Summary of Numeric Variables")),
    options = list(columnDefs = list(list(className = "dt-center", targets = 2:9)), pageLength = 10)) %>% 
  formatRound(columns = c(3:9), 0)

skim_without_charts(fund) %>% 
  select(-complete_rate) %>% 
  yank("factor") %>% 
  datatable(colnames = c("Variable", "Missing Observations", "Ordered", "Unique Levels", "Top Counts"),
    caption = htmltools::tags$caption(style = "text-align: center;",
                                      htmltools::tags$b("Summary of Categorical Variables")),
    options = list(columnDefs = list(list(className = "dt-center", targets = 2:4)), pageLength = 10))

Target Variable

In the original data set, only about 5% were donors. Since class imbalance can negatively affect the precision and recall accuracy of statistical models, weighted sampling was used to under-represent the non-responders so that the data has an equal number of donors and non-donors. The overall statistics of our target variable show that donors have an average gift of $10.10, slightly less than those who did not respond in the last campaign, but have donated a higher amount in total. Both groups have a similar average family income, while donors have a higher home value on average.

fund %>% 
  group_by(target) %>% 
  rename("Target" = target) %>% 
  summarise("Number of Observations" = n(),
            "Average Gift" = round(mean(avg_gift), 2),
            "Lifetime Gift" = round(mean(lifetime_gifts), 2),
            "Months Since Donation" = round(mean(months_since_donate), 0),
            "Average Family Income" = round(mean(avg_fam_inc)*100, 2), 
            "Home Value" = round(mean(home_value)*100, 2)) %>% 
  formattable(align = c("l","c","c","c","c","c","c"), 
              list(`Number of Observations` = comma_format(accuracy = 1),
                   area(col = c(3,4,6,7)) ~ function(x) currency(x)))

Target	Number of Observations	Average Gift	Lifetime Gift	Months Since Donation	Average Family Income	Home Value
Donor	1,499	$10.10	$113.67	31	$43,284.72	$116,381.52
No Donor	1,501	$11.23	$107.81	32	$43,177.42	$112,274.82

Exploratory Analysis

Numeric Variable Distributions

fund2 <- melt(fund[,c(10:21)], id.var = "target")
myplots <- list() 
for(i in unique(fund2$variable)){
  
  fund.subset <- subset(fund2, variable == i)
  
  g1 <- ggplot(fund.subset, aes(x = value)) +
    geom_histogram(aes(y = ..density..), position = "identity", bins = 15, color = "grey30", fill = "white") + 
    geom_density(color = "grey30", aes(fill = target), alpha = 0.6, show.legend = F) +
    labs(x = i, y = "Density", fill = "") 
  
  g2 <- ggplot(fund.subset, aes(y = value)) +
    geom_boxplot(aes(x = target, fill = target), color = "grey30", alpha = 0.6, show.legend = F) +
    labs(x = "", y = i)
  
  myplots[[i]] <- arrangeGrob(g1, g2, ncol = 2, top = paste("\n Histogram and Box Plot of", i))
}
plot_grid(plotlist = myplots, ncol = 2)

The majority of the numeric variables have right-skewed distributions, except for months_since_donate, which has a bimodal distribution that is skewed to the left. The distributions between the groups appear to be very similar to each other in the box plots. Given the overall skewness and outliers in the data, transforming some of our features may provide a better fit.

Categorical Variable Distributions

Based on the categorical distributions, just over half of women are donors while slightly less than half of men are donors. For homeowners, slightly more than half are donors. Overall, there are similar proportions of donors in the categorical variables.

melt(fund[,c(8,5,7,6,9,21)], id.var = "target") %>% 
  group_by(variable, value, target) %>% 
  tally() %>% 
  mutate(percent = n/sum(n)) %>% 
  ggplot(aes(y = value, x = percent, fill = fct_rev(target))) +
  geom_bar(stat = "identity", color = "grey20", alpha = 0.6) +
  geom_text(aes(label = percent(percent, accuracy = 0.1)), position = position_stack(vjust = 0.5)) +
  scale_x_continuous(labels = percent) +
  facet_wrap(~variable, scales = "free", ncol = 2) +
  scale_fill_discrete(direction = -1, breaks = c("Donor", "No Donor")) +
  labs(title = "Distributions by Category", x = "\n Percentage of Donors", y = "Values", fill="") +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "top", legend.justification = "left")

What is the average donation and gender composition of donors?

g1 <- ggplot(fund, aes(x = ifelse(female=="Yes", "Female", "Male"), y = largest_gift, fill = female)) +
  geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = FALSE) +
  stat_summary(aes(label = dollar(..y..)), fun = mean, geom = "text", size = 3.5, vjust = -0.5) +
  scale_y_continuous(labels = dollar_format()) +
  coord_cartesian(ylim = c(0, 20)) +
  labs(subtitle = "Largest Donation by Gender", x = "", y = "") +
  theme(plot.subtitle = element_text(hjust = 0.5))
  
g2 <- ggplot(fund, aes(x = ifelse(female=="Yes", "Female", "Male"), y = lifetime_gifts, fill = female)) +
  geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = FALSE) +
  stat_summary(aes(label = dollar(..y..)), fun = mean, geom = "text", size = 3.5, vjust = -0.5) +
  scale_y_continuous(labels = dollar_format()) +
  coord_cartesian(ylim = c(0, 125)) +
  labs(subtitle = "Total Donations by Gender", x = "", y = "") +
  theme(plot.subtitle = element_text(hjust = 0.5))
  
g3 <- ggplot(fund, aes(x = ifelse(female=="Yes", "Female", "Male"),  group = target)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
           alpha = 0.6, color = "grey20", show.legend = F) +
  geom_text(aes(label = percent(..prop..), y = ..prop.. ), size = 3.5, stat = "count", 
            position = position_stack(vjust = 0.5)) +
  facet_wrap(~target, ncol = 1) +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "", y = "\nPercentage of Donors \n")

grid.arrange(g1, g2, g3, ncol = 2, top = "Gender Composition of Donors \n", bottom = "Gender",  
             left = "Donation", layout_matrix = cbind(c(1,2), c(3,3)))

Based on the gender composition of donors, the largest donation for both men and women is a little under $17. On average, women have donated a higher total amount to date, and out of all donors, 62% are women while only 38% of men are.

Feature Selection

To determine which features might be good candidate predictors, I first checked for variables with near zero variance and used the chi-square test of independence to evaluate the relationships between our target variable and the categorical features. I then checked the correlations between the numeric variables and used a random forest model to identify the most important variables in predicting potential donors.

Near Zero Variance

Near zero variance occurs when a variable has a small number of unique categories with very few instances in them. There was only one variable in our data with near zero variance, num_child. Since predictors with near zero variance can cause a model to become unstable, I removed num_child from the data.

nearZeroVar(fund, saveMetrics = FALSE, names = TRUE)

## [1] "num_child"

fund$num_child <- NULL

Chi-Square Tests of Independence

The chi-square test of independence evaluates the association between two discrete variables, in this case between our target variable and the categorical features. The null hypothesis is that there is no relationship between the variables and the alternative hypothesis is that there is a relationship. The results show that the chi-square statistic is relatively small for each of the variables and the p-values are greater than the significance level of 0.05, indicating that there is not a statistically significant relationship between the categorical attributes and our target variable.

chi.square <- c()
p.value <- c()
cat.var <- fund[,sapply(fund, is.factor)] %>% 
  select(-target) 

for(i in unique(names(cat.var))){
  p.value[i] <- chisq.test(fund$target, unlist(cat.var[i]))$p.value
  chi.square[i] <- chisq.test(fund$target, unlist(cat.var[i]))$statistic
}

data.frame(chi.square, p.value) %>%
  rownames_to_column(var = "Variable") %>% 
  arrange(p.value) %>% 
  rename("Chi-Square Statistic" = chi.square, "P-Value" = p.value) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

Variable	Chi-Square Statistic	P-Value
homeowner	2.00	0.16
female	1.74	0.19
zipconvert5	1.25	0.26
income	5.51	0.48
wealth	7.94	0.54
zipconvert3	0.30	0.58
zipconvert2	0.08	0.77
zipconvert4	0.06	0.80

Correlations

For the numeric variables, there are several that are strongly correlated with each other, med_fam_inc and avg_fam_inc have a correlation of 0.97, and avg_gift and last_gift have a correlation of about 0.9. To avoid any collinearity issues and to select which of these features to include in the model, I then checked the most important variables in predicting potential donors using the random forest algorithm.

ggcorrmat(fund, hc.order = T, title = "Correlations in the Fundraising Data Set",
          colors = rev(colorspace::diverge_hsv(n=3)))

Variable Importance

The random forest classifier uses two measures in calculating variable importance, mean decrease in Gini and mean decrease in accuracy. The mean decrease in Gini is the total decrease in node impurities from splitting on the variable measured by the Gini index, averaged over all trees in the model. The mean decrease in accuracy is determined by calculating the prediction error rate on the out-of-bag portion of the data for each tree. Then the same is done after permuting each predictor. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.

set.seed(12345)
rf.fit <- train(target ~ ., data = fund, method = "rf", importance = TRUE, 
                trControl = trainControl(method = "cv", number = 10))

df <- data.frame(importance(rf.fit$finalModel)) %>% 
  rownames_to_column(var = "var")

g1 <- df %>% 
  slice_max(MeanDecreaseGini, n = 10) %>% 
  ggplot(aes(x = MeanDecreaseGini, y = reorder(var, MeanDecreaseGini))) + 
  geom_point(color = "#1C9099", size = 1.5) +
  geom_segment(aes(x = 0, xend = MeanDecreaseGini, y = var, yend = var), color = "#1C9099") +
  labs(x = "\n Average Decrease in Gini", y = "Variable") +
  theme(panel.grid.minor.x = element_blank())

g2 <- df %>% 
  slice_max(MeanDecreaseAccuracy, n = 10) %>% 
  ggplot(aes(x = MeanDecreaseAccuracy, y = reorder(var, MeanDecreaseAccuracy))) + 
  geom_point(color = "#1C9099", size = 1.5) +
  geom_segment(aes(x = 0, xend = MeanDecreaseAccuracy, y = var, yend = var), color = "#1C9099") +
  labs(x = "\n Average Decrease in Accuracy", y = "") +
  theme(panel.grid.minor.x = element_blank())

grid.arrange(g1, g2, nrow = 1, top = "Most Important Variables in Predicting Potential Donors")

Out of the top ten features, both measures include avg_gift, home_value, months_since_donate, largest_gift, lifetime_gifts, and num_prom as the most important predictors. Based on the results from the random forest model, the chi-square tests, and the collinearity between the variables, I selected a small subset of features to include in the models, avg_gift, home_value, months_since_donate, and largest_gift. I also incorporated avg_fam_inc since it has a moderately strong positive correlation with home_value of 0.75.

Variable Transformations

Based on the distributions of the numeric variables, many of them are right-skewed. Since this can impact the effectiveness of learning algorithms, I decided to transform several of the features before fitting the models. When a variable has a distribution that is skewed to the right, it often indicates that it has a log-normal distribution, meaning that the log-transformed variable follows a normal distribution. Below we can see the log transformation of avg_gift and avg_fam_inc. Since this reduced the skewness in the data, I replaced these features with their log-transformed variable.

fund.transform <- fund %>% 
  mutate(avg_gift.log = log(abs(avg_gift)),
         avg_fam_inc.log = log(abs(avg_fam_inc))) %>% 
  mutate_if(is.numeric, ~ifelse(abs(.) == Inf, 0, .))

fund.test <- fund.test %>% 
  mutate(avg_gift.log = log(abs(avg_gift)),
         avg_fam_inc.log = log(abs(avg_fam_inc))) %>% 
  mutate_if(is.numeric, ~ifelse(abs(.) == Inf, 0, .))

g1 <- ggplot(fund.transform, aes(x = avg_gift)) +
    geom_histogram(aes(y = ..density..), position = "identity", bins = 15, color = "grey20", fill = "white") + 
    geom_density(color = "grey20", fill = "lightblue", alpha = 0.5) +
    labs(x = "Average Gift \n", y = "Density \n")

g2 <- ggplot(fund.transform, aes(x = avg_gift.log)) +
    geom_histogram(aes(y = ..density..), position = "identity", bins = 15, color = "grey20", fill = "white") + 
    geom_density(color = "grey20", fill = "lightblue", alpha = 0.5) +
    labs(x = "Log Transformation of Average Gift \n", y = "")

g3 <- ggplot(fund.transform, aes(x = avg_fam_inc)) +
    geom_histogram(aes(y = ..density..), position = "identity", bins = 15, color = "grey20", fill = "white") + 
    geom_density(color = "grey20", fill = "#72BCB8", alpha = 0.3) +
    labs(x = "Average Family Income", y = "Density \n")

g4 <- ggplot(fund.transform, aes(x = log(avg_fam_inc))) +
    geom_histogram(aes(y = ..density..), position = "identity", bins = 15, color = "grey20", fill = "white") + 
    geom_density(color = "grey20", fill = "#72BCB8", alpha = 0.3) +
    labs(x = "Log Transformation of Average Family Income", y = "")

grid.arrange(g1, g2, g3, g4, nrow = 2, 
             top = "Variable Transformation of Average Gift and Average Family Income \n")

Methodology

To predict potential donors, I applied several different classification models, including logistic regression, discriminant analysis, naive bayes, k-nearest neighbors, and stochastic gradient boosting. To estimate the out of sample error of each method, I used 10-fold cross-validation repeated 3 times and assessed the models based on accuracy and Area Under the Curve (AUC). The Area Under the ROC Curve measures the performance of a classifier across all possible decision thresholds. A higher AUC indicates that the model is better able to distinguish between the classes.

set.seed(12345)
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

Logistic Regression

Logistic regression is a parametric classification technique that estimates the probability of an event occurring, for instance, whether or not a mailing candidate will be a donor. The logistic model is also able to provide insights into which features are the most influential based on the size of the coefficients and the significance of the predictors.

set.seed(12345)
glm.fit <- train(target ~ avg_gift.log + home_value + months_since_donate + largest_gift + avg_fam_inc.log,
                 data = fund.transform, method = "glm",
                 trControl = ctrl)
glm.fit

## Generalized Linear Model 
## 
## 3000 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5588827  0.1178156

The accuracy of the logistic model on the out of sample data is about 55.9%. To get a better idea of the prediction accuracy, I then plotted the ROC curve with the corresponding Area Under the Curve. The AUC is .589, which represents the probability that our model will rate or rank a randomly chosen observation from the positive class (a donor) as more likely to be a donor than a randomly chosen non-donor (Hanley & McNeil, 1982).

glm <- predict(glm.fit, fund.transform, type = "prob")[,1]

rocplot <- data.frame(fund.transform[,20], glm) %>% 
  ggplot(aes(d = ifelse(target == "Donor", 1, 0), m = glm)) +
  geom_roc(n.cuts = 0, color = "#6BB8B4") + 
  style_roc(xlab = "\n 1 - Specificity", ylab = "Sensitivity \n") +
  labs(title = "ROC Curve of Logistic Regression") +
  theme(plot.title = element_text(hjust = 0.5))

rocplot +
  geom_abline(size = 0.5, color = "grey30") +
  annotate("text", x = 0.7, y = 0.3,
           label = paste("Area Under the Curve =", round(calc_auc(rocplot)$AUC, 3)))

QDA

The second classification method I used is Quadratic Discriminant Analysis, which is a compromise between logistic regression and nonparametric methods. The QDA model allows for quadratic decision boundaries and can produce better results when the data is moderately non-linear.

set.seed(12345)
qda.fit <- train(target ~ avg_gift.log + home_value + months_since_donate + largest_gift + avg_fam_inc.log, 
                 data = fund.transform, method = "qda",
                 trControl = ctrl)
qda.fit

## Quadratic Discriminant Analysis 
## 
## 3000 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5191141  0.03794539

The cross-validation accuracy of the QDA model is 51.9% and the Area Under the ROC curve is 0.573, slightly lower than our logistic model.

qda <- predict(qda.fit, fund.transform, type = "prob")[,1]

rocplot <- data.frame(fund.transform[,20], qda) %>% 
  ggplot(aes(d = ifelse(target == "Donor", 1, 0), m = qda)) +
  geom_roc(n.cuts = 0, color = "#72BCB8") + 
  style_roc(xlab = "\n 1 - Specificity", ylab = "Sensitivity \n") +
  labs(title = "Quadratic Discriminant Analysis ROC Curve") +
  theme(plot.title = element_text(hjust = 0.5))

rocplot +
  geom_abline(size = 0.5, color = "grey30") +
  annotate("text", x = 0.7, y = 0.3,
           label = paste("Area Under the Curve =", round(calc_auc(rocplot)$AUC, 3)))

Naive Bayes

To improve the prediction accuracy, I then used a Naive Bayes classifier. The Naive Bayes model makes the assumption that for each class, the features are independent of each other, which allows the model to estimate individual class-conditional marginal densities (Hastie et al., 2009, p. 211). Our Naive Bayes classifier achieves the highest cross-validation accuracy using the Kernel Density Estimation function, a nonparametric technique in estimating probabilities.

set.seed(12345)
nb.fit <- train(target ~ avg_gift.log + home_value + months_since_donate + largest_gift + avg_fam_inc.log, 
                data = fund.transform, method = "naive_bayes", 
                trControl = ctrl)
nb.fit

## Naive Bayes 
## 
## 3000 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.5265611  0.05281751
##    TRUE      0.5597812  0.11954646
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

The Area Under the ROC Curve is .618 and the cross-validation accuracy is about 56%, an improvement from the previous two models.

nb <- predict(nb.fit, fund.transform, type = "prob")[,1]

rocplot <- data.frame(fund.transform[,20], nb) %>% 
  ggplot(aes(d = ifelse(target == "Donor", 1, 0), m = nb)) +
  geom_roc(n.cuts = 0, color = "#72BCB8") + 
  style_roc(xlab = "\n 1 - Specificity", ylab = "Sensitivity \n") +
  labs(title = "Naive Bayes ROC Curve") +
  theme(plot.title = element_text(hjust = 0.5))

rocplot +
  geom_abline(size = 0.5, color = "grey30") +
  annotate("text", x = 0.7, y = 0.3,
           label = paste("Area Under the Curve =", round(calc_auc(rocplot)$AUC, 3)))

K-Nearest Neighbors

K-Nearest Neighbors is a nonparametric technique, which means that it is more flexible than linear approaches. When our target variable is categorical, the KNN classifier identifies the $k$ nearest points in the training data and classifies the test observation to the most common class. In fitting the model, I tried different values of $k$ ranging from 2 to 8. Our KNN model has the highest cross-validation accuracy when $k = 5$ and an AUC of .763, higher than the previous three methods.

set.seed(12345)
knn.fit <- train(target ~ avg_gift.log + home_value + months_since_donate + largest_gift + avg_fam_inc.log, 
                 data = fund.transform, method = "knn",  
                 tuneGrid = expand.grid(k = seq(2, 8, 1)), trControl = ctrl)
knn.fit

## k-Nearest Neighbors 
## 
## 3000 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa     
##   2  0.5062263  0.01242937
##   3  0.5183463  0.03670301
##   4  0.5164537  0.03291692
##   5  0.5284500  0.05689872
##   6  0.5279986  0.05598705
##   7  0.5277808  0.05555414
##   8  0.5252282  0.05043942
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

knn <- predict(knn.fit, fund.transform, type = "prob")[,1]

rocplot <- data.frame(fund.transform[,20], knn) %>% 
  ggplot(aes(d = ifelse(target == "Donor", 1, 0), m = knn)) +
  geom_roc(n.cuts = 0, color = "#72BCB8") + 
  style_roc(xlab = "\n 1 - Specificity", ylab = "Sensitivity \n") +
  labs(title = "K-Nearest Neighbors ROC Curve") +
  theme(plot.title = element_text(hjust = 0.5))

rocplot +
  geom_abline(size = 0.5, color = "grey30") +
  annotate("text", x = 0.7, y = 0.3,
           label = paste("Area Under the Curve =", round(calc_auc(rocplot)$AUC, 3)))

value <- predict(knn.fit, fund.test)
write_csv(as.data.frame(value), file = "knn.csv", col_names = TRUE)

Stochastic Gradient Boosting

Since the non-parametric methods seemed to have a better prediction performance on our sample, I then tried a gradient boosting model. Gradient boosting is an additive ensemble technique similar to random forests except that the trees are grown sequentially. This allows the model to use the results from previous trees in growing new trees, and can result in a greater prediction accuracy. The GBM model produced the highest cross-validation accuracy out of all the methods at just over 56%, and an Area Under the ROC Curve of 0.61.

set.seed(12345)
gbm.fit <- train(target ~ avg_gift.log + home_value + months_since_donate + largest_gift + avg_fam_inc.log, 
                 data = fund.transform, method = "gbm", 
                 trControl = ctrl, verbose = FALSE)
gbm.fit

## Stochastic Gradient Boosting 
## 
## 3000 samples
##    5 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa     
##   1                   50      0.5601105  0.12017548
##   1                  100      0.5551112  0.11017880
##   1                  150      0.5526638  0.10526773
##   2                   50      0.5541079  0.10822108
##   2                  100      0.5520005  0.10400049
##   2                  150      0.5466664  0.09333109
##   3                   50      0.5453316  0.09062434
##   3                  100      0.5439979  0.08797845
##   3                  150      0.5386641  0.07732640
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 10.

gbm <- predict(gbm.fit, fund.transform, type = "prob")[,1]

rocplot <- data.frame(fund.transform[,20], gbm) %>% 
  ggplot(aes(d = ifelse(target == "Donor", 1, 0), m = gbm)) +
  geom_roc(n.cuts = 0, color = "#72BCB8") + 
  style_roc(xlab = "\n 1 - Specificity", ylab = "Sensitivity \n") +
  labs(title = "Stochastic Gradient Boosting ROC Curve") +
  theme(plot.title = element_text(hjust = 0.5))

rocplot +
  geom_abline(size = 0.5, color = "grey30") +
  annotate("text", x = 0.7, y = 0.3,
           label = paste("Area Under the Curve =", round(calc_auc(rocplot)$AUC, 3)))

value <- predict(gbm.fit, fund.test)
write_csv(as.data.frame(value), file = "gbm.csv", col_names = TRUE)

Key Findings

We know from the random forest model that the most important predictors of potential donors include avg_gift, home_value, months_since_donate, largest_gift, and avg_fam_inc. We can use our logistic regression model to quantify these relationships. For better interpretability, we will use the original variables in the model instead of the log-transformed features.

fund <- fund %>% 
  mutate(target = as.factor(ifelse(target == "Donor", 1, 0)))
glm.fit <- train(target ~ avg_gift + home_value + months_since_donate + largest_gift + avg_fam_inc,
                 data = fund, method = "glm", trControl = ctrl)
summary(glm.fit)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6616  -1.1762  -0.5693   1.1525   1.7495  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          2.092e+00  3.097e-01   6.754 1.44e-11 ***
## avg_gift            -1.870e-02  6.439e-03  -2.905  0.00367 ** 
## home_value           1.161e-04  5.958e-05   1.949  0.05129 .  
## months_since_donate -6.157e-02  9.535e-03  -6.457 1.07e-10 ***
## largest_gift         1.441e-03  2.290e-03   0.629  0.52921    
## avg_fam_inc         -3.079e-04  3.328e-04  -0.925  0.35475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4091.3  on 2994  degrees of freedom
## AIC: 4103.3
## 
## Number of Fisher Scoring iterations: 4

In the logistic model, three of the features are significant in predicting potential donors. A summary of the relationships of each, when all other variables are held constant, is listed below.

OR <- coef(glm.fit$finalModel) %>% exp() %>% round(digits = 4) %>% as.data.frame() %>% slice(-c(1,5,6))
data.frame(Predictor = c("Average Gift", "Home Value", "Months Since Donate"), 
           OddsRatio = OR[,1],
           Interpretation = c("For every $1 decrease in average donations, we expect to see about a 2% 
                              increase in the likelihood of donating.", 
                              "Higher home values increase the odds of being a donor by 1% for every 
                              $100 increase in home value, though this represents a weak association 
                              based on the odds ratio being very close to 1.", 
                              "For every one month since their last donation, 
                              the likelihood of being a donor decreases by about 6%.")) %>% 
  arrange(OddsRatio) %>% 
  rename("Odds Ratio" = OddsRatio) %>% 
  formattable(align = c("l", "c", "l"))

Predictor	Odds Ratio	Interpretation
Months Since Donate	0.9403	For every one month since their last donation, the likelihood of being a donor decreases by about 6%.
Average Gift	0.9815	For every $1 decrease in average donations, we expect to see about a 2% increase in the likelihood of donating.
Home Value	1.0001	Higher home values increase the odds of being a donor by 1% for every $100 increase in home value, though this represents a weak association based on the odds ratio being very close to 1.

Conclusion

Overall, the non-parametric methods had a better performance in predicting potential donors for the next mail campaign. The two models that stand out are the K-Nearest Neighbors classifier with the highest Area Under the Curve and the gradient boosting model with the highest cross-validation accuracy. On the future fundraising data set, the KNN model had the highest overall accuracy of 63.3% in predicting future donors. We also found that the most significant variables in predicting potential donors from the logistic regression model include months_since_donate, avg_gift, and home_value. In the next mail campaign, I would recommend targeting those who have donated more recently, have higher home values, and have slightly lower donations on average.

References

Hanley, J. A., & Mcneil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology, 143(1), 29-36. doi:10.1148/radiology.143.1.7063747

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York, NY: Springer.