Fundraising campaigns often involve casting a very wide net with a very low gift rate. Often times they involve reaching out to a pool of tens of thousands of people and a gift rate of 3-4% can be considered a good haul. For non profits with limited resources being able to better target donors can have significant benefits, not just by bringing in more donations, but also by better applying those limited resources and not spending resources to contact non-donors.
We will be using the donors dataset from Datacamp which is available here
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
donors <- read_csv("https://assets.datacamp.com/production/repositories/718/datasets/9055dac929e4515286728a2a5dae9f25f0e4eff6/donors.csv") %>%
mutate(donated = as.factor(donated))
## Rows: 93462 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): recency, frequency, money
## dbl (10): donated, veteran, bad_address, age, has_children, wealth_rating, i...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we get started I want to split the data into a train set and test set, to imitate a scenario where we have data we want to train a model on, but will test it on unseen data. We do this below.
set.seed(100)
# Get the number of observations
n_obs <- nrow(donors)
# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)
# Randomly order data: Sonar
donors_shuffled <- donors[permuted_rows,]
# Identify row to split on: split
split <- round(n_obs * 0.6)
# Create train
donors_train <- donors_shuffled[1:split,]
# Create test
donors_test <- donors_shuffled[(split + 1):nrow(donors_shuffled),]
# Check the share of donate in each\
mean(as.numeric(donors_train$donated))
## [1] 1.050716
mean(as.numeric(donors_test$donated))
## [1] 1.04994
When converted to a numeric variable, the donated variable becomes a vector of 1s and 2s with 1 indicating no donation and 2 indicating a donation. A mean of 1.050716 means a gift rate of about 5.07% and a mean of of 1.04994 means a gift rate of about 4.99%. We can see that both our test and train data sets have about the same ratio of donations to non-donations. Rare events are often difficult to model, and this may prove to be a challenge. Now that we have our data split into a train and test set, we will observe the train data as though it were the only data we had access to.
## [1] 56077 13
## donated veteran bad_address age has_children
## 0:53233 Min. :0.000000 Min. :0.00000 Min. : 1.00 Min. :0.000
## 1: 2844 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:48.00 1st Qu.:0.000
## Median :0.000000 Median :0.00000 Median :62.00 Median :0.000
## Mean :0.001123 Mean :0.01473 Mean :61.61 Mean :0.133
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:75.00 3rd Qu.:0.000
## Max. :1.000000 Max. :1.00000 Max. :98.00 Max. :1.000
## NA's :13599
## wealth_rating interest_veterans interest_religion pet_owner
## Min. :0.000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :1.143 Mean :0.1102 Mean :0.09348 Mean :0.1524
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :3.000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## catalog_shopper recency frequency money
## Min. :0.00000 Length:56077 Length:56077 Length:56077
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.08265
## 3rd Qu.:0.00000
## Max. :1.00000
##
| Name | donors_train |
| Number of rows | 56077 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| recency | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
| frequency | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
| money | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| donated | 0 | 1 | FALSE | 2 | 0: 53233, 1: 2844 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| veteran | 0 | 1.00 | 0.00 | 0.03 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| bad_address | 0 | 1.00 | 0.01 | 0.12 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| age | 13599 | 0.76 | 61.61 | 16.70 | 1 | 48 | 62 | 75 | 98 | ▁▂▇▇▃ |
| has_children | 0 | 1.00 | 0.13 | 0.34 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| wealth_rating | 0 | 1.00 | 1.14 | 1.22 | 0 | 0 | 1 | 2 | 3 | ▇▂▁▃▃ |
| interest_veterans | 0 | 1.00 | 0.11 | 0.31 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| interest_religion | 0 | 1.00 | 0.09 | 0.29 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| pet_owner | 0 | 1.00 | 0.15 | 0.36 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| catalog_shopper | 0 | 1.00 | 0.08 | 0.28 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
We have just over 56,000 observations of 13 variables. Three of those variables are character variables with two levels each. We have one numeric variable, age, which also has missing variables. We have another numeric variable that seems to be more of a factor variable, wealth_rating. It is not clear what the values mean, presumably the higher values would indicate higher levels of wealth, but we do not know if 0 indicates a low wealth rating or no info on that person. The rest of the variables seem to be indicator variables with a 1 indicating the presence of that characteristic and a 0 indicating its absence. Finally, our variable of interest is the donated variable, which is another indicator variable that was converted to a factor variable when the data was read in. Another challenge in modeling this data is there doesn’t seem to be a lot of variation in many of the dependent variables either. There seem to be very few people indicated to be veterans, very few people indicated to have children, very few people indicated to be catalog shoppers, etc. Unless these variables are very closely correlated with the our variable of interest, this will likely prove to be a challenge during our modeling.
One of the things we noticed in the data is that age is missing a significant number of records. We will be imputing the median age into those missing values and creating a new indicator variable that will tell us whether the age was missing or not, that way, if the data was missing not at random, we will have a way for the model to catch that and use that information.
The second thing we will be doing will be re-scaling the age variable. With most of the other variables on a 0-1 scale the much greater variation in age will cause it to have a much larger impact on our model, not necessarily because it is a better predictor, but because the magnitude of the variability in age is so much greater than in our other variables.
finally, for our character variables with more than one option, we will convert these into dummy variables, or turning them into indicator variables like our other variables.
##
## Pre process and prepare for modeling
##
median_age <- median(donors_train$age, na.rm = TRUE)
mean_age <- mean(donors_train$age, na.rm = TRUE)
sd_age <- sd(donors_train$age, na.rm = TRUE)
donors_proc <- donors_train %>%
mutate(
## create a new column to indicate if age was missing
missing_age = if_else(is.na(age), 1, 0),
## replace missing age values with the median age
age = if_else(is.na(age), median_age, age)
) %>%
## I want to change our categorical variables to dummy variables ahead of time
## The modeling functions would do this anyway, but I just want to do it ahead of time
fastDummies::dummy_cols(select_columns = c("recency", "frequency", "money", "wealth_rating")) %>%
## We remove the original variables now that we don't need them
select(-c("recency", "frequency", "money", "wealth_rating")) %>%
## rescale and center the age column
mutate(age = (age-mean_age)/sd_age) %>%
filter(complete.cases(.))
The first step in our modeling journey is going to be cluster analysis. Due to limited computational power we will be using kmeans cluster only, rather than comparing to hierarchical clustering. This should still provide us with some good clusters which we can add to our data to hopefully better improve our model.
Our first step will be to make several different models, changing the setting for centers in the model (number of clusters) and then comparing the results. We will create a scree plot of the model results and will be looking for a good elbow to determine how many clusters we should have.
Looking at our model we see a significant elbow at 5 centers and a slightly less obvious one at 8 centers. We want to create clusters that will help us better predict whether or not someone donated, so we will try both models and see which one creates clusters that are better correlated with our variable of interest.
## [1] 1.050716
##
## 1 2 3 4 5 6
## 1 0.93348657 0.93576771 0.93946964 0.96131659 0.94358013 0.96679816
## 2 0.06651343 0.06423229 0.06053036 0.03868341 0.05641987 0.03320184
##
## 7 8
## 1 0.95181976 0.95524578
## 2 0.04818024 0.04475422
##
## 1 2 3 4 5
## 1 0.93453051 0.95345940 0.96362293 0.95776177 0.93954116
## 2 0.06546949 0.04654060 0.03637707 0.04223823 0.06045884
The clusters that are both most negatively and positively correlated with our variable of interest (whether or not someone donated) are both in the model with 8 clusters. Several of the other clusters in this model also seem to be somewhate correlated with our variable of interest, so this may be our most helpful model. We will add the clusters from this model to our data to help us build a more predictive model. We will also make them dummy/indicator variables like our other variables.
##
## Add in the cluster we want to our data
##
donors_pro <- donors_proc %>%
mutate(clust = clust_8) %>%
fastDummies::dummy_cols(select_columns = "clust") %>%
select(-clust)
Now we come to the point where we build our models and observe their performance. We will quickly build 3 different kinds of models and compare their performance. The first model we will build will be a simple glm model for predicting a binomial outcome. The other two models will be random forest and neural net models. We will not be manually tuning the parameters on the random forest or neural net model. Instead, we will be training these model with caret and using the tunelength argument to select the best parameters out of those tried.
We will be measuring out models performance on the test data we separated out earlier. To be able to get predictions on the test data though, we will need to pr-process it and run it through our cluster model so that it is in the same format as the model we will train our data on.
library(clue)
donors_test <- donors_test %>%
mutate(
## create a new column to indicate if age was missing
missing_age = if_else(is.na(age), 1, 0),
## replace missing age values with the median age
age = if_else(is.na(age), median_age, age)
) %>%
## I want to change our categorical variables to dummy variables ahead of time
## The modeling functions would do this anyway, but I just want to do it ahead of time
fastDummies::dummy_cols(select_columns = c("recency", "frequency", "money", "wealth_rating")) %>%
## We remove the original variables now that we don't need them
select(-c("recency", "frequency", "money", "wealth_rating")) %>%
## rescale and center the age column
mutate(age = (age-mean_age)/sd_age) %>%
filter(complete.cases(.))
test_clust <- cl_predict(model_8, select(donors_test, -donated))
donors_test <- donors_test %>%
mutate(clust = test_clust) %>%
fastDummies::dummy_cols(select_columns = "clust") %>%
select(-clust)
mean(as.numeric(donors_test$donated))
## [1] 1.04994
Now that we have test data in proper format, let’s begin by building and measuring performance of a glm model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26930 1202
## 1 8588 665
##
## Accuracy : 0.7381
## 95% CI : (0.7336, 0.7426)
## No Information Rate : 0.9501
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0398
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75821
## Specificity : 0.35619
## Pos Pred Value : 0.95727
## Neg Pred Value : 0.07187
## Prevalence : 0.95006
## Detection Rate : 0.72034
## Detection Prevalence : 0.75249
## Balanced Accuracy : 0.55720
##
## 'Positive' Class : 0
##
## 0 1
## 0 vs. 1 0.5906154 0.5906154
Our AUC for the glm model is 0.59, not a great auc but considering the challenges of predicting a rare outcome with data that doesn’t have a lot of variation. it does look like if we were to use the threshold of 0.0635 we would have a group of observations where about 7.7% are donors, higher than the approximately 4.9% in the test data. We will experiment more with thresholds of our models later.
Next we will train a random forest model and see how it performs compared to the glm model
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35180 1856
## 1 338 11
##
## Accuracy : 0.9413
## 95% CI : (0.9389, 0.9437)
## No Information Rate : 0.9501
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0059
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.990484
## Specificity : 0.005892
## Pos Pred Value : 0.949887
## Neg Pred Value : 0.031519
## Prevalence : 0.950060
## Detection Rate : 0.941019
## Detection Prevalence : 0.990665
## Balanced Accuracy : 0.498188
##
## 'Positive' Class : 0
##
## 0 1
## 0 vs. 1 0.5015712 0.5015712
The AUC for the random forest model is 0.50, significantly lower than the glm model and hardly any better than just a best guess. If we were to use a threshold of 0.004 to select a universe we would have a gift rate of about 3.3%, even lower than the entire test data set. It doesn’t look like the random forest does very well, but we will see if it improves when we play with the threshold again.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26871 1204
## 1 8647 663
##
## Accuracy : 0.7365
## 95% CI : (0.732, 0.741)
## No Information Rate : 0.9501
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0387
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75655
## Specificity : 0.35512
## Pos Pred Value : 0.95711
## Neg Pred Value : 0.07121
## Prevalence : 0.95006
## Detection Rate : 0.71876
## Detection Prevalence : 0.75097
## Balanced Accuracy : 0.55583
##
## 'Positive' Class : 0
##
## 0 1
## 0 vs. 1 0.5909916 0.5909916
The neural net model’s performance is about equal to the glm model. The AUC for the neural net model is 0.59 and with a threshild of 0.0639 we would have a universe with a gift rate of 7.7% (both AUC and gift rate would be slightly lower than the glm model). This model seems much more promising than the random forest model.
As a business problem though we would be trying to improve a fundraising campaign with these models, and there are multiple ways we can do that with these models. The way we will focus on right now will be using the predicted probability of donating to create a universe that is sufficiently large and improves on the expected gift rate of the entire data set. There would be other ways to use these models to improve fundraising performance and those will be mentioned in the conclusion, but this is the way we will for this example. We will define a sufficiently large universe as 5,000, this means using our model we want a threshold that will give us at least 5,000 people. Most campaigns will likely want larger universes than this, but for the purpose of this example we will use 5,000.
We will begin with the least promising model, the random forest model. We will experiment with different thresholds on the predicted probability of donating to get a universe of at least 5,000 and try to maximize the gift rate we would get from that universe.
## threshold gift_rate universe_size
## 1 0.001 0.04858593 1446
## 2 0.008 0.03370787 184
## 3 0.015 0.03846154 108
## 4 0.022 0.03921569 53
## 5 0.029 0.05882353 36
## 6 0.036 0.05555556 19
## 7 0.043 0.00000000 15
## 8 0.050 0.00000000 7
## 9 0.057 0.00000000 4
## 10 0.064 0.00000000 4
## 11 0.071 0.00000000 2
## 12 0.078 0.00000000 1
## 13 0.085 0.00000000 1
## 14 0.092 0.00000000 1
## 15 0.099 0.00000000 1
## 16 0.106 0.00000000 1
## 17 0.113 0.00000000 1
## 18 0.120 0.00000000 1
## 19 0.127 0.00000000 1
## 20 0.134 NaN 0
As suspected, the random forest model doesn’t really ever meet our criteria. Even casting a broad net, using a very low threshold for the predicted likelihood of donating we don’t get a universe large enough to meet our criteria. Our largest universe is 1,446 and the only universes that have higher gift rates than the whole test data have univercerse of only 36 and 19 people. We likely would pass on this model.
Now let’s take a look at the neural net model and its results.
## threshold gift_rate universe_size
## 1 0.04580000 0.06743760 18646
## 2 0.04822632 0.06938470 17154
## 3 0.05065263 0.07059916 16135
## 4 0.05307895 0.07147843 15350
## 5 0.05550526 0.07262731 14828
## 6 0.05793158 0.07339175 14333
## 7 0.06035789 0.07402971 13449
## 8 0.06278421 0.07421166 12434
## 9 0.06521053 0.07704807 7954
## 10 0.06763684 0.07768855 6173
## 11 0.07006316 0.08028595 3929
## 12 0.07248947 0.08867610 2615
## 13 0.07491579 0.08408617 1560
## 14 0.07734211 0.09631728 774
## 15 0.07976842 0.10236220 420
## 16 0.08219474 0.10439560 201
## 17 0.08462105 0.08955224 73
## 18 0.08704737 0.04347826 24
## 19 0.08947368 0.00000000 5
## 20 0.09190000 NaN 0
This looks much more promising that our random forest model. Even our widest net, with a universe of 18,646 has a gift_rate of 6.7%, higher than the entire test dataset’s 4.9%. Our highest gift rate with a universe of at least 5,000 people is 7.8% an increase of 2.9 percentage points or about a 59% increase. If we wanted to maximize gift rate while with this model while keeping a decent sized universe, we might set the threshold at 0.0676. If we knew we had a much larger dataset to pull from and where we would not need to be as worried about a minimum universe size because we could pull from millions of records, as opposed to the approximately 37,000 in our test dataset, we might maximize the gift rate even more and use a threshold of 0.0822 which gave us a gift rate of 10.4% with our test dataset. We would have multiple ways to significantly improve fundraising performance with this model.
Finally, let’s see how our glm model performs.
## threshold gift_rate universe_size
## 1 0.04720000 0.06721921 18671
## 2 0.04956316 0.06900396 17258
## 3 0.05192632 0.07043181 16262
## 4 0.05428947 0.07133394 15364
## 5 0.05665263 0.07236652 14863
## 6 0.05901579 0.07365153 13834
## 7 0.06137895 0.07364656 13208
## 8 0.06374211 0.07766990 8991
## 9 0.06610526 0.07670283 7145
## 10 0.06846842 0.07799564 4948
## 11 0.07083158 0.08729282 2952
## 12 0.07319474 0.08695652 2000
## 13 0.07555789 0.08261803 1009
## 14 0.07792105 0.10105263 523
## 15 0.08028421 0.10288066 268
## 16 0.08264737 0.07865169 96
## 17 0.08501053 0.08333333 39
## 18 0.08737368 0.00000000 15
## 19 0.08973684 0.00000000 3
## 20 0.09210000 NaN 0
The results for the glm model seem pretty similar to our neural net model with very slightly less helpful results. Our largest universe still gives us a gift rate of 6.7%. The highest gift rate we get with a universe still over 5,000 people if 7.8%, and the highest gift rate we get regardless of universe size is 10.3%.
Based on the criteria we set before (trying to maximize the expected gift rate while having a universe of at least 5,000 people) we would choose to use the neural net model with a threshold of 0.0676 based on how our models perform.
There may be other ways we could use these models. Each model provides also provides and expected likelihood that someone does not donate. We may decide that instead of maximizing the gift rate for our campaign, we want to make sure we include everyone who will donate in our universe and rather than targeting those modeled most likely to give we could instead remove those model most likely not to give. Any way we do it this model would have potential to have significant benefits for the non-profit using it. Increasing the gift rate by 59% would mean a significant improvement in the efficiency of the campaign. With this kind of improvement, a non-profit would see more donations come in while spending less money on contacting fewer people. This kind of increased efficiency would be incredibly valuable for organizations that can often have limited resources when relying on the generosity of others.