The Problem

Fundraising campaigns often involve casting a very wide net with a very low gift rate. Often times they involve reaching out to a pool of tens of thousands of people and a gift rate of 3-4% can be considered a good haul. For non profits with limited resources being able to better target donors can have significant benefits, not just by bringing in more donations, but also by better applying those limited resources and not spending resources to contact non-donors.

The Data

We will be using the donors dataset from Datacamp which is available here

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
donors <- read_csv("https://assets.datacamp.com/production/repositories/718/datasets/9055dac929e4515286728a2a5dae9f25f0e4eff6/donors.csv") %>%
  mutate(donated = as.factor(donated))
## Rows: 93462 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (3): recency, frequency, money
## dbl (10): donated, veteran, bad_address, age, has_children, wealth_rating, i...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before we get started I want to split the data into a train set and test set, to imitate a scenario where we have data we want to train a model on, but will test it on unseen data. We do this below.

set.seed(100)
# Get the number of observations
n_obs <- nrow(donors)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: Sonar
donors_shuffled <- donors[permuted_rows,]

# Identify row to split on: split
split <- round(n_obs * 0.6)

# Create train
donors_train <- donors_shuffled[1:split,]

# Create test
donors_test <- donors_shuffled[(split + 1):nrow(donors_shuffled),]

# Check the share of donate in each\
mean(as.numeric(donors_train$donated))
## [1] 1.050716
mean(as.numeric(donors_test$donated))
## [1] 1.04994

When converted to a numeric variable, the donated variable becomes a vector of 1s and 2s with 1 indicating no donation and 2 indicating a donation. A mean of 1.050716 means a gift rate of about 5.07% and a mean of of 1.04994 means a gift rate of about 4.99%. We can see that both our test and train data sets have about the same ratio of donations to non-donations. Rare events are often difficult to model, and this may prove to be a challenge. Now that we have our data split into a train and test set, we will observe the train data as though it were the only data we had access to.

## [1] 56077    13
##  donated      veteran          bad_address           age         has_children  
##  0:53233   Min.   :0.000000   Min.   :0.00000   Min.   : 1.00   Min.   :0.000  
##  1: 2844   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:48.00   1st Qu.:0.000  
##            Median :0.000000   Median :0.00000   Median :62.00   Median :0.000  
##            Mean   :0.001123   Mean   :0.01473   Mean   :61.61   Mean   :0.133  
##            3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:75.00   3rd Qu.:0.000  
##            Max.   :1.000000   Max.   :1.00000   Max.   :98.00   Max.   :1.000  
##                                                 NA's   :13599                  
##  wealth_rating   interest_veterans interest_religion   pet_owner     
##  Min.   :0.000   Min.   :0.0000    Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000    1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.000   Median :0.0000    Median :0.00000   Median :0.0000  
##  Mean   :1.143   Mean   :0.1102    Mean   :0.09348   Mean   :0.1524  
##  3rd Qu.:2.000   3rd Qu.:0.0000    3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :3.000   Max.   :1.0000    Max.   :1.00000   Max.   :1.0000  
##                                                                      
##  catalog_shopper     recency           frequency            money          
##  Min.   :0.00000   Length:56077       Length:56077       Length:56077      
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.08265                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
## 
Data summary
Name donors_train
Number of rows 56077
Number of columns 13
_______________________
Column type frequency:
character 3
factor 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
recency 0 1 6 7 0 2 0
frequency 0 1 8 10 0 2 0
money 0 1 4 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
donated 0 1 FALSE 2 0: 53233, 1: 2844

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
veteran 0 1.00 0.00 0.03 0 0 0 0 1 ▇▁▁▁▁
bad_address 0 1.00 0.01 0.12 0 0 0 0 1 ▇▁▁▁▁
age 13599 0.76 61.61 16.70 1 48 62 75 98 ▁▂▇▇▃
has_children 0 1.00 0.13 0.34 0 0 0 0 1 ▇▁▁▁▁
wealth_rating 0 1.00 1.14 1.22 0 0 1 2 3 ▇▂▁▃▃
interest_veterans 0 1.00 0.11 0.31 0 0 0 0 1 ▇▁▁▁▁
interest_religion 0 1.00 0.09 0.29 0 0 0 0 1 ▇▁▁▁▁
pet_owner 0 1.00 0.15 0.36 0 0 0 0 1 ▇▁▁▁▂
catalog_shopper 0 1.00 0.08 0.28 0 0 0 0 1 ▇▁▁▁▁

We have just over 56,000 observations of 13 variables. Three of those variables are character variables with two levels each. We have one numeric variable, age, which also has missing variables. We have another numeric variable that seems to be more of a factor variable, wealth_rating. It is not clear what the values mean, presumably the higher values would indicate higher levels of wealth, but we do not know if 0 indicates a low wealth rating or no info on that person. The rest of the variables seem to be indicator variables with a 1 indicating the presence of that characteristic and a 0 indicating its absence. Finally, our variable of interest is the donated variable, which is another indicator variable that was converted to a factor variable when the data was read in. Another challenge in modeling this data is there doesn’t seem to be a lot of variation in many of the dependent variables either. There seem to be very few people indicated to be veterans, very few people indicated to have children, very few people indicated to be catalog shoppers, etc. Unless these variables are very closely correlated with the our variable of interest, this will likely prove to be a challenge during our modeling.

Pre-processing the data

One of the things we noticed in the data is that age is missing a significant number of records. We will be imputing the median age into those missing values and creating a new indicator variable that will tell us whether the age was missing or not, that way, if the data was missing not at random, we will have a way for the model to catch that and use that information.

The second thing we will be doing will be re-scaling the age variable. With most of the other variables on a 0-1 scale the much greater variation in age will cause it to have a much larger impact on our model, not necessarily because it is a better predictor, but because the magnitude of the variability in age is so much greater than in our other variables.

finally, for our character variables with more than one option, we will convert these into dummy variables, or turning them into indicator variables like our other variables.

##
## Pre process and prepare for modeling
##
median_age <- median(donors_train$age, na.rm = TRUE)
mean_age <- mean(donors_train$age, na.rm = TRUE)
sd_age <- sd(donors_train$age, na.rm = TRUE)

donors_proc <- donors_train %>%
  mutate(
    ## create a new column to indicate if age was missing
    missing_age = if_else(is.na(age), 1, 0),
    ## replace missing age values with the median age
    age = if_else(is.na(age), median_age, age)
    ) %>%
  ## I want to change our categorical variables to dummy variables ahead of time
  ## The modeling functions would do this anyway, but I just want to do it ahead of time
  fastDummies::dummy_cols(select_columns = c("recency", "frequency", "money", "wealth_rating")) %>%
  ## We remove the original variables now that we don't need them
  select(-c("recency", "frequency", "money", "wealth_rating")) %>%
  ## rescale and center the age column
  mutate(age = (age-mean_age)/sd_age) %>%
  filter(complete.cases(.))

Cluster analysis

The first step in our modeling journey is going to be cluster analysis. Due to limited computational power we will be using kmeans cluster only, rather than comparing to hierarchical clustering. This should still provide us with some good clusters which we can add to our data to hopefully better improve our model.

Our first step will be to make several different models, changing the setting for centers in the model (number of clusters) and then comparing the results. We will create a scree plot of the model results and will be looking for a good elbow to determine how many clusters we should have.

Looking at our model we see a significant elbow at 5 centers and a slightly less obvious one at 8 centers. We want to create clusters that will help us better predict whether or not someone donated, so we will try both models and see which one creates clusters that are better correlated with our variable of interest.

## [1] 1.050716
##    
##              1          2          3          4          5          6
##   1 0.93348657 0.93576771 0.93946964 0.96131659 0.94358013 0.96679816
##   2 0.06651343 0.06423229 0.06053036 0.03868341 0.05641987 0.03320184
##    
##              7          8
##   1 0.95181976 0.95524578
##   2 0.04818024 0.04475422
##    
##              1          2          3          4          5
##   1 0.93453051 0.95345940 0.96362293 0.95776177 0.93954116
##   2 0.06546949 0.04654060 0.03637707 0.04223823 0.06045884

The clusters that are both most negatively and positively correlated with our variable of interest (whether or not someone donated) are both in the model with 8 clusters. Several of the other clusters in this model also seem to be somewhate correlated with our variable of interest, so this may be our most helpful model. We will add the clusters from this model to our data to help us build a more predictive model. We will also make them dummy/indicator variables like our other variables.

##
## Add in the cluster we want to our data
##

donors_pro <- donors_proc %>%
  mutate(clust = clust_8) %>%
  fastDummies::dummy_cols(select_columns = "clust") %>%
  select(-clust)

Building and comparing models

Now we come to the point where we build our models and observe their performance. We will quickly build 3 different kinds of models and compare their performance. The first model we will build will be a simple glm model for predicting a binomial outcome. The other two models will be random forest and neural net models. We will not be manually tuning the parameters on the random forest or neural net model. Instead, we will be training these model with caret and using the tunelength argument to select the best parameters out of those tried.

We will be measuring out models performance on the test data we separated out earlier. To be able to get predictions on the test data though, we will need to pr-process it and run it through our cluster model so that it is in the same format as the model we will train our data on.

library(clue)

donors_test <- donors_test %>%
  mutate(
    ## create a new column to indicate if age was missing
    missing_age = if_else(is.na(age), 1, 0),
    ## replace missing age values with the median age
    age = if_else(is.na(age), median_age, age)
  ) %>%
  ## I want to change our categorical variables to dummy variables ahead of time
  ## The modeling functions would do this anyway, but I just want to do it ahead of time
  fastDummies::dummy_cols(select_columns = c("recency", "frequency", "money", "wealth_rating")) %>%
  ## We remove the original variables now that we don't need them
  select(-c("recency", "frequency", "money", "wealth_rating")) %>%
  ## rescale and center the age column
  mutate(age = (age-mean_age)/sd_age) %>%
  filter(complete.cases(.))

test_clust <- cl_predict(model_8, select(donors_test, -donated))

donors_test <- donors_test %>%
  mutate(clust = test_clust) %>%
  fastDummies::dummy_cols(select_columns = "clust") %>%
  select(-clust)

mean(as.numeric(donors_test$donated))
## [1] 1.04994

Now that we have test data in proper format, let’s begin by building and measuring performance of a glm model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 26930  1202
##          1  8588   665
##                                           
##                Accuracy : 0.7381          
##                  95% CI : (0.7336, 0.7426)
##     No Information Rate : 0.9501          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0398          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75821         
##             Specificity : 0.35619         
##          Pos Pred Value : 0.95727         
##          Neg Pred Value : 0.07187         
##              Prevalence : 0.95006         
##          Detection Rate : 0.72034         
##    Detection Prevalence : 0.75249         
##       Balanced Accuracy : 0.55720         
##                                           
##        'Positive' Class : 0               
## 

##                 0         1
## 0 vs. 1 0.5906154 0.5906154

Our AUC for the glm model is 0.59, not a great auc but considering the challenges of predicting a rare outcome with data that doesn’t have a lot of variation. it does look like if we were to use the threshold of 0.0635 we would have a group of observations where about 7.7% are donors, higher than the approximately 4.9% in the test data. We will experiment more with thresholds of our models later.

Next we will train a random forest model and see how it performs compared to the glm model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 35180  1856
##          1   338    11
##                                           
##                Accuracy : 0.9413          
##                  95% CI : (0.9389, 0.9437)
##     No Information Rate : 0.9501          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0059         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.990484        
##             Specificity : 0.005892        
##          Pos Pred Value : 0.949887        
##          Neg Pred Value : 0.031519        
##              Prevalence : 0.950060        
##          Detection Rate : 0.941019        
##    Detection Prevalence : 0.990665        
##       Balanced Accuracy : 0.498188        
##                                           
##        'Positive' Class : 0               
## 

##                 0         1
## 0 vs. 1 0.5015712 0.5015712

The AUC for the random forest model is 0.50, significantly lower than the glm model and hardly any better than just a best guess. If we were to use a threshold of 0.004 to select a universe we would have a gift rate of about 3.3%, even lower than the entire test data set. It doesn’t look like the random forest does very well, but we will see if it improves when we play with the threshold again.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 26871  1204
##          1  8647   663
##                                         
##                Accuracy : 0.7365        
##                  95% CI : (0.732, 0.741)
##     No Information Rate : 0.9501        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.0387        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.75655       
##             Specificity : 0.35512       
##          Pos Pred Value : 0.95711       
##          Neg Pred Value : 0.07121       
##              Prevalence : 0.95006       
##          Detection Rate : 0.71876       
##    Detection Prevalence : 0.75097       
##       Balanced Accuracy : 0.55583       
##                                         
##        'Positive' Class : 0             
## 

##                 0         1
## 0 vs. 1 0.5909916 0.5909916

The neural net model’s performance is about equal to the glm model. The AUC for the neural net model is 0.59 and with a threshild of 0.0639 we would have a universe with a gift rate of 7.7% (both AUC and gift rate would be slightly lower than the glm model). This model seems much more promising than the random forest model.

As a business problem though we would be trying to improve a fundraising campaign with these models, and there are multiple ways we can do that with these models. The way we will focus on right now will be using the predicted probability of donating to create a universe that is sufficiently large and improves on the expected gift rate of the entire data set. There would be other ways to use these models to improve fundraising performance and those will be mentioned in the conclusion, but this is the way we will for this example. We will define a sufficiently large universe as 5,000, this means using our model we want a threshold that will give us at least 5,000 people. Most campaigns will likely want larger universes than this, but for the purpose of this example we will use 5,000.

We will begin with the least promising model, the random forest model. We will experiment with different thresholds on the predicted probability of donating to get a universe of at least 5,000 and try to maximize the gift rate we would get from that universe.

##    threshold  gift_rate universe_size
## 1      0.001 0.04858593          1446
## 2      0.008 0.03370787           184
## 3      0.015 0.03846154           108
## 4      0.022 0.03921569            53
## 5      0.029 0.05882353            36
## 6      0.036 0.05555556            19
## 7      0.043 0.00000000            15
## 8      0.050 0.00000000             7
## 9      0.057 0.00000000             4
## 10     0.064 0.00000000             4
## 11     0.071 0.00000000             2
## 12     0.078 0.00000000             1
## 13     0.085 0.00000000             1
## 14     0.092 0.00000000             1
## 15     0.099 0.00000000             1
## 16     0.106 0.00000000             1
## 17     0.113 0.00000000             1
## 18     0.120 0.00000000             1
## 19     0.127 0.00000000             1
## 20     0.134        NaN             0

As suspected, the random forest model doesn’t really ever meet our criteria. Even casting a broad net, using a very low threshold for the predicted likelihood of donating we don’t get a universe large enough to meet our criteria. Our largest universe is 1,446 and the only universes that have higher gift rates than the whole test data have univercerse of only 36 and 19 people. We likely would pass on this model.

Now let’s take a look at the neural net model and its results.

##     threshold  gift_rate universe_size
## 1  0.04580000 0.06743760         18646
## 2  0.04822632 0.06938470         17154
## 3  0.05065263 0.07059916         16135
## 4  0.05307895 0.07147843         15350
## 5  0.05550526 0.07262731         14828
## 6  0.05793158 0.07339175         14333
## 7  0.06035789 0.07402971         13449
## 8  0.06278421 0.07421166         12434
## 9  0.06521053 0.07704807          7954
## 10 0.06763684 0.07768855          6173
## 11 0.07006316 0.08028595          3929
## 12 0.07248947 0.08867610          2615
## 13 0.07491579 0.08408617          1560
## 14 0.07734211 0.09631728           774
## 15 0.07976842 0.10236220           420
## 16 0.08219474 0.10439560           201
## 17 0.08462105 0.08955224            73
## 18 0.08704737 0.04347826            24
## 19 0.08947368 0.00000000             5
## 20 0.09190000        NaN             0

This looks much more promising that our random forest model. Even our widest net, with a universe of 18,646 has a gift_rate of 6.7%, higher than the entire test dataset’s 4.9%. Our highest gift rate with a universe of at least 5,000 people is 7.8% an increase of 2.9 percentage points or about a 59% increase. If we wanted to maximize gift rate while with this model while keeping a decent sized universe, we might set the threshold at 0.0676. If we knew we had a much larger dataset to pull from and where we would not need to be as worried about a minimum universe size because we could pull from millions of records, as opposed to the approximately 37,000 in our test dataset, we might maximize the gift rate even more and use a threshold of 0.0822 which gave us a gift rate of 10.4% with our test dataset. We would have multiple ways to significantly improve fundraising performance with this model.

Finally, let’s see how our glm model performs.

##     threshold  gift_rate universe_size
## 1  0.04720000 0.06721921         18671
## 2  0.04956316 0.06900396         17258
## 3  0.05192632 0.07043181         16262
## 4  0.05428947 0.07133394         15364
## 5  0.05665263 0.07236652         14863
## 6  0.05901579 0.07365153         13834
## 7  0.06137895 0.07364656         13208
## 8  0.06374211 0.07766990          8991
## 9  0.06610526 0.07670283          7145
## 10 0.06846842 0.07799564          4948
## 11 0.07083158 0.08729282          2952
## 12 0.07319474 0.08695652          2000
## 13 0.07555789 0.08261803          1009
## 14 0.07792105 0.10105263           523
## 15 0.08028421 0.10288066           268
## 16 0.08264737 0.07865169            96
## 17 0.08501053 0.08333333            39
## 18 0.08737368 0.00000000            15
## 19 0.08973684 0.00000000             3
## 20 0.09210000        NaN             0

The results for the glm model seem pretty similar to our neural net model with very slightly less helpful results. Our largest universe still gives us a gift rate of 6.7%. The highest gift rate we get with a universe still over 5,000 people if 7.8%, and the highest gift rate we get regardless of universe size is 10.3%.

Conclusion

Based on the criteria we set before (trying to maximize the expected gift rate while having a universe of at least 5,000 people) we would choose to use the neural net model with a threshold of 0.0676 based on how our models perform.

There may be other ways we could use these models. Each model provides also provides and expected likelihood that someone does not donate. We may decide that instead of maximizing the gift rate for our campaign, we want to make sure we include everyone who will donate in our universe and rather than targeting those modeled most likely to give we could instead remove those model most likely not to give. Any way we do it this model would have potential to have significant benefits for the non-profit using it. Increasing the gift rate by 59% would mean a significant improvement in the efficiency of the campaign. With this kind of improvement, a non-profit would see more donations come in while spending less money on contacting fewer people. This kind of increased efficiency would be incredibly valuable for organizations that can often have limited resources when relying on the generosity of others.