Final Project

Business Objectives and Goals
The veterans’ organization has a database of 13 million donors. The organization direct-mails fundraising and sees a 5.1% response rate with an average donation of $13.00. As part of the fundraising campaign, the organization sends personalized address labels and stationery to the possible donors which costs $0.68 per mailing. The business goal is to maximize cost-effectiveness of the veterans’ organization’s direct marketing campaigns The business objective is to create a classification model that can predict which donors will respond with a donation and should be marketed toward.

Data Sources and Data Used
A data sample is provided and has been generated with weighted sampling to fairly equally represent donors and non-donors. A breakdown of this donor or non-donor weight is described in the next section. Weighted sampling is important to guard against class imbalance as represented by the classification model. An imbalance can negatively impact or bias a model fitting if a particular class is more heavily represented. It would not be wise to use a simple random sample from the original dataset, because the organization’s total donor database is essentially 94.9% non-donor and 5.1% donor. There is a larger weight of non-donors overall and using weighted sampling while modeling prevents the model from returning unreliable results due to the skew.

Type of Analysis Performed: What, Why, Findings
A review of the summary data helps to see how the predictor data is distributed. There are 3,000 observations and 17 predictors in this set. The target variable is a binary indicator represented by two responses, “Donor” or “No Donor.” In total, the observations are fairly equally weighted in classification with 1,499 observations or 49.97% being “Donor” and 1,501 observations or 50.03% being “No Donor.” All predictors are quantitative except for homeowner, female, and target. Each predictor seems to be somewhat skewed except for target and has outliers.

A correlation matrix helps to see how the predictors relate to the response variable and to each other, and which predictors most heavily influence the model fit. Of the quantitative predictors, med_fam¬_inc and avg_fam_inc are most heavily correlated, which makes sense given their representation of “median family income” and “average family income.” The next closest correlations are these two predictors with home_value and pct_lt15k, or the percent earning less than $15k in the donor’s neighborhood.

A collinearity review helps to see which predictors if any are collinear. There doesn’t seem to be any clearly collinear predictors.

library(readr)
set.seed(12345)
fundraising <-readRDS("C:/Users/betha/Desktop/fundraising.rds")
future_fundraising <-readRDS("C:/Users/betha/Desktop/future_fundraising.rds")
library(caret)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)

fundraising = fundraising[,5:21]
future_fundraising = future_fundraising[,5:20]

Exploring the data

summary(fundraising)
##  homeowner    num_child         income      female         wealth     
##  Yes:2312   Min.   :1.000   Min.   :1.000   Yes:1831   Min.   :0.000  
##  No : 688   1st Qu.:1.000   1st Qu.:3.000   No :1169   1st Qu.:5.000  
##             Median :1.000   Median :4.000              Median :8.000  
##             Mean   :1.069   Mean   :3.899              Mean   :6.396  
##             3rd Qu.:1.000   3rd Qu.:5.000              3rd Qu.:8.000  
##             Max.   :5.000   Max.   :7.000              Max.   :9.000  
##    home_value      med_fam_inc      avg_fam_inc       pct_lt15k    
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   : 0.00  
##  1st Qu.: 554.8   1st Qu.: 278.0   1st Qu.: 318.0   1st Qu.: 5.00  
##  Median : 816.5   Median : 355.0   Median : 396.0   Median :12.00  
##  Mean   :1143.3   Mean   : 388.4   Mean   : 432.3   Mean   :14.71  
##  3rd Qu.:1341.2   3rd Qu.: 465.0   3rd Qu.: 516.0   3rd Qu.:21.00  
##  Max.   :5945.0   Max.   :1500.0   Max.   :1331.0   Max.   :90.00  
##     num_prom      lifetime_gifts    largest_gift       last_gift     
##  Min.   : 11.00   Min.   :  15.0   Min.   :   5.00   Min.   :  0.00  
##  1st Qu.: 29.00   1st Qu.:  45.0   1st Qu.:  10.00   1st Qu.:  7.00  
##  Median : 48.00   Median :  81.0   Median :  15.00   Median : 10.00  
##  Mean   : 49.14   Mean   : 110.7   Mean   :  16.65   Mean   : 13.48  
##  3rd Qu.: 65.00   3rd Qu.: 135.0   3rd Qu.:  20.00   3rd Qu.: 16.00  
##  Max.   :157.00   Max.   :5674.9   Max.   :1000.00   Max.   :219.00  
##  months_since_donate    time_lag         avg_gift            target    
##  Min.   :17.00       Min.   : 0.000   Min.   :  2.139   Donor   :1499  
##  1st Qu.:29.00       1st Qu.: 3.000   1st Qu.:  6.333   No Donor:1501  
##  Median :31.00       Median : 5.000   Median :  9.000                  
##  Mean   :31.13       Mean   : 6.876   Mean   : 10.669                  
##  3rd Qu.:34.00       3rd Qu.: 9.000   3rd Qu.: 12.800                  
##  Max.   :37.00       Max.   :77.000   Max.   :122.167
pairs(fundraising)

str(fundraising)
## tibble [3,000 × 17] (S3: tbl_df/tbl/data.frame)
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

Checking for correlation and collinearity

library(dplyr)
fundraising.vars = select(fundraising, 2:3, 5:16)
fundraising.cor = cor(fundraising.vars)
fundraising.cor
##                        num_child       income      wealth    home_value
## num_child            1.000000000  0.091893089  0.06017554 -0.0119642286
## income               0.091893089  1.000000000  0.20899310  0.2919734944
## wealth               0.060175537  0.208993101  1.00000000  0.2611611450
## home_value          -0.011964229  0.291973494  0.26116115  1.0000000000
## med_fam_inc          0.046961647  0.367505334  0.37776337  0.7381530742
## avg_fam_inc          0.047261395  0.378585352  0.38589230  0.7525690021
## pct_lt15k           -0.031717891 -0.283191234 -0.37514558 -0.3990861577
## num_prom            -0.086432604 -0.069008634 -0.41211777 -0.0645138583
## lifetime_gifts      -0.050954766 -0.019565470 -0.22547332 -0.0240737013
## largest_gift        -0.017554416  0.033180760 -0.02527652  0.0564942757
## last_gift           -0.012948678  0.109592754  0.05259131  0.1588576542
## months_since_donate -0.005563603  0.077238810  0.03371398  0.0234285142
## time_lag            -0.006069356 -0.001545727 -0.06642133  0.0006789113
## avg_gift            -0.019688680  0.124055750  0.09107875  0.1687736865
##                     med_fam_inc avg_fam_inc    pct_lt15k    num_prom
## num_child            0.04696165  0.04726139 -0.031717891 -0.08643260
## income               0.36750533  0.37858535 -0.283191234 -0.06900863
## wealth               0.37776337  0.38589230 -0.375145585 -0.41211777
## home_value           0.73815307  0.75256900 -0.399086158 -0.06451386
## med_fam_inc          1.00000000  0.97227129 -0.665362675 -0.05078270
## avg_fam_inc          0.97227129  1.00000000 -0.680284797 -0.05731139
## pct_lt15k           -0.66536267 -0.68028480  1.000000000  0.03777518
## num_prom            -0.05078270 -0.05731139  0.037775183  1.00000000
## lifetime_gifts      -0.03524583 -0.04032716  0.059618806  0.53861957
## largest_gift         0.04703207  0.04310394 -0.007882936  0.11381034
## last_gift            0.13597600  0.13137862 -0.061752121 -0.05586809
## months_since_donate  0.03233669  0.03126859 -0.009014558 -0.28232212
## time_lag             0.01520204  0.02434038 -0.019911490  0.11962322
## avg_gift             0.13716276  0.13175843 -0.062480892 -0.14725094
##                     lifetime_gifts largest_gift   last_gift months_since_donate
## num_child              -0.05095477 -0.017554416 -0.01294868        -0.005563603
## income                 -0.01956547  0.033180760  0.10959275         0.077238810
## wealth                 -0.22547332 -0.025276518  0.05259131         0.033713981
## home_value             -0.02407370  0.056494276  0.15885765         0.023428514
## med_fam_inc            -0.03524583  0.047032066  0.13597600         0.032336691
## avg_fam_inc            -0.04032716  0.043103937  0.13137862         0.031268594
## pct_lt15k               0.05961881 -0.007882936 -0.06175212        -0.009014558
## num_prom                0.53861957  0.113810342 -0.05586809        -0.282322122
## lifetime_gifts          1.00000000  0.507262313  0.20205827        -0.144621862
## largest_gift            0.50726231  1.000000000  0.44723693         0.019789633
## last_gift               0.20205827  0.447236933  1.00000000         0.186715010
## months_since_donate    -0.14462186  0.019789633  0.18671501         1.000000000
## time_lag                0.03854575  0.039977035  0.07511121         0.015528499
## avg_gift                0.18232435  0.474830096  0.86639998         0.189110799
##                          time_lag    avg_gift
## num_child           -0.0060693555 -0.01968868
## income              -0.0015457272  0.12405575
## wealth              -0.0664213294  0.09107875
## home_value           0.0006789113  0.16877369
## med_fam_inc          0.0152020426  0.13716276
## avg_fam_inc          0.0243403812  0.13175843
## pct_lt15k           -0.0199114896 -0.06248089
## num_prom             0.1196232155 -0.14725094
## lifetime_gifts       0.0385457538  0.18232435
## largest_gift         0.0399770354  0.47483010
## last_gift            0.0751112090  0.86639998
## months_since_donate  0.0155284995  0.18911080
## time_lag             1.0000000000  0.07008164
## avg_gift             0.0700816428  1.00000000
corrplot::corrplot(fundraising.cor)

par(mfrow=c(1,3))
plot(fundraising$homeowner, main= 'Homeowner')
plot(fundraising$female, main= 'Female')
plot(fundraising$target, main= 'Target')

plot(fundraising$target,fundraising$home_value, main= 'Home_value')

plot(fundraising$target,fundraising$med_fam_inc, main= 'Med_fam_inc')

plot(fundraising$target,fundraising$avg_fam_inc, main= 'Avg_fam_inc')

plot(fundraising$target,fundraising$pct_lt15k, main= 'Pct_lt15k')

plot(fundraising$target,fundraising$num_prom, main= 'Num_prom')

plot(fundraising$target,fundraising$lifetime_gifts, main= 'Lifetime_gifts')

plot(fundraising$target,fundraising$largest_gift, main= 'Largest_gift')

plot(fundraising$target,fundraising$last_gift, main= 'Last_gift')

plot(fundraising$target,fundraising$months_since_donate, main= 'Months_since_donate')

plot(fundraising$target,fundraising$avg_gift, main='Avg_gift')

Exclusions
There was no obvious reason for any exclusions.

Variable Transformations
Because several of the predictors are skewed, a log transformation is applied to remove skew and create a better fit for data distribution.

fundraising = mutate(fundraising, 
              log_months_since_donate = log(months_since_donate),
              log_avg_gift = log(avg_gift),
              log_num_child = log(num_child+1),
              log_home_value = log(home_value+0.1),
              log_med_fam_inc = log(med_fam_inc+0.1),
              log_avg_fam_inc = log(avg_fam_inc+0.1),
              log_pct_lt15k = log(pct_lt15k+0.1),
              log_num_prom = log(num_prom+1),
              log_lifetime_gifts = log(lifetime_gifts+0.1),
              log_largest_gift = log(largest_gift+0.1),
              log_last_gift = log(last_gift+0.1))

future_fundraising = mutate(future_fundraising, 
                     log_months_since_donate = log(months_since_donate),
                     log_months_since_donate = log(months_since_donate),
                     log_avg_gift = log(avg_gift),
                     log_num_child = log(num_child),
                     log_home_value = log(home_value),
                     log_med_fam_inc = log(med_fam_inc),
                     log_avg_fam_inc = log(avg_fam_inc),
                     log_pct_lt15k = log(pct_lt15k),
                     log_num_prom = log(num_prom),
                     log_lifetime_gifts = log(lifetime_gifts),
                     log_largest_gift = log(largest_gift),
                     log_last_gift = log(last_gift))
fundraising = mutate(fundraising, 
              num_child_income_int = num_child*income,
              num_child_log_months_since_donate_int= num_child*log_months_since_donate,
              num_child_log_largest_gift_int= num_child*log_largest_gift,
              num_child_log_life_gifts_int= num_child*log_lifetime_gifts,
              income_log_months_since_donate_int = income*log_months_since_donate,
              income_log_largest_gift_int = income*log_largest_gift,
              income_log_lifetime_gifts_int = income*log_lifetime_gifts,
              log_m_since_donate_log_largest_gift_int= log_months_since_donate*log_largest_gift,
              log_m_since_donate_log_life_gifts_int= log_months_since_donate*log_lifetime_gifts,
              log_largest_gift_log_lifetime_gifts_int= log_largest_gift*log_lifetime_gifts)


future_fundraising = mutate(future_fundraising,
              num_child_income_int = num_child*income,
              num_child_log_months_since_donate_int= num_child*log_months_since_donate,
              num_child_log_largest_gift_int= num_child*log_largest_gift,
              num_child_log_life_gifts_int= num_child*log_lifetime_gifts,
              income_log_months_since_donate_int = income*log_months_since_donate,
              income_log_largest_gift_int = income*log_largest_gift,
              income_log_lifetime_gifts_int = income*log_lifetime_gifts,
              log_m_since_donate_log_largest_gift_int= log_months_since_donate*log_largest_gift,
              log_m_since_donate_log_life_gifts_int= log_months_since_donate*log_lifetime_gifts,
              log_largest_gift_log_lifetime_gifts_int= log_largest_gift*log_lifetime_gifts)

Business Inputs
Resources required for this testing varies and is dependent on the organization’s current environment. If the donor or non-donor demographic information is not clean, complete, or accurate, it will be very difficult to bring the dataset to a usable state for the most accurate predictive modeling results. This would require human resources and time, and in some cases, a data collection system cutover may be the best option for obtaining the desired output through future data analysis.

Methodology Used, Background, Benefits I have chosen to use support vector machine (SVM) models because they are intended for binary classification with two classes. I first use SVM with linear kernel, then further fit an SVM with radial kernel and SVM with polynomial kernel.

Benefits to using SVM are that SVM supports complex data regardless of how well its structured. It scales well, and there is a lower risk of overfitting.

SVM with Linear Kernel:

set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)

svm.linear <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
                       income_log_months_since_donate_int + num_child_log_months_since_donate_int, 
              data = fundraising, 
              method = "svmLinear", 
              trControl = train_control,
              tuneGrid = expand.grid(C = seq(0.002, 0.003, 0.0001)),
              preProcess = c("center","scale"))
plot(svm.linear)

SVM with Radial Kernel:

set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)
svm.radial <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
                       income_log_months_since_donate_int + num_child_log_months_since_donate_int, 
              data = fundraising, 
              method = "svmRadial", 
              trControl = train_control,
              tuneLength = 10,
              tuneGrid = expand.grid(C = seq(0.25, 0.5, 0.01),
                                     sigma= 0.5612128),
              preProcess = c("center","scale"))
plot(svm.radial)

SVM with Polynomial Kernel:

set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)

svm.poly <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
                       income_log_months_since_donate_int + num_child_log_months_since_donate_int, 
              data = fundraising, 
              method = "svmPoly", 
              trControl = train_control,
              tuneLength = 4,
              preProcess = c("center","scale"))
plot(svm.poly)

models = c('SVM-linear', 'SVM-radial', 'SVM-poly')
acc= c(56.49, 56.59, 56.40)

acc.summary= as.data.frame(acc, row.names = models)
acc.summary
##              acc
## SVM-linear 56.49
## SVM-radial 56.59
## SVM-poly   56.40

Model Performance and Validation Results
SVM with linear kernel using optimal cost of 0.0022 produces an accuracy of 56.49%.
SVM with radial kernel using optimal cost of 0.46 and sigma of 0.5612128 produces an accuracy of 56.59%.
SVM with polynomial kernel using optimal degree of 3, scale of 0.01, and cost of 1 produces an accuracy of 56.40%.
The SVM model with radial kernel produces the highest accuracy rate.

Cut-Off Analysis
While default threshold is normally 0.5, no cutoff analysis or adjustment using ROC curve is necessary because the sampling is weighted.

Recommendations
One recommendation for a more accurate model would be to increase the sample size used. While using all of the organization’s 13 million observations would definitely be excessive, 5.1% of 13 million is 663,000 donor observations, which is still excessive but much more than 1,499. I would recommend performing the analysis again with a weighted sample set of at least 10,000 total observations.

Related to marketing costs, the organization could perform regression analysis to predict how much the “Donor” observations could be expected to donate. If the predicted donation exceeds a certain threshold, say, $5.00, the organization could prioritize mailing labels and stationery to this group first and then consider mailing labels and stationery to the observations predicted to donate less than $5.00.

Pseudo Codes for Implementation
My project submission includes my leaderboard submission formulated with scripting below and this writeup with code.

set.seed(12345)
preds = as.data.frame(predict(svm.radial, future_fundraising))
write_csv(preds, 'prediction.csv')