Business Objectives and Goals
The veterans’ organization has a database of 13 million donors. The
organization direct-mails fundraising and sees a 5.1% response rate with
an average donation of $13.00. As part of the fundraising campaign, the
organization sends personalized address labels and stationery to the
possible donors which costs $0.68 per mailing. The business goal is to
maximize cost-effectiveness of the veterans’ organization’s direct
marketing campaigns The business objective is to create a classification
model that can predict which donors will respond with a donation and
should be marketed toward.
Data Sources and Data Used
A data sample is provided and has been generated with weighted sampling
to fairly equally represent donors and non-donors. A breakdown of this
donor or non-donor weight is described in the next section. Weighted
sampling is important to guard against class imbalance as represented by
the classification model. An imbalance can negatively impact or bias a
model fitting if a particular class is more heavily represented. It
would not be wise to use a simple random sample from the original
dataset, because the organization’s total donor database is essentially
94.9% non-donor and 5.1% donor. There is a larger weight of non-donors
overall and using weighted sampling while modeling prevents the model
from returning unreliable results due to the skew.
Type of Analysis Performed: What, Why,
Findings
A review of the summary data helps to see how the predictor data is
distributed. There are 3,000 observations and 17 predictors in this set.
The target variable is a binary indicator represented by two responses,
“Donor” or “No Donor.” In total, the observations are fairly equally
weighted in classification with 1,499 observations or 49.97% being
“Donor” and 1,501 observations or 50.03% being “No Donor.” All
predictors are quantitative except for homeowner, female, and target.
Each predictor seems to be somewhat skewed except for target and has
outliers.
A correlation matrix helps to see how the predictors relate to the response variable and to each other, and which predictors most heavily influence the model fit. Of the quantitative predictors, med_fam¬_inc and avg_fam_inc are most heavily correlated, which makes sense given their representation of “median family income” and “average family income.” The next closest correlations are these two predictors with home_value and pct_lt15k, or the percent earning less than $15k in the donor’s neighborhood.
A collinearity review helps to see which predictors if any are collinear. There doesn’t seem to be any clearly collinear predictors.
library(readr)
set.seed(12345)
fundraising <-readRDS("C:/Users/betha/Desktop/fundraising.rds")
future_fundraising <-readRDS("C:/Users/betha/Desktop/future_fundraising.rds")
library(caret)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
fundraising = fundraising[,5:21]
future_fundraising = future_fundraising[,5:20]
Exploring the data
summary(fundraising)
## homeowner num_child income female wealth
## Yes:2312 Min. :1.000 Min. :1.000 Yes:1831 Min. :0.000
## No : 688 1st Qu.:1.000 1st Qu.:3.000 No :1169 1st Qu.:5.000
## Median :1.000 Median :4.000 Median :8.000
## Mean :1.069 Mean :3.899 Mean :6.396
## 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:8.000
## Max. :5.000 Max. :7.000 Max. :9.000
## home_value med_fam_inc avg_fam_inc pct_lt15k
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 554.8 1st Qu.: 278.0 1st Qu.: 318.0 1st Qu.: 5.00
## Median : 816.5 Median : 355.0 Median : 396.0 Median :12.00
## Mean :1143.3 Mean : 388.4 Mean : 432.3 Mean :14.71
## 3rd Qu.:1341.2 3rd Qu.: 465.0 3rd Qu.: 516.0 3rd Qu.:21.00
## Max. :5945.0 Max. :1500.0 Max. :1331.0 Max. :90.00
## num_prom lifetime_gifts largest_gift last_gift
## Min. : 11.00 Min. : 15.0 Min. : 5.00 Min. : 0.00
## 1st Qu.: 29.00 1st Qu.: 45.0 1st Qu.: 10.00 1st Qu.: 7.00
## Median : 48.00 Median : 81.0 Median : 15.00 Median : 10.00
## Mean : 49.14 Mean : 110.7 Mean : 16.65 Mean : 13.48
## 3rd Qu.: 65.00 3rd Qu.: 135.0 3rd Qu.: 20.00 3rd Qu.: 16.00
## Max. :157.00 Max. :5674.9 Max. :1000.00 Max. :219.00
## months_since_donate time_lag avg_gift target
## Min. :17.00 Min. : 0.000 Min. : 2.139 Donor :1499
## 1st Qu.:29.00 1st Qu.: 3.000 1st Qu.: 6.333 No Donor:1501
## Median :31.00 Median : 5.000 Median : 9.000
## Mean :31.13 Mean : 6.876 Mean : 10.669
## 3rd Qu.:34.00 3rd Qu.: 9.000 3rd Qu.: 12.800
## Max. :37.00 Max. :77.000 Max. :122.167
pairs(fundraising)
str(fundraising)
## tibble [3,000 × 17] (S3: tbl_df/tbl/data.frame)
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
Checking for correlation and collinearity
library(dplyr)
fundraising.vars = select(fundraising, 2:3, 5:16)
fundraising.cor = cor(fundraising.vars)
fundraising.cor
## num_child income wealth home_value
## num_child 1.000000000 0.091893089 0.06017554 -0.0119642286
## income 0.091893089 1.000000000 0.20899310 0.2919734944
## wealth 0.060175537 0.208993101 1.00000000 0.2611611450
## home_value -0.011964229 0.291973494 0.26116115 1.0000000000
## med_fam_inc 0.046961647 0.367505334 0.37776337 0.7381530742
## avg_fam_inc 0.047261395 0.378585352 0.38589230 0.7525690021
## pct_lt15k -0.031717891 -0.283191234 -0.37514558 -0.3990861577
## num_prom -0.086432604 -0.069008634 -0.41211777 -0.0645138583
## lifetime_gifts -0.050954766 -0.019565470 -0.22547332 -0.0240737013
## largest_gift -0.017554416 0.033180760 -0.02527652 0.0564942757
## last_gift -0.012948678 0.109592754 0.05259131 0.1588576542
## months_since_donate -0.005563603 0.077238810 0.03371398 0.0234285142
## time_lag -0.006069356 -0.001545727 -0.06642133 0.0006789113
## avg_gift -0.019688680 0.124055750 0.09107875 0.1687736865
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child 0.04696165 0.04726139 -0.031717891 -0.08643260
## income 0.36750533 0.37858535 -0.283191234 -0.06900863
## wealth 0.37776337 0.38589230 -0.375145585 -0.41211777
## home_value 0.73815307 0.75256900 -0.399086158 -0.06451386
## med_fam_inc 1.00000000 0.97227129 -0.665362675 -0.05078270
## avg_fam_inc 0.97227129 1.00000000 -0.680284797 -0.05731139
## pct_lt15k -0.66536267 -0.68028480 1.000000000 0.03777518
## num_prom -0.05078270 -0.05731139 0.037775183 1.00000000
## lifetime_gifts -0.03524583 -0.04032716 0.059618806 0.53861957
## largest_gift 0.04703207 0.04310394 -0.007882936 0.11381034
## last_gift 0.13597600 0.13137862 -0.061752121 -0.05586809
## months_since_donate 0.03233669 0.03126859 -0.009014558 -0.28232212
## time_lag 0.01520204 0.02434038 -0.019911490 0.11962322
## avg_gift 0.13716276 0.13175843 -0.062480892 -0.14725094
## lifetime_gifts largest_gift last_gift months_since_donate
## num_child -0.05095477 -0.017554416 -0.01294868 -0.005563603
## income -0.01956547 0.033180760 0.10959275 0.077238810
## wealth -0.22547332 -0.025276518 0.05259131 0.033713981
## home_value -0.02407370 0.056494276 0.15885765 0.023428514
## med_fam_inc -0.03524583 0.047032066 0.13597600 0.032336691
## avg_fam_inc -0.04032716 0.043103937 0.13137862 0.031268594
## pct_lt15k 0.05961881 -0.007882936 -0.06175212 -0.009014558
## num_prom 0.53861957 0.113810342 -0.05586809 -0.282322122
## lifetime_gifts 1.00000000 0.507262313 0.20205827 -0.144621862
## largest_gift 0.50726231 1.000000000 0.44723693 0.019789633
## last_gift 0.20205827 0.447236933 1.00000000 0.186715010
## months_since_donate -0.14462186 0.019789633 0.18671501 1.000000000
## time_lag 0.03854575 0.039977035 0.07511121 0.015528499
## avg_gift 0.18232435 0.474830096 0.86639998 0.189110799
## time_lag avg_gift
## num_child -0.0060693555 -0.01968868
## income -0.0015457272 0.12405575
## wealth -0.0664213294 0.09107875
## home_value 0.0006789113 0.16877369
## med_fam_inc 0.0152020426 0.13716276
## avg_fam_inc 0.0243403812 0.13175843
## pct_lt15k -0.0199114896 -0.06248089
## num_prom 0.1196232155 -0.14725094
## lifetime_gifts 0.0385457538 0.18232435
## largest_gift 0.0399770354 0.47483010
## last_gift 0.0751112090 0.86639998
## months_since_donate 0.0155284995 0.18911080
## time_lag 1.0000000000 0.07008164
## avg_gift 0.0700816428 1.00000000
corrplot::corrplot(fundraising.cor)
par(mfrow=c(1,3))
plot(fundraising$homeowner, main= 'Homeowner')
plot(fundraising$female, main= 'Female')
plot(fundraising$target, main= 'Target')
plot(fundraising$target,fundraising$home_value, main= 'Home_value')
plot(fundraising$target,fundraising$med_fam_inc, main= 'Med_fam_inc')
plot(fundraising$target,fundraising$avg_fam_inc, main= 'Avg_fam_inc')
plot(fundraising$target,fundraising$pct_lt15k, main= 'Pct_lt15k')
plot(fundraising$target,fundraising$num_prom, main= 'Num_prom')
plot(fundraising$target,fundraising$lifetime_gifts, main= 'Lifetime_gifts')
plot(fundraising$target,fundraising$largest_gift, main= 'Largest_gift')
plot(fundraising$target,fundraising$last_gift, main= 'Last_gift')
plot(fundraising$target,fundraising$months_since_donate, main= 'Months_since_donate')
plot(fundraising$target,fundraising$avg_gift, main='Avg_gift')
Exclusions
There was no obvious reason for any exclusions.
Variable Transformations
Because several of the predictors are skewed, a log transformation is
applied to remove skew and create a better fit for data
distribution.
fundraising = mutate(fundraising,
log_months_since_donate = log(months_since_donate),
log_avg_gift = log(avg_gift),
log_num_child = log(num_child+1),
log_home_value = log(home_value+0.1),
log_med_fam_inc = log(med_fam_inc+0.1),
log_avg_fam_inc = log(avg_fam_inc+0.1),
log_pct_lt15k = log(pct_lt15k+0.1),
log_num_prom = log(num_prom+1),
log_lifetime_gifts = log(lifetime_gifts+0.1),
log_largest_gift = log(largest_gift+0.1),
log_last_gift = log(last_gift+0.1))
future_fundraising = mutate(future_fundraising,
log_months_since_donate = log(months_since_donate),
log_months_since_donate = log(months_since_donate),
log_avg_gift = log(avg_gift),
log_num_child = log(num_child),
log_home_value = log(home_value),
log_med_fam_inc = log(med_fam_inc),
log_avg_fam_inc = log(avg_fam_inc),
log_pct_lt15k = log(pct_lt15k),
log_num_prom = log(num_prom),
log_lifetime_gifts = log(lifetime_gifts),
log_largest_gift = log(largest_gift),
log_last_gift = log(last_gift))
fundraising = mutate(fundraising,
num_child_income_int = num_child*income,
num_child_log_months_since_donate_int= num_child*log_months_since_donate,
num_child_log_largest_gift_int= num_child*log_largest_gift,
num_child_log_life_gifts_int= num_child*log_lifetime_gifts,
income_log_months_since_donate_int = income*log_months_since_donate,
income_log_largest_gift_int = income*log_largest_gift,
income_log_lifetime_gifts_int = income*log_lifetime_gifts,
log_m_since_donate_log_largest_gift_int= log_months_since_donate*log_largest_gift,
log_m_since_donate_log_life_gifts_int= log_months_since_donate*log_lifetime_gifts,
log_largest_gift_log_lifetime_gifts_int= log_largest_gift*log_lifetime_gifts)
future_fundraising = mutate(future_fundraising,
num_child_income_int = num_child*income,
num_child_log_months_since_donate_int= num_child*log_months_since_donate,
num_child_log_largest_gift_int= num_child*log_largest_gift,
num_child_log_life_gifts_int= num_child*log_lifetime_gifts,
income_log_months_since_donate_int = income*log_months_since_donate,
income_log_largest_gift_int = income*log_largest_gift,
income_log_lifetime_gifts_int = income*log_lifetime_gifts,
log_m_since_donate_log_largest_gift_int= log_months_since_donate*log_largest_gift,
log_m_since_donate_log_life_gifts_int= log_months_since_donate*log_lifetime_gifts,
log_largest_gift_log_lifetime_gifts_int= log_largest_gift*log_lifetime_gifts)
Business Inputs
Resources required for this testing varies and is dependent on the
organization’s current environment. If the donor or non-donor
demographic information is not clean, complete, or accurate, it will be
very difficult to bring the dataset to a usable state for the most
accurate predictive modeling results. This would require human resources
and time, and in some cases, a data collection system cutover may be the
best option for obtaining the desired output through future data
analysis.
Methodology Used, Background, Benefits I have chosen to use support vector machine (SVM) models because they are intended for binary classification with two classes. I first use SVM with linear kernel, then further fit an SVM with radial kernel and SVM with polynomial kernel.
Benefits to using SVM are that SVM supports complex data regardless of how well its structured. It scales well, and there is a lower risk of overfitting.
SVM with Linear Kernel:
set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)
svm.linear <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
income_log_months_since_donate_int + num_child_log_months_since_donate_int,
data = fundraising,
method = "svmLinear",
trControl = train_control,
tuneGrid = expand.grid(C = seq(0.002, 0.003, 0.0001)),
preProcess = c("center","scale"))
plot(svm.linear)
SVM with Radial Kernel:
set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)
svm.radial <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
income_log_months_since_donate_int + num_child_log_months_since_donate_int,
data = fundraising,
method = "svmRadial",
trControl = train_control,
tuneLength = 10,
tuneGrid = expand.grid(C = seq(0.25, 0.5, 0.01),
sigma= 0.5612128),
preProcess = c("center","scale"))
plot(svm.radial)
SVM with Polynomial Kernel:
set.seed(12345)
train_control <- trainControl(method="repeatedcv", number=5, repeats=3)
svm.poly <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
income_log_months_since_donate_int + num_child_log_months_since_donate_int,
data = fundraising,
method = "svmPoly",
trControl = train_control,
tuneLength = 4,
preProcess = c("center","scale"))
plot(svm.poly)
models = c('SVM-linear', 'SVM-radial', 'SVM-poly')
acc= c(56.49, 56.59, 56.40)
acc.summary= as.data.frame(acc, row.names = models)
acc.summary
## acc
## SVM-linear 56.49
## SVM-radial 56.59
## SVM-poly 56.40
Model Performance and Validation Results
SVM with linear kernel using optimal cost of 0.0022 produces an accuracy
of 56.49%.
SVM with radial kernel using optimal cost of 0.46 and sigma of 0.5612128
produces an accuracy of 56.59%.
SVM with polynomial kernel using optimal degree of 3, scale of 0.01, and
cost of 1 produces an accuracy of 56.40%.
The SVM model with radial kernel produces the highest accuracy rate.
Cut-Off Analysis
While default threshold is normally 0.5, no cutoff analysis or
adjustment using ROC curve is necessary because the sampling is
weighted.
Recommendations
One recommendation for a more accurate model would be to increase the
sample size used. While using all of the organization’s 13 million
observations would definitely be excessive, 5.1% of 13 million is
663,000 donor observations, which is still excessive but much more than
1,499. I would recommend performing the analysis again with a weighted
sample set of at least 10,000 total observations.
Related to marketing costs, the organization could perform regression analysis to predict how much the “Donor” observations could be expected to donate. If the predicted donation exceeds a certain threshold, say, $5.00, the organization could prioritize mailing labels and stationery to this group first and then consider mailing labels and stationery to the observations predicted to donate less than $5.00.
Pseudo Codes for Implementation
My project submission includes my leaderboard submission formulated with
scripting below and this writeup with code.
set.seed(12345)
preds = as.data.frame(predict(svm.radial, future_fundraising))
write_csv(preds, 'prediction.csv')