A national Veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. Their organization has a database with over 13 million donors that is one of the largest direct-mail fundraisers in the United States. Their records indicate there recent overall response rate is 5.1% and the average donation was $13.00 USD. The overall production and distribution cost of their gift-packets, consists of personalized address labels and assortments of cards and envelopes, is $0.68 USD per packet. Mailing gift-packets to each of the 13 million perspective donors would cost the organization over $8.8 million USD, while their expected Return on Investment (ROI) would be just under $8.2 million USD.
Average number of respondents: 6.6310^{5}
Cost per gift-packet: 12.32
Average number of respondents: 6.6310^{5}
Cost per gift-packet: 12.32
ROI for average number of respondents: 8.1681610^{6}
Total gift-packet cost for all donors: 8.8410^{6}
Expected Net: -6.718410^{5}
To assist this organization our objectives are to develop a classification model that will identify potential donors and predict maximum ROI if the this model were deploy though a targeted direct-mail campaign.
The end-state goal is to maximize profits on the organization’s next donation campaign.
The organization has provided sample data,fundraising.rds, from their most recent effort. This sample data contains \(21\) variables and \(3,000\) records that consist of weighted samples under-representing non-donors. This under-representation facilitates our discovery by presenting us with an equal amount of donors and non-donors. Our Y variable, called target within the dataset, is a categorical variable indicating Donor and No Donor. The remaining variables load in as either categorical or numerical. * Categorical zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, and female * Numerical num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, and avg_gift
A separate test dataset was also provided. This dataset exclude the response variable target but included the other \(20\) variables. Additionally, the sample size was also reduced to \(120\) respondents.
The following code chuck includes all libraries and seed used throughout the analysis and modeling phases.
All analysis was conducted with R version 4.0.2 (2020-06-22) on a x86_64-w64-mingw32 platform.
# install.packages("pacman")
pacman::p_load('ISLR', 'corrgram', 'glmnet', 'pls', 'tidyverse', 'ggthemes', 'ggthemr', 'caret', 'modelr', 'leaps', 'psych', 'pastecs', 'e1071', 'randomForest', 'gbm', 'ROCR', 'recipes', 'broom', 'scales', 'outliers')
set.seed(12345)
Loading data.
dp <- read_rds("fundraising.rds")
FF <- read_rds("future_fundraising.rds")
There are no missing values within the provided datasets.
any(is.na(dp))
## [1] FALSE
any(is.na(FF))
## [1] FALSE
summary(dp)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0 Min. : 5.00
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0 1st Qu.: 10.00
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0 Median : 15.00
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7 Mean : 16.65
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0 3rd Qu.: 20.00
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9 Max. :1000.00
## last_gift months_since_donate time_lag avg_gift target
## Min. : 0.00 Min. :17.00 Min. : 0.000 Min. : 2.139 Donor :1499
## 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000 1st Qu.: 6.333 No Donor:1501
## Median : 10.00 Median :31.00 Median : 5.000 Median : 9.000
## Mean : 13.48 Mean :31.13 Mean : 6.876 Mean : 10.669
## 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000 3rd Qu.: 12.800
## Max. :219.00 Max. :37.00 Max. :77.000 Max. :122.167
str(dp)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
It looks as though zipconvert3 was captured in reverse order, this could cause some confusion when interpreting our results. We will flip the order from Yes/No to No/Yes to match the other zipconverts.
dp$zipconvert3 <- fct_rev(dp$zipconvert3)
Looking at numerical variables to identify skewness. If skewness is discover we will transform with a logarithm within variable selection stage.
stat.desc(numvars, basic = FALSE, norm = TRUE)
## num_child income wealth home_value med_fam_inc
## median 1.000000e+00 4.000000e+00 8.000000e+00 8.165000e+02 3.550000e+02
## mean 1.069333e+00 3.899000e+00 6.395667e+00 1.143268e+03 3.883563e+02
## SE.mean 6.304301e-03 2.993156e-02 4.649711e-02 1.738372e+01 3.171913e+00
## CI.mean.0.95 1.236119e-02 5.868847e-02 9.116945e-02 3.408523e+01 6.219346e+00
## var 1.192326e-01 2.687695e+00 6.485943e+00 9.065815e+05 3.018310e+04
## std.dev 3.453008e-01 1.639419e+00 2.546752e+00 9.521457e+02 1.737328e+02
## coef.var 3.229122e-01 4.204717e-01 3.981995e-01 8.328281e-01 4.473542e-01
## skewness 5.980112e+00 -2.239905e-03 -1.200292e+00 2.232637e+00 1.892435e+00
## skew.2SE 6.689310e+01 -2.505542e-02 -1.342638e+01 2.497411e+01 2.116864e+01
## kurtosis 4.034828e+01 -5.889847e-01 1.003609e-01 5.718708e+00 7.430215e+00
## kurt.2SE 2.257415e+02 -3.295265e+00 5.615015e-01 3.199516e+01 4.157074e+01
## normtest.W 2.033681e-01 9.357304e-01 7.524213e-01 7.555717e-01 8.769336e-01
## normtest.p 9.534441e-78 3.575055e-34 3.007178e-55 5.045540e-55 1.134314e-43
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## median 3.960000e+02 1.200000e+01 4.800000e+01 8.100000e+01 1.500000e+01
## mean 4.323103e+02 1.471033e+01 4.913900e+01 1.107405e+02 1.664586e+01
## SE.mean 3.083741e+00 2.210907e-01 4.158974e-01 2.727296e+00 4.111144e-01
## CI.mean.0.95 6.046461e+00 4.335048e-01 8.154731e-01 5.347560e+00 8.060948e-01
## var 2.852837e+04 1.466433e+02 5.189120e+02 2.231443e+04 5.070452e+02
## std.dev 1.689034e+02 1.210964e+01 2.277964e+01 1.493801e+02 2.251766e+01
## coef.var 3.906995e-01 8.232062e-01 4.635756e-01 1.348920e+00 1.352748e+00
## skewness 1.210715e+00 1.312895e+00 6.900928e-01 1.944130e+01 3.000434e+01
## skew.2SE 1.354297e+01 1.468595e+01 7.719328e+00 2.174690e+02 3.356264e+02
## kurtosis 2.530427e+00 2.507536e+00 5.150029e-01 6.592335e+02 1.236900e+03
## kurt.2SE 1.415729e+01 1.402922e+01 2.881350e+00 3.688295e+03 6.920237e+03
## normtest.W 9.236199e-01 9.010550e-01 9.549399e-01 3.910505e-01 2.302320e-01
## normtest.p 1.404418e-36 2.398013e-40 1.708061e-29 3.161737e-72 4.963520e-77
## last_gift months_since_donate time_lag avg_gift
## median 1.000000e+01 3.100000e+01 5.000000e+00 9.000000e+00
## mean 1.348300e+01 3.113267e+01 6.876000e+00 1.066894e+01
## SE.mean 1.912515e-01 7.476925e-02 1.022737e-01 1.360298e-01
## CI.mean.0.95 3.749975e-01 1.466042e-01 2.005338e-01 2.667211e-01
## var 1.097315e+02 1.677132e+01 3.137975e+01 5.551230e+01
## std.dev 1.047528e+01 4.095281e+00 5.601763e+00 7.450658e+00
## coef.var 7.769249e-01 1.315429e-01 8.146834e-01 6.983505e-01
## skewness 5.563103e+00 -1.005637e+00 2.851905e+00 4.803668e+00
## skew.2SE 6.222846e+01 -1.124898e+01 3.190120e+01 5.373348e+01
## kurtosis 6.888341e+01 2.327419e+00 1.677123e+01 4.482937e+01
## kurt.2SE 3.853905e+02 1.302150e+01 9.383206e+01 2.508124e+02
## normtest.W 6.647350e-01 8.800797e-01 7.805419e-01 6.812339e-01
## normtest.p 1.045420e-60 2.861571e-43 3.761256e-53 8.835644e-60
dp[,-1] %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
All numeric variables within the dataset are skewed to some level as both skewness and skew.2SE (2 standard errors) are both above 1, witch is the marker for non-normalcy.
Also observed is a distribution discrepancy in income and wealth There are visual indications these should be categorical n-level factors. Information from the data dictionary confirms the suspicion about wealth:
“Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state.”
Their 0-9 scale makes it a Qualitative - Ordinal factor verses a quantitative number.
The data dictionary states income relates to “Household income”; the summary suggests this should be a 7-level factor as it appears to represent a range of 1-7, where \(1\) is lowand \(7\) is high. Looking at the plot above and given this evidence from the data dictionary we will keep is as number but we will also create an ordinal factor.
Note: There is evidence that suggests avg_fam_inc and med_fam_inc are colinear. We will keep this in mind when we check for colinearity later.
Converting income and wealth to n-level factors with as_factor() function.
dp$wealth <- as_factor(dp$wealth)
inc <- as_factor(dp$income)
d.pool<- cbind(dp, inc)
FF$wealth <- as_factor(FF$wealth)
inc <- as_factor(FF$income)
FFDP<- cbind(FF, inc)
We want to ensure our variables are on the same scale. Large variance would cause variables on a larger scale to outweigh smaller scaled variables, this would introduce bias during our analysis.
d.pool %>% summarise_if(is.numeric, var)
## num_child income home_value med_fam_inc avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 1 0.1192326 2.687695 906581.5 30183.1 28528.37 146.6433 518.912 22314.43
## largest_gift last_gift months_since_donate time_lag avg_gift
## 1 507.0452 109.7315 16.77132 31.37975 55.5123
As shown in the code chunk above, none of our variables are on the same scale. We will use the scale() function to accomplish this adjustment.
scale.features <- scale(d.pool[c(6:7,10:20)])
d.pools <- cbind(d.pool[c(0:5,8:9)],scale.features,d.pool[21:22])
d.pools %>% summarise_if(is.numeric, var)
## num_child income home_value med_fam_inc avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 1 1 1 1 1 1 1 1 1
## largest_gift last_gift months_since_donate time_lag avg_gift
## 1 1 1 1 1 1
Let us look at the numerical variable on a log scale.
# range(d.pools$months_since_donate)
# range(d.pool$months_since_donate)
# d.pools$MSDgrp = cut(d.pool$months_since_donate,c(16, 21 , 24 , 28, 32, 38))
# levels(d.pools$MSDgrp) <- c("1","2","3","4","5","6")
#
# catvars <- d.pools[c("zipconvert2", "zipconvert3", "zipconvert4", "zipconvert5", "homeowner", "female", "wealth", "inc", "MSDgrp")]
#
# FFDPs$MSDgrp = cut(FFDP$months_since_donate,c(16, 21 , 24 , 28, 32, 38))
# levels(FFDPs$MSDgrp) <- c("1","2","3","4","5","6")
Looking at categorical variables we see the distribution of wealth is skewed but we do not need to adjust categorical variable distributional, there is no assumption of normalcy for them.
catvars %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free") +
geom_bar()
Outliers are another part of the numerical variables. With the outlier() function and box plots we will check to see if there are some way off the scope.
outlier(numvars)
## num_child income home_value med_fam_inc
## 11.383312 1.891524 5.043064 6.398581
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 5.320730 6.217335 4.734974 37.248321
## largest_gift last_gift months_since_donate time_lag
## 43.670346 19.619240 -3.450964 12.518201
## avg_gift
## 14.964817
par(mfrow=c(2,2))
for (i in 1:length(numvars)) {
boxplot(numvars[,i],
main=names(numvars[i]), type="l")
}
Most have outliers that may effect predictive power. I am do not like taking out data unless I have to, so we will retain these outlines unless our predictions are very low.
We have completed our analysis phase, it is time to move to the predictive phase! I’m setting up a common variable I used during my exploratory data analysis final.data
final.data <- d.pools
final.FFDP <- FFDPs
Splitting data into train (80%) and test (20%) sets with the CARET package.
csam<- createDataPartition(y=final.data$target, p= .8, list = FALSE)
train <- final.data[csam,]
test <- final.data[-csam,]
dim(d.pool)
## [1] 3000 22
dim(train)
## [1] 2401 22
dim(test)
## [1] 599 22
Setting train control up for Repeated Cross Validation with \(10\) folds, \(3\) repeats.
train_ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
We have already seen evidence that suggests med_fam_inc and avg_fam_incare correlated. It is now time to investigate the other numeric variables.
corrgram(train, upper.panel=panel.cor, main="Doner Correlation Matrix")
As expected, within the transformed dataset there is high correlation (above .6) in
med_fam_inc and avg_fam_inc, home_value is also highly correlated with these two variables. We see a negative correlation between pct_lt5k (Percent earning less than $15K in potential donor’s neighborhood) and med_fam_inc, avg_fam_inc, and, to a lesser degree, home_value. The remaining highly correlated variables are last_gift and avg_gift, num_prom and lifetime_gifts, lifetime_gifts and largest_gift, largest_gift and last_gift, and largest_gift and avg_gift have moderate correlation.
More correlations are seen in the transformed dummyVars() dataset. However, rendering that chart is not a great idea became it has 39 variables and it looks like pixel art. I discovered this function on Towards Data Science and modified it for this analysis. https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57
corr_simple <- function(data=df,sig=0.5){
#convert data to numeric in order to run correlations
#convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
df_cor <- data %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric) #run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA #turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr) #select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),] #print table
print(corr) #turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
}
corr_simple(train)
## Var1 Var2 Freq
## 253 med_fam_inc avg_fam_inc 0.9710119
## 435 last_gift avg_gift 0.8598187
## 368 largest_gift last_gift 0.7492002
## 252 home_value avg_fam_inc 0.7477607
## 230 home_value med_fam_inc 0.7325554
## 434 largest_gift avg_gift 0.7300351
## 276 avg_fam_inc pct_lt15k -0.6851133
## 275 med_fam_inc pct_lt15k -0.6659114
## 345 lifetime_gifts largest_gift 0.6368191
## 322 num_prom lifetime_gifts 0.5174737
As expected, income and inc are 100% correlated, med_fam_inc and agv_fam_inc are 97% correlated. We will keep the remaining variables in mind when selecting our model variables.
Before selecting our variables we will look at variable importance. To start we will fit a random forest model to calculate variable importance factor (VIF). We need to Generate a small sample to investigate which variables are closely associated with target. Thin train a random forest to obtain the varImp for the dataset.
k.1=sample(nrow(train),nrow(train)*0.1)
rfdp=train(target~.,data=train[k.1,])
rfImp<-varImp(rfdp)
plot(rfImp,top=15)
# fit.lm1<-lm(target~.,data=train2)
# varImp(fit.lm1, useModel = TRUE, scale = TRUE)
Our second second objective is a byproduct of our first objective and will deal with information from previous campaigns where the average donation was \($13.00\) and packet cost of \($0.65\). Our maximum ROI margin for the test set would be ($13.00 * 299) – ($0.65 * 300) = \($3,692\). Knowing our model will not be perfect, we will attempt to get as close to this gold standard as possible. As such, our predictions should be better than a random guess to ensure we reach a higher percentage of potential donors.
| Optimal | Donor | No Donor |
|---|---|---|
| Donor | 299 | 0 |
| No Donor | 0 | 300 |
TP <- 299
TN <- 300
(13.00 * TP)-(0.65 * TN)
## [1] 3692
##LDA LDA uses the Bayes classifier and a threshold of the posterior probability when assigning observations to a class. This class selection is done by estimating several parameters, including the weighted average of the sample variances and the number of observations.
The first LDA model uses 20 variables. lda.b uses a three-variable subset of the most important variables from the first LDA.
##Train Model
lda.fit.a=train(target~., data=train, method='lda',trControl = train_ctrl)
##Calculate Predictions
pred.lda.a<-predict(lda.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.lda.a,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 172 142
## No Donor 127 158
##
## Accuracy : 0.5509
## 95% CI : (0.5101, 0.5912)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.007924
##
## Kappa : 0.1019
##
## Mcnemar's Test P-Value : 0.393329
##
## Sensitivity : 0.5753
## Specificity : 0.5267
## Pos Pred Value : 0.5478
## Neg Pred Value : 0.5544
## Prevalence : 0.4992
## Detection Rate : 0.2871
## Detection Prevalence : 0.5242
## Balanced Accuracy : 0.5510
##
## 'Positive' Class : Donor
##
varImp(lda.fit.a, useModel = TRUE, scale = TRUE)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 21)
##
## Importance
## months_since_donate 100.0000
## last_gift 91.8258
## avg_gift 83.5865
## largest_gift 80.7298
## num_prom 53.0404
## income 38.3934
## inc 38.3934
## lifetime_gifts 34.6910
## home_value 28.7559
## zipconvert5 22.5035
## homeowner 17.5162
## time_lag 12.2503
## zipconvert2 10.0020
## female 9.3212
## avg_fam_inc 8.2899
## num_child 7.9949
## med_fam_inc 5.8305
## zipconvert3 5.3870
## wealth 0.3288
## zipconvert4 0.1601
LDA 1 expects a return of \($2,389.40\)
| Optimal | Donor | No Donor |
|---|---|---|
| Donor | 192 | 136 |
| No Donor | 108 | 167 |
TP <- 192
TN <- 164
(13.00 * TP)-(0.65 * TN)
## [1] 2389.4
##Train Model
lda.fit.b=train(target~ avg_gift + lifetime_gifts + med_fam_inc + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, method='lda',trControl = train_ctrl)
##Calculate Predictions
pred.lda.b<-predict(lda.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.lda.b,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 171 158
## No Donor 128 142
##
## Accuracy : 0.5225
## 95% CI : (0.4817, 0.5632)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.15351
##
## Kappa : 0.0452
##
## Mcnemar's Test P-Value : 0.08638
##
## Sensitivity : 0.5719
## Specificity : 0.4733
## Pos Pred Value : 0.5198
## Neg Pred Value : 0.5259
## Prevalence : 0.4992
## Detection Rate : 0.2855
## Detection Prevalence : 0.5492
## Balanced Accuracy : 0.5226
##
## 'Positive' Class : Donor
##
LDA 2 expects a return of \($2,605.85\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 207 | 169 |
| No Donor | 92 | 131 |
TP <- 207
TN <- 131
(13.00 * TP)-(0.65 * TN)
## [1] 2605.85
##Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis is similar to LDA in that it assumes the observations from each class are drawn from a normal distribution. It creates a prediction based on parameter estimates put into Bayes’ theorem. QDA differs in that it uses a quadratic function and creates its own covariance matrix, unlike LDA, making QDA more flexible with a lower variance dataset.
##Train Model
qda.fit.a=train(target~ homeowner + num_child + income + female + home_value + med_fam_inc + avg_fam_inc + pct_lt15k + num_prom + lifetime_gifts + largest_gift + last_gift + months_since_donate + time_lag + avg_gift, data=train, method='qda',trControl = train_ctrl)
##Calculate Predictions
pred.qda.a<-predict(qda.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.qda.a,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 227 214
## No Donor 72 86
##
## Accuracy : 0.5225
## 95% CI : (0.4817, 0.5632)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.1535
##
## Kappa : 0.0458
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7592
## Specificity : 0.2867
## Pos Pred Value : 0.5147
## Neg Pred Value : 0.5443
## Prevalence : 0.4992
## Detection Rate : 0.3790
## Detection Prevalence : 0.7362
## Balanced Accuracy : 0.5229
##
## 'Positive' Class : Donor
##
names(t)
## NULL
QDA1 expects a return of \($2,895.10\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 227 | 214 |
| No Donor | 72 | 86 |
TP <- 227
TN <- 86
(13.00 * TP)-(0.65 * TN)
## [1] 2895.1
##Train Model
qda.fit.b=train(target~ num_prom + avg_gift + home_value + pct_lt15k + time_lag + med_fam_inc + avg_fam_inc + last_gift + largest_gift, data=train, method='qda',trControl = train_ctrl)
##Calculate Predictions
pred.qda.b<-predict(qda.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.qda.b,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 213 188
## No Donor 86 112
##
## Accuracy : 0.5426
## 95% CI : (0.5017, 0.583)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.02259
##
## Kappa : 0.0857
##
## Mcnemar's Test P-Value : 1.05e-09
##
## Sensitivity : 0.7124
## Specificity : 0.3733
## Pos Pred Value : 0.5312
## Neg Pred Value : 0.5657
## Prevalence : 0.4992
## Detection Rate : 0.3556
## Detection Prevalence : 0.6694
## Balanced Accuracy : 0.5429
##
## 'Positive' Class : Donor
##
QDA2 expects a return of \($2,696.20\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 213 | 188 |
| No Donor | 86 | 112 |
TP <- 213
TN <- 112
(13.00 * TP)-(0.65 * TN)
## [1] 2696.2
Logistic regression models the probability of a binary response based on one or more predictor variables. In our case, whether a direct-mail campaign will trigger a donation or will there be no ROI.
Log models were created based on earlier models. The first model includes 20 variables and the second reflecting 10 variables that were selected based upon significance.
##Train Model
glm.fit.a=train(target~., data=train, trControl = train_ctrl, method = "glm", family = "binomial")
##Calculate Predictions
pred.glm.a<-predict(glm.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.glm.a, test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 174 143
## No Donor 125 157
##
## Accuracy : 0.5526
## 95% CI : (0.5118, 0.5929)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.00631
##
## Kappa : 0.1053
##
## Mcnemar's Test P-Value : 0.29907
##
## Sensitivity : 0.5819
## Specificity : 0.5233
## Pos Pred Value : 0.5489
## Neg Pred Value : 0.5567
## Prevalence : 0.4992
## Detection Rate : 0.2905
## Detection Prevalence : 0.5292
## Balanced Accuracy : 0.5526
##
## 'Positive' Class : Donor
##
varImp(glm.fit.a, useModel = TRUE, scale = TRUE)
## glm variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## months_since_donate 100.000
## income 36.877
## home_value 36.158
## num_child 35.341
## avg_fam_inc 25.954
## last_gift 18.067
## wealth7 16.870
## med_fam_inc 16.740
## inc3 15.042
## num_prom 14.353
## homeownerNo 12.985
## inc5 12.835
## wealth9 10.081
## wealth6 9.805
## wealth5 9.755
## wealth4 9.597
## wealth1 9.084
## largest_gift 6.952
## wealth3 4.761
## pct_lt15k 4.539
GLM 1 expects a return of \($2,159.95\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 174 | 143 |
| No Donor | 125 | 157 |
TP <- 174
TN <- 157
(13.00 * TP)-(0.65 * TN)
## [1] 2159.95
##Train Model
glm.fit.b=train(target~ avg_gift + lifetime_gifts + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, trControl = train_ctrl, method = "glm", family = binomial)
##Calculate Predictions
pred.glm.b<-predict(glm.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.glm.b, test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 176 161
## No Donor 123 139
##
## Accuracy : 0.5259
## 95% CI : (0.485, 0.5665)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.11801
##
## Kappa : 0.052
##
## Mcnemar's Test P-Value : 0.02812
##
## Sensitivity : 0.5886
## Specificity : 0.4633
## Pos Pred Value : 0.5223
## Neg Pred Value : 0.5305
## Prevalence : 0.4992
## Detection Rate : 0.2938
## Detection Prevalence : 0.5626
## Balanced Accuracy : 0.5260
##
## 'Positive' Class : Donor
##
GLM 2 expects a return of \($2,468.70\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 198 | 138 |
| No Donor | 102 | 162 |
TP <- 198
TN <- 162
(13.00 * TP)-(0.65 * TN)
## [1] 2468.7
KNN assigns a weights so the nearer ‘neighbors’ contribute more to an average than observations further away. Smaller K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.
##Train Model
knn.fit.a=train(target~., data=train, method='knn',trControl = train_ctrl, tuneLength=30)
##Calculate Predictions
pred.knn.a<-predict(knn.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.knn.a,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 158 154
## No Donor 141 146
##
## Accuracy : 0.5075
## 95% CI : (0.4667, 0.5483)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.3875
##
## Kappa : 0.0151
##
## Mcnemar's Test P-Value : 0.4848
##
## Sensitivity : 0.5284
## Specificity : 0.4867
## Pos Pred Value : 0.5064
## Neg Pred Value : 0.5087
## Prevalence : 0.4992
## Detection Rate : 0.2638
## Detection Prevalence : 0.5209
## Balanced Accuracy : 0.5075
##
## 'Positive' Class : Donor
##
varImp(knn.fit.a, useModel = TRUE, scale = TRUE)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 21)
##
## Importance
## months_since_donate 100.0000
## last_gift 91.8258
## avg_gift 83.5865
## largest_gift 80.7298
## num_prom 53.0404
## income 38.3934
## inc 38.3934
## lifetime_gifts 34.6910
## home_value 28.7559
## zipconvert5 22.5035
## homeowner 17.5162
## time_lag 12.2503
## zipconvert2 10.0020
## female 9.3212
## avg_fam_inc 8.2899
## num_child 7.9949
## med_fam_inc 5.8305
## zipconvert3 5.3870
## wealth 0.3288
## zipconvert4 0.1601
KNN1 expects a return of \($2,468.70\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 190 | 168 |
| No Donor | 110 | 132 |
TP <- 190
TN <- 132
(13.00 * TP)-(0.65 * TN)
## [1] 2384.2
##Train Model
knn.fit.b=train(target~ avg_gift + lifetime_gifts + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, method='knn',trControl = train_ctrl, tuneLength=30)
##Calculate Predictions
pred.knn.b<-predict(knn.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.knn.b,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 172 164
## No Donor 127 136
##
## Accuracy : 0.5142
## 95% CI : (0.4733, 0.5549)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.27000
##
## Kappa : 0.0286
##
## Mcnemar's Test P-Value : 0.03483
##
## Sensitivity : 0.5753
## Specificity : 0.4533
## Pos Pred Value : 0.5119
## Neg Pred Value : 0.5171
## Prevalence : 0.4992
## Detection Rate : 0.2871
## Detection Prevalence : 0.5609
## Balanced Accuracy : 0.5143
##
## 'Positive' Class : Donor
##
KNN2 expects a return of \($2,216.50\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 178 | 150 |
| No Donor | 122 | 150 |
TP <- 178
TN <- 150
(13.00 * TP)-(0.65 * TN)
## [1] 2216.5
Support Vector Machine
Support vector machines are intended for binary classifications such as donor and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels. A kernel is a function that quantifies the similarity of two observations. A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid. In our dataset the radial kernel is performing better.
#Tunning the SVC
poly.tune.out <- tune(svm, target ~ ., data = train, kernel = "polynomial", ranges = list(cost = c(0.1,1, 5, 10, 20), degree = c(2, 3, 4, 5)))
summary(poly.tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 1 2
##
## - best performance: 0.4531362
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.1 2 0.4839678 0.02369064
## 2 1.0 2 0.4531362 0.02487941
## 3 5.0 2 0.4602230 0.02897757
## 4 10.0 2 0.4635615 0.02629239
## 5 20.0 2 0.4668966 0.02948750
## 6 0.1 3 0.4968724 0.02529140
## 7 1.0 3 0.4677282 0.02565760
## 8 5.0 3 0.4618793 0.03475451
## 9 10.0 3 0.4618811 0.03059307
## 10 20.0 3 0.4627092 0.03654283
## 11 0.1 4 0.5027075 0.01583733
## 12 1.0 4 0.4877144 0.01438082
## 13 5.0 4 0.4743897 0.01264273
## 14 10.0 4 0.4764575 0.02142671
## 15 20.0 4 0.4602196 0.03688775
## 16 0.1 5 0.5052075 0.01587398
## 17 1.0 5 0.4964609 0.02088036
## 18 5.0 5 0.4927178 0.01731685
## 19 10.0 5 0.4839713 0.01987610
## 20 20.0 5 0.4806328 0.02299150
rad.tune.out <- tune(svm, target ~ ., data = train, kernel = "radial", ranges = list(cost = c(0.1, 1, 5, 10, 20), gamma = c(0.01, 0.1, 1, 5, 10)))
summary(rad.tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.01
##
## - best performance: 0.442742
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 0.01 0.4452351 0.02659737
## 2 1.0 0.01 0.4427420 0.02645096
## 3 5.0 0.01 0.4556535 0.02535353
## 4 10.0 0.01 0.4623185 0.03146988
## 5 20.0 0.01 0.4685615 0.02264571
## 6 0.1 0.10 0.4493966 0.02026974
## 7 1.0 0.10 0.4652092 0.02012967
## 8 5.0 0.10 0.4960443 0.03176939
## 9 10.0 0.10 0.5077075 0.03203582
## 10 20.0 0.10 0.5114454 0.03481461
## 11 0.1 1.00 0.5285356 0.02178929
## 12 1.0 1.00 0.5127144 0.02303358
## 13 5.0 1.00 0.5056380 0.02813914
## 14 10.0 1.00 0.5056380 0.02813914
## 15 20.0 1.00 0.5056380 0.02813914
## 16 0.1 5.00 0.5285356 0.02178929
## 17 1.0 5.00 0.5226988 0.02531526
## 18 5.0 5.00 0.5239471 0.03000081
## 19 10.0 5.00 0.5239471 0.03000081
## 20 20.0 5.00 0.5239471 0.03000081
## 21 0.1 10.00 0.5285356 0.02178929
## 22 1.0 10.00 0.5247873 0.02386694
## 23 5.0 10.00 0.5247804 0.02353730
## 24 10.0 10.00 0.5247804 0.02353730
## 25 20.0 10.00 0.5247804 0.02353730
svm.poly <- svm(target ~ ., data = train, kernel = "polynomial", cost = poly.tune.out$best.parameters$cost)
train.pred <- predict(svm.poly, test)
table(test$target, train.pred)
## train.pred
## Donor No Donor
## Donor 41 258
## No Donor 46 254
confusionMatrix(test$target, train.pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 41 258
## No Donor 46 254
##
## Accuracy : 0.4925
## 95% CI : (0.4517, 0.5333)
## No Information Rate : 0.8548
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0162
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.47126
## Specificity : 0.49609
## Pos Pred Value : 0.13712
## Neg Pred Value : 0.84667
## Prevalence : 0.14524
## Detection Rate : 0.06845
## Detection Prevalence : 0.49917
## Balanced Accuracy : 0.48368
##
## 'Positive' Class : Donor
##
SVM Poly expects a return of \($2,576.60\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 204 | 86 |
| No Donor | 184 | 116 |
TP <- 204
TN <- 116
(13.00 * TP)-(0.65 * TN)
## [1] 2576.6
svm.radial <- svm(target ~ ., data = train, kernel = "radial", cost = rad.tune.out$best.parameters$cost)
train.pred <- predict(svm.radial, test)
table(test$target, train.pred)
## train.pred
## Donor No Donor
## Donor 186 113
## No Donor 160 140
confusionMatrix(test$target, train.pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 186 113
## No Donor 160 140
##
## Accuracy : 0.5442
## 95% CI : (0.5034, 0.5847)
## No Information Rate : 0.5776
## P-Value [Acc > NIR] : 0.954700
##
## Kappa : 0.0887
##
## Mcnemar's Test P-Value : 0.005369
##
## Sensitivity : 0.5376
## Specificity : 0.5534
## Pos Pred Value : 0.6221
## Neg Pred Value : 0.4667
## Prevalence : 0.5776
## Detection Rate : 0.3105
## Detection Prevalence : 0.4992
## Balanced Accuracy : 0.5455
##
## 'Positive' Class : Donor
##
SVM Radial expects a return of \($2,385.50\)
| Prediction | Donor | No Donor |
|---|---|---|
| Donor | 175 | 125 |
| No Donor | 144 | 156 |
TP <- 191
TN <- 150
FP <- 150
FN <- 109
(13.00 * TP)-(0.65 * TN)
## [1] 2385.5
We investigated five classification models; Logistic Regression, LDA, QDA, KNN, and Support Vector Machine). As anticipated results varied from training and test / validation data and from each model we were shown different results on how to maximize the gross margin. We will assess this in two ways, first by accuracy and then by potential gross margin.
Based upon the accuracy, the top three models are:
Based on Gross Margin, the top three are:
There are many other types of classification that were not addressed in this assessment. Even with the models observed today there are may transformations and hyperperamiters we can tweak to improve the accuracy and by effect, increase ROI for direct-mail campaign. There are also multi-class classifiers if in the future an on-watch type of list is desired. There is also room to add other variables such as advertisement expenditure and advertisement media the may help pinpoint a willing donor population.
For the scope of this study, we recommend the use of the QDA model as it is currently projecting the largest ROI and it achieved the highest accuracy rate \(52.3%\) of the top three ROI projectors. We also recommend that further analysis be conducted on on Random Forests and Generalized Additive Models as they show promise in classification problems. Finally, there is room for improvement with the models presented, we suggest these findings be expanded upon in the upcoming quarter to allow room for ROI growth potential.