Modeling Competition

Problem Statement

A national Veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. Their organization has a database with over 13 million donors that is one of the largest direct-mail fundraisers in the United States. Their records indicate there recent overall response rate is 5.1% and the average donation was $13.00 USD. The overall production and distribution cost of their gift-packets, consists of personalized address labels and assortments of cards and envelopes, is $0.68 USD per packet. Mailing gift-packets to each of the 13 million perspective donors would cost the organization over $8.8 million USD, while their expected Return on Investment (ROI) would be just under $8.2 million USD.

Average number of respondents: 6.6310^{5}

Cost per gift-packet: 12.32

Average number of respondents: 6.6310^{5}

Cost per gift-packet: 12.32

ROI for average number of respondents: 8.1681610^{6}

Total gift-packet cost for all donors: 8.8410^{6}

Expected Net: -6.718410^{5}

Goal

To assist this organization our objectives are to develop a classification model that will identify potential donors and predict maximum ROI if the this model were deploy though a targeted direct-mail campaign.

The end-state goal is to maximize profits on the organization’s next donation campaign.

Data

The organization has provided sample data,fundraising.rds, from their most recent effort. This sample data contains $21$ variables and $3,000$ records that consist of weighted samples under-representing non-donors. This under-representation facilitates our discovery by presenting us with an equal amount of donors and non-donors. Our Y variable, called target within the dataset, is a categorical variable indicating Donor and No Donor. The remaining variables load in as either categorical or numerical. * Categorical zipconvert2, zipconvert3, zipconvert4, zipconvert5, homeowner, and female * Numerical num_child, income, wealth, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, months_since_donate, time_lag, and avg_gift

A separate test dataset was also provided. This dataset exclude the response variable target but included the other $20$ variables. Additionally, the sample size was also reduced to $120$ respondents.

The following code chuck includes all libraries and seed used throughout the analysis and modeling phases.
All analysis was conducted with R version 4.0.2 (2020-06-22) on a x86_64-w64-mingw32 platform.

# install.packages("pacman")
pacman::p_load('ISLR', 'corrgram', 'glmnet', 'pls', 'tidyverse', 'ggthemes', 'ggthemr', 'caret', 'modelr', 'leaps', 'psych', 'pastecs', 'e1071', 'randomForest', 'gbm', 'ROCR', 'recipes', 'broom', 'scales', 'outliers')

set.seed(12345)

Exploratory Anlysis and Data Transformation

Loading data.

dp <- read_rds("fundraising.rds")
FF <- read_rds("future_fundraising.rds")

There are no missing values within the provided datasets.

any(is.na(dp))

## [1] FALSE

any(is.na(FF))

## [1] FALSE

summary(dp)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts    largest_gift    
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0   Min.   :   5.00  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0   1st Qu.:  10.00  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0   Median :  15.00  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7   Mean   :  16.65  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0   3rd Qu.:  20.00  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9   Max.   :1000.00  
##    last_gift      months_since_donate    time_lag         avg_gift            target    
##  Min.   :  0.00   Min.   :17.00       Min.   : 0.000   Min.   :  2.139   Donor   :1499  
##  1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000   1st Qu.:  6.333   No Donor:1501  
##  Median : 10.00   Median :31.00       Median : 5.000   Median :  9.000                  
##  Mean   : 13.48   Mean   :31.13       Mean   : 6.876   Mean   : 10.669                  
##  3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000   3rd Qu.: 12.800                  
##  Max.   :219.00   Max.   :37.00       Max.   :77.000   Max.   :122.167

str(dp)

## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

It looks as though zipconvert3 was captured in reverse order, this could cause some confusion when interpreting our results. We will flip the order from Yes/No to No/Yes to match the other zipconverts.

dp$zipconvert3 <- fct_rev(dp$zipconvert3)

Looking at numerical variables to identify skewness. If skewness is discover we will transform with a logarithm within variable selection stage.

stat.desc(numvars, basic = FALSE, norm = TRUE)

##                 num_child        income        wealth   home_value  med_fam_inc
## median       1.000000e+00  4.000000e+00  8.000000e+00 8.165000e+02 3.550000e+02
## mean         1.069333e+00  3.899000e+00  6.395667e+00 1.143268e+03 3.883563e+02
## SE.mean      6.304301e-03  2.993156e-02  4.649711e-02 1.738372e+01 3.171913e+00
## CI.mean.0.95 1.236119e-02  5.868847e-02  9.116945e-02 3.408523e+01 6.219346e+00
## var          1.192326e-01  2.687695e+00  6.485943e+00 9.065815e+05 3.018310e+04
## std.dev      3.453008e-01  1.639419e+00  2.546752e+00 9.521457e+02 1.737328e+02
## coef.var     3.229122e-01  4.204717e-01  3.981995e-01 8.328281e-01 4.473542e-01
## skewness     5.980112e+00 -2.239905e-03 -1.200292e+00 2.232637e+00 1.892435e+00
## skew.2SE     6.689310e+01 -2.505542e-02 -1.342638e+01 2.497411e+01 2.116864e+01
## kurtosis     4.034828e+01 -5.889847e-01  1.003609e-01 5.718708e+00 7.430215e+00
## kurt.2SE     2.257415e+02 -3.295265e+00  5.615015e-01 3.199516e+01 4.157074e+01
## normtest.W   2.033681e-01  9.357304e-01  7.524213e-01 7.555717e-01 8.769336e-01
## normtest.p   9.534441e-78  3.575055e-34  3.007178e-55 5.045540e-55 1.134314e-43
##               avg_fam_inc    pct_lt15k     num_prom lifetime_gifts largest_gift
## median       3.960000e+02 1.200000e+01 4.800000e+01   8.100000e+01 1.500000e+01
## mean         4.323103e+02 1.471033e+01 4.913900e+01   1.107405e+02 1.664586e+01
## SE.mean      3.083741e+00 2.210907e-01 4.158974e-01   2.727296e+00 4.111144e-01
## CI.mean.0.95 6.046461e+00 4.335048e-01 8.154731e-01   5.347560e+00 8.060948e-01
## var          2.852837e+04 1.466433e+02 5.189120e+02   2.231443e+04 5.070452e+02
## std.dev      1.689034e+02 1.210964e+01 2.277964e+01   1.493801e+02 2.251766e+01
## coef.var     3.906995e-01 8.232062e-01 4.635756e-01   1.348920e+00 1.352748e+00
## skewness     1.210715e+00 1.312895e+00 6.900928e-01   1.944130e+01 3.000434e+01
## skew.2SE     1.354297e+01 1.468595e+01 7.719328e+00   2.174690e+02 3.356264e+02
## kurtosis     2.530427e+00 2.507536e+00 5.150029e-01   6.592335e+02 1.236900e+03
## kurt.2SE     1.415729e+01 1.402922e+01 2.881350e+00   3.688295e+03 6.920237e+03
## normtest.W   9.236199e-01 9.010550e-01 9.549399e-01   3.910505e-01 2.302320e-01
## normtest.p   1.404418e-36 2.398013e-40 1.708061e-29   3.161737e-72 4.963520e-77
##                 last_gift months_since_donate     time_lag     avg_gift
## median       1.000000e+01        3.100000e+01 5.000000e+00 9.000000e+00
## mean         1.348300e+01        3.113267e+01 6.876000e+00 1.066894e+01
## SE.mean      1.912515e-01        7.476925e-02 1.022737e-01 1.360298e-01
## CI.mean.0.95 3.749975e-01        1.466042e-01 2.005338e-01 2.667211e-01
## var          1.097315e+02        1.677132e+01 3.137975e+01 5.551230e+01
## std.dev      1.047528e+01        4.095281e+00 5.601763e+00 7.450658e+00
## coef.var     7.769249e-01        1.315429e-01 8.146834e-01 6.983505e-01
## skewness     5.563103e+00       -1.005637e+00 2.851905e+00 4.803668e+00
## skew.2SE     6.222846e+01       -1.124898e+01 3.190120e+01 5.373348e+01
## kurtosis     6.888341e+01        2.327419e+00 1.677123e+01 4.482937e+01
## kurt.2SE     3.853905e+02        1.302150e+01 9.383206e+01 2.508124e+02
## normtest.W   6.647350e-01        8.800797e-01 7.805419e-01 6.812339e-01
## normtest.p   1.045420e-60        2.861571e-43 3.761256e-53 8.835644e-60

dp[,-1] %>%
  keep(is.numeric) %>%
  gather() %>%
ggplot(aes(value)) + 
    facet_wrap(~ key, scales = "free") + 
    geom_density()

All numeric variables within the dataset are skewed to some level as both skewness and skew.2SE (2 standard errors) are both above 1, witch is the marker for non-normalcy.

Also observed is a distribution discrepancy in income and wealth There are visual indications these should be categorical n-level factors. Information from the data dictionary confirms the suspicion about wealth:

“Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state.”

Their 0-9 scale makes it a Qualitative - Ordinal factor verses a quantitative number.

The data dictionary states income relates to “Household income”; the summary suggests this should be a 7-level factor as it appears to represent a range of 1-7, where $1$ is lowand $7$ is high. Looking at the plot above and given this evidence from the data dictionary we will keep is as number but we will also create an ordinal factor.

Note: There is evidence that suggests avg_fam_inc and med_fam_inc are colinear. We will keep this in mind when we check for colinearity later.

Converting income and wealth to n-level factors with as_factor() function.

dp$wealth <- as_factor(dp$wealth)
inc <- as_factor(dp$income)
d.pool<- cbind(dp, inc)

FF$wealth <- as_factor(FF$wealth)
inc <- as_factor(FF$income)
FFDP<- cbind(FF, inc)

Standardization

We want to ensure our variables are on the same scale. Large variance would cause variables on a larger scale to outweigh smaller scaled variables, this would introduce bias during our analysis.

d.pool %>% summarise_if(is.numeric, var)

##   num_child   income home_value med_fam_inc avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 1 0.1192326 2.687695   906581.5     30183.1    28528.37  146.6433  518.912       22314.43
##   largest_gift last_gift months_since_donate time_lag avg_gift
## 1     507.0452  109.7315            16.77132 31.37975  55.5123

As shown in the code chunk above, none of our variables are on the same scale. We will use the scale() function to accomplish this adjustment.

scale.features <- scale(d.pool[c(6:7,10:20)])
d.pools <- cbind(d.pool[c(0:5,8:9)],scale.features,d.pool[21:22])

d.pools %>% summarise_if(is.numeric, var)

##   num_child income home_value med_fam_inc avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 1         1      1          1           1           1         1        1              1
##   largest_gift last_gift months_since_donate time_lag avg_gift
## 1            1         1                   1        1        1

Let us look at the numerical variable on a log scale.

# range(d.pools$months_since_donate)
# range(d.pool$months_since_donate)

# d.pools$MSDgrp = cut(d.pool$months_since_donate,c(16, 21 , 24 , 28, 32, 38))
# levels(d.pools$MSDgrp) <- c("1","2","3","4","5","6")
# 
# catvars <- d.pools[c("zipconvert2", "zipconvert3", "zipconvert4", "zipconvert5", "homeowner", "female", "wealth", "inc", "MSDgrp")]
# 
# FFDPs$MSDgrp = cut(FFDP$months_since_donate,c(16, 21 , 24 , 28, 32, 38))
# levels(FFDPs$MSDgrp) <- c("1","2","3","4","5","6")

Looking at categorical variables we see the distribution of wealth is skewed but we do not need to adjust categorical variable distributional, there is no assumption of normalcy for them.

catvars %>%
    gather() %>%
ggplot(aes(value)) + 
    facet_wrap(~key, scales = "free") + 
    geom_bar()

Outliers are another part of the numerical variables. With the outlier() function and box plots we will check to see if there are some way off the scope.

outlier(numvars)

##           num_child              income          home_value         med_fam_inc 
##           11.383312            1.891524            5.043064            6.398581 
##         avg_fam_inc           pct_lt15k            num_prom      lifetime_gifts 
##            5.320730            6.217335            4.734974           37.248321 
##        largest_gift           last_gift months_since_donate            time_lag 
##           43.670346           19.619240           -3.450964           12.518201 
##            avg_gift 
##           14.964817

par(mfrow=c(2,2))
for (i in 1:length(numvars)) {
        boxplot(numvars[,i], 
                main=names(numvars[i]), type="l")
}

Most have outliers that may effect predictive power. I am do not like taking out data unless I have to, so we will retain these outlines unless our predictions are very low.

We have completed our analysis phase, it is time to move to the predictive phase! I’m setting up a common variable I used during my exploratory data analysis final.data

final.data <- d.pools
final.FFDP <-  FFDPs

Train/Test Split and Train Control

Splitting data into train (80%) and test (20%) sets with the CARET package.

csam<- createDataPartition(y=final.data$target, p= .8, list = FALSE)
train <- final.data[csam,]
test <- final.data[-csam,]
dim(d.pool)

## [1] 3000   22

dim(train)

## [1] 2401   22

dim(test)

## [1] 599  22

Setting train control up for Repeated Cross Validation with $10$ folds, $3$ repeats.

train_ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)

Variable Selection

We have already seen evidence that suggests med_fam_inc and avg_fam_incare correlated. It is now time to investigate the other numeric variables.

corrgram(train, upper.panel=panel.cor, main="Doner Correlation Matrix")

As expected, within the transformed dataset there is high correlation (above .6) in med_fam_inc and avg_fam_inc, home_value is also highly correlated with these two variables. We see a negative correlation between pct_lt5k (Percent earning less than $15K in potential donor’s neighborhood) and med_fam_inc, avg_fam_inc, and, to a lesser degree, home_value. The remaining highly correlated variables are last_gift and avg_gift, num_prom and lifetime_gifts, lifetime_gifts and largest_gift, largest_gift and last_gift, and largest_gift and avg_gift have moderate correlation.

More correlations are seen in the transformed dummyVars() dataset. However, rendering that chart is not a great idea became it has 39 variables and it looks like pixel art. I discovered this function on Towards Data Science and modified it for this analysis. https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57

corr_simple <- function(data=df,sig=0.5){
  #convert data to numeric in order to run correlations
  #convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
  df_cor <- data %>% mutate_if(is.character, as.factor)
  df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)  #run a correlation and drop the insignificant ones
  corr <- cor(df_cor)
  #prepare to drop duplicates and correlations of 1     
  corr[lower.tri(corr,diag=TRUE)] <- NA 
  #drop perfect correlations
  corr[corr == 1] <- NA   #turn into a 3-column table
  corr <- as.data.frame(as.table(corr))
  #remove the NA values from above 
  corr <- na.omit(corr)   #select significant values  
  corr <- subset(corr, abs(Freq) > sig) 
  #sort by highest correlation
  corr <- corr[order(-abs(corr$Freq)),]   #print table
  print(corr)  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
}

corr_simple(train)

##               Var1           Var2       Freq
## 253    med_fam_inc    avg_fam_inc  0.9710119
## 435      last_gift       avg_gift  0.8598187
## 368   largest_gift      last_gift  0.7492002
## 252     home_value    avg_fam_inc  0.7477607
## 230     home_value    med_fam_inc  0.7325554
## 434   largest_gift       avg_gift  0.7300351
## 276    avg_fam_inc      pct_lt15k -0.6851133
## 275    med_fam_inc      pct_lt15k -0.6659114
## 345 lifetime_gifts   largest_gift  0.6368191
## 322       num_prom lifetime_gifts  0.5174737

As expected, income and inc are 100% correlated, med_fam_inc and agv_fam_inc are 97% correlated. We will keep the remaining variables in mind when selecting our model variables.

Before selecting our variables we will look at variable importance. To start we will fit a random forest model to calculate variable importance factor (VIF). We need to Generate a small sample to investigate which variables are closely associated with target. Thin train a random forest to obtain the varImp for the dataset.

k.1=sample(nrow(train),nrow(train)*0.1)
rfdp=train(target~.,data=train[k.1,])
rfImp<-varImp(rfdp)
plot(rfImp,top=15)

# fit.lm1<-lm(target~.,data=train2)
# varImp(fit.lm1, useModel = TRUE, scale = TRUE)

Objective 2

Our second second objective is a byproduct of our first objective and will deal with information from previous campaigns where the average donation was $$13.00$ and packet cost of $$0.65$. Our maximum ROI margin for the test set would be ($13.00 * 299) – ($0.65 * 300) = $$3,692$. Knowing our model will not be perfect, we will attempt to get as close to this gold standard as possible. As such, our predictions should be better than a random guess to ensure we reach a higher percentage of potential donors.

Optimal	Donor	No Donor
Donor	299	0
No Donor	0	300

TP <- 299
TN <- 300
(13.00 * TP)-(0.65 * TN)

## [1] 3692

##LDA LDA uses the Bayes classifier and a threshold of the posterior probability when assigning observations to a class. This class selection is done by estimating several parameters, including the weighted average of the sample variances and the number of observations.

The first LDA model uses 20 variables. lda.b uses a three-variable subset of the most important variables from the first LDA.

##Train Model
lda.fit.a=train(target~., data=train, method='lda',trControl = train_ctrl)
##Calculate Predictions
pred.lda.a<-predict(lda.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.lda.a,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      172      142
##   No Donor   127      158
##                                           
##                Accuracy : 0.5509          
##                  95% CI : (0.5101, 0.5912)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.007924        
##                                           
##                   Kappa : 0.1019          
##                                           
##  Mcnemar's Test P-Value : 0.393329        
##                                           
##             Sensitivity : 0.5753          
##             Specificity : 0.5267          
##          Pos Pred Value : 0.5478          
##          Neg Pred Value : 0.5544          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2871          
##    Detection Prevalence : 0.5242          
##       Balanced Accuracy : 0.5510          
##                                           
##        'Positive' Class : Donor           
##

varImp(lda.fit.a, useModel = TRUE, scale = TRUE)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                     Importance
## months_since_donate   100.0000
## last_gift              91.8258
## avg_gift               83.5865
## largest_gift           80.7298
## num_prom               53.0404
## income                 38.3934
## inc                    38.3934
## lifetime_gifts         34.6910
## home_value             28.7559
## zipconvert5            22.5035
## homeowner              17.5162
## time_lag               12.2503
## zipconvert2            10.0020
## female                  9.3212
## avg_fam_inc             8.2899
## num_child               7.9949
## med_fam_inc             5.8305
## zipconvert3             5.3870
## wealth                  0.3288
## zipconvert4             0.1601

LDA 1 expects a return of $$2,389.40$

Optimal	Donor	No Donor
Donor	192	136
No Donor	108	167

TP <- 192
TN <- 164
(13.00 * TP)-(0.65 * TN)

## [1] 2389.4

##Train Model
lda.fit.b=train(target~ avg_gift + lifetime_gifts + med_fam_inc + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, method='lda',trControl = train_ctrl)
##Calculate Predictions
pred.lda.b<-predict(lda.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.lda.b,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      171      158
##   No Donor   128      142
##                                           
##                Accuracy : 0.5225          
##                  95% CI : (0.4817, 0.5632)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.15351         
##                                           
##                   Kappa : 0.0452          
##                                           
##  Mcnemar's Test P-Value : 0.08638         
##                                           
##             Sensitivity : 0.5719          
##             Specificity : 0.4733          
##          Pos Pred Value : 0.5198          
##          Neg Pred Value : 0.5259          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2855          
##    Detection Prevalence : 0.5492          
##       Balanced Accuracy : 0.5226          
##                                           
##        'Positive' Class : Donor           
##

LDA 2 expects a return of $$2,605.85$

Prediction	Donor	No Donor
Donor	207	169
No Donor	92	131

TP <- 207
TN <- 131
(13.00 * TP)-(0.65 * TN)

## [1] 2605.85

##Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis is similar to LDA in that it assumes the observations from each class are drawn from a normal distribution. It creates a prediction based on parameter estimates put into Bayes’ theorem. QDA differs in that it uses a quadratic function and creates its own covariance matrix, unlike LDA, making QDA more flexible with a lower variance dataset.

##Train Model
qda.fit.a=train(target~ homeowner +  num_child + income + female + home_value + med_fam_inc + avg_fam_inc + pct_lt15k + num_prom + lifetime_gifts + largest_gift + last_gift + months_since_donate + time_lag + avg_gift, data=train, method='qda',trControl = train_ctrl)
##Calculate Predictions
pred.qda.a<-predict(qda.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.qda.a,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      227      214
##   No Donor    72       86
##                                           
##                Accuracy : 0.5225          
##                  95% CI : (0.4817, 0.5632)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.1535          
##                                           
##                   Kappa : 0.0458          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7592          
##             Specificity : 0.2867          
##          Pos Pred Value : 0.5147          
##          Neg Pred Value : 0.5443          
##              Prevalence : 0.4992          
##          Detection Rate : 0.3790          
##    Detection Prevalence : 0.7362          
##       Balanced Accuracy : 0.5229          
##                                           
##        'Positive' Class : Donor           
##

names(t)

## NULL

QDA1 expects a return of $$2,895.10$

Prediction	Donor	No Donor
Donor	227	214
No Donor	72	86

TP <- 227
TN <- 86
(13.00 * TP)-(0.65 * TN)

## [1] 2895.1

##Train Model
qda.fit.b=train(target~ num_prom + avg_gift + home_value + pct_lt15k + time_lag + med_fam_inc + avg_fam_inc + last_gift + largest_gift, data=train, method='qda',trControl = train_ctrl)
##Calculate Predictions
pred.qda.b<-predict(qda.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.qda.b,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      213      188
##   No Donor    86      112
##                                          
##                Accuracy : 0.5426         
##                  95% CI : (0.5017, 0.583)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.02259        
##                                          
##                   Kappa : 0.0857         
##                                          
##  Mcnemar's Test P-Value : 1.05e-09       
##                                          
##             Sensitivity : 0.7124         
##             Specificity : 0.3733         
##          Pos Pred Value : 0.5312         
##          Neg Pred Value : 0.5657         
##              Prevalence : 0.4992         
##          Detection Rate : 0.3556         
##    Detection Prevalence : 0.6694         
##       Balanced Accuracy : 0.5429         
##                                          
##        'Positive' Class : Donor          
##

QDA2 expects a return of $$2,696.20$

Prediction	Donor	No Donor
Donor	213	188
No Donor	86	112

TP <- 213
TN <- 112
(13.00 * TP)-(0.65 * TN)

## [1] 2696.2

Logistic Regression

Logistic regression models the probability of a binary response based on one or more predictor variables. In our case, whether a direct-mail campaign will trigger a donation or will there be no ROI.

Log models were created based on earlier models. The first model includes 20 variables and the second reflecting 10 variables that were selected based upon significance.

##Train Model
glm.fit.a=train(target~., data=train, trControl = train_ctrl,  method = "glm",  family = "binomial")
##Calculate Predictions
pred.glm.a<-predict(glm.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.glm.a, test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      174      143
##   No Donor   125      157
##                                           
##                Accuracy : 0.5526          
##                  95% CI : (0.5118, 0.5929)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.00631         
##                                           
##                   Kappa : 0.1053          
##                                           
##  Mcnemar's Test P-Value : 0.29907         
##                                           
##             Sensitivity : 0.5819          
##             Specificity : 0.5233          
##          Pos Pred Value : 0.5489          
##          Neg Pred Value : 0.5567          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2905          
##    Detection Prevalence : 0.5292          
##       Balanced Accuracy : 0.5526          
##                                           
##        'Positive' Class : Donor           
##

varImp(glm.fit.a, useModel = TRUE, scale = TRUE)

## glm variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##                     Overall
## months_since_donate 100.000
## income               36.877
## home_value           36.158
## num_child            35.341
## avg_fam_inc          25.954
## last_gift            18.067
## wealth7              16.870
## med_fam_inc          16.740
## inc3                 15.042
## num_prom             14.353
## homeownerNo          12.985
## inc5                 12.835
## wealth9              10.081
## wealth6               9.805
## wealth5               9.755
## wealth4               9.597
## wealth1               9.084
## largest_gift          6.952
## wealth3               4.761
## pct_lt15k             4.539

GLM 1 expects a return of $$2,159.95$

Prediction	Donor	No Donor
Donor	174	143
No Donor	125	157

TP <- 174
TN <- 157
(13.00 * TP)-(0.65 * TN)

## [1] 2159.95

##Train Model
glm.fit.b=train(target~ avg_gift + lifetime_gifts + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, trControl = train_ctrl,  method = "glm",  family = binomial)

##Calculate Predictions
pred.glm.b<-predict(glm.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.glm.b, test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      176      161
##   No Donor   123      139
##                                          
##                Accuracy : 0.5259         
##                  95% CI : (0.485, 0.5665)
##     No Information Rate : 0.5008         
##     P-Value [Acc > NIR] : 0.11801        
##                                          
##                   Kappa : 0.052          
##                                          
##  Mcnemar's Test P-Value : 0.02812        
##                                          
##             Sensitivity : 0.5886         
##             Specificity : 0.4633         
##          Pos Pred Value : 0.5223         
##          Neg Pred Value : 0.5305         
##              Prevalence : 0.4992         
##          Detection Rate : 0.2938         
##    Detection Prevalence : 0.5626         
##       Balanced Accuracy : 0.5260         
##                                          
##        'Positive' Class : Donor          
##

GLM 2 expects a return of $$2,468.70$

Prediction	Donor	No Donor
Donor	198	138
No Donor	102	162

TP <- 198
TN <- 162
(13.00 * TP)-(0.65 * TN)

## [1] 2468.7

K=Nearest Neighbors (KNN)

KNN assigns a weights so the nearer ‘neighbors’ contribute more to an average than observations further away. Smaller K value will provide the most flexible fit and have a low bias but high variance due to the prediction relying on one observation and vice versa, a larger K value will provide a smoother and less variable fit but may cause bias by masking some of the structure.

##Train Model
knn.fit.a=train(target~., data=train, method='knn',trControl = train_ctrl, tuneLength=30)
##Calculate Predictions
pred.knn.a<-predict(knn.fit.a,test)
##Estimate Accuracy
confusionMatrix(pred.knn.a,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      158      154
##   No Donor   141      146
##                                           
##                Accuracy : 0.5075          
##                  95% CI : (0.4667, 0.5483)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.3875          
##                                           
##                   Kappa : 0.0151          
##                                           
##  Mcnemar's Test P-Value : 0.4848          
##                                           
##             Sensitivity : 0.5284          
##             Specificity : 0.4867          
##          Pos Pred Value : 0.5064          
##          Neg Pred Value : 0.5087          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2638          
##    Detection Prevalence : 0.5209          
##       Balanced Accuracy : 0.5075          
##                                           
##        'Positive' Class : Donor           
##

varImp(knn.fit.a, useModel = TRUE, scale = TRUE)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                     Importance
## months_since_donate   100.0000
## last_gift              91.8258
## avg_gift               83.5865
## largest_gift           80.7298
## num_prom               53.0404
## income                 38.3934
## inc                    38.3934
## lifetime_gifts         34.6910
## home_value             28.7559
## zipconvert5            22.5035
## homeowner              17.5162
## time_lag               12.2503
## zipconvert2            10.0020
## female                  9.3212
## avg_fam_inc             8.2899
## num_child               7.9949
## med_fam_inc             5.8305
## zipconvert3             5.3870
## wealth                  0.3288
## zipconvert4             0.1601

KNN1 expects a return of $$2,468.70$

Prediction	Donor	No Donor
Donor	190	168
No Donor	110	132

TP <- 190
TN <- 132
(13.00 * TP)-(0.65 * TN)

## [1] 2384.2

##Train Model
knn.fit.b=train(target~ avg_gift + lifetime_gifts + avg_fam_inc + home_value + num_prom + pct_lt15k + months_since_donate + time_lag + last_gift, data=train, method='knn',trControl = train_ctrl, tuneLength=30)
##Calculate Predictions
pred.knn.b<-predict(knn.fit.b,test)
##Estimate Accuracy
confusionMatrix(pred.knn.b,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      172      164
##   No Donor   127      136
##                                           
##                Accuracy : 0.5142          
##                  95% CI : (0.4733, 0.5549)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : 0.27000         
##                                           
##                   Kappa : 0.0286          
##                                           
##  Mcnemar's Test P-Value : 0.03483         
##                                           
##             Sensitivity : 0.5753          
##             Specificity : 0.4533          
##          Pos Pred Value : 0.5119          
##          Neg Pred Value : 0.5171          
##              Prevalence : 0.4992          
##          Detection Rate : 0.2871          
##    Detection Prevalence : 0.5609          
##       Balanced Accuracy : 0.5143          
##                                           
##        'Positive' Class : Donor           
##

KNN2 expects a return of $$2,216.50$

Prediction	Donor	No Donor
Donor	178	150
No Donor	122	150

TP <- 178
TN <- 150
(13.00 * TP)-(0.65 * TN)

## [1] 2216.5

Support Vector Machine

Support vector machines are intended for binary classifications such as donor and when there are non-linear class boundaries which is addresses by enlarging the feature space using polynomial functions as predictors using kernels. A kernel is a function that quantifies the similarity of two observations. A radial kernel is effected by nearby training observations but there are also other types including linear, polynomial and sigmoid. In our dataset the radial kernel is performing better.

#Tunning the SVC
poly.tune.out <- tune(svm, target ~ ., data = train, kernel = "polynomial", ranges = list(cost = c(0.1,1, 5, 10, 20), degree = c(2, 3, 4, 5)))
summary(poly.tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##     1      2
## 
## - best performance: 0.4531362 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1   0.1      2 0.4839678 0.02369064
## 2   1.0      2 0.4531362 0.02487941
## 3   5.0      2 0.4602230 0.02897757
## 4  10.0      2 0.4635615 0.02629239
## 5  20.0      2 0.4668966 0.02948750
## 6   0.1      3 0.4968724 0.02529140
## 7   1.0      3 0.4677282 0.02565760
## 8   5.0      3 0.4618793 0.03475451
## 9  10.0      3 0.4618811 0.03059307
## 10 20.0      3 0.4627092 0.03654283
## 11  0.1      4 0.5027075 0.01583733
## 12  1.0      4 0.4877144 0.01438082
## 13  5.0      4 0.4743897 0.01264273
## 14 10.0      4 0.4764575 0.02142671
## 15 20.0      4 0.4602196 0.03688775
## 16  0.1      5 0.5052075 0.01587398
## 17  1.0      5 0.4964609 0.02088036
## 18  5.0      5 0.4927178 0.01731685
## 19 10.0      5 0.4839713 0.01987610
## 20 20.0      5 0.4806328 0.02299150

rad.tune.out <- tune(svm, target ~ ., data = train, kernel = "radial", ranges = list(cost = c(0.1, 1, 5, 10, 20), gamma = c(0.01, 0.1, 1, 5, 10)))
summary(rad.tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1  0.01
## 
## - best performance: 0.442742 
## 
## - Detailed performance results:
##    cost gamma     error dispersion
## 1   0.1  0.01 0.4452351 0.02659737
## 2   1.0  0.01 0.4427420 0.02645096
## 3   5.0  0.01 0.4556535 0.02535353
## 4  10.0  0.01 0.4623185 0.03146988
## 5  20.0  0.01 0.4685615 0.02264571
## 6   0.1  0.10 0.4493966 0.02026974
## 7   1.0  0.10 0.4652092 0.02012967
## 8   5.0  0.10 0.4960443 0.03176939
## 9  10.0  0.10 0.5077075 0.03203582
## 10 20.0  0.10 0.5114454 0.03481461
## 11  0.1  1.00 0.5285356 0.02178929
## 12  1.0  1.00 0.5127144 0.02303358
## 13  5.0  1.00 0.5056380 0.02813914
## 14 10.0  1.00 0.5056380 0.02813914
## 15 20.0  1.00 0.5056380 0.02813914
## 16  0.1  5.00 0.5285356 0.02178929
## 17  1.0  5.00 0.5226988 0.02531526
## 18  5.0  5.00 0.5239471 0.03000081
## 19 10.0  5.00 0.5239471 0.03000081
## 20 20.0  5.00 0.5239471 0.03000081
## 21  0.1 10.00 0.5285356 0.02178929
## 22  1.0 10.00 0.5247873 0.02386694
## 23  5.0 10.00 0.5247804 0.02353730
## 24 10.0 10.00 0.5247804 0.02353730
## 25 20.0 10.00 0.5247804 0.02353730

svm.poly <- svm(target ~ ., data = train, kernel = "polynomial", cost = poly.tune.out$best.parameters$cost)
train.pred <- predict(svm.poly, test)
table(test$target, train.pred)

##           train.pred
##            Donor No Donor
##   Donor       41      258
##   No Donor    46      254

confusionMatrix(test$target, train.pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       41      258
##   No Donor    46      254
##                                           
##                Accuracy : 0.4925          
##                  95% CI : (0.4517, 0.5333)
##     No Information Rate : 0.8548          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0162         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.47126         
##             Specificity : 0.49609         
##          Pos Pred Value : 0.13712         
##          Neg Pred Value : 0.84667         
##              Prevalence : 0.14524         
##          Detection Rate : 0.06845         
##    Detection Prevalence : 0.49917         
##       Balanced Accuracy : 0.48368         
##                                           
##        'Positive' Class : Donor           
##

SVM Poly expects a return of $$2,576.60$

Prediction	Donor	No Donor
Donor	204	86
No Donor	184	116

TP <- 204
TN <- 116
(13.00 * TP)-(0.65 * TN)

## [1] 2576.6

svm.radial <- svm(target ~ ., data = train, kernel = "radial", cost = rad.tune.out$best.parameters$cost)
train.pred <- predict(svm.radial, test)
table(test$target, train.pred)

##           train.pred
##            Donor No Donor
##   Donor      186      113
##   No Donor   160      140

confusionMatrix(test$target, train.pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      186      113
##   No Donor   160      140
##                                           
##                Accuracy : 0.5442          
##                  95% CI : (0.5034, 0.5847)
##     No Information Rate : 0.5776          
##     P-Value [Acc > NIR] : 0.954700        
##                                           
##                   Kappa : 0.0887          
##                                           
##  Mcnemar's Test P-Value : 0.005369        
##                                           
##             Sensitivity : 0.5376          
##             Specificity : 0.5534          
##          Pos Pred Value : 0.6221          
##          Neg Pred Value : 0.4667          
##              Prevalence : 0.5776          
##          Detection Rate : 0.3105          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.5455          
##                                           
##        'Positive' Class : Donor           
##

SVM Radial expects a return of $$2,385.50$

Prediction	Donor	No Donor
Donor	175	125
No Donor	144	156

TP <- 191
TN <- 150
FP <- 150
FN <- 109
(13.00 * TP)-(0.65 * TN)

## [1] 2385.5

Findings and Recommendations

We investigated five classification models; Logistic Regression, LDA, QDA, KNN, and Support Vector Machine). As anticipated results varied from training and test / validation data and from each model we were shown different results on how to maximize the gross margin. We will assess this in two ways, first by accuracy and then by potential gross margin.

Based upon the accuracy, the top three models are:

GLM 2 with $10$ predictors achieved a $60%$ accuracy rate
LDA 2 with $3$ predictors achieved a $59.3%$ accuracy rate
GLM 1 with all $20$ predictors achieved a $55.3%$ accuracy rate

Based on Gross Margin, the top three are:

QDA 1 with $15$ predictors projects an ROI of $$2,895.10$ for the test dataset
QDA 2 with $9$ predictors projects an ROI of $$2,696.20$ for the test dataset
SVM Poly with all $20$ predictors projects an ROI of $$2,576.60$ for the test dataset

There are many other types of classification that were not addressed in this assessment. Even with the models observed today there are may transformations and hyperperamiters we can tweak to improve the accuracy and by effect, increase ROI for direct-mail campaign. There are also multi-class classifiers if in the future an on-watch type of list is desired. There is also room to add other variables such as advertisement expenditure and advertisement media the may help pinpoint a willing donor population.

For the scope of this study, we recommend the use of the QDA model as it is currently projecting the largest ROI and it achieved the highest accuracy rate $52.3%$ of the top three ROI projectors. We also recommend that further analysis be conducted on on Random Forests and Generalized Additive Models as they show promise in classification problems. Finally, there is room for improvement with the models presented, we suggest these findings be expanded upon in the upcoming quarter to allow room for ROI growth potential.