Load Libraries

library(readr)
library(corrplot)
library(tree)
library(tidyverse)
library(randomForest)


Background


A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.



Objectives & Goals


The objective of this business case analysis is to use predictive modeling to predict, as accurately as possible, which of the veteran organization members are likley to respond to direct mail solicitations for donations. The overall goal would be to use this information to maximize return on investment (ROI), by limiting direct mail overhead by reducing the total number of mailings. This in turn would result in a higher ROI as your cost per donation ratio would be improved.




Data


The file fundraising.rds contains 3,000 records with approximately 50% donors (target = Donor) and 50% non-donors (target = No Donor). The data set used a weighted sample rather than a random sample to avoid bias in the classification model. As a significant majority of the direct mail recipients did not donate, using a simple random sample could have resulted in a bias that would skew towards predicting No Donors. The dataset contains 21 variables (although the course material states 22, zipconvert1 is actually missing from the datasets)); the descriptions for the variables are listed below.

Zip code group: Broke into five groups; Yes = the potential donor belongs to this zip group.
00000–19999 ⇒ zipconvert1 Not included in dataset
20000–39999 ⇒ zipconvert2
40000–59999 ⇒ zipconvert3
60000–79999 ⇒ zipconvert4
80000–99999 ⇒ zipconvert5
homeowner: Yes = homeowner, No = not a homeowner
num_child: Number of children
income: Household income
female: No = male, Yes = female
wealth: Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state
home_value: Average home value in potential donor’s neighborhood in hundreds of dollars
med_fam_inc: Median family income in potential donor’s neighborhood in hundreds of dollars
avg_fam_inc: Average family income in potential donor’s neighborhood in hundreds
pct_lt15k: Percent earning less than $15K in potential donor’s neighborhood
num_prom: Lifetime number of promotions received to date
lifetime_gifts: Dollar amount of lifetime gifts to date
largest_gift: Dollar amount of largest gift to date
last_gift: Dollar amount of most recent gift
months_since_donate: Number of months from last donation to July 2018
time_lag: Number of months between first and second gift
avg_gift: Average dollar amount of gifts to date
target (Outcome variable): binary indicator for response Yes = donor, No = non-donor

During model development and testing, the fundraising.rds dataset will be partitioned into train and test datasets, using a 75%/25% split. Once the best model is selected, it will be used to predict outcomes on an additional dataset, future_fundraising.rds; this dataset contains all of the same variables as fundraising.rds, with the exception of target (note: This dataset is also missing zipconvert1).


Load & Partition Data (Step 1)

fundraising <- read_rds('fundraising.rds')
future_fundraising <- read_rds('future_fundraising.rds')





Model Building


Exploratory Data Analysis


Exploring predictors

#Look for missing values
any(is.na(fundraising))
## [1] FALSE
#Review a summary of the data
summary(fundraising)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167
#Review the structure of the data
str(fundraising)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...


When observing the structure of the data, it appears that zipconvert3 factor was created with “Yes”/“No” versus “No”/“Yes”. It should not effect our modeling, but just for the sake of consistency and to reduce any potental confusion when looking at results, I am electing to flip the factors for zipconvert3. In addition, just for the sake of consistency, I elected to flip the factors for homeowner and female as well.


#Flip Factors
fundraising$zipconvert3 <- fct_rev(fundraising$zipconvert3)
fundraising$homeowner <- fct_rev(fundraising$homeowner)
fundraising$female <- fct_rev(fundraising$female)

#Review Structure again
str(fundraising)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 2 2 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
#Create a dataframe with only numeric predictors for examination
fundraising_numeric <- fundraising[,-1] %>% keep(is.numeric) %>% gather()

#Plot numeric predictors
ggplot(fundraising_numeric, aes(value)) + 
    facet_wrap(~ key, scales = 'free') + 
    geom_density(colour = 'blue') + 
    theme(axis.text = element_text(color = 'blue'))


There appears to be collinearity (as you might expect) between avg_fam_inc, med_fam_inc, and home_value. In order to better understand the relationship and check for collinearity, I will use a correlation matrix.


#Plot correlation matrix of numerical factors
fundraising_numeric2 <- fundraising[,-1] %>% keep(is.numeric)
corrplot(cor(fundraising_numeric2), method = 'number', type = 'lower')


The correlation matrix does confirm the strong correlation between avg_fam_inc and med_fam_inc, as expected. It also shows a fairly significant correlation between those two variables and home_value. The matrix also indicated a correlation between last gift and avg_gift.


Drop highly correlated predictors

fundraising <- subset(fundraising, select = c(1:11,13:16,18:21))


Predictor Selection


Split dataset into test and train

set.seed(12345)
in_train <- sample(nrow(fundraising), 2250)
train <- fundraising[in_train,]
test <- fundraising[-in_train,]


Use Random Forest to help identify the best predictors

set.seed(12345)
samp <- sample(nrow(fundraising), nrow(fundraising)*0.1)
rf_fundraising <- train(target~., data = fundraising[samp,])
rf_fundraising_varImp <- varImp(rf_fundraising)
plot(rf_fundraising_varImp)


Use Subset Selection to evaluate predictors

set.seed(12345)



fwd_step <-  regsubsets(target ~ ., data = fundraising, method = 'forward')

fwd_step_summary <- summary(fwd_step)
fwd_step_summary
## Subset selection object
## Call: regsubsets.formula(target ~ ., data = fundraising, method = "forward")
## 18 Variables  (and intercept)
##                     Forced in Forced out
## zipconvert2Yes          FALSE      FALSE
## zipconvert3Yes          FALSE      FALSE
## zipconvert4Yes          FALSE      FALSE
## zipconvert5Yes          FALSE      FALSE
## homeownerYes            FALSE      FALSE
## num_child               FALSE      FALSE
## income                  FALSE      FALSE
## femaleYes               FALSE      FALSE
## wealth                  FALSE      FALSE
## home_value              FALSE      FALSE
## med_fam_inc             FALSE      FALSE
## pct_lt15k               FALSE      FALSE
## num_prom                FALSE      FALSE
## lifetime_gifts          FALSE      FALSE
## largest_gift            FALSE      FALSE
## months_since_donate     FALSE      FALSE
## time_lag                FALSE      FALSE
## avg_gift                FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          zipconvert2Yes zipconvert3Yes zipconvert4Yes zipconvert5Yes
## 1  ( 1 ) " "            " "            " "            " "           
## 2  ( 1 ) " "            " "            " "            " "           
## 3  ( 1 ) " "            " "            " "            " "           
## 4  ( 1 ) " "            " "            " "            " "           
## 5  ( 1 ) " "            " "            " "            " "           
## 6  ( 1 ) " "            " "            " "            " "           
## 7  ( 1 ) " "            " "            " "            " "           
## 8  ( 1 ) " "            " "            " "            " "           
##          homeownerYes num_child income femaleYes wealth home_value med_fam_inc
## 1  ( 1 ) " "          " "       " "    " "       " "    " "        " "        
## 2  ( 1 ) " "          " "       " "    " "       " "    " "        " "        
## 3  ( 1 ) " "          " "       "*"    " "       " "    " "        " "        
## 4  ( 1 ) " "          "*"       "*"    " "       " "    " "        " "        
## 5  ( 1 ) " "          "*"       "*"    " "       " "    " "        " "        
## 6  ( 1 ) " "          "*"       "*"    " "       " "    "*"        " "        
## 7  ( 1 ) " "          "*"       "*"    " "       " "    "*"        " "        
## 8  ( 1 ) " "          "*"       "*"    " "       "*"    "*"        " "        
##          pct_lt15k num_prom lifetime_gifts largest_gift months_since_donate
## 1  ( 1 ) " "       " "      " "            " "          "*"                
## 2  ( 1 ) " "       " "      " "            " "          "*"                
## 3  ( 1 ) " "       " "      " "            " "          "*"                
## 4  ( 1 ) " "       " "      " "            " "          "*"                
## 5  ( 1 ) " "       "*"      " "            " "          "*"                
## 6  ( 1 ) " "       "*"      " "            " "          "*"                
## 7  ( 1 ) "*"       "*"      " "            " "          "*"                
## 8  ( 1 ) "*"       "*"      " "            " "          "*"                
##          time_lag avg_gift
## 1  ( 1 ) " "      " "     
## 2  ( 1 ) " "      "*"     
## 3  ( 1 ) " "      "*"     
## 4  ( 1 ) " "      "*"     
## 5  ( 1 ) " "      "*"     
## 6  ( 1 ) " "      "*"     
## 7  ( 1 ) " "      "*"     
## 8  ( 1 ) " "      "*"
par(mfrow = c(2,2))
#Plot 1
plot(fwd_step_summary$cp, xlab = '# of Predictors', ylab = 'Cp')
min_cp <-  min(fwd_step_summary$cp)
std_cp <-  sd(fwd_step_summary$cp)


#Plot 2
plot(fwd_step_summary$bic,xlab = '# of Predictors', ylab = 'BIC')
min_bic <-  min(fwd_step_summary$bic)
std_bic <-  sd(fwd_step_summary$bic)


#Plot 3
plot(fwd_step_summary$adjr2, xlab = '# of Predictors', ylab = 'Adjusted R2')
max_adjr2 <-  max(fwd_step_summary$adjr2)
std_adjr2 <-  sd(fwd_step_summary$adjr2)


predictors <- (names(coef(fwd_step, 5)))[2:6]
predictors
## [1] "num_child"           "income"              "num_prom"           
## [4] "months_since_donate" "avg_gift"


Looking at the foward step charts and the Random Forest Importance chart, it appears that four or five predictors would work the best. As five predictors provided a slightly better Adjusted \(R^{2}\), I elected to use five predictors in my analysis. However, the Random Forest Importance chart and the Forward Step function found different predictors to be important. As a result, I have elected to build models using the top 5 predictors from each tool.


Saving the top 5 predictors from each tool as a separate formula

fwd_step_formula <- as.formula(target ~ num_child + income + num_prom + months_since_donate + avg_gift)
rf_formula <- as.formula(target ~ avg_gift + lifetime_gifts + med_fam_inc + home_value + num_prom)



Training & Validating Models


SVM Models

Fitting a Linear SVM

set.seed(12345)

tune_svm1 = tune(svm, fwd_step_formula, data = train, kernel = 'linear', ranges = list(cost=c(0.01, .1, .5, 1, 10)))

summary(tune_svm1) 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.4391111 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1  0.01 0.4391111 0.03422190
## 2  0.10 0.4444444 0.02763668
## 3  0.50 0.4408889 0.02494878
## 4  1.00 0.4413333 0.02479435
## 5 10.00 0.4400000 0.02531557
tune_svm2 = tune(svm, rf_formula, data = train, kernel = 'linear', ranges = list(cost=c(0.01, .1, .5, 1, 10)))

summary(tune_svm2) 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.4591111 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1  0.01 0.4591111 0.02591958
## 2  0.10 0.4600000 0.02209842
## 3  0.50 0.4608889 0.02333510
## 4  1.00 0.4613333 0.02340553
## 5 10.00 0.4608889 0.02370833


Fitting a polynomial SVM with various degrees

set.seed(12345)

tune_svm3 <- tune(svm, fwd_step_formula, data = train, kernel = 'polynomial', ranges = list(cost = c(.001, 0.1, 
    1), degree = c(2, 3)))

summary(tune_svm3)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##     1      3
## 
## - best performance: 0.4648889 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1 0.001      2 0.5275556 0.02779900
## 2 0.100      2 0.5213333 0.02305120
## 3 1.000      2 0.5102222 0.03466793
## 4 0.001      3 0.5266667 0.02693283
## 5 0.100      3 0.4800000 0.04354648
## 6 1.000      3 0.4648889 0.03636124
tune_svm4 <- tune(svm, rf_formula, data = train, kernel = 'polynomial', ranges = list(cost = c(.001, 0.1, 
    1), degree = c(2, 3, 4)))

summary(tune_svm4)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##     1      4
## 
## - best performance: 0.504 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1 0.001      2 0.5204444 0.02694912
## 2 0.100      2 0.5240000 0.02201882
## 3 1.000      2 0.5297778 0.01493286
## 4 0.001      3 0.5217778 0.02725282
## 5 0.100      3 0.5266667 0.01455325
## 6 1.000      3 0.5182222 0.00843274
## 7 0.001      4 0.5244444 0.02052801
## 8 0.100      4 0.5182222 0.02011762
## 9 1.000      4 0.5040000 0.02480763


Fitting a radial SVM with various gammas

set.seed(12345)

tune_svm5 <- tune(svm, fwd_step_formula, data = train, kernel = 'radial', ranges = list(cost = c(.001, 0.1, .5, 1), gamma = c(1, 2)))

summary(tune_svm5)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   0.5     1
## 
## - best performance: 0.4408889 
## 
## - Detailed performance results:
##    cost gamma     error dispersion
## 1 0.001     1 0.5297778 0.02824933
## 2 0.100     1 0.4466667 0.04284787
## 3 0.500     1 0.4408889 0.04646281
## 4 1.000     1 0.4417778 0.05375145
## 5 0.001     2 0.5297778 0.02824933
## 6 0.100     2 0.4613333 0.04890236
## 7 0.500     2 0.4497778 0.05199135
## 8 1.000     2 0.4560000 0.05599138
tune_svm6 <- tune(svm, rf_formula, data = train, kernel = 'radial', ranges = list(cost = c(.001, 0.1, .5, 1), gamma = c(1, 2)))

summary(tune_svm6)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   0.5     2
## 
## - best performance: 0.4808889 
## 
## - Detailed performance results:
##    cost gamma     error dispersion
## 1 0.001     1 0.5208889 0.02730110
## 2 0.100     1 0.4817778 0.03992861
## 3 0.500     1 0.4853333 0.03473118
## 4 1.000     1 0.4813333 0.03359078
## 5 0.001     2 0.5208889 0.02730110
## 6 0.100     2 0.4924444 0.02983632
## 7 0.500     2 0.4808889 0.03434993
## 8 1.000     2 0.4826667 0.02874227


For the linear models, the best cost for the fwd_step predictors was .01 while the random forest predictors was .1
For the polynomial models, the best cost and degrees were 1 and 3 for both formulas.
For the radial models, the best cost for the fwd_step predictors was .5 and a gamma of 1 while the random forest predictors the best cost was 1 and a gamma of 1.

Train the linear SVM models using their respective best parameters

svm_linear_fs <- svm(fwd_step_formula, data = train, kernel = 'linear', cost = 0.01)
svm_linear_rf <- svm(fwd_step_formula, data = train, kernel = 'linear', cost = 0.1)
svm_poly_fs <- svm(fwd_step_formula, data = train, kernel = 'polynomial', cost = 1, degree = 3)
svm_poly_rf <- svm(fwd_step_formula, data = train, kernel = 'polynomial', cost = 1, degree = 3)
svm_radial_fs <- svm(fwd_step_formula, data = train, kernel = 'radial', cost = .5, gamma = 1)
svm_radial_rf <- svm(fwd_step_formula, data = train, kernel = 'radial', cost = 1, gamma = 1)


Build Confusion Matrices to find the train error rates

svm_pred_train <- predict(svm_linear_fs, train)
CM_svm_linear_fs <- confusionMatrix(table(train$target, svm_pred_train))

svm_pred_train <- predict(svm_linear_rf, train)
CM_svm_linear_rf <- confusionMatrix(table(train$target, svm_pred_train))

svm_pred_train <- predict(svm_poly_fs, train)
CM_svm_poly_fs <- confusionMatrix(table(train$target, svm_pred_train))

svm_pred_train <- predict(svm_poly_rf, train)
CM_svm_poly_rf <- confusionMatrix(table(train$target, svm_pred_train))

svm_pred_train <- predict(svm_radial_fs, train)
CM_svm_radial_fs <- confusionMatrix(table(train$target, svm_pred_train))

svm_pred_train <- predict(svm_radial_rf, train)
CM_svm_radial_rf <- confusionMatrix(table(train$target, svm_pred_train))


Confusion Matrix Linear SVM for Forward Step Predictors

CM_svm_linear_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      694      434
##   No Donor   526      596
##                                           
##                Accuracy : 0.5733          
##                  95% CI : (0.5526, 0.5939)
##     No Information Rate : 0.5422          
##     P-Value [Acc > NIR] : 0.001608        
##                                           
##                   Kappa : 0.1465          
##                                           
##  Mcnemar's Test P-Value : 0.003314        
##                                           
##             Sensitivity : 0.5689          
##             Specificity : 0.5786          
##          Pos Pred Value : 0.6152          
##          Neg Pred Value : 0.5312          
##              Prevalence : 0.5422          
##          Detection Rate : 0.3084          
##    Detection Prevalence : 0.5013          
##       Balanced Accuracy : 0.5737          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix Linear SVM for Random Forest Predictors - Train

CM_svm_linear_rf
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      691      437
##   No Donor   525      597
##                                          
##                Accuracy : 0.5724         
##                  95% CI : (0.5517, 0.593)
##     No Information Rate : 0.5404         
##     P-Value [Acc > NIR] : 0.001222       
##                                          
##                   Kappa : 0.1447         
##                                          
##  Mcnemar's Test P-Value : 0.005032       
##                                          
##             Sensitivity : 0.5683         
##             Specificity : 0.5774         
##          Pos Pred Value : 0.6126         
##          Neg Pred Value : 0.5321         
##              Prevalence : 0.5404         
##          Detection Rate : 0.3071         
##    Detection Prevalence : 0.5013         
##       Balanced Accuracy : 0.5728         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Polynomial SVM for Forward Step Predictors - Train

CM_svm_poly_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      896      232
##   No Donor   736      386
##                                          
##                Accuracy : 0.5698         
##                  95% CI : (0.549, 0.5904)
##     No Information Rate : 0.7253         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1385         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5490         
##             Specificity : 0.6246         
##          Pos Pred Value : 0.7943         
##          Neg Pred Value : 0.3440         
##              Prevalence : 0.7253         
##          Detection Rate : 0.3982         
##    Detection Prevalence : 0.5013         
##       Balanced Accuracy : 0.5868         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Polynomial SVM for Random Forest Predictors - Train

CM_svm_poly_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      896      232
##   No Donor   736      386
##                                          
##                Accuracy : 0.5698         
##                  95% CI : (0.549, 0.5904)
##     No Information Rate : 0.7253         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1385         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5490         
##             Specificity : 0.6246         
##          Pos Pred Value : 0.7943         
##          Neg Pred Value : 0.3440         
##              Prevalence : 0.7253         
##          Detection Rate : 0.3982         
##    Detection Prevalence : 0.5013         
##       Balanced Accuracy : 0.5868         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Radial SVM for Forward Step Predictors - Train

CM_svm_radial_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      755      373
##   No Donor   452      670
##                                          
##                Accuracy : 0.6333         
##                  95% CI : (0.613, 0.6533)
##     No Information Rate : 0.5364         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2665         
##                                          
##  Mcnemar's Test P-Value : 0.006615       
##                                          
##             Sensitivity : 0.6255         
##             Specificity : 0.6424         
##          Pos Pred Value : 0.6693         
##          Neg Pred Value : 0.5971         
##              Prevalence : 0.5364         
##          Detection Rate : 0.3356         
##    Detection Prevalence : 0.5013         
##       Balanced Accuracy : 0.6339         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Radial SVM for Random Forest Predictors - Train

CM_svm_radial_rf
## Confusion Matrix and Statistics
## 
##           svm_pred_train
##            Donor No Donor
##   Donor      787      341
##   No Donor   452      670
##                                           
##                Accuracy : 0.6476          
##                  95% CI : (0.6274, 0.6673)
##     No Information Rate : 0.5507          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2949          
##                                           
##  Mcnemar's Test P-Value : 9.375e-05       
##                                           
##             Sensitivity : 0.6352          
##             Specificity : 0.6627          
##          Pos Pred Value : 0.6977          
##          Neg Pred Value : 0.5971          
##              Prevalence : 0.5507          
##          Detection Rate : 0.3498          
##    Detection Prevalence : 0.5013          
##       Balanced Accuracy : 0.6489          
##                                           
##        'Positive' Class : Donor           
## 


Build Confusion Matrices to find the test error rates

svm_pred_test <- predict(svm_linear_fs, test)
CM_svm_linear_fs <- confusionMatrix(table(test$target, svm_pred_test))

svm_pred_test <- predict(svm_linear_rf, test)
CM_svm_linear_rf <- confusionMatrix(table(test$target, svm_pred_test))

svm_pred_test <- predict(svm_poly_fs, test)
CM_svm_poly_fs <- confusionMatrix(table(test$target, svm_pred_test))

svm_pred_test <- predict(svm_poly_rf, test)
CM_svm_poly_rf <- confusionMatrix(table(test$target, svm_pred_test))

svm_pred_test <- predict(svm_radial_fs, test)
CM_svm_radial_fs <- confusionMatrix(table(test$target, svm_pred_test))

svm_pred_test <- predict(svm_radial_rf, test)
CM_svm_radial_rf <- confusionMatrix(table(test$target, svm_pred_test))


Confusion Matrix Linear SVM for Forward Step Predictors - Test

CM_svm_linear_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      224      147
##   No Donor   189      190
##                                          
##                Accuracy : 0.552          
##                  95% CI : (0.5156, 0.588)
##     No Information Rate : 0.5507         
##     P-Value [Acc > NIR] : 0.4859         
##                                          
##                   Kappa : 0.105          
##                                          
##  Mcnemar's Test P-Value : 0.0253         
##                                          
##             Sensitivity : 0.5424         
##             Specificity : 0.5638         
##          Pos Pred Value : 0.6038         
##          Neg Pred Value : 0.5013         
##              Prevalence : 0.5507         
##          Detection Rate : 0.2987         
##    Detection Prevalence : 0.4947         
##       Balanced Accuracy : 0.5531         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Linear SVM for Random Forest Predictors - Test

CM_svm_linear_rf
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      225      146
##   No Donor   188      191
##                                           
##                Accuracy : 0.5547          
##                  95% CI : (0.5183, 0.5906)
##     No Information Rate : 0.5507          
##     P-Value [Acc > NIR] : 0.42767         
##                                           
##                   Kappa : 0.1103          
##                                           
##  Mcnemar's Test P-Value : 0.02487         
##                                           
##             Sensitivity : 0.5448          
##             Specificity : 0.5668          
##          Pos Pred Value : 0.6065          
##          Neg Pred Value : 0.5040          
##              Prevalence : 0.5507          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.4947          
##       Balanced Accuracy : 0.5558          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix Polynomial SVM for Forward Step Predictors - Test

CM_svm_poly_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      288       83
##   No Donor   267      112
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.4969, 0.5695)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0714          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5189          
##             Specificity : 0.5744          
##          Pos Pred Value : 0.7763          
##          Neg Pred Value : 0.2955          
##              Prevalence : 0.7400          
##          Detection Rate : 0.3840          
##    Detection Prevalence : 0.4947          
##       Balanced Accuracy : 0.5466          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix Polynomial SVM for Random Forest Predictors - Test

CM_svm_poly_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      288       83
##   No Donor   267      112
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.4969, 0.5695)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0714          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5189          
##             Specificity : 0.5744          
##          Pos Pred Value : 0.7763          
##          Neg Pred Value : 0.2955          
##              Prevalence : 0.7400          
##          Detection Rate : 0.3840          
##    Detection Prevalence : 0.4947          
##       Balanced Accuracy : 0.5466          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix Radial SVM for Forward Step Predictors - Test

CM_svm_radial_fs
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      218      153
##   No Donor   179      200
##                                          
##                Accuracy : 0.5573         
##                  95% CI : (0.521, 0.5933)
##     No Information Rate : 0.5293         
##     P-Value [Acc > NIR] : 0.0667         
##                                          
##                   Kappa : 0.1152         
##                                          
##  Mcnemar's Test P-Value : 0.1700         
##                                          
##             Sensitivity : 0.5491         
##             Specificity : 0.5666         
##          Pos Pred Value : 0.5876         
##          Neg Pred Value : 0.5277         
##              Prevalence : 0.5293         
##          Detection Rate : 0.2907         
##    Detection Prevalence : 0.4947         
##       Balanced Accuracy : 0.5578         
##                                          
##        'Positive' Class : Donor          
## 


Confusion Matrix Radial SVM for Random Forest Predictors - Test

CM_svm_radial_rf
## Confusion Matrix and Statistics
## 
##           svm_pred_test
##            Donor No Donor
##   Donor      221      150
##   No Donor   188      191
##                                           
##                Accuracy : 0.5493          
##                  95% CI : (0.5129, 0.5854)
##     No Information Rate : 0.5453          
##     P-Value [Acc > NIR] : 0.42770         
##                                           
##                   Kappa : 0.0995          
##                                           
##  Mcnemar's Test P-Value : 0.04416         
##                                           
##             Sensitivity : 0.5403          
##             Specificity : 0.5601          
##          Pos Pred Value : 0.5957          
##          Neg Pred Value : 0.5040          
##              Prevalence : 0.5453          
##          Detection Rate : 0.2947          
##    Detection Prevalence : 0.4947          
##       Balanced Accuracy : 0.5502          
##                                           
##        'Positive' Class : Donor           
## 


In the train validation, the Radial SVM using the predictors selected by the Random Forest performed the best, with an accuracy rate of 64.8%; however it did not do as well against the test dataset with only a 55% accuracy rate. Against the test dataset, the Radial SVM using the Forward Step predictors did the best, with a 55.7% accuracy rate.


Random Forest Models

Train Random Forest models

set.seed(12345)

random_forest_fs <- randomForest(fwd_step_formula, data = train, ntree = 10000)
random_forest_rf <- randomForest(rf_formula, data = train, ntree = 10000)


Find the train error rates

random_forest_pred_fs <- predict(random_forest_fs, train, type = 'class')
CM_random_forest_fs <- confusionMatrix(table(train$target, random_forest_pred_fs))

random_forest_pred_rf <- predict(random_forest_rf, train, type = 'class')
CM_random_forest_rf <- confusionMatrix(table(train$target, random_forest_pred_rf))


Confusion Matrix for Random Forest using Forward Step Predictors - Train

CM_random_forest_fs
## Confusion Matrix and Statistics
## 
##           random_forest_pred_fs
##            Donor No Donor
##   Donor     1119       12
##   No Donor     9     1110
##                                           
##                Accuracy : 0.9907          
##                  95% CI : (0.9858, 0.9942)
##     No Information Rate : 0.5013          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9813          
##                                           
##  Mcnemar's Test P-Value : 0.6625          
##                                           
##             Sensitivity : 0.9920          
##             Specificity : 0.9893          
##          Pos Pred Value : 0.9894          
##          Neg Pred Value : 0.9920          
##              Prevalence : 0.5013          
##          Detection Rate : 0.4973          
##    Detection Prevalence : 0.5027          
##       Balanced Accuracy : 0.9907          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix for Random Forest using Random Forest Predictors - Train

CM_random_forest_rf
## Confusion Matrix and Statistics
## 
##           random_forest_pred_rf
##            Donor No Donor
##   Donor     1131        0
##   No Donor     0     1119
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9984, 1)
##     No Information Rate : 0.5027     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5027     
##          Detection Rate : 0.5027     
##    Detection Prevalence : 0.5027     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Donor      
## 


Find the test error rates

random_forest_pred_fs <- predict(random_forest_fs, test, type = 'class')
CM_random_forest_fs <- confusionMatrix(table(test$target, random_forest_pred_fs))

random_forest_pred_rf <- predict(random_forest_rf, test, type = 'class')
CM_random_forest_rf <- confusionMatrix(table(test$target, random_forest_pred_rf))


Confusion Matrix for Random Forest using Forward Step Predictors - Test

CM_random_forest_fs
## Confusion Matrix and Statistics
## 
##           random_forest_pred_fs
##            Donor No Donor
##   Donor      210      158
##   No Donor   185      197
##                                           
##                Accuracy : 0.5427          
##                  95% CI : (0.5062, 0.5788)
##     No Information Rate : 0.5267          
##     P-Value [Acc > NIR] : 0.2002          
##                                           
##                   Kappa : 0.0862          
##                                           
##  Mcnemar's Test P-Value : 0.1604          
##                                           
##             Sensitivity : 0.5316          
##             Specificity : 0.5549          
##          Pos Pred Value : 0.5707          
##          Neg Pred Value : 0.5157          
##              Prevalence : 0.5267          
##          Detection Rate : 0.2800          
##    Detection Prevalence : 0.4907          
##       Balanced Accuracy : 0.5433          
##                                           
##        'Positive' Class : Donor           
## 


Confusion Matrix for Random Forest using Random Forest Predictors - Test

CM_random_forest_rf
## Confusion Matrix and Statistics
## 
##           random_forest_pred_rf
##            Donor No Donor
##   Donor      205      163
##   No Donor   209      173
##                                           
##                Accuracy : 0.504           
##                  95% CI : (0.4676, 0.5404)
##     No Information Rate : 0.552           
##     P-Value [Acc > NIR] : 0.99625         
##                                           
##                   Kappa : 0.0099          
##                                           
##  Mcnemar's Test P-Value : 0.01964         
##                                           
##             Sensitivity : 0.4952          
##             Specificity : 0.5149          
##          Pos Pred Value : 0.5571          
##          Neg Pred Value : 0.4529          
##              Prevalence : 0.5520          
##          Detection Rate : 0.2733          
##    Detection Prevalence : 0.4907          
##       Balanced Accuracy : 0.5050          
##                                           
##        'Positive' Class : Donor           
## 


The Random Forest models were likely over fit as the train error rates were zero or almost zero. Against the Train dataset, the model using the forward step selected predictors performed slightly better with a 54% accuracy rate.






Selecting Best Model

Selecting the best model is more than just picking the highest accuracy rate. In this situation, failing to identify a non-donor (specificity) results in a wasted $0.65; however, failing to identify a donor results in a lost donation averaging $13.00. Therefore, I chose to select the model which returns the highest sensitivity rate, which in this case is the Radial SVM, using the Forward Step selected predictors, with a Sensitivity rate of 54.9%.

set.seed(12345)

svm_pred_future <- predict(svm_radial_fs, future_fundraising)
write.table(svm_pred_future, file = "svm pred future.csv", col.names = c("value"), row.names = FALSE)