Load Libraries
library(readr)
library(corrplot)
library(tree)
library(tidyverse)
library(randomForest)
A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
The objective of this business case analysis is to use predictive modeling to predict, as accurately as possible, which of the veteran organization members are likley to respond to direct mail solicitations for donations. The overall goal would be to use this information to maximize return on investment (ROI), by limiting direct mail overhead by reducing the total number of mailings. This in turn would result in a higher ROI as your cost per donation ratio would be improved.
The file fundraising.rds contains 3,000 records with approximately 50% donors (target = Donor) and 50% non-donors (target = No Donor). The data set used a weighted sample rather than a random sample to avoid bias in the classification model. As a significant majority of the direct mail recipients did not donate, using a simple random sample could have resulted in a bias that would skew towards predicting No Donors. The dataset contains 21 variables (although the course material states 22, zipconvert1 is actually missing from the datasets)); the descriptions for the variables are listed below.
Zip code group: Broke into five groups; Yes = the potential donor belongs to this zip group.
00000–19999 ⇒ zipconvert1 Not included in dataset
20000–39999 ⇒ zipconvert2
40000–59999 ⇒ zipconvert3
60000–79999 ⇒ zipconvert4
80000–99999 ⇒ zipconvert5
homeowner: Yes = homeowner, No = not a homeowner
num_child: Number of children
income: Household income
female: No = male, Yes = female
wealth: Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state
home_value: Average home value in potential donor’s neighborhood in hundreds of dollars
med_fam_inc: Median family income in potential donor’s neighborhood in hundreds of dollars
avg_fam_inc: Average family income in potential donor’s neighborhood in hundreds
pct_lt15k: Percent earning less than $15K in potential donor’s neighborhood
num_prom: Lifetime number of promotions received to date
lifetime_gifts: Dollar amount of lifetime gifts to date
largest_gift: Dollar amount of largest gift to date
last_gift: Dollar amount of most recent gift
months_since_donate: Number of months from last donation to July 2018
time_lag: Number of months between first and second gift
avg_gift: Average dollar amount of gifts to date
target (Outcome variable): binary indicator for response Yes = donor, No = non-donor
During model development and testing, the fundraising.rds dataset will be partitioned into train and test datasets, using a 75%/25% split. Once the best model is selected, it will be used to predict outcomes on an additional dataset, future_fundraising.rds; this dataset contains all of the same variables as fundraising.rds, with the exception of target (note: This dataset is also missing zipconvert1).
Load & Partition Data (Step 1)
fundraising <- read_rds('fundraising.rds')
future_fundraising <- read_rds('future_fundraising.rds')
Exploring predictors
#Look for missing values
any(is.na(fundraising))
## [1] FALSE
#Review a summary of the data
summary(fundraising)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
#Review the structure of the data
str(fundraising)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
When observing the structure of the data, it appears that zipconvert3 factor was created with “Yes”/“No” versus “No”/“Yes”. It should not effect our modeling, but just for the sake of consistency and to reduce any potental confusion when looking at results, I am electing to flip the factors for zipconvert3. In addition, just for the sake of consistency, I elected to flip the factors for homeowner and female as well.
#Flip Factors
fundraising$zipconvert3 <- fct_rev(fundraising$zipconvert3)
fundraising$homeowner <- fct_rev(fundraising$homeowner)
fundraising$female <- fct_rev(fundraising$female)
#Review Structure again
str(fundraising)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 2 2 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
#Create a dataframe with only numeric predictors for examination
fundraising_numeric <- fundraising[,-1] %>% keep(is.numeric) %>% gather()
#Plot numeric predictors
ggplot(fundraising_numeric, aes(value)) +
facet_wrap(~ key, scales = 'free') +
geom_density(colour = 'blue') +
theme(axis.text = element_text(color = 'blue'))
There appears to be collinearity (as you might expect) between avg_fam_inc, med_fam_inc, and home_value. In order to better understand the relationship and check for collinearity, I will use a correlation matrix.
#Plot correlation matrix of numerical factors
fundraising_numeric2 <- fundraising[,-1] %>% keep(is.numeric)
corrplot(cor(fundraising_numeric2), method = 'number', type = 'lower')
The correlation matrix does confirm the strong correlation between avg_fam_inc and med_fam_inc, as expected. It also shows a fairly significant correlation between those two variables and home_value. The matrix also indicated a correlation between last gift and avg_gift.
Drop highly correlated predictors
fundraising <- subset(fundraising, select = c(1:11,13:16,18:21))
Split dataset into test and train
set.seed(12345)
in_train <- sample(nrow(fundraising), 2250)
train <- fundraising[in_train,]
test <- fundraising[-in_train,]
Use Random Forest to help identify the best predictors
set.seed(12345)
samp <- sample(nrow(fundraising), nrow(fundraising)*0.1)
rf_fundraising <- train(target~., data = fundraising[samp,])
rf_fundraising_varImp <- varImp(rf_fundraising)
plot(rf_fundraising_varImp)
Use Subset Selection to evaluate predictors
set.seed(12345)
fwd_step <- regsubsets(target ~ ., data = fundraising, method = 'forward')
fwd_step_summary <- summary(fwd_step)
fwd_step_summary
## Subset selection object
## Call: regsubsets.formula(target ~ ., data = fundraising, method = "forward")
## 18 Variables (and intercept)
## Forced in Forced out
## zipconvert2Yes FALSE FALSE
## zipconvert3Yes FALSE FALSE
## zipconvert4Yes FALSE FALSE
## zipconvert5Yes FALSE FALSE
## homeownerYes FALSE FALSE
## num_child FALSE FALSE
## income FALSE FALSE
## femaleYes FALSE FALSE
## wealth FALSE FALSE
## home_value FALSE FALSE
## med_fam_inc FALSE FALSE
## pct_lt15k FALSE FALSE
## num_prom FALSE FALSE
## lifetime_gifts FALSE FALSE
## largest_gift FALSE FALSE
## months_since_donate FALSE FALSE
## time_lag FALSE FALSE
## avg_gift FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## zipconvert2Yes zipconvert3Yes zipconvert4Yes zipconvert5Yes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## homeownerYes num_child income femaleYes wealth home_value med_fam_inc
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " "
## 4 ( 1 ) " " "*" "*" " " " " " " " "
## 5 ( 1 ) " " "*" "*" " " " " " " " "
## 6 ( 1 ) " " "*" "*" " " " " "*" " "
## 7 ( 1 ) " " "*" "*" " " " " "*" " "
## 8 ( 1 ) " " "*" "*" " " "*" "*" " "
## pct_lt15k num_prom lifetime_gifts largest_gift months_since_donate
## 1 ( 1 ) " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " "*"
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " " " "*"
## 7 ( 1 ) "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
## time_lag avg_gift
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
par(mfrow = c(2,2))
#Plot 1
plot(fwd_step_summary$cp, xlab = '# of Predictors', ylab = 'Cp')
min_cp <- min(fwd_step_summary$cp)
std_cp <- sd(fwd_step_summary$cp)
#Plot 2
plot(fwd_step_summary$bic,xlab = '# of Predictors', ylab = 'BIC')
min_bic <- min(fwd_step_summary$bic)
std_bic <- sd(fwd_step_summary$bic)
#Plot 3
plot(fwd_step_summary$adjr2, xlab = '# of Predictors', ylab = 'Adjusted R2')
max_adjr2 <- max(fwd_step_summary$adjr2)
std_adjr2 <- sd(fwd_step_summary$adjr2)
predictors <- (names(coef(fwd_step, 5)))[2:6]
predictors
## [1] "num_child" "income" "num_prom"
## [4] "months_since_donate" "avg_gift"
Looking at the foward step charts and the Random Forest Importance chart, it appears that four or five predictors would work the best. As five predictors provided a slightly better Adjusted \(R^{2}\), I elected to use five predictors in my analysis. However, the Random Forest Importance chart and the Forward Step function found different predictors to be important. As a result, I have elected to build models using the top 5 predictors from each tool.
Saving the top 5 predictors from each tool as a separate formula
fwd_step_formula <- as.formula(target ~ num_child + income + num_prom + months_since_donate + avg_gift)
rf_formula <- as.formula(target ~ avg_gift + lifetime_gifts + med_fam_inc + home_value + num_prom)
Fitting a Linear SVM
set.seed(12345)
tune_svm1 = tune(svm, fwd_step_formula, data = train, kernel = 'linear', ranges = list(cost=c(0.01, .1, .5, 1, 10)))
summary(tune_svm1)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.4391111
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4391111 0.03422190
## 2 0.10 0.4444444 0.02763668
## 3 0.50 0.4408889 0.02494878
## 4 1.00 0.4413333 0.02479435
## 5 10.00 0.4400000 0.02531557
tune_svm2 = tune(svm, rf_formula, data = train, kernel = 'linear', ranges = list(cost=c(0.01, .1, .5, 1, 10)))
summary(tune_svm2)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.4591111
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4591111 0.02591958
## 2 0.10 0.4600000 0.02209842
## 3 0.50 0.4608889 0.02333510
## 4 1.00 0.4613333 0.02340553
## 5 10.00 0.4608889 0.02370833
Fitting a polynomial SVM with various degrees
set.seed(12345)
tune_svm3 <- tune(svm, fwd_step_formula, data = train, kernel = 'polynomial', ranges = list(cost = c(.001, 0.1,
1), degree = c(2, 3)))
summary(tune_svm3)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 1 3
##
## - best performance: 0.4648889
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.001 2 0.5275556 0.02779900
## 2 0.100 2 0.5213333 0.02305120
## 3 1.000 2 0.5102222 0.03466793
## 4 0.001 3 0.5266667 0.02693283
## 5 0.100 3 0.4800000 0.04354648
## 6 1.000 3 0.4648889 0.03636124
tune_svm4 <- tune(svm, rf_formula, data = train, kernel = 'polynomial', ranges = list(cost = c(.001, 0.1,
1), degree = c(2, 3, 4)))
summary(tune_svm4)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 1 4
##
## - best performance: 0.504
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.001 2 0.5204444 0.02694912
## 2 0.100 2 0.5240000 0.02201882
## 3 1.000 2 0.5297778 0.01493286
## 4 0.001 3 0.5217778 0.02725282
## 5 0.100 3 0.5266667 0.01455325
## 6 1.000 3 0.5182222 0.00843274
## 7 0.001 4 0.5244444 0.02052801
## 8 0.100 4 0.5182222 0.02011762
## 9 1.000 4 0.5040000 0.02480763
Fitting a radial SVM with various gammas
set.seed(12345)
tune_svm5 <- tune(svm, fwd_step_formula, data = train, kernel = 'radial', ranges = list(cost = c(.001, 0.1, .5, 1), gamma = c(1, 2)))
summary(tune_svm5)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 0.5 1
##
## - best performance: 0.4408889
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.001 1 0.5297778 0.02824933
## 2 0.100 1 0.4466667 0.04284787
## 3 0.500 1 0.4408889 0.04646281
## 4 1.000 1 0.4417778 0.05375145
## 5 0.001 2 0.5297778 0.02824933
## 6 0.100 2 0.4613333 0.04890236
## 7 0.500 2 0.4497778 0.05199135
## 8 1.000 2 0.4560000 0.05599138
tune_svm6 <- tune(svm, rf_formula, data = train, kernel = 'radial', ranges = list(cost = c(.001, 0.1, .5, 1), gamma = c(1, 2)))
summary(tune_svm6)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 0.5 2
##
## - best performance: 0.4808889
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.001 1 0.5208889 0.02730110
## 2 0.100 1 0.4817778 0.03992861
## 3 0.500 1 0.4853333 0.03473118
## 4 1.000 1 0.4813333 0.03359078
## 5 0.001 2 0.5208889 0.02730110
## 6 0.100 2 0.4924444 0.02983632
## 7 0.500 2 0.4808889 0.03434993
## 8 1.000 2 0.4826667 0.02874227
For the linear models, the best cost for the fwd_step predictors was .01 while the random forest predictors was .1
For the polynomial models, the best cost and degrees were 1 and 3 for both formulas.
For the radial models, the best cost for the fwd_step predictors was .5 and a gamma of 1 while the random forest predictors the best cost was 1 and a gamma of 1.
Train the linear SVM models using their respective best parameters
svm_linear_fs <- svm(fwd_step_formula, data = train, kernel = 'linear', cost = 0.01)
svm_linear_rf <- svm(fwd_step_formula, data = train, kernel = 'linear', cost = 0.1)
svm_poly_fs <- svm(fwd_step_formula, data = train, kernel = 'polynomial', cost = 1, degree = 3)
svm_poly_rf <- svm(fwd_step_formula, data = train, kernel = 'polynomial', cost = 1, degree = 3)
svm_radial_fs <- svm(fwd_step_formula, data = train, kernel = 'radial', cost = .5, gamma = 1)
svm_radial_rf <- svm(fwd_step_formula, data = train, kernel = 'radial', cost = 1, gamma = 1)
Build Confusion Matrices to find the train error rates
svm_pred_train <- predict(svm_linear_fs, train)
CM_svm_linear_fs <- confusionMatrix(table(train$target, svm_pred_train))
svm_pred_train <- predict(svm_linear_rf, train)
CM_svm_linear_rf <- confusionMatrix(table(train$target, svm_pred_train))
svm_pred_train <- predict(svm_poly_fs, train)
CM_svm_poly_fs <- confusionMatrix(table(train$target, svm_pred_train))
svm_pred_train <- predict(svm_poly_rf, train)
CM_svm_poly_rf <- confusionMatrix(table(train$target, svm_pred_train))
svm_pred_train <- predict(svm_radial_fs, train)
CM_svm_radial_fs <- confusionMatrix(table(train$target, svm_pred_train))
svm_pred_train <- predict(svm_radial_rf, train)
CM_svm_radial_rf <- confusionMatrix(table(train$target, svm_pred_train))
Confusion Matrix Linear SVM for Forward Step Predictors
CM_svm_linear_fs
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 694 434
## No Donor 526 596
##
## Accuracy : 0.5733
## 95% CI : (0.5526, 0.5939)
## No Information Rate : 0.5422
## P-Value [Acc > NIR] : 0.001608
##
## Kappa : 0.1465
##
## Mcnemar's Test P-Value : 0.003314
##
## Sensitivity : 0.5689
## Specificity : 0.5786
## Pos Pred Value : 0.6152
## Neg Pred Value : 0.5312
## Prevalence : 0.5422
## Detection Rate : 0.3084
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.5737
##
## 'Positive' Class : Donor
##
Confusion Matrix Linear SVM for Random Forest Predictors - Train
CM_svm_linear_rf
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 691 437
## No Donor 525 597
##
## Accuracy : 0.5724
## 95% CI : (0.5517, 0.593)
## No Information Rate : 0.5404
## P-Value [Acc > NIR] : 0.001222
##
## Kappa : 0.1447
##
## Mcnemar's Test P-Value : 0.005032
##
## Sensitivity : 0.5683
## Specificity : 0.5774
## Pos Pred Value : 0.6126
## Neg Pred Value : 0.5321
## Prevalence : 0.5404
## Detection Rate : 0.3071
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.5728
##
## 'Positive' Class : Donor
##
Confusion Matrix Polynomial SVM for Forward Step Predictors - Train
CM_svm_poly_fs
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 896 232
## No Donor 736 386
##
## Accuracy : 0.5698
## 95% CI : (0.549, 0.5904)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1385
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5490
## Specificity : 0.6246
## Pos Pred Value : 0.7943
## Neg Pred Value : 0.3440
## Prevalence : 0.7253
## Detection Rate : 0.3982
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.5868
##
## 'Positive' Class : Donor
##
Confusion Matrix Polynomial SVM for Random Forest Predictors - Train
CM_svm_poly_fs
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 896 232
## No Donor 736 386
##
## Accuracy : 0.5698
## 95% CI : (0.549, 0.5904)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1385
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5490
## Specificity : 0.6246
## Pos Pred Value : 0.7943
## Neg Pred Value : 0.3440
## Prevalence : 0.7253
## Detection Rate : 0.3982
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.5868
##
## 'Positive' Class : Donor
##
Confusion Matrix Radial SVM for Forward Step Predictors - Train
CM_svm_radial_fs
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 755 373
## No Donor 452 670
##
## Accuracy : 0.6333
## 95% CI : (0.613, 0.6533)
## No Information Rate : 0.5364
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2665
##
## Mcnemar's Test P-Value : 0.006615
##
## Sensitivity : 0.6255
## Specificity : 0.6424
## Pos Pred Value : 0.6693
## Neg Pred Value : 0.5971
## Prevalence : 0.5364
## Detection Rate : 0.3356
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.6339
##
## 'Positive' Class : Donor
##
Confusion Matrix Radial SVM for Random Forest Predictors - Train
CM_svm_radial_rf
## Confusion Matrix and Statistics
##
## svm_pred_train
## Donor No Donor
## Donor 787 341
## No Donor 452 670
##
## Accuracy : 0.6476
## 95% CI : (0.6274, 0.6673)
## No Information Rate : 0.5507
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2949
##
## Mcnemar's Test P-Value : 9.375e-05
##
## Sensitivity : 0.6352
## Specificity : 0.6627
## Pos Pred Value : 0.6977
## Neg Pred Value : 0.5971
## Prevalence : 0.5507
## Detection Rate : 0.3498
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.6489
##
## 'Positive' Class : Donor
##
Build Confusion Matrices to find the test error rates
svm_pred_test <- predict(svm_linear_fs, test)
CM_svm_linear_fs <- confusionMatrix(table(test$target, svm_pred_test))
svm_pred_test <- predict(svm_linear_rf, test)
CM_svm_linear_rf <- confusionMatrix(table(test$target, svm_pred_test))
svm_pred_test <- predict(svm_poly_fs, test)
CM_svm_poly_fs <- confusionMatrix(table(test$target, svm_pred_test))
svm_pred_test <- predict(svm_poly_rf, test)
CM_svm_poly_rf <- confusionMatrix(table(test$target, svm_pred_test))
svm_pred_test <- predict(svm_radial_fs, test)
CM_svm_radial_fs <- confusionMatrix(table(test$target, svm_pred_test))
svm_pred_test <- predict(svm_radial_rf, test)
CM_svm_radial_rf <- confusionMatrix(table(test$target, svm_pred_test))
Confusion Matrix Linear SVM for Forward Step Predictors - Test
CM_svm_linear_fs
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 224 147
## No Donor 189 190
##
## Accuracy : 0.552
## 95% CI : (0.5156, 0.588)
## No Information Rate : 0.5507
## P-Value [Acc > NIR] : 0.4859
##
## Kappa : 0.105
##
## Mcnemar's Test P-Value : 0.0253
##
## Sensitivity : 0.5424
## Specificity : 0.5638
## Pos Pred Value : 0.6038
## Neg Pred Value : 0.5013
## Prevalence : 0.5507
## Detection Rate : 0.2987
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5531
##
## 'Positive' Class : Donor
##
Confusion Matrix Linear SVM for Random Forest Predictors - Test
CM_svm_linear_rf
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 225 146
## No Donor 188 191
##
## Accuracy : 0.5547
## 95% CI : (0.5183, 0.5906)
## No Information Rate : 0.5507
## P-Value [Acc > NIR] : 0.42767
##
## Kappa : 0.1103
##
## Mcnemar's Test P-Value : 0.02487
##
## Sensitivity : 0.5448
## Specificity : 0.5668
## Pos Pred Value : 0.6065
## Neg Pred Value : 0.5040
## Prevalence : 0.5507
## Detection Rate : 0.3000
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5558
##
## 'Positive' Class : Donor
##
Confusion Matrix Polynomial SVM for Forward Step Predictors - Test
CM_svm_poly_fs
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 288 83
## No Donor 267 112
##
## Accuracy : 0.5333
## 95% CI : (0.4969, 0.5695)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0714
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5189
## Specificity : 0.5744
## Pos Pred Value : 0.7763
## Neg Pred Value : 0.2955
## Prevalence : 0.7400
## Detection Rate : 0.3840
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5466
##
## 'Positive' Class : Donor
##
Confusion Matrix Polynomial SVM for Random Forest Predictors - Test
CM_svm_poly_fs
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 288 83
## No Donor 267 112
##
## Accuracy : 0.5333
## 95% CI : (0.4969, 0.5695)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0714
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5189
## Specificity : 0.5744
## Pos Pred Value : 0.7763
## Neg Pred Value : 0.2955
## Prevalence : 0.7400
## Detection Rate : 0.3840
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5466
##
## 'Positive' Class : Donor
##
Confusion Matrix Radial SVM for Forward Step Predictors - Test
CM_svm_radial_fs
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 218 153
## No Donor 179 200
##
## Accuracy : 0.5573
## 95% CI : (0.521, 0.5933)
## No Information Rate : 0.5293
## P-Value [Acc > NIR] : 0.0667
##
## Kappa : 0.1152
##
## Mcnemar's Test P-Value : 0.1700
##
## Sensitivity : 0.5491
## Specificity : 0.5666
## Pos Pred Value : 0.5876
## Neg Pred Value : 0.5277
## Prevalence : 0.5293
## Detection Rate : 0.2907
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5578
##
## 'Positive' Class : Donor
##
Confusion Matrix Radial SVM for Random Forest Predictors - Test
CM_svm_radial_rf
## Confusion Matrix and Statistics
##
## svm_pred_test
## Donor No Donor
## Donor 221 150
## No Donor 188 191
##
## Accuracy : 0.5493
## 95% CI : (0.5129, 0.5854)
## No Information Rate : 0.5453
## P-Value [Acc > NIR] : 0.42770
##
## Kappa : 0.0995
##
## Mcnemar's Test P-Value : 0.04416
##
## Sensitivity : 0.5403
## Specificity : 0.5601
## Pos Pred Value : 0.5957
## Neg Pred Value : 0.5040
## Prevalence : 0.5453
## Detection Rate : 0.2947
## Detection Prevalence : 0.4947
## Balanced Accuracy : 0.5502
##
## 'Positive' Class : Donor
##
In the train validation, the Radial SVM using the predictors selected by the Random Forest performed the best, with an accuracy rate of 64.8%; however it did not do as well against the test dataset with only a 55% accuracy rate. Against the test dataset, the Radial SVM using the Forward Step predictors did the best, with a 55.7% accuracy rate.
Train Random Forest models
set.seed(12345)
random_forest_fs <- randomForest(fwd_step_formula, data = train, ntree = 10000)
random_forest_rf <- randomForest(rf_formula, data = train, ntree = 10000)
Find the train error rates
random_forest_pred_fs <- predict(random_forest_fs, train, type = 'class')
CM_random_forest_fs <- confusionMatrix(table(train$target, random_forest_pred_fs))
random_forest_pred_rf <- predict(random_forest_rf, train, type = 'class')
CM_random_forest_rf <- confusionMatrix(table(train$target, random_forest_pred_rf))
Confusion Matrix for Random Forest using Forward Step Predictors - Train
CM_random_forest_fs
## Confusion Matrix and Statistics
##
## random_forest_pred_fs
## Donor No Donor
## Donor 1119 12
## No Donor 9 1110
##
## Accuracy : 0.9907
## 95% CI : (0.9858, 0.9942)
## No Information Rate : 0.5013
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9813
##
## Mcnemar's Test P-Value : 0.6625
##
## Sensitivity : 0.9920
## Specificity : 0.9893
## Pos Pred Value : 0.9894
## Neg Pred Value : 0.9920
## Prevalence : 0.5013
## Detection Rate : 0.4973
## Detection Prevalence : 0.5027
## Balanced Accuracy : 0.9907
##
## 'Positive' Class : Donor
##
Confusion Matrix for Random Forest using Random Forest Predictors - Train
CM_random_forest_rf
## Confusion Matrix and Statistics
##
## random_forest_pred_rf
## Donor No Donor
## Donor 1131 0
## No Donor 0 1119
##
## Accuracy : 1
## 95% CI : (0.9984, 1)
## No Information Rate : 0.5027
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5027
## Detection Rate : 0.5027
## Detection Prevalence : 0.5027
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Donor
##
Find the test error rates
random_forest_pred_fs <- predict(random_forest_fs, test, type = 'class')
CM_random_forest_fs <- confusionMatrix(table(test$target, random_forest_pred_fs))
random_forest_pred_rf <- predict(random_forest_rf, test, type = 'class')
CM_random_forest_rf <- confusionMatrix(table(test$target, random_forest_pred_rf))
Confusion Matrix for Random Forest using Forward Step Predictors - Test
CM_random_forest_fs
## Confusion Matrix and Statistics
##
## random_forest_pred_fs
## Donor No Donor
## Donor 210 158
## No Donor 185 197
##
## Accuracy : 0.5427
## 95% CI : (0.5062, 0.5788)
## No Information Rate : 0.5267
## P-Value [Acc > NIR] : 0.2002
##
## Kappa : 0.0862
##
## Mcnemar's Test P-Value : 0.1604
##
## Sensitivity : 0.5316
## Specificity : 0.5549
## Pos Pred Value : 0.5707
## Neg Pred Value : 0.5157
## Prevalence : 0.5267
## Detection Rate : 0.2800
## Detection Prevalence : 0.4907
## Balanced Accuracy : 0.5433
##
## 'Positive' Class : Donor
##
Confusion Matrix for Random Forest using Random Forest Predictors - Test
CM_random_forest_rf
## Confusion Matrix and Statistics
##
## random_forest_pred_rf
## Donor No Donor
## Donor 205 163
## No Donor 209 173
##
## Accuracy : 0.504
## 95% CI : (0.4676, 0.5404)
## No Information Rate : 0.552
## P-Value [Acc > NIR] : 0.99625
##
## Kappa : 0.0099
##
## Mcnemar's Test P-Value : 0.01964
##
## Sensitivity : 0.4952
## Specificity : 0.5149
## Pos Pred Value : 0.5571
## Neg Pred Value : 0.4529
## Prevalence : 0.5520
## Detection Rate : 0.2733
## Detection Prevalence : 0.4907
## Balanced Accuracy : 0.5050
##
## 'Positive' Class : Donor
##
The Random Forest models were likely over fit as the train error rates were zero or almost zero. Against the Train dataset, the model using the forward step selected predictors performed slightly better with a 54% accuracy rate.
Selecting the best model is more than just picking the highest accuracy rate. In this situation, failing to identify a non-donor (specificity) results in a wasted $0.65; however, failing to identify a donor results in a lost donation averaging $13.00. Therefore, I chose to select the model which returns the highest sensitivity rate, which in this case is the Radial SVM, using the Forward Step selected predictors, with a Sensitivity rate of 54.9%.
set.seed(12345)
svm_pred_future <- predict(svm_radial_fs, future_fundraising)
write.table(svm_pred_future, file = "svm pred future.csv", col.names = c("value"), row.names = FALSE)