Modeling Competition

Background:

A national veterans organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business Objectives and Goals:

The objective of this analysis is to create a predictive classification model that will identify direct-mail recipients that will make a donation in order to maximize donations.

The goal of this analysis is to maximize the organization’s net profit through accurate recipient targeting.

Data Sources and Data Used

The data used will be the fundraising.rds dataset has 3,000 observations with 50% donors and 50% non-donors that was provided. The data set will be divided up in order to train a model and test the model.

As mentioned in the background the data set is weighted, under-representing the non-responders so that the sample has equal numbers of donors and non-donors. Balancing out the data set is important for classification models to enable even distribution and reduce bias.

An additional data set, future_fundraising.rds will be used to make predictions.

Installing packages for project

library(readr)
library(caret)
library(car)
library(MASS)
library(dplyr)
library(class)
library(e1071)

DF = read_rds("~/fundraising.rds")

The data does not have any null values, attaching the data for ease of use in future steps.

sum(is.na(DF))

## [1] 0

Step 1: Partitioning

The sample is partitioned into 80% training and 20% validation.

set.seed(12345)
train = sample(nrow(DF), 0.8*(nrow(DF)))
train.set.full = DF[train,]
test.set.full = DF[-train,]

Step 2: Model Building

2-1. Exploratory Data Analysis

Conducting a few analysis and visualizations to better understand the data.

summary(DF)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

str(DF)

## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

par(mfrow=c(2,2))
hist(DF$num_child)
hist(DF$wealth)
hist(DF$home_value)
hist(DF$largest_gift)

dim(DF)

## [1] 3000   21

numeric = (DF[,c(6:7,9:21)])
numeric$target = as.numeric(numeric$target)

par(mfrow=c(1,3))
boxplot(numeric[1:5], las=2,  cex.axis = .8)
boxplot(numeric[6:10], las=2, cex.axis = .8)
boxplot(numeric[11:14], las=2, cex.axis = .8)

Utilizing summary to examine the predictors there are 3000 observations, 20 predictor variables and one target variables that has two levels, “Donor” and “No Donor”. The split ratio from the provided data is Donor:1499 and No Donor:1501.

categorical variables include zipconvert2-5, homeowner, female, and potentially num_child, income and wealth based on their values.

A few variables are skewed and have outliers such as lifetime_gift, last_gift, and home_value.

2-1. Evaluating the association between the predictor variables and response variables

cor(numeric)

##                        num_child       income       wealth    home_value
## num_child            1.000000000  0.091893089  0.060175537 -0.0119642286
## income               0.091893089  1.000000000  0.208993101  0.2919734944
## wealth               0.060175537  0.208993101  1.000000000  0.2611611450
## home_value          -0.011964229  0.291973494  0.261161145  1.0000000000
## med_fam_inc          0.046961647  0.367505334  0.377763371  0.7381530742
## avg_fam_inc          0.047261395  0.378585352  0.385892299  0.7525690021
## pct_lt15k           -0.031717891 -0.283191234 -0.375145585 -0.3990861577
## num_prom            -0.086432604 -0.069008634 -0.412117770 -0.0645138583
## lifetime_gifts      -0.050954766 -0.019565470 -0.225473319 -0.0240737013
## largest_gift        -0.017554416  0.033180760 -0.025276518  0.0564942757
## last_gift           -0.012948678  0.109592754  0.052591311  0.1588576542
## months_since_donate -0.005563603  0.077238810  0.033713981  0.0234285142
## time_lag            -0.006069356 -0.001545727 -0.066421329  0.0006789113
## avg_gift            -0.019688680  0.124055750  0.091078754  0.1687736865
## target               0.042348253 -0.035953287 -0.003114465 -0.0215691141
##                      med_fam_inc  avg_fam_inc     pct_lt15k    num_prom
## num_child            0.046961647  0.047261395 -0.0317178911 -0.08643260
## income               0.367505334  0.378585352 -0.2831912335 -0.06900863
## wealth               0.377763371  0.385892299 -0.3751455847 -0.41211777
## home_value           0.738153074  0.752569002 -0.3990861577 -0.06451386
## med_fam_inc          1.000000000  0.972271285 -0.6653626748 -0.05078270
## avg_fam_inc          0.972271285  1.000000000 -0.6802847967 -0.05731139
## pct_lt15k           -0.665362675 -0.680284797  1.0000000000  0.03777518
## num_prom            -0.050782705 -0.057311385  0.0377751828  1.00000000
## lifetime_gifts      -0.035245827 -0.040327155  0.0596188059  0.53861957
## largest_gift         0.047032066  0.043103937 -0.0078829361  0.11381034
## last_gift            0.135976003  0.131378624 -0.0617521213 -0.05586809
## months_since_donate  0.032336691  0.031268594 -0.0090145584 -0.28232212
## time_lag             0.015202043  0.024340381 -0.0199114896  0.11962322
## avg_gift             0.137162758  0.131758434 -0.0624808920 -0.14725094
## target              -0.008036116 -0.003177139  0.0007592833 -0.06836599
##                     lifetime_gifts largest_gift   last_gift months_since_donate
## num_child              -0.05095477 -0.017554416 -0.01294868        -0.005563603
## income                 -0.01956547  0.033180760  0.10959275         0.077238810
## wealth                 -0.22547332 -0.025276518  0.05259131         0.033713981
## home_value             -0.02407370  0.056494276  0.15885765         0.023428514
## med_fam_inc            -0.03524583  0.047032066  0.13597600         0.032336691
## avg_fam_inc            -0.04032716  0.043103937  0.13137862         0.031268594
## pct_lt15k               0.05961881 -0.007882936 -0.06175212        -0.009014558
## num_prom                0.53861957  0.113810342 -0.05586809        -0.282322122
## lifetime_gifts          1.00000000  0.507262313  0.20205827        -0.144621862
## largest_gift            0.50726231  1.000000000  0.44723693         0.019789633
## last_gift               0.20205827  0.447236933  1.00000000         0.186715010
## months_since_donate    -0.14462186  0.019789633  0.18671501         1.000000000
## time_lag                0.03854575  0.039977035  0.07511121         0.015528499
## avg_gift                0.18232435  0.474830096  0.86639998         0.189110799
## target                 -0.01962693  0.017783355  0.07772082         0.133813301
##                          time_lag    avg_gift        target
## num_child           -0.0060693555 -0.01968868  0.0423482529
## income              -0.0015457272  0.12405575 -0.0359532869
## wealth              -0.0664213294  0.09107875 -0.0031144649
## home_value           0.0006789113  0.16877369 -0.0215691141
## med_fam_inc          0.0152020426  0.13716276 -0.0080361157
## avg_fam_inc          0.0243403812  0.13175843 -0.0031771394
## pct_lt15k           -0.0199114896 -0.06248089  0.0007592833
## num_prom             0.1196232155 -0.14725094 -0.0683659889
## lifetime_gifts       0.0385457538  0.18232435 -0.0196269259
## largest_gift         0.0399770354  0.47483010  0.0177833547
## last_gift            0.0751112090  0.86639998  0.0777208200
## months_since_donate  0.0155284995  0.18911080  0.1338133012
## time_lag             1.0000000000  0.07008164 -0.0097457015
## avg_gift             0.0700816428  1.00000000  0.0756630051
## target              -0.0097457015  0.07566301  1.0000000000

library(corrplot)
par(mfrow=c(1,1))
corrplot(cor(numeric), method = "number", type='lower')

plot(numeric,col=numeric$target)

Correlation When running a correlation matrix and chart it reflects that med_fam_inc and avg_fam_inc are higlhy correlated at 0.972271285 value. home_value is highly correlated with both med_fam_inc and avg_fam_inc, both at levels above .73. Additionally, last_gift and avg_gift are correlated with a value of 0.86639998.

Modeling the data using `glm() for significance/good predictors.

lm = glm(target~., data = DF, family="binomial")
summary(lm)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = DF)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90432  -1.15349   0.00153   1.15919   1.79778  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.885e+00  4.595e-01  -4.102 4.10e-05 ***
## zipconvert2Yes      -1.365e+01  2.670e+02  -0.051  0.95924    
## zipconvert3No        1.361e+01  2.670e+02   0.051  0.95934    
## zipconvert4Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## zipconvert5Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## homeownerNo          4.957e-02  9.412e-02   0.527  0.59847    
## num_child            2.752e-01  1.137e-01   2.422  0.01544 *  
## income              -6.952e-02  2.595e-02  -2.679  0.00738 ** 
## femaleNo             5.995e-02  7.673e-02   0.781  0.43463    
## wealth              -1.907e-02  1.800e-02  -1.059  0.28940    
## home_value          -1.074e-04  7.141e-05  -1.503  0.13272    
## med_fam_inc         -1.200e-03  9.303e-04  -1.289  0.19725    
## avg_fam_inc          1.756e-03  1.010e-03   1.738  0.08226 .  
## pct_lt15k           -9.519e-04  4.440e-03  -0.214  0.83024    
## num_prom            -3.682e-03  2.317e-03  -1.589  0.11204    
## lifetime_gifts       1.599e-04  3.721e-04   0.430  0.66743    
## largest_gift        -1.773e-03  3.091e-03  -0.574  0.56629    
## last_gift            9.923e-03  7.562e-03   1.312  0.18945    
## months_since_donate  5.922e-02  1.003e-02   5.906 3.51e-09 ***
## time_lag            -6.174e-03  6.789e-03  -0.909  0.36311    
## avg_gift             7.539e-03  1.106e-02   0.682  0.49526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4062.0  on 2979  degrees of freedom
## AIC: 4104
## 
## Number of Fisher Scoring iterations: 12

par(mfrow=c(2,2))
plot(lm, col="dark blue")

months_since_donate, income, and num_child all have significant p values making them good candidate predictors.

library(car)
vif(lm)

##         zipconvert2         zipconvert3         zipconvert4         zipconvert5 
##        8.790721e+06        7.784501e+06        8.750031e+06        1.225688e+07 
##           homeowner           num_child              income              female 
##        1.132353e+00        1.026222e+00        1.311414e+00        1.016256e+00 
##              wealth          home_value         med_fam_inc         avg_fam_inc 
##        1.523860e+00        3.308054e+00        1.877177e+01        2.100427e+01 
##           pct_lt15k            num_prom      lifetime_gifts        largest_gift 
##        2.102697e+00        1.964905e+00        2.135341e+00        2.098192e+00 
##           last_gift months_since_donate            time_lag            avg_gift 
##        3.945654e+00        1.132696e+00        1.038106e+00        4.232227e+00

From the values provided with the variance inflation function, zipconvert2 has the highest vif() value above 5. Due to this, we will remove this variable from our logistic model and retest for high collinear values.

lm2=glm(target~. -zipconvert2,family = "binomial", data = DF)
vif(lm2)

##         zipconvert3         zipconvert4         zipconvert5           homeowner 
##            1.555523            1.591039            1.983755            1.133003 
##           num_child              income              female              wealth 
##            1.026233            1.311900            1.016280            1.520999 
##          home_value         med_fam_inc         avg_fam_inc           pct_lt15k 
##            3.289635           18.773445           21.005426            2.099468 
##            num_prom      lifetime_gifts        largest_gift           last_gift 
##            1.962553            2.137299            2.105473            3.952911 
## months_since_donate            time_lag            avg_gift 
##            1.131956            1.037745            4.238023

Repeating the previous step and removing the next variables with the highest collinearity (avg_fam_inc) above 5.

lm3=glm(target~. -zipconvert2-avg_fam_inc,family = "binomial", data = DF)
vif(lm3)

##         zipconvert3         zipconvert4         zipconvert5           homeowner 
##            1.552217            1.590326            1.973780            1.129662 
##           num_child              income              female              wealth 
##            1.026185            1.300439            1.016322            1.518717 
##          home_value         med_fam_inc           pct_lt15k            num_prom 
##            3.069859            3.912247            1.984977            1.960837 
##      lifetime_gifts        largest_gift           last_gift months_since_donate 
##            2.131605            2.090869            3.965786            1.131913 
##            time_lag            avg_gift 
##            1.033901            4.245008

No additional variables remain with a vif() value above 5, meaning we no longer need to remove any more predictors due to collinearity. Collinearity reduces the accuracy of the estimates of the regression coefficients.

Candidate Predictors
As mentioned, the following variables are good candidate predictors: months_since_donate, income, and num_child due to their significance level.

Collinearity
As mentioned, both zipconvert2 and avg_fam_inc had high collinearity with vif() values above 5 in relation to the response variable target. Removing them one by one resulted in no more high collinearity.

Removing the two variables with high collinearity and establishing training/testing sets using the split from step 1:

DF = subset(DF, select=c(-zipconvert2,-avg_fam_inc))
head(DF)

## # A tibble: 6 × 19
##   zipconv…¹ zipco…² zipco…³ homeo…⁴ num_c…⁵ income female wealth home_…⁶ med_f…⁷
##   <fct>     <fct>   <fct>   <fct>     <dbl>  <dbl> <fct>   <dbl>   <dbl>   <dbl>
## 1 No        No      No      Yes           1      1 No          7     698     422
## 2 No        No      Yes     No            2      5 Yes         8     828     358
## 3 No        No      Yes     Yes           1      3 No          4    1471     484
## 4 Yes       No      No      Yes           1      4 No          8     547     386
## 5 Yes       No      No      Yes           1      4 Yes         8     482     242
## 6 No        No      Yes     Yes           1      4 Yes         8     857     450
## # … with 9 more variables: pct_lt15k <dbl>, num_prom <dbl>,
## #   lifetime_gifts <dbl>, largest_gift <dbl>, last_gift <dbl>,
## #   months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>, target <fct>,
## #   and abbreviated variable names ¹zipconvert3, ²zipconvert4, ³zipconvert5,
## #   ⁴homeowner, ⁵num_child, ⁶home_value, ⁷med_fam_inc

train.set = DF[train,]
test.set = DF[-train,]

2. Select classification tool and parameters

Using Best Subset Selection to determine variables to keep in the model.

set.seed(12345)
library(leaps)
regfit.full = regsubsets(target ~ ., data = DF, nvmax = 19)
regfull_summary = summary(regfit.full)
names(regfull_summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

regfull_summary$rsq

##  [1] 0.01790600 0.02078753 0.02347999 0.02586673 0.02677499 0.02715753
##  [7] 0.02743641 0.02774131 0.02799367 0.02817210 0.02833369 0.02844443
## [13] 0.02851728 0.02858751 0.02864296 0.02868859 0.02871641 0.02871914

par(mfrow = c(2,2))
plot(regfull_summary$cp) 
plot(regfull_summary$bic) 
plot(regfull_summary$adjr2)


cp_min = which.min(summary(regfit.full)$cp)
cp_min

## [1] 5

bic_min = which.min(summary(regfit.full)$bic)
bic_min

## [1] 3

adj_r2_max = which.max(summary(regfit.full)$adjr2)
adj_r2_max

## [1] 6

coef(regfit.full,6)

##         (Intercept)           num_child              income          home_value 
##        1.052062e+00        6.599813e-02       -1.598352e-02       -1.086337e-05 
##            num_prom           last_gift months_since_donate 
##       -7.167662e-04        3.034987e-03        1.434946e-02

2. Select classification tool and parameters
All of the models used the trimmed data set that removed zipconvert2 and avg_fam_inc due to their collinearity. Additionally, all models were created using the train.set data set. Additionally, using Best Subset Selection I decided to use 6 predictor variables in my model, those are: num_child, income, home_value, num_prom, last_gift, and months_since_donate.

Only keeping the 6 previously mentioned variables

DF = subset(DF, select= - c((zipconvert3:homeowner), female, wealth, med_fam_inc:pct_lt15k, lifetime_gifts, largest_gift, time_lag, avg_gift))

summary(DF)

##    num_child         income        home_value        num_prom     
##  Min.   :1.000   Min.   :1.000   Min.   :   0.0   Min.   : 11.00  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 554.8   1st Qu.: 29.00  
##  Median :1.000   Median :4.000   Median : 816.5   Median : 48.00  
##  Mean   :1.069   Mean   :3.899   Mean   :1143.3   Mean   : 49.14  
##  3rd Qu.:1.000   3rd Qu.:5.000   3rd Qu.:1341.2   3rd Qu.: 65.00  
##  Max.   :5.000   Max.   :7.000   Max.   :5945.0   Max.   :157.00  
##    last_gift      months_since_donate      target    
##  Min.   :  0.00   Min.   :17.00       Donor   :1499  
##  1st Qu.:  7.00   1st Qu.:29.00       No Donor:1501  
##  Median : 10.00   Median :31.00                      
##  Mean   : 13.48   Mean   :31.13                      
##  3rd Qu.: 16.00   3rd Qu.:34.00                      
##  Max.   :219.00   Max.   :37.00

DF=as.data.frame(DF)
head(DF)

##   num_child income home_value num_prom last_gift months_since_donate   target
## 1         1      1        698       46        12                  34    Donor
## 2         2      5        828       32         5                  29    Donor
## 3         1      3       1471       94         8                  30 No Donor
## 4         1      4        547       20        11                  30 No Donor
## 5         1      4        482       38        10                  31    Donor
## 6         1      4        857       47        20                  37    Donor

train.set = DF[train,]
test.set = DF[-train,]

2. classification models and details
All Models All three models attempted (SVM with Linear Kernel, Tree, KNN) used num_child, income, home_value, num_prom, last_gift, and months_since_donate as predictor variables.

The first classification model used the Support Vector Machine with a Linear Kernel method using cross validation to choose the best cost (.01). When running this model set probabilities to TRUE to allow for probability predictions. This model resulted in a train error of 0.4229167 and a test error of 0.4516667.

The second classification model used the Tree method with, unfortunately the tree method resulted in only two nodes splitting on the largest_gift predictor. The test error for this model is 0.5166667. Since this model did not seem too robust I conducted a third model.

The third classification model used the K-nearest neighbors method, cross validation was used to select the best value for K (K=5) and the model used the same 6 variables produced by the best subset selection. The train error is 0.3016667 and the test error is 0.4666667

The best model based on test error rates is the Support Vector with a Linear Kernel.

#finding best tune for cost
set.seed(12345)
tuneTarget = tune(svm, target~., data=train.set, kernel="linear", ranges = list(cost=c(cost=seq(0.01, 10, length.out=20))))

summary(tuneTarget) #best tune is cost=0.01

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.43 
## 
## - Detailed performance results:
##          cost     error dispersion
## 1   0.0100000 0.4300000 0.03937200
## 2   0.5357895 0.4329167 0.03410281
## 3   1.0615789 0.4333333 0.03327541
## 4   1.5873684 0.4333333 0.03327541
## 5   2.1131579 0.4337500 0.03324351
## 6   2.6389474 0.4337500 0.03324351
## 7   3.1647368 0.4337500 0.03324351
## 8   3.6905263 0.4337500 0.03324351
## 9   4.2163158 0.4333333 0.03327541
## 10  4.7421053 0.4337500 0.03324351
## 11  5.2678947 0.4333333 0.03327541
## 12  5.7936842 0.4337500 0.03324351
## 13  6.3194737 0.4337500 0.03324351
## 14  6.8452632 0.4333333 0.03327541
## 15  7.3710526 0.4337500 0.03324351
## 16  7.8968421 0.4337500 0.03324351
## 17  8.4226316 0.4337500 0.03324351
## 18  8.9484211 0.4333333 0.03327541
## 19  9.4742105 0.4333333 0.03327541
## 20 10.0000000 0.4337500 0.03324351

#using best tune for cost
set.seed(12345)
Target.fitR = svm(as.factor(target) ~ ., data = train.set, kernel ="linear", cost = 0.01, probability=TRUE)
summary(Target.fitR)

## 
## Call:
## svm(formula = as.factor(target) ~ ., data = train.set, kernel = "linear", 
##     cost = 0.01, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  2260
## 
##  ( 1130 1130 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Donor No Donor

Target.fitR

## 
## Call:
## svm(formula = as.factor(target) ~ ., data = train.set, kernel = "linear", 
##     cost = 0.01, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  2260

svc.train.pred = predict(Target.fitR, train.set, probability=TRUE)
table(svc.train.pred, train.set$target)

##               
## svc.train.pred Donor No Donor
##       Donor      771      577
##       No Donor   438      614

SV.Train.Error = mean(train.set$target != svc.train.pred) 
SV.Train.Error # 0.4229167 train error, 0.4516667 test error f/linear w/cost=0.01

## [1] 0.4229167

svc.test.pred=predict(Target.fitR, test.set, probability = TRUE)
table(svc.test.pred, test.set$target)

##              
## svc.test.pred Donor No Donor
##      Donor      185      166
##      No Donor   105      144

SV.Test.Error = mean(test.set$target != svc.test.pred)
SV.Test.Error

## [1] 0.4516667

Target.fitR$epsilon

## [1] 0.1

library(tree)
Target_tree=tree(target ~ ., data = train.set)
summary(Target_tree)

## 
## Classification tree:
## tree(formula = target ~ ., data = train.set)
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes:  1 
## Residual mean deviance:  1.387 = 3327 / 2399 
## Misclassification error rate: 0.4962 = 1191 / 2400

set.seed(1234)
Target.pred = predict(Target_tree, train.set, type = "class")
table(train.set$target, Target.pred)

##           Target.pred
##            Donor No Donor
##   Donor     1209        0
##   No Donor  1191        0

Tree.Train.Error = mean(Target.pred!=train.set$target)
Tree.Train.Error

## [1] 0.49625

2 terminal nodes The test error rate is 0.5166667

set.seed(1234)
Target.pred = predict(Target_tree, test.set, type = "class")
table(test.set$target, Target.pred)

##           Target.pred
##            Donor No Donor
##   Donor      290        0
##   No Donor   310        0

Tree.Test.Error = mean(Target.pred!=test.set$target)
Tree.Test.Error

## [1] 0.5166667

train_control = trainControl(method="repeatedcv",number=10,repeats=3)
knn.fit = train(target~.,
                 data=train.set,
                 method='knn',
                 trControl = train_control,
                 tuneLength=20)

target.pred=predict(knn.fit, train.set)
table(train.set$target, target.pred)

##           target.pred
##            Donor No Donor
##   Donor      856      353
##   No Donor   371      820

KNN.Train.Error=mean(target.pred!=train.set$target) #0.3016667 train error
KNN.Train.Error

## [1] 0.3016667

target.pred.test=predict(knn.fit, test.set)
table(test.set$target, target.pred.test)

##           target.pred.test
##            Donor No Donor
##   Donor      160      130
##   No Donor   150      160

KNN.Test.Error= mean(target.pred.test!=test.set$target) #0.4666667% test error
KNN.Test.Error

## [1] 0.4666667

knn.fit

## k-Nearest Neighbors 
## 
## 2400 samples
##    6 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2160, 2161, 2160, 2160, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa        
##    5  0.5380661   7.602386e-02
##    7  0.5366737   7.319577e-02
##    9  0.5306946   6.134150e-02
##   11  0.5208305   4.133428e-02
##   13  0.5079190   1.558916e-02
##   15  0.4969485  -6.457724e-03
##   17  0.4991724  -1.949733e-03
##   19  0.5001400  -2.570029e-05
##   21  0.5023675   4.357638e-03
##   23  0.4994502  -1.457676e-03
##   25  0.4987708  -2.731648e-03
##   27  0.5009826   1.623356e-03
##   29  0.4951504  -1.001811e-02
##   31  0.4930688  -1.413945e-02
##   33  0.4922326  -1.584254e-02
##   35  0.4848709  -3.052539e-02
##   37  0.4859878  -2.830650e-02
##   39  0.4815323  -3.720643e-02
##   41  0.4825080  -3.524605e-02
##   43  0.4862702  -2.762027e-02
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

3. Classification under asymmetric response and cost

Using the weighted sampling to produce a training set with equal numbers of donors and non-donors is more efficient that using a simple random sample from the original method. As mentioned earlier, balancing out the data set is important for classification models to enable even distribution and reduce bias. Having an unequal distribution in class among the target variable can cause models. The random sample would still have the unequal distribution and can be bias towards any particular class that may be more frequent.

4. Evaluate the fit The tables below compare the Training and Test Error rates for each model.

Comparison = data.frame(Type = c('LinearSVM', 'DecisionTree', 'KNN'),
                        Test_Error = c(SV.Test.Error, Tree.Test.Error, KNN.Test.Error),
                        Train_Error = c(SV.Train.Error, Tree.Train.Error, KNN.Train.Error))

Comparison

##           Type Test_Error Train_Error
## 1    LinearSVM  0.4516667   0.4229167
## 2 DecisionTree  0.5166667   0.4962500
## 3          KNN  0.4666667   0.3016667

par(mfrow=c(1,2))

plot(Comparison$Train_Error, xaxt="n", ylab='Error Rate', xlab='')
axis(1, at=1:3, labels=Comparison$Type, cex.axis=1)
title("Comparison of Train Errors")

plot(Comparison$Test_Error, xaxt="n", ylab='Error Rate', xlab='')
points(min(Comparison$Test_Error), pch = 19, type = "b", col='red')
axis(1, at=1:3, labels=Comparison$Type, cex.axis=1)
title("Comparison of Test Errors")

5. Select best model

Utilizing the Test Error Rate, the best model is the Linear Support Vector model.

Step 3: Testing

To test the model I must upload the future_fundraising.rds data set provided resulting in an accuracy rate of 58.3%.

Future.DF = read_rds("~/future_fundraising.rds")
Future.DF = subset(Future.DF, select=c(num_child, num_prom, income, home_value, last_gift, months_since_donate))
svc.Future.pred=predict(Target.fitR, Future.DF)
write.table(svc.Future.pred, file = "SV.predictions.csv", col.names = c("value"), row.names = FALSE)