A national veterans organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
The objective of this analysis is to create a predictive classification model that will identify direct-mail recipients that will make a donation in order to maximize donations.
The goal of this analysis is to maximize the organization’s net profit through accurate recipient targeting.
The data used will be the fundraising.rds dataset has
3,000 observations with 50% donors and 50% non-donors that was provided.
The data set will be divided up in order to train a model and test the
model.
As mentioned in the background the data set is weighted, under-representing the non-responders so that the sample has equal numbers of donors and non-donors. Balancing out the data set is important for classification models to enable even distribution and reduce bias.
An additional data set, future_fundraising.rds will be
used to make predictions.
Installing packages for project
library(readr)
library(caret)
library(car)
library(MASS)
library(dplyr)
library(class)
library(e1071)
DF = read_rds("~/fundraising.rds")
The data does not have any null values, attaching the data for ease of use in future steps.
sum(is.na(DF))
## [1] 0
The sample is partitioned into 80% training and 20% validation.
set.seed(12345)
train = sample(nrow(DF), 0.8*(nrow(DF)))
train.set.full = DF[train,]
test.set.full = DF[-train,]
Conducting a few analysis and visualizations to better understand the data.
summary(DF)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
str(DF)
## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
par(mfrow=c(2,2))
hist(DF$num_child)
hist(DF$wealth)
hist(DF$home_value)
hist(DF$largest_gift)
dim(DF)
## [1] 3000 21
numeric = (DF[,c(6:7,9:21)])
numeric$target = as.numeric(numeric$target)
par(mfrow=c(1,3))
boxplot(numeric[1:5], las=2, cex.axis = .8)
boxplot(numeric[6:10], las=2, cex.axis = .8)
boxplot(numeric[11:14], las=2, cex.axis = .8)
Utilizing summary to examine the predictors there are
3000 observations, 20 predictor variables and one target variables that
has two levels, “Donor” and “No Donor”. The split ratio from the
provided data is Donor:1499 and No Donor:1501.
categorical variables include zipconvert2-5,
homeowner, female, and potentially
num_child, income and wealth
based on their values.
A few variables are skewed and have outliers such as
lifetime_gift, last_gift, and
home_value.
2-1. Evaluating the association between the predictor variables and response variables
cor(numeric)
## num_child income wealth home_value
## num_child 1.000000000 0.091893089 0.060175537 -0.0119642286
## income 0.091893089 1.000000000 0.208993101 0.2919734944
## wealth 0.060175537 0.208993101 1.000000000 0.2611611450
## home_value -0.011964229 0.291973494 0.261161145 1.0000000000
## med_fam_inc 0.046961647 0.367505334 0.377763371 0.7381530742
## avg_fam_inc 0.047261395 0.378585352 0.385892299 0.7525690021
## pct_lt15k -0.031717891 -0.283191234 -0.375145585 -0.3990861577
## num_prom -0.086432604 -0.069008634 -0.412117770 -0.0645138583
## lifetime_gifts -0.050954766 -0.019565470 -0.225473319 -0.0240737013
## largest_gift -0.017554416 0.033180760 -0.025276518 0.0564942757
## last_gift -0.012948678 0.109592754 0.052591311 0.1588576542
## months_since_donate -0.005563603 0.077238810 0.033713981 0.0234285142
## time_lag -0.006069356 -0.001545727 -0.066421329 0.0006789113
## avg_gift -0.019688680 0.124055750 0.091078754 0.1687736865
## target 0.042348253 -0.035953287 -0.003114465 -0.0215691141
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child 0.046961647 0.047261395 -0.0317178911 -0.08643260
## income 0.367505334 0.378585352 -0.2831912335 -0.06900863
## wealth 0.377763371 0.385892299 -0.3751455847 -0.41211777
## home_value 0.738153074 0.752569002 -0.3990861577 -0.06451386
## med_fam_inc 1.000000000 0.972271285 -0.6653626748 -0.05078270
## avg_fam_inc 0.972271285 1.000000000 -0.6802847967 -0.05731139
## pct_lt15k -0.665362675 -0.680284797 1.0000000000 0.03777518
## num_prom -0.050782705 -0.057311385 0.0377751828 1.00000000
## lifetime_gifts -0.035245827 -0.040327155 0.0596188059 0.53861957
## largest_gift 0.047032066 0.043103937 -0.0078829361 0.11381034
## last_gift 0.135976003 0.131378624 -0.0617521213 -0.05586809
## months_since_donate 0.032336691 0.031268594 -0.0090145584 -0.28232212
## time_lag 0.015202043 0.024340381 -0.0199114896 0.11962322
## avg_gift 0.137162758 0.131758434 -0.0624808920 -0.14725094
## target -0.008036116 -0.003177139 0.0007592833 -0.06836599
## lifetime_gifts largest_gift last_gift months_since_donate
## num_child -0.05095477 -0.017554416 -0.01294868 -0.005563603
## income -0.01956547 0.033180760 0.10959275 0.077238810
## wealth -0.22547332 -0.025276518 0.05259131 0.033713981
## home_value -0.02407370 0.056494276 0.15885765 0.023428514
## med_fam_inc -0.03524583 0.047032066 0.13597600 0.032336691
## avg_fam_inc -0.04032716 0.043103937 0.13137862 0.031268594
## pct_lt15k 0.05961881 -0.007882936 -0.06175212 -0.009014558
## num_prom 0.53861957 0.113810342 -0.05586809 -0.282322122
## lifetime_gifts 1.00000000 0.507262313 0.20205827 -0.144621862
## largest_gift 0.50726231 1.000000000 0.44723693 0.019789633
## last_gift 0.20205827 0.447236933 1.00000000 0.186715010
## months_since_donate -0.14462186 0.019789633 0.18671501 1.000000000
## time_lag 0.03854575 0.039977035 0.07511121 0.015528499
## avg_gift 0.18232435 0.474830096 0.86639998 0.189110799
## target -0.01962693 0.017783355 0.07772082 0.133813301
## time_lag avg_gift target
## num_child -0.0060693555 -0.01968868 0.0423482529
## income -0.0015457272 0.12405575 -0.0359532869
## wealth -0.0664213294 0.09107875 -0.0031144649
## home_value 0.0006789113 0.16877369 -0.0215691141
## med_fam_inc 0.0152020426 0.13716276 -0.0080361157
## avg_fam_inc 0.0243403812 0.13175843 -0.0031771394
## pct_lt15k -0.0199114896 -0.06248089 0.0007592833
## num_prom 0.1196232155 -0.14725094 -0.0683659889
## lifetime_gifts 0.0385457538 0.18232435 -0.0196269259
## largest_gift 0.0399770354 0.47483010 0.0177833547
## last_gift 0.0751112090 0.86639998 0.0777208200
## months_since_donate 0.0155284995 0.18911080 0.1338133012
## time_lag 1.0000000000 0.07008164 -0.0097457015
## avg_gift 0.0700816428 1.00000000 0.0756630051
## target -0.0097457015 0.07566301 1.0000000000
library(corrplot)
par(mfrow=c(1,1))
corrplot(cor(numeric), method = "number", type='lower')
plot(numeric,col=numeric$target)
Correlation When running a correlation matrix and chart
it reflects that
med_fam_inc and avg_fam_inc
are higlhy correlated at 0.972271285 value. home_value is
highly correlated with both med_fam_inc and
avg_fam_inc, both at levels above .73. Additionally,
last_gift and avg_gift are correlated with a
value of 0.86639998.
Modeling the data using `glm() for significance/good predictors.
lm = glm(target~., data = DF, family="binomial")
summary(lm)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90432 -1.15349 0.00153 1.15919 1.79778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.885e+00 4.595e-01 -4.102 4.10e-05 ***
## zipconvert2Yes -1.365e+01 2.670e+02 -0.051 0.95924
## zipconvert3No 1.361e+01 2.670e+02 0.051 0.95934
## zipconvert4Yes -1.365e+01 2.670e+02 -0.051 0.95922
## zipconvert5Yes -1.365e+01 2.670e+02 -0.051 0.95922
## homeownerNo 4.957e-02 9.412e-02 0.527 0.59847
## num_child 2.752e-01 1.137e-01 2.422 0.01544 *
## income -6.952e-02 2.595e-02 -2.679 0.00738 **
## femaleNo 5.995e-02 7.673e-02 0.781 0.43463
## wealth -1.907e-02 1.800e-02 -1.059 0.28940
## home_value -1.074e-04 7.141e-05 -1.503 0.13272
## med_fam_inc -1.200e-03 9.303e-04 -1.289 0.19725
## avg_fam_inc 1.756e-03 1.010e-03 1.738 0.08226 .
## pct_lt15k -9.519e-04 4.440e-03 -0.214 0.83024
## num_prom -3.682e-03 2.317e-03 -1.589 0.11204
## lifetime_gifts 1.599e-04 3.721e-04 0.430 0.66743
## largest_gift -1.773e-03 3.091e-03 -0.574 0.56629
## last_gift 9.923e-03 7.562e-03 1.312 0.18945
## months_since_donate 5.922e-02 1.003e-02 5.906 3.51e-09 ***
## time_lag -6.174e-03 6.789e-03 -0.909 0.36311
## avg_gift 7.539e-03 1.106e-02 0.682 0.49526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4062.0 on 2979 degrees of freedom
## AIC: 4104
##
## Number of Fisher Scoring iterations: 12
par(mfrow=c(2,2))
plot(lm, col="dark blue")
months_since_donate, income, and
num_child all have significant p values making them good
candidate predictors.
library(car)
vif(lm)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5
## 8.790721e+06 7.784501e+06 8.750031e+06 1.225688e+07
## homeowner num_child income female
## 1.132353e+00 1.026222e+00 1.311414e+00 1.016256e+00
## wealth home_value med_fam_inc avg_fam_inc
## 1.523860e+00 3.308054e+00 1.877177e+01 2.100427e+01
## pct_lt15k num_prom lifetime_gifts largest_gift
## 2.102697e+00 1.964905e+00 2.135341e+00 2.098192e+00
## last_gift months_since_donate time_lag avg_gift
## 3.945654e+00 1.132696e+00 1.038106e+00 4.232227e+00
From the values provided with the variance inflation function,
zipconvert2 has the highest vif() value above
5. Due to this, we will remove this variable from our logistic model and
retest for high collinear values.
lm2=glm(target~. -zipconvert2,family = "binomial", data = DF)
vif(lm2)
## zipconvert3 zipconvert4 zipconvert5 homeowner
## 1.555523 1.591039 1.983755 1.133003
## num_child income female wealth
## 1.026233 1.311900 1.016280 1.520999
## home_value med_fam_inc avg_fam_inc pct_lt15k
## 3.289635 18.773445 21.005426 2.099468
## num_prom lifetime_gifts largest_gift last_gift
## 1.962553 2.137299 2.105473 3.952911
## months_since_donate time_lag avg_gift
## 1.131956 1.037745 4.238023
Repeating the previous step and removing the next variables with the
highest collinearity (avg_fam_inc) above 5.
lm3=glm(target~. -zipconvert2-avg_fam_inc,family = "binomial", data = DF)
vif(lm3)
## zipconvert3 zipconvert4 zipconvert5 homeowner
## 1.552217 1.590326 1.973780 1.129662
## num_child income female wealth
## 1.026185 1.300439 1.016322 1.518717
## home_value med_fam_inc pct_lt15k num_prom
## 3.069859 3.912247 1.984977 1.960837
## lifetime_gifts largest_gift last_gift months_since_donate
## 2.131605 2.090869 3.965786 1.131913
## time_lag avg_gift
## 1.033901 4.245008
No additional variables remain with a vif() value above 5, meaning we no longer need to remove any more predictors due to collinearity. Collinearity reduces the accuracy of the estimates of the regression coefficients.
Candidate Predictors
As mentioned, the following variables are good candidate predictors:
months_since_donate, income, and
num_child due to their significance level.
Collinearity
As mentioned, both zipconvert2 and avg_fam_inc
had high collinearity with vif() values above 5 in relation
to the response variable target. Removing them one by one
resulted in no more high collinearity.
Removing the two variables with high collinearity and establishing training/testing sets using the split from step 1:
DF = subset(DF, select=c(-zipconvert2,-avg_fam_inc))
head(DF)
## # A tibble: 6 × 19
## zipconv…¹ zipco…² zipco…³ homeo…⁴ num_c…⁵ income female wealth home_…⁶ med_f…⁷
## <fct> <fct> <fct> <fct> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 No No No Yes 1 1 No 7 698 422
## 2 No No Yes No 2 5 Yes 8 828 358
## 3 No No Yes Yes 1 3 No 4 1471 484
## 4 Yes No No Yes 1 4 No 8 547 386
## 5 Yes No No Yes 1 4 Yes 8 482 242
## 6 No No Yes Yes 1 4 Yes 8 857 450
## # … with 9 more variables: pct_lt15k <dbl>, num_prom <dbl>,
## # lifetime_gifts <dbl>, largest_gift <dbl>, last_gift <dbl>,
## # months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>, target <fct>,
## # and abbreviated variable names ¹zipconvert3, ²zipconvert4, ³zipconvert5,
## # ⁴homeowner, ⁵num_child, ⁶home_value, ⁷med_fam_inc
train.set = DF[train,]
test.set = DF[-train,]
Using Best Subset Selection to determine variables to keep in the model.
set.seed(12345)
library(leaps)
regfit.full = regsubsets(target ~ ., data = DF, nvmax = 19)
regfull_summary = summary(regfit.full)
names(regfull_summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
regfull_summary$rsq
## [1] 0.01790600 0.02078753 0.02347999 0.02586673 0.02677499 0.02715753
## [7] 0.02743641 0.02774131 0.02799367 0.02817210 0.02833369 0.02844443
## [13] 0.02851728 0.02858751 0.02864296 0.02868859 0.02871641 0.02871914
par(mfrow = c(2,2))
plot(regfull_summary$cp)
plot(regfull_summary$bic)
plot(regfull_summary$adjr2)
cp_min = which.min(summary(regfit.full)$cp)
cp_min
## [1] 5
bic_min = which.min(summary(regfit.full)$bic)
bic_min
## [1] 3
adj_r2_max = which.max(summary(regfit.full)$adjr2)
adj_r2_max
## [1] 6
coef(regfit.full,6)
## (Intercept) num_child income home_value
## 1.052062e+00 6.599813e-02 -1.598352e-02 -1.086337e-05
## num_prom last_gift months_since_donate
## -7.167662e-04 3.034987e-03 1.434946e-02
2. Select classification tool and parameters
All of the models used the trimmed data set that removed
zipconvert2 and avg_fam_inc due to their
collinearity. Additionally, all models were created using the train.set
data set. Additionally, using Best Subset Selection I decided to use 6
predictor variables in my model, those are: num_child,
income, home_value, num_prom,
last_gift, and months_since_donate.
Only keeping the 6 previously mentioned variables
DF = subset(DF, select= - c((zipconvert3:homeowner), female, wealth, med_fam_inc:pct_lt15k, lifetime_gifts, largest_gift, time_lag, avg_gift))
summary(DF)
## num_child income home_value num_prom
## Min. :1.000 Min. :1.000 Min. : 0.0 Min. : 11.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 554.8 1st Qu.: 29.00
## Median :1.000 Median :4.000 Median : 816.5 Median : 48.00
## Mean :1.069 Mean :3.899 Mean :1143.3 Mean : 49.14
## 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:1341.2 3rd Qu.: 65.00
## Max. :5.000 Max. :7.000 Max. :5945.0 Max. :157.00
## last_gift months_since_donate target
## Min. : 0.00 Min. :17.00 Donor :1499
## 1st Qu.: 7.00 1st Qu.:29.00 No Donor:1501
## Median : 10.00 Median :31.00
## Mean : 13.48 Mean :31.13
## 3rd Qu.: 16.00 3rd Qu.:34.00
## Max. :219.00 Max. :37.00
DF=as.data.frame(DF)
head(DF)
## num_child income home_value num_prom last_gift months_since_donate target
## 1 1 1 698 46 12 34 Donor
## 2 2 5 828 32 5 29 Donor
## 3 1 3 1471 94 8 30 No Donor
## 4 1 4 547 20 11 30 No Donor
## 5 1 4 482 38 10 31 Donor
## 6 1 4 857 47 20 37 Donor
train.set = DF[train,]
test.set = DF[-train,]
2. classification models and details
All Models All three models attempted (SVM with Linear
Kernel, Tree, KNN) used num_child, income,
home_value, num_prom, last_gift,
and months_since_donate as predictor variables.
The first classification model used the Support Vector Machine with a Linear Kernel method using cross validation to choose the best cost (.01). When running this model set probabilities to TRUE to allow for probability predictions. This model resulted in a train error of 0.4229167 and a test error of 0.4516667.
The second classification model used the Tree method with,
unfortunately the tree method resulted in only two nodes splitting on
the largest_gift predictor. The test error for this model
is 0.5166667. Since this model did not seem too robust I conducted a
third model.
The third classification model used the K-nearest neighbors method, cross validation was used to select the best value for K (K=5) and the model used the same 6 variables produced by the best subset selection. The train error is 0.3016667 and the test error is 0.4666667
The best model based on test error rates is the Support Vector with a Linear Kernel.
#finding best tune for cost
set.seed(12345)
tuneTarget = tune(svm, target~., data=train.set, kernel="linear", ranges = list(cost=c(cost=seq(0.01, 10, length.out=20))))
summary(tuneTarget) #best tune is cost=0.01
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.43
##
## - Detailed performance results:
## cost error dispersion
## 1 0.0100000 0.4300000 0.03937200
## 2 0.5357895 0.4329167 0.03410281
## 3 1.0615789 0.4333333 0.03327541
## 4 1.5873684 0.4333333 0.03327541
## 5 2.1131579 0.4337500 0.03324351
## 6 2.6389474 0.4337500 0.03324351
## 7 3.1647368 0.4337500 0.03324351
## 8 3.6905263 0.4337500 0.03324351
## 9 4.2163158 0.4333333 0.03327541
## 10 4.7421053 0.4337500 0.03324351
## 11 5.2678947 0.4333333 0.03327541
## 12 5.7936842 0.4337500 0.03324351
## 13 6.3194737 0.4337500 0.03324351
## 14 6.8452632 0.4333333 0.03327541
## 15 7.3710526 0.4337500 0.03324351
## 16 7.8968421 0.4337500 0.03324351
## 17 8.4226316 0.4337500 0.03324351
## 18 8.9484211 0.4333333 0.03327541
## 19 9.4742105 0.4333333 0.03327541
## 20 10.0000000 0.4337500 0.03324351
#using best tune for cost
set.seed(12345)
Target.fitR = svm(as.factor(target) ~ ., data = train.set, kernel ="linear", cost = 0.01, probability=TRUE)
summary(Target.fitR)
##
## Call:
## svm(formula = as.factor(target) ~ ., data = train.set, kernel = "linear",
## cost = 0.01, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 2260
##
## ( 1130 1130 )
##
##
## Number of Classes: 2
##
## Levels:
## Donor No Donor
Target.fitR
##
## Call:
## svm(formula = as.factor(target) ~ ., data = train.set, kernel = "linear",
## cost = 0.01, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 2260
svc.train.pred = predict(Target.fitR, train.set, probability=TRUE)
table(svc.train.pred, train.set$target)
##
## svc.train.pred Donor No Donor
## Donor 771 577
## No Donor 438 614
SV.Train.Error = mean(train.set$target != svc.train.pred)
SV.Train.Error # 0.4229167 train error, 0.4516667 test error f/linear w/cost=0.01
## [1] 0.4229167
svc.test.pred=predict(Target.fitR, test.set, probability = TRUE)
table(svc.test.pred, test.set$target)
##
## svc.test.pred Donor No Donor
## Donor 185 166
## No Donor 105 144
SV.Test.Error = mean(test.set$target != svc.test.pred)
SV.Test.Error
## [1] 0.4516667
Target.fitR$epsilon
## [1] 0.1
library(tree)
Target_tree=tree(target ~ ., data = train.set)
summary(Target_tree)
##
## Classification tree:
## tree(formula = target ~ ., data = train.set)
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes: 1
## Residual mean deviance: 1.387 = 3327 / 2399
## Misclassification error rate: 0.4962 = 1191 / 2400
set.seed(1234)
Target.pred = predict(Target_tree, train.set, type = "class")
table(train.set$target, Target.pred)
## Target.pred
## Donor No Donor
## Donor 1209 0
## No Donor 1191 0
Tree.Train.Error = mean(Target.pred!=train.set$target)
Tree.Train.Error
## [1] 0.49625
2 terminal nodes The test error rate is 0.5166667
set.seed(1234)
Target.pred = predict(Target_tree, test.set, type = "class")
table(test.set$target, Target.pred)
## Target.pred
## Donor No Donor
## Donor 290 0
## No Donor 310 0
Tree.Test.Error = mean(Target.pred!=test.set$target)
Tree.Test.Error
## [1] 0.5166667
train_control = trainControl(method="repeatedcv",number=10,repeats=3)
knn.fit = train(target~.,
data=train.set,
method='knn',
trControl = train_control,
tuneLength=20)
target.pred=predict(knn.fit, train.set)
table(train.set$target, target.pred)
## target.pred
## Donor No Donor
## Donor 856 353
## No Donor 371 820
KNN.Train.Error=mean(target.pred!=train.set$target) #0.3016667 train error
KNN.Train.Error
## [1] 0.3016667
target.pred.test=predict(knn.fit, test.set)
table(test.set$target, target.pred.test)
## target.pred.test
## Donor No Donor
## Donor 160 130
## No Donor 150 160
KNN.Test.Error= mean(target.pred.test!=test.set$target) #0.4666667% test error
KNN.Test.Error
## [1] 0.4666667
knn.fit
## k-Nearest Neighbors
##
## 2400 samples
## 6 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2160, 2161, 2160, 2160, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5380661 7.602386e-02
## 7 0.5366737 7.319577e-02
## 9 0.5306946 6.134150e-02
## 11 0.5208305 4.133428e-02
## 13 0.5079190 1.558916e-02
## 15 0.4969485 -6.457724e-03
## 17 0.4991724 -1.949733e-03
## 19 0.5001400 -2.570029e-05
## 21 0.5023675 4.357638e-03
## 23 0.4994502 -1.457676e-03
## 25 0.4987708 -2.731648e-03
## 27 0.5009826 1.623356e-03
## 29 0.4951504 -1.001811e-02
## 31 0.4930688 -1.413945e-02
## 33 0.4922326 -1.584254e-02
## 35 0.4848709 -3.052539e-02
## 37 0.4859878 -2.830650e-02
## 39 0.4815323 -3.720643e-02
## 41 0.4825080 -3.524605e-02
## 43 0.4862702 -2.762027e-02
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Using the weighted sampling to produce a training set with equal numbers of donors and non-donors is more efficient that using a simple random sample from the original method. As mentioned earlier, balancing out the data set is important for classification models to enable even distribution and reduce bias. Having an unequal distribution in class among the target variable can cause models. The random sample would still have the unequal distribution and can be bias towards any particular class that may be more frequent.
Comparison = data.frame(Type = c('LinearSVM', 'DecisionTree', 'KNN'),
Test_Error = c(SV.Test.Error, Tree.Test.Error, KNN.Test.Error),
Train_Error = c(SV.Train.Error, Tree.Train.Error, KNN.Train.Error))
Comparison
## Type Test_Error Train_Error
## 1 LinearSVM 0.4516667 0.4229167
## 2 DecisionTree 0.5166667 0.4962500
## 3 KNN 0.4666667 0.3016667
par(mfrow=c(1,2))
plot(Comparison$Train_Error, xaxt="n", ylab='Error Rate', xlab='')
axis(1, at=1:3, labels=Comparison$Type, cex.axis=1)
title("Comparison of Train Errors")
plot(Comparison$Test_Error, xaxt="n", ylab='Error Rate', xlab='')
points(min(Comparison$Test_Error), pch = 19, type = "b", col='red')
axis(1, at=1:3, labels=Comparison$Type, cex.axis=1)
title("Comparison of Test Errors")
Utilizing the Test Error Rate, the best model is the Linear Support Vector model.
To test the model I must upload the future_fundraising.rds data set provided resulting in an accuracy rate of 58.3%.
Future.DF = read_rds("~/future_fundraising.rds")
Future.DF = subset(Future.DF, select=c(num_child, num_prom, income, home_value, last_gift, months_since_donate))
svc.Future.pred=predict(Target.fitR, Future.DF)
write.table(svc.Future.pred, file = "SV.predictions.csv", col.names = c("value"), row.names = FALSE)