A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00 Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-respondents so that the sample has equal numbers of donors and non-donors.
The objective of this project is to develop a classification model that can effectively capture the donors who are willing to donate for the future campaigns so that the expected net profit is maximized. Since there is a cost involved in direct mailing campaigns it is more profitable to target the donors who are likely to make donations instead of sending mailing gifts to the entire 13 millions donors in the database. This model helps to reduce the mailing costs and helps to effectively identify donors who are likely to make a donation.
The goal of this project is to improve national veterans organization’s cost effectiveness in their direct mailing campaigns using data analysis and predictive models.
This data set was kindly provided by The American Legion. It contains 3,000 records with 20 attributes consisting of donor’s demographic data and information related to their previous donations. Our goal is to predict which of these donors should receive a mail campaign letter using a second subset of the data that contains the attributes of 120 future mailing candidates. By accurately predicting who will actually make a donation an sending mailing letter to will reduce the mailing costs instead sending to the entire donors list.
Below is the detailed analysis.
fundraising = readRDS("./fundraising.rds")
future_fundraising = readRDS("./future_fundraising.rds")str(fundraising)## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
summary(fundraising)## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
Collinearity
fund.num = fundraising %>% select_if(is.numeric)
fund.num.corr = cor(fund.num)
fund.num.corr## num_child income wealth home_value
## num_child 1.000000000 0.091893089 0.06017554 -0.0119642286
## income 0.091893089 1.000000000 0.20899310 0.2919734944
## wealth 0.060175537 0.208993101 1.00000000 0.2611611450
## home_value -0.011964229 0.291973494 0.26116115 1.0000000000
## med_fam_inc 0.046961647 0.367505334 0.37776337 0.7381530742
## avg_fam_inc 0.047261395 0.378585352 0.38589230 0.7525690021
## pct_lt15k -0.031717891 -0.283191234 -0.37514558 -0.3990861577
## num_prom -0.086432604 -0.069008634 -0.41211777 -0.0645138583
## lifetime_gifts -0.050954766 -0.019565470 -0.22547332 -0.0240737013
## largest_gift -0.017554416 0.033180760 -0.02527652 0.0564942757
## last_gift -0.012948678 0.109592754 0.05259131 0.1588576542
## months_since_donate -0.005563603 0.077238810 0.03371398 0.0234285142
## time_lag -0.006069356 -0.001545727 -0.06642133 0.0006789113
## avg_gift -0.019688680 0.124055750 0.09107875 0.1687736865
## med_fam_inc avg_fam_inc pct_lt15k num_prom
## num_child 0.04696165 0.04726139 -0.031717891 -0.08643260
## income 0.36750533 0.37858535 -0.283191234 -0.06900863
## wealth 0.37776337 0.38589230 -0.375145585 -0.41211777
## home_value 0.73815307 0.75256900 -0.399086158 -0.06451386
## med_fam_inc 1.00000000 0.97227129 -0.665362675 -0.05078270
## avg_fam_inc 0.97227129 1.00000000 -0.680284797 -0.05731139
## pct_lt15k -0.66536267 -0.68028480 1.000000000 0.03777518
## num_prom -0.05078270 -0.05731139 0.037775183 1.00000000
## lifetime_gifts -0.03524583 -0.04032716 0.059618806 0.53861957
## largest_gift 0.04703207 0.04310394 -0.007882936 0.11381034
## last_gift 0.13597600 0.13137862 -0.061752121 -0.05586809
## months_since_donate 0.03233669 0.03126859 -0.009014558 -0.28232212
## time_lag 0.01520204 0.02434038 -0.019911490 0.11962322
## avg_gift 0.13716276 0.13175843 -0.062480892 -0.14725094
## lifetime_gifts largest_gift last_gift months_since_donate
## num_child -0.05095477 -0.017554416 -0.01294868 -0.005563603
## income -0.01956547 0.033180760 0.10959275 0.077238810
## wealth -0.22547332 -0.025276518 0.05259131 0.033713981
## home_value -0.02407370 0.056494276 0.15885765 0.023428514
## med_fam_inc -0.03524583 0.047032066 0.13597600 0.032336691
## avg_fam_inc -0.04032716 0.043103937 0.13137862 0.031268594
## pct_lt15k 0.05961881 -0.007882936 -0.06175212 -0.009014558
## num_prom 0.53861957 0.113810342 -0.05586809 -0.282322122
## lifetime_gifts 1.00000000 0.507262313 0.20205827 -0.144621862
## largest_gift 0.50726231 1.000000000 0.44723693 0.019789633
## last_gift 0.20205827 0.447236933 1.00000000 0.186715010
## months_since_donate -0.14462186 0.019789633 0.18671501 1.000000000
## time_lag 0.03854575 0.039977035 0.07511121 0.015528499
## avg_gift 0.18232435 0.474830096 0.86639998 0.189110799
## time_lag avg_gift
## num_child -0.0060693555 -0.01968868
## income -0.0015457272 0.12405575
## wealth -0.0664213294 0.09107875
## home_value 0.0006789113 0.16877369
## med_fam_inc 0.0152020426 0.13716276
## avg_fam_inc 0.0243403812 0.13175843
## pct_lt15k -0.0199114896 -0.06248089
## num_prom 0.1196232155 -0.14725094
## lifetime_gifts 0.0385457538 0.18232435
## largest_gift 0.0399770354 0.47483010
## last_gift 0.0751112090 0.86639998
## months_since_donate 0.0155284995 0.18911080
## time_lag 1.0000000000 0.07008164
## avg_gift 0.0700816428 1.00000000
corrplot::corrplot(fund.num.corr, type = "lower")from the correlation plot above, we can see that home_value, avg_fam_inc, med_fam_inc are highly correlated with each other. Among the three perdictors I will keep only med_fam_inc for my further models. Similarly high correlation is present avg_gift and last_gift. and I will be selecting avg_gift for my further analysis. These high correlations among the variables may lead to multicolinearity issue while developing models. Since mutlicollinearity increases the variance in the models it is important to remove such predictors while building models.
Variance Inflation Factor
vif(as.data.frame(fundraising[, c(6,7, 9:20)]))## Variables VIF
## 1 num_child 1.025019
## 2 income 1.194773
## 3 wealth 1.508819
## 4 home_value 2.493018
## 5 med_fam_inc 18.423616
## 6 avg_fam_inc 20.688945
## 7 pct_lt15k 2.040761
## 8 num_prom 1.962585
## 9 lifetime_gifts 1.994202
## 10 largest_gift 1.715238
## 11 last_gift 4.153071
## 12 months_since_donate 1.145515
## 13 time_lag 1.032467
## 14 avg_gift 4.469569
vif(as.data.frame(fundraising[, c(6,7,9:11, 13:20)]))## Variables VIF
## 1 num_child 1.024749
## 2 income 1.187566
## 3 wealth 1.505671
## 4 home_value 2.316403
## 5 med_fam_inc 3.603571
## 6 pct_lt15k 1.937771
## 7 num_prom 1.962376
## 8 lifetime_gifts 1.994140
## 9 largest_gift 1.715228
## 10 last_gift 4.152560
## 11 months_since_donate 1.145514
## 12 time_lag 1.029603
## 13 avg_gift 4.465531
We can observe that after removing avg_fam_income the vif of all the variables dropped below 5.
Random Forest Model for variable importance
set.seed(12345)
ctrl = trainControl(method = "repeatedcv", number = 10, repeats = 3)set.seed(12345)
rf.model0 = train(target ~ . , data = fundraising, method = "rf", importance = TRUE, trControl = ctrl)
plot(varImp(rf.model0))varImp(rf.model0)## rf variable importance
##
## Importance
## months_since_donate 100.000
## largest_gift 79.248
## avg_gift 51.712
## last_gift 44.548
## med_fam_inc 38.017
## income 35.558
## pct_lt15k 32.202
## num_child 31.868
## avg_fam_inc 31.862
## home_value 30.846
## homeownerNo 29.560
## lifetime_gifts 27.131
## zipconvert4Yes 14.667
## zipconvert2Yes 12.113
## zipconvert3No 10.718
## zipconvert5Yes 9.949
## wealth 7.350
## num_prom 7.128
## time_lag 4.128
## femaleNo 0.000
From the above mentioned reasons of multicolinearity, and comparing the variables with Importance plot from Random forest model, I will use months_since_donate, largest_gift, avg_gift (not using last_gift due to multicolinearity), med_fam_income, num_child, income, home value as my final set of predictors.
Excluding a class from prediction or an observation is done if its precision or coverage statistics don’t meet the threshold of usefulness.In this process none of the classes were excluded from the analysis but predictors with high VIF have been excluded from the models. At the same time, predictors that are not important or made no impact in improve the model accuracy have also been excluded from the models.
Variable transformation is a way to make the data work better in the model. Data variables can have two types of form: numeric variable and categorical variable, and their transformation should have different approaches. No variables transformations have been made for this analysis. But for the further analysis, studying interactions between various predictors and including them in the final model can actually improve the model performance.
Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
The original data is partitioned into a training and testing data set. The training data includes 80% and the testing data includes 20% of the original data records. the training dataset consisted of 2400 observations where as the testing datset consisted of 600 observation.
set.seed(12345)
train_index = sample(nrow(fundraising), round(nrow(fundraising)*0.8))
fund.train = fundraising[train_index,]
fund.test = fundraising[-train_index,]2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.
Method : Logistic regression is a parametric classification technique that estimates the probability of an event occurring, for instance, whether or not the person will be a donor. This model is simple to run and also easy to interpret based on the significance of the predictors and the sign of the coefficients. The variables selected were months_since_donate, largest_gift, avg_gift, med_fam_income and num_child, the accuracy from the test is 49.33%
set.seed(12345)
glm.fit = glm(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
family = "binomial"
)
summary(glm.fit)##
## Call:
## glm(formula = target ~ months_since_donate + largest_gift + num_child +
## avg_gift + med_fam_inc + income + home_value, family = "binomial",
## data = fund.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8392 -1.1448 -0.7965 1.1718 1.6874
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.204e+00 3.701e-01 -5.954 2.62e-09 ***
## months_since_donate 5.823e-02 1.074e-02 5.423 5.86e-08 ***
## largest_gift -1.234e-03 2.157e-03 -0.572 0.56709
## num_child 3.384e-01 1.266e-01 2.673 0.00752 **
## avg_gift 2.266e-02 7.342e-03 3.087 0.00202 **
## med_fam_inc 3.318e-04 3.703e-04 0.896 0.37029
## income -6.487e-02 2.709e-02 -2.394 0.01665 *
## home_value -7.344e-05 6.534e-05 -1.124 0.26106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3263.8 on 2392 degrees of freedom
## AIC: 3279.8
##
## Number of Fisher Scoring iterations: 4
Predictions on test set
pred.glm = predict(glm.fit, fund.test)
class.glm = ifelse(pred.glm >=0.5, "Donor", "No Donor")
confusionMatrix(as.factor(class.glm), fund.test$target, positive = "Donor")## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 6 20
## No Donor 284 290
##
## Accuracy : 0.4933
## 95% CI : (0.4526, 0.5341)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.8819
##
## Kappa : -0.0452
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.02069
## Specificity : 0.93548
## Pos Pred Value : 0.23077
## Neg Pred Value : 0.50523
## Prevalence : 0.48333
## Detection Rate : 0.01000
## Detection Prevalence : 0.04333
## Balanced Accuracy : 0.47809
##
## 'Positive' Class : Donor
##
Method : The second classification method I used is Quadratic Discriminant Analysis, which is a compromise between logistic regression and nonparametric methods. The QDA model allows for quadratic decision boundaries and can produce better results when the data is moderately non-linear.(QDA) is a generative model. QDA assumes that each class follow a Gaussian distribution. The class-specific prior is simply the proportion of data points that belong to the class. The class-specific mean vector is the average of the input variables that belong to the class. The accuracy on the test set is the highest of 54.33%
set.seed(12345)
qda.fit = train(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
method = "qda",
trControl = ctrl
)
qda.fit## Quadratic Discriminant Analysis
##
## 2400 samples
## 7 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5091679 0.02316112
predictions on the test set
qda.preds = predict(qda.fit, fund.test)
confusionMatrix(qda.preds, fund.test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 39 23
## No Donor 251 287
##
## Accuracy : 0.5433
## 95% CI : (0.5025, 0.5837)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.1026
##
## Kappa : 0.0619
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.1345
## Specificity : 0.9258
## Pos Pred Value : 0.6290
## Neg Pred Value : 0.5335
## Prevalence : 0.4833
## Detection Rate : 0.0650
## Detection Prevalence : 0.1033
## Balanced Accuracy : 0.5301
##
## 'Positive' Class : Donor
##
Method : The Naive Bayes model makes the assumption that for each class, the features are independent of each other, which allows the model to estimate individual class-conditional marginal densities. Our Naive Bayes classifier achieves the highest cross-validation accuracy using the Kernel Density Estimation function, a nonparametric technique in estimating probabilities. This classifier has the cross validation accuracy of 55.37% and the accuracy on th unseen test set is 53.33%
set.seed(12345)
nb.fit = train(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
method = "naive_bayes",
trControl = ctrl
)
nb.fit## Naive Bayes
##
## 2400 samples
## 7 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.5144514 0.03300049
## TRUE 0.5537577 0.10530248
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
predictions on th test set
nb.pred = predict(nb.fit, fund.test)
confusionMatrix(nb.pred, fund.test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 220 210
## No Donor 70 100
##
## Accuracy : 0.5333
## 95% CI : (0.4925, 0.5738)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.2189
##
## Kappa : 0.08
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7586
## Specificity : 0.3226
## Pos Pred Value : 0.5116
## Neg Pred Value : 0.5882
## Prevalence : 0.4833
## Detection Rate : 0.3667
## Detection Prevalence : 0.7167
## Balanced Accuracy : 0.5406
##
## 'Positive' Class : Donor
##
method : The k-nearest neighbors (KNN) algorithm is a simple machine learning algorithm that can be used to solve both classification and regression problems. It’s easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label.(in this case will be Donor, No donor), in trying to fit the KN model, I have tried with different values of k ranging from 2 to 20. Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 8.
set.seed(12345)
knn.fit = train(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
method = "knn",
tuneGrid = expand.grid(k = seq(2,20, 1)),
trControl = ctrl
)
knn.fit## k-Nearest Neighbors
##
## 2400 samples
## 7 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.4744432 -0.051153689
## 3 0.4848640 -0.030330368
## 4 0.4895851 -0.020869799
## 5 0.4888982 -0.022112742
## 6 0.4938936 -0.012195740
## 7 0.4915261 -0.016973993
## 8 0.4986170 -0.002852668
## 9 0.4863831 -0.027225835
## 10 0.4888930 -0.022277509
## 11 0.4902772 -0.019468277
## 12 0.4850000 -0.029981766
## 13 0.4890301 -0.021946311
## 14 0.4812610 -0.037519462
## 15 0.4816643 -0.036699684
## 16 0.4827679 -0.034492836
## 17 0.4884658 -0.023231773
## 18 0.4851360 -0.029897042
## 19 0.4866684 -0.026858066
## 20 0.4895810 -0.021051406
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 8.
plot(knn.fit)Predictions on test set
knn.preds = predict(knn.fit, fund.test)
confusionMatrix(knn.preds, fund.test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 144 161
## No Donor 146 149
##
## Accuracy : 0.4883
## 95% CI : (0.4476, 0.5291)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.9236
##
## Kappa : -0.0228
##
## Mcnemar's Test P-Value : 0.4243
##
## Sensitivity : 0.4966
## Specificity : 0.4806
## Pos Pred Value : 0.4721
## Neg Pred Value : 0.5051
## Prevalence : 0.4833
## Detection Rate : 0.2400
## Detection Prevalence : 0.5083
## Balanced Accuracy : 0.4886
##
## 'Positive' Class : Donor
##
method : SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. It tries to find the hyperplane such that the distance between the nearest data points(Support vectors) and the hyperplane (margin) is maximized. SVM has kernel trick, which is a simple methodology where non-linear data is project to higher dimension space so as to make it easier to classify the data where it could be linearly dived by a plane.
The optimal value for cost is 0.06, and the test accuracy is 55.67% for linear kernel, and for radial kernel the optimaum value of cost is 0.4 with test accuracy of 54.17%
set.seed(12345)
svm.linear.fit = train(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
method = "svmLinear",
tuneGrid = expand.grid(C = seq(0.001, 0.01, 0.001)),
preProcess = c("center", "scale"),
trControl = ctrl
)
svm.linear.fit## Support Vector Machines with Linear Kernel
##
## 2400 samples
## 7 predictor
## 2 classes: 'Donor', 'No Donor'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.001 0.5370863 0.06953489
## 0.002 0.5640129 0.12675406
## 0.003 0.5656819 0.13068186
## 0.004 0.5658254 0.13070330
## 0.005 0.5659638 0.13084481
## 0.006 0.5677641 0.13437644
## 0.007 0.5662328 0.13127029
## 0.008 0.5670673 0.13294896
## 0.009 0.5656790 0.13015584
## 0.010 0.5658156 0.13044546
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.006.
svm.linear.fit$results$Accuracy[which.max(svm.linear.fit$results$Accuracy)]## [1] 0.5677641
plot(svm.linear.fit)Predictions on test set
svm.linear.preds = predict(svm.linear.fit, fund.test)
confusionMatrix(svm.linear.preds, fund.test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 192 168
## No Donor 98 142
##
## Accuracy : 0.5567
## 95% CI : (0.5159, 0.5969)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.02732
##
## Kappa : 0.1192
##
## Mcnemar's Test P-Value : 2.33e-05
##
## Sensitivity : 0.6621
## Specificity : 0.4581
## Pos Pred Value : 0.5333
## Neg Pred Value : 0.5917
## Prevalence : 0.4833
## Detection Rate : 0.3200
## Detection Prevalence : 0.6000
## Balanced Accuracy : 0.5601
##
## 'Positive' Class : Donor
##
set.seed(12345)
svm.radial.fit = train(target ~ months_since_donate + largest_gift + num_child +
avg_gift + med_fam_inc + income + home_value,
data = fund.train,
method = "svmRadial",
tuneGrid = expand.grid(C = seq(0.2, 1.0, 0.1),
sigma= 0.1),
preProcess = c("center", "scale"),
trControl = ctrl
)
svm.radial.fit## Support Vector Machines with Radial Basis Function Kernel
##
## 2400 samples
## 7 predictor
## 2 classes: 'Donor', 'No Donor'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.2 0.5623451 0.1241293
## 0.3 0.5623445 0.1241119
## 0.4 0.5623474 0.1240785
## 0.5 0.5595720 0.1184643
## 0.6 0.5588781 0.1170325
## 0.7 0.5587386 0.1167039
## 0.8 0.5567976 0.1127712
## 0.9 0.5579082 0.1149932
## 1.0 0.5566593 0.1124661
##
## Tuning parameter 'sigma' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 0.4.
plot(svm.radial.fit)Predictions on test set
svm.radial.preds = predict(svm.radial.fit, fund.test)
confusionMatrix(svm.radial.preds, fund.test$target)## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 178 163
## No Donor 112 147
##
## Accuracy : 0.5417
## 95% CI : (0.5008, 0.5821)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.118043
##
## Kappa : 0.0875
##
## Mcnemar's Test P-Value : 0.002569
##
## Sensitivity : 0.6138
## Specificity : 0.4742
## Pos Pred Value : 0.5220
## Neg Pred Value : 0.5676
## Prevalence : 0.4833
## Detection Rate : 0.2967
## Detection Prevalence : 0.5683
## Balanced Accuracy : 0.5440
##
## 'Positive' Class : Donor
##
3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.
Different types of models are tried on this dataset and therir accuracies have been compared. The final predictors used are selected from combination of results from VIF (removing variables that are multicolinaer) and variable importance from Random Forest. The final predictors include months_since_donate,largest_gift, num_child, avg_gift, med_fam_inc, income, home_value. logistic regression, KNN, Naive Bayes, QDA, SVM - Linear kernel, SVM - Radial Kernel, SVM with Linear kernel outperforms with 55.67% accuracy on test split of the data.
4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
models = c("GLM", "QDA", "NB", "KNN", "SVM-Linear", "SVM-Radial")
acc = c("49.33", "54.33", "53.33", "48.83", "55.67", "54.17")
acc.summary= as.data.frame(acc, row.names = models)
acc.summary## acc
## GLM 49.33
## QDA 54.33
## NB 53.33
## KNN 48.83
## SVM-Linear 55.67
## SVM-Radial 54.17
ggplot(data = acc.summary) +geom_bar(aes(x = row.names(acc.summary), y = acc), stat = "identity") + xlab("models") + ylab("accuraccy")The above graph shows the accuracy of different models on the 20% of the test data. SVM model with Linear kernel outperforms th rest of the models with highest accuracy of 55.67%. I will pick this model to further make predictions on future fundraising dataset.
5. Select best model. From your answer in (4), what do you think is the “best” model?
Like mentioned above, SVM model with Linear kernel is best model with optimal cost of 0.06. The accuracy of this model is 55.67%, which is higher than remaining models. This model resulted in 65.83% accuracy on future fundraising dataset.
future_value = predict(svm.linear.fit, future_fundraising)Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header.
write.table(future_value, file="predictions.csv", col.names = c("value"),sep = ",", row.names = F, quote = F)Of all the statistical techniques utilized, SVM with linear kernel outperformed with an accuracy of 55.67% on the testing set and 65.83% on the future fundraising dataset. Important factors to consider when targeting people are number of months from last donation, dollar amount of largest gift to date, number of children , median family income, home value and income. Primarily focus on people who had children , who made donation recently, with higher median family income and made slightly lower donation on average.