Fundraising project

Background

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00 Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-respondents so that the sample has equal numbers of donors and non-donors.

Business Objectives and Goals

Objectives:

The objective of this project is to develop a classification model that can effectively capture the donors who are willing to donate for the future campaigns so that the expected net profit is maximized. Since there is a cost involved in direct mailing campaigns it is more profitable to target the donors who are likely to make donations instead of sending mailing gifts to the entire 13 millions donors in the database. This model helps to reduce the mailing costs and helps to effectively identify donors who are likely to make a donation.

Goals

The goal of this project is to improve national veterans organization’s cost effectiveness in their direct mailing campaigns using data analysis and predictive models.

Data and Data Sources used

This data set was kindly provided by The American Legion. It contains 3,000 records with 20 attributes consisting of donor’s demographic data and information related to their previous donations. Our goal is to predict which of these donors should receive a mail campaign letter using a second subset of the data that contains the attributes of 120 future mailing candidates. By accurately predicting who will actually make a donation an sending mailing letter to will reduce the mailing costs instead sending to the entire donors list.

Below is the detailed analysis.

fundraising = readRDS("./fundraising.rds")
future_fundraising = readRDS("./future_fundraising.rds")

str(fundraising)

## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

summary(fundraising)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Exploratory data Analysis

1. Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

Collinearity

fund.num = fundraising %>%  select_if(is.numeric)
fund.num.corr = cor(fund.num)
fund.num.corr

##                        num_child       income      wealth    home_value
## num_child            1.000000000  0.091893089  0.06017554 -0.0119642286
## income               0.091893089  1.000000000  0.20899310  0.2919734944
## wealth               0.060175537  0.208993101  1.00000000  0.2611611450
## home_value          -0.011964229  0.291973494  0.26116115  1.0000000000
## med_fam_inc          0.046961647  0.367505334  0.37776337  0.7381530742
## avg_fam_inc          0.047261395  0.378585352  0.38589230  0.7525690021
## pct_lt15k           -0.031717891 -0.283191234 -0.37514558 -0.3990861577
## num_prom            -0.086432604 -0.069008634 -0.41211777 -0.0645138583
## lifetime_gifts      -0.050954766 -0.019565470 -0.22547332 -0.0240737013
## largest_gift        -0.017554416  0.033180760 -0.02527652  0.0564942757
## last_gift           -0.012948678  0.109592754  0.05259131  0.1588576542
## months_since_donate -0.005563603  0.077238810  0.03371398  0.0234285142
## time_lag            -0.006069356 -0.001545727 -0.06642133  0.0006789113
## avg_gift            -0.019688680  0.124055750  0.09107875  0.1687736865
##                     med_fam_inc avg_fam_inc    pct_lt15k    num_prom
## num_child            0.04696165  0.04726139 -0.031717891 -0.08643260
## income               0.36750533  0.37858535 -0.283191234 -0.06900863
## wealth               0.37776337  0.38589230 -0.375145585 -0.41211777
## home_value           0.73815307  0.75256900 -0.399086158 -0.06451386
## med_fam_inc          1.00000000  0.97227129 -0.665362675 -0.05078270
## avg_fam_inc          0.97227129  1.00000000 -0.680284797 -0.05731139
## pct_lt15k           -0.66536267 -0.68028480  1.000000000  0.03777518
## num_prom            -0.05078270 -0.05731139  0.037775183  1.00000000
## lifetime_gifts      -0.03524583 -0.04032716  0.059618806  0.53861957
## largest_gift         0.04703207  0.04310394 -0.007882936  0.11381034
## last_gift            0.13597600  0.13137862 -0.061752121 -0.05586809
## months_since_donate  0.03233669  0.03126859 -0.009014558 -0.28232212
## time_lag             0.01520204  0.02434038 -0.019911490  0.11962322
## avg_gift             0.13716276  0.13175843 -0.062480892 -0.14725094
##                     lifetime_gifts largest_gift   last_gift months_since_donate
## num_child              -0.05095477 -0.017554416 -0.01294868        -0.005563603
## income                 -0.01956547  0.033180760  0.10959275         0.077238810
## wealth                 -0.22547332 -0.025276518  0.05259131         0.033713981
## home_value             -0.02407370  0.056494276  0.15885765         0.023428514
## med_fam_inc            -0.03524583  0.047032066  0.13597600         0.032336691
## avg_fam_inc            -0.04032716  0.043103937  0.13137862         0.031268594
## pct_lt15k               0.05961881 -0.007882936 -0.06175212        -0.009014558
## num_prom                0.53861957  0.113810342 -0.05586809        -0.282322122
## lifetime_gifts          1.00000000  0.507262313  0.20205827        -0.144621862
## largest_gift            0.50726231  1.000000000  0.44723693         0.019789633
## last_gift               0.20205827  0.447236933  1.00000000         0.186715010
## months_since_donate    -0.14462186  0.019789633  0.18671501         1.000000000
## time_lag                0.03854575  0.039977035  0.07511121         0.015528499
## avg_gift                0.18232435  0.474830096  0.86639998         0.189110799
##                          time_lag    avg_gift
## num_child           -0.0060693555 -0.01968868
## income              -0.0015457272  0.12405575
## wealth              -0.0664213294  0.09107875
## home_value           0.0006789113  0.16877369
## med_fam_inc          0.0152020426  0.13716276
## avg_fam_inc          0.0243403812  0.13175843
## pct_lt15k           -0.0199114896 -0.06248089
## num_prom             0.1196232155 -0.14725094
## lifetime_gifts       0.0385457538  0.18232435
## largest_gift         0.0399770354  0.47483010
## last_gift            0.0751112090  0.86639998
## months_since_donate  0.0155284995  0.18911080
## time_lag             1.0000000000  0.07008164
## avg_gift             0.0700816428  1.00000000

corrplot::corrplot(fund.num.corr, type = "lower")

from the correlation plot above, we can see that home_value, avg_fam_inc, med_fam_inc are highly correlated with each other. Among the three perdictors I will keep only med_fam_inc for my further models. Similarly high correlation is present avg_gift and last_gift. and I will be selecting avg_gift for my further analysis. These high correlations among the variables may lead to multicolinearity issue while developing models. Since mutlicollinearity increases the variance in the models it is important to remove such predictors while building models.

Variance Inflation Factor

vif(as.data.frame(fundraising[, c(6,7, 9:20)]))

##              Variables       VIF
## 1            num_child  1.025019
## 2               income  1.194773
## 3               wealth  1.508819
## 4           home_value  2.493018
## 5          med_fam_inc 18.423616
## 6          avg_fam_inc 20.688945
## 7            pct_lt15k  2.040761
## 8             num_prom  1.962585
## 9       lifetime_gifts  1.994202
## 10        largest_gift  1.715238
## 11           last_gift  4.153071
## 12 months_since_donate  1.145515
## 13            time_lag  1.032467
## 14            avg_gift  4.469569

vif(as.data.frame(fundraising[, c(6,7,9:11, 13:20)]))

##              Variables      VIF
## 1            num_child 1.024749
## 2               income 1.187566
## 3               wealth 1.505671
## 4           home_value 2.316403
## 5          med_fam_inc 3.603571
## 6            pct_lt15k 1.937771
## 7             num_prom 1.962376
## 8       lifetime_gifts 1.994140
## 9         largest_gift 1.715228
## 10           last_gift 4.152560
## 11 months_since_donate 1.145514
## 12            time_lag 1.029603
## 13            avg_gift 4.465531

We can observe that after removing avg_fam_income the vif of all the variables dropped below 5.

Random Forest Model for variable importance

set.seed(12345)
ctrl = trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(12345)
rf.model0 = train(target ~ . , data = fundraising, method = "rf", importance = TRUE, trControl = ctrl)
plot(varImp(rf.model0))

varImp(rf.model0)

## rf variable importance
## 
##                     Importance
## months_since_donate    100.000
## largest_gift            79.248
## avg_gift                51.712
## last_gift               44.548
## med_fam_inc             38.017
## income                  35.558
## pct_lt15k               32.202
## num_child               31.868
## avg_fam_inc             31.862
## home_value              30.846
## homeownerNo             29.560
## lifetime_gifts          27.131
## zipconvert4Yes          14.667
## zipconvert2Yes          12.113
## zipconvert3No           10.718
## zipconvert5Yes           9.949
## wealth                   7.350
## num_prom                 7.128
## time_lag                 4.128
## femaleNo                 0.000

From the above mentioned reasons of multicolinearity, and comparing the variables with Importance plot from Random forest model, I will use months_since_donate, largest_gift, avg_gift (not using last_gift due to multicolinearity), med_fam_income, num_child, income, home value as my final set of predictors.

Exclusions:

Excluding a class from prediction or an observation is done if its precision or coverage statistics don’t meet the threshold of usefulness.In this process none of the classes were excluded from the analysis but predictors with high VIF have been excluded from the models. At the same time, predictors that are not important or made no impact in improve the model accuracy have also been excluded from the models.

Variable Transformations :

Variable transformation is a way to make the data work better in the model. Data variables can have two types of form: numeric variable and categorical variable, and their transformation should have different approaches. No variables transformations have been made for this analysis. But for the further analysis, studying interactions between various predictors and including them in the final model can actually improve the model performance.

Data Partitioning :

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

The original data is partitioned into a training and testing data set. The training data includes 80% and the testing data includes 20% of the original data records. the training dataset consisted of 2400 observations where as the testing datset consisted of 600 observation.

set.seed(12345)
train_index = sample(nrow(fundraising), round(nrow(fundraising)*0.8))
fund.train = fundraising[train_index,]
fund.test = fundraising[-train_index,]

Methodolgy and Model Building

2. Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

Logistic Regression

Method : Logistic regression is a parametric classification technique that estimates the probability of an event occurring, for instance, whether or not the person will be a donor. This model is simple to run and also easy to interpret based on the significance of the predictors and the sign of the coefficients. The variables selected were months_since_donate, largest_gift, avg_gift, med_fam_income and num_child, the accuracy from the test is 49.33%

set.seed(12345)
glm.fit = glm(target ~ months_since_donate + largest_gift + num_child + 
                avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   family = "binomial"
                   )
summary(glm.fit)

## 
## Call:
## glm(formula = target ~ months_since_donate + largest_gift + num_child + 
##     avg_gift + med_fam_inc + income + home_value, family = "binomial", 
##     data = fund.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8392  -1.1448  -0.7965   1.1718   1.6874  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.204e+00  3.701e-01  -5.954 2.62e-09 ***
## months_since_donate  5.823e-02  1.074e-02   5.423 5.86e-08 ***
## largest_gift        -1.234e-03  2.157e-03  -0.572  0.56709    
## num_child            3.384e-01  1.266e-01   2.673  0.00752 ** 
## avg_gift             2.266e-02  7.342e-03   3.087  0.00202 ** 
## med_fam_inc          3.318e-04  3.703e-04   0.896  0.37029    
## income              -6.487e-02  2.709e-02  -2.394  0.01665 *  
## home_value          -7.344e-05  6.534e-05  -1.124  0.26106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3263.8  on 2392  degrees of freedom
## AIC: 3279.8
## 
## Number of Fisher Scoring iterations: 4

Predictions on test set

pred.glm = predict(glm.fit, fund.test)
class.glm = ifelse(pred.glm >=0.5, "Donor", "No Donor")
confusionMatrix(as.factor(class.glm), fund.test$target, positive = "Donor")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor        6       20
##   No Donor   284      290
##                                           
##                Accuracy : 0.4933          
##                  95% CI : (0.4526, 0.5341)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.8819          
##                                           
##                   Kappa : -0.0452         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.02069         
##             Specificity : 0.93548         
##          Pos Pred Value : 0.23077         
##          Neg Pred Value : 0.50523         
##              Prevalence : 0.48333         
##          Detection Rate : 0.01000         
##    Detection Prevalence : 0.04333         
##       Balanced Accuracy : 0.47809         
##                                           
##        'Positive' Class : Donor           
##

QDA

Method : The second classification method I used is Quadratic Discriminant Analysis, which is a compromise between logistic regression and nonparametric methods. The QDA model allows for quadratic decision boundaries and can produce better results when the data is moderately non-linear.(QDA) is a generative model. QDA assumes that each class follow a Gaussian distribution. The class-specific prior is simply the proportion of data points that belong to the class. The class-specific mean vector is the average of the input variables that belong to the class. The accuracy on the test set is the highest of 54.33%

set.seed(12345)
qda.fit = train(target ~ months_since_donate + largest_gift + num_child +
                  avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   method = "qda",
                   trControl = ctrl
                   )
qda.fit

## Quadratic Discriminant Analysis 
## 
## 2400 samples
##    7 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5091679  0.02316112

predictions on the test set

qda.preds = predict(qda.fit, fund.test)
confusionMatrix(qda.preds, fund.test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       39       23
##   No Donor   251      287
##                                           
##                Accuracy : 0.5433          
##                  95% CI : (0.5025, 0.5837)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.1026          
##                                           
##                   Kappa : 0.0619          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.1345          
##             Specificity : 0.9258          
##          Pos Pred Value : 0.6290          
##          Neg Pred Value : 0.5335          
##              Prevalence : 0.4833          
##          Detection Rate : 0.0650          
##    Detection Prevalence : 0.1033          
##       Balanced Accuracy : 0.5301          
##                                           
##        'Positive' Class : Donor           
##

Naive Bayes

Method : The Naive Bayes model makes the assumption that for each class, the features are independent of each other, which allows the model to estimate individual class-conditional marginal densities. Our Naive Bayes classifier achieves the highest cross-validation accuracy using the Kernel Density Estimation function, a nonparametric technique in estimating probabilities. This classifier has the cross validation accuracy of 55.37% and the accuracy on th unseen test set is 53.33%

set.seed(12345)
nb.fit = train(target ~  months_since_donate + largest_gift + num_child + 
                 avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   method = "naive_bayes",
                   trControl = ctrl
                   )
nb.fit

## Naive Bayes 
## 
## 2400 samples
##    7 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.5144514  0.03300049
##    TRUE      0.5537577  0.10530248
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

predictions on th test set

nb.pred = predict(nb.fit, fund.test)
confusionMatrix(nb.pred, fund.test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      220      210
##   No Donor    70      100
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.4925, 0.5738)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.2189          
##                                           
##                   Kappa : 0.08            
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7586          
##             Specificity : 0.3226          
##          Pos Pred Value : 0.5116          
##          Neg Pred Value : 0.5882          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.7167          
##       Balanced Accuracy : 0.5406          
##                                           
##        'Positive' Class : Donor           
##

K - Nearest Neighbours

method : The k-nearest neighbors (KNN) algorithm is a simple machine learning algorithm that can be used to solve both classification and regression problems. It’s easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label.(in this case will be Donor, No donor), in trying to fit the KN model, I have tried with different values of k ranging from 2 to 20. Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 8.

set.seed(12345)
knn.fit = train(target ~  months_since_donate + largest_gift + num_child + 
                  avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   method = "knn",
                   tuneGrid = expand.grid(k = seq(2,20, 1)),
                   trControl = ctrl
                   )
knn.fit

## k-Nearest Neighbors 
## 
## 2400 samples
##    7 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    2  0.4744432  -0.051153689
##    3  0.4848640  -0.030330368
##    4  0.4895851  -0.020869799
##    5  0.4888982  -0.022112742
##    6  0.4938936  -0.012195740
##    7  0.4915261  -0.016973993
##    8  0.4986170  -0.002852668
##    9  0.4863831  -0.027225835
##   10  0.4888930  -0.022277509
##   11  0.4902772  -0.019468277
##   12  0.4850000  -0.029981766
##   13  0.4890301  -0.021946311
##   14  0.4812610  -0.037519462
##   15  0.4816643  -0.036699684
##   16  0.4827679  -0.034492836
##   17  0.4884658  -0.023231773
##   18  0.4851360  -0.029897042
##   19  0.4866684  -0.026858066
##   20  0.4895810  -0.021051406
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 8.

plot(knn.fit)

Predictions on test set

knn.preds = predict(knn.fit, fund.test)
confusionMatrix(knn.preds, fund.test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      144      161
##   No Donor   146      149
##                                           
##                Accuracy : 0.4883          
##                  95% CI : (0.4476, 0.5291)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.9236          
##                                           
##                   Kappa : -0.0228         
##                                           
##  Mcnemar's Test P-Value : 0.4243          
##                                           
##             Sensitivity : 0.4966          
##             Specificity : 0.4806          
##          Pos Pred Value : 0.4721          
##          Neg Pred Value : 0.5051          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2400          
##    Detection Prevalence : 0.5083          
##       Balanced Accuracy : 0.4886          
##                                           
##        'Positive' Class : Donor           
##

SVM with Linear Kernal

method : SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. It tries to find the hyperplane such that the distance between the nearest data points(Support vectors) and the hyperplane (margin) is maximized. SVM has kernel trick, which is a simple methodology where non-linear data is project to higher dimension space so as to make it easier to classify the data where it could be linearly dived by a plane.

The optimal value for cost is 0.06, and the test accuracy is 55.67% for linear kernel, and for radial kernel the optimaum value of cost is 0.4 with test accuracy of 54.17%

set.seed(12345)
svm.linear.fit = train(target ~ months_since_donate + largest_gift + num_child + 
                         avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   method = "svmLinear",
                   tuneGrid = expand.grid(C = seq(0.001, 0.01, 0.001)),
                   preProcess = c("center", "scale"),
                   trControl = ctrl
                   )
svm.linear.fit

## Support Vector Machines with Linear Kernel 
## 
## 2400 samples
##    7 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa     
##   0.001  0.5370863  0.06953489
##   0.002  0.5640129  0.12675406
##   0.003  0.5656819  0.13068186
##   0.004  0.5658254  0.13070330
##   0.005  0.5659638  0.13084481
##   0.006  0.5677641  0.13437644
##   0.007  0.5662328  0.13127029
##   0.008  0.5670673  0.13294896
##   0.009  0.5656790  0.13015584
##   0.010  0.5658156  0.13044546
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.006.

svm.linear.fit$results$Accuracy[which.max(svm.linear.fit$results$Accuracy)]

## [1] 0.5677641

plot(svm.linear.fit)

Predictions on test set

svm.linear.preds = predict(svm.linear.fit, fund.test)
confusionMatrix(svm.linear.preds, fund.test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      192      168
##   No Donor    98      142
##                                           
##                Accuracy : 0.5567          
##                  95% CI : (0.5159, 0.5969)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.02732         
##                                           
##                   Kappa : 0.1192          
##                                           
##  Mcnemar's Test P-Value : 2.33e-05        
##                                           
##             Sensitivity : 0.6621          
##             Specificity : 0.4581          
##          Pos Pred Value : 0.5333          
##          Neg Pred Value : 0.5917          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3200          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.5601          
##                                           
##        'Positive' Class : Donor           
##

SVM with radial kernel

set.seed(12345)
svm.radial.fit = train(target ~ months_since_donate + largest_gift + num_child + 
                         avg_gift + med_fam_inc + income + home_value, 
                   data = fund.train, 
                   method = "svmRadial",
                   tuneGrid = expand.grid(C = seq(0.2, 1.0, 0.1),
                                          sigma= 0.1),
                   preProcess = c("center", "scale"),
                   trControl = ctrl
                   )
svm.radial.fit

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2400 samples
##    7 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2159, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   C    Accuracy   Kappa    
##   0.2  0.5623451  0.1241293
##   0.3  0.5623445  0.1241119
##   0.4  0.5623474  0.1240785
##   0.5  0.5595720  0.1184643
##   0.6  0.5588781  0.1170325
##   0.7  0.5587386  0.1167039
##   0.8  0.5567976  0.1127712
##   0.9  0.5579082  0.1149932
##   1.0  0.5566593  0.1124661
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 0.4.

plot(svm.radial.fit)

Predictions on test set

svm.radial.preds = predict(svm.radial.fit, fund.test)
confusionMatrix(svm.radial.preds, fund.test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      178      163
##   No Donor   112      147
##                                           
##                Accuracy : 0.5417          
##                  95% CI : (0.5008, 0.5821)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.118043        
##                                           
##                   Kappa : 0.0875          
##                                           
##  Mcnemar's Test P-Value : 0.002569        
##                                           
##             Sensitivity : 0.6138          
##             Specificity : 0.4742          
##          Pos Pred Value : 0.5220          
##          Neg Pred Value : 0.5676          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2967          
##    Detection Prevalence : 0.5683          
##       Balanced Accuracy : 0.5440          
##                                           
##        'Positive' Class : Donor           
##

3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.

Model Performance and Validation results.

Different types of models are tried on this dataset and therir accuracies have been compared. The final predictors used are selected from combination of results from VIF (removing variables that are multicolinaer) and variable importance from Random Forest. The final predictors include months_since_donate,largest_gift, num_child, avg_gift, med_fam_inc, income, home_value. logistic regression, KNN, Naive Bayes, QDA, SVM - Linear kernel, SVM - Radial Kernel, SVM with Linear kernel outperforms with 55.67% accuracy on test split of the data.

4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

models = c("GLM", "QDA", "NB", "KNN", "SVM-Linear", "SVM-Radial")
acc = c("49.33", "54.33", "53.33", "48.83", "55.67", "54.17")

acc.summary= as.data.frame(acc, row.names = models)
acc.summary

##              acc
## GLM        49.33
## QDA        54.33
## NB         53.33
## KNN        48.83
## SVM-Linear 55.67
## SVM-Radial 54.17

ggplot(data = acc.summary) +geom_bar(aes(x = row.names(acc.summary), y = acc), stat = "identity") + xlab("models") + ylab("accuraccy")

The above graph shows the accuracy of different models on the 20% of the test data. SVM model with Linear kernel outperforms th rest of the models with highest accuracy of 55.67%. I will pick this model to further make predictions on future fundraising dataset.

5. Select best model. From your answer in (4), what do you think is the “best” model?

Like mentioned above, SVM model with Linear kernel is best model with optimal cost of 0.06. The accuracy of this model is 55.67%, which is higher than remaining models. This model resulted in 65.83% accuracy on future fundraising dataset.

future_value = predict(svm.linear.fit, future_fundraising)

Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header.

write.table(future_value, file="predictions.csv", col.names = c("value"),sep = ",", row.names = F, quote = F)

Conclusions

Of all the statistical techniques utilized, SVM with linear kernel outperformed with an accuracy of 55.67% on the testing set and 65.83% on the future fundraising dataset. Important factors to consider when targeting people are number of months from last donation, dollar amount of largest gift to date, number of children , median family income, home value and income. Primarily focus on people who had children , who made donation recently, with higher median family income and made slightly lower donation on average.

Fundraising project

Ujwala Sirigineedi

5/2/2022

Background

Business Objectives and Goals

Objectives:

Goals

Data and Data Sources used

Exploratory data Analysis

Exclusions:

Variable Transformations :

Data Partitioning :

Methodolgy and Model Building

Logistic Regression

QDA

Naive Bayes

K - Nearest Neighbours

SVM with Linear Kernal

SVM with radial kernel

Model Performance and Validation results.

Conclusions