Data Mining Final Project

Background

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Objective

The objective of this project is to develop a classification model that can improve the national veteran organization marketing campign cost effectiveness. We want to effectively capture donors so the expected net profit is maxmized.

The Data

The data file, Fundraising.rds, contains 3,000 records and 22 variables, the descriptions are down below. The data given has approximately 50% donors and 50% non-donors.

In addition to this dataset, we also have our test dataset that is future_fundraising.rds.

knitr::include_graphics("C:\\Users\\sanja\\Downloads\\Datamining project variables.png")

#structure of fundraise and future

str(fundraise)

## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

str(future)

## tibble [120 × 20] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 1 2 1 1 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num_child          : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:120] 5 1 4 4 2 4 2 3 4 2 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 2 2 2 ...
##  $ wealth             : num [1:120] 9 7 1 8 7 8 1 8 3 5 ...
##  $ home_value         : num [1:120] 1399 1355 835 1019 992 ...
##  $ med_fam_inc        : num [1:120] 637 411 310 389 524 371 209 253 302 335 ...
##  $ avg_fam_inc        : num [1:120] 703 497 364 473 563 408 259 285 324 348 ...
##  $ pct_lt15k          : num [1:120] 1 9 22 15 6 10 36 25 19 14 ...
##  $ num_prom           : num [1:120] 74 77 70 21 63 35 72 68 55 59 ...
##  $ lifetime_gifts     : num [1:120] 102 249 126 26 100 92 146 98 66 276 ...
##  $ largest_gift       : num [1:120] 6 15 6 16 20 37 12 5 7 15 ...
##  $ last_gift          : num [1:120] 5 7 6 16 3 37 11 3 5 13 ...
##  $ months_since_donate: num [1:120] 29 35 34 37 21 37 36 32 30 33 ...
##  $ time_lag           : num [1:120] 3 3 8 5 6 5 5 9 9 10 ...
##  $ avg_gift           : num [1:120] 4.86 9.58 4.34 13 7.69 ...

An important note is that our target has two factors, Donor and non-donor. Since we wish to create a classification model a weighted sample is used to insure there is fair distribution between the two factors of target.

Exploratory Data Anaylsis

For the exploratory data anaylsis the following questions needed to be asked:

How are the predictors distrbuted? Is there any signifcant outliers?
Do the predictors correlate with the response variable? What predictors heavily influence the response variable.
Is there collinearity amongst the variable?

These questions will be answered in their respective order.

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Based off the summary, we can see that the following variables home_value, med_fam_inc, avg_fam_inc, pcy_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag, and avg_gift all seem to contain outliers. num_child, income, and wealth were all labeled as numeric data however they appear to be categorical variables.

#Checking for correlation
corfund <- (fundraise[,c(6:7, 9:21)])
corfund$target <- as.numeric(corfund$target)

cor(corfund)

##                        num_child       income       wealth    home_value
## num_child            1.000000000  0.091893089  0.060175537 -0.0119642286
## income               0.091893089  1.000000000  0.208993101  0.2919734944
## wealth               0.060175537  0.208993101  1.000000000  0.2611611450
## home_value          -0.011964229  0.291973494  0.261161145  1.0000000000
## med_fam_inc          0.046961647  0.367505334  0.377763371  0.7381530742
## avg_fam_inc          0.047261395  0.378585352  0.385892299  0.7525690021
## pct_lt15k           -0.031717891 -0.283191234 -0.375145585 -0.3990861577
## num_prom            -0.086432604 -0.069008634 -0.412117770 -0.0645138583
## lifetime_gifts      -0.050954766 -0.019565470 -0.225473319 -0.0240737013
## largest_gift        -0.017554416  0.033180760 -0.025276518  0.0564942757
## last_gift           -0.012948678  0.109592754  0.052591311  0.1588576542
## months_since_donate -0.005563603  0.077238810  0.033713981  0.0234285142
## time_lag            -0.006069356 -0.001545727 -0.066421329  0.0006789113
## avg_gift            -0.019688680  0.124055750  0.091078754  0.1687736865
## target               0.042348253 -0.035953287 -0.003114465 -0.0215691141
##                      med_fam_inc  avg_fam_inc     pct_lt15k    num_prom
## num_child            0.046961647  0.047261395 -0.0317178911 -0.08643260
## income               0.367505334  0.378585352 -0.2831912335 -0.06900863
## wealth               0.377763371  0.385892299 -0.3751455847 -0.41211777
## home_value           0.738153074  0.752569002 -0.3990861577 -0.06451386
## med_fam_inc          1.000000000  0.972271285 -0.6653626748 -0.05078270
## avg_fam_inc          0.972271285  1.000000000 -0.6802847967 -0.05731139
## pct_lt15k           -0.665362675 -0.680284797  1.0000000000  0.03777518
## num_prom            -0.050782705 -0.057311385  0.0377751828  1.00000000
## lifetime_gifts      -0.035245827 -0.040327155  0.0596188059  0.53861957
## largest_gift         0.047032066  0.043103937 -0.0078829361  0.11381034
## last_gift            0.135976003  0.131378624 -0.0617521213 -0.05586809
## months_since_donate  0.032336691  0.031268594 -0.0090145584 -0.28232212
## time_lag             0.015202043  0.024340381 -0.0199114896  0.11962322
## avg_gift             0.137162758  0.131758434 -0.0624808920 -0.14725094
## target              -0.008036116 -0.003177139  0.0007592833 -0.06836599
##                     lifetime_gifts largest_gift   last_gift months_since_donate
## num_child              -0.05095477 -0.017554416 -0.01294868        -0.005563603
## income                 -0.01956547  0.033180760  0.10959275         0.077238810
## wealth                 -0.22547332 -0.025276518  0.05259131         0.033713981
## home_value             -0.02407370  0.056494276  0.15885765         0.023428514
## med_fam_inc            -0.03524583  0.047032066  0.13597600         0.032336691
## avg_fam_inc            -0.04032716  0.043103937  0.13137862         0.031268594
## pct_lt15k               0.05961881 -0.007882936 -0.06175212        -0.009014558
## num_prom                0.53861957  0.113810342 -0.05586809        -0.282322122
## lifetime_gifts          1.00000000  0.507262313  0.20205827        -0.144621862
## largest_gift            0.50726231  1.000000000  0.44723693         0.019789633
## last_gift               0.20205827  0.447236933  1.00000000         0.186715010
## months_since_donate    -0.14462186  0.019789633  0.18671501         1.000000000
## time_lag                0.03854575  0.039977035  0.07511121         0.015528499
## avg_gift                0.18232435  0.474830096  0.86639998         0.189110799
## target                 -0.01962693  0.017783355  0.07772082         0.133813301
##                          time_lag    avg_gift        target
## num_child           -0.0060693555 -0.01968868  0.0423482529
## income              -0.0015457272  0.12405575 -0.0359532869
## wealth              -0.0664213294  0.09107875 -0.0031144649
## home_value           0.0006789113  0.16877369 -0.0215691141
## med_fam_inc          0.0152020426  0.13716276 -0.0080361157
## avg_fam_inc          0.0243403812  0.13175843 -0.0031771394
## pct_lt15k           -0.0199114896 -0.06248089  0.0007592833
## num_prom             0.1196232155 -0.14725094 -0.0683659889
## lifetime_gifts       0.0385457538  0.18232435 -0.0196269259
## largest_gift         0.0399770354  0.47483010  0.0177833547
## last_gift            0.0751112090  0.86639998  0.0777208200
## months_since_donate  0.0155284995  0.18911080  0.1338133012
## time_lag             1.0000000000  0.07008164 -0.0097457015
## avg_gift             0.0700816428  1.00000000  0.0756630051
## target              -0.0097457015  0.07566301  1.0000000000

The output above shows us that the variables num_child, income, num_prom, last_gift, months_since_donate, and avg_gift are correlated with the response variable, target. In addition we also see that med_fam_inc and avg_fam_inc are highly correlated with income.

# Check collinearity
dataf <- (fundraise[,c(6:7, 9:20)])

vif(lm(data=dataf))

##              income              wealth          home_value         med_fam_inc 
##            1.186680            1.508818            2.478427           18.420766 
##         avg_fam_inc           pct_lt15k            num_prom      lifetime_gifts 
##           20.683498            2.039677            1.951447            1.994198 
##        largest_gift           last_gift months_since_donate            time_lag 
##            1.715154            4.150953            1.144156            1.032417 
##            avg_gift 
##            4.463250

Above we can see that med_fam_inc and ang_fam_inc are collinear as well as last_gift and avg_gift.

Methodology

Variable Transformation

Previously we saw the variables home_value, med_fam_inc, avg_farm_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag, and avg_gift all have large outliers. We can take the square root to the values of 0 and then apply a log to the non-zero values.

#splitting into 80% training and 20% testing
set.seed(12345)
trainI <- sample(1:nrow(fundraise), round(nrow(fundraise) * 0.80))
train <- fundraise[trainI, ]
test <- fundraise[-trainI, ]
nrow(train)

## [1] 2400

#Setting up the Cross Validation
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

Model Fitting Approach

To find the variables that signifcant, variables that have a p-value less than 0.05, a logistic regression is used.

fundglm_fit <- glm(target~., data = fundraise, family = "binomial")
summary(fundglm_fit)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = fundraise)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90432  -1.15349   0.00153   1.15919   1.79778  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.885e+00  4.595e-01  -4.102 4.10e-05 ***
## zipconvert2Yes      -1.365e+01  2.670e+02  -0.051  0.95924    
## zipconvert3No        1.361e+01  2.670e+02   0.051  0.95934    
## zipconvert4Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## zipconvert5Yes      -1.365e+01  2.670e+02  -0.051  0.95922    
## homeownerNo          4.957e-02  9.412e-02   0.527  0.59847    
## num_child            2.752e-01  1.137e-01   2.422  0.01544 *  
## income              -6.952e-02  2.595e-02  -2.679  0.00738 ** 
## femaleNo             5.995e-02  7.673e-02   0.781  0.43463    
## wealth              -1.907e-02  1.800e-02  -1.059  0.28940    
## home_value          -1.074e-04  7.141e-05  -1.503  0.13272    
## med_fam_inc         -1.200e-03  9.303e-04  -1.289  0.19725    
## avg_fam_inc          1.756e-03  1.010e-03   1.738  0.08226 .  
## pct_lt15k           -9.519e-04  4.440e-03  -0.214  0.83024    
## num_prom            -3.682e-03  2.317e-03  -1.589  0.11204    
## lifetime_gifts       1.599e-04  3.721e-04   0.430  0.66743    
## largest_gift        -1.773e-03  3.091e-03  -0.574  0.56629    
## last_gift            9.923e-03  7.562e-03   1.312  0.18945    
## months_since_donate  5.922e-02  1.003e-02   5.906 3.51e-09 ***
## time_lag            -6.174e-03  6.789e-03  -0.909  0.36311    
## avg_gift             7.539e-03  1.106e-02   0.682  0.49526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.9  on 2999  degrees of freedom
## Residual deviance: 4062.0  on 2979  degrees of freedom
## AIC: 4104
## 
## Number of Fisher Scoring iterations: 12

The output from the logistic regression shows that the intercept and the variables num_child, income, and months_since_donate are significant to predicting the target. Using these variables, I will perform a linear regresssion next.

fund_lm_fit <- glm(target ~ num_child + income + months_since_donate, data = train, family = 'binomial')
pred_prob <- predict.glm(fund_lm_fit, newdata = test, type = 'response')
pred <- ifelse(pred_prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      105      147
##   No Donor   185      163
##                                           
##                Accuracy : 0.4467          
##                  95% CI : (0.4064, 0.4875)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.99974         
##                                           
##                   Kappa : -0.1126         
##                                           
##  Mcnemar's Test P-Value : 0.04229         
##                                           
##             Sensitivity : 0.3621          
##             Specificity : 0.5258          
##          Pos Pred Value : 0.4167          
##          Neg Pred Value : 0.4684          
##              Prevalence : 0.4833          
##          Detection Rate : 0.1750          
##    Detection Prevalence : 0.4200          
##       Balanced Accuracy : 0.4439          
##                                           
##        'Positive' Class : Donor           
##

This model has a test accuracy of 44.67%. This is explains that our model is not great predictor. A randomForest model could be used next to see if it’ll produce a better test accuracy using the predictor variables we found to significant.

set.seed(12345)
fund_tree_fit <- randomForest(target ~ num_child + income + months_since_donate, data =train, mtry = 3, importance = TRUE)
fund_tree_fit

## 
## Call:
##  randomForest(formula = target ~ num_child + income + months_since_donate,      data = train, mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 50.5%
## Confusion matrix:
##          Donor No Donor class.error
## Donor      721      488   0.4036394
## No Donor   724      467   0.6078925

rando_fit <- train(target~ num_child + income + months_since_donate, data = train, method ='rf', trControl = train_control, importance = TRUE)

## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

pred_rf <- predict(rando_fit,test)
confusionMatrix(pred_rf,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      197      180
##   No Donor    93      130
##                                           
##                Accuracy : 0.545           
##                  95% CI : (0.5042, 0.5854)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.08875         
##                                           
##                   Kappa : 0.0977          
##                                           
##  Mcnemar's Test P-Value : 1.94e-07        
##                                           
##             Sensitivity : 0.6793          
##             Specificity : 0.4194          
##          Pos Pred Value : 0.5225          
##          Neg Pred Value : 0.5830          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3283          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.5493          
##                                           
##        'Positive' Class : Donor           
##

Here we see that we had a improvement in test accuracy. The randomForest model yields a test accuracy of 54.5%. For the last model fit I will try a KNN model with the predictor variables that are signifcant.

fund_knn_fit <- train(target~num_child + income + months_since_donate, data=train, method='knn',trControl = train_control, tuneLength=30)
pred_knn <- predict(fund_knn_fit,test)
confusionMatrix(pred_knn,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      179      163
##   No Donor   111      147
##                                           
##                Accuracy : 0.5433          
##                  95% CI : (0.5025, 0.5837)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.102639        
##                                           
##                   Kappa : 0.0909          
##                                           
##  Mcnemar's Test P-Value : 0.002063        
##                                           
##             Sensitivity : 0.6172          
##             Specificity : 0.4742          
##          Pos Pred Value : 0.5234          
##          Neg Pred Value : 0.5698          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2983          
##    Detection Prevalence : 0.5700          
##       Balanced Accuracy : 0.5457          
##                                           
##        'Positive' Class : Donor           
##

The KNN model yields a test accuracy of 55.33% an improvement from the previous two models.

Final Model

Out of the three models tested the KNN model produced the best test accuracy of 55.33%.

Testing the Model

Using the future_fundraising test data given, I can apply the knn model

knn <- train(target~num_child + income + months_since_donate, data=train, method='knn',trControl = train_control, tuneLength=30)
pred_knn <- predict(knn,future)
Value <- c("value",as.character(pred_knn))
write.csv(Value,file="knn_model_value.csv")

Reccomendations

The test accuracies were not the best however there is a couple of ways to improve them. One way is to simply have collected more observations. Another way was to change the scales of the models. And lastly apply a better transformation to the variables to treat outliers.