Objective

The objective is to develop a predictive model for national veterans’ organization. The model is to improve the cost effectiveness of their direct marketing campaign. This will predict which of their donors should receive a mail campaign letter.

Data Source

The data is provided by the national veterans’ organization. It contains 3,000 records with approximately 50% donors and 50% non-donors. Different distributions performed on records can create imbalance that can affect the outcome. To avoid this weighted sampling was used in our data. This method improves accuracy compared to choosing a random sample.

Content Description

fund_desc = data.frame(cbind(Variable, Description))
kableExtra::kable(fund_desc)
Variable Description
zip Zip code group (zip codes were grouped into five groups
homeowner Yes = homeowner, No = not a homeowner
num_child Number of children
income Household income
female No = male, Yes = female
wealth Wealth rating uses median family income and population statistics from each area toindex relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest
home_value Average home value in potential donor’s neighborhood in hundreds of dollars
med_fam_inc Median family income in potential donor’s neighborhood in hundreds of dollars
avg_fam_inc Average family income in potential donor’s neighborhood in hundreds
pct_lt15k Percent earning less than $15K in potential donor’s neighborhood
num_prom Lifetime number of promotions received to date
lifetime_gifts Dollar amount of lifetime gifts to date
largest_gift Dollar amount of largest gift to date
last_gift Dollar amount of most recent gift
months_since_donate N-umber of months from last donation to +-+July 2018
time_lag Number of months between first and second gift
avg_gift Average dollar amount of gifts to date
target Outcome variable: binary indicator for response Yes = donor, No = non-donor
# loading data sets
fundraising = read_rds('~/Data Mining/Final Project/fundraising(Train).rds')
future_fundraising = read_rds('~/Data Mining/Final Project/future_fundraising(Test).rds')

Variable Transformations

fund_num = fundraising
fund_num = as.data.frame(sapply(fund_num[, c(1:21)], as.numeric))

All predictors have been converted into numerical values to calculate correlation and variance values.

Exploratory Data Analysis

Correlation Matrix

round(cor(fund_num),4)
##                     zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner
## zipconvert2              1.0000      0.2490     -0.2742     -0.4150   -0.0224
## zipconvert3              0.2490      1.0000      0.2477      0.3750   -0.0812
## zipconvert4             -0.2742      0.2477      1.0000     -0.4130    0.0416
## zipconvert5             -0.4150      0.3750     -0.4130      1.0000   -0.0825
## homeowner               -0.0224     -0.0812      0.0416     -0.0825    1.0000
## num_child                0.0307      0.0080      0.0269     -0.0417   -0.0452
## income                  -0.0022      0.0942     -0.0203      0.0960   -0.3205
## female                   0.0108     -0.0235     -0.0226     -0.0108    0.0015
## wealth                   0.0425      0.0825     -0.0142      0.0434   -0.0658
## home_value              -0.1339      0.2377     -0.1831      0.4546   -0.1187
## med_fam_inc             -0.0894      0.0966     -0.0370      0.1835   -0.1383
## avg_fam_inc             -0.0792      0.1118     -0.0464      0.1950   -0.1335
## pct_lt15k                0.0495     -0.0320      0.0648     -0.1209    0.1375
## num_prom                -0.0337     -0.0440      0.0293     -0.0303   -0.0033
## lifetime_gifts          -0.0313      0.0104      0.0312      0.0091    0.0282
## largest_gift            -0.0184     -0.0018     -0.0158      0.0281    0.0333
## last_gift               -0.0087      0.0507     -0.0309      0.0745    0.0009
## months_since_donate      0.0158      0.0198      0.0253     -0.0179   -0.0214
## time_lag                -0.0185     -0.0439     -0.0148     -0.0084   -0.0263
## avg_gift                -0.0087      0.0622     -0.0277      0.0811    0.0083
## target                   0.0061     -0.0109      0.0053     -0.0211    0.0266
##                     num_child  income  female  wealth home_value med_fam_inc
## zipconvert2            0.0307 -0.0022  0.0108  0.0425    -0.1339     -0.0894
## zipconvert3            0.0080  0.0942 -0.0235  0.0825     0.2377      0.0966
## zipconvert4            0.0269 -0.0203 -0.0226 -0.0142    -0.1831     -0.0370
## zipconvert5           -0.0417  0.0960 -0.0108  0.0434     0.4546      0.1835
## homeowner             -0.0452 -0.3205  0.0015 -0.0658    -0.1187     -0.1383
## num_child              1.0000  0.0919  0.0296  0.0602    -0.0120      0.0470
## income                 0.0919  1.0000  0.0438  0.2090     0.2920      0.3675
## female                 0.0296  0.0438  1.0000  0.0294     0.0210      0.0235
## wealth                 0.0602  0.2090  0.0294  1.0000     0.2612      0.3778
## home_value            -0.0120  0.2920  0.0210  0.2612     1.0000      0.7382
## med_fam_inc            0.0470  0.3675  0.0235  0.3778     0.7382      1.0000
## avg_fam_inc            0.0473  0.3786  0.0252  0.3859     0.7526      0.9723
## pct_lt15k             -0.0317 -0.2832 -0.0552 -0.3751    -0.3991     -0.6654
## num_prom              -0.0864 -0.0690 -0.0382 -0.4121    -0.0645     -0.0508
## lifetime_gifts        -0.0510 -0.0196 -0.0371 -0.2255    -0.0241     -0.0352
## largest_gift          -0.0176  0.0332 -0.0014 -0.0253     0.0565      0.0470
## last_gift             -0.0129  0.1096  0.0464  0.0526     0.1589      0.1360
## months_since_donate   -0.0056  0.0772  0.0451  0.0337     0.0234      0.0323
## time_lag              -0.0061 -0.0015  0.0080 -0.0664     0.0007      0.0152
## avg_gift              -0.0197  0.1241  0.0745  0.0911     0.1688      0.1372
## target                 0.0423 -0.0360  0.0248 -0.0031    -0.0216     -0.0080
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## zipconvert2             -0.0792    0.0495  -0.0337        -0.0313      -0.0184
## zipconvert3              0.1118   -0.0320  -0.0440         0.0104      -0.0018
## zipconvert4             -0.0464    0.0648   0.0293         0.0312      -0.0158
## zipconvert5              0.1950   -0.1209  -0.0303         0.0091       0.0281
## homeowner               -0.1335    0.1375  -0.0033         0.0282       0.0333
## num_child                0.0473   -0.0317  -0.0864        -0.0510      -0.0176
## income                   0.3786   -0.2832  -0.0690        -0.0196       0.0332
## female                   0.0252   -0.0552  -0.0382        -0.0371      -0.0014
## wealth                   0.3859   -0.3751  -0.4121        -0.2255      -0.0253
## home_value               0.7526   -0.3991  -0.0645        -0.0241       0.0565
## med_fam_inc              0.9723   -0.6654  -0.0508        -0.0352       0.0470
## avg_fam_inc              1.0000   -0.6803  -0.0573        -0.0403       0.0431
## pct_lt15k               -0.6803    1.0000   0.0378         0.0596      -0.0079
## num_prom                -0.0573    0.0378   1.0000         0.5386       0.1138
## lifetime_gifts          -0.0403    0.0596   0.5386         1.0000       0.5073
## largest_gift             0.0431   -0.0079   0.1138         0.5073       1.0000
## last_gift                0.1314   -0.0618  -0.0559         0.2021       0.4472
## months_since_donate      0.0313   -0.0090  -0.2823        -0.1446       0.0198
## time_lag                 0.0243   -0.0199   0.1196         0.0385       0.0400
## avg_gift                 0.1318   -0.0625  -0.1473         0.1823       0.4748
## target                  -0.0032    0.0008  -0.0684        -0.0196       0.0178
##                     last_gift months_since_donate time_lag avg_gift  target
## zipconvert2           -0.0087              0.0158  -0.0185  -0.0087  0.0061
## zipconvert3            0.0507              0.0198  -0.0439   0.0622 -0.0109
## zipconvert4           -0.0309              0.0253  -0.0148  -0.0277  0.0053
## zipconvert5            0.0745             -0.0179  -0.0084   0.0811 -0.0211
## homeowner              0.0009             -0.0214  -0.0263   0.0083  0.0266
## num_child             -0.0129             -0.0056  -0.0061  -0.0197  0.0423
## income                 0.1096              0.0772  -0.0015   0.1241 -0.0360
## female                 0.0464              0.0451   0.0080   0.0745  0.0248
## wealth                 0.0526              0.0337  -0.0664   0.0911 -0.0031
## home_value             0.1589              0.0234   0.0007   0.1688 -0.0216
## med_fam_inc            0.1360              0.0323   0.0152   0.1372 -0.0080
## avg_fam_inc            0.1314              0.0313   0.0243   0.1318 -0.0032
## pct_lt15k             -0.0618             -0.0090  -0.0199  -0.0625  0.0008
## num_prom              -0.0559             -0.2823   0.1196  -0.1473 -0.0684
## lifetime_gifts         0.2021             -0.1446   0.0385   0.1823 -0.0196
## largest_gift           0.4472              0.0198   0.0400   0.4748  0.0178
## last_gift              1.0000              0.1867   0.0751   0.8664  0.0777
## months_since_donate    0.1867              1.0000   0.0155   0.1891  0.1338
## time_lag               0.0751              0.0155   1.0000   0.0701 -0.0097
## avg_gift               0.8664              0.1891   0.0701   1.0000  0.0757
## target                 0.0777              0.1338  -0.0097   0.0757  1.0000

Pairwise Plots

ggpairs(fundraising[, c(10, 11, 12, 17, 20)], ggplot2::aes(colour = fundraising$target))

The correlation values and the pairwise plots show that home_value, med_fam_inc, avg_fam_inc, last_gift, and avg_gift are highly correlated.

Variance

library(colorspace)

# all variables
var = apply(fund_num, 2, var)
round(var, 4)
##         zipconvert2         zipconvert3         zipconvert4         zipconvert5 
##              0.1694              0.1500              0.1685              0.2368 
##           homeowner           num_child              income              female 
##              0.1768              0.1192              2.6877              0.2379 
##              wealth          home_value         med_fam_inc         avg_fam_inc 
##              6.4859         906581.4713          30183.1011          28528.3742 
##           pct_lt15k            num_prom      lifetime_gifts        largest_gift 
##            146.6433            518.9120          22314.4274            507.0452 
##           last_gift months_since_donate            time_lag            avg_gift 
##            109.7315             16.7713             31.3798             55.5123 
##              target 
##              0.2501
op = par(no.readonly = TRUE)

plot(var, type = "b", col = rainbow_hcl(10), xlim = c(1, 21), pch = 16, xlab = NA, xaxt='n',ylab="Variance")
axis(side = 1, at=1:21, las = 2, labels = names(fund_num))
abline(h=50, col = "darkgray")

par(op)

Home_value, med_fam_inc, avg_fam_inc, and lifetime_gifts have a high variance of over 50 out of all other predictors.These indicates that the data points of these predictors are very spread out.

Checking Collinearity

library(usdm)
fund_nocat = as.data.frame(fundraising[, c(6,7,9:20)])

vif =  as.data.frame(usdm:: vif(fund_nocat))
vif
##              Variables       VIF
## 1            num_child  1.025019
## 2               income  1.194773
## 3               wealth  1.508819
## 4           home_value  2.493018
## 5          med_fam_inc 18.423616
## 6          avg_fam_inc 20.688945
## 7            pct_lt15k  2.040761
## 8             num_prom  1.962585
## 9       lifetime_gifts  1.994202
## 10        largest_gift  1.715238
## 11           last_gift  4.153071
## 12 months_since_donate  1.145515
## 13            time_lag  1.032467
## 14            avg_gift  4.469569
op = par(no.readonly = TRUE)
plot(vif$VIF, type = "b", col = rainbow_hcl(21), xlim = c(1,14), pch = 16, ylab = "Vif",  xlab = NA, xaxt='n')
axis(side = 1, at=1:14, las = 2, labels = names(fund_nocat))
abline(h=5, col = "darkgray")
abline(h=1, col = "lightgreen")

par(op)

The med_fam-inc and avg_fam-inc have a variance inflation factor of over 10. So collinearity exists in this predictors and these are highly unlikely to be significant.

Analyzing all the graphical and numerical statistics, the predictors that have significant relationships are: med_fam_inc, avg_fam_inc, last_gift, and avg_gift. Significant predictors tend to have low or no collinearity. Possible significant predictors with vif close to 1 is: num_child, income, wealth, largest gift, month_since_donate, and time_lag.

Partitioning Data Set

The original data is partitioned into a training and testing data set. The training data includes 80% and the testing data includes 20% of the original data records.

set.seed(12345)

# partition data
inTrain = caret::createDataPartition(fundraising$target, p=0.80, list = FALSE)
inTrain = sample(3000, 2400)
train = fundraising[inTrain,]
test = fundraising[-inTrain,]

Classification Models

Logistic Regression

glm.fit = glm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data =train, family = "binomial")
summary(glm.fit)
## 
## Call:
## glm(formula = target ~ homeowner + num_child + female + income + 
##     wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + 
##     num_prom + largest_gift + last_gift + avg_gift + months_since_donate + 
##     time_lag + home_value, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7783  -1.1505   0.4798   1.1614   1.7596  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.613e+00  5.066e-01  -3.185  0.00145 ** 
## homeownerNo          8.207e-04  1.054e-01   0.008  0.99379    
## num_child            2.499e-01  1.290e-01   1.937  0.05275 .  
## femaleNo             2.126e-02  8.586e-02   0.248  0.80444    
## income              -6.796e-02  2.915e-02  -2.331  0.01973 *  
## wealth              -2.365e-02  1.995e-02  -1.185  0.23588    
## med_fam_inc         -1.143e-03  1.027e-03  -1.112  0.26601    
## avg_fam_inc          1.257e-03  1.126e-03   1.116  0.26440    
## pct_lt15k           -1.532e-03  4.876e-03  -0.314  0.75340    
## lifetime_gifts       1.283e-04  3.808e-04   0.337  0.73612    
## num_prom            -3.802e-03  2.564e-03  -1.483  0.13807    
## largest_gift        -1.618e-03  2.991e-03  -0.541  0.58854    
## last_gift            1.302e-02  8.651e-03   1.504  0.13246    
## avg_gift             1.322e-02  1.282e-02   1.031  0.30243    
## months_since_donate  5.508e-02  1.136e-02   4.851 1.23e-06 ***
## time_lag            -3.917e-03  7.499e-03  -0.522  0.60143    
## home_value          -1.120e-04  7.003e-05  -1.599  0.10991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.1  on 2399  degrees of freedom
## Residual deviance: 3247.8  on 2383  degrees of freedom
## AIC: 3281.8
## 
## Number of Fisher Scoring iterations: 4
glm.probs = predict(glm.fit, test, type = "response")
glm.preds = as.factor(ifelse(glm.probs > 0.5, "Donor", "No Donor"))

car:: vif(glm.fit)
##           homeowner           num_child              female              income 
##            1.134218            1.030344            1.017859            1.317792 
##              wealth         med_fam_inc         avg_fam_inc           pct_lt15k 
##            1.506162           18.084577           20.461406            2.020462 
##      lifetime_gifts            num_prom        largest_gift           last_gift 
##            1.995230            1.872537            1.842279            3.607303 
##            avg_gift months_since_donate            time_lag          home_value 
##            3.918775            1.123660            1.031667            2.495922
#accuracy
mean(glm.preds == test$target)
## [1] 0.43

The model accurately predicts 43.33% of the data.

Next we fit a model with the significant predictors check the accuracy. ### Final Model

glm.fit=glm(target ~ num_child + income + months_since_donate, data =train, family = "binomial")
summary(glm.fit)
## 
## Call:
## glm(formula = target ~ num_child + income + months_since_donate, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6180  -1.1617   0.8121   1.1633   1.7263  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.18516    0.36870  -5.927 3.09e-09 ***
## num_child            0.25028    0.12754   1.962  0.04973 *  
## income              -0.07084    0.02553  -2.775  0.00552 ** 
## months_since_donate  0.07036    0.01076   6.539 6.20e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.1  on 2399  degrees of freedom
## Residual deviance: 3274.6  on 2396  degrees of freedom
## AIC: 3282.6
## 
## Number of Fisher Scoring iterations: 4
glm.probs = predict(glm.fit, test, type = "response")
glm.preds = as.factor(ifelse(glm.probs > 0.5, "Donor", "No Donor"))

car:: vif(glm.fit)
##           num_child              income months_since_donate 
##            1.014607            1.023382            1.009532
mean(glm.preds == test$target) 
## [1] 0.4366667

The final model has the predictors num_child, income, and months_since_donate. The model accurately predicts 43.67% of the data which is better than the previous logistic regression model. The vifs are close to one which proves our assumption that significant predictors usually don’t have high collinearity.

K-Nearest Neighbors

library(class)

# training matrix
train.x = as.matrix(train[, c(6, 7, 9:20)])
# testing matrix
test.x = as.matrix(test[, c(6, 7, 9:20)])

set.seed(12345)

knn.preds = knn(train.x, test.x, train$target , k = 10)

mean(knn.preds == test$target)
## [1] 0.5033333

The KNN model with all numeric predictors has an accuracy of 50.33%. Using the significant variables from logistic regression, we will fit a new KNN model. ### Final Model

# training matrix
train.x = as.matrix(train[, c(6, 7, 18)])
# testing matrix
test.x = as.matrix(test[, c(6, 7, 18)])

set.seed(12345)

knn.preds = knn(train.x, test.x, train$target , k = 10)

mean(knn.preds == test$target)
## [1] 0.5366667

The final model accurately predicts 53.67% of the data.

Support Vector Machine with Radial as the Kernel

library(e1071)

# cross validation
tune.rad = tune(svm, target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data=train, kernel ="radial", ranges =list(cost=seq(0.01, 5, 0.333)))
op = tune.rad$best.parameters$cost

# svm radial
svm.rad = svm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, kernel = "radial", data = train, cost = op)
summary(svm.rad)
## 
## Call:
## svm(formula = target ~ homeowner + num_child + female + income + 
##     wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + 
##     num_prom + largest_gift + last_gift + avg_gift + months_since_donate + 
##     time_lag + home_value, data = train, kernel = "radial", cost = op)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1.342 
## 
## Number of Support Vectors:  2197
## 
##  ( 1100 1097 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Donor No Donor
svm.preds=predict(svm.rad, test)

mean(svm.preds == test$target)
## [1] 0.56

The model accurately predicts 55.67% of the data. Now we will check the accuracy using the significant predictors. ### Final Model

# cross validation
tune.rad = tune(svm, target ~ num_child + income + months_since_donate, data=train, kernel ="radial", ranges =list(cost=seq(0.01, 5, 0.333)))
op = tune.rad$best.parameters$cost

# svm radial
svm.rad = svm(target ~ num_child + income + months_since_donate, kernel = "radial", data = train, cost = op)
summary(svm.rad)
## 
## Call:
## svm(formula = target ~ num_child + income + months_since_donate, 
##     data = train, kernel = "radial", cost = op)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.676 
## 
## Number of Support Vectors:  2205
## 
##  ( 1105 1100 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Donor No Donor
svm.preds=predict(svm.rad, test)

mean(svm.preds == test$target)
## [1] 0.57

This correctly predicts 57.33% of the data.

Types of Analysis Performed

Different types of classification models have been tried on the data and their accuracy have been checked. Logistic regression, LDA, QDA, KNN, SVM with linear kernel, and SVM with radial kernel have been performed. Out of all these, logistic regression, KNN, and SVM radial models are shown in the report.

Exclusion

Predictors with high VIF are excluded from the models. Insignificant predictors or predictors that made no difference in accuracy are also excluded from the final models.

Model Performance

Logistic regression performed the worst out of all the analysis with a prediction accuracy of 43.67%. However, this analysis indicated possible significant predictors which were later used in other models. KNN model had a prediction accuracy of 53.67% which is much better than the logistic regression model. SVM with a radial kernel gave the best prediction accuracy of 57.33%.

Best Model

After performing all the classification analysis, the best classification model was the SVM model with a radial kernel. tune() function was used to perform cross-validation. The cost parameter that gave the lowest cross validation error rate was 0.343 which has been used as the cost parameter. This model correctly classified 57.33% of the test observations.

Testing

future_value = predict(svm.rad, future_fundraising)

CSV File

write.table(future_value, file="Predictions.csv", col.names = c("value"),sep = ",", row.names = F, quote = F)