The objective is to develop a predictive model for national veterans’ organization. The model is to improve the cost effectiveness of their direct marketing campaign. This will predict which of their donors should receive a mail campaign letter.
The data is provided by the national veterans’ organization. It contains 3,000 records with approximately 50% donors and 50% non-donors. Different distributions performed on records can create imbalance that can affect the outcome. To avoid this weighted sampling was used in our data. This method improves accuracy compared to choosing a random sample.
fund_desc = data.frame(cbind(Variable, Description))
kableExtra::kable(fund_desc)
| Variable | Description |
|---|---|
| zip | Zip code group (zip codes were grouped into five groups |
| homeowner | Yes = homeowner, No = not a homeowner |
| num_child | Number of children |
| income | Household income |
| female | No = male, Yes = female |
| wealth | Wealth rating uses median family income and population statistics from each area toindex relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest |
| home_value | Average home value in potential donor’s neighborhood in hundreds of dollars |
| med_fam_inc | Median family income in potential donor’s neighborhood in hundreds of dollars |
| avg_fam_inc | Average family income in potential donor’s neighborhood in hundreds |
| pct_lt15k | Percent earning less than $15K in potential donor’s neighborhood |
| num_prom | Lifetime number of promotions received to date |
| lifetime_gifts | Dollar amount of lifetime gifts to date |
| largest_gift | Dollar amount of largest gift to date |
| last_gift | Dollar amount of most recent gift |
| months_since_donate | N-umber of months from last donation to +-+July 2018 |
| time_lag | Number of months between first and second gift |
| avg_gift | Average dollar amount of gifts to date |
| target | Outcome variable: binary indicator for response Yes = donor, No = non-donor |
# loading data sets
fundraising = read_rds('~/Data Mining/Final Project/fundraising(Train).rds')
future_fundraising = read_rds('~/Data Mining/Final Project/future_fundraising(Test).rds')
fund_num = fundraising
fund_num = as.data.frame(sapply(fund_num[, c(1:21)], as.numeric))
All predictors have been converted into numerical values to calculate correlation and variance values.
round(cor(fund_num),4)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner
## zipconvert2 1.0000 0.2490 -0.2742 -0.4150 -0.0224
## zipconvert3 0.2490 1.0000 0.2477 0.3750 -0.0812
## zipconvert4 -0.2742 0.2477 1.0000 -0.4130 0.0416
## zipconvert5 -0.4150 0.3750 -0.4130 1.0000 -0.0825
## homeowner -0.0224 -0.0812 0.0416 -0.0825 1.0000
## num_child 0.0307 0.0080 0.0269 -0.0417 -0.0452
## income -0.0022 0.0942 -0.0203 0.0960 -0.3205
## female 0.0108 -0.0235 -0.0226 -0.0108 0.0015
## wealth 0.0425 0.0825 -0.0142 0.0434 -0.0658
## home_value -0.1339 0.2377 -0.1831 0.4546 -0.1187
## med_fam_inc -0.0894 0.0966 -0.0370 0.1835 -0.1383
## avg_fam_inc -0.0792 0.1118 -0.0464 0.1950 -0.1335
## pct_lt15k 0.0495 -0.0320 0.0648 -0.1209 0.1375
## num_prom -0.0337 -0.0440 0.0293 -0.0303 -0.0033
## lifetime_gifts -0.0313 0.0104 0.0312 0.0091 0.0282
## largest_gift -0.0184 -0.0018 -0.0158 0.0281 0.0333
## last_gift -0.0087 0.0507 -0.0309 0.0745 0.0009
## months_since_donate 0.0158 0.0198 0.0253 -0.0179 -0.0214
## time_lag -0.0185 -0.0439 -0.0148 -0.0084 -0.0263
## avg_gift -0.0087 0.0622 -0.0277 0.0811 0.0083
## target 0.0061 -0.0109 0.0053 -0.0211 0.0266
## num_child income female wealth home_value med_fam_inc
## zipconvert2 0.0307 -0.0022 0.0108 0.0425 -0.1339 -0.0894
## zipconvert3 0.0080 0.0942 -0.0235 0.0825 0.2377 0.0966
## zipconvert4 0.0269 -0.0203 -0.0226 -0.0142 -0.1831 -0.0370
## zipconvert5 -0.0417 0.0960 -0.0108 0.0434 0.4546 0.1835
## homeowner -0.0452 -0.3205 0.0015 -0.0658 -0.1187 -0.1383
## num_child 1.0000 0.0919 0.0296 0.0602 -0.0120 0.0470
## income 0.0919 1.0000 0.0438 0.2090 0.2920 0.3675
## female 0.0296 0.0438 1.0000 0.0294 0.0210 0.0235
## wealth 0.0602 0.2090 0.0294 1.0000 0.2612 0.3778
## home_value -0.0120 0.2920 0.0210 0.2612 1.0000 0.7382
## med_fam_inc 0.0470 0.3675 0.0235 0.3778 0.7382 1.0000
## avg_fam_inc 0.0473 0.3786 0.0252 0.3859 0.7526 0.9723
## pct_lt15k -0.0317 -0.2832 -0.0552 -0.3751 -0.3991 -0.6654
## num_prom -0.0864 -0.0690 -0.0382 -0.4121 -0.0645 -0.0508
## lifetime_gifts -0.0510 -0.0196 -0.0371 -0.2255 -0.0241 -0.0352
## largest_gift -0.0176 0.0332 -0.0014 -0.0253 0.0565 0.0470
## last_gift -0.0129 0.1096 0.0464 0.0526 0.1589 0.1360
## months_since_donate -0.0056 0.0772 0.0451 0.0337 0.0234 0.0323
## time_lag -0.0061 -0.0015 0.0080 -0.0664 0.0007 0.0152
## avg_gift -0.0197 0.1241 0.0745 0.0911 0.1688 0.1372
## target 0.0423 -0.0360 0.0248 -0.0031 -0.0216 -0.0080
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## zipconvert2 -0.0792 0.0495 -0.0337 -0.0313 -0.0184
## zipconvert3 0.1118 -0.0320 -0.0440 0.0104 -0.0018
## zipconvert4 -0.0464 0.0648 0.0293 0.0312 -0.0158
## zipconvert5 0.1950 -0.1209 -0.0303 0.0091 0.0281
## homeowner -0.1335 0.1375 -0.0033 0.0282 0.0333
## num_child 0.0473 -0.0317 -0.0864 -0.0510 -0.0176
## income 0.3786 -0.2832 -0.0690 -0.0196 0.0332
## female 0.0252 -0.0552 -0.0382 -0.0371 -0.0014
## wealth 0.3859 -0.3751 -0.4121 -0.2255 -0.0253
## home_value 0.7526 -0.3991 -0.0645 -0.0241 0.0565
## med_fam_inc 0.9723 -0.6654 -0.0508 -0.0352 0.0470
## avg_fam_inc 1.0000 -0.6803 -0.0573 -0.0403 0.0431
## pct_lt15k -0.6803 1.0000 0.0378 0.0596 -0.0079
## num_prom -0.0573 0.0378 1.0000 0.5386 0.1138
## lifetime_gifts -0.0403 0.0596 0.5386 1.0000 0.5073
## largest_gift 0.0431 -0.0079 0.1138 0.5073 1.0000
## last_gift 0.1314 -0.0618 -0.0559 0.2021 0.4472
## months_since_donate 0.0313 -0.0090 -0.2823 -0.1446 0.0198
## time_lag 0.0243 -0.0199 0.1196 0.0385 0.0400
## avg_gift 0.1318 -0.0625 -0.1473 0.1823 0.4748
## target -0.0032 0.0008 -0.0684 -0.0196 0.0178
## last_gift months_since_donate time_lag avg_gift target
## zipconvert2 -0.0087 0.0158 -0.0185 -0.0087 0.0061
## zipconvert3 0.0507 0.0198 -0.0439 0.0622 -0.0109
## zipconvert4 -0.0309 0.0253 -0.0148 -0.0277 0.0053
## zipconvert5 0.0745 -0.0179 -0.0084 0.0811 -0.0211
## homeowner 0.0009 -0.0214 -0.0263 0.0083 0.0266
## num_child -0.0129 -0.0056 -0.0061 -0.0197 0.0423
## income 0.1096 0.0772 -0.0015 0.1241 -0.0360
## female 0.0464 0.0451 0.0080 0.0745 0.0248
## wealth 0.0526 0.0337 -0.0664 0.0911 -0.0031
## home_value 0.1589 0.0234 0.0007 0.1688 -0.0216
## med_fam_inc 0.1360 0.0323 0.0152 0.1372 -0.0080
## avg_fam_inc 0.1314 0.0313 0.0243 0.1318 -0.0032
## pct_lt15k -0.0618 -0.0090 -0.0199 -0.0625 0.0008
## num_prom -0.0559 -0.2823 0.1196 -0.1473 -0.0684
## lifetime_gifts 0.2021 -0.1446 0.0385 0.1823 -0.0196
## largest_gift 0.4472 0.0198 0.0400 0.4748 0.0178
## last_gift 1.0000 0.1867 0.0751 0.8664 0.0777
## months_since_donate 0.1867 1.0000 0.0155 0.1891 0.1338
## time_lag 0.0751 0.0155 1.0000 0.0701 -0.0097
## avg_gift 0.8664 0.1891 0.0701 1.0000 0.0757
## target 0.0777 0.1338 -0.0097 0.0757 1.0000
ggpairs(fundraising[, c(10, 11, 12, 17, 20)], ggplot2::aes(colour = fundraising$target))
The correlation values and the pairwise plots show that home_value, med_fam_inc, avg_fam_inc, last_gift, and avg_gift are highly correlated.
library(colorspace)
# all variables
var = apply(fund_num, 2, var)
round(var, 4)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5
## 0.1694 0.1500 0.1685 0.2368
## homeowner num_child income female
## 0.1768 0.1192 2.6877 0.2379
## wealth home_value med_fam_inc avg_fam_inc
## 6.4859 906581.4713 30183.1011 28528.3742
## pct_lt15k num_prom lifetime_gifts largest_gift
## 146.6433 518.9120 22314.4274 507.0452
## last_gift months_since_donate time_lag avg_gift
## 109.7315 16.7713 31.3798 55.5123
## target
## 0.2501
op = par(no.readonly = TRUE)
plot(var, type = "b", col = rainbow_hcl(10), xlim = c(1, 21), pch = 16, xlab = NA, xaxt='n',ylab="Variance")
axis(side = 1, at=1:21, las = 2, labels = names(fund_num))
abline(h=50, col = "darkgray")
par(op)
Home_value, med_fam_inc, avg_fam_inc, and lifetime_gifts have a high variance of over 50 out of all other predictors.These indicates that the data points of these predictors are very spread out.
library(usdm)
fund_nocat = as.data.frame(fundraising[, c(6,7,9:20)])
vif = as.data.frame(usdm:: vif(fund_nocat))
vif
## Variables VIF
## 1 num_child 1.025019
## 2 income 1.194773
## 3 wealth 1.508819
## 4 home_value 2.493018
## 5 med_fam_inc 18.423616
## 6 avg_fam_inc 20.688945
## 7 pct_lt15k 2.040761
## 8 num_prom 1.962585
## 9 lifetime_gifts 1.994202
## 10 largest_gift 1.715238
## 11 last_gift 4.153071
## 12 months_since_donate 1.145515
## 13 time_lag 1.032467
## 14 avg_gift 4.469569
op = par(no.readonly = TRUE)
plot(vif$VIF, type = "b", col = rainbow_hcl(21), xlim = c(1,14), pch = 16, ylab = "Vif", xlab = NA, xaxt='n')
axis(side = 1, at=1:14, las = 2, labels = names(fund_nocat))
abline(h=5, col = "darkgray")
abline(h=1, col = "lightgreen")
par(op)
The med_fam-inc and avg_fam-inc have a variance inflation factor of over 10. So collinearity exists in this predictors and these are highly unlikely to be significant.
Analyzing all the graphical and numerical statistics, the predictors that have significant relationships are: med_fam_inc, avg_fam_inc, last_gift, and avg_gift. Significant predictors tend to have low or no collinearity. Possible significant predictors with vif close to 1 is: num_child, income, wealth, largest gift, month_since_donate, and time_lag.
The original data is partitioned into a training and testing data set. The training data includes 80% and the testing data includes 20% of the original data records.
set.seed(12345)
# partition data
inTrain = caret::createDataPartition(fundraising$target, p=0.80, list = FALSE)
inTrain = sample(3000, 2400)
train = fundraising[inTrain,]
test = fundraising[-inTrain,]
glm.fit = glm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data =train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = target ~ homeowner + num_child + female + income +
## wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts +
## num_prom + largest_gift + last_gift + avg_gift + months_since_donate +
## time_lag + home_value, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7783 -1.1505 0.4798 1.1614 1.7596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.613e+00 5.066e-01 -3.185 0.00145 **
## homeownerNo 8.207e-04 1.054e-01 0.008 0.99379
## num_child 2.499e-01 1.290e-01 1.937 0.05275 .
## femaleNo 2.126e-02 8.586e-02 0.248 0.80444
## income -6.796e-02 2.915e-02 -2.331 0.01973 *
## wealth -2.365e-02 1.995e-02 -1.185 0.23588
## med_fam_inc -1.143e-03 1.027e-03 -1.112 0.26601
## avg_fam_inc 1.257e-03 1.126e-03 1.116 0.26440
## pct_lt15k -1.532e-03 4.876e-03 -0.314 0.75340
## lifetime_gifts 1.283e-04 3.808e-04 0.337 0.73612
## num_prom -3.802e-03 2.564e-03 -1.483 0.13807
## largest_gift -1.618e-03 2.991e-03 -0.541 0.58854
## last_gift 1.302e-02 8.651e-03 1.504 0.13246
## avg_gift 1.322e-02 1.282e-02 1.031 0.30243
## months_since_donate 5.508e-02 1.136e-02 4.851 1.23e-06 ***
## time_lag -3.917e-03 7.499e-03 -0.522 0.60143
## home_value -1.120e-04 7.003e-05 -1.599 0.10991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.1 on 2399 degrees of freedom
## Residual deviance: 3247.8 on 2383 degrees of freedom
## AIC: 3281.8
##
## Number of Fisher Scoring iterations: 4
glm.probs = predict(glm.fit, test, type = "response")
glm.preds = as.factor(ifelse(glm.probs > 0.5, "Donor", "No Donor"))
car:: vif(glm.fit)
## homeowner num_child female income
## 1.134218 1.030344 1.017859 1.317792
## wealth med_fam_inc avg_fam_inc pct_lt15k
## 1.506162 18.084577 20.461406 2.020462
## lifetime_gifts num_prom largest_gift last_gift
## 1.995230 1.872537 1.842279 3.607303
## avg_gift months_since_donate time_lag home_value
## 3.918775 1.123660 1.031667 2.495922
#accuracy
mean(glm.preds == test$target)
## [1] 0.43
The model accurately predicts 43.33% of the data.
Next we fit a model with the significant predictors check the accuracy. ### Final Model
glm.fit=glm(target ~ num_child + income + months_since_donate, data =train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = target ~ num_child + income + months_since_donate,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6180 -1.1617 0.8121 1.1633 1.7263
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.18516 0.36870 -5.927 3.09e-09 ***
## num_child 0.25028 0.12754 1.962 0.04973 *
## income -0.07084 0.02553 -2.775 0.00552 **
## months_since_donate 0.07036 0.01076 6.539 6.20e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.1 on 2399 degrees of freedom
## Residual deviance: 3274.6 on 2396 degrees of freedom
## AIC: 3282.6
##
## Number of Fisher Scoring iterations: 4
glm.probs = predict(glm.fit, test, type = "response")
glm.preds = as.factor(ifelse(glm.probs > 0.5, "Donor", "No Donor"))
car:: vif(glm.fit)
## num_child income months_since_donate
## 1.014607 1.023382 1.009532
mean(glm.preds == test$target)
## [1] 0.4366667
The final model has the predictors num_child, income, and months_since_donate. The model accurately predicts 43.67% of the data which is better than the previous logistic regression model. The vifs are close to one which proves our assumption that significant predictors usually don’t have high collinearity.
library(class)
# training matrix
train.x = as.matrix(train[, c(6, 7, 9:20)])
# testing matrix
test.x = as.matrix(test[, c(6, 7, 9:20)])
set.seed(12345)
knn.preds = knn(train.x, test.x, train$target , k = 10)
mean(knn.preds == test$target)
## [1] 0.5033333
The KNN model with all numeric predictors has an accuracy of 50.33%. Using the significant variables from logistic regression, we will fit a new KNN model. ### Final Model
# training matrix
train.x = as.matrix(train[, c(6, 7, 18)])
# testing matrix
test.x = as.matrix(test[, c(6, 7, 18)])
set.seed(12345)
knn.preds = knn(train.x, test.x, train$target , k = 10)
mean(knn.preds == test$target)
## [1] 0.5366667
The final model accurately predicts 53.67% of the data.
library(e1071)
# cross validation
tune.rad = tune(svm, target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data=train, kernel ="radial", ranges =list(cost=seq(0.01, 5, 0.333)))
op = tune.rad$best.parameters$cost
# svm radial
svm.rad = svm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, kernel = "radial", data = train, cost = op)
summary(svm.rad)
##
## Call:
## svm(formula = target ~ homeowner + num_child + female + income +
## wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts +
## num_prom + largest_gift + last_gift + avg_gift + months_since_donate +
## time_lag + home_value, data = train, kernel = "radial", cost = op)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.342
##
## Number of Support Vectors: 2197
##
## ( 1100 1097 )
##
##
## Number of Classes: 2
##
## Levels:
## Donor No Donor
svm.preds=predict(svm.rad, test)
mean(svm.preds == test$target)
## [1] 0.56
The model accurately predicts 55.67% of the data. Now we will check the accuracy using the significant predictors. ### Final Model
# cross validation
tune.rad = tune(svm, target ~ num_child + income + months_since_donate, data=train, kernel ="radial", ranges =list(cost=seq(0.01, 5, 0.333)))
op = tune.rad$best.parameters$cost
# svm radial
svm.rad = svm(target ~ num_child + income + months_since_donate, kernel = "radial", data = train, cost = op)
summary(svm.rad)
##
## Call:
## svm(formula = target ~ num_child + income + months_since_donate,
## data = train, kernel = "radial", cost = op)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.676
##
## Number of Support Vectors: 2205
##
## ( 1105 1100 )
##
##
## Number of Classes: 2
##
## Levels:
## Donor No Donor
svm.preds=predict(svm.rad, test)
mean(svm.preds == test$target)
## [1] 0.57
This correctly predicts 57.33% of the data.
Different types of classification models have been tried on the data and their accuracy have been checked. Logistic regression, LDA, QDA, KNN, SVM with linear kernel, and SVM with radial kernel have been performed. Out of all these, logistic regression, KNN, and SVM radial models are shown in the report.
Predictors with high VIF are excluded from the models. Insignificant predictors or predictors that made no difference in accuracy are also excluded from the final models.
Logistic regression performed the worst out of all the analysis with a prediction accuracy of 43.67%. However, this analysis indicated possible significant predictors which were later used in other models. KNN model had a prediction accuracy of 53.67% which is much better than the logistic regression model. SVM with a radial kernel gave the best prediction accuracy of 57.33%.
After performing all the classification analysis, the best classification model was the SVM model with a radial kernel. tune() function was used to perform cross-validation. The cost parameter that gave the lowest cross validation error rate was 0.343 which has been used as the cost parameter. This model correctly classified 57.33% of the test observations.
future_value = predict(svm.rad, future_fundraising)
write.table(future_value, file="Predictions.csv", col.names = c("value"),sep = ",", row.names = F, quote = F)