We are developing a predictive model for the American Legion to improve cost efficiency in their direct-mail marketing campaigns. The ultimate goal is to predict which of the donors should be sent a campaign letter in the mail.
The American Legion provided the data consisting of 3,000 records. About 50% of those are donors and the other 50% are non-donors. Weighted sampling was used to avoid imbalance created by different distributions.
The variables in the provided data are listed below with their descriptions. (This information was provided and pulled from the assignment guidelines.)
The provided data sets were loaded into the model using the following code:
fundraising <- read_rds("C:/Users/lacyb/Documents/2022 Spring/STA 6543 - Algorithms II/Final Project/fundraising.rds")
future.fundraising <- read_rds("C:/Users/lacyb/Documents/2022 Spring/STA 6543 - Algorithms II/Final Project/future_fundraising.rds")
The variables in the data needed to be converted to numeric values. All predictors were converted using the following code:
fund.num <- fundraising
fund.num <- as.data.frame(sapply(fund.num[, c(1:21)], as.numeric))
Several methods of exploring the data have been used. First, we create a correlation matrix. Then, we create pairwise plots. Finally we check variance and collinearity.
The following was created as a correlation matrix:
round(cor(fund.num), 4)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner
## zipconvert2 1.0000 0.2490 -0.2742 -0.4150 -0.0224
## zipconvert3 0.2490 1.0000 0.2477 0.3750 -0.0812
## zipconvert4 -0.2742 0.2477 1.0000 -0.4130 0.0416
## zipconvert5 -0.4150 0.3750 -0.4130 1.0000 -0.0825
## homeowner -0.0224 -0.0812 0.0416 -0.0825 1.0000
## num_child 0.0307 0.0080 0.0269 -0.0417 -0.0452
## income -0.0022 0.0942 -0.0203 0.0960 -0.3205
## female 0.0108 -0.0235 -0.0226 -0.0108 0.0015
## wealth 0.0425 0.0825 -0.0142 0.0434 -0.0658
## home_value -0.1339 0.2377 -0.1831 0.4546 -0.1187
## med_fam_inc -0.0894 0.0966 -0.0370 0.1835 -0.1383
## avg_fam_inc -0.0792 0.1118 -0.0464 0.1950 -0.1335
## pct_lt15k 0.0495 -0.0320 0.0648 -0.1209 0.1375
## num_prom -0.0337 -0.0440 0.0293 -0.0303 -0.0033
## lifetime_gifts -0.0313 0.0104 0.0312 0.0091 0.0282
## largest_gift -0.0184 -0.0018 -0.0158 0.0281 0.0333
## last_gift -0.0087 0.0507 -0.0309 0.0745 0.0009
## months_since_donate 0.0158 0.0198 0.0253 -0.0179 -0.0214
## time_lag -0.0185 -0.0439 -0.0148 -0.0084 -0.0263
## avg_gift -0.0087 0.0622 -0.0277 0.0811 0.0083
## target 0.0061 -0.0109 0.0053 -0.0211 0.0266
## num_child income female wealth home_value med_fam_inc
## zipconvert2 0.0307 -0.0022 0.0108 0.0425 -0.1339 -0.0894
## zipconvert3 0.0080 0.0942 -0.0235 0.0825 0.2377 0.0966
## zipconvert4 0.0269 -0.0203 -0.0226 -0.0142 -0.1831 -0.0370
## zipconvert5 -0.0417 0.0960 -0.0108 0.0434 0.4546 0.1835
## homeowner -0.0452 -0.3205 0.0015 -0.0658 -0.1187 -0.1383
## num_child 1.0000 0.0919 0.0296 0.0602 -0.0120 0.0470
## income 0.0919 1.0000 0.0438 0.2090 0.2920 0.3675
## female 0.0296 0.0438 1.0000 0.0294 0.0210 0.0235
## wealth 0.0602 0.2090 0.0294 1.0000 0.2612 0.3778
## home_value -0.0120 0.2920 0.0210 0.2612 1.0000 0.7382
## med_fam_inc 0.0470 0.3675 0.0235 0.3778 0.7382 1.0000
## avg_fam_inc 0.0473 0.3786 0.0252 0.3859 0.7526 0.9723
## pct_lt15k -0.0317 -0.2832 -0.0552 -0.3751 -0.3991 -0.6654
## num_prom -0.0864 -0.0690 -0.0382 -0.4121 -0.0645 -0.0508
## lifetime_gifts -0.0510 -0.0196 -0.0371 -0.2255 -0.0241 -0.0352
## largest_gift -0.0176 0.0332 -0.0014 -0.0253 0.0565 0.0470
## last_gift -0.0129 0.1096 0.0464 0.0526 0.1589 0.1360
## months_since_donate -0.0056 0.0772 0.0451 0.0337 0.0234 0.0323
## time_lag -0.0061 -0.0015 0.0080 -0.0664 0.0007 0.0152
## avg_gift -0.0197 0.1241 0.0745 0.0911 0.1688 0.1372
## target 0.0423 -0.0360 0.0248 -0.0031 -0.0216 -0.0080
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## zipconvert2 -0.0792 0.0495 -0.0337 -0.0313 -0.0184
## zipconvert3 0.1118 -0.0320 -0.0440 0.0104 -0.0018
## zipconvert4 -0.0464 0.0648 0.0293 0.0312 -0.0158
## zipconvert5 0.1950 -0.1209 -0.0303 0.0091 0.0281
## homeowner -0.1335 0.1375 -0.0033 0.0282 0.0333
## num_child 0.0473 -0.0317 -0.0864 -0.0510 -0.0176
## income 0.3786 -0.2832 -0.0690 -0.0196 0.0332
## female 0.0252 -0.0552 -0.0382 -0.0371 -0.0014
## wealth 0.3859 -0.3751 -0.4121 -0.2255 -0.0253
## home_value 0.7526 -0.3991 -0.0645 -0.0241 0.0565
## med_fam_inc 0.9723 -0.6654 -0.0508 -0.0352 0.0470
## avg_fam_inc 1.0000 -0.6803 -0.0573 -0.0403 0.0431
## pct_lt15k -0.6803 1.0000 0.0378 0.0596 -0.0079
## num_prom -0.0573 0.0378 1.0000 0.5386 0.1138
## lifetime_gifts -0.0403 0.0596 0.5386 1.0000 0.5073
## largest_gift 0.0431 -0.0079 0.1138 0.5073 1.0000
## last_gift 0.1314 -0.0618 -0.0559 0.2021 0.4472
## months_since_donate 0.0313 -0.0090 -0.2823 -0.1446 0.0198
## time_lag 0.0243 -0.0199 0.1196 0.0385 0.0400
## avg_gift 0.1318 -0.0625 -0.1473 0.1823 0.4748
## target -0.0032 0.0008 -0.0684 -0.0196 0.0178
## last_gift months_since_donate time_lag avg_gift target
## zipconvert2 -0.0087 0.0158 -0.0185 -0.0087 0.0061
## zipconvert3 0.0507 0.0198 -0.0439 0.0622 -0.0109
## zipconvert4 -0.0309 0.0253 -0.0148 -0.0277 0.0053
## zipconvert5 0.0745 -0.0179 -0.0084 0.0811 -0.0211
## homeowner 0.0009 -0.0214 -0.0263 0.0083 0.0266
## num_child -0.0129 -0.0056 -0.0061 -0.0197 0.0423
## income 0.1096 0.0772 -0.0015 0.1241 -0.0360
## female 0.0464 0.0451 0.0080 0.0745 0.0248
## wealth 0.0526 0.0337 -0.0664 0.0911 -0.0031
## home_value 0.1589 0.0234 0.0007 0.1688 -0.0216
## med_fam_inc 0.1360 0.0323 0.0152 0.1372 -0.0080
## avg_fam_inc 0.1314 0.0313 0.0243 0.1318 -0.0032
## pct_lt15k -0.0618 -0.0090 -0.0199 -0.0625 0.0008
## num_prom -0.0559 -0.2823 0.1196 -0.1473 -0.0684
## lifetime_gifts 0.2021 -0.1446 0.0385 0.1823 -0.0196
## largest_gift 0.4472 0.0198 0.0400 0.4748 0.0178
## last_gift 1.0000 0.1867 0.0751 0.8664 0.0777
## months_since_donate 0.1867 1.0000 0.0155 0.1891 0.1338
## time_lag 0.0751 0.0155 1.0000 0.0701 -0.0097
## avg_gift 0.8664 0.1891 0.0701 1.0000 0.0757
## target 0.0777 0.1338 -0.0097 0.0757 1.0000
The following plot was created:
ggpairs(fundraising[, c(10, 11, 12, 17, 20)], ggplot2::aes(colour = fundraising$target))
From the Correlation Matrix and Pairwise Plots, we can see that home_value, med_fam_inc, avg_fam_inc, last_gift, and avg_gift are all very highly correlated.
We used the following code to calculate variance:
var <- apply(fund.num, 2, var)
round(var, 4)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5
## 0.1694 0.1500 0.1685 0.2368
## homeowner num_child income female
## 0.1768 0.1192 2.6877 0.2379
## wealth home_value med_fam_inc avg_fam_inc
## 6.4859 906581.4713 30183.1011 28528.3742
## pct_lt15k num_prom lifetime_gifts largest_gift
## 146.6433 518.9120 22314.4274 507.0452
## last_gift months_since_donate time_lag avg_gift
## 109.7315 16.7713 31.3798 55.5123
## target
## 0.2501
We then plotted the variance to produce the following graph:
op <- par(no.readonly = TRUE)
plot(var, type = "b", col = rainbow_hcl(10), xlim = c(1, 21), pch = 16, xlab = NA, xaxt='n',ylab="Variance")
axis(side = 1, at=1:21, las = 2, labels = names(fund.num))
abline(h=50, col = "darkgray")
We used the following code to check collinearity:
fund.nocat <- as.data.frame(fundraising[, c(6,7,9:20)])
vif <- as.data.frame(usdm:: vif(fund.nocat))
vif
## Variables VIF
## 1 num_child 1.025019
## 2 income 1.194773
## 3 wealth 1.508819
## 4 home_value 2.493018
## 5 med_fam_inc 18.423616
## 6 avg_fam_inc 20.688945
## 7 pct_lt15k 2.040761
## 8 num_prom 1.962585
## 9 lifetime_gifts 1.994202
## 10 largest_gift 1.715238
## 11 last_gift 4.153071
## 12 months_since_donate 1.145515
## 13 time_lag 1.032467
## 14 avg_gift 4.469569
We then plotted the collinearity to produce the following graph:
op <- par(no.readonly = TRUE)
plot(vif$VIF, type = "b", col = rainbow_hcl(21), xlim = c(1,14), pch = 16, ylab = "Vif", xlab = NA, xaxt='n')
axis(side = 1, at=1:14, las = 2, labels = names(fund.nocat))
abline(h=5, col = "blue")
abline(h=1, col = "red")
The variables med_fam_inc and avg_fam_inc have variance inflation of over 10. This means colinearity exists and that these predictors are highly unlikely to be significant. Based on collinearity values that are low, the following predictors are significant: * med_fam_inc * avg_fam_inc * last_gift * avg_gift Based on VIF values close to 1, the following predictors are possibly significant: * num_child * income * wealth * largest_gift * month_since_donate * time_lag
We partition the original data into two sets. The training set has 80% of the records while the testing set has the remaining 20%. The following code was used to partition the data:
set.seed(12345)
inTrain <- caret::createDataPartition(fundraising$target, p = 0.8, list = FALSE)
inTrain <- sample(3000, 2400)
train <- fundraising[inTrain, ]
test <- fundraising[-inTrain, ]
The classification models used in this analysis are Logistic Regression, K-Nearest Neighbors, and Support Vector Machine with Radial as the Kernel.
Using the following chunks of code, we fit a logistic regression model to the data.
glm.fit = glm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = target ~ homeowner + num_child + female + income +
## wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts +
## num_prom + largest_gift + last_gift + avg_gift + months_since_donate +
## time_lag + home_value, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7783 -1.1505 0.4798 1.1614 1.7596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.613e+00 5.066e-01 -3.185 0.00145 **
## homeownerNo 8.207e-04 1.054e-01 0.008 0.99379
## num_child 2.499e-01 1.290e-01 1.937 0.05275 .
## femaleNo 2.126e-02 8.586e-02 0.248 0.80444
## income -6.796e-02 2.915e-02 -2.331 0.01973 *
## wealth -2.365e-02 1.995e-02 -1.185 0.23588
## med_fam_inc -1.143e-03 1.027e-03 -1.112 0.26601
## avg_fam_inc 1.257e-03 1.126e-03 1.116 0.26440
## pct_lt15k -1.532e-03 4.876e-03 -0.314 0.75340
## lifetime_gifts 1.283e-04 3.808e-04 0.337 0.73612
## num_prom -3.802e-03 2.564e-03 -1.483 0.13807
## largest_gift -1.618e-03 2.991e-03 -0.541 0.58854
## last_gift 1.302e-02 8.651e-03 1.504 0.13246
## avg_gift 1.322e-02 1.282e-02 1.031 0.30243
## months_since_donate 5.508e-02 1.136e-02 4.851 1.23e-06 ***
## time_lag -3.917e-03 7.499e-03 -0.522 0.60143
## home_value -1.120e-04 7.003e-05 -1.599 0.10991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.1 on 2399 degrees of freedom
## Residual deviance: 3247.8 on 2383 degrees of freedom
## AIC: 3281.8
##
## Number of Fisher Scoring iterations: 4
glm.prob <- predict(glm.fit, test, type = "response")
glm.pred <- as.factor(ifelse(glm.prob > 0.5, "Donor", "No Donor"))
car::vif(glm.fit)
## homeowner num_child female income
## 1.134218 1.030344 1.017859 1.317792
## wealth med_fam_inc avg_fam_inc pct_lt15k
## 1.506162 18.084577 20.461406 2.020462
## lifetime_gifts num_prom largest_gift last_gift
## 1.995230 1.872537 1.842279 3.607303
## avg_gift months_since_donate time_lag home_value
## 3.918775 1.123660 1.031667 2.495922
To check the accuracy of the model, we used the following code:
mean(glm.pred == test$target)
## [1] 0.43
This model can accurately predict only 43% of the data.
Next, we fit a model with the significant predictors to improve the accuracy. This becomes our Final Model. The following code was used:
glm.fi <- glm(target ~ num_child + income + months_since_donate, data =train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = target ~ homeowner + num_child + female + income +
## wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts +
## num_prom + largest_gift + last_gift + avg_gift + months_since_donate +
## time_lag + home_value, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7783 -1.1505 0.4798 1.1614 1.7596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.613e+00 5.066e-01 -3.185 0.00145 **
## homeownerNo 8.207e-04 1.054e-01 0.008 0.99379
## num_child 2.499e-01 1.290e-01 1.937 0.05275 .
## femaleNo 2.126e-02 8.586e-02 0.248 0.80444
## income -6.796e-02 2.915e-02 -2.331 0.01973 *
## wealth -2.365e-02 1.995e-02 -1.185 0.23588
## med_fam_inc -1.143e-03 1.027e-03 -1.112 0.26601
## avg_fam_inc 1.257e-03 1.126e-03 1.116 0.26440
## pct_lt15k -1.532e-03 4.876e-03 -0.314 0.75340
## lifetime_gifts 1.283e-04 3.808e-04 0.337 0.73612
## num_prom -3.802e-03 2.564e-03 -1.483 0.13807
## largest_gift -1.618e-03 2.991e-03 -0.541 0.58854
## last_gift 1.302e-02 8.651e-03 1.504 0.13246
## avg_gift 1.322e-02 1.282e-02 1.031 0.30243
## months_since_donate 5.508e-02 1.136e-02 4.851 1.23e-06 ***
## time_lag -3.917e-03 7.499e-03 -0.522 0.60143
## home_value -1.120e-04 7.003e-05 -1.599 0.10991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.1 on 2399 degrees of freedom
## Residual deviance: 3247.8 on 2383 degrees of freedom
## AIC: 3281.8
##
## Number of Fisher Scoring iterations: 4
glm.prob <- predict(glm.fit, test, type = "response")
glm.pred <- as.factor(ifelse(glm.prob > 0.5, "Donor", "No Donor"))
car::vif(glm.fit)
## homeowner num_child female income
## 1.134218 1.030344 1.017859 1.317792
## wealth med_fam_inc avg_fam_inc pct_lt15k
## 1.506162 18.084577 20.461406 2.020462
## lifetime_gifts num_prom largest_gift last_gift
## 1.995230 1.872537 1.842279 3.607303
## avg_gift months_since_donate time_lag home_value
## 3.918775 1.123660 1.031667 2.495922
To check the accuracy of the new model, we used the following code:
mean(glm.pred == test$target)
## [1] 0.43
This new model can accurately predict only 43.67% of the data. This is slightly better than the original model we created.
Using the following chunks of code, we fit a KNN model to the data.
# training matrix
train.x <- as.matrix(train[, c(6, 7, 9:20)])
# testing matrix
test.x <- as.matrix(test[, c(6, 7, 9:20)])
set.seed(12345)
knn.pred <- knn(train.x, test.x, train$target , k = 10)
# check accuracy
mean(knn.pred == test$target)
## [1] 0.5033333
This produced an accuracy rate of 50.33%. Now we will fit a new model with the significant predictors to become our Final Model. The following code was used:
# training matrix
train.x <- as.matrix(train[, c(6, 7, 18)])
# testing matrix
test.x <- as.matrix(test[, c(6, 7, 18)])
set.seed(12345)
knn.pred <- knn(train.x, test.x, train$target , k = 10)
mean(knn.pred == test$target)
## [1] 0.5366667
This produced an accuracy rate of 53.66% which is higher than the original KNN model.
Using the following chunks of code, we fit a SVM model to the data using Radial as the Kernel.
# cross validation
tune.rad <- tune(svm, target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, data = train, kernel = "radial", ranges = list(cost=seq(0.01, 5, 0.333)))
op <- tune.rad$best.parameters$cost
# svm radial
svm.rad <- svm(target ~ homeowner + num_child + female + income + wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts + num_prom + largest_gift + last_gift + avg_gift + months_since_donate + time_lag + home_value, kernel = "radial", data = train, cost = op)
summary(svm.rad)
##
## Call:
## svm(formula = target ~ homeowner + num_child + female + income +
## wealth + med_fam_inc + avg_fam_inc + pct_lt15k + lifetime_gifts +
## num_prom + largest_gift + last_gift + avg_gift + months_since_donate +
## time_lag + home_value, data = train, kernel = "radial", cost = op)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.342
##
## Number of Support Vectors: 2197
##
## ( 1100 1097 )
##
##
## Number of Classes: 2
##
## Levels:
## Donor No Donor
svm.pred <- predict(svm.rad, test)
mean(svm.pred == test$target)
## [1] 0.56
This model produced a 56% accuracy rate. We will now fit a new model with the significant predictors that will become our Final Model. The following code was used:
# cross validation
tune.rad <- tune(svm, target ~ num_child + income + months_since_donate, data = train, kernel = "radial", ranges = list(cost = seq(0.01, 5, 0.333)))
op = tune.rad$best.parameters$cost
# svm radial
svm.rad = svm(target ~ num_child + income + months_since_donate, kernel = "radial", data = train, cost = op)
summary(svm.rad)
##
## Call:
## svm(formula = target ~ num_child + income + months_since_donate,
## data = train, kernel = "radial", cost = op)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.676
##
## Number of Support Vectors: 2205
##
## ( 1105 1100 )
##
##
## Number of Classes: 2
##
## Levels:
## Donor No Donor
svm.pred <- predict(svm.rad, test)
mean(svm.pred == test$target)
## [1] 0.57
This new model produced a 57% accuracy rate; 1% better than the first.
We have performed Logistic Regression, K-Nearest Neighbors, and Support Vector Machine (with Radial as the Kernel) models on this data. These were the three most suitable for the data.
Through analysis, we identified variables that were insignificant (they made no difference in accuracy) and variables that had high VIF values. These two types of variables were excluded from the models for accuracy.
SVM with Radial Kernel had the best performance of the three models demonstrated. It produced a 57% accuracy rate. KNN had the second best accuracy rate at 53%. Logistic Regression performed the worst with only a 43% accuracy rate.
Our best model was SVM where we used the .tune() function for cross-validation. The lowest cost parameter determined was 0.34.
The following code has been used to test the model:
future.value = predict(svm.rad, future.fundraising)
To create a final CSV file, the following code was used:
write.table(future.value, file="Future Value Predictions.csv", col.names = c("value"),sep = ",", row.names = F, quote = F)