A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and evelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
funds <- readRDS("/Users/moniquevillarreal/Desktop/fundraising.rds")
ffunds <- readRDS("/Users/moniquevillarreal/Desktop/future_fundraising.rds")
any(is.na(funds))
## [1] FALSE
any(is.na(ffunds))
## [1] FALSE
str(funds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3000 obs. of 21 variables:
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num 698 828 1471 547 482 ...
## $ med_fam_inc : num 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
| Numerical values | Categorical Values |
|---|---|
| num_child | zipconvert2 |
| income | zipconvert3 |
| wealth | zipconvert4 |
| home_value | zipconvert5 |
| med_fam_inc | homeowner |
| avg_fam_inc | target |
| pct_lt15k | female |
| num_prom | |
| lifetime_gifts | |
| largest_gift | |
| months_since_donate | |
| time_lag | |
| avg_gift |
We would need to convert the categorical data into numerical. Wealth and Income variables, although numerical, represent ranges and thus are not representative of true numerical values.
funds$income = as.factor(funds$income)
funds$wealth = as.factor(funds$wealth)
# for the future fundraising
ffunds$income = as.factor(ffunds$income)
ffunds$wealth = as.factor(ffunds$wealth)
summary(funds)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
##
## income female wealth home_value med_fam_inc
## 1: 272 Yes:1831 8 :1630 Min. : 0.0 Min. : 0.0
## 2: 448 No :1169 9 : 184 1st Qu.: 554.8 1st Qu.: 278.0
## 3: 281 5 : 181 Median : 816.5 Median : 355.0
## 4:1015 7 : 172 Mean :1143.3 Mean : 388.4
## 5: 512 6 : 155 3rd Qu.:1341.2 3rd Qu.: 465.0
## 6: 238 3 : 154 Max. :5945.0 Max. :1500.0
## 7: 234 (Other): 524
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
##
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
##
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
##
Think about estimating the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 1234).
set.seed(1234)
train.index <- sample(1:nrow(funds), round(nrow(funds)*0.8))
f.train <- funds[train.index,]
f.test <- funds[-train.index,]
test <- (-train.index)
Build, evaluate, and choose a model.
Converting the predictors into numerical values to review correlation.
fundsnum <- as.data.frame(sapply(funds[, c(1:21)], as.numeric))
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(fundsnum))
The strongest correlations appear within the following predictors: income, med_fam_inc, home_value, avg_fam_inc, avg_gift, lifetime_gifts, largest_gift, last_gift.
pairs(~income + med_fam_inc + home_value + avg_fam_inc + avg_gift + lifetime_gifts + largest_gift + last_gift, data = fundsnum)
sub_funds<- fundsnum[c(7, 10, 11, 12,15, 16, 19)]
corrplot(cor(sub_funds), method = "square", type = "upper", tl.col = "black", tl.cex = 2, col = colorRampPalette(c("orange", "blue"))(100))
This will help determine the most important variables.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
set.seed(1234)
rf.funds <-train(target~., data = fundsnum)
plot(varImp(rf.funds), top = 15)
The smallest p-value is associated with months_since_donate
glm.fundfit <- glm(target~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag, data = f.train, family = binomial)
summary(glm.fundfit)
##
## Call:
## glm(formula = target ~ avg_gift + home_value + lifetime_gifts +
## med_fam_inc + avg_fam_inc + num_prom + pct_lt15k + months_since_donate +
## time_lag, family = binomial, data = f.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6537 -1.1640 0.8116 1.1582 1.6788
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6769101 0.4370229 -3.837 0.000124 ***
## avg_gift 0.0138409 0.0065222 2.122 0.033827 *
## home_value -0.0001699 0.0000693 -2.451 0.014232 *
## lifetime_gifts 0.0001501 0.0003447 0.436 0.663156
## med_fam_inc -0.0008947 0.0010777 -0.830 0.406436
## avg_fam_inc 0.0012922 0.0011555 1.118 0.263462
## num_prom -0.0036540 0.0023567 -1.550 0.121036
## pct_lt15k 0.0004583 0.0047939 0.096 0.923842
## months_since_donate 0.0564748 0.0110991 5.088 3.61e-07 ***
## time_lag -0.0122793 0.0075186 -1.633 0.102428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3266.8 on 2390 degrees of freedom
## AIC: 3286.8
##
## Number of Fisher Scoring iterations: 4
The accuracy is 48%.
library(caret)
pred.glm <-predict(glm.fundfit, f.test)
classify <- ifelse(pred.glm>=0.5, "Donor", "No Donor")
confusionMatrix(as.factor(classify), f.test$target, positive = "Donor")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 11 13
## No Donor 294 282
##
## Accuracy : 0.4883
## 95% CI : (0.4476, 0.5291)
## No Information Rate : 0.5083
## P-Value [Acc > NIR] : 0.8463
##
## Kappa : -0.0079
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.03607
## Specificity : 0.95593
## Pos Pred Value : 0.45833
## Neg Pred Value : 0.48958
## Prevalence : 0.50833
## Detection Rate : 0.01833
## Detection Prevalence : 0.04000
## Balanced Accuracy : 0.49600
##
## 'Positive' Class : Donor
##
QDA has a 52% accuracy rate.
set.seed(1234)
qda.fit <- train (target ~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag, data = f.train, method = "qda")
qda.fit
## Quadratic Discriminant Analysis
##
## 2400 samples
## 9 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5101303 0.02286336
Predicting on the test set
qda.probs <- predict(qda.fit, f.test)
confusionMatrix(qda.probs, f.test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 259 241
## No Donor 46 54
##
## Accuracy : 0.5217
## 95% CI : (0.4808, 0.5623)
## No Information Rate : 0.5083
## P-Value [Acc > NIR] : 0.2702
##
## Kappa : 0.0326
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8492
## Specificity : 0.1831
## Pos Pred Value : 0.5180
## Neg Pred Value : 0.5400
## Prevalence : 0.5083
## Detection Rate : 0.4317
## Detection Prevalence : 0.8333
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : Donor
##
Model has a 50% accuracy rate.
library(class)
knn.funds <- train(target ~ avg_gift + home_value + lifetime_gifts + med_fam_inc +avg_fam_inc + num_prom + pct_lt15k + months_since_donate + time_lag, data = f.train, method = "knn", tuneGrid = expand.grid( k = seq(2, 20, 1)))
knn.funds
## k-Nearest Neighbors
##
## 2400 samples
## 9 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.4954042 -0.008992414
## 3 0.4961007 -0.007597672
## 4 0.4974568 -0.004522338
## 5 0.5003883 0.001359051
## 6 0.5091052 0.018750103
## 7 0.5093120 0.019221955
## 8 0.5068948 0.014447467
## 9 0.5056592 0.012059762
## 10 0.5069890 0.014541234
## 11 0.5034854 0.007618484
## 12 0.5077663 0.016140638
## 13 0.5055148 0.011638444
## 14 0.5086728 0.018010590
## 15 0.5105708 0.021793628
## 16 0.5081105 0.016858445
## 17 0.5104061 0.021545813
## 18 0.5081252 0.016928683
## 19 0.5075864 0.015989487
## 20 0.5075619 0.016013925
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
Predicting on the test set
knn.probs <- predict(knn.funds, f.test)
confusionMatrix(knn.probs, f.test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 153 147
## No Donor 152 148
##
## Accuracy : 0.5017
## 95% CI : (0.4609, 0.5424)
## No Information Rate : 0.5083
## P-Value [Acc > NIR] : 0.6434
##
## Kappa : 0.0033
##
## Mcnemar's Test P-Value : 0.8171
##
## Sensitivity : 0.5016
## Specificity : 0.5017
## Pos Pred Value : 0.5100
## Neg Pred Value : 0.4933
## Prevalence : 0.5083
## Detection Rate : 0.2550
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.5017
##
## 'Positive' Class : Donor
##
| Model | Accuracy Rate |
|---|---|
| LR | 48% |
| QDA | 52% |
| KNN | 50% |
The model with the highest accuracy rate is QDA.
qdaff.prob <- predict(qda.fit, newdata = ffunds)
qdaff.pred <- data.frame(value = qdaff.prob)
write.csv(qdaff.pred, file = "target.csv", col.names = c("value"), row.names = FALSE)