A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
The goal is to improve the cost-effectiveness of the veterans’ organization direct marketing campaign. The objective of this effort is to develop a classification model that can effectively capture donors so that the expected net profit is maximized.
For this project, we were given with a data sample. The dataset was already generated with weighted sampling as the original dataset/population has heavy non-responders. The sample given has almost equal number of donors and non-donors.
fund <- readRDS("fundraising.rds")
futu_fund <- readRDS("future_fundraising.rds")
set.seed(12345)
str(fund)
## tibble [3,000 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
str(futu_fund)
## tibble [120 x 20] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 1 2 1 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_child : num [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ income : num [1:120] 5 1 4 4 2 4 2 3 4 2 ...
## $ female : Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 2 2 2 ...
## $ wealth : num [1:120] 9 7 1 8 7 8 1 8 3 5 ...
## $ home_value : num [1:120] 1399 1355 835 1019 992 ...
## $ med_fam_inc : num [1:120] 637 411 310 389 524 371 209 253 302 335 ...
## $ avg_fam_inc : num [1:120] 703 497 364 473 563 408 259 285 324 348 ...
## $ pct_lt15k : num [1:120] 1 9 22 15 6 10 36 25 19 14 ...
## $ num_prom : num [1:120] 74 77 70 21 63 35 72 68 55 59 ...
## $ lifetime_gifts : num [1:120] 102 249 126 26 100 92 146 98 66 276 ...
## $ largest_gift : num [1:120] 6 15 6 16 20 37 12 5 7 15 ...
## $ last_gift : num [1:120] 5 7 6 16 3 37 11 3 5 13 ...
## $ months_since_donate: num [1:120] 29 35 34 37 21 37 36 32 30 33 ...
## $ time_lag : num [1:120] 3 3 8 5 6 5 5 9 9 10 ...
## $ avg_gift : num [1:120] 4.86 9.58 4.34 13 7.69 ...
table(fund$target)
##
## Donor No Donor
## 1499 1501
(100*1499)/3000
## [1] 49.96667
The training dataset has 3000 observations and the test dataset has 120 observations. The sample used is weighted 50:50, with 1499 Donors and 1501 Non- Donors, which is gives tells us that 49.97% of the observation are doners.
summary(fund)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Due to some very big and very small values for min and max, it is highly likely that there are outliers in coloumns home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag and avg_gift.
corr <- (fund[,c(6:7,9:21)])
corr$target <- as.numeric(corr$target)
chart.Correlation(corr, histogram=FALSE, pch=19)
vif(as.data.frame(corr))
## Variables VIF
## 1 num_child 1.027086
## 2 income 1.198361
## 3 wealth 1.509487
## 4 home_value 2.495523
## 5 med_fam_inc 18.433712
## 6 avg_fam_inc 20.709328
## 7 pct_lt15k 2.040823
## 8 num_prom 1.964372
## 9 lifetime_gifts 1.994304
## 10 largest_gift 1.715450
## 11 last_gift 4.155421
## 12 months_since_donate 1.159323
## 13 time_lag 1.032709
## 14 avg_gift 4.470251
## 15 target 1.030083
From the correlation chart, we can see that there variables that highly correlate with other variable. Both med_fam_inc and avg_med_fam are correlated to income. We also have two sets of variables that are colinear, med_fam_inc with avg_fam_inc and last_gift with avg_gift.
There are no exclusions.
A transformation that would work here would be square root and then to use logistic transformation on the non-zero values. This would make them not zero.
trainIndex <- createDataPartition(fund$target,p=.8,list=FALSE)
data_train <- fund[trainIndex,]
data_test <- fund[-trainIndex,]
nrow(data_train)
## [1] 2401
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)
We did two different methods and it seems that doing cross validation was the best option because since the output sample is small.
glm_fit <- glm(target~., data = fund, family = "binomial")
summary(glm_fit)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = fund)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90432 -1.15349 0.00153 1.15919 1.79778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.885e+00 4.595e-01 -4.102 4.10e-05 ***
## zipconvert2Yes -1.365e+01 2.670e+02 -0.051 0.95924
## zipconvert3No 1.361e+01 2.670e+02 0.051 0.95934
## zipconvert4Yes -1.365e+01 2.670e+02 -0.051 0.95922
## zipconvert5Yes -1.365e+01 2.670e+02 -0.051 0.95922
## homeownerNo 4.957e-02 9.412e-02 0.527 0.59847
## num_child 2.752e-01 1.137e-01 2.422 0.01544 *
## income -6.952e-02 2.595e-02 -2.679 0.00738 **
## femaleNo 5.995e-02 7.673e-02 0.781 0.43463
## wealth -1.907e-02 1.800e-02 -1.059 0.28940
## home_value -1.074e-04 7.141e-05 -1.503 0.13272
## med_fam_inc -1.200e-03 9.303e-04 -1.289 0.19725
## avg_fam_inc 1.756e-03 1.010e-03 1.738 0.08226 .
## pct_lt15k -9.519e-04 4.440e-03 -0.214 0.83024
## num_prom -3.682e-03 2.317e-03 -1.589 0.11204
## lifetime_gifts 1.599e-04 3.721e-04 0.430 0.66743
## largest_gift -1.773e-03 3.091e-03 -0.574 0.56629
## last_gift 9.923e-03 7.562e-03 1.312 0.18945
## months_since_donate 5.922e-02 1.003e-02 5.906 3.51e-09 ***
## time_lag -6.174e-03 6.789e-03 -0.909 0.36311
## avg_gift 7.539e-03 1.106e-02 0.682 0.49526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.9 on 2999 degrees of freedom
## Residual deviance: 4062.0 on 2979 degrees of freedom
## AIC: 4104
##
## Number of Fisher Scoring iterations: 12
coef(glm_fit)
## (Intercept) zipconvert2Yes zipconvert3No zipconvert4Yes
## -1.884773e+00 -1.364510e+01 1.361190e+01 -1.365500e+01
## zipconvert5Yes homeownerNo num_child income
## -1.365258e+01 4.956599e-02 2.752422e-01 -6.952306e-02
## femaleNo wealth home_value med_fam_inc
## 5.994674e-02 -1.907373e-02 -1.073676e-04 -1.199496e-03
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## 1.755584e-03 -9.519183e-04 -3.681972e-03 1.598657e-04
## largest_gift last_gift months_since_donate time_lag
## -1.772667e-03 9.923322e-03 5.921548e-02 -6.174095e-03
## avg_gift
## 7.539422e-03
summary(glm_fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.884773e+00 4.595006e-01 -4.10178679 4.099720e-05
## zipconvert2Yes -1.364510e+01 2.670209e+02 -0.05110124 9.592448e-01
## zipconvert3No 1.361190e+01 2.670209e+02 0.05097692 9.593439e-01
## zipconvert4Yes -1.365500e+01 2.670209e+02 -0.05113831 9.592153e-01
## zipconvert5Yes -1.365258e+01 2.670209e+02 -0.05112926 9.592225e-01
## homeownerNo 4.956599e-02 9.412356e-02 0.52660550 5.984676e-01
## num_child 2.752422e-01 1.136519e-01 2.42180072 1.544382e-02
## income -6.952306e-02 2.595057e-02 -2.67905680 7.382987e-03
## femaleNo 5.994674e-02 7.672748e-02 0.78129420 4.346295e-01
## wealth -1.907373e-02 1.800369e-02 -1.05943418 2.894021e-01
## home_value -1.073676e-04 7.141352e-05 -1.50346346 1.327196e-01
## med_fam_inc -1.199496e-03 9.302622e-04 -1.28941665 1.972533e-01
## avg_fam_inc 1.755584e-03 1.010277e-03 1.73772557 8.225918e-02
## pct_lt15k -9.519183e-04 4.440127e-03 -0.21438986 8.302430e-01
## num_prom -3.681972e-03 2.317003e-03 -1.58911007 1.120355e-01
## lifetime_gifts 1.598657e-04 3.720630e-04 0.42967362 6.674331e-01
## largest_gift -1.772667e-03 3.090851e-03 -0.57352069 5.662922e-01
## last_gift 9.923322e-03 7.562225e-03 1.31222251 1.894451e-01
## months_since_donate 5.921548e-02 1.002632e-02 5.90600490 3.505036e-09
## time_lag -6.174095e-03 6.788749e-03 -0.90945987 3.631074e-01
## avg_gift 7.539422e-03 1.105534e-02 0.68197131 4.952571e-01
glm_resp <- predict(glm_fit, type = "response")
contrasts(fund$target)
## No Donor
## Donor 0
## No Donor 1
glm_dona <- rep("Donor", 3000)
glm_dona[glm_resp>0.5] = "No Donor"
table(glm_dona, fund$target)
##
## glm_dona Donor No Donor
## Donor 870 672
## No Donor 629 829
mean(glm_dona==fund$target)
## [1] 0.5663333
A logistic regression was used to find if any significant variables were present where p < .05.
For the following models we will be using the variables: ‘avg_gift,’ ‘lifetime_gifts,’ ‘num_child,’ ‘income,’ ‘home_value,’ ‘months_since_donate,’ and ‘time_lag.’ There were 3 variables excluded because of their collinearity to variables already going to be used. Those 3 are med_fam_inc and pct_lt15k which are both colinear with income. The variable last_gift was also excluded for being colinear with avg_gift.
knn.fit <- train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
data=fund,
method='knn',
trControl = train_control,
tuneLength=20)
futu_fund.value=predict(knn.fit, futu_fund)
Value=c("value",as.character(futu_fund.value))
write.csv(Value,file="value_knn1.csv")
knn.fit
## k-Nearest Neighbors
##
## 3000 samples
## 6 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2700, 2700, 2700, 2701, 2700, 2700, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5328960 0.06576573
## 7 0.5257904 0.05157043
## 9 0.5226712 0.04533218
## 11 0.5252215 0.05041233
## 13 0.5282194 0.05639490
## 15 0.5293268 0.05862558
## 17 0.5253297 0.05062834
## 19 0.5259890 0.05194578
## 21 0.5247682 0.04950358
## 23 0.5241119 0.04819404
## 25 0.5257775 0.05151753
## 27 0.5282212 0.05639668
## 29 0.5269993 0.05395816
## 31 0.5272256 0.05441351
## 33 0.5262282 0.05243475
## 35 0.5258971 0.05177788
## 37 0.5277849 0.05555772
## 39 0.5262275 0.05244058
## 41 0.5258893 0.05176058
## 43 0.5233267 0.04664669
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
The best value for the KNN model was k=5 which gives us an accuracy of 53.19%.
glm.fit.full <- train(target~.,data=data_train,method='glm',trControl = train_control)
summary(glm.fit.full)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.89857 -1.15445 0.00135 1.15097 1.84663
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.112e+00 5.170e-01 -4.086 4.39e-05 ***
## zipconvert2Yes -1.362e+01 3.054e+02 -0.045 0.96444
## zipconvert3No 1.367e+01 3.054e+02 0.045 0.96431
## zipconvert4Yes -1.371e+01 3.054e+02 -0.045 0.96419
## zipconvert5Yes -1.371e+01 3.054e+02 -0.045 0.96420
## homeownerNo 6.876e-02 1.056e-01 0.651 0.51506
## num_child 2.681e-01 1.262e-01 2.124 0.03369 *
## income -8.308e-02 2.928e-02 -2.838 0.00454 **
## femaleNo 2.470e-02 8.626e-02 0.286 0.77463
## wealth -1.669e-02 2.025e-02 -0.824 0.40982
## home_value -1.657e-04 7.967e-05 -2.080 0.03750 *
## med_fam_inc -1.060e-03 1.019e-03 -1.040 0.29824
## avg_fam_inc 1.683e-03 1.120e-03 1.502 0.13303
## pct_lt15k -1.977e-03 4.978e-03 -0.397 0.69128
## num_prom -3.489e-03 2.658e-03 -1.313 0.18933
## lifetime_gifts -9.855e-05 5.266e-04 -0.187 0.85153
## largest_gift 3.182e-03 7.741e-03 0.411 0.68100
## last_gift 9.720e-03 8.964e-03 1.084 0.27819
## months_since_donate 6.886e-02 1.138e-02 6.053 1.42e-09 ***
## time_lag -1.584e-03 7.767e-03 -0.204 0.83840
## avg_gift 5.146e-04 1.240e-02 0.042 0.96690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3328.5 on 2400 degrees of freedom
## Residual deviance: 3233.3 on 2380 degrees of freedom
## AIC: 3275.3
##
## Number of Fisher Scoring iterations: 12
pred.glm.full<-predict(glm.fit.full,data_test)
confusionMatrix(pred.glm.full,data_test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 174 147
## No Donor 125 153
##
## Accuracy : 0.5459
## 95% CI : (0.5051, 0.5863)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.01513
##
## Kappa : 0.0919
##
## Mcnemar's Test P-Value : 0.20291
##
## Sensitivity : 0.5819
## Specificity : 0.5100
## Pos Pred Value : 0.5421
## Neg Pred Value : 0.5504
## Prevalence : 0.4992
## Detection Rate : 0.2905
## Detection Prevalence : 0.5359
## Balanced Accuracy : 0.5460
##
## 'Positive' Class : Donor
##
futu_fund.value=predict(glm.fit.full, futu_fund)
Value=c("value",as.character(futu_fund.value))
write.csv(Value,file="value_glm_full.csv")
This model gives us a prediction accuracy of 54.26%
glm.fit.imp <- train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
data=data_train,
method='glm',
trControl = train_control)
summary(glm.fit.imp)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7095 -1.1534 0.5867 1.1576 1.8228
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.397e+00 3.719e-01 -6.446 1.15e-10 ***
## months_since_donate 7.397e-02 1.094e-02 6.762 1.36e-11 ***
## largest_gift 1.958e-03 4.699e-03 0.417 0.67697
## avg_gift 1.385e-02 8.992e-03 1.540 0.12365
## num_child 2.988e-01 1.248e-01 2.395 0.01663 *
## income -7.892e-02 2.687e-02 -2.937 0.00331 **
## home_value -8.916e-05 4.616e-05 -1.931 0.05344 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3328.5 on 2400 degrees of freedom
## Residual deviance: 3248.0 on 2394 degrees of freedom
## AIC: 3262
##
## Number of Fisher Scoring iterations: 4
pred.glm.imp<-predict(glm.fit.imp,data_test)
confusionMatrix(pred.glm.imp,data_test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 173 145
## No Donor 126 155
##
## Accuracy : 0.5476
## 95% CI : (0.5067, 0.588)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : 0.01227
##
## Kappa : 0.0953
##
## Mcnemar's Test P-Value : 0.27421
##
## Sensitivity : 0.5786
## Specificity : 0.5167
## Pos Pred Value : 0.5440
## Neg Pred Value : 0.5516
## Prevalence : 0.4992
## Detection Rate : 0.2888
## Detection Prevalence : 0.5309
## Balanced Accuracy : 0.5476
##
## 'Positive' Class : Donor
##
futu_fund.value=predict(glm.fit.imp, futu_fund)
Value=c("value",as.character(futu_fund.value))
write.csv(Value,file="value_glm_imp.csv")
This model gives us a prediction accuracy of 57.76%
The model that gave us the best prediction accuracy was the GLM fit-train with an accuracy of 57.76.
I used a cut-off value of .5. This is because the data was creaed using weighted sampling.
For us to have gotten a better accuracy, we could have used a bigger sample size. Also, with the variables num_child and income, could have been altered since they were more of categorical values rather than continuous