Background Information: A national veterans’ organization wishes to develop a predictive model to improve the costeffectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Data Information: The file Fundraising.csv contains 3,000 records with approximately 50% donors (target = Donor) and 50% non-donors (target = No Donor). T

Step 1: Partitioning You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(12345)
fund <- readRDS("~/Downloads/fundraising.rds")
future.fund <- readRDS("~/Downloads/future_fundraising.rds")

train.control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

fund <- fund[, 5:21]
future.fund <- future.fund[, 5:20]

Step 2: Model Building Follow the following steps to build, evaluate, and choose a model.

1: Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fund.vars <- dplyr::select(fund, 2:3, 5:16)
fund.cor <- cor(fund.vars)
fund.cor
##                        num_child       income      wealth    home_value
## num_child            1.000000000  0.091893089  0.06017554 -0.0119642286
## income               0.091893089  1.000000000  0.20899310  0.2919734944
## wealth               0.060175537  0.208993101  1.00000000  0.2611611450
## home_value          -0.011964229  0.291973494  0.26116115  1.0000000000
## med_fam_inc          0.046961647  0.367505334  0.37776337  0.7381530742
## avg_fam_inc          0.047261395  0.378585352  0.38589230  0.7525690021
## pct_lt15k           -0.031717891 -0.283191234 -0.37514558 -0.3990861577
## num_prom            -0.086432604 -0.069008634 -0.41211777 -0.0645138583
## lifetime_gifts      -0.050954766 -0.019565470 -0.22547332 -0.0240737013
## largest_gift        -0.017554416  0.033180760 -0.02527652  0.0564942757
## last_gift           -0.012948678  0.109592754  0.05259131  0.1588576542
## months_since_donate -0.005563603  0.077238810  0.03371398  0.0234285142
## time_lag            -0.006069356 -0.001545727 -0.06642133  0.0006789113
## avg_gift            -0.019688680  0.124055750  0.09107875  0.1687736865
##                     med_fam_inc avg_fam_inc    pct_lt15k    num_prom
## num_child            0.04696165  0.04726139 -0.031717891 -0.08643260
## income               0.36750533  0.37858535 -0.283191234 -0.06900863
## wealth               0.37776337  0.38589230 -0.375145585 -0.41211777
## home_value           0.73815307  0.75256900 -0.399086158 -0.06451386
## med_fam_inc          1.00000000  0.97227129 -0.665362675 -0.05078270
## avg_fam_inc          0.97227129  1.00000000 -0.680284797 -0.05731139
## pct_lt15k           -0.66536267 -0.68028480  1.000000000  0.03777518
## num_prom            -0.05078270 -0.05731139  0.037775183  1.00000000
## lifetime_gifts      -0.03524583 -0.04032716  0.059618806  0.53861957
## largest_gift         0.04703207  0.04310394 -0.007882936  0.11381034
## last_gift            0.13597600  0.13137862 -0.061752121 -0.05586809
## months_since_donate  0.03233669  0.03126859 -0.009014558 -0.28232212
## time_lag             0.01520204  0.02434038 -0.019911490  0.11962322
## avg_gift             0.13716276  0.13175843 -0.062480892 -0.14725094
##                     lifetime_gifts largest_gift   last_gift months_since_donate
## num_child              -0.05095477 -0.017554416 -0.01294868        -0.005563603
## income                 -0.01956547  0.033180760  0.10959275         0.077238810
## wealth                 -0.22547332 -0.025276518  0.05259131         0.033713981
## home_value             -0.02407370  0.056494276  0.15885765         0.023428514
## med_fam_inc            -0.03524583  0.047032066  0.13597600         0.032336691
## avg_fam_inc            -0.04032716  0.043103937  0.13137862         0.031268594
## pct_lt15k               0.05961881 -0.007882936 -0.06175212        -0.009014558
## num_prom                0.53861957  0.113810342 -0.05586809        -0.282322122
## lifetime_gifts          1.00000000  0.507262313  0.20205827        -0.144621862
## largest_gift            0.50726231  1.000000000  0.44723693         0.019789633
## last_gift               0.20205827  0.447236933  1.00000000         0.186715010
## months_since_donate    -0.14462186  0.019789633  0.18671501         1.000000000
## time_lag                0.03854575  0.039977035  0.07511121         0.015528499
## avg_gift                0.18232435  0.474830096  0.86639998         0.189110799
##                          time_lag    avg_gift
## num_child           -0.0060693555 -0.01968868
## income              -0.0015457272  0.12405575
## wealth              -0.0664213294  0.09107875
## home_value           0.0006789113  0.16877369
## med_fam_inc          0.0152020426  0.13716276
## avg_fam_inc          0.0243403812  0.13175843
## pct_lt15k           -0.0199114896 -0.06248089
## num_prom             0.1196232155 -0.14725094
## lifetime_gifts       0.0385457538  0.18232435
## largest_gift         0.0399770354  0.47483010
## last_gift            0.0751112090  0.86639998
## months_since_donate  0.0155284995  0.18911080
## time_lag             1.0000000000  0.07008164
## avg_gift             0.0700816428  1.00000000
library(corrplot)
## corrplot 0.92 loaded
corrplot::corrplot(fund.cor)

var <- apply(fund, 2, var)
## Warning in FUN(newX[, i], ...): NAs introduced by coercion

## Warning in FUN(newX[, i], ...): NAs introduced by coercion

## Warning in FUN(newX[, i], ...): NAs introduced by coercion
round(var, 4)
##           homeowner           num_child              income              female 
##                  NA              0.1192              2.6877                  NA 
##              wealth          home_value         med_fam_inc         avg_fam_inc 
##              6.4859         906581.4713          30183.1011          28528.3742 
##           pct_lt15k            num_prom      lifetime_gifts        largest_gift 
##            146.6433            518.9120          22314.4274            507.0452 
##           last_gift months_since_donate            time_lag            avg_gift 
##            109.7315             16.7713             31.3798             55.5123 
##              target 
##                  NA

With this model we can identify that ther are high variances in Lifetime_gifts, Home_value, Avg_fam_inc, and Med_fam_inc.

par(mfrow = c(1, 3))
plot(fund$homeowner, main = "Homeowner")
plot(fund$female, main = "Female")
plot(fund$target, main = "Target")

par(mfrow = c(1, 4))
plot(fund$target,fund$num_child, main = "Num_Child")
plot(fund$target,fund$income, main = "Income")
plot(fund$target,fund$wealth, main = "Wealth")
plot(fund$target,fund$home_value, main = "Home_Value")

plot(fund$target,fund$med_fam_inc, main = "Med_Fam_Inc")

plot(fund$target,fund$avg_fam_inc, main = "Avg_Fam_Inc")

plot(fund$target,fund$pct_lt15k, main = "Pct_Lt15k")

plot(fund$target,fund$num_prom, main = "Num_Prom")

plot(fund$target,fund$lifetime_gifts, main = "Lifetime_Gifts")

plot(fund$target,fund$largest_gift, main = "Largest_Gift")

plot(fund$target,fund$last_gift, main = "Last_Gift")

plot(fund$target,fund$months_since_donate, main = "Months_Since_Donate")

plot(fund$target,fund$avg_gift, main = "Avg_Gift")

With all the models ran, we can conclude there are no huge difference between donors and non-donors. With that being said with there are still good predictors to take from the models. Those predictors would be Lifetime_gifts and Months_since_donated and income.

Lets looks at some histograms

par(mfrow = c(1, 4))
hist(fund$num_child)
hist(fund$income)
hist(fund$wealth)
hist(fund$home_value)

hist(fund$med_fam_inc)

hist(fund$avg_fam_inc)

hist(fund$pct_lt15k)

hist(fund$num_prom)

hist(fund$lifetime_gifts)

hist(fund$largest_gift)

hist(fund$last_gift)

hist(fund$months_since_donate)

hist(fund$avg_gift)

With looking at the various histograms we can see a trend of most varibles seems to be heavily skewed in each visual.

fund <- mutate(fund,
               log_months_since_donate = log(months_since_donate),
               log_avg_gift = log(avg_gift),
               log_num_child = log(num_child + 1),
               log_home_value = log(home_value + 0.1),
               log_med_fam_inc = log(med_fam_inc + 0.1),
               log_avg_fam_inc = log(avg_fam_inc + 0.1),
               log_pct_lt15k = log(pct_lt15k + 0.1),
               log_num_prom = log(num_prom + 1),
               log_lifetime_gifts = log(lifetime_gifts + 0.1),
               log_largest_gift = log(largest_gift + 0.1),
               log_last_gift = log(last_gift + 0.1))

future.fund <- mutate(future.fund,
                      log_months_since_donate = log(months_since_donate),
                      log_months_since_donate = log(months_since_donate),
                      log_avg_gift = log(avg_gift),
                      log_num_child = log(num_child),
                      log_home_value = log(home_value),
                      log_med_fam_inc = log(med_fam_inc),
                      log_avg_fam_inc = log(avg_fam_inc),
                      log_pct_lt15k = log(pct_lt15k),
                      log_num_prom = log(num_prom),
                      log_lifetime_gifts = log(lifetime_gifts),
                      log_largest_gift = log(largest_gift),
                      log_last_gift = log(last_gift))
fund <- mutate(fund, 
               num_child_income_int = num_child * income,
               num_child_log_months_since_donate_int = num_child * log_months_since_donate,
               num_child_log_largest_gift_int = num_child * log_largest_gift,
               num_child_log_life_gifts_int = num_child * log_lifetime_gifts,
               income_log_months_since_donate_int = income * log_months_since_donate,
               income_log_largest_gift_int = income * log_largest_gift,
               income_log_lifetime_gifts_int = income * log_lifetime_gifts,
               log_m_since_donate_log_largest_gift_int = log_months_since_donate * log_largest_gift,
               log_m_since_donate_log_life_gifts_int = log_months_since_donate * log_lifetime_gifts,
               log_largest_gift_log_lifetime_gifts_int = log_largest_gift * log_lifetime_gifts)

future.fund <- mutate(future.fund,
                      num_child_income_int = num_child * income,
                      num_child_log_months_since_donate_int = num_child * log_months_since_donate,
                      num_child_log_largest_gift_int = num_child * log_largest_gift,
                      num_child_log_life_gifts_int = num_child * log_lifetime_gifts,
                      income_log_months_since_donate_int = income * log_months_since_donate,
                      income_log_largest_gift_int = income * log_largest_gift,
                      income_log_lifetime_gifts_int = income * log_lifetime_gifts,
                      log_m_since_donate_log_largest_gift_int = log_months_since_donate * log_largest_gift,
                      log_m_since_donate_log_life_gifts_int = log_months_since_donate * log_lifetime_gifts,
                      log_largest_gift_log_lifetime_gifts_int = log_largest_gift * log_lifetime_gifts)

2: Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

Lets run random forest model

set.seed(12345)

rf1 <- caret::train(target ~ num_child + log_lifetime_gifts + log_largest_gift + num_child_log_months_since_donate_int +
                      income_log_months_since_donate_int + log_largest_gift_log_lifetime_gifts_int, 
              data = fund,
              method = 'rf', 
              trControl = train.control,
              tuneGrid = expand.grid(mtry = seq(1, 5, 1)),
              preProcess = c("center","scale"))

rf1
## Random Forest 
## 
## 3000 samples
##    6 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2399, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##   1     0.5576648  0.11538115
##   2     0.5365496  0.07311536
##   3     0.5337726  0.06756465
##   4     0.5359967  0.07200605
##   5     0.5352165  0.07044486
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

We will now run a SMV with Linear Kernel Model

set.seed(12345)

svm.linear <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
                       income_log_months_since_donate_int + num_child_log_months_since_donate_int, 
              data = fund, 
              method = "svmLinear", 
              trControl = train.control,
              tuneGrid = expand.grid(C = seq(0.002, 0.003, 0.0001)),
              preProcess = c("center","scale"))

plot(svm.linear)

Last lets run a SVM with Polynomial Kernel Model

set.seed(12345)

svm.poly <- caret::train(target ~ num_child + log_months_since_donate + log_largest_gift +
                       income_log_months_since_donate_int + num_child_log_months_since_donate_int, 
              data = fund, 
              method = "svmPoly", 
              trControl = train.control,
              tuneLength = 4,
              preProcess = c("center","scale"))

plot(svm.poly)

3: Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

The performance of the model on real-world data can be impacted by utilizing unbalanced training data if the model is not exposed to an equal amount of classifiers being predicted. It is possible to randomly choose an equal number of classifiers by using weighted random sampling. A simple random sample, however, wouldn’t produce an equal number of donors and non-donors, and the samples would probably differ greatly, making them unrepresentative of the data.

Model results Based upon all the models, the random forest model predicited the worst with an accuracy of 55.77%. SVM with linear kernel preformed the best with 56.49 and the model in between the two was SVM with polynomial kernel which predicted at 56.40%. Te best model moving forward would be the SVM with linear kernel.

4: Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

models <- c("RF", "SVM-Linear", "SVM-Poly")
acc <- c(55.77, 56.49, 56.40)

acc.summary <- as.data.frame(acc, row.names = models)
acc.summary
##              acc
## RF         55.77
## SVM-Linear 56.49
## SVM-Poly   56.40

SVM-Linear is better results.

5: Select best model. From your answer in (4), what do you think is the “best” model?

SVM_Linear

Step 3: Testing The file FutureFundraising.csv contains the attributes for future mailing candidates

6: Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

library(readr)
set.seed(12345)

Final.Predictions <- as.data.frame(predict(svm.linear, future.fund))
write_csv(Final.Predictions, "future.predictions.csv")