#Business Objectives and Goals
To develop a predictive model to improve the cost effectiveness a national veterans’ organization direct marketing campaign.
#Data Sources and Data used
The fundraising.rds and future_fundraising.rds secondary source datasets were provided by the professor for this specific task. The datasets have been previously preprocessed. According to the task reading the response rate is 13%. If the response rate is 13%, and a data set is created from this activity then the created dataset will show more target == ‘No Donor’ than target == ‘Donor’. A random sample from this dataset would result in a subset that is highly skewed towards target == ‘No Donor’. If the model is fit on this unbalanced data the solutions will shift towards target == ‘No Donor’ which will muddy our attempts to find target == ‘Donor’ on new data whose behavior is varied. Using weighted sampling to produce a training set with equal numbers of donors and non-donors can prevent overfilling on an unbalanced training data set.
Variable Description • zip: Zip code group (zip codes were grouped into five groups; Yes = the potential donor belongs to this zip group.) 00000–19999 ⇒ zipconvert1 20000–39999 ⇒ zipconvert2 40000–59999 ⇒ zipconvert3 60000–79999 ⇒ zipconvert4 80000–99999 ⇒ zipconvert5 • homeowner Yes = homeowner, No = not a homeowner • num_child Number of children • income Household income • female No = male, Yes = female • wealth: wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. Each rating has a different meaning within each state • home_value Average home value in potential donor’s neighborhood in hundreds of dollars • med_fam_inc Median family income in potential donor’s neighborhood in hundreds of dollars • avg_fam_inc Average family income in potential donor’s neighborhood in hundreds • pct_lt15k Percent earning less than $15K in potential donor’s neighborhood • num_prom Lifetime number of promotions received to date • lifetime_gifts Dollar amount of lifetime gifts to date • largest_gift Dollar amount of largest gift to date • last_gift Dollar amount of most recent gift • months_since_donate Number of months from last donation to July 2018 • time_lag Number of months between first and second gift • avg_gift Average dollar amount of gifts to date • target Outcome variable: binary indicator for response Yes = donor, No = non-donor
A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is 13 dollars. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
#Type of Analysis performed
The type of analysis performed was classification because the response variable had two categorical levels. One level being ‘Donor’ and the other ‘No Donor’. I utilized logistic regression, support vector machine, gradient boosting, and k-nearest neighbor classification models in the search for a good predictive model. Comparing the accuracies, I found that knn obtained an accuracy above the rest.
#STEP 1: Partitioning
You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
set.seed(12345)
train_index= sample(1:nrow(fundraising), nrow(fundraising)*.8)
train = fundraising[train_index, ]
test= fundraising[-train_index, ]; str(train)
## tibble [2,400 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 2 2 2 1 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 2 2 1 1 1 1 1 2 ...
## $ num_child : num [1:2400] 2 1 1 1 1 1 1 1 2 1 ...
## $ income : num [1:2400] 4 4 2 1 3 3 4 7 2 3 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 1 2 ...
## $ wealth : num [1:2400] 3 3 4 4 8 8 7 8 8 4 ...
## $ home_value : num [1:2400] 541 1229 444 442 2702 ...
## $ med_fam_inc : num [1:2400] 335 359 196 315 637 273 437 463 374 295 ...
## $ avg_fam_inc : num [1:2400] 367 490 263 343 695 331 454 597 434 319 ...
## $ pct_lt15k : num [1:2400] 13 10 38 24 2 21 9 13 3 19 ...
## $ num_prom : num [1:2400] 63 39 36 52 16 54 71 30 22 82 ...
## $ lifetime_gifts : num [1:2400] 91 35 178 134 20 110 118 57 29 242 ...
## $ largest_gift : num [1:2400] 10 15 20 20 20 9 10 13 15 12 ...
## $ last_gift : num [1:2400] 10 15 20 20 20 5 10 11 15 9 ...
## $ months_since_donate: num [1:2400] 37 34 37 30 37 33 30 35 30 32 ...
## $ time_lag : num [1:2400] 4 13 0 10 5 4 3 2 6 7 ...
## $ avg_gift : num [1:2400] 6.5 11.67 9.89 16.75 20 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 2 2 2 1 2 2 1 1 1 1 ...
#STEP 2: Model Building
(2.1) Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?
#check for correlation
correlations=cor(train[,c(6,7, 10:20)]); corrplot(correlations, method="circle")
corr_results_train <- cor(train[,c(6,7, 10:20)], use = "complete.obs"); kable(round(corr_results_train, 2),)
num_child | income | home_value | med_fam_inc | avg_fam_inc | pct_lt15k | num_prom | lifetime_gifts | largest_gift | last_gift | months_since_donate | time_lag | avg_gift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
num_child | 1.00 | 0.08 | -0.02 | 0.04 | 0.04 | -0.03 | -0.09 | -0.05 | -0.01 | 0.00 | -0.01 | 0.01 | -0.01 |
income | 0.08 | 1.00 | 0.28 | 0.37 | 0.38 | -0.29 | -0.06 | -0.02 | 0.02 | 0.10 | 0.07 | 0.00 | 0.12 |
home_value | -0.02 | 0.28 | 1.00 | 0.74 | 0.76 | -0.40 | -0.07 | -0.03 | 0.04 | 0.14 | 0.01 | -0.01 | 0.15 |
med_fam_inc | 0.04 | 0.37 | 0.74 | 1.00 | 0.97 | -0.67 | -0.06 | -0.05 | 0.03 | 0.12 | 0.02 | 0.01 | 0.13 |
avg_fam_inc | 0.04 | 0.38 | 0.76 | 0.97 | 1.00 | -0.67 | -0.07 | -0.05 | 0.03 | 0.11 | 0.02 | 0.02 | 0.12 |
pct_lt15k | -0.03 | -0.29 | -0.40 | -0.67 | -0.67 | 1.00 | 0.04 | 0.06 | 0.00 | -0.06 | 0.00 | -0.01 | -0.07 |
num_prom | -0.09 | -0.06 | -0.07 | -0.06 | -0.07 | 0.04 | 1.00 | 0.52 | 0.13 | -0.06 | -0.29 | 0.12 | -0.16 |
lifetime_gifts | -0.05 | -0.02 | -0.03 | -0.05 | -0.05 | 0.06 | 0.52 | 1.00 | 0.51 | 0.18 | -0.14 | 0.04 | 0.16 |
largest_gift | -0.01 | 0.02 | 0.04 | 0.03 | 0.03 | 0.00 | 0.13 | 0.51 | 1.00 | 0.37 | 0.02 | 0.05 | 0.41 |
last_gift | 0.00 | 0.10 | 0.14 | 0.12 | 0.11 | -0.06 | -0.06 | 0.18 | 0.37 | 1.00 | 0.23 | 0.10 | 0.85 |
months_since_donate | -0.01 | 0.07 | 0.01 | 0.02 | 0.02 | 0.00 | -0.29 | -0.14 | 0.02 | 0.23 | 1.00 | 0.02 | 0.22 |
time_lag | 0.01 | 0.00 | -0.01 | 0.01 | 0.02 | -0.01 | 0.12 | 0.04 | 0.05 | 0.10 | 0.02 | 1.00 | 0.09 |
avg_gift | -0.01 | 0.12 | 0.15 | 0.13 | 0.12 | -0.07 | -0.16 | 0.16 | 0.41 | 0.85 | 0.22 | 0.09 | 1.00 |
corr_results_test <- cor(test[,c(6,7, 10:20)], use = "complete.obs"); kable(round(corr_results_test, 2),)
num_child | income | home_value | med_fam_inc | avg_fam_inc | pct_lt15k | num_prom | lifetime_gifts | largest_gift | last_gift | months_since_donate | time_lag | avg_gift | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
num_child | 1.00 | 0.13 | 0.00 | 0.07 | 0.08 | -0.04 | -0.06 | -0.08 | -0.06 | -0.04 | 0.02 | -0.06 | -0.05 |
income | 0.13 | 1.00 | 0.34 | 0.37 | 0.39 | -0.27 | -0.11 | -0.03 | 0.15 | 0.15 | 0.12 | -0.01 | 0.14 |
home_value | 0.00 | 0.34 | 1.00 | 0.74 | 0.74 | -0.39 | -0.06 | 0.03 | 0.18 | 0.23 | 0.10 | 0.02 | 0.22 |
med_fam_inc | 0.07 | 0.37 | 0.74 | 1.00 | 0.97 | -0.67 | -0.02 | 0.03 | 0.16 | 0.20 | 0.06 | 0.05 | 0.17 |
avg_fam_inc | 0.08 | 0.39 | 0.74 | 0.97 | 1.00 | -0.71 | -0.02 | 0.02 | 0.15 | 0.18 | 0.06 | 0.05 | 0.16 |
pct_lt15k | -0.04 | -0.27 | -0.39 | -0.67 | -0.71 | 1.00 | 0.02 | 0.04 | -0.04 | -0.06 | -0.04 | -0.06 | -0.04 |
num_prom | -0.06 | -0.11 | -0.06 | -0.02 | -0.02 | 0.02 | 1.00 | 0.71 | 0.04 | -0.04 | -0.24 | 0.11 | -0.10 |
lifetime_gifts | -0.08 | -0.03 | 0.03 | 0.03 | 0.02 | 0.04 | 0.71 | 1.00 | 0.46 | 0.35 | -0.20 | 0.01 | 0.35 |
largest_gift | -0.06 | 0.15 | 0.18 | 0.16 | 0.15 | -0.04 | 0.04 | 0.46 | 1.00 | 0.94 | 0.03 | 0.00 | 0.91 |
last_gift | -0.04 | 0.15 | 0.23 | 0.20 | 0.18 | -0.06 | -0.04 | 0.35 | 0.94 | 1.00 | 0.08 | 0.00 | 0.92 |
months_since_donate | 0.02 | 0.12 | 0.10 | 0.06 | 0.06 | -0.04 | -0.24 | -0.20 | 0.03 | 0.08 | 1.00 | -0.01 | 0.10 |
time_lag | -0.06 | -0.01 | 0.02 | 0.05 | 0.05 | -0.06 | 0.11 | 0.01 | 0.00 | 0.00 | -0.01 | 1.00 | 0.01 |
avg_gift | -0.05 | 0.14 | 0.22 | 0.17 | 0.16 | -0.04 | -0.10 | 0.35 | 0.91 | 0.92 | 0.10 | 0.01 | 1.00 |
train$wealth=as.ordered(train$wealth)
test$wealth=as.ordered(test$wealth)
par(mfrow=c(2,2))
#looking at balance independent categorical variable
slices <- c(nrow(train[train$zipconvert2=='No'&
train$zipconvert3=='No'&
train$zipconvert4=='No'&
train$zipconvert5=='No',]),
nrow(train[train$zipconvert2=='Yes',]),
nrow(train[train$zipconvert3=='Yes',]),
nrow(train[train$zipconvert4=='Yes',]),
nrow(train[train$zipconvert5=='Yes',])); lbls <- c("00000–19999", "20000–39999", "40000–59999", "60000–79999", "80000–99999"); pie(slices, labels = lbls, col=rainbow(length(lbls)), main="Zip Code Grouping")
barplot(prop.table(table(train$homeowner)), main = "Homeowner",names.arg = levels(train$homeowner), col = rainbow(length(levels(train$homeowner))))
barplot(prop.table(table(train$wealth)), main = "Wealth Rating",names.arg = levels(train$wealth), col = rainbow(length(levels(train$wealth))))
barplot(prop.table(table(train$female)), main = "Female",names.arg = levels(train$female), col = rainbow(length(levels(train$female))))
par(mfrow=c(1,1))
#a look at our categorical variable of interest
sum.choice.donor = sum(train$target == 'Donor')
sum.choice.nodonor = sum(train$target == 'No Donor')
choice.counts.vec = c(sum.choice.donor, sum.choice.nodonor)
bplot=barplot(prop.table(table(train$target)), main = "train$target", names.arg = levels(train$target), col=c('green', 'dark red'))
sum.choice.donor.test = sum(test$target == 'Donor')
sum.choice.nodonor.test = sum(test$target == 'No Donor')
choice.counts.vec.test = c(sum.choice.donor.test, sum.choice.nodonor.test)
bplot=barplot(prop.table(table(test$target)), main = "test$target", names.arg = levels(test$target), col=c('green', 'dark red'))
par(mfrow=c(2,7))
#looking at the independent numerical variables
boxplot(train$num_child, xlab = "num_child",col=1, notch = TRUE)
boxplot(train$income, xlab = "income",col=2, notch = TRUE)
boxplot(train$home_value, xlab = "home_value",col=3, notch = TRUE)
boxplot(train$med_fam_inc, xlab = "med_fam_inc",col=4, notch = TRUE)
boxplot(train$avg_fam_inc, xlab = "avg_fam_inc",col=5, notch = TRUE)
boxplot(train$pct_lt15k, xlab = "pct_lt15k",col=6, notch = TRUE)
boxplot(train$num_prom, xlab = "num_prom",col=7, notch = TRUE)
boxplot(train$lifetime_gifts, xlab = "lifetime_gifts",col=8, notch = TRUE)
boxplot(train$largest_gift, xlab = "largest_gift",col=9, notch = TRUE)
boxplot(train$last_gift, xlab = "last_gift",col=10, notch = TRUE)
boxplot(train$months_since_donate, xlab = "months_since_donate",col=11, notch = TRUE)
boxplot(train$time_lag, xlab = "time_lag",col=12, notch = TRUE)
boxplot(train$avg_gift, xlab = "avg_gift",col=13, notch = TRUE)
(2.1) Looking at the predictors it looks like many of them are skewed. There exist a 0.97 correlation between med_fam_inc avg_fam_inc. largest_gift is multi-collinear with last_gift and avg_gift. At this data discovery point I say utilizing a predictor set that does not have correlation issues would be best to start modeling with.
#Variable transformations
Many of the numerical independent variable were skewed but I chose not to transform them. I did choose to change the wealth variable from a number to an ordinal because the data description mentioned that the wealth segments are denoted 0 to 9, with 9 being the highest-wealth group and zero the lowest. All other variables were left as is.
#Business inputs
This task required no business inputs.
(2.2) Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.
full_glm_model = glm(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
#female + #dropped my step selection
#wealth + #insig
home_value +
#med_fam_inc + #removed due to high correlation with avg_fam_inc
avg_fam_inc +
#avg_gift + #removed multi collinear with last_gift and largest_gift
#pct_lt15k + #insig
#num_prom + #insig
#lifetime_gifts + #insig
#largest_gift + #removed multi collinear with last_gift and avg_gift
last_gift + #x
#time_lag +
months_since_donate +
zipconvert2:income +
income:home_value +
income:avg_fam_inc,data=train, family = binomial)
#stepwise model selection
glm_step_wAIC = step(full_glm_model, direction="both",
test="Chisq",
trace = F) #uses AIC criteria
glm_step_wBIC = step(full_glm_model, direction="both",
test="Chisq",
trace = F,
k=log(nrow(train))) #uses BIC criteria
summary(glm_step_wAIC); summary(glm_step_wBIC)
##
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 +
## zipconvert5 + num_child + income + home_value + avg_fam_inc +
## last_gift + months_since_donate + zipconvert2:income + income:home_value +
## income:avg_fam_inc, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.807 -1.136 -0.748 1.170 1.683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.999e+00 4.905e-01 -4.075 4.61e-05 ***
## zipconvert2Yes -1.208e+01 2.294e+02 -0.053 0.958007
## zipconvert3No 1.266e+01 2.294e+02 0.055 0.955979
## zipconvert4Yes -1.275e+01 2.294e+02 -0.056 0.955668
## zipconvert5Yes -1.271e+01 2.294e+02 -0.055 0.955829
## num_child 3.324e-01 1.276e-01 2.606 0.009164 **
## income -1.098e-01 7.739e-02 -1.419 0.156030
## home_value 3.430e-04 2.005e-04 1.710 0.087253 .
## avg_fam_inc -1.460e-03 1.072e-03 -1.362 0.173171
## last_gift 1.690e-02 4.795e-03 3.524 0.000426 ***
## months_since_donate 5.800e-02 1.079e-02 5.376 7.61e-08 ***
## zipconvert2Yes:income -1.821e-01 6.394e-02 -2.848 0.004404 **
## income:home_value -1.034e-04 4.165e-05 -2.482 0.013060 *
## income:avg_fam_inc 4.615e-04 2.281e-04 2.023 0.043093 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3244.2 on 2386 degrees of freedom
## AIC: 3272.2
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = target ~ last_gift + months_since_donate, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7569 -1.1453 -0.8249 1.1858 1.6273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.973988 0.327408 -6.029 1.65e-09 ***
## last_gift 0.014862 0.004706 3.158 0.00159 **
## months_since_donate 0.056430 0.010667 5.290 1.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3327.0 on 2399 degrees of freedom
## Residual deviance: 3276.7 on 2397 degrees of freedom
## AIC: 3282.7
##
## Number of Fisher Scoring iterations: 4
#glm_step_wAIC$call; glm_step_wBIC$call
#predict
pred.train_AIC = predict.glm(glm_step_wAIC, data = train, type = "response")
pred.train_BIC = predict.glm(glm_step_wBIC, data = train, type = "response")
#Exclusions
I did exclude a few predictors from the logistic regression model. I removed removed med_fam_inc due to 0.97 correlation with avg_fam_inc. There was also a multicollinearity between largest_gift, last_gift, avg_gift that I had consider in the model selection process. Through the logistic regression model selection step with AIC criteria a the following variables were dropped: wealth, pct_lt15k, num_prom, lifetime_gifts, and time_lag.
# tuning
set.seed(12345)
#tune =tune.svm(target ~ ., data=train, gamma = seq(0.01, 0.10, by = 0.01),cost = seq(0.10, 1.00, by = 0.10))
cost = 0.3#tune$best.parameters$cost
gamma = 0.01 #tune$best.parameters$gamma
svm.model.linear = svm(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
female +
wealth +
home_value +
med_fam_inc +
avg_fam_inc + #removed due to high correlation with med_fam_inc
avg_gift + #multi collinear with last_gift and largest_gift
pct_lt15k +
num_prom +
lifetime_gifts +
largest_gift + #multi collinear with last_gift and avg_gift
last_gift +
months_since_donate +
time_lag, data=train,
gamma = gamma,
cost = cost,
kernel='linear')
svm.model.radial = svm(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
female +
wealth +
home_value +
med_fam_inc +
avg_fam_inc + #removed due to high correlation with med_fam_inc
avg_gift + #multi collinear with last_gift and largest_gift
pct_lt15k +
num_prom +
lifetime_gifts +
largest_gift + #multi collinear with last_gift and avg_gift
last_gift +
months_since_donate +
time_lag,
data=train,
gamma = gamma,
cost = cost,
kernel='radial')
svm.model.polynomial = svm(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
female +
wealth +
home_value +
med_fam_inc +
avg_fam_inc + #removed due to high correlation with med_fam_inc
avg_gift + #multi collinear with last_gift and largest_gift
pct_lt15k +
num_prom +
lifetime_gifts +
largest_gift + #multi collinear with last_gift and avg_gift
last_gift +
months_since_donate +
time_lag, data=train,
gamma = gamma,
cost = cost,
kernel='polynomial')
svm.model.sigmoid = svm(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
female +
wealth +
home_value +
med_fam_inc +
avg_fam_inc + #removed due to high correlation with med_fam_inc
avg_gift + #multi collinear with last_gift and largest_gift
pct_lt15k +
num_prom +
lifetime_gifts +
largest_gift + #multi collinear with last_gift and avg_gift
last_gift +
months_since_donate +
time_lag,data=train,
gamma = gamma,
cost = cost,
kernel='sigmoid')
set.seed(12345)
gb.model = train(target ~ zipconvert2 +
zipconvert3 +
zipconvert4 +
zipconvert5 +
homeowner +
num_child +
income +
female +
wealth +
home_value +
med_fam_inc +
avg_fam_inc +
avg_gift +
pct_lt15k +
num_prom +
lifetime_gifts +
largest_gift +
last_gift +
months_since_donate +
time_lag+
largest_gift:months_since_donate + #influential interaction
last_gift:months_since_donate, data = train,
distribution='bernoulli',
method='gbm',
trControl=trainControl(method="cv",
number=10),
verbose=FALSE,
bag.fraction=0.75); summary(gb.model)
plot(gb.model) #visual of how many predictors are best to try
gb.model$finalModel#final model
## A gradient boosted model with bernoulli loss function.
## 50 iterations were performed.
## There were 30 predictors of which 17 had non-zero influence.
# Single Variable Partials
grid.arrange(
plot(gb.model$finalModel,i="months_since_donate"),
plot(gb.model$finalModel,i="home_value"),
plot(gb.model$finalModel,i="med_fam_inc"),
plot(gb.model$finalModel,i="num_prom"),
plot(gb.model$finalModel,i="lifetime_gifts"),
plot(gb.model$finalModel,i="avg_gift"))
set.seed(12345)
knn.train=data.frame(cbind(train[,6],train[,7], train[,9:21]))
knn.test=data.frame(cbind(train[,6],train[,7], train[,9:21]))
knn.preds <- knn(train = knn.train[,1:14], test = knn.test[,1:14], cl = knn.train[,15], k=3)
round(caret::confusionMatrix(table(knn.test[,15], knn.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.742 0.483 0.724 0.759 0.515
## AccuracyPValue McnemarPValue
## 0.000 0.315
(2.2.1) The first modeling method that I chose was logistic classification models using glm(), one of the models utilized step-wise model selection with AIC criteria, direction=“both”, test=“Chisq”, trace = F. On the the second model I used step-wise model selection with BIC criteria, direction=“both”, test=“Chisq”, trace = F, k=log(nrow(train). The only parameter needed for Logistic Regression in R with glm() is family = binomial. The variables were chosen through the AIC criteria were: ~ zipconvert2 + zipconvert3 + zipconvert4 + zipconvert5 + num_child + income + home_value + avg_fam_inc + last_gift + months_since_donate + zipconvert2:income + income:home_value + income:avg_fam_inc. The variables chosen with BIC stepwise selection were: ~ last_gift + months_since_donate
(2.2.2) The second modeling method I chose was k-Nearest Neighbors (knn) Classification using class::knn(). I used data.frame(cbind(train[,6],train[,7], train[,9:21])) for the data frame of training set cases argument, data.frame(cbind(train[,6],train[,7], train[,9:21])) for my data frame of test set cases argument, cl = train[,21] for the factor of true classifications of training set, and the number of neighbors considered was set at k=10
(2.3) Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
(2.3) Comment: If the response rate is 13%, and a data set is created from this activity then the created dataset will show more target == ‘No Donor’ than target == ‘Donor’. A random sample from this dataset would result in a subset that is highly skewed towards target == ‘No Donor’. If the model is fit on this unbalanced data it will lack the training to find our target == ‘Donor’ on new data whose behavior is varied. Using weighted sampling to produce a training set with equal numbers of donors and non-donors can prevent prevent overfilling on an unbalanced training data set. Although we were told the response rate is 13% in my data discovery found that the given data set has a balanced response.
Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
#evaluate on train
par(mfrow=c(1,2))
Roc.train.AIC = roc(train$target ~ pred.train_AIC, plot = T, print.auc = T)
Roc.train.BIC = roc(train$target ~ pred.train_BIC, plot = T, print.auc = T)
#pred.train_AIC = predict.glm(glm_step_wAIC, data = train, type = "response")
#pred.train_BIC = predict.glm(glm_step_wBIC, data = train, type = "response")
#yhat.train_AIC= as.factor(ifelse(pred.train_AIC >= .4, "No Donor", "Donor"))
#yhat.train_BIC = as.factor(ifelse(pred.train_AIC >= .5, "No Donor", "Donor"))
#round(caret::confusionMatrix(table(train$target, yhat.train_AIC))$overall, digits = 3)
#round(caret::confusionMatrix(table(train$target, yhat.train_BIC))$overall, digits = 3)
# Train Results:
# glm_step_wAIC: Train AUC= 0.605, AIC: 3272.2
# glm_step_wBIC: Train AUC= 0.585, AIC: 3282.7
#evaluate on test
pred.test_AIC = predict.glm(glm_step_wAIC, newdata = test, type = "response")
pred.test_BIC = predict.glm(glm_step_wBIC, newdata = test, type = "response")
yhat.test_AIC= as.factor(ifelse(pred.test_AIC >= .4, "No Donor", "Donor"))
yhat.test_BIC = as.factor(ifelse(pred.test_AIC >= .4, "No Donor", "Donor"))
round(caret::confusionMatrix(table(test$target, yhat.test_BIC))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.557 0.089 0.516 0.597 0.893
## AccuracyPValue McnemarPValue
## 1.000 0.000
# Test Results:
# glm_step_wAIC: Test Accuracy: 0.557
# glm_step_wBIC: Test Accuracy: 0.548
#Best glm is: glm_step_wAIC
#GLM Model performance and Validation Results. On the glm() model after performing a stepwise selection with AIC & BIC criteria, I plotted ROC for the AIC and BIC glm() versions. With the ROC’s I was hoping to get an idea of the performance of the classification model while still working on the fit on test. The resulting AUC were Results: 0.605 for the AIC glm() and 0.585 for the BIC glm(). I decided to move forward with the AIC version and next attempted to add interactions to improve the fit. I did find that adding the following interactions improved the model slightly: months_since_donate + zipconvert2:income + income:home_value + income:avg_fam_inc. I reran a model selection to see if any other predictors could be dropped but all were kept. My validation results on the glm() was a test accuracy of 0.548.
#Cut-Off Analysis The only other adjustment I could perform on my glm() was to try various classification cut offs to see if the accuracy improved. I manually plugged and played with cut offs between 0-1. I found the lowest error rate and minimum cost occurred at a cut off of 0.40.
svm_yhat.linear = predict(svm.model.linear, newdata=test, type ="response")
svm_yhat.radial = predict(svm.model.radial, newdata=test, type ="response")
svm_yhat.polynomial = predict(svm.model.polynomial, newdata=test, type ="response")
svm_yhat.sigmoid = predict(svm.model.sigmoid, newdata=test, type ="response")
round(confusionMatrix(reference=as.factor(test$target), data=as.factor(svm_yhat.linear), positive = 'Donor')$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.535 0.075 0.494 0.575 0.517
## AccuracyPValue McnemarPValue
## 0.196 0.000
round(confusionMatrix(reference=as.factor(test$target), data=as.factor(svm_yhat.radial), positive = 'Donor')$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.550 0.109 0.509 0.590 0.517
## AccuracyPValue McnemarPValue
## 0.055 0.000
round(confusionMatrix(reference=as.factor(test$target), data=as.factor(svm_yhat.polynomial), positive = 'Donor')$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.487 0.006 0.446 0.527 0.517
## AccuracyPValue McnemarPValue
## 0.935 0.000
round(confusionMatrix(reference=as.factor(test$target), data=as.factor(svm_yhat.sigmoid), positive = 'Donor')$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.548 0.105 0.508 0.589 0.517
## AccuracyPValue McnemarPValue
## 0.065 0.000
#svm with radial kernel is the best svm model
gb.train.preds <- predict(gb.model, type = 'raw', data=train, n.trees=50)
gb.preds <- predict(gb.model, type = 'raw', newdata=test, n.trees=50)
round(caret::confusionMatrix(table(train$target, gb.train.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.513 0.019 0.493 0.533 0.988
## AccuracyPValue McnemarPValue
## 1.000 0.000
round(caret::confusionMatrix(table(test$target, gb.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.560 0.121 0.519 0.600 0.513
## AccuracyPValue McnemarPValue
## 0.012 0.295
knn.preds <- knn(train = knn.train[,1:14], test = knn.test[,1:14], cl = knn.train[,15], k=10)
#results
round(caret::confusionMatrix(table(knn.test[,15], knn.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.619 0.238 0.599 0.639 0.517
## AccuracyPValue McnemarPValue
## 0.000 0.305
#KNN Model performance and Validation Results. The knn model did not require much adjustment other making sure the train and test subsets only included numerical predictors and I also tried various k’s. I started with k=10 and took steps down and kept track of the test accuracy. I found a maximum validation accuracy at k=3 and test accuracy at 0.742.
#2.5 Select best model
From your answer in (4), what do you think is the “best” model?
print("GLM:")
## [1] "GLM:"
round(caret::confusionMatrix(table(test$target, yhat.test_AIC))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.557 0.089 0.516 0.597 0.893
## AccuracyPValue McnemarPValue
## 1.000 0.000
print("SVM:")
## [1] "SVM:"
round(confusionMatrix(reference=as.factor(test$target), data=as.factor(svm_yhat.radial), positive = 'Donor')$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.550 0.109 0.509 0.590 0.517
## AccuracyPValue McnemarPValue
## 0.055 0.000
print("GBM:")
## [1] "GBM:"
round(caret::confusionMatrix(table(test$target, gb.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.560 0.121 0.519 0.600 0.513
## AccuracyPValue McnemarPValue
## 0.012 0.295
print("KNN:")
## [1] "KNN:"
round(caret::confusionMatrix(table(knn.test[,15], knn.preds))$overall, digits = 3)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.619 0.238 0.599 0.639 0.517
## AccuracyPValue McnemarPValue
## 0.000 0.305
(2.5) Answer: From my answer in (2.4), my “best” model is my knn model with a test accuracy of 0.742.
table(knn.test[,15], knn.preds)
## knn.preds
## Donor No Donor
## Donor 768 441
## No Donor 473 718
#Recommendations
The goal of this task was to develop a predictive model to improve the cost effectiveness a national veterans’ organization direct marketing campaign. According to the detail provided on the current way the organization run’s it’s marketing campaign 5.1% of respondents choose to donate. Assuming they send a mailer to everyone on the mailing list this would cost about 8.8M dollars, and only provide a donation revenue of 8.6M dollars, creating a loss of 0.2M dollars. I found that using my glm() model to target mailings the organization would only have to spend 0.9M dollars sending out 90% less mailings and earning and estimated 11.5M dollars in donations. This solution would work well if the campaign budgeting were limited. If the campaign had a bigger budget the organization could earn more on there investment by utilizing my KNN model where they would still send out about half of the mailings, they normally would but earning 59.7M dollars in donation profits.
#STEP 3: Testing
(3.6)The file FutureFundraising.csv contains the attributes for future mailing candidates. Using your “best” model from (2.4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.
set.seed(12345)
knn.train=data.frame(cbind(train[,6],train[,7], train[,9:21]))
knn.test=data.frame(cbind(future_fundraising[,6],future_fundraising[,7], future_fundraising[,9:20]))
knn.preds <- knn(train = knn.train[,1:14], test = knn.test[,1:14], cl = knn.train[,15], k=3)
knn.preds
## [1] No Donor Donor No Donor No Donor No Donor No Donor Donor No Donor
## [9] Donor Donor Donor Donor No Donor Donor No Donor No Donor
## [17] Donor No Donor Donor No Donor Donor No Donor Donor No Donor
## [25] No Donor Donor No Donor Donor No Donor Donor No Donor Donor
## [33] No Donor Donor Donor No Donor No Donor Donor Donor Donor
## [41] No Donor Donor No Donor No Donor Donor Donor No Donor No Donor
## [49] Donor Donor No Donor No Donor No Donor No Donor No Donor No Donor
## [57] Donor No Donor Donor Donor Donor No Donor Donor No Donor
## [65] Donor Donor No Donor Donor No Donor Donor No Donor No Donor
## [73] Donor No Donor Donor No Donor No Donor No Donor Donor No Donor
## [81] No Donor Donor Donor No Donor No Donor Donor No Donor Donor
## [89] Donor Donor No Donor No Donor No Donor No Donor No Donor Donor
## [97] Donor No Donor No Donor No Donor Donor Donor Donor Donor
## [105] No Donor Donor Donor No Donor Donor No Donor No Donor Donor
## [113] No Donor No Donor Donor Donor Donor Donor Donor No Donor
## Levels: Donor No Donor
(3.7)Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header and have the following format: value Donor Donor No Donor Donor No Donor No Donor . . . etc.
write.csv(knn.preds, file = "~/STA6543 Algorithms II/fundraising2.csv")
(3.7)File Submitted