The main objective of this analysis is to reduce the overall cost of advertising and maximize profits from donors for the national veterans organization.
Following are the goals to achieve:
Find people who will be doners Lower the amount of gift/label sent Increase overall response rate
Fundraising.rds and future_fundraising.rds are the datasets used for this case study. Fundraising dataset has 3000 observations and 21 variables. As per study the latest response rate is 5.1%. There are 50% donors and 50% non donors. The output variable is “target” with binary values for response yes donor and no donor.
We split the original dataset into a train and test set using an 80/20 split
#Load library
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
## The following object is masked from 'package:dplyr':
##
## select
library(ROCR)
Import dataset
set.seed(12345)
data <- readRDS("fundraising.rds")
Explore dataset
head(data)
## # A tibble: 6 x 21
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child income
## <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 Yes No No No Yes 1 1
## 2 No No No Yes No 2 5
## 3 No No No Yes Yes 1 3
## 4 No Yes No No Yes 1 4
## 5 No Yes No No Yes 1 4
## 6 No No No Yes Yes 1 4
## # … with 14 more variables: female <fct>, wealth <dbl>, home_value <dbl>,
## # med_fam_inc <dbl>, avg_fam_inc <dbl>, pct_lt15k <dbl>, num_prom <dbl>,
## # lifetime_gifts <dbl>, largest_gift <dbl>, last_gift <dbl>,
## # months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>, target <fct>
summary(data)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
str(data)
## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
Check missing values
sum(is.na(data))
## [1] 0
Check Cooks distance diagnostics to find outliers
cooks_distance <- cooks.distance(glm(target ~ ., family = "binomial", data = data))
plot(cooks_distance,
pch="*",
cex=2,
main="Influencial Observations using Cooks distance")
abline(h = 4*mean(cooks_distance, na.rm=T), col="red")
Find outliers row
outliers <- rownames(data[cooks_distance > 4*mean(cooks_distance, na.rm=T), ])
print(outliers)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"
Remove influential points
influential <- as.numeric(names(cooks_distance)[(cooks_distance > (4/nrow(data)))])
data.no_outliers <- data[-influential, ]
cooks_distance.no_outliers <- cooks.distance(glm(target ~ ., family = "binomial",
data =data.no_outliers))
plot(cooks_distance.no_outliers,
pch="*",
cex=2,
main="Influential Observations using Cooks distance")
abline(h = 4*mean(cooks_distance.no_outliers, na.rm=T), col="red")
We observe from plots above that once removing outliers, the influential obsevations are minimized and have cooks distance less than 0.01. And the dataset data.no_outliers doesnt contain these influential points.
Split dataset into train and test into 80/20
smp_size <- floor(0.80 * nrow(data.no_outliers))
train_ind <- sample(seq_len(nrow(data.no_outliers)), size = smp_size)
train <- data[train_ind, ]
test <- data[-train_ind, ]
We have used classification models in this case study as they can analyze variables in order to predict output variable. The output variable is “target” with binary values for response yes donor and no donor. Below are the classification methods used:
Logistic Regression is chosen since its a predictive analytical model that uses a logistic function to predict a binary target. Logistic can explain the relationship between the dependent variables in the dataset to the target variable.
Linear Discriminant Analysis is chosen since its a predictive analogical model that reduces dimensionality. It reduces the dataset from a high-dimensional dataset to a lower dimensional dataset.The goal is to reduce resources and help seperation between classes. Linear discriminant analysis can only learn linear boundaries.
Quadratic Discriminate Analysis is similar to linear discriminant analysis as it reduces dimensionality. The difference is that QDA can learn quadratic boundaries.
K Nearest Neighbors is a non-parametric technique that captures which data points are most similar. If there is similarity, KNN classifies these points based on it’s neighbors.
Random Forest is a collection of decision trees which create an ensemble. A collection of decision trees are analyzed per each row of data (or set of dependent variables) in order to predict the target.
We have to exclude some variables or data rows from the dataset in order to improve accuracy or reduce processing needed.
We applied ogistic regression model to determine significant variables. This model found that variables num_child and months_since_donate are significant.
set.seed(12345)
## Define the control
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
## Train the model
model <- train(target ~ ., data = train, method = "glm",
trControl = train.control, family = binomial())
## Summarize the results
print(model)
## Generalized Linear Model
##
## 2371 samples
## 20 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5518111 0.101479
summary(model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7786 -1.1333 -0.8234 1.1794 1.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.841e+00 5.162e-01 -3.566 0.000363 ***
## zipconvert2Yes -1.289e+01 2.296e+02 -0.056 0.955220
## zipconvert3No 1.279e+01 2.296e+02 0.056 0.955565
## zipconvert4Yes -1.286e+01 2.296e+02 -0.056 0.955335
## zipconvert5Yes -1.281e+01 2.296e+02 -0.056 0.955510
## homeownerNo 8.828e-02 1.058e-01 0.835 0.403982
## num_child 2.497e-01 1.212e-01 2.060 0.039363 *
## income -5.073e-02 2.919e-02 -1.738 0.082193 .
## femaleNo 5.922e-03 8.654e-02 0.068 0.945445
## wealth -1.107e-02 2.023e-02 -0.547 0.584267
## home_value -1.339e-04 8.102e-05 -1.652 0.098459 .
## med_fam_inc -8.152e-04 1.092e-03 -0.747 0.455164
## avg_fam_inc 1.330e-03 1.170e-03 1.137 0.255596
## pct_lt15k -3.318e-04 4.930e-03 -0.067 0.946344
## num_prom -5.221e-04 3.080e-03 -0.170 0.865397
## lifetime_gifts -8.155e-04 6.998e-04 -1.165 0.243904
## largest_gift 8.936e-03 9.746e-03 0.917 0.359178
## last_gift 5.157e-03 1.025e-02 0.503 0.614924
## months_since_donate 5.071e-02 1.135e-02 4.467 7.94e-06 ***
## time_lag -1.003e-02 7.640e-03 -1.312 0.189422
## avg_gift 1.090e-02 1.371e-02 0.795 0.426736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3286.3 on 2370 degrees of freedom
## Residual deviance: 3216.0 on 2350 degrees of freedom
## AIC: 3258
##
## Number of Fisher Scoring iterations: 11
We check collinearity in training dataset using function ols_vif_tol on a linear model to evaluate VIF of each variable. We see that zipconvert variables have collinearity between each other and med_fam_inc and avg_fam_inc are collinear.
lm.fit <- lm(target ~ ., data = train)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
ols_vif_tol(lm.fit)
## Variables Tolerance VIF
## 1 zipconvert2Yes 0.004958388 201.678462
## 2 zipconvert3No 0.005450314 183.475656
## 3 zipconvert4Yes 0.004935004 202.634097
## 4 zipconvert5Yes 0.003533760 282.984674
## 5 homeownerNo 0.884834290 1.130155
## 6 num_child 0.973707965 1.027002
## 7 income 0.762674021 1.311176
## 8 femaleNo 0.978004638 1.022490
## 9 wealth 0.650947928 1.536221
## 10 home_value 0.306707373 3.260437
## 11 med_fam_inc 0.050407582 19.838286
## 12 avg_fam_inc 0.045626689 21.916997
## 13 pct_lt15k 0.483124359 2.069860
## 14 num_prom 0.396036787 2.525018
## 15 lifetime_gifts 0.386765349 2.585547
## 16 largest_gift 0.180423638 5.542511
## 17 last_gift 0.191421345 5.224078
## 18 months_since_donate 0.852099106 1.173572
## 19 time_lag 0.963489097 1.037894
## 20 avg_gift 0.228258158 4.381004
Because zipconvert variables show collinearity and are insignificant we will remove these from training dataset. Also we see that avg_fam_inc has higher VIF than med_fam_inc so we will remove it from the dataset also.
train.sub = subset(train, select = -c(1:4, 12))
test.sub = subset(test, select = -c(1:4, 12))
lm.fit.new <- lm(target ~ ., data = train.sub)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
ols_vif_tol(lm.fit.new)
## Variables Tolerance VIF
## 1 homeownerNo 0.8896507 1.124037
## 2 num_child 0.9748986 1.025748
## 3 income 0.7670144 1.303756
## 4 femaleNo 0.9813177 1.019038
## 5 wealth 0.6599391 1.515291
## 6 home_value 0.4377664 2.284324
## 7 med_fam_inc 0.2783527 3.592564
## 8 pct_lt15k 0.5120119 1.953080
## 9 num_prom 0.3984120 2.509965
## 10 lifetime_gifts 0.3881044 2.576626
## 11 largest_gift 0.1806960 5.534155
## 12 last_gift 0.1915438 5.220739
## 13 months_since_donate 0.8540023 1.170957
## 14 time_lag 0.9665327 1.034626
## 15 avg_gift 0.2286243 4.373987
Random Forest model is used to determine the importance of variables within the dataset. For minimize processing time we will use test dataset. We will also plot a graph ranking variable by importance.
Below are the top 3 most important variables:
home_value med_fam_inc lifetime_gift
Below are the least 3 important variables:
num_child homeowner female
var <- train(target~., data = test.sub)
varImp <- varImp(var)
plot(varImp)
We observe that based on p-value, collinearity, and importance, we select following variables in our model.
home_value avg_gift lifetime_gifts med_fam_inc num_prom pct_lt15k time_lag months_since_donate largest_gift income
###2. Apply bootstrapped logistic regression model
set.seed(12345)
glm.fit <- train(target ~
home_value + avg_gift + lifetime_gifts + med_fam_inc +
num_prom + pct_lt15k + time_lag + months_since_donate +
largest_gift + income,
data = train.sub, method = "glm", family = "binomial",
trControl = trainControl(method = "cv"))
glm.fit
## Generalized Linear Model
##
## 2371 samples
## 10 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5520585 0.1017809
Prediction and Accuracy
pred.glm <- predict(glm.fit,test.sub)
cm.glm <- confusionMatrix(pred.glm,test.sub$target)
cm.glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 188 166
## No Donor 106 169
##
## Accuracy : 0.5676
## 95% CI : (0.5278, 0.6067)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : 0.042686
##
## Kappa : 0.1422
##
## Mcnemar's Test P-Value : 0.000347
##
## Sensitivity : 0.6395
## Specificity : 0.5045
## Pos Pred Value : 0.5311
## Neg Pred Value : 0.6145
## Prevalence : 0.4674
## Detection Rate : 0.2989
## Detection Prevalence : 0.5628
## Balanced Accuracy : 0.5720
##
## 'Positive' Class : Donor
##
overall.glm <- cm.glm$overall
glm.acc <- overall.glm[['Accuracy']]
glm.err <- 1-glm.acc
The glm function may also be used to train the model and predict donors. The probability of a person being a donor or not is determined. An optimal cutoff value can then be chosen to increase accuracy and minimize costs.
glm.fit1 <- glm(target~
home_value + avg_gift + lifetime_gifts + med_fam_inc +
num_prom + pct_lt15k + time_lag + months_since_donate +
largest_gift + income,
data = train.sub, family=binomial)
glm.fit1
##
## Call: glm(formula = target ~ home_value + avg_gift + lifetime_gifts +
## med_fam_inc + num_prom + pct_lt15k + time_lag + months_since_donate +
## largest_gift + income, family = binomial, data = train.sub)
##
## Coefficients:
## (Intercept) home_value avg_gift
## -1.5200667 -0.0001107 0.0127765
## lifetime_gifts med_fam_inc num_prom
## -0.0008350 0.0003004 -0.0005177
## pct_lt15k time_lag months_since_donate
## -0.0008063 -0.0092565 0.0504394
## largest_gift income
## 0.0118111 -0.0533687
##
## Degrees of Freedom: 2370 Total (i.e. Null); 2360 Residual
## Null Deviance: 3286
## Residual Deviance: 3227 AIC: 3249
Determine probability based on prediction from logistic regression for being a donor or not a donor
glm.probs=predict(glm.fit1, type="response")
## Find the number of predictions
glm.n <- sum(glm.probs)
contrasts(train.sub$target)
## No Donor
## Donor 0
## No Donor 1
Convert predictions to Donor and cutoff value 54% as No Donor
## Convert predictions to "Donor"
glm.pred=rep("Donor", glm.n)
## Use a cutoff value of 54%. If above 54% then label as "No Donor"
glm.pred[glm.probs > .54]= "No Donor"
Create confusion matrix and calculate accuracy
d <- table(as.factor(glm.pred), train.sub$target)
glm.acc.2 <- sum(diag(d))/sum(d)
glm.acc.2
## [1] 0.5894448
The function below loops through cutoff values from 0 to 1 by increments of 0.01 in order to find the optimal cutoff by determining which cutoff produces the highest accuracy. As seen in the plot, a cutoff value of 54% is optimal.
a <- 0
for (i in seq(0, 100)) {
glm.pred=rep("Donor" ,1166)
glm.pred[glm.probs > i/100]=" No Donor"
d <- table(glm.pred, train.sub$target)
a[i] = 1-sum(diag(d))/sum(d)
}
print(a)
## [1] 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756
## [8] 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756
## [15] 0.4917756 0.4921974 0.4921974 0.4921974 0.4921974 0.4926191 0.4926191
## [22] 0.4928270 0.4928209 0.4928209 0.4930233 0.4921776 0.4928088 0.4928027
## [29] 0.4940628 0.4963876 0.4976566 0.4987179 0.4995716 0.5015028 0.5030146
## [36] 0.5000000 0.5056375 0.5076186 0.5102845 0.5105820 0.5126949 0.5196744
## [43] 0.5225806 0.5312500 0.5412621 0.5466936 0.5482850 0.5678060 0.5699422
## [50] 0.5730607 0.5861423 0.5865633 0.5864362 0.5894448 0.5849190 0.5821868
## [57] 0.5772660 0.5683891 0.5654531 0.5573123 0.5525261 0.5444984 0.5418719
## [64] 0.5376968 0.5349418 0.5313283 0.5291139 0.5270728 0.5279661 0.5284139
## [71] 0.5288625 0.5288625 0.5305603 0.5310110 0.5306122 0.5297619 0.5311168
## [78] 0.5298635 0.5299145 0.5303678 0.5299658 0.5299658 0.5299658 0.5299658
## [85] 0.5299658 0.5299658 0.5295630 0.5295630 0.5295630 0.5295630 0.5295630
## [92] 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405
## [99] 0.4708405 0.4708405
plot(a, type = "b")
points(x = which.max(a),
y = max(a),
col = "red",
pch = "X"
)
text(x = which.max(a)+5,
y = max(a),
which.max(a)/100,
col = "blue",
pch = "X"
)
e observe that bootstrapped logistic regression model using train shows accuracy of 56.76%. Also logistic regression model using glm function with cutoff of 54% returns an accuracy of 58.94%.
set.seed(12345)
lda.fit <- train(target ~
home_value + avg_gift + lifetime_gifts + med_fam_inc +
num_prom + pct_lt15k + time_lag + months_since_donate +
largest_gift + income,
data = train.sub, method = "lda",
trControl = trainControl(method = "cv"))
lda.fit
## Linear Discriminant Analysis
##
## 2371 samples
## 10 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5491049 0.09586161
Get prediction and Accuracy
pred.lda<-predict(lda.fit,test.sub)
cm.lda <- confusionMatrix(pred.lda,test.sub$target)
cm.lda
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 187 166
## No Donor 107 169
##
## Accuracy : 0.566
## 95% CI : (0.5262, 0.6051)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : 0.0504923
##
## Kappa : 0.1388
##
## Mcnemar's Test P-Value : 0.0004476
##
## Sensitivity : 0.6361
## Specificity : 0.5045
## Pos Pred Value : 0.5297
## Neg Pred Value : 0.6123
## Prevalence : 0.4674
## Detection Rate : 0.2973
## Detection Prevalence : 0.5612
## Balanced Accuracy : 0.5703
##
## 'Positive' Class : Donor
##
overall.lda <- cm.lda$overall
lda.acc <- overall.lda[['Accuracy']]
lda.err <- 1-lda.acc
LDA gives an accuracy of 56.6%
set.seed(12345)
qda.fit <- train(target ~
home_value + avg_gift + lifetime_gifts + med_fam_inc +
num_prom + pct_lt15k + time_lag + months_since_donate +
largest_gift + income,
data = train.sub, method = "qda",
trControl = trainControl(method = "cv"))
qda.fit
## Quadratic Discriminant Analysis
##
## 2371 samples
## 10 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5098749 0.02744165
Get Prediction and Accuracy
pred.qda<-predict(qda.fit,test.sub)
cm.qda <- confusionMatrix(pred.qda,test.sub$target)
cm.qda
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 91 75
## No Donor 203 260
##
## Accuracy : 0.558
## 95% CI : (0.5182, 0.5973)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : 0.1076
##
## Kappa : 0.088
##
## Mcnemar's Test P-Value : 2.597e-14
##
## Sensitivity : 0.3095
## Specificity : 0.7761
## Pos Pred Value : 0.5482
## Neg Pred Value : 0.5616
## Prevalence : 0.4674
## Detection Rate : 0.1447
## Detection Prevalence : 0.2639
## Balanced Accuracy : 0.5428
##
## 'Positive' Class : Donor
##
overall.qda <- cm.qda$overall
qda.acc <- overall.qda[['Accuracy']]
qda.err <- 1-qda.acc
QDA gives an accuracy of 55.8%
set.seed(12345)
knn.fit <- train(target ~
home_value + avg_gift + lifetime_gifts + med_fam_inc +
num_prom + pct_lt15k + time_lag + months_since_donate +
largest_gift + income,
data = train.sub, method = "knn",
trControl = trainControl(method = "cv"),
tuneLength=20)
knn.fit
## k-Nearest Neighbors
##
## 2371 samples
## 10 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.4879480 -0.024389201
## 7 0.4955269 -0.009627793
## 9 0.4985019 -0.003455210
## 11 0.5014590 0.002724883
## 13 0.4955572 -0.009367955
## 15 0.4951566 -0.010144926
## 17 0.4905098 -0.019599854
## 19 0.4892511 -0.022413272
## 21 0.4922190 -0.016297474
## 23 0.4922136 -0.016398632
## 25 0.4896944 -0.021599947
## 27 0.4922100 -0.016573939
## 29 0.4812538 -0.039045399
## 31 0.4926356 -0.015978378
## 33 0.4829273 -0.035397914
## 35 0.4862940 -0.028682641
## 37 0.4867248 -0.028326526
## 39 0.4808337 -0.040210875
## 41 0.4846347 -0.032556763
## 43 0.4728185 -0.056362993
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
Get prediction and accuracy
pred.knn<-predict(knn.fit,test.sub)
cm.knn <- confusionMatrix(pred.knn,test.sub$target)
cm.knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 156 172
## No Donor 138 163
##
## Accuracy : 0.5072
## 95% CI : (0.4673, 0.5469)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : 0.90626
##
## Kappa : 0.0171
##
## Mcnemar's Test P-Value : 0.06089
##
## Sensitivity : 0.5306
## Specificity : 0.4866
## Pos Pred Value : 0.4756
## Neg Pred Value : 0.5415
## Prevalence : 0.4674
## Detection Rate : 0.2480
## Detection Prevalence : 0.5215
## Balanced Accuracy : 0.5086
##
## 'Positive' Class : Donor
##
overall.knn <- cm.knn$overall
knn.acc <- overall.knn[['Accuracy']]
knn.err <- 1-knn.acc
KNN Model gives an accuracy of 50.71%
library(rpart)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(MASS)
library(rpart.plot)
set.seed(12345)
tree.fund = rpart(target~., data = train, method = "class", control=rpart.control(minsplit=15, cp=0.01))
fancyRpartPlot(tree.fund)
printcp(tree.fund)
##
## Classification tree:
## rpart(formula = target ~ ., data = train, method = "class", control = rpart.control(minsplit = 15,
## cp = 0.01))
##
## Variables actually used in tree construction:
## [1] largest_gift
##
## Root node error: 1166/2371 = 0.49178
##
## n= 2371
##
## CP nsplit rel error xerror xstd
## 1 0.10463 0 1.00000 1.00000 0.020877
## 2 0.01000 1 0.89537 0.93225 0.020808
Use 168 number of trees and mtry=1
rf1 = randomForest(target~., data = train, ntree = 168, mtry = 1, importance=TRUE)
rf1
##
## Call:
## randomForest(formula = target ~ ., data = train, ntree = 168, mtry = 1, importance = TRUE)
## Type of random forest: classification
## Number of trees: 168
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 44.5%
## Confusion matrix:
## Donor No Donor class.error
## Donor 789 416 0.3452282
## No Donor 639 527 0.5480274
Optimal number of trees to minimize error
rf1.err <- rf1$err.rate[,1]
rf1.acc <- 1 - rf1.err
which.min(rf1.err)
## [1] 168
plot(rf1)
print(paste("Accuracy:", max(rf1.acc)))
## [1] "Accuracy: 0.555040067482075"
Random Forest gives an accuracy of 55.50% Also rattle plot shows the most important indicator to be largest_gift.
Fundraising.rds and future_fundraising.rds are the datasets used for this case study. Fundraising dataset has 3000 observations and 21 variables. There are 50% donors and 50% non donors. As response rate is 5.1%, there is going to be class bias within the dataset. The proportion of donors is much smaller than non-donors. To combat class bias, we sample the observations in approximately equal proportions. This will make a better model.
Comparision table for sample error rate for all models used
err.table <- matrix(c(glm.err, lda.err, qda.err, knn.err, min(rf1.err)), ncol=1, byrow = TRUE)
colnames(err.table) <- c("Sample Error")
rownames(err.table) <- c("Logistic Regression", "LDA", "QDA", "KNN", "Random Forest")
err.table <- as.table(err.table)
err.table
## Sample Error
## Logistic Regression 0.4324324
## LDA 0.4340223
## QDA 0.4419714
## KNN 0.4928458
## Random Forest 0.4449599
mycols = c("grey", "blue", "black", "red", "green", "brown")
barplot(err.table, beside=T,main='Out of Sample Error of Models', horiz = FALSE, legend = T, ylim = c(0, 1.0),
args.legend = list(bty = "n", x = "top", ncol = 2), col = mycols)
When we look at barchart above and table of sample error for each models, we find that logistic regression appears to be the best model.
Import the future_fundraising dataset to test with logistic regression model.
testing <- readRDS("future_fundraising.rds")
pred.testing <- predict(glm.fit, testing)
df <- data.frame(value=pred.testing)
df
## value
## 1 Donor
## 2 No Donor
## 3 Donor
## 4 No Donor
## 5 Donor
## 6 No Donor
## 7 No Donor
## 8 Donor
## 9 Donor
## 10 No Donor
## 11 No Donor
## 12 Donor
## 13 Donor
## 14 Donor
## 15 Donor
## 16 Donor
## 17 No Donor
## 18 Donor
## 19 Donor
## 20 No Donor
## 21 No Donor
## 22 Donor
## 23 No Donor
## 24 No Donor
## 25 Donor
## 26 Donor
## 27 Donor
## 28 No Donor
## 29 No Donor
## 30 Donor
## 31 Donor
## 32 No Donor
## 33 No Donor
## 34 Donor
## 35 No Donor
## 36 No Donor
## 37 Donor
## 38 No Donor
## 39 Donor
## 40 Donor
## 41 Donor
## 42 No Donor
## 43 Donor
## 44 No Donor
## 45 No Donor
## 46 Donor
## 47 Donor
## 48 No Donor
## 49 No Donor
## 50 No Donor
## 51 No Donor
## 52 No Donor
## 53 Donor
## 54 Donor
## 55 No Donor
## 56 No Donor
## 57 Donor
## 58 No Donor
## 59 Donor
## 60 Donor
## 61 Donor
## 62 Donor
## 63 No Donor
## 64 No Donor
## 65 No Donor
## 66 Donor
## 67 Donor
## 68 No Donor
## 69 Donor
## 70 Donor
## 71 No Donor
## 72 No Donor
## 73 No Donor
## 74 No Donor
## 75 Donor
## 76 Donor
## 77 No Donor
## 78 No Donor
## 79 No Donor
## 80 Donor
## 81 No Donor
## 82 Donor
## 83 No Donor
## 84 Donor
## 85 No Donor
## 86 Donor
## 87 No Donor
## 88 Donor
## 89 No Donor
## 90 Donor
## 91 No Donor
## 92 No Donor
## 93 Donor
## 94 No Donor
## 95 No Donor
## 96 Donor
## 97 Donor
## 98 Donor
## 99 Donor
## 100 No Donor
## 101 No Donor
## 102 Donor
## 103 No Donor
## 104 No Donor
## 105 Donor
## 106 No Donor
## 107 Donor
## 108 Donor
## 109 Donor
## 110 Donor
## 111 Donor
## 112 Donor
## 113 No Donor
## 114 No Donor
## 115 Donor
## 116 No Donor
## 117 Donor
## 118 Donor
## 119 Donor
## 120 Donor
Write the output predictions to a csv file using write.csv function
write.csv(df, "C:\\Users\\shamstabrez\\Documents\\Algo\\Homework1\\predict_fundraising.csv", row.names = FALSE)
We find that logistic regression as the classification method is the best choice to accurately predict who will be a donor or not. And the most important factors for predicting donor is those who have high average gift amount, those with high home values, and those with high income. At the same time, targeted advertisement to donors who donate a high amount on average is recommended followed by those who have high home values.