Business Objectives and Goals

The main objective of this analysis is to reduce the overall cost of advertising and maximize profits from donors for the national veterans organization.

Following are the goals to achieve:

Find people who will be doners Lower the amount of gift/label sent Increase overall response rate

Data Sources and Data Used

Fundraising.rds and future_fundraising.rds are the datasets used for this case study. Fundraising dataset has 3000 observations and 21 variables. As per study the latest response rate is 5.1%. There are 50% donors and 50% non donors. The output variable is “target” with binary values for response yes donor and no donor.

Step 1: Partitioning

We split the original dataset into a train and test set using an 80/20 split

#Load library
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
## The following object is masked from 'package:dplyr':
## 
##     select
library(ROCR)

Import dataset

set.seed(12345)

data <- readRDS("fundraising.rds")

Explore dataset

head(data)
## # A tibble: 6 x 21
##   zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child income
##   <fct>       <fct>       <fct>       <fct>       <fct>         <dbl>  <dbl>
## 1 Yes         No          No          No          Yes               1      1
## 2 No          No          No          Yes         No                2      5
## 3 No          No          No          Yes         Yes               1      3
## 4 No          Yes         No          No          Yes               1      4
## 5 No          Yes         No          No          Yes               1      4
## 6 No          No          No          Yes         Yes               1      4
## # … with 14 more variables: female <fct>, wealth <dbl>, home_value <dbl>,
## #   med_fam_inc <dbl>, avg_fam_inc <dbl>, pct_lt15k <dbl>, num_prom <dbl>,
## #   lifetime_gifts <dbl>, largest_gift <dbl>, last_gift <dbl>,
## #   months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>, target <fct>
summary(data)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167
str(data)
## tibble [3,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...

Check missing values

sum(is.na(data))
## [1] 0

Check Cooks distance diagnostics to find outliers

cooks_distance <- cooks.distance(glm(target ~ ., family = "binomial", data = data))

plot(cooks_distance, 
     pch="*", 
     cex=2, 
     main="Influencial Observations using Cooks distance")  
abline(h = 4*mean(cooks_distance, na.rm=T), col="red")

Find outliers row

outliers <- rownames(data[cooks_distance > 4*mean(cooks_distance, na.rm=T), ])
print(outliers)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"

Remove influential points

influential <- as.numeric(names(cooks_distance)[(cooks_distance > (4/nrow(data)))])

data.no_outliers <- data[-influential, ]

cooks_distance.no_outliers <- cooks.distance(glm(target ~ ., family = "binomial", 
                                         data =data.no_outliers))

plot(cooks_distance.no_outliers, 
     pch="*", 
     cex=2, 
     main="Influential Observations using Cooks distance")  
abline(h = 4*mean(cooks_distance.no_outliers, na.rm=T), col="red")

We observe from plots above that once removing outliers, the influential obsevations are minimized and have cooks distance less than 0.01. And the dataset data.no_outliers doesnt contain these influential points.

Split dataset into train and test into 80/20

smp_size <- floor(0.80 * nrow(data.no_outliers))

train_ind <- sample(seq_len(nrow(data.no_outliers)), size = smp_size)

train <- data[train_ind, ]
test <- data[-train_ind, ]

Type of Analysis performed:

We have used classification models in this case study as they can analyze variables in order to predict output variable. The output variable is “target” with binary values for response yes donor and no donor. Below are the classification methods used:

Logistic Regression

Logistic Regression is chosen since its a predictive analytical model that uses a logistic function to predict a binary target. Logistic can explain the relationship between the dependent variables in the dataset to the target variable.

Linear Discriminant Analysis

Linear Discriminant Analysis is chosen since its a predictive analogical model that reduces dimensionality. It reduces the dataset from a high-dimensional dataset to a lower dimensional dataset.The goal is to reduce resources and help seperation between classes. Linear discriminant analysis can only learn linear boundaries.

Quadratic Discriminate Analysis

Quadratic Discriminate Analysis is similar to linear discriminant analysis as it reduces dimensionality. The difference is that QDA can learn quadratic boundaries.

K Nearest Neighbors

K Nearest Neighbors is a non-parametric technique that captures which data points are most similar. If there is similarity, KNN classifies these points based on it’s neighbors.

Random Forest

Random Forest is a collection of decision trees which create an ensemble. A collection of decision trees are analyzed per each row of data (or set of dependent variables) in order to predict the target.

Exclusions

We have to exclude some variables or data rows from the dataset in order to improve accuracy or reduce processing needed.

We applied ogistic regression model to determine significant variables. This model found that variables num_child and months_since_donate are significant.

set.seed(12345)

## Define the control
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
## Train the model
model <- train(target ~ ., data = train, method = "glm",
               trControl = train.control, family = binomial())

## Summarize the results
print(model)
## Generalized Linear Model 
## 
## 2371 samples
##   20 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.5518111  0.101479
summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7786  -1.1333  -0.8234   1.1794   1.7589  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.841e+00  5.162e-01  -3.566 0.000363 ***
## zipconvert2Yes      -1.289e+01  2.296e+02  -0.056 0.955220    
## zipconvert3No        1.279e+01  2.296e+02   0.056 0.955565    
## zipconvert4Yes      -1.286e+01  2.296e+02  -0.056 0.955335    
## zipconvert5Yes      -1.281e+01  2.296e+02  -0.056 0.955510    
## homeownerNo          8.828e-02  1.058e-01   0.835 0.403982    
## num_child            2.497e-01  1.212e-01   2.060 0.039363 *  
## income              -5.073e-02  2.919e-02  -1.738 0.082193 .  
## femaleNo             5.922e-03  8.654e-02   0.068 0.945445    
## wealth              -1.107e-02  2.023e-02  -0.547 0.584267    
## home_value          -1.339e-04  8.102e-05  -1.652 0.098459 .  
## med_fam_inc         -8.152e-04  1.092e-03  -0.747 0.455164    
## avg_fam_inc          1.330e-03  1.170e-03   1.137 0.255596    
## pct_lt15k           -3.318e-04  4.930e-03  -0.067 0.946344    
## num_prom            -5.221e-04  3.080e-03  -0.170 0.865397    
## lifetime_gifts      -8.155e-04  6.998e-04  -1.165 0.243904    
## largest_gift         8.936e-03  9.746e-03   0.917 0.359178    
## last_gift            5.157e-03  1.025e-02   0.503 0.614924    
## months_since_donate  5.071e-02  1.135e-02   4.467 7.94e-06 ***
## time_lag            -1.003e-02  7.640e-03  -1.312 0.189422    
## avg_gift             1.090e-02  1.371e-02   0.795 0.426736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3286.3  on 2370  degrees of freedom
## Residual deviance: 3216.0  on 2350  degrees of freedom
## AIC: 3258
## 
## Number of Fisher Scoring iterations: 11

Step 2: Model Building

We check collinearity in training dataset using function ols_vif_tol on a linear model to evaluate VIF of each variable. We see that zipconvert variables have collinearity between each other and med_fam_inc and avg_fam_inc are collinear.

lm.fit <- lm(target ~ ., data = train)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
ols_vif_tol(lm.fit)
##              Variables   Tolerance        VIF
## 1       zipconvert2Yes 0.004958388 201.678462
## 2        zipconvert3No 0.005450314 183.475656
## 3       zipconvert4Yes 0.004935004 202.634097
## 4       zipconvert5Yes 0.003533760 282.984674
## 5          homeownerNo 0.884834290   1.130155
## 6            num_child 0.973707965   1.027002
## 7               income 0.762674021   1.311176
## 8             femaleNo 0.978004638   1.022490
## 9               wealth 0.650947928   1.536221
## 10          home_value 0.306707373   3.260437
## 11         med_fam_inc 0.050407582  19.838286
## 12         avg_fam_inc 0.045626689  21.916997
## 13           pct_lt15k 0.483124359   2.069860
## 14            num_prom 0.396036787   2.525018
## 15      lifetime_gifts 0.386765349   2.585547
## 16        largest_gift 0.180423638   5.542511
## 17           last_gift 0.191421345   5.224078
## 18 months_since_donate 0.852099106   1.173572
## 19            time_lag 0.963489097   1.037894
## 20            avg_gift 0.228258158   4.381004

Because zipconvert variables show collinearity and are insignificant we will remove these from training dataset. Also we see that avg_fam_inc has higher VIF than med_fam_inc so we will remove it from the dataset also.

train.sub = subset(train, select = -c(1:4, 12))

test.sub = subset(test, select = -c(1:4, 12))

lm.fit.new <- lm(target ~ ., data = train.sub)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
ols_vif_tol(lm.fit.new)
##              Variables Tolerance      VIF
## 1          homeownerNo 0.8896507 1.124037
## 2            num_child 0.9748986 1.025748
## 3               income 0.7670144 1.303756
## 4             femaleNo 0.9813177 1.019038
## 5               wealth 0.6599391 1.515291
## 6           home_value 0.4377664 2.284324
## 7          med_fam_inc 0.2783527 3.592564
## 8            pct_lt15k 0.5120119 1.953080
## 9             num_prom 0.3984120 2.509965
## 10      lifetime_gifts 0.3881044 2.576626
## 11        largest_gift 0.1806960 5.534155
## 12           last_gift 0.1915438 5.220739
## 13 months_since_donate 0.8540023 1.170957
## 14            time_lag 0.9665327 1.034626
## 15            avg_gift 0.2286243 4.373987

Random Forest model is used to determine the importance of variables within the dataset. For minimize processing time we will use test dataset. We will also plot a graph ranking variable by importance.

Below are the top 3 most important variables:

home_value med_fam_inc lifetime_gift

Below are the least 3 important variables:

num_child homeowner female

var <- train(target~., data = test.sub)
varImp <- varImp(var)
plot(varImp)

We observe that based on p-value, collinearity, and importance, we select following variables in our model.

home_value avg_gift lifetime_gifts med_fam_inc num_prom pct_lt15k time_lag months_since_donate largest_gift income

###2. Apply bootstrapped logistic regression model

set.seed(12345)
glm.fit <- train(target ~
                   home_value + avg_gift + lifetime_gifts + med_fam_inc +
                   num_prom + pct_lt15k + time_lag + months_since_donate + 
                   largest_gift + income, 
                 data = train.sub, method = "glm", family = "binomial", 
                 trControl = trainControl(method = "cv"))

glm.fit
## Generalized Linear Model 
## 
## 2371 samples
##   10 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5520585  0.1017809

Prediction and Accuracy

pred.glm <- predict(glm.fit,test.sub)

cm.glm <- confusionMatrix(pred.glm,test.sub$target)

cm.glm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      188      166
##   No Donor   106      169
##                                           
##                Accuracy : 0.5676          
##                  95% CI : (0.5278, 0.6067)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : 0.042686        
##                                           
##                   Kappa : 0.1422          
##                                           
##  Mcnemar's Test P-Value : 0.000347        
##                                           
##             Sensitivity : 0.6395          
##             Specificity : 0.5045          
##          Pos Pred Value : 0.5311          
##          Neg Pred Value : 0.6145          
##              Prevalence : 0.4674          
##          Detection Rate : 0.2989          
##    Detection Prevalence : 0.5628          
##       Balanced Accuracy : 0.5720          
##                                           
##        'Positive' Class : Donor           
## 
overall.glm <- cm.glm$overall

glm.acc <- overall.glm[['Accuracy']]
glm.err <- 1-glm.acc

Cutoff Analysis

The glm function may also be used to train the model and predict donors. The probability of a person being a donor or not is determined. An optimal cutoff value can then be chosen to increase accuracy and minimize costs.

glm.fit1 <- glm(target~
                  home_value + avg_gift + lifetime_gifts + med_fam_inc +
                   num_prom + pct_lt15k + time_lag + months_since_donate + 
                   largest_gift + income,
                data = train.sub, family=binomial)

glm.fit1
## 
## Call:  glm(formula = target ~ home_value + avg_gift + lifetime_gifts + 
##     med_fam_inc + num_prom + pct_lt15k + time_lag + months_since_donate + 
##     largest_gift + income, family = binomial, data = train.sub)
## 
## Coefficients:
##         (Intercept)           home_value             avg_gift  
##          -1.5200667           -0.0001107            0.0127765  
##      lifetime_gifts          med_fam_inc             num_prom  
##          -0.0008350            0.0003004           -0.0005177  
##           pct_lt15k             time_lag  months_since_donate  
##          -0.0008063           -0.0092565            0.0504394  
##        largest_gift               income  
##           0.0118111           -0.0533687  
## 
## Degrees of Freedom: 2370 Total (i.e. Null);  2360 Residual
## Null Deviance:       3286 
## Residual Deviance: 3227  AIC: 3249

Determine probability based on prediction from logistic regression for being a donor or not a donor

glm.probs=predict(glm.fit1, type="response")

## Find the number of predictions 
glm.n <- sum(glm.probs)

contrasts(train.sub$target)
##          No Donor
## Donor           0
## No Donor        1

Convert predictions to Donor and cutoff value 54% as No Donor

## Convert predictions to "Donor"
glm.pred=rep("Donor", glm.n)

## Use a cutoff value of 54%. If above 54% then label as "No Donor"
glm.pred[glm.probs > .54]= "No Donor"

Create confusion matrix and calculate accuracy

d <- table(as.factor(glm.pred), train.sub$target)

glm.acc.2 <- sum(diag(d))/sum(d)

glm.acc.2
## [1] 0.5894448

The function below loops through cutoff values from 0 to 1 by increments of 0.01 in order to find the optimal cutoff by determining which cutoff produces the highest accuracy. As seen in the plot, a cutoff value of 54% is optimal.

a <- 0

for (i in seq(0, 100)) {
    glm.pred=rep("Donor" ,1166)
    glm.pred[glm.probs > i/100]=" No Donor"

    d <- table(glm.pred, train.sub$target)

    a[i] = 1-sum(diag(d))/sum(d)
}

print(a)
##   [1] 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756
##   [8] 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756 0.4917756
##  [15] 0.4917756 0.4921974 0.4921974 0.4921974 0.4921974 0.4926191 0.4926191
##  [22] 0.4928270 0.4928209 0.4928209 0.4930233 0.4921776 0.4928088 0.4928027
##  [29] 0.4940628 0.4963876 0.4976566 0.4987179 0.4995716 0.5015028 0.5030146
##  [36] 0.5000000 0.5056375 0.5076186 0.5102845 0.5105820 0.5126949 0.5196744
##  [43] 0.5225806 0.5312500 0.5412621 0.5466936 0.5482850 0.5678060 0.5699422
##  [50] 0.5730607 0.5861423 0.5865633 0.5864362 0.5894448 0.5849190 0.5821868
##  [57] 0.5772660 0.5683891 0.5654531 0.5573123 0.5525261 0.5444984 0.5418719
##  [64] 0.5376968 0.5349418 0.5313283 0.5291139 0.5270728 0.5279661 0.5284139
##  [71] 0.5288625 0.5288625 0.5305603 0.5310110 0.5306122 0.5297619 0.5311168
##  [78] 0.5298635 0.5299145 0.5303678 0.5299658 0.5299658 0.5299658 0.5299658
##  [85] 0.5299658 0.5299658 0.5295630 0.5295630 0.5295630 0.5295630 0.5295630
##  [92] 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405 0.4708405
##  [99] 0.4708405 0.4708405
plot(a, type = "b")

points(x = which.max(a),
       y = max(a),
       col = "red",
       pch = "X"
       )

text(x = which.max(a)+5,
       y = max(a),
     which.max(a)/100,
       col = "blue",
       pch = "X"
       )

e observe that bootstrapped logistic regression model using train shows accuracy of 56.76%. Also logistic regression model using glm function with cutoff of 54% returns an accuracy of 58.94%.

Apply Linear Discriminant Analysis

set.seed(12345)
lda.fit <- train(target ~
                   home_value + avg_gift + lifetime_gifts + med_fam_inc +
                   num_prom + pct_lt15k + time_lag + months_since_donate + 
                   largest_gift + income, 
                 data = train.sub, method = "lda", 
                 trControl = trainControl(method = "cv"))

lda.fit
## Linear Discriminant Analysis 
## 
## 2371 samples
##   10 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5491049  0.09586161

Get prediction and Accuracy

pred.lda<-predict(lda.fit,test.sub)

cm.lda <- confusionMatrix(pred.lda,test.sub$target)

cm.lda
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      187      166
##   No Donor   107      169
##                                           
##                Accuracy : 0.566           
##                  95% CI : (0.5262, 0.6051)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : 0.0504923       
##                                           
##                   Kappa : 0.1388          
##                                           
##  Mcnemar's Test P-Value : 0.0004476       
##                                           
##             Sensitivity : 0.6361          
##             Specificity : 0.5045          
##          Pos Pred Value : 0.5297          
##          Neg Pred Value : 0.6123          
##              Prevalence : 0.4674          
##          Detection Rate : 0.2973          
##    Detection Prevalence : 0.5612          
##       Balanced Accuracy : 0.5703          
##                                           
##        'Positive' Class : Donor           
## 
overall.lda <- cm.lda$overall

lda.acc <- overall.lda[['Accuracy']]
lda.err <- 1-lda.acc

LDA gives an accuracy of 56.6%

Apply Quadratic Discriminant Analysis QDA

set.seed(12345)
qda.fit <- train(target ~
                   home_value + avg_gift + lifetime_gifts + med_fam_inc +
                   num_prom + pct_lt15k + time_lag + months_since_donate + 
                   largest_gift + income,
                 data = train.sub, method = "qda", 
                 trControl = trainControl(method = "cv"))

qda.fit
## Quadratic Discriminant Analysis 
## 
## 2371 samples
##   10 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ... 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5098749  0.02744165

Get Prediction and Accuracy

pred.qda<-predict(qda.fit,test.sub)

cm.qda <- confusionMatrix(pred.qda,test.sub$target)

cm.qda
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       91       75
##   No Donor   203      260
##                                           
##                Accuracy : 0.558           
##                  95% CI : (0.5182, 0.5973)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : 0.1076          
##                                           
##                   Kappa : 0.088           
##                                           
##  Mcnemar's Test P-Value : 2.597e-14       
##                                           
##             Sensitivity : 0.3095          
##             Specificity : 0.7761          
##          Pos Pred Value : 0.5482          
##          Neg Pred Value : 0.5616          
##              Prevalence : 0.4674          
##          Detection Rate : 0.1447          
##    Detection Prevalence : 0.2639          
##       Balanced Accuracy : 0.5428          
##                                           
##        'Positive' Class : Donor           
## 
overall.qda <- cm.qda$overall

qda.acc <- overall.qda[['Accuracy']]
qda.err <- 1-qda.acc

QDA gives an accuracy of 55.8%

Apply KNN Model

set.seed(12345)
knn.fit <- train(target ~
                   home_value + avg_gift + lifetime_gifts + med_fam_inc +
                   num_prom + pct_lt15k + time_lag + months_since_donate + 
                   largest_gift + income,
                 data = train.sub, method = "knn", 
                 trControl = trainControl(method = "cv"),
                 tuneLength=20)

knn.fit
## k-Nearest Neighbors 
## 
## 2371 samples
##   10 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2134, 2134, 2134, 2134, 2133, 2133, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    5  0.4879480  -0.024389201
##    7  0.4955269  -0.009627793
##    9  0.4985019  -0.003455210
##   11  0.5014590   0.002724883
##   13  0.4955572  -0.009367955
##   15  0.4951566  -0.010144926
##   17  0.4905098  -0.019599854
##   19  0.4892511  -0.022413272
##   21  0.4922190  -0.016297474
##   23  0.4922136  -0.016398632
##   25  0.4896944  -0.021599947
##   27  0.4922100  -0.016573939
##   29  0.4812538  -0.039045399
##   31  0.4926356  -0.015978378
##   33  0.4829273  -0.035397914
##   35  0.4862940  -0.028682641
##   37  0.4867248  -0.028326526
##   39  0.4808337  -0.040210875
##   41  0.4846347  -0.032556763
##   43  0.4728185  -0.056362993
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.

Get prediction and accuracy

pred.knn<-predict(knn.fit,test.sub)

cm.knn <- confusionMatrix(pred.knn,test.sub$target)

cm.knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      156      172
##   No Donor   138      163
##                                           
##                Accuracy : 0.5072          
##                  95% CI : (0.4673, 0.5469)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : 0.90626         
##                                           
##                   Kappa : 0.0171          
##                                           
##  Mcnemar's Test P-Value : 0.06089         
##                                           
##             Sensitivity : 0.5306          
##             Specificity : 0.4866          
##          Pos Pred Value : 0.4756          
##          Neg Pred Value : 0.5415          
##              Prevalence : 0.4674          
##          Detection Rate : 0.2480          
##    Detection Prevalence : 0.5215          
##       Balanced Accuracy : 0.5086          
##                                           
##        'Positive' Class : Donor           
## 
overall.knn <- cm.knn$overall

knn.acc <- overall.knn[['Accuracy']]
knn.err <- 1-knn.acc

KNN Model gives an accuracy of 50.71%

Apply Random Forest Regression Model

library(rpart)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(MASS)
library(rpart.plot)

set.seed(12345)

tree.fund = rpart(target~., data = train, method = "class", control=rpart.control(minsplit=15, cp=0.01))

fancyRpartPlot(tree.fund)

printcp(tree.fund)
## 
## Classification tree:
## rpart(formula = target ~ ., data = train, method = "class", control = rpart.control(minsplit = 15, 
##     cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] largest_gift
## 
## Root node error: 1166/2371 = 0.49178
## 
## n= 2371 
## 
##        CP nsplit rel error  xerror     xstd
## 1 0.10463      0   1.00000 1.00000 0.020877
## 2 0.01000      1   0.89537 0.93225 0.020808

Use 168 number of trees and mtry=1

rf1 = randomForest(target~., data = train, ntree = 168, mtry = 1, importance=TRUE)

rf1
## 
## Call:
##  randomForest(formula = target ~ ., data = train, ntree = 168,      mtry = 1, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 168
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 44.5%
## Confusion matrix:
##          Donor No Donor class.error
## Donor      789      416   0.3452282
## No Donor   639      527   0.5480274

Optimal number of trees to minimize error

rf1.err <- rf1$err.rate[,1]
rf1.acc <- 1 - rf1.err

which.min(rf1.err)
## [1] 168
plot(rf1)

print(paste("Accuracy:", max(rf1.acc)))
## [1] "Accuracy: 0.555040067482075"

Random Forest gives an accuracy of 55.50% Also rattle plot shows the most important indicator to be largest_gift.

Fundraising.rds and future_fundraising.rds are the datasets used for this case study. Fundraising dataset has 3000 observations and 21 variables. There are 50% donors and 50% non donors. As response rate is 5.1%, there is going to be class bias within the dataset. The proportion of donors is much smaller than non-donors. To combat class bias, we sample the observations in approximately equal proportions. This will make a better model.

Comparision table for sample error rate for all models used

err.table <- matrix(c(glm.err, lda.err, qda.err, knn.err, min(rf1.err)), ncol=1, byrow = TRUE)
colnames(err.table) <- c("Sample Error")
rownames(err.table) <- c("Logistic Regression", "LDA", "QDA", "KNN", "Random Forest")
err.table <- as.table(err.table)
err.table
##                     Sample Error
## Logistic Regression    0.4324324
## LDA                    0.4340223
## QDA                    0.4419714
## KNN                    0.4928458
## Random Forest          0.4449599
mycols = c("grey", "blue", "black", "red", "green", "brown")

barplot(err.table, beside=T,main='Out of Sample Error of Models', horiz = FALSE, legend = T, ylim = c(0, 1.0),
        args.legend = list(bty = "n", x = "top", ncol = 2), col = mycols)

When we look at barchart above and table of sample error for each models, we find that logistic regression appears to be the best model.

Step 3

Import the future_fundraising dataset to test with logistic regression model.

testing <- readRDS("future_fundraising.rds")

pred.testing <- predict(glm.fit, testing)

df <- data.frame(value=pred.testing)

df
##        value
## 1      Donor
## 2   No Donor
## 3      Donor
## 4   No Donor
## 5      Donor
## 6   No Donor
## 7   No Donor
## 8      Donor
## 9      Donor
## 10  No Donor
## 11  No Donor
## 12     Donor
## 13     Donor
## 14     Donor
## 15     Donor
## 16     Donor
## 17  No Donor
## 18     Donor
## 19     Donor
## 20  No Donor
## 21  No Donor
## 22     Donor
## 23  No Donor
## 24  No Donor
## 25     Donor
## 26     Donor
## 27     Donor
## 28  No Donor
## 29  No Donor
## 30     Donor
## 31     Donor
## 32  No Donor
## 33  No Donor
## 34     Donor
## 35  No Donor
## 36  No Donor
## 37     Donor
## 38  No Donor
## 39     Donor
## 40     Donor
## 41     Donor
## 42  No Donor
## 43     Donor
## 44  No Donor
## 45  No Donor
## 46     Donor
## 47     Donor
## 48  No Donor
## 49  No Donor
## 50  No Donor
## 51  No Donor
## 52  No Donor
## 53     Donor
## 54     Donor
## 55  No Donor
## 56  No Donor
## 57     Donor
## 58  No Donor
## 59     Donor
## 60     Donor
## 61     Donor
## 62     Donor
## 63  No Donor
## 64  No Donor
## 65  No Donor
## 66     Donor
## 67     Donor
## 68  No Donor
## 69     Donor
## 70     Donor
## 71  No Donor
## 72  No Donor
## 73  No Donor
## 74  No Donor
## 75     Donor
## 76     Donor
## 77  No Donor
## 78  No Donor
## 79  No Donor
## 80     Donor
## 81  No Donor
## 82     Donor
## 83  No Donor
## 84     Donor
## 85  No Donor
## 86     Donor
## 87  No Donor
## 88     Donor
## 89  No Donor
## 90     Donor
## 91  No Donor
## 92  No Donor
## 93     Donor
## 94  No Donor
## 95  No Donor
## 96     Donor
## 97     Donor
## 98     Donor
## 99     Donor
## 100 No Donor
## 101 No Donor
## 102    Donor
## 103 No Donor
## 104 No Donor
## 105    Donor
## 106 No Donor
## 107    Donor
## 108    Donor
## 109    Donor
## 110    Donor
## 111    Donor
## 112    Donor
## 113 No Donor
## 114 No Donor
## 115    Donor
## 116 No Donor
## 117    Donor
## 118    Donor
## 119    Donor
## 120    Donor

Write the output predictions to a csv file using write.csv function

write.csv(df, "C:\\Users\\shamstabrez\\Documents\\Algo\\Homework1\\predict_fundraising.csv", row.names = FALSE)

Recommendation

We find that logistic regression as the classification method is the best choice to accurately predict who will be a donor or not. And the most important factors for predicting donor is those who have high average gift amount, those with high home values, and those with high income. At the same time, targeted advertisement to donors who donate a high amount on average is recommended followed by those who have high home values.