Background:

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Business Objectives and Goals:

The goal is to improve the cost-effectiveness of the veterans’ organization direct marketing campaign with data analysis.

The objective of this effort is to develop a classification model that can effectively capture donors so that the expected net profit is maximized.

Data Sources and Data used:

For this project, we were given with a data sample. The dataset was already generated with weighted sampling as the original dataset/population has heavy non-responders. The sample given has almost equal number of donors and non-donors.

# Load the libraries necessary for this program
library(ISLR)
library(dplyr)
library(caret)
library(usdm)
library(modelr)
library(PerformanceAnalytics)
# Read sample dataset - fundraising.rds and production dataset future_fundraising.rds
future_fundraising <- readRDS("future_fundraising.rds")
f_f <- readRDS("future_fundraising.rds")
f1 <- readRDS("fundraising.rds")
f2 <- readRDS("fundraising.rds")
# set the seed to 12345
set.seed(12345)
str(future_fundraising)
# Check the structure of the sample dataset
str(f1)

Quick read:
n = 3000
p = 20
y = target; this is a factor variable with two levels - Donor and No Donor

# Lets check the sampling of the response variable - target
table(f1$target)
1499/3000

The sample dataset has 3000 observations and the production dataset has 120 observations.

As mentioned in the background, in the given sample, weighted sampling was used. "Donors’ class is 49.97% which is almost 50-50 weighted sample.

Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

Using weighted sampling is key to the classification problems. Otherwise, the disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. A simple random sample can weigh heavily towards any particular class depending on how the data is biased. In our case, the non-responders are more and if we didn’t approach the weighted sampling, it could result in bad/untrust-worthy results.

Type of Analysis performed: what, why, findings:

Following Exploratory Data analysis is performed:

# Get the summary of the training sample
summary(f1)

Summary of the sample indicates that target is almost equally split with “No Donor” with just 1 observation more than Donor.

home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag and avg_gift are mostly skewed and have heavy outliers.

num_child, income and wealth seem to be more of categorical. Lets fnd how data is distributed for these fields.

# Lets get the correlation matrix. 

# To get correlation lets get all numeric variables into a separate dataset
# Lets convert the target variable to numeric 
mydata <- (f1[,c(6:7,9:21)])
mydata$target <- as.numeric(mydata$target)

# Get the correlation matrix
chart.Correlation(mydata, histogram=FALSE, pch=19)

From the above chart, it seems clear that few variables are heavily correlated to each other and very few of the variables such as num_child, income, num_prom, last_gift, months_since_donate, and avg_gift are correlated to the response variable.
We could notice that med_fam_inc and avg_fam_inc are heavily correlated. They are inturn correlated to income. Also, avg_gift and last_gift are heavily correlated.

# Lets check the collinearity
vif(as.data.frame(mydata))

As expected from the correlation matrix, it seems med_fam_inc and avg_fam_inc are collinear. Also, last_gift and avg_gift are collinear. Rest all are within the threshold.

Exclusions:

For this purpose, it didn’t seem to have a clear reason to exclude any particular class.

Variable transformations:

As mentioned before, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag and avg_gift are mostly skewed and have heavy outliers.

Based on the data, since some of the fields have minimum values as 0’s, it would be good to apply sqrt transformation on them and apply log transformation on the ones without any 0’s. It seems beneficial to apply transformations, especially on the predictors that will go into the final model. Ideally for these predictors, it would have been better if some imputation is done. However, i did not apply any transformation in my model.

Methodology used, background, benefits:

I have applied different approaches for Partitioning and Model Fit evaluation.

Partitioning:

After trying two different approaches below, it seemed logical to go ahead with Cross Validation only as the output sample is small and could potentially cause overfitting if any validation approach is not used.

    1. 80-20 split of Train and Test and then use Production set for evaluation and
# Create a data partition
split = .7995
trainIndex <- createDataPartition(f1$target,p=split,list=FALSE)
data_train <- f1[trainIndex,]
data_test <- f1[-trainIndex,]
nrow(data_train)
    1. Cross Validation
# Train control with repeatedCV with 10-fold cross validation and 3 repeats
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)

Testing:

When tried with train and test set methodology, the results weren’t reflective on the held-out production dataset. It could possibly be that production dataset is taken out of the full sample given instead of a clear held-out set or the sheer nature of the survey or high non-responders.

Hence, i didn’t use the approach of train and test after few trials and instead evaluated on the final output set to see the improvement in accuracy.

Model Fit Approach:

Variable Importance:

Used RandomForrest to understand the important variables needed for the final model

rf.fit = train(target~.,
               data=data_train,
               method='rf',
               trControl=train_control,
               importance=TRUE)
rf.fit$besttune
varImp(rf.fit)
plot(varImp(rf.fit))
pred.rf<-predict(rf.fit,data_test)
confusionMatrix(pred.rf,data_test$target)
Predictors for final Model:
  1. Took Top 10 predictors for the final model fit from the RandomForest variable Importance
  2. Excluded the Collinear predictors from the final model.
  • Excluded last_gift as it is collinear with avg_gift
  • Excluded med_fam_inc and pct_lt15k as they are collinear with income
  1. Predictors that made it to the final model:
  • months_since_donate
  • largest_gift
  • avg_gift
  • num_child
  • income
  • home_value
Classification Models/Tools

Models Tried:
Tried several classification models such as RandomForrest, Logistic, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K Nearset Neighbours (KNN) and Support Vector Machines (SVM).

Models chosen and Presented in this analysis documentation:
Though applied several models, in this documentation, only Logistic, KNN and RandomForrest are included.

Model performance and Validation Results:

The final best accuracy rate observed on production dataset is 0.625 or 62.5% with KNN.

Tried several classification models such as RandomForrest, Logistic, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K Nearset Neighbours (KNN) and Support Vector Machines (SVM).

After several trials, it is clear that Cross Validation is better approach, and for this classification problem, KNN is the best fit as KNN is non-parametric and the distirbution of the data is not as good to use GLM.

I didn’t choose RandomForrest as the number of observations are not good enough to determine a better accuracy.

LDA and QDA are also not preferred for the same reasons as Logistic. However, QDA could have performed well but the distribution of the predictors is not approximately normal.

Hence KNN is the best approach in this case and the model performance results proved the same.

Performace data and the code is documented below:

GLM
# GLM fit-All parameters-trainControl with 10-fold 3-repeats repeatedCV on train dataset
glm.fit.full <- train(target~.,data=data_train,method='glm',trControl = train_control)
summary(glm.fit.full)
# Get the confusion matrix with test data set
pred.glm.full<-predict(glm.fit.full,data_test)
confusionMatrix(pred.glm.full,data_test$target)
# Write to CSV file
future_fundraising.value=predict(glm.fit.full, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_glm_full.csv")

Prediction accuracy with the production dataset is only 47.5%.

# GLM fit-train dataset-important parameters-trainControl with CV 10-fold 3-repeats
glm.fit.imp <- 
  train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
        data=data_train,
        method='glm',
        trControl = train_control)
future_fundraising.value=predict(glm.fit.imp, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_glm_imp.csv")

Prediction accuracy with the production dataset is the same 47.5%.

KNN
# KNN fit - full dataset - important parameters - trainControl with CV 10-fold 3-repeats
knn.fit <- 
  train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
                 data=f1,
                 method='knn',
                 trControl = train_control,
                 tuneLength=20)
future_fundraising.value=predict(knn.fit, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_knn.csv")

Prediction accuracy with the production dataset is 62.5%.

Cut-Off Analysis:

For this analysis, i have used default threshold or cutoff of 0.5. As the dataset is created using weighted sampling, i didn’t approach the threshold adjustement using ROC curve and AUC.

Recommendations:

The following could have been done to gain much better accuracy:

  1. Sample Size: Sample size could be considerbly big to reflect right outcome.
  2. Transformation:

Pseudo codes for implementation:

The code is embedded along with description and the Rmd file is attached in the submission.