# Load the libraries necessary for this program
library(ISLR)
library(dplyr)
library(caret)
library(usdm)
library(modelr)
library(PerformanceAnalytics)
# Read sample dataset - fundraising.rds and production dataset future_fundraising.rds
future_fundraising <- readRDS("future_fundraising.rds")
f_f <- readRDS("future_fundraising.rds")
f1 <- readRDS("fundraising.rds")
f2 <- readRDS("fundraising.rds")
# set the seed to 12345
set.seed(12345)
str(future_fundraising)
# Check the structure of the sample dataset
str(f1)
Quick read:
n = 3000
p = 20
y = target; this is a factor variable with two levels - Donor and No Donor
# Lets check the sampling of the response variable - target
table(f1$target)
1499/3000
The sample dataset has 3000 observations and the production dataset has 120 observations.
As mentioned in the background, in the given sample, weighted sampling was used. "Donors’ class is 49.97% which is almost 50-50 weighted sample.
Using weighted sampling is key to the classification problems. Otherwise, the disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. A simple random sample can weigh heavily towards any particular class depending on how the data is biased. In our case, the non-responders are more and if we didn’t approach the weighted sampling, it could result in bad/untrust-worthy results.
Following Exploratory Data analysis is performed:
# Get the summary of the training sample
summary(f1)
Summary of the sample indicates that target is almost equally split with “No Donor” with just 1 observation more than Donor.
home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag and avg_gift are mostly skewed and have heavy outliers.
num_child, income and wealth seem to be more of categorical. Lets fnd how data is distributed for these fields.
# Lets get the correlation matrix.
# To get correlation lets get all numeric variables into a separate dataset
# Lets convert the target variable to numeric
mydata <- (f1[,c(6:7,9:21)])
mydata$target <- as.numeric(mydata$target)
# Get the correlation matrix
chart.Correlation(mydata, histogram=FALSE, pch=19)
From the above chart, it seems clear that few variables are heavily correlated to each other and very few of the variables such as num_child, income, num_prom, last_gift, months_since_donate, and avg_gift are correlated to the response variable.
We could notice that med_fam_inc and avg_fam_inc are heavily correlated. They are inturn correlated to income. Also, avg_gift and last_gift are heavily correlated.
# Lets check the collinearity
vif(as.data.frame(mydata))
As expected from the correlation matrix, it seems med_fam_inc and avg_fam_inc are collinear. Also, last_gift and avg_gift are collinear. Rest all are within the threshold.
For this purpose, it didn’t seem to have a clear reason to exclude any particular class.
As mentioned before, home_value, med_fam_inc, avg_fam_inc, pct_lt15k, num_prom, lifetime_gifts, largest_gift, last_gift, time_lag and avg_gift are mostly skewed and have heavy outliers.
Based on the data, since some of the fields have minimum values as 0’s, it would be good to apply sqrt transformation on them and apply log transformation on the ones without any 0’s. It seems beneficial to apply transformations, especially on the predictors that will go into the final model. Ideally for these predictors, it would have been better if some imputation is done. However, i did not apply any transformation in my model.
I have applied different approaches for Partitioning and Model Fit evaluation.
After trying two different approaches below, it seemed logical to go ahead with Cross Validation only as the output sample is small and could potentially cause overfitting if any validation approach is not used.
# Create a data partition
split = .7995
trainIndex <- createDataPartition(f1$target,p=split,list=FALSE)
data_train <- f1[trainIndex,]
data_test <- f1[-trainIndex,]
nrow(data_train)
# Train control with repeatedCV with 10-fold cross validation and 3 repeats
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)
When tried with train and test set methodology, the results weren’t reflective on the held-out production dataset. It could possibly be that production dataset is taken out of the full sample given instead of a clear held-out set or the sheer nature of the survey or high non-responders.
Hence, i didn’t use the approach of train and test after few trials and instead evaluated on the final output set to see the improvement in accuracy.
Used RandomForrest to understand the important variables needed for the final model
rf.fit = train(target~.,
data=data_train,
method='rf',
trControl=train_control,
importance=TRUE)
rf.fit$besttune
varImp(rf.fit)
plot(varImp(rf.fit))
pred.rf<-predict(rf.fit,data_test)
confusionMatrix(pred.rf,data_test$target)
Models Tried:
Tried several classification models such as RandomForrest, Logistic, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K Nearset Neighbours (KNN) and Support Vector Machines (SVM).
Models chosen and Presented in this analysis documentation:
Though applied several models, in this documentation, only Logistic, KNN and RandomForrest are included.
Tried several classification models such as RandomForrest, Logistic, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K Nearset Neighbours (KNN) and Support Vector Machines (SVM).
After several trials, it is clear that Cross Validation is better approach, and for this classification problem, KNN is the best fit as KNN is non-parametric and the distirbution of the data is not as good to use GLM.
I didn’t choose RandomForrest as the number of observations are not good enough to determine a better accuracy.
LDA and QDA are also not preferred for the same reasons as Logistic. However, QDA could have performed well but the distribution of the predictors is not approximately normal.
Hence KNN is the best approach in this case and the model performance results proved the same.
Performace data and the code is documented below:
# GLM fit-All parameters-trainControl with 10-fold 3-repeats repeatedCV on train dataset
glm.fit.full <- train(target~.,data=data_train,method='glm',trControl = train_control)
summary(glm.fit.full)
# Get the confusion matrix with test data set
pred.glm.full<-predict(glm.fit.full,data_test)
confusionMatrix(pred.glm.full,data_test$target)
# Write to CSV file
future_fundraising.value=predict(glm.fit.full, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_glm_full.csv")
Prediction accuracy with the production dataset is only 47.5%.
# GLM fit-train dataset-important parameters-trainControl with CV 10-fold 3-repeats
glm.fit.imp <-
train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
data=data_train,
method='glm',
trControl = train_control)
future_fundraising.value=predict(glm.fit.imp, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_glm_imp.csv")
Prediction accuracy with the production dataset is the same 47.5%.
# KNN fit - full dataset - important parameters - trainControl with CV 10-fold 3-repeats
knn.fit <-
train(target~months_since_donate+largest_gift+avg_gift+num_child+income+home_value,
data=f1,
method='knn',
trControl = train_control,
tuneLength=20)
future_fundraising.value=predict(knn.fit, future_fundraising)
Value=c("value",as.character(future_fundraising.value))
write.csv(Value,file="value_knn.csv")
Prediction accuracy with the production dataset is 62.5%.
For this analysis, i have used default threshold or cutoff of 0.5. As the dataset is created using weighted sampling, i didn’t approach the threshold adjustement using ROC curve and AUC.
The following could have been done to gain much better accuracy:
The code is embedded along with description and the Rmd file is attached in the submission.