The data analytics team was tasked with breaking down the data associated with sending donation mailers and increasing our donation likeliness per mailer to help drive down costs associated with sending mailers to those who are unlikely to donate. In order to do this, data analytics obtained a weighted sample of data to ensure we had a 50/50 split of donor and non-donor information. This ensured we could appropriately train a model to identify conditions in which someone is likely and unlikely to donate. If we were to utilize a random split of data, our models may not be able to capture enough data on donors or non-donors to effectively evaluate the information. After reading through the data dictionary and gaining an understanding of it, we were able to split data into training and test sets of information. From there, we were able to summarize the data and plot to try and find correlations between target donors and the predictors to help in sending out mailers to more targeted community members to help drive down our cost per donation. We determined that the predictor most commonly associated was the last gift, unfortunately, that didn’t help narrow down potential future donators that haven’t gifted before, so we did look at corresponding items to the last gift. Once we determined the critical classifiers, we trained a KNN model, LDA Model, and Tree model and applied them to the testing segment of data. The LDA was the most accurate of the models. After several configurations, of the LDA model with different variables, it seemed most accurate to utilize all variables to train the model, which would make sense as numerous items go into this analysis. We were able to achieve a ~56% accuracy rating in classifying a donor/non-donor correctly. As the testing data is an even split of donor vs non-donor, this is still better than the 50% ratio otherwise achieved by sending mailers out haphazardly.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.2.2
library(MASS)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.2.2
library(class)
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
library(tree)
## Warning: package 'tree' was built under R version 4.2.2
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
fundraising <- readRDS("~/1 Data/Fundraising.rds",refhook = NULL)
future_fundraising <- readRDS("~/1 Data/future_fundraising.rds", refhook = NULL)
set.seed(12345)
sample<- createDataPartition(y=fundraising$target, p= .80, list = FALSE)
train <- fundraising[sample,]
test <- fundraising[-sample,]
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)
attach(fundraising)
pairs(fundraising)
frtestcor <- as.data.frame(sapply(test[, c(1:21)], as.numeric))
corrplot(cor(frtestcor))
summary(fundraising)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
par(mfrow = c(2,2))
plot(fundraising$target~female)
plot(fundraising$target~homeowner)
plot(fundraising$target~wealth)
plot(fundraising$target~num_child)
plot(fundraising$target~home_value)
plot(fundraising$target~avg_fam_inc)
plot(fundraising$target~income)
plot(fundraising$home_value,avg_gift)
plot(fundraising$income,avg_gift)
plot(fundraising$avg_fam_inc,avg_gift)
ldafund = lda(target~., data= train)
summary(ldafund)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 40 -none- numeric
## scaling 20 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 6 -none- list
lda.pred = predict(ldafund, newdata=test, type="response")
lda.class = lda.pred$class
table(lda.class, test$target)
##
## lda.class Donor No Donor
## Donor 174 147
## No Donor 125 153
55.6%
#knn.fit <- train(target~ homeowner + num_child,
# data = train,
# method = 'knn',
# trControl = train_control,
# tuneLength=20)
#knn.fr = predict(knn.fit, test)
#confusionMatrix(as.factor(knn.fr), test$target, positive = 'Donor')
This code chunk will not run in markdown. THis is the result from console: Reference Prediction Donor No Donor Donor 168 143 No Donor 131 157
treefr <- tree(target ~ ., data = test)
summary(treefr)
##
## Classification tree:
## tree(formula = target ~ ., data = test)
## Variables actually used in tree construction:
## [1] "avg_gift" "time_lag" "months_since_donate"
## [4] "wealth"
## Number of terminal nodes: 5
## Residual mean deviance: 1.327 = 788.3 / 594
## Misclassification error rate: 0.379 = 227 / 599
A weighted sample gives equal treatment to both sides. If utilizing true random, it may skew results to the side with more samples (if we used 98% donors and 2% non-donors, the models would have a hard time figuring out what makes a non-donor.)
The best model was the LDA model.
lda.pred.final = predict(ldafund, newdata=future_fundraising, type="response")
future_fundraising$predicted_donor <- lda.pred.final[[1]]
View(future_fundraising)