Untitled

The data analytics team was tasked with breaking down the data associated with sending donation mailers and increasing our donation likeliness per mailer to help drive down costs associated with sending mailers to those who are unlikely to donate. In order to do this, data analytics obtained a weighted sample of data to ensure we had a 50/50 split of donor and non-donor information. This ensured we could appropriately train a model to identify conditions in which someone is likely and unlikely to donate. If we were to utilize a random split of data, our models may not be able to capture enough data on donors or non-donors to effectively evaluate the information. After reading through the data dictionary and gaining an understanding of it, we were able to split data into training and test sets of information. From there, we were able to summarize the data and plot to try and find correlations between target donors and the predictors to help in sending out mailers to more targeted community members to help drive down our cost per donation. We determined that the predictor most commonly associated was the last gift, unfortunately, that didn’t help narrow down potential future donators that haven’t gifted before, so we did look at corresponding items to the last gift. Once we determined the critical classifiers, we trained a KNN model, LDA Model, and Tree model and applied them to the testing segment of data. The LDA was the most accurate of the models. After several configurations, of the LDA model with different variables, it seemed most accurate to utilize all variables to train the model, which would make sense as numerous items go into this analysis. We were able to achieve a ~56% accuracy rating in classifying a donor/non-donor correctly. As the testing data is an even split of donor vs non-donor, this is still better than the 50% ratio otherwise achieved by sending mailers out haphazardly.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.2.2

library(MASS)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.2.2

## corrplot 0.92 loaded

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.2.2

library(class)
library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 4.2.2

## Loading required package: lattice

library(tree)

## Warning: package 'tree' was built under R version 4.2.2

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.2.2

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

fundraising <- readRDS("~/1 Data/Fundraising.rds",refhook = NULL)
future_fundraising <- readRDS("~/1 Data/future_fundraising.rds", refhook = NULL)
set.seed(12345)
sample<- createDataPartition(y=fundraising$target, p= .80, list = FALSE)
train <- fundraising[sample,]
test <- fundraising[-sample,]
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)
attach(fundraising)
pairs(fundraising)

frtestcor <- as.data.frame(sapply(test[, c(1:21)], as.numeric))
corrplot(cor(frtestcor))

summary(fundraising)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

par(mfrow = c(2,2))
plot(fundraising$target~female)
plot(fundraising$target~homeowner)
plot(fundraising$target~wealth)
plot(fundraising$target~num_child)

plot(fundraising$target~home_value)
plot(fundraising$target~avg_fam_inc)
plot(fundraising$target~income)
plot(fundraising$home_value,avg_gift)

plot(fundraising$income,avg_gift)
plot(fundraising$avg_fam_inc,avg_gift)

ldafund = lda(target~., data= train)
summary(ldafund)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   40     -none- numeric  
## scaling 20     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  6     -none- list

lda.pred = predict(ldafund, newdata=test, type="response")
lda.class = lda.pred$class
table(lda.class, test$target)

##           
## lda.class  Donor No Donor
##   Donor      174      147
##   No Donor   125      153

55.6%

#knn.fit <- train(target~ homeowner + num_child,
#                 data = train,
#                 method = 'knn',
#                 trControl = train_control,
#                tuneLength=20)
#knn.fr = predict(knn.fit, test)
#confusionMatrix(as.factor(knn.fr), test$target, positive = 'Donor')

This code chunk will not run in markdown. THis is the result from console: Reference Prediction Donor No Donor Donor 168 143 No Donor 131 157

treefr <- tree(target ~ ., data = test)
summary(treefr)

## 
## Classification tree:
## tree(formula = target ~ ., data = test)
## Variables actually used in tree construction:
## [1] "avg_gift"            "time_lag"            "months_since_donate"
## [4] "wealth"             
## Number of terminal nodes:  5 
## Residual mean deviance:  1.327 = 788.3 / 594 
## Misclassification error rate: 0.379 = 227 / 599

A weighted sample gives equal treatment to both sides. If utilizing true random, it may skew results to the side with more samples (if we used 98% donors and 2% non-donors, the models would have a hard time figuring out what makes a non-donor.)

The best model was the LDA model.

lda.pred.final = predict(ldafund, newdata=future_fundraising, type="response")
future_fundraising$predicted_donor <- lda.pred.final[[1]]
View(future_fundraising)

Untitled

Charles Ponthieux

2022-12-09