Background

Business Objectives and Goals.

Objective

  • Develop a classification model to help maximize the expected net profit by predicting who will be more likely to donate during the direct-mail fundraiser.

Goal

  • Improve the national veterans’ organization by using data analytics to help achieve costeffectiveneness during their direct marking campaign.

Data Sources and Data used.

  • The fund-raising file used contains 3,000 records. The records have approximately 50% donors and 50% non-donors.
  • This weighted sampling (50/50 split) is essential to produce repeatable results. If our training data included data where only 10% of the observations were donors (random sample), then it would have been difficult to create a predictive model for future donors because our training data would consist of such few and randomly selected donors.

Libraries

library(ISLR)
library(tidyverse)
library(MASS)
library(DescTools)
library(ResourceSelection)
library(caret)

Read and View Data

fundraise_data = readRDS("fundraising.rds")
future_fundraise_data = readRDS("future_fundraising.rds")
glimpse(fundraise_data)
## Rows: 3,000
## Columns: 21
## $ zipconvert2         <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ zipconvert3         <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No,…
## $ zipconvert4         <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No,…
## $ zipconvert5         <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, N…
## $ homeowner           <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ num_child           <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income              <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2…
## $ female              <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes…
## $ wealth              <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7…
## $ home_value          <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 4…
## $ med_fam_inc         <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203,…
## $ avg_fam_inc         <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271,…
## $ pct_lt15k           <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13…
## $ num_prom            <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 2…
## $ lifetime_gifts      <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 4…
## $ largest_gift        <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20,…
## $ last_gift           <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7,…
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 3…
## $ time_lag            <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3,…
## $ avg_gift            <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000,…
## $ target              <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, D…

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

set.seed(12345)
train_index = sample(1:nrow(fundraise_data), round(nrow(fundraise_data) * 0.80))
train = fundraise_data[train_index, ]
test = fundraise_data[-train_index, ]


Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

temp = fundraise_data[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
correlation = cor(temp)
round(correlation, 5)
##                     num_child   income   wealth home_value med_fam_inc
## num_child             1.00000  0.09189  0.06018   -0.01196     0.04696
## income                0.09189  1.00000  0.20899    0.29197     0.36751
## wealth                0.06018  0.20899  1.00000    0.26116     0.37776
## home_value           -0.01196  0.29197  0.26116    1.00000     0.73815
## med_fam_inc           0.04696  0.36751  0.37776    0.73815     1.00000
## avg_fam_inc           0.04726  0.37859  0.38589    0.75257     0.97227
## pct_lt15k            -0.03172 -0.28319 -0.37515   -0.39909    -0.66536
## num_prom             -0.08643 -0.06901 -0.41212   -0.06451    -0.05078
## lifetime_gifts       -0.05095 -0.01957 -0.22547   -0.02407    -0.03525
## largest_gift         -0.01755  0.03318 -0.02528    0.05649     0.04703
## last_gift            -0.01295  0.10959  0.05259    0.15886     0.13598
## months_since_donate  -0.00556  0.07724  0.03371    0.02343     0.03234
## time_lag             -0.00607 -0.00155 -0.06642    0.00068     0.01520
## avg_gift             -0.01969  0.12406  0.09108    0.16877     0.13716
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child               0.04726  -0.03172 -0.08643       -0.05095     -0.01755
## income                  0.37859  -0.28319 -0.06901       -0.01957      0.03318
## wealth                  0.38589  -0.37515 -0.41212       -0.22547     -0.02528
## home_value              0.75257  -0.39909 -0.06451       -0.02407      0.05649
## med_fam_inc             0.97227  -0.66536 -0.05078       -0.03525      0.04703
## avg_fam_inc             1.00000  -0.68028 -0.05731       -0.04033      0.04310
## pct_lt15k              -0.68028   1.00000  0.03778        0.05962     -0.00788
## num_prom               -0.05731   0.03778  1.00000        0.53862      0.11381
## lifetime_gifts         -0.04033   0.05962  0.53862        1.00000      0.50726
## largest_gift            0.04310  -0.00788  0.11381        0.50726      1.00000
## last_gift               0.13138  -0.06175 -0.05587        0.20206      0.44724
## months_since_donate     0.03127  -0.00901 -0.28232       -0.14462      0.01979
## time_lag                0.02434  -0.01991  0.11962        0.03855      0.03998
## avg_gift                0.13176  -0.06248 -0.14725        0.18232      0.47483
##                     last_gift months_since_donate time_lag avg_gift
## num_child            -0.01295            -0.00556 -0.00607 -0.01969
## income                0.10959             0.07724 -0.00155  0.12406
## wealth                0.05259             0.03371 -0.06642  0.09108
## home_value            0.15886             0.02343  0.00068  0.16877
## med_fam_inc           0.13598             0.03234  0.01520  0.13716
## avg_fam_inc           0.13138             0.03127  0.02434  0.13176
## pct_lt15k            -0.06175            -0.00901 -0.01991 -0.06248
## num_prom             -0.05587            -0.28232  0.11962 -0.14725
## lifetime_gifts        0.20206            -0.14462  0.03855  0.18232
## largest_gift          0.44724             0.01979  0.03998  0.47483
## last_gift             1.00000             0.18672  0.07511  0.86640
## months_since_donate   0.18672             1.00000  0.01553  0.18911
## time_lag              0.07511             0.01553  1.00000  0.07008
## avg_gift              0.86640             0.18911  0.07008  1.00000



Logistic Regression

  • Process: We begin by fitting the model with all the predictors. We proceed by performing step-wise selection to determine the best predictors to include in the final model.
  • Significant Predictors: Based on these results from the step-wise selection, num_child, last_gift, months_since_donate are a significant predictors as they have a p-value below the significance level of 0.05. This is our method of ensuring the model doesn’t predict a particular output with insignificant variables. We’re training the solution to only use predictors that are significant.
glm.fund = glm(target ~., data = train, family = 'binomial')

glm.step = step(glm.fund, scope = list(upper = glm.fund),
                direction = "both", test = "Chisq", trace = F)

summary(glm.step)
## 
## Call:
## glm(formula = target ~ zipconvert2 + zipconvert3 + zipconvert4 + 
##     zipconvert5 + homeowner + num_child + income + home_value + 
##     avg_fam_inc + last_gift + months_since_donate, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7887  -1.1411  -0.7736   1.1703   1.6751  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.335e+00  3.900e-01  -5.987 2.14e-09 ***
## zipconvert2Yes      -1.266e+01  2.271e+02  -0.056  0.95553    
## zipconvert3No        1.257e+01  2.271e+02   0.055  0.95586    
## zipconvert4Yes      -1.264e+01  2.271e+02  -0.056  0.95563    
## zipconvert5Yes      -1.258e+01  2.271e+02  -0.055  0.95582    
## homeownerNo          1.505e-01  1.052e-01   1.430  0.15272    
## num_child            3.424e-01  1.271e-01   2.694  0.00706 ** 
## income              -5.308e-02  2.858e-02  -1.857  0.06330 .  
## home_value          -1.091e-04  7.675e-05  -1.422  0.15508    
## avg_fam_inc          5.741e-04  4.042e-04   1.420  0.15550    
## last_gift            1.610e-02  4.798e-03   3.356  0.00079 ***
## months_since_donate  5.852e-02  1.076e-02   5.438 5.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3327.0  on 2399  degrees of freedom
## Residual deviance: 3255.7  on 2388  degrees of freedom
## AIC: 3279.7
## 
## Number of Fisher Scoring iterations: 11
hoslem.test(glm.step$y, fitted(glm.step), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.step$y, fitted(glm.step)
## X-squared = 1.7672, df = 8, p-value = 0.9873
  • Goodness of Fit: The Hosmer-Lemeshow Test yielded a p-value of 0.9873 which is above the significance level of 0.05. We do not reject the null; therefore, the model is adequate.


Based on the above analysis, we fit the final model with the following predictors: num_child, last_gift, and months_since_donate.
glm.fund_final = glm(target ~ num_child + last_gift + months_since_donate, data = train, family = 'binomial')
pred.prob = predict.glm(glm.fund_final, newdata = test, type = 'response')
pred = ifelse(pred.prob > .5, 'Donor', 'No Donor')
confusionMatrix(as.factor(pred), test$target, positive = 'Donor')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor       98      131
##   No Donor   192      179
##                                           
##                Accuracy : 0.4617          
##                  95% CI : (0.4212, 0.5025)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.9968933       
##                                           
##                   Kappa : -0.0852         
##                                           
##  Mcnemar's Test P-Value : 0.0008424       
##                                           
##             Sensitivity : 0.3379          
##             Specificity : 0.5774          
##          Pos Pred Value : 0.4279          
##          Neg Pred Value : 0.4825          
##              Prevalence : 0.4833          
##          Detection Rate : 0.1633          
##    Detection Prevalence : 0.3817          
##       Balanced Accuracy : 0.4577          
##                                           
##        'Positive' Class : Donor           
## 


K Nearest Neighbors

  • Process: We begin by creating a train_ctrl variable for cross-validation. We proceed to fit the K Nearest Neighbors model with the significant variables gathered in our initial step-wise selection process.
set.seed(12345)
train_ctrl = trainControl(method="repeatedcv", number=10,repeats=3)

knn.fit = train(target~ months_since_donate + num_child + last_gift,
                data=train,
                method='knn',
                trControl = train_ctrl,
                tuneLength=20)

pred.prod3 = predict(knn.fit, test)

confusionMatrix(as.factor(pred.prod3), test$target, positive = 'Donor')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      158      156
##   No Donor   132      154
##                                           
##                Accuracy : 0.52            
##                  95% CI : (0.4792, 0.5606)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.4514          
##                                           
##                   Kappa : 0.0415          
##                                           
##  Mcnemar's Test P-Value : 0.1753          
##                                           
##             Sensitivity : 0.5448          
##             Specificity : 0.4968          
##          Pos Pred Value : 0.5032          
##          Neg Pred Value : 0.5385          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2633          
##    Detection Prevalence : 0.5233          
##       Balanced Accuracy : 0.5208          
##                                           
##        'Positive' Class : Donor           
## 


Random Forest

  • Process: We begin by creating a train_control variable for cross-validation. We proceed to fit the Random Forest model with all predictors. With the model fitted, we determine the variables ranked by importance in the below table and chart. This information aids us in selecting the predictors to include in our final Random Forest model.
train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit = train(target~.,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

rf.fit$besttune
## NULL
varImp(rf.fit)
## rf variable importance
## 
##                     Importance
## months_since_donate    100.000
## largest_gift            78.852
## num_child               70.241
## last_gift               62.017
## pct_lt15k               58.123
## avg_fam_inc             48.639
## home_value              46.974
## income                  43.231
## med_fam_inc             40.671
## avg_gift                34.473
## zipconvert3No           29.082
## femaleNo                21.990
## homeownerNo             20.963
## num_prom                19.328
## wealth                  18.354
## time_lag                16.654
## lifetime_gifts          11.840
## zipconvert2Yes          10.684
## zipconvert5Yes           2.436
## zipconvert4Yes           0.000
plot(varImp(rf.fit))

We remove avg_gift as it is collinear with last_gift, and we remove med_fam_inc as it is collinear with pct_lt15k.
train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit_refitted = train(target~ months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = train,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_refitted = predict(rf.fit_refitted,test)
confusionMatrix(pred.rf_refitted,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      159      147
##   No Donor   131      163
##                                           
##                Accuracy : 0.5367          
##                  95% CI : (0.4958, 0.5771)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.1738          
##                                           
##                   Kappa : 0.074           
##                                           
##  Mcnemar's Test P-Value : 0.3683          
##                                           
##             Sensitivity : 0.5483          
##             Specificity : 0.5258          
##          Pos Pred Value : 0.5196          
##          Neg Pred Value : 0.5544          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2650          
##    Detection Prevalence : 0.5100          
##       Balanced Accuracy : 0.5370          
##                                           
##        'Positive' Class : Donor           
## 


  • 3. Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?
    • A weighted sample is utilized in producing a training set for the model that contains equal numbers of donors and non-donors to adjust for potential imbalance in the data. If the response is not balanced, the model may be biased towards the class that is dominant which can cause poor test performance. A simple random sample is not enough to compensate for this imbalance; rather, it will preserve the imbalance.


  • 4. Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?
models = c('Logistic Regression', 'K Nearest Neighbors', 'Random Forest')
acc= c(46.17, 52.00, 53.17)

acc.summary= as.data.frame(acc, row.names = models)
acc.summary
##                       acc
## Logistic Regression 46.17
## K Nearest Neighbors 52.00
## Random Forest       53.17
# Plot the bar chart 
barplot(acc,names.arg = models, ylab="Accuracy Score", col="navyblue",
main="Model Results", border="orange")

- The Random Forest model appears to be the model that dominates.


  • 5. Select the best model. From your answer in (4), what do you think is the “best” model?
    • Model Selected: Random Forest is the selected model due to its slightly higher accuracy.

Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates.

train_control = trainControl(method="repeatedcv",number=10,repeats=3)


rf.fit_final = train(target~months_since_donate + largest_gift + num_child + last_gift + pct_lt15k + income + wealth,
               data = fundraise_data,
               method ='rf',
               trControl = train_control,
               importance = TRUE)

pred.rf_final = predict(rf.fit_final,future_fundraise_data)

Shiny App Upload Results

- Found 62 records... Imported 62 records. Simplifying into dataframe...
- Found 120 records...
- Imported 120 records. Simplifying into dataframe...
- Complete! Processed total of 1 rows.
- [1] 0.5083333
  • The model performed as expected and predicted 50.8% of the response variable. Using Random Forest as the classification method is the optimal choice when trying to predict whether a person will be a donor or non-donor based on out results.
  • Important factors to consider when targeting people who may be a donor are the number of months from the last donation to July 2018, the dollar amount of the largest gift to date, number of children, dollar amount of most recent gift, percent earnings that are less than $15k in potential donor’s neighborhood, household income, and wealth ratings.


  • 7. Submission File. For each row in the test set, you must predict whether or not the candidate is a donor or not. The .csv file should contain a header.
write.table(pred.rf_final, file = "predictions_randomforest_final.csv", col.names = c("value"), row.names = FALSE)