1 Introduction

Giving blood is a great thing to do and if you are going to collect blood donations you need to know who to send reminders for donations. The collected data provides an opportunity to come up with an algorithm that can help in identifying who is likely to donate again.

The plan is to:

  • Read in the data and look throw it for anything that stands out
  • Do some data visualization to help see the picture of our data
  • Look at correlations
  • Build machine learning model algorithms
    • Random Forest, Decision Tree, Naive Baiyes, KNN & Logistic Regression.
  • Compare algorithms and pick what works best

The data set is from mobile blood donation vehicle in Taiwan which is the courtesy of Yeh, I-Cheng (related link on acknowledgement). The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

2 Preparations

2.1 Load Libraries

I am loading a few key libraries for data manipulation, visualization, and model building

library("readr") 
library("ggplot2") 
library("dplyr")
library("caret")
## Warning: package 'caret' was built under R version 4.2.2
library("car")

2.2 Load Data

Load the data and take a look at a few observations

##     X Months.since.Last.Donation Number.of.Donations
## 1 619                          2                  50
## 2 664                          0                  13
## 3 441                          1                  16
## 4 160                          2                  20
## 5 358                          1                  24
## 6 335                          4                   4
##   Total.Volume.Donated..c.c.. Months.since.First.Donation
## 1                       12500                          98
## 2                        3250                          28
## 3                        4000                          35
## 4                        5000                          45
## 5                        6000                          77
## 6                        1000                           4
##   Made.Donation.in.March.2007
## 1                           1
## 2                           1
## 3                           1
## 4                           1
## 5                           0
## 6                           0
## Rows: 576
## Columns: 6
## $ X                           <int> 619, 664, 441, 160, 358, 335, 47, 164, 736…
## $ Months.since.Last.Donation  <int> 2, 0, 1, 2, 1, 4, 2, 1, 5, 0, 2, 1, 2, 2, …
## $ Number.of.Donations         <int> 50, 13, 16, 20, 24, 4, 7, 12, 46, 3, 10, 1…
## $ Total.Volume.Donated..c.c.. <int> 12500, 3250, 4000, 5000, 6000, 1000, 1750,…
## $ Months.since.First.Donation <int> 98, 28, 35, 45, 77, 4, 14, 35, 98, 4, 28, …
## $ Made.Donation.in.March.2007 <int> 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, …

Looks like the data was read in well and we can see that all the variables are integers Will convert the Made Donation in March variable into a factor and use as the response variable for the algorithms

##        X         Months.since.Last.Donation Number.of.Donations
##  Min.   :  0.0   Min.   : 0.000             Min.   : 1.000     
##  1st Qu.:183.8   1st Qu.: 2.000             1st Qu.: 2.000     
##  Median :375.5   Median : 7.000             Median : 4.000     
##  Mean   :374.0   Mean   : 9.439             Mean   : 5.427     
##  3rd Qu.:562.5   3rd Qu.:14.000             3rd Qu.: 7.000     
##  Max.   :747.0   Max.   :74.000             Max.   :50.000     
##  Total.Volume.Donated..c.c.. Months.since.First.Donation
##  Min.   :  250               Min.   : 2.00              
##  1st Qu.:  500               1st Qu.:16.00              
##  Median : 1000               Median :28.00              
##  Mean   : 1357               Mean   :34.05              
##  3rd Qu.: 1750               3rd Qu.:49.25              
##  Max.   :12500               Max.   :98.00              
##  Made.Donation.in.March.2007
##  Min.   :0.0000             
##  1st Qu.:0.0000             
##  Median :0.0000             
##  Mean   :0.2396             
##  3rd Qu.:0.0000             
##  Max.   :1.0000

Just wanted to get a rough summary of the variables in the dataset Created a new variable which is a factor that will be used as the response

3 Basic Visualizations

Let’s start by doing a few basic visualizations of the data.

3.1 Months Since last Donation

ggplot(data, aes(x = "", y = Months.since.Last.Donation)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Months since Last Donation") +
  ggtitle("Boxplot of the Months since Last Donation") 

ggplot(data, aes(x=Months.since.Last.Donation)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Months since Last Donation") + 
  ggtitle("Months since Last Donation Density") + 
  geom_vline(aes(xintercept=mean(Months.since.Last.Donation)), color="blue", linetype="dashed", size=1)

ggplot(data, aes(x = resp, y = Months.since.Last.Donation)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Months since Last Donation") +
  ggtitle("Boxplot of the Months since Last Donation across the response") 

summary(data$Months.since.Last.Donation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   7.000   9.439  14.000  74.000

The mean number of months since the last donation is 9.439 months, with the median being 7 months and the maximum is 74 months. The data seems to have a skewness towards shorter length between donations which is good news. Across the response as expected the mean number of months between donations is smaller for those who donated in the dataset and this group also has a smaller range.

3.2 Number of Donations

ggplot(data, aes(x = "", y = Number.of.Donations)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Number of Donations") +
  ggtitle("Boxplot of the Number of Donations") 

ggplot(data, aes(x=Number.of.Donations)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Number of Donations") + 
  ggtitle("Number of Donations Density") + 
  geom_vline(aes(xintercept=mean(Number.of.Donations)), color="blue", linetype="dashed", size=1)

ggplot(data, aes(x = resp, y = Number.of.Donations)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Number of Donations") +
  ggtitle("Boxplot of the Number of Donations across the response") 

summary(data$Number.of.Donations)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.000   5.427   7.000  50.000

The mean number of donations is 5.427, I am sure everyone would want this to be a bit higher. The median value is 4, with the maximum value being 50. The data has a skewedness to it similar to what we saw with regards to the number of months since the last donation, for this variable I am sure more of a bell shaped graph would be more welcome. Those who donated blood since March they have a higher mean number of donations than their counterparts.

3.3 Total Volume Donated in cc

ggplot(data, aes(x = "", y = Total.Volume.Donated..c.c..)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Total Volume of Blood Donated") +
  ggtitle("Boxplot of the Total Volume of Blood Donated") 

ggplot(data, aes(x=Total.Volume.Donated..c.c..)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Total Volume of Blood Donated") + 
  ggtitle("Total Volume of Blood Donated Density") + 
  geom_vline(aes(xintercept=mean(Total.Volume.Donated..c.c..)), color="blue", linetype="dashed", size=1)

ggplot(data, aes(x = resp, y = Total.Volume.Donated..c.c..)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Total Volume of Blood Donated") +
  ggtitle("Boxplot of the Total Volume of Blood Donated across the response") 

summary(data$Total.Volume.Donated..c.c..)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250     500    1000    1357    1750   12500

The mean total volume of donated blood is 1357 cc, with the median being 1000 cc. The most that has been donated is 12500 cc and the minimum that has been donated is 250 cc. We have our third straight variable that is right skewed. The mean total volume for blood donations is higher for those who made a donation in March 2017.

3.4 Months since First Donation

ggplot(data, aes(x = "", y = Months.since.First.Donation)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Months since First Donation") +
  ggtitle("Boxplot of the Months since First Donation") 

ggplot(data, aes(x=Months.since.First.Donation)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Months since First Donation") + 
  ggtitle("Months since First Donation Density") + 
  geom_vline(aes(xintercept=mean(Months.since.First.Donation)), color="blue", linetype="dashed", size=1)

ggplot(data, aes(x = resp, y = Months.since.First.Donation)) +
  geom_boxplot(color = "blue", fill = "red") +
  ylab("Months since First Donation") +
  ggtitle("Boxplot of the Months since First Donation across the response") 

summary(data$Months.since.First.Donation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   16.00   28.00   34.05   49.25   98.00

The mean number of months since the first donation is 34.05 months with the median being 28 months. The minimum is 2 months. This variable is not as bad in terms of how skewed it is in comparison to the other variables we have looked at already.

3.5 Response

ggplot(data, aes(resp)) +
  geom_bar(stat = "count", aes(fill = resp)) + 
  ggtitle("Distribution of Response variable") + 
  theme(legend.position="none")

There are more records of those who have donated blood since March 2017.

4 Correlations

Run a few things to visualize and calculate the correlations Create a dummy variable for gender and then drop the original variable

library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:lattice':
## 
##     panel.fill
corrgram(data[-1], order=NULL, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt,
         main="Corrgram of the data")

panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  # correlation coefficient
  r <- cor(x, y)
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste("r= ", txt, sep = "")
  text(0.5, 0.6, txt)
  
  # p-value calculation
  p <- cor.test(x, y)$p.value
  txt2 <- format(c(p, 0.123456789), digits = digits)[1]
  txt2 <- paste("p= ", txt2, sep = "")
  if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
  text(0.5, 0.4, txt2)
}

As expected the umber of donations are highly correlated with the total volume of donations. There are a few variables that are negatively correlated with each other.

5 Models

5.1 Data Split

Drop the two variables we will not need to build the algorithms Split the data into the training and testing sets Set the seed number to reproduce the results

# Split the data into a training and testing set

data = subset(data, select = -c(X , Made.Donation.in.March.2007) )

set.seed(1729) 
intrain = createDataPartition(y = data$resp, p = .75, list = FALSE) 
training = data[intrain,]
testing = data[-intrain,]
dim(training) 
## [1] 433   5
dim(testing)
## [1] 143   5

5.2 Random Forest

Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is

set.seed(1812) 
modFit_RF = train(resp ~ ., method = "rf", data = training, prox = TRUE, na.action=na.omit)
modFit_RF
## Random Forest 
## 
## 433 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7550380  0.2704417
##   3     0.7466205  0.2585785
##   4     0.7452845  0.2583526
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(modFit_RF)

rf_imp = varImp(modFit_RF, scale = FALSE)
rf_imp
## rf variable importance
## 
##                             Overall
## Months.since.First.Donation   47.95
## Months.since.Last.Donation    31.21
## Total.Volume.Donated..c.c..   21.00
## Number.of.Donations           19.77
plot(rf_imp)

pred_rf = predict(modFit_RF, testing)
confusionMatrix(pred_rf, testing$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 96 22
##          1 13 12
##                                           
##                Accuracy : 0.7552          
##                  95% CI : (0.6764, 0.8232)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : 0.6218          
##                                           
##                   Kappa : 0.2571          
##                                           
##  Mcnemar's Test P-Value : 0.1763          
##                                           
##             Sensitivity : 0.8807          
##             Specificity : 0.3529          
##          Pos Pred Value : 0.8136          
##          Neg Pred Value : 0.4800          
##              Prevalence : 0.7622          
##          Detection Rate : 0.6713          
##    Detection Prevalence : 0.8252          
##       Balanced Accuracy : 0.6168          
##                                           
##        'Positive' Class : 0               
## 

The most important variable in the random forest algorithm is the month since the first donation variable. This algorithm has an accuracy of 76.92%, which is not too bad. Now we need to apply this to the test sample.

test_sample = read.csv("data/blood-test.csv")

test_sample$pred_rf = predict(modFit_RF, test_sample, type = "prob")[,2]

ggplot(test_sample, aes(x=pred_rf)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Random Forest Probability") + 
  ggtitle("Probability that they donated in March") + 
  geom_vline(aes(xintercept=mean(pred_rf)), color="blue", linetype="dashed", size=1)

summary(test_sample$pred_rf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.1140  0.1986  0.3115  0.9720

The mean probability that someone made a donation is 0.1994, with a median of 0.1090, range of 0 - 0.9760.

5.3 Classification Tree

Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is

set.seed(1812) 
modFit_CT = train(resp ~ ., method = "rpart", data = training)
print(modFit_CT$finalModel)
## n= 433 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 433 104 0 (0.7598152 0.2401848)  
##    2) Months.since.Last.Donation>=7.5 215  25 0 (0.8837209 0.1162791) *
##    3) Months.since.Last.Donation< 7.5 218  79 0 (0.6376147 0.3623853)  
##      6) Number.of.Donations< 4.5 108  24 0 (0.7777778 0.2222222) *
##      7) Number.of.Donations>=4.5 110  55 0 (0.5000000 0.5000000)  
##       14) Months.since.First.Donation>=49 42  12 0 (0.7142857 0.2857143)  
##         28) Number.of.Donations< 18 35   6 0 (0.8285714 0.1714286) *
##         29) Number.of.Donations>=18 7   1 1 (0.1428571 0.8571429) *
##       15) Months.since.First.Donation< 49 68  25 1 (0.3676471 0.6323529) *
plot(modFit_CT$finalModel, uniform = TRUE, main = "Classification Tree")
text(modFit_CT$finalModel, use.n = TRUE, all = TRUE, cex = 0.8)

ct_imp = varImp(modFit_CT, scale = FALSE)
ct_imp
## rpart variable importance
## 
##                             Overall
## Total.Volume.Donated..c.c..   24.69
## Number.of.Donations           24.69
## Months.since.Last.Donation    14.56
## Months.since.First.Donation   14.25
plot(ct_imp)

pred_ct = predict(modFit_CT, testing)
confusionMatrix(pred_ct, testing$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 95 17
##          1 14 17
##                                           
##                Accuracy : 0.7832          
##                  95% CI : (0.7066, 0.8477)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : 0.3164          
##                                           
##                   Kappa : 0.3832          
##                                           
##  Mcnemar's Test P-Value : 0.7194          
##                                           
##             Sensitivity : 0.8716          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8482          
##          Neg Pred Value : 0.5484          
##              Prevalence : 0.7622          
##          Detection Rate : 0.6643          
##    Detection Prevalence : 0.7832          
##       Balanced Accuracy : 0.6858          
##                                           
##        'Positive' Class : 0               
## 

The tree looks reasonable. The algorithm has an accuracy of 78.32% which is not too bad. The two most important variables for this algorithm are Total volume and number of donations. These two variables were highly correlated as we saw earlier. Might have to put a pin in this model.

test_sample$pred_ct = predict(modFit_CT, test_sample, type = "prob")[,2]

ggplot(test_sample, aes(x=pred_ct)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Classification Tree Probability") + 
  ggtitle("Probability that they donated in March") + 
  geom_vline(aes(xintercept=mean(pred_ct)), color="blue", linetype="dashed", size=1)

summary(test_sample$pred_ct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1163  0.1163  0.1714  0.2345  0.2222  0.8571

The mean probability that someone made a donation is 0.2360, with a median of 0.1667, range of 0.1239 - 0.6923

5.4 Logistic Regression

Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is

set.seed(1812) 
modFit_lr = train(resp ~ ., data = training, method = "glm", family = "binomial", na.action=na.omit)
modFit_lr
## Generalized Linear Model 
## 
## 433 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7564893  0.1422483
lr_imp = varImp(modFit_lr, scale = FALSE)
lr_imp
## glm variable importance
## 
##                             Overall
## Number.of.Donations           4.217
## Months.since.Last.Donation    4.182
## Months.since.First.Donation   2.871
plot(lr_imp)

pred_lr = predict(modFit_lr, testing)
confusionMatrix(pred_lr, testing$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 105  27
##          1   4   7
##                                           
##                Accuracy : 0.7832          
##                  95% CI : (0.7066, 0.8477)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : 0.3164          
##                                           
##                   Kappa : 0.2205          
##                                           
##  Mcnemar's Test P-Value : 7.772e-05       
##                                           
##             Sensitivity : 0.9633          
##             Specificity : 0.2059          
##          Pos Pred Value : 0.7955          
##          Neg Pred Value : 0.6364          
##              Prevalence : 0.7622          
##          Detection Rate : 0.7343          
##    Detection Prevalence : 0.9231          
##       Balanced Accuracy : 0.5846          
##                                           
##        'Positive' Class : 0               
## 

There are only three variables used for this algorithm. The three are to do with first time and last time donating and the number of donations in between. I think these three variables make the most sense to look at. The algorithm has an accuracy of 75.52%, not too bad.

test_sample$pred_lr = predict(modFit_lr, test_sample, type = "prob")[,2]

ggplot(test_sample, aes(x=pred_lr)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Logistic Regression Probability") + 
  ggtitle("Probability that they donated in March") + 
  geom_vline(aes(xintercept=mean(pred_lr)), color="blue", linetype="dashed", size=1)

summary(test_sample$pred_lr)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007122 0.120022 0.239185 0.249747 0.333716 0.957031

The mean probability that someone made a donation is 0.2484, with a median of 0.26410, range of 0.01032 - 0.9331 The density plot for the predicted probabilities looks like a two hump camel.

5.5 k Nearest Neighbor

Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is

set.seed(1812) 
modFit_knn = train(resp ~ ., data = training, method = "knn", na.action=na.omit, preProcess = c("center","scale"), tuneLength = 20)
modFit_knn
## k-Nearest Neighbors 
## 
## 433 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7450857  0.2444210
##    7  0.7573733  0.2602258
##    9  0.7671026  0.2824970
##   11  0.7662924  0.2751241
##   13  0.7690874  0.2771518
##   15  0.7741798  0.2911583
##   17  0.7782582  0.3021319
##   19  0.7794413  0.3012047
##   21  0.7785192  0.2954618
##   23  0.7808082  0.2964227
##   25  0.7792511  0.2849155
##   27  0.7786865  0.2827195
##   29  0.7802025  0.2844877
##   31  0.7758164  0.2634299
##   33  0.7747296  0.2583282
##   35  0.7749671  0.2561261
##   37  0.7724841  0.2409391
##   39  0.7711998  0.2343050
##   41  0.7683544  0.2248466
##   43  0.7698422  0.2221161
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
plot(modFit_knn)

knn_imp = varImp(modFit_knn, scale = FALSE)
knn_imp
## ROC curve variable importance
## 
##                             Importance
## Months.since.Last.Donation      0.6856
## Number.of.Donations             0.6605
## Total.Volume.Donated..c.c..     0.6605
## Months.since.First.Donation     0.5140
plot(knn_imp)

pred_knn = predict(modFit_knn, testing)
confusionMatrix(pred_knn, testing$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 101  24
##          1   8  10
##                                          
##                Accuracy : 0.7762         
##                  95% CI : (0.699, 0.8416)
##     No Information Rate : 0.7622         
##     P-Value [Acc > NIR] : 0.39025        
##                                          
##                   Kappa : 0.2634         
##                                          
##  Mcnemar's Test P-Value : 0.00801        
##                                          
##             Sensitivity : 0.9266         
##             Specificity : 0.2941         
##          Pos Pred Value : 0.8080         
##          Neg Pred Value : 0.5556         
##              Prevalence : 0.7622         
##          Detection Rate : 0.7063         
##    Detection Prevalence : 0.8741         
##       Balanced Accuracy : 0.6104         
##                                          
##        'Positive' Class : 0              
## 

Four variables were used to build the algorithm. Accuracy peaked just before the 20 neighors mark. The algorithm is 77.62% accurate, which is decent.

test_sample$pred_knn = predict(modFit_knn, test_sample, type = "prob")[,2]

ggplot(test_sample, aes(x=pred_knn)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("k-Nearest Neighbor Probability") + 
  ggtitle("Probability that they donated in March") + 
  geom_vline(aes(xintercept=mean(pred_knn)), color="blue", linetype="dashed", size=1)

summary(test_sample$pred_knn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1280  0.1739  0.2318  0.3478  0.7391

The mean probability that someone made a donation is 0.2298, with a median of 0.1765, range of 0.0 - 0.8235

5.6 Naive Bayes

Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is

set.seed(1812) 
modFit_NB = train(resp ~ ., method = "nb", data = training, prox = TRUE, na.action=na.omit)
modFit_NB
## Naive Bayes 
## 
## 433 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7316093  0.1264004
##    TRUE      0.7319720  0.2799411
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
plot(modFit_NB)

nb_imp = varImp(modFit_NB, scale = FALSE)
nb_imp
## ROC curve variable importance
## 
##                             Importance
## Months.since.Last.Donation      0.6856
## Total.Volume.Donated..c.c..     0.6605
## Number.of.Donations             0.6605
## Months.since.First.Donation     0.5140
plot(rf_imp)

pred_nb = predict(modFit_NB, testing)
confusionMatrix(pred_nb, testing$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 18
##          1 21 16
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.6465, 0.7983)
##     No Information Rate : 0.7622          
##     P-Value [Acc > NIR] : 0.8595          
##                                           
##                   Kappa : 0.2697          
##                                           
##  Mcnemar's Test P-Value : 0.7488          
##                                           
##             Sensitivity : 0.8073          
##             Specificity : 0.4706          
##          Pos Pred Value : 0.8302          
##          Neg Pred Value : 0.4324          
##              Prevalence : 0.7622          
##          Detection Rate : 0.6154          
##    Detection Prevalence : 0.7413          
##       Balanced Accuracy : 0.6390          
##                                           
##        'Positive' Class : 0               
## 

Four predictors ares used, with months since the first donation being the most important variable. This algorithm has an accuracy of 71.33%, which is not too bad. Now we need to apply this to the test sample.

test_sample$pred_nb = predict(modFit_NB, test_sample, type = "prob")[,2]

ggplot(test_sample, aes(x=pred_nb)) + 
  geom_density() + 
  theme(legend.position="none") + 
  xlab("Naive Bayes Probability") + 
  ggtitle("Probability that they donated in March") + 
  geom_vline(aes(xintercept=mean(pred_nb)), color="blue", linetype="dashed", size=1)

summary(test_sample$pred_nb)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000016 0.0703603 0.1585494 0.2630476 0.4593079 1.0000000

The mean probability that someone made a donation is 0.2279, with a median of 0.1382, range of 0.0000003 - 1.0

6 Conclusions

Looking at the various algorithms (Random Forest, Decision Tree, Naive Baiyes, KNN and Lositic Regression), it appears that the k-Nearest Neighbor algorithm is the best choice. I would advise trying to do some data transformations and possibly adding some interaction terms to the model to see how that affects accuracy. I could not pick the classification tree algorithm because the variables used in building it were highly correlated as was shown in the correlation work.

7 Acknowledgement

Data is courtesy of Yeh, I-Cheng via the UCI Machine Learning repository:

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008, doi:10.1016/j.eswa.2008.07.018.