Giving blood is a great thing to do and if you are going to collect blood donations you need to know who to send reminders for donations. The collected data provides an opportunity to come up with an algorithm that can help in identifying who is likely to donate again.
The plan is to:
The data set is from mobile blood donation vehicle in Taiwan which is the courtesy of Yeh, I-Cheng (related link on acknowledgement). The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.
I am loading a few key libraries for data manipulation, visualization, and model building
library("readr")
library("ggplot2")
library("dplyr")
library("caret")## Warning: package 'caret' was built under R version 4.2.2
library("car")Load the data and take a look at a few observations
## X Months.since.Last.Donation Number.of.Donations
## 1 619 2 50
## 2 664 0 13
## 3 441 1 16
## 4 160 2 20
## 5 358 1 24
## 6 335 4 4
## Total.Volume.Donated..c.c.. Months.since.First.Donation
## 1 12500 98
## 2 3250 28
## 3 4000 35
## 4 5000 45
## 5 6000 77
## 6 1000 4
## Made.Donation.in.March.2007
## 1 1
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
## Rows: 576
## Columns: 6
## $ X <int> 619, 664, 441, 160, 358, 335, 47, 164, 736…
## $ Months.since.Last.Donation <int> 2, 0, 1, 2, 1, 4, 2, 1, 5, 0, 2, 1, 2, 2, …
## $ Number.of.Donations <int> 50, 13, 16, 20, 24, 4, 7, 12, 46, 3, 10, 1…
## $ Total.Volume.Donated..c.c.. <int> 12500, 3250, 4000, 5000, 6000, 1000, 1750,…
## $ Months.since.First.Donation <int> 98, 28, 35, 45, 77, 4, 14, 35, 98, 4, 28, …
## $ Made.Donation.in.March.2007 <int> 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, …
Looks like the data was read in well and we can see that all the variables are integers Will convert the Made Donation in March variable into a factor and use as the response variable for the algorithms
## X Months.since.Last.Donation Number.of.Donations
## Min. : 0.0 Min. : 0.000 Min. : 1.000
## 1st Qu.:183.8 1st Qu.: 2.000 1st Qu.: 2.000
## Median :375.5 Median : 7.000 Median : 4.000
## Mean :374.0 Mean : 9.439 Mean : 5.427
## 3rd Qu.:562.5 3rd Qu.:14.000 3rd Qu.: 7.000
## Max. :747.0 Max. :74.000 Max. :50.000
## Total.Volume.Donated..c.c.. Months.since.First.Donation
## Min. : 250 Min. : 2.00
## 1st Qu.: 500 1st Qu.:16.00
## Median : 1000 Median :28.00
## Mean : 1357 Mean :34.05
## 3rd Qu.: 1750 3rd Qu.:49.25
## Max. :12500 Max. :98.00
## Made.Donation.in.March.2007
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2396
## 3rd Qu.:0.0000
## Max. :1.0000
Just wanted to get a rough summary of the variables in the dataset Created a new variable which is a factor that will be used as the response
Let’s start by doing a few basic visualizations of the data.
ggplot(data, aes(x = "", y = Months.since.Last.Donation)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Months since Last Donation") +
ggtitle("Boxplot of the Months since Last Donation") ggplot(data, aes(x=Months.since.Last.Donation)) +
geom_density() +
theme(legend.position="none") +
xlab("Months since Last Donation") +
ggtitle("Months since Last Donation Density") +
geom_vline(aes(xintercept=mean(Months.since.Last.Donation)), color="blue", linetype="dashed", size=1)ggplot(data, aes(x = resp, y = Months.since.Last.Donation)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Months since Last Donation") +
ggtitle("Boxplot of the Months since Last Donation across the response") summary(data$Months.since.Last.Donation)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 7.000 9.439 14.000 74.000
The mean number of months since the last donation is 9.439 months, with the median being 7 months and the maximum is 74 months. The data seems to have a skewness towards shorter length between donations which is good news. Across the response as expected the mean number of months between donations is smaller for those who donated in the dataset and this group also has a smaller range.
ggplot(data, aes(x = "", y = Number.of.Donations)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Number of Donations") +
ggtitle("Boxplot of the Number of Donations") ggplot(data, aes(x=Number.of.Donations)) +
geom_density() +
theme(legend.position="none") +
xlab("Number of Donations") +
ggtitle("Number of Donations Density") +
geom_vline(aes(xintercept=mean(Number.of.Donations)), color="blue", linetype="dashed", size=1)ggplot(data, aes(x = resp, y = Number.of.Donations)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Number of Donations") +
ggtitle("Boxplot of the Number of Donations across the response") summary(data$Number.of.Donations)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 4.000 5.427 7.000 50.000
The mean number of donations is 5.427, I am sure everyone would want this to be a bit higher. The median value is 4, with the maximum value being 50. The data has a skewedness to it similar to what we saw with regards to the number of months since the last donation, for this variable I am sure more of a bell shaped graph would be more welcome. Those who donated blood since March they have a higher mean number of donations than their counterparts.
ggplot(data, aes(x = "", y = Total.Volume.Donated..c.c..)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Total Volume of Blood Donated") +
ggtitle("Boxplot of the Total Volume of Blood Donated") ggplot(data, aes(x=Total.Volume.Donated..c.c..)) +
geom_density() +
theme(legend.position="none") +
xlab("Total Volume of Blood Donated") +
ggtitle("Total Volume of Blood Donated Density") +
geom_vline(aes(xintercept=mean(Total.Volume.Donated..c.c..)), color="blue", linetype="dashed", size=1)ggplot(data, aes(x = resp, y = Total.Volume.Donated..c.c..)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Total Volume of Blood Donated") +
ggtitle("Boxplot of the Total Volume of Blood Donated across the response") summary(data$Total.Volume.Donated..c.c..)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 500 1000 1357 1750 12500
The mean total volume of donated blood is 1357 cc, with the median being 1000 cc. The most that has been donated is 12500 cc and the minimum that has been donated is 250 cc. We have our third straight variable that is right skewed. The mean total volume for blood donations is higher for those who made a donation in March 2017.
ggplot(data, aes(x = "", y = Months.since.First.Donation)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Months since First Donation") +
ggtitle("Boxplot of the Months since First Donation") ggplot(data, aes(x=Months.since.First.Donation)) +
geom_density() +
theme(legend.position="none") +
xlab("Months since First Donation") +
ggtitle("Months since First Donation Density") +
geom_vline(aes(xintercept=mean(Months.since.First.Donation)), color="blue", linetype="dashed", size=1)ggplot(data, aes(x = resp, y = Months.since.First.Donation)) +
geom_boxplot(color = "blue", fill = "red") +
ylab("Months since First Donation") +
ggtitle("Boxplot of the Months since First Donation across the response") summary(data$Months.since.First.Donation)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 16.00 28.00 34.05 49.25 98.00
The mean number of months since the first donation is 34.05 months with the median being 28 months. The minimum is 2 months. This variable is not as bad in terms of how skewed it is in comparison to the other variables we have looked at already.
ggplot(data, aes(resp)) +
geom_bar(stat = "count", aes(fill = resp)) +
ggtitle("Distribution of Response variable") +
theme(legend.position="none")There are more records of those who have donated blood since March 2017.
Run a few things to visualize and calculate the correlations Create a dummy variable for gender and then drop the original variable
library(corrgram)##
## Attaching package: 'corrgram'
## The following object is masked from 'package:lattice':
##
## panel.fill
corrgram(data[-1], order=NULL, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt,
main="Corrgram of the data")panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
# p-value calculation
p <- cor.test(x, y)$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.4, txt2)
}As expected the umber of donations are highly correlated with the total volume of donations. There are a few variables that are negatively correlated with each other.
Drop the two variables we will not need to build the algorithms Split the data into the training and testing sets Set the seed number to reproduce the results
# Split the data into a training and testing set
data = subset(data, select = -c(X , Made.Donation.in.March.2007) )
set.seed(1729)
intrain = createDataPartition(y = data$resp, p = .75, list = FALSE)
training = data[intrain,]
testing = data[-intrain,]
dim(training) ## [1] 433 5
dim(testing)## [1] 143 5
Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is
set.seed(1812)
modFit_RF = train(resp ~ ., method = "rf", data = training, prox = TRUE, na.action=na.omit)
modFit_RF## Random Forest
##
## 433 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7550380 0.2704417
## 3 0.7466205 0.2585785
## 4 0.7452845 0.2583526
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(modFit_RF)rf_imp = varImp(modFit_RF, scale = FALSE)
rf_imp## rf variable importance
##
## Overall
## Months.since.First.Donation 47.95
## Months.since.Last.Donation 31.21
## Total.Volume.Donated..c.c.. 21.00
## Number.of.Donations 19.77
plot(rf_imp)pred_rf = predict(modFit_RF, testing)
confusionMatrix(pred_rf, testing$resp)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 96 22
## 1 13 12
##
## Accuracy : 0.7552
## 95% CI : (0.6764, 0.8232)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : 0.6218
##
## Kappa : 0.2571
##
## Mcnemar's Test P-Value : 0.1763
##
## Sensitivity : 0.8807
## Specificity : 0.3529
## Pos Pred Value : 0.8136
## Neg Pred Value : 0.4800
## Prevalence : 0.7622
## Detection Rate : 0.6713
## Detection Prevalence : 0.8252
## Balanced Accuracy : 0.6168
##
## 'Positive' Class : 0
##
The most important variable in the random forest algorithm is the month since the first donation variable. This algorithm has an accuracy of 76.92%, which is not too bad. Now we need to apply this to the test sample.
test_sample = read.csv("data/blood-test.csv")
test_sample$pred_rf = predict(modFit_RF, test_sample, type = "prob")[,2]
ggplot(test_sample, aes(x=pred_rf)) +
geom_density() +
theme(legend.position="none") +
xlab("Random Forest Probability") +
ggtitle("Probability that they donated in March") +
geom_vline(aes(xintercept=mean(pred_rf)), color="blue", linetype="dashed", size=1)summary(test_sample$pred_rf)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.1140 0.1986 0.3115 0.9720
The mean probability that someone made a donation is 0.1994, with a median of 0.1090, range of 0 - 0.9760.
Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is
set.seed(1812)
modFit_CT = train(resp ~ ., method = "rpart", data = training)
print(modFit_CT$finalModel)## n= 433
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 433 104 0 (0.7598152 0.2401848)
## 2) Months.since.Last.Donation>=7.5 215 25 0 (0.8837209 0.1162791) *
## 3) Months.since.Last.Donation< 7.5 218 79 0 (0.6376147 0.3623853)
## 6) Number.of.Donations< 4.5 108 24 0 (0.7777778 0.2222222) *
## 7) Number.of.Donations>=4.5 110 55 0 (0.5000000 0.5000000)
## 14) Months.since.First.Donation>=49 42 12 0 (0.7142857 0.2857143)
## 28) Number.of.Donations< 18 35 6 0 (0.8285714 0.1714286) *
## 29) Number.of.Donations>=18 7 1 1 (0.1428571 0.8571429) *
## 15) Months.since.First.Donation< 49 68 25 1 (0.3676471 0.6323529) *
plot(modFit_CT$finalModel, uniform = TRUE, main = "Classification Tree")
text(modFit_CT$finalModel, use.n = TRUE, all = TRUE, cex = 0.8)ct_imp = varImp(modFit_CT, scale = FALSE)
ct_imp## rpart variable importance
##
## Overall
## Total.Volume.Donated..c.c.. 24.69
## Number.of.Donations 24.69
## Months.since.Last.Donation 14.56
## Months.since.First.Donation 14.25
plot(ct_imp)pred_ct = predict(modFit_CT, testing)
confusionMatrix(pred_ct, testing$resp)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 95 17
## 1 14 17
##
## Accuracy : 0.7832
## 95% CI : (0.7066, 0.8477)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : 0.3164
##
## Kappa : 0.3832
##
## Mcnemar's Test P-Value : 0.7194
##
## Sensitivity : 0.8716
## Specificity : 0.5000
## Pos Pred Value : 0.8482
## Neg Pred Value : 0.5484
## Prevalence : 0.7622
## Detection Rate : 0.6643
## Detection Prevalence : 0.7832
## Balanced Accuracy : 0.6858
##
## 'Positive' Class : 0
##
The tree looks reasonable. The algorithm has an accuracy of 78.32% which is not too bad. The two most important variables for this algorithm are Total volume and number of donations. These two variables were highly correlated as we saw earlier. Might have to put a pin in this model.
test_sample$pred_ct = predict(modFit_CT, test_sample, type = "prob")[,2]
ggplot(test_sample, aes(x=pred_ct)) +
geom_density() +
theme(legend.position="none") +
xlab("Classification Tree Probability") +
ggtitle("Probability that they donated in March") +
geom_vline(aes(xintercept=mean(pred_ct)), color="blue", linetype="dashed", size=1)summary(test_sample$pred_ct)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1163 0.1163 0.1714 0.2345 0.2222 0.8571
The mean probability that someone made a donation is 0.2360, with a median of 0.1667, range of 0.1239 - 0.6923
Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is
set.seed(1812)
modFit_lr = train(resp ~ ., data = training, method = "glm", family = "binomial", na.action=na.omit)
modFit_lr## Generalized Linear Model
##
## 433 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7564893 0.1422483
lr_imp = varImp(modFit_lr, scale = FALSE)
lr_imp## glm variable importance
##
## Overall
## Number.of.Donations 4.217
## Months.since.Last.Donation 4.182
## Months.since.First.Donation 2.871
plot(lr_imp)pred_lr = predict(modFit_lr, testing)
confusionMatrix(pred_lr, testing$resp)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 105 27
## 1 4 7
##
## Accuracy : 0.7832
## 95% CI : (0.7066, 0.8477)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : 0.3164
##
## Kappa : 0.2205
##
## Mcnemar's Test P-Value : 7.772e-05
##
## Sensitivity : 0.9633
## Specificity : 0.2059
## Pos Pred Value : 0.7955
## Neg Pred Value : 0.6364
## Prevalence : 0.7622
## Detection Rate : 0.7343
## Detection Prevalence : 0.9231
## Balanced Accuracy : 0.5846
##
## 'Positive' Class : 0
##
There are only three variables used for this algorithm. The three are to do with first time and last time donating and the number of donations in between. I think these three variables make the most sense to look at. The algorithm has an accuracy of 75.52%, not too bad.
test_sample$pred_lr = predict(modFit_lr, test_sample, type = "prob")[,2]
ggplot(test_sample, aes(x=pred_lr)) +
geom_density() +
theme(legend.position="none") +
xlab("Logistic Regression Probability") +
ggtitle("Probability that they donated in March") +
geom_vline(aes(xintercept=mean(pred_lr)), color="blue", linetype="dashed", size=1)summary(test_sample$pred_lr)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007122 0.120022 0.239185 0.249747 0.333716 0.957031
The mean probability that someone made a donation is 0.2484, with a median of 0.26410, range of 0.01032 - 0.9331 The density plot for the predicted probabilities looks like a two hump camel.
Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is
set.seed(1812)
modFit_knn = train(resp ~ ., data = training, method = "knn", na.action=na.omit, preProcess = c("center","scale"), tuneLength = 20)
modFit_knn## k-Nearest Neighbors
##
## 433 samples
## 4 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7450857 0.2444210
## 7 0.7573733 0.2602258
## 9 0.7671026 0.2824970
## 11 0.7662924 0.2751241
## 13 0.7690874 0.2771518
## 15 0.7741798 0.2911583
## 17 0.7782582 0.3021319
## 19 0.7794413 0.3012047
## 21 0.7785192 0.2954618
## 23 0.7808082 0.2964227
## 25 0.7792511 0.2849155
## 27 0.7786865 0.2827195
## 29 0.7802025 0.2844877
## 31 0.7758164 0.2634299
## 33 0.7747296 0.2583282
## 35 0.7749671 0.2561261
## 37 0.7724841 0.2409391
## 39 0.7711998 0.2343050
## 41 0.7683544 0.2248466
## 43 0.7698422 0.2221161
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
plot(modFit_knn)knn_imp = varImp(modFit_knn, scale = FALSE)
knn_imp## ROC curve variable importance
##
## Importance
## Months.since.Last.Donation 0.6856
## Number.of.Donations 0.6605
## Total.Volume.Donated..c.c.. 0.6605
## Months.since.First.Donation 0.5140
plot(knn_imp)pred_knn = predict(modFit_knn, testing)
confusionMatrix(pred_knn, testing$resp)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 101 24
## 1 8 10
##
## Accuracy : 0.7762
## 95% CI : (0.699, 0.8416)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : 0.39025
##
## Kappa : 0.2634
##
## Mcnemar's Test P-Value : 0.00801
##
## Sensitivity : 0.9266
## Specificity : 0.2941
## Pos Pred Value : 0.8080
## Neg Pred Value : 0.5556
## Prevalence : 0.7622
## Detection Rate : 0.7063
## Detection Prevalence : 0.8741
## Balanced Accuracy : 0.6104
##
## 'Positive' Class : 0
##
Four variables were used to build the algorithm. Accuracy peaked just before the 20 neighors mark. The algorithm is 77.62% accurate, which is decent.
test_sample$pred_knn = predict(modFit_knn, test_sample, type = "prob")[,2]
ggplot(test_sample, aes(x=pred_knn)) +
geom_density() +
theme(legend.position="none") +
xlab("k-Nearest Neighbor Probability") +
ggtitle("Probability that they donated in March") +
geom_vline(aes(xintercept=mean(pred_knn)), color="blue", linetype="dashed", size=1)summary(test_sample$pred_knn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1280 0.1739 0.2318 0.3478 0.7391
The mean probability that someone made a donation is 0.2298, with a median of 0.1765, range of 0.0 - 0.8235
Get the algorithm Look at the important variables used in the algorithm creation Test the model and see how good it is
set.seed(1812)
modFit_NB = train(resp ~ ., method = "nb", data = training, prox = TRUE, na.action=na.omit)
modFit_NB## Naive Bayes
##
## 433 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 433, 433, 433, 433, 433, 433, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.7316093 0.1264004
## TRUE 0.7319720 0.2799411
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
plot(modFit_NB)nb_imp = varImp(modFit_NB, scale = FALSE)
nb_imp## ROC curve variable importance
##
## Importance
## Months.since.Last.Donation 0.6856
## Total.Volume.Donated..c.c.. 0.6605
## Number.of.Donations 0.6605
## Months.since.First.Donation 0.5140
plot(rf_imp)pred_nb = predict(modFit_NB, testing)
confusionMatrix(pred_nb, testing$resp)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 18
## 1 21 16
##
## Accuracy : 0.7273
## 95% CI : (0.6465, 0.7983)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : 0.8595
##
## Kappa : 0.2697
##
## Mcnemar's Test P-Value : 0.7488
##
## Sensitivity : 0.8073
## Specificity : 0.4706
## Pos Pred Value : 0.8302
## Neg Pred Value : 0.4324
## Prevalence : 0.7622
## Detection Rate : 0.6154
## Detection Prevalence : 0.7413
## Balanced Accuracy : 0.6390
##
## 'Positive' Class : 0
##
Four predictors ares used, with months since the first donation being the most important variable. This algorithm has an accuracy of 71.33%, which is not too bad. Now we need to apply this to the test sample.
test_sample$pred_nb = predict(modFit_NB, test_sample, type = "prob")[,2]
ggplot(test_sample, aes(x=pred_nb)) +
geom_density() +
theme(legend.position="none") +
xlab("Naive Bayes Probability") +
ggtitle("Probability that they donated in March") +
geom_vline(aes(xintercept=mean(pred_nb)), color="blue", linetype="dashed", size=1)summary(test_sample$pred_nb)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000016 0.0703603 0.1585494 0.2630476 0.4593079 1.0000000
The mean probability that someone made a donation is 0.2279, with a median of 0.1382, range of 0.0000003 - 1.0
Looking at the various algorithms (Random Forest, Decision Tree, Naive Baiyes, KNN and Lositic Regression), it appears that the k-Nearest Neighbor algorithm is the best choice. I would advise trying to do some data transformations and possibly adding some interaction terms to the model to see how that affects accuracy. I could not pick the classification tree algorithm because the variables used in building it were highly correlated as was shown in the correlation work.
Data is courtesy of Yeh, I-Cheng via the UCI Machine Learning repository:
Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008, doi:10.1016/j.eswa.2008.07.018.