In 1994, about 50,000 new titles were published in the US each year, giving rise to a $20B book publishing industry. About 10% of these books are sold through mail order. Recently, online superstores such as Amazon have emerged, carrying 1-2.5M titles and further intensifying the pressure on book clubs and mail order firms. In response to these pressures, book clubs are starting to look at alternative business models that will make them more responsive to their customer’s preferences.
The BBBC, Bookbinders Book Club, was established in 1986 for the purpose of selling specialty books through direct marketing. BBBC is strictly a distributor and does not publish any of the books it sells. In anticipation of using database marketing, BBBC made a strategic decision right from the start to build and maintain a detailed database about its members containing all relevant information about them. Readers fill out an insert and return it to BBBC which then enters the data into the database. The company currently has a database of 500,000 readers and sends out a mailing about once a month.
BBBC is exploring whether to use predictive modeling approaches to improve the efficacy of its direct mail program. For this analysis, we will use a subset of the database available to BBBC. The dependent variable (i.e., response group) for the analysis is Choice – purchase or no purchase of the book. BBBC also selected several independent variables that it thought might explain the observed choice behavior. The variables are:
| Variable | Description |
|---|---|
| Choice | If the customer purchased the The Art History of Florence. 1 = purchase and 0 = non-purchase |
| Gender | 0 = Female and 1 = Male |
| Amount_purchased | Total money spent on BBBC books |
| Frequency | Total number of purchases in the chosen period |
| Last_purchase | Months since last purchase |
| First_purchase | Months since first purchase |
| P_Child | Number of children’s books purchased |
| P_Youth | Number of youth’s books purchased |
| P_Cook | Number of cookbooks purchased |
| P_DIY | Number of do-it-yourself books purchased |
| P_Art | Number of art books purchased |
We will use multiple models to classify customer purchasing behavior at BBBC. Our models look to better predict the likelihood of success for the book company’s direct marketing department as they attempt to move away from mass marketing campaigns and better allocate their resources. We will use Logistic Regression, RandomForest, and SVM modeling approaches to predict whether a customer will purchase a book, The Art History of Florence, after receiving an informative brochure about the book.
This study looks to answer questions such as:
* Which customers should BBBC target?
* What is the most cost-effective approach for directly marketing to the company's customer base?
* What variables most affect a customer's purchase of this art book?
The data is provided in two Excel files sampled into train and test datasets. The train data contains 1600 rows while the test data contains 2300 rows. Here we will read in the data and observe some of our training data results.
train = read_excel("BBBC-Train.xlsx")
test = read_excel("BBBC-Test.xlsx")
glimpse(train)
## Rows: 1,600
## Columns: 12
## $ Observation <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Choice <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Gender <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0…
## $ Amount_purchased <dbl> 113, 418, 336, 180, 320, 268, 198, 280, 393, 138, 14…
## $ Frequency <dbl> 8, 6, 18, 16, 2, 4, 2, 6, 12, 10, 14, 10, 14, 6, 10,…
## $ Last_purchase <dbl> 1, 11, 6, 5, 3, 1, 12, 2, 11, 7, 2, 2, 2, 1, 2, 1, 9…
## $ First_purchase <dbl> 8, 66, 32, 42, 18, 4, 62, 12, 50, 38, 20, 18, 18, 6,…
## $ P_Child <dbl> 0, 0, 2, 2, 0, 0, 2, 0, 3, 2, 0, 0, 1, 0, 1, 1, 0, 0…
## $ P_Youth <dbl> 1, 2, 0, 0, 0, 0, 3, 2, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1…
## $ P_Cook <dbl> 0, 3, 1, 0, 0, 0, 2, 0, 3, 0, 0, 2, 0, 0, 0, 0, 2, 0…
## $ P_DIY <dbl> 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ P_Art <dbl> 0, 3, 2, 1, 2, 0, 2, 0, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0…
The Observation variable must be removed. This variable will cause issues during modeling as it is a poor predictor of our response variable and will cause an error to be thrown.
train = train %>%
select(-Observation)
test = test %>%
select(-Observation)
glimpse(train)
## Rows: 1,600
## Columns: 11
## $ Choice <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Gender <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0…
## $ Amount_purchased <dbl> 113, 418, 336, 180, 320, 268, 198, 280, 393, 138, 14…
## $ Frequency <dbl> 8, 6, 18, 16, 2, 4, 2, 6, 12, 10, 14, 10, 14, 6, 10,…
## $ Last_purchase <dbl> 1, 11, 6, 5, 3, 1, 12, 2, 11, 7, 2, 2, 2, 1, 2, 1, 9…
## $ First_purchase <dbl> 8, 66, 32, 42, 18, 4, 62, 12, 50, 38, 20, 18, 18, 6,…
## $ P_Child <dbl> 0, 0, 2, 2, 0, 0, 2, 0, 3, 2, 0, 0, 1, 0, 1, 1, 0, 0…
## $ P_Youth <dbl> 1, 2, 0, 0, 0, 0, 3, 2, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1…
## $ P_Cook <dbl> 0, 3, 1, 0, 0, 0, 2, 0, 3, 0, 0, 2, 0, 0, 0, 0, 2, 0…
## $ P_DIY <dbl> 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ P_Art <dbl> 0, 3, 2, 1, 2, 0, 2, 0, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0…
These outputs are an eyesore and not showing any initial patterns the eye can see. Let’s start by checking if our datasets contain any missing values.
sapply(train, function(x) sum(is.na(x)))
## Choice Gender Amount_purchased Frequency
## 0 0 0 0
## Last_purchase First_purchase P_Child P_Youth
## 0 0 0 0
## P_Cook P_DIY P_Art
## 0 0 0
sapply(test, function(x) sum(is.na(x)))
## Choice Gender Amount_purchased Frequency
## 0 0 0 0
## Last_purchase First_purchase P_Child P_Youth
## 0 0 0 0
## P_Cook P_DIY P_Art
## 0 0 0
Oh lucky day! Our dataset has no missing values! Multicollinearity is another issue we should check for. Keeping in mind Garbage-In-Garbage-Out, we don’t want to include variables that predict another variable’s result with high likelihood. Errors could be thrown and these results would misrepresent the data. We will use a correlation plot to visually represent relationships between our variables.
#corrplot(cor(train), method = 'number')
We find a highly correlated relationship between the variables First_purchase and Last_purchase. We will wait to remove one of these variables from our model. This data is relatively clean but we do need to ensure that variables with binary outcomes (i.e., Choice and Gender) are converted to factors prior to building our models.
train$Gender = as.factor(train$Gender)
test$Gender = as.factor(test$Gender)
train$Choice = as.factor(train$Choice)
test$Choice = as.factor(test$Choice)
train$Choice = ifelse(train$Choice == 0, "Non.Purchase", "Purchase")
test$Choice = ifelse(test$Choice == 0, "Non.Purchase", "Purchase")
Before training let’s first set our trainControl conditions using the caret package. We will compare the results after K-fold resampling and applying weights to our model to deal with the data imbalance. Data imbalance occurs when one value is heavily represented in a variable. In this case, our binary response Choice has more Non.Purchase than Purchase rows.
set.seed(123)
folds = 10
repeats = 3
number = 5
# create model weights for training with imbalanced data
model_weights = ifelse(train$Choice == 'Non.Purchase',
(1/table(train$Choice)[1]) * 0.5,
(1/table(train$Choice)[2])* 0.5)
# set trainControl parame
control <- trainControl(method = 'repeatedcv',
number = folds,
repeats = repeats,
classProbs = TRUE,
savePredictions = 'final')
We will use the caretList function to run all of our models at once. We will be running a logistic regression, 3 variations of support vector machine, and a random forest model. Since this is a classification problem we will evaluate the model based on AUROC and accuracy. After initially running a logistic regression, observing the correlation plots, and manually completing a stepwise regression based on p-value significance, I was able to determine that the variables First_purchase and Amount_purchased had no significant effect on the response Choice and should not be included in the final model.
my_models = c('svmLinear', 'svmPoly', 'svmRadial', 'glm','rf')
set.seed(123)
all_models = caretList(Choice~.-First_purchase-Amount_purchased,
data = train,
trControl=control,
metric="Accuracy",
methodList = my_models,
weights=model_weights,
tuneList=list(glm=caretModelSpec(method='glm', family=quasibinomial)))
## Warning in trControlCheck(x = trControl, y = target): indexes not defined in
## trControl. Attempting to set them ourselves, so each model in the ensemble will
## have the same resampling indexes.
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Now that we have all of our model results we want to make predictions. We will use the dplyr package to augment the data into a data frame that holds our specific needs. We want the model name, the accuracy, and the AUC obtained for that model.
list_of_results <- lapply(my_models, function(x) {all_models[[x]]$resample})
# convert to data frame
total_df <- do.call("bind_rows", list_of_results)
total_df %<>% mutate(Model = lapply(my_models, function(x) {rep(x, folds*repeats)}) %>% unlist()) # create model names column next to their respective accuracy
Now we will write a method that can calculate predictions for each of our models.
# This function creates a data frame of the predicted probs for each row
predict_prob = function(model_select) {
predict(model_select, test, type="prob") %>% # make predictions on model
as.data.frame() %>% # convert to data frame
return()
}
Now we make our predictions for each model. I attempted to do this in a for loop but was this was not returning the desired results.
pred_logit = predict_prob(all_models$glm)
pred_svm.lin = predict_prob(all_models$svmLinear)
pred_svm.rbf = predict_prob(all_models$svmRadial)
pred_svm.poly = predict_prob(all_models$svmPoly)
pred_rf = predict_prob(all_models$rf)
As mentioned earlier, we will compare each of our models using AUROC. To compute AUC let’s write a function.
# This function uses the `roc()` function to calculate the AUC based on our predicted probs
calc_auc = function(prob) {
roc(test$Choice, prob)
}
Let’s now use the function on our 5 models
auc_logit = calc_auc(pred_logit[,2])
## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases
auc_svm.lin = calc_auc(pred_svm.lin[,2])
## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases
auc_svm.rbf = calc_auc(pred_svm.rbf[,2])
## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases
auc_svm.poly = calc_auc(pred_svm.poly[,2])
## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases
auc_rf = calc_auc(pred_rf[,2])
## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases
auc_logit$auc[1] # we want this part included not the text so we index
## [1] 0.7981204
Let’s visualize the AUC results of our models using ggplot()
# create a data frame of AUC results for each model
df_auc = bind_rows(data_frame(TPR = auc_logit$sensitivities,
FPR = 1 - auc_logit$specificities,
AUC = auc_logit$auc[1],
Model = "Logistic"),
data_frame(TPR = auc_svm.lin$sensitivities,
FPR = 1 - auc_svm.lin$specificities,
AUC = auc_svm.lin$auc[1],
Model = "Linear SVM"),
data_frame(TPR = auc_svm.rbf$sensitivities,
FPR = 1 - auc_svm.rbf$specificities,
AUC = auc_svm.rbf$auc[1],
Model = "RBF SVM"),
data_frame(TPR = auc_svm.poly$sensitivities,
FPR = 1 - auc_svm.poly$specificities,
AUC = auc_svm.poly$auc[1],
Model = "Poly SVM"),
data_frame(TPR = auc_rf$sensitivities,
FPR = 1 - auc_rf$specificities,
AUC = auc_rf$auc[1],
Model = "Random Forest"))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
unique(df_auc$Model) # ensure each of our models is included
## [1] "Logistic" "Linear SVM" "RBF SVM" "Poly SVM"
## [5] "Random Forest"
# Plot ROC curves to compare each model
# The more L-shaped the curve, the better
df_auc %>%
ggplot(aes(FPR, TPR, color=Model)) +
geom_line(size = .5) +
theme_bw() +
coord_equal() +
# dashed line indicates a model that is classifying poorly
geom_abline(intercept = 0, slope = 1, color='gray37', size=1, linetype="dashed") +
labs(x = "FPR (1 - Specificity)",
y = "TPR (Sensitivity)",
title = "ROC Curve and AUC: Model Comparison")
We must keep in mind our business needs. It is more detrimental to our use case to predict a customer will purchase who actually does not. This represents wasted expenditures, false positives, which we want to minimize. For this use case, we want high sensitivity and specificity. Low specificity would be problematic and indicate a model with a high false positive rate. You can read more about these measures here.
# Show which models had the highest average accuracy
total_df %>%
group_by(Model) %>%
summarize(Avg_Acc = max(Accuracy)) %>%
ungroup() %>%
arrange(-Avg_Acc) %>%
mutate_if(is.numeric, function(x) {round(x, 3)}) %>%
knitr::kable(caption = "Table 1: Model Performance in decreasing order of Accuracy",
col.names = c("Model", "Accuracy"))
## `summarise()` ungrouping output (override with `.groups` argument)
| Model | Accuracy |
|---|---|
| svmLinear | 0.869 |
| svmPoly | 0.856 |
| rf | 0.838 |
| svmRadial | 0.825 |
| glm | 0.794 |
# Show which models had the best AUC
df_auc %>%
group_by(Model) %>%
summarize(AUC = max(AUC)) %>%
ungroup() %>%
arrange(-AUC) %>%
mutate_if(is.numeric, function(x) {round(x, 4)}) %>%
knitr::kable(caption = "Table 2: Model Performance in decreasing order of AUC",
col.names = c("Model", "AUC"))
## `summarise()` ungrouping output (override with `.groups` argument)
| Model | AUC |
|---|---|
| Logistic | 0.7981 |
| Linear SVM | 0.7933 |
| Poly SVM | 0.7912 |
| RBF SVM | 0.7650 |
| Random Forest | 0.7550 |
# Show which models had the best Specificity (aka TPR)
df_auc %>%
group_by(Model) %>%
summarize(TPR = mean(TPR)) %>%
ungroup() %>%
arrange(-TPR) %>%
mutate_if(is.numeric, function(x) {round(x, 4)}) %>%
knitr::kable(caption = "Table 3: Model Performance in decreasing order of Sensitivity",
col.names = c("Model", "TPR"))
## `summarise()` ungrouping output (override with `.groups` argument)
| Model | TPR |
|---|---|
| Logistic | 0.7486 |
| Linear SVM | 0.7384 |
| Poly SVM | 0.7377 |
| RBF SVM | 0.7034 |
| Random Forest | 0.4541 |
# Show which models had the best Specificity (aka FPR)
df_auc %>%
group_by(Model) %>%
summarize(FPR = mean(FPR)) %>%
ungroup() %>%
arrange(-FPR) %>%
mutate_if(is.numeric, function(x) {round(x, 4)}) %>%
knitr::kable(caption = "Table 4: Model Performance in decreasing order of Specificity",
col.names = c("Model", "FPR"))
## `summarise()` ungrouping output (override with `.groups` argument)
| Model | FPR |
|---|---|
| Logistic | 0.4746 |
| Poly SVM | 0.4700 |
| Linear SVM | 0.4676 |
| RBF SVM | 0.4550 |
| Random Forest | 0.1743 |
Our Logistic Regression model seems to be performing the best in all categories with the highest AUC, Sensitivity, and Specificity scores. Moving forward we will calculate how much our profits could increase when using this model with our targeted campaign.
Logistic regression models are easier to interpret. Let’s identify any potential business insights from the regression output.
summary(all_models$glm) # Logit model output
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.058859 -0.021428 -0.014849 -0.001057 0.084423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.125582 0.151226 7.443 1.60e-13 ***
## Gender1 -0.909757 0.123131 -7.389 2.39e-13 ***
## Frequency -0.088559 0.008997 -9.843 < 2e-16 ***
## Last_purchase 0.588613 0.074347 7.917 4.52e-15 ***
## P_Child -0.823686 0.106555 -7.730 1.89e-14 ***
## P_Youth -0.686482 0.131155 -5.234 1.88e-07 ***
## P_Cook -0.942850 0.109937 -8.576 < 2e-16 ***
## P_DIY -0.939051 0.129572 -7.247 6.60e-13 ***
## P_Art 0.657891 0.120530 5.458 5.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.0006307152)
##
## Null deviance: 1.3863 on 1599 degrees of freedom
## Residual deviance: 1.0690 on 1591 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
exp(all_models$glm$finalModel$coefficients) # calculate odds ratios to interpret the model in a business sense
## (Intercept) Gender1 Frequency Last_purchase P_Child
## 3.0820086 0.4026222 0.9152495 1.8014873 0.4388112
## P_Youth P_Cook P_DIY P_Art
## 0.5033437 0.3895162 0.3909985 1.9307154
At a high level the results can be interpreted as suggesting that:
Last_purchase, the more time that has passed since a customer’s last purchase there is a 177% higher likelihood of purchaseFrequency, indicates a 91% decrease in customer purchase of the art bookThese are significant findings to consider. This model indicates the profile for an ideal Art History of Florence buyer would be:
* a woman
* who has purchased 1 or more art books in the past
* who has not purchased in a significant amount of time
* and who does not purchase very many books when they do buy
log_preds = predict(all_models$glm, test, type="raw")
confusionMatrix(as.factor(log_preds), as.factor(test$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Non.Purchase Purchase
## Non.Purchase 1589 67
## Purchase 507 137
##
## Accuracy : 0.7504
## 95% CI : (0.7322, 0.768)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2177
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7581
## Specificity : 0.6716
## Pos Pred Value : 0.9595
## Neg Pred Value : 0.2127
## Prevalence : 0.9113
## Detection Rate : 0.6909
## Detection Prevalence : 0.7200
## Balanced Accuracy : 0.7148
##
## 'Positive' Class : Non.Purchase
##
We calculated profit for the company in two scenarios:
As mentioned, false positives cost BBBC. We applied the domain knowledge given on the company’s expenses in our calculations. Using our confusion matrix, we find that mailing to the entire list actually costs the company profit as seen in the equation below:
\[Profit_Mass=(a \cdot b=6517.80) - (c \cdot d=2535.00) + (e + (e \cdot f)) \cdot b=4437.00 =\color{red}-\color{red}4\color{red}5\color{red}4\color{red}.\color{red}2\color{red}0\] where,
a = selling price of the book ($31.95)
b = # of people who bought the book (127+77 = 204)
c = cost of mailing the book (0.65)
d = # of people in the entire mailing list (2300 + 1600 = 3900)
e = cost to purchase and mail the book (15)
f = percentage allocated as overhead for each book (0.45)
Using a mass marketing mailing campaign will cost the company, leaving their profit at $-454.20. Using a more targeted mailing campaign built using our Logistic Regression model will be more beneficial to BBBC as shown in the equation below. Details of this calculation were taken from the model’s confusion matrix:
\[Profit_Targeted=(a \cdot b=6517.80) - (c \cdot d=78.65) + (e + (e \cdot f)) \cdot b=4437.00 =\color{green}2\color{green},\color{green}0\color{green}0\color{green}2\color{green}.\color{green}1\color{green}5\] where,
a = selling price of the book ($31.95)
b = # of people who bought the book (127+77 = 204)
c = cost of mailing the book (0.65)
d = # of people predicted to buy in our model (121)
e = cost to purchase and mail the book (15)
f = percentage allocated as overhead for each book (0.45)
Using our logistic regression model, BBBC stands to make a $2,002.15 profit opposed to losing money in its mass marketing campaign. I would suggest implementing the model for 2 quarters and see how it compares with the previous. During this analysis we were able to uncover significant variables and interpret them in a business sense, find an optimal model that fit our requirements, visualize our results, and compute a final profit margin for the company to compare with its current mass marketing campaign.