BBBC Case Study

Back to Top

Background

In 1994, about 50,000 new titles were published in the US each year, giving rise to a $20B book publishing industry. About 10% of these books are sold through mail order. Recently, online superstores such as Amazon have emerged, carrying 1-2.5M titles and further intensifying the pressure on book clubs and mail order firms. In response to these pressures, book clubs are starting to look at alternative business models that will make them more responsive to their customer’s preferences.

The BBBC, Bookbinders Book Club, was established in 1986 for the purpose of selling specialty books through direct marketing. BBBC is strictly a distributor and does not publish any of the books it sells. In anticipation of using database marketing, BBBC made a strategic decision right from the start to build and maintain a detailed database about its members containing all relevant information about them. Readers fill out an insert and return it to BBBC which then enters the data into the database. The company currently has a database of 500,000 readers and sends out a mailing about once a month.

BBBC is exploring whether to use predictive modeling approaches to improve the efficacy of its direct mail program. For this analysis, we will use a subset of the database available to BBBC. The dependent variable (i.e., response group) for the analysis is Choice – purchase or no purchase of the book. BBBC also selected several independent variables that it thought might explain the observed choice behavior. The variables are:

Variable	Description
Choice	If the customer purchased the The Art History of Florence. 1 = purchase and 0 = non-purchase
Gender	0 = Female and 1 = Male
Amount_purchased	Total money spent on BBBC books
Frequency	Total number of purchases in the chosen period
Last_purchase	Months since last purchase
First_purchase	Months since first purchase
P_Child	Number of children’s books purchased
P_Youth	Number of youth’s books purchased
P_Cook	Number of cookbooks purchased
P_DIY	Number of do-it-yourself books purchased
P_Art	Number of art books purchased

We will use multiple models to classify customer purchasing behavior at BBBC. Our models look to better predict the likelihood of success for the book company’s direct marketing department as they attempt to move away from mass marketing campaigns and better allocate their resources. We will use Logistic Regression, RandomForest, and SVM modeling approaches to predict whether a customer will purchase a book, The Art History of Florence, after receiving an informative brochure about the book.

This study looks to answer questions such as:

* Which customers should BBBC target?  
* What is the most cost-effective approach for directly marketing to the company's customer base?  
* What variables most affect a customer's purchase of this art book?

Data Exploration

The data is provided in two Excel files sampled into train and test datasets. The train data contains 1600 rows while the test data contains 2300 rows. Here we will read in the data and observe some of our training data results.

train = read_excel("BBBC-Train.xlsx")
test = read_excel("BBBC-Test.xlsx")

glimpse(train)

## Rows: 1,600
## Columns: 12
## $ Observation      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Choice           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Gender           <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0…
## $ Amount_purchased <dbl> 113, 418, 336, 180, 320, 268, 198, 280, 393, 138, 14…
## $ Frequency        <dbl> 8, 6, 18, 16, 2, 4, 2, 6, 12, 10, 14, 10, 14, 6, 10,…
## $ Last_purchase    <dbl> 1, 11, 6, 5, 3, 1, 12, 2, 11, 7, 2, 2, 2, 1, 2, 1, 9…
## $ First_purchase   <dbl> 8, 66, 32, 42, 18, 4, 62, 12, 50, 38, 20, 18, 18, 6,…
## $ P_Child          <dbl> 0, 0, 2, 2, 0, 0, 2, 0, 3, 2, 0, 0, 1, 0, 1, 1, 0, 0…
## $ P_Youth          <dbl> 1, 2, 0, 0, 0, 0, 3, 2, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1…
## $ P_Cook           <dbl> 0, 3, 1, 0, 0, 0, 2, 0, 3, 0, 0, 2, 0, 0, 0, 0, 2, 0…
## $ P_DIY            <dbl> 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ P_Art            <dbl> 0, 3, 2, 1, 2, 0, 2, 0, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0…

The Observation variable must be removed. This variable will cause issues during modeling as it is a poor predictor of our response variable and will cause an error to be thrown.

train = train %>% 
  select(-Observation)

test = test %>% 
  select(-Observation)

glimpse(train)

## Rows: 1,600
## Columns: 11
## $ Choice           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Gender           <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0…
## $ Amount_purchased <dbl> 113, 418, 336, 180, 320, 268, 198, 280, 393, 138, 14…
## $ Frequency        <dbl> 8, 6, 18, 16, 2, 4, 2, 6, 12, 10, 14, 10, 14, 6, 10,…
## $ Last_purchase    <dbl> 1, 11, 6, 5, 3, 1, 12, 2, 11, 7, 2, 2, 2, 1, 2, 1, 9…
## $ First_purchase   <dbl> 8, 66, 32, 42, 18, 4, 62, 12, 50, 38, 20, 18, 18, 6,…
## $ P_Child          <dbl> 0, 0, 2, 2, 0, 0, 2, 0, 3, 2, 0, 0, 1, 0, 1, 1, 0, 0…
## $ P_Youth          <dbl> 1, 2, 0, 0, 0, 0, 3, 2, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1…
## $ P_Cook           <dbl> 0, 3, 1, 0, 0, 0, 2, 0, 3, 0, 0, 2, 0, 0, 0, 0, 2, 0…
## $ P_DIY            <dbl> 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ P_Art            <dbl> 0, 3, 2, 1, 2, 0, 2, 0, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0…

These outputs are an eyesore and not showing any initial patterns the eye can see. Let’s start by checking if our datasets contain any missing values.

sapply(train, function(x) sum(is.na(x)))

##           Choice           Gender Amount_purchased        Frequency 
##                0                0                0                0 
##    Last_purchase   First_purchase          P_Child          P_Youth 
##                0                0                0                0 
##           P_Cook            P_DIY            P_Art 
##                0                0                0

sapply(test, function(x)  sum(is.na(x)))

##           Choice           Gender Amount_purchased        Frequency 
##                0                0                0                0 
##    Last_purchase   First_purchase          P_Child          P_Youth 
##                0                0                0                0 
##           P_Cook            P_DIY            P_Art 
##                0                0                0

Oh lucky day! Our dataset has no missing values! Multicollinearity is another issue we should check for. Keeping in mind Garbage-In-Garbage-Out, we don’t want to include variables that predict another variable’s result with high likelihood. Errors could be thrown and these results would misrepresent the data. We will use a correlation plot to visually represent relationships between our variables.

#corrplot(cor(train), method = 'number')

We find a highly correlated relationship between the variables First_purchase and Last_purchase. We will wait to remove one of these variables from our model. This data is relatively clean but we do need to ensure that variables with binary outcomes (i.e., Choice and Gender) are converted to factors prior to building our models.

train$Gender = as.factor(train$Gender)
test$Gender = as.factor(test$Gender)
train$Choice = as.factor(train$Choice)
test$Choice = as.factor(test$Choice)

train$Choice = ifelse(train$Choice == 0, "Non.Purchase", "Purchase") 
test$Choice = ifelse(test$Choice == 0, "Non.Purchase", "Purchase")

Model Building

Before training let’s first set our trainControl conditions using the caret package. We will compare the results after K-fold resampling and applying weights to our model to deal with the data imbalance. Data imbalance occurs when one value is heavily represented in a variable. In this case, our binary response Choice has more Non.Purchase than Purchase rows.

set.seed(123)
folds = 10
repeats = 3
number = 5


# create model weights for training with imbalanced data 
model_weights = ifelse(train$Choice == 'Non.Purchase',
                       (1/table(train$Choice)[1]) * 0.5,
                       (1/table(train$Choice)[2])* 0.5)

# set trainControl parame
control <- trainControl(method = 'repeatedcv',
                        number = folds,
                        repeats = repeats,
                        classProbs = TRUE,
                        savePredictions = 'final')

We will use the caretList function to run all of our models at once. We will be running a logistic regression, 3 variations of support vector machine, and a random forest model. Since this is a classification problem we will evaluate the model based on AUROC and accuracy. After initially running a logistic regression, observing the correlation plots, and manually completing a stepwise regression based on p-value significance, I was able to determine that the variables First_purchase and Amount_purchased had no significant effect on the response Choice and should not be included in the final model.

my_models = c('svmLinear', 'svmPoly', 'svmRadial', 'glm','rf')

set.seed(123)
all_models = caretList(Choice~.-First_purchase-Amount_purchased,
          data = train,
          trControl=control,
          metric="Accuracy",
          methodList = my_models,
          weights=model_weights,
          tuneList=list(glm=caretModelSpec(method='glm', family=quasibinomial)))

## Warning in trControlCheck(x = trControl, y = target): indexes not defined in
## trControl. Attempting to set them ourselves, so each model in the ensemble will
## have the same resampling indexes.

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Predictions

Now that we have all of our model results we want to make predictions. We will use the dplyr package to augment the data into a data frame that holds our specific needs. We want the model name, the accuracy, and the AUC obtained for that model.

list_of_results <- lapply(my_models, function(x) {all_models[[x]]$resample})

# convert to data frame
total_df <- do.call("bind_rows", list_of_results)
total_df %<>% mutate(Model = lapply(my_models, function(x) {rep(x, folds*repeats)}) %>% unlist()) # create model names column next to their respective accuracy

Now we will write a method that can calculate predictions for each of our models.

# This function creates a data frame of the predicted probs for each row

predict_prob = function(model_select) {
    predict(model_select, test, type="prob") %>% # make predictions on model
    as.data.frame() %>% # convert to data frame
    return()
  
}

Now we make our predictions for each model. I attempted to do this in a for loop but was this was not returning the desired results.

pred_logit = predict_prob(all_models$glm)
pred_svm.lin = predict_prob(all_models$svmLinear)
pred_svm.rbf = predict_prob(all_models$svmRadial)
pred_svm.poly = predict_prob(all_models$svmPoly)
pred_rf = predict_prob(all_models$rf)

As mentioned earlier, we will compare each of our models using AUROC. To compute AUC let’s write a function.

# This function uses the `roc()` function to calculate the AUC based on our predicted probs

calc_auc = function(prob) {
  roc(test$Choice, prob)
}

Let’s now use the function on our 5 models

auc_logit = calc_auc(pred_logit[,2])

## Setting levels: control = Non.Purchase, case = Purchase

## Setting direction: controls < cases

auc_svm.lin = calc_auc(pred_svm.lin[,2])

## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases

auc_svm.rbf = calc_auc(pred_svm.rbf[,2])

## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases

auc_svm.poly = calc_auc(pred_svm.poly[,2])

## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases

auc_rf = calc_auc(pred_rf[,2])

## Setting levels: control = Non.Purchase, case = Purchase
## Setting direction: controls < cases

auc_logit$auc[1] # we want this part included not the text so we index

## [1] 0.7981204

Results

Let’s visualize the AUC results of our models using ggplot()

# create a data frame of AUC results for each model
df_auc = bind_rows(data_frame(TPR = auc_logit$sensitivities,
                              FPR = 1 - auc_logit$specificities,
                              AUC = auc_logit$auc[1],
                              Model = "Logistic"),
                    data_frame(TPR = auc_svm.lin$sensitivities,
                              FPR = 1 - auc_svm.lin$specificities,
                              AUC = auc_svm.lin$auc[1],
                              Model = "Linear SVM"),
                   data_frame(TPR = auc_svm.rbf$sensitivities,
                              FPR = 1 - auc_svm.rbf$specificities,
                              AUC = auc_svm.rbf$auc[1],
                              Model = "RBF SVM"),
                   data_frame(TPR = auc_svm.poly$sensitivities,
                              FPR = 1 - auc_svm.poly$specificities,
                              AUC = auc_svm.poly$auc[1],
                              Model = "Poly SVM"),
                   data_frame(TPR = auc_rf$sensitivities,
                              FPR = 1 - auc_rf$specificities,
                              AUC = auc_rf$auc[1],
                              Model = "Random Forest"))

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

unique(df_auc$Model) # ensure each of our models is included

## [1] "Logistic"      "Linear SVM"    "RBF SVM"       "Poly SVM"     
## [5] "Random Forest"

# Plot ROC curves to compare each model
# The more L-shaped the curve, the better

df_auc %>% 
  ggplot(aes(FPR, TPR, color=Model)) +
  geom_line(size = .5) +
  theme_bw() + 
  coord_equal() + 
  # dashed line indicates a model that is classifying poorly  
  geom_abline(intercept = 0, slope = 1, color='gray37', size=1, linetype="dashed") + 
  labs(x = "FPR (1 - Specificity)",
       y = "TPR (Sensitivity)",
       title = "ROC Curve and AUC: Model Comparison")

We must keep in mind our business needs. It is more detrimental to our use case to predict a customer will purchase who actually does not. This represents wasted expenditures, false positives, which we want to minimize. For this use case, we want high sensitivity and specificity. Low specificity would be problematic and indicate a model with a high false positive rate. You can read more about these measures here.

# Show which models had the highest average accuracy

total_df %>% 
  group_by(Model) %>% 
  summarize(Avg_Acc = max(Accuracy)) %>% 
  ungroup() %>% 
  arrange(-Avg_Acc) %>% 
  mutate_if(is.numeric, function(x) {round(x, 3)}) %>% 
  knitr::kable(caption = "Table 1: Model Performance in decreasing order of Accuracy",
               col.names = c("Model", "Accuracy"))

## `summarise()` ungrouping output (override with `.groups` argument)

Table 1: Model Performance in decreasing order of Accuracy
Model	Accuracy
svmLinear	0.869
svmPoly	0.856
rf	0.838
svmRadial	0.825
glm	0.794

# Show which models had the best AUC

df_auc %>% 
  group_by(Model) %>% 
  summarize(AUC = max(AUC)) %>% 
  ungroup() %>% 
  arrange(-AUC) %>% 
  mutate_if(is.numeric, function(x) {round(x, 4)}) %>% 
  knitr::kable(caption = "Table 2: Model Performance in decreasing order of AUC",
               col.names = c("Model", "AUC"))

## `summarise()` ungrouping output (override with `.groups` argument)

Table 2: Model Performance in decreasing order of AUC
Model	AUC
Logistic	0.7981
Linear SVM	0.7933
Poly SVM	0.7912
RBF SVM	0.7650
Random Forest	0.7550

# Show which models had the best Specificity (aka TPR)

df_auc %>% 
  group_by(Model) %>% 
  summarize(TPR = mean(TPR)) %>% 
  ungroup() %>% 
  arrange(-TPR) %>% 
  mutate_if(is.numeric, function(x) {round(x, 4)}) %>% 
  knitr::kable(caption = "Table 3: Model Performance in decreasing order of Sensitivity",
               col.names = c("Model", "TPR"))

## `summarise()` ungrouping output (override with `.groups` argument)

Table 3: Model Performance in decreasing order of Sensitivity
Model	TPR
Logistic	0.7486
Linear SVM	0.7384
Poly SVM	0.7377
RBF SVM	0.7034
Random Forest	0.4541

# Show which models had the best Specificity (aka FPR)

df_auc %>% 
  group_by(Model) %>% 
  summarize(FPR = mean(FPR)) %>% 
  ungroup() %>% 
  arrange(-FPR) %>% 
  mutate_if(is.numeric, function(x) {round(x, 4)}) %>% 
  knitr::kable(caption = "Table 4: Model Performance in decreasing order of Specificity",
               col.names = c("Model", "FPR"))

## `summarise()` ungrouping output (override with `.groups` argument)

Table 4: Model Performance in decreasing order of Specificity
Model	FPR
Logistic	0.4746
Poly SVM	0.4700
Linear SVM	0.4676
RBF SVM	0.4550
Random Forest	0.1743

Recommendations

Our Logistic Regression model seems to be performing the best in all categories with the highest AUC, Sensitivity, and Specificity scores. Moving forward we will calculate how much our profits could increase when using this model with our targeted campaign.

Logistic regression models are easier to interpret. Let’s identify any potential business insights from the regression output.

summary(all_models$glm) # Logit model output

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.058859  -0.021428  -0.014849  -0.001057   0.084423  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.125582   0.151226   7.443 1.60e-13 ***
## Gender1       -0.909757   0.123131  -7.389 2.39e-13 ***
## Frequency     -0.088559   0.008997  -9.843  < 2e-16 ***
## Last_purchase  0.588613   0.074347   7.917 4.52e-15 ***
## P_Child       -0.823686   0.106555  -7.730 1.89e-14 ***
## P_Youth       -0.686482   0.131155  -5.234 1.88e-07 ***
## P_Cook        -0.942850   0.109937  -8.576  < 2e-16 ***
## P_DIY         -0.939051   0.129572  -7.247 6.60e-13 ***
## P_Art          0.657891   0.120530   5.458 5.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.0006307152)
## 
##     Null deviance: 1.3863  on 1599  degrees of freedom
## Residual deviance: 1.0690  on 1591  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

exp(all_models$glm$finalModel$coefficients) # calculate odds ratios to interpret the model in a business sense

##   (Intercept)       Gender1     Frequency Last_purchase       P_Child 
##     3.0820086     0.4026222     0.9152495     1.8014873     0.4388112 
##       P_Youth        P_Cook         P_DIY         P_Art 
##     0.5033437     0.3895162     0.3909985     1.9307154

At a high level the results can be interpreted as suggesting that:

Women are more likely to purchase this art book as the odds of a male purchasing decreases by 58% compared to females.
A customer who purchased any number of art books has a 197% increased likelihood of purchase
In Last_purchase, the more time that has passed since a customer’s last purchase there is a 177% higher likelihood of purchase
The more a customer has purchased in a given period, Frequency, indicates a 91% decrease in customer purchase of the art book

These are significant findings to consider. This model indicates the profile for an ideal Art History of Florence buyer would be:
* a woman
* who has purchased 1 or more art books in the past
* who has not purchased in a significant amount of time
* and who does not purchase very many books when they do buy

log_preds = predict(all_models$glm, test, type="raw")
confusionMatrix(as.factor(log_preds), as.factor(test$Choice))

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Non.Purchase Purchase
##   Non.Purchase         1589       67
##   Purchase              507      137
##                                          
##                Accuracy : 0.7504         
##                  95% CI : (0.7322, 0.768)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2177         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7581         
##             Specificity : 0.6716         
##          Pos Pred Value : 0.9595         
##          Neg Pred Value : 0.2127         
##              Prevalence : 0.9113         
##          Detection Rate : 0.6909         
##    Detection Prevalence : 0.7200         
##       Balanced Accuracy : 0.7148         
##                                          
##        'Positive' Class : Non.Purchase   
##

We calculated profit for the company in two scenarios:

a model that targets customers
a model that sends the art book to the entire mailing list.

As mentioned, false positives cost BBBC. We applied the domain knowledge given on the company’s expenses in our calculations. Using our confusion matrix, we find that mailing to the entire list actually costs the company profit as seen in the equation below:

\[Profit_Mass=(a \cdot b=6517.80) - (c \cdot d=2535.00) + (e + (e \cdot f)) \cdot b=4437.00 =\color{red}-\color{red}4\color{red}5\color{red}4\color{red}.\color{red}2\color{red}0\] where,
a = selling price of the book ($31.95)
b = # of people who bought the book (127+77 = 204)
c = cost of mailing the book (0.65)
d = # of people in the entire mailing list (2300 + 1600 = 3900)
e = cost to purchase and mail the book (15)
f = percentage allocated as overhead for each book (0.45)

Using a mass marketing mailing campaign will cost the company, leaving their profit at $-454.20. Using a more targeted mailing campaign built using our Logistic Regression model will be more beneficial to BBBC as shown in the equation below. Details of this calculation were taken from the model’s confusion matrix:

\[Profit_Targeted=(a \cdot b=6517.80) - (c \cdot d=78.65) + (e + (e \cdot f)) \cdot b=4437.00 =\color{green}2\color{green},\color{green}0\color{green}0\color{green}2\color{green}.\color{green}1\color{green}5\] where,
a = selling price of the book ($31.95)
b = # of people who bought the book (127+77 = 204)
c = cost of mailing the book (0.65)
d = # of people predicted to buy in our model (121)
e = cost to purchase and mail the book (15)
f = percentage allocated as overhead for each book (0.45)

Using our logistic regression model, BBBC stands to make a $2,002.15 profit opposed to losing money in its mass marketing campaign. I would suggest implementing the model for 2 quarters and see how it compares with the previous. During this analysis we were able to uncover significant variables and interpret them in a business sense, find an optimal model that fit our requirements, visualize our results, and compute a final profit margin for the company to compare with its current mass marketing campaign.