Team Members: Sreeja Yalamaddi (yxm265)
This case study evaluates the effectiveness of four predictive modeling techniques—Linear Regression, Logistic Regression, SVM Linear, and Random Forest—in optimizing Bookbinders Book Club’s (BBBC) direct mail marketing campaigns. The goal was to predict customer purchase behavior and improve targeting strategies to maximize profitability. Through a comprehensive analysis, it was found that Logistic Regression with class weights and SVM Linear provided the best performance, with Logistic Regression achieving an accuracy of 74.2% and precision of 95.91%. Notably, these models identified key predictors of purchase behavior, such as Frequency, Gender, and P_Art, which were most influential in determining the likelihood of a customer purchasing an art book.
The Profit analysis indicated that using Logistic Regression for targeted mailing could generate an additional $75,929.31 in profit over a generic mass mailing campaign, highlighting the potential for more efficient and profitable marketing strategies. This case study recommends leveraging Logistic Regression with class weights for BBBC’s similar campaign in Midwest to optimize resource allocation and drive higher returns.
Bookbinders Book Club (BBBC), a well-established book distributor, faces challenges in optimizing its direct mail marketing campaigns. With a large customer base, the BBBC struggles to identify which customers are most likely to respond to offers, particularly for art books. The goal of this study is to apply predictive modeling techniques to improve the targeting of these campaigns, enhancing both marketing efficiency and profitability.
This report explores the use of four models—Linear Regression, Logistic Regression, Support Vector Machines (SVM), and Random Forest—to predict customer purchase behavior. By identifying key predictors of purchase decisions, the study aims to provide actionable insights that can increase response rates and drive greater profits from BBBC’s marketing efforts.
Key questions addressed include:
Which model best predicts customer purchase behavior?
What factors most influence a customer’s likelihood to purchase?
How can BBBC optimize its direct mail campaigns for maximum profitability?
This analysis will guide the BBBC in refining its marketing strategy for more targeted and profitable campaigns.
Predictive modeling in marketing has commonly used models like
Logistic Regression, Support Vector Machines (SVM), and Random Forests.
Logistic Regression is widely applied for binary classification tasks
due to its simplicity and interpretability (Hosmer & Lemeshow,
2000). SVM has gained popularity for its robustness in handling complex,
high-dimensional datasets (Cortes & Vapnik, 1995), and Random
Forests, an ensemble method, is often favored for its ability to model
complex, non-linear relationships (Breiman, 2001). Additionally,
techniques like class weighting and resampling have been proposed to
address class imbalances commonly found in marketing data (Chawla et
al., 2002).
While Logistic Regression and Random Forests have been extensively
explored, there is limited research on SVM in marketing prediction
tasks, especially with class weighting to address imbalanced data.
Furthermore, there is a lack of comparative studies that analyze feature
importance across different models, especially in niche applications
like book distribution. Additionally, many studies focus on model
accuracy without considering the real-world business implications, such
as profitability or cost-benefit analysis in marketing campaigns.
This study uses Logistic Regression, SVM, and Random Forests because
these methods are well-established in the literature and address
different strengths in predictive modeling. The addition of class
weights to SVM improves performance on imbalanced datasets, which is
crucial in marketing. Random Forests provide insights into feature
importance, allowing a better understanding of customer behavior. By
focusing on business outcomes like profitability, this study bridges the
gap between model accuracy and practical application in marketing
campaigns.
The dataset was provided by Bookbinders Book Club (BBBC), a book distributor that manages customer records for direct mail marketing. The dataset was compiled from internal transactions and demographic data of customers.
The dataset contains 1600 training samples and 2300 test samples, making it a total of 3900 entries. The dataset includes multiple variables such as customer demographics and behavioral features related to their book purchase history.
The dataset includes various features that help predict customer purchase behavior.
Independent variables:
Amount_purchase (numeric): total money spent on BBBC books
Frequency (numeric): total number of purchases in the chosen period
First_purchase (numeric): months since first purchase
Last_purchase (numeric): months since last purchase
P_Child (numeric): number of children’s books purchased
P_Youth (numeric): number of youth books purchased
P_Cook(numeric): number of cookbooks purchased
P_DIY (numeric): number of do-it-yourself books purchased
P_Art (numeric): number of art books purchased
Gender (numeric): 0 and 1. We will change them into categorical of Female and Male later
Dependent variable:
The target variable shows a strong class imbalance, with only 25% of customers purchasing art books. Visualizing this distribution provides insights into the need for handling class imbalance in model evaluation.
Bar chart showing the count of 0s (non-purchasers) vs. 1s (purchasers).
Univariate Analysis: We have analyzed continuous variables like ’Amount_Purchased’, ’Frequency’ etc… through boxplots and histograms and noticed right-skewed distributions for few variables. Also, identified significant outliers for some variables like ’First_purchase’. Visualizing these helps to understand their behavior and whether they need normalization or transformation.
Correlation Analysis:
We can visualize the relationship between continuous features (e.g.,
Amount_Purchased, Frequency) and explore correlations. Correlation
matrices or heatmaps clarify how features relate to one another. We
noticed first_purchase and last_purchase are strongly
correlated, after evaluating Variation Inflaction Factor on these
numeric variables so dropped variable with high VIF values like
last_purchase
There are no missing values in the dataset, meaning there’s no need for imputation.
Outliers in continuous variables like Amount_Purchased and Frequency were inspected visually using box plots and histograms. Extreme values were transformed using logarithmic scaling to stabilize variance on variables like first_purchase which significantly improved model performance.
Standardization (z-score) was applied to continuous features (e.g., Amount_Purchased, Frequency) to bring them to a similar scale. This step is crucial for machine learning models like SVM and Logistic Regression, which are sensitive to feature scales.
Due to the significant imbalance in the target class (only 25% of customers purchased an art book), several strategies were implemented:
Class Weighting: For models like Logistic Regression, SVM, and Random Forest, class weights were adjusted to give more importance to the minority class (purchasers).
Evaluation Metrics: To evaluate model performance more effectively, metrics like Precision, Recall, F1-Score, and AUC (Area Under the Curve) were prioritized over accuracy due to the class imbalance.
Linear Regression: Chosen as a baseline model for comparison. While it’s generally used for continuous outcomes, we applied it to binary classification with a threshold. It’s simple but may not capture the complexity of customer behavior and gives unbounded predictions.
Logistic Regression: Ideal for binary classification, providing probabilistic outputs. It’s interpretable but assumes linear relationships between predictors and log-odds, which might not capture non-linearities.
Support Vector Machines (SVM): Suitable for handling high-dimensional data and complex decision boundaries. It can be computationally expensive and harder to interpret but works well for non-linear classification tasks.
Random Forest Classifier: An ensemble method that handles non-linear relationships and interactions between features. It’s highly accurate but less interpretable and can be overfit if not properly tuned.
Linear Regression: Assumes linearity between features and target, and homoscedasticity. Not ideal for binary classification, as its continuous output can lead to unrealistic predictions.
Logistic Regression: Assumes linearity in log-odds, independence of observations, and no multicollinearity. May struggle with non-linear relationships or interactions between features.
SVM: Assumes linear separability in high-dimensional space and independence of observations. Computationally expensive and sensitive to hyperparameter tuning.
Random Forest: Makes no strict assumptions about the data distribution but assumes some independence between trees. Prone to overfitting and lacks interpretability.
Linear Regression: Trained as a baseline model. Despite being used for continuous outcomes, we adapted it for binary classification by applying a 0.5 threshold to the predicted probabilities.
Logistic Regression: Aimed at binary classification, this model estimates the probability of a customer purchasing an art book. We used class weights to handle the class imbalance, enhancing the precision for the minority class (art book purchasers).
Support Vector Machines (SVM): A linear SVM was trained to classify customers, with class weights applied to balance the dataset and focus on predicting the minority class (purchases).
Random Forest: Trained as an ensemble of decision trees. Class weights were also applied to improve predictions for the minority class. Hyperparameter tuning (e.g., max depth, n_estimators) was performed using cross-validation to avoid overfitting and improve model performance.
The dataset had a significant class imbalance, with only 25% of customers making a purchase. To address this, we adjusted the class weights for the Logistic Regression, SVM, and Random Forest models, which penalized misclassifications of the minority class (purchasers).
Impact on Model Performance:
Logistic Regression: With class weights, precision improved to 95.91%, but recall decreased slightly, indicating a trade-off. The overall performance was better suited for targeting potential buyers with higher accuracy.
SVM: Similar improvements were observed with SVM using class weights, increasing precision and recall for the minority class.
Random Forest: The Random Forest model benefited from class weights, achieving higher recall and precision for the minority class, although the accuracy was slightly lower compared to the unweighted model.
The class weight adjustments allowed the models to focus more on predicting art book purchasers, significantly improving their performance despite the class imbalance.
The performance of each model was evaluated using the following metrics: accuracy, precision, recall, F1-score, and AUC. Here’s a summary of how each model performed:
Logistic Regression and SVM Linear performed best in terms of precision, recall, and F1-score, with Logistic Regression (weights) having the highest precision (95.91%).
Linear Regression was the least effective, showing a low accuracy and F1-score, highlighting its inadequacy for binary classification tasks.
Logistic Regression and SVM Linear models performed similarly well, with AUC scores of 0.79, indicating strong discriminative ability.
Confusion matrices revealed that Logistic Regression (weights), despite its lower accuracy, significantly reduced false positives, making it ideal for a business context where identifying potential buyers is key.
Logistic Regression is the most interpretable model, with the ability to directly understand the relationship between features and the target variable (purchasing behavior). The coefficients provide clear insight into how features impact the likelihood of a purchase.
SVM and Random Forest, while powerful, are less interpretable. Random Forest relies on decision trees, and although feature importance can be derived, the complexity of multiple trees makes it harder to understand individual predictions.
P_Art (Art-related purchases) was the most influential feature across all models, confirming that previous art book purchases are strong predictors of future purchases.
Other important features included Frequency (purchase frequency) and Gender, which highlighted customer engagement and demographic factors.
Linear Regression:
Advantages: Simple, fast, and easy to interpret.
Disadvantages: Not suitable for binary classification tasks; accuracy and F1-score were poor due to continuous output.
Logistic Regression:
Advantages: Highly interpretable, suitable for binary classification, and performs well with class weights in imbalanced datasets.
Disadvantages: May underperform if the relationships between variables are not linear.
SVM Linear:
Advantages: Strong performance on high-dimensional data, effective with imbalanced datasets when using class weights.
Disadvantages: Less interpretable compared to logistic regression, and may require more computational power for large datasets.
Random Forest:
Advantages: Handles complex, non-linear relationships and interactions between features well. Class weights improve performance on imbalanced datasets.
Disadvantages: Less interpretable due to its ensemble nature, and prone to overfitting if not tuned properly.
This section demonstrates how the model’s findings can be applied to a larger customer base, particularly for the Midwest region with 50,000 customers. By applying the logistic regression model, we estimate the potential profitability of the mailing campaign.
Evaluating the business implications of the mailing campaign includes considering the associated costs and potential benefits:
Costs: These include the book cost ($15), overhead allocation (45% of the book cost, or $6.75), and mailing costs ($0.65 per customer). These operational costs are deducted from the revenue generated from sales to calculate profitability.
Benefits: The revenue from the selling price of $31.95 per book, and the resulting profits from the estimated number of successful sales, form the core of the benefits calculation.
In a scenario where direct mail is sent to all 50,000 customers without any targeting, the expected profit is calculated by multiplying the expected number of buyers (based on the observed 9.03% response rate from historic data from different mailing campaign) by the profit per sale
Expected buyers = 50,000 × 0.0903 = 4,515 buyers
Revenue: 4,515 buyers * $31.95 selling price = $144,591.75
Mailing Costs: 50,000 customers * $0.65 mailing cost = $32,500
Profit: Revenue - Book Costs - Overhead Costs - Mailing Costs = $144,591.75 - (4,515 * $15 book cost) - (4,515 * $6.75 overhead cost) - $32,500 mailing cost = $13,553 in profit.
The expected profit for the full mailing approach, considering 50,000 customers, is $13,553.
Using the logistic regression model to target the most likely buyers can improve the profitability of the campaign by focusing resources on customers with the highest probability of conversion.
Targeted Mailing using Logistic Model: The logistic model predicts a probability of purchase for each customer. By selecting customers with a purchase probability greater than 0.5, we target a subset of customers. This targeted approach allows for more efficient use of marketing resources.
The expected profit for each targeted customer is calculated as:
Profit per Targeted Customer=Profit per Sale×Probability of Purchase−Mailing Cost
Using this method, the average expected profit per targeted customer is $6.21.
Targeted Group Size: The size of the targeted group is estimated based on the probability threshold applied to the customer base. For the 50,000 customers in Midwest, the targeted group size is estimated at 14,413 customers.
Total Expected Profit for Targeted Mailing: The total expected profit for the targeted mailing campaign is calculated by multiplying the average expected profit per targeted customer by the size of the targeted group:
Total Expected Profit for Targeted Mailing=6.21×14,413=89,482.31Full Mailing Campaign: Expected profit = $13,553
Targeted Mailing using Logistic Model: Expected profit = $89,482.31
The additional profit from the targeted mailing approach over the
full mailing strategy using the logistic model is:
Additional Profit=89,482.31−13,553=75,929.31
Thus, using the targeted mailing approach significantly boosts the profitability of the campaign by $75,929.31.
Similarly, the Support Vector Machine (SVM) model was also applied to predict customer of $21,056.88 boosting additional profit by $7,503.88 compared to Mass mailing Campaign.
While the SVM model also provides a valuable increase in profit, the logistic model outperforms it in terms of profitability. The targeted mailing approach using the logistic model provides a $75,929.31 increase in profit, compared to the $7,503.88 increase with the SVM model.
Feature Engineering:
Although we’ve key variables (such as Frequency, Gender, and P_Art) as
significant predictors of customer purchases, there might be more latent
features that could improve the model’s performance. Advanced feature
engineering, such as creating interaction terms or exploring non-linear
relationships between features, could lead to more robust models.
Exploring temporal features, like customer purchase cycles or
promotional impacts, might also enhance the model’s predictive
capabilities.
Handling Class Imbalance:
Although we used class weights to handle the class imbalance,
alternative techniques such as SMOTE (Synthetic Minority Over-sampling
Technique) or other resampling methods could further help in balancing
the dataset and improving the recall of the minority class (art book
purchasers). This could ensure that the model does not become biased
toward the majority class.
Hyperparameter Tuning:
Optimizing the hyperparameters of models such as Random Forest and SVM
through grid search or random search could yield better performance by
finding the optimal combination of parameters. Fine-tuning could improve
the models’ generalization and reduce overfitting.
Support Vector Machines with Non-linear
Kernels:
While we used a linear kernel for the SVM, experimenting with non-linear
kernels (such as RBF or polynomial kernels) could potentially improve
performance, especially if the relationship between features and the
target variable are not strictly linear.
Gradient Boosting Machines (GBM):
Models like XGBoost or LightGBM could be explored as alternatives to
Random Forest. These gradient boosting models often provide superior
performance, particularly in tasks with structured data like this one,
by iteratively improving the weak learners (trees). They can also handle
both numerical and categorical variables efficiently and might
outperform Random Forest in terms of prediction accuracy and
precision.
Deep Learning Models:
Deep learning models, such as neural networks, could be explored for
this task. Given the relatively small dataset, a deep learning model may
not perform better than traditional machine learning algorithms like
Random Forest or SVM. However, with larger datasets or more complex
feature sets, deep learning could uncover non-linear patterns and
interactions that simpler models may miss.
Each of these advanced methods offers potential benefits but also introduces complexity. The choice of technique should depend on the trade-offs between model interpretability, computational resources, and expected improvement in predictive accuracy.
Logistic Regression with Class Weights is the most effective model for BBBC’s direct mail campaigns, achieving high precision (95.91%) and F1-score (94.17%). It effectively addresses class imbalance, making it ideal for targeting potential art book buyers.
BBBC should prioritize Logistic Regression for future campaigns due to its strong performance, simplicity, and easy integration. Automating this model for real-time predictions will improve scalability and efficiency. As BBBC grows, this model can handle larger datasets while remaining resource efficient.
An automated pipeline for continuous model retraining and performance monitoring is essential to adapt to changing customer behaviors. While Logistic Regression is currently the best option, exploring more complex models like Random Forest and SVM could further refine targeting for more complex behaviors.
Building in-house expertise in predictive modeling will enable BBBC to continuously optimize campaigns, adapt to emerging trends, and improve marketing efficiency. Integrating predictive models, automation, and internal expertise will significantly boost targeting accuracy and profitability.
#############################################
# BBBC Direct Mail Analysis #
# Data Analysis, Modeling & Profitability #
#############################################
# Install necessary packages
#install.packages(c("readxl", "dplyr", "ggplot2", "corrplot", "caTools", "e1071", "caret"))
# Load libraries
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(PRROC)
## Warning: package 'PRROC' was built under R version 4.4.2
library(vip)
## Warning: package 'vip' was built under R version 4.4.2
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(RColorBrewer)
train_data <- read_xlsx("./BBBC-Train.xlsx")
test_data <- read_xlsx("./BBBC-Test.xlsx")
rownames(train_data) <- train_data$Observation # Make "Observation" the row names
## Warning: Setting row names on a tibble is deprecated.
train_data <- train_data %>% select(-Observation) # Remove the "Observation" column
rownames(test_data) <- test_data$Observation # Make "Observation" the row names
## Warning: Setting row names on a tibble is deprecated.
test_data <- test_data %>% select(-Observation) # Remove the "Observation" column
# View structure of data
str(train_data)
## tibble [1,600 × 11] (S3: tbl_df/tbl/data.frame)
## $ Choice : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
## $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
## $ Frequency : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
## $ Last_purchase : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
## $ First_purchase : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
## $ P_Child : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
## $ P_Youth : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
## $ P_Cook : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
## $ P_DIY : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
## $ P_Art : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...
# Check for missing values
sum(is.na(train_data))
## [1] 0
# Summary statistics
summary(train_data)
## Choice Gender Amount_purchased Frequency
## Min. :0.00 Min. :0.0000 Min. : 15.0 Min. : 2.00
## 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:126.8 1st Qu.: 6.00
## Median :0.00 Median :1.0000 Median :203.0 Median :12.00
## Mean :0.25 Mean :0.6587 Mean :200.9 Mean :12.31
## 3rd Qu.:0.25 3rd Qu.:1.0000 3rd Qu.:273.0 3rd Qu.:16.00
## Max. :1.00 Max. :1.0000 Max. :474.0 Max. :36.00
## Last_purchase First_purchase P_Child P_Youth
## Min. : 1.000 Min. : 2.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 1.000 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 2.000 Median :18.00 Median :0.0000 Median :0.0000
## Mean : 3.199 Mean :22.58 Mean :0.7394 Mean :0.3375
## 3rd Qu.: 4.000 3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :12.000 Max. :96.00 Max. :8.0000 Max. :4.0000
## P_Cook P_DIY P_Art
## Min. :0.00 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00 Median :0.0000 Median :0.000
## Mean :0.76 Mean :0.3912 Mean :0.425
## 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :6.00 Max. :4.0000 Max. :5.000
# Factor conversion of Gender and target variables
train_data$Gender <- as.factor(train_data$Gender)
test_data$Gender <- as.factor(test_data$Gender)
train_data$Choice <- as.factor(train_data$Choice)
test_data$Choice <- as.factor(test_data$Choice)
# Bar plot for categorical target distribution
ggplot(train_data, aes(x = as.factor(Choice), fill = as.factor(Choice))) +
geom_bar() +
labs(title = "Distribution of Target Variable", x = "Choice (0 = No, 1 = Yes)", y = "Count") +
scale_fill_manual(values = c("#56B4E9", "#E69F00")) + # CUD colors
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Center title
ggsave("target_dist.png", width = 8, height = 6, dpi = 300)
table(train_data$Choice)
##
## 0 1
## 1200 400
table(test_data$Choice)
##
## 0 1
## 2096 204
numeric_vars <- train_data %>% select(where(is.numeric))
# Convert data to long format for faceting
numeric_long <- numeric_vars %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Generate 10 distinct color-blind friendly colors from RColorBrewer
color_palette <- RColorBrewer::brewer.pal(10, "Set3")
# Create facet grid of boxplots
ggplot(numeric_long, aes(x = Value, fill = Variable)) +
geom_histogram(bins = 10, color = "black", alpha = 0.6) + # CUD color for outliers
facet_wrap(~ Variable, scales = "free", ncol = 5, nrow = 2) + # Create facet grid
labs(title = "Boxplots of All Numeric Variables", x = "Variables", y = "Values") +
scale_fill_manual(values = color_palette) + # Use automatically generated color palette
theme_minimal() +
theme(axis.text.x = element_blank(), legend.position = "none", plot.title = element_text(hjust = 0.5))
ggsave("hist_plot.png", width = 8, height = 6, dpi = 300)
# Create facet grid of boxplots
ggplot(numeric_long, aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot(outlier.colour = "#D55E00", outlier.shape = 16, outlier.size = 2) + # CUD color for outliers
facet_wrap(~ Variable, scales = "free", ncol = 5, nrow = 2) + # Create facet grid
labs(title = "Boxplots of All Numeric Variables", x = "Variables", y = "Values") +
scale_fill_manual(values = color_palette) + # Use automatically generated color palette
theme_minimal() +
theme(axis.text.x = element_blank(), legend.position = "none", plot.title = element_text(hjust = 0.5))
ggsave("box_plot.png", width = 8, height = 6, dpi = 300)
# handling outliers / non normal distribution using logarithm function
train_data$First_purchase_log <- log1p(train_data$First_purchase) # log(1 + x) handles zero values
test_data$First_purchase_log <- log1p(test_data$First_purchase) # log(1 + x) handles zero values
# correlation analysis
corr_matrix <- cor(train_data %>% select(-Choice,-Gender),, use = "complete.obs")
corrplot(corr_matrix, method="circle", type="upper", tl.cex=0.8)
# check for multicollinearity
# Convert Choice to numeric if it's a factor
train_data$Choice <- as.numeric(as.character(train_data$Choice))
# Calculate VIF for each predictor variable
vif_model <- lm(Choice ~ ., data = train_data %>% select(-Gender))
vif(vif_model)
## Amount_purchased Frequency Last_purchase First_purchase
## 1.250025 3.789095 19.100132 12.601333
## P_Child P_Youth P_Cook P_DIY
## 3.353938 1.774316 3.322654 2.014644
## P_Art First_purchase_log
## 2.273068 7.216250
# Convert the target variable 'Choice' to a factor
train_data$Choice <- as.factor(train_data$Choice)
test_data$Choice <- as.factor(test_data$Choice)
# Define predictor variables and target
X_train <- train_data %>% select(-Choice,-First_purchase, -Last_purchase)
y_train <- factor(train_data$Choice)
X_test <- test_data %>% select(-Choice,-First_purchase, -Last_purchase)
y_test <- factor(test_data$Choice)
# Standardize predictor variables
preProcValues <- preProcess(X_train %>% select(-Gender), method = c("center", "scale"))
X_train_scaled <- predict(preProcValues, X_train %>% select(-Gender))
X_test_scaled <- predict(preProcValues, X_test %>% select(-Gender))
X_train_scaled['Gender'] = X_train['Gender']
X_test_scaled['Gender'] = X_test['Gender']
linear_model <- lm(as.numeric(as.character(y_train)) ~., data = X_train_scaled, class_weights = class_weights)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'class_weights' will be disregarded
summary(linear_model)
##
## Call:
## lm(formula = as.numeric(as.character(y_train)) ~ ., data = X_train_scaled,
## class_weights = class_weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9612 -0.2465 -0.1312 0.1690 1.1209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.333426 0.016509 20.197 < 2e-16 ***
## Amount_purchased 0.035066 0.010698 3.278 0.001068 **
## Frequency -0.110665 0.015453 -7.161 1.22e-12 ***
## P_Child -0.040049 0.012083 -3.315 0.000938 ***
## P_Youth -0.007494 0.010624 -0.705 0.480672
## P_Cook -0.055425 0.012100 -4.580 5.00e-06 ***
## P_DIY -0.033158 0.011053 -3.000 0.002741 **
## P_Art 0.151524 0.011322 13.383 < 2e-16 ***
## First_purchase_log 0.039154 0.020907 1.873 0.061284 .
## Gender1 -0.126643 0.020353 -6.222 6.25e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3853 on 1590 degrees of freedom
## Multiple R-squared: 0.2132, Adjusted R-squared: 0.2087
## F-statistic: 47.86 on 9 and 1590 DF, p-value: < 2.2e-16
# Make predictions
lm_predictions <- predict(linear_model, test_data)
# Evaluate predictions
predicted_values <- ifelse(lm_predictions > 0.5, 1, 0) # Classify as 1 or 0 based on 0.5 threshold
actual_values <- y_test
# Compare predicted vs actual values
lm_cm <- confusionMatrix(as.factor(predicted_values), as.factor(actual_values))
print(lm_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 155 6
## 1 1941 198
##
## Accuracy : 0.1535
## 95% CI : (0.139, 0.1689)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0084
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07395
## Specificity : 0.97059
## Pos Pred Value : 0.96273
## Neg Pred Value : 0.09257
## Prevalence : 0.91130
## Detection Rate : 0.06739
## Detection Prevalence : 0.07000
## Balanced Accuracy : 0.52227
##
## 'Positive' Class : 0
##
# Performance metrics
lm_accuracy <- lm_cm$overall['Accuracy']
lm_precision <- lm_cm$byClass['Pos Pred Value']
lm_recall <- lm_cm$byClass['Sensitivity']
lm_f1 <- 2 * ((lm_precision * lm_recall) / (lm_precision + lm_recall))
# Print the metrics
cat("Accuracy: ", lm_accuracy, "\n")
## Accuracy: 0.1534783
cat("Precision: ", lm_precision, "\n")
## Precision: 0.9627329
cat("Recall: ", lm_recall, "\n")
## Recall: 0.07395038
cat("F1 Score: ", lm_f1, "\n")
## F1 Score: 0.1373505
# Check the min and max values of the predictions
min_pred <- min(lm_predictions)
max_pred <- max(lm_predictions)
# Print the min and max predictions
cat("Min Prediction: ", min_pred, "\n")
## Min Prediction: -2.755344
cat("Max Prediction: ", max_pred, "\n")
## Max Prediction: 15.91
# Plot actual vs predicted values
ggplot(data.frame(Actual = actual_values, Predicted = lm_predictions), aes(x = Actual, y = Predicted)) +
geom_jitter() +
labs(title = "Linear Regression: Actual vs Predicted", x = "Actual", y = "Predicted")
# ------------------------------------
# 2. Train Base Models
# ------------------------------------
#### 2. Logistic Regression ####
logit_model <- glm(y_train ~ ., data = X_train_scaled, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = y_train ~ ., family = binomial, data = X_train_scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.86153 0.10503 -8.203 2.35e-16 ***
## Amount_purchased 0.22551 0.07372 3.059 0.002221 **
## Frequency -0.93893 0.12741 -7.369 1.72e-13 ***
## P_Child -0.31167 0.08734 -3.569 0.000359 ***
## P_Youth -0.09128 0.07516 -1.215 0.224538
## P_Cook -0.41300 0.08754 -4.718 2.38e-06 ***
## P_DIY -0.26067 0.07888 -3.304 0.000952 ***
## P_Art 0.84021 0.07817 10.749 < 2e-16 ***
## First_purchase_log 0.35538 0.13802 2.575 0.010026 *
## Gender1 -0.81461 0.13483 -6.042 1.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1438.2 on 1590 degrees of freedom
## AIC: 1458.2
##
## Number of Fisher Scoring iterations: 5
logit_preds_prob <- predict(logit_model, newdata = X_test_scaled, type = "response")
logit_preds_class <- factor(ifelse(logit_preds_prob >= 0.5, 1, 0), levels = levels(y_test))
logit_cm <- confusionMatrix(logit_preds_class, y_test)
logit_roc <- roc(as.numeric(y_test), logit_preds_prob)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
cat("=== Logistic Regression Metrics ===\n")
## === Logistic Regression Metrics ===
print(logit_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1990 135
## 1 106 69
##
## Accuracy : 0.8952
## 95% CI : (0.882, 0.9074)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.99642
##
## Kappa : 0.3074
##
## Mcnemar's Test P-Value : 0.07129
##
## Sensitivity : 0.9494
## Specificity : 0.3382
## Pos Pred Value : 0.9365
## Neg Pred Value : 0.3943
## Prevalence : 0.9113
## Detection Rate : 0.8652
## Detection Prevalence : 0.9239
## Balanced Accuracy : 0.6438
##
## 'Positive' Class : 0
##
cat("AUC:", auc(logit_roc), "\n\n")
## AUC: 0.7947362
#### 3. SVM with Linear Kernel ####
svm_linear_model <- svm(y_train ~ ., data = X_train_scaled, kernel = "linear", probability = TRUE)
summary(svm_linear_model)
##
## Call:
## svm(formula = y_train ~ ., data = X_train_scaled, kernel = "linear",
## probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 746
##
## ( 372 374 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_linear_pred_obj <- predict(svm_linear_model, newdata = X_test_scaled, probability = TRUE)
svm_linear_probs <- attr(svm_linear_pred_obj, "probabilities")[, which(colnames(attr(svm_linear_pred_obj, "probabilities")) == "1")]
svm_linear_preds_class <- factor(ifelse(svm_linear_probs >= 0.5, 1, 0), levels = levels(y_test))
svm_linear_cm <- confusionMatrix(svm_linear_preds_class, y_test)
svm_linear_roc <- roc(as.numeric(as.character(y_test)), svm_linear_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("=== SVM (Linear Kernel) Metrics ===\n")
## === SVM (Linear Kernel) Metrics ===
print(svm_linear_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1971 132
## 1 125 72
##
## Accuracy : 0.8883
## 95% CI : (0.8747, 0.9009)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.2979
##
## Mcnemar's Test P-Value : 0.7082
##
## Sensitivity : 0.9404
## Specificity : 0.3529
## Pos Pred Value : 0.9372
## Neg Pred Value : 0.3655
## Prevalence : 0.9113
## Detection Rate : 0.8570
## Detection Prevalence : 0.9143
## Balanced Accuracy : 0.6467
##
## 'Positive' Class : 0
##
cat("AUC:", auc(svm_linear_roc), "\n\n")
## AUC: 0.7869132
### 4. Random Forest ###
rf_model <- randomForest(y_train ~ ., data = X_train_scaled)
rf_pred_obj <- predict(rf_model, newdata = X_test_scaled, type = "prob")
rf_probs <- rf_pred_obj[,2]
rf_preds_class <- factor(ifelse(rf_probs >= 0.5, 1, 0), levels = levels(y_test))
rf_cm <- confusionMatrix(rf_preds_class, y_test)
rf_roc <- roc(as.numeric(as.character(y_test)), rf_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("=== Random Forest Metrics ===\n")
## === Random Forest Metrics ===
print(rf_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1950 143
## 1 146 61
##
## Accuracy : 0.8743
## 95% CI : (0.8601, 0.8876)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.2279
##
## Mcnemar's Test P-Value : 0.9063
##
## Sensitivity : 0.9303
## Specificity : 0.2990
## Pos Pred Value : 0.9317
## Neg Pred Value : 0.2947
## Prevalence : 0.9113
## Detection Rate : 0.8478
## Detection Prevalence : 0.9100
## Balanced Accuracy : 0.6147
##
## 'Positive' Class : 0
##
cat("AUC:", auc(rf_roc), "\n")
## AUC: 0.7546856
#### 1. Logistic Regression ####
class_freq <- table(y_train)
class_weights <- 1 / class_freq
logit_model_balanced <- glm(y_train ~ ., data = X_train_scaled, family = binomial(),
weights = ifelse(y_train == 1, class_weights[2], class_weights[1]))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(logit_model_balanced)
##
## Call:
## glm(formula = y_train ~ ., family = binomial(), data = X_train_scaled,
## weights = ifelse(y_train == 1, class_weights[2], class_weights[1]))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.27915 2.66229 0.105 0.916
## Amount_purchased 0.23553 1.82867 0.129 0.898
## Frequency -0.92529 3.08718 -0.300 0.764
## P_Child -0.31756 2.15631 -0.147 0.883
## P_Youth -0.08718 1.85831 -0.047 0.963
## P_Cook -0.40502 2.19000 -0.185 0.853
## P_DIY -0.27974 1.96970 -0.142 0.887
## P_Art 0.83770 2.09191 0.400 0.689
## First_purchase_log 0.36280 3.43776 0.106 0.916
## Gender1 -0.87833 3.40814 -0.258 0.797
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.7726 on 1599 degrees of freedom
## Residual deviance: 2.2004 on 1590 degrees of freedom
## AIC: 20
##
## Number of Fisher Scoring iterations: 5
logit_preds_prob_bl <- predict(logit_model_balanced, newdata = X_test_scaled, type = "response")
logit_preds_class_bl <- factor(ifelse(logit_preds_prob_bl >= 0.5, 1, 0), levels = levels(y_test))
logit_cm_bl <- confusionMatrix(logit_preds_class_bl, y_test)
logit_roc_bl <- roc(as.numeric(y_test), logit_preds_prob_bl)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
cat("=== Logistic Regression Metrics ===\n")
## === Logistic Regression Metrics ===
print(logit_cm_bl)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1570 67
## 1 526 137
##
## Accuracy : 0.7422
## 95% CI : (0.7238, 0.76)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2087
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7490
## Specificity : 0.6716
## Pos Pred Value : 0.9591
## Neg Pred Value : 0.2066
## Prevalence : 0.9113
## Detection Rate : 0.6826
## Detection Prevalence : 0.7117
## Balanced Accuracy : 0.7103
##
## 'Positive' Class : 0
##
cat("AUC:", auc(logit_roc_bl), "\n\n")
## AUC: 0.7948274
#### 2. SVM with Linear Kernel ####
svm_linear_model_bl <- svm(y_train ~ ., data = X_train_scaled, kernel = "linear", probability = TRUE,class.weights = class_weights)
summary(svm_linear_model_bl)
##
## Call:
## svm(formula = y_train ~ ., data = X_train_scaled, kernel = "linear",
## probability = TRUE, class.weights = class_weights)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 1438
##
## ( 360 1078 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_linear_pred_obj_bl <- predict(svm_linear_model_bl, newdata = X_test_scaled, probability = TRUE)
svm_linear_probs_bl <- attr(svm_linear_pred_obj_bl, "probabilities")[, which(colnames(attr(svm_linear_pred_obj_bl, "probabilities")) == "1")]
svm_linear_preds_class_bl <- factor(ifelse(svm_linear_probs_bl >= 0.5, 1, 0), levels = levels(y_test))
svm_linear_cm_bl <- confusionMatrix(svm_linear_preds_class_bl, y_test)
svm_linear_roc_bl <- roc(as.numeric(as.character(y_test)), svm_linear_probs_bl)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("=== SVM (Linear Kernel) Metrics ===\n")
## === SVM (Linear Kernel) Metrics ===
print(svm_linear_cm_bl)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1988 138
## 1 108 66
##
## Accuracy : 0.893
## 95% CI : (0.8797, 0.9054)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.99880
##
## Kappa : 0.2913
##
## Mcnemar's Test P-Value : 0.06446
##
## Sensitivity : 0.9485
## Specificity : 0.3235
## Pos Pred Value : 0.9351
## Neg Pred Value : 0.3793
## Prevalence : 0.9113
## Detection Rate : 0.8643
## Detection Prevalence : 0.9243
## Balanced Accuracy : 0.6360
##
## 'Positive' Class : 0
##
cat("AUC:", auc(svm_linear_roc_bl), "\n\n")
## AUC: 0.7864922
### 3. Random Forest ###
rf_model_bl <- randomForest(y_train ~ ., data = X_train_scaled,classwt = class_weights)
summary(rf_model_bl)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1600 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 3200 matrix numeric
## oob.times 1600 -none- numeric
## classes 2 -none- character
## importance 9 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 1600 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
rf_pred_obj_bl <- predict(rf_model_bl, newdata = X_test_scaled, type = "prob")
rf_probs_bl <- rf_pred_obj_bl[,2]
rf_preds_class_bl <- factor(ifelse(rf_probs_bl >= 0.5, 1, 0), levels = levels(y_test))
rf_cm_bl <- confusionMatrix(rf_preds_class_bl, y_test)
rf_roc_bl <- roc(as.numeric(as.character(y_test)), rf_probs_bl)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("=== Random Forest Metrics ===\n")
## === Random Forest Metrics ===
print(rf_cm_bl)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1821 127
## 1 275 77
##
## Accuracy : 0.8252
## 95% CI : (0.8091, 0.8405)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1855
##
## Mcnemar's Test P-Value : 2.273e-13
##
## Sensitivity : 0.8688
## Specificity : 0.3775
## Pos Pred Value : 0.9348
## Neg Pred Value : 0.2188
## Prevalence : 0.9113
## Detection Rate : 0.7917
## Detection Prevalence : 0.8470
## Balanced Accuracy : 0.6231
##
## 'Positive' Class : 0
##
cat("AUC:", auc(rf_roc_bl), "\n")
## AUC: 0.7455541
######################## |Interpretation of models | ########################
lm_importance <- abs(coef(linear_model))[-1] # Remove intercept
lm_df <- data.frame(Feature = names(lm_importance), Importance = lm_importance)
ggplot(lm_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "maroon") +
coord_flip() +
ggtitle("Feature Importance - Linear Regression") +
xlab("Feature") + ylab("Importance (|Coefficient|)")
logit_importance <- abs(coef(logit_model_balanced))[-1] # Remove intercept
logit_df <- data.frame(Feature = names(logit_importance), Importance = logit_importance)
ggplot(logit_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#1f77b4") +
coord_flip() +
ggtitle("Feature Importance - Logistic Regression") +
xlab("Feature") + ylab("Importance (|Coefficient|)")
# Extract the weight vector (support vector coefficients * support vectors)
svm_linear_weights <- t(svm_linear_model_bl$coefs) %*% svm_linear_model_bl$SV
svm_linear_importance <- abs(svm_linear_weights)
svm_linear_importance_df <- as.data.frame(svm_linear_importance)
features <- colnames(svm_linear_importance_df)
svm_linear_importance_df <- data.frame(
Feature = features,
Importance = as.numeric(svm_linear_importance_df[1, ]) # Take the first row for feature importance
)
# Sort the features by their importance
svm_linear_importance_df <- svm_linear_importance_df[order(svm_linear_importance_df$Importance, decreasing = TRUE), ]
# Plot the feature importance using ggplot2
ggplot(svm_linear_importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#E69F00") +
coord_flip() +
labs(title = "Feature Importance for SVM Linear", x = "Feature", y = "Importance") +
theme_minimal()
# Extract feature importance
rf_importance <- importance(rf_model_bl)[, 1] # Mean Decrease in Gini
rf_df <- data.frame(Feature = rownames(importance(rf_model)), Importance = rf_importance)
# Plot
ggplot(rf_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#2ca02c") +
coord_flip() +
ggtitle("Feature Importance - Random Forest") +
xlab("Feature") + ylab("Importance (Mean Decrease in Gini)")
lm_df
## Feature Importance
## Amount_purchased Amount_purchased 0.035066237
## Frequency Frequency 0.110664955
## P_Child P_Child 0.040049318
## P_Youth P_Youth 0.007494294
## P_Cook P_Cook 0.055424585
## P_DIY P_DIY 0.033158219
## P_Art P_Art 0.151524125
## First_purchase_log First_purchase_log 0.039153989
## Gender1 Gender1 0.126642629
logit_df
## Feature Importance
## Amount_purchased Amount_purchased 0.23552862
## Frequency Frequency 0.92528727
## P_Child P_Child 0.31756266
## P_Youth P_Youth 0.08717607
## P_Cook P_Cook 0.40501939
## P_DIY P_DIY 0.27973746
## P_Art P_Art 0.83770273
## First_purchase_log First_purchase_log 0.36279608
## Gender1 Gender1 0.87832655
svm_linear_importance_df
## Feature Importance
## 7 P_Art 0.43055918
## 2 Frequency 0.25495775
## 1 Amount_purchased 0.14024735
## 10 Gender1 0.12976127
## 9 Gender0 0.12976127
## 5 P_Cook 0.08836134
## 8 First_purchase_log 0.07478641
## 6 P_DIY 0.04928144
## 3 P_Child 0.03786663
## 4 P_Youth 0.01831745
rf_df
## Feature Importance
## Amount_purchased Amount_purchased 153.65013
## Frequency Frequency 92.86314
## P_Child P_Child 36.76524
## P_Youth P_Youth 24.02233
## P_Cook P_Cook 37.36559
## P_DIY P_DIY 25.29101
## P_Art P_Art 55.41600
## First_purchase_log First_purchase_log 84.15271
## Gender Gender 24.20464
# ---- Save ROC Curves as PNG ----
png("roc_curves.png", width = 800, height = 800)
# ---- Plot ROC Curves ----
plot(logit_roc_bl, col = "#1f77b4", lwd = 2, main = "ROC Curves for Models", cex.main = 1.5)
plot(svm_linear_roc_bl, col = "#E69F00", add = TRUE, lwd = 2)
plot(rf_roc_bl, col = "#2ca02c", add = TRUE, lwd = 2)
# ---- Add Legend ----
legend("bottomright",
legend = c("Logistic Regression", "SVM (Linear)", "Random Forest"),
col = c("#1f77b4", "#E69F00", "#2ca02c"), lwd = 2)
# ---- Close the Device ----
dev.off()
## png
## 2
# ---- Precision-Recall Curve ----
logreg_pr <- pr.curve(scores.class0 = logit_preds_prob_bl, weights.class0 = as.numeric(as.character(y_test)), curve = TRUE)
svm_linear_pr <- pr.curve(scores.class0 = svm_linear_probs_bl, weights.class0 = as.numeric(as.character(y_test)), curve = TRUE)
rf_pr <- pr.curve(scores.class0 = rf_probs_bl, weights.class0 = as.numeric(as.character(y_test)), curve = TRUE)
# Plot Precision-Recall Curves
plot(logreg_pr$curve[, 1], logreg_pr$curve[, 2], type = "l", col = "#1f77b4", lwd = 2, main = " Precision-Recall Curves", xlab = "Recall", ylab = "Precision", xlim = c(0, 1), ylim = c(0, 1),cex.main = 1.5) # Ensuring same limits
lines(svm_linear_pr$curve[, 1], svm_linear_pr$curve[, 2], col = "#E69F00", lwd = 2)
lines(rf_pr$curve[, 1], rf_pr$curve[, 2], col = "#2ca02c", lwd = 2)
# Add Legend to Precision-Recall Plot
legend("topright", legend = c("Logistic Regression", "SVM Linear", "Random Forest"),
col = c("#1f77b4", "#E69F00","#2ca02c"), lwd = 1)
# Reset the plotting layout to default
par(mfrow = c(1, 1), oma = c(0, 0, 0, 0)) # Reset the plotting layout and outer margin
# --- Step 3: Extract Metrics ---
# Accuracy
logit_accuracy <- logit_cm$overall["Accuracy"]
svm_linear_accuracy <- svm_linear_cm$overall["Accuracy"]
rf_accuracy <- rf_cm$overall["Accuracy"]
logit_accuracy_bl <- logit_cm_bl$overall["Accuracy"]
svm_linear_accuracy_bl <- svm_linear_cm_bl$overall["Accuracy"]
rf_accuracy_bl <- rf_cm_bl$overall["Accuracy"]
# AUC (Area Under ROC Curve)
logit_auc <- logit_roc$auc
svm_linear_auc <- svm_linear_roc$auc
rf_auc <- rf_roc$auc
logit_auc_bl <- logit_roc_bl$auc
svm_linear_auc_bl <- svm_linear_roc_bl$auc
rf_auc_bl <- rf_roc_bl$auc
# Precision, Recall, F1-Score
logit_precision <- logit_cm$byClass["Precision"]
svm_linear_precision <- svm_linear_cm$byClass["Precision"]
rf_precision <- rf_cm$byClass["Precision"]
logit_precision_bl <- logit_cm_bl$byClass["Precision"]
svm_linear_precision_bl <- svm_linear_cm_bl$byClass["Precision"]
rf_precision_bl <- rf_cm_bl$byClass["Precision"]
logit_recall <- logit_cm$byClass["Recall"]
svm_linear_recall <- svm_linear_cm$byClass["Recall"]
rf_recall <- rf_cm$byClass["Recall"]
logit_recall_bl <- logit_cm_bl$byClass["Recall"]
svm_linear_recall_bl <- svm_linear_cm_bl$byClass["Recall"]
rf_recall_bl <- rf_cm_bl$byClass["Recall"]
logit_f1 <- logit_cm$byClass["F1"]
svm_linear_f1 <- svm_linear_cm$byClass["F1"]
rf_f1 <- rf_cm$byClass["F1"]
logit_f1_bl <- logit_cm_bl$byClass["F1"]
svm_linear_f1_bl <- svm_linear_cm_bl$byClass["F1"]
rf_f1_bl <- rf_cm_bl$byClass["F1"]
# --- Step 4: Create a Summary Table ---
metrics_df <- data.frame(
Model = c("Linear Regression","Logistic Regression", "SVM Linear", "Random Forest","Logistic Regression Bl", "SVM Linear Bl", "Random Forest Bl"),
Accuracy = c(lm_accuracy,logit_accuracy, svm_linear_accuracy, rf_accuracy,logit_accuracy_bl, svm_linear_accuracy_bl, rf_accuracy_bl),
AUC = c("NA",logit_auc, svm_linear_auc, rf_auc,logit_auc_bl, svm_linear_auc_bl, rf_auc_bl),
Precision = c(lm_precision,logit_precision, svm_linear_precision, rf_precision,logit_precision_bl, svm_linear_precision_bl, rf_precision_bl),
Recall = c(lm_recall,logit_recall, svm_linear_recall, rf_recall,logit_recall_bl, svm_linear_recall_bl, rf_recall_bl),
F1 = c(lm_f1,logit_f1, svm_linear_f1, rf_f1,logit_f1_bl, svm_linear_f1_bl, rf_f1_bl)
)
# Print the table of metrics
print(metrics_df)
## Model Accuracy AUC Precision Recall
## 1 Linear Regression 0.1534783 NA 0.9627329 0.07395038
## 2 Logistic Regression 0.8952174 0.794736238961233 0.9364706 0.94942748
## 3 SVM Linear 0.8882609 0.786913214713366 0.9372325 0.94036260
## 4 Random Forest 0.8743478 0.754685629022601 0.9316770 0.93034351
## 5 Logistic Regression Bl 0.7421739 0.794827449109415 0.9590715 0.74904580
## 6 SVM Linear Bl 0.8930435 0.786492244798683 0.9350894 0.94847328
## 7 Random Forest Bl 0.8252174 0.745554089956593 0.9348049 0.86879771
## F1
## 1 0.1373505
## 2 0.9429045
## 3 0.9387950
## 4 0.9310098
## 5 0.8411465
## 6 0.9417338
## 7 0.9005935
# Given cost and pricing details
book_cost <- 15 # Cost per book
overhead_percentage <- 0.45 # Overhead allocation
selling_price <- 31.95 # Selling price of the book
mailing_cost <- 0.65 # Mailing cost per customer
overhead_cost <- book_cost * overhead_percentage # Overhead cost per book
# Compute profit per successful sale
profit_per_sale <- selling_price - book_cost - overhead_cost
# Midwest customer count
midwest_customers <- 50000
# Observed response rate from past campaign
observed_response_rate <- 0.0903 # 9.03%
# **Scenario 1: Sending mail to all 50,000 customers**
expected_buyers_full <- midwest_customers * observed_response_rate
expected_profit_full <- expected_buyers_full * profit_per_sale - (midwest_customers * mailing_cost)
# ----- Targeted Mailing Strategy using logistic model -----
# Identify customers in the test data that meet the targeting threshold:
targeted_indices <- which(logit_preds_prob_bl >= 0.5)
targeted_fraction <- length(targeted_indices) / length(y_test) # Proportion of test customers targeted
# For these targeted customers, compute net expected profit:
expected_profit_targeted_per_customer <- profit_per_sale * logit_preds_prob_bl[targeted_indices] - mailing_cost
avg_profit_targeted <- mean(expected_profit_targeted_per_customer)
print("avg_profit_targeted using logit")
## [1] "avg_profit_targeted using logit"
avg_profit_targeted
## [1] 6.208426
# Estimate targeted group size for 50k customers:
targeted_group_size_50k <- targeted_fraction * midwest_customers
print("targeted_group_size_50k logit")
## [1] "targeted_group_size_50k logit"
targeted_group_size_50k
## [1] 14413.04
# Total expected profit for targeted mailing:
total_expected_profit_targeted_logit <- avg_profit_targeted * targeted_group_size_50k
# ----- Relative Profit Gain -----
additional_profit_logit <- total_expected_profit_targeted_logit - expected_profit_full
# ----- Targeted Mailing Strategy using svm linear model -----
# Identify customers in the test data that meet the targeting threshold:
targeted_indices <- which(svm_linear_probs_bl >= 0.5)
targeted_fraction <- length(targeted_indices) / length(y_test) # Proportion of test customers targeted
# For these targeted customers, compute net expected profit:
expected_profit_targeted_per_customer <- profit_per_sale * svm_linear_probs_bl[targeted_indices] - mailing_cost
avg_profit_targeted <- mean(expected_profit_targeted_per_customer)
avg_profit_targeted
## [1] 5.880931
# Estimate targeted group size for 50k customers:
targeted_group_size_50k <- targeted_fraction * midwest_customers
targeted_group_size_50k
## [1] 3782.609
# Total expected profit for targeted mailing:
total_expected_profit_targeted_svm <- avg_profit_targeted * targeted_group_size_50k
# ----- Relative Profit Gain -----
additional_profit_svm <- total_expected_profit_targeted_svm - expected_profit_full
# ----- Output Results -----
cat("Extrapolated expected profit (full mailing, 50k customers):", expected_profit_full, "\n")
## Extrapolated expected profit (full mailing, 50k customers): 13553
cat("Extrapolated expected profit (targeted mailing using logit model, 50k customers):", total_expected_profit_targeted_logit, "\n")
## Extrapolated expected profit (targeted mailing using logit model, 50k customers): 89482.31
cat("Additional profit from targeted mailing over full mailing using logistic model:", additional_profit_logit, "\n")
## Additional profit from targeted mailing over full mailing using logistic model: 75929.31
cat("Extrapolated expected profit (targeted mailing using SVM model, 50k customers):", total_expected_profit_targeted_svm, "\n")
## Extrapolated expected profit (targeted mailing using SVM model, 50k customers): 22245.26
cat("Additional profit from targeted mailing over full mailing using SVM model:", additional_profit_svm, "\n")
## Additional profit from targeted mailing over full mailing using SVM model: 8692.263
From these results, it is clear that targeting customers based on predictive models (whether logistic regression or SVM) leads to significantly higher profits compared to sending the offer to all customers. The SVM model appears to be the most effective in this case, generating the highest additional profit, making it the best option for BBBC’s Midwest mailing campaign.
# Create a comparison plot
# Data for the plot
campaign_types <- c("Full Mailing", "Targeted (Logit Model)", "Targeted (SVM Model)")
profits <- c(expected_profit_full, total_expected_profit_targeted_logit, total_expected_profit_targeted_svm) # Extrapolated expected profits
# Create a data frame
profit_data <- data.frame(campaign_types, profits)
# Generate the bar plot
ggplot(profit_data, aes(x = campaign_types, y = profits, fill = campaign_types)) +
geom_bar(stat = "identity", width = 0.6) +
labs(title = "Profit Comparison of Mailing Strategies",
x = "Campaign Type",
y = "Expected Profit ($)",
fill = "Campaign Type") +
theme_minimal() +
scale_fill_manual(values = c("skyblue", "salmon", "lightgreen")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(profits, 2)), vjust = -0.5) # Add values above bars
ggsave("profit_comparison_plot.png", width = 8, height = 6, dpi = 300)