#Introduction In today’s competitive retail environment, understanding customer purchasing behavior is critical for maximizing sales and improving customer experience. Blackwell Electronics, a renowned electronics retailer, has collected customer transaction data from both online and in-store purchases. By analyzing this data, we aim to uncover key insights into spending patterns, regional differences, and customer demographics. The findings will help the company optimize marketing strategies, improve inventory management, and enhance the overall shopping experience.
The primary purpose of this project is to leverage predictive analytics to gain a deeper understanding of customer purchasing behavior at Blackwell Electronics. By analyzing customer transactions, we aim to:
Improve Customer Retention: Identify factors influencing repeat purchases and customer loyalty.
Increase Sales: Optimize marketing and inventory strategies based on regional spending habits and demographic insights.
Enhance Online and In-Store Experience: Provide data-driven recommendations to improve the user journey across sales channels.
4.Predict Shopping Behavior: Use machine learning models to forecast future purchasing patterns and segment customers effectively.
Through this study, Blackwell Electronics will gain actionable insights to make informed business decisions, resulting in higher customer satisfaction, improved operational efficiency, and increased profitability.
To achieve these broad goals, we break them down into specific, measurable objectives: Identify regional spending patterns and determine which regions spend the most/least. Analyze the relationship between the number of items purchased and total spending. Investigate age differences between regions and whether a customer’s age can be predicted based on demographic data. Explore correlations between customer age and transaction type (online vs. in-store). Predict a customer’s region based on transaction and demographic data.
To strengthen our analysis, i will generate and test the following hypotheses based on age group, region, and key transactional variables:
Hypotheses for Testing
H1: Older customers are more likely to shop online than in-store
(Proven with Chi-Square and Logistic Regression). H2: Regional
Differences in Spending Behavior
H3: A customer’s age can be predicted based on transaction
characteristics (shopping mode, number of items, and amount spent
H4: Correlation Between Items Purchased & Spending (Pearson
Correlation & Regression)
H5:Predicting Customer Region Based on Transaction Data (To be done in
modeling).
#Data Description The dataset consists of 80,000 customer transactions, with the following variables: 1.in-store (Categorical: 1 = In-store, 0 = Online) 2.age (Numerical: Customer’s age) 3.items (Numerical: Number of items purchased per transaction) 4.amount (Numerical: Total amount spent in USD per transaction) 5.region (Categorical: 1 = North, 2 = South, 3 = East, 4 = West)
#Data Preprocessing To ensure the dataset is clean and ready for analysis, i perform the following preprocessing steps:
A. Handling Missing Values: -Identify and remove or impute missing values where necessary.
B. Handling Duplicates: -Detect and remove duplicate records to ensure data integrity.
C.Outlier Detection and Treatment: -Use Interquartile Range (IQR) to detect and remove extreme outliers in amount, age, and items.
D. Feature Transformation: -Convert categorical variables (in-store, region) into factors for analysis.
E.Scaling & Normalization: -Standardize numerical variables to improve model performance.
#Check the First Few Rows
## in.store age items amount region
## 1 0 37 4 281.0300 2
## 2 0 35 2 219.5100 2
## 3 1 45 3 1525.7000 4
## 4 1 46 3 715.2500 3
## 5 1 33 4 1937.5000 1
## 6 0 24 3 1314.2000 4
## 7 0 43 6 8.5472 2
## 8 1 66 5 819.9800 3
## 9 1 51 5 908.3100 1
## 10 1 47 3 767.5400 1
## in.store age items amount region
## 79991 0 71 7 6.9172 2
## 79992 0 37 4 658.0000 4
## 79993 0 77 4 170.0200 2
## 79994 1 47 3 337.5400 3
## 79995 0 22 1 698.5400 4
## 79996 1 71 3 558.8200 1
## 79997 0 59 7 1932.0000 3
## 79998 0 54 1 414.1600 2
## 79999 1 49 4 335.3200 1
## 80000 1 30 1 527.1200 3
#Check the Structure and Data Types
## 'data.frame': 80000 obs. of 5 variables:
## $ in.store: int 0 0 1 1 1 0 0 1 1 1 ...
## $ age : int 37 35 45 46 33 24 43 66 51 47 ...
## $ items : int 4 2 3 3 4 3 6 5 5 3 ...
## $ amount : num 281 220 1526 715 1938 ...
## $ region : int 2 2 4 3 1 4 2 3 1 1 ...
#get the summary statistics of the data
## in.store age items amount
## Min. :0.0 Min. :18.00 Min. :1.000 Min. : 5.005
## 1st Qu.:0.0 1st Qu.:33.00 1st Qu.:3.000 1st Qu.: 285.140
## Median :0.5 Median :45.00 Median :4.000 Median : 582.315
## Mean :0.5 Mean :45.76 Mean :4.505 Mean : 835.920
## 3rd Qu.:1.0 3rd Qu.:56.00 3rd Qu.:6.000 3rd Qu.:1233.700
## Max. :1.0 Max. :85.00 Max. :8.000 Max. :3000.000
## region
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.675
## 3rd Qu.:4.000
## Max. :4.000
## in.store age items amount region
## 17859 0 76 6 283.61 2
## 26113 0 67 4 320.41 2
## 29344 1 23 2 1019.50 4
## 33835 1 38 3 1993.80 1
## 34215 1 31 5 1488.10 4
## 34707 0 39 5 1268.00 4
## 35972 0 38 7 391.51 2
## 43959 1 22 1 1303.20 4
## 44309 1 26 5 1495.70 4
## 45165 0 79 6 342.28 2
## 47415 0 27 6 1555.10 4
## 51601 0 56 2 1941.70 4
## 52014 1 50 5 1889.40 1
## 53653 0 29 2 144.43 2
## 58455 0 76 1 259.44 2
## 59540 1 49 2 1162.00 4
## 59860 0 30 7 2535.10 4
## 61440 1 32 4 1106.10 4
## 63971 1 42 5 1731.10 4
## 75507 0 44 2 1756.70 4
## 78689 1 40 2 1080.60 1
I observed that there are 21 duplicate rows.
#Check if duplicates have been removed
## [1] 0
Duplicates have been successfully removed. Since the reduction in data set size is only 0.03% (21 rows removed out of 80,000), removing duplicates does not significantly impact our insights.
## in.store age items amount region
## 0 0 0 0 0
It’s observed that there are no missing values seen in the data set.
#Save Data
#Load Data
customer_data_copy$in.store <- factor(customer_data_copy$in.store, labels = c("Online", "In-Store"))
customer_data_copy$region <- factor(customer_data_copy$region,
levels = c(1, 2, 3, 4),
labels = c("North", "South", "East", "West"))To improve readability in analysis and visualization.
##
## North South East West
## 15997 19994 18000 25988
##
## Online In-Store
## 39989 39990
#Exploratory Data Analysis (EDA) Now, let’s visualize key insights from your dataset.
# Load required libraries
library(ggplot2)
# Load the dataset (Ensure the file path is correct)
customer_data_copy <- read.csv("Demographic_Data.csv", stringsAsFactors = FALSE)
# Convert 'region' column to factors for better visualization
customer_data_copy$region <- factor(customer_data_copy$region,
levels = c(1, 2, 3, 4),
labels = c("North", "South", "East", "West"))
# Create the boxplots with red outliers and display them
par(mfrow = c(1,3)) # Arrange plots in 1 row, 3 columns
# Boxplot for Amount Spent
boxplot(customer_data_copy$amount,
main = "Outliers in Amount Spent",
ylab = "Amount Spent (USD)",
col = "blue",
border = "black",
pch = 19) # Red outliers
# Boxplot for Age
boxplot(customer_data_copy$age,
main = "Outliers in Age",
ylab = "Customer Age",
col = "lightgreen",
border = "black",
pch = 19)
# Boxplot for Items Purchased
boxplot(customer_data_copy$items,
main = "Outliers in Items Purchased",
ylab = "Number of Items",
col = "lightcoral",
border = "black",
pch = 19)# Reset plot layout
par(mfrow = c(1,1))
# Save the plots as a PNG file
png("Outliers_Boxplots.png", width = 1500, height = 600, res = 150)
# Redraw plots in the PNG file
par(mfrow = c(1,3))
boxplot(customer_data_copy$amount, main = "Outliers in Amount Spent", ylab = "Amount Spent (USD)",
col = "blue", border = "red", pch = 19)
boxplot(customer_data_copy$age, main = "Outliers in Age", ylab = "Customer Age",
col = "lightgreen", border = "black", pch = 19)
boxplot(customer_data_copy$items, main = "Outliers in Items Purchased", ylab = "Number of Items",
col = "yellow", border = "black", pch = 19)
# Close the PNG device to save the file
dev.off()## png
## 2
## ✅ Boxplots are displayed and saved as 'Outliers_Boxplots.png'
Why These Outliers Matter? Outliers in Amount Spent → High spenders could be targeted for premium offers.
No outliers in Age → Balanced customer demographics.
No outliers in Items Purchased → Stable purchasing behavior across customers.
#Spending by Region
# Load necessary libraries
library(ggplot2)
# Create the box plot
spending_plot <- ggplot(customer_data_copy, aes(x = factor(region), y = amount, fill = factor(region))) +
geom_boxplot(outlier.color = "black", outlier.shape = 16, outlier.size = 2) +
labs(title = "Spending Behavior by Region",
x = "Region",
y = "Amount Spent ($)",
fill = "Region") +
theme_minimal() +
scale_fill_manual(values = c("red", "green", "blue", "purple"))
# Save the plot as a PNG file
ggsave("spending_by_region.png", spending_plot, width = 10, height = 5, dpi = 300)
ggplot(customer_data_copy, aes(x = region, y = amount, fill = region)) +
geom_boxplot() +
labs(title = "Spending Distribution by Region", y = "Amount Spent", x = "Region")Understanding the Box Plot Each box represents the middle 50% of spending values (Interquartile Range - IQR), while the black horizontal line inside the box shows the median spending. The whiskers extend to data points within 1.5 times the IQR, and dots above the whiskers indicate outliers, or unusually high spending customers.
Key Observations by Region 1. North Region (Red) Median spending is slightly below $1000. Spending varies widely, with some customers spending over $2500. A few extreme outliers suggest some high-spending customers. 2. South Region (Green) Median spending is similar to North but slightly lower. The range is from near $0 to $2000. Less extreme outliers compared to North. 3. East Region (Blue) This region has the lowest median spending. Spending is more consistent, with a narrow range. Very few high outliers indicate that customers in this region tend to spend within a limited range. 4. West Region (Purple) Highest median spending among all regions. Widest variability, meaning spending habits are inconsistent. Many high-value outliers (above $3000) suggest some customers spend significantly more than others. Business Implications & Recommendations Marketing Focus: The West region has the most high-spending customers, making it ideal for premium product promotions. Targeted Discounts: The East region spends the least; introducing discounts or incentives may encourage higher spending. Customer Segmentation: South and North have moderate spending behavior, making them suitable for mid-range pricing strategies. Outlier Investigation: Further analysis can help understand what drives high spending in the West and why the East has limited variability.
#Online vs In-Store Purchases:
# Load required library
library(ggplot2)
# Ensure in.store is a factor
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)
# Create the boxplot
p <- ggplot(customer_data_copy, aes(x = in.store, y = amount, fill = in.store)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 3) +
scale_fill_manual(values = c("orange", "blue")) + # Yellow for Online, Blue for In-Store
labs(title = "Spending Amount: Online vs. In-Store",
x = "Shopping Type",
y = "Amount Spent ($)") +
theme_minimal()
# Save the plot as PNG
ggsave("Online_vs_InStore_Purchases.png", plot = p, width = 8, height = 5, dpi = 300)
# Print the plot in R
print(p)The median spending amount for online purchases is higher than for in-store purchases. In-store purchases show more variation in spending, with some very high amounts (outliers). There is a higher spread (larger interquartile range) in in-store spending.
# Load required library
library(ggplot2)
# Create bar plot
items_plot <- ggplot(customer_data_copy, aes(x = factor(items))) +
geom_bar(fill = "purple", color = "black") +
labs(title = "Number of Items Purchased per Transaction",
x = "Number of Items",
y = "Transaction Count") +
theme_minimal()
# Display the plot
print(items_plot)# Save the plot as PNG for presentation slides
ggsave("Items_Purchased_BarPlot.png", plot = items_plot, width = 8, height = 5, dpi = 300)Analysis: Number of Items Purchased per Transaction This bar chart visualizes the distribution of the number of items purchased per transaction across all customers. It helps in understanding customer purchasing behavior, which is crucial for inventory management, pricing strategies, and marketing decisions.
🔹 Key Observations from the Chart Most Transactions Involve 2 to 7 Items
The bars for 2 to 7 items per transaction are the tallest, indicating that most customers buy within this range. This suggests that customers prefer buying multiple items at once rather than single-item purchases. Single-Item Transactions Are Less Frequent
The lowest bar (except for 8 items) is at 1 item per transaction. This could imply that customers prefer bulk shopping rather than frequent, small transactions. Alternatively, it may suggest that low-value purchases are rare. Even Distribution for 2 to 7 Items
The bars for 2 to 7 items are almost equal in height, suggesting that customers commonly purchase within this range, without a strong preference for any specific number. Slight Decline at 8 Items Per Transaction
The transaction count drops sharply at 8 items. This may indicate a purchase limit, pricing threshold, or bulk purchase discount impact. Customers may not need more than 7 items per transaction on average. 📌 Business Insights 💡 1. Encourage Multi-Item Purchases
Since customers already buy 2 to 7 items per transaction, businesses can incentivize bulk purchases through: Bundle discounts (e.g., “Buy 3, Get 1 Free”). Loyalty rewards for larger transactions. Free shipping for orders over a certain threshold. 💡 2. Improve Pricing for Single-Item Purchases
The low number of single-item transactions suggests that customers might find single purchases less valuable. Strategies to increase single-item purchases: Offer limited-time discounts for specific products. Promote impulse buys at checkout. Provide “frequently bought together” recommendations. 💡 3. Investigate the 8-Item Drop
If there is a system limit or pricing pattern causing fewer 8-item transactions, businesses should: Analyze customer receipts to identify spending caps. Test removing transaction limits if applicable. Offer a discount for 8+ items to check if purchase volume increases.
# Compute correlation matrix for numerical variables
cor_matrix <- cor(customer_data_copy[, c("age", "items", "amount")])
# Display correlation matrix
print(cor_matrix)## age items amount
## age 1.0000000000 0.0006785133 -0.2820891185
## items 0.0006785133 1.0000000000 0.0003670029
## amount -0.2820891185 0.0003670029 1.0000000000
Explanation of Correlation Matrix for Your Presentation The correlation matrix shows the strength and direction of relationships between Age, Number of Items Purchased, and Amount Spent:
Age vs. Items Purchased (0.0007): There is virtually no correlation between a customer’s age and the number of items they purchase. This suggests that customers across all age groups tend to buy a similar number of items.
Age vs. Amount Spent (-0.282): There is a weak negative correlation, meaning older customers tend to spend slightly less. However, this relationship is not very strong.
Items Purchased vs. Amount Spent (0.0004): There is almost no correlation between the number of items purchased and the total amount spent. This could indicate that customers might be buying different types of products with varying prices, so simply buying more items doesn’t always mean spending more.
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Compute correlation matrix for numerical variables
cor_matrix <- cor(customer_data_copy[, c("age", "items", "amount")])
# Convert correlation matrix into a format suitable for ggplot
melted_cor <- melt(cor_matrix)
# Final improved heatmap with better color transition
ggplot(data = melted_cor, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), color = "white", size = 6, fontface = "bold") +
scale_fill_gradientn(colors = c("#053061", "#2166AC", "white", "#F4A582", "#B2182B"),
values = scales::rescale(c(-1, -0.5, 0, 0.5, 1)),
name="Pearson\nCorrelation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
) +
ggtitle("Final Enhanced Correlation Heatmap")Correlation Analysis and Interpretation The Final Enhanced Correlation Heatmap presents the Pearson correlation coefficients among three key numerical variables in the dataset: age, items, and amount spent. The correlation coefficients range between -1 and 1, where:
1 (red): Perfect positive correlation – as one variable increases, the other also increases. 0 (white): No correlation – no linear relationship between variables. -1 (blue): Perfect negative correlation – as one variable increases, the other decreases.
#Testing the Hypotheses. Older customers are more likely to shop online than in-store.
# Create Age Groups
customer_data_copy$Age_Group <- cut(customer_data_copy$age, breaks = c(18, 30, 45, 60, 85),
labels = c("18-30", "31-45", "46-60", "61+"))
# Create contingency table
age_store_table <- table(customer_data_copy$Age_Group, customer_data_copy$in.store)
# Perform Chi-Square Test
chi_test <- chisq.test(age_store_table)
# Display results
chi_test##
## Pearson's Chi-squared test
##
## data: age_store_table
## X-squared = 2087.8, df = 3, p-value < 2.2e-16
#Logistic Regression Model
# Logistic regression model
logistic_model <- glm(in.store ~ age, data = customer_data, family = binomial)
# Display summary
summary(logistic_model)##
## Call:
## glm(formula = in.store ~ age, family = binomial, data = customer_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0635598 0.0225026 47.26 <2e-16 ***
## age -0.0232745 0.0004679 -49.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110874 on 79978 degrees of freedom
## Residual deviance: 108302 on 79977 degrees of freedom
## AIC: 108306
##
## Number of Fisher Scoring iterations: 4
#Age Distribution by Shoping Channel
# Load required library
library(ggplot2)
# Convert 'in.store' to a factor (1 = In-Store, 0 = Online)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)
# Boxplot to visualize age distribution for online vs. in-store shopping
ggplot(customer_data_copy, aes(x = in.store, y = age, fill = in.store)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Age Distribution by Shopping Channel", x = "Shopping Channel", y = "Age") +
scale_x_discrete(labels = c("Online", "In-Store"))# Load required library
library(ggplot2)
# Create a sequence of ages from 18 to 85
age_range <- data.frame(age = seq(18, 85, by = 1))
# Predict probability of in-store shopping for each age
age_range$predicted_prob <- predict(logistic_model, newdata = age_range, type = "response")
# Plot the logistic regression probability curve (Fixed Warning)
ggplot(age_range, aes(x = age, y = predicted_prob)) +
geom_line(color = "blue", linewidth = 1) + # Changed 'size' to 'linewidth'
theme_minimal() +
labs(title = "Probability of In-Store Shopping by Age",
x = "Age",
y = "Predicted Probability of Shopping In-Store") +
theme(text = element_text(size = 12)) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
annotate("text", x = 65, y = 0.55, label = "50% Probability Threshold", color = "red")Key Evidence Supporting This Conclusion 1️⃣ Chi-Square Test Results p-value < 2.2e-16 → Statistically significant relationship between age and shopping preference. This confirms that age influences whether a customer shops online or in-store. 2️⃣ Logistic Regression Results Negative coefficient for Age (-0.0233) → Older customers are less likely to shop in-store. Since the p-value is very small (< 2e-16), the effect is statistically significant. 3️⃣ Boxplot: Age Distribution by Shopping Mode Older customers tend to shop online more frequently compared to younger customers. Younger customers have a higher proportion of in-store purchases. 4️⃣ Probability Plot of Shopping Mode by Age The probability of shopping in-store declines with age. The regression curve confirms that older customers have a higher likelihood of online shopping. 📌 Final Conclusion 🚀 We reject the original hypothesis that older customers prefer in-store shopping. ✅ Instead, the analysis shows that older customers are more likely to shop online than in-store.
🔎 Business Implications Older customers prefer convenience and flexibility, making online platforms critical for customer retention. Younger customers are more likely to visit physical stores, meaning in-store promotions should target them more aggressively. Marketing Strategy Shift: For older customers → Improve online shopping experience, easy checkout, and digital marketing. For younger customers → Focus on in-store events, student discounts, and personalized experiences.
# Load required library
library(ggplot2)
library(dplyr)
# Remove NA values from Age Group before plotting
customer_data_copy <- customer_data_copy %>%
filter(!is.na(Age_Group)) # Excludes rows where Age Group is NA
# Recreate the bar plot without NA
ggplot(customer_data_copy, aes(x = Age_Group, fill = in.store)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = "Shopping Preference by Age Group",
x = "Age Group",
y = "Proportion") +
scale_fill_manual(values = c("red", "blue"),
labels = c("In-Store", "Online"),
name = "Shopping Mode")Key Observations from the Chart
The 18-30 age group shows a nearly equal split between in-store (red) and online (blue) shopping.
This suggests that younger customers are equally comfortable shopping online and in physical stores.
The 31-45 and 46-60 groups have a higher proportion of in-store shoppers compared to online.
This suggests that these age groups still value physical shopping experiences but also shop online.
The 61+ group has a higher proportion of online shoppers than any other age group.
This confirms that older customers are more likely to shop online than in-store.
# Load the saved dataset
load("customer_data_copy.RData")
# Verify dataset structure
str(customer_data_copy)## 'data.frame': 79979 obs. of 5 variables:
## $ in.store: int 0 0 1 1 1 0 0 1 1 1 ...
## $ age : int 37 35 45 46 33 24 43 66 51 47 ...
## $ items : int 4 2 3 3 4 3 6 5 5 3 ...
## $ amount : num 281 220 1526 715 1938 ...
## $ region : int 2 2 4 3 1 4 2 3 1 1 ...
## [1] 79979 5
## in.store age items amount
## Min. :0.0 Min. :18.00 Min. :1.000 Min. : 5.005
## 1st Qu.:0.0 1st Qu.:33.00 1st Qu.:3.000 1st Qu.: 285.120
## Median :1.0 Median :45.00 Median :4.000 Median : 582.140
## Mean :0.5 Mean :45.76 Mean :4.505 Mean : 835.826
## 3rd Qu.:1.0 3rd Qu.:56.00 3rd Qu.:6.000 3rd Qu.:1233.400
## Max. :1.0 Max. :85.00 Max. :8.000 Max. :3000.000
## region
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.675
## 3rd Qu.:4.000
## Max. :4.000
## [1] "in.store" "age" "items" "amount" "region"
# Load required libraries
library(caret) # For model training
library(ranger) # Random Forest (for H5 - Region Prediction)
# Set seed for reproducibility
set.seed(123)
# Load the correct dataset
load("customer_data_copy.RData") # Reload dataset
# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)
# Train-Test Split (80% Train, 20% Test)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data <- customer_data_copy[-trainIndex, ]#H5 - Predicting Customer Region (Classification Model - Random Forest ranger)
## Ranger result
##
## Call:
## ranger(dependent.variable.name = "region", data = train_data, num.trees = 500, num.threads = 4, importance = "impurity", classification = TRUE)
##
## Type: Classification
## Number of trees: 500
## Sample size: 63985
## Number of independent variables: 4
## Mtry: 2
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 36.60 %
## in.store age items amount
## 8083.534 5733.406 1438.485 16882.853
#Remove Weak Features & Retrain the Model
train_data_reduced <- train_data %>% select(-items)
rf_model_reduced <- ranger(dependent.variable.name = "region",
data = train_data_reduced,
num.trees = 500,
num.threads = 4,
importance = "impurity",
classification = TRUE)#Tune Hyperparameters to Improve Accuracy Since your Out-of-Bag (OOB) Error = 36.60%, tuning mtry and num.trees may help
# Try different values for mtry and num.trees
rf_model_tuned <- ranger(dependent.variable.name = "region",
data = train_data,
num.trees = 700, # Increase trees
mtry = 3, # Try different feature subsets per split
num.threads = 4,
importance = "impurity",
classification = TRUE)## Growing trees.. Progress: 50%. Estimated remaining time: 30 seconds.
## Growing trees.. Progress: 98%. Estimated remaining time: 1 seconds.
# Predict on test data
predictions <- predict(rf_model, data = test_data)$predictions
# Create a confusion matrix
conf_matrix <- table(Predicted = predictions, Actual = test_data$region)
# Compute accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Test Accuracy:", round(accuracy * 100, 2), "%"))## [1] "Test Accuracy: 63.25 %"
# Load necessary libraries
library(caret) # For machine learning workflows
library(xgboost) # For Gradient Boosting##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)
# Create Train-Test Split (80% Train, 20% Test)
set.seed(123)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data <- customer_data_copy[-trainIndex, ]
# Convert to XGBoost-compatible format (Matrix)
train_matrix <- sparse.model.matrix(region ~ ., data = train_data)[,-1]
test_matrix <- sparse.model.matrix(region ~ ., data = test_data)[,-1]
# Convert Target Variable to Numeric (XGBoost requires numeric labels)
train_labels <- as.numeric(train_data$region) - 1 # Convert factors to 0-based index
test_labels <- as.numeric(test_data$region) - 1
# Create DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_matrix, label = train_labels)
dtest <- xgb.DMatrix(data = test_matrix, label = test_labels)library(caret)
# Define trainControl
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Define XGBoost parameters
params <- list(
objective = "multi:softmax", # Multi-class classification
num_class = length(unique(train_labels)), # Number of regions
eta = 0.1, # Learning rate
max_depth = 6, # Tree depth
subsample = 0.8, # Subsampling
colsample_bytree = 0.8 # Feature selection
)
# Train the XGBoost model
xgb_model <- xgb.train(params = params,
data = dtrain,
nrounds = 150, # Number of boosting rounds
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 10,
verbose = 1)## [1] train-mlogloss:1.323526 test-mlogloss:1.323618
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
##
## [2] train-mlogloss:1.279707 test-mlogloss:1.279595
## [3] train-mlogloss:1.227901 test-mlogloss:1.227985
## [4] train-mlogloss:1.172231 test-mlogloss:1.172730
## [5] train-mlogloss:1.124893 test-mlogloss:1.125902
## [6] train-mlogloss:1.086358 test-mlogloss:1.087701
## [7] train-mlogloss:1.052024 test-mlogloss:1.053605
## [8] train-mlogloss:1.020829 test-mlogloss:1.022667
## [9] train-mlogloss:0.993481 test-mlogloss:0.995601
## [10] train-mlogloss:0.966720 test-mlogloss:0.969101
## [11] train-mlogloss:0.947149 test-mlogloss:0.949708
## [12] train-mlogloss:0.930831 test-mlogloss:0.933523
## [13] train-mlogloss:0.913030 test-mlogloss:0.916093
## [14] train-mlogloss:0.901441 test-mlogloss:0.904597
## [15] train-mlogloss:0.886552 test-mlogloss:0.889856
## [16] train-mlogloss:0.870727 test-mlogloss:0.874473
## [17] train-mlogloss:0.857685 test-mlogloss:0.861741
## [18] train-mlogloss:0.846152 test-mlogloss:0.850353
## [19] train-mlogloss:0.837332 test-mlogloss:0.841615
## [20] train-mlogloss:0.825454 test-mlogloss:0.829959
## [21] train-mlogloss:0.814101 test-mlogloss:0.818858
## [22] train-mlogloss:0.804041 test-mlogloss:0.809083
## [23] train-mlogloss:0.797704 test-mlogloss:0.802967
## [24] train-mlogloss:0.789739 test-mlogloss:0.795187
## [25] train-mlogloss:0.783779 test-mlogloss:0.789487
## [26] train-mlogloss:0.778633 test-mlogloss:0.784421
## [27] train-mlogloss:0.772700 test-mlogloss:0.778697
## [28] train-mlogloss:0.766523 test-mlogloss:0.772679
## [29] train-mlogloss:0.761693 test-mlogloss:0.768014
## [30] train-mlogloss:0.757662 test-mlogloss:0.764127
## [31] train-mlogloss:0.753986 test-mlogloss:0.760525
## [32] train-mlogloss:0.749864 test-mlogloss:0.756501
## [33] train-mlogloss:0.745024 test-mlogloss:0.751818
## [34] train-mlogloss:0.741167 test-mlogloss:0.748124
## [35] train-mlogloss:0.737791 test-mlogloss:0.744924
## [36] train-mlogloss:0.734315 test-mlogloss:0.741651
## [37] train-mlogloss:0.731390 test-mlogloss:0.738887
## [38] train-mlogloss:0.727803 test-mlogloss:0.735478
## [39] train-mlogloss:0.724596 test-mlogloss:0.732419
## [40] train-mlogloss:0.722974 test-mlogloss:0.730907
## [41] train-mlogloss:0.720935 test-mlogloss:0.729090
## [42] train-mlogloss:0.718433 test-mlogloss:0.726695
## [43] train-mlogloss:0.715626 test-mlogloss:0.724035
## [44] train-mlogloss:0.713435 test-mlogloss:0.722003
## [45] train-mlogloss:0.711416 test-mlogloss:0.720238
## [46] train-mlogloss:0.709362 test-mlogloss:0.718370
## [47] train-mlogloss:0.707806 test-mlogloss:0.716980
## [48] train-mlogloss:0.706070 test-mlogloss:0.715444
## [49] train-mlogloss:0.704194 test-mlogloss:0.713748
## [50] train-mlogloss:0.702437 test-mlogloss:0.712131
## [51] train-mlogloss:0.700929 test-mlogloss:0.710777
## [52] train-mlogloss:0.700046 test-mlogloss:0.709988
## [53] train-mlogloss:0.698406 test-mlogloss:0.708522
## [54] train-mlogloss:0.696692 test-mlogloss:0.706998
## [55] train-mlogloss:0.695634 test-mlogloss:0.706082
## [56] train-mlogloss:0.694356 test-mlogloss:0.704973
## [57] train-mlogloss:0.693303 test-mlogloss:0.704138
## [58] train-mlogloss:0.691912 test-mlogloss:0.702965
## [59] train-mlogloss:0.690780 test-mlogloss:0.701894
## [60] train-mlogloss:0.689790 test-mlogloss:0.701088
## [61] train-mlogloss:0.688892 test-mlogloss:0.700350
## [62] train-mlogloss:0.687682 test-mlogloss:0.699411
## [63] train-mlogloss:0.686874 test-mlogloss:0.698654
## [64] train-mlogloss:0.685742 test-mlogloss:0.697759
## [65] train-mlogloss:0.684818 test-mlogloss:0.697017
## [66] train-mlogloss:0.684122 test-mlogloss:0.696440
## [67] train-mlogloss:0.683406 test-mlogloss:0.695956
## [68] train-mlogloss:0.682635 test-mlogloss:0.695304
## [69] train-mlogloss:0.682066 test-mlogloss:0.694833
## [70] train-mlogloss:0.681455 test-mlogloss:0.694377
## [71] train-mlogloss:0.680778 test-mlogloss:0.693858
## [72] train-mlogloss:0.680059 test-mlogloss:0.693308
## [73] train-mlogloss:0.679598 test-mlogloss:0.692967
## [74] train-mlogloss:0.678950 test-mlogloss:0.692490
## [75] train-mlogloss:0.678399 test-mlogloss:0.692167
## [76] train-mlogloss:0.677747 test-mlogloss:0.691659
## [77] train-mlogloss:0.677302 test-mlogloss:0.691333
## [78] train-mlogloss:0.676692 test-mlogloss:0.690925
## [79] train-mlogloss:0.676210 test-mlogloss:0.690592
## [80] train-mlogloss:0.675718 test-mlogloss:0.690382
## [81] train-mlogloss:0.675108 test-mlogloss:0.689924
## [82] train-mlogloss:0.674649 test-mlogloss:0.689643
## [83] train-mlogloss:0.674157 test-mlogloss:0.689260
## [84] train-mlogloss:0.673775 test-mlogloss:0.689082
## [85] train-mlogloss:0.673314 test-mlogloss:0.688763
## [86] train-mlogloss:0.672849 test-mlogloss:0.688530
## [87] train-mlogloss:0.672424 test-mlogloss:0.688319
## [88] train-mlogloss:0.672062 test-mlogloss:0.688140
## [89] train-mlogloss:0.671680 test-mlogloss:0.687851
## [90] train-mlogloss:0.671300 test-mlogloss:0.687681
## [91] train-mlogloss:0.670949 test-mlogloss:0.687498
## [92] train-mlogloss:0.670534 test-mlogloss:0.687209
## [93] train-mlogloss:0.670121 test-mlogloss:0.687067
## [94] train-mlogloss:0.669781 test-mlogloss:0.686878
## [95] train-mlogloss:0.669396 test-mlogloss:0.686745
## [96] train-mlogloss:0.668976 test-mlogloss:0.686521
## [97] train-mlogloss:0.668652 test-mlogloss:0.686372
## [98] train-mlogloss:0.668354 test-mlogloss:0.686222
## [99] train-mlogloss:0.667944 test-mlogloss:0.686011
## [100] train-mlogloss:0.667646 test-mlogloss:0.685948
## [101] train-mlogloss:0.667396 test-mlogloss:0.685756
## [102] train-mlogloss:0.667197 test-mlogloss:0.685624
## [103] train-mlogloss:0.666935 test-mlogloss:0.685504
## [104] train-mlogloss:0.666698 test-mlogloss:0.685431
## [105] train-mlogloss:0.666467 test-mlogloss:0.685377
## [106] train-mlogloss:0.666232 test-mlogloss:0.685311
## [107] train-mlogloss:0.665849 test-mlogloss:0.685118
## [108] train-mlogloss:0.665661 test-mlogloss:0.685121
## [109] train-mlogloss:0.665259 test-mlogloss:0.685021
## [110] train-mlogloss:0.664988 test-mlogloss:0.684949
## [111] train-mlogloss:0.664768 test-mlogloss:0.684876
## [112] train-mlogloss:0.664507 test-mlogloss:0.684813
## [113] train-mlogloss:0.664226 test-mlogloss:0.684783
## [114] train-mlogloss:0.663883 test-mlogloss:0.684631
## [115] train-mlogloss:0.663654 test-mlogloss:0.684583
## [116] train-mlogloss:0.663383 test-mlogloss:0.684530
## [117] train-mlogloss:0.663202 test-mlogloss:0.684480
## [118] train-mlogloss:0.662960 test-mlogloss:0.684388
## [119] train-mlogloss:0.662726 test-mlogloss:0.684311
## [120] train-mlogloss:0.662457 test-mlogloss:0.684264
## [121] train-mlogloss:0.662274 test-mlogloss:0.684227
## [122] train-mlogloss:0.662107 test-mlogloss:0.684234
## [123] train-mlogloss:0.661903 test-mlogloss:0.684236
## [124] train-mlogloss:0.661698 test-mlogloss:0.684222
## [125] train-mlogloss:0.661482 test-mlogloss:0.684140
## [126] train-mlogloss:0.661202 test-mlogloss:0.684089
## [127] train-mlogloss:0.661040 test-mlogloss:0.684086
## [128] train-mlogloss:0.660854 test-mlogloss:0.684045
## [129] train-mlogloss:0.660688 test-mlogloss:0.684007
## [130] train-mlogloss:0.660544 test-mlogloss:0.684001
## [131] train-mlogloss:0.660294 test-mlogloss:0.683870
## [132] train-mlogloss:0.660142 test-mlogloss:0.683861
## [133] train-mlogloss:0.659940 test-mlogloss:0.683843
## [134] train-mlogloss:0.659745 test-mlogloss:0.683809
## [135] train-mlogloss:0.659630 test-mlogloss:0.683766
## [136] train-mlogloss:0.659443 test-mlogloss:0.683781
## [137] train-mlogloss:0.659212 test-mlogloss:0.683818
## [138] train-mlogloss:0.658990 test-mlogloss:0.683886
## [139] train-mlogloss:0.658857 test-mlogloss:0.683892
## [140] train-mlogloss:0.658612 test-mlogloss:0.683783
## [141] train-mlogloss:0.658442 test-mlogloss:0.683729
## [142] train-mlogloss:0.658244 test-mlogloss:0.683743
## [143] train-mlogloss:0.657882 test-mlogloss:0.683689
## [144] train-mlogloss:0.657640 test-mlogloss:0.683710
## [145] train-mlogloss:0.657472 test-mlogloss:0.683711
## [146] train-mlogloss:0.657216 test-mlogloss:0.683656
## [147] train-mlogloss:0.656947 test-mlogloss:0.683617
## [148] train-mlogloss:0.656782 test-mlogloss:0.683643
## [149] train-mlogloss:0.656606 test-mlogloss:0.683635
## [150] train-mlogloss:0.656444 test-mlogloss:0.683633
# Make predictions on test data
predictions <- predict(xgb_model, dtest)
# Convert predictions to factors
predicted_labels <- factor(predictions, levels = 0:(length(unique(train_labels)) - 1))
actual_labels <- factor(test_labels, levels = 0:(length(unique(test_labels)) - 1))
# Create Confusion Matrix
conf_matrix <- confusionMatrix(predicted_labels, actual_labels)
# Print Confusion Matrix & Accuracy
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 1735 0 1429 1022
## 1 0 3998 157 294
## 2 651 0 783 108
## 3 813 0 1231 3773
##
## Overall Statistics
##
## Accuracy : 0.6433
## 95% CI : (0.6358, 0.6507)
## No Information Rate : 0.3249
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5168
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 0.5424 1.0000 0.21750 0.7260
## Specificity 0.8084 0.9624 0.93876 0.8107
## Pos Pred Value 0.4145 0.8986 0.50778 0.6486
## Neg Pred Value 0.8760 1.0000 0.80508 0.8601
## Prevalence 0.2000 0.2500 0.22508 0.3249
## Detection Rate 0.1085 0.2500 0.04896 0.2359
## Detection Prevalence 0.2617 0.2782 0.09641 0.3637
## Balanced Accuracy 0.6754 0.9812 0.57813 0.7683
#Prepare Data for SVM
# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)
# Train-Test Split (80% Train, 20% Test)
set.seed(123)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data <- customer_data_copy[-trainIndex, ]#Train SVM Model
# Make predictions on test data
svm_predictions <- predict(svm_model, test_data)
# Create Confusion Matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_data$region)
# Print Confusion Matrix & Accuracy
print(svm_conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2412 0 2137 1250
## 2 0 3998 228 442
## 3 0 0 75 121
## 4 787 0 1160 3384
##
## Overall Statistics
##
## Accuracy : 0.617
## 95% CI : (0.6095, 0.6246)
## No Information Rate : 0.3249
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4849
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.7540 1.0000 0.020833 0.6511
## Specificity 0.7353 0.9441 0.990237 0.8197
## Pos Pred Value 0.4159 0.8565 0.382653 0.6348
## Neg Pred Value 0.9228 1.0000 0.776870 0.8300
## Prevalence 0.2000 0.2500 0.225084 0.3249
## Detection Rate 0.1508 0.2500 0.004689 0.2116
## Detection Prevalence 0.3626 0.2919 0.012255 0.3333
## Balanced Accuracy 0.7446 0.9721 0.505535 0.7354
# Load required libraries
library(ggplot2)
library(reshape2)
library(caret)
# Make predictions on the test data using XGBoost
xgb_predictions <- predict(xgb_model, dtest)
# Convert predictions to factors
predicted_labels <- factor(xgb_predictions, levels = levels(test_data$region))
actual_labels <- factor(test_data$region, levels = levels(test_data$region))
# Create confusion matrix
xgb_conf_matrix <- confusionMatrix(predicted_labels, actual_labels)
# Convert confusion matrix to a dataframe for visualization
conf_matrix_df <- as.data.frame(xgb_conf_matrix$table)
# Plot the confusion matrix as a heatmap
ggplot(conf_matrix_df, aes(Prediction, Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", size = 5) + # Add count labels
scale_fill_gradient(low = "lightblue", high = "darkblue") + # Color gradient
theme_minimal() +
labs(title = "Confusion Matrix - XGBoost",
x = "Predicted Label",
y = "Actual Label",
fill = "Count") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 12),
axis.text.y = element_text(size = 12))# Load necessary libraries
library(xgboost)
library(ggplot2)
# Extract feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)
# Convert to dataframe for ggplot2
importance_df <- as.data.frame(importance_matrix)
# Create the enhanced feature importance plot
ggplot(importance_df, aes(x = reorder(Feature, Gain), y = Gain, fill = Gain)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "lightblue", high = "darkblue") + # Color gradient
coord_flip() + # Flip for better readability
theme_minimal() +
labs(title = "Feature Importance - XGBoost Model",
x = "Features",
y = "Importance Score") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12))# Compare Model Accuracies
accuracy_results <- data.frame(
Model = c("Random Forest", "XGBoost", "SVM"),
Accuracy = c(63.26, 64.33, 61.7)
)
ggplot(accuracy_results, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Model Accuracy Comparison", y = "Accuracy (%)", x = "Model")
Final Summary: Customer Region Prediction Using Machine Learning This
project focused on building a machine learning model to accurately
predict customer regions based on age, shopping behavior, and spending
patterns. Multiple models were tested, and XGBoost emerged as the best
model with 64.33% accuracy.
🚀 Key Findings & Model Performance Comparison Model Accuracy (%) Strengths Weaknesses Random Forest 63.26% Handles complex data well Slight overfitting XGBoost 64.33% Best accuracy, efficient Requires tuning SVM (Radial Kernel) 61.7% Strong generalization Slow training ✅ Final Decision: XGBoost is the best-performing model for customer region prediction.
📊 Key Visualizations for Submission ✅ Confusion Matrix (XGBoost) – Shows classification performance. ✅ Feature Importance Chart – Highlights which factors influence predictions. ✅ Model Accuracy Comparison – Demonstrates why XGBoost was chosen. ✅ Future Improvements Chart – Lays out next steps for improving model performance.
📌 Future Improvements 🚀 Optimizing XGBoost for better performance 🚀 Adding more predictive features (e.g., purchase frequency) 🚀 Balancing training data if regions are imbalanced
✅ Deployment Options: ✔ Save & use xgboost_region_model.rds for future predictions ✔ Integrate model predictions into a business dashboard
#Enhancing XGBoost for Customer Region Prediction.
# Load required library
library(ggplot2)
# Data for visualization
future_steps <- data.frame(
Step = c("Hyperparameter Tuning", "Additional Features", "Balance Training Data",
"Save & Use Model", "Integrate into Dashboard"),
Priority = c(90, 80, 70, 100, 95) # Arbitrary scores for emphasis
)
# Create a horizontal bar chart
ggplot(future_steps, aes(x = Priority, y = reorder(Step, Priority), fill = Priority)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "lightblue", high = "darkblue") + # Color gradient
theme_minimal() +
labs(title = "Enhancing XGBoost for Customer Region Prediction",
x = "Priority Level (%)",
y = "Future Improvements & Deployment Steps") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12)) +
geom_text(aes(label = paste(Priority, "%")), hjust = -0.2, size = 5, fontface = "bold") +
xlim(0, 110) # Set limits for better spacingThis Visualization Highlights: ✅ Hyperparameter tuning (90%) – Important for improving model accuracy. ✅ Additional features (80%) – Adding more variables like purchase frequency can enhance predictions. ✅ Balancing training data (70%) – Ensuring fair class distribution can prevent bias. ✅ Saving & using model (100%) – The trained model (xgboost_region_model.rds) is ready for future predictions. ✅ Integrating into a dashboard (95%) – Deploying predictions for business insights is crucial.
This project aims to build a machine learning model to predict customer regions based on their shopping behaviors, age, and spending patterns. The goal is to identify the best model for accurately classifying customers into different regions.
✅ Final Decision: XGBoost was selected as the best model (64.33%).
1️⃣ Confusion Matrix (XGBoost) – Shows classification performance. 2️⃣ Feature Importance Chart – Highlights the most influential features. 3️⃣ Model Accuracy Comparison – Justifies the model selection. 4️⃣ Spending Behavior by Region – Provides insights into customer spending patterns.
🚀 Enhancements for XGBoost: ✅ Hyperparameter tuning to optimize performance. ✅ Adding more predictive features (e.g., purchase frequency). ✅ Balancing training data if regions are imbalanced.
📌 Deployment Options: ✔ Save & use
xgboost_region_model.rds for future predictions. ✔
Integrate model predictions into a business
dashboard.
✅ XGBoost is recommended for customer region prediction due to its superior accuracy. ✅ Additional data features and optimization can further enhance model performance. ✅ Deploying the model in a business environment can aid in targeted marketing and customer engagement strategies.
This study demonstrates that machine learning can effectively predict customer regions based on spending behaviors. With further enhancements, this model can provide valuable insights for business decision-making.
#DashBoard
# Install & Load Required Libraries
if (!requireNamespace("patchwork", quietly = TRUE)) install.packages("patchwork")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(ggplot2)
library(patchwork)##
## Attaching package: 'patchwork'
## The following object is masked from 'package:formattable':
##
## area
# Ensure categorical variables are factors
customer_data$region <- as.factor(customer_data$region)
customer_data$in.store <- as.factor(customer_data$in.store)
# Increase text size for better readability
theme_custom <- theme_minimal(base_size = 14) +
theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
axis.text = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))
# Create Individual Plots
plot1 <- ggplot(customer_data, aes(x = items, y = amount)) +
geom_point(color = "dodgerblue", size = 3, alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) +
theme_custom +
labs(title = "📊 Items Purchased vs. Amount Spent",
x = "Number of Items Purchased",
y = "Amount Spent ($)")
plot2 <- ggplot(customer_data, aes(x = region, y = amount, fill = region)) +
geom_bar(stat = "identity", width = 0.6) +
scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF", "#F1C40F")) +
theme_custom +
labs(title = "💰 Spending Behavior by Region",
x = "Region",
y = "Total Amount Spent")
plot3 <- ggplot(customer_data, aes(x = in.store, fill = region)) +
geom_bar(position = "fill", width = 0.6) +
scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF", "#F1C40F")) +
theme_custom +
labs(title = "🛒 Shopping Mode Preference by Region",
x = "Shopping Mode (Online vs. In-Store)",
y = "Proportion")
plot4 <- ggplot(customer_data, aes(x = age, y = amount)) +
geom_point(color = "purple", alpha = 0.6, size = 3) +
geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) +
theme_custom +
labs(title = "👥 Age vs. Spending",
x = "Customer Age",
y = "Amount Spent ($)")
# Arrange Plots Using Patchwork
final_plot <- (plot1 | plot2) / (plot3 | plot4) +
plot_annotation(title = "🛍️ Customer Insights Dashboard - Evovanta AI",
subtitle = "Understanding Spending & Shopping Behaviors Using Machine Learning",
theme = theme(plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 16, hjust = 0.5)))
# Save the dashboard with better resolution
ggsave("EvovantaAI_Dashboard.png", plot = final_plot, width = 16, height = 10, dpi = 300)## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Confirm the file was saved
if (file.exists("EvovantaAI_Dashboard.png")) {
print("✅ Dashboard successfully saved as EvovantaAI_Dashboard.png")
} else {
print("❌ Error: Dashboard not saved.")
}## [1] "✅ Dashboard successfully saved as EvovantaAI_Dashboard.png"