#Introduction In today’s competitive retail environment, understanding customer purchasing behavior is critical for maximizing sales and improving customer experience. Blackwell Electronics, a renowned electronics retailer, has collected customer transaction data from both online and in-store purchases. By analyzing this data, we aim to uncover key insights into spending patterns, regional differences, and customer demographics. The findings will help the company optimize marketing strategies, improve inventory management, and enhance the overall shopping experience.

1 Goal Overall Purpose of the Project (Broad Vision)

The primary purpose of this project is to leverage predictive analytics to gain a deeper understanding of customer purchasing behavior at Blackwell Electronics. By analyzing customer transactions, we aim to:

  1. Improve Customer Retention: Identify factors influencing repeat purchases and customer loyalty.

  2. Increase Sales: Optimize marketing and inventory strategies based on regional spending habits and demographic insights.

  3. Enhance Online and In-Store Experience: Provide data-driven recommendations to improve the user journey across sales channels.

4.Predict Shopping Behavior: Use machine learning models to forecast future purchasing patterns and segment customers effectively.

Through this study, Blackwell Electronics will gain actionable insights to make informed business decisions, resulting in higher customer satisfaction, improved operational efficiency, and increased profitability.

2 Ojectives (Specific Actions and Roadmap)

To achieve these broad goals, we break them down into specific, measurable objectives: Identify regional spending patterns and determine which regions spend the most/least. Analyze the relationship between the number of items purchased and total spending. Investigate age differences between regions and whether a customer’s age can be predicted based on demographic data. Explore correlations between customer age and transaction type (online vs. in-store). Predict a customer’s region based on transaction and demographic data.

2.1 Generating and Testing Hypotheses

To strengthen our analysis, i will generate and test the following hypotheses based on age group, region, and key transactional variables:

Hypotheses for Testing

H1: Older customers are more likely to shop online than in-store (Proven with Chi-Square and Logistic Regression). H2: Regional Differences in Spending Behavior
H3: A customer’s age can be predicted based on transaction characteristics (shopping mode, number of items, and amount spent
H4: Correlation Between Items Purchased & Spending (Pearson Correlation & Regression)
H5:Predicting Customer Region Based on Transaction Data (To be done in modeling).

#Data Description The dataset consists of 80,000 customer transactions, with the following variables: 1.in-store (Categorical: 1 = In-store, 0 = Online) 2.age (Numerical: Customer’s age) 3.items (Numerical: Number of items purchased per transaction) 4.amount (Numerical: Total amount spent in USD per transaction) 5.region (Categorical: 1 = North, 2 = South, 3 = East, 4 = West)

#Data Preprocessing To ensure the dataset is clean and ready for analysis, i perform the following preprocessing steps:

A. Handling Missing Values: -Identify and remove or impute missing values where necessary.

B. Handling Duplicates: -Detect and remove duplicate records to ensure data integrity.

C.Outlier Detection and Treatment: -Use Interquartile Range (IQR) to detect and remove extreme outliers in amount, age, and items.

D. Feature Transformation: -Convert categorical variables (in-store, region) into factors for analysis.

E.Scaling & Normalization: -Standardize numerical variables to improve model performance.

customer_data <- read.csv("Demographic_Data.csv")

#Check the First Few Rows

head(customer_data, 10)  # View first 10 rows
##    in.store age items    amount region
## 1         0  37     4  281.0300      2
## 2         0  35     2  219.5100      2
## 3         1  45     3 1525.7000      4
## 4         1  46     3  715.2500      3
## 5         1  33     4 1937.5000      1
## 6         0  24     3 1314.2000      4
## 7         0  43     6    8.5472      2
## 8         1  66     5  819.9800      3
## 9         1  51     5  908.3100      1
## 10        1  47     3  767.5400      1
tail(customer_data, 10)  # View last 10 rows
##       in.store age items    amount region
## 79991        0  71     7    6.9172      2
## 79992        0  37     4  658.0000      4
## 79993        0  77     4  170.0200      2
## 79994        1  47     3  337.5400      3
## 79995        0  22     1  698.5400      4
## 79996        1  71     3  558.8200      1
## 79997        0  59     7 1932.0000      3
## 79998        0  54     1  414.1600      2
## 79999        1  49     4  335.3200      1
## 80000        1  30     1  527.1200      3

#Check the Structure and Data Types

str(customer_data)
## 'data.frame':    80000 obs. of  5 variables:
##  $ in.store: int  0 0 1 1 1 0 0 1 1 1 ...
##  $ age     : int  37 35 45 46 33 24 43 66 51 47 ...
##  $ items   : int  4 2 3 3 4 3 6 5 5 3 ...
##  $ amount  : num  281 220 1526 715 1938 ...
##  $ region  : int  2 2 4 3 1 4 2 3 1 1 ...

#get the summary statistics of the data

summary(customer_data)
##     in.store        age            items           amount        
##  Min.   :0.0   Min.   :18.00   Min.   :1.000   Min.   :   5.005  
##  1st Qu.:0.0   1st Qu.:33.00   1st Qu.:3.000   1st Qu.: 285.140  
##  Median :0.5   Median :45.00   Median :4.000   Median : 582.315  
##  Mean   :0.5   Mean   :45.76   Mean   :4.505   Mean   : 835.920  
##  3rd Qu.:1.0   3rd Qu.:56.00   3rd Qu.:6.000   3rd Qu.:1233.700  
##  Max.   :1.0   Max.   :85.00   Max.   :8.000   Max.   :3000.000  
##      region     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.675  
##  3rd Qu.:4.000  
##  Max.   :4.000

3 Data Preprocessing

3.1 Checking for duplicates

#checking duplicates
duplicates <- customer_data [duplicated(customer_data),]
print(duplicates)
##       in.store age items  amount region
## 17859        0  76     6  283.61      2
## 26113        0  67     4  320.41      2
## 29344        1  23     2 1019.50      4
## 33835        1  38     3 1993.80      1
## 34215        1  31     5 1488.10      4
## 34707        0  39     5 1268.00      4
## 35972        0  38     7  391.51      2
## 43959        1  22     1 1303.20      4
## 44309        1  26     5 1495.70      4
## 45165        0  79     6  342.28      2
## 47415        0  27     6 1555.10      4
## 51601        0  56     2 1941.70      4
## 52014        1  50     5 1889.40      1
## 53653        0  29     2  144.43      2
## 58455        0  76     1  259.44      2
## 59540        1  49     2 1162.00      4
## 59860        0  30     7 2535.10      4
## 61440        1  32     4 1106.10      4
## 63971        1  42     5 1731.10      4
## 75507        0  44     2 1756.70      4
## 78689        1  40     2 1080.60      1

I observed that there are 21 duplicate rows.

4 Remove duplicates

customer_data <- customer_data[!duplicated(customer_data), ]

#Check if duplicates have been removed

sum(duplicated(customer_data))  # Should return 0
## [1] 0

Duplicates have been successfully removed. Since the reduction in data set size is only 0.03% (21 rows removed out of 80,000), removing duplicates does not significantly impact our insights.

5 Checking For missing values

#checking missing values
missing_values <- colSums(is.na(customer_data))
print (missing_values)
## in.store      age    items   amount   region 
##        0        0        0        0        0

It’s observed that there are no missing values seen in the data set.

customer_data_copy <- customer_data

#Save Data

# Save the data frame to an RData file
save(customer_data_copy, file = "customer_data_copy.RData")

#Load Data

# Load the data frame from the RData file
load("customer_data_copy.RData")

6 Convert region and in.store to Factors

customer_data_copy$in.store <- factor(customer_data_copy$in.store, labels = c("Online", "In-Store"))

customer_data_copy$region <- factor(customer_data_copy$region, 
                               levels = c(1, 2, 3, 4), 
                               labels = c("North", "South", "East", "West"))

To improve readability in analysis and visualization.

table(customer_data_copy$region)  # Check correct region mapping
## 
## North South  East  West 
## 15997 19994 18000 25988
table(customer_data_copy$in.store)  # Check online vs in-store count
## 
##   Online In-Store 
##    39989    39990

#Exploratory Data Analysis (EDA) Now, let’s visualize key insights from your dataset.

7 Boxplots for Outliers:

# Load required libraries
library(ggplot2)

# Load the dataset (Ensure the file path is correct)
customer_data_copy <- read.csv("Demographic_Data.csv", stringsAsFactors = FALSE)

# Convert 'region' column to factors for better visualization
customer_data_copy$region <- factor(customer_data_copy$region, 
                               levels = c(1, 2, 3, 4), 
                               labels = c("North", "South", "East", "West"))

# Create the boxplots with red outliers and display them
par(mfrow = c(1,3))  # Arrange plots in 1 row, 3 columns

# Boxplot for Amount Spent
boxplot(customer_data_copy$amount, 
        main = "Outliers in Amount Spent", 
        ylab = "Amount Spent (USD)", 
        col = "blue", 
        border = "black", 
        pch = 19)  # Red outliers

# Boxplot for Age
boxplot(customer_data_copy$age, 
        main = "Outliers in Age", 
        ylab = "Customer Age", 
        col = "lightgreen", 
        border = "black", 
        pch = 19)

# Boxplot for Items Purchased
boxplot(customer_data_copy$items, 
        main = "Outliers in Items Purchased", 
        ylab = "Number of Items", 
        col = "lightcoral", 
        border = "black", 
        pch = 19)

# Reset plot layout
par(mfrow = c(1,1))

# Save the plots as a PNG file
png("Outliers_Boxplots.png", width = 1500, height = 600, res = 150)

# Redraw plots in the PNG file
par(mfrow = c(1,3))
boxplot(customer_data_copy$amount, main = "Outliers in Amount Spent", ylab = "Amount Spent (USD)", 
        col = "blue", border = "red", pch = 19)

boxplot(customer_data_copy$age, main = "Outliers in Age", ylab = "Customer Age", 
        col = "lightgreen", border = "black", pch = 19)

boxplot(customer_data_copy$items, main = "Outliers in Items Purchased", ylab = "Number of Items", 
        col = "yellow", border = "black", pch = 19)

# Close the PNG device to save the file
dev.off()
## png 
##   2
# Print success message
cat("✅ Boxplots are displayed and saved as 'Outliers_Boxplots.png'\n")
## ✅ Boxplots are displayed and saved as 'Outliers_Boxplots.png'
# To view the saved file in RStudio (optional)
browseURL("Outliers_Boxplots.png")

Why These Outliers Matter? Outliers in Amount Spent → High spenders could be targeted for premium offers.

No outliers in Age → Balanced customer demographics.

No outliers in Items Purchased → Stable purchasing behavior across customers.

#Spending by Region

# Load necessary libraries
library(ggplot2)

# Create the box plot
spending_plot <- ggplot(customer_data_copy, aes(x = factor(region), y = amount, fill = factor(region))) +
  geom_boxplot(outlier.color = "black", outlier.shape = 16, outlier.size = 2) +
  labs(title = "Spending Behavior by Region",
       x = "Region",
       y = "Amount Spent ($)",
       fill = "Region") +
  theme_minimal() +
  scale_fill_manual(values = c("red", "green", "blue", "purple"))

# Save the plot as a PNG file
ggsave("spending_by_region.png", spending_plot, width = 10, height = 5, dpi = 300)


ggplot(customer_data_copy, aes(x = region, y = amount, fill = region)) +
  geom_boxplot() +
  labs(title = "Spending Distribution by Region", y = "Amount Spent", x = "Region")

Understanding the Box Plot Each box represents the middle 50% of spending values (Interquartile Range - IQR), while the black horizontal line inside the box shows the median spending. The whiskers extend to data points within 1.5 times the IQR, and dots above the whiskers indicate outliers, or unusually high spending customers.

Key Observations by Region 1. North Region (Red) Median spending is slightly below $1000. Spending varies widely, with some customers spending over $2500. A few extreme outliers suggest some high-spending customers. 2. South Region (Green) Median spending is similar to North but slightly lower. The range is from near $0 to $2000. Less extreme outliers compared to North. 3. East Region (Blue) This region has the lowest median spending. Spending is more consistent, with a narrow range. Very few high outliers indicate that customers in this region tend to spend within a limited range. 4. West Region (Purple) Highest median spending among all regions. Widest variability, meaning spending habits are inconsistent. Many high-value outliers (above $3000) suggest some customers spend significantly more than others. Business Implications & Recommendations Marketing Focus: The West region has the most high-spending customers, making it ideal for premium product promotions. Targeted Discounts: The East region spends the least; introducing discounts or incentives may encourage higher spending. Customer Segmentation: South and North have moderate spending behavior, making them suitable for mid-range pricing strategies. Outlier Investigation: Further analysis can help understand what drives high spending in the West and why the East has limited variability.

#Online vs In-Store Purchases:

# Load required library
library(ggplot2)

# Ensure in.store is a factor
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

# Create the boxplot
p <- ggplot(customer_data_copy, aes(x = in.store, y = amount, fill = in.store)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 3) +
  scale_fill_manual(values = c("orange", "blue")) +  # Yellow for Online, Blue for In-Store
  labs(title = "Spending Amount: Online vs. In-Store", 
       x = "Shopping Type", 
       y = "Amount Spent ($)") +
  theme_minimal()

# Save the plot as PNG
ggsave("Online_vs_InStore_Purchases.png", plot = p, width = 8, height = 5, dpi = 300)

# Print the plot in R
print(p)

The median spending amount for online purchases is higher than for in-store purchases. In-store purchases show more variation in spending, with some very high amounts (outliers). There is a higher spread (larger interquartile range) in in-store spending.

7.1 Customer Age Distribution

# Load required library
library(ggplot2)

# Create bar plot
items_plot <- ggplot(customer_data_copy, aes(x = factor(items))) +
  geom_bar(fill = "purple", color = "black") +
  labs(title = "Number of Items Purchased per Transaction", 
       x = "Number of Items", 
       y = "Transaction Count") +
  theme_minimal()

# Display the plot
print(items_plot)

# Save the plot as PNG for presentation slides
ggsave("Items_Purchased_BarPlot.png", plot = items_plot, width = 8, height = 5, dpi = 300)

Analysis: Number of Items Purchased per Transaction This bar chart visualizes the distribution of the number of items purchased per transaction across all customers. It helps in understanding customer purchasing behavior, which is crucial for inventory management, pricing strategies, and marketing decisions.

🔹 Key Observations from the Chart Most Transactions Involve 2 to 7 Items

The bars for 2 to 7 items per transaction are the tallest, indicating that most customers buy within this range. This suggests that customers prefer buying multiple items at once rather than single-item purchases. Single-Item Transactions Are Less Frequent

The lowest bar (except for 8 items) is at 1 item per transaction. This could imply that customers prefer bulk shopping rather than frequent, small transactions. Alternatively, it may suggest that low-value purchases are rare. Even Distribution for 2 to 7 Items

The bars for 2 to 7 items are almost equal in height, suggesting that customers commonly purchase within this range, without a strong preference for any specific number. Slight Decline at 8 Items Per Transaction

The transaction count drops sharply at 8 items. This may indicate a purchase limit, pricing threshold, or bulk purchase discount impact. Customers may not need more than 7 items per transaction on average. 📌 Business Insights 💡 1. Encourage Multi-Item Purchases

Since customers already buy 2 to 7 items per transaction, businesses can incentivize bulk purchases through: Bundle discounts (e.g., “Buy 3, Get 1 Free”). Loyalty rewards for larger transactions. Free shipping for orders over a certain threshold. 💡 2. Improve Pricing for Single-Item Purchases

The low number of single-item transactions suggests that customers might find single purchases less valuable. Strategies to increase single-item purchases: Offer limited-time discounts for specific products. Promote impulse buys at checkout. Provide “frequently bought together” recommendations. 💡 3. Investigate the 8-Item Drop

If there is a system limit or pricing pattern causing fewer 8-item transactions, businesses should: Analyze customer receipts to identify spending caps. Test removing transaction limits if applicable. Offer a discount for 8+ items to check if purchase volume increases.

7.2 Correlation Analysis

# Compute correlation matrix for numerical variables
cor_matrix <- cor(customer_data_copy[, c("age", "items", "amount")])

# Display correlation matrix
print(cor_matrix)
##                  age        items        amount
## age     1.0000000000 0.0006785133 -0.2820891185
## items   0.0006785133 1.0000000000  0.0003670029
## amount -0.2820891185 0.0003670029  1.0000000000

Explanation of Correlation Matrix for Your Presentation The correlation matrix shows the strength and direction of relationships between Age, Number of Items Purchased, and Amount Spent:

Age vs. Items Purchased (0.0007): There is virtually no correlation between a customer’s age and the number of items they purchase. This suggests that customers across all age groups tend to buy a similar number of items.

Age vs. Amount Spent (-0.282): There is a weak negative correlation, meaning older customers tend to spend slightly less. However, this relationship is not very strong.

Items Purchased vs. Amount Spent (0.0004): There is almost no correlation between the number of items purchased and the total amount spent. This could indicate that customers might be buying different types of products with varying prices, so simply buying more items doesn’t always mean spending more.

cor_matrix <- cor(customer_data_copy[, c("age", "items", "amount")])

8 Convert matrix to data frame and add row names as a column

library(dplyr)

# Convert matrix to data frame and add row names as a column
cor_df <- as.data.frame(cor_matrix) %>%
  mutate(Var1 = rownames(cor_matrix))

# Reshape the data frame to long format
melted_cor_matrix <- cor_df %>%
  pivot_longer(cols = -Var1, names_to = "Var2", values_to = "value")

9 Compute correlation matrix for numerical variables

# Load required libraries
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# Compute correlation matrix for numerical variables
cor_matrix <- cor(customer_data_copy[, c("age", "items", "amount")])

# Convert correlation matrix into a format suitable for ggplot
melted_cor <- melt(cor_matrix)

# Final improved heatmap with better color transition
ggplot(data = melted_cor, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "white", size = 6, fontface = "bold") +
  scale_fill_gradientn(colors = c("#053061", "#2166AC", "white", "#F4A582", "#B2182B"), 
                       values = scales::rescale(c(-1, -0.5, 0, 0.5, 1)),
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ggtitle("Final Enhanced Correlation Heatmap")

Correlation Analysis and Interpretation The Final Enhanced Correlation Heatmap presents the Pearson correlation coefficients among three key numerical variables in the dataset: age, items, and amount spent. The correlation coefficients range between -1 and 1, where:

1 (red): Perfect positive correlation – as one variable increases, the other also increases. 0 (white): No correlation – no linear relationship between variables. -1 (blue): Perfect negative correlation – as one variable increases, the other decreases.

#Testing the Hypotheses. Older customers are more likely to shop online than in-store.

# Create Age Groups
customer_data_copy$Age_Group <- cut(customer_data_copy$age, breaks = c(18, 30, 45, 60, 85), 
                               labels = c("18-30", "31-45", "46-60", "61+"))

# Create contingency table
age_store_table <- table(customer_data_copy$Age_Group, customer_data_copy$in.store)

# Perform Chi-Square Test
chi_test <- chisq.test(age_store_table)

# Display results
chi_test
## 
##  Pearson's Chi-squared test
## 
## data:  age_store_table
## X-squared = 2087.8, df = 3, p-value < 2.2e-16

#Logistic Regression Model

# Logistic regression model
logistic_model <- glm(in.store ~ age, data = customer_data, family = binomial)

# Display summary
summary(logistic_model)
## 
## Call:
## glm(formula = in.store ~ age, family = binomial, data = customer_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.0635598  0.0225026   47.26   <2e-16 ***
## age         -0.0232745  0.0004679  -49.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110874  on 79978  degrees of freedom
## Residual deviance: 108302  on 79977  degrees of freedom
## AIC: 108306
## 
## Number of Fisher Scoring iterations: 4

#Age Distribution by Shoping Channel

# Load required library
library(ggplot2)

# Convert 'in.store' to a factor (1 = In-Store, 0 = Online)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

# Boxplot to visualize age distribution for online vs. in-store shopping
ggplot(customer_data_copy, aes(x = in.store, y = age, fill = in.store)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Age Distribution by Shopping Channel", x = "Shopping Channel", y = "Age") +
  scale_x_discrete(labels = c("Online", "In-Store"))

10 Probability Plot of Shopping Mode by Age

# Load required library
library(ggplot2)

# Create a sequence of ages from 18 to 85
age_range <- data.frame(age = seq(18, 85, by = 1))

# Predict probability of in-store shopping for each age
age_range$predicted_prob <- predict(logistic_model, newdata = age_range, type = "response")

# Plot the logistic regression probability curve (Fixed Warning)
ggplot(age_range, aes(x = age, y = predicted_prob)) +
  geom_line(color = "blue", linewidth = 1) +  # Changed 'size' to 'linewidth'
  theme_minimal() +
  labs(title = "Probability of In-Store Shopping by Age",
       x = "Age",
       y = "Predicted Probability of Shopping In-Store") +
  theme(text = element_text(size = 12)) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  annotate("text", x = 65, y = 0.55, label = "50% Probability Threshold", color = "red")

Key Evidence Supporting This Conclusion 1️⃣ Chi-Square Test Results p-value < 2.2e-16 → Statistically significant relationship between age and shopping preference. This confirms that age influences whether a customer shops online or in-store. 2️⃣ Logistic Regression Results Negative coefficient for Age (-0.0233) → Older customers are less likely to shop in-store. Since the p-value is very small (< 2e-16), the effect is statistically significant. 3️⃣ Boxplot: Age Distribution by Shopping Mode Older customers tend to shop online more frequently compared to younger customers. Younger customers have a higher proportion of in-store purchases. 4️⃣ Probability Plot of Shopping Mode by Age The probability of shopping in-store declines with age. The regression curve confirms that older customers have a higher likelihood of online shopping. 📌 Final Conclusion 🚀 We reject the original hypothesis that older customers prefer in-store shopping. ✅ Instead, the analysis shows that older customers are more likely to shop online than in-store.

🔎 Business Implications Older customers prefer convenience and flexibility, making online platforms critical for customer retention. Younger customers are more likely to visit physical stores, meaning in-store promotions should target them more aggressively. Marketing Strategy Shift: For older customers → Improve online shopping experience, easy checkout, and digital marketing. For younger customers → Focus on in-store events, student discounts, and personalized experiences.

11 Bar Plot to visualize the hypotheses

# Load required library
library(ggplot2)
library(dplyr)

# Remove NA values from Age Group before plotting
customer_data_copy <- customer_data_copy %>%
  filter(!is.na(Age_Group))  # Excludes rows where Age Group is NA

# Recreate the bar plot without NA
ggplot(customer_data_copy, aes(x = Age_Group, fill = in.store)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Shopping Preference by Age Group",
       x = "Age Group",
       y = "Proportion") +
  scale_fill_manual(values = c("red", "blue"), 
                    labels = c("In-Store", "Online"),
                    name = "Shopping Mode")

Key Observations from the Chart

  1. Younger Customers (18-30) Have a Balanced Preference

The 18-30 age group shows a nearly equal split between in-store (red) and online (blue) shopping.

This suggests that younger customers are equally comfortable shopping online and in physical stores.

  1. Middle-Aged Customers (31-45 & 46-60) Show Slightly Higher In-Store Preference

The 31-45 and 46-60 groups have a higher proportion of in-store shoppers compared to online.

This suggests that these age groups still value physical shopping experiences but also shop online.

  1. Older Customers (61+) Prefer Online Shopping

The 61+ group has a higher proportion of online shoppers than any other age group.

This confirms that older customers are more likely to shop online than in-store.

12 Load the saved dataset

# Load the saved dataset
load("customer_data_copy.RData")

# Verify dataset structure
str(customer_data_copy)
## 'data.frame':    79979 obs. of  5 variables:
##  $ in.store: int  0 0 1 1 1 0 0 1 1 1 ...
##  $ age     : int  37 35 45 46 33 24 43 66 51 47 ...
##  $ items   : int  4 2 3 3 4 3 6 5 5 3 ...
##  $ amount  : num  281 220 1526 715 1938 ...
##  $ region  : int  2 2 4 3 1 4 2 3 1 1 ...
dim(customer_data_copy)
## [1] 79979     5
summary(customer_data_copy)
##     in.store        age            items           amount        
##  Min.   :0.0   Min.   :18.00   Min.   :1.000   Min.   :   5.005  
##  1st Qu.:0.0   1st Qu.:33.00   1st Qu.:3.000   1st Qu.: 285.120  
##  Median :1.0   Median :45.00   Median :4.000   Median : 582.140  
##  Mean   :0.5   Mean   :45.76   Mean   :4.505   Mean   : 835.826  
##  3rd Qu.:1.0   3rd Qu.:56.00   3rd Qu.:6.000   3rd Qu.:1233.400  
##  Max.   :1.0   Max.   :85.00   Max.   :8.000   Max.   :3000.000  
##      region     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.675  
##  3rd Qu.:4.000  
##  Max.   :4.000

13 Display column names

colnames(customer_data_copy)  # Display column names
## [1] "in.store" "age"      "items"    "amount"   "region"

14 Convert categorical variables to factors using correct column name

# Convert categorical variables to factors using correct column name
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

15 Set seed for reproducibility.

# Load required libraries
library(caret)       # For model training
library(ranger)      # Random Forest (for H5 - Region Prediction)

# Set seed for reproducibility
set.seed(123)

# Load the correct dataset
load("customer_data_copy.RData")  # Reload dataset

# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

# Train-Test Split (80% Train, 20% Test)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data  <- customer_data_copy[-trainIndex, ]

#H5 - Predicting Customer Region (Classification Model - Random Forest ranger)

# Train the Random Forest model
rf_model <- ranger(dependent.variable.name = "region",  # Target variable
                   data = train_data, 
                   num.trees = 500,  # Number of trees
                   num.threads = 4,  # Use 4 CPU threads
                   importance = "impurity",  # Feature importance
                   classification = TRUE)  # Ensure classification mode

16 View model summary

print(rf_model)  
## Ranger result
## 
## Call:
##  ranger(dependent.variable.name = "region", data = train_data,      num.trees = 500, num.threads = 4, importance = "impurity",      classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      63985 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             36.60 %

17 Extract variable importance

importance(rf_model)
##  in.store       age     items    amount 
##  8083.534  5733.406  1438.485 16882.853

#Remove Weak Features & Retrain the Model

train_data_reduced <- train_data %>% select(-items)

rf_model_reduced <- ranger(dependent.variable.name = "region",
                           data = train_data_reduced,
                           num.trees = 500,
                           num.threads = 4,
                           importance = "impurity",
                           classification = TRUE)

#Tune Hyperparameters to Improve Accuracy Since your Out-of-Bag (OOB) Error = 36.60%, tuning mtry and num.trees may help

# Try different values for mtry and num.trees
rf_model_tuned <- ranger(dependent.variable.name = "region",
                         data = train_data,
                         num.trees = 700,  # Increase trees
                         mtry = 3,  # Try different feature subsets per split
                         num.threads = 4,
                         importance = "impurity",
                         classification = TRUE)
## Growing trees.. Progress: 50%. Estimated remaining time: 30 seconds.
## Growing trees.. Progress: 98%. Estimated remaining time: 1 seconds.

18 Predict on test data

# Predict on test data
predictions <- predict(rf_model, data = test_data)$predictions

# Create a confusion matrix
conf_matrix <- table(Predicted = predictions, Actual = test_data$region)

# Compute accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Test Accuracy:", round(accuracy * 100, 2), "%"))
## [1] "Test Accuracy: 63.25 %"

19 Save the trained Random Forest model to a file

# Save the trained Random Forest model to a file
saveRDS(rf_model, file = "random_forest_region_model.rds")

20 Load necessary libraries

# Load necessary libraries
library(caret)       # For machine learning workflows
library(xgboost)     # For Gradient Boosting
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
library(Matrix)      # For matrix conversion
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

21 Create DMatrix for Traing XGBoost

# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

# Create Train-Test Split (80% Train, 20% Test)
set.seed(123)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data  <- customer_data_copy[-trainIndex, ]

# Convert to XGBoost-compatible format (Matrix)
train_matrix <- sparse.model.matrix(region ~ ., data = train_data)[,-1]
test_matrix  <- sparse.model.matrix(region ~ ., data = test_data)[,-1]

# Convert Target Variable to Numeric (XGBoost requires numeric labels)
train_labels <- as.numeric(train_data$region) - 1  # Convert factors to 0-based index
test_labels  <- as.numeric(test_data$region) - 1

# Create DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_matrix, label = train_labels)
dtest  <- xgb.DMatrix(data = test_matrix, label = test_labels)

22 Define XGBoost parameters

library(caret)

# Define trainControl
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation

# Define XGBoost parameters
params <- list(
  objective = "multi:softmax",  # Multi-class classification
  num_class = length(unique(train_labels)),  # Number of regions
  eta = 0.1,  # Learning rate
  max_depth = 6,  # Tree depth
  subsample = 0.8,  # Subsampling
  colsample_bytree = 0.8  # Feature selection
)

# Train the XGBoost model
xgb_model <- xgb.train(params = params,
                        data = dtrain,
                        nrounds = 150,  # Number of boosting rounds
                        watchlist = list(train = dtrain, test = dtest),
                        early_stopping_rounds = 10,
                        verbose = 1)
## [1]  train-mlogloss:1.323526 test-mlogloss:1.323618 
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
## 
## [2]  train-mlogloss:1.279707 test-mlogloss:1.279595 
## [3]  train-mlogloss:1.227901 test-mlogloss:1.227985 
## [4]  train-mlogloss:1.172231 test-mlogloss:1.172730 
## [5]  train-mlogloss:1.124893 test-mlogloss:1.125902 
## [6]  train-mlogloss:1.086358 test-mlogloss:1.087701 
## [7]  train-mlogloss:1.052024 test-mlogloss:1.053605 
## [8]  train-mlogloss:1.020829 test-mlogloss:1.022667 
## [9]  train-mlogloss:0.993481 test-mlogloss:0.995601 
## [10] train-mlogloss:0.966720 test-mlogloss:0.969101 
## [11] train-mlogloss:0.947149 test-mlogloss:0.949708 
## [12] train-mlogloss:0.930831 test-mlogloss:0.933523 
## [13] train-mlogloss:0.913030 test-mlogloss:0.916093 
## [14] train-mlogloss:0.901441 test-mlogloss:0.904597 
## [15] train-mlogloss:0.886552 test-mlogloss:0.889856 
## [16] train-mlogloss:0.870727 test-mlogloss:0.874473 
## [17] train-mlogloss:0.857685 test-mlogloss:0.861741 
## [18] train-mlogloss:0.846152 test-mlogloss:0.850353 
## [19] train-mlogloss:0.837332 test-mlogloss:0.841615 
## [20] train-mlogloss:0.825454 test-mlogloss:0.829959 
## [21] train-mlogloss:0.814101 test-mlogloss:0.818858 
## [22] train-mlogloss:0.804041 test-mlogloss:0.809083 
## [23] train-mlogloss:0.797704 test-mlogloss:0.802967 
## [24] train-mlogloss:0.789739 test-mlogloss:0.795187 
## [25] train-mlogloss:0.783779 test-mlogloss:0.789487 
## [26] train-mlogloss:0.778633 test-mlogloss:0.784421 
## [27] train-mlogloss:0.772700 test-mlogloss:0.778697 
## [28] train-mlogloss:0.766523 test-mlogloss:0.772679 
## [29] train-mlogloss:0.761693 test-mlogloss:0.768014 
## [30] train-mlogloss:0.757662 test-mlogloss:0.764127 
## [31] train-mlogloss:0.753986 test-mlogloss:0.760525 
## [32] train-mlogloss:0.749864 test-mlogloss:0.756501 
## [33] train-mlogloss:0.745024 test-mlogloss:0.751818 
## [34] train-mlogloss:0.741167 test-mlogloss:0.748124 
## [35] train-mlogloss:0.737791 test-mlogloss:0.744924 
## [36] train-mlogloss:0.734315 test-mlogloss:0.741651 
## [37] train-mlogloss:0.731390 test-mlogloss:0.738887 
## [38] train-mlogloss:0.727803 test-mlogloss:0.735478 
## [39] train-mlogloss:0.724596 test-mlogloss:0.732419 
## [40] train-mlogloss:0.722974 test-mlogloss:0.730907 
## [41] train-mlogloss:0.720935 test-mlogloss:0.729090 
## [42] train-mlogloss:0.718433 test-mlogloss:0.726695 
## [43] train-mlogloss:0.715626 test-mlogloss:0.724035 
## [44] train-mlogloss:0.713435 test-mlogloss:0.722003 
## [45] train-mlogloss:0.711416 test-mlogloss:0.720238 
## [46] train-mlogloss:0.709362 test-mlogloss:0.718370 
## [47] train-mlogloss:0.707806 test-mlogloss:0.716980 
## [48] train-mlogloss:0.706070 test-mlogloss:0.715444 
## [49] train-mlogloss:0.704194 test-mlogloss:0.713748 
## [50] train-mlogloss:0.702437 test-mlogloss:0.712131 
## [51] train-mlogloss:0.700929 test-mlogloss:0.710777 
## [52] train-mlogloss:0.700046 test-mlogloss:0.709988 
## [53] train-mlogloss:0.698406 test-mlogloss:0.708522 
## [54] train-mlogloss:0.696692 test-mlogloss:0.706998 
## [55] train-mlogloss:0.695634 test-mlogloss:0.706082 
## [56] train-mlogloss:0.694356 test-mlogloss:0.704973 
## [57] train-mlogloss:0.693303 test-mlogloss:0.704138 
## [58] train-mlogloss:0.691912 test-mlogloss:0.702965 
## [59] train-mlogloss:0.690780 test-mlogloss:0.701894 
## [60] train-mlogloss:0.689790 test-mlogloss:0.701088 
## [61] train-mlogloss:0.688892 test-mlogloss:0.700350 
## [62] train-mlogloss:0.687682 test-mlogloss:0.699411 
## [63] train-mlogloss:0.686874 test-mlogloss:0.698654 
## [64] train-mlogloss:0.685742 test-mlogloss:0.697759 
## [65] train-mlogloss:0.684818 test-mlogloss:0.697017 
## [66] train-mlogloss:0.684122 test-mlogloss:0.696440 
## [67] train-mlogloss:0.683406 test-mlogloss:0.695956 
## [68] train-mlogloss:0.682635 test-mlogloss:0.695304 
## [69] train-mlogloss:0.682066 test-mlogloss:0.694833 
## [70] train-mlogloss:0.681455 test-mlogloss:0.694377 
## [71] train-mlogloss:0.680778 test-mlogloss:0.693858 
## [72] train-mlogloss:0.680059 test-mlogloss:0.693308 
## [73] train-mlogloss:0.679598 test-mlogloss:0.692967 
## [74] train-mlogloss:0.678950 test-mlogloss:0.692490 
## [75] train-mlogloss:0.678399 test-mlogloss:0.692167 
## [76] train-mlogloss:0.677747 test-mlogloss:0.691659 
## [77] train-mlogloss:0.677302 test-mlogloss:0.691333 
## [78] train-mlogloss:0.676692 test-mlogloss:0.690925 
## [79] train-mlogloss:0.676210 test-mlogloss:0.690592 
## [80] train-mlogloss:0.675718 test-mlogloss:0.690382 
## [81] train-mlogloss:0.675108 test-mlogloss:0.689924 
## [82] train-mlogloss:0.674649 test-mlogloss:0.689643 
## [83] train-mlogloss:0.674157 test-mlogloss:0.689260 
## [84] train-mlogloss:0.673775 test-mlogloss:0.689082 
## [85] train-mlogloss:0.673314 test-mlogloss:0.688763 
## [86] train-mlogloss:0.672849 test-mlogloss:0.688530 
## [87] train-mlogloss:0.672424 test-mlogloss:0.688319 
## [88] train-mlogloss:0.672062 test-mlogloss:0.688140 
## [89] train-mlogloss:0.671680 test-mlogloss:0.687851 
## [90] train-mlogloss:0.671300 test-mlogloss:0.687681 
## [91] train-mlogloss:0.670949 test-mlogloss:0.687498 
## [92] train-mlogloss:0.670534 test-mlogloss:0.687209 
## [93] train-mlogloss:0.670121 test-mlogloss:0.687067 
## [94] train-mlogloss:0.669781 test-mlogloss:0.686878 
## [95] train-mlogloss:0.669396 test-mlogloss:0.686745 
## [96] train-mlogloss:0.668976 test-mlogloss:0.686521 
## [97] train-mlogloss:0.668652 test-mlogloss:0.686372 
## [98] train-mlogloss:0.668354 test-mlogloss:0.686222 
## [99] train-mlogloss:0.667944 test-mlogloss:0.686011 
## [100]    train-mlogloss:0.667646 test-mlogloss:0.685948 
## [101]    train-mlogloss:0.667396 test-mlogloss:0.685756 
## [102]    train-mlogloss:0.667197 test-mlogloss:0.685624 
## [103]    train-mlogloss:0.666935 test-mlogloss:0.685504 
## [104]    train-mlogloss:0.666698 test-mlogloss:0.685431 
## [105]    train-mlogloss:0.666467 test-mlogloss:0.685377 
## [106]    train-mlogloss:0.666232 test-mlogloss:0.685311 
## [107]    train-mlogloss:0.665849 test-mlogloss:0.685118 
## [108]    train-mlogloss:0.665661 test-mlogloss:0.685121 
## [109]    train-mlogloss:0.665259 test-mlogloss:0.685021 
## [110]    train-mlogloss:0.664988 test-mlogloss:0.684949 
## [111]    train-mlogloss:0.664768 test-mlogloss:0.684876 
## [112]    train-mlogloss:0.664507 test-mlogloss:0.684813 
## [113]    train-mlogloss:0.664226 test-mlogloss:0.684783 
## [114]    train-mlogloss:0.663883 test-mlogloss:0.684631 
## [115]    train-mlogloss:0.663654 test-mlogloss:0.684583 
## [116]    train-mlogloss:0.663383 test-mlogloss:0.684530 
## [117]    train-mlogloss:0.663202 test-mlogloss:0.684480 
## [118]    train-mlogloss:0.662960 test-mlogloss:0.684388 
## [119]    train-mlogloss:0.662726 test-mlogloss:0.684311 
## [120]    train-mlogloss:0.662457 test-mlogloss:0.684264 
## [121]    train-mlogloss:0.662274 test-mlogloss:0.684227 
## [122]    train-mlogloss:0.662107 test-mlogloss:0.684234 
## [123]    train-mlogloss:0.661903 test-mlogloss:0.684236 
## [124]    train-mlogloss:0.661698 test-mlogloss:0.684222 
## [125]    train-mlogloss:0.661482 test-mlogloss:0.684140 
## [126]    train-mlogloss:0.661202 test-mlogloss:0.684089 
## [127]    train-mlogloss:0.661040 test-mlogloss:0.684086 
## [128]    train-mlogloss:0.660854 test-mlogloss:0.684045 
## [129]    train-mlogloss:0.660688 test-mlogloss:0.684007 
## [130]    train-mlogloss:0.660544 test-mlogloss:0.684001 
## [131]    train-mlogloss:0.660294 test-mlogloss:0.683870 
## [132]    train-mlogloss:0.660142 test-mlogloss:0.683861 
## [133]    train-mlogloss:0.659940 test-mlogloss:0.683843 
## [134]    train-mlogloss:0.659745 test-mlogloss:0.683809 
## [135]    train-mlogloss:0.659630 test-mlogloss:0.683766 
## [136]    train-mlogloss:0.659443 test-mlogloss:0.683781 
## [137]    train-mlogloss:0.659212 test-mlogloss:0.683818 
## [138]    train-mlogloss:0.658990 test-mlogloss:0.683886 
## [139]    train-mlogloss:0.658857 test-mlogloss:0.683892 
## [140]    train-mlogloss:0.658612 test-mlogloss:0.683783 
## [141]    train-mlogloss:0.658442 test-mlogloss:0.683729 
## [142]    train-mlogloss:0.658244 test-mlogloss:0.683743 
## [143]    train-mlogloss:0.657882 test-mlogloss:0.683689 
## [144]    train-mlogloss:0.657640 test-mlogloss:0.683710 
## [145]    train-mlogloss:0.657472 test-mlogloss:0.683711 
## [146]    train-mlogloss:0.657216 test-mlogloss:0.683656 
## [147]    train-mlogloss:0.656947 test-mlogloss:0.683617 
## [148]    train-mlogloss:0.656782 test-mlogloss:0.683643 
## [149]    train-mlogloss:0.656606 test-mlogloss:0.683635 
## [150]    train-mlogloss:0.656444 test-mlogloss:0.683633
# Save the trained model
saveRDS(xgb_model, file = "xgboost_region_model.rds")
library(caret)

# Define trainControl globally
train_control <- trainControl(method = "cv", number = 5)

23 Create Confusion Matrix

# Make predictions on test data
predictions <- predict(xgb_model, dtest)

# Convert predictions to factors
predicted_labels <- factor(predictions, levels = 0:(length(unique(train_labels)) - 1))
actual_labels <- factor(test_labels, levels = 0:(length(unique(test_labels)) - 1))

# Create Confusion Matrix
conf_matrix <- confusionMatrix(predicted_labels, actual_labels)

# Print Confusion Matrix & Accuracy
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3
##          0 1735    0 1429 1022
##          1    0 3998  157  294
##          2  651    0  783  108
##          3  813    0 1231 3773
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6433          
##                  95% CI : (0.6358, 0.6507)
##     No Information Rate : 0.3249          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5168          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            0.5424   1.0000  0.21750   0.7260
## Specificity            0.8084   0.9624  0.93876   0.8107
## Pos Pred Value         0.4145   0.8986  0.50778   0.6486
## Neg Pred Value         0.8760   1.0000  0.80508   0.8601
## Prevalence             0.2000   0.2500  0.22508   0.3249
## Detection Rate         0.1085   0.2500  0.04896   0.2359
## Detection Prevalence   0.2617   0.2782  0.09641   0.3637
## Balanced Accuracy      0.6754   0.9812  0.57813   0.7683
library(caret)
library(e1071)

#Prepare Data for SVM

# Convert categorical variables to factors
customer_data_copy$region <- as.factor(customer_data_copy$region)
customer_data_copy$in.store <- as.factor(customer_data_copy$in.store)

# Train-Test Split (80% Train, 20% Test)
set.seed(123)
trainIndex <- createDataPartition(customer_data_copy$region, p = 0.8, list = FALSE)
train_data <- customer_data_copy[trainIndex, ]
test_data  <- customer_data_copy[-trainIndex, ]

#Train SVM Model

svm_model <- train(region ~ ., data = train_data, 
                   method = "svmLinear",
                   trControl = train_control,
                   preProcess = c("center", "scale"))

24 Make predictions on test data

# Make predictions on test data
svm_predictions <- predict(svm_model, test_data)

# Create Confusion Matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_data$region)

# Print Confusion Matrix & Accuracy
print(svm_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1 2412    0 2137 1250
##          2    0 3998  228  442
##          3    0    0   75  121
##          4  787    0 1160 3384
## 
## Overall Statistics
##                                           
##                Accuracy : 0.617           
##                  95% CI : (0.6095, 0.6246)
##     No Information Rate : 0.3249          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4849          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.7540   1.0000 0.020833   0.6511
## Specificity            0.7353   0.9441 0.990237   0.8197
## Pos Pred Value         0.4159   0.8565 0.382653   0.6348
## Neg Pred Value         0.9228   1.0000 0.776870   0.8300
## Prevalence             0.2000   0.2500 0.225084   0.3249
## Detection Rate         0.1508   0.2500 0.004689   0.2116
## Detection Prevalence   0.3626   0.2919 0.012255   0.3333
## Balanced Accuracy      0.7446   0.9721 0.505535   0.7354

25 Cinfusion Matrix XGBoost

# Load required libraries
library(ggplot2)
library(reshape2)
library(caret)

# Make predictions on the test data using XGBoost
xgb_predictions <- predict(xgb_model, dtest)

# Convert predictions to factors
predicted_labels <- factor(xgb_predictions, levels = levels(test_data$region))
actual_labels <- factor(test_data$region, levels = levels(test_data$region))

# Create confusion matrix
xgb_conf_matrix <- confusionMatrix(predicted_labels, actual_labels)

# Convert confusion matrix to a dataframe for visualization
conf_matrix_df <- as.data.frame(xgb_conf_matrix$table)

# Plot the confusion matrix as a heatmap
ggplot(conf_matrix_df, aes(Prediction, Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", size = 5) +  # Add count labels
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Color gradient
  theme_minimal() +
  labs(title = "Confusion Matrix - XGBoost",
       x = "Predicted Label",
       y = "Actual Label",
       fill = "Count") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 12),
        axis.text.y = element_text(size = 12))

26 Create the enhanced feature importance plot

# Load necessary libraries
library(xgboost)
library(ggplot2)

# Extract feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_matrix), model = xgb_model)

# Convert to dataframe for ggplot2
importance_df <- as.data.frame(importance_matrix)

# Create the enhanced feature importance plot
ggplot(importance_df, aes(x = reorder(Feature, Gain), y = Gain, fill = Gain)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Color gradient
  coord_flip() +  # Flip for better readability
  theme_minimal() +
  labs(title = "Feature Importance - XGBoost Model",
       x = "Features",
       y = "Importance Score") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12))

27 Compare Model Accuracies

# Compare Model Accuracies
accuracy_results <- data.frame(
  Model = c("Random Forest", "XGBoost", "SVM"),
  Accuracy = c(63.26, 64.33, 61.7)
)

ggplot(accuracy_results, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Model Accuracy Comparison", y = "Accuracy (%)", x = "Model")

Final Summary: Customer Region Prediction Using Machine Learning This project focused on building a machine learning model to accurately predict customer regions based on age, shopping behavior, and spending patterns. Multiple models were tested, and XGBoost emerged as the best model with 64.33% accuracy.

🚀 Key Findings & Model Performance Comparison Model Accuracy (%) Strengths Weaknesses Random Forest 63.26% Handles complex data well Slight overfitting XGBoost 64.33% Best accuracy, efficient Requires tuning SVM (Radial Kernel) 61.7% Strong generalization Slow training ✅ Final Decision: XGBoost is the best-performing model for customer region prediction.

📊 Key Visualizations for Submission ✅ Confusion Matrix (XGBoost) – Shows classification performance. ✅ Feature Importance Chart – Highlights which factors influence predictions. ✅ Model Accuracy Comparison – Demonstrates why XGBoost was chosen. ✅ Future Improvements Chart – Lays out next steps for improving model performance.

📌 Future Improvements 🚀 Optimizing XGBoost for better performance 🚀 Adding more predictive features (e.g., purchase frequency) 🚀 Balancing training data if regions are imbalanced

✅ Deployment Options: ✔ Save & use xgboost_region_model.rds for future predictions ✔ Integrate model predictions into a business dashboard

#Enhancing XGBoost for Customer Region Prediction.

# Load required library
library(ggplot2)

# Data for visualization
future_steps <- data.frame(
  Step = c("Hyperparameter Tuning", "Additional Features", "Balance Training Data", 
           "Save & Use Model", "Integrate into Dashboard"),
  Priority = c(90, 80, 70, 100, 95)  # Arbitrary scores for emphasis
)

# Create a horizontal bar chart
ggplot(future_steps, aes(x = Priority, y = reorder(Step, Priority), fill = Priority)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Color gradient
  theme_minimal() +
  labs(title = "Enhancing XGBoost for Customer Region Prediction",
       x = "Priority Level (%)",
       y = "Future Improvements & Deployment Steps") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12)) +
  geom_text(aes(label = paste(Priority, "%")), hjust = -0.2, size = 5, fontface = "bold") +
  xlim(0, 110)  # Set limits for better spacing

This Visualization Highlights: ✅ Hyperparameter tuning (90%) – Important for improving model accuracy. ✅ Additional features (80%) – Adding more variables like purchase frequency can enhance predictions. ✅ Balancing training data (70%) – Ensuring fair class distribution can prevent bias. ✅ Saving & using model (100%) – The trained model (xgboost_region_model.rds) is ready for future predictions. ✅ Integrating into a dashboard (95%) – Deploying predictions for business insights is crucial.

28 Customer Region Prediction Report

28.1 1. Introduction

This project aims to build a machine learning model to predict customer regions based on their shopping behaviors, age, and spending patterns. The goal is to identify the best model for accurately classifying customers into different regions.

28.2 2. Data Preprocessing

  • The dataset contains age, number of items purchased, amount spent, and shopping mode (in-store or online).
  • Missing values were handled, and categorical variables were converted into factors.
  • The data was split into 80% training and 20% testing sets.

28.3 3. Model Selection & Performance Comparison

Model | Accuracy (%) | Strengths | Weaknesses
Random Forest | 63.26% | Handles complex data well | Slight overfitting |
XGBoost | 64.33% | Best accuracy, efficient | Requires tuning |
SVM (Radial Kernel) | 61.7% | Strong generalization | Slow training |

Final Decision: XGBoost was selected as the best model (64.33%).

28.4 4. Key Visualizations

1️⃣ Confusion Matrix (XGBoost) – Shows classification performance. 2️⃣ Feature Importance Chart – Highlights the most influential features. 3️⃣ Model Accuracy Comparison – Justifies the model selection. 4️⃣ Spending Behavior by Region – Provides insights into customer spending patterns.

28.5 5. Spending Behavior by Region

  • West & East regions have the highest spending, while South has the lowest.
  • Businesses can adjust marketing strategies based on these insights.

28.6 6. Future Improvements & Deployment Steps

🚀 Enhancements for XGBoost:Hyperparameter tuning to optimize performance. ✅ Adding more predictive features (e.g., purchase frequency). ✅ Balancing training data if regions are imbalanced.

📌 Deployment Options:Save & use xgboost_region_model.rds for future predictions.Integrate model predictions into a business dashboard.

28.7 7. Final Recommendations

✅ XGBoost is recommended for customer region prediction due to its superior accuracy. ✅ Additional data features and optimization can further enhance model performance. ✅ Deploying the model in a business environment can aid in targeted marketing and customer engagement strategies.

28.8 8. Conclusion

This study demonstrates that machine learning can effectively predict customer regions based on spending behaviors. With further enhancements, this model can provide valuable insights for business decision-making.

#DashBoard

# Install & Load Required Libraries
if (!requireNamespace("patchwork", quietly = TRUE)) install.packages("patchwork")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")

library(ggplot2)
library(patchwork)
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:formattable':
## 
##     area
# Ensure categorical variables are factors
customer_data$region <- as.factor(customer_data$region)
customer_data$in.store <- as.factor(customer_data$in.store)

# Increase text size for better readability
theme_custom <- theme_minimal(base_size = 14) +  
  theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 14, hjust = 0.5),
        axis.text = element_text(size = 12),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 14))

# Create Individual Plots
plot1 <- ggplot(customer_data, aes(x = items, y = amount)) +
  geom_point(color = "dodgerblue", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) +
  theme_custom +
  labs(title = "📊 Items Purchased vs. Amount Spent",
       x = "Number of Items Purchased",
       y = "Amount Spent ($)")

plot2 <- ggplot(customer_data, aes(x = region, y = amount, fill = region)) +
  geom_bar(stat = "identity", width = 0.6) +  
  scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF", "#F1C40F")) +
  theme_custom +
  labs(title = "💰 Spending Behavior by Region",
       x = "Region",
       y = "Total Amount Spent")

plot3 <- ggplot(customer_data, aes(x = in.store, fill = region)) +
  geom_bar(position = "fill", width = 0.6) +
  scale_fill_manual(values = c("#FF5733", "#33FF57", "#3357FF", "#F1C40F")) +
  theme_custom +
  labs(title = "🛒 Shopping Mode Preference by Region",
       x = "Shopping Mode (Online vs. In-Store)",
       y = "Proportion")

plot4 <- ggplot(customer_data, aes(x = age, y = amount)) +
  geom_point(color = "purple", alpha = 0.6, size = 3) +
  geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 1.5) +
  theme_custom +
  labs(title = "👥 Age vs. Spending",
       x = "Customer Age",
       y = "Amount Spent ($)")

# Arrange Plots Using Patchwork
final_plot <- (plot1 | plot2) / (plot3 | plot4) +  
  plot_annotation(title = "🛍️ Customer Insights Dashboard - Evovanta AI",
                  subtitle = "Understanding Spending & Shopping Behaviors Using Machine Learning",
                  theme = theme(plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
                                plot.subtitle = element_text(size = 16, hjust = 0.5)))

# Save the dashboard with better resolution
ggsave("EvovantaAI_Dashboard.png", plot = final_plot, width = 16, height = 10, dpi = 300)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Confirm the file was saved
if (file.exists("EvovantaAI_Dashboard.png")) {
  print("✅ Dashboard successfully saved as EvovantaAI_Dashboard.png")
} else {
  print("❌ Error: Dashboard not saved.")
}
## [1] "✅ Dashboard successfully saved as EvovantaAI_Dashboard.png"
knitr::include_graphics("EvovantaAI_Dashboard.png")