Picture yourself managing a shopping mall or running a marketing campaign. You have hundreds or thousands of customers — each with different incomes, ages, preferences, and habits. Some buy regularly, others occasionally, and many don’t buy at all.
How can we make sense of this complexity?
How do we group similar customers together and
predict who might respond to our next campaign?
This is where machine learning becomes our compass.
In this workshop, we’ll take a hands-on, business-oriented approach to two fundamental machine learning methods:
| Technique | Purpose | Type | Analogy |
|---|---|---|---|
| K-Means Clustering | Customer Segmentation | Unsupervised | Grouping shoppers into “tribes” based on behavior |
| Random Forests | Response Prediction | Supervised | Asking hundreds of “mini decision trees” to vote on who will respond |
Our tools: the R programming language and real marketing datasets.
Imagine you’re observing shoppers from a bird’s-eye view in a mall. Some hang around luxury stores, others spend time near discount outlets. Without knowing their names or preferences, you can still see groups forming naturally.
That’s what K-Means does — it automatically identifies clusters of similar customers based on numeric characteristics (e.g., age, income, spending).
We’ll use the classic Mall Customers Dataset, which contains demographic and behavioral information about mall visitors.
Each record represents a customer with: - Gender - Age - Annual Income - Spending Score (based on spending behavior)
library(tidyverse)
library(factoextra)
library(cluster)
mall <- read.csv("/Users/aakashshah/Documents/mall_customers.csv")
glimpse(mall)## Rows: 10
## Columns: 5
## $ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ Gender <chr> "Male", "Male", "Female", "Female", "Female", "Female",…
## $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30
## $ Annual_Income <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19
## $ Spending_Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72
🧠 Reflection: What kinds of patterns do you expect — younger high-spenders vs. older savers?
We’ll select only numeric variables for clustering and scale them. Scaling ensures that variables measured in different units (e.g., income vs. age) contribute equally.
mall <- na.omit(mall)
mall_data <- mall %>% select(Age, Annual_Income, Spending_Score)
mall_scaled <- scale(mall_data)🧠 Think about: Why might an unscaled variable like “Annual Income” dominate clustering results?
K-Means requires us to specify how many clusters we want. To find the best number, we’ll use the Elbow Method, which compares how the within-cluster variance decreases as K increases.
max_k <- min(10, nrow(mall_scaled) - 1)
fviz_nbclust(mall_scaled, kmeans, method = "wss", k.max = max_k) +
labs(title = "Elbow Method to Determine Optimal K")🧩 Interpretation: Look for a “bend” or “elbow” — the point where adding more clusters doesn’t reduce much variance. That’s your best K.
set.seed(123)
mall_km <- kmeans(mall_scaled, centers = 5, nstart = 25)
mall$Cluster <- as.factor(mall_km$cluster)
fviz_cluster(mall_km, data = mall_scaled,
geom = "point",
ellipse.type = "norm",
main = "Customer Segments Identified via K-Means")🧠 Interpretation: Each color represents a segment. Notice how distinct the clusters are — these are customer groups that behave similarly.
Let’s summarize each segment’s average characteristics.
aggregate(mall[, c("Age", "Annual_Income", "Spending_Score")],
by = list(Cluster = mall$Cluster), mean)## Cluster Age Annual_Income Spending_Score
## 1 1 64.0 19.0 3.0
## 2 2 22.0 16.0 78.0
## 3 3 33.0 17.5 23.0
## 4 4 26.5 18.5 83.0
## 5 5 19.5 15.5 22.5
📊 Interpretation:
- A segment with high income and high spending →
premium buyers.
- A group with low income and low spending →
budget-conscious.
- Middle-aged moderate spenders → steady, loyal customers.
💬 Marketing Application:
Use this to tailor promotions — e.g., offer premium cards to high
spenders, discounts to low-spenders, and loyalty rewards to mid-segment
customers.
Think of a Random Forest as a committee of decision trees. Each tree is trained on slightly different data, and when asked to predict, the trees vote. The majority wins.
This approach reduces bias and variance — giving us strong, stable predictions even in noisy marketing data.
The Customer Personality Dataset includes demographics, lifestyle, spending behavior, and campaign responses.
Goal: Predict whether a customer responded (1) or did not respond (0) to a marketing campaign.
library(randomForest)
library(caret)
cust <- read.csv("/Users/aakashshah/Documents/customer_personality.csv")
glimpse(cust)## Rows: 5
## Columns: 17
## $ ID <int> 1, 2, 3, 4, 5
## $ Year_Birth <int> 1957, 1954, 1965, 1984, 1981
## $ Education <chr> "Graduation", "Graduation", "PhD", "Graduation", "…
## $ Marital_Status <chr> "Single", "Single", "Together", "Single", "Married"
## $ Income <int> 58138, 46344, 71613, 26646, 58293
## $ Recency <int> 58, 38, 26, 26, 94
## $ MntWines <int> 635, 11, 426, 11, 173
## $ MntFruits <int> 88, 1, 49, 4, 43
## $ MntMeatProducts <int> 546, 6, 127, 20, 118
## $ MntFishProducts <int> 172, 2, 111, 10, 46
## $ MntSweetProducts <int> 88, 1, 21, 3, 27
## $ MntGoldProds <int> 88, 6, 42, 5, 15
## $ NumDealsPurchases <int> 3, 2, 1, 2, 5
## $ NumWebPurchases <int> 8, 1, 8, 2, 5
## $ NumCatalogPurchases <int> 10, 1, 2, 0, 3
## $ NumStorePurchases <int> 4, 2, 10, 4, 6
## $ Response <int> 1, 0, 1, 0, 1
🧠 Reflection: Which variables might influence marketing response — income, online purchases, or product mix?
We’ll clean the dataset and convert categorical variables to factors.
cust <- na.omit(cust)
cust <- cust %>% select(Income, Year_Birth, Education, Marital_Status,
MntWines, MntMeatProducts, NumWebPurchases,
NumStorePurchases, Response)
cust$Education <- as.factor(cust$Education)
cust$Marital_Status <- as.factor(cust$Marital_Status)
cust$Response <- as.factor(cust$Response)
set.seed(123)
trainIndex <- createDataPartition(cust$Response, p = 0.8, list = FALSE)
train <- cust[trainIndex, ]
test <- cust[-trainIndex, ]💬 Note: The Response variable is our
target — what we’re trying to predict.
rf_model <- randomForest(Response ~ ., data = train, importance = TRUE, ntree = 500)
print(rf_model)##
## Call:
## randomForest(formula = Response ~ ., data = train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## 0 1 class.error
## 0 2 0 0
## 1 0 3 0
🧠 Interpretation: Each tree makes a decision, and the forest aggregates them. The higher the number of trees, the more robust the model.
We’ll use the confusion matrix to assess performance — it tells us how many customers were correctly or incorrectly predicted as responders.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
🧩 Interpretation:
- Accuracy: How often the model is right.
- Sensitivity: How well it identifies actual
responders.
- Specificity: How well it avoids false positives.
💬 Marketing Insight: If sensitivity is high, your campaign targeting is strong — you’re catching most of the real responders.
Which variables most influence the prediction? Let’s find out.
📊 Interpretation:
- MntWines: Wine buyers respond more — they may be
high-value customers.
- NumWebPurchases: Online activity signals digital
engagement.
- Income: Spending power affects responsiveness.
💬 Marketing Implication: Focus digital campaigns on online buyers and high wine spenders — they’re more likely to respond positively.
| Insight | Business Interpretation | Recommended Action |
|---|---|---|
| Segment 1: High-income high spenders | Premium customers | Loyalty & luxury programs |
| Segment 2: Low-spending, high income | Untapped potential | Targeted upselling |
| Wine spending predicts response | Affluent, lifestyle-driven | Personalized wine promotions |
| Online purchase frequency | Digitally engaged | Leverage email & social media |
🧠 Reflection Prompt: Which of these findings could be most valuable to your organization’s campaign strategy?
Gender to K-Means — does it affect clustering
patterns?© 2025 La Trobe University | MKT5006 Workshop 12
Prepared by Aakash Mayur Shah