1 🧭 Introduction: Data-Driven Marketing Decisions
2 💡 Part 1: Understanding Customer Segmentation (K-Means Clustering)
3 🌲 Part 2: Predicting Campaign Response (Random Forests)
4 📈 Integrating Insights
5 🧠 Recap
6 🧩 Practice Exercises
7 📘 References

1 🧭 Introduction: Data-Driven Marketing Decisions

Picture yourself managing a shopping mall or running a marketing campaign. You have hundreds or thousands of customers — each with different incomes, ages, preferences, and habits. Some buy regularly, others occasionally, and many don’t buy at all.

How can we make sense of this complexity?
How do we group similar customers together and predict who might respond to our next campaign?

This is where machine learning becomes our compass.

In this workshop, we’ll take a hands-on, business-oriented approach to two fundamental machine learning methods:

Technique	Purpose	Type	Analogy
K-Means Clustering	Customer Segmentation	Unsupervised	Grouping shoppers into “tribes” based on behavior
Random Forests	Response Prediction	Supervised	Asking hundreds of “mini decision trees” to vote on who will respond

Our tools: the R programming language and real marketing datasets.

2 💡 Part 1: Understanding Customer Segmentation (K-Means Clustering)

2.1 1.1 The Intuition Behind Clustering

Imagine you’re observing shoppers from a bird’s-eye view in a mall. Some hang around luxury stores, others spend time near discount outlets. Without knowing their names or preferences, you can still see groups forming naturally.

That’s what K-Means does — it automatically identifies clusters of similar customers based on numeric characteristics (e.g., age, income, spending).

2.2 1.2 Dataset: Mall Customers

We’ll use the classic Mall Customers Dataset, which contains demographic and behavioral information about mall visitors.

Each record represents a customer with: - Gender - Age - Annual Income - Spending Score (based on spending behavior)

library(tidyverse)
library(factoextra)
library(cluster)

mall <- read.csv("/Users/aakashshah/Documents/mall_customers.csv")
glimpse(mall)

## Rows: 10
## Columns: 5
## $ CustomerID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ Gender         <chr> "Male", "Male", "Female", "Female", "Female", "Female",…
## $ Age            <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30
## $ Annual_Income  <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19
## $ Spending_Score <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72

🧠 Reflection: What kinds of patterns do you expect — younger high-spenders vs. older savers?

2.3 1.3 Data Preparation

We’ll select only numeric variables for clustering and scale them. Scaling ensures that variables measured in different units (e.g., income vs. age) contribute equally.

mall <- na.omit(mall)
mall_data <- mall %>% select(Age, Annual_Income, Spending_Score)
mall_scaled <- scale(mall_data)

🧠 Think about: Why might an unscaled variable like “Annual Income” dominate clustering results?

2.4 1.4 Choosing the Number of Clusters (K)

K-Means requires us to specify how many clusters we want. To find the best number, we’ll use the Elbow Method, which compares how the within-cluster variance decreases as K increases.

max_k <- min(10, nrow(mall_scaled) - 1)
fviz_nbclust(mall_scaled, kmeans, method = "wss", k.max = max_k) +
  labs(title = "Elbow Method to Determine Optimal K")

🧩 Interpretation: Look for a “bend” or “elbow” — the point where adding more clusters doesn’t reduce much variance. That’s your best K.

2.5 1.5 Running K-Means Clustering

set.seed(123)
mall_km <- kmeans(mall_scaled, centers = 5, nstart = 25)

mall$Cluster <- as.factor(mall_km$cluster)
fviz_cluster(mall_km, data = mall_scaled,
             geom = "point",
             ellipse.type = "norm",
             main = "Customer Segments Identified via K-Means")

🧠 Interpretation: Each color represents a segment. Notice how distinct the clusters are — these are customer groups that behave similarly.

2.6 1.6 Profiling the Segments

Let’s summarize each segment’s average characteristics.

aggregate(mall[, c("Age", "Annual_Income", "Spending_Score")],
          by = list(Cluster = mall$Cluster), mean)

##   Cluster  Age Annual_Income Spending_Score
## 1       1 64.0          19.0            3.0
## 2       2 22.0          16.0           78.0
## 3       3 33.0          17.5           23.0
## 4       4 26.5          18.5           83.0
## 5       5 19.5          15.5           22.5

📊 Interpretation:
- A segment with high income and high spending → premium buyers.
- A group with low income and low spending → budget-conscious.
- Middle-aged moderate spenders → steady, loyal customers.

💬 Marketing Application:
Use this to tailor promotions — e.g., offer premium cards to high spenders, discounts to low-spenders, and loyalty rewards to mid-segment customers.

3 🌲 Part 2: Predicting Campaign Response (Random Forests)

3.1 2.1 The Story Behind Random Forests

Think of a Random Forest as a committee of decision trees. Each tree is trained on slightly different data, and when asked to predict, the trees vote. The majority wins.

This approach reduces bias and variance — giving us strong, stable predictions even in noisy marketing data.

3.2 2.2 Dataset: Customer Personality Analysis

The Customer Personality Dataset includes demographics, lifestyle, spending behavior, and campaign responses.

Goal: Predict whether a customer responded (1) or did not respond (0) to a marketing campaign.

library(randomForest)
library(caret)

cust <- read.csv("/Users/aakashshah/Documents/customer_personality.csv")
glimpse(cust)

## Rows: 5
## Columns: 17
## $ ID                  <int> 1, 2, 3, 4, 5
## $ Year_Birth          <int> 1957, 1954, 1965, 1984, 1981
## $ Education           <chr> "Graduation", "Graduation", "PhD", "Graduation", "…
## $ Marital_Status      <chr> "Single", "Single", "Together", "Single", "Married"
## $ Income              <int> 58138, 46344, 71613, 26646, 58293
## $ Recency             <int> 58, 38, 26, 26, 94
## $ MntWines            <int> 635, 11, 426, 11, 173
## $ MntFruits           <int> 88, 1, 49, 4, 43
## $ MntMeatProducts     <int> 546, 6, 127, 20, 118
## $ MntFishProducts     <int> 172, 2, 111, 10, 46
## $ MntSweetProducts    <int> 88, 1, 21, 3, 27
## $ MntGoldProds        <int> 88, 6, 42, 5, 15
## $ NumDealsPurchases   <int> 3, 2, 1, 2, 5
## $ NumWebPurchases     <int> 8, 1, 8, 2, 5
## $ NumCatalogPurchases <int> 10, 1, 2, 0, 3
## $ NumStorePurchases   <int> 4, 2, 10, 4, 6
## $ Response            <int> 1, 0, 1, 0, 1

🧠 Reflection: Which variables might influence marketing response — income, online purchases, or product mix?

3.3 2.3 Preparing the Data

We’ll clean the dataset and convert categorical variables to factors.

cust <- na.omit(cust)

cust <- cust %>% select(Income, Year_Birth, Education, Marital_Status,
                        MntWines, MntMeatProducts, NumWebPurchases,
                        NumStorePurchases, Response)

cust$Education <- as.factor(cust$Education)
cust$Marital_Status <- as.factor(cust$Marital_Status)
cust$Response <- as.factor(cust$Response)

set.seed(123)
trainIndex <- createDataPartition(cust$Response, p = 0.8, list = FALSE)
train <- cust[trainIndex, ]
test <- cust[-trainIndex, ]

💬 Note: The Response variable is our target — what we’re trying to predict.

3.4 2.4 Building a Random Forest Model

rf_model <- randomForest(Response ~ ., data = train, importance = TRUE, ntree = 500)
print(rf_model)

## 
## Call:
##  randomForest(formula = Response ~ ., data = train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##   0 1 class.error
## 0 2 0           0
## 1 0 3           0

🧠 Interpretation: Each tree makes a decision, and the forest aggregates them. The higher the number of trees, the more robust the model.

3.5 2.5 Model Evaluation

We’ll use the confusion matrix to assess performance — it tells us how many customers were correctly or incorrectly predicted as responders.

pred <- predict(rf_model, test)
confusionMatrix(pred, test$Response)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 0 0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : 0       
##

🧩 Interpretation:
- Accuracy: How often the model is right.
- Sensitivity: How well it identifies actual responders.
- Specificity: How well it avoids false positives.

💬 Marketing Insight: If sensitivity is high, your campaign targeting is strong — you’re catching most of the real responders.

3.6 2.6 Feature Importance

Which variables most influence the prediction? Let’s find out.

varImpPlot(rf_model, main = "Variable Importance (Random Forest)")

📊 Interpretation:
- MntWines: Wine buyers respond more — they may be high-value customers.
- NumWebPurchases: Online activity signals digital engagement.
- Income: Spending power affects responsiveness.

💬 Marketing Implication: Focus digital campaigns on online buyers and high wine spenders — they’re more likely to respond positively.

4 📈 Integrating Insights

Insight	Business Interpretation	Recommended Action
Segment 1: High-income high spenders	Premium customers	Loyalty & luxury programs
Segment 2: Low-spending, high income	Untapped potential	Targeted upselling
Wine spending predicts response	Affluent, lifestyle-driven	Personalized wine promotions
Online purchase frequency	Digitally engaged	Leverage email & social media

🧠 Reflection Prompt: Which of these findings could be most valuable to your organization’s campaign strategy?

5 🧠 Recap

K-Means helps uncover natural customer groups — who your customers are.
Random Forests predict behavior — what they’re likely to do next.
Together, they power segmentation, targeting, and prediction in modern marketing.

6 🧩 Practice Exercises

Change the number of clusters to 4 and observe how the segment profiles change.
Add Gender to K-Means — does it affect clustering patterns?
Reduce Random Forest trees to 100 and compare accuracy.
Explore feature importance: which variables dominate?
Write a 150-word marketing summary translating model insights into actionable strategies.

7 📘 References

James et al. (2021). An Introduction to Statistical Learning with R (2nd ed.)
UCLA IDRE: Introduction to R and Statistics
Kaggle Datasets: Mall Customers, Customer Personality Analysis

From Segments to Predictions: A Guided Introduction to K-Means and Random Forests in Marketing Analytics (R)

Aakash Mayur Shah

October 16, 2025