Introduction

This tutorial illustrates customer segmentation techniques through a Customer Personality Analysis dataset from Kaggle (website link under referneces). The dataset has comprehensive information regarding customers demographics, purchasing behavior, responses to campaigns, and their interests in product categories. I used the first 100 observations for this analysis to show how marketers can segment customers into clusters through K-means clustering. These segments,and marketing professionals can develop more targeted campaigns that address the specific needs of different customer segments, ultimately improving campaign effectiveness and customer satisfaction. ## Data Preparation and Exploration

# Load the customer personality analysis dataset
customers <- read.csv("marketing_campaign.csv", sep="\t")

# Select only the first 100 observations
customers <- customers[1:100,]

# Display the first few rows
head(customers)
##     ID Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524       1957 Graduation         Single  58138       0        0  04-09-2012
## 2 2174       1954 Graduation         Single  46344       1        1  08-03-2014
## 3 4141       1965 Graduation       Together  71613       0        0  21-08-2013
## 4 6182       1984 Graduation       Together  26646       1        0  10-02-2014
## 5 5324       1981        PhD        Married  58293       1        0  19-01-2014
## 6 7446       1967     Master       Together  62513       0        1  09-09-2013
##   Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      58      635        88             546             172               88
## 2      38       11         1               6               2                1
## 3      26      426        49             127             111               21
## 4      26       11         4              20              10                3
## 5      94      173        43             118              46               27
## 6      16      520        42              98               0               42
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
## 3           42                 1               8                   2
## 4            5                 2               2                   0
## 5           15                 5               5                   3
## 6           14                 2               6                   4
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
## 3                10                 4            0            0            0
## 4                 4                 6            0            0            0
## 5                 6                 5            0            0            0
## 6                10                 6            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1            0            0        0             3        11        1
## 2            0            0        0             3        11        0
## 3            0            0        0             3        11        0
## 4            0            0        0             3        11        0
## 5            0            0        0             3        11        0
## 6            0            0        0             3        11        0
# Check the structure of the data
str(customers)
## 'data.frame':    100 obs. of  29 variables:
##  $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
##  $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
##  $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
##  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
##  $ AcceptedCmp3       : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...
# Summary statistics
summary(customers)
##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :  273   Min.   :1943   Length:100         Length:100        
##  1st Qu.: 2253   1st Qu.:1957   Class :character   Class :character  
##  Median : 5450   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5407   Mean   :1968                                        
##  3rd Qu.: 8094   3rd Qu.:1977                                        
##  Max.   :11178   Max.   :1996                                        
##                                                                      
##      Income         Kidhome        Teenhome    Dt_Customer       
##  Min.   : 2447   Min.   :0.00   Min.   :0.00   Length:100        
##  1st Qu.:31412   1st Qu.:0.00   1st Qu.:0.00   Class :character  
##  Median :50150   Median :0.00   Median :0.00   Mode  :character  
##  Mean   :49500   Mean   :0.45   Mean   :0.42                     
##  3rd Qu.:67516   3rd Qu.:1.00   3rd Qu.:1.00                     
##  Max.   :88194   Max.   :2.00   Max.   :2.00                     
##  NA's   :9                                                       
##     Recency         MntWines        MntFruits      MntMeatProducts  
##  Min.   : 0.00   Min.   :   1.0   Min.   :  0.00   Min.   :   1.00  
##  1st Qu.:19.75   1st Qu.:  13.0   1st Qu.:  2.00   1st Qu.:  14.75  
##  Median :39.50   Median : 126.5   Median :  6.50   Median :  48.00  
##  Mean   :44.73   Mean   : 253.7   Mean   : 22.41   Mean   : 174.21  
##  3rd Qu.:69.00   3rd Qu.: 445.5   3rd Qu.: 35.25   3rd Qu.: 221.00  
##  Max.   :99.00   Max.   :1032.0   Max.   :164.00   Max.   :1725.00  
##                                                                     
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.00    
##  1st Qu.:  2.00   1st Qu.:  1.00   1st Qu.:  6.00   1st Qu.: 1.00    
##  Median : 10.50   Median :  9.00   Median : 18.00   Median : 2.00    
##  Mean   : 39.49   Mean   : 28.72   Mean   : 41.98   Mean   : 2.44    
##  3rd Qu.: 52.00   3rd Qu.: 37.00   3rd Qu.: 45.75   3rd Qu.: 3.00    
##  Max.   :227.00   Max.   :263.00   Max.   :362.00   Max.   :15.00    
##                                                                      
##  NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.00   Min.   : 0.00       Min.   : 0.00     Min.   : 1.00    
##  1st Qu.: 2.00   1st Qu.: 0.00       1st Qu.: 3.00     1st Qu.: 3.00    
##  Median : 3.00   Median : 1.50       Median : 4.50     Median : 6.00    
##  Mean   : 4.08   Mean   : 2.84       Mean   : 5.70     Mean   : 5.39    
##  3rd Qu.: 6.00   3rd Qu.: 4.00       3rd Qu.: 8.25     3rd Qu.: 7.00    
##  Max.   :27.00   Max.   :28.00       Max.   :13.00     Max.   :20.00    
##                                                                         
##   AcceptedCmp3   AcceptedCmp4   AcceptedCmp5   AcceptedCmp1   AcceptedCmp2
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0    
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0    
##  Median :0.00   Median :0.00   Median :0.00   Median :0.00   Median :0    
##  Mean   :0.06   Mean   :0.07   Mean   :0.05   Mean   :0.06   Mean   :0    
##  3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:0    
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :0    
##                                                                           
##     Complain    Z_CostContact   Z_Revenue     Response   
##  Min.   :0.00   Min.   :3     Min.   :11   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:3     1st Qu.:11   1st Qu.:0.00  
##  Median :0.00   Median :3     Median :11   Median :0.00  
##  Mean   :0.01   Mean   :3     Mean   :11   Mean   :0.16  
##  3rd Qu.:0.00   3rd Qu.:3     3rd Qu.:11   3rd Qu.:0.00  
##  Max.   :1.00   Max.   :3     Max.   :11   Max.   :1.00  
## 
# Handle missing values
customers <- customers %>%
  drop_na()

# Calculate customer age from birth year
customers$Age <- 2023 - customers$Year_Birth

# Create total spending variable
customers$TotalSpending <- customers$MntWines + customers$MntFruits + 
                         customers$MntMeatProducts + customers$MntFishProducts + 
                         customers$MntSweetProducts + customers$MntGoldProds

# Calculate purchase frequency
customers$PurchaseFrequency <- customers$NumWebPurchases + customers$NumCatalogPurchases + 
                              customers$NumStorePurchases

# Select relevant variables for clustering
customer_data <- customers %>%
  select(Age, Income, TotalSpending, PurchaseFrequency)

# Scale the data
customer_scaled <- scale(customer_data)

Exploratory Data Analysis

# Visualize the distribution of key variables
p1 <- ggplot(customers, aes(x=Age)) + 
  geom_histogram(fill="steelblue", bins=20) + 
  labs(title="Age Distribution")

p2 <- ggplot(customers, aes(x=Income)) + 
  geom_histogram(fill="darkgreen", bins=20) + 
  labs(title="Income Distribution")

p3 <- ggplot(customers, aes(x=TotalSpending)) + 
  geom_histogram(fill="darkred", bins=20) + 
  labs(title="Total Spending Distribution")

p4 <- ggplot(customers, aes(x=PurchaseFrequency)) + 
  geom_histogram(fill="purple", bins=20) + 
  labs(title="Purchase Frequency Distribution")

grid.arrange(p1, p2, p3, p4, ncol=2)

# Create correlation matrix
cor_matrix <- cor(customer_data, use="complete.obs")
corrplot(cor_matrix, method="circle", type="upper", tl.col="black", tl.srt=45)

Cluster with K-means

# Determine optimal number of clusters using elbow method
fviz_nbclust(customer_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal K")

# Silhouette method for optimal clusters
fviz_nbclust(customer_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for Optimal K")

# Perform k-means clustering with 3 clusters
set.seed(123)
km_result <- kmeans(customer_scaled, centers = 3, nstart = 25)

# Add cluster assignment to original data
customers$Cluster <- as.factor(km_result$cluster)

# Visualize the clusters
fviz_cluster(list(data = customer_scaled, cluster = km_result$cluster),
             palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
             geom = "point", ellipse.type = "convex",
             ggtheme = theme_minimal())

Cluster Profiling

# Calculate mean values for key variables by cluster
cluster_profiles <- customers %>%
  group_by(Cluster) %>%
  summarize(
    Count = n(),
    Avg_Age = mean(Age, na.rm = TRUE),
    Avg_Income = mean(Income, na.rm = TRUE),
    Avg_Spending = mean(TotalSpending, na.rm = TRUE),
    Avg_PurchaseFreq = mean(PurchaseFrequency, na.rm = TRUE),
    Avg_NumDeals = mean(NumDealsPurchases, na.rm = TRUE),
    Avg_WebPurchases = mean(NumWebPurchases, na.rm = TRUE),
    Avg_CatalogPurchases = mean(NumCatalogPurchases, na.rm = TRUE),
    Avg_StorePurchases = mean(NumStorePurchases, na.rm = TRUE)
  )

# Display cluster profiles
knitr::kable(cluster_profiles, digits = 2)
Cluster Count Avg_Age Avg_Income Avg_Spending Avg_PurchaseFreq Avg_NumDeals Avg_WebPurchases Avg_CatalogPurchases Avg_StorePurchases
1 26 67.81 61141.04 762.04 17.38 2.88 6.00 3.23 8.15
2 44 49.14 31964.50 105.89 6.14 1.91 2.16 0.75 3.23
3 21 51.33 71829.10 1373.76 21.43 2.90 5.29 7.48 8.67
# Create visual comparisons of clusters
p1 <- ggplot(customers, aes(x = Age, y = Income, color = Cluster)) +
  geom_point(alpha = 0.7) +
  labs(title = "Age vs Income by Cluster")

p2 <- ggplot(customers, aes(x = TotalSpending, y = PurchaseFrequency, color = Cluster)) +
  geom_point(alpha = 0.7) +
  labs(title = "Spending vs Purchase Frequency by Cluster")

grid.arrange(p1, p2, ncol = 1)

# Analyze preferred marketing channels by cluster
channel_data <- customers %>%
  group_by(Cluster) %>%
  summarize(
    Web = mean(NumWebPurchases, na.rm = TRUE),
    Catalog = mean(NumCatalogPurchases, na.rm = TRUE),
    Store = mean(NumStorePurchases, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(Web, Catalog, Store), 
               names_to = "Channel", values_to = "Average")

ggplot(channel_data, aes(x = Channel, y = Average, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Preferred Purchase Channels by Cluster",
       y = "Average Number of Purchases") +
  theme_minimal()

# Analyze campaign response rates by cluster
response_data <- customers %>%
  group_by(Cluster) %>%
  summarize(
    Campaign1 = mean(AcceptedCmp1, na.rm = TRUE) * 100,
    Campaign2 = mean(AcceptedCmp2, na.rm = TRUE) * 100,
    Campaign3 = mean(AcceptedCmp3, na.rm = TRUE) * 100,
    Campaign4 = mean(AcceptedCmp4, na.rm = TRUE) * 100,
    Campaign5 = mean(AcceptedCmp5, na.rm = TRUE) * 100
  ) %>%
  pivot_longer(cols = starts_with("Campaign"), 
               names_to = "Campaign", values_to = "ResponseRate")

ggplot(response_data, aes(x = Campaign, y = ResponseRate, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Campaign Response Rates by Cluster",
       y = "Response Rate (%)") +
  theme_minimal()

Recommendations Based on Clusters

# Based on the profiles, we can name our clusters
cluster_names <- data.frame(
  Cluster = c(1, 2, 3),
  Name = c("Budget Conscious", "Affluent Shoppers", "Average Consumers"),
  Description = c(
    "Lower income, moderate spending, price-sensitive",
    "High income, high spending, premium product focus",
    "Middle income, regular purchasing patterns"
  )
)

knitr::kable(cluster_names)
Cluster Name Description
1 Budget Conscious Lower income, moderate spending, price-sensitive
2 Affluent Shoppers High income, high spending, premium product focus
3 Average Consumers Middle income, regular purchasing patterns

Conclusion

With the use of K-means clustering, we segmented customers into three meaningful groups with distinct spending habits, income, and purchasing behavior. Budget Conscious, Affluent Shoppers, and Average Consumers represent the three clusters, proving the strength of data to uncover rich customer behavior patterns. From these findings, marketers can then design more specific and effective campaigns that cater to the needs and interests of each group directly. Coding assistance for this analysis was provided by Claude AI.

References