This tutorial illustrates customer segmentation techniques through a Customer Personality Analysis dataset from Kaggle (website link under referneces). The dataset has comprehensive information regarding customers demographics, purchasing behavior, responses to campaigns, and their interests in product categories. I used the first 100 observations for this analysis to show how marketers can segment customers into clusters through K-means clustering. These segments,and marketing professionals can develop more targeted campaigns that address the specific needs of different customer segments, ultimately improving campaign effectiveness and customer satisfaction. ## Data Preparation and Exploration
# Load the customer personality analysis dataset
customers <- read.csv("marketing_campaign.csv", sep="\t")
# Select only the first 100 observations
customers <- customers[1:100,]
# Display the first few rows
head(customers)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014
## 6 7446 1967 Master Together 62513 0 1 09-09-2013
## Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 58 635 88 546 172 88
## 2 38 11 1 6 2 1
## 3 26 426 49 127 111 21
## 4 26 11 4 20 10 3
## 5 94 173 43 118 46 27
## 6 16 520 42 98 0 42
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## 3 42 1 8 2
## 4 5 2 2 0
## 5 15 5 5 3
## 6 14 2 6 4
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## 3 10 4 0 0 0
## 4 4 6 0 0 0
## 5 6 5 0 0 0
## 6 10 6 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1 0 0 0 3 11 1
## 2 0 0 0 3 11 0
## 3 0 0 0 3 11 0
## 4 0 0 0 3 11 0
## 5 0 0 0 3 11 0
## 6 0 0 0 3 11 0
# Check the structure of the data
str(customers)
## 'data.frame': 100 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
# Summary statistics
summary(customers)
## ID Year_Birth Education Marital_Status
## Min. : 273 Min. :1943 Length:100 Length:100
## 1st Qu.: 2253 1st Qu.:1957 Class :character Class :character
## Median : 5450 Median :1970 Mode :character Mode :character
## Mean : 5407 Mean :1968
## 3rd Qu.: 8094 3rd Qu.:1977
## Max. :11178 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 2447 Min. :0.00 Min. :0.00 Length:100
## 1st Qu.:31412 1st Qu.:0.00 1st Qu.:0.00 Class :character
## Median :50150 Median :0.00 Median :0.00 Mode :character
## Mean :49500 Mean :0.45 Mean :0.42
## 3rd Qu.:67516 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :88194 Max. :2.00 Max. :2.00
## NA's :9
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 1.0 Min. : 0.00 Min. : 1.00
## 1st Qu.:19.75 1st Qu.: 13.0 1st Qu.: 2.00 1st Qu.: 14.75
## Median :39.50 Median : 126.5 Median : 6.50 Median : 48.00
## Mean :44.73 Mean : 253.7 Mean : 22.41 Mean : 174.21
## 3rd Qu.:69.00 3rd Qu.: 445.5 3rd Qu.: 35.25 3rd Qu.: 221.00
## Max. :99.00 Max. :1032.0 Max. :164.00 Max. :1725.00
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 6.00 1st Qu.: 1.00
## Median : 10.50 Median : 9.00 Median : 18.00 Median : 2.00
## Mean : 39.49 Mean : 28.72 Mean : 41.98 Mean : 2.44
## 3rd Qu.: 52.00 3rd Qu.: 37.00 3rd Qu.: 45.75 3rd Qu.: 3.00
## Max. :227.00 Max. :263.00 Max. :362.00 Max. :15.00
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 1.00
## 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 3.00 1st Qu.: 3.00
## Median : 3.00 Median : 1.50 Median : 4.50 Median : 6.00
## Mean : 4.08 Mean : 2.84 Mean : 5.70 Mean : 5.39
## 3rd Qu.: 6.00 3rd Qu.: 4.00 3rd Qu.: 8.25 3rd Qu.: 7.00
## Max. :27.00 Max. :28.00 Max. :13.00 Max. :20.00
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00 Min. :0
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0
## Median :0.00 Median :0.00 Median :0.00 Median :0.00 Median :0
## Mean :0.06 Mean :0.07 Mean :0.05 Mean :0.06 Mean :0
## 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00 Max. :0
##
## Complain Z_CostContact Z_Revenue Response
## Min. :0.00 Min. :3 Min. :11 Min. :0.00
## 1st Qu.:0.00 1st Qu.:3 1st Qu.:11 1st Qu.:0.00
## Median :0.00 Median :3 Median :11 Median :0.00
## Mean :0.01 Mean :3 Mean :11 Mean :0.16
## 3rd Qu.:0.00 3rd Qu.:3 3rd Qu.:11 3rd Qu.:0.00
## Max. :1.00 Max. :3 Max. :11 Max. :1.00
##
# Handle missing values
customers <- customers %>%
drop_na()
# Calculate customer age from birth year
customers$Age <- 2023 - customers$Year_Birth
# Create total spending variable
customers$TotalSpending <- customers$MntWines + customers$MntFruits +
customers$MntMeatProducts + customers$MntFishProducts +
customers$MntSweetProducts + customers$MntGoldProds
# Calculate purchase frequency
customers$PurchaseFrequency <- customers$NumWebPurchases + customers$NumCatalogPurchases +
customers$NumStorePurchases
# Select relevant variables for clustering
customer_data <- customers %>%
select(Age, Income, TotalSpending, PurchaseFrequency)
# Scale the data
customer_scaled <- scale(customer_data)
# Visualize the distribution of key variables
p1 <- ggplot(customers, aes(x=Age)) +
geom_histogram(fill="steelblue", bins=20) +
labs(title="Age Distribution")
p2 <- ggplot(customers, aes(x=Income)) +
geom_histogram(fill="darkgreen", bins=20) +
labs(title="Income Distribution")
p3 <- ggplot(customers, aes(x=TotalSpending)) +
geom_histogram(fill="darkred", bins=20) +
labs(title="Total Spending Distribution")
p4 <- ggplot(customers, aes(x=PurchaseFrequency)) +
geom_histogram(fill="purple", bins=20) +
labs(title="Purchase Frequency Distribution")
grid.arrange(p1, p2, p3, p4, ncol=2)
# Create correlation matrix
cor_matrix <- cor(customer_data, use="complete.obs")
corrplot(cor_matrix, method="circle", type="upper", tl.col="black", tl.srt=45)
# Determine optimal number of clusters using elbow method
fviz_nbclust(customer_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal K")
# Silhouette method for optimal clusters
fviz_nbclust(customer_scaled, kmeans, method = "silhouette") +
labs(title = "Silhouette Method for Optimal K")
# Perform k-means clustering with 3 clusters
set.seed(123)
km_result <- kmeans(customer_scaled, centers = 3, nstart = 25)
# Add cluster assignment to original data
customers$Cluster <- as.factor(km_result$cluster)
# Visualize the clusters
fviz_cluster(list(data = customer_scaled, cluster = km_result$cluster),
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point", ellipse.type = "convex",
ggtheme = theme_minimal())
# Calculate mean values for key variables by cluster
cluster_profiles <- customers %>%
group_by(Cluster) %>%
summarize(
Count = n(),
Avg_Age = mean(Age, na.rm = TRUE),
Avg_Income = mean(Income, na.rm = TRUE),
Avg_Spending = mean(TotalSpending, na.rm = TRUE),
Avg_PurchaseFreq = mean(PurchaseFrequency, na.rm = TRUE),
Avg_NumDeals = mean(NumDealsPurchases, na.rm = TRUE),
Avg_WebPurchases = mean(NumWebPurchases, na.rm = TRUE),
Avg_CatalogPurchases = mean(NumCatalogPurchases, na.rm = TRUE),
Avg_StorePurchases = mean(NumStorePurchases, na.rm = TRUE)
)
# Display cluster profiles
knitr::kable(cluster_profiles, digits = 2)
Cluster | Count | Avg_Age | Avg_Income | Avg_Spending | Avg_PurchaseFreq | Avg_NumDeals | Avg_WebPurchases | Avg_CatalogPurchases | Avg_StorePurchases |
---|---|---|---|---|---|---|---|---|---|
1 | 26 | 67.81 | 61141.04 | 762.04 | 17.38 | 2.88 | 6.00 | 3.23 | 8.15 |
2 | 44 | 49.14 | 31964.50 | 105.89 | 6.14 | 1.91 | 2.16 | 0.75 | 3.23 |
3 | 21 | 51.33 | 71829.10 | 1373.76 | 21.43 | 2.90 | 5.29 | 7.48 | 8.67 |
# Create visual comparisons of clusters
p1 <- ggplot(customers, aes(x = Age, y = Income, color = Cluster)) +
geom_point(alpha = 0.7) +
labs(title = "Age vs Income by Cluster")
p2 <- ggplot(customers, aes(x = TotalSpending, y = PurchaseFrequency, color = Cluster)) +
geom_point(alpha = 0.7) +
labs(title = "Spending vs Purchase Frequency by Cluster")
grid.arrange(p1, p2, ncol = 1)
# Analyze preferred marketing channels by cluster
channel_data <- customers %>%
group_by(Cluster) %>%
summarize(
Web = mean(NumWebPurchases, na.rm = TRUE),
Catalog = mean(NumCatalogPurchases, na.rm = TRUE),
Store = mean(NumStorePurchases, na.rm = TRUE)
) %>%
pivot_longer(cols = c(Web, Catalog, Store),
names_to = "Channel", values_to = "Average")
ggplot(channel_data, aes(x = Channel, y = Average, fill = Cluster)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Preferred Purchase Channels by Cluster",
y = "Average Number of Purchases") +
theme_minimal()
# Analyze campaign response rates by cluster
response_data <- customers %>%
group_by(Cluster) %>%
summarize(
Campaign1 = mean(AcceptedCmp1, na.rm = TRUE) * 100,
Campaign2 = mean(AcceptedCmp2, na.rm = TRUE) * 100,
Campaign3 = mean(AcceptedCmp3, na.rm = TRUE) * 100,
Campaign4 = mean(AcceptedCmp4, na.rm = TRUE) * 100,
Campaign5 = mean(AcceptedCmp5, na.rm = TRUE) * 100
) %>%
pivot_longer(cols = starts_with("Campaign"),
names_to = "Campaign", values_to = "ResponseRate")
ggplot(response_data, aes(x = Campaign, y = ResponseRate, fill = Cluster)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Campaign Response Rates by Cluster",
y = "Response Rate (%)") +
theme_minimal()
# Based on the profiles, we can name our clusters
cluster_names <- data.frame(
Cluster = c(1, 2, 3),
Name = c("Budget Conscious", "Affluent Shoppers", "Average Consumers"),
Description = c(
"Lower income, moderate spending, price-sensitive",
"High income, high spending, premium product focus",
"Middle income, regular purchasing patterns"
)
)
knitr::kable(cluster_names)
Cluster | Name | Description |
---|---|---|
1 | Budget Conscious | Lower income, moderate spending, price-sensitive |
2 | Affluent Shoppers | High income, high spending, premium product focus |
3 | Average Consumers | Middle income, regular purchasing patterns |
With the use of K-means clustering, we segmented customers into three meaningful groups with distinct spending habits, income, and purchasing behavior. Budget Conscious, Affluent Shoppers, and Average Consumers represent the three clusters, proving the strength of data to uncover rich customer behavior patterns. From these findings, marketers can then design more specific and effective campaigns that cater to the needs and interests of each group directly. Coding assistance for this analysis was provided by Claude AI.