library(cluster)
library(tidyverse)R: Hierarchical + K-means clustering
Part A
1)
smc <- read_csv("supermarket_customers.csv")2)
smc2 <- select(smc, annual_income:spending_score)
smc2_scale <- scale(smc2)
d1 <- dist(smc2_scale)3) + 4)
h1 <- hclust(d1)
plot(h1, hang = -1)a)
clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 79 39 38
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1401 0.4993 0.5965 0.5531 0.6610 0.7645
Reasonable clustering structure has been found to be number 5 with silhouette median score being 0.55. Clusters 2 and 3 are the strongest, 1 and 4 are weaker and cluster 5 is the weakest.
heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)Yes, the heatmap is light, with light “blocks” along the diagonal representing small Euclidean distances between them and therefore showcasing clustering structure.
5)
a)
ggplot(smc, aes(x = annual_income,
y = spending_score,
colour = factor(clusters1))) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "Customer Segments Based on Income and Spending",
x = "Annual Income (€000s)",
y = "Spending Score",
colour = "Cluster")This graph shows us, that the cluster segmentation was successful and identified segments with enough distinction.
Cluster one includes low spenders with low incomes, 2 includes high spenders with low incomes, 3 includes middle spenders with middle incomes, 4 includes high spenders with high incomes and 5 includes low spenders with high incomes.
This output shows us that spending x income represents distinct and actionable customer segments.
b)
smc_clus <- smc %>%
mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'))smc_clus_means <- smc_clus %>%
group_by(cluster) %>%
summarise(num_custs = n(),
avg_income = mean(annual_income),
avg_spending = mean(spending_score),
avg_age = mean(age))
smc_clus_means# A tibble: 5 × 5
cluster num_custs avg_income avg_spending avg_age
<chr> <int> <dbl> <dbl> <dbl>
1 C1 23 26.3 20.9 45.2
2 C2 21 25.1 80.0 25.3
3 C3 79 54.4 50.2 42.9
4 C4 39 86.5 82.1 32.7
5 C5 38 87 18.6 40.4
c)
smc$gender <- factor(smc$gender, levels = c("Female", "Male"))
ggplot(smc, aes(x = factor(clusters1), fill = gender)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("Percentage of Customers") +
xlab("Cluster") +
ggtitle("Gender Split by Cluster") +
scale_fill_manual(values = c("Female" = "pink", "Male" = "skyblue"))7)
Segment 1 (low income, low spending) contains 23 customers with average income of ~26k, average spending of ~21, and average age of ~45 with ~36% being females and ~64% males.
Segment 2 (low income, high spending) contains 21 customers with average income of ~25k, average spending of ~80, and average age of ~25 with ~55% being females and ~45% males.
Segment 3 (mid income, mid spending) contains 79 customers with average income of ~54k, average spending of ~50, and average age of ~43 with ~58% being females and ~42% males.
Segment 4 (high income, high spending) contains 39 customers with average income of ~86k, average spending of ~50, and average age of ~43 with ~52% being females and ~48% males.
Segment 5 (high income, low spending) contains 38 customers with average income of ~87k, average spending of ~19, and average age of ~40 with ~48% being females and ~52% males.
8)
1 Cash conscious
- Attract them by price promotions - discounts, special deals and offers
- Loyalty program/app with point collecting
2 Kids with adult money
- Student loyalty program
- Cross-selling of complementary products
- Subscription based discounts
- Target impulse buying
- Use of social media and influencer marketing
3 Middle class
- Family bundles
- Loyalty card
- Cross-category promotions (dinner + wine + dessert/kid snack + cheese for the parents)
- Target small luxuries
4 VIPs
Premium and luxury segment - recommend those products
Premium product bundling
Exclusive offers
5 Potential VIPs
- Free samples -Trial discounts -Loyalty programs -Highlight convenience and rationality
Part B
1)
rec <- read_csv("recycling.csv")2)
rec2 <- select(rec, pos_impact, environ, money, bins, local, avoid_waste)3)
no, because all variables are on the same 1-4 scale.
4)
set.seed(101)
kmeans1 <- kmeans(rec2, centers = 3)5)
d2 <- dist(rec2)
sil_kmeans <- silhouette(kmeans1$cluster, d2)
summary(sil_kmeans)Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
Cluster sizes and average silhouette widths:
169 162 35
0.2682631 0.2054575 0.3271905
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08256 0.14303 0.23963 0.24610 0.35704 0.52470
Overall silhouette mean score is 0.25 which means that no substantial structure has been found. Cluster 2 has individual median score under 0.25 and 1 and 3 have 0.26 and 0.33 scores, which means they are weak and potentially artificial.
6)
a)
rec_clus <- rec %>%
mutate(clusters = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters == 1 ~ 'C1',
clusters == 2 ~ 'C2',
clusters == 3 ~ 'C3'))rec_clus_means <- rec_clus %>%
group_by(cluster) %>%
summarise(num_custs = n(),
avg_pos_impact = mean(pos_impact),
avg_environ = mean(environ),
avg_money = mean(money),
avg_bins = mean(bins),
avg_local = mean(local),
avg_waste = mean(avoid_waste))
rec_clus_means# A tibble: 3 × 8
cluster num_custs avg_pos_impact avg_environ avg_money avg_bins avg_local
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 169 3.85 3.67 3.93 3.12 2.98
2 C2 162 3.59 3.48 3.83 2.62 2.97
3 C3 35 1.8 1.23 1.66 1.11 2.29
# ℹ 1 more variable: avg_waste <dbl>
b)
rec_clus <- rec_clus %>%
mutate(age_group = case_when(
age < 25 ~ "Under 25",
age < 41 ~ "26-40",
age < 56 ~ "41-55",
age < 66 ~ "56-65",
TRUE ~ "65+"
))
ggplot(rec_clus, aes(x = cluster, fill = age_group)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Age Distribution by Cluster",
x = "Segment",
y = "Percentage of Customers")7)
Segment 1 includes 169 respondents with ~5% being under 25, ~33% being 26-40, ~55% being 41-55 and ~7% being 56-65. These people overall agreed with recycling having positive impact, recycling being vital to the environment, recycling saving money and also that they get confused about what goes into which bin and didn’t know where local recycling center is and didn’t agree with the statement that avoiding waste at home is difficult.
Segment 2 includes 162 respondents with ~9% being under 25, ~43% being 26-40, ~45% being 41-55 and ~3% being 56-65.These people overall agreed with recycling having positive impact, recycling being vital to the environment and recycling saving money. They aren’t confused about the bins, they also don’t know where their local recycling center is and they do find it difficult to avoid waste from packaging at home.
Segment 3 includes 35 respondents with ~25% being under 25, ~55% being 26-40, ~20% being 41-55. These people overall disagreed with everything, strongly so about positive impact of recycling, recycling being vital to the environment, recycling saving money and all while they claim to not be confused about what belongs to which bin, did’t knwo where local recycling center is and do not find it difficult to avoid waste from packaging at home.
8)
Segment 1 - Education on recycling practices - what belongs where, information about local recycling and environmental organisations.
Segment 2 - Education on how to avoid waste coming to the home from packaging. Also mention information about local recycling and environmental organisations.
Segment 3 - Education on why is recycling important, use of data of its environmental impacts to make them believe that it is worth it, because they claim to know how to do it, just believe it is not worth the effort. It is the youngest and smallest segment.