library(cluster)
library(tidyverse)
super <- read_csv("supermarket_customers.csv")
View(super) Segmentation report
1 Part A - Segmenting Supermarket Customers using Hierarchical Clustering
#1
#2
super_2 <- select(super, annual_income, spending_score)
super_2_scale <- scale(super_2)
d2 <- dist(super_2_scale)#2a.
The data does need to be scaled as they are not measured on the same Likert scale and annual income and spending score is measured on a much wider scale. Therefore, we have to scale the data before computing the distance matrix.
#3.
h2 <- hclust(d2, method = 'ward.D')
plot(h2, hang = -1)#4.
heatmap(as.matrix(d2), Rowv = as.dendrogram(h2), Colv = 'Rowv')#4a.
Yes the heatmap provides evidence of clustering due to the lightly coloured blocks around the diagonal that suggests that they are similar to each other.
#5.
clusters1 <- cutree(h2, k = 5)
sil1 <- silhouette(clusters1, d2)
summary(sil1)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d2) :
Cluster sizes and average silhouette widths:
23 21 85 39 32
0.5161937 0.6353734 0.5625990 0.5138767 0.5526370
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08419 0.49074 0.59411 0.55381 0.66283 0.74542
The mean Silhouette Score for the overall clustering results is 0.55381, which mean a reasonable clustering structure has been found. Clusters 1, 2, 3, & 4 also have a silhouette scores less than 0.7, which shows that they are not that strong but reasonable.
#6.
super_clus <- super %>%
mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'))
super_clus_means <- super_clus %>%
group_by(cluster) %>%
summarise(id = n(),
#gender_m = mean(gender),
age_m = mean(age),
annual_income_m = mean(annual_income),
spending_score_m = mean(spending_score))#6a.
ggplot(super, aes(x = annual_income,
y = spending_score,
colour = id)) +
geom_point(size = 3) +
labs(title = "Customer Segments Based on Income and Spending Score",
x = "Annual Income (€000s)",
y = "Spending Score",
colour = "Cluster") +
theme_minimal()The scatterplot shows that customers with a spending score of 50 have an annual income of 45-50k.
#6b.
knitr::kable(select(super_clus_means, cluster, id, annual_income_m, spending_score_m, age_m),
digits = c(0,0,0,0),
col.names = c("Cluster", "# Customers", "Avg. Annual Income (€)", "Avg. Spending Score", "Avg. Age"),
caption = "Number of Customers and Average Annual Income & Spending Score by Cluster")| Cluster | # Customers | Avg. Annual Income (€) | Avg. Spending Score | Avg. Age |
|---|---|---|---|---|
| C1 | 23 | 26 | 21 | 45 |
| C2 | 21 | 25 | 80 | 25 |
| C3 | 85 | 56 | 49 | 42 |
| C4 | 39 | 87 | 82 | 33 |
| C5 | 32 | 89 | 16 | 41 |
#6c.
ggplot(super_clus, aes(x = gender, group = cluster)) +
geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count") +
facet_grid(~ cluster)#7.
Cluster 1 contains 23 customers with an average income of 26k annually and an average spending score of 21 and the average age of 45. There is more females than males in cluster 1.
Cluster 2 contains 21 customers with an average income of 25k annually and an average spending score of 80 and the average age of 25. There is more females than males in cluster 2.
Cluster 3 contains 85 customers with an average income of 56k annually and an average spending score of 49 and the average age of 42. There is more females than males in cluster 3.
Cluster 4 contains 39 customers with an average income of 87k annually and an average spending score of 82 and the average age of 33. There is more females than males in cluster 4.
Cluster 5 contains 32 customers with an average income of 89k annually and an average spending score of 16 and the average age of 41. There is more males than females in cluster 1.
#8.
The supermarket should focus on value and affordability to encourage more spending.They should offer discount bundles on everyday grocery items and provide loyalty rewards or points for frequent purchases.They should also encourage continued engagement and loyalty. It helps retain the active shoppers and maintain their high spending levels.
The segmentation analysis reveals five distinct customer groups with different income levels and spending behaviours. By tailoring marketing strategies to each segment, the supermarket can better target promotions, increase customer engagement, and ultimately improve overall sales.
For cluster 1, i would call them Traditional low spenders. For cluster 2, i would call them Discount hunters. For cluster 3, i would call them Bulk buyers. For cluster 4, i would call them High end, luxury shoppers. For cluster 5, i would call them Intentional buyers.
2 Part B - Segmenting Consumers’ Recycling Perceptions & Habits using K-means
#1
recycle <- read_csv("recycling.csv")Rows: 366 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): age
dbl (7): id, pos_impact, environ, money, bins, local, avoid_waste
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(recycle) #2.
recycle_2 <- select(recycle, pos_impact, environ, money, bins, local, avoid_waste)#3.
No it does not need to be scaled as they are all measured on the same 10 point Likert scale.
#4
set.seed(101)
kmeans1 <- kmeans(recycle_2, centers = 3)
kmeans1$cluster #The assignment of each observation (i.e. customer) to each cluster. [1] 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2
[38] 1 2 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 2 2 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 1 1
[75] 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 2 1 1 2 1 2 2
[112] 2 1 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 3 3 3 3
[149] 2 1 1 1 2 1 1 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 3 3 3 3 3 1 2 2 2 2 1
[186] 2 1 1 2 1 1 1 3 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 3 1 2 2 1 2 2
[223] 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 1 1 2 1 1 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 1
[260] 2 1 3 1 1 2 2 1 2 1 2 2 3 2 2 1 1 3 1 2 2 1 1 1 2 2 2 3 3 3 3 3 3 1 1 2 1
[297] 2 1 2 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2
[334] 2 1 1 1 1 1 2 1 3 3 3 3 3 3 3 3 1 1 2 1 1 1 2 1 3 3 3 3 3 3 3 1 1
kmeans1$centers #The centroids for each cluster. pos_impact environ money bins local avoid_waste
1 3.846154 3.674556 3.928994 3.118343 2.976331 1.337278
2 3.586420 3.481481 3.827160 2.617284 2.969136 3.104938
3 1.800000 1.228571 1.657143 1.114286 2.285714 2.657143
kmeans1$size #The number of observations in each cluster.[1] 169 162 35
kmeans1$iter #The number of iterations it took to converge to the final clustering solution.[1] 3
#5.
d1 <- dist(recycle_2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d1)
summary(sil_kmeans1)Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
Cluster sizes and average silhouette widths:
169 162 35
0.2682631 0.2054575 0.3271905
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08256 0.14303 0.23963 0.24610 0.35704 0.52470
The overall cluster analysis has a mean Silhouette Score of 0.24610, which means that no Substantial finding structure has been found. Cluster 1 and Cluster 3 show that the structure is weak and could be artificial.
#6
recycle_clus <- recycle %>%
mutate(clusters1 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))
recycle_clus_means <- recycle_clus %>%
group_by(cluster) %>%
summarise(num_custs = n(),
pos_impact_m = mean(pos_impact),
environ_m = mean(environ),
money_m = mean(money),
bins_m = mean(bins),
local_m = mean (local),
avoid_waste_m = mean(avoid_waste))
recycle_clus_means# A tibble: 3 × 8
cluster num_custs pos_impact_m environ_m money_m bins_m local_m avoid_waste_m
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 169 3.85 3.67 3.93 3.12 2.98 1.34
2 C2 162 3.59 3.48 3.83 2.62 2.97 3.10
3 C3 35 1.8 1.23 1.66 1.11 2.29 2.66
knitr::kable(select(recycle_clus_means, cluster, num_custs, pos_impact_m, environ_m, money_m, bins_m, local_m, avoid_waste_m),
digits = c(0,0,2),
col.names = c("Cluster", "# Customers", "Positive Impact", "Environment", "Money", "Bins", "Local", "Avoid Waste"),
caption = "Number of Customers and Average Responses by Cluster")| Cluster | # Customers | Positive Impact | Environment | Money | Bins | Local | Avoid Waste |
|---|---|---|---|---|---|---|---|
| C1 | 169 | 3.85 | 4 | 4 | 3.12 | 3 | 1 |
| C2 | 162 | 3.59 | 3 | 4 | 2.62 | 3 | 3 |
| C3 | 35 | 1.80 | 1 | 2 | 1.11 | 2 | 3 |
ggplot(recycle_clus, aes(x = age, group = cluster)) +
geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count") +
facet_grid(~ cluster)#7
Cluster 1 has 169 customers, they believe their recycling efforts made a positive impact (3.85), they strongly agree that their recycling is improving the environment (4), they strongly agreed that it saved money (4), they are confused on what goes into the bins (3.12), they know where the waste centre is (3) and avoiding the waste that comes home from packaging is not that difficult for them (1).
Cluster 2 has 162 customers, they believe their recycling efforts made a positive impact (3.59), they strongly agree that their recycling is improving the environment (3), they strongly agreed that it saved money (4), they don’t seem to be too confused on what goes into the bins (2.62), they know where the waste centre is (3) and avoiding the waste that comes home from packaging is difficult for them (3).
Cluster 3 has 35 customers, they don’t believe their recycling efforts make a positive impact (1.80), they strongly disagree that their recycling efforts improve the environment (1), They disagree that they are saving money (2), they are very confused on what goes into the bins (1.11), they don’t know where the local waste centre is (2), and avoiding the waste that comes home from packaging is difficult for them (3).
#8
For cluster 1, i would call them the Environmentally Aware Heros. Carlow County Council could focus on reinforcing and supporting their behaviour.They could offer community recycling initiatives or challenges to maintain engagement.
For cluster 2, i would call them the Complacent Recyclers. Carlow County Council could help make recycling easier and more convenient for them. They could educate residents on how to reduce household packaging waste and provide simple recycling guides and reminders through social media or local campaigns.
For cluster 3, i would call them them the Confused Recyclers. Carlow County Council should focus on education and more environmental awareness for this cluster. They could run run community workshops or awareness campaigns to build understanding and to display simple visual guides showing what goes into each bin.
Overall, we can see how segmentation analysis helps us profile our clusters and to analyse them for how we can improve habits and to make targeted campaigns.