Part A - Hierachical clustering
A2 Sub-dataset + Euclidean distance matrix
Yes the data need to be scaled because, annual_income and spending_score are on different ranges, so without scaling the income variable would dominate the Euclidean distances and your clusters would be driven mostly by income.
A3 Dendrogram and Heatmap
A4
Yes, this does provide evidence of clustering. We see a large vertical merge at higher heights. There are clear branches splitting the data into groups and the biggest jump in height occurs before the final merge.
A5 5-Cluster solution
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 79 39 38
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1401 0.4993 0.5965 0.5531 0.6610 0.7645
The overall cluster analysis has a mean Sihouette Score of 0.55, which suggests clear separation, not perfect but strong enough to jutsify 5 segment.
A6 Scatterplot
This shows the 5 distinct customer segments based on income and spending behaviour. The clear separation supports the hierarchical clustering results and demonstrates true segmentation within the dataset.
Table
| Cluster | # Customer | Avg Income (€000s) | Avg Spending Score | Avg Age |
|---|---|---|---|---|
| C1 | 23 | 26.30 | 20.91 | 45.2 |
| C2 | 21 | 25.10 | 80.05 | 25.3 |
| C3 | 79 | 54.42 | 50.22 | 42.9 |
| C4 | 39 | 86.54 | 82.13 | 32.7 |
| C5 | 38 | 87.00 | 18.63 | 40.4 |
C1: 23 customers avg income = 26.3, avg spend = 20.9, avg age = 45.2
C2: 21 customers avg income = 25.1, avg spend = 80, avg age = 25.3
C3: 79 customers avg income = 54.4, avg spend = 50.2, avg age = 42.9
C4: 39 customers avg income = 86.5, avg spend = 82.1, avg age = 32.1
C5: 38 customers avg income = 87, avg spend = 18.6 , avg age = 40.4
Barchart % Males vs Female per Cluster
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Based on the graph above. It shows a fairly balanced gender split across clusters. With Females taking the lead in all sectors.
8 Summary
- C1: “Low-Budget Occasional Shoppers” Poor income, spending, and age. They are probably price-sensitive and unengaged.
Action: “€10 dinner deals,” loyalty points on staples, weekly value bundles (essentials) SMS offers are scheduled in relation to paydays.
- C2: “Young Big Spenders” The youngest cluster has a high expenditure score while having a low income (they spend a lot in comparison to their income).
Action: meal-deal subscriptions, app-only rewards, impulse-friendly promotions (snacks, drinks, and prepared meals), student/young adult benefits, and “spend €X get €Y off next visit.”
C3: “Mainstream Regulars” largest group, older average, mid-income + mid-spending. Take action to maintain their loyalty: customised coupons, family-sized packages, “buy more save more,” seasonal promotions, and enhanced convenience (click-and-collect deals).
C4: “Premium Power Shoppers” Mid-age, high income plus maximum spending.
Action: premium experiences include wine and cheese deals, upscale product lines, VIP tiers, early access promotions, premium delivery times, and tailored suggestions.
- C5: “High Income, Low Engagement” They could spend more, but they don’t because they have a high income but a low spending score.
Action: win-back + convenience: highlight quality/freshness, curated baskets, free delivery threshold promotions, targeted “try us” offers, and premium convenience positioning.
Part B K-means clustering
Not all 6 variables are meausred on the same 1-4 agreement scale, so one variable wont overpower another.
4 Run 3 cluster k-means
[1] 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2
[38] 1 2 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 2 2 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 1 1
[75] 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 2 1 1 2 1 2 2
[112] 2 1 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 3 3 3 3
[149] 2 1 1 1 2 1 1 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 3 3 3 3 3 1 2 2 2 2 1
[186] 2 1 1 2 1 1 1 3 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 3 1 2 2 1 2 2
[223] 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 1 1 2 1 1 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 1
[260] 2 1 3 1 1 2 2 1 2 1 2 2 3 2 2 1 1 3 1 2 2 1 1 1 2 2 2 3 3 3 3 3 3 1 1 2 1
[297] 2 1 2 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2
[334] 2 1 1 1 1 1 2 1 3 3 3 3 3 3 3 3 1 1 2 1 1 1 2 1 3 3 3 3 3 3 3 1 1
pos_impact environ money bins local avoid_waste
1 3.846154 3.674556 3.928994 3.118343 2.976331 1.337278
2 3.586420 3.481481 3.827160 2.617284 2.969136 3.104938
3 1.800000 1.228571 1.657143 1.114286 2.285714 2.657143
[1] 169 162 35
[1] 3
5 Assess quality
Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
Cluster sizes and average silhouette widths:
169 162 35
0.2682631 0.2054575 0.3271905
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08256 0.14303 0.23963 0.24610 0.35704 0.52470
- With an average silhouette of roughly 0.25, this 3-cluster solution is weaker/moderate. This indicates that while there are clusters, there is overlap, which is quite typical for attitude surveys.
6 Profile clusters
Barchart showing %
| Cluster | # Consumers | Pos Impact | Environment | Saves Money | Confused by Bins | Know Local Centre | Avoiding Packaging is Difficult |
|---|---|---|---|---|---|---|---|
| C1 | 169 | 3.85 | 3.67 | 3.93 | 3.12 | 2.98 | 1.34 |
| C2 | 162 | 3.59 | 3.48 | 3.83 | 2.62 | 2.97 | 3.10 |
| C3 | 35 | 1.80 | 1.23 | 1.66 | 1.11 | 2.29 | 2.66 |
7 Summary
“Recycling Sceptics” (C1) They don’t really think recycling is beneficial (poor pos impact/environment/money). Simple “why it matters locally” messaging, local proof (before/after, community impact), and simple entry actions (one or two essential items to recycle first) comprise the strategy. Make use of reliable local voices.
“Willing but Unsure” (C2) Although they support recycling, they are not really clear regarding bins. Bin labels, brief videos, “this goes here” graphics, school/community seminars, and uniform labelling across bins are all part of the strategy for the clarity campaign.
C3: “Eco Believers, Need Guidance” Extremely pro-recycling, yet most perplexed by bin regulations (highest bins score). Strategy: sophisticated clarity aids, such as fridge magnets, QR codes on bins, searchable “where does this go?” guidelines, and targeted communications about the most incorrectly sorted things.