Assignment 2

Author

Kayla Nolan

Introduction

This research will be done to help a supermarket that wants to gain a better understanding of its customers to re-design their existing offerings and marketing communications to improve sales.

1

library(cluster)
library(tidyverse)
library(kableExtra)
super <- read_csv("supermarket_customers.csv")

2

super_2 <- select(super, annual_income, spending_score)
super_2_scale <- scale(super_2)
d1 <- dist(super_2)

2a

The data does need to be scaled as the variables annual_income and spending_score are on different scales, spending_score is measured a score that is calculated by the supermarket that shows the amount of shopping the customer has done, with a higher score meaning higher spend and annual_income is measured in € on a much wider scale. If the data is not scaled, the income variable would have a much bigger influence on the distance matrix, which would affect the clustering results. Scaling makes sure both variables contribute equally when calculating distances.

3/4

h1 <- hclust(d1)
plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

(4a)

The heat map does provide evidence of clustering structure in the data set.There are visible blocks of similar colours grouped together which suggests that certain observations are more similar to each other. The dendrogram also shows branches where groups of observations merge at lower heights indicating smaller distances within those groups. This suggests that the data contains distinct clusters rather than being randomly distributed.

5

clusters2 <- cutree(h1, k = 5)

 sil2 <- silhouette(clusters2, d1)
 summary(sil2)

Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters2, dist = d1) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08527  0.49132  0.59236  0.55299  0.66247  0.74442

The overall cluster analysis has a mean Silhouette Score of 0.55299, which suggests a reasonably good clustering structure.

Cluster 2 has the highest score (0.636), meaning this cluster is the strongest and most clearly defined.
Clusters 3, 4 and 5 have scores of 0.561, 0.512 and 0.551 which suggests that these clusters are reasonably well formed but not as strong as Cluster 2.
Cluster 1 has the lowest score (0.518) meaning it is the weakest of the five clusters although it is still above 0.5 and therefore considered acceptable.

6

super_clus <- super %>%
    mutate(clusters2 = clusters2) %>%
    mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
                                clusters2 == 2 ~ 'C2',
                                clusters2 == 3 ~ 'C3',
                                clusters2 == 4 ~ 'C4',
                                clusters2 == 5 ~ 'C5'))

(6a)

ggplot(super_clus, mapping = aes( x = annual_income, y = spending_score , colour = cluster)) +
  geom_point(size = 2) + 
  xlab("Annual Income") + 
  ylab("Spending Score") +
  ggtitle("Annual Income and Spend Score by Clusters")

The scatter plot shows that the clusters are reasonably well separated based on annual income and spending score. Cluster C4 represents customers with high income and high spending scores, while Cluster C5 represents high income but low spending scores. Cluster C1 appears to contain customers with low income and low spending scores whereas Cluster C2 includes lower income customers with high spending scores. Cluster C3 mainly represents customers with medium income and medium spending levels.

The clusters are fairly distinct although there is some slight overlap in the middle income range, particularly around Cluster C3. Overall, the scatter plot supports the hierarchical clustering results as the clusters form clear customer segments based on income and spending behaviour.

(6b)

super_clus_mean <- super_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            annual_income = mean(annual_income),
            spending_score = mean(spending_score),
            age = mean(age))
      super_clus_mean

# A tibble: 5 × 5
  cluster num_custs annual_income spending_score   age
  <chr>       <int>         <dbl>          <dbl> <dbl>
1 C1             23          26.3           20.9  45.2
2 C2             21          25.1           80.0  25.3
3 C3             85          55.8           49.1  42.5
4 C4             39          86.5           82.1  32.7
5 C5             32          89.4           15.6  41

knitr::kable(select(super_clus_mean, cluster, num_custs, annual_income, spending_score,age), 
             digits = c(0,0,2,0,0),
             col.names = c("Cluster", "# Customers", "Avg. Annual Income (€)", "Avg. Spending Score", "Average Age"), 
             caption = "Number of Customers and Average Income, Spending Score and Age by Cluster") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "blue", color = "white")

Number of Customers and Average Income, Spending Score and Age by Cluster
Cluster	# Customers	Avg. Annual Income (€)	Avg. Spending Score	Average Age
C1	23	26.30	21	45
C2	21	25.10	80	25
C3	85	55.81	49	42
C4	39	86.54	82	33
C5	32	89.41	16	41

(6c)

ggplot(super_clus, aes(x = gender, group = cluster)) + 
  geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count", show.legend = FALSE)+
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of Customers") + 
  xlab("Gender") +
  ggtitle("Gender Distribution by Cluster")

7

From these tables and graphs above we can gather the following:

Cluster C1 consists of 23 customers with low average income (€26.30) and low spending scores (21). The average age is 45 making this one of the older groups. The cluster has a higher group of females (around 60%). Overall this segment represents older, low-income and low-spending customers.
Cluster C2 contains 21 customers with low income (€25.10) but high spending scores (80). This group is much younger with an average age of 25. Females slightly outnumber males. This segment represents younger customers who spend heavily despite having lower incomes.
Cluster C3 is the largest cluster with 85 customers. It has medium income (€55.81) and medium spending scores (49) with an average age of 42. The gender distribution is slightly more female again. This appears to be a middle-income, moderate-spending, middle-aged customer segment.
Cluster C4 includes 39 customers with high income (€86.54) and high spending scores (82). The average age is 33, making this a relatively younger high-income group. The gender split is fairly balanced. This segment represents high-value customers who earn and spend a lot.
Cluster C5 consists of 32 customers with high income (€89.41) but very low spending scores (16). The average age is 41 and this cluster has a higher proportion of males. This group represents high-income customers who spend relatively little.

8

Here are some marketing actions the supermarket could take to increase the spend from each cluster . # Cluster 1 This cluster could be named “Traditional Budgeters”, these are older customers with lower income who spend selectively. For marketing actions they could:

Introduce loyalty discounts on essential items such as bread, milk or household staples.
Offer senior discount days or targeted coupons.
Promote value bundles and own-brand products.

This group is likely price sensitive. Discounts on essentials and value based promotions would encourage them to increase spend without feeling financially pressured.

Cluster 2

This cluster could be named “Young Big Spenders”, these are younger customers who spend heavily despite lower incomes. To market to these they could:

Promote student discounts or app rewards.
Use digital marketing and personalised app offers.
Offer limited time trendy product promotions.

This group is already willing to spend. Digital engagement, gamified rewards and trendy product promotions would increase loyalty and encourage repeat purchases.

Cluster 3

This cluster could be named “Steady Shoppers”, These are the largest group they have balanced income and spending. To market to these they could:

Introduce family meal deals and bulk buy discounts.
Offer loyalty points multipliers for weekly shops.
Send personalised offers based on shopping history.

This segment likely represents regular household shoppers. Encouraging larger baskets and repeat visits could significantly increase overall revenue due to the size of this cluster.

Cluster 4

This cluster could be named “Luxury Loyalists”, These are the high income and high value customers. To market to these they could:

Offer exclusive premium product lines.
Create VIP loyalty tiers with special rewards.
Promote organic or luxury products.

This group has both the ability and willingness to spend. Premium offerings and exclusivity would increase average transaction value and strengthen brand loyalty.

Cluster 5

This cluster could be named “Untapped High Earners”, These are the high earning but low spending customers.To target these they could:

Provide personalised high-quality product recommendations.
Offer targeted promotions on premium convenience products.
Introduce time-saving services (e.g., click & collect priority).

This group has strong purchasing power but low engagement. Encouraging convenience, quality, and tailored offers could encourage higher spending from this untapped segment.

Part B

In Part B the research is to identify customer segments with different recycling perceptions and habits to help understand how best to design and target their recycling campaigns. # 1

recycling <- read_csv("recycling.csv")

2

recycling_2 <- select(recycling, pos_impact, environ, money, bins, local, avoid_waste)

3

The data does not need to be scaled before applying the K-means clustering algorithm because these variables are all measured on the same 4 point scale that the consumer was given a statement and was asked to indicate their level of agreement on a scale from 1-4, where 1 = Strongly Disagree and 4 = Strong Agree. # 4

set.seed(101)
kmeans1 <- kmeans(recycling_2, centers = 3)

5

d2 <- dist(recycling_2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d2)
summary(sil_kmeans1)

Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
 Cluster sizes and average silhouette widths:
      169       162        35 
0.2682631 0.2054575 0.3271905 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08256  0.14303  0.23963  0.24610  0.35704  0.52470

The overall cluster analysis has a mean Silhouette Score of 0.24610, which means that the analysis has uncovered that the structure is weak and may be artificial.

Cluster 3 has the highest score (0.327) meaning this cluster is the most clearly defined of the three clusters.
Cluster 1 has a score of 0.268 suggesting that the cluster structure is moderate but not particularly strong and some observations could almost belong to another clusters.
Cluster 2 has the lowest score (0.205), meaning this cluster is the weakest and least clearly defined with a greater possibility of overlap with other clusters.

Overall the relatively low silhouette scores suggest that the 3-cluster solution does not produce strongly separated groups and the cluster structure may be weak.

6

recycling_clus <- recycling %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5'))

recycling_clus_means <- recycling_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            pos_impact_m = mean(pos_impact),
            environ_m = mean(environ),
            money_m = mean(money),
            bins_m = mean(bins),
            local_m = mean(local),
            avoid_waste_m = mean(avoid_waste))


knitr::kable(select(recycling_clus_means, cluster, num_custs, pos_impact_m, environ_m,
money_m, bins_m, local_m, avoid_waste_m), 
             digits = c(0,0,0,0,0,0,0,0),
             col.names = c("Cluster", "# Customers", "Positive Impact","Environment", "Money", "Bins", "Local", "Avoids Waste"), 
             caption = "Customer's Average Response to Survey Statements")  %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "blue", color = "white")

Customer's Average Response to Survey Statements
Cluster	# Customers	Positive Impact	Environment	Money	Bins	Local	Avoids Waste
C1	169	4	4	4	3	3	1
C2	162	4	3	4	3	3	3
C3	35	2	1	2	1	2	3

recycling_clus$age <- factor(recycling_clus$age,
                             levels = c("18-24","25-34","35-44","45-54","55-64","65+"))

ggplot(recycling_clus, aes(x = age, group = cluster)) + 
  geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of Customers") + 
  xlab("Age Bracket") +
  ggtitle("Age Distribution by Cluster")

7

From these tables and graph above we can gather the following:

Cluster 1 This Cluster has 169 customers making it the largest segment. Customers in this cluster show strong agreement that recycling has a positive impact, is important for the environment, and saves money, with average scores of 4 for these statements. They show medium agreement regarding confusion about bins and knowing their local recycling center with both averaging 3, and strongly disagree that avoiding packaging waste is difficult with a score of 1. The age distribution suggests this cluster is mainly middle-aged particularly 45–54 and 35–44 indicating a group that is highly positive about recycling and relatively confident in their knowledge.
Cluster 2 This Cluster has 162 customers making it similar in size but slightly smaller than Cluster 1. This group also shows strong agreement that recycling has a positive impact and saves money with a score of 4 but slightly lower agreement that recycling improves the environment with a score of 3. They report moderate confusion about recycling bins and knowledge of local recycling centers both with average scores of 3. Unlike Cluster 1, they moderately agree that avoiding packaging waste is difficult with a score of 3. The age distribution is again concentrated in the 35–54 age groups suggesting a generally positive but slightly less confident group regarding recycling practices.
Cluster 3 This Cluster is the smallest segment with 35 customers and shows much lower agreement across most recycling attitudes. Customers in this cluster report lower belief that their recycling makes a positive impact with a score of 2 and very low agreement that recycling improves the environment with a score of 1. They also report high confusion about recycling bins with a 1 and only moderate knowledge of local recycling centers with a score of 2 but they do agree that avoiding packaging waste is difficult with a score of 3. The age distribution shows a higher proportion of younger customers particularly 18–24 and 35–44 suggesting this segment may be less engaged with recycling and less confident about recycling practices.

8

Here are some strategies that Carlow County Council could take to support and encourage each segment to recycle.

Cluster 1

This cluster already show strong positive attitudes towards recycling and believe it benefits the environment and saves money. The Council should focus on maintaining and reinforcing these behaviors. Strategies could include providing regular updates on the environmental impact of recycling in the county, recognising community recycling achievements and offering incentive programmes or rewards for consistent recycling behaviour.

Cluster 2

This cluster also values recycling but shows more uncertainty about waste avoidance and recycling processes. They could support this group by providing clearer guidance on recycling rules such as simplified bin labels, visual guides, and easy online resources explaining what can go into each bin. Awareness campaigns could also highlight practical ways to reduce packaging waste at home helping this group turn positive attitudes into stronger recycling habits.

Cluster 3

This cluster show lower belief in the benefits of recycling and higher confusion about recycling practices. For this group, The Council should prioritise education and awareness campaigns that explain the environmental importance of recycling and how their actions make a difference. Providing simple instructions on waste separation and targeted social media campaigns may help improve knowledge and engagement. Since this group also finds avoiding packaging waste difficult, the council could promote practical tips for reducing waste and highlight local recycling facilities and services. As this cluster has the most younger customers targeting this group with educational initiatives could be beneficial as increasing their awareness now may encourage long-term recycling behaviours as they move through adulthood.