Segmentation report

Author

Rosemary Francis

1 Part A - Segmenting Supermarket Customers using Hierarchical Clustering

library(cluster)
library(tidyverse)
super <- read_csv("supermarket_customers.csv")
View(super)

super_2 <- select(super, annual_income, spending_score)
super_2_scale <- scale(super_2)
d2 <- dist(super_2_scale)

#2a.

The data does need to be scaled as they are not measured on the same Likert scale and annual income and spending score is measured on a much wider scale. Therefore, we have to scale the data before computing the distance matrix.

#3.

h2 <- hclust(d2, method = 'ward.D')
plot(h2, hang = -1)

#4.

heatmap(as.matrix(d2), Rowv = as.dendrogram(h2), Colv = 'Rowv')

#4a.

Yes the heatmap provides evidence of clustering due to the lightly coloured blocks around the diagonal that suggests that they are similar to each other.

#5.

clusters1 <- cutree(h2, k = 5)
sil1 <- silhouette(clusters1, d2)
summary(sil1)

Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d2) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5161937 0.6353734 0.5625990 0.5138767 0.5526370 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08419  0.49074  0.59411  0.55381  0.66283  0.74542

The mean Silhouette Score for the overall clustering results is 0.55381, which mean a reasonable clustering structure has been found. Clusters 1, 2, 3, & 4 also have a silhouette scores less than 0.7, which shows that they are not that strong but reasonable.

#6.

super_clus <- super %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5'))

super_clus_means <- super_clus %>%
  group_by(cluster) %>%
  summarise(id = n(),
            #gender_m = mean(gender),
            age_m = mean(age),
            annual_income_m = mean(annual_income),
            spending_score_m = mean(spending_score))

#6a.

ggplot(super, aes(x = annual_income, 
                  y = spending_score,
                  colour = id)) +
  geom_point(size = 3) +
  labs(title = "Customer Segments Based on Income and Spending Score",
       x = "Annual Income (€000s)",
       y = "Spending Score",
       colour = "Cluster") +
  theme_minimal()

The scatterplot shows that customers with a spending score of 50 have an annual income of 45-50k.

#6b.

knitr::kable(select(super_clus_means, cluster, id, annual_income_m, spending_score_m, age_m), 
             digits = c(0,0,0,0),
             col.names = c("Cluster", "# Customers", "Avg. Annual Income (€)", "Avg. Spending Score", "Avg. Age"), 
             caption = "Number of Customers and Average Annual Income & Spending Score by Cluster")

Number of Customers and Average Annual Income & Spending Score by Cluster
Cluster	# Customers	Avg. Annual Income (€)	Avg. Spending Score	Avg. Age
C1	23	26	21	45
C2	21	25	80	25
C3	85	56	49	42
C4	39	87	82	33
C5	32	89	16	41

#6c.

ggplot(super_clus, aes(x = gender, group = cluster)) + 
  geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count") +
  facet_grid(~ cluster)

#7.

Cluster 1 contains 23 customers with an average income of 26k annually and an average spending score of 21 and the average age of 45. There is more females than males in cluster 1.

Cluster 2 contains 21 customers with an average income of 25k annually and an average spending score of 80 and the average age of 25. There is more females than males in cluster 2.

Cluster 3 contains 85 customers with an average income of 56k annually and an average spending score of 49 and the average age of 42. There is more females than males in cluster 3.

Cluster 4 contains 39 customers with an average income of 87k annually and an average spending score of 82 and the average age of 33. There is more females than males in cluster 4.

Cluster 5 contains 32 customers with an average income of 89k annually and an average spending score of 16 and the average age of 41. There is more males than females in cluster 1.

#8.

The supermarket should focus on value and affordability to encourage more spending.They should offer discount bundles on everyday grocery items and provide loyalty rewards or points for frequent purchases.They should also encourage continued engagement and loyalty. It helps retain the active shoppers and maintain their high spending levels.

The segmentation analysis reveals five distinct customer groups with different income levels and spending behaviours. By tailoring marketing strategies to each segment, the supermarket can better target promotions, increase customer engagement, and ultimately improve overall sales.

For cluster 1, i would call them Traditional low spenders. For cluster 2, i would call them Discount hunters. For cluster 3, i would call them Bulk buyers. For cluster 4, i would call them High end, luxury shoppers. For cluster 5, i would call them Intentional buyers.

2 Part B - Segmenting Consumers’ Recycling Perceptions & Habits using K-means

recycle <- read_csv("recycling.csv")

Rows: 366 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): age
dbl (7): id, pos_impact, environ, money, bins, local, avoid_waste

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(recycle)

#2.

recycle_2 <- select(recycle, pos_impact, environ, money, bins, local, avoid_waste)

#3.

No it does not need to be scaled as they are all measured on the same 10 point Likert scale.

set.seed(101)
kmeans1 <- kmeans(recycle_2, centers = 3)
kmeans1$cluster  #The assignment of each observation (i.e. customer) to each cluster.

  [1] 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2
 [38] 1 2 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 2 2 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 1 1
 [75] 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 2 1 1 2 1 2 2
[112] 2 1 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 3 3 3 3
[149] 2 1 1 1 2 1 1 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 3 3 3 3 3 1 2 2 2 2 1
[186] 2 1 1 2 1 1 1 3 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 3 1 2 2 1 2 2
[223] 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 1 1 2 1 1 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 1
[260] 2 1 3 1 1 2 2 1 2 1 2 2 3 2 2 1 1 3 1 2 2 1 1 1 2 2 2 3 3 3 3 3 3 1 1 2 1
[297] 2 1 2 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2
[334] 2 1 1 1 1 1 2 1 3 3 3 3 3 3 3 3 1 1 2 1 1 1 2 1 3 3 3 3 3 3 3 1 1

kmeans1$centers #The centroids for each cluster.

  pos_impact  environ    money     bins    local avoid_waste
1   3.846154 3.674556 3.928994 3.118343 2.976331    1.337278
2   3.586420 3.481481 3.827160 2.617284 2.969136    3.104938
3   1.800000 1.228571 1.657143 1.114286 2.285714    2.657143

kmeans1$size    #The number of observations in each cluster.

[1] 169 162  35

kmeans1$iter #The number of iterations it took to converge to the final clustering solution.

[1] 3

#5.

d1 <- dist(recycle_2)

sil_kmeans1 <- silhouette(kmeans1$cluster, d1)
summary(sil_kmeans1)

Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
 Cluster sizes and average silhouette widths:
      169       162        35 
0.2682631 0.2054575 0.3271905 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08256  0.14303  0.23963  0.24610  0.35704  0.52470

The overall cluster analysis has a mean Silhouette Score of 0.24610, which means that no Substantial finding structure has been found. Cluster 1 and Cluster 3 show that the structure is weak and could be artificial.

recycle_clus <- recycle %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3'))

recycle_clus_means <- recycle_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            pos_impact_m = mean(pos_impact),
            environ_m = mean(environ),
            money_m = mean(money),
            bins_m = mean(bins),
            local_m = mean (local),
            avoid_waste_m = mean(avoid_waste))

recycle_clus_means

# A tibble: 3 × 8
  cluster num_custs pos_impact_m environ_m money_m bins_m local_m avoid_waste_m
  <chr>       <int>        <dbl>     <dbl>   <dbl>  <dbl>   <dbl>         <dbl>
1 C1            169         3.85      3.67    3.93   3.12    2.98          1.34
2 C2            162         3.59      3.48    3.83   2.62    2.97          3.10
3 C3             35         1.8       1.23    1.66   1.11    2.29          2.66

knitr::kable(select(recycle_clus_means, cluster, num_custs, pos_impact_m, environ_m, money_m, bins_m, local_m, avoid_waste_m), 
             digits = c(0,0,2),
             col.names = c("Cluster", "# Customers", "Positive Impact", "Environment", "Money", "Bins", "Local", "Avoid Waste"), 
             caption = "Number of Customers and Average Responses by Cluster")

Number of Customers and Average Responses by Cluster
Cluster	# Customers	Positive Impact	Environment	Money	Bins	Local	Avoid Waste
C1	169	3.85	4	4	3.12	3	1
C2	162	3.59	3	4	2.62	3	3
C3	35	1.80	1	2	1.11	2	3

ggplot(recycle_clus, aes(x = age, group = cluster)) + 
  geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count") +
  facet_grid(~ cluster)

Cluster 1 has 169 customers, they believe their recycling efforts made a positive impact (3.85), they strongly agree that their recycling is improving the environment (4), they strongly agreed that it saved money (4), they are confused on what goes into the bins (3.12), they know where the waste centre is (3) and avoiding the waste that comes home from packaging is not that difficult for them (1).

Cluster 2 has 162 customers, they believe their recycling efforts made a positive impact (3.59), they strongly agree that their recycling is improving the environment (3), they strongly agreed that it saved money (4), they don’t seem to be too confused on what goes into the bins (2.62), they know where the waste centre is (3) and avoiding the waste that comes home from packaging is difficult for them (3).

Cluster 3 has 35 customers, they don’t believe their recycling efforts make a positive impact (1.80), they strongly disagree that their recycling efforts improve the environment (1), They disagree that they are saving money (2), they are very confused on what goes into the bins (1.11), they don’t know where the local waste centre is (2), and avoiding the waste that comes home from packaging is difficult for them (3).

For cluster 1, i would call them the Environmentally Aware Heros. Carlow County Council could focus on reinforcing and supporting their behaviour.They could offer community recycling initiatives or challenges to maintain engagement.

For cluster 2, i would call them the Complacent Recyclers. Carlow County Council could help make recycling easier and more convenient for them. They could educate residents on how to reduce household packaging waste and provide simple recycling guides and reminders through social media or local campaigns.

For cluster 3, i would call them them the Confused Recyclers. Carlow County Council should focus on education and more environmental awareness for this cluster. They could run run community workshops or awareness campaigns to build understanding and to display simple visual guides showing what goes into each bin.

Overall, we can see how segmentation analysis helps us profile our clusters and to analyse them for how we can improve habits and to make targeted campaigns.