Clustering

Part A – Segmenting Supermarket Customers using Hierarchical Clustering

1

library(cluster)
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(knitr)

cus <- read_csv("supermarket_customers.csv")

Rows: 200 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (4): id, age, annual_income, spending_score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2

cus_2 <- select(cus, annual_income, spending_score)
scaled_data <- scale(cus_2)
d1 <- dist(scaled_data)

Scaling is necessary to ensure both variables contribute equally to the Euclidean distance calculation, leading to more meaningful customer segmentation.

3

h1 <- hclust (d1)

4

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

The brighter coloured squares is an example of strong clustering

5

clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)

Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
       23        21        79        39        38 
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1401  0.4993  0.5965  0.5531  0.6610  0.7645

The overall score is 0.55 which gives us a reasonable clustering structure. Cluster 2 and 3 are the strongest with 0.61. Cluster 5 has the lowest score with 0.46

6

cus_clus <- cus %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5'))

A

ggplot(cus_clus, aes(x = annual_income, y = spending_score, color = cluster)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Customer Clusters based on Annual Income and Spending Score",
       x = "Annual Income",
       y = "Spending Score") +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green", "purple", "orange"))

cluster 2 and 4 have the highest spending score. cluster 2 have a high spending score but low income. They should be targeted with special offers to get them to spend more despite a low income. Cluster 4 have a high spending score and high income so they should be targeted with highend products

B

cus_clus_means <- cus_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            annual_income = mean(annual_income),
            spending_score = mean(spending_score),
            age = mean(age))

cus_clus_means

# A tibble: 5 × 5
  cluster num_custs annual_income spending_score   age
  <chr>       <int>         <dbl>          <dbl> <dbl>
1 C1             23          26.3           20.9  45.2
2 C2             21          25.1           80.0  25.3
3 C3             79          54.4           50.2  42.9
4 C4             39          86.5           82.1  32.7
5 C5             38          87             18.6  40.4

kable(cus_clus_means)

cluster	num_custs	annual_income	spending_score	age
C1	23	26.30435	20.91304	45.21739
C2	21	25.09524	80.04762	25.33333
C3	79	54.41772	50.21519	42.88608
C4	39	86.53846	82.12821	32.69231
C5	38	87.00000	18.63158	40.39474

Having a high income does not mean a cluster will have a high spending score. This shows targeting high income customers is not the best for business

C

gender_cluster_summary <- cus_clus %>%
  group_by(cluster, gender) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(percentage = Count / sum(Count) * 100)


ggplot(gender_cluster_summary, aes(x = cluster, y = percentage, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Gender Distribution in Each Cluster",
       x = "Cluster",
       y = "Percentage of Customers",
       fill = "gender") +
  theme_minimal() +
  scale_fill_manual(values = c("Male" = "blue", "Female" = "pink"))

Women make up the higher percentage of supermarket shoppers.

7

Cluster 1 is people that have a low income and low spending score

Cluster 2 is people that have a low income and high spending score

Cluster 3 is people that have a medium income and medium spending score

Cluster 4 is people that have a high income and high spending score

Cluster 5 is people that have a high income and low spending score

8

• Special offers for low income customers

• Push premium products to high income

LowLow
HighLow
Mid
HighHigh
LowHigh

Part B – Segmenting Bank Customers using K-means

1

bank <- read_csv("bank_personal_loan.csv")

Rows: 5000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, age, experience, income, cc_avg, personal_loan

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2

bank_2 <- select(bank, age, experience, income, cc_avg)
scaled_bank2 <- scale(bank_2)

3

Scaling is necessary to ensure both variables contribute equally to the Euclidean distance calculation, leading to more meaningful customer segmentation.

4

set.seed(101)
kmeans1 <- kmeans(scaled_bank2, centers = 3)

5

d1 <- dist(scaled_bank2)

sil_kmeans1 <- silhouette(kmeans1$cluster, d1)
summary(sil_kmeans1)

Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.4344352 0.4359975 0.2505801 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1365  0.2776  0.4538  0.4056  0.5614  0.6311

The overall score is 0.40 which gives us a weak clustering structure. Cluster 1 and 2 are the strongest with 0.42 Cluster 3 has the lowest score with 0.25

6

  bank_clus <- bank_2 %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3'))

bank_clus_means <- bank_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            avg_age = mean(age),
            avg_experience = mean(experience),
            avg_income = mean(income),
            avg_cc = mean(cc_avg))

bank_clus_means

# A tibble: 3 × 6
  cluster num_custs avg_age avg_experience avg_income avg_cc
  <chr>       <int>   <dbl>          <dbl>      <dbl>  <dbl>
1 C1           2029    35.1           9.87       60.3   1.37
2 C2           2168    55.5          30.3        59.1   1.36
3 C3            803    43.6          18.6       148.    4.96

kable(bank_clus_means)

cluster	num_custs	avg_age	avg_experience	avg_income	avg_cc
C1	2029	35.12173	9.87038	60.28339	1.365155
C2	2168	55.54566	30.25646	59.06550	1.356504
C3	803	43.59527	18.55542	147.57410	4.955031

7

Cluster 1 is made up of young people which tend to have little savings Cluster 2 have a high income which leads to high credit card

8

Cluster 1 would be most likely to take out a loan as they are young which may mean they have ess savings. Young people are also more likely to need money quick for unexpected expenses