library(tidyverse)
library(cluster)
library(knitr)
library(kableExtra)Segmentation Analysis
Part A- Hierarchical Clustering
1
supermarket <- read_csv("supermarket_customers.csv")2
supermarket_2 <- select(supermarket, annual_income, spending_score)
supermarket_2_scale <- scale(supermarket_2)
d1 <- dist(supermarket_2_scale)2a
While the two variables fall within a similar range, annual_income runs from 15-137, and spending_scores from 1-99, it is not an exact scale. By scaling the data it prevents bias clustering results according to certain sources. Therefore for the purposes of this assignment, I have chosen to scale the data.
3
h1 <- hclust(d1)4
plot(h1, hang = -1)heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')4a
The light yellow blocks in the heat map indicate groups of customers who are very similar to each other, or strong clusters.
From the above heat map we can see that there are three obvious blocks of light yellow in the bottom right corner, along the diagonal. Therefore we can conclude that there is a good clustering structure within the data.
5
clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 79 39 38
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1401 0.4993 0.5965 0.5531 0.6610 0.7645
The overall analysis has a mean silhouette score of 0.5531 which means the analysis has uncovered a reasonable clustering structure. The average silhouette scores for each individual cluster vary slightly, falling between two groups, reasonable clustering structure (clusters 2 and 3), and a weak structure (clusters 1, 4, and 6). These weaker clusters could indicate an artificial cluster and therefor the use of these clusters may not be effective.
6a
supermarket_clus <- supermarket %>%
mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'))
ggplot(data = supermarket_clus) +
geom_point(mapping = aes(x = annual_income, y = spending_score, colour = factor(clusters1))) +
labs (title = "Spending Score And Annual Income", x = "Annual Income (€000s)", y = "Spending Score", colour = "Cluster")It’s clear that the clusters have strong reliance on the spending score and annual income.
Cluster 1 have a low spending score and low annual income.
Cluster 2 have a high spending score but low annual income.
Cluster 3 has a mid-range spending score and mid-range annual income.
Cluster 4 have a high spending score and high annual income.
Cluster 5 have a low spending score but a high annual income.
6b
supermarket_clus_means <- supermarket_clus %>%
group_by(cluster) %>%
summarise(num_custs = n(),
annual_m = mean(annual_income),
spending_m = mean(spending_score),
age_m = mean(age))
knitr::kable(select(supermarket_clus_means, cluster, num_custs, annual_m, spending_m, age_m),
digits = c(0,0,0,2,0),
col.names = c("Cluster", "Number of Customers", "Annual Income (€000s)", "Spending Score", "Age"),
caption = "Cluster Details") %>%
kable_styling(full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "black", underline = TRUE)%>%
column_spec(1, bold = TRUE)| Cluster | Number of Customers | Annual Income (€000s) | Spending Score | Age |
|---|---|---|---|---|
| C1 | 23 | 26 | 20.91 | 45 |
| C2 | 21 | 25 | 80.05 | 25 |
| C3 | 79 | 54 | 50.22 | 43 |
| C4 | 39 | 87 | 82.13 | 33 |
| C5 | 38 | 87 | 18.63 | 40 |
6c
supermarket_clus_gender <- supermarket_clus %>%
group_by(cluster) %>%
summarise(gender)Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'cluster'. You can override using the
`.groups` argument.
ggplot(data = supermarket_clus_gender) +
geom_bar(mapping = aes(x = clusters1, fill = gender), position = "fill") +
labs(title = "Gender Breakdown in Clusters",
x = "Cluster",
y = "Percentage")7
Cluster 1: This cluster is generally older female customers. They have lower income and lower spending tenancies. This cluster is the second smallest with 23 customers.
Cluster 2: This cluster is the youngest with a relatively even gender breakdown. These customers have low income but they have high spending scores. This cluster is the smallest with 21 customers.
Cluster 3: This cluster is made up of older customers, slightly skewed female. These customers have mid-range income and mid- range spending scores. This cluster is the largest with 79 customers.
Cluster 4: This cluster is mainly mid-aged, males. These customers have high income and the highest spending scores. This cluster is second largest with 39 customers.
Cluster 5: This cluster is older with a mixed gender spread. These customers have high incomes and low spending scores. This cluster is middle sized with 38 customers.
8
Cluster 1: Low Ladies
Cluster 2: Poor Spenders
Cluster 3: Mid Majority
Cluster 4: Big Spenders
Cluster 5: Rich Savers
Firstly, the supermarket should look to focus the majority of their marketing efforts on cluster 3. This cluster has medium spending habits making it easier to increase their spending. It is also one of the clusters with the strongest cluster structure.
The supermarket should use their email marketing as well as a loyalty card and app to offer deals and promote targeted products based on cluster segments. They should use hyperpersonalisation within these to offer products within the correct price range for each cluster.
For cluster 1 the “Low Ladies”, the supermarket should provide loyalty card rewards or in store promotions that focus on female products. Promoting slightly more luxury versions of everyday products could increase overall typical spend. For example, promoting slightly more expensive shampoo and conditioner, face creams, etc. These older females could also be mothers, therefor looking at the promotion of family based meals and family products could also increase spending/
Cluster 2 , “Poor Spenders”, is the poorer, younger segment. It is mainly made up of students and those who are starting their career. This cluster has low income but a high spending score indicating that the supermarket should promote their more expensive products. Fresh fruit, slightly more expensive cuts of meat, semi-luxury products would work well in this segment as the youths try to “treat themselves”. We do not want to over use these tactics as these poorer customers may need to change supermarket if we target them with overly expensive products. This cluster could also benefit from the promotion of ready-made meals, branded snacks and alcohol.
Cluster 3, “Mid Majority”, is the largest portion of customers and the segment that the supermarket should focus on. These customers are mixed gender with medium income and spending habits. Offering rewards that are just above their typical spend could increase their shopping (10% off if they spend 20% more). The supermarket should also look at promoting the benefits of a loyalty card or scheme, encouraging customers to “save while they shop”, by collecting points when they spend more money.
Cluster 4, “Big Spenders”, is the biggest spender with the most money. Targeting these customers with high ticket items would be best. We know they can afford the more expensive produce so promoting fresh, expensive cuts of meat and fish, branded products.
Cluster 5, “Rich Savers”, is the high income, low spend segment. These customers should be targeted with branded products that could subtly increase their spending without over promoting towards them. Also promoting deals like Buy One Get One Half Price could increase their overall consumption of products overtime leading to increased spending in the future.
Part B
1
bank <- read_csv("bank_personal_loan.csv")Rows: 5000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, age, experience, income, cc_avg, personal_loan
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2
bank_2 <- select(bank, age, experience, income, cc_avg)3
bank_2_scale <- scale(bank_2)Scaling is necessary for this data set as the ranges differ among the variables. The age range goes from 23 to 67 whereas experience starts at minus 3 and income goes into the hundreds.
4
set.seed(101)
kmeans1 <- kmeans(bank_2, centers = 3)5
d1 <- dist(bank_2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d1)
summary(sil_kmeans1)Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
Cluster sizes and average silhouette widths:
2189 1889 922
0.4733037 0.3846869 0.4700824
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.002796 0.331929 0.496459 0.439230 0.576024 0.654877
The overall silhouette score is 0.439 which means there is no substantial structure, it is weak and clusters could be artificial.
The clusters themselves also have low scores all below 0.5 indicating weak clusters. These clusters therefore may not have a strong connection and therefore may not be accurate for marketing uses.
6
bank_clus <- bank %>%
mutate(clusters1 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))
bank_clus_means <- bank_clus %>%
group_by(cluster) %>%
summarise(num_custs = n(),
age_m = mean(age),
experience_m = mean(experience),
income_m = mean(income),
spend_m = mean(cc_avg))
knitr::kable(select(bank_clus_means, cluster, num_custs, age_m, experience_m, income_m, spend_m),
digits = c(0,0,2),
col.names = c("Cluster", "Number of Customers", "Average Age", "Average Experience", "Average Income", "Average Credit Card Spend (€)"),
caption = "Cluster Details") %>%
kable_styling(full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "black", underline = TRUE) %>%
column_spec(1, bold = TRUE)| Cluster | Number of Customers | Average Age | Average Experience | Average Income | Average Credit Card Spend (€) |
|---|---|---|---|---|---|
| C1 | 2189 | 45.98 | 21 | 34 | 1.04 |
| C2 | 1889 | 45.04 | 20 | 81 | 1.95 |
| C3 | 922 | 44.43 | 19 | 152 | 4.04 |
7
Cluster 1: This is the largest cluster with 2189 customers. This cluster is the oldest with the highest work experience. This customer also has the lowest income and lowest average credit card spend.
Cluster 2: This cluster is mid-sized with 1,889 customers. It is medium in terms of all aspects.
Cluster 3: This cluster is the smallest with 922 customers. It is the youngest segment with the lowest work experience. This cluster also has the highest income and credit card spending habits.
8
bank_clus_loan <- bank_clus %>%
group_by(cluster) %>%
summarise(past_loans = mean(personal_loan)) %>%
mutate(past_loans = past_loans*100)
knitr::kable(select(bank_clus_loan, cluster, past_loans),
digits = c(0,0),
col.names = c("Cluster", "Past Loans (%)"),
caption = "Cluster Details") %>%
kable_styling(full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "black", underline = TRUE) %>%
column_spec(1, bold = TRUE)| Cluster | Past Loans (%) |
|---|---|
| C1 | 0 |
| C2 | 5 |
| C3 | 41 |
From the above table we can see that Cluster 3 has the highest percentage of previous loans among their customers. This indicates that these customers are also more likely to take out a loan again in the future. Cluster 3 has a significantly higher likelihood of taking out a loan in comparison to either Cluster 1 or 2.