library(cluster)
library(tidyverse)Assignment 2 Version 1
Part A
Load tidyverse
Q1
Import the file into R
supermarket_customers <- read_csv("supermarket_customers.csv")
View(supermarket_customers)Q2
supermarket_customers_2 <- select(supermarket_customers, annual_income, spending_score)
supermarket_customers_2_scale <- scale(supermarket_customers_2)
d1 <- dist(supermarket_customers_2)Q2 (a)
The data requires to be scaled due to annual_income could range from €0 to €1000, while spending_score as we can see only ranges from 0-100 therefore, in order to get unbias results the data should be scaled.
Q3
h1 <- hclust(d1)Q4
plot(h1, hang = -1)heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)Q4 (a)
The heatmap above gives evidence of custering structure as it contains many light coloured boxes going digonally in the heatmap.
Q5
clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 85 39 32
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08527 0.49132 0.59236 0.55299 0.66247 0.74442
- The overall cluster analysis has a mean Silhouette score of: 0.7243 which shows a strong clustering structure.
- Cluster 1: has a score of 0.52, which shows a strong clustering structure.
- Cluster 2: has a score of 0.64, which shows a strong clustering structure.
- Cluster 3: has a score of 0.56, which shows a strong clustering structure.
- Cluster 4: has a score of 0.51, which shows a reasonable clustering structure.
- Cluster 5: has a score of 0.55 which shoes a reasonable clustering structure.
Q6
supermarket_cluster <- supermarket_customers %>%
mutate(cluster = case_when(
clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'
))Q6(a)
ggplot(supermarket_cluster, aes(x = annual_income, y = spending_score, color = cluster)) +
geom_point(size = 4) +
labs(title = "Scatterplot of Annual Spend vs. Spending Score",
x = "Annual Spend",
y = "Spending Score",
color = "Cluster") +
theme_minimal() - Cluster 1 (red): Contains customers with low annual spend and low spending scores.
- Cluster 2(yellow): Contains customers with high annual spend and high spending scores.
- Cluster 3(green): Contains customers with moderate spending and average spending scores.
- Cluster 4(blue):Contains customers with moderate to high spending with a high spending score.
- Cluster 5(pink): Contains customers with a moderate to high spending with a low spending score.
Q6 (b)
cluster_summary<- supermarket_cluster %>%
group_by(cluster) %>%
summarise(
number_of_customers = n(),
avg_annual_income = mean(annual_income, na.rm = TRUE),
avg_spending_score = mean(spending_score, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE))
knitr::kable(cluster_summary,
digits = c(0,0,2,2,2),
col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Age"),
caption = "Segment Summary")| Cluster | # Customers | Avg. Annual Income | Avg. Spending Score | Age |
|---|---|---|---|---|
| C1 | 23 | 26.30 | 20.91 | 45.22 |
| C2 | 21 | 25.10 | 80.05 | 25.33 |
| C3 | 85 | 55.81 | 49.13 | 42.48 |
| C4 | 39 | 86.54 | 82.13 | 32.69 |
| C5 | 32 | 89.41 | 15.59 | 41.00 |
Q6 (c)
gender_percentage <- supermarket_cluster %>%
group_by(cluster, gender) %>%
summarise(count = n()) %>%
group_by(cluster) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(gender_percentage, aes(x = factor(cluster), y = percentage, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Percentage of Males and Females in Each Cluster",
x = "Cluster",
y = "Percentage",
fill = "Gender"
) + theme_minimal()Q7
- Cluster 1: Consists of 23 customers with a mean age of 45 and are mostly women with an average income of €26,300 and an average spending score of 20.92. This cluster contains customers who do not frequently make large purchases.
- Cluster 2: Consists of 21 customers with a mean age of 25 and are mostly female with an average income of €25,100 and an average spending score of 80.05. This cluster contains customer who spend regularly.
- Cluster 3: Consists of 85 customers with a mean age of 42 and are mostly female with an average income of €55,810 and an average spending score of 49.13. This cluster contains customers who have average income and spending behaviours.
- Cluster 4: Consists of 39 customers with a mean age of 32 and are mostly female with an average income of €86,540 and an average spending score of 82.13. This cluster contains customers with more disposable income.
- Cluster 5: Consists of 32 customers with a mean age of 41 and are mostly male with an average income of €89,410 and an average spending score of 15.59. This cluster contains customers who have a high income with a low spending score showing they do not spend on discretionary items and probably only pay for certain services and goods.
Q8
Marketing Actions and Catchy Cluster Names:
- Cluster 1: “Value seekers” Profile: Customers with lower annual incomes and low spending scores. Marketing Actions:
Discounts & Promotions: Offer regular promotions and discounts, this could encourage more frequent purchases.
- Cluster 2: “High life hustlers” Profile: low-income individuals with high spending scores. Marketing Actions:
Post a favorite product of the week: promote on socials, posters in store, and on app the favourite product of the week to entice customers in this cluster to purchase.
- Cluster 3: “The practical class” Profile: Customers with average incomes and moderate spending scores. Marketing Actions:
Value-Added Offers: Offer discounts on frequently purchased items or on different brand of the product and offer discounts such as “Buy 2 Get 1 Free”.
- Cluster 4: “The high rollers” Profile: High-income individuals with high spending scores. Marketing Actions:
Exclusive & Premium Products: Promote high-end or luxury products such as organic or specialty foods, gourmet items, or high-quality wines, aligning with their purchasing behavior.
- Cluster 5: “Wealth watchers” Profile: individuals with higher income but lower spending scores compared to income. Marketing Actions:
Value-Driven Campaigns: Promote the idea of “getting more for less,” emphasizing the value of products and offering coupons or bundle deals.
#Part B
bank_personal_loan <- read_csv("bank_personal_loan.csv")
View(bank_personal_loan)Q2
bank_personal_loan2 <- select(bank_personal_loan, age, income, experience, cc_avg)Q3
bank_personal_loan2_scale <- scale(bank_personal_loan2)
d2 <- dist(bank_personal_loan2)Yes, scaling the data is necessary due to age is measured 0-100 along with experience, while income is measure from €0-€1000.
Q4
set.seed(101)
kmeans1 <- kmeans(bank_personal_loan2_scale, centers = 3)Q5
d2 <- dist(bank_personal_loan2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d2)
summary(sil_kmeans1)Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
Cluster sizes and average silhouette widths:
2029 2168 803
0.1899016 0.1531329 0.4718464
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6774 0.1229 0.2487 0.2192 0.3453 0.6855
- Overall silhouette score is: 0.2487 meaning there is no real structure found.
- Cluster 1: has a score of 0.19 showing the no structure is found.
- Cluster 2: has a score of 0.15 showing no structure is found.
- Cluster 3: has a score of 0.47 showing the structure is weak.
Q6
bank_personal_cluster <- bank_personal_loan2 %>%
mutate(clusters2 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
clusters2 == 2 ~ 'C2',
clusters2 == 3 ~ 'C3'))
cluster_summary <- bank_personal_cluster %>%
group_by(cluster) %>%
summarise(
num_customers = n(),
avg_age = mean(age, na.rm = TRUE),
avg_experience = mean(experience, na.rm = TRUE),
avg_income = mean(income, na.rm = TRUE),
avg_cc_spend = mean(cc_avg, na.rm = TRUE)
)
knitr::kable(cluster_summary,
digits = c(0, 0, 2, 2, 2, 2),
col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. CC Spend"),
caption = "Customer Segments Summary for K-Means Clustering")| Cluster | # Customers | Avg. Age | Avg. Experience | Avg. Income | Avg. CC Spend |
|---|---|---|---|---|---|
| C1 | 2029 | 35.12 | 9.87 | 60.28 | 1.37 |
| C2 | 2168 | 55.55 | 30.26 | 59.07 | 1.36 |
| C3 | 803 | 43.60 | 18.56 | 147.57 | 4.96 |
Q7
- Cluster 1: Contains the second most number of customers in this group of clusters, but has the youngest age group. With little experience and little spending and income.
- Cluster 2: Contains the most customers with moderate income and spending and the age group being the oldest and the most experienced group.
- Cluster 3: Contains the lowest number of customers being the middle for age and experience yet being the highest group of income and spending.
##Q8
bank_personal_loan_cluster <- bank_personal_loan %>%
mutate(clusters2 = kmeans1$cluster) %>%
mutate(cluster = case_when(
clusters2 == 1 ~ 'C1',
clusters2 == 2 ~ 'C2',
clusters2 == 3 ~ 'C3'
))
loan_profile <- bank_personal_loan_cluster %>%
group_by(cluster) %>%
summarise(
number_of_customers = n(),
num_loan_taken = sum(personal_loan, na.rm = TRUE),
loan_percentage = (sum(personal_loan, na.rm = TRUE) / n()) * 100
)
knitr::kable(loan_profile,
digits = 2,
col.names = c("Cluster", "# Customers", "# Loans Taken", "Percentage with Loan"),
caption = "Personal Loan Profile by Cluster")| Cluster | # Customers | # Loans Taken | Percentage with Loan |
|---|---|---|---|
| C1 | 2029 | 69 | 3.40 |
| C2 | 2168 | 85 | 3.92 |
| C3 | 803 | 326 | 40.60 |
ggplot(loan_profile, aes(x = cluster, y = loan_percentage, fill = cluster)) +
geom_bar(stat = "identity") +
labs(title = "Personal Loans Taken by Cluster",
x = "Customer Segment",
y = "Percentage of Customers with a Loan") +
theme_minimal()- Cluster 1: is least likely to take out a loan but could take out a loan in the future.
- Cluster 2: Most customers in this cluster are in a stable income and spending phase therefore, may or may not take out a loan in the future.
- Cluster 3: are most likely to take out a loan due to already taken out loans, making them more easy to apply for and more likely to qualify for loans in the future.