library(cluster)
library(tidyverse)
supermarket_cust <- read_csv("supermarket_customers.csv")
bank_loans <- read_csv("bank_personal_loan.csv")Assignment 2
Part A Q 1
Q 2
customer_cluster <- select(supermarket_cust, annual_income, spending_score)
dist1 <- dist(customer_cluster)Q 3
h1 <- hclust(dist1)Q 4
plot(h1, hang = -1)heatmap(as.matrix(dist1), Rowv = as.dendrogram(h1), Colv = 'Rowv')Q 5
clusters1 <- cutree(h1, k = 5)
sill <- silhouette(clusters1, dist1)
summary(sill)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = dist1) :
Cluster sizes and average silhouette widths:
23 21 85 39 32
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08527 0.49132 0.59236 0.55299 0.66247 0.74442
I believe there is evidence of Clustering, as the silhouette widths for each cluster is over 0.51, meaning a reasonable clustering structure has been found
Q 6
clusters_customer <- supermarket_cust %>%
mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5',))Q 6a
ggplot(clusters_customer, aes(x = annual_income, y = spending_score, color = cluster)) +
geom_point(size = 4) +
labs(title = "Scatterplot of Annual Spend vs. Spending Score",
x = "Annual Spend",
y = "Spending Score",
color = "Cluster") +
theme_minimal() Q 6b
cluster_summary<- clusters_customer %>%
group_by(cluster) %>%
summarise(
number_of_customers = n(),
avg_annual_income = mean(annual_income, na.rm = TRUE),
avg_spending_score = mean(spending_score, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE))
knitr::kable(cluster_summary,
digits = c(0,0,2,2,2),
col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Age"),
caption = "Segment Summary")| Cluster | # Customers | Avg. Annual Income | Avg. Spending Score | Age |
|---|---|---|---|---|
| C1 | 23 | 26.30 | 20.91 | 45.22 |
| C2 | 21 | 25.10 | 80.05 | 25.33 |
| C3 | 85 | 55.81 | 49.13 | 42.48 |
| C4 | 39 | 86.54 | 82.13 | 32.69 |
| C5 | 32 | 89.41 | 15.59 | 41.00 |
Q 6c
percent_gender <- clusters_customer %>%
group_by(cluster, gender) %>%
summarise(count = n()) %>%
group_by(cluster) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(percent_gender, aes(x = factor(cluster), y = percentage, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Percentage of Males and Females in Each Cluster",
x = "Cluster",
y = "Percentage",
fill = "Gender"
) + theme_minimal()Q 7 Cluster 1: This cluster has the highest average age, with an average Annual income of 26.30, and the second lowest spending score of 20.91. This suggests that they don’t like to spend what little money they have.
Cluster 2: The youngest average cluster, and the lowest average Annual income, but a very high spending score of 80.05, which suggests that despite not earning a lot, the younger customers still like to spend their money.
Cluster 3: This cluster has the most people, annual income of 55.81, spending score of 49.13, meaning that this cluster earns decent money, and spend most but not all of it in store
Cluster 4: This cluster has the second highest average annual income of 86.54, and a spending score of 82.13 meaning the people in this cluster earn a lot of money, and like to spend a lot of it too.
Cluster 5: This cluster has the highest average annual income but the lowest average spending score, which means that the people in this cluster earn a lot of money, and like to spend very little of it. The rich stay rich
Q 8
Cluster 1: “The Old-Timers” - This cluster could be given discounts to encourage them to spend more of what little money they have.
Cluster 2: “The ‘No Impulse-Control’ Gang” - The supermarket could sell bundled items, including items the customers from this cluster wouldn’t normally buy, but they will anyway if it is part of a bundle with at least some items they would like to purchase.
Cluster 3: “The Middle of the Roads” - The supermarket could introduce a loyalty program which could include exclusive offers for loyal customers, possibly encouraging them to spend money on things they don’t need, just because they have an offer for it.
Cluster 4: “The Big Spenders” - This cluster likes to spend their money, so the supermarket could target them for higher price items.
Cluster 5: “The Rich Gang” - This cluster has the highest income, so they could also be targeted for the higher price items as they don’t spend a lot of their money, meaning they can afford the more expensive things.
Question B
bank_loans2 <- select(bank_loans, age, income, experience, cc_avg)bank_loans2_scaled <- scale(bank_loans2)
dist2 <- dist(bank_loans2)Yes, the data needs to be scaled first, as each variable is measured on a different scale which can go beyond 100, which would mess up the process of clustering.
set.seed(101)
k_means1 <- kmeans(bank_loans2_scaled, centers = 3)dist3 <- dist(bank_loans2)
sill_k_means1 <- silhouette(k_means1$cluster, dist3)
summary(sill_k_means1)Silhouette of 5000 units in 3 clusters from silhouette.default(x = k_means1$cluster, dist = dist3) :
Cluster sizes and average silhouette widths:
2029 2168 803
0.1899016 0.1531329 0.4718464
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6774 0.1229 0.2487 0.2192 0.3453 0.6855
bank_loans_cluster <-bank_loans2 %>%
mutate(clusters2 = k_means1$cluster) %>%
mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
clusters2 == 2 ~ 'C2',
clusters2 == 3 ~ 'C3',))
bank_loans_summary <- bank_loans_cluster %>%
group_by(cluster) %>%
summarise(
number_of_customers = n(),
average_age = mean(age, na.rm = TRUE),
average_experience = mean(experience, na.rm = TRUE),
average_income = mean(income, na.rm = TRUE),
average_credit_card_spend = mean(cc_avg, na.rm = TRUE))
knitr::kable(bank_loans_summary,
digits = c(0,0,2,2,2,2),
col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. Credit Card Spend"),
caption = "Segment Summary")| Cluster | # Customers | Avg. Age | Avg. Experience | Avg. Income | Avg. Credit Card Spend |
|---|---|---|---|---|---|
| C1 | 2029 | 35.12 | 9.87 | 60.28 | 1.37 |
| C2 | 2168 | 55.55 | 30.26 | 59.07 | 1.36 |
| C3 | 803 | 43.60 | 18.56 | 147.57 | 4.96 |
I think Cluster 3 would be the most likely to have a loan or to take out a loan based on their average credit card spend - 4.96, which is significantly higher than cluster 1 or 2