<- select(supermarket, annual_income, spending_score)
supermarket_2 <- dist(supermarket_2) d1
Assignment 2 - Adv Data Analysis 2
Part A
Question 2
Does the data need to be scaled before computing the distance matrix? Explain your answer.
Question 3
<- hclust(d1) h1
Question 4
plot(h1, hang = -1)
heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')
Does the heatmap provide evidence of any clustering structure within the dataset? Explain your answer.
The heatmap above provides evidence of clustering within the dataset. This is due to a lot of the squares being brightly coloured.
Question 5
<- cutree(h1, k = 5)
clusters1 <- silhouette(clusters1, d1)
sil1 summary(sil1)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 85 39 32
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08527 0.49132 0.59236 0.55299 0.66247 0.74442
Assess the quality of this solution. Discuss your findings.
The overall cluster analysis has a mean Silhouette Score of 0.5530, which means that the analysis has a reasonable clustering structure. Cluster 1 has a score of 0.52 (reasonable clustering structure), Cluster 2 has a score of 0.64 (reasonable clustering structure), Cluster 3 has a score of 0.56 (reasonable clustering structure), Cluster 4 has a score of 0.51 (reasonable clustering structure), Cluster 5 has a score of 0.55 (reasonable clustering structure).
Question 6
<- supermarket %>%
super_clus mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'))
clusters1
<- super_clus %>%
super_clus_means group_by(cluster) %>%
summarise(age_m = mean(age),
income_m = mean(annual_income),
spending_m = mean(spending_score),
num_custs = n())
super_clus_means
# A tibble: 5 × 5
cluster age_m income_m spending_m num_custs
<chr> <dbl> <dbl> <dbl> <int>
1 C1 45.2 26.3 20.9 23
2 C2 25.3 25.1 80.0 21
3 C3 42.5 55.8 49.1 85
4 C4 32.7 86.5 82.1 39
5 C5 41 89.4 15.6 32
ggplot(super_clus_means, aes(x = income_m, y = spending_m, color = as.factor(cluster))) +
geom_point(size = 3) +
labs(title = "Scatterplot of Annual Income vs Spending Score by Cluster",
x = "Annual Income (€000s)",
y = "Spending Score",
color = "Cluster")
Cluster 2 has the highest spending score, while also having the lowest average annual income. Cluster 1 has a slightly higher annual income, although have the lowest spending score of all clusters. Similarly, we can see that Cluster 4 has a very similar spending score to Cluster 2, and Cluster 5 has a very similar spending score to Cluster 1. These two clusters (4 & 5) have a much higher average income.
::kable(select(super_clus_means, cluster, num_custs, age_m, income_m, spending_m),
knitrdigits = c(0,0,2,2),
col.names = c("Cluster", "# Customers", "Avg Age", "Avg Annual Income", "Spending Score"),
caption = "Number of Customers and Average Age, Income & Spending Score")
Cluster | # Customers | Avg Age | Avg Annual Income | Spending Score |
---|---|---|---|---|
C1 | 23 | 45.22 | 26.30 | 21 |
C2 | 21 | 25.33 | 25.10 | 80 |
C3 | 85 | 42.48 | 55.81 | 49 |
C4 | 39 | 32.69 | 86.54 | 82 |
C5 | 32 | 41.00 | 89.41 | 16 |
<- super_clus %>%
gender_percent group_by(cluster, gender) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(gender_percent, aes(x = as.factor(cluster), y = percentage, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Gender Distribution by Cluster",
x = "Cluster",
y = "Percentage (%)",
fill = "Gender")
Question 7
Cluster 1: This cluster is primarily female. It has the highest average age (45 years old), second lowest average income (€26,300), and second lowest spending score (21).
Cluster 2: This cluster is primarily female. It has the lowest average age (25 years old), lowest average income (€25,100), and second highest spending score (80).
Cluster 3: This cluster is primarily female. It has the second highest average age (42 years old), third highest average income (€55,810), and third highest spending score (49).
Cluster 4: This cluster is primarily female. It has the second lowest average age (33 years old), second highest average income (€86,540), and highest spending score (82).
Cluster 5: This cluster is primarily male. It has the third highest average age (41 years old), highest average income (€89,410), and lowest spending score (16).
Question 8
Cluster 1: “The Savvy Seniors” - The supermarket could offer postal offers through coupons & vouchers to these customers. This could be paired with money-off vouchers on receipts to encourage repeat purchases.
Cluster 2: “The Young Trendsetters” - The supermarket could follow trends online and stock trending products, such as Dubai Chocolate which is currently trending within Ireland. This would encourage younger people to visit the shop.
Cluster 3: “The Steady Spenders” - To encourage the Steady Spenders, the supermarket could offer bundles of items. This would encourage more spending, especially due to their higher spending power with a higher income.
Cluster 4: “The Big Spenders” - As they are already high spenders, the supermarket could offer a loyalty program which rewards consistent spending with points and discounts relative to how much the customers spend.
Cluster 5: “The High-Flyers” - As this cluster mainly consists of males, the supermarket could look to offer items such as aftershaves etc. These customers have the highest average income so could be targeted through higher ticket goods.
Part B
Question 2
<- select(bank, age, experience, income, cc_avg) bank_2
Question 3
Yes, the data does need to be scaled before applying the K-means clustering algorithm. This is due to the fact that it is not measured on a fixed scale and there is a much wider variance in data.
<- scale(bank_2) bank_2_scale
Question 4
set.seed(101)
<- kmeans(bank_2_scale, centers = 3)
kmeans1
$cluster kmeans1
$centers kmeans1
age experience income cc_avg
1 -0.8912604 -0.8924191 -0.2930636 -0.3277429
2 0.8904403 0.8852371 -0.3195201 -0.3326932
3 -0.1520638 -0.1350880 1.6031701 1.7263626
$size kmeans1
[1] 2029 2168 803
$iter kmeans1
[1] 3
Question 5
<- dist(bank_2_scale)
bd1
<- silhouette(kmeans1$cluster, bd1)
sil_kmeans1 summary(sil_kmeans1)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = bd1) :
Cluster sizes and average silhouette widths:
2029 2168 803
0.4344352 0.4359975 0.2505801
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1365 0.2776 0.4538 0.4056 0.5614 0.6311
Assess the quality of the cluster solution and discuss your findings.
The overall cluster analysis has a mean Silhouette Score of 0.4056, which means that the analysis has a weak clustering structure. Cluster 1 has a score of 0.43 (weak clustering structure), Cluster 2 has a score of 0.44 (weak clustering structure), and Cluster 3 has a score of 0.25 (weak clustering structure).
Question 6
<- bank %>%
bank_clus mutate(clusters1 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))
clusters1
<- bank_clus %>%
bank_clus_means group_by(cluster) %>%
summarise(num_custs = n(),
age_m = mean(age),
experience_m = mean(experience),
income_m = mean(income),
cc_avg_m = mean(cc_avg))
bank_clus_means
# A tibble: 3 × 6
cluster num_custs age_m experience_m income_m cc_avg_m
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 C1 2029 35.1 9.87 60.3 1.37
2 C2 2168 55.5 30.3 59.1 1.36
3 C3 803 43.6 18.6 148. 4.96
Question 7
Cluster 1: This cluster has the second most customers (2029), with the youngest average age (35 years old), least experience (10 years), second highest average income (€60,283), and second highest credit card spend (€1,365).
Cluster 2: This cluster has the most customers (2168), with the oldest average age (56 years old), most experience (30 years), lowest average income (€59,066), and lowest credit card spend (€1,357).
Cluster 3: This cluster has the least customers (803), with the second oldest average age (44 years old), second most experience (19 years), highest average income (€147,574), and highest credit card spend (€4,955).
Question 8
Which segment(s) is(are) most likely to take out a personal loan in the future? Hint: profile the clusters/segments using the personal_loan variable.
<- bank %>%
bank_clus mutate(clusters1 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))
clusters1
<- bank_clus %>%
bank_clus_means group_by(cluster) %>%
summarise(num_custs = n(),
age_m = mean(age),
experience_m = mean(experience),
income_m = mean(income),
cc_avg_m = mean(cc_avg),
loan_rate = mean(personal_loan))
bank_clus_means
# A tibble: 3 × 7
cluster num_custs age_m experience_m income_m cc_avg_m loan_rate
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 2029 35.1 9.87 60.3 1.37 0.0340
2 C2 2168 55.5 30.3 59.1 1.36 0.0392
3 C3 803 43.6 18.6 148. 4.96 0.406
ggplot(bank_clus_means, aes(x = factor(cluster), y = loan_rate)) +
geom_bar(stat = "identity", fill = "seagreen3") +
labs(title = "Proportion of Personal Loan Holders by Cluster",
x = "Cluster",
y = "Proportion of Personal Loan Holders")
As we can see from the above, Cluster 3 is the most likely to take out a personal loan in the future. This heavily outweighs Clusters 1 & 2 which are very similar to each other.