Assignment 2 - Adv Data Analysis 2

Part A

Question 2

supermarket_2 <- select(supermarket, annual_income, spending_score)
d1 <- dist(supermarket_2)

Does the data need to be scaled before computing the distance matrix? Explain your answer.

Question 3

h1 <- hclust(d1)

Question 4

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

Does the heatmap provide evidence of any clustering structure within the dataset? Explain your answer.

The heatmap above provides evidence of clustering within the dataset. This is due to a lot of the squares being brightly coloured.

Question 5

clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08527  0.49132  0.59236  0.55299  0.66247  0.74442 

Assess the quality of this solution. Discuss your findings.

The overall cluster analysis has a mean Silhouette Score of 0.5530, which means that the analysis has a reasonable clustering structure. Cluster 1 has a score of 0.52 (reasonable clustering structure), Cluster 2 has a score of 0.64 (reasonable clustering structure), Cluster 3 has a score of 0.56 (reasonable clustering structure), Cluster 4 has a score of 0.51 (reasonable clustering structure), Cluster 5 has a score of 0.55 (reasonable clustering structure).

Question 6

super_clus <- supermarket %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5'))

super_clus_means <- super_clus %>%
  group_by(cluster) %>%
  summarise(age_m = mean(age),
            income_m = mean(annual_income),
            spending_m = mean(spending_score),
            num_custs = n())

super_clus_means
# A tibble: 5 × 5
  cluster age_m income_m spending_m num_custs
  <chr>   <dbl>    <dbl>      <dbl>     <int>
1 C1       45.2     26.3       20.9        23
2 C2       25.3     25.1       80.0        21
3 C3       42.5     55.8       49.1        85
4 C4       32.7     86.5       82.1        39
5 C5       41       89.4       15.6        32
ggplot(super_clus_means, aes(x = income_m, y = spending_m, color = as.factor(cluster))) +
  geom_point(size = 3) +
  labs(title = "Scatterplot of Annual Income vs Spending Score by Cluster",
       x = "Annual Income (€000s)",
       y = "Spending Score",
       color = "Cluster") 

Cluster 2 has the highest spending score, while also having the lowest average annual income. Cluster 1 has a slightly higher annual income, although have the lowest spending score of all clusters. Similarly, we can see that Cluster 4 has a very similar spending score to Cluster 2, and Cluster 5 has a very similar spending score to Cluster 1. These two clusters (4 & 5) have a much higher average income.

knitr::kable(select(super_clus_means, cluster, num_custs, age_m, income_m, spending_m), 
             digits = c(0,0,2,2),
             col.names = c("Cluster", "# Customers", "Avg Age", "Avg Annual Income", "Spending Score"), 
             caption = "Number of Customers and Average Age, Income & Spending Score")
Number of Customers and Average Age, Income & Spending Score
Cluster # Customers Avg Age Avg Annual Income Spending Score
C1 23 45.22 26.30 21
C2 21 25.33 25.10 80
C3 85 42.48 55.81 49
C4 39 32.69 86.54 82
C5 32 41.00 89.41 16
gender_percent <- super_clus %>%
  group_by(cluster, gender) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

ggplot(gender_percent, aes(x = as.factor(cluster), y = percentage, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Gender Distribution by Cluster",
       x = "Cluster",
       y = "Percentage (%)",
       fill = "Gender")

Question 7

Cluster 1: This cluster is primarily female. It has the highest average age (45 years old), second lowest average income (€26,300), and second lowest spending score (21).

Cluster 2: This cluster is primarily female. It has the lowest average age (25 years old), lowest average income (€25,100), and second highest spending score (80).

Cluster 3: This cluster is primarily female. It has the second highest average age (42 years old), third highest average income (€55,810), and third highest spending score (49).

Cluster 4: This cluster is primarily female. It has the second lowest average age (33 years old), second highest average income (€86,540), and highest spending score (82).

Cluster 5: This cluster is primarily male. It has the third highest average age (41 years old), highest average income (€89,410), and lowest spending score (16).

Question 8

Cluster 1: “The Savvy Seniors” - The supermarket could offer postal offers through coupons & vouchers to these customers. This could be paired with money-off vouchers on receipts to encourage repeat purchases.

Cluster 2: “The Young Trendsetters” - The supermarket could follow trends online and stock trending products, such as Dubai Chocolate which is currently trending within Ireland. This would encourage younger people to visit the shop.

Cluster 3: “The Steady Spenders” - To encourage the Steady Spenders, the supermarket could offer bundles of items. This would encourage more spending, especially due to their higher spending power with a higher income.

Cluster 4: “The Big Spenders” - As they are already high spenders, the supermarket could offer a loyalty program which rewards consistent spending with points and discounts relative to how much the customers spend.

Cluster 5: “The High-Flyers” - As this cluster mainly consists of males, the supermarket could look to offer items such as aftershaves etc. These customers have the highest average income so could be targeted through higher ticket goods.

Part B

Question 2

bank_2 <- select(bank, age, experience, income, cc_avg)

Question 3

Yes, the data does need to be scaled before applying the K-means clustering algorithm. This is due to the fact that it is not measured on a fixed scale and there is a much wider variance in data.

bank_2_scale <- scale(bank_2)

Question 4

set.seed(101)
kmeans1 <- kmeans(bank_2_scale, centers = 3)

kmeans1$cluster
kmeans1$centers
         age experience     income     cc_avg
1 -0.8912604 -0.8924191 -0.2930636 -0.3277429
2  0.8904403  0.8852371 -0.3195201 -0.3326932
3 -0.1520638 -0.1350880  1.6031701  1.7263626
kmeans1$size
[1] 2029 2168  803
kmeans1$iter
[1] 3

Question 5

bd1 <- dist(bank_2_scale)

sil_kmeans1 <- silhouette(kmeans1$cluster, bd1)
summary(sil_kmeans1)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = bd1) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.4344352 0.4359975 0.2505801 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1365  0.2776  0.4538  0.4056  0.5614  0.6311 

Assess the quality of the cluster solution and discuss your findings.

The overall cluster analysis has a mean Silhouette Score of 0.4056, which means that the analysis has a weak clustering structure. Cluster 1 has a score of 0.43 (weak clustering structure), Cluster 2 has a score of 0.44 (weak clustering structure), and Cluster 3 has a score of 0.25 (weak clustering structure).

Question 6

bank_clus <- bank %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3'))

bank_clus_means <- bank_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            age_m = mean(age),
            experience_m = mean(experience),
            income_m = mean(income),
            cc_avg_m = mean(cc_avg))

bank_clus_means
# A tibble: 3 × 6
  cluster num_custs age_m experience_m income_m cc_avg_m
  <chr>       <int> <dbl>        <dbl>    <dbl>    <dbl>
1 C1           2029  35.1         9.87     60.3     1.37
2 C2           2168  55.5        30.3      59.1     1.36
3 C3            803  43.6        18.6     148.      4.96

Question 7

Cluster 1: This cluster has the second most customers (2029), with the youngest average age (35 years old), least experience (10 years), second highest average income (€60,283), and second highest credit card spend (€1,365).

Cluster 2: This cluster has the most customers (2168), with the oldest average age (56 years old), most experience (30 years), lowest average income (€59,066), and lowest credit card spend (€1,357).

Cluster 3: This cluster has the least customers (803), with the second oldest average age (44 years old), second most experience (19 years), highest average income (€147,574), and highest credit card spend (€4,955).

Question 8

Which segment(s) is(are) most likely to take out a personal loan in the future? Hint: profile the clusters/segments using the personal_loan variable.

bank_clus <- bank %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3'))

bank_clus_means <- bank_clus %>%
  group_by(cluster) %>%
  summarise(num_custs = n(),
            age_m = mean(age),
            experience_m = mean(experience),
            income_m = mean(income),
            cc_avg_m = mean(cc_avg),
            loan_rate = mean(personal_loan))

bank_clus_means
# A tibble: 3 × 7
  cluster num_custs age_m experience_m income_m cc_avg_m loan_rate
  <chr>       <int> <dbl>        <dbl>    <dbl>    <dbl>     <dbl>
1 C1           2029  35.1         9.87     60.3     1.37    0.0340
2 C2           2168  55.5        30.3      59.1     1.36    0.0392
3 C3            803  43.6        18.6     148.      4.96    0.406 
ggplot(bank_clus_means, aes(x = factor(cluster), y = loan_rate)) +
  geom_bar(stat = "identity", fill = "seagreen3") +
  labs(title = "Proportion of Personal Loan Holders by Cluster",
       x = "Cluster",
       y = "Proportion of Personal Loan Holders")

As we can see from the above, Cluster 3 is the most likely to take out a personal loan in the future. This heavily outweighs Clusters 1 & 2 which are very similar to each other.