Assignment 2

Part A Q 1

library(cluster)
library(tidyverse)

supermarket_cust <- read_csv("supermarket_customers.csv")
bank_loans <- read_csv("bank_personal_loan.csv")

Q 2

customer_cluster <- select(supermarket_cust, annual_income, spending_score)
dist1 <- dist(customer_cluster)

Q 3

h1 <- hclust(dist1)

Q 4

plot(h1, hang = -1)

heatmap(as.matrix(dist1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

Q 5

clusters1 <- cutree(h1, k = 5)

sill <- silhouette(clusters1, dist1)
summary(sill)

Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = dist1) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08527  0.49132  0.59236  0.55299  0.66247  0.74442

I believe there is evidence of Clustering, as the silhouette widths for each cluster is over 0.51, meaning a reasonable clustering structure has been found

Q 6

clusters_customer <- supermarket_cust %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5',))

Q 6a

ggplot(clusters_customer, aes(x = annual_income, y = spending_score, color = cluster)) +
  geom_point(size = 4) +  
  labs(title = "Scatterplot of Annual Spend vs. Spending Score",
       x = "Annual Spend",
       y = "Spending Score",
       color = "Cluster") +
  theme_minimal()

Q 6b

cluster_summary<- clusters_customer %>%
  group_by(cluster) %>% 
  summarise(
    number_of_customers = n(),
    avg_annual_income = mean(annual_income, na.rm = TRUE),
    avg_spending_score = mean(spending_score, na.rm = TRUE),
    avg_age = mean(age, na.rm = TRUE))

knitr::kable(cluster_summary, 
             digits = c(0,0,2,2,2),
             col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Age"), 
             caption = "Segment Summary")

Segment Summary
Cluster	# Customers	Avg. Annual Income	Avg. Spending Score	Age
C1	23	26.30	20.91	45.22
C2	21	25.10	80.05	25.33
C3	85	55.81	49.13	42.48
C4	39	86.54	82.13	32.69
C5	32	89.41	15.59	41.00

Q 6c

percent_gender <- clusters_customer %>%
  group_by(cluster, gender) %>%
  summarise(count = n()) %>%
  group_by(cluster) %>%
  mutate(percentage = count / sum(count) * 100)


ggplot(percent_gender, aes(x = factor(cluster), y = percentage, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Percentage of Males and Females in Each Cluster",
    x = "Cluster",
    y = "Percentage",
    fill = "Gender"
  ) + theme_minimal()

Q 7 Cluster 1: This cluster has the highest average age, with an average Annual income of 26.30, and the second lowest spending score of 20.91. This suggests that they don’t like to spend what little money they have.

Cluster 2: The youngest average cluster, and the lowest average Annual income, but a very high spending score of 80.05, which suggests that despite not earning a lot, the younger customers still like to spend their money.

Cluster 3: This cluster has the most people, annual income of 55.81, spending score of 49.13, meaning that this cluster earns decent money, and spend most but not all of it in store

Cluster 4: This cluster has the second highest average annual income of 86.54, and a spending score of 82.13 meaning the people in this cluster earn a lot of money, and like to spend a lot of it too.

Cluster 5: This cluster has the highest average annual income but the lowest average spending score, which means that the people in this cluster earn a lot of money, and like to spend very little of it. The rich stay rich

Q 8

Cluster 1: “The Old-Timers” - This cluster could be given discounts to encourage them to spend more of what little money they have.

Cluster 2: “The ‘No Impulse-Control’ Gang” - The supermarket could sell bundled items, including items the customers from this cluster wouldn’t normally buy, but they will anyway if it is part of a bundle with at least some items they would like to purchase.

Cluster 3: “The Middle of the Roads” - The supermarket could introduce a loyalty program which could include exclusive offers for loyal customers, possibly encouraging them to spend money on things they don’t need, just because they have an offer for it.

Cluster 4: “The Big Spenders” - This cluster likes to spend their money, so the supermarket could target them for higher price items.

Cluster 5: “The Rich Gang” - This cluster has the highest income, so they could also be targeted for the higher price items as they don’t spend a lot of their money, meaning they can afford the more expensive things.

Question B

bank_loans2 <- select(bank_loans, age, income, experience, cc_avg)

bank_loans2_scaled <- scale(bank_loans2)
dist2 <- dist(bank_loans2)

Yes, the data needs to be scaled first, as each variable is measured on a different scale which can go beyond 100, which would mess up the process of clustering.

set.seed(101)
k_means1 <- kmeans(bank_loans2_scaled, centers = 3)

dist3 <- dist(bank_loans2)
sill_k_means1 <- silhouette(k_means1$cluster, dist3)
summary(sill_k_means1)

Silhouette of 5000 units in 3 clusters from silhouette.default(x = k_means1$cluster, dist = dist3) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.1899016 0.1531329 0.4718464 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6774  0.1229  0.2487  0.2192  0.3453  0.6855

bank_loans_cluster <-bank_loans2 %>%
  mutate(clusters2 = k_means1$cluster) %>%
  mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
                             clusters2 == 2 ~ 'C2',
                             clusters2 == 3 ~ 'C3',))

bank_loans_summary <- bank_loans_cluster %>%
  group_by(cluster) %>%  
  summarise(
    number_of_customers = n(),  
    average_age = mean(age, na.rm = TRUE), 
    average_experience = mean(experience, na.rm = TRUE),  
    average_income = mean(income, na.rm = TRUE),
    average_credit_card_spend = mean(cc_avg, na.rm = TRUE))

knitr::kable(bank_loans_summary, 
             digits = c(0,0,2,2,2,2),
             col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. Credit Card Spend"), 
             caption = "Segment Summary")

Segment Summary
Cluster	# Customers	Avg. Age	Avg. Experience	Avg. Income	Avg. Credit Card Spend
C1	2029	35.12	9.87	60.28	1.37
C2	2168	55.55	30.26	59.07	1.36
C3	803	43.60	18.56	147.57	4.96

I think Cluster 3 would be the most likely to have a loan or to take out a loan based on their average credit card spend - 4.96, which is significantly higher than cluster 1 or 2