Assignment 2 Version 1

Author

Chloe O Donovan

Part A

Load tidyverse

library(cluster)
library(tidyverse)

Q1

Import the file into R

supermarket_customers <- read_csv("supermarket_customers.csv")
View(supermarket_customers)

Q2

supermarket_customers_2 <- select(supermarket_customers, annual_income, spending_score)


supermarket_customers_2_scale <- scale(supermarket_customers_2)
d1 <- dist(supermarket_customers_2)

Q2 (a)

The data requires to be scaled due to annual_income could range from €0 to €1000, while spending_score as we can see only ranges from 0-100 therefore, in order to get unbias results the data should be scaled.

Q3

h1 <- hclust(d1)

Q4

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

Q4 (a)

The heatmap above gives evidence of custering structure as it contains many light coloured boxes going digonally in the heatmap.

Q5

clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08527  0.49132  0.59236  0.55299  0.66247  0.74442 
  • The overall cluster analysis has a mean Silhouette score of: 0.7243 which shows a strong clustering structure.
  • Cluster 1: has a score of 0.52, which shows a strong clustering structure.
  • Cluster 2: has a score of 0.64, which shows a strong clustering structure.
  • Cluster 3: has a score of 0.56, which shows a strong clustering structure.
  • Cluster 4: has a score of 0.51, which shows a reasonable clustering structure.
  • Cluster 5: has a score of 0.55 which shoes a reasonable clustering structure.

Q6

supermarket_cluster <- supermarket_customers %>%
  mutate(cluster = case_when(
    clusters1 == 1 ~ 'C1',
    clusters1 == 2 ~ 'C2',
    clusters1 == 3 ~ 'C3',
    clusters1 == 4 ~ 'C4',
    clusters1 == 5 ~ 'C5'
  ))

Q6(a)

ggplot(supermarket_cluster, aes(x = annual_income, y = spending_score, color = cluster)) +
  geom_point(size = 4) +  
  labs(title = "Scatterplot of Annual Spend vs. Spending Score",
       x = "Annual Spend",
       y = "Spending Score",
       color = "Cluster") +
  theme_minimal() 

  • Cluster 1 (red): Contains customers with low annual spend and low spending scores.
  • Cluster 2(yellow): Contains customers with high annual spend and high spending scores.
  • Cluster 3(green): Contains customers with moderate spending and average spending scores.
  • Cluster 4(blue):Contains customers with moderate to high spending with a high spending score.
  • Cluster 5(pink): Contains customers with a moderate to high spending with a low spending score.

Q6 (b)

cluster_summary<- supermarket_cluster %>%
  group_by(cluster) %>% 
  summarise(
    number_of_customers = n(),
    avg_annual_income = mean(annual_income, na.rm = TRUE),
    avg_spending_score = mean(spending_score, na.rm = TRUE),
    avg_age = mean(age, na.rm = TRUE))

knitr::kable(cluster_summary, 
             digits = c(0,0,2,2,2),
             col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Age"), 
             caption = "Segment Summary")
Segment Summary
Cluster # Customers Avg. Annual Income Avg. Spending Score Age
C1 23 26.30 20.91 45.22
C2 21 25.10 80.05 25.33
C3 85 55.81 49.13 42.48
C4 39 86.54 82.13 32.69
C5 32 89.41 15.59 41.00

Q6 (c)

gender_percentage <- supermarket_cluster %>%
  group_by(cluster, gender) %>%
  summarise(count = n()) %>%
  group_by(cluster) %>%
  mutate(percentage = count / sum(count) * 100)


ggplot(gender_percentage, aes(x = factor(cluster), y = percentage, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Percentage of Males and Females in Each Cluster",
    x = "Cluster",
    y = "Percentage",
    fill = "Gender"
  ) + theme_minimal()

Q7

  • Cluster 1: Consists of 23 customers with a mean age of 45 and are mostly women with an average income of €26,300 and an average spending score of 20.92. This cluster contains customers who do not frequently make large purchases.
  • Cluster 2: Consists of 21 customers with a mean age of 25 and are mostly female with an average income of €25,100 and an average spending score of 80.05. This cluster contains customer who spend regularly.
  • Cluster 3: Consists of 85 customers with a mean age of 42 and are mostly female with an average income of €55,810 and an average spending score of 49.13. This cluster contains customers who have average income and spending behaviours.
  • Cluster 4: Consists of 39 customers with a mean age of 32 and are mostly female with an average income of €86,540 and an average spending score of 82.13. This cluster contains customers with more disposable income.
  • Cluster 5: Consists of 32 customers with a mean age of 41 and are mostly male with an average income of €89,410 and an average spending score of 15.59. This cluster contains customers who have a high income with a low spending score showing they do not spend on discretionary items and probably only pay for certain services and goods.

Q8

Marketing Actions and Catchy Cluster Names:

  • Cluster 1: “Value seekers” Profile: Customers with lower annual incomes and low spending scores. Marketing Actions:

Discounts & Promotions: Offer regular promotions and discounts, this could encourage more frequent purchases.

  • Cluster 2: “High life hustlers” Profile: low-income individuals with high spending scores. Marketing Actions:

Post a favorite product of the week: promote on socials, posters in store, and on app the favourite product of the week to entice customers in this cluster to purchase.

  • Cluster 3: “The practical class” Profile: Customers with average incomes and moderate spending scores. Marketing Actions:

Value-Added Offers: Offer discounts on frequently purchased items or on different brand of the product and offer discounts such as “Buy 2 Get 1 Free”.

  • Cluster 4: “The high rollers” Profile: High-income individuals with high spending scores. Marketing Actions:

Exclusive & Premium Products: Promote high-end or luxury products such as organic or specialty foods, gourmet items, or high-quality wines, aligning with their purchasing behavior.

  • Cluster 5: “Wealth watchers” Profile: individuals with higher income but lower spending scores compared to income. Marketing Actions:

Value-Driven Campaigns: Promote the idea of “getting more for less,” emphasizing the value of products and offering coupons or bundle deals.

#Part B

bank_personal_loan <- read_csv("bank_personal_loan.csv")
View(bank_personal_loan)

Q2

bank_personal_loan2 <- select(bank_personal_loan, age, income, experience, cc_avg)

Q3

bank_personal_loan2_scale <- scale(bank_personal_loan2)
d2 <- dist(bank_personal_loan2)

Yes, scaling the data is necessary due to age is measured 0-100 along with experience, while income is measure from €0-€1000.

Q4

set.seed(101)
kmeans1 <- kmeans(bank_personal_loan2_scale, centers = 3)

Q5

d2 <- dist(bank_personal_loan2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d2)
summary(sil_kmeans1)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.1899016 0.1531329 0.4718464 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6774  0.1229  0.2487  0.2192  0.3453  0.6855 
  • Overall silhouette score is: 0.2487 meaning there is no real structure found.
  • Cluster 1: has a score of 0.19 showing the no structure is found.
  • Cluster 2: has a score of 0.15 showing no structure is found.
  • Cluster 3: has a score of 0.47 showing the structure is weak.

Q6

bank_personal_cluster <- bank_personal_loan2 %>%
  mutate(clusters2 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
                             clusters2 == 2 ~ 'C2',
                             clusters2 == 3 ~ 'C3'))

cluster_summary <- bank_personal_cluster %>%
  group_by(cluster) %>%
  summarise(
    num_customers = n(),
    avg_age = mean(age, na.rm = TRUE),
    avg_experience = mean(experience, na.rm = TRUE),
    avg_income = mean(income, na.rm = TRUE),
    avg_cc_spend = mean(cc_avg, na.rm = TRUE)
  )


knitr::kable(cluster_summary, 
             digits = c(0, 0, 2, 2, 2, 2),
             col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. CC Spend"), 
             caption = "Customer Segments Summary for K-Means Clustering")
Customer Segments Summary for K-Means Clustering
Cluster # Customers Avg. Age Avg. Experience Avg. Income Avg. CC Spend
C1 2029 35.12 9.87 60.28 1.37
C2 2168 55.55 30.26 59.07 1.36
C3 803 43.60 18.56 147.57 4.96

Q7

  • Cluster 1: Contains the second most number of customers in this group of clusters, but has the youngest age group. With little experience and little spending and income.
  • Cluster 2: Contains the most customers with moderate income and spending and the age group being the oldest and the most experienced group.
  • Cluster 3: Contains the lowest number of customers being the middle for age and experience yet being the highest group of income and spending.

##Q8

bank_personal_loan_cluster <- bank_personal_loan %>%
  mutate(clusters2 = kmeans1$cluster) %>%
  mutate(cluster = case_when(
    clusters2 == 1 ~ 'C1',
    clusters2 == 2 ~ 'C2',
    clusters2 == 3 ~ 'C3'
  ))


loan_profile <- bank_personal_loan_cluster %>%
  group_by(cluster) %>%
  summarise(
    number_of_customers = n(),
    num_loan_taken = sum(personal_loan, na.rm = TRUE),  
    loan_percentage = (sum(personal_loan, na.rm = TRUE) / n()) * 100 
  )

knitr::kable(loan_profile, 
             digits = 2,
             col.names = c("Cluster", "# Customers", "# Loans Taken", "Percentage with Loan"), 
             caption = "Personal Loan Profile by Cluster")
Personal Loan Profile by Cluster
Cluster # Customers # Loans Taken Percentage with Loan
C1 2029 69 3.40
C2 2168 85 3.92
C3 803 326 40.60
ggplot(loan_profile, aes(x = cluster, y = loan_percentage, fill = cluster)) +
  geom_bar(stat = "identity") +
  labs(title = "Personal Loans Taken by Cluster",
       x = "Customer Segment",
       y = "Percentage of Customers with a Loan") +
  theme_minimal()

  • Cluster 1: is least likely to take out a loan but could take out a loan in the future.
  • Cluster 2: Most customers in this cluster are in a stable income and spending phase therefore, may or may not take out a loan in the future.
  • Cluster 3: are most likely to take out a loan due to already taken out loans, making them more easy to apply for and more likely to qualify for loans in the future.