Data Assignment 2

Author

Zoe Keating

Data Assignment 2 Semester 2 Segmentation

library(cluster)
library(tidyverse)

Q1

sm_cust <- read_csv("supermarket_customers.csv")
View(sm_cust)

Q2

sm_cust2 <- select(sm_cust, annual_income, spending_score)
d1 <- dist(sm_cust2)

sm_cust2_scale <- scale(sm_cust2)
d2 <- dist(sm_cust2_scale)

Yes we must scale, because annual income is measured in € thousands and spending score is a score between 1 - 100 meaning if we did not not scale annual income would dominate the results as the numbers are much larger.

Q3

h1 <- hclust(d2)

Q4

plot(h1, hang = -1)

heatmap(as.matrix(d2), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

There is some evidence of lightly coloured blocks around the diagonal that suggest groups of customers who are similar to each other.

Q5

clusters1 <- cutree(h1, k = 5)

sil1 <- silhouette(clusters1, d2)
summary(sil1)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d2) :
 Cluster sizes and average silhouette widths:
       23        21        79        39        38 
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1401  0.4993  0.5965  0.5531  0.6610  0.7645 

The mean Silhouette Score for the overall clustering results is 0.5965, which means that areasonable clustering structure has been found. We can also look at the average Silhouette Scores for each individual cluster:

  • Clusters 5 have score of 0.46 meaning The structure is weak and could be artificial.

  • Cluster 1, 2, 4 and 5 have scores of 0.51, 0.62, 0.61 and 0.51 meaning a reasonable clustering structure has been found.

Q6

sm_cust_clus <- sm_cust %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5'))

Q6(A)

ggplot(sm_cust_clus, aes(x = annual_income, y = spending_score, color = factor(cluster))) +
  geom_point(size = 3) +
  labs(title = "Customer Segments Based on Annual Income & Spending Score",
       x = "Annual Income",
       y = "Spending Score",
       color = "Cluster") +
  theme_minimal()

This scatterplot shows us the customers in cluster 3 are very similar as they are tightly fitted.

C1 - Has low income and also spends the least

C2 - Has low annual income but are high spenders

C3- Medium annual income, medium annual spend

C4 - High annual income, high spending score

C5- High annual income, low spending score

Q6(B)

cluster_summary <- sm_cust_clus %>%
  group_by(cluster) %>%
  summarise(
    num_customers = n(),
    avg_annual_income = mean(annual_income, na.rm = TRUE),
    avg_spending_score = mean(spending_score, na.rm = TRUE),
    avg_age = mean(age, na.rm = TRUE)
  )


knitr::kable(cluster_summary,
             digits = c(0, 0, 2, 2, 2),
             col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Avg. Age"), 
             caption = "Customer Segments Summary")
Customer Segments Summary
Cluster # Customers Avg. Annual Income Avg. Spending Score Avg. Age
C1 23 26.30 20.91 45.22
C2 21 25.10 80.05 25.33
C3 79 54.42 50.22 42.89
C4 39 86.54 82.13 32.69
C5 38 87.00 18.63 40.39

Q6(C)

gender_distribution <- sm_cust_clus %>%
  group_by(cluster, gender) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(cluster) %>%
  mutate(percent = (count / sum(count)) * 100)

ggplot(gender_distribution, aes(x = cluster, y = percent, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Gender Distribution by Cluster",
       x = "Cluster",
       y = "Percentage (%)",
       fill = "Gender") +
  theme_minimal()

Q7

From these tables/graphs we can deduce the following:

  • Cluster 1 - Low income, low spending score. Consists of 23 customers with an average annual income of €26.30K and a low spending score of 20.91. With an average age of 45 and consists of over 60% females compared to less then 40% males.

  • Cluster 2 - Low income, high spending score. Consists of 21 customers with an average annual income of €25.10K and a high spending score of 80.05. With an average age of 25.33 and consists of over 50% females compared to less than 50% male.

  • Cluster 3 - Medium income, moderate spending score. Consists of 79 customers with an average income of €54.42K and a moderate spending score of 50.22. With an average age of 42.89 and consists of nearly 60% females compared to just above 40% male. This is our largest cluster.

  • Cluster 4 - High income, high spending score. Consists of 39 customers with an average income of €86.54K and a high spending score of 82.13. With an average age of 32.69 and consists of over 50% females and less than 50% male.

  • Cluster 5 - High income, low spending score. Consists of 38 customers with an average annual income of €87K and a low spending score of 18.63. With an aveage age of 40.39 and consists of over 50% male and less then 50% female.

Q8

  • Cluster 1 - The Cautious Cart Savers

    -Saver Deals

  • Cluster 2 - The Trendy Spenders

    -Student discounts

    -Trendy Friday Bundles

  • Cluster 3 -The Balanced Basket Buyers

    -Family shopper discounts

    -Memebrship deals

  • Cluster 4 - The Premium Purchasers

    -VIP loyalty programme

    -Premium products

  • Cluster 5 - The Reserved Rich

    -A shopping list with their usuals on it and include some add on suggestions to encourage them to purchase more then their usual.

Part B

Q1

bpl <- read_csv("bank_personal_loan.csv")

Q2

bpl2 <- select(bpl, age, experience, income, cc_avg)

Q3

bpl2_scale <- scale(bpl2)

Yes, we must scale the data as the variables are all measured differently. Age is measured from 23-67, experience is measured in years from minus 3 to 43, income is measured in €000 from 8 to 224 and cc_avg is measured in €000 from 0 to 10.0. For accurate results we must scale the data.

Q4

set.seed(101)
kmeans1 <- kmeans(bpl2_scale, centers = 3)

Q5

d2 <- dist(bpl2)

sil_kmeans1 <- silhouette(kmeans1$cluster, d2)
summary(sil_kmeans1)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.1899016 0.1531329 0.4718464 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6774  0.1229  0.2487  0.2192  0.3453  0.6855 

The mean Silhouette Score for the overall clustering results is 0.216178, which means that no substantial structure has been found. We can also look at the average Silhouette Scores for each individual cluster:

  • Clusters 2, 3 and 4 have scores of 0.04, 0.24 and 0,07 meaning that there is no substantial structure been found.

  • Cluster 1 has a score of 0.37 and meaning the structure is weak and could be artificial.

Q6

bpl_clus <- bpl %>%
  mutate(clusters2 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
                             clusters2 == 2 ~ 'C2',
                             clusters2 == 3 ~ 'C3'))
cluster_summary <- bpl_clus %>%
  group_by(cluster) %>%
  summarise(
    num_customers = n(),
    avg_age = mean(age, na.rm = TRUE),
    avg_experience = mean(experience, na.rm = TRUE),
    avg_income = mean(income, na.rm = TRUE),
    avg_cc_spend = mean(cc_avg, na.rm = TRUE)
  )

knitr::kable(cluster_summary, 
             digits = c(0, 0, 2),
             col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. CC Spend"), 
             caption = "Customer Segments Summary for K-Means Clustering")
Customer Segments Summary for K-Means Clustering
Cluster # Customers Avg. Age Avg. Experience Avg. Income Avg. CC Spend
C1 2029 35.12 10 60 1.37
C2 2168 55.55 30 59 1.36
C3 803 43.60 19 148 4.96

Cluster 1- Second largest cluster with 2029 customers, youngest average age being 35.12, lowest average work experience with it being 10 years and their average annual income is €60K and their average spening on credit cards per month being €1.37K.

Cluster 2 - Largest cluster with 2168 customers, oldest average age with it being 55.55, highest average work experience being 30 years and their average annual income is €59k. The average spend on credit cards per month being €1.36K which is the lowest.

Cluster 3 - Smallest cluster with just 803 customers. Average age is 43.60 and their average work experience is 19 years. Their average annual income is the highest being €148K and the highest average spend on credit cards per month with it being €4.96k

bpl_pl <- bpl_clus %>%
  mutate(clusters3 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters3 == 1 ~ 'C1',
                             clusters3 == 2 ~ 'C2',
                             clusters3 == 3 ~ 'C3'))
loan_summary <- bpl_clus %>%
  group_by(cluster) %>%
  summarise(
    num_customers = n(),
    num_loan_taken = sum(personal_loan, na.rm = TRUE),  
    loan_percentage = (sum(personal_loan, na.rm = TRUE) / n()) * 100 )


knitr::kable(loan_summary, 
             digits = 2,
             col.names = c("Cluster", "# Customers", "# Loans Taken", "Loan %"),
             caption = "Personal Loans Taken per Cluster")
Personal Loans Taken per Cluster
Cluster # Customers # Loans Taken Loan %
C1 2029 69 3.40
C2 2168 85 3.92
C3 803 326 40.60
ggplot(loan_summary, aes(x = cluster, y = loan_percentage, fill = cluster)) +
  geom_bar(stat = "identity") +
  labs(title = "Personal Loans Taken by Cluster",
       x = "Customer Segment",
       y = "Percentage of Customers with a Loan") +
  theme_minimal()

Cluster 3 is most likely to take out a loan in the future as they have the highest percentage (40.6%) of loans taken out in the past. They have taken out 326 loans in the past compared to 69 and 85 in the the others.