library(cluster)
library(tidyverse)
Data Assignment 2
Data Assignment 2 Semester 2 Segmentation
Q1
<- read_csv("supermarket_customers.csv")
sm_cust View(sm_cust)
Q2
<- select(sm_cust, annual_income, spending_score)
sm_cust2 <- dist(sm_cust2)
d1
<- scale(sm_cust2)
sm_cust2_scale <- dist(sm_cust2_scale) d2
Yes we must scale, because annual income is measured in € thousands and spending score is a score between 1 - 100 meaning if we did not not scale annual income would dominate the results as the numbers are much larger.
Q3
<- hclust(d2) h1
Q4
plot(h1, hang = -1)
heatmap(as.matrix(d2), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)
There is some evidence of lightly coloured blocks around the diagonal that suggest groups of customers who are similar to each other.
Q5
<- cutree(h1, k = 5)
clusters1
<- silhouette(clusters1, d2)
sil1 summary(sil1)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d2) :
Cluster sizes and average silhouette widths:
23 21 79 39 38
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1401 0.4993 0.5965 0.5531 0.6610 0.7645
The mean Silhouette Score for the overall clustering results is 0.5965, which means that areasonable clustering structure has been found. We can also look at the average Silhouette Scores for each individual cluster:
Clusters 5 have score of 0.46 meaning The structure is weak and could be artificial.
Cluster 1, 2, 4 and 5 have scores of 0.51, 0.62, 0.61 and 0.51 meaning a reasonable clustering structure has been found.
Q6
<- sm_cust %>%
sm_cust_clus mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5')) clusters1
Q6(A)
ggplot(sm_cust_clus, aes(x = annual_income, y = spending_score, color = factor(cluster))) +
geom_point(size = 3) +
labs(title = "Customer Segments Based on Annual Income & Spending Score",
x = "Annual Income",
y = "Spending Score",
color = "Cluster") +
theme_minimal()
This scatterplot shows us the customers in cluster 3 are very similar as they are tightly fitted.
C1 - Has low income and also spends the least
C2 - Has low annual income but are high spenders
C3- Medium annual income, medium annual spend
C4 - High annual income, high spending score
C5- High annual income, low spending score
Q6(B)
<- sm_cust_clus %>%
cluster_summary group_by(cluster) %>%
summarise(
num_customers = n(),
avg_annual_income = mean(annual_income, na.rm = TRUE),
avg_spending_score = mean(spending_score, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE)
)
::kable(cluster_summary,
knitrdigits = c(0, 0, 2, 2, 2),
col.names = c("Cluster", "# Customers", "Avg. Annual Income", "Avg. Spending Score", "Avg. Age"),
caption = "Customer Segments Summary")
Cluster | # Customers | Avg. Annual Income | Avg. Spending Score | Avg. Age |
---|---|---|---|---|
C1 | 23 | 26.30 | 20.91 | 45.22 |
C2 | 21 | 25.10 | 80.05 | 25.33 |
C3 | 79 | 54.42 | 50.22 | 42.89 |
C4 | 39 | 86.54 | 82.13 | 32.69 |
C5 | 38 | 87.00 | 18.63 | 40.39 |
Q6(C)
<- sm_cust_clus %>%
gender_distribution group_by(cluster, gender) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(cluster) %>%
mutate(percent = (count / sum(count)) * 100)
ggplot(gender_distribution, aes(x = cluster, y = percent, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Gender Distribution by Cluster",
x = "Cluster",
y = "Percentage (%)",
fill = "Gender") +
theme_minimal()
Q7
From these tables/graphs we can deduce the following:
Cluster 1 - Low income, low spending score. Consists of 23 customers with an average annual income of €26.30K and a low spending score of 20.91. With an average age of 45 and consists of over 60% females compared to less then 40% males.
Cluster 2 - Low income, high spending score. Consists of 21 customers with an average annual income of €25.10K and a high spending score of 80.05. With an average age of 25.33 and consists of over 50% females compared to less than 50% male.
Cluster 3 - Medium income, moderate spending score. Consists of 79 customers with an average income of €54.42K and a moderate spending score of 50.22. With an average age of 42.89 and consists of nearly 60% females compared to just above 40% male. This is our largest cluster.
Cluster 4 - High income, high spending score. Consists of 39 customers with an average income of €86.54K and a high spending score of 82.13. With an average age of 32.69 and consists of over 50% females and less than 50% male.
Cluster 5 - High income, low spending score. Consists of 38 customers with an average annual income of €87K and a low spending score of 18.63. With an aveage age of 40.39 and consists of over 50% male and less then 50% female.
Q8
Cluster 1 - The Cautious Cart Savers
-Saver Deals
Cluster 2 - The Trendy Spenders
-Student discounts
-Trendy Friday Bundles
Cluster 3 -The Balanced Basket Buyers
-Family shopper discounts
-Memebrship deals
Cluster 4 - The Premium Purchasers
-VIP loyalty programme
-Premium products
Cluster 5 - The Reserved Rich
-A shopping list with their usuals on it and include some add on suggestions to encourage them to purchase more then their usual.
Part B
Q1
<- read_csv("bank_personal_loan.csv") bpl
Q2
<- select(bpl, age, experience, income, cc_avg) bpl2
Q3
<- scale(bpl2) bpl2_scale
Yes, we must scale the data as the variables are all measured differently. Age is measured from 23-67, experience is measured in years from minus 3 to 43, income is measured in €000 from 8 to 224 and cc_avg is measured in €000 from 0 to 10.0. For accurate results we must scale the data.
Q4
set.seed(101)
<- kmeans(bpl2_scale, centers = 3) kmeans1
Q5
<- dist(bpl2)
d2
<- silhouette(kmeans1$cluster, d2)
sil_kmeans1 summary(sil_kmeans1)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d2) :
Cluster sizes and average silhouette widths:
2029 2168 803
0.1899016 0.1531329 0.4718464
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6774 0.1229 0.2487 0.2192 0.3453 0.6855
The mean Silhouette Score for the overall clustering results is 0.216178, which means that no substantial structure has been found. We can also look at the average Silhouette Scores for each individual cluster:
Clusters 2, 3 and 4 have scores of 0.04, 0.24 and 0,07 meaning that there is no substantial structure been found.
Cluster 1 has a score of 0.37 and meaning the structure is weak and could be artificial.
Q6
<- bpl %>%
bpl_clus mutate(clusters2 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters2 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters2 == 3 ~ 'C3')) clusters2
<- bpl_clus %>%
cluster_summary group_by(cluster) %>%
summarise(
num_customers = n(),
avg_age = mean(age, na.rm = TRUE),
avg_experience = mean(experience, na.rm = TRUE),
avg_income = mean(income, na.rm = TRUE),
avg_cc_spend = mean(cc_avg, na.rm = TRUE)
)
::kable(cluster_summary,
knitrdigits = c(0, 0, 2),
col.names = c("Cluster", "# Customers", "Avg. Age", "Avg. Experience", "Avg. Income", "Avg. CC Spend"),
caption = "Customer Segments Summary for K-Means Clustering")
Cluster | # Customers | Avg. Age | Avg. Experience | Avg. Income | Avg. CC Spend |
---|---|---|---|---|---|
C1 | 2029 | 35.12 | 10 | 60 | 1.37 |
C2 | 2168 | 55.55 | 30 | 59 | 1.36 |
C3 | 803 | 43.60 | 19 | 148 | 4.96 |
Cluster 1- Second largest cluster with 2029 customers, youngest average age being 35.12, lowest average work experience with it being 10 years and their average annual income is €60K and their average spening on credit cards per month being €1.37K.
Cluster 2 - Largest cluster with 2168 customers, oldest average age with it being 55.55, highest average work experience being 30 years and their average annual income is €59k. The average spend on credit cards per month being €1.36K which is the lowest.
Cluster 3 - Smallest cluster with just 803 customers. Average age is 43.60 and their average work experience is 19 years. Their average annual income is the highest being €148K and the highest average spend on credit cards per month with it being €4.96k
<- bpl_clus %>%
bpl_pl mutate(clusters3 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters3 == 1 ~ 'C1',
== 2 ~ 'C2',
clusters3 == 3 ~ 'C3'))
clusters3 <- bpl_clus %>%
loan_summary group_by(cluster) %>%
summarise(
num_customers = n(),
num_loan_taken = sum(personal_loan, na.rm = TRUE),
loan_percentage = (sum(personal_loan, na.rm = TRUE) / n()) * 100 )
::kable(loan_summary,
knitrdigits = 2,
col.names = c("Cluster", "# Customers", "# Loans Taken", "Loan %"),
caption = "Personal Loans Taken per Cluster")
Cluster | # Customers | # Loans Taken | Loan % |
---|---|---|---|
C1 | 2029 | 69 | 3.40 |
C2 | 2168 | 85 | 3.92 |
C3 | 803 | 326 | 40.60 |
ggplot(loan_summary, aes(x = cluster, y = loan_percentage, fill = cluster)) +
geom_bar(stat = "identity") +
labs(title = "Personal Loans Taken by Cluster",
x = "Customer Segment",
y = "Percentage of Customers with a Loan") +
theme_minimal()
Cluster 3 is most likely to take out a loan in the future as they have the highest percentage (40.6%) of loans taken out in the past. They have taken out 326 loans in the past compared to 69 and 85 in the the others.