library(cluster)
library(tidyverse)
library(ggplot2)
library(knitr)Segmentation
title: “Assignment 2- Segmentation” format: html editor: visual —
Task 1
Question 1
supermarket <- read_csv("supermarket_customers.csv")Question 2
comm_2 <- select(supermarket, annual_income , spending_score)
d1<- dist(comm_2)(A)
Yes the data needs to be scaled as the two variables given are varied, and unlimited as to how much money is earned by the population within the sample, and how much money is spent in the supermarket.
Question 3
h1 <- hclust(d1)Question 4
plot(h1, hang = -1)heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')(A)
Yes this heat map does suggest evidence of clustering. The reasoning for this is the lightly coloured blocks that are positioned around the diagonal of the heat map indicate clustering.
Question 5
clusters1 <- cutree(h1, k = 5)
sil1 <- silhouette(clusters1, d1)
summary(sil1)Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
23 21 85 39 32
0.5180227 0.6361876 0.5613615 0.5123015 0.5509055
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08527 0.49132 0.59236 0.55299 0.66247 0.74442
(A)
The cluster analysis has a mean Silhouette Score of 0.55299 which suggests a reasonable clustering structure.
Question 6
comm_clus <- supermarket %>%
mutate(clusters1 = clusters1) %>%
mutate(cluster = case_when (clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3',
clusters1 == 4 ~ 'C4',
clusters1 == 5 ~ 'C5'))
comm_clus_means <- comm_clus %>%
group_by(cluster) %>%
summarise (num_custs = n(),
annual_income = mean(annual_income),
spending_score = mean(spending_score),
age = mean(age))
comm_clus_means# A tibble: 5 × 5
cluster num_custs annual_income spending_score age
<chr> <int> <dbl> <dbl> <dbl>
1 C1 23 26.3 20.9 45.2
2 C2 21 25.1 80.0 25.3
3 C3 85 55.8 49.1 42.5
4 C4 39 86.5 82.1 32.7
5 C5 32 89.4 15.6 41
(A)
ggplot(comm_clus, aes(x = annual_income, y = spending_score, color = as.factor(cluster))) +
geom_point() +
labs(title = "Scatterplot of Annual Income vs Spending Score",
x = "Annual income",
y = "Spending Score",
color = "Cluster") +
theme_minimal()(B)
knitr::kable(select(comm_clus_means, cluster, num_custs, age, annual_income, spending_score), digits = c(0, 0, 0, 2, 0),
col.names = c("Cluster", "Number of Customers", "Age", "Annual Income", "Spending Score"),
caption = "Average customer information (by cluster)")| Cluster | Number of Customers | Age | Annual Income | Spending Score |
|---|---|---|---|---|
| C1 | 23 | 45 | 26.30 | 21 |
| C2 | 21 | 25 | 25.10 | 80 |
| C3 | 85 | 42 | 55.81 | 49 |
| C4 | 39 | 33 | 86.54 | 82 |
| C5 | 32 | 41 | 89.41 | 16 |
(C)
comm_clus$age <- factor(comm_clus$age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))ggplot(comm_clus, aes(x = gender, group = cluster)) +
geom_bar(aes(y = after_stat(prop), fill = factor(after_stat(x))), stat = "count", show.legend = FALSE) +
scale_fill_manual(values = c(rep("darkgrey",6))) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent) +
ylab("Percentage of Customers") +
xlab("Cluster") +
ggtitle("Distribution of gender by Cluster")Question 7
(C1) This cluster is made up of low spenders and low earners. (Age- 45, majority female)
(C2) This cluster is made up of low income but high spenders (Age- 25, majority female)
(C3) This cluster is made up of average earners and average everyday spenders. (Middle ground) (Age-42, majority female)
(C4) This cluster is made up of high earners and high spenders. (Age- 33, majority female)
(C5) This cluster is made up of high earners and low spenders. (Age- 41, majority male)
Question 8
Frugal 45ers- Encouraging the low spenders to enter the store more often by offering consistent loyalty discounts that are limited time.
Spendy Twenties- Target the low income- high spenders with values on the supermarkets names brand products.
Balanced Budgeters- Offer items in bundles rather than on theirtm own, this can entice the average spender to spend more time and money in the supermarket.
High Rollers- Target with promotions on premium products.
Savvy Spenders- Entice with premium products that are sold for a limited time.
Task 2
Question 1
recycling <- read_csv("recycling.csv")Question 2
recycling_2 <- select(recycling, pos_impact , environ, money, bins, local, avoid_waste)
d1<- dist(recycling_2)Question 3
No the data does not need to be scaled, as all the variables included are on a scale of 1-4.
Question 4
set.seed(101)
kmeans1 <- kmeans(recycling_2, centers = 3)Question 5
d3 <- dist(recycling_2)
sil_kmeans1 <- silhouette(kmeans1$cluster, d3)
summary(sil_kmeans1)Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d3) :
Cluster sizes and average silhouette widths:
169 162 35
0.2682631 0.2054575 0.3271905
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.08256 0.14303 0.23963 0.24610 0.35704 0.52470
The average Silhouette Score is 0.24610, this is considered a weak quality of clustering.
Question 6
recycling_clusters <- recycling %>%
mutate(clusters1 = kmeans1$cluster) %>%
mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))
rec_cluster_mean <- recycling_clusters %>%
group_by(cluster) %>%
summarise (num_custs = n(),
pos_impact_mean = mean(pos_impact),
environ_mean = mean(environ),
money_mean = mean(money),
bins_mean = mean(bins),
local_mean = mean(local),
avoid_waste_mean = mean(avoid_waste))
rec_cluster_mean(A)
knitr::kable(select(rec_cluster_mean, cluster, num_custs, pos_impact_mean, environ_mean, money_mean, bins_mean, local_mean, avoid_waste_mean),
digits = c(0, 0, 0, 0, 0, 0, 0),
col.names = c("Cluster", "Number of Customers", "Pos impact", "Environment", "Money", "Bins", "Local", "Avoid Waste"),
caption = "Average Responses")| Cluster | Number of Customers | Pos impact | Environment | Money | Bins | Local | Avoid Waste |
|---|---|---|---|---|---|---|---|
| C1 | 169 | 4 | 4 | 4 | 3 | 3 | 1 |
| C2 | 162 | 4 | 3 | 4 | 3 | 3 | 3 |
| C3 | 35 | 2 | 1 | 2 | 1 | 2 | 3 |
(B)
ggplot(data = recycling_clusters) +
geom_bar(mapping = aes(x = cluster, fill = age), position = "dodge") +
scale_fill_manual(values = c("#0000FF", "#7FFFD4", "#EE7AE9", "#FFEA56", "#708090", "#D02090")) +
ggtitle("Age by Cluster") +
xlab("Cluster") +
ylab("Percentages")Question 7
(C1) - This cluster is made up of an older generation, but they believe that the environment is quite important and should be taken seriously. They do their best to adhere to best practices surrounding the environment but often lack education and understanding in comparison to the other clusters.
(C2) - This is the second oldest cluster, but they’re generally closely related to the opinions of their elders. They struggle with packaging and rubbish coming into the house, this could be related to an increase in consumerism in relation to their elders.
(C2)- This is the youngest demographic and are quite unconcerned with their environmental impact, even though they’re very educated and understand it well.
Question 8
Carlow county council could educate the older clusters about the environment, so they can understand the reasoning behind recycling and how to better deal with rubbish in their homes.
The younger generation needs to be convinced that what they do everyday matters. They seem to be aware and educated about recycling but often cant or wont apply it to their own life or daily habits.