Data Analysis YR.4, SEM2 CA2

Setting Up

library(cluster)
library(tidyverse)

Part A

Q1

Importing dataset

supermarket <- read_csv("supermarket_customers.csv")

Q2

Creating a sub data set containing only clustering variables (annual_income and spending_score)

supermarket_2 <- select(supermarket, annual_income, spending_score)
d1 <- dist(supermarket_2)

Q2 (a)

The data should be scaled as both variables are not even; annual_income can be as high as needed, whereas spending score appears to be on a scale from 1 to 99.

#scaling the data#
supermarket_2_scale <- scale(supermarket_2)
d2 <- dist(supermarket_2_scale)

Q3

Carrying out hierarchical clustering on the sub data set.

hc_1 <- hclust(d2)

Q4

Plotting a dendogram of the hierarchical clustering.

plot(hc_1, hang = -1)

Creating a heatmap of the hierarchical clustering.

heatmap(as.matrix(d2), Rowv = as.dendrogram(hc_1), Colv = 'Rowv')

Q4 (a)

The heatmap does provide evidence of clustering as there is multiple light coloured blocks situated around the diagonal of the heatmap, this suggests a convincing clustering structure.

Q5

Creating a 5 cluster solution of the hierarchical clustering.

clusters1 <- cutree(hc_1, k = 5)

Q6

Assessing the quality of the 5 cluster solution using the silhouette score.

sil1 <- silhouette(clusters1, d2)
summary(sil1)

Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters1, dist = d2) :
 Cluster sizes and average silhouette widths:
       23        21        79        39        38 
0.5065849 0.6199558 0.6140651 0.5094809 0.4623933 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1401  0.4993  0.5965  0.5531  0.6610  0.7645

The mean of the silhouette score is 0.5531, this symbolizes a reasonable clustering structure.

Q6 (a)

Adding the clusters as a new variable in the original data set. This will allow for easier graph creation.

supermarket_clusters <- supermarket %>%
  mutate(clusters1 = clusters1) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3',
                             clusters1 == 4 ~ 'C4',
                             clusters1 == 5 ~ 'C5',))

Creating a scatterplot of the relationship between customers annual income and their spending score

Spending score is measured on a scale of 1-99

ggplot(data = supermarket_clusters) + 
  geom_point(mapping = aes(x = annual_income, y = spending_score, colour = cluster)) +
  ggtitle("Customer Annual Income and Spending Score per Cluster") +
  scale_color_manual(values = c("#7BC1FA", "#7ABD7E", "#FAD66D", "#FFB54C", "#FF6961")) +
  xlab("Customer Annual Income (in '000)") +
  ylab("Customer Spending Score") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "#666666", face = "bold"),
        axis.title.y = element_text(colour = "#666666", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "#666666", face = "bold"),
        plot.title = element_text(face = "bold"))

As can be seen in the above scatter plot, the customer clusters can be profiled as such;

C1 (blue) is comprised of customers who have a low annual income and a low spending score.
C2 (green) contains customers who have a low annual income, but a high spending score.
C3 (yellow) comprises of customers who both have a medium annual income, as well as a medium spending score.
C4 (orange) is made up of customers who have a higher annual income, and a high spending score.
C5 (red) contains customers who have a higher annual income, but a low spending score.

Q6 (b)

Table of number of customers, average annual income, average spending cost and average age based on each cluster.

table_1.1 <- supermarket_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(annual_income), 2)) 

table_1.2 <- supermarket_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(spending_score), 2))

table_1.3 <- supermarket_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(age), 0))

knitr::kable(table_1.1, "pipe", col.names = c("Cluster", "Mean Annual Income"), align = c("c", "c"), caption = "Avg. Annual Income by Cluster")

Avg. Annual Income by Cluster
Cluster	Mean Annual Income
C1	26.30
C2	25.10
C3	54.42
C4	86.54
C5	87.00

knitr::kable(table_1.2, "pipe", col.names = c("Cluster", "Mean Spending Score"), align = c("c", "c"), caption = "Avg. Spending Score by Cluster")

Avg. Spending Score by Cluster
Cluster	Mean Spending Score
C1	20.91
C2	80.05
C3	50.22
C4	82.13
C5	18.63

knitr::kable(table_1.3, "pipe", col.names = c("Cluster", "Mean Age"), align = c("c", "c"), caption = "Avg. Age by Cluster")

Avg. Age by Cluster
Cluster	Mean Age
C1	45
C2	25
C3	43
C4	33
C5	40

Q6 (c)

Graph of count of gender in each cluster.

ggplot(data = supermarket_clusters) +
  geom_bar(mapping = aes(x = cluster, fill = gender),position = "dodge") +
  scale_fill_manual(values = c("#FFB7CE", "#89CFF0")) +
  ggtitle("Gender by Cluster") +
  xlab("Cluster") +
  ylab("# of people") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "#666666", face = "bold"),
        axis.title.y = element_text(colour = "#666666", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "#666666", face = "bold"),
        plot.title = element_text(face = "bold"))

Q7

From all the above information, the segments can be profiled as such;

C1 comprises mostly of females with a mean age of 45. The customers in this cluster have an average income of €26,300 annually with a 20.91/ 99 spending score. This group is the oldest in comparison to other clusters.
C2 contains mostly females, but less females in comparison to C1 (amount of males stays the same) with an average age of 25 years old. The average annual income in this cluster in €25’100 with a spending score of 80.05/99. This segments contains the lowest average annual income group, whilst also being the youngest segment.
C3 is the largest cluster out of all five segments. It contains more females than males, and the average age is 43. The average annual income in this segments is €54’420 with a spending score of 50.22/99.
C4 contains more females than males, and the average age of this cluster is 33. The average income amongst these customers is €86’540 annually, and they have a mean spending score of 82.13/99. This segment contains the highest mean spending score in comparison to other clusters.
C5 is the only segments which contains more males than females, with an average age of 40 years old. The average income in this segment is €87’000 with a mean spending score of 18.63/99. This segments contains the highest average annual income group, however also the group with the lowest mean spending score.

Q8

Increasing spending score from all segments;

C1 (low annual income, low spending score); the supermarket could run sales promotions to appeal to this segment, promotions such as the distribution of coupons to lower grocery shopping prices, focusing on highlighting own value produce as a way to save money for price sensitive shoppers, as well as positioning the store in the ‘best value for money’ segment.
Appealing to C2 (low annual income, high spending score) could be done by increasing repeat purchases (buy 1 get one free, buy 3 get the cheapest for free etc.), and by offering discounts if customers pass a specified grocery total (ex. spend €50 and receive €5 off on next grocery shop). Promoting club cards for collected supermarket points or lowered prices could also be beneficial. As this segment is the youngest, student discounts could also be considered.
As C3 is the biggest segment, with a medium annual income and a medium spending score the supermarket could promote sales promotions and strategies used for all other segments (C1 - C5) and position them as better than can be found at competitor stores. This could encourage more customers to shop at the store as well as aid in increasing spending scores.
As C4 is a high earner and a high spending score bracket, promoting luxury products such as highest quality foods, alcohols, hygiene/ beauty products and home decor could appeal to this segment. This could increase higher price impulse buys with a mindset of ‘higher quality for higher price’, or simply for ‘treating yourself’.
For C5, which is a high earner but low spending score segment, offering products which cannot be bought at competitors stores could be employed to gain more foot traction from this segment. Appealing to this segment could also be done by offering luxury foreign/ exotic product choices to drive interest in shopping at the supermarket.

Part B

Q1

Importing the data set

recycling <- read_csv("recycling.csv")

Q2

Creating a sub data set including all variable necessary to carry out the k-means analysis

recycling_2 <- select(recycling, pos_impact:avoid_waste)

Q3

Does the data need to be scaled? No, as all variables that the k-means clustering will be run on are on a 1-4 likert scale.

Q4

Using the k-means clustering algorithm to produce a 3 cluster solution

set.seed(101)
kmeans1 <- kmeans(recycling_2, centers = 3)

Q5

Assessing the quality of the k-means solution

d3 <- dist(recycling_2)

sil_kmeans1 <- silhouette(kmeans1$cluster, d3)
summary(sil_kmeans1)

Silhouette of 366 units in 3 clusters from silhouette.default(x = kmeans1$cluster, dist = d3) :
 Cluster sizes and average silhouette widths:
      169       162        35 
0.2682631 0.2054575 0.3271905 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08256  0.14303  0.23963  0.24610  0.35704  0.52470

Based on the above silhouette scores, the quality of the clustering is as below

Clustering for C1 is weak, with a score of 0.268.
C2 shows no substantial structure with a silhouette score of 0.205.
C3 has the highest silhouette score, however the clustering of this segment is still considered weak with a score of 0.327.
Overall, the clustering solution shows an insubstantial structure due to low mean score of 0.246.

Q6

Profiling the clusters

#adding cluster variable columns to original data set#
recycling_clusters <- recycling %>%
  mutate(clusters1 = kmeans1$cluster) %>%
  mutate(cluster = case_when(clusters1 == 1 ~ 'C1',
                             clusters1 == 2 ~ 'C2',
                             clusters1 == 3 ~ 'C3'))

Q6 (a)

table_2.1 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(pos_impact), 0))

table_2.2 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(environ), 0))

table_2.3 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(money), 0))

table_2.4 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(bins), 0))

table_2.5 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(local), 0))

table_2.6 <- recycling_clusters %>%
  group_by(cluster) %>%
  summarise(round(mean(avoid_waste), 0))

knitr::kable(table_2.1, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Positive Impact'")

Avg. Response to ‘Positive Impact’
Cluster	Average Response
C1	4
C2	4
C3	2

knitr::kable(table_2.2, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Environment'")

Avg. Response to ‘Environment’
Cluster	Average Response
C1	4
C2	3
C3	1

knitr::kable(table_2.3, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Money'")

Avg. Response to ‘Money’
Cluster	Average Response
C1	4
C2	4
C3	2

knitr::kable(table_2.4, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Bins'")

Avg. Response to ‘Bins’
Cluster	Average Response
C1	3
C2	3
C3	1

knitr::kable(table_2.5, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Local'")

Avg. Response to ‘Local’
Cluster	Average Response
C1	3
C2	3
C3	2

knitr::kable(table_2.6, "pipe", col.names = c("Cluster", "Average Response"), align = c("c", "c"), caption = "Avg. Response to 'Avoid Waste'")

Avg. Response to ‘Avoid Waste’
Cluster	Average Response
C1	1
C2	3
C3	3

Q6 (b)

ggplot(data = recycling_clusters) + 
  geom_bar(mapping = aes(x = cluster, fill = age), position = "dodge") +
  scale_fill_manual(values = c("#A587CA", "#36CEDC", "#8FE968", "#FFEA56", "#FFB750", "#FE797B")) +
  ggtitle("Age by Cluster") +
  xlab("Cluster") +
  ylab("# of People in each Age bracket") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid.major = element_line(color = "white"),
        axis.title.x = element_text(colour = "#666666", face = "bold"),
        axis.title.y = element_text(colour = "#666666", face = "bold"),
        axis.text = element_text(colour = "#666666"),
        axis.ticks = element_line(colour = "#666666"),
        axis.line = element_line(colour = "#666666"),
        title = element_text(colour = "#666666", face = "bold"),
        plot.title = element_text(face = "bold"))

Q7

The clusters can be profiled as such;

C1 is contains most people from ages 35-54. This cluster agreed the most to the following statements; ‘positive impact’, ‘environment’and ’money’. This cluster only scored a 1 to the statement regarding ‘avoid waste’.
C2 contains most individuals from the ages 25-54. This cluster agreed the most to the following statements; ‘positive impact’ and ‘money’. It is also important to notice that this cluster did not score the importance of any statements under a 3. This suggests that this segment is very involved and interested in recycling.
C3 contains most individuals from the ages 18-24 and 35-44. This cluster agreed the most to the following statements; ‘avoid waste’. The other statements were all answered with either a 1 or a 2, this suggests that this segment is not very interested in recycling.

Q8

Some strategies the council could employ to encourage recycling in each segment could be;

For citizens found in C1, the council could focus on a recycling campaign which emphasizes the importance of recycling for the environment and the positive impacts it makes to reducing further damage to the Earth. As ‘avoiding waste’ is a low priority of this segment, the campaign could focus on recycling’s positive impacts of protecting Irelands flora and fauna.
For citizens located in C2, the council could organise litter picking initiatives around the town and encourage participation. An informational class in increasing recycling efforts through sorting rubbish into the correct bins could also be held, this would increase citizen awareness in which rubbish goes into what bin, and how the rubbish should be prepared before recycling for best results.
For citizens in C3, a social media campaign could be organised on the importance and the impact of correct recycling. As the citizens found in this cluster are younger, a social media campaign could be a good way to reach out to internet savvy members of the town. As this segment doesn’t seem to be interested in recycling, the key objectives of this campaign would be increasing recycling knowledge and its benefits, as well as promoting interest in the topic.