library(cluster)
library(tidyverse)
#Step 1: Import the data
<- read_csv("energy_drinks.csv")
energy_drink View(energy_drink)
#Step 2: Compute distances between each pair of players
<- select(energy_drink, D1:D5)
energy_drink_2 <- dist(energy_drink_2) d1
Energy Drink Analysis
Segmenting Consumers Based on Energy Drink Preference
1. Import the energy_drinks.csv file into R.
2. Create a distance matrix containing the Euclidean distance between all pairs of consumers.
a. Does the data need to be scaled before computing the distance matrix? Explain your answer.
No, because the four variables that will be entered into the clustering algorithm are all measured on the same 10-point Likert scale. Therefore, we can proceed straight to calculating the Euclidean distance between each pair of customers using the following R code.
3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.
<- hclust(d1, method = "average") h1
4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap. Note that the heatmap may take several seconds to appear because of the large number of customers in the dataset.
plot(h1, hang = -1)
heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)
a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.
Yes. As there are clusters in the data, 3 obvious blocks of light yellow on the diagonal line of the heatmap are visible. These light yellow blocks indicate groups of customers who are very similar to each other. Apart from that, The heatmap shows different square-like blocks as well. These blocks show that there are certain groups of variables that have high similarity with each other compared to other groups.
5. Create a 3-cluster solution using the cutree function and assess the quality of this solution.
#Step 4: Decide on number of clusters
<- cutree(h1, k = 3)
clusters1
#Step 5: Assess the quality of the segmentation
<- silhouette(clusters1, d1)
sil1 summary(sil1)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
417 235 188
0.2249249 0.1987262 0.3918562
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3120 0.1599 0.2916 0.2550 0.3716 0.5502
6. Profile the clusters, making sure to include answers to the questions below. Include any graphs/tables necessary to support your profiling.
#Step 6: Profile the clusters.
<- cbind(energy_drink, clusters1)
energy_drink_clus <- mutate(energy_drink_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
energy_drink_clus == 2 ~ 'C2',
clusters1 == 3 ~ 'C3')) clusters1
a. How do the clusters differ on their average rating of each version of the energy drinks?
library(tidyverse)
library(kableExtra)
library(tidyverse)
<- energy_drink_clus %>%
Rating_Avg group_by(cluster) %>%
summarise(num_customers = n(),
avg_D1 = mean(D1),
avg_D2 = mean(D2),
avg_D3 = mean(D3),
avg_D4 = mean(D4),
avg_D5 = mean(D5))
#Convert the dataset to be in "tidy" format to allow for creation of line graph.
<- Rating_Avg %>%
Rating_Avg_tidy pivot_longer(cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5),
names_to = "Energy_Drink", values_to = "Average")
Rating_Avg_tidy
# A tibble: 15 × 4
cluster num_customers Energy_Drink Average
<chr> <int> <chr> <dbl>
1 C1 417 avg_D1 3
2 C1 417 avg_D2 4.71
3 C1 417 avg_D3 6.20
4 C1 417 avg_D4 6.70
5 C1 417 avg_D5 6.76
6 C2 235 avg_D1 2.96
7 C2 235 avg_D2 4.67
8 C2 235 avg_D3 6.95
9 C2 235 avg_D4 5.01
10 C2 235 avg_D5 2.97
11 C3 188 avg_D1 6.98
12 C3 188 avg_D2 5.37
13 C3 188 avg_D3 3.03
14 C3 188 avg_D4 2.88
15 C3 188 avg_D5 2.84
arrange(Rating_Avg)
# A tibble: 3 × 7
cluster num_customers avg_D1 avg_D2 avg_D3 avg_D4 avg_D5
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 417 3 4.71 6.20 6.70 6.76
2 C2 235 2.96 4.67 6.95 5.01 2.97
3 C3 188 6.98 5.37 3.03 2.88 2.84
::kable( Rating_Avg,
knitrformat = "html",
col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2",
"Avg D3", "Avg D4", "Avg D5"),
caption = "<b>Number of Customers and Average Ratings</b>",
align = "ccccccc",
table.attr = 'data-quarto-disable-processing = "true"',
digits = c(0,3,2,2,2,2,2)) %>%
::kable_styling(
kableExtrabootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center",
font_size = 14) %>%
column_spec(1, color = "black", background = "Pink")
Cluster | Number of Customers | Avg D1 | Avg D2 | Avg D3 | Avg D4 | Avg D5 |
---|---|---|---|---|---|---|
C1 | 417 | 3.00 | 4.71 | 6.20 | 6.70 | 6.76 |
C2 | 235 | 2.96 | 4.67 | 6.95 | 5.01 | 2.97 |
C3 | 188 | 6.98 | 5.37 | 3.03 | 2.88 | 2.84 |
ggplot(Rating_Avg_tidy, mapping = aes(x = Energy_Drink, y = Average, group = cluster, colour = cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
ylab("Ratings") +
xlab("Energy Drink") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Average Rating of Energy Drinks")
b. How do the clusters differ on age and gender?
$Age <- factor(energy_drink_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))
energy_drink_clus
library(ggplot2)
library(scales)
# Gender
$Gender <- factor(energy_drink_clus$Gender, levels = c("Male", "Female"))
energy_drink_clus
ggplot(energy_drink_clus, aes(x = Gender, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent_format()) +
ylab("Percentage of People") +
xlab("Gender") +
ggtitle("Gender by Cluster") +
coord_flip()
# Age
$Gender <- factor(energy_drink_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))
energy_drink_clus
ggplot(energy_drink_clus, aes(x = Age, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent_format()) +
ylab("People Percentage") +
xlab("Age buckets") +
ggtitle("Age by Cluster") +
coord_flip()
7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.
The suitable cluster for energy drink D1 is Cluster 3, energy drink D3 is Cluster 2, and energy drink D5 is Cluster 1 because these clusters are most likely to respond positively to the mentioned energy drink version as per the ratings.
8. If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?
D3 is the version the company can choose to continue producing, as it holds the highest reviews and ratings across both clusters (C1 and C2). Moreover, when considering the customer bases of these clusters together, D3 has the largest customer base.
******************************END OF THE DOCUMENT****************************************