Segmenting Consumers Based on Energy Drink Preference”

Author

Chaitanya Kumar

Introduction

In today’s competitive beverage market, understanding customer preferences is crucial for successful product positioning and marketing. This report focuses on segmenting customers based on their ratings of five energy drink versions (D1, D2, D3, D4, and D5), which vary in the concentration of a flavoring ingredient. Customer demographic information such as age and gender is also analyzed to provide deeper insights. Through clustering analysis, we aim to identify distinct customer groups and recommend targeted marketing strategies to optimize product appeal and profitability.

You can add options to executable code like this

### 1. Import the Dataset

# Load the dataset
energy_data <- read_csv("energy_drinks.csv")

# Display the structure of the data
str(energy_data)
spc_tbl_ [840 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : chr [1:840] "ID_1" "ID_2" "ID_3" "ID_4" ...
 $ D1    : num [1:840] 2 4 2 1 1 2 1 1 2 5 ...
 $ D2    : num [1:840] 3 4 3 6 3 3 5 3 3 5 ...
 $ D3    : num [1:840] 7 5 8 5 7 8 6 7 6 6 ...
 $ D4    : num [1:840] 7 6 8 8 7 7 5 9 7 7 ...
 $ D5    : num [1:840] 7 9 5 6 7 5 5 7 5 7 ...
 $ Gender: chr [1:840] "Male" "Male" "Female" "Female" ...
 $ Age   : chr [1:840] "Under_25" "Under_25" "Under_25" "Under_25" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_character(),
  ..   D1 = col_double(),
  ..   D2 = col_double(),
  ..   D3 = col_double(),
  ..   D4 = col_double(),
  ..   D5 = col_double(),
  ..   Gender = col_character(),
  ..   Age = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(energy_data)
# A tibble: 6 × 8
  ID       D1    D2    D3    D4    D5 Gender Age     
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <chr>   
1 ID_1      2     3     7     7     7 Male   Under_25
2 ID_2      4     4     5     6     9 Male   Under_25
3 ID_3      2     3     8     8     5 Female Under_25
4 ID_4      1     6     5     8     6 Female Under_25
5 ID_5      1     3     7     7     7 Male   Under_25
6 ID_6      2     3     8     7     5 Male   Under_25

The echo: false option disables the printing of code (only output is displayed).

### 2. Create a Distance Matrix

# Select only numerical columns for clustering
numerical_data <- energy_data %>% select(D1:D5)

# Scale the data
scaled_data <- scale(numerical_data)

# Create a Euclidean distance matrix
distance_matrix <- dist(scaled_data, method = "euclidean")
a. Does the data need to be scaled before computing the distance matrix?

Scaling is necessary to ensure all variables contribute equally to the distance computation. Without scaling, variables with larger ranges dominate, leading to biased results. Since the variables represent ratings on the same scale, scaling minimizes bias and enhances clustering accuracy, ensuring fair representation of all features in the distance calculation.


::: {.cell}

```{.r .cell-code}
### 3. Perform Hierarchical Clustering
# Perform hierarchical clustering using the "average" method
hclust_result <- hclust(distance_matrix, method = "average")

# Visualize the dendrogram
plot(hclust_result, main = "Dendrogram of Hierarchical Clustering", 
     xlab = "", ylab = "Height")

::: Explanation: The dendrogram provides a visual representation of the hierarchical clustering process. The height of the branches indicates the dissimilarity between clusters, helping us decide the optimal number of clusters for the dataset.

  1. Dendrogram
### 4. Visualize the Clustering Results

# Create a colored dendrogram
dendrogram <- as.dendrogram(hclust_result)
colored_dend <- color_branches(dendrogram, k = 3)
plot(colored_dend, main = "Colored Dendrogram for 3 Clusters")

  1. Heatmap
# Visualize data with a heatmap
pheatmap(scaled_data, cluster_rows = TRUE, cluster_cols = FALSE,
         show_rownames = FALSE, main = "Heatmap of Energy Drink Ratings")

Does the heatmap provide evidence of any clustering structure within the energy drinks dataset?

Answer: The heatmap shows distinct blocks of similarity, indicating that certain groups of participants share similar preferences for the energy drinks. This validates the presence of clustering patterns and supports the segmentation approach.

### 5. Create a 3-Cluster Solution

# Cut the dendrogram into 3 clusters
clusters <- cutree(hclust_result, k = 3)

# Add cluster labels to the original data
energy_data$Cluster <- as.factor(clusters)

# Assess the quality of clustering using silhouette scores
silhouette_scores <- silhouette(clusters, distance_matrix)
mean_silhouette <- mean(silhouette_scores[, "sil_width"])
mean_silhouette
[1] 0.215618

Explanation: The silhouette score evaluates clustering quality. A higher mean silhouette score signifies better-defined and more cohesive clusters, reflecting the effectiveness of the clustering method.

### 6. Profile the Clusters

#### a. How do the clusters differ on their average rating of each version of the energy drinks?

# Compute average ratings by cluster
cluster_means <- energy_data %>%
  group_by(Cluster) %>%
  summarise(across(D1:D5, mean))

# Format the table
kable(cluster_means, col.names = c("Cluster", "D1", "D2", "D3", "D4", "D5")) %>%
  kable_styling(full_width = FALSE)
Cluster D1 D2 D3 D4 D5
1 2.945578 4.811791 6.274376 6.646259 6.603175
2 2.508982 4.610778 7.323353 5.089820 2.718563
3 6.642241 5.081897 3.439655 3.159483 2.956897
# Melt data for plotting
melted_cluster_means <- melt(cluster_means, id.vars = "Cluster")

# Line graph for cluster ratings
ggplot(melted_cluster_means, aes(x = variable, y = value, color = Cluster, group = Cluster)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(title = "Average Ratings of Energy Drink Versions by Cluster", 
       x = "Energy Drink Version", 
       y = "Average Rating") +
  theme_minimal()

Explanation: The graph and table highlight significant variations in ratings across clusters for the five energy drink versions. These insights allow us to identify the preferred versions for each segment, guiding targeted marketing efforts.

#### b. How do the clusters differ on age and gender?
# Add Cluster column to energy_data
clusters <- cutree(hclust_result, k = 3)  # Replace with your number of clusters
energy_data <- energy_data %>% mutate(Cluster = as.factor(clusters))
# Age distribution by cluster
age_distribution <- energy_data %>%
  group_by(Cluster, Age) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Cluster) %>%  # Calculate proportions within each cluster
  mutate(Proportion = Count / sum(Count))

# Line graph for age distribution
ggplot(age_distribution, aes(x = Age, y = Proportion, color = Cluster, group = Cluster)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(
    title = "Age Distribution by Cluster", 
    x = "Age Group", 
    y = "Proportion"
  ) +
  theme_minimal()

# Gender distribution by cluster
gender_distribution <- energy_data %>%
  group_by(Cluster, Gender) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Cluster) %>%  # Calculate proportions within each cluster
  mutate(Proportion = Count / sum(Count))

# Line graph for gender distribution
ggplot(gender_distribution, aes(x = Gender, y = Proportion, color = Cluster, group = Cluster)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(
    title = "Gender Distribution by Cluster", 
    x = "Gender", 
    y = "Proportion"
  ) +
  theme_minimal()

Heatmap Explanation of Heatmap: Distinct Patterns: The heatmap visually demonstrates the clustering structure in the dataset. Rows (participants) are grouped into clusters based on their similarity in ratings for the five energy drink versions (D1–D5).

Blocks of Similarity: The heatmap shows distinct blocks where ratings are similar within each cluster, validating the segmentation approach.

Intensity of Color: Darker shades indicate higher ratings, while lighter shades represent lower ratings.

Clusters with consistent darker blocks for specific drinks highlight a shared preference.

Insights: Cluster 1 shows stronger ratings for D1 and relatively weaker ratings for D5. Cluster 2 has a balanced pattern, with consistently moderate ratings for all drinks but peaks for D3. Cluster 3 prefers D5 while rating D1 lower, indicating an inclination toward bold flavors ******************************************************************************

Graphs Age Distribution by Cluster:

Cluster 1: Higher proportion of older participants (e.g., 36–45 age group). This group is more likely to prefer subtle flavors (D1).

Cluster 2: A balanced distribution of participants across all age groups. This diversity explains their preference for the balanced flavor of D3.

Cluster 3: Predominantly younger participants (e.g., 18–25 age group). They show a stronger preference for bold and intense flavors (D5).

Gender Distribution by Cluster: Cluster 1: Gender proportions are nearly equal, showing that D1 has broad appeal across genders.

Cluster 2: A slight female dominance is observed, suggesting that marketing for D3 can be adjusted to appeal more to women.

Cluster 3: Predominantly male, aligning with the adventurous and bold positioning of D5. ************************************************************************************

Clusters

Cluster 1: Preferences: Strong preference for D1, with lower ratings for D5.

Demographics: Likely older participants, split evenly between genders. Flavor Preference: Subtle and mild. Behavior: Traditional, seeking a more balanced and less intense flavor experience.

Cluster 2: Preferences: Strong preference for D3, with moderate ratings across all drinks. Demographics: Mixed age and gender group, leaning slightly female. Flavor Preference: Balanced and versatile. Behavior: Broad appeal, representing participants who enjoy moderate intensity.

Cluster 3: Preferences: Strong preference for D5, with lower ratings for D1. Demographics: Predominantly younger and male. Flavor Preference: Bold and intense. Behavior: Adventurous, drawn to strong and impactful flavors.


Recommendations Advertising Strategies: D1 (Cluster 1):

Highlight its mild and subtle flavor. Use traditional advertising channels like TV and newspapers to reach older demographics. Position it as a “classic choice” for balanced refreshment. D3 (Cluster 2):

Emphasize its versatility and universal appeal. Leverage digital platforms such as social media and influencer marketing to reach diverse demographics. Highlight testimonials showcasing its balanced flavor profile. D5 (Cluster 3):

Focus on its bold, intense flavor. Use dynamic campaigns like sports sponsorships or collaborations with high-energy events. Engage younger audiences via Instagram, TikTok, and action-packed visual storytelling.

Product Strategy: Primary Product: D3, given its broad appeal across demographics and clusters. Supporting Products: D1 for traditional consumers and D5 for younger, adventurous audiences. Product Bundles: Combine D1 and D3 or D3 and D5 in promotional bundles to encourage cross-cluster trials. These insights ensure marketing and product strategies align with each cluster’s unique preferences and behaviors, maximizing customer satisfaction and profitability