Tan Bao Assignment

library(tidyverse)

library(rattle)

library(rpart)

Testing <- read_csv("sub_testing.csv")

Training <- read_csv("sub_training.csv")

Part A

ggplot(data = Training) + 
  geom_point(mapping = aes(x = renewed, y = spend))     

Evaluation

The geom_points indicates the distribution of spend between customers who renewed and those who did not renew. The median spend for both groups is very similar, potentially falling between 240 and 350. When it comes to no renewed groups, the majority of data points are clustered around the middle part of the spend range, roughly from 150 to 450.In contrast, the renewed groups are tighter and slightly shifted compared to the one. The average points seems to focus on the range of 150 to 450 as well, but with fewer data points.

ggplot(data = Training) +
  geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge") +
  scale_fill_manual(values = c( "cyan4", "darkseagreen"))

Evaluation

The bar chart shows the distribution of variable gender against the count of customers who renewed or non-renewed. When it comes to female group, the non-renewed group count is approximately 160 customers whereas the others count is approximately 120 customers. This means that more females did not renew than did renew. In the male group, 270 customers are counted as not renewed while renewers count is approximately 310 customers. This means that more males renewed than did not renew. Male customers have a higher renewal rate than female customers.

ggplot(data = Training) +
  geom_boxplot(mapping = aes(x = renewed, y = num_contacts))

Evaluation

The geom_boxplot shows the distribution of number of contacts for customers who renewed and those who did not renew. The median and IQR are concentrated very close to zero. When it comes to the no renewed groups, the maximum number of contacts is around 25 whereas the others is significantly higher, reaching up to 40. There are several outliers in both groups, particularly in the renewed group. These outliers can represent customers who had either very few.

ggplot(data = Training) +
  geom_boxplot(mapping = aes(x = renewed, y = lor))

Evaluation

The box plot compares the length of relationship expressed in days for customers who renewed and those who did not. When it comes to non-renewed groups, the media is around 100, and the box extends from approximately 50 to 180. While the median of renewed groups is significantly higher, around 150-160. The maximum value of non-renewed is around 500. There are several clear outliers clustered between 400 and 500. In addition, the maximum point of renewed is around from 480 to 500. In conclusion, while the tendency is higher for renewers, the range of the data and the highest recorded values are very similar for both groups

table1 <- Training %>%
            group_by(renewed) %>%
            summarise(
              avg_complaints = mean(num_complaints, na.rm = TRUE), 
              count = n())


table1
# A tibble: 2 × 3
  renewed avg_complaints count
  <chr>            <dbl> <int>
1 No               0.455   426
2 Yes              0.524   424

Evaluation

The table demonstrates the renewal rate based on the number of complaints customers have made.The count of customers who did not renew is 426, while the count of customers who renewed is 424. The numbers are fairly balanced, indicating that there is a reasonable distribution between those who renewed and those who did not renew their subscription.

Training$age_group <- cut(Training$age, 
                          breaks = c(18, 25, 35, 45, 55, 65, 100),
                          labels = c( "18-25", "26-35", "36-45", "46-55", "56-65", "65+"),
                          right = FALSE)
age_renewal_table <- table(Training$age_group, Training$renewed)
view(age_renewal_table)
ggplot(data = Training) +
  geom_boxplot(mapping = aes(x = renewed, y = contact_recency))

Evaluation

The geom boxplot indicates the relationship between various customer attributes and their renewal status.It is obviously that customers who renewed have a higher median than non-renewers. The number of customers who renew drops sharply after even a single complaint. This indicates that minimizing complaints is crucial. The median number of contacts is zero for both groups. Only customers with an extremely high number of contacts are found to be retained. There is no observable difference in the median or spread of the last contact between renewing and non-renewers. This feature is likely not useful for the predictive model.

ggplot(Training, aes(x = age_group, fill = renewed)) +
  geom_bar(position = "dodge") +
  ggtitle("Comparison of Renewal Status by Age Group") + 
  xlab("Age Group") +
  ylab("Number of Customers") +
  scale_fill_manual(values = c("blue", "orange"), name = "Renewed", labels = c("No", "Yes")) +
  theme_minimal()

Evaluation The bar chart demonstrates the destribution between age group and their renewal status. The 46-55 age group represents the largest segment of the customers base, while the 65+ age group exhibits the highest rate of renewal. Conversely, the largest volume of churn occurs in the46-55 group. In while, the 18-25 age group is the smallest segment, registering only about 15 total customers.

Part B

  1. Create and visualize a classification tree model
renewed_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age,
                    data = Training)
fancyRpartPlot(renewed_tree)

2.Interpret the classification tree: a. To determine whether a customer will re-subscribe, we need to follow the decision rules in the tree Rule for renewal: 1. If the customer’s relationship length is less than 140 1.1 If their spend is less than 182: 1.1.1 If their age is less than 61, the likelihood of renewal is 50% Yes and 50% No

1.1.2 If their age is greater than or equal to 61, the renewal rate is 45% Yes and 55% No 1.2 If their spend is greater than or equal to 182 1.2.1 The Customer has a 100% likelihood of not renewing.

When it comes to purity, the nodes where spend <182 are not completely pure, with mixed outcomes (both renewed and not renewed). However, when spend is greater than or equal to 182, the node becomes 100% pure, indicating that these customer do no renew.

b. The rule for predicting if a customer will churn

Rule for churn: If the customer’s spend is greater than or equal to 182, the customer will not renew - 100% churn. When it comes to purity, the spend is greater than or equal 185, the node is 100% pure. This means that all customer with a spend above 182 did not renew their subscription. This makes rule a very strong predictor of churn. If the customer’s spend is less than 182. Then, check the lor (length of relationship).If the lor is less than 140 months, customers are more likely to not renew. Finally, if lor is greater than 140 months, there is a 50% chance of renewal (50% No, 50% Yes).

When it comes to purity of the node, Spend > 182 is a highly pure predictor for churning, with 100% certainty. When lor < 140 months tends to lead to churn, but the node is not perfectly pure.

  1. Important variables for predicting renewal or churn:

c.1 Spend is the most important predictor in this tree. It appears right at the top of the tree and is the first variable used to split the data. If a customer’s spend is higher than or equal 182, they are predicticed ti not renew. This is a strong predictor because the node with spend that is greater than or equal 182 is 100% pure, meaning that the variable dose a good job of distinguishing customers who will not renew.

c.2 Lor is also crucial, as it is used in one of the first splits in the tree. Customers wit a lor less than 140 are further split based on their spend and age.

c.3 Age is used after spend to further split the customers with spend less than 182. However, it doesn’t provide as much clear predictive power as spend in this tree.

c.4 The number of contacts variables is used in a subsequent split, but it is not as influential as the other variables in this tree. It helps differentiate some customers, but the tree uses spend and lor as the deciding factors.

renewed_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
summary(renewed_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = Training)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.1132075 0.03416973
2 0.01179245      1 0.8066038 0.8466981 0.03396352
3 0.01000000      4 0.7617925 0.8325472 0.03388365

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 

Variable that appeared important in visual exploration but not in the classification tree:

Gender - there were slight differences in the renewal rate based on gender. The patterns from part A indicate that gender might have an impact on whether customers renewed their subscription. However, it did not appear as an important variable in the tree.

Contact recency showed some correlation with renewal in part A. While contact recency was explored visually, it did not emerge as a key predictor in the classification tree.

Variable that appeared unimportant in visual exploration but were important in the classification tree:

Spend did show some differences between customers who renewed and those who didn’t, but its importance wasn’t as clear from part A. Nevertheless, spend was the most important predictor in the classification tree. The tree showed that customers with higher spend were more likely to churn, making spend a strong predictor for customer renewal.

Length of relationship (lor) showed some indication that longer relationships correlated with higher renewal rates, but it wasn’t as emphasized in part A. Nonetheless, lor was another key predictor in the tree. Customers with shorter relationship ( lor<140) were more likely to churn, while those with longer relationships had a higher chance of renewal.

train_probs <- predict(renewed_tree, Training, type = 'prob')
train_preds <- predict(renewed_tree, Training, type = 'class')

Training_updated <- cbind(Training, train_probs, train_preds)
head(Training_updated)
   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
  age_group        No       Yes train_preds
1     46-55 0.3932292 0.6067708         Yes
2     56-65 0.5927052 0.4072948          No
3     46-55 0.7560976 0.2439024          No
4     36-45 0.7560976 0.2439024          No
5       65+ 0.3932292 0.6067708         Yes
6       65+ 0.3932292 0.6067708         Yes
train_con_mat <- table(Training_updated$renewed, 
                        Training_updated$train_preds,
                        dnn=c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

From the confusion matrix, we know the following:

The overall model accuracy is 0,62 or 62%

Of all customer the model predicted to churn, they got 0.615 or 61.5% correct.

Of all customers the model predicted not to churn, they got 0.6253 or 62.53% correct.

Of all customers who did churn, the model correctly identified 0.6368 or 63.68%.

Of all customers whp did not churn, the model correctly identified 0.6033 or 60.33%.

test_probs2 <- predict(renewed_tree, newdata = Testing, type = 'prob')
test_preds2 <- predict(renewed_tree, newdata = Testing, type = 'class')
renewed_test_updated2 <- cbind(Testing, test_probs2, test_preds2)

test_con_mat2 <- table(renewed_test_updated2$renewed, renewed_test_updated2$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41
sum(diag(test_con_mat2))/sum(test_con_mat2)
[1] 0.5133333

From the testing confusion matrix, we know the following:

The overall model accuracy is 0.5133 or 51.33%.

Of all customer the model predicted to churn, they got 0.5190 or 51.90% correct.

Of all customers the model predicted not to churn, they got 0.5070 or 50.70% correct.

Of all customers who did churn, the model correctly identified 0.5394 or 53.94%.

Of all customers whp did not churn, the model correctly identified 0.4865 or 48.65%.

The training accuracy is 62%, while the testing accuracy is 51.33%. This difference of about 10% is a clear sign that the model as overfit the training data. The model is more accuarate in predicting the dataset than it is on the test dataset.

sub_tree2 <- rpart(renewed ~ num_contacts + contact_recency+ num_complaints + spend + lor + gender + age, data = Training, maxdepth = 3)
fancyRpartPlot(sub_tree2)

train_probs2 <- predict(sub_tree2, newdata = Training, type = 'prob')
train_preds2 <- predict(sub_tree2, newdata = Training, type = 'class')
Training_updated2 <- cbind(Training, train_probs2, train_preds2)

train_con_mat2 <- table(Training_updated2$renewed, Training_updated2$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat2
      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252
sum(diag(train_con_mat2))/sum(train_con_mat2)
[1] 0.6094118
test_probs2 <- predict(sub_tree2, newdata = Testing, type = 'prob')
test_preds2 <- predict(sub_tree2, newdata = Testing, type = 'class')
Testing_updated2 <- cbind(Testing, test_probs2, test_preds2)

test_con_mat2 <- table(Testing_updated2$renewed, Testing_updated2$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2
      Predicted
Actual No Yes
   No  41  33
   Yes 38  38
sum(diag(test_con_mat2))/sum(test_con_mat2)
[1] 0.5266667

Prunning the tree has changed a little bit the percentages an overall accuracy of 60.9% for the training dataset. However, the overall accuarcy for the testing dataset has slightly increased from 51.3% to around 53%, which indicated that the pruned tree is now better at generalizing to unseen data, meaning it will likely predict new date more accurately and does not suffer from as much overfitting as the original tree.

Part C

  1. Completed
library(cluster)
energy <- read_csv("energy_drinks.csv")
Rows: 840 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Gender, Age
dbl (5): D1, D2, D3, D4, D5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
energy_2 <- select( energy,D1, D2, D3, D4, D5)
energy_2_scale <- scale(energy[, c("D1","D2","D3","D4","D5")])
a1 <- dist(energy_2)
  1. Yes, the data should be scaled before computing the distance matrix. If the variables are measured on different scales, te Euclidean distance can be biased towards variables with larger rangers or higher variance. Scaling ebsure that all the features can contribute equally to the distance caculation by standardizing them. 3
h1 <- hclust(a1, method = 'ward.D')
plot(h1, hang = -1)

heatmap(as.matrix(a1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

4a.Yes, the heatmap wil display color intensity where darker shades represent higher or lower ratings respectively. There is some evidence of lightly coloured blocks around the diagonal that suggest groups customers who are similar to each other,it indicates that the clustering has identified meaningful preferences in the data. Thus, the heatmap will show how participant’s ratings for energy drinks are grouped based on their preferences.

clusters1 <- cutree(h1, k = 3)

sil1 <- silhouette(clusters1, a1)
summary(sil1)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters1, dist = a1) :
 Cluster sizes and average silhouette widths:
      435       183       222 
0.1820086 0.2752174 0.3189396 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.4444  0.1327  0.2840  0.2385  0.3649  0.5366 
energy_clus <- cbind(energy, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                   clusters1 == 2 ~ 'C2',
                                                   clusters1 == 3 ~ 'C3'))
                                                  
size_rev <- energy_clus %>%
  group_by(cluster) %>%
  summarise(ID = n())
energy_clus$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))
ggplot(energy_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip()

Evaluation The overall clustering structure is moderate, with a mean sihouette score of 0.2385, indicating some well-defined clusters and others with weaker cohesion.

  • Cluster 1 has a low silhouette score of 0.1829, suggesting that it is weakly defined and potentially artificial.

  • CLuster 2 has a moderate silhouette score of 0.2752, showing that it is better defined than cluster 1, but still less cohesive than cluster 3.

  • CLuster 3, with the highest silhoutte score of 0.3189, represents a well-defined and cohesive cluster of participants, making it the most reliable cluster.

  1. 6a
energy_clus_means <- energy_clus %>%
  group_by(cluster) %>%
  summarise(D1 = mean(D1),
            D2 = mean(D2),
            D3 = mean(D3),
            D4 = mean(D4),
            D5 = mean (D5))

energy_clus_means
# A tibble: 3 × 6
  cluster    D1    D2    D3    D4    D5
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1       3.18  4.53  6.10  6.43  6.69
2 C2       2.33  5.32  7.34  5.79  2.98
3 C3       6.52  5.09  3.57  2.96  2.69

Evaluation

Cluster C1 has consistent preference for higer concentrations of the flavoring ingredient. They seem to favor stronger flavors, especially for D4 and D5. This group likely prefers more flavor experiences. Cluster C2 has a high preference for D3, with a moderate preference for D2 and D4. However, they have a low preference for D1 and D5, suggesting they might prefer balanced and mid-range flavors and avoid the extremes. Cluster C3 has a very high preference for D1. They also have moderate preference for D2, but they show a strong dislike for D3, D4, and D5. This group seems to prefer milder, less concentrated flavors.

6b

energy_clus <- cbind(energy, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                   clusters1 == 2 ~ 'C2',
                                                   clusters1 == 3 ~ 'C3'))
                                                  
size_rev <- energy_clus %>%
  group_by(cluster) %>%
  summarise(ID = n())

energy_clus$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))


ggplot(energy_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip() 

Evaluation Age Breakdown by Cluster

  • Cluster 1 (C1) has significant participation across various age groups, with higher proportions in the 35-49 and 50-64 age groups.

  • Cluster 2 (C2) has a relatively balanced participation across age groups, with notable proportions in the 25-34 and 35-49 ranges.

  • Cluster 3 (C3) has a strong representation from the 35-49 and 50-64 age groups. This group also seems to have a notable proportion in the Over 65 category.

energy_clus_tidy <- energy_clus_means %>%
  pivot_longer(cols = c(D1,D2, D3, D4, D5), names_to = "Contact_Method", values_to = "Average_Value")

energy_clus_tidy$Contact_Method <- factor(energy_clus_tidy$Contact_Method, levels = c("D1", "D2", "D3", "D4", "D5"))

energy_clus_tidy
# A tibble: 15 × 3
   cluster Contact_Method Average_Value
   <chr>   <fct>                  <dbl>
 1 C1      D1                      3.18
 2 C1      D2                      4.53
 3 C1      D3                      6.10
 4 C1      D4                      6.43
 5 C1      D5                      6.69
 6 C2      D1                      2.33
 7 C2      D2                      5.32
 8 C2      D3                      7.34
 9 C2      D4                      5.79
10 C2      D5                      2.98
11 C3      D1                      6.52
12 C3      D2                      5.09
13 C3      D3                      3.57
14 C3      D4                      2.96
15 C3      D5                      2.69
ggplot(energy_clus_tidy, mapping = aes(x = Contact_Method, y = Average_Value, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_colour_manual(values = c("#A752A0", "#FCCA3A", "#378B84")) +
  ylab("Mean Satisfaction Score") + 
  xlab("Contact Method") +
  ggtitle("Mean Satisfaction Score for each Contact Method by Cluster")

Evaluation

Gender and Satisfaction Scores

The chart illustrates the satisfaction scores for the energy drink versions across three different clusters, C1, C2, and C3.

  • Cluster 1 (C1) has a high satisfaction score for D5 (the highest concentration) and a low satisfaction score for D1. This indicates that Cluster 1 prefers stronger flavors.

  • Cluster 2 (C2) shows a moderate satisfaction trend across the different energy drinks (D1 to D5), suggesting that this cluster prefers balanced or medium-intensity flavors.

  • Cluster 3 (C3) has the highest satisfaction with D1 and the lowest satisfaction with D5, indicating a preference for milder flavors and avoidance of stronger flavors.

  1. 7.1 Targeting Energy Drink Version D1. It is obviously that Cluster 3 is the most suitable cluster for D1. Because
  • Cluster 3 shows the highest satisfaction with D1 and lowest satisfaction with D3 and D5, indicating a preference for milder flavors. In addition, the participants in this cluster likely prefer smooth, gentle, and less intense flavors. D1 should highlight the milder energy boost that aligns with their taste profile, appealing to consumers who enjoy less intense energy drinks.

    7.2 Targeting Energy Drink Version D3. This is the reason why the cluster 2 is the most suitable cluster for D3.

  • Cluster 2 demonstrates a moderate satisfaction across all drinks, with a notable preference for D3. This suggests that D3 aligns well with their balanced preferences. Participants in this cluster seem to appreciate balanced, moderate flavor profiles rather than extremes. They enjoy a mid-range experience, where the flavor isn’t too strong or too weak. This version of the drink could be positioned as a satisfying option for consumers who enjoy a middle-ground energy drink that isn’t too overpowering but still provides a strong flavor.

    7.3 Targeting Energy Drink Version D5. Cluster 1 is the cluster that can most suitable one for D5. Because:

  • Cluster 1 shows the highest satisfaction with D5, suggesting that these consumers are particularly fond of stronger, bolder flavors. Participants in this cluster have a preference for more intense and high-concentration flavors, making them ideal for D5, which represents the highest concentration of flavor. D5 should highlight the flavor strength and powerful experience that appeals to consumers looking for a highly concentrated energy drink.

8.If the company has to choose just one version of the energy drink to continue producing, the best option would be D3.Because:

  • Cluster 2 , which represents moderate preferences, has a notable preference for D3. These consumers are looking for a balanced flavor, not too intense or too mild, which aligns perfectly with D3.

  • D3 are often more versatile in appeal, especially when targeting a wide market. Younger consumers, who are more likely to seek variety, will likely appreciate the moderate flavor. Health-conscious individuals or those who are newer to energy drinks may prefer a balanced flavor over something too intense. D3 can be positioned as an all-around energy booster that is neither too strong nor too mild, making it a great choice for everyday consumption.

  • D3 appeals to a broader demographic compared to the extreme versions.

  • D3 offers a safer bet for the company, as it doesn’t rely on extreme preferences. It provides a steady and predictable market that can serve multiple segments. By choosing D3, the company can cater to a large audience without taking the risk of alienating any specific group, which could happen if they choose to focus on just D1 or D5.