Part A – Exploratory Analysis of the Music Subscription Dataset
Q1. Import Data
First we will use the code in the following chunk to load the tidyverse package and then import our dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sub_train <-read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_test <-read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q2. Data Visualization & Interpretation
Bar Chart: Number of Contacts vs Subscription Renewal
ggplot(data = sub_train) +geom_bar(mapping =aes(x = num_contacts,fill = renewed),binwidth =1,position ="dodge") +labs(title ="Distribution of Contacts by Renewal Status",x ="Number of Contacts", y ="Count")
Warning in geom_bar(mapping = aes(x = num_contacts, fill = renewed), binwidth =
1, : Ignoring unknown parameters: `binwidth`
Interpretation
The customers who did not renew service had more contact with the company. The bar chart displays how many contacts customers received, split by those that renewed and those that didn’t. The majority of clients had very small number of contacts, falling into the range between 0 and 5.
According to the chart the non-renewals group had few contacts. On the contrary, the renewing customers group had a slightly higher range of contacts which means they had been applying a little bit more of activity within the company compared to did not renew people.
In general, the vast majority of these customers had low engagement, which means they hadn’t had very many interactions leading up to the decision point. This suggests that customers in either instance had little contact, but more interactions appear to be associated with a greater likelihood for renewal.
Boxplot: Contact Recency vs Subscription Renewal
ggplot(data = sub_train) +geom_boxplot(mapping =aes(x = renewed, y = contact_recency, fill = renewed)) +labs(title ="Contact Recency by Renewal Status",x ="Renewed", y ="Contact Recency")
Interpretation
This boxplot shows the distribution of contact recency between customers who renewed and those who did not. Customers who did not renew tend to have a larger spread in contact recency, while customers who renewedshow a more concentrated spread of recent contacts.
Here in boxplot view of Contact Recency with respect to Renewal Status we see that non-renewing customers the red box tend to have a slightly larger recency and not as closely spaced This would indicate that they were reached out to sooner or went through more contacts but ultimately not renewed.
On the other hand, renewed customers highlighted by the blue box tend to have lower and less variable contact recency, which means they are in average contacted not that long ago, and also contact time is less spread.
It seems that customers who renewed were contacted more recently, whereas customers that didn’t renew lost contact for various periods of time, potentially longer periods.
Scatterplot: Number of Complaints vs Subscription Renewal
ggplot(data = sub_train) +geom_point(mapping =aes(x = renewed, y = num_complaints, color = renewed)) +labs(title ="Number of Complaints by Renewal Status", x ="Renewed", y ="Number of Complaints") +scale_color_manual(values =c("Yes"="green", "No"="red"))
Interpretation
The chart displays the number of complaints by customers, as a function of whether the customer renewed. This suggests that dissatisfaction, as measured by complaints, may contribute to non-renewal.
From the chart, we can observe that non-renew customers cover a wide range of complaints and some have complained much more than others. A small number of customers had 10 or more complaints. On the other hand, renewing customers have very few complaints.
This means that those that do not renew complain more, whereas renewed customers tend to have less or no complaints, which might mean there are the diasatisfaction lower.
Histogram for Spend
ggplot(data = sub_train) +geom_histogram(mapping =aes(x = spend, fill = renewed), binwidth =90) +labs(title ="Distribution of Spend by Renewal Status",x ="Amount Spent", y ="Count")
Interpretation
The histogram illustrates how customer spending was distributed according to their renewal status if they renewed the service or not.
From the plots of spend amount tells that those customers who renewed their contract have a higher spend amount approximately 400 whereas most of the non-renewed customer is having less spend. By this statistic, we can tell that there’s a positive correlation between spending and renewal customer since those who spent more are how likely renew their contracts.
Scatterplot : Length of Relationship vs Age by RenewalStatus
ggplot(data = sub_train) +geom_point(mapping =aes(x = lor, y = age, color = renewed)) +labs(title ="Length of Relationship vs Age by Renewal Status",x ="Length of Relationship (Days)", y ="Age")
Interpretation
The scatter plot shows the length of relationship with customer age by renewal status. Customers that renewed their subscription blue points generally are customers who have a long relation with the company. Although age isn’t likely to drive the renewal decision, it is clear that longer relationships are more conducive to a renewal. This indicates that relationship length with the customer is important in renewal, not sex.
Scatterplot: Age by Contact Recency and Renewal Status.
ggplot(data = sub_train) +geom_point(mapping =aes(x = lor, y = age, color = renewed)) +labs(title ="Age by Contact Recency and Renewal Status",x ="Contact Recency", y ="Age")
Interpretation
The Scatter plot indicates that there is more spread in the age and contact recency of customers who didn’t renew marked by red dots comparing them with those who renewed marked by blue dots. This may imply a pattern that older customers especially middle-aged will more likely to renew the contracts and keep less mobile-contact history.
Barplot: Gender vs Subscription Renewal
ggplot(data = sub_train) +geom_bar(mapping =aes(x = gender, fill = renewed), position ="dodge") +labs(title ="Gender Distribution by Renewal Status",x ="Gender", y ="Count")
Interpretation
From the chart, we can see that male customers who renewed (blue bars) are more numerous than those who did not renew (red bars). On the other hand, female customers who renewed are fewer compared to female customers who did not renew. This indicates that more males tend to renew their subscriptions, while fewer females do so. The chart highlights that the renewal rate is higher among males than females in this dataset.
Part B – Predicting Customers Who Will Renew Their Music Subscription
Import Data
First we will use the code in the following chunk to load the tidyverse package and then import our dataset
library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_test <-read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q1. Classification tree model
renew_tree <-rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)fancyRpartPlot(renew_tree)
renew_tree$variable.importance
lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
summary(renew_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + age + gender, data = sub_train)
n= 850
CP nsplit rel error xerror xstd
1 0.19339623 0 1.0000000 1.0966981 0.03422800
2 0.01179245 1 0.8066038 0.9150943 0.03425008
3 0.01000000 4 0.7617925 0.8655660 0.03405895
Variable importance
lor spend age num_contacts contact_recency
47 22 22 8 1
Node number 1: 850 observations, complexity param=0.1933962
predicted class=No expected loss=0.4988235 P(node) =1
class counts: 426 424
probabilities: 0.501 0.499
left son=2 (466 obs) right son=3 (384 obs)
Primary splits:
lor < 139.5 to the left, improve=16.323670, (0 missing)
age < 60.5 to the left, improve=13.784490, (0 missing)
spend < 182 to the left, improve=12.232030, (0 missing)
contact_recency < 7.5 to the right, improve= 5.498858, (0 missing)
num_contacts < 3.5 to the left, improve= 5.307063, (0 missing)
Surrogate splits:
age < 60.5 to the left, agree=0.732, adj=0.406, (0 split)
spend < 422 to the left, agree=0.684, adj=0.299, (0 split)
contact_recency < 8.5 to the right, agree=0.565, adj=0.036, (0 split)
num_contacts < 2.5 to the left, agree=0.560, adj=0.026, (0 split)
Node number 2: 466 observations, complexity param=0.01179245
predicted class=No expected loss=0.4098712 P(node) =0.5482353
class counts: 275 191
probabilities: 0.590 0.410
left son=4 (82 obs) right son=5 (384 obs)
Primary splits:
spend < 182 to the left, improve=5.482157, (0 missing)
lor < 42.5 to the left, improve=5.434363, (0 missing)
age < 61.5 to the left, improve=4.067404, (0 missing)
num_contacts < 7.5 to the left, improve=3.721617, (0 missing)
contact_recency < 7.5 to the right, improve=2.971256, (0 missing)
Surrogate splits:
lor < 21 to the left, agree=0.987, adj=0.927, (0 split)
Node number 3: 384 observations
predicted class=Yes expected loss=0.3932292 P(node) =0.4517647
class counts: 151 233
probabilities: 0.393 0.607
Node number 4: 82 observations
predicted class=No expected loss=0.2439024 P(node) =0.09647059
class counts: 62 20
probabilities: 0.756 0.244
Node number 5: 384 observations, complexity param=0.01179245
predicted class=No expected loss=0.4453125 P(node) =0.4517647
class counts: 213 171
probabilities: 0.555 0.445
left son=10 (356 obs) right son=11 (28 obs)
Primary splits:
num_contacts < 7.5 to the left, improve=3.286592, (0 missing)
age < 61 to the left, improve=3.189001, (0 missing)
gender splits as LR, improve=2.683644, (0 missing)
num_complaints < 1.5 to the left, improve=2.393375, (0 missing)
lor < 45.5 to the left, improve=1.671696, (0 missing)
Surrogate splits:
lor < 137.5 to the left, agree=0.93, adj=0.036, (0 split)
Node number 10: 356 observations, complexity param=0.01179245
predicted class=No expected loss=0.4269663 P(node) =0.4188235
class counts: 204 152
probabilities: 0.573 0.427
left son=20 (329 obs) right son=21 (27 obs)
Primary splits:
age < 61 to the left, improve=3.357262, (0 missing)
gender splits as LR, improve=2.949544, (0 missing)
num_complaints < 1.5 to the left, improve=1.278902, (0 missing)
lor < 42.5 to the left, improve=1.232817, (0 missing)
spend < 403 to the right, improve=1.196010, (0 missing)
Node number 11: 28 observations
predicted class=Yes expected loss=0.3214286 P(node) =0.03294118
class counts: 9 19
probabilities: 0.321 0.679
Node number 20: 329 observations
predicted class=No expected loss=0.4072948 P(node) =0.3870588
class counts: 195 134
probabilities: 0.593 0.407
Node number 21: 27 observations
predicted class=Yes expected loss=0.3333333 P(node) =0.03176471
class counts: 9 18
probabilities: 0.333 0.667
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)
a. One rule for predicting that customer will re-subscribe is from Node 11. This node is relevant in a case where the customer has length of relationship (lor) of 140 or higher and number contacts of 7.5 or more. For node 11, 68% of customers renew which indicates that the node is quite pure when it comes to predicting re-subscription. though it represent only 3% of total dataset.
b. One rule for knowing that a customer churns is when customer spend < 182 then the model predicts customer will not re-subscribe. Node 4, the one with highest level of purity is yet consistent with this rule. Here in this node 76 % people are No case and 24% are Yes case, node-pure is pretty high and predicts with confidence about the churning of those customers. Though it represent only 10% of total dataset.
c. According to classification tree findings, customer lenght of relationship (lor) is the most important factor for predicting re-subscription. This is the most important variable, with feature importance of 47 and it’s a top primary split lor < 139.5 days . The next most relevant predictors are spend and age, both with an importance of 22. These terms also come at the top of the tree: for instance if spend < 182, then such a customer is much less likely to renew, and there are clearly different renewal behaviors associated with those who aged 61 or older. Num_contacts has intermediate importance rate 8 which splits the data at num_contacts < 7.5 as customers with more contacts are likely to churn. Lastly, contact recency has a score of 1, indicating that it adds very little to predictive accuracy. In sum, the data support that renewal can best be predicted by lor, spend and age with other variables having a much lesser impact to the model.
Q3.Compare the results of your visual exploration in Part A to your findings in Part B
After comparing the results of the visual exploration in Part A with the findings from the decision tree, we can analyze if there are any discrepancies. For example: Variables that were important in the visual exploration but not in the classification tree, If contact_recency or num_complaints appeared significant visually but were less important in the tree, it suggests that other factors like LOR, Age and Spend are more predictive. Variables that were unimportant visually but important in the classification tree: If spend or age were not highlighted in visual exploration but became key predictors in the classification tree, this suggests that these variables have more complex relationships with churn than initially apparent.
Q4. Assess the accuracy of the classification tree
Fully assess the accuracy of the classification tree using both the training and the testing datasets.
Measure accuracy on Training data
# Measure accuracy on Training datatrain_probs <-predict(renew_tree, newdata = sub_train, type ='prob')train_preds <-predict(renew_tree, newdata = sub_train, type ='class')renew_tree_updated <-cbind(sub_train, train_probs, train_preds)head(renew_tree_updated)
id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187 No 0 28 0 213 248 Male 45
2 269 No 1 12 2 425 82 Male 60
3 376 No 0 28 2 0 15 Female 53
4 400 No 1 11 1 0 12 Male 44
5 679 Yes 0 28 0 216 300 Male 68
6 565 Yes 0 28 0 425 349 Female 68
No Yes train_preds
1 0.3932292 0.6067708 Yes
2 0.5927052 0.4072948 No
3 0.7560976 0.2439024 No
4 0.7560976 0.2439024 No
5 0.3932292 0.6067708 Yes
6 0.3932292 0.6067708 Yes
From the above confusion matrix, we know the following:
The overall model accuracy is (36 + 41)/150 = 0.51 or 51%.
Of all customers the model predicted to not renew (churn), they got 41/79 = 0.52 or 52%.
Of all customers the model predicted to renew, they got 36/71 = 0.51 or 51% correct.
Of all customers who did not renew (churn), the model correctly identified 41/71 = 0.58 or 58%.
Of all customers who did renew, the model correctly identified 36/79 = 0.46 or 46%.
Q4a. Explanation of Overfitting in Classification Tree:
The significant drop in accuracy from 62% on the training data to 51% on the testing data indicates that the classification tree has overfitted the training data. In overfitting, the model becomes too complex and tailored to the training set, capturing noise and small fluctuations that don’t generalize well to new data.
Q5. Prune the Classification Tree
a. Create the Pruned Classification Tree
renew_tree_pruned <-rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train, maxdepth =3)fancyRpartPlot(renew_tree_pruned)
b. Assess the Accuracy of the Pruned Tree on Training and Testing Datasets
#Measure accuracy on Training data Prunedtrain_probs_pruned <-predict(renew_tree_pruned, newdata = sub_train, type ='prob')train_preds_pruned <-predict(renew_tree_pruned, newdata = sub_train, type ='class')renew_tree_updated_pruned <-cbind(sub_train, train_probs_pruned, train_preds_pruned)head(renew_tree_updated_pruned)
id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187 No 0 28 0 213 248 Male 45
2 269 No 1 12 2 425 82 Male 60
3 376 No 0 28 2 0 15 Female 53
4 400 No 1 11 1 0 12 Male 44
5 679 Yes 0 28 0 216 300 Male 68
6 565 Yes 0 28 0 425 349 Female 68
No Yes train_preds_pruned
1 0.3932292 0.6067708 Yes
2 0.5730337 0.4269663 No
3 0.7560976 0.2439024 No
4 0.7560976 0.2439024 No
5 0.3932292 0.6067708 Yes
6 0.3932292 0.6067708 Yes
#Measure accuracy on Testing data Prunedtest_probs_pruned <-predict(renew_tree_pruned, newdata = sub_test, type ='prob')test_preds_pruned <-predict(renew_tree_pruned, newdata = sub_test, type ='class')renew_tree_updated_pruned_test <-cbind(sub_test, test_probs_pruned, test_preds_pruned)test_con_mat_pruned <-table(renew_tree_updated_pruned_test$renewed, renew_tree_updated_pruned_test$test_preds_pruned, dnn =c('Actual', 'Predicted'))test_con_mat_pruned
Predicted
Actual No Yes
No 41 33
Yes 38 38
Training Data:
From the above confusion matrix, we know the following:
The overall model accuracy is (266 + 252)/850 = 0.61 or 61%.
Of all customers the model predicted to not renew, they got 252/422 = 0.60 or 60%.
Of all customers the model predicted to renew, they got 266/428 = 0.62 or 62%.
Of all customers who did not renew,, the model correctly identified 252/424 = 0.59 or 59%.
Of all customers who did not renew,, the model correctly identified 266/426 = 0.62 or 62%.
Testing Data:
From the above confusion matrix, we know the following:
The overall model accuracy is (36 + 41)/150 = 0.51 or 51%.
Of all customers the model predicted to not renew (churn), they got 41/79 = 0.52 or 52%.
Of all customers the model predicted to renew, they got 36/71 = 0.51 or 51%.
Of all customers who did not renew (churn), the model correctly identified 41/71 = 0.58 or 58%.
Of all customers who did renew, the model correctly identified 36/79 = 0.46 or 46%.
Q5c. Pruning the classification tree overfitting
Overall the model achieves an accuracy of 61% on training data, correctly identifying 60% of non-renewers and 62% of renewers.
But in the testing data, the performance of model is very bad. The final accuracy over the test set is merely 51%. It accurately identified 52% of non-renewers predicted customers and 51% for those predicted to renew.
This significant drop-off from train to test performance indicates that the model is overfitting
Q6. Actions for Improving Renewal Rates
Based on the analysis, the company may implement following strategies to improve its renewal rate and reduce churn.
The company should focus retention efforts on customers with short LOR and low spend, as they represent high-risk churn groups.
Target high-spending, long-term customers with personalized offers to keep them loyal.
Use the classification tree model to identify at-risk customers and proactively engage them through marketing campaigns, loyalty programs, or customer support interventions.
Part C – Segmenting Consumers Based on Energy Drink Preference
Q1. Import Data
First we will use the code in the following chunk to load the tidyverse package and then import our dataset.
The distance matrix is computed using the Euclidean distance metric, which helps measure the dissimilarity between participants based on their ratings.
Q2a . Does the data need to be scaled before computing the distance matrix?
Yes, scaling is important before calculating Euclidean distance, especially for different variables such as the ratings for each drink. By scaling, we standardize each feature and ensuring that each variable contributes equally to the distance calculation
Q3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.
h1 <-hclust(d1, method ="average")
Commentary:
We applied hierarchic clustering to the subjects using their ratings of the five energy drink types. The dendrogram reflects with which ratings how groups of participants can be formed.
Visualise the results of the hierarchical clustering using a dendrogram and a heatmap.
Q4a: Does the heatmap provide evidence of any clustering structure within the energy drinks dataset?
Yes, there is clustering in the heatmap. Blocks of colors showing that clusters of people have rated the energy drinks similarly. The visible clusters can be compared with the outcome of the hierarchical clustering dendrogram and indicate that there is structure in data, and participants can be grouped into different segments according to their preferences for energy drink versions.
Q5. Create a 3-cluster solution using the cutree function and assess the quality of this solution.
clusters <-cutree(h1, k =3)sil2 <-silhouette(clusters, d1)summary(sil2)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters, dist = d1) :
Cluster sizes and average silhouette widths:
417 235 188
0.2249249 0.1987262 0.3918562
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3120 0.1599 0.2916 0.2550 0.3716 0.5502
Commentary:
A 3-cluster solution was generated with the cutree function. The silhouette score also measures the performance of the clustering, but where higher is better. An ideal clustering solution would create clusters with high silhouette scores.
The silhouette analysis suggests a moderate extent of data clustering meaning it neither too close nor too separate. The clustering solution performs well under average silhouette width for Cluster 3 but Clusters 1 and 2 have no cohesion and may overlap. The negative silhouette scores such as -0.3120 indicate some missclassification in the clusters, where some of the participants may be misclassified and assigned to a different cluster.
Q6 Profile the clusters
Q6a. Profile the clusters and calculate average ratings for each version
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(starts_with("D"), mean, na.rm = TRUE)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Cluster C1 prefers the higher concentration versions, especially D4 and D5 with average scores of 6.70 and 6.76 respectively. However, this cluster rates D1 the lowest 3.00, as participants in this group detest most the less concentrated form. Cluster C2, however, has a strong preference for D3 (6.95) and highly dislike D5 (2.97), indicating that users in this group prefer moderate concentrations but not the highest concentration one. Finally the Cluster C3 is characterized for having high values in the lowest concentration drink D1 (high value: 6.98) and low values in intermediate drinksD4 (low value: 2.87), and D5 (low value: 2.84), showing a lack of interest towards higher concentrates drinks. These different ratings of the clusters illuminate differences in preference based on flavoring ingredient concentration, and we observe distinct patterns of liking for each version within each of the clusters.
Q6b: How do the clusters differ on age and gender?
Age and Gender distribution by cluster
energy_drinks_clus$Age <-factor(energy_drinks_clus$Age, levels =c("Under_25", "25_34", "35_49", "50_64", "Over_65"))# Age distribution by clusterggplot(energy_drinks_clus, aes(x = Age, group = cluster)) +geom_bar(aes(y = ..prop..), stat ="count", show.legend =FALSE) +facet_grid(~ cluster) +scale_y_continuous(labels = scales::percent) +ylab("Percentage of People") +xlab("Age Group") +ggtitle("Age Breakdown by Cluster") +coord_flip()
Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.
# Gender distributionggplot(energy_drinks_clus, aes(x = Gender, fill = (cluster))) +geom_bar(position ="fill") +ylab("Percentage") +xlab("Gender") +ggtitle("Gender Distribution by Cluster")
Commentary:
There are a few noteworthy differences when analyse age and sex within the three clusters. Cluster C1 is pretty mixed by age, including persons from all ages, particularly the 25-34 and 35-49 brackets. This cluster is also slightly female-biased. Cluster C2 is comprised of younger people, with the majority aged Under 25 and relatively equal gender distribution between males versus females. In contrast, older and less young people prefer C3 and there’s more in the 35-49 and 50-64 age groups. This family also has more males to females.
Q7. Advise the Company on the Suitable Segment/Cluster at Which to Advertise Energy Drink Versions D1, D3, and D5
D3 is to be recommended for production because it holds the highest mean rating overall (5.7) among all versions. This means that D3 is the most favored type regardless of all participants.