Assignement on Data Analytics & Consumer Insights

Author: “Md Saydur Rahman Sayd” format: html editor: visual

Part A – Exploratory Analysis of the Music Subscription Data set

Question 1.

At First we will load the tidy verse package and then use it to import our data set into R.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sub_train <- read_csv("sub_training.csv")

Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sub_test <- read_csv("sub_testing.csv")

Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 2. Understanding and Interpreting the Data Visualisations

1. Bar Chart: Number of Contacts vs Subscription Renewal

ggplot(data = sub_train) +
  geom_bar(mapping = aes(x = num_contacts,fill = renewed),binwidth = 1,position = "dodge") +
  labs(title = "Distribution of Contacts by Renewal Status",
       x = "Number of Contacts", y = "Count")

Warning in geom_bar(mapping = aes(x = num_contacts, fill = renewed), binwidth =
1, : Ignoring unknown parameters: `binwidth`

Interpretation:

The bar chart shows the number of contacts each customer had with the company, broken down into those who renewed and those who did not. Most customers in both groups had only a small number of contacts, on average between 0 and 5. From the chart, we can see that nonrenewing customers generally had fewer contacts, while renewing customers show a slightly higher range of contacts, indicating a bit more interaction with the company. Overall, customer engagement levels are low for most customers, but having more contacts appears to be linked to a higher chance of renewal.

2. Scatter plot: Number of Complaints vs Subscription Renewal

 ggplot(data = sub_train) + 
  geom_point(mapping = aes(x = renewed, y = num_complaints, color = renewed)) + labs(title = "Number of Complaints by Renewal Status", 
                                                                                     x = "Renewed", y = "Number of Complaints") + 
  scale_color_manual(values = c("Yes" = "yellow", "No" = "red"))

Interpretation

The Scatterplot shows the number of complaints made by customers, separated by whether they renewed or not. It suggests that dissatisfaction, measured through complaints, is linked to on‑renewal. On‑renewing customers span a wide range of complaint counts, and a few have made 10 or more complaints. In contrast, renewing customers usually have very few complaints. Overall, customers who do not renew tend to complain more, while those who renew generally have fewer or no complaints, indicating lower dissatisfaction in the renewing group.

3. Boxplot: Contact Recency vs Subscription Renewal

 ggplot(data = sub_train) +
  geom_boxplot(mapping = aes(x = renewed, y = contact_recency, fill = renewed)) +
  labs(title = "Contact Recency by Renewal Status",
       x = "Renewed", y = "Contact Recency")

Interpretation

Here we see in the boxplot, the red box (non‑renewals) covers slightly higher and more varied recency values, suggesting these customers went longer without contact or experienced more irregular contact before not renewing. The blue box (renewals) shows lower and less variable recency, meaning these customers were contacted more recently and with more consistent timing. Overall, the pattern indicates that customers who renewed were generally contacted more recently, whereas customers who did not renew experienced longer or more varied gaps in contact.

4. Histogram for Spend:

  ggplot(data = sub_train) +
  geom_histogram(mapping = aes(x = spend, fill = renewed), binwidth = 90) +  
  labs(title = "Distribution of Spend by Renewal Status",
       x = "Amount Spent", y = "Count")

Interpretation

The histogram shows how customer spending is distributed for those who renewed and those who did not. It separates spend amounts by renewal status. From the plot, customers who renewed tend to have higher spending, with many around 400, while most non‑renewing customers have lower spend levels. This pattern suggests a positive relationship between spending and renewal, where customers who spend more are more likely to renew their contracts.

5. Scatterplot : Length of Relationship vs Age by Renewal Status.

 ggplot(data = sub_train) +
  geom_point(mapping = aes(x = lor, y = age, color = renewed)) +
  labs(title = "Length of Relationship vs Age by Renewal Status",
       x = "Length of Relationship (Days)", y = "Age")

Interpretation

Here we see that the scatter plot, customers who renew generally show higher spending, with many spending around 400, while most customers who do not renew have lower spending levels. This pattern indicates a positive link between spending and renewal, meaning customers who spend more are more likely to renew their contracts.

6. Scatterplot: Age by Contact Recency and Renewal Status.

ggplot(data = sub_train) +
  geom_point(mapping = aes(x = lor, y = age, color = renewed)) +
  labs(title = "Age by Contact Recency and Renewal Status",
       x = "Contact Recency", y = "Age")

Interpretation

Here we see that, the scatter plot shows a wider spread in age and contact recency for customers who did not renew, represented by the red dots, compared with those who renewed, shown as blue dots. Moreover, this pattern suggests that non‑renewing customers vary more in both age and how recently they were contacted, while renewing customers are more concentrated, particularly among middle‑aged groups who appear more likely to renew and have shorter, less varied contact histories.

7. Barplot: Gender vs Subscription Renewal

ggplot(data = sub_train) +
  geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge") +
  labs(title = "Gender Distribution by Renewal Status",
       x = "Gender", y = "Count")

Interpretation

Here we see from the graph representation that, there are more male renewals blue bars than non-renewals red bars. In contrast, the female renewing customers are less than non renewing. This means that males are more prone to subscribe for a second period, and that the renewal rate is higher among males in this dataset.

Part B –Predicting customers who will Renew their Music subscription

Import the Data

  library(rattle)

Loading required package: bitops

Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

library(rpart)
library(tidyverse)

sub_train <- read_csv("sub_training.csv")

Rows: 850 Columns: 9

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sub_test <- read_csv("sub_testing.csv")

Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Classification tree model

renew_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)
fancyRpartPlot(renew_tree)

renew_tree$variable.importance

            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338

summary(renew_tree)

Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + age + gender, data = sub_train)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.0636792 0.03431622
2 0.01179245      1 0.8066038 0.8466981 0.03396352
3 0.01000000      4 0.7617925 0.8325472 0.03388365

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667

rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)

n= 850 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 850 424 No (0.5011765 0.4988235)  
   2) lor< 139.5 466 191 No (0.5901288 0.4098712)  
     4) spend< 182 82  20 No (0.7560976 0.2439024) *
     5) spend>=182 384 171 No (0.5546875 0.4453125)  
      10) num_contacts< 7.5 356 152 No (0.5730337 0.4269663)  
        20) age< 61 329 134 No (0.5927052 0.4072948) *
        21) age>=61 27   9 Yes (0.3333333 0.6666667) *
      11) num_contacts>=7.5 28   9 Yes (0.3214286 0.6785714) *
   3) lor>=139.5 384 151 Yes (0.3932292 0.6067708) *

n= 850

2. Interpretation of the tree model

a. One rule for predicting that a customer will re‑subscribe comes from the part of the tree where the length of relationship (lor) is at least about 140 days. In this region, the model predicts renewed = Yes, with roughly 60% of customers renewing, so the node is only moderately pure for renewal. A more specific rule comes from Node 11 that customers with lor of 140 days or more and at least 7.5 contacts, where about 68% renew. This node is quite pure for re‑subscribers but represents only about 3% of all customers.

b. One rule for predicting that a customer will churn is when the length of relationship (lor) is below about 140 days and spending is less than 182. In this case, the tree predicts renewed = No, and in that node around 76% of customers do not renew, so it is quite pure for churn. In Node 4, 76% of cases are non renewals and 24% are renewals, meaning the model can predict churn with reasonably high confidence, even though this node represents only about 10% of the full dataset.

c. The classification tree results show that the length of relationship (lor) is the strongest predictor of re‑subscription. It has the highest importance score (47) and appears in the first main split at about 139.5 days, meaning it reduces misclassification more than any other variable.The next most important predictors are spend and age, each with an importance of 22, and they also appear near the top of the tree (for example, spend < 182 is linked to a lower chance of renewal, and customers aged 61 or more show distinct renewal behaviour). num_contacts have moderate importance (8), splitting at about 7.5 contacts, while contact_recency has a very small importance value (1), indicating it adds little to prediction. num_complaints and gender play only minor roles and do not meaningfully improve the model compared with lor, spend, and age.

3. Compare the results of your visual exploration in Part A to your findings in Part B

Comparing the visual exploration in Part A with the decision tree shows how the importance of variables that we find. If variables like contact_recency or num_complaints looked important in the graphs but have low importance in the tree, this indicates that other factors such as length of relationship (LOR), age, and spend are actually more predictive for renewal. Conversely, if spend or age did not stand out visually but emerge as key predictors in the tree, this means they have more complex effects on churn and renewal than was obvious from the simple plots.

4. Fully assess the accuracy of the classification tree using both the training and the testing datasets.

Measure accuracy on Training data

train_probs <- predict(renew_tree, newdata = sub_train, type = 'prob')
train_preds <- predict(renew_tree, newdata = sub_train, type = 'class')

renew_tree_updated <- cbind(sub_train, train_probs, train_preds)
head(renew_tree_updated)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds
1 0.3932292 0.6067708         Yes
2 0.5927052 0.4072948          No
3 0.7560976 0.2439024          No
4 0.7560976 0.2439024          No
5 0.3932292 0.6067708         Yes
6 0.3932292 0.6067708         Yes

train_con_mat <- table(renew_tree_updated$renewed, renew_tree_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

Interpretation

The confusion matrix gives these results for the model’s performance. The overall accuracy of the model is (257 + 270) / 850 = 0.62, or 62%. Of all customers predicted to churn, 270 /439= (61%) which is correcet. Of all customers predicted not to churn is257 / 411=(63%). Among customers who actually churned, the model correctly flags 270/ 424= (64%). Among customers who actually did not churn, it correctly identifies 257 /426=(60%).

Measure accuracy on Testing data

test_probs <- predict(renew_tree, newdata = sub_test, type = 'prob')
test_preds <- predict(renew_tree, newdata = sub_test, type = 'class')

renew_tree_updated_test <- cbind(sub_test, test_probs, test_preds)
head(renew_tree_updated_test)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 942     Yes            0              28              0   213  81 Female  45
2 299      No            0              28              0   425 120   Male  51
3  44      No            0              28              3   477  89 Female  69
4 706     Yes            1              14              0   425 361 Female  68
5 965     Yes            0              28              0   235  36 Female  51
6 354      No            1              11              0     0  12   Male  59
         No       Yes test_preds
1 0.5927052 0.4072948         No
2 0.5927052 0.4072948         No
3 0.3333333 0.6666667        Yes
4 0.3932292 0.6067708        Yes
5 0.5927052 0.4072948         No
6 0.7560976 0.2439024         No

test_con_mat <- table(renew_tree_updated_test$renewed, renew_tree_updated_test$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

Interpretation

The confusion matrix for the test set gives the following results. The model’s overall accuracy is (36 + 41) / 150 = 0.51, or 51%. All customers of the model predicted would not renew (churn), it was correct for 41/ 79= 52%. Of all customers the model predicted would renew, it was correct for 36/ 71=(51%). Among customers who actually did not renew, the model correctly identified 41/ 71=(58%), while among customers who did renew, it correctly identified 36/ 79= (46%).

4a. Explanation of Overfitting in Classification Tree

Here we see that clear drop in accuracy from 62% on the training data to 51% on the testing data shows that the classification tree is overfitting the training data. Overfitting happens when the model is too complex and too closely fitted to the training set, so it learns noise and minor variations that do not generalise well to new data.

5. Prune the Classification Tree

5a. Create the Pruned Classification Tree

renew_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train, maxdepth = 3)
fancyRpartPlot(renew_tree_pruned)

5b. Assess the Accuracy of the Pruned Tree on Training and Testing Datasets

Measure accuracy on Training data Pruned

train_probs_pruned <- predict(renew_tree_pruned, newdata = sub_train, type = 'prob')
train_preds_pruned <- predict(renew_tree_pruned, newdata = sub_train, type = 'class')

renew_tree_updated_pruned <- cbind(sub_train, train_probs_pruned, train_preds_pruned)

head(renew_tree_updated_pruned)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds_pruned
1 0.3932292 0.6067708                Yes
2 0.5730337 0.4269663                 No
3 0.7560976 0.2439024                 No
4 0.7560976 0.2439024                 No
5 0.3932292 0.6067708                Yes
6 0.3932292 0.6067708                Yes

train_con_mat_pruned <- table(renew_tree_updated_pruned$renewed, renew_tree_updated_pruned$train_preds_pruned, dnn = c('Actual', 'Predicted'))
train_con_mat_pruned

      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252

Interpretation

The confusion matrix for the training data gives these results. #The model’s overall accuracy is (266 + 252) / 850 = 0.61, or 61%. Of all customers the model predicted would not renew, it was correct for 252 / 422= (60%). Of all customers the model predicted would renew, it was correct for 266 / 428 = (62%). Among customers who actually did not renew, the model correctly identified 252/424= (59%), and among customers who actually renewed, it correctly identified 266/ 426= (62%).

Measure accuracy on Testing data Pruned

test_probs_pruned <- predict(renew_tree_pruned, newdata = sub_test, type = 'prob')
test_preds_pruned <- predict(renew_tree_pruned, newdata = sub_test, type = 'class')

renew_tree_updated_pruned_test <- cbind(sub_test, test_probs_pruned, test_preds_pruned)

test_con_mat_pruned <- table(renew_tree_updated_pruned_test$renewed, renew_tree_updated_pruned_test$test_preds_pruned, dnn = c('Actual', 'Predicted'))
test_con_mat_pruned

      Predicted
Actual No Yes
   No  41  33
   Yes 38  38

Interpretation:

For the testing data, the confusion matrix shows these results. #The overall model accuracy is (36 + 41) / 150 = 0.51, or 51%. All customers of the model predicted would not renew (churn), it was correct for 41/ 79= (52%). Of all customers the model predicted would renew, it was correct for 36/ 71= (51%). Among customers who actually did not renew, the model correctly identified 41 / 71= (58%), and among those who did renew, it correctly identified 36/ 79= (46%).

5c. Pruning the classification tree overfitting

Generally, the model reaches 61% accuracy on the training data, correctly classifying 60% of non‑renewers and 62% of renewers. However, on the testing data the performance is much weaker, with an overall accuracy of only 51%. It correctly identifies just 52% of customers who do not renew and 51% of those who renew. This large drop in performance from training to testing shows that the model is overfitting.

6. Actions for Improving Renewal Rates

The company has to target the customers which are having a small LOR and low spend as they are the customer segment with maximum churn risk. It can give personalized offers to its customer who are loyal and spend high on their products. With the classification tree model, the company can identify at-risk customers early and contact them with marketing campaigns, loyalty programs or an improved customer support system. These are the moves that will drive up renewals and drop churn.

Focus retention on customers with short LOR and low spend, using early reminders, on boarding support, and introductory offers to reduce churn in this high‑risk group.

Protect long‑term, high‑spend customers with loyalty rewards, exclusive content, and stable pricing so they remain satisfied and continue renewing.

Tailor communication by age and interaction history, for example, age‑appropriate content bundles and service‑recovery offers for customers with many contacts or complaints.

6a. Using the tree for marketing

Use the classification tree to score each customer’s churn risk and segment them into high-medium and low‑risk groups for renewal. Build targeted campaigns around the tree rules (e.g., “low LOR + low spend = high risk”), then run email, in‑app, or offer‑based campaigns for these segments and track how customers move into lower‑risk nodes over time.

Part C – Segmenting Consumers Based on Energy Drink Preference

Import the data

At first we will write the code in the following chunk to run the tidyverse package and then import the given data set.

library(tidyverse)
library(cluster)

energy_drinks <- read_csv("energy_drinks.csv")

Rows: 840 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Gender, Age
dbl (5): D1, D2, D3, D4, D5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(energy_drinks)

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers.

Compute distances between each pair of players

energy_drinks_2 <- select(energy_drinks, D1, D2, D3, D4, D5)
d1 <- dist(energy_drinks_2)

Interpretation

The distance matrix is calculated using the Euclidean distance metric, which measures how different participants are from each other based on their ratings.

2a. Does the data need to be scaled before computing the distance matrix?

Scaling is needed before computing Euclidean distance, especially when variables like drink ratings are on different scales. By scaling, each feature is standardised so that all variables contribute equally to the distance calculation.

Q3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.

h1 <- hclust(d1, method = "average")

Interpretation

Participants were cluster analysed based on their scores of the five energy drink variation. The dendrogram explained how clusters of participants could be built based on similarity in these ratings.

Q4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap

par(mar = c(5, 5, 2, 2))
plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

4a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset?

We see the heatmap shows clear evidence of clustering. Blocks of similar colours indicate groups of participants who rated the energy drinks in a similar way across the five versions. These visible clusters are consistent with the hierarchical clustering dendrogram and confirm that there is meaningful structure in the data, meaning participants can be grouped into distinct segments based on their preferences for the different energy drink versions.

Q-5: Create a 3-cluster solution using the cutree function and assess the quality of this solution.

Create a 3-cluster solution using cutree

clusters <- cutree(h1, k = 3)

Assess the quality of the segmentation

sil2 <- silhouette(clusters, d1)
summary(sil2)

Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters, dist = d1) :
 Cluster sizes and average silhouette widths:
      417       235       188 
0.2249249 0.1987262 0.3918562 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502

Interpretation

A 3‑cluster solution was created using the cutree function. The silhouette score is used to evaluate this clustering, where higher values indicate better‑defined clusters. The silhouette results point to a moderate level of clustering: the data are neither very compact nor very separated. Cluster 3 has a reasonably good average silhouette width, but Clusters 1 and 2 show weak cohesion and possible overlap. Negative silhouette values, such as −0.3120, suggest some participants are likely misclassified and might belong more naturally to a different cluster.

6a. How do the clusters differ on their average rating of each version of the energy drinks?

Profile the clusters

energy_drinks_clus <- cbind(energy_drinks, clusters)
energy_drinks_clus <- mutate(energy_drinks, cluster = case_when(clusters == 1 ~ 'C1',
                                                                clusters == 2 ~ 'C2',
                                                                clusters == 3 ~ 'C3'))

cluster_ratings <- energy_drinks_clus %>%
  group_by(cluster) %>%
  summarise(across(starts_with('D'), mean, na.rm = TRUE))

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(starts_with("D"), mean, na.rm = TRUE)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

cluster_ratings

# A tibble: 3 × 6
  cluster    D1    D2    D3    D4    D5
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1       3     4.71  6.20  6.70  6.76
2 C2       2.96  4.67  6.95  5.01  2.97
3 C3       6.98  5.37  3.03  2.88  2.84

Interpretation

Cluster C1 shows a clear liking for the strongest drinks, rating D4 and D5 highest at 6.70 and 6.76, while giving D1 a low score of 3.00, which suggests they dislike the weakest concentration. #Cluster C2 prefers the mid‑strength drink D3 (6.95) and strongly dislikes D5 (2.97), indicating a taste for moderate but not very strong concentrations. Cluster C3, in contrast, rates the weakest drink D1 very highly (6.98) and gives low scores to D4 (2.87) and D5 (2.84), showing little interest in higher‑concentration options. These patterns reveal clear differences in flavour concentration preferences between the three clusters.

6b. How do the clusters differ on age and gender?

6b: Age and Gender distribution by cluster

energy_drinks_clus$Age <- factor(energy_drinks_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))

Age distribution by cluster

ggplot(energy_drinks_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip()

Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.

Gender distribution

ggplot(energy_drinks_clus, aes(x = Gender, fill = (cluster))) + 
  geom_bar(position = "fill") +
  ylab("Percentage") + 
  xlab("Gender") +
  ggtitle("Gender Distribution by Cluster")

Interpretation

Here we see that cluster C1 shows a broad age mix, with many participants in the 25–34 and 35–49 groups, and it is slightly more female than male.Cluster C2 mainly contains younger participants, mostly under 25, and has a roughly equal split between males and females. Cluster C3, in contrast, includes older participants, especially those aged 35–49 and 50–64, and has a higher proportion of males than females.

7. Advise the Company on the Suitable Segment/Cluster at Which to Advertise Energy Drink Versions D1, D3, and D5?

energy_drinks_mean <- energy_drinks_clus %>%
  group_by(cluster) %>%
  summarise(D1 = mean(D1),
            D3 = mean(D3),
            D5 = mean(D5))

energy_drinks_mean

# A tibble: 3 × 4
  cluster    D1    D3    D5
  <chr>   <dbl> <dbl> <dbl>
1 C1       3     6.20  6.76
2 C2       2.96  6.95  2.97
3 C3       6.98  3.03  2.84

Suggestion

For D1, Cluster 3 gives the highest rating (6.98), so this cluster is the best target for D1 advertising. For D3, Cluster 2 rates it highest (6.95), so D3 promotions should focus on Cluster 2. For D5, Cluster 1 has the top rating (6.76), so Cluster 1 should be the main audience for D5 advertising.Calculate overall average ratings across all clusters for each version of the energy drink

8. Calculate overall average ratings across all clusters for each version of the energy drink

overall_ratings <- energy_drinks_clus %>%
  summarise(across(starts_with('D'), mean, na.rm = TRUE))
overall_ratings

# A tibble: 1 × 5
     D1    D2    D3    D4    D5
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  3.88  4.85   5.7  5.37  4.82

Suggestion

D3 should be recommended for continued production because it has the highest overall mean rating (5.7) among all versions. This indicates that D3 is the most preferred option across participants, making it the strongest all‑round choice.