Data Analytics & Consumer Insight- Assignment

Author

Mamunur Rahman

Part A – Exploratory Analysis of the Music Subscription Dataset

Q1. Import Data

First we will use the code in the following chunk to load the tidyverse package and then import our dataset.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sub_train  <- read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_test   <- read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q2. Data Visualization & Interpretation

Bar Chart: Number of Contacts vs Subscription Renewal

ggplot(data = sub_train) +
  geom_bar(mapping = aes(x = num_contacts,fill = renewed),binwidth = 1,position = "dodge") +
  labs(title = "Distribution of Contacts by Renewal Status",
       x = "Number of Contacts", y = "Count")
Warning in geom_bar(mapping = aes(x = num_contacts, fill = renewed), binwidth =
1, : Ignoring unknown parameters: `binwidth`

Interpretation

  • The customers who did not renew service had more contact with the company. The bar chart displays how many contacts customers received, split by those that renewed and those that didn’t. The majority of clients had very small number of contacts, falling into the range between 0 and 5.

  • According to the chart the non-renewals group had few contacts. On the contrary, the renewing customers group had a slightly higher range of contacts which means they had been applying a little bit more of activity within the company compared to did not renew people.

  • In general, the vast majority of these customers had low engagement, which means they hadn’t had very many interactions leading up to the decision point. This suggests that customers in either instance had little contact, but more interactions appear to be associated with a greater likelihood for renewal.

Boxplot: Contact Recency vs Subscription Renewal

ggplot(data = sub_train) +
  geom_boxplot(mapping = aes(x = renewed, y = contact_recency, fill = renewed)) +
  labs(title = "Contact Recency by Renewal Status",
       x = "Renewed", y = "Contact Recency")

Interpretation

This boxplot shows the distribution of contact recency between customers who renewed and those who did not. Customers who did not renew tend to have a larger spread in contact recency, while customers who renewed show a more concentrated spread of recent contacts.

Here in boxplot view of Contact Recency with respect to Renewal Status we see that non-renewing customers the red box tend to have a slightly larger recency and not as closely spaced This would indicate that they were reached out to sooner or went through more contacts but ultimately not renewed.

On the other hand, renewed customers highlighted by the blue box tend to have lower and less variable contact recency, which means they are in average contacted not that long ago, and also contact time is less spread.

It seems that customers who renewed were contacted more recently, whereas customers that didn’t renew lost contact for various periods of time, potentially longer periods.

Scatterplot: Number of Complaints vs Subscription Renewal

ggplot(data = sub_train) + 
   geom_point(mapping = aes(x = renewed, y = num_complaints, color = renewed)) + labs(title = "Number of Complaints by Renewal Status", 
     x = "Renewed", y = "Number of Complaints") + 
   scale_color_manual(values = c("Yes" = "green", "No" = "red"))

Interpretation

The chart displays the number of complaints by customers, as a function of whether the customer renewed. This suggests that dissatisfaction, as measured by complaints, may contribute to non-renewal.

From the chart, we can observe that non-renew customers cover a wide range of complaints and some have complained much more than others. A small number of customers had 10 or more complaints. On the other hand, renewing customers have very few complaints.

This means that those that do not renew complain more, whereas renewed customers tend to have less or no complaints, which might mean there are the diasatisfaction lower.

Histogram for Spend

ggplot(data = sub_train) +
  geom_histogram(mapping = aes(x = spend, fill = renewed), binwidth = 90) +  
  labs(title = "Distribution of Spend by Renewal Status",
       x = "Amount Spent", y = "Count") 

Interpretation

The histogram illustrates how customer spending was distributed according to their renewal status if they renewed the service or not.

From the plots of spend amount tells that those customers who renewed their contract have a higher spend amount approximately 400 whereas most of the non-renewed customer is having less spend. By this statistic, we can tell that there’s a positive correlation between spending and renewal customer since those who spent more are how likely renew their contracts.

Scatterplot : Length of Relationship vs Age by Renewal Status

ggplot(data = sub_train) +
  geom_point(mapping = aes(x = lor, y = age, color = renewed)) +
  labs(title = "Length of Relationship vs Age by Renewal Status",
       x = "Length of Relationship (Days)", y = "Age")

Interpretation

The scatter plot shows the length of relationship with customer age by renewal status. Customers that renewed their subscription blue points generally are customers who have a long relation with the company. Although age isn’t likely to drive the renewal decision, it is clear that longer relationships are more conducive to a renewal. This indicates that relationship length with the customer is important in renewal, not sex.

Scatterplot: Age by Contact Recency and Renewal Status.

ggplot(data = sub_train) +
  geom_point(mapping = aes(x = lor, y = age, color = renewed)) +
  labs(title = "Age by Contact Recency and Renewal Status",
       x = "Contact Recency", y = "Age")

Interpretation

The Scatter plot indicates that there is more spread in the age and contact recency of customers who didn’t renew marked by red dots comparing them with those who renewed marked by blue dots. This may imply a pattern that older customers especially middle-aged will more likely to renew the contracts and keep less mobile-contact history.

Barplot: Gender vs Subscription Renewal

ggplot(data = sub_train) +
  geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge") +
  labs(title = "Gender Distribution by Renewal Status",
       x = "Gender", y = "Count")

Interpretation

From the chart, we can see that male customers who renewed (blue bars) are more numerous than those who did not renew (red bars). On the other hand, female customers who renewed are fewer compared to female customers who did not renew. This indicates that more males tend to renew their subscriptions, while fewer females do so. The chart highlights that the renewal rate is higher among males than females in this dataset.

Part B – Predicting Customers Who Will Renew Their Music Subscription

Import Data

First we will use the code in the following chunk to load the tidyverse package and then import our dataset

library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
library(tidyverse)

sub_train <- read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_test <- read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q1. Classification tree model

renew_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)
fancyRpartPlot(renew_tree)

renew_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
summary(renew_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + age + gender, data = sub_train)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.0966981 0.03422800
2 0.01179245      1 0.8066038 0.9150943 0.03425008
3 0.01000000      4 0.7617925 0.8655660 0.03405895

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train)
n= 850 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 850 424 No (0.5011765 0.4988235)  
   2) lor< 139.5 466 191 No (0.5901288 0.4098712)  
     4) spend< 182 82  20 No (0.7560976 0.2439024) *
     5) spend>=182 384 171 No (0.5546875 0.4453125)  
      10) num_contacts< 7.5 356 152 No (0.5730337 0.4269663)  
        20) age< 61 329 134 No (0.5927052 0.4072948) *
        21) age>=61 27   9 Yes (0.3333333 0.6666667) *
      11) num_contacts>=7.5 28   9 Yes (0.3214286 0.6785714) *
   3) lor>=139.5 384 151 Yes (0.3932292 0.6067708) *
n= 850 

Q2 Interpret the tree model

a. One rule for predicting that customer will re-subscribe is from Node 11. This node is relevant in a case where the customer has length of relationship (lor) of 140 or higher and number contacts of 7.5 or more. For node 11, 68% of customers renew which indicates that the node is quite pure when it comes to predicting re-subscription. though it represent only 3% of total dataset.

b. One rule for knowing that a customer churns is when customer spend < 182 then the model predicts customer will not re-subscribe. Node 4, the one with highest level of purity is yet consistent with this rule. Here in this node 76 % people are No case and 24% are Yes case, node-pure is pretty high and predicts with confidence about the churning of those customers. Though it represent only 10% of total dataset.

c. According to classification tree findings, customer lenght of relationship (lor) is the most important factor for predicting re-subscription. This is the most important variable, with feature importance of 47 and it’s a top primary split lor < 139.5 days . The next most relevant predictors are spend and age, both with an importance of 22. These terms also come at the top of the tree: for instance if spend < 182, then such a customer is much less likely to renew, and there are clearly different renewal behaviors associated with those who aged 61 or older. Num_contacts has intermediate importance rate 8 which splits the data at num_contacts < 7.5 as customers with more contacts are likely to churn. Lastly, contact recency has a score of 1, indicating that it adds very little to predictive accuracy. In sum, the data support that renewal can best be predicted by lor, spend and age with other variables having a much lesser impact to the model.

Q3.Compare the results of your visual exploration in Part A to your findings in Part B

After comparing the results of the visual exploration in Part A with the findings from the decision tree, we can analyze if there are any discrepancies. For example: Variables that were important in the visual exploration but not in the classification tree, If contact_recency or num_complaints appeared significant visually but were less important in the tree, it suggests that other factors like LOR, Age and Spend are more predictive. Variables that were unimportant visually but important in the classification tree: If spend or age were not highlighted in visual exploration but became key predictors in the classification tree, this suggests that these variables have more complex relationships with churn than initially apparent.

Q4. Assess the accuracy of the classification tree

Fully assess the accuracy of the classification tree using both the training and the testing datasets.

Measure accuracy on Training data

# Measure accuracy on Training data

train_probs <- predict(renew_tree, newdata = sub_train, type = 'prob')
train_preds <- predict(renew_tree, newdata = sub_train, type = 'class')

renew_tree_updated <- cbind(sub_train, train_probs, train_preds)
head(renew_tree_updated)
   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds
1 0.3932292 0.6067708         Yes
2 0.5927052 0.4072948          No
3 0.7560976 0.2439024          No
4 0.7560976 0.2439024          No
5 0.3932292 0.6067708         Yes
6 0.3932292 0.6067708         Yes
train_con_mat <- table(renew_tree_updated$renewed, renew_tree_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

From the above confusion matrix, we know the following:

  • The overall model accuracy is (257 + 270)/850 = 0.62 or 62%.

  • Of all customers the model predicted to churn, they got 270/439 = 0.61 or 61%.

  • Of all customers the model predicted not to churn, they got 257/411 = 0.63 or 63%.

  • Of all customers who did churn, the model correctly identified 270/424 = 0.64 or 64%.

  • Of all customers who did not churn, the model correctly identified 257/426 = 0.60 or 60%.

Measure accuracy on Tesing data

# Measure accuracy on Testing data
test_probs <- predict(renew_tree, newdata = sub_test, type = 'prob')
test_preds <- predict(renew_tree, newdata = sub_test, type = 'class')

renew_tree_updated_test <- cbind(sub_test, test_probs, test_preds)
head(renew_tree_updated_test)
   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 942     Yes            0              28              0   213  81 Female  45
2 299      No            0              28              0   425 120   Male  51
3  44      No            0              28              3   477  89 Female  69
4 706     Yes            1              14              0   425 361 Female  68
5 965     Yes            0              28              0   235  36 Female  51
6 354      No            1              11              0     0  12   Male  59
         No       Yes test_preds
1 0.5927052 0.4072948         No
2 0.5927052 0.4072948         No
3 0.3333333 0.6666667        Yes
4 0.3932292 0.6067708        Yes
5 0.5927052 0.4072948         No
6 0.7560976 0.2439024         No
test_con_mat <- table(renew_tree_updated_test$renewed, renew_tree_updated_test$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

From the above confusion matrix, we know the following:

  • The overall model accuracy is (36 + 41)/150 = 0.51 or 51%.

  • Of all customers the model predicted to not renew (churn), they got 41/79 = 0.52 or 52%.

  • Of all customers the model predicted to renew, they got 36/71 = 0.51 or 51% correct.

  • Of all customers who did not renew (churn), the model correctly identified 41/71 = 0.58 or 58%.

  • Of all customers who did renew, the model correctly identified 36/79 = 0.46 or 46%.

Q4a. Explanation of Overfitting in Classification Tree:

The significant drop in accuracy from 62% on the training data to 51% on the testing data indicates that the classification tree has overfitted the training data. In overfitting, the model becomes too complex and tailored to the training set, capturing noise and small fluctuations that don’t generalize well to new data.

Q5. Prune the Classification Tree

a. Create the Pruned Classification Tree

renew_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, data = sub_train, maxdepth = 3)
fancyRpartPlot(renew_tree_pruned)

b. Assess the Accuracy of the Pruned Tree on Training and Testing Datasets

#Measure accuracy on Training data Pruned
train_probs_pruned <- predict(renew_tree_pruned, newdata = sub_train, type = 'prob')
train_preds_pruned <- predict(renew_tree_pruned, newdata = sub_train, type = 'class')

renew_tree_updated_pruned <- cbind(sub_train, train_probs_pruned, train_preds_pruned)

head(renew_tree_updated_pruned)
   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds_pruned
1 0.3932292 0.6067708                Yes
2 0.5730337 0.4269663                 No
3 0.7560976 0.2439024                 No
4 0.7560976 0.2439024                 No
5 0.3932292 0.6067708                Yes
6 0.3932292 0.6067708                Yes
train_con_mat_pruned <- table(renew_tree_updated_pruned$renewed, renew_tree_updated_pruned$train_preds_pruned, dnn = c('Actual', 'Predicted'))
train_con_mat_pruned
      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252
#Measure accuracy on Testing data Pruned
test_probs_pruned <- predict(renew_tree_pruned, newdata = sub_test, type = 'prob')
test_preds_pruned <- predict(renew_tree_pruned, newdata = sub_test, type = 'class')

renew_tree_updated_pruned_test <- cbind(sub_test, test_probs_pruned, test_preds_pruned)

test_con_mat_pruned <- table(renew_tree_updated_pruned_test$renewed, renew_tree_updated_pruned_test$test_preds_pruned, dnn = c('Actual', 'Predicted'))
test_con_mat_pruned
      Predicted
Actual No Yes
   No  41  33
   Yes 38  38

Training Data:

From the above confusion matrix, we know the following:

  • The overall model accuracy is (266 + 252)/850 = 0.61 or 61%.

  • Of all customers the model predicted to not renew, they got 252/422 = 0.60 or 60%.

  • Of all customers the model predicted to renew, they got 266/428 = 0.62 or 62%.

  • Of all customers who did not renew,, the model correctly identified 252/424 = 0.59 or 59%.

  • Of all customers who did not renew,, the model correctly identified 266/426 = 0.62 or 62%.

Testing Data:

From the above confusion matrix, we know the following:

  • The overall model accuracy is (36 + 41)/150 = 0.51 or 51%.

  • Of all customers the model predicted to not renew (churn), they got 41/79 = 0.52 or 52%.

  • Of all customers the model predicted to renew, they got 36/71 = 0.51 or 51%.

  • Of all customers who did not renew (churn), the model correctly identified 41/71 = 0.58 or 58%.

  • Of all customers who did renew, the model correctly identified 36/79 = 0.46 or 46%.

Q5c. Pruning the classification tree overfitting

Overall the model achieves an accuracy of 61% on training data, correctly identifying 60% of non-renewers and 62% of renewers.

But in the testing data, the performance of model is very bad. The final accuracy over the test set is merely 51%. It accurately identified 52% of non-renewers predicted customers and 51% for those predicted to renew.

This significant drop-off from train to test performance indicates that the model is overfitting

Q6. Actions for Improving Renewal Rates

Based on the analysis, the company may implement following strategies to improve its renewal rate and reduce churn.

  • The company should focus retention efforts on customers with short LOR and low spend, as they represent high-risk churn groups.

  • Target high-spending, long-term customers with personalized offers to keep them loyal.

  • Use the classification tree model to identify at-risk customers and proactively engage them through marketing campaigns, loyalty programs, or customer support interventions.

Part C – Segmenting Consumers Based on Energy Drink Preference

Q1. Import Data

First we will use the code in the following chunk to load the tidyverse package and then import our dataset.

library(tidyverse)
library(cluster)

energy_drinks <- read_csv("energy_drinks.csv")
View(energy_drinks)

Q2 Create a distance matrix containing the Euclidean distance between all pairs of consumers.

energy_drinks_2 <- select(energy_drinks, D1, D2, D3, D4, D5)
d1 <- dist(energy_drinks_2)

Commentary:

  • The distance matrix is computed using the Euclidean distance metric, which helps measure the dissimilarity between participants based on their ratings.

Q2a . Does the data need to be scaled before computing the distance matrix?

  • Yes, scaling is important before calculating Euclidean distance, especially for different variables such as the ratings for each drink. By scaling, we standardize each feature and ensuring that each variable contributes equally to the distance calculation

Q3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.

h1 <- hclust(d1, method = "average")

Commentary:

  • We applied hierarchic clustering to the subjects using their ratings of the five energy drink types. The dendrogram reflects with which ratings how groups of participants can be formed.

Visualise the results of the hierarchical clustering using a dendrogram and a heatmap.

par(mar = c(5, 5, 2, 2))
plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

Q4a: Does the heatmap provide evidence of any clustering structure within the energy drinks dataset?

  • Yes, there is clustering in the heatmap. Blocks of colors showing that clusters of people have rated the energy drinks similarly. The visible clusters can be compared with the outcome of the hierarchical clustering dendrogram and indicate that there is structure in data, and participants can be grouped into different segments according to their preferences for energy drink versions.

Q5. Create a 3-cluster solution using the cutree function and assess the quality of this solution.

clusters <- cutree(h1, k = 3)

sil2 <- silhouette(clusters, d1)
summary(sil2)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters, dist = d1) :
 Cluster sizes and average silhouette widths:
      417       235       188 
0.2249249 0.1987262 0.3918562 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502 

Commentary:

  • A 3-cluster solution was generated with the cutree function. The silhouette score also measures the performance of the clustering, but where higher is better. An ideal clustering solution would create clusters with high silhouette scores.

  • The silhouette analysis suggests a moderate extent of data clustering meaning it neither too close nor too separate. The clustering solution performs well under average silhouette width for Cluster 3 but Clusters 1 and 2 have no cohesion and may overlap. The negative silhouette scores such as -0.3120 indicate some missclassification in the clusters, where some of the participants may be misclassified and assigned to a different cluster.

Q6 Profile the clusters

Q6a. Profile the clusters and calculate average ratings for each version

energy_drinks_clus <- cbind(energy_drinks, clusters)
energy_drinks_clus <- mutate(energy_drinks, cluster = case_when(clusters == 1 ~ 'C1',
                                                 clusters == 2 ~ 'C2',
                                                 clusters == 3 ~ 'C3'))

cluster_ratings <- energy_drinks_clus %>%
  group_by(cluster) %>%
  summarise(across(starts_with('D'), mean, na.rm = TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(starts_with("D"), mean, na.rm = TRUE)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
cluster_ratings
# A tibble: 3 × 6
  cluster    D1    D2    D3    D4    D5
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1       3     4.71  6.20  6.70  6.76
2 C2       2.96  4.67  6.95  5.01  2.97
3 C3       6.98  5.37  3.03  2.88  2.84

Commentary:

  • Cluster C1 prefers the higher concentration versions, especially D4 and D5 with average scores of 6.70 and 6.76 respectively. However, this cluster rates D1 the lowest 3.00, as participants in this group detest most the less concentrated form. Cluster C2, however, has a strong preference for D3 (6.95) and highly dislike D5 (2.97), indicating that users in this group prefer moderate concentrations but not the highest concentration one. Finally the Cluster C3 is characterized for having high values in the lowest concentration drink D1 (high value: 6.98) and low values in intermediate drinksD4 (low value: 2.87), and D5 (low value: 2.84), showing a lack of interest towards higher concentrates drinks. These different ratings of the clusters illuminate differences in preference based on flavoring ingredient concentration, and we observe distinct patterns of liking for each version within each of the clusters.

Q6b: How do the clusters differ on age and gender?

Age and Gender distribution by cluster

energy_drinks_clus$Age <- factor(energy_drinks_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))

# Age distribution by cluster
ggplot(energy_drinks_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip()
Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.

# Gender distribution
ggplot(energy_drinks_clus, aes(x = Gender, fill = (cluster))) + 
  geom_bar(position = "fill") +
  ylab("Percentage") + 
  xlab("Gender") +
  ggtitle("Gender Distribution by Cluster")

Commentary:

  • There are a few noteworthy differences when analyse age and sex within the three clusters. Cluster C1 is pretty mixed by age, including persons from all ages, particularly the 25-34 and 35-49 brackets. This cluster is also slightly female-biased. Cluster C2 is comprised of younger people, with the majority aged Under 25 and relatively equal gender distribution between males versus females. In contrast, older and less young people prefer C3 and there’s more in the 35-49 and 50-64 age groups. This family also has more males to females.

Q7. Advise the Company on the Suitable Segment/Cluster at Which to Advertise Energy Drink Versions D1, D3, and D5

Average ratings for each version across clusters

energy_drinks_mean <- energy_drinks_clus %>%
  group_by(cluster) %>%
  summarise(D1 = mean(D1),
            D3 = mean(D3),
            D5 = mean(D5))

energy_drinks_mean
# A tibble: 3 × 4
  cluster    D1    D3    D5
  <chr>   <dbl> <dbl> <dbl>
1 C1       3     6.20  6.76
2 C2       2.96  6.95  2.97
3 C3       6.98  3.03  2.84

Recommendation

  • For D1: Cluster 3 has the highest rating for D1 with a value of 6.98. Therefore, Cluster 3 should be targeted for D1 advertising.

  • For D3: Cluster 2 has the highest rating for D3 with a value of 6.95. Therefore, Cluster 2 should be targeted for D3 advertising.

  • For D5: Cluster 1 has the highest rating for D5 with a value of 6.76. Therefore, Cluster 1 should be targeted for D5 advertising

Q8. Overall average ratings across all clusters

Overall Average Ratings Across All Clusters for Each Version

overall_ratings <- energy_drinks_clus %>%
  summarise(across(starts_with('D'), mean, na.rm = TRUE))
overall_ratings
# A tibble: 1 × 5
     D1    D2    D3    D4    D5
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  3.88  4.85   5.7  5.37  4.82

Recommendation

  • D3 is to be recommended for production because it holds the highest mean rating overall (5.7) among all versions. This means that D3 is the most favored type regardless of all participants.