Music Subscription Analysis

Author

Cristina Marinescu

Part A - Exploratory Analysis of the Music Subscription Dataset

To explore and understand the relationship between whether a customer renews their subscription and the predictor variables. A visual exploration using relevant graphs was carried.

Interpretation

This Barchart represents the relationship between gender and whether or not they decided to renew their subscription.

As we can see above, it appears that the data contains more information on the male gender rather than the female, therefore the data is skewed as there is not equal amounts of data from both genders. Right now we are told that Males have a higher rate of renewal and non-renewal.

Renewal vs Customer Spending Behaviour:

Interpretation

The Boxplot shows that spending levels are similar for customers that renewed and those who didn’t.

Customers who did renew their subscription are shown to have a slightly higher median spend than those who didn’t. But as we can see the top whisker for the non-renewed side of the boxplot spans higher than the one for renewers, which means that even if customers were spending a lot of money, they still chose to not renew.

This suggests that the customer’s spending cannot really explain the reason they chose to not renew.

Renewed vs Age:

Interpretation

The boxplot shows that the median age of customers who renewed and those who didn’t is nearly identical, which tells us that age doesn’t differ strongly between groups.

Both groups display a wide range of ages with some younger and some older outliers and even though the median ages are very similar, the renewed customers have the larger interquartile range which tells us there is a greater age variation in this group.

Renewed vs Number of Complaints:

Interpretation

The boxplot shows that the number of complaints is zero for the majority of the customers in both the renewed and non-renewed categories which is why the boxes collapsed into a single line at 0.

The outliers represent customers who issued one or more complaints and both groups contain an amount of customers with several complaints each. The group that didn’t renew has an extreme outlier that sits at around 20 complaints on the Y axis.

Therefore this suggests that complaints are generally rare and similar across group but it suggests that high complaint levels could be associated with a higher likelihood of non-renewal.

Renewed vs Contact Recency:

Interpretation

The boxplot shows that contact recency is generally similar for customers who renewed and those who didn’t

The lower quartile for non-renewers is slightly higher which tells us that they were is higher which says that they were contacted less recently on average. The medians and overall distributions are close so this suggests that contact recency does not strongly influence whether a customer renews their subscription.

Renewed vs Number of Contacts

Interpretation

The boxplot shows that the number of contacts is very low for most customers in both the renewed and non-renewed groups which makes the boxes appearing to be small and close to zero.

The high outliers represent customers who were contacted unusually frequently, with both groups showing a few customers receiving over 10 contacts, some reaching over 30.

The medians and interquartile ranges are nearly identical so this tells us that the typical number of contacts does not differ meaningfully between those who did and did not renew.

The number of contacts seems to have a weak relationship with renewal behaviour.

Renewed vs Length of Relationship:

Interpretation

This boxplot shows a clear difference in length of relationship between customers who renewed and those who did not. Renewers have a noticeably higher median LOR and a wider spread in the upper quartile, this tells us that they tend to have longer relationships with the company.

Although some non-renewers appear as high outliers with very long relationships, they represent a small number of cases and do not reflect the typical pattern.

Overall, this suggests that customers with longer relationships are more likely to renew, making LOR one of the strongest predictors.

Part B - Predicting Customers Who Will Renew Their Music Subscription

  1. A classification tree model will be used along with the data in order to predict if a customer will re-subscribe.

Interpretation

RULE: If a customer has LOR ≥ 140 then the model predicts that the customer will re-subscribe.

The node representing this rule has a predicted renewal probability of 0.61, where 61% of customers in this group renewed. This makes the node reasonably pure because the majority of observations fall into the same class (Yes)

RULE: If a customer has LOR < 140 and Spend < 182, the model predicts that the customer will not re-subscribe.

This node shows a renewal probability of 0.24, meaning 24% of people renewed, which leaves an amount of 76% of churners. This means the node is fairly pure in predicting churn with the majority of observations belonging to the “No” class.

Important Variables for Predicting Re-Subscription.

This classification tree identifies the variables in order of their importance based on where they appear on the model, LOR is at the top so it is the most important predictor as it forms the first split. Spend is the next most important variable, forming the second split within the largest branch. Lastly Number of Contacts and Age also contribute to prediction, appearing in deeper splits and refining the model for smaller subgroups.

There are a few missing variables such as gender, number of complaints and contact recency, this occurred because the algorithm didn’t think they were useful for reducing impurity.

Comparison Between Visual Exploration and B.2.c Findings.

Part A and Part B were mostly consistent. LOR, spend, and age looked important visually and were also used in the tree. Complaints, gender and contact recency looked unimportant visually and were not selected by the tree. However, number of contacts appeared unimportant in the visual plots but became important in the classification tree, showing that some variables only reveal their predictive value when considered in combination with others.

Assessing Accuracy of the Classification Tree

[1] 0.62
[1] 0.5133333

Training Accuracy: 0.62

The model correctly predicts reneweal/churn for 62% of customers in the training set. This is the unpruned classification tree.

Training Accuracy: 0.5133333

This model correctly predicts renewal/churn for only 51.3% of customers in the test set.

Conclusion: Strong evidence of overfitting.

The classification tree achieves an accuracy of 62& on the training dataset but only 51.3% on the testing dataset. This drop in performance indicates overfitting. The fact that the model performs substantially better on the training set than on unseen testing data shows that the tree has fit the training data too closely and does not generalise well.

Training Confusion matrix.

          
train_pred  No Yes
       No  257 154
       Yes 169 270
         
test_pred No Yes
      No  36  35
      Yes 38  41

True Negatives (correctly predicted No): 257

True Positives (correctly predicted Yes): 270

False Negatives (predicted No but actually Yes): 154

False Positives (predicted Yes but actually No): 169

The model performs reasonably well but misclassifies a large number of observations in both directions which is typical of an unpruned tree.

The confusion matrices and accuracy values provide evidence that the classification tree is overfitting the training data. The training accuracy is 62%, but the testing accuracy falls to only 51.3% which is close to random guessing. This difference shows that the model has learned patterns specific to the training dataset that do not generalise well to new observations. The confusion matrix for the testing data shows considerable misclassification both classes, further demonstrating weak out of sample performance. This gap between training and testing accuracy is a clear sign of overfitting.

[1] 0.6094118
[1] 0.5266667

Accuracy fpr the Pruned Classification Tree

The pruned classification tree achieved a training accuracy of 60.9% and a testing accuracy of 52.7%. The training accuracy is slightly lower than the original unpruned tree (62%), which is expected because pruning simplifies the model and prevents it from fitting the training data too closely. The testing accuracy increased slightly from 51.3% to 52.7%, indicating a small improvement in generalisation performance.

Did pruning reduce overfitting?

Yes, pruning has reduced overfitting. Evidence for this comes from the smaller gap between training and testing accuracy. In the original tree, the accuracy dropped from 62% (training) to 51.3% (testing), a gap of about 10.7 percentage points. After pruning, the gap narrows to roughly 8.2 percentage points (60.9% vs 52.7%). This indicates that the pruned tree fits the training data less tightly and generalises slightly better to unseen testing data. Although the improvement is modest, pruning clearly reduces overfitting by simplifying the model and reducing the depth of the tree.

Recommendations for improvement.

Based on the analysis, several variables strongly influence whether customers renew their subscription, particularly length of relationship (LOR), spend, number of contacts, and age. These patterns suggest actionable strategies to improve renewal rates that are directly supported by the data.

Target customers with short relationships (LOR < 140)

  1. It would be beneficial to develop onboarding campaigns that engage new customers early and to offer early loyalty incentives, like discounted upgrades or bonus features before reaching 140 days. Also automated check-ins should be implemented.

  2. Create upsell campaigns aimed at increasing low spend, introduce bundles and limited time offers, or personalised recommendations and provide free trails of premium features to increase perceived value.

  3. Increase the number of customer contacts or touchpoints. Increase marketing touchpoints such as emails, app notifications, playlists, or usage tips. Trigger engagement campaigns when contact frequency declines.

Segmenting Consumers Based on Energy Drink Preference.

Rows: 840
Columns: 8
$ ID     <chr> "ID_1", "ID_2", "ID_3", "ID_4", "ID_5", "ID_6", "ID_7", "ID_8",…
$ D1     <dbl> 2, 4, 2, 1, 1, 2, 1, 1, 2, 5, 4, 2, 3, 1, 4, 4, 3, 1, 1, 4, 5, …
$ D2     <dbl> 3, 4, 3, 6, 3, 3, 5, 3, 3, 5, 6, 3, 7, 6, 4, 7, 5, 3, 3, 3, 5, …
$ D3     <dbl> 7, 5, 8, 5, 7, 8, 6, 7, 6, 6, 6, 9, 7, 6, 7, 9, 7, 7, 6, 9, 6, …
$ D4     <dbl> 7, 6, 8, 8, 7, 7, 5, 9, 7, 7, 6, 7, 7, 7, 6, 7, 5, 7, 6, 5, 7, …
$ D5     <dbl> 7, 9, 5, 6, 7, 5, 5, 7, 5, 7, 6, 8, 8, 5, 8, 7, 6, 6, 6, 9, 8, …
$ Gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Female", "…
$ Age    <chr> "Under_25", "Under_25", "Under_25", "Under_25", "Under_25", "Un…

Part C – Segmenting Consumers Based on Energy Drink Preference

Before calculating Euclidean distances, the ratings were put on a scale. All drinks use the same 1–9 scale, but scaling makes sure that each drink has the same effect on the distance calculation and that no drink with more variability takes over the clustering.

The heat map shows clearly that the data set has a clustering structure. Across participants, there are distinct blocks of similar colour patterns. This shows that there are groups of consumers who rate the energy drink formulations in a similar way. This visual structure shows that using cluster analysis to group people based on their energy drink preferences is a good idea.


  1   2   3 
441 167 232 
# A tibble: 3 × 6
  cluster    D1    D2    D3    D4    D5
  <fct>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 1        2.95  4.81  6.27  6.65  6.60
2 2        2.51  4.61  7.32  5.09  2.72
3 3        6.64  5.08  3.44  3.16  2.96

Visual Plot

The clusters show different patterns of preference for the five different types of energy drinks. Cluster 1 shows increasing preference as flavour concentration increases, indicating a stronger liking for higher-intensity drinks. Cluster 2 has a clear peak preference for the mid-concentration drink (D3), which suggests that people like drinks with a moderate amount of flavour. Cluster 3, on the other hand, shows a declining preference as concentration goes up, which means that people like drinks with lower concentrations more.

Cluster Differences by Gender

The number of men and women in each cluster is different. More men than women are in Cluster 1, while Cluster 2 has a more even mix of men and women. There are also more men than women in Cluster 3, but the difference is less clear. Overall, there are differences between genders in all clusters, but they are not as big as the differences in energy drink preferences.

The age mix is different in each cluster. Cluster 1 has more younger people in it, especially people under 35. Cluster 2 has a lot of people between the ages of 25 and 49, while Cluster 3 has a lot of people who are older. These patterns imply that age correlates with variations in energy drink flavour preferences.

Marketing Implications and Recommendations

Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D

Energy drink D1 (lowest flavour concentration) should be targeted at Cluster 3, which shows higher ratings for lower-concentration drinks and contains a relatively older consumer base.

Energy drink D3 (medium concentration) is best targeted at Cluster 2, which demonstrates a clear preference peak for D3 and shows balanced preferences across all drink versions, indicating broad appeal.

Energy drink D5 (highest concentration) should be targeted at Cluster 1, which shows increasing preference as flavour concentration increases and contains a higher proportion of younger consumers.

These findings suggest that targeted advertising based on consumer segments could improve marketing effectiveness and product satisfaction.

If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?

If the company were required to continue producing only one energy drink version, D3 is recommended.

D3 achieves consistently strong average ratings across all three clusters, indicating broad acceptance among different consumer segments. Unlike D1 and D5, which appeal more strongly to specific clusters, D3 minimises the risk of alienating any particular group. As a result, D3 represents the most commercially viable option when only one product can be retained.