Applied Data Analytics Using R

Author

Rosie O’Grady

Introduction

This report will use a range of key descriptive and inferential analysis techniques to carry out an analysis of digital marketing data. In part A, a visual exploratory analysis of the music subscription dataset will be carried out to identify what factors are associated with whether customers renew their subscription or not. This involves the examination of the relationships between renewal behaviour and several potential predictor variables, including the number of times the customer was in contact with the company, the number of complaints made by the customer, their spending trends, and demographics.

A classification tree model is created in Part B to forecast a customer’s likelihood to renew their music subscription. Using both training and testing datasets, the model is thoroughly evaluated by assessing prediction rules, node purity, variable importance, and overall model accuracy. To provide practical suggestions for enhancing client retention, insights from the model are outlined and explained.

In part C, the energy drink preference dataset is analysed using hierarchical clustering to determine significant customer segments based on ratings of five different drink variations. A dendrogram and heatmap are used to visualise the clustering results. To profile consumer groups according to product preferences and demographics, a three-cluster method is investigated. After that, suggestions are given about which customer groups the company should focus on and subsequently, which one product would be best suited for ongoing manufacturing.

Part A - Exploratory Analysis of the Music Subscription Dataset

The relationship between whether a customer renews their subscription and the number of times a customer was in contact with the music downloading service

It is evident from the boxplot that the median number of times contacted is low for both customers who renewed and did not renew and the interquartile range of each group is quite narrow which suggests that most customers only made contact with the company a few times.

There are a large number of outliers for both, which shows that there are a number of customers in each group who contacted the service abnormally often.

The distributions and general shapes are almost the same and customers with both low and high contact frequency can be found in both categories, suggesting that the number of times a customer contacted the service is not a reliable indicator of renewal.

The relationship between whether a customer renews their subscription and the elapsed time since their last contact

The boxplot shows that there is not a significant difference in the recency of contact between renewed and non-renewed clients. The medians for the two groups are nearly identical and the whiskers for both groups reach the same minimum and maximum values, showing that the overall distribution of the data is similar as well. This suggests that both groups contain customers who have more recently contacted the business as well as those who haven’t in a long time.

There is no visual evidence that contact recency is linked to renewal behaviour because the centre, spread, and whisker range are so comparable across the two groups. A customer’s decision to renew their subscription does not seem to be influenced by how recently they contacted the company.

The relationship between whether a customer renews their subscription and the number of complaints they have made

The patterns shown by both groups in the boxplot are nearly identical. The median number for both customers who renewed and did not renew is zero, and the interquartile range is incredibly small as the majority of consumers made no complaints to the company at all during the last 36 months. As a result, the box collapses into a single line at zero.

The outliers in the boxplot follow a very similar pattern in both groups, representing customers who made one or more complaints. Looking at these outliers it is evident that both groups contain customers who complained more frequently. The group of customers who did not renew their subscription contains one outlier of a customer who made approximately 20 complaints, however this only represents one customer and the rest of the distribution closely mirrors the one of the renewed group.

Since the overall distribution, skewness, and the outliers are almost identical across the two groups, the number of complaints does not appear to give a meaningful insight into whether a customer renews their subscription.

The relationship between whether a customer renews their subscription and the amount of money they have spent during the last 36 months with the company

In regard to spending, the median spend of customers who renewed is visibly greater than that of those who did not renew, which indicates that on average renewing customers tend to spend somewhat more. The slightly higher whisker in the non-renewed group indicates that some consumers spent relatively large amounts and still decided not to renew.

However, the boxes for both groups sit at similar heights showing that the central range of spending values is comparable. Neither category has any evident outliers which indicates that most customers have similar spending habits.

Even though there is evidence of higher spending among customers who renew, the similarity in the distributional features indicates that spending is not a strong differentiator between customers who renewed and who did not renew.

The relationship between whether a customer renews their subscription and their length of relationship with the company expressed in days

The boxplot shows that the group of customers who did renew generally have a longer relationship with the company as the group has a higher median value in comparison to the group of customers who did not renew. The interquartile range for the renewed group is also bigger, showing greater variance in relationship length among those who renewed.

For clients who did not renew, the majority of relationship lengths are concentrated at lower values, with only a few customers with longer relationships appearing as upper outliers. Overall, the plot indicates that consumers having a longer history with the company are more likely to renew, while shorter relationships appear to be more often connected with non-renewal.

The relationship between whether a customer renews their subscription and their gender

It is evident from the bar chart that male customers take up a larger proportion of the dataset. Within the male customer group, the number of renewals is visibly higher than the number of customers who did not renew.

This contrasts to the female customers, as the number who did not renew their subscription is visibly higher than the number who did renew.

The chart implies that gender may be connected with renewal behaviour, since males appear to renew their subscription more than females. However, there is not an extreme difference indicating gender alone does not entirely explain renewal outcomes.

The relationship between whether a customer renews their subscription and their age

It is evident from the boxplot that customers who renewed tend to be slightly older on average, as they have a higher median age than the group of customers who did not renew. The interquartile range for the renewed group is also bigger, indicating that there is a greater diversity in age among consumers that renewed.

Both groups show a concentration of ages in the 45 to 65 year old age range. Younger customers, roughly below 25 years of age, can be seen as outliers in both groups, indicating that they make up a small part of the sample.

Overall, while the age distributions for the two groups are similar, the boxplot implies that older customers are slightly more likely to renew their subscription than younger customers.

Part B - Predicting Customers Who Will Renew Their Music Subscription

Classification Tree for Predicting Renewal Behaviour

Interpretation of the classification tree

A. One rule for predicting whether a customer will re-subscribe

From interpreting the classification tree, the following rule can be identified for predicting customer re-subscription: “If a customer’s length of relationship (lor) is 140 days or more, then they are predicted to re-subscribe.”

This aligns with the right-hand branch of the tree. Customers with a length of relationship (lor) of 140 days or more follow this path, reaching the node that states “yes”. According to this node, 61% of customers in this group resubscrbed, whereas 39% did not. This suggests a mostly pure node, as the majority of outcomes are renewals.

B. One rule for predicting if a customer will churn

A rule for predicting customer churn can also be identified from the classification tree: “If a customer has a length of relationship (lor) of less than 140 days and a total spend of less than 182, then they are predicted to not renew their subscription.”

This churn rule applies to customers who take the left-hand path of the classification tree, with a length of relationship of less than 140 days and a spend of less than 182, resulting in a terminal node predicting “No”. The numbers in this node suggest that around 76% of clients did not re-subscribe, whereas 24% did. This node is also mostly pure as non-renewal is the most common result.

C. The most important variables for predicting whether a customer will re-subscribe

Length of relationship (lor), spend, number of contacts (num_contacts), and age are the most important variables for predicting whether a customer wiill re-subscribe or not. The length of relationship (lor) is the most important variable, which is evident as it determines the first split at the top of the classification tree. Spend is also an important variable because it appears next after lor and separates customers with shorter relationships into different renewal outcomes.

The number of contacts and age appear lower in the tree, which shows that they are not as important variables but they still provide predictive potential for specific consumer groups. Other factors in the model are not shown in the tree, indicating that they did not significantly enhance prediction and are less important.

Comparison of Visual Exploration and Classification Tree Results

In the visual exploration in part A, the variables length of relationship (lor) and spend were important, as the boxplots identified that customers in the renewed group tended to have longer relationships with the company and also had higher spending. This coincides with part B as the classification tree also highlights that they are important variables, as they appear high in the classification tree, with lor as the first split showing that it plays an important role in prediction.

Variables that were evidently less important in the visual exploration such as the number of times a customer was in contact with the music downloading service and also age were still included in the classification tree. This shows that when they are combined with other variables they still help to make predictions.

The classification tree did not show contact recency, number of complaints and gender. While these variables were investigated in Part A, they were excluded from the classification tree in part B as they did not significantly improve the model’s ability to predict re-subscription.

Assessment of Classification Tree Accuracy**

Table 1: Accuracy of the Classification Tree (Training Dataset)

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270
[1] 0.62

Table 2: Accuracy of the Classification Tree (Testing Dataset)

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41
[1] 0.5133333

The training and testing datasets were used to assess the accuracy of the classification tree as seen above. In relation to the training dataset as seen in the first table, the classification tree got an accuracy score of 62% (0.62). This shows that the model correctly predicted the outcome for 62% of customers in the training dataset. In relation to the training dataset as seen in the second table, the classification tree got an accuracy score of 51.3% (0.53), which is lower than the score for the training dataset.

The classification tree is overfitting the training dataset, which is evident from the difference in accuracy scores. When a model learns patterns that are overly particular to the training data it is said to be overfitting and as a result, it does not generalise well to new data. In this instance, the model’s performance on the training dataset (62%) is significantly higher than that on the testing dataset (51.3%). Overfitting is evident from this decline in performance on the testing dataset.

Pruned Classification Tree

Accuracy of the Pruned Classification Tree (Training Dataset)

      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252
[1] 0.6094118

Accuracy of the Pruned Classification Tree (Testing Dataset)

      Predicted
Actual No Yes
   No  41  33
   Yes 38  38
[1] 0.5266667

The accuracy gap between the training dataset (60.9%) and the testing dataset (52.7%), is less after the classification tree has been pruned.

These scores clearly show that pruning the classification tree resulted in less overfitting as there was a bigger difference in accuracy score between the training and testing datasets for the unpruned classification tree.

Marketing Recommendations Based on the Classification Tree

The company can make use of the classification tree for marketing purposes through using it to identify which customers are the most and least likely to re-subscribe and subsequently targeting those who are not likely to re-subscribe focusing on the relevant predictor variables. Some actions the company could take to improve their renewal rate are as follows:

1. Focus on targeting customers with a short length of relationship with the company:

As length of relationship is identified as an important variable in part B with length of relationship appearing high in the classification tree, and also in part A, with the boxplot showing that customers in the renewed group tend to have longer relationships with the company, the company should focus on length of relationship in their marketing efforts.

The classification tree shows that customers who have shorter relationships with the company i.e. less than 140 days are more likely to not re-subscribe. The company can focus on these customers by providing them with welcome offers when they first subscribe to the music streaming service such as a free trial period or a discounted subscription price at first to create a positive first impression. The company could then offer timed discounts on subscription renewal when they are approaching the 140 day mark to encourage these customers to remain subscribed.

2. Target customers that have low spend amounts with the service:

In the classification tree, after the tree first splits at length of relationship at lor ( lor <140), the tree then next splits based at spend (spend < 182). This indicates that customers that have spent less with the company are more likely to not re-subscribe. Customers in this group may not be spending much with the service due to low usage. The company could encourage customers to use the service more by incorporating features such as daily personalised playlists to increase engagement. If customers feel more enticed to use the service due to new appealing features, they may spend more and also be more likely to renew their subscription.

If the music streaming service currently operate a premium subscription option that includes the likes of ad-free listening or unlimited skips, if customers who have the basic subscription are provided with a limited time per day of these features it may entice them to upgrade to the premium subscription, raising the amount they are spending with the company and potentially motivating them to continue subscribing to the service.

Part C - Segmenting Consumers Based on Energy Drink Preference

Distance Measure and Clustering Method

A Euclidean distance matrix was created using the scaled energy drink ratings. The data was scaled beforehand due to how sensitive Euclidean distance is of the scale of variables. If the data wasn’t scaled, it would have meant that the variables with larger ranges had more influence on the calculation of the distances. By scaling the variables, each of the energy drink ratings contributes equally in the distance matrix.

Visualisation of Hierarchical Clustering

Evidence of Clustering Structure

The heatmap uses colour intensity to show customer ratings, with higher ratings being represented by darker orange and red colours, while lower ratings are represented by lighter yellow and pastel colours. Specifically, certain consumer groups regularly display lighter yellow colours for the same drinks, indicating lower ratings, while other groups consistently display darker orange or red shades for specific drink types, suggesting higher ratings.

These patterns show up as long blocks of similar colours that are arranged in a way that follows the dendrogram. This means that customers with similar rating patterns are placed next to each other. This indicates that clients with similar preferences have been put together by hierarchical clustering. The presence of these structured colour blocks provides visual evidence of clustering structure in the energy drinks dataset.

Selection and Assessment of the 3-Cluster Solution

energy_clusters
  1   2   3 
441 167 232 

Assessment of the quality of this solution

The quality of the 3-cluster solution can be assessed by examining the number of customers in each cluster. As seen above, cluster 1 contains 441 participants, cluster 2 contains 167 participants and cluster 3 contains 232 participants. All three clusters are quite populated and none are empty or relatively small. Even though Cluster 1 is larger than the other two clusters, each cluster is still large enough to compare characteristics making the solution suitable for profiling and interpretation.

Profiling the Clusters

A . How do the clusters differ on their average rating of each version of the energy drinks?

  Cluster       D1       D2       D3       D4       D5
1       1 2.945578 4.811791 6.274376 6.646259 6.603175
2       2 2.508982 4.610778 7.323353 5.089820 2.718563
3       3 6.642241 5.081897 3.439655 3.159483 2.956897

It is evident from the table above that cluster 1 prefers versions D3, D4 and D5, giving a low rating for D1.

Cluster 2 likes version D3 the most and has a strong dislike for D5. This cluster has more of a distinct preference.

Cluster 3 also has a distinctive preference for D1, giving poor ratings for D3, D4 and D5.

Cluster 2 likes D3 the best, with a strong dislike for D5, implying a more distinctive preference.

Overall the data in the table above shows that there are differences in the preference profiles of the clusters.

B. How do the clusters differ on age and gender?

          
             1   2   3
  25_34    192  58  88
  35_49     90  56  65
  50_64     52  18  26
  Over_65   20   7  12
  Under_25  87  28  41

Cluster 1 is the biggest cluster and contains all ages, particularly 25-34 years old. Cluster 2 is the smallest cluster and contains fewer respondents in older age groups. Cluster 3 is relatively balanced but contains less younger respondents than cluster 1.

Overall, there are differences in the distribution of ages among all 3 clusters. Cluster 1 contains the most number of respondents and respondents vary in ages particularly within the age group of 25-34 years old. Cluster 2 is the smallest cluster and contains less older respondents, indicating it may be a niche group.

        
           1   2   3
  Female 162  81  96
  Male   279  86 136

The gender distribution is more or less the same for each cluster, with there being more males than females in all 3 clusters.

The difference in genders is more evident in cluster 1, as there are 117 more males than females. Clusters 2 and 3 have a closer ration of male and female respondents.

Overall it appears that gender is not a major distinguishing factor of the clusters preference in drink type.

Marketing Recommendations by Product

Advertising D1

Version D1 should be specifically advertised to Cluster 3. This is because the mean value of the rating for D1 of cluster 3 is 6.64, which is greater than the mean ratings of the same energy drink version in clusters 1 and 2 which are both less than 3.

This shows that the D1 version is much more favoured amongst the cluster 3 customers and that they have a distinct preference for it.

Advertising D3

Version D3 should be advertised to Cluster 2. Cluster 2 has an evident preference for the energy drink version D3 in comparison to the other clusters. The mean value of the rating for D3 in this cluster is approximately 7.32.

This contrasts to the other clusters’ ratings of D3, with cluster 3 having a mean rating of 3.44, and cluster 1 having a mean rating of 6.27.

Advertising D5

Version D5 should be advertised to Cluster 1. D5 is preferred by Cluster 1 as the mean value of the rating is 6.60. The mean rating for the same energy drink version in Cluster 2 is 2.72, and 2.96 in Cluster 3.

Therefore, since cluster 1 evidently prefers this energy drink type in comparison to the other clusters and it is also the biggest cluster, it is the most viable option for the company to advertise D5 to.

Recommendation of a Single Energy Drink Version

If the company had to choose just one version of the energy drink to continue producing, I would recommend them to continue producing the energy drink version D3. D3 has the best overall performance across all of the clusters. It gets the maximum average rating in cluster 2 with a mean of approximately 7.32, while also getting a positive average rating in cluster 1 with a mean of approximately 6.27. Even though it has a low average rating in cluster 3, its positive rating across both cluster 1 and 2 indicate that a significant number of customers prefer this energy drink version.

The other energy drink versions are not as preferred on a wider scale, as D1 is only strongly preferred in cluster 3 and D5 is disliked by cluster 2 even though it is liked in cluster 1. This shows that D1 and D5 have more of a specific following whereas D3 strikes a balance of preferences and showcases that it has a broad acceptability amongst multiple clusters. Overall, it seems like the most viable option if the company had to choose just one version of the energy drink to keep producing.