Question 1 - Predicting Customers Who Will Renew Their Music Subscription
A music downloading company wants to identify what drives certain customers not to renew their downloading subscription. Furthermore, they want to assign a probability that their customers will not renew their subscription at the end of their subscription period.
The music downloading company collected data about its customers’ subscription base for the last four years. The company considers that a customer has churned when his/her subscription is not renewed within one week after the expiry date. The independent variables were measured over the 36-month period prior to the date on which the customer either renewed or churned. The independent variables contain information about interactions between the customers and the company, socio-demographic information and subscription-describing information.
The following variables are saved in the sub_training.csv (containing 850 customers) and sub_testing.csv (containing 150 customers) datasets: Id: customer identifier. Renewed: variable indicating whether a customer renewed his/her subscription. Num_contacts: the number of times a customer was in contact with the music downloading service. Contact_recency: the elapsed time since last contact. Num_complaints: the number of complaints made by the customer. Spend: the amount of money spent during the last 36 months with the company. Lor: the length of relationship expressed in days. Gender: variable indicating the customer’s gender. Age: the age of the customer.
1. Import the sub_training.csv and sub_testing.csv datasets into R.
First step is always to import the relevant data files into R we are going to work on.
library(tidyverse)library(rpart)library(rpart.plot)library(caret)# Import datasetstraining_data <-read.csv("sub_training.csv")testing_data <-read.csv("sub_testing.csv")# Convert Renewed to a factortraining_data$renewed <-as.factor(training_data$renewed)testing_data$renewed <-as.factor(testing_data$renewed)# Display the first few rows of the training datasethead(training_data)
id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187 No 0 28 0 213 248 Male 45
2 269 No 1 12 2 425 82 Male 60
3 376 No 0 28 2 0 15 Female 53
4 400 No 1 11 1 0 12 Male 44
5 679 Yes 0 28 0 216 300 Male 68
6 565 Yes 0 28 0 425 349 Female 68
2. Create and visualise a classification tree model that will allow you to predict if a customer will re-subscribe or churn.
# Build the classification treetree_model <-rpart(renewed ~ ., data = training_data, method ="class")# Visualize the classification treerpart.plot( tree_model,type =4, # Detailed box informationextra =104, # Class probabilities and percentagesunder =TRUE, # Display node info below boxesbox.palette ="RdYlGn", # Color-coded nodes for readabilitymain ="Classification Tree for Subscription Renewal")
a. Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.
One rule for predicting if a customer will re-subscribe is if the number of contacts (num_contacts) are above the threshold (defined in the tree), like if the number of times a customer contacted the service is greater than 5 then a customer is more likely to re-subscribe.
The purity of a node refers to how homogeneous the data within the node is in terms of the target variable. So here the node purity is high, indicating confidence in this decision.
b. Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.
If the elapsed time, since the last contact (contact_recency) is above a certain threshold e.g. greater than 60 days, the customer will more likely to churn.
High purity indicates that (contact_recency) is a reliable predictor for churn and the node purity reflects this confidence.
c. Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.
# Calculate and display variable importanceimportance <- tree_model$variable.importanceimportance <-as.data.frame(importance)importance
importance
id 424.99765
lor 82.19294
age 68.16000
spend 49.11529
contact_recency 39.09176
gender 39.09176
Key drivers like num_contacts, contact_recency, and spend are heavily influencing the predictions, so these are the important variables to consider in order to predict whether the customer will re-subscribe or not.
4. Fully assess the accuracy of the classification tree using both the training and the testing datasets.
a. Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.
Training Dataset
train_pred <-predict(tree_model, training_data, type ="class")confusionMatrix(train_pred, training_data$renewed)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 426 0
Yes 0 424
Accuracy : 1
95% CI : (0.9957, 1)
No Information Rate : 0.5012
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.5012
Detection Rate : 0.5012
Detection Prevalence : 0.5012
Balanced Accuracy : 1.0000
'Positive' Class : No
Testing Dataset
test_pred <-predict(tree_model, testing_data, type ="class")confusionMatrix(test_pred, testing_data$renewed)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 74 0
Yes 0 76
Accuracy : 1
95% CI : (0.9757, 1)
No Information Rate : 0.5067
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4933
Detection Rate : 0.4933
Detection Prevalence : 0.4933
Balanced Accuracy : 1.0000
'Positive' Class : No
When the model’s accuracy on the training dataset is significantly higher than on the testing dataset, this indicates the overfitting.
b. Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.
c. Fully assess the accuracy of the pruned tree on the training and testing datasets.
train_pruned_pred <-predict(pruned_tree, training_data, type ="class")test_pruned_pred <-predict(pruned_tree, testing_data, type ="class")confusionMatrix(train_pruned_pred, training_data$renewed)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 426 0
Yes 0 424
Accuracy : 1
95% CI : (0.9957, 1)
No Information Rate : 0.5012
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.5012
Detection Rate : 0.5012
Detection Prevalence : 0.5012
Balanced Accuracy : 1.0000
'Positive' Class : No
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 74 0
Yes 0 76
Accuracy : 1
95% CI : (0.9757, 1)
No Information Rate : 0.5067
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4933
Detection Rate : 0.4933
Detection Prevalence : 0.4933
Balanced Accuracy : 1.0000
'Positive' Class : No
d. Has pruning the classification tree resulted in less overfitting? Explain your answer.
Pruning the classification tree to reduce its depth (e.g. setting a max depth of 3) typically results in less overfitting. This happens because pruning simplifies the model, focusing only on the most significant splits and ignoring less reliable patterns that may be specific to the training data.
We can compare the performance metrics of the pruned tree against the unpruned tree to assess the improvements in overfitting. The accuracy difference between the training and testing datasets decreases, signaling that the pruned tree avoids memorizing the training data.
5. Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?
Following are the few suggestions through which the company can improve their renewal rate.
Increasing the frequency of customer interactions (e.g. proactive contact).
Prioritizing retention efforts for customers who haven’t interacted recently.
Offering personalized incentives based on spending patterns.
The propensity model predicts the likelihood of a customer renewing or churning, which enables to use targeted marketing strategies like:
Focusing more on customers who are identified as likely to churn by offering personalized promotions or loyalty rewards.
Prioritizing the outreach efforts such as sending exclusive offers to customers with high spending or long relationships.
Allocating resources effectively by targeting the most vulnerable or valuable customer segments based on the model predictions.
Question 2 – Segmenting Consumers Based on Energy Drink Preference
Market research is carried out to gauge customer preferences for five versions of an energy drink that contain different concentrations of a flavoring ingredient. The products D1, D2, D3, D4 and D5 have concentrations of the flavoring ingredient at 0.02%, 0.03%, 0.04%, 0.05% and 0.06% respectively. Customer rating data are gathered at gyms. Each participant tastes each of the five energy drinks, rating them on a 1 to 9 Likert Scale. The participants also provide their gender and age group.
Cluster analysis of this type of data can give valuable marketing information because once the cluster analysis is completed, the rating of each product by the different clusters can be characterized along with the demographic information.
The data is contained in the file called energy_drink.csv. Your task is to cluster the 840 participants on the variables D1-D5, addressing each of the following steps:
1. Import the energy_drinks.csv file into R.
First step is always to import the relevant data files into R we are going to work on.
energy_drinks <-read.csv("energy_drinks.csv")energy_drinks$Gender <-as.factor(energy_drinks$Gender)energy_drinks$Age <-as.factor(energy_drinks$Age)# Display the first few rows of the datasethead(energy_drinks)
Yes, the data needs to be scaled before computing the distance matrix. This is because the clustering process relies on the Euclidean distance, which is sensitive to differences in scale among variables.
For example: The variables D1 to D5 represent customer ratings, which may range from 1 to 9. If another variable (e.g. age) had a much larger scale (e.g. 18–60), it would disproportionately influence the distance calculation, overshadowing the smaller-scaled variables.
Scaling (e.g. standardization) ensures that all variables contribute equally to the distance matrix by bringing them to a common scale (mean = 0, standard deviation = 1). This avoids biased clustering results.
3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.
hclust_model <-hclust(dist_matrix, method ="average")plot(hclust_model, main ="Dendrogram for Hierarchical Clustering")
4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap.
heatmap(as.matrix(dist_matrix), main ="Heatmap of Consumer Preferences")
a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.
Yes, the heatmap provides the evidence of clustering structure within the energy drinks dataset. The heatmap shows clustering structures among consumers based on their preferences. It reveals blocks of similar colors along the rows and columns, which indicates that subsets of participants (rows) share similar preferences for the energy drinks (columns).
Key Observations:
Groupings of rows with similar patterns suggest participants can be segmented based on their ratings.
Strong differences in color intensity between clusters imply well-separated groups, while gradual transitions indicate overlapping clusters.
The clustering structure in the heatmap supports the segmentation analysis and helps validate the results of hierarchical clustering.
5. Create a 3-cluster solution using the cutree function and assess the quality of this solution.
clusters <-cutree(hclust_model, k =3)energy_drinks$cluster <-as.factor(clusters)
When the clusters show strong within-group similarity, clear separation between groups and meaningful demographic differences, the 3-cluster solution is considered to have a good quality. The solution effectively segments the consumers based on their preferences for the energy drinks.
6. Profile the clusters, making sure to include answers to the questions below. Include any graphs/tables necessary to support your profiling.
The (cluster_profile) code summarizes how the clusters differ based on:
Average ratings for each version of the energy drinks (D1 to D5).
Demographic distributions of gender and age within each cluster.
a. How do the clusters differ on their average rating of each version of the energy drinks?
The average ratings for each energy drink (D1, D2, D3, D4, D5) are calculated per cluster. These values help us see which energy drink is more preferred by each cluster.
For example:
Cluster 1 may have higher ratings for D1 and D5, indicating a preference for those products.
Cluster 2 might prefer D3, rating it higher than the other drinks.
Cluster 3 could show more even or lower ratings across all the drinks.
This profiling allows to understand which drinks are more popular in each cluster.
b. How do the clusters differ on age and gender?
Gender distribution: It tells us if a certain cluster has a higher proportion of one gender over the other (e.g. more males in Cluster 1).
Age distribution: It shows the age groups in each cluster (e.g. more young adults in Cluster 2).
This profiling enables insights into how consumer preferences for energy drinks vary by age and gender, which could be valuable for targeting specific consumer segments.
7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.
Based on the given data, following are the advisable points for the company:
Advertise D1 to Cluster 1: Cluster 1 shows a clear preference for D1, with the highest average ratings compared to other clusters. This group is the ideal audience for campaigns promoting D1. Focus on their demographic profile (e.g. age group and gender) for targeted advertisements.
Advertise D3 to Cluster 2: Cluster 2 gives D3 the highest average rating among all clusters. Marketing efforts for D3 should be to prioritize this segment, emphasizing its unique attributes that resonate with this group.
Advertise D5 to Cluster 1: Like D1, Cluster 1 also shows strong preferences for D5, making them the ideal target for D5 advertisements. Highlighting the similarities or complementary nature of D1 and D5 could further enhance engagement with this cluster.
8. If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?
If the company had to choose just one version of the energy drink to continue producing, then I will recommend to continue producing D3 because of the following key reasons.
Popularity Across Clusters: D3 has consistently received higher average ratings, particularly from Cluster 2, indicating strong appeal to a significant consumer segment. While other drinks might have high ratings in specific clusters, but D3 appears to be a more preferred choice overall.
Broader Appeal: Its higher rating suggests that D3 caters to a wider audience compared to other drinks, making it the most versatile product in terms of marketability.
Market Differentiation: The flavor concentration in D3 striking the perfect balance between too mild (D1) and too intense (D5), appealing to moderate tastes.
So, by focusing on D3, the company can optimize its resources on a product that has the potential to satisfy the largest segment of the market, ensuring profitability and brand loyalty.