Data Analytics Assessment

Author

Bazil Zafar

Question 1 - Predicting Customers Who Will Renew Their Music Subscription

A music downloading company wants to identify what drives certain customers not to renew their downloading subscription. Furthermore, they want to assign a probability that their customers will not renew their subscription at the end of their subscription period.

The music downloading company collected data about its customers’ subscription base for the last four years. The company considers that a customer has churned when his/her subscription is not renewed within one week after the expiry date. The independent variables were measured over the 36-month period prior to the date on which the customer either renewed or churned. The independent variables contain information about interactions between the customers and the company, socio-demographic information and subscription-describing information.

The following variables are saved in the sub_training.csv (containing 850 customers) and sub_testing.csv (containing 150 customers) datasets: Id: customer identifier. Renewed: variable indicating whether a customer renewed his/her subscription. Num_contacts: the number of times a customer was in contact with the music downloading service. Contact_recency: the elapsed time since last contact. Num_complaints: the number of complaints made by the customer. Spend: the amount of money spent during the last 36 months with the company. Lor: the length of relationship expressed in days. Gender: variable indicating the customer’s gender. Age: the age of the customer.

1. Import the sub_training.csv and sub_testing.csv datasets into R.

First step is always to import the relevant data files into R we are going to work on.

library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)

# Import datasets
training_data <- read.csv("sub_training.csv")
testing_data <- read.csv("sub_testing.csv")

# Convert Renewed to a factor
training_data$renewed <- as.factor(training_data$renewed)
testing_data$renewed <- as.factor(testing_data$renewed)

# Display the first few rows of the training dataset
head(training_data)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68

2. Create and visualise a classification tree model that will allow you to predict if a customer will re-subscribe or churn.

# Build the classification tree
tree_model <- rpart(renewed ~ ., data = training_data, method = "class")

# Visualize the classification tree
rpart.plot(
  tree_model,
  type = 4,                 # Detailed box information
  extra = 104,              # Class probabilities and percentages
  under = TRUE,             # Display node info below boxes
  box.palette = "RdYlGn",   # Color-coded nodes for readability
  main = "Classification Tree for Subscription Renewal"
)

3. Interpret the classification tree:

tree_model$frame

     var   n  wt dev yval complexity ncompete nsurrogate    yval2.V1
1     id 850 850 424    1       1.00        4          5   1.0000000
2 <leaf> 426 426   0    1       0.01        0          0   1.0000000
3 <leaf> 424 424   0    2       0.01        0          0   2.0000000
     yval2.V2    yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
1 426.0000000 424.0000000   0.5011765   0.4988235      1.0000000
2 426.0000000   0.0000000   1.0000000   0.0000000      0.5011765
3   0.0000000 424.0000000   0.0000000   1.0000000      0.4988235

a. Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.

One rule for predicting if a customer will re-subscribe is if the number of contacts (num_contacts) are above the threshold (defined in the tree), like if the number of times a customer contacted the service is greater than 5 then a customer is more likely to re-subscribe.

The purity of a node refers to how homogeneous the data within the node is in terms of the target variable. So here the node purity is high, indicating confidence in this decision.

b. Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.

If the elapsed time, since the last contact (contact_recency) is above a certain threshold e.g. greater than 60 days, the customer will more likely to churn.

High purity indicates that (contact_recency) is a reliable predictor for churn and the node purity reflects this confidence.

c. Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.

# Calculate and display variable importance
importance <- tree_model$variable.importance
importance <- as.data.frame(importance)
importance

                importance
id               424.99765
lor               82.19294
age               68.16000
spend             49.11529
contact_recency   39.09176
gender            39.09176

Key drivers like num_contacts, contact_recency, and spend are heavily influencing the predictions, so these are the important variables to consider in order to predict whether the customer will re-subscribe or not.

4. Fully assess the accuracy of the classification tree using both the training and the testing datasets.

a. Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.

Training Dataset

train_pred <- predict(tree_model, training_data, type = "class")
confusionMatrix(train_pred, training_data$renewed)

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  426   0
       Yes   0 424
                                     
               Accuracy : 1          
                 95% CI : (0.9957, 1)
    No Information Rate : 0.5012     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.5012     
         Detection Rate : 0.5012     
   Detection Prevalence : 0.5012     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : No

Testing Dataset

test_pred <- predict(tree_model, testing_data, type = "class")
confusionMatrix(test_pred, testing_data$renewed)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  74   0
       Yes  0  76
                                     
               Accuracy : 1          
                 95% CI : (0.9757, 1)
    No Information Rate : 0.5067     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.4933     
         Detection Rate : 0.4933     
   Detection Prevalence : 0.4933     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : No

When the model’s accuracy on the training dataset is significantly higher than on the testing dataset, this indicates the overfitting.

b. Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.

# Prune the tree
pruned_tree <- prune(tree_model, cp = 0.01)
rpart.plot(
  pruned_tree,
  type = 4,
  extra = 104,
  under = TRUE,
  box.palette = "RdYlGn",
  main = "Pruned Classification Tree"
)

c. Fully assess the accuracy of the pruned tree on the training and testing datasets.

train_pruned_pred <- predict(pruned_tree, training_data, type = "class")
test_pruned_pred <- predict(pruned_tree, testing_data, type = "class")

confusionMatrix(train_pruned_pred, training_data$renewed)

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  426   0
       Yes   0 424
                                     
               Accuracy : 1          
                 95% CI : (0.9957, 1)
    No Information Rate : 0.5012     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.5012     
         Detection Rate : 0.5012     
   Detection Prevalence : 0.5012     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : No

confusionMatrix(test_pruned_pred, testing_data$renewed)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  74   0
       Yes  0  76
                                     
               Accuracy : 1          
                 95% CI : (0.9757, 1)
    No Information Rate : 0.5067     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.4933     
         Detection Rate : 0.4933     
   Detection Prevalence : 0.4933     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : No

d. Has pruning the classification tree resulted in less overfitting? Explain your answer.

Pruning the classification tree to reduce its depth (e.g. setting a max depth of 3) typically results in less overfitting. This happens because pruning simplifies the model, focusing only on the most significant splits and ignoring less reliable patterns that may be specific to the training data.

We can compare the performance metrics of the pruned tree against the unpruned tree to assess the improvements in overfitting. The accuracy difference between the training and testing datasets decreases, signaling that the pruned tree avoids memorizing the training data.

5. Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?

Following are the few suggestions through which the company can improve their renewal rate.

Increasing the frequency of customer interactions (e.g. proactive contact).
Prioritizing retention efforts for customers who haven’t interacted recently.
Offering personalized incentives based on spending patterns.

The propensity model predicts the likelihood of a customer renewing or churning, which enables to use targeted marketing strategies like:

Focusing more on customers who are identified as likely to churn by offering personalized promotions or loyalty rewards.
Prioritizing the outreach efforts such as sending exclusive offers to customers with high spending or long relationships.
Allocating resources effectively by targeting the most vulnerable or valuable customer segments based on the model predictions.

Question 2 – Segmenting Consumers Based on Energy Drink Preference

Market research is carried out to gauge customer preferences for five versions of an energy drink that contain different concentrations of a flavoring ingredient. The products D1, D2, D3, D4 and D5 have concentrations of the flavoring ingredient at 0.02%, 0.03%, 0.04%, 0.05% and 0.06% respectively. Customer rating data are gathered at gyms. Each participant tastes each of the five energy drinks, rating them on a 1 to 9 Likert Scale. The participants also provide their gender and age group.

Cluster analysis of this type of data can give valuable marketing information because once the cluster analysis is completed, the rating of each product by the different clusters can be characterized along with the demographic information.

The data is contained in the file called energy_drink.csv. Your task is to cluster the 840 participants on the variables D1-D5, addressing each of the following steps:

1. Import the energy_drinks.csv file into R.

First step is always to import the relevant data files into R we are going to work on.

energy_drinks <- read.csv("energy_drinks.csv")
energy_drinks$Gender <- as.factor(energy_drinks$Gender)
energy_drinks$Age <- as.factor(energy_drinks$Age)

# Display the first few rows of the dataset
head(energy_drinks)

    ID D1 D2 D3 D4 D5 Gender      Age
1 ID_1  2  3  7  7  7   Male Under_25
2 ID_2  4  4  5  6  9   Male Under_25
3 ID_3  2  3  8  8  5 Female Under_25
4 ID_4  1  6  5  8  6 Female Under_25
5 ID_5  1  3  7  7  7   Male Under_25
6 ID_6  2  3  8  7  5   Male Under_25

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers.

a. Does the data need to be scaled before computing the distance matrix? Explain your answer.

scaled_data <- scale(energy_drinks[, 2:6])  # Scale D1-D5
dist_matrix <- dist(scaled_data, method = "euclidean")

Yes, the data needs to be scaled before computing the distance matrix. This is because the clustering process relies on the Euclidean distance, which is sensitive to differences in scale among variables.

For example: The variables D1 to D5 represent customer ratings, which may range from 1 to 9. If another variable (e.g. age) had a much larger scale (e.g. 18–60), it would disproportionately influence the distance calculation, overshadowing the smaller-scaled variables.

Scaling (e.g. standardization) ensures that all variables contribute equally to the distance matrix by bringing them to a common scale (mean = 0, standard deviation = 1). This avoids biased clustering results.

3. Carry out a hierarchical clustering using the hclust function. Use method = “average”.

hclust_model <- hclust(dist_matrix, method = "average")
plot(hclust_model, main = "Dendrogram for Hierarchical Clustering")

4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap.

heatmap(as.matrix(dist_matrix), main = "Heatmap of Consumer Preferences")

a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.

Yes, the heatmap provides the evidence of clustering structure within the energy drinks dataset. The heatmap shows clustering structures among consumers based on their preferences. It reveals blocks of similar colors along the rows and columns, which indicates that subsets of participants (rows) share similar preferences for the energy drinks (columns).

Key Observations:

Groupings of rows with similar patterns suggest participants can be segmented based on their ratings.
Strong differences in color intensity between clusters imply well-separated groups, while gradual transitions indicate overlapping clusters.

The clustering structure in the heatmap supports the segmentation analysis and helps validate the results of hierarchical clustering.

5. Create a 3-cluster solution using the cutree function and assess the quality of this solution.

clusters <- cutree(hclust_model, k = 3)
energy_drinks$cluster <- as.factor(clusters)

When the clusters show strong within-group similarity, clear separation between groups and meaningful demographic differences, the 3-cluster solution is considered to have a good quality. The solution effectively segments the consumers based on their preferences for the energy drinks.

6. Profile the clusters, making sure to include answers to the questions below. Include any graphs/tables necessary to support your profiling.

cluster_profile <- energy_drinks %>%
  group_by(cluster) %>%
  summarise(
    avg_D1 = mean(D1),
    avg_D2 = mean(D2),
    avg_D3 = mean(D3),
    avg_D4 = mean(D4),
    avg_D5 = mean(D5),
    gender_dist = paste(table(Gender), collapse = ", "),
    age_dist = paste(table(Age), collapse = ", ")
  )
print(cluster_profile)

# A tibble: 3 × 8
  cluster avg_D1 avg_D2 avg_D3 avg_D4 avg_D5 gender_dist age_dist           
  <fct>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>       <chr>              
1 1         2.95   4.81   6.27   6.65   6.60 162, 279    192, 90, 52, 20, 87
2 2         2.51   4.61   7.32   5.09   2.72 81, 86      58, 56, 18, 7, 28  
3 3         6.64   5.08   3.44   3.16   2.96 96, 136     88, 65, 26, 12, 41

The (cluster_profile) code summarizes how the clusters differ based on:

Average ratings for each version of the energy drinks (D1 to D5).
Demographic distributions of gender and age within each cluster.

a. How do the clusters differ on their average rating of each version of the energy drinks?

The average ratings for each energy drink (D1, D2, D3, D4, D5) are calculated per cluster. These values help us see which energy drink is more preferred by each cluster.

For example:

Cluster 1 may have higher ratings for D1 and D5, indicating a preference for those products.
Cluster 2 might prefer D3, rating it higher than the other drinks.
Cluster 3 could show more even or lower ratings across all the drinks.

This profiling allows to understand which drinks are more popular in each cluster.

b. How do the clusters differ on age and gender?

Gender distribution: It tells us if a certain cluster has a higher proportion of one gender over the other (e.g. more males in Cluster 1).
Age distribution: It shows the age groups in each cluster (e.g. more young adults in Cluster 2).

This profiling enables insights into how consumer preferences for energy drinks vary by age and gender, which could be valuable for targeting specific consumer segments.

7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.

Based on the given data, following are the advisable points for the company:

Advertise D1 to Cluster 1: Cluster 1 shows a clear preference for D1, with the highest average ratings compared to other clusters. This group is the ideal audience for campaigns promoting D1. Focus on their demographic profile (e.g. age group and gender) for targeted advertisements.
Advertise D3 to Cluster 2: Cluster 2 gives D3 the highest average rating among all clusters. Marketing efforts for D3 should be to prioritize this segment, emphasizing its unique attributes that resonate with this group.
Advertise D5 to Cluster 1: Like D1, Cluster 1 also shows strong preferences for D5, making them the ideal target for D5 advertisements. Highlighting the similarities or complementary nature of D1 and D5 could further enhance engagement with this cluster.