Assignment 4- Data Analytics and Consumer Insights

Author

Akhil Hobby

Introduction

The assessment is based on customer segmentation and predictive analytics. These will be implemented using R Studio to address the questions and provide meaningful, actionable insights for marketers.

Question 1- Predicting Customers Who Will Renew Their Music Subscription

1) Load Necessary Libraries and Import the data set

# Loading libraries

library(rattle)
library(rpart)
library(tidyverse)

sub1_train <- read_csv("sub_training.csv")
sub1_test <- read_csv("sub_testing.csv")

2) Create and visualise the classification tree model

renew_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints+ spend + lor + gender + age, sub1_train)
fancyRpartPlot(renew_tree)

3) Interpret the tree model

#Extract the variable importance from the rpart object we have called ko_model_tree
renew_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 

(a) Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.

  • Path 1 (going far down left): If the “lor” is less than 140, the number of times a customer was in contact with the music downloading service (num_contacts) is less than 7.5, and the spend is less than €182, then they are predicted not to re-subscribe.

  • Path 2 (going far down right): If the “lor” is not less that 140, i.e., greater than 140, the number of times a customer was in contact with the music downloading service (num_contacts) is greater than 7.5, then they are predicted to re-subscribe.

  • Consider the blue leaf going down far right, the end of path 2 (path 2 is to the right of the tree and path 1 is to the left of the tree). The label “Yes” describes the prediction of re-subscription if they satisfy the rules on this path.

  • To determine how pure the node is, it is ideal to study the leaf labels. In this example, the blue leaf going down far right, displays 0.39 and 0.61. This means that 39% of the customers did not re-subscribe and 61% of the customers did re-subscribe. Thus if a customer follows path 2 and is placed in this specific leaf, they are predicted to re-subscribe with a probability of 61% or 0.61.

  • Consider the leaf labels 0.32 and 0.68. This indicates that 32% of the customers did not re-subscribe and 68% of the customers did re-subscribe. Therefore, the re-subscription prediction probability is 68% or 0.68.

(b) Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.

  • Path 1 (going far down left): If the “lor” is less than 140, the number of times a customer was in contact with the music downloading service (num_contacts) is less than 7.5, and the spend is less than €182, then they are predicted not to churn.

  • Path 2 (going far down right): If the “lor” is not less that 140, i.e., greater than 140, the number of times a customer was in contact with the music downloading service (num_contacts) is greater than 7.5, then they are predicted to churn.

  • Consider the second green lead going left, the label “No” describes the prediction of churning if they satisfy the rules on this path.

  • To determine how pure the node is, it is ideal to study the leaf labels. In this example, the second green label to the left, displays 0.59 and 0.41. This means that 59% of the customers did not churn and 41% of the customers did churn. Thus, if a customer follows path 1 and ends up in this leaf, they are predicted to churn with a probability of 41% or 0.41. The leaf labels, 0.76 and 0.24 indicate that 76% of the customers did not churn and 24% of them churned. They are thus predicted to churn with a probability of 24%.

(c) Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.

When we extracted the “variable importance” from the rpart object we have called ko_model_tree. With this, we can determine the essential variables that predict a customer’s re-subscription.Hence, the important variables are as follows:

  1. The length of relationship expressed in days (lor).
  2. The amount of money spent during the last 36 months with the company (spend).
  3. The age of the customer (age).
  4. The number of times a customer was in contact with the music downloading service (num_contacts).
  5. The elapsed time since last contact (contact_recency).

The top 3 important predictors of re-subscription are “lor”, “spend” and “age”.

4) Check the model accuracy

#Measure accuracy on Training data

renew_probs <- predict(renew_tree, newdata = sub1_train, type = 'prob')
renew_preds <- predict(renew_tree, newdata = sub1_train, type = 'class')

renew_train_updated <- cbind(sub1_train, renew_probs, renew_preds)


train_con_mat <- table(renew_train_updated$renewed, renew_train_updated$renew_preds, dnn=c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270
sum(diag(train_con_mat))/sum(train_con_mat)
[1] 0.62
#Measure accuracy on Testing data

renew_probs <- predict(renew_tree, newdata = sub1_test, type = 'prob')
renew_preds <- predict(renew_tree, newdata = sub1_test, type = 'class')
renew_test_updated <- cbind(sub1_test, renew_probs, renew_preds)

test_con_mat <- table(renew_test_updated$renewed, renew_test_updated$renew_preds, dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41
sum(diag(test_con_mat))/sum(test_con_mat)
[1] 0.5133333

(a) Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.

Accuracy of Training data

  1. The overall model accuracy is (527/850) = 0.62 or 62%

  2. Of all the customers the model predicted to renew, they got (270/439) = 0.61 or 61%

  3. Of all the customers the model predicted not to renew, they got (257/411) = 0.62 or 62%

  4. Of all the customers who did renew, the model correctly identified (270/424) = 0.63 or 63%

  5. Of all the customers who did not renew, the model correctly identified (257/426) = 0.60 or 60%

It is recommended to assess the accuracy of the testing data also, because the training data alone, may not be able to provide insights to whether the data overfits itself or not.

Accuracy of Testing data

  1. The overall model accuracy is (77/150) = 0.51 or 51%

  2. Of all the customers the model predicted to renew, they got (41/79) = 0.51 or 51%

  3. Of all the customers the model predicted not to renew, they got (36/71) = 0.50 or 50%

  4. Of all the customers who did renew, the model correctly identified (41/76) = 0.53 or 53%

  5. Of all the customers who did not renew, the model correctly identified (36/74) = 0.48 or 48%

It is seen that there is a huge difference between the overall accuracy of the training data (62%) and that of the testing data (51%). Thus it can be concluded that the classification tree model overfits the training data and will not generalise well when asked to predict new unseen data in the future. In this situation, is is suggested to prune the tree.

(b) Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.

#Create and visualise the pruned classification tree

renew_tree2 <- rpart(renewed ~ num_contacts + contact_recency + num_complaints+ spend + lor + gender + age, sub1_train, maxdepth = 3)
fancyRpartPlot(renew_tree2)

(c) Fully assess the accuracy of the pruned tree on the training and testing datasets.

#Measure accuracy on Training data

renew_probs2 <- predict(renew_tree2, newdata = sub1_train, type = 'prob')
renew_preds2 <- predict(renew_tree2, newdata = sub1_train, type = 'class')
renew_train_updated2 <- cbind(sub1_train, renew_probs2, renew_preds2)

train_con_mat2 <- table(renew_train_updated2$renewed, renew_train_updated2$renew_preds, dnn = c('Actual', 'Predicted'))
train_con_mat2
      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252
sum(diag(train_con_mat2))/sum(train_con_mat2)
[1] 0.6094118
#Measure accuracy on Testing data

renew_probs2 <- predict(renew_tree2, newdata = sub1_test, type = 'prob')
renew_preds2 <- predict(renew_tree2, newdata = sub1_test, type = 'class')
renew_test_updated2 <- cbind(sub1_test, renew_probs2, renew_preds2)

test_con_mat2 <- table(renew_test_updated2$renewed, renew_test_updated2$renew_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2
      Predicted
Actual No Yes
   No  41  33
   Yes 38  38
sum(diag(test_con_mat2))/sum(test_con_mat2)
[1] 0.5266667

Accuracy of the pruned model -Training data

  1. The overall model accuracy is (518/846) = 0.61 or 61%

  2. Of all the customers the model predicted to renew, they got (252/412) = 0.61 or 61%

  3. Of all the customers the model predicted not to renew, they got (266/436) = 0.61 or 61%

  4. Of all the customers who did renew, the model correctly identified (252/422) = 0.59% or 59%

  5. Of all the customers who did not renew, the model correctly identified (266/426) = 0.62 or 62%

Accuracy of the pruned model-Testing data

  1. The overall model accuracy is (79/150) = 0.52 or 52%

  2. Of all the customers the model predicted to renew, they got (38/71) = 0.53 or 53%

  3. Of all the customers the model predicted not to renew, they got (41/79) = 0.62 or 62%

  4. Of all the customers who did renew, the model correctly identified (41/76) = 0.50 or 50%

  5. Of all the customers who did not renew, the model correctly identified (41/74) = 0.55 or 55%

(d) Has pruning the classification tree resulted in less overfitting? Explain your answer.

  • Yes, pruning the classification tree has resulted in less overfitting, though the difference is less, which is unusual. Post pruning, the overall model accuracy of the training data reduced from 62% to 61% and the overall model accuracy of the testing data increased from 51% to 52%. Hence the difference between both the models is still 9%.

  • Furthermore, an increase in the number of customers the model predicted not to renew in the testing data, from 50% to 62%, was also noted.

  • There is only a minimal difference in overall accuracy of the model, even after pruning, but there was an increase in the customers who chose not to renew in the testing data.

  • The model thus does not suffer as much overfitting as the original tree and will now be able to make accurate predictions.

  • Regardless, it is recommended to prune the classification tree once again to check if a better model can be generated that can make actionable predictions.

5) Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?

  • Customers with contact frequency greater that 7.5 should be kept engaged through proactive communication. They should be given timely updates on exclusive benefits and service improvements. It is also suggested to collect customer feedback and alter marketing strategies accordingly.

  • Offer loyalty rewards for the customers with high length of relationship to convert them into long-term customers. Furthermore, provide premium services to maintain their satisfaction.

  • For customers with a lower likelihood of renewal, offer limited-time offers and discounts to encourage re-subscription. It is important to highlight cost-effective benefits in communications.

  • Increase communication via reminders to decrease churn for high-risk customers. However, over-communication is not recommended.

  • The propensity model identifies customers likely to renew and churn. High-propensity customers can be interacted with upselling campaigns while low-propensity customers can be offered discounts. By using insights from the propensity model, marketing efforts can be more accurate and budget-friendly, resulting in success.

Question 2-Segmenting Consumers Based on Energy Drink Preference

1) Load the Necessary Libraries and Import the dataset

library(cluster)


drinks <- read_csv("energy_drinks.csv")
View(drinks)

2) Create a distance matrix containing the Euclidean distance between all pairs of consumers.

# Compute distances between each pair of consumers

drinks_2 <- select(drinks, D1, D2, D3, D4, D5)
d1 <- dist(drinks_2)

(a) Does the data need to be scaled before computing the distance matrix? Explain your answer.

  • No, the data need not be scaled in this question.

  • The different versions of energy drinks (D1-D5) are rated on a 1-9 Likert Scale.

  • Since scaling is needed only when the variables are measured on different scales, and the ratings in this example are all measured on a Likert scale, the Euclidean distance will not be interrupted and “scaling” can eliminated.

3) Carry out a hierarchical clustering using the hclust function. Use method = “average”.

#Carry out the hierarchical clustering. Using Average because it reveals more interesting clusters in this example.

h2 <- hclust(d1, method = "average")

4) Visualise the results of the hierarchical clustering using a dendrogram and a heatmap.

plot(h2, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h2), Colv = 'Rowv', labRow = F, labCol = F)

(a) Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.

From the heatmap above, it is observed that there is some evidence of lightly coloured blocks around the diagonal that suggests the existence of groups of customers who are similar to each other in the given energy drinks data.However, the heatmap shows a less effective cluster structure.

5) Create a 3-cluster solution using the cutree function and assess the quality of this solution.

# Decide on number of clusters

clusters2 <- cutree(h2, k = 3)

# Assess the quality of the segmentation

sil1 <- silhouette(clusters2, d1)

6) Profile the clusters, making sure to include answers to the questions below. Include any graphs/tables necessary to support your profiling.

# Profile the clusters

drinks_clus <- cbind(drinks, clusters2)
drinks_clus <- mutate(drinks_clus, cluster = case_when(clusters2 == 1 ~ 'C1',
                                                   clusters2 == 2 ~ 'C2',
                                                   clusters2 == 3 ~ 'C3'))

### Create a table with the number of customers and average rating.

rating_avg <- drinks_clus %>%
  group_by(cluster) %>%
  summarise(num_customers = n(),
            rating_avg1 = mean(D1),
            rating_avg2 = mean(D2),
            rating_avg3 = mean(D3),
            rating_avg4 = mean(D4),
            rating_avg5 = mean(D5))





# Format the table

library (kableExtra)
library (knitr)

rating_avg %>%
  knitr::kable(
    format = "html",
    col.names= c("Cluster", "Customers", "D1", "D2", "D3", "D4", "D5" ),
    caption= "<b> Clusters & Energy Drink Ratings <b>",
    table.attr = 'data-quarto-disable-processing = "true"',
    align= "lrrrrrr",
    digits = c(0,3,2,2,2,2,2)
  )  %>%
  
kableExtra:: kable_styling (
  bootstrap_options = c ("striped", "hover", "condensed", "responsive"),
  full_width = FALSE,
  position = "center",
  font_size = 12,
  
)  %>%
  
  column_spec(1, color = "black")
Clusters & Energy Drink Ratings
Cluster Customers D1 D2 D3 D4 D5
C1 417 3.00 4.71 6.20 6.70 6.76
C2 235 2.96 4.67 6.95 5.01 2.97
C3 188 6.98 5.37 3.03 2.88 2.84

(a) How do the clusters differ on their average rating of each version of the energy drinks?

drinks_clus_means <- drinks_clus %>%
  group_by(cluster) %>%
  summarise(Average_D1 = mean(D1),
            Average_D2 = mean(D2),
            Average_D3 = mean(D3),
            Average_D4= mean(D4),
            Average_D5= mean(D5))



#Convert the dataset to be in "tidy" format to allow for creation of line graph. 

drinks_clus_tidy <- drinks_clus_means %>%
  pivot_longer(cols = c(Average_D1, Average_D2, Average_D3, Average_D4, Average_D5), names_to = "Drink_Versions", values_to = "Average_Rating")




#Visualise the average rating of each version of the energy drinks

ggplot(drinks_clus_tidy, mapping = aes(x = Drink_Versions, y = Average_Rating, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  ylab("Average") + 
  xlab("Energy Drink") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) + #Rotates the angle of the labels and moves them down so they don't overlap the bottom of the graph
  ggtitle("Average Rating of Popular Energy Drinks by Cluster")

(b) How do the clusters differ on age and gender?

# AGE

# Reorder the age brackets in chronological order.

drinks_clus$Age <- factor(drinks_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))

## Create a bar chart to visualise the distribution of age within each cluster.

library (ggplot2)
library (scales)

ggplot(drinks_clus, aes(x = Age, group = cluster, fill = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip() #This swaps the x and y axes, making the age brackets easier to read.

# GENDER

drinks_clus$Gender <- factor(drinks_clus$Gender, levels = c("Male", "Female"))

# Bar Chart to visualise the distribution of gender within each cluster.

ggplot(drinks_clus, aes(x = Gender, group = cluster, fill = cluster)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Gender") +
  ggtitle("Gender by Cluster") +
  coord_flip()

When comparing all the 3 clusters, it is noticed that people aged between 25 and 34 stand out with a high percentage. Additionally, female dominance is less in the 3 clusters.

7) Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.

The observations from the table and “Average Vs. Energy Drink” line graph, are given below:

  • Cluster 1 has the highest average rating of 6.76 for the energy drink version D5.

  • Cluster 2 has the highest average rating of 6.95 for the energy drink version D3.

  • Cluster 3 has the highest average rating of 6.98 for the energy drink version D1.

8) If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?

From the “Clusters & Energy Ratings” table, it can be concluded that the number of customers in cluster 1, cluster 2, and cluster 3 are 417, 235 and 188, respectively. It is important to pick the cluster with more number of customers to provide a reliable marketing insight. Therefore, according to the author, the energy drink version “D5” is highly recommended because it belongs to cluster with 417 customers and bears the highest average rating of 6.76.