Assessment 4 - Data Analytics

Question 1 – Predicting Customers Who Will Renew Their Music Subscription

1.Import the sub_training.csv and sub_testing.csv datasets into R.

2.Create and visualise a classification tree model that will allow you to predict if a customer will re-subscribe or churn.

#Load required packages

library(rattle)
library(rpart)
library(tidyverse)

#Step 1: Import the subscription_training.csv and subscription_training.csv datasets into R

subscription_training <- read_csv("sub_training.csv")
subscription_testing <- read_csv("sub_testing.csv")


#Step 2: Create and visualise a classification tree model

subscription_tree <- rpart( renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, subscription_training )
fancyRpartPlot(subscription_tree)

3.Interpret the classification tree:

a. Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.

Path 1: If Lor is not less than 140 and Num_contacts is not less than 7.5, the customer is likely to re-subscribe.

As a result, this rule generates a terminal node that predicts “Yes”(re-subscribe) with a purity of 67%, which is simply 67% of the customers from this leaf re-subscribed, and 33% did not.

Path 2: If Lor is less than 140, and Spend is less than 182, then the customer is not likely to re-subscribe.

At this point, we end up with a terminal node which predicts “No” (not re-subscribe) and has a purity of 42%, i.e., here 42% customers in this leaf re-subscribed and 58% did not.

b.Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.

The “No” label in Leaf 4 indicates customers fulfilling the criteria for this path are unlikely to churn. For example, when Lor >= 140, the customer belongs to the leaf. These probabilities reflects the purity of this leaf: 76% of customers in this leaf did not churn and 24% did churn.

This detail reveals some about how pure the leaf is and how likely it is to come to outcome. Although the majority prediction says “No” (no churn), the probability of a customer churning in this leaf is 24%. As a result, customers on this path are less prone to churn yet moderate propensity to churn.

# c. Important variables:

subscription_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
# Check node purity using the summary.

summary(subscription_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = subscription_training)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.0683962 0.03430592
2 0.01179245      1 0.8066038 0.8608491 0.03403627
3 0.01000000      4 0.7617925 0.8632075 0.03404771

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 

C. Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.

The summary() function gives you the variable importances — that printout shows you how much the variables contribute to the improvements within the model. These are respectively expressed as a share of the total improvements.

From the output:

Lor accounts for 47% of all improvements and is the most important variable for predicting re-subscription.

Spend and Age both contribute 22%, suggesting they are also major players in the model.

num_contacts was contributing 8% which means it has moderate effect.

This is just a couple of changes made in the initial preprocessing step contact_recency, which only accounts for 1%, indicating that it has little influence on the predictions of the model.

Lor is singled out as the most important variable according to these percentages, because it is the variable with the most contribution to improving the model’s accuracy.

4.Fully assess the accuracy of the classification tree using both the training and the testing datasets.

a.Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.

# Measure accuracy on the training dataset

train_probs <- predict(subscription_tree, newdata = subscription_training, type = 'prob')
train_preds <- predict(subscription_tree, newdata = subscription_training, type = 'class')

#Append the probablities and prediction to the main dataset

subscription_training_updated <- cbind(subscription_training, train_probs, train_preds)

#create a confution matrix(cross tabulation) between predicted vs actual

train_con_mat <- table(subscription_training_updated$renewed, subscription_training_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270
sum(diag(train_con_mat))/sum(train_con_mat)
[1] 0.62

From the above confusion matrix of training dataset, we know the following:

• The overall model accuracy is (257 + 270)/850 = 0.62 or 62%.

• Of all customers the model predicted to resubscribe, they got 270/439 = 0.615 or 61% correct.

• Of all customers the model predicted not to resubscribe, they got 257/411 = 0.62 or 62% correct.

• Of all customers who did resubscribe, the model correctly identified 270/424 = 0.63 or 63%.

• Of all customers who did not resubscribe, the model correctly identified 257/426 = 0.60 or 60%.

# Measure accuracy on the testing dataset

test_probs <- predict(subscription_tree, newdata = subscription_testing, type = 'prob')
test_preds <- predict(subscription_tree, newdata = subscription_testing, type = 'class')
subscription_testing_updated <- cbind(subscription_testing, test_probs, test_preds)

test_con_mat <- table(subscription_testing_updated$renewed, subscription_testing_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41
sum(diag(test_con_mat)) /sum(test_con_mat)
[1] 0.5133333

From the above confusion matrix of the testing data set, we know the following:

• The overall model accuracy is (36+41)/150 = 0.51 or 51%. • Of all customers the model predicted to resubscribe, they got 41/79 = 0.518 or 51% correct. • Of all customers the model predicted not to resubscribe, they got 36/71 = 0.50 or 50% correct. • Of all customers who did resubscribe, the model correctly identified 41/76 = 0.539 or 54%. • Of all customers who did not resubscribe, the model correctly identified 36/74 = 0.48 or 48%.

The overall accuracy of the model was 62% for the training dataset, however, it has dropped to 51% for the testing dataset. This suggests that the classification tree model is overfitting itself to the training dataset and will not generalize well when asked to predict new unseen data in the future. So, we should prune the tree.

b. Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.

#5: Pruned Classification Tree


subscription_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, subscription_training, maxdepth =5 )


fancyRpartPlot(subscription_tree_pruned)

c.Fully assess the accuracy of the pruned tree on the training and testing datasets.

# Accuracy on training data

train_probs_pruned <- predict(subscription_tree_pruned, newdata = subscription_training, type = 'prob')
train_preds_pruned <- predict(subscription_tree_pruned, newdata = subscription_training, type = 'class')
subscription_training_updated_pruned <- cbind(subscription_training, train_probs_pruned, train_preds_pruned)

train_con_mat_pruned <- table(subscription_training_updated_pruned$renewed, subscription_training_updated_pruned$train_preds, dnn = c('Actual', 'Predicted'))

train_accuracy_pruned <- sum(diag(train_con_mat_pruned)) / sum(train_con_mat_pruned)
# Accuracy on testing data

test_probs_pruned <- predict(subscription_tree_pruned, newdata = subscription_testing, type = 'prob')
test_preds_pruned <- predict(subscription_tree_pruned, newdata = subscription_testing, type = 'class')
subscription_testing_updated_pruned <- cbind(subscription_testing, test_probs_pruned, test_preds_pruned)

test_con_mat_pruned <- table(subscription_testing_updated_pruned$renewed, subscription_testing_updated_pruned$test_preds, dnn = c('Actual', 'Predicted'))

test_accuracy_pruned <- sum(diag(test_con_mat_pruned)) / sum(test_con_mat_pruned)

d.Has pruning the classification tree resulted in less overfitting? Explain your answer.

The overall accuracy for the training dataset dropped from 62% to 60%, whilst the accuracy in the testing dataset improved from 51% to 53%.

The classification tree has been pruned to reduce overfitting. Reducing the models complexity to obtain better performance due to pruning on the testing dataset, even if that meant a negligible drop of the overall accuracy on the training dataset. This trade-off illustrates the advantage of pruning in building a more generalizable model.

4.Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?

  • Consider sending these customers personalized emails, reminders, or even exclusive offers to re-engage and incentivize renewal.

  • Enhanced Customer Support: Try to decrease the number of complaints helped by improving your customers support responsiveness and issue solving.

  • Recommendations: Implement a proactive approach to addressing customer concerns before they escalate.

  • Loyalty and Rewards Programs: Whether it’s to refer a friend or share a social post, or just being a loyal customer, offer all long-term customers, or customers above a certain spend amount, discounts or exclusive benefits.

  • Retention Campaigns for Churned Customers: Utilize the model to detect customers who are most likely to churn (Lor >= 140 (or low propensity to spend). Create specific strategies for retention, such as unique discounts, extended trial periods, or added value services to retain these customers.

  • Strategies in Data-Driven Marketing: Align the prop model with the company’s CRM to automate segmentation and targeting. Take support on the insights from the model to personalize marketing campaigns and channel spending towards customers that are more likely to renew.

  • Customer Feedback Mechanism: Iterative feedback — Get feedback from your customers regularly about their experience with the service.

  • Improve customer support to reduce the number of complaints.

Question 2 – Segmenting Consumers Based on Energy Drink Preference

1.Import the energy_drinks.csv file into R.

2.Create a distance matrix containing the Euclidean distance between all pairs of consumers.

library(cluster)
library(tidyverse)

#1: Import the data
energy_drink <- read_csv("energy_drinks.csv")
View(energy_drink)

#2: Compute distances between each pair of players
energy_drink_2 <- select(energy_drink, D1:D5)
d1 <- dist(energy_drink_2)

a.Does the data need to be scaled before computing the distance matrix? Explain your answer.

No, because the four variables that will be entered into the clustering algorithm are all measured on the same 10-point Likert scale. Therefore, we can proceed straight to calculating the Euclidean distance between each pair of customers using the following R code.

3.Carry out a hierarchical clustering using the hclust function. Use method = “average”.

h1 <- hclust(d1, method = "average")

4.Visualise the results of the hierarchical clustering using a dendrogram and a heatmap. Note that the heatmap may take several seconds to appear because of the large number of customers in the dataset.

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

a.Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.

The heatmap shows clustering structure in the energy drinks dataset Then, looking at the diagonal of the heatmap really gives very distinct groups of light yellow blocks, meaning its the groups of customers matched very closely in their preferences, i.e., smaller Euclidean distance. Finally, one more thing is that those square-like patches in our heatmap indicate that some clusters of our variables or customers are more similar to each other than others. The dendrogram represents associations between the clusters, and hints at the hierarchical nature of this relationship between smaller, more tightly-knit groups merging into larger clusters, confirming that there are multiple groups that are distinct in the data.

5.Create a 3-cluster solution using the cutree function and assess the quality of this solution.

#Step 4: Decide on number of clusters
clusters1 <- cutree(h1, k = 3)


#Step 5: Assess the quality of the segmentation
sil1 <- silhouette(clusters1, d1)
summary(sil1)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
      417       235       188 
0.2249249 0.1987262 0.3918562 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502 

6.Profile the clusters, making sure to include answers to the questions below. Include any graphs/tables necessary to support your profiling.

#Step 6: Profile the clusters. 
energy_drink_clus <- cbind(energy_drink, clusters1)
energy_drink_clus <- mutate(energy_drink_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                                   clusters1 == 2 ~ 'C2',
                                                                   clusters1 == 3 ~ 'C3'))

a.How do the clusters differ on their average rating of each version of the energy drinks?

library(tidyverse)
library(kableExtra)
library(tidyverse)

Rating_Avg <- energy_drink_clus %>% group_by(cluster) %>% 
summarise(num_customers = n(), 
avg_D1 = mean(D1), 
avg_D2 = mean(D2), 
avg_D3 = mean(D3), 
avg_D4 = mean(D4), 
avg_D5 = mean(D5)) 

#Convert the dataset to be in "tidy" format to allow for creation of line graph.

Rating_Avg_tidy <- Rating_Avg %>% 
pivot_longer(cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5), 
             names_to = "Energy_Drink", values_to = "Average") 

Rating_Avg_tidy 
# A tibble: 15 × 4
   cluster num_customers Energy_Drink Average
   <chr>           <int> <chr>          <dbl>
 1 C1                417 avg_D1          3   
 2 C1                417 avg_D2          4.71
 3 C1                417 avg_D3          6.20
 4 C1                417 avg_D4          6.70
 5 C1                417 avg_D5          6.76
 6 C2                235 avg_D1          2.96
 7 C2                235 avg_D2          4.67
 8 C2                235 avg_D3          6.95
 9 C2                235 avg_D4          5.01
10 C2                235 avg_D5          2.97
11 C3                188 avg_D1          6.98
12 C3                188 avg_D2          5.37
13 C3                188 avg_D3          3.03
14 C3                188 avg_D4          2.88
15 C3                188 avg_D5          2.84
arrange(Rating_Avg) 
# A tibble: 3 × 7
  cluster num_customers avg_D1 avg_D2 avg_D3 avg_D4 avg_D5
  <chr>           <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 C1                417   3      4.71   6.20   6.70   6.76
2 C2                235   2.96   4.67   6.95   5.01   2.97
3 C3                188   6.98   5.37   3.03   2.88   2.84
knitr::kable( Rating_Avg, format = "html", 
              col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2", "Avg D3", "Avg D4", "Avg D5"), 
              caption = "<b>Number of Customers and Average Ratings</b>",
              align = "lrrrrrr", 
              table.attr = 'data-quarto-disable-processing = "true"', 
              digits = c(0,3,2,2,2,2,2) ) %>% 
  
  
  kableExtra::kable_styling( 
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE, 
    position = "center", 
    font_size = 14 ) %>% 
  
  column_spec(1, color = "black", background = "#ebe7fa")
Number of Customers and Average Ratings
Cluster Number of Customers Avg D1 Avg D2 Avg D3 Avg D4 Avg D5
C1 417 3.00 4.71 6.20 6.70 6.76
C2 235 2.96 4.67 6.95 5.01 2.97
C3 188 6.98 5.37 3.03 2.88 2.84
#Visualise the mean satisfaction score for each contact method by cluster. 

ggplot(Rating_Avg_tidy, mapping = aes(x = Energy_Drink, y = Average , group = cluster, colour = cluster)) + 
  geom_line(linewidth = 1) + 
  geom_point(size = 2) + 
  ylab("Average") + 
  xlab("Energy Drink") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  
  ggtitle("Average Rating of Energy Drinks")

b.How do the clusters differ on age and gender?

energy_drink_clus$Age <- factor(energy_drink_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65")) 

library(ggplot2) 
library(scales) 

# Gender 

energy_drink_clus$Gender <- factor(energy_drink_clus$Gender, levels = c("Male", "Female")) 

ggplot(energy_drink_clus, aes(x = Gender, group = cluster)) + geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) + 
  
facet_grid(~ cluster) + scale_y_continuous(labels = scales::percent_format()) +
  ylab("Percentage of People") + 
  xlab("Gender") + 
  ggtitle("Gender by Cluster") + 
  coord_flip() 

# Age 


energy_drink_clus$Gender <- factor(energy_drink_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65")) 

ggplot(energy_drink_clus, aes(x = Age, group = cluster)) + 
  
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
  
  facet_grid(~ cluster) + 
  scale_y_continuous(labels = scales::percent_format()) + 
  ylab("People Percentage") + 
  xlab("Age buckets") + 
  ggtitle("Age by Cluster") + 
  coord_flip()

7.Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.

  • Energy Drink D1: Target Cluster: C3

Highest average rating of D1 is seen in Cluster C3 which implies that customers in this cluster highly prefer this version D1 should drive advertising efforts toward this clusters in order to maximize engagement and sales.

  • Energy Drink D3: Target Cluster: C2

Cluster C2 has the highest average value of D3, which indicates that this version resonates the most with customers in this cohort. Marketing campaigns of D3 should target this segment where they feel good.

  • Energy Drink D5: Target Cluster: C1

D5 receives the best average rating in the cluster of C1, thus representing the most potential segment to a model of this type. C1 customers should be targeted for D5 promotions and offers to boost sales and brand loyalty.

By adjusting the marketing campaign based on preferences of certain clusters in company can optimize its ad spend, and enhance customer engagement for these energy drink variants.

8.If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?

My recommendation would be to keep producing Energy Drink D5 for the company, as it is the one with the highest overall average ratings across the four different clusters, with particularly high ratings in Cluster C1, meaning it has a good target market fit that is likely to be very satisfied with the product. Therefore, it follows from this that the D5 is absolutely the most appreciated and most favoured version so the best potential choice for maximizing customer retention and sales.

****************************************** End of the Document ******************************************