Predictive Analytics and Customer Segmentation

Author

Sushma Mahesh

Introduction

The assignment has two questions on predictive analytics and customer segmentation, will be executed using desicion trees and cluster analysis in R.

Question 1 – Predicting Customers Who Will Renew Their Music Subscription

#Load required packages

library(rattle)
library(rpart)
library(tidyverse)

1. Import the data

sub_train <- read_csv("sub_training.csv")
sub_test <- read_csv("sub_testing.csv")

2. Create and visualise a classification tree model

# Step 2: Create the classification tree model

sub_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_train)
fancyRpartPlot(sub_tree)

3. Interpret the classification tree

Rule for predicting if a customer will re-subscribe.

The rules for predicting if a customer will re-subscribe.

Path 1 (going down far left): if the length of relationship (lor) is lesser than 140 days, and the amount of spent during the last 36 months with the company is lesser than Euros 182, then they are predicted to not to renew their downloading subscription.
Path 2 (going down right): if the length of relationship (lor) is greater than 140 days, then they are predicted to renew their subscription.

The probability of the prediction being correct. Consider the blue leaf at the end of Path 2 going down right:

The label “Yes” tells you the renew subscription prediction of a customer if they satisfy the rules on this path.
The labels 0.39 and 0.61 tell you that of all customers in the training dataset that fall into this leaf, 39% of them did not renew their subscription and 61% of them did renew their subscription. The above figure gives the idea of how pure the leaf is, and it also provides the probability of a customer churning. In this case, if a customer follows Path 2 and ends up in this leaf, they are predicted to subscribe with a probability of 0.61 or 61% and considered as a strong prediction.

Rule for predicting if a customer will churn.

The rules for predicting if a customer will churn.

Path 1 (going down far left): if the length of relationship (lor) is lesser than 140 days, the amount of spent during the last 36 months with the company is lesser than Euros 182, then they will not churn.
Path 2 (going down right): if the length of relationship (lor) is greater than 140 days, then they are predicted to churn.

The probability of the prediction being correct. Consider the green leaf at the end of Path 1 going down far left.

The label “No” tells the churn prediction of a customer if they satisfy the rules on Path 1.
The labels 0.76 and 0.24 tells that of all customers in the training dataset that fall into this leaf, 76% of them did not churn and 24% did churn. The figure gives you an idea of how pure the leaf is, and also provides the probability of a customer churning. In this case, if a customer follows Path 1 and ends up in this leaf, ther are predicted to churn with probability of 0.76 or 76%.

Important variables in predicting customer will re-subscribe or not.

The length of relationship (lor), and the amount of spent during the last 36 months with company (spend) and the age are the major variables considered to be important in predicting customer re-subscription or not for the company.

#Extract the variable importance from the rpart object we have called ko_model_tree
sub_tree$variable.importance

            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338

#Extract the variable importance as a percentage of all improvements to the model
summary(sub_tree)

Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = sub_train)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.1108491 0.03417864
2 0.01179245      1 0.8066038 0.8514151 0.03398856
3 0.01000000      4 0.7617925 0.8514151 0.03398856

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667

4. Assess the accuracy of the classification tree using both the training and the testing datasets.

# Step 4: Measure accuracy on Training data

sub_probs <- predict(sub_tree, newdata = sub_train, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_train, type = 'class')

sub_train_updated <- cbind(sub_train, sub_probs, sub_preds)


train_con_mat <- table(sub_train_updated$renewed, sub_train_updated$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

sum(diag(train_con_mat))/sum(train_con_mat)

[1] 0.62

From, the above confusion matrix, we know the following:

The overall model accuracy is (257+270)/850 = 0.62 or 62%.
Of all customers the model predicted to renew, they got 270/439 = 0.61 or 61%.
Of all customers the model predicted not to renew, they got 257/411 = 0.62 or 62%.
Of all customers who did renew, the model correctly identified 270/424 = 0.63 or 63%.
Of all customers who did not renew, the model correctly identified 257/426 = 0.60 or 60%.

At this stage, the model seems slightly okay because it is right 62% of the time, which is not so great (although it misses quite a lot of customers who did not renew). However, it is important to assess the model’s accuracy on the testing dataset because the model overfits the training dataset and also we are not 100% satisfied with the model accuracy on the training dataset.

#Measure accuracy on Testing data

sub_probs <- predict(sub_tree, newdata = sub_test, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_test, type = 'class')

sub_test_updated <- cbind(sub_test, sub_probs, sub_preds)


test_con_mat <- table(sub_test_updated$renewed, sub_test_updated$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

sum(diag(test_con_mat))/sum(test_con_mat)

[1] 0.5133333

From, the above confusion matrix, we know the following:

The overall model accuracy is (36+41)/150 = 0.51 or 51%.
Of all customers the model predicted to renew, they got 41/79 = 0.51 or 51%.
Of all customers the model predicted not to renew, they got 36/71 = 0.50 or 50%.
Of all customers who did renew, the model correctly identified 41/76 = 0.53 or 53%.
Of all customers who did not renew, the model correctly identified 36/74 = 0.48 or 48%.

Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.

From the findings, the overall accuracy of the model was 62% for the training dataset, however it has dropped to 51% for the testing dataset (11% drop or difference). This suggests that the classification tree model is overfitting itself to the training dataset and will not generalise well when asked to predict new unseen data in the future. In this situation it is common to prune the tre

Prune the tree model to improve predictive power

sub_tree2 <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_train, maxdepth = 3)
fancyRpartPlot(sub_tree2)

Assess the accuracy of the pruned tree on the training and testing datasets.

# Measure the accuracy of the pruned tree model for training data

sub_probs2 <- predict(sub_tree2, newdata = sub_train, type = 'prob')
sub_preds2 <- predict(sub_tree2, newdata = sub_train, type = 'class')

sub_train_updated2 <- cbind(sub_train, sub_probs2, sub_preds2)


train_con_mat2 <- table(sub_train_updated2$renewed, sub_train_updated2$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat2

      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252

sum(diag(train_con_mat2))/sum(train_con_mat2)

[1] 0.6094118

From, the above confusion matrix, we know the following:

The overall model accuracy is (266+252)/846 = 0.61 or 61%.
Of all customers the model predicted to renew, they got 252/412 = 0.61 or 61%.
Of all customers the model predicted not to renew, they got 266/436 = 0.61 or 61%.
Of all customers who did renew, the model correctly identified 252/422 = 0.59 or 59%.
Of all customers who did not renew, the model correctly identified 266/426 = 0.62 or 62%.

# Measure the accuracy of the pruned tree model for testing data

sub_probs2 <- predict(sub_tree2, newdata = sub_test, type = 'prob')
sub_preds2 <- predict(sub_tree2, newdata = sub_test, type = 'class')

sub_test_updated2 <- cbind(sub_test, sub_probs2, sub_preds2)


test_con_mat2 <- table(sub_test_updated2$renewed, sub_test_updated2$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2

      Predicted
Actual No Yes
   No  41  33
   Yes 38  38

sum(diag(test_con_mat2))/sum(test_con_mat2)

[1] 0.5266667

From, the above confusion matrix, we know the following:

The overall model accuracy is (41+38)/150 = 0.52 or 52%.
Of all customers the model predicted to renew, they got 38/71 = 0.53 or 53%.
Of all customers the model predicted not to renew, they got 41/79 = 0.62 or 62%.
Of all customers who did renew, the model correctly identified 38/76 = 0.50 or 50%.
Of all customers who did not renew, the model correctly identified 41/74 = 0.55 or 55%.

Has pruning the classification tree resulted in less overfitting? Explain.

Yes, pruning the classification tree resulted in less overfitting, but the difference in very minimal.

Pruned the tree has decreased an overall model accuracy from 62% to 61%, which is minimal. However, the overall accuracy for the testing dataset has increased from 51% to 52%, which indicates that the pruned tree does not suffer from as much as overfitting as the original tree and should predict new data more accurately.

5. Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?

To improve renewal rates, the company should use targeted campaigns based on customer segmentation. High-risk customers should be offered discounts, trial renewals, and personalised content for engagement. Low-risk, long term customers must be rewarded with loyalty programs and exclusive offers for increased commitment. Moderate-risk customers should receive targeted discounts, satisfaction surveys and follow-up campaigns for more engagement.
Personalised marketing based on propensity model cans end tailored recommendations, upsell services, and cross-sell relevant features.
Using A/B Testing, the company can refine its campaigns by identifying the most effective incentives.
The propensity model helps in enabling personalised marketing , dynamic pricing strategies, and churn prevention. It helps in segmenting customers into high-risk, moderate and low-risk and predicting future outcomes.
Optimising marketing efforts, allocating resources efficiently, and in improving customer retention and acquisition strategies. Overall, it helps in data-driven decisions to improve the customer engagement and renewal rates.

Question 2 – Segmenting Consumers Based on Energy Drink Preference

#Load required packages

library(cluster)
library(ggplot2)
library(kableExtra)
library(knitr)
library(scales)

1. Import the data

energy_drinks <- read_csv("energy_drinks.csv")

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers

# Compute distances between each pair of players
energy_ratings <- select(energy_drinks, D1:D5)
d1 <- dist(energy_ratings)

Does the data need to be scaled before computing the distance matrix? Explain.

No, because the five variables (energy drinks) that will be entered into the clustering algorithm are all measured on the same 9-point Likert scale. Therefore, no need of scaling and directly proceeding for calculating the Euclidean distance between customer rating data for energy drinks.

3. Carry out a hierarchical clustering using the hclust function. Use method = “average”

# Carry out the hierarchical clustering 
h1 <- hclust(d1, method = "average")

4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain.

Yes, the heatmap provides evidence of clustering within the energy drinks dataset.

The hierarchical dendrograms along the rowsa nd columns suggest that the data points (variables) are grouped into clusters based on their similarity. This indicates the relationships within the dataset.
The heatmap displays block-like structures along the diagonal and the varying intensity of colors within the blocks reveals the differences in the similarity of data points.

5. Create a 3-cluster solution using the cutree function and assess the quality of this solution

# 3-cluster solution
clusters1 <- cutree(h1, k = 3)

# Assess the quality of the segmentation
sil1 <- silhouette(clusters1, d1)
summary(sil1)

Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
      417       235       188 
0.2249249 0.1987262 0.3918562 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502

The overall cluster analysis has a mean Silhouette Score of 0.255, which means that the analysis has uncovered a weak clustering structure and could be artificial. The average Silhouette Scores for each individual clusters are analysed below:

Cluster 1 and 2 have scores of 0.22 and 0.19 respectively, meaning that there are no substantial structures found.
Cluster 3 has a score of 0.4, which means that this cluster is weak and could be artificial.

6. Profile the clusters. Include any graphs/tables necessary to support your profiling

# Profile the clusters. 
energy_clus <- cbind(energy_drinks, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
          clusters1 == 2 ~ 'C2',
          clusters1 == 3 ~ 'C3'))

How do the clusters differ on their average rating of each version of the energy drinks?

# Create a graph showing the showing the average ratings of energy drinks by cluster
size_rating <- energy_clus %>%
  group_by(cluster) %>%
  summarise(num_customers = n(),
            avg_D1 = mean(D1),
            avg_D2 = mean(D2),
            avg_D3 = mean(D3),
            avg_D4 = mean(D4),
            avg_D5 = mean(D5))
        
knitr::kable(size_rating,
    format = "html",
    digits = c(0,3,2,2,2,2,2), 
    align = "lrr", 
    col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2", "Avg D3", "Avg D4", "Avg D5"),
    caption = "<div style='text-align: center;'>Average Ratings of Energy drinks by Cluster</div>",
    table.attr = 'data-quarto-disable-processing = "true" ')  %>% 
    
  kableExtra::kable_styling(
  full_width = F, 
  bootstrap_options = c("striped", "hover", "condensed"),
  position = "center", font_size = 12) %>%  
  
  kableExtra::row_spec(0, bold = TRUE, color = "white", background = "gray")

Average Ratings of Energy drinks by Cluster
Cluster	Number of Customers	Avg D1	Avg D2	Avg D3	Avg D4	Avg D5
C1	417	3.00	4.71	6.20	6.70	6.76
C2	235	2.96	4.67	6.95	5.01	2.97
C3	188	6.98	5.37	3.03	2.88	2.84

# Convert the dataset to be in "tidy" format
energy_rating_tidy <- size_rating %>%
  pivot_longer(cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5), 
               names_to = "Energy_drinks", values_to = "Average_Ratings")

# Visualise the average ratings of energy drinks by cluster

ggplot(energy_rating_tidy, mapping = aes(x = Energy_drinks, y = Average_Ratings, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_colour_manual(values = c("#A752A0", "#FCCA3A", "green4")) +
  ylab("Average Ratings") + 
  xlab("Energy drinks") +
  theme(axis.text.x = element_text(hjust = 1))

  ggtitle("Average Ratings of Energy drinks by Cluster")

$title
[1] "Average Ratings of Energy drinks by Cluster"

attr(,"class")
[1] "labels"

The graphs below allow us to profile each of the clusters according to the different variables inlcuded in the original dataset. From these graphs the following are analysed:

Cluster 1 is the largest segment with 417 customers and has the highest ratings for D4 (6.70) and D5 (6.76).
Cluster 2 is the next segment with 235 customers and has the highest ratings for D3 (6.95) but strongly dislikes D5 (2.97).
Cluster 3 is the smallest segment with 188 customers and strongly favours the earliest version, D1 (6.98), with ratings declining for later versions.

How do the clusters differ on age and gender?

# Age: Use barcharts and faceting to visualise the distribution of age within each cluster.

#First reorder the age brackets in chronological order.

energy_clus$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))

ggplot(energy_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip()

# Gender

energy_clus$Gender <- factor(energy_clus$Gender, levels = c("Male", "Female"))

ggplot(energy_clus, aes(x = Gender, group = cluster)) + 
  geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of People") + 
  xlab("Gender") +
  ggtitle("Gender Breakdown by Cluster") +
  coord_flip()

In all three clusters, the individuals aged 25-34 years are more. In Cluster C1, the portion of older aged individuals is smaller. In Cluster C2, shows a significant proportion of 35-49 aged individuals with a very fewer individuals representing aged over 65 years. In Cluster C3, individuals with aged over 65 years are more.

The gender distributions shows differences across clusters. Cluster C1 has more males and females make slightly more than half the population of this cluster, cluster C2 has a balanced gender representation and cluster C3 is predominantly male, with a significantly higher proportion of males.

7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.

The analysis of cluster preferences for energy drink versions D1, D3 and D5 reveals distinct targeting opportunities. For D1, Cluster C3 shows the highest preferences with a score of 6.98, which makes the most suitable segment for advertising D1. For D3, Cluster C2 emerges as the ideal target with the highest score of 6.95. Lastly, D5 should focus on C1, which has the highest score of 6.76. These recommendations align with the preferences of each cluster for maximum impact.

Introduction

Question 1 – Predicting Customers Who Will Renew Their Music Subscription

1. Import the data

2. Create and visualise a classification tree model

3. Interpret the classification tree

4. Assess the accuracy of the classification tree using both the training and the testing datasets.

5. Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?

Question 2 – Segmenting Consumers Based on Energy Drink Preference

1. Import the data

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers

3. Carry out a hierarchical clustering using the hclust function. Use method = “average”

4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap

5. Create a 3-cluster solution using the cutree function and assess the quality of this solution

6. Profile the clusters. Include any graphs/tables necessary to support your profiling

7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.

8. If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?