#Load required packages
library(rattle)
library(rpart)
library(tidyverse)
Predictive Analytics and Customer Segmentation
Introduction
The assignment has two questions on predictive analytics and customer segmentation, will be executed using desicion trees and cluster analysis in R.
Question 1 – Predicting Customers Who Will Renew Their Music Subscription
1. Import the data
<- read_csv("sub_training.csv")
sub_train <- read_csv("sub_testing.csv") sub_test
2. Create and visualise a classification tree model
# Step 2: Create the classification tree model
<- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_train)
sub_tree fancyRpartPlot(sub_tree)
3. Interpret the classification tree
- Rule for predicting if a customer will re-subscribe.
- The rules for predicting if a customer will re-subscribe.
Path 1 (going down far left): if the length of relationship (lor) is lesser than 140 days, and the amount of spent during the last 36 months with the company is lesser than Euros 182, then they are predicted to not to renew their downloading subscription.
Path 2 (going down right): if the length of relationship (lor) is greater than 140 days, then they are predicted to renew their subscription.
- The probability of the prediction being correct. Consider the blue leaf at the end of Path 2 going down right:
The label “Yes” tells you the renew subscription prediction of a customer if they satisfy the rules on this path.
The labels 0.39 and 0.61 tell you that of all customers in the training dataset that fall into this leaf, 39% of them did not renew their subscription and 61% of them did renew their subscription. The above figure gives the idea of how pure the leaf is, and it also provides the probability of a customer churning. In this case, if a customer follows Path 2 and ends up in this leaf, they are predicted to subscribe with a probability of 0.61 or 61% and considered as a strong prediction.
- Rule for predicting if a customer will churn.
- The rules for predicting if a customer will churn.
Path 1 (going down far left): if the length of relationship (lor) is lesser than 140 days, the amount of spent during the last 36 months with the company is lesser than Euros 182, then they will not churn.
Path 2 (going down right): if the length of relationship (lor) is greater than 140 days, then they are predicted to churn.
- The probability of the prediction being correct. Consider the green leaf at the end of Path 1 going down far left.
The label “No” tells the churn prediction of a customer if they satisfy the rules on Path 1.
The labels 0.76 and 0.24 tells that of all customers in the training dataset that fall into this leaf, 76% of them did not churn and 24% did churn. The figure gives you an idea of how pure the leaf is, and also provides the probability of a customer churning. In this case, if a customer follows Path 1 and ends up in this leaf, ther are predicted to churn with probability of 0.76 or 76%.
- Important variables in predicting customer will re-subscribe or not.
The length of relationship (lor), and the amount of spent during the last 36 months with company (spend) and the age are the major variables considered to be important in predicting customer re-subscription or not for the company.
#Extract the variable importance from the rpart object we have called ko_model_tree
$variable.importance sub_tree
lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
#Extract the variable importance as a percentage of all improvements to the model
summary(sub_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + gender + age, data = sub_train)
n= 850
CP nsplit rel error xerror xstd
1 0.19339623 0 1.0000000 1.1108491 0.03417864
2 0.01179245 1 0.8066038 0.8514151 0.03398856
3 0.01000000 4 0.7617925 0.8514151 0.03398856
Variable importance
lor spend age num_contacts contact_recency
47 22 22 8 1
Node number 1: 850 observations, complexity param=0.1933962
predicted class=No expected loss=0.4988235 P(node) =1
class counts: 426 424
probabilities: 0.501 0.499
left son=2 (466 obs) right son=3 (384 obs)
Primary splits:
lor < 139.5 to the left, improve=16.323670, (0 missing)
age < 60.5 to the left, improve=13.784490, (0 missing)
spend < 182 to the left, improve=12.232030, (0 missing)
contact_recency < 7.5 to the right, improve= 5.498858, (0 missing)
num_contacts < 3.5 to the left, improve= 5.307063, (0 missing)
Surrogate splits:
age < 60.5 to the left, agree=0.732, adj=0.406, (0 split)
spend < 422 to the left, agree=0.684, adj=0.299, (0 split)
contact_recency < 8.5 to the right, agree=0.565, adj=0.036, (0 split)
num_contacts < 2.5 to the left, agree=0.560, adj=0.026, (0 split)
Node number 2: 466 observations, complexity param=0.01179245
predicted class=No expected loss=0.4098712 P(node) =0.5482353
class counts: 275 191
probabilities: 0.590 0.410
left son=4 (82 obs) right son=5 (384 obs)
Primary splits:
spend < 182 to the left, improve=5.482157, (0 missing)
lor < 42.5 to the left, improve=5.434363, (0 missing)
age < 61.5 to the left, improve=4.067404, (0 missing)
num_contacts < 7.5 to the left, improve=3.721617, (0 missing)
contact_recency < 7.5 to the right, improve=2.971256, (0 missing)
Surrogate splits:
lor < 21 to the left, agree=0.987, adj=0.927, (0 split)
Node number 3: 384 observations
predicted class=Yes expected loss=0.3932292 P(node) =0.4517647
class counts: 151 233
probabilities: 0.393 0.607
Node number 4: 82 observations
predicted class=No expected loss=0.2439024 P(node) =0.09647059
class counts: 62 20
probabilities: 0.756 0.244
Node number 5: 384 observations, complexity param=0.01179245
predicted class=No expected loss=0.4453125 P(node) =0.4517647
class counts: 213 171
probabilities: 0.555 0.445
left son=10 (356 obs) right son=11 (28 obs)
Primary splits:
num_contacts < 7.5 to the left, improve=3.286592, (0 missing)
age < 61 to the left, improve=3.189001, (0 missing)
gender splits as LR, improve=2.683644, (0 missing)
num_complaints < 1.5 to the left, improve=2.393375, (0 missing)
lor < 45.5 to the left, improve=1.671696, (0 missing)
Surrogate splits:
lor < 137.5 to the left, agree=0.93, adj=0.036, (0 split)
Node number 10: 356 observations, complexity param=0.01179245
predicted class=No expected loss=0.4269663 P(node) =0.4188235
class counts: 204 152
probabilities: 0.573 0.427
left son=20 (329 obs) right son=21 (27 obs)
Primary splits:
age < 61 to the left, improve=3.357262, (0 missing)
gender splits as LR, improve=2.949544, (0 missing)
num_complaints < 1.5 to the left, improve=1.278902, (0 missing)
lor < 42.5 to the left, improve=1.232817, (0 missing)
spend < 403 to the right, improve=1.196010, (0 missing)
Node number 11: 28 observations
predicted class=Yes expected loss=0.3214286 P(node) =0.03294118
class counts: 9 19
probabilities: 0.321 0.679
Node number 20: 329 observations
predicted class=No expected loss=0.4072948 P(node) =0.3870588
class counts: 195 134
probabilities: 0.593 0.407
Node number 21: 27 observations
predicted class=Yes expected loss=0.3333333 P(node) =0.03176471
class counts: 9 18
probabilities: 0.333 0.667
4. Assess the accuracy of the classification tree using both the training and the testing datasets.
# Step 4: Measure accuracy on Training data
<- predict(sub_tree, newdata = sub_train, type = 'prob')
sub_probs <- predict(sub_tree, newdata = sub_train, type = 'class')
sub_preds
<- cbind(sub_train, sub_probs, sub_preds)
sub_train_updated
<- table(sub_train_updated$renewed, sub_train_updated$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat train_con_mat
Predicted
Actual No Yes
No 257 169
Yes 154 270
sum(diag(train_con_mat))/sum(train_con_mat)
[1] 0.62
From, the above confusion matrix, we know the following:
The overall model accuracy is (257+270)/850 = 0.62 or 62%.
Of all customers the model predicted to renew, they got 270/439 = 0.61 or 61%.
Of all customers the model predicted not to renew, they got 257/411 = 0.62 or 62%.
Of all customers who did renew, the model correctly identified 270/424 = 0.63 or 63%.
Of all customers who did not renew, the model correctly identified 257/426 = 0.60 or 60%.
At this stage, the model seems slightly okay because it is right 62% of the time, which is not so great (although it misses quite a lot of customers who did not renew). However, it is important to assess the model’s accuracy on the testing dataset because the model overfits the training dataset and also we are not 100% satisfied with the model accuracy on the training dataset.
#Measure accuracy on Testing data
<- predict(sub_tree, newdata = sub_test, type = 'prob')
sub_probs <- predict(sub_tree, newdata = sub_test, type = 'class')
sub_preds
<- cbind(sub_test, sub_probs, sub_preds)
sub_test_updated
<- table(sub_test_updated$renewed, sub_test_updated$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat test_con_mat
Predicted
Actual No Yes
No 36 38
Yes 35 41
sum(diag(test_con_mat))/sum(test_con_mat)
[1] 0.5133333
From, the above confusion matrix, we know the following:
The overall model accuracy is (36+41)/150 = 0.51 or 51%.
Of all customers the model predicted to renew, they got 41/79 = 0.51 or 51%.
Of all customers the model predicted not to renew, they got 36/71 = 0.50 or 50%.
Of all customers who did renew, the model correctly identified 41/76 = 0.53 or 53%.
Of all customers who did not renew, the model correctly identified 36/74 = 0.48 or 48%.
- Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.
From the findings, the overall accuracy of the model was 62% for the training dataset, however it has dropped to 51% for the testing dataset (11% drop or difference). This suggests that the classification tree model is overfitting itself to the training dataset and will not generalise well when asked to predict new unseen data in the future. In this situation it is common to prune the tre
- Prune the tree model to improve predictive power
<- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_train, maxdepth = 3)
sub_tree2 fancyRpartPlot(sub_tree2)
- Assess the accuracy of the pruned tree on the training and testing datasets.
# Measure the accuracy of the pruned tree model for training data
<- predict(sub_tree2, newdata = sub_train, type = 'prob')
sub_probs2 <- predict(sub_tree2, newdata = sub_train, type = 'class')
sub_preds2
<- cbind(sub_train, sub_probs2, sub_preds2)
sub_train_updated2
<- table(sub_train_updated2$renewed, sub_train_updated2$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat2 train_con_mat2
Predicted
Actual No Yes
No 266 160
Yes 172 252
sum(diag(train_con_mat2))/sum(train_con_mat2)
[1] 0.6094118
From, the above confusion matrix, we know the following:
The overall model accuracy is (266+252)/846 = 0.61 or 61%.
Of all customers the model predicted to renew, they got 252/412 = 0.61 or 61%.
Of all customers the model predicted not to renew, they got 266/436 = 0.61 or 61%.
Of all customers who did renew, the model correctly identified 252/422 = 0.59 or 59%.
Of all customers who did not renew, the model correctly identified 266/426 = 0.62 or 62%.
# Measure the accuracy of the pruned tree model for testing data
<- predict(sub_tree2, newdata = sub_test, type = 'prob')
sub_probs2 <- predict(sub_tree2, newdata = sub_test, type = 'class')
sub_preds2
<- cbind(sub_test, sub_probs2, sub_preds2)
sub_test_updated2
<- table(sub_test_updated2$renewed, sub_test_updated2$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2 test_con_mat2
Predicted
Actual No Yes
No 41 33
Yes 38 38
sum(diag(test_con_mat2))/sum(test_con_mat2)
[1] 0.5266667
From, the above confusion matrix, we know the following:
The overall model accuracy is (41+38)/150 = 0.52 or 52%.
Of all customers the model predicted to renew, they got 38/71 = 0.53 or 53%.
Of all customers the model predicted not to renew, they got 41/79 = 0.62 or 62%.
Of all customers who did renew, the model correctly identified 38/76 = 0.50 or 50%.
Of all customers who did not renew, the model correctly identified 41/74 = 0.55 or 55%.
- Has pruning the classification tree resulted in less overfitting? Explain.
Yes, pruning the classification tree resulted in less overfitting, but the difference in very minimal.
Pruned the tree has decreased an overall model accuracy from 62% to 61%, which is minimal. However, the overall accuracy for the testing dataset has increased from 51% to 52%, which indicates that the pruned tree does not suffer from as much as overfitting as the original tree and should predict new data more accurately.
5. Based on your analysis, suggest some actions the company could take to improve their renewal rate. How could your propensity model be used for marketing purposes?
To improve renewal rates, the company should use targeted campaigns based on customer segmentation. High-risk customers should be offered discounts, trial renewals, and personalised content for engagement. Low-risk, long term customers must be rewarded with loyalty programs and exclusive offers for increased commitment. Moderate-risk customers should receive targeted discounts, satisfaction surveys and follow-up campaigns for more engagement.
Personalised marketing based on propensity model cans end tailored recommendations, upsell services, and cross-sell relevant features.
Using A/B Testing, the company can refine its campaigns by identifying the most effective incentives.
The propensity model helps in enabling personalised marketing , dynamic pricing strategies, and churn prevention. It helps in segmenting customers into high-risk, moderate and low-risk and predicting future outcomes.
Optimising marketing efforts, allocating resources efficiently, and in improving customer retention and acquisition strategies. Overall, it helps in data-driven decisions to improve the customer engagement and renewal rates.
Question 2 – Segmenting Consumers Based on Energy Drink Preference
#Load required packages
library(cluster)
library(ggplot2)
library(kableExtra)
library(knitr)
library(scales)
1. Import the data
<- read_csv("energy_drinks.csv") energy_drinks
2. Create a distance matrix containing the Euclidean distance between all pairs of consumers
# Compute distances between each pair of players
<- select(energy_drinks, D1:D5)
energy_ratings <- dist(energy_ratings) d1
- Does the data need to be scaled before computing the distance matrix? Explain.
No, because the five variables (energy drinks) that will be entered into the clustering algorithm are all measured on the same 9-point Likert scale. Therefore, no need of scaling and directly proceeding for calculating the Euclidean distance between customer rating data for energy drinks.
3. Carry out a hierarchical clustering using the hclust function. Use method = “average”
# Carry out the hierarchical clustering
<- hclust(d1, method = "average") h1
4. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap
plot(h1, hang = -1)
heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)
- Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain.
Yes, the heatmap provides evidence of clustering within the energy drinks dataset.
The hierarchical dendrograms along the rowsa nd columns suggest that the data points (variables) are grouped into clusters based on their similarity. This indicates the relationships within the dataset.
The heatmap displays block-like structures along the diagonal and the varying intensity of colors within the blocks reveals the differences in the similarity of data points.
5. Create a 3-cluster solution using the cutree function and assess the quality of this solution
# 3-cluster solution
<- cutree(h1, k = 3)
clusters1
# Assess the quality of the segmentation
<- silhouette(clusters1, d1)
sil1 summary(sil1)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
417 235 188
0.2249249 0.1987262 0.3918562
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3120 0.1599 0.2916 0.2550 0.3716 0.5502
The overall cluster analysis has a mean Silhouette Score of 0.255, which means that the analysis has uncovered a weak clustering structure and could be artificial. The average Silhouette Scores for each individual clusters are analysed below:
Cluster 1 and 2 have scores of 0.22 and 0.19 respectively, meaning that there are no substantial structures found.
Cluster 3 has a score of 0.4, which means that this cluster is weak and could be artificial.
6. Profile the clusters. Include any graphs/tables necessary to support your profiling
# Profile the clusters.
<- cbind(energy_drinks, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
energy_clus == 2 ~ 'C2',
clusters1 == 3 ~ 'C3')) clusters1
- How do the clusters differ on their average rating of each version of the energy drinks?
# Create a graph showing the showing the average ratings of energy drinks by cluster
<- energy_clus %>%
size_rating group_by(cluster) %>%
summarise(num_customers = n(),
avg_D1 = mean(D1),
avg_D2 = mean(D2),
avg_D3 = mean(D3),
avg_D4 = mean(D4),
avg_D5 = mean(D5))
::kable(size_rating,
knitrformat = "html",
digits = c(0,3,2,2,2,2,2),
align = "lrr",
col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2", "Avg D3", "Avg D4", "Avg D5"),
caption = "<div style='text-align: center;'>Average Ratings of Energy drinks by Cluster</div>",
table.attr = 'data-quarto-disable-processing = "true" ') %>%
::kable_styling(
kableExtrafull_width = F,
bootstrap_options = c("striped", "hover", "condensed"),
position = "center", font_size = 12) %>%
::row_spec(0, bold = TRUE, color = "white", background = "gray") kableExtra
Cluster | Number of Customers | Avg D1 | Avg D2 | Avg D3 | Avg D4 | Avg D5 |
---|---|---|---|---|---|---|
C1 | 417 | 3.00 | 4.71 | 6.20 | 6.70 | 6.76 |
C2 | 235 | 2.96 | 4.67 | 6.95 | 5.01 | 2.97 |
C3 | 188 | 6.98 | 5.37 | 3.03 | 2.88 | 2.84 |
# Convert the dataset to be in "tidy" format
<- size_rating %>%
energy_rating_tidy pivot_longer(cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5),
names_to = "Energy_drinks", values_to = "Average_Ratings")
# Visualise the average ratings of energy drinks by cluster
ggplot(energy_rating_tidy, mapping = aes(x = Energy_drinks, y = Average_Ratings, group = cluster, colour = cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_colour_manual(values = c("#A752A0", "#FCCA3A", "green4")) +
ylab("Average Ratings") +
xlab("Energy drinks") +
theme(axis.text.x = element_text(hjust = 1))
ggtitle("Average Ratings of Energy drinks by Cluster")
$title
[1] "Average Ratings of Energy drinks by Cluster"
attr(,"class")
[1] "labels"
The graphs below allow us to profile each of the clusters according to the different variables inlcuded in the original dataset. From these graphs the following are analysed:
Cluster 1 is the largest segment with 417 customers and has the highest ratings for D4 (6.70) and D5 (6.76).
Cluster 2 is the next segment with 235 customers and has the highest ratings for D3 (6.95) but strongly dislikes D5 (2.97).
Cluster 3 is the smallest segment with 188 customers and strongly favours the earliest version, D1 (6.98), with ratings declining for later versions.
- How do the clusters differ on age and gender?
# Age: Use barcharts and faceting to visualise the distribution of age within each cluster.
#First reorder the age brackets in chronological order.
$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))
energy_clus
ggplot(energy_clus, aes(x = Age, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent) +
ylab("Percentage of People") +
xlab("Age Group") +
ggtitle("Age Breakdown by Cluster") +
coord_flip()
# Gender
$Gender <- factor(energy_clus$Gender, levels = c("Male", "Female"))
energy_clus
ggplot(energy_clus, aes(x = Gender, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent) +
ylab("Percentage of People") +
xlab("Gender") +
ggtitle("Gender Breakdown by Cluster") +
coord_flip()
In all three clusters, the individuals aged 25-34 years are more. In Cluster C1, the portion of older aged individuals is smaller. In Cluster C2, shows a significant proportion of 35-49 aged individuals with a very fewer individuals representing aged over 65 years. In Cluster C3, individuals with aged over 65 years are more.
The gender distributions shows differences across clusters. Cluster C1 has more males and females make slightly more than half the population of this cluster, cluster C2 has a balanced gender representation and cluster C3 is predominantly male, with a significantly higher proportion of males.
7. Advise the company on the suitable segment/cluster at which to advertise energy drink versions D1, D3 and D5.
The analysis of cluster preferences for energy drink versions D1, D3 and D5 reveals distinct targeting opportunities. For D1, Cluster C3 shows the highest preferences with a score of 6.98, which makes the most suitable segment for advertising D1. For D3, Cluster C2 emerges as the ideal target with the highest score of 6.95. Lastly, D5 should focus on C1, which has the highest score of 6.76. These recommendations align with the preferences of each cluster for maximum impact.
8. If the company had to choose just one version of the energy drink to continue producing, then which one do you recommend and why?
Energy drink version D3 with a preference score of 6.95 in cluster C2 is recommended as it has larger customer base, and also targets a younger demographics, which are more likely to engage with and continue purchasing energy drinks. It offers a better long-term brand loyalty and growth.