library(rattle)
library(rpart)
library(tidyverse)Data Analytics Assignment 4- Predictive Analytics and Customer Segmentation
Introduction
This assignment consists of two sections: predictive analysis and customer segmentation, and the author has employed RStudio for the coding process.
Question 1: Predicting Customers Who Will Renew Their Music Subscription
Load required packages
1. Import the data set
setwd("C:\\Users\\HP\\Desktop\\Assignment 4-Gopika Manoj-DA")
sub_train <- read_csv("sub_training.csv")
sub_test <- read_csv("sub_testing.csv")2. Create the classification tree model
sub_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, sub_train)
fancyRpartPlot(sub_tree)3. Interpret the tree model
#Extract the variable importance from the rpart object we have called ko_model_tree
sub_tree$variable.importance lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
a. Rule for Predicting if a Customer Will Re-Subscribe
Path 1 (going down left): If the length of the relationship (lor) is less than 140, the amount of money spent during the last 36 months with the company is less than €182 and the number of times a customer was in contact with the music downloading service is less than 7.5 then they are predicted to not renew their subscription.
Path 2 (going down far right): If the length of the relationship (lor) is greater than 140, the number of times a customer was in contact with the music downloading service is more than 7.5, then they are predicted to renew their music subscription.
Considering the blue leaf at the end of path 2 going down the far right, the following observations can be made:-
The label “Yes” explains a customer’s renewal prediction if they satisfy the rules on this particular path.
The labels 0.39 and 0.61 explain that out of all the customers in the training dataset, categorised in this leaf, 39% of them did not renew their subscription and 61% of them did renew. This indicates the purity of the specific leaf. In this example, if a customer follows path 2 (towards the right) and gets placed in this leaf, they are predicted to renew their subscription with a probability of 0.61 or 60%. Similarly, the labels 0.32 and 0.68 explain that 32% of them did not renew their subscription and 68% did renew. Hence, they are predicted to renew with a probability of 0.68 or 68%.
b. Rule for Predicting if a Customer Will Churn
Path 1 (going down left): If the length of the relationship (lor) is less than 140, the amount of money spent during the last 36 months with the company is less than €182 and the number of times a customer was in contact with the music downloading service is less than 7.5 then they will not churn.
Path 2 (going down right): If the length of the relationship (lor) is greater than 140, the number of times a customer was in contact with the music downloading service is more than 7.5, then they will churn.
The label “No” explains the customer’s churn prediction if they satisfy the rules on this particular path
The labels 0.59 and 0.41 explain that out of all the customers in the training data set, categorised in this leaf, 59% of them did not churn and 41% of them did churn. This indicates the purity of the specific leaf. Additionally, the figure also provides the probability of customer churning. In this example, if a customer follows path 1 (towards the left) and gets placed in this leaf, they are predicted to churn with a probability of 0.41 or 41%.
Similarly, the labels 0.76 and 0.24 explain that 76% of customers did not churn and 24% did churn. This indicates how pure the node is. Thus, they are predicted to churn with a probability of 24%.
Considering the labels 0.57 and 0.43, 57% of the customers did not churn and 43% of them did churn. Hence, they are predicted to churn with a probability of 43%.
c. Key Factors in Predicting Customer Re-Subscription
We use the code “sub_tree$variable.importance” to filter the important predictors of subscription renewal.
The output displayed while running code determines the variables that are considered essential for predicting if a customer will re-subscribe or not.
Hence, ” the length of relationship expressed in days-lor”, “ the amount of money spend during the last 36 months with the company-spend”, “ the age of the customers-age” are the most important variables when compared to “the number of times a customer was in contact with the music downloading service-num_contacts” and “ the elapsed time since last contact-contact_recency” can be considered the major factors.
4. Evaluating Classification Tree Accuracy
#Measure accuracy on Training data
sub_probs <- predict(sub_tree, newdata = sub_train, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_train, type = 'class')
sub_train_updated <- cbind(sub_train, sub_probs, sub_preds)
train_con_mat <- table(sub_train_updated$renewed, sub_train_updated$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat Predicted
Actual No Yes
No 257 169
Yes 154 270
sum(diag(train_con_mat))/sum(train_con_mat)[1] 0.62
#Measure accuracy on Testing data
sub_probs <- predict(sub_tree, newdata = sub_test, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_test, type = 'class')
sub_test_updated <- cbind(sub_test, sub_probs, sub_preds)
test_con_mat <- table(sub_test_updated$renewed, sub_test_updated$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat Predicted
Actual No Yes
No 36 38
Yes 35 41
sum(diag(test_con_mat))/sum(test_con_mat)[1] 0.5133333
a. Analysing Overfitting
Measure the accuracy of Training data
The overall model accuracy is (527/850) = 0.62 or 62%
Of all the customers the model predicted to renew, they got (270/439) = 0.61 or 61%
Of all the customers the model predicted not to renew, they got (257/411) = 0.62 or 62%
Of all the customers who did renew, the model correctly identified (270/424) = 0.63 or 63%
Of all the customers who did not renew, the model correctly identified (257/426) = 0.60 or 60%
Since we may not be 100% satisfied with the model accuracy of the training data, it is essential to check the model quality of the testing data too, to assess the model accuracy.
Measure the accuracy of Testing data
The overall model accuracy is (77/150) = 0.51 or 51%
Of all the customers the model predicted to renew, they got (41/79) = 0.51 or 51%
Of all the customers the model predicted not to renew, they got (36/71) = 0.50 or 50%
Of all the customers who did renew, the model correctly identified (41/76) = 0.53 or 53%
Of all the customers who did not renew, the model correctly identified (36/74) = 0.48 or 48%
Conclusion: By measuring the model accuracy of the testing data and training data, it is observed that the model accuracy was 62% for the training data, however, it has dropped to 51% for the testing data. This 11% difference between the training and testing data proves that the classification tree model is overfitting itself to the training data and will not accurately generalise when asked to predict new data in the future. Hence we must prune the tree, in this example.
b. Creating a Pruned Classification Tree
#Prune the tree model to improve predictive power
sub_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, sub_train, maxdepth=3)
fancyRpartPlot(sub_tree_pruned)c. Evaluating the Accuracy of the Pruned Classification Tree Accuracy
# Measure the accuracy of the pruned tree model for training data
sub_probs2 <- predict(sub_tree_pruned, newdata = sub_train, type = 'prob')
sub_preds2 <- predict(sub_tree_pruned, newdata = sub_train, type = 'class')
sub_train_updated2 <- cbind(sub_train, sub_probs2, sub_preds2)
train_con_mat2 <- table(sub_train_updated2$renewed, sub_train_updated2$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat2 Predicted
Actual No Yes
No 266 160
Yes 172 252
sum(diag(train_con_mat2))/sum(train_con_mat2)[1] 0.6094118
# Measure the accuracy of the pruned tree model for testing data
sub_probs2 <- predict(sub_tree_pruned, newdata = sub_test, type = 'prob')
sub_preds2 <- predict(sub_tree_pruned, newdata = sub_test, type = 'class')
sub_test_updated2 <- cbind(sub_test, sub_probs2, sub_preds2)
test_con_mat2 <- table(sub_test_updated2$renewed, sub_test_updated2$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2 Predicted
Actual No Yes
No 41 33
Yes 38 38
sum(diag(test_con_mat2))/sum(test_con_mat2)[1] 0.5266667
Measure the accuracy of the pruned model of Training data
The overall model accuracy is (518/846) = 0.61 or 61%
Of all the customers the model predicted to renew, they got (252/412) = 0.61 or 61%
Of all the customers the model predicted not to renew, they got (266/436) = 0.61 or 61%
Of all the customers who did renew, the model correctly identified (252/422) = 0.59% or 59%
Of all the customers who did not renew, the model correctly identified (266/426) = 0.62 or 62%
Measure the accuracy of the pruned model of Testing data
The overall model accuracy is (79/150) = 0.52 or 52%
Of all the customers the model predicted to renew, they got (38/71) = 0.53 or 53%
Of all the customers the model predicted not to renew, they got (41/79) = 0.62 or 62%
Of all the customers who did renew, the model correctly identified (41/76) = 0.50 or 50%
Of all the customers who did not renew, the model correctly identified (41/74) = 0.55 or 55%
d. Has pruning the classification tree resulted in less overfitting?
Yes, pruning the tree has resulted in comparatively less overfitting.
The overall model accuracy of the training data decreased from 62% to 61%, which is a minimal difference.
The overall accuracy of the testing data has increased from 51% to 52%. However while conducting other analyses, it was found that the number of customers the model predicted not to renew has increased from 50% to 62%, in the testing data. This shows that though there is not much difference in the overall accuracy of the pruned models for both the testing and training data, the prediction of the customers who chose not to renew showed a significant increase.
Conclusion: Hence, the model suffers less overfitting than the original classification tree and may predict new data effectively. However, since the difference is minimal, it is highly recommended to prune the tree again for producing efficient and insightful data.
5. Marketing Actions
Since customers with a short length of relationship are more likely not to renew their music subscription, it would be ideal to create targeted campaigns for that specific audience. Consider implementing offers, discounts and effective customised recommendations to motivate them to renew their subscription. Additionally, offer trial renewals to improve customer engagement.
Regular follow-ups and support campaigns offering any assistance can retain customers at risk; i.e. customers with fewer contacts, with the music downloading service. Customer ratings can also help alter marketing actions.
The propensity model can be used to provide personalised recommendations to fine-tune current marketing practices to improve renewal rates. The model also reveals customers who are likely to re-subscribe and churn.
Since the model identifies that customers with a “high length of relationship” are more likely to renew the subscription, the company could create campaigns with messages that may seem appealing to this particular demographic. High-risk customers with a “length of the relationship” of less than 140 should be contacted at the earliest or offered incentives so that they can renew their subscriptions.
Question 2: Segmenting Consumers Based on Energy Drink Preference
Load required packages
library(cluster)1. Import the data
setwd("C:/Users/HP/Desktop/Assignments/Data Analytics/Assignment 4/Assessment Files")
energy_drinks <- read_csv("energy_drinks.csv")2. Create a distance matrix containing the Euclidean distance between all pairs of consumers
# Compute distances between participants
# Select D1 to D5
energy_ratings <- select(energy_drinks, D1, D2, D3, D4, D5)
d1 <- dist(energy_ratings)a. Does the data need to be scaled before computing the distance matrix? Explain your answer.
No, scaling is not necessary in this case. This is because the variables D1-D5 represent customer ratings of the energy drinks. Additionally, each participant who tastes each of the five energy drinks rates them on a 1 to 9 Likert Scale. Since they are on the same scale no variables will dominate the Euclidean distance, thus preventing biased clustering results.
3. Hierarchical clustering using the “hclust” function
h1 <- hclust(d1, method= "average")4. Visualisaton of the results using a dendogram and a heatmap
# Heatmap
plot(h1, hang = -1, main= "Energy Drink Preferences")heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.
On analysing the heatmap, it is clear that there is some evidence of light-coloured blocks along the diagonal. Though less convincing, this proves the presence of a clustering structure within the energy drinks dataset.
5. 3-Cluster solution using the cutree function & Quality Assessment
clusters1 <- cutree(h1, k = 3)
# Assess the quality of the segmentation
sil1 <- silhouette(clusters1, d1)6. Profile the clusters
energy_clus <- cbind(energy_drinks, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
clusters1 == 2 ~ 'C2',
clusters1 == 3 ~ 'C3'))a. Analysing the size of each cluster and the average ratings for each energy drink
avg_ratings <- energy_clus %>%
group_by(cluster) %>%
summarise(num_customers = n(),
avg_D1= mean(D1),
avg_D2= mean(D2),
avg_D3= mean(D3),
avg_D4= mean(D4),
avg_D5= mean(D5))
# Convert the dataset to a "tidy" format
avg_ratings_tidy <- avg_ratings %>%
pivot_longer(
cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5),
names_to = "Energy_Drink",
values_to = "Average_Rating"
)
# Visualise the average ratings of energy drinks by cluster
ggplot(avg_ratings_tidy, mapping = aes(x = Energy_Drink, y = Average_Rating, group = cluster, colour = cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
ylab("Average Rating") +
xlab("Energy Drink") +
theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
ggtitle("Average Rating of Energy Drinks by Cluster")# Formatted Table
# Load required libraries
library(knitr)
library(kableExtra)
# Format the summarized table using kableExtra
avg_ratings %>%
knitr::kable(
format = "html",
col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2",
"Avg D3", "Avg D4", "Avg D5"),
caption = "<b>Number of Customers and Average Ratings</b>",
align = "lrrrrrr",
table.attr = 'data-quarto-disable-processing = "true"',
digits = c(0,3,2,2,2,2,2)
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center",
font_size = 14
) %>%
column_spec(1, color = "black", background = "#ebe7fa") | Cluster | Number of Customers | Avg D1 | Avg D2 | Avg D3 | Avg D4 | Avg D5 |
|---|---|---|---|---|---|---|
| C1 | 417 | 3.00 | 4.71 | 6.20 | 6.70 | 6.76 |
| C2 | 235 | 2.96 | 4.67 | 6.95 | 5.01 | 2.97 |
| C3 | 188 | 6.98 | 5.37 | 3.03 | 2.88 | 2.84 |
b. Analyse cluster differences based on age and gender
# Age
energy_clus$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))
# Plot age breakdown by cluster
library(ggplot2)
library(scales)
ggplot(energy_clus, aes(x = Age, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = percent_format()) +
ylab("Percentage of People") +
xlab("Age Group") +
ggtitle("Age Breakdown by Cluster") +
coord_flip() # Gender
energy_clus$Gender <- factor(energy_clus$Gender, levels = c("Male", "Female"))
ggplot(energy_clus, aes(x = Gender, group = cluster)) +
geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
facet_grid(~ cluster) +
scale_y_continuous(labels = scales::percent_format()) +
ylab("Percentage of People") +
xlab("Gender") +
ggtitle("Gender Breakdown by Cluster") +
coord_flip() Conclusion: The young customers aged between 25 and 34, dominates the other age groups. The percentage of men is higher when compared to women.
7. Analyse the suitable cluster at which the company can advertise energy drink versions D1, D3 and D5.
# Analysis can be simplified with a filtered table that contains D1, D3 and D5.
avg_ratings %>%
select(cluster, num_customers, avg_D1, avg_D3, avg_D5) %>% # Select only relevant columns
knitr::kable(
format = "html",
col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D3", "Avg D5"),
caption = "<b>Number of Customers and Average Ratings (D1, D3, D5)</b>",
align = "lrrrr",
table.attr = 'data-quarto-disable-processing = "true"',
digits = c(0, 3, 2, 2, 2)
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center",
font_size = 14
) %>%
column_spec(1, color = "black", background = "#ebe7fa") | Cluster | Number of Customers | Avg D1 | Avg D3 | Avg D5 |
|---|---|---|---|---|
| C1 | 417 | 3.00 | 6.20 | 6.76 |
| C2 | 235 | 2.96 | 6.95 | 2.97 |
| C3 | 188 | 6.98 | 3.03 | 2.84 |
The following recommendations are based on the above table and line graph:
- Cluster 1: Advertise D5
Cluster 1 has the highest average rating for D5 (6.76), making it the popular preference in this segment.
Cluster 2: Advertise D3
Cluster 2 has the highest average rating for D3 (6.95), making it more demanding in this segment.Cluster 3: Advertise D1
Cluster 3 has the highest average rating for D1 (6.98), indicating a stronger choice in this segment.
8. Marketing Actions
Cluster 1 shows a stronger preference for version “D5”, with an average rating of 6.76, almost close to 7.
This rating is more significant than the others since cluster 1 has a higher market share with 417 customers, which is indeed a strong audience. Other versions D1, D2, D3, and D4 have higher ratings, but their audience size is small, so it might not be a reliable observation. Hence, the recommended version is “D5”.