Data Analytics Assignment 4- Predictive Analytics and Customer Segmentation

Author

Gopika Manoj

Introduction

This assignment consists of two sections: predictive analysis and customer segmentation, and the author has employed RStudio for the coding process.

Question 1: Predicting Customers Who Will Renew Their Music Subscription

Load required packages

library(rattle)
library(rpart)
library(tidyverse)

1. Import the data set

setwd("C:\\Users\\HP\\Desktop\\Assignment 4-Gopika Manoj-DA")

sub_train <- read_csv("sub_training.csv")
sub_test <- read_csv("sub_testing.csv")

2. Create the classification tree model

sub_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, sub_train)
fancyRpartPlot(sub_tree)

3. Interpret the tree model

#Extract the variable importance from the rpart object we have called ko_model_tree

sub_tree$variable.importance

            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338

b. Rule for Predicting if a Customer Will Churn

Path 1 (going down left): If the length of the relationship (lor) is less than 140, the amount of money spent during the last 36 months with the company is less than €182 and the number of times a customer was in contact with the music downloading service is less than 7.5 then they will not churn.
Path 2 (going down right): If the length of the relationship (lor) is greater than 140, the number of times a customer was in contact with the music downloading service is more than 7.5, then they will churn.
The label “No” explains the customer’s churn prediction if they satisfy the rules on this particular path
The labels 0.59 and 0.41 explain that out of all the customers in the training data set, categorised in this leaf, 59% of them did not churn and 41% of them did churn. This indicates the purity of the specific leaf. Additionally, the figure also provides the probability of customer churning. In this example, if a customer follows path 1 (towards the left) and gets placed in this leaf, they are predicted to churn with a probability of 0.41 or 41%.
Similarly, the labels 0.76 and 0.24 explain that 76% of customers did not churn and 24% did churn. This indicates how pure the node is. Thus, they are predicted to churn with a probability of 24%.
Considering the labels 0.57 and 0.43, 57% of the customers did not churn and 43% of them did churn. Hence, they are predicted to churn with a probability of 43%.

c. Key Factors in Predicting Customer Re-Subscription

We use the code “sub_tree$variable.importance” to filter the important predictors of subscription renewal.
The output displayed while running code determines the variables that are considered essential for predicting if a customer will re-subscribe or not.
Hence, ” the length of relationship expressed in days-lor”, “ the amount of money spend during the last 36 months with the company-spend”, “ the age of the customers-age” are the most important variables when compared to “the number of times a customer was in contact with the music downloading service-num_contacts” and “ the elapsed time since last contact-contact_recency” can be considered the major factors.

4. Evaluating Classification Tree Accuracy

#Measure accuracy on Training data

sub_probs <- predict(sub_tree, newdata = sub_train, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_train, type = 'class')

sub_train_updated <- cbind(sub_train, sub_probs, sub_preds)


train_con_mat <- table(sub_train_updated$renewed, sub_train_updated$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

sum(diag(train_con_mat))/sum(train_con_mat)

[1] 0.62

#Measure accuracy on Testing data

sub_probs <- predict(sub_tree, newdata = sub_test, type = 'prob')
sub_preds <- predict(sub_tree, newdata = sub_test, type = 'class')

sub_test_updated <- cbind(sub_test, sub_probs, sub_preds)


test_con_mat <- table(sub_test_updated$renewed, sub_test_updated$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

sum(diag(test_con_mat))/sum(test_con_mat)

[1] 0.5133333

a. Analysing Overfitting

Measure the accuracy of Training data

The overall model accuracy is (527/850) = 0.62 or 62%
Of all the customers the model predicted to renew, they got (270/439) = 0.61 or 61%
Of all the customers the model predicted not to renew, they got (257/411) = 0.62 or 62%
Of all the customers who did renew, the model correctly identified (270/424) = 0.63 or 63%
Of all the customers who did not renew, the model correctly identified (257/426) = 0.60 or 60%
Since we may not be 100% satisfied with the model accuracy of the training data, it is essential to check the model quality of the testing data too, to assess the model accuracy.

Measure the accuracy of Testing data

The overall model accuracy is (77/150) = 0.51 or 51%
Of all the customers the model predicted to renew, they got (41/79) = 0.51 or 51%
Of all the customers the model predicted not to renew, they got (36/71) = 0.50 or 50%
Of all the customers who did renew, the model correctly identified (41/76) = 0.53 or 53%
Of all the customers who did not renew, the model correctly identified (36/74) = 0.48 or 48%

Conclusion: By measuring the model accuracy of the testing data and training data, it is observed that the model accuracy was 62% for the training data, however, it has dropped to 51% for the testing data. This 11% difference between the training and testing data proves that the classification tree model is overfitting itself to the training data and will not accurately generalise when asked to predict new data in the future. Hence we must prune the tree, in this example.

b. Creating a Pruned Classification Tree

#Prune the tree model to improve predictive power

sub_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + age + gender, sub_train, maxdepth=3)
fancyRpartPlot(sub_tree_pruned)

c. Evaluating the Accuracy of the Pruned Classification Tree Accuracy

# Measure the accuracy of the pruned tree model for training data

sub_probs2 <- predict(sub_tree_pruned, newdata = sub_train, type = 'prob')
sub_preds2 <- predict(sub_tree_pruned, newdata = sub_train, type = 'class')

sub_train_updated2 <- cbind(sub_train, sub_probs2, sub_preds2)


train_con_mat2 <- table(sub_train_updated2$renewed, sub_train_updated2$sub_preds, dnn=c('Actual', 'Predicted'))
train_con_mat2

      Predicted
Actual  No Yes
   No  266 160
   Yes 172 252

sum(diag(train_con_mat2))/sum(train_con_mat2)

[1] 0.6094118

# Measure the accuracy of the pruned tree model for testing data

sub_probs2 <- predict(sub_tree_pruned, newdata = sub_test, type = 'prob')
sub_preds2 <- predict(sub_tree_pruned, newdata = sub_test, type = 'class')

sub_test_updated2 <- cbind(sub_test, sub_probs2, sub_preds2)


test_con_mat2 <- table(sub_test_updated2$renewed, sub_test_updated2$sub_preds, dnn = c('Actual', 'Predicted'))
test_con_mat2

      Predicted
Actual No Yes
   No  41  33
   Yes 38  38

sum(diag(test_con_mat2))/sum(test_con_mat2)

[1] 0.5266667

Measure the accuracy of the pruned model of Training data

The overall model accuracy is (518/846) = 0.61 or 61%
Of all the customers the model predicted to renew, they got (252/412) = 0.61 or 61%
Of all the customers the model predicted not to renew, they got (266/436) = 0.61 or 61%
Of all the customers who did renew, the model correctly identified (252/422) = 0.59% or 59%
Of all the customers who did not renew, the model correctly identified (266/426) = 0.62 or 62%

Measure the accuracy of the pruned model of Testing data

The overall model accuracy is (79/150) = 0.52 or 52%
Of all the customers the model predicted to renew, they got (38/71) = 0.53 or 53%
Of all the customers the model predicted not to renew, they got (41/79) = 0.62 or 62%
Of all the customers who did renew, the model correctly identified (41/76) = 0.50 or 50%
Of all the customers who did not renew, the model correctly identified (41/74) = 0.55 or 55%

d. Has pruning the classification tree resulted in less overfitting?

Yes, pruning the tree has resulted in comparatively less overfitting.
The overall model accuracy of the training data decreased from 62% to 61%, which is a minimal difference.
The overall accuracy of the testing data has increased from 51% to 52%. However while conducting other analyses, it was found that the number of customers the model predicted not to renew has increased from 50% to 62%, in the testing data. This shows that though there is not much difference in the overall accuracy of the pruned models for both the testing and training data, the prediction of the customers who chose not to renew showed a significant increase.

Conclusion: Hence, the model suffers less overfitting than the original classification tree and may predict new data effectively. However, since the difference is minimal, it is highly recommended to prune the tree again for producing efficient and insightful data.

5. Marketing Actions

Since customers with a short length of relationship are more likely not to renew their music subscription, it would be ideal to create targeted campaigns for that specific audience. Consider implementing offers, discounts and effective customised recommendations to motivate them to renew their subscription. Additionally, offer trial renewals to improve customer engagement.
Regular follow-ups and support campaigns offering any assistance can retain customers at risk; i.e. customers with fewer contacts, with the music downloading service. Customer ratings can also help alter marketing actions.
The propensity model can be used to provide personalised recommendations to fine-tune current marketing practices to improve renewal rates. The model also reveals customers who are likely to re-subscribe and churn.
Since the model identifies that customers with a “high length of relationship” are more likely to renew the subscription, the company could create campaigns with messages that may seem appealing to this particular demographic. High-risk customers with a “length of the relationship” of less than 140 should be contacted at the earliest or offered incentives so that they can renew their subscriptions.

Question 2: Segmenting Consumers Based on Energy Drink Preference

Load required packages

library(cluster)

1. Import the data

setwd("C:/Users/HP/Desktop/Assignments/Data Analytics/Assignment 4/Assessment Files")

energy_drinks <- read_csv("energy_drinks.csv")

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers

# Compute distances between participants

# Select D1 to D5

energy_ratings <- select(energy_drinks, D1, D2, D3, D4, D5)
d1 <- dist(energy_ratings)

a. Does the data need to be scaled before computing the distance matrix? Explain your answer.

No, scaling is not necessary in this case. This is because the variables D1-D5 represent customer ratings of the energy drinks. Additionally, each participant who tastes each of the five energy drinks rates them on a 1 to 9 Likert Scale. Since they are on the same scale no variables will dominate the Euclidean distance, thus preventing biased clustering results.

3. Hierarchical clustering using the “hclust” function

h1 <- hclust(d1, method= "average")

4. Visualisaton of the results using a dendogram and a heatmap

# Heatmap

plot(h1, hang = -1, main= "Energy Drink Preferences")

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.

On analysing the heatmap, it is clear that there is some evidence of light-coloured blocks along the diagonal. Though less convincing, this proves the presence of a clustering structure within the energy drinks dataset.

5. 3-Cluster solution using the cutree function & Quality Assessment

clusters1 <- cutree(h1, k = 3)

# Assess the quality of the segmentation

sil1 <- silhouette(clusters1, d1)

6. Profile the clusters

energy_clus <- cbind(energy_drinks, clusters1)
energy_clus <- mutate(energy_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                   clusters1 == 2 ~ 'C2',
                                                   clusters1 == 3 ~ 'C3'))

a. Analysing the size of each cluster and the average ratings for each energy drink

avg_ratings <- energy_clus %>%
  group_by(cluster) %>%
  summarise(num_customers = n(),
           avg_D1= mean(D1),
           avg_D2= mean(D2),
           avg_D3= mean(D3),
           avg_D4= mean(D4),
           avg_D5= mean(D5))



# Convert the dataset to a "tidy" format 

avg_ratings_tidy <- avg_ratings %>%
  pivot_longer(
    cols = c(avg_D1, avg_D2, avg_D3, avg_D4, avg_D5), 
    names_to = "Energy_Drink", 
    values_to = "Average_Rating"  
  )


# Visualise the average ratings of energy drinks by cluster

ggplot(avg_ratings_tidy, mapping = aes(x = Energy_Drink, y = Average_Rating, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  ylab("Average Rating") + 
  xlab("Energy Drink") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) + 
  ggtitle("Average Rating of Energy Drinks by Cluster")

# Formatted Table

# Load required libraries

library(knitr)
library(kableExtra)


# Format the summarized table using kableExtra

avg_ratings %>%

  knitr::kable(
    format = "html",
    col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D2", 
                  "Avg D3", "Avg D4", "Avg D5"),  
    caption = "<b>Number of Customers and Average Ratings</b>",  
    align = "lrrrrrr",  
    table.attr = 'data-quarto-disable-processing = "true"',
    digits = c(0,3,2,2,2,2,2)
  ) %>%
  
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
    full_width = FALSE,  
    position = "center",  
    font_size = 14  
  ) %>%
  
  column_spec(1, color = "black", background = "#ebe7fa")

**Number of Customers and Average Ratings**
Cluster	Number of Customers	Avg D1	Avg D2	Avg D3	Avg D4	Avg D5
C1	417	3.00	4.71	6.20	6.70	6.76
C2	235	2.96	4.67	6.95	5.01	2.97
C3	188	6.98	5.37	3.03	2.88	2.84

b. Analyse cluster differences based on age and gender

# Age

energy_clus$Age <- factor(energy_clus$Age, levels = c("Under_25", "25_34", "35_49", "50_64", "Over_65"))

# Plot age breakdown by cluster

library(ggplot2)
library(scales)

ggplot(energy_clus, aes(x = Age, group = cluster)) + 
  geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +  
  scale_y_continuous(labels = percent_format()) +  
  ylab("Percentage of People") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Cluster") +
  coord_flip()

# Gender

energy_clus$Gender <- factor(energy_clus$Gender, levels = c("Male", "Female"))

ggplot(energy_clus, aes(x = Gender, group = cluster)) + 
  geom_bar(aes(y = ..prop.., fill = cluster), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) + 
  scale_y_continuous(labels = scales::percent_format()) +  
  ylab("Percentage of People") + 
  xlab("Gender") +
  ggtitle("Gender Breakdown by Cluster") +
  coord_flip()

Conclusion: The young customers aged between 25 and 34, dominates the other age groups. The percentage of men is higher when compared to women.

7. Analyse the suitable cluster at which the company can advertise energy drink versions D1, D3 and D5.

# Analysis can be simplified with a filtered table that contains D1, D3 and D5.

avg_ratings %>%
  
  select(cluster, num_customers, avg_D1, avg_D3, avg_D5) %>%  # Select only relevant columns
  knitr::kable(
    format = "html",
    col.names = c("Cluster", "Number of Customers", "Avg D1", "Avg D3", "Avg D5"),  
    caption = "<b>Number of Customers and Average Ratings (D1, D3, D5)</b>", 
    align = "lrrrr",  
    table.attr = 'data-quarto-disable-processing = "true"',
    digits = c(0, 3, 2, 2, 2) 
  ) %>%
  
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
    full_width = FALSE,  
    position = "center",  
    font_size = 14  
  ) %>%
  
  column_spec(1, color = "black", background = "#ebe7fa")

**Number of Customers and Average Ratings (D1, D3, D5)**
Cluster	Number of Customers	Avg D1	Avg D3	Avg D5
C1	417	3.00	6.20	6.76
C2	235	2.96	6.95	2.97
C3	188	6.98	3.03	2.84

The following recommendations are based on the above table and line graph:

Cluster 1: Advertise D5
Cluster 1 has the highest average rating for D5 (6.76), making it the popular preference in this segment.

Cluster 2: Advertise D3
Cluster 2 has the highest average rating for D3 (6.95), making it more demanding in this segment.
Cluster 3: Advertise D1
Cluster 3 has the highest average rating for D1 (6.98), indicating a stronger choice in this segment.

8. Marketing Actions

Cluster 1 shows a stronger preference for version “D5”, with an average rating of 6.76, almost close to 7.
This rating is more significant than the others since cluster 1 has a higher market share with 417 customers, which is indeed a strong audience. Other versions D1, D2, D3, and D4 have higher ratings, but their audience size is small, so it might not be a reliable observation. Hence, the recommended version is “D5”.

Introduction

Question 1: Predicting Customers Who Will Renew Their Music Subscription

Load required packages

1. Import the data set

2. Create the classification tree model

3. Interpret the tree model

a. Rule for Predicting if a Customer Will Re-Subscribe

b. Rule for Predicting if a Customer Will Churn

c. Key Factors in Predicting Customer Re-Subscription

4. Evaluating Classification Tree Accuracy

a. Analysing Overfitting

b. Creating a Pruned Classification Tree

c. Evaluating the Accuracy of the Pruned Classification Tree Accuracy

d. Has pruning the classification tree resulted in less overfitting?

5. Marketing Actions

Question 2: Segmenting Consumers Based on Energy Drink Preference

Load required packages

1. Import the data

2. Create a distance matrix containing the Euclidean distance between all pairs of consumers

a. Does the data need to be scaled before computing the distance matrix? Explain your answer.

3. Hierarchical clustering using the “hclust” function

4. Visualisaton of the results using a dendogram and a heatmap

a. Does the heatmap provide evidence of any clustering structure within the energy drinks dataset? Explain your answer.

5. 3-Cluster solution using the cutree function & Quality Assessment

6. Profile the clusters

a. Analysing the size of each cluster and the average ratings for each energy drink

b. Analyse cluster differences based on age and gender

7. Analyse the suitable cluster at which the company can advertise energy drink versions D1, D3 and D5.

8. Marketing Actions