#Load required packages
library(rattle)
library(rpart)
library(tidyverse)
#Step 1: Import the sub_training.csv and sub_testing.csv datasets into R
<- read_csv("sub_training.csv")
sub_training <- read_csv("sub_testing.csv")
sub_testing
#Step 2: Create and visualise a classification tree model
<- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training )
sub_tree fancyRpartPlot(sub_tree)
Music Subscription Analysis
Predicting Customers Who Will Renew Their Music Subscription
1. Import the sub_training.csv and sub_testing.csv data sets into R.
2. Create and visualize a classification tree model that will allow you to predict if a customer will re-subscribe or churn.
3. Interpret the classification tree
a. Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.
Path 1: If Lor is not less than 140 and Num_contacts is not less than 7.5, the customer is likely to re-subscribe.
The label “Yes” in Leaf 3 tells that the customer is likely to re-subscribe if they satisfy the rules on Path. The labels 0.39 and 0.61 tell that of all customers in the training data set that fall into this leaf, 39% of them are not likely to re-subscribe and 61% of them are likely to re-subscribe.
The value of 45% tells us that 45% of all customers in the training dataset fall into the leaf.
Path 2: If Lor is less than 140, and Spend is less than 182, then the customer is not likely to re-subscribe.
The label “No” in Leaf 4 tells that the customer is not likely to re-subscribe if they satisfy the rules on Path. The labels 0.76 and 0.24 tell that of all customers in the training data set that fall into this leaf, 76% of them are not likely to re-subscribe and 24% of them are likely to re-subscribe.
The value of 10% tells us that 10% of all customers in the training data set fall into the leaf.
b. Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.
The label “No” in Leaf 4 tells the churn prediction of a customer if they satisfy the rules on Path. The labels 0.76 and 0.24 tell that of all customers in the training data set that fall into this leaf, 76% of them did not churn and 24% did churn. This above figure gives us an idea of how pure the leaf is, and it also provides the probability of a customer churning.
If a customer follows Path 1 and ends up in this leaf, they are predicted to churn with a probability of 0.24 or 24%.
# Important variables:
$variable.importance sub_tree
lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
summary(sub_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + gender + age, data = sub_training)
n= 850
CP nsplit rel error xerror xstd
1 0.19339623 0 1.0000000 1.1037736 0.03420419
2 0.01179245 1 0.8066038 0.9221698 0.03427042
3 0.01000000 4 0.7617925 0.8773585 0.03411223
Variable importance
lor spend age num_contacts contact_recency
47 22 22 8 1
Node number 1: 850 observations, complexity param=0.1933962
predicted class=No expected loss=0.4988235 P(node) =1
class counts: 426 424
probabilities: 0.501 0.499
left son=2 (466 obs) right son=3 (384 obs)
Primary splits:
lor < 139.5 to the left, improve=16.323670, (0 missing)
age < 60.5 to the left, improve=13.784490, (0 missing)
spend < 182 to the left, improve=12.232030, (0 missing)
contact_recency < 7.5 to the right, improve= 5.498858, (0 missing)
num_contacts < 3.5 to the left, improve= 5.307063, (0 missing)
Surrogate splits:
age < 60.5 to the left, agree=0.732, adj=0.406, (0 split)
spend < 422 to the left, agree=0.684, adj=0.299, (0 split)
contact_recency < 8.5 to the right, agree=0.565, adj=0.036, (0 split)
num_contacts < 2.5 to the left, agree=0.560, adj=0.026, (0 split)
Node number 2: 466 observations, complexity param=0.01179245
predicted class=No expected loss=0.4098712 P(node) =0.5482353
class counts: 275 191
probabilities: 0.590 0.410
left son=4 (82 obs) right son=5 (384 obs)
Primary splits:
spend < 182 to the left, improve=5.482157, (0 missing)
lor < 42.5 to the left, improve=5.434363, (0 missing)
age < 61.5 to the left, improve=4.067404, (0 missing)
num_contacts < 7.5 to the left, improve=3.721617, (0 missing)
contact_recency < 7.5 to the right, improve=2.971256, (0 missing)
Surrogate splits:
lor < 21 to the left, agree=0.987, adj=0.927, (0 split)
Node number 3: 384 observations
predicted class=Yes expected loss=0.3932292 P(node) =0.4517647
class counts: 151 233
probabilities: 0.393 0.607
Node number 4: 82 observations
predicted class=No expected loss=0.2439024 P(node) =0.09647059
class counts: 62 20
probabilities: 0.756 0.244
Node number 5: 384 observations, complexity param=0.01179245
predicted class=No expected loss=0.4453125 P(node) =0.4517647
class counts: 213 171
probabilities: 0.555 0.445
left son=10 (356 obs) right son=11 (28 obs)
Primary splits:
num_contacts < 7.5 to the left, improve=3.286592, (0 missing)
age < 61 to the left, improve=3.189001, (0 missing)
gender splits as LR, improve=2.683644, (0 missing)
num_complaints < 1.5 to the left, improve=2.393375, (0 missing)
lor < 45.5 to the left, improve=1.671696, (0 missing)
Surrogate splits:
lor < 137.5 to the left, agree=0.93, adj=0.036, (0 split)
Node number 10: 356 observations, complexity param=0.01179245
predicted class=No expected loss=0.4269663 P(node) =0.4188235
class counts: 204 152
probabilities: 0.573 0.427
left son=20 (329 obs) right son=21 (27 obs)
Primary splits:
age < 61 to the left, improve=3.357262, (0 missing)
gender splits as LR, improve=2.949544, (0 missing)
num_complaints < 1.5 to the left, improve=1.278902, (0 missing)
lor < 42.5 to the left, improve=1.232817, (0 missing)
spend < 403 to the right, improve=1.196010, (0 missing)
Node number 11: 28 observations
predicted class=Yes expected loss=0.3214286 P(node) =0.03294118
class counts: 9 19
probabilities: 0.321 0.679
Node number 20: 329 observations
predicted class=No expected loss=0.4072948 P(node) =0.3870588
class counts: 195 134
probabilities: 0.593 0.407
Node number 21: 27 observations
predicted class=Yes expected loss=0.3333333 P(node) =0.03176471
class counts: 9 18
probabilities: 0.333 0.667
c. Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.
In variable importance printed above by summary() function, the values are reported as a percentage of total improvements to the model.
In the output above, Lor accounts for 47% of all improvements made to the model, Spend accounts for 22%, and age accounts for 22%. num_contacts accounts for 8% and contact_recency accounts for 1%
So, Lor is the variable that contributes more to improving the model.
4. Fully assess the accuracy of the classification tree using both the training and the testing datasets.
a. Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.
#Check model accuracy of the classification tree
# Measure accuracy on the training dataset
<- predict(sub_tree, newdata = sub_training, type = 'prob')
train_probs <- predict(sub_tree, newdata = sub_training, type = 'class')
train_preds
<- cbind(sub_training, train_probs, train_preds)
sub_training_updated
#create a confution matrix(cross tabulation) between predicted vs actual
<- table(sub_training_updated$renewed, sub_training_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat train_con_mat
Predicted
Actual No Yes
No 257 169
Yes 154 270
sum(diag(train_con_mat)) / sum(train_con_mat)
[1] 0.62
From the above confusion matrix of the training data set, we know the following:
The overall model accuracy is (257 + 270)/850 = 0.62 or 62%.
Of all customers the model predicted to resubscribe, they got 270/439 = 0.615 or 61% correct.
Of all customers the model predicted not to resubscribe, they got 257/411 = 0.62 or 62% correct.
Of all customers who did resubscribe, the model correctly identified 270/424 = 0.63 or 63%.
Of all customers who did not resubscribe, the model correctly identified 257/426 = 0.60 or 60%.
# Measure accuracy on the testing dataset
<- predict(sub_tree, newdata = sub_testing, type = 'prob')
test_probs <- predict(sub_tree, newdata = sub_testing, type = 'class')
test_preds <- cbind(sub_testing, test_probs, test_preds)
sub_testing_updated
<- table(sub_testing_updated$renewed, sub_testing_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat test_con_mat
Predicted
Actual No Yes
No 36 38
Yes 35 41
sum(diag(test_con_mat)) / sum(test_con_mat)
[1] 0.5133333
From the above confusion matrix of the testing data set, we know the following:
The overall model accuracy is (36+41)/150 = 0.51 or 51%.
Of all customers the model predicted to resubscribe, they got 41/79 = 0.518 or 51% correct.
Of all customers the model predicted not to resubscribe, they got 36/71 = 0.50 or 50% correct.
Of all customers who did resubscribe, the model correctly identified 41/76 = 0.539 or 54%.
Of all customers who did not resubscribe, the model correctly identified 36/74 = 0.48 or 48%.
The overall accuracy of the model was 62% for the training dataset, however, it has dropped to 51% for the testing dataset. This suggests that the classification tree model is overfitting itself to the training dataset and will not generalize well when asked to predict new unseen data in the future. So, we should prune the tree.
b. Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.
# Pruned Classification Tree
<- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training, maxdepth = 3 )
sub_tree_pruned
fancyRpartPlot(sub_tree_pruned)
c. Fully assess the accuracy of the pruned tree on the training and testing datasets.
#Assess the accuracy of the pruned classification tree - training data
<- predict(sub_tree_pruned, newdata = sub_training, type = 'prob')
train_probs_pruned <- predict(sub_tree_pruned, newdata = sub_training, type = 'class')
train_preds_pruned <- cbind(sub_training, train_probs_pruned, train_preds_pruned)
sub_training_updated_pruned
<- table(sub_training_updated_pruned$renewed, sub_training_updated_pruned$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat_pruned
sum(diag(train_con_mat_pruned)) / sum(train_con_mat_pruned)
[1] 0.6094118
#Assess the accuracy of the pruned classification tree - testing data
<- predict(sub_tree_pruned, newdata = sub_testing, type = 'prob')
test_probs_pruned <- predict(sub_tree_pruned, newdata = sub_testing, type = 'class')
test_preds_pruned <- cbind(sub_testing, test_probs_pruned, test_preds_pruned)
sub_testing_updated_pruned
<- table(sub_testing_updated_pruned$renewed, sub_testing_updated_pruned$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat_pruned
sum(diag(test_con_mat_pruned)) / sum(test_con_mat_pruned)
[1] 0.5266667
d. Has pruning the classification tree resulted in less overfitting? Explain your answer.
The overall accuracy for the training data set has decreased from 62% to 60% and the testing dataset has decreased from 51% to 53%
The pruning of the classification tree has resulted in less overfitting. Pruning has effectively reduced overfitting by creating a simpler model that performs better on the testing dataset, even though it slightly reduced the training accuracy.
5. Based on your analysis, suggest some actions the company could take to improve its renewal rate. How could your propensity model be used for marketing purposes?
Send emails and offers to target customers to reduce the time since the last contact and increase engagement.
Improve customer support to reduce the number of complaints.
Offer loyalty programs to customers who have spent a lot of money or been with the company for a long time.
Use the model to find customers likely to churn and create special retention campaigns to keep them.
Also by integrating this propensity model into their CRM system, the company can more effectively target customers with marketing strategies that are most likely to prevent churn and increase renewal rates.
*****************************************END OF THE DOCUMENT******************************************