Music Subscription Analysis

Author

Gopika Kurunthala

Predicting Customers Who Will Renew Their Music Subscription

1. Import the sub_training.csv and sub_testing.csv data sets into R.

2. Create and visualize a classification tree model that will allow you to predict if a customer will re-subscribe or churn.

#Load required packages

library(rattle)
library(rpart)
library(tidyverse)

#Step 1: Import the sub_training.csv and sub_testing.csv datasets into R

sub_training <- read_csv("sub_training.csv")
sub_testing <- read_csv("sub_testing.csv")


#Step 2: Create and visualise a classification tree model

sub_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training )
fancyRpartPlot(sub_tree)

3. Interpret the classification tree

a. Clearly state one rule for predicting if a customer will re-subscribe. Your answer should also address how pure the node is.

Path 1: If Lor is not less than 140 and Num_contacts is not less than 7.5, the customer is likely to re-subscribe.

The label “Yes” in Leaf 3 tells that the customer is likely to re-subscribe if they satisfy the rules on Path. The labels 0.39 and 0.61 tell that of all customers in the training data set that fall into this leaf, 39% of them are not likely to re-subscribe and 61% of them are likely to re-subscribe.

The value of 45% tells us that 45% of all customers in the training dataset fall into the leaf.

Path 2: If Lor is less than 140, and Spend is less than 182, then the customer is not likely to re-subscribe.

The label “No” in Leaf 4 tells that the customer is not likely to re-subscribe if they satisfy the rules on Path. The labels 0.76 and 0.24 tell that of all customers in the training data set that fall into this leaf, 76% of them are not likely to re-subscribe and 24% of them are likely to re-subscribe.

The value of 10% tells us that 10% of all customers in the training data set fall into the leaf.

b. Clearly state one rule for predicting if a customer will churn. Your answer should also address how pure the node is.

The label “No” in Leaf 4 tells the churn prediction of a customer if they satisfy the rules on Path. The labels 0.76 and 0.24 tell that of all customers in the training data set that fall into this leaf, 76% of them did not churn and 24% did churn. This above figure gives us an idea of how pure the leaf is, and it also provides the probability of a customer churning.

If a customer follows Path 1 and ends up in this leaf, they are predicted to churn with a probability of 0.24 or 24%.

# Important variables:

sub_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
summary(sub_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = sub_training)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.1037736 0.03420419
2 0.01179245      1 0.8066038 0.9221698 0.03427042
3 0.01000000      4 0.7617925 0.8773585 0.03411223

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 

c. Which variables are considered important for predicting if a customer will re-subscribe or not? Explain your answer.

In variable importance printed above by summary() function, the values are reported as a percentage of total improvements to the model.

In the output above, Lor accounts for 47% of all improvements made to the model, Spend accounts for 22%, and age accounts for 22%. num_contacts accounts for 8% and contact_recency accounts for 1%

So, Lor is the variable that contributes more to improving the model.

4. Fully assess the accuracy of the classification tree using both the training and the testing datasets.

a. Based on your findings, you should see evidence of the classification tree overfitting the training dataset. Explain how this overfitting is detected.

#Check model accuracy of the classification tree

# Measure accuracy on the training dataset

train_probs <- predict(sub_tree, newdata = sub_training, type = 'prob')
train_preds <- predict(sub_tree, newdata = sub_training, type = 'class')

sub_training_updated <- cbind(sub_training, train_probs, train_preds)

#create a confution matrix(cross tabulation) between predicted vs actual

train_con_mat <- table(sub_training_updated$renewed, sub_training_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270
sum(diag(train_con_mat)) / sum(train_con_mat)
[1] 0.62

From the above confusion matrix of the training data set, we know the following:

  • The overall model accuracy is (257 + 270)/850 = 0.62 or 62%.

  • Of all customers the model predicted to resubscribe, they got 270/439 = 0.615 or 61% correct.

  • Of all customers the model predicted not to resubscribe, they got 257/411 = 0.62 or 62% correct.

  • Of all customers who did resubscribe, the model correctly identified 270/424 = 0.63 or 63%.

  • Of all customers who did not resubscribe, the model correctly identified 257/426 = 0.60 or 60%.

# Measure accuracy on the testing dataset

test_probs <- predict(sub_tree, newdata = sub_testing, type = 'prob')
test_preds <- predict(sub_tree, newdata = sub_testing, type = 'class')
sub_testing_updated <- cbind(sub_testing, test_probs, test_preds)

test_con_mat <- table(sub_testing_updated$renewed, sub_testing_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41
sum(diag(test_con_mat)) / sum(test_con_mat)
[1] 0.5133333

From the above confusion matrix of the testing data set, we know the following:

  • The overall model accuracy is (36+41)/150 = 0.51 or 51%.

  • Of all customers the model predicted to resubscribe, they got 41/79 = 0.518 or 51% correct.

  • Of all customers the model predicted not to resubscribe, they got 36/71 = 0.50 or 50% correct.

  • Of all customers who did resubscribe, the model correctly identified 41/76 = 0.539 or 54%.

  • Of all customers who did not resubscribe, the model correctly identified 36/74 = 0.48 or 48%.

The overall accuracy of the model was 62% for the training dataset, however, it has dropped to 51% for the testing dataset. This suggests that the classification tree model is overfitting itself to the training dataset and will not generalize well when asked to predict new unseen data in the future. So, we should prune the tree.

b. Create a second classification tree that is a pruned version of the classification tree created in part 2. This pruned classification tree should have a max depth of 3.

 # Pruned Classification Tree

sub_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training, maxdepth = 3 )
                         
fancyRpartPlot(sub_tree_pruned)

c. Fully assess the accuracy of the pruned tree on the training and testing datasets.

#Assess the accuracy of the pruned classification tree - training data

train_probs_pruned <- predict(sub_tree_pruned, newdata = sub_training, type = 'prob')
train_preds_pruned <- predict(sub_tree_pruned, newdata = sub_training, type = 'class')
sub_training_updated_pruned <- cbind(sub_training, train_probs_pruned, train_preds_pruned)

train_con_mat_pruned <- table(sub_training_updated_pruned$renewed, sub_training_updated_pruned$train_preds, dnn = c('Actual', 'Predicted'))

sum(diag(train_con_mat_pruned)) / sum(train_con_mat_pruned)
[1] 0.6094118
#Assess the accuracy of the pruned classification tree - testing data

test_probs_pruned <- predict(sub_tree_pruned, newdata = sub_testing, type = 'prob')
test_preds_pruned <- predict(sub_tree_pruned, newdata = sub_testing, type = 'class')
sub_testing_updated_pruned <- cbind(sub_testing, test_probs_pruned, test_preds_pruned)

test_con_mat_pruned <- table(sub_testing_updated_pruned$renewed, sub_testing_updated_pruned$test_preds, dnn = c('Actual', 'Predicted'))

sum(diag(test_con_mat_pruned)) / sum(test_con_mat_pruned)
[1] 0.5266667

d. Has pruning the classification tree resulted in less overfitting? Explain your answer.

The overall accuracy for the training data set has decreased from 62% to 60% and the testing dataset has decreased from 51% to 53%

The pruning of the classification tree has resulted in less overfitting. Pruning has effectively reduced overfitting by creating a simpler model that performs better on the testing dataset, even though it slightly reduced the training accuracy.

5. Based on your analysis, suggest some actions the company could take to improve its renewal rate. How could your propensity model be used for marketing purposes?

  • Send emails and offers to target customers to reduce the time since the last contact and increase engagement.

  • Improve customer support to reduce the number of complaints.

  • Offer loyalty programs to customers who have spent a lot of money or been with the company for a long time.

  • Use the model to find customers likely to churn and create special retention campaigns to keep them.

  • Also by integrating this propensity model into their CRM system, the company can more effectively target customers with marketing strategies that are most likely to prevent churn and increase renewal rates.

*****************************************END OF THE DOCUMENT******************************************