Predictive Analysis of a music streaming service

Intro

This is a predictive analysis of a music streaming service.

Setup

Before we can get into it, we first have to install packages, and import the relevant files:

library(tidyverse)
library(rpart)
library(dplyr)
library(ggplot2)
library(bitops)
library(rattle)

sub_testing <- read_csv("sub_testing.csv")
sub_training <- read_csv("sub_training.csv")

Question 1

Below is a visual exploration of the variables in relation to the users renewal status

Question 2

Part 3

Part 4

  1. If the length of the relationship between the streaming service and the customer was more than 140 days, they will most likely renew their subscription once it elapses

  2. If the length of the relationship was shorter than 140 days, and they’ve spent less than €182, they are likely to churn

renewal_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
summary(renewal_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = sub_training)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.0731132 0.03429485
2 0.01179245      1 0.8066038 0.8915094 0.03416973
3 0.01000000      4 0.7617925 0.8584906 0.03402464

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 

Variable Importance:

Length of Relationship: 47% Spend: 22% Age: 22% Number of contacts: 8% Contact Recency: 1%

Part 5

Training dataset

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

The overall accuracy of the training dataset is 62%

Of all customers the model predicted would renew, it was 62% accurate

Of all customers the model predicted would not renew, it was 63% accurate

Of all customers who did renew, the model correctly identified 64%

Of all customers who did not renew, the model correctly identified 60%

Testing dataset

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

The overall accuracy of the testing dataset is 51%

Of all customers the model predicted would renew, it was 51% accurate

Of all customers the model predicted would not renew, it was 50% accurate

Of all customers who did renew, the model correctly identified 54%

Of all customers who did not renew, the model correctly identified 49%

Based on this analysis, I believe the classification tree is overfitting the dataset as the accuracy between the training and testing datasets has an 11% discrepancy

Question 3

Part 6

sub_testing$renewed <- factor(sub_testing$renewed, levels = c("Yes", "No"))
sub_training$renewed <- factor(sub_training$renewed, levels = c("Yes", "No"))

levels(sub_testing$renewed)
[1] "Yes" "No" 
levels(sub_training$renewed)
[1] "Yes" "No" 

Part 7

renewed_lr <- glm(renewed ~ num_complaints + gender + contact_recency + lor + spend + num_contacts + age,
                  data = sub_training, family = binomial(link = 'logit'))
summary(renewed_lr)

Call:
glm(formula = renewed ~ num_complaints + gender + contact_recency + 
    lor + spend + num_contacts + age, family = binomial(link = "logit"), 
    data = sub_training)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.3077726  0.3889894   3.362 0.000774 ***
num_complaints  -0.0463282  0.0575359  -0.805 0.420701    
genderMale      -0.4179377  0.1498906  -2.788 0.005299 ** 
contact_recency  0.0061870  0.0085374   0.725 0.468641    
lor             -0.0026467  0.0007908  -3.347 0.000818 ***
spend           -0.0004400  0.0005652  -0.779 0.436256    
num_contacts    -0.0400276  0.0250176  -1.600 0.109604    
age             -0.0098297  0.0064033  -1.535 0.124762    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1178.3  on 849  degrees of freedom
Residual deviance: 1125.4  on 842  degrees of freedom
AIC: 1141.4

Number of Fisher Scoring iterations: 4
  1. genderFemale was ommitted

  2. Regression Equation:

1.31 - 0.0463(num_complaints) - 0.4179(genderMale) + 0.0061(contact_recency) - 0.0026(lor) - 0.0004(spend) - 0.04(num_contacts) - 0.0098(age)

  1. The variables genderMale, and lor are significant predictor variables as their p values are all below 0.05

  2. The variables num_complaints, genderMale, lor, spend, num_contacts, and age had a negative impact on the likelihood of a subscription renewal

contact_recency is the only predictor variable to have a positive impact on the likelihood of a subscription renewal ### Part 8

Training Dataset

      Predicted
Actual Yes  No
   Yes 248 176
   No  147 279

The overall accuracy of the training dataset is 62%

Of all customers the model predicted would renew, it was 62% accurate

Of all customers the model predicted would not renew, it was 61% accurate

Of all customers who did renew, the model correctly identified 58%

Of all customers who did not renew, the model correctly identified 65%

Testing Dataset

      Predicted
Actual Yes No
   Yes  35 41
   No   36 38

The overall accuracy of the testing dataset is 49%

Of all customers the model predicted would renew, it was 49% accurate

Of all customers the model predicted would not renew, it was 48% accurate

Of all customers who did renew, the model correctly identified 46%

Of all customers who did not renew, the model correctly identified 51%

Question 4

Part 9

The testing dataset for the logistic regression model is 49% accurate, compared to the testing dataset for the classification tree, which is slightly more accurate, at 51% accuracy

Part 10

I think the company should use the classification tree for their predictive analysis as the overall accuracy is higher than the logistic regression model

Part 11

  1. How often the company was in contact with the user seemed to have a big impact on whether or not they renewed their subscription, with most people who received less than 7.5 communications from the company choosing not to renew their subscription
  2. In order to grow the number of resubscribing, I suggest increasing the rate of marketing communications to the customer, as the users who received more than 7.5 were likely to renew their subscription. This means that if the other segment of customers were to receive more communications from the company, they would be more likely to renew their subscription.