library(tidyverse)
library(rpart)
library(dplyr)
library(ggplot2)
library(bitops)
library(rattle)
sub_testing <- read_csv("sub_testing.csv")
sub_training <- read_csv("sub_training.csv")Predictive Analysis of a music streaming service
Intro
This is a predictive analysis of a music streaming service.
Setup
Before we can get into it, we first have to install packages, and import the relevant files:
Question 1
Below is a visual exploration of the variables in relation to the users renewal status
Question 2
Part 3
Part 4
If the length of the relationship between the streaming service and the customer was more than 140 days, they will most likely renew their subscription once it elapses
If the length of the relationship was shorter than 140 days, and they’ve spent less than €182, they are likely to churn
renewal_tree$variable.importance lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
summary(renewal_tree)Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + gender + age, data = sub_training)
n= 850
CP nsplit rel error xerror xstd
1 0.19339623 0 1.0000000 1.0731132 0.03429485
2 0.01179245 1 0.8066038 0.8915094 0.03416973
3 0.01000000 4 0.7617925 0.8584906 0.03402464
Variable importance
lor spend age num_contacts contact_recency
47 22 22 8 1
Node number 1: 850 observations, complexity param=0.1933962
predicted class=No expected loss=0.4988235 P(node) =1
class counts: 426 424
probabilities: 0.501 0.499
left son=2 (466 obs) right son=3 (384 obs)
Primary splits:
lor < 139.5 to the left, improve=16.323670, (0 missing)
age < 60.5 to the left, improve=13.784490, (0 missing)
spend < 182 to the left, improve=12.232030, (0 missing)
contact_recency < 7.5 to the right, improve= 5.498858, (0 missing)
num_contacts < 3.5 to the left, improve= 5.307063, (0 missing)
Surrogate splits:
age < 60.5 to the left, agree=0.732, adj=0.406, (0 split)
spend < 422 to the left, agree=0.684, adj=0.299, (0 split)
contact_recency < 8.5 to the right, agree=0.565, adj=0.036, (0 split)
num_contacts < 2.5 to the left, agree=0.560, adj=0.026, (0 split)
Node number 2: 466 observations, complexity param=0.01179245
predicted class=No expected loss=0.4098712 P(node) =0.5482353
class counts: 275 191
probabilities: 0.590 0.410
left son=4 (82 obs) right son=5 (384 obs)
Primary splits:
spend < 182 to the left, improve=5.482157, (0 missing)
lor < 42.5 to the left, improve=5.434363, (0 missing)
age < 61.5 to the left, improve=4.067404, (0 missing)
num_contacts < 7.5 to the left, improve=3.721617, (0 missing)
contact_recency < 7.5 to the right, improve=2.971256, (0 missing)
Surrogate splits:
lor < 21 to the left, agree=0.987, adj=0.927, (0 split)
Node number 3: 384 observations
predicted class=Yes expected loss=0.3932292 P(node) =0.4517647
class counts: 151 233
probabilities: 0.393 0.607
Node number 4: 82 observations
predicted class=No expected loss=0.2439024 P(node) =0.09647059
class counts: 62 20
probabilities: 0.756 0.244
Node number 5: 384 observations, complexity param=0.01179245
predicted class=No expected loss=0.4453125 P(node) =0.4517647
class counts: 213 171
probabilities: 0.555 0.445
left son=10 (356 obs) right son=11 (28 obs)
Primary splits:
num_contacts < 7.5 to the left, improve=3.286592, (0 missing)
age < 61 to the left, improve=3.189001, (0 missing)
gender splits as LR, improve=2.683644, (0 missing)
num_complaints < 1.5 to the left, improve=2.393375, (0 missing)
lor < 45.5 to the left, improve=1.671696, (0 missing)
Surrogate splits:
lor < 137.5 to the left, agree=0.93, adj=0.036, (0 split)
Node number 10: 356 observations, complexity param=0.01179245
predicted class=No expected loss=0.4269663 P(node) =0.4188235
class counts: 204 152
probabilities: 0.573 0.427
left son=20 (329 obs) right son=21 (27 obs)
Primary splits:
age < 61 to the left, improve=3.357262, (0 missing)
gender splits as LR, improve=2.949544, (0 missing)
num_complaints < 1.5 to the left, improve=1.278902, (0 missing)
lor < 42.5 to the left, improve=1.232817, (0 missing)
spend < 403 to the right, improve=1.196010, (0 missing)
Node number 11: 28 observations
predicted class=Yes expected loss=0.3214286 P(node) =0.03294118
class counts: 9 19
probabilities: 0.321 0.679
Node number 20: 329 observations
predicted class=No expected loss=0.4072948 P(node) =0.3870588
class counts: 195 134
probabilities: 0.593 0.407
Node number 21: 27 observations
predicted class=Yes expected loss=0.3333333 P(node) =0.03176471
class counts: 9 18
probabilities: 0.333 0.667
Variable Importance:
Length of Relationship: 47% Spend: 22% Age: 22% Number of contacts: 8% Contact Recency: 1%
Part 5
Training dataset
Predicted
Actual No Yes
No 257 169
Yes 154 270
The overall accuracy of the training dataset is 62%
Of all customers the model predicted would renew, it was 62% accurate
Of all customers the model predicted would not renew, it was 63% accurate
Of all customers who did renew, the model correctly identified 64%
Of all customers who did not renew, the model correctly identified 60%
Testing dataset
Predicted
Actual No Yes
No 36 38
Yes 35 41
The overall accuracy of the testing dataset is 51%
Of all customers the model predicted would renew, it was 51% accurate
Of all customers the model predicted would not renew, it was 50% accurate
Of all customers who did renew, the model correctly identified 54%
Of all customers who did not renew, the model correctly identified 49%
Based on this analysis, I believe the classification tree is overfitting the dataset as the accuracy between the training and testing datasets has an 11% discrepancy
Question 3
Part 6
sub_testing$renewed <- factor(sub_testing$renewed, levels = c("Yes", "No"))
sub_training$renewed <- factor(sub_training$renewed, levels = c("Yes", "No"))
levels(sub_testing$renewed)[1] "Yes" "No"
levels(sub_training$renewed)[1] "Yes" "No"
Part 7
renewed_lr <- glm(renewed ~ num_complaints + gender + contact_recency + lor + spend + num_contacts + age,
data = sub_training, family = binomial(link = 'logit'))
summary(renewed_lr)
Call:
glm(formula = renewed ~ num_complaints + gender + contact_recency +
lor + spend + num_contacts + age, family = binomial(link = "logit"),
data = sub_training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3077726 0.3889894 3.362 0.000774 ***
num_complaints -0.0463282 0.0575359 -0.805 0.420701
genderMale -0.4179377 0.1498906 -2.788 0.005299 **
contact_recency 0.0061870 0.0085374 0.725 0.468641
lor -0.0026467 0.0007908 -3.347 0.000818 ***
spend -0.0004400 0.0005652 -0.779 0.436256
num_contacts -0.0400276 0.0250176 -1.600 0.109604
age -0.0098297 0.0064033 -1.535 0.124762
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1178.3 on 849 degrees of freedom
Residual deviance: 1125.4 on 842 degrees of freedom
AIC: 1141.4
Number of Fisher Scoring iterations: 4
genderFemale was ommitted
Regression Equation:
1.31 - 0.0463(num_complaints) - 0.4179(genderMale) + 0.0061(contact_recency) - 0.0026(lor) - 0.0004(spend) - 0.04(num_contacts) - 0.0098(age)
The variables genderMale, and lor are significant predictor variables as their p values are all below 0.05
The variables num_complaints, genderMale, lor, spend, num_contacts, and age had a negative impact on the likelihood of a subscription renewal
contact_recency is the only predictor variable to have a positive impact on the likelihood of a subscription renewal ### Part 8
Training Dataset
Predicted
Actual Yes No
Yes 248 176
No 147 279
The overall accuracy of the training dataset is 62%
Of all customers the model predicted would renew, it was 62% accurate
Of all customers the model predicted would not renew, it was 61% accurate
Of all customers who did renew, the model correctly identified 58%
Of all customers who did not renew, the model correctly identified 65%
Testing Dataset
Predicted
Actual Yes No
Yes 35 41
No 36 38
The overall accuracy of the testing dataset is 49%
Of all customers the model predicted would renew, it was 49% accurate
Of all customers the model predicted would not renew, it was 48% accurate
Of all customers who did renew, the model correctly identified 46%
Of all customers who did not renew, the model correctly identified 51%
Question 4
Part 9
The testing dataset for the logistic regression model is 49% accurate, compared to the testing dataset for the classification tree, which is slightly more accurate, at 51% accuracy
Part 10
I think the company should use the classification tree for their predictive analysis as the overall accuracy is higher than the logistic regression model
Part 11
- How often the company was in contact with the user seemed to have a big impact on whether or not they renewed their subscription, with most people who received less than 7.5 communications from the company choosing not to renew their subscription
- In order to grow the number of resubscribing, I suggest increasing the rate of marketing communications to the customer, as the users who received more than 7.5 were likely to renew their subscription. This means that if the other segment of customers were to receive more communications from the company, they would be more likely to renew their subscription.