library(tidyverse)
library(kableExtra)
library(rattle)
library(rpart)
sub_testing <- read_csv("sub_testing.csv")
sub_training <- read_csv("sub_training.csv")Music Churn Churn
Visual Exploration
Setup
First we will set up our quarto file by loading all the needed libraries and importing our testing and training datasets.
Renewal Percentage
Our first action will be making a table that will show us the percentages of customers that renewed and customers that did not. It is important that we find out how many customers we are losing.
renew_table <- sub_training %>%
group_by(renewed) %>%
summarise(count = n()) %>%
mutate(renew_percentage = round(count/sum(count)*100, 1)) %>%
arrange(desc(count))
knitr::kable(renew_table,
align = "lrr",
col.names = c("Renewed", "Number of Customers", "Percentage"),
caption = "Churn Rate",
table.attr = 'data-quarto-disable-processing = "true"') %>%
kable_styling(full_width = F) %>%
row_spec(2, color = "white", background = "green") %>%
row_spec(1, color = "white", background = "red")| Renewed | Number of Customers | Percentage |
|---|---|---|
| No | 426 | 50.1 |
| Yes | 424 | 49.9 |
Findings
The split between renewed subscription and canceled one is nearly 50/50 this is a very high churn rate and it is integral that we find what is causing this, what customers are likely to churn and keep them from doing so
Overlapping Histogram of Contact Frequency and Renewal
ggplot(data = sub_training) +
geom_histogram(mapping = aes(x = num_contacts, fill = renewed, ), position = "identity", binwidth = 1, alpha = 0.5) +
scale_x_continuous(limits = c(0, 40)) +
scale_y_continuous(limits = c(0, 40)) +
labs(title = "Overlapping Histogram of Contact Frequency and Renewal",
x = "Number of Contact",
y = element_blank())Findings
We can see that most of our customers have little contact with our service (<10), these customers are also much more likely to churn from our service, since any customer who was in contact with our service more than 30 times had ended up renewing their subscription. Meaning Number of Contact could be a useful variable for our predictive algorithms.
Percentage of Renewed Based on Contact Recency
For the variable contact recency we will be looking to see the percentage of renewal at each day since last contact to see if there is a drop off in subscription renewals based on the time spend away from the service.
contact_recency_renewed <- sub_training %>%
count(contact_recency, renewed) %>%
group_by(contact_recency) %>%
mutate(percentage = n / sum(n) * 100) %>%
ungroup()
ggplot(contact_recency_renewed, aes(x = contact_recency, y = percentage, fill = renewed)) +
geom_bar(stat = "identity") +
ylab("Renewed (%)") +
xlab("Time Since Last Contact") +
ggtitle("Percentage of Renewed Based on Contact Recency")Findings
The highest rate of re-subscription is at 21 and 26 days since last use and the lowest is at 25 days, there seems to be little to no pattern emerging and it is likely that this variable will not be very useful for our algorithm.
Renewal Rate by Number of Complaints
It is expected that complaints would have a strong correlation with churn whether it is the ability or inability of the customer service team to solve them, it is crucial to our business to find the outcome that multiple customer complaints have on our renewal rate. We will cap the number of complaints at 7 since there is only 1 customer per number of complaints higher that 7 and we are trying to find an emerging pattern, that singular responses would discredit.
complaint_renewal_rate <- sub_training %>%
group_by(num_complaints) %>%
summarise(
n = n(),
pct_renewed = mean(renewed == "Yes") * 100
)
library(ggplot2)
ggplot(complaint_renewal_rate,
aes(x = num_complaints, y = pct_renewed)) +
geom_line(color = "red") +
geom_point(size = 2, color = "red") +
labs(
title = "Renewal Rate by Number of Complaints",
x = "Number of Complaints",
y = "Percentage Renewed"
) +
scale_y_continuous(limits = c(25, 75)) +
scale_x_continuous(limits = c(0, 7)) Findings
Once again there is not a clear pattern emerging, all we can say is that there is a drop in renewal at 3 complaints, a surge of renewal at 4 and another drop at 7 complaints. The lack of clear trajectory will likely make it difficult for the algorithm to find use for this variable
Renewal Rate by Customer Spend
We are going to make a boxplot showing the distribution of customer spend based on whether the customer renewed or not, this will help us determine whether there is correlation between spend and renewal.
ggplot(data = sub_training) +
geom_boxplot(mapping = aes(x = renewed, y = spend)) +
labs(
title = "Renewal Rate by Customer Spend",
x = "Renewal Status",
y = "Customer Spend")Findings
Customers who renewed have median spend coinciding with the 75th percentile at around €420, indicating a strong concentration of higher spenders. Customers who did not renew show a lower median spend and bigger dispersion.At the same time customers that churned show a few high spenders, meaning it does not necessarily mean that spending more means renewal is promised.Either way there seems to be little correlation between spend and renewal.
Reneval Rate Based on Length of Relationship
Next we are going to look to figure out the connection between length or relationship to see whether staying longer with the service has an impact on renewal.
ggplot(data = sub_training) +
geom_boxplot(mapping = aes(x = renewed, y = lor),
outlier.shape = 17, outlier.colour = "red") +
labs(
title = "Renewal Rate by Length of Relationship",
x = "Renewal Status",
y = "Length of Relationship (Days)")Findings
Finally we find a variable that clearly shows correlation with renewal. Here we can see that the longer a customer is with the service the longer more likely they are to renew their subscription. There are a few outliers on the churn side signified with the red arrow, but it is clear from the graph that new customers churn more often.
Renewal Based on Gender
The gender distribution will help us see our target audience as well as how is more likely to churn men or women.
ggplot(data = sub_training) +
geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge") +
scale_fill_manual(values = c("forestgreen", "red")) +
labs(
title = "Renewal Based on Gender")Findings
Men are the larger demographic for our service with a lower rate of churning, while Women are more likely to churn, this brings forth the question whether we should target men in our campaigns since they are our target market or if we should try to pursue women to stay with our service to lower our churn rate.
Age Spread Based on Renewal Status
The age spread will once again point us towards our main customer segments as well as showing correlation of age with renewal.
ggplot(data = sub_training) +
geom_histogram(mapping = aes(x = age, fill = renewed ), position = "dodge") +
labs(title = "Age Spread Based on Renewal Status",
x = "Age",
y = element_blank())Findings
We can see a few main customer segments form (45-55, 60-70), it also seem that that our younger customers are more likely to churn, while customers over the age of 60 are more likely to stay. # Classification Trees ## renewal Prediction Tree
renewal_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training)
fancyRpartPlot(renewal_tree)renewal_tree$variable.importance lor spend age num_contacts contact_recency
21.5220728 10.3707565 9.9887537 3.7116877 0.5951338
summary(renewal_tree)Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + gender + age, data = sub_training)
n= 850
CP nsplit rel error xerror xstd
1 0.19339623 0 1.0000000 1.0683962 0.03430592
2 0.01179245 1 0.8066038 0.8584906 0.03402464
3 0.01000000 4 0.7617925 0.8584906 0.03402464
Variable importance
lor spend age num_contacts contact_recency
47 22 22 8 1
Node number 1: 850 observations, complexity param=0.1933962
predicted class=No expected loss=0.4988235 P(node) =1
class counts: 426 424
probabilities: 0.501 0.499
left son=2 (466 obs) right son=3 (384 obs)
Primary splits:
lor < 139.5 to the left, improve=16.323670, (0 missing)
age < 60.5 to the left, improve=13.784490, (0 missing)
spend < 182 to the left, improve=12.232030, (0 missing)
contact_recency < 7.5 to the right, improve= 5.498858, (0 missing)
num_contacts < 3.5 to the left, improve= 5.307063, (0 missing)
Surrogate splits:
age < 60.5 to the left, agree=0.732, adj=0.406, (0 split)
spend < 422 to the left, agree=0.684, adj=0.299, (0 split)
contact_recency < 8.5 to the right, agree=0.565, adj=0.036, (0 split)
num_contacts < 2.5 to the left, agree=0.560, adj=0.026, (0 split)
Node number 2: 466 observations, complexity param=0.01179245
predicted class=No expected loss=0.4098712 P(node) =0.5482353
class counts: 275 191
probabilities: 0.590 0.410
left son=4 (82 obs) right son=5 (384 obs)
Primary splits:
spend < 182 to the left, improve=5.482157, (0 missing)
lor < 42.5 to the left, improve=5.434363, (0 missing)
age < 61.5 to the left, improve=4.067404, (0 missing)
num_contacts < 7.5 to the left, improve=3.721617, (0 missing)
contact_recency < 7.5 to the right, improve=2.971256, (0 missing)
Surrogate splits:
lor < 21 to the left, agree=0.987, adj=0.927, (0 split)
Node number 3: 384 observations
predicted class=Yes expected loss=0.3932292 P(node) =0.4517647
class counts: 151 233
probabilities: 0.393 0.607
Node number 4: 82 observations
predicted class=No expected loss=0.2439024 P(node) =0.09647059
class counts: 62 20
probabilities: 0.756 0.244
Node number 5: 384 observations, complexity param=0.01179245
predicted class=No expected loss=0.4453125 P(node) =0.4517647
class counts: 213 171
probabilities: 0.555 0.445
left son=10 (356 obs) right son=11 (28 obs)
Primary splits:
num_contacts < 7.5 to the left, improve=3.286592, (0 missing)
age < 61 to the left, improve=3.189001, (0 missing)
gender splits as LR, improve=2.683644, (0 missing)
num_complaints < 1.5 to the left, improve=2.393375, (0 missing)
lor < 45.5 to the left, improve=1.671696, (0 missing)
Surrogate splits:
lor < 137.5 to the left, agree=0.93, adj=0.036, (0 split)
Node number 10: 356 observations, complexity param=0.01179245
predicted class=No expected loss=0.4269663 P(node) =0.4188235
class counts: 204 152
probabilities: 0.573 0.427
left son=20 (329 obs) right son=21 (27 obs)
Primary splits:
age < 61 to the left, improve=3.357262, (0 missing)
gender splits as LR, improve=2.949544, (0 missing)
num_complaints < 1.5 to the left, improve=1.278902, (0 missing)
lor < 42.5 to the left, improve=1.232817, (0 missing)
spend < 403 to the right, improve=1.196010, (0 missing)
Node number 11: 28 observations
predicted class=Yes expected loss=0.3214286 P(node) =0.03294118
class counts: 9 19
probabilities: 0.321 0.679
Node number 20: 329 observations
predicted class=No expected loss=0.4072948 P(node) =0.3870588
class counts: 195 134
probabilities: 0.593 0.407
Node number 21: 27 observations
predicted class=Yes expected loss=0.3333333 P(node) =0.03176471
class counts: 9 18
probabilities: 0.333 0.667
A.
If length of relationship is higher or equal to 140 days, then predict the customer will renew subscription. This node is labeled Yes (0.61 / 0.39), that means 61% of customers in this node re-subscribed, while 39% churned. This node also contains 45% of all observations, so it’s both fairly pure and very influential.
B.
If length of relationship is lower than 140 days and spend is higher or equal to €182, then predict the customer will churn. This node is labeled No (0.24 / 0.76), that means 76% of customers in this node churn, making it a highly pure churn node. It contains 10% of the data, so while smaller, it’s a strong and confident churn signal.
C.
Variable importance percentage lor spend age num_contacts contact_recency 47% 22% 22% 8% 1%
Model Accuracy
train_probs <- predict(renewal_tree, newdata = sub_training, type = 'prob')
train_preds <- predict(renewal_tree, newdata = sub_training, type = 'class')
sub_training_updated <- cbind(sub_training, train_probs, train_preds)
head(sub_training_updated) id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187 No 0 28 0 213 248 Male 45
2 269 No 1 12 2 425 82 Male 60
3 376 No 0 28 2 0 15 Female 53
4 400 No 1 11 1 0 12 Male 44
5 679 Yes 0 28 0 216 300 Male 68
6 565 Yes 0 28 0 425 349 Female 68
No Yes train_preds
1 0.3932292 0.6067708 Yes
2 0.5927052 0.4072948 No
3 0.7560976 0.2439024 No
4 0.7560976 0.2439024 No
5 0.3932292 0.6067708 Yes
6 0.3932292 0.6067708 Yes
train_con_mat <- table(sub_training_updated$renewed, sub_training_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat Predicted
Actual No Yes
No 257 169
Yes 154 270
Training model accuracy
(270+257)/850 = 0.62 = 62% The training data shows 62% accuracy in prediction, this is fairly accurate.
test_probs <- predict(renewal_tree, newdata = sub_testing, type = 'prob')
test_preds <- predict(renewal_tree, newdata = sub_testing, type = 'class')
sub_testing_updated <- cbind(sub_testing, test_probs, test_preds)
head(sub_testing_updated) id renewed num_contacts contact_recency num_complaints spend lor gender age
1 942 Yes 0 28 0 213 81 Female 45
2 299 No 0 28 0 425 120 Male 51
3 44 No 0 28 3 477 89 Female 69
4 706 Yes 1 14 0 425 361 Female 68
5 965 Yes 0 28 0 235 36 Female 51
6 354 No 1 11 0 0 12 Male 59
No Yes test_preds
1 0.5927052 0.4072948 No
2 0.5927052 0.4072948 No
3 0.3333333 0.6666667 Yes
4 0.3932292 0.6067708 Yes
5 0.5927052 0.4072948 No
6 0.7560976 0.2439024 No
test_con_mat <- table(sub_testing_updated$renewed, sub_testing_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat Predicted
Actual No Yes
No 36 38
Yes 35 41
Testing data accuracy
(36+41)/150 = 0.51333 = 51.3% this is much less accurate and it seems as though the model is over fitted as it demonstrated higher accuracy on the training data and will need to be pruned before it can be accurate, because with nearly 50% accuracy there is the same chance that a prediction will be correct as there is that it will not be correct.
Binary Logistic Regression
sub_training$renewed <- factor(sub_training$renewed, levels = c("Yes", "No"))
sub_testing$renewed <- factor(sub_testing$renewed, levels = c("Yes", "No"))
levels(sub_training$renewed)[1] "Yes" "No"
levels(sub_testing$renewed)[1] "Yes" "No"
renewal_lr <- glm(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, data = sub_training, family = binomial(link = 'logit'))
summary(renewal_lr)
Call:
glm(formula = renewed ~ num_contacts + contact_recency + num_complaints +
spend + lor + gender + age, family = binomial(link = "logit"),
data = sub_training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3077726 0.3889894 3.362 0.000774 ***
num_contacts -0.0400276 0.0250176 -1.600 0.109604
contact_recency 0.0061870 0.0085374 0.725 0.468641
num_complaints -0.0463282 0.0575359 -0.805 0.420701
spend -0.0004400 0.0005652 -0.779 0.436256
lor -0.0026467 0.0007908 -3.347 0.000818 ***
genderMale -0.4179377 0.1498906 -2.788 0.005299 **
age -0.0098297 0.0064033 -1.535 0.124762
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1178.3 on 849 degrees of freedom
Residual deviance: 1125.4 on 842 degrees of freedom
AIC: 1141.4
Number of Fisher Scoring iterations: 4
train_pi <- predict(renewal_lr, newdata = sub_training, type = 'response')
train_pi 1 2 3 4 5 6 7 8
0.4677010 0.4594854 0.6958636 0.6004329 0.3788554 0.4260171 0.5423641 0.5912532
9 10 11 12 13 14 15 16
0.5361077 0.5234023 0.4143391 0.6431296 0.4904683 0.6431296 0.7349879 0.3821701
17 18 19 20 21 22 23 24
0.6497835 0.5693861 0.5829538 0.3574688 0.5133261 0.5067947 0.5221561 0.4771866
25 26 27 28 29 30 31 32
0.4579919 0.6330254 0.5410108 0.5093449 0.5403136 0.6343816 0.6180604 0.5716042
33 34 35 36 37 38 39 40
0.5719954 0.5363469 0.5746450 0.5421427 0.3517599 0.4942457 0.2845781 0.5433066
41 42 43 44 45 46 47 48
0.4502692 0.4907864 0.5644121 0.6852244 0.6108140 0.4579919 0.5759761 0.5687692
49 50 51 52 53 54 55 56
0.4491684 0.1827203 0.3635581 0.5399489 0.6548843 0.4601126 0.4087059 0.4461582
57 58 59 60 61 62 63 64
0.6983748 0.6294821 0.6415901 0.5856446 0.4822225 0.4481498 0.4466630 0.6005874
65 66 67 68 69 70 71 72
0.2593840 0.6083302 0.2961096 0.3407975 0.6317718 0.2945270 0.5417920 0.5059073
73 74 75 76 77 78 79 80
0.4066799 0.5403860 0.7265964 0.4474685 0.3472018 0.5635760 0.7324152 0.6703948
81 82 83 84 85 86 87 88
0.3640080 0.5466660 0.5039131 0.4302498 0.5451436 0.6157469 0.3780301 0.5370083
89 90 91 92 93 94 95 96
0.4699923 0.2004668 0.6361487 0.4935883 0.4862908 0.6135804 0.4549286 0.5145120
97 98 99 100 101 102 103 104
0.5691104 0.5244290 0.4745758 0.5910268 0.3123436 0.3339000 0.6461611 0.3877489
105 106 107 108 109 110 111 112
0.2814171 0.6615571 0.6318434 0.3229300 0.6294821 0.1652461 0.5064308 0.5487011
113 114 115 116 117 118 119 120
0.5600478 0.3965076 0.5007403 0.6703374 0.4944206 0.3243956 0.5504015 0.6740563
121 122 123 124 125 126 127 128
0.2999757 0.5717047 0.4072116 0.5959668 0.3064827 0.6815100 0.6770166 0.3367275
129 130 131 132 133 134 135 136
0.5424397 0.4558933 0.4284594 0.4946643 0.5185954 0.4776289 0.3063902 0.3346316
137 138 139 140 141 142 143 144
0.6132449 0.5370371 0.4797394 0.6232065 0.5146475 0.5571657 0.6299639 0.6006706
145 146 147 148 149 150 151 152
0.4723487 0.4590611 0.3202268 0.4028984 0.5645614 0.6431296 0.3822239 0.6720500
153 154 155 156 157 158 159 160
0.5128036 0.6961400 0.6144683 0.3791406 0.5744130 0.5635029 0.2445590 0.3733007
161 162 163 164 165 166 167 168
0.4460121 0.5616239 0.4657428 0.6354583 0.2915086 0.4969974 0.6714709 0.5345446
169 170 171 172 173 174 175 176
0.4016256 0.1307567 0.4855279 0.4608509 0.4788049 0.5599091 0.5871828 0.5256568
177 178 179 180 181 182 183 184
0.5122814 0.6477010 0.4677010 0.6130911 0.3818419 0.4842949 0.3198534 0.4811274
185 186 187 188 189 190 191 192
0.7375704 0.6560758 0.4756482 0.3101851 0.4871934 0.5563336 0.2628863 0.6018565
193 194 195 196 197 198 199 200
0.2585636 0.3055310 0.2643184 0.4729522 0.5582464 0.4782471 0.3433170 0.6363460
201 202 203 204 205 206 207 208
0.6008290 0.5920195 0.6368669 0.5836863 0.4748238 0.6197136 0.5693574 0.6671508
209 210 211 212 213 214 215 216
0.4858167 0.6043645 0.6521502 0.3891307 0.4581427 0.4430736 0.3495022 0.6626188
217 218 219 220 221 222 223 224
0.5632597 0.4986951 0.3028746 0.6289835 0.4071339 0.4961413 0.6299040 0.4733891
225 226 227 228 229 230 231 232
0.5431864 0.6224711 0.5796719 0.3244778 0.4149298 0.5146642 0.4984074 0.4321647
233 234 235 236 237 238 239 240
0.4835751 0.4842361 0.5004421 0.6442159 0.4527406 0.5441835 0.4637985 0.2705899
241 242 243 244 245 246 247 248
0.5810841 0.6037989 0.6340733 0.4141599 0.5891449 0.4106257 0.7017629 0.3943916
249 250 251 252 253 254 255 256
0.3858911 0.5380443 0.4132452 0.6718076 0.4600869 0.5125219 0.5800796 0.3452770
257 258 259 260 261 262 263 264
0.4983821 0.5498690 0.5272622 0.6132959 0.2585839 0.3507245 0.5481082 0.2738812
265 266 267 268 269 270 271 272
0.4698438 0.4878984 0.3536026 0.4542031 0.5382878 0.7057366 0.4542192 0.5416679
273 274 275 276 277 278 279 280
0.5434041 0.4782110 0.5515067 0.5478252 0.5623528 0.3238049 0.5972389 0.4274620
281 282 283 284 285 286 287 288
0.4448375 0.6007578 0.5636195 0.3568611 0.5537802 0.6638012 0.7324152 0.4445463
289 290 291 292 293 294 295 296
0.5244547 0.4113369 0.4657573 0.5399404 0.4456630 0.5580479 0.4118136 0.6428064
297 298 299 300 301 302 303 304
0.4099737 0.5198662 0.4812821 0.4072116 0.5665124 0.2580510 0.6228761 0.6431296
305 306 307 308 309 310 311 312
0.5478580 0.3759624 0.7375704 0.6868787 0.5377178 0.4025506 0.6304643 0.5910050
313 314 315 316 317 318 319 320
0.5162816 0.4858411 0.5542381 0.3895684 0.4060438 0.5186139 0.4626069 0.4814627
321 322 323 324 325 326 327 328
0.5346576 0.5564200 0.6181405 0.2716641 0.6083535 0.5338057 0.2828522 0.5047102
329 330 331 332 333 334 335 336
0.4497246 0.3314465 0.4007238 0.6085163 0.6864346 0.6090518 0.5243224 0.5877959
337 338 339 340 341 342 343 344
0.3280127 0.3416667 0.4431587 0.3871274 0.7297165 0.6644173 0.5635541 0.4281227
345 346 347 348 349 350 351 352
0.5537802 0.3558110 0.5195787 0.2370498 0.5684102 0.5001113 0.2612965 0.5035397
353 354 355 356 357 358 359 360
0.6168630 0.2833346 0.4381982 0.5658410 0.6424302 0.6507548 0.5614949 0.3633937
361 362 363 364 365 366 367 368
0.1751609 0.5577009 0.6431296 0.3713276 0.2819890 0.6280196 0.5992573 0.3306004
369 370 371 372 373 374 375 376
0.6439356 0.2288594 0.3441367 0.4917389 0.4778502 0.4900010 0.6122740 0.7139726
377 378 379 380 381 382 383 384
0.2668355 0.6707502 0.4205910 0.5202498 0.5432344 0.4779103 0.4791531 0.6487980
385 386 387 388 389 390 391 392
0.4669336 0.3507245 0.5235776 0.6920445 0.6372977 0.3287148 0.5018475 0.6548797
393 394 395 396 397 398 399 400
0.6971938 0.6497835 0.2060701 0.3379128 0.5626084 0.6043285 0.3942669 0.4268394
401 402 403 404 405 406 407 408
0.4622082 0.3263420 0.5380507 0.6399216 0.6431296 0.4578496 0.3711395 0.7365445
409 410 411 412 413 414 415 416
0.4482157 0.7365445 0.2487620 0.4377381 0.5598326 0.5096548 0.5637405 0.5738941
417 418 419 420 421 422 423 424
0.6637708 0.5626084 0.3148528 0.7365445 0.5842475 0.5482312 0.3241894 0.7016749
425 426 427 428 429 430 431 432
0.3746643 0.4596435 0.5001113 0.5087909 0.5459190 0.2898164 0.7125890 0.6375720
433 434 435 436 437 438 439 440
0.6001948 0.5153320 0.3308627 0.6697986 0.5794348 0.6606534 0.6799673 0.6927005
441 442 443 444 445 446 447 448
0.7025491 0.3828420 0.3768328 0.5780473 0.4967702 0.6493660 0.4421074 0.4264252
449 450 451 452 453 454 455 456
0.5766562 0.6854089 0.4829141 0.2909187 0.4565174 0.6254189 0.7324152 0.5560027
457 458 459 460 461 462 463 464
0.4323424 0.2715702 0.5596895 0.5895547 0.5675548 0.4563643 0.3397858 0.3496052
465 466 467 468 469 470 471 472
0.5616145 0.5988666 0.4769686 0.4694810 0.3501220 0.4483479 0.4345273 0.2798223
473 474 475 476 477 478 479 480
0.4008217 0.4326561 0.2628863 0.6185790 0.5728295 0.2747910 0.4180493 0.4291296
481 482 483 484 485 486 487 488
0.3886783 0.2628863 0.5338683 0.4266645 0.7232385 0.5441111 0.5315907 0.6213967
489 490 491 492 493 494 495 496
0.6024906 0.4056083 0.4534274 0.5408374 0.4049922 0.6675445 0.6465678 0.4715120
497 498 499 500 501 502 503 504
0.5401464 0.6643848 0.5083176 0.5777467 0.6608414 0.6270461 0.6042485 0.4782566
505 506 507 508 509 510 511 512
0.4855582 0.5080181 0.5708812 0.3264045 0.2401885 0.4996238 0.4393190 0.5288130
513 514 515 516 517 518 519 520
0.6307966 0.5221136 0.4902619 0.3124724 0.4883323 0.6055624 0.6676960 0.3153218
521 522 523 524 525 526 527 528
0.5706548 0.7265228 0.4688353 0.7535911 0.2864384 0.4255396 0.5168890 0.5416064
529 530 531 532 533 534 535 536
0.3253314 0.4258075 0.4513014 0.4898105 0.4509244 0.5186557 0.5572429 0.5717047
537 538 539 540 541 542 543 544
0.5199186 0.6549577 0.5648237 0.3323600 0.3112793 0.3323193 0.4892064 0.5304498
545 546 547 548 549 550 551 552
0.3793679 0.6681316 0.2834788 0.5160642 0.6494678 0.5879858 0.7030169 0.5391498
553 554 555 556 557 558 559 560
0.4729563 0.4338012 0.2700500 0.5766030 0.4355931 0.3940764 0.4735329 0.3522311
561 562 563 564 565 566 567 568
0.5473638 0.5218875 0.4539263 0.5429818 0.2882340 0.4779624 0.4499494 0.2210045
569 570 571 572 573 574 575 576
0.4930442 0.4514294 0.4518130 0.5034755 0.2057048 0.4198981 0.5636499 0.5707684
577 578 579 580 581 582 583 584
0.5173999 0.6500831 0.6571617 0.2834081 0.5559738 0.6912157 0.3613703 0.3271833
585 586 587 588 589 590 591 592
0.6318121 0.5285002 0.3576424 0.6408956 0.4697698 0.6035906 0.5537802 0.5241981
593 594 595 596 597 598 599 600
0.2681833 0.5594425 0.4505771 0.3507218 0.4829141 0.3513563 0.3871463 0.3866829
601 602 603 604 605 606 607 608
0.3774529 0.4977457 0.6102550 0.6150915 0.3015230 0.3822162 0.3476853 0.5785309
609 610 611 612 613 614 615 616
0.3288456 0.5199186 0.4507741 0.7517613 0.4835424 0.4357300 0.5912290 0.1975542
617 618 619 620 621 622 623 624
0.6324276 0.4926433 0.6428064 0.6044956 0.4241395 0.5183473 0.4422708 0.5325480
625 626 627 628 629 630 631 632
0.3888191 0.3797409 0.7065705 0.6294821 0.5365952 0.4343361 0.6710594 0.3574688
633 634 635 636 637 638 639 640
0.5834074 0.5684612 0.6643916 0.5705592 0.5987904 0.5195486 0.4927892 0.6667567
641 642 643 644 645 646 647 648
0.3534797 0.4714450 0.6140187 0.5371778 0.6679232 0.4730720 0.4632482 0.3711580
649 650 651 652 653 654 655 656
0.6913967 0.5321504 0.3714314 0.5232536 0.5672066 0.5166851 0.2858771 0.7206994
657 658 659 660 661 662 663 664
0.6583575 0.5521613 0.5910268 0.2940428 0.5175582 0.5826563 0.5129061 0.5724603
665 666 667 668 669 670 671 672
0.3505798 0.5475466 0.6232859 0.5165022 0.4579919 0.3438392 0.5352240 0.3755925
673 674 675 676 677 678 679 680
0.4853854 0.5729956 0.4338928 0.3671419 0.5423250 0.4035507 0.5136462 0.5286153
681 682 683 684 685 686 687 688
0.4867812 0.6036274 0.3574688 0.5804260 0.6431296 0.5682041 0.2620701 0.5126479
689 690 691 692 693 694 695 696
0.5547625 0.3383848 0.6195539 0.5739689 0.4818225 0.5245362 0.5220194 0.5934658
697 698 699 700 701 702 703 704
0.6498768 0.2861209 0.4353508 0.4055371 0.4464738 0.6491810 0.4984476 0.5380349
705 706 707 708 709 710 711 712
0.7206994 0.5986811 0.6530875 0.6045247 0.6062876 0.3883441 0.7324152 0.5694182
713 714 715 716 717 718 719 720
0.5636975 0.3232366 0.5511343 0.6665418 0.4217280 0.6096908 0.3874434 0.4921717
721 722 723 724 725 726 727 728
0.5622395 0.5917393 0.7743421 0.5334991 0.4364504 0.6620268 0.4957752 0.4374565
729 730 731 732 733 734 735 736
0.1710442 0.5357480 0.4367843 0.3700905 0.5684612 0.2191114 0.6699031 0.5857104
737 738 739 740 741 742 743 744
0.5366599 0.6440388 0.7232385 0.5800796 0.4823823 0.6086456 0.5186366 0.5604012
745 746 747 748 749 750 751 752
0.4501189 0.5780393 0.5063995 0.3154241 0.5529742 0.2500644 0.2811618 0.5434635
753 754 755 756 757 758 759 760
0.7206994 0.3258376 0.4553548 0.4173357 0.6959033 0.3234113 0.6026897 0.6509001
761 762 763 764 765 766 767 768
0.3923168 0.4520849 0.6483107 0.5462639 0.4198524 0.3174141 0.5520007 0.4605614
769 770 771 772 773 774 775 776
0.3861013 0.6957385 0.5629022 0.6431296 0.3288456 0.4809317 0.5127728 0.2112260
777 778 779 780 781 782 783 784
0.5897596 0.4663009 0.5214140 0.4991150 0.4950384 0.3520163 0.2096184 0.5785555
785 786 787 788 789 790 791 792
0.3574688 0.4054741 0.5906305 0.5191112 0.4254227 0.5730003 0.4777945 0.5694182
793 794 795 796 797 798 799 800
0.3776194 0.4375462 0.4061695 0.4416180 0.4878668 0.6183054 0.6625182 0.5749324
801 802 803 804 805 806 807 808
0.3496052 0.4474796 0.4573349 0.3738671 0.4962705 0.6056556 0.5397286 0.5330852
809 810 811 812 813 814 815 816
0.6404623 0.4627604 0.5409817 0.5758109 0.6287659 0.7355160 0.2883089 0.2814629
817 818 819 820 821 822 823 824
0.6335536 0.3520163 0.6475683 0.6675584 0.5612868 0.5425750 0.4003103 0.5888777
825 826 827 828 829 830 831 832
0.4710296 0.5851241 0.4431081 0.2525076 0.5423250 0.3764682 0.3502073 0.4506896
833 834 835 836 837 838 839 840
0.4870202 0.6178767 0.3767529 0.5599727 0.6218147 0.5311305 0.5641583 0.4440120
841 842 843 844 845 846 847 848
0.7265359 0.3660982 0.3399501 0.5455034 0.5349303 0.4770592 0.6476369 0.6913989
849 850
0.5082580 0.3496052
sub_train_updated <- sub_training %>%
mutate(pi = train_pi) %>%
mutate(prediction = case_when(pi > 0.5 ~ 'No',
pi <= 0.5 ~ 'Yes'))
sub_train_updated# A tibble: 850 × 11
id renewed num_contacts contact_recency num_complaints spend lor gender
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 187 No 0 28 0 213 248 Male
2 269 No 1 12 2 425 82 Male
3 376 No 0 28 2 0 15 Female
4 400 No 1 11 1 0 12 Male
5 679 Yes 0 28 0 216 300 Male
6 565 Yes 0 28 0 425 349 Female
7 956 Yes 0 28 0 235 109 Male
8 916 Yes 1 7 0 197 156 Female
9 373 No 0 28 1 425 36 Male
10 152 No 1 10 0 425 49 Male
# ℹ 840 more rows
# ℹ 3 more variables: age <dbl>, pi <dbl>, prediction <chr>
sub_train_updated$prediction <- factor(sub_train_updated$prediction, levels = c("Yes", "No"))
train_con_mat <- table(sub_train_updated$renewed, sub_train_updated$prediction, dnn = c('Actual', 'Predicted'))
train_con_mat Predicted
Actual Yes No
Yes 248 176
No 147 279
A.
The categorical predictor variable used was gender the dummy variables were genderMale and genderFemale, the genderFemale variable was omitted.
B.
Regression equation = 1.308 - 0.400(num_contacts) + 0.006(contact_recency) - 0.046(num_complaints) - 0.0004(spend) - 0.003(spend) - 0.418(genderMale) - 0.0098(age)
C.
The important variables are lor and genderMale, because their p-value is lower than 0.05.
D.
How does each variable effect the likely-hood of re-subscription
num_contacts 0.0400276 times more likely per contact
contact_recency 0.0061870 times less likely per day since last contact
num_complaints -0.0463282 times more likely per complaint
spend -0.0004400 times more likely per every € spent
lor -0.0026467 times more likely per each day spent with the service
genderMale -0.4179377 times more likely if customer is male
age -0.0098297 times more likely with each year of rising age
Training Data Accuracy
(248+279)/850 = 0.62 = 62% same accuracy as our tree on the training data
test_pi <- predict(renewal_lr, newdata = sub_testing, type = 'response')
test_pi 1 2 3 4 5 6 7 8
0.6749299 0.5142839 0.5543692 0.3878103 0.6858986 0.5759458 0.4746917 0.5945342
9 10 11 12 13 14 15 16
0.1878282 0.4448073 0.3568611 0.6636202 0.4591778 0.3916342 0.4832867 0.4180025
17 18 19 20 21 22 23 24
0.5925629 0.5776361 0.4090364 0.2882175 0.6340636 0.5296683 0.6064848 0.3907746
25 26 27 28 29 30 31 32
0.5500747 0.5448264 0.4574808 0.5498260 0.5205010 0.3122499 0.3099375 0.6140153
33 34 35 36 37 38 39 40
0.5965908 0.6373731 0.4107937 0.5689260 0.5205792 0.4663834 0.5406037 0.4252380
41 42 43 44 45 46 47 48
0.4899970 0.6528472 0.6879073 0.4512635 0.4571741 0.5225372 0.4761310 0.6971559
49 50 51 52 53 54 55 56
0.5960685 0.4710296 0.6836818 0.7324152 0.4904323 0.5798912 0.5637999 0.5338123
57 58 59 60 61 62 63 64
0.2953440 0.3777626 0.5974466 0.6772484 0.5685965 0.5542381 0.5220884 0.3523619
65 66 67 68 69 70 71 72
0.3797260 0.6150382 0.2605113 0.4375462 0.2528963 0.4898069 0.5336169 0.3544350
73 74 75 76 77 78 79 80
0.5190080 0.3783558 0.3180604 0.4469370 0.4029038 0.5397286 0.5270522 0.4578775
81 82 83 84 85 86 87 88
0.5469198 0.4230744 0.5951762 0.6431296 0.5176362 0.1526149 0.4322731 0.4926881
89 90 91 92 93 94 95 96
0.3178437 0.4185311 0.6970719 0.4818667 0.4937076 0.5910268 0.5524537 0.5896009
97 98 99 100 101 102 103 104
0.4210315 0.4974685 0.5760784 0.5498527 0.2983391 0.4516625 0.5910268 0.5594253
105 106 107 108 109 110 111 112
0.5148644 0.3837301 0.3392984 0.4782893 0.4458620 0.5799467 0.4955099 0.6560958
113 114 115 116 117 118 119 120
0.5447166 0.6703427 0.4705087 0.4424130 0.4317793 0.3701891 0.4967702 0.5929447
121 122 123 124 125 126 127 128
0.6576330 0.3520163 0.4677336 0.6446453 0.5062266 0.6621405 0.4085815 0.5771271
129 130 131 132 133 134 135 136
0.5656715 0.3881990 0.3822840 0.5215530 0.7324152 0.6062876 0.6074287 0.6137915
137 138 139 140 141 142 143 144
0.6613775 0.4749881 0.3568611 0.4011934 0.5791383 0.4401087 0.6462766 0.6318130
145 146 147 148 149 150
0.5800670 0.5055751 0.7324152 0.3449358 0.5972892 0.4055021
sub_test_updated <- sub_testing %>%
mutate(pi = test_pi) %>%
mutate(prediction = case_when(pi > 0.5 ~ 'No',
pi <= 0.5 ~ 'Yes'))
sub_test_updated# A tibble: 150 × 11
id renewed num_contacts contact_recency num_complaints spend lor gender
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 942 Yes 0 28 0 213 81 Female
2 299 No 0 28 0 425 120 Male
3 44 No 0 28 3 477 89 Female
4 706 Yes 1 14 0 425 361 Female
5 965 Yes 0 28 0 235 36 Female
6 354 No 1 11 0 0 12 Male
7 156 No 0 28 0 477 266 Female
8 107 No 1 29 7 242 50 Female
9 525 Yes 18 3 0 425 301 Male
10 174 No 2 26 0 425 157 Male
# ℹ 140 more rows
# ℹ 3 more variables: age <dbl>, pi <dbl>, prediction <chr>
sub_test_updated$prediction <- factor(sub_test_updated$prediction, levels = c("Yes", "No"))
test_con_mat <- table(sub_test_updated$renewed, sub_test_updated$prediction, dnn = c('Actual', 'Predicted'))
test_con_mat Predicted
Actual Yes No
Yes 35 41
No 36 38
Training Data Accuracy
(35+38)/150 = 0.486 = 48.6% lower accuracy than even our tree, meaning this model is not very accurate at all
Model Comparison & Marketing Actions
The Classification tree was a bit more accurate, even though it wasn’t by very much, the inaccuracy of both models is likely due to the lack of important variables, making it difficult for the algorithm to find patterns, our company likely needs to collect data with higher importance in regards to renewal of our services. If we had to choose we would go with the classification tree as it turned out to be more accurate and after pruning could turn out to be much more accurate, and because it’s simple design will make it easier to explain the outcome to other departments.
Some of the main drivers of retention are length of relationship and gender both of which we identified as such in the initial analysis. Men who have been with the service for a longer amount of time are more likely to stay. Interestingly enough when we did our initial analysis of spend we found that it seems to have a pattern, which is hard to decipher, it seemed that people who spend more on average stay with the company but a lot of higher spenders churn, which left us wondering whether or not it is an important variable and interestingly enough the classification tree ranked it as important as gender, while the linear regression ranked it as not important.
To Lower churn rates we advise that a campaign aimed towards women should take place. Could be a campaign featuring most prominent female artists. We also advise working on nurturing longer relationships, through loyalty programes and rewards for long time customers. Maybe we could introduce discounts at different levels of bought songs (1% off for every 100 songs bought capping at 20%) this would also encourage spend which is another variable potentially lowering our churn rate.