Music Churn Churn

Visual Exploration

Setup

First we will set up our quarto file by loading all the needed libraries and importing our testing and training datasets.

library(tidyverse)
library(kableExtra)
library(rattle)
library(rpart)
sub_testing <- read_csv("sub_testing.csv")
sub_training <- read_csv("sub_training.csv")

Renewal Percentage

Our first action will be making a table that will show us the percentages of customers that renewed and customers that did not. It is important that we find out how many customers we are losing.

renew_table <- sub_training %>%
  group_by(renewed) %>%
  summarise(count = n()) %>%
  mutate(renew_percentage = round(count/sum(count)*100, 1)) %>%
  arrange(desc(count))

knitr::kable(renew_table,
             align = "lrr", 
             col.names = c("Renewed", "Number of Customers", "Percentage"),
             caption = "Churn Rate",
            table.attr = 'data-quarto-disable-processing = "true"') %>% 
  kable_styling(full_width = F) %>%
    row_spec(2, color = "white", background = "green") %>%
  row_spec(1, color = "white", background = "red")

Churn Rate
Renewed	Number of Customers	Percentage
No	426	50.1
Yes	424	49.9

Findings

The split between renewed subscription and canceled one is nearly 50/50 this is a very high churn rate and it is integral that we find what is causing this, what customers are likely to churn and keep them from doing so

Overlapping Histogram of Contact Frequency and Renewal

ggplot(data = sub_training) +
  geom_histogram(mapping = aes(x = num_contacts, fill = renewed, ), position = "identity", binwidth = 1, alpha = 0.5) +
  scale_x_continuous(limits = c(0, 40)) +
  scale_y_continuous(limits = c(0, 40)) +
  labs(title = "Overlapping Histogram of Contact Frequency and Renewal",
    x = "Number of Contact",
    y = element_blank())

Findings

We can see that most of our customers have little contact with our service (<10), these customers are also much more likely to churn from our service, since any customer who was in contact with our service more than 30 times had ended up renewing their subscription. Meaning Number of Contact could be a useful variable for our predictive algorithms.

Percentage of Renewed Based on Contact Recency

For the variable contact recency we will be looking to see the percentage of renewal at each day since last contact to see if there is a drop off in subscription renewals based on the time spend away from the service.

contact_recency_renewed <- sub_training %>%
  count(contact_recency, renewed) %>%
  group_by(contact_recency) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  ungroup()
  
ggplot(contact_recency_renewed, aes(x = contact_recency, y = percentage, fill = renewed)) +
  geom_bar(stat = "identity") +
  ylab("Renewed (%)") +
  xlab("Time Since Last Contact") +
  ggtitle("Percentage of Renewed Based on Contact Recency")

Findings

The highest rate of re-subscription is at 21 and 26 days since last use and the lowest is at 25 days, there seems to be little to no pattern emerging and it is likely that this variable will not be very useful for our algorithm.

Renewal Rate by Number of Complaints

It is expected that complaints would have a strong correlation with churn whether it is the ability or inability of the customer service team to solve them, it is crucial to our business to find the outcome that multiple customer complaints have on our renewal rate. We will cap the number of complaints at 7 since there is only 1 customer per number of complaints higher that 7 and we are trying to find an emerging pattern, that singular responses would discredit.

complaint_renewal_rate <- sub_training %>%
  group_by(num_complaints) %>%
  summarise(
    n = n(),
    pct_renewed = mean(renewed == "Yes") * 100
  )
library(ggplot2)

ggplot(complaint_renewal_rate,
       aes(x = num_complaints, y = pct_renewed)) +
  geom_line(color = "red") +
  geom_point(size = 2, color = "red") +
  labs(
    title = "Renewal Rate by Number of Complaints",
    x = "Number of Complaints",
    y = "Percentage Renewed"
  ) +
  scale_y_continuous(limits = c(25, 75)) +
  scale_x_continuous(limits = c(0, 7))

Findings

Once again there is not a clear pattern emerging, all we can say is that there is a drop in renewal at 3 complaints, a surge of renewal at 4 and another drop at 7 complaints. The lack of clear trajectory will likely make it difficult for the algorithm to find use for this variable

Renewal Rate by Customer Spend

We are going to make a boxplot showing the distribution of customer spend based on whether the customer renewed or not, this will help us determine whether there is correlation between spend and renewal.

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = spend)) +
  labs(
    title = "Renewal Rate by Customer Spend",
    x = "Renewal Status",
    y = "Customer Spend")

Findings

Customers who renewed have median spend coinciding with the 75th percentile at around €420, indicating a strong concentration of higher spenders. Customers who did not renew show a lower median spend and bigger dispersion.At the same time customers that churned show a few high spenders, meaning it does not necessarily mean that spending more means renewal is promised.Either way there seems to be little correlation between spend and renewal.

Reneval Rate Based on Length of Relationship

Next we are going to look to figure out the connection between length or relationship to see whether staying longer with the service has an impact on renewal.

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = lor), 
    outlier.shape = 17, outlier.colour = "red") +
  labs(
    title = "Renewal Rate by Length of Relationship",
    x = "Renewal Status",
    y = "Length of Relationship (Days)")

Findings

Finally we find a variable that clearly shows correlation with renewal. Here we can see that the longer a customer is with the service the longer more likely they are to renew their subscription. There are a few outliers on the churn side signified with the red arrow, but it is clear from the graph that new customers churn more often.

Renewal Based on Gender

The gender distribution will help us see our target audience as well as how is more likely to churn men or women.

ggplot(data = sub_training) +
  geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge") +
  scale_fill_manual(values = c("forestgreen", "red")) +
labs(
    title = "Renewal Based on Gender")

Findings

Men are the larger demographic for our service with a lower rate of churning, while Women are more likely to churn, this brings forth the question whether we should target men in our campaigns since they are our target market or if we should try to pursue women to stay with our service to lower our churn rate.

Age Spread Based on Renewal Status

The age spread will once again point us towards our main customer segments as well as showing correlation of age with renewal.

ggplot(data = sub_training) +
  geom_histogram(mapping = aes(x = age, fill = renewed ), position = "dodge") +
  labs(title = "Age Spread Based on Renewal Status",
    x = "Age",
    y = element_blank())

Findings

We can see a few main customer segments form (45-55, 60-70), it also seem that that our younger customers are more likely to churn, while customers over the age of 60 are more likely to stay. # Classification Trees ## renewal Prediction Tree

renewal_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training)
fancyRpartPlot(renewal_tree)

renewal_tree$variable.importance

            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338

summary(renewal_tree)

Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = sub_training)
  n= 850 

          CP nsplit rel error    xerror       xstd
1 0.19339623      0 1.0000000 1.0683962 0.03430592
2 0.01179245      1 0.8066038 0.8584906 0.03402464
3 0.01000000      4 0.7617925 0.8584906 0.03402464

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667

A.

If length of relationship is higher or equal to 140 days, then predict the customer will renew subscription. This node is labeled Yes (0.61 / 0.39), that means 61% of customers in this node re-subscribed, while 39% churned. This node also contains 45% of all observations, so it’s both fairly pure and very influential.

B.

If length of relationship is lower than 140 days and spend is higher or equal to €182, then predict the customer will churn. This node is labeled No (0.24 / 0.76), that means 76% of customers in this node churn, making it a highly pure churn node. It contains 10% of the data, so while smaller, it’s a strong and confident churn signal.

C.

Variable importance percentage lor spend age num_contacts contact_recency 47% 22% 22% 8% 1%

Model Accuracy

train_probs <- predict(renewal_tree, newdata = sub_training, type = 'prob')
train_preds <- predict(renewal_tree, newdata = sub_training, type = 'class')
sub_training_updated <- cbind(sub_training, train_probs, train_preds)
head(sub_training_updated)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds
1 0.3932292 0.6067708         Yes
2 0.5927052 0.4072948          No
3 0.7560976 0.2439024          No
4 0.7560976 0.2439024          No
5 0.3932292 0.6067708         Yes
6 0.3932292 0.6067708         Yes

train_con_mat <- table(sub_training_updated$renewed, sub_training_updated$train_preds, dnn = c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270

Training model accuracy

(270+257)/850 = 0.62 = 62% The training data shows 62% accuracy in prediction, this is fairly accurate.

test_probs <- predict(renewal_tree, newdata = sub_testing, type = 'prob')
test_preds <- predict(renewal_tree, newdata = sub_testing, type = 'class')
sub_testing_updated <- cbind(sub_testing, test_probs, test_preds)
head(sub_testing_updated)

   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 942     Yes            0              28              0   213  81 Female  45
2 299      No            0              28              0   425 120   Male  51
3  44      No            0              28              3   477  89 Female  69
4 706     Yes            1              14              0   425 361 Female  68
5 965     Yes            0              28              0   235  36 Female  51
6 354      No            1              11              0     0  12   Male  59
         No       Yes test_preds
1 0.5927052 0.4072948         No
2 0.5927052 0.4072948         No
3 0.3333333 0.6666667        Yes
4 0.3932292 0.6067708        Yes
5 0.5927052 0.4072948         No
6 0.7560976 0.2439024         No

test_con_mat <- table(sub_testing_updated$renewed, sub_testing_updated$test_preds, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

Testing data accuracy

(36+41)/150 = 0.51333 = 51.3% this is much less accurate and it seems as though the model is over fitted as it demonstrated higher accuracy on the training data and will need to be pruned before it can be accurate, because with nearly 50% accuracy there is the same chance that a prediction will be correct as there is that it will not be correct.

Binary Logistic Regression

sub_training$renewed <- factor(sub_training$renewed, levels = c("Yes", "No"))
sub_testing$renewed <- factor(sub_testing$renewed, levels = c("Yes", "No"))

levels(sub_training$renewed)

[1] "Yes" "No"

levels(sub_testing$renewed)

[1] "Yes" "No"

renewal_lr <- glm(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, data = sub_training, family = binomial(link = 'logit'))
summary(renewal_lr)


Call:
glm(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, family = binomial(link = "logit"), 
    data = sub_training)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.3077726  0.3889894   3.362 0.000774 ***
num_contacts    -0.0400276  0.0250176  -1.600 0.109604    
contact_recency  0.0061870  0.0085374   0.725 0.468641    
num_complaints  -0.0463282  0.0575359  -0.805 0.420701    
spend           -0.0004400  0.0005652  -0.779 0.436256    
lor             -0.0026467  0.0007908  -3.347 0.000818 ***
genderMale      -0.4179377  0.1498906  -2.788 0.005299 ** 
age             -0.0098297  0.0064033  -1.535 0.124762    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1178.3  on 849  degrees of freedom
Residual deviance: 1125.4  on 842  degrees of freedom
AIC: 1141.4

Number of Fisher Scoring iterations: 4

train_pi <- predict(renewal_lr, newdata = sub_training, type = 'response')
train_pi

        1         2         3         4         5         6         7         8 
0.4677010 0.4594854 0.6958636 0.6004329 0.3788554 0.4260171 0.5423641 0.5912532 
        9        10        11        12        13        14        15        16 
0.5361077 0.5234023 0.4143391 0.6431296 0.4904683 0.6431296 0.7349879 0.3821701 
       17        18        19        20        21        22        23        24 
0.6497835 0.5693861 0.5829538 0.3574688 0.5133261 0.5067947 0.5221561 0.4771866 
       25        26        27        28        29        30        31        32 
0.4579919 0.6330254 0.5410108 0.5093449 0.5403136 0.6343816 0.6180604 0.5716042 
       33        34        35        36        37        38        39        40 
0.5719954 0.5363469 0.5746450 0.5421427 0.3517599 0.4942457 0.2845781 0.5433066 
       41        42        43        44        45        46        47        48 
0.4502692 0.4907864 0.5644121 0.6852244 0.6108140 0.4579919 0.5759761 0.5687692 
       49        50        51        52        53        54        55        56 
0.4491684 0.1827203 0.3635581 0.5399489 0.6548843 0.4601126 0.4087059 0.4461582 
       57        58        59        60        61        62        63        64 
0.6983748 0.6294821 0.6415901 0.5856446 0.4822225 0.4481498 0.4466630 0.6005874 
       65        66        67        68        69        70        71        72 
0.2593840 0.6083302 0.2961096 0.3407975 0.6317718 0.2945270 0.5417920 0.5059073 
       73        74        75        76        77        78        79        80 
0.4066799 0.5403860 0.7265964 0.4474685 0.3472018 0.5635760 0.7324152 0.6703948 
       81        82        83        84        85        86        87        88 
0.3640080 0.5466660 0.5039131 0.4302498 0.5451436 0.6157469 0.3780301 0.5370083 
       89        90        91        92        93        94        95        96 
0.4699923 0.2004668 0.6361487 0.4935883 0.4862908 0.6135804 0.4549286 0.5145120 
       97        98        99       100       101       102       103       104 
0.5691104 0.5244290 0.4745758 0.5910268 0.3123436 0.3339000 0.6461611 0.3877489 
      105       106       107       108       109       110       111       112 
0.2814171 0.6615571 0.6318434 0.3229300 0.6294821 0.1652461 0.5064308 0.5487011 
      113       114       115       116       117       118       119       120 
0.5600478 0.3965076 0.5007403 0.6703374 0.4944206 0.3243956 0.5504015 0.6740563 
      121       122       123       124       125       126       127       128 
0.2999757 0.5717047 0.4072116 0.5959668 0.3064827 0.6815100 0.6770166 0.3367275 
      129       130       131       132       133       134       135       136 
0.5424397 0.4558933 0.4284594 0.4946643 0.5185954 0.4776289 0.3063902 0.3346316 
      137       138       139       140       141       142       143       144 
0.6132449 0.5370371 0.4797394 0.6232065 0.5146475 0.5571657 0.6299639 0.6006706 
      145       146       147       148       149       150       151       152 
0.4723487 0.4590611 0.3202268 0.4028984 0.5645614 0.6431296 0.3822239 0.6720500 
      153       154       155       156       157       158       159       160 
0.5128036 0.6961400 0.6144683 0.3791406 0.5744130 0.5635029 0.2445590 0.3733007 
      161       162       163       164       165       166       167       168 
0.4460121 0.5616239 0.4657428 0.6354583 0.2915086 0.4969974 0.6714709 0.5345446 
      169       170       171       172       173       174       175       176 
0.4016256 0.1307567 0.4855279 0.4608509 0.4788049 0.5599091 0.5871828 0.5256568 
      177       178       179       180       181       182       183       184 
0.5122814 0.6477010 0.4677010 0.6130911 0.3818419 0.4842949 0.3198534 0.4811274 
      185       186       187       188       189       190       191       192 
0.7375704 0.6560758 0.4756482 0.3101851 0.4871934 0.5563336 0.2628863 0.6018565 
      193       194       195       196       197       198       199       200 
0.2585636 0.3055310 0.2643184 0.4729522 0.5582464 0.4782471 0.3433170 0.6363460 
      201       202       203       204       205       206       207       208 
0.6008290 0.5920195 0.6368669 0.5836863 0.4748238 0.6197136 0.5693574 0.6671508 
      209       210       211       212       213       214       215       216 
0.4858167 0.6043645 0.6521502 0.3891307 0.4581427 0.4430736 0.3495022 0.6626188 
      217       218       219       220       221       222       223       224 
0.5632597 0.4986951 0.3028746 0.6289835 0.4071339 0.4961413 0.6299040 0.4733891 
      225       226       227       228       229       230       231       232 
0.5431864 0.6224711 0.5796719 0.3244778 0.4149298 0.5146642 0.4984074 0.4321647 
      233       234       235       236       237       238       239       240 
0.4835751 0.4842361 0.5004421 0.6442159 0.4527406 0.5441835 0.4637985 0.2705899 
      241       242       243       244       245       246       247       248 
0.5810841 0.6037989 0.6340733 0.4141599 0.5891449 0.4106257 0.7017629 0.3943916 
      249       250       251       252       253       254       255       256 
0.3858911 0.5380443 0.4132452 0.6718076 0.4600869 0.5125219 0.5800796 0.3452770 
      257       258       259       260       261       262       263       264 
0.4983821 0.5498690 0.5272622 0.6132959 0.2585839 0.3507245 0.5481082 0.2738812 
      265       266       267       268       269       270       271       272 
0.4698438 0.4878984 0.3536026 0.4542031 0.5382878 0.7057366 0.4542192 0.5416679 
      273       274       275       276       277       278       279       280 
0.5434041 0.4782110 0.5515067 0.5478252 0.5623528 0.3238049 0.5972389 0.4274620 
      281       282       283       284       285       286       287       288 
0.4448375 0.6007578 0.5636195 0.3568611 0.5537802 0.6638012 0.7324152 0.4445463 
      289       290       291       292       293       294       295       296 
0.5244547 0.4113369 0.4657573 0.5399404 0.4456630 0.5580479 0.4118136 0.6428064 
      297       298       299       300       301       302       303       304 
0.4099737 0.5198662 0.4812821 0.4072116 0.5665124 0.2580510 0.6228761 0.6431296 
      305       306       307       308       309       310       311       312 
0.5478580 0.3759624 0.7375704 0.6868787 0.5377178 0.4025506 0.6304643 0.5910050 
      313       314       315       316       317       318       319       320 
0.5162816 0.4858411 0.5542381 0.3895684 0.4060438 0.5186139 0.4626069 0.4814627 
      321       322       323       324       325       326       327       328 
0.5346576 0.5564200 0.6181405 0.2716641 0.6083535 0.5338057 0.2828522 0.5047102 
      329       330       331       332       333       334       335       336 
0.4497246 0.3314465 0.4007238 0.6085163 0.6864346 0.6090518 0.5243224 0.5877959 
      337       338       339       340       341       342       343       344 
0.3280127 0.3416667 0.4431587 0.3871274 0.7297165 0.6644173 0.5635541 0.4281227 
      345       346       347       348       349       350       351       352 
0.5537802 0.3558110 0.5195787 0.2370498 0.5684102 0.5001113 0.2612965 0.5035397 
      353       354       355       356       357       358       359       360 
0.6168630 0.2833346 0.4381982 0.5658410 0.6424302 0.6507548 0.5614949 0.3633937 
      361       362       363       364       365       366       367       368 
0.1751609 0.5577009 0.6431296 0.3713276 0.2819890 0.6280196 0.5992573 0.3306004 
      369       370       371       372       373       374       375       376 
0.6439356 0.2288594 0.3441367 0.4917389 0.4778502 0.4900010 0.6122740 0.7139726 
      377       378       379       380       381       382       383       384 
0.2668355 0.6707502 0.4205910 0.5202498 0.5432344 0.4779103 0.4791531 0.6487980 
      385       386       387       388       389       390       391       392 
0.4669336 0.3507245 0.5235776 0.6920445 0.6372977 0.3287148 0.5018475 0.6548797 
      393       394       395       396       397       398       399       400 
0.6971938 0.6497835 0.2060701 0.3379128 0.5626084 0.6043285 0.3942669 0.4268394 
      401       402       403       404       405       406       407       408 
0.4622082 0.3263420 0.5380507 0.6399216 0.6431296 0.4578496 0.3711395 0.7365445 
      409       410       411       412       413       414       415       416 
0.4482157 0.7365445 0.2487620 0.4377381 0.5598326 0.5096548 0.5637405 0.5738941 
      417       418       419       420       421       422       423       424 
0.6637708 0.5626084 0.3148528 0.7365445 0.5842475 0.5482312 0.3241894 0.7016749 
      425       426       427       428       429       430       431       432 
0.3746643 0.4596435 0.5001113 0.5087909 0.5459190 0.2898164 0.7125890 0.6375720 
      433       434       435       436       437       438       439       440 
0.6001948 0.5153320 0.3308627 0.6697986 0.5794348 0.6606534 0.6799673 0.6927005 
      441       442       443       444       445       446       447       448 
0.7025491 0.3828420 0.3768328 0.5780473 0.4967702 0.6493660 0.4421074 0.4264252 
      449       450       451       452       453       454       455       456 
0.5766562 0.6854089 0.4829141 0.2909187 0.4565174 0.6254189 0.7324152 0.5560027 
      457       458       459       460       461       462       463       464 
0.4323424 0.2715702 0.5596895 0.5895547 0.5675548 0.4563643 0.3397858 0.3496052 
      465       466       467       468       469       470       471       472 
0.5616145 0.5988666 0.4769686 0.4694810 0.3501220 0.4483479 0.4345273 0.2798223 
      473       474       475       476       477       478       479       480 
0.4008217 0.4326561 0.2628863 0.6185790 0.5728295 0.2747910 0.4180493 0.4291296 
      481       482       483       484       485       486       487       488 
0.3886783 0.2628863 0.5338683 0.4266645 0.7232385 0.5441111 0.5315907 0.6213967 
      489       490       491       492       493       494       495       496 
0.6024906 0.4056083 0.4534274 0.5408374 0.4049922 0.6675445 0.6465678 0.4715120 
      497       498       499       500       501       502       503       504 
0.5401464 0.6643848 0.5083176 0.5777467 0.6608414 0.6270461 0.6042485 0.4782566 
      505       506       507       508       509       510       511       512 
0.4855582 0.5080181 0.5708812 0.3264045 0.2401885 0.4996238 0.4393190 0.5288130 
      513       514       515       516       517       518       519       520 
0.6307966 0.5221136 0.4902619 0.3124724 0.4883323 0.6055624 0.6676960 0.3153218 
      521       522       523       524       525       526       527       528 
0.5706548 0.7265228 0.4688353 0.7535911 0.2864384 0.4255396 0.5168890 0.5416064 
      529       530       531       532       533       534       535       536 
0.3253314 0.4258075 0.4513014 0.4898105 0.4509244 0.5186557 0.5572429 0.5717047 
      537       538       539       540       541       542       543       544 
0.5199186 0.6549577 0.5648237 0.3323600 0.3112793 0.3323193 0.4892064 0.5304498 
      545       546       547       548       549       550       551       552 
0.3793679 0.6681316 0.2834788 0.5160642 0.6494678 0.5879858 0.7030169 0.5391498 
      553       554       555       556       557       558       559       560 
0.4729563 0.4338012 0.2700500 0.5766030 0.4355931 0.3940764 0.4735329 0.3522311 
      561       562       563       564       565       566       567       568 
0.5473638 0.5218875 0.4539263 0.5429818 0.2882340 0.4779624 0.4499494 0.2210045 
      569       570       571       572       573       574       575       576 
0.4930442 0.4514294 0.4518130 0.5034755 0.2057048 0.4198981 0.5636499 0.5707684 
      577       578       579       580       581       582       583       584 
0.5173999 0.6500831 0.6571617 0.2834081 0.5559738 0.6912157 0.3613703 0.3271833 
      585       586       587       588       589       590       591       592 
0.6318121 0.5285002 0.3576424 0.6408956 0.4697698 0.6035906 0.5537802 0.5241981 
      593       594       595       596       597       598       599       600 
0.2681833 0.5594425 0.4505771 0.3507218 0.4829141 0.3513563 0.3871463 0.3866829 
      601       602       603       604       605       606       607       608 
0.3774529 0.4977457 0.6102550 0.6150915 0.3015230 0.3822162 0.3476853 0.5785309 
      609       610       611       612       613       614       615       616 
0.3288456 0.5199186 0.4507741 0.7517613 0.4835424 0.4357300 0.5912290 0.1975542 
      617       618       619       620       621       622       623       624 
0.6324276 0.4926433 0.6428064 0.6044956 0.4241395 0.5183473 0.4422708 0.5325480 
      625       626       627       628       629       630       631       632 
0.3888191 0.3797409 0.7065705 0.6294821 0.5365952 0.4343361 0.6710594 0.3574688 
      633       634       635       636       637       638       639       640 
0.5834074 0.5684612 0.6643916 0.5705592 0.5987904 0.5195486 0.4927892 0.6667567 
      641       642       643       644       645       646       647       648 
0.3534797 0.4714450 0.6140187 0.5371778 0.6679232 0.4730720 0.4632482 0.3711580 
      649       650       651       652       653       654       655       656 
0.6913967 0.5321504 0.3714314 0.5232536 0.5672066 0.5166851 0.2858771 0.7206994 
      657       658       659       660       661       662       663       664 
0.6583575 0.5521613 0.5910268 0.2940428 0.5175582 0.5826563 0.5129061 0.5724603 
      665       666       667       668       669       670       671       672 
0.3505798 0.5475466 0.6232859 0.5165022 0.4579919 0.3438392 0.5352240 0.3755925 
      673       674       675       676       677       678       679       680 
0.4853854 0.5729956 0.4338928 0.3671419 0.5423250 0.4035507 0.5136462 0.5286153 
      681       682       683       684       685       686       687       688 
0.4867812 0.6036274 0.3574688 0.5804260 0.6431296 0.5682041 0.2620701 0.5126479 
      689       690       691       692       693       694       695       696 
0.5547625 0.3383848 0.6195539 0.5739689 0.4818225 0.5245362 0.5220194 0.5934658 
      697       698       699       700       701       702       703       704 
0.6498768 0.2861209 0.4353508 0.4055371 0.4464738 0.6491810 0.4984476 0.5380349 
      705       706       707       708       709       710       711       712 
0.7206994 0.5986811 0.6530875 0.6045247 0.6062876 0.3883441 0.7324152 0.5694182 
      713       714       715       716       717       718       719       720 
0.5636975 0.3232366 0.5511343 0.6665418 0.4217280 0.6096908 0.3874434 0.4921717 
      721       722       723       724       725       726       727       728 
0.5622395 0.5917393 0.7743421 0.5334991 0.4364504 0.6620268 0.4957752 0.4374565 
      729       730       731       732       733       734       735       736 
0.1710442 0.5357480 0.4367843 0.3700905 0.5684612 0.2191114 0.6699031 0.5857104 
      737       738       739       740       741       742       743       744 
0.5366599 0.6440388 0.7232385 0.5800796 0.4823823 0.6086456 0.5186366 0.5604012 
      745       746       747       748       749       750       751       752 
0.4501189 0.5780393 0.5063995 0.3154241 0.5529742 0.2500644 0.2811618 0.5434635 
      753       754       755       756       757       758       759       760 
0.7206994 0.3258376 0.4553548 0.4173357 0.6959033 0.3234113 0.6026897 0.6509001 
      761       762       763       764       765       766       767       768 
0.3923168 0.4520849 0.6483107 0.5462639 0.4198524 0.3174141 0.5520007 0.4605614 
      769       770       771       772       773       774       775       776 
0.3861013 0.6957385 0.5629022 0.6431296 0.3288456 0.4809317 0.5127728 0.2112260 
      777       778       779       780       781       782       783       784 
0.5897596 0.4663009 0.5214140 0.4991150 0.4950384 0.3520163 0.2096184 0.5785555 
      785       786       787       788       789       790       791       792 
0.3574688 0.4054741 0.5906305 0.5191112 0.4254227 0.5730003 0.4777945 0.5694182 
      793       794       795       796       797       798       799       800 
0.3776194 0.4375462 0.4061695 0.4416180 0.4878668 0.6183054 0.6625182 0.5749324 
      801       802       803       804       805       806       807       808 
0.3496052 0.4474796 0.4573349 0.3738671 0.4962705 0.6056556 0.5397286 0.5330852 
      809       810       811       812       813       814       815       816 
0.6404623 0.4627604 0.5409817 0.5758109 0.6287659 0.7355160 0.2883089 0.2814629 
      817       818       819       820       821       822       823       824 
0.6335536 0.3520163 0.6475683 0.6675584 0.5612868 0.5425750 0.4003103 0.5888777 
      825       826       827       828       829       830       831       832 
0.4710296 0.5851241 0.4431081 0.2525076 0.5423250 0.3764682 0.3502073 0.4506896 
      833       834       835       836       837       838       839       840 
0.4870202 0.6178767 0.3767529 0.5599727 0.6218147 0.5311305 0.5641583 0.4440120 
      841       842       843       844       845       846       847       848 
0.7265359 0.3660982 0.3399501 0.5455034 0.5349303 0.4770592 0.6476369 0.6913989 
      849       850 
0.5082580 0.3496052

sub_train_updated <- sub_training %>%
                            mutate(pi = train_pi) %>%
                            mutate(prediction = case_when(pi > 0.5 ~ 'No', 
                                                          pi <= 0.5 ~ 'Yes'))
sub_train_updated

# A tibble: 850 × 11
      id renewed num_contacts contact_recency num_complaints spend   lor gender
   <dbl> <fct>          <dbl>           <dbl>          <dbl> <dbl> <dbl> <chr> 
 1   187 No                 0              28              0   213   248 Male  
 2   269 No                 1              12              2   425    82 Male  
 3   376 No                 0              28              2     0    15 Female
 4   400 No                 1              11              1     0    12 Male  
 5   679 Yes                0              28              0   216   300 Male  
 6   565 Yes                0              28              0   425   349 Female
 7   956 Yes                0              28              0   235   109 Male  
 8   916 Yes                1               7              0   197   156 Female
 9   373 No                 0              28              1   425    36 Male  
10   152 No                 1              10              0   425    49 Male  
# ℹ 840 more rows
# ℹ 3 more variables: age <dbl>, pi <dbl>, prediction <chr>

sub_train_updated$prediction <- factor(sub_train_updated$prediction, levels = c("Yes", "No"))

train_con_mat <- table(sub_train_updated$renewed, sub_train_updated$prediction, dnn = c('Actual', 'Predicted'))
train_con_mat

      Predicted
Actual Yes  No
   Yes 248 176
   No  147 279

A.

The categorical predictor variable used was gender the dummy variables were genderMale and genderFemale, the genderFemale variable was omitted.

B.

Regression equation = 1.308 - 0.400(num_contacts) + 0.006(contact_recency) - 0.046(num_complaints) - 0.0004(spend) - 0.003(spend) - 0.418(genderMale) - 0.0098(age)

C.

The important variables are lor and genderMale, because their p-value is lower than 0.05.

D.

How does each variable effect the likely-hood of re-subscription
num_contacts 0.0400276 times more likely per contact
contact_recency 0.0061870 times less likely per day since last contact
num_complaints -0.0463282 times more likely per complaint
spend -0.0004400 times more likely per every € spent
lor -0.0026467 times more likely per each day spent with the service
genderMale -0.4179377 times more likely if customer is male
age -0.0098297 times more likely with each year of rising age

Training Data Accuracy

(248+279)/850 = 0.62 = 62% same accuracy as our tree on the training data

test_pi <- predict(renewal_lr, newdata = sub_testing, type = 'response')
test_pi

        1         2         3         4         5         6         7         8 
0.6749299 0.5142839 0.5543692 0.3878103 0.6858986 0.5759458 0.4746917 0.5945342 
        9        10        11        12        13        14        15        16 
0.1878282 0.4448073 0.3568611 0.6636202 0.4591778 0.3916342 0.4832867 0.4180025 
       17        18        19        20        21        22        23        24 
0.5925629 0.5776361 0.4090364 0.2882175 0.6340636 0.5296683 0.6064848 0.3907746 
       25        26        27        28        29        30        31        32 
0.5500747 0.5448264 0.4574808 0.5498260 0.5205010 0.3122499 0.3099375 0.6140153 
       33        34        35        36        37        38        39        40 
0.5965908 0.6373731 0.4107937 0.5689260 0.5205792 0.4663834 0.5406037 0.4252380 
       41        42        43        44        45        46        47        48 
0.4899970 0.6528472 0.6879073 0.4512635 0.4571741 0.5225372 0.4761310 0.6971559 
       49        50        51        52        53        54        55        56 
0.5960685 0.4710296 0.6836818 0.7324152 0.4904323 0.5798912 0.5637999 0.5338123 
       57        58        59        60        61        62        63        64 
0.2953440 0.3777626 0.5974466 0.6772484 0.5685965 0.5542381 0.5220884 0.3523619 
       65        66        67        68        69        70        71        72 
0.3797260 0.6150382 0.2605113 0.4375462 0.2528963 0.4898069 0.5336169 0.3544350 
       73        74        75        76        77        78        79        80 
0.5190080 0.3783558 0.3180604 0.4469370 0.4029038 0.5397286 0.5270522 0.4578775 
       81        82        83        84        85        86        87        88 
0.5469198 0.4230744 0.5951762 0.6431296 0.5176362 0.1526149 0.4322731 0.4926881 
       89        90        91        92        93        94        95        96 
0.3178437 0.4185311 0.6970719 0.4818667 0.4937076 0.5910268 0.5524537 0.5896009 
       97        98        99       100       101       102       103       104 
0.4210315 0.4974685 0.5760784 0.5498527 0.2983391 0.4516625 0.5910268 0.5594253 
      105       106       107       108       109       110       111       112 
0.5148644 0.3837301 0.3392984 0.4782893 0.4458620 0.5799467 0.4955099 0.6560958 
      113       114       115       116       117       118       119       120 
0.5447166 0.6703427 0.4705087 0.4424130 0.4317793 0.3701891 0.4967702 0.5929447 
      121       122       123       124       125       126       127       128 
0.6576330 0.3520163 0.4677336 0.6446453 0.5062266 0.6621405 0.4085815 0.5771271 
      129       130       131       132       133       134       135       136 
0.5656715 0.3881990 0.3822840 0.5215530 0.7324152 0.6062876 0.6074287 0.6137915 
      137       138       139       140       141       142       143       144 
0.6613775 0.4749881 0.3568611 0.4011934 0.5791383 0.4401087 0.6462766 0.6318130 
      145       146       147       148       149       150 
0.5800670 0.5055751 0.7324152 0.3449358 0.5972892 0.4055021

sub_test_updated <- sub_testing %>%
  mutate(pi = test_pi) %>%
  mutate(prediction = case_when(pi > 0.5 ~ 'No', 
                                pi <= 0.5 ~ 'Yes'))

sub_test_updated

# A tibble: 150 × 11
      id renewed num_contacts contact_recency num_complaints spend   lor gender
   <dbl> <fct>          <dbl>           <dbl>          <dbl> <dbl> <dbl> <chr> 
 1   942 Yes                0              28              0   213    81 Female
 2   299 No                 0              28              0   425   120 Male  
 3    44 No                 0              28              3   477    89 Female
 4   706 Yes                1              14              0   425   361 Female
 5   965 Yes                0              28              0   235    36 Female
 6   354 No                 1              11              0     0    12 Male  
 7   156 No                 0              28              0   477   266 Female
 8   107 No                 1              29              7   242    50 Female
 9   525 Yes               18               3              0   425   301 Male  
10   174 No                 2              26              0   425   157 Male  
# ℹ 140 more rows
# ℹ 3 more variables: age <dbl>, pi <dbl>, prediction <chr>

sub_test_updated$prediction <- factor(sub_test_updated$prediction, levels = c("Yes", "No"))

test_con_mat <- table(sub_test_updated$renewed, sub_test_updated$prediction, dnn = c('Actual', 'Predicted'))
test_con_mat

      Predicted
Actual Yes No
   Yes  35 41
   No   36 38

Training Data Accuracy

(35+38)/150 = 0.486 = 48.6% lower accuracy than even our tree, meaning this model is not very accurate at all

Model Comparison & Marketing Actions

The Classification tree was a bit more accurate, even though it wasn’t by very much, the inaccuracy of both models is likely due to the lack of important variables, making it difficult for the algorithm to find patterns, our company likely needs to collect data with higher importance in regards to renewal of our services. If we had to choose we would go with the classification tree as it turned out to be more accurate and after pruning could turn out to be much more accurate, and because it’s simple design will make it easier to explain the outcome to other departments.

Some of the main drivers of retention are length of relationship and gender both of which we identified as such in the initial analysis. Men who have been with the service for a longer amount of time are more likely to stay. Interestingly enough when we did our initial analysis of spend we found that it seems to have a pattern, which is hard to decipher, it seemed that people who spend more on average stay with the company but a lot of higher spenders churn, which left us wondering whether or not it is an important variable and interestingly enough the classification tree ranked it as important as gender, while the linear regression ranked it as not important.

To Lower churn rates we advise that a campaign aimed towards women should take place. Could be a campaign featuring most prominent female artists. We also advise working on nurturing longer relationships, through loyalty programes and rewards for long time customers. Maybe we could introduce discounts at different levels of bought songs (1% off for every 100 songs bought capping at 20%) this would also encourage spend which is another variable potentially lowering our churn rate.