Music_Assignment

Author

Conor

An Exploratory Analysis of a Music Subscription Data Set

Part A - This report gives an in depth analysis into the exploration of a music companies subscription data set. This report is being done to aid the music company with their own business and to help them gain insight into their own clientele so that improvements can be made to their subscription model.

In order for this to be done, it is important than all data is visualized and then subsequently analysed. See figures below.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sub_testing <- read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_training <- read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Figure 1: Customer Renewal and Gender

ggplot(data = sub_training) + 
  geom_bar(mapping = aes(x = gender, fill = renewed), position = "dodge")

The above graph depicts the amount of customers who churned vs customers who renewed their subscription for each gender. This graph suggests that males make more renewals than non-renewals. It also suggests the opposite for females, stating that more females churned than renewed.

Figure 2: Customer Renewal and Age

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = age, fill = renewed))

This box plot show the relationship between the age of customers and whether or not they renewed their subscription. From this box plot we can see that the median age for those customers who did renew is ever so slightly higher, suggesting that there is a minor relationship between older customers and renewal compared to younger customers, however, this is a very small difference.

Figure 3: Amount Spent and Renewal

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = spend, fill = renewed))

This box plot shows the relationship between the amount a customer has spent and whether or not they renewed. The box plot shows that median spend is higher with customers who had spent more money, this suggests that customers who spend more money and seemingly more willing to renew their subscription.

Figure 4: Number of Contacts and Renewal

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = num_contacts, fill = renewed))

This next box plot shows the relationship between the customers who called the music company, those who didn’t and whether or not they renewed their subscription. The graph suggests that customers who did call the company had a higher rate of renewal. This shows that customers who made contact with the company were more willing to renew, probably due to helpful information provided by the company. There are also a few outliers for both those who renewed and didn’t.

Figure 5: Contact Recency and Renewal

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = contact_recency, fill = renewed))

This box plot graph shows the relationship between contact recency and whether or not a customer renewed their subscription. The graph suggests that more customers who renewed their subscription were in recent contact with the music company

Figure 6: Number of Complaints and Renewal

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = num_complaints, fill = renewed))

This next box plot graph shows the relationship between the number of complaints that the music company received and whether or not a customer renewed their subscription. The graph suggests that there isn’t much difference between the two variables except for a few outliers.

Figure 7: Length of Relationship (lor) and Renewal

ggplot(data = sub_training) +
  geom_boxplot(mapping = aes(x = renewed, y = lor, fill = renewed))

This box plot graph shows the relationship between the length of the relationship (lor) and whether or not a customer renewed or not. The graph suggests that customers who did renew typically have a longer length of a relationship than those who did not.

Part B - Predicting Customers Who Will Renew Their Music Subscription

library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
library(tidyverse)

1. Classification Tree Model

In order to predict which customers are going to renew their subscription a classification tree model can be created. Using all possible predictor variables, the classification tree model sorts the data provided and creates a visual structure to simplify our data.

music_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, data = sub_training,)

fancyRpartPlot(music_tree)

This classification tree model presents the multiple variables which affect whether or not a customer chooses to renew their subscription or not.

2. Classification Tree Interpretation

2a. Interpretation of the classification tree is the next step in predicting whether or not a customer will renew their subscription.

music_tree$variable.importance
            lor           spend             age    num_contacts contact_recency 
     21.5220728      10.3707565       9.9887537       3.7116877       0.5951338 
summary(music_tree)
Call:
rpart(formula = renewed ~ num_contacts + contact_recency + num_complaints + 
    spend + lor + gender + age, data = sub_training)
  n= 850 

          CP nsplit rel error   xerror       xstd
1 0.19339623      0 1.0000000 1.061321 0.03432108
2 0.01179245      1 0.8066038 0.870283 0.03408085
3 0.01000000      4 0.7617925 0.875000 0.03410196

Variable importance
            lor           spend             age    num_contacts contact_recency 
             47              22              22               8               1 

Node number 1: 850 observations,    complexity param=0.1933962
  predicted class=No   expected loss=0.4988235  P(node) =1
    class counts:   426   424
   probabilities: 0.501 0.499 
  left son=2 (466 obs) right son=3 (384 obs)
  Primary splits:
      lor             < 139.5 to the left,  improve=16.323670, (0 missing)
      age             < 60.5  to the left,  improve=13.784490, (0 missing)
      spend           < 182   to the left,  improve=12.232030, (0 missing)
      contact_recency < 7.5   to the right, improve= 5.498858, (0 missing)
      num_contacts    < 3.5   to the left,  improve= 5.307063, (0 missing)
  Surrogate splits:
      age             < 60.5  to the left,  agree=0.732, adj=0.406, (0 split)
      spend           < 422   to the left,  agree=0.684, adj=0.299, (0 split)
      contact_recency < 8.5   to the right, agree=0.565, adj=0.036, (0 split)
      num_contacts    < 2.5   to the left,  agree=0.560, adj=0.026, (0 split)

Node number 2: 466 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4098712  P(node) =0.5482353
    class counts:   275   191
   probabilities: 0.590 0.410 
  left son=4 (82 obs) right son=5 (384 obs)
  Primary splits:
      spend           < 182   to the left,  improve=5.482157, (0 missing)
      lor             < 42.5  to the left,  improve=5.434363, (0 missing)
      age             < 61.5  to the left,  improve=4.067404, (0 missing)
      num_contacts    < 7.5   to the left,  improve=3.721617, (0 missing)
      contact_recency < 7.5   to the right, improve=2.971256, (0 missing)
  Surrogate splits:
      lor < 21    to the left,  agree=0.987, adj=0.927, (0 split)

Node number 3: 384 observations
  predicted class=Yes  expected loss=0.3932292  P(node) =0.4517647
    class counts:   151   233
   probabilities: 0.393 0.607 

Node number 4: 82 observations
  predicted class=No   expected loss=0.2439024  P(node) =0.09647059
    class counts:    62    20
   probabilities: 0.756 0.244 

Node number 5: 384 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4453125  P(node) =0.4517647
    class counts:   213   171
   probabilities: 0.555 0.445 
  left son=10 (356 obs) right son=11 (28 obs)
  Primary splits:
      num_contacts   < 7.5   to the left,  improve=3.286592, (0 missing)
      age            < 61    to the left,  improve=3.189001, (0 missing)
      gender         splits as  LR,        improve=2.683644, (0 missing)
      num_complaints < 1.5   to the left,  improve=2.393375, (0 missing)
      lor            < 45.5  to the left,  improve=1.671696, (0 missing)
  Surrogate splits:
      lor < 137.5 to the left,  agree=0.93, adj=0.036, (0 split)

Node number 10: 356 observations,    complexity param=0.01179245
  predicted class=No   expected loss=0.4269663  P(node) =0.4188235
    class counts:   204   152
   probabilities: 0.573 0.427 
  left son=20 (329 obs) right son=21 (27 obs)
  Primary splits:
      age            < 61    to the left,  improve=3.357262, (0 missing)
      gender         splits as  LR,        improve=2.949544, (0 missing)
      num_complaints < 1.5   to the left,  improve=1.278902, (0 missing)
      lor            < 42.5  to the left,  improve=1.232817, (0 missing)
      spend          < 403   to the right, improve=1.196010, (0 missing)

Node number 11: 28 observations
  predicted class=Yes  expected loss=0.3214286  P(node) =0.03294118
    class counts:     9    19
   probabilities: 0.321 0.679 

Node number 20: 329 observations
  predicted class=No   expected loss=0.4072948  P(node) =0.3870588
    class counts:   195   134
   probabilities: 0.593 0.407 

Node number 21: 27 observations
  predicted class=Yes  expected loss=0.3333333  P(node) =0.03176471
    class counts:     9    18
   probabilities: 0.333 0.667 

By summarising the classification tree and measuring the variables by importance, the following information becomes available. The terminal node which is predicting renewal is Node 3. Node 3 details the following:

  • Class counts: 151 No vs 233 Yes
  • Probability: 0.607 yes and 0.393 no. Therefore the purity is 60.7%.
  • This node is for customers with longer relationships as lor = 139.5.

Based on this information Rule 1 can be set: If a customer has been with the company for at least 139.5 days then the model predicts that they will renew their subscription. Furthermore, 60.7% of customers in this node are renewers which means that the node is moderately accurate or pure.

2b. Node 4 predicts churn rates. It details:

  • Class counts: 62 No vs 20 Yes
  • Probability: 75.6% No
  • This node comes from lor < 139.5 and spend < 182.

From this information Rule 2 can be set: If a customer has a short relationship and spending is low (lor < 139.5 days and spend < 182) then the model predicts that these customers will churn or not renew. 75.6% of customers will not renew which means the node is highly accurate or pure.

2c. Based on the variable importance output it becomes clear that the most important predictor of renewel is the length of relationship or lor. This appears as the highest in importance score at 47. The next most important variables are spend and age both having 22 as their importance scores. Number of contacts and contact recency are the least important variables due to their low importance scores.

  1. Through the visual exploration conducted in Part A it suggests that length of relationship (lor), spending and age are all related to renewal behaviour. The contact related variables did not suggest much relevancy to renewal bahviour. The classification tree supported these suggestions as it identified lor as the most important predictor of renewal, followed by spend and age. The rest of the variables are unimportant to renewal.

  2. Both the training and testing datasets were used in order to further evaluate the classification tree. Both the predicted proabilities and classes were generated. Confusion matrices were then generated by comparing the renewal rate and the predicted rate.

train_probs <- predict(music_tree, newdata = sub_training, type = "prob")
train_preds <- predict(music_tree, newdata = sub_training, type = "class")

sub_training_updated <- cbind(sub_training, train_probs, train_preds) 
head(sub_training_updated)
   id renewed num_contacts contact_recency num_complaints spend lor gender age
1 187      No            0              28              0   213 248   Male  45
2 269      No            1              12              2   425  82   Male  60
3 376      No            0              28              2     0  15 Female  53
4 400      No            1              11              1     0  12   Male  44
5 679     Yes            0              28              0   216 300   Male  68
6 565     Yes            0              28              0   425 349 Female  68
         No       Yes train_preds
1 0.3932292 0.6067708         Yes
2 0.5927052 0.4072948          No
3 0.7560976 0.2439024          No
4 0.7560976 0.2439024          No
5 0.3932292 0.6067708         Yes
6 0.3932292 0.6067708         Yes
train_con_mat <- table(sub_training_updated$renewed, sub_training_updated$train_preds, dnn = c("Actual", "Predicted"))

train_con_mat
      Predicted
Actual  No Yes
   No  257 169
   Yes 154 270
test_probs <- predict(music_tree, newdata = sub_testing, type = "prob")
test_preds <- predict(music_tree, newdata = sub_testing, type = "class")

sub_testing_updated <- cbind(sub_testing, test_probs, test_preds)

test_con_mat <- table(sub_testing_updated$renewed, sub_testing_updated$test_preds, dnn = c("Actual", "Predicted"))

test_con_mat
      Predicted
Actual No Yes
   No  36  38
   Yes 35  41

4a. The confusion matrix for the training data set shows a high level of classification. The testing data set shows a slighly lower level of correct classification. This information indicates that the classification tree model performs better when evaluating data it was trained on rather than the data sets it has not been trained on.

  1. Pruned Music Classification Tree Model
music_tree_pruned <- rpart(renewed ~ num_contacts + contact_recency + num_complaints + spend + lor + gender + age, sub_training, control = rpart.control(maxdepth = 3))

fancyRpartPlot(music_tree_pruned)

5b. In comparison to the full classification tree, the pruned model tree shows less accuracy on the training data set and this can be for the reason that there is less complexity within this tree. As this is the case it is still important to point out that testing data set is more stable.

5c. Yes. Pruning has reduced overfitting as it is a much simpler model. The reduction provides evidence that pruning was able to imporve the model’s ability to generalise the data sets.

  1. Based on this analysis, it was established that the most important variables are lor, spending and age. Some suggested actions that the company could take are as follows:
  • Target new customers early on.
  • Incetivise low-spending customers using targeted discounts etc.
  • Monitor high contact customers.

Another suggestion would also to be to make use of the classification trees which have been provided. These trees can be used as a tool to identify different types of customers before they decide to churn or their subscription expires. Marketing teams can also make use of the rules which have been provided.

Part C - Segmenting Consumers Based on Energy Drink Preference

This section of this analysis paper turns away from the music subscription section and takes a look into the world of energy drink preference. The following data and information is based on market research carried out by a company that is looking to guage customer preference. Five versions of an energy drink are rated.

energy_drinks <- read_csv("energy_drinks.csv")
Rows: 840 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Gender, Age
dbl (5): D1, D2, D3, D4, D5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(cluster)
view(energy_drinks)
energy_drinks_2 <- energy_drinks |> select(D1, D2, D3, D4, D5)
energy_dist <- dist(energy_drinks_2, method = "euclidean")
h1 <- hclust (energy_dist, method = "average")

plot(h1, hang = -1)

2a. The data from the energy drink company does not need to be scaled because these variables are measured using the same likert scale from 1-9.

heatmap(as.matrix(energy_drinks_2), Rowv = as.dendrogram(h1), Colv = NA, scale = "none", labRow = NA)

4a. Yes, this heatmap does provide evidence of clustering structure. It is clear that there are bands of similar colours across rows and what this indicates that the consumers of the energy drinks that are within the same branches of the dendrgoram are showing similar rating profiles.

  1. 3-cluster solution.
clusters1 <- cutree(h1, k = 3)
  1. Profiling the clusters using necessary tables and graphs.
energy_drinks_clus <- cbind(energy_drinks, clusters1)

energy_drinks_clus$cluster <- case_when(
  energy_drinks_clus$clusters1 == 1 ~ "C1",
  energy_drinks_clus$clusters1 == 2 ~ "C2",
  energy_drinks_clus$clusters1 == 3 ~ "C3"
)

Table showing the mean score for each drink

cluster_means <- energy_drinks_clus |>
  group_by(cluster) |>
  summarise(
    mean_D1 = mean(D1, na.rm = TRUE),
    mean_D2 = mean(D2, na.rm = TRUE),
    mean_D3 = mean(D3, na.rm = TRUE),
    mean_D4 = mean(D4, na.rm = TRUE),
    mean_D5 = mean(D5, na.rm = TRUE)
    )
cluster_means
# A tibble: 3 × 6
  cluster mean_D1 mean_D2 mean_D3 mean_D4 mean_D5
  <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 C1         3       4.71    6.20    6.70    6.76
2 C2         2.96    4.67    6.95    5.01    2.97
3 C3         6.98    5.37    3.03    2.88    2.84

Graph showing the mean score for each drink

ggplot(data = energy_long) +
  geom_line(mapping = aes(x = Drink, y = Rating, group = cluster, colour = cluster), stat = "summary", fun = mean) +
  geom_point(mapping = aes(x = Drink, y = Rating, colour = cluster), stat = "summary", fun = mean)

6a. The clusters show very different averages for each energy drink based on the table and graphs seen above. Cluster 1 shows that an overall higher rating for most drinks except D1 and arguably D2. Cluster 2 shows big favour towards D3 and cluster 3 shows a favour towards D1 and arguably D2. These differences in ratings suggest that each cluster is representative of different consumer preferences.

6b.

table(energy_drinks_clus$cluster, energy_drinks_clus$Gender)
    
     Female Male
  C1    149  268
  C2    110  125
  C3     80  108
table(energy_drinks_clus$cluster, energy_drinks_clus$Age)
    
     25_34 35_49 50_64 Over_65 Under_25
  C1   177    90    51      21       78
  C2    90    70    25       7       43
  C3    71    51    20      11       35

All the clusters show that there is more males than females although this difference is not extreme. Age also differs quite a bit among the different clusters with individuals between 25 and 49 being the most frequent. Overall, the clusters do show differences in age and gender and these do play a role in consumer preference however it is evident that consumer preference is the primary ratings driver.

  1. For advertising different energy drinks, it would be advisable to use separate clusters due to differences in ratings. For D1 the recommended cluster to advertise would be cluster 3 as the ratings are the highest for this drink. For D3 the recommended cluster would be clusters 1 and 2 due to the high ratings within these clusters. Advertising for D5 would be best done using cluster 1, like D1, the ratings for this cluster are high.

  2. If the company was to pick just one drink to produce the best pick would be D3. The reason for this is that D3 produced the best ratings within the most clusters. Both cluster 1 and 2 expressed very high ratings for this drink.