Assignmentss

Author

Praveen

Introduction

For subscription-based organizations, customer churn poses a significant problem because keeping current clients is more economical than finding new ones. For music streaming services, understanding the factors that influence whether a customer renews their subscription is critical for improving customer retention strategies and long-term profitability.The analysis focuses on the training dataset and visualisation techniques to customers who renewed their subscription and those who churned.

Part A – Exploratory Analysis of the Music Subscription Dataset

The analysis uses two datasets: a training dataset containing 850 customers and a testing dataset containing 150 customers.Using the sub_training.csv dataset, carry out a visual exploration of the data to understand the relationship between whether a customer renews their subscription (variable called “renewed”) and each of the other potential predictor variables like age, spend, gender, lor, contact recency,num_complaints.

Question 1: Data Import

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sub_training <- read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_testing <- read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The datasets used in this analysis consist of a training dataset and a testing dataset provided by the music subscription company. The training dataset contains 850 customer observations and the testing dataset contains 150 customer observations. Both datasets are imported into R using the `read_csv function from the tidyverse package.

Question 2: Exploratory Data Analysis

This section explore the relationship between subscription renewal status and potential predictor variables. it is mainly used to identify customers who renewed their subscription and those who churned.

Customer Spend by Renewal Status:

ggplot(data = sub_training) + geom_boxplot(mapping = aes(x = renewed, y = spend)) + labs(title = "Customer Spend by Renewal Status")

Customer Spend by Renewal Status

This boxplot illustrates how customer spending and subscription renewal status are related. Those who renewed their subscriptions spend more than (median) those who did not. This means that lower-spending customers can be more vulnerable to churn, higher-spending customers are more likely to stick with their subscription.

Length of Relationship by Renewal Status

ggplot(data = sub_training) + geom_boxplot(mapping = aes(x = renewed, y = lor)) + labs(title = "Length of Relationship by Renewal Status")

In this boxplot customers who renewed generally have a longer relationship with the company compared to customers who churned. This suggests that customers who have been with the service for a longer time are more likely to renew their subscription.

Customer Age by Renewal Status

ggplot(data = sub_training) + geom_boxplot(mapping = aes(x = renewed, y = age)) + labs(title = "Customer Age by Renewal Status")

This boxplot compares the ages of customers who renewed their subscription with those who did not. The age distributions with both groups looks similar. This suggests that age does not have a strong influence on whether a customer renews their subscription.

Number of Complaints by Renewal Status

ggplot(data = sub_training) + geom_boxplot(mapping = aes(x = renewed, y = num_complaints)) + labs(title = "Number of complaints by Renewal Status")

This boxplot shows the number of complaints made by customers based on whether they renewed their subscription. when higher number of complaints, higher the chance to customer churn.

Number of Customer Contacts by Renewal Status

ggplot(data = sub_training) + geom_boxplot(mapping = aes(x = renewed, y = num_contacts)) + labs(title = "Number of Customer Contacts by Renewal Status")

This boxplot compares the number of times customers contacted the service based on whether they renewed their subscription. Customers who did not renew generally have a higher number of contacts than those who renewed. This suggests that continous contact with the company due to unresolved issues make a higher chance of churn.

Renewal Status by Gender

ggplot(sub_training, aes(x = gender, fill = renewed)) +
geom_bar(position = "fill") + labs(title = "Renewal Status by gender")

This bar chart shows the number of customers who renewed and did not renew their subscription across different genders. The renewal status looks similar for each gender, with no large differences between them. This suggests that gender does not have a strong impact on whether a customer renews their subscription.

Part B

Qustion1:

a) Create and visualise a classification tree model

In this section, a classification tree is built to predict whether a customer will renew their subscription.

classification of tree setup:

Firstly, the following packages are loaded for better result

library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
library(tidyverse)

The two datasets are used in this analysis: a training dataset and a testing dataset. Both datasets are imported into R to support model development and evaluation.

sub_training <- read_csv("sub_training.csv")
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_testing <- read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A tree model is created to predict whether a customer renews their subscription based on all available predictor variables.

renew_tree <- rpart(renewed ~ num_contacts + contact_recency + num_complaints +
    spend + lor + gender + age,sub_training)

The tree diagram below shows how customers are classified as renewed or churned based on their characteristics.

fancyRpartPlot(renew_tree)

Qustion 2

a) If a customer has a length of relationship (lor) < 140 days and a spend below 182, then the model predicts that the customer will not renew their subscription.

b) If a customer has a length of relationship (lor) < 140 days and a spend below 182, then the model predicts that the customer will not renew their subscription.

c)Length of relationship (lor), Spend, Age are considered important because they appear on the top of the classification tree, indicating it is the strongest predictor of renewal behaviour.

Qustion 3

Part A visually suggested that several variables like length of relationship, number of contacts, number of complaints, and spend appeared important and we observed between renewing and non-renewing customers in the boxplots. In Part B, the classification tree confirmed that length of relationship, spend are important predictors, as the variables appeared on the top of the tree and were used to create the main decision rules. This supports the findings from the exploratory analysis in Part A.

Part C: Segmenting Consumers Based on Energy Drink Preference

Qustion 1:

We used R with the tidyverse and cluster packages to analyse the data.

library(cluster)
library(tidyverse)

Importing the energy_drinks.csv file into R.

energy <- read.csv("energy_drinks.csv")
View(energy)

The view function is used to confirm that te data is loaded correctly.

Qustion 2

For Euclidean distance, it is very important that variable being entered into the clustering algoritham are on comparable scale.

en <- select(energy, D1:D5)
e1 <- dist(en)

The energy drink rating variables were selected and used to calculate the Euclidean distance between consumers. Here, all the five variables used to compute the distance are measured in same 10 point Likert scale.

Qustion 3

h1 <- hclust(e1, method = "average")

Hierarchical clustering was applied using average linkage. hang = -1 forces the leaves of dendrogram to allign nicely along 0.

plot(h1, hang = -1)

The length of the vertical line on the dendrogram represent the distance or height at which different clusters were merged.

Qustion 4

heatmap(as.matrix(e1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

A heatmap takes a metrix of numbers and replaces the number with colours. A white represent small values(pairs of customers who are similar to each other),yellow will repersent customers who are some what is similar each other and finally red repersent customers who are not similar to each other.

Qustion 5

clustere1 <- cutree(h1, k = 3)

sil1 <- silhouette(clustere1, e1)
summary(sil1)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clustere1, dist = e1) :
 Cluster sizes and average silhouette widths:
      417       235       188 
0.2249249 0.1987262 0.3918562 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502 
en_clus <- cbind(energy, clustere1)
en_clus <- mutate(en_clus, cluster = case_when(clustere1 == 1 ~ 'C1',
                                               clustere1 == 2 ~ 'C2',
                                               clustere1 == 3 ~ 'C3', ))

The hierarchical clustering solution was cut into three clusters. The quality of this three-cluster solution was then assessed using silhouette analysis.

Cluster sizes and average silhouette widths:

      417       235       188

0.2249249 (weak) 0.1987262 ( No substantial structure) 0.3918562 (weak)

Individual silhouette widths:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

-0.3120  0.1599  0.2916  0.2550  0.3716  0.5502

Here, we can see ovrall silhouette score of overall structure by looking at the mean.

The overall cluster analysis have overall mean of 0.25, which indicate the structure is weak and could be artificial.

qustion 6

profiling

Cluster C1 shows a strong preference for higher-concentration energy drinks, with the highest ratings given to D4 and D5. This cluster clearly dislikes the lowest concentration drink (D1).Cluster C2 strongly prefers the mid-concentration. Ratings drop noticeably for both the lowest (D1) and highest (D5) concentration versions.Cluster C3 strongly favours the lowest concentration drink D1 and shows clear dislike for higher-concentration drinks (D4 and D5).These results indicate three distinct taste preference segments.

b) For gender, total males: 501 and females:339. so, its mael dominat segment and for age overall age distribution: 25–34 years: 338 (largest group). so, the profile indicates that the market is dominated by male consumers and younger age groups, mainly 25–34.