Analysis for a music downloading company collected data about its customers’ subscription base for the last four years. The company considers that a customer has churned when his/her subscription is not renewed within one week after the expiry date. The independent variables were measured over the 36-month period prior to the date on which the customer either renewed or churned. The independent variables contain information about interactions between the customers and the company, socio-demographic information and subscription-describing information.
Loading Packages for the visualization
library(cluster)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
Importing datasets for visualisation
subtest <-read_csv("sub_testing.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 850 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): renewed, gender
dbl (7): id, num_contacts, contact_recency, num_complaints, spend, lor, age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(subtrain)
The echo: false option disables the printing of code (only output is displayed).
subtest_2 <-select(subtest, renewed, contact_recency, lor, spend, gender, age)d1 <-dist(subtest_2)
Warning in dist(subtest_2): NAs introduced by coercion
subtrain_2 <-select(subtrain, renewed, contact_recency, lor, spend, gender, age)d2 <-dist(subtrain_2)
Warning in dist(subtrain_2): NAs introduced by coercion
subtest_2_clean <- subtest %>%select(renewed, contact_recency, lor, spend, gender, age) %>%mutate(gender =as.factor(gender), # Convert 'gender' to factorage =as.factor(age) # Convert 'age' to factor ) %>%drop_na() # Remove rows with missing valuessubtrain_2_clean <- subtrain %>%select(renewed, contact_recency, lor, spend, gender, age) %>%mutate(gender =as.factor(gender), # Convert 'gender' to factorage =as.factor(age) # Convert 'age' to factor ) %>%drop_na() # Remove rows with missing values# Convert factors into numeric representations (if necessary)subtest_2_clean <- subtest_2_clean %>%mutate(gender =as.numeric(gender), # Convert gender to numeric (1 = Female, 2 = Male)age =as.numeric(age) # Convert age to numeric (e.g., 1 = Under_25, 2 = 25_34, etc.) )subtrain_2_clean <- subtrain_2_clean %>%mutate(gender =as.numeric(gender),age =as.numeric(age) )# Compute Euclidean distance matricesd1 <-dist(subtest_2_clean, method ="euclidean")
Warning in dist(subtest_2_clean, method = "euclidean"): NAs introduced by
coercion
d2 <-dist(subtrain_2_clean, method ="euclidean")
Warning in dist(subtrain_2_clean, method = "euclidean"): NAs introduced by
coercion
#summary of distances (optional)summary(d1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 129.1 240.4 237.6 308.7 732.7
summary(d2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 135.3 243.3 247.8 316.7 817.9
According to this dendogram, the further you go down, the similar the nodes are.
clusters1 <-cutree(h1, k =3)clusters2 <-cutree(h2, k =3)
Assess the quality of the segmentation
sil1 <-silhouette(clusters1, d1)summary(sil1)
Silhouette of 150 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
Cluster sizes and average silhouette widths:
71 64 15
0.5627435 0.4217442 0.3025624
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.002092 0.354104 0.526981 0.476566 0.620396 0.698237
sil2 <-silhouette(clusters2, d2)summary(sil2)
Silhouette of 850 units in 3 clusters from silhouette.default(x = clusters2, dist = d2) :
Cluster sizes and average silhouette widths:
387 276 187
0.4591246 0.5782505 0.4968392
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.4597 0.4736 0.5515 0.5061 0.6217 0.7286
#####Step 6.1: Create a table showing the size of each segment (i.e the number of customers in the cluster)# and the average revenue generated in the last 6 months per customer.size_rev <- test_clus %>%group_by(clusters1) %>%summarise(id =n(),avg_rev =mean(spend))size_rev2 <- train_clus %>%group_by(clusters2) %>%summarise(id =n(),avg_rev =mean(spend))size_rev
# A tibble: 12 × 3
cluster Contact_Method Average_Value
<chr> <fct> <dbl>
1 C1 Spend 182.
2 C1 Lor 84.3
3 C1 Contact 20.1
4 C1 Age 47.2
5 C2 Spend 440.
6 C2 Lor 123.
7 C2 Contact 21.0
8 C2 Age 53.9
9 C3 Spend 403.
10 C3 Lor 325.
11 C3 Contact 19.4
12 C3 Age 66.4
#Visualise the mean satisfaction score for each contact method by cluster.ggplot(test_clus_tidy, mapping =aes(x = Contact_Method, y = Average_Value, group = cluster, colour = cluster)) +geom_line(linewidth =1) +geom_point(size =2) +scale_colour_manual(values =c("#0e0f0f", "#8AB9F1", "#FFC40A")) +ylab("Mean Satisfaction Score") +xlab("Contact Method") +ggtitle("Mean Satisfaction Score for each Contact Method by Cluster")
Question 2 Inmporting the energy drinks dataset
#Step 1: Import the dataEnergy <-read_csv('energy_drinks.csv')
Rows: 840 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Gender, Age
dbl (5): D1, D2, D3, D4, D5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4a.) The dendogram and block layout in the heatmap shows the datasets have different clusters. However the hierarchical cluster and heatmeap show that consumers can be grouped according to their similiarity in certain groupings.
Creating a 3-cluster solution
clusters2 <-cutree(h2, k =3)sil2 <-silhouette(clusters2, d2)summary(sil2)
Silhouette of 840 units in 3 clusters from silhouette.default(x = clusters2, dist = d2) :
Cluster sizes and average silhouette widths:
441 167 232
0.1589812 0.2412339 0.3048378
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.2863 0.1262 0.2338 0.2156 0.3280 0.5065
# Profiling the clusters by Age distributionage_profile <- Energy %>%group_by(Cluster, Age) %>%summarise(count =n()) %>%group_by(Cluster) %>%mutate(percentage = count /sum(count) *100)
`summarise()` has grouped output by 'Cluster'. You can override using the
`.groups` argument.
# View the age profile by clusterkable(age_profile)
Cluster
Age
count
percentage
1
Under_25
87
19.72789
1
25_34
192
43.53742
1
NA
162
36.73469
2
Under_25
28
16.76647
2
25_34
58
34.73054
2
NA
81
48.50299
3
Under_25
41
17.67241
3
25_34
88
37.93103
3
NA
103
44.39655
Plotting age distribution by cluster
ggplot(age_profile, aes(x =factor(Cluster), y = percentage, fill = Age, width =0.5)) +geom_bar(stat ="identity", position ="stack") +labs(title ="Age Distribution by Cluster", x ="Cluster", y ="Percentage") +coord_flip()
`summarise()` has grouped output by 'Cluster'. You can override using the
`.groups` argument.
# View the gender profile by clusterkable(gender_profile)
Cluster
Gender
count
percentage
1
Female
279
63.26531
1
Male
162
36.73469
2
Female
86
51.49701
2
Male
81
48.50299
3
Female
136
58.62069
3
Male
96
41.37931
In the cluster represented above, we see that cluster 1 has a higher percentage of females streaming than male, in cluster 2, there is a balance in the gender per percentage points and in cluster 3, there is not a wide gap in the gender because even though the females have a higher streaming percentage, there is not much difference with the males.
ggplot(gender_profile, aes(x =factor(Cluster), y = percentage, fill = Gender, width =0.5)) +geom_bar(stat ="identity", position ="stack") +labs(title ="Gender Distribution by Cluster", x ="Cluster", y ="Percentage") +coord_flip()
Recommendations I would recommend that for the cluster 1 and 2, targeted promotions are run as there seems to be an existing balance between the gender percentage in cluster 3.
The recommendation from this visualisation is that the company works on the version D1, D4, and D5 because the ratings show that the customers are not exactly satisfied.
cluster_profiles_long <- cluster_profiles %>%pivot_longer(cols =starts_with("avg_rating"),names_to ="Version", values_to ="Avg_Rating")# Plot the bar chart with best practicesggplot(cluster_profiles_long, aes(x =factor(Cluster), y = Avg_Rating, fill = Version)) +geom_bar(stat ="identity", position ="dodge", width =0.7, alpha =0.8) +# Adjust position and bar widthscale_fill_manual(values =c("grey", "brown", "red", "black", "yellow")) +# Custom colorslabs(title ="Average Ratings by Cluster and Energy Drink Version",x ="Cluster",y ="Average Rating") +scale_y_continuous(limits =c(0, 10), expand =c(0, 0)) +# Adjust y-axis limits and remove paddingtheme_minimal() +theme(legend.title =element_blank(), # Remove legend titlelegend.position ="top", # Place the legend at the topaxis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityplot.title =element_text(hjust =0.5, size =16, face ="bold"), # Center and style titleaxis.title =element_text(size =12), # Style axis titlespanel.grid.major =element_line(color ="gray80", size =0.5), # Lighter grid lines for better readabilitypanel.grid.minor =element_blank() # Remove minor grid lines ) +facet_wrap(~ Version, scales ="free_y") # Separate plots for each version with free y scales
My recommendation is that discounts and deals could be offered to D3, D4, and D5. Also, I advise that that marketing strategy should be different for the different groups and the targeted messages should reflect their dissatisfaction. Diversity in the products offering could also boost brand image.