for preparing and importing dataset or the csv
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
customer_subscription_churn_usage_patterns <- read_csv("customer_subscription_churn_usage_patterns.csv",
col_types = cols(signup_date = col_datetime(format = "%Y-%m-%d")))| user_id | signup_date | plan_type | monthly_fee | avg_weekly_usage_hours | support_tickets | payment_failures | tenure_months | last_login_days_ago | churn |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2023-04-15 | Premium | 699 | 1.1 | 4 | 1 | 8 | 14 | Yes |
| 2 | 2023-08-27 | Premium | 699 | 2.6 | 6 | 0 | 35 | 1 | Yes |
| 3 | 2023-10-12 | Premium | 699 | 14.3 | 8 | 3 | 2 | 14 | Yes |
| 4 | 2023-12-11 | Basic | 199 | 17.6 | 5 | 2 | 11 | 9 | Yes |
| 5 | 2023-02-14 | Basic | 199 | 9.8 | 5 | 2 | 6 | 38 | Yes |
| 6 | 2024-10-05 | Premium | 699 | 13.6 | 6 | 0 | 30 | 35 | Yes |
| 7 | 2023-10-24 | Premium | 699 | 14.6 | 1 | 0 | 24 | 42 | Yes |
| 8 | 2024-01-09 | Basic | 199 | 21.7 | 6 | 2 | 15 | 29 | Yes |
| 9 | 2023-03-15 | Basic | 199 | 9.2 | 4 | 5 | 24 | 59 | Yes |
| 10 | 2024-10-17 | Premium | 699 | 13.6 | 3 | 1 | 11 | 29 | No |
## Rows: 2,800
## Columns: 10
## $ user_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ signup_date <dttm> 2023-04-15, 2023-08-27, 2023-10-12, 2023-12-11…
## $ plan_type <chr> "Premium", "Premium", "Premium", "Basic", "Basi…
## $ monthly_fee <dbl> 699, 699, 699, 199, 199, 699, 699, 199, 199, 69…
## $ avg_weekly_usage_hours <dbl> 1.1, 2.6, 14.3, 17.6, 9.8, 13.6, 14.6, 21.7, 9.…
## $ support_tickets <dbl> 4, 6, 8, 5, 5, 6, 1, 6, 4, 3, 5, 5, 3, 2, 6, 8,…
## $ payment_failures <dbl> 1, 0, 3, 2, 2, 0, 0, 2, 5, 1, 0, 1, 5, 4, 4, 0,…
## $ tenure_months <dbl> 8, 35, 2, 11, 6, 30, 24, 15, 24, 11, 36, 18, 10…
## $ last_login_days_ago <dbl> 14, 1, 14, 9, 38, 35, 42, 29, 59, 29, 14, 41, 3…
## $ churn <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
## user_id signup_date plan_type
## Min. : 1.0 Min. :2023-01-01 00:00:00 Length:2800
## 1st Qu.: 700.8 1st Qu.:2023-07-09 18:00:00 Class :character
## Median :1400.5 Median :2024-01-14 00:00:00 Mode :character
## Mean :1400.5 Mean :2024-01-09 16:06:20
## 3rd Qu.:2100.2 3rd Qu.:2024-07-12 06:00:00
## Max. :2800.0 Max. :2024-12-31 00:00:00
## monthly_fee avg_weekly_usage_hours support_tickets payment_failures
## Min. :199.0 Min. : 0.50 Min. :0.000 Min. :0.000
## 1st Qu.:199.0 1st Qu.: 6.70 1st Qu.:2.000 1st Qu.:1.000
## Median :399.0 Median :12.80 Median :4.000 Median :2.000
## Mean :434.2 Mean :12.89 Mean :3.888 Mean :2.492
## 3rd Qu.:699.0 3rd Qu.:19.20 3rd Qu.:6.000 3rd Qu.:4.000
## Max. :699.0 Max. :25.00 Max. :8.000 Max. :5.000
## tenure_months last_login_days_ago churn
## Min. : 1.00 Min. : 0 Length:2800
## 1st Qu.:10.00 1st Qu.:14 Class :character
## Median :18.00 Median :30 Mode :character
## Mean :18.61 Mean :30
## 3rd Qu.:27.00 3rd Qu.:46
## Max. :36.00 Max. :60
the data haves 2800 rows and summary function summarize each columns to see medians max and statistical summary for example the max support tickets is 8.
After looking at the table and understand columns and what they mean its time to check for the credibility of data.
## user_id signup_date plan_type
## 0 0 0
## monthly_fee avg_weekly_usage_hours support_tickets
## 0 0 0
## payment_failures tenure_months last_login_days_ago
## 0 0 0
## churn
## 0
there is no nulls in the dataset
## Warning: package 'janitor' was built under R version 4.5.2
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## No variable names specified - using all columns.
## No duplicate combinations found of: user_id, signup_date, plan_type, monthly_fee, avg_weekly_usage_hours, support_tickets, payment_failures, tenure_months, last_login_days_ago, churn
means there are no dupes in the dataset
dataset is clean.
Before diving into detailed analysis, we performed a preliminary exploration to understand the dataset and the general behavior of customers. This step helps frame the business context and highlights initial patterns worth investigating further.
churn_prc <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=n(),prc=100*(total/2800))
knitr::kable(head(churn_prc))| churn | total | prc |
|---|---|---|
| No | 1195 | 42.67857 |
| Yes | 1605 | 57.32143 |
Examined the proportion of churned vs retained customers, identifying that 57% of users had churned.
plan_prc <- customer_subscription_churn_usage_patterns %>% group_by(plan_type) %>% summarise(total=n(),prc=100*(total/2800))
knitr::kable(head(plan_prc))| plan_type | total | prc |
|---|---|---|
| Basic | 923 | 32.96429 |
| Premium | 944 | 33.71429 |
| Standard | 933 | 33.32143 |
Reviewed subscription plans (Basic, Standard, Premium) to ensure balanced representation and identify potential plan-specific trends. And as we see the distr of clients in plans are same.
plan_churn_prc <- customer_subscription_churn_usage_patterns %>% group_by(churn, plan_type) %>% summarise(total = n(), .groups = "drop") %>% group_by(churn) %>% mutate(prc = 100 * total / sum(total)) %>% ungroup()
knitr::kable(head(plan_churn_prc))| churn | plan_type | total | prc |
|---|---|---|---|
| No | Basic | 389 | 32.55230 |
| No | Premium | 396 | 33.13808 |
| No | Standard | 410 | 34.30962 |
| Yes | Basic | 534 | 33.27103 |
| Yes | Premium | 548 | 34.14330 |
| Yes | Standard | 523 | 32.58567 |
Checked how churn was distributed within each plan to get an early sense of whether subscription type influenced attrition.
to look more into other variables, i did need to see average support tickets and payment failures by churn state and see total numbers and comapare if it respects the ration of 57% and 43%.
tickets_churn_pattern <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(tickets_num=sum(support_tickets),avg_tickets=mean(support_tickets),payments_fail_num=sum(payment_failures),avg_pay_fail=mean(payment_failures))
knitr::kable(head(tickets_churn_pattern))| churn | tickets_num | avg_tickets | payments_fail_num | avg_pay_fail |
|---|---|---|---|---|
| No | 4106 | 3.435983 | 2477 | 2.072803 |
| Yes | 6780 | 4.224299 | 4500 | 2.803738 |
difference between avg support tickets is 4.22 for yes and 3.44 for no, meaning that clients who leaves haves more support tickets. for payment failures also 2.8 for leaving and 2 for staying meaning they tend to have more payment failures. and for total of these variables they show higher sum more than the ratio meaning it not because of churn yes clients are more than the others.
tickets_range <- customer_subscription_churn_usage_patterns %>% group_by(churn,support_tickets) %>% summarise(total=n(),avg_last_login=mean(last_login_days_ago))## `summarise()` has grouped output by 'churn'. You can override using the
## `.groups` argument.
| churn | support_tickets | total | avg_last_login |
|---|---|---|---|
| No | 0 | 187 | 29.35294 |
| No | 1 | 165 | 25.69697 |
| No | 2 | 152 | 27.27632 |
| No | 3 | 151 | 28.39073 |
| No | 4 | 125 | 24.36000 |
| No | 5 | 94 | 26.01064 |
This aggregation summarizes customer behavior by churn status and number of support tickets. This helps identify patterns between support interactions and churn, highlighting whether higher support ticket volumes correlate with increased attrition.
PF_range <- customer_subscription_churn_usage_patterns %>% group_by(churn,payment_failures) %>% summarise(total=n(),avg_tickets=mean(support_tickets))## `summarise()` has grouped output by 'churn'. You can override using the
## `.groups` argument.
| churn | payment_failures | total | avg_tickets |
|---|---|---|---|
| No | 0 | 267 | 3.674157 |
| No | 1 | 297 | 3.673401 |
| No | 2 | 171 | 3.397661 |
| No | 3 | 155 | 3.296774 |
| No | 4 | 152 | 2.842105 |
| No | 5 | 153 | 3.333333 |
Purpose: This aggregation analyzes the relationship between payment failures (PF) and churn:
Total customers are counted for each combination of churn status (Yes/No) and number of payment failures.
Average support tickets per group are calculated to see how support interactions vary with payment failures.
This helps identify whether customers with more payment failures are more likely to churn and whether higher PFs coincide with increased support needs.
churn_rate_for_MONTHS <- customer_subscription_churn_usage_patterns %>% mutate(month=make_date(year(signup_date),month(signup_date),1)) %>% group_by(month) %>% summarise(total_cust=n(),churn_count=sum(churn=="Yes"),churn_rate=mean(churn=="Yes"),.groups = "drop")
knitr::kable(head(churn_rate_for_MONTHS))| month | total_cust | churn_count | churn_rate |
|---|---|---|---|
| 2023-01-01 | 88 | 54 | 0.6136364 |
| 2023-02-01 | 112 | 70 | 0.6250000 |
| 2023-03-01 | 99 | 51 | 0.5151515 |
| 2023-04-01 | 118 | 62 | 0.5254237 |
| 2023-05-01 | 125 | 66 | 0.5280000 |
| 2023-06-01 | 121 | 64 | 0.5289256 |
Purpose: This aggregation calculates monthly churn trends to understand how churn evolves over time:
The signup_date is converted to a month-level date to group customers by the month they signed up.
For each month, it computes:
total_cust: total number of customers who signed up that month
churn_count: number of customers who churned
churn_rate: proportion of churned customers for that month
This allows us to observe patterns, such as whether certain months have higher churn rates or if churn correlates with customer acquisition volume.
churn_rate_for_PF <- customer_subscription_churn_usage_patterns %>% group_by(payment_failures) %>% summarise(total_cust=n(),churn_count=sum(churn=="Yes"),churn_rate=mean(churn=="Yes"),.groups = "drop")
knitr::kable(head(churn_rate_for_PF))| payment_failures | total_cust | churn_count | churn_rate |
|---|---|---|---|
| 0 | 442 | 175 | 0.3959276 |
| 1 | 496 | 199 | 0.4012097 |
| 2 | 478 | 307 | 0.6422594 |
| 3 | 471 | 316 | 0.6709130 |
| 4 | 453 | 301 | 0.6644592 |
| 5 | 460 | 307 | 0.6673913 |
bin_pf <- churn_rate_for_PF %>% group_by(pf_bin=case_when(payment_failures == 0 ~ "0", payment_failures == 1 ~ "1",payment_failures == 2 ~ "2", payment_failures >= 3 ~ "3+")) %>% summarise(total_cus=sum(total_cust),churn_count=sum(churn_count),rate=sum(churn_rate*total_cust)/sum(total_cust))
knitr::kable(head(bin_pf))| pf_bin | total_cus | churn_count | rate |
|---|---|---|---|
| 0 | 442 | 175 | 0.3959276 |
| 1 | 496 | 199 | 0.4012097 |
| 2 | 478 | 307 | 0.6422594 |
| 3+ | 1384 | 924 | 0.6676301 |
Purpose: This code investigates the relationship between payment failures (PF) and churn risk, identifying critical thresholds.
churn_rate_for_PF
Groups customers by payment_failures.
Calculates for each PF count:
total_cust: total customers with that number of failed payments
churn_count: number of churned customers
churn_rate: proportion of churned customers
This provides a raw view of how churn changes as payment failures increase.
bin_pf
Groups the PF counts into binned categories: 0, 1, 2, 3+
Aggregates totals and computes a weighted churn rate for each bin.
This makes it easier to visualize churn patterns and highlights the “two-payment-fail tipping point”, where churn jumps significantly.
plan_total_pf <- customer_subscription_churn_usage_patterns %>% group_by(plan_type,churn) %>% summarise(total=n(),PF=sum(payment_failures),avg_pf=mean(payment_failures))## `summarise()` has grouped output by 'plan_type'. You can override using the
## `.groups` argument.
| plan_type | churn | total | PF | avg_pf |
|---|---|---|---|---|
| Basic | No | 389 | 839 | 2.156812 |
| Basic | Yes | 534 | 1482 | 2.775281 |
| Premium | No | 396 | 796 | 2.010101 |
| Premium | Yes | 548 | 1512 | 2.759124 |
| Standard | No | 410 | 842 | 2.053659 |
| Standard | Yes | 523 | 1506 | 2.879541 |
Purpose: This code explores payment failures across different subscription plans and how they relate to churn.
Groups data by plan_type (Basic, Standard, Premium) and churn status (Yes/No).
For each combination, calculates:
total: number of customers
PF: total number of payment failures
avg_pf: average number of payment failures
Insight: This helps identify whether certain plans are more prone to payment issues and churn, providing a plan-specific view of risk.
tenure_patterns <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=sum(tenure_months),avr_tenure=mean(tenure_months))
knitr::kable(head(tenure_patterns))| churn | total | avr_tenure |
|---|---|---|
| No | 22213 | 18.58828 |
| Yes | 29903 | 18.63115 |
Purpose: This code examines customer tenure in relation to churn.
Groups the dataset by churn status (Yes/No).
Calculates:
total: total months of tenure for each churn group
avr_tenure: average tenure (in months) per group
Insight: This gives a general understanding of how long churned vs. retained customers have been with the service, helping to see whether longer-tenured customers are more or less likely to leave.
avg_hours_pattern <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=sum(avg_weekly_usage_hours),avr_hours=mean(avg_weekly_usage_hours))
knitr::kable(head(avg_hours_pattern))| churn | total | avr_hours |
|---|---|---|
| No | 16427 | 13.74644 |
| Yes | 19669 | 12.25483 |
avg_hours_pattern Purpose: Analyzes average weekly app usage for churned vs. retained customers.
Groups by churn status.
Calculates:
total: total weekly usage hours for each group
avr_hours: average weekly usage per customer
Insight: Helps understand whether more engaged users (higher weekly usage) are less likely to churn. the avg hours are same which means that for churn variable it shows that they both use it for same period 13 yes 12 no.
mean_last_login <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(avg=mean(last_login_days_ago))
knitr::kable(head(mean_last_login))| churn | avg |
|---|---|
| No | 26.14979 |
| Yes | 32.87539 |
mean_last_login Purpose: Measures recency of activity among customers.
Groups by churn.
Calculates the average number of days since the last login for each group.
Insight: Provides a sense of whether recently active customers are more likely to stay, which can help in identifying dormant users at risk of churning. it shows that churned tend to have higher avg last login days.
the patterns are organized by the importance.
the bin_pf dataframe shows that 2 payment failures is where churn no clients number start to decrease and churn yes increase meaning that 2 payment failures is dead point. also the number of payment failures for churned people is higher meaning they get more payment failures. and the churn rate from 1 to 2 jumps from 0.4 to 0.64.
support tickets shows that churned people have done more support tickets than staying people with an average of 4.22 compared to 3.44 making a difference of approximatly 1 support tickets.
plans doesnt show difference in term of churn state they are same arround 33.3% for each and also same in payment failures.
for avg weekly hour usage it shows a difference of 1h from 13 for no and 12 for yes meaning the difference is little.
for tenure months they are the same arround 18.5.
last login in days shows that churned people have from 30-36 days without using the subscription while people who stayed 22-29 days.
note: in plotting phase i used AI by giving him first manual code for plot using ggplot and asking him to improve it and make it more beautiful. which cuts time.