1 Preparing the data

for preparing and importing dataset or the csv

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
customer_subscription_churn_usage_patterns <- read_csv("customer_subscription_churn_usage_patterns.csv", 
                                                       col_types = cols(signup_date = col_datetime(format = "%Y-%m-%d")))
knitr::kable(head(customer_subscription_churn_usage_patterns,10))
user_id signup_date plan_type monthly_fee avg_weekly_usage_hours support_tickets payment_failures tenure_months last_login_days_ago churn
1 2023-04-15 Premium 699 1.1 4 1 8 14 Yes
2 2023-08-27 Premium 699 2.6 6 0 35 1 Yes
3 2023-10-12 Premium 699 14.3 8 3 2 14 Yes
4 2023-12-11 Basic 199 17.6 5 2 11 9 Yes
5 2023-02-14 Basic 199 9.8 5 2 6 38 Yes
6 2024-10-05 Premium 699 13.6 6 0 30 35 Yes
7 2023-10-24 Premium 699 14.6 1 0 24 42 Yes
8 2024-01-09 Basic 199 21.7 6 2 15 29 Yes
9 2023-03-15 Basic 199 9.2 4 5 24 59 Yes
10 2024-10-17 Premium 699 13.6 3 1 11 29 No

2 first look at data

glimpse(customer_subscription_churn_usage_patterns)
## Rows: 2,800
## Columns: 10
## $ user_id                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ signup_date            <dttm> 2023-04-15, 2023-08-27, 2023-10-12, 2023-12-11…
## $ plan_type              <chr> "Premium", "Premium", "Premium", "Basic", "Basi…
## $ monthly_fee            <dbl> 699, 699, 699, 199, 199, 699, 699, 199, 199, 69…
## $ avg_weekly_usage_hours <dbl> 1.1, 2.6, 14.3, 17.6, 9.8, 13.6, 14.6, 21.7, 9.…
## $ support_tickets        <dbl> 4, 6, 8, 5, 5, 6, 1, 6, 4, 3, 5, 5, 3, 2, 6, 8,…
## $ payment_failures       <dbl> 1, 0, 3, 2, 2, 0, 0, 2, 5, 1, 0, 1, 5, 4, 4, 0,…
## $ tenure_months          <dbl> 8, 35, 2, 11, 6, 30, 24, 15, 24, 11, 36, 18, 10…
## $ last_login_days_ago    <dbl> 14, 1, 14, 9, 38, 35, 42, 29, 59, 29, 14, 41, 3…
## $ churn                  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
summary(customer_subscription_churn_usage_patterns)
##     user_id        signup_date                   plan_type        
##  Min.   :   1.0   Min.   :2023-01-01 00:00:00   Length:2800       
##  1st Qu.: 700.8   1st Qu.:2023-07-09 18:00:00   Class :character  
##  Median :1400.5   Median :2024-01-14 00:00:00   Mode  :character  
##  Mean   :1400.5   Mean   :2024-01-09 16:06:20                     
##  3rd Qu.:2100.2   3rd Qu.:2024-07-12 06:00:00                     
##  Max.   :2800.0   Max.   :2024-12-31 00:00:00                     
##   monthly_fee    avg_weekly_usage_hours support_tickets payment_failures
##  Min.   :199.0   Min.   : 0.50          Min.   :0.000   Min.   :0.000   
##  1st Qu.:199.0   1st Qu.: 6.70          1st Qu.:2.000   1st Qu.:1.000   
##  Median :399.0   Median :12.80          Median :4.000   Median :2.000   
##  Mean   :434.2   Mean   :12.89          Mean   :3.888   Mean   :2.492   
##  3rd Qu.:699.0   3rd Qu.:19.20          3rd Qu.:6.000   3rd Qu.:4.000   
##  Max.   :699.0   Max.   :25.00          Max.   :8.000   Max.   :5.000   
##  tenure_months   last_login_days_ago    churn          
##  Min.   : 1.00   Min.   : 0          Length:2800       
##  1st Qu.:10.00   1st Qu.:14          Class :character  
##  Median :18.00   Median :30          Mode  :character  
##  Mean   :18.61   Mean   :30                            
##  3rd Qu.:27.00   3rd Qu.:46                            
##  Max.   :36.00   Max.   :60

the data haves 2800 rows and summary function summarize each columns to see medians max and statistical summary for example the max support tickets is 8.

After looking at the table and understand columns and what they mean its time to check for the credibility of data.

3 cleaning

3.1 checking for nulls

colSums(is.na(customer_subscription_churn_usage_patterns))
##                user_id            signup_date              plan_type 
##                      0                      0                      0 
##            monthly_fee avg_weekly_usage_hours        support_tickets 
##                      0                      0                      0 
##       payment_failures          tenure_months    last_login_days_ago 
##                      0                      0                      0 
##                  churn 
##                      0
customer_subscription_churn_usage_patterns %>% summarise(across(everything(), ~sum(is.na(.))))

there is no nulls in the dataset

3.2 checking for dupes

library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
get_dupes(customer_subscription_churn_usage_patterns)
## No variable names specified - using all columns.
## No duplicate combinations found of: user_id, signup_date, plan_type, monthly_fee, avg_weekly_usage_hours, support_tickets, payment_failures, tenure_months, last_login_days_ago, churn

means there are no dupes in the dataset

dataset is clean.

4 exploring data

Before diving into detailed analysis, we performed a preliminary exploration to understand the dataset and the general behavior of customers. This step helps frame the business context and highlights initial patterns worth investigating further.

churn_prc <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=n(),prc=100*(total/2800))
knitr::kable(head(churn_prc))
churn total prc
No 1195 42.67857
Yes 1605 57.32143

Examined the proportion of churned vs retained customers, identifying that 57% of users had churned.

plan_prc <- customer_subscription_churn_usage_patterns %>% group_by(plan_type) %>% summarise(total=n(),prc=100*(total/2800))
knitr::kable(head(plan_prc))
plan_type total prc
Basic 923 32.96429
Premium 944 33.71429
Standard 933 33.32143

Reviewed subscription plans (Basic, Standard, Premium) to ensure balanced representation and identify potential plan-specific trends. And as we see the distr of clients in plans are same.

plan_churn_prc <- customer_subscription_churn_usage_patterns %>% group_by(churn, plan_type) %>% summarise(total = n(), .groups = "drop") %>%  group_by(churn) %>%  mutate(prc = 100 * total / sum(total)) %>%  ungroup()
knitr::kable(head(plan_churn_prc))
churn plan_type total prc
No Basic 389 32.55230
No Premium 396 33.13808
No Standard 410 34.30962
Yes Basic 534 33.27103
Yes Premium 548 34.14330
Yes Standard 523 32.58567

Checked how churn was distributed within each plan to get an early sense of whether subscription type influenced attrition.

5 aggregating data

5.1 support tickets and payment failures

5.1.1 general look

to look more into other variables, i did need to see average support tickets and payment failures by churn state and see total numbers and comapare if it respects the ration of 57% and 43%.

tickets_churn_pattern <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(tickets_num=sum(support_tickets),avg_tickets=mean(support_tickets),payments_fail_num=sum(payment_failures),avg_pay_fail=mean(payment_failures))
knitr::kable(head(tickets_churn_pattern))
churn tickets_num avg_tickets payments_fail_num avg_pay_fail
No 4106 3.435983 2477 2.072803
Yes 6780 4.224299 4500 2.803738

difference between avg support tickets is 4.22 for yes and 3.44 for no, meaning that clients who leaves haves more support tickets. for payment failures also 2.8 for leaving and 2 for staying meaning they tend to have more payment failures. and for total of these variables they show higher sum more than the ratio meaning it not because of churn yes clients are more than the others.

5.1.2 tickets and last login

tickets_range <- customer_subscription_churn_usage_patterns %>% group_by(churn,support_tickets) %>% summarise(total=n(),avg_last_login=mean(last_login_days_ago))
## `summarise()` has grouped output by 'churn'. You can override using the
## `.groups` argument.
knitr::kable(head(tickets_range))
churn support_tickets total avg_last_login
No 0 187 29.35294
No 1 165 25.69697
No 2 152 27.27632
No 3 151 28.39073
No 4 125 24.36000
No 5 94 26.01064

This aggregation summarizes customer behavior by churn status and number of support tickets. This helps identify patterns between support interactions and churn, highlighting whether higher support ticket volumes correlate with increased attrition.

5.1.3 payment failure

PF_range <- customer_subscription_churn_usage_patterns %>% group_by(churn,payment_failures) %>% summarise(total=n(),avg_tickets=mean(support_tickets))
## `summarise()` has grouped output by 'churn'. You can override using the
## `.groups` argument.
knitr::kable(head(PF_range))
churn payment_failures total avg_tickets
No 0 267 3.674157
No 1 297 3.673401
No 2 171 3.397661
No 3 155 3.296774
No 4 152 2.842105
No 5 153 3.333333

Purpose: This aggregation analyzes the relationship between payment failures (PF) and churn:

Total customers are counted for each combination of churn status (Yes/No) and number of payment failures.

Average support tickets per group are calculated to see how support interactions vary with payment failures.

This helps identify whether customers with more payment failures are more likely to churn and whether higher PFs coincide with increased support needs.

5.1.4 churn by month

churn_rate_for_MONTHS <- customer_subscription_churn_usage_patterns %>% mutate(month=make_date(year(signup_date),month(signup_date),1)) %>% group_by(month) %>% summarise(total_cust=n(),churn_count=sum(churn=="Yes"),churn_rate=mean(churn=="Yes"),.groups = "drop")
knitr::kable(head(churn_rate_for_MONTHS))
month total_cust churn_count churn_rate
2023-01-01 88 54 0.6136364
2023-02-01 112 70 0.6250000
2023-03-01 99 51 0.5151515
2023-04-01 118 62 0.5254237
2023-05-01 125 66 0.5280000
2023-06-01 121 64 0.5289256

Purpose: This aggregation calculates monthly churn trends to understand how churn evolves over time:

The signup_date is converted to a month-level date to group customers by the month they signed up.

For each month, it computes:

total_cust: total number of customers who signed up that month

churn_count: number of customers who churned

churn_rate: proportion of churned customers for that month

This allows us to observe patterns, such as whether certain months have higher churn rates or if churn correlates with customer acquisition volume.

5.1.5 churn by payment failures

churn_rate_for_PF <- customer_subscription_churn_usage_patterns  %>% group_by(payment_failures) %>% summarise(total_cust=n(),churn_count=sum(churn=="Yes"),churn_rate=mean(churn=="Yes"),.groups = "drop")

knitr::kable(head(churn_rate_for_PF))
payment_failures total_cust churn_count churn_rate
0 442 175 0.3959276
1 496 199 0.4012097
2 478 307 0.6422594
3 471 316 0.6709130
4 453 301 0.6644592
5 460 307 0.6673913
bin_pf <- churn_rate_for_PF %>% group_by(pf_bin=case_when(payment_failures == 0 ~ "0", payment_failures == 1 ~ "1",payment_failures == 2 ~ "2", payment_failures >= 3 ~ "3+")) %>% summarise(total_cus=sum(total_cust),churn_count=sum(churn_count),rate=sum(churn_rate*total_cust)/sum(total_cust))

knitr::kable(head(bin_pf))
pf_bin total_cus churn_count rate
0 442 175 0.3959276
1 496 199 0.4012097
2 478 307 0.6422594
3+ 1384 924 0.6676301

Purpose: This code investigates the relationship between payment failures (PF) and churn risk, identifying critical thresholds.

churn_rate_for_PF

Groups customers by payment_failures.

Calculates for each PF count:

total_cust: total customers with that number of failed payments

churn_count: number of churned customers

churn_rate: proportion of churned customers

This provides a raw view of how churn changes as payment failures increase.

bin_pf

Groups the PF counts into binned categories: 0, 1, 2, 3+

Aggregates totals and computes a weighted churn rate for each bin.

This makes it easier to visualize churn patterns and highlights the “two-payment-fail tipping point”, where churn jumps significantly.

5.1.6 plans and pf

plan_total_pf <- customer_subscription_churn_usage_patterns %>% group_by(plan_type,churn) %>% summarise(total=n(),PF=sum(payment_failures),avg_pf=mean(payment_failures))
## `summarise()` has grouped output by 'plan_type'. You can override using the
## `.groups` argument.
knitr::kable(head(plan_total_pf))
plan_type churn total PF avg_pf
Basic No 389 839 2.156812
Basic Yes 534 1482 2.775281
Premium No 396 796 2.010101
Premium Yes 548 1512 2.759124
Standard No 410 842 2.053659
Standard Yes 523 1506 2.879541

Purpose: This code explores payment failures across different subscription plans and how they relate to churn.

Groups data by plan_type (Basic, Standard, Premium) and churn status (Yes/No).

For each combination, calculates:

total: number of customers

PF: total number of payment failures

avg_pf: average number of payment failures

Insight: This helps identify whether certain plans are more prone to payment issues and churn, providing a plan-specific view of risk.

5.2 tenure

5.2.1 tenure by churn

tenure_patterns <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=sum(tenure_months),avr_tenure=mean(tenure_months))
knitr::kable(head(tenure_patterns))
churn total avr_tenure
No 22213 18.58828
Yes 29903 18.63115

Purpose: This code examines customer tenure in relation to churn.

Groups the dataset by churn status (Yes/No).

Calculates:

total: total months of tenure for each churn group

avr_tenure: average tenure (in months) per group

Insight: This gives a general understanding of how long churned vs. retained customers have been with the service, helping to see whether longer-tenured customers are more or less likely to leave.

5.3 hours usage and last login

avg_hours_pattern <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(total=sum(avg_weekly_usage_hours),avr_hours=mean(avg_weekly_usage_hours))
knitr::kable(head(avg_hours_pattern))
churn total avr_hours
No 16427 13.74644
Yes 19669 12.25483

avg_hours_pattern Purpose: Analyzes average weekly app usage for churned vs. retained customers.

Groups by churn status.

Calculates:

total: total weekly usage hours for each group

avr_hours: average weekly usage per customer

Insight: Helps understand whether more engaged users (higher weekly usage) are less likely to churn. the avg hours are same which means that for churn variable it shows that they both use it for same period 13 yes 12 no.

mean_last_login <- customer_subscription_churn_usage_patterns %>% group_by(churn) %>% summarise(avg=mean(last_login_days_ago))
knitr::kable(head(mean_last_login))
churn avg
No 26.14979
Yes 32.87539

mean_last_login Purpose: Measures recency of activity among customers.

Groups by churn.

Calculates the average number of days since the last login for each group.

Insight: Provides a sense of whether recently active customers are more likely to stay, which can help in identifying dormant users at risk of churning. it shows that churned tend to have higher avg last login days.

6 First spotted patterns,trends or difference

the patterns are organized by the importance.

6.1 payment failures

the bin_pf dataframe shows that 2 payment failures is where churn no clients number start to decrease and churn yes increase meaning that 2 payment failures is dead point. also the number of payment failures for churned people is higher meaning they get more payment failures. and the churn rate from 1 to 2 jumps from 0.4 to 0.64.

6.2 support tickets

support tickets shows that churned people have done more support tickets than staying people with an average of 4.22 compared to 3.44 making a difference of approximatly 1 support tickets.

6.3 plans

plans doesnt show difference in term of churn state they are same arround 33.3% for each and also same in payment failures.

6.4 others

for avg weekly hour usage it shows a difference of 1h from 13 for no and 12 for yes meaning the difference is little.

for tenure months they are the same arround 18.5.

last login in days shows that churned people have from 30-36 days without using the subscription while people who stayed 22-29 days.

7 plotting

note: in plotting phase i used AI by giving him first manual code for plot using ggplot and asking him to improve it and make it more beautiful. which cuts time.