data_621_blog

Survival model

This is a simple tutorial of building a survival model for a subscription business. The use case is that a media company offers various subscription plan to its customers. Each plan is associated with different price and billing period, e.g. Annual, Month, Semi-Annual, Two-Year, etc. We need to infer from billing period associated with churn rate to estimate the average number of extended payments that a subscription is likely to have left in order to estimate LTV (life-time value) and project revenue. For example, someone already paid his/her fifteenth payment is more likely to pay additional two, three or more payments than someone just finished paying his/her second payment. Generally speaking, the likelihood of paying the twenty-first payment is higher for this cohort than new subscribers. This type of analysis is very commnn in medical or actuarial science (such as life insurance). For each subscription, the goal is to predict how many more payments (on average) that the subscription is likely to have left based on billing period and existing number of payments. For simplicity, we only rely on “billing period” to build this survivalship model, and we also only estimate “periods_remaining” (aka lifetime) for new subscriptions.

# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}

## Loading required package: pacman

## Warning: package 'pacman' was built under R version 3.6.2

packages <- c('tidyverse', 'sqldf', 'broom', 'survival', 'survminer', 'kableExtra')
pacman::p_load(char = packages)

Read data

Each row represents a subscription. “sub_id” represents unique id for each subscription; “billing_period” is referred to the frequency of payment; “lifetime” is referred to the number of payment already paid by the subscription; and lastly, “status” is referred to the current subscription status. Disclaimer: this is a made-up dataset for building this tutorial.

# head, dim
df <- read.csv("df.csv")
head(df)
##        sub_id billing_period lifetime    status
## 1  D78251x_id         Annual        2    Active
## 2  O10741e_id          Month        0    Active
## 3 B619365g_id          Month        3 Cancelled
## 4 I794344u_id          Month        3 Cancelled
## 5  B35391k_id          Month        1    Active
## 6 M719337i_id          Month        2 Cancelled
dim(df)
## [1] 197346      4

# check missing
colSums(is.na(df))
##         sub_id billing_period       lifetime         status 
##              0              0              0              0

# billing periods
unique(df$billing_period)
## [1] Annual      Month       Two_Years   Semi_Annual
## Levels: Annual Month Semi_Annual Two_Years

# billing periods x status
ftable(df$billing_period, df$status)
##              Active Cancelled
##                              
## Annual        15920      1320
## Month        101139     77708
## Semi_Annual     275       106
## Two_Years       847        31

# average number of payment received by billing period
aggregate(lifetime ~ billing_period, data = df, FUN = mean)
##   billing_period  lifetime
## 1         Annual 1.0861949
## 2          Month 4.7443290
## 3    Semi_Annual 1.6456693
## 4      Two_Years 0.9897494

Nest data

Nesting data for each billing period. Each data set contains only status and lifetime. An important step is to apply ‘case_when’ statement to convert status into value of 1 or 2, i.e. ‘Active’ == 1, ‘Cancelled’ == 2. Subsequently, we can apply survival model on each subset (by billing period).

# nest data, i.e. data = c(status, lifetime)
dfNest <- df %>%
        dplyr::select(billing_period, status, lifetime) %>%
        dplyr::mutate(status = dplyr::case_when(status == 'Active' ~ 1, TRUE ~ 2)) %>%
        nest(., data = c(status, lifetime)) %>%
        arrange(billing_period)

dfNest
## # A tibble: 4 x 2
##   billing_period data                  
##   <fct>          <list>                
## 1 Annual         <tibble [17,240 x 2]> 
## 2 Month          <tibble [178,847 x 2]>
## 3 Semi_Annual    <tibble [381 x 2]>    
## 4 Two_Years      <tibble [878 x 2]>

Model

We build a survival model for each billing period and then combine the output together into one data frame. All we care is the “estimate” column, which tells us the chance of surviving the corresponding lifetime (given the subscribers have survived the previous stages). For example, for ‘Annual’ billing, the chance of reaching the third payment is roughly 71% (given they successfully made the previous payments).

# build model for overall survival - (estimate lifetime for new subscriptions)
overall_survival <- lapply(dfNest$data, function(x) survfit(Surv(lifetime, status) ~ 1, data = x))

names(overall_survival) <- dfNest$billing_period

# tidy result output
overall_survival_result <- lapply(1:length(overall_survival), 
                                  function(x) tidy(overall_survival[[x]]) %>% 
                                          dplyr::mutate(billing_period = names(overall_survival[x])))

# combine results (from list) into one dataframe
overall_survival_result <- overall_survival_result %>% 
        dplyr::bind_rows(.) %>%
        dplyr::select(billing_period, lifetime = time, everything())

overall_survival_result %>% kable()

billing_period	lifetime	n.risk	n.event	n.censor	estimate	std.error	conf.high	conf.low
Annual	0	17240	192	52	0.9888631	0.0008082	0.9904308	0.9872979
Annual	1	16996	981	14294	0.9317865	0.0020633	0.9355623	0.9280259
Annual	2	1721	146	1569	0.8527389	0.0076237	0.8655763	0.8400919
Annual	3	6	1	3	0.7106158	0.1827333	1.0000000	0.4966975
Annual	4	2	0	1	0.7106158	0.1827333	1.0000000	0.4966975
Annual	5	1	0	1	0.7106158	0.1827333	1.0000000	0.4966975
Month	0	178847	3382	1865	0.9810900	0.0003283	0.9817214	0.9804589
Month	1	173600	13188	31244	0.9065588	0.0007625	0.9079146	0.9052050
Month	2	129168	9216	21446	0.8418768	0.0010845	0.8436682	0.8400892
Month	3	98506	16987	8634	0.6966982	0.0018143	0.6991800	0.6942252
Month	4	72885	11641	4874	0.5854234	0.0024289	0.5882170	0.5826431
Month	5	56370	5361	3019	0.5297474	0.0027864	0.5326484	0.5268623
Month	6	47990	4479	3596	0.4803051	0.0031478	0.4832776	0.4773509
Month	7	39915	3535	3465	0.4377677	0.0035133	0.4407926	0.4347636
Month	8	32915	2159	3405	0.4090531	0.0038047	0.4121149	0.4060141
Month	9	27351	1531	2453	0.3861560	0.0040797	0.3892561	0.3830806
Month	10	23367	1151	1683	0.3671349	0.0043429	0.3702733	0.3640231
Month	11	20533	912	1517	0.3508281	0.0045962	0.3540028	0.3476819
Month	12	18104	825	1208	0.3348409	0.0048746	0.3380553	0.3316570
Month	13	16071	661	1055	0.3210689	0.0051411	0.3243204	0.3178499
Month	14	14355	582	779	0.3080517	0.0054199	0.3113414	0.3047966
Month	15	12994	454	835	0.2972886	0.0056711	0.3006114	0.2940025
Month	16	11705	408	848	0.2869260	0.0059369	0.2902842	0.2836067
Month	17	10449	340	852	0.2775897	0.0062021	0.2809846	0.2742358
Month	18	9257	245	820	0.2702429	0.0064345	0.2736726	0.2668562
Month	19	8192	213	793	0.2632163	0.0066829	0.2666867	0.2597912
Month	20	7186	174	859	0.2568429	0.0069364	0.2603585	0.2533747
Month	21	6153	116	889	0.2520007	0.0071580	0.2555611	0.2484900
Month	22	5148	106	1006	0.2468119	0.0074378	0.2504362	0.2432400
Month	23	4036	37	1740	0.2445493	0.0075903	0.2482146	0.2409381
Month	24	2259	5	2254	0.2440080	0.0076547	0.2476964	0.2403744
Semi_Annual	0	381	2	0	0.9947507	0.0037216	1.0000000	0.9875211
Semi_Annual	1	379	83	48	0.7769029	0.0274537	0.8198518	0.7362039
Semi_Annual	2	248	21	227	0.7111168	0.0335669	0.7594743	0.6658383
Two_Years	0	878	9	0	0.9897494	0.0034345	0.9964344	0.9831093
Two_Years	1	869	22	847	0.9646925	0.0064564	0.9769776	0.9525618

Visualization

It’s easier to visualize the outcome by plotting a survival curve. Once subscribers make certain number of payments, their chance of continuously paying (recurring payment) will become stablized. The goal of business becomes clearer, which is either to find a better way to get people to reach certain number of payments (let’s say 12 for ‘Month’), or bring the “finish line” closer to the subscribers, e.g. find a way to shorten the number of payments in order to flatten, stablize the survival curve. In the long run, keeping customers to stay longer with us by reducing churn or increasing retention rate is the key to grow profit for the business.

ggsurvplot(
        fit = survfit(Surv(lifetime, status) ~ billing_period, data = unnest(dfNest)), 
        xlab = "Payments", 
        ylab = "Overall survival probability")
## Warning: `cols` is now required.
## Please use `cols = c(data)`

## Warning: `cols` is now required.
## Please use `cols = c(data)`

How about expecting life expectancy?

To answer this question, we can simply sum up the area under the curve to find the expected value. However, we do not collect future data and we can only expect (or wish) them to pay us indefinitely. For example, subscribers signed up for our monthly plan, since we roughly launched our business for two years, the maximum number of payments that they can pay us is 24 and we can see that there is a long tail trailing to the right. We still have to come up with a reasonable “stop” or “end point” to the x-axis.

Therefore, there are two things we can do. First option, we can sum up the esimate to get the expected value (lifetime expectancy) for each billing period. That’s the lifetime expectancy for now (our second year), but as we continue to grow our business (into our third, fourth, fifth year and so forth), the expectancy will change and prolong. Although the outcome is flexible (because it will change and shift once we get more data), we may want to have something more concrete so that we can use it in another process (such as calculating lifetime value based on lifetime expectancy). Second option, we can set up a hard “stop” for each payment cohort. This is the part completely based on business acumen or experts’ opinion. For example, we only expect people to pay us for the maximum of next four years regardless of how many payments that they have made or whatever subscription plan that they have signed up for. For example, for people who already made their first monthly payment, we expect the maximum number of payments that they would have pay us is 48. Similarly, for people already paid their twenty-fourth monthly payment, the maximum number of extended payments (which is how many more that they will pay us in the future) is also 48.

# option 1
option_1 <- overall_survival_result %>% group_by(billing_period) %>% summarise(lifetime = sum(estimate))

option_1 <- option_1 %>% dplyr::mutate(lifetime = floor(lifetime))

option_1
## # A tibble: 4 x 2
##   billing_period lifetime
##   <chr>             <dbl>
## 1 Annual                4
## 2 Month                10
## 3 Semi_Annual           2
## 4 Two_Years             1

# option 2
# create 'expectancy' table - a fixed number of outcomes (possible number of payments) for each payment cohort
billing_periods_list <- list("Annual", "Month", "Semi_Annual", "Two_Years")

# set the maximum number of payment equal to 4 years
lifetime_list <- list(Annual = c(0:4), Month = c(0:48), Semi_Annual = c(0:8), Two_Years = c(0:2))  

expectancy <- purrr::map2(billing_periods_list, lifetime_list, function(x, y) expand.grid(x, y)) %>%
        dplyr::bind_rows(.) %>%
        dplyr::select(billing_period = Var1,
                      lifetime = Var2)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

# left join with the above model result output
modelOutput <- expectancy %>%
        dplyr::left_join(., overall_survival_result, by = c("billing_period" = "billing_period", "lifetime", "lifetime")) %>%
        dplyr::select(billing_period, lifetime, estimate) %>%
        arrange(billing_period, lifetime)

modelOutput %>% kable()

billing_period	lifetime	estimate
Annual	0	0.9888631
Annual	1	0.9317865
Annual	2	0.8527389
Annual	3	0.7106158
Annual	4	0.7106158
Month	0	0.9810900
Month	1	0.9065588
Month	2	0.8418768
Month	3	0.6966982
Month	4	0.5854234
Month	5	0.5297474
Month	6	0.4803051
Month	7	0.4377677
Month	8	0.4090531
Month	9	0.3861560
Month	10	0.3671349
Month	11	0.3508281
Month	12	0.3348409
Month	13	0.3210689
Month	14	0.3080517
Month	15	0.2972886
Month	16	0.2869260
Month	17	0.2775897
Month	18	0.2702429
Month	19	0.2632163
Month	20	0.2568429
Month	21	0.2520007
Month	22	0.2468119
Month	23	0.2445493
Month	24	0.2440080
Month	25	NA
Month	26	NA
Month	27	NA
Month	28	NA
Month	29	NA
Month	30	NA
Month	31	NA
Month	32	NA
Month	33	NA
Month	34	NA
Month	35	NA
Month	36	NA
Month	37	NA
Month	38	NA
Month	39	NA
Month	40	NA
Month	41	NA
Month	42	NA
Month	43	NA
Month	44	NA
Month	45	NA
Month	46	NA
Month	47	NA
Month	48	NA
Semi_Annual	0	0.9947507
Semi_Annual	1	0.7769029
Semi_Annual	2	0.7111168
Semi_Annual	3	NA
Semi_Annual	4	NA
Semi_Annual	5	NA
Semi_Annual	6	NA
Semi_Annual	7	NA
Semi_Annual	8	NA
Two_Years	0	0.9897494
Two_Years	1	0.9646925
Two_Years	2	NA

Deacy function

For option 2, once we do a left join with the model output using the expectancy table that we created, we need the fill the NA by fitting a decay function. That’s just the last value multiplies the change in previous two values. Once we fill up the gaps (filling all the NA with the decay function), we can sum up the estimate to return us the expected value.

# decay function
for(i in 1:nrow(modelOutput)){
        
        if(!is.na(modelOutput$estimate[i])) {
                next
        } else {
                modelOutput$estimate[i] <- dplyr::lag(modelOutput$estimate, 1)[i] / dplyr::lag(modelOutput$estimate, 2)[i] * dplyr::lag(modelOutput$estimate, 1)[i]
        }        
        
}

modelOutput %>% kable()

billing_period	lifetime	estimate
Annual	0	0.9888631
Annual	1	0.9317865
Annual	2	0.8527389
Annual	3	0.7106158
Annual	4	0.7106158
Month	0	0.9810900
Month	1	0.9065588
Month	2	0.8418768
Month	3	0.6966982
Month	4	0.5854234
Month	5	0.5297474
Month	6	0.4803051
Month	7	0.4377677
Month	8	0.4090531
Month	9	0.3861560
Month	10	0.3671349
Month	11	0.3508281
Month	12	0.3348409
Month	13	0.3210689
Month	14	0.3080517
Month	15	0.2972886
Month	16	0.2869260
Month	17	0.2775897
Month	18	0.2702429
Month	19	0.2632163
Month	20	0.2568429
Month	21	0.2520007
Month	22	0.2468119
Month	23	0.2445493
Month	24	0.2440080
Month	25	0.2434679
Month	26	0.2429290
Month	27	0.2423913
Month	28	0.2418548
Month	29	0.2413195
Month	30	0.2407854
Month	31	0.2402524
Month	32	0.2397207
Month	33	0.2391901
Month	34	0.2386607
Month	35	0.2381324
Month	36	0.2376053
Month	37	0.2370794
Month	38	0.2365547
Month	39	0.2360311
Month	40	0.2355087
Month	41	0.2349874
Month	42	0.2344673
Month	43	0.2339483
Month	44	0.2334305
Month	45	0.2329139
Month	46	0.2323983
Month	47	0.2318839
Month	48	0.2313707
Semi_Annual	0	0.9947507
Semi_Annual	1	0.7769029
Semi_Annual	2	0.7111168
Semi_Annual	3	0.6509012
Semi_Annual	4	0.5957846
Semi_Annual	5	0.5453351
Semi_Annual	6	0.4991575
Semi_Annual	7	0.4568901
Semi_Annual	8	0.4182019
Two_Years	0	0.9897494
Two_Years	1	0.9646925
Two_Years	2	0.9402699


# sum up the estimate to get E(x)
option_2 <- modelOutput %>%
        dplyr::mutate(estimate = dplyr::case_when(is.nan(estimate) ~ 0,
                                                  TRUE ~ estimate)) %>%
        group_by(billing_period) %>%
        summarise(periods_remaining = sum(estimate)) %>%
        dplyr::mutate(periods_remaining = floor(periods_remaining))

# comparison between option 1 and 2
option_1
## # A tibble: 4 x 2
##   billing_period lifetime
##   <chr>             <dbl>
## 1 Annual                4
## 2 Month                10
## 3 Semi_Annual           2
## 4 Two_Years             1
option_2
## # A tibble: 4 x 2
##   billing_period periods_remaining
##   <chr>                      <dbl>
## 1 Annual                         4
## 2 Month                         16
## 3 Semi_Annual                    5
## 4 Two_Years                      2

Thoughts

This exercise probably is a mix of math and art. For example, we apply floor() in both option_1 and option_2 to round down the numbers, so that our numbers would become more conservative (as we don’t want to overestimate lifetime value of our customers). We rather want to forecast less revenue than grossly overestimate how much we can get from our subscription business. At the end, it’s also based on business acumen and expert judgement.

data_621_blog_2

Jimmy Ng

4/20/2020