Survival model

This is a simple tutorial of building a survival model for a subscription business. The use case is that a media company offers various subscription plan to its customers. Each plan is associated with different price and billing period, e.g. Annual, Month, Semi-Annual, Two-Year, etc. We need to infer from billing period associated with churn rate to estimate the average number of extended payments that a subscription is likely to have left in order to estimate LTV (life-time value) and project revenue. For example, someone already paid his/her fifteenth payment is more likely to pay additional two, three or more payments than someone just finished paying his/her second payment. Generally speaking, the likelihood of paying the twenty-first payment is higher for this cohort than new subscribers. This type of analysis is very commnn in medical or actuarial science (such as life insurance). For each subscription, the goal is to predict how many more payments (on average) that the subscription is likely to have left based on billing period and existing number of payments. For simplicity, we only rely on “billing period” to build this survivalship model, and we also only estimate “periods_remaining” (aka lifetime) for new subscriptions.

# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('tidyverse', 'sqldf', 'broom', 'survival', 'survminer', 'kableExtra')
pacman::p_load(char = packages)

Read data

Each row represents a subscription. “sub_id” represents unique id for each subscription; “billing_period” is referred to the frequency of payment; “lifetime” is referred to the number of payment already paid by the subscription; and lastly, “status” is referred to the current subscription status. Disclaimer: this is a made-up dataset for building this tutorial.

# head, dim
df <- read.csv("df.csv")
head(df)
##        sub_id billing_period lifetime    status
## 1  D78251x_id         Annual        2    Active
## 2  O10741e_id          Month        0    Active
## 3 B619365g_id          Month        3 Cancelled
## 4 I794344u_id          Month        3 Cancelled
## 5  B35391k_id          Month        1    Active
## 6 M719337i_id          Month        2 Cancelled
dim(df)
## [1] 197346      4
# check missing
colSums(is.na(df))
##         sub_id billing_period       lifetime         status 
##              0              0              0              0

# billing periods
unique(df$billing_period)
## [1] Annual      Month       Two_Years   Semi_Annual
## Levels: Annual Month Semi_Annual Two_Years

# billing periods x status
ftable(df$billing_period, df$status)
##              Active Cancelled
##                              
## Annual        15920      1320
## Month        101139     77708
## Semi_Annual     275       106
## Two_Years       847        31

# average number of payment received by billing period
aggregate(lifetime ~ billing_period, data = df, FUN = mean)
##   billing_period  lifetime
## 1         Annual 1.0861949
## 2          Month 4.7443290
## 3    Semi_Annual 1.6456693
## 4      Two_Years 0.9897494

Nest data

Nesting data for each billing period. Each data set contains only status and lifetime. An important step is to apply ‘case_when’ statement to convert status into value of 1 or 2, i.e. ‘Active’ == 1, ‘Cancelled’ == 2. Subsequently, we can apply survival model on each subset (by billing period).

# nest data, i.e. data = c(status, lifetime)
dfNest <- df %>%
        dplyr::select(billing_period, status, lifetime) %>%
        dplyr::mutate(status = dplyr::case_when(status == 'Active' ~ 1, TRUE ~ 2)) %>%
        nest(., data = c(status, lifetime)) %>%
        arrange(billing_period)

dfNest
## # A tibble: 4 x 2
##   billing_period data                  
##   <fct>          <list>                
## 1 Annual         <tibble [17,240 x 2]> 
## 2 Month          <tibble [178,847 x 2]>
## 3 Semi_Annual    <tibble [381 x 2]>    
## 4 Two_Years      <tibble [878 x 2]>

Model

We build a survival model for each billing period and then combine the output together into one data frame. All we care is the “estimate” column, which tells us the chance of surviving the corresponding lifetime (given the subscribers have survived the previous stages). For example, for ‘Annual’ billing, the chance of reaching the third payment is roughly 71% (given they successfully made the previous payments).

# build model for overall survival - (estimate lifetime for new subscriptions)
overall_survival <- lapply(dfNest$data, function(x) survfit(Surv(lifetime, status) ~ 1, data = x))

names(overall_survival) <- dfNest$billing_period

# tidy result output
overall_survival_result <- lapply(1:length(overall_survival), 
                                  function(x) tidy(overall_survival[[x]]) %>% 
                                          dplyr::mutate(billing_period = names(overall_survival[x])))

# combine results (from list) into one dataframe
overall_survival_result <- overall_survival_result %>% 
        dplyr::bind_rows(.) %>%
        dplyr::select(billing_period, lifetime = time, everything())

overall_survival_result %>% kable()
billing_period lifetime n.risk n.event n.censor estimate std.error conf.high conf.low
Annual 0 17240 192 52 0.9888631 0.0008082 0.9904308 0.9872979
Annual 1 16996 981 14294 0.9317865 0.0020633 0.9355623 0.9280259
Annual 2 1721 146 1569 0.8527389 0.0076237 0.8655763 0.8400919
Annual 3 6 1 3 0.7106158 0.1827333 1.0000000 0.4966975
Annual 4 2 0 1 0.7106158 0.1827333 1.0000000 0.4966975
Annual 5 1 0 1 0.7106158 0.1827333 1.0000000 0.4966975
Month 0 178847 3382 1865 0.9810900 0.0003283 0.9817214 0.9804589
Month 1 173600 13188 31244 0.9065588 0.0007625 0.9079146 0.9052050
Month 2 129168 9216 21446 0.8418768 0.0010845 0.8436682 0.8400892
Month 3 98506 16987 8634 0.6966982 0.0018143 0.6991800 0.6942252
Month 4 72885 11641 4874 0.5854234 0.0024289 0.5882170 0.5826431
Month 5 56370 5361 3019 0.5297474 0.0027864 0.5326484 0.5268623
Month 6 47990 4479 3596 0.4803051 0.0031478 0.4832776 0.4773509
Month 7 39915 3535 3465 0.4377677 0.0035133 0.4407926 0.4347636
Month 8 32915 2159 3405 0.4090531 0.0038047 0.4121149 0.4060141
Month 9 27351 1531 2453 0.3861560 0.0040797 0.3892561 0.3830806
Month 10 23367 1151 1683 0.3671349 0.0043429 0.3702733 0.3640231
Month 11 20533 912 1517 0.3508281 0.0045962 0.3540028 0.3476819
Month 12 18104 825 1208 0.3348409 0.0048746 0.3380553 0.3316570
Month 13 16071 661 1055 0.3210689 0.0051411 0.3243204 0.3178499
Month 14 14355 582 779 0.3080517 0.0054199 0.3113414 0.3047966
Month 15 12994 454 835 0.2972886 0.0056711 0.3006114 0.2940025
Month 16 11705 408 848 0.2869260 0.0059369 0.2902842 0.2836067
Month 17 10449 340 852 0.2775897 0.0062021 0.2809846 0.2742358
Month 18 9257 245 820 0.2702429 0.0064345 0.2736726 0.2668562
Month 19 8192 213 793 0.2632163 0.0066829 0.2666867 0.2597912
Month 20 7186 174 859 0.2568429 0.0069364 0.2603585 0.2533747
Month 21 6153 116 889 0.2520007 0.0071580 0.2555611 0.2484900
Month 22 5148 106 1006 0.2468119 0.0074378 0.2504362 0.2432400
Month 23 4036 37 1740 0.2445493 0.0075903 0.2482146 0.2409381
Month 24 2259 5 2254 0.2440080 0.0076547 0.2476964 0.2403744
Semi_Annual 0 381 2 0 0.9947507 0.0037216 1.0000000 0.9875211
Semi_Annual 1 379 83 48 0.7769029 0.0274537 0.8198518 0.7362039
Semi_Annual 2 248 21 227 0.7111168 0.0335669 0.7594743 0.6658383
Two_Years 0 878 9 0 0.9897494 0.0034345 0.9964344 0.9831093
Two_Years 1 869 22 847 0.9646925 0.0064564 0.9769776 0.9525618

Visualization

It’s easier to visualize the outcome by plotting a survival curve. Once subscribers make certain number of payments, their chance of continuously paying (recurring payment) will become stablized. The goal of business becomes clearer, which is either to find a better way to get people to reach certain number of payments (let’s say 12 for ‘Month’), or bring the “finish line” closer to the subscribers, e.g. find a way to shorten the number of payments in order to flatten, stablize the survival curve. In the long run, keeping customers to stay longer with us by reducing churn or increasing retention rate is the key to grow profit for the business.

ggsurvplot(
        fit = survfit(Surv(lifetime, status) ~ billing_period, data = unnest(dfNest)), 
        xlab = "Payments", 
        ylab = "Overall survival probability")
## Warning: `cols` is now required.
## Please use `cols = c(data)`

## Warning: `cols` is now required.
## Please use `cols = c(data)`

How about expecting life expectancy?

To answer this question, we can simply sum up the area under the curve to find the expected value. However, we do not collect future data and we can only expect (or wish) them to pay us indefinitely. For example, subscribers signed up for our monthly plan, since we roughly launched our business for two years, the maximum number of payments that they can pay us is 24 and we can see that there is a long tail trailing to the right. We still have to come up with a reasonable “stop” or “end point” to the x-axis.

Therefore, there are two things we can do. First option, we can sum up the esimate to get the expected value (lifetime expectancy) for each billing period. That’s the lifetime expectancy for now (our second year), but as we continue to grow our business (into our third, fourth, fifth year and so forth), the expectancy will change and prolong. Although the outcome is flexible (because it will change and shift once we get more data), we may want to have something more concrete so that we can use it in another process (such as calculating lifetime value based on lifetime expectancy). Second option, we can set up a hard “stop” for each payment cohort. This is the part completely based on business acumen or experts’ opinion. For example, we only expect people to pay us for the maximum of next four years regardless of how many payments that they have made or whatever subscription plan that they have signed up for. For example, for people who already made their first monthly payment, we expect the maximum number of payments that they would have pay us is 48. Similarly, for people already paid their twenty-fourth monthly payment, the maximum number of extended payments (which is how many more that they will pay us in the future) is also 48.

# option 1
option_1 <- overall_survival_result %>% group_by(billing_period) %>% summarise(lifetime = sum(estimate))

option_1 <- option_1 %>% dplyr::mutate(lifetime = floor(lifetime))

option_1
## # A tibble: 4 x 2
##   billing_period lifetime
##   <chr>             <dbl>
## 1 Annual                4
## 2 Month                10
## 3 Semi_Annual           2
## 4 Two_Years             1

# option 2
# create 'expectancy' table - a fixed number of outcomes (possible number of payments) for each payment cohort
billing_periods_list <- list("Annual", "Month", "Semi_Annual", "Two_Years")

# set the maximum number of payment equal to 4 years
lifetime_list <- list(Annual = c(0:4), Month = c(0:48), Semi_Annual = c(0:8), Two_Years = c(0:2))  

expectancy <- purrr::map2(billing_periods_list, lifetime_list, function(x, y) expand.grid(x, y)) %>%
        dplyr::bind_rows(.) %>%
        dplyr::select(billing_period = Var1,
                      lifetime = Var2)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

# left join with the above model result output
modelOutput <- expectancy %>%
        dplyr::left_join(., overall_survival_result, by = c("billing_period" = "billing_period", "lifetime", "lifetime")) %>%
        dplyr::select(billing_period, lifetime, estimate) %>%
        arrange(billing_period, lifetime)

modelOutput %>% kable()
billing_period lifetime estimate
Annual 0 0.9888631
Annual 1 0.9317865
Annual 2 0.8527389
Annual 3 0.7106158
Annual 4 0.7106158
Month 0 0.9810900
Month 1 0.9065588
Month 2 0.8418768
Month 3 0.6966982
Month 4 0.5854234
Month 5 0.5297474
Month 6 0.4803051
Month 7 0.4377677
Month 8 0.4090531
Month 9 0.3861560
Month 10 0.3671349
Month 11 0.3508281
Month 12 0.3348409
Month 13 0.3210689
Month 14 0.3080517
Month 15 0.2972886
Month 16 0.2869260
Month 17 0.2775897
Month 18 0.2702429
Month 19 0.2632163
Month 20 0.2568429
Month 21 0.2520007
Month 22 0.2468119
Month 23 0.2445493
Month 24 0.2440080
Month 25 NA
Month 26 NA
Month 27 NA
Month 28 NA
Month 29 NA
Month 30 NA
Month 31 NA
Month 32 NA
Month 33 NA
Month 34 NA
Month 35 NA
Month 36 NA
Month 37 NA
Month 38 NA
Month 39 NA
Month 40 NA
Month 41 NA
Month 42 NA
Month 43 NA
Month 44 NA
Month 45 NA
Month 46 NA
Month 47 NA
Month 48 NA
Semi_Annual 0 0.9947507
Semi_Annual 1 0.7769029
Semi_Annual 2 0.7111168
Semi_Annual 3 NA
Semi_Annual 4 NA
Semi_Annual 5 NA
Semi_Annual 6 NA
Semi_Annual 7 NA
Semi_Annual 8 NA
Two_Years 0 0.9897494
Two_Years 1 0.9646925
Two_Years 2 NA

Deacy function

For option 2, once we do a left join with the model output using the expectancy table that we created, we need the fill the NA by fitting a decay function. That’s just the last value multiplies the change in previous two values. Once we fill up the gaps (filling all the NA with the decay function), we can sum up the estimate to return us the expected value.

# decay function
for(i in 1:nrow(modelOutput)){
        
        if(!is.na(modelOutput$estimate[i])) {
                next
        } else {
                modelOutput$estimate[i] <- dplyr::lag(modelOutput$estimate, 1)[i] / dplyr::lag(modelOutput$estimate, 2)[i] * dplyr::lag(modelOutput$estimate, 1)[i]
        }        
        
}

modelOutput %>% kable()
billing_period lifetime estimate
Annual 0 0.9888631
Annual 1 0.9317865
Annual 2 0.8527389
Annual 3 0.7106158
Annual 4 0.7106158
Month 0 0.9810900
Month 1 0.9065588
Month 2 0.8418768
Month 3 0.6966982
Month 4 0.5854234
Month 5 0.5297474
Month 6 0.4803051
Month 7 0.4377677
Month 8 0.4090531
Month 9 0.3861560
Month 10 0.3671349
Month 11 0.3508281
Month 12 0.3348409
Month 13 0.3210689
Month 14 0.3080517
Month 15 0.2972886
Month 16 0.2869260
Month 17 0.2775897
Month 18 0.2702429
Month 19 0.2632163
Month 20 0.2568429
Month 21 0.2520007
Month 22 0.2468119
Month 23 0.2445493
Month 24 0.2440080
Month 25 0.2434679
Month 26 0.2429290
Month 27 0.2423913
Month 28 0.2418548
Month 29 0.2413195
Month 30 0.2407854
Month 31 0.2402524
Month 32 0.2397207
Month 33 0.2391901
Month 34 0.2386607
Month 35 0.2381324
Month 36 0.2376053
Month 37 0.2370794
Month 38 0.2365547
Month 39 0.2360311
Month 40 0.2355087
Month 41 0.2349874
Month 42 0.2344673
Month 43 0.2339483
Month 44 0.2334305
Month 45 0.2329139
Month 46 0.2323983
Month 47 0.2318839
Month 48 0.2313707
Semi_Annual 0 0.9947507
Semi_Annual 1 0.7769029
Semi_Annual 2 0.7111168
Semi_Annual 3 0.6509012
Semi_Annual 4 0.5957846
Semi_Annual 5 0.5453351
Semi_Annual 6 0.4991575
Semi_Annual 7 0.4568901
Semi_Annual 8 0.4182019
Two_Years 0 0.9897494
Two_Years 1 0.9646925
Two_Years 2 0.9402699

# sum up the estimate to get E(x)
option_2 <- modelOutput %>%
        dplyr::mutate(estimate = dplyr::case_when(is.nan(estimate) ~ 0,
                                                  TRUE ~ estimate)) %>%
        group_by(billing_period) %>%
        summarise(periods_remaining = sum(estimate)) %>%
        dplyr::mutate(periods_remaining = floor(periods_remaining))

# comparison between option 1 and 2
option_1
## # A tibble: 4 x 2
##   billing_period lifetime
##   <chr>             <dbl>
## 1 Annual                4
## 2 Month                10
## 3 Semi_Annual           2
## 4 Two_Years             1
option_2
## # A tibble: 4 x 2
##   billing_period periods_remaining
##   <chr>                      <dbl>
## 1 Annual                         4
## 2 Month                         16
## 3 Semi_Annual                    5
## 4 Two_Years                      2

Thoughts

This exercise probably is a mix of math and art. For example, we apply floor() in both option_1 and option_2 to round down the numbers, so that our numbers would become more conservative (as we don’t want to overestimate lifetime value of our customers). We rather want to forecast less revenue than grossly overestimate how much we can get from our subscription business. At the end, it’s also based on business acumen and expert judgement.