This is a simple tutorial of building a survival model for a subscription business. The use case is that a media company offers various subscription plan to its customers. Each plan is associated with different price and billing period, e.g. Annual, Month, Semi-Annual, Two-Year, etc. We need to infer from billing period associated with churn rate to estimate the average number of extended payments that a subscription is likely to have left in order to estimate LTV (life-time value) and project revenue. For example, someone already paid his/her fifteenth payment is more likely to pay additional two, three or more payments than someone just finished paying his/her second payment. Generally speaking, the likelihood of paying the twenty-first payment is higher for this cohort than new subscribers. This type of analysis is very commnn in medical or actuarial science (such as life insurance). For each subscription, the goal is to predict how many more payments (on average) that the subscription is likely to have left based on billing period and existing number of payments. For simplicity, we only rely on “billing period” to build this survivalship model, and we also only estimate “periods_remaining” (aka lifetime) for new subscriptions.
# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('tidyverse', 'sqldf', 'broom', 'survival', 'survminer', 'kableExtra')
pacman::p_load(char = packages)
Each row represents a subscription. “sub_id” represents unique id for each subscription; “billing_period” is referred to the frequency of payment; “lifetime” is referred to the number of payment already paid by the subscription; and lastly, “status” is referred to the current subscription status. Disclaimer: this is a made-up dataset for building this tutorial.
# head, dim
df <- read.csv("df.csv")
head(df)
## sub_id billing_period lifetime status
## 1 D78251x_id Annual 2 Active
## 2 O10741e_id Month 0 Active
## 3 B619365g_id Month 3 Cancelled
## 4 I794344u_id Month 3 Cancelled
## 5 B35391k_id Month 1 Active
## 6 M719337i_id Month 2 Cancelled
dim(df)
## [1] 197346 4
# check missing
colSums(is.na(df))
## sub_id billing_period lifetime status
## 0 0 0 0
# billing periods
unique(df$billing_period)
## [1] Annual Month Two_Years Semi_Annual
## Levels: Annual Month Semi_Annual Two_Years
# billing periods x status
ftable(df$billing_period, df$status)
## Active Cancelled
##
## Annual 15920 1320
## Month 101139 77708
## Semi_Annual 275 106
## Two_Years 847 31
# average number of payment received by billing period
aggregate(lifetime ~ billing_period, data = df, FUN = mean)
## billing_period lifetime
## 1 Annual 1.0861949
## 2 Month 4.7443290
## 3 Semi_Annual 1.6456693
## 4 Two_Years 0.9897494
Nesting data for each billing period. Each data set contains only status and lifetime. An important step is to apply ‘case_when’ statement to convert status into value of 1 or 2, i.e. ‘Active’ == 1, ‘Cancelled’ == 2. Subsequently, we can apply survival model on each subset (by billing period).
# nest data, i.e. data = c(status, lifetime)
dfNest <- df %>%
dplyr::select(billing_period, status, lifetime) %>%
dplyr::mutate(status = dplyr::case_when(status == 'Active' ~ 1, TRUE ~ 2)) %>%
nest(., data = c(status, lifetime)) %>%
arrange(billing_period)
dfNest
## # A tibble: 4 x 2
## billing_period data
## <fct> <list>
## 1 Annual <tibble [17,240 x 2]>
## 2 Month <tibble [178,847 x 2]>
## 3 Semi_Annual <tibble [381 x 2]>
## 4 Two_Years <tibble [878 x 2]>
We build a survival model for each billing period and then combine the output together into one data frame. All we care is the “estimate” column, which tells us the chance of surviving the corresponding lifetime (given the subscribers have survived the previous stages). For example, for ‘Annual’ billing, the chance of reaching the third payment is roughly 71% (given they successfully made the previous payments).
# build model for overall survival - (estimate lifetime for new subscriptions)
overall_survival <- lapply(dfNest$data, function(x) survfit(Surv(lifetime, status) ~ 1, data = x))
names(overall_survival) <- dfNest$billing_period
# tidy result output
overall_survival_result <- lapply(1:length(overall_survival),
function(x) tidy(overall_survival[[x]]) %>%
dplyr::mutate(billing_period = names(overall_survival[x])))
# combine results (from list) into one dataframe
overall_survival_result <- overall_survival_result %>%
dplyr::bind_rows(.) %>%
dplyr::select(billing_period, lifetime = time, everything())
overall_survival_result %>% kable()
| billing_period | lifetime | n.risk | n.event | n.censor | estimate | std.error | conf.high | conf.low |
|---|---|---|---|---|---|---|---|---|
| Annual | 0 | 17240 | 192 | 52 | 0.9888631 | 0.0008082 | 0.9904308 | 0.9872979 |
| Annual | 1 | 16996 | 981 | 14294 | 0.9317865 | 0.0020633 | 0.9355623 | 0.9280259 |
| Annual | 2 | 1721 | 146 | 1569 | 0.8527389 | 0.0076237 | 0.8655763 | 0.8400919 |
| Annual | 3 | 6 | 1 | 3 | 0.7106158 | 0.1827333 | 1.0000000 | 0.4966975 |
| Annual | 4 | 2 | 0 | 1 | 0.7106158 | 0.1827333 | 1.0000000 | 0.4966975 |
| Annual | 5 | 1 | 0 | 1 | 0.7106158 | 0.1827333 | 1.0000000 | 0.4966975 |
| Month | 0 | 178847 | 3382 | 1865 | 0.9810900 | 0.0003283 | 0.9817214 | 0.9804589 |
| Month | 1 | 173600 | 13188 | 31244 | 0.9065588 | 0.0007625 | 0.9079146 | 0.9052050 |
| Month | 2 | 129168 | 9216 | 21446 | 0.8418768 | 0.0010845 | 0.8436682 | 0.8400892 |
| Month | 3 | 98506 | 16987 | 8634 | 0.6966982 | 0.0018143 | 0.6991800 | 0.6942252 |
| Month | 4 | 72885 | 11641 | 4874 | 0.5854234 | 0.0024289 | 0.5882170 | 0.5826431 |
| Month | 5 | 56370 | 5361 | 3019 | 0.5297474 | 0.0027864 | 0.5326484 | 0.5268623 |
| Month | 6 | 47990 | 4479 | 3596 | 0.4803051 | 0.0031478 | 0.4832776 | 0.4773509 |
| Month | 7 | 39915 | 3535 | 3465 | 0.4377677 | 0.0035133 | 0.4407926 | 0.4347636 |
| Month | 8 | 32915 | 2159 | 3405 | 0.4090531 | 0.0038047 | 0.4121149 | 0.4060141 |
| Month | 9 | 27351 | 1531 | 2453 | 0.3861560 | 0.0040797 | 0.3892561 | 0.3830806 |
| Month | 10 | 23367 | 1151 | 1683 | 0.3671349 | 0.0043429 | 0.3702733 | 0.3640231 |
| Month | 11 | 20533 | 912 | 1517 | 0.3508281 | 0.0045962 | 0.3540028 | 0.3476819 |
| Month | 12 | 18104 | 825 | 1208 | 0.3348409 | 0.0048746 | 0.3380553 | 0.3316570 |
| Month | 13 | 16071 | 661 | 1055 | 0.3210689 | 0.0051411 | 0.3243204 | 0.3178499 |
| Month | 14 | 14355 | 582 | 779 | 0.3080517 | 0.0054199 | 0.3113414 | 0.3047966 |
| Month | 15 | 12994 | 454 | 835 | 0.2972886 | 0.0056711 | 0.3006114 | 0.2940025 |
| Month | 16 | 11705 | 408 | 848 | 0.2869260 | 0.0059369 | 0.2902842 | 0.2836067 |
| Month | 17 | 10449 | 340 | 852 | 0.2775897 | 0.0062021 | 0.2809846 | 0.2742358 |
| Month | 18 | 9257 | 245 | 820 | 0.2702429 | 0.0064345 | 0.2736726 | 0.2668562 |
| Month | 19 | 8192 | 213 | 793 | 0.2632163 | 0.0066829 | 0.2666867 | 0.2597912 |
| Month | 20 | 7186 | 174 | 859 | 0.2568429 | 0.0069364 | 0.2603585 | 0.2533747 |
| Month | 21 | 6153 | 116 | 889 | 0.2520007 | 0.0071580 | 0.2555611 | 0.2484900 |
| Month | 22 | 5148 | 106 | 1006 | 0.2468119 | 0.0074378 | 0.2504362 | 0.2432400 |
| Month | 23 | 4036 | 37 | 1740 | 0.2445493 | 0.0075903 | 0.2482146 | 0.2409381 |
| Month | 24 | 2259 | 5 | 2254 | 0.2440080 | 0.0076547 | 0.2476964 | 0.2403744 |
| Semi_Annual | 0 | 381 | 2 | 0 | 0.9947507 | 0.0037216 | 1.0000000 | 0.9875211 |
| Semi_Annual | 1 | 379 | 83 | 48 | 0.7769029 | 0.0274537 | 0.8198518 | 0.7362039 |
| Semi_Annual | 2 | 248 | 21 | 227 | 0.7111168 | 0.0335669 | 0.7594743 | 0.6658383 |
| Two_Years | 0 | 878 | 9 | 0 | 0.9897494 | 0.0034345 | 0.9964344 | 0.9831093 |
| Two_Years | 1 | 869 | 22 | 847 | 0.9646925 | 0.0064564 | 0.9769776 | 0.9525618 |
It’s easier to visualize the outcome by plotting a survival curve. Once subscribers make certain number of payments, their chance of continuously paying (recurring payment) will become stablized. The goal of business becomes clearer, which is either to find a better way to get people to reach certain number of payments (let’s say 12 for ‘Month’), or bring the “finish line” closer to the subscribers, e.g. find a way to shorten the number of payments in order to flatten, stablize the survival curve. In the long run, keeping customers to stay longer with us by reducing churn or increasing retention rate is the key to grow profit for the business.
ggsurvplot(
fit = survfit(Surv(lifetime, status) ~ billing_period, data = unnest(dfNest)),
xlab = "Payments",
ylab = "Overall survival probability")
## Warning: `cols` is now required.
## Please use `cols = c(data)`
## Warning: `cols` is now required.
## Please use `cols = c(data)`
To answer this question, we can simply sum up the area under the curve to find the expected value. However, we do not collect future data and we can only expect (or wish) them to pay us indefinitely. For example, subscribers signed up for our monthly plan, since we roughly launched our business for two years, the maximum number of payments that they can pay us is 24 and we can see that there is a long tail trailing to the right. We still have to come up with a reasonable “stop” or “end point” to the x-axis.
Therefore, there are two things we can do. First option, we can sum up the esimate to get the expected value (lifetime expectancy) for each billing period. That’s the lifetime expectancy for now (our second year), but as we continue to grow our business (into our third, fourth, fifth year and so forth), the expectancy will change and prolong. Although the outcome is flexible (because it will change and shift once we get more data), we may want to have something more concrete so that we can use it in another process (such as calculating lifetime value based on lifetime expectancy). Second option, we can set up a hard “stop” for each payment cohort. This is the part completely based on business acumen or experts’ opinion. For example, we only expect people to pay us for the maximum of next four years regardless of how many payments that they have made or whatever subscription plan that they have signed up for. For example, for people who already made their first monthly payment, we expect the maximum number of payments that they would have pay us is 48. Similarly, for people already paid their twenty-fourth monthly payment, the maximum number of extended payments (which is how many more that they will pay us in the future) is also 48.
# option 1
option_1 <- overall_survival_result %>% group_by(billing_period) %>% summarise(lifetime = sum(estimate))
option_1 <- option_1 %>% dplyr::mutate(lifetime = floor(lifetime))
option_1
## # A tibble: 4 x 2
## billing_period lifetime
## <chr> <dbl>
## 1 Annual 4
## 2 Month 10
## 3 Semi_Annual 2
## 4 Two_Years 1
# option 2
# create 'expectancy' table - a fixed number of outcomes (possible number of payments) for each payment cohort
billing_periods_list <- list("Annual", "Month", "Semi_Annual", "Two_Years")
# set the maximum number of payment equal to 4 years
lifetime_list <- list(Annual = c(0:4), Month = c(0:48), Semi_Annual = c(0:8), Two_Years = c(0:2))
expectancy <- purrr::map2(billing_periods_list, lifetime_list, function(x, y) expand.grid(x, y)) %>%
dplyr::bind_rows(.) %>%
dplyr::select(billing_period = Var1,
lifetime = Var2)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
# left join with the above model result output
modelOutput <- expectancy %>%
dplyr::left_join(., overall_survival_result, by = c("billing_period" = "billing_period", "lifetime", "lifetime")) %>%
dplyr::select(billing_period, lifetime, estimate) %>%
arrange(billing_period, lifetime)
modelOutput %>% kable()
| billing_period | lifetime | estimate |
|---|---|---|
| Annual | 0 | 0.9888631 |
| Annual | 1 | 0.9317865 |
| Annual | 2 | 0.8527389 |
| Annual | 3 | 0.7106158 |
| Annual | 4 | 0.7106158 |
| Month | 0 | 0.9810900 |
| Month | 1 | 0.9065588 |
| Month | 2 | 0.8418768 |
| Month | 3 | 0.6966982 |
| Month | 4 | 0.5854234 |
| Month | 5 | 0.5297474 |
| Month | 6 | 0.4803051 |
| Month | 7 | 0.4377677 |
| Month | 8 | 0.4090531 |
| Month | 9 | 0.3861560 |
| Month | 10 | 0.3671349 |
| Month | 11 | 0.3508281 |
| Month | 12 | 0.3348409 |
| Month | 13 | 0.3210689 |
| Month | 14 | 0.3080517 |
| Month | 15 | 0.2972886 |
| Month | 16 | 0.2869260 |
| Month | 17 | 0.2775897 |
| Month | 18 | 0.2702429 |
| Month | 19 | 0.2632163 |
| Month | 20 | 0.2568429 |
| Month | 21 | 0.2520007 |
| Month | 22 | 0.2468119 |
| Month | 23 | 0.2445493 |
| Month | 24 | 0.2440080 |
| Month | 25 | NA |
| Month | 26 | NA |
| Month | 27 | NA |
| Month | 28 | NA |
| Month | 29 | NA |
| Month | 30 | NA |
| Month | 31 | NA |
| Month | 32 | NA |
| Month | 33 | NA |
| Month | 34 | NA |
| Month | 35 | NA |
| Month | 36 | NA |
| Month | 37 | NA |
| Month | 38 | NA |
| Month | 39 | NA |
| Month | 40 | NA |
| Month | 41 | NA |
| Month | 42 | NA |
| Month | 43 | NA |
| Month | 44 | NA |
| Month | 45 | NA |
| Month | 46 | NA |
| Month | 47 | NA |
| Month | 48 | NA |
| Semi_Annual | 0 | 0.9947507 |
| Semi_Annual | 1 | 0.7769029 |
| Semi_Annual | 2 | 0.7111168 |
| Semi_Annual | 3 | NA |
| Semi_Annual | 4 | NA |
| Semi_Annual | 5 | NA |
| Semi_Annual | 6 | NA |
| Semi_Annual | 7 | NA |
| Semi_Annual | 8 | NA |
| Two_Years | 0 | 0.9897494 |
| Two_Years | 1 | 0.9646925 |
| Two_Years | 2 | NA |
For option 2, once we do a left join with the model output using the expectancy table that we created, we need the fill the NA by fitting a decay function. That’s just the last value multiplies the change in previous two values. Once we fill up the gaps (filling all the NA with the decay function), we can sum up the estimate to return us the expected value.
# decay function
for(i in 1:nrow(modelOutput)){
if(!is.na(modelOutput$estimate[i])) {
next
} else {
modelOutput$estimate[i] <- dplyr::lag(modelOutput$estimate, 1)[i] / dplyr::lag(modelOutput$estimate, 2)[i] * dplyr::lag(modelOutput$estimate, 1)[i]
}
}
modelOutput %>% kable()
| billing_period | lifetime | estimate |
|---|---|---|
| Annual | 0 | 0.9888631 |
| Annual | 1 | 0.9317865 |
| Annual | 2 | 0.8527389 |
| Annual | 3 | 0.7106158 |
| Annual | 4 | 0.7106158 |
| Month | 0 | 0.9810900 |
| Month | 1 | 0.9065588 |
| Month | 2 | 0.8418768 |
| Month | 3 | 0.6966982 |
| Month | 4 | 0.5854234 |
| Month | 5 | 0.5297474 |
| Month | 6 | 0.4803051 |
| Month | 7 | 0.4377677 |
| Month | 8 | 0.4090531 |
| Month | 9 | 0.3861560 |
| Month | 10 | 0.3671349 |
| Month | 11 | 0.3508281 |
| Month | 12 | 0.3348409 |
| Month | 13 | 0.3210689 |
| Month | 14 | 0.3080517 |
| Month | 15 | 0.2972886 |
| Month | 16 | 0.2869260 |
| Month | 17 | 0.2775897 |
| Month | 18 | 0.2702429 |
| Month | 19 | 0.2632163 |
| Month | 20 | 0.2568429 |
| Month | 21 | 0.2520007 |
| Month | 22 | 0.2468119 |
| Month | 23 | 0.2445493 |
| Month | 24 | 0.2440080 |
| Month | 25 | 0.2434679 |
| Month | 26 | 0.2429290 |
| Month | 27 | 0.2423913 |
| Month | 28 | 0.2418548 |
| Month | 29 | 0.2413195 |
| Month | 30 | 0.2407854 |
| Month | 31 | 0.2402524 |
| Month | 32 | 0.2397207 |
| Month | 33 | 0.2391901 |
| Month | 34 | 0.2386607 |
| Month | 35 | 0.2381324 |
| Month | 36 | 0.2376053 |
| Month | 37 | 0.2370794 |
| Month | 38 | 0.2365547 |
| Month | 39 | 0.2360311 |
| Month | 40 | 0.2355087 |
| Month | 41 | 0.2349874 |
| Month | 42 | 0.2344673 |
| Month | 43 | 0.2339483 |
| Month | 44 | 0.2334305 |
| Month | 45 | 0.2329139 |
| Month | 46 | 0.2323983 |
| Month | 47 | 0.2318839 |
| Month | 48 | 0.2313707 |
| Semi_Annual | 0 | 0.9947507 |
| Semi_Annual | 1 | 0.7769029 |
| Semi_Annual | 2 | 0.7111168 |
| Semi_Annual | 3 | 0.6509012 |
| Semi_Annual | 4 | 0.5957846 |
| Semi_Annual | 5 | 0.5453351 |
| Semi_Annual | 6 | 0.4991575 |
| Semi_Annual | 7 | 0.4568901 |
| Semi_Annual | 8 | 0.4182019 |
| Two_Years | 0 | 0.9897494 |
| Two_Years | 1 | 0.9646925 |
| Two_Years | 2 | 0.9402699 |
# sum up the estimate to get E(x)
option_2 <- modelOutput %>%
dplyr::mutate(estimate = dplyr::case_when(is.nan(estimate) ~ 0,
TRUE ~ estimate)) %>%
group_by(billing_period) %>%
summarise(periods_remaining = sum(estimate)) %>%
dplyr::mutate(periods_remaining = floor(periods_remaining))
# comparison between option 1 and 2
option_1
## # A tibble: 4 x 2
## billing_period lifetime
## <chr> <dbl>
## 1 Annual 4
## 2 Month 10
## 3 Semi_Annual 2
## 4 Two_Years 1
option_2
## # A tibble: 4 x 2
## billing_period periods_remaining
## <chr> <dbl>
## 1 Annual 4
## 2 Month 16
## 3 Semi_Annual 5
## 4 Two_Years 2
This exercise probably is a mix of math and art. For example, we apply floor() in both option_1 and option_2 to round down the numbers, so that our numbers would become more conservative (as we don’t want to overestimate lifetime value of our customers). We rather want to forecast less revenue than grossly overestimate how much we can get from our subscription business. At the end, it’s also based on business acumen and expert judgement.