1. Introduction

This is the last of a series of 4 dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.

The key insights and arguments are presented in a separate powerpoint deck prepared for non-technical executives.

The papers implement advanced analytics methods to:

  1. Identify managerially relevant subgroups in a customer database (segmentation).

  2. Build a model that predicts customer churn.

  3. Build a model that predicts the spending level of customers predicted to remain loyal.

4. Estimate the lifetime value of customer and hence the present value of the customer database.


2. Method

This document focuses on the fourth task - estimating customer lifetime value.

There are numerous methods available extimating lifetime value. Many are simple to the point of having limited business value whilst some are highly complex and possibly do not justify the effort. The method used here does not fall into the fiendishly advanced category.

Neverthless, the approach is far from trivial and in my opinion produces offers significant business value.

  • We begin by implementing a 4 segment solution on the raw data. The idea is to subsequently compute the lifetime value of each customer but also to gain insights into the revenue generated per segment.

The first document in this series - Customer Analytics I - Statistical Segmentation - discusses the choice of the ‘correct’ number of subgroups in a segmentation scheme.

The solution implemented in this document is a managerial segmentation solution commonly applied in practice. The document referenced above proposes a more statistically rigorous solution based on algorithmically identified natural clusters.

The segmentation scheme is applied to the latest version of the database (2015) and separately, retrospectively to the 2014 version of the database.

  • This split approach allows us to analyse the movement of customers between segments from 1 period to the next. Assuming transition behaviour is maintained into the future, we contruct a transition matrix that is used to model the evolution of the customer database going forward. This model can of course be updated to reflect changes in transition behaviour as a result of marketing actions for example.

  • We then compute historical revenue generated by individual customers and by segment. Next, we project forward 10 periods (years) by combining revenue with what we know about customers’ transitions between segments going forward. 10 years is the forecast horizon. The choice is based on judgement.

  • The estimated lifetime value is discounted to obtain the present value of the database.


3. Set up

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

3.1 Load Packages

library(ggplot2)
library(dplyr)
library(sqldf)

3.2 Load Data

customer_data <- read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', 
                            dec = '.')

4. Data

4.1 Description

The dataset is a proprietary customer database consisting of 51,243 observations across 3 features:

  • customer i.d
  • purchase amount in USD
  • transaction date

The data covers the period January 2nd 2005 to December 31st 2015.

The dataset is tidy and ready for analysis.

4.2 Data Preparation

## Name columns
customer_data <- rename(customer_data, customer_id = V1, purchase_amount = V2, 
                        date_of_purchase = V3)

## Convert purchase date to R date object
customer_data$date_of_purchase <- as.Date(customer_data$date_of_purchase, "%Y-%m-%d")

Compute additional variables (key marketing indicators):

recency - time since last purchase (days)

frequency - number of purchase transactions

year_of_purchase

avg_amount - average amount spent

## Create `recency` variable
customer_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
                                                time2 = customer_data$date_of_purchase,
                                                units = "days"))  

customer_data$year_of_purchase <- as.numeric(format(customer_data$date_of_purchase, 
                                                    "%Y"))

## Customer database as at end 2015 grouped by customer ID with remaining 
## marketing indicators computed
customers_2015 <- customer_data %>%
                       group_by(customer_id) %>%
                       summarise(recency = min(days_since), 
                                 frequency = n(), 
                                 first_purchase = max(days_since),
                                 avg_amount = mean(purchase_amount))

5. Segmention

Implement 4 segment solution of the current database

customers_2015$segment <- "NA"

Customers who have not made a purchase in more than 3 years are labelled “inactive”

customers_2015$segment[which(customers_2015$recency > 365*3)] <- "inactive"

Customers who have not made a purchase in more 2 years (but less than 3) are labelled “cold”

customers_2015$segment[which(customers_2015$recency <= 365*3 & 
                               customers_2015$recency > 365*2)] <- "cold"

Customers who have not made a purchase during the last year but did so the year before are labelled “warm”

customers_2015$segment[which(customers_2015$recency <= 365*2 & 
                               customers_2015$recency > 365*1)] <- "warm"

Customers who made a purchase within the last year are labelled “active”

customers_2015$segment[which(customers_2015$recency <= 365)] <- "active"

Segment counts

ggplot(data = customers_2015, aes(x = segment)) +
                  geom_bar(fill = "steelblue") +
                  theme_classic() +
                  labs(x = "Segment", y = "",
                       title = "Segment Counts - 2015 Database")

  • Barely 30% of the database is “active”, meaning they made a purchase during the last year.

  • Half of all customers in the database are “inactive” - they have made no purchases in more than 3 years.

prop.table(table(customers_2015$segment))
## 
##    active      cold  inactive      warm 
## 0.2930988 0.1033284 0.4972580 0.1063148

Further subdivide the “active” and “warm” categories

“active”

customers_2015$segment[which(customers_2015$segment == "active" & 
                               customers_2015$first_purchase <= 365)] <- "new active"

customers_2015$segment[which(customers_2015$segment == "active" & 
                               customers_2015$avg_amount < 100)] = "active low value"

customers_2015$segment[which(customers_2015$segment == "active" & 
                               customers_2015$avg_amount >= 100)] = "active high value"

“warm”

customers_2015$segment[which(customers_2015$segment == "warm" & 
                               customers_2015$first_purchase <= 365*2)] <- "new warm"

customers_2015$segment[which(customers_2015$segment == "warm" & 
                               customers_2015$avg_amount < 100)] <- "warm low value"

customers_2015$segment[which(customers_2015$segment == "warm" & 
                               customers_2015$avg_amount >= 100)] <- "warm high value"

Re-order segment

customers_2015$segment <- factor(customers_2015$segment, 
                                levels = c("inactive", 
                                           "cold",
                                           "warm high value", 
                                           "warm low value", 
                                           "new warm",
                                           "active high value", 
                                           "active low value", 
                                           "new active"))

Expanded segments

ggplot(data = customers_2015, aes(x = segment)) +
               geom_bar(fill = "steelblue") +
               theme_classic() +
               labs(x = "Segment", y = "",
                    title = 
                     "Segment Counts (2015)\n'active' & 'warm' segments subdivided") +
               coord_flip()


Create customer database as it was 1 year before (at end 2014)

The object of this exercise is to compare the 2014 database to the 2015 version in order to develop a sense of how customers move between segments over time.

We shall construct a transition matrix that models this behaviour.

2014 database

customers_2014 <- customer_data %>%
                       filter(days_since > 365) %>%
                       group_by(customer_id) %>%
                       summarise(recency = min(days_since) -365, 
                                 frequency = n(), 
                                 first_purchase = max(days_since) - 365,
                                 avg_amount = mean(purchase_amount),
                                 revenue = sum(purchase_amount)) 

Implement 4 segment solution of the database

customers_2014$segment = "NA"

Customers who have not made a purchase in more than 3 years are labelled “inactive”

customers_2014$segment[which(customers_2014$recency > 365*3)] <- "inactive"

Customers who have not made a purchase in more 2 years (but less than 3) are labelled “cold”

customers_2014$segment[which(customers_2014$recency <= 365*3 & 
                               customers_2014$recency > 365*2)] <- "cold"

Customers who have not made a purchase during the last year but did so the year before are labelled “warm”

customers_2014$segment[which(customers_2014$recency <= 365*2 & 
                               customers_2014$recency > 365*1)] <- "warm"

Customers who made a purchase within the last year are labelled “active”

customers_2014$segment[which(customers_2014$recency <= 365)] <- "active"

Further refine the “warm” and “active” categories

“warm”

customers_2014$segment[which(customers_2014$segment == "warm" & 
                               customers_2014$first_purchase <= 365*2)] <- "new warm"

customers_2014$segment[which(customers_2014$segment == "warm" & 
                               customers_2014$avg_amount < 100)] <- "warm low value"

customers_2014$segment[which(customers_2014$segment == "warm" & 
                               customers_2014$avg_amount >= 100)] <- "warm high value"

“active”

customers_2014$segment[which(customers_2014$segment == "active" & 
                               customers_2014$first_purchase <= 365)] <- "new active"

customers_2014$segment[which(customers_2014$segment == "active" & 
                               customers_2014$avg_amount < 100)] <- "active low value"

customers_2014$segment[which(customers_2014$segment == "active" & 
                               customers_2014$avg_amount >= 100)] <- "active high value"

Convert segment to factor and re-order

customers_2014$segment = factor(customers_2014$segment, 
                                levels = c("inactive", 
                                            "cold",
                                            "warm high value", 
                                            "warm low value",
                                            "new warm",
                                            "active high value", 
                                            "active low value", 
                                            "new active"))

Plot 2014 segments

ggplot(data = customers_2014, aes(x = segment)) +
               geom_bar(fill = "steelblue") +
               theme_classic() +
               labs(x = "Segment", y = "",
                    title = 
                     "Segment Counts (2014)\n'active' & 'warm' segments subdivided") +
               coord_flip()

2014 segment counts look very similar to 2015

6. Database Evolution

6.1 Construct transition matrix.

The likelihood of customers moving from a given segment to another.

transition_data <- merge(x = customers_2014, y = customers_2015, 
                         by = "customer_id", all.x = TRUE)

transition <- table(transition_data$segment.x, transition_data$segment.y)

transition <- prop.table(transition,1)
print(round(transition,3))
##                    
##                     inactive  cold warm high value warm low value new warm
##   inactive             0.962 0.000           0.000          0.000    0.000
##   cold                 0.897 0.000           0.000          0.000    0.000
##   warm high value      0.000 0.676           0.000          0.000    0.000
##   warm low value       0.000 0.721           0.000          0.000    0.000
##   new warm             0.000 0.911           0.000          0.000    0.000
##   active high value    0.000 0.000           0.251          0.000    0.000
##   active low value     0.000 0.000           0.000          0.299    0.000
##   new active           0.000 0.000           0.000          0.000    0.653
##                    
##                     active high value active low value new active
##   inactive                      0.005            0.033      0.000
##   cold                          0.010            0.093      0.000
##   warm high value               0.315            0.009      0.000
##   warm low value                0.001            0.278      0.000
##   new warm                      0.012            0.077      0.000
##   active high value             0.745            0.004      0.000
##   active low value              0.007            0.693      0.000
##   new active                    0.062            0.285      0.000
  • 96% of customers inactive in 2014 remained so in 2015.

  • 91% of customers classes as “new warm” in 2014 were allowed to go “cold” in 2015.

  • Of greatest concern, 65 of customers acquired in 2014 purchased nothing in 2015.

The previous 3 documents in this series propose machine learning solutions that could inform marketing efforts to staunch the high attrition rates observed in the transition matrix.

6.2 Predict customer transitions

We shall use the transition matrix to predict customer transitions over the next 10 years.

Initialize a matrix with the number of customers in each segment today; project forward 10 additional periods

segment_counts <- matrix(nrow = 8, ncol = 11)

colnames(segment_counts) <- 2015:2025

Populate the matrix

segment_counts[, 1] <- table(customers_2015$segment)

row.names(segment_counts) <- levels(customers_2015$segment)
## Compute for 10 future periods

for (i in 2:11) {
   segment_counts[, i] = segment_counts[, i-1] %*% transition
}

The table shows customers’ evolution through the segments across 10 periods

print(round(segment_counts))
##                   2015  2016  2017  2018  2019  2020  2021  2022  2023
## inactive          9158 10517 11539 12636 12940 13186 13386 13542 13664
## cold              1903  1584  1711   874   821   782   740   709   684
## warm high value    119   144   165   160   156   152   149   146   143
## warm low value     901   991  1058   989   938   884   844   813   789
## new warm           938   987     0     0     0     0     0     0     0
## active high value  573   657   639   624   607   593   581   571   562
## active low value  3313  3537  3305  3134  2954  2820  2717  2637  2575
## new active        1512     0     0     0     0     0     0     0     0
##                    2024  2025
## inactive          13759 13834
## cold                665   650
## warm high value     141   139
## warm low value      771   756
## new warm              0     0
## active high value   554   547
## active low value   2527  2490
## new active            0     0
  • Numbers of inactive customers are predicted to grow throughout the period although the trend slows down significantly over time.

  • Numbers of active high value customers fall throughout the period. A sharp decline is noticeable after 2017. The trend levels off beyond that point.

barplot(segment_counts[1, ], xlab = "Year", main = "Inactive Customers", 
        col = "steelblue")
barplot(segment_counts[2, ], xlab = "Year", main = "Active high value customers", 
        col = "steelblue")

7. Lifetime Value

7.1 Revenue per Segment

Compute revenue generated by each customer in 2015

revenue_2015 <- customer_data %>%
                  filter(year_of_purchase == 2015) %>%
                  group_by(customer_id) %>%
                  summarise(revenue_2015 = sum(purchase_amount))

Add 2015 revenue to 2015 customer data

merged_2015 <- left_join(customers_2015, revenue_2015, by = "customer_id")

## Convert missing values to 0
merged_2015$revenue_2015[is.na(merged_2015$revenue_2015)] <- 0

Average revenue by segment in 2015

revenue2015_segments <- merged_2015 %>%
                         group_by(segment) %>%
                         summarise("average_revenue" = mean(revenue_2015))

print(revenue2015_segments)
## Source: local data frame [8 x 2]
## 
##             segment average_revenue
##              (fctr)           (dbl)
## 1          inactive         0.00000
## 2              cold         0.00000
## 3   warm high value         0.00000
## 4    warm low value         0.00000
## 5          new warm         0.00000
## 6 active high value       323.56894
## 7  active low value        52.30604
## 8        new active        79.16614

10 year projected annual revenue per segment.

revenue_segments <- revenue2015_segments$average_revenue * segment_counts

print(revenue_segments)
##                       2015     2016     2017     2018     2019     2020
## inactive               0.0      0.0      0.0      0.0      0.0      0.0
## cold                   0.0      0.0      0.0      0.0      0.0      0.0
## warm high value        0.0      0.0      0.0      0.0      0.0      0.0
## warm low value         0.0      0.0      0.0      0.0      0.0      0.0
## new warm               0.0      0.0      0.0      0.0      0.0      0.0
## active high value 185405.0 212495.1 206634.1 202008.5 196555.0 191967.1
## active low value  173289.9 184985.5 172891.5 163912.8 154516.6 147520.4
## new active        119699.2      0.0      0.0      0.0      0.0      0.0
##                       2021     2022     2023     2024     2025
## inactive               0.0      0.0      0.0      0.0      0.0
## cold                   0.0      0.0      0.0      0.0      0.0
## warm high value        0.0      0.0      0.0      0.0      0.0
## warm low value         0.0      0.0      0.0      0.0      0.0
## new warm               0.0      0.0      0.0      0.0      0.0
## active high value 188022.1 184626.9 181716.8 179228.5 177105.5
## active low value  142116.9 137930.2 134692.5 132189.2 130254.5
## new active             0.0      0.0      0.0      0.0      0.0

10 year projected annual revenue.

revenue_projected <- colSums(revenue_segments)

print(round(revenue_projected))
##   2015   2016   2017   2018   2019   2020   2021   2022   2023   2024 
## 478394 397481 379526 365921 351072 339488 330139 322557 316409 311418 
##   2025 
## 307360

Revenues are projected to fall over time. Unsurprising, as active customer numbers are predicted to decline over the same period.

barplot(revenue_projected, xlab = "Year", ylab = "Revenue", 
        main = "Revenue Predictions", col = "steelblue")

Cummulated revenue

revenue_cumulated <- cumsum(revenue_projected)

print(round(revenue_cumulated))
##    2015    2016    2017    2018    2019    2020    2021    2022    2023 
##  478394  875875 1255400 1621322 1972393 2311881 2642020 2964577 3280986 
##    2024    2025 
## 3592404 3899764

7.2 Discounted Revenue

Create a discount factor

In order obtain the present value of a stream of future cashflows, we must discount by an appropriate discount factor to account for the time value of money and other factors. This is a finance concept best discussed elsewhere.

The 10% discount rate is subjective.

discount_rate <- 0.10
discount = 1 / ((1 + discount_rate) ^ ((1:11) - 1))
print(round(discount,2))
##  [1] 1.00 0.91 0.83 0.75 0.68 0.62 0.56 0.51 0.47 0.42 0.39

Compute discounted annual revenue

disc_annual_revenue <- revenue_projected * discount
print(round(disc_annual_revenue))
##   2015   2016   2017   2018   2019   2020   2021   2022   2023   2024 
## 478394 361346 313658 274922 239787 210795 186355 165523 147607 132071 
##   2025 
## 118501

Plot discounted revenue

barplot(revenue_projected, xlab = "Year", ylab = "Revenue", 
        main = "Predicted Revenue", col = "steelblue")

barplot(disc_annual_revenue, xlab = "Year", ylab = "Revenue", 
        main = "Predicted Revenue\n (Discounted)", col = "steelblue")

Compute discounted cumulative revenue

disc_cumulative_revenue <- cumsum(disc_annual_revenue)
print(round(disc_cumulative_revenue))
##    2015    2016    2017    2018    2019    2020    2021    2022    2023 
##  478394  839740 1153398 1428320 1668106 1878901 2065256 2230779 2378386 
##    2024    2025 
## 2510458 2628958
barplot(disc_cumulative_revenue, xlab = "Year", ylab = "$", 
        main = "Discounted Cumulative Revenue", col = "steelblue")

7.3 Database Lifetime Value

Present value of the customer database

print(disc_cumulative_revenue[11] - revenue_projected[1])
##    2025 
## 2150564

2,150,564

The code can easily be modified to compute the lifetime value of individual customers.

Note: We computed the lifetime value of all customers currently in the database. No account is taken of future customer acquisitions.