options(rpubs.upload.method = "internal")

1 Introduction

So a couple of months back, I reached out to a few ecommerce companies to understand how they use CLV today. And here is what I noted

1.1 Problem

B2C buisnesses struggle to capture/project customer purchase patterns in a typical non-contractual setting. As a result, they struggle to understand the actual value of their customers and end up trying to figure out the following challenges:

  • How much should they spend in acquiring a new customer for their product?

  • How do different segments of customers interact with a product?

  • How well does their product actually do in the market?

  • How can they drive the demand for their products higher and maximize returns from each customer?

There are tools that help in driving engagement of customers high, but understanding a customer is still a challenge.

1.2 Solution

Correctly estimating the lifetime value of each customer holds the key.

  • Knowing the right value of a new customer helps to understand how much should you be spending on a acquiring customers

  • Targeting different valued customers can help with smart allocation of marketing budget to increase customer engagement and stickiness

  • Figuring out how frequently does a particular customer transact and when will a customer transact next would help in predicting demand for different products among different customer segments

1.3 What is CLV?

Customer Lifetime Value is the present value of the future cash flows attributed to the customer during a given time period. CLV gives sense to the amount of revenue each customer generates in a fixed interval of time. There are a few advanced statistical methods to describe and predict customer’s purchase behavior in non-contractual settings. By fiting probabilistic models to historic transaction records one can compute customer-centric metrics of managerial interest.

2 CLV Analysis on UCI online retail store data

The UCI online-retail dataset is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. I wanted to analyze the lifetime value of each customer to predict when and how much would a customer buy next.

library(dplyr)
library(BTYD)
library(ggplot2)

2.1 Attribute information

#Read the data 
ucl_data = read.csv("~/Downloads/Online_Retail.csv",stringsAsFactors = F)
#Fix the date 
ucl_data = ucl_data %>% mutate(InvoiceDate = as.Date(as.POSIXct(InvoiceDate,format = "%m/%d/%Y %H:%M")))
str(ucl_data)
## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Date, format: "10-12-01" "10-12-01" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.

2.2 Data Cleaning and feature extraction

  • There are a few products that have negative quantities associated to it. We exclude them from our analysis.
  • Get Recency, First purchase and last purchase made by customers and the total value of each transaction (QTY*PRICE)
  • Identify repeat customers
#1. Get Recency, First purchase and last purchase made by customers and the total value of each transaction (QTY*PRICE)
max_InvoiceDate = max(ucl_data$InvoiceDate)
# today = max_InvoiceDate+2
ucl_data = ucl_data %>% group_by(CustomerID) %>% mutate(Recency = max_InvoiceDate-max(InvoiceDate),First_purchase = max_InvoiceDate - min(InvoiceDate) ,Value = Quantity*UnitPrice, Frequency = n_distinct(InvoiceNo)) %>%arrange(InvoiceDate) 

# 2. Filter out Qty < 0 
neg = ucl_data %>% filter(Quantity<0)
ucl_data = ucl_data %>% filter(Quantity>=0)

# 3. Get customers with repeat transactions
ucl_data = ucl_data %>% group_by(CustomerID) %>% mutate(repeat_customers = ifelse(n_distinct(InvoiceNo)>1,n_distinct(InvoiceNo),0)) 

3 Data Analysis

3.1 Summarizing the data

ucl_summary = ucl_data %>% group_by(CustomerID) %>% summarize(Last_purchase = max(InvoiceDate), First_purchase = min(InvoiceDate),
                                                              Frequency = n_distinct(InvoiceNo), Sales = mean(Value) ,repeat_customers = ifelse(mean(repeat_customers)>0,1,0))


ucl_summary = ucl_summary %>% mutate(time_between = Last_purchase-First_purchase, recency = abs(max_InvoiceDate - Last_purchase))
head(ucl_summary)
## # A tibble: 6 x 8
##   CustomerID Last_purchase First_purchase Frequency  Sales repeat_customers
##        <int> <date>        <date>             <int>  <dbl>            <dbl>
## 1      12346 11-01-18      11-01-18               1 7.72e4                0
## 2      12347 11-12-06      10-12-06               7 2.37e1                1
## 3      12348 11-09-24      10-12-16               4 5.80e1                1
## 4      12349 11-11-21      11-11-21               1 2.41e1                0
## 5      12350 11-02-01      11-02-01               1 1.97e1                0
## 6      12352 11-11-02      11-02-16               8 2.95e1                1
## # … with 2 more variables: time_between <drtn>, recency <drtn>
print(paste("Total number of customers transacted in the given period = ", nrow(ucl_summary)))
## [1] "Total number of customers transacted in the given period =  4340"

3.2 Proportion of repeat customers

recency_prop = 100* table(as.factor(ucl_summary$repeat_customers))/sum(table(as.factor(ucl_summary$repeat_customers)))
print(recency_prop)
## 
##        0        1 
## 34.42396 65.57604

~66% of customers have transacted more than once

3.3 Sales plot

# make log plot and plot sales

ggplot(ucl_data, aes(x=InvoiceDate,y=Value,group=CustomerID))+
    geom_line(alpha=0.1)+
    scale_x_date()+
    scale_y_log10()+
    ggtitle("Sales for individual customers")+
    ylab("Sales ($, US)")+xlab("")+
    theme_minimal()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis

This plot shows that the log plot of sales. The median money spent by a customer is 9.9$ and the mean is 220$

3.4 Days between orders

# look at days between orders
# model describes rates via a gamma distribution across customers
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
purchaseFreq <- ddply(ucl_data %>% filter(Frequency>1), .(CustomerID), summarize, 
     daysBetween = as.numeric(diff(InvoiceDate)))
detach(package:plyr)

library(grDevices)
library(dplyr)
ggplot(purchaseFreq %>% filter( daysBetween>0),aes(x=daysBetween))+
    geom_histogram(fill="orange")+
    xlab("Time between purchases (days)")+
    theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot shows that with time, the number of repeat transactions reduces.

3.5 Time gap vs Frequency

colnames(ucl_summary)
## [1] "CustomerID"       "Last_purchase"    "First_purchase"  
## [4] "Frequency"        "Sales"            "repeat_customers"
## [7] "time_between"     "recency"
ggplot(ucl_summary %>% filter(Frequency>1), aes(x = time_between, y = Frequency, color = as.factor(repeat_customers))) + geom_point() + ylim(0,50)
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 16 rows containing missing values (geom_point).

### Problem

If the time-gap between first and last transaction is large, the mean frequency of purchase is also high.

4 PnBD and BGnBD Models

## Convert Data to RFM matrix for PnBD and BGnBD models and split data to training (caliberation) and test (holdout) set
source("model_data_prep.R")
## Parameter Estimates: PnBD model: 
params = pnbd.EstimateParameters(cal.cbs)
params
LL = pnbd.cbs.LL(params,cal.cbs)
LL
#Run multiple iterations to get optimal parameters 
p.matrix = c(params,LL)
for(i in 1:10){
  params = pnbd.EstimateParameters(cal.cbs,params)
  LL = pnbd.cbs.LL(params,cal.cbs)
  p.matrix.row = c(params,LL)
  p.matrix = rbind(p.matrix,p.matrix.row)
}
colnames(p.matrix) <- c("r","alpha","s","beta","LL")
rownames(p.matrix) <- 1:nrow(p.matrix)
p.matrix
# params = p.matrix[3,]
## Estimating model parameters BGnBD:: Parameters are estimated by maximising log-likelihood
#start params = (1,1,1,1)
params_bgnbd = bgnbd.EstimateParameters(cal.cbs,par.start = c(1,1,1,2))
params_bgnbd
LL = bgnbd.cbs.LL(params_bgnbd,cal.cbs)
LL
#Run multiple iterations to get optimal parameters 
p.matrix_bgnd = c(params_bgnbd,LL)
for(i in 1:10){
  params_bgnbd = bgnbd.EstimateParameters(cal.cbs,params_bgnbd)
  LL = bgnbd.cbs.LL(params_bgnbd,cal.cbs)
  p.matrix_bgnd.row = c(params_bgnbd,LL)
  p.matrix_bgnd = rbind(p.matrix_bgnd,p.matrix_bgnd.row)
}
colnames(p.matrix_bgnd) <- c("r","alpha","a","beta","LL")
rownames(p.matrix_bgnd) <- 1:nrow(p.matrix_bgnd)
p.matrix_bgnd
# params_bgnbd = p.matrix_bgnd[8,]

4.1 PNBD

The Pareto/Negative Binomial Distribution model is perhaps the most well-known and frequently applied probabilistic model in the non-contractual context. Pareto/NBD only focuses on the purchase count and lifetime of a customer.

pnbd.PlotTransactionRateHeterogeneity(params) ## gamma distribution

The plot above shows that customers are more likely to have low values of individual poisson transaction process parameters. It shows that only a few customers have high transaction rate. This validates the heteroginity across customer transactions. Repeat customer buy around their own mean of purchasing time. The timegap between purchases would be different for different customers.

pnbd.PlotDropoutRateHeterogeneity(params,lim = 0.5) ## gamma mixing distribution of Pareto dropout process

The above plot shows the distribution of customers’ propensities to drop out. Only a few customer densities have high drop-out rates.

4.2 BGnBD

Why BGnBD model? Beta Geometric Negative Binomial Distribution can be used to determine the expected repeat visits for customers in order to determine a customers lifetime value. It can also be used to determine whether a customer has churned or is likely to churn soon.

bgnbd.PlotTransactionRateHeterogeneity(params_bgnbd) ## gamma distribution

##  This plot shows that customers are more likely to have low values of individual poisson transaction process parameters. 
## This means that with time the number of transactions made by a customer reduces, which is expected

The plot above shows that customers are more likely to have low values of individual poisson transaction process parameters. It shows that only a few customers have high transaction rate. This basically shows the heteroginity across customer transactions. It validates the fact that repeat customer buys around their own mean of purchasing time. The timegap between purchases would be different for different customers. No customer would buy forever. The time for dropout for different customers varies.

bgnbd.PlotDropoutRateHeterogeneity(params_bgnbd) ## gamma mixing distribution of BGnBD dropout process

##               [,1]         [,2]         [,3]         [,4]         [,5]
## x.axis.ticks     0 1.374870e-13 2.749741e-13 4.124611e-13 5.499482e-13
## heterogeneity  Inf 3.944624e+09 1.973063e+09 1.315668e+09 9.869070e+08
##                       [,6]         [,7]         [,8]         [,9]
## x.axis.ticks  6.874352e-13 8.249223e-13 9.624093e-13 1.099896e-12
## heterogeneity 7.896224e+08 6.580845e+08 5.641202e+08 4.936414e+08
##                      [,10]        [,11]        [,12]        [,13]
## x.axis.ticks  1.237383e-12 1.374870e-12 1.512358e-12 1.649845e-12
## heterogeneity 4.388207e+08 3.949615e+08 3.590747e+08 3.291676e+08
##                      [,14]        [,15]        [,16]        [,17]
## x.axis.ticks  1.787332e-12 1.924819e-12 2.062306e-12 2.199793e-12
## heterogeneity 3.038603e+08 2.821675e+08 2.633663e+08 2.469147e+08
##                     [,18]        [,19]        [,20]        [,21]
## x.axis.ticks  2.33728e-12 2.474767e-12 2.612254e-12 2.749741e-12
## heterogeneity 2.32398e+08 2.194939e+08 2.079478e+08 1.975559e+08
##                      [,22]        [,23]        [,24]        [,25]
## x.axis.ticks  2.887228e-12 3.024715e-12 3.162202e-12 3.299689e-12
## heterogeneity 1.881536e+08 1.796057e+08 1.718010e+08 1.646464e+08
##                      [,26]        [,27]        [,28]        [,29]
## x.axis.ticks  3.437176e-12 3.574663e-12 3.712150e-12 3.849637e-12
## heterogeneity 1.580641e+08 1.519880e+08 1.463619e+08 1.411375e+08
##                      [,30]        [,31]        [,32]        [,33]
## x.axis.ticks  3.987124e-12 4.124611e-12 4.262098e-12 4.399586e-12
## heterogeneity 1.362733e+08 1.317333e+08 1.274861e+08 1.235043e+08
##                      [,34]        [,35]        [,36]        [,37]
## x.axis.ticks  4.537073e-12 4.674560e-12 4.812047e-12 4.949534e-12
## heterogeneity 1.197638e+08 1.162433e+08 1.129238e+08 1.097887e+08
##                      [,38]        [,39]        [,40]        [,41]
## x.axis.ticks  5.087021e-12 5.224508e-12 5.361995e-12 5.499482e-12
## heterogeneity 1.068231e+08 1.040135e+08 1.013479e+08 9.881558e+07
##                      [,42]        [,43]        [,44]        [,45]
## x.axis.ticks  5.636969e-12 5.774456e-12 5.911943e-12 6.049430e-12
## heterogeneity 9.640675e+07 9.411260e+07 9.192512e+07 8.983705e+07
##                      [,46]        [,47]        [,48]        [,49]
## x.axis.ticks  6.186917e-12 6.324404e-12 6.461891e-12 6.599378e-12
## heterogeneity 8.784176e+07 8.593319e+07 8.410582e+07 8.235456e+07
##                      [,50]        [,51]        [,52]        [,53]
## x.axis.ticks  6.736865e-12 6.874352e-12 7.011839e-12 7.149326e-12
## heterogeneity 8.067477e+07 7.906215e+07 7.751276e+07 7.602294e+07
##                      [,54]        [,55]        [,56]        [,57]
## x.axis.ticks  7.286814e-12 7.424301e-12 7.561788e-12 7.699275e-12
## heterogeneity 7.458933e+07 7.320879e+07 7.187845e+07 7.059560e+07
##                      [,58]        [,59]        [,60]        [,61]
## x.axis.ticks  7.836762e-12 7.974249e-12 8.111736e-12 8.249223e-12
## heterogeneity 6.935776e+07 6.816259e+07 6.700792e+07 6.589173e+07
##                      [,62]        [,63]        [,64]        [,65]
## x.axis.ticks  8.386710e-12 8.524197e-12 8.661684e-12 8.799171e-12
## heterogeneity 6.481212e+07 6.376734e+07 6.275571e+07 6.177568e+07
##                      [,66]        [,67]        [,68]        [,69]
## x.axis.ticks  8.936658e-12 9.074145e-12 9.211632e-12 9.349119e-12
## heterogeneity 6.082581e+07 5.990470e+07 5.901109e+07 5.814376e+07
##                      [,70]        [,71]       [,72]        [,73]
## x.axis.ticks  9.486606e-12 9.624093e-12 9.76158e-12 9.899067e-12
## heterogeneity 5.730155e+07 5.648340e+07 5.56883e+07 5.491527e+07
##                      [,74]        [,75]        [,76]        [,77]
## x.axis.ticks  1.003655e-11 1.017404e-11 1.031153e-11 1.044902e-11
## heterogeneity 5.416342e+07 5.343188e+07 5.271984e+07 5.202654e+07
##                      [,78]        [,79]        [,80]        [,81]
## x.axis.ticks  1.058650e-11 1.072399e-11 1.086148e-11 1.099896e-11
## heterogeneity 5.135124e+07 5.069325e+07 5.005191e+07 4.942660e+07
##                      [,82]        [,83]        [,84]        [,85]
## x.axis.ticks  1.113645e-11 1.127394e-11 1.141142e-11 1.154891e-11
## heterogeneity 4.881673e+07 4.822173e+07 4.764106e+07 4.707422e+07
##                     [,86]        [,87]        [,88]        [,89]
## x.axis.ticks  1.16864e-11 1.182389e-11 1.196137e-11 1.209886e-11
## heterogeneity 4.65207e+07 4.598006e+07 4.545184e+07 4.493563e+07
##                      [,90]        [,91]        [,92]        [,93]
## x.axis.ticks  1.223635e-11 1.237383e-11 1.251132e-11 1.264881e-11
## heterogeneity 4.443101e+07 4.393760e+07 4.345503e+07 4.298295e+07
##                      [,94]        [,95]        [,96]        [,97]
## x.axis.ticks  1.278630e-11 1.292378e-11 1.306127e-11 1.319876e-11
## heterogeneity 4.252102e+07 4.206892e+07 4.162633e+07 4.119296e+07
##                      [,98]        [,99]       [,100]
## x.axis.ticks  1.333624e-11 1.347373e-11 1.361122e-11
## heterogeneity 4.076852e+07 4.035274e+07 3.994536e+07

The above plot shows the distribution of customers’ propensities to drop out. Only a few customer densities have high drop-out rates. There is also a heteroginity involved for customers dropout. No customer would buy forever. The time for dropout for different customers varies. This should be expected in a transactional setting of an online retail store.

4.3 Individual Level Estimates

  1. Number of repeat transactions a new customer would make in given time period: t
for(t in seq(40,100,by=10)){
print(paste("Expected number of repeat transaction by a new customer in",t,"days =" ,pnbd.Expectation(params,t = t)))
}
## [1] "Expected number of repeat transaction by a new customer in 40 days = 0.481301525540638"
## [1] "Expected number of repeat transaction by a new customer in 50 days = 0.601626885840017"
## [1] "Expected number of repeat transaction by a new customer in 60 days = 0.721952240729078"
## [1] "Expected number of repeat transaction by a new customer in 70 days = 0.842277590938877"
## [1] "Expected number of repeat transaction by a new customer in 80 days = 0.962602937026013"
## [1] "Expected number of repeat transaction by a new customer in 90 days = 1.08292827942852"
## [1] "Expected number of repeat transaction by a new customer in 100 days = 1.20325361850016"

This shows that a new customer is most likely to transact again after 3 months

4.3.1 For a specific customer::12662

cust.12662 <- cal.cbs["12662",]
x = cust.12662["x"]
t.x = cust.12662["t.x"]
T.cal = cust.12662["T.cal"]

for(t in seq(40,100,by=10)){
  print(paste("Expected number of repeat transaction by customer 12662 in",t,"days =" ,pnbd.ConditionalExpectedTransactions(params,T.star = t,x,t.x,T.cal)))
}
## [1] "Expected number of repeat transaction by customer 12662 in 40 days = 0.745276650336834"
## [1] "Expected number of repeat transaction by customer 12662 in 50 days = 0.931595807244443"
## [1] "Expected number of repeat transaction by customer 12662 in 60 days = 1.11791496206169"
## [1] "Expected number of repeat transaction by customer 12662 in 70 days = 1.30423411486676"
## [1] "Expected number of repeat transaction by customer 12662 in 80 days = 1.4905532657322"
## [1] "Expected number of repeat transaction by customer 12662 in 90 days = 1.6768724147255"
## [1] "Expected number of repeat transaction by customer 12662 in 100 days = 1.86319156190963"

Customer 12662 will most likely transact again atleast once within 2 months

5 Model Performance

5.1 Model vs actual:: Training Period

5.1.1 PnBD

pnbd.PlotFrequencyInCalibration(params,cal.cbs,3)

##                freq.0   freq.1   freq.2  freq.3+
## n.x.actual   1394.000 588.0000 313.0000 521.0000
## n.x.expected 1383.694 568.0825 309.4955 554.7278

This plot shows the actual and expected number of customers who made repeat transactions in the caliberation period binned as per frequencies

5.1.2 BGnBD

bgnbd.PlotFrequencyInCalibration(params_bgnbd,cal.cbs,3)

##                freq.0   freq.1   freq.2  freq.3+
## n.x.actual   1394.000 588.0000 313.0000 521.0000
## n.x.expected 1383.724 568.0642 309.4836 549.7278

This plot shows the actual and expected number of customers who made repeat transactions in the caliberation period binned as per frequencies The difference between PNBD and BGNBD models is not much in this case

5.2 Model vs actual:: Test (Holdout) Period

The frequency plot above shows that the models perform alright in the caliberation period. But that does not mean that the models are good. To test the performance of the model, it is important to see how they perform in the holdout/test period.

5.2.1 PnBD

source("data_prep_holdout.R")
T.star =  as.numeric(max(ucl_log_sum$date)-threshold_date+1)
censor = 11
x.star = cal.cbs2[,"x.star"]
comp = pnbd.PlotFreqVsConditionalExpectedFrequency(params,T.star = T.star,cal.cbs=cal.cbs2,x.star,censor)

The PnBD model does well in the holdout period, however we do see deviations becoming larger towards the end.

5.2.2 BGnBD

T.star =  as.numeric(max(ucl_log_sum$date)-threshold_date+1)
censor = 11
x.star = cal.cbs2[,"x.star"]
comp = bgnbd.PlotFreqVsConditionalExpectedFrequency(params_bgnbd,T.star = T.star,cal.cbs=cal.cbs2,x.star,censor)

The BGnBD model also esitmates well.

6 Comparing cummulative actual sales with predicted transactions

Next, to check if the models capture the trend of transactions, we test how the model performs against cummulative sales in the holdout period. This in itself gives an idea as to predict customer behavior in the future.

#Get total daily sales 
d.track = rep(0,period)
origin = min(ucl_data$InvoiceDate)
for(i in colnames(tot.cbt)){
  date.index = (as.numeric(difftime(as.Date(i),origin))+1)
  d.track[date.index] = sum(tot.cbt[,i])
}

# Get mean weekly actual sales 
w.track = rep(0,round(period/7))
for (j in 1:length(w.track)) {
w.track[j] =  mean(d.track[(j*7-6):(j*7)]) 
}

T.cal =(cal.cbs2[,"T.cal"])
T.tot = as.numeric(round((range(ucl_data$InvoiceDate)[2]-range(ucl_data$InvoiceDate)[1])))

6.1 PnBD

## Cummalative daily tracking 
cum.track = cumsum(d.track)
cum.tracking = pnbd.PlotTrackingCum(params,T.cal,T.tot,cum.track , title = "Tracking daily cumulative transactions",xlab = "Days")

## Cummalative weekly tracking 
cum.track.week = cumsum(w.track)
cum.tracking.week = pnbd.PlotTrackingCum(params,T.cal,T.tot/7,cum.track.week)

PnBD captures the trend in cummualtive sales reasonably well.

6.2 BGnBD

## Cummalative daily tracking 
cum.track = cumsum(d.track)
cum.tracking = bgnbd.PlotTrackingCum(params_bgnbd,T.cal,T.tot,cum.track, title = "Tracking daily cumulative transactions",xlab = "Days")

## Cummalative weekly tracking 
cum.track.week = cumsum(w.track)
cum.tracking.week = bgnbd.PlotTrackingCum(params_bgnbd,T.cal,T.tot/7,cum.track.week)

Again the difference in the 2 models for this dataset is insignificant.

7 Weekly and Daily transactional forecasts

While it was useful to check the models against cummulative sales, we could also test the models performance against daily and weekly transactions. To test the models, we first extract the actual daily and weekly values in the holdout period and then compare with the expected values

7.1 Daily forecasts

Both the models capture the trend in daily sales.

7.1.1 PnBD

inc.tracking.daily = pnbd.PlotTrackingInc(params,T.cal=T.cal,T.tot,actual.inc.tracking.data = d.track,title = "Tracking daily transactions",xlab = "Days")

inc.tracking.daily[,20:30]

7.1.2 BGnBD

inc.tracking.daily = bgnbd.PlotTrackingInc(params_bgnbd,T.cal=T.cal,T.tot,actual.inc.tracking.data = d.track, title = "Tracking daily transactions",xlab = "Days")

inc.tracking.daily[,20:30]

7.2 Weekly forecasts

Both PnBD and BGnBD models capture the trend of weekly transactons well. The difference between the 2 models is not that significant as more than 60% of the customers are repeat customers, which is really high. While both the models capture the trend and average value of customers well, it does not capture effects of seasonality and marketing campaigns well.

7.2.1 PnBD

inc.tracking.weekly = pnbd.PlotTrackingInc(params,T.cal=T.cal,T.tot/7,actual.inc.tracking.data = w.track)

inc.tracking.weekly[,1:10]
##              [,1]      [,2]      [,3]      [,4]      [,5]     [,6]
## actual   4.142857 17.714286 21.571429  3.428571  1.428571 19.71429
## expected 2.438375  5.579396  7.625722 11.047591 14.164283 15.51589
##              [,7]     [,8]     [,9]    [,10]
## actual   18.71429 15.57143 21.71429 18.28571
## expected 17.16052 19.30019 20.65141 21.08881

7.2.2 BGnBD

inc.tracking.weekly = bgnbd.PlotTrackingInc(params_bgnbd,T.cal=T.cal,T.tot/7,actual.inc.tracking.data = w.track)

inc.tracking.weekly[,1:10]

8 Deep Neural Network

To include demographics and seasonality, I train and deep neural network with 3 hidden layers (64,32,16) and test how the model performs in the holdout period. I have stored the results of the DNN on the test set in test_results.csv

holdout_nn = read.csv("test_results.csv",stringsAsFactors = F)
holdout_nn = holdout_nn %>% mutate(InvoiceDate=as.Date(InvoiceDate))
ggplot(data = holdout_nn,aes(x=InvoiceDate,y=actual_sale))+geom_line()+geom_line(aes(y=pred_sale),color="red")+ylim(0,100)
## Warning: Removed 617 rows containing missing values (geom_path).

This looks ok, but the plot is not too clear because of volume of data.

Next we look at the mean of the predicted sales to see if the trends are captured.

#detach(package:plyr)
library(ggplot2)
library(dplyr)
holdout_nn2 = holdout_nn %>% group_by(InvoiceDate) %>% summarize(pred_sale=mean(pred_sale),actual_sale=mean(actual_sale))
ggplot(data = holdout_nn2,aes(x=InvoiceDate,y=pred_sale))+geom_line(color="blue")+geom_line(aes(y=actual_sale))+ylim(15,26)
## Warning: Removed 1 rows containing missing values (geom_path).

# print(unique(holdout_nn$pred_sale))

The mean captures the ups and downs, but this model can be improved further by adding engagement metrics and tuning the network further.

9 Next Steps:

As future work, I would be taking up the following - Fine tune DNN model - Include more engagement features for DNN - Include attribution analysis - Perform per product analysis to forecast customer demand for each product

10 Conclusion

We saw how we could calculate CLV for an online retail store. CLV can help in understanding the value of each customer. Driving managerial decisions based on CLV is the step to be taken to be more customer oriented and for faster growth.

You can find the consolidated code for DNN and BTYD models at https://github.com/ragav208/uci_clv

11 References