Case II: CLV at Crutchfield

Please download and load data “CLV.Rdata” from Canvas.

load("CLV.RData")
head(CLV_data)

## # A tibble: 6 x 4
##      ID     x   t_x     T
##   <int> <dbl> <dbl> <dbl>
## 1     1     1  3.86  29  
## 2     2     2 13.3   27.1
## 3     3     5 29.1   31.3
## 4     4     0  0     38.4
## 5     5     0  0     32  
## 6     6     0  0     30.6

The three variables of the data are defined as following:
* \(x\): the frequency of purchases
* \(t_x\): the recency, i.e., the time the most recent purchase
* \(T\): the end time for the observations

Two functions: lik_bgnbd and pred_Y

The first function “lik_bgnbd” is the log-likelihood function. The A1-A4 terms defined in the function are the logged terms of A1-A4 in Fader et al. (2005), p. 279.
The function requires data “CLV_data”. To use this function, please make sure you have loaded “CLV_data”, a data.frame containing three variables: \(x\), \(t_x\), and \(T\) as defined above.
See the comments within the function for more details.

lik_bgnbd

## function(para) {
## # This is the likelihood function with CLV_data 
## # Change CLV_data to the names your datasets; the variables should be the same.
##     
##   # unpack 4 parameters: 
##   # r and alpha are the parameters of the gamma distribution of lambda 
##   # a and b are the parameters of the beta distribution ofr the "death" rate
##   r <- para[1]
##   alpha <- para[2]
##   a <- para[3]
##   b <- para[4]
##   
##   # unpack three variables: 
##   # x - # of transactions (frequency)
##   # t_x - the most recent time of transaction (recency)
##   # T - the end time for the Poisson process
##   x <- CLV_data$x
##   t_x <- CLV_data$t_x
##   T <- CLV_data$T
##   
##   # A1 - A4 corresponds to the log of 4 terms in the likelihood function
##   # See p.280 of Fader et al. (2005)
##   # They are the log of A1-A4 for the log-likelihood  
##   A1 <- lgamma(r+x) + r*log(alpha) - lgamma(r)
##   A2 <- lgamma(a+b) + lgamma(b+x) - lgamma(b) - lgamma(a+b+x)
##   A3 <- -(r+x)*log(alpha+T)
##   
##   A4 <- rep(0,length(x))
##   idx <- x>0
##   if (sum(idx)>0) {
##     A4[idx] <- log(a/(b+x[idx]-1)) - (r+x[idx])*log(alpha+t_x[idx])
##   }
##   
##   lh <- sum(A1+A2+log(exp(A3)+idx*exp(A4)))
##   
##   # To return the minus of likelihood for minimization
##   # maximize likelihood = minimize negative likelihood
##   return(-lh)
##   
## }
## <bytecode: 0x0000000014b8d8d0>

The second function “pred_Y” is the function of prediction future # of transactions based on the estimated parameters from “lik_bgnbd”.
This function follows Equation (10) of Fader et al. (2005), p.279.
The function requires a package “hypergeo” for the evaluation of the hypergeometeric function 2F1(.).
As evaluation of 2F1(.) is somewhat costly, this function uses input from only 1 customer. You can extend the function to calculate for many customers by iterations or use lapply(.).
Please see the comments within the function for more details.

pred_Y

## function(para,H,t) {
## # Given estimated parameters, for a customer with history {x,t_x,T}, to predict her # of transactions in t time.
##   # Please see Equation (10) of Fader et al. (2005), p.279;
##   # This function uses a package "hypergeo" for function evaluation;
##   # As evaluation of 2F1(.) is somewhat costly, this function is to accommodate only 1 customer;
##   # You can extend the function to calculate for many customers by iterations or use lapply(.);
## 
## # The function takes three input: 
##   # para - a vector of the 4 parameters; 
##   # H - a data.frame with only 1 row (for 1 customer) and three variables {x,t_x,T};
##   # t - how long into the future?
##   
##   # unpack 4 parameters: 
##   # r and alpha are the parameters of the gamma distribution of lambda 
##   # a and b are the parameters of the beta distribution of the "death" rate
##   r <- para[1]
##   alpha <- para[2]
##   a <- para[3]
##   b <- para[4]
##   
##   # unpack three variables: 
##   # x - # of transactions (frequency)
##   # t_x - the most recent time of transaction (recency)
##   # T - the end time for the Poisson process
##   x <- H$x
##   t_x <- H$t_x
##   T <- H$T
##   
##   # the term of the Gaussian hypergeometric function 2F1(.)
##   hg <- hypergeo::hypergeo(r+x,b+x,a+b+x-1,t/(alpha+T+t))
##   hg <- Re(hg) # hg is a real number in the complex format Re(.) makes it real. 
##   
##   # the numerator term, given hg 
##   Y1 <- (a+b+x-1)/(a-1)*(1-(((alpha+T)/(alpha+T+t))^(r+x))*hg)
##   Y2 <- 1 
##   if (x>0) {
##     Y2 <- Y2 + a/(b+x-1)*(((alpha+T)/(alpha+t_x))^(r+x))
##   }
##   
##   return(Y1/Y2)
## }

Estimating the Parameters of the BG/NBD model

For the estimation, we need to maximize the log likelihood (or minimize the negative log likelihood, as we have as the output of the “lik_bgnbd” function). To do so, we are using a non-linear solver built into R called “constrOptim(.)”. The function is in the R Base, and use “?constrOptim” to see more details.
One thing to take note is the all parameters \((r,\alpha)\) for the Gamma distribution and \((a,b)\) for the Beta distribution are positive (\(>0\)). We thus have to specify the constraints in “constrOptim” as “ui” and “ci” with: ui %*% parameters >= ci. For our purpose, we specify “ui” as a \(4\times4\) identity matrix, and “ci” as a length 4 all-zero vector.
\[ \left[ \begin{matrix} 1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ 0& 0& 0& 1\\ \end{matrix} \right] \times \left[ \begin{array}{c} r\\ \alpha\\ a\\ b\\ \end{array} \right] \geqslant \left[ \begin{array}{c} 0\\ 0\\ 0\\ 0\\ \end{array} \right] \]

# to specify the constraint matrix
ui <- diag(4)
ci <- rep(0,4)

# para: the starting values; a vector with 4 elements corresponding to (r,alpha,a,b). 
# you can change this to other values or test different values.
para <- rep(.01,4)

# to run the constrained optimization 
results <- constrOptim(para, lik_bgnbd, NULL, ui, ci)

# to get the estimated parameters in the order of (r,alpha,a,b)
para <- results$par
names(para) <- c("r","alpha","a","b")
para

##         r     alpha         a         b 
## 0.2171882 3.7234535 0.8349768 3.7701876

Predict the No. of Transactions

We use the function “pred_Y” to predict the no. of transactions for a single customer. You have to specify how long into the future you want to predict. Please see comments within the function “pred_Y” for more details.

# suppose we want to predict next year and set t = 52
t <- 52

# let's predict the first customer in the data
H <- CLV_data[1,]

# to obtain the prediction
Yhat <- pred_Y(para,H,t)
as.numeric(Yhat)

## [1] 0.6823722

LS0tDQp0aXRsZTogJ0Nhc2UgSUk6IENMViBhdCBDcnV0Y2hmaWVsZCcNCm91dHB1dDoNCiAgaHRtbF9kb2N1bWVudDoNCiAgICBkZl9wcmludDogdGliYmxlDQogICAgY29kZV9kb3dubG9hZDogeWVzDQogICAgdGhlbWU6IHNwYWNlbGFiDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZmxvYXQ6IHllcw0KICAgIG51bWJlcl9zZWN0aW9uczogbm8NCiAgcGRmX2RvY3VtZW50Og0KICAgIHRvYzogeWVzDQotLS0NCg0KIyMjICoqUGxlYXNlIGRvd25sb2FkIGFuZCBsb2FkIGRhdGEgIkNMVi5SZGF0YSIgZnJvbSBDYW52YXMuKiogDQoNCmBgYHtyfQ0KbG9hZCgiQ0xWLlJEYXRhIikNCmhlYWQoQ0xWX2RhdGEpDQpgYGANClwNClRoZSB0aHJlZSB2YXJpYWJsZXMgb2YgdGhlIGRhdGEgYXJlIGRlZmluZWQgYXMgZm9sbG93aW5nOg0KXA0KICogJHgkOiB0aGUgZnJlcXVlbmN5IG9mIHB1cmNoYXNlcw0KXA0KICogJHRfeCQ6IHRoZSByZWNlbmN5LCBpLmUuLCB0aGUgdGltZSB0aGUgbW9zdCByZWNlbnQgcHVyY2hhc2UNClwNCiAqICRUJDogdGhlIGVuZCB0aW1lIGZvciB0aGUgb2JzZXJ2YXRpb25zDQoNCg0KIyMjICoqVHdvIGZ1bmN0aW9uczogbGlrX2JnbmJkIGFuZCBwcmVkX1kqKg0KVGhlIGZpcnN0IGZ1bmN0aW9uICJsaWtfYmduYmQiIGlzIHRoZSBsb2ctbGlrZWxpaG9vZCBmdW5jdGlvbi4gVGhlIEExLUE0IHRlcm1zIGRlZmluZWQgaW4gdGhlIGZ1bmN0aW9uIGFyZSB0aGUgbG9nZ2VkIHRlcm1zIG9mIEExLUE0IGluIEZhZGVyIGV0IGFsLiAoMjAwNSksIHAuIDI3OS4gXA0KVGhlIGZ1bmN0aW9uIHJlcXVpcmVzIGRhdGEgIkNMVl9kYXRhIi4gVG8gdXNlIHRoaXMgZnVuY3Rpb24sIHBsZWFzZSBtYWtlIHN1cmUgeW91IGhhdmUgbG9hZGVkICJDTFZfZGF0YSIsIGEgZGF0YS5mcmFtZSBjb250YWluaW5nIHRocmVlIHZhcmlhYmxlczogJHgkLCAkdF94JCwgYW5kICRUJCBhcyBkZWZpbmVkIGFib3ZlLiBcDQpTZWUgdGhlIGNvbW1lbnRzIHdpdGhpbiB0aGUgZnVuY3Rpb24gZm9yIG1vcmUgZGV0YWlscy4gDQpgYGB7cn0NCmxpa19iZ25iZA0KYGBgDQpcDQpUaGUgc2Vjb25kIGZ1bmN0aW9uICJwcmVkX1kiIGlzIHRoZSBmdW5jdGlvbiBvZiBwcmVkaWN0aW9uIGZ1dHVyZSAjIG9mIHRyYW5zYWN0aW9ucyBiYXNlZCBvbiB0aGUgZXN0aW1hdGVkIHBhcmFtZXRlcnMgZnJvbSAibGlrX2JnbmJkIi4gXA0KVGhpcyBmdW5jdGlvbiBmb2xsb3dzIEVxdWF0aW9uICgxMCkgb2YgRmFkZXIgZXQgYWwuICgyMDA1KSwgcC4yNzkuIFwNClRoZSBmdW5jdGlvbiByZXF1aXJlcyBhIHBhY2thZ2UgImh5cGVyZ2VvIiBmb3IgdGhlIGV2YWx1YXRpb24gb2YgdGhlIGh5cGVyZ2VvbWV0ZXJpYyBmdW5jdGlvbiAyRjEoLikuXA0KQXMgZXZhbHVhdGlvbiBvZiAyRjEoLikgaXMgc29tZXdoYXQgY29zdGx5LCB0aGlzIGZ1bmN0aW9uIHVzZXMgaW5wdXQgZnJvbSBvbmx5IDEgY3VzdG9tZXIuIFlvdSBjYW4gZXh0ZW5kIHRoZSBmdW5jdGlvbiB0byBjYWxjdWxhdGUgZm9yIG1hbnkgY3VzdG9tZXJzIGJ5IGl0ZXJhdGlvbnMgb3IgdXNlIGxhcHBseSguKS5cDQpQbGVhc2Ugc2VlIHRoZSBjb21tZW50cyB3aXRoaW4gdGhlIGZ1bmN0aW9uIGZvciBtb3JlIGRldGFpbHMuDQpgYGB7cn0NCnByZWRfWQ0KYGBgDQoNCiMjIyAqKkVzdGltYXRpbmcgdGhlIFBhcmFtZXRlcnMgb2YgdGhlIEJHL05CRCBtb2RlbCoqDQpGb3IgdGhlIGVzdGltYXRpb24sIHdlIG5lZWQgdG8gbWF4aW1pemUgdGhlIGxvZyBsaWtlbGlob29kIChvciBtaW5pbWl6ZSB0aGUgbmVnYXRpdmUgbG9nIGxpa2VsaWhvb2QsIGFzIHdlIGhhdmUgYXMgdGhlIG91dHB1dCBvZiB0aGUgImxpa19iZ25iZCIgZnVuY3Rpb24pLiBUbyBkbyBzbywgd2UgYXJlIHVzaW5nIGEgbm9uLWxpbmVhciBzb2x2ZXIgYnVpbHQgaW50byBSIGNhbGxlZCAiY29uc3RyT3B0aW0oLikiLiBUaGUgZnVuY3Rpb24gaXMgaW4gdGhlIFIgQmFzZSwgYW5kIHVzZSAiP2NvbnN0ck9wdGltIiB0byBzZWUgbW9yZSBkZXRhaWxzLlwNCk9uZSB0aGluZyB0byB0YWtlIG5vdGUgaXMgdGhlIGFsbCBwYXJhbWV0ZXJzICQocixcYWxwaGEpJCBmb3IgdGhlIEdhbW1hIGRpc3RyaWJ1dGlvbiBhbmQgJChhLGIpJCBmb3IgdGhlIEJldGEgZGlzdHJpYnV0aW9uIGFyZSBwb3NpdGl2ZSAoJD4wJCkuIFdlIHRodXMgaGF2ZSB0byBzcGVjaWZ5IHRoZSBjb25zdHJhaW50cyBpbiAiY29uc3RyT3B0aW0iIGFzICJ1aSIgYW5kICJjaSIgd2l0aDogdWkgJSolIHBhcmFtZXRlcnMgPj0gY2kuIEZvciBvdXIgcHVycG9zZSwgd2Ugc3BlY2lmeSAidWkiIGFzIGEgJDRcdGltZXM0JCBpZGVudGl0eSBtYXRyaXgsIGFuZCAiY2kiIGFzIGEgbGVuZ3RoIDQgYWxsLXplcm8gdmVjdG9yLlwNCiQkDQpcbGVmdFsgXGJlZ2lue21hdHJpeH0NCgkxJgkJMCYJCTAmCQkwXFwNCgkwJgkJMSYJCTAmCQkwXFwNCgkwJgkJMCYJCTEmCQkwXFwNCgkwJgkJMCYJCTAmCQkxXFwNClxlbmR7bWF0cml4fSBccmlnaHRdIFx0aW1lcyBcbGVmdFsgXGJlZ2lue2FycmF5fXtjfQ0KCXJcXA0KCVxhbHBoYVxcDQoJYVxcDQoJYlxcDQpcZW5ke2FycmF5fSBccmlnaHRdIFxnZXFzbGFudCBcbGVmdFsgXGJlZ2lue2FycmF5fXtjfQ0KCTBcXA0KCTBcXA0KCTBcXA0KCTBcXA0KXGVuZHthcnJheX0gXHJpZ2h0XSANCiQkDQpgYGB7cn0NCiMgdG8gc3BlY2lmeSB0aGUgY29uc3RyYWludCBtYXRyaXgNCnVpIDwtIGRpYWcoNCkNCmNpIDwtIHJlcCgwLDQpDQoNCiMgcGFyYTogdGhlIHN0YXJ0aW5nIHZhbHVlczsgYSB2ZWN0b3Igd2l0aCA0IGVsZW1lbnRzIGNvcnJlc3BvbmRpbmcgdG8gKHIsYWxwaGEsYSxiKS4gDQojIHlvdSBjYW4gY2hhbmdlIHRoaXMgdG8gb3RoZXIgdmFsdWVzIG9yIHRlc3QgZGlmZmVyZW50IHZhbHVlcy4NCnBhcmEgPC0gcmVwKC4wMSw0KQ0KDQojIHRvIHJ1biB0aGUgY29uc3RyYWluZWQgb3B0aW1pemF0aW9uIA0KcmVzdWx0cyA8LSBjb25zdHJPcHRpbShwYXJhLCBsaWtfYmduYmQsIE5VTEwsIHVpLCBjaSkNCg0KIyB0byBnZXQgdGhlIGVzdGltYXRlZCBwYXJhbWV0ZXJzIGluIHRoZSBvcmRlciBvZiAocixhbHBoYSxhLGIpDQpwYXJhIDwtIHJlc3VsdHMkcGFyDQpuYW1lcyhwYXJhKSA8LSBjKCJyIiwiYWxwaGEiLCJhIiwiYiIpDQpwYXJhDQpgYGANCg0KIyMjICoqUHJlZGljdCB0aGUgTm8uIG9mIFRyYW5zYWN0aW9ucyoqDQpXZSB1c2UgdGhlIGZ1bmN0aW9uICJwcmVkX1kiIHRvIHByZWRpY3QgdGhlIG5vLiBvZiB0cmFuc2FjdGlvbnMgZm9yIGEgc2luZ2xlIGN1c3RvbWVyLiBZb3UgaGF2ZSB0byBzcGVjaWZ5IGhvdyBsb25nIGludG8gdGhlIGZ1dHVyZSB5b3Ugd2FudCB0byBwcmVkaWN0LiBQbGVhc2Ugc2VlIGNvbW1lbnRzIHdpdGhpbiB0aGUgZnVuY3Rpb24gInByZWRfWSIgZm9yIG1vcmUgZGV0YWlscy4gDQpgYGB7cn0NCiMgc3VwcG9zZSB3ZSB3YW50IHRvIHByZWRpY3QgbmV4dCB5ZWFyIGFuZCBzZXQgdCA9IDUyDQp0IDwtIDUyDQoNCiMgbGV0J3MgcHJlZGljdCB0aGUgZmlyc3QgY3VzdG9tZXIgaW4gdGhlIGRhdGENCkggPC0gQ0xWX2RhdGFbMSxdDQoNCiMgdG8gb2J0YWluIHRoZSBwcmVkaWN0aW9uDQpZaGF0IDwtIHByZWRfWShwYXJhLEgsdCkNCmFzLm51bWVyaWMoWWhhdCkNCmBgYA0KDQoNCg0K