Data Description

Please download and load data “CLV.Rdata” from Canvas. The data record three variables for the CLV analysis at Crutchfield. The three variables of the data are defined as following:

load("CLV.RData")
head(CLV_data)
##   ID x       t_x        T
## 1  1 1  3.857143 29.00000
## 2  2 2 13.285714 27.14286
## 3  3 5 29.142857 31.28571
## 4  4 0  0.000000 38.42857
## 5  5 0  0.000000 32.00000
## 6  6 0  0.000000 30.57143

Two Funcitions: lik_bgnbd and pred_Y

You may find two functions: lik_bgnbd and pred_Y along with the data. You can directly use these two functions for your CLV analysis and predictions. The details of these functions are explained in the comments within the functions.

The likelihood function lik_bgnbd

The lik_bgnbg is the log-likelihood function of the CLV model that we discussed in class. The function takes in one input a vector of parameters para in the order as this: \[para = (r,\alpha,a,b,)\] Where, \(r\) and \(\alpha\) are two parameters of the gamma distribution of the transaction rate \(\lambda\), and \(a\) and \(b\) are two parameters of the beta distribution of the “death” rate of consumers \(p\). The function output the log-likelihood of the CLV model.

The A1-A4 terms defined in the function are the logged terms of A1-A4 in Fader et al. (2005), p. 280. The function requires data “CLV_data.” To use this function, please make sure you have loaded “CLV_data,” a data.frame that contains at least three variables: \(x\), \(t_x\), and \(T\) as defined above. For the future use of this function, your data.frame must also contain the three variables \(x\), \(t_x\), and \(T\). Please see the comments within the function for more details.

lik_bgnbd
## function(para) {
## # This is the likelihood function with CLV_data 
## # Change CLV_data to the names your datasets; the variables should be the same.
##     
##   # unpack 4 parameters: 
##   # r and alpha are the parameters of the gamma distribution of lambda 
##   # a and b are the parameters of the beta distribution ofr the "death" rate
##   r <- para[1]
##   alpha <- para[2]
##   a <- para[3]
##   b <- para[4]
##   
##   # unpack three variables: 
##   # x - # of transactions (frequency)
##   # t_x - the most recent time of transaction (recency)
##   # T - the end time for the Poisson process
##   x <- CLV_data$x
##   t_x <- CLV_data$t_x
##   T <- CLV_data$T
##   
##   # A1 - A4 corresponds to the log of 4 terms in the likelihood function
##   # See p.280 of Fader et al. (2005)
##   # They are the log of A1-A4 for the log-likelihood  
##   A1 <- lgamma(r+x) + r*log(alpha) - lgamma(r)
##   A2 <- lgamma(a+b) + lgamma(b+x) - lgamma(b) - lgamma(a+b+x)
##   A3 <- -(r+x)*log(alpha+T)
##   
##   A4 <- rep(0,length(x))
##   idx <- x>0
##   if (sum(idx)>0) {
##     A4[idx] <- log(a/(b+x[idx]-1)) - (r+x[idx])*log(alpha+t_x[idx])
##   }
##   
##   lh <- sum(A1+A2+log(exp(A3)+idx*exp(A4)))
##   
##   # To return the minus of likelihood for minimization
##   # maximize likelihood = minimize negative likelihood
##   return(-lh)
##   
## }
## <bytecode: 0x000001c445ba53c8>

The prediction function pred_Y

The second function pred_Y is the function of prediction future # of transactions based on the estimated parameters from lik_bgnbd. This function follows Equation (10) of Fader et al. (2005), p.279. The function outputs the predicted no. of transactions during your specified time periods \(t\). The function takes in three inputs:

  • para: as defined in the previous subsection for lik_bgnbd
  • H: a data.frame with only 1 row (for 1 customer) and three variables \((x,t_x,T)\);
  • t - how long into the future you want to predict for the customer

Note that The function requires a package “hypergeo” for the evaluation of the hypergeometeric function 2F1(.). Please make sure you have installed the function in your R libraries. If not, please install the package beforehand with install.packages("hypergeo").

As evaluation of 2F1(.) is somewhat costly, this function uses input from only 1 customer. You can extend the function to calculate for many customers by iterations or use lapply(.) to apply the function to multiple customers. Please see the comments within the pred_Y function for more details.

pred_Y
## function(para,H,t) {
## # Given estimated parameters, for a customer with history {x,t_x,T}, to predict her # of transactions in t time.
##   # Please see Equation (10) of Fader et al. (2005), p.279;
##   # This function uses a package "hypergeo" for function evaluation;
##   # As evaluation of 2F1(.) is somewhat costly, this function is to accommodate only 1 customer;
##   # You can extend the function to calculate for many customers by iterations or use lapply(.);
## 
## # The function takes three input: 
##   # para - a vector of the 4 parameters; 
##   # H - a data.frame with only 1 row (for 1 customer) and three variables {x,t_x,T};
##   # t - how long into the future?
##   
##   # unpack 4 parameters: 
##   # r and alpha are the parameters of the gamma distribution of lambda 
##   # a and b are the parameters of the beta distribution of the "death" rate
##   r <- para[1]
##   alpha <- para[2]
##   a <- para[3]
##   b <- para[4]
##   
##   # unpack three variables: 
##   # x - # of transactions (frequency)
##   # t_x - the most recent time of transaction (recency)
##   # T - the end time for the Poisson process
##   x <- H$x
##   t_x <- H$t_x
##   T <- H$T
##   
##   # the term of the Gaussian hypergeometric function 2F1(.)
##   hg <- hypergeo::hypergeo(r+x,b+x,a+b+x-1,t/(alpha+T+t))
##   hg <- Re(hg) # hg is a real number in the complex format Re(.) makes it real. 
##   
##   # the numerator term, given hg 
##   Y1 <- (a+b+x-1)/(a-1)*(1-(((alpha+T)/(alpha+T+t))^(r+x))*hg)
##   Y2 <- 1 
##   if (x>0) {
##     Y2 <- Y2 + a/(b+x-1)*(((alpha+T)/(alpha+t_x))^(r+x))
##   }
##   
##   return(Y1/Y2)
## }

Estimating the Parameters of the BG/NBD model

For the estimation, we need to maximize the log likelihood (or minimize the negative log likelihood, as we have as the output of the lik_bgnbd function). To do so, we are using a non-linear solver built into R called constrOptim(.) (standing for constrained optimization). The function is in the R Base, and use ?constrOptim to see more details.

One thing to take note is the all parameters \((r,\alpha)\) for the Gamma distribution and \((a,b)\) for the Beta distribution are positive (\(>0\)). We thus have to specify the constraints in the constrOptim as ui and ci with: ui %*% parameters >= ci. For our model where all parameters are positive, we specify ui as a \(4\times4\) identity matrix, and ci as a length 4 all-zero vector.

\[ \left[ \begin{matrix} 1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ 0& 0& 0& 1\\ \end{matrix} \right] \times \left[ \begin{array}{c} r\\ \alpha\\ a\\ b\\ \end{array} \right] > \left[ \begin{array}{c} 0\\ 0\\ 0\\ 0\\ \end{array} \right] \]

Running the estimation procedure

The procedure is as below:

# to specify the constraint matrix with 4 parameters
ui <- diag(4) # an identify matrix 4*4 
ci <- rep(0,4) # all all-zero vector of length 4

# para: the starting values; a vector with 4 elements corresponding to (r,alpha,a,b). 
# it usually does not matter much with different starting values. 
# you can change this to other values or test different values.
para <- rep(.01,4)

# to run the constrained optimization 
results <- constrOptim(para, lik_bgnbd, NULL, ui, ci)

# to get the estimated parameters in the order of (r,alpha,a,b)
para <- results$par
names(para) <- c("r","alpha","a","b")
para
##         r     alpha         a         b 
## 0.2171882 3.7234535 0.8349768 3.7701876

Predicting the no. of transactions

We use the function pred_Y to predict the no. of transactions for a single customer. You have to specify how long into the future you want to predict. Please see comments within the function pred_Y for more details.

Before we do the prediction, we must install the package “hypergeo” to evaluate the hypergeometeric function 2F1(.). Please run install.packages("hypergeo") in your command line.

# suppose we want to predict next year or 52 weeks. 
# so we set t = 52. You can set t to other values of your choice. 
t <- 52

# let's predict the first customer in the data.
H <- CLV_data[1,]

# to obtain the prediction
Yhat <- pred_Y(para,H,t)
as.numeric(Yhat)
## [1] 0.6823722