Data Description
Please download and load data “CLV.Rdata” from Canvas. The data record three variables for the CLV analysis at Crutchfield. The three variables of the data are defined as following:
- \(x\): the frequency of purchases
- \(t_x\): the recency, i.e., the time the most recent purchase
- \(T\): the end time for the observations
## ID x t_x T
## 1 1 1 3.857143 29.00000
## 2 2 2 13.285714 27.14286
## 3 3 5 29.142857 31.28571
## 4 4 0 0.000000 38.42857
## 5 5 0 0.000000 32.00000
## 6 6 0 0.000000 30.57143
Two Funcitions: lik_bgnbd and pred_Y
You may find two functions: lik_bgnbd and
pred_Y along with the data. You can directly use these two
functions for your CLV analysis and predictions. The details of these
functions are explained in the comments within the functions.
The likelihood function lik_bgnbd
The lik_bgnbg is the log-likelihood function of the CLV
model that we discussed in class. The function takes in one input a
vector of parameters para in the order as this: \[para = (r,\alpha,a,b,)\] Where, \(r\) and \(\alpha\) are two parameters of the gamma
distribution of the transaction rate \(\lambda\), and \(a\) and \(b\) are two parameters of the beta
distribution of the “death” rate of consumers \(p\). The function output the log-likelihood
of the CLV model.
The A1-A4 terms defined in the function are
the logged terms of A1-A4 in Fader et al. (2005), p. 280. The function
requires data “CLV_data.” To use this function, please make sure you
have loaded “CLV_data,” a data.frame that contains at least three
variables: \(x\), \(t_x\), and \(T\) as defined above. For the future use of
this function, your data.frame must also contain the three
variables \(x\), \(t_x\), and \(T\). Please see the comments
within the function for more details.
## function(para) {
## # This is the likelihood function with CLV_data
## # Change CLV_data to the names your datasets; the variables should be the same.
##
## # unpack 4 parameters:
## # r and alpha are the parameters of the gamma distribution of lambda
## # a and b are the parameters of the beta distribution ofr the "death" rate
## r <- para[1]
## alpha <- para[2]
## a <- para[3]
## b <- para[4]
##
## # unpack three variables:
## # x - # of transactions (frequency)
## # t_x - the most recent time of transaction (recency)
## # T - the end time for the Poisson process
## x <- CLV_data$x
## t_x <- CLV_data$t_x
## T <- CLV_data$T
##
## # A1 - A4 corresponds to the log of 4 terms in the likelihood function
## # See p.280 of Fader et al. (2005)
## # They are the log of A1-A4 for the log-likelihood
## A1 <- lgamma(r+x) + r*log(alpha) - lgamma(r)
## A2 <- lgamma(a+b) + lgamma(b+x) - lgamma(b) - lgamma(a+b+x)
## A3 <- -(r+x)*log(alpha+T)
##
## A4 <- rep(0,length(x))
## idx <- x>0
## if (sum(idx)>0) {
## A4[idx] <- log(a/(b+x[idx]-1)) - (r+x[idx])*log(alpha+t_x[idx])
## }
##
## lh <- sum(A1+A2+log(exp(A3)+idx*exp(A4)))
##
## # To return the minus of likelihood for minimization
## # maximize likelihood = minimize negative likelihood
## return(-lh)
##
## }
## <bytecode: 0x000001c445ba53c8>
The prediction function pred_Y
The second function pred_Y is the function of prediction
future # of transactions based on the estimated parameters from
lik_bgnbd. This function follows Equation (10) of Fader et
al. (2005), p.279. The function outputs the predicted no. of
transactions during your specified time periods \(t\). The function takes in three
inputs:
para: as defined in the previous subsection forlik_bgnbdH: a data.frame with only 1 row (for 1 customer) and three variables \((x,t_x,T)\);t- how long into the future you want to predict for the customer
Note that The function requires a package “hypergeo”
for the evaluation of the hypergeometeric function 2F1(.).
Please make sure you have installed the function in your R libraries. If
not, please install the package beforehand with
install.packages("hypergeo").
As evaluation of 2F1(.) is somewhat costly, this
function uses input from only 1 customer. You can extend the function to
calculate for many customers by iterations or use lapply(.)
to apply the function to multiple customers. Please see the comments
within the pred_Y function for more details.
## function(para,H,t) {
## # Given estimated parameters, for a customer with history {x,t_x,T}, to predict her # of transactions in t time.
## # Please see Equation (10) of Fader et al. (2005), p.279;
## # This function uses a package "hypergeo" for function evaluation;
## # As evaluation of 2F1(.) is somewhat costly, this function is to accommodate only 1 customer;
## # You can extend the function to calculate for many customers by iterations or use lapply(.);
##
## # The function takes three input:
## # para - a vector of the 4 parameters;
## # H - a data.frame with only 1 row (for 1 customer) and three variables {x,t_x,T};
## # t - how long into the future?
##
## # unpack 4 parameters:
## # r and alpha are the parameters of the gamma distribution of lambda
## # a and b are the parameters of the beta distribution of the "death" rate
## r <- para[1]
## alpha <- para[2]
## a <- para[3]
## b <- para[4]
##
## # unpack three variables:
## # x - # of transactions (frequency)
## # t_x - the most recent time of transaction (recency)
## # T - the end time for the Poisson process
## x <- H$x
## t_x <- H$t_x
## T <- H$T
##
## # the term of the Gaussian hypergeometric function 2F1(.)
## hg <- hypergeo::hypergeo(r+x,b+x,a+b+x-1,t/(alpha+T+t))
## hg <- Re(hg) # hg is a real number in the complex format Re(.) makes it real.
##
## # the numerator term, given hg
## Y1 <- (a+b+x-1)/(a-1)*(1-(((alpha+T)/(alpha+T+t))^(r+x))*hg)
## Y2 <- 1
## if (x>0) {
## Y2 <- Y2 + a/(b+x-1)*(((alpha+T)/(alpha+t_x))^(r+x))
## }
##
## return(Y1/Y2)
## }
Estimating the Parameters of the BG/NBD model
For the estimation, we need to maximize the log likelihood (or
minimize the negative log likelihood, as we have as the output of the
lik_bgnbd function). To do so, we are using a non-linear
solver built into R called constrOptim(.) (standing for
constrained optimization). The function is in the R Base, and use
?constrOptim to see more details.
One thing to take note is the all parameters \((r,\alpha)\) for the Gamma distribution and
\((a,b)\) for the Beta distribution are
positive (\(>0\)). We thus have to
specify the constraints in the constrOptim as
ui and ci with:
ui %*% parameters >= ci. For our model where all
parameters are positive, we specify ui as a \(4\times4\) identity matrix, and
ci as a length 4 all-zero vector.
\[ \left[ \begin{matrix} 1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ 0& 0& 0& 1\\ \end{matrix} \right] \times \left[ \begin{array}{c} r\\ \alpha\\ a\\ b\\ \end{array} \right] > \left[ \begin{array}{c} 0\\ 0\\ 0\\ 0\\ \end{array} \right] \]
Running the estimation procedure
The procedure is as below:
# to specify the constraint matrix with 4 parameters
ui <- diag(4) # an identify matrix 4*4
ci <- rep(0,4) # all all-zero vector of length 4
# para: the starting values; a vector with 4 elements corresponding to (r,alpha,a,b).
# it usually does not matter much with different starting values.
# you can change this to other values or test different values.
para <- rep(.01,4)
# to run the constrained optimization
results <- constrOptim(para, lik_bgnbd, NULL, ui, ci)
# to get the estimated parameters in the order of (r,alpha,a,b)
para <- results$par
names(para) <- c("r","alpha","a","b")
para## r alpha a b
## 0.2171882 3.7234535 0.8349768 3.7701876
Predicting the no. of transactions
We use the function pred_Y to predict the no. of
transactions for a single customer. You have to specify how long into
the future you want to predict. Please see comments within the function
pred_Y for more details.
Before we do the prediction, we must install the package “hypergeo”
to evaluate the hypergeometeric function 2F1(.). Please run
install.packages("hypergeo") in your command line.
# suppose we want to predict next year or 52 weeks.
# so we set t = 52. You can set t to other values of your choice.
t <- 52
# let's predict the first customer in the data.
H <- CLV_data[1,]
# to obtain the prediction
Yhat <- pred_Y(para,H,t)
as.numeric(Yhat)## [1] 0.6823722