options(rpubs.upload.method = "internal")So a couple of months back, I reached out to a few ecommerce companies to understand how they use CLV today. And here is what I noted
B2C buisnesses struggle to capture/project customer purchase patterns in a typical non-contractual setting. As a result, they struggle to understand the actual value of their customers and end up trying to figure out the following challenges:
How much should they spend in acquiring a new customer for their product?
How do different segments of customers interact with a product?
How well does their product actually do in the market?
How can they drive the demand for their products higher and maximize returns from each customer?
There are tools that help in driving engagement of customers high, but understanding a customer is still a challenge.
Correctly estimating the lifetime value of each customer holds the key.
Knowing the right value of a new customer helps to understand how much should you be spending on a acquiring customers
Targeting different valued customers can help with smart allocation of marketing budget to increase customer engagement and stickiness
Figuring out how frequently does a particular customer transact and when will a customer transact next would help in predicting demand for different products among different customer segments
Customer Lifetime Value is the present value of the future cash flows attributed to the customer during a given time period. CLV gives sense to the amount of revenue each customer generates in a fixed interval of time. There are a few advanced statistical methods to describe and predict customer’s purchase behavior in non-contractual settings. By fiting probabilistic models to historic transaction records one can compute customer-centric metrics of managerial interest.
The UCI online-retail dataset is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. I wanted to analyze the lifetime value of each customer to predict when and how much would a customer buy next.
library(dplyr)
library(BTYD)
library(ggplot2)#Read the data
ucl_data = read.csv("~/Downloads/Online_Retail.csv",stringsAsFactors = F)
#Fix the date
ucl_data = ucl_data %>% mutate(InvoiceDate = as.Date(as.POSIXct(InvoiceDate,format = "%m/%d/%Y %H:%M")))
str(ucl_data)## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Date, format: "10-12-01" "10-12-01" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
#1. Get Recency, First purchase and last purchase made by customers and the total value of each transaction (QTY*PRICE)
max_InvoiceDate = max(ucl_data$InvoiceDate)
# today = max_InvoiceDate+2
ucl_data = ucl_data %>% group_by(CustomerID) %>% mutate(Recency = max_InvoiceDate-max(InvoiceDate),First_purchase = max_InvoiceDate - min(InvoiceDate) ,Value = Quantity*UnitPrice, Frequency = n_distinct(InvoiceNo)) %>%arrange(InvoiceDate)
# 2. Filter out Qty < 0
neg = ucl_data %>% filter(Quantity<0)
ucl_data = ucl_data %>% filter(Quantity>=0)
# 3. Get customers with repeat transactions
ucl_data = ucl_data %>% group_by(CustomerID) %>% mutate(repeat_customers = ifelse(n_distinct(InvoiceNo)>1,n_distinct(InvoiceNo),0)) ucl_summary = ucl_data %>% group_by(CustomerID) %>% summarize(Last_purchase = max(InvoiceDate), First_purchase = min(InvoiceDate),
Frequency = n_distinct(InvoiceNo), Sales = mean(Value) ,repeat_customers = ifelse(mean(repeat_customers)>0,1,0))
ucl_summary = ucl_summary %>% mutate(time_between = Last_purchase-First_purchase, recency = abs(max_InvoiceDate - Last_purchase))
head(ucl_summary)## # A tibble: 6 x 8
## CustomerID Last_purchase First_purchase Frequency Sales repeat_customers
## <int> <date> <date> <int> <dbl> <dbl>
## 1 12346 11-01-18 11-01-18 1 7.72e4 0
## 2 12347 11-12-06 10-12-06 7 2.37e1 1
## 3 12348 11-09-24 10-12-16 4 5.80e1 1
## 4 12349 11-11-21 11-11-21 1 2.41e1 0
## 5 12350 11-02-01 11-02-01 1 1.97e1 0
## 6 12352 11-11-02 11-02-16 8 2.95e1 1
## # … with 2 more variables: time_between <drtn>, recency <drtn>
print(paste("Total number of customers transacted in the given period = ", nrow(ucl_summary)))## [1] "Total number of customers transacted in the given period = 4340"
recency_prop = 100* table(as.factor(ucl_summary$repeat_customers))/sum(table(as.factor(ucl_summary$repeat_customers)))
print(recency_prop)##
## 0 1
## 34.42396 65.57604
~66% of customers have transacted more than once
# make log plot and plot sales
ggplot(ucl_data, aes(x=InvoiceDate,y=Value,group=CustomerID))+
geom_line(alpha=0.1)+
scale_x_date()+
scale_y_log10()+
ggtitle("Sales for individual customers")+
ylab("Sales ($, US)")+xlab("")+
theme_minimal()## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
This plot shows that the log plot of sales. The median money spent by a customer is 9.9$ and the mean is 220$
# look at days between orders
# model describes rates via a gamma distribution across customers
library(plyr)## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
purchaseFreq <- ddply(ucl_data %>% filter(Frequency>1), .(CustomerID), summarize,
daysBetween = as.numeric(diff(InvoiceDate)))
detach(package:plyr)
library(grDevices)
library(dplyr)
ggplot(purchaseFreq %>% filter( daysBetween>0),aes(x=daysBetween))+
geom_histogram(fill="orange")+
xlab("Time between purchases (days)")+
theme_minimal()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot shows that with time, the number of repeat transactions reduces.
colnames(ucl_summary)## [1] "CustomerID" "Last_purchase" "First_purchase"
## [4] "Frequency" "Sales" "repeat_customers"
## [7] "time_between" "recency"
ggplot(ucl_summary %>% filter(Frequency>1), aes(x = time_between, y = Frequency, color = as.factor(repeat_customers))) + geom_point() + ylim(0,50)## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 16 rows containing missing values (geom_point).
### ProblemIf the time-gap between first and last transaction is large, the mean frequency of purchase is also high.
## Convert Data to RFM matrix for PnBD and BGnBD models and split data to training (caliberation) and test (holdout) set
source("model_data_prep.R")## Parameter Estimates: PnBD model:
params = pnbd.EstimateParameters(cal.cbs)
params
LL = pnbd.cbs.LL(params,cal.cbs)
LL
#Run multiple iterations to get optimal parameters
p.matrix = c(params,LL)
for(i in 1:10){
params = pnbd.EstimateParameters(cal.cbs,params)
LL = pnbd.cbs.LL(params,cal.cbs)
p.matrix.row = c(params,LL)
p.matrix = rbind(p.matrix,p.matrix.row)
}
colnames(p.matrix) <- c("r","alpha","s","beta","LL")
rownames(p.matrix) <- 1:nrow(p.matrix)
p.matrix
# params = p.matrix[3,]## Estimating model parameters BGnBD:: Parameters are estimated by maximising log-likelihood
#start params = (1,1,1,1)
params_bgnbd = bgnbd.EstimateParameters(cal.cbs,par.start = c(1,1,1,2))
params_bgnbd
LL = bgnbd.cbs.LL(params_bgnbd,cal.cbs)
LL
#Run multiple iterations to get optimal parameters
p.matrix_bgnd = c(params_bgnbd,LL)
for(i in 1:10){
params_bgnbd = bgnbd.EstimateParameters(cal.cbs,params_bgnbd)
LL = bgnbd.cbs.LL(params_bgnbd,cal.cbs)
p.matrix_bgnd.row = c(params_bgnbd,LL)
p.matrix_bgnd = rbind(p.matrix_bgnd,p.matrix_bgnd.row)
}
colnames(p.matrix_bgnd) <- c("r","alpha","a","beta","LL")
rownames(p.matrix_bgnd) <- 1:nrow(p.matrix_bgnd)
p.matrix_bgnd
# params_bgnbd = p.matrix_bgnd[8,]The Pareto/Negative Binomial Distribution model is perhaps the most well-known and frequently applied probabilistic model in the non-contractual context. Pareto/NBD only focuses on the purchase count and lifetime of a customer.
pnbd.PlotTransactionRateHeterogeneity(params) ## gamma distributionThe plot above shows that customers are more likely to have low values of individual poisson transaction process parameters. It shows that only a few customers have high transaction rate. This validates the heteroginity across customer transactions. Repeat customer buy around their own mean of purchasing time. The timegap between purchases would be different for different customers.
pnbd.PlotDropoutRateHeterogeneity(params,lim = 0.5) ## gamma mixing distribution of Pareto dropout processThe above plot shows the distribution of customers’ propensities to drop out. Only a few customer densities have high drop-out rates.
Why BGnBD model? Beta Geometric Negative Binomial Distribution can be used to determine the expected repeat visits for customers in order to determine a customers lifetime value. It can also be used to determine whether a customer has churned or is likely to churn soon.
bgnbd.PlotTransactionRateHeterogeneity(params_bgnbd) ## gamma distribution## This plot shows that customers are more likely to have low values of individual poisson transaction process parameters.
## This means that with time the number of transactions made by a customer reduces, which is expectedThe plot above shows that customers are more likely to have low values of individual poisson transaction process parameters. It shows that only a few customers have high transaction rate. This basically shows the heteroginity across customer transactions. It validates the fact that repeat customer buys around their own mean of purchasing time. The timegap between purchases would be different for different customers. No customer would buy forever. The time for dropout for different customers varies.
bgnbd.PlotDropoutRateHeterogeneity(params_bgnbd) ## gamma mixing distribution of BGnBD dropout process## [,1] [,2] [,3] [,4] [,5]
## x.axis.ticks 0 1.374870e-13 2.749741e-13 4.124611e-13 5.499482e-13
## heterogeneity Inf 3.944624e+09 1.973063e+09 1.315668e+09 9.869070e+08
## [,6] [,7] [,8] [,9]
## x.axis.ticks 6.874352e-13 8.249223e-13 9.624093e-13 1.099896e-12
## heterogeneity 7.896224e+08 6.580845e+08 5.641202e+08 4.936414e+08
## [,10] [,11] [,12] [,13]
## x.axis.ticks 1.237383e-12 1.374870e-12 1.512358e-12 1.649845e-12
## heterogeneity 4.388207e+08 3.949615e+08 3.590747e+08 3.291676e+08
## [,14] [,15] [,16] [,17]
## x.axis.ticks 1.787332e-12 1.924819e-12 2.062306e-12 2.199793e-12
## heterogeneity 3.038603e+08 2.821675e+08 2.633663e+08 2.469147e+08
## [,18] [,19] [,20] [,21]
## x.axis.ticks 2.33728e-12 2.474767e-12 2.612254e-12 2.749741e-12
## heterogeneity 2.32398e+08 2.194939e+08 2.079478e+08 1.975559e+08
## [,22] [,23] [,24] [,25]
## x.axis.ticks 2.887228e-12 3.024715e-12 3.162202e-12 3.299689e-12
## heterogeneity 1.881536e+08 1.796057e+08 1.718010e+08 1.646464e+08
## [,26] [,27] [,28] [,29]
## x.axis.ticks 3.437176e-12 3.574663e-12 3.712150e-12 3.849637e-12
## heterogeneity 1.580641e+08 1.519880e+08 1.463619e+08 1.411375e+08
## [,30] [,31] [,32] [,33]
## x.axis.ticks 3.987124e-12 4.124611e-12 4.262098e-12 4.399586e-12
## heterogeneity 1.362733e+08 1.317333e+08 1.274861e+08 1.235043e+08
## [,34] [,35] [,36] [,37]
## x.axis.ticks 4.537073e-12 4.674560e-12 4.812047e-12 4.949534e-12
## heterogeneity 1.197638e+08 1.162433e+08 1.129238e+08 1.097887e+08
## [,38] [,39] [,40] [,41]
## x.axis.ticks 5.087021e-12 5.224508e-12 5.361995e-12 5.499482e-12
## heterogeneity 1.068231e+08 1.040135e+08 1.013479e+08 9.881558e+07
## [,42] [,43] [,44] [,45]
## x.axis.ticks 5.636969e-12 5.774456e-12 5.911943e-12 6.049430e-12
## heterogeneity 9.640675e+07 9.411260e+07 9.192512e+07 8.983705e+07
## [,46] [,47] [,48] [,49]
## x.axis.ticks 6.186917e-12 6.324404e-12 6.461891e-12 6.599378e-12
## heterogeneity 8.784176e+07 8.593319e+07 8.410582e+07 8.235456e+07
## [,50] [,51] [,52] [,53]
## x.axis.ticks 6.736865e-12 6.874352e-12 7.011839e-12 7.149326e-12
## heterogeneity 8.067477e+07 7.906215e+07 7.751276e+07 7.602294e+07
## [,54] [,55] [,56] [,57]
## x.axis.ticks 7.286814e-12 7.424301e-12 7.561788e-12 7.699275e-12
## heterogeneity 7.458933e+07 7.320879e+07 7.187845e+07 7.059560e+07
## [,58] [,59] [,60] [,61]
## x.axis.ticks 7.836762e-12 7.974249e-12 8.111736e-12 8.249223e-12
## heterogeneity 6.935776e+07 6.816259e+07 6.700792e+07 6.589173e+07
## [,62] [,63] [,64] [,65]
## x.axis.ticks 8.386710e-12 8.524197e-12 8.661684e-12 8.799171e-12
## heterogeneity 6.481212e+07 6.376734e+07 6.275571e+07 6.177568e+07
## [,66] [,67] [,68] [,69]
## x.axis.ticks 8.936658e-12 9.074145e-12 9.211632e-12 9.349119e-12
## heterogeneity 6.082581e+07 5.990470e+07 5.901109e+07 5.814376e+07
## [,70] [,71] [,72] [,73]
## x.axis.ticks 9.486606e-12 9.624093e-12 9.76158e-12 9.899067e-12
## heterogeneity 5.730155e+07 5.648340e+07 5.56883e+07 5.491527e+07
## [,74] [,75] [,76] [,77]
## x.axis.ticks 1.003655e-11 1.017404e-11 1.031153e-11 1.044902e-11
## heterogeneity 5.416342e+07 5.343188e+07 5.271984e+07 5.202654e+07
## [,78] [,79] [,80] [,81]
## x.axis.ticks 1.058650e-11 1.072399e-11 1.086148e-11 1.099896e-11
## heterogeneity 5.135124e+07 5.069325e+07 5.005191e+07 4.942660e+07
## [,82] [,83] [,84] [,85]
## x.axis.ticks 1.113645e-11 1.127394e-11 1.141142e-11 1.154891e-11
## heterogeneity 4.881673e+07 4.822173e+07 4.764106e+07 4.707422e+07
## [,86] [,87] [,88] [,89]
## x.axis.ticks 1.16864e-11 1.182389e-11 1.196137e-11 1.209886e-11
## heterogeneity 4.65207e+07 4.598006e+07 4.545184e+07 4.493563e+07
## [,90] [,91] [,92] [,93]
## x.axis.ticks 1.223635e-11 1.237383e-11 1.251132e-11 1.264881e-11
## heterogeneity 4.443101e+07 4.393760e+07 4.345503e+07 4.298295e+07
## [,94] [,95] [,96] [,97]
## x.axis.ticks 1.278630e-11 1.292378e-11 1.306127e-11 1.319876e-11
## heterogeneity 4.252102e+07 4.206892e+07 4.162633e+07 4.119296e+07
## [,98] [,99] [,100]
## x.axis.ticks 1.333624e-11 1.347373e-11 1.361122e-11
## heterogeneity 4.076852e+07 4.035274e+07 3.994536e+07
The above plot shows the distribution of customers’ propensities to drop out. Only a few customer densities have high drop-out rates. There is also a heteroginity involved for customers dropout. No customer would buy forever. The time for dropout for different customers varies. This should be expected in a transactional setting of an online retail store.
for(t in seq(40,100,by=10)){
print(paste("Expected number of repeat transaction by a new customer in",t,"days =" ,pnbd.Expectation(params,t = t)))
}## [1] "Expected number of repeat transaction by a new customer in 40 days = 0.481301525540638"
## [1] "Expected number of repeat transaction by a new customer in 50 days = 0.601626885840017"
## [1] "Expected number of repeat transaction by a new customer in 60 days = 0.721952240729078"
## [1] "Expected number of repeat transaction by a new customer in 70 days = 0.842277590938877"
## [1] "Expected number of repeat transaction by a new customer in 80 days = 0.962602937026013"
## [1] "Expected number of repeat transaction by a new customer in 90 days = 1.08292827942852"
## [1] "Expected number of repeat transaction by a new customer in 100 days = 1.20325361850016"
This shows that a new customer is most likely to transact again after 3 months
cust.12662 <- cal.cbs["12662",]
x = cust.12662["x"]
t.x = cust.12662["t.x"]
T.cal = cust.12662["T.cal"]
for(t in seq(40,100,by=10)){
print(paste("Expected number of repeat transaction by customer 12662 in",t,"days =" ,pnbd.ConditionalExpectedTransactions(params,T.star = t,x,t.x,T.cal)))
}## [1] "Expected number of repeat transaction by customer 12662 in 40 days = 0.745276650336834"
## [1] "Expected number of repeat transaction by customer 12662 in 50 days = 0.931595807244443"
## [1] "Expected number of repeat transaction by customer 12662 in 60 days = 1.11791496206169"
## [1] "Expected number of repeat transaction by customer 12662 in 70 days = 1.30423411486676"
## [1] "Expected number of repeat transaction by customer 12662 in 80 days = 1.4905532657322"
## [1] "Expected number of repeat transaction by customer 12662 in 90 days = 1.6768724147255"
## [1] "Expected number of repeat transaction by customer 12662 in 100 days = 1.86319156190963"
Customer 12662 will most likely transact again atleast once within 2 months
pnbd.PlotFrequencyInCalibration(params,cal.cbs,3)## freq.0 freq.1 freq.2 freq.3+
## n.x.actual 1394.000 588.0000 313.0000 521.0000
## n.x.expected 1383.694 568.0825 309.4955 554.7278
This plot shows the actual and expected number of customers who made repeat transactions in the caliberation period binned as per frequencies
bgnbd.PlotFrequencyInCalibration(params_bgnbd,cal.cbs,3)## freq.0 freq.1 freq.2 freq.3+
## n.x.actual 1394.000 588.0000 313.0000 521.0000
## n.x.expected 1383.724 568.0642 309.4836 549.7278
This plot shows the actual and expected number of customers who made repeat transactions in the caliberation period binned as per frequencies The difference between PNBD and BGNBD models is not much in this case
The frequency plot above shows that the models perform alright in the caliberation period. But that does not mean that the models are good. To test the performance of the model, it is important to see how they perform in the holdout/test period.
source("data_prep_holdout.R")
T.star = as.numeric(max(ucl_log_sum$date)-threshold_date+1)
censor = 11
x.star = cal.cbs2[,"x.star"]
comp = pnbd.PlotFreqVsConditionalExpectedFrequency(params,T.star = T.star,cal.cbs=cal.cbs2,x.star,censor)The PnBD model does well in the holdout period, however we do see deviations becoming larger towards the end.
T.star = as.numeric(max(ucl_log_sum$date)-threshold_date+1)
censor = 11
x.star = cal.cbs2[,"x.star"]
comp = bgnbd.PlotFreqVsConditionalExpectedFrequency(params_bgnbd,T.star = T.star,cal.cbs=cal.cbs2,x.star,censor)The BGnBD model also esitmates well.
Next, to check if the models capture the trend of transactions, we test how the model performs against cummulative sales in the holdout period. This in itself gives an idea as to predict customer behavior in the future.
#Get total daily sales
d.track = rep(0,period)
origin = min(ucl_data$InvoiceDate)
for(i in colnames(tot.cbt)){
date.index = (as.numeric(difftime(as.Date(i),origin))+1)
d.track[date.index] = sum(tot.cbt[,i])
}
# Get mean weekly actual sales
w.track = rep(0,round(period/7))
for (j in 1:length(w.track)) {
w.track[j] = mean(d.track[(j*7-6):(j*7)])
}
T.cal =(cal.cbs2[,"T.cal"])
T.tot = as.numeric(round((range(ucl_data$InvoiceDate)[2]-range(ucl_data$InvoiceDate)[1])))## Cummalative daily tracking
cum.track = cumsum(d.track)
cum.tracking = pnbd.PlotTrackingCum(params,T.cal,T.tot,cum.track , title = "Tracking daily cumulative transactions",xlab = "Days")## Cummalative weekly tracking
cum.track.week = cumsum(w.track)
cum.tracking.week = pnbd.PlotTrackingCum(params,T.cal,T.tot/7,cum.track.week)PnBD captures the trend in cummualtive sales reasonably well.
## Cummalative daily tracking
cum.track = cumsum(d.track)
cum.tracking = bgnbd.PlotTrackingCum(params_bgnbd,T.cal,T.tot,cum.track, title = "Tracking daily cumulative transactions",xlab = "Days")## Cummalative weekly tracking
cum.track.week = cumsum(w.track)
cum.tracking.week = bgnbd.PlotTrackingCum(params_bgnbd,T.cal,T.tot/7,cum.track.week)Again the difference in the 2 models for this dataset is insignificant.
While it was useful to check the models against cummulative sales, we could also test the models performance against daily and weekly transactions. To test the models, we first extract the actual daily and weekly values in the holdout period and then compare with the expected values
Both the models capture the trend in daily sales.
inc.tracking.daily = pnbd.PlotTrackingInc(params,T.cal=T.cal,T.tot,actual.inc.tracking.data = d.track,title = "Tracking daily transactions",xlab = "Days")inc.tracking.daily[,20:30]inc.tracking.daily = bgnbd.PlotTrackingInc(params_bgnbd,T.cal=T.cal,T.tot,actual.inc.tracking.data = d.track, title = "Tracking daily transactions",xlab = "Days")inc.tracking.daily[,20:30]Both PnBD and BGnBD models capture the trend of weekly transactons well. The difference between the 2 models is not that significant as more than 60% of the customers are repeat customers, which is really high. While both the models capture the trend and average value of customers well, it does not capture effects of seasonality and marketing campaigns well.
inc.tracking.weekly = pnbd.PlotTrackingInc(params,T.cal=T.cal,T.tot/7,actual.inc.tracking.data = w.track)inc.tracking.weekly[,1:10]## [,1] [,2] [,3] [,4] [,5] [,6]
## actual 4.142857 17.714286 21.571429 3.428571 1.428571 19.71429
## expected 2.438375 5.579396 7.625722 11.047591 14.164283 15.51589
## [,7] [,8] [,9] [,10]
## actual 18.71429 15.57143 21.71429 18.28571
## expected 17.16052 19.30019 20.65141 21.08881
inc.tracking.weekly = bgnbd.PlotTrackingInc(params_bgnbd,T.cal=T.cal,T.tot/7,actual.inc.tracking.data = w.track)inc.tracking.weekly[,1:10]To include demographics and seasonality, I train and deep neural network with 3 hidden layers (64,32,16) and test how the model performs in the holdout period. I have stored the results of the DNN on the test set in test_results.csv
holdout_nn = read.csv("test_results.csv",stringsAsFactors = F)
holdout_nn = holdout_nn %>% mutate(InvoiceDate=as.Date(InvoiceDate))
ggplot(data = holdout_nn,aes(x=InvoiceDate,y=actual_sale))+geom_line()+geom_line(aes(y=pred_sale),color="red")+ylim(0,100)## Warning: Removed 617 rows containing missing values (geom_path).
This looks ok, but the plot is not too clear because of volume of data.
Next we look at the mean of the predicted sales to see if the trends are captured.
#detach(package:plyr)
library(ggplot2)
library(dplyr)
holdout_nn2 = holdout_nn %>% group_by(InvoiceDate) %>% summarize(pred_sale=mean(pred_sale),actual_sale=mean(actual_sale))
ggplot(data = holdout_nn2,aes(x=InvoiceDate,y=pred_sale))+geom_line(color="blue")+geom_line(aes(y=actual_sale))+ylim(15,26)## Warning: Removed 1 rows containing missing values (geom_path).
# print(unique(holdout_nn$pred_sale))The mean captures the ups and downs, but this model can be improved further by adding engagement metrics and tuning the network further.
As future work, I would be taking up the following - Fine tune DNN model - Include more engagement features for DNN - Include attribution analysis - Perform per product analysis to forecast customer demand for each product
We saw how we could calculate CLV for an online retail store. CLV can help in understanding the value of each customer. Driving managerial decisions based on CLV is the step to be taken to be more customer oriented and for faster growth.
You can find the consolidated code for DNN and BTYD models at https://github.com/ragav208/uci_clv