[Adapted for R, original text and idea from: https://towardsdatascience.com/predictive-customer-analytics-part-iii-aeb996beafba]

“We are going to see how we can predict the customer lifetime value for new customers based on models that we will base on existing customer data.

We have the first six months of revenue generated by these customers, month one to month six, how much revenue they generated, and the customer’s lifetime value. Which may be possibly, is like a 3 year overall revenue that they gave. That is something we can decide based on the length to which our customers stay with our business. So this is the data that we’re going to use, and we’re going to use this to build a linear regression model that can then be used to predict the customer lifetime value.

We start out by importing the libraries.

library(corrplot)
library(dplyr)

We load up the history.csv file.

#df <- read_csv("history.csv", col_types = "nciiiiiii")
setwd("E:/3 course/anal")
getwd()
##  "E:/3 course/anal"
df <- read_csv("history.csv")
## Warning: Missing column names filled in: 'X1' 
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   CUST_ID = col_integer(),
##   MONTH_1 = col_integer(),
##   MONTH_2 = col_integer(),
##   MONTH_3 = col_integer(),
##   MONTH_4 = col_integer(),
##   MONTH_5 = col_integer(),
##   MONTH_6 = col_integer(),
##   CLV = col_integer()
## )
df <- df[,-1]

And we just look at the file to make sure that all the data elements have been loaded as integers, because the elements would require them to be integers.

# ВСЕ ПЕРЕМЕННЫЕ ИНТЕЖЕР
typeof(df$CUST_ID)  ##  "integer" typeof(df$MONTH_1) 
##  "integer"
typeof(df$MONTH_2)  ##  "integer" typeof(df$MONTH_3) 
##  "integer"
typeof(df$MONTH_4)  ##  "integer" typeof(df$MONTH_5) 
##  "integer"
typeof(df$MONTH_6)  ##  "integer" typeof(df$CLV) 
##  "integer"

We do a little head here on the top filer cards to see if they have all loaded up properly.

#табличка по данным
head(df)
## # A tibble: 6 x 8
##   CUST_ID MONTH_1 MONTH_2 MONTH_3 MONTH_4 MONTH_5 MONTH_6   CLV
##     <int>   <int>   <int>   <int>   <int>   <int>   <int> <int>
## 1    1001     301     291     317     335     287     323 30361
## 2    1002     276     201     287     316     241     254 25909
## 3    1003     301     210     264     253     276     228 25162
## 4    1004     351     270     334     330     323     292 31333
## 5    1005     351     310     342     299     297     285 30306
## 6    1006     251     231     211     214     240     240 22440

We then move on to correlation analysis between the first six months of data and the CLV. We drop the customer ID column because it is not required for our model building purposes.

So the correlation shows some really good correlation across different months. So I think now we can go forward and build a model.

We start off by doing the training and testing split by using the train test split material available in the library in the ratio 90:10.

df$isTrain <- rbinom(nrow(df), 1, 0.900) train <- subset(df, df$isTrain == 1)
test <- subset(df, df$isTrain == 0) We print out the size to make sure that everything looks okay. We then go on to build a model. We start off with a linear regression model. We do a fit to build a model, print out the coefficients and the intercept. This gives us the actual equation, the linear regression equation. modelTrain <- lm(MONTH_1 ~ MONTH_6, data = train) summary(modelTrain) ## ## Call: ## lm(formula = MONTH_1 ~ MONTH_6, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -58.896 -28.291 -0.399 26.956 57.381 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 275.49841 8.74158 31.516 < 2e-16 *** ## MONTH_6 0.10237 0.03327 3.077 0.00215 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 35.34 on 890 degrees of freedom ## Multiple R-squared: 0.01053, Adjusted R-squared: 0.009415 ## F-statistic: 9.468 on 1 and 890 DF, p-value: 0.002155 Then we can test on the testing data that we created by creating the predictions. And then we also look at the auto score for the regression model, which tells us how to create the model list. test$predNew <- predict(modelTrain, newdata = test)

actuals_preds <- data.frame(cbind(actuals=test$CLV, predicteds=test$predNew))  # make actuals_predicteds dataframe.

correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.7462575
## predicteds 0.7462575  1.0000000
min_max_accuracy <- mean (apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  #Higher the better

mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals) # mean absolute percentage deviation, Lower the better And that is turning out to have a accuracy, which is really good. 98% This is an excellent model for predicting CLV. Now how do I predict for new customers? Suppose there is a new customer who has been with us for 3 months. We take out the first 3 months of revenue that he has given us, and based on that we build this array of all the values. We have the first 3 month values, and the next 3 months are going to be zeros. modelTrain2 <- lm(MONTH_3 ~ MONTH_6, data = train) summary(modelTrain2) ## ## Call: ## lm(formula = MONTH_3 ~ MONTH_6, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -76.175 -24.977 -0.378 25.223 74.418 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 168.39461 8.12013 20.74 <2e-16 *** ## MONTH_6 0.39994 0.03091 12.94 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 32.83 on 890 degrees of freedom ## Multiple R-squared: 0.1584, Adjusted R-squared: 0.1574 ## F-statistic: 167.5 on 1 and 890 DF, p-value: < 2.2e-16 AIC(modelTrain) #Lower the better ##  8895.242 AIC(modelTrain2) ##  8763.68 test$predNew2 <- predict(modelTrain2, newdata = test)

actuals_preds <- data.frame(cbind(actuals=test$CLV, predicteds=test$predNew2))  # make actuals_predicteds dataframe.

correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.7462575
## predicteds 0.7462575  1.0000000
#Accuracy measure formulae from: http://rstatistics.net/linear-regression-with-r-a-numeric-example/

And we use that to print the CLV.

So this is how we can build a linear regression model for CLV and be able to predict our CLV for our new customers.