[Adapted for R, original text and idea from: https://towardsdatascience.com/predictive-customer-analytics-part-iii-aeb996beafba]

“We are going to see how we can predict the customer lifetime value for new customers based on models that we will base on existing customer data.

We have the first six months of revenue generated by these customers, month one to month six, how much revenue they generated, and the customer’s lifetime value. Which may be possibly, is like a 3 year overall revenue that they gave. That is something we can decide based on the length to which our customers stay with our business. So this is the data that we’re going to use, and we’re going to use this to build a linear regression model that can then be used to predict the customer lifetime value.

We start out by importing the libraries.

```
library(corrplot)
library(readr)
library(dplyr)
```

We load up the history.csv file.

```
#df <- read_csv("history.csv", col_types = "nciiiiiii")
setwd("E:/3 course/anal")
getwd()
```

`## [1] "E:/3 course/anal"`

`df <- read_csv("history.csv")`

`## Warning: Missing column names filled in: 'X1' [1]`

```
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## CUST_ID = col_integer(),
## MONTH_1 = col_integer(),
## MONTH_2 = col_integer(),
## MONTH_3 = col_integer(),
## MONTH_4 = col_integer(),
## MONTH_5 = col_integer(),
## MONTH_6 = col_integer(),
## CLV = col_integer()
## )
```

`df <- df[,-1]`

And we just look at the file to make sure that all the data elements have been loaded as integers, because the elements would require them to be integers.

```
# ВСЕ ПЕРЕМЕННЫЕ ИНТЕЖЕР
typeof(df$CUST_ID)
```

`## [1] "integer"`

`typeof(df$MONTH_1) `

`## [1] "integer"`

`typeof(df$MONTH_2) `

`## [1] "integer"`

`typeof(df$MONTH_3) `

`## [1] "integer"`

`typeof(df$MONTH_4) `

`## [1] "integer"`

`typeof(df$MONTH_5) `

`## [1] "integer"`

`typeof(df$MONTH_6) `

`## [1] "integer"`

`typeof(df$CLV) `

`## [1] "integer"`

We do a little head here on the top filer cards to see if they have all loaded up properly.

```
#табличка по данным
head(df)
```

```
## # A tibble: 6 x 8
## CUST_ID MONTH_1 MONTH_2 MONTH_3 MONTH_4 MONTH_5 MONTH_6 CLV
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1001 301 291 317 335 287 323 30361
## 2 1002 276 201 287 316 241 254 25909
## 3 1003 301 210 264 253 276 228 25162
## 4 1004 351 270 334 330 323 292 31333
## 5 1005 351 310 342 299 297 285 30306
## 6 1006 251 231 211 214 240 240 22440
```

We then move on to correlation analysis between the first six months of data and the CLV. We drop the customer ID column because it is not required for our model building purposes.

So the correlation shows some really good correlation across different months. So I think now we can go forward and build a model.

We start off by doing the training and testing split by using the train test split material available in the library in the ratio 90:10.

```
df$isTrain <- rbinom(nrow(df), 1, 0.900)
train <- subset(df, df$isTrain == 1)
test <- subset(df, df$isTrain == 0)
```

We print out the size to make sure that everything looks okay.

We then go on to build a model. We start off with a linear regression model. We do a fit to build a model, print out the coefficients and the intercept. This gives us the actual equation, the linear regression equation.

```
modelTrain <- lm(MONTH_1 ~ MONTH_6, data = train)
summary(modelTrain)
```

```
##
## Call:
## lm(formula = MONTH_1 ~ MONTH_6, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.896 -28.291 -0.399 26.956 57.381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 275.49841 8.74158 31.516 < 2e-16 ***
## MONTH_6 0.10237 0.03327 3.077 0.00215 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.34 on 890 degrees of freedom
## Multiple R-squared: 0.01053, Adjusted R-squared: 0.009415
## F-statistic: 9.468 on 1 and 890 DF, p-value: 0.002155
```

Then we can test on the testing data that we created by creating the predictions. And then we also look at the auto score for the regression model, which tells us how to create the model list.

```
test$predNew <- predict(modelTrain, newdata = test)
actuals_preds <- data.frame(cbind(actuals=test$CLV, predicteds=test$predNew)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
```

```
## actuals predicteds
## actuals 1.0000000 0.7462575
## predicteds 0.7462575 1.0000000
```

```
min_max_accuracy <- mean (apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max)) #Higher the better
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals) # mean absolute percentage deviation, Lower the better
```

And that is turning out to have a accuracy, which is really good.

98%

This is an excellent model for predicting CLV. Now how do I predict for new customers?

Suppose there is a new customer who has been with us for 3 months. We take out the first 3 months of revenue that he has given us, and based on that we build this array of all the values. We have the first 3 month values, and the next 3 months are going to be zeros.

```
modelTrain2 <- lm(MONTH_3 ~ MONTH_6, data = train)
summary(modelTrain2)
```

```
##
## Call:
## lm(formula = MONTH_3 ~ MONTH_6, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.175 -24.977 -0.378 25.223 74.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.39461 8.12013 20.74 <2e-16 ***
## MONTH_6 0.39994 0.03091 12.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.83 on 890 degrees of freedom
## Multiple R-squared: 0.1584, Adjusted R-squared: 0.1574
## F-statistic: 167.5 on 1 and 890 DF, p-value: < 2.2e-16
```

`AIC(modelTrain) #Lower the better`

`## [1] 8895.242`

`AIC(modelTrain2)`

`## [1] 8763.68`

```
test$predNew2 <- predict(modelTrain2, newdata = test)
actuals_preds <- data.frame(cbind(actuals=test$CLV, predicteds=test$predNew2)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
```

```
## actuals predicteds
## actuals 1.0000000 0.7462575
## predicteds 0.7462575 1.0000000
```

`#Accuracy measure formulae from: http://rstatistics.net/linear-regression-with-r-a-numeric-example/`

And we use that to print the CLV.

So this is how we can build a linear regression model for CLV and be able to predict our CLV for our new customers.