customer lifetime value (CLV) : 1-The identification of customers that are likely to generate a big net-profit is one of the core aims of CLV analysis.

2-Minimization of acquisition costs:CLV analysis tells you how much you should pay for the acquisition of a new customer. If a special customer is too expensive to acquire, you should focus on another more promising customer.

3- Efficient Organisation of CRM: Once you know the CLV of your customer you can adjust promotions, recommendations or customer service accordingly

Note: The identification of special segments in your customers is the main goal of cluster analysis. This is nothing CLV analysis can do.

library(ggplot2)

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.4

## corrplot 0.84 loaded

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sales=read.csv("salesData.csv")
colnames(sales)

##  [1] "id"                    "nItems"                "mostFreqStore"        
##  [4] "mostFreqCat"           "nCats"                 "preferredBrand"       
##  [7] "nBrands"               "nPurch"                "salesLast3Mon"        
## [10] "salesThisMon"          "daysSinceLastPurch"    "meanItemPrice"        
## [13] "meanShoppingCartValue" "customerDuration"

str(sales,give.attr=FALSE)

## 'data.frame':    5122 obs. of  14 variables:
##  $ id                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ nItems               : int  1469 1463 262 293 108 216 174 122 204 308 ...
##  $ mostFreqStore        : chr  "Stockton" "Stockton" "Colorado Springs" "Colorado Springs" ...
##  $ mostFreqCat          : chr  "Alcohol" "Alcohol" "Shoes" "Bakery" ...
##  $ nCats                : int  72 73 55 50 32 41 36 31 41 52 ...
##  $ preferredBrand       : chr  "Veina" "Veina" "Bo" "Veina" ...
##  $ nBrands              : int  517 482 126 108 79 98 78 62 99 103 ...
##  $ nPurch               : int  82 88 56 43 18 35 34 12 26 33 ...
##  $ salesLast3Mon        : num  2742 2791 1530 1766 1180 ...
##  $ salesThisMon         : num  1284 1243 683 730 553 ...
##  $ daysSinceLastPurch   : int  1 1 1 1 12 2 2 4 14 1 ...
##  $ meanItemPrice        : num  1.87 1.91 5.84 6.03 10.93 ...
##  $ meanShoppingCartValue: num  33.4 31.7 27.3 41.1 65.6 ...
##  $ customerDuration     : int  821 657 548 596 603 673 612 517 709 480 ...

 head( sales%>% select_if(is.numeric))

##   id nItems nCats nBrands nPurch salesLast3Mon salesThisMon daysSinceLastPurch
## 1  1   1469    72     517     82       2741.97      1283.87                  1
## 2  2   1463    73     482     88       2790.58      1242.60                  1
## 3  3    262    55     126     56       1529.55       682.57                  1
## 4  4    293    50     108     43       1765.81       730.23                  1
## 5  5    108    32      79     18       1180.00       552.54                 12
## 6  6    216    41      98     35       1345.29       662.52                  2
##   meanItemPrice meanShoppingCartValue customerDuration
## 1      1.866555              33.43866              821
## 2      1.907437              31.71114              657
## 3      5.837977              27.31339              548
## 4      6.026655              41.06535              596
## 5     10.925926              65.55556              603
## 6      6.228194              38.43686              673

head( sales%>% select_if(is.numeric)%>% select(-id))

##   nItems nCats nBrands nPurch salesLast3Mon salesThisMon daysSinceLastPurch
## 1   1469    72     517     82       2741.97      1283.87                  1
## 2   1463    73     482     88       2790.58      1242.60                  1
## 3    262    55     126     56       1529.55       682.57                  1
## 4    293    50     108     43       1765.81       730.23                  1
## 5    108    32      79     18       1180.00       552.54                 12
## 6    216    41      98     35       1345.29       662.52                  2
##   meanItemPrice meanShoppingCartValue customerDuration
## 1      1.866555              33.43866              821
## 2      1.907437              31.71114              657
## 3      5.837977              27.31339              548
## 4      6.026655              41.06535              596
## 5     10.925926              65.55556              603
## 6      6.228194              38.43686              673

Pearson correlation

sample corellation coefficient is calculated as : Sample correlation coefficent

Sx and sy are the sample standard deviations, and sxy is the sample covariance.

Population corellation coefficient is calculated as :

Explained and Unexplained Variation

The population correlation coefficient uses σx and σy as the population standard deviations, and σxy as the population covariance. Close to 1 correlation values shows sronger correlations

# correlation
# Visualization of correlations
head(sales %>% select_if(is.numeric) %>%
  select(-id) %>%
  cor())

##                  nItems     nCats   nBrands    nPurch salesLast3Mon
## nItems        1.0000000 0.8488235 0.9416391 0.6703290     0.8935192
## nCats         0.8488235 1.0000000 0.9033997 0.5958623     0.8848045
## nBrands       0.9416391 0.9033997 1.0000000 0.6350477     0.8943407
## nPurch        0.6703290 0.5958623 0.6350477 1.0000000     0.6361522
## salesLast3Mon 0.8935192 0.8848045 0.8943407 0.6361522     1.0000000
## salesThisMon  0.7108909 0.6709461 0.6963680 0.4914052     0.7701776
##               salesThisMon daysSinceLastPurch meanItemPrice
## nItems           0.7108909         -0.3124199    -0.4275983
## nCats            0.6709461         -0.3691781    -0.5811185
## nBrands          0.6963680         -0.3174433    -0.4659714
## nPurch           0.4914052         -0.3850449    -0.3704709
## salesLast3Mon    0.7701776         -0.3939762    -0.5571719
## salesThisMon     1.0000000         -0.2646649    -0.3833332
##               meanShoppingCartValue customerDuration
## nItems                   -0.3267883    -0.0006594666
## nCats                    -0.3600195     0.0026478288
## nBrands                  -0.3266744    -0.0044130246
## nPurch                   -0.6471322     0.0159686784
## salesLast3Mon            -0.3453611    -0.0039518655
## salesThisMon             -0.2074043     0.4640231298

Close to 1 shows sronger correlations we can plot the corr values as shown below:

# plotting correllation
sales %>% select_if(is.numeric) %>%
  select(-id) %>%
  cor() %>% corrplot()

Box plot for most frequent stores vs sales that month plotting

# most frequent stores vs sales that month plotting
ggplot(sales) + geom_boxplot(aes(x=sales$mostFreqStore , y=sales$salesThisMon))

## Warning: Use of `sales$mostFreqStore` is discouraged. Use `mostFreqStore`
## instead.

## Warning: Use of `sales$salesThisMon` is discouraged. Use `salesThisMon` instead.

Preferred brand this month

ggplot(sales) +
    geom_boxplot(aes(x = preferredBrand, y = salesThisMon))

We chose the margin in year 1,since the correllation between the two vairables is the highest

Linear Regression

clvData1=read.csv("clvData1.csv")
head(clvData1)

##   customerID nOrders nItems daysSinceLastOrder margin returnRatio shareOwnBrand
## 1          2       4      7                  4  35.77        0.25          0.67
## 2          3       3      4                272  25.74        0.44          0.33
## 3          4      12     25                 12  63.32        0.15          0.86
## 4          5      16     29                 32  53.74        0.03          0.96
## 5          6       1      2                 47  35.85        0.00          1.00
## 6          7       2      8                 19  22.02        0.18          0.00
##   shareVoucher shareSale gender age marginPerOrder marginPerItem itemsPerOrder
## 1         0.17      0.00 female  56           8.94          5.11          1.75
## 2         0.00      0.67   male  37           8.58          6.43          1.33
## 3         0.38      0.29   male  32           5.28          2.53          2.08
## 4         0.17      0.33 female  43           3.36          1.85          1.81
## 5         0.00      1.00   male  48          35.85         17.93          2.00
## 6         0.86      0.14 female  31          11.01          2.75          4.00
##   futureMargin
## 1        57.62
## 2        29.69
## 3        56.26
## 4        58.84
## 5        29.31
## 6        35.72

Finding relation between FutureMarging and Marging using Linear regression model

# we need to see relation bet futureMargin and margin
simpleLM=lm(futureMargin~margin,data=clvData1)
summary(simpleLM)

## 
## Call:
## lm(formula = futureMargin ~ margin, data = clvData1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.055  -9.258   0.727  10.060  49.869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.63068    0.49374   25.58   <2e-16 ***
## margin       0.64543    0.01467   43.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.24 on 4189 degrees of freedom
## Multiple R-squared:  0.3159, Adjusted R-squared:  0.3158 
## F-statistic:  1935 on 1 and 4189 DF,  p-value: < 2.2e-16

Plotting the LM relation

ggplot(clvData1,aes(margin,futureMargin)) + geom_point()+ geom_smooth(method = lm,se=FALSE) +xlab('Margin year 1') + ylab('Margin year 2')

## `geom_smooth()` using formula 'y ~ x'

Explanation of Linear Model and its output

The estimated regression model is shown in this figure.

Explained and Unexplained Variation

First we explain the mutltiple R-squared values in the above model ### Multiple R-squared (Goodness of Fit measure) We have 3 measurements of variations in the LM called sum of squares (SS) SST=SS Total, SSE=SS Explained and SSR = SS Regression
Explained and Unexplained Variation

R squared measures goodness of fit of the model .It measures the Proportion of the Total Variation explained by Regression i.e R-squared = 1 - SSR/SST Above relation is showing that the proportion of Regression SS is substraced from the total proportion i.e from 1. Also R-squared =SSE/SST It is the proportion of SSEE to SST R-squared = SSE/SST = 1 -SSR/SST

In above model R-squared of 0.315 is interpreted as 31.5% of the variation is explained by the regression and the rest is due to error. R-squared that is greater than 0.25 is considered good fit. # T-test (which is shows inference about the Slope) t-test in the Linear model can be explained by the following figure. Explained and Unexplained Variation

F-statisitcs vs R-squared

R-squared value also shows the shows the joint impact of features on target within Sample F-statistics(shows the joint impact of features on target but within the whole population : ## F-test f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. The F value in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero Also F test is just MSR/MSE, not sq root. We have p-value for F-statistics= < 2.2e-16 . So our model is significant The F-statisitcs p-value is also less than 0.05% showing that we can reject the null .

Slope values in Linear Regression :

Slope shows the Effect of 1 unit change (in that variable)has on the futureMargin(the target) if all others variables are constant eg. margin coefficient=0.64 shows that 1 unit increase in ‘margin’ will increase 0.64 Euros of target

Coefficient - Pr(>t)

The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point.

If the p-value in the last column is smaller than 0.05, we can conclude the coefficient is significantly different from 0 at the .05 significance level. ###########################################

Refresher t-test: Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement by assuming a null hypothesis that the two means are equal. Based on the applicable formulas, certain values are calculated and compared against the standard values, and the assumed null hypothesis is accepted or rejected accordingly.

If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance. The t-test is just one of many tests used for this purpose. Statisticians must additionally use tests other than the t-test to examine more variables and tests with larger sample sizes. For a large sample size, statisticians use a z-test. Other testing options include the chi-square test and the f-test.

Linear Model of sales this month vs the last 3 months

Also the salesThisMon is strongly correlated with the SalesLast3Mon .Therefore we plot linear model

salesSimpleModel <- lm(salesThisMon ~ salesLast3Mon, 
                        data = sales)
# Looking at model summary
summary(salesSimpleModel)

## 
## Call:
## lm(formula = salesThisMon ~ salesLast3Mon, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.18  -68.26    3.21   72.98  605.58 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.690501   6.083886   16.39   <2e-16 ***
## salesLast3Mon  0.382696   0.004429   86.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 5120 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5931 
## F-statistic:  7465 on 1 and 5120 DF,  p-value: < 2.2e-16

Estimate of salesLast3Mon = 0.382696 is positive which shows that if we have more sales in last3month then will also see increase sales thisMonth. The multiple R-squared shows tht 0.593% of variance in the future sales can be explained by the sales of last3month.

MULTIPLE LINEAR REGRESSION:

colnames(clvData1)

##  [1] "customerID"         "nOrders"            "nItems"            
##  [4] "daysSinceLastOrder" "margin"             "returnRatio"       
##  [7] "shareOwnBrand"      "shareVoucher"       "shareSale"         
## [10] "gender"             "age"                "marginPerOrder"    
## [13] "marginPerItem"      "itemsPerOrder"      "futureMargin"

mutipleLM <- lm( futureMargin ~.-customerID  ,data=clvData1)
summary(mutipleLM)

## 
## Call:
## lm(formula = futureMargin ~ . - customerID, data = clvData1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.830  -8.926   0.557   9.473  49.822 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22.528666   1.435062  15.699  < 2e-16 ***
## nOrders            -0.031825   0.122980  -0.259  0.79581    
## nItems              0.137517   0.070997   1.937  0.05282 .  
## daysSinceLastOrder -0.016521   0.002683  -6.157 8.12e-10 ***
## margin              0.402783   0.027298  14.755  < 2e-16 ***
## returnRatio        -1.944799   0.601547  -3.233  0.00123 ** 
## shareOwnBrand       7.654707   0.678893  11.275  < 2e-16 ***
## shareVoucher       -1.830182   0.669253  -2.735  0.00627 ** 
## shareSale          -2.964308   0.690573  -4.293 1.81e-05 ***
## gendermale          0.179593   0.429459   0.418  0.67583    
## age                -0.010303   0.017257  -0.597  0.55051    
## marginPerOrder     -0.202354   0.091411  -2.214  0.02691 *  
## marginPerItem       0.021231   0.109703   0.194  0.84655    
## itemsPerOrder       0.102576   0.540835   0.190  0.84958    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.85 on 4177 degrees of freedom
## Multiple R-squared:  0.3547, Adjusted R-squared:  0.3527 
## F-statistic: 176.6 on 13 and 4177 DF,  p-value: < 2.2e-16

# plotting correllation
clvData1 %>% select_if(is.numeric) %>%
  select(-customerID) %>% cor() %>% corrplot()

There is high corr between “nOrders” vs “nItems” and “marginPerOrder” vs “marginPerItem”.They are candidates for *********************************************** # MULTI-COLLINEARITY Important points about multi-colinearity. 1-Multicollinearity is a statistical concept where independent variables in a model are correlated. 2-Multicollinearity among independent variables will result in less reliable statistical inferences. 3-It is better to use independent variables that are not correlated or repetitive when building multiple regression models that use two or more variables.

What Is a Variance Inflation Factor (VIF)?

1-VIF provides a measure of multicollinearity among the independent variables in a multiple regression model. 2-Detecting multicollinearity is important because while multicollinearity does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables. 3-A large variance inflation factor (VIF) on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables. These indicate the increase in the variance of an estimated coefficient due to multicollinearity. VIF higher than 5 is problemativ and values above 10 indicate poor regeression estimates. **************************** Let us check the VIF of the above linear model

library(rms)

## Warning: package 'rms' was built under R version 4.0.4

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 4.0.4

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Warning: package 'Formula' was built under R version 4.0.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## Warning: package 'SparseM' was built under R version 4.0.4

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

vif(mutipleLM)

##            nOrders             nItems daysSinceLastOrder             margin 
##          11.565731          13.141486           1.368208           3.658257 
##        returnRatio      shareOwnBrand       shareVoucher          shareSale 
##           1.311476           1.363515           1.181329           1.148697 
##         gendermale                age     marginPerOrder      marginPerItem 
##           1.003452           1.026513           8.977661           7.782651 
##      itemsPerOrder 
##           6.657435

Results of VIF

VIF higher than 5 is problematic and values above 10 indicate poor regeression estimates.Hence we remove ‘nItems’ and ‘marginPerOrder’

colnames(clvData1)

##  [1] "customerID"         "nOrders"            "nItems"            
##  [4] "daysSinceLastOrder" "margin"             "returnRatio"       
##  [7] "shareOwnBrand"      "shareVoucher"       "shareSale"         
## [10] "gender"             "age"                "marginPerOrder"    
## [13] "marginPerItem"      "itemsPerOrder"      "futureMargin"

Final Multiple Regression Model

mutipleLM2 <- lm( futureMargin ~.-customerID - nItems - marginPerOrder   ,data=clvData1)
summary(mutipleLM2)

## 
## Call:
## lm(formula = futureMargin ~ . - customerID - nItems - marginPerOrder, 
##     data = clvData1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.659  -8.827   0.483   9.561  50.118 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22.798064   1.287806  17.703  < 2e-16 ***
## nOrders             0.220255   0.061347   3.590 0.000334 ***
## daysSinceLastOrder -0.017180   0.002675  -6.422 1.49e-10 ***
## margin              0.404200   0.026983  14.980  < 2e-16 ***
## returnRatio        -1.992829   0.601214  -3.315 0.000925 ***
## shareOwnBrand       7.568686   0.677572  11.170  < 2e-16 ***
## shareVoucher       -1.750877   0.669017  -2.617 0.008900 ** 
## shareSale          -2.942525   0.691108  -4.258 2.11e-05 ***
## gendermale          0.203813   0.430136   0.474 0.635643    
## age                -0.015158   0.017245  -0.879 0.379462    
## marginPerItem      -0.197277   0.051160  -3.856 0.000117 ***
## itemsPerOrder      -0.270260   0.261458  -1.034 0.301354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.87 on 4179 degrees of freedom
## Multiple R-squared:  0.3522, Adjusted R-squared:  0.3504 
## F-statistic: 206.5 on 11 and 4179 DF,  p-value: < 2.2e-16

T-value: All variables are significant at the 95% confidence level except for (those without * sign) “gendermale” “age” and “itemsPerOrder” .

A t-test shows whether or not the respective coefficient is 0 ,Note ; Slope Beta=0 is the null hypothesis of t-test which shows there is no linear relation between the features and target The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0 (Ho). We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. Also by rejecting Ho,it also it indicates that data readings are strong and are probably not due to chance.

For a large sample size, statisticians use a z-test. Other testing options include the chi-square test and the f-test.

vif(mutipleLM2)

##            nOrders daysSinceLastOrder             margin        returnRatio 
##           2.868060           1.354986           3.561828           1.305490 
##      shareOwnBrand       shareVoucher          shareSale         gendermale 
##           1.353513           1.176411           1.146499           1.003132 
##                age      marginPerItem      itemsPerOrder 
##           1.021518           1.686746           1.550524

Results of final Mutli variable Linear model:

vif : None of the VIF values are more than 5. So no significant collinearity exists in this model

R-squared: 0.3522 . Which shows 35% of the variance of dependant variable (futureMargin) is explained JOINT by independent variables in the Regression model. or we can say 35% of the fluctuation in futureMargin depends upon the independant variables used in the lm.So remaining 100-35=65% fluctuation is due to error.

F-statistics shows the joint impact of features on target (within Population)).We have p-value for F-statistics= < 2.2e-16 . So our model is suitable

Linear Regression explained in marketing analytics example