customer lifetime value (CLV) : 1-The identification of customers that are likely to generate a big net-profit is one of the core aims of CLV analysis.
2-Minimization of acquisition costs:CLV analysis tells you how much you should pay for the acquisition of a new customer. If a special customer is too expensive to acquire, you should focus on another more promising customer.
3- Efficient Organisation of CRM: Once you know the CLV of your customer you can adjust promotions, recommendations or customer service accordingly
Note: The identification of special segments in your customers is the main goal of cluster analysis. This is nothing CLV analysis can do.
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.4
## corrplot 0.84 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sales=read.csv("salesData.csv")
colnames(sales)
## [1] "id" "nItems" "mostFreqStore"
## [4] "mostFreqCat" "nCats" "preferredBrand"
## [7] "nBrands" "nPurch" "salesLast3Mon"
## [10] "salesThisMon" "daysSinceLastPurch" "meanItemPrice"
## [13] "meanShoppingCartValue" "customerDuration"
str(sales,give.attr=FALSE)
## 'data.frame': 5122 obs. of 14 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ nItems : int 1469 1463 262 293 108 216 174 122 204 308 ...
## $ mostFreqStore : chr "Stockton" "Stockton" "Colorado Springs" "Colorado Springs" ...
## $ mostFreqCat : chr "Alcohol" "Alcohol" "Shoes" "Bakery" ...
## $ nCats : int 72 73 55 50 32 41 36 31 41 52 ...
## $ preferredBrand : chr "Veina" "Veina" "Bo" "Veina" ...
## $ nBrands : int 517 482 126 108 79 98 78 62 99 103 ...
## $ nPurch : int 82 88 56 43 18 35 34 12 26 33 ...
## $ salesLast3Mon : num 2742 2791 1530 1766 1180 ...
## $ salesThisMon : num 1284 1243 683 730 553 ...
## $ daysSinceLastPurch : int 1 1 1 1 12 2 2 4 14 1 ...
## $ meanItemPrice : num 1.87 1.91 5.84 6.03 10.93 ...
## $ meanShoppingCartValue: num 33.4 31.7 27.3 41.1 65.6 ...
## $ customerDuration : int 821 657 548 596 603 673 612 517 709 480 ...
head( sales%>% select_if(is.numeric))
## id nItems nCats nBrands nPurch salesLast3Mon salesThisMon daysSinceLastPurch
## 1 1 1469 72 517 82 2741.97 1283.87 1
## 2 2 1463 73 482 88 2790.58 1242.60 1
## 3 3 262 55 126 56 1529.55 682.57 1
## 4 4 293 50 108 43 1765.81 730.23 1
## 5 5 108 32 79 18 1180.00 552.54 12
## 6 6 216 41 98 35 1345.29 662.52 2
## meanItemPrice meanShoppingCartValue customerDuration
## 1 1.866555 33.43866 821
## 2 1.907437 31.71114 657
## 3 5.837977 27.31339 548
## 4 6.026655 41.06535 596
## 5 10.925926 65.55556 603
## 6 6.228194 38.43686 673
head( sales%>% select_if(is.numeric)%>% select(-id))
## nItems nCats nBrands nPurch salesLast3Mon salesThisMon daysSinceLastPurch
## 1 1469 72 517 82 2741.97 1283.87 1
## 2 1463 73 482 88 2790.58 1242.60 1
## 3 262 55 126 56 1529.55 682.57 1
## 4 293 50 108 43 1765.81 730.23 1
## 5 108 32 79 18 1180.00 552.54 12
## 6 216 41 98 35 1345.29 662.52 2
## meanItemPrice meanShoppingCartValue customerDuration
## 1 1.866555 33.43866 821
## 2 1.907437 31.71114 657
## 3 5.837977 27.31339 548
## 4 6.026655 41.06535 596
## 5 10.925926 65.55556 603
## 6 6.228194 38.43686 673
sample corellation coefficient is calculated as :
Sx and sy are the sample standard deviations, and sxy is the sample covariance.
Population corellation coefficient is calculated as :
Explained and Unexplained Variation
The population correlation coefficient uses σx and σy as the population standard deviations, and σxy as the population covariance. Close to 1 correlation values shows sronger correlations
# correlation
# Visualization of correlations
head(sales %>% select_if(is.numeric) %>%
select(-id) %>%
cor())
## nItems nCats nBrands nPurch salesLast3Mon
## nItems 1.0000000 0.8488235 0.9416391 0.6703290 0.8935192
## nCats 0.8488235 1.0000000 0.9033997 0.5958623 0.8848045
## nBrands 0.9416391 0.9033997 1.0000000 0.6350477 0.8943407
## nPurch 0.6703290 0.5958623 0.6350477 1.0000000 0.6361522
## salesLast3Mon 0.8935192 0.8848045 0.8943407 0.6361522 1.0000000
## salesThisMon 0.7108909 0.6709461 0.6963680 0.4914052 0.7701776
## salesThisMon daysSinceLastPurch meanItemPrice
## nItems 0.7108909 -0.3124199 -0.4275983
## nCats 0.6709461 -0.3691781 -0.5811185
## nBrands 0.6963680 -0.3174433 -0.4659714
## nPurch 0.4914052 -0.3850449 -0.3704709
## salesLast3Mon 0.7701776 -0.3939762 -0.5571719
## salesThisMon 1.0000000 -0.2646649 -0.3833332
## meanShoppingCartValue customerDuration
## nItems -0.3267883 -0.0006594666
## nCats -0.3600195 0.0026478288
## nBrands -0.3266744 -0.0044130246
## nPurch -0.6471322 0.0159686784
## salesLast3Mon -0.3453611 -0.0039518655
## salesThisMon -0.2074043 0.4640231298
Close to 1 shows sronger correlations we can plot the corr values as shown below:
# plotting correllation
sales %>% select_if(is.numeric) %>%
select(-id) %>%
cor() %>% corrplot()
# most frequent stores vs sales that month plotting
ggplot(sales) + geom_boxplot(aes(x=sales$mostFreqStore , y=sales$salesThisMon))
## Warning: Use of `sales$mostFreqStore` is discouraged. Use `mostFreqStore`
## instead.
## Warning: Use of `sales$salesThisMon` is discouraged. Use `salesThisMon` instead.
ggplot(sales) +
geom_boxplot(aes(x = preferredBrand, y = salesThisMon))
We chose the margin in year 1,since the correllation between the two vairables is the highest
clvData1=read.csv("clvData1.csv")
head(clvData1)
## customerID nOrders nItems daysSinceLastOrder margin returnRatio shareOwnBrand
## 1 2 4 7 4 35.77 0.25 0.67
## 2 3 3 4 272 25.74 0.44 0.33
## 3 4 12 25 12 63.32 0.15 0.86
## 4 5 16 29 32 53.74 0.03 0.96
## 5 6 1 2 47 35.85 0.00 1.00
## 6 7 2 8 19 22.02 0.18 0.00
## shareVoucher shareSale gender age marginPerOrder marginPerItem itemsPerOrder
## 1 0.17 0.00 female 56 8.94 5.11 1.75
## 2 0.00 0.67 male 37 8.58 6.43 1.33
## 3 0.38 0.29 male 32 5.28 2.53 2.08
## 4 0.17 0.33 female 43 3.36 1.85 1.81
## 5 0.00 1.00 male 48 35.85 17.93 2.00
## 6 0.86 0.14 female 31 11.01 2.75 4.00
## futureMargin
## 1 57.62
## 2 29.69
## 3 56.26
## 4 58.84
## 5 29.31
## 6 35.72
# we need to see relation bet futureMargin and margin
simpleLM=lm(futureMargin~margin,data=clvData1)
summary(simpleLM)
##
## Call:
## lm(formula = futureMargin ~ margin, data = clvData1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.055 -9.258 0.727 10.060 49.869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.63068 0.49374 25.58 <2e-16 ***
## margin 0.64543 0.01467 43.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.24 on 4189 degrees of freedom
## Multiple R-squared: 0.3159, Adjusted R-squared: 0.3158
## F-statistic: 1935 on 1 and 4189 DF, p-value: < 2.2e-16
ggplot(clvData1,aes(margin,futureMargin)) + geom_point()+ geom_smooth(method = lm,se=FALSE) +xlab('Margin year 1') + ylab('Margin year 2')
## `geom_smooth()` using formula 'y ~ x'
The estimated regression model is shown in this figure.
Explained and Unexplained Variation
First we explain the mutltiple R-squared values in the above model ### Multiple R-squared (Goodness of Fit measure) We have 3 measurements of variations in the LM called sum of squares (SS) SST=SS Total, SSE=SS Explained and SSR = SS Regression
R squared measures goodness of fit of the model .It measures the Proportion of the Total Variation explained by Regression i.e R-squared = 1 - SSR/SST Above relation is showing that the proportion of Regression SS is substraced from the total proportion i.e from 1. Also R-squared =SSE/SST It is the proportion of SSEE to SST R-squared = SSE/SST = 1 -SSR/SST
In above model R-squared of 0.315 is interpreted as 31.5% of the variation is explained by the regression and the rest is due to error. R-squared that is greater than 0.25 is considered good fit. # T-test (which is shows inference about the Slope) t-test in the Linear model can be explained by the following figure.
R-squared value also shows the shows the joint impact of features on target within Sample F-statistics(shows the joint impact of features on target but within the whole population : ## F-test f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. The F value in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero Also F test is just MSR/MSE, not sq root. We have p-value for F-statistics= < 2.2e-16 . So our model is significant The F-statisitcs p-value is also less than 0.05% showing that we can reject the null .
Slope shows the Effect of 1 unit change (in that variable)has on the futureMargin(the target) if all others variables are constant eg. margin coefficient=0.64 shows that 1 unit increase in ‘margin’ will increase 0.64 Euros of target
The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point.
If the p-value in the last column is smaller than 0.05, we can conclude the coefficient is significantly different from 0
at the .05
significance level. ###########################################
Refresher t-test: Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement by assuming a null hypothesis that the two means are equal. Based on the applicable formulas, certain values are calculated and compared against the standard values, and the assumed null hypothesis is accepted or rejected accordingly.
If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance. The t-test is just one of many tests used for this purpose. Statisticians must additionally use tests other than the t-test to examine more variables and tests with larger sample sizes. For a large sample size, statisticians use a z-test. Other testing options include the chi-square test and the f-test.
Also the salesThisMon is strongly correlated with the SalesLast3Mon .Therefore we plot linear model
salesSimpleModel <- lm(salesThisMon ~ salesLast3Mon,
data = sales)
# Looking at model summary
summary(salesSimpleModel)
##
## Call:
## lm(formula = salesThisMon ~ salesLast3Mon, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570.18 -68.26 3.21 72.98 605.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.690501 6.083886 16.39 <2e-16 ***
## salesLast3Mon 0.382696 0.004429 86.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117.5 on 5120 degrees of freedom
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5931
## F-statistic: 7465 on 1 and 5120 DF, p-value: < 2.2e-16
Estimate of salesLast3Mon = 0.382696 is positive which shows that if we have more sales in last3month then will also see increase sales thisMonth. The multiple R-squared shows tht 0.593% of variance in the future sales can be explained by the sales of last3month.
colnames(clvData1)
## [1] "customerID" "nOrders" "nItems"
## [4] "daysSinceLastOrder" "margin" "returnRatio"
## [7] "shareOwnBrand" "shareVoucher" "shareSale"
## [10] "gender" "age" "marginPerOrder"
## [13] "marginPerItem" "itemsPerOrder" "futureMargin"
mutipleLM <- lm( futureMargin ~.-customerID ,data=clvData1)
summary(mutipleLM)
##
## Call:
## lm(formula = futureMargin ~ . - customerID, data = clvData1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.830 -8.926 0.557 9.473 49.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.528666 1.435062 15.699 < 2e-16 ***
## nOrders -0.031825 0.122980 -0.259 0.79581
## nItems 0.137517 0.070997 1.937 0.05282 .
## daysSinceLastOrder -0.016521 0.002683 -6.157 8.12e-10 ***
## margin 0.402783 0.027298 14.755 < 2e-16 ***
## returnRatio -1.944799 0.601547 -3.233 0.00123 **
## shareOwnBrand 7.654707 0.678893 11.275 < 2e-16 ***
## shareVoucher -1.830182 0.669253 -2.735 0.00627 **
## shareSale -2.964308 0.690573 -4.293 1.81e-05 ***
## gendermale 0.179593 0.429459 0.418 0.67583
## age -0.010303 0.017257 -0.597 0.55051
## marginPerOrder -0.202354 0.091411 -2.214 0.02691 *
## marginPerItem 0.021231 0.109703 0.194 0.84655
## itemsPerOrder 0.102576 0.540835 0.190 0.84958
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.85 on 4177 degrees of freedom
## Multiple R-squared: 0.3547, Adjusted R-squared: 0.3527
## F-statistic: 176.6 on 13 and 4177 DF, p-value: < 2.2e-16
# plotting correllation
clvData1 %>% select_if(is.numeric) %>%
select(-customerID) %>% cor() %>% corrplot()
There is high corr between “nOrders” vs “nItems” and “marginPerOrder” vs “marginPerItem”.They are candidates for *********************************************** # MULTI-COLLINEARITY Important points about multi-colinearity. 1-Multicollinearity is a statistical concept where independent variables in a model are correlated. 2-Multicollinearity among independent variables will result in less reliable statistical inferences. 3-It is better to use independent variables that are not correlated or repetitive when building multiple regression models that use two or more variables.
1-VIF provides a measure of multicollinearity among the independent variables in a multiple regression model. 2-Detecting multicollinearity is important because while multicollinearity does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables. 3-A large variance inflation factor (VIF) on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables. These indicate the increase in the variance of an estimated coefficient due to multicollinearity. VIF higher than 5 is problemativ and values above 10 indicate poor regeression estimates. **************************** Let us check the VIF of the above linear model
library(rms)
## Warning: package 'rms' was built under R version 4.0.4
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.0.4
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.0.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 4.0.4
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
vif(mutipleLM)
## nOrders nItems daysSinceLastOrder margin
## 11.565731 13.141486 1.368208 3.658257
## returnRatio shareOwnBrand shareVoucher shareSale
## 1.311476 1.363515 1.181329 1.148697
## gendermale age marginPerOrder marginPerItem
## 1.003452 1.026513 8.977661 7.782651
## itemsPerOrder
## 6.657435
VIF higher than 5 is problematic and values above 10 indicate poor regeression estimates.Hence we remove ‘nItems’ and ‘marginPerOrder’
colnames(clvData1)
## [1] "customerID" "nOrders" "nItems"
## [4] "daysSinceLastOrder" "margin" "returnRatio"
## [7] "shareOwnBrand" "shareVoucher" "shareSale"
## [10] "gender" "age" "marginPerOrder"
## [13] "marginPerItem" "itemsPerOrder" "futureMargin"
mutipleLM2 <- lm( futureMargin ~.-customerID - nItems - marginPerOrder ,data=clvData1)
summary(mutipleLM2)
##
## Call:
## lm(formula = futureMargin ~ . - customerID - nItems - marginPerOrder,
## data = clvData1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.659 -8.827 0.483 9.561 50.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.798064 1.287806 17.703 < 2e-16 ***
## nOrders 0.220255 0.061347 3.590 0.000334 ***
## daysSinceLastOrder -0.017180 0.002675 -6.422 1.49e-10 ***
## margin 0.404200 0.026983 14.980 < 2e-16 ***
## returnRatio -1.992829 0.601214 -3.315 0.000925 ***
## shareOwnBrand 7.568686 0.677572 11.170 < 2e-16 ***
## shareVoucher -1.750877 0.669017 -2.617 0.008900 **
## shareSale -2.942525 0.691108 -4.258 2.11e-05 ***
## gendermale 0.203813 0.430136 0.474 0.635643
## age -0.015158 0.017245 -0.879 0.379462
## marginPerItem -0.197277 0.051160 -3.856 0.000117 ***
## itemsPerOrder -0.270260 0.261458 -1.034 0.301354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.87 on 4179 degrees of freedom
## Multiple R-squared: 0.3522, Adjusted R-squared: 0.3504
## F-statistic: 206.5 on 11 and 4179 DF, p-value: < 2.2e-16
T-value: All variables are significant at the 95% confidence level except for (those without * sign) “gendermale” “age” and “itemsPerOrder” .
A t-test shows whether or not the respective coefficient is 0 ,Note ; Slope Beta=0 is the null hypothesis of t-test which shows there is no linear relation between the features and target The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0 (Ho). We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. Also by rejecting Ho,it also it indicates that data readings are strong and are probably not due to chance.
For a large sample size, statisticians use a z-test. Other testing options include the chi-square test and the f-test.
vif(mutipleLM2)
## nOrders daysSinceLastOrder margin returnRatio
## 2.868060 1.354986 3.561828 1.305490
## shareOwnBrand shareVoucher shareSale gendermale
## 1.353513 1.176411 1.146499 1.003132
## age marginPerItem itemsPerOrder
## 1.021518 1.686746 1.550524
vif : None of the VIF values are more than 5. So no significant collinearity exists in this model
R-squared: 0.3522 . Which shows 35% of the variance of dependant variable (futureMargin) is explained JOINT by independent variables in the Regression model. or we can say 35% of the fluctuation in futureMargin depends upon the independant variables used in the lm.So remaining 100-35=65% fluctuation is due to error.
F-statistics shows the joint impact of features on target (within Population)).We have p-value for F-statistics= < 2.2e-16 . So our model is suitable