Our analytic’s solution depends on our understanding of the business problem. The better we grasp how the business works and what challenges it face, the better we can answer its need. In this first part, we present our analytic’s solution:
Our business is a medium sized store based in a large city. Within it 20 years of existence, our business never have done any kind of analytics. But since few years, the shop is not growing as wanted, so the new manager decided to study the customer’ behaviours.
The manager is aware that his company is sitting on a treasure trove of data and simply lack the skills and people to analyse and exploit them efficiently. He ask us to look into them.
The sales are not growing as planned. Since few years the store has proposed new offers and get a new parking garage but the sales continued to grow too slowly for the store to achieve critical economy of scale.
On the analytic side, the store has nothing. They don’t know who their customers are and where they should spend their marketing budget.
Digging insight from the store’s data will enable us to visualise the customer’s behaviour in a new way, and create targeted marketing segmentations and campaigns, based on a data-driven strategy.
After having seen the available data, we propose three marketing analysis’ methods: segmentation, scoring models and customer lifetime value. Segmentation is about understanding your customers, scoring models are about targeting the right ones, and customer lifetime value is about anticipating their future value.
Our first goal is to understand the data that will be used to build each step of our analysis, and our second goal is to assess where the analysis might lack data or where the quality of it might suffer.
Our database present the orders processed from 2006 to 2016. There are three variables:
## Customer_id Date Total
## 1 5540 2006-01-02 130
## 2 8150 2006-01-02 26
## 3 10210 2006-01-02 26
## 4 14140 2006-01-04 26
## 5 5640 2006-01-04 39
## 6 14150 2006-01-06 273
Here we describe the characteristics of our three variables. In 10 years we have had 264 200 customers that spent on average $80. We can see two issues in the data that we have to handle. One is the 428 missing values and the second is the outliers. As the outliers drive the mean up, this metric become irrelevant to measure the central tendency because our dataset is to much skewed by the few outliers. Here, the median is better to understand our distributions parameters.
## Customer_id Date Total
## Min. : 10 Min. :2006-01-02 Min. :-4500.00
## 1st Qu.: 57720 1st Qu.:2010-01-17 1st Qu.: 32.50
## Median :102440 Median :2012-11-22 Median : 39.00
## Mean :108935 Mean :2012-07-13 Mean : 82.37
## 3rd Qu.:160525 3rd Qu.:2014-12-29 3rd Qu.: 78.00
## Max. :264200 Max. :2016-12-30 Max. :24530.00
## NA's :474
# Load the SQL library
library(sqldf)
# Select only the year as a numeric
Orders$YOrder = as.numeric(format(Orders$Date, "%Y"))
# Number of orders per year
Number_of_Orders = sqldf("SELECT YOrder, COUNT(YOrder) AS 'counter' FROM Orders GROUP BY 1 ORDER BY 1")
# Average "order total"" per year
AvgTotal = sqldf("SELECT YOrder, AVG(Total) AS 'Total' FROM Orders GROUP BY 1 ORDER BY 1")
par(mfrow=c(1,2))
barplot(Number_of_Orders$counter, names.arg = Number_of_Orders$YOrder, main="Number of Orders", col="#2C3E50")
barplot(AvgTotal$Total, names.arg = AvgTotal$YOrder, main = "Average per Orders ($)", col="#2C3E50")Let’s count the missing values per columns and see how they are distributed within the dataset:
## Customer_id Date Total YOrder
## 0 0 474 0
## Customer_id Date YOrder Total
## 50769 1 1 1 1 0
## 474 1 1 1 0 1
## 0 0 0 474 474
aggr(Orders, prop = F, numbers = T)Our missing values are only found in the total column, not within the date or the customer ID column. After discussing with the manager we learn that is due to a change of their ERP system. To handle theses missing values we decide to replace them with the median as the mean was too much influenced by the outliers. Now we have no more missing values:
## Customer_id Date Total YOrder
## 0 0 0 0
boxplot(Total ~ Date, data=Orders, main="Outliers") # clear pattern is noticeable.After presenting this to the manager he told us we could remove the outliers and the negative value. So, here are now the data we will work with:
## Customer_id Date Total YOrder
## Min. : 10 Min. :2006-01-02 Min. : 6.50 Min. :2006
## 1st Qu.: 57735 1st Qu.:2010-01-17 1st Qu.: 32.50 1st Qu.:2010
## Median :102440 Median :2012-11-22 Median : 39.00 Median :2012
## Mean :108939 Mean :2012-07-13 Mean : 80.83 Mean :2012
## 3rd Qu.:160525 3rd Qu.:2014-12-29 3rd Qu.: 71.50 3rd Qu.:2014
## Max. :264200 Max. :2016-12-30 Max. :5850.00 Max. :2016
We can’t treat all our customers in the same way, offer them the same product or charge the same price, because this leave too much value on the table. We need to build relevant segment of customers and use them to improve our relationship, offers and campaigns. Indeed, segmentation summarise efficiently mountains of data and make it usable.
A good segmentation gather similar entities together and separate different one. It enable decision maker to treat the customer’ segments differently without going down to the individual level, which will be to hard to manage.
Once a segment is done, we need to describe it in simple terms. This enable managers to see the customer’ differences of needs, desires and habits. Then, they can customise offerings, adapt customer’ messages and optimise marketing campaigns.
Customers have to be similar, but similar on which variables? Well, it depends of the business and the managerial question we are asking. Here our segmentation variables will be the recency, frequency, and monetary value, and our managerial goal is to understand the customer’ behaviours and values. Also we are limited by the amount of data available. For example we don’t have any information about each customers age or sex.
These three variables are good predictors of the future customer’ profitability and are easy to compute from the database presented above:
# Compute the number of days from the last order (2017-01-01) and name this variable lastorders
Orders$lastorders= as.numeric(difftime(time1 = "2017-01-01", time2 = Orders$Date,units = "days"))
# Compute recency, frequency, and average order amount
rfmOrders = sqldf("SELECT Customer_id,
MIN(lastorders) AS 'recency',
COUNT(*) AS 'frequency',
AVG(Total) AS 'avgorder'
FROM Orders GROUP BY 1")
# Arrange per frequency
rfmOrders <- rfmOrders %>% arrange(desc(frequency))
# Display the first lines
head(rfmOrders)## Customer_id recency frequency avgorder
## 1 9720 30.208333 45 57.34444
## 2 109370 2.208333 41 22.03659
## 3 119430 302.208333 38 19.18868
## 4 10720 35.208333 37 34.08108
## 5 1420 25.208333 34 65.57353
## 6 10640 37.208333 34 21.53412
Each of these variable answer a key question:
Now that we have created the three variables, we can visualise our new distributions of the total and average order per customer:
par(mfrow=c(1,2))
# Plot the distribution of the frequency ~ Total
hist(Orders$Total, breaks = 80, main = "Orders", xlab = "Total", col="#2C3E50")
# Plot the distribution of the frequency ~ Avg order
hist(rfmOrders$avgorder, breaks = 80, main = "Average Orders (per customers)", xlab = "Average Orders (per customers)", col="#2C3E50")
As our data aren’t at the same scale, we risk to find irrelevant patterns and build wrong segmentation. That’s why we need to de-skew our distribution and make it more normal. Indeed, transforming data is crucial as it enable us to meet the assumptions our analysis require:
# Use the logarithm for all the variables:
RerfmOrders <- rfmOrders
RerfmOrders$frequency <- log(rfmOrders$frequency)
RerfmOrders$recency <- log(rfmOrders$recency)
RerfmOrders$avgorder <- log(rfmOrders$avgorder)
# Print the first lines
head(RerfmOrders)## Customer_id recency frequency avgorder
## 1 9720 3.4081178 3.806662 4.049076
## 2 109370 0.7922381 3.713572 3.092704
## 3 119430 5.7111166 3.637586 2.954321
## 4 10720 3.5612828 3.610918 3.528742
## 5 1420 3.2271746 3.526361 4.183172
## 6 10640 3.6165328 3.526361 3.069639
# Have a look on the summary
summary(RerfmOrders)## Customer_id recency frequency avgorder
## Min. : 10 Min. :0.7922 Min. :0.0000 Min. :1.872
## 1st Qu.: 81990 1st Qu.:5.5021 1st Qu.:0.0000 1st Qu.:3.376
## Median :136430 Median :6.9765 Median :0.6931 Median :3.664
## Mean :137574 Mean :6.2743 Mean :0.6656 Mean :3.844
## 3rd Qu.:195100 3rd Qu.:7.6644 3rd Qu.:1.0986 3rd Qu.:4.174
## Max. :264200 Max. :8.2978 Max. :3.8067 Max. :8.674
# Plot row
par(mfrow=c(1,1))
# Plot the distribution of the log
hist(RerfmOrders$avgorde, breaks = 25, main = "Log (Average Orders (per customers)) ", xlab = "Log (Average Orders (per customers))", col="#2C3E50")There are 3 other methods to transform your data:
Now we can create five clusters of customers using our 3 rescaled variables. It’s important to notice that without the rescaling process we might find very different clusters than the one we display below.
# 1. Run K-means (nstart = 20) and 5 different groups
RerfmOrders <- RerfmOrders %>% filter(frequency>0)
RerfmOrders_ <- RerfmOrders %>% select(recency:avgorder)
RerfmOrders_km <- kmeans(RerfmOrders_, centers = 5, nstart = 20)
# Plot using plotly
library(plotly)
#p <- plot_ly(RerfmOrders, x = RerfmOrders$recency, y = RerfmOrders$avgorder, z = frequency, type = "scatter3d", mode = "markers", color=RerfmOrders_km$cluster)%>% layout(showlegend = FALSE)
#p %>% layout(showlegend = FALSE)
#print(p)
#p <- plotIf there is a good separation on the recency variables, the separation is more nuanced for the frequency and the average ordered varibles.
RerfmOrders_km$centers## recency frequency avgorder
## 1 1.331202 1.5977582 4.206890
## 2 6.925577 1.2462405 4.979758
## 3 3.669905 1.6521907 4.105000
## 4 5.691762 1.4611416 3.843151
## 5 7.427947 0.9981392 3.561406
Conclusion: these clustering methods enable us to create targeted marketing segments. Now we need to think on how to exploit them.
From the segmentation obtained above we want to build segments of customers that are easily manageable. We are working with the same database and adding a variable that take the first purchase of every customers to assess their loyalty:
## Customer_id recency first_purchase frequency amount
## 1 10 3830.2083 3830.208 1 39.00000
## 2 80 344.2083 3752.208 7 92.85714
## 3 90 759.2083 3784.208 10 150.54000
## 4 120 1402.2083 1402.208 1 26.00000
## 5 130 2971.2083 3711.208 2 65.00000
## 6 160 2964.2083 3578.208 2 39.00000
To limit managerial complexity we restraint the number of segments created, but to enhance the value of our segmentation we create at least 4 segments that are usable to managers. Of course as the optimal segmentation solution will vary over time we will have to re-run our code to update our segmentation.
We now define as active a customer who purchased something within the last 12 months, as warm a customer whose last purchase happens a year before, that is between 13 and 24 months. We qualify as cold, a customer whose last purchase was between two and three years ago. For those who haven’t purchased anything for more than three years, we qualify them as inactive.
customers_2016$segment = "NA"
customers_2016$segment[which(customers_2016$recency > 365*3)] = "inactive"
customers_2016$segment[which(customers_2016$recency <= 365*3 & customers_2016$recency > 365*2)] = "cold"
customers_2016$segment[which(customers_2016$recency <= 365*2 & customers_2016$recency > 365*1)] = "warm"
customers_2016$segment[which(customers_2016$recency <= 365)] = "active"
Cust_2015 <-aggregate(x = customers_2016[, 2:5], by = list(customers_2016$segment), mean)
names(Cust_2015)[names(Cust_2015)=="Group.1"] <- "Segments"
head(Cust_2015)## Segments recency first_purchase frequency amount
## 1 active 100.9490 1467.051 4.560393 93.60942
## 2 cold 859.0564 1433.468 2.303365 67.24480
## 3 inactive 2179.2029 2547.223 1.814390 62.33179
## 4 warm 491.1481 1320.799 2.871808 90.39982
Now, we create segments called low or high value, and underline our new customers by calling them new warm or new active customers. Here are all our final segments and the number of customers within each segments:
# Complete segment solution using which, and exploiting previous test as input
customers_2016$segment = "NA"
customers_2016$segment[which(customers_2016$recency > 365*3)] = "inactive"
customers_2016$segment[which(customers_2016$recency <= 365*3 & customers_2016$recency > 365*2)] = "cold"
customers_2016$segment[which(customers_2016$recency <= 365*2 & customers_2016$recency > 365*1)] = "warm"
customers_2016$segment[which(customers_2016$recency <= 365)] = "active"
customers_2016$segment[which(customers_2016$segment == "warm" & customers_2016$first_purchase <= 365*2)] = "new warm"
customers_2016$segment[which(customers_2016$segment == "warm" & customers_2016$amount < 100)] = "warm low value"
customers_2016$segment[which(customers_2016$segment == "warm" & customers_2016$amount >= 100)] = "warm high value"
customers_2016$segment[which(customers_2016$segment == "active" & customers_2016$first_purchase <= 365)] = "new active"
customers_2016$segment[which(customers_2016$segment == "active" & customers_2016$amount < 100)] = "active low value"
customers_2016$segment[which(customers_2016$segment == "active" & customers_2016$amount >= 100)] = "active high value"
# Re-order factor in a way that makes sense
customers_2016$segment = factor(x = customers_2016$segment, levels = c("inactive", "cold",
"warm high value", "warm low value", "new warm",
"active high value", "active low value", "new active"))
table(customers_2016$segment)##
## inactive cold warm high value warm low value
## 9159 1902 171 849
## new warm active high value active low value new active
## 938 795 3091 1512
pie(table(customers_2016$segment), col = topo.colors(24))Our final segmentation is composed of 8 segments. Such insights can improve managerial decisions on every levels. We see how much customers are different and how they should be treated differently.
Cust_full_2015 <-aggregate(x = customers_2016[, 2:5], by = list(customers_2016$segment), mean)
names(Cust_full_2015)[names(Cust_full_2015)=="Group.1"] <- "Segments"
print(Cust_full_2015)## Segments recency first_purchase frequency amount
## 1 inactive 2179.20287 2547.22340 1.814390 62.33179
## 2 cold 859.05639 1433.46806 2.303365 67.24480
## 3 warm high value 459.01535 2045.02705 4.959064 329.42552
## 4 warm low value 476.22482 2062.07170 4.469965 46.47849
## 5 new warm 510.51324 517.83093 1.044776 86.57873
## 6 active high value 91.91022 2051.92531 6.075472 255.67906
## 7 active low value 110.48883 1989.62567 5.890003 48.66532
## 8 new active 86.19907 91.22222 1.045635 100.27400
Here is the revenue from each segment since two years. We can repeat this process and see which customers is changing segment and when. For example when a customer went from “active high value” to “low value” we should see if we can do anything about that.
par(mfrow=c(1,1))
since2years <- customers_2016 %>% filter(recency<720)
since2years <- since2years %>% group_by(segment) %>% summarise(sum=sum(amount,na.rm=TRUE))
barplot(since2years$sum, names.arg = since2years$segment, col="#2C3E50")Conclusion: with these segmentation methods, we can treat different customer, differently. For example we can send special offers to recently acquired customers, or meet them in person to push them to become loyal. Also we now see which customers are high value and which one aren’t. That crucial for developing valuable relationship.
We know how many customers there are within each segments and can visualise when a customer is changing segment. We could even lay out a narrative for each segment like, “I’m John, I’m 52 years old and I made my first purchase three months ago for a total of $20, and I wonder whether I’d make a new purchase in the future.”
A great segmentation find a good balance between usability and completeness, between simplifying enough so it remains usable and not simplifying too much, so it’s still valuable. As much as we can, segment have to be similar, measurable, and accessible. To say it in another way segments have to be statistically relevant and managerially relevant.
Here we build a model to predict how much money our customers are going to spend over the next 12 months. We use the same database than above and we compute three new variables: maximal amount spend, revenue from 2016, and a binary variable that answer if a customer bought anything in 2016 (1) or nothing (0).
# Extract the predictors: (from 2015)
customers_2015 = sqldf("SELECT customer_id,
MIN(days_since) - 365 AS 'recency',
MAX(days_since) - 365 AS 'first_purchase',
COUNT(*) AS 'frequency',
AVG(Total) AS 'avg_amount',
MAX(Total) AS 'max_amount'
FROM Orders
WHERE days_since > 365
GROUP BY 1")
# Compute revenues generated by customers in 2016
revenue_2016 = sqldf("SELECT customer_id, SUM(Total) AS 'revenue_2016'
FROM Orders
WHERE year_of_purchase = 2016
GROUP BY 1")
# Merge 2015 customers and 2016 revenue
in_sample = merge(customers_2015, revenue_2016, all.x = TRUE)
in_sample$revenue_2016[is.na(in_sample$revenue_2016)] = 0
in_sample$active_2015 = as.numeric(in_sample$revenue_2016 > 0)
# Display calibration (in-sample) data
head(in_sample)## Customer_id recency first_purchase frequency avg_amount max_amount
## 1 10 3465.2083 3465.208 1 39.00 39.0
## 2 80 303.2083 3387.208 6 91.00 104.0
## 3 90 394.2083 3419.208 10 150.54 198.9
## 4 120 1037.2083 1037.208 1 26.00 26.0
## 5 130 2606.2083 3346.208 2 65.00 78.0
## 6 160 2599.2083 3213.208 2 39.00 39.0
## revenue_2016 active_2015
## 1 0 0
## 2 104 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
summary(in_sample)## Customer_id recency first_purchase frequency
## Min. : 10 Min. : 2.208 Min. : 2.208 Min. : 1.000
## 1st Qu.: 77710 1st Qu.: 259.208 1st Qu.: 797.208 1st Qu.: 1.000
## Median :127140 Median : 891.208 Median :1892.208 Median : 2.000
## Mean :127315 Mean :1123.817 Mean :1789.618 Mean : 2.664
## 3rd Qu.:181800 3rd Qu.:1869.208 3rd Qu.:2696.208 3rd Qu.: 3.000
## Max. :245840 Max. :3650.208 Max. :3652.208 Max. :40.000
## avg_amount max_amount revenue_2016 active_2015
## Min. : 6.50 Min. : 6.50 Min. : 0.00 Min. :0.0000
## 1st Qu.: 28.17 1st Qu.: 32.50 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 39.00 Median : 39.00 Median : 0.00 Median :0.0000
## Mean : 72.03 Mean : 85.35 Mean : 27.58 Mean :0.2299
## 3rd Qu.: 65.00 3rd Qu.: 78.00 3rd Qu.: 0.00 3rd Qu.:0.0000
## Max. :5850.00 Max. :5850.00 Max. :5850.00 Max. :1.0000
We now have 7 variables that by themselves are not saying much, but when combined together tell an interesting story. And to graps this story we are going to create a calibration model.
library(nnet)
prob.model = multinom(formula = active_2015 ~ recency + first_purchase + frequency + avg_amount + max_amount,
data = in_sample)## # weights: 7 (6 variable)
## initial value 11717.653087
## iter 10 value 6232.735194
## final value 6184.162328
## converged
The model predict customer’ probabilities. Here the importance of each predictor is shown by what we call weights and their statistical significance. If the weights are large it means they are good predictors. If not, it means they contribute very little to the predictions.
coef = summary(prob.model)$coefficients
std = summary(prob.model)$standard.errors
# Ratio
print(coef / std)## (Intercept) recency first_purchase frequency avg_amount
## -12.0208933 -32.7428558 -0.3052296 14.8451174 1.1274078
## max_amount
## -0.5583795
Our results show to which extent each parameters are significant. We see that recency and frequency are the most meaningful predictor in our model.
Now for our monetary model, we need to select only those who made a purchase. Here we try to predict how much -active customers- are going to spend next year. Note that we are using the logarithmic function to reduce the influence power from the few outliers.
# Select only active customer:
z = which(in_sample$active_2015 == 1)
# Calibrate the monetary model, using a log-transform
amount.model = lm(formula = log(revenue_2016) ~ log(avg_amount) + log(max_amount), data = in_sample[z, ])
summary(amount.model)##
## Call:
## lm(formula = log(revenue_2016) ~ log(avg_amount) + log(max_amount),
## data = in_sample[z, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7891 -0.1802 -0.0742 0.1851 3.5658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38312 0.04282 8.946 <2e-16 ***
## log(avg_amount) 0.54477 0.04161 13.091 <2e-16 ***
## log(max_amount) 0.39295 0.03787 10.375 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 3883 degrees of freedom
## Multiple R-squared: 0.6925, Adjusted R-squared: 0.6924
## F-statistic: 4373 on 2 and 3883 DF, p-value: < 2.2e-16
# Plot the results of this new monetary model
## The fitted values are the value predicted by the model
plot(x = log(in_sample[z, ]$revenue_2016), y = amount.model$fitted.values, col="#2C3E50", xlab = "revenue_2016", ylab = "fitted.values")We are predicting two things. The first is the probability that an active customer will buy and the second is the amount they will spent.
First we compute the RFM variables of today -note that we now work from the full database and not just a sample-, then we predict the target variables based on today’s data. Here is our predicted probabilities.
customers_2016 = sqldf("SELECT customer_id,
MIN(days_since) AS 'recency',
MAX(days_since) AS 'first_purchase',
COUNT(*) AS 'frequency',
AVG(Total) AS 'avg_amount',
MAX(Total) AS 'max_amount'
FROM Orders GROUP BY 1")
customers_2016$prob_predicted = predict(object = prob.model, newdata = customers_2016, type = "probs")
# To get the real value from the logarithm we have to use the exponant:
customers_2016$revenue_predicted = exp(predict(object = amount.model, newdata = customers_2016))
customers_2016$score_predicted = customers_2016$prob_predicted * customers_2016$revenue_predicted
# Predicted probabilities:
summary(customers_2016$prob_predicted)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002629 0.0126336 0.1061096 0.2249887 0.3978301 0.9999048
Now we see the revenue predicted by our model. On average our customers will spend $ 45. We also have more information about the distribution:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.485 38.138 45.536 85.305 73.664 4999.308
The third things we predict is called score predicted and its the product of the two first values predicted. It’s the average for every customer next year. The distribution of this spending goes from 0 to extreme values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004 0.591 5.919 24.508 23.341 3715.838
This last figure is important as it tells us how many customers have an expected revenue of more than $50. We can even see who these customers are, and lay out a special relationship with them. Now we see that there are 2006 customers that have an expected revenue of more than $50, and we print the first customer_id of this new group:
z = which(customers_2016$score_predicted > 50)
print(length(z))## [1] 2006
## [1] 2 3 13 18 32 39
The point of a customer lifetime value models -or CLV- is to see what is on average the value of each customer from the first purchase to the last. The goal of such methods is to analyse what is happening today and what has happened in the recent past in order to predict the revenues customers will generate in the future.
Customer lifetime value models have many other applications in practice. For instance, you could compare one acquisition campaign to another, or the full life time value of different customer segments.
This transition matrix show how many customers switch from one segment to another. The rows display 2015 and the column display 2016. So, for instance, we can say that 49 inactive customers in 2015, became active high value in 2016. The next step is to understand why.
new_data = merge(x = customers_2015, y = customers_2016, by = "Customer_id", all.x = TRUE)
transition = table(new_data$segment.x, new_data$segment.y)
print(transition)##
## inactive cold warm high value warm low value new warm
## inactive 7227 0 0 0 0
## cold 1932 0 0 0 0
## warm high value 0 108 0 0 0
## warm low value 0 655 0 0 0
## new warm 0 1139 0 0 0
## active high value 0 0 171 0 0
## active low value 0 0 0 849 0
## new active 0 0 0 0 938
##
## active high value active low value new active
## inactive 48 237 0
## cold 30 192 0
## warm high value 44 0 0
## warm low value 5 254 0
## new warm 21 90 0
## active high value 495 13 0
## active low value 49 1909 0
## new active 103 396 0
The last line display the new active customers. It’s interesting to see that most of them after their first purchase become new warm and not active high value. This line is relevant if we want to assess the results of an acquisition campaign.
Now, to see the percentage behind this transition matrix we need to divide each row by its sum. And we obtain the matrix below. We can say for example, that if you were an inactive customer in 2015, then you had a 96 % change of remaining inactive. This matrix will be useful to make predictions.
transition = transition / rowSums(transition)
print(transition)##
## inactive cold warm high value warm low value
## inactive 0.962060703 0.000000000 0.000000000 0.000000000
## cold 0.896935933 0.000000000 0.000000000 0.000000000
## warm high value 0.000000000 0.710526316 0.000000000 0.000000000
## warm low value 0.000000000 0.716630197 0.000000000 0.000000000
## new warm 0.000000000 0.911200000 0.000000000 0.000000000
## active high value 0.000000000 0.000000000 0.251840943 0.000000000
## active low value 0.000000000 0.000000000 0.000000000 0.302458140
## new active 0.000000000 0.000000000 0.000000000 0.000000000
##
## new warm active high value active low value
## inactive 0.000000000 0.006389776 0.031549521
## cold 0.000000000 0.013927577 0.089136490
## warm high value 0.000000000 0.289473684 0.000000000
## warm low value 0.000000000 0.005470460 0.277899344
## new warm 0.000000000 0.016800000 0.072000000
## active high value 0.000000000 0.729013255 0.019145803
## active low value 0.000000000 0.017456359 0.680085501
## new active 0.652748782 0.071677105 0.275574113
##
## new active
## inactive 0.000000000
## cold 0.000000000
## warm high value 0.000000000
## warm low value 0.000000000
## new warm 0.000000000
## active high value 0.000000000
## active low value 0.000000000
## new active 0.000000000
We can see which customers will go from one segment to the next in the coming years. Our model compute the 3 next years -note that we didn’t take into account the new customers in 2016 and 2017-. Below we present the bar plot of this evolution for 4 segments:
# Initialize a matrix with the number of customers in each segment today and after 3 periods
segments = matrix(nrow = 8, ncol = 4)
segments[, 1] = table(customers_2016$segment)
colnames(segments) = 2017:2020
row.names(segments) = levels(customers_2016$segment)
# Compute for each an every period
for (i in 2:4) {
segments[, i] = segments[, i-1] %*% transition
}
# Display how segments will evolve over time
print(round(segments))## 2017 2018 2019 2020
## inactive 9159 10517 11540 12637
## cold 1902 1585 1712 875
## warm high value 171 200 226 222
## warm low value 849 935 997 926
## new warm 938 987 0 0
## active high value 795 897 880 864
## active low value 3091 3296 3063 2893
## new active 1512 0 0 0
We now compute the revenue per segment for the three coming years. To do that we use our transition matrix and the average order per segment. (Note that we still don’t take into account the new customer that we will start buying in 2017 and after.)
yearly_revenue = c(0, 2, 60, 3, 10, 370, 60, 90)
revenue_per_segment = yearly_revenue * segments
print(revenue_per_segment)## 2017 2018 2019 2020
## inactive 0 0.000 0.000 0.000
## cold 3804 3169.249 3423.095 1749.774
## warm high value 10260 12012.813 13551.299 13301.026
## warm low value 2547 2804.694 2990.711 2778.992
## new warm 9380 9869.562 0.000 0.000
## active high value 294150 331821.905 325693.640 319526.868
## active low value 185460 197760.328 183760.424 173604.449
## new active 136080 0.000 0.000 0.000
This analysis is a first step toward a better customer’s oriented business. Now, the managers can customise their offering, adapt their messages and optimise their marketing campaigns much better than few months ago. But they can also measure their results in a more accurate manner.
So… What’s next? More analytics will require more data about the products sold and about the customers. We could run an association rule algorithm to see what products tends to be bought together and therefore reorganised the store relevantly. On the customer side, it will be useful to know the location, sex and age of customers. For example woman and man might have very different buying behaviour and that is important to notice.
Finally, we propose to the store a plan to gather more data about their customer and we present the next possible step toward a more data-driven business. We show how applied machine learning with bigger dataset will improve the predictability and segmentation for the store.