Our business is a medium sized store located in a large city. Within it 20 years of existence, our business never have done any kind of analytics. But, since few years, the shop is not growing as wanted, and the new manager decided to study the customer’s behaviours.
The manager is aware that companies are sitting on a treasure trove of data, but usually lack the skills and people to analyse and exploit them efficiently. So he ask us to look into them.
Few months ago, with only a single database of three variables: Customer ID, Total spent and Date, we proposed an analysis to the managers. Now we need to push it further. We want to know who are our customers and where we should invest our marketing budget.
After having done a segmentation, scoring models and customer lifetime value, we now have gathered more data about our customers to build a better predictive model.
As the manager was worried to scare some customers away by our data collection campaign, we asked only 3 questions relevent for the business: do you have children? do you have pets? are you leaving far from the store? As we couldn’t get to intimate, we asked the customer to answer by yes or no for the children and pets question. The employee then guessed if the person was old, middle-aged or young and if it was a man or a woman. So we did gather 5 new variables.
Here are the 5 new variables displayed per customers and some summary statistics.
## Customer_id sum age sex children distance pets
## 1 183010 24026.0 2 1 1 1 1
## 2 144610 22100.0 2 1 1 1 1
## 3 164930 18200.0 2 1 1 1 1
## 4 141340 17356.3 1 1 1 1 1
## 5 13610 15860.0 1 1 1 1 0
## 6 98790 15600.0 2 1 1 1 1
## Customer_id sum age sex
## Min. : 80 Min. :-4084.0 Min. :1.000 Min. :0.000
## 1st Qu.:104602 1st Qu.: 39.0 1st Qu.:1.000 1st Qu.:0.000
## Median :156875 Median : 78.0 Median :2.000 Median :0.000
## Mean :151336 Mean : 218.2 Mean :1.924 Mean :0.455
## 3rd Qu.:204758 3rd Qu.: 195.0 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :264200 Max. :24026.0 Max. :3.000 Max. :1.000
## children distance pets
## Min. :0.0000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000
## Median :0.0000 Median :3.000 Median :0.0000
## Mean :0.3639 Mean :2.564 Mean :0.3692
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :3.000 Max. :1.0000
## Customer_id sum age sex
## Min. : 80 Min. :-4084.0 young :5264 man :8318
## 1st Qu.:104602 1st Qu.: 39.0 middle-aged:5888 woman:6944
## Median :156875 Median : 78.0 elderly :4110
## Mean :151336 Mean : 218.2
## 3rd Qu.:204758 3rd Qu.: 195.0
## Max. :264200 Max. :24026.0
## children distance pets Long_term_value
## no :9708 near : 130 no :9628 low_value :13613
## yes:5554 in-middle:6393 yes:5634 high_value: 1644
## far :8739 NA's : 5
##
##
##
## Long_term_value_MLevels
## very_low_value:11720
## low_value : 1893
## medium_value : 662
## good_value : 506
## high_value : 476
## NA's : 5
boxplot(sum ~ children, data=Customer, main="Outliers") # clear pattern is noticeable.After presenting this graph to the manager, he told us we could remove the outliers and the negative value. Here is now our database:
## Customer_id sum age sex
## Min. : 80 Min. : 6.5 young :5262 man :8312
## 1st Qu.:104660 1st Qu.: 39.0 middle-aged:5872 woman:6927
## Median :156950 Median : 78.0 elderly :4105
## Mean :151384 Mean : 201.9
## 3rd Qu.:204795 3rd Qu.: 195.0
## Max. :264200 Max. :8710.0
## children distance pets Long_term_value
## no :9703 near : 112 no :9622 low_value :13613
## yes:5536 in-middle:6393 yes:5617 high_value: 1626
## far :8734
##
##
##
## Long_term_value_MLevels
## very_low_value:11720
## low_value : 1893
## medium_value : 662
## good_value : 506
## high_value : 458
##
This graph shows the correlation between each variables. The strength of a correlation is represented by the bubble size and the direction (positive or negative) by the colour. We can see that a woman, with childrens and pets, tends to buy more whereas people living far from the store buy less.
library(corrplot)
Customer_Cor <- Customer %>% select(sum:pets)
M <- cor(Customer_Cor)
corrplot(M, method="circle")We want to predict which customer is high value potential, which is medium value, and which is low value. Our target variable can therefore take on 3 values: low_value, medium_value or high_value.
After setting a cross-validation to train and test our models, we try different machine learning algorithms: the first one is a tree model, the second a naives Bayes and the third a k-nearest neighbors (KNN).
# Cross-validation dataset
Customer_cv <- Customer
# Build the 3 levels
Customer_cv$Long_term_value<-cut(Customer_cv$sum, c(0,100, 400, 40000))
levels(Customer_cv$Long_term_value) <- c('low_value', 'medium_value', 'high_value')
# Set the target variable as a factor
Customer_cv$Long_term_value <- as.factor(Customer_cv$Long_term_value)
Customer_cv <- Customer_cv %>% select(age:Long_term_value)
# cross-validation
# library(caret)
train_control<- trainControl(method="cv", number=5, repeats=3)
head(train_control)## $method
## [1] "cv"
##
## $number
## [1] 5
##
## $repeats
## [1] 3
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
library("rpart.plot")## Loading required package: rpart
fit <- rpart(Long_term_value ~ age + sex + distance + children + pets,
method = "class",
data = Customer_cv,
control = rpart.control(minsplit = 20),
parms = list(split='information'))
rpart.plot(fit, type=2, extra = 1)This tree if following the correlation graphs seen above. It says: If you are a woman, if you are leaving close, if you have a children, and if you have a pets, then there is good change that you are a high value customer.
Below we display the confusion matrix that is key to compare the model with each other:
library("rpart")
library("rpart.plot")
# train the model
Customer_cv<-na.omit(Customer_cv)
rpartmodel<- train(Long_term_value~., data=Customer_cv, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(rpartmodel,Customer_cv)
Customer_cv_tree<- cbind(Customer_cv,predictions)
# summarize results
confusionMatrix<- confusionMatrix(Customer_cv_tree$predictions,Customer_cv_tree$Long_term_value)
confusionMatrix## Confusion Matrix and Statistics
##
## Reference
## Prediction low_value medium_value high_value
## low_value 7970 3591 617
## medium_value 489 686 262
## high_value 342 535 747
##
## Overall Statistics
##
## Accuracy : 0.617
## 95% CI : (0.6093, 0.6248)
## No Information Rate : 0.5775
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.23
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: low_value Class: medium_value
## Sensitivity 0.9056 0.14256
## Specificity 0.3464 0.92798
## Pos Pred Value 0.6545 0.47738
## Neg Pred Value 0.7285 0.70106
## Prevalence 0.5775 0.31577
## Detection Rate 0.5230 0.04502
## Detection Prevalence 0.7991 0.09430
## Balanced Accuracy 0.6260 0.53527
## Class: high_value
## Sensitivity 0.45941
## Specificity 0.93558
## Pos Pred Value 0.45998
## Neg Pred Value 0.93544
## Prevalence 0.10670
## Detection Rate 0.04902
## Detection Prevalence 0.10657
## Balanced Accuracy 0.69749
# train the model
e1071model2 <- train(Long_term_value~., data=Customer_cv, trControl=train_control, method="nb")## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# make predictions
predictions <- predict(e1071model2,Customer_cv)
e1071modelbinded <- cbind(Customer_cv,predictions)
# summarize results
confusionMatrix<- confusionMatrix(e1071modelbinded$predictions,e1071modelbinded$Long_term_value)
confusionMatrix## Confusion Matrix and Statistics
##
## Reference
## Prediction low_value medium_value high_value
## low_value 7363 3066 522
## medium_value 1243 1399 434
## high_value 195 347 670
##
## Overall Statistics
##
## Accuracy : 0.6189
## 95% CI : (0.6112, 0.6267)
## No Information Rate : 0.5775
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2568
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: low_value Class: medium_value
## Sensitivity 0.8366 0.2907
## Specificity 0.4427 0.8392
## Pos Pred Value 0.6724 0.4548
## Neg Pred Value 0.6646 0.7194
## Prevalence 0.5775 0.3158
## Detection Rate 0.4832 0.0918
## Detection Prevalence 0.7186 0.2019
## Balanced Accuracy 0.6396 0.5649
## Class: high_value
## Sensitivity 0.41205
## Specificity 0.96019
## Pos Pred Value 0.55281
## Neg Pred Value 0.93185
## Prevalence 0.10670
## Detection Rate 0.04397
## Detection Prevalence 0.07953
## Balanced Accuracy 0.68612
library(class)
# train the model
knnFit <- train(Long_term_value ~ ., data = Customer_cv, method = "knn", trControl = train_control, preProcess = c("center","scale"), tuneLength = 10)
# make predictions
predictions<- predict(knnFit,Customer_cv)
knnFit_bind <- cbind(Customer_cv,predictions)
# summarize results
confusionMatrix<- confusionMatrix(knnFit_bind$predictions,knnFit_bind$Long_term_value)
confusionMatrix## Confusion Matrix and Statistics
##
## Reference
## Prediction low_value medium_value high_value
## low_value 7007 2435 227
## medium_value 1794 2285 882
## high_value 0 92 517
##
## Overall Statistics
##
## Accuracy : 0.6437
## 95% CI : (0.636, 0.6513)
## No Information Rate : 0.5775
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3232
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: low_value Class: medium_value
## Sensitivity 0.7962 0.4749
## Specificity 0.5865 0.7434
## Pos Pred Value 0.7247 0.4606
## Neg Pred Value 0.6779 0.7541
## Prevalence 0.5775 0.3158
## Detection Rate 0.4598 0.1499
## Detection Prevalence 0.6345 0.3255
## Balanced Accuracy 0.6913 0.6091
## Class: high_value
## Sensitivity 0.31796
## Specificity 0.99324
## Pos Pred Value 0.84893
## Neg Pred Value 0.92420
## Prevalence 0.10670
## Detection Rate 0.03393
## Detection Prevalence 0.03996
## Balanced Accuracy 0.65560
As our classes are unbalanced (we have much more low value than high value), the Kappa measure that we find within the confusion matrix is helpful to understand the results. The Kappa (or Cohen’s Kappa) compare the observed accuracy with the expected accuracy. For instance if a predictive model always predict low value it can obtain a 60% accuracy for a target variable of 3 classes. The Kappa measure correct for that.
As we see that the KNN algorithm can have a lower accuracy score but better Kappa and regarding the business’ need, we advice to use the KNN algorithm to predict better who might be high value. To go even further we can propose a expected value framework that give different weight to each predictions.
We can now infert how much a customer is a potential for being high value just by asking 3 simple questions. Of course getting more data on our customers could improve the model dramatically, but also could answer very different question.
Find me on twitter: LudoBenistant