After having analysed our customer’s behaviour, effectuated a segmentation, computed the scoring and lifetime value, we now use our customer’s information to build a better predictive model. See the first part of the analysis here

Find me on twitter: LudoBenistant

1 Data science problem

1.1 Business understanding

Our business is a medium sized store located in a large city. Within it 20 years of existence, our business never have done any kind of analytics. But, since few years, the shop is not growing as wanted, and the new manager decided to study the customer’s behaviours.

The manager is aware that companies are sitting on a treasure trove of data, but usually lack the skills and people to analyse and exploit them efficiently. So he ask us to look into them.

1.2 Problem

Few months ago, with only a single database of three variables: Customer ID, Total spent and Date, we proposed an analysis to the managers. Now we need to push it further. We want to know who are our customers and where we should invest our marketing budget.

1.3 Solution

After having done a segmentation, scoring models and customer lifetime value, we now have gathered more data about our customers to build a better predictive model.

As the manager was worried to scare some customers away by our data collection campaign, we asked only 3 questions relevent for the business: do you have children? do you have pets? are you leaving far from the store? As we couldn’t get to intimate, we asked the customer to answer by yes or no for the children and pets question. The employee then guessed if the person was old, middle-aged or young and if it was a man or a woman. So we did gather 5 new variables.

2 Data exploration

2.1 The data quality report

Here are the 5 new variables displayed per customers and some summary statistics.

##   Customer_id     sum age sex children distance pets
## 1      183010 24026.0   2   1        1        1    1
## 2      144610 22100.0   2   1        1        1    1
## 3      164930 18200.0   2   1        1        1    1
## 4      141340 17356.3   1   1        1        1    1
## 5       13610 15860.0   1   1        1        1    0
## 6       98790 15600.0   2   1        1        1    1

##   Customer_id          sum               age             sex       
##  Min.   :    80   Min.   :-4084.0   Min.   :1.000   Min.   :0.000  
##  1st Qu.:104602   1st Qu.:   39.0   1st Qu.:1.000   1st Qu.:0.000  
##  Median :156875   Median :   78.0   Median :2.000   Median :0.000  
##  Mean   :151336   Mean   :  218.2   Mean   :1.924   Mean   :0.455  
##  3rd Qu.:204758   3rd Qu.:  195.0   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :264200   Max.   :24026.0   Max.   :3.000   Max.   :1.000  
##     children         distance          pets       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :0.0000   Median :3.000   Median :0.0000  
##  Mean   :0.3639   Mean   :2.564   Mean   :0.3692  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :3.000   Max.   :1.0000

##   Customer_id          sum                   age          sex      
##  Min.   :    80   Min.   :-4084.0   young      :5264   man  :8318  
##  1st Qu.:104602   1st Qu.:   39.0   middle-aged:5888   woman:6944  
##  Median :156875   Median :   78.0   elderly    :4110               
##  Mean   :151336   Mean   :  218.2                                  
##  3rd Qu.:204758   3rd Qu.:  195.0                                  
##  Max.   :264200   Max.   :24026.0                                  
##  children        distance     pets        Long_term_value 
##  no :9708   near     : 130   no :9628   low_value :13613  
##  yes:5554   in-middle:6393   yes:5634   high_value: 1644  
##             far      :8739              NA's      :    5  
##                                                           
##                                                           
##                                                           
##    Long_term_value_MLevels
##  very_low_value:11720     
##  low_value     : 1893     
##  medium_value  :  662     
##  good_value    :  506     
##  high_value    :  476     
##  NA's          :    5

2.2 Handling data issues

boxplot(sum ~ children, data=Customer, main="Outliers")  # clear pattern is noticeable.

After presenting this graph to the manager, he told us we could remove the outliers and the negative value. Here is now our database:

##   Customer_id          sum                  age          sex      
##  Min.   :    80   Min.   :   6.5   young      :5262   man  :8312  
##  1st Qu.:104660   1st Qu.:  39.0   middle-aged:5872   woman:6927  
##  Median :156950   Median :  78.0   elderly    :4105               
##  Mean   :151384   Mean   : 201.9                                  
##  3rd Qu.:204795   3rd Qu.: 195.0                                  
##  Max.   :264200   Max.   :8710.0                                  
##  children        distance     pets        Long_term_value 
##  no :9703   near     : 112   no :9622   low_value :13613  
##  yes:5536   in-middle:6393   yes:5617   high_value: 1626  
##             far      :8734                                
##                                                           
##                                                           
##                                                           
##    Long_term_value_MLevels
##  very_low_value:11720     
##  low_value     : 1893     
##  medium_value  :  662     
##  good_value    :  506     
##  high_value    :  458     
##

2.3 First visualisations

This graph shows the correlation between each variables. The strength of a correlation is represented by the bubble size and the direction (positive or negative) by the colour. We can see that a woman, with childrens and pets, tends to buy more whereas people living far from the store buy less.

library(corrplot)
Customer_Cor <- Customer %>% select(sum:pets)
M <- cor(Customer_Cor)
corrplot(M, method="circle")

3 Modeling

3.1 Predictive modeling

We want to predict which customer is high value potential, which is medium value, and which is low value. Our target variable can therefore take on 3 values: low_value, medium_value or high_value.

After setting a cross-validation to train and test our models, we try different machine learning algorithms: the first one is a tree model, the second a naives Bayes and the third a k-nearest neighbors (KNN).

3.1.1 Cross-Validation

# Cross-validation dataset
Customer_cv <- Customer

# Build the 3 levels
Customer_cv$Long_term_value<-cut(Customer_cv$sum, c(0,100, 400, 40000))
levels(Customer_cv$Long_term_value) <- c('low_value', 'medium_value', 'high_value')

# Set the target variable as a factor
Customer_cv$Long_term_value <- as.factor(Customer_cv$Long_term_value)
Customer_cv <- Customer_cv %>% select(age:Long_term_value)

# cross-validation 
# library(caret)
train_control<- trainControl(method="cv", number=5, repeats=3)
head(train_control)

## $method
## [1] "cv"
## 
## $number
## [1] 5
## 
## $repeats
## [1] 3
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL

3.1.2 Tree learning

library("rpart.plot")

## Loading required package: rpart

fit <- rpart(Long_term_value ~ age + sex + distance + children + pets,
             method = "class",
             data = Customer_cv,
             control = rpart.control(minsplit = 20),
             parms = list(split='information'))

rpart.plot(fit, type=2, extra = 1)

This tree if following the correlation graphs seen above. It says: If you are a woman, if you are leaving close, if you have a children, and if you have a pets, then there is good change that you are a high value customer.

Below we display the confusion matrix that is key to compare the model with each other:

library("rpart")
library("rpart.plot")

# train the model 
Customer_cv<-na.omit(Customer_cv)

rpartmodel<- train(Long_term_value~., data=Customer_cv, trControl=train_control, method="rpart")

# make predictions
predictions<- predict(rpartmodel,Customer_cv)
Customer_cv_tree<- cbind(Customer_cv,predictions)

# summarize results
confusionMatrix<- confusionMatrix(Customer_cv_tree$predictions,Customer_cv_tree$Long_term_value)
confusionMatrix

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     low_value medium_value high_value
##   low_value         7970         3591        617
##   medium_value       489          686        262
##   high_value         342          535        747
## 
## Overall Statistics
##                                           
##                Accuracy : 0.617           
##                  95% CI : (0.6093, 0.6248)
##     No Information Rate : 0.5775          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.23            
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: low_value Class: medium_value
## Sensitivity                    0.9056             0.14256
## Specificity                    0.3464             0.92798
## Pos Pred Value                 0.6545             0.47738
## Neg Pred Value                 0.7285             0.70106
## Prevalence                     0.5775             0.31577
## Detection Rate                 0.5230             0.04502
## Detection Prevalence           0.7991             0.09430
## Balanced Accuracy              0.6260             0.53527
##                      Class: high_value
## Sensitivity                    0.45941
## Specificity                    0.93558
## Pos Pred Value                 0.45998
## Neg Pred Value                 0.93544
## Prevalence                     0.10670
## Detection Rate                 0.04902
## Detection Prevalence           0.10657
## Balanced Accuracy              0.69749

3.1.3 Naives Bayes

# train the model 
e1071model2 <- train(Long_term_value~., data=Customer_cv, trControl=train_control, method="nb")

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# make predictions
predictions <- predict(e1071model2,Customer_cv)
e1071modelbinded <- cbind(Customer_cv,predictions)
# summarize results
confusionMatrix<- confusionMatrix(e1071modelbinded$predictions,e1071modelbinded$Long_term_value)
confusionMatrix

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     low_value medium_value high_value
##   low_value         7363         3066        522
##   medium_value      1243         1399        434
##   high_value         195          347        670
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6189          
##                  95% CI : (0.6112, 0.6267)
##     No Information Rate : 0.5775          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2568          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: low_value Class: medium_value
## Sensitivity                    0.8366              0.2907
## Specificity                    0.4427              0.8392
## Pos Pred Value                 0.6724              0.4548
## Neg Pred Value                 0.6646              0.7194
## Prevalence                     0.5775              0.3158
## Detection Rate                 0.4832              0.0918
## Detection Prevalence           0.7186              0.2019
## Balanced Accuracy              0.6396              0.5649
##                      Class: high_value
## Sensitivity                    0.41205
## Specificity                    0.96019
## Pos Pred Value                 0.55281
## Neg Pred Value                 0.93185
## Prevalence                     0.10670
## Detection Rate                 0.04397
## Detection Prevalence           0.07953
## Balanced Accuracy              0.68612

3.1.4 KNN

library(class)
# train the model 
knnFit <- train(Long_term_value ~ ., data = Customer_cv, method = "knn", trControl = train_control, preProcess = c("center","scale"), tuneLength = 10)
# make predictions
predictions<- predict(knnFit,Customer_cv)
knnFit_bind <- cbind(Customer_cv,predictions)
# summarize results
confusionMatrix<- confusionMatrix(knnFit_bind$predictions,knnFit_bind$Long_term_value)
confusionMatrix

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     low_value medium_value high_value
##   low_value         7007         2435        227
##   medium_value      1794         2285        882
##   high_value           0           92        517
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6437         
##                  95% CI : (0.636, 0.6513)
##     No Information Rate : 0.5775         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3232         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: low_value Class: medium_value
## Sensitivity                    0.7962              0.4749
## Specificity                    0.5865              0.7434
## Pos Pred Value                 0.7247              0.4606
## Neg Pred Value                 0.6779              0.7541
## Prevalence                     0.5775              0.3158
## Detection Rate                 0.4598              0.1499
## Detection Prevalence           0.6345              0.3255
## Balanced Accuracy              0.6913              0.6091
##                      Class: high_value
## Sensitivity                    0.31796
## Specificity                    0.99324
## Pos Pred Value                 0.84893
## Neg Pred Value                 0.92420
## Prevalence                     0.10670
## Detection Rate                 0.03393
## Detection Prevalence           0.03996
## Balanced Accuracy              0.65560

3.2 Interpretation

As our classes are unbalanced (we have much more low value than high value), the Kappa measure that we find within the confusion matrix is helpful to understand the results. The Kappa (or Cohen’s Kappa) compare the observed accuracy with the expected accuracy. For instance if a predictive model always predict low value it can obtain a 60% accuracy for a target variable of 3 classes. The Kappa measure correct for that.

As we see that the KNN algorithm can have a lower accuracy score but better Kappa and regarding the business’ need, we advice to use the KNN algorithm to predict better who might be high value. To go even further we can propose a expected value framework that give different weight to each predictions.

4 Conclusion

We can now infert how much a customer is a potential for being high value just by asking 3 simple questions. Of course getting more data on our customers could improve the model dramatically, but also could answer very different question.