In this project, I’m going to use Machine Learning to Optimize the Product Backorders.

Introduction

Backorder is an order which has not been fulfilled yet by company. It indicates the interest of consumer in the product even though the product is in short amount. This is both and good for the company. Good because it shows customer is still interested in the product and demands for it. Bad because if not fulfilled in time the consumer may lose interest, look for alternative product which will result in the loss of company, losing customers and image of the company may be distorted.

Now, what company can do is built so many of the products that there won’t be shortage. But most of the companies can’t do it because of the high inventory cost. And if demand decreases, the will suffer quite a loss.

So, it is better to look at the past data and optimize the current backorder such that the inventory cost is low, product is delivered in time before the conumer loses the interest. This will good for both consumers who get the product they want with only little wait and for company which retains the customers and make profit.

There are a lot of challenges in building the predictive model for optimization of backorders. There are lot of factors which doesn’t depend on the company, product or business but external factors like holiday, season, special occassions etc. So let’s see what we are going to do.

The data we are using here is obtained from Kaggle. You can obtain the data from here.

Loading required libraries

library(data.table)
library(tidyquant)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between()         masks data.table::between()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x dplyr::first()           masks xts::first(), data.table::first()
## x lubridate::hour()        masks data.table::hour()
## x lubridate::intersect()   masks base::intersect()
## x lubridate::isoweek()     masks data.table::isoweek()
## x dplyr::lag()             masks stats::lag()
## x dplyr::last()            masks xts::last(), data.table::last()
## x lubridate::mday()        masks data.table::mday()
## x lubridate::minute()      masks data.table::minute()
## x lubridate::month()       masks data.table::month()
## x lubridate::quarter()     masks data.table::quarter()
## x lubridate::second()      masks data.table::second()
## x lubridate::setdiff()     masks base::setdiff()
## x purrr::transpose()       masks data.table::transpose()
## x lubridate::union()       masks base::union()
## x lubridate::wday()        masks data.table::wday()
## x lubridate::week()        masks data.table::week()
## x lubridate::yday()        masks data.table::yday()
## x lubridate::year()        masks data.table::year()
## 
## Attaching package: 'tidyquant'
## The following object is masked from 'package:dplyr':
## 
##     as_tibble
## The following object is masked from 'package:tibble':
## 
##     as_tibble
library(unbalanced)
## Loading required package: mlr
## Loading required package: ParamHelpers
## 
## Attaching package: 'ParamHelpers'
## The following object is masked from 'package:quantmod':
## 
##     getDefaults
## Warning: replacing previous import 'BBmisc::isFALSE' by
## 'backports::isFALSE' when loading 'mlr'
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:mlr':
## 
##     train
## The following object is masked from 'package:purrr':
## 
##     lift
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year
## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

Reading the data

train <- read.csv("train.csv", na.strings = "")
test <- read.csv("test.csv", na.strings = "")

Let’s have a look at the data

str(train)
## 'data.frame':    1687861 obs. of  23 variables:
##  $ sku              : Factor w/ 1687861 levels "(1687860 rows)",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ national_inv     : int  0 2 2 7 8 13 1095 6 140 4 ...
##  $ lead_time        : int  NA 9 NA 8 NA 8 NA 2 NA 8 ...
##  $ in_transit_qty   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ forecast_3_month : int  0 0 0 0 0 0 0 0 15 0 ...
##  $ forecast_6_month : int  0 0 0 0 0 0 0 0 114 0 ...
##  $ forecast_9_month : int  0 0 0 0 0 0 0 0 152 0 ...
##  $ sales_1_month    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_3_month    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_6_month    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_9_month    : int  0 0 0 0 4 0 0 0 0 0 ...
##  $ min_bank         : int  0 0 0 1 2 0 4 0 0 0 ...
##  $ potential_issue  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pieces_past_due  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ perf_6_month_avg : num  -99 0.99 -99 0.1 -99 0.82 -99 0 -99 0.82 ...
##  $ perf_12_month_avg: num  -99 0.99 -99 0.13 -99 0.87 -99 0 -99 0.87 ...
##  $ local_bo_qty     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deck_risk        : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 2 2 1 1 ...
##  $ oe_constraint    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ppap_risk        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ stop_auto_buy    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ rev_stop         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ went_on_backorder: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
str(test)
## 'data.frame':    242076 obs. of  23 variables:
##  $ sku              : Factor w/ 242076 levels "(242075 rows)",..: 167 213 440 599 690 1042 1155 1195 1288 1407 ...
##  $ national_inv     : int  62 9 17 9 2 15 0 28 2 2 ...
##  $ lead_time        : int  NA NA 8 2 8 2 NA NA NA NA ...
##  $ in_transit_qty   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ forecast_3_month : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ forecast_6_month : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ forecast_9_month : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_1_month    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_3_month    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales_6_month    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ sales_9_month    : int  0 0 0 2 0 2 0 0 0 0 ...
##  $ min_bank         : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ potential_issue  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pieces_past_due  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ perf_6_month_avg : num  -99 -99 0.92 0.78 0.54 0.37 -99 -99 -99 -99 ...
##  $ perf_12_month_avg: num  -99 -99 0.95 0.75 0.71 0.68 -99 -99 -99 -99 ...
##  $ local_bo_qty     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deck_risk        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 2 1 ...
##  $ oe_constraint    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ppap_risk        : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
##  $ stop_auto_buy    : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ rev_stop         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ went_on_backorder: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

Let’s have a look at the target variable

table(train$went_on_backorder)
## 
##      No     Yes 
## 1676567   11293

As we can see the data is highly unbalanced data. Since, we are focused on optimizing the backorder, we need more occurance of backorder in our train data. So, we need to balance the data.

Data Pre-Processing

summary(train)
##              sku           national_inv        lead_time     
##  (1687860 rows):      1   Min.   :  -27256   Min.   : 0.00   
##  1026827       :      1   1st Qu.:       4   1st Qu.: 4.00   
##  1043384       :      1   Median :      15   Median : 8.00   
##  1043696       :      1   Mean   :     496   Mean   : 7.87   
##  1043852       :      1   3rd Qu.:      80   3rd Qu.: 9.00   
##  1044048       :      1   Max.   :12334404   Max.   :52.00   
##  (Other)       :1687855   NA's   :1          NA's   :100894  
##  in_transit_qty     forecast_3_month    forecast_6_month 
##  Min.   :     0.0   Min.   :      0.0   Min.   :      0  
##  1st Qu.:     0.0   1st Qu.:      0.0   1st Qu.:      0  
##  Median :     0.0   Median :      0.0   Median :      0  
##  Mean   :    44.1   Mean   :    178.1   Mean   :    345  
##  3rd Qu.:     0.0   3rd Qu.:      4.0   3rd Qu.:     12  
##  Max.   :489408.0   Max.   :1427612.0   Max.   :2461360  
##  NA's   :1          NA's   :1           NA's   :1        
##  forecast_9_month  sales_1_month      sales_3_month    
##  Min.   :      0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:      0   1st Qu.:     0.0   1st Qu.:      0  
##  Median :      0   Median :     0.0   Median :      1  
##  Mean   :    506   Mean   :    55.9   Mean   :    175  
##  3rd Qu.:     20   3rd Qu.:     4.0   3rd Qu.:     15  
##  Max.   :3777304   Max.   :741774.0   Max.   :1105478  
##  NA's   :1         NA's   :1          NA's   :1        
##  sales_6_month       sales_9_month        min_bank         potential_issue
##  Min.   :      0.0   Min.   :      0   Min.   :     0.00   No  :1686953   
##  1st Qu.:      0.0   1st Qu.:      0   1st Qu.:     0.00   Yes :    907   
##  Median :      2.0   Median :      4   Median :     0.00   NA's:      1   
##  Mean   :    341.7   Mean   :    525   Mean   :    52.77                  
##  3rd Qu.:     31.0   3rd Qu.:     47   3rd Qu.:     3.00                  
##  Max.   :2146625.0   Max.   :3205172   Max.   :313319.00                  
##  NA's   :1           NA's   :1         NA's   :1                          
##  pieces_past_due     perf_6_month_avg  perf_12_month_avg
##  Min.   :     0.00   Min.   :-99.000   Min.   :-99.000  
##  1st Qu.:     0.00   1st Qu.:  0.630   1st Qu.:  0.660  
##  Median :     0.00   Median :  0.820   Median :  0.810  
##  Mean   :     2.04   Mean   : -6.872   Mean   : -6.438  
##  3rd Qu.:     0.00   3rd Qu.:  0.970   3rd Qu.:  0.950  
##  Max.   :146496.00   Max.   :  1.000   Max.   :  1.000  
##  NA's   :1           NA's   :1         NA's   :1        
##   local_bo_qty       deck_risk      oe_constraint  ppap_risk     
##  Min.   :    0.000   No  :1300377   No  :1687615   No  :1484026  
##  1st Qu.:    0.000   Yes : 387483   Yes :    245   Yes : 203834  
##  Median :    0.000   NA's:      1   NA's:      1   NA's:      1  
##  Mean   :    0.626                                               
##  3rd Qu.:    0.000                                               
##  Max.   :12530.000                                               
##  NA's   :1                                                       
##  stop_auto_buy  rev_stop       went_on_backorder
##  No  :  61086   No  :1687129   No  :1676567     
##  Yes :1626774   Yes :    731   Yes :  11293     
##  NA's:      1   NA's:      1   NA's:      1     
##                                                 
##                                                 
##                                                 
## 
head(train, 10)
tail(train)
train <- train[-nrow(train),-1]
test <- test[-nrow(test),-1]
Dealing with NAs
train$lead_time <- ifelse(is.na(train$lead_time), -1, train$lead_time )
train$went_on_backorder <- ifelse(train$went_on_backorder=="Yes",1,0)
test$lead_time <- ifelse(is.na(test$lead_time), -1, test$lead_time )
test$went_on_backorder <- ifelse(test$went_on_backorder=="Yes",1,0)
train$went_on_backorder <- as.factor(train$went_on_backorder)
test$went_on_backorder <- as.factor(test$went_on_backorder)
table(is.na(train))
## 
##    FALSE 
## 37132920
table(is.na(test))
## 
##   FALSE 
## 5325650

Now, NA is gone, let’s balance data. There are many ways to deal with unbalanced data. Here, I’m using SMOTE.

train_bal <-  ubSMOTE(train[,-22], train[,22], perc.over = 200, perc.under = 200, k=5)
train_final <- cbind(train_bal$X, train_bal$Y)
names(train_final)[22] <- "went_on_backorder"
table(train_final$went_on_backorder)
## 
##     0     1 
## 45172 33879
rm(train,train_bal)

Now, we can see the data is quite balanced. Also, as a added benefit, the data size has been reduced hugely which will make our training faster.

Modeling

Now, let’s move to the modeling.

Random Forest model
fit.rf <- randomForest(went_on_backorder~., data = train_final)
pred.rf <- predict(fit.rf,test[,-22])
caret::confusionMatrix(test$went_on_backorder,pred.rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 230096   9291
##          1   1047   1641
##                                           
##                Accuracy : 0.9573          
##                  95% CI : (0.9565, 0.9581)
##     No Information Rate : 0.9548          
##     P-Value [Acc > NIR] : 2.331e-09       
##                                           
##                   Kappa : 0.2272          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9955          
##             Specificity : 0.1501          
##          Pos Pred Value : 0.9612          
##          Neg Pred Value : 0.6105          
##              Prevalence : 0.9548          
##          Detection Rate : 0.9505          
##    Detection Prevalence : 0.9889          
##       Balanced Accuracy : 0.5728          
##                                           
##        'Positive' Class : 0               
## 

We can see the accuracy is pretty good, but we need to see other metrics.

We will now train the model using H2O. It provides professional grade ML and scalibility. It also has a auto.ml function to automatically train model without providing a specific algorithm.

h2o.init()
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\Vivek\AppData\Local\Temp\Rtmp84Rjm2/h2o_Vivek_started_from_r.out
##     C:\Users\Vivek\AppData\Local\Temp\Rtmp84Rjm2/h2o_Vivek_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ........................ Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         42 seconds 645 milliseconds 
##     H2O cluster version:        3.16.0.4 
##     H2O cluster version age:    3 months and 5 days  
##     H2O cluster name:           H2O_started_from_R_Vivek_ebx284 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.74 GB 
##     H2O cluster total cores:    0 
##     H2O cluster allowed cores:  0 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.3 (2017-11-30)

Let’s create a validation dataset. Since, H2O deals with H2OFrame, we need to convert our data to that format.

index <- createDataPartition(train_final$went_on_backorder, p=0.8, list = FALSE)
train <- train_final[index,]
valid <- train_final[-index,]

train_h2o <- as.h2o(train)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
valid_h2o <- as.h2o(valid)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test_h2o <- as.h2o(test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Now, that we have transformed the data, we are going to use automl function from h2o package to train model.

y <- "went_on_backorder"
x <- setdiff(names(train_h2o), y)

models_h2o <- h2o.automl(x=x, y=y, training_frame = train_h2o, validation_frame = valid_h2o, leaderboard_frame = test_h2o, max_runtime_secs = 60)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Model has been trained. Let’s extract our model.

fit.h2o <- models_h2o@leader

Let’s predict using the model.

pred.h2o <- h2o.predict(fit.h2o,newdata = test_h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
as.data.frame(pred.h2o)

We have obtained the predictions. Now, H2O provides a function h2o.performance which can help to assess the perdormance. Let’t try.

performance.h2o <- h2o.performance(fit.h2o, newdata = test_h2o)
h2o.metric(performance.h2o)

Let’s see the AUC metric which is widely used in the business and challenges like Kaggle’s.

h2o.auc(performance.h2o)
## [1] 0.9123415

It is 91% which is very good considering the minimum effort put in.

So, we saw how h2o can help us get good performing model. It is also scalable so you can put the model in production. Also, we saw how to handle imbalanced dataset whenever required.