In this project, I’m going to use Machine Learning to Optimize the Product Backorders.
Backorder is an order which has not been fulfilled yet by company. It indicates the interest of consumer in the product even though the product is in short amount. This is both and good for the company. Good because it shows customer is still interested in the product and demands for it. Bad because if not fulfilled in time the consumer may lose interest, look for alternative product which will result in the loss of company, losing customers and image of the company may be distorted.
Now, what company can do is built so many of the products that there won’t be shortage. But most of the companies can’t do it because of the high inventory cost. And if demand decreases, the will suffer quite a loss.
So, it is better to look at the past data and optimize the current backorder such that the inventory cost is low, product is delivered in time before the conumer loses the interest. This will good for both consumers who get the product they want with only little wait and for company which retains the customers and make profit.
There are a lot of challenges in building the predictive model for optimization of backorders. There are lot of factors which doesn’t depend on the company, product or business but external factors like holiday, season, special occassions etc. So let’s see what we are going to do.
The data we are using here is obtained from Kaggle. You can obtain the data from here.
library(data.table)
library(tidyquant)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:data.table':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between() masks data.table::between()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks xts::first(), data.table::first()
## x lubridate::hour() masks data.table::hour()
## x lubridate::intersect() masks base::intersect()
## x lubridate::isoweek() masks data.table::isoweek()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks xts::last(), data.table::last()
## x lubridate::mday() masks data.table::mday()
## x lubridate::minute() masks data.table::minute()
## x lubridate::month() masks data.table::month()
## x lubridate::quarter() masks data.table::quarter()
## x lubridate::second() masks data.table::second()
## x lubridate::setdiff() masks base::setdiff()
## x purrr::transpose() masks data.table::transpose()
## x lubridate::union() masks base::union()
## x lubridate::wday() masks data.table::wday()
## x lubridate::week() masks data.table::week()
## x lubridate::yday() masks data.table::yday()
## x lubridate::year() masks data.table::year()
##
## Attaching package: 'tidyquant'
## The following object is masked from 'package:dplyr':
##
## as_tibble
## The following object is masked from 'package:tibble':
##
## as_tibble
library(unbalanced)
## Loading required package: mlr
## Loading required package: ParamHelpers
##
## Attaching package: 'ParamHelpers'
## The following object is masked from 'package:quantmod':
##
## getDefaults
## Warning: replacing previous import 'BBmisc::isFALSE' by
## 'backports::isFALSE' when loading 'mlr'
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:mlr':
##
## train
## The following object is masked from 'package:purrr':
##
## lift
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
##
## day, hour, month, week, year
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
train <- read.csv("train.csv", na.strings = "")
test <- read.csv("test.csv", na.strings = "")
Let’s have a look at the data
str(train)
## 'data.frame': 1687861 obs. of 23 variables:
## $ sku : Factor w/ 1687861 levels "(1687860 rows)",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ national_inv : int 0 2 2 7 8 13 1095 6 140 4 ...
## $ lead_time : int NA 9 NA 8 NA 8 NA 2 NA 8 ...
## $ in_transit_qty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ forecast_3_month : int 0 0 0 0 0 0 0 0 15 0 ...
## $ forecast_6_month : int 0 0 0 0 0 0 0 0 114 0 ...
## $ forecast_9_month : int 0 0 0 0 0 0 0 0 152 0 ...
## $ sales_1_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_3_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_6_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_9_month : int 0 0 0 0 4 0 0 0 0 0 ...
## $ min_bank : int 0 0 0 1 2 0 4 0 0 0 ...
## $ potential_issue : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ pieces_past_due : int 0 0 0 0 0 0 0 0 0 0 ...
## $ perf_6_month_avg : num -99 0.99 -99 0.1 -99 0.82 -99 0 -99 0.82 ...
## $ perf_12_month_avg: num -99 0.99 -99 0.13 -99 0.87 -99 0 -99 0.87 ...
## $ local_bo_qty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deck_risk : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 2 2 1 1 ...
## $ oe_constraint : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ ppap_risk : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ stop_auto_buy : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ rev_stop : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ went_on_backorder: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
str(test)
## 'data.frame': 242076 obs. of 23 variables:
## $ sku : Factor w/ 242076 levels "(242075 rows)",..: 167 213 440 599 690 1042 1155 1195 1288 1407 ...
## $ national_inv : int 62 9 17 9 2 15 0 28 2 2 ...
## $ lead_time : int NA NA 8 2 8 2 NA NA NA NA ...
## $ in_transit_qty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ forecast_3_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ forecast_6_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ forecast_9_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_1_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_3_month : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales_6_month : int 0 0 0 0 0 1 0 0 0 0 ...
## $ sales_9_month : int 0 0 0 2 0 2 0 0 0 0 ...
## $ min_bank : int 1 1 0 0 0 0 0 0 0 0 ...
## $ potential_issue : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ pieces_past_due : int 0 0 0 0 0 0 0 0 0 0 ...
## $ perf_6_month_avg : num -99 -99 0.92 0.78 0.54 0.37 -99 -99 -99 -99 ...
## $ perf_12_month_avg: num -99 -99 0.95 0.75 0.71 0.68 -99 -99 -99 -99 ...
## $ local_bo_qty : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deck_risk : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 2 1 ...
## $ oe_constraint : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ ppap_risk : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
## $ stop_auto_buy : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ rev_stop : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ went_on_backorder: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
Let’s have a look at the target variable
table(train$went_on_backorder)
##
## No Yes
## 1676567 11293
As we can see the data is highly unbalanced data. Since, we are focused on optimizing the backorder, we need more occurance of backorder in our train data. So, we need to balance the data.
summary(train)
## sku national_inv lead_time
## (1687860 rows): 1 Min. : -27256 Min. : 0.00
## 1026827 : 1 1st Qu.: 4 1st Qu.: 4.00
## 1043384 : 1 Median : 15 Median : 8.00
## 1043696 : 1 Mean : 496 Mean : 7.87
## 1043852 : 1 3rd Qu.: 80 3rd Qu.: 9.00
## 1044048 : 1 Max. :12334404 Max. :52.00
## (Other) :1687855 NA's :1 NA's :100894
## in_transit_qty forecast_3_month forecast_6_month
## Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0
## Median : 0.0 Median : 0.0 Median : 0
## Mean : 44.1 Mean : 178.1 Mean : 345
## 3rd Qu.: 0.0 3rd Qu.: 4.0 3rd Qu.: 12
## Max. :489408.0 Max. :1427612.0 Max. :2461360
## NA's :1 NA's :1 NA's :1
## forecast_9_month sales_1_month sales_3_month
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 0 Median : 0.0 Median : 1
## Mean : 506 Mean : 55.9 Mean : 175
## 3rd Qu.: 20 3rd Qu.: 4.0 3rd Qu.: 15
## Max. :3777304 Max. :741774.0 Max. :1105478
## NA's :1 NA's :1 NA's :1
## sales_6_month sales_9_month min_bank potential_issue
## Min. : 0.0 Min. : 0 Min. : 0.00 No :1686953
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.00 Yes : 907
## Median : 2.0 Median : 4 Median : 0.00 NA's: 1
## Mean : 341.7 Mean : 525 Mean : 52.77
## 3rd Qu.: 31.0 3rd Qu.: 47 3rd Qu.: 3.00
## Max. :2146625.0 Max. :3205172 Max. :313319.00
## NA's :1 NA's :1 NA's :1
## pieces_past_due perf_6_month_avg perf_12_month_avg
## Min. : 0.00 Min. :-99.000 Min. :-99.000
## 1st Qu.: 0.00 1st Qu.: 0.630 1st Qu.: 0.660
## Median : 0.00 Median : 0.820 Median : 0.810
## Mean : 2.04 Mean : -6.872 Mean : -6.438
## 3rd Qu.: 0.00 3rd Qu.: 0.970 3rd Qu.: 0.950
## Max. :146496.00 Max. : 1.000 Max. : 1.000
## NA's :1 NA's :1 NA's :1
## local_bo_qty deck_risk oe_constraint ppap_risk
## Min. : 0.000 No :1300377 No :1687615 No :1484026
## 1st Qu.: 0.000 Yes : 387483 Yes : 245 Yes : 203834
## Median : 0.000 NA's: 1 NA's: 1 NA's: 1
## Mean : 0.626
## 3rd Qu.: 0.000
## Max. :12530.000
## NA's :1
## stop_auto_buy rev_stop went_on_backorder
## No : 61086 No :1687129 No :1676567
## Yes :1626774 Yes : 731 Yes : 11293
## NA's: 1 NA's: 1 NA's: 1
##
##
##
##
head(train, 10)
tail(train)
train <- train[-nrow(train),-1]
test <- test[-nrow(test),-1]
train$lead_time <- ifelse(is.na(train$lead_time), -1, train$lead_time )
train$went_on_backorder <- ifelse(train$went_on_backorder=="Yes",1,0)
test$lead_time <- ifelse(is.na(test$lead_time), -1, test$lead_time )
test$went_on_backorder <- ifelse(test$went_on_backorder=="Yes",1,0)
train$went_on_backorder <- as.factor(train$went_on_backorder)
test$went_on_backorder <- as.factor(test$went_on_backorder)
table(is.na(train))
##
## FALSE
## 37132920
table(is.na(test))
##
## FALSE
## 5325650
Now, NA is gone, let’s balance data. There are many ways to deal with unbalanced data. Here, I’m using SMOTE.
train_bal <- ubSMOTE(train[,-22], train[,22], perc.over = 200, perc.under = 200, k=5)
train_final <- cbind(train_bal$X, train_bal$Y)
names(train_final)[22] <- "went_on_backorder"
table(train_final$went_on_backorder)
##
## 0 1
## 45172 33879
rm(train,train_bal)
Now, we can see the data is quite balanced. Also, as a added benefit, the data size has been reduced hugely which will make our training faster.
Now, let’s move to the modeling.
fit.rf <- randomForest(went_on_backorder~., data = train_final)
pred.rf <- predict(fit.rf,test[,-22])
caret::confusionMatrix(test$went_on_backorder,pred.rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 230096 9291
## 1 1047 1641
##
## Accuracy : 0.9573
## 95% CI : (0.9565, 0.9581)
## No Information Rate : 0.9548
## P-Value [Acc > NIR] : 2.331e-09
##
## Kappa : 0.2272
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9955
## Specificity : 0.1501
## Pos Pred Value : 0.9612
## Neg Pred Value : 0.6105
## Prevalence : 0.9548
## Detection Rate : 0.9505
## Detection Prevalence : 0.9889
## Balanced Accuracy : 0.5728
##
## 'Positive' Class : 0
##
We can see the accuracy is pretty good, but we need to see other metrics.
We will now train the model using H2O. It provides professional grade ML and scalibility. It also has a auto.ml function to automatically train model without providing a specific algorithm.
h2o.init()
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\Vivek\AppData\Local\Temp\Rtmp84Rjm2/h2o_Vivek_started_from_r.out
## C:\Users\Vivek\AppData\Local\Temp\Rtmp84Rjm2/h2o_Vivek_started_from_r.err
##
##
## Starting H2O JVM and connecting: ........................ Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 42 seconds 645 milliseconds
## H2O cluster version: 3.16.0.4
## H2O cluster version age: 3 months and 5 days
## H2O cluster name: H2O_started_from_R_Vivek_ebx284
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.74 GB
## H2O cluster total cores: 0
## H2O cluster allowed cores: 0
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.3 (2017-11-30)
Let’s create a validation dataset. Since, H2O deals with H2OFrame, we need to convert our data to that format.
index <- createDataPartition(train_final$went_on_backorder, p=0.8, list = FALSE)
train <- train_final[index,]
valid <- train_final[-index,]
train_h2o <- as.h2o(train)
##
|
| | 0%
|
|=================================================================| 100%
valid_h2o <- as.h2o(valid)
##
|
| | 0%
|
|=================================================================| 100%
test_h2o <- as.h2o(test)
##
|
| | 0%
|
|=================================================================| 100%
Now, that we have transformed the data, we are going to use automl function from h2o package to train model.
y <- "went_on_backorder"
x <- setdiff(names(train_h2o), y)
models_h2o <- h2o.automl(x=x, y=y, training_frame = train_h2o, validation_frame = valid_h2o, leaderboard_frame = test_h2o, max_runtime_secs = 60)
##
|
| | 0%
|
|== | 3%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======== | 12%
|
|========== | 15%
|
|============================================================== | 95%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
Model has been trained. Let’s extract our model.
fit.h2o <- models_h2o@leader
Let’s predict using the model.
pred.h2o <- h2o.predict(fit.h2o,newdata = test_h2o)
##
|
| | 0%
|
|=================================================================| 100%
as.data.frame(pred.h2o)
We have obtained the predictions. Now, H2O provides a function h2o.performance which can help to assess the perdormance. Let’t try.
performance.h2o <- h2o.performance(fit.h2o, newdata = test_h2o)
h2o.metric(performance.h2o)
Let’s see the AUC metric which is widely used in the business and challenges like Kaggle’s.
h2o.auc(performance.h2o)
## [1] 0.9123415
It is 91% which is very good considering the minimum effort put in.
So, we saw how h2o can help us get good performing model. It is also scalable so you can put the model in production. Also, we saw how to handle imbalanced dataset whenever required.