Prediction of Redordered Products & Associated Products

Introduction

Instacart is a San Fransisco based company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. This study makes use of open sourced data set, The Instacart Online Grocery Shopping Dataset 2017 released by Instacart. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Problem Statement:

Instacart indicates each order in the data as prior, train or test. Prior orders describe the past behaviour of a user while train and test orders regard the future behaviour that we need to predict. As a result, we want to predict which previously purchased products (prior orders) will be in a user’s next order (train and test orders).

The goal is to predict which of these products will be in a user’s future order. This is a classification problem because we need to predict whether each pair of user and product is a reorder or not. This is indicated by the value of the reordered variable, i.e. reordered=1 or reordered=0 (see figure below)

On these predicted products, we will apply an Apriori Algorithm, to understand what could be the associated products that may further be predicted in the order_id

Method

Our method includes the following steps:

  • Import and reshape data: This step includes loading CSV files into R tables, tranform character variables to categorical variables, and create a supportive table.
  • Calculate predictor variables: This step includes identifying and calculating predictor variables (aka features) from the initial datasets provided by Instacart.
  • Create the test and train datasets: In this step we create two distinct datasets that will be used in the creation and the use of the predictive model.
  • Create the preditive model: In this step we employ XGBoost algorithm to create the predictive model through the train dataset.
  • Apply the model: This step includes applying the model to predict the ‘reordered’ variable for the test dataset.
  • Find the Associated Products: Using Association rules we will then identify products that are usually bought with the predicted products.

Data Preparation

Packages

  • pacman: To load packages and install missing ones
  • readr: Fast data load
  • data.table: Fast data load.
  • tidyverse: Package of multiple R packages used for data manipulation
  • dplyr: Easy functions to perform data manipulation in R.
  • stringr: String operations in R
  • DT: Package to put data objects in R as HTML tables
  • ggplot2: Data visualisation in R mining for word processing and sentiment analysis
  • treemap: Package for tree map visulaization
  • arules : Package for mining Association Rules and Frequent Itemsets
  • arulesviz : Package for visualizing Association Rules and Frequent Itemsets
  • xgboost : Package for predicting reorders
if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("readr", 
               "data.table", 
               "tidyverse", 
               "dplyr", 
               "stringr", 
               "DT", 
               "ggplot2",
               "knitr",
               "magrittr",
               "treemap",
               "arules",
               "arulesViz",
               "xgboost",
               "Ckmeans.1d.dp")

Data Source

Instacart is an American company that operates as a same-day grocery delivery service. The Instacart Online Grocery Shopping dataset 2017 used in this study is accessed on 03/31.

In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information

  • orders: This table includes all orders, namely prior, train, and test. It has single primary key (order_id).
  • ordert: This table includes training orders. It has a composite primary key (order_id and product_id) and indicates whether a product in an order is a reorder or not (through the reordered variable).
  • orderp: This table includes prior orders. It has a composite primary key (order_id and product_id) and indicates whether a product in an order is a reorder or not (through the reordered variable).
  • products: This table includes all products. It has a single primary key (product_id)
  • aisles: This table includes all aisles. It has a single primary key (aisle_id)
  • departments: This table includes all departments. It has a single primary key (department_id)

Data Import

The six files containing data is loaded in this step.

path <- "/Users/keerthigopalakrishnan/Downloads/instacart_2017_05_01"

orderp <- fread(file.path(path, "order_products__prior.csv"))
ordert <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
products <- fread(file.path(path, "products.csv"))
aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))

Exploratory Data Analysis

Buying patterns of users are analysed by the day at which they purchase and the department to which they belong.

By Weekday and Time of Order

Weekends have more orders than weekdays.

 orders %>% 
  ggplot(aes(x= order_dow, fill = order_dow)) + 
  geom_histogram(stat="count")

Monday to Thursday follows a double camel hump shape with first peak from 9-11 hours and second peak around 13-15 hours. This indicates that the users order groceries either in the morning or during lunch time on working days. But this pattern is slightly different for weekends

orders %>% 
  ggplot(aes(x= order_hour_of_day, fill = order_dow)) + 
  geom_histogram(stat="count") +
  facet_wrap(~ order_dow, ncol = 2)

By Aisles and Departments

Though produce offers lesser number of unique products compareed to departments like personal care, snacks and pantry. But, most ordered products are from the produce department which contains vegetables and fruits.

products %>% 
  group_by(department_id, aisle_id) %>% summarize(count=n()) %>%
  left_join(departments,by="department_id") %>% 
  left_join(aisles,by="aisle_id") %>%
  treemap(index=c("department","aisle"),vSize="count",title="Tree map of Unique products offered in each Department/ aisle",palette="Set3",border.col="#FFFFFF")

ordert %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(departments,by="department_id") %>% 
  left_join(aisles,by="aisle_id") %>%
  mutate(onesize = 1) %>%
  treemap(index=c("department","aisle"),vSize="sumcount",title="Tree map of most ordered products in Department/Aisle",palette="Set3",border.col="#FFFFFF")

Item Frequency
# Split the "Product ID" values into groups based on "Order ID" variable
order_product <- orderp %>% 
  left_join(products, by = "product_id")


transactions <- as(split(order_product$product_name,order_product$order_id),"transactions")

hist(size(transactions), 
     breaks = 0:150, 
     xaxt="n", 
     ylim=c(0,250000), 
     col = "blue",
     main = "Number of Products per Order", 
     xlab = "Order Size:Number of Products")
+ axis(1, 
     at = seq(0,160,by=10)) + 
mtext(paste("Total:", length(transactions), "Orders,", sum(size(transactions)), "Products"))

## numeric(0)

ITEM FREQUENCY PLOT Top 10 items most frequently bought are shown below. Only items with a support of atleast 0.01 (Appears at least in 1% of the transactions) are considered for this.

itemFrequencyPlot(transactions,
                  support = 0.01,
                  cex.names = 0.8,
                  horiz = TRUE, topN = 10)

Reshape Data

We transform the data in order to facilitate their further analysis. First, we convert character variables into factors so we can use them in the creation of the model. In R, a categorical variable is called factor and has a fixed number of different values.

# We convert character variables into factor. In R, a categorical variable is called factor and has a fixed number of different values
aisles$aisle <- as.factor(aisles$aisle)
departments$department <- as.factor(departments$department)
orders$eval_set <- as.factor(orders$eval_set)
products$product_name <- as.factor(products$product_name)

In the products table we replace aisle_id and department_id with aisle name and department name

# In the products table we replace aisle_id and department_id with aisle name and department name
products <- products %>% 
  inner_join(aisles) %>% inner_join(departments) %>% 
  select(-aisle_id, -department_id)

We add the column “user_id” at the table ordert after matching the “order_id” of this table with the “order_id” of the table orders

# We add the column "user_id" at the table ordert
ordert$user_id <- orders$user_id[match(ordert$order_id, orders$order_id)]

We create a new table orders_products which contains the tables orders and orderp. Towards this end, we use inner_join() function, which returns records that have matching values in both tables

# We create a new table "orders_products" which contains the tables "orders" and orderp
orders_products <- orders %>% inner_join(orderp, by = "order_id")

Sample Data

Following are the final datasets that will be used during this study:

1. Orders
datatable(head(orders, n = 10), class = 'cell-border stripe hover condensed responsive')
2. Departments
datatable(head(departments, n = 10), class = 'cell-border stripe hover condensed responsive')
3. Aisles
datatable(head(aisles, n = 10), class = 'cell-border stripe hover condensed responsive')
4. Order Products Prior
datatable(head(orderp, n = 10), class = 'cell-border stripe hover condensed responsive')
5. Order Products Train
datatable(head(ordert, n = 10), class = 'cell-border stripe hover condensed responsive')
6. Products
datatable(head(products, n = 10), class = 'cell-border stripe hover condensed responsive')
4. Order_Product Joined Table
datatable(head(orders_products, n = 10), class = 'cell-border stripe hover condensed responsive')

Create Predictor Variables

We now ready to identify and calculate predictor variables based on the provided data. We can create various types of predictors such as:

  • Product predictors describing characteristics of a product e.g. total number of times a product has been purchased.
  • User predictors describing the behavior of a user e.g. total number of orders of a user.
  • User & product predictors describing the behavior of a user towards a specific product e.g. the total times a user ordered a product.
  • Datetime predictors describing temporal characteristics of the orders.

Product Predictors

In this step, we create four new predictor variables that define products and we store them in a new table that is called prd. We have selected to include the following

  • prod_orders: Total number of orders per product
  • prod_reorder_probability: Probability a product is reordered after the first order
  • prod_reorder_times: In average how many times a product has been purchased by the users who purchased it at least once
  • prod_reorder_ratio: Reorders per total number of orders of the product

We first need to calculate the following supportive variables:

  • prod_reorders: Total number of reorders per product
  • prod_first_orders: Total number of first orders per product or total number of customers who bought a product
  • prod_second_orders: Total number of second orders per product or total number of customers who bougth a product at least twice
# We create the prd and we start with the data inside the orders_products table
prd <- orders_products %>%
# We arrange() the three variables "user_id", "order_number", "product_id" to sort them in descending order
arrange(user_id, order_number, product_id) %>%
# We group_by() to group together users and products
group_by(user_id, product_id) %>%
# Mutate() creates the new variable "product time" through row_number() which returns a sequential number starting at 1
mutate(product_time = row_number()) %>%
# We have now identified how many times a user bought a product
ungroup()


prd <- prd %>%
# We group data per product_id
group_by(product_id) %>%
# We create the four new variables based on the groups we have created i.e. per product_id
summarise(
# n() counts the objects inside the different groups of product_id i.e. the total number of orders per product    
    prod_orders = n(),
# Summarise the reordered variable per product_id (recall that reordered is 1 or 0) i.e. the total number of reorders per product    
    prod_reorders = sum(reordered),    
    prod_first_orders = sum(product_time == 1),
    prod_second_orders = sum(product_time == 2)
  )

# we calculate the prod_reorder_probability variable
prd$prod_reorder_probability <- prd$prod_second_orders / prd$prod_first_orders
# we caclculate the prod_reorder_times variable
prd$prod_reorder_times <- 1 + prd$prod_reorders / prd$prod_first_orders
# we caclculate the prod_reorder_ratio variable
prd$prod_reorder_ratio <- prd$prod_reorders / prd$prod_orders

# we remove the prod_reorders, prod_first_orders, and prod_second_orders variables
prd <- prd %>% select(-prod_reorders, -prod_first_orders, -prod_second_orders)


# Let's see the final prd table
head(prd,20)
## # A tibble: 20 x 5
##    product_id prod_orders prod_reorder_pr… prod_reorder_ti…
##         <int>       <int>            <dbl>            <dbl>
##  1          1        1852           0.385              2.59
##  2          2          90           0.103              1.15
##  3          3         277           0.486              3.74
##  4          4         329           0.352              1.81
##  5          5          15           0.667              2.5 
##  6          6           8           0.4                1.6 
##  7          7          30           0.333              1.67
##  8          8         165           0.366              2.01
##  9          9         156           0.419              2.11
## 10         10        2572           0.315              2.03
## 11         11         104           0.333              2.31
## 12         12         246           0.317              2.05
## 13         13           9           0.167              1.5 
## 14         14          17           0.0667             1.13
## 15         15           4           0.333              1.33
## 16         16          19           0.118              1.12
## 17         17          18           0.125              1.12
## 18         18         137           0.418              2.49
## 19         19           4           0                  1   
## 20         20           6           0.2                1.2 
## # ... with 1 more variable: prod_reorder_ratio <dbl>

User Predictors

The next set of predictor variables that we calculate are related to users and we store them in the new users table. To calculate these variables we take into account only the prior orders.

We start by calculating three new variables:

  • user_orders: Total number of orders per user
  • user_period: The time period (in days) between the first and last order of a user
  • user_mean_days_since_prior: Mean time period (in days) between two consequtive orders of a user

Inline code comments explain how we calculate these variables. We start by creating a temporary version of the users table from the orders table.

users <- orders %>%
# We keep only the prior orders
  filter(eval_set == "prior") %>%
# We group orders by user_id
  group_by(user_id) %>%
# We calculate the variables based on different user_id
  summarise(
    # We calculate the total number of orders per user using the order_number variable.
    # What other variable we could have used in order to calculate total number of orders?  
    user_orders = max(order_number),
    # Using the na.rm = T we omit the missing values and calculate the sum and mean only for the values that we have
    user_period = sum(days_since_prior_order, na.rm = T),
    user_mean_days_since_prior = mean(days_since_prior_order, na.rm = T)
  )

# Let's see the temporary users table
head(users,10)
## # A tibble: 10 x 4
##    user_id user_orders user_period user_mean_days_since_prior
##      <int>       <dbl>       <dbl>                      <dbl>
##  1       1          10         176                       19.6
##  2       2          14         198                       15.2
##  3       3          12         133                       12.1
##  4       4           5          55                       13.8
##  5       5           4          40                       13.3
##  6       6           3          18                        9  
##  7       7          20         203                       10.7
##  8       8           3          60                       30  
##  9       9           3          36                       18  
## 10      10           5          79                       19.8

We also create a supportive table, namely us in order to calculate three more new variables:

  • user_total_products: Total numbers of basket items included in user’s orders
  • user_reorder_ratio: Reorder ratio (as defined above) per user
  • user_distinct_products: Total number of distinct products ordered by a user

Towards this end, we start from the orders_products table and we group observations using user_id variable.

us <- orders_products %>%
  group_by(user_id) %>%
  summarise(
    user_total_products = n(),
    user_reorder_ratio = sum(reordered == 1) / sum(order_number > 1),
    # The n_distinct() function counts the number of unique values in a set 
    user_distinct_products = n_distinct(product_id)
  )

# Let's see the us table
head(us,10)
## # A tibble: 10 x 4
##    user_id user_total_products user_reorder_ratio user_distinct_products
##      <int>               <int>              <dbl>                  <int>
##  1       1                  59             0.759                      18
##  2       2                 195             0.511                     102
##  3       3                  88             0.705                      33
##  4       4                  18             0.0714                     17
##  5       5                  37             0.538                      23
##  6       6                  14             0.2                        12
##  7       7                 206             0.711                      68
##  8       8                  49             0.464                      36
##  9       9                  76             0.391                      58
## 10      10                 143             0.355                      94

Then we combine the users and us tables ussing inner_join() function and we calculate the final variable:

  • user_average_basket: Average number of basket items per order per user
# We combine users and us tables and store the results into users table
users <- users %>% inner_join(us)

# We calculate the user_average_basket variable
users$user_average_basket <- users$user_total_products / users$user_orders

# let's see the users table
head(users,10)
## # A tibble: 10 x 8
##    user_id user_orders user_period user_mean_days_… user_total_prod…
##      <int>       <dbl>       <dbl>            <dbl>            <int>
##  1       1          10         176             19.6               59
##  2       2          14         198             15.2              195
##  3       3          12         133             12.1               88
##  4       4           5          55             13.8               18
##  5       5           4          40             13.3               37
##  6       6           3          18              9                 14
##  7       7          20         203             10.7              206
##  8       8           3          60             30                 49
##  9       9           3          36             18                 76
## 10      10           5          79             19.8              143
## # ... with 3 more variables: user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>

We now identify the future order per user and add them in the users table. The future orders are indicated as train and test in the eval_set variable. As a result, we will know what is the order_id of the future order per user, whether this order belongs in the train or test set, and the time in days since the last order.

us <- orders %>%
  # we exclude prior orders and thus we keep only train and test orders
  filter(eval_set != "prior") %>%
  select(user_id, order_id, eval_set,
         time_since_last_order = days_since_prior_order)

# We combine users and us tables and store the results into the users table
users <- users %>% inner_join(us)

# We delete the us table
rm(us)
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   6262630  334.5   25032344 1336.9         NA  31290430 1671.1
## Vcells 450758019 3439.1  953631162 7275.7     102400 767328383 5854.3
# let's see the final users table
head(users,10)
## # A tibble: 10 x 11
##    user_id user_orders user_period user_mean_days_… user_total_prod…
##      <int>       <dbl>       <dbl>            <dbl>            <int>
##  1       1          10         176             19.6               59
##  2       2          14         198             15.2              195
##  3       3          12         133             12.1               88
##  4       4           5          55             13.8               18
##  5       5           4          40             13.3               37
##  6       6           3          18              9                 14
##  7       7          20         203             10.7              206
##  8       8           3          60             30                 49
##  9       9           3          36             18                 76
## 10      10           5          79             19.8              143
## # ... with 6 more variables: user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>,
## #   order_id <int>, eval_set <fct>, time_since_last_order <dbl>

User_Product Predictors

We now create predictors that indicate how a user behaves towards a specific product. We store these predictors in the data table, which is also the final table that we create. Towards this end, we use both prd and users tables. We create the following predictors:

  • up_orders: The total times a user ordered a product
  • up_first_order: What was the first time a user purchased a product
  • up_last_order: What was the last time a user purchased a product
  • up_average_cart_position: The average position in a user’s cart of a product
  • up_order_rate: Percentage of user’s orders that include a specific product
  • up_orders_since_last_order: Number of orders since user’s last order of a product
  • up_order_rate_since_first_order: Pecentage of orders since first order of a product in which a user purchased this product

We create the data table starting from the orders_products table and summarising using both user_id and product_id. We then create the four first new variables. After running the following code we get a temporary version of the data table.

data <- orders_products %>%
  group_by(user_id, product_id) %>% 
  summarise(
    up_orders = n(),
    up_first_order = min(order_number),
    up_last_order = max(order_number),
    up_average_cart_position = mean(add_to_cart_order))

# We delete the tables "orders_products" and "orders"
rm(orders_products, orders)

We compine the data table with the prd and users tables and we calculate the final three variables

# We use inner_join() to combine the table "data" with the tables "prd" and "users"
data <- data %>% 
  inner_join(prd, by = "product_id") %>%
  inner_join(users, by = "user_id")

# We calculate the variables "up_order_rate", "up_orders_since_last_order", "up_order_rate_since_first_order"
data$up_order_rate <- data$up_orders / data$user_orders
data$up_orders_since_last_order <- data$user_orders - data$up_last_order
data$up_order_rate_since_first_order <- data$up_orders / (data$user_orders - data$up_first_order + 1)

We now combine the data table with the ordert table in order to see which products that a user has already bought (prior) has been reordered (train).

If a product from ordert table has been reordered (reordered=1) by a user then this combination of product and user will exist in data table and thus the reordered value will be added in the data table through leftjoin(). If not then the specific combination of product and user will not exist in data table and thus the reordered value for this observation will remain empty (reordered=NA).

Below shows the final Table:

data <- data %>% 
  left_join(ordert %>% select(user_id, product_id, reordered), 
            by = c("user_id", "product_id"))

# We delete the tables "ordert", "prd" and "users"
rm(ordert, prd, users)
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   6262580  334.5   20053024 1071.0         NA  31332852 1673.4
## Vcells 512774175 3912.2  953631162 7275.7     102400 953132310 7271.9
# Let's see the final data table
head(data, 10)
## # A tibble: 10 x 24
## # Groups:   user_id [1]
##    user_id product_id up_orders up_first_order up_last_order
##      <int>      <int>     <int>          <dbl>         <dbl>
##  1       1        196        10              1            10
##  2       1      10258         9              2            10
##  3       1      10326         1              5             5
##  4       1      12427        10              1            10
##  5       1      13032         3              2            10
##  6       1      13176         2              2             5
##  7       1      14084         1              1             1
##  8       1      17122         1              5             5
##  9       1      25133         8              3            10
## 10       1      26088         2              1             2
## # ... with 19 more variables: up_average_cart_position <dbl>,
## #   prod_orders <int>, prod_reorder_probability <dbl>,
## #   prod_reorder_times <dbl>, prod_reorder_ratio <dbl>, user_orders <dbl>,
## #   user_period <dbl>, user_mean_days_since_prior <dbl>,
## #   user_total_products <int>, user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>,
## #   order_id <int>, eval_set <fct>, time_since_last_order <dbl>,
## #   up_order_rate <dbl>, up_orders_since_last_order <dbl>,
## #   up_order_rate_since_first_order <dbl>, reordered <int>

Train and Test Data Sets

Before we are ready to run the XGBoost algorithm, it is necessary to create the final train and test tables that will feed the algorithm and the evaluation respectively.

  • We split the data based on the eval_set variable into train and test.
  • We remove all the columns that are not predictors variables.
  • In the train set we tranform the missing values (NA) of reordered to 0 in order to indicate that these products have not been reordered in the future order.
train <- as.data.frame(data[data$eval_set == "train",])
train$eval_set <- NULL
train$user_id <- NULL
#train$product_id <- NULL
#train$order_id <- NULL
# below we transform missing values of 'reordered' variable to 0
train$reordered[is.na(train$reordered)] <- 0

head(train,10)
##    product_id up_orders up_first_order up_last_order
## 1         196        10              1            10
## 2       10258         9              2            10
## 3       10326         1              5             5
## 4       12427        10              1            10
## 5       13032         3              2            10
## 6       13176         2              2             5
## 7       14084         1              1             1
## 8       17122         1              5             5
## 9       25133         8              3            10
## 10      26088         2              1             2
##    up_average_cart_position prod_orders prod_reorder_probability
## 1                  1.400000       35791                0.5825000
## 2                  3.333333        1946                0.5529623
## 3                  5.000000        5526                0.5215809
## 4                  3.300000        6476                0.5294818
## 5                  6.333333        3751                0.4797823
## 6                  6.000000      379450                0.7118844
## 7                  2.000000       15935                0.6291501
## 8                  6.000000       13880                0.5274262
## 9                  4.000000        6196                0.5807453
## 10                 4.500000        2523                0.4049871
##    prod_reorder_times prod_reorder_ratio user_orders user_period
## 1            4.473875          0.7764801          10         176
## 2            3.493716          0.7137718          10         176
## 3            2.873635          0.6520087          10         176
## 4            3.857058          0.7407350          10         176
## 5            2.916796          0.6571581          10         176
## 6            5.972111          0.8325550          10         176
## 7            5.290505          0.8109821          10         176
## 8            3.082390          0.6755764          10         176
## 9            3.848447          0.7401549          10         176
## 10           2.169390          0.5390408          10         176
##    user_mean_days_since_prior user_total_products user_reorder_ratio
## 1                    19.55556                  59          0.7592593
## 2                    19.55556                  59          0.7592593
## 3                    19.55556                  59          0.7592593
## 4                    19.55556                  59          0.7592593
## 5                    19.55556                  59          0.7592593
## 6                    19.55556                  59          0.7592593
## 7                    19.55556                  59          0.7592593
## 8                    19.55556                  59          0.7592593
## 9                    19.55556                  59          0.7592593
## 10                   19.55556                  59          0.7592593
##    user_distinct_products user_average_basket order_id
## 1                      18                 5.9  1187899
## 2                      18                 5.9  1187899
## 3                      18                 5.9  1187899
## 4                      18                 5.9  1187899
## 5                      18                 5.9  1187899
## 6                      18                 5.9  1187899
## 7                      18                 5.9  1187899
## 8                      18                 5.9  1187899
## 9                      18                 5.9  1187899
## 10                     18                 5.9  1187899
##    time_since_last_order up_order_rate up_orders_since_last_order
## 1                     14           1.0                          0
## 2                     14           0.9                          0
## 3                     14           0.1                          5
## 4                     14           1.0                          0
## 5                     14           0.3                          0
## 6                     14           0.2                          5
## 7                     14           0.1                          9
## 8                     14           0.1                          5
## 9                     14           0.8                          0
## 10                    14           0.2                          8
##    up_order_rate_since_first_order reordered
## 1                        1.0000000         1
## 2                        1.0000000         1
## 3                        0.1666667         0
## 4                        1.0000000         0
## 5                        0.3333333         1
## 6                        0.2222222         0
## 7                        0.1000000         0
## 8                        0.1666667         0
## 9                        1.0000000         1
## 10                       0.2000000         1
test <- as.data.frame(data[data$eval_set == "test",])
test$eval_set <- NULL
test$user_id <- NULL
test$reordered <- NULL

head(test,10)
##    product_id up_orders up_first_order up_last_order
## 1         248         1              2             2
## 2        1005         1             10            10
## 3        1819         3              4             7
## 4        7503         1              3             3
## 5        8021         1              2             2
## 6        9387         5              1             7
## 7       12845         1              4             4
## 8       14992         2              6             7
## 9       15143         1              1             1
## 10      16797         3              1             9
##    up_average_cart_position prod_orders prod_reorder_probability
## 1                  3.000000        6371                0.2795080
## 2                  5.000000         463                0.3127413
## 3                  2.666667        2424                0.3468725
## 4                  6.000000       12474                0.3837314
## 5                  5.000000       27864                0.4408357
## 6                  3.600000       36187                0.4691700
## 7                  2.000000       10027                0.2570445
## 8                  7.000000       29069                0.4270636
## 9                  3.000000        3447                0.3580811
## 10                 4.000000      142951                0.5666180
##    prod_reorder_times prod_reorder_ratio user_orders user_period
## 1            1.667365          0.4002511          12         133
## 2            1.787645          0.4406048          12         133
## 3            1.969131          0.4921617          12         133
## 4            2.239899          0.5535514          12         133
## 5            2.445927          0.5911570          12         133
## 6            2.860632          0.6504269          12         133
## 7            1.569662          0.3629201          12         133
## 8            2.397048          0.5828202          12         133
## 9            1.968589          0.4920220          12         133
## 10           3.312962          0.6981553          12         133
##    user_mean_days_since_prior user_total_products user_reorder_ratio
## 1                    12.09091                  88          0.7051282
## 2                    12.09091                  88          0.7051282
## 3                    12.09091                  88          0.7051282
## 4                    12.09091                  88          0.7051282
## 5                    12.09091                  88          0.7051282
## 6                    12.09091                  88          0.7051282
## 7                    12.09091                  88          0.7051282
## 8                    12.09091                  88          0.7051282
## 9                    12.09091                  88          0.7051282
## 10                   12.09091                  88          0.7051282
##    user_distinct_products user_average_basket order_id
## 1                      33            7.333333  2774568
## 2                      33            7.333333  2774568
## 3                      33            7.333333  2774568
## 4                      33            7.333333  2774568
## 5                      33            7.333333  2774568
## 6                      33            7.333333  2774568
## 7                      33            7.333333  2774568
## 8                      33            7.333333  2774568
## 9                      33            7.333333  2774568
## 10                     33            7.333333  2774568
##    time_since_last_order up_order_rate up_orders_since_last_order
## 1                     11    0.08333333                         10
## 2                     11    0.08333333                          2
## 3                     11    0.25000000                          5
## 4                     11    0.08333333                          9
## 5                     11    0.08333333                         10
## 6                     11    0.41666667                          5
## 7                     11    0.08333333                          8
## 8                     11    0.16666667                          5
## 9                     11    0.08333333                         11
## 10                    11    0.25000000                          3
##    up_order_rate_since_first_order
## 1                       0.09090909
## 2                       0.33333333
## 3                       0.33333333
## 4                       0.10000000
## 5                       0.09090909
## 6                       0.41666667
## 7                       0.11111111
## 8                       0.28571429
## 9                       0.08333333
## 10                      0.25000000
rm(data)
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   6263961  334.6   20053024 1071.0         NA  31332852 1673.4
## Vcells 501289579 3824.6  953631162 7275.7     102400 953132310 7271.9

Create the Model

We use the train data to create the model. Each variable is a list containing two things, label (or outcome) and data (predictors). In our example the label we want to predict is the column “reordered”.

Because xgboost manages only numeric vectors we have to transform the categorical data to dummy variables while building the model we will need. Towards this end we use xgb.DMatrix.

Finally we can plot the results from our model using xgb.plot.importance()

library(ggplot2)

params <- list(
  "objective"           = "reg:logistic",
  "eval_metric"         = "logloss", 
  "eta"                 = 0.1, 
  "max_depth"           = 6, 
  "min_child_weight"    = 10,  
  "gamma"               = 0.70,  
  "subsample"           = 0.76,
  "colsample_bytree"    = 0.95,  
  "alpha"               = 2e-05,  
  "lambda"              = 10 
)

# We get a sample containing 10% of the train table
subtrain <- train %>% sample_frac(0.1)
# We create an xgb.DMatrix that is named X with predictors from subtrain table and response the reordered variable
X <- xgb.DMatrix(as.matrix(subtrain %>% select(-reordered, -order_id, -product_id)), label = subtrain$reordered)
# We create the actual model
model <- xgboost(data = X, params = params, nrounds = 80)
## [1]  train-logloss:0.624639 
## [2]  train-logloss:0.571516 
## [3]  train-logloss:0.525171 
## [4]  train-logloss:0.485455 
## [5]  train-logloss:0.454769 
## [6]  train-logloss:0.426194 
## [7]  train-logloss:0.401619 
## [8]  train-logloss:0.381367 
## [9]  train-logloss:0.364305 
## [10] train-logloss:0.348925 
## [11] train-logloss:0.335851 
## [12] train-logloss:0.324318 
## [13] train-logloss:0.314377 
## [14] train-logloss:0.305517 
## [15] train-logloss:0.297942 
## [16] train-logloss:0.291039 
## [17] train-logloss:0.285190 
## [18] train-logloss:0.280525 
## [19] train-logloss:0.275822 
## [20] train-logloss:0.272233 
## [21] train-logloss:0.268503 
## [22] train-logloss:0.265747 
## [23] train-logloss:0.263313 
## [24] train-logloss:0.260833 
## [25] train-logloss:0.258859 
## [26] train-logloss:0.257246 
## [27] train-logloss:0.255782 
## [28] train-logloss:0.254252 
## [29] train-logloss:0.253062 
## [30] train-logloss:0.252092 
## [31] train-logloss:0.251246 
## [32] train-logloss:0.250460 
## [33] train-logloss:0.249842 
## [34] train-logloss:0.249046 
## [35] train-logloss:0.248511 
## [36] train-logloss:0.248028 
## [37] train-logloss:0.247614 
## [38] train-logloss:0.247265 
## [39] train-logloss:0.246923 
## [40] train-logloss:0.246660 
## [41] train-logloss:0.246397 
## [42] train-logloss:0.246177 
## [43] train-logloss:0.245940 
## [44] train-logloss:0.245752 
## [45] train-logloss:0.245588 
## [46] train-logloss:0.245439 
## [47] train-logloss:0.245313 
## [48] train-logloss:0.245211 
## [49] train-logloss:0.245116 
## [50] train-logloss:0.244940 
## [51] train-logloss:0.244843 
## [52] train-logloss:0.244733 
## [53] train-logloss:0.244648 
## [54] train-logloss:0.244575 
## [55] train-logloss:0.244511 
## [56] train-logloss:0.244434 
## [57] train-logloss:0.244336 
## [58] train-logloss:0.244264 
## [59] train-logloss:0.244187 
## [60] train-logloss:0.244099 
## [61] train-logloss:0.244033 
## [62] train-logloss:0.243989 
## [63] train-logloss:0.243933 
## [64] train-logloss:0.243888 
## [65] train-logloss:0.243832 
## [66] train-logloss:0.243803 
## [67] train-logloss:0.243779 
## [68] train-logloss:0.243715 
## [69] train-logloss:0.243665 
## [70] train-logloss:0.243614 
## [71] train-logloss:0.243565 
## [72] train-logloss:0.243523 
## [73] train-logloss:0.243488 
## [74] train-logloss:0.243445 
## [75] train-logloss:0.243419 
## [76] train-logloss:0.243379 
## [77] train-logloss:0.243323 
## [78] train-logloss:0.243281 
## [79] train-logloss:0.243250 
## [80] train-logloss:0.243218
# We estimate the importance of the predictors
importance <- xgb.importance(colnames(X), model = model)
# We plot the importance of the predictors
xgb.ggplot.importance(importance)

rm(X, importance, subtrain)
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   6339429  338.6   20053024 1071.0         NA  31332852 1673.4
## Vcells 486142528 3709.0  953631162 7275.7     102400 953132310 7271.9

Apply the Model

Because Instacart did not reveal the outcome of the test data we also apply the model in the train data and create a table that enables comparing the predicted to the actural results

Apply to Test Data

Now we are ready to perform the prediction with the model we have built to classify test data.

# We use the xgb.DMatrix to group our test data into a matrix
X <- xgb.DMatrix(as.matrix(test %>% select(-order_id, -product_id)))
# We apply the model and we predict the reordered variable for the test set.
test$reordered <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
test$reordered <- (test$reordered > 0.21) * 1

# We create the final table with reordered products per order
submission <- test %>%
  filter(reordered == 1) %>%
  group_by(order_id) %>%
  summarise(
    products = paste(product_id, collapse = " ")
  )
# Let's see the submission table
head(submission,10)
## # A tibble: 10 x 2
##    order_id products                                                      
##       <int> <chr>                                                         
##  1       17 13107 21463                                                   
##  2       34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792    
##  3      137 2326 5134 23794 24852 25890 38689 41787                       
##  4      182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672     
##  5      257 13870 21137 22035 24838 24852 27104 27966 29837 30233 37646 3…
##  6      313 12779 13198 14077 21903 25890 28535 30391 45007 46906         
##  7      353 21137 35561 40688                                             
##  8      386 260 1158 2326 4920 6046 8174 10343 15872 17652 21479 21903 22…
##  9      414 4472 7539 14947 19006 20392 20564 21230 21376 21709 27845 312…
## 10      418 1503 13702 30489 38694 40268 41950 47766
# We create the table "missing" where we have the orders in which none product will be ordered according to our prediction
missing <- data.frame(
  order_id = unique(test$order_id[!test$order_id %in% submission$order_id]),
  products = "None"
)

submission <- submission %>% bind_rows(missing) %>% arrange(order_id)
#Let's see the submission table
head(submission,10)
## # A tibble: 10 x 2
##    order_id products                                                      
##       <int> <chr>                                                         
##  1       17 13107 21463                                                   
##  2       34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792    
##  3      137 2326 5134 23794 24852 25890 38689 41787                       
##  4      182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672     
##  5      257 13870 21137 22035 24838 24852 27104 27966 29837 30233 37646 3…
##  6      313 12779 13198 14077 21903 25890 28535 30391 45007 46906         
##  7      353 21137 35561 40688                                             
##  8      386 260 1158 2326 4920 6046 8174 10343 15872 17652 21479 21903 22…
##  9      414 4472 7539 14947 19006 20392 20564 21230 21376 21709 27845 312…
## 10      418 1503 13702 30489 38694 40268 41950 47766

Apply to Training Set

In order to be able to compare our prediction to the actual result we re-apply the model in the train data

# We use the xgb.DMatrix to group our train data into a matrix
X <- xgb.DMatrix(as.matrix(train %>% select(-order_id, -product_id, -reordered)))
# We apply the model and we predict the reordered variable for the train set.
train$reordered_pred <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
train$reordered_pred <- (train$reordered_pred > 0.21) * 1

# We create the final table with reordered products per order
submission_train <- train %>%
  filter(reordered_pred == 1) %>%
  group_by(order_id) %>%
  summarise(
    products = paste(product_id, collapse = " ")
  )

real_reorders <- train %>%
 filter(reordered == 1) %>%
 group_by(order_id) %>%
 summarise(
   real_products = paste(product_id, collapse = " ")
 )

submission_train <- real_reorders %>% 
  inner_join(submission_train, by = "order_id") 

The Final Result for comparison is shown below:

datatable(head(submission_train, n = 10), class = 'cell-border stripe hover condensed responsive')

Model Evaluation

The model is now evlauted by calculating the number of correct matches vs the number of incorrect matches.

Evalution_set <- submission_train

Reordered_Products <- ""

for (i in 1: nrow(Evalution_set)){

Evalution_set$matches[i] <- length(Reduce('intersect',str_extract_all(c(Evalution_set$products[i],Evalution_set$real_products[i]),"\\w+")))
  
Evalution_set$no_matches[i] <- (str_count(Evalution_set$real_products[i]," ")+1) - Evalution_set$matches[i]

}

The Final Result for comparison is shown below:

datatable(head(Evalution_set, n = 10), class = 'cell-border stripe hover condensed responsive')

The final sum of matches and no matches are compared:

sum(Evalution_set$matches)
## [1] 423488
sum(Evalution_set$no_matches)
## [1] 392930

Prediction of Associated Products

Apriori Algorithm and Association Rules

  • Support : This is a quantifies the popularity of an item and is measured as the proportion of transactions in which the item appears.We can identify itemsets that have support values beyond this threshold as significant itemsets. supp(X) = (Number of transactions in which X appears)/(Total number of transactions)

  • Confidence : This shows how likely item B is added to cart when item A is purchased. It is measured as the proportion of transactions of item A in which item B is also purchased. for example If conf (A -> B) is 75%, it implies that, for 75% of transactions containing A & B, this rule is correct. If item B is very popular, then value of confidence will also be high. So confidence alone cannot be used for association. conf(A -> B) = supp(A U B) / supp( A )

  • Lift : This is measured as the ratio of the observed support to that expected if A and B were independent. Lift value of 1 indicates that the two items are independent of each other. If value is greater than 1, then it indicates B is likely to be bought along with A. lift (A -> B) = supp (A U B)/( supp(A) supp (B) )

Conviction : conv (A->B) = (1-supp(A))/(1-conf(A-B))

If the conviction means 1.4, it means that the rule A -> B would be incorrect 40% more often if the association between A & B was an accidental chance.

Association Rules and Algorithm

Using association rules and market basket analysis, set of rules are created. Suport value of 0.00001, confidence of 0.4 and maximum number of items of 3 are used

basket_rules<-apriori(transactions,parameter = list(sup=0.00001,conf=0.4,maxlen=3,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   1e-05      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [20.90s].
## sorting and recoding items ... [30863 item(s)] done [0.85s].
## creating transaction tree ... done [5.67s].
## checking subsets of size 1 2 done [12.22s].
## writing ... [979 rule(s)] done [2.53s].
## creating S4 object  ... done [2.71s].
summary(basket_rules)
## set of 979 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2 
## 979 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support            confidence          lift              count       
##  Min.   :1.026e-05   Min.   :0.4000   Min.   :    2.72   Min.   :  33.0  
##  1st Qu.:1.338e-05   1st Qu.:0.4232   1st Qu.:   41.86   1st Qu.:  43.0  
##  Median :2.146e-05   Median :0.4569   Median :  959.12   Median :  69.0  
##  Mean   :7.943e-05   Mean   :0.4795   Mean   : 3418.94   Mean   : 255.4  
##  3rd Qu.:5.848e-05   3rd Qu.:0.5104   3rd Qu.: 4382.79   3rd Qu.: 188.0  
##  Max.   :2.902e-03   Max.   :1.0000   Max.   :76544.62   Max.   :9331.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions       3214874   1e-05        0.4
kable(inspect(head(sort(basket_rules, decreasing = TRUE, na.last = NA, by = "lift"), n = 10)))
##      lhs                                                                 rhs                                                                   support confidence     lift count
## [1]  {Moisturizing Facial Wash}                                       => {Moisturizing Non-Drying Facial Wash}                            1.306428e-05  1.0000000 76544.62    42
## [2]  {Moisturizing Non-Drying Facial Wash}                            => {Moisturizing Facial Wash}                                       1.306428e-05  1.0000000 76544.62    42
## [3]  {Prepared Meals Simmered Beef Entree Dog Food}                   => {Prepared Meals Beef & Chicken Medley Dog Food}                  1.275322e-05  0.6212121 32211.59    41
## [4]  {Prepared Meals Beef & Chicken Medley Dog Food}                  => {Prepared Meals Simmered Beef Entree Dog Food}                   1.275322e-05  0.6612903 32211.59    41
## [5]  {Ocean Whitefish}                                                => {Premium Classic Chicken Recipe Cat Food}                        1.026479e-05  0.7500000 32148.74    33
## [6]  {Premium Classic Chicken Recipe Cat Food}                        => {Ocean Whitefish}                                                1.026479e-05  0.4400000 32148.74    33
## [7]  {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} => {Ancient Grains Apricot Blended Low-Fat Greek Yogurt}            1.430849e-05  0.5609756 29088.16    46
## [8]  {Ancient Grains Apricot Blended Low-Fat Greek Yogurt}            => {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} 1.430849e-05  0.7419355 29088.16    46
## [9]  {Thousand Island Salad Snax}                                     => {Raspberry Vinaigrette Salad Snax}                               2.146274e-05  0.6160714 23030.14    69
## [10] {Raspberry Vinaigrette Salad Snax}                               => {Thousand Island Salad Snax}                                     2.146274e-05  0.8023256 23030.14    69
#Frequently ordered products

#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000

item_frequencies <- itemFrequency(transactions, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transactions)]

par(mar=c(2,10,2,2)); options(scipen=5)

barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transactions), col="red")

We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold

#Frequent items bought together

basket_rules <- apriori(transactions,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5  0.0003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.48s].
## sorting and recoding items ... [5213 item(s)] done [0.45s].
## creating transaction tree ... done [4.02s].
## checking subsets of size 1 2 3 4 5 done [6.75s].
## writing ... [101 rule(s)] done [0.06s].
## creating S4 object  ... done [1.25s].

This scatter plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “support-confidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised.

plot(basket_rules)

The network graph below shows associations between selected items. Larger circles imply higher support, while darker red circles imply higher lift

plot(head(sort(basket_rules,by="lift"), 5),method="graph")

Summary

Highlights:

  • Using XGBoost Algorithm we could predict the future demand of re-ordered products by order_id
  • This is particulary beneficial in situations of a popular brand of item being the all time favourite for many customers.
  • This greatly helps towards ensuring that the inventories are not out of stock.
  • On aggregating the number of occurances of product_id in the column ‘products’, we could predict an approximate number of quantity to be restocked/displayed at the shelf that meets the demand.
  • Finally, using Association Rules and Apriori Algorithm, we can identify the products which are closely associated with predicted product_ids from the previous step. This step will give us a complete understanding of what an order_id may order in future.

Future Work:

  • While this Algorithm predicted 42000+ products correctly, there is scope for improvement in prediction, as 39000+ products were not matched.
  • The products unmatched could be new products, which would need different line/algorithm of prediction.
  • The processing speed in order to run the algorithm on an automated basis can improve.