Reorder Data Analysis

Instacart Purchase Analysis

Introduction

Instacart is a San Fransisco based company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. This study makes use of open sourced data set, The Instacart Online Grocery Shopping Dataset 2017 released by Instacart. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Problem Statement:

Instacart indicates each order in the data as prior, train or test. Prior orders describe the past behaviour of a user while train and test orders regard the future behaviour that we need to predict. As a result, we want to predict which previously purchased products (prior orders) will be in a user’s next order (train and test orders).

The goal is to predict which of these products will be in a user’s future order. This is a classification problem because we need to predict whether each pair of user and product is a reorder or not. This is indicated by the value of the reordered variable, i.e. reordered=1 or reordered=0 (see figure below)

Method

Our method includes the following steps:

Import and reshape data: This step includes loading CSV files into R tables, tranform character variables to categorical variables, and create a supportive table.
Calculate predictor variables: This step includes identifying and calculating predictor variables (aka features) from the initial datasets provided by Instacart.
Create the test and train datasets: In this step we create two distinct datasets that will be used in the creation and the use of the predictive model.
Create the preditive model: In this step we employ XGBoost algorithm to create the predictive model through the train dataset.
Apply the model: This step includes applying the model to predict the ‘reordered’ variable for the test dataset.

Data Preparation

Packages

pacman: To load packages and install missing ones
readr: Fast data load
data.table: Fast data load.
tidyverse: Package of multiple R packages used for data manipulation
dplyr: Easy functions to perform data manipulation in R.
DT: Package to put data objects in R as HTML tables
xgboost : Package for predicting reorders

if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("data.table",
               "DT",
               "dplyr", 
               "tidyr",
               "tidyverse",
               "xgboost",
               "Ckmeans.1d.dp")

Data Source

Instacart is an American company that operates as a same-day grocery delivery service. The Instacart Online Grocery Shopping dataset 2017 used in this study is accessed on 03/31.

In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information

orders: This table includes all orders, namely prior, train, and test. It has single primary key (order_id).
ordert: This table includes training orders. It has a composite primary key (order_id and product_id) and indicates whether a product in an order is a reorder or not (through the reordered variable).
orderp: This table includes prior orders. It has a composite primary key (order_id and product_id) and indicates whether a product in an order is a reorder or not (through the reordered variable).
products: This table includes all products. It has a single primary key (product_id)
aisles: This table includes all aisles. It has a single primary key (aisle_id)
departments: This table includes all departments. It has a single primary key (department_id)

Data Import

The six files containing data is loaded in this step.

path <- "/Users/keerthigopalakrishnan/Downloads/instacart_2017_05_01"

orderp <- fread(file.path(path, "order_products__prior.csv"))
ordert <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
products <- fread(file.path(path, "products.csv"))
aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))

Reshape Data

We transform the data in order to facilitate their further analysis. First, we convert character variables into factors so we can use them in the creation of the model. In R, a categorical variable is called factor and has a fixed number of different values.

# We convert character variables into factor. In R, a categorical variable is called factor and has a fixed number of different values
aisles$aisle <- as.factor(aisles$aisle)
departments$department <- as.factor(departments$department)
orders$eval_set <- as.factor(orders$eval_set)
products$product_name <- as.factor(products$product_name)

In the products table we replace aisle_id and department_id with aisle name and department name

# In the products table we replace aisle_id and department_id with aisle name and department name
products <- products %>% 
  inner_join(aisles) %>% inner_join(departments) %>% 
  select(-aisle_id, -department_id)

# We delete the tables "aisles" and "departments" as we don't need them
rm(aisles, departments)

We add the column “user_id” at the table ordert after matching the “order_id” of this table with the “order_id” of the table orders

# We add the column "user_id" at the table ordert
ordert$user_id <- orders$user_id[match(ordert$order_id, orders$order_id)]

We create a new table orders_products which contains the tables orders and orderp. Towards this end, we use inner_join() function, which returns records that have matching values in both tables

# We create a new table "orders_products" which contains the tables "orders" and orderp
orders_products <- orders %>% inner_join(orderp, by = "order_id")

# We delete the table "orderp"
rm(orderp)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1904419  101.8    3585355  191.5         NA   2826053  151.0
## Vcells 209231783 1596.4  331033758 2525.6     102400 274140100 2091.6

Sample Data

Following are the final datasets that will be used during this study:

1. Orders

datatable(head(orders, n = 10), class = 'cell-border stripe hover condensed responsive')

2. Products

datatable(head(products, n = 10), class = 'cell-border stripe hover condensed responsive')

3. Order Products Train

datatable(head(ordert, n = 10), class = 'cell-border stripe hover condensed responsive')

4. Order_Product Joined Table

datatable(head(orders_products, n = 10), class = 'cell-border stripe hover condensed responsive')

Create Predictor Variables

We now ready to identify and calculate predictor variables based on the provided data. We can create various types of predictors such as:

Product predictors describing characteristics of a product e.g. total number of times a product has been purchased.
User predictors describing the behavior of a user e.g. total number of orders of a user.
User & product predictors describing the behavior of a user towards a specific product e.g. the total times a user ordered a product.
Datetime predictors describing temporal characteristics of the orders.

Product Predictors

In this step, we create four new predictor variables that define products and we store them in a new table that is called prd. We have selected to include the following

prod_orders: Total number of orders per product
prod_reorder_probability: Probability a product is reordered after the first order
prod_reorder_times: In average how many times a product has been purchased by the users who purchased it at least once
prod_reorder_ratio: Reorders per total number of orders of the product

We first need to calculate the following supportive variables:

prod_reorders: Total number of reorders per product
prod_first_orders: Total number of first orders per product or total number of customers who bought a product
prod_second_orders: Total number of second orders per product or total number of customers who bougth a product at least twice

# We create the prd and we start with the data inside the orders_products table
prd <- orders_products %>%
# We arrange() the three variables "user_id", "order_number", "product_id" to sort them in descending order
arrange(user_id, order_number, product_id) %>%
# We group_by() to group together users and products
group_by(user_id, product_id) %>%
# Mutate() creates the new variable "product time" through row_number() which returns a sequential number starting at 1
mutate(product_time = row_number()) %>%
# We have now identified how many times a user bought a product
ungroup()


prd <- prd %>%
# We group data per product_id
group_by(product_id) %>%
# We create the four new variables based on the groups we have created i.e. per product_id
summarise(
# n() counts the objects inside the different groups of product_id i.e. the total number of orders per product    
    prod_orders = n(),
# Summarise the reordered variable per product_id (recall that reordered is 1 or 0) i.e. the total number of reorders per product    
    prod_reorders = sum(reordered),    
    prod_first_orders = sum(product_time == 1),
    prod_second_orders = sum(product_time == 2)
  )

# we calculate the prod_reorder_probability variable
prd$prod_reorder_probability <- prd$prod_second_orders / prd$prod_first_orders
# we caclculate the prod_reorder_times variable
prd$prod_reorder_times <- 1 + prd$prod_reorders / prd$prod_first_orders
# we caclculate the prod_reorder_ratio variable
prd$prod_reorder_ratio <- prd$prod_reorders / prd$prod_orders

# we remove the prod_reorders, prod_first_orders, and prod_second_orders variables
prd <- prd %>% select(-prod_reorders, -prod_first_orders, -prod_second_orders)

# we delete the products table
rm(products)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1943910  103.9   26169075 1397.6         NA  32711344 1747.0
## Vcells 209173326 1595.9  592629427 4521.5     102400 507023843 3868.3

# Let's see the final prd table
head(prd,20)

## # A tibble: 20 x 5
##    product_id prod_orders prod_reorder_pr… prod_reorder_ti…
##         <int>       <int>            <dbl>            <dbl>
##  1          1        1852           0.385              2.59
##  2          2          90           0.103              1.15
##  3          3         277           0.486              3.74
##  4          4         329           0.352              1.81
##  5          5          15           0.667              2.5 
##  6          6           8           0.4                1.6 
##  7          7          30           0.333              1.67
##  8          8         165           0.366              2.01
##  9          9         156           0.419              2.11
## 10         10        2572           0.315              2.03
## 11         11         104           0.333              2.31
## 12         12         246           0.317              2.05
## 13         13           9           0.167              1.5 
## 14         14          17           0.0667             1.13
## 15         15           4           0.333              1.33
## 16         16          19           0.118              1.12
## 17         17          18           0.125              1.12
## 18         18         137           0.418              2.49
## 19         19           4           0                  1   
## 20         20           6           0.2                1.2 
## # ... with 1 more variable: prod_reorder_ratio <dbl>

User Predictors

The next set of predictor variables that we calculate are related to users and we store them in the new users table. To calculate these variables we take into account only the prior orders.

We start by calculating three new variables:

user_orders: Total number of orders per user
user_period: The time period (in days) between the first and last order of a user
user_mean_days_since_prior: Mean time period (in days) between two consequtive orders of a user

Inline code comments explain how we calculate these variables. We start by creating a temporary version of the users table from the orders table.

users <- orders %>%
# We keep only the prior orders
  filter(eval_set == "prior") %>%
# We group orders by user_id
  group_by(user_id) %>%
# We calculate the variables based on different user_id
  summarise(
    # We calculate the total number of orders per user using the order_number variable.
    # What other variable we could have used in order to calculate total number of orders?  
    user_orders = max(order_number),
    # Using the na.rm = T we omit the missing values and calculate the sum and mean only for the values that we have
    user_period = sum(days_since_prior_order, na.rm = T),
    user_mean_days_since_prior = mean(days_since_prior_order, na.rm = T)
  )

# Let's see the temporary users table
head(users,10)

## # A tibble: 10 x 4
##    user_id user_orders user_period user_mean_days_since_prior
##      <int>       <dbl>       <dbl>                      <dbl>
##  1       1          10         176                       19.6
##  2       2          14         198                       15.2
##  3       3          12         133                       12.1
##  4       4           5          55                       13.8
##  5       5           4          40                       13.3
##  6       6           3          18                        9  
##  7       7          20         203                       10.7
##  8       8           3          60                       30  
##  9       9           3          36                       18  
## 10      10           5          79                       19.8

We also create a supportive table, namely us in order to calculate three more new variables:

user_total_products: Total numbers of basket items included in user’s orders
user_reorder_ratio: Reorder ratio (as defined above) per user
user_distinct_products: Total number of distinct products ordered by a user

Towards this end, we start from the orders_products table and we group observations using user_id variable.

us <- orders_products %>%
  group_by(user_id) %>%
  summarise(
    user_total_products = n(),
    user_reorder_ratio = sum(reordered == 1) / sum(order_number > 1),
    # The n_distinct() function counts the number of unique values in a set 
    user_distinct_products = n_distinct(product_id)
  )

# Let's see the us table
head(us,10)

## # A tibble: 10 x 4
##    user_id user_total_products user_reorder_ratio user_distinct_products
##      <int>               <int>              <dbl>                  <int>
##  1       1                  59             0.759                      18
##  2       2                 195             0.511                     102
##  3       3                  88             0.705                      33
##  4       4                  18             0.0714                     17
##  5       5                  37             0.538                      23
##  6       6                  14             0.2                        12
##  7       7                 206             0.711                      68
##  8       8                  49             0.464                      36
##  9       9                  76             0.391                      58
## 10      10                 143             0.355                      94

Then we combine the users and us tables ussing inner_join() function and we calculate the final variable:

user_average_basket: Average number of basket items per order per user

# We combine users and us tables and store the results into users table
users <- users %>% inner_join(us)

# We calculate the user_average_basket variable
users$user_average_basket <- users$user_total_products / users$user_orders

# let's see the users table
head(users,10)

## # A tibble: 10 x 8
##    user_id user_orders user_period user_mean_days_… user_total_prod…
##      <int>       <dbl>       <dbl>            <dbl>            <int>
##  1       1          10         176             19.6               59
##  2       2          14         198             15.2              195
##  3       3          12         133             12.1               88
##  4       4           5          55             13.8               18
##  5       5           4          40             13.3               37
##  6       6           3          18              9                 14
##  7       7          20         203             10.7              206
##  8       8           3          60             30                 49
##  9       9           3          36             18                 76
## 10      10           5          79             19.8              143
## # ... with 3 more variables: user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>

We now identify the future order per user and add them in the users table. The future orders are indicated as train and test in the eval_set variable. As a result, we will know what is the order_id of the future order per user, whether this order belongs in the train or test set, and the time in days since the last order.

us <- orders %>%
  # we exclude prior orders and thus we keep only train and test orders
  filter(eval_set != "prior") %>%
  select(user_id, order_id, eval_set,
         time_since_last_order = days_since_prior_order)

# We combine users and us tables and store the results into the users table
users <- users %>% inner_join(us)

# We delete the us table
rm(us)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1967707  105.1   20935260 1118.1         NA  32711344 1747.0
## Vcells 210976359 1609.7  592629427 4521.5     102400 507023843 3868.3

# let's see the final users table
head(users,10)

## # A tibble: 10 x 11
##    user_id user_orders user_period user_mean_days_… user_total_prod…
##      <int>       <dbl>       <dbl>            <dbl>            <int>
##  1       1          10         176             19.6               59
##  2       2          14         198             15.2              195
##  3       3          12         133             12.1               88
##  4       4           5          55             13.8               18
##  5       5           4          40             13.3               37
##  6       6           3          18              9                 14
##  7       7          20         203             10.7              206
##  8       8           3          60             30                 49
##  9       9           3          36             18                 76
## 10      10           5          79             19.8              143
## # ... with 6 more variables: user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>,
## #   order_id <int>, eval_set <fct>, time_since_last_order <dbl>

User_Product Predictors

We now create predictors that indicate how a user behaves towards a specific product. We store these predictors in the data table, which is also the final table that we create. Towards this end, we use both prd and users tables. We create the following predictors:

up_orders: The total times a user ordered a product
up_first_order: What was the first time a user purchased a product
up_last_order: What was the last time a user purchased a product
up_average_cart_position: The average position in a user’s cart of a product
up_order_rate: Percentage of user’s orders that include a specific product
up_orders_since_last_order: Number of orders since user’s last order of a product
up_order_rate_since_first_order: Pecentage of orders since first order of a product in which a user purchased this product

We create the data table starting from the orders_products table and summarising using both user_id and product_id. We then create the four first new variables. After running the following code we get a temporary version of the data table.

data <- orders_products %>%
  group_by(user_id, product_id) %>% 
  summarise(
    up_orders = n(),
    up_first_order = min(order_number),
    up_last_order = max(order_number),
    up_average_cart_position = mean(add_to_cart_order))

# We delete the tables "orders_products" and "orders"
rm(orders_products, orders)

We compine the data table with the prd and users tables and we calculate the final three variables

# We use inner_join() to combine the table "data" with the tables "prd" and "users"
data <- data %>% 
  inner_join(prd, by = "product_id") %>%
  inner_join(users, by = "user_id")

# We calculate the variables "up_order_rate", "up_orders_since_last_order", "up_order_rate_since_first_order"
data$up_order_rate <- data$up_orders / data$user_orders
data$up_orders_since_last_order <- data$user_orders - data$up_last_order
data$up_order_rate_since_first_order <- data$up_orders / (data$user_orders - data$up_first_order + 1)

We now combine the data table with the ordert table in order to see which products that a user has already bought (prior) has been reordered (train).

If a product from ordert table has been reordered (reordered=1) by a user then this combination of product and user will exist in data table and thus the reordered value will be added in the data table through leftjoin(). If not then the specific combination of product and user will not exist in data table and thus the reordered value for this observation will remain empty (reordered=NA).

Below shows the final Table:

data <- data %>% 
  left_join(ordert %>% select(user_id, product_id, reordered), 
            by = c("user_id", "product_id"))

# We delete the tables "ordert", "prd" and "users"
rm(ordert, prd, users)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1968043  105.2   10718852  572.5         NA  32711344 1747.0
## Vcells 272993355 2082.8  711235312 5426.3     102400 591235612 4510.8

# Let's see the final data table
head(data, 10)

## # A tibble: 10 x 24
## # Groups:   user_id [1]
##    user_id product_id up_orders up_first_order up_last_order
##      <int>      <int>     <int>          <dbl>         <dbl>
##  1       1        196        10              1            10
##  2       1      10258         9              2            10
##  3       1      10326         1              5             5
##  4       1      12427        10              1            10
##  5       1      13032         3              2            10
##  6       1      13176         2              2             5
##  7       1      14084         1              1             1
##  8       1      17122         1              5             5
##  9       1      25133         8              3            10
## 10       1      26088         2              1             2
## # ... with 19 more variables: up_average_cart_position <dbl>,
## #   prod_orders <int>, prod_reorder_probability <dbl>,
## #   prod_reorder_times <dbl>, prod_reorder_ratio <dbl>, user_orders <dbl>,
## #   user_period <dbl>, user_mean_days_since_prior <dbl>,
## #   user_total_products <int>, user_reorder_ratio <dbl>,
## #   user_distinct_products <int>, user_average_basket <dbl>,
## #   order_id <int>, eval_set <fct>, time_since_last_order <dbl>,
## #   up_order_rate <dbl>, up_orders_since_last_order <dbl>,
## #   up_order_rate_since_first_order <dbl>, reordered <int>

Train and Test Data Sets

Before we are ready to run the XGBoost algorithm, it is necessary to create the final train and test tables that will feed the algorithm and the evaluation respectively.

We split the data based on the eval_set variable into train and test.
We remove all the columns that are not predictors variables.
In the train set we tranform the missing values (NA) of reordered to 0 in order to indicate that these products have not been reordered in the future order.

train <- as.data.frame(data[data$eval_set == "train",])
train$eval_set <- NULL
train$user_id <- NULL
#train$product_id <- NULL
#train$order_id <- NULL
# below we transform missing values of 'reordered' variable to 0
train$reordered[is.na(train$reordered)] <- 0

head(train,10)

##    product_id up_orders up_first_order up_last_order
## 1         196        10              1            10
## 2       10258         9              2            10
## 3       10326         1              5             5
## 4       12427        10              1            10
## 5       13032         3              2            10
## 6       13176         2              2             5
## 7       14084         1              1             1
## 8       17122         1              5             5
## 9       25133         8              3            10
## 10      26088         2              1             2
##    up_average_cart_position prod_orders prod_reorder_probability
## 1                  1.400000       35791                0.5825000
## 2                  3.333333        1946                0.5529623
## 3                  5.000000        5526                0.5215809
## 4                  3.300000        6476                0.5294818
## 5                  6.333333        3751                0.4797823
## 6                  6.000000      379450                0.7118844
## 7                  2.000000       15935                0.6291501
## 8                  6.000000       13880                0.5274262
## 9                  4.000000        6196                0.5807453
## 10                 4.500000        2523                0.4049871
##    prod_reorder_times prod_reorder_ratio user_orders user_period
## 1            4.473875          0.7764801          10         176
## 2            3.493716          0.7137718          10         176
## 3            2.873635          0.6520087          10         176
## 4            3.857058          0.7407350          10         176
## 5            2.916796          0.6571581          10         176
## 6            5.972111          0.8325550          10         176
## 7            5.290505          0.8109821          10         176
## 8            3.082390          0.6755764          10         176
## 9            3.848447          0.7401549          10         176
## 10           2.169390          0.5390408          10         176
##    user_mean_days_since_prior user_total_products user_reorder_ratio
## 1                    19.55556                  59          0.7592593
## 2                    19.55556                  59          0.7592593
## 3                    19.55556                  59          0.7592593
## 4                    19.55556                  59          0.7592593
## 5                    19.55556                  59          0.7592593
## 6                    19.55556                  59          0.7592593
## 7                    19.55556                  59          0.7592593
## 8                    19.55556                  59          0.7592593
## 9                    19.55556                  59          0.7592593
## 10                   19.55556                  59          0.7592593
##    user_distinct_products user_average_basket order_id
## 1                      18                 5.9  1187899
## 2                      18                 5.9  1187899
## 3                      18                 5.9  1187899
## 4                      18                 5.9  1187899
## 5                      18                 5.9  1187899
## 6                      18                 5.9  1187899
## 7                      18                 5.9  1187899
## 8                      18                 5.9  1187899
## 9                      18                 5.9  1187899
## 10                     18                 5.9  1187899
##    time_since_last_order up_order_rate up_orders_since_last_order
## 1                     14           1.0                          0
## 2                     14           0.9                          0
## 3                     14           0.1                          5
## 4                     14           1.0                          0
## 5                     14           0.3                          0
## 6                     14           0.2                          5
## 7                     14           0.1                          9
## 8                     14           0.1                          5
## 9                     14           0.8                          0
## 10                    14           0.2                          8
##    up_order_rate_since_first_order reordered
## 1                        1.0000000         1
## 2                        1.0000000         1
## 3                        0.1666667         0
## 4                        1.0000000         0
## 5                        0.3333333         1
## 6                        0.2222222         0
## 7                        0.1000000         0
## 8                        0.1666667         0
## 9                        1.0000000         1
## 10                       0.2000000         1

test <- as.data.frame(data[data$eval_set == "test",])
test$eval_set <- NULL
test$user_id <- NULL
test$reordered <- NULL

head(test,10)

##    product_id up_orders up_first_order up_last_order
## 1         248         1              2             2
## 2        1005         1             10            10
## 3        1819         3              4             7
## 4        7503         1              3             3
## 5        8021         1              2             2
## 6        9387         5              1             7
## 7       12845         1              4             4
## 8       14992         2              6             7
## 9       15143         1              1             1
## 10      16797         3              1             9
##    up_average_cart_position prod_orders prod_reorder_probability
## 1                  3.000000        6371                0.2795080
## 2                  5.000000         463                0.3127413
## 3                  2.666667        2424                0.3468725
## 4                  6.000000       12474                0.3837314
## 5                  5.000000       27864                0.4408357
## 6                  3.600000       36187                0.4691700
## 7                  2.000000       10027                0.2570445
## 8                  7.000000       29069                0.4270636
## 9                  3.000000        3447                0.3580811
## 10                 4.000000      142951                0.5666180
##    prod_reorder_times prod_reorder_ratio user_orders user_period
## 1            1.667365          0.4002511          12         133
## 2            1.787645          0.4406048          12         133
## 3            1.969131          0.4921617          12         133
## 4            2.239899          0.5535514          12         133
## 5            2.445927          0.5911570          12         133
## 6            2.860632          0.6504269          12         133
## 7            1.569662          0.3629201          12         133
## 8            2.397048          0.5828202          12         133
## 9            1.968589          0.4920220          12         133
## 10           3.312962          0.6981553          12         133
##    user_mean_days_since_prior user_total_products user_reorder_ratio
## 1                    12.09091                  88          0.7051282
## 2                    12.09091                  88          0.7051282
## 3                    12.09091                  88          0.7051282
## 4                    12.09091                  88          0.7051282
## 5                    12.09091                  88          0.7051282
## 6                    12.09091                  88          0.7051282
## 7                    12.09091                  88          0.7051282
## 8                    12.09091                  88          0.7051282
## 9                    12.09091                  88          0.7051282
## 10                   12.09091                  88          0.7051282
##    user_distinct_products user_average_basket order_id
## 1                      33            7.333333  2774568
## 2                      33            7.333333  2774568
## 3                      33            7.333333  2774568
## 4                      33            7.333333  2774568
## 5                      33            7.333333  2774568
## 6                      33            7.333333  2774568
## 7                      33            7.333333  2774568
## 8                      33            7.333333  2774568
## 9                      33            7.333333  2774568
## 10                     33            7.333333  2774568
##    time_since_last_order up_order_rate up_orders_since_last_order
## 1                     11    0.08333333                         10
## 2                     11    0.08333333                          2
## 3                     11    0.25000000                          5
## 4                     11    0.08333333                          9
## 5                     11    0.08333333                         10
## 6                     11    0.41666667                          5
## 7                     11    0.08333333                          8
## 8                     11    0.16666667                          5
## 9                     11    0.08333333                         11
## 10                    11    0.25000000                          3
##    up_order_rate_since_first_order
## 1                       0.09090909
## 2                       0.33333333
## 3                       0.33333333
## 4                       0.10000000
## 5                       0.09090909
## 6                       0.41666667
## 7                       0.11111111
## 8                       0.28571429
## 9                       0.08333333
## 10                      0.25000000

rm(data)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   1970612  105.3    8575081  458.0         NA  32711344 1747.0
## Vcells 261511645 1995.2  711235312 5426.3     102400 711235263 5426.3

Create the Model

We use the train data to create the model. Each variable is a list containing two things, label (or outcome) and data (predictors). In our example the label we want to predict is the column “reordered”.

Because xgboost manages only numeric vectors we have to transform the categorical data to dummy variables while building the model we will need. Towards this end we use xgb.DMatrix.

Finally we can plot the results from our model using xgb.plot.importance()

library(ggplot2)

params <- list(
  "objective"           = "reg:logistic",
  "eval_metric"         = "logloss", 
  "eta"                 = 0.1, 
  "max_depth"           = 6, 
  "min_child_weight"    = 10,  
  "gamma"               = 0.70,  
  "subsample"           = 0.76,
  "colsample_bytree"    = 0.95,  
  "alpha"               = 2e-05,  
  "lambda"              = 10 
)

# We get a sample containing 10% of the train table
subtrain <- train %>% sample_frac(0.1)
# We create an xgb.DMatrix that is named X with predictors from subtrain table and response the reordered variable
X <- xgb.DMatrix(as.matrix(subtrain %>% select(-reordered, -order_id, -product_id)), label = subtrain$reordered)
# We create the actual model
model <- xgboost(data = X, params = params, nrounds = 80)

## [1]  train-logloss:0.623296 
## [2]  train-logloss:0.571299 
## [3]  train-logloss:0.524926 
## [4]  train-logloss:0.485843 
## [5]  train-logloss:0.454547 
## [6]  train-logloss:0.426012 
## [7]  train-logloss:0.401700 
## [8]  train-logloss:0.381263 
## [9]  train-logloss:0.364102 
## [10] train-logloss:0.348727 
## [11] train-logloss:0.335426 
## [12] train-logloss:0.323883 
## [13] train-logloss:0.313988 
## [14] train-logloss:0.305120 
## [15] train-logloss:0.297464 
## [16] train-logloss:0.290523 
## [17] train-logloss:0.284649 
## [18] train-logloss:0.279868 
## [19] train-logloss:0.275282 
## [20] train-logloss:0.271585 
## [21] train-logloss:0.268023 
## [22] train-logloss:0.265157 
## [23] train-logloss:0.262673 
## [24] train-logloss:0.260274 
## [25] train-logloss:0.258414 
## [26] train-logloss:0.256825 
## [27] train-logloss:0.255284 
## [28] train-logloss:0.253753 
## [29] train-logloss:0.252495 
## [30] train-logloss:0.251595 
## [31] train-logloss:0.250821 
## [32] train-logloss:0.250048 
## [33] train-logloss:0.249379 
## [34] train-logloss:0.248643 
## [35] train-logloss:0.248030 
## [36] train-logloss:0.247615 
## [37] train-logloss:0.247155 
## [38] train-logloss:0.246788 
## [39] train-logloss:0.246505 
## [40] train-logloss:0.246253 
## [41] train-logloss:0.246013 
## [42] train-logloss:0.245834 
## [43] train-logloss:0.245646 
## [44] train-logloss:0.245423 
## [45] train-logloss:0.245262 
## [46] train-logloss:0.245106 
## [47] train-logloss:0.244976 
## [48] train-logloss:0.244870 
## [49] train-logloss:0.244761 
## [50] train-logloss:0.244657 
## [51] train-logloss:0.244542 
## [52] train-logloss:0.244433 
## [53] train-logloss:0.244355 
## [54] train-logloss:0.244269 
## [55] train-logloss:0.244163 
## [56] train-logloss:0.244095 
## [57] train-logloss:0.244030 
## [58] train-logloss:0.243962 
## [59] train-logloss:0.243899 
## [60] train-logloss:0.243824 
## [61] train-logloss:0.243765 
## [62] train-logloss:0.243708 
## [63] train-logloss:0.243658 
## [64] train-logloss:0.243609 
## [65] train-logloss:0.243530 
## [66] train-logloss:0.243484 
## [67] train-logloss:0.243403 
## [68] train-logloss:0.243371 
## [69] train-logloss:0.243331 
## [70] train-logloss:0.243264 
## [71] train-logloss:0.243214 
## [72] train-logloss:0.243166 
## [73] train-logloss:0.243098 
## [74] train-logloss:0.243065 
## [75] train-logloss:0.243022 
## [76] train-logloss:0.242973 
## [77] train-logloss:0.242920 
## [78] train-logloss:0.242891 
## [79] train-logloss:0.242847 
## [80] train-logloss:0.242827

# We estimate the importance of the predictors
importance <- xgb.importance(colnames(X), model = model)
# We plot the importance of the predictors
xgb.ggplot.importance(importance)

rm(X, importance, subtrain)
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   2113604  112.9    6860064  366.4         NA  32711344 1747.0
## Vcells 261896928 1998.2  711235312 5426.3     102400 711235263 5426.3

Apply the Model

Because Instacart did not reveal the outcome of the test data we also apply the model in the train data and create a table that enables comparing the predicted to the actural results

Apply to Test Data

Now we are ready to perform the prediction with the model we have built to classify test data.

# We use the xgb.DMatrix to group our test data into a matrix
X <- xgb.DMatrix(as.matrix(test %>% select(-order_id, -product_id)))
# We apply the model and we predict the reordered variable for the test set.
test$reordered <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
test$reordered <- (test$reordered > 0.21) * 1

# We create the final table with reordered products per order
submission <- test %>%
  filter(reordered == 1) %>%
  group_by(order_id) %>%
  summarise(
    products = paste(product_id, collapse = " ")
  )
# Let's see the submission table
head(submission,10)

## # A tibble: 10 x 2
##    order_id products                                                      
##       <int> <chr>                                                         
##  1       17 13107 21463                                                   
##  2       34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792    
##  3      137 2326 5134 23794 24852 25890 38689 41787                       
##  4      182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672     
##  5      257 4605 13870 21137 22035 24838 24852 27104 27966 28476 29837 30…
##  6      313 12779 13198 14077 21903 25890 28535 30391 34969 45007 46906 4…
##  7      353 21137 33000 35561 40688                                       
##  8      386 260 2326 4920 6046 8174 15872 17652 21479 21903 22124 24852 2…
##  9      414 4472 14947 19006 20392 20564 21230 21376 21709 27845 31215 31…
## 10      418 1503 13702 30489 38694 40268 41950 47766

# We create the table "missing" where we have the orders in which none product will be ordered according to our prediction
missing <- data.frame(
  order_id = unique(test$order_id[!test$order_id %in% submission$order_id]),
  products = "None"
)

submission <- submission %>% bind_rows(missing) %>% arrange(order_id)
#Let's see the submission table
head(submission,10)

## # A tibble: 10 x 2
##    order_id products                                                      
##       <int> <chr>                                                         
##  1       17 13107 21463                                                   
##  2       34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792    
##  3      137 2326 5134 23794 24852 25890 38689 41787                       
##  4      182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672     
##  5      257 4605 13870 21137 22035 24838 24852 27104 27966 28476 29837 30…
##  6      313 12779 13198 14077 21903 25890 28535 30391 34969 45007 46906 4…
##  7      353 21137 33000 35561 40688                                       
##  8      386 260 2326 4920 6046 8174 15872 17652 21479 21903 22124 24852 2…
##  9      414 4472 14947 19006 20392 20564 21230 21376 21709 27845 31215 31…
## 10      418 1503 13702 30489 38694 40268 41950 47766

Apply to Training Set

In order to be able to compare our prediction to the actual result we re-apply the model in the train data

# We use the xgb.DMatrix to group our train data into a matrix
X <- xgb.DMatrix(as.matrix(train %>% select(-order_id, -product_id, -reordered)))
# We apply the model and we predict the reordered variable for the train set.
train$reordered_pred <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
train$reordered_pred <- (train$reordered_pred > 0.21) * 1

# We create the final table with reordered products per order
submission_train <- train %>%
  filter(reordered_pred == 1) %>%
  group_by(order_id) %>%
  summarise(
    products = paste(product_id, collapse = " ")
  )

real_reorders <- train %>%
 filter(reordered == 1) %>%
 group_by(order_id) %>%
 summarise(
   real_products = paste(product_id, collapse = " ")
 )

submission_train <- real_reorders %>% 
  inner_join(submission_train, by = "order_id")

The Final Result for comparison is shown below:

datatable(head(submission_train, n = 10), class = 'cell-border stripe hover condensed responsive')

Model Evaluation

The model is now evlauted by calculating the number of correct matches vs the number of incorrect matches.

Evalution_set <- submission_train

for (i in 1: nrow(Evalution_set)){

Evalution_set$matches[i] <- length(Reduce('intersect',str_extract_all(c(Evalution_set$products[i],Evalution_set$real_products[i]),"\\w+")))
  
Evalution_set$no_matches[i] <- (str_count(Evalution_set$real_products[i]," ")+1) - Evalution_set$matches[i]

}