Instacart is a San Fransisco based company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. This study makes use of open sourced data set, The Instacart Online Grocery Shopping Dataset 2017 released by Instacart. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
Instacart indicates each order in the data as prior, train or test. Prior orders describe the past behaviour of a user while train and test orders regard the future behaviour that we need to predict. As a result, we want to predict which previously purchased products (prior orders) will be in a user’s next order (train and test orders).
The goal is to predict which of these products will be in a user’s future order. This is a classification problem because we need to predict whether each pair of user and product is a reorder or not. This is indicated by the value of the reordered variable, i.e. reordered=1 or reordered=0 (see figure below)
On these predicted products, we will apply an Apriori Algorithm, to understand what could be the associated products that may further be predicted in the order_id
Our method includes the following steps:
if (!require("pacman")) install.packages("pacman")
# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("readr",
"data.table",
"tidyverse",
"dplyr",
"stringr",
"DT",
"ggplot2",
"knitr",
"magrittr",
"treemap",
"arules",
"arulesViz",
"xgboost",
"Ckmeans.1d.dp")
Instacart is an American company that operates as a same-day grocery delivery service. The Instacart Online Grocery Shopping dataset 2017 used in this study is accessed on 03/31.
In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information
The six files containing data is loaded in this step.
path <- "/Users/keerthigopalakrishnan/Downloads/instacart_2017_05_01"
orderp <- fread(file.path(path, "order_products__prior.csv"))
ordert <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
products <- fread(file.path(path, "products.csv"))
aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))
Buying patterns of users are analysed by the day at which they purchase and the department to which they belong.
Weekends have more orders than weekdays.
orders %>%
ggplot(aes(x= order_dow, fill = order_dow)) +
geom_histogram(stat="count")
Monday to Thursday follows a double camel hump shape with first peak from 9-11 hours and second peak around 13-15 hours. This indicates that the users order groceries either in the morning or during lunch time on working days. But this pattern is slightly different for weekends
orders %>%
ggplot(aes(x= order_hour_of_day, fill = order_dow)) +
geom_histogram(stat="count") +
facet_wrap(~ order_dow, ncol = 2)
Though produce offers lesser number of unique products compareed to departments like personal care, snacks and pantry. But, most ordered products are from the produce department which contains vegetables and fruits.
products %>%
group_by(department_id, aisle_id) %>% summarize(count=n()) %>%
left_join(departments,by="department_id") %>%
left_join(aisles,by="aisle_id") %>%
treemap(index=c("department","aisle"),vSize="count",title="Tree map of Unique products offered in each Department/ aisle",palette="Set3",border.col="#FFFFFF")
ordert %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(departments,by="department_id") %>%
left_join(aisles,by="aisle_id") %>%
mutate(onesize = 1) %>%
treemap(index=c("department","aisle"),vSize="sumcount",title="Tree map of most ordered products in Department/Aisle",palette="Set3",border.col="#FFFFFF")
# Split the "Product ID" values into groups based on "Order ID" variable
order_product <- orderp %>%
left_join(products, by = "product_id")
transactions <- as(split(order_product$product_name,order_product$order_id),"transactions")
hist(size(transactions),
breaks = 0:150,
xaxt="n",
ylim=c(0,250000),
col = "blue",
main = "Number of Products per Order",
xlab = "Order Size:Number of Products")
+ axis(1,
at = seq(0,160,by=10)) +
mtext(paste("Total:", length(transactions), "Orders,", sum(size(transactions)), "Products"))
## numeric(0)
ITEM FREQUENCY PLOT Top 10 items most frequently bought are shown below. Only items with a support of atleast 0.01 (Appears at least in 1% of the transactions) are considered for this.
itemFrequencyPlot(transactions,
support = 0.01,
cex.names = 0.8,
horiz = TRUE, topN = 10)
We transform the data in order to facilitate their further analysis. First, we convert character variables into factors so we can use them in the creation of the model. In R, a categorical variable is called factor and has a fixed number of different values.
# We convert character variables into factor. In R, a categorical variable is called factor and has a fixed number of different values
aisles$aisle <- as.factor(aisles$aisle)
departments$department <- as.factor(departments$department)
orders$eval_set <- as.factor(orders$eval_set)
products$product_name <- as.factor(products$product_name)
In the products table we replace aisle_id and department_id with aisle name and department name
# In the products table we replace aisle_id and department_id with aisle name and department name
products <- products %>%
inner_join(aisles) %>% inner_join(departments) %>%
select(-aisle_id, -department_id)
We add the column “user_id” at the table ordert after matching the “order_id” of this table with the “order_id” of the table orders
# We add the column "user_id" at the table ordert
ordert$user_id <- orders$user_id[match(ordert$order_id, orders$order_id)]
We create a new table orders_products which contains the tables orders and orderp. Towards this end, we use inner_join() function, which returns records that have matching values in both tables
# We create a new table "orders_products" which contains the tables "orders" and orderp
orders_products <- orders %>% inner_join(orderp, by = "order_id")
Following are the final datasets that will be used during this study:
datatable(head(orders, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(departments, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(aisles, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(orderp, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(ordert, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(products, n = 10), class = 'cell-border stripe hover condensed responsive')
datatable(head(orders_products, n = 10), class = 'cell-border stripe hover condensed responsive')
We now ready to identify and calculate predictor variables based on the provided data. We can create various types of predictors such as:
In this step, we create four new predictor variables that define products and we store them in a new table that is called prd. We have selected to include the following
We first need to calculate the following supportive variables:
# We create the prd and we start with the data inside the orders_products table
prd <- orders_products %>%
# We arrange() the three variables "user_id", "order_number", "product_id" to sort them in descending order
arrange(user_id, order_number, product_id) %>%
# We group_by() to group together users and products
group_by(user_id, product_id) %>%
# Mutate() creates the new variable "product time" through row_number() which returns a sequential number starting at 1
mutate(product_time = row_number()) %>%
# We have now identified how many times a user bought a product
ungroup()
prd <- prd %>%
# We group data per product_id
group_by(product_id) %>%
# We create the four new variables based on the groups we have created i.e. per product_id
summarise(
# n() counts the objects inside the different groups of product_id i.e. the total number of orders per product
prod_orders = n(),
# Summarise the reordered variable per product_id (recall that reordered is 1 or 0) i.e. the total number of reorders per product
prod_reorders = sum(reordered),
prod_first_orders = sum(product_time == 1),
prod_second_orders = sum(product_time == 2)
)
# we calculate the prod_reorder_probability variable
prd$prod_reorder_probability <- prd$prod_second_orders / prd$prod_first_orders
# we caclculate the prod_reorder_times variable
prd$prod_reorder_times <- 1 + prd$prod_reorders / prd$prod_first_orders
# we caclculate the prod_reorder_ratio variable
prd$prod_reorder_ratio <- prd$prod_reorders / prd$prod_orders
# we remove the prod_reorders, prod_first_orders, and prod_second_orders variables
prd <- prd %>% select(-prod_reorders, -prod_first_orders, -prod_second_orders)
# Let's see the final prd table
head(prd,20)
## # A tibble: 20 x 5
## product_id prod_orders prod_reorder_pr… prod_reorder_ti…
## <int> <int> <dbl> <dbl>
## 1 1 1852 0.385 2.59
## 2 2 90 0.103 1.15
## 3 3 277 0.486 3.74
## 4 4 329 0.352 1.81
## 5 5 15 0.667 2.5
## 6 6 8 0.4 1.6
## 7 7 30 0.333 1.67
## 8 8 165 0.366 2.01
## 9 9 156 0.419 2.11
## 10 10 2572 0.315 2.03
## 11 11 104 0.333 2.31
## 12 12 246 0.317 2.05
## 13 13 9 0.167 1.5
## 14 14 17 0.0667 1.13
## 15 15 4 0.333 1.33
## 16 16 19 0.118 1.12
## 17 17 18 0.125 1.12
## 18 18 137 0.418 2.49
## 19 19 4 0 1
## 20 20 6 0.2 1.2
## # ... with 1 more variable: prod_reorder_ratio <dbl>
The next set of predictor variables that we calculate are related to users and we store them in the new users table. To calculate these variables we take into account only the prior orders.
We start by calculating three new variables:
Inline code comments explain how we calculate these variables. We start by creating a temporary version of the users table from the orders table.
users <- orders %>%
# We keep only the prior orders
filter(eval_set == "prior") %>%
# We group orders by user_id
group_by(user_id) %>%
# We calculate the variables based on different user_id
summarise(
# We calculate the total number of orders per user using the order_number variable.
# What other variable we could have used in order to calculate total number of orders?
user_orders = max(order_number),
# Using the na.rm = T we omit the missing values and calculate the sum and mean only for the values that we have
user_period = sum(days_since_prior_order, na.rm = T),
user_mean_days_since_prior = mean(days_since_prior_order, na.rm = T)
)
# Let's see the temporary users table
head(users,10)
## # A tibble: 10 x 4
## user_id user_orders user_period user_mean_days_since_prior
## <int> <dbl> <dbl> <dbl>
## 1 1 10 176 19.6
## 2 2 14 198 15.2
## 3 3 12 133 12.1
## 4 4 5 55 13.8
## 5 5 4 40 13.3
## 6 6 3 18 9
## 7 7 20 203 10.7
## 8 8 3 60 30
## 9 9 3 36 18
## 10 10 5 79 19.8
We also create a supportive table, namely us in order to calculate three more new variables:
Towards this end, we start from the orders_products table and we group observations using user_id variable.
us <- orders_products %>%
group_by(user_id) %>%
summarise(
user_total_products = n(),
user_reorder_ratio = sum(reordered == 1) / sum(order_number > 1),
# The n_distinct() function counts the number of unique values in a set
user_distinct_products = n_distinct(product_id)
)
# Let's see the us table
head(us,10)
## # A tibble: 10 x 4
## user_id user_total_products user_reorder_ratio user_distinct_products
## <int> <int> <dbl> <int>
## 1 1 59 0.759 18
## 2 2 195 0.511 102
## 3 3 88 0.705 33
## 4 4 18 0.0714 17
## 5 5 37 0.538 23
## 6 6 14 0.2 12
## 7 7 206 0.711 68
## 8 8 49 0.464 36
## 9 9 76 0.391 58
## 10 10 143 0.355 94
Then we combine the users and us tables ussing inner_join() function and we calculate the final variable:
# We combine users and us tables and store the results into users table
users <- users %>% inner_join(us)
# We calculate the user_average_basket variable
users$user_average_basket <- users$user_total_products / users$user_orders
# let's see the users table
head(users,10)
## # A tibble: 10 x 8
## user_id user_orders user_period user_mean_days_… user_total_prod…
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 10 176 19.6 59
## 2 2 14 198 15.2 195
## 3 3 12 133 12.1 88
## 4 4 5 55 13.8 18
## 5 5 4 40 13.3 37
## 6 6 3 18 9 14
## 7 7 20 203 10.7 206
## 8 8 3 60 30 49
## 9 9 3 36 18 76
## 10 10 5 79 19.8 143
## # ... with 3 more variables: user_reorder_ratio <dbl>,
## # user_distinct_products <int>, user_average_basket <dbl>
We now identify the future order per user and add them in the users table. The future orders are indicated as train and test in the eval_set variable. As a result, we will know what is the order_id of the future order per user, whether this order belongs in the train or test set, and the time in days since the last order.
us <- orders %>%
# we exclude prior orders and thus we keep only train and test orders
filter(eval_set != "prior") %>%
select(user_id, order_id, eval_set,
time_since_last_order = days_since_prior_order)
# We combine users and us tables and store the results into the users table
users <- users %>% inner_join(us)
# We delete the us table
rm(us)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 6262630 334.5 25032344 1336.9 NA 31290430 1671.1
## Vcells 450758019 3439.1 953631162 7275.7 102400 767328383 5854.3
# let's see the final users table
head(users,10)
## # A tibble: 10 x 11
## user_id user_orders user_period user_mean_days_… user_total_prod…
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 10 176 19.6 59
## 2 2 14 198 15.2 195
## 3 3 12 133 12.1 88
## 4 4 5 55 13.8 18
## 5 5 4 40 13.3 37
## 6 6 3 18 9 14
## 7 7 20 203 10.7 206
## 8 8 3 60 30 49
## 9 9 3 36 18 76
## 10 10 5 79 19.8 143
## # ... with 6 more variables: user_reorder_ratio <dbl>,
## # user_distinct_products <int>, user_average_basket <dbl>,
## # order_id <int>, eval_set <fct>, time_since_last_order <dbl>
We now create predictors that indicate how a user behaves towards a specific product. We store these predictors in the data table, which is also the final table that we create. Towards this end, we use both prd and users tables. We create the following predictors:
We create the data table starting from the orders_products table and summarising using both user_id and product_id. We then create the four first new variables. After running the following code we get a temporary version of the data table.
data <- orders_products %>%
group_by(user_id, product_id) %>%
summarise(
up_orders = n(),
up_first_order = min(order_number),
up_last_order = max(order_number),
up_average_cart_position = mean(add_to_cart_order))
# We delete the tables "orders_products" and "orders"
rm(orders_products, orders)
We compine the data table with the prd and users tables and we calculate the final three variables
# We use inner_join() to combine the table "data" with the tables "prd" and "users"
data <- data %>%
inner_join(prd, by = "product_id") %>%
inner_join(users, by = "user_id")
# We calculate the variables "up_order_rate", "up_orders_since_last_order", "up_order_rate_since_first_order"
data$up_order_rate <- data$up_orders / data$user_orders
data$up_orders_since_last_order <- data$user_orders - data$up_last_order
data$up_order_rate_since_first_order <- data$up_orders / (data$user_orders - data$up_first_order + 1)
We now combine the data table with the ordert table in order to see which products that a user has already bought (prior) has been reordered (train).
If a product from ordert table has been reordered (reordered=1) by a user then this combination of product and user will exist in data table and thus the reordered value will be added in the data table through leftjoin(). If not then the specific combination of product and user will not exist in data table and thus the reordered value for this observation will remain empty (reordered=NA).
Below shows the final Table:
data <- data %>%
left_join(ordert %>% select(user_id, product_id, reordered),
by = c("user_id", "product_id"))
# We delete the tables "ordert", "prd" and "users"
rm(ordert, prd, users)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 6262580 334.5 20053024 1071.0 NA 31332852 1673.4
## Vcells 512774175 3912.2 953631162 7275.7 102400 953132310 7271.9
# Let's see the final data table
head(data, 10)
## # A tibble: 10 x 24
## # Groups: user_id [1]
## user_id product_id up_orders up_first_order up_last_order
## <int> <int> <int> <dbl> <dbl>
## 1 1 196 10 1 10
## 2 1 10258 9 2 10
## 3 1 10326 1 5 5
## 4 1 12427 10 1 10
## 5 1 13032 3 2 10
## 6 1 13176 2 2 5
## 7 1 14084 1 1 1
## 8 1 17122 1 5 5
## 9 1 25133 8 3 10
## 10 1 26088 2 1 2
## # ... with 19 more variables: up_average_cart_position <dbl>,
## # prod_orders <int>, prod_reorder_probability <dbl>,
## # prod_reorder_times <dbl>, prod_reorder_ratio <dbl>, user_orders <dbl>,
## # user_period <dbl>, user_mean_days_since_prior <dbl>,
## # user_total_products <int>, user_reorder_ratio <dbl>,
## # user_distinct_products <int>, user_average_basket <dbl>,
## # order_id <int>, eval_set <fct>, time_since_last_order <dbl>,
## # up_order_rate <dbl>, up_orders_since_last_order <dbl>,
## # up_order_rate_since_first_order <dbl>, reordered <int>
Before we are ready to run the XGBoost algorithm, it is necessary to create the final train and test tables that will feed the algorithm and the evaluation respectively.
train <- as.data.frame(data[data$eval_set == "train",])
train$eval_set <- NULL
train$user_id <- NULL
#train$product_id <- NULL
#train$order_id <- NULL
# below we transform missing values of 'reordered' variable to 0
train$reordered[is.na(train$reordered)] <- 0
head(train,10)
## product_id up_orders up_first_order up_last_order
## 1 196 10 1 10
## 2 10258 9 2 10
## 3 10326 1 5 5
## 4 12427 10 1 10
## 5 13032 3 2 10
## 6 13176 2 2 5
## 7 14084 1 1 1
## 8 17122 1 5 5
## 9 25133 8 3 10
## 10 26088 2 1 2
## up_average_cart_position prod_orders prod_reorder_probability
## 1 1.400000 35791 0.5825000
## 2 3.333333 1946 0.5529623
## 3 5.000000 5526 0.5215809
## 4 3.300000 6476 0.5294818
## 5 6.333333 3751 0.4797823
## 6 6.000000 379450 0.7118844
## 7 2.000000 15935 0.6291501
## 8 6.000000 13880 0.5274262
## 9 4.000000 6196 0.5807453
## 10 4.500000 2523 0.4049871
## prod_reorder_times prod_reorder_ratio user_orders user_period
## 1 4.473875 0.7764801 10 176
## 2 3.493716 0.7137718 10 176
## 3 2.873635 0.6520087 10 176
## 4 3.857058 0.7407350 10 176
## 5 2.916796 0.6571581 10 176
## 6 5.972111 0.8325550 10 176
## 7 5.290505 0.8109821 10 176
## 8 3.082390 0.6755764 10 176
## 9 3.848447 0.7401549 10 176
## 10 2.169390 0.5390408 10 176
## user_mean_days_since_prior user_total_products user_reorder_ratio
## 1 19.55556 59 0.7592593
## 2 19.55556 59 0.7592593
## 3 19.55556 59 0.7592593
## 4 19.55556 59 0.7592593
## 5 19.55556 59 0.7592593
## 6 19.55556 59 0.7592593
## 7 19.55556 59 0.7592593
## 8 19.55556 59 0.7592593
## 9 19.55556 59 0.7592593
## 10 19.55556 59 0.7592593
## user_distinct_products user_average_basket order_id
## 1 18 5.9 1187899
## 2 18 5.9 1187899
## 3 18 5.9 1187899
## 4 18 5.9 1187899
## 5 18 5.9 1187899
## 6 18 5.9 1187899
## 7 18 5.9 1187899
## 8 18 5.9 1187899
## 9 18 5.9 1187899
## 10 18 5.9 1187899
## time_since_last_order up_order_rate up_orders_since_last_order
## 1 14 1.0 0
## 2 14 0.9 0
## 3 14 0.1 5
## 4 14 1.0 0
## 5 14 0.3 0
## 6 14 0.2 5
## 7 14 0.1 9
## 8 14 0.1 5
## 9 14 0.8 0
## 10 14 0.2 8
## up_order_rate_since_first_order reordered
## 1 1.0000000 1
## 2 1.0000000 1
## 3 0.1666667 0
## 4 1.0000000 0
## 5 0.3333333 1
## 6 0.2222222 0
## 7 0.1000000 0
## 8 0.1666667 0
## 9 1.0000000 1
## 10 0.2000000 1
test <- as.data.frame(data[data$eval_set == "test",])
test$eval_set <- NULL
test$user_id <- NULL
test$reordered <- NULL
head(test,10)
## product_id up_orders up_first_order up_last_order
## 1 248 1 2 2
## 2 1005 1 10 10
## 3 1819 3 4 7
## 4 7503 1 3 3
## 5 8021 1 2 2
## 6 9387 5 1 7
## 7 12845 1 4 4
## 8 14992 2 6 7
## 9 15143 1 1 1
## 10 16797 3 1 9
## up_average_cart_position prod_orders prod_reorder_probability
## 1 3.000000 6371 0.2795080
## 2 5.000000 463 0.3127413
## 3 2.666667 2424 0.3468725
## 4 6.000000 12474 0.3837314
## 5 5.000000 27864 0.4408357
## 6 3.600000 36187 0.4691700
## 7 2.000000 10027 0.2570445
## 8 7.000000 29069 0.4270636
## 9 3.000000 3447 0.3580811
## 10 4.000000 142951 0.5666180
## prod_reorder_times prod_reorder_ratio user_orders user_period
## 1 1.667365 0.4002511 12 133
## 2 1.787645 0.4406048 12 133
## 3 1.969131 0.4921617 12 133
## 4 2.239899 0.5535514 12 133
## 5 2.445927 0.5911570 12 133
## 6 2.860632 0.6504269 12 133
## 7 1.569662 0.3629201 12 133
## 8 2.397048 0.5828202 12 133
## 9 1.968589 0.4920220 12 133
## 10 3.312962 0.6981553 12 133
## user_mean_days_since_prior user_total_products user_reorder_ratio
## 1 12.09091 88 0.7051282
## 2 12.09091 88 0.7051282
## 3 12.09091 88 0.7051282
## 4 12.09091 88 0.7051282
## 5 12.09091 88 0.7051282
## 6 12.09091 88 0.7051282
## 7 12.09091 88 0.7051282
## 8 12.09091 88 0.7051282
## 9 12.09091 88 0.7051282
## 10 12.09091 88 0.7051282
## user_distinct_products user_average_basket order_id
## 1 33 7.333333 2774568
## 2 33 7.333333 2774568
## 3 33 7.333333 2774568
## 4 33 7.333333 2774568
## 5 33 7.333333 2774568
## 6 33 7.333333 2774568
## 7 33 7.333333 2774568
## 8 33 7.333333 2774568
## 9 33 7.333333 2774568
## 10 33 7.333333 2774568
## time_since_last_order up_order_rate up_orders_since_last_order
## 1 11 0.08333333 10
## 2 11 0.08333333 2
## 3 11 0.25000000 5
## 4 11 0.08333333 9
## 5 11 0.08333333 10
## 6 11 0.41666667 5
## 7 11 0.08333333 8
## 8 11 0.16666667 5
## 9 11 0.08333333 11
## 10 11 0.25000000 3
## up_order_rate_since_first_order
## 1 0.09090909
## 2 0.33333333
## 3 0.33333333
## 4 0.10000000
## 5 0.09090909
## 6 0.41666667
## 7 0.11111111
## 8 0.28571429
## 9 0.08333333
## 10 0.25000000
rm(data)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 6263961 334.6 20053024 1071.0 NA 31332852 1673.4
## Vcells 501289579 3824.6 953631162 7275.7 102400 953132310 7271.9
We use the train data to create the model. Each variable is a list containing two things, label (or outcome) and data (predictors). In our example the label we want to predict is the column “reordered”.
Because xgboost manages only numeric vectors we have to transform the categorical data to dummy variables while building the model we will need. Towards this end we use xgb.DMatrix.
Finally we can plot the results from our model using xgb.plot.importance()
library(ggplot2)
params <- list(
"objective" = "reg:logistic",
"eval_metric" = "logloss",
"eta" = 0.1,
"max_depth" = 6,
"min_child_weight" = 10,
"gamma" = 0.70,
"subsample" = 0.76,
"colsample_bytree" = 0.95,
"alpha" = 2e-05,
"lambda" = 10
)
# We get a sample containing 10% of the train table
subtrain <- train %>% sample_frac(0.1)
# We create an xgb.DMatrix that is named X with predictors from subtrain table and response the reordered variable
X <- xgb.DMatrix(as.matrix(subtrain %>% select(-reordered, -order_id, -product_id)), label = subtrain$reordered)
# We create the actual model
model <- xgboost(data = X, params = params, nrounds = 80)
## [1] train-logloss:0.624639
## [2] train-logloss:0.571516
## [3] train-logloss:0.525171
## [4] train-logloss:0.485455
## [5] train-logloss:0.454769
## [6] train-logloss:0.426194
## [7] train-logloss:0.401619
## [8] train-logloss:0.381367
## [9] train-logloss:0.364305
## [10] train-logloss:0.348925
## [11] train-logloss:0.335851
## [12] train-logloss:0.324318
## [13] train-logloss:0.314377
## [14] train-logloss:0.305517
## [15] train-logloss:0.297942
## [16] train-logloss:0.291039
## [17] train-logloss:0.285190
## [18] train-logloss:0.280525
## [19] train-logloss:0.275822
## [20] train-logloss:0.272233
## [21] train-logloss:0.268503
## [22] train-logloss:0.265747
## [23] train-logloss:0.263313
## [24] train-logloss:0.260833
## [25] train-logloss:0.258859
## [26] train-logloss:0.257246
## [27] train-logloss:0.255782
## [28] train-logloss:0.254252
## [29] train-logloss:0.253062
## [30] train-logloss:0.252092
## [31] train-logloss:0.251246
## [32] train-logloss:0.250460
## [33] train-logloss:0.249842
## [34] train-logloss:0.249046
## [35] train-logloss:0.248511
## [36] train-logloss:0.248028
## [37] train-logloss:0.247614
## [38] train-logloss:0.247265
## [39] train-logloss:0.246923
## [40] train-logloss:0.246660
## [41] train-logloss:0.246397
## [42] train-logloss:0.246177
## [43] train-logloss:0.245940
## [44] train-logloss:0.245752
## [45] train-logloss:0.245588
## [46] train-logloss:0.245439
## [47] train-logloss:0.245313
## [48] train-logloss:0.245211
## [49] train-logloss:0.245116
## [50] train-logloss:0.244940
## [51] train-logloss:0.244843
## [52] train-logloss:0.244733
## [53] train-logloss:0.244648
## [54] train-logloss:0.244575
## [55] train-logloss:0.244511
## [56] train-logloss:0.244434
## [57] train-logloss:0.244336
## [58] train-logloss:0.244264
## [59] train-logloss:0.244187
## [60] train-logloss:0.244099
## [61] train-logloss:0.244033
## [62] train-logloss:0.243989
## [63] train-logloss:0.243933
## [64] train-logloss:0.243888
## [65] train-logloss:0.243832
## [66] train-logloss:0.243803
## [67] train-logloss:0.243779
## [68] train-logloss:0.243715
## [69] train-logloss:0.243665
## [70] train-logloss:0.243614
## [71] train-logloss:0.243565
## [72] train-logloss:0.243523
## [73] train-logloss:0.243488
## [74] train-logloss:0.243445
## [75] train-logloss:0.243419
## [76] train-logloss:0.243379
## [77] train-logloss:0.243323
## [78] train-logloss:0.243281
## [79] train-logloss:0.243250
## [80] train-logloss:0.243218
# We estimate the importance of the predictors
importance <- xgb.importance(colnames(X), model = model)
# We plot the importance of the predictors
xgb.ggplot.importance(importance)
rm(X, importance, subtrain)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 6339429 338.6 20053024 1071.0 NA 31332852 1673.4
## Vcells 486142528 3709.0 953631162 7275.7 102400 953132310 7271.9
Because Instacart did not reveal the outcome of the test data we also apply the model in the train data and create a table that enables comparing the predicted to the actural results
Now we are ready to perform the prediction with the model we have built to classify test data.
# We use the xgb.DMatrix to group our test data into a matrix
X <- xgb.DMatrix(as.matrix(test %>% select(-order_id, -product_id)))
# We apply the model and we predict the reordered variable for the test set.
test$reordered <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
test$reordered <- (test$reordered > 0.21) * 1
# We create the final table with reordered products per order
submission <- test %>%
filter(reordered == 1) %>%
group_by(order_id) %>%
summarise(
products = paste(product_id, collapse = " ")
)
# Let's see the submission table
head(submission,10)
## # A tibble: 10 x 2
## order_id products
## <int> <chr>
## 1 17 13107 21463
## 2 34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792
## 3 137 2326 5134 23794 24852 25890 38689 41787
## 4 182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672
## 5 257 13870 21137 22035 24838 24852 27104 27966 29837 30233 37646 3…
## 6 313 12779 13198 14077 21903 25890 28535 30391 45007 46906
## 7 353 21137 35561 40688
## 8 386 260 1158 2326 4920 6046 8174 10343 15872 17652 21479 21903 22…
## 9 414 4472 7539 14947 19006 20392 20564 21230 21376 21709 27845 312…
## 10 418 1503 13702 30489 38694 40268 41950 47766
# We create the table "missing" where we have the orders in which none product will be ordered according to our prediction
missing <- data.frame(
order_id = unique(test$order_id[!test$order_id %in% submission$order_id]),
products = "None"
)
submission <- submission %>% bind_rows(missing) %>% arrange(order_id)
#Let's see the submission table
head(submission,10)
## # A tibble: 10 x 2
## order_id products
## <int> <chr>
## 1 17 13107 21463
## 2 34 2596 13176 16083 21137 39180 39475 43504 44663 47766 47792
## 3 137 2326 5134 23794 24852 25890 38689 41787
## 4 182 5479 9337 11520 13629 32109 33000 39275 41149 47209 47672
## 5 257 13870 21137 22035 24838 24852 27104 27966 29837 30233 37646 3…
## 6 313 12779 13198 14077 21903 25890 28535 30391 45007 46906
## 7 353 21137 35561 40688
## 8 386 260 1158 2326 4920 6046 8174 10343 15872 17652 21479 21903 22…
## 9 414 4472 7539 14947 19006 20392 20564 21230 21376 21709 27845 312…
## 10 418 1503 13702 30489 38694 40268 41950 47766
In order to be able to compare our prediction to the actual result we re-apply the model in the train data
# We use the xgb.DMatrix to group our train data into a matrix
X <- xgb.DMatrix(as.matrix(train %>% select(-order_id, -product_id, -reordered)))
# We apply the model and we predict the reordered variable for the train set.
train$reordered_pred <- predict(model, X)
# The model estimates a probability. We apply a threshold so every prediction above 0.21 will be considered as a reorder (reordered=1)
train$reordered_pred <- (train$reordered_pred > 0.21) * 1
# We create the final table with reordered products per order
submission_train <- train %>%
filter(reordered_pred == 1) %>%
group_by(order_id) %>%
summarise(
products = paste(product_id, collapse = " ")
)
real_reorders <- train %>%
filter(reordered == 1) %>%
group_by(order_id) %>%
summarise(
real_products = paste(product_id, collapse = " ")
)
submission_train <- real_reorders %>%
inner_join(submission_train, by = "order_id")
The Final Result for comparison is shown below:
datatable(head(submission_train, n = 10), class = 'cell-border stripe hover condensed responsive')
The model is now evlauted by calculating the number of correct matches vs the number of incorrect matches.
Evalution_set <- submission_train
Reordered_Products <- ""
for (i in 1: nrow(Evalution_set)){
Evalution_set$matches[i] <- length(Reduce('intersect',str_extract_all(c(Evalution_set$products[i],Evalution_set$real_products[i]),"\\w+")))
Evalution_set$no_matches[i] <- (str_count(Evalution_set$real_products[i]," ")+1) - Evalution_set$matches[i]
}
The Final Result for comparison is shown below:
datatable(head(Evalution_set, n = 10), class = 'cell-border stripe hover condensed responsive')
The final sum of matches and no matches are compared:
sum(Evalution_set$matches)
## [1] 423488
sum(Evalution_set$no_matches)
## [1] 392930
Support : This is a quantifies the popularity of an item and is measured as the proportion of transactions in which the item appears.We can identify itemsets that have support values beyond this threshold as significant itemsets. supp(X) = (Number of transactions in which X appears)/(Total number of transactions)
Confidence : This shows how likely item B is added to cart when item A is purchased. It is measured as the proportion of transactions of item A in which item B is also purchased. for example If conf (A -> B) is 75%, it implies that, for 75% of transactions containing A & B, this rule is correct. If item B is very popular, then value of confidence will also be high. So confidence alone cannot be used for association. conf(A -> B) = supp(A U B) / supp( A )
Lift : This is measured as the ratio of the observed support to that expected if A and B were independent. Lift value of 1 indicates that the two items are independent of each other. If value is greater than 1, then it indicates B is likely to be bought along with A. lift (A -> B) = supp (A U B)/( supp(A) supp (B) )
Conviction : conv (A->B) = (1-supp(A))/(1-conf(A-B))
If the conviction means 1.4, it means that the rule A -> B would be incorrect 40% more often if the association between A & B was an accidental chance.
Using association rules and market basket analysis, set of rules are created. Suport value of 0.00001, confidence of 0.4 and maximum number of items of 3 are used
basket_rules<-apriori(transactions,parameter = list(sup=0.00001,conf=0.4,maxlen=3,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 1e-05 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 32
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [20.90s].
## sorting and recoding items ... [30863 item(s)] done [0.85s].
## creating transaction tree ... done [5.67s].
## checking subsets of size 1 2 done [12.22s].
## writing ... [979 rule(s)] done [2.53s].
## creating S4 object ... done [2.71s].
summary(basket_rules)
## set of 979 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 979
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift count
## Min. :1.026e-05 Min. :0.4000 Min. : 2.72 Min. : 33.0
## 1st Qu.:1.338e-05 1st Qu.:0.4232 1st Qu.: 41.86 1st Qu.: 43.0
## Median :2.146e-05 Median :0.4569 Median : 959.12 Median : 69.0
## Mean :7.943e-05 Mean :0.4795 Mean : 3418.94 Mean : 255.4
## 3rd Qu.:5.848e-05 3rd Qu.:0.5104 3rd Qu.: 4382.79 3rd Qu.: 188.0
## Max. :2.902e-03 Max. :1.0000 Max. :76544.62 Max. :9331.0
##
## mining info:
## data ntransactions support confidence
## transactions 3214874 1e-05 0.4
kable(inspect(head(sort(basket_rules, decreasing = TRUE, na.last = NA, by = "lift"), n = 10)))
## lhs rhs support confidence lift count
## [1] {Moisturizing Facial Wash} => {Moisturizing Non-Drying Facial Wash} 1.306428e-05 1.0000000 76544.62 42
## [2] {Moisturizing Non-Drying Facial Wash} => {Moisturizing Facial Wash} 1.306428e-05 1.0000000 76544.62 42
## [3] {Prepared Meals Simmered Beef Entree Dog Food} => {Prepared Meals Beef & Chicken Medley Dog Food} 1.275322e-05 0.6212121 32211.59 41
## [4] {Prepared Meals Beef & Chicken Medley Dog Food} => {Prepared Meals Simmered Beef Entree Dog Food} 1.275322e-05 0.6612903 32211.59 41
## [5] {Ocean Whitefish} => {Premium Classic Chicken Recipe Cat Food} 1.026479e-05 0.7500000 32148.74 33
## [6] {Premium Classic Chicken Recipe Cat Food} => {Ocean Whitefish} 1.026479e-05 0.4400000 32148.74 33
## [7] {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} => {Ancient Grains Apricot Blended Low-Fat Greek Yogurt} 1.430849e-05 0.5609756 29088.16 46
## [8] {Ancient Grains Apricot Blended Low-Fat Greek Yogurt} => {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} 1.430849e-05 0.7419355 29088.16 46
## [9] {Thousand Island Salad Snax} => {Raspberry Vinaigrette Salad Snax} 2.146274e-05 0.6160714 23030.14 69
## [10] {Raspberry Vinaigrette Salad Snax} => {Thousand Island Salad Snax} 2.146274e-05 0.8023256 23030.14 69
#Frequently ordered products
#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000
item_frequencies <- itemFrequency(transactions, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transactions)]
par(mar=c(2,10,2,2)); options(scipen=5)
barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transactions), col="red")
We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold
#Frequent items bought together
basket_rules <- apriori(transactions,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.0003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 964
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.48s].
## sorting and recoding items ... [5213 item(s)] done [0.45s].
## creating transaction tree ... done [4.02s].
## checking subsets of size 1 2 3 4 5 done [6.75s].
## writing ... [101 rule(s)] done [0.06s].
## creating S4 object ... done [1.25s].
This scatter plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “support-confidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised.
plot(basket_rules)
The network graph below shows associations between selected items. Larger circles imply higher support, while darker red circles imply higher lift
plot(head(sort(basket_rules,by="lift"), 5),method="graph")
Highlights:
Future Work: