1. Introduction

Instacart is a San Fransisco based company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. This study makes use of open sourced data set,“The Instacart Online Grocery Shopping Dataset 2017” released by Instacart. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

As part of the study, item suggestions based on market basket analysis is performed.The focus is on understanding temporal behavior patterns, where user needs and preferences are often assumed to be relatively constant across short windows of time. Apriori algorithm and association rules are used as initial model. Complex feature engineering, gradient boosted tree models are used as modified model for more accuracy and personalisation to suit the user.

2. Packages

  • pacman: To load packages and install missing ones
  • readr: Fast data load.
  • data.table: Fast data load.
  • tidyverse: Package of multiple R packages used for data manipulation
  • dplyr: Easy functions to perform data manipulation in R.
  • stringr: String operations in R
  • DT: Package to put data objects in R as HTML tables
  • ggplot2: Data visualisation in R mining for word processing and sentiment analysis
  • treemap: Package for tree map visulaization
  • arules : Package for mining Association Rules and Frequent Itemsets
  • arulesviz : Package for visualizing Association Rules and Frequent Itemsets
if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("readr", 
               "data.table", 
               "tidyverse", 
               "dplyr", 
               "stringr", 
               "DT", 
               "ggplot2",
               "knitr",
               "magrittr",
               "treemap",
               "arules",
               "arulesViz")

3. Data Preperation

The data is loaded from the source and cleaned for further analysis.

3.1 Data source

Instacart is an American company that operates as a same-day grocery delivery service. The Instacart Online Grocery Shopping dataset 2017 used in this study is accessed on 03/31.

In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information

  • orders: Contains location and category information for each business
  • products: Contains attributes of the business like music, delivery, parking etc.
  • aisles : Contains working hours for each weekday for businesses
  • deptartments : Number of checkins for each day and hour for a business is stored in this dataset.
  • order_products__SET : Contains tips and comment written by a user on a business

where SET is one of the four following evaluation sets (eval_set in orders):

“prior”: orders prior to that users most recent order (~3.2m orders) “train”: training data supplied to participants (~131k orders) “test”: test data reserved for testing machine learning algorithms (~75k orders)

3.2 Data import

The six files containing data is loaded in this step.

orders <- fread("orders.csv")
products <- fread("products.csv")
aisles <- read_csv("aisles.csv")
departments <- fread("departments.csv")
order_products_train <- fread("order_products__train.csv")
order_products_prior <- read_csv("order_products__prior.csv")

3.3 Sample data

a. orders

# Displays 10 rows of checkin table

orders$order_dow = as.factor(orders$order_dow)
orders$order_hour_of_day = as.factor(orders$order_hour_of_day)

datatable(head(orders, n = 10), class = 'cell-border stripe hover condensed responsive')

b. products

# Displays 10 rows of checkin table

datatable(head(products, n = 10), class = 'cell-border stripe hover condensed responsive')

c. departments

# Displays 10 rows of checkin table

datatable(head(departments, n = 10), class = 'cell-border stripe hover condensed responsive')

d. aisles

# Displays 10 rows of checkin table

datatable(head(aisles, n = 10), class = 'cell-border stripe hover condensed responsive')

d. order_products_prior

# Displays 10 rows of checkin table

datatable(head(order_products_prior, n = 10), class = 'cell-border stripe hover condensed responsive')

d. order_products_train

# Displays 10 rows of checkin table

datatable(head(order_products_train, n = 10), class = 'cell-border stripe hover condensed responsive')

4. Exploratory Data Analysis

Buying patterns of users are analysed by the day at which they purchase and the department to which they belong.

4.1 By Weekday and time of order

Weekends have more orders than weekdays.

 orders %>% 
  ggplot(aes(x= order_dow, fill = order_dow)) + 
  geom_histogram(stat="count")

Monday to Thursday follows a double camel hump shape with first peak from 9-11 hours and second peak around 13-15 hours. This indicates that the users order groceries either in the morning or during lunch time on working days. But this pattern is slightly different for weekends

 orders %>% 
  ggplot(aes(x= order_hour_of_day, fill = order_dow)) + 
  geom_histogram(stat="count") +
  facet_wrap(~ order_dow, ncol = 2)

4.2 By Aisles and Departments.

Though produce offers lesser number of unique products compareed to departments like personal care, snacks and pantry. But, most ordered products are from the produce department which contains vegetables and fruits.

products %>% 
  group_by(department_id, aisle_id) %>% summarize(count=n()) %>%
  left_join(departments,by="department_id") %>% 
  left_join(aisles,by="aisle_id") %>%
  treemap(index=c("department","aisle"),vSize="count",title="Tree map of Unique products offered in each Department/ aisle",palette="Set3",border.col="#FFFFFF")

order_products_train %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(departments,by="department_id") %>% 
  left_join(aisles,by="aisle_id") %>%
  mutate(onesize = 1) %>%
  treemap(index=c("department","aisle"),vSize="sumcount",title="Tree map of most ordered products in Department/Aisle",palette="Set3",border.col="#FFFFFF")

4.3 Frequent item analysis

# Split the "Product ID" values into groups based on "Order ID" variable
order_product <- order_products_prior %>% 
  left_join(products, by = "product_id")


transactions <- as(split(order_product$product_name,order_product$order_id),"transactions")

hist(size(transactions), 
     breaks = 0:150, 
     xaxt="n", 
     ylim=c(0,250000), 
     col = "blue",
     main = "Number of Products per Order", 
     xlab = "Order Size:Number of Products")
+ axis(1, 
     at = seq(0,160,by=10)) + 
mtext(paste("Total:", length(transactions), "Orders,", sum(size(transactions)), "Products"))

## numeric(0)

ITEM FREQUENCY PLOT

Top 10 items most frequently bought are shown below. Only items with a support of atleast 0.01 (Appears at least in 1% of the transactions) are considered for this.

itemFrequencyPlot(transactions,
                  support = 0.01,
                  cex.names = 0.8,
                  horiz = TRUE, topN = 10)

5. Market Basket Analysis

Apriori algorithm and association rules

  • Support : This is a quantifies the popularity of an item and is measured as the proportion of transactions in which the item appears.We can identify itemsets that have support values beyond this threshold as significant itemsets.

supp(X) = (Number of transactions in which X appears)/(Total number of transactions)

  • Confidence : This shows how likely item B is added to cart when item A is purchased. It is measured as the proportion of transactions of item A in which item B is also purchased. for example If conf (A -> B) is 75%, it implies that, for 75% of transactions containing A & B, this rule is correct. If item B is very popular, then value of confidence will also be high. So confidence alone cannot be used for association.

conf(A -> B) = supp(A U B) / supp( A )

  • Lift : This is measured as the ratio of the observed support to that expected if A and B were independent. Lift value of 1 indicates that the two items are independent of each other. If value is greater than 1, then it indicates B is likely to be bought along with A.

lift (A -> B) = supp (A U B)/( supp(A) supp (B) )

  • Conviction :

conv (A->B) = (1-supp(A))/(1-conf(A-B))

If the conviction means 1.4, it means that the rule A -> B would be incorrect 40% more often if the association between A & B was an accidental chance.

Market basket rules

Using association rules and market basket analysis, set of rules are created. Suport value of 0.00001, confidence of 0.4 and maximum number of items of 3 are used

basket_rules<-apriori(transactions,parameter = list(sup=0.00001,conf=0.4,maxlen=3,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   1e-05      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [5.93s].
## sorting and recoding items ... [30863 item(s)] done [0.45s].
## creating transaction tree ... done [3.75s].
## checking subsets of size 1 2 done [7.87s].
## writing ... [979 rule(s)] done [1.94s].
## creating S4 object  ... done [1.95s].
summary(basket_rules)
## set of 979 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2 
## 979 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support            confidence          lift              count       
##  Min.   :1.026e-05   Min.   :0.4000   Min.   :    2.72   Min.   :  33.0  
##  1st Qu.:1.338e-05   1st Qu.:0.4232   1st Qu.:   41.86   1st Qu.:  43.0  
##  Median :2.146e-05   Median :0.4569   Median :  959.12   Median :  69.0  
##  Mean   :7.943e-05   Mean   :0.4795   Mean   : 3418.94   Mean   : 255.4  
##  3rd Qu.:5.848e-05   3rd Qu.:0.5104   3rd Qu.: 4382.79   3rd Qu.: 188.0  
##  Max.   :2.902e-03   Max.   :1.0000   Max.   :76544.62   Max.   :9331.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions       3214874   1e-05        0.4
kable(inspect(head(sort(basket_rules, decreasing = TRUE, na.last = NA, by = "lift"), n = 10)))
##      lhs                                                                 rhs                                                                   support confidence     lift count
## [1]  {Moisturizing Facial Wash}                                       => {Moisturizing Non-Drying Facial Wash}                            1.306428e-05  1.0000000 76544.62    42
## [2]  {Moisturizing Non-Drying Facial Wash}                            => {Moisturizing Facial Wash}                                       1.306428e-05  1.0000000 76544.62    42
## [3]  {Prepared Meals Simmered Beef Entree Dog Food}                   => {Prepared Meals Beef & Chicken Medley Dog Food}                  1.275322e-05  0.6212121 32211.59    41
## [4]  {Prepared Meals Beef & Chicken Medley Dog Food}                  => {Prepared Meals Simmered Beef Entree Dog Food}                   1.275322e-05  0.6612903 32211.59    41
## [5]  {Ocean Whitefish}                                                => {Premium Classic Chicken Recipe Cat Food}                        1.026479e-05  0.7500000 32148.74    33
## [6]  {Premium Classic Chicken Recipe Cat Food}                        => {Ocean Whitefish}                                                1.026479e-05  0.4400000 32148.74    33
## [7]  {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} => {Ancient Grains Apricot Blended Low-Fat Greek Yogurt}            1.430849e-05  0.5609756 29088.16    46
## [8]  {Ancient Grains Apricot Blended Low-Fat Greek Yogurt}            => {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} 1.430849e-05  0.7419355 29088.16    46
## [9]  {Thousand Island Salad Snax}                                     => {Raspberry Vinaigrette Salad Snax}                               2.146274e-05  0.6160714 23030.14    69
## [10] {Raspberry Vinaigrette Salad Snax}                               => {Thousand Island Salad Snax}                                     2.146274e-05  0.8023256 23030.14    69
#Frequently ordered products

#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000

item_frequencies <- itemFrequency(transactions, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transactions)]

par(mar=c(2,10,2,2)); options(scipen=5)

barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transactions), col="red")

We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold

#Frequent items bought together

basket_rules <- apriori(transactions,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   3e-04      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [7.05s].
## sorting and recoding items ... [5213 item(s)] done [0.40s].
## creating transaction tree ... done [3.92s].
## checking subsets of size 1 2 3 4 5 done [6.08s].
## writing ... [101 rule(s)] done [0.05s].
## creating S4 object  ... done [1.21s].

This scatter plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “support-confidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised.

plot(basket_rules)

The network graph below shows associations between selected items. Larger circles imply higher support, while darker red circles imply higher lift

plot(head(sort(basket_rules,by="lift"),10),method="graph")

To visualize the grouped matrix we use a balloon plot with antecedent groups as columns and consequents as rows.The color of each balloon represents the aggregated interest measure in the group and the size of the balloon shows the aggregated support.

plot(basket_rules,method="grouped")