Instacart is a San Fransisco based company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. This study makes use of open sourced data set,“The Instacart Online Grocery Shopping Dataset 2017” released by Instacart. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
As part of the study, item suggestions based on market basket analysis is performed.The focus is on understanding temporal behavior patterns, where user needs and preferences are often assumed to be relatively constant across short windows of time. Apriori algorithm and association rules are used as initial model. Complex feature engineering, gradient boosted tree models are used as modified model for more accuracy and personalisation to suit the user.
if (!require("pacman")) install.packages("pacman")
# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("readr",
"data.table",
"tidyverse",
"dplyr",
"stringr",
"DT",
"ggplot2",
"knitr",
"magrittr",
"treemap",
"arules",
"arulesViz")
The data is loaded from the source and cleaned for further analysis.
Instacart is an American company that operates as a same-day grocery delivery service. The Instacart Online Grocery Shopping dataset 2017 used in this study is accessed on 03/31.
In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information
where SET is one of the four following evaluation sets (eval_set in orders):
“prior”: orders prior to that users most recent order (~3.2m orders) “train”: training data supplied to participants (~131k orders) “test”: test data reserved for testing machine learning algorithms (~75k orders)
The six files containing data is loaded in this step.
orders <- fread("orders.csv")
products <- fread("products.csv")
aisles <- read_csv("aisles.csv")
departments <- fread("departments.csv")
order_products_train <- fread("order_products__train.csv")
order_products_prior <- read_csv("order_products__prior.csv")
# Displays 10 rows of checkin table
orders$order_dow = as.factor(orders$order_dow)
orders$order_hour_of_day = as.factor(orders$order_hour_of_day)
datatable(head(orders, n = 10), class = 'cell-border stripe hover condensed responsive')
# Displays 10 rows of checkin table
datatable(head(products, n = 10), class = 'cell-border stripe hover condensed responsive')
# Displays 10 rows of checkin table
datatable(head(departments, n = 10), class = 'cell-border stripe hover condensed responsive')
# Displays 10 rows of checkin table
datatable(head(aisles, n = 10), class = 'cell-border stripe hover condensed responsive')
# Displays 10 rows of checkin table
datatable(head(order_products_prior, n = 10), class = 'cell-border stripe hover condensed responsive')
# Displays 10 rows of checkin table
datatable(head(order_products_train, n = 10), class = 'cell-border stripe hover condensed responsive')
Buying patterns of users are analysed by the day at which they purchase and the department to which they belong.
Weekends have more orders than weekdays.
orders %>%
ggplot(aes(x= order_dow, fill = order_dow)) +
geom_histogram(stat="count")
Monday to Thursday follows a double camel hump shape with first peak from 9-11 hours and second peak around 13-15 hours. This indicates that the users order groceries either in the morning or during lunch time on working days. But this pattern is slightly different for weekends
orders %>%
ggplot(aes(x= order_hour_of_day, fill = order_dow)) +
geom_histogram(stat="count") +
facet_wrap(~ order_dow, ncol = 2)
Though produce offers lesser number of unique products compareed to departments like personal care, snacks and pantry. But, most ordered products are from the produce department which contains vegetables and fruits.
products %>%
group_by(department_id, aisle_id) %>% summarize(count=n()) %>%
left_join(departments,by="department_id") %>%
left_join(aisles,by="aisle_id") %>%
treemap(index=c("department","aisle"),vSize="count",title="Tree map of Unique products offered in each Department/ aisle",palette="Set3",border.col="#FFFFFF")
order_products_train %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(departments,by="department_id") %>%
left_join(aisles,by="aisle_id") %>%
mutate(onesize = 1) %>%
treemap(index=c("department","aisle"),vSize="sumcount",title="Tree map of most ordered products in Department/Aisle",palette="Set3",border.col="#FFFFFF")
# Split the "Product ID" values into groups based on "Order ID" variable
order_product <- order_products_prior %>%
left_join(products, by = "product_id")
transactions <- as(split(order_product$product_name,order_product$order_id),"transactions")
hist(size(transactions),
breaks = 0:150,
xaxt="n",
ylim=c(0,250000),
col = "blue",
main = "Number of Products per Order",
xlab = "Order Size:Number of Products")
+ axis(1,
at = seq(0,160,by=10)) +
mtext(paste("Total:", length(transactions), "Orders,", sum(size(transactions)), "Products"))
## numeric(0)
Top 10 items most frequently bought are shown below. Only items with a support of atleast 0.01 (Appears at least in 1% of the transactions) are considered for this.
itemFrequencyPlot(transactions,
support = 0.01,
cex.names = 0.8,
horiz = TRUE, topN = 10)
supp(X) = (Number of transactions in which X appears)/(Total number of transactions)
conf(A -> B) = supp(A U B) / supp( A )
lift (A -> B) = supp (A U B)/( supp(A) supp (B) )
conv (A->B) = (1-supp(A))/(1-conf(A-B))
If the conviction means 1.4, it means that the rule A -> B would be incorrect 40% more often if the association between A & B was an accidental chance.
Using association rules and market basket analysis, set of rules are created. Suport value of 0.00001, confidence of 0.4 and maximum number of items of 3 are used
basket_rules<-apriori(transactions,parameter = list(sup=0.00001,conf=0.4,maxlen=3,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 1e-05 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 32
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [5.93s].
## sorting and recoding items ... [30863 item(s)] done [0.45s].
## creating transaction tree ... done [3.75s].
## checking subsets of size 1 2 done [7.87s].
## writing ... [979 rule(s)] done [1.94s].
## creating S4 object ... done [1.95s].
summary(basket_rules)
## set of 979 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 979
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift count
## Min. :1.026e-05 Min. :0.4000 Min. : 2.72 Min. : 33.0
## 1st Qu.:1.338e-05 1st Qu.:0.4232 1st Qu.: 41.86 1st Qu.: 43.0
## Median :2.146e-05 Median :0.4569 Median : 959.12 Median : 69.0
## Mean :7.943e-05 Mean :0.4795 Mean : 3418.94 Mean : 255.4
## 3rd Qu.:5.848e-05 3rd Qu.:0.5104 3rd Qu.: 4382.79 3rd Qu.: 188.0
## Max. :2.902e-03 Max. :1.0000 Max. :76544.62 Max. :9331.0
##
## mining info:
## data ntransactions support confidence
## transactions 3214874 1e-05 0.4
kable(inspect(head(sort(basket_rules, decreasing = TRUE, na.last = NA, by = "lift"), n = 10)))
## lhs rhs support confidence lift count
## [1] {Moisturizing Facial Wash} => {Moisturizing Non-Drying Facial Wash} 1.306428e-05 1.0000000 76544.62 42
## [2] {Moisturizing Non-Drying Facial Wash} => {Moisturizing Facial Wash} 1.306428e-05 1.0000000 76544.62 42
## [3] {Prepared Meals Simmered Beef Entree Dog Food} => {Prepared Meals Beef & Chicken Medley Dog Food} 1.275322e-05 0.6212121 32211.59 41
## [4] {Prepared Meals Beef & Chicken Medley Dog Food} => {Prepared Meals Simmered Beef Entree Dog Food} 1.275322e-05 0.6612903 32211.59 41
## [5] {Ocean Whitefish} => {Premium Classic Chicken Recipe Cat Food} 1.026479e-05 0.7500000 32148.74 33
## [6] {Premium Classic Chicken Recipe Cat Food} => {Ocean Whitefish} 1.026479e-05 0.4400000 32148.74 33
## [7] {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} => {Ancient Grains Apricot Blended Low-Fat Greek Yogurt} 1.430849e-05 0.5609756 29088.16 46
## [8] {Ancient Grains Apricot Blended Low-Fat Greek Yogurt} => {Oats Ancient Grain Blend with Mixed Berry Low-Fat Greek Yogurt} 1.430849e-05 0.7419355 29088.16 46
## [9] {Thousand Island Salad Snax} => {Raspberry Vinaigrette Salad Snax} 2.146274e-05 0.6160714 23030.14 69
## [10] {Raspberry Vinaigrette Salad Snax} => {Thousand Island Salad Snax} 2.146274e-05 0.8023256 23030.14 69
#Frequently ordered products
#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000
item_frequencies <- itemFrequency(transactions, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transactions)]
par(mar=c(2,10,2,2)); options(scipen=5)
barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transactions), col="red")
We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold
#Frequent items bought together
basket_rules <- apriori(transactions,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 3e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 964
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [7.05s].
## sorting and recoding items ... [5213 item(s)] done [0.40s].
## creating transaction tree ... done [3.92s].
## checking subsets of size 1 2 3 4 5 done [6.08s].
## writing ... [101 rule(s)] done [0.05s].
## creating S4 object ... done [1.21s].
This scatter plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “support-confidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised.
plot(basket_rules)
The network graph below shows associations between selected items. Larger circles imply higher support, while darker red circles imply higher lift
plot(head(sort(basket_rules,by="lift"),10),method="graph")
To visualize the grouped matrix we use a balloon plot with antecedent groups as columns and consequents as rows.The color of each balloon represents the aggregated interest measure in the group and the size of the balloon shows the aggregated support.
plot(basket_rules,method="grouped")