Instacart is an app for on-demand grocery shopping with same-day delivery service. Instacart uses a crowdsourced marketplace model, akin to that of Uber or Lyft.
The Instacart shopping process is as follows. First, an app user places their grocery order through the app. Then, a locally crowdsourced “shopper” is notified of the order, goes to a nearby store, buys the groceries, and delivers them to the user.
There are three ways that Instacart generates revenue: delivery fees, membership fees, and mark-ups on in-store prices.
The main objective of the competition is to predict what will the user will buy in the next order, given all data of prior orders.
Last year, Instacart released a public dataset, “The Instacart Online Grocery Shopping Dataset 2017”. The dataset contains over 3 million anonymized grocery orders from more than 200,000 Instacart users. This analysis will make use of this datasets.
Data source can be downloaded here: https://www.kaggle.com/c/instacart-market-basket-analysis/data
Here are the R libraries used in this analysis.
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
setwd('./datasets')
aisles = fread('aisles.csv', stringsAsFactors = TRUE)
departments = fread('departments.csv', stringsAsFactors = TRUE)
products = fread('products.csv', stringsAsFactors = TRUE)
orders = fread('orders.csv', stringsAsFactors = TRUE)
order_products_train = fread('order_products__train.csv')
order_products_prior = fread('order_products__prior.csv')
|--------------------------------------------------|
|==================================================|
The dataset for this competition is a relational set of files describing customers’ orders over time. They are anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, Instacart provided between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders.
Total six datasets were imported. Follwing section will explore each datasets in further detail. These datasets were sourced from an existing Kaggle competition.
orders
(3.4m rows, 206k users):
order_id
: order identifieruser_id
: customer identifiereval_set
: which evaluation set this order belongs in (see SET
described below)order_number
: the order sequence number for this user (1 = first, n = nth)order_dow
: the day of the week the order was placed onorder_hour_of_day
: the hour of the day the order was placed ondays_since_prior
: days since the last order, capped at 30 (with NAs for order_number
= 1)products
(50k rows):
product_id
: product identifierproduct_name
: name of the productaisle_id
: foreign keydepartment_id
: foreign keyaisles
(134 rows):
aisle_id
: aisle identifieraisle
: the name of the aisledeptartments
(21 rows):
department_id
: department identifierdepartment
: the name of the departmentorder_products__SET
(30m+ rows):
order_id
: foreign keyproduct_id
: foreign keyadd_to_cart_order
: order in which each product was added to cartreordered
: 1 if this product has been ordered by this user in the past, 0 otherwisewhere SET
is one of the four following evaluation sets (eval_set
in orders
):
"prior"
: orders prior to that users most recent order (~3.2m orders)"train"
: training data supplied to participants (~131k orders)"test"
: test data reserved for machine learning competitions (~75k orders)There are 134 ailes in this dataset. Here are few sample names of the ailes.
paste(sort(head(aisles$aisle)), collapse=', ')
[1] "energy granola bars, instant foods, marinades meat preparation, other, prepared soups salads, specialty cheeses"
There are 21 departments in this dataset.Names of all deparments are listed below in aphabetically ordered.
paste(sort(departments$department), collapse = ', ')
[1] "alcohol, babies, bakery, beverages, breakfast, bulk, canned goods, dairy eggs, deli, dry goods pasta, frozen, household, international, meat seafood, missing, other, pantry, personal care, pets, produce, snacks"
There are 49,688 products in the catalogue within 134 aisles and 21 departments.
Sample products are as below.
products %>% head %>% kable %>%
kable_styling(bootstrap_options = c("striped", "hover"))
product_id | product_name | aisle_id | department_id |
---|---|---|---|
1 | Chocolate Sandwich Cookies | 61 | 19 |
2 | All-Seasons Salt | 104 | 13 |
3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
4 | Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
5 | Green Chile Anytime Sauce | 5 | 13 |
6 | Dry Nose Oil | 11 | 11 |
Products dataframe is related to Deparments.
We shall see sample of 3 products for few deparments.
left_join(departments, products) %>% select(department, product_name ) %>%
group_by(department) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' / ')) %>%
sample_n(5) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"))
department | three_examples_product |
---|---|
beverages | Peach Iced Tea Mix Packets / Cayenne Lemonade Made With Raw Tristate Honey / Mango Lemonade Sparkling Water |
alcohol | The Cutrer Chardonnay / Bud Light Beer Cans / Jamaican Style Lager |
meat seafood | Country Mild Premium Pork Sausage / Ground Beef 85% Lean / Grass Fed Ground Beef Patties |
personal care | Very Volumizing Pomegranate Shampoo / Radiant Super Unscented Tampons / Wild Cherry Throat Drops |
international | Organic Ramen Noodles / Borscht / Lo Mein Egg Noodles |
Products dataframe is also related to aisles. Each aisle relates to multiple products. By joining both aisles and products dataframe, we have an idea what type of prodcuts for each ailes.
Example below shows 3 samples products of for few aisles.
left_join(aisles, products) %>%
select(aisle, product_name ) %>% group_by(aisle) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' || ')) %>%
sample_n(5) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"))
aisle | three_examples_product |
---|---|
frozen dessert | Naked Coconut Organic Coconut Bliss || Smooth & Creamy Cherry Original Cream Cheesecake || Outshine Fruit & Veggie Bars Tangerine Medley |
nuts seeds dried fruit | Organic Dried Cranberries Apple Sweetened || Ginger Rescue Strong Chewable Ginger Tablets || Sesame Sticks |
beauty | Intense i-Color Eye liner - For Hazel Eyes, Black Pearl || Cinnamon Cassia Oil || Cushioned Nail Board |
ice cream toppings | The Ultimate Caramel Suace || Classic Waffle Cones || Gluten Free Sugar Cones |
food storage | Oven Bags || Linking Bag Clip || No Stick Heavy Duty Foil |
There are over 3 millions observations in orders dataset. Each row represent an unique order.
Let’s analyse the construct of one user. For example, user_id 1 had made 10 prior orders (order number from 1 to 10), last order is a train (eval_set). Note that the first order (order_number 1) does not have value for day_since_prior_order, as it is the first order without prior records.
This also means <user_id, product_id>
made up the key for prediction.
orders %>% filter (user_id==1) %>%
kable %>%kable_styling(bootstrap_options = c("striped", "hover"))
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
2539329 | 1 | prior | 1 | 2 | 8 | NA |
2398795 | 1 | prior | 2 | 3 | 7 | 15 |
473747 | 1 | prior | 3 | 3 | 12 | 21 |
2254736 | 1 | prior | 4 | 4 | 7 | 29 |
431534 | 1 | prior | 5 | 4 | 15 | 28 |
3367565 | 1 | prior | 6 | 2 | 7 | 19 |
550135 | 1 | prior | 7 | 1 | 9 | 20 |
3108588 | 1 | prior | 8 | 1 | 14 | 14 |
2295261 | 1 | prior | 9 | 1 | 16 | 0 |
2550362 | 1 | prior | 10 | 4 | 8 | 30 |
1187899 | 1 | train | 11 | 4 | 8 | 14 |
Let’s analyse another construct of orders. User_id 3 had made 12 orders before the final order labeled as test (eval_set) order. From the data we know that order_number is being recycled for each user.
Instacart did not provide us the basket content for test order. This is in fact the target for prediction.
orders %>% filter (user_id==3) %>%
kable %>%kable_styling(bootstrap_options = c("striped", "hover"))
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
1374495 | 3 | prior | 1 | 1 | 14 | NA |
444309 | 3 | prior | 2 | 3 | 19 | 9 |
3002854 | 3 | prior | 3 | 3 | 16 | 21 |
2037211 | 3 | prior | 4 | 2 | 18 | 20 |
2710558 | 3 | prior | 5 | 0 | 17 | 12 |
1972919 | 3 | prior | 6 | 0 | 16 | 7 |
1839752 | 3 | prior | 7 | 0 | 15 | 7 |
3225766 | 3 | prior | 8 | 0 | 17 | 7 |
3160850 | 3 | prior | 9 | 0 | 16 | 7 |
676467 | 3 | prior | 10 | 3 | 16 | 17 |
521107 | 3 | prior | 11 | 0 | 18 | 11 |
1402502 | 3 | prior | 12 | 1 | 15 | 15 |
2774568 | 3 | test | 13 | 5 | 15 | 11 |
Each order contain multiple products purchased by user. Instacart had cleanly categorized the orders into ‘train’ and ‘prior’ in SINGLE order dataset.
However, the detail of each orders are splitted into two datsets:
- order_product_train
: contain only detail product items of last order
- order_product_prior
: contain detail product items of all prior orders
order_product_train/prior dataframe tells us which products were purchased at each order; for both train and prior order.
For example, we know user_id 1 in the LAST ORDER (order_id 1187899) purchased 10 unique products by quering order_product_train with the relevant order_id.
order_products_train %>% filter (order_id == 1187899) %>%
kable %>% kable_styling(bootstrap_options = c("striped", "hover"))
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
1187899 | 196 | 1 | 1 |
1187899 | 25133 | 2 | 1 |
1187899 | 38928 | 3 | 1 |
1187899 | 26405 | 4 | 1 |
1187899 | 39657 | 5 | 1 |
1187899 | 10258 | 6 | 1 |
1187899 | 13032 | 7 | 1 |
1187899 | 26088 | 8 | 1 |
1187899 | 27845 | 9 | 0 |
1187899 | 49235 | 10 | 1 |
1187899 | 46149 | 11 | 1 |
Similary, detail items for a PRIOR ORDER (example order_id: 2550362) can be retireved by quering diffren dataset order_product_prior.
order_products_prior %>% filter (order_id == 2550362) %>%
kable %>% kable_styling(bootstrap_options = c("striped", "hover"))
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
2550362 | 196 | 1 | 1 |
2550362 | 46149 | 2 | 1 |
2550362 | 39657 | 3 | 0 |
2550362 | 38928 | 4 | 0 |
2550362 | 25133 | 5 | 1 |
2550362 | 10258 | 6 | 1 |
2550362 | 35951 | 7 | 0 |
2550362 | 13032 | 8 | 1 |
2550362 | 12427 | 9 | 1 |
Ther is no dedicated dataframe for users. However, we can derive number of unique users from order dataframe. By grouping the user_id and eval_set column, we found that there are 75,000 test users, 131,209 train users.
orders %>% filter(eval_set %in% c('train','test') ) %>%
count(eval_set) %>%
mutate(percentage=n/sum(n)) %>%
kable %>% kable_styling(bootstrap_options = c("striped", "hover"))
eval_set | n | percentage |
---|---|---|
test | 75000 | 0.3637087 |
train | 131209 | 0.6362913 |
In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.
To reduce our coding steps, we construct a reusable dataframe combining all details from orders and its products. This dataframe will contain rows for prior orders and products only (excluding last order which is labeled as ‘train’).
Most users made few orders. The number of orders a users made decrease significally along the order numbers. Maximum orders any users had made is 99.
tmp = users_orders_products_ %>% group_by(user_id) %>% summarize(n_orders = max(order_number))
tmp %>% ggplot(aes(x=as.factor(n_orders))) + geom_bar() +
ylab('Count of Users') +
xlab('Number of Orders Made By Users') +
theme(
axis.text.x = element_text (size = 6.0, angle = (90), hjust = 1, vjust = 0.5)
)
It is very obvious that most users made their orders weekly (every 7 days) and monthly (every 30 days). See the peak of day 7 and day 30 in the chart below.
tmp = users_orders_products_ %>%
filter(order_number>1) %>% # days_since_prior is NA for first order, need to filter out
group_by(order_id) %>%
summarize(n_orders = max(days_since_prior_order))
tmp %>% ggplot(aes(x=as.factor(n_orders))) + geom_bar() + ylab('Count of Orders') + xlab('Days Since Prior For Each Order')
We know that banana are the most popular products. The number of orders varies greatly for different products. Illustration below uses shows sample of only 30 top products. Notice however the varience is not obvious outside top 10 products.
tmp = order_products_train %>%
left_join(products) %>%
group_by(product_name) %>%
summarize(count=n()) %>%
top_n(n=30, wt=count) %>% mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(product_name,count), y=percentage)) +
geom_col() + ggtitle('Products Top 30') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
p2 = ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Products Top 30') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Products')
grid.arrange(p1, p2, ncol = 2)
Certain departmens are clearly more popular, like produce and dairy eggs. Both deparments combined contributed to more than 40% of total orders.
tmp = users_orders_products_ %>%
group_by(department) %>%
summarize(count=n()) %>%
mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(department,count), y=percentage)) +
geom_col() + ggtitle('Departments') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
p2 = ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Departments') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Departments')
grid.arrange(p1, p2, ncol = 2)
We looked into the buying trend of product by ailes and notice that certain aisle like vegetables and fruits contributes to almost 30% of total orders. Chart below shows top 30 aisles.
tmp = users_orders_products_ %>%
group_by(aisle) %>%
summarize(count=n()) %>%
top_n(n=30, wt=count) %>% mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(aisle,count), y=percentage)) +
geom_col() + ggtitle('Aisles Top 30') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank()) + ylab('Percentage of Orders') + xlab('Aisles')
p2 = ggplot (tmp, aes( x= '', y=percentage )) +
ggtitle('Aisles Top 30') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Aisles')
grid.arrange(p1, p2, ncol = 2)
We can see that both Day 0 and Day 1 stands out to be the most busy shopping day for instacart. This means that day of order made may influence the basket size.
users_orders_products_ %>%
group_by(order_dow) %>%
summarize(count = n()) %>%
mutate(percentage=count/sum(count)) %>%
ggplot (aes(x=as.factor(order_dow), y=percentage, fill=as.factor(order_dow))) +
geom_col()+ ylab('Percentage of Orders') + ggtitle('Daily Orders')
When we zoom into daily orders, we notice that top ten products contributes between 7% to 8% of daily orders. It is interesting to see that Limes are part of top ten for Day 0 and Day 6, but not other days. Whereas Organic Whole Milk doesn’t make it to top ten for Day 0. Organic Respberries does not make it to top 10 of Day 6. This means that there is a chance of predictability based on the day order is made.
users_orders_products_%>%
group_by(order_dow, product_name) %>%
summarize(n=n()) %>%
mutate(percentage=n/sum(n)) %>%
top_n(10, wt=n) %>%
ggplot (aes(x=as.factor(order_dow), y=percentage, fill=product_name)) +
geom_col() + ylab('Proprtion of Orders In A Day') + ggtitle('Daily Top 10 Products Ordered') +
theme(legend.position="bottom",legend.direction="horizontal")
Morning to afternoon are the peak shopping hours for instacart customers. The hour order made influences basket size.
users_orders_products_ %>%
group_by(order_hour_of_day) %>%
summarize(count = n()) %>%
mutate(percentage=count/sum(count)) %>%
ggplot (aes(x=as.factor(order_hour_of_day), y=percentage)) +
geom_col()+ ylab('Percentage of Orders') + ggtitle('Hourly Orders')
In the grocery, there are close to 50,000 products. When we zoom into hourly purchases, we noticed that top 10 products managed to score betwen 6% to 8% of hourly sales. Every hour has slightly diffrent combination of top 10 products (combination out of 12 products). That means certain products are predictable for ordering irregardless of the hour of order.
It is interesting to know that, similar to daily top 10 products, the Organic Wholemilk and Limes is missing as top 10 from some hours.
users_orders_products_ %>%
group_by(order_hour_of_day, product_name) %>%
summarize(n=n()) %>%
mutate(percentage=n/sum(n)) %>%
top_n(10, wt=n) %>%
ggplot (aes(x=as.factor(order_hour_of_day), y=percentage, fill=product_name)) +
geom_col() + ylab('Proprtion of Orders In A Hour') + ggtitle('Hourly Top 10 Products Ordered') +
theme(legend.position="bottom",legend.direction="horizontal")
Number of items in all orders range from 1 to 145. The histogram below is highly skewed towards small basket size. Majority of users purchased 5 items in their orders.
tmp = users_orders_products_ %>%
group_by(order_id) %>%
summarize( basket_size=n(),
reordered_items = sum(reordered)) %>%
group_by(basket_size) %>%
summarize(n=n(), avg_reordered_items =mean(reordered_items)) %>%
arrange(basket_size)
tmp %>% ggplot(aes(x=as.factor(basket_size))) +
geom_col(aes(y=n)) +
ylab('Order Count') +
xlab('Number of Items in Basket') +
ggtitle('Basket Size Distribution') +
theme(
axis.text.x = element_text (size = 6.0, angle = (90), hjust = 1, vjust = 0.5)
)
Analyzing the re-ordered products is the most important part of the EDA. This is becasue insights from this analysis can help to develop intuition for furhter feature engineering that will make the prediction more meaningful.
tmp %>% ggplot (aes(x=as.factor(basket_size))) +
geom_point(aes(y=avg_reordered_items), color='red') +
ylab('Avg Number of Re-Ordered Items') +
xlab('Number of Items in Basket') +
ggtitle('Reorder Rate by Basket Size') +
theme(
axis.text.x = element_text (size = 6.0, angle = (90), hjust = 1, vjust = 0.5)
) +
geom_abline(intercept = 0, slope = 1, color='blue')
One of the tricker things to predict in the Instacart dataset is the incidence of orders without reordered products. Plotting the proportion of this incidence across the training sample (a snapshot of 131K+ users) provides some inspiration.
Among all product purchases, 41% of products are reordered. The reordered rate is particularly high on top 10 products. As shown in chart below, top ten popular products has reordered rate is around 70% to 85%; higher than the overall ratio of 41%.
## overall all products reordered rate
tmp1 = users_orders_products_ %>%
filter(order_number >1) %>% # exclude first order, which will never have reordered
count(reordered) %>%
mutate(ratio=n/sum(n))
p1 = tmp1 %>%
ggplot (aes(x='',y=ratio, fill=as.factor(reordered))) + geom_col(width=1) + ylab('Product Reordered Ratio') +
coord_polar(theta='y', start=0) +
scale_fill_brewer(palette="Dark2") +
theme(axis.title.y = element_blank())
## top10 products and its reordered rate
tmp2 = users_orders_products_ %>%
count(product_id) %>% # filter only top 10 products for reorder analysis
top_n(n=10) %>%
left_join(users_orders_products_) %>% # now find out their reordered rate
group_by(product_id, product_name) %>%
summarize(
reordered_rate = sum(reordered,na.rm=T)/n()
) %>%
select(product_id, product_name, reordered_rate) %>%
arrange(desc(reordered_rate))
p2 = tmp2 %>% ggplot (aes(x=reorder(product_name, reordered_rate), y=reordered_rate)) +
ggtitle('Top 10 Products Sold and Their Reordering Rate') +
geom_col() + scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) + coord_flip()
grid.arrange(p1, p2, ncol = 2)
We understand from order analysis earlier that most users place their orders every 7 and 30 days. However, from reorder ration perspective, day 7 and day 30 has high contrast whereby day 7 orders has high reorder ratio and day 30 has lowest reordering ratio.
tmp = users_orders_products_ %>%
filter(order_number>1) %>%
group_by(days_since_prior_order, order_id) %>%
summarize(
contain_reordered = max(reordered)
) %>%
summarize(
reordered_orders = sum(contain_reordered),
n = n()+1
) %>%
mutate(non_reorder_ratio= 1-(reordered_orders/n))
tmp %>% ggplot (aes(x=days_since_prior_order, y=non_reorder_ratio)) + geom_point() + geom_line() +
ggtitle('Orders NOT Containing Reordered Products over Days since Prior Order')
tmp = users_orders_products_ %>%
filter(order_number>1) %>%
group_by(order_hour_of_day, order_id) %>%
summarize(
contain_reordered = max(reordered)
) %>%
summarize(
reordered_orders = sum(contain_reordered),
n = n()
) %>%
mutate(non_reorder_ratio= 1-(reordered_orders/n))
tmp %>% ggplot (aes(x=order_hour_of_day, y=non_reorder_ratio)) + geom_point() + geom_line() +
ggtitle('Non Reorder Ratio over Time of Order Placed')
tmp = users_orders_products_ %>%
filter(order_number>1) %>%
group_by(order_dow, order_id) %>%
summarize(
contain_reordered = max(reordered)
) %>%
summarize(
reordered_orders = sum(contain_reordered),
n = n()+1
) %>%
mutate(non_reorder_ratio= 1-(reordered_orders/n))
tmp %>% ggplot (aes(x=order_dow, y=non_reorder_ratio)) + geom_point() + geom_line() +
ggtitle('Non Reorder Ratio over Day of Purchase')
Intuitively, we can think of the more regular a buyer is, the person tend to repeat ordering the same products.
tmp = users_orders_products_ %>%
filter(order_number>1) %>%
group_by(user_id, order_id) %>%
summarize(
contain_reordered = max(reordered)
) %>%
summarize(
reordered_orders = sum(contain_reordered),
total_orders_per_user=n()
) %>%
group_by(total_orders_per_user) %>%
summarize(
reorders = sum(reordered_orders),
total = sum(total_orders_per_user)) %>%
mutate(non_ratio= 1-(reorders/total))
tmp %>% ggplot (aes(x=total_orders_per_user, y=non_ratio)) + geom_point() + geom_line() +
ggtitle('Non Reorder Ratio over Day of Purchase')
The objective is to predict what product will the customer purchase in the next basket. It require probability estimation of each product that bad been purchased before, that to be purchased before.This is a classification problem, as well as a regression of probability of repurchases.
For this analysis, we shall use two Naive models (handcrafted baseline) and one Machine Learning Logistic regression will be used for Machine Learning approach for its speed and simplicity; to demonstrate the feasibility to producing a better outcome then baseline.
Instacart did not provide us test order detail, therefore we shall use the train users for both trainng and testing. We achieve this by splitting the train users and its related orders and products into train dataset and train dataset, at 70%/30% split (by number of users). That means our train/test dataset will contain approximately 91846 / 39,363 users.
For this analysis, we will not be submitting to Kaggle.
# update this variable for changing split ratio
train_proportion = 0.7
# build list of all users ID
tmp = orders %>% filter(eval_set=='train') %>% distinct(user_id)
# 70/30 split
set.seed(12345)
train.rows = sample( 1:nrow(tmp), train_proportion * nrow(tmp) )
train.users = tmp[train.rows,] # select training rows, list of train users
test.users = tmp[-train.rows,] # select testing rows, list of test users
cat("Total Rows in Training Users: ", length(train.users),
"\nTotal Rows in Testing Users: ", length(test.users),
"\nTrain/Test Split % : ", 100*length(train.users)/(length(test.users)+length(train.users)),
" / ", 100*length(test.users)/(length(test.users)+length(train.users)))
Total Rows in Training Users: 91846
Total Rows in Testing Users: 39363
Train/Test Split % : 69.99977 / 30.00023
The data frame used for training should contain the below columns and features:
key
user_id-product_id
pair that includes all prior and test/train rows`actual
other features
From exploratory discovery, features that could contribute to the prediction should be populated into the construct. Feature engineering will happen in the later stage.
Let’s proceed to create the basic training construct. This won’t be used for prediction until feature engineering is completed in later stage.
# list of products in the final order, this make up the label
construct1 = orders %>%
filter(user_id %in% train.users, eval_set=='train') %>%
left_join(order_products_train) %>%
distinct(user_id, product_id) %>%
mutate(actual=1) #training label
# list of products each users had bought before in prior orders
construct2 = orders %>%
filter(user_id %in% train.users, eval_set=='prior') %>%
left_join(order_products_prior) %>%
distinct(user_id,product_id)
# Training Construct
train.construct = left_join(construct2,construct1) %>%
mutate(key=paste(user_id,product_id,sep="-")) %>% # key
select(key, user_id, product_id, actual) %>%
arrange(user_id, product_id) %>%
replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order
# drop_na # remove proudcts not in historical but appear in last order
rm(list=c('construct1','construct2'))
head(train.construct,50)
Similar approach to training data construct, here we frame the testing data for evaluate our model built with training data.
# list of products in the final order, this make up the label
construct1 = orders %>%
filter(user_id %in% test.users, eval_set=='train') %>%
left_join(order_products_train) %>%
distinct(user_id, product_id) %>%
mutate(actual=1) #training label
# list of products each users had bought before in prior orders
construct2 = orders %>%
filter(user_id %in% test.users, eval_set=='prior') %>%
left_join(order_products_prior) %>%
distinct(user_id,product_id)
# Training Construct
test.construct = left_join(construct2,construct1) %>%
mutate(key=paste(user_id,product_id,sep="-")) %>% # key
select(key, user_id, product_id, actual) %>%
arrange(user_id, product_id) %>%
replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order
# drop_na # remove proudcts not in historical but appear in last order
rm(list=c('construct1','construct2'))
head(test.construct,50)
Instacart has close to 50k products in their catalogue. As the maximum number of items ordered by a user is just a fraction of the 50k available product. This means by simply predicting nothing is purchased in the next basket, we would yeild close to 100% accuracy.
Due to the highly imbalance dataset, Instacart require F1 Score as the competition scoring, instead of accuracy.
To evaluate the performance of the model, we had created a custom function to build a confusion matrix and derive other binary classification metrics.
## Custom Function For Binary Class Performance Evaluation
binclass_eval = function (actual, predict) {
cm = table(as.integer(actual), as.integer(predict), dnn=c('Actual','Predicted'))
ac = (cm['1','1']+cm['0','0'])/(cm['0','1'] + cm['1','0'] + cm['1','1'] + cm['0','0'])
pr = cm['1','1']/(cm['0','1'] + cm['1','1'])
rc = cm['1','1']/(cm['1','0'] + cm['1','1'])
fs = 2* pr*rc/(pr+rc)
list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac)
}
If the prediction is based on probability, we shall build a function to discover cutoff that optimize various performance metrics.
### Cutoff Threshold Optimization
optimize_cutoff = function (actual, probability) {
rocr.pred = prediction(predictions = probability, labels = actual)
rocr.metrics = data.frame(
cutoff = rocr.pred@cutoffs[[1]],
accuracy = (rocr.pred@tp[[1]] + rocr.pred@tn[[1]]) /
(rocr.pred@tp[[1]] + rocr.pred@tn[[1]] + rocr.pred@fp[[1]] + rocr.pred@fn[[1]]),
tpr = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fn[[1]]),
fpr = rocr.pred@fp[[1]] / (rocr.pred@fp[[1]] + rocr.pred@tn[[1]]),
ppv = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fp[[1]])
)
rocr.metrics$fscore = 2 * (rocr.metrics$tpr * rocr.metrics$ppv) / (rocr.metrics$tpr + rocr.metrics$ppv)
rocr.metrics$tpr_fpr = rocr.metrics$tpr / rocr.metrics$fpr
## Discovery the optimal threshold for various metrics
rocr.best = rbind(
best.accuracy = c(max = max(rocr.metrics$accuracy, na.rm = TRUE), cutoff=rocr.metrics$cutoff[which.max(rocr.metrics$accuracy)]),
best.ppv = c(max = max(rocr.metrics$ppv, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$ppv)]),
best.recall = c(max = max(rocr.metrics$tpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr)]),
best.fscore = c(max = max(rocr.metrics$fscore, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$fscore)]),
best.tpr_fpr = c(max = max(rocr.metrics$tpr_fpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr_fpr)])
)
list(metrics = rocr.metrics, best = rocr.best)
}
With intension to make this a baseline model, We simply predict the basket based on user last order.
m1.train.data = users_orders_products_ %>%
filter(user_id %in% train.users) %>%
group_by(user_id) %>%
top_n(n=1, wt=order_number) %>% #last order has the higher order_number
select(user_id, product_id) %>%
mutate (predicted=1) %>% #predict based on last ordered, therefore 1
full_join(train.construct) %>% # join with train construct for items not predicted but in final order
select(user_id, product_id, actual, predicted) %>%
replace_na(list(predicted = 0))
head(m1.train.data,25)
m1.eval = binclass_eval(m1.train.data$actual, m1.train.data$predicted)
m1.eval$cm
Predicted
Actual 0 1
0 4672615 688814
1 315266 265703
The result shows only 0.3460833 F1 Score.
cat("Accuracy: ", m1.eval$accuracy,
"\nPrecision: ", m1.eval$precision,
"\nRecall: ", m1.eval$recall,
"\nFScore: ", m1.eval$fscore)
Accuracy: 0.8310312
Precision: 0.2783638
Recall: 0.4573445
FScore: 0.3460833
In this model, we predict products in the basket by estimating their frequency of repurchased. This way we get a ratio to indicate probability of re-purchases. We use ROCR package to estimate the best cutoff point (at which above this cutoff we shall predict for re-order) that give us the optimum F1 score.
## Build Model
m2.train.data = users_orders_products_ %>%
filter(user_id %in% train.users) %>%
group_by(user_id) %>%
mutate(total_orders = max(order_number)) %>% # total number of orders made previously
ungroup %>%
select(user_id, order_id, product_id, total_orders) %>%
group_by(user_id, product_id) %>%
summarize(predicted=n()/max(total_orders)) %>%
select(user_id, product_id, predicted) %>%
full_join(train.construct) %>% # join with train construct for items not predicted but in final order
select(user_id, product_id, actual, predicted) %>%
replace_na(list(predicted = 0))
head(m2.train.data,20)
We see that in order to maximize F1 Score, we need to set the cutoff threshold to 0.3368, which is the next step.
### Threshold Optimization
m2.rocr = optimize_cutoff(actual = m2.train.data$actual, probability = m2.train.data$predicted)
kable(m2.rocr$best) %>% kable_styling(bootstrap_options = c("striped"))
max | cutoff | |
---|---|---|
best.accuracy | 0.9061271 | 0.6710526 |
best.ppv | 0.6850580 | 0.8600000 |
best.recall | 1.0000000 | 0.0101010 |
best.fscore | 0.3753548 | 0.3367347 |
best.tpr_fpr | 20.0735607 | 0.8600000 |
Let’s set the cutoff to 0.3367347 as discovered in previous step.
m2.eval = binclass_eval(m2.train.data$actual, m2.train.data$predicted>0.3367347)
m2.eval$cm
Predicted
Actual 0 1
0 5023825 337604
1 368744 212225
We are getting slightly better F1 Score (0.3753544) compare to previous naive model. We shall use this as the BASELINE.
cat("Accuracy: ", m2.eval$accuracy,
"\nPrecision: ", m2.eval$precision,
"\nRecall: ", m2.eval$recall,
"\nFScore: ", m2.eval$fscore)
Accuracy: 0.8811342
Precision: 0.3859836
Recall: 0.3652949
FScore: 0.3753544
We construct all the products that users had purchased in the last 3 orders, then use machine learning classification to predict will each of the product be purchased again. We shall use decision tree and logistic regression for this prediction.
These are original features provided by Instacart. Although there are no other features engineered specifically to describe Order, thse features are being used to generate other features in the following sections.
orders
- order_dow
- order_hour_of_day
- days_since_prior_order
- reordered
We create five features which is unique to each individual user. These are the features that desribe the user.
users
- u_n_orders
: Number of Orders Per User
- u_avg_priors
: Average waiting days between orders per User
- u_avg_hod
: Average Order Placing Hour Per User
- u_avg_dow
: Average Order Placing Day Per User
- u_avg_order_size
: Average Size of Basket (items in order) Per User
#### user features
users_ = users_orders_products_ %>%
group_by(user_id,order_id) %>%
mutate(u_o_size = ifelse(row_number()==1, max(add_to_cart_order),0) ) %>%
group_by(user_id) %>%
summarize(
u_n_orders = max(order_number),
u_avg_priors = mean(days_since_prior_order,na.rm=TRUE),
u_avg_hod = mean(order_hour_of_day),
u_avg_dow = mean(order_dow),
u_avg_order_size = sum(u_o_size)/max(order_number)
) %>%
arrange(user_id)
head(users_)
We create two product specific features.
products
avg_product_order_dow
: Average of product order_dowavg_product_order_hod
: Average of product order_hour_of_dayproducts_ = users_orders_products_ %>%
group_by(product_id) %>%
summarize(
p_avg_dow = mean(order_dow),
p_avg_hod = mean(order_hour_of_day)
) %>% arrange(product_id)
head(products_)
We shall introduce product related features that are user-product specifc
up_n_reordered
: how many times a user reorderedthis productup_avg_priors
: Average number of days in between before a user purchase this productup_avg_hod
: Average hour a user purchase this productup_avg_dow
: Average day of week a user purchase this productup_avg_rank
: Average add to cart number a user select this productWe shall combined training construct table with the new engineered features to form the training data. Categorical data which are merely names or identification will be removed since they should not contribute to prediction.
After this step, the trianing data is ready for machine learning algorithm of choice.
m3.train.data = users_orders_products_ %>%
filter(user_id %in% train.users) %>%
left_join(user_products_) %>%
left_join(products_) %>%
#left_join(users_) #user_products_ already contain user specific features
full_join(train.construct, by=c('user_id','product_id')) %>%
arrange(user_id, product_id) %>%
select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order'))
glimpse(m3.train.data)
m3.test.data = users_orders_products_ %>%
filter(user_id %in% test.users) %>%
left_join(user_products_) %>%
left_join(products_) %>%
#left_join(users_) #user_products_ already contain user specific features
full_join(test.construct, by=c('user_id','product_id')) %>%
arrange(user_id, product_id) %>%
select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order'))
glimpse(m3.test.data)
m3.fit = glm(actual ~ ., family = binomial, data = m3.train.data)
m3.predict = predict(m3.fit, type = 'response', newdata = m3.train.data)
max | cutoff | |
---|---|---|
best.accuracy | 0.8221607 | 0.5246629 |
best.ppv | 1.0000000 | 0.9895952 |
best.recall | 1.0000000 | 0.0092867 |
best.fscore | 0.5388467 | 0.2233115 |
best.tpr_fpr | Inf | 0.9895952 |
m3.eval = binclass_eval(m3.train.data$actual, m3.predict>0.2233115)
m3.eval$cm
Predicted
Actual 0 1
0 9191429 2312354
1 1022350 1948260
Logistic regression produce F1 Score of 0.5388937 with training data, a much better compared to Model 1 and Model 2. We shall proceed test the model on unknown data, the test data.
cat("Accuracy: ", m3.eval$accuracy,
"\nPrecision: ", m3.eval$precision,
"\nRecall: ", m3.eval$recall,
"\nFScore: ", m3.eval$fscore)
Accuracy: 0.7696136
Precision: 0.4572721
Recall: 0.6558451
FScore: 0.5388465
rocr.pred = prediction(m3.predict, m3.train.data$actual)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)
plot(rocr.perf,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)
m3.predict.test = predict(m3.fit, type = 'response', newdata = m3.test.data)
### Threshold Optimization
m3.rocr.test = optimize_cutoff(actual = m3.test.data$actual, probability = m3.predict.test)
kable(m3.rocr.test$best) %>% kable_styling(bootstrap_options = c("striped"))
max | cutoff | |
---|---|---|
best.accuracy | 0.8221624 | 0.5217251 |
best.ppv | 1.0000000 | 0.9861198 |
best.recall | 1.0000000 | 0.0100454 |
best.fscore | 0.5405047 | 0.2266508 |
best.tpr_fpr | Inf | 0.9861198 |
m3.eval.test = binclass_eval(m3.test.data$actual, m3.predict.test>0.2233115)
m3.eval.test$cm
Predicted
Actual 0 1
0 3921528 978677
1 435702 831691
Logistic regression produce F1 Score of 0.5388937 with training data, a much better compared to Model 1 and Model 2. We shall proceed test the model on unknown data, the test data.
We acheived F1 Score of 0.5405588, slightly higher than training data.
cat("Accuracy: ", m3.eval.test$accuracy,
"\nPrecision: ", m3.eval.test$precision,
"\nRecall: ", m3.eval.test$recall,
"\nFScore: ", m3.eval.test$fscore)
Accuracy: 0.7706759
Precision: 0.4594044
Recall: 0.6562219
FScore: 0.540452
rm(list=c('m3.fit','m3.predict', 'm3.rocr'))
rocr.auc
[1] 0.8090974
plot(rocr.perf,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('auc : ', round(rocr.auc, 5)))
Error: memory exhausted (limit reached?)
abline(0, 1, col = "red", lty = 2)
Error: memory exhausted (limit reached?)
Technical Challenges