Copied and modified from https://rpubs.com/yongks/instacart8
Introduction
Instacart is an app for on-demand grocery shopping with same-day delivery service. Instacart uses a crowdsourced marketplace model, akin to that of Uber or Lyft.
The Instacart shopping process is as follows. First, an app user places their grocery order through the app. Then, a locally crowdsourced “shopper” is notified of the order, goes to a nearby store, buys the groceries, and delivers them to the user.
There are three ways that Instacart generates revenue: delivery fees, membership fees, and mark-ups on in-store prices.
Research Goal & Objective
The main objective of the competition is to predict what will the user will buy in the next order, given all data of prior orders.
Market Basket Analysis
Market Basket Analysis (MBA) is a process that looks for relationships among entities and objects that frequently appear together, such as the collection of items in a shopper’s cart. For the purposes of customer centricity, market basket analysis examines collections of items to identify affinities that are relevant within the different contexts of the customer touch points. Some examples include:
Product placement—Identifying products that may often be purchased together and arranging the placement of those close by to encourage the purchaser to buy both items. That placement can be physical, such as in the arrangement of products on shelves in a brick and mortar location, or virtual, such as in a print catalog or on an e-commerce site.
Point-of-Sale—Companies may use the affinity grouping of multiple products as an indication that customers may be predisposed to buying certain sets of products at the same time. This enables the presentation of items for cross-selling, or may suggest that customers may be willing to buy more items when certain products are bundled together.
Customer retention—When customers contact a business to sever a relationship, a company representative may use market basket analysis to determine the right incentives to offer in order to retain the customer’s business.
MBA is one of the key techniques used by the large retailers that uncovers associations between items by looking for combinations of items that occur together frequently in transactions. In other words, it allows the retailers to identify relationships between the items that people buy.
Association Rules is widely used to analyze retail basket or transaction data, is intended to identify strong rules discovered in transaction data using some measures of interestingness, based on the concept of strong rules.
Terminology of Association Rules
Assume there are 100 transactions on a retail shop. * 10 out of them bought milk, 8 bought butter and 6 bought both of them (milk & butter). * For example, our interesting is how it like if someone bought milk also bought butter: + bought milk => bought butter
First, we need to know the terminology that used in MBA.
- Itemset : set of items that customer bought in all transactions.
Support : proportion of transaction that contain an itemsets of interest. \[Supp(X \Rightarrow Y) = \frac{frq(| X \cup Y |)}{N}\] For example, our interest is product Milk & Butter \[
\begin{aligned}
Supp(Milk \Rightarrow Butter) &= P(Milk, Butter) \\
&= \frac{frq(Milk, Butter)}{N}\\
&= \frac{6}{100}\\
&= 0.06\\
\end{aligned}
\]
The minimum support of the rule is defined as the minimum number of transactions that include both the antecedent and consequent parts in order to qualify to be part of frequent item set. The default minimum support would be 10% of the total number of transactions taken for analysis.
Confidence : conditional probability that if a customer purchases X, they will purchases Y. It determines the operational usefulness of a rule. Transactions with confidence with more than 50% will be selected. Higher the confidence , stronger the rule is. \[Conf(X, Y) = \frac{Supp(X,Y)}{P(Y)}\] Confidence for Milk & Butter is, where \[P(Butter) = \frac{frq(Butter)}{N} = \frac{8}{100} = 0.08\] Then, \[
\begin{aligned}
Conf(Milk \Rightarrow Butter) &= \frac{Supp(Milk \Rightarrow Butter)}{P(Butter)}\\
&= \frac{0.06}{0.08}\\
&= 0.75
\end{aligned}
\]
The minimum confidence of the rule is defined as the minimum number of transaction that has consequent will also have antecedent. The default minimum confidence would be 50%.
Lift : ratio of support of X occuring together with Y divided by the probability that X and Y occur if they are independet. \[
\begin{aligned}
Lift(X,Y) &= \frac{Supp(X,Y)}{P(X,Y)},& P(X,Y) = P(X)P(Y)\\
&= \frac{Supp(X,Y)}{P(X)P(Y)}&\\
&= \frac{Supp(X,Y)}{P(Y)}\frac{1}{P(X)}\\
&= Conf(X,Y)\frac{1}{P(X)}\\
&= \frac{Conf(X,Y)}{P(X)}
\end{aligned}
\] Now \(Lift(Milk, Butter) = ?\) \[P(Milk) = \frac{frq(Milk)}{N} = \frac{10}{100} = 0.01\] So, \[
\begin{aligned}
Lift(Milk, Butter) &= \frac{Conf(Milk, BUtter)}{P(Milk)}\\
&= \frac{0.75}{0.10}\\
&= 7.5
\end{aligned}
\]
Note: this example is extremely small. In practice, a rule needs a support of hundreds transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Ok, enough for the theory, let’s get to the code.
Dataset Preparation
Data Source
Last year, Instacart released a public dataset, “The Instacart Online Grocery Shopping Dataset 2017”. The dataset contains over 3 million anonymized grocery orders from more than 200,000 Instacart users. This analysis will make use of this datasets.
Data source can be downloaded here: https://www.kaggle.com/c/instacart-market-basket-analysis/data
R Libraries Used
Here are the R libraries used in this analysis.
rr library(knitr) # web widget library(tidyverse) # data manipulation library(data.table) # fast file reading library(caret) # rocr analysis library(ROCR) # rocr analysis library(kableExtra) # nice table html formating library(gridExtra) # arranging ggplot in grid
Import Datasets
rr # setwd(‘./data’) aisles <- fread(‘data/aisles.csv’, stringsAsFactors = TRUE) departments <- fread(‘data/departments.csv’, stringsAsFactors = TRUE) products <- fread(‘data/products.csv’, stringsAsFactors = TRUE) orders <- fread(‘data/orders.csv’, stringsAsFactors = TRUE) order_products_train <- fread(’data/order_products__train.csv’) order_products_prior <- fread(’data/order_products__prior.csv’)
Data Dictionary
The dataset for this competition is a relational set of files describing customers’ orders over time. They are anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, Instacart provided between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders.
Total six datasets were imported. Follwing section will explore each datasets in further detail. These datasets were sourced from an existing Kaggle competition.
orders (3.4m rows, 206k users):
order_id: order identifier
user_id: customer identifier
eval_set: which evaluation set this order belongs in (see SET described below)
order_number: the order sequence number for this user (1 = first, n = nth)
order_dow: the day of the week the order was placed on
order_hour_of_day: the hour of the day the order was placed on
days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)
products (50k rows):
product_id: product identifier
product_name: name of the product
aisle_id: foreign key
department_id: foreign key
aisles (134 rows):
aisle_id: aisle identifier
aisle: the name of the aisle
deptartments (21 rows):
department_id: department identifier
department: the name of the department
order_products__SET (30m+ rows):
order_id: foreign key
product_id: foreign key
add_to_cart_order: order in which each product was added to cart
reordered: 1 if this product has been ordered by this user in the past, 0 otherwise
where SET is one of the four following evaluation sets (eval_set in orders):
"prior": orders prior to that users most recent order (~3.2m orders)
"train": training data supplied to participants (~131k orders)
"test": test data reserved for machine learning competitions (~75k orders)
Understanding Datasets
Aisles
There are 134 aisles in this dataset. Here are few sample names of the ailes.
rr paste(sort(head(aisles$aisle)), collapse=‘,’)
Departments
There are 21 departments in this dataset.Names of all deparments are listed below in aphabetically ordered.
rr paste(sort(departments$department), collapse = ‘,’)
Products
There are 49,688 products in the catalogue within 134 aisles and 21 departments.
Sample products are as below.
rr products %>% head %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Departments And Its Relevant Products
Products dataframe is related to Deparments.
We shall see sample of 3 products for few deparments.
rr departments %>% left_join(products) %>% select(department, product_name) %>% group_by(department) %>% sample_n(3) %>% summarise(three_examples_product = paste(product_name, collapse = ’ / ’)) %>% sample_n(5) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Aisles And Its Relevant Products
Products dataframe is also related to aisles. Each aisle relates to multiple products. By joining both aisles and products dataframe, we have an idea what type of prodcuts for each ailes.
Example below shows 3 samples products of for few aisles.
rr aisles %>% left_join(products) %>% select(aisle, product_name) %>%
group_by(aisle) %>% sample_n(3) %>% summarise(three_examples_product = paste(product_name, collapse = ’ || ’)) %>% sample_n(5) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Orders
There are over 3 millions observations in orders dataset. Each row represent an unique order.
Train Eval_Set
Let’s analyse the construct of one user. For example, user_id 1 had made 10 prior orders (order number from 1 to 10), last order is a train (eval_set). Note that the first order (order_number 1) does not have value for day_since_prior_order, as it is the first order without prior records.
This also means <user_id, product_id> made up the key for prediction.
rr orders %>% filter(user_id == 1) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Test Eval_Set
Let’s analyse another construct of orders. User_id 3 had made 12 orders before the final order labeled as test (eval_set) order. From the data we know that order_number is being recycled for each user.
Instacart did not provide us the basket content for test order. This is in fact the target for prediction.
rr orders %>% filter(user_id == 3) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Order_Product
Each order contain multiple products purchased by user. Instacart had cleanly categorized the orders into ‘train’ and ‘prior’ in SINGLE order dataset.
However, the detail of each orders are splitted into two datsets:
- order_product_train: contain only detail product items of last order
- order_product_prior: contain detail product items of all prior orders
Order_Product_Train
order_product_train/prior dataframe tells us which products were purchased at each order; for both train and prior order.
For example, we know user_id 1 in the LAST ORDER (order_id 1187899) purchased 10 unique products by quering order_product_train with the relevant order_id.
rr order_products_train %>% filter (order_id == 1187899) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Order_Product_Train_Prior
Similary, detail items for a PRIOR ORDER (example order_id: 2550362) can be retireved by quering different dataset order_product_prior.
rr order_products_prior %>% filter(order_id == 2550362) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Users
Ther is no dedicated dataframe for users. However, we can derive number of unique users from order dataframe. By grouping the user_id and eval_set column, we found that there are 75,000 test users, 131,209 train users.
rr orders %>% filter(eval_set %in% c(‘train’, ‘test’) ) %>% count(eval_set) %>% mutate(percentage = n/sum(n)) %>% kable() %>% kable_styling(bootstrap_options = c(, ))
Exploratory Data Analysis
In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.
- What usually does people buy, and which one they usually reorder
- When do they buy (day and time)? Is there a buying trend and does it influence what they buy ?
To reduce our coding steps, we construct a reusable dataframe combining all details from orders and its products. This dataframe will contain rows for prior orders and products only (excluding last order which is labeled as train).
rr users_orders_products_ <- orders %>% inner_join(order_products_prior) %>% # inner_join with prior table will filter out train orders left_join(products) %>% left_join(aisles) %>% left_join(departments) %>% arrange(user_id, order_number) %>% select(user_id, order_id, order_number, order_dow, order_hour_of_day, days_since_prior_order, product_id, product_name, reordered, add_to_cart_order, department_id, aisle_id, department, aisle)
Orders
How Many Orders ?
Most users made few orders. The number of orders a users made decrease significally along the order numbers. Maximum orders any users had made is 99.
rr tmp <- users_orders_products_ %>% group_by(user_id) %>% summarize(n_orders = max(order_number))
tmp %>% ggplot(aes(x = as.factor(n_orders))) + geom_bar() + labs(y = ‘Count of Users’, x = ‘Number of Orders Made By Users’) + theme( axis.text.x = element_text (size = 6.0, angle = (90), hjust = 1, vjust = 0.5) )
How Soon Until Next Order ?
It is very obvious that most users made their orders weekly (every 7 days) and monthly (every 30 days). See the peak of day 7 and day 30 in the chart below.
rr tmp <- users_orders_products_ %>% filter(order_number > 1) %>% # days_since_prior is NA for first order, need to filter out group_by(order_id) %>% summarize(n_orders = max(days_since_prior_order))
tmp %>% ggplot(aes(x = as.factor(n_orders))) + geom_bar() + labs(y = ‘Count of Orders’, x = ‘Days Since Prior For Each Order’)
Orders_Products
Most Popular Products Sold
We know that banana are the most popular products. The number of orders varies greatly for different products. Illustration below uses shows sample of only 30 top products. Notice however the varience is not obvious outside top 10 products.
rr tmp <- order_products_train %>% left_join(products) %>% group_by(product_name) %>% summarize(count = n()) %>% top_n(n = 30, wt = count) %>%
mutate(percentage = count/sum(count))
p1 <- tmp %>% ggplot(aes(x = reorder(product_name, count), y = percentage)) +
geom_col() + labs(title = ‘Products Top 30’, y = ‘Percentage of Orders’) + theme ( axis.text.x=element_text(angle=90, hjust=1, vjust=0.5), axis.title.x = element_blank())
p2 <- tmp %>% ggplot(aes(x = ’‘, y = percentage)) + labs(title = ’Products Top 30’, y = ‘percentage.of.orders’, x = ‘Products’) + geom_boxplot()
grid.arrange(p1, p2, ncol = 2)
Most Popular Department Sold
Certain departmens are clearly more popular, like produce and dairy eggs. Both deparments combined contributed to more than 40% of total orders.
rr tmp <- users_orders_products_ %>% group_by(department) %>% summarize(count = n()) %>% mutate(percentage = count/sum(count))
p1 <- tmp %>% ggplot (aes(x = reorder(department, count), y = percentage)) +
geom_col() + labs(title = ‘Departments’, y = ‘Percentage of Orders’) + theme ( axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), axis.title.x = element_blank())
p2 <- tmp %>% ggplot(aes(x = ’‘, y = percentage )) + labs(title = ’Departments’, y = ‘percentage.of.orders’, x = ‘Departments’) + geom_boxplot()
grid.arrange(p1, p2, ncol = 2)
Most Popular Aisles Sold
We looked into the buying trend of product by aisles and notice that certain aisle like vegetables and fruits contributes to almost 30% of total orders. Chart below shows top 30 aisles.
rr tmp <- users_orders_products_ %>% group_by(aisle) %>% summarize(count = n()) %>% top_n(n = 30, wt = count) %>%
mutate(percentage = count/sum(count))
p1 <- tmp %>% ggplot(aes(x = reorder(aisle, count), y = percentage)) +
geom_col() + labs(title = ‘Aisles Top 30’, x = ‘Aisles’, y = ‘Percentage of Orders’) + theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5), axis.title.x = element_blank())
p2 <- tmp %>% ggplot(aes(x = ’‘, y = percentage )) + geom_boxplot() + labs(title = ’Aisles Top 30’, y = ‘percentage.of.orders’, x = ‘Aisles’)
grid.arrange(p1, p2, ncol = 2)
Products Ordered Day Pattern
We can see that both Day 0 and Day 1 stands out to be the most busy shopping day for instacart. This means that day of order made may influence the basket size.
rr users_orders_products_ %>% group_by(order_dow) %>% summarize(count = n()) %>% mutate(percentage = count/sum(count)) %>% ggplot(aes(x = as.factor(order_dow), y = percentage, fill = as.factor(order_dow))) + geom_col() + labs(title = ‘Daily Orders’, y = ‘Percentage of Orders’)
When we zoom into daily orders, we notice that top ten products contributes between 7% to 8% of daily orders. It is interesting to see that Limes are part of top ten for Day 0 and Day 6, but not other days. Whereas Organic Whole Milk doesn’t make it to top ten for Day 0. Organic Respberries does not make it to top 10 of Day 6. This means that there is a chance of predictability based on the day order is made.
rr users_orders_products_%>% group_by(order_dow, product_name) %>% summarize(n = n()) %>% mutate(percentage = n/sum(n)) %>% top_n(10, wt = n) %>% ggplot(aes(x = as.factor(order_dow), y = percentage, fill = product_name)) + geom_col() + labs(y = ‘Proprtion of Orders In A Day’, title = ‘Daily Top 10 Products Ordered’) + theme(legend.position = , legend.direction = )
Products Ordered Hour Pattern
Morning to afternoon are the peak shopping hours for instacart customers. The hour order made influences basket size.
rr users_orders_products_ %>% group_by(order_hour_of_day) %>% summarize(count = n()) %>% mutate(percentage = count/sum(count)) %>% ggplot(aes(x = as.factor(order_hour_of_day), y = percentage)) + geom_col() + labs(y = ‘Percentage of Orders’, title = ‘Hourly Orders’)
In the grocery, there are close to 50,000 products. When we zoom into hourly purchases, we noticed that top 10 products managed to score betwen 6% to 8% of hourly sales. Every hour has slightly diffrent combination of top 10 products (combination out of 12 products). That means certain products are predictable for ordering irregardless of the hour of order.
It is interesting to know that, similar to daily top 10 products, the Organic Wholemilk and Limes is missing as top 10 from some hours.
rr users_orders_products_ %>% group_by(order_hour_of_day, product_name) %>% summarize(n = n()) %>% mutate(percentage = n/sum(n)) %>% top_n(10, wt = n) %>% ggplot (aes(x = as.factor(order_hour_of_day), y = percentage, fill = product_name)) + geom_col() + labs(y = ‘Proprtion of Orders In A Hour’, title = ‘Hourly Top 10 Products Ordered’) + theme(legend.position = , legend.direction = )
Basket Analysis
Basket Size Distribution
Number of items in all orders range from 1 to 145. The histogram below is highly skewed towards small basket size. Majority of users purchased 5 items in their orders.
rr tmp <- users_orders_products_ %>% group_by(order_id) %>% summarize(basket_size = n(), reordered_items = sum(reordered)) %>% group_by(basket_size) %>% summarize(n = n(), avg_reordered_items = mean(reordered_items)) %>% arrange(basket_size)
tmp %>% ggplot(aes(x = as.factor(basket_size))) + geom_col(aes(y = n)) + labs(y = ‘Order Count’, x = ‘Number of Items in Basket’, title = ‘Basket Size Distribution’) + theme(axis.text.x = element_text(size = 6.0, angle = 90, hjust = 1, vjust = 0.5))
Re-Ordered Analysis
Analyzing the re-ordered products is the most important part of the EDA. This is becasue insights from this analysis can help to develop intuition for furhter feature engineering that will make the prediction more meaningful.
Average Re-ordered Items In Basket Distribution
rr tmp %>% ggplot(aes(x = as.factor(basket_size))) + geom_point(aes(y = avg_reordered_items), color = ‘red’) + labs(y = ‘Avg Number of Re-Ordered Items’, x = ‘Number of Items in Basket’, title = ‘Reorder Rate by Basket Size’) + theme(axis.text.x = element_text(size = 6.0, angle = 90, hjust = 1, vjust = 0.5)) + geom_abline(intercept = 0, slope = 1, color = ‘blue’)
Product Reorder Ratio
One of the tricker things to predict in the Instacart dataset is the incidence of orders without reordered products. Plotting the proportion of this incidence across the training sample (a snapshot of 131K+ users) provides some inspiration.
How Many Products Were Reordered
Among all product purchases, 41% of products are reordered. The reordered rate is particularly high on top 10 products. As shown in chart below, top ten popular products has reordered rate is around 70% to 85%; higher than the overall ratio of 41%.
rr ## overall all products reordered rate tmp1 <- users_orders_products_ %>% filter(order_number > 1) %>% # exclude first order, which will never have reordered count(reordered) %>% mutate(ratio = n/sum(n))
p1 <- tmp1 %>% ggplot(aes(x = ’‘, y = ratio, fill = as.factor(reordered))) + geom_col(width = 1) + labs(y = ’Product Reordered Ratio’) + coord_polar(theta = ‘y’, start = 0) + scale_fill_brewer(palette = 2) + theme(axis.title.y = element_blank())
top10 products and its reordered rate
tmp2 <- users_orders_products_ %>% count(product_id) %>% # filter only top 10 products for reorder analysis top_n(n = 10) %>% left_join(users_orders_products_) %>% # now find out their reordered rate group_by(product_id, product_name) %>% summarize(reordered_rate = sum(reordered, na.rm = TRUE)/n()) %>% select(product_id, product_name, reordered_rate) %>% arrange(desc(reordered_rate))
p2 <- tmp2 %>% ggplot(aes(x = reorder(product_name, reordered_rate), y = reordered_rate)) + labs(title = ‘Top 10 Products Sold and Their Reordering Rate’) + geom_col() + scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) + coord_flip()
grid.arrange(p1, p2, ncol = 2)
Reordering vs Days Since Prior Order
We understand from order analysis earlier that most users place their orders every 7 and 30 days. However, from reorder ration perspective, day 7 and day 30 has high contrast whereby day 7 orders has high reorder ratio and day 30 has lowest reordering ratio.
rr tmp <- users_orders_products_ %>% filter(order_number > 1) %>% group_by(days_since_prior_order, order_id) %>% summarize(contain_reordered = max(reordered)) %>% summarize(reordered_orders = sum(contain_reordered), n = n() + 1) %>% mutate(non_reorder_ratio = 1 - (reordered_orders/n))
tmp %>% ggplot(aes(x = days_since_prior_order, y = non_reorder_ratio)) + geom_point() + geom_line() + labs(title = ‘Orders NOT Containing Reordered Products over Days since Prior Order’)
Reordering vs Hour Of Order
rr tmp <- users_orders_products_ %>% filter(order_number > 1) %>% group_by(order_hour_of_day, order_id) %>% summarize(contain_reordered = max(reordered)) %>% summarize(reordered_orders = sum(contain_reordered), n = n()) %>% mutate(non_reorder_ratio = 1 - (reordered_orders/n))
tmp %>% ggplot(aes(x = order_hour_of_day, y = non_reorder_ratio)) + geom_point() + geom_line() + labs(title = ‘Non Reorder Ratio over Time of Order Placed’)
Reordering vs Day Of Order
rr tmp <- users_orders_products_ %>% filter(order_number > 1) %>% group_by(order_dow, order_id) %>% summarize(contain_reordered = max(reordered)) %>% summarize(reordered_orders = sum(contain_reordered), n = n() + 1) %>% mutate(non_reorder_ratio = 1 - (reordered_orders/n))
tmp %>% ggplot(aes(x = order_dow, y = non_reorder_ratio)) + geom_point() + geom_line() + labs(title = ‘Non Reorder Ratio over Day of Purchase’)
Reordering vs Day Of Order
Intuitively, we can think of the more regular a buyer is, the person tend to repeat ordering the same products.
rr tmp <- users_orders_products_ %>% filter(order_number > 1) %>% group_by(user_id, order_id) %>% summarize(contain_reordered = max(reordered)) %>% summarize(reordered_orders = sum(contain_reordered), total_orders_per_user = n()) %>% group_by(total_orders_per_user) %>% summarize(reorders = sum(reordered_orders), total = sum(total_orders_per_user)) %>% mutate(non_ratio = 1 - (reorders/total))
tmp %>% ggplot(aes(x = total_orders_per_user, y = non_ratio)) + geom_point() + geom_line() + labs(title = ‘Non Reorder Ratio over Day of Purchase’)
Predictive Analysis
Type of Prediction
The objective is to predict what product will the customer purchase in the next basket. It require probability estimation of each product that bad been purchased before, that to be purchased before.This is a classification problem, as well as a regression of probability of repurchases.
For this analysis, we shall use two Naive models (handcrafted baseline) and one Machine Learning Logistic regression will be used for Machine Learning approach for its speed and simplicity; to demonstrate the feasibility to producing a better outcome then baseline.
Train/Test Dataset Splitting
Instacart did not provide us test order detail, therefore we shall use the train users for both trainng and testing. We achieve this by splitting the train users and its related orders and products into train dataset and train dataset, at 70%/30% split (by number of users). That means our train/test dataset will contain approximately 91846 / 39,363 users.
For this analysis, we will not be submitting to Kaggle.
rr # update this variable for changing split ratio train_proportion <- 0.7
build list of all users ID
tmp <- orders %>% filter(eval_set == ‘train’) %>% distinct(user_id)
70/30 split
set.seed(12345) train.rows <- sample(1:nrow(tmp), train_proportion * nrow(tmp)) train.users <- tmp[train.rows, ] # select training rows, list of train users test.users <- tmp[-train.rows, ] # select testing rows, list of test users
cat(Rows in Training Users: , length(train.users), \nTotal Rows in Testing Users: , length(test.users), \nTrain/Test Split % : , 100length(train.users)/(length(test.users)+length(train.users)), / , 100length(test.users)/(length(test.users)+length(train.users)))
Training Data Construct
The data frame used for training should contain the below columns and features:
key
- This is unique pair of user_id and product_id from orders
- The keys should be constructed from all
user_id-product_id pair that includes all prior and test/train rows
`actual
- This is the response variable with value of 1 or 0 for each unique key
- The value is 1 when the product is purchased in the last order (train or test set of orders)
- The value is 0 when the product is not purchased in the train or test set, but was bought in prior set
other features
From exploratory discovery, features that could contribute to the prediction should be populated into the construct. Feature engineering will happen in the later stage.
Let’s proceed to create the basic training construct. This won’t be used for prediction until feature engineering is completed in later stage.
rr # list of products in the final order, this make up the label construct1 <- orders %>%
filter(user_id %in% train.users & eval_set == ‘train’) %>% left_join(order_products_train) %>% distinct(user_id, product_id) %>% mutate(actual = 1) #training label
list of products each users had bought before in prior orders
construct2 <- orders %>%
filter(user_id %in% train.users & eval_set == ‘prior’) %>% left_join(order_products_prior) %>% distinct(user_id, product_id)
Training Construct
train.construct <- left_join(construct2, construct1) %>% mutate(key = paste(user_id, product_id, sep = -)) %>% # key select(key, user_id, product_id, actual) %>% arrange(user_id, product_id) %>% replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order # drop_na # remove proudcts not in historical but appear in last order
rm(list = c(‘construct1’, ‘construct2’)) head(train.construct, 50)
Testing Data Construct
Similar approach to training data construct, here we frame the testing data for evaluate our model built with training data.
# list of products in the final order, this make up the label
construct1 <- orders %>%
filter(user_id %in% test.users & eval_set == 'train') %>%
left_join(order_products_train) %>%
distinct(user_id, product_id) %>%
mutate(actual = 1) #training label
# list of products each users had bought before in prior orders
construct2 <- orders %>%
filter(user_id %in% test.users & eval_set == 'prior') %>%
left_join(order_products_prior) %>%
distinct(user_id, product_id)
# Training Construct
test.construct <- construct2 %>%
left_join(construct1) %>%
mutate(key = paste(user_id, product_id, sep = "-")) %>% # key
select(key, user_id, product_id, actual) %>%
arrange(user_id, product_id) %>%
replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order
# drop_na # remove proudcts not in historical but appear in last order
rm(list = c('construct1', 'construct2'))
head(test.construct, 50)
Model Evaluation & Optimization
Instacart has close to 50k products in their catalogue. As the maximum number of items ordered by a user is just a fraction of the 50k available product. This means by simply predicting nothing is purchased in the next basket, we would yeild close to 100% accuracy.
Due to the highly imbalance dataset, Instacart require F1 Score as the competition scoring, instead of accuracy.
To evaluate the performance of the model, we had created a custom function to build a confusion matrix and derive other binary classification metrics.
rr ## Custom Function For Binary Class Performance Evaluation binclass_eval = function (actual, predict) { cm = table(as.integer(actual), as.integer(predict), dnn=c(‘Actual’,‘Predicted’)) ac = (cm[‘1’,‘1’]+cm[‘0’,‘0’])/(cm[‘0’,‘1’] + cm[‘1’,‘0’] + cm[‘1’,‘1’] + cm[‘0’,‘0’]) pr = cm[‘1’,‘1’]/(cm[‘0’,‘1’] + cm[‘1’,‘1’]) rc = cm[‘1’,‘1’]/(cm[‘1’,‘0’] + cm[‘1’,‘1’]) fs = 2* pr*rc/(pr+rc) list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac) }
If the prediction is based on probability, we shall build a function to discover cutoff that optimize various performance metrics.
rr ### Cutoff Threshold Optimization optimize_cutoff = function (actual, probability) { rocr.pred = prediction(predictions = probability, labels = actual) rocr.metrics = data.frame( cutoff = rocr.pred@cutoffs[[1]], accuracy = (rocr.pred@tp[[1]] + rocr.pred@tn[[1]]) / (rocr.pred@tp[[1]] + rocr.pred@tn[[1]] + rocr.pred@fp[[1]] + rocr.pred@fn[[1]]), tpr = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fn[[1]]), fpr = rocr.pred@fp[[1]] / (rocr.pred@fp[[1]] + rocr.pred@tn[[1]]), ppv = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fp[[1]]) ) rocr.metrics\(fscore = 2 * (rocr.metrics\)tpr * rocr.metrics\(ppv) / (rocr.metrics\)tpr + rocr.metrics\(ppv) rocr.metrics\)tpr_fpr = rocr.metrics\(tpr / rocr.metrics\)fpr
## Discovery the optimal threshold for various metrics rocr.best = rbind( best.accuracy = c(max = max(rocr.metrics\(accuracy, na.rm = TRUE), cutoff=rocr.metrics\)cutoff[which.max(rocr.metrics$accuracy)]), best.ppv = c(max = max(rocr.metrics\(ppv, na.rm = TRUE), cutoff = rocr.metrics\)cutoff[which.max(rocr.metrics$ppv)]), best.recall = c(max = max(rocr.metrics\(tpr, na.rm = TRUE), cutoff = rocr.metrics\)cutoff[which.max(rocr.metrics$tpr)]), best.fscore = c(max = max(rocr.metrics\(fscore, na.rm = TRUE), cutoff = rocr.metrics\)cutoff[which.max(rocr.metrics$fscore)]), best.tpr_fpr = c(max = max(rocr.metrics\(tpr_fpr, na.rm = TRUE), cutoff = rocr.metrics\)cutoff[which.max(rocr.metrics$tpr_fpr)]) )
list(metrics = rocr.metrics, best = rocr.best) }
Model 1 : Naive Prediction
Build The Model
With intension to make this a baseline model, We simply predict the basket based on user last order.
rr m1.train.data = users_orders_products_ %>% filter(user_id %in% train.users) %>% group_by(user_id) %>% top_n(n=1, wt=order_number) %>% #last order has the higher order_number select(user_id, product_id) %>% mutate (predicted=1) %>% #predict based on last ordered, therefore 1 full_join(train.construct) %>% # join with train construct for items not predicted but in final order select(user_id, product_id, actual, predicted) %>% replace_na(list(predicted = 0))
head(m1.train.data,25)
Confusion Matrix
rr m1.eval = binclass_eval(m1.train.data\(actual, m1.train.data\)predicted) m1.eval$cm
Model 2 : Smarter Naive Prediction (Baseline)
In this model, we predict products in the basket by estimating their frequency of repurchased. This way we get a ratio to indicate probability of re-purchases. We use ROCR package to estimate the best cutoff point (at which above this cutoff we shall predict for re-order) that give us the optimum F1 score.
Build The Model
rr ## Build Model m2.train.data = users_orders_products_ %>% filter(user_id %in% train.users) %>% group_by(user_id) %>% mutate(total_orders = max(order_number)) %>% # total number of orders made previously ungroup %>% select(user_id, order_id, product_id, total_orders) %>% group_by(user_id, product_id) %>% summarize(predicted=n()/max(total_orders)) %>% select(user_id, product_id, predicted) %>% full_join(train.construct) %>% # join with train construct for items not predicted but in final order select(user_id, product_id, actual, predicted) %>% replace_na(list(predicted = 0))
head(m2.train.data,20)
Optimize Cutoff
We see that in order to maximize F1 Score, we need to set the cutoff threshold to 0.3368, which is the next step.
rr ### Threshold Optimization m2.rocr = optimize_cutoff(actual = m2.train.data\(actual, probability = m2.train.data\)predicted) kable(m2.rocr$best) %>% kable_styling(bootstrap_options = c())
Confusion Matrix
Let’s set the cutoff to 0.3367347 as discovered in previous step.
rr m2.eval = binclass_eval(m2.train.data\(actual, m2.train.data\)predicted>0.3367347) m2.eval$cm
Machine Learning Framing
We construct all the products that users had purchased in the last 3 orders, then use machine learning classification to predict will each of the product be purchased again. We shall use decision tree and logistic regression for this prediction.
Feature Engineering
Order Features
These are original features provided by Instacart. Although there are no other features engineered specifically to describe Order, thse features are being used to generate other features in the following sections.
orders
- order_dow
- order_hour_of_day
- days_since_prior_order
- reordered
User Features
We create five features which is unique to each individual user. These are the features that desribe the user.
users
- u_n_orders: Number of Orders Per User
- u_avg_priors: Average waiting days between orders per User
- u_avg_hod: Average Order Placing Hour Per User
- u_avg_dow: Average Order Placing Day Per User
- u_avg_order_size: Average Size of Basket (items in order) Per User
#### user features
users_ = users_orders_products_ %>%
group_by(user_id,order_id) %>%
mutate(u_o_size = ifelse(row_number()==1, max(add_to_cart_order),0) ) %>%
group_by(user_id) %>%
summarize(
u_n_orders = max(order_number),
u_avg_priors = mean(days_since_prior_order,na.rm=TRUE),
u_avg_hod = mean(order_hour_of_day),
u_avg_dow = mean(order_dow),
u_avg_order_size = sum(u_o_size)/max(order_number)
) %>%
arrange(user_id)
head(users_)
Product Features
We create two product specific features.
products
avg_product_order_dow: Average of product order_dow
avg_product_order_hod: Average of product order_hour_of_day
products_ = users_orders_products_ %>%
group_by(product_id) %>%
summarize(
p_avg_dow = mean(order_dow),
p_avg_hod = mean(order_hour_of_day)
) %>% arrange(product_id)
head(products_)
User-Product Features
We shall introduce product related features that are user-product specifc
up_n_reordered : how many times a user reorderedthis product
up_avg_priors : Average number of days in between before a user purchase this product
up_avg_hod : Average hour a user purchase this product
up_avg_dow : Average day of week a user purchase this product
up_avg_rank : Average add to cart number a user select this product
rr ### user_products features user_products_ = users_orders_products_ %>% group_by(user_id, product_id) %>% summarize( up_n_reordered = n()-1, # minus off first order, which is not reorder up_avg_priors = mean(days_since_prior_order,na.rm=TRUE), up_avg_hod = mean(order_hour_of_day), up_avg_dow = mean(order_dow), up_avg_rank = mean(add_to_cart_order) ) %>% ungroup %>% left_join(users_) %>% # to retrieve u_n_orders mutate(up_reorder_rate = up_n_reordered/(u_n_orders-1)) %>% replace_na(list(up_avg_priors = 0)) %>% # fix up arrange(user_id,product_id)
head(user_products_,20)
Construct Training Data
We shall combined training construct table with the new engineered features to form the training data. Categorical data which are merely names or identification will be removed since they should not contribute to prediction.
After this step, the trianing data is ready for machine learning algorithm of choice.
m3.train.data = users_orders_products_ %>%
filter(user_id %in% train.users) %>%
left_join(user_products_) %>%
left_join(products_) %>%
#left_join(users_) #user_products_ already contain user specific features
full_join(train.construct, by=c('user_id','product_id')) %>%
arrange(user_id, product_id) %>%
select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order'))
glimpse(m3.train.data)
Construct Testing Data
m3.test.data = users_orders_products_ %>%
filter(user_id %in% test.users) %>%
left_join(user_products_) %>%
left_join(products_) %>%
#left_join(users_) #user_products_ already contain user specific features
full_join(test.construct, by=c('user_id','product_id')) %>%
arrange(user_id, product_id) %>%
select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order'))
glimpse(m3.test.data)
Model 3 : Logistic Regression
Model Trainng
rr m3.fit = glm(actual ~ ., family = binomial, data = m3.train.data)
Analysis & Recommendations
Technical Challenges
- Machine speed and memory. The GLM
---
title:  "Instacart Market Basket Analysis"
author: "By Aep Hidayatuloh"
date:   "2019 Jun 7"
output: 
  html_notebook:
    number_sections: yes
    theme: spacelab
    df_print: paged
    toc: yes
    toc_depth: 4
    toc_float: true
---

<style type="text/css">

body{ /* Normal  */
      font-size: 12px;
  }
td {  /* Table  */
  font-size: 12px;
}
h1.title {
  font-size: 38px;
  color: lightblue;
  font-weight: bold;
}
h1 { /* Header 1 */
  font-size: 24px;
  color: DarkBlue;
}
h2 { /* Header 2 */
  font-size: 20px;
  color: DarkBlue;
}
h3 { /* Header 3 */
  font-size: 16px;
#  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
}
h4 { /* Header 4 */
  font-size: 14px;
  color: DarkBlue;
}
code.r{ /* Code block */
    font-size: 12px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 12px;
}
</style>


```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo=TRUE, fig.height=3.5, fig.width=9.2, results='hold', warning=FALSE, fig.show='hold', message=FALSE) 
options(scipen = 99)
```

Copied and modified from <https://rpubs.com/yongks/instacart8>

# Introduction

Instacart is an app for on-demand grocery shopping with same-day delivery service.  Instacart uses a crowdsourced marketplace model, akin to that of Uber or Lyft. 
 
The Instacart shopping process is as follows.  First, an app user places their grocery order through the app.  Then, a locally crowdsourced "shopper" is notified of the order, goes to a nearby store, buys the groceries, and delivers them to the user.   
 
There are three ways that Instacart generates revenue: delivery fees, membership fees, and mark-ups on in-store prices.

## Research Goal & Objective

The main objective of the competition is to predict what will the user will buy in the next order, given all data of prior orders. 

# Market Basket Analysis

Market Basket Analysis (MBA) is a process that looks for relationships among entities and objects that frequently appear together, such as the collection of items in a shopper’s cart. For the purposes of customer centricity, market basket analysis examines collections of items to identify affinities that are relevant within the different contexts of the customer touch points. Some examples include:

* Product placement—Identifying products that may often be purchased together and arranging the placement of those close by to encourage the purchaser to buy both items. That placement can be physical, such as in the arrangement of products on shelves in a brick and mortar location, or virtual, such as in a print catalog or on an e-commerce site.

* Point-of-Sale—Companies may use the affinity grouping of multiple products as an indication that customers may be predisposed to buying certain sets of products at the same time. This enables the presentation of items for cross-selling, or may suggest that customers may be willing to buy more items when certain products are bundled together.

* Customer retention—When customers contact a business to sever a relationship, a company representative may use market basket analysis to determine the right incentives to offer in order to retain the customer’s business.

MBA is one of the key techniques used by the large retailers that uncovers associations between items by looking for combinations of items that occur together frequently in transactions. In other words, it allows the retailers to identify relationships between the items that people buy.

Association Rules is widely used to analyze retail basket or transaction data, is intended to identify strong rules discovered in transaction data using some measures of interestingness, based on the concept of strong rules.

# Terminology of Association Rules

Assume there are 100 transactions on a retail shop.
* 10 out of them bought milk, 8 bought butter and 6 bought both of them (milk & butter). 
* For example, our interesting is how it like if someone bought milk also bought butter:
    + bought milk => bought butter

First, we need to know the terminology that used in MBA.

  *  **Itemset** : set of items that customer bought in all transactions.
  *  **Support** : proportion of transaction that contain an itemsets of interest.
      $$Supp(X \Rightarrow Y) = \frac{frq(| X \cup Y |)}{N}$$
      For example, our interest is product Milk & Butter
      $$
      \begin{aligned}
      Supp(Milk \Rightarrow Butter) &= P(Milk, Butter) \\
      &= \frac{frq(Milk, Butter)}{N}\\
      &= \frac{6}{100}\\
      &= 0.06\\
      \end{aligned}
      $$
      
      The minimum support of the rule is defined as the minimum number of transactions that include both the antecedent and consequent parts in order to qualify to be part of frequent item set. The default minimum support would be 10% of the total number of transactions taken for analysis.
      
  * **Confidence** : conditional probability that if a customer purchases X, they will purchases Y. It determines the operational usefulness of a rule. Transactions with confidence with more than 50% will be selected. Higher the confidence , stronger the rule is.
    $$Conf(X, Y) = \frac{Supp(X,Y)}{P(Y)}$$
    Confidence for Milk & Butter is,
    where $$P(Butter) = \frac{frq(Butter)}{N} = \frac{8}{100} = 0.08$$ 
    Then,
    $$
    \begin{aligned}
    Conf(Milk \Rightarrow Butter) &= \frac{Supp(Milk \Rightarrow Butter)}{P(Butter)}\\
    &= \frac{0.06}{0.08}\\
    &= 0.75
    \end{aligned}
    $$
    
    The minimum confidence of the rule is defined as the minimum number of transaction that has consequent will also have antecedent. The default minimum confidence would be 50%.
  * **Lift** : ratio of support of X occuring together with Y divided by the probability that X and Y occur if they are independet.
  $$
  \begin{aligned}
  Lift(X,Y) &= \frac{Supp(X,Y)}{P(X,Y)},& P(X,Y) = P(X)P(Y)\\
  &= \frac{Supp(X,Y)}{P(X)P(Y)}&\\ 
  &= \frac{Supp(X,Y)}{P(Y)}\frac{1}{P(X)}\\
  &= Conf(X,Y)\frac{1}{P(X)}\\
  &= \frac{Conf(X,Y)}{P(X)}
  \end{aligned}
  $$
  Now $Lift(Milk, Butter) = ?$
  $$P(Milk) = \frac{frq(Milk)}{N} = \frac{10}{100} = 0.01$$
  So,
  $$
  \begin{aligned}
  Lift(Milk, Butter) &= \frac{Conf(Milk, BUtter)}{P(Milk)}\\
  &= \frac{0.75}{0.10}\\
  &= 7.5
  \end{aligned}
  $$

Note: this example is extremely small. In practice, a rule needs a support of hundreds transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Ok, enough for the theory, let's get to the code. 

# Dataset Preparation

## Data Source

Last year, Instacart released a public dataset, **"The Instacart Online Grocery Shopping Dataset 2017"**. The dataset contains over 3 million anonymized grocery orders from more than 200,000 Instacart users. This analysis will make use of this datasets.  

Data source can be downloaded here: https://www.kaggle.com/c/instacart-market-basket-analysis/data

## R Libraries Used

Here are the R libraries used in this analysis. 

```{r, message=FALSE}
library(knitr)      # web widget
library(tidyverse)  # data manipulation
library(data.table) # fast file reading
library(caret)      # rocr analysis
library(ROCR)       # rocr analysis
library(kableExtra) # nice table html formating 
library(gridExtra)  # arranging ggplot in grid
```

## Import Datasets

```{r, results='hide', message=FALSE, warning=FALSE}
# setwd('./data')
aisles      <-  fread('data/aisles.csv',      stringsAsFactors = TRUE)
departments <-  fread('data/departments.csv', stringsAsFactors = TRUE)
products    <-  fread('data/products.csv',    stringsAsFactors = TRUE)
orders      <-  fread('data/orders.csv',      stringsAsFactors = TRUE)
order_products_train  <-  fread('data/order_products__train.csv')
order_products_prior  <-  fread('data/order_products__prior.csv')
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
# load prepared csv for faster processing

# construct
train.construct = fread('datas/train.construct.csv')
test.construct = fread('data/test.construct.csv')

# features
users_orders_products_ = fread('./datasets/users_orders_products_.csv')
users_ = fread('./datasets/users_.csv')
products_= fread('./datasets/products.csv')
user_products_ = fread('./datasets/user_products_.csv')
m3.train.data = fread('./datasets/m3.train.data.csv')
m3.test.data = fread('./datasets/m3.test.data.csv')
```

## Data Dictionary

The dataset for this competition is a relational set of files describing customers' orders over time. They are anonymized and contains a sample of over **3 million grocery orders** from more than **200,000 Instacart users**. For each user, Instacart provided **between 4 and 100** of their orders, with the sequence of products purchased in each order, the **week and hour of day** the order was placed, and a **relative measure of time between orders**.

Total six datasets were imported. Follwing section will explore each datasets in further detail. These datasets were sourced from an existing Kaggle competition.

`orders` (3.4m rows, 206k users):  

* `order_id`: order identifier  
* `user_id`: customer identifier  
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)  
* `order_number`: the order sequence number for this user (1 = first, n = nth)  
* `order_dow`: the day of the week the order was placed on  
* `order_hour_of_day`: the hour of the day the order was placed on  
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)  

`products` (50k rows):  

* `product_id`: product identifier  
* `product_name`: name of the product  
* `aisle_id`: foreign key  
* `department_id`: foreign key  

`aisles` (134 rows):  

* `aisle_id`: aisle identifier  
* `aisle`: the name of the aisle  

`deptartments` (21 rows):  

* `department_id`: department identifier  
* `department`: the name of the department  

`order_products__SET` (30m+ rows):  

* `order_id`: foreign key  
* `product_id`: foreign key  
* `add_to_cart_order`: order in which each product was added to cart  
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise  

where `SET` is one of the four following evaluation sets (`eval_set` in `orders`):  

* `"prior"`: orders prior to that users most recent order (~3.2m orders)  
* `"train"`: training data supplied to participants (~131k orders)  
* `"test"`: test data reserved for machine learning competitions (~75k orders)  

## Understanding Datasets

### Aisles

There are **134 aisles** in this dataset. Here are few sample names of the ailes.  

```{r, results='hold'}
paste(sort(head(aisles$aisle)), collapse=', ')
```

### Departments

There are **21 departments** in this dataset.Names of all deparments are listed below in aphabetically ordered.

```{r, results='hold'}
paste(sort(departments$department), collapse = ', ')
```

### Products

There are **49,688 products** in the catalogue within **134 aisles and 21 departments**.  

Sample products are as below.

```{r, results='hold'}
products %>% 
  head %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Departments And Its Relevant Products

Products dataframe is related to Deparments.  

We shall see sample of 3 products for few deparments.

```{r, results='hold', message=FALSE}
departments %>% 
  left_join(products) %>% 
  select(department, product_name) %>%
  group_by(department) %>%
  sample_n(3) %>%
  summarise(three_examples_product = paste(product_name, collapse = ' / ')) %>% 
  sample_n(5) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Aisles And Its Relevant Products

Products dataframe is also related to aisles. Each aisle relates to multiple products. By joining both aisles and products dataframe, we have an idea what type of prodcuts for each ailes.  

Example below shows 3 samples products of for few aisles.

```{r, results='hold', message=FALSE}
aisles %>% 
  left_join(products) %>%
  select(aisle, product_name) %>%  
  group_by(aisle) %>%
  sample_n(3) %>% 
  summarise(three_examples_product = paste(product_name, collapse = ' || ')) %>% 
  sample_n(5) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Orders

There are over **3 millions** observations in orders dataset. Each row represent an **unique order**. 

#### Train Eval_Set

Let's analyse the construct of one user. For example, **user_id 1** had made **10 prior orders** (order number from 1 to 10), last order is a **train** (eval_set). Note that the first order (order_number 1) does not have value for day_since_prior_order, as it is the first order without prior records.  

This also means **`<user_id, product_id>`** made up the **key** for prediction.  

```{r}
orders %>% 
  filter(user_id == 1) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```
#### Test Eval_Set

Let's analyse another construct of orders. **User_id 3** had made **12 orders** before the final order labeled as **test** (eval_set) order. From the data we know that order_number is being recycled for each user.  

**Instacart did not provide us the basket content for test order**. This is in fact the **target for prediction**. 

```{r}
orders %>% 
  filter(user_id == 3) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Order_Product

Each order contain multiple products purchased by user. Instacart had cleanly categorized the orders into 'train' and 'prior' in **SINGLE** order dataset.

However, the detail of each orders are splitted into two datsets:  
- **`order_product_train`**: contain only detail product items of last order  
- **`order_product_prior`**: contain detail product items of all prior orders  

### Order_Product_Train

**order_product_train/prior** dataframe tells us which products were purchased at each order; for both **train** and **prior** order.  

For example, we know **user_id 1** in the **LAST ORDER** (order_id 1187899) purchased **10 unique products** by **quering order_product_train** with the relevant order_id.  

```{r}
order_products_train %>% 
  filter (order_id == 1187899) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Order_Product_Train_Prior

Similary, detail items for a **PRIOR ORDER** (example order_id: 2550362) can be retireved by quering different dataset order_product_prior.

```{r}
order_products_prior %>% 
  filter(order_id == 2550362) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```

### Users

Ther is no dedicated dataframe for users. However, we can derive number of unique users from **order** dataframe. By grouping the user_id and eval_set column, we found that there are **75,000 test** users, **131,209 train** users.  

```{r}
orders %>% 
  filter(eval_set %in% c('train', 'test') ) %>%
  count(eval_set) %>%
  mutate(percentage = n/sum(n)) %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
```


# Exploratory Data Analysis

In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.  

- What usually does people buy, and which one they usually reorder
- When do they buy (day and time)? Is there a buying trend and does it influence what they buy ?  

To reduce our coding steps, we construct a reusable dataframe combining all details from orders and its products. This dataframe will contain rows for **prior orders and products only** (excluding last order which is labeled as `train`). 

```{r}
users_orders_products_ <- orders %>%
  inner_join(order_products_prior) %>%  # inner_join with prior table will filter out train orders
  left_join(products) %>%
  left_join(aisles) %>%
  left_join(departments) %>%
  arrange(user_id, order_number) %>%
  select(user_id, order_id, order_number, order_dow, order_hour_of_day, days_since_prior_order, product_id, product_name, reordered, add_to_cart_order, department_id, aisle_id, department, aisle)
```


## Orders

### How Many Orders ?

Most users made few orders. The number of orders a users made decrease significally along the order numbers. Maximum orders any users had made is 99.

```{r fig.width=9.3, fig.height=2.5}
tmp <- users_orders_products_ %>% 
  group_by(user_id) %>% 
  summarize(n_orders = max(order_number))

tmp %>% 
  ggplot(aes(x = as.factor(n_orders))) + 
  geom_bar() +
  labs(y = 'Count of Users',
       x = 'Number of Orders Made By Users') +
  theme(
      axis.text.x  = element_text (size = 6.0, angle = (90), hjust = 1, vjust = 0.5)
    )
```
### How Soon Until Next Order ?

It is very obvious that most users made their orders **weekly (every 7 days) and monthly (every 30 days)**. See the peak of day 7 and day 30 in the chart below. 

```{r fig.width=9.3, fig.height=2.5}
tmp  <-  users_orders_products_ %>% 
  filter(order_number > 1) %>% # days_since_prior is NA for first order, need to filter out
  group_by(order_id) %>% 
    summarize(n_orders = max(days_since_prior_order))

tmp %>% 
  ggplot(aes(x = as.factor(n_orders))) + 
  geom_bar() + 
  labs(y = 'Count of Orders', 
       x = 'Days Since Prior For Each Order')
```

## Orders_Products

### Most Popular Products Sold

We know that **banana** are the most popular products. The number of orders varies greatly for different products. Illustration below uses shows sample of only **30 top products**.  Notice however the varience is not obvious outside top 10 products.

```{r, fig.width=9.5, fig.height=3.5}
tmp <- order_products_train %>%
  left_join(products) %>%
  group_by(product_name) %>%
  summarize(count = n()) %>%
  top_n(n = 30, wt = count) %>%  
  mutate(percentage = count/sum(count))

p1 <- tmp %>% 
  ggplot(aes(x = reorder(product_name, count), y = percentage)) +  
  geom_col() + 
  labs(title = 'Products Top 30',
       y = 'Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) 

p2 <- tmp %>% 
  ggplot(aes(x = '', y = percentage)) + 
  labs(title = 'Products Top 30',
       y = 'percentage.of.orders',
       x = 'Products') + 
  geom_boxplot()

grid.arrange(p1, p2, ncol = 2)
```
```{r echo=FALSE}
rm(list = c('tmp','p1','p2'))
```

### Most Popular Department Sold

Certain departmens are clearly more popular, like **produce and dairy eggs**. Both deparments combined contributed to **more than 40%** of total orders.

```{r, fig.width=9.5, fig.height=3.5}
tmp <- users_orders_products_ %>%
  group_by(department) %>%
  summarize(count = n()) %>%
  mutate(percentage = count/sum(count))

p1 <- tmp %>% 
  ggplot (aes(x = reorder(department, count), y = percentage)) +  
  geom_col() + 
  labs(title = 'Departments', y = 'Percentage of Orders') +
  theme (
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    axis.title.x = element_blank()) 

p2 <- tmp %>% 
  ggplot(aes(x = '', y = percentage )) + 
  labs(title = 'Departments',
       y = 'percentage.of.orders',
       x = 'Departments') + 
  geom_boxplot()

grid.arrange(p1, p2, ncol = 2)
```

```{r echo=FALSE}
rm(list = c('tmp', 'p1', 'p2'))
```

### Most Popular Aisles Sold

We looked into the buying trend of product by aisles and notice that certain aisle like **vegetables and fruits** contributes to **almost 30%** of total orders. Chart below shows top 30 aisles.

```{r, fig.width=9.5, fig.height=3.5}
tmp <- users_orders_products_ %>%
  group_by(aisle) %>%
  summarize(count = n()) %>%
  top_n(n = 30, wt = count) %>%  
  mutate(percentage = count/sum(count))

p1 <- tmp %>% 
  ggplot(aes(x = reorder(aisle, count), y = percentage)) +  
  geom_col() + 
  labs(title = 'Aisles Top 30',
       x = 'Aisles',
       y = 'Percentage of Orders') +
  theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.title.x = element_blank())

p2 <- tmp %>%
  ggplot(aes(x = '', y = percentage )) + 
  geom_boxplot() +
  labs(title = 'Aisles Top 30',
       y = 'percentage.of.orders',
       x = 'Aisles')

grid.arrange(p1, p2, ncol = 2)
```

```{r echo=FALSE}
rm(list = c('tmp', 'p1', 'p2'))
```

### Products Ordered Day Pattern

We can see that both **Day 0 and Day 1** stands out to be the most busy shopping day for instacart. This means that day of order made may influence the basket size.

```{r, fig.width=9.3, fig.height=3.0}
users_orders_products_ %>%
  group_by(order_dow) %>%
  summarize(count = n()) %>%
  mutate(percentage = count/sum(count)) %>%
  ggplot(aes(x = as.factor(order_dow), y = percentage, fill = as.factor(order_dow))) + 
  geom_col() + 
  labs(title = 'Daily Orders', y = 'Percentage of Orders') 
```

When we zoom into daily orders, we notice that top ten products contributes between 7% to 8% of daily orders. It is interesting to see that Limes are part of top ten for Day 0 and Day 6, but not other days. Whereas Organic Whole Milk doesn't make it to top ten for Day 0. Organic Respberries does not make it to top 10 of Day 6. This means that there is a chance of predictability based on the day order is made. 

```{r, fig.width=9.3, fig.height=4.5}
users_orders_products_%>%
  group_by(order_dow, product_name) %>%
  summarize(n = n()) %>%
  mutate(percentage = n/sum(n)) %>%
  top_n(10, wt = n) %>%
  ggplot(aes(x = as.factor(order_dow), y = percentage, fill = product_name)) + 
  geom_col() + 
  labs(y = 'Proprtion of Orders In A Day',
       title = 'Daily Top 10 Products Ordered') +
  theme(legend.position = "bottom", legend.direction = "horizontal")
```

### Products Ordered Hour Pattern

Morning to afternoon are the peak shopping hours for instacart customers. The hour order made influences basket size.

```{r, fig.width=9.3, fig.height=3.0}
users_orders_products_ %>%
  group_by(order_hour_of_day) %>%
  summarize(count = n()) %>%
  mutate(percentage = count/sum(count)) %>%
  ggplot(aes(x = as.factor(order_hour_of_day), y = percentage)) + 
  geom_col() + 
  labs(y = 'Percentage of Orders',
       title = 'Hourly Orders')
```

In the grocery, there are close to 50,000 products. When we zoom into hourly purchases, we noticed that top 10 products managed to score betwen 6% to 8% of hourly sales. Every hour has slightly diffrent combination of top 10 products (combination out of 12 products). That means certain products are predictable for ordering irregardless of the hour of order.  

It is interesting to know that, similar to daily top 10 products, the Organic Wholemilk and Limes is missing as top 10 from some hours.  

```{r, fig.width=9.3, fig.height=5}
users_orders_products_ %>%
  group_by(order_hour_of_day, product_name) %>%
  summarize(n = n()) %>%
  mutate(percentage = n/sum(n)) %>%
  top_n(10, wt = n) %>%
  ggplot (aes(x = as.factor(order_hour_of_day), y = percentage, fill = product_name)) + 
  geom_col() + 
  labs(y = 'Proprtion of Orders In A Hour',
       title = 'Hourly Top 10 Products Ordered') +
  theme(legend.position = "bottom", legend.direction = "horizontal")
```


## Basket Analysis

### Basket Size Distribution

Number of items in all orders range from 1 to 145. The histogram below is highly **skewed towards small basket size**. Majority of users purchased 5 items in their orders.

```{r fig.width=9.3, fig.height=3}
tmp <- users_orders_products_ %>%
  group_by(order_id)  %>%
  summarize(basket_size = n(), 
            reordered_items = sum(reordered)) %>%
  group_by(basket_size) %>%
  summarize(n = n(), avg_reordered_items = mean(reordered_items)) %>%
  arrange(basket_size)
  
tmp %>% 
  ggplot(aes(x = as.factor(basket_size))) +
  geom_col(aes(y = n)) +
  labs(y = 'Order Count',
       x = 'Number of Items in Basket',
       title = 'Basket Size Distribution') +
  theme(axis.text.x = element_text(size = 6.0, angle = 90, hjust = 1, vjust = 0.5))
```


## Re-Ordered Analysis

Analyzing the re-ordered products is the most important part of the EDA. This is becasue insights from this analysis can help to develop intuition for furhter feature engineering that will make the prediction more meaningful.

### Average Re-ordered Items In Basket Distribution

```{r fig.width=9.3, fig.height=3}
tmp %>% 
  ggplot(aes(x = as.factor(basket_size))) + 
  geom_point(aes(y = avg_reordered_items), color = 'red') +
  labs(y = 'Avg Number of Re-Ordered Items',
       x = 'Number of Items in Basket',
       title = 'Reorder Rate by Basket Size') +
  theme(axis.text.x = element_text(size = 6.0, angle = 90, hjust = 1, vjust = 0.5)) +
  geom_abline(intercept = 0, slope = 1, color = 'blue')
```

### Product Reorder Ratio

One of the tricker things to predict in the Instacart dataset is the incidence of orders without reordered products. Plotting the proportion of this incidence across the training sample (a snapshot of 131K+ users) provides some inspiration.

### How Many Products Were Reordered

Among all product purchases, 41% of products are reordered.  The reordered rate is particularly high on top 10 products. As shown in chart below, top ten popular products has reordered rate is around 70% to 85%; higher than the overall ratio of 41%. 

```{r, fig.width=9.3, fig.height=3, message=FALSE}
## overall all products reordered rate
tmp1 <- users_orders_products_ %>%
  filter(order_number > 1) %>% # exclude first order, which will never have reordered
  count(reordered) %>%
  mutate(ratio = n/sum(n))
  
p1 <- tmp1 %>% 
  ggplot(aes(x = '', y = ratio, fill = as.factor(reordered))) + 
  geom_col(width = 1) + 
  labs(y = 'Product Reordered Ratio') +
  coord_polar(theta = 'y', start = 0) + 
  scale_fill_brewer(palette = "Dark2") +
  theme(axis.title.y = element_blank())

## top10 products and its reordered rate
tmp2 <- users_orders_products_ %>%
  count(product_id) %>% # filter only top 10 products for reorder analysis
  top_n(n = 10) %>%
  left_join(users_orders_products_) %>%  # now find out their reordered rate
  group_by(product_id, product_name) %>%
  summarize(reordered_rate = sum(reordered, na.rm = TRUE)/n()) %>% 
  select(product_id, product_name, reordered_rate) %>%
  arrange(desc(reordered_rate))

p2 <- tmp2 %>% 
  ggplot(aes(x = reorder(product_name, reordered_rate), y = reordered_rate)) + 
  labs(title = 'Top 10 Products Sold and Their Reordering Rate') +
  geom_col() + 
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) + 
  coord_flip()

grid.arrange(p1, p2, ncol = 2)
```

```{r echo=FALSE}
rm(list = c('tmp1', 'tmp2', 'p1', 'p2'))
```

### Reordering vs Days Since Prior Order

We understand from order analysis earlier that most users place their orders every 7 and 30 days. However, from reorder ration perspective, day 7 and day 30 has high contrast whereby day 7 orders has high reorder ratio and day 30 has lowest reordering ratio.

```{r fig.width=9.3, fig.height=2.5}
tmp <- users_orders_products_ %>%
  filter(order_number > 1) %>%
  group_by(days_since_prior_order, order_id) %>%
  summarize(contain_reordered = max(reordered)) %>%
  summarize(reordered_orders = sum(contain_reordered),
            n = n() + 1) %>%
  mutate(non_reorder_ratio = 1 - (reordered_orders/n))

tmp %>% 
  ggplot(aes(x = days_since_prior_order, y = non_reorder_ratio)) + 
  geom_point() + 
  geom_line() +
  labs(title = 'Orders NOT Containing Reordered Products over Days since Prior Order')
```

### Reordering vs Hour Of Order


```{r fig.width=9.3, fig.height=2.5}
tmp <- users_orders_products_ %>%
  filter(order_number > 1) %>%
  group_by(order_hour_of_day, order_id) %>%
  summarize(contain_reordered = max(reordered)) %>%
  summarize(reordered_orders = sum(contain_reordered),
            n = n()) %>%
  mutate(non_reorder_ratio = 1 - (reordered_orders/n))

tmp %>% 
  ggplot(aes(x = order_hour_of_day, y = non_reorder_ratio)) + 
  geom_point() + 
  geom_line() +
  labs(title = 'Non Reorder Ratio over Time of Order Placed')
```

### Reordering vs Day Of Order


```{r fig.width=9.3, fig.height=2.5}
tmp <- users_orders_products_ %>%
  filter(order_number > 1) %>%
  group_by(order_dow, order_id) %>%
  summarize(contain_reordered = max(reordered)) %>%
  summarize(reordered_orders = sum(contain_reordered),
            n = n() + 1) %>%
  mutate(non_reorder_ratio = 1 - (reordered_orders/n))

tmp %>% 
  ggplot(aes(x = order_dow, y = non_reorder_ratio)) + 
  geom_point() + 
  geom_line() +
  labs(title = 'Non Reorder Ratio over Day of Purchase')
```

### Reordering vs Day Of Order

Intuitively, we can think of the more regular a buyer is,  the person tend to repeat ordering the same products. 

```{r fig.width=9.3, fig.height=2.5}
tmp <- users_orders_products_ %>%
  filter(order_number > 1) %>%
  group_by(user_id, order_id) %>%
  summarize(contain_reordered = max(reordered)) %>%
  summarize(reordered_orders = sum(contain_reordered),
            total_orders_per_user = n()) %>%
  group_by(total_orders_per_user) %>%
  summarize(reorders = sum(reordered_orders),
            total = sum(total_orders_per_user)) %>%
  mutate(non_ratio = 1 - (reorders/total))
  
tmp %>% 
  ggplot(aes(x = total_orders_per_user, y = non_ratio)) + 
  geom_point() + 
  geom_line() +
  labs(title = 'Non Reorder Ratio over Day of Purchase')
```


# Predictive Analysis

## Type of Prediction

The objective is to predict what product will the customer purchase in the next basket. It require probability estimation of each product that bad been purchased before, that to be purchased before.**This is a classification problem**, as well as **a regression of probability** of repurchases. 

For this analysis, we shall use two Naive models (handcrafted baseline) and one Machine Learning Logistic regression will be used for Machine Learning approach for its speed and simplicity; to demonstrate the feasibility to producing a better outcome then baseline.

### Train/Test Dataset Splitting

Instacart did not provide us **test order detail**, therefore we shall use the **train** users for both trainng and testing. We achieve this by splitting the **train** users and its related orders and products into train dataset and train dataset, at 70%/30% split (by number of users). That means our train/test dataset will contain approximately 91846 / 39,363 users.

For this analysis, we will not be submitting to Kaggle. 

```{r results='hold'}
# update this variable for changing split ratio
train_proportion <- 0.7

# build list of all users ID
tmp <- orders %>% 
  filter(eval_set == 'train') %>% 
  distinct(user_id)

# 70/30 split
set.seed(12345)
train.rows <- sample(1:nrow(tmp), train_proportion * nrow(tmp))
train.users <- tmp[train.rows, ]  # select training rows, list of train users
test.users <- tmp[-train.rows, ] # select testing rows, list of test users

cat("Total Rows in Training Users: ", length(train.users),
    "\nTotal Rows in Testing Users: ", length(test.users),
    "\nTrain/Test Split % : ", 100*length(train.users)/(length(test.users)+length(train.users)),
    " / ", 100*length(test.users)/(length(test.users)+length(train.users)))
```

```{r echo=FALSE}
rm(list = c('tmp', 'train.rows')) # memory cleanup
```

### Training Data Construct

The data frame used for training should contain the below columns and features:  

**`key`**  

- This is unique pair of user_id and product_id from orders  
- The keys should be constructed from all `user_id-product_id` pair that includes all prior and test/train rows  

**`actual**  

- This is the response variable with value of 1 or 0 for each unique key  
- The value is 1 when the product is purchased in the last order (train or test set of orders)  
- The value is 0 when the product is not purchased in the train or test set, but was bought in prior set

**`other features`**  

From exploratory discovery, features that could contribute to the prediction should be populated into the construct. Feature engineering will happen in the later stage.

Let's proceed to create the basic **training construct**. This won't be used for prediction until feature engineering is completed in later stage.

```{r}
# list of products in the final order, this make up the label
construct1 <- orders %>%    
  filter(user_id %in% train.users & eval_set == 'train') %>% 
  left_join(order_products_train) %>%
  distinct(user_id, product_id) %>%
  mutate(actual = 1)  #training label

# list of products each users had bought before in prior orders
construct2 <- orders %>%   
  filter(user_id %in% train.users & eval_set == 'prior') %>% 
  left_join(order_products_prior) %>%
  distinct(user_id, product_id)

# Training Construct
train.construct <- left_join(construct2, construct1) %>%
  mutate(key = paste(user_id, product_id, sep = "-")) %>%  # key
  select(key, user_id, product_id, actual) %>%
  arrange(user_id, product_id) %>%
  replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order
#  drop_na # remove proudcts not in historical but appear in last order

rm(list = c('construct1', 'construct2'))
head(train.construct, 50)
```

### Testing Data Construct

Similar approach to training data construct, here we frame the testing data for evaluate our model built with training data.

```{r eval=FALSE}
# list of products in the final order, this make up the label
construct1 <- orders %>%    
  filter(user_id %in% test.users & eval_set == 'train') %>% 
  left_join(order_products_train) %>%
  distinct(user_id, product_id) %>%
  mutate(actual = 1)  #training label

# list of products each users had bought before in prior orders
construct2 <- orders %>%   
  filter(user_id %in% test.users & eval_set == 'prior') %>% 
  left_join(order_products_prior) %>%
  distinct(user_id, product_id)

# Training Construct
test.construct <- construct2 %>% 
  left_join(construct1) %>%
  mutate(key = paste(user_id, product_id, sep = "-")) %>%  # key
  select(key, user_id, product_id, actual) %>%
  arrange(user_id, product_id) %>%
  replace_na(list(actual = 0)) # proudcts not in last order, but exist in prior order
#  drop_na # remove proudcts not in historical but appear in last order

rm(list = c('construct1', 'construct2'))
head(test.construct, 50)
```

## Model Evaluation & Optimization

Instacart has close to 50k products in their catalogue. As the maximum number of items ordered by a user is just a fraction of the 50k available product. This means by simply predicting nothing is purchased in the next basket, we would yeild **close to 100% accuracy**.  

Due to the **highly imbalance** dataset, Instacart require **F1 Score** as the competition scoring, instead of accuracy.  

To evaluate the performance of the model, we had created a custom function to build a **confusion matrix and derive other binary classification metrics**. 

```{r}
## Custom Function For Binary Class Performance Evaluation
binclass_eval = function (actual, predict) {
  cm = table(as.integer(actual), as.integer(predict), dnn=c('Actual','Predicted'))
  ac = (cm['1','1']+cm['0','0'])/(cm['0','1'] + cm['1','0'] + cm['1','1'] + cm['0','0'])
  pr = cm['1','1']/(cm['0','1'] + cm['1','1'])
  rc = cm['1','1']/(cm['1','0'] + cm['1','1'])
  fs = 2* pr*rc/(pr+rc)
  list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac)
}
```

If the prediction is based on probability, we shall build a function to discover **cutoff that optimize various performance metrics**. 

```{r}
### Cutoff Threshold Optimization
optimize_cutoff = function (actual, probability) {
  rocr.pred = prediction(predictions = probability, labels = actual)
  rocr.metrics = data.frame(
      cutoff   = rocr.pred@cutoffs[[1]],
      accuracy = (rocr.pred@tp[[1]] + rocr.pred@tn[[1]]) / 
                   (rocr.pred@tp[[1]] + rocr.pred@tn[[1]] + rocr.pred@fp[[1]] + rocr.pred@fn[[1]]),
      tpr = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fn[[1]]),
      fpr = rocr.pred@fp[[1]] / (rocr.pred@fp[[1]] + rocr.pred@tn[[1]]),
      ppv = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fp[[1]])
  )
  rocr.metrics$fscore = 2 * (rocr.metrics$tpr * rocr.metrics$ppv) / (rocr.metrics$tpr + rocr.metrics$ppv)
  rocr.metrics$tpr_fpr = rocr.metrics$tpr / rocr.metrics$fpr
  
  ## Discovery the optimal threshold for various metrics
  rocr.best = rbind(
    best.accuracy = c(max = max(rocr.metrics$accuracy, na.rm = TRUE), cutoff=rocr.metrics$cutoff[which.max(rocr.metrics$accuracy)]),
    best.ppv = c(max = max(rocr.metrics$ppv, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$ppv)]),
    best.recall = c(max = max(rocr.metrics$tpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr)]),
    best.fscore = c(max = max(rocr.metrics$fscore, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$fscore)]),
    best.tpr_fpr = c(max = max(rocr.metrics$tpr_fpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr_fpr)])
  )
  
  list(metrics = rocr.metrics, best = rocr.best)
}
```


## Model 1 : Naive Prediction

### Build The Model

With intension to make this a baseline model, We simply predict the basket based on user last order. 

```{r}
m1.train.data = users_orders_products_ %>%
  filter(user_id %in% train.users) %>%
  group_by(user_id) %>%
  top_n(n=1, wt=order_number)  %>% #last order has the higher order_number
  select(user_id, product_id) %>% 
  mutate (predicted=1)  %>%        #predict based on last ordered, therefore 1
  full_join(train.construct) %>%  # join with train construct for items not predicted but in final order
  select(user_id, product_id, actual, predicted) %>%
  replace_na(list(predicted = 0))

head(m1.train.data,25)
```

### Confusion Matrix

```{r model1-eval}
m1.eval = binclass_eval(m1.train.data$actual, m1.train.data$predicted)
m1.eval$cm
```

### Model Performance

The result shows only **0.3460833 F1 Score**.

```{r}
cat("Accuracy:  ", m1.eval$accuracy,
    "\nPrecision: ", m1.eval$precision,
    "\nRecall:    ", m1.eval$recall,
    "\nFScore:    ", m1.eval$fscore)
```

```{r echo=FALSE}
rm(list=c('m1.train.data'))
```

## Model 2 : Smarter Naive Prediction (Baseline)

In this model, we predict products in the basket by estimating their frequency of repurchased. This way we get a ratio to indicate probability of re-purchases. We use ROCR package to estimate the best cutoff point (at which above this cutoff we shall predict for re-order) that give us the **optimum F1 score**. 

### Build The Model

```{r model2-building, message=FALSE}
## Build Model
m2.train.data = users_orders_products_ %>%
  filter(user_id %in% train.users) %>%
  group_by(user_id) %>%
    mutate(total_orders = max(order_number)) %>%  # total number of orders made previously
  ungroup %>% 
  select(user_id, order_id, product_id, total_orders) %>%
  group_by(user_id, product_id) %>%
    summarize(predicted=n()/max(total_orders)) %>%
  select(user_id, product_id, predicted) %>%
  full_join(train.construct) %>%  # join with train construct for items not predicted but in final order
  select(user_id, product_id, actual, predicted) %>%
  replace_na(list(predicted = 0))

head(m2.train.data,20)
```

### Optimize Cutoff

We see that in order to maximize **F1 Score**, we need to set the cutoff threshold to 0.3368, which is the next step.

```{r model2-optimize}
### Threshold Optimization
m2.rocr = optimize_cutoff(actual = m2.train.data$actual, probability = m2.train.data$predicted)
kable(m2.rocr$best) %>% kable_styling(bootstrap_options = c("striped"))
```

### Confusion Matrix

Let's set the **cutoff to 0.3367347** as discovered in previous step.

```{r model2-eval}
m2.eval = binclass_eval(m2.train.data$actual, m2.train.data$predicted>0.3367347)
m2.eval$cm
```

### Model Performance

We are getting slightly **better F1 Score (0.3753544)** compare to previous naive model. We shall use this as the **BASELINE**.

```{r}
cat("Accuracy:  ", m2.eval$accuracy,
    "\nPrecision: ", m2.eval$precision,
    "\nRecall:    ", m2.eval$recall,
    "\nFScore:    ", m2.eval$fscore)
```

```{r echo=FALSE}
rm(list=c('m2.train.data','m2.rocr'))
```

## Machine Learning Framing

We construct all the products that users had purchased in the last 3 orders, then use machine learning classification to predict will each of the product be purchased again. We shall use **decision tree and logistic regression** for this prediction.

### Feature Engineering

#### Order Features

These are original features provided by Instacart. Although there are no other features engineered specifically to describe Order, thse features are being used to generate other features in the following sections.

**`orders`**  
- `order_dow`  
- `order_hour_of_day`  
- `days_since_prior_order`  
- `reordered`  

#### User Features

We create five features which is unique to each individual user. These are the features that desribe the user.

**`users`**  
- `u_n_orders`: Number of Orders Per User  
- `u_avg_priors`: Average waiting days between orders per User  
- `u_avg_hod`: Average Order Placing Hour Per User  
- `u_avg_dow`: Average Order Placing Day Per User  
- `u_avg_order_size`: Average Size of Basket (items in order) Per User  

```{r eval=FALSE, message=FALSE}
#### user features
users_ = users_orders_products_ %>%
  group_by(user_id,order_id) %>%
    mutate(u_o_size = ifelse(row_number()==1, max(add_to_cart_order),0) ) %>%
  group_by(user_id) %>%
    summarize(
      u_n_orders = max(order_number),
      u_avg_priors = mean(days_since_prior_order,na.rm=TRUE),
      u_avg_hod = mean(order_hour_of_day),
      u_avg_dow = mean(order_dow),
      u_avg_order_size = sum(u_o_size)/max(order_number)
    ) %>% 
  arrange(user_id)

head(users_)
```

#### Product Features

We create two product specific features.

**`products`**  

- `avg_product_order_dow`: Average of product order_dow
- `avg_product_order_hod`: Average of product order_hour_of_day  

```{r eval=FALSE, message=FALSE}
products_ = users_orders_products_ %>%
  group_by(product_id) %>%
  summarize( 
    p_avg_dow = mean(order_dow),
    p_avg_hod = mean(order_hour_of_day)
  ) %>% arrange(product_id)

head(products_)
```

#### User-Product Features

We shall introduce product related features that are user-product specifc

- `up_n_reordered` : how many times a user reorderedthis product  
- `up_avg_priors` : Average number of days in between before a user purchase this product  
- `up_avg_hod` : Average hour a user purchase this product  
- `up_avg_dow` : Average day of week a user purchase this product  
- `up_avg_rank` : Average add to cart number a user select this product  

```{r}
### user_products features
user_products_ = users_orders_products_ %>%
  group_by(user_id, product_id) %>%
    summarize(
      up_n_reordered = n()-1, # minus off first order, which is not reorder
      up_avg_priors = mean(days_since_prior_order,na.rm=TRUE),
      up_avg_hod = mean(order_hour_of_day),
      up_avg_dow = mean(order_dow),
      up_avg_rank = mean(add_to_cart_order)
    ) %>%
  ungroup %>%
  left_join(users_) %>% # to retrieve u_n_orders
  mutate(up_reorder_rate = up_n_reordered/(u_n_orders-1)) %>%
  replace_na(list(up_avg_priors = 0)) %>% # fix up 
  arrange(user_id,product_id)

head(user_products_,20)
```

### Construct Training Data

We shall combined training construct table with the new engineered features to form the training data. Categorical data which are merely names or identification will be removed since they should not contribute to prediction.

After this step, the trianing data is ready for machine learning algorithm of choice.

```{r eval=FALSE}
m3.train.data = users_orders_products_ %>%
  filter(user_id %in% train.users) %>%
  left_join(user_products_) %>% 
  left_join(products_) %>%
  #left_join(users_)  #user_products_ already contain user specific features
  full_join(train.construct, by=c('user_id','product_id')) %>%
  arrange(user_id, product_id) %>%
  select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order')) 

glimpse(m3.train.data)
```

### Construct Testing Data

```{r eval=FALSE}
m3.test.data = users_orders_products_ %>%
  filter(user_id %in% test.users) %>%
  left_join(user_products_) %>% 
  left_join(products_) %>%
  #left_join(users_)  #user_products_ already contain user specific features
  full_join(test.construct, by=c('user_id','product_id')) %>%
  arrange(user_id, product_id) %>%
  select(-c('key','user_id','order_id', 'product_id', 'product_name', 'department_id', 'aisle_id', 'department','aisle', 'days_since_prior_order')) 

glimpse(m3.test.data)
```

```{r echo=FALSE}
# cleanup to free memory
rm(list=c('users_','products','products_','order_products_train','order_products_prior','user_products_','users_orders_products_'))
```


## Model 3 : Logistic Regression

### Model Trainng

```{r model3-building}
m3.fit = glm(actual ~ ., family = binomial, data = m3.train.data)
```

### Training Data Performance

#### Prediction

```{r}
m3.predict = predict(m3.fit, type = 'response', newdata = m3.train.data)
```

#### Optimize Cutoff

```{r model3-optimize}
### Threshold Optimization
m3.rocr = optimize_cutoff(actual = m3.train.data$actual, probability = m3.predict)
kable(m3.rocr$best) %>% kable_styling(bootstrap_options = c("striped"))
```

#### Confusion Matrix

```{r model3-eval}
m3.eval = binclass_eval(m3.train.data$actual, m3.predict>0.2233115)
m3.eval$cm
```

#### Model Evaluation

Logistic regression produce F1 Score of 0.5388937 with training data, a much better compared to Model 1 and Model 2. We shall proceed test the model on unknown data, the test data.

```{r}
cat("Accuracy:  ",   m3.eval$accuracy,
    "\nPrecision: ", m3.eval$precision,
    "\nRecall:    ", m3.eval$recall,
    "\nFScore:    ", m3.eval$fscore)
```

```{r, eval=FALSE}
rocr.pred = prediction(m3.predict, m3.train.data$actual)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)
plot(rocr.perf,
    lwd = 3, colorize = TRUE,
    print.cutoffs.at = seq(0, 1, by = 0.1),
    text.adj = c(-0.2, 1.7),
    main = 'ROC Curve')
mtext(paste('auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)
```

```{r model3-cleanup, echo=FALSE}
rm(list=c('m3.predict', 'm3.rocr'))
```

### Test Data Performance

#### Prediction

```{r}
m3.predict.test = predict(m3.fit, type = 'response', newdata = m3.test.data)
```

#### Optimize Cutoff

```{r model3-optimize2}
### Threshold Optimization
m3.rocr.test = optimize_cutoff(actual = m3.test.data$actual, probability = m3.predict.test)
kable(m3.rocr.test$best) %>% kable_styling(bootstrap_options = c("striped"))
```

#### Confusion Matrix

```{r model3-eval2}
m3.eval.test = binclass_eval(m3.test.data$actual, m3.predict.test>0.2233115)
m3.eval.test$cm
```

#### Model Evaluation

Logistic regression produce F1 Score of 0.5388937 with training data, a much better compared to Model 1 and Model 2. We shall proceed test the model on unknown data, the test data.

We acheived F1 Score of **0.5405588**, slightly higher than training data.

```{r}
cat("Accuracy:  ",   m3.eval.test$accuracy,
    "\nPrecision: ", m3.eval.test$precision,
    "\nRecall:    ", m3.eval.test$recall,
    "\nFScore:    ", m3.eval.test$fscore)
```

```{r}
rm(list=c('m3.fit','m3.predict', 'm3.rocr'))
```


#### ROC

```{r}
rocr.pred = prediction(predictions = m3.predict.test, labels = m3.test.data$actual)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)
rocr.auc
```

```{r}
plot(rocr.perf,
     lwd = 3, colorize = TRUE,
     print.cutoffs.at = seq(0, 1, by = 0.1),
     text.adj = c(-0.2, 1.7),
     main = 'ROC Curve')
mtext(paste('auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)
```

# Analysis & Recommendations

Technical Challenges

- Machine speed and memory. The GLM
