Here are the R libraries used in this analysis.
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(treemap) # tree visualization
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
#library(corrgram) # correlation graphics
setwd('./datasets')
aisles = fread('aisles.csv', stringsAsFactors = TRUE)
departments = fread('departments.csv', stringsAsFactors = TRUE)
products = fread('products.csv', stringsAsFactors = TRUE)
orders = fread('orders.csv', stringsAsFactors = TRUE)
order_products_train = fread('order_products__train.csv')
order_products_prior = fread('order_products__prior.csv')
The dataset for this competition is a relational set of files describing customers’ orders over time. They are anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, Instacart provided between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders.
Total six datasets were imported. Follwing section will explore each datasets in further detail. These datasets were sourced from an existing Kaggle competiotion (https://www.kaggle.com/c/instacart-market-basket-analysis/data)
orders
(3.4m rows, 206k users):
order_id
: order identifieruser_id
: customer identifiereval_set
: which evaluation set this order belongs in (see SET
described below)order_number
: the order sequence number for this user (1 = first, n = nth)order_dow
: the day of the week the order was placed onorder_hour_of_day
: the hour of the day the order was placed ondays_since_prior
: days since the last order, capped at 30 (with NAs for order_number
= 1)products
(50k rows):
product_id
: product identifierproduct_name
: name of the productaisle_id
: foreign keydepartment_id
: foreign keyaisles
(134 rows):
aisle_id
: aisle identifieraisle
: the name of the aisledeptartments
(21 rows):
department_id
: department identifierdepartment
: the name of the departmentorder_products__SET
(30m+ rows):
order_id
: foreign keyproduct_id
: foreign keyadd_to_cart_order
: order in which each product was added to cartreordered
: 1 if this product has been ordered by this user in the past, 0 otherwisewhere SET
is one of the four following evaluation sets (eval_set
in orders
):
"prior"
: orders prior to that users most recent order (~3.2m orders)"train"
: training data supplied to participants (~131k orders)"test"
: test data reserved for machine learning competitions (~75k orders)There are 134 ailes in this dataset. Here are few sample names of the ailes.
paste(sort(head(aisles$aisle)), collapse=', ')
[1] "energy granola bars, instant foods, marinades meat preparation, other, prepared soups salads, specialty cheeses"
There are 21 departments in this dataset.Names of all deparments are listed below in aphabetically ordered.
paste(sort(departments$department), collapse = ', ')
[1] "alcohol, babies, bakery, beverages, breakfast, bulk, canned goods, dairy eggs, deli, dry goods pasta, frozen, household, international, meat seafood, missing, other, pantry, personal care, pets, produce, snacks"
There are 49,688 products in the catalogue within 134 aisles and 21 departments. Sample products are as below.
products %>% head %>% kable
product_id | product_name | aisle_id | department_id |
---|---|---|---|
1 | Chocolate Sandwich Cookies | 61 | 19 |
2 | All-Seasons Salt | 104 | 13 |
3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
4 | Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
5 | Green Chile Anytime Sauce | 5 | 13 |
6 | Dry Nose Oil | 11 | 11 |
Products dataframe is related to Deparments. We shall see below a glimpse of 3 products for each deparment, giving an idea the how the products being departmentized. Only sample five departments are shown.
left_join(departments, products) %>% select(department, product_name ) %>%
group_by(department) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
department | three_examples_product |
---|---|
pets | Small Bone || Grain Free Naturals Dry Dog Food Chicken & Garden Pea Recipe || T-Bonz Porterhouse Flavor Steak-Shaped Dog Treats |
deli | All Natural Prosciutto Vacuum Pack || Butternut Squash Tamales With Cheese || Mediterranean Orzo Salad |
bakery | Naan || Gluten Free Blueberry Muffins || Artisan Thin Pizza Crust Rustic White Flatbread |
other | Natural Calm Anti-Stress Drink || Organic Vitamin D3 400 IU Drops For Babies & Infants Natural Berry Flavor || Yogurt Mint Sauce |
breakfast | Flavor Variety Instant Oatmeal || Cinnamon Chex || Light and Fluffy Blueberry Pancake Mix |
Products dataframe is also related to aisles. Each aisle relates to multiple prodcuts. By joining both aisles and products dataframe, we have an idea what type of prodcuts for each ailes. Example below shows sample of aisles and few of its related products.
left_join(aisles, products) %>%
select(aisle, product_name ) %>% group_by(aisle) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
aisle | three_examples_product |
---|---|
poultry counter | Whole Chicken Fryer || Uncured Boneless Ham || Fresh Chicken Wings |
other creams cheeses | X Sharp Cheddar Cheese || Cottage Cheese, Natural Small Curd, 4% Milkfat Min || Cottage Cheese Large Curd |
energy granola bars | Organic Berry Berry Granola Bars || Pomegranate Blueberry Pistachio Plus Antioxidants 1.4 oz Fruit & Nut Bars || Organic Z Fruit Grape Rope |
instant foods | Creamy Deluxe Gluten Free Rice Pasta & Extra Cheesy Cheddar Sauce || Slow Cookers Chili Seasoning Mix || Macaroni Shells & White Cheddar Cheese |
yogurt | YoKids Squeeze Organic Blueberry Blue Yogurt || Greek 100 Calories Mixed Berry Yogurt || Organic Strawberry Coconut Milk Yogurt |
There are over 3 millions observations in orders dataset. Each row represent an unique order. Each variable potentially can be used as predictors. Let’s analyse the construct of one user. For example, user_id 1 had made 10 prior orders (order number from 1 to 10), last order is a train (eval_set).
orders %>% filter (user_id==1) %>% kable
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
2539329 | 1 | prior | 1 | 2 | 8 | NA |
2398795 | 1 | prior | 2 | 3 | 7 | 15 |
473747 | 1 | prior | 3 | 3 | 12 | 21 |
2254736 | 1 | prior | 4 | 4 | 7 | 29 |
431534 | 1 | prior | 5 | 4 | 15 | 28 |
3367565 | 1 | prior | 6 | 2 | 7 | 19 |
550135 | 1 | prior | 7 | 1 | 9 | 20 |
3108588 | 1 | prior | 8 | 1 | 14 | 14 |
2295261 | 1 | prior | 9 | 1 | 16 | 0 |
2550362 | 1 | prior | 10 | 4 | 8 | 30 |
1187899 | 1 | train | 11 | 4 | 8 | 14 |
Let’s analyse another construct of orders. User_id 3 had made 12 orders before the final order labeled as test (eval_set) order. From the data we know that order_number is being recycled for each user. Instacart did not provide us the basket content for test order. This is in fact the target for prediction.
This also means <user_id, product_id>
made up the key for prediction.
orders %>% filter (user_id==3) %>% kable
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
1374495 | 3 | prior | 1 | 1 | 14 | NA |
444309 | 3 | prior | 2 | 3 | 19 | 9 |
3002854 | 3 | prior | 3 | 3 | 16 | 21 |
2037211 | 3 | prior | 4 | 2 | 18 | 20 |
2710558 | 3 | prior | 5 | 0 | 17 | 12 |
1972919 | 3 | prior | 6 | 0 | 16 | 7 |
1839752 | 3 | prior | 7 | 0 | 15 | 7 |
3225766 | 3 | prior | 8 | 0 | 17 | 7 |
3160850 | 3 | prior | 9 | 0 | 16 | 7 |
676467 | 3 | prior | 10 | 3 | 16 | 17 |
521107 | 3 | prior | 11 | 0 | 18 | 11 |
1402502 | 3 | prior | 12 | 1 | 15 | 15 |
2774568 | 3 | test | 13 | 5 | 15 | 11 |
order_product_train/prior dataframe tells us which products were purchased in each order, for both train order and prior order. For example, we can know user_id 1 in the last order (order_id 1187899) purchased 10 unique products by quering order_product_train with the relevant order_id.
order_products_train %>% filter (order_id == 1187899) %>% kable
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
1187899 | 196 | 1 | 1 |
1187899 | 25133 | 2 | 1 |
1187899 | 38928 | 3 | 1 |
1187899 | 26405 | 4 | 1 |
1187899 | 39657 | 5 | 1 |
1187899 | 10258 | 6 | 1 |
1187899 | 13032 | 7 | 1 |
1187899 | 26088 | 8 | 1 |
1187899 | 27845 | 9 | 0 |
1187899 | 49235 | 10 | 1 |
1187899 | 46149 | 11 | 1 |
Ther is no dedicated dataframe for users. However, we can derive number of unique users from order dataframe. By grouping the user_id and eval_set column, we found that there are 75,000 test users, 131,209 train users.
We shall use train users for training, and test users for submission to Kaggle.We want to keep this large number of train records in mind for trimming during quick initial model build and verification.
table(orders$eval_set) %>% kable(col.names = c('eval_set','Frequency')) %>% kable_styling(bootstrap_options = c("striped", "hover"), position = "float_right")
eval_set | Frequency |
---|---|
prior | 3214874 |
test | 75000 |
train | 131209 |
In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.
We know that banana are the most popular products. The number of orders varies greatly for different products. Illustration below uses shows sample of only 30 top products. Notice however the varience is not obvious after top 10 products.
tmp = order_products_train %>%
left_join(products) %>%
group_by(product_name) %>%
summarize(count=n()) %>%
top_n(n=30, wt=count) %>% mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(product_name,count), y=percentage)) +
geom_col() + ggtitle('Products Top 30') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
p2 = ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Products Top 30') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Products')
grid.arrange(p1, p2, ncol = 2)
Certain departmens are clearly more popular, like produce and dairy eggs. Both deparments combined contributed to more than 40% of total orders.
tmp = order_products_train %>%
left_join(products) %>%
left_join(departments) %>%
group_by(department) %>%
summarize(count=n()) %>%
mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(department,count), y=percentage)) +
geom_col() + ggtitle('Departments') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
p2 = ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Departments') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Departments')
grid.arrange(p1, p2, ncol = 2)
We looked into the buying trend of product by ailes and notice that certain aisle like vegetables and fruits contributes to almost 30% of total orders. Chart below shows top 30 aisles.
tmp = order_products_train %>%
left_join(products) %>%
left_join(aisles) %>%
group_by(aisle) %>%
summarize(count=n()) %>%
top_n(n=30, wt=count) %>% mutate(percentage=count/sum(count))
p1 = ggplot (tmp, aes(x=reorder(aisle,count), y=percentage)) +
geom_col() + ggtitle('Aisles Top 30') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank()) + ylab('Percentage of Orders') + xlab('Aisles')
p2 = ggplot (tmp, aes( x= '', y=percentage )) +
ggtitle('Aisles Top 30') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Aisles')
grid.arrange(p1, p2, ncol = 2)
We can see that both Day 0 and Day 1 stands out to be the most busy shopping day for instacart. This means that day of order made may influence the basket size.
order_products_prior %>%
left_join(orders) %>%
group_by(order_dow) %>%
summarize(count = n()) %>%
mutate(percentage=count/sum(count)) %>%
ggplot (aes(x=as.factor(order_dow), y=percentage)) +
geom_col()+ ylab('Percentage of Orders') + ggtitle('Daily Orders')
Top ten products ordered daily contributes between 7% to 8%. It is interesting to see that Limes are part of top ten for Day 0 and Day 6, but not other days. Wherease Organic Whole Milk doesn’t make it to top ten for Day 0. Organic Respberries does not make it to top 10 of Day 6. This means that there is a chance of predictability based on the day order is made.
order_products_prior %>%
left_join(orders) %>% left_join(products) %>%
group_by(order_dow, product_name) %>%
summarize(n=n()) %>%
mutate(percentage=n/sum(n)) %>%
top_n(10, wt=n) %>%
ggplot (aes(x=as.factor(order_dow), y=percentage, fill=product_name)) +
geom_col() + ylab('Proprtion of Orders') + ggtitle('Daily Top 10 Products Ordered') +
theme(legend.position="bottom",legend.direction="horizontal")
Morning to afternoon are the peak shopping hours for instacart customers. The hour order made influences basket size.
order_products_prior %>%
left_join(orders) %>%
group_by(order_hour_of_day) %>%
summarize(count = n()) %>%
mutate(percentage=count/sum(count)) %>%
ggplot (aes(x=as.factor(order_hour_of_day), y=percentage)) +
geom_col()+ ylab('Percentage of Orders') + ggtitle('Hourly Orders')
In the grocery, there are close to 50,000 products. When we zoom into hourly purchases, we noticed that top 10 products managed to score betwen 6% to 8% of hourly sales. Every hour has slightly diffrent combination of top 10 products (combination out of 12 products). That means certain products are predictable for ordering irregardless of the hour of order.
order_products_prior %>%
left_join(orders) %>% left_join(products) %>%
group_by(order_hour_of_day, product_name) %>%
summarize(n=n()) %>%
mutate(percentage=n/sum(n)) %>%
top_n(10, wt=n) %>%
ggplot (aes(x=as.factor(order_hour_of_day), y=percentage, fill=product_name)) +
geom_col() + ylab('Proprtion of Orders In A Hour') + ggtitle('Hourly Top 10 Products Ordered') +
theme(legend.position="bottom",legend.direction="horizontal")
Joining, by = "order_id"
Joining, by = "product_id"
Predict what product will the customer purchase in the next basket means estimation of probably if each product being purchased before will be purchased again. This is a classification problem, as well as a regression of probability of repurchases. The output of the prediction is to be evaluated against training eval_set. with below table columns:
<user_id product_id>
: as keylabel
: 1 or 0 , this is the target labelFor submission to kagggle, output from above table should be transformed into the form below:
user_id
<list of product id seperated by space>
First, let’s create a training label, which is the value or either 1 or 0 to indicate the actual basket content. All prediction will be evaluated against this training label.
## Training Label
train.label = orders %>%
filter(eval_set=='train') %>%
left_join(order_products_train) %>%
left_join(products) %>%
mutate(actual=1) %>% #this is training label
select(user_id, order_id, product_id, product_name, actual)
Let’s see an example of training label for user_id 1. As this is training eval_set, the single order_id shows that it is the final order. We shall use this as the label (actual) for training.
train.label %>% filter(user_id==1) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
user_id | order_id | product_id | product_name | actual |
---|---|---|---|---|
1 | 1187899 | 196 | Soda | 1 |
1 | 1187899 | 25133 | Organic String Cheese | 1 |
1 | 1187899 | 38928 | 0% Greek Strained Yogurt | 1 |
1 | 1187899 | 26405 | XL Pick-A-Size Paper Towel Rolls | 1 |
1 | 1187899 | 39657 | Milk Chocolate Almonds | 1 |
1 | 1187899 | 10258 | Pistachios | 1 |
1 | 1187899 | 13032 | Cinnamon Toast Crunch | 1 |
1 | 1187899 | 26088 | Aged White Cheddar Popcorn | 1 |
1 | 1187899 | 27845 | Organic Whole Milk | 1 |
1 | 1187899 | 49235 | Organic Half & Half | 1 |
1 | 1187899 | 46149 | Zero Calorie Cola | 1 |
Instacart has close to 50k products in their catalogue. As the maximum number of items ordered by a user is just a fraction of the 50k available product. This means by simply predicting nothing is purchased in the next basket, we would yeild close to 100% accuracy.
Due to the highly imbalance dataset, Instacart require F1 Score as the competition scoring, instead of accuracy.
To evaluate the performance of the model, we had created a custom function to build a confusion matrix and derive other binary classification metrics.
## Custom Function For Binary Class Performance Evaluation
binclass_eval = function (actual, predict) {
cm = table(as.integer(actual), as.integer(predict), dnn=c('Actual','Predicted'))
ac = (cm['1','1']+cm['0','0'])/(cm['0','1'] + cm['1','0'] + cm['1','1'] + cm['0','0'])
pr = cm['1','1']/(cm['0','1'] + cm['1','1'])
rc = cm['1','1']/(cm['1','0'] + cm['1','1'])
fs = 2* pr*rc/(pr+rc)
list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac)
}
If the prediction is based on probability, we shall build a function to discover cutoff that optimize various performance metrics.
### Cutoff Threshold Optimization
optimize_cutoff = function (actual, probability) {
rocr.pred = prediction(predictions = probability, labels = actual)
rocr.metrics = data.frame(
cutoff = rocr.pred@cutoffs[[1]],
accuracy = (rocr.pred@tp[[1]] + rocr.pred@tn[[1]]) /
(rocr.pred@tp[[1]] + rocr.pred@tn[[1]] + rocr.pred@fp[[1]] + rocr.pred@fn[[1]]),
tpr = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fn[[1]]),
fpr = rocr.pred@fp[[1]] / (rocr.pred@fp[[1]] + rocr.pred@tn[[1]]),
ppv = rocr.pred@tp[[1]] / (rocr.pred@tp[[1]] + rocr.pred@fp[[1]])
)
rocr.metrics$fscore = 2 * (rocr.metrics$tpr * rocr.metrics$ppv) / (rocr.metrics$tpr + rocr.metrics$ppv)
rocr.metrics$tpr_fpr = rocr.metrics$tpr / rocr.metrics$fpr
## Discovery the optimal threshold for various metrics
rocr.best = rbind(
best.accuracy = c(max = max(rocr.metrics$accuracy, na.rm = TRUE), cutoff=rocr.metrics$cutoff[which.max(rocr.metrics$accuracy)]),
best.ppv = c(max = max(rocr.metrics$ppv, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$ppv)]),
best.recall = c(max = max(rocr.metrics$tpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr)]),
best.fscore = c(max = max(rocr.metrics$fscore, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$fscore)]),
best.tpr_fpr = c(max = max(rocr.metrics$tpr_fpr, na.rm = TRUE), cutoff = rocr.metrics$cutoff[which.max(rocr.metrics$tpr_fpr)])
)
list(metrics = rocr.metrics, best = rocr.best)
}
We can simply predict the basket based on user last order.
m1.predict = orders %>%
filter(eval_set=='prior') %>%
group_by(user_id) %>%
top_n(n=1, wt=order_number) %>% #last order has the higher order_number
left_join(order_products_prior) %>%
mutate (
predicted=1) %>% #predict based on last ordered, therefore 1
select(user_id, product_id, predicted) %>%
full_join(train.label) %>% # join with train.label for items not predicted but in train.label
select(user_id, product_id, actual, predicted) %>%
replace(., is.na(.), 0) %>% # treat not predicted as 0
arrange(user_id, product_id)
m2.predict %>% filter(user_id ==1) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
user_id | product_id | actual | rate |
---|---|---|---|
1 | 196 | 1 | 1.0 |
1 | 10258 | 1 | 0.9 |
1 | 10326 | 0 | 0.1 |
1 | 12427 | 0 | 1.0 |
1 | 13032 | 1 | 0.3 |
1 | 13176 | 0 | 0.2 |
1 | 14084 | 0 | 0.1 |
1 | 17122 | 0 | 0.1 |
1 | 25133 | 1 | 0.8 |
1 | 26088 | 1 | 0.2 |
1 | 26405 | 1 | 0.2 |
1 | 27845 | 1 | 0.0 |
1 | 30450 | 0 | 0.1 |
1 | 35951 | 0 | 0.1 |
1 | 38928 | 1 | 0.1 |
1 | 39657 | 1 | 0.1 |
1 | 41787 | 0 | 0.1 |
1 | 46149 | 1 | 0.3 |
1 | 49235 | 1 | 0.2 |
m1.eval = binclass_eval(m1.predict$actual, m1.predict$predicted)
m1.eval$cm
Predicted
Actual 0 1
0 0 1760406
1 1005235 379382
The result shows only 0.2153 F1 Score.
cat("Accuracy: ", m1.eval$accuracy,
"\nPrecision: ", m1.eval$precision,
"\nRecall: ", m1.eval$recall,
"\nFScore: ", m1.eval$fscore)
Accuracy: 0.1206293
Precision: 0.1772989
Recall: 0.2739978
FScore: 0.2152885
In this model, we predict products in the basket by estimating their frequency of repurchased. This way we get a ratio to indicate probability of re-purchases. We use ROCR package to predict the best cutoff point (at which above this cutoff we shall predict for re-order) that give us the optimum F1 score.
## Build Model
m2.predict = orders %>%
filter(eval_set=='prior') %>%
group_by(user_id) %>%
mutate(total_orders = max(order_number)) %>% # total number of orders made previously
ungroup %>% select(user_id, order_id, total_orders) %>%
left_join(order_products_prior) %>%
group_by(user_id, product_id) %>%
summarize(rate=n()/max(total_orders)) %>%
select(user_id, product_id, rate) %>%
full_join(train.label) %>% # join with train.label for items not predicted but in train.label
select(user_id, product_id, actual, rate) %>%
replace(., is.na(.), 0) %>% # treat not predicted as 0
arrange(user_id, product_id)
m2.predict %>% filter(user_id==1) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
user_id | product_id | actual | rate |
---|---|---|---|
1 | 196 | 1 | 1.0 |
1 | 10258 | 1 | 0.9 |
1 | 10326 | 0 | 0.1 |
1 | 12427 | 0 | 1.0 |
1 | 13032 | 1 | 0.3 |
1 | 13176 | 0 | 0.2 |
1 | 14084 | 0 | 0.1 |
1 | 17122 | 0 | 0.1 |
1 | 25133 | 1 | 0.8 |
1 | 26088 | 1 | 0.2 |
1 | 26405 | 1 | 0.2 |
1 | 27845 | 1 | 0.0 |
1 | 30450 | 0 | 0.1 |
1 | 35951 | 0 | 0.1 |
1 | 38928 | 1 | 0.1 |
1 | 39657 | 1 | 0.1 |
1 | 41787 | 0 | 0.1 |
1 | 46149 | 1 | 0.3 |
1 | 49235 | 1 | 0.2 |
We see that in order to maximize F1 Score, we need to set the cutoff threshold to 0.3368, which is the next step.
max | cutoff | |
---|---|---|
best.accuracy | 0.9001268 | Inf |
best.ppv | 0.4345635 | 0.8588235 |
best.recall | 1.0000000 | 0.0000000 |
best.fscore | 0.2312383 | 0.3368421 |
best.tpr_fpr | 6.9266631 | 0.8588235 |
Let’s set the cutoff to 0.3368 as discovered in previous step.
m2.eval = binclass_eval(m2.predict$actual, m2.predict$rate>0.3368)
m2.eval$cm
Predicted
Actual 0 1
0 11548605 930524
1 1081948 302669
We are getting slightly better F1 Score (0.2312) compare to previous model. We shall use this as the baseline.
cat("Accuracy: ", m2.eval$accuracy,
"\nPrecision: ", m2.eval$precision,
"\nRecall: ", m2.eval$recall,
"\nFScore: ", m2.eval$fscore)
Accuracy: 0.8548392
Precision: 0.2454352
Recall: 0.218594
FScore: 0.2312383
We construct all the products that users had purchased in the last 3 orders, then use machine learning classification to predict will each of the product be purchased again. We shall use decision tree and logistic regression for this prediction.
#m3.predict = orders %>%
# filter(eval_set=='prior') %>%
# left_join(order_products_prior) %>%
# group_by(user_id, product_id) %>%
# summarize(rate=n()/max(total_orders))
We create three features applied to individual users.
users
- avg_user_order_dow
: Average of order_dow
- avg_user_order_hod
: Average of order_hour_of_day
- avg_user_items
: Average items for each order
users.train.avg_items = orders %>%
filter(eval_set=='train') %>%
select(user_id) %>%
left_join(orders) %>%
left_join(order_products_prior) %>%
count(user_id, order_id) %>%
group_by(user_id) %>%
summarize(avg_user_items = mean(n))
users.train.avg_dowhod = orders %>%
filter(eval_set=='train') %>%
select(user_id) %>%
left_join(orders) %>%
group_by(user_id) %>%
summarize(
avg_user_order_dow = mean(order_dow),
avg_user_order_hod = mean(order_hour_of_day))
users.train = users.train.avg_items %>% left_join(users.train.avg_dowhod)
users.test.avg_items = orders %>%
filter(eval_set=='test') %>%
select(user_id) %>%
left_join(orders) %>%
left_join(order_products_prior) %>%
count(user_id, order_id) %>%
group_by(user_id) %>%
summarize(avg_user_items = mean(n))
users.test.avg_dowhod = orders %>%
filter(eval_set=='test') %>%
select(user_id) %>%
left_join(orders) %>%
group_by(user_id) %>%
summarize(
avg_user_order_dow = mean(order_dow),
avg_user_order_hod = mean(order_hour_of_day))
users.test = users.test.avg_items %>% left_join(users.test.avg_dowhod)
rm(list=c('users.test.avg_items','users.test.avg_dowhod','users.train.avg_items','users.train.avg_dowhod'))
Training Users
user_id | avg_user_items | avg_user_order_dow | avg_user_order_hod |
---|---|---|---|
1 | 5.454546 | 2.636364 | 10.09091 |
2 | 13.066667 | 2.066667 | 10.60000 |
5 | 7.600000 | 1.400000 | 15.00000 |
7 | 9.857143 | 1.857143 | 13.47619 |
8 | 12.500000 | 3.250000 | 5.50000 |
9 | 19.250000 | 3.000000 | 13.25000 |
Test Users
head(users.test) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
user_id | avg_user_items | avg_user_order_dow | avg_user_order_hod |
---|---|---|---|
3 | 6.846154 | 1.384615 | 16.30769 |
4 | 3.166667 | 4.500000 | 12.50000 |
6 | 3.750000 | 3.500000 | 17.00000 |
11 | 11.875000 | 4.500000 | 10.87500 |
12 | 12.500000 | 2.666667 | 11.16667 |
15 | 3.173913 | 2.347826 | 11.26087 |
We create two features generally applied across all products.
products
avg_product_order_dow
: Average of product order_dowavg_product_order_hod
: Average of product order_hour_of_dayproducts_ = orders %>%
left_join(order_products_prior) %>%
group_by(product_id) %>%
summarize(
avg_product_order_dow = mean(order_dow),
avg_product_order_hod = mean(order_hour_of_day)
) %>%
left_join(products)
head(products_) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
product_id | avg_product_order_dow | avg_product_order_hod | product_name | aisle_id | department_id |
---|---|---|---|---|---|
1 | 2.776458 | 13.23812 | Chocolate Sandwich Cookies | 61 | 19 |
2 | 2.922222 | 13.27778 | All-Seasons Salt | 104 | 13 |
3 | 2.736462 | 12.10469 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
4 | 2.683891 | 13.71429 | Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
5 | 2.733333 | 10.66667 | Green Chile Anytime Sauce | 5 | 13 |
6 | 4.250000 | 14.12500 | Dry Nose Oil | 11 | 11 |
m3.train.label = train.label %>%
left_join(users.train) %>% #combine with new user features
left_join(products_) #combine with new product features
Joining, by = "user_id"
Joining, by = c("product_id", "product_name")
head(m3.train.label) %>% kable %>% kable_styling(bootstrap_options = c("striped"))
user_id | order_id | product_id | product_name | actual | avg_user_items | avg_user_order_dow | avg_user_order_hod | avg_product_order_dow | avg_product_order_hod | aisle_id | department_id |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1187899 | 196 | Soda | 1 | 5.454546 | 2.636364 | 10.09091 | 2.898550 | 12.52396 | 77 | 7 |
1 | 1187899 | 25133 | Organic String Cheese | 1 | 5.454546 | 2.636364 | 10.09091 | 2.650581 | 12.97482 | 21 | 16 |
1 | 1187899 | 38928 | 0% Greek Strained Yogurt | 1 | 5.454546 | 2.636364 | 10.09091 | 2.728207 | 12.28471 | 120 | 16 |
1 | 1187899 | 26405 | XL Pick-A-Size Paper Towel Rolls | 1 | 5.454546 | 2.636364 | 10.09091 | 2.765239 | 12.48682 | 54 | 17 |
1 | 1187899 | 39657 | Milk Chocolate Almonds | 1 | 5.454546 | 2.636364 | 10.09091 | 2.707312 | 12.27296 | 45 | 19 |
1 | 1187899 | 10258 | Pistachios | 1 | 5.454546 | 2.636364 | 10.09091 | 2.719425 | 12.25591 | 117 | 19 |