The objective of this Kaggle competition is to use the anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.
The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, 4 to 100 of their prior orders are given, with the sequence of products purchased in each order.
The data has been provided in 6 files which seem to be dump of database tables. The data files are:
Let’s load the data and look at the first few rows of the files to understand the data better.
library(data.table)
path <- "./input"
aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))
orderp <- fread(file.path(path, "order_products__prior.csv"))
##
Read 0.0% of 32434489 rows
Read 10.8% of 32434489 rows
Read 22.1% of 32434489 rows
Read 32.8% of 32434489 rows
Read 44.1% of 32434489 rows
Read 54.8% of 32434489 rows
Read 66.0% of 32434489 rows
Read 76.7% of 32434489 rows
Read 88.1% of 32434489 rows
Read 98.7% of 32434489 rows
Read 32434489 rows and 4 (of 4) columns from 0.538 GB file in 00:00:18
ordert <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
##
Read 10.5% of 3421083 rows
Read 44.1% of 3421083 rows
Read 71.0% of 3421083 rows
Read 3421083 rows and 7 (of 7) columns from 0.101 GB file in 00:00:05
products <- fread(file.path(path, "products.csv"))
head(aisles)
## aisle_id aisle
## 1: 1 prepared soups salads
## 2: 2 specialty cheeses
## 3: 3 energy granola bars
## 4: 4 instant foods
## 5: 5 marinades meat preparation
## 6: 6 other
head(departments)
## department_id department
## 1: 1 frozen
## 2: 2 other
## 3: 3 bakery
## 4: 4 produce
## 5: 5 alcohol
## 6: 6 international
head(products)
## product_id
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## product_name
## 1: Chocolate Sandwich Cookies
## 2: All-Seasons Salt
## 3: Robust Golden Unsweetened Oolong Tea
## 4: Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce
## 5: Green Chile Anytime Sauce
## 6: Dry Nose Oil
## aisle_id department_id
## 1: 61 19
## 2: 104 13
## 3: 94 7
## 4: 38 1
## 5: 5 13
## 6: 11 11
head(orders)
## order_id user_id eval_set order_number order_dow order_hour_of_day
## 1: 2539329 1 prior 1 2 8
## 2: 2398795 1 prior 2 3 7
## 3: 473747 1 prior 3 3 12
## 4: 2254736 1 prior 4 4 7
## 5: 431534 1 prior 5 4 15
## 6: 3367565 1 prior 6 2 7
## days_since_prior_order
## 1: NA
## 2: 15
## 3: 21
## 4: 29
## 5: 28
## 6: 19
head(ordert)
## order_id product_id add_to_cart_order reordered
## 1: 1 49302 1 1
## 2: 1 11109 2 1
## 3: 1 10246 3 0
## 4: 1 49683 4 0
## 5: 1 43633 5 1
## 6: 1 13176 6 0
head(orderp)
## order_id product_id add_to_cart_order reordered
## 1: 2 33120 1 1
## 2: 2 28985 2 1
## 3: 2 9327 3 0
## 4: 2 45918 4 1
## 5: 2 30035 5 0
## 6: 2 17794 6 1
As we could see, orders.csv has all the information about the given order, like the user who has purchased the order, when was it purchased, days since prior order and so on. The columns present in order_products_train and order_products_prior are same. Then what is the difference between these files?
In this dataset, 4 to 100 orders of a customer are given (we will look at this later) and we need to predict the products that will be re-ordered. So the last order of the user has been taken out and divided into train and test sets. All the prior order information of the customer is present in order_products_prior file. We can also note that there is a column in orders.csv file called eval_set which tells us as to which of the three datasets (prior, train or test) the given row goes to. Order_products*csv file has more detailed information about the products that been bought in the given order along with the re-ordered status. The products ordered in the last order of the training set has been provided in the Order_products_train.csv
Let us first get the count of rows in each of the three sets.
library(dplyr)
orders %>%
group_by(eval_set) %>%
summarise(users=n_distinct(user_id))
## # A tibble: 3 × 2
## eval_set users
## <chr> <int>
## 1 prior 206209
## 2 test 75000
## 3 train 131209
So there are 206,209 customers in total. Out of which, the last purchase of 131,209 customers is given as train set and we need to predict for 75,000 customers belonging to the test set.
Let’s validate that the number of orders provided are indeed btw 4 and 100.
library(ggplot2)
grouped_df <- orders %>%
group_by(user_id) %>%
summarise(total_orders= max(order_number))
ggplot(grouped_df, aes(total_orders)) + geom_bar(fill="salmon") + ggtitle("Frequecy of total orders") + theme(plot.title = element_text(hjust = 0.5))
The total number of orders are indeed between 4-100 per customer in a decreasing trend with a spike at 100.
Let’s now look at the ordering pattern based on the day of the week and the hour of the day.
grouped_df <- orders %>%
group_by(order_dow, order_hour_of_day) %>%
summarise(total_orders=n())
ggplot(grouped_df, aes(order_hour_of_day,order_dow)) + geom_tile(aes(fill = total_orders), colour = "white") + scale_fill_gradient(low = "white",
high = "salmon") + ggtitle("Frequency of Day of week Vs Hour of day") + theme(plot.title = element_text(hjust = 0.5))
Seems Saturday evenings and Sunday mornings are the prime times for orders.
Now let us check the time interval between the orders.
grouped_df <- orders %>%
group_by(days_since_prior_order) %>%
summarise(count=n(), na.rm = T)
ggplot(grouped_df, aes(days_since_prior_order, count)) + geom_bar(stat="identity",fill="salmon") +
ggtitle("Frequency distribution by days since prior order") + theme(plot.title = element_text(hjust = 0.5))
Looks like customers order once in every week (check the peak at 7 days) or once in a month (peak at 30 days). We could also see smaller peaks at 14, 21 and 28 days (weekly intervals).
Since our objective is to figure out the re-orders, let us check out the re-order percentage in prior set and train set.
sum(orderp$reordered)/nrow(orderp)
## [1] 0.5896975
sum(ordert$reordered)/nrow(ordert)
## [1] 0.5985944
On an average, about 59% of the products in an order are re-ordered products.
Let’s now find the percentage of orders with no reordered products.
grouped_df <- orderp %>%
group_by(order_id) %>%
summarise(reordered_pr = sum(reordered==1))
sum(grouped_df$reordered_pr==0)/nrow(grouped_df)
## [1] 0.1208486
grouped_df <- ordert %>%
group_by(order_id) %>%
summarise(reordered_pr = sum(reordered==1))
sum(grouped_df$reordered_pr==0)/nrow(grouped_df)
## [1] 0.06555953
About 12% of the orders in prior set have no re-ordered items while in the train set, 6.5% of the orders have no reordered items.
Now let us see the number of products bought in each order.
grouped_df <- ordert %>%
group_by(order_id) %>%
summarise(products_in_cart = max(add_to_cart_order))
ggplot(grouped_df, aes(products_in_cart)) + geom_bar(fill="salmon") + ggtitle("Frequecy of total products in an order") + theme(plot.title = element_text(hjust = 0.5))
A right tailed distribution with the maximum value at 5!
Let’s now merge the product,aisles and department details with the order_prior details.
What are the top selling products?
orderp_product <- orderp %>%
inner_join(products) %>%
inner_join(aisles) %>%
inner_join(departments)
rev(sort(table(orderp_product$product_name)))[1:20]
##
## Banana Bag of Organic Bananas Organic Strawberries
## 472565 379450 264683
## Organic Baby Spinach Organic Hass Avocado Organic Avocado
## 241921 213584 176815
## Large Lemon Strawberries Limes
## 152657 142951 140627
## Organic Whole Milk Organic Raspberries Organic Yellow Onion
## 137905 137057 113426
## Organic Garlic Organic Zucchini Organic Blueberries
## 109778 104823 100060
## Cucumber Kirby Organic Fuji Apple Organic Lemon
## 97315 89632 87746
## Apple Honeycrisp Organic Organic Grape Tomatoes
## 85020 84255
Wow. Most of them are organic products.! Also majority of them are fruits.
Now let us look at the important aisles.
rev(sort(table(orderp_product$aisle)))[1:20]
##
## fresh fruits fresh vegetables
## 3642188 3418021
## packaged vegetables fruits yogurt
## 1765313 1452343
## packaged cheese milk
## 979763 891015
## water seltzer sparkling water chips pretzels
## 841533 722470
## soy lactosefree bread
## 638253 584834
## refrigerated frozen produce
## 575881 522654
## ice cream ice crackers
## 498425 458838
## energy granola bars eggs
## 456386 452134
## lunch meat frozen meals
## 395130 390299
## baby food formula fresh herbs
## 382456 377741
Let us now check the department wise distribution.
grouped_df <- orderp_product %>%
group_by(department) %>%
summarise(count_percentage = n()/nrow(orderp_product)*100)
grouped_df[rev(order(grouped_df$count_percentage)),]
## # A tibble: 21 × 2
## department count_percentage
## <chr> <dbl>
## 1 produce 29.225961
## 2 dairy eggs 16.692158
## 3 snacks 8.902715
## 4 beverages 8.294038
## 5 frozen 6.895228
## 6 pantry 5.782662
## 7 bakery 3.628197
## 8 canned goods 3.292970
## 9 deli 3.241146
## 10 dry goods pasta 2.671930
## # ... with 11 more rows
Produce is the largest department.
Now let us check the reordered percentage of each department.
grouped_df <- orderp_product %>%
group_by(department) %>%
summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df, aes(department, reordered_ratio, group=1)) + geom_line(linetype=1, color="salmon", size=2)+ geom_point(size=3) + ggtitle("Department wise reorder ratio") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Personal care has lowest reorder ratio and dairy eggs have highest reorder ratio.
Aisle - Reorder ratio
grouped_df <- orderp_product %>%
group_by(aisle) %>%
summarise(reordered_ratio = sum(reordered)/n())
grouped_df[rev(order(grouped_df$reordered_ratio))[1:10],]
## # A tibble: 10 × 2
## aisle reordered_ratio
## <chr> <dbl>
## 1 milk 0.7814279
## 2 water seltzer sparkling water 0.7295935
## 3 fresh fruits 0.7181038
## 4 eggs 0.7053661
## 5 soy lactosefree 0.6925514
## 6 packaged produce 0.6907343
## 7 yogurt 0.6864893
## 8 cream 0.6850460
## 9 bread 0.6701679
## 10 refrigerated 0.6633020
The aisles for milk, water seltzer sparkling water, fresh fruits and eggs have the highest reorder ratio.
Add to Cart - Reorder ratio:
grouped_df <- orderp_product %>%
group_by(add_to_cart_order) %>%
summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df[1:70,], aes(add_to_cart_order, reordered_ratio, group=1)) + geom_line(linetype=1, color="salmon", size=2)+ geom_point(size=1) + ggtitle("Add to Cart reorder ratio") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Looks like the products that are added to the cart initially are more likely to be reordered again compared to the ones added later.
orderp_product <- orderp_product %>%
inner_join(orders)
grouped_df <- orderp_product %>%
group_by(order_dow, order_hour_of_day) %>%
summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df, aes(order_hour_of_day,order_dow)) + geom_tile(aes(fill = reordered_ratio), colour = "white") + scale_fill_gradient(low = "white",
high = "salmon") + ggtitle("Reorder ratio of Day of week Vs Hour of day") + theme(plot.title = element_text(hjust = 0.5))
Looks like reorder ratios are quite high during the early mornings compared to later half of the day.