Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

The objective of this Kaggle competition is to use the anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.

Exploratory Data Analysis

Data

The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, 4 to 100 of their prior orders are given, with the sequence of products purchased in each order.

The data has been provided in 6 files which seem to be dump of database tables. The data files are:

  • aisles.csv
  • departments.csv
  • order_products__prior.csv
  • order_products__train.csv
  • orders.csv
  • products.csv

Let’s load the data and look at the first few rows of the files to understand the data better.

library(data.table)
path <- "./input"

aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))
orderp <- fread(file.path(path, "order_products__prior.csv"))
## 
Read 0.0% of 32434489 rows
Read 10.8% of 32434489 rows
Read 22.1% of 32434489 rows
Read 32.8% of 32434489 rows
Read 44.1% of 32434489 rows
Read 54.8% of 32434489 rows
Read 66.0% of 32434489 rows
Read 76.7% of 32434489 rows
Read 88.1% of 32434489 rows
Read 98.7% of 32434489 rows
Read 32434489 rows and 4 (of 4) columns from 0.538 GB file in 00:00:18
ordert <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
## 
Read 10.5% of 3421083 rows
Read 44.1% of 3421083 rows
Read 71.0% of 3421083 rows
Read 3421083 rows and 7 (of 7) columns from 0.101 GB file in 00:00:05
products <- fread(file.path(path, "products.csv"))

head(aisles)
##    aisle_id                      aisle
## 1:        1      prepared soups salads
## 2:        2          specialty cheeses
## 3:        3        energy granola bars
## 4:        4              instant foods
## 5:        5 marinades meat preparation
## 6:        6                      other
head(departments)
##    department_id    department
## 1:             1        frozen
## 2:             2         other
## 3:             3        bakery
## 4:             4       produce
## 5:             5       alcohol
## 6:             6 international
head(products)
##    product_id
## 1:          1
## 2:          2
## 3:          3
## 4:          4
## 5:          5
## 6:          6
##                                                         product_name
## 1:                                        Chocolate Sandwich Cookies
## 2:                                                  All-Seasons Salt
## 3:                              Robust Golden Unsweetened Oolong Tea
## 4: Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce
## 5:                                         Green Chile Anytime Sauce
## 6:                                                      Dry Nose Oil
##    aisle_id department_id
## 1:       61            19
## 2:      104            13
## 3:       94             7
## 4:       38             1
## 5:        5            13
## 6:       11            11
head(orders)
##    order_id user_id eval_set order_number order_dow order_hour_of_day
## 1:  2539329       1    prior            1         2                 8
## 2:  2398795       1    prior            2         3                 7
## 3:   473747       1    prior            3         3                12
## 4:  2254736       1    prior            4         4                 7
## 5:   431534       1    prior            5         4                15
## 6:  3367565       1    prior            6         2                 7
##    days_since_prior_order
## 1:                     NA
## 2:                     15
## 3:                     21
## 4:                     29
## 5:                     28
## 6:                     19
head(ordert)
##    order_id product_id add_to_cart_order reordered
## 1:        1      49302                 1         1
## 2:        1      11109                 2         1
## 3:        1      10246                 3         0
## 4:        1      49683                 4         0
## 5:        1      43633                 5         1
## 6:        1      13176                 6         0
head(orderp)
##    order_id product_id add_to_cart_order reordered
## 1:        2      33120                 1         1
## 2:        2      28985                 2         1
## 3:        2       9327                 3         0
## 4:        2      45918                 4         1
## 5:        2      30035                 5         0
## 6:        2      17794                 6         1

As we could see, orders.csv has all the information about the given order, like the user who has purchased the order, when was it purchased, days since prior order and so on. The columns present in order_products_train and order_products_prior are same. Then what is the difference between these files?

In this dataset, 4 to 100 orders of a customer are given (we will look at this later) and we need to predict the products that will be re-ordered. So the last order of the user has been taken out and divided into train and test sets. All the prior order information of the customer is present in order_products_prior file. We can also note that there is a column in orders.csv file called eval_set which tells us as to which of the three datasets (prior, train or test) the given row goes to. Order_products*csv file has more detailed information about the products that been bought in the given order along with the re-ordered status. The products ordered in the last order of the training set has been provided in the Order_products_train.csv

Let us first get the count of rows in each of the three sets.

library(dplyr)
orders %>%
        group_by(eval_set) %>%
        summarise(users=n_distinct(user_id))
## # A tibble: 3 × 2
##   eval_set  users
##      <chr>  <int>
## 1    prior 206209
## 2     test  75000
## 3    train 131209

So there are 206,209 customers in total. Out of which, the last purchase of 131,209 customers is given as train set and we need to predict for 75,000 customers belonging to the test set.

Let’s validate that the number of orders provided are indeed btw 4 and 100.

library(ggplot2)
grouped_df <- orders %>%
                group_by(user_id) %>%
                summarise(total_orders= max(order_number))
ggplot(grouped_df, aes(total_orders)) + geom_bar(fill="salmon") + ggtitle("Frequecy of total orders") + theme(plot.title = element_text(hjust = 0.5))

The total number of orders are indeed between 4-100 per customer in a decreasing trend with a spike at 100.

Let’s now look at the ordering pattern based on the day of the week and the hour of the day.

grouped_df <- orders %>%
                        group_by(order_dow, order_hour_of_day) %>%
                        summarise(total_orders=n())
ggplot(grouped_df, aes(order_hour_of_day,order_dow)) + geom_tile(aes(fill = total_orders), colour = "white") + scale_fill_gradient(low = "white",
     high = "salmon") + ggtitle("Frequency of Day of week Vs Hour of day") + theme(plot.title = element_text(hjust = 0.5))

Seems Saturday evenings and Sunday mornings are the prime times for orders.

Now let us check the time interval between the orders.

grouped_df <- orders %>%
                         group_by(days_since_prior_order) %>%
                         summarise(count=n(), na.rm = T)
ggplot(grouped_df, aes(days_since_prior_order, count)) + geom_bar(stat="identity",fill="salmon") +
         ggtitle("Frequency distribution by days since prior order") + theme(plot.title = element_text(hjust = 0.5))

Looks like customers order once in every week (check the peak at 7 days) or once in a month (peak at 30 days). We could also see smaller peaks at 14, 21 and 28 days (weekly intervals).

Since our objective is to figure out the re-orders, let us check out the re-order percentage in prior set and train set.

sum(orderp$reordered)/nrow(orderp)
## [1] 0.5896975
sum(ordert$reordered)/nrow(ordert)
## [1] 0.5985944

On an average, about 59% of the products in an order are re-ordered products.

Let’s now find the percentage of orders with no reordered products.

grouped_df <- orderp %>%
                group_by(order_id) %>%
                summarise(reordered_pr = sum(reordered==1))
sum(grouped_df$reordered_pr==0)/nrow(grouped_df)
## [1] 0.1208486
grouped_df <- ordert %>%
                group_by(order_id) %>%
                summarise(reordered_pr = sum(reordered==1))
sum(grouped_df$reordered_pr==0)/nrow(grouped_df)
## [1] 0.06555953

About 12% of the orders in prior set have no re-ordered items while in the train set, 6.5% of the orders have no reordered items.

Now let us see the number of products bought in each order.

grouped_df <- ordert %>%
                group_by(order_id) %>%
                summarise(products_in_cart = max(add_to_cart_order))        
ggplot(grouped_df, aes(products_in_cart)) + geom_bar(fill="salmon") + ggtitle("Frequecy of total products in an order") + theme(plot.title = element_text(hjust = 0.5))

A right tailed distribution with the maximum value at 5!

Let’s now merge the product,aisles and department details with the order_prior details.

What are the top selling products?

orderp_product <- orderp %>%
                 inner_join(products) %>% 
                 inner_join(aisles) %>%
                 inner_join(departments)        

rev(sort(table(orderp_product$product_name)))[1:20]
## 
##                   Banana   Bag of Organic Bananas     Organic Strawberries 
##                   472565                   379450                   264683 
##     Organic Baby Spinach     Organic Hass Avocado          Organic Avocado 
##                   241921                   213584                   176815 
##              Large Lemon             Strawberries                    Limes 
##                   152657                   142951                   140627 
##       Organic Whole Milk      Organic Raspberries     Organic Yellow Onion 
##                   137905                   137057                   113426 
##           Organic Garlic         Organic Zucchini      Organic Blueberries 
##                   109778                   104823                   100060 
##           Cucumber Kirby       Organic Fuji Apple            Organic Lemon 
##                    97315                    89632                    87746 
## Apple Honeycrisp Organic   Organic Grape Tomatoes 
##                    85020                    84255

Wow. Most of them are organic products.! Also majority of them are fruits.

Now let us look at the important aisles.

rev(sort(table(orderp_product$aisle)))[1:20]
## 
##                  fresh fruits              fresh vegetables 
##                       3642188                       3418021 
##    packaged vegetables fruits                        yogurt 
##                       1765313                       1452343 
##               packaged cheese                          milk 
##                        979763                        891015 
## water seltzer sparkling water                chips pretzels 
##                        841533                        722470 
##               soy lactosefree                         bread 
##                        638253                        584834 
##                  refrigerated                frozen produce 
##                        575881                        522654 
##                 ice cream ice                      crackers 
##                        498425                        458838 
##           energy granola bars                          eggs 
##                        456386                        452134 
##                    lunch meat                  frozen meals 
##                        395130                        390299 
##             baby food formula                   fresh herbs 
##                        382456                        377741

Let us now check the department wise distribution.

grouped_df <- orderp_product %>%
                group_by(department) %>%
                summarise(count_percentage = n()/nrow(orderp_product)*100)
grouped_df[rev(order(grouped_df$count_percentage)),]
## # A tibble: 21 × 2
##         department count_percentage
##              <chr>            <dbl>
## 1          produce        29.225961
## 2       dairy eggs        16.692158
## 3           snacks         8.902715
## 4        beverages         8.294038
## 5           frozen         6.895228
## 6           pantry         5.782662
## 7           bakery         3.628197
## 8     canned goods         3.292970
## 9             deli         3.241146
## 10 dry goods pasta         2.671930
## # ... with 11 more rows

Produce is the largest department.

Now let us check the reordered percentage of each department.

grouped_df <- orderp_product %>%
                group_by(department) %>%
                summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df, aes(department, reordered_ratio, group=1)) + geom_line(linetype=1, color="salmon", size=2)+ geom_point(size=3) + ggtitle("Department wise reorder ratio") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Personal care has lowest reorder ratio and dairy eggs have highest reorder ratio.

Aisle - Reorder ratio

grouped_df <- orderp_product %>%
                group_by(aisle) %>%
                summarise(reordered_ratio = sum(reordered)/n())
grouped_df[rev(order(grouped_df$reordered_ratio))[1:10],]
## # A tibble: 10 × 2
##                            aisle reordered_ratio
##                            <chr>           <dbl>
## 1                           milk       0.7814279
## 2  water seltzer sparkling water       0.7295935
## 3                   fresh fruits       0.7181038
## 4                           eggs       0.7053661
## 5                soy lactosefree       0.6925514
## 6               packaged produce       0.6907343
## 7                         yogurt       0.6864893
## 8                          cream       0.6850460
## 9                          bread       0.6701679
## 10                  refrigerated       0.6633020

The aisles for milk, water seltzer sparkling water, fresh fruits and eggs have the highest reorder ratio.

Add to Cart - Reorder ratio:

grouped_df <- orderp_product %>%
                group_by(add_to_cart_order) %>%
                summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df[1:70,], aes(add_to_cart_order, reordered_ratio, group=1)) + geom_line(linetype=1, color="salmon", size=2)+ geom_point(size=1) + ggtitle("Add to Cart reorder ratio") + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Looks like the products that are added to the cart initially are more likely to be reordered again compared to the ones added later.

orderp_product <- orderp_product %>%
                        inner_join(orders)
grouped_df <- orderp_product %>%
                group_by(order_dow, order_hour_of_day) %>%
                summarise(reordered_ratio = sum(reordered)/n())
ggplot(grouped_df, aes(order_hour_of_day,order_dow)) + geom_tile(aes(fill = reordered_ratio), colour = "white") + scale_fill_gradient(low = "white",
     high = "salmon") + ggtitle("Reorder ratio of Day of week Vs Hour of day") + theme(plot.title = element_text(hjust = 0.5))

Looks like reorder ratios are quite high during the early mornings compared to later half of the day.