1 Introduction, Research Goal & Objective

3 Dataset Preparation

3.1 R Libraries Used

Here are the R libraries used in this analysis.

library(knitr)      # web widget
library(tidyverse)  # data manipulation
library(data.table) # fast file reading
library(treemap)    # tree visualization
library(corrgram)   # correlation graphics

3.2 Import Datasets

setwd('./datasets')
aisles      = fread('aisles.csv',      stringsAsFactors = TRUE)
departments = fread('departments.csv', stringsAsFactors = TRUE)
products    = fread('products.csv',    stringsAsFactors = TRUE)
orders      = fread('orders.csv',      stringsAsFactors = TRUE)
order_products_train = fread('order_products__train.csv')
order_products_prior = fread('order_products__prior.csv')

3.3 Understanding Datasets

These datasets were sourced from an existing Kaggle competiotion (https://www.kaggle.com/c/instacart-market-basket-analysis/data)

  • The dataset for this competition is a relational set of files describing customers’ orders over time
  • The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users
  • For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order
  • Provide the week and hour of day the order was placed, and a relative measure of time between orders

Total six datasets were imported. Follwing section will explore each datasets in further detail.

3.3.1 Aisles

There are 134 ailes in this dataset.

str(aisles)
## Classes 'data.table' and 'data.frame':   134 obs. of  2 variables:
##  $ aisle_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ aisle   : Factor w/ 134 levels "air fresheners candles",..: 107 121 43 75 82 92 95 6 101 77 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Here are the names of all ailes aphabetically ordered, seprated by comma.

paste(sort(aisles$aisle), collapse=', ')
## [1] "air fresheners candles, asian foods, baby accessories, baby bath body care, baby food formula, bakery desserts, baking ingredients, baking supplies decor, beauty, beers coolers, body lotions soap, bread, breakfast bakery, breakfast bars pastries, bulk dried fruits vegetables, bulk grains rice dried goods, buns rolls, butter, candy chocolate, canned fruit applesauce, canned jarred vegetables, canned meals beans, canned meat seafood, cat food care, cereal, chips pretzels, cleaning products, cocoa drink mixes, coffee, cold flu allergy, condiments, cookies cakes, crackers, cream, deodorants, diapers wipes, digestion, dish detergents, dog food care, doughs gelatins bake mixes, dry pasta, eggs, energy granola bars, energy sports drinks, eye ear care, facial care, feminine care, first aid, food storage, fresh dips tapenades, fresh fruits, fresh herbs, fresh pasta, fresh vegetables, frozen appetizers sides, frozen breads doughs, frozen breakfast, frozen dessert, frozen juice, frozen meals, frozen meat seafood, frozen pizza, frozen produce, frozen vegan vegetarian, fruit vegetable snacks, grains rice dried goods, granola, hair care, honeys syrups nectars, hot cereal pancake mixes, hot dogs bacon sausage, ice cream ice, ice cream toppings, indian foods, instant foods, juice nectars, kitchen supplies, kosher foods, latino foods, laundry, lunch meat, marinades meat preparation, meat counter, milk, mint gum, missing, more household, muscles joints pain relief, nuts seeds dried fruit, oils vinegars, oral hygiene, other, other creams cheeses, packaged cheese, packaged meat, packaged poultry, packaged produce, packaged seafood, packaged vegetables fruits, paper goods, pasta sauce, pickled goods olives, plates bowls cups flatware, popcorn jerky, poultry counter, prepared meals, prepared soups salads, preserved dips spreads, protein meal replacements, red wines, refrigerated, refrigerated pudding desserts, salad dressing toppings, seafood counter, shave needs, skin care, soap, soft drinks, soup broth bouillon, soy lactosefree, specialty cheeses, specialty wines champagnes, spices seasonings, spirits, spreads, tea, tofu meat alternatives, tortillas flat bread, trail mix snack mix, trash bags liners, vitamins supplements, water seltzer sparkling water, white wines, yogurt"

3.3.2 Departments

There are 21 departments in this dataset.

str(departments)
## Classes 'data.table' and 'data.frame':   21 obs. of  2 variables:
##  $ department_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ department   : Factor w/ 21 levels "alcohol","babies",..: 11 16 3 20 1 13 4 19 10 6 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Names of all deparments are listed below in aphabetically ordered.

sort(departments$department)
##  [1] alcohol         babies          bakery          beverages      
##  [5] breakfast       bulk            canned goods    dairy eggs     
##  [9] deli            dry goods pasta frozen          household      
## [13] international   meat seafood    missing         other          
## [17] pantry          personal care   pets            produce        
## [21] snacks         
## 21 Levels: alcohol babies bakery beverages breakfast bulk ... snacks

3.3.3 Products

There are 49,688 products in the catalogue within 134 aisles and 21 departments.

str(products)
length(unique(products$aisle_id))
length(unique(products$department_id))
## Classes 'data.table' and 'data.frame':   49688 obs. of  4 variables:
##  $ product_id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_name : Factor w/ 49688 levels "#2 Coffee Filters",..: 8642 2018 38573 40977 18584 13349 36616 11763 23264 41864 ...
##  $ aisle_id     : int  61 104 94 38 5 11 98 116 120 115 ...
##  $ department_id: int  19 13 7 1 13 11 7 1 16 7 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## [1] 134
## [1] 21

3.3.4 Departments And Its Relevant Products

Table below shows a glimpse of 3 products for each deparment, giving an idea the how the products being categorized. Only sample five departments are shown.

left_join(departments, products) %>% select(department, product_name ) %>%
  group_by(department) %>%
  sample_n(3) %>%
  summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
department three_examples_product
dry goods pasta Creamy Four Cheese Rice || Garden Vegetable Ready Rice || Roasted Garlic & Herb Brown Rice Medley
produce Orange Navel Heirloom. || Crenshaw Melon || Cocktail Tomatoes
pets Flaked Tuna Feast Cat Food || Rawhide Chip Rolls || Adult Complete Nutrition Dry Food for Dogs
bakery Yellow Corn Organic Tortillas || Tortillas, Flour || Small Soft and Fluffy Flour Tortillas
babies Fruit & Veggie Twist Apple Mango Pear & Kale || Organic Sunny Days Apple Snack Bars || Slip On Size 6 Diapers

3.3.5 Ailes And Its Relevant Products

Table below shows a glimpse of 3 products for each ailes, giving an idea the how the products being categorized. Only sample five ailes are shown

left_join(aisles, products) %>%
  select(aisle, product_name ) %>%  group_by(aisle) %>%
  sample_n(3) %>% 
  summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
aisle three_examples_product
cookies cakes Original Animal Crackers || Soft Baked Sanibel Snickerdoodle Cookies || Sandies Toffee Shortbread Cookies
fresh fruits Mandarin Box || Organic Honeydew || Red Globe Grapes
cat food care Sliced Salmon Canned Catfood Entree || Liver & Chicken Dinner Classic Pate Cat Food || Purina Mixed Grill Classic Pate Cat Food
food storage Take Alongs Round Small Bowls || Zipper Bags Sandwich - 50 CT || Takealongs Serving Bowl
packaged seafood Alaskan Sockeye Smoked Salmon || Wild Alaskan Salmon Sockeye Salmon Gravlox || Herring In Wine Sauce

3.3.6 Orders

There are over 3 millions observations in orders dataset. Each row represent an unique order. Each variable potentially can be used as predictors.

variable description
user_id unique identity of a customer
eval_set value “prior” identifies that it is previous order before leading to last order order value “train” is the last order with associated products that the user order value “test” is the last order, however the contest is to predict what will be the product the user will buy
order_number true
order_dow day of week the order is placed
order_hour_of_day which hour the order is placed
days_since_prior_order how many days lapsed from this order to the previous order
glimpse(orders)
## Observations: 3,421,083
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <fct> prior, prior, prior, prior, prior, prio...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...

Let’s analyse the construct of one user. For user_id 1, there were 10 prior orders (order number from 1 to 10), last order is a train.

orders %>% filter (user_id==1) %>% kable
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 8 NA
2398795 1 prior 2 3 7 15
473747 1 prior 3 3 12 21
2254736 1 prior 4 4 7 29
431534 1 prior 5 4 15 28
3367565 1 prior 6 2 7 19
550135 1 prior 7 1 9 20
3108588 1 prior 8 1 14 14
2295261 1 prior 9 1 16 0
2550362 1 prior 10 4 8 30
1187899 1 train 11 4 8 14

Let’s analyse another construct of orders. User_id 3 made 12 orders before the final order labeled as test order:

  • From the data we know that order_number is being recycled for each user.
  • Instacart does not provide us the basket content for test order. This is in fact the objective for prediction.
orders %>% filter (user_id==3) %>% kable
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
1374495 3 prior 1 1 14 NA
444309 3 prior 2 3 19 9
3002854 3 prior 3 3 16 21
2037211 3 prior 4 2 18 20
2710558 3 prior 5 0 17 12
1972919 3 prior 6 0 16 7
1839752 3 prior 7 0 15 7
3225766 3 prior 8 0 17 7
3160850 3 prior 9 0 16 7
676467 3 prior 10 3 16 17
521107 3 prior 11 0 18 11
1402502 3 prior 12 1 15 15
2774568 3 test 13 5 15 11

3.3.7 Order And Its Basket Content

We know that in the last order (order_id 1187899), the user purchased 10 unique products by quering order_product_train with the order_id.

order_products_train %>% filter (order_id == 1187899) %>% kable
order_id product_id add_to_cart_order reordered
1187899 196 1 1
1187899 25133 2 1
1187899 38928 3 1
1187899 26405 4 1
1187899 39657 5 1
1187899 10258 6 1
1187899 13032 7 1
1187899 26088 8 1
1187899 27845 9 0
1187899 49235 10 1
1187899 46149 11 1

order_product_train/prior dataframe tells us which products were purchased in each order.

glimpse(order_products_train)
## Observations: 1,384,617
## Variables: 4
## $ order_id          <int> 1, 1, 1, 1, 1, 1, 1, 1, 36, 36, 36, 36, 36, ...
## $ product_id        <int> 49302, 11109, 10246, 49683, 43633, 13176, 47...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7,...
## $ reordered         <int> 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,...

3.3.8 Users

We can derive number of unique users from order dataframe. There are 75000 test users for prediction, 131,209 training users for machine learning. We want to keep this large number of train records in mind for traiming during quick initial model build and verification.

table(orders$eval_set)
## 
##   prior    test   train 
## 3214874   75000  131209

4 Exploratory Data Analysis

In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.

4.1 Orders

4.1.1 Orders By Products

We know that banana are the most popular products. The variance in the number of orders differs greatly for different products.

tmp = order_products_train %>%
  left_join(products) %>%
  group_by(product_name) %>%
  summarize(count=n()) %>%
  top_n(n=40, wt=count) %>%  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(product_name,count), y=percentage)) +  
  geom_col() + ggtitle('Product') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) 

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Product') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Products')

4.1.2 Orders by Departments

Certain departmens are clearly more popular, like personal care, snacks and pantray. They made up more than 30% of total product sold.

tmp = order_products_train %>%
  left_join(products) %>%
  left_join(departments) %>%
  group_by(department) %>%
  summarize(count=n()) %>%
  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(department,count), y=percentage)) +  
  geom_col() + ggtitle('Departments') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) 

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Departments') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Departments')

4.1.3 Orders by Aisles

We looked into the buying trend of product by ailes and notice that certain aisle like candy chocolate and ice cream are much popular than others. The box plot illustrates the variance between the least popular and most popular.

par(mfrow=c(1,2))
tmp = order_products_train %>%
  left_join(products) %>%
  left_join(aisles) %>%
  group_by(aisle) %>%
  summarize(count=n()) %>%
  top_n(n=40, wt=count) %>%  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(aisle,count), y=percentage)) +  
  geom_col() + ggtitle('Aisles') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) +  ylab('Percentage of Orders') + xlab('Aisles')

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Aisles') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Aisles')

4.1.4 Buying Day Pattern

4.1.5 Buying Hour Pattern

4.1.6 Buying Behavior

5 Type of Prediction & Modelling Techniques

5.1 Selection of Model

5.2 Feature Engineering

5.3 Data Recoding

5.4 Prediction

6 Analysis & Recommendations

7 Citations

  1. The Instacart Online Grocery Shopping Dataset 2017“, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017
  2. Data Dictionary: https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b