1 Introduction, Research Goal & Objective

3 Dataset Preparation

3.1 R Libraries Used

Here are the R libraries used in this analysis.

library(knitr)      # web widget
library(tidyverse)  # data manipulation
library(data.table) # fast file reading
library(treemap)    # tree visualization
library(corrgram)   # correlation graphics

3.2 Import Datasets

setwd('./datasets')
aisles      = fread('aisles.csv',      stringsAsFactors = TRUE)
departments = fread('departments.csv', stringsAsFactors = TRUE)
products    = fread('products.csv',    stringsAsFactors = TRUE)
orders      = fread('orders.csv',      stringsAsFactors = TRUE)
order_products_train = fread('order_products__train.csv')
order_products_prior = fread('order_products__prior.csv')

3.3 Understanding Datasets

These datasets were sourced from an existing Kaggle competiotion (https://www.kaggle.com/c/instacart-market-basket-analysis/data)

The dataset for this competition is a relational set of files describing customers’ orders over time
The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users
For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order
Provide the week and hour of day the order was placed, and a relative measure of time between orders

Total six datasets were imported. Follwing section will explore each datasets in further detail.

3.3.1 Aisles

There are 134 ailes in this dataset.

str(aisles)

## Classes 'data.table' and 'data.frame':   134 obs. of  2 variables:
##  $ aisle_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ aisle   : Factor w/ 134 levels "air fresheners candles",..: 107 121 43 75 82 92 95 6 101 77 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Here are the names of all ailes aphabetically ordered, seprated by comma.

paste(sort(aisles$aisle), collapse=', ')

## [1] "air fresheners candles, asian foods, baby accessories, baby bath body care, baby food formula, bakery desserts, baking ingredients, baking supplies decor, beauty, beers coolers, body lotions soap, bread, breakfast bakery, breakfast bars pastries, bulk dried fruits vegetables, bulk grains rice dried goods, buns rolls, butter, candy chocolate, canned fruit applesauce, canned jarred vegetables, canned meals beans, canned meat seafood, cat food care, cereal, chips pretzels, cleaning products, cocoa drink mixes, coffee, cold flu allergy, condiments, cookies cakes, crackers, cream, deodorants, diapers wipes, digestion, dish detergents, dog food care, doughs gelatins bake mixes, dry pasta, eggs, energy granola bars, energy sports drinks, eye ear care, facial care, feminine care, first aid, food storage, fresh dips tapenades, fresh fruits, fresh herbs, fresh pasta, fresh vegetables, frozen appetizers sides, frozen breads doughs, frozen breakfast, frozen dessert, frozen juice, frozen meals, frozen meat seafood, frozen pizza, frozen produce, frozen vegan vegetarian, fruit vegetable snacks, grains rice dried goods, granola, hair care, honeys syrups nectars, hot cereal pancake mixes, hot dogs bacon sausage, ice cream ice, ice cream toppings, indian foods, instant foods, juice nectars, kitchen supplies, kosher foods, latino foods, laundry, lunch meat, marinades meat preparation, meat counter, milk, mint gum, missing, more household, muscles joints pain relief, nuts seeds dried fruit, oils vinegars, oral hygiene, other, other creams cheeses, packaged cheese, packaged meat, packaged poultry, packaged produce, packaged seafood, packaged vegetables fruits, paper goods, pasta sauce, pickled goods olives, plates bowls cups flatware, popcorn jerky, poultry counter, prepared meals, prepared soups salads, preserved dips spreads, protein meal replacements, red wines, refrigerated, refrigerated pudding desserts, salad dressing toppings, seafood counter, shave needs, skin care, soap, soft drinks, soup broth bouillon, soy lactosefree, specialty cheeses, specialty wines champagnes, spices seasonings, spirits, spreads, tea, tofu meat alternatives, tortillas flat bread, trail mix snack mix, trash bags liners, vitamins supplements, water seltzer sparkling water, white wines, yogurt"

3.3.2 Departments

There are 21 departments in this dataset.

str(departments)

## Classes 'data.table' and 'data.frame':   21 obs. of  2 variables:
##  $ department_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ department   : Factor w/ 21 levels "alcohol","babies",..: 11 16 3 20 1 13 4 19 10 6 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Names of all deparments are listed below in aphabetically ordered.

sort(departments$department)

##  [1] alcohol         babies          bakery          beverages      
##  [5] breakfast       bulk            canned goods    dairy eggs     
##  [9] deli            dry goods pasta frozen          household      
## [13] international   meat seafood    missing         other          
## [17] pantry          personal care   pets            produce        
## [21] snacks         
## 21 Levels: alcohol babies bakery beverages breakfast bulk ... snacks

3.3.3 Products

There are 49,688 products in the catalogue within 134 aisles and 21 departments.

str(products)
length(unique(products$aisle_id))
length(unique(products$department_id))

## Classes 'data.table' and 'data.frame':   49688 obs. of  4 variables:
##  $ product_id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_name : Factor w/ 49688 levels "#2 Coffee Filters",..: 8642 2018 38573 40977 18584 13349 36616 11763 23264 41864 ...
##  $ aisle_id     : int  61 104 94 38 5 11 98 116 120 115 ...
##  $ department_id: int  19 13 7 1 13 11 7 1 16 7 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
## [1] 134
## [1] 21

3.3.4 Departments And Its Relevant Products

Table below shows a glimpse of 3 products for each deparment, giving an idea the how the products being categorized. Only sample five departments are shown.

left_join(departments, products) %>% select(department, product_name ) %>%
  group_by(department) %>%
  sample_n(3) %>%
  summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable

department	three_examples_product
dry goods pasta	Creamy Four Cheese Rice \|\| Garden Vegetable Ready Rice \|\| Roasted Garlic & Herb Brown Rice Medley
produce	Orange Navel Heirloom. \|\| Crenshaw Melon \|\| Cocktail Tomatoes
pets	Flaked Tuna Feast Cat Food \|\| Rawhide Chip Rolls \|\| Adult Complete Nutrition Dry Food for Dogs
bakery	Yellow Corn Organic Tortillas \|\| Tortillas, Flour \|\| Small Soft and Fluffy Flour Tortillas
babies	Fruit & Veggie Twist Apple Mango Pear & Kale \|\| Organic Sunny Days Apple Snack Bars \|\| Slip On Size 6 Diapers

3.3.5 Ailes And Its Relevant Products

Table below shows a glimpse of 3 products for each ailes, giving an idea the how the products being categorized. Only sample five ailes are shown

left_join(aisles, products) %>%
  select(aisle, product_name ) %>%  group_by(aisle) %>%
  sample_n(3) %>% 
  summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable

aisle	three_examples_product
cookies cakes	Original Animal Crackers \|\| Soft Baked Sanibel Snickerdoodle Cookies \|\| Sandies Toffee Shortbread Cookies
fresh fruits	Mandarin Box \|\| Organic Honeydew \|\| Red Globe Grapes
cat food care	Sliced Salmon Canned Catfood Entree \|\| Liver & Chicken Dinner Classic Pate Cat Food \|\| Purina Mixed Grill Classic Pate Cat Food
food storage	Take Alongs Round Small Bowls \|\| Zipper Bags Sandwich - 50 CT \|\| Takealongs Serving Bowl
packaged seafood	Alaskan Sockeye Smoked Salmon \|\| Wild Alaskan Salmon Sockeye Salmon Gravlox \|\| Herring In Wine Sauce

3.3.6 Orders

There are over 3 millions observations in orders dataset. Each row represent an unique order. Each variable potentially can be used as predictors.

variable	description
user_id	unique identity of a customer
eval_set	value “prior” identifies that it is previous order before leading to last order order value “train” is the last order with associated products that the user order value “test” is the last order, however the contest is to predict what will be the product the user will buy
order_number	true
order_dow	day of week the order is placed
order_hour_of_day	which hour the order is placed
days_since_prior_order	how many days lapsed from this order to the previous order

glimpse(orders)

## Observations: 3,421,083
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <fct> prior, prior, prior, prior, prior, prio...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...

Let’s analyse the construct of one user. For user_id 1, there were 10 prior orders (order number from 1 to 10), last order is a train.

orders %>% filter (user_id==1) %>% kable

order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
2539329	1	prior	1	2	8	NA
2398795	1	prior	2	3	7	15
473747	1	prior	3	3	12	21
2254736	1	prior	4	4	7	29
431534	1	prior	5	4	15	28
3367565	1	prior	6	2	7	19
550135	1	prior	7	1	9	20
3108588	1	prior	8	1	14	14
2295261	1	prior	9	1	16	0
2550362	1	prior	10	4	8	30
1187899	1	train	11	4	8	14

Let’s analyse another construct of orders. User_id 3 made 12 orders before the final order labeled as test order:

From the data we know that order_number is being recycled for each user.
Instacart does not provide us the basket content for test order. This is in fact the objective for prediction.

orders %>% filter (user_id==3) %>% kable

order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
1374495	3	prior	1	1	14	NA
444309	3	prior	2	3	19	9
3002854	3	prior	3	3	16	21
2037211	3	prior	4	2	18	20
2710558	3	prior	5	0	17	12
1972919	3	prior	6	0	16	7
1839752	3	prior	7	0	15	7
3225766	3	prior	8	0	17	7
3160850	3	prior	9	0	16	7
676467	3	prior	10	3	16	17
521107	3	prior	11	0	18	11
1402502	3	prior	12	1	15	15
2774568	3	test	13	5	15	11

3.3.7 Order And Its Basket Content

We know that in the last order (order_id 1187899), the user purchased 10 unique products by quering order_product_train with the order_id.

order_products_train %>% filter (order_id == 1187899) %>% kable

order_id	product_id	add_to_cart_order	reordered
1187899	196	1	1
1187899	25133	2	1
1187899	38928	3	1
1187899	26405	4	1
1187899	39657	5	1
1187899	10258	6	1
1187899	13032	7	1
1187899	26088	8	1
1187899	27845	9	0
1187899	49235	10	1
1187899	46149	11	1

order_product_train/prior dataframe tells us which products were purchased in each order.

glimpse(order_products_train)

## Observations: 1,384,617
## Variables: 4
## $ order_id          <int> 1, 1, 1, 1, 1, 1, 1, 1, 36, 36, 36, 36, 36, ...
## $ product_id        <int> 49302, 11109, 10246, 49683, 43633, 13176, 47...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7,...
## $ reordered         <int> 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,...

3.3.8 Users

We can derive number of unique users from order dataframe. There are 75000 test users for prediction, 131,209 training users for machine learning. We want to keep this large number of train records in mind for traiming during quick initial model build and verification.

table(orders$eval_set)

## 
##   prior    test   train 
## 3214874   75000  131209

4 Exploratory Data Analysis

In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.

What usually does people buy, and which one they usually reorder
When do they buy (day and time)? Is there a buying trend and does it influence what they buy ?

4.1 Orders

4.1.1 Orders By Products

We know that banana are the most popular products. The variance in the number of orders differs greatly for different products.

tmp = order_products_train %>%
  left_join(products) %>%
  group_by(product_name) %>%
  summarize(count=n()) %>%
  top_n(n=40, wt=count) %>%  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(product_name,count), y=percentage)) +  
  geom_col() + ggtitle('Product') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) 

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Product') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Products')

4.1.2 Orders by Departments

Certain departmens are clearly more popular, like personal care, snacks and pantray. They made up more than 30% of total product sold.

tmp = order_products_train %>%
  left_join(products) %>%
  left_join(departments) %>%
  group_by(department) %>%
  summarize(count=n()) %>%
  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(department,count), y=percentage)) +  
  geom_col() + ggtitle('Departments') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) 

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Departments') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Departments')

4.1.3 Orders by Aisles

We looked into the buying trend of product by ailes and notice that certain aisle like candy chocolate and ice cream are much popular than others. The box plot illustrates the variance between the least popular and most popular.

par(mfrow=c(1,2))
tmp = order_products_train %>%
  left_join(products) %>%
  left_join(aisles) %>%
  group_by(aisle) %>%
  summarize(count=n()) %>%
  top_n(n=40, wt=count) %>%  mutate(percentage=count/sum(count))

ggplot (tmp, aes(x=reorder(aisle,count), y=percentage)) +  
  geom_col() + ggtitle('Aisles') + ylab('Percentage of Orders') +
  theme (
    axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
    axis.title.x = element_blank()) +  ylab('Percentage of Orders') + xlab('Aisles')

ggplot (data = tmp, aes( x= '', y=percentage )) + 
  ggtitle('Aisles') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Aisles')

4.1.4 Buying Day Pattern

4.1.5 Buying Hour Pattern

4.1.6 Buying Behavior

5 Type of Prediction & Modelling Techniques

5.1 Selection of Model

5.2 Feature Engineering

5.3 Data Recoding

5.4 Prediction

6 Analysis & Recommendations

7 Citations

The Instacart Online Grocery Shopping Dataset 2017“, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017
Data Dictionary: https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b

Instacart Market Basket Analysis

Yong Keh Soon

18 November 2018

1 Introduction, Research Goal & Objective

3 Dataset Preparation

3.1 R Libraries Used

3.2 Import Datasets

3.3 Understanding Datasets

3.3.1 Aisles

3.3.2 Departments

3.3.3 Products

3.3.4 Departments And Its Relevant Products

3.3.5 Ailes And Its Relevant Products

3.3.6 Orders

3.3.7 Order And Its Basket Content

3.3.8 Users

4 Exploratory Data Analysis

4.1 Orders

4.1.1 Orders By Products

4.1.2 Orders by Departments

4.1.3 Orders by Aisles

4.1.4 Buying Day Pattern

4.1.5 Buying Hour Pattern

4.1.6 Buying Behavior

5 Type of Prediction & Modelling Techniques

5.1 Selection of Model

5.2 Feature Engineering

5.3 Data Recoding

5.4 Prediction

6 Analysis & Recommendations

7 Citations

Instacart Market Basket Analysis

Yong Keh Soon

18 November 2018

1 Introduction, Research Goal & Objective

2 Related works

3 Dataset Preparation

3.1 R Libraries Used

3.2 Import Datasets

3.3 Understanding Datasets

3.3.1 Aisles

3.3.2 Departments

3.3.3 Products

3.3.4 Departments And Its Relevant Products

3.3.5 Ailes And Its Relevant Products

3.3.6 Orders

3.3.7 Order And Its Basket Content

3.3.8 Users

4 Exploratory Data Analysis

4.1 Orders

4.1.1 Orders By Products

4.1.2 Orders by Departments

4.1.3 Orders by Aisles

4.1.4 Buying Day Pattern

4.1.5 Buying Hour Pattern

4.1.6 Buying Behavior

5 Type of Prediction & Modelling Techniques

5.1 Selection of Model

5.2 Feature Engineering

5.3 Data Recoding

5.4 Prediction

6 Analysis & Recommendations

7 Citations