Instacart Market Basket Analysis (Kaggle Competition)

1. Introduction

One of the most challenging aspects of Data Science, I have found in the last two terms in both Data Visualization and Statistical Learning, is preparing the data for usage in the packages available in R. For this reason, in order to step beyond what we did in class, I wanted to apply some of the Statistical Learning techniques we learned to real-life datasets. So a good place to start is Kaggle. I found the Instacart Challenge to be suitable because of both the real-life application and because it explores one of the topics covered in class, association rules and market basket analysis.

From the Kaggle competition site:

“Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.

In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. "

Because this is a large dataset, it was also an opportunity to explore some of the intersting data-munging libraries such as dplyr.

2. Exploratory Data Analysis using dplyr

Start by loading the required libraries and datasets:

# Exploratory Data Analysis

str(orders, max.level=1)

## Classes 'data.table' and 'data.frame':   1048575 obs. of  7 variables:
##  $ order_id              : int  2539329 2398795 473747 2254736 431534 3367565 550135 3108588 2295261 2550362 ...
##  $ user_id               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eval_set              : chr  "prior" "prior" "prior" "prior" ...
##  $ order_number          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ order_dow             : int  2 3 3 4 4 2 1 1 1 4 ...
##  $ order_hour_of_day     : int  8 7 12 7 15 7 9 14 16 8 ...
##  $ days_since_prior_order: int  NA 15 21 29 28 19 20 14 0 30 ...
##  - attr(*, ".internal.selfref")=<externalptr>

kable(head(orders,12))

order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
2539329	1	prior	1	2	8	NA
2398795	1	prior	2	3	7	15
473747	1	prior	3	3	12	21
2254736	1	prior	4	4	7	29
431534	1	prior	5	4	15	28
3367565	1	prior	6	2	7	19
550135	1	prior	7	1	9	20
3108588	1	prior	8	1	14	14
2295261	1	prior	9	1	16	0
2550362	1	prior	10	4	8	30
1187899	1	train	11	4	8	14
2168274	2	prior	1	2	11	NA

glimpse(orders)

## Observations: 1,048,575
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <chr> "prior", "prior", "prior", "prior", "pr...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <int> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...

orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

A) When do customers order?

Throughout our report, we use the dplyr library to manipulate the large datasets and the piping “%>%” structure. We start by looking at daily and weekly shopping habits.

Hour of day:

orders %>% 
  ggplot(aes(x=order_hour_of_day)) + 
  geom_histogram(stat="count",fill="skyblue")

Day of Week:

orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="skyblue")

Although it’s not clear which days of the week are which, clearly there is a lot of shopping going on during work hours, which is a widely known fact.

B) When do they purchase again?

People tend to place a new order mostly within 7 days of the last order, predominantly exactly 7 days after. The 30 day spike seems to be a long-tail effect.

orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
  geom_histogram(stat="count",fill="skyblue")

orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="skyblue", size=1)+geom_point(size=2, color="blue")

C) How many items do people buy?

This seems to follow a Poisson Distribution. It would be worth investigating further whether it has been modeled previously.

order_products %>% 
  group_by(order_id) %>% 
  summarize(n_items = last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items))+
  geom_histogram(stat="count",fill="skyblue") + 
  geom_rug()+
  coord_cartesian(xlim=c(0,80))

D) Bestsellers (Ranking by Support)

These are the staples, the items with the greatest Support \(P(X)\). Not surprisingly, items like fruit, veggies and milk.

tmp <- order_products %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(12, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 
kable(tmp)

product_id	count	product_name
24852	14136	Banana
13176	11639	Bag of Organic Bananas
21137	8233	Organic Strawberries
21903	7443	Organic Baby Spinach
47626	6148	Large Lemon
47766	5606	Organic Avocado
47209	5489	Organic Hass Avocado
16797	4920	Strawberries
26209	4609	Limes
27966	4200	Organic Raspberries
39275	3744	Organic Blueberries
27845	3719	Organic Whole Milk

E) What percentage of items bought are reordered?

The unfortunate “curse of dimensionality”" certainly applies to retail. Here we see that over 40% of items are ordered just once!

tmp <- order_products %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(reordered = as.factor(reordered)) %>%
  mutate(proportion = count/sum(count))
kable(tmp)

reordered	count	proportion
0	420967	0.4014658
1	627608	0.5985342

F) Top reorder proportion

Although there is some overlap (notably bananas and milk), a lot of the high frequency re-orders are more niche products like special kinds of milk and bread.

tmp <-order_products %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

kable(tmp)

product_id	proportion_reordered	n	product_name	aisle_id	department_id
1729	0.9200000	75	2% Lactose Free Milk	84	16
36717	0.9122807	57	Double Fiber Bread	112	3
20940	0.9052632	285	Organic Low Fat Milk	84	16
11885	0.9019608	51	Sparkling Water, Bottles	115	7
21038	0.8906250	64	Organic Spelt Tortillas	128	3
39180	0.8847185	373	Organic Lowfat 1% Milk	84	16
24852	0.8815082	14136	Banana	24	4
24024	0.8809524	336	1% Lowfat Milk	84	16
1402	0.8793103	58	Plain Soymilk Creamer	91	16
29447	0.8775982	433	Milk, Organic, Vitamin D	84	16

order_products %>% 
  left_join(orders,by="order_id") %>% 
  group_by(days_since_prior_order) %>%
  summarize(mean_reorder = mean(reordered)) %>%
  ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
  geom_bar(stat="identity",fill="skyblue")

As far as days since prior re-order, we can see there is a “memory effect”. That is, the closer to the first order, the greater that product gets recalled and re-ordered.

F) Analyzing SKU distribution with Treemaps

First we look at the number of SKUs offered per categories. The area of the rectangles are proportional to the number considered.

tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")

tmp2<-order_products %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(tmp, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")

Likewise, we have a similar visualization for what’s sold.

treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")

Notice how produce and dairy are a small number of SKUs that yet comprise almost half of units sold. Talk about Pareto power distribution!

In the next section, we will look to apply the association rules, or market basket analysis (MBA) we learned in class to this dataset.

3. Performing Market Basket Analysis with arules apriori Rule

We read the dataset into the transaction type dataform required for ingestion into the arules algorithm.

tmp <-order_products %>% 
  group_by(product_id) %>% 
  left_join(products,by="product_id")

write.csv(tmp, file = "transactions.csv")
transactions<-read.transactions("transactions.csv", format = "single", sep = ",",cols = c(2,6))

summary(transactions)

## transactions as itemMatrix in sparse format with
##  99575 rows (elements/itemsets/transactions) and
##  36865 columns (items) and a density of 0.0002856507 
## 
## most frequent items:
##                 Banana Bag of Organic Bananas   Organic Strawberries 
##                  14136                  11639                   8233 
##   Organic Baby Spinach            Large Lemon                (Other) 
##                   7443                   6148                1000977 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 5212 5591 6114 6249 6791 6632 6451 6074 5519 4959 4614 4062 3686 3302 2902 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 2672 2321 2027 1881 1584 1378 1274 1137  976  850  764  654  501  499  401 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  332  296  252  236  209  155  145  132  103   72   69   67   58   57   53 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   35   35   27   18   24   19   19   14   12    8    7    5    3    3    6 
##   61   62   63   64   65   66   67   68   70   72   74   75   76   77   80 
##    2    1    2    3    3    2    1    2    3    2    2    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    9.00   10.53   14.00   80.00 
## 
## includes extended item information - examples:
##                                                   labels
## 1                     \\""Constant Comment\\"" Black Tea
## 2 \\""Constant Comment\\"" Decaffeinated Black Tea Blend
## 3                            \\""Darn Good\\"" Chili Mix
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2        100000
## 3       1000008

inspect(transactions[1:3])

##     items                                                transactionID
## [1] {Bag of Organic Bananas,                                          
##      Bulgarian Yogurt,                                                
##      Cucumber Kirby,                                                  
##      Lightly Smoked Sardines in Olive Oil,                            
##      Organic 4% Milk Fat Whole Milk Cottage Cheese,                   
##      Organic Celery Hearts,                                           
##      Organic Hass Avocado,                                            
##      Organic Whole String Cheese}                              1      
## [2] {Corn Tortillas,                                                  
##      Extra Virgin Olive Oil,                                          
##      Gala Apples,                                                     
##      Garnet Sweet Potato (Yam),                                       
##      Ground Cumin,                                                    
##      I Heart Baby Kale,                                               
##      No Salt Added Black Beans,                                       
##      Organic Baby Carrots,                                            
##      Organic Baby Spinach,                                            
##      Organic Yellow Onion,                                            
##      Original Hummus,                                                 
##      Snack Sticks Chicken & Rice Recipe Dog Treats,                   
##      Total 2% All Natural Plain Greek Yogurt,                         
##      Unscented Long Lasting Stick Deodorant,                          
##      Wheat Sandwich Thins}                                     100000 
## [3] {Daily Moisture Shampoo,                                          
##      DeTox Caffeine Free Organic Herbal Tea Bags,                     
##      Ensure Plus Milk Chocolate Nutrition Shake,                      
##      G Series Perform Frost Glacier Cherry Sports Drink,              
##      Original No Pulp 100% Florida Orange Juice,                      
##      Triple Chocolate Ripple,                                         
##      ZzzQuil Liquid Warming Berry Flavor Sleep-Aid}            1000008

And next we apply the arules apriori algorithm and inspect 10 of the 685 rules generated. We consider rules for items with minimum support of .001, generating a combo with confidence \(P(Y|X)\) of at least .25. The output of the algorithm are the association rules together with the confidence and lift \(confidence(X \rightarrow Y)/support(Y)\).

groceryrules <- apriori(transactions, parameter = list(support =
                                                      0.001, confidence = 0.25))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 99 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[36865 item(s), 99575 transaction(s)] done [0.22s].
## sorting and recoding items ... [1877 item(s)] done [0.02s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.08s].
## writing ... [685 rule(s)] done [0.01s].
## creating S4 object  ... done [0.04s].

# Association rules
summary(groceryrules)

## set of 685 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 207 463  15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    3.00    2.72    3.00    4.00 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.001004   Min.   :0.2500   Min.   : 1.763  
##  1st Qu.:0.001165   1st Qu.:0.2718   1st Qu.: 2.415  
##  Median :0.001386   Median :0.3012   Median : 3.369  
##  Mean   :0.001950   Mean   :0.3221   Mean   : 5.789  
##  3rd Qu.:0.001808   3rd Qu.:0.3540   3rd Qu.: 4.592  
##  Max.   :0.023279   Max.   :0.6250   Max.   :86.850  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         99575   0.001       0.25

inspect(groceryrules[1:10])

##      lhs                                           rhs                                                 support confidence      lift
## [1]  {Zero Calorie Cola}                        => {Soda}                                          0.001024354  0.3984375 35.141199
## [2]  {Organic Fuji Apples}                      => {Bag of Organic Bananas}                        0.001154908  0.3432836  2.936890
## [3]  {Packaged Grape Tomatoes}                  => {Hass Avocados}                                 0.001154908  0.2584270 14.840176
## [4]  {Packaged Grape Tomatoes}                  => {Strawberries}                                  0.001134823  0.2539326  5.139296
## [5]  {Baby Cucumbers}                           => {Bag of Organic Bananas}                        0.001205122  0.2515723  2.152274
## [6]  {Lemon Sparkling Water}                    => {Grapefruit Sparkling Water}                    0.001104695  0.3133903 65.284185
## [7]  {Nonfat Icelandic Style Strawberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001134823  0.4109091 76.622234
## [8]  {Sweet Potato Yam}                         => {Banana}                                        0.001174994  0.3861386  2.719988
## [9]  {Organic Honey Sweet Whole Wheat Bread}    => {Bag of Organic Bananas}                        0.001064524  0.3642612  3.116359
## [10] {Organic Whole Grassmilk Milk}             => {Bag of Organic Bananas}                        0.001205122  0.3529412  3.019514

4. Model Improvement, Conclusion and Next Steps

These rules consider only the collective and cumulative transactions. Our submission to Kaggle, which we are still working on, involves two additional association dimensions. We would have to look at individual customers over time to discern patterns and make inferences about their next purchases.