1. Introduction

One of the most challenging aspects of Data Science, I have found in the last two terms in both Data Visualization and Statistical Learning, is preparing the data for usage in the packages available in R. For this reason, in order to step beyond what we did in class, I wanted to apply some of the Statistical Learning techniques we learned to real-life datasets. So a good place to start is Kaggle. I found the Instacart Challenge to be suitable because of both the real-life application and because it explores one of the topics covered in class, association rules and market basket analysis.

From the Kaggle competition site:

“Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.

In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. "

Because this is a large dataset, it was also an opportunity to explore some of the intersting data-munging libraries such as dplyr.

2. Exploratory Data Analysis using dplyr

Start by loading the required libraries and datasets:

# Exploratory Data Analysis

str(orders, max.level=1)
## Classes 'data.table' and 'data.frame':   1048575 obs. of  7 variables:
##  $ order_id              : int  2539329 2398795 473747 2254736 431534 3367565 550135 3108588 2295261 2550362 ...
##  $ user_id               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ eval_set              : chr  "prior" "prior" "prior" "prior" ...
##  $ order_number          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ order_dow             : int  2 3 3 4 4 2 1 1 1 4 ...
##  $ order_hour_of_day     : int  8 7 12 7 15 7 9 14 16 8 ...
##  $ days_since_prior_order: int  NA 15 21 29 28 19 20 14 0 30 ...
##  - attr(*, ".internal.selfref")=<externalptr>
kable(head(orders,12))
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 8 NA
2398795 1 prior 2 3 7 15
473747 1 prior 3 3 12 21
2254736 1 prior 4 4 7 29
431534 1 prior 5 4 15 28
3367565 1 prior 6 2 7 19
550135 1 prior 7 1 9 20
3108588 1 prior 8 1 14 14
2295261 1 prior 9 1 16 0
2550362 1 prior 10 4 8 30
1187899 1 train 11 4 8 14
2168274 2 prior 1 2 11 NA
glimpse(orders)
## Observations: 1,048,575
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <chr> "prior", "prior", "prior", "prior", "pr...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <int> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

A) When do customers order?

Throughout our report, we use the dplyr library to manipulate the large datasets and the piping “%>%” structure. We start by looking at daily and weekly shopping habits.

Hour of day:

orders %>% 
  ggplot(aes(x=order_hour_of_day)) + 
  geom_histogram(stat="count",fill="skyblue")

Day of Week:

orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="skyblue")

Although it’s not clear which days of the week are which, clearly there is a lot of shopping going on during work hours, which is a widely known fact.

B) When do they purchase again?

People tend to place a new order mostly within 7 days of the last order, predominantly exactly 7 days after. The 30 day spike seems to be a long-tail effect.

orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
  geom_histogram(stat="count",fill="skyblue")

orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="skyblue", size=1)+geom_point(size=2, color="blue")

C) How many items do people buy?

This seems to follow a Poisson Distribution. It would be worth investigating further whether it has been modeled previously.

order_products %>% 
  group_by(order_id) %>% 
  summarize(n_items = last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items))+
  geom_histogram(stat="count",fill="skyblue") + 
  geom_rug()+
  coord_cartesian(xlim=c(0,80))

D) Bestsellers (Ranking by Support)

These are the staples, the items with the greatest Support \(P(X)\). Not surprisingly, items like fruit, veggies and milk.

tmp <- order_products %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(12, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 
kable(tmp)
product_id count product_name
24852 14136 Banana
13176 11639 Bag of Organic Bananas
21137 8233 Organic Strawberries
21903 7443 Organic Baby Spinach
47626 6148 Large Lemon
47766 5606 Organic Avocado
47209 5489 Organic Hass Avocado
16797 4920 Strawberries
26209 4609 Limes
27966 4200 Organic Raspberries
39275 3744 Organic Blueberries
27845 3719 Organic Whole Milk

E) What percentage of items bought are reordered?

The unfortunate “curse of dimensionality”" certainly applies to retail. Here we see that over 40% of items are ordered just once!

tmp <- order_products %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(reordered = as.factor(reordered)) %>%
  mutate(proportion = count/sum(count))
kable(tmp)
reordered count proportion
0 420967 0.4014658
1 627608 0.5985342

F) Top reorder proportion

Although there is some overlap (notably bananas and milk), a lot of the high frequency re-orders are more niche products like special kinds of milk and bread.

tmp <-order_products %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

kable(tmp)
product_id proportion_reordered n product_name aisle_id department_id
1729 0.9200000 75 2% Lactose Free Milk 84 16
36717 0.9122807 57 Double Fiber Bread 112 3
20940 0.9052632 285 Organic Low Fat Milk 84 16
11885 0.9019608 51 Sparkling Water, Bottles 115 7
21038 0.8906250 64 Organic Spelt Tortillas 128 3
39180 0.8847185 373 Organic Lowfat 1% Milk 84 16
24852 0.8815082 14136 Banana 24 4
24024 0.8809524 336 1% Lowfat Milk 84 16
1402 0.8793103 58 Plain Soymilk Creamer 91 16
29447 0.8775982 433 Milk, Organic, Vitamin D 84 16
order_products %>% 
  left_join(orders,by="order_id") %>% 
  group_by(days_since_prior_order) %>%
  summarize(mean_reorder = mean(reordered)) %>%
  ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
  geom_bar(stat="identity",fill="skyblue")

As far as days since prior re-order, we can see there is a “memory effect”. That is, the closer to the first order, the greater that product gets recalled and re-ordered.

F) Analyzing SKU distribution with Treemaps

First we look at the number of SKUs offered per categories. The area of the rectangles are proportional to the number considered.

tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")

tmp2<-order_products %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(tmp, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")

Likewise, we have a similar visualization for what’s sold.

treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")

Notice how produce and dairy are a small number of SKUs that yet comprise almost half of units sold. Talk about Pareto power distribution!

In the next section, we will look to apply the association rules, or market basket analysis (MBA) we learned in class to this dataset.

3. Performing Market Basket Analysis with arules apriori Rule

We read the dataset into the transaction type dataform required for ingestion into the arules algorithm.

tmp <-order_products %>% 
  group_by(product_id) %>% 
  left_join(products,by="product_id")

write.csv(tmp, file = "transactions.csv")
transactions<-read.transactions("transactions.csv", format = "single", sep = ",",cols = c(2,6))

summary(transactions)
## transactions as itemMatrix in sparse format with
##  99575 rows (elements/itemsets/transactions) and
##  36865 columns (items) and a density of 0.0002856507 
## 
## most frequent items:
##                 Banana Bag of Organic Bananas   Organic Strawberries 
##                  14136                  11639                   8233 
##   Organic Baby Spinach            Large Lemon                (Other) 
##                   7443                   6148                1000977 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 5212 5591 6114 6249 6791 6632 6451 6074 5519 4959 4614 4062 3686 3302 2902 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 2672 2321 2027 1881 1584 1378 1274 1137  976  850  764  654  501  499  401 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  332  296  252  236  209  155  145  132  103   72   69   67   58   57   53 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   35   35   27   18   24   19   19   14   12    8    7    5    3    3    6 
##   61   62   63   64   65   66   67   68   70   72   74   75   76   77   80 
##    2    1    2    3    3    2    1    2    3    2    2    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    9.00   10.53   14.00   80.00 
## 
## includes extended item information - examples:
##                                                   labels
## 1                     \\""Constant Comment\\"" Black Tea
## 2 \\""Constant Comment\\"" Decaffeinated Black Tea Blend
## 3                            \\""Darn Good\\"" Chili Mix
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2        100000
## 3       1000008
inspect(transactions[1:3])
##     items                                                transactionID
## [1] {Bag of Organic Bananas,                                          
##      Bulgarian Yogurt,                                                
##      Cucumber Kirby,                                                  
##      Lightly Smoked Sardines in Olive Oil,                            
##      Organic 4% Milk Fat Whole Milk Cottage Cheese,                   
##      Organic Celery Hearts,                                           
##      Organic Hass Avocado,                                            
##      Organic Whole String Cheese}                              1      
## [2] {Corn Tortillas,                                                  
##      Extra Virgin Olive Oil,                                          
##      Gala Apples,                                                     
##      Garnet Sweet Potato (Yam),                                       
##      Ground Cumin,                                                    
##      I Heart Baby Kale,                                               
##      No Salt Added Black Beans,                                       
##      Organic Baby Carrots,                                            
##      Organic Baby Spinach,                                            
##      Organic Yellow Onion,                                            
##      Original Hummus,                                                 
##      Snack Sticks Chicken & Rice Recipe Dog Treats,                   
##      Total 2% All Natural Plain Greek Yogurt,                         
##      Unscented Long Lasting Stick Deodorant,                          
##      Wheat Sandwich Thins}                                     100000 
## [3] {Daily Moisture Shampoo,                                          
##      DeTox Caffeine Free Organic Herbal Tea Bags,                     
##      Ensure Plus Milk Chocolate Nutrition Shake,                      
##      G Series Perform Frost Glacier Cherry Sports Drink,              
##      Original No Pulp 100% Florida Orange Juice,                      
##      Triple Chocolate Ripple,                                         
##      ZzzQuil Liquid Warming Berry Flavor Sleep-Aid}            1000008

And next we apply the arules apriori algorithm and inspect 10 of the 685 rules generated. We consider rules for items with minimum support of .001, generating a combo with confidence \(P(Y|X)\) of at least .25. The output of the algorithm are the association rules together with the confidence and lift \(confidence(X \rightarrow Y)/support(Y)\).

groceryrules <- apriori(transactions, parameter = list(support =
                                                      0.001, confidence = 0.25))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 99 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[36865 item(s), 99575 transaction(s)] done [0.22s].
## sorting and recoding items ... [1877 item(s)] done [0.02s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.08s].
## writing ... [685 rule(s)] done [0.01s].
## creating S4 object  ... done [0.04s].
# Association rules
summary(groceryrules)
## set of 685 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 207 463  15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    3.00    2.72    3.00    4.00 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.001004   Min.   :0.2500   Min.   : 1.763  
##  1st Qu.:0.001165   1st Qu.:0.2718   1st Qu.: 2.415  
##  Median :0.001386   Median :0.3012   Median : 3.369  
##  Mean   :0.001950   Mean   :0.3221   Mean   : 5.789  
##  3rd Qu.:0.001808   3rd Qu.:0.3540   3rd Qu.: 4.592  
##  Max.   :0.023279   Max.   :0.6250   Max.   :86.850  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         99575   0.001       0.25
inspect(groceryrules[1:10])
##      lhs                                           rhs                                                 support confidence      lift
## [1]  {Zero Calorie Cola}                        => {Soda}                                          0.001024354  0.3984375 35.141199
## [2]  {Organic Fuji Apples}                      => {Bag of Organic Bananas}                        0.001154908  0.3432836  2.936890
## [3]  {Packaged Grape Tomatoes}                  => {Hass Avocados}                                 0.001154908  0.2584270 14.840176
## [4]  {Packaged Grape Tomatoes}                  => {Strawberries}                                  0.001134823  0.2539326  5.139296
## [5]  {Baby Cucumbers}                           => {Bag of Organic Bananas}                        0.001205122  0.2515723  2.152274
## [6]  {Lemon Sparkling Water}                    => {Grapefruit Sparkling Water}                    0.001104695  0.3133903 65.284185
## [7]  {Nonfat Icelandic Style Strawberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001134823  0.4109091 76.622234
## [8]  {Sweet Potato Yam}                         => {Banana}                                        0.001174994  0.3861386  2.719988
## [9]  {Organic Honey Sweet Whole Wheat Bread}    => {Bag of Organic Bananas}                        0.001064524  0.3642612  3.116359
## [10] {Organic Whole Grassmilk Milk}             => {Bag of Organic Bananas}                        0.001205122  0.3529412  3.019514

4. Model Improvement, Conclusion and Next Steps

These rules consider only the collective and cumulative transactions. Our submission to Kaggle, which we are still working on, involves two additional association dimensions. We would have to look at individual customers over time to discern patterns and make inferences about their next purchases.