STAT3000 092524

Dataset Overview

The Instacart dataset includes the following CSV files:

products.csv: Contains information about individual products.
aisles.csv: Contains the names and IDs of aisles.
departments.csv: Contains the names and IDs of departments.
order_products_train.csv: Contains the product information for the training set orders.
order_products_prior.csv: Contains product information for all prior orders before the training set.

Read in the data

library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT) 
library(treemap)

orders <- fread('./orders.csv')
products <- fread('./products.csv')
order_products <- fread('./order_products__train.csv')
order_products_prior <- fread('./order_products__prior.csv')
aisles <- fread('./aisles.csv')
departments <- fread('./departments.csv')

Lets first have a look at these files:

orders

This file gives a list of all orders we have in the dataset. 1 row per order.

head(orders,12)

##     order_id user_id eval_set order_number order_dow order_hour_of_day
##        <int>   <int>   <char>        <int>     <int>             <int>
##  1:  2539329       1    prior            1         2                 8
##  2:  2398795       1    prior            2         3                 7
##  3:   473747       1    prior            3         3                12
##  4:  2254736       1    prior            4         4                 7
##  5:   431534       1    prior            5         4                15
##  6:  3367565       1    prior            6         2                 7
##  7:   550135       1    prior            7         1                 9
##  8:  3108588       1    prior            8         1                14
##  9:  2295261       1    prior            9         1                16
## 10:  2550362       1    prior           10         4                 8
## 11:  1187899       1    train           11         4                 8
## 12:  2168274       2    prior            1         2                11
##     days_since_prior_order
##                      <num>
##  1:                     NA
##  2:                     15
##  3:                     21
##  4:                     29
##  5:                     28
##  6:                     19
##  7:                     20
##  8:                     14
##  9:                      0
## 10:                     30
## 11:                     14
## 12:                     NA

order_products_train

This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0).

For example, we see below that order_id 1 had 8 products, 4 of which are reorders.

Still we don’t know what these products are. This information is in the products.csv

(head(order_products,10))

##     order_id product_id add_to_cart_order reordered
##        <int>      <int>             <int>     <int>
##  1:        1      49302                 1         1
##  2:        1      11109                 2         1
##  3:        1      10246                 3         0
##  4:        1      49683                 4         0
##  5:        1      43633                 5         1
##  6:        1      13176                 6         0
##  7:        1      47209                 7         0
##  8:        1      22035                 8         1
##  9:       36      39612                 1         0
## 10:       36      19660                 2         1

str(order_products)

## Classes 'data.table' and 'data.frame':   1384617 obs. of  4 variables:
##  $ order_id         : int  1 1 1 1 1 1 1 1 36 36 ...
##  $ product_id       : int  49302 11109 10246 49683 43633 13176 47209 22035 39612 19660 ...
##  $ add_to_cart_order: int  1 2 3 4 5 6 7 8 1 2 ...
##  $ reordered        : int  1 1 0 0 1 0 0 1 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

products

This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included.

(head(products,10))

##     product_id
##          <int>
##  1:          1
##  2:          2
##  3:          3
##  4:          4
##  5:          5
##  6:          6
##  7:          7
##  8:          8
##  9:          9
## 10:         10
##                                                          product_name aisle_id
##                                                                <char>    <int>
##  1:                                        Chocolate Sandwich Cookies       61
##  2:                                                  All-Seasons Salt      104
##  3:                              Robust Golden Unsweetened Oolong Tea       94
##  4: Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce       38
##  5:                                         Green Chile Anytime Sauce        5
##  6:                                                      Dry Nose Oil       11
##  7:                                    Pure Coconut Water With Orange       98
##  8:                                 Cut Russet Potatoes Steam N' Mash      116
##  9:                                 Light Strawberry Blueberry Yogurt      120
## 10:                    Sparkling Orange Juice & Prickly Pear Beverage      115
##     department_id
##             <int>
##  1:            19
##  2:            13
##  3:             7
##  4:             1
##  5:            13
##  6:            11
##  7:             7
##  8:             1
##  9:            16
## 10:             7

str(products)

## Classes 'data.table' and 'data.frame':   49688 obs. of  4 variables:
##  $ product_id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ product_name : chr  "Chocolate Sandwich Cookies" "All-Seasons Salt" "Robust Golden Unsweetened Oolong Tea" "Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce" ...
##  $ aisle_id     : int  61 104 94 38 5 11 98 116 120 115 ...
##  $ department_id: int  19 13 7 1 13 11 7 1 16 7 ...
##  - attr(*, ".internal.selfref")=<externalptr>

order_products_prior

This file is structurally the same as the other_products_train.csv.

(head(order_products_prior,10))

##     order_id product_id add_to_cart_order reordered
##        <int>      <int>             <int>     <int>
##  1:        2      33120                 1         1
##  2:        2      28985                 2         1
##  3:        2       9327                 3         0
##  4:        2      45918                 4         1
##  5:        2      30035                 5         0
##  6:        2      17794                 6         1
##  7:        2      40141                 7         1
##  8:        2       1819                 8         1
##  9:        2      43668                 9         0
## 10:        3      33754                 1         1

str(order_products_prior)

## Classes 'data.table' and 'data.frame':   32434489 obs. of  4 variables:
##  $ order_id         : int  2 2 2 2 2 2 2 2 2 3 ...
##  $ product_id       : int  33120 28985 9327 45918 30035 17794 40141 1819 43668 33754 ...
##  $ add_to_cart_order: int  1 2 3 4 5 6 7 8 9 1 ...
##  $ reordered        : int  1 1 0 1 0 1 1 1 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

aisles

This file contains the different aisles.

(head(aisles,10))

##     aisle_id                      aisle
##        <int>                     <char>
##  1:        1      prepared soups salads
##  2:        2          specialty cheeses
##  3:        3        energy granola bars
##  4:        4              instant foods
##  5:        5 marinades meat preparation
##  6:        6                      other
##  7:        7              packaged meat
##  8:        8            bakery desserts
##  9:        9                pasta sauce
## 10:       10           kitchen supplies

str(aisles)

## Classes 'data.table' and 'data.frame':   134 obs. of  2 variables:
##  $ aisle_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ aisle   : chr  "prepared soups salads" "specialty cheeses" "energy granola bars" "instant foods" ...
##  - attr(*, ".internal.selfref")=<externalptr>

departments

(head(departments,10))

##     department_id      department
##             <int>          <char>
##  1:             1          frozen
##  2:             2           other
##  3:             3          bakery
##  4:             4         produce
##  5:             5         alcohol
##  6:             6   international
##  7:             7       beverages
##  8:             8            pets
##  9:             9 dry goods pasta
## 10:            10            bulk

str(departments)

## Classes 'data.table' and 'data.frame':   21 obs. of  2 variables:
##  $ department_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ department   : chr  "frozen" "other" "bakery" "produce" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Recode variables

We should do some recoding and convert character variables to factors.

orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

When do people order?

Let’s have a look when people buy groceries online.

Hour of Day

orders %>% 
  ggplot(aes(x=order_hour_of_day)) + 
  geom_histogram(stat="count",fill="red")

There is a clear effect of hour of day on order volume. Most orders are between 8.00-18.00

Day of Week

orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="red")

There is a clear effect of day of the week. Most orders are on days 0 and 1. Unfortunately there is no info regarding which values represent which day, but one would assume that this is the weekend. ### How many items do people buy?

Train set

The add_to_cart_order column usually contains the sequence in which products were added to the cart in a given order. The “last” value of this column represents the number of items that were added to the cart because add_to_cart_order increments for each item added.

order_products %>% 
  group_by(order_id) %>% 
  summarize(n_items = last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items))+
  geom_histogram(stat="count",fill="red") + 
  geom_rug()+
  coord_cartesian(xlim=c(0,80))

Let’s have a look how many items are in the orders. We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set.

Bestsellers

Let’s have a look which products are sold most often (top10). And the clear winner is: - n() is a function from dplyr that counts the number of rows in each group. In other words, it returns the number of occurrences (or the frequency) of that particular product_id in the order_products dataset. - left_join: used to merge two data frames it keeps all the rows from the left data frame and merges matching rows from the right data frame.
- arrange similar to order but can be applied on dataframe

tmp <- order_products %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 
(tmp)

## # A tibble: 10 × 3
##    product_id count product_name          
##         <int> <int> <fct>                 
##  1      24852 18726 Banana                
##  2      13176 15480 Bag of Organic Bananas
##  3      21137 10894 Organic Strawberries  
##  4      21903  9784 Organic Baby Spinach  
##  5      47626  8135 Large Lemon           
##  6      47766  7409 Organic Avocado       
##  7      47209  7293 Organic Hass Avocado  
##  8      16797  6494 Strawberries          
##  9      26209  6033 Limes                 
## 10      27966  5546 Organic Raspberries

tmp %>% 
  ggplot(aes(x=reorder(product_name,-count), y=count))+
  geom_bar(stat="identity",fill="red")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())

Bananas

How often do people order the same items again?

tmp <- order_products %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(reordered = as.factor(reordered)) %>%
  mutate(proportion = count/sum(count))
str(tmp)

## tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
##  $ reordered : Factor w/ 2 levels "0","1": 1 2
##  $ count     : int [1:2] 555793 828824
##  $ proportion: num [1:2] 0.401 0.599

tmp %>% 
  ggplot(aes(x=reordered,y=count,fill=reordered))+
  geom_bar(stat="identity")

Organic vs Non-organic

What is the percentage of orders that are organic vs. not organic?

products <- products %>% 
    mutate(organic=ifelse(str_detect(tolower(products$product_name),'organic'),"organic","not organic"), 
           organic= as.factor(organic))
    
tmp <- order_products %>% 
  left_join(products, by="product_id") %>% 
  group_by(organic) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))
str(tmp)

## tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
##  $ organic   : Factor w/ 2 levels "not organic",..: 1 2
##  $ count     : int [1:2] 979000 405617
##  $ proportion: num [1:2] 0.707 0.293

tmp %>% 
  ggplot(aes(x=organic,y=count, fill=organic))+
  geom_bar(stat="identity")

Reordering Organic vs Non-Organic

tmp <- order_products %>% 
  left_join(products,by="product_id") %>% 
  group_by(organic) %>% 
  summarize(mean_reordered = mean(reordered))
str(tmp)

## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
##  $ organic       : Factor w/ 2 levels "not organic",..: 1 2
##  $ mean_reordered: num [1:2] 0.578 0.647

tmp %>% 
  ggplot(aes(x=organic,fill=organic,y=mean_reordered))+
  geom_bar(stat="identity")

People more often reorder organic products vs non-organic products.

Visualizing the Product Portfolio

Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles. This map can be used as a layout of a physical store.

Step-by-Step Breakdown: We need to creat an intermediate dataframe for creating a map. Let;s call it tmp2

Aggregating Product Counts by Department and Aisle and

tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")

Aggregating Order Counts by Department and Aisle. For each product, it counts how many times that product was ordered.

tmp2<-order_products %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id")

Calculates the total number of orders for all products in that department-aisle combination.

tmp2 <- tmp2 %>%
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(tmp, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

How are aisles organized within departments?

treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",sortID="-sumcount", type="categorical", fontsize.legend = 0)

How often are products from the department/aisle sold?

The size of the boxes shows the number of sales.

treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")