The Instacart dataset consists of multiple CSV files that provide detailed information on customer orders, products, aisles, and departments. It is commonly used for analyzing e-commerce customer behaviors and developing recommendation systems. This report provides a brief overview of the data and sets up the analysis for further exploration.
The Instacart dataset includes the following CSV files:
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)
library(treemap)
orders <- fread('./orders.csv')
products <- fread('./products.csv')
order_products <- fread('./order_products__train.csv')
order_products_prior <- fread('./order_products__prior.csv')
aisles <- fread('./aisles.csv')
departments <- fread('./departments.csv')
Lets first have a look at these files:
This file gives a list of all orders we have in the dataset. 1 row per order.
head(orders,12)
## order_id user_id eval_set order_number order_dow order_hour_of_day
## <int> <int> <char> <int> <int> <int>
## 1: 2539329 1 prior 1 2 8
## 2: 2398795 1 prior 2 3 7
## 3: 473747 1 prior 3 3 12
## 4: 2254736 1 prior 4 4 7
## 5: 431534 1 prior 5 4 15
## 6: 3367565 1 prior 6 2 7
## 7: 550135 1 prior 7 1 9
## 8: 3108588 1 prior 8 1 14
## 9: 2295261 1 prior 9 1 16
## 10: 2550362 1 prior 10 4 8
## 11: 1187899 1 train 11 4 8
## 12: 2168274 2 prior 1 2 11
## days_since_prior_order
## <num>
## 1: NA
## 2: 15
## 3: 21
## 4: 29
## 5: 28
## 6: 19
## 7: 20
## 8: 14
## 9: 0
## 10: 30
## 11: 14
## 12: NA
This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0).
For example, we see below that order_id 1 had 8 products, 4 of which are reorders.
Still we don’t know what these products are. This information is in the products.csv
(head(order_products,10))
## order_id product_id add_to_cart_order reordered
## <int> <int> <int> <int>
## 1: 1 49302 1 1
## 2: 1 11109 2 1
## 3: 1 10246 3 0
## 4: 1 49683 4 0
## 5: 1 43633 5 1
## 6: 1 13176 6 0
## 7: 1 47209 7 0
## 8: 1 22035 8 1
## 9: 36 39612 1 0
## 10: 36 19660 2 1
str(order_products)
## Classes 'data.table' and 'data.frame': 1384617 obs. of 4 variables:
## $ order_id : int 1 1 1 1 1 1 1 1 36 36 ...
## $ product_id : int 49302 11109 10246 49683 43633 13176 47209 22035 39612 19660 ...
## $ add_to_cart_order: int 1 2 3 4 5 6 7 8 1 2 ...
## $ reordered : int 1 1 0 0 1 0 0 1 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included.
(head(products,10))
## product_id
## <int>
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
## 10: 10
## product_name aisle_id
## <char> <int>
## 1: Chocolate Sandwich Cookies 61
## 2: All-Seasons Salt 104
## 3: Robust Golden Unsweetened Oolong Tea 94
## 4: Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38
## 5: Green Chile Anytime Sauce 5
## 6: Dry Nose Oil 11
## 7: Pure Coconut Water With Orange 98
## 8: Cut Russet Potatoes Steam N' Mash 116
## 9: Light Strawberry Blueberry Yogurt 120
## 10: Sparkling Orange Juice & Prickly Pear Beverage 115
## department_id
## <int>
## 1: 19
## 2: 13
## 3: 7
## 4: 1
## 5: 13
## 6: 11
## 7: 7
## 8: 1
## 9: 16
## 10: 7
str(products)
## Classes 'data.table' and 'data.frame': 49688 obs. of 4 variables:
## $ product_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ product_name : chr "Chocolate Sandwich Cookies" "All-Seasons Salt" "Robust Golden Unsweetened Oolong Tea" "Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce" ...
## $ aisle_id : int 61 104 94 38 5 11 98 116 120 115 ...
## $ department_id: int 19 13 7 1 13 11 7 1 16 7 ...
## - attr(*, ".internal.selfref")=<externalptr>
This file is structurally the same as the other_products_train.csv.
(head(order_products_prior,10))
## order_id product_id add_to_cart_order reordered
## <int> <int> <int> <int>
## 1: 2 33120 1 1
## 2: 2 28985 2 1
## 3: 2 9327 3 0
## 4: 2 45918 4 1
## 5: 2 30035 5 0
## 6: 2 17794 6 1
## 7: 2 40141 7 1
## 8: 2 1819 8 1
## 9: 2 43668 9 0
## 10: 3 33754 1 1
str(order_products_prior)
## Classes 'data.table' and 'data.frame': 32434489 obs. of 4 variables:
## $ order_id : int 2 2 2 2 2 2 2 2 2 3 ...
## $ product_id : int 33120 28985 9327 45918 30035 17794 40141 1819 43668 33754 ...
## $ add_to_cart_order: int 1 2 3 4 5 6 7 8 9 1 ...
## $ reordered : int 1 1 0 1 0 1 1 1 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
This file contains the different aisles.
(head(aisles,10))
## aisle_id aisle
## <int> <char>
## 1: 1 prepared soups salads
## 2: 2 specialty cheeses
## 3: 3 energy granola bars
## 4: 4 instant foods
## 5: 5 marinades meat preparation
## 6: 6 other
## 7: 7 packaged meat
## 8: 8 bakery desserts
## 9: 9 pasta sauce
## 10: 10 kitchen supplies
str(aisles)
## Classes 'data.table' and 'data.frame': 134 obs. of 2 variables:
## $ aisle_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ aisle : chr "prepared soups salads" "specialty cheeses" "energy granola bars" "instant foods" ...
## - attr(*, ".internal.selfref")=<externalptr>
(head(departments,10))
## department_id department
## <int> <char>
## 1: 1 frozen
## 2: 2 other
## 3: 3 bakery
## 4: 4 produce
## 5: 5 alcohol
## 6: 6 international
## 7: 7 beverages
## 8: 8 pets
## 9: 9 dry goods pasta
## 10: 10 bulk
str(departments)
## Classes 'data.table' and 'data.frame': 21 obs. of 2 variables:
## $ department_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ department : chr "frozen" "other" "bakery" "produce" ...
## - attr(*, ".internal.selfref")=<externalptr>
We should do some recoding and convert character variables to factors.
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
Let’s have a look when people buy groceries online.
orders %>%
ggplot(aes(x=order_hour_of_day)) +
geom_histogram(stat="count",fill="red")
There is a clear effect of hour of day on order volume. Most orders are
between 8.00-18.00
orders %>%
ggplot(aes(x=order_dow)) +
geom_histogram(stat="count",fill="red")
There is a clear effect of day of the week. Most orders are on days 0
and 1. Unfortunately there is no info regarding which values represent
which day, but one would assume that this is the weekend. ### How many
items do people buy?
order_products %>%
group_by(order_id) %>%
summarize(n_items = last(add_to_cart_order)) %>%
ggplot(aes(x=n_items))+
geom_histogram(stat="count",fill="red") +
geom_rug()+
coord_cartesian(xlim=c(0,80))
Let’s have a look how many items are in the orders. We can see that
people most often order around 5 items. The distributions are comparable
between the train and prior order set.
Let’s have a look which products are sold most often (top10). And the
clear winner is: - n() is a function from dplyr that counts the number
of rows in each group. In other words, it returns the number of
occurrences (or the frequency) of that particular product_id in the
order_products dataset. - left_join: used to merge two data frames it
keeps all the rows from the left data frame and merges matching rows
from the right data frame.
- arrange similar to order but can be applied on dataframe
tmp <- order_products %>%
group_by(product_id) %>%
summarize(count = n()) %>%
top_n(10, wt = count) %>%
left_join(select(products,product_id,product_name),by="product_id") %>%
arrange(desc(count))
(tmp)
## # A tibble: 10 × 3
## product_id count product_name
## <int> <int> <fct>
## 1 24852 18726 Banana
## 2 13176 15480 Bag of Organic Bananas
## 3 21137 10894 Organic Strawberries
## 4 21903 9784 Organic Baby Spinach
## 5 47626 8135 Large Lemon
## 6 47766 7409 Organic Avocado
## 7 47209 7293 Organic Hass Avocado
## 8 16797 6494 Strawberries
## 9 26209 6033 Limes
## 10 27966 5546 Organic Raspberries
tmp %>%
ggplot(aes(x=reorder(product_name,-count), y=count))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())
Bananas
tmp <- order_products %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(reordered = as.factor(reordered)) %>%
mutate(proportion = count/sum(count))
str(tmp)
## tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
## $ reordered : Factor w/ 2 levels "0","1": 1 2
## $ count : int [1:2] 555793 828824
## $ proportion: num [1:2] 0.401 0.599
tmp %>%
ggplot(aes(x=reordered,y=count,fill=reordered))+
geom_bar(stat="identity")
What is the percentage of orders that are organic vs. not organic?
products <- products %>%
mutate(organic=ifelse(str_detect(tolower(products$product_name),'organic'),"organic","not organic"),
organic= as.factor(organic))
tmp <- order_products %>%
left_join(products, by="product_id") %>%
group_by(organic) %>%
summarize(count = n()) %>%
mutate(proportion = count/sum(count))
str(tmp)
## tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
## $ organic : Factor w/ 2 levels "not organic",..: 1 2
## $ count : int [1:2] 979000 405617
## $ proportion: num [1:2] 0.707 0.293
tmp %>%
ggplot(aes(x=organic,y=count, fill=organic))+
geom_bar(stat="identity")
tmp <- order_products %>%
left_join(products,by="product_id") %>%
group_by(organic) %>%
summarize(mean_reordered = mean(reordered))
str(tmp)
## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
## $ organic : Factor w/ 2 levels "not organic",..: 1 2
## $ mean_reordered: num [1:2] 0.578 0.647
tmp %>%
ggplot(aes(x=organic,fill=organic,y=mean_reordered))+
geom_bar(stat="identity")
People more often reorder organic products vs non-organic
products.
Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles. This map can be used as a layout of a physical store.
Step-by-Step Breakdown: We need to creat an intermediate dataframe for creating a map. Let;s call it tmp2
tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")
tmp2<-order_products %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id")
tmp2 <- tmp2 %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(tmp, by = c("department_id", "aisle_id")) %>%
mutate(onesize = 1)
treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",sortID="-sumcount", type="categorical", fontsize.legend = 0)
The size of the boxes shows the number of sales.
treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")