This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Instacart - Stacked Bar charts

This is still an exploratory phase of the project. We are trying to understand further the distributions of the different types of items ordered by looking at the barcharts within departments, aisles and product name and the percentage of reorders for each. The data used is all the orders (prior + latest) from the sampled train users (35000 users) and the exporation covers the following:

  1. Distribution of items ordered by departments
  2. Distribution of items ordered by top 20 aisles, by volume (no of orders)
  3. Distribution of items of top 3 departments, broken down by aisles
  4. Distribution of items of top 20 product names
  5. Distribution of items of top 3 aisles, broken down by product names

Distribution of items ordered by departments

There are altogether 21 departments. Below is the distribution of these departments in volume of items ordered.

There is a total of 5820204 rows of ordered items in departments. Conversion to percentage is 100/5820204 = 0.0000171815


Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union

data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

Attaching package: 㤼㸱data.table㤼㸲

The following objects are masked from 㤼㸱package:dplyr㤼㸲:

    between, first, last


Attaching package: 㤼㸱reshape2㤼㸲

The following objects are masked from 㤼㸱package:data.table㤼㸲:

    dcast, melt


Attaching package: 㤼㸱scales㤼㸲

The following object is masked from 㤼㸱package:plotrix㤼㸲:

    rescale

Creating a generic function for 㤼㸱toJSON㤼㸲 from package 㤼㸱jsonlite㤼㸲 in package 㤼㸱googleVis㤼㸲

Welcome to googleVis version 0.6.2

Please read Google's Terms of Use
before you start using the package:
https://developers.google.com/terms/

Note, the plot method of googleVis will by default use
the standard browser to display its output.

See the googleVis package vignettes for more details,
or visit http://github.com/mages/googleVis.

To suppress this message use:
suppressPackageStartupMessages(library(googleVis))


TraMineR stable version 2.0-6 (Built: 2017-06-25)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.

Read 0.0% of 5820204 rows
Read 8.9% of 5820204 rows
Read 20.4% of 5820204 rows
Read 32.1% of 5820204 rows
Read 47.9% of 5820204 rows
Read 64.6% of 5820204 rows
Read 75.9% of 5820204 rows
Read 92.6% of 5820204 rows
Read 5820204 rows and 15 (of 15) columns from 0.527 GB file in 00:00:11

The top 3 departments comes from produce, dairy eggs followed by snacks. Next we are going to be looking at the distribution of aisles as well as individual aisles within departments. Percentage of reorders within each department is also higher than non-reorders (around 60% within departments). One reason could be these are daily necessities, which would explain the high reorders.

Distribution of items ordered by top 20 aisles, by volume (no of orders)

Since there is a total of 134 aisles, we will not be looking at all the aisles. Instead we will only be looking at the distribution of items ordered for the top 20 aisles.

There is a total of 5820204 rows of ordered items in departments. Conversion to percentage is 100/5820204 = 0.0000171815

Notice that the top 2 aisles (fresh fruits, fresh vegetables) are higher than the rest and account for more than 20% of all aisles. Reorders is also significantly higher (72%) in fresh fruits compared to other aisles.

How does the aisle distribution look like for the top 3 departments (produce, dairy eggs, snacks)?

Distribution of items of top 3 departments, broken down by aisles

The top 3 departments (produce, dairy eggs and snacks) have 1702405, 967650 and 507960 rows of items corresponding to a conversion ratio to percentage of 0.0000587404289, 0.000103343 and 0.000196865 respectively.

Both fresh fruits and fresh vegetables which are the 2 highest aisles within produce department (around 75% of produce department) help to explain the main bulk of ALL orders in which 29% comes from produce department.

Notice the reorder ratio for milk is much higher than the rest within dairy eggs department. One reason could be the very short shelf life of milk, hence within the one year period of the recorded data, more milk will be reordered compared to other aisles.

We will next look at individual products and the most commonly ordered ones.

Distribution of items of top 20 product names

We will now look at the top 20 most ordered products and their reorder ratio.

From the graph above, the most popular products ordered are bananas and organic bananas, accounting for around 2.5% of all orders. Notice that the reorder ratio is also significantly higher than the rest (85%).

The commonly ordered products are mainly fruits and some items for cooking ingredients such as garlic, zucchini, cucumber.

Another interesting thing to note is the large number of organic items in the top 20 popular products (14 out of 20 are organic). We can conclude that quite a substantial segment of the customers are health conscious through their choice of organic products.

We shall next proceed to see the distribution of products within the top 3 aisles (fresh fruits, fresh vegetables, packaged vegetables fruits)

Distribution of items of top 3 aisles, broken down by product names

As indicated above, the top 3 aisles are fresh fruits, fresh vegetables and packaged vegetables fruits.

Most of the products that we are seeing already falls in the previous plot for most commonly ordered products.

Conclusion

This exploratory phase helps us in understanding what are the most commonly ordered items and give us insight on customers’ buying habits. However, it does not show any relation between the types of products ordered within a basket or between orders. We will explore such relationship further using Sanskey diagrams and some sequence mining R packages to see the most common sequence of items.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

# Instacart - Stacked Bar charts

This is still an exploratory phase of the project. We are trying to understand further the distributions of the different types of items ordered by looking at the barcharts within departments, aisles and product name and the percentage of reorders for each. The data used is all the orders (prior + latest) from the sampled train users (35000 users) and the exporation covers the following:

1) Distribution of items ordered by departments
2) Distribution of items ordered by top 20 aisles, by volume (no of orders) 
3) Distribution of items of top 3 departments, broken down by aisles
4) Distribution of items of top 20 product names
5) Distribution of items of top 3 aisles, broken down by product names


## Distribution of items ordered by departments

There are altogether 21 departments. Below is the distribution of these departments in volume of items ordered.

There is a total of 5820204 rows of ordered items in departments. Conversion to percentage is 100/5820204 = 0.0000171815

```{r echo = FALSE}

# Load libraries
library(dplyr)
library(data.table)
library(reshape2)
library(plotrix)
library(scales)
library(googleVis)
library(TraMineR)
library(ggplot2)

# setwd("~/Data Analytics/PCP project - Instacart/100_Stages/001_Join_tables/")
join_table <- fread("~/Data Analytics/PCP project - Instacart/100_Stages/001_Join_tables/join_table_train.csv", stringsAsFactors = T)

data <- join_table %>% select(product_name, department, aisle, reordered, add_to_cart_order,
                              order_number, order_dow, order_hour_of_day)
data %>% group_by(department, reordered) %>% summarise(freq = n()) %>% 
  group_by(department) %>% mutate(percentage = percent(freq/sum(freq)), tot = sum(freq)) %>%
  arrange(desc(tot)) %>%
  ggplot(aes(reorder(department, tot), freq, fill = factor(reordered))) + scale_y_continuous(sec.axis = sec_axis(~.*0.0000171815, name = "% of department")) + geom_col() + coord_flip() + ggtitle("Items ordered by Departments") + 
  labs(y = "Frequency of items", x = "Departments") +
  theme_bw() + theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5), 
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")
```

The top 3 departments comes from produce, dairy eggs followed by snacks. Next we are going to be looking at the distribution of aisles as well as individual aisles within departments. Percentage of reorders within each department is also higher than non-reorders (around 60% within departments). One reason could be these are daily necessities, which would explain the high reorders.


## Distribution of items ordered by top 20 aisles, by volume (no of orders)

Since there is a total of 134 aisles, we will not be looking at all the aisles. Instead we will only be looking at the distribution of items ordered for the top 20 aisles.

There is a total of 5820204 rows of ordered items in departments. Conversion to percentage is 100/5820204 = 0.0000171815

```{r, echo=FALSE}
data %>% group_by(aisle, reordered) %>% summarise(freq = n()) %>% 
  group_by(aisle) %>% mutate(percentage = percent(freq/sum(freq)), tot = sum(freq)) %>% 
  arrange(desc(tot)) %>% head(40) %>% 
  ggplot(aes(reorder(aisle, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.0000171815, name = "% of aisles")) + 
  geom_col() + coord_flip() +
  ggtitle("Items ordered by top 20 Aisles") + labs(y = "Frequency of items", x = "Aisles") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")
```

Notice that the top 2 aisles (fresh fruits, fresh vegetables) are higher than the rest and account for more than 20% of all aisles. Reorders is also significantly higher (72%) in fresh fruits compared to other aisles.

How does the aisle distribution look like for the top 3 departments (produce, dairy eggs, snacks)?


## Distribution of items of top 3 departments, broken down by aisles

The top 3 departments (produce, dairy eggs and snacks) have 1702405, 967650 and 507960 rows of items corresponding to a conversion ratio to percentage of 0.0000587404289, 0.000103343 and 0.000196865 respectively.

```{r, echo = F}
# nrows in produce = 1702405. Conversion to percentage = 100/1702405
data %>% filter(department == "produce") %>% group_by(aisle, reordered) %>% summarise(freq = n()) %>% 
  group_by(aisle) %>% mutate(percentage = percent(freq/sum(freq)), tot = sum(freq)) %>% 
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(aisle, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.0000587404289, name = "% of produce dept")) +
  geom_col() + coord_flip() +
  ggtitle("Items of Produce Dept ordered by Aisles") + labs(y = "Frequency of items", x = "Aisles") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")

# nrows in produce = 967650. Conversion to percentage = 100/967650
data %>% filter(department == "dairy eggs") %>% group_by(aisle, reordered) %>% summarise(freq = n()) %>% 
  group_by(aisle) %>% mutate(percentage = percent(freq/sum(freq)), tot = sum(freq)) %>%  
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(aisle, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.000103343, name = "% of dairy eggs dept")) +
  geom_col() + coord_flip() +
  ggtitle("Items of Dairy Eggs Dept ordered by Aisles") + labs(y = "Frequency of items", x = "Aisles") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")

# nrows in snacks = 507960 Conversion to percentage = 100/507960
data %>% filter(department == "snacks") %>% group_by(aisle, reordered) %>% summarise(freq = n()) %>% 
  group_by(aisle) %>% mutate(percentage = percent(freq/sum(freq)), tot = sum(freq)) %>%  
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(aisle, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.000196865, name = "% of snacks dept")) +
  geom_col() + coord_flip() +
  ggtitle("Items of Snacks Dept ordered by Aisles") + labs(y = "Frequency of items", x = "Aisles") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")
```

Both fresh fruits and fresh vegetables which are the 2 highest aisles within produce department (around 75% of produce department) help to explain the main bulk of *ALL* orders in which 29% comes from produce department. 

Notice the reorder ratio for milk is much higher than the rest within dairy eggs department. One reason could be the very short shelf life of milk, hence within the one year period of the recorded data, more milk will be reordered compared to other aisles.

We will next look at individual products and the most commonly ordered ones.

## Distribution of items of top 20 product names

We will now look at the top 20 most ordered products and their reorder ratio.

```{r, echo = F}
# nrows in products = 5820204. Conversion to percentage = 100/5820204
tmp <- data %>% group_by(product_name, reordered) %>% summarise(freq = n()) %>%
  group_by(product_name) %>% mutate(percentage = percent(freq/sum(freq)),tot = sum(freq)) %>%
  arrange(desc(tot)) %>% head(40)
# data %>% group_by(product_name, reordered) %>% summarise(freq = n()) %>%
#   group_by(product_name) %>% mutate(percentage = percent(freq/sum(freq)),tot = sum(freq)) %>% arrange(desc(tot)) %>% head(40) %>%
tmp %>% 
  ggplot(aes(reorder(product_name, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.0000171815, name = "% of products")) +
  geom_col() + coord_flip() +
  ggtitle("Items ordered by top 20 Products") + labs(y = "Frequency of items", x = "Products") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")
```

From the graph above, the most popular products ordered are bananas and organic bananas, accounting for around 2.5% of all orders. Notice that the reorder ratio is also significantly higher than the rest (85%). 

The commonly ordered products are mainly fruits and some items for cooking ingredients such as garlic, zucchini, cucumber.

Another interesting thing to note is the large number of organic items in the top 20 popular products (14 out of 20 are organic). We can conclude that quite a substantial segment of the customers are health conscious through their choice of organic products.

We shall next proceed to see the distribution of products within the top 3 aisles (fresh fruits, fresh vegetables, packaged vegetables fruits)

## Distribution of items of top 3 aisles, broken down by product names

As indicated above, the top 3 aisles are fresh fruits, fresh vegetables and packaged vegetables fruits. 

```{r, echo = F}
# nrows in aisle fresh fruits = 650674. Conversion to percentage = 100/650674
data %>% filter(aisle == "fresh fruits") %>% group_by(product_name, reordered) %>% summarise(freq = n()) %>%
  group_by(product_name) %>% mutate(percentage = percent(freq/sum(freq)),tot = sum(freq)) %>%
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(product_name, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.000153686, name = "% of fresh fruit aisle")) +
  geom_col() + coord_flip() +
  ggtitle("Items ordered by top 20 Products") + labs(y = "Frequency of items", x = "Products") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")

# nrows in aisle fresh vegetables = 616547. Conversion to percentage = 100/616547
data %>% filter(aisle == "fresh vegetables") %>% group_by(product_name, reordered) %>% summarise(freq = n()) %>%
  group_by(product_name) %>% mutate(percentage = percent(freq/sum(freq)),tot = sum(freq)) %>%
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(product_name, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.00016219364, name = "% of fresh vegetable aisle")) +
  geom_col() + coord_flip() +
  ggtitle("Items ordered by top 20 Products") + labs(y = "Frequency of items", x = "Products") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")

# nrows in aisle packaged vegetables fruits = 316885. Conversion to percentage = 100/316885
data %>% filter(aisle == "packaged vegetables fruits") %>% group_by(product_name, reordered) %>% summarise(freq = n()) %>%
  group_by(product_name) %>% mutate(percentage = percent(freq/sum(freq)),tot = sum(freq)) %>%
  arrange(desc(tot)) %>% head(40) %>%
  ggplot(aes(reorder(product_name, tot), freq, fill = factor(reordered))) + 
  scale_y_continuous(sec.axis = sec_axis(~.*0.0003155719, name = "% of packaged vegetables fruits aisle")) +
  geom_col() + coord_flip() +
  ggtitle("Items ordered by top 20 Products") + labs(y = "Frequency of items", x = "Products") +
  theme_bw() + theme(plot.title = element_text(size = 25, face = "bold", hjust = 0.5),
                     axis.title = element_text(size = 15)) + 
  geom_text(aes(label = percentage), hjust = 2, size = 3, position ="stack")
```

Most of the products that we are seeing already falls in the previous plot for most commonly ordered products. 

## Conclusion

This exploratory phase helps us in understanding what are the most commonly ordered items and give us insight on customers' buying habits. However, it does not show any relation between the types of products ordered within a basket or between orders. We will explore such relationship further using Sanskey diagrams and some sequence mining R packages to see the most common sequence of items.

