Here are the R libraries used in this analysis.
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(treemap) # tree visualization
library(corrgram) # correlation graphics
setwd('./datasets')
aisles = fread('aisles.csv', stringsAsFactors = TRUE)
departments = fread('departments.csv', stringsAsFactors = TRUE)
products = fread('products.csv', stringsAsFactors = TRUE)
orders = fread('orders.csv', stringsAsFactors = TRUE)
order_products_train = fread('order_products__train.csv')
order_products_prior = fread('order_products__prior.csv')
These datasets were sourced from an existing Kaggle competiotion (https://www.kaggle.com/c/instacart-market-basket-analysis/data)
Total six datasets were imported. Follwing section will explore each datasets in further detail.
There are 134 ailes in this dataset.
str(aisles)
## Classes 'data.table' and 'data.frame': 134 obs. of 2 variables:
## $ aisle_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ aisle : Factor w/ 134 levels "air fresheners candles",..: 107 121 43 75 82 92 95 6 101 77 ...
## - attr(*, ".internal.selfref")=<externalptr>
Here are the names of all ailes aphabetically ordered, seprated by comma.
paste(sort(aisles$aisle), collapse=', ')
## [1] "air fresheners candles, asian foods, baby accessories, baby bath body care, baby food formula, bakery desserts, baking ingredients, baking supplies decor, beauty, beers coolers, body lotions soap, bread, breakfast bakery, breakfast bars pastries, bulk dried fruits vegetables, bulk grains rice dried goods, buns rolls, butter, candy chocolate, canned fruit applesauce, canned jarred vegetables, canned meals beans, canned meat seafood, cat food care, cereal, chips pretzels, cleaning products, cocoa drink mixes, coffee, cold flu allergy, condiments, cookies cakes, crackers, cream, deodorants, diapers wipes, digestion, dish detergents, dog food care, doughs gelatins bake mixes, dry pasta, eggs, energy granola bars, energy sports drinks, eye ear care, facial care, feminine care, first aid, food storage, fresh dips tapenades, fresh fruits, fresh herbs, fresh pasta, fresh vegetables, frozen appetizers sides, frozen breads doughs, frozen breakfast, frozen dessert, frozen juice, frozen meals, frozen meat seafood, frozen pizza, frozen produce, frozen vegan vegetarian, fruit vegetable snacks, grains rice dried goods, granola, hair care, honeys syrups nectars, hot cereal pancake mixes, hot dogs bacon sausage, ice cream ice, ice cream toppings, indian foods, instant foods, juice nectars, kitchen supplies, kosher foods, latino foods, laundry, lunch meat, marinades meat preparation, meat counter, milk, mint gum, missing, more household, muscles joints pain relief, nuts seeds dried fruit, oils vinegars, oral hygiene, other, other creams cheeses, packaged cheese, packaged meat, packaged poultry, packaged produce, packaged seafood, packaged vegetables fruits, paper goods, pasta sauce, pickled goods olives, plates bowls cups flatware, popcorn jerky, poultry counter, prepared meals, prepared soups salads, preserved dips spreads, protein meal replacements, red wines, refrigerated, refrigerated pudding desserts, salad dressing toppings, seafood counter, shave needs, skin care, soap, soft drinks, soup broth bouillon, soy lactosefree, specialty cheeses, specialty wines champagnes, spices seasonings, spirits, spreads, tea, tofu meat alternatives, tortillas flat bread, trail mix snack mix, trash bags liners, vitamins supplements, water seltzer sparkling water, white wines, yogurt"
There are 21 departments in this dataset.
str(departments)
## Classes 'data.table' and 'data.frame': 21 obs. of 2 variables:
## $ department_id: int 1 2 3 4 5 6 7 8 9 10 ...
## $ department : Factor w/ 21 levels "alcohol","babies",..: 11 16 3 20 1 13 4 19 10 6 ...
## - attr(*, ".internal.selfref")=<externalptr>
Names of all deparments are listed below in aphabetically ordered.
sort(departments$department)
## [1] alcohol babies bakery beverages
## [5] breakfast bulk canned goods dairy eggs
## [9] deli dry goods pasta frozen household
## [13] international meat seafood missing other
## [17] pantry personal care pets produce
## [21] snacks
## 21 Levels: alcohol babies bakery beverages breakfast bulk ... snacks
There are 49,688 products in the catalogue within 134 aisles and 21 departments.
str(products)
length(unique(products$aisle_id))
length(unique(products$department_id))
## Classes 'data.table' and 'data.frame': 49688 obs. of 4 variables:
## $ product_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ product_name : Factor w/ 49688 levels "#2 Coffee Filters",..: 8642 2018 38573 40977 18584 13349 36616 11763 23264 41864 ...
## $ aisle_id : int 61 104 94 38 5 11 98 116 120 115 ...
## $ department_id: int 19 13 7 1 13 11 7 1 16 7 ...
## - attr(*, ".internal.selfref")=<externalptr>
## [1] 134
## [1] 21
Table below shows a glimpse of 3 products for each deparment, giving an idea the how the products being categorized. Only sample five departments are shown.
left_join(departments, products) %>% select(department, product_name ) %>%
group_by(department) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
| department | three_examples_product |
|---|---|
| dry goods pasta | Creamy Four Cheese Rice || Garden Vegetable Ready Rice || Roasted Garlic & Herb Brown Rice Medley |
| produce | Orange Navel Heirloom. || Crenshaw Melon || Cocktail Tomatoes |
| pets | Flaked Tuna Feast Cat Food || Rawhide Chip Rolls || Adult Complete Nutrition Dry Food for Dogs |
| bakery | Yellow Corn Organic Tortillas || Tortillas, Flour || Small Soft and Fluffy Flour Tortillas |
| babies | Fruit & Veggie Twist Apple Mango Pear & Kale || Organic Sunny Days Apple Snack Bars || Slip On Size 6 Diapers |
Table below shows a glimpse of 3 products for each ailes, giving an idea the how the products being categorized. Only sample five ailes are shown
left_join(aisles, products) %>%
select(aisle, product_name ) %>% group_by(aisle) %>%
sample_n(3) %>%
summarise(three_examples_product=paste(product_name, collapse=' || ')) %>% sample_n(5) %>% kable
| aisle | three_examples_product |
|---|---|
| cookies cakes | Original Animal Crackers || Soft Baked Sanibel Snickerdoodle Cookies || Sandies Toffee Shortbread Cookies |
| fresh fruits | Mandarin Box || Organic Honeydew || Red Globe Grapes |
| cat food care | Sliced Salmon Canned Catfood Entree || Liver & Chicken Dinner Classic Pate Cat Food || Purina Mixed Grill Classic Pate Cat Food |
| food storage | Take Alongs Round Small Bowls || Zipper Bags Sandwich - 50 CT || Takealongs Serving Bowl |
| packaged seafood | Alaskan Sockeye Smoked Salmon || Wild Alaskan Salmon Sockeye Salmon Gravlox || Herring In Wine Sauce |
There are over 3 millions observations in orders dataset. Each row represent an unique order. Each variable potentially can be used as predictors.
| variable | description |
|---|---|
| user_id | unique identity of a customer |
| eval_set | value “prior” identifies that it is previous order before leading to last order order value “train” is the last order with associated products that the user order value “test” is the last order, however the contest is to predict what will be the product the user will buy |
| order_number | true |
| order_dow | day of week the order is placed |
| order_hour_of_day | which hour the order is placed |
| days_since_prior_order | how many days lapsed from this order to the previous order |
glimpse(orders)
## Observations: 3,421,083
## Variables: 7
## $ order_id <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set <fct> prior, prior, prior, prior, prior, prio...
## $ order_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...
Let’s analyse the construct of one user. For user_id 1, there were 10 prior orders (order number from 1 to 10), last order is a train.
orders %>% filter (user_id==1) %>% kable
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
|---|---|---|---|---|---|---|
| 2539329 | 1 | prior | 1 | 2 | 8 | NA |
| 2398795 | 1 | prior | 2 | 3 | 7 | 15 |
| 473747 | 1 | prior | 3 | 3 | 12 | 21 |
| 2254736 | 1 | prior | 4 | 4 | 7 | 29 |
| 431534 | 1 | prior | 5 | 4 | 15 | 28 |
| 3367565 | 1 | prior | 6 | 2 | 7 | 19 |
| 550135 | 1 | prior | 7 | 1 | 9 | 20 |
| 3108588 | 1 | prior | 8 | 1 | 14 | 14 |
| 2295261 | 1 | prior | 9 | 1 | 16 | 0 |
| 2550362 | 1 | prior | 10 | 4 | 8 | 30 |
| 1187899 | 1 | train | 11 | 4 | 8 | 14 |
Let’s analyse another construct of orders. User_id 3 made 12 orders before the final order labeled as test order:
orders %>% filter (user_id==3) %>% kable
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
|---|---|---|---|---|---|---|
| 1374495 | 3 | prior | 1 | 1 | 14 | NA |
| 444309 | 3 | prior | 2 | 3 | 19 | 9 |
| 3002854 | 3 | prior | 3 | 3 | 16 | 21 |
| 2037211 | 3 | prior | 4 | 2 | 18 | 20 |
| 2710558 | 3 | prior | 5 | 0 | 17 | 12 |
| 1972919 | 3 | prior | 6 | 0 | 16 | 7 |
| 1839752 | 3 | prior | 7 | 0 | 15 | 7 |
| 3225766 | 3 | prior | 8 | 0 | 17 | 7 |
| 3160850 | 3 | prior | 9 | 0 | 16 | 7 |
| 676467 | 3 | prior | 10 | 3 | 16 | 17 |
| 521107 | 3 | prior | 11 | 0 | 18 | 11 |
| 1402502 | 3 | prior | 12 | 1 | 15 | 15 |
| 2774568 | 3 | test | 13 | 5 | 15 | 11 |
We know that in the last order (order_id 1187899), the user purchased 10 unique products by quering order_product_train with the order_id.
order_products_train %>% filter (order_id == 1187899) %>% kable
| order_id | product_id | add_to_cart_order | reordered |
|---|---|---|---|
| 1187899 | 196 | 1 | 1 |
| 1187899 | 25133 | 2 | 1 |
| 1187899 | 38928 | 3 | 1 |
| 1187899 | 26405 | 4 | 1 |
| 1187899 | 39657 | 5 | 1 |
| 1187899 | 10258 | 6 | 1 |
| 1187899 | 13032 | 7 | 1 |
| 1187899 | 26088 | 8 | 1 |
| 1187899 | 27845 | 9 | 0 |
| 1187899 | 49235 | 10 | 1 |
| 1187899 | 46149 | 11 | 1 |
order_product_train/prior dataframe tells us which products were purchased in each order.
glimpse(order_products_train)
## Observations: 1,384,617
## Variables: 4
## $ order_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 36, 36, 36, 36, 36, ...
## $ product_id <int> 49302, 11109, 10246, 49683, 43633, 13176, 47...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7,...
## $ reordered <int> 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,...
We can derive number of unique users from order dataframe. There are 75000 test users for prediction, 131,209 training users for machine learning. We want to keep this large number of train records in mind for traiming during quick initial model build and verification.
table(orders$eval_set)
##
## prior test train
## 3214874 75000 131209
In this section, we shall try to understand the buying behaviour by asking some interesting quesitons.
We know that banana are the most popular products. The variance in the number of orders differs greatly for different products.
tmp = order_products_train %>%
left_join(products) %>%
group_by(product_name) %>%
summarize(count=n()) %>%
top_n(n=40, wt=count) %>% mutate(percentage=count/sum(count))
ggplot (tmp, aes(x=reorder(product_name,count), y=percentage)) +
geom_col() + ggtitle('Product') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Product') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Products')
Certain departmens are clearly more popular, like personal care, snacks and pantray. They made up more than 30% of total product sold.
tmp = order_products_train %>%
left_join(products) %>%
left_join(departments) %>%
group_by(department) %>%
summarize(count=n()) %>%
mutate(percentage=count/sum(count))
ggplot (tmp, aes(x=reorder(department,count), y=percentage)) +
geom_col() + ggtitle('Departments') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank())
ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Departments') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Departments')
We looked into the buying trend of product by ailes and notice that certain aisle like candy chocolate and ice cream are much popular than others. The box plot illustrates the variance between the least popular and most popular.
par(mfrow=c(1,2))
tmp = order_products_train %>%
left_join(products) %>%
left_join(aisles) %>%
group_by(aisle) %>%
summarize(count=n()) %>%
top_n(n=40, wt=count) %>% mutate(percentage=count/sum(count))
ggplot (tmp, aes(x=reorder(aisle,count), y=percentage)) +
geom_col() + ggtitle('Aisles') + ylab('Percentage of Orders') +
theme (
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),
axis.title.x = element_blank()) + ylab('Percentage of Orders') + xlab('Aisles')
ggplot (data = tmp, aes( x= '', y=percentage )) +
ggtitle('Aisles') + ylab('percentage.of.orders') + geom_boxplot() + xlab('Aisles')