This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
Load libraries
library(ggplot2)
library(dplyr)
Read in datasets
setwd("~/R/Data Analytics/Kaggle - Instacart Market Basket Analysis")
orders <- read.csv("orders.csv")
products <- read.csv("products.csv")
order_products_prior <- read.csv("order_products__prior.csv")
order_products_train <- read.csv("order_products__train.csv")
departments <- read.csv("departments.csv")
aisles <- read.csv("aisles.csv")
Data Exploration
Refer to docs in https://pcpust.slack.com/messages/C5TEC17GS/files/
Frequency of orders in days, hour of days and no of days prior
ggplot(data=orders,aes(x=order_dow))+geom_histogram(stat="count",fill="red") + xlab('Days of Week') + ylab('No of orders') + ggtitle('No of orders vs Days of week')
Ignoring unknown parameters: binwidth, bins, pad

ggplot(data=orders,aes(x=order_hour_of_day))+geom_histogram(stat="count",fill="red") + xlab('Hour of Day') + ylab('No of orders') + ggtitle('No of orders vs Hour of Day')
Ignoring unknown parameters: binwidth, bins, pad

ggplot(data=orders,aes(x=days_since_prior_order))+geom_histogram(stat="count",fill="red") + xlab('Days since previous order') + ylab('No of orders') + ggtitle('No of orders vs Days since previous order')
Ignoring unknown parameters: binwidth, bins, pad

Based on the plots, these are the findings: 1) There are higher count of orders for DOW (day of week) 0 and 1 2) Higher count of orders during the day between 0800 to 1800 3) Count of orders increases to maximum for no of days prior at day 7. Then decreases with less obvious seasonal peaks of 14 and 21 days. Maximum no of days prior is capped at 30 days, therefore the peak at 30 days is irrelevant.
Distribution of total orders per customer
orders %>% select(user_id) %>% group_by(user_id) %>% count(user_id) %>% ggplot(aes(x=n))+geom_histogram(stat="count", fill="red") + xlab('Total orders per customer') + ylab('No of customers') + ggtitle('No of customers vs Total no of orders per customer')
Ignoring unknown parameters: binwidth, bins, pad

The range for number of orders per customer is between 4 and 100, with 4 orders being the most common.
Distibution of no of items per order
order_products_prior %>% select(order_id) %>% group_by(order_id) %>% count(order_id) %>% ggplot(aes(x=n))+geom_histogram(stat="count", fill="red") + xlim(0, 60) + xlab('Number of products ordered') + ylab('No of orders') + ggtitle('No of orders vs No of products ordered (prior datasets')
Ignoring unknown parameters: binwidth, bins, pad

Maximum no of items bought per order is between 5 and 6 items.
LS0tDQp0aXRsZTogIkluc3RhY2FydCBNYXJrZXQgQmFza2V0IEFuYWx5c2lzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suIFdoZW4geW91IGV4ZWN1dGUgY29kZSB3aXRoaW4gdGhlIG5vdGVib29rLCB0aGUgcmVzdWx0cyBhcHBlYXIgYmVuZWF0aCB0aGUgY29kZS4gDQoNClRyeSBleGVjdXRpbmcgdGhpcyBjaHVuayBieSBjbGlja2luZyB0aGUgKlJ1biogYnV0dG9uIHdpdGhpbiB0aGUgY2h1bmsgb3IgYnkgcGxhY2luZyB5b3VyIGN1cnNvciBpbnNpZGUgaXQgYW5kIHByZXNzaW5nICpDdHJsK1NoaWZ0K0VudGVyKi4gDQoNCg0KQWRkIGEgbmV3IGNodW5rIGJ5IGNsaWNraW5nIHRoZSAqSW5zZXJ0IENodW5rKiBidXR0b24gb24gdGhlIHRvb2xiYXIgb3IgYnkgcHJlc3NpbmcgKkN0cmwrQWx0K0kqLg0KDQpXaGVuIHlvdSBzYXZlIHRoZSBub3RlYm9vaywgYW4gSFRNTCBmaWxlIGNvbnRhaW5pbmcgdGhlIGNvZGUgYW5kIG91dHB1dCB3aWxsIGJlIHNhdmVkIGFsb25nc2lkZSBpdCAoY2xpY2sgdGhlICpQcmV2aWV3KiBidXR0b24gb3IgcHJlc3MgKkN0cmwrU2hpZnQrSyogdG8gcHJldmlldyB0aGUgSFRNTCBmaWxlKS4NCg0KDQojIExvYWQgbGlicmFyaWVzDQpgYGB7cn0NCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkoZHBseXIpDQpgYGANCg0KDQojIFJlYWQgaW4gZGF0YXNldHMNCmBgYHtyfQ0Kc2V0d2QoIn4vUi9EYXRhIEFuYWx5dGljcy9LYWdnbGUgLSBJbnN0YWNhcnQgTWFya2V0IEJhc2tldCBBbmFseXNpcyIpDQpvcmRlcnMgPC0gcmVhZC5jc3YoIm9yZGVycy5jc3YiKQ0KcHJvZHVjdHMgPC0gcmVhZC5jc3YoInByb2R1Y3RzLmNzdiIpDQpvcmRlcl9wcm9kdWN0c19wcmlvciA8LSByZWFkLmNzdigib3JkZXJfcHJvZHVjdHNfX3ByaW9yLmNzdiIpDQpvcmRlcl9wcm9kdWN0c190cmFpbiA8LSByZWFkLmNzdigib3JkZXJfcHJvZHVjdHNfX3RyYWluLmNzdiIpDQpkZXBhcnRtZW50cyA8LSByZWFkLmNzdigiZGVwYXJ0bWVudHMuY3N2IikNCmFpc2xlcyA8LSByZWFkLmNzdigiYWlzbGVzLmNzdiIpDQpgYGANCg0KDQojIERhdGEgRXhwbG9yYXRpb24NClJlZmVyIHRvIGRvY3MgaW4gaHR0cHM6Ly9wY3B1c3Quc2xhY2suY29tL21lc3NhZ2VzL0M1VEVDMTdHUy9maWxlcy8NCg0KIyMgRnJlcXVlbmN5IG9mIG9yZGVycyBpbiBkYXlzLCBob3VyIG9mIGRheXMgYW5kIG5vIG9mIGRheXMgcHJpb3INCmBgYHtyLCBFQ0hPPSBGQUxTRX0NCmdncGxvdChkYXRhPW9yZGVycyxhZXMoeD1vcmRlcl9kb3cpKStnZW9tX2hpc3RvZ3JhbShzdGF0PSJjb3VudCIsZmlsbD0icmVkIikgKyB4bGFiKCdEYXlzIG9mIFdlZWsnKSArIHlsYWIoJ05vIG9mIG9yZGVycycpICsgZ2d0aXRsZSgnTm8gb2Ygb3JkZXJzIHZzIERheXMgb2Ygd2VlaycpDQpnZ3Bsb3QoZGF0YT1vcmRlcnMsYWVzKHg9b3JkZXJfaG91cl9vZl9kYXkpKStnZW9tX2hpc3RvZ3JhbShzdGF0PSJjb3VudCIsZmlsbD0icmVkIikgKyB4bGFiKCdIb3VyIG9mIERheScpICsgeWxhYignTm8gb2Ygb3JkZXJzJykgKyBnZ3RpdGxlKCdObyBvZiBvcmRlcnMgdnMgSG91ciBvZiBEYXknKQ0KZ2dwbG90KGRhdGE9b3JkZXJzLGFlcyh4PWRheXNfc2luY2VfcHJpb3Jfb3JkZXIpKStnZW9tX2hpc3RvZ3JhbShzdGF0PSJjb3VudCIsZmlsbD0icmVkIikgKyB4bGFiKCdEYXlzIHNpbmNlIHByZXZpb3VzIG9yZGVyJykgKyB5bGFiKCdObyBvZiBvcmRlcnMnKSArIGdndGl0bGUoJ05vIG9mIG9yZGVycyB2cyBEYXlzIHNpbmNlIHByZXZpb3VzIG9yZGVyJykNCmBgYA0KQmFzZWQgb24gdGhlIHBsb3RzLCB0aGVzZSBhcmUgdGhlIGZpbmRpbmdzOg0KMSkgVGhlcmUgYXJlIGhpZ2hlciBjb3VudCBvZiBvcmRlcnMgZm9yIERPVyAoZGF5IG9mIHdlZWspIDAgYW5kIDENCjIpIEhpZ2hlciBjb3VudCBvZiBvcmRlcnMgZHVyaW5nIHRoZSBkYXkgYmV0d2VlbiAwODAwIHRvIDE4MDANCjMpIENvdW50IG9mIG9yZGVycyBpbmNyZWFzZXMgdG8gbWF4aW11bSBmb3Igbm8gb2YgZGF5cyBwcmlvciBhdCBkYXkgNy4gVGhlbiBkZWNyZWFzZXMgd2l0aCBsZXNzIG9idmlvdXMgc2Vhc29uYWwgcGVha3Mgb2YgMTQgYW5kIDIxIGRheXMuIE1heGltdW0gbm8gb2YgZGF5cyBwcmlvciBpcyBjYXBwZWQgYXQgMzAgZGF5cywgdGhlcmVmb3JlIHRoZSBwZWFrIGF0IDMwIGRheXMgaXMgaXJyZWxldmFudC4NCg0KIyMgRGlzdHJpYnV0aW9uIG9mIHRvdGFsIG9yZGVycyBwZXIgY3VzdG9tZXINCmBgYHtyLCBFQ0hPPUZBTFNFfQ0Kb3JkZXJzICU+JSBzZWxlY3QodXNlcl9pZCkgJT4lIGdyb3VwX2J5KHVzZXJfaWQpICU+JSBjb3VudCh1c2VyX2lkKSAlPiUgZ2dwbG90KGFlcyh4PW4pKStnZW9tX2hpc3RvZ3JhbShzdGF0PSJjb3VudCIsIGZpbGw9InJlZCIpICsgeGxhYignVG90YWwgb3JkZXJzIHBlciBjdXN0b21lcicpICsgeWxhYignTm8gb2YgY3VzdG9tZXJzJykgKyBnZ3RpdGxlKCdObyBvZiBjdXN0b21lcnMgdnMgVG90YWwgbm8gb2Ygb3JkZXJzIHBlciBjdXN0b21lcicpDQpgYGANClRoZSByYW5nZSBmb3IgbnVtYmVyIG9mIG9yZGVycyBwZXIgY3VzdG9tZXIgaXMgYmV0d2VlbiA0IGFuZCAxMDAsIHdpdGggNCBvcmRlcnMgYmVpbmcgdGhlIG1vc3QgY29tbW9uLg0KDQojIyBEaXN0aWJ1dGlvbiBvZiBubyBvZiBpdGVtcyBwZXIgb3JkZXINCmBgYHtyLCBFQ0hPPUZBTFNFfQ0Kb3JkZXJfcHJvZHVjdHNfcHJpb3IgJT4lIHNlbGVjdChvcmRlcl9pZCkgJT4lIGdyb3VwX2J5KG9yZGVyX2lkKSAlPiUgY291bnQob3JkZXJfaWQpICU+JSBnZ3Bsb3QoYWVzKHg9bikpK2dlb21faGlzdG9ncmFtKHN0YXQ9ImNvdW50IiwgZmlsbD0icmVkIikgKyB4bGltKDAsIDYwKSArIHhsYWIoJ051bWJlciBvZiBwcm9kdWN0cyBvcmRlcmVkJykgKyB5bGFiKCdObyBvZiBvcmRlcnMnKSArIGdndGl0bGUoJ05vIG9mIG9yZGVycyB2cyBObyBvZiBwcm9kdWN0cyBvcmRlcmVkIChwcmlvciBkYXRhc2V0cycpDQpgYGANCk1heGltdW0gbm8gb2YgaXRlbXMgYm91Z2h0IHBlciBvcmRlciBpcyBiZXR3ZWVuIDUgYW5kIDYgaXRlbXMuDQo=