One of the most challenging aspects of Data Science, I have found in the last two terms in both Data Visualization and Statistical Learning, is preparing the data for usage in the packages available in R. For this reason, in order to step beyond what we did in class, I wanted to apply some of the Statistical Learning techniques we learned to real-life datasets. So a good place to start is Kaggle. I found the Instacart Challenge to be suitable because of both the real-life application and because it explores one of the topics covered in class, association rules and market basket analysis.
From the Kaggle competition site:
“Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.
Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.
In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. "
Because this is a large dataset, it was also an opportunity to explore some of the intersting data-munging libraries such as dplyr.
Start by loading the required libraries and datasets:
# Exploratory Data Analysis
str(orders, max.level=1)
## Classes 'data.table' and 'data.frame': 1048575 obs. of 7 variables:
## $ order_id : int 2539329 2398795 473747 2254736 431534 3367565 550135 3108588 2295261 2550362 ...
## $ user_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ eval_set : chr "prior" "prior" "prior" "prior" ...
## $ order_number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ order_dow : int 2 3 3 4 4 2 1 1 1 4 ...
## $ order_hour_of_day : int 8 7 12 7 15 7 9 14 16 8 ...
## $ days_since_prior_order: int NA 15 21 29 28 19 20 14 0 30 ...
## - attr(*, ".internal.selfref")=<externalptr>
kable(head(orders,12))
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
2539329 | 1 | prior | 1 | 2 | 8 | NA |
2398795 | 1 | prior | 2 | 3 | 7 | 15 |
473747 | 1 | prior | 3 | 3 | 12 | 21 |
2254736 | 1 | prior | 4 | 4 | 7 | 29 |
431534 | 1 | prior | 5 | 4 | 15 | 28 |
3367565 | 1 | prior | 6 | 2 | 7 | 19 |
550135 | 1 | prior | 7 | 1 | 9 | 20 |
3108588 | 1 | prior | 8 | 1 | 14 | 14 |
2295261 | 1 | prior | 9 | 1 | 16 | 0 |
2550362 | 1 | prior | 10 | 4 | 8 | 30 |
1187899 | 1 | train | 11 | 4 | 8 | 14 |
2168274 | 2 | prior | 1 | 2 | 11 | NA |
glimpse(orders)
## Observations: 1,048,575
## Variables: 7
## $ order_id <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set <chr> "prior", "prior", "prior", "prior", "pr...
## $ order_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <int> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
Throughout our report, we use the dplyr library to manipulate the large datasets and the piping “%>%” structure. We start by looking at daily and weekly shopping habits.
Hour of day:
orders %>%
ggplot(aes(x=order_hour_of_day)) +
geom_histogram(stat="count",fill="skyblue")
Day of Week:
orders %>%
ggplot(aes(x=order_dow)) +
geom_histogram(stat="count",fill="skyblue")
Although it’s not clear which days of the week are which, clearly there is a lot of shopping going on during work hours, which is a widely known fact.
People tend to place a new order mostly within 7 days of the last order, predominantly exactly 7 days after. The 30 day spike seems to be a long-tail effect.
orders %>%
ggplot(aes(x=days_since_prior_order)) +
geom_histogram(stat="count",fill="skyblue")
orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="skyblue", size=1)+geom_point(size=2, color="blue")
This seems to follow a Poisson Distribution. It would be worth investigating further whether it has been modeled previously.
order_products %>%
group_by(order_id) %>%
summarize(n_items = last(add_to_cart_order)) %>%
ggplot(aes(x=n_items))+
geom_histogram(stat="count",fill="skyblue") +
geom_rug()+
coord_cartesian(xlim=c(0,80))
These are the staples, the items with the greatest Support \(P(X)\). Not surprisingly, items like fruit, veggies and milk.
tmp <- order_products %>%
group_by(product_id) %>%
summarize(count = n()) %>%
top_n(12, wt = count) %>%
left_join(select(products,product_id,product_name),by="product_id") %>%
arrange(desc(count))
kable(tmp)
product_id | count | product_name |
---|---|---|
24852 | 14136 | Banana |
13176 | 11639 | Bag of Organic Bananas |
21137 | 8233 | Organic Strawberries |
21903 | 7443 | Organic Baby Spinach |
47626 | 6148 | Large Lemon |
47766 | 5606 | Organic Avocado |
47209 | 5489 | Organic Hass Avocado |
16797 | 4920 | Strawberries |
26209 | 4609 | Limes |
27966 | 4200 | Organic Raspberries |
39275 | 3744 | Organic Blueberries |
27845 | 3719 | Organic Whole Milk |
The unfortunate “curse of dimensionality”" certainly applies to retail. Here we see that over 40% of items are ordered just once!
tmp <- order_products %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(reordered = as.factor(reordered)) %>%
mutate(proportion = count/sum(count))
kable(tmp)
reordered | count | proportion |
---|---|---|
0 | 420967 | 0.4014658 |
1 | 627608 | 0.5985342 |
Although there is some overlap (notably bananas and milk), a lot of the high frequency re-orders are more niche products like special kinds of milk and bread.
tmp <-order_products %>%
group_by(product_id) %>%
summarize(proportion_reordered = mean(reordered), n=n()) %>%
filter(n>40) %>%
top_n(10,wt=proportion_reordered) %>%
arrange(desc(proportion_reordered)) %>%
left_join(products,by="product_id")
kable(tmp)
product_id | proportion_reordered | n | product_name | aisle_id | department_id |
---|---|---|---|---|---|
1729 | 0.9200000 | 75 | 2% Lactose Free Milk | 84 | 16 |
36717 | 0.9122807 | 57 | Double Fiber Bread | 112 | 3 |
20940 | 0.9052632 | 285 | Organic Low Fat Milk | 84 | 16 |
11885 | 0.9019608 | 51 | Sparkling Water, Bottles | 115 | 7 |
21038 | 0.8906250 | 64 | Organic Spelt Tortillas | 128 | 3 |
39180 | 0.8847185 | 373 | Organic Lowfat 1% Milk | 84 | 16 |
24852 | 0.8815082 | 14136 | Banana | 24 | 4 |
24024 | 0.8809524 | 336 | 1% Lowfat Milk | 84 | 16 |
1402 | 0.8793103 | 58 | Plain Soymilk Creamer | 91 | 16 |
29447 | 0.8775982 | 433 | Milk, Organic, Vitamin D | 84 | 16 |
order_products %>%
left_join(orders,by="order_id") %>%
group_by(days_since_prior_order) %>%
summarize(mean_reorder = mean(reordered)) %>%
ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
geom_bar(stat="identity",fill="skyblue")
As far as days since prior re-order, we can see there is a “memory effect”. That is, the closer to the first order, the greater that product gets recalled and re-ordered.
First we look at the number of SKUs offered per categories. The area of the rectangles are proportional to the number considered.
tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")
tmp2<-order_products %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(tmp, by = c("department_id", "aisle_id")) %>%
mutate(onesize = 1)
treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")
Likewise, we have a similar visualization for what’s sold.
treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")
Notice how produce and dairy are a small number of SKUs that yet comprise almost half of units sold. Talk about Pareto power distribution!
In the next section, we will look to apply the association rules, or market basket analysis (MBA) we learned in class to this dataset.
We read the dataset into the transaction type dataform required for ingestion into the arules algorithm.
tmp <-order_products %>%
group_by(product_id) %>%
left_join(products,by="product_id")
write.csv(tmp, file = "transactions.csv")
transactions<-read.transactions("transactions.csv", format = "single", sep = ",",cols = c(2,6))
summary(transactions)
## transactions as itemMatrix in sparse format with
## 99575 rows (elements/itemsets/transactions) and
## 36865 columns (items) and a density of 0.0002856507
##
## most frequent items:
## Banana Bag of Organic Bananas Organic Strawberries
## 14136 11639 8233
## Organic Baby Spinach Large Lemon (Other)
## 7443 6148 1000977
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 5212 5591 6114 6249 6791 6632 6451 6074 5519 4959 4614 4062 3686 3302 2902
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2672 2321 2027 1881 1584 1378 1274 1137 976 850 764 654 501 499 401
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 332 296 252 236 209 155 145 132 103 72 69 67 58 57 53
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 35 35 27 18 24 19 19 14 12 8 7 5 3 3 6
## 61 62 63 64 65 66 67 68 70 72 74 75 76 77 80
## 2 1 2 3 3 2 1 2 3 2 2 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 9.00 10.53 14.00 80.00
##
## includes extended item information - examples:
## labels
## 1 \\""Constant Comment\\"" Black Tea
## 2 \\""Constant Comment\\"" Decaffeinated Black Tea Blend
## 3 \\""Darn Good\\"" Chili Mix
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 100000
## 3 1000008
inspect(transactions[1:3])
## items transactionID
## [1] {Bag of Organic Bananas,
## Bulgarian Yogurt,
## Cucumber Kirby,
## Lightly Smoked Sardines in Olive Oil,
## Organic 4% Milk Fat Whole Milk Cottage Cheese,
## Organic Celery Hearts,
## Organic Hass Avocado,
## Organic Whole String Cheese} 1
## [2] {Corn Tortillas,
## Extra Virgin Olive Oil,
## Gala Apples,
## Garnet Sweet Potato (Yam),
## Ground Cumin,
## I Heart Baby Kale,
## No Salt Added Black Beans,
## Organic Baby Carrots,
## Organic Baby Spinach,
## Organic Yellow Onion,
## Original Hummus,
## Snack Sticks Chicken & Rice Recipe Dog Treats,
## Total 2% All Natural Plain Greek Yogurt,
## Unscented Long Lasting Stick Deodorant,
## Wheat Sandwich Thins} 100000
## [3] {Daily Moisture Shampoo,
## DeTox Caffeine Free Organic Herbal Tea Bags,
## Ensure Plus Milk Chocolate Nutrition Shake,
## G Series Perform Frost Glacier Cherry Sports Drink,
## Original No Pulp 100% Florida Orange Juice,
## Triple Chocolate Ripple,
## ZzzQuil Liquid Warming Berry Flavor Sleep-Aid} 1000008
And next we apply the arules apriori algorithm and inspect 10 of the 685 rules generated. We consider rules for items with minimum support of .001, generating a combo with confidence \(P(Y|X)\) of at least .25. The output of the algorithm are the association rules together with the confidence and lift \(confidence(X \rightarrow Y)/support(Y)\).
groceryrules <- apriori(transactions, parameter = list(support =
0.001, confidence = 0.25))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 99
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[36865 item(s), 99575 transaction(s)] done [0.22s].
## sorting and recoding items ... [1877 item(s)] done [0.02s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.08s].
## writing ... [685 rule(s)] done [0.01s].
## creating S4 object ... done [0.04s].
# Association rules
summary(groceryrules)
## set of 685 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 207 463 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 3.00 2.72 3.00 4.00
##
## summary of quality measures:
## support confidence lift
## Min. :0.001004 Min. :0.2500 Min. : 1.763
## 1st Qu.:0.001165 1st Qu.:0.2718 1st Qu.: 2.415
## Median :0.001386 Median :0.3012 Median : 3.369
## Mean :0.001950 Mean :0.3221 Mean : 5.789
## 3rd Qu.:0.001808 3rd Qu.:0.3540 3rd Qu.: 4.592
## Max. :0.023279 Max. :0.6250 Max. :86.850
##
## mining info:
## data ntransactions support confidence
## transactions 99575 0.001 0.25
inspect(groceryrules[1:10])
## lhs rhs support confidence lift
## [1] {Zero Calorie Cola} => {Soda} 0.001024354 0.3984375 35.141199
## [2] {Organic Fuji Apples} => {Bag of Organic Bananas} 0.001154908 0.3432836 2.936890
## [3] {Packaged Grape Tomatoes} => {Hass Avocados} 0.001154908 0.2584270 14.840176
## [4] {Packaged Grape Tomatoes} => {Strawberries} 0.001134823 0.2539326 5.139296
## [5] {Baby Cucumbers} => {Bag of Organic Bananas} 0.001205122 0.2515723 2.152274
## [6] {Lemon Sparkling Water} => {Grapefruit Sparkling Water} 0.001104695 0.3133903 65.284185
## [7] {Nonfat Icelandic Style Strawberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001134823 0.4109091 76.622234
## [8] {Sweet Potato Yam} => {Banana} 0.001174994 0.3861386 2.719988
## [9] {Organic Honey Sweet Whole Wheat Bread} => {Bag of Organic Bananas} 0.001064524 0.3642612 3.116359
## [10] {Organic Whole Grassmilk Milk} => {Bag of Organic Bananas} 0.001205122 0.3529412 3.019514
These rules consider only the collective and cumulative transactions. Our submission to Kaggle, which we are still working on, involves two additional association dimensions. We would have to look at individual customers over time to discern patterns and make inferences about their next purchases.