Instacart is an internet - based grocery delivery service with a slogan of Groceries Delivered in an Hour. The purpose of this exercise is to analyze the trend in customer buying pattern on Instacart, suggest combination of products which can be sold together under various offers.
The data set is a relational set of files describing customers’ orders over time. The data set is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, between 4 and 100 of their order details is provided with the sequence of products purchased in each order, the week and hour of day the order was placed and a relative measure of time between orders.
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)
library(magrittr)
library(grid)
library(gridExtra)
library(ggthemes)
library(ggrepel)
library(tcltk)
library(gsubfn)
library(proto)
library(RSQLite)
library(sqldf)
library(Matrix)
library(arules)
library(tidyr)
library(arulesViz)
library(methods)
products<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/products.csv")
orders<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/orders.csv")
prior<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/order_products__prior.csv")
aisles<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/aisles.csv")
departments<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/departments.csv")
The data set contains list of unique order_id for corresponding orders made by users. Order_number gives the number of the order. Eval_set denotes if the order is a prior order, train, or test. All but the last order of every user is classified as prior. Last order of every user is either classified as train or test. The ones classified as test are the order_id for which we predict which products will be included in the next order. Order_dow gives the day of the week and order_hour_of_day denotes hour of the day. Days_since_prior_order gives the time difference between two orders and contains NULL value for the first order of every user. There are 3 million plus order_id for 200,000 plus different users.
kable(head(orders,10))
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
|---|---|---|---|---|---|---|
| 2539329 | 1 | prior | 1 | 2 | 8 | NA |
| 2398795 | 1 | prior | 2 | 3 | 7 | 15 |
| 473747 | 1 | prior | 3 | 3 | 12 | 21 |
| 2254736 | 1 | prior | 4 | 4 | 7 | 29 |
| 431534 | 1 | prior | 5 | 4 | 15 | 28 |
| 3367565 | 1 | prior | 6 | 2 | 7 | 19 |
| 550135 | 1 | prior | 7 | 1 | 9 | 20 |
| 3108588 | 1 | prior | 8 | 1 | 14 | 14 |
| 2295261 | 1 | prior | 9 | 1 | 16 | 0 |
| 2550362 | 1 | prior | 10 | 4 | 8 | 30 |
glimpse(orders)
## Observations: 3,421,083
## Variables: 7
## $ order_id <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set <fctr> prior, prior, prior, prior, prior, pri...
## $ order_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...
dim(orders)
## [1] 3421083 7
kable(head(products,10))
| product_id | product_name | aisle_id | department_id |
|---|---|---|---|
| 1 | Abc Chocolate Sandwich Cookies | 61 | 19 |
| 2 | Abc All-Seasons Salt | 104 | 13 |
| 3 | Abc Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 4 | Abc Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
| 5 | Abc Green Chile Anytime Sauce | 5 | 13 |
| 6 | Abc Dry Nose Oil | 11 | 11 |
| 7 | Abc Pure Coconut Water With Orange | 98 | 7 |
| 8 | Abc Cut Russet Potatoes Steam N’ Mash | 116 | 1 |
| 9 | Abc Light Strawberry Blueberry Yogurt | 120 | 16 |
| 10 | Abc Sparkling Orange Juice & Prickly Pear Beverage | 115 | 7 |
glimpse(products)
## Observations: 49,688
## Variables: 4
## $ product_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ product_name <fctr> Abc Chocolate Sandwich Cookies, Abc All-Seasons...
## $ aisle_id <int> 61, 104, 94, 38, 5, 11, 98, 116, 120, 115, 31, 1...
## $ department_id <int> 19, 13, 7, 1, 13, 11, 7, 1, 16, 7, 7, 1, 11, 17,...
dim(products)
## [1] 49688 4
kable(head(aisles,10))
| aisle_id | aisle |
|---|---|
| 1 | prepared soups salads |
| 2 | specialty cheeses |
| 3 | energy granola bars |
| 4 | instant foods |
| 5 | marinades meat preparation |
| 6 | other |
| 7 | packaged meat |
| 8 | bakery desserts |
| 9 | pasta sauce |
| 10 | kitchen supplies |
glimpse(aisles)
## Observations: 134
## Variables: 2
## $ aisle_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ aisle <fctr> prepared soups salads, specialty cheeses, energy gra...
dim(aisles)
## [1] 134 2
kable(head(departments,10))
| department_id | department |
|---|---|
| 1 | frozen |
| 2 | other |
| 3 | bakery |
| 4 | produce |
| 5 | alcohol |
| 6 | international |
| 7 | beverages |
| 8 | pets |
| 9 | dry goods pasta |
| 10 | bulk |
glimpse(departments)
## Observations: 21
## Variables: 2
## $ department_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ department <fctr> frozen, other, bakery, produce, alcohol, intern...
dim(departments)
## [1] 21 2
Prior table contains product_id for every order_id. It thereby gives information about products included in every order. Add_to_cart_order gives the order for product_id by which it was added by customer to their shopping cart. Every product_id is classified and coded as 1 under reordered column if it was previously ordered by customer and 0 otherwise. It is the largest table with over 32 million rows of data.
kable(head(prior,10))
| order_id | product_id | add_to_cart_order | reordered |
|---|---|---|---|
| 2 | 33120 | 1 | 1 |
| 2 | 28985 | 2 | 1 |
| 2 | 9327 | 3 | 0 |
| 2 | 45918 | 4 | 1 |
| 2 | 30035 | 5 | 0 |
| 2 | 17794 | 6 | 1 |
| 2 | 40141 | 7 | 1 |
| 2 | 1819 | 8 | 1 |
| 2 | 43668 | 9 | 0 |
| 3 | 33754 | 1 | 1 |
glimpse(prior)
## Observations: 32,434,489
## Variables: 4
## $ order_id <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,...
## $ product_id <int> 33120, 28985, 9327, 45918, 30035, 17794, 401...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6,...
## $ reordered <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,...
dim(prior)
## [1] 32434489 4
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
Products_Aisles<-merge(products,aisles,by="aisle_id")
Products_Aisles_Departments<-merge(Products_Aisles,departments,"department_id")
kable(head(Products_Aisles_Departments,6))
| department_id | aisle_id | product_id | product_name | aisle | department |
|---|---|---|---|---|---|
| 1 | 37 | 32475 | Abc Meyer Lemon Sorbet | ice cream ice | frozen |
| 1 | 37 | 18020 | Abc Black Raspberry Chocolate Chip Ice Cream | ice cream ice | frozen |
| 1 | 37 | 20175 | Abc The Original Vanilla Ice Cream Sandwich | ice cream ice | frozen |
| 1 | 37 | 49459 | Abc Dark Chocolate Non Dairy Frozen Dessert Bar | ice cream ice | frozen |
| 1 | 37 | 8507 | Abc Fun Flavors Spumoni Ice Cream | ice cream ice | frozen |
| 1 | 37 | 30176 | Abc Fruit Bars Straw Tangerine | ice cream ice | frozen |
glimpse(Products_Aisles_Departments)
## Observations: 49,688
## Variables: 6
## $ department_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ aisle_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, ...
## $ product_id <int> 32475, 18020, 20175, 49459, 8507, 30176, 36242, ...
## $ product_name <fctr> Abc Meyer Lemon Sorbet, Abc Black Raspberry Cho...
## $ aisle <fctr> ice cream ice, ice cream ice, ice cream ice, ic...
## $ department <fctr> frozen, frozen, frozen, frozen, frozen, frozen,...
dim(Products_Aisles_Departments)
## [1] 49688 6
Missing, candy chocoloate and Ice cream ice aisle are the aisles with maximum variety of products while few variety is avilable relatively in trash bag liners, frozen dessert and Indian foods aisles
Number_of_Product_each_Aisle<-Products_Aisles_Departments%>%group_by(aisle)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))
#Top 20 Aisle by number of product offerings
Top_20<-head(Number_of_Product_each_Aisle,n=20)
#Plotting Number of Products in each aisle in decreasing order(Top 20)
ggplot(Top_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
geom_bar(stat = "identity")+coord_flip()+
labs(title="Top 20 Aisle by Variety of Product Offering",y="Number of Products",x="Aisle")+
geom_text(nudge_y = 35)
#Bottom 20 Aisle by number of product offerings
Bottom_20<-tail(Number_of_Product_each_Aisle,n=20)
#Plotting Number of Products in each aisle in decreasing order(Bottom 20)
ggplot(Bottom_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
geom_bar(stat = "identity")+coord_flip()+labs(title="Bottom 20 Aisle by Variety of Product offering",y="Number of Products",x="Aisle")+
geom_text(nudge_y = 3.5)
It can be inferred from the below bar chart that Instacart has maximum number of product offerings across personal care and edible item departments
#Number of Products in each department
Number_of_Product_each_department<-Products_Aisles_Departments%>%group_by(department)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))
#Vis--Bar chart for number of products in each department
ggplot(Number_of_Product_each_department, aes(x = reorder(department,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
geom_bar(stat = "identity")+coord_flip()+labs(title="Department by Variety of Product offering",y="Number of Products",x="Department")+
geom_text(nudge_y = 250)
From the figure it is inferred that most people order between 9:00 AM to 6:00 PM in the evening. Instacart can accordingly plan to hire persons for delivery during days shifts. The visualization is plotted in R with colored portion representing the relative percentage of total orders across the day with 10:00 AM being the busiest hour (100% in vis)
Orders_everyhour<-orders%>%group_by(order_hour_of_day)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))
#Visualization for number of orders at every hour of the day and every day of the week
# referenced from: http://zoonek2.free.fr/UNIX/48_R/03.html
x<-Orders_everyhour$Percentage_of_orders
clock.plot <- function (x, col = rainbow(n,s=1,v=1,start=0,end=max(1,n-1)/n,alpha=0.5), ...) {
if( min(x)<0 ) x <- x - min(x)
if( max(x)>1 ) x <- x/max(x)
n <- length(x)
if(is.null(names(x))) names(x) <- 0:(n-1)
m <- 1.05
plot(0,
type = 'n', # do not plot anything
xlim = c(-m,m), ylim = c(-m,m),
axes = F, xlab = '', ylab = '', ...)
a <- pi/2 - 2*pi/200*0:200
polygon( cos(a), sin(a) )
v <- .02
a <- pi/2 - 2*pi/n*0:n
segments( (1+v)*cos(a), (1+v)*sin(a),
(1-v)*cos(a), (1-v)*sin(a) )
segments( cos(a), sin(a),
0, 0,
col = 'light grey', lty = 3)
ca <- -2*pi/n*(0:50)/50
for (i in 1:n) {
a <- pi/2 - 2*pi/n*(i-1)
b <- pi/2 - 2*pi/n*i
polygon( c(0, x[i]*cos(a+ca), 0),
c(0, x[i]*sin(a+ca), 0),
col=col[i] )
v <- .1
text((1+v)*cos(a), (1+v)*sin(a), names(x)[i])
}
}
clock.plot(x,
main = "Peak Ordering Hours")
From the figure below it is deduced that Sunday and Monday are the days when people order most on Instacart
#Number of Orders every day of the week
Orders_everyday<-orders%>%group_by(order_dow)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))
#Visualizing Number of Orders by day of the week
ggplot(Orders_everyday,aes(x=order_dow,y=Percentage_of_orders,label=paste0(round(Percentage_of_orders,1))))+
geom_bar(stat = "identity")+labs(title="% of Orders by day of the Week",y="Percentage of Total Orders",x="Day of the Week : 0 denotes Sunday ")+
geom_text(nudge_y = .5)
It is found that 10:00 AM on Monday is the time of the day when most orders are placed. This can be thought of when people go to work they refill their groceries for the rest of the week. To visualize this percentage of orders made every hour for every day is plotted and compared.
Dow_hod_orders<-orders%>%group_by(order_dow,order_hour_of_day)%>%
summarise(Number_of_Orders=n())
Dow_hod_orders_combined<-merge(Dow_hod_orders,Orders_everyday,by="order_dow",all.x = TRUE)%>%
mutate(Percentage_by_doy=Number_of_Orders.x*100/Number_of_Orders.y)
#Visualizing orders by dow-->hod
ggplot(Dow_hod_orders_combined, aes(x = Dow_hod_orders_combined$order_hour_of_day, y = Dow_hod_orders_combined$Percentage_by_doy)) +
geom_bar(stat="identity") +
labs(title="Visualizing orders by hour of day for each day of week with 0 representing Sunday",x="0-24 represents hours of the day",y="Percentage of orders for the day")+
facet_grid(~ Dow_hod_orders_combined$order_dow)
We are given the gap between two orders for every user. When we plot it we find two categories of people! One that reorders monthly other who does weekly. This is based on the peaks formed at 30th day and 7th day.
library(plyr)
library(dplyr)
Reordering_Gap<-count(orders,'days_since_prior_order')%>%arrange(desc(freq))%>%mutate(Percent_orders=round(freq*100/nrow(orders)),2)
#Inference: 11 % of the time people reorder monthly(after 30 days), and 9 % of the time weekly. This shows there is a section of people who refill their groceries every month and other who refills every week. Frequency of NA represents total number of unique users and its their first order.
#Visualizing reordering Gap
Reordering_Gap_plot<-ggplot(orders,aes(x=days_since_prior_order))+
geom_histogram(aes(fill=..count..),binwidth=1)+
scale_x_continuous(name = "Days Since Prior Order",breaks = seq(0, 30, 1))+
scale_y_continuous(name = "Frequency of Orders",breaks=seq(0,1000000,100000))+
ggtitle("Gap between two orders?")+
labs(x="Days Since Prior Order")+
theme_update()
Reordering_Gap_plot
top25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%head(25)
colnames(top25_products)[1]<-'product_id'
Top25Products<-merge(top25_products,Products_Aisles_Departments,by='product_id')%>%arrange(desc(freq))
kable(head(Top25Products,25))
| product_id | freq | department_id | aisle_id | product_name | aisle | department |
|---|---|---|---|---|---|---|
| 24852 | 472565 | 4 | 24 | Abc Banana | fresh fruits | produce |
| 13176 | 379450 | 4 | 24 | Abc Bag of Organic Bananas | fresh fruits | produce |
| 21137 | 264683 | 4 | 24 | Abc Organic Strawberries | fresh fruits | produce |
| 21903 | 241921 | 4 | 123 | Abc Organic Baby Spinach | packaged vegetables fruits | produce |
| 47209 | 213584 | 4 | 24 | Abc Organic Hass Avocado | fresh fruits | produce |
| 47766 | 176815 | 4 | 24 | Abc Organic Avocado | fresh fruits | produce |
| 47626 | 152657 | 4 | 24 | Abc Large Lemon | fresh fruits | produce |
| 16797 | 142951 | 4 | 24 | Abc Strawberries | fresh fruits | produce |
| 26209 | 140627 | 4 | 24 | Abc Limes | fresh fruits | produce |
| 27845 | 137905 | 16 | 84 | Abc Organic Whole Milk | milk | dairy eggs |
| 27966 | 137057 | 4 | 123 | Abc Organic Raspberries | packaged vegetables fruits | produce |
| 22935 | 113426 | 4 | 83 | Abc Organic Yellow Onion | fresh vegetables | produce |
| 24964 | 109778 | 4 | 83 | Abc Organic Garlic | fresh vegetables | produce |
| 45007 | 104823 | 4 | 83 | Abc Organic Zucchini | fresh vegetables | produce |
| 39275 | 100060 | 4 | 123 | Abc Organic Blueberries | packaged vegetables fruits | produce |
| 49683 | 97315 | 4 | 83 | Abc Cucumber Kirby | fresh vegetables | produce |
| 28204 | 89632 | 4 | 24 | Abc Organic Fuji Apple | fresh fruits | produce |
| 5876 | 87746 | 4 | 24 | Abc Organic Lemon | fresh fruits | produce |
| 8277 | 85020 | 4 | 24 | Abc Apple Honeycrisp Organic | fresh fruits | produce |
| 40706 | 84255 | 4 | 123 | Abc Organic Grape Tomatoes | packaged vegetables fruits | produce |
| 4920 | 82689 | 4 | 123 | Abc Seedless Red Grapes | packaged vegetables fruits | produce |
| 30391 | 80392 | 4 | 83 | Abc Organic Cucumber | fresh vegetables | produce |
| 45066 | 79769 | 4 | 24 | Abc Honeycrisp Apple | fresh fruits | produce |
| 42265 | 76896 | 4 | 123 | Abc Organic Baby Carrots | packaged vegetables fruits | produce |
| 49235 | 76360 | 16 | 53 | Abc Organic Half & Half | cream | dairy eggs |
#Visualization of top 50 products
ggplot(Top25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
geom_bar(stat = "identity")+coord_flip()+labs(title="Most ordered Products: Top 25 ",y="Number of orders",x="product_name")+
geom_text(nudge_y = 20000)
#Bottom 25 (least ordered products)
bottom25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%tail(25)
colnames(bottom25_products)[1]<-'product_id'
bottom25Products<-merge(bottom25_products,Products_Aisles_Departments,by='product_id')%>%arrange(freq)
kable(head(bottom25Products,25))
| product_id | freq | department_id | aisle_id | product_name | aisle | department |
|---|---|---|---|---|---|---|
| 42206 | 1 | 11 | 80 | Abc Anarchy For Her Daily Fragrance | deodorants | personal care |
| 42235 | 1 | 13 | 5 | Abc Wicked Good No-Fry Wing Marinade | marinades meat preparation | pantry |
| 42464 | 1 | 13 | 104 | Abc Organic Aleppo Pepper | spices seasonings | pantry |
| 43098 | 1 | 11 | 70 | Abc Buddy Bear Probiotic Orange Flavor Chewable | digestion | personal care |
| 43144 | 1 | 13 | 19 | Abc Flax Oil Organic Omega-3 Original Formula | oils vinegars | pantry |
| 43778 | 1 | 19 | 3 | Abc Strawberry Energy Gel | energy granola bars | snacks |
| 44495 | 1 | 16 | 120 | Abc Lowfat Cherry Lime Supernova Kefir Cultured Milk Smoothie | yogurt | dairy eggs |
| 44748 | 1 | 13 | 72 | Abc Thick ’n Chunky Salsa Mild | condiments | pantry |
| 44982 | 1 | 1 | 37 | Abc Coconut Bliss Pineapple Coconut | ice cream ice | frozen |
| 44986 | 1 | 19 | 107 | Abc Mustard & Onion | chips pretzels | snacks |
| 45273 | 1 | 20 | 67 | Abc Flame Roasted Red Peppers Spreadable Cheese | fresh dips tapenades | deli |
| 45378 | 1 | 11 | 22 | Abc Miss Treated Conditioner | hair care | personal care |
| 45380 | 1 | 19 | 3 | Abc Original Salted Caramel Protein Energy Bar | energy granola bars | snacks |
| 45400 | 1 | 17 | 114 | Abc Multi-Surface Sunflower Scent Everyday Cleaner | cleaning products | household |
| 45893 | 1 | 9 | 9 | Abc Florentine Spinach & Cheese Pasta Sauce | pasta sauce | dry goods pasta |
| 46278 | 1 | 14 | 121 | Abc Berry Sprouted Blend Cereal | cereal | breakfast |
| 46379 | 1 | 11 | 73 | Abc Skin Firming Cellulite Gel Cream | facial care | personal care |
| 46741 | 1 | 15 | 59 | Abc Seasoned Southern Style Red Beans And Rice | canned meals beans | canned goods |
| 47408 | 1 | 21 | 100 | Abc Control GX Gray Reducing Shampoo | missing | missing |
| 47680 | 1 | 5 | 28 | Abc Indian Wells Merlot | red wines | alcohol |
| 48151 | 1 | 21 | 100 | Abc Grill Mates Brown Sugar Bourbon Seasoning | missing | missing |
| 48211 | 1 | 9 | 4 | Abc Cajun Sides Dirty Rice | instant foods | dry goods pasta |
| 48343 | 1 | 5 | 27 | Abc Hennepin Farmhouse Ale | beers coolers | alcohol |
| 48461 | 1 | 19 | 103 | Abc The Ultimate Caramel Suace | ice cream toppings | snacks |
| 49487 | 1 | 8 | 40 | Abc Vitality Chicken Breasts with Flaxseed & Vitamins Dog Treats | dog food care | pets |
#Visualization of bottom 25 products
ggplot(bottom25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
geom_bar(stat = "identity")+coord_flip()+labs(title="least ordered Products: Bottom 25 ",y="Number of orders",x="product_name")+
geom_text(nudge_y = 1)
Market Basket Analysis has wide applications including but not limited to cross selling, product placement, affinity promotion, fraud detection and customer behavior
Apriori Algorithm and Association Rules
Frequent Itemset Property: Any subset of a frequent itemset is frequent.
Contrapositive: If an itemset is not frequent, none of its supersets are frequent.
. A set of items is referred as an itemset. A itemset that contains k items is a k-itemset . In theory, we can consider all rules -exponentially many but it is not a practical solution. Hence, we consider only combinations that occur with high frequency and call such sets as frequent item sets . The idea of frequent item sets is used for computational efficiency. If the set {item A, Item B} is not frequent, then no set containing item A and item B are frequent, and therefore do not need to be considered
Support
. The support s of an itemset A is the percentage of transactions in the transaction database D that contains A . The support of the rule A???B in the transaction database D is the support of the items set (A and B) in D . Rules with low support may have happened by chance. Low support rules may be uninteresting from the business side with an exception if the consequent is very valuable and /or the confidence is very high
Confidence
. The confidence of the rule A???B in the transaction database D is the ratio of the number of transactions in D that contain (A and B) to the number of transactions that contain A in D . The confidence of A???B is a measure of the reliability of the rule . It is an estimate of P(B|A). That is, it tells us the conditional probability that the items in the consequent set are contained in a randomly selected transaction that includes the antecedent set
Lift Ratio
. The lift ratio allows us to judge the strength of an association rule compared to a benchmark value . The benchmark: If the antecedent set and consequent sets are independent we can write the confidence as: P (consequent | antecedent) = P (antecedent AND consequent) / P (antecedent) =P (antecedent) x P (consequent) / P (antecedent) =p (consequent) . P(consequent) is called benchmark confidence . Lift ratio is defined as Confidence / benchmark confidence . Lift Ratio = [{support (A and B)} / {Support (A) * Support (B)}] . The lift ratio can take value between 0 and infinity
Interpreting the Results
There is no rule of thumb for what is a “good rule”. We can consider following points while analyzing our dataset. . How impactful a rule is: This can be measured from the size of support . Efficiency of Rule: The lift tells us how efficient the rule is at finding the consequent set compared to a random selection . Operational usefulness: The confidence tells is how efficient the rule will be in practice
prior4mba<-split(prior$product_id,prior$order_id)
transaction_prior<-as(prior4mba,"transactions")
dim(transaction_prior)
## [1] 3214874 49677
#frequent product ids in the transactions
itemFrequencyPlot(transaction_prior,support=0.05,cex.names=0.8)
#Apriori algorithm
basket_rules<-apriori(transaction_prior,parameter = list(sup=0.00001,conf=0.6,maxlen=3,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 1e-05 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 32
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.85s].
## sorting and recoding items ... [30863 item(s)] done [1.10s].
## creating transaction tree ... done [8.48s].
## checking subsets of size 1 2 done [18.65s].
## writing ... [80 rule(s)] done [3.83s].
## creating S4 object ... done [1.89s].
#Visualizing rules
#Number of Products per basket
hist(size(transaction_prior), breaks = 0:150, xaxt="n", ylim=c(0,250000), col = "grey",
main = "Number of Products per Order", xlab = "Order Size:Number of Products")
axis(1, at=seq(0,160,by=10), cex.axis=0.8)
mtext(paste("Total:", length(transaction_prior), "Orders,", sum(size(transaction_prior)), "Products"))
#Frequently ordered products
#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000
item_frequencies <- itemFrequency(transaction_prior, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transaction_prior)]
par(mar=c(2,10,2,2)); options(scipen=5)
barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transaction_prior), col="red")
#Frequent items bought together
#We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold
basket_rules<-apriori(transaction_prior,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.0003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 964
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.87s].
## sorting and recoding items ... [5213 item(s)] done [1.34s].
## creating transaction tree ... done [9.64s].
## checking subsets of size 1 2 3 done [6.20s].
## writing ... [60 rule(s)] done [0.08s].
## creating S4 object ... done [1.50s].
plot(basket_rules)
plot(head(sort(basket_rules,by="lift"),10),method="graph")
plot(basket_rules,method="grouped")
#Above figure visualizes all the three parameters: support, confidence, and lift. Confidence level is set at 50%. We get a set of 60 rules. We sort them by the value of lift which gives the efficiency of the rule and thereby make our product combinations