Exploratory Data Analysis & Market Basket Analysis

Instacart is an internet - based grocery delivery service with a slogan of Groceries Delivered in an Hour. The purpose of this exercise is to analyze the trend in customer buying pattern on Instacart, suggest combination of products which can be sold together under various offers.

Loading dependencies and Data Files

The data set is a relational set of files describing customers’ orders over time. The data set is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, between 4 and 100 of their order details is provided with the sequence of products purchased in each order, the week and hour of day the order was placed and a relative measure of time between orders.

library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)
library(magrittr)
library(grid)
library(gridExtra)
library(ggthemes)
library(ggrepel)
library(tcltk)
library(gsubfn)
library(proto)
library(RSQLite)
library(sqldf)
library(Matrix)
library(arules)
library(tidyr)
library(arulesViz)
library(methods)

products<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/products.csv")
orders<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/orders.csv")
prior<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/order_products__prior.csv")
aisles<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/aisles.csv")
departments<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/departments.csv")

Viewing the Data Sets

Orders

The data set contains list of unique order_id for corresponding orders made by users. Order_number gives the number of the order. Eval_set denotes if the order is a prior order, train, or test. All but the last order of every user is classified as prior. Last order of every user is either classified as train or test. The ones classified as test are the order_id for which we predict which products will be included in the next order. Order_dow gives the day of the week and order_hour_of_day denotes hour of the day. Days_since_prior_order gives the time difference between two orders and contains NULL value for the first order of every user. There are 3 million plus order_id for 200,000 plus different users.

kable(head(orders,10))

order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
2539329	1	prior	1	2	8	NA
2398795	1	prior	2	3	7	15
473747	1	prior	3	3	12	21
2254736	1	prior	4	4	7	29
431534	1	prior	5	4	15	28
3367565	1	prior	6	2	7	19
550135	1	prior	7	1	9	20
3108588	1	prior	8	1	14	14
2295261	1	prior	9	1	16	0
2550362	1	prior	10	4	8	30

glimpse(orders)

## Observations: 3,421,083
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <fctr> prior, prior, prior, prior, prior, pri...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...

dim(orders)

## [1] 3421083       7

Products

kable(head(products,10))

product_id	product_name	aisle_id	department_id
1	Abc Chocolate Sandwich Cookies	61	19
2	Abc All-Seasons Salt	104	13
3	Abc Robust Golden Unsweetened Oolong Tea	94	7
4	Abc Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce	38	1
5	Abc Green Chile Anytime Sauce	5	13
6	Abc Dry Nose Oil	11	11
7	Abc Pure Coconut Water With Orange	98	7
8	Abc Cut Russet Potatoes Steam N’ Mash	116	1
9	Abc Light Strawberry Blueberry Yogurt	120	16
10	Abc Sparkling Orange Juice & Prickly Pear Beverage	115	7

glimpse(products)

## Observations: 49,688
## Variables: 4
## $ product_id    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ product_name  <fctr> Abc Chocolate Sandwich Cookies, Abc All-Seasons...
## $ aisle_id      <int> 61, 104, 94, 38, 5, 11, 98, 116, 120, 115, 31, 1...
## $ department_id <int> 19, 13, 7, 1, 13, 11, 7, 1, 16, 7, 7, 1, 11, 17,...

dim(products)

## [1] 49688     4

Aisles

kable(head(aisles,10))

aisle_id	aisle
1	prepared soups salads
2	specialty cheeses
3	energy granola bars
4	instant foods
5	marinades meat preparation
6	other
7	packaged meat
8	bakery desserts
9	pasta sauce
10	kitchen supplies

glimpse(aisles)

## Observations: 134
## Variables: 2
## $ aisle_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ aisle    <fctr> prepared soups salads, specialty cheeses, energy gra...

dim(aisles)

## [1] 134   2

Departments

kable(head(departments,10))

department_id	department
1	frozen
2	other
3	bakery
4	produce
5	alcohol
6	international
7	beverages
8	pets
9	dry goods pasta
10	bulk

glimpse(departments)

## Observations: 21
## Variables: 2
## $ department_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ department    <fctr> frozen, other, bakery, produce, alcohol, intern...

dim(departments)

## [1] 21  2

Prior

Prior table contains product_id for every order_id. It thereby gives information about products included in every order. Add_to_cart_order gives the order for product_id by which it was added by customer to their shopping cart. Every product_id is classified and coded as 1 under reordered column if it was previously ordered by customer and 0 otherwise. It is the largest table with over 32 million rows of data.

kable(head(prior,10))

order_id	product_id	add_to_cart_order	reordered
2	33120	1	1
2	28985	2	1
2	9327	3	0
2	45918	4	1
2	30035	5	0
2	17794	6	1
2	40141	7	1
2	1819	8	1
2	43668	9	0
3	33754	1	1

glimpse(prior)

## Observations: 32,434,489
## Variables: 4
## $ order_id          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,...
## $ product_id        <int> 33120, 28985, 9327, 45918, 30035, 17794, 401...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6,...
## $ reordered         <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,...

dim(prior)

## [1] 32434489        4

Recoding the Character Variables to Factor

orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

Merging The Products_Aisles_Departments Data sets

Products_Aisles<-merge(products,aisles,by="aisle_id")
Products_Aisles_Departments<-merge(Products_Aisles,departments,"department_id")
kable(head(Products_Aisles_Departments,6))

department_id	aisle_id	product_id	product_name	aisle	department
1	37	32475	Abc Meyer Lemon Sorbet	ice cream ice	frozen
1	37	18020	Abc Black Raspberry Chocolate Chip Ice Cream	ice cream ice	frozen
1	37	20175	Abc The Original Vanilla Ice Cream Sandwich	ice cream ice	frozen
1	37	49459	Abc Dark Chocolate Non Dairy Frozen Dessert Bar	ice cream ice	frozen
1	37	8507	Abc Fun Flavors Spumoni Ice Cream	ice cream ice	frozen
1	37	30176	Abc Fruit Bars Straw Tangerine	ice cream ice	frozen

glimpse(Products_Aisles_Departments)

## Observations: 49,688
## Variables: 6
## $ department_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ aisle_id      <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, ...
## $ product_id    <int> 32475, 18020, 20175, 49459, 8507, 30176, 36242, ...
## $ product_name  <fctr> Abc Meyer Lemon Sorbet, Abc Black Raspberry Cho...
## $ aisle         <fctr> ice cream ice, ice cream ice, ice cream ice, ic...
## $ department    <fctr> frozen, frozen, frozen, frozen, frozen, frozen,...

dim(Products_Aisles_Departments)

## [1] 49688     6

Instacart Product Offerings

Aisle

Missing, candy chocoloate and Ice cream ice aisle are the aisles with maximum variety of products while few variety is avilable relatively in trash bag liners, frozen dessert and Indian foods aisles

Number_of_Product_each_Aisle<-Products_Aisles_Departments%>%group_by(aisle)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))

#Top 20 Aisle by number of product offerings
Top_20<-head(Number_of_Product_each_Aisle,n=20)

#Plotting Number of Products in each aisle in decreasing order(Top 20)
ggplot(Top_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+
  labs(title="Top 20 Aisle by Variety of Product Offering",y="Number of Products",x="Aisle")+
  geom_text(nudge_y = 35)

#Bottom 20 Aisle by number of product offerings
Bottom_20<-tail(Number_of_Product_each_Aisle,n=20)

#Plotting Number of Products in each aisle in decreasing order(Bottom 20)
ggplot(Bottom_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Bottom 20 Aisle by Variety of Product offering",y="Number of Products",x="Aisle")+
  geom_text(nudge_y = 3.5)

Department

It can be inferred from the below bar chart that Instacart has maximum number of product offerings across personal care and edible item departments

#Number of Products in each department

Number_of_Product_each_department<-Products_Aisles_Departments%>%group_by(department)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))


#Vis--Bar chart for number of products in each department

ggplot(Number_of_Product_each_department, aes(x = reorder(department,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Department by Variety of Product offering",y="Number of Products",x="Department")+
  geom_text(nudge_y = 250)

Orders EDA

Hour_of_Day

From the figure it is inferred that most people order between 9:00 AM to 6:00 PM in the evening. Instacart can accordingly plan to hire persons for delivery during days shifts. The visualization is plotted in R with colored portion representing the relative percentage of total orders across the day with 10:00 AM being the busiest hour (100% in vis)

Orders_everyhour<-orders%>%group_by(order_hour_of_day)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))

#Visualization for number of orders at every hour of the day and every day of the week

# referenced from: http://zoonek2.free.fr/UNIX/48_R/03.html

x<-Orders_everyhour$Percentage_of_orders
clock.plot <- function (x, col = rainbow(n,s=1,v=1,start=0,end=max(1,n-1)/n,alpha=0.5), ...) {
  if( min(x)<0 ) x <- x - min(x)
  if( max(x)>1 ) x <- x/max(x)
  n <- length(x)
  if(is.null(names(x))) names(x) <- 0:(n-1)
  m <- 1.05
  plot(0, 
       type = 'n', # do not plot anything
       xlim = c(-m,m), ylim = c(-m,m), 
       axes = F, xlab = '', ylab = '', ...)
  a <- pi/2 - 2*pi/200*0:200
  polygon( cos(a), sin(a) )
  v <- .02
  a <- pi/2 - 2*pi/n*0:n
  segments( (1+v)*cos(a), (1+v)*sin(a), 
            (1-v)*cos(a), (1-v)*sin(a) )
  segments( cos(a), sin(a), 
            0, 0, 
            col = 'light grey', lty = 3) 
  ca <- -2*pi/n*(0:50)/50
  for (i in 1:n) {
    a <- pi/2 - 2*pi/n*(i-1)
    b <- pi/2 - 2*pi/n*i
    polygon( c(0, x[i]*cos(a+ca), 0),
             c(0, x[i]*sin(a+ca), 0),
             col=col[i] )
    v <- .1
    text((1+v)*cos(a), (1+v)*sin(a), names(x)[i])
  }
}
clock.plot(x, 
           main = "Peak Ordering Hours")

Day_of_Week

From the figure below it is deduced that Sunday and Monday are the days when people order most on Instacart

#Number of Orders every day of the week

Orders_everyday<-orders%>%group_by(order_dow)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))

#Visualizing Number of Orders by day of the week

ggplot(Orders_everyday,aes(x=order_dow,y=Percentage_of_orders,label=paste0(round(Percentage_of_orders,1))))+
  geom_bar(stat = "identity")+labs(title="% of Orders by day of the Week",y="Percentage of Total Orders",x="Day of the Week : 0 denotes Sunday ")+
  geom_text(nudge_y = .5)

Every Day_ Every Hour

It is found that 10:00 AM on Monday is the time of the day when most orders are placed. This can be thought of when people go to work they refill their groceries for the rest of the week. To visualize this percentage of orders made every hour for every day is plotted and compared.

Dow_hod_orders<-orders%>%group_by(order_dow,order_hour_of_day)%>%
  summarise(Number_of_Orders=n())

Dow_hod_orders_combined<-merge(Dow_hod_orders,Orders_everyday,by="order_dow",all.x = TRUE)%>%
  mutate(Percentage_by_doy=Number_of_Orders.x*100/Number_of_Orders.y)


#Visualizing orders by dow-->hod

ggplot(Dow_hod_orders_combined, aes(x = Dow_hod_orders_combined$order_hour_of_day, y = Dow_hod_orders_combined$Percentage_by_doy)) +
  geom_bar(stat="identity") +
  labs(title="Visualizing orders by hour of day for each day of week with 0 representing Sunday",x="0-24 represents hours of the day",y="Percentage of orders for the day")+
  facet_grid(~ Dow_hod_orders_combined$order_dow)

Days Since Prior Order Analysis

We are given the gap between two orders for every user. When we plot it we find two categories of people! One that reorders monthly other who does weekly. This is based on the peaks formed at 30th day and 7th day.

library(plyr)
library(dplyr)
Reordering_Gap<-count(orders,'days_since_prior_order')%>%arrange(desc(freq))%>%mutate(Percent_orders=round(freq*100/nrow(orders)),2)

#Inference: 11 % of the time people reorder monthly(after 30 days), and 9 % of the time weekly. This shows there is a section of people who refill their groceries every month and other who refills every week. Frequency of NA represents total number of unique users and its their first order.

#Visualizing reordering Gap

Reordering_Gap_plot<-ggplot(orders,aes(x=days_since_prior_order))+
  geom_histogram(aes(fill=..count..),binwidth=1)+
  scale_x_continuous(name = "Days Since Prior Order",breaks = seq(0, 30, 1))+
  scale_y_continuous(name = "Frequency of Orders",breaks=seq(0,1000000,100000))+
  ggtitle("Gap between two orders?")+
  labs(x="Days Since Prior Order")+
  theme_update()
Reordering_Gap_plot

Prior Table Analysis

Top Ordered Products

top25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%head(25)

colnames(top25_products)[1]<-'product_id'

Top25Products<-merge(top25_products,Products_Aisles_Departments,by='product_id')%>%arrange(desc(freq))

kable(head(Top25Products,25))

product_id	freq	department_id	aisle_id	product_name	aisle	department
24852	472565	4	24	Abc Banana	fresh fruits	produce
13176	379450	4	24	Abc Bag of Organic Bananas	fresh fruits	produce
21137	264683	4	24	Abc Organic Strawberries	fresh fruits	produce
21903	241921	4	123	Abc Organic Baby Spinach	packaged vegetables fruits	produce
47209	213584	4	24	Abc Organic Hass Avocado	fresh fruits	produce
47766	176815	4	24	Abc Organic Avocado	fresh fruits	produce
47626	152657	4	24	Abc Large Lemon	fresh fruits	produce
16797	142951	4	24	Abc Strawberries	fresh fruits	produce
26209	140627	4	24	Abc Limes	fresh fruits	produce
27845	137905	16	84	Abc Organic Whole Milk	milk	dairy eggs
27966	137057	4	123	Abc Organic Raspberries	packaged vegetables fruits	produce
22935	113426	4	83	Abc Organic Yellow Onion	fresh vegetables	produce
24964	109778	4	83	Abc Organic Garlic	fresh vegetables	produce
45007	104823	4	83	Abc Organic Zucchini	fresh vegetables	produce
39275	100060	4	123	Abc Organic Blueberries	packaged vegetables fruits	produce
49683	97315	4	83	Abc Cucumber Kirby	fresh vegetables	produce
28204	89632	4	24	Abc Organic Fuji Apple	fresh fruits	produce
5876	87746	4	24	Abc Organic Lemon	fresh fruits	produce
8277	85020	4	24	Abc Apple Honeycrisp Organic	fresh fruits	produce
40706	84255	4	123	Abc Organic Grape Tomatoes	packaged vegetables fruits	produce
4920	82689	4	123	Abc Seedless Red Grapes	packaged vegetables fruits	produce
30391	80392	4	83	Abc Organic Cucumber	fresh vegetables	produce
45066	79769	4	24	Abc Honeycrisp Apple	fresh fruits	produce
42265	76896	4	123	Abc Organic Baby Carrots	packaged vegetables fruits	produce
49235	76360	16	53	Abc Organic Half & Half	cream	dairy eggs

#Visualization of top 50 products

ggplot(Top25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Most ordered Products: Top 25 ",y="Number of orders",x="product_name")+
  geom_text(nudge_y = 20000)

Least Ordered Products

#Bottom 25 (least ordered products)

bottom25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%tail(25)

colnames(bottom25_products)[1]<-'product_id'

bottom25Products<-merge(bottom25_products,Products_Aisles_Departments,by='product_id')%>%arrange(freq)

kable(head(bottom25Products,25))

product_id	freq	department_id	aisle_id	product_name	aisle	department
42206	1	11	80	Abc Anarchy For Her Daily Fragrance	deodorants	personal care
42235	1	13	5	Abc Wicked Good No-Fry Wing Marinade	marinades meat preparation	pantry
42464	1	13	104	Abc Organic Aleppo Pepper	spices seasonings	pantry
43098	1	11	70	Abc Buddy Bear Probiotic Orange Flavor Chewable	digestion	personal care
43144	1	13	19	Abc Flax Oil Organic Omega-3 Original Formula	oils vinegars	pantry
43778	1	19	3	Abc Strawberry Energy Gel	energy granola bars	snacks
44495	1	16	120	Abc Lowfat Cherry Lime Supernova Kefir Cultured Milk Smoothie	yogurt	dairy eggs
44748	1	13	72	Abc Thick ’n Chunky Salsa Mild	condiments	pantry
44982	1	1	37	Abc Coconut Bliss Pineapple Coconut	ice cream ice	frozen
44986	1	19	107	Abc Mustard & Onion	chips pretzels	snacks
45273	1	20	67	Abc Flame Roasted Red Peppers Spreadable Cheese	fresh dips tapenades	deli
45378	1	11	22	Abc Miss Treated Conditioner	hair care	personal care
45380	1	19	3	Abc Original Salted Caramel Protein Energy Bar	energy granola bars	snacks
45400	1	17	114	Abc Multi-Surface Sunflower Scent Everyday Cleaner	cleaning products	household
45893	1	9	9	Abc Florentine Spinach & Cheese Pasta Sauce	pasta sauce	dry goods pasta
46278	1	14	121	Abc Berry Sprouted Blend Cereal	cereal	breakfast
46379	1	11	73	Abc Skin Firming Cellulite Gel Cream	facial care	personal care
46741	1	15	59	Abc Seasoned Southern Style Red Beans And Rice	canned meals beans	canned goods
47408	1	21	100	Abc Control GX Gray Reducing Shampoo	missing	missing
47680	1	5	28	Abc Indian Wells Merlot	red wines	alcohol
48151	1	21	100	Abc Grill Mates Brown Sugar Bourbon Seasoning	missing	missing
48211	1	9	4	Abc Cajun Sides Dirty Rice	instant foods	dry goods pasta
48343	1	5	27	Abc Hennepin Farmhouse Ale	beers coolers	alcohol
48461	1	19	103	Abc The Ultimate Caramel Suace	ice cream toppings	snacks
49487	1	8	40	Abc Vitality Chicken Breasts with Flaxseed & Vitamins Dog Treats	dog food care	pets

#Visualization of bottom 25 products

ggplot(bottom25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="least ordered Products: Bottom 25 ",y="Number of orders",x="product_name")+
  geom_text(nudge_y = 1)

Market Basket Analysis

Market Basket Analysis has wide applications including but not limited to cross selling, product placement, affinity promotion, fraud detection and customer behavior

Association Rules - Theory

Apriori Algorithm and Association Rules

Frequent Itemset Property: Any subset of a frequent itemset is frequent.

Contrapositive: If an itemset is not frequent, none of its supersets are frequent.

. A set of items is referred as an itemset. A itemset that contains k items is a k-itemset . In theory, we can consider all rules -exponentially many but it is not a practical solution. Hence, we consider only combinations that occur with high frequency and call such sets as frequent item sets . The idea of frequent item sets is used for computational efficiency. If the set {item A, Item B} is not frequent, then no set containing item A and item B are frequent, and therefore do not need to be considered

Support

. The support s of an itemset A is the percentage of transactions in the transaction database D that contains A . The support of the rule A???B in the transaction database D is the support of the items set (A and B) in D . Rules with low support may have happened by chance. Low support rules may be uninteresting from the business side with an exception if the consequent is very valuable and /or the confidence is very high

Confidence

. The confidence of the rule A???B in the transaction database D is the ratio of the number of transactions in D that contain (A and B) to the number of transactions that contain A in D . The confidence of A???B is a measure of the reliability of the rule . It is an estimate of P(B|A). That is, it tells us the conditional probability that the items in the consequent set are contained in a randomly selected transaction that includes the antecedent set

Lift Ratio

. The lift ratio allows us to judge the strength of an association rule compared to a benchmark value . The benchmark: If the antecedent set and consequent sets are independent we can write the confidence as: P (consequent | antecedent) = P (antecedent AND consequent) / P (antecedent) =P (antecedent) x P (consequent) / P (antecedent) =p (consequent) . P(consequent) is called benchmark confidence . Lift ratio is defined as Confidence / benchmark confidence . Lift Ratio = [{support (A and B)} / {Support (A) * Support (B)}] . The lift ratio can take value between 0 and infinity

Interpreting the Results

There is no rule of thumb for what is a “good rule”. We can consider following points while analyzing our dataset. . How impactful a rule is: This can be measured from the size of support . Efficiency of Rule: The lift tells us how efficient the rule is at finding the consequent set compared to a random selection . Operational usefulness: The confidence tells is how efficient the rule will be in practice

Basket Size-Frequent Item Sets-Arules Vis

prior4mba<-split(prior$product_id,prior$order_id)

transaction_prior<-as(prior4mba,"transactions")

dim(transaction_prior)

## [1] 3214874   49677

#frequent product ids in the transactions

itemFrequencyPlot(transaction_prior,support=0.05,cex.names=0.8)

#Apriori algorithm

basket_rules<-apriori(transaction_prior,parameter = list(sup=0.00001,conf=0.6,maxlen=3,target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   1e-05      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.85s].
## sorting and recoding items ... [30863 item(s)] done [1.10s].
## creating transaction tree ... done [8.48s].
## checking subsets of size 1 2 done [18.65s].
## writing ... [80 rule(s)] done [3.83s].
## creating S4 object  ... done [1.89s].

#Visualizing rules

#Number of Products per basket

hist(size(transaction_prior), breaks = 0:150, xaxt="n", ylim=c(0,250000), col = "grey",
     main = "Number of Products per Order", xlab = "Order Size:Number of Products")
axis(1, at=seq(0,160,by=10), cex.axis=0.8)
mtext(paste("Total:", length(transaction_prior), "Orders,", sum(size(transaction_prior)), "Products"))

#Frequently ordered products

#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000

item_frequencies <- itemFrequency(transaction_prior, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transaction_prior)]

par(mar=c(2,10,2,2)); options(scipen=5)

barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transaction_prior), col="red")

#Frequent items bought together

#We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold

basket_rules<-apriori(transaction_prior,parameter = list(sup=0.0003, conf=0.5, target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5  0.0003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.87s].
## sorting and recoding items ... [5213 item(s)] done [1.34s].
## creating transaction tree ... done [9.64s].
## checking subsets of size 1 2 3 done [6.20s].
## writing ... [60 rule(s)] done [0.08s].
## creating S4 object  ... done [1.50s].

plot(basket_rules)

plot(head(sort(basket_rules,by="lift"),10),method="graph")

plot(basket_rules,method="grouped")

#Above figure visualizes all the three parameters: support, confidence, and lift. Confidence level is set at 50%. We get a set of 60 rules. We sort them by the value of lift which gives the efficiency of the rule and thereby make our product combinations

Instacart EDA & MBA

Swapnil Sharma

June 30, 2017