Exploratory Data Analysis & Market Basket Analysis

Instacart is an internet - based grocery delivery service with a slogan of Groceries Delivered in an Hour. The purpose of this exercise is to analyze the trend in customer buying pattern on Instacart, suggest combination of products which can be sold together under various offers.

Loading dependencies and Data Files

The data set is a relational set of files describing customers’ orders over time. The data set is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, between 4 and 100 of their order details is provided with the sequence of products purchased in each order, the week and hour of day the order was placed and a relative measure of time between orders.

library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)
library(magrittr)
library(grid)
library(gridExtra)
library(ggthemes)
library(ggrepel)
library(tcltk)
library(gsubfn)
library(proto)
library(RSQLite)
library(sqldf)
library(Matrix)
library(arules)
library(tidyr)
library(arulesViz)
library(methods)

products<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/products.csv")
orders<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/orders.csv")
prior<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/order_products__prior.csv")
aisles<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/aisles.csv")
departments<-read.csv("file:///C:/Users/swapn/Downloads/Analytics/Kaggle/InstakartMBA/InstakartMBA/departments.csv")

Viewing the Data Sets

Orders

The data set contains list of unique order_id for corresponding orders made by users. Order_number gives the number of the order. Eval_set denotes if the order is a prior order, train, or test. All but the last order of every user is classified as prior. Last order of every user is either classified as train or test. The ones classified as test are the order_id for which we predict which products will be included in the next order. Order_dow gives the day of the week and order_hour_of_day denotes hour of the day. Days_since_prior_order gives the time difference between two orders and contains NULL value for the first order of every user. There are 3 million plus order_id for 200,000 plus different users.

kable(head(orders,10))
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 8 NA
2398795 1 prior 2 3 7 15
473747 1 prior 3 3 12 21
2254736 1 prior 4 4 7 29
431534 1 prior 5 4 15 28
3367565 1 prior 6 2 7 19
550135 1 prior 7 1 9 20
3108588 1 prior 8 1 14 14
2295261 1 prior 9 1 16 0
2550362 1 prior 10 4 8 30
glimpse(orders)
## Observations: 3,421,083
## Variables: 7
## $ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
## $ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ eval_set               <fctr> prior, prior, prior, prior, prior, pri...
## $ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
## $ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
## $ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
## $ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...
dim(orders)
## [1] 3421083       7

Products

kable(head(products,10))
product_id product_name aisle_id department_id
1 Abc Chocolate Sandwich Cookies 61 19
2 Abc All-Seasons Salt 104 13
3 Abc Robust Golden Unsweetened Oolong Tea 94 7
4 Abc Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Abc Green Chile Anytime Sauce 5 13
6 Abc Dry Nose Oil 11 11
7 Abc Pure Coconut Water With Orange 98 7
8 Abc Cut Russet Potatoes Steam N’ Mash 116 1
9 Abc Light Strawberry Blueberry Yogurt 120 16
10 Abc Sparkling Orange Juice & Prickly Pear Beverage 115 7
glimpse(products)
## Observations: 49,688
## Variables: 4
## $ product_id    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ product_name  <fctr> Abc Chocolate Sandwich Cookies, Abc All-Seasons...
## $ aisle_id      <int> 61, 104, 94, 38, 5, 11, 98, 116, 120, 115, 31, 1...
## $ department_id <int> 19, 13, 7, 1, 13, 11, 7, 1, 16, 7, 7, 1, 11, 17,...
dim(products)
## [1] 49688     4

Aisles

kable(head(aisles,10))
aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
glimpse(aisles)
## Observations: 134
## Variables: 2
## $ aisle_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ aisle    <fctr> prepared soups salads, specialty cheeses, energy gra...
dim(aisles)
## [1] 134   2

Departments

kable(head(departments,10))
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
glimpse(departments)
## Observations: 21
## Variables: 2
## $ department_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ department    <fctr> frozen, other, bakery, produce, alcohol, intern...
dim(departments)
## [1] 21  2

Prior

Prior table contains product_id for every order_id. It thereby gives information about products included in every order. Add_to_cart_order gives the order for product_id by which it was added by customer to their shopping cart. Every product_id is classified and coded as 1 under reordered column if it was previously ordered by customer and 0 otherwise. It is the largest table with over 32 million rows of data.

kable(head(prior,10))
order_id product_id add_to_cart_order reordered
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0
2 17794 6 1
2 40141 7 1
2 1819 8 1
2 43668 9 0
3 33754 1 1
glimpse(prior)
## Observations: 32,434,489
## Variables: 4
## $ order_id          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,...
## $ product_id        <int> 33120, 28985, 9327, 45918, 30035, 17794, 401...
## $ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6,...
## $ reordered         <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,...
dim(prior)
## [1] 32434489        4

Recoding the Character Variables to Factor

orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

Merging The Products_Aisles_Departments Data sets

Products_Aisles<-merge(products,aisles,by="aisle_id")
Products_Aisles_Departments<-merge(Products_Aisles,departments,"department_id")
kable(head(Products_Aisles_Departments,6))
department_id aisle_id product_id product_name aisle department
1 37 32475 Abc Meyer Lemon Sorbet ice cream ice frozen
1 37 18020 Abc Black Raspberry Chocolate Chip Ice Cream ice cream ice frozen
1 37 20175 Abc The Original Vanilla Ice Cream Sandwich ice cream ice frozen
1 37 49459 Abc Dark Chocolate Non Dairy Frozen Dessert Bar ice cream ice frozen
1 37 8507 Abc Fun Flavors Spumoni Ice Cream ice cream ice frozen
1 37 30176 Abc Fruit Bars Straw Tangerine ice cream ice frozen
glimpse(Products_Aisles_Departments)
## Observations: 49,688
## Variables: 6
## $ department_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ aisle_id      <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, ...
## $ product_id    <int> 32475, 18020, 20175, 49459, 8507, 30176, 36242, ...
## $ product_name  <fctr> Abc Meyer Lemon Sorbet, Abc Black Raspberry Cho...
## $ aisle         <fctr> ice cream ice, ice cream ice, ice cream ice, ic...
## $ department    <fctr> frozen, frozen, frozen, frozen, frozen, frozen,...
dim(Products_Aisles_Departments)
## [1] 49688     6

Instacart Product Offerings

Aisle

Missing, candy chocoloate and Ice cream ice aisle are the aisles with maximum variety of products while few variety is avilable relatively in trash bag liners, frozen dessert and Indian foods aisles

Number_of_Product_each_Aisle<-Products_Aisles_Departments%>%group_by(aisle)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))

#Top 20 Aisle by number of product offerings
Top_20<-head(Number_of_Product_each_Aisle,n=20)

#Plotting Number of Products in each aisle in decreasing order(Top 20)
ggplot(Top_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+
  labs(title="Top 20 Aisle by Variety of Product Offering",y="Number of Products",x="Aisle")+
  geom_text(nudge_y = 35)

#Bottom 20 Aisle by number of product offerings
Bottom_20<-tail(Number_of_Product_each_Aisle,n=20)

#Plotting Number of Products in each aisle in decreasing order(Bottom 20)
ggplot(Bottom_20, aes(x = reorder(aisle,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Bottom 20 Aisle by Variety of Product offering",y="Number of Products",x="Aisle")+
  geom_text(nudge_y = 3.5)

Department

It can be inferred from the below bar chart that Instacart has maximum number of product offerings across personal care and edible item departments

#Number of Products in each department

Number_of_Product_each_department<-Products_Aisles_Departments%>%group_by(department)%>%summarise(Number_of_Products=n())%>%arrange(desc(Number_of_Products))


#Vis--Bar chart for number of products in each department

ggplot(Number_of_Product_each_department, aes(x = reorder(department,Number_of_Products), y = Number_of_Products,label=paste0(round(Number_of_Products,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Department by Variety of Product offering",y="Number of Products",x="Department")+
  geom_text(nudge_y = 250)

Orders EDA

Hour_of_Day

From the figure it is inferred that most people order between 9:00 AM to 6:00 PM in the evening. Instacart can accordingly plan to hire persons for delivery during days shifts. The visualization is plotted in R with colored portion representing the relative percentage of total orders across the day with 10:00 AM being the busiest hour (100% in vis)

Orders_everyhour<-orders%>%group_by(order_hour_of_day)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))

#Visualization for number of orders at every hour of the day and every day of the week

# referenced from: http://zoonek2.free.fr/UNIX/48_R/03.html

x<-Orders_everyhour$Percentage_of_orders
clock.plot <- function (x, col = rainbow(n,s=1,v=1,start=0,end=max(1,n-1)/n,alpha=0.5), ...) {
  if( min(x)<0 ) x <- x - min(x)
  if( max(x)>1 ) x <- x/max(x)
  n <- length(x)
  if(is.null(names(x))) names(x) <- 0:(n-1)
  m <- 1.05
  plot(0, 
       type = 'n', # do not plot anything
       xlim = c(-m,m), ylim = c(-m,m), 
       axes = F, xlab = '', ylab = '', ...)
  a <- pi/2 - 2*pi/200*0:200
  polygon( cos(a), sin(a) )
  v <- .02
  a <- pi/2 - 2*pi/n*0:n
  segments( (1+v)*cos(a), (1+v)*sin(a), 
            (1-v)*cos(a), (1-v)*sin(a) )
  segments( cos(a), sin(a), 
            0, 0, 
            col = 'light grey', lty = 3) 
  ca <- -2*pi/n*(0:50)/50
  for (i in 1:n) {
    a <- pi/2 - 2*pi/n*(i-1)
    b <- pi/2 - 2*pi/n*i
    polygon( c(0, x[i]*cos(a+ca), 0),
             c(0, x[i]*sin(a+ca), 0),
             col=col[i] )
    v <- .1
    text((1+v)*cos(a), (1+v)*sin(a), names(x)[i])
  }
}
clock.plot(x, 
           main = "Peak Ordering Hours")

Day_of_Week

From the figure below it is deduced that Sunday and Monday are the days when people order most on Instacart

#Number of Orders every day of the week

Orders_everyday<-orders%>%group_by(order_dow)%>%summarise(Number_of_Orders=n())%>%mutate(Percentage_of_orders=(Number_of_Orders*100/nrow(orders)))

#Visualizing Number of Orders by day of the week

ggplot(Orders_everyday,aes(x=order_dow,y=Percentage_of_orders,label=paste0(round(Percentage_of_orders,1))))+
  geom_bar(stat = "identity")+labs(title="% of Orders by day of the Week",y="Percentage of Total Orders",x="Day of the Week : 0 denotes Sunday ")+
  geom_text(nudge_y = .5)

Every Day_ Every Hour

It is found that 10:00 AM on Monday is the time of the day when most orders are placed. This can be thought of when people go to work they refill their groceries for the rest of the week. To visualize this percentage of orders made every hour for every day is plotted and compared.

Dow_hod_orders<-orders%>%group_by(order_dow,order_hour_of_day)%>%
  summarise(Number_of_Orders=n())

Dow_hod_orders_combined<-merge(Dow_hod_orders,Orders_everyday,by="order_dow",all.x = TRUE)%>%
  mutate(Percentage_by_doy=Number_of_Orders.x*100/Number_of_Orders.y)


#Visualizing orders by dow-->hod

ggplot(Dow_hod_orders_combined, aes(x = Dow_hod_orders_combined$order_hour_of_day, y = Dow_hod_orders_combined$Percentage_by_doy)) +
  geom_bar(stat="identity") +
  labs(title="Visualizing orders by hour of day for each day of week with 0 representing Sunday",x="0-24 represents hours of the day",y="Percentage of orders for the day")+
  facet_grid(~ Dow_hod_orders_combined$order_dow)

Days Since Prior Order Analysis

We are given the gap between two orders for every user. When we plot it we find two categories of people! One that reorders monthly other who does weekly. This is based on the peaks formed at 30th day and 7th day.

library(plyr)
library(dplyr)
Reordering_Gap<-count(orders,'days_since_prior_order')%>%arrange(desc(freq))%>%mutate(Percent_orders=round(freq*100/nrow(orders)),2)

#Inference: 11 % of the time people reorder monthly(after 30 days), and 9 % of the time weekly. This shows there is a section of people who refill their groceries every month and other who refills every week. Frequency of NA represents total number of unique users and its their first order.

#Visualizing reordering Gap

Reordering_Gap_plot<-ggplot(orders,aes(x=days_since_prior_order))+
  geom_histogram(aes(fill=..count..),binwidth=1)+
  scale_x_continuous(name = "Days Since Prior Order",breaks = seq(0, 30, 1))+
  scale_y_continuous(name = "Frequency of Orders",breaks=seq(0,1000000,100000))+
  ggtitle("Gap between two orders?")+
  labs(x="Days Since Prior Order")+
  theme_update()
Reordering_Gap_plot

Prior Table Analysis

Top Ordered Products

top25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%head(25)

colnames(top25_products)[1]<-'product_id'

Top25Products<-merge(top25_products,Products_Aisles_Departments,by='product_id')%>%arrange(desc(freq))

kable(head(Top25Products,25))
product_id freq department_id aisle_id product_name aisle department
24852 472565 4 24 Abc Banana fresh fruits produce
13176 379450 4 24 Abc Bag of Organic Bananas fresh fruits produce
21137 264683 4 24 Abc Organic Strawberries fresh fruits produce
21903 241921 4 123 Abc Organic Baby Spinach packaged vegetables fruits produce
47209 213584 4 24 Abc Organic Hass Avocado fresh fruits produce
47766 176815 4 24 Abc Organic Avocado fresh fruits produce
47626 152657 4 24 Abc Large Lemon fresh fruits produce
16797 142951 4 24 Abc Strawberries fresh fruits produce
26209 140627 4 24 Abc Limes fresh fruits produce
27845 137905 16 84 Abc Organic Whole Milk milk dairy eggs
27966 137057 4 123 Abc Organic Raspberries packaged vegetables fruits produce
22935 113426 4 83 Abc Organic Yellow Onion fresh vegetables produce
24964 109778 4 83 Abc Organic Garlic fresh vegetables produce
45007 104823 4 83 Abc Organic Zucchini fresh vegetables produce
39275 100060 4 123 Abc Organic Blueberries packaged vegetables fruits produce
49683 97315 4 83 Abc Cucumber Kirby fresh vegetables produce
28204 89632 4 24 Abc Organic Fuji Apple fresh fruits produce
5876 87746 4 24 Abc Organic Lemon fresh fruits produce
8277 85020 4 24 Abc Apple Honeycrisp Organic fresh fruits produce
40706 84255 4 123 Abc Organic Grape Tomatoes packaged vegetables fruits produce
4920 82689 4 123 Abc Seedless Red Grapes packaged vegetables fruits produce
30391 80392 4 83 Abc Organic Cucumber fresh vegetables produce
45066 79769 4 24 Abc Honeycrisp Apple fresh fruits produce
42265 76896 4 123 Abc Organic Baby Carrots packaged vegetables fruits produce
49235 76360 16 53 Abc Organic Half & Half cream dairy eggs
#Visualization of top 50 products

ggplot(Top25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="Most ordered Products: Top 25 ",y="Number of orders",x="product_name")+
  geom_text(nudge_y = 20000)

Least Ordered Products

#Bottom 25 (least ordered products)

bottom25_products<-count(prior$product_id)%>%arrange(desc(freq))%>%tail(25)

colnames(bottom25_products)[1]<-'product_id'

bottom25Products<-merge(bottom25_products,Products_Aisles_Departments,by='product_id')%>%arrange(freq)

kable(head(bottom25Products,25))
product_id freq department_id aisle_id product_name aisle department
42206 1 11 80 Abc Anarchy For Her Daily Fragrance deodorants personal care
42235 1 13 5 Abc Wicked Good No-Fry Wing Marinade marinades meat preparation pantry
42464 1 13 104 Abc Organic Aleppo Pepper spices seasonings pantry
43098 1 11 70 Abc Buddy Bear Probiotic Orange Flavor Chewable digestion personal care
43144 1 13 19 Abc Flax Oil Organic Omega-3 Original Formula oils vinegars pantry
43778 1 19 3 Abc Strawberry Energy Gel energy granola bars snacks
44495 1 16 120 Abc Lowfat Cherry Lime Supernova Kefir Cultured Milk Smoothie yogurt dairy eggs
44748 1 13 72 Abc Thick ’n Chunky Salsa Mild condiments pantry
44982 1 1 37 Abc Coconut Bliss Pineapple Coconut ice cream ice frozen
44986 1 19 107 Abc Mustard & Onion chips pretzels snacks
45273 1 20 67 Abc Flame Roasted Red Peppers Spreadable Cheese fresh dips tapenades deli
45378 1 11 22 Abc Miss Treated Conditioner hair care personal care
45380 1 19 3 Abc Original Salted Caramel Protein Energy Bar energy granola bars snacks
45400 1 17 114 Abc Multi-Surface Sunflower Scent Everyday Cleaner cleaning products household
45893 1 9 9 Abc Florentine Spinach & Cheese Pasta Sauce pasta sauce dry goods pasta
46278 1 14 121 Abc Berry Sprouted Blend Cereal cereal breakfast
46379 1 11 73 Abc Skin Firming Cellulite Gel Cream facial care personal care
46741 1 15 59 Abc Seasoned Southern Style Red Beans And Rice canned meals beans canned goods
47408 1 21 100 Abc Control GX Gray Reducing Shampoo missing missing
47680 1 5 28 Abc Indian Wells Merlot red wines alcohol
48151 1 21 100 Abc Grill Mates Brown Sugar Bourbon Seasoning missing missing
48211 1 9 4 Abc Cajun Sides Dirty Rice instant foods dry goods pasta
48343 1 5 27 Abc Hennepin Farmhouse Ale beers coolers alcohol
48461 1 19 103 Abc The Ultimate Caramel Suace ice cream toppings snacks
49487 1 8 40 Abc Vitality Chicken Breasts with Flaxseed & Vitamins Dog Treats dog food care pets
#Visualization of bottom 25 products

ggplot(bottom25Products, aes(x = reorder(product_name,freq), y = freq,label=paste0(round(freq,0)))) +
  geom_bar(stat = "identity")+coord_flip()+labs(title="least ordered Products: Bottom 25 ",y="Number of orders",x="product_name")+
  geom_text(nudge_y = 1)

Market Basket Analysis

Market Basket Analysis has wide applications including but not limited to cross selling, product placement, affinity promotion, fraud detection and customer behavior

Association Rules - Theory

Apriori Algorithm and Association Rules

Frequent Itemset Property: Any subset of a frequent itemset is frequent.

Contrapositive: If an itemset is not frequent, none of its supersets are frequent.

. A set of items is referred as an itemset. A itemset that contains k items is a k-itemset . In theory, we can consider all rules -exponentially many but it is not a practical solution. Hence, we consider only combinations that occur with high frequency and call such sets as frequent item sets . The idea of frequent item sets is used for computational efficiency. If the set {item A, Item B} is not frequent, then no set containing item A and item B are frequent, and therefore do not need to be considered

Support

. The support s of an itemset A is the percentage of transactions in the transaction database D that contains A . The support of the rule A???B in the transaction database D is the support of the items set (A and B) in D . Rules with low support may have happened by chance. Low support rules may be uninteresting from the business side with an exception if the consequent is very valuable and /or the confidence is very high

Confidence

. The confidence of the rule A???B in the transaction database D is the ratio of the number of transactions in D that contain (A and B) to the number of transactions that contain A in D . The confidence of A???B is a measure of the reliability of the rule . It is an estimate of P(B|A). That is, it tells us the conditional probability that the items in the consequent set are contained in a randomly selected transaction that includes the antecedent set

Lift Ratio

. The lift ratio allows us to judge the strength of an association rule compared to a benchmark value . The benchmark: If the antecedent set and consequent sets are independent we can write the confidence as: P (consequent | antecedent) = P (antecedent AND consequent) / P (antecedent) =P (antecedent) x P (consequent) / P (antecedent) =p (consequent) . P(consequent) is called benchmark confidence . Lift ratio is defined as Confidence / benchmark confidence . Lift Ratio = [{support (A and B)} / {Support (A) * Support (B)}] . The lift ratio can take value between 0 and infinity

Interpreting the Results

There is no rule of thumb for what is a “good rule”. We can consider following points while analyzing our dataset. . How impactful a rule is: This can be measured from the size of support . Efficiency of Rule: The lift tells us how efficient the rule is at finding the consequent set compared to a random selection . Operational usefulness: The confidence tells is how efficient the rule will be in practice

Basket Size-Frequent Item Sets-Arules Vis

prior4mba<-split(prior$product_id,prior$order_id)

transaction_prior<-as(prior4mba,"transactions")

dim(transaction_prior)
## [1] 3214874   49677
#frequent product ids in the transactions

itemFrequencyPlot(transaction_prior,support=0.05,cex.names=0.8)

#Apriori algorithm

basket_rules<-apriori(transaction_prior,parameter = list(sup=0.00001,conf=0.6,maxlen=3,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   1e-05      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.85s].
## sorting and recoding items ... [30863 item(s)] done [1.10s].
## creating transaction tree ... done [8.48s].
## checking subsets of size 1 2 done [18.65s].
## writing ... [80 rule(s)] done [3.83s].
## creating S4 object  ... done [1.89s].
#Visualizing rules

#Number of Products per basket

hist(size(transaction_prior), breaks = 0:150, xaxt="n", ylim=c(0,250000), col = "grey",
     main = "Number of Products per Order", xlab = "Order Size:Number of Products")
axis(1, at=seq(0,160,by=10), cex.axis=0.8)
mtext(paste("Total:", length(transaction_prior), "Orders,", sum(size(transaction_prior)), "Products"))

#Frequently ordered products

#We find 15 products to occur when the support is set at 0.03. This means these products are found in 3% of the total transactions which is approximately about 90,000

item_frequencies <- itemFrequency(transaction_prior, type="a")
support <- 0.03
freq_items <- sort(item_frequencies, decreasing = F)
freq_items <- freq_items[freq_items>support*length(transaction_prior)]

par(mar=c(2,10,2,2)); options(scipen=5)

barplot(freq_items, horiz=T, las=1, main="Frequent Items", cex.names=.8, xlim=c(0,500000))
mtext(paste("support:",support), padj = .8)
abline(v=support*length(transaction_prior), col="red")

#Frequent items bought together

#We desire to make 2 products and 3 product combinations and hence we choose a lower support = 0.003 which means the product is in around 0.3 % of 3 million transactions that is about 10,000 times the product is sold

basket_rules<-apriori(transaction_prior,parameter = list(sup=0.0003, conf=0.5, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5  0.0003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49677 item(s), 3214874 transaction(s)] done [18.87s].
## sorting and recoding items ... [5213 item(s)] done [1.34s].
## creating transaction tree ... done [9.64s].
## checking subsets of size 1 2 3 done [6.20s].
## writing ... [60 rule(s)] done [0.08s].
## creating S4 object  ... done [1.50s].
plot(basket_rules)

plot(head(sort(basket_rules,by="lift"),10),method="graph")

plot(basket_rules,method="grouped")

#Above figure visualizes all the three parameters: support, confidence, and lift. Confidence level is set at 50%. We get a set of 60 rules. We sort them by the value of lift which gives the efficiency of the rule and thereby make our product combinations