Motivation: The Usefulness of Market Basket Analysis

Companies’ need for data processing has generated a new science, methodology, and tools to analyze large masses of data such as Big Data or data mining. This science has developed methods and predictive algorithms that allow us to know hidden behavior patterns of the individual that, with other techniques, it is impossible to know. This helps companies understand how their customers behave when making a purchase. Once their behavior pattern is known, companies better adapt to it to achieve maximum customer satisfaction, trust, and commitment.

There are currently many tools to carry out this type of analysis. In this work, we will focus on one of them, the association rules, a methodology that we consider to have great potential. We will use the apriori algorithm.

Objective

Our goal is to earn customer loyalty and increase revenue through machine-readable marketing data, which allows us to create a user behavior profile and more accurately predict revenue.

Description of the Data Source

The data analyzed came for kaggle, https://www.kaggle.com/c/instacart-market-basket-analysis/data. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users

Once the dataset is downloaded, it weighs 200.7 MB in zip format. Once unzipped, six files appear:

departments.csv: Collect name and id for different departments.

aisle.csv: That collects the corridors and the different sections.
products.csv: Collect all the products that are for sale in the store.
orders.csv: Collect all the orders made by customers, or which is the same, all the shopping carts (Market_Basket).
order_products_prior.csv: Collect the products that were purchased in an order and all the customer’s purchase history throughout that year.
order_products_train.csv: Collect the products that were purchased in an order reflecting the last basket purchased by each customer.

For this analysis, we will use the order_products_train.csv, which offers a data sample of the order_products_prior.csv

Description of the Procedure

Get the libraries

## library(arules) ## Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
## library(arulesViz) ## Extends package 'arules' with various visualization techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
## library(tidyverse)
## library(ggplot2)
## library(plyr) ## Tools for Splitting, Applying and Combining Data
## library(dplyr)
## library(readxl)
## library(treemap)
## library(d3treeR)
## library(colorspace)
## library(plotly)

Load the Data

orders <- read_csv("C:/Users/karolo/Desktop/Instacart/orders.csv")
order_products <- read_csv("C:/Users/karolo/Desktop/Instacart/order_products__train.csv")
products <- read_csv("C:/Users/karolo/Desktop/Instacart/products.csv")
aisles <- read_csv("C:/Users/karolo/Desktop/Instacart/aisles.csv")
departments <- read_csv("C:/Users/karolo/Desktop/Instacart/departments.csv")

Review Structure of the Data

Products

str(products) ## Linked to departments and aisles by the id

## spec_tbl_df [49,688 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ product_id   : num [1:49688] 1 2 3 4 5 6 7 8 9 10 ...
##  $ product_name : chr [1:49688] "Chocolate Sandwich Cookies" "All-Seasons Salt" "Robust Golden Unsweetened Oolong Tea" "Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce" ...
##  $ aisle_id     : num [1:49688] 61 104 94 38 5 11 98 116 120 115 ...
##  $ department_id: num [1:49688] 19 13 7 1 13 11 7 1 16 7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   product_id = col_double(),
##   ..   product_name = col_character(),
##   ..   aisle_id = col_double(),
##   ..   department_id = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Departmemts

str(departments)

## spec_tbl_df [21 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ department_id: num [1:21] 1 2 3 4 5 6 7 8 9 10 ...
##  $ department   : chr [1:21] "Frozen" "Other" "Bakery" "Produce" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   department_id = col_double(),
##   ..   department = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Aisles

str(aisles)

## spec_tbl_df [134 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ aisle_id: num [1:134] 1 2 3 4 5 6 7 8 9 10 ...
##  $ aisle   : chr [1:134] "prepared soups salads" "specialty cheeses" "energy granola bars" "instant foods" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   aisle_id = col_double(),
##   ..   aisle = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Order_products

str(order_products) # Connnected to Products and Order by order_id

## spec_tbl_df [1,384,617 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ order_id         : num [1:1384617] 1 1 1 1 1 1 1 1 36 36 ...
##  $ product_id       : num [1:1384617] 49302 11109 10246 49683 43633 ...
##  $ add_to_cart_order: num [1:1384617] 1 2 3 4 5 6 7 8 1 2 ...
##  $ reordered        : num [1:1384617] 1 1 0 0 1 0 0 1 0 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   order_id = col_double(),
##   ..   product_id = col_double(),
##   ..   add_to_cart_order = col_double(),
##   ..   reordered = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Orders

str(orders)

## spec_tbl_df [3,421,083 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ order_id              : num [1:3421083] 2539329 2398795 473747 2254736 431534 ...
##  $ user_id               : num [1:3421083] 1 1 1 1 1 1 1 1 1 1 ...
##  $ eval_set              : chr [1:3421083] "prior" "prior" "prior" "prior" ...
##  $ order_number          : num [1:3421083] 1 2 3 4 5 6 7 8 9 10 ...
##  $ order_dow             : num [1:3421083] 2 3 3 4 4 2 1 1 1 4 ...
##  $ order_hour_of_day     : chr [1:3421083] "08" "07" "12" "07" ...
##  $ days_since_prior_order: num [1:3421083] NA 15 21 29 28 19 20 14 0 30 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   order_id = col_double(),
##   ..   user_id = col_double(),
##   ..   eval_set = col_character(),
##   ..   order_number = col_double(),
##   ..   order_dow = col_double(),
##   ..   order_hour_of_day = col_character(),
##   ..   days_since_prior_order = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Transforming the features

orders$order_hour_of_day <- as.numeric(orders$order_hour_of_day)

products <- products%>% mutate(category =  ifelse(grepl("Organic", products$product_name), "Organic",  "Non-organic"))

products <- products %>% mutate(product_name = as.factor(product_name))

aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

Combining the data

df <- products %>% group_by(department_id, aisle_id, product_id, product_name, category) %>% dplyr::summarize(n=n())
df <- df %>% left_join(departments,by="department_id")
df<-  df %>% left_join(aisles,by="aisle_id")

data <- merge(x= order_products,y= df, by= "product_id")

data <- merge(x= data, y= orders, by= "order_id")

data <- data %>% mutate(product_name = as.factor(product_name))



head(data)

##   order_id product_id add_to_cart_order reordered department_id aisle_id
## 1        1      13176                 6         0             4       24
## 2        1      49302                 1         1            16      120
## 3        1      22035                 8         1            16       21
## 4        1      11109                 2         1            16      108
## 5        1      47209                 7         0             4       24
## 6        1      49683                 4         0             4       83
##                                    product_name    category n department
## 1                        Bag of Organic Bananas     Organic 1    Produce
## 2                              Bulgarian Yogurt Non-organic 1 Dairy Eggs
## 3                   Organic Whole String Cheese     Organic 1 Dairy Eggs
## 4 Organic 4% Milk Fat Whole Milk Cottage Cheese     Organic 1 Dairy Eggs
## 5                          Organic Hass Avocado     Organic 1    Produce
## 6                                Cucumber Kirby Non-organic 1    Produce
##                  aisle user_id eval_set order_number order_dow
## 1         fresh fruits  112108    train            4         4
## 2               yogurt  112108    train            4         4
## 3      packaged cheese  112108    train            4         4
## 4 other creams cheeses  112108    train            4         4
## 5         fresh fruits  112108    train            4         4
## 6     fresh vegetables  112108    train            4         4
##   order_hour_of_day days_since_prior_order
## 1                10                      9
## 2                10                      9
## 3                10                      9
## 4                10                      9
## 5                10                      9
## 6                10                      9

Data Exploration

Orders- Time and Day

day_week <- orders %>%
    mutate(day = as.factor(order_dow)) %>%
    mutate(hour = as.factor(order_hour_of_day)) %>%
    group_by(day,hour) %>%
    summarise(count = n()) %>%
    arrange(desc(count))


day_weekp <-day_week %>%
    ggplot(aes(x=day, y=hour))+
    geom_tile(aes(fill=count), colour = "white") + 
  
    scale_fill_gradient(name= "Number of\nOrders", low = "#fff1e6",high = "#00835C")+
  
    scale_x_discrete( position = "top",
                    breaks = c("0", "1", "2", "3", "4", "5", "6"),
                    label = c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday"),
                    expand=c(0,0))+
  
   scale_y_discrete( 
                    breaks = c("0", "6", "12", "18", "23"),
                    label = c("12am", "6am", "12pm", "6pm", "11pm"),
                    expand=c(0,0))+
  
      labs(title="Which Day and What Time\nDo Customers Order the Most?",
         x="", 
         y="",
         caption = "Maximum number of orders are placed between 9:00am and 4:00pm on Sunday and Monday. There is also a big number of orders during Firday\nand Saturday.")+
  
    theme_classic()+
  
    theme(
      
    axis.line=element_blank(),                                               
    axis.ticks=element_line(size=0.4),
    axis.text = element_text(size= 10, color= "#00835C"),
    axis.line.x = element_line(color= "#00835C" ),
    
    plot.background=element_blank(),         
    plot.title = element_text(size =10, face = "bold", hjust = 0.50, vjust = 1),
    plot.caption = element_text(hjust = 0, size = 7, margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"), color = "#718c9e"),

    
    panel.grid = element_blank(),
    
    legend.position = "bottom",
    legend.title = element_text(size= 8),
    legend.margin=margin(grid::unit(0,"cm")),
    legend.key.width=grid::unit(2,"cm"),
    legend.key.height=grid::unit(0.2,"cm")
)

day_weekp

So, from the above graph, its clear that Maximum no of orders are placed between 10:00AM and 5:00PM on Sunday and Monday.

Actions: Develop strategies to increase sales during the quietest days of the week:

Offer discounts on certain items.
Offer free delivery with orders over a specific dollar amount.

Positive note: Customers can shop on a Wednesday or Thursday, and most likely, they will never encounter “out of stock” items or delivery/pick delays.

Product Portfolio

Prod_portf <- data %>% 
  group_by(product_id) %>% 
  dplyr::summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  dplyr::summarize(sumcount = sum(count)) %>% 
  left_join(df, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

## Treemap

p <-  treemap(Prod_portf,index=c("department","aisle"),
        vSize="n",vColor="department",
        
        #Main
        palette= "Set3",
        title="Product Portfolio",
        sortID="-sumcount", 
        
        #Borders
        border.lwds=c(0,0),
        border.col=c("#343538","transparent"), 
        
        type="categorical", 
        fontsize.legend = 0,
        fontsize.labels=c(9,1),
        fontsize.title=12,

        overlap.labels=0.5, 
        
        
        inflate.labels=F,
        bg.labels=c("transparent"),
        align.labels=list(
              c("center", "center"), 
              c("right", "bottom")))

d3tree2( p ,  rootname = "Portfolio" )

Bestsellers Products

type <- data   %>%
    group_by(product_id)%>% 
    dplyr::summarize(count = n()) %>% 
    top_n(20, wt = count) %>%
    left_join(select(products, product_id, product_name, category), by="product_id") %>%
    arrange(desc(count))

 

best <- type %>% 
ggplot(aes(x=reorder(product_name,count), y=count, color= category, text= paste0(product_name, ", Total Orders:", count)))+    geom_point(size= 2)+
    geom_segment(aes(x=reorder(product_name,count), 
                     xend=reorder(product_name,count), 
                     y=0, 
                     yend=count), size=0.8)+
  
    scale_y_continuous(labels = scales::comma) +
  
      labs(title="Bestsellers Products",
           subtitle = "Organic vs Non-Organic",
           y="", 
           x="", 
           legend = "")+
  
    scale_color_manual("", values = c("#FF8200", "#0AAD0A"))+

  
    theme_minimal()+
  
    theme(
       axis.text.x= element_text( size= 7),
       axis.text.y= element_text( size= 8),

       plot.title = element_text(hjust=0.5, size= 12, face = "bold", vjust = 0.5),
       plot.subtitle = element_text(hjust=0.5, size= 9, vjust = 0.5),
       
       panel.grid.major.x = element_blank(),
       panel.grid.major.y = element_blank(),
       panel.grid.minor.y = element_blank(),
       panel.grid = element_line(color = "#e5e5e5"))+
  
  
     coord_flip()

ggplotly(best, tooltip = "text")

The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach. So to increase the sale of Strawberries the retailer can put it near the Bananas.

Less Popular Products

Categorically we see what areas we can improve sales on by addressing various marketing strategies.

library(flextable)
set_flextable_defaults(font.family = "Arial", font.size = 9, padding = 2)

less_p <- data   %>%
    group_by(product_id)%>% 
    dplyr::summarize(count = n()) %>% 
    top_n(-6, wt = count) %>%
    left_join(select(products, product_id, product_name, category), by="product_id") %>%
    arrange(desc(count))%>% 
        mutate(Number_of_Orders= count,
         Product = product_name,
         Category= category)%>% 
  select(Product, Number_of_Orders, Category) 

f <-flextable(head(less_p, n= 5))%>%
  align(
    j = c("Product", "Number_of_Orders", "Category"), 
    align = "center", part = "all") %>% 
  padding(padding = 3, part = "all")%>% 
  theme_zebra(odd_header = "#F26A58", even_header = "#F26A58",
              odd_body = "#fff5f5", even_body = "#f8f9fa") %>% 
  color(part = "header", color = "white")%>% 
  align(align = "right", part = "all")%>% autofit() 

f

Product	Number_of_Orders	Category
Green Chile Anytime Sauce	1	Non-organic
Pure Coconut Water With Orange	1	Non-organic
Saline Nasal Mist	1	Non-organic
Fresh Scent Dishwasher Cleaner	1	Non-organic
Mint Chocolate Flavored Syrup	1	Non-organic

less_org <- data   %>%
    group_by(product_id)%>% 
    filter(category== "Organic")%>%
    dplyr::summarize(count = n()) %>% 
    top_n(-6, wt = count) %>%
    left_join(select(products, product_id, product_name, category), by="product_id") %>%
    arrange(desc(count))%>% 
        mutate(Number_of_Orders= count,
         Product = product_name,
         Category= category)%>% 
  select(Product, Number_of_Orders, Category) 

f_org <-flextable(head(less_org, n= 5))%>%
  align(
    j = c("Product", "Number_of_Orders", "Category"), 
    align = "center", part = "all") %>% 
  padding(padding = 3, part = "all")%>% 
  theme_zebra(odd_header = "#0AAD0A", even_header = "#0AAD0A",
              odd_body = "#fff5f5" , even_body = "#FAF1E5") %>% 
  color(part = "header", color = "white")%>% 
  align(align = "right", part = "all")%>% autofit() 

f_org

Product	Number_of_Orders	Category
Organic Spaghetti Style Pasta	1	Organic
Organic Vegetable Greens & Greens Juice Blend	1	Organic
Organic Peppermint Lemonade	1	Organic
Organic Lemon Gingersnap	1	Organic
Reg; Organic Apple Cider Vinegar 16 Fl Oz	1	Organic

Market Basket Analysis Data

Market Basket Analysis is a technique that identifies the strength of association between pairs of products purchased together and identifies patterns of co-occurrence. A co-occurrence is when two or more things take place concurrently.

To conduct an Association Rule Mining or Market Basket Analysis, we need to build the transaction dataset. We need to gather information about user_id, order_id, and products_name. This last feature needs to be converted into one row, so we can get all the items that have been bought in the same transaction or order.

transactionData <- ddply(data,c("user_id","order_id"),
                       function(df1)paste(df1$product_name,
                       collapse = ","))

write.csv(transactionData,"C:/Users/karolo/Desktop/Instacart/market_basket_transactions.csv", quote = FALSE, row.names = FALSE)

With the read.transactions () function you can directly read data from text-type files and store it in an object of type transactions, which is the storage structure used by arules

tr <- read.transactions("C:/Users/karolo/Desktop/Instacart/market_basket_transactions.csv", format = 'basket', sep=',', header = TRUE, rm.duplicates = TRUE)

## distribution of transactions with duplicates:
## 1 
## 1

summary(tr)

## transactions as itemMatrix in sparse format with
##  131209 rows (elements/itemsets/transactions) and
##  306430 columns (items) and a density of 3.867044e-05 
## 
## most frequent items:
##                 Banana Bag of Organic Bananas   Organic Strawberries 
##                  17611                  14604                  10241 
##   Organic Baby Spinach            Large Lemon                (Other) 
##                   9326                   7724                1495292 
## 
## element (itemset/transaction) length distribution:
## sizes
##    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18 
## 7929 8337 8887 8798 9332 8999 8719 8098 7239 6454 5888 5229 4636 4220 3603 3332 
##   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34 
## 2893 2521 2279 1877 1656 1461 1271 1118  924  821  726  548  524  448  373  328 
##   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
##  264  237  212  142  134  125  107   66   59   65   42   55   33   28   28   19 
##   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65   66 
##   16   17   16   17   10    5    5    6    3    3    1    5    1    1    2    4 
##   67   68   70   72   74   76   78   82 
##    2    2    1    3    1    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   10.00   11.85   15.00   82.00 
## 
## includes extended item information - examples:
##                         labels
## 1            #2 Coffee Filters
## 2 #2 Cone White Coffee Filters
## 3        #2 Mechanical Pencils

The summary(tr) is a useful command that gives us information about our transaction object. Let’s take a look at what the above output says:

There are 131,209 transactions (rows) and 306,430 items (columns). Note that 306,433 are the product names involved in the dataset, and 131,209 transactions are collections of these items.

Density tells the percentage of nonzero cells in a sparse matrix. You can say it as the total number of purchased items divided by a possible number of items in that matrix. You can calculate how many items were purchased by using density: 131,209 x 306,430 x3.867004e-05

The summary can also tell you the most frequent items.

Element (itemset/transaction) length distribution: This tells you how many transactions are there for 1-itemset, for 2-itemset, and so on. The first row indicates the number of items, and the second row shows the number of transactions.

For example, there are 7929 transactions for three items, 8337 transactions for four things, and 82 items in one transaction.

Inspect

One of the first analyses that should be carried out when working with transactions is exploring their content and size. The inspect () function shows the features that make up each transaction.

inspect(tr[1:3])

##     items                                           
## [1] {1,                                             
##      112108,                                        
##      Bag of Organic Bananas,                        
##      Bulgarian Yogurt,                              
##      Cucumber Kirby,                                
##      Lightly Smoked Sardines in Olive Oil,          
##      Organic 4% Milk Fat Whole Milk Cottage Cheese, 
##      Organic Celery Hearts,                         
##      Organic Hass Avocado,                          
##      Organic Whole String Cheese}                   
## [2] {36,                                            
##      79431,                                         
##      Asparagus,                                     
##      Cage Free Extra Large Grade AA Eggs,           
##      Grated Pecorino Romano Cheese,                 
##      Organic Garnet Sweet Potato (Yam),             
##      Organic Half & Half,                           
##      Prosciutto Americano,                          
##      Spring Water,                                  
##      Super Greens Salad}                            
## [3] {38,                                            
##      42756,                                         
##      Bunched Cilantro,                              
##      Flat Parsley Bunch,                            
##      Fresh Dill,                                    
##      Green Peas,                                    
##      Organic Baby Arugula,                          
##      Organic Biologique Limes,                      
##      Organic Hot House Tomato,                      
##      Organic Raw Unfiltered Apple Cider Vinegar,    
##      Shelled Pistachios}

It is also possible to display the results in dataframe format with the DATAFRAME () function or with as (transactions, “dataframe”).

df_tr <- as(tr, Class = "data.frame")
as.tibble(df_tr) %>% 

  head(n= 3)

## # A tibble: 3 x 1
##   items                                                                         
##   <chr>                                                                         
## 1 {1,112108,Bag of Organic Bananas,Bulgarian Yogurt,Cucumber Kirby,Lightly Smok~
## 2 {36,79431,Asparagus,Cage Free Extra Large Grade AA Eggs,Grated Pecorino Roman~
## 3 {38,42756,Bunched Cilantro,Flat Parsley Bunch,Fresh Dill,Green Peas,Organic B~

To extract the size of each transaction, the size () function is used.

tr_size <- size(tr)
summary(tr_size)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   10.00   11.85   15.00   82.00

data.frame(tr_size) %>%
  ggplot(aes(x = tr_size)) +
  geom_histogram(fill= "orange", bin = 5) +
  labs(title = "Transaction Size Distribution",
       x = "Size",
       y= "Number of Transactions") +
  theme_minimal()

itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Set2'), main="Relative Item Frequency Plot")

The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach. So to increase the sale of Strawberries the retailer can put it near the Bananas.

Apriori Algorithm

The apriori algorithm has been used to search for association rules; a function with the same name in the arules package allows it to be calculated. It is based on the anti monotone property of the support; that is, if an itemset is frequent, then all its subsets must be too.

Apriori was one of the first algorithms developed for the search for association rules, and it is still one of the most used; it has two stages:

Identify all itemsets that occur with a frequency above a specific limit (frequent itemsets).

Convert those frequent itemsets into association rules.

Before going into the details of the algorithm, a series of concepts should be defined:

Support: The fraction of which our item set occurs in our dataset.

Confidence: It’s an indication of how often the rule has been found to be true.

Coverage: It’s the support of the left part of the rule (antecedent). It is interpreted as the frequency with which the antecedent appears in the set of transactions.

Lift: It is the ratio of confidence to expected confidence. Expected confidence is simply the probability that product y is in the basket.Higher lift values indicate stronger associations.

To find a reasonable number of rules, it was necessary to considerably reduce the support and confidence to 0.001 and 0.2, respectively. This is due to the high number of shopping carts and products present in the dataset.

association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.2, maxlen=10, target= "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 131 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[306430 item(s), 131209 transaction(s)] done [1.26s].
## sorting and recoding items ... [1729 item(s)] done [0.05s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.07s].
## writing ... [873 rule(s)] done [0.00s].
## creating S4 object  ... done [0.05s].

summary(association.rules)

## set of 873 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 380 485   8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.574   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001006   Min.   :0.2000   Min.   :0.001936   Min.   : 1.498  
##  1st Qu.:0.001136   1st Qu.:0.2261   1st Qu.:0.004344   1st Qu.: 2.167  
##  Median :0.001372   Median :0.2571   Median :0.005244   Median : 3.120  
##  Mean   :0.001981   Mean   :0.2747   Mean   :0.007634   Mean   : 5.137  
##  3rd Qu.:0.001936   3rd Qu.:0.3051   3rd Qu.:0.007599   3rd Qu.: 4.152  
##  Max.   :0.021005   Max.   :0.5714   Max.   :0.078051   Max.   :85.582  
##      count     
##  Min.   : 132  
##  1st Qu.: 149  
##  Median : 180  
##  Mean   : 260  
##  3rd Qu.: 254  
##  Max.   :2756  
## 
## mining info:
##  data ntransactions support confidence
##    tr        131209   0.001        0.2
##                                                                                           call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.2, maxlen = 10, target = "rules"))

The summary of the rules gives us some information: The number of rules : 873

The distribution of rules by length: a length of 3 items has the most rules.

The summary of quality measures: ranges of support, confidence, and lift.

Visualizing Association Rules

# Filter rules with confidence greater than 0.2 or 20%

subRules<-association.rules[quality(association.rules)$confidence> 0.2]

Since there are 873 rules, let’s print only top 15 with a confidence level > 20%:

inspect(sort(x= subRules[1:15], decreasing = TRUE, by = "confidence"))

##      lhs                                                rhs                                                 support confidence    coverage      lift count
## [1]  {Organic Yellow Squash}                         => {Organic Zucchini}                              0.001242293  0.4515235 0.002751336 13.701192   163
## [2]  {Nonfat Icelandic Style Strawberry Yogurt}      => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001097486  0.4161850 0.002637014 81.625133   144
## [3]  {Non Fat Acai & Mixed Berries Yogurt}           => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001234671  0.4060150 0.003040950 79.630534   162
## [4]  {Sweet Potato Yam}                              => {Banana}                                        0.001028893  0.3638814 0.002827550  2.711062   135
## [5]  {Organic Whole Grassmilk Milk}                  => {Bag of Organic Bananas}                        0.001196564  0.3512304 0.003406778  3.155614   157
## [6]  {Organic Fuji Apples}                           => {Bag of Organic Bananas}                        0.001021271  0.3292383 0.003101921  2.958027   134
## [7]  {Packaged Grape Tomatoes}                       => {Hass Avocados}                                 0.001173700  0.2558140 0.004588100 15.229171   154
## [8]  {Icelandic Style Skyr Blueberry Non-fat Yogurt} => {Non Fat Acai & Mixed Berries Yogurt}           0.001234671  0.2421525 0.005098736 79.630534   162
## [9]  {Packaged Grape Tomatoes}                       => {Strawberries}                                  0.001105107  0.2408638 0.004588100  5.159755   145
## [10] {Baby Cucumbers}                                => {Bag of Organic Bananas}                        0.001127971  0.2379421 0.004740528  2.137781   148
## [11] {Baby Cucumbers}                                => {Raspberries}                                   0.001036514  0.2186495 0.004740528  9.177474   136
## [12] {Baby Cucumbers}                                => {Hass Avocados}                                 0.001028893  0.2170418 0.004740528 12.920979   135
## [13] {Baby Cucumbers}                                => {Strawberries}                                  0.001028893  0.2170418 0.004740528  4.649443   135
## [14] {Icelandic Style Skyr Blueberry Non-fat Yogurt} => {Nonfat Icelandic Style Strawberry Yogurt}      0.001097486  0.2152466 0.005098736 81.625133   144
## [15] {Seedless Cucumbers}                            => {Strawberries}                                  0.001044136  0.2041729 0.005113978  4.373766   137

top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph",  engine = "htmlwidget")

Parallel coordinate plot

plot(top5subRules , method="paracoord", control=list(reorder=TRUE))

The graph above shows us that when a customer has organic strawberries, organic raspberries and organic Hass avocados in their shopping cart, they are very likely to buy a bag of organic bananas to go along with those items, and so on.

With this information, retailers can influence purchasing decisions by confidently predicting what customers need next. This will allow them to recommend products and design marketing strategies appealing to the customer segment. Another insight, which can be obtained from this analysis, is the positioning of products in specific places and the offering of bundles for less popular products.

Filtering Rules

You can filter the created rules and restrict the elements that appear on the left or right side of the rules when creating them. Suppose you know that there is interest in understanding the products sold together with the banana, the best-selling product. This means that the banana product should appear on the right side.

bananas <- apriori(data= tr, parameter= list(supp=0.0001, confidence=0.4, target= "rules"), appearance = list(rhs= "Banana"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   1e-04      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 13 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[306430 item(s), 131209 transaction(s)] done [2.86s].
## sorting and recoding items ... [10811 item(s)] done [0.12s].
## creating transaction tree ... done [0.18s].
## checking subsets of size 1 2 3 4 5 6 done [2.30s].
## writing ... [1938 rule(s)] done [0.95s].
## creating S4 object  ... done [0.16s].

summary(bananas)

## set of 1938 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##   22 1078  790   48 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.446   4.000   5.000 
## 
## summary of quality measures:
##     support            confidence        coverage              lift      
##  Min.   :0.0001067   Min.   :0.4000   Min.   :0.0001143   Min.   :2.980  
##  1st Qu.:0.0001143   1st Qu.:0.4239   1st Qu.:0.0002439   1st Qu.:3.158  
##  Median :0.0001372   Median :0.4565   Median :0.0002972   Median :3.401  
##  Mean   :0.0001814   Mean   :0.4753   Mean   :0.0003952   Mean   :3.541  
##  3rd Qu.:0.0001829   3rd Qu.:0.5116   3rd Qu.:0.0004039   3rd Qu.:3.812  
##  Max.   :0.0019358   Max.   :0.9333   Max.   :0.0047100   Max.   :6.954  
##      count      
##  Min.   : 14.0  
##  1st Qu.: 15.0  
##  Median : 18.0  
##  Mean   : 23.8  
##  3rd Qu.: 24.0  
##  Max.   :254.0  
## 
## mining info:
##  data ntransactions support confidence
##    tr        131209   1e-04        0.4
##                                                                                                                       call
##  apriori(data = tr, parameter = list(supp = 1e-04, confidence = 0.4, target = "rules"), appearance = list(rhs = "Banana"))

Top 10 Association rules whose consequent is banana

inspect(bananas[1:10])

##      lhs                                                         rhs           support confidence     coverage     lift count
## [1]  {Frozen Unsweetened Whole Strawberries}                  => {Banana} 0.0001143214  0.4838710 0.0002362643 3.605032    15
## [2]  {Chamomile Herbal Tea}                                   => {Banana} 0.0001143214  0.4285714 0.0002667500 3.193029    15
## [3]  {Yokids Organic Strawberry Banana Lowfat Yogurt Squeeze} => {Banana} 0.0001143214  0.4687500 0.0002438857 3.492375    15
## [4]  {All Natural Whole Strawberries}                         => {Banana} 0.0001219429  0.4324324 0.0002819929 3.221795    16
## [5]  {Caramel Nut Brownie}                                    => {Banana} 0.0001143214  0.4054054 0.0002819929 3.020433    15
## [6]  {Chickpea Salad}                                         => {Banana} 0.0001219429  0.4210526 0.0002896143 3.137011    16
## [7]  {Greek 100 Calories Mixed Berry Yogurt}                  => {Banana} 0.0001219429  0.4102564 0.0002972357 3.056574    16
## [8]  {Triscuit Reduced Fat Crackers}                          => {Banana} 0.0001143214  0.4054054 0.0002819929 3.020433    15
## [9]  {Natural Uncured Turkey Hot Dog}                         => {Banana} 0.0001295643  0.4146341 0.0003124786 3.089190    17
## [10] {Broccoli Floret}                                        => {Banana} 0.0001448071  0.4130435 0.0003505857 3.077339    19

Conclusions and Recommendation

Maximum no of orders are placed between 10:00AM and 5:00PM on Sunday and Monday.
The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach.
The implementation of association rules is an interesting option in the selection process of the technique to use to perform data mining; Its application is essential for the discovery of association relationships of data, being useful in the selection of a marketing strategy, decision-making processes, business management, etc.
When a customer has organic strawberries, organic raspberries and organic Hass avocados in their shopping cart, they are very likely to buy a bag of organic bananas to go along with those items, and so on. With this information, retailers can influence purchasing decisions by confidently predicting what customers need next.
We recommend to do additional work with marketing team, to increase product placement storewide.
We also recommend to increase visibility of products in the aisles, or end of aisle displays.

Next Steps

Conduct a Cluster Analysis to identify and understand the customer’s behavior. Then, the buying behavior of each group can be examined separately on measures such as brand loyalty, price willing to pay, frequency of purchase, etc. To do this analysis, refer to the K-means clustering Analysis, Gap Statistic Method, Average Silhouette or Elbow Method.
Conduct a Basket Analysis by department or aisles and increase the confidence of the model. Refer to the lift metric to find the strong-association and randomly sample out those associations.
Conduct a data analysis including real prices of the products or other additional information that allows business managers to improve their performance and predict other purchasing behaviors.

Lessons Learned

Domain knowledge - Understand the data well enough to generate questions on how to solve current and future challenges.
Keep your project simple by well defining a manageable portion of the data; then redefine throughout the EDA. Your final steps are for the cool new tricks you have learned.

Reference

Instacart Market Basket Analysis. Available online: https://www.kaggle.com/c/instacart-market-basket-analysis/data
Market Basket Analysis using R. https://www.datacamp.com/community/tutorials/market-basket-analysis-r
A Gentle Introduction on Market Basket Analysis.Association Rules. Available online: https://datascienceplus.com/a-gentle-introduction-on-market-basket-analysis%E2%80%8A-%E2%80%8Aassociation-rules/

Association Rules

Foundations of Data Science with R

Karol Orozco, Charlene Ewing