Market basket analysis of groceries data

Introduction

Market basket analysis is a data mining technique, primarily used by businesses(retailers) to better understand their customer’s preference patterns. This helps them to focus more on the needs of their customers, which leads to an increase in the profits. This might, for instance, unfold patterns like a customer who buys milk and bread, is more likely to buy egg and butter as well.

Importing the necessary libraries

library(arules)
library(arulesViz)
library(kableExtra)
library(ggplot2)

Loading the dataset(CSV format)

market <- read.csv("basket_analysis.csv", stringsAsFactors = T)

About the dataset

The dataset contains information about 999 observations(transactions) made at a grocery shop, and 17 variables(1 being the transaction number while 16 being the items bought). All the variables that state if a particular item was included in a transaction are imported as factors for better performance of the analysis. The values for these can either be “True” or “False”, “True” suggesting the item was bought whereas “False” suggesting the opposite.

The dataset is open-source and is made available on: https://www.kaggle.com/datasets?search=market+basket

Initial exploration of the dataset

head(market)

##   X Apple Bread Butter Cheese  Corn  Dill  Eggs Ice.cream Kidney.Beans  Milk
## 1 0 False  True  False  False  True  True False      True        False False
## 2 1 False False  False  False False False False     False        False  True
## 3 2  True False   True  False False  True False      True        False  True
## 4 3 False False   True   True False  True False     False        False  True
## 5 4  True  True  False  False False False False     False        False False
## 6 5  True  True   True   True False  True False      True        False False
##   Nutmeg Onion Sugar Unicorn Yogurt chocolate
## 1  False False  True   False   True      True
## 2  False False False   False  False     False
## 3  False False False   False   True      True
## 4   True  True False   False  False     False
## 5  False False False   False  False     False
## 6   True False False    True   True      True

str(market)

## 'data.frame':    999 obs. of  17 variables:
##  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Apple       : Factor w/ 2 levels "False","True": 1 1 2 1 2 2 1 2 2 2 ...
##  $ Bread       : Factor w/ 2 levels "False","True": 2 1 1 1 2 2 1 1 1 1 ...
##  $ Butter      : Factor w/ 2 levels "False","True": 1 1 2 2 1 2 2 1 1 1 ...
##  $ Cheese      : Factor w/ 2 levels "False","True": 1 1 1 2 1 2 1 2 1 1 ...
##  $ Corn        : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dill        : Factor w/ 2 levels "False","True": 2 1 2 2 1 2 1 1 2 2 ...
##  $ Eggs        : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Ice.cream   : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 2 1 2 2 ...
##  $ Kidney.Beans: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Milk        : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 1 2 2 ...
##  $ Nutmeg      : Factor w/ 2 levels "False","True": 1 1 1 2 1 2 2 2 2 1 ...
##  $ Onion       : Factor w/ 2 levels "False","True": 1 1 1 2 1 1 2 1 2 2 ...
##  $ Sugar       : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 2 2 2 ...
##  $ Unicorn     : Factor w/ 2 levels "False","True": 1 1 1 1 1 2 1 1 2 2 ...
##  $ Yogurt      : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 2 2 2 1 ...
##  $ chocolate   : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 1 1 2 2 ...

Dropping the first variable(X)

The variable containing just the transaction(order) ids is not important for the further analysis regarding association rules(market basket analysis).

market$X <- NULL

So, now the dataset contains 1 less variable i.e, 16 variables in total.

Basket analysis using the apriori algorithm

rules <- apriori(market,parameter = list(minlen = 2, conf = .8, supp = 0.14))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.14      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 139 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[32 item(s), 999 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [68 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The results show that there exist 68 such possible rules(itemsets) at the support level of 0.14, maximum length of 10 and minimum confidence of 0.8.

inspect(head(sort(rules, by = "confidence", decreasing = T), 5))

##     lhs                     rhs            support confidence  coverage     lift count
## [1] {Butter=False,                                                                    
##      Corn=False,                                                                      
##      Kidney.Beans=False,                                                              
##      chocolate=False}    => {Milk=False} 0.1481481  0.8554913 0.1731732 1.438781   148
## [2] {Apple=False,                                                                     
##      Butter=False,                                                                    
##      Corn=False,                                                                      
##      chocolate=False}    => {Milk=False} 0.1521522  0.8444444 0.1801802 1.420202   152
## [3] {Corn=False,                                                                      
##      Kidney.Beans=False,                                                              
##      Sugar=False,                                                                     
##      chocolate=False}    => {Milk=False} 0.1401401  0.8333333 0.1681682 1.401515   140
## [4] {Apple=False,                                                                     
##      Butter=False,                                                                    
##      Unicorn=False,                                                                   
##      chocolate=False}    => {Milk=False} 0.1441441  0.8323699 0.1731732 1.399895   144
## [5] {Apple=False,                                                                     
##      Butter=False,                                                                    
##      Kidney.Beans=False,                                                              
##      chocolate=False}    => {Milk=False} 0.1491491  0.8277778 0.1801802 1.392172   149

The table above renders the top 5 such itemsets which have the highest confidence level of occurence. Interestingly, all these top 5 sets include the sets where purchasing pattern of a customer results in him/her not buying milk.

The following step tries to find those combinations which leads to the purchase of milk.

rules <- apriori(market,parameter = list(minlen=2, maxlen = 5, supp=.01, conf=.8),
                 appearance=list(rhs=c("Milk=True")))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[32 item(s), 999 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [1 rule(s)] done [0.01s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(rules, by = "confidence", decreasing = T), 5))

##     lhs               rhs            support confidence   coverage     lift count
## [1] {Apple=True,                                                                 
##      Corn=False,                                                                 
##      Eggs=True,                                                                  
##      Unicorn=True} => {Milk=True} 0.01801802  0.8181818 0.02202202 2.018182    18

At minimum confidence level of 0.8, support level of 0.1, and maximum length of the order of 5, there exist only one such combination.

Checking if there exist any redundancy in “rules”

redundant <- is.redundant(rules, measure="confidence")
which(redundant)

## integer(0)

The above output suggests that there is no existence of redundancy, which is a very good thing.

Plot to show the above combination

plot(rules, method= "graph")