Market basket analysis is a data mining technique, primarily used by businesses(retailers) to better understand their customer’s preference patterns. This helps them to focus more on the needs of their customers, which leads to an increase in the profits. This might, for instance, unfold patterns like a customer who buys milk and bread, is more likely to buy egg and butter as well.
library(arules)
library(arulesViz)
library(kableExtra)
library(ggplot2)
market <- read.csv("basket_analysis.csv", stringsAsFactors = T)
The dataset contains information about 999 observations(transactions) made at a grocery shop, and 17 variables(1 being the transaction number while 16 being the items bought). All the variables that state if a particular item was included in a transaction are imported as factors for better performance of the analysis. The values for these can either be “True” or “False”, “True” suggesting the item was bought whereas “False” suggesting the opposite.
The dataset is open-source and is made available on: https://www.kaggle.com/datasets?search=market+basket
head(market)
## X Apple Bread Butter Cheese Corn Dill Eggs Ice.cream Kidney.Beans Milk
## 1 0 False True False False True True False True False False
## 2 1 False False False False False False False False False True
## 3 2 True False True False False True False True False True
## 4 3 False False True True False True False False False True
## 5 4 True True False False False False False False False False
## 6 5 True True True True False True False True False False
## Nutmeg Onion Sugar Unicorn Yogurt chocolate
## 1 False False True False True True
## 2 False False False False False False
## 3 False False False False True True
## 4 True True False False False False
## 5 False False False False False False
## 6 True False False True True True
str(market)
## 'data.frame': 999 obs. of 17 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Apple : Factor w/ 2 levels "False","True": 1 1 2 1 2 2 1 2 2 2 ...
## $ Bread : Factor w/ 2 levels "False","True": 2 1 1 1 2 2 1 1 1 1 ...
## $ Butter : Factor w/ 2 levels "False","True": 1 1 2 2 1 2 2 1 1 1 ...
## $ Cheese : Factor w/ 2 levels "False","True": 1 1 1 2 1 2 1 2 1 1 ...
## $ Corn : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dill : Factor w/ 2 levels "False","True": 2 1 2 2 1 2 1 1 2 2 ...
## $ Eggs : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 2 2 2 2 ...
## $ Ice.cream : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 2 1 2 2 ...
## $ Kidney.Beans: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 2 1 1 1 ...
## $ Milk : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 1 2 2 ...
## $ Nutmeg : Factor w/ 2 levels "False","True": 1 1 1 2 1 2 2 2 2 1 ...
## $ Onion : Factor w/ 2 levels "False","True": 1 1 1 2 1 1 2 1 2 2 ...
## $ Sugar : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 2 2 2 ...
## $ Unicorn : Factor w/ 2 levels "False","True": 1 1 1 1 1 2 1 1 2 2 ...
## $ Yogurt : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 2 2 2 1 ...
## $ chocolate : Factor w/ 2 levels "False","True": 2 1 2 1 1 2 1 1 2 2 ...
The variable containing just the transaction(order) ids is not important for the further analysis regarding association rules(market basket analysis).
market$X <- NULL
So, now the dataset contains 1 less variable i.e, 16 variables in total.
rules <- apriori(market,parameter = list(minlen = 2, conf = .8, supp = 0.14))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.14 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 139
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[32 item(s), 999 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [68 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The results show that there exist 68 such possible rules(itemsets) at the support level of 0.14, maximum length of 10 and minimum confidence of 0.8.
inspect(head(sort(rules, by = "confidence", decreasing = T), 5))
## lhs rhs support confidence coverage lift count
## [1] {Butter=False,
## Corn=False,
## Kidney.Beans=False,
## chocolate=False} => {Milk=False} 0.1481481 0.8554913 0.1731732 1.438781 148
## [2] {Apple=False,
## Butter=False,
## Corn=False,
## chocolate=False} => {Milk=False} 0.1521522 0.8444444 0.1801802 1.420202 152
## [3] {Corn=False,
## Kidney.Beans=False,
## Sugar=False,
## chocolate=False} => {Milk=False} 0.1401401 0.8333333 0.1681682 1.401515 140
## [4] {Apple=False,
## Butter=False,
## Unicorn=False,
## chocolate=False} => {Milk=False} 0.1441441 0.8323699 0.1731732 1.399895 144
## [5] {Apple=False,
## Butter=False,
## Kidney.Beans=False,
## chocolate=False} => {Milk=False} 0.1491491 0.8277778 0.1801802 1.392172 149
The table above renders the top 5 such itemsets which have the highest confidence level of occurence. Interestingly, all these top 5 sets include the sets where purchasing pattern of a customer results in him/her not buying milk.
The following step tries to find those combinations which leads to the purchase of milk.
rules <- apriori(market,parameter = list(minlen=2, maxlen = 5, supp=.01, conf=.8),
appearance=list(rhs=c("Milk=True")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[32 item(s), 999 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [1 rule(s)] done [0.01s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules, by = "confidence", decreasing = T), 5))
## lhs rhs support confidence coverage lift count
## [1] {Apple=True,
## Corn=False,
## Eggs=True,
## Unicorn=True} => {Milk=True} 0.01801802 0.8181818 0.02202202 2.018182 18
At minimum confidence level of 0.8, support level of 0.1, and maximum length of the order of 5, there exist only one such combination.
redundant <- is.redundant(rules, measure="confidence")
which(redundant)
## integer(0)
The above output suggests that there is no existence of redundancy, which is a very good thing.
plot(rules, method= "graph")