An association rule is a pattern that states the probability of an event occurring, when another event occurs. In other words, there are if/then statements that assist in defining relationships between unrelated data. The widely used example for association rule is the market basket analysis and in this paper, we will be considering the items purchased in a bakery.
The arules and arulesviz were installed and called using the library function. The arules package provides the framework for illustrating and analyzing the transactions and patterns within the dataset. The arulesviz package is an extension of the “arules package that uses visualization techniques for association rules. The matrix package, which provides classes for logical and pattern dense matrices, was also attached as it is required to run the arules package.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
library(Matrix)
Bakery <- read.csv("C:\\Users\\User\\Desktop\\bread basket.csv", header = F, colClasses = "factor")
#Bakery <- subset(Bakery, select = -Bakery$peroid_day)
summary(Bakery)
## V1 V2 V3
## 6279 : 11 Coffee :5471 5/2/2017 11:58 : 12
## 6412 : 11 Bread :3325 11/2/2017 14:08: 11
## 6474 : 11 Tea :1435 12/2/2017 14:35: 11
## 6716 : 11 Cake :1025 17/2/2017 14:18: 11
## 6045 : 10 Pastry : 856 9/2/2017 13:44 : 11
## 9447 : 10 Sandwich: 771 5/4/2017 17:22 : 10
## (Other):20444 (Other) :7625 (Other) :20442
head(Bakery)
## V1 V2 V3
## 1 Transaction Item date_time
## 2 1 Bread 30/10/2016 09:58
## 3 2 Scandinavian 30/10/2016 10:05
## 4 2 Scandinavian 30/10/2016 10:05
## 5 3 Hot chocolate 30/10/2016 10:07
## 6 3 Jam 30/10/2016 10:07
In addition, the statistics from the summary function may be ignored as the data being analysed is qualitative data.
The apriori algorithm is used to mine frequent item sets and association rules within the dataset.By using this algorithm, the confidence value or probability of the next item being selected can be obtained. It’s fundamentals are built on creating combinations and obtaining frequencies.The quality of the association rules is indicated by the following; a) Support, which is the frequency of item-set in dataset,and each item-set in each level should be equal to or greater than the minimum support. b) Confidence, which is the confidence level c) Lift, which shows the correlation between the items in the dataset.
Bakery01 <- read.transactions("C:\\Users\\User\\Desktop\\bread basket.csv", format = "single", sep = ",", cols = c(3,2))
summary(Bakery01)
## transactions as itemMatrix in sparse format with
## 6375 rows (elements/itemsets/transactions) and
## 103 columns (items) and a density of 0.02014087
##
## most frequent items:
## Coffee Bread Tea Cake Pastry (Other)
## 3133 2124 935 687 575 5771
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 2480 2069 1055 515 187 49 13 2 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.075 3.000 10.000
##
## includes extended item information - examples:
## labels
## 1 Adjustment
## 2 Afternoon with the baker
## 3 Alfajores
##
## includes extended transaction information - examples:
## transactionID
## 1 1/11/2016 09:07
## 2 1/11/2016 09:09
## 3 1/11/2016 09:26
N
#install.packages("RColorBrewer")
# a package that provides color schemes.
library(RColorBrewer)
itemFrequencyPlot(Bakery01, topN=30, type="relative",col = brewer.pal(15, 'Paired'),weighted = FALSE, main=" Frequency Graph")
## Warning in brewer.pal(15, "Paired"): n too large, allowed maximum for palette Paired is 12
## Returning the palette you asked for with that many colors
From the frequency graph above, it is evident that coffee is the most purchased product and Hearty &Seasonal is the least purchased product.
rules <- apriori(Bakery01, parameter = list(supp =0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 6
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6375 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules[1:10])
## lhs rhs support confidence
## [1] {Keeping It Local} => {Coffee} 0.004392157 0.8000000
## [2] {Extra Salami or Feta} => {Coffee} 0.003607843 0.8214286
## [3] {Cake, Vegan mincepie} => {Coffee} 0.001098039 0.8750000
## [4] {Keeping It Local, Tea} => {Coffee} 0.001254902 0.8000000
## [5] {Fudge, Sandwich} => {Coffee} 0.001098039 0.8750000
## [6] {Hearty & Seasonal, Sandwich} => {Coffee} 0.001411765 0.9000000
## [7] {Salad, Sandwich} => {Coffee} 0.001568627 0.8333333
## [8] {Cake, Salad} => {Coffee} 0.001254902 0.8000000
## [9] {Alfajores, Mineral water} => {Coffee} 0.001254902 0.8000000
## [10] {Farm House, Toast} => {Coffee} 0.001098039 1.0000000
## coverage lift count
## [1] 0.005490196 1.627833 28
## [2] 0.004392157 1.671435 23
## [3] 0.001254902 1.780442 7
## [4] 0.001568627 1.627833 8
## [5] 0.001254902 1.780442 7
## [6] 0.001568627 1.831312 9
## [7] 0.001882353 1.695659 10
## [8] 0.001568627 1.627833 8
## [9] 0.001568627 1.627833 8
## [10] 0.001098039 2.034791 7
The support level was lower from the initial 1% because there were no association rule generated at that support level. At the 0.1 % support, there were 20 rule generated, with the top 5 rule displayed above. The above mentioned rules can be interpreted as follows: Any customer who has bought Cake and vegan mince pie, has also bought coffee, and any individual who has bought “keeping it local” has also bought a coffee. The individuals who purchase ‘keeping it local’ and coffee has the highest count, whilst the individuals who purchase hearty & seasonal, sandwiches and also coffee have the highest confidence(0.90).
Since the lift for the above rules are all greater than 1, it can be assumed that these are good rules to consider. The top association rule may be sorted by either confidence, support count etc. This is illustrated below:
by_confidence<-sort(rules, by="confidence", decreasing=TRUE)
inspect(head(by_confidence))
## lhs rhs support confidence
## [1] {Farm House, Toast} => {Coffee} 0.001098039 1.0
## [2] {Cake, Hot chocolate, Sandwich} => {Coffee} 0.001254902 1.0
## [3] {Bread, Medialuna, Sandwich} => {Coffee} 0.001098039 1.0
## [4] {Hearty & Seasonal, Sandwich} => {Coffee} 0.001411765 0.9
## [5] {Pastry, Sandwich} => {Coffee} 0.001411765 0.9
## [6] {Cake, Sandwich, Tea} => {Coffee} 0.001411765 0.9
## coverage lift count
## [1] 0.001098039 2.034791 7
## [2] 0.001254902 2.034791 8
## [3] 0.001098039 2.034791 7
## [4] 0.001568627 1.831312 9
## [5] 0.001568627 1.831312 9
## [6] 0.001568627 1.831312 9
inspect(head(rules, n = 100, by = "confidence"))
## lhs rhs support confidence
## [1] {Farm House, Toast} => {Coffee} 0.001098039 1.0000000
## [2] {Cake, Hot chocolate, Sandwich} => {Coffee} 0.001254902 1.0000000
## [3] {Bread, Medialuna, Sandwich} => {Coffee} 0.001098039 1.0000000
## [4] {Hearty & Seasonal, Sandwich} => {Coffee} 0.001411765 0.9000000
## [5] {Pastry, Sandwich} => {Coffee} 0.001411765 0.9000000
## [6] {Cake, Sandwich, Tea} => {Coffee} 0.001411765 0.9000000
## [7] {Cake, Vegan mincepie} => {Coffee} 0.001098039 0.8750000
## [8] {Fudge, Sandwich} => {Coffee} 0.001098039 0.8750000
## [9] {Cake, Toast} => {Coffee} 0.002196078 0.8750000
## [10] {Cake, Sandwich, Soup} => {Coffee} 0.001098039 0.8750000
## [11] {Hot chocolate, Scone} => {Coffee} 0.001882353 0.8571429
## [12] {Cookies, Scone} => {Coffee} 0.001882353 0.8571429
## [13] {Salad, Sandwich} => {Coffee} 0.001568627 0.8333333
## [14] {Extra Salami or Feta} => {Coffee} 0.003607843 0.8214286
## [15] {Keeping It Local} => {Coffee} 0.004392157 0.8000000
## [16] {Keeping It Local, Tea} => {Coffee} 0.001254902 0.8000000
## [17] {Cake, Salad} => {Coffee} 0.001254902 0.8000000
## [18] {Alfajores, Mineral water} => {Coffee} 0.001254902 0.8000000
## [19] {Juice, Spanish Brunch} => {Coffee} 0.002509804 0.8000000
## [20] {Pastry, Toast} => {Coffee} 0.001254902 0.8000000
## coverage lift count
## [1] 0.001098039 2.034791 7
## [2] 0.001254902 2.034791 8
## [3] 0.001098039 2.034791 7
## [4] 0.001568627 1.831312 9
## [5] 0.001568627 1.831312 9
## [6] 0.001568627 1.831312 9
## [7] 0.001254902 1.780442 7
## [8] 0.001254902 1.780442 7
## [9] 0.002509804 1.780442 14
## [10] 0.001254902 1.780442 7
## [11] 0.002196078 1.744107 12
## [12] 0.002196078 1.744107 12
## [13] 0.001882353 1.695659 10
## [14] 0.004392157 1.671435 23
## [15] 0.005490196 1.627833 28
## [16] 0.001568627 1.627833 8
## [17] 0.001568627 1.627833 8
## [18] 0.001568627 1.627833 8
## [19] 0.003137255 1.627833 16
## [20] 0.001568627 1.627833 8
In addition, we can identify and remove duplicated rules from the generated association rules. This can be done first identifying the duplicated rules using the ‘is.redundant’ function
redundant_rules <- is.redundant(rules)
redundant_rules
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
summary(redundant_rules)
## Mode FALSE TRUE
## logical 19 1
True indicated that there is a duplicate rule while False indicates non-duplicated rules. The summary shows that there is one duplicated rule which can be removed as shown below;
rules <- rules [!redundant_rules]
rules
## set of 19 rules
Now we can target product bought by customers to analyse the basket of goods bought by individuals simultaneously, in other words, what else customers buy if they buy cake.By targeting our analysis, we are able to choose a product as the default item (lhs).
rules_cake <- apriori(Bakery01, parameter = list(supp=0.001, conf = 0.2), appearance = list(default="rhs", lhs = "Cake"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 6
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6375 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules_cake[1:5])
## lhs rhs support confidence coverage lift count
## [1] {} => {Bread} 0.33317647 0.3331765 1.0000000 1.0000000 2124
## [2] {} => {Coffee} 0.49145098 0.4914510 1.0000000 1.0000000 3133
## [3] {Cake} => {Tea} 0.02760784 0.2561863 0.1077647 1.7467249 176
## [4] {Cake} => {Bread} 0.02478431 0.2299854 0.1077647 0.6902812 158
## [5] {Cake} => {Coffee} 0.05882353 0.5458515 0.1077647 1.1106937 375
In the above example, both the support and confidence level were adjusted in order to generate association rules for the product (cake).This analysis can be repeated for each product in the dataset and the default option can be adjusted to either be the left hand side (lhs) or the right hand side (rhs).
The rules generated can be visualized using the arulesViz package. This is illustrated below.
#install.packages("interactions")
library(interactions)
plot(rules, method="graph")
plot(rules, method="graph", interactive =TRUE)
## Warning in plot.rules(rules, method = "graph", interactive = TRUE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
The first graph illustrates the products and their dependecies and it is evident that there lays a strong association between “keeping it local”, hot chocolate, coffee and Extra Salami or Feta. The interactive graph also illustrates the most important associations, in green, which are also the most frequent combinations. The graph is interactive thus products illustrated can be dragged to better visualize the relationships
From this study, it can be concluded that coffee is the best selling product and the shop owner may use this as a marketing tool when customer purchase other products. Furthermore, the business owner may opt to bundle coffee with other products in order to boost sales. Lastly, the shop owner could make a decision to stop the sale of certain products such as Hearty & Seasonal that do not get as much sales.