Association Rules project UL

Introduction

An association rule is a pattern that states the probability of an event occurring, when another event occurs. In other words, there are if/then statements that assist in defining relationships between unrelated data. The widely used example for association rule is the market basket analysis and in this paper, we will be considering the items purchased in a bakery.

Packages

The arules and arulesviz were installed and called using the library function. The arules package provides the framework for illustrating and analyzing the transactions and patterns within the dataset. The arulesviz package is an extension of the “arules package that uses visualization techniques for association rules. The matrix package, which provides classes for logical and pattern dense matrices, was also attached as it is required to run the arules package.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.

library(Matrix)

Summary of Bread basket dataset.

Bakery <- read.csv("C:\\Users\\User\\Desktop\\bread basket.csv", header = F, colClasses = "factor")
#Bakery <- subset(Bakery, select = -Bakery$peroid_day)
summary(Bakery)

##        V1               V2                     V3       
##  6279   :   11   Coffee  :5471   5/2/2017 11:58 :   12  
##  6412   :   11   Bread   :3325   11/2/2017 14:08:   11  
##  6474   :   11   Tea     :1435   12/2/2017 14:35:   11  
##  6716   :   11   Cake    :1025   17/2/2017 14:18:   11  
##  6045   :   10   Pastry  : 856   9/2/2017 13:44 :   11  
##  9447   :   10   Sandwich: 771   5/4/2017 17:22 :   10  
##  (Other):20444   (Other) :7625   (Other)        :20442

head(Bakery)

##            V1            V2               V3
## 1 Transaction          Item        date_time
## 2           1         Bread 30/10/2016 09:58
## 3           2  Scandinavian 30/10/2016 10:05
## 4           2  Scandinavian 30/10/2016 10:05
## 5           3 Hot chocolate 30/10/2016 10:07
## 6           3           Jam 30/10/2016 10:07

Due to the fact that this study focuses on the relationship between the items bought on a specific day, the period_day column will be discounted.

In addition, the statistics from the summary function may be ignored as the data being analysed is qualitative data.

Association Rules

The apriori algorithm is used to mine frequent item sets and association rules within the dataset.By using this algorithm, the confidence value or probability of the next item being selected can be obtained. It’s fundamentals are built on creating combinations and obtaining frequencies.The quality of the association rules is indicated by the following; a) Support, which is the frequency of item-set in dataset,and each item-set in each level should be equal to or greater than the minimum support. b) Confidence, which is the confidence level c) Lift, which shows the correlation between the items in the dataset.

Bakery01 <- read.transactions("C:\\Users\\User\\Desktop\\bread basket.csv", format = "single", sep = ",", cols = c(3,2))
summary(Bakery01)

## transactions as itemMatrix in sparse format with
##  6375 rows (elements/itemsets/transactions) and
##  103 columns (items) and a density of 0.02014087 
## 
## most frequent items:
##  Coffee   Bread     Tea    Cake  Pastry (Other) 
##    3133    2124     935     687     575    5771 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10 
## 2480 2069 1055  515  187   49   13    2    4    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.075   3.000  10.000 
## 
## includes extended item information - examples:
##                     labels
## 1               Adjustment
## 2 Afternoon with the baker
## 3                Alfajores
## 
## includes extended transaction information - examples:
##     transactionID
## 1 1/11/2016 09:07
## 2 1/11/2016 09:09
## 3 1/11/2016 09:26

#install.packages("RColorBrewer")
# a package that provides color schemes.
library(RColorBrewer)
itemFrequencyPlot(Bakery01, topN=30, type="relative",col = brewer.pal(15, 'Paired'),weighted = FALSE, main=" Frequency Graph")

## Warning in brewer.pal(15, "Paired"): n too large, allowed maximum for palette Paired is 12
## Returning the palette you asked for with that many colors

From the frequency graph above, it is evident that coffee is the most purchased product and Hearty &Seasonal is the least purchased product.

rules <- apriori(Bakery01, parameter = list(supp =0.001, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6375 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(rules[1:10])

##      lhs                              rhs      support     confidence
## [1]  {Keeping It Local}            => {Coffee} 0.004392157 0.8000000 
## [2]  {Extra Salami or Feta}        => {Coffee} 0.003607843 0.8214286 
## [3]  {Cake, Vegan mincepie}        => {Coffee} 0.001098039 0.8750000 
## [4]  {Keeping It Local, Tea}       => {Coffee} 0.001254902 0.8000000 
## [5]  {Fudge, Sandwich}             => {Coffee} 0.001098039 0.8750000 
## [6]  {Hearty & Seasonal, Sandwich} => {Coffee} 0.001411765 0.9000000 
## [7]  {Salad, Sandwich}             => {Coffee} 0.001568627 0.8333333 
## [8]  {Cake, Salad}                 => {Coffee} 0.001254902 0.8000000 
## [9]  {Alfajores, Mineral water}    => {Coffee} 0.001254902 0.8000000 
## [10] {Farm House, Toast}           => {Coffee} 0.001098039 1.0000000 
##      coverage    lift     count
## [1]  0.005490196 1.627833 28   
## [2]  0.004392157 1.671435 23   
## [3]  0.001254902 1.780442  7   
## [4]  0.001568627 1.627833  8   
## [5]  0.001254902 1.780442  7   
## [6]  0.001568627 1.831312  9   
## [7]  0.001882353 1.695659 10   
## [8]  0.001568627 1.627833  8   
## [9]  0.001568627 1.627833  8   
## [10] 0.001098039 2.034791  7

Interpretation of Association Rules

The support level was lower from the initial 1% because there were no association rule generated at that support level. At the 0.1 % support, there were 20 rule generated, with the top 5 rule displayed above. The above mentioned rules can be interpreted as follows: Any customer who has bought Cake and vegan mince pie, has also bought coffee, and any individual who has bought “keeping it local” has also bought a coffee. The individuals who purchase ‘keeping it local’ and coffee has the highest count, whilst the individuals who purchase hearty & seasonal, sandwiches and also coffee have the highest confidence(0.90).

Since the lift for the above rules are all greater than 1, it can be assumed that these are good rules to consider. The top association rule may be sorted by either confidence, support count etc. This is illustrated below:

by_confidence<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(head(by_confidence))

##     lhs                                rhs      support     confidence
## [1] {Farm House, Toast}             => {Coffee} 0.001098039 1.0       
## [2] {Cake, Hot chocolate, Sandwich} => {Coffee} 0.001254902 1.0       
## [3] {Bread, Medialuna, Sandwich}    => {Coffee} 0.001098039 1.0       
## [4] {Hearty & Seasonal, Sandwich}   => {Coffee} 0.001411765 0.9       
## [5] {Pastry, Sandwich}              => {Coffee} 0.001411765 0.9       
## [6] {Cake, Sandwich, Tea}           => {Coffee} 0.001411765 0.9       
##     coverage    lift     count
## [1] 0.001098039 2.034791 7    
## [2] 0.001254902 2.034791 8    
## [3] 0.001098039 2.034791 7    
## [4] 0.001568627 1.831312 9    
## [5] 0.001568627 1.831312 9    
## [6] 0.001568627 1.831312 9

inspect(head(rules, n = 100, by = "confidence"))

##      lhs                                rhs      support     confidence
## [1]  {Farm House, Toast}             => {Coffee} 0.001098039 1.0000000 
## [2]  {Cake, Hot chocolate, Sandwich} => {Coffee} 0.001254902 1.0000000 
## [3]  {Bread, Medialuna, Sandwich}    => {Coffee} 0.001098039 1.0000000 
## [4]  {Hearty & Seasonal, Sandwich}   => {Coffee} 0.001411765 0.9000000 
## [5]  {Pastry, Sandwich}              => {Coffee} 0.001411765 0.9000000 
## [6]  {Cake, Sandwich, Tea}           => {Coffee} 0.001411765 0.9000000 
## [7]  {Cake, Vegan mincepie}          => {Coffee} 0.001098039 0.8750000 
## [8]  {Fudge, Sandwich}               => {Coffee} 0.001098039 0.8750000 
## [9]  {Cake, Toast}                   => {Coffee} 0.002196078 0.8750000 
## [10] {Cake, Sandwich, Soup}          => {Coffee} 0.001098039 0.8750000 
## [11] {Hot chocolate, Scone}          => {Coffee} 0.001882353 0.8571429 
## [12] {Cookies, Scone}                => {Coffee} 0.001882353 0.8571429 
## [13] {Salad, Sandwich}               => {Coffee} 0.001568627 0.8333333 
## [14] {Extra Salami or Feta}          => {Coffee} 0.003607843 0.8214286 
## [15] {Keeping It Local}              => {Coffee} 0.004392157 0.8000000 
## [16] {Keeping It Local, Tea}         => {Coffee} 0.001254902 0.8000000 
## [17] {Cake, Salad}                   => {Coffee} 0.001254902 0.8000000 
## [18] {Alfajores, Mineral water}      => {Coffee} 0.001254902 0.8000000 
## [19] {Juice, Spanish Brunch}         => {Coffee} 0.002509804 0.8000000 
## [20] {Pastry, Toast}                 => {Coffee} 0.001254902 0.8000000 
##      coverage    lift     count
## [1]  0.001098039 2.034791  7   
## [2]  0.001254902 2.034791  8   
## [3]  0.001098039 2.034791  7   
## [4]  0.001568627 1.831312  9   
## [5]  0.001568627 1.831312  9   
## [6]  0.001568627 1.831312  9   
## [7]  0.001254902 1.780442  7   
## [8]  0.001254902 1.780442  7   
## [9]  0.002509804 1.780442 14   
## [10] 0.001254902 1.780442  7   
## [11] 0.002196078 1.744107 12   
## [12] 0.002196078 1.744107 12   
## [13] 0.001882353 1.695659 10   
## [14] 0.004392157 1.671435 23   
## [15] 0.005490196 1.627833 28   
## [16] 0.001568627 1.627833  8   
## [17] 0.001568627 1.627833  8   
## [18] 0.001568627 1.627833  8   
## [19] 0.003137255 1.627833 16   
## [20] 0.001568627 1.627833  8

Duplicated Rules

In addition, we can identify and remove duplicated rules from the generated association rules. This can be done first identifying the duplicated rules using the ‘is.redundant’ function

redundant_rules <- is.redundant(rules)
redundant_rules

##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

summary(redundant_rules)

##    Mode   FALSE    TRUE 
## logical      19       1

True indicated that there is a duplicate rule while False indicates non-duplicated rules. The summary shows that there is one duplicated rule which can be removed as shown below;

rules <- rules [!redundant_rules]
rules

## set of 19 rules

Target product analysis.

Now we can target product bought by customers to analyse the basket of goods bought by individuals simultaneously, in other words, what else customers buy if they buy cake.By targeting our analysis, we are able to choose a product as the default item (lhs).

rules_cake <- apriori(Bakery01, parameter = list(supp=0.001, conf = 0.2), appearance = list(default="rhs", lhs = "Cake"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6375 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(rules_cake[1:5])

##     lhs       rhs      support    confidence coverage  lift      count
## [1] {}     => {Bread}  0.33317647 0.3331765  1.0000000 1.0000000 2124 
## [2] {}     => {Coffee} 0.49145098 0.4914510  1.0000000 1.0000000 3133 
## [3] {Cake} => {Tea}    0.02760784 0.2561863  0.1077647 1.7467249  176 
## [4] {Cake} => {Bread}  0.02478431 0.2299854  0.1077647 0.6902812  158 
## [5] {Cake} => {Coffee} 0.05882353 0.5458515  0.1077647 1.1106937  375

In the above example, both the support and confidence level were adjusted in order to generate association rules for the product (cake).This analysis can be repeated for each product in the dataset and the default option can be adjusted to either be the left hand side (lhs) or the right hand side (rhs).

Visualization of Association Rules.

The rules generated can be visualized using the arulesViz package. This is illustrated below.

#install.packages("interactions")
library(interactions)
plot(rules, method="graph")

plot(rules, method="graph", interactive =TRUE)

## Warning in plot.rules(rules, method = "graph", interactive = TRUE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

The first graph illustrates the products and their dependecies and it is evident that there lays a strong association between “keeping it local”, hot chocolate, coffee and Extra Salami or Feta. The interactive graph also illustrates the most important associations, in green, which are also the most frequent combinations. The graph is interactive thus products illustrated can be dragged to better visualize the relationships

Conclusion

From this study, it can be concluded that coffee is the best selling product and the shop owner may use this as a marketing tool when customer purchase other products. Furthermore, the business owner may opt to bundle coffee with other products in order to boost sales. Lastly, the shop owner could make a decision to stop the sale of certain products such as Hearty & Seasonal that do not get as much sales.

References

https://www.kaggle.com/mittalvasu95/the-bread-basket

https://stackoverflow.com/