Introduction

Association rules are regular ‘if-then’ statements that aid in the discovery of relationships between data items within large data sets and different types of databases.To observe and discover the correlations, patterns and associations between data items/ sets, association rule mining is implored.

In this project, association rules is explained using a data set from the retail industry. Applying association rules in retail as a machine learning model has distinct advantages. Retailers can collect data about purchasing patterns which can then be used to look for co-occurrences that will eventually determine the products that are most likely to be purchased together.From the results, the retailer can take advantage of the information and adjust their sales and marketing strategies.

Data

The data used in this project was obtained from kaggle

mydata <- read.csv("C:\\Users\\cynar\\Desktop\\school\\Semester 1\\unsupervised learning\\purchases\\dataset.csv",
                   header = F, colClasses = "factor")
head(data)

##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)

library(arules)

We can observe that the data set in use has 1499 observations in total and 14 variables that will be used to extract rules.

Creating rules

rules<-apriori(mydata)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 149 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[551 item(s), 1499 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [144 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 144 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5  6 
## 19 46 49 25  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    3.66    4.00    6.00 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1328   Min.   :0.8630   Min.   :0.1328   Min.   :3.295  
##  1st Qu.:0.1328   1st Qu.:1.0000   1st Qu.:0.1328   1st Qu.:3.785  
##  Median :0.1328   Median :1.0000   Median :0.1328   Median :4.370  
##  Mean   :0.1567   Mean   :0.9863   Mean   :0.1598   Mean   :4.315  
##  3rd Qu.:0.1721   3rd Qu.:1.0000   3rd Qu.:0.1721   3rd Qu.:5.064  
##  Max.   :0.2642   Max.   :1.0000   Max.   :0.3035   Max.   :5.810  
##      count      
##  Min.   :199.0  
##  1st Qu.:199.0  
##  Median :199.0  
##  Mean   :234.9  
##  3rd Qu.:258.0  
##  Max.   :396.0  
## 
## mining info:
##    data ntransactions support confidence
##  mydata          1499     0.1        0.8

A set of 144 rules have been created from this data set using default parameter specifications which in this case are: Confidence - 0.8 Support - 0.1 Minimum length - 1 Maximum length - 10 From the results, we can obtain the rule length distribution that there are: 19 rules that only have 2 items, 46 rules that have 3 items, 49 rules that have 4 items, 25 rules that have 5 items and 5 rules that have 6 items. The summary of quality measures can also be analyzed. The observation is that the support had a default value of 0.1328, which however increased after the creating of 144 rules to a value of 0.2642/ Likewise, Confidence also increased from 0.8630 to 1; Lift value from 3.295 to 5.810 and the count increased from 199 to 396.

trans<-read.transactions("C:\\Users\\cynar\\Desktop\\school\\Semester 1\\unsupervised learning\\purchases\\dataset.csv", format = "single", sep=",", cols = c(2,3))
summary(trans)

## transactions as itemMatrix in sparse format with
##  38 rows (elements/itemsets/transactions) and
##  38 columns (items) and a density of 0.6405817 
## 
## most frequent items:
##       vegetables             soda            mixes    aluminum foil 
##               34               32               30               28 
##  spaghetti sauce          (Other) 
##               28              773 
## 
## element (itemset/transaction) length distribution:
## sizes
## 19 21 22 23 24 25 26 27 28 29 30 33 
##  1  4  6  5  6  6  2  3  2  1  1  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   22.00   24.00   24.34   25.75   33.00 
## 
## includes extended item information - examples:
##           labels
## 1   all- purpose
## 2  aluminum foil
## 3         bagels
## 
## includes extended transaction information - examples:
##    transactionID
## 1   all- purpose
## 2  aluminum foil
## 3         bagels

For the data and the rules to be comprehensive, we have to convert the entries into transactions. Summary of the data shows a brief list of the most frequent items in the data set. Vegetables have the highest frequency; followed by soda, mixes, aluminum foil and spaghetti sauce.

summary(mydata)

##                  V1                    V2                    V3      
##   vegetables      : 108    vegetables   :  98    vegetables   :  95  
##   bagels          :  53    ice cream    :  51    soda         :  50  
##   poultry         :  46    sandwich bags:  51    milk         :  46  
##   ketchup         :  44    poultry      :  44    soap         :  45  
##   individual meals:  43    beef         :  43    aluminum foil:  44  
##   shampoo         :  43    cheeses      :  43    coffee/tea   :  44  
##  (Other)          :1162   (Other)       :1169   (Other)       :1175  
##            V4                       V5                 V6      
##   vegetables: 140    vegetables      : 113    vegetables:  97  
##   lunch meat:  49                    :  51              :  51  
##   soap      :  49    tortillas       :  50    bagels    :  46  
##   fruits    :  48    beef            :  47    poultry   :  45  
##   bagels    :  47    individual meals:  44    cereals   :  44  
##   mixes     :  47    pasta           :  43    ice cream :  44  
##  (Other)    :1119   (Other)          :1151   (Other)    :1172  
##              V7                 V8                    V9      
##   vegetables  :  95              : 143                 : 199  
##               :  88    vegetables: 103    vegetables   :  89  
##               :  55              :  56                 :  59  
##   mixes       :  52    yogurt    :  43    beef         :  43  
##   cereals     :  47    butter    :  41    poultry      :  41  
##   all- purpose:  43    coffee/tea:  39    aluminum foil:  38  
##  (Other)      :1119   (Other)    :1074   (Other)       :1030  
##           V10               V11               V12                 V13     
##             :258              :296              :343                :396  
##   vegetables: 98    vegetables: 78    vegetables: 90    vegetables  : 77  
##   eggs      : 44              : 47              : 53                : 59  
##   waffles   : 43    pasta     : 38    juice     : 38    toilet paper: 42  
##   pasta     : 41    bagels    : 36    soda      : 38    cheeses     : 36  
##   ice cream : 40    ketchup   : 36    butter    : 36    mixes       : 36  
##  (Other)    :975   (Other)    :968   (Other)    :901   (Other)      :853  
##             V14     
##               :455  
##   vegetables  : 81  
##               : 41  
##   dinner rolls: 34  
##   shampoo     : 33  
##   eggs        : 31  
##  (Other)      :824

library(arulesViz)
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")

head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=40)

##                    vegetables                          soda 
##                            34                            32 
##                         mixes                 aluminum foil 
##                            30                            28 
##               spaghetti sauce                       waffles 
##                            28                            28 
##                    coffee/tea              individual meals 
##                            27                            27 
##                       ketchup                          beef 
##                            27                            26 
##                         flour                         juice 
##                            26                            26 
##                          soap                        yogurt 
##                            26                            26 
##                     ice cream                    lunch meat 
##                            25                            25 
##                          milk                          pork 
##                            25                            25 
##                       poultry                       cereals 
##                            25                            24 
##  dishwashing liquid/detergent                         pasta 
##                            24                            24 
##               sandwich loaves                       shampoo 
##                            24                            24 
##                         sugar                       cheeses 
##                            23                            22 
##                  dinner rolls                          eggs 
##                            22                            22 
##             laundry detergent                 sandwich bags 
##                            22                            22 
##                  toilet paper                  all- purpose 
##                            22                            20 
##                        bagels                        butter 
##                            20                            20 
##                  paper towels                        fruits 
##                            20                            19 
##                     hand soap                     tortillas 
##                            18                            17

According to the summary of the data, Vegetables have the highest frequency and are appearing in almost all the variables. This information can be visualized using an items frequency plot of the items found in the transaction and in different baskets. From the Items frequency plot, Vegetables have the highest frequency at 34, followed by Soda and mixes at 32 and 30 respectively. The rest of the items have a somewhat constant frequency. The graphical display of the items frequency is supported by numerical data that shows that Vegetables appeared 34 times in the data,soda appeared, 32 times and so forth.

Reducing number of rules

rules <-eclat(trans, parameter=list(supp=0.65, maxlen = 6))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.65      1      6 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 24 
## 
## create itemset ... 
## set transactions ...[38 item(s), 38 transaction(s)] done [0.00s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating bit matrix ... [19 row(s), 38 column(s)] done [0.00s].
## writing  ... [29 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(rules)

##      items                           support   transIdenticalToItemsets count
## [1]  { soap, vegetables}             0.6578947 25                       25   
## [2]  { juice, vegetables}            0.6578947 25                       25   
## [3]  { spaghetti sauce, vegetables}  0.6578947 25                       25   
## [4]  { individual meals, vegetables} 0.6578947 25                       25   
## [5]  { aluminum foil, vegetables}    0.6578947 25                       25   
## [6]  { aluminum foil, soda}          0.6842105 26                       26   
## [7]  { vegetables, waffles}          0.6578947 25                       25   
## [8]  { mixes, vegetables}            0.7105263 27                       27   
## [9]  { mixes, soda}                  0.6842105 26                       26   
## [10] { soda, vegetables}             0.7368421 28                       28   
## [11] { vegetables}                   0.8947368 34                       34   
## [12] { soda}                         0.8421053 32                       32   
## [13] { mixes}                        0.7894737 30                       30   
## [14] { waffles}                      0.7368421 28                       28   
## [15] { aluminum foil}                0.7368421 28                       28   
## [16] { individual meals}             0.7105263 27                       27   
## [17] { spaghetti sauce}              0.7368421 28                       28   
## [18] { coffee/tea}                   0.7105263 27                       27   
## [19] { juice}                        0.6842105 26                       26   
## [20] { ketchup}                      0.7105263 27                       27   
## [21] { flour}                        0.6842105 26                       26   
## [22] { soap}                         0.6842105 26                       26   
## [23] { yogurt}                       0.6842105 26                       26   
## [24] { beef}                         0.6842105 26                       26   
## [25] { milk}                         0.6578947 25                       25   
## [26] { pork}                         0.6578947 25                       25   
## [27] { ice cream}                    0.6578947 25                       25   
## [28] { poultry}                      0.6578947 25                       25   
## [29] { lunch meat}                   0.6578947 25                       25

To be able to manage the rules we work with we have to reduce the number of rules. This can be done by imposing parameter restrictions. Hence, we restrict the maximum length to 6 and increase the support value to 0.65 which will drastically reduce the number of rules to 29. Upon inspecting the rules, the results will display the support values of items and the count, which is the number of baskets these items were found in. Using eclat allows detection of all the items that are likely to exist together. Hence, why support is in the results since it shows probability of an item set occurring together.

freq_rules<-ruleInduction(rules, trans, confidence=0.5)
inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE),10))

##      lhs                    rhs           support   confidence lift     
## [1]  { soap}             => { vegetables} 0.6578947 0.9615385  1.0746606
## [2]  { juice}            => { vegetables} 0.6578947 0.9615385  1.0746606
## [3]  { aluminum foil}    => { soda}       0.6842105 0.9285714  1.1026786
## [4]  { individual meals} => { vegetables} 0.6578947 0.9259259  1.0348584
## [5]  { mixes}            => { vegetables} 0.7105263 0.9000000  1.0058824
## [6]  { spaghetti sauce}  => { vegetables} 0.6578947 0.8928571  0.9978992
## [7]  { aluminum foil}    => { vegetables} 0.6578947 0.8928571  0.9978992
## [8]  { waffles}          => { vegetables} 0.6578947 0.8928571  0.9978992
## [9]  { soda}             => { vegetables} 0.7368421 0.8750000  0.9779412
## [10] { mixes}            => { soda}       0.6842105 0.8666667  1.0291667
##      itemset
## [1]   1     
## [2]   2     
## [3]   6     
## [4]   4     
## [5]   8     
## [6]   3     
## [7]   5     
## [8]   7     
## [9]  10     
## [10]  9

Rules can be further sorted and analyzed using the Confidence and support values to identify the most important relationships.Confidence shows the number of times the ‘if-then’ statements are found to be true. In this case, the ‘soap-vegetables’ combination and the ‘juice-vegetables’ combination has the highest confidence values.

inspect(head(sort(freq_rules, by = "support", decreasing = TRUE), 10))

##      lhs                 rhs              support   confidence lift     
## [1]  { vegetables}    => { soda}          0.7368421 0.8235294  0.9779412
## [2]  { soda}          => { vegetables}    0.7368421 0.8750000  0.9779412
## [3]  { vegetables}    => { mixes}         0.7105263 0.7941176  1.0058824
## [4]  { mixes}         => { vegetables}    0.7105263 0.9000000  1.0058824
## [5]  { soda}          => { aluminum foil} 0.6842105 0.8125000  1.1026786
## [6]  { aluminum foil} => { soda}          0.6842105 0.9285714  1.1026786
## [7]  { soda}          => { mixes}         0.6842105 0.8125000  1.0291667
## [8]  { mixes}         => { soda}          0.6842105 0.8666667  1.0291667
## [9]  { vegetables}    => { soap}          0.6578947 0.7352941  1.0746606
## [10] { soap}          => { vegetables}    0.6578947 0.9615385  1.0746606
##      itemset
## [1]  10     
## [2]  10     
## [3]   8     
## [4]   8     
## [5]   6     
## [6]   6     
## [7]   9     
## [8]   9     
## [9]   1     
## [10]  1

Support signifies how frequently an item appears in the data set or rather the probability of the item appearing in the data set.In this case, the combination of Vegetables and soda have the highest probability of occurring together since they have the highest support value.

inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))

##      lhs                    rhs                 support   confidence lift    
## [1]  { soda}             => { aluminum foil}    0.6842105 0.8125000  1.102679
## [2]  { aluminum foil}    => { soda}             0.6842105 0.9285714  1.102679
## [3]  { vegetables}       => { soap}             0.6578947 0.7352941  1.074661
## [4]  { vegetables}       => { juice}            0.6578947 0.7352941  1.074661
## [5]  { soap}             => { vegetables}       0.6578947 0.9615385  1.074661
## [6]  { juice}            => { vegetables}       0.6578947 0.9615385  1.074661
## [7]  { vegetables}       => { individual meals} 0.6578947 0.7352941  1.034858
## [8]  { individual meals} => { vegetables}       0.6578947 0.9259259  1.034858
## [9]  { soda}             => { mixes}            0.6842105 0.8125000  1.029167
## [10] { mixes}            => { soda}             0.6842105 0.8666667  1.029167
##      itemset
## [1]  6      
## [2]  6      
## [3]  1      
## [4]  2      
## [5]  1      
## [6]  2      
## [7]  4      
## [8]  4      
## [9]  9      
## [10] 9

The lift metric can be used to make a comparison between confidence and expected confidence to determine how many times an if-then statement is expected to be found true.

Visualizing the rules

library(arulesViz)
plot(freq_rules, method="grouped")

The grouped matrix show the support value of items by the SIZE of the bubble and the Lift is shown by the COLOUR. sODA and ALUMINUM FOIL have the highest lift value according to this matrix as shown by the darker colour shade. VEGETABLES and SODA have the highest support value as shown by the large sizes of the bubble. According to the matrix, the general trend shows that the larger the support of a combination of items, the smaller its lift value and the larger the lift value, the smaller the support is.

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter =0)

## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.

We can visualize the relationship between the LIFT, the CONFIDENCE and the SUPPORT of the item sets. In the scatter plot, most items with high lift values have relatively low to moderate/medium support values. The rule with the highest confidence in this case has the lowest support value and a high lift value. The items with the lowest lift values generally have high support values and relatively low/moderate confidence values. To determine the importance of a rule, the Confidence and the Support values are considered the most. This is because, the support value determines the presence or probability of a transaction containing both A and B;Whereas, the confidence value validates the rule’s precision.

plot(freq_rules, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Graph for 20 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Since vegetables are the most frequent item purchased from this data set, they appear in almost all the rules. This graph shows Support as size in the range of (0.658 - 0.737). And LIFT as color in the range of (0.978 - 1.103). We can observe that VEGETABLES are purchased with SODA at a high probability of close to 0.7. And Vegetables are also purchased with mixes at a high probability of an estimate that is close to 0.7 or slightly lower since the bubble is of a slightly smaller size relative to the Vegetables-Soda size.

soda<-apriori(data=trans, parameter=list(supp=0.65,conf = 0.5), 
                      appearance=list(default="lhs", rhs=" soda"), control=list(verbose=F)) 
inspect(sort(soda, by='lift'))

##     lhs                 rhs     support   confidence coverage  lift      count
## [1] { aluminum foil} => { soda} 0.6842105 0.9285714  0.7368421 1.1026786 26   
## [2] { mixes}         => { soda} 0.6842105 0.8666667  0.7894737 1.0291667 26   
## [3] {}               => { soda} 0.8421053 0.8421053  1.0000000 1.0000000 32   
## [4] { vegetables}    => { soda} 0.7368421 0.8235294  0.8947368 0.9779412 28

plot(soda, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Graph for 4 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

We can inspect other rules using items that high count numbers like SODA. In this case, there are 4 rules that have SODA as a consequent and the rule with the highest confidence is Aluminum foil - soda. According to the graph plot that only shows support and lift, this rule relatively has the least support value but the highest lift value. The rule that has the highest support value and a relatively high confidence value only contains Soda. This is also confirmed by the graph plot.

mixes<-apriori(data=trans, parameter=list(supp=0.65,conf = 0.5), 
                      appearance=list(default="lhs", rhs=" mixes"), control=list(verbose=F)) 
inspect(sort(mixes, by='lift'))

##     lhs              rhs      support   confidence coverage  lift     count
## [1] { soda}       => { mixes} 0.6842105 0.8125000  0.8421053 1.029167 26   
## [2] { vegetables} => { mixes} 0.7105263 0.7941176  0.8947368 1.005882 27   
## [3] {}            => { mixes} 0.7894737 0.7894737  1.0000000 1.000000 30

plot(mixes, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Graph for 3 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

There are only 3 rules that have ‘mixes’ as a consequence. The combination between soda - mixes has the highest confidence value and the lowest support in comparison to the rest of the rules.

waffles<-apriori(data=trans, parameter=list(supp=0.6,conf = 0.5), 
                       appearance=list(default="lhs", rhs=" waffles"), control=list(verbose=F)) 
inspect(sort(waffles, by='lift'))

##     lhs              rhs        support   confidence coverage  lift      count
## [1] { mixes}      => { waffles} 0.6052632 0.7666667  0.7894737 1.0404762 23   
## [2] {}            => { waffles} 0.7368421 0.7368421  1.0000000 1.0000000 28   
## [3] { vegetables} => { waffles} 0.6578947 0.7352941  0.8947368 0.9978992 25   
## [4] { soda}       => { waffles} 0.6052632 0.7187500  0.8421053 0.9754464 23

plot(waffles, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Graph for 4 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

There are only 4 rules that have ‘waffles’ as a consequence.The results shows that if a customer buys vegetables then they will also buy waffles with a probability of 0.66.

Dissimilarity measures

Dissimilarity is the numerical measure of how different two data items are. When dissimilarity measure is low, then the items under observation are similar. And if dissimilarity is high, the items are different.

To measure dissimilarity, the Jaccard index is used:

J(A,B) = |A ∩ B| / |A ∪ B|

trans.sel<-trans[,itemFrequency(trans)>0.7]
jac<-dissimilarity(trans.sel, which="items") 
round(jac,digits=3)

##                    aluminum foil  coffee/tea  individual meals  ketchup  mixes
##  coffee/tea                0.472                                              
##  individual meals          0.333       0.457                                  
##  ketchup                   0.429       0.500             0.457                
##  mixes                     0.389       0.459             0.371    0.500       
##  soda                      0.235       0.361             0.361    0.405  0.278
##  spaghetti sauce           0.486       0.429             0.514    0.429  0.389
##  vegetables                0.324       0.395             0.306    0.351  0.270
##  waffles                   0.486       0.382             0.382    0.382  0.343
##                    soda  spaghetti sauce  vegetables
##  coffee/tea                                         
##  individual meals                                   
##  ketchup                                            
##  mixes                                              
##  soda                                               
##  spaghetti sauce  0.378                             
##  vegetables       0.263            0.324            
##  waffles          0.378            0.400       0.324

plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")

The dissimilarity values are fairly low but we can highlight and consider the dissimilarity between Aluminum foil & coffee/tea, Aluminum foil & spaghetti sauce, aluminum foil & waffles, individual meals & spaghetti sauce and between ketchup & mixes. A dendrogram is used to plot items that are dissimilar.

Affinity measure

Affinity is the numerical measure of the similarity between items in a data set. In this case, the higher the affinity value, the higher the probability that two products are similar and will be bought together.

Calculated as:

A(i,j) = supp(i,j)/supp(i)+supp(j)−supp(i,j)

a = affinity(trans.sel)
round(a, digits=3)

## An object of class "ar_similarity"
##                    aluminum foil  coffee/tea  individual meals  ketchup  mixes
##  aluminum foil             0.000       0.528             0.667    0.571  0.611
##  coffee/tea                0.528       0.000             0.543    0.500  0.541
##  individual meals          0.667       0.543             0.000    0.543  0.629
##  ketchup                   0.571       0.500             0.543    0.000  0.500
##  mixes                     0.611       0.541             0.629    0.500  0.000
##  soda                      0.765       0.639             0.639    0.595  0.722
##  spaghetti sauce           0.514       0.571             0.486    0.571  0.611
##  vegetables                0.676       0.605             0.694    0.649  0.730
##  waffles                   0.514       0.618             0.618    0.618  0.657
##                    soda  spaghetti sauce  vegetables  waffles
##  aluminum foil    0.765            0.514       0.676    0.514
##  coffee/tea       0.639            0.571       0.605    0.618
##  individual meals 0.639            0.486       0.694    0.618
##  ketchup          0.595            0.571       0.649    0.618
##  mixes            0.722            0.611       0.730    0.657
##  soda             0.000            0.622       0.737    0.622
##  spaghetti sauce  0.622            0.000       0.676    0.600
##  vegetables       0.737            0.676       0.000    0.676
##  waffles          0.622            0.600       0.676    0.000
## Slot "method":
## [1] "Affinity"

Only taking into account the affinity levels that are > 0.7, we observe that the following pairs of items have high probability of being purchased together:

Aluminum foil & Soda; Mixes & Soda Mixes & Vegetables Vegetables & Soda

par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)

Affinity measure results can be plotted in terms of matrix. The darker the shade of the combination the higher its affinity value, hence the higher the probability that the two items will be bought together. Likewise, the lighter the color shade of the combination the lower the affinity value.

Association rules 1

Cynara Nyahoda

2/27/2021