Association rules are regular ‘if-then’ statements that aid in the discovery of relationships between data items within large data sets and different types of databases.To observe and discover the correlations, patterns and associations between data items/ sets, association rule mining is implored.
In this project, association rules is explained using a data set from the retail industry. Applying association rules in retail as a machine learning model has distinct advantages. Retailers can collect data about purchasing patterns which can then be used to look for co-occurrences that will eventually determine the products that are most likely to be purchased together.From the results, the retailer can take advantage of the information and adjust their sales and marketing strategies.
The data used in this project was obtained from kaggle
mydata <- read.csv("C:\\Users\\cynar\\Desktop\\school\\Semester 1\\unsupervised learning\\purchases\\dataset.csv",
header = F, colClasses = "factor")
head(data)
##
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,
## 2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## 3 {
## 4 fileExt <- function(x) {
## 5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
## 6 ans <- sub(".*\\\\.", "", x)
library(arules)
We can observe that the data set in use has 1499 observations in total and 14 variables that will be used to extract rules.
rules<-apriori(mydata)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 149
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[551 item(s), 1499 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [144 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 144 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 19 46 49 25 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 4.00 3.66 4.00 6.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1328 Min. :0.8630 Min. :0.1328 Min. :3.295
## 1st Qu.:0.1328 1st Qu.:1.0000 1st Qu.:0.1328 1st Qu.:3.785
## Median :0.1328 Median :1.0000 Median :0.1328 Median :4.370
## Mean :0.1567 Mean :0.9863 Mean :0.1598 Mean :4.315
## 3rd Qu.:0.1721 3rd Qu.:1.0000 3rd Qu.:0.1721 3rd Qu.:5.064
## Max. :0.2642 Max. :1.0000 Max. :0.3035 Max. :5.810
## count
## Min. :199.0
## 1st Qu.:199.0
## Median :199.0
## Mean :234.9
## 3rd Qu.:258.0
## Max. :396.0
##
## mining info:
## data ntransactions support confidence
## mydata 1499 0.1 0.8
A set of 144 rules have been created from this data set using default parameter specifications which in this case are: Confidence - 0.8 Support - 0.1 Minimum length - 1 Maximum length - 10 From the results, we can obtain the rule length distribution that there are: 19 rules that only have 2 items, 46 rules that have 3 items, 49 rules that have 4 items, 25 rules that have 5 items and 5 rules that have 6 items. The summary of quality measures can also be analyzed. The observation is that the support had a default value of 0.1328, which however increased after the creating of 144 rules to a value of 0.2642/ Likewise, Confidence also increased from 0.8630 to 1; Lift value from 3.295 to 5.810 and the count increased from 199 to 396.
trans<-read.transactions("C:\\Users\\cynar\\Desktop\\school\\Semester 1\\unsupervised learning\\purchases\\dataset.csv", format = "single", sep=",", cols = c(2,3))
summary(trans)
## transactions as itemMatrix in sparse format with
## 38 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.6405817
##
## most frequent items:
## vegetables soda mixes aluminum foil
## 34 32 30 28
## spaghetti sauce (Other)
## 28 773
##
## element (itemset/transaction) length distribution:
## sizes
## 19 21 22 23 24 25 26 27 28 29 30 33
## 1 4 6 5 6 6 2 3 2 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 22.00 24.00 24.34 25.75 33.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
##
## includes extended transaction information - examples:
## transactionID
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
For the data and the rules to be comprehensive, we have to convert the entries into transactions. Summary of the data shows a brief list of the most frequent items in the data set. Vegetables have the highest frequency; followed by soda, mixes, aluminum foil and spaghetti sauce.
summary(mydata)
## V1 V2 V3
## vegetables : 108 vegetables : 98 vegetables : 95
## bagels : 53 ice cream : 51 soda : 50
## poultry : 46 sandwich bags: 51 milk : 46
## ketchup : 44 poultry : 44 soap : 45
## individual meals: 43 beef : 43 aluminum foil: 44
## shampoo : 43 cheeses : 43 coffee/tea : 44
## (Other) :1162 (Other) :1169 (Other) :1175
## V4 V5 V6
## vegetables: 140 vegetables : 113 vegetables: 97
## lunch meat: 49 : 51 : 51
## soap : 49 tortillas : 50 bagels : 46
## fruits : 48 beef : 47 poultry : 45
## bagels : 47 individual meals: 44 cereals : 44
## mixes : 47 pasta : 43 ice cream : 44
## (Other) :1119 (Other) :1151 (Other) :1172
## V7 V8 V9
## vegetables : 95 : 143 : 199
## : 88 vegetables: 103 vegetables : 89
## : 55 : 56 : 59
## mixes : 52 yogurt : 43 beef : 43
## cereals : 47 butter : 41 poultry : 41
## all- purpose: 43 coffee/tea: 39 aluminum foil: 38
## (Other) :1119 (Other) :1074 (Other) :1030
## V10 V11 V12 V13
## :258 :296 :343 :396
## vegetables: 98 vegetables: 78 vegetables: 90 vegetables : 77
## eggs : 44 : 47 : 53 : 59
## waffles : 43 pasta : 38 juice : 38 toilet paper: 42
## pasta : 41 bagels : 36 soda : 38 cheeses : 36
## ice cream : 40 ketchup : 36 butter : 36 mixes : 36
## (Other) :975 (Other) :968 (Other) :901 (Other) :853
## V14
## :455
## vegetables : 81
## : 41
## dinner rolls: 34
## shampoo : 33
## eggs : 31
## (Other) :824
library(arulesViz)
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=40)
## vegetables soda
## 34 32
## mixes aluminum foil
## 30 28
## spaghetti sauce waffles
## 28 28
## coffee/tea individual meals
## 27 27
## ketchup beef
## 27 26
## flour juice
## 26 26
## soap yogurt
## 26 26
## ice cream lunch meat
## 25 25
## milk pork
## 25 25
## poultry cereals
## 25 24
## dishwashing liquid/detergent pasta
## 24 24
## sandwich loaves shampoo
## 24 24
## sugar cheeses
## 23 22
## dinner rolls eggs
## 22 22
## laundry detergent sandwich bags
## 22 22
## toilet paper all- purpose
## 22 20
## bagels butter
## 20 20
## paper towels fruits
## 20 19
## hand soap tortillas
## 18 17
According to the summary of the data, Vegetables have the highest frequency and are appearing in almost all the variables. This information can be visualized using an items frequency plot of the items found in the transaction and in different baskets. From the Items frequency plot, Vegetables have the highest frequency at 34, followed by Soda and mixes at 32 and 30 respectively. The rest of the items have a somewhat constant frequency. The graphical display of the items frequency is supported by numerical data that shows that Vegetables appeared 34 times in the data,soda appeared, 32 times and so forth.
rules <-eclat(trans, parameter=list(supp=0.65, maxlen = 6))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.65 1 6 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 24
##
## create itemset ...
## set transactions ...[38 item(s), 38 transaction(s)] done [0.00s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating bit matrix ... [19 row(s), 38 column(s)] done [0.00s].
## writing ... [29 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(rules)
## items support transIdenticalToItemsets count
## [1] { soap, vegetables} 0.6578947 25 25
## [2] { juice, vegetables} 0.6578947 25 25
## [3] { spaghetti sauce, vegetables} 0.6578947 25 25
## [4] { individual meals, vegetables} 0.6578947 25 25
## [5] { aluminum foil, vegetables} 0.6578947 25 25
## [6] { aluminum foil, soda} 0.6842105 26 26
## [7] { vegetables, waffles} 0.6578947 25 25
## [8] { mixes, vegetables} 0.7105263 27 27
## [9] { mixes, soda} 0.6842105 26 26
## [10] { soda, vegetables} 0.7368421 28 28
## [11] { vegetables} 0.8947368 34 34
## [12] { soda} 0.8421053 32 32
## [13] { mixes} 0.7894737 30 30
## [14] { waffles} 0.7368421 28 28
## [15] { aluminum foil} 0.7368421 28 28
## [16] { individual meals} 0.7105263 27 27
## [17] { spaghetti sauce} 0.7368421 28 28
## [18] { coffee/tea} 0.7105263 27 27
## [19] { juice} 0.6842105 26 26
## [20] { ketchup} 0.7105263 27 27
## [21] { flour} 0.6842105 26 26
## [22] { soap} 0.6842105 26 26
## [23] { yogurt} 0.6842105 26 26
## [24] { beef} 0.6842105 26 26
## [25] { milk} 0.6578947 25 25
## [26] { pork} 0.6578947 25 25
## [27] { ice cream} 0.6578947 25 25
## [28] { poultry} 0.6578947 25 25
## [29] { lunch meat} 0.6578947 25 25
To be able to manage the rules we work with we have to reduce the number of rules. This can be done by imposing parameter restrictions. Hence, we restrict the maximum length to 6 and increase the support value to 0.65 which will drastically reduce the number of rules to 29. Upon inspecting the rules, the results will display the support values of items and the count, which is the number of baskets these items were found in. Using eclat allows detection of all the items that are likely to exist together. Hence, why support is in the results since it shows probability of an item set occurring together.
freq_rules<-ruleInduction(rules, trans, confidence=0.5)
inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE),10))
## lhs rhs support confidence lift
## [1] { soap} => { vegetables} 0.6578947 0.9615385 1.0746606
## [2] { juice} => { vegetables} 0.6578947 0.9615385 1.0746606
## [3] { aluminum foil} => { soda} 0.6842105 0.9285714 1.1026786
## [4] { individual meals} => { vegetables} 0.6578947 0.9259259 1.0348584
## [5] { mixes} => { vegetables} 0.7105263 0.9000000 1.0058824
## [6] { spaghetti sauce} => { vegetables} 0.6578947 0.8928571 0.9978992
## [7] { aluminum foil} => { vegetables} 0.6578947 0.8928571 0.9978992
## [8] { waffles} => { vegetables} 0.6578947 0.8928571 0.9978992
## [9] { soda} => { vegetables} 0.7368421 0.8750000 0.9779412
## [10] { mixes} => { soda} 0.6842105 0.8666667 1.0291667
## itemset
## [1] 1
## [2] 2
## [3] 6
## [4] 4
## [5] 8
## [6] 3
## [7] 5
## [8] 7
## [9] 10
## [10] 9
Rules can be further sorted and analyzed using the Confidence and support values to identify the most important relationships.Confidence shows the number of times the ‘if-then’ statements are found to be true. In this case, the ‘soap-vegetables’ combination and the ‘juice-vegetables’ combination has the highest confidence values.
inspect(head(sort(freq_rules, by = "support", decreasing = TRUE), 10))
## lhs rhs support confidence lift
## [1] { vegetables} => { soda} 0.7368421 0.8235294 0.9779412
## [2] { soda} => { vegetables} 0.7368421 0.8750000 0.9779412
## [3] { vegetables} => { mixes} 0.7105263 0.7941176 1.0058824
## [4] { mixes} => { vegetables} 0.7105263 0.9000000 1.0058824
## [5] { soda} => { aluminum foil} 0.6842105 0.8125000 1.1026786
## [6] { aluminum foil} => { soda} 0.6842105 0.9285714 1.1026786
## [7] { soda} => { mixes} 0.6842105 0.8125000 1.0291667
## [8] { mixes} => { soda} 0.6842105 0.8666667 1.0291667
## [9] { vegetables} => { soap} 0.6578947 0.7352941 1.0746606
## [10] { soap} => { vegetables} 0.6578947 0.9615385 1.0746606
## itemset
## [1] 10
## [2] 10
## [3] 8
## [4] 8
## [5] 6
## [6] 6
## [7] 9
## [8] 9
## [9] 1
## [10] 1
Support signifies how frequently an item appears in the data set or rather the probability of the item appearing in the data set.In this case, the combination of Vegetables and soda have the highest probability of occurring together since they have the highest support value.
inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))
## lhs rhs support confidence lift
## [1] { soda} => { aluminum foil} 0.6842105 0.8125000 1.102679
## [2] { aluminum foil} => { soda} 0.6842105 0.9285714 1.102679
## [3] { vegetables} => { soap} 0.6578947 0.7352941 1.074661
## [4] { vegetables} => { juice} 0.6578947 0.7352941 1.074661
## [5] { soap} => { vegetables} 0.6578947 0.9615385 1.074661
## [6] { juice} => { vegetables} 0.6578947 0.9615385 1.074661
## [7] { vegetables} => { individual meals} 0.6578947 0.7352941 1.034858
## [8] { individual meals} => { vegetables} 0.6578947 0.9259259 1.034858
## [9] { soda} => { mixes} 0.6842105 0.8125000 1.029167
## [10] { mixes} => { soda} 0.6842105 0.8666667 1.029167
## itemset
## [1] 6
## [2] 6
## [3] 1
## [4] 2
## [5] 1
## [6] 2
## [7] 4
## [8] 4
## [9] 9
## [10] 9
The lift metric can be used to make a comparison between confidence and expected confidence to determine how many times an if-then statement is expected to be found true.
library(arulesViz)
plot(freq_rules, method="grouped")
The grouped matrix show the support value of items by the SIZE of the bubble and the Lift is shown by the COLOUR. sODA and ALUMINUM FOIL have the highest lift value according to this matrix as shown by the darker colour shade. VEGETABLES and SODA have the highest support value as shown by the large sizes of the bubble. According to the matrix, the general trend shows that the larger the support of a combination of items, the smaller its lift value and the larger the lift value, the smaller the support is.
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter =0)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.
We can visualize the relationship between the LIFT, the CONFIDENCE and the SUPPORT of the item sets. In the scatter plot, most items with high lift values have relatively low to moderate/medium support values. The rule with the highest confidence in this case has the lowest support value and a high lift value. The items with the lowest lift values generally have high support values and relatively low/moderate confidence values. To determine the importance of a rule, the Confidence and the Support values are considered the most. This is because, the support value determines the presence or probability of a transaction containing both A and B;Whereas, the confidence value validates the rule’s precision.
plot(freq_rules, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 20 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
Since vegetables are the most frequent item purchased from this data set, they appear in almost all the rules. This graph shows Support as size in the range of (0.658 - 0.737). And LIFT as color in the range of (0.978 - 1.103). We can observe that VEGETABLES are purchased with SODA at a high probability of close to 0.7. And Vegetables are also purchased with mixes at a high probability of an estimate that is close to 0.7 or slightly lower since the bubble is of a slightly smaller size relative to the Vegetables-Soda size.
soda<-apriori(data=trans, parameter=list(supp=0.65,conf = 0.5),
appearance=list(default="lhs", rhs=" soda"), control=list(verbose=F))
inspect(sort(soda, by='lift'))
## lhs rhs support confidence coverage lift count
## [1] { aluminum foil} => { soda} 0.6842105 0.9285714 0.7368421 1.1026786 26
## [2] { mixes} => { soda} 0.6842105 0.8666667 0.7894737 1.0291667 26
## [3] {} => { soda} 0.8421053 0.8421053 1.0000000 1.0000000 32
## [4] { vegetables} => { soda} 0.7368421 0.8235294 0.8947368 0.9779412 28
plot(soda, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 4 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
We can inspect other rules using items that high count numbers like SODA. In this case, there are 4 rules that have SODA as a consequent and the rule with the highest confidence is Aluminum foil - soda. According to the graph plot that only shows support and lift, this rule relatively has the least support value but the highest lift value. The rule that has the highest support value and a relatively high confidence value only contains Soda. This is also confirmed by the graph plot.
mixes<-apriori(data=trans, parameter=list(supp=0.65,conf = 0.5),
appearance=list(default="lhs", rhs=" mixes"), control=list(verbose=F))
inspect(sort(mixes, by='lift'))
## lhs rhs support confidence coverage lift count
## [1] { soda} => { mixes} 0.6842105 0.8125000 0.8421053 1.029167 26
## [2] { vegetables} => { mixes} 0.7105263 0.7941176 0.8947368 1.005882 27
## [3] {} => { mixes} 0.7894737 0.7894737 1.0000000 1.000000 30
plot(mixes, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 3 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
There are only 3 rules that have ‘mixes’ as a consequence. The combination between soda - mixes has the highest confidence value and the lowest support in comparison to the rest of the rules.
waffles<-apriori(data=trans, parameter=list(supp=0.6,conf = 0.5),
appearance=list(default="lhs", rhs=" waffles"), control=list(verbose=F))
inspect(sort(waffles, by='lift'))
## lhs rhs support confidence coverage lift count
## [1] { mixes} => { waffles} 0.6052632 0.7666667 0.7894737 1.0404762 23
## [2] {} => { waffles} 0.7368421 0.7368421 1.0000000 1.0000000 28
## [3] { vegetables} => { waffles} 0.6578947 0.7352941 0.8947368 0.9978992 25
## [4] { soda} => { waffles} 0.6052632 0.7187500 0.8421053 0.9754464 23
plot(waffles, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 4 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
There are only 4 rules that have ‘waffles’ as a consequence.The results shows that if a customer buys vegetables then they will also buy waffles with a probability of 0.66.
Dissimilarity is the numerical measure of how different two data items are. When dissimilarity measure is low, then the items under observation are similar. And if dissimilarity is high, the items are different.
To measure dissimilarity, the Jaccard index is used:
J(A,B) = |A ∩ B| / |A ∪ B|
trans.sel<-trans[,itemFrequency(trans)>0.7]
jac<-dissimilarity(trans.sel, which="items")
round(jac,digits=3)
## aluminum foil coffee/tea individual meals ketchup mixes
## coffee/tea 0.472
## individual meals 0.333 0.457
## ketchup 0.429 0.500 0.457
## mixes 0.389 0.459 0.371 0.500
## soda 0.235 0.361 0.361 0.405 0.278
## spaghetti sauce 0.486 0.429 0.514 0.429 0.389
## vegetables 0.324 0.395 0.306 0.351 0.270
## waffles 0.486 0.382 0.382 0.382 0.343
## soda spaghetti sauce vegetables
## coffee/tea
## individual meals
## ketchup
## mixes
## soda
## spaghetti sauce 0.378
## vegetables 0.263 0.324
## waffles 0.378 0.400 0.324
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")
The dissimilarity values are fairly low but we can highlight and consider the dissimilarity between Aluminum foil & coffee/tea, Aluminum foil & spaghetti sauce, aluminum foil & waffles, individual meals & spaghetti sauce and between ketchup & mixes. A dendrogram is used to plot items that are dissimilar.
Affinity is the numerical measure of the similarity between items in a data set. In this case, the higher the affinity value, the higher the probability that two products are similar and will be bought together.
Calculated as:
A(i,j) = supp(i,j)/supp(i)+supp(j)−supp(i,j)
a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
## aluminum foil coffee/tea individual meals ketchup mixes
## aluminum foil 0.000 0.528 0.667 0.571 0.611
## coffee/tea 0.528 0.000 0.543 0.500 0.541
## individual meals 0.667 0.543 0.000 0.543 0.629
## ketchup 0.571 0.500 0.543 0.000 0.500
## mixes 0.611 0.541 0.629 0.500 0.000
## soda 0.765 0.639 0.639 0.595 0.722
## spaghetti sauce 0.514 0.571 0.486 0.571 0.611
## vegetables 0.676 0.605 0.694 0.649 0.730
## waffles 0.514 0.618 0.618 0.618 0.657
## soda spaghetti sauce vegetables waffles
## aluminum foil 0.765 0.514 0.676 0.514
## coffee/tea 0.639 0.571 0.605 0.618
## individual meals 0.639 0.486 0.694 0.618
## ketchup 0.595 0.571 0.649 0.618
## mixes 0.722 0.611 0.730 0.657
## soda 0.000 0.622 0.737 0.622
## spaghetti sauce 0.622 0.000 0.676 0.600
## vegetables 0.737 0.676 0.000 0.676
## waffles 0.622 0.600 0.676 0.000
## Slot "method":
## [1] "Affinity"
Only taking into account the affinity levels that are > 0.7, we observe that the following pairs of items have high probability of being purchased together:
Aluminum foil & Soda; Mixes & Soda Mixes & Vegetables Vegetables & Soda
par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)
Affinity measure results can be plotted in terms of matrix. The darker the shade of the combination the higher its affinity value, hence the higher the probability that the two items will be bought together. Likewise, the lighter the color shade of the combination the lower the affinity value.
Association rule mining can explore in detail all the frequent and common purchases or customer patterns that will help retailers restrategize and potentially increase customer satisfaction, sales and profits. For instance, Aluminum foil and soda have a high probability of being purchased together, hence retailers can consider placing them next to each other on shelves or combining them to form a bundle for promotions in their catalogs etc. This simple solution as a result of performing association rules can instantly improve customer’s shopping experience and thensome.