Analysis of sales data is one of the most common implementation of association rules. The simpliest definition of association rule is “if something happens then the other thing also tends to happen”. In case of sales data, it can be transformed to the statement “if customer buys X, he also tends to buy Y”, where X and Y are some itemsets. Mining such rules is very important in sales branch. Companies may take advantage of these by arranging the store or catalogs in a specific way considering which products are more likely to be purchased together, setting sales promotions in order to stimulate the sale of specific product or by making personalizes discounts. [https://www.cs.helsinki.fi/u/htoivone/pubs/advances.pdf]
During the analysis, I will use Groceries dataset provided by arules package. It was published as a text file on GitHub^ (przypis). It can be imported with read.table() function and then transformed into object of class transactions with read.transactions().
library(stringr)
Groceries <- read.table("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv", sep=";")
head(Groceries)
## V1
## 1 citrus fruit,semi-finished bread,margarine,ready soups
## 2 tropical fruit,yogurt,coffee
## 3 whole milk
## 4 pip fruit,yogurt,cream cheese ,meat spreads
## 5 other vegetables,whole milk,condensed milk,long life bakery product
## 6 whole milk,butter,yogurt,rice,abrasive cleaner
transactions <- str_split_fixed(Groceries$V1, ",", n = Inf)
head(transactions[,1:4])#limited to 4 columns since from the fifth column there are no items
## [,1] [,2] [,3] [,4]
## [1,] "citrus fruit" "semi-finished bread" "margarine" "ready soups"
## [2,] "tropical fruit" "yogurt" "coffee" ""
## [3,] "whole milk" "" "" ""
## [4,] "pip fruit" "yogurt" "cream cheese " "meat spreads"
## [5,] "other vegetables" "whole milk" "condensed milk" "long life bakery product"
## [6,] "whole milk" "butter" "yogurt" "rice"
write.csv(transactions, file = "transactions.csv", row.names = F)
library(arules)
Groceries <- read.transactions("transactions.csv", format = "basket", sep = ",", skip=1)
However, it is also available as an object of class transactions and can be analyzed straightaway. It can be imported through data() function.
data(Groceries)
The first step of the analysis will be inspecting the detailed information of the data.
cat("Number of baskets:", length(Groceries))
## Number of baskets: 9835
cat("Number of unique items:", sum(size(Groceries)))
## Number of unique items: 43367
The following output shows first 5 products in the dataset. Moreover, beyond the label of the product, one can see also two associated levels to a particular product, where level1 is the most broad one. Groups of products will be a subject of hierarchical rule mining section.
head(itemInfo(Groceries))
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
## 4 ham sausage meat and sausage
## 5 meat sausage meat and sausage
## 6 finished products sausage meat and sausage
Knowing the products, one can also check which items are most frequent in transactions. The below plot shows 10 most occuring products in the Groceries data.
library(arulesViz)
itemFrequencyPlot(Groceries, topN=10, type="relative", main="Items Frequency", cex.names=0.8)
Since we know what are the most frequent items, let’s examine which are the least frequent ones. The below function will list 10 items which has the lowest frequency ratio and thus are the least interesting for the customers. It may be caused by the character of the store or by its prices. However, these products should be considered by the store managers.
head(sort(itemFrequency(Groceries), decreasing=FALSE), n=10)
## baby food sound storage medium preservation products kitchen utensil bags
## 0.0001016777 0.0001016777 0.0002033554 0.0004067107 0.0004067107
## frozen chicken baby cosmetics toilet cleaner salad dressing whisky
## 0.0006100661 0.0006100661 0.0007117438 0.0008134215 0.0008134215
Interesting for the analysis is also count of products in basket. The plot below shows the distribution of the number of items per basket.
hist(size(Groceries), breaks = 0:40, xaxt="n", ylim=c(0,2500),
main = "Number of items in particular baskets", xlab = "Items")
axis(1, at=seq(0,40,by=5), cex.axis=0.8)
cat("The biggest basket consists of", ncol(transactions), "products.")
## The biggest basket consists of 32 products.
One can see that the most frequent itemsets are those consisting of one product. The number of baskets decreases with the number of items.
The Eclat algorithm is used to identify frequent patterns in a transaction data. The Eclat algorithm takes an input value of minimum support and rejects those with the lower support. The support measure tells how many times the itemset appears in all transactions. Thus, the results are the most frequent itemsets. ^ https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_Eclat_Algorithm
inspect(head(itemsets))
## items support count
## [1] {whole milk,hard cheese} 0.01006609 99
## [2] {whole milk,butter milk} 0.01159126 114
## [3] {other vegetables,butter milk} 0.01037112 102
## [4] {ham,whole milk} 0.01148958 113
## [5] {whole milk,sliced cheese} 0.01077783 106
## [6] {whole milk,oil} 0.01128622 111
As can be seen above, the result of the Eclat algorithm are three most frequent itemsets. The result is determined by the parameters set in the eclat() function. That is the minumum support equal to 0.01, the minumum length of the itemset eual to 2 and the maximum length of the itemset equal to 20. Considering that the value of support can be from 0 to 1 and also assuming that the higher the measure the better, one can deduce that the above output indicates a moderate shop pattern of buying items together, i.e. whole milk and yogurt, rolls/buns and whole milk, other vegetables and whole milk.
Next step of the analysis will be induction of the rules from determined itemsets. It can be done with the function ruleInduction(). The default method implemented by the function is prefix tree. The function needs at least three arguments: the output object of the Eclat algorithm, the fundamental dataset and the specification of confidence parameter. Confidence can be understood as the measure of the strenght of the rule. Let’s assume two levels of the confidence - 1 and 0.5 - and check how many rules were determined.
rules <- ruleInduction(itemsets, Groceries, confidence=1)
rules
## set of 0 rules
rules <- ruleInduction(itemsets, Groceries, confidence=0.5)
rules
## set of 15 rules
Only assuming confidence at the 0.5 level, one can inspect the rules. This means that the strenght of the determined rules is between 100% and 50%. Let’s now look at the mined rules.
inspect(rules)
## lhs rhs support confidence lift itemset
## [1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 54
## [2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885 101
## [3] {other vegetables,domestic eggs} => {whole milk} 0.01230300 0.5525114 2.162336 116
## [4] {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098 2.052747 137
## [5] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 139
## [6] {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351 148
## [7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 170
## [8] {tropical fruit,root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969 208
## [9] {tropical fruit,root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999 209
## [10] {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 210
## [11] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 220
## [12] {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 221
## [13] {root vegetables,rolls/buns} => {whole milk} 0.01270971 0.5230126 2.046888 222
## [14] {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890 223
## [15] {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 238
The output consists of six columns. First two - lhs and rhs - refer to the itemset in antecedent (if I buy x…) and itemsets in consequent (…then I buy y) respectively. One can see also earlier mentioned measures: support and confidence. Moreover, there also can be seen measure called lift. Lift indicates how more or less likely it is that there will occure a shopping pattern compared to the situation in which items are independent.
The strongest rule seems to be the one indicating that if a customer buys curd and yoghurt, he also buys whole milk. In case of this particular rule, around 1% of all transactions in a dataset contain curd, yoghurt and whole milk together. According to the value of confidence, the probability of appearing whole milk in a transaction which contains yoghurt is 0.58. Moreover, occuring of these items together is 2.28 times the rate we would expect assuming independence of both products.
The second implemented algorithm will be Apriori. Given a set of itemsets, the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets. It works iteratively using a “bottom up” approach. It means that the starting point of the algorithm is a single item and if there exists an association between this and other item (under specific condition of support value), then new itemset is created. The algorithm terminates if no extensions of the current itemsets are found. ^ https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_Apriori_Algorithm
inspect(rules1)
## lhs rhs support confidence lift count
## [1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [3] {other vegetables,domestic eggs} => {whole milk} 0.01230300 0.5525114 2.162336 121
## [4] {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098 2.052747 107
## [5] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 144
## [6] {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [8] {tropical fruit,root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [9] {tropical fruit,root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [10] {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [11] {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
## [12] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [13] {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [14] {root vegetables,rolls/buns} => {whole milk} 0.01270971 0.5230126 2.046888 125
## [15] {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
In order to extract the strongest rules, I will use is.significant() function which is based on Fisher’s exact test.
is.significant(rules1, Groceries)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Despite the fact that both algorithms works differently, the output obtained by inspecting rules by Apriori algorithm is exactly the same as the output obtained by applying Eclat algorithm.
Since the rules obtained using both algorithms are the same, I will use rules object (obtained with the Eclat) in firther analysis.
Above results can be also presented graphically. Below, one can see 15 obtained rules. The more red the rectangle, the stronger the association rule. The strenght of the rule is measured with lift. The arrangement of the rectangles depends of the values of support and confidence.
arulesViz::plotly_arules(rules, method="matrix", measure=c("support","confidence"))
Next plot is analogical to the previous one but presented in a slightly different way. However, it carries the same information.
plot(rules, method="grouped")
One can also visualize rules in a form of graph. The arrows shows the direction of the basket rule, eg. if a person buys citrus fruit, it is possible he will also buy other vegetables. The size of circles idicates the support rate and the color indicates the lift. It is easily seen which items are the most frequent combined, for example other vegetables, whole milk or yogurt.
plot(rules, method="graph", shading="lift")
After analysing all rules together, I will now focus on particular items. As said before, the most frequent items in transactions are among others whole milk, other vegatables, rolls/buns, soda and joghurt. Thus, I will run Apriori algorithm on these and check whether there are any interesting patterns not spotted yet. Here, I set minimum support equal to 1%. The value of confidence take three levels. Some rules will be generated with the minimum confidence 0.5 and some will be generated with the minimum confidence 0.3 or 0.2. That is because in some cases no rules are generated assuming first level.
Whole milk
rules.milk <- apriori(data=Groceries, parameter=list(supp=0.01, conf = 0.5, target="rules"), appearance = list(default="lhs", rhs="whole milk"), control=list(verbose=F))
rules.milk.byconf <- sort(rules.milk, by="confidence", decreasing=TRUE)
inspect(rules.milk.byconf)
## lhs rhs support confidence lift count
## [1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [3] {tropical fruit,root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [4] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [5] {other vegetables,domestic eggs} => {whole milk} 0.01230300 0.5525114 2.162336 121
## [6] {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098 2.052747 107
## [7] {root vegetables,rolls/buns} => {whole milk} 0.01270971 0.5230126 2.046888 125
## [8] {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [9] {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [10] {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
## [11] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 144
It occurs that for minimum support 1% there are 11 rules mined by the Apriori algorithm consequent in buying whole milk. According to confidence, the strongest one is {curd, yogurt} => {whole milk}. As to support, the strongest one is {other vegetables, yogurt} => {whole milk}. The highest lift refers to {curd, yogurt} => {whole milk} rule.
Other vegetables
rules.vege <- apriori(data=Groceries, parameter=list(supp=0.01, conf = 0.5, target="rules"), appearance = list(default="lhs", rhs="other vegetables"), control=list(verbose=F))
rules.vege.byconf <- sort(rules.vege, by="confidence", decreasing=TRUE)
inspect(rules.vege.byconf)
## lhs rhs support confidence lift count
## [1] {citrus fruit,root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [3] {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [4] {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
For minimum support 1%, the algorithm mined 4 rules. According to confidence, the strongest one is {citrus fruit, root vegetables} => {other vegetables}. As to support, the strongest one is {root vegetables, yogurt} => {other vegetables}. The highest lift refers to {citrus fruit,root vegetables} => {other vegetables} rule.
Rolls/buns
rules.roll <- apriori(data=Groceries, parameter=list(supp=0.01, conf = 0.3, target="rules"), appearance = list(default="lhs", rhs="rolls/buns"), control=list(verbose=F))
rules.roll.byconf <- sort(rules.roll, by="confidence", decreasing=TRUE)
inspect(head(rules.roll.byconf))
## lhs rhs support confidence lift count
## [1] {frankfurter} => {rolls/buns} 0.01921708 0.3258621 1.771616 189
## [2] {sausage} => {rolls/buns} 0.03060498 0.3257576 1.771048 301
Assuming confidence 0.5, the algorithm found no rules. Thus, I decreased the minimum value to 0.3. After that, two rules were mined concerning rolls/buns assuming 1% support. Both rules are comparably strong.They say that if a person buys frankfurter or sausage, he will also buy rolls/buns. The confidence of both rules is 0.33 and the lift is 1.77.
Soda
rules.soda <- apriori(data=Groceries, parameter=list(supp=0.01, conf = 0.2, target="rules"), appearance = list(default="lhs", rhs="soda"), control=list(verbose=F))
rules.soda.byconf <- sort(rules.soda, by="confidence", decreasing=TRUE)
inspect(rules.soda.byconf)
## lhs rhs support confidence lift count
## [1] {chocolate} => {soda} 0.01352313 0.2725410 1.562939 133
## [2] {bottled water} => {soda} 0.02897814 0.2621895 1.503577 285
## [3] {sausage} => {soda} 0.02430097 0.2586580 1.483324 239
## [4] {fruit/vegetable juice} => {soda} 0.01840366 0.2545710 1.459887 181
## [5] {shopping bags} => {soda} 0.02460600 0.2497420 1.432194 242
## [6] {white bread} => {soda} 0.01026945 0.2439614 1.399044 101
## [7] {pastry} => {soda} 0.02104728 0.2365714 1.356665 207
## [8] {napkins} => {soda} 0.01199797 0.2291262 1.313969 118
## [9] {bottled beer} => {soda} 0.01698017 0.2108586 1.209209 167
## [10] {rolls/buns} => {soda} 0.03833249 0.2084024 1.195124 377
## [11] {pork} => {soda} 0.01189629 0.2063492 1.183350 117
Assuming confidence 0.5, the algorithm also found no rules. Thus, I decreased the minimum confidence to 0.2 and got eleven rules. The strongest ones seem to be {chocolate} => {soda} with the confidence 0.27. The lift is 1.56. However, the most frequent rule on the list is {rolls/buns} => {soda}.
Yogurt
rules.yogurt <- apriori(data=Groceries, parameter=list(supp=0.01, conf = 0.3, target="rules"), appearance = list(default="lhs", rhs="yogurt"), control=list(verbose=F))
rules.yogurt.byconf <- sort(rules.yogurt, by="confidence", decreasing=TRUE)
inspect(head(rules.yogurt.byconf))
## lhs rhs support confidence lift count
## [1] {whole milk,curd} => {yogurt} 0.01006609 0.3852140 2.761356 99
## [2] {tropical fruit,whole milk} => {yogurt} 0.01514997 0.3581731 2.567516 149
## [3] {other vegetables,whipped/sour cream} => {yogurt} 0.01016777 0.3521127 2.524073 100
## [4] {tropical fruit,other vegetables} => {yogurt} 0.01230300 0.3427762 2.457146 121
## [5] {whole milk,whipped/sour cream} => {yogurt} 0.01087951 0.3375394 2.419607 107
## [6] {citrus fruit,whole milk} => {yogurt} 0.01026945 0.3366667 2.413350 101
The last analysed item was yogurt. Just as in the previous cases, I had to lower the confidence level to 0.3 since there were no asssociation rules at all on 0.5 level. From the above rules, it is clear that concerning confidence, the strongest rule is {whole milk,curd} => {yogurt}. It has also the highest lift - 2.76.
Moving to graphical analysis, below one can see graphs of above mined rules.
plot(rules.milk, method="graph", cex=0.7, shading="lift")
plot(rules.vege, method="graph", cex=0.7, shading="lift")
plot(rules.roll, method="graph", cex=0.7, shading="lift")
plot(rules.soda, method="graph", cex=0.7, shading="lift")
plot(rules.yogurt, method="graph", cex=0.7, shading="lift")
Starting from the fisrt graph which concerns whole milk,
The below plots show parallel coordinates. Worth mentioned is the meaning of the x axis. Positions 3, 2 and 1 are associated with the lhs, so the itemset which the customer already has in the basket, where 3 and 2 are the most recent one and 1 is the item added previously.
plot(rules.milk, method="paracoord")
plot(rules.vege, method="paracoord")
plot(rules.roll, method="paracoord")
plot(rules.soda, method="paracoord")
plot(rules.yogurt, method="paracoord")
Since Groceries dataset beyond names of items contains also two item levels, it is possible to conduct hierarchical rule mining. It is based on aggregating items together and checking if any group of products is associated with another. Moreover, one can also provide an analysis on relationships between individual items and groups of items. I will start from mining rules assuming antecedent and consequent to be groups of products.
The unique categories of level1 are as follows:
unique(Groceries@itemInfo[["level1"]])
## [1] meat and sausage fruit and vegetables fresh products processed food canned food
## [6] drinks snacks and candies detergent perfumery non-food
## 10 Levels: canned food detergent drinks fresh products fruit and vegetables meat and sausage non-food ... snacks and candies
The unique categories of level2 are as follows:
unique(Groceries@itemInfo[["level2"]])
## [1] sausage poultry pork
## [4] beef fish fruit
## [7] vegetables packaged fruit/vegetables dairy produce
## [10] shelf-stable dairy cheese delicatessen
## [13] frozen foods eggs bread and backed goods
## [16] staple foods vinegar/oils sweetener
## [19] condiments soups/sauces health food
## [22] bakery improver pudding powder canned fruit/vegetables
## [25] jam/sweet spreads meat spreads canned fish
## [28] pet food/care baby food coffee
## [31] tea/cocoa drinks non-alc. drinks beer
## [34] hard drinks wine snacks
## [37] long-life bakery products chewing gum chocolate
## [40] candy seasonal products detergent/softener
## [43] cleaner bathroom cleaner hair care
## [46] dental care cosmetics soap
## [49] personal hygiene perfumery non-food kitchen
## [52] non-food house keeping products games/books/hobby garden
## [55] bags
## 55 Levels: baby food bags bakery improver bathroom cleaner beef beer bread and backed goods candy ... wine
Since there are more levels concerning level2, I will use them in order to enrich the analysis.
trans_level2 <- aggregate(Groceries, by="level2")
inspect(head(trans_level2))
## items
## [1] {bread and backed goods,fruit,soups/sauces,vinegar/oils}
## [2] {coffee,dairy produce,fruit}
## [3] {dairy produce}
## [4] {cheese,dairy produce,fruit,meat spreads}
## [5] {dairy produce,long-life bakery products,shelf-stable dairy,vegetables}
## [6] {cleaner,dairy produce,staple foods}
inspect(rules.by.conf2)
## lhs rhs support confidence lift count
## [1] {fruit,vegetables} => {dairy produce} 0.07869853 0.7350427 1.659203 774
## [2] {bread and backed goods,fruit} => {dairy produce} 0.07727504 0.7183365 1.621492 760
## [3] {bread and backed goods,vegetables} => {dairy produce} 0.08195221 0.7051619 1.591753 806
## [4] {sausage,vegetables} => {dairy produce} 0.05266904 0.6906667 1.559033 518
## [5] {non-alc. drinks,vegetables} => {dairy produce} 0.06446365 0.6817204 1.538839 634
## [6] {fruit,non-alc. drinks} => {dairy produce} 0.06375191 0.6807818 1.536720 627
## [7] {cheese} => {dairy produce} 0.08459583 0.6677368 1.507274 832
## [8] {vinegar/oils} => {dairy produce} 0.05866802 0.6519774 1.471700 577
## [9] {fruit} => {dairy produce} 0.15638027 0.6277551 1.417024 1538
## [10] {vegetables} => {dairy produce} 0.17041179 0.6242086 1.409018 1676
## [11] {bread and backed goods,sausage} => {dairy produce} 0.06395526 0.6172718 1.393360 629
## [12] {long-life bakery products} => {dairy produce} 0.05002542 0.6007326 1.356026 492
## [13] {dairy produce,sausage} => {bread and backed goods} 0.06395526 0.5956439 1.724002 629
## [14] {bread and backed goods,non-alc. drinks} => {dairy produce} 0.07229283 0.5818331 1.313364 711
## [15] {frozen foods} => {dairy produce} 0.06710727 0.5739130 1.295487 660
## [16] {sausage} => {dairy produce} 0.10737163 0.5677419 1.281557 1056
## [17] {sausage} => {bread and backed goods} 0.10360956 0.5478495 1.585668 1019
## [18] {bread and backed goods} => {dairy produce} 0.18769700 0.5432607 1.226295 1846
## [19] {dairy produce,fruit} => {vegetables} 0.07869853 0.5032510 1.843379 774
## [20] {cheese} => {bread and backed goods} 0.06365023 0.5024077 1.454144 626
Assuming 5% support and 0.5 confidence, Apriori algorithm mined 20 association rules. The strongest one concerning confidence is {fruit, vegetables} => {dairy produce} with support equal to 0.08. The highest lift is on the other hand observed for {dairy produce, fruit} => {vegetables} rule.
Additionally, one can also visualize the results with a graph.
plot(rules.trans_level2, method="graph", cex=0.7, shading="lift")
The second part of the analysis will be looking for the relationships between individual items and groups of items. In order to do that, I will use addAggregate() function and then run Apriori algorithm.
multilevel <- addAggregate(Groceries, "level2")
inspect(head(multilevel)) # the * indicates group-level items
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups,
## bread and backed goods*,
## fruit*,
## soups/sauces*,
## vinegar/oils*}
## [2] {tropical fruit,
## yogurt,
## coffee,
## coffee*,
## dairy produce*,
## fruit*}
## [3] {whole milk,
## dairy produce*}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads,
## cheese*,
## dairy produce*,
## fruit*,
## meat spreads*}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product,
## dairy produce*,
## long-life bakery products*,
## shelf-stable dairy*,
## vegetables*}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner,
## cleaner*,
## dairy produce*,
## staple foods*}
inspect(head(rules_multilevel))
## lhs rhs support confidence lift count
## [1] {canned beer} => {beer*} 0.07768175 1.0000000 6.428105 764
## [2] {curd} => {dairy produce*} 0.05327911 1.0000000 2.257287 524
## [3] {coffee} => {coffee*} 0.05805796 1.0000000 15.415361 571
## [4] {coffee*} => {coffee} 0.05805796 0.8949843 15.415361 571
## [5] {beef} => {beef*} 0.05246568 1.0000000 12.202233 516
## [6] {beef*} => {beef} 0.05246568 0.6401985 12.202233 516
It occured that all of the printed transactions are spurious. It means that the lhs and rhs are refer to the same product. For example, the first transaction is {canned beer} => {beer*} which means that if the customer buys canned beer, he will also tend to buy an item from beer group. In order to filter the spurious transactions, one can use filterAggregate().
rules <- filterAggregate(multilevel)
rules
## transactions in sparse format with
## 0 transactions (rows) and
## 224 items (columns)
After filternig, none transaction left. Thus, all of the transactions were in fact spurious and this analysis doesn’t bring added value.
Besides support, confidence and lift, there are also other varied measures which refer to association rules quality. Among others, worth to mention are Jaccard index and affinity.
Jaccard index tells how much likely are two items to be bought together. It is represented as conditional probability. The formal equation is as follows ^(http://michael.hahsler.net/research/association_rules/measures.html#jaccard):
\[Jaccard(X=>Y) = \frac{supp(X∪Y)}{supp(X)+supp(Y)-supp(X∪Y)}\]
In R, Jaccard index can be calculated with dissimilarity() function, setting “jaccard” as a method.
trans <- Groceries[,itemFrequency(Groceries)>0.1]
jaccard <- dissimilarity(trans, which="items", method = "jaccard")
round(jaccard, 2)
## tropical fruit root vegetables other vegetables whole milk yogurt rolls/buns bottled water
## root vegetables 0.89
## other vegetables 0.86 0.81
## whole milk 0.87 0.85 0.80
## yogurt 0.86 0.88 0.85 0.83
## rolls/buns 0.91 0.91 0.87 0.85 0.88
## bottled water 0.91 0.92 0.91 0.90 0.90 0.91
## soda 0.92 0.93 0.90 0.90 0.90 0.88 0.89
The result is a matrix with conditional probabilities. The higher the values of Jaccard index the more less likely are two items to occur in the same transaction. According to the output, the least probable is soda and root vegetables occuring together.
Affinity is on the other hand a similarity measure. The higher the value the higher similarity. The formal equation is as follows^(https://rdrr.io/cran/arules/man/affinity.html):
\[A(X,Y) = \frac{supp(X, Y)}{supp(X)+supp(Y)-supp(X, Y)}\]
aff <- affinity(trans)
round(aff, 2)
## An object of class "ar_similarity"
## tropical fruit root vegetables other vegetables whole milk yogurt rolls/buns bottled water soda
## tropical fruit 0.00 0.11 0.14 0.13 0.14 0.09 0.09 0.08
## root vegetables 0.11 0.00 0.19 0.15 0.12 0.09 0.08 0.07
## other vegetables 0.14 0.19 0.00 0.20 0.15 0.13 0.09 0.10
## whole milk 0.13 0.15 0.20 0.00 0.17 0.15 0.10 0.10
## yogurt 0.14 0.12 0.15 0.17 0.00 0.12 0.10 0.10
## rolls/buns 0.09 0.09 0.13 0.15 0.12 0.00 0.09 0.12
## bottled water 0.09 0.08 0.09 0.10 0.10 0.09 0.00 0.11
## soda 0.08 0.07 0.10 0.10 0.10 0.12 0.11 0.00
## Slot "method":
## [1] "Affinity"
One can easily spot that values for particular items sums to 1 when Jaccard index added to affinity measure. Maximum sililarity is observed for whole milk and other vegetables which confirms revious conclusions.
The last part of the quality measures analysis will be visualization of the above matrix. The more red the rectangle, the more similar the items.
image(aff, axes = FALSE)
axis(1, at=seq(0,1,l=ncol(aff)), labels=rownames(aff), cex.axis=0.6, las=2)
axis(2, at=seq(0,1,l=ncol(aff)), labels=rownames(aff), cex.axis=0.6, las=1.5)
Most of the rectangles are yellow and orange, which is as expected. In the data, items are not very similar to each other, which means the overall probability of seeing them together is not higher than 0.2.
Since association rules are very useful in setting a strategy of the store, I will try to point a few conclusions out of the above analysis.