Association rules - market basket analysis

Basket analysis
Dataset
Initial analysis
The Eclat Algorithm
The Apriori algorithm
Individual rule representation
Hierarchical rules
Other quality measures
Conclusions

Basket analysis

Analysis of sales data is one of the most common implementation of association rules. The simpliest definition of association rule is “if something happens then the other thing also tends to happen”. In case of sales data, it can be transformed to the statement “if customer buys X, he also tends to buy Y”, where X and Y are some itemsets. Mining such rules is very important in sales branch. Companies may take advantage of these by arranging the store or catalogs in a specific way considering which products are more likely to be purchased together, setting sales promotions in order to stimulate the sale of specific product or by making personalizes discounts. [https://www.cs.helsinki.fi/u/htoivone/pubs/advances.pdf]

Dataset

During the analysis, I will use Groceries dataset provided by arules package. It was published as a text file on GitHub^ (przypis). It can be imported with read.table() function and then transformed into object of class transactions with read.transactions().

library(stringr)
Groceries <- read.table("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv", sep=";")
head(Groceries)

##                                                                    V1
## 1              citrus fruit,semi-finished bread,margarine,ready soups
## 2                                        tropical fruit,yogurt,coffee
## 3                                                          whole milk
## 4                         pip fruit,yogurt,cream cheese ,meat spreads
## 5 other vegetables,whole milk,condensed milk,long life bakery product
## 6                      whole milk,butter,yogurt,rice,abrasive cleaner

transactions <- str_split_fixed(Groceries$V1, ",", n = Inf)
head(transactions[,1:4])#limited to 4 columns since from the fifth column there are no items

##      [,1]               [,2]                  [,3]             [,4]                      
## [1,] "citrus fruit"     "semi-finished bread" "margarine"      "ready soups"             
## [2,] "tropical fruit"   "yogurt"              "coffee"         ""                        
## [3,] "whole milk"       ""                    ""               ""                        
## [4,] "pip fruit"        "yogurt"              "cream cheese "  "meat spreads"            
## [5,] "other vegetables" "whole milk"          "condensed milk" "long life bakery product"
## [6,] "whole milk"       "butter"              "yogurt"         "rice"

write.csv(transactions, file = "transactions.csv", row.names = F)

library(arules)
Groceries <- read.transactions("transactions.csv", format = "basket", sep = ",", skip=1)

However, it is also available as an object of class transactions and can be analyzed straightaway. It can be imported through data() function.

data(Groceries)

Initial analysis

The first step of the analysis will be inspecting the detailed information of the data.

cat("Number of baskets:", length(Groceries))

## Number of baskets: 9835

cat("Number of unique items:", sum(size(Groceries)))

## Number of unique items: 43367

The following output shows first 5 products in the dataset. Moreover, beyond the label of the product, one can see also two associated levels to a particular product, where level1 is the most broad one. Groups of products will be a subject of hierarchical rule mining section.

head(itemInfo(Groceries))

##              labels  level2           level1
## 1       frankfurter sausage meat and sausage
## 2           sausage sausage meat and sausage
## 3        liver loaf sausage meat and sausage
## 4               ham sausage meat and sausage
## 5              meat sausage meat and sausage
## 6 finished products sausage meat and sausage

Knowing the products, one can also check which items are most frequent in transactions. The below plot shows 10 most occuring products in the Groceries data.

library(arulesViz)
itemFrequencyPlot(Groceries, topN=10, type="relative", main="Items Frequency", cex.names=0.8)

Since we know what are the most frequent items, let’s examine which are the least frequent ones. The below function will list 10 items which has the lowest frequency ratio and thus are the least interesting for the customers. It may be caused by the character of the store or by its prices. However, these products should be considered by the store managers.

head(sort(itemFrequency(Groceries), decreasing=FALSE), n=10)

##             baby food  sound storage medium preservation products       kitchen utensil                  bags 
##          0.0001016777          0.0001016777          0.0002033554          0.0004067107          0.0004067107 
##        frozen chicken        baby cosmetics        toilet cleaner        salad dressing                whisky 
##          0.0006100661          0.0006100661          0.0007117438          0.0008134215          0.0008134215

Interesting for the analysis is also count of products in basket. The plot below shows the distribution of the number of items per basket.

hist(size(Groceries), breaks = 0:40, xaxt="n", ylim=c(0,2500), 
     main = "Number of items in particular baskets", xlab = "Items")
axis(1, at=seq(0,40,by=5), cex.axis=0.8)

cat("The biggest basket consists of", ncol(transactions), "products.")

## The biggest basket consists of 32 products.

One can see that the most frequent itemsets are those consisting of one product. The number of baskets decreases with the number of items.

The Eclat Algorithm

The Eclat algorithm is used to identify frequent patterns in a transaction data. The Eclat algorithm takes an input value of minimum support and rejects those with the lower support. The support measure tells how many times the itemset appears in all transactions. Thus, the results are the most frequent itemsets. ^ https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_Eclat_Algorithm

inspect(head(itemsets))

##     items                          support    count
## [1] {whole milk,hard cheese}       0.01006609  99  
## [2] {whole milk,butter milk}       0.01159126 114  
## [3] {other vegetables,butter milk} 0.01037112 102  
## [4] {ham,whole milk}               0.01148958 113  
## [5] {whole milk,sliced cheese}     0.01077783 106  
## [6] {whole milk,oil}               0.01128622 111

As can be seen above, the result of the Eclat algorithm are three most frequent itemsets. The result is determined by the parameters set in the eclat() function. That is the minumum support equal to 0.01, the minumum length of the itemset eual to 2 and the maximum length of the itemset equal to 20. Considering that the value of support can be from 0 to 1 and also assuming that the higher the measure the better, one can deduce that the above output indicates a moderate shop pattern of buying items together, i.e. whole milk and yogurt, rolls/buns and whole milk, other vegetables and whole milk.

Next step of the analysis will be induction of the rules from determined itemsets. It can be done with the function ruleInduction(). The default method implemented by the function is prefix tree. The function needs at least three arguments: the output object of the Eclat algorithm, the fundamental dataset and the specification of confidence parameter. Confidence can be understood as the measure of the strenght of the rule. Let’s assume two levels of the confidence - 1 and 0.5 - and check how many rules were determined.

rules <- ruleInduction(itemsets, Groceries, confidence=1)
rules

## set of 0 rules

rules <- ruleInduction(itemsets, Groceries, confidence=0.5)
rules

## set of 15 rules

Only assuming confidence at the 0.5 level, one can inspect the rules. This means that the strenght of the determined rules is between 100% and 50%. Let’s now look at the mined rules.

inspect(rules)

##      lhs                                      rhs                support    confidence lift     itemset
## [1]  {curd,yogurt}                         => {whole milk}       0.01006609 0.5823529  2.279125  54    
## [2]  {other vegetables,butter}             => {whole milk}       0.01148958 0.5736041  2.244885 101    
## [3]  {other vegetables,domestic eggs}      => {whole milk}       0.01230300 0.5525114  2.162336 116    
## [4]  {yogurt,whipped/sour cream}           => {whole milk}       0.01087951 0.5245098  2.052747 137    
## [5]  {other vegetables,whipped/sour cream} => {whole milk}       0.01464159 0.5070423  1.984385 139    
## [6]  {pip fruit,other vegetables}          => {whole milk}       0.01352313 0.5175097  2.025351 148    
## [7]  {citrus fruit,root vegetables}        => {other vegetables} 0.01037112 0.5862069  3.029608 170    
## [8]  {tropical fruit,root vegetables}      => {whole milk}       0.01199797 0.5700483  2.230969 208    
## [9]  {tropical fruit,root vegetables}      => {other vegetables} 0.01230300 0.5845411  3.020999 209    
## [10] {tropical fruit,yogurt}               => {whole milk}       0.01514997 0.5173611  2.024770 210    
## [11] {root vegetables,yogurt}              => {whole milk}       0.01453991 0.5629921  2.203354 220    
## [12] {root vegetables,yogurt}              => {other vegetables} 0.01291307 0.5000000  2.584078 221    
## [13] {root vegetables,rolls/buns}          => {whole milk}       0.01270971 0.5230126  2.046888 222    
## [14] {root vegetables,rolls/buns}          => {other vegetables} 0.01220132 0.5020921  2.594890 223    
## [15] {other vegetables,yogurt}             => {whole milk}       0.02226741 0.5128806  2.007235 238

The output consists of six columns. First two - lhs and rhs - refer to the itemset in antecedent (if I buy x…) and itemsets in consequent (…then I buy y) respectively. One can see also earlier mentioned measures: support and confidence. Moreover, there also can be seen measure called lift. Lift indicates how more or less likely it is that there will occure a shopping pattern compared to the situation in which items are independent.

The strongest rule seems to be the one indicating that if a customer buys curd and yoghurt, he also buys whole milk. In case of this particular rule, around 1% of all transactions in a dataset contain curd, yoghurt and whole milk together. According to the value of confidence, the probability of appearing whole milk in a transaction which contains yoghurt is 0.58. Moreover, occuring of these items together is 2.28 times the rate we would expect assuming independence of both products.

The Apriori algorithm

The second implemented algorithm will be Apriori. Given a set of itemsets, the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets. It works iteratively using a “bottom up” approach. It means that the starting point of the algorithm is a single item and if there exists an association between this and other item (under specific condition of support value), then new itemset is created. The algorithm terminates if no extensions of the current itemsets are found. ^ https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_Apriori_Algorithm

inspect(rules1)

##      lhs                                      rhs                support    confidence lift     count
## [1]  {curd,yogurt}                         => {whole milk}       0.01006609 0.5823529  2.279125  99  
## [2]  {other vegetables,butter}             => {whole milk}       0.01148958 0.5736041  2.244885 113  
## [3]  {other vegetables,domestic eggs}      => {whole milk}       0.01230300 0.5525114  2.162336 121  
## [4]  {yogurt,whipped/sour cream}           => {whole milk}       0.01087951 0.5245098  2.052747 107  
## [5]  {other vegetables,whipped/sour cream} => {whole milk}       0.01464159 0.5070423  1.984385 144  
## [6]  {pip fruit,other vegetables}          => {whole milk}       0.01352313 0.5175097  2.025351 133  
## [7]  {citrus fruit,root vegetables}        => {other vegetables} 0.01037112 0.5862069  3.029608 102  
## [8]  {tropical fruit,root vegetables}      => {other vegetables} 0.01230300 0.5845411  3.020999 121  
## [9]  {tropical fruit,root vegetables}      => {whole milk}       0.01199797 0.5700483  2.230969 118  
## [10] {tropical fruit,yogurt}               => {whole milk}       0.01514997 0.5173611  2.024770 149  
## [11] {root vegetables,yogurt}              => {other vegetables} 0.01291307 0.5000000  2.584078 127  
## [12] {root vegetables,yogurt}              => {whole milk}       0.01453991 0.5629921  2.203354 143  
## [13] {root vegetables,rolls/buns}          => {other vegetables} 0.01220132 0.5020921  2.594890 120  
## [14] {root vegetables,rolls/buns}          => {whole milk}       0.01270971 0.5230126  2.046888 125  
## [15] {other vegetables,yogurt}             => {whole milk}       0.02226741 0.5128806  2.007235 219

In order to extract the strongest rules, I will use is.significant() function which is based on Fisher’s exact test.

is.significant(rules1, Groceries)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Despite the fact that both algorithms works differently, the output obtained by inspecting rules by Apriori algorithm is exactly the same as the output obtained by applying Eclat algorithm.

Since the rules obtained using both algorithms are the same, I will use rules object (obtained with the Eclat) in firther analysis.

Above results can be also presented graphically. Below, one can see 15 obtained rules. The more red the rectangle, the stronger the association rule. The strenght of the rule is measured with lift. The arrangement of the rectangles depends of the values of support and confidence.

arulesViz::plotly_arules(rules, method="matrix", measure=c("support","confidence"))

Next plot is analogical to the previous one but presented in a slightly different way. However, it carries the same information.

plot(rules, method="grouped")

One can also visualize rules in a form of graph. The arrows shows the direction of the basket rule, eg. if a person buys citrus fruit, it is possible he will also buy other vegetables. The size of circles idicates the support rate and the color indicates the lift. It is easily seen which items are the most frequent combined, for example other vegetables, whole milk or yogurt.

plot(rules, method="graph", shading="lift")

Individual rule representation

After analysing all rules together, I will now focus on particular items. As said before, the most frequent items in transactions are among others whole milk, other vegatables, rolls/buns, soda and joghurt. Thus, I will run Apriori algorithm on these and check whether there are any interesting patterns not spotted yet. Here, I set minimum support equal to 1%. The value of confidence take three levels. Some rules will be generated with the minimum confidence 0.5 and some will be generated with the minimum confidence 0.3 or 0.2. That is because in some cases no rules are generated assuming first level.

Whole milk

rules.milk <- apriori(data=Groceries,  parameter=list(supp=0.01, conf = 0.5, target="rules"), appearance = list(default="lhs", rhs="whole milk"), control=list(verbose=F))

rules.milk.byconf <- sort(rules.milk, by="confidence", decreasing=TRUE)

inspect(rules.milk.byconf)

##      lhs                                      rhs          support    confidence lift     count
## [1]  {curd,yogurt}                         => {whole milk} 0.01006609 0.5823529  2.279125  99  
## [2]  {other vegetables,butter}             => {whole milk} 0.01148958 0.5736041  2.244885 113  
## [3]  {tropical fruit,root vegetables}      => {whole milk} 0.01199797 0.5700483  2.230969 118  
## [4]  {root vegetables,yogurt}              => {whole milk} 0.01453991 0.5629921  2.203354 143  
## [5]  {other vegetables,domestic eggs}      => {whole milk} 0.01230300 0.5525114  2.162336 121  
## [6]  {yogurt,whipped/sour cream}           => {whole milk} 0.01087951 0.5245098  2.052747 107  
## [7]  {root vegetables,rolls/buns}          => {whole milk} 0.01270971 0.5230126  2.046888 125  
## [8]  {pip fruit,other vegetables}          => {whole milk} 0.01352313 0.5175097  2.025351 133  
## [9]  {tropical fruit,yogurt}               => {whole milk} 0.01514997 0.5173611  2.024770 149  
## [10] {other vegetables,yogurt}             => {whole milk} 0.02226741 0.5128806  2.007235 219  
## [11] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423  1.984385 144

It occurs that for minimum support 1% there are 11 rules mined by the Apriori algorithm consequent in buying whole milk. According to confidence, the strongest one is {curd, yogurt} => {whole milk}. As to support, the strongest one is {other vegetables, yogurt} => {whole milk}. The highest lift refers to {curd, yogurt} => {whole milk} rule.

Other vegetables

rules.vege <- apriori(data=Groceries,  parameter=list(supp=0.01, conf = 0.5, target="rules"), appearance = list(default="lhs", rhs="other vegetables"), control=list(verbose=F))

rules.vege.byconf <- sort(rules.vege, by="confidence", decreasing=TRUE)

inspect(rules.vege.byconf)

##     lhs                                 rhs                support    confidence lift     count
## [1] {citrus fruit,root vegetables}   => {other vegetables} 0.01037112 0.5862069  3.029608 102  
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300 0.5845411  3.020999 121  
## [3] {root vegetables,rolls/buns}     => {other vegetables} 0.01220132 0.5020921  2.594890 120  
## [4] {root vegetables,yogurt}         => {other vegetables} 0.01291307 0.5000000  2.584078 127

For minimum support 1%, the algorithm mined 4 rules. According to confidence, the strongest one is {citrus fruit, root vegetables} => {other vegetables}. As to support, the strongest one is {root vegetables, yogurt} => {other vegetables}. The highest lift refers to {citrus fruit,root vegetables} => {other vegetables} rule.

Rolls/buns

rules.roll <- apriori(data=Groceries,  parameter=list(supp=0.01, conf = 0.3, target="rules"), appearance = list(default="lhs", rhs="rolls/buns"), control=list(verbose=F))

rules.roll.byconf <- sort(rules.roll, by="confidence", decreasing=TRUE)

inspect(head(rules.roll.byconf))

##     lhs              rhs          support    confidence lift     count
## [1] {frankfurter} => {rolls/buns} 0.01921708 0.3258621  1.771616 189  
## [2] {sausage}     => {rolls/buns} 0.03060498 0.3257576  1.771048 301

Assuming confidence 0.5, the algorithm found no rules. Thus, I decreased the minimum value to 0.3. After that, two rules were mined concerning rolls/buns assuming 1% support. Both rules are comparably strong.They say that if a person buys frankfurter or sausage, he will also buy rolls/buns. The confidence of both rules is 0.33 and the lift is 1.77.

Soda

rules.soda <- apriori(data=Groceries,  parameter=list(supp=0.01, conf = 0.2, target="rules"), appearance = list(default="lhs", rhs="soda"), control=list(verbose=F))

rules.soda.byconf <- sort(rules.soda, by="confidence", decreasing=TRUE)

inspect(rules.soda.byconf)

##      lhs                        rhs    support    confidence lift     count
## [1]  {chocolate}             => {soda} 0.01352313 0.2725410  1.562939 133  
## [2]  {bottled water}         => {soda} 0.02897814 0.2621895  1.503577 285  
## [3]  {sausage}               => {soda} 0.02430097 0.2586580  1.483324 239  
## [4]  {fruit/vegetable juice} => {soda} 0.01840366 0.2545710  1.459887 181  
## [5]  {shopping bags}         => {soda} 0.02460600 0.2497420  1.432194 242  
## [6]  {white bread}           => {soda} 0.01026945 0.2439614  1.399044 101  
## [7]  {pastry}                => {soda} 0.02104728 0.2365714  1.356665 207  
## [8]  {napkins}               => {soda} 0.01199797 0.2291262  1.313969 118  
## [9]  {bottled beer}          => {soda} 0.01698017 0.2108586  1.209209 167  
## [10] {rolls/buns}            => {soda} 0.03833249 0.2084024  1.195124 377  
## [11] {pork}                  => {soda} 0.01189629 0.2063492  1.183350 117

Assuming confidence 0.5, the algorithm also found no rules. Thus, I decreased the minimum confidence to 0.2 and got eleven rules. The strongest ones seem to be {chocolate} => {soda} with the confidence 0.27. The lift is 1.56. However, the most frequent rule on the list is {rolls/buns} => {soda}.

Yogurt

rules.yogurt <- apriori(data=Groceries,  parameter=list(supp=0.01, conf = 0.3, target="rules"), appearance = list(default="lhs", rhs="yogurt"), control=list(verbose=F))

rules.yogurt.byconf <- sort(rules.yogurt, by="confidence", decreasing=TRUE)

inspect(head(rules.yogurt.byconf))

##     lhs                                      rhs      support    confidence lift     count
## [1] {whole milk,curd}                     => {yogurt} 0.01006609 0.3852140  2.761356  99  
## [2] {tropical fruit,whole milk}           => {yogurt} 0.01514997 0.3581731  2.567516 149  
## [3] {other vegetables,whipped/sour cream} => {yogurt} 0.01016777 0.3521127  2.524073 100  
## [4] {tropical fruit,other vegetables}     => {yogurt} 0.01230300 0.3427762  2.457146 121  
## [5] {whole milk,whipped/sour cream}       => {yogurt} 0.01087951 0.3375394  2.419607 107  
## [6] {citrus fruit,whole milk}             => {yogurt} 0.01026945 0.3366667  2.413350 101

The last analysed item was yogurt. Just as in the previous cases, I had to lower the confidence level to 0.3 since there were no asssociation rules at all on 0.5 level. From the above rules, it is clear that concerning confidence, the strongest rule is {whole milk,curd} => {yogurt}. It has also the highest lift - 2.76.

Moving to graphical analysis, below one can see graphs of above mined rules.

plot(rules.milk, method="graph", cex=0.7, shading="lift")

plot(rules.vege, method="graph", cex=0.7, shading="lift")

plot(rules.roll, method="graph", cex=0.7, shading="lift")

plot(rules.soda, method="graph", cex=0.7, shading="lift")

plot(rules.yogurt, method="graph", cex=0.7, shading="lift")

Starting from the fisrt graph which concerns whole milk,

The below plots show parallel coordinates. Worth mentioned is the meaning of the x axis. Positions 3, 2 and 1 are associated with the lhs, so the itemset which the customer already has in the basket, where 3 and 2 are the most recent one and 1 is the item added previously.

plot(rules.milk, method="paracoord")

plot(rules.vege, method="paracoord")

plot(rules.roll, method="paracoord")

plot(rules.soda, method="paracoord")

plot(rules.yogurt, method="paracoord")

Hierarchical rules

Since Groceries dataset beyond names of items contains also two item levels, it is possible to conduct hierarchical rule mining. It is based on aggregating items together and checking if any group of products is associated with another. Moreover, one can also provide an analysis on relationships between individual items and groups of items. I will start from mining rules assuming antecedent and consequent to be groups of products.

The unique categories of level1 are as follows:

unique(Groceries@itemInfo[["level1"]])

##  [1] meat and sausage     fruit and vegetables fresh products       processed food       canned food         
##  [6] drinks               snacks and candies   detergent            perfumery            non-food            
## 10 Levels: canned food detergent drinks fresh products fruit and vegetables meat and sausage non-food ... snacks and candies

The unique categories of level2 are as follows:

unique(Groceries@itemInfo[["level2"]])

##  [1] sausage                         poultry                         pork                           
##  [4] beef                            fish                            fruit                          
##  [7] vegetables                      packaged fruit/vegetables       dairy produce                  
## [10] shelf-stable dairy              cheese                          delicatessen                   
## [13] frozen foods                    eggs                            bread and backed goods         
## [16] staple foods                    vinegar/oils                    sweetener                      
## [19] condiments                      soups/sauces                    health food                    
## [22] bakery improver                 pudding powder                  canned fruit/vegetables        
## [25] jam/sweet spreads               meat spreads                    canned fish                    
## [28] pet food/care                   baby food                       coffee                         
## [31] tea/cocoa drinks                non-alc. drinks                 beer                           
## [34] hard drinks                     wine                            snacks                         
## [37] long-life bakery products       chewing gum                     chocolate                      
## [40] candy                           seasonal products               detergent/softener             
## [43] cleaner                         bathroom cleaner                hair care                      
## [46] dental care                     cosmetics                       soap                           
## [49] personal hygiene                perfumery                       non-food kitchen               
## [52] non-food house keeping products games/books/hobby               garden                         
## [55] bags                           
## 55 Levels: baby food bags bakery improver bathroom cleaner beef beer bread and backed goods candy ... wine

Since there are more levels concerning level2, I will use them in order to enrich the analysis.

trans_level2 <- aggregate(Groceries, by="level2")
inspect(head(trans_level2))

##     items                                                                  
## [1] {bread and backed goods,fruit,soups/sauces,vinegar/oils}               
## [2] {coffee,dairy produce,fruit}                                           
## [3] {dairy produce}                                                        
## [4] {cheese,dairy produce,fruit,meat spreads}                              
## [5] {dairy produce,long-life bakery products,shelf-stable dairy,vegetables}
## [6] {cleaner,dairy produce,staple foods}

inspect(rules.by.conf2)

##      lhs                                         rhs                      support    confidence lift     count
## [1]  {fruit,vegetables}                       => {dairy produce}          0.07869853 0.7350427  1.659203  774 
## [2]  {bread and backed goods,fruit}           => {dairy produce}          0.07727504 0.7183365  1.621492  760 
## [3]  {bread and backed goods,vegetables}      => {dairy produce}          0.08195221 0.7051619  1.591753  806 
## [4]  {sausage,vegetables}                     => {dairy produce}          0.05266904 0.6906667  1.559033  518 
## [5]  {non-alc. drinks,vegetables}             => {dairy produce}          0.06446365 0.6817204  1.538839  634 
## [6]  {fruit,non-alc. drinks}                  => {dairy produce}          0.06375191 0.6807818  1.536720  627 
## [7]  {cheese}                                 => {dairy produce}          0.08459583 0.6677368  1.507274  832 
## [8]  {vinegar/oils}                           => {dairy produce}          0.05866802 0.6519774  1.471700  577 
## [9]  {fruit}                                  => {dairy produce}          0.15638027 0.6277551  1.417024 1538 
## [10] {vegetables}                             => {dairy produce}          0.17041179 0.6242086  1.409018 1676 
## [11] {bread and backed goods,sausage}         => {dairy produce}          0.06395526 0.6172718  1.393360  629 
## [12] {long-life bakery products}              => {dairy produce}          0.05002542 0.6007326  1.356026  492 
## [13] {dairy produce,sausage}                  => {bread and backed goods} 0.06395526 0.5956439  1.724002  629 
## [14] {bread and backed goods,non-alc. drinks} => {dairy produce}          0.07229283 0.5818331  1.313364  711 
## [15] {frozen foods}                           => {dairy produce}          0.06710727 0.5739130  1.295487  660 
## [16] {sausage}                                => {dairy produce}          0.10737163 0.5677419  1.281557 1056 
## [17] {sausage}                                => {bread and backed goods} 0.10360956 0.5478495  1.585668 1019 
## [18] {bread and backed goods}                 => {dairy produce}          0.18769700 0.5432607  1.226295 1846 
## [19] {dairy produce,fruit}                    => {vegetables}             0.07869853 0.5032510  1.843379  774 
## [20] {cheese}                                 => {bread and backed goods} 0.06365023 0.5024077  1.454144  626

Assuming 5% support and 0.5 confidence, Apriori algorithm mined 20 association rules. The strongest one concerning confidence is {fruit, vegetables} => {dairy produce} with support equal to 0.08. The highest lift is on the other hand observed for {dairy produce, fruit} => {vegetables} rule.

Additionally, one can also visualize the results with a graph.

plot(rules.trans_level2, method="graph", cex=0.7, shading="lift")

The second part of the analysis will be looking for the relationships between individual items and groups of items. In order to do that, I will use addAggregate() function and then run Apriori algorithm.

multilevel <- addAggregate(Groceries, "level2")
inspect(head(multilevel)) # the * indicates group-level items

##     items                       
## [1] {citrus fruit,              
##      semi-finished bread,       
##      margarine,                 
##      ready soups,               
##      bread and backed goods*,   
##      fruit*,                    
##      soups/sauces*,             
##      vinegar/oils*}             
## [2] {tropical fruit,            
##      yogurt,                    
##      coffee,                    
##      coffee*,                   
##      dairy produce*,            
##      fruit*}                    
## [3] {whole milk,                
##      dairy produce*}            
## [4] {pip fruit,                 
##      yogurt,                    
##      cream cheese ,             
##      meat spreads,              
##      cheese*,                   
##      dairy produce*,            
##      fruit*,                    
##      meat spreads*}             
## [5] {other vegetables,          
##      whole milk,                
##      condensed milk,            
##      long life bakery product,  
##      dairy produce*,            
##      long-life bakery products*,
##      shelf-stable dairy*,       
##      vegetables*}               
## [6] {whole milk,                
##      butter,                    
##      yogurt,                    
##      rice,                      
##      abrasive cleaner,          
##      cleaner*,                  
##      dairy produce*,            
##      staple foods*}

inspect(head(rules_multilevel))

##     lhs              rhs              support    confidence lift      count
## [1] {canned beer} => {beer*}          0.07768175 1.0000000   6.428105 764  
## [2] {curd}        => {dairy produce*} 0.05327911 1.0000000   2.257287 524  
## [3] {coffee}      => {coffee*}        0.05805796 1.0000000  15.415361 571  
## [4] {coffee*}     => {coffee}         0.05805796 0.8949843  15.415361 571  
## [5] {beef}        => {beef*}          0.05246568 1.0000000  12.202233 516  
## [6] {beef*}       => {beef}           0.05246568 0.6401985  12.202233 516

It occured that all of the printed transactions are spurious. It means that the lhs and rhs are refer to the same product. For example, the first transaction is {canned beer} => {beer*} which means that if the customer buys canned beer, he will also tend to buy an item from beer group. In order to filter the spurious transactions, one can use filterAggregate().

rules <- filterAggregate(multilevel)
rules

## transactions in sparse format with
##  0 transactions (rows) and
##  224 items (columns)

After filternig, none transaction left. Thus, all of the transactions were in fact spurious and this analysis doesn’t bring added value.

Other quality measures

Besides support, confidence and lift, there are also other varied measures which refer to association rules quality. Among others, worth to mention are Jaccard index and affinity.

Jaccard index tells how much likely are two items to be bought together. It is represented as conditional probability. The formal equation is as follows ^(http://michael.hahsler.net/research/association_rules/measures.html#jaccard):

\[Jaccard(X=>Y) = \frac{supp(X∪Y)}{supp(X)+supp(Y)-supp(X∪Y)}\]

In R, Jaccard index can be calculated with dissimilarity() function, setting “jaccard” as a method.

trans <- Groceries[,itemFrequency(Groceries)>0.1]
jaccard <- dissimilarity(trans, which="items", method = "jaccard")
round(jaccard, 2)

##                  tropical fruit root vegetables other vegetables whole milk yogurt rolls/buns bottled water
## root vegetables            0.89                                                                            
## other vegetables           0.86            0.81                                                            
## whole milk                 0.87            0.85             0.80                                           
## yogurt                     0.86            0.88             0.85       0.83                                
## rolls/buns                 0.91            0.91             0.87       0.85   0.88                         
## bottled water              0.91            0.92             0.91       0.90   0.90       0.91              
## soda                       0.92            0.93             0.90       0.90   0.90       0.88          0.89

The result is a matrix with conditional probabilities. The higher the values of Jaccard index the more less likely are two items to occur in the same transaction. According to the output, the least probable is soda and root vegetables occuring together.

Affinity is on the other hand a similarity measure. The higher the value the higher similarity. The formal equation is as follows^(https://rdrr.io/cran/arules/man/affinity.html):

\[A(X,Y) = \frac{supp(X, Y)}{supp(X)+supp(Y)-supp(X, Y)}\]

aff <- affinity(trans)
round(aff, 2)

## An object of class "ar_similarity"
##                  tropical fruit root vegetables other vegetables whole milk yogurt rolls/buns bottled water soda
## tropical fruit             0.00            0.11             0.14       0.13   0.14       0.09          0.09 0.08
## root vegetables            0.11            0.00             0.19       0.15   0.12       0.09          0.08 0.07
## other vegetables           0.14            0.19             0.00       0.20   0.15       0.13          0.09 0.10
## whole milk                 0.13            0.15             0.20       0.00   0.17       0.15          0.10 0.10
## yogurt                     0.14            0.12             0.15       0.17   0.00       0.12          0.10 0.10
## rolls/buns                 0.09            0.09             0.13       0.15   0.12       0.00          0.09 0.12
## bottled water              0.09            0.08             0.09       0.10   0.10       0.09          0.00 0.11
## soda                       0.08            0.07             0.10       0.10   0.10       0.12          0.11 0.00
## Slot "method":
## [1] "Affinity"

One can easily spot that values for particular items sums to 1 when Jaccard index added to affinity measure. Maximum sililarity is observed for whole milk and other vegetables which confirms revious conclusions.

The last part of the quality measures analysis will be visualization of the above matrix. The more red the rectangle, the more similar the items.

image(aff, axes = FALSE)
axis(1, at=seq(0,1,l=ncol(aff)), labels=rownames(aff), cex.axis=0.6, las=2)
axis(2, at=seq(0,1,l=ncol(aff)), labels=rownames(aff), cex.axis=0.6, las=1.5)

Most of the rectangles are yellow and orange, which is as expected. In the data, items are not very similar to each other, which means the overall probability of seeing them together is not higher than 0.2.

Conclusions

Since association rules are very useful in setting a strategy of the store, I will try to point a few conclusions out of the above analysis.

Concerning the least sold products which are baby food, sound storage medium, preservation products, kitchen utensil, bags, frozen chicken, baby cosmetics, toilet cleaner, salad dressing and whisky, the store managers should consider to:
- withdraw them from sale in order to not warehouse unprofitable items,
- set a sales promotion (3 for 2 or the third one for 50% the price) or promotion by combining these products with more popular ones (e.g. vegetables with salad dressing, soda and whisky, etc.),
- create a nice advertising or a thematic booth in the store (e.g. “Everything you need in your kitchen” with the Ikea-like exhibition).
Concerning the association rules, the store managers should consider to:
- place booth with dairy products near fruits, vegetables, backed goods and sausages,
- run cross-marketing camiangn,
- inform customers who bought any of the abovementioned products about the promotion of the other ones.