Introduction

Association rules aim to observe frequently occurring patterns, correlations, or associations from datasets. It is a procedure of analyzing data and looking for frequent patterns to discover underlying relationships.

The area where association rules are most frequently used is market basket analysis. With this analysis, customers purchasing behavior is analyzed and customers purchasing patterns are determined by finding the combined purchase rates between customers’ shopping habits and products.

In this project we will apply association rule by Apriori Algorithm to the market basket of consumers. By analyzing baskets, we will get insights related to consumer purchasing behaviors can be obtained.

Exploratory Data Analysis

In this part, you can observe Exploratory Data Analysis of the given data with some statistics approaches to get an insight about the distribution of the data. We will analyze the data by using specific R codes. We will add visualization parts as well to observe data deeply.

Review of data set

A dataset (Groceries) is downloaded from the “arules” R library. The dataset has 9835 rows and 169 items.

Loading Packages

We should install the necessary libraries to our environment.

requiredPackages = c("dplyr","tidyverse","ggplot2","readr", "knitr","corrplot","RColorBrewer", "arules","arulesViz","rmarkdown")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) } 

Statistical Analysis of Data

Firstly, We will get Groceries dataset from library(arules).

data(Groceries)
transactiondata <- Groceries

We read our data, and assigned it to the “transactiondata” variable.

Let’s look at the size structure of the dataset.

dim(transactiondata)
## [1] 9835  169

Let’s get the names of the products in the data set.

itemLabels(transactiondata)
##   [1] "frankfurter"               "sausage"                  
##   [3] "liver loaf"                "ham"                      
##   [5] "meat"                      "finished products"        
##   [7] "organic sausage"           "chicken"                  
##   [9] "turkey"                    "pork"                     
##  [11] "beef"                      "hamburger meat"           
##  [13] "fish"                      "citrus fruit"             
##  [15] "tropical fruit"            "pip fruit"                
##  [17] "grapes"                    "berries"                  
##  [19] "nuts/prunes"               "root vegetables"          
##  [21] "onions"                    "herbs"                    
##  [23] "other vegetables"          "packaged fruit/vegetables"
##  [25] "whole milk"                "butter"                   
##  [27] "curd"                      "dessert"                  
##  [29] "butter milk"               "yogurt"                   
##  [31] "whipped/sour cream"        "beverages"                
##  [33] "UHT-milk"                  "condensed milk"           
##  [35] "cream"                     "soft cheese"              
##  [37] "sliced cheese"             "hard cheese"              
##  [39] "cream cheese "             "processed cheese"         
##  [41] "spread cheese"             "curd cheese"              
##  [43] "specialty cheese"          "mayonnaise"               
##  [45] "salad dressing"            "tidbits"                  
##  [47] "frozen vegetables"         "frozen fruits"            
##  [49] "frozen meals"              "frozen fish"              
##  [51] "frozen chicken"            "ice cream"                
##  [53] "frozen dessert"            "frozen potato products"   
##  [55] "domestic eggs"             "rolls/buns"               
##  [57] "white bread"               "brown bread"              
##  [59] "pastry"                    "roll products "           
##  [61] "semi-finished bread"       "zwieback"                 
##  [63] "potato products"           "flour"                    
##  [65] "salt"                      "rice"                     
##  [67] "pasta"                     "vinegar"                  
##  [69] "oil"                       "margarine"                
##  [71] "specialty fat"             "sugar"                    
##  [73] "artif. sweetener"          "honey"                    
##  [75] "mustard"                   "ketchup"                  
##  [77] "spices"                    "soups"                    
##  [79] "ready soups"               "Instant food products"    
##  [81] "sauces"                    "cereals"                  
##  [83] "organic products"          "baking powder"            
##  [85] "preservation products"     "pudding powder"           
##  [87] "canned vegetables"         "canned fruit"             
##  [89] "pickled vegetables"        "specialty vegetables"     
##  [91] "jam"                       "sweet spreads"            
##  [93] "meat spreads"              "canned fish"              
##  [95] "dog food"                  "cat food"                 
##  [97] "pet care"                  "baby food"                
##  [99] "coffee"                    "instant coffee"           
## [101] "tea"                       "cocoa drinks"             
## [103] "bottled water"             "soda"                     
## [105] "misc. beverages"           "fruit/vegetable juice"    
## [107] "syrup"                     "bottled beer"             
## [109] "canned beer"               "brandy"                   
## [111] "whisky"                    "liquor"                   
## [113] "rum"                       "liqueur"                  
## [115] "liquor (appetizer)"        "white wine"               
## [117] "red/blush wine"            "prosecco"                 
## [119] "sparkling wine"            "salty snack"              
## [121] "popcorn"                   "nut snack"                
## [123] "snack products"            "long life bakery product" 
## [125] "waffles"                   "cake bar"                 
## [127] "chewing gum"               "chocolate"                
## [129] "cooking chocolate"         "specialty chocolate"      
## [131] "specialty bar"             "chocolate marshmallow"    
## [133] "candy"                     "seasonal products"        
## [135] "detergent"                 "softener"                 
## [137] "decalcifier"               "dish cleaner"             
## [139] "abrasive cleaner"          "cleaner"                  
## [141] "toilet cleaner"            "bathroom cleaner"         
## [143] "hair spray"                "dental care"              
## [145] "male cosmetics"            "make up remover"          
## [147] "skin care"                 "female sanitary products" 
## [149] "baby cosmetics"            "soap"                     
## [151] "rubbing alcohol"           "hygiene articles"         
## [153] "napkins"                   "dishes"                   
## [155] "cookware"                  "kitchen utensil"          
## [157] "cling film/bags"           "kitchen towels"           
## [159] "house keeping products"    "candles"                  
## [161] "light bulbs"               "sound storage medium"     
## [163] "newspapers"                "photo/film"               
## [165] "pot plants"                "flower soil/fertilizer"   
## [167] "flower (seeds)"            "shopping bags"            
## [169] "bags"

Let’s see average number of items in a transaction

mean(size(transactiondata));
## [1] 4.409456

Let’s continue to look at the dataset in more detail.

summary(transactiondata)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage
str(transactiondata)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  3 variables:
##   .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
##   .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
##   .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

The top 10 frequently purchased items after running the R code block are given in the below plot.

itemFrequencyPlot(transactiondata, topN = 10,  
                          col = "blue", 
                          main = 'Item Frequency', 
                          type = "relative", 
                          ylab = "Relative Frequency")

Let’s look at the purchase frequency of all our items.

items <- itemFrequency(transactiondata, type = "absolute")
sorteddata<-sort(items, decreasing = TRUE)


barplot(sorteddata, main="Frequeny for All Items", xlab="Items", ylab="Frequency", col = "blue",cex.names = 0.01) 

It would be better for our model if we removed very few purchased items. So let’s set a threshold value and drop the products which are below this value.

transactiondata <- transactiondata[, itemFrequency(transactiondata)>0.01]
dim(transactiondata)
## [1] 9835   88

Now we are continuing

Apriori Algorithm

Apriori Algorithm is one of the association rules methods. Its name is Apriori, meaning “prior” because it obtains information from the previous step.

Apriori Algorithm was developed especially for data scientist studies on very large-scale databases. The purpose of the algorithm is to reveal the connection between rows in databases.

Let’s start building the model.

Two criteria are used to define the relationship levels between the products purchased in market basket analysis. The first of these is support and the other is confidence criteria.

Let’s take our minimum support rate value as 0.01 and our confidence probability value as 0.35 in the model.

#Min destek 0.3, güven ise 0.5.
modelmba <- apriori(transactiondata, parameter = list(supp=0.01,conf=0.35, maxlen=10,target= "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.35    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[88 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [89 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(modelmba)
## set of 89 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 44 45 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.506   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.3510   Min.   :0.01729   Min.   :1.388  
##  1st Qu.:0.01149   1st Qu.:0.3889   1st Qu.:0.02613   1st Qu.:1.617  
##  Median :0.01423   Median :0.4201   Median :0.03294   Median :1.868  
##  Mean   :0.01825   Mean   :0.4372   Mean   :0.04309   Mean   :1.937  
##  3rd Qu.:0.02166   3rd Qu.:0.4740   3rd Qu.:0.05247   3rd Qu.:2.155  
##  Max.   :0.07483   Max.   :0.5862   Max.   :0.19349   Max.   :3.295  
##      count      
##  Min.   : 99.0  
##  1st Qu.:113.0  
##  Median :140.0  
##  Mean   :179.4  
##  3rd Qu.:213.0  
##  Max.   :736.0  
## 
## mining info:
##             data ntransactions support confidence
##  transactiondata          9835    0.01       0.35
##                                                                                                        call
##  apriori(data = transactiondata, parameter = list(supp = 0.01, conf = 0.35, maxlen = 10, target = "rules"))

We see that the model consists of 89 rules. In other words, the sum of the rule lengths lhs (X) and rhs (Y) is equal to (R) = 44 + 45 = 89.

The average lift values are 1.937, that is, the average lift values are above 1. This finding shows that if product X is purchased, the probability of purchasing product Y is also high.Confidence and support criterion values are also above the determined threshold values.

The graph below shows the confidence and support criterion values and lift values.

plot(modelmba, main="89 Rules")

Let’s see only the confidence values in the graph.

plot(modelmba, measure = "confidence", main="89 Rules")

You can find the network graphs of the first 20 and last 10 rules below.

set.seed(123) 
plot(head(modelmba,20), method="graph", main="Top 20 Rule Network of Products in the Model")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

set.seed(123)  
plot(tail(modelmba,10), method="graph", main="Top 10 Rule Network of Products in the Model")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

set.seed(123)
plot(head(modelmba,20), method = "grouped",  main="Top 20 Grouped Matrix of Items in the Model")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

The below tables show the top 15 association rules generated by the Apriori algorithm with their corresponding support, confidence and lift metrics.

rules_v2 = sort(modelmba, by = "confidence")[1:15]
inspect(rules_v2)
##      lhs                                       rhs                support   
## [1]  {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
## [2]  {tropical fruit, root vegetables}      => {other vegetables} 0.01230300
## [3]  {curd, yogurt}                         => {whole milk}       0.01006609
## [4]  {other vegetables, butter}             => {whole milk}       0.01148958
## [5]  {tropical fruit, root vegetables}      => {whole milk}       0.01199797
## [6]  {root vegetables, yogurt}              => {whole milk}       0.01453991
## [7]  {other vegetables, domestic eggs}      => {whole milk}       0.01230300
## [8]  {yogurt, whipped/sour cream}           => {whole milk}       0.01087951
## [9]  {root vegetables, rolls/buns}          => {whole milk}       0.01270971
## [10] {pip fruit, other vegetables}          => {whole milk}       0.01352313
## [11] {tropical fruit, yogurt}               => {whole milk}       0.01514997
## [12] {other vegetables, yogurt}             => {whole milk}       0.02226741
## [13] {other vegetables, whipped/sour cream} => {whole milk}       0.01464159
## [14] {root vegetables, rolls/buns}          => {other vegetables} 0.01220132
## [15] {root vegetables, yogurt}              => {other vegetables} 0.01291307
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5823529  0.01728521 2.279125  99  
## [4]  0.5736041  0.02003050 2.244885 113  
## [5]  0.5700483  0.02104728 2.230969 118  
## [6]  0.5629921  0.02582613 2.203354 143  
## [7]  0.5525114  0.02226741 2.162336 121  
## [8]  0.5245098  0.02074225 2.052747 107  
## [9]  0.5230126  0.02430097 2.046888 125  
## [10] 0.5175097  0.02613116 2.025351 133  
## [11] 0.5173611  0.02928317 2.024770 149  
## [12] 0.5128806  0.04341637 2.007235 219  
## [13] 0.5070423  0.02887646 1.984385 144  
## [14] 0.5020921  0.02430097 2.594890 120  
## [15] 0.5000000  0.02582613 2.584078 127

If I explain first rule as example comment; First rule shows us that {citrus fruit, root vegetables} and {other vegetables} has the highest confidence so that customers who buy {citrus fruit, root vegetables} are also likely to buy {other vegetables}

rules_v2 = sort(modelmba, by = "lift")[1:15]
inspect(rules_v2)
##      lhs                                       rhs                support   
## [1]  {citrus fruit, other vegetables}       => {root vegetables}  0.01037112
## [2]  {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
## [3]  {tropical fruit, root vegetables}      => {other vegetables} 0.01230300
## [4]  {whole milk, curd}                     => {yogurt}           0.01006609
## [5]  {root vegetables, rolls/buns}          => {other vegetables} 0.01220132
## [6]  {root vegetables, yogurt}              => {other vegetables} 0.01291307
## [7]  {tropical fruit, whole milk}           => {yogurt}           0.01514997
## [8]  {yogurt, whipped/sour cream}           => {other vegetables} 0.01016777
## [9]  {other vegetables, whipped/sour cream} => {yogurt}           0.01016777
## [10] {root vegetables, whole milk}          => {other vegetables} 0.02318251
## [11] {onions}                               => {other vegetables} 0.01423488
## [12] {pork, whole milk}                     => {other vegetables} 0.01016777
## [13] {whole milk, whipped/sour cream}       => {other vegetables} 0.01464159
## [14] {pip fruit, whole milk}                => {other vegetables} 0.01352313
## [15] {curd, yogurt}                         => {whole milk}       0.01006609
##      confidence coverage   lift     count
## [1]  0.3591549  0.02887646 3.295045 102  
## [2]  0.5862069  0.01769192 3.029608 102  
## [3]  0.5845411  0.02104728 3.020999 121  
## [4]  0.3852140  0.02613116 2.761356  99  
## [5]  0.5020921  0.02430097 2.594890 120  
## [6]  0.5000000  0.02582613 2.584078 127  
## [7]  0.3581731  0.04229792 2.567516 149  
## [8]  0.4901961  0.02074225 2.533410 100  
## [9]  0.3521127  0.02887646 2.524073 100  
## [10] 0.4740125  0.04890696 2.449770 228  
## [11] 0.4590164  0.03101169 2.372268 140  
## [12] 0.4587156  0.02216573 2.370714 100  
## [13] 0.4542587  0.03223183 2.347679 144  
## [14] 0.4493243  0.03009659 2.322178 133  
## [15] 0.5823529  0.01728521 2.279125  99

If I explain first rule as example comment; First rule in the table shows that customers who purchase {citrus fruit, other vegetables} are 3.29 times more likely to purchase {root vegetables}

rules_v2 = sort(modelmba, by = "support")[1:15]
inspect(rules_v2)
##      lhs                        rhs                support    confidence
## [1]  {other vegetables}      => {whole milk}       0.07483477 0.3867578 
## [2]  {yogurt}                => {whole milk}       0.05602440 0.4016035 
## [3]  {root vegetables}       => {whole milk}       0.04890696 0.4486940 
## [4]  {root vegetables}       => {other vegetables} 0.04738180 0.4347015 
## [5]  {tropical fruit}        => {whole milk}       0.04229792 0.4031008 
## [6]  {pastry}                => {whole milk}       0.03324860 0.3737143 
## [7]  {whipped/sour cream}    => {whole milk}       0.03223183 0.4496454 
## [8]  {citrus fruit}          => {whole milk}       0.03050330 0.3685504 
## [9]  {pip fruit}             => {whole milk}       0.03009659 0.3978495 
## [10] {domestic eggs}         => {whole milk}       0.02999492 0.4727564 
## [11] {whipped/sour cream}    => {other vegetables} 0.02887646 0.4028369 
## [12] {butter}                => {whole milk}       0.02755465 0.4972477 
## [13] {fruit/vegetable juice} => {whole milk}       0.02663955 0.3684951 
## [14] {curd}                  => {whole milk}       0.02613116 0.4904580 
## [15] {brown bread}           => {whole milk}       0.02521607 0.3887147 
##      coverage   lift     count
## [1]  0.19349263 1.513634 736  
## [2]  0.13950178 1.571735 551  
## [3]  0.10899847 1.756031 481  
## [4]  0.10899847 2.246605 466  
## [5]  0.10493137 1.577595 416  
## [6]  0.08896797 1.462587 327  
## [7]  0.07168277 1.759754 317  
## [8]  0.08276563 1.442377 300  
## [9]  0.07564820 1.557043 296  
## [10] 0.06344687 1.850203 295  
## [11] 0.07168277 2.081924 284  
## [12] 0.05541434 1.946053 271  
## [13] 0.07229283 1.442160 262  
## [14] 0.05327911 1.919481 257  
## [15] 0.06487036 1.521293 248

If I explain first rule as example comment; the first row shows that customers who bought {other vegetables} also bought {whole milk} in 736 transactions, which accounts for 7.48% of all transactions in the dataset.

Conclusion

I tried analyzing the association rules between different items (focusing on the most common ones). As a results from the espicially last 3 graphs above, the analysis above can be used for better placement of products in shop market.because the strongest rules were discovered.

inspectDT(head(modelmba, by = "lift", 10))

REFERENCES

Lecture notes of Jacek Lewkowicz - Unsupervised Learning-University of Warsaw.

https://www.datacamp.com/community/tutorials/market-basket-analysis-r

https://www.geeksforgeeks.org/apriori-algorithm-in-r-programming/

https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce