Association Rules Mining

Introduction

Market Basket Analysis is one of the most important methods that major merchants use to investigate the connections that exist between different products. In order for it to function, it searches for combinations of products that are found to appear together frequently in transactions. To put it another way, it enables merchants to determine whether or not there is a connection between the products that customers purchase. Association Rules are used extensively in the analysis of retail bundle or transaction data. These rules are designed to identify strong rules that have been identified in transaction data by utilising measures of interestingness. The concept of strong rules is the foundation for these rules.

Dataset Details

The collection contains 38765 rows, all of which are the purchase requests that customers made at grocery retailers. Market Basket Analysis, which is used by algorithms such as the Apriori Algorithm, can be used to analyse these categories and generate relationship rules between them.

Apriori Algorithm

An algorithm for gathering frequent itemsets and learning association rules from relational databases, Apriori is referred to as “Apriori.” The next step is to determine which of the database’s individual items occur most frequently, and then to expand those frequent occurrences into progressively more extensive groups of items, provided that those sets of items occur frequently enough in the database. This has implications in fields such as market basket analysis and can be used to determine association rules that emphasise general patterns in the database based on the frequent itemets that are determined by Apriori.

Results & Discussion

First of all the dataset is read and inspected as below:

data <- read.csv("Groceries_dataset.csv")
head(data)

The dimensions of the dataset are checked

dim(data)

## [1] 38765     3

The summary of the dataset is obtained.

summary(data)

##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

The first column is removed and the number of unique items in the dataset are checked which are 167.

data <- data[,c(2:3)]
length(unique(data$itemDescription))

## [1] 167

Next, we have read the data as transactions and the summary is printed. We can see from the summary that the whole milk and other vegetables are the most commonly occurring items in the dataset. Moreover, the median value for basket is 35 in this case.

transactions_data <- read.transactions("Groceries_dataset.csv",
                                       format = "single", 
                                       sep=",", 
                                       cols = c(2,3))
summary(transactions_data)

## transactions as itemMatrix in sparse format with
##  729 rows (elements/itemsets/transactions) and
##  168 columns (items) and a density of 0.2060226 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##              697              666              649              617 
##           yogurt          (Other) 
##              600            22003 
## 
## element (itemset/transaction) length distribution:
## sizes
##  1 17 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
##  1  1  2  3  6  6  5 11 16 24 21 32 33 24 45 42 47 33 61 34 47 41 30 40 23 22 
## 43 44 45 46 47 48 49 51 52 55 56 59 
## 22 16 10  5 11  3  6  1  2  1  1  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   30.00   35.00   34.61   39.00   59.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1    01-01-2014
## 2    01-01-2015
## 3    01-02-2014

The top 10 frequent items in the dataset are plotted and is attached below. We can see that the whole milk, other vegetables and rolls / buns are the three top most frequently items occurring in the dataset.

itemFrequencyPlot(transactions_data, 
                  topN = 5, 
                  type = "absolute", 
                  main = "Top 5 items by frequency", 
                  col = "steel blue")

The apriori algorithm is required to be utilized in order to derive principles from the information. The Eclat algorithm, which is a faster variant of apriori, is the one I will use because it is better suited for large databases. Eclat is able to determine the most likely component combinations within a group. From th summary it can be seen that 16557 rules are found with support of 0.20 and maximum length of 10.

association_rules <- eclat(transactions_data, 
                           parameter = list(supp = 0.20, 
                                            maxlen = 10))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE     0.2      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 145 
## 
## create itemset ... 
## set transactions ...[168 item(s), 729 transaction(s)] done [0.00s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 729 column(s)] done [0.00s].
## writing  ... [16557 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Next, the top 10 rules by support are printed and is attached below. Support is associated with the likelihood of a basket’s occurrence, whereas count refers to the actual number of times a basket appears in the collection. The frequency histogram reveals that the vast majority of containers contain vegetables.

inspect(head(sort(association_rules, 
                  by = "support"), 
             10))

##      items                          support   count
## [1]  {whole milk}                   0.9561043 697  
## [2]  {other vegetables}             0.9135802 666  
## [3]  {rolls/buns}                   0.8902606 649  
## [4]  {other vegetables, whole milk} 0.8737997 637  
## [5]  {rolls/buns, whole milk}       0.8504801 620  
## [6]  {soda}                         0.8463649 617  
## [7]  {yogurt}                       0.8230453 600  
## [8]  {soda, whole milk}             0.8161866 595  
## [9]  {other vegetables, rolls/buns} 0.8148148 594  
## [10] {whole milk, yogurt}           0.7873800 574

A measure that demonstrates the possibility of event A and event B divided by the probability of event A is referred to as confidence. According to the findings, the likelihood of a consumer purchasing whole milk increases to 1 if they have previously purchased bottled beer and hygiene articles. Because the majority of the baskets contain whole milk, the majority of the regulations also include whole milk as a natural consequence of this reality.

frequent_rules <- ruleInduction(association_rules, 
                                transactions_data, 
                                confidence = 0.8)
inspect(head(sort(frequent_rules, 
                  by = "confidence", 
                  decreasing = TRUE),
             10))

##      lhs                     rhs            support confidence     lift itemset
## [1]  {chicken,                                                                 
##       citrus fruit,                                                            
##       tropical fruit,                                                          
##       yogurt}             => {whole milk} 0.2043896  1.0000000 1.045911    2041
## [2]  {brown bread,                                                             
##       butter,                                                                  
##       rolls/buns,                                                              
##       soda}               => {whole milk} 0.2167353  1.0000000 1.045911    6528
## [3]  {butter,                                                                  
##       root vegetables,                                                         
##       soda,                                                                    
##       whipped/sour cream} => {whole milk} 0.2112483  1.0000000 1.045911    6578
## [4]  {butter,                                                                  
##       other vegetables,                                                        
##       rolls/buns,                                                              
##       soda,                                                                    
##       whipped/sour cream} => {whole milk} 0.2153635  1.0000000 1.045911    6597
## [5]  {butter,                                                                  
##       rolls/buns,                                                              
##       soda,                                                                    
##       whipped/sour cream} => {whole milk} 0.2331962  1.0000000 1.045911    6598
## [6]  {butter,                                                                  
##       soda,                                                                    
##       whipped/sour cream} => {whole milk} 0.2524005  0.9945946 1.040257    6601
## [7]  {chicken,                                                                 
##       frankfurter}        => {whole milk} 0.2427984  0.9943820 1.040035    1867
## [8]  {brown bread,                                                             
##       butter,                                                                  
##       soda}               => {whole milk} 0.2359396  0.9942197 1.039865    6530
## [9]  {chicken,                                                                 
##       citrus fruit,                                                            
##       tropical fruit}     => {whole milk} 0.2345679  0.9941860 1.039830    2046
## [10] {chicken,                                                                 
##       citrus fruit,                                                            
##       other vegetables,                                                        
##       yogurt}             => {whole milk} 0.2345679  0.9941860 1.039830    2065

Because whole milk was the item that was purchased the most frequently during the hamper analysis, we are unable to adhere to any guidelines that do not include it. Let’s verify the guidelines item by item so that we can identify a wider variety of recurring patterns. Let’s begin with other types of vegetables, as this is the product that is purchased the second most often overall. As the apriori algorithm makes it possible to discover principles by product, I will use it. We will need to reduce the confidence number in order to accomplish it.

other_vegetables <- apriori(transactions_data, 
                            parameter = list(supp = 0.1,
                                             conf = 0.48),
                            appearance =list(default = "lhs", 
                                             rhs = "other vegetables"), 
                            control = list(verbose = F)) 
inspect(head(sort(other_vegetables, 
                  by = "support",
                  decreasing = TRUE),
             10))

##      lhs                         rhs                support   confidence
## [1]  {}                       => {other vegetables} 0.9135802 0.9135802 
## [2]  {whole milk}             => {other vegetables} 0.8737997 0.9139168 
## [3]  {rolls/buns}             => {other vegetables} 0.8148148 0.9152542 
## [4]  {rolls/buns, whole milk} => {other vegetables} 0.7777778 0.9145161 
## [5]  {soda}                   => {other vegetables} 0.7722908 0.9124797 
## [6]  {yogurt}                 => {other vegetables} 0.7572016 0.9200000 
## [7]  {soda, whole milk}       => {other vegetables} 0.7434842 0.9109244 
## [8]  {whole milk, yogurt}     => {other vegetables} 0.7242798 0.9198606 
## [9]  {root vegetables}        => {other vegetables} 0.6886145 0.9127273 
## [10] {rolls/buns, soda}       => {other vegetables} 0.6844993 0.9139194 
##      coverage  lift      count
## [1]  1.0000000 1.0000000 666  
## [2]  0.9561043 1.0003684 637  
## [3]  0.8902606 1.0018323 594  
## [4]  0.8504801 1.0010244 567  
## [5]  0.8463649 0.9987954 563  
## [6]  0.8230453 1.0070270 552  
## [7]  0.8161866 0.9970929 542  
## [8]  0.7873800 1.0068745 528  
## [9]  0.7544582 0.9990663 502  
## [10] 0.7489712 1.0003713 499

The graphical representation of above rules is attached below. We can create more rules for other items as well in the same way as we created for other vegetables.

plot(other_vegetables, 
     method="graph")

Association Rules Mining

Folefac A Walsh

2023-02-28

Introduction

Dataset Details

Apriori Algorithm

Results & Discussion