Dataset

The dataset used in this project is the Groceries dataset.

Link to the dataset -> https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

library(arules)

## Warning: package 'arules' was built under R version 4.4.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

## Warning: package 'arulesViz' was built under R version 4.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

groceries <- read.csv("C://Users/Admin/Desktop/Groceries_dataset.csv")
head(groceries)

##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns

Head function is useful in order to look at the first 6 elements of the dataset.

summary(groceries)

##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

Summary function is executed in order to better understand the dataset.

Data Preprocessing

Data Preprocessing is a must procedure in any Unsupervised Learning related projects. Data always needs to be well-structured before any actions.

Convert transactions to a suitable format:

groceries_trans <- split(groceries$itemDescription, groceries$Member_number)
transactions <- as(groceries_trans, "transactions", warning(call=FALSE))

## Warning in as(groceries_trans, "transactions", warning(call = FALSE)): FALSE

## Warning in asMethod(object): removing duplicated items in transactions

summary(transactions)

## transactions as itemMatrix in sparse format with
##  3898 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.05340678 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             1786             1468             1363             1222 
##           yogurt          (Other) 
##             1103            27824 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   6 248  87 331 261 381 303 332 340 296 276 238 181 179 123  97  66  46  39  28 
##  21  22  23  24  25  26 
##  15  13   3   5   2   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.500   8.919  12.000  26.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1          1000
## 2          1001
## 3          1002

Now, our dataset is in a form of transactions. It is necessary in order to perform the Association rules on the dataset.

Item Frequency

Item Frequency is a technique used to see which items occur the most in the dataset.

Frequency of the items:

itemFrequencyPlot(transactions, topN=15, type="absolute",ylim=c(0,2000), main="Item frequency", col="#56B4E9")

From the graph above, it can be observed that the most frequent items in the dataset are whole milk and other vegetables.The same graph can be represented in form of the percentages:

itemFrequencyPlot(transactions, topN=15, type="relative",ylim=c(0,0.5), main="Item frequency", col="#56B4E9")

From the item frequency plot, it can be deduced that the whole milk has been the most popular choice (almost 50%).

#Apriori

Apriori is an algorithm to create the association rules within the dataset. It has its own parameters like support level, confidence and lift. All of these parameters should be set to minimum standards but cannot be too much low, because no rules will be found without sufficient levels of support, confidence and lift.

rules <- apriori(transactions)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 389 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The defaults of apriori algorithm should be modified to get the optimal number of rules.

After several attempts with Apriori, I came out with the minimal support level of 0.1 and confidence of 0.5.

rules <- apriori(transactions,parameter=list(supp=0.1, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 389 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 5 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 
## 5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1070   Min.   :0.5082   Min.   :0.2060   Min.   :1.109  
##  1st Qu.:0.1124   1st Qu.:0.5106   1st Qu.:0.2137   1st Qu.:1.114  
##  Median :0.1506   Median :0.5193   Median :0.2830   Median :1.133  
##  Mean   :0.1480   Mean   :0.5192   Mean   :0.2858   Mean   :1.133  
##  3rd Qu.:0.1786   3rd Qu.:0.5258   3rd Qu.:0.3497   3rd Qu.:1.148  
##  Max.   :0.1914   Max.   :0.5322   Max.   :0.3766   Max.   :1.162  
##      count      
##  Min.   :417.0  
##  1st Qu.:438.0  
##  Median :587.0  
##  Mean   :576.8  
##  3rd Qu.:696.0  
##  Max.   :746.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          3898     0.1        0.5
##                                                                    call
##  apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.5))

Now, there are 5 rules found by the algorithm.

set.seed(240)
plot(rules, measure=c("support","lift"), shading="confidence", main="Groceries Transactions Rules")

The graph shows the link between the support level and lift on the groceries transaction rules.The rules with high lift and support level are the most red dots.

plot(rules, method="graph", measure="support", shading="lift")

From the graph, it is evident that the Whole Milk appears in all rules and was the most popular transaction.

Inspect

Now, support, confidence and lift levels can be inspected for a better understanding and visualization.

Support

inspect(sort(rules, by = "support"), linebreak = FALSE, decreasing=TRUE)

##     lhs                   rhs          support   confidence coverage  lift    
## [1] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
## [2] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [3] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [4] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [5] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
##     count
## [1] 746  
## [2] 696  
## [3] 587  
## [4] 438  
## [5] 417

Confidence

inspect(sort(rules, by = "confidence"), linebreak = FALSE, decreasing=TRUE)

##     lhs                   rhs          support   confidence coverage  lift    
## [1] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [2] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [3] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
## [4] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
##     count
## [1] 587  
## [2] 438  
## [3] 417  
## [4] 696  
## [5] 746

Lift

inspect(sort(rules, by = "lift"), linebreak = FALSE, decreasing=TRUE)

##     lhs                   rhs          support   confidence coverage  lift    
## [1] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [2] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [3] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
## [4] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
##     count
## [1] 587  
## [2] 438  
## [3] 417  
## [4] 696  
## [5] 746

plot(rules, method="paracoord", control=list(reorder=TRUE))

This plot shows that yogurt has more link to the whole milk, as people when buying yogurt will most likely buy whole milk too.

rules_yogurt<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="yogurt"), control=list(verbose=F)) 
rules_yogurt_byconf<-sort(rules_yogurt, by="confidence", decreasing=TRUE)
inspect((rules_yogurt_byconf)[1:2], linebreak = FALSE)

##     lhs         rhs                support   confidence coverage  lift    count
## [1] {yogurt} => {whole milk}       0.1505900 0.532185   0.2829656 1.16151 587  
## [2] {yogurt} => {other vegetables} 0.1203181 0.425204   0.2829656 1.12905 469

This means that people who buy yogurt, will most likely buy other vegetables too as their second choice (whole milk is obviously the best choice).

rules_whole_milk<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="whole milk"), control=list(verbose=F)) 
rules_whole_milk_byconf<-sort(rules_whole_milk, by="confidence", decreasing=TRUE)
inspect((rules_whole_milk_byconf)[1:2], linebreak = FALSE)

##     lhs             rhs                support   confidence coverage  lift    
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932  0.4581837 1.109106
## [2] {whole milk} => {rolls/buns}       0.1785531 0.3896976  0.4581837 1.114484
##     count
## [1] 746  
## [2] 696

People who buy whole milk, there is a high possibility that they will buy other vegetables too.

Association Rules for Groceries

Eljan Abbaszada

2025-01-17

Introduction

Literature Review