Association rule mining is a machine learning technique helping to uncover relationships between databases. One example of application of might be Market Basket Analysis. Market basket analysis is a data mining application/procedure used by retailers to discover relationships between items people buy to identify customer purchasing patterns and then employ them in order to increase sales.
In order to implement market basket analysis algorithms arules and arulesViz packages was utilized. Those packages offer the environment for representing, manipulating, measuring, visualizing and analyzing transaction data using association rules.
The Groceries dataset used in this paper is a built-in dataset from arules package consisting of 30 days of real-world transaction data from a local grocery outlet.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
data("Groceries")
transactions = Groceries
head(Groceries)
## transactions in sparse format with
## 6 transactions (rows) and
## 169 items (columns)
summary(transactions)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
LIST(head(transactions))
## [[1]]
## [1] "citrus fruit" "semi-finished bread" "margarine"
## [4] "ready soups"
##
## [[2]]
## [1] "tropical fruit" "yogurt" "coffee"
##
## [[3]]
## [1] "whole milk"
##
## [[4]]
## [1] "pip fruit" "yogurt" "cream cheese " "meat spreads"
##
## [[5]]
## [1] "other vegetables" "whole milk"
## [3] "condensed milk" "long life bakery product"
##
## [[6]]
## [1] "whole milk" "butter" "yogurt" "rice"
## [5] "abrasive cleaner"
inspect(head(transactions))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
size(head(transactions))
## [1] 4 3 1 4 4 5
length(head(transactions))
## [1] 6
The plot below shows the 25 items, which appeared most frequently in this dataset. The most frequent five items are respectively whole milk, other vegetables, rolls/buns, soda and yogurt.
itemFrequencyPlot(
transactions,
topN = 25,
type = "absolute",
main = "Item frequency",
cex.names = 0.85
)
The plot below shows how sparse is the matrix for the first 10 transactions
image(transactions[1:10])
The plot below shows how sparse is the matrix for the random sample of 80 transactions
image(sample(transactions, 80))
The table below enables us to see the fraction of all transactions in which a given product occurred
head(round(itemFrequency(transactions),3))
## frankfurter sausage liver loaf ham
## 0.059 0.094 0.005 0.026
## meat finished products
## 0.026 0.007
head(itemFrequency(transactions, type="absolute"))
## frankfurter sausage liver loaf ham
## 580 924 50 256
## meat finished products
## 254 64
The symetric matrices of n x n containing the co-occurence counts between pairs of items
The count measure indicates the number of transactions in which both events occurred together
cctab<-crossTable(transactions, measure="count", sort=TRUE)
head(round(cctab,2))
The support measure shows how frequently an item appeard in total number of transactions
stab<-crossTable(transactions, measure="support", sort=TRUE)
head(round(stab, 3))
The lift measure illustrates how often two given products are bought together than separately
ltab<-crossTable(transactions, measure="lift", sort=TRUE)
head(ltab)
The Apriori algorithm used to determine rules the association rules between items. This algorithm identify often occuring sets of items and based on them generates rules. Firstly it finds frequent single items in the database and then add to them other items as long as they appear together sufficiently often in the database.
The thresholds of the support and confidence parameters were respectively 0.006 and therefore the algorithm returned only the rules having support at 0.6% and confidence at 2.5% at least.
rules.transactions<-apriori(transactions, parameter=list(support =
0.006, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.by.conf<-sort(rules.transactions, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift count
## [1] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {root vegetables,
## butter} => {whole milk} 0.008235892 0.6377953 0.012913066 2.496107 81
## [4] {tropical fruit,
## curd} => {whole milk} 0.006507372 0.6336634 0.010269446 2.479936 64
## [5] {tropical fruit,
## butter} => {whole milk} 0.006202339 0.6224490 0.009964413 2.436047 61
## [6] {tropical fruit,
## other vegetables,
## yogurt} => {whole milk} 0.007625826 0.6198347 0.012302999 2.425816 75
rules.by.lift<-sort(rules.transactions, by="lift", decreasing=TRUE)
inspect(head(rules.by.lift))
## lhs rhs support confidence coverage lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 0.01626843 3.956477 69
## [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 0.03324860 3.796886 89
## [3] {tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.007015760 0.4107143 0.01708185 3.768074 69
## [4] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 0.01972547 3.688692 78
## [5] {tropical fruit,
## other vegetables} => {pip fruit} 0.009456024 0.2634561 0.03589222 3.482649 93
## [6] {beef,
## whole milk} => {root vegetables} 0.008032537 0.3779904 0.02125064 3.467851 79
rules.by.count<- sort(rules.transactions, by="count", decreasing=TRUE)
inspect(head(rules.by.count))
## lhs rhs support confidence coverage
## [1] {other vegetables} => {whole milk} 0.07483477 0.3867578 0.1934926
## [2] {whole milk} => {other vegetables} 0.07483477 0.2928770 0.2555160
## [3] {rolls/buns} => {whole milk} 0.05663447 0.3079049 0.1839349
## [4] {yogurt} => {whole milk} 0.05602440 0.4016035 0.1395018
## [5] {root vegetables} => {whole milk} 0.04890696 0.4486940 0.1089985
## [6] {root vegetables} => {other vegetables} 0.04738180 0.4347015 0.1089985
## lift count
## [1] 1.513634 736
## [2] 1.513634 736
## [3] 1.205032 557
## [4] 1.571735 551
## [5] 1.756031 481
## [6] 2.246605 466
rules_butter <-
apriori(
data = transactions,
parameter = list(supp = 0.001, conf = 0.15),
appearance = list(default = "lhs", rhs = "butter"),
control = list(verbose = F)
)
rules_cbeer_dt <- inspect(rules_butter[1:5], linebreak = FALSE)
## lhs rhs support confidence coverage
## [1] {jam} => {butter} 0.001220132 0.2264151 0.005388917
## [2] {Instant food products} => {butter} 0.001220132 0.1518987 0.008032537
## [3] {flower (seeds)} => {butter} 0.001626843 0.1568627 0.010371124
## [4] {turkey} => {butter} 0.001525165 0.1875000 0.008134215
## [5] {rice} => {butter} 0.001830198 0.2400000 0.007625826
## lift count
## [1] 4.085858 12
## [2] 2.741145 12
## [3] 2.830725 16
## [4] 3.383601 15
## [5] 4.331009 18
inspectDT(rules.transactions)
plot(rules.transactions, shading = "order",engine = "html")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules.transactions, method = "graph")
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).
plot(rules.transactions, method = "matrix", engine = "html")
plot(rules.transactions, method="paracoord", control=list(reorder=TRUE))
plot(rules.transactions, method = "graph", limit = 20, engine = "html" )
The Eclat algortim finds frequent itemsets and provides measures for them. This algorithm was introduced to adress the weakness of the aforementioned Apriori algorithm. Due to the fact that at each stage it uses the recent generated dataset to learn frequent itemset is more effient than the Apriori which scans the original database repeatedly. What is more Eclat is faster since it gives less metrics, unlike Apriori it does not include the Lift and Confidence metrics.
head(inspect(transactions[1:5]))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## NULL
freq.items<-eclat(transactions, parameter=list(supp=0.01, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 98
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating sparse bit matrix ... [88 row(s), 9835 column(s)] done [0.00s].
## writing ... [333 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
From the first row of the results we can see that the probability of whole milk and hard cheese occuring together in one transaction is 0.1%.
inspect(freq.items[1:5])
## items support count
## [1] {whole milk, hard cheese} 0.01006609 99
## [2] {whole milk, butter milk} 0.01159126 114
## [3] {other vegetables, butter milk} 0.01037112 102
## [4] {ham, whole milk} 0.01148958 113
## [5] {whole milk, sliced cheese} 0.01077783 106
round(support(items(freq.items), transactions) , 2)
## [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## [16] 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01
## [31] 0.02 0.02 0.01 0.02 0.01 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.02
## [46] 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.02 0.01 0.03 0.02 0.01 0.02 0.01 0.01
## [61] 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.01 0.02 0.02
## [76] 0.02 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.01 0.01 0.01
## [91] 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.01 0.01
## [106] 0.01 0.01 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.01
## [121] 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01
## [136] 0.01 0.01 0.01 0.01 0.03 0.03 0.01 0.02 0.01 0.02 0.01 0.01 0.01 0.03 0.03
## [151] 0.01 0.02 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.02
## [166] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.03 0.02 0.02 0.01 0.02 0.02 0.01
## [181] 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.02 0.01 0.03 0.03 0.03 0.02
## [196] 0.02 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.03 0.02 0.02 0.01 0.01 0.02
## [211] 0.01 0.01 0.02 0.04 0.04 0.02 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.02 0.05
## [226] 0.05 0.02 0.03 0.02 0.01 0.01 0.04 0.03 0.04 0.03 0.02 0.01 0.02 0.06 0.04
## [241] 0.03 0.02 0.06 0.04 0.07 0.26 0.19 0.18 0.14 0.17 0.11 0.10 0.11 0.09 0.10
## [256] 0.08 0.09 0.08 0.07 0.07 0.06 0.08 0.06 0.06 0.06 0.08 0.06 0.06 0.05 0.05
## [271] 0.05 0.05 0.06 0.05 0.04 0.04 0.04 0.08 0.04 0.04 0.04 0.03 0.04 0.03 0.03
## [286] 0.03 0.03 0.03 0.03 0.02 0.03 0.03 0.03 0.02 0.03 0.03 0.03 0.02 0.02 0.03
## [301] 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02
## [316] 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01
## [331] 0.01 0.01 0.01
freq.rules<-ruleInduction(freq.items, transactions, confidence=0.9)
freq.rules
## set of 0 rules
inspect(freq.rules) # screening the rules
summary(rules.transactions)
## set of 463 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 150 297 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.711 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.006101 Min. :0.2500 Min. :0.009964 Min. :0.9932
## 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:0.018709 1st Qu.:1.6229
## Median :0.008744 Median :0.3554 Median :0.024809 Median :1.9332
## Mean :0.011539 Mean :0.3786 Mean :0.032608 Mean :2.0351
## 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:0.035892 3rd Qu.:2.3565
## Max. :0.074835 Max. :0.6600 Max. :0.255516 Max. :3.9565
## count
## Min. : 60.0
## 1st Qu.: 70.0
## Median : 86.0
## Mean :113.5
## 3rd Qu.:121.0
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.006 0.25
## call
## apriori(data = transactions, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
There were find 463 rules. The most rules has a set consisitng of 3 items.
In this paper Market Basked Analisys in R was performed. We explored Groceries dataset, implemented Apriori algorithm to create association rules, ECLAT algorithm to discover most frequent itemsets and interpreted the obtained results. Such an analisys can help with better understanding customer buying patterns and can be employed by retailers in order to boost sales. z