https://en.wikipedia.org/wiki/Association_rule_learning
Intuition lecture 167 https://www.udemy.com/machinelearning/learn/lecture/6455326
Lecture 169 https://www.udemy.com/machinelearning/learn/lecture/5935024
Eclat only deals with support; this is a simplified version of Apriori algorithm really and what it returns is actually only those items frequently found together from our transactions.
Check Working directory getwd() to always know where you are working.
First let’s import as normal.
# note we have no headers on our columns so we need to have the read.csv insert a header row.
dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
dataset intro, prior to creating the Sparse Matrix.
A caption
We need to import the dataset in a particular way. We have an array, but what Eclat is expecting is a list of lists. To clean things up we’ll also remove all the nulls from each row and create a list of lists of all the items for each transaction. To do this we’ll create a Sparse Matrix using the arules package. https://en.wikipedia.org/wiki/Sparse_matrix
# install.packages('arules')
library(arules)
# rm.duplicates is to remove duplicates because the eclat algorithm cannot have duplicates
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1
## 5
The 1 5 references that we have 5 examples of 1 duplicates.
Let’s have a look.
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17
## 16 18 19 20
## 4 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
itemFrequencyPlot is a function of the arules package library
# topN is what top you want to see
itemFrequencyPlot(dataset, topN = 40)
A caption
Selecting the Support is not by a rule of thumb, short answer it Depends ;-) Support is the idea of how frequently is an item ‘in the data’ so Support is designed to tell the algorithm to ignore certain items that don’t show frequently enough. Looking at our graph above of the Frequency we can see as we get out to the right things become less frequent, less impactful and thereby the support will be less. We’ll go with things that are purchased 3-4 times a day 3 x 7 = 21 a week / total number of transactions. support: a numeric value for the minimal support of an item set (default: 0.1)
# calculated our support that we want to use
3*7/7500
## [1] 0.0028
We’ll set support at 0.003 to start. The minlen is the idea that we want rule with a minimum length, as in X # of items, we choose two.
rules = eclat(data = dataset, parameter = list(support = 0.003, minlen = 2 ))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.003 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 22
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating sparse bit matrix ... [115 row(s), 7501 column(s)] done [0.00s].
## writing ... [1328 set(s)] done [0.02s].
## Creating S4 object ... done [0.00s].
The most important bit is how many sets our algorithm wrote. writing … [1328 set(s)] done [0.02s].
Eclat gives us the ability to state what we want to sort by, in this case support The 1:10 tells use we want the first 10 lifts, the most common sets of things.
inspect(sort(rules, by = 'support')[1:20])
## items support count
## [1] {mineral water,spaghetti} 0.05972537 448
## [2] {chocolate,mineral water} 0.05265965 395
## [3] {eggs,mineral water} 0.05092654 382
## [4] {milk,mineral water} 0.04799360 360
## [5] {ground beef,mineral water} 0.04092788 307
## [6] {ground beef,spaghetti} 0.03919477 294
## [7] {chocolate,spaghetti} 0.03919477 294
## [8] {eggs,spaghetti} 0.03652846 274
## [9] {eggs,french fries} 0.03639515 273
## [10] {frozen vegetables,mineral water} 0.03572857 268
## [11] {milk,spaghetti} 0.03546194 266
## [12] {chocolate,french fries} 0.03439541 258
## [13] {mineral water,pancakes} 0.03372884 253
## [14] {french fries,mineral water} 0.03372884 253
## [15] {chocolate,eggs} 0.03319557 249
## [16] {chocolate,milk} 0.03212905 241
## [17] {green tea,mineral water} 0.03106252 233
## [18] {eggs,milk} 0.03079589 231
## [19] {burgers,eggs} 0.02879616 216
## [20] {french fries,green tea} 0.02852953 214
How about 4 times a day instead of three, let’s calculate that.
# calculated our support that we want to use
4*7/7500
## [1] 0.003733333
Let’s round that to 0.004
rules = eclat(data = dataset, parameter = list(support = 0.004, minlen = 2 ))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.004 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 30
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [114 item(s)] done [0.00s].
## creating sparse bit matrix ... [114 row(s), 7501 column(s)] done [0.00s].
## writing ... [845 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
writing … [845 set(s)] done [0.01s]. Let’s look at 20 sets. Really there is no change in the top sets, changing the support will only change the number of sets returned because more sets will meett the criteria.
inspect(sort(rules, by = 'support')[1:20])
## items support count
## [1] {mineral water,spaghetti} 0.05972537 448
## [2] {chocolate,mineral water} 0.05265965 395
## [3] {eggs,mineral water} 0.05092654 382
## [4] {milk,mineral water} 0.04799360 360
## [5] {ground beef,mineral water} 0.04092788 307
## [6] {ground beef,spaghetti} 0.03919477 294
## [7] {chocolate,spaghetti} 0.03919477 294
## [8] {eggs,spaghetti} 0.03652846 274
## [9] {eggs,french fries} 0.03639515 273
## [10] {frozen vegetables,mineral water} 0.03572857 268
## [11] {milk,spaghetti} 0.03546194 266
## [12] {chocolate,french fries} 0.03439541 258
## [13] {mineral water,pancakes} 0.03372884 253
## [14] {french fries,mineral water} 0.03372884 253
## [15] {chocolate,eggs} 0.03319557 249
## [16] {chocolate,milk} 0.03212905 241
## [17] {green tea,mineral water} 0.03106252 233
## [18] {eggs,milk} 0.03079589 231
## [19] {burgers,eggs} 0.02879616 216
## [20] {french fries,green tea} 0.02852953 214
=========================
Github files; https://github.com/ghettocounselor
Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf