https://en.wikipedia.org/wiki/Association_rule_learning

Intuition lecture 167 https://www.udemy.com/machinelearning/learn/lecture/6455326

Lecture 169 https://www.udemy.com/machinelearning/learn/lecture/5935024

Eclat only deals with support; this is a simplified version of Apriori algorithm really and what it returns is actually only those items frequently found together from our transactions.

Check Working directory getwd() to always know where you are working.

Data Preprocessing

First let’s import as normal.

# note we have no headers on our columns so we need to have the read.csv insert a header row. 
dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
dataset intro, prior to creating the Sparse Matrix.
A caption

A caption

Sparse Matrix

We need to import the dataset in a particular way. We have an array, but what Eclat is expecting is a list of lists. To clean things up we’ll also remove all the nulls from each row and create a list of lists of all the items for each transaction. To do this we’ll create a Sparse Matrix using the arules package. https://en.wikipedia.org/wiki/Sparse_matrix

# install.packages('arules')
library(arules)
# rm.duplicates is to remove duplicates because the eclat algorithm cannot have duplicates
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1 
## 5

The 1 5 references that we have 5 examples of 1 duplicates.

Let’s have a look.

summary(dataset)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17 
##   16   18   19   20 
##    4    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Plot the dataset

itemFrequencyPlot is a function of the arules package library

# topN is what top you want to see
itemFrequencyPlot(dataset, topN = 40)

Eclat Steps

The steps to work through to tune the algorithm.
A caption

A caption

Training Eclat on the dataset - creating sets of interest

Selecting the Support is not by a rule of thumb, short answer it Depends ;-) Support is the idea of how frequently is an item ‘in the data’ so Support is designed to tell the algorithm to ignore certain items that don’t show frequently enough. Looking at our graph above of the Frequency we can see as we get out to the right things become less frequent, less impactful and thereby the support will be less. We’ll go with things that are purchased 3-4 times a day 3 x 7 = 21 a week / total number of transactions. support: a numeric value for the minimal support of an item set (default: 0.1)

# calculated our support that we want to use
3*7/7500
## [1] 0.0028

We’ll set support at 0.003 to start. The minlen is the idea that we want rule with a minimum length, as in X # of items, we choose two.

rules = eclat(data = dataset, parameter = list(support = 0.003, minlen = 2 ))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE   0.003      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 22 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating sparse bit matrix ... [115 row(s), 7501 column(s)] done [0.00s].
## writing  ... [1328 set(s)] done [0.02s].
## Creating S4 object  ... done [0.00s].

The most important bit is how many sets our algorithm wrote. writing … [1328 set(s)] done [0.02s].

Visualising the results

Eclat gives us the ability to state what we want to sort by, in this case support The 1:10 tells use we want the first 10 lifts, the most common sets of things.

inspect(sort(rules, by = 'support')[1:20])
##      items                             support    count
## [1]  {mineral water,spaghetti}         0.05972537 448  
## [2]  {chocolate,mineral water}         0.05265965 395  
## [3]  {eggs,mineral water}              0.05092654 382  
## [4]  {milk,mineral water}              0.04799360 360  
## [5]  {ground beef,mineral water}       0.04092788 307  
## [6]  {ground beef,spaghetti}           0.03919477 294  
## [7]  {chocolate,spaghetti}             0.03919477 294  
## [8]  {eggs,spaghetti}                  0.03652846 274  
## [9]  {eggs,french fries}               0.03639515 273  
## [10] {frozen vegetables,mineral water} 0.03572857 268  
## [11] {milk,spaghetti}                  0.03546194 266  
## [12] {chocolate,french fries}          0.03439541 258  
## [13] {mineral water,pancakes}          0.03372884 253  
## [14] {french fries,mineral water}      0.03372884 253  
## [15] {chocolate,eggs}                  0.03319557 249  
## [16] {chocolate,milk}                  0.03212905 241  
## [17] {green tea,mineral water}         0.03106252 233  
## [18] {eggs,milk}                       0.03079589 231  
## [19] {burgers,eggs}                    0.02879616 216  
## [20] {french fries,green tea}          0.02852953 214

Adjust support just for fun

How about 4 times a day instead of three, let’s calculate that.

# calculated our support that we want to use
4*7/7500
## [1] 0.003733333

Let’s round that to 0.004

rules = eclat(data = dataset, parameter = list(support = 0.004, minlen = 2 ))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE   0.004      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 30 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [114 item(s)] done [0.00s].
## creating sparse bit matrix ... [114 row(s), 7501 column(s)] done [0.00s].
## writing  ... [845 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

writing … [845 set(s)] done [0.01s]. Let’s look at 20 sets. Really there is no change in the top sets, changing the support will only change the number of sets returned because more sets will meett the criteria.

inspect(sort(rules, by = 'support')[1:20])
##      items                             support    count
## [1]  {mineral water,spaghetti}         0.05972537 448  
## [2]  {chocolate,mineral water}         0.05265965 395  
## [3]  {eggs,mineral water}              0.05092654 382  
## [4]  {milk,mineral water}              0.04799360 360  
## [5]  {ground beef,mineral water}       0.04092788 307  
## [6]  {ground beef,spaghetti}           0.03919477 294  
## [7]  {chocolate,spaghetti}             0.03919477 294  
## [8]  {eggs,spaghetti}                  0.03652846 274  
## [9]  {eggs,french fries}               0.03639515 273  
## [10] {frozen vegetables,mineral water} 0.03572857 268  
## [11] {milk,spaghetti}                  0.03546194 266  
## [12] {chocolate,french fries}          0.03439541 258  
## [13] {mineral water,pancakes}          0.03372884 253  
## [14] {french fries,mineral water}      0.03372884 253  
## [15] {chocolate,eggs}                  0.03319557 249  
## [16] {chocolate,milk}                  0.03212905 241  
## [17] {green tea,mineral water}         0.03106252 233  
## [18] {eggs,milk}                       0.03079589 231  
## [19] {burgers,eggs}                    0.02879616 216  
## [20] {french fries,green tea}          0.02852953 214

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf