Intuition Lecture 159 https://www.udemy.com/machinelearning/learn/lecture/6455322

R Lectures:

161 https://www.udemy.com/machinelearning/learn/lecture/5921986

162 https://www.udemy.com/machinelearning/learn/lecture/5927156

163 https://www.udemy.com/machinelearning/learn/lecture/5932642

https://en.wikipedia.org/wiki/Association_rule_learning

https://en.wikipedia.org/wiki/Apriori_algorithm

Overview

The Apriori (reference to prior knowledge) algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions, for example, collections of items bought by customers, or details of a website frequentation or IP addresses.

Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold {C} C, the Apriori algorithm identifies the item sets which are subsets of at least {C} C transactions in the database.

Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data; people who bought beer also bought diapers, people who bought diapers also bought laundry detergent, and so on. The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length {k} k from item sets of length {k-1} k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent {k} k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

Check Working directory getwd() to always know where you are working.

Data Preprocessing

First let’s import as normal.

# note we have no headers on our columns so we need to have the read.csv insert a header row. 
dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
dataset intro, prior to creating the Sparse Matrix.
A caption

A caption

Sparse Matrix

We need to import the dataset in a particular way. We have an array, but what apriori is expecting is a list of lists. So, we need to remove all the nulls from each row and create a list of lists of all the items for each transaction. To do this we’ll create a Sparse Matrix using the arules package. https://en.wikipedia.org/wiki/Sparse_matrix

# install.packages('arules')
library(arules)
# rm.duplicates is to remove duplicates because the Apriori algorithm cannot have duplicates
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1 
## 5

The 1 5 references that we have 5 examples of 1 duplicates.

Let’s have a look.

summary(dataset)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17 
##   16   18   19   20 
##    4    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Plot the dataset

itemFrequencyPlot is a function of the arules package library

# topN is what top you want to see
itemFrequencyPlot(dataset, topN = 60)
# Steps The steps to work through to tune the algorithm.
A caption

A caption

Training Apriori on the dataset - creating a rule(s)

Selecting the Support and Confidence is not by a rule of thumb, short answer it Depends ;-) Support is the idea of how frequently is an item ‘in the data’ so Support is designed to tell the rules to ignore certain items. Looking at our graph above of the Frequency we can see as we get out to the right things become less frequent, less impactful and thereby the support will be left. We’ll go with things that are purchased 3-4 times a day 3 x 7 = 21 a week / total number of transactions. support: a numeric value for the minimal support of an item set (default: 0.1)

# calculated our support that we want to use
3*7/7500
## [1] 0.0028

Setting confidence depends on the business use case. We’ll start with the default and then make adjustments based on how the algorithm goes. confidence: a numeric value for the minimal confidence of rules/association hyperedges (default: 0.8). For frequent itemsets it is set to NA.

rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The most important bit is how many rules our algorithm wrote. writing … [0 rule(s)] done [0.00s]. Think about what we were asking with our confidence, at 0.8 we were asking for something to be true 4 out of 5 times. So lets try again.

rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Ok, that’s better writing … [281 rule(s)] done [0.00s]. We’ll move forward here with this onto Step 4.

Visualising the results

by gives us the ability to state what we want to sort by, in this case lift. The 1:10 tells use we want the first 10 lifts, the strongest rules. https://en.wikipedia.org/wiki/Association_rule_learning#Lift

inspect(sort(rules, by = 'lift')[1:10])
##      lhs                    rhs                     support confidence     lift count
## [1]  {mineral water,                                                                 
##       whole wheat pasta} => {olive oil}         0.003866151  0.4027778 6.115863    29
## [2]  {spaghetti,                                                                     
##       tomato sauce}      => {ground beef}       0.003066258  0.4893617 4.980600    23
## [3]  {french fries,                                                                  
##       herb & pepper}     => {ground beef}       0.003199573  0.4615385 4.697422    24
## [4]  {cereals,                                                                       
##       spaghetti}         => {ground beef}       0.003066258  0.4600000 4.681764    23
## [5]  {frozen vegetables,                                                             
##       mineral water,                                                                 
##       soup}              => {milk}              0.003066258  0.6052632 4.670863    23
## [6]  {chocolate,                                                                     
##       herb & pepper}     => {ground beef}       0.003999467  0.4411765 4.490183    30
## [7]  {chocolate,                                                                     
##       mineral water,                                                                 
##       shrimp}            => {frozen vegetables} 0.003199573  0.4210526 4.417225    24
## [8]  {frozen vegetables,                                                             
##       mineral water,                                                                 
##       olive oil}         => {milk}              0.003332889  0.5102041 3.937285    25
## [9]  {cereals,                                                                       
##       ground beef}       => {spaghetti}         0.003066258  0.6764706 3.885303    23
## [10] {frozen vegetables,                                                             
##       soup}              => {milk}              0.003999467  0.5000000 3.858539    30
A caption

A caption

So, let’s adjust the confidence to help clean up the rules, basically we are looking to have less rules that are heavily influenced by items with lots of support, like chocolate. We’ll go with 0.2.

rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1348 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

writing … [1348 rule(s)] done [0.00s]. That’s alot more rules :-) and that’s expected.

inspect(sort(rules, by = 'lift')[1:10])
##      lhs                                       rhs             support    
## [1]  {mineral water,whole wheat pasta}      => {olive oil}     0.003866151
## [2]  {frozen vegetables,milk,mineral water} => {soup}          0.003066258
## [3]  {fromage blanc}                        => {honey}         0.003332889
## [4]  {spaghetti,tomato sauce}               => {ground beef}   0.003066258
## [5]  {light cream}                          => {chicken}       0.004532729
## [6]  {pasta}                                => {escalope}      0.005865885
## [7]  {french fries,herb & pepper}           => {ground beef}   0.003199573
## [8]  {cereals,spaghetti}                    => {ground beef}   0.003066258
## [9]  {frozen vegetables,mineral water,soup} => {milk}          0.003066258
## [10] {french fries,ground beef}             => {herb & pepper} 0.003199573
##      confidence lift     count
## [1]  0.4027778  6.115863 29   
## [2]  0.2771084  5.484407 23   
## [3]  0.2450980  5.164271 25   
## [4]  0.4893617  4.980600 23   
## [5]  0.2905983  4.843951 34   
## [6]  0.3728814  4.700812 44   
## [7]  0.4615385  4.697422 24   
## [8]  0.4600000  4.681764 23   
## [9]  0.6052632  4.670863 23   
## [10] 0.2307692  4.665768 24

Lift

Lift (originally called interest) https://en.wikipedia.org/wiki/Association_rule_learning#Lift In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). https://en.wikipedia.org/wiki/Lift_(data_mining)

Adjust support

How about 4 times a day instead of three, let’s calculate that.

# calculated our support that we want to use
4*7/7500
## [1] 0.003733333

Let’s round that to 0.004

rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 30 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [114 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [811 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

writing … [811 rule(s)] done [0.00s]. Let’s look at 20 rules this time.

inspect(sort(rules, by = 'lift')[1:20])
##      lhs                       rhs                     support confidence     lift count
## [1]  {light cream}          => {chicken}           0.004532729  0.2905983 4.843951    34
## [2]  {pasta}                => {escalope}          0.005865885  0.3728814 4.700812    44
## [3]  {pasta}                => {shrimp}            0.005065991  0.3220339 4.506672    38
## [4]  {eggs,                                                                             
##       ground beef}          => {herb & pepper}     0.004132782  0.2066667 4.178455    31
## [5]  {whole wheat pasta}    => {olive oil}         0.007998933  0.2714932 4.122410    60
## [6]  {herb & pepper,                                                                    
##       spaghetti}            => {ground beef}       0.006399147  0.3934426 4.004360    48
## [7]  {herb & pepper,                                                                    
##       mineral water}        => {ground beef}       0.006665778  0.3906250 3.975683    50
## [8]  {tomato sauce}         => {ground beef}       0.005332622  0.3773585 3.840659    40
## [9]  {mushroom cream sauce} => {escalope}          0.005732569  0.3006993 3.790833    43
## [10] {frozen vegetables,                                                                
##       mineral water,                                                                    
##       spaghetti}            => {ground beef}       0.004399413  0.3666667 3.731841    33
## [11] {olive oil,                                                                        
##       tomatoes}             => {spaghetti}         0.004399413  0.6111111 3.509912    33
## [12] {frozen vegetables,                                                                
##       spaghetti}            => {tomatoes}          0.006665778  0.2392344 3.498046    50
## [13] {mineral water,                                                                    
##       soup}                 => {olive oil}         0.005199307  0.2254335 3.423030    39
## [14] {ground beef,                                                                      
##       milk}                 => {olive oil}         0.004932676  0.2242424 3.404944    37
## [15] {eggs,                                                                             
##       herb & pepper}        => {ground beef}       0.004132782  0.3297872 3.356491    31
## [16] {spaghetti,                                                                        
##       tomatoes}             => {frozen vegetables} 0.006665778  0.3184713 3.341054    50
## [17] {herb & pepper}        => {ground beef}       0.015997867  0.3234501 3.291994   120
## [18] {grated cheese,                                                                    
##       spaghetti}            => {ground beef}       0.005332622  0.3225806 3.283144    40
## [19] {cooking oil,                                                                      
##       ground beef}          => {spaghetti}         0.004799360  0.5714286 3.281995    36
## [20] {frozen vegetables,                                                                
##       olive oil}            => {milk}              0.004799360  0.4235294 3.268410    36

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf