Intuition Lecture 159 https://www.udemy.com/machinelearning/learn/lecture/6455322
R Lectures:
161 https://www.udemy.com/machinelearning/learn/lecture/5921986
162 https://www.udemy.com/machinelearning/learn/lecture/5927156
163 https://www.udemy.com/machinelearning/learn/lecture/5932642
https://en.wikipedia.org/wiki/Association_rule_learning
https://en.wikipedia.org/wiki/Apriori_algorithm
The Apriori (reference to prior knowledge) algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions, for example, collections of items bought by customers, or details of a website frequentation or IP addresses.
Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold {C} C, the Apriori algorithm identifies the item sets which are subsets of at least {C} C transactions in the database.
Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data; people who bought beer also bought diapers, people who bought diapers also bought laundry detergent, and so on. The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length {k} k from item sets of length {k-1} k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent {k} k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.
Check Working directory getwd() to always know where you are working.
First let’s import as normal.
# note we have no headers on our columns so we need to have the read.csv insert a header row.
dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
dataset intro, prior to creating the Sparse Matrix.
A caption
We need to import the dataset in a particular way. We have an array, but what apriori is expecting is a list of lists. So, we need to remove all the nulls from each row and create a list of lists of all the items for each transaction. To do this we’ll create a Sparse Matrix using the arules package. https://en.wikipedia.org/wiki/Sparse_matrix
# install.packages('arules')
library(arules)
# rm.duplicates is to remove duplicates because the Apriori algorithm cannot have duplicates
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1
## 5
The 1 5 references that we have 5 examples of 1 duplicates.
Let’s have a look.
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17
## 16 18 19 20
## 4 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
itemFrequencyPlot is a function of the arules package library
# topN is what top you want to see
itemFrequencyPlot(dataset, topN = 60)
A caption
Selecting the Support and Confidence is not by a rule of thumb, short answer it Depends ;-) Support is the idea of how frequently is an item ‘in the data’ so Support is designed to tell the rules to ignore certain items. Looking at our graph above of the Frequency we can see as we get out to the right things become less frequent, less impactful and thereby the support will be left. We’ll go with things that are purchased 3-4 times a day 3 x 7 = 21 a week / total number of transactions. support: a numeric value for the minimal support of an item set (default: 0.1)
# calculated our support that we want to use
3*7/7500
## [1] 0.0028
Setting confidence depends on the business use case. We’ll start with the default and then make adjustments based on how the algorithm goes. confidence: a numeric value for the minimal confidence of rules/association hyperedges (default: 0.8). For frequent itemsets it is set to NA.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The most important bit is how many rules our algorithm wrote. writing … [0 rule(s)] done [0.00s]. Think about what we were asking with our confidence, at 0.8 we were asking for something to be true 4 out of 5 times. So lets try again.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Ok, that’s better writing … [281 rule(s)] done [0.00s]. We’ll move forward here with this onto Step 4.
by gives us the ability to state what we want to sort by, in this case lift. The 1:10 tells use we want the first 10 lifts, the strongest rules. https://en.wikipedia.org/wiki/Association_rule_learning#Lift
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence lift count
## [1] {mineral water,
## whole wheat pasta} => {olive oil} 0.003866151 0.4027778 6.115863 29
## [2] {spaghetti,
## tomato sauce} => {ground beef} 0.003066258 0.4893617 4.980600 23
## [3] {french fries,
## herb & pepper} => {ground beef} 0.003199573 0.4615385 4.697422 24
## [4] {cereals,
## spaghetti} => {ground beef} 0.003066258 0.4600000 4.681764 23
## [5] {frozen vegetables,
## mineral water,
## soup} => {milk} 0.003066258 0.6052632 4.670863 23
## [6] {chocolate,
## herb & pepper} => {ground beef} 0.003999467 0.4411765 4.490183 30
## [7] {chocolate,
## mineral water,
## shrimp} => {frozen vegetables} 0.003199573 0.4210526 4.417225 24
## [8] {frozen vegetables,
## mineral water,
## olive oil} => {milk} 0.003332889 0.5102041 3.937285 25
## [9] {cereals,
## ground beef} => {spaghetti} 0.003066258 0.6764706 3.885303 23
## [10] {frozen vegetables,
## soup} => {milk} 0.003999467 0.5000000 3.858539 30
A caption
So, let’s adjust the confidence to help clean up the rules, basically we are looking to have less rules that are heavily influenced by items with lots of support, like chocolate. We’ll go with 0.2.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1348 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
writing … [1348 rule(s)] done [0.00s]. That’s alot more rules :-) and that’s expected.
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support
## [1] {mineral water,whole wheat pasta} => {olive oil} 0.003866151
## [2] {frozen vegetables,milk,mineral water} => {soup} 0.003066258
## [3] {fromage blanc} => {honey} 0.003332889
## [4] {spaghetti,tomato sauce} => {ground beef} 0.003066258
## [5] {light cream} => {chicken} 0.004532729
## [6] {pasta} => {escalope} 0.005865885
## [7] {french fries,herb & pepper} => {ground beef} 0.003199573
## [8] {cereals,spaghetti} => {ground beef} 0.003066258
## [9] {frozen vegetables,mineral water,soup} => {milk} 0.003066258
## [10] {french fries,ground beef} => {herb & pepper} 0.003199573
## confidence lift count
## [1] 0.4027778 6.115863 29
## [2] 0.2771084 5.484407 23
## [3] 0.2450980 5.164271 25
## [4] 0.4893617 4.980600 23
## [5] 0.2905983 4.843951 34
## [6] 0.3728814 4.700812 44
## [7] 0.4615385 4.697422 24
## [8] 0.4600000 4.681764 23
## [9] 0.6052632 4.670863 23
## [10] 0.2307692 4.665768 24
Lift (originally called interest) https://en.wikipedia.org/wiki/Association_rule_learning#Lift In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.
For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). https://en.wikipedia.org/wiki/Lift_(data_mining)
How about 4 times a day instead of three, let’s calculate that.
# calculated our support that we want to use
4*7/7500
## [1] 0.003733333
Let’s round that to 0.004
rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 30
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [114 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [811 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
writing … [811 rule(s)] done [0.00s]. Let’s look at 20 rules this time.
inspect(sort(rules, by = 'lift')[1:20])
## lhs rhs support confidence lift count
## [1] {light cream} => {chicken} 0.004532729 0.2905983 4.843951 34
## [2] {pasta} => {escalope} 0.005865885 0.3728814 4.700812 44
## [3] {pasta} => {shrimp} 0.005065991 0.3220339 4.506672 38
## [4] {eggs,
## ground beef} => {herb & pepper} 0.004132782 0.2066667 4.178455 31
## [5] {whole wheat pasta} => {olive oil} 0.007998933 0.2714932 4.122410 60
## [6] {herb & pepper,
## spaghetti} => {ground beef} 0.006399147 0.3934426 4.004360 48
## [7] {herb & pepper,
## mineral water} => {ground beef} 0.006665778 0.3906250 3.975683 50
## [8] {tomato sauce} => {ground beef} 0.005332622 0.3773585 3.840659 40
## [9] {mushroom cream sauce} => {escalope} 0.005732569 0.3006993 3.790833 43
## [10] {frozen vegetables,
## mineral water,
## spaghetti} => {ground beef} 0.004399413 0.3666667 3.731841 33
## [11] {olive oil,
## tomatoes} => {spaghetti} 0.004399413 0.6111111 3.509912 33
## [12] {frozen vegetables,
## spaghetti} => {tomatoes} 0.006665778 0.2392344 3.498046 50
## [13] {mineral water,
## soup} => {olive oil} 0.005199307 0.2254335 3.423030 39
## [14] {ground beef,
## milk} => {olive oil} 0.004932676 0.2242424 3.404944 37
## [15] {eggs,
## herb & pepper} => {ground beef} 0.004132782 0.3297872 3.356491 31
## [16] {spaghetti,
## tomatoes} => {frozen vegetables} 0.006665778 0.3184713 3.341054 50
## [17] {herb & pepper} => {ground beef} 0.015997867 0.3234501 3.291994 120
## [18] {grated cheese,
## spaghetti} => {ground beef} 0.005332622 0.3225806 3.283144 40
## [19] {cooking oil,
## ground beef} => {spaghetti} 0.004799360 0.5714286 3.281995 36
## [20] {frozen vegetables,
## olive oil} => {milk} 0.004799360 0.4235294 3.268410 36
=========================
Github files; https://github.com/ghettocounselor
Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf