dataset = read.csv('groceries.csv', header = FALSE)
Association rule learning-what and why
A famous story/myth about association rule mining is about “beer and diaper” story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. There are several explanations out there.one of them says that, “when fathers are sent out on an errand to buy diapers, they often purchase a six-pack of their favorite beer as a reward.”
This association spread so much that in 1998, IBM even aired a television Ad that used the beer-and-diapers example.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. So if an association between two products is too high, store managers can put those two products together so that when a customer comes to buy one of the product,he or she might think, seeing the other product that, “hey it could be a great combination”. For example,By keeping Butter just next to the bread,customer will tend to buy both of the products instated of buying just one of them. The same technique is applied by the e-commerce websites. Remember that “people also buy this together” Pop up ? That’s the use of an association rule learning.
The dataset
The given dataset has 15296 rows. Each row represents a transaction . The transactions are made over a period of one week.column values represent the different products which were bought in that transaction. So basically each time a customer comes out of the shop with the basket, it counts as one transaction/row.One single customer can surely come for multiple times and buy different/same set of products.That will also be conducted as another transaction.So there has been a total of 15296 transactions over the week.
The taken dataset is a real life dataset , so there are a few things which can happen in real life which might create a problem in our analysis:
1.If a particular customer comes twice over the period and buys the same set of products , it will again be counted as transaction.But it will be redundant for our model,because (i)we do not want our model to learn from the habit of one particular customer. (ii)If we want a good model to be built,we need as many different combination of products as possible, so that our model can learn to establish a connection from that randomness, not from the repetition of set of products.
2.If a customer buys a single product let’s say whole milk and another customer comes and also buys whole milk,then again we will see repetition, which we want to avoid.However it needs to be counted once.
Checking for duplicate rows
sum(duplicated(dataset))
## [1] 5357
As I doubted , there are a total of 5357 duplicate rows,which will not add any value to our model.So we need to get rid of them. The good thing is that when we build the Apriori model, it will get rid of repetition by itself.However we must know what and why the the model is doing what it is doing.
Another thing to know that Apriori model is built on Spars matrix. Sparse matrices are those matrices that have the majority of their elements equal to zero. In other words, the sparse matrix can be defined as the matrix that has a greater number of zero elements than the non-zero elements. So we need to convert our dataset into spars matrix. To do that, we can simply read the whole csv file as sparse matrix.While reading the csv file , we need to specify that the saperator is a comma.which we did not need to do in our regular read.csv command, because than R by default knows what the separator is. We need to import the arules package to read the file as sparse matrix and since we do not have heading in our dataset(as our first row is a transaction itselt). We will have to set header as FALSE. Last thing is that,since the dataset is a real life data, there could be some human errors like counting the same product twice in one particular transaction. Meaning if a customer buys yogurt,whole milk and soda. By human mistake if the soda is typed twice in that transaction. Than again it will be of no use to us, because we know for fact from that particular transaction that yogurt,whole milk and soda are bought together,so counting soda twice is redundant. So we just need to add rm.duplicate=TRUE to get rid of this repetition.
creating the sparse matrix :
library(arules)
dataset = read.transactions('groceries.csv', sep = ',' , header = FALSE,rm.duplicates = TRUE)
So now our dataset is converted into sparse matrix, but we can’t view the sparse matrix.We can only see the summary of the sparse matrix.
(summary(dataset))
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Finding from the summary of sparse matrix :
1.As expected, the as.transaction function took care of all the repetition of rows by itself. Since we only have 9835 rows.
2.Further we have 169 columns.Each column signifies one product.So there are a total of 169 columns in our dataset.
3.Each row represents the transaction and each column represents a product.If that product was bought in that transaction than sparse matrix will give 1,if not, it will give 0.
4.The density of our sparse matrix is 0.026. It means 2.6% of all the boxes in the sparse matrix are filled with non zero value,rest are filled with 0.
5.Most frequent bought product is whole milk than other vegetables than rolls.
6.least number of product any customer bought is 1 and the most number of products which fall in one customer’s basket is 32.
7.On an average 4 products are bought in a single purchase,also median is 3. which is a good thing that most of the customers who come to the store, buy multiple items.
Getting a visual representation for the top 30 products which were bought :
itemFrequencyPlot(dataset,topN = 30)
so these are the most bought 30 products and the frequency for how many times they were bought.
How does Apriori work :
Apriori confidence has 3 parts, Support, Confidence and Lift
Support :
Support here is a kind of probability, which is defined as the number of transaction containing a desired product divided by the total number of transactions during that period.
Since this data is a weekly data and We want to consider only products which were bought at least 4 times per day.So a total purchase of 47=28 in that week. so our support = (47)/9834 = 0.0028 ~ 0.003
Confidence :
Confidence is the percentage of cases the items are bought together.For example 60% confidence for bread with butter will mean that,60% of the time when people buy bread , they also buy butter. Ideally we could start with a very high confidence,like 80%, but than again it will give us very obvious result which we don’t need ML to tell us.
Lift :
Lift is the ratio of confidence to support. We only need to specify support and confidence in our model. It will calculate lift on it’s own and arrange the rules accordingly
Training Apriori on the dataset :
rules_80 = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_80
## set of 1 rules
So we found only 1 rule for the products which were bought at least 4 times in that week with confidence level of 80% .Let’s check that rule
inspect(sort(rules_80, by = 'lift')[1])
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## tropical fruit,
## whole milk} => {other vegetables} 0.003152008 0.8857143 0.003558719 4.577509 31
Finding from above rule :
So if citrus fruit,root vegetables,tropical fruit,whole milk are bought,customer also buys “other vegetables” in 88.5% of the cases. It doesn’t really give us much of insight because,it’s a common sense that customer is just buying healthy fruit and vegetables,he will most probably buy other vegetables as well.We don’t need a Machine learning model to tell us this rule.
**So keeping the confidence level as 80% is not a good idea,because –> It will give us the most obvious rules,which we do not need
Keeping the confidence level as 40% :
rules_40 = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [823 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_40
## set of 823 rules
inspect(sort(rules_40, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [2] {other vegetables,
## root vegetables,
## tropical fruit,
## whole milk} => {citrus fruit} 0.003152008 0.4492754 0.007015760 5.428284 31
## [3] {liquor} => {bottled beer} 0.004677173 0.4220183 0.011082867 5.240594 46
## [4] {citrus fruit,
## other vegetables,
## root vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 0.005795628 5.183004 31
## [5] {herbs,
## whole milk} => {root vegetables} 0.004168785 0.5394737 0.007727504 4.949369 41
## [6] {herbs,
## other vegetables} => {root vegetables} 0.003863752 0.5000000 0.007727504 4.587220 38
## [7] {citrus fruit,
## root vegetables,
## tropical fruit,
## whole milk} => {other vegetables} 0.003152008 0.8857143 0.003558719 4.577509 31
## [8] {butter,
## other vegetables,
## yogurt} => {tropical fruit} 0.003050330 0.4761905 0.006405694 4.538114 30
## [9] {citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.004473818 0.4943820 0.009049314 4.535678 44
## [10] {beef,
## tropical fruit} => {root vegetables} 0.003762074 0.4933333 0.007625826 4.526057 37
So we found 823 rules and all of this are further sorted in descending order of lift values with them.We look at the top 10 rules
Finding from the rules at 40% confidence level :
we can see “whole milk” and “other vegetables” are in most of these 10 rules.This is because, they have high support that’s why they are falling in most of the baskets.To get rid of this problem,we need to put lower support.
Keeping the confidence level as 20% :
rules_20 = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [2246 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_20
## set of 2246 rules
So at 20% confidence level,there are a total of 2246 rules.Let’s check the top 10 of these rules of descending order of their lift value
inspect(sort(rules_20, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Instant food products} => {hamburger meat} 0.003050330 0.3797468 0.008032537 11.421438 30
## [2] {flour} => {sugar} 0.004982206 0.2865497 0.017386884 8.463112 49
## [3] {processed cheese} => {white bread} 0.004168785 0.2515337 0.016573462 5.975445 41
## [4] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [5] {other vegetables,
## root vegetables,
## tropical fruit,
## whole milk} => {citrus fruit} 0.003152008 0.4492754 0.007015760 5.428284 31
## [6] {liquor} => {bottled beer} 0.004677173 0.4220183 0.011082867 5.240594 46
## [7] {citrus fruit,
## other vegetables,
## root vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 0.005795628 5.183004 31
## [8] {berries,
## whole milk} => {whipped/sour cream} 0.004270463 0.3620690 0.011794611 5.050990 42
## [9] {herbs,
## whole milk} => {root vegetables} 0.004168785 0.5394737 0.007727504 4.949369 41
## [10] {tropical fruit,
## whole milk,
## yogurt} => {curd} 0.003965430 0.2617450 0.015149975 4.912713 39
Finding from the rules at 40% confidence level :
Now we see some of the good finding which we couldn’t have thought about without ML telling us to 1.When instant food products are bought,customer also buys hamburger meat in 37.97% of the cases.
2.processed cheese and white bread are also bought together in 25% of the cases.Which also makes sense because,these products often go hand in hand.Although the percentage is not too high,but it can be exploited.
3.As per rule 6,If Liquor is bought,bottled beer is also bought, but this association is obvious. Same is the case with 9th rule which is association with root vegetables. So we can check next 20 rules.
inspect(sort(rules_20, by = 'lift')[11:30])
## lhs rhs support confidence coverage lift count
## [1] {other vegetables,
## whipped/sour cream,
## whole milk} => {butter} 0.003965430 0.2708333 0.014641586 4.887424 39
## [2] {butter,
## other vegetables,
## whole milk} => {whipped/sour cream} 0.003965430 0.3451327 0.011489578 4.814724 39
## [3] {frozen vegetables,
## other vegetables} => {chicken} 0.003558719 0.2000000 0.017793594 4.661137 35
## [4] {herbs,
## other vegetables} => {root vegetables} 0.003863752 0.5000000 0.007727504 4.587220 38
## [5] {citrus fruit,
## root vegetables,
## tropical fruit,
## whole milk} => {other vegetables} 0.003152008 0.8857143 0.003558719 4.577509 31
## [6] {onions,
## whole milk} => {butter} 0.003050330 0.2521008 0.012099644 4.549379 30
## [7] {long life bakery product,
## whole milk} => {chocolate} 0.003050330 0.2255639 0.013523132 4.545945 30
## [8] {butter,
## other vegetables,
## yogurt} => {tropical fruit} 0.003050330 0.4761905 0.006405694 4.538114 30
## [9] {citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.004473818 0.4943820 0.009049314 4.535678 44
## [10] {beef,
## tropical fruit} => {root vegetables} 0.003762074 0.4933333 0.007625826 4.526057 37
## [11] {onions,
## other vegetables,
## whole milk} => {root vegetables} 0.003253686 0.4923077 0.006609049 4.516648 32
## [12] {domestic eggs,
## other vegetables,
## whole milk} => {butter} 0.003050330 0.2479339 0.012302999 4.474183 30
## [13] {other vegetables,
## tropical fruit,
## yogurt} => {butter} 0.003050330 0.2479339 0.012302999 4.474183 30
## [14] {beef,
## soda} => {root vegetables} 0.003965430 0.4875000 0.008134215 4.472540 39
## [15] {other vegetables,
## root vegetables,
## tropical fruit} => {citrus fruit} 0.004473818 0.3636364 0.012302999 4.393567 44
## [16] {curd,
## tropical fruit,
## whole milk} => {yogurt} 0.003965430 0.6093750 0.006507372 4.368224 39
## [17] {other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.003558719 0.4545455 0.007829181 4.331836 35
## [18] {hygiene articles,
## whole milk} => {butter} 0.003050330 0.2380952 0.012811388 4.296636 30
## [19] {other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.003558719 0.4666667 0.007625826 4.281405 35
## [20] {butter,
## tropical fruit} => {whipped/sour cream} 0.003050330 0.3061224 0.009964413 4.270517 30
1.Rule number 7 is interesting and also makes a lot of sense.If bakery products are bought, chocolate is also bought,since both of these products are used in making cake and pastry.
2.When beef and soda are bought,root vegetables are also bought in 48.7% of the cases. So this also could be a good association
Caution :
The rule number 28 seemed funny to me
inspect(sort(rules_20, by = 'lift')[28])
## lhs rhs support confidence coverage
## [1] {hygiene articles,whole milk} => {butter} 0.00305033 0.2380952 0.01281139
## lift count
## [1] 4.296636 30
When hygiene articles and whole milk are bought, butter is also bought. Hygiene articles are completely different products. The above association is only between whole milk and butter.
End Note :
There are plenty of rules to look into, but the association must looked at with caution. The association rules are used not only in retail stores but also in E-commerce websites