In this study we are going to mine the association rules and explore frequent itemsets in a supermarket transaction data set by using arules package.
Transactions in the txt file are in the following format:
col1 -> transactionId
col2 -> productName
read.transactions function reads the data and creates a transactions object. Transaction object is in a special matrix format which has columns for each different products and rows for transactions. We choose ‘single’ as a format parameter since each line consists single item associated with a unique transaction id.
# Load package arules
library(arules)
library(arulesViz)
GTransactions<- read.transactions("transactions.txt", sep=",",
format = "single", cols = c(1, 2));
itemFrequencyPlot(GTransactions, support=0.10)
We will use Apriori algorithm which basically counts the item combinations to generate rules. There are 2 important parameters to define.
Support is a probability that a transaction contains itemsets.
Confidence is a conditional probability that if the transaction has item A, what is the chance that B is also included in the transaction.
# Run apriori algorithm and generate rules
rules <- apriori(GTransactions, parameter = list(support=0.01, confidence=0.5))
# summary of rules
summary(rules)
## set of 7501 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 32 2622 4447 400
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.695 4.000 5.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.01001 Min. :0.5002 Min. : 1.622
## 1st Qu.:0.01052 1st Qu.:0.6673 1st Qu.: 6.373
## Median :0.01119 Median :0.7648 Median : 8.007
## Mean :0.01165 Mean :0.7335 Mean : 8.877
## 3rd Qu.:0.01211 3rd Qu.:0.8048 3rd Qu.:11.468
## Max. :0.03301 Max. :0.9742 Max. :28.188
##
## mining info:
## data ntransactions support confidence
## GTransactions 64808 0.01 0.5
If we are interested in specific rules that has certain items in it, subset of rules can be created using the subset function.
For example if want to check if there is a wine in the basket what else the customer can purchase, we can check the rules that includes wine in the left hand side.
It is seen that customers who purchase wine also gets fresh vegetables.
Support and confidence are not good indicators of correlation. Lift is an important measure to understand if the events are correlated. While lift value larger than 1 indicates there is positive correlation between events, lift value equal to 1 indicates that events are independent.
# summary of rules
WineRulesL <- subset(rules, subset = lhs %pin% "Wine")
inspect(WineRulesL)
## lhs rhs support confidence
## [1] {Candles,Wine} => {Fresh Vegetables} 0.01029194 0.8707572
## [2] {Fresh Chicken,Wine} => {Fresh Vegetables} 0.01023022 0.8851802
## [3] {Sauces,Wine} => {Fresh Vegetables} 0.01492100 0.9088346
## [4] {Cooking Oil,Wine} => {Fresh Vegetables} 0.01272991 0.7405745
## [5] {Rice,Wine} => {Fresh Vegetables} 0.01030737 0.8077388
## [6] {Juice,Wine} => {Fresh Vegetables} 0.01024565 0.7272727
## [7] {Fresh Fruit,Wine} => {Fresh Vegetables} 0.01530675 0.5814771
## lift
## [1] 2.821460
## [2] 2.868195
## [3] 2.944840
## [4] 2.399638
## [5] 2.617266
## [6] 2.356537
## [7] 1.884124
# summary of rules
WineRulesR <- subset(rules, subset = rhs %pin% "Wine")
plot (WineRulesR,method="graph",shading="lift")
If there is a fresh food in the basket, probability of procurement of deodorizes and sour cream increases more than 15 times.
FreshRulesL <- subset(rules, subset = lhs %pin% "Fresh" & lift>15 & confidence>0.8)
plot(FreshRulesL, method="grouped", control=list(k=50))
The item sets such as {Cheese, Milk, Pancake Mix, Sliced Bread}, {Cheese, Juice, Milk, Pancake Mix} increases the probability of purchasing Canned Vegetables more than 7 times.
CannedRulesR <- subset(rules, subset = rhs %pin% "Canned"& lift> 5 & confidence>0.8)
inspect(sort(CannedRulesR, by='lift')[1:5])
## lhs rhs support confidence lift
## [1] {Cheese,
## Milk,
## Pancake Mix,
## Sliced Bread} => {Canned Vegetables} 0.01064683 0.8508015 8.154206
## [2] {Cheese,
## Juice,
## Milk,
## Pancake Mix} => {Canned Vegetables} 0.01055425 0.8507463 8.153677
## [3] {Juice,
## Milk,
## Pancake Mix} => {Canned Vegetables} 0.01242131 0.8491561 8.138437
## [4] {Cheese,
## Milk,
## Pancake Mix} => {Canned Vegetables} 0.01232872 0.8490967 8.137867
## [5] {Juice,
## Milk,
## Pancake Mix,
## Sliced Bread} => {Canned Vegetables} 0.01063140 0.8485222 8.132361
If we define small transactions as transaction that has 6 or less items we reach the rules below.
LTransactions = subset(GTransactions, subset= size (GTransactions) > 6)
STransactions = subset(GTransactions, subset= size (GTransactions) <= 6)
Lrules <- apriori(LTransactions, parameter = list(support=0.05, confidence=0.8 ))
Srules <- apriori(STransactions, parameter = list(support=0.01, confidence=0.25))
library("RColorBrewer")
plot(Lrules,control=list(col=brewer.pal(5,"Spectral")),main="")
In the rules defined for transactions that have more than six items, we can see the patterns increase Jam and Cottage Cheese sales.
inspect(sort(Lrules, by='lift')[1:5])
## lhs rhs support confidence
## [1] {Jam,Sliced Bread,Waffles} => {Cottage Cheese} 0.05549930 0.8127896
## [2] {Bagels,Milk} => {Muffins} 0.05701810 0.8342593
## [3] {Cereal,Jam,Waffles} => {Cottage Cheese} 0.05473991 0.8061510
## [4] {Jelly,Sliced Bread,Waffles} => {Cottage Cheese} 0.05353753 0.8041825
## [5] {Jam,Jelly,Waffles} => {Cottage Cheese} 0.05341096 0.8038095
## lift
## [1] 5.276788
## [2] 5.235490
## [3] 5.233688
## [4] 5.220909
## [5] 5.218487
There is a strong positive association between fresh shower soap and cleaners. Presence of shower soap has increased the probability that the cleaners will occur on this transaction approximately 12 times more likely. Out of 100 customers who bought fresh shower soap, 40 bought cleaners too.
ProductRules <- subset(rules, size(rules) < 3)
inspect(sort(ProductRules, by='lift')[1:15])
## lhs rhs support confidence lift
## [1] {Shower Soap} => {Cleaners} 0.01004506 0.6760125 12.673131
## [2] {Deodorizers} => {Pancake Mix} 0.02084619 0.5029784 9.329429
## [3] {Deodorizers} => {Frozen Chicken} 0.02163313 0.5219657 7.762174
## [4] {Dishwasher Soap} => {Sliced Bread} 0.01013764 0.6759259 6.454311
## [5] {Shower Soap} => {Sliced Bread} 0.01002963 0.6749740 6.445221
## [6] {Dishwasher Soap} => {Juice} 0.01032280 0.6882716 6.301103
## [7] {Shower Soap} => {Juice} 0.01006049 0.6770509 6.198377
## [8] {Deodorizers} => {Cereal} 0.02132453 0.5145197 5.745175
## [9] {Pancake Mix} => {Juice} 0.03300518 0.6121923 5.604600
## [10] {Bagels} => {Sliced Bread} 0.02711085 0.5738080 5.479202
## [11] {Bagels} => {Juice} 0.02738859 0.5796865 5.307010
## [12] {Deodorizers} => {Sliced Bread} 0.02280583 0.5502606 5.254352
## [13] {Sauces} => {Wine} 0.01641773 0.5256917 5.130878
## [14] {Deodorizers} => {Juice} 0.02258980 0.5450484 4.989899
## [15] {Pancake Mix} => {Sliced Bread} 0.02789779 0.5174585 4.941130