In this section, I will create association rules that will allow identification of relationships between variables in the dataset.
# calling the library
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Loading the dataset
df <- read.transactions('http://bit.ly/SupermarketDatasetII',sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
# Checking the class of the dataset
class(df)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing the dataset
inspect(df[1:5])
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
# Generating the statistical summary of the data
summary(df)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
Observation: the most frequent items are mineral water, eggs, spaghetti, french fries, chocolate
# Displaying top 10 most common items in the dataset and the items whose relative importance is at least 10%
par(mfrow = c(1, 2))
# plot the frequency of items
itemFrequencyPlot(df, topN = 10, col="blue")
itemFrequencyPlot(df, support = 0.1, col="darkred")
The items shown above have an importance of 10% and above in the dataset.
# Building a model based on association rules
# Using Min Support as 0.001 and confidence as 0.6
rules <- apriori (df, parameter = list(supp = 0.001, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [545 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 545 rules
Building the model using support as 0.001 and a confidence of 60% generates a set of 545 rules
# Exploring the model
summary(rules)
## set of 545 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 146 329 67 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.000 3.866 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.6000 Min. :0.001067 Min. : 2.517
## 1st Qu.:0.001067 1st Qu.:0.6250 1st Qu.:0.001600 1st Qu.: 2.797
## Median :0.001200 Median :0.6667 Median :0.001866 Median : 3.446
## Mean :0.001409 Mean :0.6893 Mean :0.002081 Mean : 3.889
## 3rd Qu.:0.001466 3rd Qu.:0.7273 3rd Qu.:0.002266 3rd Qu.: 4.177
## Max. :0.005066 Max. :1.0000 Max. :0.007999 Max. :34.970
## count
## Min. : 8.00
## 1st Qu.: 8.00
## Median : 9.00
## Mean :10.57
## 3rd Qu.:11.00
## Max. :38.00
##
## mining info:
## data ntransactions support confidence
## df 7501 0.001 0.6
## call
## apriori(data = df, parameter = list(supp = 0.001, conf = 0.6))
Most rules have 3 or 4 items, 3 rules have 6 items. More statistical information such as support, lift and confidence are also provided.
# observing the first 10 rules built
inspect(rules[1:10])
## lhs rhs support confidence
## [1] {cookies, shallot} => {low fat yogurt} 0.001199840 0.6000000
## [2] {low fat yogurt, shallot} => {cookies} 0.001199840 0.6923077
## [3] {cookies, shallot} => {green tea} 0.001199840 0.6000000
## [4] {cookies, shallot} => {french fries} 0.001199840 0.6000000
## [5] {low fat yogurt, shallot} => {french fries} 0.001066524 0.6153846
## [6] {burger sauce, chicken} => {mineral water} 0.001066524 0.6666667
## [7] {frozen smoothie, spinach} => {mineral water} 0.001066524 0.8888889
## [8] {milk, spinach} => {mineral water} 0.001066524 0.6666667
## [9] {spaghetti, spinach} => {mineral water} 0.001333156 0.7142857
## [10] {olive oil, strong cheese} => {spaghetti} 0.001066524 0.7272727
## coverage lift count
## [1] 0.001999733 7.840767 9
## [2] 0.001733102 8.611940 9
## [3] 0.001999733 4.541473 9
## [4] 0.001999733 3.510608 9
## [5] 0.001733102 3.600624 8
## [6] 0.001599787 2.796793 8
## [7] 0.001199840 3.729058 8
## [8] 0.001599787 2.796793 8
## [9] 0.001866418 2.996564 10
## [10] 0.001466471 4.177085 8
Interpretation of a few random rules:
# ordering the rules by a criteria
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {french fries,
## mushroom cream sauce,
## pasta} => {escalope} 0.001066524 1.00 0.001066524 12.606723 8
## [2] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.00 0.001199840 4.195190 9
## [3] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.00 0.001066524 7.717078 8
## [4] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.00 0.001199840 4.195190 9
## [5] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.95 0.002666311 11.976387 19
rules<-sort(rules, by="support", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {frozen vegetables,
## soup} => {mineral water} 0.005065991 0.6333333 0.007998933 2.656954 38
## [2] {olive oil,
## tomatoes} => {spaghetti} 0.004399413 0.6111111 0.007199040 3.509912 33
## [3] {pancakes,
## soup} => {mineral water} 0.004266098 0.6274510 0.006799093 2.632276 32
## [4] {chocolate,
## eggs,
## ground beef} => {mineral water} 0.003999467 0.6122449 0.006532462 2.568484 30
## [5] {frozen vegetables,
## ground beef,
## milk} => {mineral water} 0.003732836 0.6511628 0.005732569 2.731752 28
rules<-sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {escalope,
## french fries,
## pasta} => {mushroom cream sauce} 0.001066524 0.6666667 0.001599787 34.96970 8
## [2] {fresh tuna,
## fromage blanc} => {honey} 0.001599787 0.6666667 0.002399680 14.04682 12
## [3] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.72218 10
## [4] {french fries,
## mushroom cream sauce,
## pasta} => {escalope} 0.001066524 1.0000000 0.001066524 12.60672 8
## [5] {milk,
## pasta} => {shrimp} 0.001599787 0.8571429 0.001866418 11.99520 12
Interpretation: - Ordering by confidence in decsending order gives 5 rules with 100% confidence and 1 with 95% confidence. - Ordering by support in descending order, the first rule is applicable 0.005 times to the dataset - Ordering by lift in descending order, the first rule is expected to be founf true 34.98 times in the data.
# Visualizing the rules
# calling the library
library(arulesViz)
# plotting
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules, method = 'grouped')
Suppose we’re interested in making a promotion relating to the sale of chocolate, we could create a subset of rules concerning these products. This would tell us the items that the customers bought before purchasing chocolate:
chocolate <- subset(rules, subset = rhs %pin% "chocolate")
# Then order by confidence
chocolate <- sort(chocolate, by="confidence", decreasing=TRUE)
inspect(chocolate[1:5])
## lhs rhs support confidence
## [1] {escalope, french fries, shrimp} => {chocolate} 0.001066524 0.8888889
## [2] {red wine, tomato sauce} => {chocolate} 0.001066524 0.8000000
## [3] {burgers, olive oil, pancakes} => {chocolate} 0.001199840 0.7500000
## [4] {almonds, olive oil, spaghetti} => {chocolate} 0.001066524 0.7272727
## [5] {almonds, milk, spaghetti} => {chocolate} 0.001066524 0.7272727
## coverage lift count
## [1] 0.001199840 5.425188 8
## [2] 0.001333156 4.882669 8
## [3] 0.001599787 4.577502 9
## [4] 0.001466471 4.438790 8
## [5] 0.001466471 4.438790 8
Observation:
Chocolate was bought after buying a set of the following items: {escalope, french fries, shrimp}, {red wine, tomato sauce}, {burgers, olive oil, pancakes}, {almonds, olive oil, spaghetti} and {almonds, milk, spaghetti}.
Association rules allow the business to arrange items on the shelves according to how they’ve been bought before or how likely they’re to be picked depending on a set of items already picked.
If, say, the business is doing a sale promotion on chocolates, then it is safe to say that chocolates can be arranged next to the following items: {escalope, french fries, shrimp}, {red wine, tomato sauce}, {burgers, olive oil, pancakes}, {almonds, olive oil, spaghetti} and {almonds, milk, spaghetti}, because the analysis says the set of items show that chocolates is always picked when they are picked.