In this session, we will go through an example of association rules using the arules package. The documentation of this package can be found by visiting the following link: https://www.rdocumentation.org/packages/arules/versions/1.6-4. Below is an extract from its documentation:
“It provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). It also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat.”
# We first we install the required arules library
#
# install.packages("arules")
# Loading the arules library
#
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Loading our transactions dataset from our csv file
# ---
# We will use read.transactions fuction which will load data from comma-separated files
# and convert them to the class transactions, which is the kind of data that
# we will require while working with models of association rules
# ---
#
path <-"http://bit.ly/GroceriesDataset"
Transactions<-read.transactions(path, sep = ",")
Transactions
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
# Verifying the object's class
# ---
# This should show us transactions as the type of data that we will need
# ---
#
class(Transactions)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing our first 5 transactions
#
inspect(Transactions[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
# If we wanted to preview the items that make up our dataset,
# alternatively we can do the following
# ---
#
items<-as.data.frame(itemLabels(Transactions))
colnames(items) <- "Item"
head(items, 10)
## Item
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
## 4 baby food
## 5 bags
## 6 baking powder
## 7 bathroom cleaner
## 8 beef
## 9 berries
## 10 beverages
# Generating a summary of the transaction dataset
# ---
# This would give us some information such as the most purchased items,
# distribution of the item sets (no. of items purchased in each transaction), etc.
# ---
#
summary(Transactions)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
# Exploring the frequency of some articles
# i.e. transacations ranging from 8 to 10 and performing
# some operation in percentage terms of the total transactions
#
itemFrequency(Transactions[, 8:10],type = "absolute")
## beef berries beverages
## 516 327 256
round(itemFrequency(Transactions[, 8:10],type = "relative")*100,2)
## beef berries beverages
## 5.25 3.32 2.60
# Producing a chart of frequencies and fitering
# to consider only items with a minimum percentage
# of support/ considering a top x of items
# ---
# Displaying top 10 most common items in the transactions dataset
# and the items whose relative importance is at least 10%
#
par(mfrow = c(1, 2))
# plot the frequency of items
itemFrequencyPlot(Transactions, topN = 10,col="darkgreen")
itemFrequencyPlot(Transactions, support = 0.1, col="darkred")
# Building a model based on association rules
# using the apriori function
# ---
# We use Min Support as 0.001 and confidence as 0.8
# ---
#
rules <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 410 rules
# We use measures of significance and interest on the rules,
# determining which ones are interesting and which to discard.
# ---
# However since we built the model using 0.001 Min support
# and confidence as 0.8 we obtained 410 rules.
# However, in order to illustrate the sensitivity of the model to these two parameters,
# we will see what happens if we increase the support or lower the confidence level
#
# Building a apriori model with Min Support as 0.002 and confidence as 0.8.
rules2 <- apriori (Transactions,parameter = list(supp = 0.002, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Building apriori model with Min Support as 0.002 and confidence as 0.6.
rules3 <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [2918 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules2
## set of 11 rules
rules3
## set of 2918 rules
In our first example, we increased the minimum support of 0.001 to 0.002 and model rules went from 410 to only 11. This would lead us to understand that using a high level of support can make the model lose interesting rules. In the second example, we decreased the minimum confidence level to 0.6 and the number of model rules went from 410 to 2918. This would mean that using a low confidence level increases the number of rules to quite an extent and many will not be useful.
# We can perform an exploration of our model
# through the use of the summary function as shown
# ---
# Upon running the code, the function would give us information about the model
# i.e. the size of rules, depending on the items that contain these rules.
# In our above case, most rules have 3 and 4 items though some rules do have upto 6.
# More statistical information such as support, lift and confidence is also provided.
# ---
#
summary(rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## Transactions 9835 0.001 0.8
## call
## apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))
# Observing rules built in our model i.e. first 5 model rules
# ---
#
inspect(rules[1:5])
## lhs rhs support confidence
## [1] {liquor, red/blush wine} => {bottled beer} 0.001931876 0.9047619
## [2] {cereals, curd} => {whole milk} 0.001016777 0.9090909
## [3] {cereals, yogurt} => {whole milk} 0.001728521 0.8095238
## [4] {butter, jam} => {whole milk} 0.001016777 0.8333333
## [5] {bottled beer, soups} => {whole milk} 0.001118454 0.9166667
## coverage lift count
## [1] 0.002135231 11.235269 19
## [2] 0.001118454 3.557863 10
## [3] 0.002135231 3.168192 17
## [4] 0.001220132 3.261374 10
## [5] 0.001220132 3.587512 11
# Interpretation of the first rule:
# ---
# If someone buys liquor and red/blush wine, they are 90% likely to buy bottled beer too
# ---
# Ordering these rules by a criteria such as the level of confidence
# then looking at the first five rules.
# We can also use different criteria such as: (by = "lift" or by = "support")
#
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## domestic eggs,
## soft cheese} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
# Interpretation
# ---
# The given five rules have a confidence of 100
# ---
# If we're interested in making a promotion relating to the sale of yogurt,
# we could create a subset of rules concerning these products
# ---
# This would tell us the items that the customers bought before purchasing yogurt
# ---
#
yogurt <- subset(rules, subset = rhs %pin% "yogurt")
# Then order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)
inspect(yogurt[1:5])
## lhs rhs support confidence coverage lift count
## [1] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [2] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [3] {cream cheese,
## curd,
## other vegetables,
## whipped/sour cream} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [4] {butter,
## other vegetables,
## tropical fruit,
## white bread} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [5] {pip fruit,
## sausage,
## sliced cheese} => {yogurt} 0.001220132 0.8571429 0.001423488 6.144315 12
# What if we wanted to determine items that customers might buy
# who have previously bought yogurt?
# ---
#
# Subset the rules
yogurt <- subset(rules, subset = lhs %pin% "yogurt")
# Order by confidence
yogurt<-sort(yogurt, by="confidence", decreasing=TRUE)
# inspect top 5
inspect(yogurt[15:19])
## lhs rhs support confidence coverage lift count
## [1] {butter,
## domestic eggs,
## tropical fruit,
## yogurt} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [2] {cream cheese,
## other vegetables,
## pip fruit,
## yogurt} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [3] {curd,
## domestic eggs,
## tropical fruit,
## yogurt} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [4] {butter,
## domestic eggs,
## root vegetables,
## yogurt} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [5] {domestic eggs,
## tropical fruit,
## whipped/sour cream,
## yogurt} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## Challenge 1
# ---
# Question: Build an apriori model previewing the rules with the highest confidence interval
# given the following interval.
# ---
url <- 'http://bit.ly/AssociativeAnalysisDataset'
# ---
# OUR CODE GOES BELOW
#
# loading the dataset
data <- read.transactions(url, header=FALSE)
## Warning in asMethod(object): removing duplicated items in transactions
# checking datatype of dataset
class(data)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# checking items in our dataset
inspect(head(data, 2))
## items
## [1] {cheese,energy,
## drink,tomato,
## fat,
## flour,yams,cottage,
## grapes,whole,
## juice,frozen,
## juice,low,
## mix,green,
## oil,
## shrimp,almonds,avocado,vegetables,
## smoothie,spinach,olive,
## tea,honey,salad,mineral,
## water,salmon,antioxydant,
## weat,
## yogurt,green}
## [2] {burgers,meatballs,eggs}
# create a dataframe for items
items<- as.data.frame(itemLabels(data))
# rename column
colnames(items) <- 'Item'
# preview the dataset
head(items)
## Item
## 1 &
## 2 accessories
## 3 accessories,antioxydant
## 4 accessories,champagne,fresh
## 5 accessories,champagne,protein
## 6 accessories,chocolate
# summary
# import psych library package for describe
library(psych)
# describe data
summary(data)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 5729 columns (items) and a density of 0.0005421748
##
## most frequent items:
## tea wheat mineral fat yogurt (Other)
## 803 645 577 574 543 20157
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16
## 1603 2007 1382 942 651 407 228 151 70 39 13 5 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.106 4.000 16.000
##
## includes extended item information - examples:
## labels
## 1 &
## 2 accessories
## 3 accessories,antioxydant
# checking item frequecy for some items
itemFrequency(data[, 1:5],type = "absolute")
## & accessories
## 371 10
## accessories,antioxydant accessories,champagne,fresh
## 1 1
## accessories,champagne,protein
## 1
# percentage frequency relative to all other products
round(itemFrequency(data[, 1:10],type = "relative")*100,2)
## & accessories
## 4.95 0.13
## accessories,antioxydant accessories,champagne,fresh
## 0.01 0.01
## accessories,champagne,protein accessories,chocolate
## 0.01 0.01
## accessories,chocolate,champagne,frozen accessories,chocolate,frozen
## 0.01 0.01
## accessories,chocolate,low accessories,chocolate,pasta,salt
## 0.01 0.01
# create subplot
par(mfrow = c(1, 2))
# plot the frequency of items
itemFrequencyPlot(data, topN = 10,col="darkgreen",ylim=c(0,0.11))
itemFrequencyPlot(data, support = 0.04, col="darkred",ylim=c(0,0.11))
# Building a model based on association rules
# using the apriori function
# ---
# We use Min Support as 0.001 and confidence as 0.8
# ---
#
rules <- apriori (data, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5729 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [354 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [271 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 271 rules
# rules summary
summary(rules)
## set of 271 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 107 144 20
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.679 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.800 Min. :0.001067 Min. : 7.611
## 1st Qu.:0.001200 1st Qu.:0.931 1st Qu.:0.001200 1st Qu.: 11.630
## Median :0.001600 Median :1.000 Median :0.001600 Median : 13.068
## Mean :0.002834 Mean :0.963 Mean :0.002973 Mean : 22.372
## 3rd Qu.:0.002666 3rd Qu.:1.000 3rd Qu.:0.002800 3rd Qu.: 20.218
## Max. :0.068391 Max. :1.000 Max. :0.076523 Max. :613.718
## count
## Min. : 8.00
## 1st Qu.: 9.00
## Median : 12.00
## Mean : 21.26
## 3rd Qu.: 20.00
## Max. :513.00
##
## mining info:
## data ntransactions support confidence
## data 7501 0.001 0.8
## call
## apriori(data = data, parameter = list(supp = 0.001, conf = 0.8))
# Observing rules built in our model i.e. first 5 model rules
# ---
#
inspect(rules[1:5])
## lhs rhs support confidence
## [1] {cookies,low} => {yogurt} 0.001066524 1
## [2] {cookies,low} => {fat} 0.001066524 1
## [3] {extra} => {dark} 0.001066524 1
## [4] {burgers,whole} => {wheat} 0.001199840 1
## [5] {fries,escalope,pasta,mushroom} => {cream} 0.001066524 1
## coverage lift count
## [1] 0.001066524 13.81400 8
## [2] 0.001066524 13.06794 8
## [3] 0.001066524 83.34444 8
## [4] 0.001199840 11.62946 9
## [5] 0.001066524 47.77707 8
# Looking at the first five rules sorted by lift
#
#
rules<-sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage
## [1] {&, fresh} => {tuna,herb} 0.001199840 0.9 0.001333156
## [2] {parmesan, wheat} => {cheese,whole} 0.001333156 1.0 0.001333156
## [3] {fat, tea} => {yogurt,green} 0.004666045 1.0 0.004666045
## [4] {&, grated} => {cheese,herb} 0.004666045 1.0 0.004666045
## [5] {bar,hand} => {protein} 0.001199840 1.0 0.001199840
## lift count
## [1] 613.7182 9
## [2] 258.6552 10
## [3] 197.3947 35
## [4] 153.0816 35
## [5] 144.2500 9
## Challenge 2
# ---
# Question:
# ---
# Question: Build an apriori model previewing the rules with the highest confidence interval.
# given the following interval.
# ---
# OUR CODE GOES BELOW
#
# Looking at the first five rules sorted by lift
#
#
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
## lhs rhs support confidence coverage
## [1] {parmesan, wheat} => {cheese,whole} 0.001333156 1 0.001333156
## [2] {fat, tea} => {yogurt,green} 0.004666045 1 0.004666045
## [3] {&, grated} => {cheese,herb} 0.004666045 1 0.004666045
## [4] {bar,hand} => {protein} 0.001199840 1 0.001199840
## [5] {flour,green} => {weat} 0.001199840 1 0.001199840
## lift count
## [1] 258.6552 10
## [2] 197.3947 35
## [3] 153.0816 35
## [4] 144.2500 9
## [5] 107.1571 9