This is a short vignette on the arules library and the association mining information that can be found from transactional datasets using the library.
Notice that the class for Groceries is “transactions.” This is important to note because the apriori() function will only take the transaction class as input. We can also see after inspecting a few transaction in the Groceries data that they are a comma sperated list of common grocery items.
library(arules)
## Warning: package 'arules' was built under R version 3.6.3
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
inspect(head(Groceries, 2))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
The eclat() function takes a transaction list and gives the most frequent elements in the dataset.These results can be altered by changing the support(supp) and the max number of items in each trasnaction set (maxlen). AS you can see below this funciton lists 8 items that occur in at least 10% of transactions that have 20 or less items.
The item frequency function as shown below will produce a bar graph of the topN(8) most common items in the transactions.
freq <- eclat (Groceries, parameter = list(supp = 0.1, maxlen = 20))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.1 1 20 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 983
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating bit matrix ... [8 row(s), 9835 column(s)] done [0.00s].
## writing ... [8 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq)
## items support count
## [1] {whole milk} 0.2555160 2513
## [2] {other vegetables} 0.1934926 1903
## [3] {rolls/buns} 0.1839349 1809
## [4] {yogurt} 0.1395018 1372
## [5] {soda} 0.1743772 1715
## [6] {root vegetables} 0.1089985 1072
## [7] {tropical fruit} 0.1049314 1032
## [8] {bottled water} 0.1105236 1087
itemFrequencyPlot(Groceries, topN=8, type="absolute", main="Frequency")
In the followjg block of code we see the apriori function in use to create the object grocery_rules. We establish that we only want to look at associations that have a support of 0.01 or higher and a confidence of 0.05 or higher. After that we inspect the grocery_rules object by sorting it by lift and confidence. We find the association with the highest lift are whole milk and yogurt are associtated with curd. We also find that the the association with the highest confidence are citrus fruit and root vegetables associated with other vegetables.
grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.05))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [541 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(grocery_rules, by = "lift"), 10))
## lhs rhs support
## [1] {whole milk,yogurt} => {curd} 0.01006609
## [2] {citrus fruit,other vegetables} => {root vegetables} 0.01037112
## [3] {other vegetables,yogurt} => {whipped/sour cream} 0.01016777
## [4] {tropical fruit,other vegetables} => {root vegetables} 0.01230300
## [5] {root vegetables} => {beef} 0.01738688
## [6] {beef} => {root vegetables} 0.01738688
## [7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112
## [8] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [9] {other vegetables,whole milk} => {root vegetables} 0.02318251
## [10] {other vegetables,whole milk} => {butter} 0.01148958
## confidence lift count
## [1] 0.1796733 3.372304 99
## [2] 0.3591549 3.295045 102
## [3] 0.2341920 3.267062 100
## [4] 0.3427762 3.144780 121
## [5] 0.1595149 3.040367 171
## [6] 0.3313953 3.040367 171
## [7] 0.5862069 3.029608 102
## [8] 0.5845411 3.020999 121
## [9] 0.3097826 2.842082 228
## [10] 0.1535326 2.770630 113
inspect(head(sort(grocery_rules, by = "confidence"), 10))
## lhs rhs support
## [1] {citrus fruit,root vegetables} => {other vegetables} 0.01037112
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [3] {curd,yogurt} => {whole milk} 0.01006609
## [4] {other vegetables,butter} => {whole milk} 0.01148958
## [5] {tropical fruit,root vegetables} => {whole milk} 0.01199797
## [6] {root vegetables,yogurt} => {whole milk} 0.01453991
## [7] {other vegetables,domestic eggs} => {whole milk} 0.01230300
## [8] {yogurt,whipped/sour cream} => {whole milk} 0.01087951
## [9] {root vegetables,rolls/buns} => {whole milk} 0.01270971
## [10] {pip fruit,other vegetables} => {whole milk} 0.01352313
## confidence lift count
## [1] 0.5862069 3.029608 102
## [2] 0.5845411 3.020999 121
## [3] 0.5823529 2.279125 99
## [4] 0.5736041 2.244885 113
## [5] 0.5700483 2.230969 118
## [6] 0.5629921 2.203354 143
## [7] 0.5525114 2.162336 121
## [8] 0.5245098 2.052747 107
## [9] 0.5230126 2.046888 125
## [10] 0.5175097 2.025351 133
We see by using the apriori() function that there are association rules that we can extract from the Groceries dataset. We find that the combination of whole milk and yogurt in association with curd has the highest lift with citrus fruit and root vegetables in association with other vegetables is a close second in lift. This makes sense as a person who buys multiple types of dairy or produce is more likely to also buy a third type of dairy or produce produce.
For the high confidence associations we find that the top two associatons include multiple produce items while the rest of the top 10 confidence associations are various items in association with whole milk. As we can see from the frequency funcitons above, whole milk is the most common items in these transactions which is likely skewing these associations. In order to drill down into truly valuable associaitons in the dataset it may be valuable to remove such high frequency outliers like whole milk.
Associaiton Rule Mining and in turn the arules package are very helpful ways to find associations between itmes that may not be originally anticipated. This information will be extremely useful for any entity that wants to promote associations between products to their customers or to identify associations within their data that they were not aware of.
https://towardsdatascience.com/association-rule-mining-in-r-ddf2d044ae50
http://r-statistics.co/Association-Mining-With-R.html
Data Science for Business Chapter 12