Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting item sets and rules.
Apriori Algorithm is used to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5. Example of Apriori algorithm is market basket analysis. It provides insights into which products tend to be purchased together and which are most amenable to promotion. The drawback of Apriori algorithm is that it is slow and can generate lot of rules which might be difficult to understand, although visualization, filtering techniques might help.
The data used here is ‘Groceries’. The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.The data set is provided for arules by Michael Hahsler, Kurt Hornik and Thomas Reutterer.
Before proceeding, the important libraries that will be used for the analysis are
library(arules)
library(arulesViz)
library(tidyverse)
Importing the data and summary of the data
data("Groceries")
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
Couple of insights that can be derived from the summary :
Let’s see what Grocery data has in store for us:
grocery <- as(Groceries, 'data.frame')
head(grocery)
## items
## 1 {citrus fruit,semi-finished bread,margarine,ready soups}
## 2 {tropical fruit,yogurt,coffee}
## 3 {whole milk}
## 4 {pip fruit,yogurt,cream cheese ,meat spreads}
## 5 {other vegetables,whole milk,condensed milk,long life bakery product}
## 6 {whole milk,butter,yogurt,rice,abrasive cleaner}
itemFrequency(Groceries[,1:5])
## frankfurter sausage liver loaf ham meat
## 0.058973055 0.093950178 0.005083884 0.026029487 0.025826131
itemFrequencyPlot(Groceries, support= 0.10)
itemFrequencyPlot(Groceries, support= 0.05)
itemFrequencyPlot(Groceries, topN = 20)
image(Groceries[1:5])
image(sample(Groceries, 100))
basket <- apriori(Groceries, parameter = list(support = 0.009, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.009 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 88
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [224 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
basket
## set of 224 rules
summary(basket)
## set of 224 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 111 113
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.504 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.009049 Min. :0.2513 Min. :0.9932 Min. : 89.0
## 1st Qu.:0.010066 1st Qu.:0.2974 1st Qu.:1.5767 1st Qu.: 99.0
## Median :0.012303 Median :0.3603 Median :1.8592 Median :121.0
## Mean :0.016111 Mean :0.3730 Mean :1.9402 Mean :158.5
## 3rd Qu.:0.018480 3rd Qu.:0.4349 3rd Qu.:2.2038 3rd Qu.:181.8
## Max. :0.074835 Max. :0.6389 Max. :3.7969 Max. :736.0
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.009 0.25
Couple of insights from the summary are as follows:
inspect(basket[1:10])
## lhs rhs support confidence lift
## [1] {baking powder} => {whole milk} 0.009252669 0.5229885 2.046793
## [2] {grapes} => {other vegetables} 0.009049314 0.4045455 2.090754
## [3] {meat} => {other vegetables} 0.009964413 0.3858268 1.994013
## [4] {meat} => {whole milk} 0.009964413 0.3858268 1.509991
## [5] {frozen meals} => {whole milk} 0.009862735 0.3476703 1.360659
## [6] {hard cheese} => {other vegetables} 0.009456024 0.3858921 1.994350
## [7] {hard cheese} => {whole milk} 0.010066090 0.4107884 1.607682
## [8] {butter milk} => {other vegetables} 0.010371124 0.3709091 1.916916
## [9] {butter milk} => {whole milk} 0.011591256 0.4145455 1.622385
## [10] {ham} => {other vegetables} 0.009150991 0.3515625 1.816930
## count
## [1] 91
## [2] 89
## [3] 98
## [4] 98
## [5] 97
## [6] 93
## [7] 99
## [8] 102
## [9] 114
## [10] 90
The first five rules are seen here. Also, we can see support for the top 10 most frequent items. We can see the lift column along with support and confidence. The lift of a rule measures how much likely an item or itemset is purchased relative to its typical rate of purchase, given that you know another item or itemsethas been purchased.
inspect(sort(basket, by = "lift")[1:5])
## lhs rhs support confidence lift count
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89
## [2] {tropical fruit,
## other vegetables} => {pip fruit} 0.009456024 0.2634561 3.482649 93
## [3] {pip fruit,
## other vegetables} => {tropical fruit} 0.009456024 0.3618677 3.448613 93
## [4] {citrus fruit,
## other vegetables} => {root vegetables} 0.010371124 0.3591549 3.295045 102
## [5] {tropical fruit,
## other vegetables} => {root vegetables} 0.012302999 0.3427762 3.144780 121
Sometimes the marketing team requires to promote a specific product, say they want to promote berries, and want to find out how often and with which items the berries are purchased. The subset function enables one to find subsets of transactions, items or rules. The %in% operator is used for exact matching
berries <- subset(basket, items %in% "berries")
inspect(berries)
## lhs rhs support confidence lift
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886
## [2] {berries} => {yogurt} 0.010574479 0.3180428 2.279848
## [3] {berries} => {other vegetables} 0.010269446 0.3088685 1.596280
## [4] {berries} => {whole milk} 0.011794611 0.3547401 1.388328
## count
## [1] 89
## [2] 104
## [3] 101
## [4] 116
plot(basket)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(basket, measure=c("support", "lift"), shading="confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(basket, shading="order", control=list(main = "Two-key plot"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(basket, measure=c("support", "lift"), shading="confidence", interactive=TRUE)
plot(basket, method="grouped")
plot(basket, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## main = Graph for 100 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE