library("arules")
library("arulesViz")
library("plotly")
shop <- read.csv2("groceries_data.csv", sep = ",")
nrow(shop)
## [1] 9835
ncol(shop)
## [1] 32
nrow(shop) command compiles the total number of customers while ncol(shop) displays the different items which were purchased by a customer in the same basket. Now,I will go ahead and read the data as transactions using the arules library package.trans<-read.transactions("groceries_data.csv", format = "basket", sep=",", header = TRUE)
trans
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
summary function.summary(trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
To get an idea about the less frequent items, I will attempt to sort the items at the tail point.
tail(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=10)
## salad dressing whisky toilet cleaner
## 8 8 7
## baby cosmetics frozen chicken bags
## 6 6 4
## kitchen utensil preservation products baby food
## 4 2 1
## sound storage medium
## 1
Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association.They are Support, Confidence and Lift. These are the constraints used to select best rules from a set of possible rules. In this particular paper, I set the the support threshold to 0.01 which approximately represents the probability of an item appearing 100 times with other items in all 9835 transactions and the confidence threshold to 0.05 representing 50% of the entire threshold. The number of rules were determined as below.
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.50))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
\[\ Support = Number of Transactions with both A and B / Total Number of Transactions\]
library("DT")
support_rules <- sort(rules, by = "support", decreasing = TRUE)
support_table <- inspect(support_rules)
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02226741
## [2] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [3] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [4] {root vegetables, yogurt} => {whole milk} 0.01453991
## [5] {other vegetables, pip fruit} => {whole milk} 0.01352313
## [6] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [7] {rolls/buns, root vegetables} => {whole milk} 0.01270971
## [8] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [9] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [10] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [11] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [12] {butter, other vegetables} => {whole milk} 0.01148958
## [13] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## [14] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [15] {curd, yogurt} => {whole milk} 0.01006609
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.5173611 0.02928317 2.024770 149
## [3] 0.5070423 0.02887646 1.984385 144
## [4] 0.5629921 0.02582613 2.203354 143
## [5] 0.5175097 0.02613116 2.025351 133
## [6] 0.5000000 0.02582613 2.584078 127
## [7] 0.5230126 0.02430097 2.046888 125
## [8] 0.5525114 0.02226741 2.162336 121
## [9] 0.5845411 0.02104728 3.020999 121
## [10] 0.5020921 0.02430097 2.594890 120
## [11] 0.5700483 0.02104728 2.230969 118
## [12] 0.5736041 0.02003050 2.244885 113
## [13] 0.5245098 0.02074225 2.052747 107
## [14] 0.5862069 0.01769192 3.029608 102
## [15] 0.5823529 0.01728521 2.279125 99
datatable(support_table)
{other vegetables, yoghurt} => whole milk and had 219 appearances out of the total which represents about 2.2% of the total transactions. The rule with the least transactions according to the support constraint was that of {curd, yoghurt} => {whole milk} which only happened 99 times representing 1% of the total transactions.\[\ Confidence=Number of Transactions with both A and B/Total Number of Transactions with A\]
confidence_rules <- sort(rules, by = "confidence", decreasing = TRUE)
confidence_table <- inspect(confidence_rules)
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {curd, yogurt} => {whole milk} 0.01006609
## [4] {butter, other vegetables} => {whole milk} 0.01148958
## [5] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [6] {root vegetables, yogurt} => {whole milk} 0.01453991
## [7] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [8] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## [9] {rolls/buns, root vegetables} => {whole milk} 0.01270971
## [10] {other vegetables, pip fruit} => {whole milk} 0.01352313
## [11] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [12] {other vegetables, yogurt} => {whole milk} 0.02226741
## [13] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [14] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [15] {root vegetables, yogurt} => {other vegetables} 0.01291307
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5823529 0.01728521 2.279125 99
## [4] 0.5736041 0.02003050 2.244885 113
## [5] 0.5700483 0.02104728 2.230969 118
## [6] 0.5629921 0.02582613 2.203354 143
## [7] 0.5525114 0.02226741 2.162336 121
## [8] 0.5245098 0.02074225 2.052747 107
## [9] 0.5230126 0.02430097 2.046888 125
## [10] 0.5175097 0.02613116 2.025351 133
## [11] 0.5173611 0.02928317 2.024770 149
## [12] 0.5128806 0.04341637 2.007235 219
## [13] 0.5070423 0.02887646 1.984385 144
## [14] 0.5020921 0.02430097 2.594890 120
## [15] 0.5000000 0.02582613 2.584078 127
datatable(confidence_table)
\[\ Lift=Confidence/Expected Confidence\]
\[\ Expected Confidence=Number of Transactions with B/Total Number of Transactions\]
lift_rules <- sort(rules, by = "lift", decreasing = TRUE)
lift_table <- inspect(lift_rules)
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [4] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [5] {curd, yogurt} => {whole milk} 0.01006609
## [6] {butter, other vegetables} => {whole milk} 0.01148958
## [7] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [8] {root vegetables, yogurt} => {whole milk} 0.01453991
## [9] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [10] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## [11] {rolls/buns, root vegetables} => {whole milk} 0.01270971
## [12] {other vegetables, pip fruit} => {whole milk} 0.01352313
## [13] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [14] {other vegetables, yogurt} => {whole milk} 0.02226741
## [15] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5020921 0.02430097 2.594890 120
## [4] 0.5000000 0.02582613 2.584078 127
## [5] 0.5823529 0.01728521 2.279125 99
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5525114 0.02226741 2.162336 121
## [10] 0.5245098 0.02074225 2.052747 107
## [11] 0.5230126 0.02430097 2.046888 125
## [12] 0.5175097 0.02613116 2.025351 133
## [13] 0.5173611 0.02928317 2.024770 149
## [14] 0.5128806 0.04341637 2.007235 219
## [15] 0.5070423 0.02887646 1.984385 144
datatable(lift_table)
plotly package. Hovering the mouse over the plotted points would give you the support, confidence and lift values.plot(rules, engine="plotly")
yogurt_rules <- apriori(
data = trans,
parameter = list(supp = 0.001, conf = 0.9),
appearance = list(default = "lhs", rhs = "yogurt"),
control = list(verbose = F)
)
yogurt_rules_table <- inspect(yogurt_rules, linebreak = FALSE)
## lhs rhs
## [1] {butter, cream cheese, root vegetables} => {yogurt}
## [2] {butter, sliced cheese, tropical fruit, whole milk} => {yogurt}
## [3] {cream cheese, curd, other vegetables, whipped/sour cream} => {yogurt}
## [4] {butter, other vegetables, tropical fruit, white bread} => {yogurt}
## support confidence coverage lift count
## [1] 0.001016777 0.9090909 0.001118454 6.516698 10
## [2] 0.001016777 0.9090909 0.001118454 6.516698 10
## [3] 0.001016777 0.9090909 0.001118454 6.516698 10
## [4] 0.001016777 0.9090909 0.001118454 6.516698 10
datatable(yogurt_rules_table)
plot(yogurt_rules, method="graph")