Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.
Loading Libraries and Data
library(arules)
library(arulesViz)
Grocery <- read.transactions("GroceryDataSet.csv", sep = ",")
Data summary and inspection
Grocery
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
summary(Grocery)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
inspect(Grocery[1:6])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
## [6] {abrasive cleaner,
## butter,
## rice,
## whole milk,
## yogurt}
Item Frequency and Plot
itemFrequency(Grocery[,1:10])
## abrasive cleaner artif. sweetener baby cosmetics baby food
## 0.0035587189 0.0032536858 0.0006100661 0.0001016777
## bags baking powder bathroom cleaner beef
## 0.0004067107 0.0176919166 0.0027452974 0.0524656838
## berries beverages
## 0.0332486019 0.0260294865
itemFrequencyPlot(Grocery, topN=10)
Generating Association Rules
Selecting to even lower the minimum support and confidence than the default of 0.1 and 0.8 respectively to generate more than a handful of possible rules.
basket.model <- apriori(Grocery, parameter = list(support=0.007, confidence=0.25, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.007 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 68
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [104 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [363 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(basket.model)
## set of 363 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 137 214 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.656 3.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.007016 Min. :0.2500 Min. :0.9932 Min. : 69.0
## 1st Qu.:0.008134 1st Qu.:0.2962 1st Qu.:1.6060 1st Qu.: 80.0
## Median :0.009659 Median :0.3551 Median :1.9086 Median : 95.0
## Mean :0.012945 Mean :0.3743 Mean :2.0072 Mean :127.3
## 3rd Qu.:0.013777 3rd Qu.:0.4420 3rd Qu.:2.3289 3rd Qu.:135.5
## Max. :0.074835 Max. :0.6389 Max. :3.9565 Max. :736.0
##
## mining info:
## data ntransactions support confidence
## Grocery 9835 0.007 0.25
inspect(basket.model[1:6])
## lhs rhs support confidence lift
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
## [2] {herbs} => {other vegetables} 0.007727504 0.4750000 2.454874
## [3] {herbs} => {whole milk} 0.007727504 0.4750000 1.858983
## [4] {processed cheese} => {whole milk} 0.007015760 0.4233129 1.656698
## [5] {semi-finished bread} => {whole milk} 0.007117438 0.4022989 1.574457
## [6] {detergent} => {whole milk} 0.008947636 0.4656085 1.822228
## count
## [1] 69
## [2] 76
## [3] 76
## [4] 69
## [5] 70
## [6] 88
Selecting to based the Top 10 rules on Lift. The higher the Lift, the higher the chances of LHS and RHS items occuring together.
Top 10 Association RUles by Lift
inspect(sort(basket.model, by="lift")[1:10])
## lhs rhs support confidence lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69
## [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89
## [3] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 69
## [4] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692 78
## [5] {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649 93
## [6] {beef,
## whole milk} => {root vegetables} 0.008032537 0.3779904 3.467851 79
## [7] {other vegetables,
## pip fruit} => {tropical fruit} 0.009456024 0.3618677 3.448613 93
## [8] {citrus fruit,
## other vegetables} => {root vegetables} 0.010371124 0.3591549 3.295045 102
## [9] {other vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.007625826 0.3424658 3.263712 75
## [10] {other vegetables,
## whole milk,
## yogurt} => {root vegetables} 0.007829181 0.3515982 3.225716 77
Graph of Top 10 Rules by Lift
top10Rules <- head(basket.model, n = 10, by = "lift")
plot(top10Rules, method = "graph")
CONCLUSION
The majority of items are fruits and vegetables combinations. It is not surprising as most of them are convenietly located in the same section grocery stores. It is also not surprising that milk comes up often in the list, as singularly, it is most bought items in the pool of transactions. What maybe a surprise is the rule about berries and whipped cream, they are not physically near each other in the grocery section, but it makes sense to market them together (e.g. when one is on sale, the other maybe on sale, too).