Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
grocerydata <- read.transactions("GroceryDataSet.csv", sep = ",")
summary(grocerydata)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
We can visualize to top frequent items using the ItemFrequency chart. Our data shows that top item is bottled water, citrus fruit and beer. It would be intresting to see top grocery items during Covid.
itemFrequencyPlot(grocerydata[, 1:30], topN=30 , main="Top 30
Frequent Items")
##Create Rules
We created a rules using a support of 0.05 and a confidence of 0.6. This returned 13 Rules for us.
rules <- apriori(grocerydata, parameter = list(supp = 0.005, conf = 0.6, maxlen=3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.04s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3
## Warning in apriori(grocerydata, parameter = list(supp = 0.005, conf = 0.6, :
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!
## done [0.01s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 13 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence lift count
## Min. :0.005186 Min. :0.6022 Min. :2.357 Min. :51.00
## 1st Qu.:0.005592 1st Qu.:0.6071 1st Qu.:2.434 1st Qu.:55.00
## Median :0.005999 Median :0.6224 Median :2.480 Median :59.00
## Mean :0.006398 Mean :0.6249 Mean :2.562 Mean :62.92
## 3rd Qu.:0.006711 3rd Qu.:0.6378 3rd Qu.:2.537 3rd Qu.:66.00
## Max. :0.009354 Max. :0.6600 Max. :3.124 Max. :92.00
##
## mining info:
## data ntransactions support confidence
## grocerydata 9835 0.005 0.6
inspect(rules)
## lhs rhs support
## [1] {onions,root vegetables} => {other vegetables} 0.005693950
## [2] {curd,tropical fruit} => {whole milk} 0.006507372
## [3] {domestic eggs,margarine} => {whole milk} 0.005185562
## [4] {butter,domestic eggs} => {whole milk} 0.005998983
## [5] {butter,whipped/sour cream} => {whole milk} 0.006710727
## [6] {bottled water,butter} => {whole milk} 0.005388917
## [7] {butter,tropical fruit} => {whole milk} 0.006202339
## [8] {butter,root vegetables} => {whole milk} 0.008235892
## [9] {butter,yogurt} => {whole milk} 0.009354347
## [10] {domestic eggs,pip fruit} => {whole milk} 0.005388917
## [11] {domestic eggs,tropical fruit} => {whole milk} 0.006914082
## [12] {pip fruit,whipped/sour cream} => {other vegetables} 0.005592272
## [13] {pip fruit,whipped/sour cream} => {whole milk} 0.005998983
## confidence lift count
## [1] 0.6021505 3.112008 56
## [2] 0.6336634 2.479936 64
## [3] 0.6219512 2.434099 51
## [4] 0.6210526 2.430582 59
## [5] 0.6600000 2.583008 66
## [6] 0.6022727 2.357084 53
## [7] 0.6224490 2.436047 61
## [8] 0.6377953 2.496107 81
## [9] 0.6388889 2.500387 92
## [10] 0.6235294 2.440275 53
## [11] 0.6071429 2.376144 68
## [12] 0.6043956 3.123610 55
## [13] 0.6483516 2.537421 59
library("arulesViz")
## Loading required package: grid
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
library("grid")
plot(rules, method='two-key plot')
top5 <- head(rules, n=5, by ="lift")
plot(top5, method = "graph", engine="htmlwidget")
When we inspect our rules, we see majoirty of them are assocaited with milk, butter and vegetables.
Top 10 Rules by Lift
inspect(rules[1:10,] ,by= "lift")
## lhs rhs support confidence
## [1] {onions,root vegetables} => {other vegetables} 0.005693950 0.6021505
## [2] {curd,tropical fruit} => {whole milk} 0.006507372 0.6336634
## [3] {domestic eggs,margarine} => {whole milk} 0.005185562 0.6219512
## [4] {butter,domestic eggs} => {whole milk} 0.005998983 0.6210526
## [5] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000
## [6] {bottled water,butter} => {whole milk} 0.005388917 0.6022727
## [7] {butter,tropical fruit} => {whole milk} 0.006202339 0.6224490
## [8] {butter,root vegetables} => {whole milk} 0.008235892 0.6377953
## [9] {butter,yogurt} => {whole milk} 0.009354347 0.6388889
## [10] {domestic eggs,pip fruit} => {whole milk} 0.005388917 0.6235294
## lift count
## [1] 3.112008 56
## [2] 2.479936 64
## [3] 2.434099 51
## [4] 2.430582 59
## [5] 2.583008 66
## [6] 2.357084 53
## [7] 2.436047 61
## [8] 2.496107 81
## [9] 2.500387 92
## [10] 2.440275 53