Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
#install.packages("Matrix")
# Load the package
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
transaction <- read.transactions('C://Users//Bikash_Bhowmik//Downloads//GroceryDataSet.csv', sep = ',', header = FALSE)
head(transaction)
## transactions in sparse format with
## 6 transactions (rows) and
## 169 items (columns)
str(transaction)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 1 variable:
## .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
look at a summary of the data
summary(transaction)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
I can see that whole milk, other vegetables, rolls/buns, soda and yogurt are the most purchased items.
Looking at the top 25
itemFrequencyPlot(transaction, topN = 25)
itemFrequencyPlot(transaction, topN = 25, type="absolute")
I will develop association rules which are relationships between the items in the transactional data to try to identify patterns and associations amongst the grocery items using the apriori function.
I can see that the frequency above ranges from 0 to a little over 0.2. I want to capture enough items to determine associations so I will set my support at 0.01 (I experimented with 0.2, returned no rules.)
association <- apriori(transaction, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- head(sort(association, by = "lift"), 10)
summary(rules)
## set of 10 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 10
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.5000 Min. :0.01729 Min. :2.053
## 1st Qu.:0.01103 1st Qu.:0.5315 1st Qu.:0.02021 1st Qu.:2.210
## Median :0.01210 Median :0.5665 Median :0.02105 Median :2.262
## Mean :0.01191 Mean :0.5539 Mean :0.02161 Mean :2.440
## 3rd Qu.:0.01230 3rd Qu.:0.5802 3rd Qu.:0.02379 3rd Qu.:2.592
## Max. :0.01454 Max. :0.5862 Max. :0.02583 Max. :3.030
## count
## Min. : 99.0
## 1st Qu.:108.5
## Median :119.0
## Mean :117.1
## 3rd Qu.:121.0
## Max. :143.0
##
## mining info:
## data ntransactions support confidence
## transaction 9835 0.01 0.5
## call
## apriori(data = transaction, parameter = list(support = 0.01, confidence = 0.5))
inspect(rules)
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [4] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [5] {curd, yogurt} => {whole milk} 0.01006609
## [6] {butter, other vegetables} => {whole milk} 0.01148958
## [7] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [8] {root vegetables, yogurt} => {whole milk} 0.01453991
## [9] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [10] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5020921 0.02430097 2.594890 120
## [4] 0.5000000 0.02582613 2.584078 127
## [5] 0.5823529 0.01728521 2.279125 99
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5525114 0.02226741 2.162336 121
## [10] 0.5245098 0.02074225 2.052747 107
The top 10 rules by lift are listed above. With a lift of 3.029, citrus fruit and root vegetables are often associated with the purchase of “other vegetables” with the highest confidence, meaning someone buying citrus fruit and root vegetables is 59% likely to also buy other vegetables. Curd and yogurt have the highest association with whole milk, however the highest frequency of purchases together is between root vegetables, yogurt and whole milk.
Use the arulesViz package to visualize the associations and their lift and support.
library(arulesViz)
plot(association, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE