The dataset is the Groceries dataset in the arules R package.
The groceries data contains 9,835 transactions and 169 groceries types. Transaction data are best handled in a sparse matrix. The read.transaction() function in the arules package can be used to create a sparse matrix for the groceries dataset
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
groceries <- read.transactions("groceries.csv", sep = ",")
head(groceries)
## transactions in sparse format with
## 6 transactions (rows) and
## 169 items (columns)
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The columns density is 0.0261 (2.6%). That is 2.6% is the proportion of nonzero matrix cells.By multiplying 9,835 & 169 we get 1,662,115 positions in the groceries matrix
2.6% of 1,662,115 is 43,367 (the total number of items purchase during the 30 days period when the data was collected)
a<-sort(itemFrequency(groceries), decreasing = FALSE)
head(a)
## baby food sound storage medium preservation products
## 0.0001016777 0.0001016777 0.0002033554
## bags kitchen utensil baby cosmetics
## 0.0004067107 0.0004067107 0.0006100661
itemFrequencyPlot(groceries, support = 0.05)
(image(sample(groceries, 100)))
The overall distribution of the sparse matrix looks fairly random. This is a good indication to progress to the next step (model training)
I will set our support threshold 0.006 (We want to include items that were purchased at least twice a day on average: i.e. 60 times in a month and 60/9835 = 0.006)
I will start with a confidence threshold to 0.25 and optimize as needed
groceryrules <- apriori(groceries, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
This gives 463 rules
summary(groceryrules)
## set of 463 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 150 297 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.711 3.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.006101 Min. :0.2500 Min. :0.9932 Min. : 60.0
## 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 1st Qu.: 70.0
## Median :0.008744 Median :0.3554 Median :1.9332 Median : 86.0
## Mean :0.011539 Mean :0.3786 Mean :2.0351 Mean :113.5
## 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 3rd Qu.:121.0
## Max. :0.074835 Max. :0.6600 Max. :3.9565 Max. :736.0
##
## mining info:
## data ntransactions support confidence
## groceries 9835 0.006 0.25
150 rules have 2 items, 297 have items 3, and 16 rules have 4 items
inspect(groceryrules[1:5])
## lhs rhs support confidence lift
## [1] {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460
## [2] {pasta} => {whole milk} 0.006100661 0.4054054 1.586614
## [3] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
## [4] {herbs} => {other vegetables} 0.007727504 0.4750000 2.454874
## [5] {herbs} => {whole milk} 0.007727504 0.4750000 1.858983
## count
## [1] 68
## [2] 60
## [3] 69
## [4] 76
## [5] 76
Depdending on the objectives, the most interesting rules might be the ones with the highest support, confidence or lift
inspect(sort(groceryrules, by = "lift")[1:10])
## lhs rhs support confidence lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69
## [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89
## [3] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 69
## [4] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692 78
## [5] {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649 93
## [6] {beef,
## whole milk} => {root vegetables} 0.008032537 0.3779904 3.467851 79
## [7] {other vegetables,
## pip fruit} => {tropical fruit} 0.009456024 0.3618677 3.448613 93
## [8] {pip fruit,
## yogurt} => {tropical fruit} 0.006405694 0.3559322 3.392048 63
## [9] {citrus fruit,
## other vegetables} => {root vegetables} 0.010371124 0.3591549 3.295045 102
## [10] {other vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.007625826 0.3424658 3.263712 75
Compared to the typical customer, people that bought herbs or berries were ~4 times more likely to buy root veggies or whipped cream respectively.