Identification of items that are usually bought together from a grocery store using market basket analysis

Get, explore and prepare data

The dataset is the Groceries dataset in the arules R package.

The groceries data contains 9,835 transactions and 169 groceries types. Transaction data are best handled in a sparse matrix. The read.transaction() function in the arules package can be used to create a sparse matrix for the groceries dataset

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
groceries <- read.transactions("groceries.csv", sep = ",")
head(groceries)

## transactions in sparse format with
##  6 transactions (rows) and
##  169 items (columns)

summary(groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The columns density is 0.0261 (2.6%). That is 2.6% is the proportion of nonzero matrix cells.By multiplying 9,835 & 169 we get 1,662,115 positions in the groceries matrix

2.6% of 1,662,115 is 43,367 (the total number of items purchase during the 30 days period when the data was collected)

Examine the frequency of items in the groceries dataset

a<-sort(itemFrequency(groceries), decreasing = FALSE)
head(a)

##             baby food  sound storage medium preservation products 
##          0.0001016777          0.0001016777          0.0002033554 
##                  bags       kitchen utensil        baby cosmetics 
##          0.0004067107          0.0004067107          0.0006100661

Graphical presentation of items that appear at least 5% of the time in the dataset

itemFrequencyPlot(groceries, support = 0.05)

Visual inspection of the entire sparse matrix

(image(sample(groceries, 100)))

The overall distribution of the sparse matrix looks fairly random. This is a good indication to progress to the next step (model training)

Support and confidence threshols setting

I will set our support threshold 0.006 (We want to include items that were purchased at least twice a day on average: i.e. 60 times in a month and 60/9835 = 0.006)

I will start with a confidence threshold to 0.25 and optimize as needed

groceryrules <- apriori(groceries, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

This gives 463 rules

Model evaluation (how useful are the 463 rules?)

summary(groceryrules)

## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932   Min.   : 60.0  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229   1st Qu.: 70.0  
##  Median :0.008744   Median :0.3554   Median :1.9332   Median : 86.0  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351   Mean   :113.5  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565   3rd Qu.:121.0  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565   Max.   :736.0  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

150 rules have 2 items, 297 have items 3, and 16 rules have 4 items

inspect(groceryrules[1:5])

##     lhs                rhs                support     confidence lift    
## [1] {potted plants} => {whole milk}       0.006914082 0.4000000  1.565460
## [2] {pasta}         => {whole milk}       0.006100661 0.4054054  1.586614
## [3] {herbs}         => {root vegetables}  0.007015760 0.4312500  3.956477
## [4] {herbs}         => {other vegetables} 0.007727504 0.4750000  2.454874
## [5] {herbs}         => {whole milk}       0.007727504 0.4750000  1.858983
##     count
## [1] 68   
## [2] 60   
## [3] 69   
## [4] 76   
## [5] 76

Depdending on the objectives, the most interesting rules might be the ones with the highest support, confidence or lift

To obtained the top 10 rules sorted by the lift statistic

inspect(sort(groceryrules, by = "lift")[1:10])

##      lhs                   rhs                      support confidence     lift count
## [1]  {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477    69
## [2]  {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886    89
## [3]  {other vegetables,                                                              
##       tropical fruit,                                                                
##       whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074    69
## [4]  {beef,                                                                          
##       other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692    78
## [5]  {other vegetables,                                                              
##       tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649    93
## [6]  {beef,                                                                          
##       whole milk}       => {root vegetables}    0.008032537  0.3779904 3.467851    79
## [7]  {other vegetables,                                                              
##       pip fruit}        => {tropical fruit}     0.009456024  0.3618677 3.448613    93
## [8]  {pip fruit,                                                                     
##       yogurt}           => {tropical fruit}     0.006405694  0.3559322 3.392048    63
## [9]  {citrus fruit,                                                                  
##       other vegetables} => {root vegetables}    0.010371124  0.3591549 3.295045   102
## [10] {other vegetables,                                                              
##       whole milk,                                                                    
##       yogurt}           => {tropical fruit}     0.007625826  0.3424658 3.263712    75

Interpretation of first and second rules:

Compared to the typical customer, people that bought herbs or berries were ~4 times more likely to buy root veggies or whipped cream respectively.

Reference:

Machine Learning with R (2nd edition) by Brett Lantz