Homework #8: identifying frequently purchased groceries with association rules

Step 1 - Data Collection

Our market basket analysis will utilize the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day (roughly 30 transactions per hour in a 12-hour business day), suggesting that the retailer is not particularly large, nor is it particularly small.

The typical grocery store offers a huge variety of items. There might be five brands of milk, a dozen different types of laundry detergent, and three brands of coffee. Given the moderate size of the retailer, we will assume that they are not terribly concerned with finding rules that apply only to a specific brand of milk or detergent. With this in mind, all brand names can be removed from the purchases. This reduces the number of groceries to a more manageable 169 types, using broad categories such as chicken, frozen meals, margarine, and soda.

Step 2 - exploring and preparing the data

Since we’re loading the transactional data, we cannot simply use the read.csv() function used previously. Instead, arules provides a read.transactions() function that is similar to read.csv() with the exception that it results in a sparse matrix suitable for transactional data. The sep = “,” parameter specifies that items in the input file are separated by a comma. To read the groceries.csv data into a sparse matrix named groceries, type the following line:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
setwd('C:/Users/KEVIN/Downloads')
groceries <- read.transactions("groceries.csv", sep = ",")

To see some basic information about the groceries matrix we just created, use the summary() function on the object:

summary(groceries) 
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The first block of information in the output (as shown previously) provides a summary of the sparse matrix we created. The output 9835 rows refers to the number of transactions, and the output 169 columns refers to the 169 different items that might appear in someone’s grocery basket. Each cell in the matrix is 1 if the item was purchased for the corresponding transaction, or 0 otherwise.

The arules package includes some useful features for examining transaction data. To look at the contents of the sparse matrix, use the inspect() function in combination with the vector operators. The first five transactions can be viewed as follows:

# look at the first five transactions
inspect(groceries[1:5])
##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}

The next block of the summary() output lists the items that were most commonly found in the transactional data. Since 2,513 / 9,835 = 0.2555, we can determine that whole milk appeared in 25.6 percent of the transactions. The other vegetables, rolls/buns, soda, and yogurt round out the list of other common items, as follows:

These transactions match our look at the original CSV file. To examine a particular item (that is, a column of data), it is possible use the [row, column] matrix notion. Using this with the itemFrequency() function allows us to see the proportion of transactions that contain the item. This allows us, for instance, to view the support level for the first three items in the grocery data:

itemFrequency(groceries[, 1:3])
## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661

Visualizing item support - item frequency plots

itemFrequencyPlot(groceries, support = 0.1)

If you would rather limit the plot to a specific number of items, the topN parameter can be used with itemFrequencyPlot(). The histogram is then sorted by decreasing support, as shown in the following diagram of the top 20 items in the groceries data:

itemFrequencyPlot(groceries, topN = 20)

Visualizing the transaction data - plotting the sparse matrix

In addition to looking at the items, it’s also possible to visualize the entire sparse matrix. To do so, use the image() function. The command to display the sparse matrix for the first five transactions is as follows:

image(groceries[1:5])

Keep in mind that this visualization will not be as useful for extremely large transaction databases, because the cells will be too small to discern. Still, by combining it with the sample() function, you can view the sparse matrix for a randomly sampled set of transactions. The command to create random selection of 100 transactions is as follows:

image(sample(groceries, 100))

A few columns seem fairly heavily populated, indicating some very popular items at the store. But overall, the distribution of dots seems fairly random. Given nothing else of note, let’s continue with our analysis.

Step 3 - training a model on the data

With data preparation completed, we can now work at finding associations among shopping cart items. We will use an implementation of the Apriori algorithm in the arules package we’ve been using to explore and prepare the groceries data. You’ll need to install and load this package if you have not done so already. The following table shows the syntax to create sets of rules with the apriori() function:

Below is an attempt to use the default settings of support = 0.1 and confidence = 0.8, we will end up with a set of zero rules:

apriori(groceries)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules

Since 60 out of 9,835 equals 0.006, we’ll try setting the support there first.

The full command to find a set of association rules using the Apriori algorithm is as follows:

groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

This saves our rules in a rules object, can take a peek into by typing its name:

groceryrules
## set of 463 rules

Our groceryrules object contains a set of 463 association rules. To determine whether any of them are useful, we’ll have to dig deeper.

Step 4 - evaluating model performance

To obtain a high-level overview of the association rules, we can use summary() as follows. The rule length distribution tells us how many rules have each count of items. In our rule set, 150 rules have only two items, while 297 have three, and 16 have four. The summary statistics associated with this distribution are also given:

In the final section of the summary() output, we receive mining information, telling us about how the rules were chosen. Here, we see that the groceries data, which contained 9,835 transactions, was used to construct rules with a minimum support of 0.0006 and minimum confidence of 0.25:

summary(groceryrules)
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

We can take a look at specific rules using the inspect() function. For instance, the first three rules in the groceryrules object can be viewed as follows:

inspect(groceryrules[1:3])
##     lhs                rhs               support     confidence lift    
## [1] {potted plants} => {whole milk}      0.006914082 0.4000000  1.565460
## [2] {pasta}         => {whole milk}      0.006100661 0.4054054  1.586614
## [3] {herbs}         => {root vegetables} 0.007015760 0.4312500  3.956477

Step 5 - improving model performance

Reorder the groceryrules object, we can apply sort() while specifying a “support”, “confidence”, or “lift” value to the by parameter. By combining the sort function with vector operators, we can obtain a specific number of interesting rules. For instance, the best five rules according to the lift statistic can be examined using the following command:

inspect(sort(groceryrules, by = "lift")[1:5])
##     lhs                   rhs                      support confidence     lift
## [1] {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477
## [2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886
## [3] {other vegetables,                                                        
##      tropical fruit,                                                          
##      whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074
## [4] {beef,                                                                    
##      other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692
## [5] {other vegetables,                                                        
##      tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649

The first rule, with a lift of about 3.96, implies that people who buy herbs are nearly four times more likely to buy root vegetables than the typical customer.

The subset() function provides a method to search for subsets of transactions, items, or rules. To use it to find any rules with berries appearing in the rule, use the following command. It will store the rules in a new object titled berryrules:

berryrules <- subset(groceryrules, items %in% "berries")

We can then inspect the rules as we did with the larger set:

inspect(berryrules)
##     lhs          rhs                  support     confidence lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328

To share the results of your market basket analysis, you can save the rules to a CSV file with the write() function. This will produce a CSV file that can be used in most spreadsheet programs including Microsoft Excel:

write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE)

Sometimes it is also convenient to convert the rules into an R data frame. This can be accomplished easily using the as() function, as follows:

groceryrules_df <- as(groceryrules, "data.frame")

This creates a data frame with the rules in the factor format, and numeric vectors for support, confidence, and lift:

str(groceryrules_df)
## 'data.frame':    463 obs. of  4 variables:
##  $ rules     : Factor w/ 463 levels "{baking powder} => {other vegetables}",..: 340 302 207 206 208 341 402 21 139 140 ...
##  $ support   : num  0.00691 0.0061 0.00702 0.00773 0.00773 ...
##  $ confidence: num  0.4 0.405 0.431 0.475 0.475 ...
##  $ lift      : num  1.57 1.59 3.96 2.45 1.86 ...

Summary

Association rules are frequently used to find provide useful insights in the massive transaction databases of large retailers. The Apriori algorithm, which we studied in this chapter, does so by setting minimum thresholds of interestingness, and reporting only the associations meeting these criteria.

References

Lantz, Brett. Machine Learning with R. 2nd ed. Birmingham: Packt Publishing Ltd, 2015. Print. , 2013. Print.

EOF