Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 8.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to perform a market basket analysis of transnational data from a grocery store.

Required packages

we will use arules package

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

Setp 1 - collecting data

The dataset was adapted from the Groceries dataset in arules R package.

Step 2 - exploring and preparing the data

As the data of each purchase are separated by a comma, it is not feasible to read the file using read.csv, instead we will use arules package to create a spares matrix for transaction data.

Data preparation - creating a spares matrix for transaction data

groceries <- read.transactions("groceries.csv", sep = ",")

#Exploring the summary of the spares matrix
summary(groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

We can see dimension of the spares matrix, 169 columns X 9835 rows. 169 columns represents 169 different items that might appear in someone’s grocery basket. Each cell in the matrix is a 1 if the item was purchased for the corresponding transaction, or 0 otherwise.

The Density value 0.026 represents the proportion of non-zero matrix cells. Below it we can see frequently purchased items. finally we can find some statistics about the size of the transactions. A total of 2159 transactions have 1 item.

#we can inspect more the data using inspect and item frequency
inspect(groceries[1:5])

##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}

itemFrequency(groceries[, 1:3])

## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661

We notice that items are stored by alphabetical order, the first one is found in about 0.3 percent of the transactions

Visualizing item support - item frequency plot

itemFrequencyPlot(groceries, support = 0.1)

To know the top items

itemFrequencyPlot(groceries, topN = 20)

We can notice that milk and other vegetables are on the top of the list.

Visualizing the transaction data - plotting the spares matrix

we will use image to have a bird’s-eye view of the entire sparse matrix.

#below code for the first 5 transactions.
image(groceries[1:5])

To see random sample of 100 transactions

image(sample(groceries, 100))

We can see the common items between the transactions.

Step 3 - training a model on the data

we will use apriori function to train the model, but we have to select parameters like support, confidence and minlen in a proper way to help us filter the required data only.

groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

groceryrules

## set of 463 rules

We have 463 rules now, we need to know which of them is useful to us.

Step 4 - evaluating model performance

summary(groceryrules)

## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.009964   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:0.018709   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :0.024809   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :0.032608   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:0.035892   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :0.255516   Max.   :3.9565  
##      count      
##  Min.   : 60.0  
##  1st Qu.: 70.0  
##  Median : 86.0  
##  Mean   :113.5  
##  3rd Qu.:121.0  
##  Max.   :736.0  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

To take closer look on specific rules

inspect(groceryrules[1:3])

##     lhs                rhs               support     confidence coverage  
## [1] {potted plants} => {whole milk}      0.006914082 0.4000000  0.01728521
## [2] {pasta}         => {whole milk}      0.006100661 0.4054054  0.01504830
## [3] {herbs}         => {root vegetables} 0.007015760 0.4312500  0.01626843
##     lift     count
## [1] 1.565460 68   
## [2] 1.586614 60   
## [3] 3.956477 69

We want to divide the rules to three categories, actionable, trivial, and inexplicable.

Step 5 - improving model performance

Sorting the set of association rules

We will sort the rules by lift

inspect(sort(groceryrules, by = "lift") [1:5])

##     lhs                   rhs                      support confidence   coverage     lift count
## [1] {herbs}            => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {other vegetables,                                                                         
##      tropical fruit,                                                                           
##      whole milk}       => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69
## [4] {beef,                                                                                     
##      other vegetables} => {root vegetables}    0.007930859  0.4020619 0.01972547 3.688692    78
## [5] {other vegetables,                                                                         
##      tropical fruit}   => {pip fruit}          0.009456024  0.2634561 0.03589222 3.482649    93

from first rule we can say that people who buy herbs are nearly four times more likely to buy root vegetables than the typical customer.

Taking subsets of association rules

berryrules <- subset(groceryrules, items %in% "berries")

inspect(berryrules)

##     lhs          rhs                  support     confidence coverage  lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  0.0332486 3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  0.0332486 2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  0.0332486 1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  0.0332486 1.388328
##     count
## [1]  89  
## [2] 104  
## [3] 101  
## [4] 116

Saving association rules to a file or a data frame

write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE)

groceryrules_df <- as(groceryrules, "data.frame")

str(groceryrules_df)

## 'data.frame':    463 obs. of  6 variables:
##  $ rules     : chr  "{potted plants} => {whole milk}" "{pasta} => {whole milk}" "{herbs} => {root vegetables}" "{herbs} => {other vegetables}" ...
##  $ support   : num  0.00691 0.0061 0.00702 0.00773 0.00773 ...
##  $ confidence: num  0.4 0.405 0.431 0.475 0.475 ...
##  $ coverage  : num  0.0173 0.015 0.0163 0.0163 0.0163 ...
##  $ lift      : num  1.57 1.59 3.96 2.45 1.86 ...
##  $ count     : int  68 60 69 76 76 69 70 67 63 88 ...

Identifying frequently purchased groceries

Mohamed Ali Hefnawy

8/7/2020