library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(cluster) 
set.seed(12345)

Instructions

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Loading the data

groceryfile = "https://raw.githubusercontent.com/jerryjerald27/data624/refs/heads/main/GroceryDataSet.csv"
data <- read.csv(groceryfile, header=FALSE)
head(data)
##                 V1                  V2             V3                       V4
## 1     citrus fruit semi-finished bread      margarine              ready soups
## 2   tropical fruit              yogurt         coffee                         
## 3       whole milk                                                            
## 4        pip fruit              yogurt  cream cheese              meat spreads
## 5 other vegetables          whole milk condensed milk long life bakery product
## 6       whole milk              butter         yogurt                     rice
##                 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6 abrasive cleaner                                                            
##   V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1                                            
## 2                                            
## 3                                            
## 4                                            
## 5                                            
## 6

We are going to have to read them in as transactions for the best fit

tdata = read.transactions(groceryfile, sep = ',')
summary(tdata)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Here we see that when read as transactions using read.transactions there are 9835 total transactions, and the most number of unique items is 169. is there is at least 1 transactions with 169 unique items. We also see the top 5 most frequent items in the transactions Whole milk followed by other vegetables, rolls/buns, soda and yogurt

Mining for rules

Now we can mine with apriori from the arules library that helps us with association rules. I am expecting to see the top 5 show up a lot on the Right hand side of these rules when not ordered. Confidence lets us chose the strength of the rules that show up, the higher they are the more certain that these items show up together. Support is the proportion of transactions that contain all the items in the rule. In this example we are okay with the support being around .001 which means the rules we uncover are going to have items that occur together in around 9.8 transactions out of 9835. This seems reasonable compared to setting a higher support that might hide seasonal patterns or event specific buying patterns.

rules <- apriori(tdata, parameter = list(supp = 0.001,conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules)
## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##   data ntransactions support confidence
##  tdata          9835   0.001        0.8
##                                                               call
##  apriori(data = tdata, parameter = list(supp = 0.001, conf = 0.8))

Inspecting the rules

Now we inspect the rules to see the confidence lift and support. Inspecting gives us the coverage, lift and count for the rules along with the support and confidence. Count is basically the number of times the rule appears, we have set a minimum of 9.8 with our confidence at 0.001. Coverage is how often the left hand side appears across all the transactions regardless of the RHS being present. And Lift tells us the deviation between the RHS occurring randomly vs RHS occurring when the LHS has occurred. The higher the Lift the more positive the association.

inspect(rules[1:10])
##      lhs                          rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                          
##       red/blush wine}          => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {cereals,                                                                                         
##       curd}                    => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [3]  {cereals,                                                                                         
##       yogurt}                  => {whole milk}       0.001728521  0.8095238 0.002135231  3.168192    17
## [4]  {butter,                                                                                          
##       jam}                     => {whole milk}       0.001016777  0.8333333 0.001220132  3.261374    10
## [5]  {bottled beer,                                                                                    
##       soups}                   => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {house keeping products,                                                                          
##       napkins}                 => {whole milk}       0.001321810  0.8125000 0.001626843  3.179840    13
## [7]  {house keeping products,                                                                          
##       whipped/sour cream}      => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [8]  {pastry,                                                                                          
##       sweet spreads}           => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {curd,                                                                                            
##       turkey}                  => {other vegetables} 0.001220132  0.8000000 0.001525165  4.134524    12
## [10] {rice,                                                                                            
##       sugar}                   => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12

Extracting top 10 by lift

inspect(sort(rules, by = "lift", decreasing = TRUE)[1:10])
##      lhs                         rhs                   support confidence    coverage      lift count
## [1]  {liquor,                                                                                        
##       red/blush wine}         => {bottled beer}    0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       other vegetables,                                                                              
##       soda}                   => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [4]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       grapes}                 => {tropical fruit}  0.001118454  0.8461538 0.001321810  8.063879    11
## [5]  {other vegetables,                                                                              
##       rice,                                                                                          
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [6]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk}             => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [7]  {ham,                                                                                           
##       other vegetables,                                                                              
##       pip fruit,                                                                                     
##       yogurt}                 => {tropical fruit}  0.001016777  0.8333333 0.001220132  7.941699    10
## [8]  {beef,                                                                                          
##       citrus fruit,                                                                                  
##       other vegetables,                                                                              
##       tropical fruit}         => {root vegetables} 0.001016777  0.8333333 0.001220132  7.645367    10
## [9]  {butter,                                                                                        
##       cream cheese,                                                                                  
##       root vegetables}        => {yogurt}          0.001016777  0.9090909 0.001118454  6.516698    10
## [10] {butter,                                                                                        
##       sliced cheese,                                                                                 
##       tropical fruit,                                                                                
##       whole milk}             => {yogurt}          0.001016777  0.9090909 0.001118454  6.516698    10

We can explain our top rule here and its metrics. About 0.19 percent of all the transactions, or 19, contains liquor,wine, and bottled beer. The LHS alone, liquor and redwine together appear around 0.21 or 21 times as shown by the coverage. I.e., out of 21 transactions that have the LHS, 19 of them have the RHS, which puts the probability that they buy bottled beer when they have liquor and red wine, at around 90 percent (19/21 * 100), which is denoted by our confidence at 90 percent. Our Lift tells us that customers who buy liquor and wine are about 11 times more likely to also buy bottled beer, which is pretty high.This makes a lot of sense as it looks like my ideal night in with friends grocery run! The next set of citrus fruits, juice and yogurt looks like what id run out to get to cure my hangover the next day.

Cluster analysis

Clustering by size , lets say shopping cart big small,shopping basket big small, or a tiny bag. WIth a size break

sizes <- size(tdata)
clusters <- cut(sizes, breaks = 5, labels = FALSE)
table(clusters)
## clusters
##    1    2    3    4    5 
## 8151 1411  235   31    7
breaks <- seq(min(sizes), max(sizes), length.out = 6) 
print(breaks)
## [1]  1.0  7.2 13.4 19.6 25.8 32.0
for (i in 1:5) {
  clusters_t <- tdata[clusters == i]
  item_no_t <- itemFrequency(clusters_t, type = "absolute")
  cat("Cluster number ", i )
  print(head(sort(item_no_t, decreasing = TRUE), 5))
}
## Cluster number  1      whole milk       rolls/buns             soda other vegetables 
##             1598             1274             1222             1065 
##           yogurt 
##              761 
## Cluster number  2      whole milk other vegetables           yogurt       rolls/buns 
##              727              662              473              446 
##             soda 
##              387 
## Cluster number  3      whole milk other vegetables           yogurt  root vegetables 
##              161              148              116              107 
##             soda 
##               94 
## Cluster number  4other vegetables       whole milk   tropical fruit           pastry 
##               24               21               19               16 
##        pip fruit 
##               16 
## Cluster number  5         butter root vegetables  tropical fruit      whole milk          yogurt 
##               7               6               6               6               6

Most transactions seems to fall in the small cluster with only very few in the biggest transactions cluster. As the sizes get bigger the top items seem to be going from snack materials to actual household groceries.

WE could also use something like k-means or kmedoids.We would first need to convert tdata into a binary matrix with 1/0 for each item, run the clustering and see how the items are distributed

tmatrix <- as(tdata, "matrix") * 1
dist_matrix <- daisy(tmatrix, metric = "gower") 
pam_res <- pam(dist_matrix, k = 5)
clusters <- pam_res$clustering
table(clusters)
## clusters
##    1    2    3    4    5 
## 3817 1175 1631 1661 1551
for (i in 1:5) {
   clusters_t <- tdata[clusters == i]
  item_no_t <- itemFrequency(clusters_t, type = "absolute")
  cat("Cluster number ", i )
  print(head(sort(item_no_t, decreasing = TRUE), 5))
}
## Cluster number  1  canned beer shopping bags  bottled beer    newspapers        pastry 
##           719           347           317           247           240 
## Cluster number  2        yogurt  bottled water           soda     rolls/buns tropical fruit 
##            713            688            255            203            185 
## Cluster number  3     whole milk      rolls/buns          yogurt root vegetables  tropical fruit 
##            1631             316             260             231             219 
## Cluster number  4other vegetables       whole milk  root vegetables           yogurt 
##             1661              634              402              316 
##   tropical fruit 
##              302 
## Cluster number  5         soda    rolls/buns       sausage shopping bags        pastry 
##           977           921           207           168           146