library(arules)
library(arulesViz)
library(dplyr)

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Market Basket Analysis

First we read in the transaction of the GroceryDataset.

transactions <- read.transactions("https://raw.githubusercontent.com/Luz917/data624hw10/main/GroceryDataSet.csv", format = 'basket', sep=',')

transactions
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

Summary of transactions.

summary(transactions)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Top 10 Transactions

Here we can see the Top 10 items of the Grocery List.

itemFrequencyPlot(transactions,topN=10,col= 'Purple', main="Grocery List")

arules package

We use the apriori function to get the rules.

arulez <- apriori(transactions, parameter = list(support = 0.02, confidence = 0.3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Lets take a look at the rules.

  • There are a set of 37 rules
summary(arulez)
## set of 37 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 32  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.135   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02003   Min.   :0.3079   Min.   :0.04342   Min.   :1.205  
##  1st Qu.:0.02318   1st Qu.:0.3483   1st Qu.:0.05765   1st Qu.:1.514  
##  Median :0.02755   Median :0.3868   Median :0.07229   Median :1.756  
##  Mean   :0.03134   Mean   :0.3915   Mean   :0.08227   Mean   :1.730  
##  3rd Qu.:0.03325   3rd Qu.:0.4249   3rd Qu.:0.09395   3rd Qu.:1.915  
##  Max.   :0.07483   Max.   :0.5129   Max.   :0.19349   Max.   :2.842  
##      count      
##  Min.   :197.0  
##  1st Qu.:228.0  
##  Median :271.0  
##  Mean   :308.3  
##  3rd Qu.:327.0  
##  Max.   :736.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9835    0.02        0.3

A look the list sorted by lift.

inspect(head(arulez, n = 10, by = "lift"))
##      lhs                                   rhs                support   
## [1]  {other vegetables,whole milk}      => {root vegetables}  0.02318251
## [2]  {root vegetables,whole milk}       => {other vegetables} 0.02318251
## [3]  {root vegetables}                  => {other vegetables} 0.04738180
## [4]  {whipped/sour cream}               => {other vegetables} 0.02887646
## [5]  {whole milk,yogurt}                => {other vegetables} 0.02226741
## [6]  {other vegetables,yogurt}          => {whole milk}       0.02226741
## [7]  {butter}                           => {whole milk}       0.02755465
## [8]  {pork}                             => {other vegetables} 0.02165735
## [9]  {curd}                             => {whole milk}       0.02613116
## [10] {other vegetables,root vegetables} => {whole milk}       0.02318251
##      confidence coverage   lift     count
## [1]  0.3097826  0.07483477 2.842082 228  
## [2]  0.4740125  0.04890696 2.449770 228  
## [3]  0.4347015  0.10899847 2.246605 466  
## [4]  0.4028369  0.07168277 2.081924 284  
## [5]  0.3974592  0.05602440 2.054131 219  
## [6]  0.5128806  0.04341637 2.007235 219  
## [7]  0.4972477  0.05541434 1.946053 271  
## [8]  0.3756614  0.05765125 1.941476 213  
## [9]  0.4904580  0.05327911 1.919481 257  
## [10] 0.4892704  0.04738180 1.914833 228

A look at the list sorted by the confidence.

inspect(head(arulez, n = 10, by = "confidence"))
##      lhs                                   rhs                support   
## [1]  {other vegetables,yogurt}          => {whole milk}       0.02226741
## [2]  {butter}                           => {whole milk}       0.02755465
## [3]  {curd}                             => {whole milk}       0.02613116
## [4]  {other vegetables,root vegetables} => {whole milk}       0.02318251
## [5]  {root vegetables,whole milk}       => {other vegetables} 0.02318251
## [6]  {domestic eggs}                    => {whole milk}       0.02999492
## [7]  {whipped/sour cream}               => {whole milk}       0.03223183
## [8]  {root vegetables}                  => {whole milk}       0.04890696
## [9]  {root vegetables}                  => {other vegetables} 0.04738180
## [10] {frozen vegetables}                => {whole milk}       0.02043721
##      confidence coverage   lift     count
## [1]  0.5128806  0.04341637 2.007235 219  
## [2]  0.4972477  0.05541434 1.946053 271  
## [3]  0.4904580  0.05327911 1.919481 257  
## [4]  0.4892704  0.04738180 1.914833 228  
## [5]  0.4740125  0.04890696 2.449770 228  
## [6]  0.4727564  0.06344687 1.850203 295  
## [7]  0.4496454  0.07168277 1.759754 317  
## [8]  0.4486940  0.10899847 1.756031 481  
## [9]  0.4347015  0.10899847 2.246605 466  
## [10] 0.4249471  0.04809354 1.663094 201

A plot of the 37 rules. You can see that the points are scattered.

plot(arulez, jitter = 5)

Another plot of the 10 rules using the paracoord method.

head(arulez, n = 10, by = "confidence") %>%  
    plot(method = "paracoord")

head(arulez, n = 10, by = "confidence") %>% 
  plot(method = "graph", engine = "htmlwidget")

Cluster analysis

Using the hierarchical cluster analysis to cluster the transactions. The hierarchical clusters creates cluster dedograms and it can include different methods in the hclust function. We will look at the complete method and the ward method.

Complete Method

trans2 <- transactions[ , itemFrequency(transactions) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )

plot(hc1, cex = 0.8, hang = 2)

Ward Method

  • Serperated into 5 different clusters with borders around them.
hc2 <- hclust(glist, method = "ward.D2" )

plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 5, border = 2:5)

Tanglegram compares the two dendrograms.

library(dendextend)
# Compute 2 hierarchical clusterings
hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)

tanglegram(dend1, dend2)

You can see that it splits the words if its too long and it connects.

References

https://stackoverflow.com/questions/49628304/create-interactive-paracoord-plot-for-association-rules

https://uc-r.github.io/hc_clustering