Homework 10: Market Basket and Clusters

Instructions

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift. Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Packages

library(arules)
library(arulesViz)
library(dplyr)
library(dendextend)

Read the transaction of the data set:

transactions <- read.transactions("GroceryDataSet.csv", format = 'basket', sep=',')

transactions
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

Summary of the data set

summary(transactions)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Top 10 items in the list which has most frequency

itemFrequencyPlot(transactions,topN=10,main="Grocery List")

Use apriori function from the arulez package to establish the rules.

arulez <- apriori(transactions, parameter = list(support = 0.02, confidence = 0.3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Summary of the rules

summary(arulez)
## set of 37 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 32  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.135   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02003   Min.   :0.3079   Min.   :0.04342   Min.   :1.205  
##  1st Qu.:0.02318   1st Qu.:0.3483   1st Qu.:0.05765   1st Qu.:1.514  
##  Median :0.02755   Median :0.3868   Median :0.07229   Median :1.756  
##  Mean   :0.03134   Mean   :0.3915   Mean   :0.08227   Mean   :1.730  
##  3rd Qu.:0.03325   3rd Qu.:0.4249   3rd Qu.:0.09395   3rd Qu.:1.915  
##  Max.   :0.07483   Max.   :0.5129   Max.   :0.19349   Max.   :2.842  
##      count      
##  Min.   :197.0  
##  1st Qu.:228.0  
##  Median :271.0  
##  Mean   :308.3  
##  3rd Qu.:327.0  
##  Max.   :736.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9835    0.02        0.3
##                                                                              call
##  apriori(data = transactions, parameter = list(support = 0.02, confidence = 0.3))

Inspect the head of the list sorted by lift

inspect(head(arulez, n = 10, by = "lift"))
##      lhs                                    rhs                support   
## [1]  {other vegetables, whole milk}      => {root vegetables}  0.02318251
## [2]  {root vegetables, whole milk}       => {other vegetables} 0.02318251
## [3]  {root vegetables}                   => {other vegetables} 0.04738180
## [4]  {whipped/sour cream}                => {other vegetables} 0.02887646
## [5]  {whole milk, yogurt}                => {other vegetables} 0.02226741
## [6]  {other vegetables, yogurt}          => {whole milk}       0.02226741
## [7]  {butter}                            => {whole milk}       0.02755465
## [8]  {pork}                              => {other vegetables} 0.02165735
## [9]  {curd}                              => {whole milk}       0.02613116
## [10] {other vegetables, root vegetables} => {whole milk}       0.02318251
##      confidence coverage   lift     count
## [1]  0.3097826  0.07483477 2.842082 228  
## [2]  0.4740125  0.04890696 2.449770 228  
## [3]  0.4347015  0.10899847 2.246605 466  
## [4]  0.4028369  0.07168277 2.081924 284  
## [5]  0.3974592  0.05602440 2.054131 219  
## [6]  0.5128806  0.04341637 2.007235 219  
## [7]  0.4972477  0.05541434 1.946053 271  
## [8]  0.3756614  0.05765125 1.941476 213  
## [9]  0.4904580  0.05327911 1.919481 257  
## [10] 0.4892704  0.04738180 1.914833 228

List sorted by the confidence:

inspect(head(arulez, n = 10, by = "confidence"))
##      lhs                                    rhs                support   
## [1]  {other vegetables, yogurt}          => {whole milk}       0.02226741
## [2]  {butter}                            => {whole milk}       0.02755465
## [3]  {curd}                              => {whole milk}       0.02613116
## [4]  {other vegetables, root vegetables} => {whole milk}       0.02318251
## [5]  {root vegetables, whole milk}       => {other vegetables} 0.02318251
## [6]  {domestic eggs}                     => {whole milk}       0.02999492
## [7]  {whipped/sour cream}                => {whole milk}       0.03223183
## [8]  {root vegetables}                   => {whole milk}       0.04890696
## [9]  {root vegetables}                   => {other vegetables} 0.04738180
## [10] {frozen vegetables}                 => {whole milk}       0.02043721
##      confidence coverage   lift     count
## [1]  0.5128806  0.04341637 2.007235 219  
## [2]  0.4972477  0.05541434 1.946053 271  
## [3]  0.4904580  0.05327911 1.919481 257  
## [4]  0.4892704  0.04738180 1.914833 228  
## [5]  0.4740125  0.04890696 2.449770 228  
## [6]  0.4727564  0.06344687 1.850203 295  
## [7]  0.4496454  0.07168277 1.759754 317  
## [8]  0.4486940  0.10899847 1.756031 481  
## [9]  0.4347015  0.10899847 2.246605 466  
## [10] 0.4249471  0.04809354 1.663094 201

Plot of the 37 rules.

plot(arulez, jitter = 5)

Plot of the 10 rules using the paracoord method:

head(arulez, n = 10, by = "confidence") %>%  
    plot(method = "paracoord")

head(arulez, n = 10, by = "confidence") %>% 
  plot(method = "graph", engine = "htmlwidget")

Cluster analysis:

Hierarchical cluster analysis: it creates cluster dedograms.

Complete Method:

trans2 <- transactions[ , itemFrequency(transactions) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )

plot(hc1, cex = 0.8, hang = 2)

Ward.D2 Method:

5 different clusters.

hc2 <- hclust(glist, method = "ward.D2" )

plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 5, border = 2:5)

hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")


dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)

tanglegram(dend1, dend2)

Ref: https://uc-r.github.io/hc_clustering