Market Basket Analysis and Clustering

Problem

The Groceries Data Set contains a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

library(arules)
library(arulesViz)
library(caret)
library(caTools)
library(cluster)
library(corrplot)
library(dplyr)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(knitr)
library(lattice)
library(lubridate)
library(mice)
library(plyr)
library(RColorBrewer)
library(reshape2)
library(readxl)
library(tidyr)
library(tidyverse)
library(utils)

Read Transaction Data

We start by reading the CSV file containing the grocery transactions. We use the \(read.transactions\) function from the \(arules\) package, which enables reading of data in transaction format.

tr = read.transactions("GroceryDataSet.csv", format = 'basket', sep=',')
summary(tr)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The most common items can be shown using both absolute and relative frequency plots.

itemFrequencyPlot(tr, topN = 20, type = "absolute",
                  col = brewer.pal(8,'Pastel2'), main = "Absolute Item Frequency Plot")

itemFrequencyPlot(tr, topN = 20, type = "relative",
                  col = brewer.pal(8,'Pastel2'), main = "Relative Item Frequency Plot")

Association Rule Learning

In the context of association rule mining, the following terms have specific definitions.

\(Support:\) This is a measure of how frequently an itemset appears in the dataset. For example, an itemset that appears in half of all transactions has \(Support\) = 0.5. This can also be thought of as how popular an itemset is.

\(Confidence:\) This indicates how often a rule has been found to be true.

\(Lift:\) For a pair of itemsets, Lift equals the Support of the two itemsets together, divided by their individual Supports. For itemsets that have no correlation, the value of Lift would be 1. For highly correlated itemsets, Lift would be (significantly) greater than 1.

The \(arules\) library is one of the most common R libraries for learning association rules. We use it to generate association rules as follows.

# Min Support as 0.002, confidence as 0.5.
association.rules = arules::apriori(tr, parameter=list(supp=0.002, conf=0.5, maxlen=10))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(association.rules)

## set of 1098 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   6 576 471  45 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.505   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence          lift           count       
##  Min.   :0.002034   Min.   :0.5000   Min.   :1.957   Min.   : 20.00  
##  1st Qu.:0.002237   1st Qu.:0.5263   1st Qu.:2.194   1st Qu.: 22.00  
##  Median :0.002644   Median :0.5676   Median :2.584   Median : 26.00  
##  Mean   :0.003289   Mean   :0.5845   Mean   :2.668   Mean   : 32.35  
##  3rd Qu.:0.003559   3rd Qu.:0.6223   3rd Qu.:2.899   3rd Qu.: 35.00  
##  Max.   :0.022267   Max.   :0.8857   Max.   :7.154   Max.   :219.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr          9835   0.002        0.5

Inspect the top 10 rules:

inspect(association.rules[1:10])

##      lhs                                    rhs                support    
## [1]  {cereals}                           => {whole milk}       0.003660397
## [2]  {jam}                               => {whole milk}       0.002948653
## [3]  {specialty cheese}                  => {other vegetables} 0.004270463
## [4]  {rice}                              => {other vegetables} 0.003965430
## [5]  {rice}                              => {whole milk}       0.004677173
## [6]  {baking powder}                     => {whole milk}       0.009252669
## [7]  {specialty cheese,yogurt}           => {whole milk}       0.002033554
## [8]  {specialty cheese,whole milk}       => {yogurt}           0.002033554
## [9]  {other vegetables,specialty cheese} => {whole milk}       0.002236909
## [10] {specialty cheese,whole milk}       => {other vegetables} 0.002236909
##      confidence lift     count
## [1]  0.6428571  2.515917 36   
## [2]  0.5471698  2.141431 29   
## [3]  0.5000000  2.584078 42   
## [4]  0.5200000  2.687441 39   
## [5]  0.6133333  2.400371 46   
## [6]  0.5229885  2.046793 91   
## [7]  0.7142857  2.795464 20   
## [8]  0.5405405  3.874793 20   
## [9]  0.5238095  2.050007 22   
## [10] 0.5945946  3.072957 22

Rules with Highest Lift

Here we filter rules with confidence greater than 0.6 or 60%, then obtain the top 10 rules with highest \(lift\).

subRules = association.rules[quality(association.rules)$confidence > 0.6]

top10RulesByLift = head(subRules, n = 10, by = "lift")
inspect(top10RulesByLift)

##      lhs                     rhs                    support confidence     lift count
## [1]  {beef,                                                                          
##       citrus fruit,                                                                  
##       other vegetables}   => {root vegetables}  0.002135231  0.6363636 5.838280    21
## [2]  {citrus fruit,                                                                  
##       other vegetables,                                                              
##       tropical fruit,                                                                
##       whole milk}         => {root vegetables}  0.003152008  0.6326531 5.804238    31
## [3]  {citrus fruit,                                                                  
##       frozen vegetables,                                                             
##       other vegetables}   => {root vegetables}  0.002033554  0.6250000 5.734025    20
## [4]  {beef,                                                                          
##       other vegetables,                                                              
##       tropical fruit}     => {root vegetables}  0.002745297  0.6136364 5.629770    27
## [5]  {butter,                                                                        
##       other vegetables,                                                              
##       tropical fruit,                                                                
##       whole milk}         => {yogurt}           0.002338587  0.6969697 4.996135    23
## [6]  {citrus fruit,                                                                  
##       root vegetables,                                                               
##       tropical fruit,                                                                
##       whole milk}         => {other vegetables} 0.003152008  0.8857143 4.577509    31
## [7]  {other vegetables,                                                              
##       rolls/buns,                                                                    
##       tropical fruit,                                                                
##       whole milk}         => {yogurt}           0.002541942  0.6250000 4.480230    25
## [8]  {rolls/buns,                                                                    
##       tropical fruit,                                                                
##       whipped/sour cream} => {yogurt}           0.002135231  0.6176471 4.427521    21
## [9]  {curd,                                                                          
##       tropical fruit,                                                                
##       whole milk}         => {yogurt}           0.003965430  0.6093750 4.368224    39
## [10] {grapes,                                                                        
##       tropical fruit,                                                                
##       whole milk}         => {other vegetables} 0.002033554  0.8000000 4.134524    20

plot(top10RulesByLift, main="Scatter plot for Top 10 rules by Lift")

plot(top10RulesByLift, method="paracoord")

Interactive Scatter Plot

plotly_arules(top10RulesByLift)

Graph-based Visualization of Rules

plot(top10RulesByLift, method = "graph",  engine = "htmlwidget")

Finding clusters using K-Means

In this section we perform a cluster analysis on the data using the K-Means Algorithm. For this, we first need to transform the transaction data into a dataframe format, and normalize (scale and center) it.

tr_data = as(tr, "matrix")
norm_data = as.data.frame(scale(tr_data))
dim(norm_data)

## [1] 9835  169

set.seed(1234)
kmfit = kmeans(norm_data, centers=5, nstart = 25)
str(kmfit)

## List of 9
##  $ cluster     : int [1:9835] 4 4 4 4 4 1 4 4 4 4 ...
##  $ centers     : num [1:5, 1:169] 0.1246 -0.0598 0.3498 -0.0418 0.6703 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##  $ totss       : num 1661946
##  $ withinss    : num [1:5] 823255 16544 9517 741302 16571
##  $ tot.withinss: num 1607189
##  $ betweenss   : num 54757
##  $ size        : int [1:5] 2277 17 41 7477 23
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

summary(kmfit)

##              Length Class  Mode   
## cluster      9835   -none- numeric
## centers       845   -none- numeric
## totss           1   -none- numeric
## withinss        5   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            5   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

The following is a visual plot of the clusters found by the K-Means algorithm. Since there are more than 2 dimensions, the plot uses the first two components of the PCA transformation.

#print(kmfit$centers)
factoextra::fviz_cluster(kmfit, data = norm_data)

norm_data %>%
  mutate(Cluster = kmfit$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")

## # A tibble: 5 x 170
##   Cluster `abrasive clean… `artif. sweeten… `baby cosmetics` `baby food`
##     <int>            <dbl>            <dbl>            <dbl>       <dbl>
## 1       1           0.125            0.113            0.0464     -0.0101
## 2       2          -0.0598           0.976           -0.0247      5.82  
## 3       3           0.350           -0.0571          -0.0247     -0.0101
## 4       4          -0.0418          -0.0360          -0.0139     -0.0101
## 5       5           0.670           -0.0571          -0.0247     -0.0101
## # … with 165 more variables: bags <dbl>, `baking powder` <dbl>, `bathroom
## #   cleaner` <dbl>, beef <dbl>, berries <dbl>, beverages <dbl>, `bottled
## #   beer` <dbl>, `bottled water` <dbl>, brandy <dbl>, `brown bread` <dbl>,
## #   butter <dbl>, `butter milk` <dbl>, `cake bar` <dbl>, candles <dbl>,
## #   candy <dbl>, `canned beer` <dbl>, `canned fish` <dbl>, `canned
## #   fruit` <dbl>, `canned vegetables` <dbl>, `cat food` <dbl>,
## #   cereals <dbl>, `chewing gum` <dbl>, chicken <dbl>, chocolate <dbl>,
## #   `chocolate marshmallow` <dbl>, `citrus fruit` <dbl>, cleaner <dbl>,
## #   `cling film/bags` <dbl>, `cocoa drinks` <dbl>, coffee <dbl>,
## #   `condensed milk` <dbl>, `cooking chocolate` <dbl>, cookware <dbl>,
## #   cream <dbl>, `cream cheese` <dbl>, curd <dbl>, `curd cheese` <dbl>,
## #   decalcifier <dbl>, `dental care` <dbl>, dessert <dbl>,
## #   detergent <dbl>, `dish cleaner` <dbl>, dishes <dbl>, `dog food` <dbl>,
## #   `domestic eggs` <dbl>, `female sanitary products` <dbl>, `finished
## #   products` <dbl>, fish <dbl>, flour <dbl>, `flower (seeds)` <dbl>,
## #   `flower soil/fertilizer` <dbl>, frankfurter <dbl>, `frozen
## #   chicken` <dbl>, `frozen dessert` <dbl>, `frozen fish` <dbl>, `frozen
## #   fruits` <dbl>, `frozen meals` <dbl>, `frozen potato products` <dbl>,
## #   `frozen vegetables` <dbl>, `fruit/vegetable juice` <dbl>,
## #   grapes <dbl>, `hair spray` <dbl>, ham <dbl>, `hamburger meat` <dbl>,
## #   `hard cheese` <dbl>, herbs <dbl>, honey <dbl>, `house keeping
## #   products` <dbl>, `hygiene articles` <dbl>, `ice cream` <dbl>, `instant
## #   coffee` <dbl>, `Instant food products` <dbl>, jam <dbl>,
## #   ketchup <dbl>, `kitchen towels` <dbl>, `kitchen utensil` <dbl>, `light
## #   bulbs` <dbl>, liqueur <dbl>, liquor <dbl>, `liquor (appetizer)` <dbl>,
## #   `liver loaf` <dbl>, `long life bakery product` <dbl>, `make up
## #   remover` <dbl>, `male cosmetics` <dbl>, margarine <dbl>,
## #   mayonnaise <dbl>, meat <dbl>, `meat spreads` <dbl>, `misc.
## #   beverages` <dbl>, mustard <dbl>, napkins <dbl>, newspapers <dbl>, `nut
## #   snack` <dbl>, `nuts/prunes` <dbl>, oil <dbl>, onions <dbl>, `organic
## #   products` <dbl>, `organic sausage` <dbl>, `other vegetables` <dbl>,
## #   `packaged fruit/vegetables` <dbl>, …

References

Market Basket Analysis using R. https://www.datacamp.com/community/tutorials/market-basket-analysis-r
K-means Cluster Analysis. https://uc-r.github.io/kmeans_clustering
Introduction to arules – A computational environment for mining association rules and frequent item sets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf
Association Rules and the Apriori Algorithm: A Tutorial. https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html
A Gentle Introduction on Market Basket Analysis - Association Rules. https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce