HW10: Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

load and read data have have quick summary of the dataset. With library “arules” to ready transaction.

df <- read.csv("C:/Users/xmei/Desktop/GroceryDataSet.csv",header = FALSE)


summary(df)
##       V1                 V2                 V3                 V4           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       V5                 V6                 V7                 V8           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       V9                V10                V11                V12           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V13                V14                V15                V16           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V17                V18                V19                V20           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V21                V22                V23                V24           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V25                V26                V27                V28           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V29                V30                V31                V32           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
tr = read.transactions("C:/Users/xmei/Desktop/GroceryDataSet.csv", format = 'basket', sep=',')
summary(tr)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

To show top 10 items that has highest frequency and it’s occurrence in bar chart. Top item includes: while milk, other vegetables, rolls/bhuns…

itemFrequencyPlot(tr, topN = 20, type = "absolute",
                  col = brewer.pal(8,'Pastel2'), main = "Absolute Item Frequency Plot")

Explore rules with Support, Confidence and Lift.

association.rules = arules::apriori(tr, parameter=list(supp=0.002, conf=0.5, maxlen=10))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary

summary(association.rules)
## set of 1098 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   6 576 471  45 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.505   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.002034   Min.   :0.5000   Min.   :0.002440   Min.   :1.957  
##  1st Qu.:0.002237   1st Qu.:0.5263   1st Qu.:0.003864   1st Qu.:2.194  
##  Median :0.002644   Median :0.5676   Median :0.004677   Median :2.584  
##  Mean   :0.003289   Mean   :0.5845   Mean   :0.005765   Mean   :2.668  
##  3rd Qu.:0.003559   3rd Qu.:0.6223   3rd Qu.:0.006304   3rd Qu.:2.899  
##  Max.   :0.022267   Max.   :0.8857   Max.   :0.043416   Max.   :7.154  
##      count       
##  Min.   : 20.00  
##  1st Qu.: 22.00  
##  Median : 26.00  
##  Mean   : 32.35  
##  3rd Qu.: 35.00  
##  Max.   :219.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr          9835   0.002        0.5
##                                                                                 call
##  arules::apriori(data = tr, parameter = list(supp = 0.002, conf = 0.5, maxlen = 10))

The top 10 rules.

inspect(association.rules[1:10])
##      lhs                                     rhs                support    
## [1]  {cereals}                            => {whole milk}       0.003660397
## [2]  {jam}                                => {whole milk}       0.002948653
## [3]  {specialty cheese}                   => {other vegetables} 0.004270463
## [4]  {rice}                               => {other vegetables} 0.003965430
## [5]  {rice}                               => {whole milk}       0.004677173
## [6]  {baking powder}                      => {whole milk}       0.009252669
## [7]  {specialty cheese, yogurt}           => {whole milk}       0.002033554
## [8]  {specialty cheese, whole milk}       => {yogurt}           0.002033554
## [9]  {other vegetables, specialty cheese} => {whole milk}       0.002236909
## [10] {specialty cheese, whole milk}       => {other vegetables} 0.002236909
##      confidence coverage    lift     count
## [1]  0.6428571  0.005693950 2.515917 36   
## [2]  0.5471698  0.005388917 2.141431 29   
## [3]  0.5000000  0.008540925 2.584078 42   
## [4]  0.5200000  0.007625826 2.687441 39   
## [5]  0.6133333  0.007625826 2.400371 46   
## [6]  0.5229885  0.017691917 2.046793 91   
## [7]  0.7142857  0.002846975 2.795464 20   
## [8]  0.5405405  0.003762074 3.874793 20   
## [9]  0.5238095  0.004270463 2.050007 22   
## [10] 0.5945946  0.003762074 3.072957 22

Rules with lift. Here I listed filtered rule with confidence greater than 50% show top 10.

subRules = association.rules[quality(association.rules)$confidence > 0.5]

top10RulesByLift = head(subRules, n = 10, by = "lift")
inspect(top10RulesByLift)
##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {butter,                                                                                      
##       hard cheese}        => {whipped/sour cream} 0.002033554  0.5128205 0.003965430 7.154028    20
## [2]  {beef,                                                                                        
##       citrus fruit,                                                                                
##       other vegetables}   => {root vegetables}    0.002135231  0.6363636 0.003355363 5.838280    21
## [3]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit,                                                                              
##       whole milk}         => {root vegetables}    0.003152008  0.6326531 0.004982206 5.804238    31
## [4]  {citrus fruit,                                                                                
##       frozen vegetables,                                                                           
##       other vegetables}   => {root vegetables}    0.002033554  0.6250000 0.003253686 5.734025    20
## [5]  {beef,                                                                                        
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.002745297  0.6136364 0.004473818 5.629770    27
## [6]  {bottled water,                                                                               
##       root vegetables,                                                                             
##       yogurt}             => {tropical fruit}     0.002236909  0.5789474 0.003863752 5.517391    22
## [7]  {herbs,                                                                                       
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.002440264  0.6000000 0.004067107 5.504664    24
## [8]  {grapes,                                                                                      
##       pip fruit}          => {tropical fruit}     0.002135231  0.5675676 0.003762074 5.408941    21
## [9]  {herbs,                                                                                       
##       yogurt}             => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20
## [10] {beef,                                                                                        
##       other vegetables,                                                                            
##       soda}               => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20

Use parallel coordinates plot visualizes flow of association rules.

plot(top10RulesByLift, method="paracoord")

Cluster analysis on data using K-Means.

tr_data = as(tr, "matrix")
norm_data = as.data.frame(scale(tr_data))
dim(norm_data)
## [1] 9835  169
set.seed(1234)
kmfit = kmeans(norm_data, centers=5, nstart = 25)
str(kmfit)
## List of 9
##  $ cluster     : int [1:9835] 4 4 4 4 4 1 4 4 4 4 ...
##  $ centers     : num [1:5, 1:169] 0.1246 -0.0598 0.3498 -0.0418 0.6703 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##  $ totss       : num 1661946
##  $ withinss    : num [1:5] 823255 16544 9517 741302 16571
##  $ tot.withinss: num 1607189
##  $ betweenss   : num 54757
##  $ size        : int [1:5] 2277 17 41 7477 23
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Use visualization to show the quality and separation of the identified clusters by projecting the normalized data (norm_data) onto the first two principal components.

factoextra::fviz_cluster(kmfit, data = norm_data,
                        
                         palette = "Set2",          
                         ellipse.type = "convex",   
                         repel = TRUE,            
                         star.plot = TRUE,         
                         ggtheme = theme_minimal(), 
                         main = "K-means Clustering Results",
                         xlab = "Principal Component 1",
                         ylab = "Principal Component 2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        legend.position = "right",
        legend.title = element_text(face = "bold"))