DATA624 - Homework 10

Author

Anthony Josue Roman

Overview

Market Basket Analysis is used in this project for determining the set of goods often bought together. Association rules are analyzed based on their support, confidence, and lift. The ten best association rules are ranked in descending order of their lift.

Load Libraries

library(arules)
library(arulesViz)
library(tidyverse)
library(cluster)
library(factoextra)

Load the Grocery Transactions

groceries <- read.transactions(
  "GroceryDataSet.csv",
  format = "basket",
  sep = ","
)

summary(groceries)

transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146 

most frequent items:
      whole milk other vegetables       rolls/buns             soda 
            2513             1903             1809             1715 
          yogurt          (Other) 
            1372            34055 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
  17   18   19   20   21   22   23   24   26   27   28   29   32 
  29   14   14    9   11    4    6    1    1    1    1    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   4.409   6.000  32.000 

includes extended item information - examples:
            labels
1 abrasive cleaner
2 artif. sweetener
3   baby cosmetics

Inspect the First Five Transactions

inspect(groceries[1:5])

    items                      
[1] {citrus fruit,             
     margarine,                
     ready soups,              
     semi-finished bread}      
[2] {coffee,                   
     tropical fruit,           
     yogurt}                   
[3] {whole milk}               
[4] {cream cheese,             
     meat spreads,             
     pip fruit,                
     yogurt}                   
[5] {condensed milk,           
     long life bakery product, 
     other vegetables,         
     whole milk}

Item Frequency Plot

itemFrequencyPlot(
  groceries,
  topN = 20,
  type = "absolute",
  main = "Top 20 Most Frequently Purchased Items"
)

Generate Association Rules

rules <- apriori(
  groceries,
  parameter = list(
    supp = 0.001,
    conf = 0.3,
    minlen = 2
  )
)

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.3    0.1    1 none FALSE            TRUE       5   0.001      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 9 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.00s].
writing ... [13770 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

summary(rules)

set of 13770 rules

rule length distribution (lhs + rhs):sizes
   2    3    4    5    6 
 239 5000 6942 1530   59 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   3.000   4.000   3.722   4.000   6.000 

summary of quality measures:
    support           confidence        coverage             lift       
 Min.   :0.001017   Min.   :0.3000   Min.   :0.001017   Min.   : 1.174  
 1st Qu.:0.001118   1st Qu.:0.3654   1st Qu.:0.002237   1st Qu.: 2.206  
 Median :0.001322   Median :0.4545   Median :0.003152   Median : 2.840  
 Mean   :0.001889   Mean   :0.4833   Mean   :0.004337   Mean   : 3.132  
 3rd Qu.:0.001830   3rd Qu.:0.5758   3rd Qu.:0.004474   3rd Qu.: 3.670  
 Max.   :0.074835   Max.   :1.0000   Max.   :0.193493   Max.   :35.716  
     count       
 Min.   : 10.00  
 1st Qu.: 11.00  
 Median : 13.00  
 Mean   : 18.58  
 3rd Qu.: 18.00  
 Max.   :736.00  

mining info:
      data ntransactions support confidence
 groceries          9835   0.001        0.3
                                                                              call
 apriori(data = groceries, parameter = list(supp = 0.001, conf = 0.3, minlen = 2))

Top 10 Rules by Lift

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)

top10_rules <- inspect(rules_lift[1:10])

     lhs                                rhs                support    
[1]  {bottled beer, red/blush wine}  => {liquor}           0.001931876
[2]  {ham, white bread}              => {processed cheese} 0.001931876
[3]  {bottled beer, liquor}          => {red/blush wine}   0.001931876
[4]  {Instant food products, soda}   => {hamburger meat}   0.001220132
[5]  {curd, sugar}                   => {flour}            0.001118454
[6]  {baking powder, sugar}          => {flour}            0.001016777
[7]  {processed cheese, white bread} => {ham}              0.001931876
[8]  {popcorn, soda}                 => {salty snack}      0.001220132
[9]  {baking powder, flour}          => {sugar}            0.001016777
[10] {ham, processed cheese}         => {white bread}      0.001931876
     confidence coverage    lift     count
[1]  0.3958333  0.004880529 35.71579 19   
[2]  0.3800000  0.005083884 22.92822 19   
[3]  0.4130435  0.004677173 21.49356 19   
[4]  0.6315789  0.001931876 18.99565 12   
[5]  0.3235294  0.003457041 18.60767 11   
[6]  0.3125000  0.003253686 17.97332 10   
[7]  0.4634146  0.004168785 17.80345 19   
[8]  0.6315789  0.001931876 16.69779 12   
[9]  0.5555556  0.001830198 16.40807 10   
[10] 0.6333333  0.003050330 15.04549 19

top10_rules

                                 lhs                   rhs     support
[1]   {bottled beer, red/blush wine} =>           {liquor} 0.001931876
[2]               {ham, white bread} => {processed cheese} 0.001931876
[3]           {bottled beer, liquor} =>   {red/blush wine} 0.001931876
[4]    {Instant food products, soda} =>   {hamburger meat} 0.001220132
[5]                    {curd, sugar} =>            {flour} 0.001118454
[6]           {baking powder, sugar} =>            {flour} 0.001016777
[7]  {processed cheese, white bread} =>              {ham} 0.001931876
[8]                  {popcorn, soda} =>      {salty snack} 0.001220132
[9]           {baking powder, flour} =>            {sugar} 0.001016777
[10]         {ham, processed cheese} =>      {white bread} 0.001931876
     confidence    coverage     lift count
[1]   0.3958333 0.004880529 35.71579    19
[2]   0.3800000 0.005083884 22.92822    19
[3]   0.4130435 0.004677173 21.49356    19
[4]   0.6315789 0.001931876 18.99565    12
[5]   0.3235294 0.003457041 18.60767    11
[6]   0.3125000 0.003253686 17.97332    10
[7]   0.4634146 0.004168785 17.80345    19
[8]   0.6315789 0.001931876 16.69779    12
[9]   0.5555556 0.001830198 16.40807    10
[10]  0.6333333 0.003050330 15.04549    19

Top association rules are identified based on the lift method, which identifies the level of dependency that exists between items. The greater the lift value, the stronger the relationship that exists between items in terms of their purchase; the two items are likely to be purchased together in one transaction.

Association Rule Graph

plot(
  rules_lift[1:10],
  method = "graph",
  engine = "htmlwidget"
)

The association rules graph shows connections between products that are bought together frequently. Products that are connected by the same rule have a strong connection between them based on their purchases from customers. There are various groups of connected products, which include snacks, ingredients for baking, and processed food products.

Lift is one measure of correlation between two products where the lift value will show the strength of the relationship between the two products without taking into account any coincidences.

Extra Credit: Cluster Analysis

binary_matrix <- as(groceries, "matrix")

sample_data <- binary_matrix[1:500, ]

sample_data <- sample_data[, apply(sample_data, 2, var) > 0]

set.seed(123)
kmeans_result <- kmeans(sample_data, centers = 3, nstart = 25)

fviz_cluster(
  kmeans_result,
  data = sample_data,
  labelsize = 0
)

A rudimentary k-means clustering analysis was conducted using a reduced dataset of the grocery store transaction records. After eliminating features with no variance, transactions were clustered into three groups according to their purchasing behavior.

The diagram illustrates that most transactions cluster around the origin point while a few transactions are further away from the main clusters. This implies that there might be more unique purchasing behavior in the transactions. With the first two principal components explaining a very small amount of variance in the data, the clusters show significant overlap, which is characteristic of sparsely populated transaction data such as grocery basket items.

In summary, the clustering analysis offers a rudimentary grouping of consumer purchasing behavior.

Conclusion

Association rule mining was performed in this study to extract patterns from grocery purchases. Association rules were chosen according to the lift value since lift aids in finding item sets that have an occurrence greater than random. The use of cluster analysis also made it easy to categorize transactions.