Homework 10 Data 624

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

groc <- read.transactions("GroceryDataSet.csv", sep=",")
summary(groc)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Each line in the dataset represents a shopping transaction (receipt) and each column represents an item. The summary function provides insights into the dataset, such as the total number of transactions, unique items and an overview of how many items are frequently bought together.

size(head(groc)) # number of items in each observation

## [1] 4 3 1 4 4 5

LIST(head(groc, 3))

## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"

size() reveals how many items were purchased in each transaction. LIST() displays the actual items in the first three transactions, helping us understand what a transaction looks like.

frequentItems <- eclat (groc, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequentItems)

##      items                          support    count
## [1]  {other vegetables, whole milk} 0.07483477  736 
## [2]  {whole milk}                   0.25551601 2513 
## [3]  {other vegetables}             0.19349263 1903 
## [4]  {rolls/buns}                   0.18393493 1809 
## [5]  {yogurt}                       0.13950178 1372 
## [6]  {soda}                         0.17437722 1715 
## [7]  {root vegetables}              0.10899847 1072 
## [8]  {tropical fruit}               0.10493137 1032 
## [9]  {bottled water}                0.11052364 1087 
## [10] {sausage}                      0.09395018  924 
## [11] {shopping bags}                0.09852567  969 
## [12] {citrus fruit}                 0.08276563  814 
## [13] {pastry}                       0.08896797  875 
## [14] {pip fruit}                    0.07564820  744 
## [15] {whipped/sour cream}           0.07168277  705 
## [16] {fruit/vegetable juice}        0.07229283  711 
## [17] {newspapers}                   0.07981698  785 
## [18] {bottled beer}                 0.08052872  792 
## [19] {canned beer}                  0.07768175  764

The eclat algorithm identifies items or combinations of items that appear frequently in transactions based on a minimum support threshold (e.g., 7% of transactions).

itemFrequencyPlot(groc, topN=20, type="absolute", main="Item Frequency") # plot frequent items

This plot shows the top 20 most frequently purchased items and their absolute frequencies and helps identify which items are most popular and often appear in customer baskets.

# Define thresholds
supp_val <- 0.001  # Minimum support: At least 0.1% of transactions
conf_val <- 0.9    # Minimum confidence: At least 90% reliability
maxlen_val <- 5    # Maximum number of items in a rule

rules <- apriori(groc, parameter=list(supp=supp_val, conf=conf_val, maxlen=maxlen_val))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5

## Warning in apriori(groc, parameter = list(supp = supp_val, conf = conf_val, :
## Mining stopped (maxlen reached). Only patterns up to a length of 5 returned!

##  done [0.02s].
## writing ... [123 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support measures how often a rule’s items appear together. Low support captures rare patterns but risks being noisy. Confidence measures the reliability of the rule. In our case, rules must be 90% reliable to be included. Maxlen limits the rule complexity by controlling the number of items per rule.

rules_conf <- sort(rules, by="confidence", decreasing=TRUE)

inspect(head(rules_conf, 10))

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {rice,                                                                                      
##       sugar}              => {whole milk}       0.001220132          1 0.001220132 3.913649    12
## [2]  {canned fish,                                                                               
##       hygiene articles}   => {whole milk}       0.001118454          1 0.001118454 3.913649    11
## [3]  {butter,                                                                                    
##       rice,                                                                                      
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [4]  {flour,                                                                                     
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001728521          1 0.001728521 3.913649    17
## [5]  {butter,                                                                                    
##       domestic eggs,                                                                             
##       soft cheese}        => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       soft cheese}        => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
## [7]  {butter,                                                                                    
##       hygiene articles,                                                                          
##       pip fruit}          => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [8]  {hygiene articles,                                                                          
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [9]  {hygiene articles,                                                                          
##       pip fruit,                                                                                 
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [10] {cream cheese,                                                                              
##       domestic eggs,                                                                             
##       sugar}              => {whole milk}       0.001118454          1 0.001118454 3.913649    11

Sorting by confidence highlights the most reliable rules.Displaying the top 10 rules helps us focus on the strongest and most meaningful associations.

Lift Evaluation

rules_lift <- sort(rules, by="lift", decreasing=TRUE)
inspect(head(rules_lift, 10))

##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                   
##       fruit/vegetable juice,                                                                          
##       other vegetables,                                                                               
##       soda}                   => {root vegetables}  0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {butter,                                                                                         
##       cream cheese,                                                                                   
##       root vegetables}        => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [4]  {butter,                                                                                         
##       sliced cheese,                                                                                  
##       tropical fruit,                                                                                 
##       whole milk}             => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [5]  {cream cheese,                                                                                   
##       curd,                                                                                           
##       other vegetables,                                                                               
##       whipped/sour cream}     => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [6]  {butter,                                                                                         
##       other vegetables,                                                                               
##       tropical fruit,                                                                                 
##       white bread}            => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [7]  {citrus fruit,                                                                                   
##       root vegetables,                                                                                
##       soft cheese}            => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [8]  {brown bread,                                                                                    
##       pip fruit,                                                                                      
##       whipped/sour cream}     => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [9]  {grapes,                                                                                         
##       tropical fruit,                                                                                 
##       whole milk,                                                                                     
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [10] {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10

Lift measures how much more likely items are to be purchased together compared to chance. Higher lift values indicate stronger associations. For example a lift of 11.24 for {liquor, red/blush wine} => {bottled beer} means these items are over 11 times more likely to be bought together than random chance.

rules1 <- head(rules_lift, n = 10, by = "lift")
plot(rules1, method = "grouped", control = list(k = 10))

library(ggplot2)

# Extract data for the scatterplot
rules_df <- as(rules_lift, "data.frame")

# Create the scatter plot
ggplot(rules_df, aes(x = support, y = lift, color = confidence)) +
  geom_point(size = 2, alpha = 0.7) +
  scale_color_gradient(low = "lightpink", high = "red", name = "Confidence") +
  labs(
    title = "Scatter Plot of Association Rules",
    x = "Support",
    y = "Lift"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

Homework 10 Data 624

Nikoleta Emanouilidi

2024-11-23