Association Rules_groceries

Introduction

Association Rule Mining is a powerful technique used to uncover hidden patterns and relationships between items in transactional data. It is widely applied in market basket analysis to identify product bundles and understand customer purchasing behavior. This project applies the Apriori Algorithm on a Groceries dataset to discover frequent itemsets and generate strong association rules. These insights can inform marketing strategies, such as product recommendations and cross-selling.

Dataset Description

The Groceries dataset consists of transactional data collected from a retail store. Key characteristics of the dataset:

Transactions: Each record represents a unique customer’s purchase.
Items: Products purchased by customers, including food and beverages like “whole milk,” “sausage,” “tropical fruit,” etc.
Objective: To identify frequently purchased combinations and derive actionable insights.

Data Quality Check

data <- read.csv("C:/Users/abdul/Documents/data/Groceries_dataset.csv")

missing_values <- sum(is.na(data))
if (missing_values > 0) {
  cat("Missing values detected: ", missing_values, "\n")
  data <- na.omit(data)
  cat("Missing values removed.\n")
} else {
  cat("No missing values detected.\n")
}

## Missing values detected:  28765 
## Missing values removed.

transactions <- split(data$itemDescription, data$Member_number)
transactions <- lapply(transactions, unique)  # Remove duplicates in transactions
transactions <- as(transactions, "transactions")

summary(transactions)

## transactions as itemMatrix in sparse format with
##  3512 rows (elements/itemsets/transactions) and
##  158 columns (items) and a density of 0.01721223 
## 
## most frequent items:
##          sausage       whole milk      frankfurter   tropical fruit 
##              729              670              536              454 
## other vegetables          (Other) 
##              441             6721 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11 
## 937 897 722 471 271 132  47  26   5   3   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    2.72    4.00   11.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1          1000
## 2          1001
## 3          1002

Exploratory Data Analysis

itemFrequencyPlot(transactions, topN = 20, col = "steelblue", main = "Top 20 Purchased Items")

Applying the Apriori Algorithm

The Apriori Algorithm is employed to mine frequent itemsets and generate association rules. Key thresholds:

Support: Minimum proportion of transactions containing an itemset (0.002).
Confidence: Probability of purchasing consequent items given the antecedent (0.4).
Lift: Measures the strength of a rule relative to random co-occurrence.

Generating Rules

rules <- apriori(transactions, 
                 parameter = list(supp = 0.002, conf = 0.4, minlen = 2, maxlen = 5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.002      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[158 item(s), 3512 transaction(s)] done [0.00s].
## sorting and recoding items ... [95 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules <- sort(rules, by = "lift")

strong_rules <- subset(rules, lift > 1.0 & confidence > 0.4)
if (length(strong_rules) > 100) {
  strong_rules <- strong_rules[1:100]
}

Inspecting Strong Rules

if (length(strong_rules) > 0) {
  inspect(strong_rules[1:min(10, length(strong_rules))])
} else {
  print("No strong association rules found.")
}

##     lhs                                 rhs          support     confidence
## [1] {bottled beer, other vegetables} => {sausage}    0.002847380 0.5263158 
## [2] {pastry, tropical fruit}         => {whole milk} 0.003416856 0.4800000 
## [3] {root vegetables, soda}          => {sausage}    0.002277904 0.4210526 
## [4] {beef, rolls/buns}               => {sausage}    0.003132118 0.4074074 
##     coverage    lift     count
## [1] 0.005410023 2.535557 10   
## [2] 0.007118451 2.516060 12   
## [3] 0.005410023 2.028446  8   
## [4] 0.007687927 1.962709 11

Visualizations

if (length(rules) > 0) {
  plot(rules[1:min(20, length(rules))], method = "graph", control = list(layout = "circle", max = 20))
}

Scatterplot of Rules

if (length(rules) > 0) {
  plot(rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift", jitter = 0)
}

Grouped Matrix Visualization

if (length(rules) > 0) {
  plot(rules[1:min(20, length(rules))], method = "grouped")
}

Barplot of Top 10 Rules by Lift

if (length(strong_rules) > 0) {
  num_rules <- min(10, length(strong_rules))
  top_rules <- as(strong_rules, "data.frame")
  top_rules <- top_rules[order(-top_rules$lift),][1:num_rules, ]

  # Combine LHS and RHS for rule labels
  top_rules$rules <- paste(
    labels(lhs(strong_rules)[1:num_rules]),
    "=>",
    labels(rhs(strong_rules)[1:num_rules])
  )

  ggplot(top_rules, aes(x=reorder(rules, lift), y=lift)) +
    geom_bar(stat="identity", fill="steelblue") +
    coord_flip() +
    geom_text(aes(label = round(lift, 2)), hjust = -0.2) +
    labs(title="Top 10 Association Rules by Lift", x="Rules", y="Lift")
}

Interpretation of Results

Key Insights

Top Purchased Items:
- The most frequently purchased items include sausage, whole milk, and frankfurter. These items are likely to be part of many association rules.
Network Graph:
- Items such as whole milk and tropical fruit form strong connections with other items, indicating frequent co-purchases.
Scatterplot:
- High-lift rules like {bottled beer, other vegetables} => {sausage} and {pastry, tropical fruit} => {whole milk} stand out as significant.
Barplot of Rules:
- The rules ranked by lift clearly show strong associations that can inform strategies such as product bundling or targeted promotions.

Conclusion

This analysis demonstrates the effectiveness of association rule mining for discovering meaningful patterns in transactional data. The insights derived can guide strategies for product bundling, targeted promotions, and inventory management. Future work can extend this analysis by exploring temporal trends or incorporating demographic data to further refine insights.

Association Rules_groceries_USL

Abdul Hannan

2025-02-09