1. Introduction

The goal of this project is to perform a comprehensive Market Basket Analysis (MBA) on an online retail dataset. While standard association rules (Apriori algorithm) are excellent for finding frequent patterns, they often miss high-value relationships that occur less frequently.

This project compares two approaches:

Standard Apriori: Identifying the most popular item combinations.

Goal: To find items that are frequently bought together.

How it works: It uses the Apriori algorithm to identify patterns based on frequency (Support).

Typical Result: Rules like {Bread} => {Milk}.

Business Use: Great for store layout optimization (putting popular items together) or general product bundling.

Limitation: It is biased towards cheap, high-volume items (like food or stationary) and often ignores expensive items that sell less frequently.

Weighted/Value-Based Apriori: Identifying the most profitable item combinations by incorporating item price into the rule evaluation.

Goal: To find item combinations that generate the most revenue or profit.

How it works: It integrates item prices (or profit margins) into the analysis. It typically requires lowering the support threshold (to catch rare items) and then filtering the resulting rules based on their monetary value.

Typical Result: Rules like {Laptop} => {Extended Warranty}.

Business Use: Essential for cross-selling strategies where the goal is to increase the total basket value, rather than just the number of items sold.

2. Review of the dataset

We utilize the Online Retail II dataset, which contains transactions from a UK-based online retailer. The dataset includes fields such as:

  • Invoice
  • StockCode
  • Description
  • Quantity
  • InvoiceDate
  • Price

3. Data Preparation

3.1 Loading Libraries

Firstly we need to load some libraries

# read packages
library(tidyverse)
library(arules)
library(arulesViz)
library(DT)
library(lubridate)
library(knitr)

3.2 Reading Data

We load the dataset and perform intial inspection.

df <- read.csv("online_retail_II.csv", stringsAsFactors = FALSE)
knitr::kable(head(df, 5), caption = "Preview of Uploaded Data")
Preview of Uploaded Data
Invoice StockCode Description Quantity InvoiceDate Price Customer.ID Country
489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 2009-12-01 07:45:00 6.95 13085 United Kingdom
489434 79323P PINK CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom
489434 79323W WHITE CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom
489434 22041 RECORD FRAME 7” SINGLE SIZE 48 2009-12-01 07:45:00 2.10 13085 United Kingdom
489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 2009-12-01 07:45:00 1.25 13085 United Kingdom

3.3 Data Cleaning

Real-world retail data requires significant cleaning. We perform the following steps:

  1. Remove Missing IDs: Transactions without a Customer ID cannot be tracked effectively.

  2. Remove Cancellations: Invoices starting with C represent returned items.

  3. Validate Values: We ensure Quantity and Price are positive.

  4. Feature Engineering: We create a TotalValue column to understand the monetary impact of each transaction.

df_clean <- df %>%
  filter(!is.na(Customer.ID)) %>%
  filter(!grepl("^C", Invoice)) %>%
  filter(Quantity > 0, Price > 0) %>%
  mutate(InvoiceDate = parse_date_time(InvoiceDate, orders = c("mdy HM", "dmy HM", "ymd HMS"))) %>%
  mutate(Date = as.Date(InvoiceDate)) %>%
  mutate(TotalValue = Quantity * Price)

3.4 Preparing Transactions

# Group by Invoice
trans_list <- split(df_clean$Description, df_clean$Invoice)
trans <- as(trans_list, "transactions")

The original dataset is in long format, where each item in a single invoice occupies its own row. However, the Apriori algorithm requires data in a basket format, where each row represents a unique transaction containing a set of items.

We use the split() function to group items by their Invoice number and then convert this list into a sparse transactions object using the arules package. This structure is memory-efficient and required for calculating support and confidence.

3.5 Calcutaing Item Price (For Weighted Analysis)

We need to know how much each item costs on average to find High Value rules

item_prices <- df_clean %>%
  group_by(Description) %>%
  summarise

# Preview expensive items
knitr::kable(head(item_prices, 5), caption = "Most Expensive Items")
Most Expensive Items
Description
DOORMAT UNION JACK GUNS AND ROSES
3 STRIPEY MICE FELTCRAFT
4 PURPLE FLOCK DINNER CANDLES
50’S CHRISTMAS GIFT BAG LARGE
ANIMAL STICKERS

4. Exploratory Data Analysis (EDA)

Before mining rules, we analyze the basic characteristics of the transactions.

4.2 Transaction Size Distribution

How many items do people usually buy in one go?

Figure 2. Distribution of Basket Sizes Most Common Behavior: The highest concentration of transactions involves very small baskets, with 1–5 items being the most frequent.

Distribution Trend: There is a steady decline in frequency as the number of items per basket increases.

Volume Metrics: While thousands of baskets contain fewer than 10 items, the count drops significantly below 1,000 for baskets containing 30 or more items.

5. Association Rules

5.1 Standard Association Rules

We look for standard patterns with:

  • Support: 0.01 (Item appears in 1% of baskets)

  • Confidence: 0.5 (Rule is correct 50% of the time)

rules_std <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 369 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5283 item(s), 36969 transaction(s)] done [0.15s].
## sorting and recoding items ... [538 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [98 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

5.1.1 Visualizing the Rules

library(arules)
library(arulesViz)
inspect(head(sort(rules_std, by = "lift"), 5))
##     lhs                                rhs                                support confidence   coverage     lift count
## [1] {POPPY'S PLAYHOUSE LIVINGROOM } => {POPPY'S PLAYHOUSE BEDROOM }    0.01033298  0.8304348 0.01244286 55.21645   382
## [2] {POPPY'S PLAYHOUSE BEDROOM }    => {POPPY'S PLAYHOUSE LIVINGROOM } 0.01033298  0.6870504 0.01503963 55.21645   382
## [3] {POPPY'S PLAYHOUSE KITCHEN}     => {POPPY'S PLAYHOUSE LIVINGROOM } 0.01103627  0.6623377 0.01666261 53.23035   408
## [4] {POPPY'S PLAYHOUSE LIVINGROOM } => {POPPY'S PLAYHOUSE KITCHEN}     0.01103627  0.8869565 0.01244286 53.23035   408
## [5] {POPPY'S PLAYHOUSE BEDROOM }    => {POPPY'S PLAYHOUSE KITCHEN}     0.01276745  0.8489209 0.01503963 50.94765   472
plot(head(sort(rules_std, by = "lift"), 10), method = "graph", engine = "htmlwidget")

5.2 Weighted (High-Value) Rules

The Problem: The standard analysis above highlights cheap items (Lunch Bags, Bunting). It misses expensive items (Furniture, Electronics) because they sell less often (lower support).

The Solution: We implement a value-based approach: 1. Calculate the Average Price of every item. 2. Lower the Support threshold (to 0.001) to catch rare items. 3. Filter the resulting rules based on the Value of the recommended item.

We already calculate the Average Price Item in 3.5

Mining & Filtering High-Value Rules

# We lower support to find rare expensive items, then filter by Price
# 1. Run Apriori with lower support (0.001) to catch expensive items that sell less often
rules_weighted <- apriori(trans, parameter = list(supp = 0.001, conf = 0.3))

# 2. Map Prices to the Rules
# We look at the Right-Hand-Side (RHS) of the rule
rhs_labels <- labels(rhs(rules_weighted))
rhs_labels <- gsub("\\{|\\}", "", rhs_labels) 
# Match with price list
rule_values <- item_prices$AvgPrice[match(rhs_labels, item_prices$Description)]
# Add value to the rule quality measures
quality(rules_weighted)$RuleValue <- rule_values

# 3. Filter: Keep only rules where the recommended item costs > 10.00
rules_high_value <- subset(rules_weighted, RuleValue > 10.00)

5.3 Comparison of Results

Summary of Key Association Rules
Type LHS RHS Support Confidence Lift Count RuleValue
Top Standard Rule (Popularity) {POPPY’S PLAYHOUSE LIVINGROOM} {POPPY’S PLAYHOUSE BEDROOM} 0.01033 0.8304 55.22 382
Top Weighted Rule (Profit) {LANDMARK FRAME COVENT GARDEN} {LANDMARK FRAME OXFORD STREET} 0.00108 0.7692 526.62 40 12.389

Key Insights

Standard Rule: Shows high Confidence (83%), meaning customers buying the Livingroom set are very likely to buy the Bedroom set as well.

  • Business Impact: These rules drive volume. They are useful for increasing the number of items in a basket, but since the items cost very little, they have a marginal impact on total revenue.

Weighted Rule: Shows an extremely high Lift (526.62), indicating a very strong association between the two Landmark Frame locations that is far from random.

  • Business Impact: These rules drive value. Even though these transactions happen less often (lower support), a single conversion on this rule is worth 10x–20x more revenue than a standard rule.

Rule Value: The profit-weighted rule identifies a specific value of 12.389, highlighting its financial significance compared to standard popularity.

Conclusion: Relying solely on standard Apriori biases the strategy towards cheap and cheerful products. The Weighted analysis balances this by uncovering the hidden gems that drive the store’s actual profitability.

6. Interactive Dashboard (Shiny App)

To allow exploring these rules dynamically, a Shiny App was developed.

Link to the Shiny App