The goal of this project is to perform a comprehensive Market Basket Analysis (MBA) on an online retail dataset. While standard association rules (Apriori algorithm) are excellent for finding frequent patterns, they often miss high-value relationships that occur less frequently.
This project compares two approaches:
Standard Apriori: Identifying the most popular item combinations.
Goal: To find items that are frequently bought together.
How it works: It uses the Apriori algorithm to identify patterns based on frequency (Support).
Typical Result: Rules like {Bread} => {Milk}.
Business Use: Great for store layout optimization (putting popular items together) or general product bundling.
Limitation: It is biased towards cheap, high-volume items (like food or stationary) and often ignores expensive items that sell less frequently.
Weighted/Value-Based Apriori: Identifying the most profitable item combinations by incorporating item price into the rule evaluation.
Goal: To find item combinations that generate the most revenue or profit.
How it works: It integrates item prices (or profit margins) into the analysis. It typically requires lowering the support threshold (to catch rare items) and then filtering the resulting rules based on their monetary value.
Typical Result: Rules like {Laptop} => {Extended Warranty}.
Business Use: Essential for cross-selling strategies where the goal is to increase the total basket value, rather than just the number of items sold.
We utilize the Online Retail II dataset, which contains transactions from a UK-based online retailer. The dataset includes fields such as:
Firstly we need to load some libraries
We load the dataset and perform intial inspection.
df <- read.csv("online_retail_II.csv", stringsAsFactors = FALSE)
knitr::kable(head(df, 5), caption = "Preview of Uploaded Data")| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer.ID | Country |
|---|---|---|---|---|---|---|---|
| 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085 | United Kingdom |
| 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom |
| 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom |
| 489434 | 22041 | RECORD FRAME 7” SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085 | United Kingdom |
| 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085 | United Kingdom |
Real-world retail data requires significant cleaning. We perform the following steps:
Remove Missing IDs: Transactions without a Customer ID cannot be tracked effectively.
Remove Cancellations: Invoices starting with C represent returned items.
Validate Values: We ensure Quantity and Price are positive.
Feature Engineering: We create a TotalValue column to understand the monetary impact of each transaction.
# Group by Invoice
trans_list <- split(df_clean$Description, df_clean$Invoice)
trans <- as(trans_list, "transactions")The original dataset is in long format, where each item in a single invoice occupies its own row. However, the Apriori algorithm requires data in a basket format, where each row represents a unique transaction containing a set of items.
We use the split() function to group items by their Invoice number and then convert this list into a sparse transactions object using the arules package. This structure is memory-efficient and required for calculating support and confidence.
We need to know how much each item costs on average to find High Value rules
item_prices <- df_clean %>%
group_by(Description) %>%
summarise
# Preview expensive items
knitr::kable(head(item_prices, 5), caption = "Most Expensive Items")| Description |
|---|
| DOORMAT UNION JACK GUNS AND ROSES |
| 3 STRIPEY MICE FELTCRAFT |
| 4 PURPLE FLOCK DINNER CANDLES |
| 50’S CHRISTMAS GIFT BAG LARGE |
| ANIMAL STICKERS |
Before mining rules, we analyze the basic characteristics of the transactions.
What are the best-selling products by frequency?
Figure 1. Top 10 Most Frequent Items Top Performer: The WHITE
HANGING HEART T-LIGHT HOLDER is the most frequent item by a significant
margin, with a frequency exceeding 5,000.
Second and Third Place: The REGENCY CAKESTAND 3 TIER follows with a frequency of approximately 3,400, followed by the ASSORTED COLOUR BIRD ORNAMENT at nearly 2,800.
How many items do people usually buy in one go?
Figure 2. Distribution of Basket Sizes Most Common Behavior: The
highest concentration of transactions involves very small baskets, with
1–5 items being the most frequent.
Distribution Trend: There is a steady decline in frequency as the number of items per basket increases.
Volume Metrics: While thousands of baskets contain fewer than 10 items, the count drops significantly below 1,000 for baskets containing 30 or more items.
We look for standard patterns with:
Support: 0.01 (Item appears in 1% of baskets)
Confidence: 0.5 (Rule is correct 50% of the time)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 369
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[5283 item(s), 36969 transaction(s)] done [0.15s].
## sorting and recoding items ... [538 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [98 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## lhs rhs support confidence coverage lift count
## [1] {POPPY'S PLAYHOUSE LIVINGROOM } => {POPPY'S PLAYHOUSE BEDROOM } 0.01033298 0.8304348 0.01244286 55.21645 382
## [2] {POPPY'S PLAYHOUSE BEDROOM } => {POPPY'S PLAYHOUSE LIVINGROOM } 0.01033298 0.6870504 0.01503963 55.21645 382
## [3] {POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE LIVINGROOM } 0.01103627 0.6623377 0.01666261 53.23035 408
## [4] {POPPY'S PLAYHOUSE LIVINGROOM } => {POPPY'S PLAYHOUSE KITCHEN} 0.01103627 0.8869565 0.01244286 53.23035 408
## [5] {POPPY'S PLAYHOUSE BEDROOM } => {POPPY'S PLAYHOUSE KITCHEN} 0.01276745 0.8489209 0.01503963 50.94765 472
The Problem: The standard analysis above highlights cheap items (Lunch Bags, Bunting). It misses expensive items (Furniture, Electronics) because they sell less often (lower support).
The Solution: We implement a value-based approach: 1. Calculate the Average Price of every item. 2. Lower the Support threshold (to 0.001) to catch rare items. 3. Filter the resulting rules based on the Value of the recommended item.
We already calculate the Average Price Item in 3.5
Mining & Filtering High-Value Rules
# We lower support to find rare expensive items, then filter by Price
# 1. Run Apriori with lower support (0.001) to catch expensive items that sell less often
rules_weighted <- apriori(trans, parameter = list(supp = 0.001, conf = 0.3))
# 2. Map Prices to the Rules
# We look at the Right-Hand-Side (RHS) of the rule
rhs_labels <- labels(rhs(rules_weighted))
rhs_labels <- gsub("\\{|\\}", "", rhs_labels)
# Match with price list
rule_values <- item_prices$AvgPrice[match(rhs_labels, item_prices$Description)]
# Add value to the rule quality measures
quality(rules_weighted)$RuleValue <- rule_values
# 3. Filter: Keep only rules where the recommended item costs > 10.00
rules_high_value <- subset(rules_weighted, RuleValue > 10.00)| Type | LHS | RHS | Support | Confidence | Lift | Count | RuleValue |
|---|---|---|---|---|---|---|---|
| Top Standard Rule (Popularity) | {POPPY’S PLAYHOUSE LIVINGROOM} | {POPPY’S PLAYHOUSE BEDROOM} | 0.01033 | 0.8304 | 55.22 | 382 |
|
| Top Weighted Rule (Profit) | {LANDMARK FRAME COVENT GARDEN} | {LANDMARK FRAME OXFORD STREET} | 0.00108 | 0.7692 | 526.62 | 40 | 12.389 |
Key Insights
Standard Rule: Shows high Confidence (83%), meaning customers buying the Livingroom set are very likely to buy the Bedroom set as well.
Weighted Rule: Shows an extremely high Lift (526.62), indicating a very strong association between the two Landmark Frame locations that is far from random.
Rule Value: The profit-weighted rule identifies a specific value of 12.389, highlighting its financial significance compared to standard popularity.
Conclusion: Relying solely on standard Apriori biases the strategy towards cheap and cheerful products. The Weighted analysis balances this by uncovering the hidden gems that drive the store’s actual profitability.
To allow exploring these rules dynamically, a Shiny App was developed.