This analysis investigates transactional synchronicity within the UK retail market. By applying the Apriori Algorithm to approximately 981,000 records, we identify hidden product associations to uncover “purchase triggers” and optimize inventory allocation. The study moves beyond simple sales reporting to find robust product affinities that can drive cross-selling strategies.
UK retailers often face missed cross-selling opportunities due to inefficient warehouse allocation and lack of insight into bundling potential. The historical approach of treating every SKU independently ignores the complex buying behaviors of modern consumers.
This project bridges that gap by analyzing high-volume transaction data to find robust product affinities. We aim to transition from random sales patterns to a behaviorally-aligned strategy.
We define “transactions” using the Online Retail II dataset, sourced from Kaggle. This dataset contains all transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware, and many of its customers are wholesalers.
# Load Data
raw_data <- read.csv("online_retail_II.csv")
head(raw_data) %>%
kable(caption = "Preview of Raw Transaction Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer.ID | Country |
|---|---|---|---|---|---|---|---|
| 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085 | United Kingdom |
| 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom |
| 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom |
| 489434 | 22041 | RECORD FRAME 7” SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085 | United Kingdom |
| 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085 | United Kingdom |
| 489434 | 22064 | PINK DOUGHNUT TRINKET POT | 24 | 2009-12-01 07:45:00 | 1.65 | 13085 | United Kingdom |
# Clean and Preprocess
clean_data <- raw_data %>%
filter(!is.na(Customer.ID), Quantity > 0, Description != "") %>%
filter(Country == "United Kingdom") %>%
mutate(
InvoiceDate = as.POSIXct(InvoiceDate),
Hour = hour(InvoiceDate),
Category = word(Description, 1) # Extract first word as proxy for category
)
# Preview Data
head(clean_data) %>%
kable(caption = "Preview of Cleaned Transaction Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer.ID | Country | Hour | Category |
|---|---|---|---|---|---|---|---|---|---|
| 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085 | United Kingdom | 7 | 15CM |
| 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom | 7 | PINK |
| 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085 | United Kingdom | 7 | |
| 489434 | 22041 | RECORD FRAME 7” SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085 | United Kingdom | 7 | RECORD |
| 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085 | United Kingdom | 7 | STRAWBERRY |
| 489434 | 22064 | PINK DOUGHNUT TRINKET POT | 24 | 2009-12-01 07:45:00 | 1.65 | 13085 | United Kingdom | 7 | PINK |
We convert the long-format dataframe into a sparse transaction matrix suitable for the Apriori algorithm.
# Convert to transactions format
trans_uk <- as(split(clean_data$Description, clean_data$Invoice), "transactions")
# Optional: Category level transactions
# trans_cats <- as(split(clean_data$Category, clean_data$Invoice), "transactions")
print(trans_uk)## transactions in sparse format with
## 33546 transactions (rows) and
## 5249 items (columns)
Before mining for rules, we analyze the statistical context of the transactions.
Understanding “Basket Size” helps determine if shoppers are purchasing in bulk or engaged in targeted, small-volume purchasing.
basket_sizes <- size(trans_uk)
hist(basket_sizes,
main = "UK Basket Size Distribution",
xlab = "Items per Basket", col = "skyblue", border = "white", breaks = 50
)
abline(v = mean(basket_sizes), col = "red", lwd = 2, lty = 2)
legend("topright",
legend = paste("Mean Size:", round(mean(basket_sizes), 1)),
col = "red", lty = 2, lwd = 2
)Identifying the primary “anchors” for UK consumers allows us to understand the base volume drivers.
itemFrequencyPlot(trans_uk,
topN = 10, type = "relative",
col = "steelblue", main = "Top 10 UK Products (Relative Frequency)"
)Understanding when purchases happen allows for better staffing and ad scheduling. We analyze transaction volume by hour of day.
uk_temporal_summary <- clean_data %>%
group_by(Hour) %>%
summarise(Transactions = n_distinct(Invoice), .groups = "drop")
ggplot(uk_temporal_summary, aes(x = Hour, y = Transactions)) +
geom_col(fill = "steelblue") +
theme_minimal() +
labs(
title = "UK Transaction Volume by Hour of Day",
subtitle = "Peak activity analysis",
x = "Hour of Day",
y = "Number of Invoices"
)We apply the Apriori Algorithm. We set Support to 0.01 and Confidence to 0.5. Given the high volume of data (~981k records), these thresholds ensure we are capturing high-signal patterns rather than statistical noise.
rules_uk <- apriori(trans_uk,
parameter = list(supp = 0.01, conf = 0.5, minlen = 2),
control = list(verbose = FALSE)
)
# Summary of rules
summary(rules_uk)## set of 98 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 73 25
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.255 2.750 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01002 Min. :0.5015 Min. :0.01172 Min. : 5.037
## 1st Qu.:0.01067 1st Qu.:0.5615 1st Qu.:0.01700 1st Qu.:11.875
## Median :0.01236 Median :0.6107 Median :0.01961 Median :20.363
## Mean :0.01362 Mean :0.6355 Mean :0.02187 Mean :23.070
## 3rd Qu.:0.01529 3rd Qu.:0.6850 3rd Qu.:0.02468 3rd Qu.:31.550
## Max. :0.03261 Max. :0.8931 Max. :0.05124 Max. :56.424
## count
## Min. : 336.0
## 1st Qu.: 358.0
## Median : 414.5
## Mean : 456.9
## 3rd Qu.: 513.0
## Max. :1094.0
##
## mining info:
## data ntransactions support confidence
## trans_uk 33546 0.01 0.5
## call
## apriori(data = trans_uk, parameter = list(supp = 0.01, conf = 0.5, minlen = 2), control = list(verbose = FALSE))
We sort the rules by Lift, which measures how much more likely the consequent is given the antecedent, compared to its baseline probability. A lift > 1 indicates a positive association.
rules_uk_sorted <- sort(rules_uk, by = "lift")
# View top rules
as(head(rules_uk_sorted, 10), "data.frame") %>%
kable(caption = "Top 10 Association Rules by Lift") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| rules | support | confidence | coverage | lift | count | |
|---|---|---|---|---|---|---|
| 6 | {POPPY’S PLAYHOUSE LIVINGROOM } => {POPPY’S PLAYHOUSE KITCHEN} | 0.0104632 | 0.8931298 | 0.0117153 | 56.42360 | 351 |
| 7 | {POPPY’S PLAYHOUSE KITCHEN} => {POPPY’S PLAYHOUSE LIVINGROOM } | 0.0104632 | 0.6610169 | 0.0158290 | 56.42360 | 351 |
| 12 | {POPPY’S PLAYHOUSE KITCHEN} => {POPPY’S PLAYHOUSE BEDROOM } | 0.0120134 | 0.7589454 | 0.0158290 | 53.93979 | 403 |
| 11 | {POPPY’S PLAYHOUSE BEDROOM } => {POPPY’S PLAYHOUSE KITCHEN} | 0.0120134 | 0.8538136 | 0.0140702 | 53.93979 | 403 |
| 2 | {SET/6 RED SPOTTY PAPER PLATES} => {SET/6 RED SPOTTY PAPER CUPS} | 0.0101055 | 0.6634051 | 0.0152328 | 50.80956 | 339 |
| 1 | {SET/6 RED SPOTTY PAPER CUPS} => {SET/6 RED SPOTTY PAPER PLATES} | 0.0101055 | 0.7739726 | 0.0130567 | 50.80956 | 339 |
| 8 | {RED STRIPE CERAMIC DRAWER KNOB} => {BLUE STRIPE CERAMIC DRAWER KNOB} | 0.0104931 | 0.6654064 | 0.0157694 | 38.75299 | 352 |
| 9 | {BLUE STRIPE CERAMIC DRAWER KNOB} => {RED STRIPE CERAMIC DRAWER KNOB} | 0.0104931 | 0.6111111 | 0.0171705 | 38.75299 | 352 |
| 45 | {KEY FOB , BACK DOOR } => {KEY FOB , SHED} | 0.0122816 | 0.7253521 | 0.0169320 | 37.60844 | 412 |
| 46 | {KEY FOB , SHED} => {KEY FOB , BACK DOOR } | 0.0122816 | 0.6367852 | 0.0192869 | 37.60844 | 412 |
This visualization tracks the transition from the Left-Hand Side (LHS) of a rule to the Right-Hand Side (RHS). Thicker lines indicate stronger support for specific antecedents “flowing” into consequents.
top_rules_uk <- head(rules_uk_sorted, 20)
plot(top_rules_uk, method = "paracoord", control = list(reorder = TRUE))We identify the “efficient frontier” of rules—those that balance frequency (Support) with reliability (Confidence).
The analysis of 33,546 unique transactions and approximately 981,000 records confirms that UK purchasing behavior is driven by non-random, synchronized patterns. By applying the Apriori algorithm with a support of 0.01 and confidence of 0.5, we identified 98 high-signal rules that transition the retailer from intuitive guesses to a behaviorally-aligned strategy.
Core Data Insights: