1 Abstract

This analysis investigates transactional synchronicity within the UK retail market. By applying the Apriori Algorithm to approximately 981,000 records, we identify hidden product associations to uncover “purchase triggers” and optimize inventory allocation. The study moves beyond simple sales reporting to find robust product affinities that can drive cross-selling strategies.

2 Introduction

2.1 Analysis Context

UK retailers often face missed cross-selling opportunities due to inefficient warehouse allocation and lack of insight into bundling potential. The historical approach of treating every SKU independently ignores the complex buying behaviors of modern consumers.

This project bridges that gap by analyzing high-volume transaction data to find robust product affinities. We aim to transition from random sales patterns to a behaviorally-aligned strategy.

2.2 Analytical Goals

  1. Identify Purchase Triggers: Discover which products serve as antecedents (drivers) for subsequent purchases.
  2. Optimize Visual Merchandising: Use association rules to suggest product placements (e.g., placing “Jumbo Bag” near “Lunch Box”).
  3. Validate “Collection” Behavior: Assess if consumers are “collecting” specific aesthetic lines rather than buying random items.

3 Methodology & Data

3.1 Data Acquisition

We define “transactions” using the Online Retail II dataset, sourced from Kaggle. This dataset contains all transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware, and many of its customers are wholesalers.

# Load Data
raw_data <- read.csv("online_retail_II.csv")

head(raw_data) %>%
    kable(caption = "Preview of Raw Transaction Data") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Preview of Raw Transaction Data
Invoice StockCode Description Quantity InvoiceDate Price Customer.ID Country
489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 2009-12-01 07:45:00 6.95 13085 United Kingdom
489434 79323P PINK CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom
489434 79323W WHITE CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom
489434 22041 RECORD FRAME 7” SINGLE SIZE 48 2009-12-01 07:45:00 2.10 13085 United Kingdom
489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 2009-12-01 07:45:00 1.25 13085 United Kingdom
489434 22064 PINK DOUGHNUT TRINKET POT 24 2009-12-01 07:45:00 1.65 13085 United Kingdom
# Clean and Preprocess
clean_data <- raw_data %>%
    filter(!is.na(Customer.ID), Quantity > 0, Description != "") %>%
    filter(Country == "United Kingdom") %>%
    mutate(
        InvoiceDate = as.POSIXct(InvoiceDate),
        Hour = hour(InvoiceDate),
        Category = word(Description, 1) # Extract first word as proxy for category
    )

# Preview Data
head(clean_data) %>%
    kable(caption = "Preview of Cleaned Transaction Data") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Preview of Cleaned Transaction Data
Invoice StockCode Description Quantity InvoiceDate Price Customer.ID Country Hour Category
489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 2009-12-01 07:45:00 6.95 13085 United Kingdom 7 15CM
489434 79323P PINK CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom 7 PINK
489434 79323W WHITE CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085 United Kingdom 7
489434 22041 RECORD FRAME 7” SINGLE SIZE 48 2009-12-01 07:45:00 2.10 13085 United Kingdom 7 RECORD
489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 2009-12-01 07:45:00 1.25 13085 United Kingdom 7 STRAWBERRY
489434 22064 PINK DOUGHNUT TRINKET POT 24 2009-12-01 07:45:00 1.65 13085 United Kingdom 7 PINK

3.2 Transaction formatting

We convert the long-format dataframe into a sparse transaction matrix suitable for the Apriori algorithm.

# Convert to transactions format
trans_uk <- as(split(clean_data$Description, clean_data$Invoice), "transactions")
# Optional: Category level transactions
# trans_cats <- as(split(clean_data$Category, clean_data$Invoice), "transactions")

print(trans_uk)
## transactions in sparse format with
##  33546 transactions (rows) and
##  5249 items (columns)

4 Exploratory Data Analysis (EDA)

Before mining for rules, we analyze the statistical context of the transactions.

4.1 Statistical Context: Basket Sizes

Understanding “Basket Size” helps determine if shoppers are purchasing in bulk or engaged in targeted, small-volume purchasing.

basket_sizes <- size(trans_uk)
hist(basket_sizes,
    main = "UK Basket Size Distribution",
    xlab = "Items per Basket", col = "skyblue", border = "white", breaks = 50
)
abline(v = mean(basket_sizes), col = "red", lwd = 2, lty = 2)
legend("topright",
    legend = paste("Mean Size:", round(mean(basket_sizes), 1)),
    col = "red", lty = 2, lwd = 2
)

4.2 Item Frequency

Identifying the primary “anchors” for UK consumers allows us to understand the base volume drivers.

itemFrequencyPlot(trans_uk,
    topN = 10, type = "relative",
    col = "steelblue", main = "Top 10 UK Products (Relative Frequency)"
)

4.3 Temporal Analysis

Understanding when purchases happen allows for better staffing and ad scheduling. We analyze transaction volume by hour of day.

uk_temporal_summary <- clean_data %>%
    group_by(Hour) %>%
    summarise(Transactions = n_distinct(Invoice), .groups = "drop")

ggplot(uk_temporal_summary, aes(x = Hour, y = Transactions)) +
    geom_col(fill = "steelblue") +
    theme_minimal() +
    labs(
        title = "UK Transaction Volume by Hour of Day",
        subtitle = "Peak activity analysis",
        x = "Hour of Day",
        y = "Number of Invoices"
    )

5 Association Rule Mining

5.1 Algorithmic Approach

We apply the Apriori Algorithm. We set Support to 0.01 and Confidence to 0.5. Given the high volume of data (~981k records), these thresholds ensure we are capturing high-signal patterns rather than statistical noise.

rules_uk <- apriori(trans_uk,
    parameter = list(supp = 0.01, conf = 0.5, minlen = 2),
    control = list(verbose = FALSE)
)

# Summary of rules
summary(rules_uk)
## set of 98 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 73 25 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.255   2.750   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01002   Min.   :0.5015   Min.   :0.01172   Min.   : 5.037  
##  1st Qu.:0.01067   1st Qu.:0.5615   1st Qu.:0.01700   1st Qu.:11.875  
##  Median :0.01236   Median :0.6107   Median :0.01961   Median :20.363  
##  Mean   :0.01362   Mean   :0.6355   Mean   :0.02187   Mean   :23.070  
##  3rd Qu.:0.01529   3rd Qu.:0.6850   3rd Qu.:0.02468   3rd Qu.:31.550  
##  Max.   :0.03261   Max.   :0.8931   Max.   :0.05124   Max.   :56.424  
##      count       
##  Min.   : 336.0  
##  1st Qu.: 358.0  
##  Median : 414.5  
##  Mean   : 456.9  
##  3rd Qu.: 513.0  
##  Max.   :1094.0  
## 
## mining info:
##      data ntransactions support confidence
##  trans_uk         33546    0.01        0.5
##                                                                                                              call
##  apriori(data = trans_uk, parameter = list(supp = 0.01, conf = 0.5, minlen = 2), control = list(verbose = FALSE))

5.2 Rule Inspection

We sort the rules by Lift, which measures how much more likely the consequent is given the antecedent, compared to its baseline probability. A lift > 1 indicates a positive association.

rules_uk_sorted <- sort(rules_uk, by = "lift")

# View top rules
as(head(rules_uk_sorted, 10), "data.frame") %>%
    kable(caption = "Top 10 Association Rules by Lift") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
Top 10 Association Rules by Lift
rules support confidence coverage lift count
6 {POPPY’S PLAYHOUSE LIVINGROOM } => {POPPY’S PLAYHOUSE KITCHEN} 0.0104632 0.8931298 0.0117153 56.42360 351
7 {POPPY’S PLAYHOUSE KITCHEN} => {POPPY’S PLAYHOUSE LIVINGROOM } 0.0104632 0.6610169 0.0158290 56.42360 351
12 {POPPY’S PLAYHOUSE KITCHEN} => {POPPY’S PLAYHOUSE BEDROOM } 0.0120134 0.7589454 0.0158290 53.93979 403
11 {POPPY’S PLAYHOUSE BEDROOM } => {POPPY’S PLAYHOUSE KITCHEN} 0.0120134 0.8538136 0.0140702 53.93979 403
2 {SET/6 RED SPOTTY PAPER PLATES} => {SET/6 RED SPOTTY PAPER CUPS} 0.0101055 0.6634051 0.0152328 50.80956 339
1 {SET/6 RED SPOTTY PAPER CUPS} => {SET/6 RED SPOTTY PAPER PLATES} 0.0101055 0.7739726 0.0130567 50.80956 339
8 {RED STRIPE CERAMIC DRAWER KNOB} => {BLUE STRIPE CERAMIC DRAWER KNOB} 0.0104931 0.6654064 0.0157694 38.75299 352
9 {BLUE STRIPE CERAMIC DRAWER KNOB} => {RED STRIPE CERAMIC DRAWER KNOB} 0.0104931 0.6111111 0.0171705 38.75299 352
45 {KEY FOB , BACK DOOR } => {KEY FOB , SHED} 0.0122816 0.7253521 0.0169320 37.60844 412
46 {KEY FOB , SHED} => {KEY FOB , BACK DOOR } 0.0122816 0.6367852 0.0192869 37.60844 412

6 Visualizing Behavioral Flow

6.1 Parallel Coordinates Plot

This visualization tracks the transition from the Left-Hand Side (LHS) of a rule to the Right-Hand Side (RHS). Thicker lines indicate stronger support for specific antecedents “flowing” into consequents.

top_rules_uk <- head(rules_uk_sorted, 20)
plot(top_rules_uk, method = "paracoord", control = list(reorder = TRUE))

6.2 Rule Density (Support vs Confidence)

We identify the “efficient frontier” of rules—those that balance frequency (Support) with reliability (Confidence).

plot(rules_uk, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

6.3 Grouped Matrix Plot

This plot clusters rules to show which groups of items drive the highest Lift, validating if consumers are “collecting” specific aesthetic lines.

plot(rules_uk, method = "grouped")

7 Conclusion

The analysis of 33,546 unique transactions and approximately 981,000 records confirms that UK purchasing behavior is driven by non-random, synchronized patterns. By applying the Apriori algorithm with a support of 0.01 and confidence of 0.5, we identified 98 high-signal rules that transition the retailer from intuitive guesses to a behaviorally-aligned strategy.

Core Data Insights:

  • High Engagement Scale: A mean basket size of 20.6 items indicates that consumers are actively engaged in multi-item discovery rather than single-product purchases.
  • Collection-Based Purchasing: High-affinity clusters are most prominent in aesthetic lines like Poppy’s Playhouse, where lift values of 56.42 prove that shoppers are over 50 times more likely to purchase associated items than the average consumer.
  • Temporal Optimization: Transactional volume peaks during Hour 12, defining a critical daily window for targeted marketing and staffing.