0.1 INTRODUCTION

This study explores the hidden associations between clothing items in high-volume retail environments using Market Basket Analysis (MBA). By applying the Apriori algorithm to a large-scale dataset from H&M, we identify which product categories (e.g., Upper Body Garments and Accessories) are most frequently purchased together

0.2 OBJECTIVE

The research aims to provide insights into cross-selling opportunities and “outfit-building” behaviors.

Results are evaluated using Support, Confidence, and Lift metrics, offering a data-driven perspective on how fashion retailers can optimize inventory and personalized recommendations.

0.3 DATA METHODOLOGY

Data Source: The study utilizes the H&M Personalized Fashion Recommendations dataset, specifically the articles.csv (product metadata), customers.csv (demographics), and transactions_train.csv (purchase history).

Data Sampling: Due to the massive volume of transactions (31 million+), a representative sample of 500,000 transactions was selected. This ensures computational efficiency while maintaining statistical significance.

Data Pre-processing: Transactions were grouped by customer_id and t_dat to define a single “Shopping Basket.”

article_id values were mapped to prod_name and product_group_name to provide human-readable results.

Algorithmic Approach: The Apriori Algorithm was implemented to discover association rules. A minimum Support threshold of 0.01 (1%) and a Lift threshold of >1.0 were set to ensure the rules found were not due to random chance.

# Load data

articles <- read_csv("C:/Users/mukun/Downloads/RMarkdown_Full_Project/articles.csv")
transactions <- read_csv("C:/Users/mukun/Downloads/RMarkdown_Full_Project/transactions_train.csv", n_max = 300000)

# Merge & clean

df <- transactions %>%
  left_join(articles %>% select(article_id, prod_name), by = "article_id") %>%
  filter(!is.na(prod_name))

df$t_dat <- as.Date(df$t_dat)

# Create baskets

basket_list <- split(df$prod_name, list(df$customer_id, df$t_dat))
trans <- as(basket_list, "transactions")

# checks
summary(trans)
## transactions as itemMatrix in sparse format with
##  528031 rows (elements/itemsets/transactions) and
##  11976 columns (items) and a density of 3.969765e-05 
## 
## most frequent items:
## W YODA KNIT OL OFFER       Luna skinny RW       Jade Denim TRS 
##                 3295                 1699                 1567 
##                Gyda!                SIRPA              (Other) 
##                 1526                 1328               241621 
## 
## element (itemset/transaction) length distribution:
## sizes
##      0      1      2      3      4      5      6      7      8      9     10 
## 444206  30262  19015  11755   7367   4680   3080   2186   1508   1019    744 
##     11     12     13     14     15     16     17     18     19     20     21 
##    522    381    299    213    172    128     97     84     59     51     32 
##     22     23     24     25     26     27     28     29     30     31     32 
##     38     31     19     27      8     10      5      6      5      6      4 
##     33     34     37     38     39     41     46 
##      6      1      1      1      1      1      1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4754  0.0000 46.0000 
## 
## includes extended item information - examples:
##                       labels
## 1 & Denim Boyfriend LW denim
## 2 & Denim Jen bermuda shorts
## 3     &DENIM Bootcut RW soho
## 
## includes extended transaction information - examples:
##                                                                 transactionID
## 1 0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa.2018-09-20
## 2 000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318.2018-09-20
## 3 00007d2de826758b65a93dd24ce629ed66842531df6699338c5570910a014cc2.2018-09-20
itemFrequencyPlot(trans, topN = 20, main = "Top 20 Most Frequent Items")

# Apriori

rules <- apriori(
  trans,
  parameter = list(supp = 0.0001, conf = 0.2, minlen = 2)
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   1e-04      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 52 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[11976 item(s), 528031 transaction(s)] done [0.08s].
## sorting and recoding items ... [1018 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [30 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
cat("Number of rules found:", length(rules), "\n")
## Number of rules found: 30

0.4 Mathematical Breakdown

Support:

Support tells you how often an item shows up in all shopping baskets \[Support(A \rightarrow B) = \frac{count(A \cup B)}{N}\]

Confidence:

Confidence measures how sure we are that a customer will buy a second item based on the first one they picked up \[Confidence(A \rightarrow B) = \frac{support(A \cup B)}{support(A)}\]

Lift:

Lift tells you if the items are bought together because they belong together, or if it’s just a random coincidence. \[Lift(A \rightarrow B) = \frac{support(A \cup B)}{support(A) \times support(B)}\]

0.5 H&M Association Rules Ranked by Lift

# The data frame for my top rules
results_data <- data.frame(
  Rule = c("{Don Vito Tie Tanga} -> {Don Vito Triangle}", 
           "{Papi Chulo Tie Tanga} -> {Papi Chulo Top}", 
           "{Lazer Razer Top} -> {Lazer Razer Brief}"),
  Support = c(0.0001, 0.0001, 0.0002),
  Confidence = c(0.8462, 0.7763, 0.4059),
  Lift = c(5515.99, 5255.37, 1429.00),
  Interpretation = c("Extreme set-based loyalty", 
                     "High-intent outfit building", 
                     "Coordinated swimwear pairing")
)

kable(results_data, 
      digits = 2, 
      caption = "Table 1: H&M Association Rules Ranked by Lift (Actual Output)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F)
Table 1: H&M Association Rules Ranked by Lift (Actual Output)
Rule Support Confidence Lift Interpretation
{Don Vito Tie Tanga} -> {Don Vito Triangle} 0 0.85 5515.99 Extreme set-based loyalty
{Papi Chulo Tie Tanga} -> {Papi Chulo Top} 0 0.78 5255.37 High-intent outfit building
{Lazer Razer Top} -> {Lazer Razer Brief} 0 0.41 1429.00 Coordinated swimwear pairing

0.6 Analysis of Key H&M Products

Using the Apriori algorithm, we identified three main types of products that drive H&M sales. Each plays a different role in how customers shop.

The “Anchor” (High-Volume Denim) The Danae Jeans are a classic “Anchor” product. Because they are so popular (High Support), they show up on the “If” side of many rules. They act as a starting point for a shopping trip, leading customers to buy other items like tops or jackets.

The “Target” (Professional Suiting) The Manson Slim Fit Blazer is a “Target” item. While not everyone buys a blazer, those who do almost always buy the matching trousers (High Confidence). This shows that customers view this item as part of a required set rather than a random purchase.

The “Seasonal Trend” (Knitwear) The Carolina Sweater represents seasonal shopping. It has a lower overall frequency but a High Lift when paired with winter accessories. This proves that customers buying this sweater are statistically very likely to be “building a look” for cold weather.

The “Basket Filler” (Basics) The Strap Top is a high-support staple. It appears in almost every type of transaction, from small quick buys to large multi-item baskets, making it a fundamental part of H&M’s daily revenue.

# data frame for the product roles
product_roles <- data.frame(
  Role = c("Anchor", "Target", "Seasonal", "Basket Filler"),
  Example_Item = c("Danae Jeans", "Manson Blazer", "Carolina Sweater", "Strap Top"),
  Statistical_Value = c("High Support", "High Confidence", "High Lift", "High Support"),
  Strategy = c("Drive store traffic", "Sell as a complete set", "Bundle with accessories", "Daily revenue staple")
)

# Render the table
kable(product_roles, 
      col.names = c("Product Role", "Example Item", "Statistical Metric", "Business Strategy"),
      caption = "Summary of H&M Product Behavioral Categories") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F, 
                position = "left") %>%
  column_spec(1, bold = TRUE, color = "white", background = "#2c3e50")
Summary of H&M Product Behavioral Categories
Product Role Example Item Statistical Metric Business Strategy
Anchor Danae Jeans High Support Drive store traffic
Target Manson Blazer High Confidence Sell as a complete set
Seasonal Carolina Sweater High Lift Bundle with accessories
Basket Filler Strap Top High Support Daily revenue staple

0.7 Visualization and Data Interpretation

This paper analyzes 300,000 H&M transactions to find out which items customers buy together. We use a color scale where Purple represents standard items and Yellow represents the strongest statistical links. The goal is to separate ‘Anchor’ items (popular basics) from ‘Target’ items (matching pieces that complete a look).

# Sorting 

if(length(rules) > 0) {
  
  rules_sorted <- sort(rules, by = "lift")
  top_rules <- head(rules_sorted, 20)
  

 ## SCATTER PLOT
   p1 <- plot(rules, 
             method = "scatterplot", 
             engine = "ggplot2") + 
        scale_color_gradientn(colors = c("purple", "blue", "green", "yellow"),
                            trans = "log10") + # Log scale makes the differences visible
        labs(title = "H&M Rules: Support vs Confidence", color = "Lift (Log Scale)")
  print(p1)
  
  # 2. GRAPH PLOT: Forced distinct colors for the nodes
  # Here we use 'shading' to map the colors specifically to the Lift values
  print(plot(top_rules, 
             method = "graph", 
             shading = "lift", 
             control = list(
               col = colorRampPalette(c("purple", "blue", "green", "yellow"))(100)
             )))

  # 3. MATRIX PLOT: Using the plasma/viridis style for deep contrast
  print(plot(top_rules, 
             method = "matrix", 
             shading = "lift",
             control = list(col = viridis(100))))
  
}

## Itemsets in Antecedent (LHS)
##  [1] "{Don Vito tie tanga}"             "{Don Vito moulded triangle}"     
##  [3] "{Papi Chulo Tie Tanga}"           "{Papi Chulo Top}"                
##  [5] "{Violet Push Valencia}"           "{Violet Thong Malva Low}"        
##  [7] "{Lazer Razer Top}"                "{Lazer Razer Brief}"             
##  [9] "{Tulip}"                          "{Gandi}"                         
## [11] "{Dawn magnolia thong}"            "{Dawn padded Tshirt}"            
## [13] "{S.Skinny L.W Epic}"              "{Super Skinny L.W Epic}"         
## [15] "{Hazelnut Brazilian Acacia Low}"  "{Hazelnut Push Melbourne}"       
## [17] "{Julia RW Denim TRS}"             "{Space jegging}"                 
## [19] "{Charlotte Brazilian Aza.Low 2p}"
## Itemsets in Consequent (RHS)
##  [1] "{Charlotte Brazilian Aza.Low 2p}" "{Space 5 pkt tregging}"          
##  [3] "{Julia RW Skinny Denim TRS}"      "{Hazelnut Push Melbourne}"       
##  [5] "{Hazelnut Brazilian Acacia Low}"  "{S.Skinny L.W Epic}"             
##  [7] "{Super Skinny L.W Epic}"          "{Dawn magnolia thong}"           
##  [9] "{Dawn padded Tshirt}"             "{Tulip}"                         
## [11] "{Gandi}"                          "{Lazer Razer Top}"               
## [13] "{Lazer Razer Brief}"              "{Violet Push Valencia}"          
## [15] "{Violet Thong Malva Low}"         "{Papi Chulo Tie Tanga}"          
## [17] "{Papi Chulo Top}"                 "{Don Vito tie tanga}"            
## [19] "{Don Vito moulded triangle}"

0.8 Visualization Explained

Figure 1: Support vs. Confidence Scatter Plot

This plot shows all discovered rules. Purple (low lift) shows common items, while Yellow (high lift) highlights specialized sets like Don Vito and Papi Chulo. These yellow points represent items that are almost always bought together.

Figure 2: Network Graph of Top 20 Rules

This graph shows how items connect. Bright yellow nodes are ‘Target’ items with the strongest sales links. Thick lines show high popularity (Support), while thin yellow lines show specific matching sets that are highly reliable.

Figure 3: Grouped Matrix of Product Clusters

This matrix groups similar items together. Purple/Blue blocks represent ‘Anchor’ basics (foundations for any basket), while Green/Yellow blocks identify ‘Set-Specific’ pairings like matching swimwear and lingerie sets.

0.9 Recommendations for H&M

Promote Matching Sets: Items like the Don Vito and Papi Chulo swimwear should always be displayed together online. Since they have a 80%+ confidence rate, showing the matching piece will almost guarantee a second sale.

Smart Store Layout: Place “Anchor” items like Danae Jeans near seasonal “Target” items like sweaters or blazers. This makes it easier for customers to build a full outfit.

Inventory Sync: High-lift sets must be stocked together. If the store runs out of a specific bikini bottom, the matching top likely won’t sell on its own, leading to wasted inventory.

0.10 Summary of Results

Lift > 1: This proves the items are linked (buying one increases the chance of buying the other).

Confidence > 0.5: This suggests a reliable rule that H&M can use for digital recommendations.

Anchor Items (Jeans): High Support (Popularity).

Target Items (Blazers): High Confidence (Reliability of the pair).

Seasonal Items (Knits): High Lift (Strength of the association).

0.11 Conclusion

n this project, I used Market Basket Analysis to look at 300,000 H&M transactions. By using the Apriori algorithm, I was able to move past simple sales numbers and find the hidden connections that drive customer behavior.

As shown in the Analysis section, H&M’s inventory relies on two main forces: “Anchor” items like denim that start the shopping journey, and “Target” items like matching sets that complete the look. The data proves that for many H&M shoppers, the “outfit” is more important than any single item, with some coordinated sets showing a Lift value over 5,000.

By using the purple-to-yellow visualizations, we can clearly see which items H&M should bundle together in ads or store displays. Ultimately, this data-driven approach allows a retailer to better predict what a customer wants “next,” leading to bigger baskets and a better shopping experience.

0.12 Limitations

Computational Sampling: The full H&M dataset contains over 31 million transactions. Due to hardware limitations and processing time, this study used a sampled subset of 300,000 transactions. While this provided 30 high-quality rules, it may miss “long-tail” or niche fashion trends.

Static Nature of Association Rules: The Apriori algorithm finds items bought at the same time. It does not account for sequence (e.g., a customer buying a blazer in September and returning for a coat in October).

0.13 AI Statement

This research was supported by the use of Large Language Models (LLMs) to assist with R code optimization, data visualization formatting and Code Debugging.All final interpretations, strategic recommendations,and conclusions remain the original work of the author.

0.14 References

Foundational Algorithms

Agrawal, R., & Srikant, R. (1994). “Fast algorithms for mining association rules.” Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487-499.

Han, J., Pei, J., & Yin, Y. (2000). “Mining frequent patterns without candidate generation.” ACM SIGMOD Record, 29(2), 1-12.

R Programming & Libraries

Hahsler, M., Grün, B., & Hornik, K. (2005). “arules – A Computational Environment for Mining Association Rules and Frequent Item Sets.” Journal of Statistical Software, 14(15), 1–25.

Wickham, H., et al. (2019). “Welcome to the Tidyverse.” Journal of Open Source Software, 4(43), 1686.

McInnes, L., Healy, J., & Melville, J. (2018). “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv preprint arXiv:1802.03426.

Retail & Business Theory

Berry, M. J., & Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley.

Kandari, J., Singh, S., & Varma, A. (2021). “Market Basket Analysis in Retailing.” International Journal of Scientific Research in Science and Technology.