Introduction

Association rule mining is a key technique in data mining used to uncover hidden relationships between items in transactional datasets. This project focuses on applying Apriori and Eclat algorithms to analyze retail transaction data. By identifying patterns in customer purchases, we aim to generate insights that can help businesses make data-driven decisions regarding product placement, bundling strategies, and customer purchasing behaviors.

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Due to the large size of the dataset, it was necessary to apply filtering techniques to reduce its size while maintaining meaningful information. By selecting transactions from Poland and focusing on the top 20 most frequently purchased items, we ensure computational efficiency without losing significant patterns.

Goal

The primary objective of this project is to discover frequent itemsets and association rules from a retail dataset. By leveraging Apriori and Eclat algorithms, we seek to:

data <- na.omit(data)  

data <- data %>% filter(Quantity > 0)  

data$InvoiceDate <- as.Date(data$InvoiceDate)  

data$Customer.ID <- as.factor(data$Customer.ID)  

Including Plots

ggplot(data, aes(x = Quantity)) +
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Purchase Quantities", x = "Quantity", y = "Frequency") +
  theme_minimal() +
  xlim(0, 100)  
## Warning: Removed 9959 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

ggplot(data, aes(x = Price)) +
  geom_histogram(binwidth = 1, fill = "lightcoral", color = "black") +
  labs(title = "Distribution of Product Prices", x = "Price", y = "Frequency") +
  theme_minimal() +
  xlim(0, 50) 
## Warning: Removed 519 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

country_sales <- data %>%
  group_by(Country) %>%
  summarise(TotalSales = sum(Quantity * Price)) %>%
  arrange(desc(TotalSales)) %>%
  head(10)

ggplot(country_sales, aes(x = reorder(Country, TotalSales), y = TotalSales)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 10 Countries by Sales", x = "Country", y = "Total Sales") +
  theme_minimal()

country_prices <- data %>%
  group_by(Country) %>%
  summarise(AveragePrice = mean(Price, na.rm = TRUE)) %>%
  arrange(desc(AveragePrice))

ggplot(country_prices, aes(x = reorder(Country, AveragePrice), y = AveragePrice)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  coord_flip() +
  labs(title = "Average Product Price by Country", x = "Country", y = "Average Price") +
  theme_minimal()

library(ggplot2)
library(dplyr)

data$InvoiceDate <- as.POSIXct(data$InvoiceDate, format="%Y-%m-%d %H:%M:%S")

time_series <- data %>%
  mutate(Date = as.Date(InvoiceDate)) %>%
  group_by(Date) %>%
  summarise(DailySales = sum(Quantity * Price))

ggplot(time_series, aes(x = Date, y = DailySales)) +
  geom_line(color = "blue", linewidth = 1) +
  labs(title = "Sales Trend Over Time", x = "Date", y = "Total Sales") +
  theme_minimal()

top_products <- data %>%
  group_by(Description) %>%
  summarise(TotalQuantity = sum(Quantity)) %>%
  arrange(desc(TotalQuantity)) %>%
  head(10)

ggplot(top_products, aes(x = reorder(Description, TotalQuantity), y = TotalQuantity)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Most Sold Products", x = "Product", y = "Total Quantity Sold") +
  theme_minimal()

## Warning in asMethod(object): removing duplicated items in transactions
## [1] "Number of transactions: 28"
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data_trans, parameter = list(supp = 0.003, conf = 0.6)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [0.01s].
## writing ... [914017 rule(s)] done [0.05s].
## creating S4 object  ... done [0.22s].
##      lhs                                       rhs                         support confidence   coverage lift count
## [1]  {EDWARDIAN PARASOL BLACK,                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [2]  {POSTAGE,                                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [3]  {EDWARDIAN PARASOL BLACK,                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       SMALL CHINESE STYLE SCISSOR}          => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [4]  {EDWARDIAN PARASOL BLACK,                                                                                     
##       POSTAGE,                                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [5]  {EDWARDIAN PARASOL BLACK,                                                                                     
##       LARGE CHINESE STYLE SCISSOR,                                                                                 
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [6]  {EDWARDIAN PARASOL BLACK,                                                                                     
##       PANTRY WASHING UP BRUSH,                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [7]  {POSTAGE,                                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       SMALL CHINESE STYLE SCISSOR}          => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [8]  {PANTRY WASHING UP BRUSH,                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       SMALL CHINESE STYLE SCISSOR}          => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [9]  {LARGE CHINESE STYLE SCISSOR,                                                                                 
##       POSTAGE,                                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [10] {PANTRY WASHING UP BRUSH,                                                                                     
##       POSTAGE,                                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [11] {LARGE CHINESE STYLE SCISSOR,                                                                                 
##       PANTRY WASHING UP BRUSH,                                                                                     
##       SET OF 6 SPICE TINS PANTRY DESIGN}    => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [12] {RECIPE BOX PANTRY YELLOW DESIGN,                                                                             
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [13] {CERAMIC BOWL WITH STRAWBERRY DESIGN,                                                                         
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [14] {CERAMIC STRAWBERRY CAKE MONEY BANK,                                                                          
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [15] {IVORY KITCHEN SCALES,                                                                                        
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [16] {LARGE HEART MEASURING SPOONS,                                                                                
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [17] {CERAMIC CAKE BOWL + HANGING CAKES,                                                                           
##       SET OF 6 SPICE TINS PANTRY DESIGN,                                                                           
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [18] {CERAMIC BOWL WITH STRAWBERRY DESIGN,                                                                         
##       RECIPE BOX PANTRY YELLOW DESIGN,                                                                             
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [19] {CERAMIC STRAWBERRY CAKE MONEY BANK,                                                                          
##       RECIPE BOX PANTRY YELLOW DESIGN,                                                                             
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1
## [20] {IVORY KITCHEN SCALES,                                                                                        
##       RECIPE BOX PANTRY YELLOW DESIGN,                                                                             
##       VINTAGE CREAM CAT FOOD CONTAINER}     => {JAM MAKING SET PRINTED} 0.03571429          1 0.03571429    7     1

Following parameters are used in this application of the Apriori Method: supp = 0.003, conf = 0.6 The dataset was initially too large, so filtering was applied to only include transactions from Poland. This step ensures that the analysis is focused on a specific market while improving computational efficiency.

Since the dataset contains a large number of unique products, the analysis is limited to the top 20 most frequently purchased items. This helps in identifying meaningful associations while reducing noise from less frequent products.

The filtered dataset is converted into a transaction format, where each invoice represents a unique transaction containing multiple items. Transaction format is necessary for association rule mining.

The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.

plot(rules, measure=c("support", "confidence"), shading="lift", engine= "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
rules <- apriori(data_trans, parameter = list(supp = 0.005, conf = 0.25))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data_trans, parameter = list(supp = 0.005, conf = 0.25)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [0.01s].
## writing ... [955314 rule(s)] done [0.05s].
## creating S4 object  ... done [0.13s].
summary(rules)
## set of 955314 rules
## 
## rule length distribution (lhs + rhs):sizes
##      1      2      3      4      5      6      7      8      9     10 
##      4    337   2969  12900  39295  89250 156380 215312 235017 203850 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   7.000   8.000   8.127   9.000  10.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.03571   Min.   :0.2500   Min.   :0.03571   Min.   :0.7778  
##  1st Qu.:0.03571   1st Qu.:1.0000   1st Qu.:0.03571   1st Qu.:4.6667  
##  Median :0.03571   Median :1.0000   Median :0.03571   Median :5.6000  
##  Mean   :0.03698   Mean   :0.9769   Mean   :0.03878   Mean   :4.9335  
##  3rd Qu.:0.03571   3rd Qu.:1.0000   3rd Qu.:0.03571   3rd Qu.:5.6000  
##  Max.   :0.32143   Max.   :1.0000   Max.   :1.00000   Max.   :7.0000  
##      count      
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.035  
##  3rd Qu.:1.000  
##  Max.   :9.000  
## 
## mining info:
##        data ntransactions support confidence
##  data_trans            28   0.005       0.25
##                                                                     call
##  apriori(data = data_trans, parameter = list(supp = 0.005, conf = 0.25))

Eclat Method

The algorithm computes frequent itemsets based on intersections of transaction IDs. It does not generate candidate itemsets explicitly like Apriori, making it more memory-efficient.

plot(rules, measure=c("support", "confidence"), shading="lift", method="scatterplot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

itemsets <- eclat(data_trans, parameter = list(supp = 0.0005, maxlen = 4))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   5e-04      1      4 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 0 
## 
## create itemset ... 
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating bit matrix ... [24 row(s), 28 column(s)] done [0.00s].
## writing  ... [4462 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
freq.items <- sort(itemsets, by = "support")[seq_len(min(10, length(itemsets)))]
plot(freq.items, method="graph", engine="igraph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 10 itemsets
## max   =  100
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## itemnodeCol   =  #66CC66FF
## edgeCol   =  #ABABABFF
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## arrowSize     =  0.5
## alpha     =  0.5
## cex   =  1
## layout    =  NULL
## layoutParams  =  list()
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## verbose   =  FALSE

plot(freq.items, method="graph", engine="htmlwidget")

Eclat results: Since we are mining frequent itemsets with Eclat, we identify which items are commonly bought together in Poland. Apriori results: The rules generated provide deeper insights into the relationships between items, such as “If a customer buys X, they are likely to buy Y.” In practice, both methods can complement each other: Eclat can first find frequent itemsets, and Apriori can refine the rules.

To sum up, using association rule mining to analyze customer purchasing behavior in Poland provided valuable insights. While association rules are traditionally applied in market basket analysis, this research demonstrated that they can also be effectively used to identify strong relationships between frequently purchased products. By applying Apriori and Eclat algorithms, we successfully uncovered patterns that could help businesses optimize product bundling, inventory management, and marketing strategies. There is still much to explore in the field of association rule mining, but even with the current approach, the results highlight its practical applications in retail analytics.