Association rule mining is a key technique in data mining used to uncover hidden relationships between items in transactional datasets. This project focuses on applying Apriori and Eclat algorithms to analyze retail transaction data. By identifying patterns in customer purchases, we aim to generate insights that can help businesses make data-driven decisions regarding product placement, bundling strategies, and customer purchasing behaviors.
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
Due to the large size of the dataset, it was necessary to apply filtering techniques to reduce its size while maintaining meaningful information. By selecting transactions from Poland and focusing on the top 20 most frequently purchased items, we ensure computational efficiency without losing significant patterns.
The primary objective of this project is to discover frequent itemsets and association rules from a retail dataset. By leveraging Apriori and Eclat algorithms, we seek to:
data <- na.omit(data)
data <- data %>% filter(Quantity > 0)
data$InvoiceDate <- as.Date(data$InvoiceDate)
data$Customer.ID <- as.factor(data$Customer.ID)
ggplot(data, aes(x = Quantity)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Purchase Quantities", x = "Quantity", y = "Frequency") +
theme_minimal() +
xlim(0, 100)
## Warning: Removed 9959 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
ggplot(data, aes(x = Price)) +
geom_histogram(binwidth = 1, fill = "lightcoral", color = "black") +
labs(title = "Distribution of Product Prices", x = "Price", y = "Frequency") +
theme_minimal() +
xlim(0, 50)
## Warning: Removed 519 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
country_sales <- data %>%
group_by(Country) %>%
summarise(TotalSales = sum(Quantity * Price)) %>%
arrange(desc(TotalSales)) %>%
head(10)
ggplot(country_sales, aes(x = reorder(Country, TotalSales), y = TotalSales)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(title = "Top 10 Countries by Sales", x = "Country", y = "Total Sales") +
theme_minimal()
country_prices <- data %>%
group_by(Country) %>%
summarise(AveragePrice = mean(Price, na.rm = TRUE)) %>%
arrange(desc(AveragePrice))
ggplot(country_prices, aes(x = reorder(Country, AveragePrice), y = AveragePrice)) +
geom_bar(stat = "identity", fill = "dodgerblue") +
coord_flip() +
labs(title = "Average Product Price by Country", x = "Country", y = "Average Price") +
theme_minimal()
library(ggplot2)
library(dplyr)
data$InvoiceDate <- as.POSIXct(data$InvoiceDate, format="%Y-%m-%d %H:%M:%S")
time_series <- data %>%
mutate(Date = as.Date(InvoiceDate)) %>%
group_by(Date) %>%
summarise(DailySales = sum(Quantity * Price))
ggplot(time_series, aes(x = Date, y = DailySales)) +
geom_line(color = "blue", linewidth = 1) +
labs(title = "Sales Trend Over Time", x = "Date", y = "Total Sales") +
theme_minimal()
top_products <- data %>%
group_by(Description) %>%
summarise(TotalQuantity = sum(Quantity)) %>%
arrange(desc(TotalQuantity)) %>%
head(10)
ggplot(top_products, aes(x = reorder(Description, TotalQuantity), y = TotalQuantity)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Most Sold Products", x = "Product", y = "Total Quantity Sold") +
theme_minimal()
## Warning in asMethod(object): removing duplicated items in transactions
## [1] "Number of transactions: 28"
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data_trans, parameter = list(supp = 0.003, conf = 0.6)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.01s].
## writing ... [914017 rule(s)] done [0.05s].
## creating S4 object ... done [0.22s].
## lhs rhs support confidence coverage lift count
## [1] {EDWARDIAN PARASOL BLACK,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [2] {POSTAGE,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [3] {EDWARDIAN PARASOL BLACK,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## SMALL CHINESE STYLE SCISSOR} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [4] {EDWARDIAN PARASOL BLACK,
## POSTAGE,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [5] {EDWARDIAN PARASOL BLACK,
## LARGE CHINESE STYLE SCISSOR,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [6] {EDWARDIAN PARASOL BLACK,
## PANTRY WASHING UP BRUSH,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [7] {POSTAGE,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## SMALL CHINESE STYLE SCISSOR} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [8] {PANTRY WASHING UP BRUSH,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## SMALL CHINESE STYLE SCISSOR} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [9] {LARGE CHINESE STYLE SCISSOR,
## POSTAGE,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [10] {PANTRY WASHING UP BRUSH,
## POSTAGE,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [11] {LARGE CHINESE STYLE SCISSOR,
## PANTRY WASHING UP BRUSH,
## SET OF 6 SPICE TINS PANTRY DESIGN} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [12] {RECIPE BOX PANTRY YELLOW DESIGN,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [13] {CERAMIC BOWL WITH STRAWBERRY DESIGN,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [14] {CERAMIC STRAWBERRY CAKE MONEY BANK,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [15] {IVORY KITCHEN SCALES,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [16] {LARGE HEART MEASURING SPOONS,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [17] {CERAMIC CAKE BOWL + HANGING CAKES,
## SET OF 6 SPICE TINS PANTRY DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [18] {CERAMIC BOWL WITH STRAWBERRY DESIGN,
## RECIPE BOX PANTRY YELLOW DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [19] {CERAMIC STRAWBERRY CAKE MONEY BANK,
## RECIPE BOX PANTRY YELLOW DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
## [20] {IVORY KITCHEN SCALES,
## RECIPE BOX PANTRY YELLOW DESIGN,
## VINTAGE CREAM CAT FOOD CONTAINER} => {JAM MAKING SET PRINTED} 0.03571429 1 0.03571429 7 1
Following parameters are used in this application of the Apriori Method: supp = 0.003, conf = 0.6 The dataset was initially too large, so filtering was applied to only include transactions from Poland. This step ensures that the analysis is focused on a specific market while improving computational efficiency.
Since the dataset contains a large number of unique products, the analysis is limited to the top 20 most frequently purchased items. This helps in identifying meaningful associations while reducing noise from less frequent products.
The filtered dataset is converted into a transaction format, where each invoice represents a unique transaction containing multiple items. Transaction format is necessary for association rule mining.
The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.
plot(rules, measure=c("support", "confidence"), shading="lift", engine= "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
rules <- apriori(data_trans, parameter = list(supp = 0.005, conf = 0.25))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data_trans, parameter = list(supp = 0.005, conf = 0.25)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.01s].
## writing ... [955314 rule(s)] done [0.05s].
## creating S4 object ... done [0.13s].
summary(rules)
## set of 955314 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9 10
## 4 337 2969 12900 39295 89250 156380 215312 235017 203850
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 8.000 8.127 9.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.03571 Min. :0.2500 Min. :0.03571 Min. :0.7778
## 1st Qu.:0.03571 1st Qu.:1.0000 1st Qu.:0.03571 1st Qu.:4.6667
## Median :0.03571 Median :1.0000 Median :0.03571 Median :5.6000
## Mean :0.03698 Mean :0.9769 Mean :0.03878 Mean :4.9335
## 3rd Qu.:0.03571 3rd Qu.:1.0000 3rd Qu.:0.03571 3rd Qu.:5.6000
## Max. :0.32143 Max. :1.0000 Max. :1.00000 Max. :7.0000
## count
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.035
## 3rd Qu.:1.000
## Max. :9.000
##
## mining info:
## data ntransactions support confidence
## data_trans 28 0.005 0.25
## call
## apriori(data = data_trans, parameter = list(supp = 0.005, conf = 0.25))
The algorithm computes frequent itemsets based on intersections of transaction IDs. It does not generate candidate itemsets explicitly like Apriori, making it more memory-efficient.
plot(rules, measure=c("support", "confidence"), shading="lift", method="scatterplot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
itemsets <- eclat(data_trans, parameter = list(supp = 0.0005, maxlen = 4))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 5e-04 1 4 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 0
##
## create itemset ...
## set transactions ...[24 item(s), 28 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating bit matrix ... [24 row(s), 28 column(s)] done [0.00s].
## writing ... [4462 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
freq.items <- sort(itemsets, by = "support")[seq_len(min(10, length(itemsets)))]
plot(freq.items, method="graph", engine="igraph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 10 itemsets
## max = 100
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## itemnodeCol = #66CC66FF
## edgeCol = #ABABABFF
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## arrowSize = 0.5
## alpha = 0.5
## cex = 1
## layout = NULL
## layoutParams = list()
## engine = igraph
## plot = TRUE
## plot_options = list()
## verbose = FALSE
plot(freq.items, method="graph", engine="htmlwidget")
Eclat results: Since we are mining frequent itemsets with Eclat, we identify which items are commonly bought together in Poland. Apriori results: The rules generated provide deeper insights into the relationships between items, such as “If a customer buys X, they are likely to buy Y.” In practice, both methods can complement each other: Eclat can first find frequent itemsets, and Apriori can refine the rules.
To sum up, using association rule mining to analyze customer purchasing behavior in Poland provided valuable insights. While association rules are traditionally applied in market basket analysis, this research demonstrated that they can also be effectively used to identify strong relationships between frequently purchased products. By applying Apriori and Eclat algorithms, we successfully uncovered patterns that could help businesses optimize product bundling, inventory management, and marketing strategies. There is still much to explore in the field of association rule mining, but even with the current approach, the results highlight its practical applications in retail analytics.