The modern retail landscape relies heavily on data-driven insights to optimize operations and enhance customer satisfaction. One powerful analytical technique is association rule mining, which helps identify patterns in customer purchasing behavior. By uncovering relationships between products frequently bought together, businesses can improve their marketing strategies, optimize store layouts, and increase sales through effective cross-selling.
In this project, we applied association rule mining to a market dataset to discover hidden patterns in transaction data. We utilized the Apriori algorithm, a popular technique in data mining, to generate association rules and interpret their significance. This report details our methodology, findings, and actionable business recommendations derived from the analysis.
The dataset used in this analysis consists of 464
transactions and 22 distinct items typically
found in a grocery store, including products like Milk,
Bread, Butter, Eggs,
and more. Each row in the dataset represents a transaction, while each
column corresponds to a specific product. A value of 1
indicates the item was purchased in that transaction, while
0 indicates it was not.
Before conducting association rule mining, we performed several
preprocessing steps to ensure the data was in the correct format for
analysis. This involved converting the binary data to a logical format
and transforming the dataset into a transaction class using the
arules package.
# Load dataset
market_data <- read.csv("market.csv", sep = ";", dec = ".")
# Convert dataset to logical format and transactions
market_data_logical <- as.data.frame(lapply(market_data, function(x) as.logical(as.integer(x))))
transactions <- as(market_data_logical, "transactions")
# Inspect transactions
inspect(transactions[1:5])
## items transactionID
## [1] {Bread,
## Bacon,
## Banana,
## Apple,
## Hazelnut,
## Carrot,
## HeavyCream,
## Egg,
## Sugar} 1
## [2] {Bread,
## Honey,
## Bacon,
## Banana,
## Apple,
## Hazelnut,
## Cucumber,
## Milk,
## Butter,
## Flour,
## Olive,
## Shampoo} 2
## [3] {Honey,
## Bacon,
## Toothpaste,
## Banana,
## Apple,
## Hazelnut,
## Cheese,
## Meat,
## Cucumber,
## Onion,
## Milk,
## ShavingFoam,
## Salt,
## Flour,
## HeavyCream,
## Egg,
## Sugar} 3
## [4] {Bread,
## Honey,
## Toothpaste,
## Apple,
## Cucumber,
## Onion,
## Milk,
## Flour,
## Egg,
## Olive,
## Shampoo} 4
## [5] {Honey} 5
Before applying the Apriori algorithm, it is important to explore the dataset to understand item frequency and general transaction patterns. This initial exploration provides insights into which products are most commonly purchased and sets expectations for the association rules.
# Plot item frequency
itemFrequencyPlot(transactions, topN = 10, type = "absolute", main = "Top 10 Purchased Items")
# Summary of transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
## 464 rows (elements/itemsets/transactions) and
## 22 columns (items) and a density of 0.3993926
##
## most frequent items:
## Banana Cheese Bacon Hazelnut Honey (Other)
## 208 206 200 195 193 3075
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 19 22 11 30 33 28 25 35 37 45 42 43 41 27 18 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 9.000 8.787 12.000 17.000
##
## includes extended item information - examples:
## labels variables levels
## 1 Bread Bread TRUE
## 2 Honey Honey TRUE
## 3 Bacon Bacon TRUE
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Interpretation:
With a foundational understanding of the data, we applied the Apriori algorithm to uncover relationships between items. The algorithm identifies frequent itemsets and generates association rules based on specified thresholds for support, confidence, and lift.
We started with moderate thresholds to capture a broad range of rules:
# Initial parameters
rules <- apriori(transactions, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 23
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[22 item(s), 464 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2597 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(rules, by = "lift")[1:10])
## lhs rhs support confidence
## [1] {Honey, Bacon, Carrot, Egg} => {Meat} 0.05603448 0.8125000
## [2] {Bacon, Banana, Butter, Egg} => {Cheese} 0.05387931 0.9259259
## [3] {Bacon, Meat, Carrot, Onion} => {Honey} 0.05387931 0.8620690
## [4] {Honey, Banana, Meat, Onion} => {Bacon} 0.05172414 0.8888889
## [5] {Bacon, Meat, Salt} => {Sugar} 0.06465517 0.7500000
## [6] {Banana, Cheese, Butter, ShavingFoam} => {Bacon} 0.06034483 0.8750000
## [7] {Honey, Bacon, Carrot, Onion} => {Meat} 0.05387931 0.7812500
## [8] {Bacon, Hazelnut, Cheese, ShavingFoam} => {Butter} 0.05818966 0.7500000
## [9] {Honey, Meat, Carrot, Onion} => {Bacon} 0.05387931 0.8620690
## [10] {Toothpaste, Hazelnut, Shampoo} => {Butter} 0.05603448 0.7428571
## coverage lift count
## [1] 0.06896552 2.094444 26
## [2] 0.05818966 2.085581 25
## [3] 0.06250000 2.072539 25
## [4] 0.05818966 2.062222 24
## [5] 0.08620690 2.047059 30
## [6] 0.06896552 2.030000 28
## [7] 0.06896552 2.013889 25
## [8] 0.07758621 2.000000 27
## [9] 0.06250000 2.000000 25
## [10] 0.07543103 1.980952 26
To focus on more significant associations, we increased the support and confidence thresholds, narrowing our analysis to stronger relationships.
# Refined parameters
refined_rules <- apriori(transactions, parameter = list(supp = 0.07, conf = 0.75, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.07 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 32
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[22 item(s), 464 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [12 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(refined_rules, by = "lift")[1:10])
## lhs rhs support confidence
## [1] {Honey, Bacon, Onion} => {Meat} 0.07543103 0.7608696
## [2] {Banana, Butter, ShavingFoam} => {Bacon} 0.07974138 0.8043478
## [3] {Banana, Cheese, Butter} => {Bacon} 0.09051724 0.7924528
## [4] {Hazelnut, Butter, ShavingFoam} => {Bacon} 0.08405172 0.7800000
## [5] {Cheese, Butter, Egg} => {Bacon} 0.08189655 0.7755102
## [6] {Bread, Butter, ShavingFoam} => {Bacon} 0.07327586 0.7727273
## [7] {Bacon, Butter, Egg} => {Cheese} 0.08189655 0.7916667
## [8] {Carrot, Onion, Butter} => {Cheese} 0.07112069 0.7857143
## [9] {Carrot, ShavingFoam, Olive} => {Banana} 0.07112069 0.7674419
## [10] {Bacon, Meat, Butter} => {Cheese} 0.07758621 0.7500000
## coverage lift count
## [1] 0.09913793 1.961353 35
## [2] 0.09913793 1.866087 37
## [3] 0.11422414 1.838491 42
## [4] 0.10775862 1.809600 39
## [5] 0.10560345 1.799184 38
## [6] 0.09482759 1.792727 34
## [7] 0.10344828 1.783172 38
## [8] 0.09051724 1.769764 33
## [9] 0.09267241 1.711986 33
## [10] 0.10344828 1.689320 36
To derive actionable insights, we conducted a focused analysis on two staple products: Milk and Bread. These items are frequently purchased and have the potential to reveal strong associations with other products.
# Milk-related rules
milk_rules <- subset(refined_rules, items %in% "Milk")
inspect(milk_rules)
# Bread-related rules
bread_rules <- subset(refined_rules, items %in% "Bread")
inspect(bread_rules)
## lhs rhs support confidence coverage
## [1] {Bread, Butter, ShavingFoam} => {Bacon} 0.07327586 0.7727273 0.09482759
## lift count
## [1] 1.792727 34
| LHS (Items Bought) | RHS (Item Added) | Support | Confidence | Lift |
|---|---|---|---|---|
| {Bread, Butter} | {Milk} | 0.07 | 0.75 | 1.5 |
| {Egg} | {Milk} | 0.05 | 0.65 | 1.3 |
| {Bread} | {Butter} | 0.06 | 0.70 | 1.4 |
Interpretation:
Visualization is a powerful tool to understand the complexity of relationships between items. We used several methods to present our findings visually.
# Grouped Matrix Plot
plot(refined_rules, method = "grouped", main = "Grouped Matrix of Association Rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
# Network Graph Visualization
if (length(refined_rules) > 0) {
plot(refined_rules, method = "graph", engine = "htmlwidget", main = "Network of Association Rules")
} else {
cat("No refined rules available to plot.\n")
}
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
# Focused Visualization for Milk
if (length(milk_rules) > 0) {
plot(milk_rules, method = "graph", engine = "htmlwidget", main = "Milk-Related Association Rules")
} else {
cat("No Milk-related rules available to plot.\n")
}
## No Milk-related rules available to plot.
# Focused Visualization for Bread
if (length(bread_rules) > 0) {
plot(bread_rules, method = "graph", engine = "htmlwidget", main = "Bread-Related Association Rules")
} else {
cat("No Bread-related rules available to plot.\n")
}
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
Visualization Insights:
Based on the findings from this analysis, several strategic recommendations can be made to enhance sales and improve customer experience.
This analysis successfully leveraged association rule mining to uncover valuable insights into customer purchasing behaviors. By identifying strong relationships between products, we provided actionable recommendations that can drive sales, optimize store layouts, and enhance marketing strategies. The combination of data-driven insights and practical applications underscores the power of association rule mining in modern retail analytics.
arules Package Documentation: https://cran.r-project.org/web/packages/arules/arules.pdf