In the data-driven landscape, uncovering meaningful patterns from large data sets is crucial for informed decision-making. This project focuses on association rule mining, specifically using the Apriori algorithm and Eclat algorithm to reveal significant relationships within a grocery transaction data set. This project showcases the power of association rule mining information about customer relationships within grocery transactions.The data set used was downloaded from kaggle and its named Groceries_Groceries.
library(readr)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Warning: package ‘readr’ was built under R version 4.3.2Warning message:
package ‘arules’ was built under R version 4.3.2
groceries_groceries <- read_csv("C:/Users/lenovo/Desktop/project 3/groceries - groceries.csv")
Rows: 9835 Columns: 1── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Item 1;Item 2;Item 3;Item 4;Item 5;Item 6;Item 7;Item 8;Item 9;Item 10;Item 11;Item 12;...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(groceries_groceries)
str(groceries_groceries)
spc_tbl_ [9,835 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Item 1;Item 2;Item 3;Item 4;Item 5;Item 6;Item 7;Item 8;Item 9;Item 10;Item 11;Item 12;Item 13;Item 14;Item 15;Item 16;Item 17;Item 18;Item 19;Item 20;Item 21;Item 22;Item 23;Item 24;Item 25;Item 26;Item 27;Item 28;Item 29;Item 30;Item 31;Item 32: chr [1:9835] "citrus fruit;semi-finished bread;margarine;ready soups;;;;;;;;;;;;;;;;;;;;;;;;;;;;" "tropical fruit;yogurt;coffee;;;;;;;;;;;;;;;;;;;;;;;;;;;;;" "whole milk;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;" "pip fruit;yogurt;cream cheese;meat spreads;;;;;;;;;;;;;;;;;;;;;;;;;;;;" ...
- attr(*, "spec")=
.. cols(
.. `Item 1;Item 2;Item 3;Item 4;Item 5;Item 6;Item 7;Item 8;Item 9;Item 10;Item 11;Item 12;Item 13;Item 14;Item 15;Item 16;Item 17;Item 18;Item 19;Item 20;Item 21;Item 22;Item 23;Item 24;Item 25;Item 26;Item 27;Item 28;Item 29;Item 30;Item 31;Item 32` = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(groceries_groceries)
I transformed my data into a format suitable for association rule mining,that is to convert it into a transaction format where each row represents a transaction, and each item in the transaction is a separate column with binary values indicating whether the item is present or not.
library(arules)
transactions_list <- strsplit(groceries_groceries$`Item 1;Item 2;Item 3;Item 4;Item 5;Item 6;Item 7;Item 8;Item 9;Item 10;Item 11;Item 12;Item 13;Item 14;Item 15;Item 16;Item 17;Item 18;Item 19;Item 20;Item 21;Item 22;Item 23;Item 24;Item 25;Item 26;Item 27;Item 28;Item 29;Item 30;Item 31;Item 32`, ";")
Convert the list to a transactions object and removing duplicate transactions
transactions <- as(transactions_list, "transactions")
Warning: removing duplicated items in transactions
min_support <- 0.10
itemFrequencyPlot(transactions, support = min_support, type = "relative", main = "Commodities Frequency Plot (10% Support)")
Item frequency plot at 10% support shows that whole milk was frequently purchased followed by other vegetables and roll/buns
Now, we have data in a transaction format suitable for association rule mining using the arules package
Firstly would like to see the number of rules that i have without giving any support or confidence level
rules <- apriori(transactions)
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[170 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.02s].
writing ... [9 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(rules)
NA
I only observed nine rules from my data set without setting any confidence and support level.Henceforth for the purpose of this study i would like to explore many relationships and i will change my support to 3% and confidence to 0.5and print the rules
rules <- apriori(transactions, parameter = list(support = 0.03, confidence = 0.5))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 295
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[170 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [45 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [64 rule(s)] done [0.00s].
creating S4 object ... done [0.01s].
inspect(rules)
first_six_rules <- rules[1:6]
inspect(first_six_rules)
NA
NA
These rules suggest that the items mentioned are individually purchased in a small percentage of transactions, and when they are bought, they are typically bought without other specific items. The lift values being 1.0 indicate that the presence of these items doesn’t significantly influence the likelihood of the other items being present beyond what would be expected by chance.
last_six_rules <- rules[(length(rules)-5):length(rules)]
inspect(last_six_rules)
When customers buy rolls/buns and soda together, they are very likely to buy them without any additional items. The support is around 3.83%, and the confidence is 100%, indicating a strong association.If customers purchase other vegetables and soda together, it is highly likely that they buy them without any additional items. The support is around 3.27%, and the confidence is 100%. When customers buy soda and whole milk together, they are very likely to buy them without any additional items. The support is around 4.01%, and the confidence is 100%.If customers purchase other vegetables and rolls/buns together, it is highly likely that they buy them without any additional items. The support is around 4.25%, and the confidence is approximately 99.76%. In summary, these rules indicate strong associations between pairs of items, and when customers buy these pairs together, they are likely to purchase them without any other specific items. The high confidence values suggest a high likelihood of these purchasing patterns.
This list was made in descending order
rules_by_support <- sort(rules, by = "support", decreasing = TRUE)
six_rules_highest_support <- rules_by_support[1:6]
inspect(six_rules_highest_support)
NA
The rules with the highest support highlight the most prevalent items in the data set this includes; Whole milk, other vegetables, rolls/buns, soda, and yogurt are among the items with the highest support, indicating their popularity among customers.The support values gives us an idea of the frequency of each item’s occurrence in transactions.
rules_by_lift <- sort(rules, by = "lift", decreasing = TRUE)
six_rules_highest_lift <- rules_by_lift[1:6]
inspect(six_rules_highest_lift)
NA
In summary, these rules suggest that specific items like speciality chocolate,UHT-Milk,onions,berries, hamburger meat and salty snack are frequently purchased independently, and their presence on the left-hand side does not significantly affect the likelihood of other items being purchased (as indicated by the lift values close to 1.
rules_by_confidence <- sort(rules, by = "confidence", decreasing = TRUE)
six_rules_highest_confidence <- rules_by_confidence[1:6]
inspect(six_rules_highest_confidence)
These rules can be interpreted as strong indications that when a customer buys items like speciality chocolate,UHT-Milk,onios,berries, hamburger meat and salty snack , they tend to only buy those items and nothing else this is due to the low support.
Create a scatter plot of support levels with the first 10 colors
library(arulesViz)
plot(rules, method = "scatter", shading = "support", control = list(jitter = 0))
The scatter plot explains the relationship between the confidence lift and support.
Analysis of change in support and confidence to get more meaningful results Filter rules with at least 5% support and at least 80% confidence
library(arules)
filtered_rules <- subset(rules, support >= 0.05 & confidence >= 0.8)
inspect(filtered_rules)
NA
Data shows relationship items that can be purchased together with
baskets not showing more than 2 items the following rules where
determined {rolls/buns, whole milk}
{other vegetables, whole milk} {whole milk, yogurt} This was furthur
explaine by the graph below. # Filtered Rules Graph Create a graph-based
plot of the filtered rules
library(arulesViz)
plot(filtered_rules, method = "graph", control = list(type = "items"))
Warning: Unknown control parameters: type
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
library(arulesViz)
plot(filtered_rules, method="grouped")
The grouped matrix show the support value of items by the burble size and the Lift is shown by the color.Beef ,curd and butter has the highest lift value according to this matrix.whole milk and vegetables have the highest support value as shown by the large sizes of the bubble. In this case there is the 13th rule which shows a low level of support and lift as well.
plot(filtered_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter =0)
Warning: The parameter interactive is deprecated. Use engine='interactive' instead.
This graphic explains the relationship between lift, confidence, and support. The majority of the high lift value items in the scatter plot have relatively low to moderate/medium support values. The rule with the most confidence in this example has the lowest support value and a high lift value. The items with the lowest lift values often have strong support values and relatively low/moderate confidence ratings. To determine the importance of a rule, the Confidence and the Support values are evaluated the most. This is so because the precision of the rule is validated by the confidence value, whereas the support value establishes the likelihood or presence of a transaction having both A and B.
library(arulesViz)
plot(filtered_rules, method="graph", control =list(type="items") )
Warning: Unknown control parameters: type
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
The plot shows interesting relationships of things that can be bought together this includes(other vegetables and whole milk),(butter,domestic egg and bread),(roll buns and yogurt) just to mention a few.
Find rules with ‘Whole Milk’ in the antecedent
antecedent_rules <- subset(filtered_rules, subset = lhs %in% "whole milk")
inspect(antecedent_rules)
library(arulesViz)
plot(antecedent_rules, method="graph", control =list(type="items","4 rule graph") )
Warning: Unknown control parameters: type,
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
For further exploration i decided to plot a graph to determine rules associated with purchasing whole milk.
library(arulesViz)
antecedent_rules1 <- subset(filtered_rules, subset = lhs %in% "other vegetables")
inspect(antecedent_rules1)
plot(antecedent_rules1, method="graph", control =list(type="items","2 rule graph") )
Warning: Unknown control parameters: type,
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
According to the filtered rules whole milk and other vegetables are likely to be purchased only in pairs.
antecedent_rules2 <- subset(filtered_rules, subset = lhs %in% "yogurt")
inspect(antecedent_rules2)
plot(antecedent_rules2, method="graph", control =list(type="items","2 rule graph") )
Warning: Unknown control parameters: type,
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
This graph explains that there is a high chance of buying yogurt and whole milk together having a high support level
library(arules)
library(arulesViz)
rule_measures <- interestMeasure(filtered_rules, measure = c("support", "confidence", "lift"))
rule_measures_df <- as.data.frame(rule_measures)
print(rule_measures_df)
NA
plot(rule_measures_df, main = "Affinity Measures")
NA
NA
NA
Out of curiosity sake i performed an affinity measure on my data and below is what i discovered; From the filtered rules one can conclude that the rules consist of; High Support Values:The support values are close to 1. This suggests that the item sets involved in the rules are quite common in the data set. High Confidence Values:The confidence values are also close to 1 in many cases. This indicates strong conditional probabilities, suggesting that if one item is present, the other is very likely to be present as well. Lift Close to 1:The lift values are close to 1, indicating that the presence of one item doesn’t significantly affect the likelihood of the other item being present. This suggests independence.
However further exploration was done using ECLAT and this time with a low support level to have a deeper meaning to the basket
library(arules)
min_support <- 0.02
frequent_itemsets <- eclat(transactions, parameter = list(support = min_support))
Eclat
parameter specification:
algorithmic control:
Absolute minimum support count: 196
create itemset ...
set transactions ...[170 item(s), 9835 transaction(s)] done [0.07s].
sorting and recoding items ... [60 item(s)] done [0.00s].
creating sparse bit matrix ... [60 row(s), 9835 column(s)] done [0.01s].
writing ... [244 set(s)] done [0.02s].
Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
One can tell what are the frequently purchased products in the data frame and their support level represented by the count however this does not make much sense so we will increase our support to get the most frequent purchased products with high support as well to make an informative decision.
Warning message:
package ‘arules’ was built under R version 4.3.2
high_support <- 0.10
highsupport_frequency <- eclat(transactions, parameter = list(support = high_support))
Eclat
parameter specification:
algorithmic control:
Absolute minimum support count: 983
create itemset ...
set transactions ...[170 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating bit matrix ... [9 row(s), 9835 column(s)] done [0.01s].
writing ... [17 set(s)] done [0.00s].
Creating S4 object ... done [0.00s].
inspect(highsupport_frequency)
NA
NA
library(arulesViz)
plot(highsupport_frequency, method = "graph", control = list(type = "items"))
Warning: Unknown control parameters: type
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
With a further increase in support one could determine the products that has high frequency when shopping and this include soda,yogurt ,tropical fruit,rolls/buns,other vegetables and whole milk just to mention a few.With whole milk having the highest support.
Rule Induction function is often used to generate association rules from frequent item sets.
rules1 <- ruleInduction(frequent_itemsets, transactions, confidence = 0.8)
inspect(rules1)
NA
library(arulesViz)
Warning: package ‘arulesViz’ was built under R version 4.3.2
plot(rules1, method = "graph", control = list(type="items"))
Warning: Unknown control parameters: type
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
The plot explains the relationship between goods are frequently purchased and their relationships with whole milk ,other vegetables roll/buns having the most association rules from frequent items.
ruless2 <- ruleInduction(highsupport_frequency, transactions, confidence = 0.8)
# Inspect association rules
inspect(ruless2)
library(arulesViz)
plot(ruless2, method = "graph", control = list(type="items"))
Warning: Unknown control parameters: type
Available control parameters (with default values):
layout = stress
circular = FALSE
ggraphdots = NULL
edges = <environment>
nodes = <environment>
nodetext = <environment>
colors = c("#EE0000FF", "#EEEEEEFF")
engine = ggplot2
max = 100
verbose = FALSE
These association rules suggest that each of the mentioned items tends to be purchased independently with high confidence. The lift values are close to 1, indicating that the presence or absence of other items in the transaction doesn’t significantly impact the purchase of these individual items,also the confidence is high and this items are likely to be purchased frequently.
Several important findings are shown when contrasting the association rule mining outputs from the Apriori and ECLAT methods. The Apriori algorithm identified association rules from the data set, emphasizing mostly on products that are likely to be purchased together.These rules indicated that the likelihood of purchasing other things was not greatly affected by the existence of these items, as indicated by lift values that were close to 1. Conversely, the ECLAT algorithm concentrated on finding frequent item sets, especially high-support products that were bought often and through the creation of association rules, which highlighted products that have a high propensity to be independently purchased. While both algorithms offered insightful information about item associations and purchasing patterns, ECLAT offered greater details as it offered us with the most frequent purchased item sets from the grocery data set , while Apriori showed us some few rules which where also helpful.Henceforth given this data set one can determine the most frequently purchased goods and make a decision for your store.