Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
groc <- read.transactions("GroceryDataSet.csv", sep=",")
summary(groc)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Each line in the dataset represents a shopping transaction (receipt) and each column represents an item. The summary function provides insights into the dataset, such as the total number of transactions, unique items and an overview of how many items are frequently bought together.
size(head(groc)) # number of items in each observation
## [1] 4 3 1 4 4 5
LIST(head(groc, 3))
## [[1]]
## [1] "citrus fruit" "margarine" "ready soups"
## [4] "semi-finished bread"
##
## [[2]]
## [1] "coffee" "tropical fruit" "yogurt"
##
## [[3]]
## [1] "whole milk"
size() reveals how many items were purchased in each transaction. LIST() displays the actual items in the first three transactions, helping us understand what a transaction looks like.
frequentItems <- eclat (groc, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 688
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing ... [19 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequentItems)
## items support count
## [1] {other vegetables, whole milk} 0.07483477 736
## [2] {whole milk} 0.25551601 2513
## [3] {other vegetables} 0.19349263 1903
## [4] {rolls/buns} 0.18393493 1809
## [5] {yogurt} 0.13950178 1372
## [6] {soda} 0.17437722 1715
## [7] {root vegetables} 0.10899847 1072
## [8] {tropical fruit} 0.10493137 1032
## [9] {bottled water} 0.11052364 1087
## [10] {sausage} 0.09395018 924
## [11] {shopping bags} 0.09852567 969
## [12] {citrus fruit} 0.08276563 814
## [13] {pastry} 0.08896797 875
## [14] {pip fruit} 0.07564820 744
## [15] {whipped/sour cream} 0.07168277 705
## [16] {fruit/vegetable juice} 0.07229283 711
## [17] {newspapers} 0.07981698 785
## [18] {bottled beer} 0.08052872 792
## [19] {canned beer} 0.07768175 764
The eclat algorithm identifies items or combinations of items that appear frequently in transactions based on a minimum support threshold (e.g., 7% of transactions).
itemFrequencyPlot(groc, topN=20, type="absolute", main="Item Frequency") # plot frequent items
This plot shows the top 20 most frequently purchased items and their
absolute frequencies and helps identify which items are most popular and
often appear in customer baskets.
# Define thresholds
supp_val <- 0.001 # Minimum support: At least 0.1% of transactions
conf_val <- 0.9 # Minimum confidence: At least 90% reliability
maxlen_val <- 5 # Maximum number of items in a rule
rules <- apriori(groc, parameter=list(supp=supp_val, conf=conf_val, maxlen=maxlen_val))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(groc, parameter = list(supp = supp_val, conf = conf_val, :
## Mining stopped (maxlen reached). Only patterns up to a length of 5 returned!
## done [0.02s].
## writing ... [123 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Support measures how often a rule’s items appear together. Low support captures rare patterns but risks being noisy. Confidence measures the reliability of the rule. In our case, rules must be 90% reliable to be included. Maxlen limits the rule complexity by controlling the number of items per rule.
rules_conf <- sort(rules, by="confidence", decreasing=TRUE)
inspect(head(rules_conf, 10))
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## domestic eggs,
## soft cheese} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [6] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
## [7] {butter,
## hygiene articles,
## pip fruit} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [8] {hygiene articles,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [9] {hygiene articles,
## pip fruit,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [10] {cream cheese,
## domestic eggs,
## sugar} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
Sorting by confidence highlights the most reliable rules.Displaying the top 10 rules helps us focus on the strongest and most meaningful associations.
Lift Evaluation
rules_lift <- sort(rules, by="lift", decreasing=TRUE)
inspect(head(rules_lift, 10))
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [4] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [5] {cream cheese,
## curd,
## other vegetables,
## whipped/sour cream} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [6] {butter,
## other vegetables,
## tropical fruit,
## white bread} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [7] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
## [8] {brown bread,
## pip fruit,
## whipped/sour cream} => {other vegetables} 0.001118454 1.0000000 0.001118454 5.168156 11
## [9] {grapes,
## tropical fruit,
## whole milk,
## yogurt} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
## [10] {ham,
## pip fruit,
## tropical fruit,
## yogurt} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
Lift measures how much more likely items are to be purchased together compared to chance. Higher lift values indicate stronger associations. For example a lift of 11.24 for {liquor, red/blush wine} => {bottled beer} means these items are over 11 times more likely to be bought together than random chance.
rules1 <- head(rules_lift, n = 10, by = "lift")
plot(rules1, method = "grouped", control = list(k = 10))
library(ggplot2)
# Extract data for the scatterplot
rules_df <- as(rules_lift, "data.frame")
# Create the scatter plot
ggplot(rules_df, aes(x = support, y = lift, color = confidence)) +
geom_point(size = 2, alpha = 0.7) +
scale_color_gradient(low = "lightpink", high = "red", name = "Confidence") +
labs(
title = "Scatter Plot of Association Rules",
x = "Support",
y = "Lift"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)