This assignment uses a grocery transaction dataset. Each row represents one customer receipt/ transaction. Each column in a row represents an item purchased in that transaction. The goal is to use R to mine the data for association rules, reporting support, confidence and lift and your top 10 rules by lift. As well as a simple cluster analysis on the data.
groceries_raw <- read.csv("GroceryDataSet.csv",
header = FALSE,
stringsAsFactors = FALSE)
dim(groceries_raw)
## [1] 9835 32
head(groceries_raw)
## V1 V2 V3 V4
## 1 citrus fruit semi-finished bread margarine ready soups
## 2 tropical fruit yogurt coffee
## 3 whole milk
## 4 pip fruit yogurt cream cheese meat spreads
## 5 other vegetables whole milk condensed milk long life bakery product
## 6 whole milk butter yogurt rice
## V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1
## 2
## 3
## 4
## 5
## 6 abrasive cleaner
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1
## 2
## 3
## 4
## 5
## 6
summary(groceries_raw)
## V1 V2 V3 V4
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V5 V6 V7 V8
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V9 V10 V11 V12
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V13 V14 V15 V16
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V17 V18 V19 V20
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V21 V22 V23 V24
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V25 V26 V27 V28
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V29 V30 V31 V32
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
In this dataset, missing values do not mean errors; different receipts contain different numbers of items.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
groceries_raw[groceries_raw == ""] <- NA
# Remove extra spaces from item names
groceries_raw[] <- lapply(groceries_raw, function(x) trimws(x))
# Convert each row into a transaction list
groceries_list <- apply(groceries_raw, 1, function(x) {
x <- na.omit(x)
x <- x[x != ""]
unique(x)
})
# Convert list
groceries_trans <- as(groceries_list, "transactions")
summary(groceries_trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
inspect(groceries_trans[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
itemFrequencyPlot(groceries_trans,
topN = 20,
type = "absolute",
main = "Top 20 Most Frequently Purchased Items",
col = "lightgreen")
# Mining the date for association rules
# First model: moderate support/confidence
rules_1 <- apriori(
groceries_trans,
parameter = list(
supp = 0.001,
conf = 0.20,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [21633 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules_1)
## set of 21633 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 620 9337 9824 1792 60
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.599 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.2000 Min. :0.001017 Min. : 0.8028
## 1st Qu.:0.001118 1st Qu.:0.2632 1st Qu.:0.002745 1st Qu.: 2.1178
## Median :0.001322 Median :0.3548 Median :0.004169 Median : 2.7571
## Mean :0.001948 Mean :0.3967 Mean :0.005840 Mean : 3.0214
## 3rd Qu.:0.001932 3rd Qu.:0.5000 3rd Qu.:0.006101 3rd Qu.: 3.6148
## Max. :0.074835 Max. :1.0000 Max. :0.255516 Max. :35.7158
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 19.15
## 3rd Qu.: 19.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## groceries_trans 9835 0.001 0.2
## call
## apriori(data = groceries_trans, parameter = list(supp = 0.001, conf = 0.2, minlen = 2))
# Second model: slightly stricter confidence
rules_2 <- apriori(
groceries_trans,
parameter = list(
supp = 0.002,
conf = 0.30,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.002 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [3119 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules_2)
## set of 3119 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 181 1863 1001 74
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 3.00 3.31 4.00 5.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.002034 Min. :0.3000 Min. :0.002440 Min. : 1.176
## 1st Qu.:0.002339 1st Qu.:0.3607 1st Qu.:0.005084 1st Qu.: 1.957
## Median :0.002847 Median :0.4369 Median :0.006711 Median : 2.356
## Mean :0.003931 Mean :0.4558 Mean :0.009361 Mean : 2.517
## 3rd Qu.:0.003965 3rd Qu.:0.5355 3rd Qu.:0.009558 3rd Qu.: 2.904
## Max. :0.074835 Max. :0.8857 Max. :0.193493 Max. :11.421
## count
## Min. : 20.00
## 1st Qu.: 23.00
## Median : 28.00
## Mean : 38.66
## 3rd Qu.: 39.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## groceries_trans 9835 0.002 0.3
## call
## apriori(data = groceries_trans, parameter = list(supp = 0.002, conf = 0.3, minlen = 2))
# Third model: lower support to discover less frequent but possibly strong rules
rules_3 <- apriori(
groceries_trans,
parameter = list(
supp = 0.0005,
conf = 0.20,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 5e-04 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [164 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.01s].
## writing ... [111652 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
summary(rules_3)
## set of 111652 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7
## 768 22681 55303 27954 4680 266
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 4.000 4.124 5.000 7.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.0005084 Min. :0.2000 Min. :0.0005084 Min. : 0.7827
## 1st Qu.:0.0005084 1st Qu.:0.2857 1st Qu.:0.0011185 1st Qu.: 2.6091
## Median :0.0006101 Median :0.4000 Median :0.0017285 Median : 3.6494
## Mean :0.0008728 Mean :0.4546 Mean :0.0024302 Mean : 4.4219
## 3rd Qu.:0.0008134 3rd Qu.:0.5833 3rd Qu.:0.0025419 3rd Qu.: 5.1982
## Max. :0.0748348 Max. :1.0000 Max. :0.2555160 Max. :81.9583
## count
## Min. : 5.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 8.584
## 3rd Qu.: 8.000
## Max. :736.000
##
## mining info:
## data ntransactions support confidence
## groceries_trans 9835 5e-04 0.2
## call
## apriori(data = groceries_trans, parameter = list(supp = 5e-04, conf = 0.2, minlen = 2))
Association rules were generated using the Apriori algorithm to identify relationships between items in the dataset. The strength of these rules is evaluated using three key measures: support, confidence, and lift.
Support represents how frequently a combination of items appears in the dataset. In this analysis, support values are generally low, with a mean around 0.002, indicating that most item combinations occur in a small proportion of transactions.
Confidence measures the likelihood that a transaction containing one item will also contain another item. The average confidence across rules is approximately 0.40–0.45, meaning that when the left-hand side of a rule occurs, there is about a 40–45% chance that the right-hand side also occurs.
Lift measures how much more likely items are to occur together compared to if they were independent. The average lift is greater than 1 (around 2.5–4.4 depending on parameters), indicating that many item combinations occur more frequently than expected by chance. Some rules have very high lift values, suggesting strong associations between certain items.
Overall, while most rules have low support, the confidence and lift values indicate meaningful relationships between items in the dataset.
rules_lift <- sort(rules_1, by = "lift", decreasing = TRUE)
top10_rules <- head(rules_lift, 10)
inspect(top10_rules)
## lhs rhs support
## [1] {bottled beer, red/blush wine} => {liquor} 0.001931876
## [2] {hamburger meat, soda} => {Instant food products} 0.001220132
## [3] {ham, white bread} => {processed cheese} 0.001931876
## [4] {bottled beer, liquor} => {red/blush wine} 0.001931876
## [5] {Instant food products, soda} => {hamburger meat} 0.001220132
## [6] {curd, sugar} => {flour} 0.001118454
## [7] {baking powder, sugar} => {flour} 0.001016777
## [8] {processed cheese, white bread} => {ham} 0.001931876
## [9] {fruit/vegetable juice, ham} => {processed cheese} 0.001118454
## [10] {margarine, sugar} => {flour} 0.001626843
## confidence coverage lift count
## [1] 0.3958333 0.004880529 35.71579 19
## [2] 0.2105263 0.005795628 26.20919 12
## [3] 0.3800000 0.005083884 22.92822 19
## [4] 0.4130435 0.004677173 21.49356 19
## [5] 0.6315789 0.001931876 18.99565 12
## [6] 0.3235294 0.003457041 18.60767 11
## [7] 0.3125000 0.003253686 17.97332 10
## [8] 0.4634146 0.004168785 17.80345 19
## [9] 0.2894737 0.003863752 17.46610 11
## [10] 0.2962963 0.005490595 17.04137 16
# Table with support, confidence, and lift
top10_table <- as(top10_rules, "data.frame")
top10_table <- top10_table[, c("rules", "support", "confidence", "lift")]
print(top10_table)
## rules support confidence
## 633 {bottled beer,red/blush wine} => {liquor} 0.001931876 0.3958333
## 696 {hamburger meat,soda} => {Instant food products} 0.001220132 0.2105263
## 1489 {ham,white bread} => {processed cheese} 0.001931876 0.3800000
## 632 {bottled beer,liquor} => {red/blush wine} 0.001931876 0.4130435
## 695 {Instant food products,soda} => {hamburger meat} 0.001220132 0.6315789
## 2022 {curd,sugar} => {flour} 0.001118454 0.3235294
## 1916 {baking powder,sugar} => {flour} 0.001016777 0.3125000
## 1488 {processed cheese,white bread} => {ham} 0.001931876 0.4634146
## 1492 {fruit/vegetable juice,ham} => {processed cheese} 0.001118454 0.2894737
## 2025 {margarine,sugar} => {flour} 0.001626843 0.2962963
## lift
## 633 35.71579
## 696 26.20919
## 1489 22.92822
## 632 21.49356
## 695 18.99565
## 2022 18.60767
## 1916 17.97332
## 1488 17.80345
## 1492 17.46610
## 2025 17.04137
# Top 10 rules
write.csv(top10_table,
"top10_grocery_rules_by_lift.csv",
row.names = FALSE)
The top 10 association rules were selected based on the highest lift values, indicating the strongest relationships between items.
The strongest rule shows that customers who purchase bottled beer and red/blush wine are highly likely to also purchase liquor, with a lift of 35.72. This means this combination occurs over 35 times more often than expected by chance, indicating a very strong association.
Another strong rule indicates that customers who buy hamburger meat and soda are likely to also purchase instant food products (lift = 26.21). This suggests a pattern of convenience or quick-meal purchases.
The rule {ham, white bread} → {processed cheese} (lift = 22.93) reflects a common food pairing, suggesting customers are purchasing ingredients for sandwiches.
Similarly, customers who buy instant food products and soda are highly likely to also purchase hamburger meat (confidence = 0.63), indicating a strong likelihood of these items being purchased together.
Several rules involve baking-related items. For example, combinations such as {curd, sugar}, {baking powder, sugar}, and {margarine, sugar} all lead to flour, suggesting that these items are commonly purchased together for baking purposes.
Overall, while the support values are low (indicating these combinations are relatively rare), the high lift values show that these item combinations are significantly more likely to occur together than by chance. This highlights meaningful purchasing patterns and associations within the dataset.
These findings demonstrate how association rule mining can uncover meaningful consumer purchasing behavior.
library(arulesViz)
plot(top10_rules,
method = "scatter",
measure = c("support", "confidence"),
shading = "lift")
plot(top10_rules,
method = "graph",
engine = "htmlwidget")
This scatter plot displays the top 10 association rules, with support on the x-axis and confidence on the y-axis. Each point represents a rule, and the color intensity indicates the lift value, where darker red corresponds to stronger associations.
The color scale shows that many of these rules have high lift values, with some exceeding 30. This indicates that these item combinations occur far more frequently than expected by chance, highlighting strong and meaningful relationships between items.
Overall, the plot demonstrates that even though these rules are relatively rare (low support), they represent strong and reliable associations (high confidence and lift), making them valuable for understanding customer purchasing behavior.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Convert transactions to a binary item matrix
binary_matrix <- as(groceries_trans, "matrix")
# Top 30 most frequent items
item_freq <- itemFrequency(groceries_trans, type = "absolute")
top_items <- names(sort(item_freq, decreasing = TRUE))[1:30]
top_binary_matrix <- binary_matrix[, top_items]
# Transpose
item_matrix <- t(top_binary_matrix)
# Compute distance between items
dist_matrix <- dist(item_matrix, method = "binary")
# Hierarchical clustering
hc <- hclust(dist_matrix, method = "complete")
# Plot
plot(hc,
main = "Hierarchical Clustering of Top 30 Grocery Items",
xlab = "",
sub = "",
cex = 0.7)
# Tree into 5 clusters
clusters <- cutree(hc, k = 5)
clusters
## whole milk other vegetables rolls/buns
## 1 1 2
## soda yogurt bottled water
## 3 1 3
## root vegetables tropical fruit shopping bags
## 1 1 2
## sausage pastry citrus fruit
## 2 2 1
## bottled beer newspapers canned beer
## 4 3 5
## pip fruit fruit/vegetable juice whipped/sour cream
## 1 3 1
## brown bread domestic eggs frankfurter
## 2 1 2
## margarine coffee pork
## 1 4 1
## butter curd beef
## 1 1 1
## napkins chocolate frozen vegetables
## 3 3 1
# Cluster table
cluster_table <- data.frame(
item = names(clusters),
cluster = clusters
)
cluster_table <- cluster_table %>%
arrange(cluster, item)
print(cluster_table)
## item cluster
## beef beef 1
## butter butter 1
## citrus fruit citrus fruit 1
## curd curd 1
## domestic eggs domestic eggs 1
## frozen vegetables frozen vegetables 1
## margarine margarine 1
## other vegetables other vegetables 1
## pip fruit pip fruit 1
## pork pork 1
## root vegetables root vegetables 1
## tropical fruit tropical fruit 1
## whipped/sour cream whipped/sour cream 1
## whole milk whole milk 1
## yogurt yogurt 1
## brown bread brown bread 2
## frankfurter frankfurter 2
## pastry pastry 2
## rolls/buns rolls/buns 2
## sausage sausage 2
## shopping bags shopping bags 2
## bottled water bottled water 3
## chocolate chocolate 3
## fruit/vegetable juice fruit/vegetable juice 3
## napkins napkins 3
## newspapers newspapers 3
## soda soda 3
## bottled beer bottled beer 4
## coffee coffee 4
## canned beer canned beer 5
Hierarchical clustering was used to group the top 30 most frequently purchased grocery items based on similarity in purchasing patterns. The figure visually represents how items are grouped together, with items that are closer on the tree being more similar in terms of how often they are purchased together.
The clustering results show several meaningful groupings. For example, items such as bread, sausage, pastry, and rolls/buns are grouped together, suggesting they are commonly purchased as part of meal or breakfast combinations. Another cluster includes beverages such as bottled water, soda, and fruit/vegetable juice, indicating that these items are often purchased together.
Items like bottled beer, coffee, and canned beer form another cluster, reflecting beverage-related purchasing patterns. Additionally, yogurt appears as its own cluster, indicating that it may be purchased independently or does not strongly co-occur with other items in the top 30.
Overall, the clustering analysis complements the association rules by identifying groups of similar items, while the association rules identify directional relationships between specific item combinations. Together, these methods provide a deeper understanding of customer purchasing behavior.
This analysis used market basket analysis to identify relationships between grocery items that are frequently purchased together. The Apriori algorithm generated association rules, and the top 10 rules were ranked by lift. Lift was particularly useful because it highlights relationships that occur more frequently than expected by chance, allowing for the identification of strong and meaningful item associations.
The data were examined for missing values, and blank entries were treated as the absence of items in a transaction rather than missing data. Therefore, no imputation was required, as each transaction simply contained a different number of purchased items.
The results revealed clear patterns in consumer purchasing behavior, such as common combinations of convenience foods, beverages, and baking-related items. These insights can be valuable for retail strategies, including product placement, promotions, and inventory management.
The hierarchical clustering analysis further supported these findings by grouping similar items based on purchasing patterns. While association rules identify directional relationships between items, clustering highlights broader groupings of related products. Together, these methods provide a more comprehensive understanding of customer behavior.
Overall, this analysis demonstrates how data mining techniques can uncover meaningful patterns in transaction data and support data-driven decision-making in retail environments.