This project uses the Groceries dataset to perform Market Basket Analysis. The aim is to discover interesting association rules between purchased items.
data_path <- "C:/Users/Admin/Downloads/GroceryDataSet.csv"
data <- read.csv(data_path, header = FALSE, stringsAsFactors = FALSE)
# Display summary of raw data
summary(data) %>%
kable(caption = "Summary of Raw Grocery Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | Length:9835 | |
Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | |
Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character |
# Load transactions for market basket analysis
trans <- read.transactions(data_path, format = "basket", sep = ",")
# Summary of transaction data
summary(trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Based on the summary output, I observed that ‘whole milk’ appears most frequently across the transactions, with a count of 2,513. This is followed by ‘other vegetables’ with 1,903 occurrences, and then ‘rolls/buns’ and ‘soda’ with 1,809 and 1,715 respectively. These high counts suggest that these items are staples in many customer baskets. To better understand the distribution visually, I’ll use an item frequency plot to highlight the top purchased items.
itemFrequencyPlot(trans,
topN = 10,
type = "absolute",
col = brewer.pal(8, "Pastel1"),
main = "Top 10 Items by Absolute Frequency")
The frequency plot above provides a much clearer visual representation
of the top items purchased. I used the
itemFrequencyPlot
function to create this bar chart based on the transaction data stored
in the itemMatrix. This approach helps highlight the most common items
across all transactions and makes it easier to compare their absolute
frequencies at a glance.
To extract meaningful association rules, I trained the Apriori algorithm by setting minimum thresholds for support and confidence. These parameters help filter out less relevant combinations and focus on rules that reflect patterns with a higher likelihood of co-occurrence. In this context, support represents how frequently an item or itemset appears in the dataset, while confidence measures the reliability of the inference made by a rule.
# Set minimum support: 42 transactions out of the total
min_support <- 42 / length(trans)
min_support
## [1] 0.004270463
# Generate rules using apriori
rules <- apriori(trans, parameter = list(supp = min_support, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.004270463 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 42
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [124 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [177 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Sort by lift
rules <- sort(rules, by = "lift", decreasing = TRUE)
# View top 10 rules
top_rules <- head(rules, 10)
inspect(top_rules)
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## tropical fruit} => {other vegetables} 0.004473818 0.7857143 0.005693950 4.060694 44
## [2] {tropical fruit,
## whipped/sour cream,
## whole milk} => {yogurt} 0.004372140 0.5512821 0.007930859 3.951792 43
## [3] {curd,
## tropical fruit} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [4] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [5] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [6] {root vegetables,
## tropical fruit,
## yogurt} => {other vegetables} 0.004982206 0.6125000 0.008134215 3.165495 49
## [7] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [8] {onions,
## root vegetables} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [9] {cream cheese,
## root vegetables} => {other vegetables} 0.004473818 0.5945946 0.007524148 3.072957 44
## [10] {beef,
## tropical fruit} => {other vegetables} 0.004473818 0.5866667 0.007625826 3.031985 44
The table above displays the top 10 association rules discovered through the Apriori algorithm, ranked by their lift values. Lift helps quantify the strength of a rule by measuring how much more frequently the items in the rule appear together than would be expected if they were occurring independently. A lift greater than 1 indicates that the items are positively associated, meaning the presence of items on the left-hand side of the rule increases the likelihood of seeing the item(s) on the right-hand side in the same transaction.
# Convert rules to a data frame for tabular view
rules_df <- as(rules, "data.frame")
# Show top 10 rules by support, confidence, and lift
rules_df %>%
head(10) %>%
select(support, confidence, lift) %>%
kable(caption = "Top 10 Rules by Support, Confidence, and Lift") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
support | confidence | lift | |
---|---|---|---|
157 | 0.0044738 | 0.7857143 | 4.060694 |
144 | 0.0043721 | 0.5512821 | 3.951792 |
46 | 0.0052872 | 0.5148515 | 3.690645 |
161 | 0.0057956 | 0.6333333 | 3.273165 |
154 | 0.0054906 | 0.6136364 | 3.171368 |
163 | 0.0049822 | 0.6125000 | 3.165495 |
101 | 0.0055923 | 0.6043956 | 3.123610 |
11 | 0.0056940 | 0.6021505 | 3.112008 |
21 | 0.0044738 | 0.5945946 | 3.072957 |
37 | 0.0044738 | 0.5866667 | 3.031985 |
# Visualize the top rules
plot(top_rules, method = "graph", engine = "htmlwidget")
plot(top_rules, method = "grouped")
The grouped matrix visualization above illustrates how different items on the left-hand side of the rules are associated with specific outcomes on the right-hand side. Each bubble represents a rule, with its size indicating the level of support and its color intensity representing the lift. This view makes it easier to spot which item combinations consistently lead to a specific product being purchased and how strong those associations are compared to random chance.
# Convert transactions to a binary matrix
item_matrix <- as(trans, "matrix")
# Select top 50 most frequent items
top_items <- sort(colSums(item_matrix), decreasing = TRUE)
top_item_matrix <- item_matrix[, names(top_items)[1:50]]
# Apply k-means clustering with 3 clusters
set.seed(123)
clust <- kmeans(top_item_matrix, centers = 3)
# View cluster distribution
table(clust$cluster)
##
## 1 2 3
## 2300 5030 2505
The clustering groups transactions with similar item patterns into three segments, which could help a retailer target specific types of shopping behavior with promotions or personalized recommendations.
This market basket analysis revealed valuable patterns within the grocery transaction data. Frequently purchased items like whole milk, vegetables, and bakery products often appeared together, highlighting consistent shopping habits across customers. By applying the Apriori algorithm, I was able to uncover strong association rules with notable confidence and lift, indicating reliable and meaningful product relationships. The additional clustering analysis offered deeper insight into shopper segments, which can support more targeted inventory decisions, personalized promotions, and strategic product placement.