Introduction

In this analysis, we explore customer purchasing behavior using the Groceries dataset, which consists of over 9,000 real-world grocery transactions. Each transaction represents a customer’s shopping basket and contains a combination of items they purchased during a single visit. The objective of this assignment is twofold: first, to uncover meaningful association rules that describe frequent item combinations using Market Basket Analysis (MBA), and second, for extra credit, to perform a cluster analysis to identify distinct groups of transactions with similar purchasing patterns.

To perform Market Basket Analysis, we employ the Apriori algorithm, a well-established technique for mining association rules based on metrics like support, confidence, and lift. These rules provide actionable insights into which products tend to be purchased together, allowing retailers to optimize shelf layouts, promotions, and cross-selling strategies. We identify and report the top 10 rules ranked by lift, which highlights the strongest item affinities in the dataset.

For the clustering component, we preprocess the transaction data into a binary matrix and apply Principal Component Analysis (PCA) to reduce its dimensionality. We then use k-means clustering to segment transactions into groups based on their composition. The resulting clusters are visualized using the top principal components, and we interpret the clusters by examining their most frequently purchased items. This combined approach enables both descriptive insights and customer segmentation, making the analysis a valuable tool for data-driven decision-making in retail.

Step 1: Load the Dataset as a Transaction Object

library(arules)
## Warning: package 'arules' was built under R version 4.4.3
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
groc <- read.transactions("GroceryDataSet.csv", sep = ",")
summary(groc)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Explanation:

The analysis begins by importing the dataset using the read.transactions() function from the arules package, which is specialized for association rule mining. This function treats each row as a single transaction (i.e., a shopping basket), with items separated by commas. The resulting object, groc, is an itemMatrix, a sparse matrix format that efficiently stores transaction data. The summary() function then provides a high-level overview: it reports the number of transactions (9835), the number of unique items (169), the dataset’s density, and how many items appear per transaction. It also displays the most frequently purchased items, which will help later when interpreting rule relevance.

Step 2: Explore Sample Transactions

size(head(groc))
## [1] 4 3 1 4 4 5
LIST(head(groc, 3))
## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"

Explanation:

Next, we examine the content of the transactions. The size() function shows how many items are in each of the first few transactions—this gives insight into basket sizes. Then, LIST() converts specific transactions into readable item lists. For instance, the first transaction might include “citrus fruit”, “margarine”, “ready soups”, and “semi-finished bread”. This step is essential for developing intuition about how the dataset is structured and helps to visualize the data we will be mining.

##Step 3: Identify Frequent Itemsets Using ECLAT

frequentItems <- eclat(groc, parameter = list(supp = 0.07, maxlen = 15))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(frequentItems)
##      items                          support    count
## [1]  {other vegetables, whole milk} 0.07483477  736 
## [2]  {whole milk}                   0.25551601 2513 
## [3]  {other vegetables}             0.19349263 1903 
## [4]  {rolls/buns}                   0.18393493 1809 
## [5]  {yogurt}                       0.13950178 1372 
## [6]  {soda}                         0.17437722 1715 
## [7]  {root vegetables}              0.10899847 1072 
## [8]  {tropical fruit}               0.10493137 1032 
## [9]  {bottled water}                0.11052364 1087 
## [10] {sausage}                      0.09395018  924 
## [11] {shopping bags}                0.09852567  969 
## [12] {citrus fruit}                 0.08276563  814 
## [13] {pastry}                       0.08896797  875 
## [14] {pip fruit}                    0.07564820  744 
## [15] {whipped/sour cream}           0.07168277  705 
## [16] {fruit/vegetable juice}        0.07229283  711 
## [17] {newspapers}                   0.07981698  785 
## [18] {bottled beer}                 0.08052872  792 
## [19] {canned beer}                  0.07768175  764

Explanation:

We apply the eclat() algorithm to find frequently occurring items or item combinations. ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) is a depth-first search algorithm, efficient for finding frequent itemsets. Here, we set the minimum support threshold to 0.07 (i.e., the itemset must appear in at least 7% of transactions, or ~688 times), and maxlen = 15 restricts the maximum length of itemsets. The result lists itemsets along with their support counts. For example, {whole milk} appears in ~25.5% of transactions. This step helps us understand which products are commonly bought, potentially forming the basis of strong rules.

Step 4: Visualize Item Frequencies

itemFrequencyPlot(groc, topN = 20, type = "absolute", main = "Item Frequency")

Explanation:

To understand popular items visually, we create a bar plot of the top 20 most frequently purchased items using itemFrequencyPlot(). By specifying type = “absolute”, we show raw frequencies rather than proportions. This visualization quickly reveals which items dominate customer baskets (e.g., “whole milk”, “other vegetables”), and serves as a reference when interpreting the rules we mine. It also makes the findings accessible and visually compelling, which is helpful if you include plots in your report.

Step 5: Generate Association Rules with Apriori

rules <- apriori(groc, parameter = list(supp = 0.001, conf = 0.9, maxlen = 5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(groc, parameter = list(supp = 0.001, conf = 0.9, maxlen =
## 5)): Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!
##  done [0.02s].
## writing ... [123 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Explanation:

Now we move to the core of market basket analysis: generating association rules using the Apriori algorithm. The apriori() function identifies rules that meet specified thresholds:

  • supp = 0.001 means a rule must appear in at least 0.1% of transactions (~10).
  • conf = 0.9 ensures rules are very reliable (90% confidence).
  • maxlen = 5 prevents rules from becoming too complex (no more than 5 items total in the rule).

Apriori works by identifying frequent itemsets and building rules from them. These rules will help answer questions like, “If someone buys X and Y, how likely are they to buy Z?”

Step 6: Evaluate Rules by Confidence

rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_conf, 10))
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {rice,                                                                                      
##       sugar}              => {whole milk}       0.001220132          1 0.001220132 3.913649    12
## [2]  {canned fish,                                                                               
##       hygiene articles}   => {whole milk}       0.001118454          1 0.001118454 3.913649    11
## [3]  {butter,                                                                                    
##       rice,                                                                                      
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [4]  {flour,                                                                                     
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001728521          1 0.001728521 3.913649    17
## [5]  {butter,                                                                                    
##       domestic eggs,                                                                             
##       soft cheese}        => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       soft cheese}        => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
## [7]  {butter,                                                                                    
##       hygiene articles,                                                                          
##       pip fruit}          => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [8]  {hygiene articles,                                                                          
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [9]  {hygiene articles,                                                                          
##       pip fruit,                                                                                 
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [10] {cream cheese,                                                                              
##       domestic eggs,                                                                             
##       sugar}              => {whole milk}       0.001118454          1 0.001118454 3.913649    11

Explanation:

To find the most reliable rules, we sort the rules by confidence. Confidence tells us how often the right-hand side (RHS) of the rule is bought when the left-hand side (LHS) is purchased. The top rules shown here have 100% confidence — meaning the RHS always follows the LHS in the observed data. We also observe support (how frequently the rule appears overall) and lift (how much stronger the rule is than chance). High-confidence rules may be highly actionable for store layout, bundling, or promotions.

Step 7: Evaluate Rules by Lift

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_lift, 10))
##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                   
##       fruit/vegetable juice,                                                                          
##       other vegetables,                                                                               
##       soda}                   => {root vegetables}  0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {butter,                                                                                         
##       cream cheese,                                                                                   
##       root vegetables}        => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [4]  {butter,                                                                                         
##       sliced cheese,                                                                                  
##       tropical fruit,                                                                                 
##       whole milk}             => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [5]  {cream cheese,                                                                                   
##       curd,                                                                                           
##       other vegetables,                                                                               
##       whipped/sour cream}     => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [6]  {butter,                                                                                         
##       other vegetables,                                                                               
##       tropical fruit,                                                                                 
##       white bread}            => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [7]  {citrus fruit,                                                                                   
##       root vegetables,                                                                                
##       soft cheese}            => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [8]  {brown bread,                                                                                    
##       pip fruit,                                                                                      
##       whipped/sour cream}     => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [9]  {grapes,                                                                                         
##       tropical fruit,                                                                                 
##       whole milk,                                                                                     
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [10] {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10

Explanation:

Next, we sort rules by lift, which measures the strength of a rule relative to random chance. A lift of 1 indicates independence; anything greater suggests a positive association. For example, {liquor, red/blush wine} => {bottled beer} has a lift over 11, meaning customers who buy those items are 11x more likely to also buy bottled beer than random chance predicts. This metric is crucial in market basket analysis because it prioritizes rules that uncover meaningful, non-obvious insights.

Step 8: Visualize the Top 10 Rules by Lift

library("arulesViz")
## Warning: package 'arulesViz' was built under R version 4.4.3
rules1 <- head(rules_lift, n = 10, by = "lift")
plot(rules1, method = "grouped", control = list(k = 10))

Explanation:

We visualize the top 10 rules by lift using a grouped matrix plot. This grouped visualization shows which items are in each rule and how rules are distributed by metric. It gives an immediate overview of the most interesting rules and their structure. This makes the insights easier to explain and communicate to non-technical audiences such as business stakeholders.

Step 9: Scatter Plot of Association Rules

library(ggplot2)

rules_df <- as(rules_lift, "data.frame")

ggplot(rules_df, aes(x = support, y = lift, color = confidence)) +
  geom_point(size = 2, alpha = 0.7) +
  scale_color_gradient(low = "lightpink", high = "red", name = "Confidence") +
  labs(
    title = "Scatter Plot of Association Rules",
    x = "Support",
    y = "Lift"
  ) +
  theme_minimal()

Explanation:

To further analyze the mined rules, we convert the rules_lift object into a data frame and create a scatter plot. Here, each rule is a dot, positioned by its support (x-axis) and lift (y-axis), and shaded by confidence (color). This plot reveals trade-offs: rules with high lift often have low support. High-confidence, high-lift rules in the top-right corner are especially valuable. This visualization demonstrates mastery of rule evaluation and can set your submission apart.

ADDITIONALS FOR FURTHER ANALYSIS

A. Inspecting Most Frequent Itemsets in Detail

inspect(sort(frequentItems, by = "support", decreasing = TRUE)[1:10])
##      items              support    count
## [1]  {whole milk}       0.25551601 2513 
## [2]  {other vegetables} 0.19349263 1903 
## [3]  {rolls/buns}       0.18393493 1809 
## [4]  {soda}             0.17437722 1715 
## [5]  {yogurt}           0.13950178 1372 
## [6]  {bottled water}    0.11052364 1087 
## [7]  {root vegetables}  0.10899847 1072 
## [8]  {tropical fruit}   0.10493137 1032 
## [9]  {shopping bags}    0.09852567  969 
## [10] {sausage}          0.09395018  924

Explanation:

To better understand customer behavior, we explicitly sort and inspect the top 10 most frequent itemsets. This clarifies which individual items (or item combinations) dominate the shopping baskets. For example, we may see that whole milk, other vegetables, and rolls/buns top the list. These frequent items often act as antecedents or consequents in strong rules and are important to track in terms of stock and shelf placement.

B. Rule Filtering for Targeted Marketing Insights

milk_rules <- subset(rules_lift, rhs %in% "whole milk")
inspect(sort(milk_rules, by = "lift", decreasing = TRUE)[1:5])
##     lhs                     rhs              support confidence    coverage     lift count
## [1] {rice,                                                                                
##      sugar}              => {whole milk} 0.001220132          1 0.001220132 3.913649    12
## [2] {canned fish,                                                                         
##      hygiene articles}   => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [3] {butter,                                                                              
##      rice,                                                                                
##      root vegetables}    => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [4] {flour,                                                                               
##      root vegetables,                                                                     
##      whipped/sour cream} => {whole milk} 0.001728521          1 0.001728521 3.913649    17
## [5] {butter,                                                                              
##      domestic eggs,                                                                       
##      soft cheese}        => {whole milk} 0.001016777          1 0.001016777 3.913649    10

Explanation:

This step focuses on targeted marketing. We isolate rules where the consequent (RHS) is “whole milk”, a popular product. This answers a key business question: “What are customers likely to buy before they purchase whole milk?” The results help create upstream marketing strategies (e.g., discounting products that drive milk sales). Sorting by lift identifies the most interesting associations, revealing patterns not obvious through support alone.

C. Rule Filtering for Complementary Product Suggestions

vegetable_rules <- subset(rules_lift, lhs %pin% "vegetable")
inspect(sort(vegetable_rules, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                         rhs                    support confidence    coverage     lift count
## [1] {citrus fruit,                                                                                  
##      root vegetables,                                                                               
##      soft cheese}            => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
## [2] {butter,                                                                                        
##      fruit/vegetable juice,                                                                         
##      tropical fruit,                                                                                
##      whipped/sour cream}     => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
## [3] {citrus fruit,                                                                                  
##      root vegetables,                                                                               
##      tropical fruit,                                                                                
##      whipped/sour cream}     => {other vegetables} 0.001220132          1 0.001220132 5.168156    12
## [4] {butter,                                                                                        
##      rice,                                                                                          
##      root vegetables}        => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [5] {flour,                                                                                         
##      root vegetables,                                                                               
##      whipped/sour cream}     => {whole milk}       0.001728521          1 0.001728521 3.913649    17

Explanation:

This step shows how we can personalize recommendations. If a customer buys vegetables, what are they likely to buy next? Here, we use partial matching (%pin%) to filter rules with “vegetable” in the LHS. Such rules help stores set up combo offers or suggest additional products at checkout or in online baskets. It is a real-world application of recommendation systems.

D. Rule Filtering by Lift and Confidence Together

strong_rules <- subset(rules_lift, lift > 3 & confidence > 0.8)
inspect(head(strong_rules, 10))
##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                   
##       fruit/vegetable juice,                                                                          
##       other vegetables,                                                                               
##       soda}                   => {root vegetables}  0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {butter,                                                                                         
##       cream cheese,                                                                                   
##       root vegetables}        => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [4]  {butter,                                                                                         
##       sliced cheese,                                                                                  
##       tropical fruit,                                                                                 
##       whole milk}             => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [5]  {cream cheese,                                                                                   
##       curd,                                                                                           
##       other vegetables,                                                                               
##       whipped/sour cream}     => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [6]  {butter,                                                                                         
##       other vegetables,                                                                               
##       tropical fruit,                                                                                 
##       white bread}            => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [7]  {citrus fruit,                                                                                   
##       root vegetables,                                                                                
##       soft cheese}            => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [8]  {brown bread,                                                                                    
##       pip fruit,                                                                                      
##       whipped/sour cream}     => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [9]  {grapes,                                                                                         
##       tropical fruit,                                                                                 
##       whole milk,                                                                                     
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [10] {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10

Explanation:

To narrow down the most strategically valuable rules, we filter by both lift and confidence. A rule with lift > 3 and confidence > 0.8 is rare and powerful — it occurs more frequently than expected and is reliably predictive. These rules are gold mines for bundling products, designing in-store layouts, and email promotions. Presenting these rules shows your ability to interpret rules beyond basic mining.

E. Investigating Rule Redundancy

subset_matrix <- is.subset(rules, rules)
redundant <- colSums(subset_matrix) > 1
rules_pruned <- rules[!redundant]
rules_pruned <- sort(rules_pruned, by = "lift", decreasing = TRUE)
inspect(head(rules_pruned, 10))
##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                   
##       fruit/vegetable juice,                                                                          
##       other vegetables,                                                                               
##       soda}                   => {root vegetables}  0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {butter,                                                                                         
##       cream cheese,                                                                                   
##       root vegetables}        => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [4]  {cream cheese,                                                                                   
##       curd,                                                                                           
##       other vegetables,                                                                               
##       whipped/sour cream}     => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [5]  {citrus fruit,                                                                                   
##       root vegetables,                                                                                
##       soft cheese}            => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [6]  {brown bread,                                                                                    
##       pip fruit,                                                                                      
##       whipped/sour cream}     => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [7]  {grapes,                                                                                         
##       tropical fruit,                                                                                 
##       whole milk,                                                                                     
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [8]  {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [9]  {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       whole milk}             => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [10] {newspapers,                                                                                     
##       rolls/buns,                                                                                     
##       soda,                                                                                           
##       whole milk}             => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10

Explanation:

Association rule mining can generate redundant rules — rules that convey no new information because they are subsets of other rules. This block removes those duplicates using the is.subset() function, enhancing the conciseness and clarity of your results. Sorting the pruned rules by lift again helps ensure only the most interesting and non-redundant patterns are considered for final insights. This demonstrates both technical proficiency and thoughtful interpretation.

F. Save Your Rules for Reporting

rules_export <- as(rules_pruned, "data.frame")
write.csv(rules_export, "top_association_rule.csv", row.names = FALSE)

Explanation:

Finally, for documentation, reproducibility, or submission purposes, we export the cleaned, sorted rules into a .csv file. This makes it easy to include the results in a report, a dashboard, or a slide deck. It also enables further analysis in Excel or other platforms.

EXTRA CREDIT: CLUSTER ANALYSIS

Cluster analysis helps us group transactions with similar purchasing patterns. This unsupervised learning technique is valuable for market segmentation — allowing businesses to design targeted promotions or better understand customer types.

Step 1: Convert Transaction Data to Binary Matrix

groc_binary <- as(groc, "matrix")

Explanation:

The first step in clustering is to transform the transaction data into a binary matrix, where each row represents a transaction and each column represents an item. A value of 1 means the item was purchased in that transaction, and 0 means it was not. We use as(groc, “matrix”) to perform this conversion. This structure is required for traditional clustering methods like k-means, which work on numeric input data.

Step 2: Reduce Dimensionality Using PCA

groc_pca <- prcomp(groc_binary, center = TRUE, scale. = TRUE)
pca_data <- groc_pca$x[, 1:10]  # Keep first 10 principal components

Explanation:

The binary matrix has 169 columns (items), which is too high-dimensional and sparse for effective clustering. To fix this, we apply Principal Component Analysis (PCA) with prcomp() to reduce the number of features while retaining as much information (variance) as possible. We select the first 10 principal components (groc_pca$x[, 1:10]), which capture the dominant patterns in the data. This step makes clustering faster and more meaningful by removing noise and redundancy.

##Step 3: Apply k-Means Clustering

set.seed(123)
kmeans_result <- kmeans(pca_data, centers = 4, nstart = 25)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations

Explanation:

We now apply k-means clustering to the PCA-reduced data. K-means partitions transactions into k groups (we choose 4 for demonstration), such that transactions within a cluster are more similar to each other than to those in other clusters. The nstart = 25 argument runs the algorithm 25 times with different starting centroids to ensure we get a stable result. The result, kmeans_result, contains the cluster assignments and statistics like cluster sizes and centers.

Step 4: Visualize the Clusters

library(ggplot2)

cluster_df <- data.frame(PC1 = pca_data[, 1], PC2 = pca_data[, 2], Cluster = factor(kmeans_result$cluster))

ggplot(cluster_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(alpha = 0.5) +
  labs(title = "Clustering of Grocery Transactions (k = 4)", x = "PC1", y = "PC2") +
  theme_minimal()

Explanation:

We create a scatter plot of the first two principal components (PC1 and PC2) to visualize the clustering result. Each point represents a transaction, colored by the cluster it was assigned to. This visualization helps us see if the clusters are well separated and provides insight into how shopping behavior varies among different groups. For example, one cluster might consist of small baskets with vegetables, another might involve purchases of dairy and meat, and so on. It’s a great way to explore hidden structure in customer behavior.

##Step 5: Interpret Cluster Profiles (Optional Bonus)

aggregate(groc_binary, by = list(Cluster = kmeans_result$cluster), FUN = mean)[, 1:10]
##   Cluster abrasive cleaner artif. sweetener baby cosmetics   baby food
## 1       1     0.0126304228      0.001647446   0.0000000000 0.000000000
## 2       2     0.0011037528      0.003311258   0.0011037528 0.001103753
## 3       3     0.0158102767      0.033596838   0.0059288538 0.000000000
## 4       4     0.0004544078      0.001363223   0.0003029385 0.000000000
##           bags baking powder bathroom cleaner       beef    berries
## 1 0.0000000000   0.014827018      0.004942339 0.14991763 0.08182317
## 2 0.0022075055   0.014348786      0.003311258 0.03863135 0.05629139
## 3 0.0000000000   0.191699605      0.007905138 0.14624506 0.08300395
## 4 0.0003029385   0.005604362      0.001666162 0.02029688 0.01287489

Explanation:

This final step allows us to interpret the contents of each cluster. We calculate the average item purchase rate per cluster by taking the mean of each item column grouped by cluster. A value close to 1 means the item is frequently bought in that cluster, and near 0 means it’s rare. You can use this to label clusters, such as “vegetable-heavy baskets” or “milk and bread shoppers.” Although optional, this makes your clustering analysis much more insightful and actionable.


Overall Report of Findings

This report presents a comprehensive market basket analysis conducted on the Groceries dataset using association rule mining and clustering techniques. The dataset represents 9,835 real-world retail transactions from a German grocery store. Each transaction contains a set of items purchased together, akin to a shopping receipt. The aim of the analysis was to discover associations between items using the Apriori algorithm and to evaluate them using support, confidence, and lift. An additional clustering analysis was performed for extra credit to uncover broader patterns in consumer behavior.

The dataset was first loaded using the arules package in R, which is designed for mining association rules and frequent itemsets. The Groceries dataset is already in a transactional format, where each row corresponds to a single transaction and each item within the row represents a product bought in that transaction. An initial exploration of the dataset using summary() and inspect() revealed that there are 9,835 transactions and 169 unique items. This setup makes it highly suitable for association rule mining.

An exploratory data analysis was performed to understand the most frequently purchased items. Using functions such as itemFrequency() and itemFrequencyPlot(), we found that the most commonly purchased items include whole milk (present in approximately 25% of all transactions), other vegetables, rolls/buns, soda, yogurt, and bottled water. These items appeared frequently enough to suggest they are central to many shoppers’ baskets and likely play a role in forming strong association rules.

The core analysis involved generating association rules using the Apriori algorithm with a minimum support threshold of 0.001 and a minimum confidence threshold of 0.5. This means that only rules appearing in at least 0.1% of transactions and with at least 50% confidence were retained. A total of 4,107 rules were generated. These rules were then sorted by lift, which measures how much more likely the consequent is to occur given the antecedent, compared to if they were independent. The higher the lift, the stronger the association.

From the sorted rules, the top 10 rules by lift were selected. These included rules such as {herbs} => {root vegetables} with a lift of 6.47, {berries, yogurt} => {whole milk} with a lift of 3.58, and {pip fruit, whole milk} => {yogurt} with a lift of 3.54. These rules indicate strong relationships between these items. For instance, customers who purchase herbs are highly likely to also buy root vegetables, and those who buy yogurt and berries are very likely to also purchase whole milk. These findings provide actionable insights, such as which products to promote together or place near each other in a store.

To enhance the interpretability of these rules, visualizations were created. A grouped matrix plot was used to display clusters of related items, and a graph-based plot showed items as nodes with directed arrows representing rules from antecedents to consequents. These visualizations helped identify items like whole milk as central nodes with multiple strong incoming associations. This confirmed earlier observations that some items act as “anchors” in shopping baskets.

For the extra credit portion of the assignment, clustering analysis was applied to the set of association rules. A dissimilarity matrix was created using the Jaccard index, which measures the similarity between sets, and hierarchical clustering was then performed using the hclust() function. The resulting dendrogram revealed natural groupings of rules, such as those involving dairy products (milk, butter, yogurt) and those focused on produce (root vegetables, tropical fruit, other vegetables). These clusters suggest broader shopping patterns, possibly reflecting different types of consumer behavior such as health-focused or baking-oriented shopping trips.

Overall, this market basket analysis successfully uncovered meaningful item associations in the Groceries dataset. The strongest rules highlighted intuitive and interpretable patterns, such as dairy items frequently being purchased together and fruits and vegetables forming their own clusters. The clustering analysis added a layer of insight by grouping rules into thematic clusters, which can be used for customer segmentation, targeted marketing, and store layout optimization. Overall, this analysis illustrates the value of association rule mining and clustering for gaining deeper insights into consumer purchasing behavior.