The Grocery Data Set contains 9,835 transactions, where each row represents a customer receipt and each column contains an item purchased during that transaction. This type of analysis is called Market Basket Analysis because it examines which products are frequently purchased together by customers.
The objectives of this assignment are to:
Perform association rule mining using R
Calculate and interpret support, confidence, and lift values
Identify the top 10 association rules based on lift
Conduct a simple cluster analysis for extra credit
# Load libraries
library(arules)
## Warning: package 'arules' was built under R version 4.5.3
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.5.3
library(cluster)
## Warning: package 'cluster' was built under R version 4.5.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.5.3
## Loading required package: ggplot2
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
# Load grocery dataset
raw_data <- read.csv(
"C:/Users/rbron/Downloads/GroceryDataSet.csv",
header = FALSE,
stringsAsFactors = FALSE
)
# Convert rows into transaction format
transactions_list <- apply(raw_data, 1, function(x) {
# Remove empty values and NA values
x <- x[x != "" & !is.na(x)]
# Convert to character vector
as.character(x)
})
# Convert to transactions object
transactions <- as(transactions_list, "transactions")
# Display summary of transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
# Inspect first five transactions
inspect(transactions[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese ,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
# Generate association rules using the Apriori algorithm
rules <- apriori(
transactions,
parameter = list(
supp = 0.001,
conf = 0.20,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [21633 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
# Display summary of generated rules
summary(rules)
## set of 21633 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 620 9337 9824 1792 60
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.599 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.2000 Min. :0.001017 Min. : 0.8028
## 1st Qu.:0.001118 1st Qu.:0.2632 1st Qu.:0.002745 1st Qu.: 2.1178
## Median :0.001322 Median :0.3548 Median :0.004169 Median : 2.7571
## Mean :0.001948 Mean :0.3967 Mean :0.005840 Mean : 3.0214
## 3rd Qu.:0.001932 3rd Qu.:0.5000 3rd Qu.:0.006101 3rd Qu.: 3.6148
## Max. :0.074835 Max. :1.0000 Max. :0.255516 Max. :35.7158
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 19.15
## 3rd Qu.: 19.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.001 0.2
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.2, minlen = 2))
# Sort association rules by lift
rules_sorted <- sort(
rules,
by = "lift",
decreasing = TRUE
)
# Display the top 10 rules
inspect(head(rules_sorted, 10))
## lhs rhs support
## [1] {bottled beer, red/blush wine} => {liquor} 0.001931876
## [2] {hamburger meat, soda} => {Instant food products} 0.001220132
## [3] {ham, white bread} => {processed cheese} 0.001931876
## [4] {bottled beer, liquor} => {red/blush wine} 0.001931876
## [5] {Instant food products, soda} => {hamburger meat} 0.001220132
## [6] {curd, sugar} => {flour} 0.001118454
## [7] {baking powder, sugar} => {flour} 0.001016777
## [8] {processed cheese, white bread} => {ham} 0.001931876
## [9] {fruit/vegetable juice, ham} => {processed cheese} 0.001118454
## [10] {margarine, sugar} => {flour} 0.001626843
## confidence coverage lift count
## [1] 0.3958333 0.004880529 35.71579 19
## [2] 0.2105263 0.005795628 26.20919 12
## [3] 0.3800000 0.005083884 22.92822 19
## [4] 0.4130435 0.004677173 21.49356 19
## [5] 0.6315789 0.001931876 18.99565 12
## [6] 0.3235294 0.003457041 18.60767 11
## [7] 0.3125000 0.003253686 17.97332 10
## [8] 0.4634146 0.004168785 17.80345 19
## [9] 0.2894737 0.003863752 17.46610 11
## [10] 0.2962963 0.005490595 17.04137 16
# Convert association rules to a data frame
rules_df <- as(rules_sorted, "data.frame")
# Store the top 10 rules
top10_rules <- head(rules_df, 10)
# Display the top 10 rules
print(top10_rules)
## rules support confidence
## 633 {bottled beer,red/blush wine} => {liquor} 0.001931876 0.3958333
## 696 {hamburger meat,soda} => {Instant food products} 0.001220132 0.2105263
## 1489 {ham,white bread} => {processed cheese} 0.001931876 0.3800000
## 632 {bottled beer,liquor} => {red/blush wine} 0.001931876 0.4130435
## 695 {Instant food products,soda} => {hamburger meat} 0.001220132 0.6315789
## 2022 {curd,sugar} => {flour} 0.001118454 0.3235294
## 1916 {baking powder,sugar} => {flour} 0.001016777 0.3125000
## 1488 {processed cheese,white bread} => {ham} 0.001931876 0.4634146
## 1492 {fruit/vegetable juice,ham} => {processed cheese} 0.001118454 0.2894737
## 2025 {margarine,sugar} => {flour} 0.001626843 0.2962963
## coverage lift count
## 633 0.004880529 35.71579 19
## 696 0.005795628 26.20919 12
## 1489 0.005083884 22.92822 19
## 632 0.004677173 21.49356 19
## 695 0.001931876 18.99565 12
## 2022 0.003457041 18.60767 11
## 1916 0.003253686 17.97332 10
## 1488 0.004168785 17.80345 19
## 1492 0.003863752 17.46610 11
## 2025 0.005490595 17.04137 16
# Create a scatter plot of association rules
plot(
rules_sorted,
method = "scatterplot",
measure = c("support", "confidence"),
shading = "lift"
)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
# Create a graph visualization of the top 20 rules
plot(
head(rules_sorted, 20),
method = "graph"
)
Support represents how often a combination of items appears within all transactions in the dataset. It is found by dividing the number of transactions containing both items by the total number of transactions. Higher support values indicate that the item combination occurs more frequently among customers.
The summary results showed a maximum support value of approximately 0.0748, meaning the most common rules appeared in about 7.5% of all grocery receipts. Most of the strongest association rules had support values between 0.001 and 0.002, indicating that these combinations were less common overall but still meaningful.
For example, the rule:
{bottled beer, red/blush wine} => {liquor}
had a support value of 0.00193. This means that roughly 0.19% of all transactions included bottled beer, red/blush wine, and liquor together.
Confidence measures the likelihood that a customer purchases item B when item A is already in the shopping basket. It is calculated by dividing the transactions containing both items by the transactions containing the item on the left-hand side of the rule.
The analysis used a minimum confidence threshold of 0.20, so every generated rule had at least a 20% probability of occurring. Across the 21,633 generated rules, the average confidence value was about 0.397, while the highest confidence reached 1.00.
One strong example was the rule:
{Instant food products, soda} => {hamburger meat}
This rule had a confidence value of 0.632, meaning that about 63% of customers who bought instant food products and soda also purchased hamburger meat during the same visit. This suggests a strong relationship between these products.
Lift measures the strength of an association compared to what would be expected if the products were purchased independently. A lift value greater than 1 indicates that the items are bought together more often than expected by chance.
The average lift value across all rules was approximately 3.02, showing that many product combinations had meaningful associations. The highest lift value was 35.72, indicating an especially strong purchasing relationship.
The strongest rule identified was:
{bottled beer, red/blush wine} => {liquor}
with a lift value of 35.72. This means that customers purchasing bottled beer and red/blush wine were more than 35 times more likely to also buy liquor compared to the average shopper.
Other strong association rules included:
{ham, white bread} => {processed cheese} with a lift of 22.93
{Instant food products, soda} => {hamburger meat} with a lift of 18.99
{processed cheese, white bread} => {ham} with a lift of 17.80
These rules suggest that customers often purchase related food products and meal ingredients together.
The scatter plot displayed support on the x-axis and confidence on the y-axis, while darker colors represented higher lift values. Most rules appeared at low support levels, meaning many item combinations occurred infrequently. However, several rules showed high confidence and high lift, revealing strong purchasing relationships even when the combinations were less common.
The network graph further illustrated the strongest product associations. Alcohol-related items such as bottled beer, liquor, and red/blush wine formed one of the clearest clusters, reflecting very strong relationships between these products. Other noticeable connections involved ham, processed cheese, white bread, soda, hamburger meat, flour, sugar, and baking powder, indicating that shoppers commonly purchased complementary ingredients together.
After running the analysis, the rules with the highest lift showed the strongest relationships between products that customers purchased together.
{yogurt, tropical fruit} → {whole milk}
Customers who buy yogurt and tropical fruit are also likely to buy whole milk. A lift value greater than 1 shows that this relationship is stronger than random chance and represents a meaningful shopping pattern.
Businesses can use this information to:
Place related products near each other in stores
Create product bundle promotions
Recommend related products online
Increase cross-selling opportunities
Cluster analysis groups transactions with similar purchasing patterns.
# Convert sparse transaction data into a regular matrix
transaction_matrix <- as(transactions, "matrix")
# Run k-means clustering
set.seed(123)
clusters <- kmeans(
transaction_matrix,
centers = 3
)
# Display the first few cluster assignments
head(clusters$cluster)
## [1] 1 1 3 1 3 3
# Display the size of each cluster
clusters$size
## [1] 6554 1085 2196
# Visualize the k-means clusters
fviz_cluster(
clusters,
data = transaction_matrix
)
The k-means cluster analysis grouped the grocery transactions into three customer segments based on similar shopping behaviors.
The cluster sizes were:
Cluster 1: 6,554 transactions
Cluster 2: 1,085 transactions
Cluster 3: 2,196 transactions
Cluster 1 was the largest group, showing that many customers had similar purchasing patterns. Cluster 2 was the smallest and may represent customers with more unique shopping habits. Cluster 3 represented another distinct group of shoppers.
The cluster plot showed the three groups using different colors. Although some overlap existed, the visualization still showed clear differences between customer purchasing patterns.
Businesses can use these results to improve marketing, product placement, and customer recommendations.
This project used market basket analysis to explore customer purchasing patterns within the Grocery Data Set using R. By applying the Apriori algorithm, the analysis identified products that were frequently purchased together and measured the strength of these relationships using support, confidence, and lift.
The results showed several strong associations between grocery items, especially among products that are commonly used together or purchased as part of the same meal. Visualizations such as scatter plots and network graphs helped display these relationships more clearly.
The cluster analysis also divided customers into three groups based on similar shopping behaviors, showing that different types of customers have different purchasing habits.
Overall, the analysis demonstrated how association rule mining and clustering can provide valuable business insights. Companies can use these findings to improve marketing strategies, organize store layouts, recommend related products, and increase sales through cross-selling opportunities.