Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
library(arules)
library(arulesViz)
library(cluster)
To load the data from the csv file, we’ll use the
read.transactions function from the arules
package. This will create a transactions object used for mining
associations.
groceries <- read.transactions("GroceryDataSet.csv", format = "basket", sep = ",", rm.duplicates = TRUE)
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
# Top 20 most frequent items
itemFrequencyPlot(groceries, topN=20, type="absolute", main="Top 20 Items")
This plot shows the top 20 most frequently purchased items in the dataset. Whole milk seems to be the most frequently purchased item.
Now that the dataset is a transaction object, we can utilize the Apriori algorithm to mine the data and uncover hidden patterns. The Apriori algorithm finds the most frequent itemsets and generates association rules by identifying item combinations that appear together often enough to meet specified thresholds. It works efficiently by using the principle that if an itemset is frequent, all of its subsets must also be frequent.
# We set support to 0.01 (1%) to ignore very rare items.
# We set confidence to 0.1 (10%) to filter out very weak rules.
rules <- apriori(groceries, parameter = list(supp = 0.01, conf = 0.1, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [435 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
There were 9835 transactions so we set support to 1%, which means that the item will have to appear in at least 98 receipts to be considered frequent. If an item appears less than 98 times, Apriori ignores it. Out of the 169 items in the csv, 88 of them were considered frequent. Next, the algorithm took these 88 frequent items and paired them up to find 2-item combinations (like Milk and Bread) that also met the 98-transaction threshold. It continued this process, building larger groups (3-itemsets) until no larger frequent groups could be found. Finally, it evaluated these frequent groups to generate specific rules. We set a minimum confidence of 10%, meaning the algorithm only kept rules where the presence of one item (Left hand side) implied the presence of the other (Right hand side) at least 10% of the time. This resulted in a final set of 435 association rules.
# Sort the rules
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))
## lhs rhs support
## [1] {whole milk, yogurt} => {curd} 0.01006609
## [2] {citrus fruit, other vegetables} => {root vegetables} 0.01037112
## [3] {other vegetables, yogurt} => {whipped/sour cream} 0.01016777
## [4] {other vegetables, tropical fruit} => {root vegetables} 0.01230300
## [5] {root vegetables} => {beef} 0.01738688
## [6] {beef} => {root vegetables} 0.01738688
## [7] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [8] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [9] {other vegetables, whole milk} => {root vegetables} 0.02318251
## [10] {other vegetables, whole milk} => {butter} 0.01148958
## confidence coverage lift count
## [1] 0.1796733 0.05602440 3.372304 99
## [2] 0.3591549 0.02887646 3.295045 102
## [3] 0.2341920 0.04341637 3.267062 100
## [4] 0.3427762 0.03589222 3.144780 121
## [5] 0.1595149 0.10899847 3.040367 171
## [6] 0.3313953 0.05246568 3.040367 171
## [7] 0.5862069 0.01769192 3.029608 102
## [8] 0.5845411 0.02104728 3.020999 121
## [9] 0.3097826 0.07483477 2.842082 228
## [10] 0.1535326 0.07483477 2.770630 113
This table shows the top 10 association rules ranked by lift, the strength of the link. The strongest rule is {whole milk, yogurt} => {curd}, indicating that if a customer has both whole milk and yogurt in their basket, they are 3.37 times more likely to buy curd than a random shopper. However, the support is only 0.01, signifying that only 1% (99 receipts) of all shoppers bought this exact combination.
One noticeable takeaway from this table is that shoppers with other vegetables often buy root vegetables and vice versa. This makes sense as vegetables are often in the same section in the market and shoppers generally buy them together. Rule 7 and 8 show very high confidence (58%), indicating that once a shopper has root vegetables and fruit in their basket, they have a high likelihood of adding other vegetables into their basket. This suggests that buying fruits and vegetables is clustered.
Although market basket analysis provides insights into product relationships, it treats all transaction as a single group. By applying Cluster Analysis, we can group similar transactions together and identifying purchasing patterns. To do this, we use k-means clustering, which separates transactions into groups based on similarities in their item-purchase patterns.
# Convert transactions to a binary matrix (0s and 1s)
# This creates a table where rows are customers and columns are every possible item.
trans_matrix <- as(groceries, "matrix")
# Run K-Means Clustering
set.seed(123)
# We choose 3 clusters (centers=3) as a starting point.
kmeans_result <- kmeans(trans_matrix, centers = 3)
trans_matrix <- trans_matrix * 1
#Cluster Sizes
cat("Number of shoppers in each cluster:\n")
## Number of shoppers in each cluster:
print(kmeans_result$size)
## [1] 6554 1085 2196
# Top Items per Cluster
cat("\nTop 5 items that define each cluster (by frequency):\n")
##
## Top 5 items that define each cluster (by frequency):
# Get the centers matrix (Rows = Clusters, Cols = Items)
centers <- kmeans_result$centers
# Ensure column names are set (items)
colnames(centers) <- colnames(trans_matrix)
# Show top items by cluster
for(i in 1:3) {
cat(paste0("\n--- CLUSTER ", i, " ---\n"))
top_items <- sort(centers[i, ], decreasing = TRUE)
print(head(top_items, 5))
}
##
## --- CLUSTER 1 ---
## soda rolls/buns other vegetables yogurt
## 0.16875191 0.16615807 0.15410436 0.10329570
## shopping bags
## 0.09719255
##
## --- CLUSTER 2 ---
## bottled water whole milk soda other vegetables
## 1.0000000 0.3096774 0.2626728 0.2230415
## rolls/buns
## 0.2193548
##
## --- CLUSTER 3 ---
## whole milk other vegetables rolls/buns yogurt
## 0.9913479 0.2964481 0.2194900 0.2144809
## root vegetables
## 0.1930783
The K-Means analysis reveals three distinct customer segments. Cluster 1 represents the majority of the shoppers with 6554 (65%). The highest probability in this cluster is soda at 16%. Most items in this cluster range from 10% to 16%, indicating that there is no single dominant product.
Cluster 2 represents 1085 shoppers (11%). This cluster has a 100% purchase rate of bottled water, meaning every single customer in this group bought bottled water, frequently accompanied by whole milk and soda. Cluster 2 seems to be characterized by beverage shoppers.
Lastly, cluster 3 contains 2196 shoppers (22%) and is defined by the high probability of whole milk purchases. Similar to cluster two, these shoppers buy vegetables and rolls/buns as well. Cluster 3 shoppers also have a high probability (21%) of purchasing yogurt, which makes sense since these are both dairy products probably shelved next to each other.