# Load the dataset
groceries <- read.transactions("/Users/aelsaeyed/Downloads/GroceryDataSet.csv", format = "basket", sep = ",")
# Summary of the data
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
# Inspect the first few transactions
inspect(groceries[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
We have 9,835 rows or transactions. There are 169 columns, or items to buy. A “a density of 0.02609146” or 2.6%. This is calculated by: total number of items purchased/total number of items possible. 2.6% of this table has an item purchased, which makes it a pretty sparse matrix. The average transactions size has 4.4 items, which is very low considering the total number of items (169).
# Generate rules
rules <- apriori(groceries, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Sort rules by lift
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
# Inspect top 10 rules by lift
top_rules <- rules_sorted[1:10]
inspect(top_rules)
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [4] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [5] {curd, yogurt} => {whole milk} 0.01006609
## [6] {butter, other vegetables} => {whole milk} 0.01148958
## [7] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [8] {root vegetables, yogurt} => {whole milk} 0.01453991
## [9] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [10] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5020921 0.02430097 2.594890 120
## [4] 0.5000000 0.02582613 2.584078 127
## [5] 0.5823529 0.01728521 2.279125 99
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5525114 0.02226741 2.162336 121
## [10] 0.5245098 0.02074225 2.052747 107
From the top 10 rules, we see that for example in the first row “citrus fruits and root vegetables” appear with “other vegetables” in 1.04% of the transactions based on the Support column.
The Confidence column shows us that 58.62% of the transactions that include “citrus fruit and root vegetables” will include “other vegetables”.
Lift shows us that “other vegetables” are 3.029 times more likely to appear in a transaction if “citrus fruit, root vegetables” appear.
Finally, count is the total number of transactions that support the rule, 102 in this case.
# Convert transactions to a numeric matrix
groceries_numeric <- as.matrix(as(groceries, "matrix")) * 1
# groceries_numeric
# dissimilarity <- daisy(groceries_numeric, metric = "gower")
# Specify that all columns are binary variables
dissimilarity <- daisy(groceries_numeric, metric = "gower", type = list(asymm = 1:ncol(groceries_numeric)))
I was getting a warning at first that R was treating my variables as continuous ones (a default behavior) when calculating dissimilarity. Since my data is binary (1 if present, 0 if not) I modified this behavior by specifying the type of all the columns as binary.
# Perform hierarchical clustering
hc <- hclust(dissimilarity, method = "ward.D2")
# Plot the dendrogram
plot(hc, labels = FALSE, main = "Dendrogram of Groceries Transactions")
# Cut the tree into 5 clusters
clusters <- cutree(hc, k = 5)
# Visualize the clusters
fviz_cluster(list(data = groceries_numeric, cluster = clusters), geom = "point", stand = FALSE)
# Convert the transactions object to a binary numeric matrix
groceries_matrix <- as(groceries, "matrix") * 1
cluster_labels <- cutree(hc, k = 5)
# Add cluster labels to the matrix
groceries_with_clusters <- data.frame(groceries_matrix, Cluster = cluster_labels)
# Calculate the average presence of items per cluster
cluster_summary <- aggregate(. ~ Cluster, data = groceries_with_clusters, FUN = mean)
# Extract the top 5 items for each cluster
top_products <- apply(cluster_summary[,-1], 1, function(x) {
sorted_indices <- order(x, decreasing = TRUE) # Sort by frequency
colnames(cluster_summary)[-1][sorted_indices[1:5]] # Get top 5 product names
})
# Convert to a readable dataframe
top_products_df <- data.frame(Cluster = 1:5, t(top_products))
colnames(top_products_df)[-1] <- paste0("Product_", 1:5)
# Print the dataframe
print(top_products_df)
## Cluster Product_1 Product_2 Product_3 Product_4
## 1 1 whole.milk other.vegetables rolls.buns yogurt
## 2 2 rolls.buns whole.milk soda other.vegetables
## 3 3 canned.beer shopping.bags rolls.buns bottled.water
## 4 4 soda bottled.water canned.beer newspapers
## 5 5 bottled.beer liquor red.blush.wine shopping.bags
## Product_5
## 1 soda
## 2 bottled.beer
## 3 newspapers
## 4 bottled.beer
## 5 whole.milk
Here I print out the top 5 most occuring products in each cluster. It looks like each cluster contains some of the same things- groceries like milk, veggies, water, alcohol, etc.
I was curious to see if there are any products in a cluster that don’t appear in other clusters. Based on the chart that should only happen for cluster 1. The other clusters are inside cluster 1- in fact they seem to stack up.
I think from these top 5 products per cluster we can get a good idea of the “types” of buyers, but perhaps we can get a more accurate idea if we could see more of the top products per cluster.
I want to expand this to get the top 10 most occurring products:
# Convert the transactions object to a binary numeric matrix
groceries_matrix <- as(groceries, "matrix") * 1
cluster_labels <- cutree(hc, k = 5)
# Add cluster labels to the matrix
groceries_with_clusters <- data.frame(groceries_matrix, Cluster = cluster_labels)
# Calculate the average presence of items per cluster
cluster_summary <- aggregate(. ~ Cluster, data = groceries_with_clusters, FUN = mean)
# Extract the top 10 items for each cluster
top_products <- apply(cluster_summary[,-1], 1, function(x) {
sorted_indices <- order(x, decreasing = TRUE) # Sort by frequency
colnames(cluster_summary)[-1][sorted_indices[1:10]] # Get top 5 product names
})
# Convert to a readable dataframe
top_products_df <- data.frame(Cluster = 1:5, t(top_products))
colnames(top_products_df)[-1] <- paste0("Product_", 1:10)
# Print the dataframe
print(top_products_df)
## Cluster Product_1 Product_2 Product_3 Product_4
## 1 1 whole.milk other.vegetables rolls.buns yogurt
## 2 2 rolls.buns whole.milk soda other.vegetables
## 3 3 canned.beer shopping.bags rolls.buns bottled.water
## 4 4 soda bottled.water canned.beer newspapers
## 5 5 bottled.beer liquor red.blush.wine shopping.bags
## Product_5 Product_6 Product_7 Product_8
## 1 soda root.vegetables bottled.water tropical.fruit
## 2 bottled.beer yogurt sausage sugar
## 3 newspapers dessert misc..beverages white.wine
## 4 bottled.beer fruit.vegetable.juice ice.cream coffee
## 5 whole.milk liquor..appetizer. soda root.vegetables
## Product_9 Product_10
## 1 shopping.bags sausage
## 2 newspapers UHT.milk
## 3 yogurt whole.milk
## 4 napkins candy
## 5 bottled.water napkins
It looks like cluster 1 represents general purpose buyers that get everything from whole milk to veggies to fruit to sausage, and clusters 2 and 3 are similar.
Cluster 4 is interesting because it is features many beverages as well as snacks/sweets like ice cream and candy.
Lastly cluster 5 has a lot of liquor, as well as red blush wine which seems to be a special occasion type of purchase.
This information could help drive changes in the store such as where to place products for convenience as well as to drive sales by placing products suggestively.