624-HW10

# Load the dataset
groceries <- read.transactions("/Users/aelsaeyed/Downloads/GroceryDataSet.csv", format = "basket", sep = ",")

# Summary of the data
summary(groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

# Inspect the first few transactions
inspect(groceries[1:5])

##     items                      
## [1] {citrus fruit,             
##      margarine,                
##      ready soups,              
##      semi-finished bread}      
## [2] {coffee,                   
##      tropical fruit,           
##      yogurt}                   
## [3] {whole milk}               
## [4] {cream cheese,             
##      meat spreads,             
##      pip fruit,                
##      yogurt}                   
## [5] {condensed milk,           
##      long life bakery product, 
##      other vegetables,         
##      whole milk}

We have 9,835 rows or transactions. There are 169 columns, or items to buy. A “a density of 0.02609146” or 2.6%. This is calculated by: total number of items purchased/total number of items possible. 2.6% of this table has an item purchased, which makes it a pretty sparse matrix. The average transactions size has 4.4 items, which is very low considering the total number of items (169).

# Generate rules
rules <- apriori(groceries, parameter = list(support = 0.01, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Sort rules by lift
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)

# Inspect top 10 rules by lift
top_rules <- rules_sorted[1:10]
inspect(top_rules)

##      lhs                                  rhs                support   
## [1]  {citrus fruit, root vegetables}   => {other vegetables} 0.01037112
## [2]  {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3]  {rolls/buns, root vegetables}     => {other vegetables} 0.01220132
## [4]  {root vegetables, yogurt}         => {other vegetables} 0.01291307
## [5]  {curd, yogurt}                    => {whole milk}       0.01006609
## [6]  {butter, other vegetables}        => {whole milk}       0.01148958
## [7]  {root vegetables, tropical fruit} => {whole milk}       0.01199797
## [8]  {root vegetables, yogurt}         => {whole milk}       0.01453991
## [9]  {domestic eggs, other vegetables} => {whole milk}       0.01230300
## [10] {whipped/sour cream, yogurt}      => {whole milk}       0.01087951
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5020921  0.02430097 2.594890 120  
## [4]  0.5000000  0.02582613 2.584078 127  
## [5]  0.5823529  0.01728521 2.279125  99  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5525114  0.02226741 2.162336 121  
## [10] 0.5245098  0.02074225 2.052747 107

From the top 10 rules, we see that for example in the first row “citrus fruits and root vegetables” appear with “other vegetables” in 1.04% of the transactions based on the Support column.

The Confidence column shows us that 58.62% of the transactions that include “citrus fruit and root vegetables” will include “other vegetables”.

Lift shows us that “other vegetables” are 3.029 times more likely to appear in a transaction if “citrus fruit, root vegetables” appear.

Finally, count is the total number of transactions that support the rule, 102 in this case.

# Convert transactions to a numeric matrix
groceries_numeric <- as.matrix(as(groceries, "matrix")) * 1

# groceries_numeric

# dissimilarity <- daisy(groceries_numeric, metric = "gower")

# Specify that all columns are binary variables
dissimilarity <- daisy(groceries_numeric, metric = "gower", type = list(asymm = 1:ncol(groceries_numeric)))

I was getting a warning at first that R was treating my variables as continuous ones (a default behavior) when calculating dissimilarity. Since my data is binary (1 if present, 0 if not) I modified this behavior by specifying the type of all the columns as binary.

# Perform hierarchical clustering
hc <- hclust(dissimilarity, method = "ward.D2")

# Plot the dendrogram
plot(hc, labels = FALSE, main = "Dendrogram of Groceries Transactions")

# Cut the tree into 5 clusters
clusters <- cutree(hc, k = 5)

# Visualize the clusters
fviz_cluster(list(data = groceries_numeric, cluster = clusters), geom = "point", stand = FALSE)

# Convert the transactions object to a binary numeric matrix
groceries_matrix <- as(groceries, "matrix") * 1

cluster_labels <- cutree(hc, k = 5)
# Add cluster labels to the matrix
groceries_with_clusters <- data.frame(groceries_matrix, Cluster = cluster_labels)


# Calculate the average presence of items per cluster
cluster_summary <- aggregate(. ~ Cluster, data = groceries_with_clusters, FUN = mean)

# Extract the top 5 items for each cluster
top_products <- apply(cluster_summary[,-1], 1, function(x) {
  sorted_indices <- order(x, decreasing = TRUE) # Sort by frequency
  colnames(cluster_summary)[-1][sorted_indices[1:5]] # Get top 5 product names
})

# Convert to a readable dataframe
top_products_df <- data.frame(Cluster = 1:5, t(top_products))
colnames(top_products_df)[-1] <- paste0("Product_", 1:5)

# Print the dataframe
print(top_products_df)

##   Cluster    Product_1        Product_2      Product_3        Product_4
## 1       1   whole.milk other.vegetables     rolls.buns           yogurt
## 2       2   rolls.buns       whole.milk           soda other.vegetables
## 3       3  canned.beer    shopping.bags     rolls.buns    bottled.water
## 4       4         soda    bottled.water    canned.beer       newspapers
## 5       5 bottled.beer           liquor red.blush.wine    shopping.bags
##      Product_5
## 1         soda
## 2 bottled.beer
## 3   newspapers
## 4 bottled.beer
## 5   whole.milk

Here I print out the top 5 most occuring products in each cluster. It looks like each cluster contains some of the same things- groceries like milk, veggies, water, alcohol, etc.

I was curious to see if there are any products in a cluster that don’t appear in other clusters. Based on the chart that should only happen for cluster 1. The other clusters are inside cluster 1- in fact they seem to stack up.

I think from these top 5 products per cluster we can get a good idea of the “types” of buyers, but perhaps we can get a more accurate idea if we could see more of the top products per cluster.

I want to expand this to get the top 10 most occurring products:

# Convert the transactions object to a binary numeric matrix
groceries_matrix <- as(groceries, "matrix") * 1

cluster_labels <- cutree(hc, k = 5)
# Add cluster labels to the matrix
groceries_with_clusters <- data.frame(groceries_matrix, Cluster = cluster_labels)


# Calculate the average presence of items per cluster
cluster_summary <- aggregate(. ~ Cluster, data = groceries_with_clusters, FUN = mean)

# Extract the top 10 items for each cluster
top_products <- apply(cluster_summary[,-1], 1, function(x) {
  sorted_indices <- order(x, decreasing = TRUE) # Sort by frequency
  colnames(cluster_summary)[-1][sorted_indices[1:10]] # Get top 5 product names
})

# Convert to a readable dataframe
top_products_df <- data.frame(Cluster = 1:5, t(top_products))
colnames(top_products_df)[-1] <- paste0("Product_", 1:10)

# Print the dataframe
print(top_products_df)

##   Cluster    Product_1        Product_2      Product_3        Product_4
## 1       1   whole.milk other.vegetables     rolls.buns           yogurt
## 2       2   rolls.buns       whole.milk           soda other.vegetables
## 3       3  canned.beer    shopping.bags     rolls.buns    bottled.water
## 4       4         soda    bottled.water    canned.beer       newspapers
## 5       5 bottled.beer           liquor red.blush.wine    shopping.bags
##      Product_5             Product_6       Product_7       Product_8
## 1         soda       root.vegetables   bottled.water  tropical.fruit
## 2 bottled.beer                yogurt         sausage           sugar
## 3   newspapers               dessert misc..beverages      white.wine
## 4 bottled.beer fruit.vegetable.juice       ice.cream          coffee
## 5   whole.milk    liquor..appetizer.            soda root.vegetables
##       Product_9 Product_10
## 1 shopping.bags    sausage
## 2    newspapers   UHT.milk
## 3        yogurt whole.milk
## 4       napkins      candy
## 5 bottled.water    napkins

It looks like cluster 1 represents general purpose buyers that get everything from whole milk to veggies to fruit to sausage, and clusters 2 and 3 are similar.

Cluster 4 is interesting because it is features many beverages as well as snacks/sweets like ice cream and candy.

Lastly cluster 5 has a lot of liquor, as well as red blush wine which seems to be a special occasion type of purchase.

This information could help drive changes in the store such as where to place products for convenience as well as to drive sales by placing products suggestively.

624-HW10

Ahmed Elsaeyed

2024-11-21