Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore “Market Basket Analysis”.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.

library(gridExtra)
library(knitr)
library(kableExtra)
library(readxl)
library(ggplot2)
library(dplyr)
library(arulesViz)
library(tidyverse)
library(igraph)
data <- read.transactions('GroceryDataSet.csv', sep = ",")

summary(data)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Clearly, “whole milk”, “other vegetables”, “rolls/buns” and soda are the most frequent items (the top 4).

Top 20 most frequent items:

itemFrequencyPlot(data, topN = 20, type = "absolute", main = "Top 20 Items")

more Analysis

support <- 0.001
confidence <- 0.4
We.Rules <- apriori(data, parameter = list(support = support, confidence = confidence), control = list(verbose = FALSE))

summary(We.Rules)
## set of 8955 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##   81 2771 4804 1245   54 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.824   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.4000   Min.   :0.001017   Min.   : 1.565  
##  1st Qu.:0.001118   1st Qu.:0.4583   1st Qu.:0.001932   1st Qu.: 2.316  
##  Median :0.001322   Median :0.5319   Median :0.002542   Median : 2.870  
##  Mean   :0.001811   Mean   :0.5579   Mean   :0.003478   Mean   : 3.191  
##  3rd Qu.:0.001830   3rd Qu.:0.6296   3rd Qu.:0.003559   3rd Qu.: 3.733  
##  Max.   :0.056024   Max.   :1.0000   Max.   :0.139502   Max.   :21.494  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 17.81  
##  3rd Qu.: 18.00  
##  Max.   :551.00  
## 
## mining info:
##  data ntransactions support confidence
##  data          9835   0.001        0.4

Displaying the top 10 rules with their support and confidence, sorted descending order of lift:

We.Rules %>% DATAFRAME() %>% arrange(desc(lift)) %>% top_n(10) %>% kable()
## Selecting by count
LHS RHS support confidence coverage lift count
{root vegetables} {other vegetables} 0.0473818 0.4347015 0.1089985 2.246605 466
{whipped/sour cream} {other vegetables} 0.0288765 0.4028369 0.0716828 2.081924 284
{butter} {whole milk} 0.0275547 0.4972477 0.0554143 1.946053 271
{curd} {whole milk} 0.0261312 0.4904580 0.0532791 1.919480 257
{domestic eggs} {whole milk} 0.0299949 0.4727564 0.0634469 1.850203 295
{whipped/sour cream} {whole milk} 0.0322318 0.4496454 0.0716828 1.759754 317
{root vegetables} {whole milk} 0.0489070 0.4486940 0.1089985 1.756031 481
{margarine} {whole milk} 0.0241993 0.4131944 0.0585663 1.617098 238
{tropical fruit} {whole milk} 0.0422979 0.4031008 0.1049314 1.577595 416
{yogurt} {whole milk} 0.0560244 0.4016035 0.1395018 1.571735 551

Clearly, the rule having the greatest lift (2.246605), is for the item {other vegetables}, after purchase of {root vegetables}. The support and confidence of the item are 0.04738180 and 0.4347015 respectively.

let’ see items association through a visualization

srules <- head(We.Rules, n = 10, by = 'lift')
plot(srules, method = 'graph')

Cluster analysis

Let’s do the grouping. We’ll use the igraph package to do the Network graphs.

groceries <- read.csv("GroceryDataSet.csv", header = FALSE) %>% mutate(shoper_id = row_number()) %>% pivot_longer(-shoper_id) %>% filter(value != "") %>% select(-name)
communities <- groceries %>% rename(to = value, from = shoper_id) %>% graph_from_data_frame(directed = FALSE) %>% cluster_louvain() %>% communities()

Let’s associate customers and items to the clusters

products <- as.character(unique(groceries$value))
df <- data.frame(name = c(NA), members = c(NA)) %>% na.omit() 
for (i in 1:length(communities)){
  cluster_name <- paste0(i,": ")
  cluster_members <- 0
  for (member in communities[[i]]){
    if (member %in% products){
      cluster_name <- paste0(cluster_name, member, " + ")
    } else {
      cluster_members <- cluster_members + 1
    }
  }
  cluster_name <- substr(cluster_name,1,nchar(cluster_name)-3)
  df <- rbind(df, data.frame(name = cluster_name, members = cluster_members))
}
  df %>%
  arrange(desc(members)) %>% kable()
name members
8: chocolate + soda + specialty bar + pastry + salty snack + waffles + candy + dessert + chocolate marshmallow + specialty chocolate + popcorn + cake bar + snack products + finished products + make up remover + potato products + hair spray + light bulbs + baby food + tidbits 1292
10: other vegetables + rice + abrasive cleaner + flour + beef + chicken + root vegetables + bathroom cleaner + spices + pork + turkey + oil + curd cheese + onions + herbs + dog food + frozen fish + salad dressing + vinegar + roll products + frozen fruits 1087
12: ready soups + rolls/buns + frankfurter + sausage + spread cheese + hard cheese + canned fish + seasonal products + frozen potato products + sliced cheese + soft cheese + meat + mustard + mayonnaise + nut snack + ketchup + cream 1053
13: whole milk + butter + cereals + curd + detergent + hamburger meat + flower (seeds) + canned vegetables + pasta + softener + Instant food products + honey + cocoa drinks + cleaner + soups + soap + pudding powder 857
5: liquor (appetizer) + canned beer + shopping bags + misc. beverages + chewing gum + brandy + liqueur + whisky 730
7: yogurt + cream cheese + meat spreads + packaged fruit/vegetables + butter milk + berries + whipped/sour cream + baking powder + specialty cheese + instant coffee + organic sausage + cooking chocolate + kitchen utensil 674
4: tropical fruit + pip fruit + white bread + processed cheese + sweet spreads + beverages + ham + cookware + tea + syrup + baby cosmetics + specialty vegetables + sound storage medium 624
15: citrus fruit + hygiene articles + domestic eggs + cat food + cling film/bags + canned fruit + dental care + flower soil/fertilizer + female sanitary products + dish cleaner + house keeping products + rubbing alcohol + preservation products 569
16: bottled beer + red/blush wine + prosecco + liquor + rum 432
11: UHT-milk + bottled water + white wine + male cosmetics 349
2: long life bakery product + pot plants + fruit/vegetable juice + pickled vegetables + jam + bags 341
3: semi-finished bread + newspapers + pet care + nuts/prunes + toilet cleaner 298
6: dishes + napkins + grapes + zwieback + decalcifier 293
1: coffee + condensed milk + sparkling wine + fish + kitchen towels 287
18: sugar + frozen vegetables + salt + skin care + liver loaf + frozen chicken 273
14: frozen dessert + ice cream + frozen meals 262
9: margarine + artif. sweetener + specialty fat + candles + organic products 207
17: brown bread + sauces 128
19: photo/film 79