That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.
library(gridExtra)
library(knitr)
library(kableExtra)
library(readxl)
library(ggplot2)
library(dplyr)
library(arulesViz)
library(tidyverse)
library(igraph)
data <- read.transactions('GroceryDataSet.csv', sep = ",")
summary(data)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Clearly, “whole milk”, “other vegetables”, “rolls/buns” and soda are the most frequent items (the top 4).
itemFrequencyPlot(data, topN = 20, type = "absolute", main = "Top 20 Items")
support <- 0.001
confidence <- 0.4
We.Rules <- apriori(data, parameter = list(support = support, confidence = confidence), control = list(verbose = FALSE))
summary(We.Rules)
## set of 8955 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 81 2771 4804 1245 54
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.824 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.4000 Min. :0.001017 Min. : 1.565
## 1st Qu.:0.001118 1st Qu.:0.4583 1st Qu.:0.001932 1st Qu.: 2.316
## Median :0.001322 Median :0.5319 Median :0.002542 Median : 2.870
## Mean :0.001811 Mean :0.5579 Mean :0.003478 Mean : 3.191
## 3rd Qu.:0.001830 3rd Qu.:0.6296 3rd Qu.:0.003559 3rd Qu.: 3.733
## Max. :0.056024 Max. :1.0000 Max. :0.139502 Max. :21.494
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 17.81
## 3rd Qu.: 18.00
## Max. :551.00
##
## mining info:
## data ntransactions support confidence
## data 9835 0.001 0.4
Displaying the top 10 rules with their support and confidence, sorted descending order of lift:
We.Rules %>% DATAFRAME() %>% arrange(desc(lift)) %>% top_n(10) %>% kable()
## Selecting by count
LHS | RHS | support | confidence | coverage | lift | count |
---|---|---|---|---|---|---|
{root vegetables} | {other vegetables} | 0.0473818 | 0.4347015 | 0.1089985 | 2.246605 | 466 |
{whipped/sour cream} | {other vegetables} | 0.0288765 | 0.4028369 | 0.0716828 | 2.081924 | 284 |
{butter} | {whole milk} | 0.0275547 | 0.4972477 | 0.0554143 | 1.946053 | 271 |
{curd} | {whole milk} | 0.0261312 | 0.4904580 | 0.0532791 | 1.919480 | 257 |
{domestic eggs} | {whole milk} | 0.0299949 | 0.4727564 | 0.0634469 | 1.850203 | 295 |
{whipped/sour cream} | {whole milk} | 0.0322318 | 0.4496454 | 0.0716828 | 1.759754 | 317 |
{root vegetables} | {whole milk} | 0.0489070 | 0.4486940 | 0.1089985 | 1.756031 | 481 |
{margarine} | {whole milk} | 0.0241993 | 0.4131944 | 0.0585663 | 1.617098 | 238 |
{tropical fruit} | {whole milk} | 0.0422979 | 0.4031008 | 0.1049314 | 1.577595 | 416 |
{yogurt} | {whole milk} | 0.0560244 | 0.4016035 | 0.1395018 | 1.571735 | 551 |
Clearly, the rule having the greatest lift (2.246605), is for the item {other vegetables}
, after purchase of {root vegetables}
. The support and confidence of the item are 0.04738180 and 0.4347015 respectively.
let’ see items association through a visualization
srules <- head(We.Rules, n = 10, by = 'lift')
plot(srules, method = 'graph')
Let’s do the grouping. We’ll use the igraph package to do the Network graphs.
groceries <- read.csv("GroceryDataSet.csv", header = FALSE) %>% mutate(shoper_id = row_number()) %>% pivot_longer(-shoper_id) %>% filter(value != "") %>% select(-name)
communities <- groceries %>% rename(to = value, from = shoper_id) %>% graph_from_data_frame(directed = FALSE) %>% cluster_louvain() %>% communities()
Let’s associate customers and items to the clusters
products <- as.character(unique(groceries$value))
df <- data.frame(name = c(NA), members = c(NA)) %>% na.omit()
for (i in 1:length(communities)){
cluster_name <- paste0(i,": ")
cluster_members <- 0
for (member in communities[[i]]){
if (member %in% products){
cluster_name <- paste0(cluster_name, member, " + ")
} else {
cluster_members <- cluster_members + 1
}
}
cluster_name <- substr(cluster_name,1,nchar(cluster_name)-3)
df <- rbind(df, data.frame(name = cluster_name, members = cluster_members))
}
df %>%
arrange(desc(members)) %>% kable()
name | members |
---|---|
8: chocolate + soda + specialty bar + pastry + salty snack + waffles + candy + dessert + chocolate marshmallow + specialty chocolate + popcorn + cake bar + snack products + finished products + make up remover + potato products + hair spray + light bulbs + baby food + tidbits | 1292 |
10: other vegetables + rice + abrasive cleaner + flour + beef + chicken + root vegetables + bathroom cleaner + spices + pork + turkey + oil + curd cheese + onions + herbs + dog food + frozen fish + salad dressing + vinegar + roll products + frozen fruits | 1087 |
12: ready soups + rolls/buns + frankfurter + sausage + spread cheese + hard cheese + canned fish + seasonal products + frozen potato products + sliced cheese + soft cheese + meat + mustard + mayonnaise + nut snack + ketchup + cream | 1053 |
13: whole milk + butter + cereals + curd + detergent + hamburger meat + flower (seeds) + canned vegetables + pasta + softener + Instant food products + honey + cocoa drinks + cleaner + soups + soap + pudding powder | 857 |
5: liquor (appetizer) + canned beer + shopping bags + misc. beverages + chewing gum + brandy + liqueur + whisky | 730 |
7: yogurt + cream cheese + meat spreads + packaged fruit/vegetables + butter milk + berries + whipped/sour cream + baking powder + specialty cheese + instant coffee + organic sausage + cooking chocolate + kitchen utensil | 674 |
4: tropical fruit + pip fruit + white bread + processed cheese + sweet spreads + beverages + ham + cookware + tea + syrup + baby cosmetics + specialty vegetables + sound storage medium | 624 |
15: citrus fruit + hygiene articles + domestic eggs + cat food + cling film/bags + canned fruit + dental care + flower soil/fertilizer + female sanitary products + dish cleaner + house keeping products + rubbing alcohol + preservation products | 569 |
16: bottled beer + red/blush wine + prosecco + liquor + rum | 432 |
11: UHT-milk + bottled water + white wine + male cosmetics | 349 |
2: long life bakery product + pot plants + fruit/vegetable juice + pickled vegetables + jam + bags | 341 |
3: semi-finished bread + newspapers + pet care + nuts/prunes + toilet cleaner | 298 |
6: dishes + napkins + grapes + zwieback + decalcifier | 293 |
1: coffee + condensed milk + sparkling wine + fish + kitchen towels | 287 |
18: sugar + frozen vegetables + salt + skin care + liver loaf + frozen chicken | 273 |
14: frozen dessert + ice cream + frozen meals | 262 |
9: margarine + artif. sweetener + specialty fat + candles + organic products | 207 |
17: brown bread + sauces | 128 |
19: photo/film | 79 |