Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 5 before midnight.
The required packages are loaded and the grocery data set is read using the read.tranactions() function.
library(arulesViz)
library(dplyr)
library(RColorBrewer)
groc <- read.transactions('GroceryDataSet.csv', sep=',')
summary(groc)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The item frequency plot can be created using the itemFrequencyPlot() function in the arules package:
itemFrequencyPlot(groc, topN=20, col=brewer.pal(8,'Pastel1'))
In order to mine the association rules using the apriori() function in the arules package, the support and confidence parameters need to be determined. Setting the values too high will filter out many rules. Setting the values too low will introduce many trivial rules.
Below, I have written a wrapper function that takes the transaction data and the desired support and confidence as input, and output the top N rules (default 10) mined using the Apriori algorithm.
arFit <- function(data, support, confidence, topN=10, topOnly=TRUE){
rules <- apriori(data, parameter = list(support=support, confidence=confidence), control=list(verbose = FALSE))
rulesLen <- length(rules)
topRules <- head(rules, n=topN, by='lift')
topRules <- data.frame(lhs=labels(lhs(topRules)), rhs=labels(rhs(topRules)), topRules@quality)
ifelse(topOnly, return(topRules), return(list(rulesLen, topRules, rules)))
}
Holding the confidence constant at 0.1, we can see the effects of support on the number of association rules found. As can be seen in the below plot, as support gets smaller, the number of rules increases exponetionally.
values <- seq(0.001, 0.1, by=0.001)
numRules <- c()
for (val in values){
fit <- arFit(groc, support=val, confidence=0.1, topOnly = FALSE)
numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Support', ylab='Association Rules Found', type='l')
Holding the support constant at 0.001, we can also see that smaller the confident, more rules are found:
values <- seq(0.1, 1, by=0.01)
numRules <- c()
for (val in values){
fit <- arFit(groc, support=0.001, confidence=val, topOnly = FALSE)
numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Confidence', ylab='Association Rules Found', type='l')
After some trials, I found some interest rules setting minimum support to 0.002 and minimum confidence to 0.1. The support, confidence, and lift of the top 10 association rules with the aforementioned parameters are found below.
rules <- arFit(groc, 0.002, 0.1, topOnly = FALSE)
rules[[2]]
## lhs rhs support
## 47 {Instant food products} {hamburger meat} 0.003050330
## 1584 {sugar,whole milk} {flour} 0.002846975
## 19 {liquor} {red/blush wine} 0.002135231
## 20 {red/blush wine} {liquor} 0.002135231
## 1583 {flour,whole milk} {sugar} 0.002846975
## 398 {flour} {sugar} 0.004982206
## 399 {sugar} {flour} 0.004982206
## 1728 {hard cheese,whipped/sour cream} {butter} 0.002033554
## 36 {popcorn} {salty snack} 0.002236909
## 1729 {butter,whipped/sour cream} {hard cheese} 0.002033554
## confidence lift count
## 47 0.3797468 11.421438 30
## 1584 0.1891892 10.881144 28
## 19 0.1926606 10.025484 21
## 20 0.1111111 10.025484 21
## 1583 0.3373494 9.963457 28
## 398 0.2865497 8.463112 49
## 399 0.1471471 8.463112 49
## 1728 0.4545455 8.202669 20
## 36 0.3098592 8.192110 22
## 1729 0.2000000 8.161826 20
The rules can be visualized using arulesViz package. A particular interesting visulization is the graph of the rules:
subrules <- head(rules[[3]], n=10, by='lift')
plot(subrules, method = 'graph')
These rules can help the grocery store in terms of product placement, advertisement, or promotion. For example, placing salty snacks next to popcorn, advertisement of certain brand of hamburger meat in the aisle for instant food products, etc.
First, the transaction data is converted into data frame. The goal is to cluster the items, with the transactions as dimensions. So the data frame will be 169 rows (items) by 9835 columns (transactions):
df <- groc@data %>% as.matrix() %>% as.data.frame()
row.names(df) <- groc@itemInfo$labels
dim(df)
## [1] 169 9835
Next, I will try the kmeans function to perform K-means clustering. The centers parameter specifies the number of desired clusters. The nstart parameter repeats the algorithm for n times, each time with different set of initial centers; and pick the best one. The iter.max parameter specifies the maximum number of iteration.
set.seed(1)
cluster <- kmeans(df, centers=10, nstart=50, iter.max=20)
str(cluster)
## List of 9
## $ cluster : Named int [1:169] 9 9 9 9 9 9 9 6 9 9 ...
## ..- attr(*, "names")= chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## $ centers : num [1:10, 1:9835] 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:10] "1" "2" "3" "4" ...
## .. ..$ : chr [1:9835] "V1" "V2" "V3" "V4" ...
## $ totss : num 41486
## $ withinss : num [1:10] 845 0 0 0 0 ...
## $ tot.withinss: num 28405
## $ betweenss : num 13081
## $ size : int [1:10] 2 1 1 1 1 21 1 1 138 2
## $ iter : int 4
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
We can retrieve the clusters, and find the group distribution:
cluster$cluster %>% table()
## .
## 1 2 3 4 5 6 7 8 9 10
## 2 1 1 1 1 21 1 1 138 2
It appears most items are concerntrated in two groups - one with 21 items and the other with 138. Let’s take a look at the large groups:
cluster$cluster[cluster$cluster==6]
## beef bottled beer brown bread
## 6 6 6
## butter chicken chocolate
## 6 6 6
## citrus fruit coffee curd
## 6 6 6
## domestic eggs frankfurter frozen vegetables
## 6 6 6
## fruit/vegetable juice margarine napkins
## 6 6 6
## newspapers pastry pip fruit
## 6 6 6
## pork whipped/sour cream white bread
## 6 6 6
It appears this group are all food and drink related items, with the exception of newspapers.
cluster$cluster[cluster$cluster==9]
## abrasive cleaner artif. sweetener
## 9 9
## baby cosmetics baby food
## 9 9
## bags baking powder
## 9 9
## bathroom cleaner berries
## 9 9
## beverages brandy
## 9 9
## butter milk cake bar
## 9 9
## candles candy
## 9 9
## canned beer canned fish
## 9 9
## canned fruit canned vegetables
## 9 9
## cat food cereals
## 9 9
## chewing gum chocolate marshmallow
## 9 9
## cleaner cling film/bags
## 9 9
## cocoa drinks condensed milk
## 9 9
## cooking chocolate cookware
## 9 9
## cream cream cheese
## 9 9
## curd cheese decalcifier
## 9 9
## dental care dessert
## 9 9
## detergent dish cleaner
## 9 9
## dishes dog food
## 9 9
## female sanitary products finished products
## 9 9
## fish flour
## 9 9
## flower (seeds) flower soil/fertilizer
## 9 9
## frozen chicken frozen dessert
## 9 9
## frozen fish frozen fruits
## 9 9
## frozen meals frozen potato products
## 9 9
## grapes hair spray
## 9 9
## ham hamburger meat
## 9 9
## hard cheese herbs
## 9 9
## honey house keeping products
## 9 9
## hygiene articles ice cream
## 9 9
## instant coffee Instant food products
## 9 9
## jam ketchup
## 9 9
## kitchen towels kitchen utensil
## 9 9
## light bulbs liqueur
## 9 9
## liquor liquor (appetizer)
## 9 9
## liver loaf long life bakery product
## 9 9
## make up remover male cosmetics
## 9 9
## mayonnaise meat
## 9 9
## meat spreads misc. beverages
## 9 9
## mustard nut snack
## 9 9
## nuts/prunes oil
## 9 9
## onions organic products
## 9 9
## organic sausage packaged fruit/vegetables
## 9 9
## pasta pet care
## 9 9
## photo/film pickled vegetables
## 9 9
## popcorn pot plants
## 9 9
## potato products preservation products
## 9 9
## processed cheese prosecco
## 9 9
## pudding powder ready soups
## 9 9
## red/blush wine rice
## 9 9
## roll products rubbing alcohol
## 9 9
## rum salad dressing
## 9 9
## salt salty snack
## 9 9
## sauces seasonal products
## 9 9
## semi-finished bread skin care
## 9 9
## sliced cheese snack products
## 9 9
## soap soft cheese
## 9 9
## softener sound storage medium
## 9 9
## soups sparkling wine
## 9 9
## specialty bar specialty cheese
## 9 9
## specialty chocolate specialty fat
## 9 9
## specialty vegetables spices
## 9 9
## spread cheese sugar
## 9 9
## sweet spreads syrup
## 9 9
## tea tidbits
## 9 9
## toilet cleaner turkey
## 9 9
## UHT-milk vinegar
## 9 9
## waffles whisky
## 9 9
## white wine zwieback
## 9 9
It’s difficult to find what these items share in common. The transaction data may not have enough information to distinguish these items into groups.
Here’s what happen when we cluster the items into 20 centers:
set.seed(1)
kmeans(df, centers=20, nstart=50, iter.max=20)$cluster %>% table()
## .
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 21 1 1 1 1 130 1 1 1 1 1 1 1
## 19 20
## 1 1
and 30 centers:
set.seed(1)
kmeans(df, centers=30, nstart=50, iter.max=20)$cluster %>% table()
## .
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 1 122 1 1 1 1 19 1 1 1 1 1 1
## 19 20 21 22 23 24 25 26 27 28 29 30
## 1 1 1 1 1 1 1 1 1 1 1 1
As you can see, the items are still largely group under two main clusters. So increasing the number of cluster centers is not helpful.
Next, I tried hierarchical clustering via hclust funtcion, using the “average” cluster method. Blow is the plot of the hierarchical structural of the groupping.
dist_mat <- dist(df, method = 'euclidean')
hcluster <- hclust(dist_mat, method = 'average')
plot(hcluster)
I used the function cutree to get the desired number of clusters (20):
cutree(hcluster, 20) %>% table()
## .
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 150 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 19 20
## 1 1
It appears that there is just one concentrated cluster using this method.
Next, I tried the “ward.D” method of clustering:
hcluster <- hclust(dist_mat, method = 'ward.D')
plot(hcluster)
cutree(hcluster, 20) %>% table()
## .
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 54 40 15 1 1 44 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This clustering method is more interesting, separating the items into 4 large groups. We can extract these groups and take a look (Group 1, 2, 3, and 6). Again, it’s difficult to see what these items share in common in their respective groups.
Below is Group 1 cluster:
clusters <- cutree(hcluster, 20) %>% sort()
names(clusters[clusters == 1])
## [1] "abrasive cleaner" "artif. sweetener"
## [3] "baby cosmetics" "baby food"
## [5] "bags" "bathroom cleaner"
## [7] "brandy" "canned fruit"
## [9] "cleaner" "cocoa drinks"
## [11] "cooking chocolate" "cookware"
## [13] "cream" "curd cheese"
## [15] "decalcifier" "fish"
## [17] "flower soil/fertilizer" "frozen chicken"
## [19] "frozen fruits" "hair spray"
## [21] "honey" "jam"
## [23] "ketchup" "kitchen utensil"
## [25] "light bulbs" "liqueur"
## [27] "liver loaf" "make up remover"
## [29] "male cosmetics" "meat spreads"
## [31] "nut snack" "nuts/prunes"
## [33] "organic products" "organic sausage"
## [35] "potato products" "preservation products"
## [37] "prosecco" "pudding powder"
## [39] "ready soups" "rubbing alcohol"
## [41] "rum" "salad dressing"
## [43] "skin care" "snack products"
## [45] "soap" "sound storage medium"
## [47] "specialty fat" "specialty vegetables"
## [49] "spices" "syrup"
## [51] "tea" "tidbits"
## [53] "toilet cleaner" "whisky"
Below is Group 2 cluster:
names(clusters[clusters == 2])
## [1] "baking powder" "berries"
## [3] "beverages" "butter milk"
## [5] "candy" "cat food"
## [7] "chewing gum" "cream cheese"
## [9] "dessert" "detergent"
## [11] "dishes" "flour"
## [13] "frozen meals" "grapes"
## [15] "ham" "hamburger meat"
## [17] "hard cheese" "herbs"
## [19] "hygiene articles" "ice cream"
## [21] "long life bakery product" "meat"
## [23] "misc. beverages" "oil"
## [25] "onions" "pickled vegetables"
## [27] "pot plants" "processed cheese"
## [29] "red/blush wine" "salty snack"
## [31] "semi-finished bread" "sliced cheese"
## [33] "soft cheese" "specialty bar"
## [35] "specialty chocolate" "sugar"
## [37] "UHT-milk" "waffles"
## [39] "white bread" "white wine"
Below is Group 3 cluster:
names(clusters[clusters == 3])
## [1] "beef" "brown bread"
## [3] "butter" "chicken"
## [5] "chocolate" "coffee"
## [7] "curd" "domestic eggs"
## [9] "frankfurter" "frozen vegetables"
## [11] "fruit/vegetable juice" "margarine"
## [13] "napkins" "pork"
## [15] "whipped/sour cream"
Below is Group 6 cluster:
names(clusters[clusters == 6])
## [1] "cake bar" "candles"
## [3] "canned fish" "canned vegetables"
## [5] "cereals" "chocolate marshmallow"
## [7] "cling film/bags" "condensed milk"
## [9] "dental care" "dish cleaner"
## [11] "dog food" "female sanitary products"
## [13] "finished products" "flower (seeds)"
## [15] "frozen dessert" "frozen fish"
## [17] "frozen potato products" "house keeping products"
## [19] "instant coffee" "Instant food products"
## [21] "kitchen towels" "liquor"
## [23] "liquor (appetizer)" "mayonnaise"
## [25] "mustard" "packaged fruit/vegetables"
## [27] "pasta" "pet care"
## [29] "photo/film" "popcorn"
## [31] "rice" "roll products"
## [33] "salt" "sauces"
## [35] "seasonal products" "softener"
## [37] "soups" "sparkling wine"
## [39] "specialty cheese" "spread cheese"
## [41] "sweet spreads" "turkey"
## [43] "vinegar" "zwieback"