Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(TSstudio)
library(RColorBrewer)
library(GGally)
library(grid)
library(gridExtra)
library(mlbench)
library(psych)
library(cowplot)
library(corrplot)
library(caret)
library(geoR)
library(reshape)
library(naniar)
library(mice)
library(DMwR)
library(AppliedPredictiveModeling)
library(pls)
library(glmnet)
library(elasticnet)
library(earth)
library(kernlab)
library(randomForest)
library(vip)
library(party)
library(Cubist)
library(gbm)
library(rpart.plot)
library(arulesViz)

Problem Statement

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

grocTrans <- read.transactions('GroceryDataSet.csv', sep=',')
summary(grocTrans)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Top 20 Items Purchased

itemFrequencyPlot(grocTrans, topN=20, col=brewer.pal(11,'RdBu'), type="absolute", main="Top 20 Grocery  Items")

In order to mine the association rules using the apriori() function in the arules package, the support and confidence parameters need to be determined. Setting the values too high will filter out many rules. Setting the values too low will introduce many trivial rules.

Below, is a wrapper function that takes the transaction data and the desired support and confidence as input, and output the top N rules (default 10) mined using the Apriori algorithm.

arFit <- function(data, support, confidence, topN=10, topOnly=TRUE){
  rules <- apriori(data, parameter = list(support=support, confidence=confidence), control=list(verbose = FALSE)) 
  rulesLen <- length(rules)
  topRules <- head(rules, n=topN, by='lift')
  topRules <- data.frame(lhs=labels(lhs(topRules)), rhs=labels(rhs(topRules)), topRules@quality)
  ifelse(topOnly, return(topRules), return(list(rulesLen, topRules, rules)))
}

Holding the confidence constant at 0.1, we can see the effects of support on the number of association rules found. As can be seen in the below plot, as support gets smaller, the number of rules increases exponentially.

values <- seq(0.001, 0.1, by=0.001)
numRules <- c()
for (val in values){
  fit <- arFit(grocTrans, support=val, confidence=0.1, topOnly = FALSE)
  numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Support', ylab='Association Rules Found', type='l')

Holding the support constant at 0.001, we can also see that smaller the confident, more rules are found:

values <- seq(0.1, 1, by=0.01)
numRules <- c()
for (val in values){
  fit <- arFit(grocTrans, support=0.001, confidence=val, topOnly = FALSE)
  numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Confidence', ylab='Association Rules Found', type='l')

After some trials, I found some interest rules setting minimum support to 0.002 and minimum confidence to 0.1. The support, confidence, and lift of the top 10 association rules with the aforementioned parameters are found below.

rules <- arFit(grocTrans, 0.002, 0.1, topOnly = FALSE)
rules[[2]]
## Warning: namespace 'highr' is not available and has been replaced
## by .GlobalEnv when processing object '<unknown>'
lhs rhs support confidence coverage lift count
47 {Instant food products} {hamburger meat} 0.0030503 0.3797468 0.0080325 11.421438 30
1584 {sugar,whole milk} {flour} 0.0028470 0.1891892 0.0150483 10.881144 28
19 {liquor} {red/blush wine} 0.0021352 0.1926606 0.0110829 10.025484 21
20 {red/blush wine} {liquor} 0.0021352 0.1111111 0.0192171 10.025484 21
1583 {flour,whole milk} {sugar} 0.0028470 0.3373494 0.0084392 9.963457 28
398 {flour} {sugar} 0.0049822 0.2865497 0.0173869 8.463112 49
399 {sugar} {flour} 0.0049822 0.1471471 0.0338587 8.463112 49
1728 {hard cheese,whipped/sour cream} {butter} 0.0020336 0.4545455 0.0044738 8.202669 20
36 {popcorn} {salty snack} 0.0022369 0.3098592 0.0072191 8.192110 22
1729 {butter,whipped/sour cream} {hard cheese} 0.0020336 0.2000000 0.0101678 8.161826 20

The rules can be visualized using arulesViz package. A particular interesting visulization is the graph of the rules:

subrules <- head(rules[[3]], n=10, by='lift')
plot(subrules, method = 'graph')

These rules can help the grocery store in terms of product placement, advertisement, or promotion. For example, placing salty snacks next to popcorn, advertisement of certain brand of hamburger meat in the aisle for instant food products, etc.

Clustering Analysis

First, the transaction data is converted into data frame. The goal is to cluster the items, with the transactions as dimensions. So the data frame will be 169 rows (items) by 9835 columns (transactions):

df <- grocTrans@data %>% as.matrix()  %>% as.data.frame() 
row.names(df) <- grocTrans@itemInfo$labels
dim(df)
## [1]  169 9835

Next, I will try the kmeans function to perform K-means clustering. The centers parameter specifies the number of desired clusters. The nstart parameter repeats the algorithm for n times, each time with different set of initial centers; and pick the best one. The iter.max parameter specifies the maximum number of iteration.

set.seed(1)
cluster <- kmeans(df, centers=10, nstart=50, iter.max=20)
str(cluster)
## List of 9
##  $ cluster     : Named int [1:169] 8 8 8 8 8 8 8 9 8 8 ...
##   ..- attr(*, "names")= chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##  $ centers     : num [1:10, 1:9835] 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:9835] "V1" "V2" "V3" "V4" ...
##  $ totss       : num 41486
##  $ withinss    : num [1:10] 0 0 845 792 0 ...
##  $ tot.withinss: num 28405
##  $ betweenss   : num 13081
##  $ size        : int [1:10] 1 1 2 2 1 1 1 138 21 1
##  $ iter        : int 4
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

We can retrieve the clusters, and find the group distribution:

cluster$cluster %>% table()
## .
##   1   2   3   4   5   6   7   8   9  10 
##   1   1   2   2   1   1   1 138  21   1

It appears most items are concerntrated in two groups - one with 21 items and the other with 138. Let’s take a look at the large groups:

cluster$cluster[cluster$cluster==6]
## bottled water 
##             6

It appears this group are all food and drink related items, with the exception of newspapers.

cluster$cluster[cluster$cluster==9]
##                  beef          bottled beer           brown bread 
##                     9                     9                     9 
##                butter               chicken             chocolate 
##                     9                     9                     9 
##          citrus fruit                coffee                  curd 
##                     9                     9                     9 
##         domestic eggs           frankfurter     frozen vegetables 
##                     9                     9                     9 
## fruit/vegetable juice             margarine               napkins 
##                     9                     9                     9 
##            newspapers                pastry             pip fruit 
##                     9                     9                     9 
##                  pork    whipped/sour cream           white bread 
##                     9                     9                     9

It’s difficult to find what these items share in common. The transaction data may not have enough information to distinguish these items into groups.

Here’s what happen when we cluster the items into 20 centers:

set.seed(1)
kmeans(df, centers=20, nstart=50, iter.max=20)$cluster %>% table()
## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1  21   1   1   1   1   1   1   1   1   1   1   1   1   1 130   1   1   1   1

and 30 centers:

set.seed(1)
kmeans(df, centers=30, nstart=50, iter.max=20)$cluster %>% table()
## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1 122   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30 
##   1   1   1   1  19   1   1   1   1   1

As you can see, the items are still largely group under two main clusters. So increasing the number of cluster centers is not helpful.

Next, I tried hierarchical clustering via hclust funtcion, using the “average” cluster method. Blow is the plot of the hierarchical structural of the groupping.

dist_mat <- dist(df, method = 'euclidean')
hcluster <- hclust(dist_mat, method = 'average')
plot(hcluster)

I used the function cutree to get the desired number of clusters (20):

cutree(hcluster, 20) %>% table()
## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 150   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

It appears that there is just one concentrated cluster using this method.

Next, I tried the “ward.D” method of clustering:

hcluster <- hclust(dist_mat, method = 'ward.D')
plot(hcluster)

cutree(hcluster, 20) %>% table()
## .
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 54 40 15  1  1 44  1  1  1  1  1  1  1  1  1  1  1  1  1  1

This clustering method is more interesting, separating the items into 4 large groups. We can extract these groups and take a look (Group 1, 2, 3, and 6). Again, it’s difficult to see what these items share in common in their respective groups.

Below is Group 1 cluster:

clusters <- cutree(hcluster, 20) %>% sort()
names(clusters[clusters == 1])
##  [1] "abrasive cleaner"       "artif. sweetener"       "baby cosmetics"        
##  [4] "baby food"              "bags"                   "bathroom cleaner"      
##  [7] "brandy"                 "canned fruit"           "cleaner"               
## [10] "cocoa drinks"           "cooking chocolate"      "cookware"              
## [13] "cream"                  "curd cheese"            "decalcifier"           
## [16] "fish"                   "flower soil/fertilizer" "frozen chicken"        
## [19] "frozen fruits"          "hair spray"             "honey"                 
## [22] "jam"                    "ketchup"                "kitchen utensil"       
## [25] "light bulbs"            "liqueur"                "liver loaf"            
## [28] "make up remover"        "male cosmetics"         "meat spreads"          
## [31] "nut snack"              "nuts/prunes"            "organic products"      
## [34] "organic sausage"        "potato products"        "preservation products" 
## [37] "prosecco"               "pudding powder"         "ready soups"           
## [40] "rubbing alcohol"        "rum"                    "salad dressing"        
## [43] "skin care"              "snack products"         "soap"                  
## [46] "sound storage medium"   "specialty fat"          "specialty vegetables"  
## [49] "spices"                 "syrup"                  "tea"                   
## [52] "tidbits"                "toilet cleaner"         "whisky"

Below is Group 2 cluster:

names(clusters[clusters == 2])
##  [1] "baking powder"            "berries"                 
##  [3] "beverages"                "butter milk"             
##  [5] "candy"                    "cat food"                
##  [7] "chewing gum"              "cream cheese"            
##  [9] "dessert"                  "detergent"               
## [11] "dishes"                   "flour"                   
## [13] "frozen meals"             "grapes"                  
## [15] "ham"                      "hamburger meat"          
## [17] "hard cheese"              "herbs"                   
## [19] "hygiene articles"         "ice cream"               
## [21] "long life bakery product" "meat"                    
## [23] "misc. beverages"          "oil"                     
## [25] "onions"                   "pickled vegetables"      
## [27] "pot plants"               "processed cheese"        
## [29] "red/blush wine"           "salty snack"             
## [31] "semi-finished bread"      "sliced cheese"           
## [33] "soft cheese"              "specialty bar"           
## [35] "specialty chocolate"      "sugar"                   
## [37] "UHT-milk"                 "waffles"                 
## [39] "white bread"              "white wine"

Below is Group 3 cluster:

names(clusters[clusters == 3])
##  [1] "beef"                  "brown bread"           "butter"               
##  [4] "chicken"               "chocolate"             "coffee"               
##  [7] "curd"                  "domestic eggs"         "frankfurter"          
## [10] "frozen vegetables"     "fruit/vegetable juice" "margarine"            
## [13] "napkins"               "pork"                  "whipped/sour cream"

Below is Group 6 cluster:

names(clusters[clusters == 6])
##  [1] "cake bar"                  "candles"                  
##  [3] "canned fish"               "canned vegetables"        
##  [5] "cereals"                   "chocolate marshmallow"    
##  [7] "cling film/bags"           "condensed milk"           
##  [9] "dental care"               "dish cleaner"             
## [11] "dog food"                  "female sanitary products" 
## [13] "finished products"         "flower (seeds)"           
## [15] "frozen dessert"            "frozen fish"              
## [17] "frozen potato products"    "house keeping products"   
## [19] "instant coffee"            "Instant food products"    
## [21] "kitchen towels"            "liquor"                   
## [23] "liquor (appetizer)"        "mayonnaise"               
## [25] "mustard"                   "packaged fruit/vegetables"
## [27] "pasta"                     "pet care"                 
## [29] "photo/film"                "popcorn"                  
## [31] "rice"                      "roll products"            
## [33] "salt"                      "sauces"                   
## [35] "seasonal products"         "softener"                 
## [37] "soups"                     "sparkling wine"           
## [39] "specialty cheese"          "spread cheese"            
## [41] "sweet spreads"             "turkey"                   
## [43] "vinegar"                   "zwieback"