homework

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 5 before midnight.

The required packages are loaded and the grocery data set is read using the read.tranactions() function.

library(arulesViz)
library(dplyr)
library(RColorBrewer)

groc <- read.transactions('GroceryDataSet.csv', sep=',')
summary(groc)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Association Rules

The item frequency plot can be created using the itemFrequencyPlot() function in the arules package:

itemFrequencyPlot(groc, topN=20, col=brewer.pal(8,'Pastel1'))

In order to mine the association rules using the apriori() function in the arules package, the support and confidence parameters need to be determined. Setting the values too high will filter out many rules. Setting the values too low will introduce many trivial rules.

Below, I have written a wrapper function that takes the transaction data and the desired support and confidence as input, and output the top N rules (default 10) mined using the Apriori algorithm.

arFit <- function(data, support, confidence, topN=10, topOnly=TRUE){
  rules <- apriori(data, parameter = list(support=support, confidence=confidence), control=list(verbose = FALSE)) 
  rulesLen <- length(rules)
  topRules <- head(rules, n=topN, by='lift')
  topRules <- data.frame(lhs=labels(lhs(topRules)), rhs=labels(rhs(topRules)), topRules@quality)
  ifelse(topOnly, return(topRules), return(list(rulesLen, topRules, rules)))
}

Holding the confidence constant at 0.1, we can see the effects of support on the number of association rules found. As can be seen in the below plot, as support gets smaller, the number of rules increases exponetionally.

values <- seq(0.001, 0.1, by=0.001)
numRules <- c()
for (val in values){
  fit <- arFit(groc, support=val, confidence=0.1, topOnly = FALSE)
  numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Support', ylab='Association Rules Found', type='l')

Holding the support constant at 0.001, we can also see that smaller the confident, more rules are found:

values <- seq(0.1, 1, by=0.01)
numRules <- c()
for (val in values){
  fit <- arFit(groc, support=0.001, confidence=val, topOnly = FALSE)
  numRules <- c(numRules, fit[[1]])
}
plot(x=values, y=numRules, xlab='Confidence', ylab='Association Rules Found', type='l')

After some trials, I found some interest rules setting minimum support to 0.002 and minimum confidence to 0.1. The support, confidence, and lift of the top 10 association rules with the aforementioned parameters are found below.

rules <- arFit(groc, 0.002, 0.1, topOnly = FALSE)
rules[[2]]

##                                   lhs              rhs     support
## 47            {Instant food products} {hamburger meat} 0.003050330
## 1584               {sugar,whole milk}          {flour} 0.002846975
## 19                           {liquor} {red/blush wine} 0.002135231
## 20                   {red/blush wine}         {liquor} 0.002135231
## 1583               {flour,whole milk}          {sugar} 0.002846975
## 398                           {flour}          {sugar} 0.004982206
## 399                           {sugar}          {flour} 0.004982206
## 1728 {hard cheese,whipped/sour cream}         {butter} 0.002033554
## 36                          {popcorn}    {salty snack} 0.002236909
## 1729      {butter,whipped/sour cream}    {hard cheese} 0.002033554
##      confidence      lift count
## 47    0.3797468 11.421438    30
## 1584  0.1891892 10.881144    28
## 19    0.1926606 10.025484    21
## 20    0.1111111 10.025484    21
## 1583  0.3373494  9.963457    28
## 398   0.2865497  8.463112    49
## 399   0.1471471  8.463112    49
## 1728  0.4545455  8.202669    20
## 36    0.3098592  8.192110    22
## 1729  0.2000000  8.161826    20

The rules can be visualized using arulesViz package. A particular interesting visulization is the graph of the rules:

subrules <- head(rules[[3]], n=10, by='lift')
plot(subrules, method = 'graph')

These rules can help the grocery store in terms of product placement, advertisement, or promotion. For example, placing salty snacks next to popcorn, advertisement of certain brand of hamburger meat in the aisle for instant food products, etc.

Clustering

First, the transaction data is converted into data frame. The goal is to cluster the items, with the transactions as dimensions. So the data frame will be 169 rows (items) by 9835 columns (transactions):

df <- groc@data %>% as.matrix()  %>% as.data.frame() 
row.names(df) <- groc@itemInfo$labels
dim(df)

## [1]  169 9835

Next, I will try the kmeans function to perform K-means clustering. The centers parameter specifies the number of desired clusters. The nstart parameter repeats the algorithm for n times, each time with different set of initial centers; and pick the best one. The iter.max parameter specifies the maximum number of iteration.

set.seed(1)
cluster <- kmeans(df, centers=10, nstart=50, iter.max=20)
str(cluster)

## List of 9
##  $ cluster     : Named int [1:169] 9 9 9 9 9 9 9 6 9 9 ...
##   ..- attr(*, "names")= chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##  $ centers     : num [1:10, 1:9835] 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:9835] "V1" "V2" "V3" "V4" ...
##  $ totss       : num 41486
##  $ withinss    : num [1:10] 845 0 0 0 0 ...
##  $ tot.withinss: num 28405
##  $ betweenss   : num 13081
##  $ size        : int [1:10] 2 1 1 1 1 21 1 1 138 2
##  $ iter        : int 4
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

We can retrieve the clusters, and find the group distribution:

cluster$cluster %>% table()

## .
##   1   2   3   4   5   6   7   8   9  10 
##   2   1   1   1   1  21   1   1 138   2

It appears most items are concerntrated in two groups - one with 21 items and the other with 138. Let’s take a look at the large groups:

cluster$cluster[cluster$cluster==6]

##                  beef          bottled beer           brown bread 
##                     6                     6                     6 
##                butter               chicken             chocolate 
##                     6                     6                     6 
##          citrus fruit                coffee                  curd 
##                     6                     6                     6 
##         domestic eggs           frankfurter     frozen vegetables 
##                     6                     6                     6 
## fruit/vegetable juice             margarine               napkins 
##                     6                     6                     6 
##            newspapers                pastry             pip fruit 
##                     6                     6                     6 
##                  pork    whipped/sour cream           white bread 
##                     6                     6                     6

It appears this group are all food and drink related items, with the exception of newspapers.

cluster$cluster[cluster$cluster==9]

##          abrasive cleaner          artif. sweetener 
##                         9                         9 
##            baby cosmetics                 baby food 
##                         9                         9 
##                      bags             baking powder 
##                         9                         9 
##          bathroom cleaner                   berries 
##                         9                         9 
##                 beverages                    brandy 
##                         9                         9 
##               butter milk                  cake bar 
##                         9                         9 
##                   candles                     candy 
##                         9                         9 
##               canned beer               canned fish 
##                         9                         9 
##              canned fruit         canned vegetables 
##                         9                         9 
##                  cat food                   cereals 
##                         9                         9 
##               chewing gum     chocolate marshmallow 
##                         9                         9 
##                   cleaner           cling film/bags 
##                         9                         9 
##              cocoa drinks            condensed milk 
##                         9                         9 
##         cooking chocolate                  cookware 
##                         9                         9 
##                     cream              cream cheese 
##                         9                         9 
##               curd cheese               decalcifier 
##                         9                         9 
##               dental care                   dessert 
##                         9                         9 
##                 detergent              dish cleaner 
##                         9                         9 
##                    dishes                  dog food 
##                         9                         9 
##  female sanitary products         finished products 
##                         9                         9 
##                      fish                     flour 
##                         9                         9 
##            flower (seeds)    flower soil/fertilizer 
##                         9                         9 
##            frozen chicken            frozen dessert 
##                         9                         9 
##               frozen fish             frozen fruits 
##                         9                         9 
##              frozen meals    frozen potato products 
##                         9                         9 
##                    grapes                hair spray 
##                         9                         9 
##                       ham            hamburger meat 
##                         9                         9 
##               hard cheese                     herbs 
##                         9                         9 
##                     honey    house keeping products 
##                         9                         9 
##          hygiene articles                 ice cream 
##                         9                         9 
##            instant coffee     Instant food products 
##                         9                         9 
##                       jam                   ketchup 
##                         9                         9 
##            kitchen towels           kitchen utensil 
##                         9                         9 
##               light bulbs                   liqueur 
##                         9                         9 
##                    liquor        liquor (appetizer) 
##                         9                         9 
##                liver loaf  long life bakery product 
##                         9                         9 
##           make up remover            male cosmetics 
##                         9                         9 
##                mayonnaise                      meat 
##                         9                         9 
##              meat spreads           misc. beverages 
##                         9                         9 
##                   mustard                 nut snack 
##                         9                         9 
##               nuts/prunes                       oil 
##                         9                         9 
##                    onions          organic products 
##                         9                         9 
##           organic sausage packaged fruit/vegetables 
##                         9                         9 
##                     pasta                  pet care 
##                         9                         9 
##                photo/film        pickled vegetables 
##                         9                         9 
##                   popcorn                pot plants 
##                         9                         9 
##           potato products     preservation products 
##                         9                         9 
##          processed cheese                  prosecco 
##                         9                         9 
##            pudding powder               ready soups 
##                         9                         9 
##            red/blush wine                      rice 
##                         9                         9 
##             roll products           rubbing alcohol 
##                         9                         9 
##                       rum            salad dressing 
##                         9                         9 
##                      salt               salty snack 
##                         9                         9 
##                    sauces         seasonal products 
##                         9                         9 
##       semi-finished bread                 skin care 
##                         9                         9 
##             sliced cheese            snack products 
##                         9                         9 
##                      soap               soft cheese 
##                         9                         9 
##                  softener      sound storage medium 
##                         9                         9 
##                     soups            sparkling wine 
##                         9                         9 
##             specialty bar          specialty cheese 
##                         9                         9 
##       specialty chocolate             specialty fat 
##                         9                         9 
##      specialty vegetables                    spices 
##                         9                         9 
##             spread cheese                     sugar 
##                         9                         9 
##             sweet spreads                     syrup 
##                         9                         9 
##                       tea                   tidbits 
##                         9                         9 
##            toilet cleaner                    turkey 
##                         9                         9 
##                  UHT-milk                   vinegar 
##                         9                         9 
##                   waffles                    whisky 
##                         9                         9 
##                white wine                  zwieback 
##                         9                         9

It’s difficult to find what these items share in common. The transaction data may not have enough information to distinguish these items into groups.

Here’s what happen when we cluster the items into 20 centers:

set.seed(1)
kmeans(df, centers=20, nstart=50, iter.max=20)$cluster %>% table()

## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1  21   1   1   1   1 130   1   1   1   1   1   1   1 
##  19  20 
##   1   1

and 30 centers:

set.seed(1)
kmeans(df, centers=30, nstart=50, iter.max=20)$cluster %>% table()

## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1 122   1   1   1   1  19   1   1   1   1   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30 
##   1   1   1   1   1   1   1   1   1   1   1   1

As you can see, the items are still largely group under two main clusters. So increasing the number of cluster centers is not helpful.

Next, I tried hierarchical clustering via hclust funtcion, using the “average” cluster method. Blow is the plot of the hierarchical structural of the groupping.

dist_mat <- dist(df, method = 'euclidean')
hcluster <- hclust(dist_mat, method = 'average')
plot(hcluster)

I used the function cutree to get the desired number of clusters (20):

cutree(hcluster, 20) %>% table()

## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 150   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  19  20 
##   1   1

It appears that there is just one concentrated cluster using this method.

Next, I tried the “ward.D” method of clustering:

hcluster <- hclust(dist_mat, method = 'ward.D')
plot(hcluster)

cutree(hcluster, 20) %>% table()

## .
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 54 40 15  1  1 44  1  1  1  1  1  1  1  1  1  1  1  1  1  1

This clustering method is more interesting, separating the items into 4 large groups. We can extract these groups and take a look (Group 1, 2, 3, and 6). Again, it’s difficult to see what these items share in common in their respective groups.

Below is Group 1 cluster:

clusters <- cutree(hcluster, 20) %>% sort()
names(clusters[clusters == 1])

##  [1] "abrasive cleaner"       "artif. sweetener"      
##  [3] "baby cosmetics"         "baby food"             
##  [5] "bags"                   "bathroom cleaner"      
##  [7] "brandy"                 "canned fruit"          
##  [9] "cleaner"                "cocoa drinks"          
## [11] "cooking chocolate"      "cookware"              
## [13] "cream"                  "curd cheese"           
## [15] "decalcifier"            "fish"                  
## [17] "flower soil/fertilizer" "frozen chicken"        
## [19] "frozen fruits"          "hair spray"            
## [21] "honey"                  "jam"                   
## [23] "ketchup"                "kitchen utensil"       
## [25] "light bulbs"            "liqueur"               
## [27] "liver loaf"             "make up remover"       
## [29] "male cosmetics"         "meat spreads"          
## [31] "nut snack"              "nuts/prunes"           
## [33] "organic products"       "organic sausage"       
## [35] "potato products"        "preservation products" 
## [37] "prosecco"               "pudding powder"        
## [39] "ready soups"            "rubbing alcohol"       
## [41] "rum"                    "salad dressing"        
## [43] "skin care"              "snack products"        
## [45] "soap"                   "sound storage medium"  
## [47] "specialty fat"          "specialty vegetables"  
## [49] "spices"                 "syrup"                 
## [51] "tea"                    "tidbits"               
## [53] "toilet cleaner"         "whisky"

Below is Group 2 cluster:

names(clusters[clusters == 2])

##  [1] "baking powder"            "berries"                 
##  [3] "beverages"                "butter milk"             
##  [5] "candy"                    "cat food"                
##  [7] "chewing gum"              "cream cheese"            
##  [9] "dessert"                  "detergent"               
## [11] "dishes"                   "flour"                   
## [13] "frozen meals"             "grapes"                  
## [15] "ham"                      "hamburger meat"          
## [17] "hard cheese"              "herbs"                   
## [19] "hygiene articles"         "ice cream"               
## [21] "long life bakery product" "meat"                    
## [23] "misc. beverages"          "oil"                     
## [25] "onions"                   "pickled vegetables"      
## [27] "pot plants"               "processed cheese"        
## [29] "red/blush wine"           "salty snack"             
## [31] "semi-finished bread"      "sliced cheese"           
## [33] "soft cheese"              "specialty bar"           
## [35] "specialty chocolate"      "sugar"                   
## [37] "UHT-milk"                 "waffles"                 
## [39] "white bread"              "white wine"

Below is Group 3 cluster:

names(clusters[clusters == 3])

##  [1] "beef"                  "brown bread"          
##  [3] "butter"                "chicken"              
##  [5] "chocolate"             "coffee"               
##  [7] "curd"                  "domestic eggs"        
##  [9] "frankfurter"           "frozen vegetables"    
## [11] "fruit/vegetable juice" "margarine"            
## [13] "napkins"               "pork"                 
## [15] "whipped/sour cream"

Below is Group 6 cluster:

names(clusters[clusters == 6])

##  [1] "cake bar"                  "candles"                  
##  [3] "canned fish"               "canned vegetables"        
##  [5] "cereals"                   "chocolate marshmallow"    
##  [7] "cling film/bags"           "condensed milk"           
##  [9] "dental care"               "dish cleaner"             
## [11] "dog food"                  "female sanitary products" 
## [13] "finished products"         "flower (seeds)"           
## [15] "frozen dessert"            "frozen fish"              
## [17] "frozen potato products"    "house keeping products"   
## [19] "instant coffee"            "Instant food products"    
## [21] "kitchen towels"            "liquor"                   
## [23] "liquor (appetizer)"        "mayonnaise"               
## [25] "mustard"                   "packaged fruit/vegetables"
## [27] "pasta"                     "pet care"                 
## [29] "photo/film"                "popcorn"                  
## [31] "rice"                      "roll products"            
## [33] "salt"                      "sauces"                   
## [35] "seasonal products"         "softener"                 
## [37] "soups"                     "sparkling wine"           
## [39] "specialty cheese"          "spread cheese"            
## [41] "sweet spreads"             "turkey"                   
## [43] "vinegar"                   "zwieback"

homework_10

Jun Yan

April 12, 2019

Association Rules

Clustering