HMWK8

Ch 8: ASSOCIATION RULES.

This is an R Markdown Notebook.

EXAMPLE: Identifying Frequently-Purchased Grocerieas.

STEP 1: DATA COLLECTION.

STEP 2: EXPLORING AND PREPARING THE DATA.

  • Load the grocery data into a sparse matrix(it only stores the cells that are occupied by an item. This allows the structure to be more memory efficient than an equivalently sized matrix or data frame). In order to create the sparse matrix data structure from the transactional data, we need to use the functionality provided by the arules package.
library(arules)
## Warning: package 'arules' was built under R version 3.3.3
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
GROCERIES <- read.transactions("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv", sep = ",")
summary(GROCERIES)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
  • Looking at the first five transactions. This range can be played arounf with, depending on what you want to check/see.
inspect(GROCERIES[1:5])
##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}
  • Examining the frequency of the items.
itemFrequency(GROCERIES[, 1:3])
## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661
  • Plotting a graph for the frequency of items.
itemFrequencyPlot(GROCERIES, support = 0.1)

  • This shows from the high frequency to the lowest frequency.
itemFrequencyPlot(GROCERIES, topN = 20)

  • Creating a visualization of the sparse matrix for the first five transactions.
image(GROCERIES[1:5])

  • Visualization of a random sample of 100 transactions.
image(sample(GROCERIES, 100))

STEP 3: TRAINING A MODEL ON THE DATA.

library(arules)
  • Default settings result in zero rules learned.
apriori(GROCERIES)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules
  • Setting better support and confidence levels to learn more rules.
Groceryrules <- apriori(GROCERIES, parameter = list(support =
                          0.006, confidence = 0.25, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
Groceryrules
## set of 463 rules

STEP 4: EVALUATING MODEL PERFORMANCE.

  • First obtain the summary of Grocery Association rules.
summary(Groceryrules)
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## 
## mining info:
##       data ntransactions support confidence
##  GROCERIES          9835   0.006       0.25
  • Looking at the first fives rules. Again, here, you can stretch the range according to what yo need.
inspect(Groceryrules[1:5])
##     lhs                rhs                support     confidence lift    
## [1] {potted plants} => {whole milk}       0.006914082 0.4000000  1.565460
## [2] {pasta}         => {whole milk}       0.006100661 0.4054054  1.586614
## [3] {herbs}         => {root vegetables}  0.007015760 0.4312500  3.956477
## [4] {herbs}         => {other vegetables} 0.007727504 0.4750000  2.454874
## [5] {herbs}         => {whole milk}       0.007727504 0.4750000  1.858983

STEP 5: IMPROVING MODEL PERFORMANCE.

  • Sorting grocery rules by lift.
inspect(sort(Groceryrules, by = "lift")[1:5])
##     lhs                   rhs                      support confidence     lift
## [1] {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477
## [2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886
## [3] {other vegetables,                                                        
##      tropical fruit,                                                          
##      whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074
## [4] {beef,                                                                    
##      other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692
## [5] {other vegetables,                                                        
##      tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649
  • Finding subsets of rules containing any berry items.
Berryrules <- subset(Groceryrules, items %in% "berries")
inspect(Berryrules)
##     lhs          rhs                  support     confidence lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328
  • Writing the rules to a CSV file.
write(Groceryrules, file = "Groceryrules.csv",
      sep = ",", quote = TRUE, row.names = FALSE)
  • Converting the rule set to a data frame.
Groceryrules_df <- as(Groceryrules, "data.frame")
str(Groceryrules_df)
## 'data.frame':    463 obs. of  4 variables:
##  $ rules     : Factor w/ 463 levels "{baking powder} => {other vegetables}",..: 340 302 207 206 208 341 402 21 139 140 ...
##  $ support   : num  0.00691 0.0061 0.00702 0.00773 0.00773 ...
##  $ confidence: num  0.4 0.405 0.431 0.475 0.475 ...
##  $ lift      : num  1.57 1.59 3.96 2.45 1.86 ...