Introduciton

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Data Summary

The data summary details the most popular items. In this dataset, the top 3 most popular after items after other are yogurt, whole milk and other vegetables.

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Frequency Plot

The frequency plot gives us 2 other ways to look at populuar items, absolute and relative. The absolute plot details a specific item on the plot while the relative plot shows the item in relation to other items on the datsset. Both plots have similar listed items and frequencies.

Rules Plot

In the apriori plot we are using a support of .01 and conf of.10. Support gives us the threshold of popularity for the item and the confidence shows how likely item y was purchased when item x was purchased. The initial confidence level of .8 resulted in no rules, .5 resulted in 15 rule and .1 resulted in 435. This example uses .49 which visualizes more than one order.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.49    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

## set of 19 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
##  2 17 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.895   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01007   Min.   :0.4902   Min.   :1.919   Min.   : 99.0  
##  1st Qu.:0.01118   1st Qu.:0.5010   1st Qu.:2.016   1st Qu.:110.0  
##  Median :0.01230   Median :0.5175   Median :2.162   Median :121.0  
##  Mean   :0.01430   Mean   :0.5312   Mean   :2.255   Mean   :140.6  
##  3rd Qu.:0.01459   3rd Qu.:0.5665   3rd Qu.:2.406   3rd Qu.:143.5  
##  Max.   :0.02755   Max.   :0.5862   Max.   :3.030   Max.   :271.0  
## 
## mining info:
##  data ntransactions support confidence
##  gcds          9835    0.01       0.49

##      lhs                        rhs                   support confidence     lift count
## [1]  {curd}                  => {whole milk}       0.02613116  0.4904580 1.919481   257
## [2]  {butter}                => {whole milk}       0.02755465  0.4972477 1.946053   271
## [3]  {curd,                                                                            
##       yogurt}                => {whole milk}       0.01006609  0.5823529 2.279125    99
## [4]  {butter,                                                                          
##       other vegetables}      => {whole milk}       0.01148958  0.5736041 2.244885   113
## [5]  {domestic eggs,                                                                   
##       other vegetables}      => {whole milk}       0.01230300  0.5525114 2.162336   121
## [6]  {fruit/vegetable juice,                                                           
##       other vegetables}      => {whole milk}       0.01047280  0.4975845 1.947371   103
## [7]  {whipped/sour cream,                                                              
##       yogurt}                => {other vegetables} 0.01016777  0.4901961 2.533410   100
## [8]  {whipped/sour cream,                                                              
##       yogurt}                => {whole milk}       0.01087951  0.5245098 2.052747   107
## [9]  {other vegetables,                                                                
##       whipped/sour cream}    => {whole milk}       0.01464159  0.5070423 1.984385   144
## [10] {other vegetables,                                                                
##       pip fruit}             => {whole milk}       0.01352313  0.5175097 2.025351   133
## [11] {citrus fruit,                                                                    
##       root vegetables}       => {other vegetables} 0.01037112  0.5862069 3.029608   102
## [12] {root vegetables,                                                                 
##       tropical fruit}        => {other vegetables} 0.01230300  0.5845411 3.020999   121
## [13] {root vegetables,                                                                 
##       tropical fruit}        => {whole milk}       0.01199797  0.5700483 2.230969   118
## [14] {tropical fruit,                                                                  
##       yogurt}                => {whole milk}       0.01514997  0.5173611 2.024770   149
## [15] {root vegetables,                                                                 
##       yogurt}                => {other vegetables} 0.01291307  0.5000000 2.584078   127
## [16] {root vegetables,                                                                 
##       yogurt}                => {whole milk}       0.01453991  0.5629921 2.203354   143
## [17] {rolls/buns,                                                                      
##       root vegetables}       => {other vegetables} 0.01220132  0.5020921 2.594890   120
## [18] {rolls/buns,                                                                      
##       root vegetables}       => {whole milk}       0.01270971  0.5230126 2.046888   125
## [19] {other vegetables,                                                                
##       yogurt}                => {whole milk}       0.02226741  0.5128806 2.007235   219

Lift and Confidence Plot

Lift gives us how likely item Y is purchased when X is purchased while controlling Y. Below we are plotting the lift and confidence of each item.

Cluster Analysis

The cluster plot groups similar rules, then shows grouped distribution of the rules. In the below plots we are we are using 10 as the number of clusters

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(arules)
library(arulesViz)
library(cluster)
library(igraph)
library(visNetwork)
gcds <- read.transactions("GroceryDataSet.csv", header = F, format = 'basket', sep=',')
summary(gcds)
#itemFrequency(gcds)
par(mfrow=c(1,2))
itemFrequencyPlot(gcds, topN=10, type="absolute", main="Absolute")
itemFrequencyPlot(gcds, topN=10, type="relative", main="Relative")

rules<-apriori(gcds, parameter = list(supp=0.01, conf=.49))
summary(rules)
inspect(rules)

plot(rules, jitter=3)
plot(rules, jitter=3, method = "two-key plot")

top10rules2<-head(rules, n=10, by = "lift")
plot(top10rules2,  method = "paracoord")
top10rules<-head(rules, n=10, by = "confidence")
plot(top10rules, method = "graph", engine = "htmlwidget")



plot(rules, method = "grouped", control = list(k = 10))
subrules <- head(sort(rules, by = "lift"), 10)
plot(subrules, method = "graph")

Data 624 Homework 10

Anthony Pagan