Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
The data summary details the most popular items. In this dataset, the top 3 most popular after items after other are yogurt, whole milk and other vegetables.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The frequency plot gives us 2 other ways to look at populuar items, absolute and relative. The absolute plot details a specific item on the plot while the relative plot shows the item in relation to other items on the datsset. Both plots have similar listed items and frequencies.
In the apriori plot we are using a support of .01 and conf of.10. Support gives us the threshold of popularity for the item and the confidence shows how likely item y was purchased when item x was purchased. The initial confidence level of .8 resulted in no rules, .5 resulted in 15 rule and .1 resulted in 435. This example uses .49 which visualizes more than one order.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.49 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 19 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 2 17
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.895 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01007 Min. :0.4902 Min. :1.919 Min. : 99.0
## 1st Qu.:0.01118 1st Qu.:0.5010 1st Qu.:2.016 1st Qu.:110.0
## Median :0.01230 Median :0.5175 Median :2.162 Median :121.0
## Mean :0.01430 Mean :0.5312 Mean :2.255 Mean :140.6
## 3rd Qu.:0.01459 3rd Qu.:0.5665 3rd Qu.:2.406 3rd Qu.:143.5
## Max. :0.02755 Max. :0.5862 Max. :3.030 Max. :271.0
##
## mining info:
## data ntransactions support confidence
## gcds 9835 0.01 0.49
## lhs rhs support confidence lift count
## [1] {curd} => {whole milk} 0.02613116 0.4904580 1.919481 257
## [2] {butter} => {whole milk} 0.02755465 0.4972477 1.946053 271
## [3] {curd,
## yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [4] {butter,
## other vegetables} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [5] {domestic eggs,
## other vegetables} => {whole milk} 0.01230300 0.5525114 2.162336 121
## [6] {fruit/vegetable juice,
## other vegetables} => {whole milk} 0.01047280 0.4975845 1.947371 103
## [7] {whipped/sour cream,
## yogurt} => {other vegetables} 0.01016777 0.4901961 2.533410 100
## [8] {whipped/sour cream,
## yogurt} => {whole milk} 0.01087951 0.5245098 2.052747 107
## [9] {other vegetables,
## whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 144
## [10] {other vegetables,
## pip fruit} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [11] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [12] {root vegetables,
## tropical fruit} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [13] {root vegetables,
## tropical fruit} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [14] {tropical fruit,
## yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [15] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
## [16] {root vegetables,
## yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [17] {rolls/buns,
## root vegetables} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [18] {rolls/buns,
## root vegetables} => {whole milk} 0.01270971 0.5230126 2.046888 125
## [19] {other vegetables,
## yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
Lift gives us how likely item Y is purchased when X is purchased while controlling Y. Below we are plotting the lift and confidence of each item.
The cluster plot groups similar rules, then shows grouped distribution of the rules. In the below plots we are we are using 10 as the number of clusters
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(arules)
library(arulesViz)
library(cluster)
library(igraph)
library(visNetwork)
gcds <- read.transactions("GroceryDataSet.csv", header = F, format = 'basket', sep=',')
summary(gcds)
#itemFrequency(gcds)
par(mfrow=c(1,2))
itemFrequencyPlot(gcds, topN=10, type="absolute", main="Absolute")
itemFrequencyPlot(gcds, topN=10, type="relative", main="Relative")
rules<-apriori(gcds, parameter = list(supp=0.01, conf=.49))
summary(rules)
inspect(rules)
plot(rules, jitter=3)
plot(rules, jitter=3, method = "two-key plot")
top10rules2<-head(rules, n=10, by = "lift")
plot(top10rules2, method = "paracoord")
top10rules<-head(rules, n=10, by = "confidence")
plot(top10rules, method = "graph", engine = "htmlwidget")
plot(rules, method = "grouped", control = list(k = 10))
subrules <- head(sort(rules, by = "lift"), 10)
plot(subrules, method = "graph")