if (!require("arules")) install.packages("arules")
if (!require("arulesViz")) install.packages("arulesViz")
library(arules)
library(arulesViz)
Note: This exercise follows the Market Basket Analysis technique used by Salem Marafi (2014) in his Market Basket Analysis with R post where he analyzes a similar dataset.
Imagine 10,000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. You assignment is to use R to mine the data for association rules. You should report support, confidence, and lift and your top 10 rules by lift.
github <- "https://raw.githubusercontent.com/jzuniga123"
file <- "/SPS/master/DATA%20624/GroceryDataSet.csv"
Groceries <- read.transactions(paste0(github, file), format="basket", sep=",")
The read.transactions()
function reads transaction data and creates an object type that can be analyzed with the arules
package. The format
parameter can be either basket
(“each line in the transaction data file represents a transaction where the items are separated by the characters specified by sep
”) or single
(“each line corresponds to a single item, containing at least ids for the transaction and the item”).
According to Marafi, given items \(I=\left\{ i_{ j },i_{ k },...,i_{ n } \right\}\) and transactions \(t_{ n }\) comprised of items $ { i_{ j },i_{ k },…,i_{ n } }$, an association rule is a relationship where \(\left\{ i_{ 1 },i_{ 2 } \right\} \Rightarrow i_{ 3 }\) such that the purchase of the antecedents implies the likely purchase of the consequence. The association conjecture is measured quantitively in terms of Support, Confidence, and Lift. Support measures the frequency of the relationship in the dataset, Confidence measures the probability that the association rule will be correct for out of sample data, and Lift measures the effectiveness of the rule in finding consequents. Analysis involves starting with a priori minimum estimates for support and confidence, sorting results, removing redundancies, then targeting items if desired. The itemFrequencyPlot()
function, as its name states, creates a plot showing the frequency of the items in the dataset. The apriori()
function mines and fits association rules to transaction data. The is.redundant()
and inspect()
functions work with fitted association rules to remove redundancies and view the resulting rules.
itemFrequencyPlot(Groceries, topN=20, type="relative")
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8, maxlen=3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- sort(rules, by="confidence", decreasing=T)
rules <- rules[!is.redundant(rules)]
inspect(head(rules, 10))
## lhs rhs support confidence lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 3.913649 11
## [3] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [4] {bottled water,
## rice} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {grapes,
## onions} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [7] {hard cheese,
## oil} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [8] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [10] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
summary(rules)
## set of 29 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 29
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001118 1st Qu.:0.8125 1st Qu.: 3.261 1st Qu.:11.00
## Median :0.001220 Median :0.8462 Median : 3.613 Median :12.00
## Mean :0.001473 Mean :0.8613 Mean : 4.000 Mean :14.48
## 3rd Qu.:0.001729 3rd Qu.:0.9091 3rd Qu.: 4.199 3rd Qu.:17.00
## Max. :0.002542 Max. :1.0000 Max. :11.235 Max. :25.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.8
plot(rules, method="graph", layout=igraph::in_circle())
The frequency plot shows that in these data the five most purchased items in order of frequency are whole milk, other vegetables, rolls/buns, soda, and yogurt. The apriori()
function returns 410 association rules for these data. . Using the is.redundant()
function reduces the number of rules from 410 to 392. These rules however, have antecedents with several items. To avoid overly long rules, the apriori()
function is run with maxlen=3
specified. This reduces the number of rules from 410 to 29 and renders the is.redundant()
function pointless since there are no redundancies in the 29 rules. Inspecting the top ten rules sorted by confidence shows that most of the associations are with whole milk and other vegetables which are the two most purchased items. This relationship can also be seen in the plot of the network where most arrows are pointing toward whole milk and other vegetables.
https://rpubs.com/jeknov/movieRec
http://www.salemmarafi.com/code/market-basket-analysis-with-r/
https://cran.r-project.org/web/packages/arulesViz/arulesViz.pdf
https://www.rdocumentation.org/packages/arules/versions/1.6-1/topics/apriori
https://www.rdocumentation.org/packages/arules/versions/1.6-1/topics/inspect
https://www.rdocumentation.org/packages/arules/versions/1.6-1/topics/read.transactions
https://www.rdocumentation.org/packages/arules/versions/1.6-1/topics/itemFrequencyPlot