Market Basket Analysis

Note: This exercise follows the Market Basket Analysis technique used by Salem Marafi (2014) in his Market Basket Analysis with R post where he analyzes a similar dataset.

Exercise

Imagine 10,000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. You assignment is to use R to mine the data for association rules. You should report support, confidence, and lift and your top 10 rules by lift.

github <- "https://raw.githubusercontent.com/jzuniga123"
file <- "/SPS/master/DATA%20624/GroceryDataSet.csv"
Groceries <- read.transactions(paste0(github, file), format="basket", sep=",")

The read.transactions() function reads transaction data and creates an object type that can be analyzed with the arules package. The format parameter can be either basket (“each line in the transaction data file represents a transaction where the items are separated by the characters specified by sep”) or single (“each line corresponds to a single item, containing at least ids for the transaction and the item”).

Approach

According to Marafi, given items $I=\left\{ i_{ j },i_{ k },...,i_{ n } \right\}$ and transactions $t_{ n }$ comprised of items $ { i_{ j },i_{ k },…,i_{ n } }$, an association rule is a relationship where $\left\{ i_{ 1 },i_{ 2 } \right\} \Rightarrow i_{ 3 }$ such that the purchase of the antecedents implies the likely purchase of the consequence. The association conjecture is measured quantitively in terms of Support, Confidence, and Lift. Support measures the frequency of the relationship in the dataset, Confidence measures the probability that the association rule will be correct for out of sample data, and Lift measures the effectiveness of the rule in finding consequents. Analysis involves starting with a priori minimum estimates for support and confidence, sorting results, removing redundancies, then targeting items if desired. The itemFrequencyPlot() function, as its name states, creates a plot showing the frequency of the items in the dataset. The apriori() function mines and fits association rules to transaction data. The is.redundant() and inspect() functions work with fitted association rules to remove redundancies and view the resulting rules.

Results

itemFrequencyPlot(Groceries, topN=20, type="relative")

rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8, maxlen=3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules <- sort(rules, by="confidence", decreasing=T)
rules <- rules[!is.redundant(rules)]
inspect(head(rules, 10))

##      lhs                         rhs                    support confidence      lift count
## [1]  {rice,                                                                               
##       sugar}                  => {whole milk}       0.001220132  1.0000000  3.913649    12
## [2]  {canned fish,                                                                        
##       hygiene articles}       => {whole milk}       0.001118454  1.0000000  3.913649    11
## [3]  {house keeping products,                                                             
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769  3.612599    12
## [4]  {bottled water,                                                                      
##       rice}                   => {whole milk}       0.001220132  0.9230769  3.612599    12
## [5]  {bottled beer,                                                                       
##       soups}                  => {whole milk}       0.001118454  0.9166667  3.587512    11
## [6]  {grapes,                                                                             
##       onions}                 => {other vegetables} 0.001118454  0.9166667  4.737476    11
## [7]  {hard cheese,                                                                        
##       oil}                    => {other vegetables} 0.001118454  0.9166667  4.737476    11
## [8]  {cereals,                                                                            
##       curd}                   => {whole milk}       0.001016777  0.9090909  3.557863    10
## [9]  {pastry,                                                                             
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909  3.557863    10
## [10] {liquor,                                                                             
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 11.235269    19

summary(rules)

## set of 29 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 29 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.001017   Min.   :0.8000   Min.   : 3.131   Min.   :10.00  
##  1st Qu.:0.001118   1st Qu.:0.8125   1st Qu.: 3.261   1st Qu.:11.00  
##  Median :0.001220   Median :0.8462   Median : 3.613   Median :12.00  
##  Mean   :0.001473   Mean   :0.8613   Mean   : 4.000   Mean   :14.48  
##  3rd Qu.:0.001729   3rd Qu.:0.9091   3rd Qu.: 4.199   3rd Qu.:17.00  
##  Max.   :0.002542   Max.   :1.0000   Max.   :11.235   Max.   :25.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001        0.8

plot(rules, method="graph", layout=igraph::in_circle())

Interpretation

The frequency plot shows that in these data the five most purchased items in order of frequency are whole milk, other vegetables, rolls/buns, soda, and yogurt. The apriori() function returns 410 association rules for these data. . Using the is.redundant() function reduces the number of rules from 410 to 392. These rules however, have antecedents with several items. To avoid overly long rules, the apriori() function is run with maxlen=3 specified. This reduces the number of rules from 410 to 29 and renders the is.redundant() function pointless since there are no redundancies in the 29 rules. Inspecting the top ten rules sorted by confidence shows that most of the associations are with whole milk and other vegetables which are the two most purchased items. This relationship can also be seen in the plot of the network where most arrows are pointing toward whole milk and other vegetables.

Market Basket Analysis

Jose Zuniga

Exercise

Approach

Results

Interpretation

References