Validating Market Basket Analysis

Market basket analysis is a very powerful tool for many modern companies, allowing them to boost revenue by recommending relevant products to customers who may not have otherwise purchased said recommended items. However, when data-mining large datasets, there is always a significant risk that a given finding occurs due to pure chance, especially when examining thousands of findings. One way to verify a market basket rule is by using a chi square test of independence as I will demonstrate here.

Association rules

rec_data <- read.csv('GroceryDataSet.csv', header = FALSE, sep = ';')
rec_data <- rec_data %>% apply(1, str_split, ',') 
rec_data <- rec_data %>% lapply(function(x){x[[1]][x[[1]] != '']})
basket <- as(rec_data, "transactions")
params <- new("APparameter", support = 0.001, confidence = .3, minlen = 1,
              maxlen = 2)
assoc_rules <- apriori(basket, parameter = params)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       2  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(basket, parameter = params): Mining stopped (maxlen reached).
## Only patterns up to a length of 2 returned!
##  done [0.00s].
## writing ... [239 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_by_lift <- sort(assoc_rules, by = 'lift', decreasing = TRUE)

knitr::kable(as(rules_by_lift[1:10,], 'data.frame'), format = 'pandoc', digits = 3L)
rules support confidence coverage lift count
86 {Instant food products} => {hamburger meat} 0.003 0.380 0.008 11.421 30
80 {popcorn} => {salty snack} 0.002 0.310 0.007 8.192 22
35 {specialty fat} => {margarine} 0.001 0.333 0.004 5.692 12
65 {liquor} => {bottled beer} 0.005 0.422 0.011 5.241 46
52 {ketchup} => {domestic eggs} 0.001 0.310 0.004 4.878 13
23 {canned fruit} => {citrus fruit} 0.001 0.344 0.003 4.153 11
16 {potato products} => {pastry} 0.001 0.357 0.003 4.014 10
138 {herbs} => {root vegetables} 0.007 0.431 0.016 3.956 69
105 {rice} => {root vegetables} 0.003 0.413 0.008 3.792 31
41 {abrasive cleaner} => {root vegetables} 0.001 0.371 0.004 3.408 13

We see above some of the simple market basket rules. However, it is possible that these trends in the data are due to random chance. We can validate the significance of the market basket rules using a chi square test of independence. We will use the example the herbs lead to the purchase of root vegetables.

Chi square test of independence

herbs_vegs <- length(Filter(function(x){('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))

nherbs_vegs <- length(Filter(function(x){!('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))

herbs_nvegs <- length(Filter(function(x){('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))

nherbs_nvegs <- length(Filter(function(x){!('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))

herbs <- c(herbs_vegs, herbs_nvegs)
n_herbs <- c(nherbs_vegs, nherbs_nvegs)

count_df <- data.frame(herbs, n_herbs, row.names = c('root vegetables', 'no root vegetables'))

chisq_results <- chisq.test(count_df)

chisq_results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  count_df
## X-squared = 170.56, df = 1, p-value < 2.2e-16

We see that the chi square test returns a very small p-value, indicating that this relationship is very unlikely to have occurred due to pure chance. We can also examine the expected counts compared to the actual counts to confirm these findings.

round(chisq_results$expected, 2)
##                     herbs n_herbs
## root vegetables     17.44 1054.56
## no root vegetables 142.56 8620.44
count_df
##                    herbs n_herbs
## root vegetables       69    1003
## no root vegetables    91    8672

We see that the expected counts have a much lower co-occurrence of herbs and root vegetables, indicating that this is a significant relationship and likely not random variations of the data. This is an effective way to validate your market basket analyses statistically.