Validating Market Basket Analysis

Market basket analysis is a very powerful tool for many modern companies, allowing them to boost revenue by recommending relevant products to customers who may not have otherwise purchased said recommended items. However, when data-mining large datasets, there is always a significant risk that a given finding occurs due to pure chance, especially when examining thousands of findings. One way to verify a market basket rule is by using a chi square test of independence as I will demonstrate here.

Association rules

rec_data <- read.csv('GroceryDataSet.csv', header = FALSE, sep = ';')
rec_data <- rec_data %>% apply(1, str_split, ',') 
rec_data <- rec_data %>% lapply(function(x){x[[1]][x[[1]] != '']})
basket <- as(rec_data, "transactions")
params <- new("APparameter", support = 0.001, confidence = .3, minlen = 1,
              maxlen = 2)
assoc_rules <- apriori(basket, parameter = params)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       2  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2

## Warning in apriori(basket, parameter = params): Mining stopped (maxlen reached).
## Only patterns up to a length of 2 returned!

##  done [0.00s].
## writing ... [239 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_by_lift <- sort(assoc_rules, by = 'lift', decreasing = TRUE)

knitr::kable(as(rules_by_lift[1:10,], 'data.frame'), format = 'pandoc', digits = 3L)

	rules	support	confidence	coverage	lift	count
86	{Instant food products} => {hamburger meat}	0.003	0.380	0.008	11.421	30
80	{popcorn} => {salty snack}	0.002	0.310	0.007	8.192	22
35	{specialty fat} => {margarine}	0.001	0.333	0.004	5.692	12
65	{liquor} => {bottled beer}	0.005	0.422	0.011	5.241	46
52	{ketchup} => {domestic eggs}	0.001	0.310	0.004	4.878	13
23	{canned fruit} => {citrus fruit}	0.001	0.344	0.003	4.153	11
16	{potato products} => {pastry}	0.001	0.357	0.003	4.014	10
138	{herbs} => {root vegetables}	0.007	0.431	0.016	3.956	69
105	{rice} => {root vegetables}	0.003	0.413	0.008	3.792	31
41	{abrasive cleaner} => {root vegetables}	0.001	0.371	0.004	3.408	13

We see above some of the simple market basket rules. However, it is possible that these trends in the data are due to random chance. We can validate the significance of the market basket rules using a chi square test of independence. We will use the example the herbs lead to the purchase of root vegetables.

Chi square test of independence

herbs_vegs <- length(Filter(function(x){('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))

nherbs_vegs <- length(Filter(function(x){!('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))

herbs_nvegs <- length(Filter(function(x){('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))

nherbs_nvegs <- length(Filter(function(x){!('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))

herbs <- c(herbs_vegs, herbs_nvegs)
n_herbs <- c(nherbs_vegs, nherbs_nvegs)

count_df <- data.frame(herbs, n_herbs, row.names = c('root vegetables', 'no root vegetables'))

chisq_results <- chisq.test(count_df)

chisq_results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  count_df
## X-squared = 170.56, df = 1, p-value < 2.2e-16

We see that the chi square test returns a very small p-value, indicating that this relationship is very unlikely to have occurred due to pure chance. We can also examine the expected counts compared to the actual counts to confirm these findings.

round(chisq_results$expected, 2)

##                     herbs n_herbs
## root vegetables     17.44 1054.56
## no root vegetables 142.56 8620.44

count_df

##                    herbs n_herbs
## root vegetables       69    1003
## no root vegetables    91    8672

We see that the expected counts have a much lower co-occurrence of herbs and root vegetables, indicating that this is a significant relationship and likely not random variations of the data. This is an effective way to validate your market basket analyses statistically.

Market Basket Analysis Blog

Validating Market Basket Analysis

Association rules

Chi square test of independence