Market basket analysis is a very powerful tool for many modern companies, allowing them to boost revenue by recommending relevant products to customers who may not have otherwise purchased said recommended items. However, when data-mining large datasets, there is always a significant risk that a given finding occurs due to pure chance, especially when examining thousands of findings. One way to verify a market basket rule is by using a chi square test of independence as I will demonstrate here.
rec_data <- read.csv('GroceryDataSet.csv', header = FALSE, sep = ';')
rec_data <- rec_data %>% apply(1, str_split, ',')
rec_data <- rec_data %>% lapply(function(x){x[[1]][x[[1]] != '']})
basket <- as(rec_data, "transactions")
params <- new("APparameter", support = 0.001, confidence = .3, minlen = 1,
maxlen = 2)
assoc_rules <- apriori(basket, parameter = params)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(basket, parameter = params): Mining stopped (maxlen reached).
## Only patterns up to a length of 2 returned!
## done [0.00s].
## writing ... [239 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_by_lift <- sort(assoc_rules, by = 'lift', decreasing = TRUE)
knitr::kable(as(rules_by_lift[1:10,], 'data.frame'), format = 'pandoc', digits = 3L)
| rules | support | confidence | coverage | lift | count | |
|---|---|---|---|---|---|---|
| 86 | {Instant food products} => {hamburger meat} | 0.003 | 0.380 | 0.008 | 11.421 | 30 |
| 80 | {popcorn} => {salty snack} | 0.002 | 0.310 | 0.007 | 8.192 | 22 |
| 35 | {specialty fat} => {margarine} | 0.001 | 0.333 | 0.004 | 5.692 | 12 |
| 65 | {liquor} => {bottled beer} | 0.005 | 0.422 | 0.011 | 5.241 | 46 |
| 52 | {ketchup} => {domestic eggs} | 0.001 | 0.310 | 0.004 | 4.878 | 13 |
| 23 | {canned fruit} => {citrus fruit} | 0.001 | 0.344 | 0.003 | 4.153 | 11 |
| 16 | {potato products} => {pastry} | 0.001 | 0.357 | 0.003 | 4.014 | 10 |
| 138 | {herbs} => {root vegetables} | 0.007 | 0.431 | 0.016 | 3.956 | 69 |
| 105 | {rice} => {root vegetables} | 0.003 | 0.413 | 0.008 | 3.792 | 31 |
| 41 | {abrasive cleaner} => {root vegetables} | 0.001 | 0.371 | 0.004 | 3.408 | 13 |
We see above some of the simple market basket rules. However, it is possible that these trends in the data are due to random chance. We can validate the significance of the market basket rules using a chi square test of independence. We will use the example the herbs lead to the purchase of root vegetables.
herbs_vegs <- length(Filter(function(x){('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))
nherbs_vegs <- length(Filter(function(x){!('herbs' %in% x) & ('root vegetables' %in% x)}, rec_data))
herbs_nvegs <- length(Filter(function(x){('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))
nherbs_nvegs <- length(Filter(function(x){!('herbs' %in% x) & !('root vegetables' %in% x)}, rec_data))
herbs <- c(herbs_vegs, herbs_nvegs)
n_herbs <- c(nherbs_vegs, nherbs_nvegs)
count_df <- data.frame(herbs, n_herbs, row.names = c('root vegetables', 'no root vegetables'))
chisq_results <- chisq.test(count_df)
chisq_results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: count_df
## X-squared = 170.56, df = 1, p-value < 2.2e-16
We see that the chi square test returns a very small p-value, indicating that this relationship is very unlikely to have occurred due to pure chance. We can also examine the expected counts compared to the actual counts to confirm these findings.
round(chisq_results$expected, 2)
## herbs n_herbs
## root vegetables 17.44 1054.56
## no root vegetables 142.56 8620.44
count_df
## herbs n_herbs
## root vegetables 69 1003
## no root vegetables 91 8672
We see that the expected counts have a much lower co-occurrence of herbs and root vegetables, indicating that this is a significant relationship and likely not random variations of the data. This is an effective way to validate your market basket analyses statistically.