Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(arules)
library(arulesViz)We will load the GroceryDataSet.csv file into an arules transaction object.
grocery_data_raw = read.csv("GroceryDataSet.csv")
dim(grocery_data_raw)## [1] 9834 32
# Convert the data frame of groceries to transactions.
grocery_transactions = read.transactions("GroceryDataSet.csv", sep = ",")
summary(grocery_transactions)## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
There are 9834 transactions with 169 distinct items purchased. We print the most occurring items below. Whole milk, vegetables, rolls/buns, soda, yogurt lead the top 5 items as seen when we use itemFrequencyPlot to visualize the 20 most frequent items.
plot_item_frequencies_ranked = itemFrequencyPlot(grocery_transactions, topN=20, type = "absolute", main="Frequency Ranked")We now display rules computed by the apriori algorithm from the groceries in which the support \(P(A \bigcap B) \geq 0.001\) and confidence \(P(A \bigcap B)/P(A) \geq 0.3\). By choosing a relatively high confidence level and support of 0.1%, we require at least 9-10 relevant transactions involving both \(A\) and \(B\). This reduces the problem that the discovered associations are statistically spurious.
rules <- apriori( grocery_transactions, parameter = list(supp = 0.001, conf = 0.3))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [13770 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
Nonetheless, we still obtain over 13000 rules. The rules are then ranked by lift and the top 10 rules by lift are displayed below.
rules_lift= sort( rules, by ="lift", decreasing = TRUE ) # Display by lift
top_rules_by_lift = DATAFRAME(head(rules_lift, n = 10) )
top_rules_by_lift %>% kable(digits = 3) %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")| LHS | RHS | support | confidence | coverage | lift | count | |
|---|---|---|---|---|---|---|---|
| 252 | {bottled beer,red/blush wine} | {liquor} | 0.002 | 0.396 | 0.005 | 35.716 | 19 |
| 906 | {ham,white bread} | {processed cheese} | 0.002 | 0.380 | 0.005 | 22.928 | 19 |
| 251 | {bottled beer,liquor} | {red/blush wine} | 0.002 | 0.413 | 0.005 | 21.494 | 19 |
| 311 | {Instant food products,soda} | {hamburger meat} | 0.001 | 0.632 | 0.002 | 18.996 | 12 |
| 1269 | {curd,sugar} | {flour} | 0.001 | 0.324 | 0.003 | 18.608 | 11 |
| 1193 | {baking powder,sugar} | {flour} | 0.001 | 0.312 | 0.003 | 17.973 | 10 |
| 905 | {processed cheese,white bread} | {ham} | 0.002 | 0.463 | 0.004 | 17.803 | 19 |
| 281 | {popcorn,soda} | {salty snack} | 0.001 | 0.632 | 0.002 | 16.698 | 12 |
| 1192 | {baking powder,flour} | {sugar} | 0.001 | 0.556 | 0.002 | 16.408 | 10 |
| 904 | {ham,processed cheese} | {white bread} | 0.002 | 0.633 | 0.003 | 15.045 | 19 |
The top rule, for example, associates the purchase of beer and wine with an additional purchase of liquor. The support \(P(A \bigca B)\) is 0.2%. The confidence of 0.396 tells us that 39.6% of purchases of the former also include purchases of the latter (liquor) and this association is extremely unlikely to be due to chance due to a lift of 35.7. We also notice a rule in the opposite direction. The top 3rd rule is effectively as association of beer and liquor to induce the purchase of red wine. As these rules are simply correlations of triples of items, we cannot really infer causality.
Rather, we are left to infer causality or reasonableness of the observed association by using common sense or other quantitative evidence.
We are also left to apply business knowledge and commonsense to distinguish the good and bad associations. For example, the top 4th rule (instant food products, soda) \(\implies\) hamburger meat does not quite make sense to me. While soda and hamburger meat might be associated with a BBQ, it is unclear what instant food products would mean.
Lastly, we should be cautious of inferring product placement from association rules.
While some associations may suggest moving associated products closer to increase revenues, the causality may actually go in the other direction. Some products are traditionally placed together. For example, baking powder, sugar and flour are usually in the same aisle.
Relocating some products closer to strengthen proximity (and increase revenues) may cause other product associations to weaken (and decrease revenues). It may be helpful to search other associations from the rule set and inspect the net impact.
Lastly, we visualize the rules by showing lift on the y-axis and support on the x-axis. The color intensity is related to confidence of the rule. We can use this to see if any rules with lower lift still have offsetting greater support. More support means more revenues could be impacted by a good marketing decision.
plot(rules, method = "scatterplot", measure = c("support", "lift"), shading = "confidence" )## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
To perform clustering analysis, we’ll have to convert the transactions and items into a big matrix in a manner similar to the term-document matrix used in text processing. In text processing, the matrix has \(N\) rows representing a set of \(N\) words and \(M\) columns representing the documents in which the words may be found. In the market basket analysis, the matrix has \(N\) rows for items and \(M\) columns for transactions.
The entry \(M[i,j]\) is the count of the item \(i\) bought for transaction \(j\).
We use the FactorMineR and factoextra packages to facilitate plotting the clusters after extracting the matrix from the arules transaction object.
library(FactoMineR)
it_df = as.matrix( grocery_transactions@data) %>% as_tibble() ## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
# Make the labels into a column.
it_df$row_name = grocery_transactions@itemInfo$labels
# Then do a hack to convert an explicit column to row names.
it_df %>% column_to_rownames( var = "row_name") -> it_dfNow we apply hierarchical clustering with principal components (HCPC) to obtain the main types of market baskets.
# Conver the logical matrix to a numeric one. This is the fastest way.
it_df = 1 * it_df
res.PCA<-PCA(it_df,ncp=10, scale.unit=FALSE,graph=FALSE)
res.HCPC<-HCPC(res.PCA,nb.clust=6 ,consol=TRUE,graph=TRUE)The HCPC plot below shows the food items grouped on a Principal Components plot using PC dimensions one and two.
It below suggests that whole milk, other vegetables are their own components. This means that business decisions on milk, soda, rolls/buns and vegetables need to be made individually and these decisions will be consequential.
It also suggests that the factor groups are highly imbalanced - contrary to our preference for balanced groups.
library(factoextra)
fviz_cluster(res.HCPC, repel = TRUE, show.clust.cent = TRUE, palette = "jco", ggtheme = theme_minimal(), main = "Grocery Factor Map")