Overview

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.

Response

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(arules)
library(arulesViz)

We will load the GroceryDataSet.csv file into an arules transaction object.

grocery_data_raw = read.csv("GroceryDataSet.csv")

dim(grocery_data_raw)

## [1] 9834   32

# Convert the data frame of groceries to transactions.

grocery_transactions = read.transactions("GroceryDataSet.csv", sep = ",")

summary(grocery_transactions)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

There are 9834 transactions with 169 distinct items purchased. We print the most occurring items below. Whole milk, vegetables, rolls/buns, soda, yogurt lead the top 5 items as seen when we use itemFrequencyPlot to visualize the 20 most frequent items.

plot_item_frequencies_ranked = itemFrequencyPlot(grocery_transactions, topN=20, type = "absolute", main="Frequency Ranked")

We now display rules computed by the apriori algorithm from the groceries in which the support \(P(A \bigcap B) \geq 0.001\) and confidence \(P(A \bigcap B)/P(A) \geq 0.3\). By choosing a relatively high confidence level and support of 0.1%, we require at least 9-10 relevant transactions involving both \(A\) and \(B\). This reduces the problem that the discovered associations are statistically spurious.

rules <- apriori( grocery_transactions, parameter = list(supp = 0.001, conf = 0.3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [13770 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].

Nonetheless, we still obtain over 13000 rules. The rules are then ranked by lift and the top 10 rules by lift are displayed below.

rules_lift= sort( rules, by ="lift", decreasing = TRUE ) # Display by lift

top_rules_by_lift = DATAFRAME(head(rules_lift, n = 10) ) 

top_rules_by_lift %>% kable(digits = 3) %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

	LHS	RHS	support	confidence	coverage	lift	count
252	{bottled beer,red/blush wine}	{liquor}	0.002	0.396	0.005	35.716	19
906	{ham,white bread}	{processed cheese}	0.002	0.380	0.005	22.928	19
251	{bottled beer,liquor}	{red/blush wine}	0.002	0.413	0.005	21.494	19
311	{Instant food products,soda}	{hamburger meat}	0.001	0.632	0.002	18.996	12
1269	{curd,sugar}	{flour}	0.001	0.324	0.003	18.608	11
1193	{baking powder,sugar}	{flour}	0.001	0.312	0.003	17.973	10
905	{processed cheese,white bread}	{ham}	0.002	0.463	0.004	17.803	19
281	{popcorn,soda}	{salty snack}	0.001	0.632	0.002	16.698	12
1192	{baking powder,flour}	{sugar}	0.001	0.556	0.002	16.408	10
904	{ham,processed cheese}	{white bread}	0.002	0.633	0.003	15.045	19

The top rule, for example, associates the purchase of beer and wine with an additional purchase of liquor. The support \(P(A \bigca B)\) is 0.2%. The confidence of 0.396 tells us that 39.6% of purchases of the former also include purchases of the latter (liquor) and this association is extremely unlikely to be due to chance due to a lift of 35.7. We also notice a rule in the opposite direction. The top 3rd rule is effectively as association of beer and liquor to induce the purchase of red wine. As these rules are simply correlations of triples of items, we cannot really infer causality.

Rather, we are left to infer causality or reasonableness of the observed association by using common sense or other quantitative evidence.

We are also left to apply business knowledge and commonsense to distinguish the good and bad associations. For example, the top 4th rule (instant food products, soda) \(\implies\) hamburger meat does not quite make sense to me. While soda and hamburger meat might be associated with a BBQ, it is unclear what instant food products would mean.

Lastly, we should be cautious of inferring product placement from association rules.

While some associations may suggest moving associated products closer to increase revenues, the causality may actually go in the other direction. Some products are traditionally placed together. For example, baking powder, sugar and flour are usually in the same aisle.
Relocating some products closer to strengthen proximity (and increase revenues) may cause other product associations to weaken (and decrease revenues). It may be helpful to search other associations from the rule set and inspect the net impact.

Lastly, we visualize the rules by showing lift on the y-axis and support on the x-axis. The color intensity is related to confidence of the rule. We can use this to see if any rules with lower lift still have offsetting greater support. More support means more revenues could be impacted by a good marketing decision.

plot(rules, method = "scatterplot",  measure = c("support", "lift"), shading = "confidence" )

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Clustering Analysis

To perform clustering analysis, we’ll have to convert the transactions and items into a big matrix in a manner similar to the term-document matrix used in text processing. In text processing, the matrix has \(N\) rows representing a set of \(N\) words and \(M\) columns representing the documents in which the words may be found. In the market basket analysis, the matrix has \(N\) rows for items and \(M\) columns for transactions.

The entry \(M[i,j]\) is the count of the item \(i\) bought for transaction \(j\).

We use the FactorMineR and factoextra packages to facilitate plotting the clusters after extracting the matrix from the arules transaction object.

library(FactoMineR)

it_df = as.matrix( grocery_transactions@data) %>% as_tibble()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.

# Make the labels into a column.
it_df$row_name = grocery_transactions@itemInfo$labels

# Then do a hack to convert an explicit column to row names.
it_df %>% column_to_rownames( var = "row_name") -> it_df

Now we apply hierarchical clustering with principal components (HCPC) to obtain the main types of market baskets.

# Conver the logical matrix to a numeric one.  This is the fastest way.

it_df = 1 * it_df


res.PCA<-PCA(it_df,ncp=10, scale.unit=FALSE,graph=FALSE)

res.HCPC<-HCPC(res.PCA,nb.clust=6 ,consol=TRUE,graph=TRUE)

The HCPC plot below shows the food items grouped on a Principal Components plot using PC dimensions one and two.

It below suggests that whole milk, other vegetables are their own components. This means that business decisions on milk, soda, rolls/buns and vegetables need to be made individually and these decisions will be consequential.
It also suggests that the factor groups are highly imbalanced - contrary to our preference for balanced groups.

library(factoextra)
fviz_cluster(res.HCPC, repel = TRUE, show.clust.cent = TRUE, palette = "jco", ggtheme = theme_minimal(), main = "Grocery Factor Map")

Data 624 HW 10 (Week 14) Recommender Systems

Alexander Ng

Due 12/5/2021

Overview

Response

Clustering Analysis