Importing Data
Data Exploration and Preparation
Model Training
Model Evaluation
Importing Data
Data Exploration and Preparation
Model Training
Model Evaluation
Market basket analysis is used behind the scenes for the recommendation systems used in many brick-and-mortar and online retailers. The learned association rules indicate the combinations of items that are often purchased together.
In this tutorial, we will perform a market basket analysis of transactional data from a grocery store.
However, the techniques could be applied to many different types of problems, from movie recommendations, to dating sites, to finding dangerous interactions among medications.
Our market basket analysis will utilize the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day.
First load the data groceries.csv from canvas Module Week 8, or click this link groceries.csv
groceries <- read.csv("groceries.csv")
Transactional data is stored in a slightly different format than that we used previously.
Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.
Let’s first browse the data What differences do you notice? What problems do you notice?
str(groceries)
## 'data.frame': 15295 obs. of 4 variables: ## $ citrus.fruit : chr "tropical fruit" "whole milk" "pip fruit" "other vegetables" ... ## $ semi.finished.bread: chr "yogurt" "" "yogurt" "whole milk" ... ## $ margarine : chr "coffee" "" "cream cheese" "condensed milk" ... ## $ ready.soups : chr "" "" "meat spreads" "long life bakery product" ...
Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.
Why not just store this as a data frame as we did in most of our analyses?
A conventional data structure quickly becomes too large to fit in the available memory with transactional data
We need a new data structure that does not treat a transaction as a set of positions to be filled (or not filled) with specific items
Row: Each row in the sparse matrix indicates a transaction.
Column: The sparse matrix has a column (that is, feature) for every item.
Memory: A sparse matrix does not actually store the full matrix in memory; it only stores the cells that are occupied by an item.
To create a sparse matrix, we can first install arules package, then load the package.
#install.packages("arules")
library(arules)
## Loading required package: Matrix
## ## Attaching package: 'arules'
## The following objects are masked from 'package:base': ## ## abbreviate, write
#Create a sparse matrix
groceries <- read.transactions("groceries.csv", sep = ",")
#Explore the sparse matrix summary(groceries)
## transactions as itemMatrix in sparse format with ## 9835 rows (elements/itemsets/transactions) and ## 169 columns (items) and a density of 0.02609146 ## ## most frequent items: ## whole milk other vegetables rolls/buns soda ## 2513 1903 1809 1715 ## yogurt (Other) ## 1372 34055 ## ## element (itemset/transaction) length distribution: ## sizes ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 ## 17 18 19 20 21 22 23 24 26 27 28 29 32 ## 29 14 14 9 11 4 6 1 1 1 1 3 1 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 2.000 3.000 4.409 6.000 32.000 ## ## includes extended item information - examples: ## labels ## 1 abrasive cleaner ## 2 artif. sweetener ## 3 baby cosmetics
The density value of 0.02609146 (2.6 percent) refers to the proportion of nonzero matrix cells.
To look at the contents of the sparse matrix, use the inspect() function in combination with the vector operators. The first three transactions can be viewed as follows.
inspect(groceries[1:3])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
We can visualize the transation data by ploting entire sparse matrix. To do so, use the image() function.
The resulting diagram depicts a matrix with 5 rows and 169 columns, indicating the 5 transactions and 169 possible items we requested.
#Display the spars matrix for the first five transactions image(groceries[1:5])
This visualization will not be as useful for extremely large transaction databases, because the cells will be too small to discern.
Still, by combining it with the sample() function, you can view the sparse matrix for a randomly sampled set of transactions.
image(sample(groceries, 100))
We can view the frequency of a certain item among all the transactions by using itemFrequency() function.
#To view the support level for the first three items in the grocery data: itemFrequency(groceries[, 1:3])
## abrasive cleaner artif. sweetener baby cosmetics ## 0.0035587189 0.0032536858 0.0006100661
To present these statistics visually, use the itemFrequencyPlot() function. As shown in the following plot, this results in a histogram showing the eight items in the groceries data with at least 10 percent support:
itemFrequencyPlot(groceries, support = 0.1)
If you would rather limit the plot to a specific number of items, the topN parameter can be used with itemFrequencyPlot() by specifying topN option:
itemFrequencyPlot(groceries, topN = 20)
We can now work at finding associations among shopping cart items. The following table shows the syntax to create sets of rules with the apriori() function:
There can sometimes be some trial and error needed to find the support and confidence parameters that produce a reasonable number of association rules.
If you set these levels too high, you might find no rules or rules that are too generic to be very useful.
A threshold too low might result in an unwieldy number of rules, or worse, it may take a very long time or run out of memory during the learning phase.
Minimum support: Think about the smallest number of transactions you would need before you would consider a pattern interesting.
For instance, you could argue that if an item is purchased twice a day (about 60 times in a month of data), it may be an interesting pattern.Since 60 out of 9,835 equals 0.006, we’ll try setting the support there first.
We’ll start with a confidence threshold of 0.25, which means that in order to be included in the results, the rule has to be correct at least 25 percent of the time.
groceryrules <- apriori(groceries,
parameter = list(support =0.006,
confidence = 0.25,
minlen = 2))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.25 0.1 1 none FALSE TRUE 5 0.006 2 ## maxlen target ext ## 10 rules TRUE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 59 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [109 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 done [0.00s]. ## writing ... [463 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s].
groceryrules
## set of 463 rules
To obtain a high-level overview of the association rules, we can use summary() as follows.
summary(groceryrules)
## set of 463 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 ## 150 297 16 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.000 2.000 3.000 2.711 3.000 4.000 ## ## summary of quality measures: ## support confidence coverage lift ## Min. :0.006101 Min. :0.2500 Min. :0.009964 Min. :0.9932 ## 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:0.018709 1st Qu.:1.6229 ## Median :0.008744 Median :0.3554 Median :0.024809 Median :1.9332 ## Mean :0.011539 Mean :0.3786 Mean :0.032608 Mean :2.0351 ## 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:0.035892 3rd Qu.:2.3565 ## Max. :0.074835 Max. :0.6600 Max. :0.255516 Max. :3.9565 ## count ## Min. : 60.0 ## 1st Qu.: 70.0 ## Median : 86.0 ## Mean :113.5 ## 3rd Qu.:121.0 ## Max. :736.0 ## ## mining info: ## data ntransactions support confidence ## groceries 9835 0.006 0.25 ## call ## apriori(data = groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
In our rule set, 150 rules have only two items, while 297 have three, and 16 have four.
We can take a look at specific rules using the inspect() function. For instance, the first three rules in the groceryrules object can be viewed as follows:
inspect(groceryrules[1:3])
## lhs rhs support confidence coverage
## [1] {potted plants} => {whole milk} 0.006914082 0.4000000 0.01728521
## [2] {pasta} => {whole milk} 0.006100661 0.4054054 0.01504830
## [3] {herbs} => {root vegetables} 0.007015760 0.4312500 0.01626843
## lift count
## [1] 1.565460 68
## [2] 1.586614 60
## [3] 3.956477 69
Interpretation of the first rule:
If a customer buys potted plants, they will also buy whole milk.
This rule covers 0.7 percent of the transactions
It is correct in 40 percent of purchases involving potted plants
The lift value tells us how much more likely a customer is to buy whole milk relative to the average customer, given that he or she bought a potted plant.
A common approach is to take the association rules and divide them into the following three categories:
Actionable: provide a clear and useful insight
Trivial: rules are obvious but not worth-mentioning
Inexplicable: unclear connections between the items
Depending upon the objectives of the market basket analysis, the most useful rules might be the ones with the highest support, confidence, or lift.
inspect(sort(groceryrules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 0.01626843 3.956477 69
## [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 0.03324860 3.796886 89
## [3] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 0.01708185 3.768074 69
## [4] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 0.01972547 3.688692 78
## [5] {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 0.03589222 3.482649 93
Suppose that given the preceding rule, the marketing team is excited about the possibilities of creating an advertisement to promote berries, which are now in season. Before finalizing the campaign, however, they ask you to investigate whether berries are often purchased with other items. To answer this question, we’ll need to find all the rules that include berries in some form.
The subset() function provides a method to search for subsets of transactions,items, or rules.
berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules)
## lhs rhs support confidence coverage lift
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 0.0332486 3.796886
## [2] {berries} => {yogurt} 0.010574479 0.3180428 0.0332486 2.279848
## [3] {berries} => {other vegetables} 0.010269446 0.3088685 0.0332486 1.596280
## [4] {berries} => {whole milk} 0.011794611 0.3547401 0.0332486 1.388328
## count
## [1] 89
## [2] 104
## [3] 101
## [4] 116
To share the results of your market basket analysis, you can save the rules to a CSV file with the write() function.
write(groceryrules, file = "groceryrules.csv",
sep = ",", quote = TRUE, row.names = FALSE)
Association rules are frequently used to find useful insights in the massive transaction databases of large retailers
As an unsupervised learning process, we can extract knowledge from large databases without any prior knowledge of what patterns to seek
The challenge is to reduce the big data into manageable insight. We did this by setting proper thresholds of measurements of rules (support, confidence, lift)