DATA624 HW5: MBA
Introduction
Imagine 1000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
Here is the dataset = GroceryDataSet.csv (comma separated file).
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Data Prep
Visual inspection of the data identifies an unequal number of columns per row. Pulling the data into a dataframe and assigning each row a number to represent each basket, we pivot the data from wide to long for analysis.
url <- "https://raw.githubusercontent.com/klgriffen96/summer23_data624/main/hw_2/GroceryDataSet.csv"
data <- read.csv(url, header=FALSE)
data[data == ""] <- NA # replace empty cells with NA
data$tid <- as.integer(row.names(data)) # create transaction ID column using row indices
# pivot data
baskets <- data |>
pivot_longer(cols = starts_with("V"),
values_to = "item",
values_drop_na = TRUE) |>
select(tid, item)
kable(head(baskets)) |> kable_styling()
tid | item |
---|---|
1 | citrus fruit |
1 | semi-finished bread |
1 | margarine |
1 | ready soups |
2 | tropical fruit |
2 | yogurt |
We can visualize the frequency of items in our baskets,
baskets |>
group_by(item) |>
summarize(count = n()) |>
arrange(desc(count)) |>
head(10) |>
ggplot(aes(x=reorder(item,count), y=count)) +
geom_col(fill="blue") +
labs(x=NULL, y='Count') + ggtitle('Top 10 Items') +
coord_flip() +
theme_classic()
And the distribution of items per transaction,
baskets |>
group_by(tid) |>
summarize(total_items = n()) |>
ggplot(aes(x=total_items)) +
geom_histogram(fill="blue", bins=30) +
theme_classic()
We can see most of the baskets contain 1 or 2 items, the maximum basket size appears to be just about 30, we can check this:
baskets |>
group_by(tid) |>
summarize(total_items = n()) |>
filter(total_items == max(total_items)) |>
kable() |>
kable_styling()
tid | total_items |
---|---|
1217 | 32 |
Now we prepare the data for Market Basket Analysis, first we create a transaction object with the data,
# Transform tid into a factor
baskets$tid <- factor(baskets$tid)
# split into groups
baskets_list <- split(baskets$item,
baskets$tid)
# transform to transactional dataset
baskets_trx <- as(baskets_list,"transactions")
# inspect transactions
inspect(head(baskets_trx))
## items transactionID
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread} 1
## [2] {coffee,
## tropical fruit,
## yogurt} 2
## [3] {whole milk} 3
## [4] {cream cheese ,
## meat spreads,
## pip fruit,
## yogurt} 4
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk} 5
## [6] {abrasive cleaner,
## butter,
## rice,
## whole milk,
## yogurt} 6
We can also call the summary
function on the
transactions object,
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
All of these operations can be simplified by calling the
read.transactions()
function to read in the data as a
transaction object.
basket_tr <- read.transactions(url, sep=",")
itemFrequencyPlot(basket_tr,
topN=10,
type="absolute",
xlab="Count",
ylab="",
col="blue",
main="Top 10 items",
cex.names=0.8,
horiz=TRUE)
Market Basket Analysis
Association rule mining consists of two subtasks which provide insights into the relationships between items in transactional data:
- Frequent itemset generation
- Rule generation
Support
First we’ll take a look at the itemsets in the data to identify the itemsets that occur most frequently. An itemset is a combination of items that appear together in a set of transactions. The support measures the frequency or occurrence of an itemset in the data by dividing the number of transactions that contain a specific itemset by the total number of transactions. High support indicates the itemset appears frequently.
# Frequent itemsets for all items
support_all <- apriori(basket_tr,
parameter = list(target="frequent itemsets",
supp = 0.01,
minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [245 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# inspect(head(sort(support_all, by="support),10))
# 10 most frequent items
support_all |>
as("data.frame") |>
arrange(desc(support)) |>
head(10) |>
kable() |>
kable_styling()
items | support | count |
---|---|---|
{other vegetables,whole milk} | 0.0748348 | 736 |
{rolls/buns,whole milk} | 0.0566345 | 557 |
{whole milk,yogurt} | 0.0560244 | 551 |
{root vegetables,whole milk} | 0.0489070 | 481 |
{other vegetables,root vegetables} | 0.0473818 | 466 |
{other vegetables,yogurt} | 0.0434164 | 427 |
{other vegetables,rolls/buns} | 0.0426029 | 419 |
{tropical fruit,whole milk} | 0.0422979 | 416 |
{soda,whole milk} | 0.0400610 | 394 |
{rolls/buns,soda} | 0.0383325 | 377 |
Interpreting these results, the support column indicates the proportion of transactions that contain each itemset: 7.5% of all transactions contain “other vegetables” and “whole milk”, 5.7% of all transactions contain “other vegetables”rolls/buns” and “whole milk”, and so on.
Rules
Extracting rules allows us to observe the support, confidence, and lift measures of frequent itemsets.
rules <- apriori(baskets_trx,
parameter= list(supp = 0.01, # minimum support
conf = 0.4, # minimum confidence
minlen = 2)) # disallow empty sets
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [62 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The arulesViz
package contains a number of functions
that enhance graphical display of association rules. The
inspectDT
function outputs an interactive display which can
be sorted by column:
Plotting the rules with the engine = "plotly"
argument
creates an interactive visualization that allows the user to hover over
data points in order to view association rule metrics,
Confidence
Each itemset consists of antecedents and consequents. Antecedents are the items or itemsets on the left-hand side of the association rule that represents a set of items or conditions that act as the premise or condition for the rule. Consequents are the items or itemsets on the right-hand side of the association rule that represents the set of items or outcomes that are predicted or observed based on the presence of the antecent. Confidence measures the likelihood or probability of finding the consequent item(s) in a rule given the antecedent(s). It is the proportion of transactions containing the antecedent that also contain the consequent. High confidence suggests a strong association between the antecedent and consequent.
rules | support | confidence | coverage | lift | count |
---|---|---|---|---|---|
{citrus fruit,root vegetables} => {other vegetables} | 0.0103711 | 0.5862069 | 0.0176919 | 3.029608 | 102 |
{root vegetables,tropical fruit} => {other vegetables} | 0.0123030 | 0.5845411 | 0.0210473 | 3.020999 | 121 |
{curd,yogurt} => {whole milk} | 0.0100661 | 0.5823529 | 0.0172852 | 2.279125 | 99 |
{butter,other vegetables} => {whole milk} | 0.0114896 | 0.5736041 | 0.0200305 | 2.244885 | 113 |
{root vegetables,tropical fruit} => {whole milk} | 0.0119980 | 0.5700483 | 0.0210473 | 2.230969 | 118 |
{root vegetables,yogurt} => {whole milk} | 0.0145399 | 0.5629921 | 0.0258261 | 2.203354 | 143 |
{domestic eggs,other vegetables} => {whole milk} | 0.0123030 | 0.5525114 | 0.0222674 | 2.162336 | 121 |
{whipped/sour cream,yogurt} => {whole milk} | 0.0108795 | 0.5245098 | 0.0207422 | 2.052747 | 107 |
{rolls/buns,root vegetables} => {whole milk} | 0.0127097 | 0.5230126 | 0.0243010 | 2.046888 | 125 |
{other vegetables,pip fruit} => {whole milk} | 0.0135231 | 0.5175097 | 0.0261312 | 2.025351 | 133 |
We can interpret the confidence that 58% of customers that bought “citrus fruit” and “root vegetables” also bought “other vegetables”, and so on.
Lift
Lift measures the strength of association between the antecedent and consequent items compared to their individual occurrences. It is the ratio of the observed support to the expected support if the antecedent and consequent were independent. Lift greater than 1 indicates a positive association, while lift less than 1 indicates a negative association.
rules | support | confidence | coverage | lift | count |
---|---|---|---|---|---|
{citrus fruit,root vegetables} => {other vegetables} | 0.0103711 | 0.5862069 | 0.0176919 | 3.029608 | 102 |
{root vegetables,tropical fruit} => {other vegetables} | 0.0123030 | 0.5845411 | 0.0210473 | 3.020999 | 121 |
{rolls/buns,root vegetables} => {other vegetables} | 0.0122013 | 0.5020921 | 0.0243010 | 2.594890 | 120 |
{root vegetables,yogurt} => {other vegetables} | 0.0129131 | 0.5000000 | 0.0258261 | 2.584078 | 127 |
{whipped/sour cream,yogurt} => {other vegetables} | 0.0101678 | 0.4901961 | 0.0207422 | 2.533410 | 100 |
{root vegetables,whole milk} => {other vegetables} | 0.0231825 | 0.4740125 | 0.0489070 | 2.449770 | 228 |
{onions} => {other vegetables} | 0.0142349 | 0.4590164 | 0.0310117 | 2.372268 | 140 |
{pork,whole milk} => {other vegetables} | 0.0101678 | 0.4587156 | 0.0221657 | 2.370714 | 100 |
{whipped/sour cream,whole milk} => {other vegetables} | 0.0146416 | 0.4542587 | 0.0322318 | 2.347679 | 144 |
{pip fruit,whole milk} => {other vegetables} | 0.0135231 | 0.4493243 | 0.0300966 | 2.322178 | 133 |
Again, looking at the first line, we can interpret the lift of 3.03 to represent the occurrence of the consequent (other vegetables) is 3.03 times likelier when the antecedent items (citrus fruit & root vegetables) are present, and so on.