607 Final Project Part II: Market Basket Analysis

Intro: Market Basket Analysis

The purpose of a Market Basket Analysis is to identify and visualize the impact of statistically significant associations between unique items across a multitude of occurrences. As its name suggests, it originated in the analysis of items frequently purchased together in supermarket transactions, and we’ll use it in an online retail example below.

In selecting our model and drawing conclusions, we’ll be evaluating associations according to the following association measures:

Support

The support of an item or is its frequency relative to the total number of orders in the dataset. It can be measured for individual items as well as for associations between them. Simply put,

\[ support (A, B) = \frac{orders\ with\ (A, B)}{all \ orders} \]

Confidence

The confidence of an association with respect to one of its items, is the association’s frequency relative to that individual item’s frequency.

\[ confidence (A \Rightarrow B) = \frac{orders\ with\ (A, B)}{orders\ with\ A} \]

A high confidence association indicates a high proportion of orders for the “Left Hand Side” item also feature the “Right Hand Side” item. Note here that

\[ confidence (A \Rightarrow B) \neq confidence (B \Rightarrow A) \]

Lift

Lift quantifies the support of the association versus the support of its individual items independently. It is the frequency of the association divided by the frequencies of the individual items assumed independent:

\[ lift (A, B) = \frac{support(A, B)}{support(A) \times support(B)} \]

Step 1: Import and describe transaction data

In part 1, we exported to .csv the results of our SQL Query SELECT * FROM TransactionLineView.

Here, we’ll begin by importing this data from .csv and taking a look at the size and shape of this dataset. It includes 401,365 online retail transaction line items from 18,533 unique transactions, shipped over the course of 2009-2010 from a UK-based online home decor retailer.

Load Packages

We’re going to be relying for the “meat” of our analysis on the packages arules and arulesViz.

library(tidyverse)
library(arules)
library(arulesViz)
library(DT)

options(scipen = 999)

#import transactions from .csv (as exported from database TransactionLineView in part 1)
transactions_raw <- read_delim("https://raw.githubusercontent.com/curdferguson/CUNY607final_MarketBasket/main/transactions_raw.csv", delim= ",")

Describe Transactions

The 18,533 transactions grossed the eCommerce retailer 8,982,405 GBP, with an average order value of 484.67 GBP and an average basket size of 21.7 unique items per transaction. While the company shipped to over 35 countries throughout Europe and as far afield as Brazil, the US, Japan, and Australia, 87.68% of its revenue shipped within the UK.

#view results in a DataTable
datatable(head(transactions_raw), filter = "none")

transactions_by_transaction <- transactions_raw %>% group_by(InvoiceID) %>% summarize(InvoiceID = max(InvoiceID), Customer = max(CustomerID), Country = max(Country), `Basket Size` = max(n()), Total = round(sum(Price * Qty), 2))

#summarize transactions
transactions_summary <- transactions_by_transaction %>% summarize(`No. of Transactions` = n(), `Avg. Basket Size` = round((sum(`Basket Size`) / `No. of Transactions`), 3), `Avg. Order Value` = round((sum(Total) / `No. of Transactions`), 2), Revenue = sum(Total))

datatable(transactions_summary)

# calculate transactions by country
transactions_by_country <- transactions_by_transaction %>% group_by(Country) %>% summarize(`No. of Transactions` = n(), `Avg. Basket Size` = round((sum(`Basket Size`) / `No. of Transactions`), 3), `Avg. Order Value` = round((sum(Total) / `No. of Transactions`), 2), Revenue = sum(Total), `% of Revenue` = round((Revenue / sum(transactions_summary$Revenue)), 4)) %>% arrange(desc(`% of Revenue`))

datatable(transactions_by_country)

Describe Item Makeup of the Transactions

There are 3683 unique items, identified by their StockCode and Description fields, that appear on our 1853 items. While we could analyze at length the percentage revenue or average quantity per order of the items in this dataset, we’ve narrowed our focus here to the variable of interest to Market Basket Analysis - the number of orders on which a particular item appears (without regard to the quantity ordered).

We’re going to remove 3 miscellaneous items POSTAGE, Manual, and CARRIAGE from the dataset at this point, as one in particular comes into our top 10 and may otherwise skew the analysis.

Finally, we’re going to view absolute and relative frequency plots for the items in our dataset, to get an read on how the quantity of orders is distributed amongst unique items.

#top 10 items by # of orders shipped
transactions_by_item_orders <- transactions_raw %>% group_by(StockCode) %>% summarize(StockCode = max(StockCode), Description = max(`Description`), Orders = max(n()), Quantity = sum(Qty), `Avg. Qty` = round((Quantity / Orders), 3), Price = max(Price), Total = round(sum(Price * Qty), 2)) %>% arrange(desc(Orders))

datatable(head(transactions_by_item_orders, 10))

Step 2: Wrangle data to “single” format

“Single” format is one of two input types allowed by the read.transactions function of package arules. The “single” format lists two columns - one for the transaction’s unique identifier (in this case, InvoiceID) and one for Description - the unique item name.

Market Basket Analysis is agnostic toward the quantity and price of items on each transaction line, so we can exclude those columns.

We’ll use the arules package’s Item Frequency Plot function to get a quick read on the top items in our set. Will we see these top items show up in significant associations?

#wrangle to "single" format and write to .csv
transactions_mba <- transactions_raw %>% distinct(InvoiceID, Description)

transactions_mba$Description <- transactions_mba$Description %>% str_trim(side = c("both"))
#write_delim(transactions_mba, file = "transactions_mba.csv", delim = ",", na = "", append = FALSE)

# create transaction object from transactions_mba .csv file, and view summary output
transobj_mba <- read.transactions(file = "https://raw.githubusercontent.com/curdferguson/CUNY607final_MarketBasket/main/transactions_mba.csv", format = "single", header = TRUE, sep = ",", cols = c(1, 2), rm.duplicates = TRUE)

summary(transobj_mba)

## transactions as itemMatrix in sparse format with
##  13094 rows (elements/itemsets/transactions) and
##  6925 columns (items) and a density of 0.002195047 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                                992                                953 
##            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
##                                830                                715 
##      ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
##                                696                             194852 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2047  943  661  562  514  483  414  402  369  367  366  323  348  317  331  294 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  261  266  283  259  213  202  185  174  137  126  135  131  130  111  104   81 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##   69   88   80   62   72   66   67   64   55   44   46   45   42   44   36   31 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   31   33   29   30   24   23   34   18   27   22   13   16   17   10   17    9 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   17   15   13   13    7    8    8    8    5    7    6    8    7    5    9    7 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   11    7    6    6    5    6    8    2    5    4    5    5    1    1    3    2 
##   97   98   99  100  101  102  104  105  107  108  109  110  111  113  115  116 
##    3    2    4    3    1    1    4    1    4    1    1    4    3    2    1    2 
##  118  120  122  123  124  126  129  132  136  140  144  147  149  153  154  157 
##    2    1    1    2    1    1    2    1    1    1    1    1    1    1    1    1 
##  167  169  176  178  180  185  198  199  204  226  231  282  294  319 
##    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0    10.0    15.2    21.0   319.0 
## 
## includes extended item information - examples:
##                                   labels
## 1      "ASSORTED FLOWER COLOUR ""LEIS"""
## 2  "CHARLIE+LOLA""EXTREMELY BUSY"" SIGN"
## 3 "FLOWER GLASS GARLAND NECKL.36""BLACK"
## 
## includes extended transaction information - examples:
##   transactionID
## 1        536365
## 2        536366
## 3        536367

# show an item frequency plot
itemFrequencyPlot(transobj_mba, type = c("absolute"), topN = 1000, horiz = FALSE, names = FALSE, col = "lightblue", angle = 45, xlab = "Items", ylab = "Frequency", main = "Absolute Item Frequency")

# show an item frequency plot
itemFrequencyPlot(transobj_mba, type = c("absolute"), topN = 10, horiz = TRUE, names = TRUE, col = "lightblue", xlab = "Items", ylab = "", main = "TOP 10 Absolute Item Frequency")

Step 3: Select support and confidence levels based on the dataset

The strength and significance of the association rules we’ll derive from our data is determined primarily by two variables, support and confidence, as discussed above. We’re going to settle on support and confidence levels for our analysis by looping the apriori algorithm over the dataset at different confidence and support levels, and counting the number of rules generated.

The sweet spot we’re looking for is the intersection of support and confidence with 10-20 rules that has the highest value for each. We’ll start by comparing support levels of 3%, 2%, 1%, and 0.5%…

# Support and confidence values
supportLevels <- c(0.03, 0.02, 0.01, 0.005)
confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)

# Empty integers 
rules_sup3 <- integer(length=9)
rules_sup2 <- integer(length=9)
rules_sup1 <- integer(length=9)
rules_sup0.5 <- integer(length=9)

# Apriori algorithm with a support level of 3%
for (i in 1:length(confidenceLevels)) {
  
  rules_sup3[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[1], 
                                   conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels)){
  
  rules_sup2[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[2], 
                                  conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)){
  
  rules_sup1[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[3], 
                                  conf=confidenceLevels[i], target="rules")))
  
}

# Apriori algorithm with a support level of 0.5%
for (i in 1:length(confidenceLevels)){
  
  rules_sup0.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[4], 
                                    conf=confidenceLevels[i], target="rules")))
  
}

# code chunk citation: Garcia, Xavier Vivancos.  "Market Basket Analysis."  Kaggle.com.  Published May 2, 2020.  Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.

At our lowest levels of confidence and support, the algorithm generates over 2000 rules - far too many to offer us useful insights. At a support level of 3%, by contrast, the model appears to offer us zero statistically significant association rules. So in the next step, we’ll recalibrate to look at support levels from 1% - 2.5%, and 50% or greater confidence.

rules_plot <- data.frame(
  confidence = confidenceLevels,
  support_3 = rules_sup3,
  support_2 = rules_sup2,
  support_1 = rules_sup1,
  support_0.5 = rules_sup0.5) %>% 
  pivot_longer(cols = c(support_3, support_2, support_1, support_0.5), names_to = "support", values_to = "rules")

rules_plot %>% ggplot(aes(x = confidence, y = rules, color = support)) +
  geom_point() + geom_line(legend = TRUE) + scale_fill_brewer()

# Support and confidence values
supportLevels2 <- c(0.025, 0.02, 0.015, 0.01)
confidenceLevels2 <- c(0.9, 0.8, 0.7, 0.6, 0.5)

# Empty integers 
rules_sup2.5 <- integer(length=5)
rules_sup2 <- integer(length=5)
rules_sup1.5 <- integer(length=5)
rules_sup1 <- integer(length=5)

# Apriori algorithm with a support level of 2.5%
for (i in 1:length(confidenceLevels2)) {
  
  rules_sup2.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[1], 
                                   conf=confidenceLevels2[i], target="rules")))
  
}

# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels2)){
  
  rules_sup2[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[2], 
                                  conf=confidenceLevels2[i], target="rules")))
  
}

# Apriori algorithm with a support level of 1.5%
for (i in 1:length(confidenceLevels2)){
  
  rules_sup1.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[3], 
                                  conf=confidenceLevels2[i], target="rules")))
  
}

# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels2)){
  
  rules_sup1[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[4], 
                                    conf=confidenceLevels2[i], target="rules")))
  
}

# code chunk citation: Garcia, Xavier Vivancos.  "Market Basket Analysis."  Kaggle.com.  Published May 2, 2020.  Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.

The recalibrated algorithm helps us drill down into the region of most value to us. We can see that there are two options that meet our criteria:

support >= 1.5%, confidence >= 60%
support >= 1.0%, confidence >= 70%

Below, we will run the apriori agorithm at each of these intersections of support and confidence, and compare the results.

rules_plot1 <- data.frame(
  confidence = confidenceLevels2,
  support_2.5 = rules_sup2.5,
  support_2 = rules_sup2,
  support_1.5 = rules_sup1.5,
  support_1 = rules_sup1) %>% 
  pivot_longer(cols = c(support_2.5, support_2, support_1.5, support_1), names_to = "support", values_to = "rules")

rules_plot1 %>% ggplot(aes(x = confidence, y = rules, color = support)) +
  geom_point() + geom_line(legend = TRUE) + scale_fill_brewer()

Step 4: Run our apriori analysis and interpret the results

Support 1.0%, Confidence 70%

There are a few immediate takeaways from our dataset: - Items with the greatest absolute frequency in our dataset are not necessarily the items that appear most frequently in association rules (though a few notables do re-occur). - There are a few categories that stand out for this company as driving significant association:teacups/ saucers and cake stands; lunch bags; party bunting; and hanging candle holders. - The lift amongst the associations in this group of 19 items is extraordinary - a lift of 47.615 when customers purchase pink and blue Happy Birthday bunting together versus separately indicates the probability of the items being purchased together is 47.615 times greater than the probability they are purchased separately.

(Remember formally, the definition of lift is the frequency of the association divided by the frequencies of the individual items assumed independent).

\[ lift (A, B) = \frac{support(A, B)}{support(A) \times support(B)} \]

rules1 <- apriori(transobj_mba, parameter=list(sup=0.01, conf=0.7, target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 130 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6925 item(s), 13094 transaction(s)] done [0.30s].
## sorting and recoding items ... [397 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspectDT(rules1)

summary(rules1)

## set of 19 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
##  9 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.526   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01000   Min.   :0.7000   Min.   :0.01298   Min.   : 9.24  
##  1st Qu.:0.01100   1st Qu.:0.7189   1st Qu.:0.01466   1st Qu.:23.67  
##  Median :0.01168   Median :0.7487   Median :0.01642   Median :25.52  
##  Mean   :0.01351   Mean   :0.7608   Mean   :0.01779   Mean   :28.22  
##  3rd Qu.:0.01604   3rd Qu.:0.7978   3rd Qu.:0.02058   3rd Qu.:31.00  
##  Max.   :0.02207   Max.   :0.8787   Max.   :0.02948   Max.   :47.61  
##      count      
##  Min.   :131.0  
##  1st Qu.:144.0  
##  Median :153.0  
##  Mean   :176.9  
##  3rd Qu.:210.0  
##  Max.   :289.0  
## 
## mining info:
##          data ntransactions support confidence
##  transobj_mba         13094    0.01        0.7

rules_scatter <- plot(rules1, measure=c("support", "lift"), 
          shading = "confidence",
          interactive = FALSE)

Support 1.5%, Confidence 60%

In the case of this dataset, we happened to lose the 4 rules with the most significant lift by upping our support threshold by 0.5%. I would not necessarily generalize that we should air on the side of greater confidence over greater support - the relationship may vary considerably depending on the distribution of items and the nature of the observations.

rules2 <- apriori(transobj_mba, parameter=list(sup=0.015, conf=0.6, target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.015      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6925 item(s), 13094 transaction(s)] done [0.34s].
## sorting and recoding items ... [197 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.01s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(rules2)

##      lhs                                    rhs                                     support confidence   coverage      lift count
## [1]  {PINK REGENCY TEACUP AND SAUCER}    => {GREEN REGENCY TEACUP AND SAUCER}    0.01909271  0.7987220 0.02390408 27.094473   250
## [2]  {GREEN REGENCY TEACUP AND SAUCER}   => {PINK REGENCY TEACUP AND SAUCER}     0.01909271  0.6476684 0.02947915 27.094473   250
## [3]  {PINK REGENCY TEACUP AND SAUCER}    => {ROSES REGENCY TEACUP AND SAUCER}    0.01825263  0.7635783 0.02390408 23.197898   239
## [4]  {JUMBO BAG STRAWBERRY}              => {JUMBO BAG RED RETROSPOT}            0.01580877  0.6330275 0.02497327  9.986581   207
## [5]  {GARDENERS KNEELING PAD CUP OF TEA} => {GARDENERS KNEELING PAD KEEP CALM}   0.01580877  0.7040816 0.02245303 25.258205   207
## [6]  {RED HANGING HEART T-LIGHT HOLDER}  => {WHITE HANGING HEART T-LIGHT HOLDER} 0.01687796  0.6443149 0.02619520  8.504696   221
## [7]  {GREEN REGENCY TEACUP AND SAUCER}   => {ROSES REGENCY TEACUP AND SAUCER}    0.02207118  0.7487047 0.02947915 22.746030   289
## [8]  {ROSES REGENCY TEACUP AND SAUCER}   => {GREEN REGENCY TEACUP AND SAUCER}    0.02207118  0.6705336 0.03291584 22.746030   289
## [9]  {ALARM CLOCK BAKELIKE GREEN}        => {ALARM CLOCK BAKELIKE RED}           0.01985642  0.6735751 0.02947915 19.599539   260
## [10] {JUMBO BAG PINK POLKADOT}           => {JUMBO BAG RED RETROSPOT}            0.02008554  0.6232227 0.03222850  9.831902   263
## [11] {DOLLY GIRL LUNCH BOX}              => {SPACEBOY LUNCH BOX}                 0.01634336  0.6484848 0.02520238 23.851856   214
## [12] {SPACEBOY LUNCH BOX}                => {DOLLY GIRL LUNCH BOX}               0.01634336  0.6011236 0.02718803 23.851856   214
## [13] {GREEN REGENCY TEACUP AND SAUCER,                                                                                           
##       PINK REGENCY TEACUP AND SAUCER}    => {ROSES REGENCY TEACUP AND SAUCER}    0.01603788  0.8400000 0.01909271 25.519629   210
## [14] {PINK REGENCY TEACUP AND SAUCER,                                                                                            
##       ROSES REGENCY TEACUP AND SAUCER}   => {GREEN REGENCY TEACUP AND SAUCER}    0.01603788  0.8786611 0.01825263 29.806187   210
## [15] {GREEN REGENCY TEACUP AND SAUCER,                                                                                           
##       ROSES REGENCY TEACUP AND SAUCER}   => {PINK REGENCY TEACUP AND SAUCER}     0.01603788  0.7266436 0.02207118 30.398311   210

summary(rules2)

## set of 15 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 12  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     2.0     2.0     2.2     2.0     3.0 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01581   Min.   :0.6011   Min.   :0.01825   Min.   : 8.505  
##  1st Qu.:0.01604   1st Qu.:0.6460   1st Qu.:0.02318   1st Qu.:21.173  
##  Median :0.01688   Median :0.6736   Median :0.02520   Median :23.852  
##  Mean   :0.01799   Mean   :0.7068   Mean   :0.02579   Mean   :21.966  
##  3rd Qu.:0.01947   3rd Qu.:0.7561   3rd Qu.:0.02948   3rd Qu.:26.307  
##  Max.   :0.02207   Max.   :0.8787   Max.   :0.03292   Max.   :30.398  
##      count      
##  Min.   :207.0  
##  1st Qu.:210.0  
##  Median :221.0  
##  Mean   :235.5  
##  3rd Qu.:255.0  
##  Max.   :289.0  
## 
## mining info:
##          data ntransactions support confidence
##  transobj_mba         13094   0.015        0.6

rules_scatter2 <- plot(rules2, measure=c("support", "lift"), 
          shading = "confidence",
          interactive = FALSE)

Conclusion:

The analysis we performed here reveals 19 associations between specific items that suggest strong cross-promotional opportunities for this retailer in 4 categories:

Party bunting
Hanging candle holders
teacups and cake stands
lunch boxes

Activating these opportunities in the real world would require more information about the nature of the company’s customers and their purchasing habits. While the dataset is characterized as “online retail transactions”, the high AOV and Basket Size, combined with the types of association rules uncovered which bias items within the same category, indicates this site’s customers are likely in the B2B space, and are reselling the items through online or brick-and-mortar retail.

References:

Garcia, Xavier Vivancos. “Market Basket Analysis.” Kaggle.com. Published May 2, 2020. Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.

Goel, Niharika. “Market Basket Analysis Assocication Rules.” Medium.com. 23 July 2018.https://medium.com/@niharika.goel/market-basket-analysis-association-rules-e7c27b377bd8