Market Basket Analysis of Online Retail Data Using FP-Growth

Introduction
- Significance of FP-Growth
Methodology
- Selection of Dataset
- Step by Step FP-Growth
  - Process Overview
  - Internal Mining Process of FP-Growth
- Application of FP-Growth
  - Preparing Data for FP-Growth in R
  - Parameters of FP-Growth function (fim4r function) and how to set it up
- Analysis and Results
Conclusion

Introduction

The exploration of transactional data for pattern discovery is a critical aspect of retail analytics. This project introduced and tried to use proficiently the FP-Growth algorithm to analyze a dataset from a UK-based online retail, capturing transactions between December 1, 2009, and December 9, 2011. The dataset, “Online Retail II” from UCI (UC Irvine Machine Learning Respository) , details transactions involving a wide array of unique gift-ware products sold both to individual customers and wholesalers. By applying the FP-Growth algorithm, the research purpose is to mine frequent itemsets and generate association rules to show the ongoing patterns in customer purchase behavior.

Significance of FP-Growth

FP-Growth is very proficient in mining frequent itemsets without the need for candidate generation, which is a common limitation in traditional association rule learning methods. By constructing an FP-tree, the algorithm significantly reduces the number of database scans required, facilitating a faster and more scalable analysis. This efficiency makes FP-Growth more suitable for dealing with large datasets like “Online Retail II”, than other algorithms, enabling the discovery of meaningful patterns that can inform strategic business decisions in retail management and marketing strategies in much more efficient manner.

Installation of libraries used in the code:

# Install necessary packages if not already installed
if(!require("tidyverse")) install.packages("tidyverse")
if(!require("arules")) install.packages("arules")
if(!require("arulesViz")) install.packages("arulesViz")
if(!require("gridExtra")) install.packages("gridExtra")
if(!require("dplyr")) install.packages("dplyr")
if(!require("readxl")) install.packages("readxl")
if(!require("plyr")) install.packages("plyr")
if(!require("ggplot2")) install.packages("ggplot2")
if(!require("knitr")) install.packages("knitr")
if(!require("lubridate")) install.packages("lubridate")
if(!require("kableExtra")) install.packages("kableExtra")
if(!require("RColorBrewer")) install.packages("RColorBrewer")
if(!require("purrr")) install.packages("purrr")
if(!require("tidyr")) install.packages("tidyr")

Libraries used in the code and their descriptions:

# Load libraries and add comments
library(tidyverse) # Used for data manipulation and visualization
library(arules) # Used for mining association rules and frequent itemsets
library(arulesViz) # Used for visualizing association rules and frequent itemsets
library(gridExtra) # Possess functions for creating complex layouts in plots
library(dplyr) # Additional package for data manipulation
library(readxl) # Reading Excel files
library(plyr) # Contains tools for splitting, applying, and combining data
library(ggplot2) # System for graphics
library(knitr) # Dynamic report generation
library(lubridate) # Easier date and time manipulation
library(kableExtra) # Enhanced 'kable' tables with additional styling options
library(RColorBrewer) # Color schemes for graphics
library(purrr) # Functional programming tools
library(tidyr) # Tools for tidying data

Methodology

This section outlines applied procedure, used to analyze transaction data using the FP-Growth algorithm. It begins with the selection of a suitable dataset, followed by data preparation steps to ensure the dataset is optimized for mining frequent itemsets.

Selection of Dataset

The dataset chosen for this study is “Online Retail II”, accessible from the UCI Machine Learning Repository. It represents a comprehensive collection of transactions from a UK-based online retail that operates without a physical storefront. Spanning from December 1, 2009, to December 9, 2011, the dataset provides a detailed amount of sales transactions, including data on the products sold, their prices, the transaction dates, and customer information.

Key attributes of the dataset include:

Invoice (No): A unique identifier for each transaction, where codes starting with ‘c’ indicate cancellations.

StockCode: A unique identifier assigned to each product.

Description: The name of the product.

Quantity: The number of units sold in each transaction.

InvoiceDate: The date and time when each transaction occurred.

UnitPrice: The price per unit of the product, in sterling.

CustomerID: A unique identifier for each customer.

Country: The country where the customer is residing.

This dataset was selected for amount and quality of transactional data across a broad spectrum of products and its potential to reveal insights into consumer purchasing patterns.

The “Online Retail II” dataset’s nature provides an ideal base for applying the FP-Growth algorithm to uncover associations between products and to analyze trends in customer buying behavior over the two-year period.

# Load dataset and check str
file_path <- "online-retail/online_retail_II.xlsx"

#read excel into R dataframe
online_retail_data <- read_excel(file_path)

# Display the dimensions and structure of the loaded dataset
dim(online_retail_data)

## [1] 525461      8

str(online_retail_data)

## tibble [525,461 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Invoice    : chr [1:525461] "489434" "489434" "489434" "489434" ...
##  $ StockCode  : chr [1:525461] "85048" "79323P" "79323W" "22041" ...
##  $ Description: chr [1:525461] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
##  $ Quantity   : num [1:525461] 12 12 12 48 24 24 24 10 12 12 ...
##  $ InvoiceDate: POSIXct[1:525461], format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
##  $ Price      : num [1:525461] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
##  $ Customer ID: num [1:525461] 13085 13085 13085 13085 13085 ...
##  $ Country    : chr [1:525461] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

The dataset contains missing values in the Description and CustomerID fields.

For the purpose of this analysis, handling missing Description values is necessary, while CustomerID can be ignored.

#1 # Count the number of missing values in each column
na_counts <- sapply(online_retail_data, function(x) sum(is.na(x)))
na_counts

##     Invoice   StockCode Description    Quantity InvoiceDate       Price 
##           0           0        2928           0           0           0 
## Customer ID     Country 
##      107927           0

# Handle missing descriptions
online_retail_data <- online_retail_data %>%
  filter(!is.na(Description))

# Check the dimensions after removal of missing descriptions
dim(online_retail_data)

## [1] 522533      8

The rows containing missing descriptions were removed.

Following the cleaning of missing descriptions, the next steps involve further data preparation neccessary for FP-Growth analysis:

#online_retail_data$Invoice 

#online_retail_data$StockCode 

online_retail_data$Description <- as.factor(online_retail_data$Description) # Convert the 'Description' column to a factor for categorical analysis

dim(online_retail_data) # Print dimensions before removing cancelled transactions

## [1] 522533      8

# Filter out cancelled transactions ( transaction is cancelled when quantity is less or equal to 0) and those with letters in InvoiceNo
online_retail_data <- online_retail_data %>%
  filter(Quantity > 0, !grepl("[a-zA-Z]", Invoice)) # Keep rows with positive Quantity and numeric Invoice 
dim(online_retail_data) # Print dimensions after filtering to see effect

## [1] 512030      8

online_retail_data$Country <- as.factor(online_retail_data$Country) # Convert Country to factor

online_retail_data$Date <- as.Date(online_retail_data$InvoiceDate) # Extract Date from InvoiceDate

online_retail_data$Time <- format(online_retail_data$InvoiceDate,"%H:%M:%S") # Extract Time from InvoiceDate

#online_retail_data$Price

dim(online_retail_data) # Print final dimensions after all operations

## [1] 512030     10

The dataset is now preprocessed and structured, facilitating the effective application of the FP-Growth algorithm for mining frequent itemsets and receiving association rules.

Step by Step FP-Growth

The FP-Growth algorithm represents a significant advancement in the field of association rule mining, offering a more efficient alternative to traditionaly used Apriori algorithm. Unlike Apriori, which requires multiple scans of the transaction database to generate frequent itemsets, FP-Growth compresses the database into a compact structure called the FP-tree (Frequent Pattern tree) and then extracts frequent itemsets directly from this tree. This approach significantly reduces the computational burden (on machine), especially for large datasets.

Process Overview:

Data Preprocessing: Before applying FP-Growth, it’s important to preprocess the data to ensure it’s in a suitable format. This involves cleaning the data to remove any null or entries taht may cause errors, filtering out irrelevant transactions (e.g., canceled orders), and transforming the dataset into a list of transactions where each transaction is a set of items bought together.

Building the FP-tree: The FP-Growth algorithm starts by creating the FP-tree, a compact representation of the transaction database where nodes correspond to itemsets and paths represent transactions. The tree is constructed by reading each transaction, maintaining the order of items by their overall frequency in the dataset.

Extracting Frequent Itemsets: Once the FP-tree is built, the algorithm recursively extracts frequent itemsets by exploring conditional bases (subtrees) for each item, starting from the least frequent item and combining it with its conditional pattern base to find frequent patterns.

Internal Mining Process of FP-Growth:

Conditional FP-tree Construction: For each item (starting from the least frequent one), FP-Growth constructs a conditional FP-tree. This tree represents only those transactions that contain the given item. The process involves tracing the path of each item back to the root of the FP-tree, capturing only those nodes that are part of transactions including the target item. This effectively filters the dataset to focus on relevant subsets for each item, reducing the size of the data to be analyzed further.

Recursive Mining: The algorithm then recursively mines these conditional FP-trees, each time constructing a new tree for an item in the subset, until no more frequent itemsets can be found. This recursive approach allows FP-Growth to efficiently discover all frequent itemsets without having to generate candidate itemsets, unlike the Apriori algorithm.

Mining Separation: Internally, FP-Growth separates the mining process into two distinct parts:

Identifying frequent items: This involves searching through the database to find the frequency of each item and then filtering out those that do not meet the supported threshold.
Constructing conditional bases: For each frequent item, its conditional pattern base (a collection of prefix paths in the FP-tree leading to the item) is identified. They are used as bases and then are used to build conditional FP-trees, which are smaller and focused on specific itemsets.

Application of FP-Growth

In this section the process that involves transforming the raw retail data into a suitable format for identifying frequent itemsets was shown. The plyr package’s ddply function was utilized to aggregate items within transactions, followed by data cleaning and formatting tasks.

Preparing Data for FP-Growth in R

Aggregation

# Aggregate transaction items by Invoice and Date
transaction_data <- ddply(online_retail_data, c("Invoice", "Date")
                          , function(df1) {
  paste(df1$Description, collapse = ",")
})

# The aggregated data now consists of items listed together in transactions,
# making it easier to analyze them as sets for frequent pattern mining.

Removal of redundancies

Before proceeding, the dataset will be simplified by removing unnecessary columns.

This step ensures that transaction data is focused only on the items involved in each transaction, which is necessary for the FP-Growth algorithm.

# Remove the InvoiceNo and Date columns as they are no longer needed (for this analysis)
transaction_data$InvoiceNo <- NULL # Remove InvoiceNo column
transaction_data$Date <- NULL # Remove Date column

# Rename the remaining column to 'items' for clarity
colnames(transaction_data) <- c("items")

Writing CSV with Transaction Data

The next step involves saving the processed transaction data to a CSV file. This file will then be read into an R transaction object, which is a format required by the FP-Growth algorithm for mining association rules.

# Write CSV with transaction data
write.csv(transaction_data,'market_basket_transactions.csv', quote = FALSE, row.names = TRUE)


tr <- read.transactions('market_basket_transactions.csv', format = 'basket', sep=',')

Reading Transaction Data for FP-Growth

With the data saved in a CSV file, proceed to read it into an R transaction object.

This step utilizes the arules package, which provides the necessary functionality for association rule mining, including the FP-Growth algorithm.

## transactions as itemMatrix in sparse format with
##  21000 rows (elements/itemsets/transactions) and
##  49706 columns (items) and a density of 0.0004449678 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                               2961                               1808 
##     STRAWBERRY CERAMIC TRINKET BOX      ASSORTED COLOUR BIRD ORNAMENT 
##                               1423                               1283 
##   PACK OF 72 RETRO SPOT CAKE CASES                            (Other) 
##                               1233                             455761 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
##    1 2276  970  807  745  724  678  672  588  651  592  592  602  574  541  559 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##  498  520  531  482  465  445  392  362  334  310  293  255  268  227  214  185 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##  184  170  138  139  147  127  139  128  122   94  112  112   87   77   79   77 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   71   80   72   59   69   51   50   49   57   49   52   41   28   34   24   31 
##   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
##   26   27   28   31   23   22   20   25   20   25   13   19   13   14   13   17 
##   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96   97 
##    7   16   11   10   11   11    7    7   10   12   10   14    5   10   11    6 
##   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112  113 
##    7   10    9    5    7    6    3    8    9   10    6    6    6    5    6    3 
##  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129 
##    2    9    6    5    4    6    3    3    5    9    6    3    2    1    3    1 
##  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145 
##    2    3    3    2    4    2    8    4    4    6    7    3    4    3    1    2 
##  146  147  148  149  150  151  152  153  154  155  156  159  160  161  162  163 
##    2    1    2    3    2    2    2    2    5    3    3    2    3    5    2    1 
##  164  165  166  167  168  169  170  171  172  173  175  176  177  178  179  181 
##    5    3    3    5    3    2    2    5    1    2    2    2    1    2    1    3 
##  182  187  189  190  191  193  195  196  197  198  201  205  207  208  212  213 
##    1    1    2    2    1    4    1    1    1    2    1    1    1    2    1    3 
##  214  215  216  217  219  220  224  226  227  228  229  230  231  234  236  237 
##    1    1    1    1    1    2    2    1    1    1    3    1    1    2    4    2 
##  247  248  259  261  263  264  268  269  278  279  282  283  285  293  294  300 
##    1    2    1    1    1    1    1    1    1    2    1    1    1    1    1    1 
##  302  304  311  312  314  318  327  334  345  362  365  367  381  400  419  421 
##    1    1    2    1    1    1    1    1    2    1    1    1    1    1    1    1 
##  425  429  514 
##    2    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   16.00   22.12   27.00  514.00 
## 
## includes extended item information - examples:
##                     labels
## 1    *Boombox Ipod Classic
## 2 *USB Office Glitter Lamp
## 3                        ?

The summary of the transaction object provides an overview of the dataset, including the number of transactions and items, item distribution across transactions, and other relevant statistics. This information is important for understanding the data’s structure before proceeding with the FP-Growth algorithm for mining frequent itemsets and association rules.

Parameters of FP-Growth function (fim4r function) and how to set it up

The fim4r function serves as an interface to various mining algorithms, enabling the mining of different patterns from transaction data. Each parameter within fim4r is designed to customize the mining process to meet specific analytical goals.

Parameters of `fim4r` Function

transactions: The dataset for the mining process, expected to be a collection of transactions where each transaction comprises a set of items purchased together. This should be formatted as a transactions object, a specific class in R designed to handle transaction data efficiently.
method: Determines the mining algorithm to be employed. Each algorithm offers unique advantages:
- “apriori”: Best for smaller datasets, operating through candidate itemset generation and subsequent pruning based on the support threshold.
- “eclat”: Utilizes a depth-first search (algorithm) to efficiently uncover frequent itemsets, surpassing Apriori’s performance in dense datasets.
- “fpgrowth”: Optimized for large datasets by constructing and mining from a compact FP-tree structure, eliminating the need for candidate generation. Used in this research.
- “relim”, “sam”: Focus on mining itemsets by recursively removing non-frequent items or employing a split-and-merge strategy.
- “carpenter”, “ista”: Specialized for identifying closed itemsets, helping reduce output redundancy by focusing on maximal frequent item sets.
support: Sets the minimum support threshold, expressed as a fraction of total transactions. Only itemsets appearing in at least this fraction of transactions are considered frequent, helping in deletion of infrequent itemsets.
confidence: Applicable when the target is set to “rules”, establishing the minimum confidence level for a rule to be deemed significant. This parameter evaluates the probability of purchasing the consequent given the antecedent’s purchase.
target: Specifies the desired pattern type for mining, with options including “frequent”, “closed”, “maximal”, “generators”, or “rules”. This choice allows analysts to .focus on particular pattern types that align with their analytical objectives.
- Frequent
  - Description: Identifies itemsets that appear together in transactions more frequently than a specified threshold (support).
  - Use Case: Useful for discovering common product combinations, foundational for market basket analysis.
- Closed
  - Description: Frequent itemsets for which no superset exists with the same support count, reducing redundancy.
  - Use Case: Helps in making the analysis more concise by focusing on itemsets that provide unique information.
- Maximal
  - Description: Frequent itemsets that are not subsets of any other frequent itemset, representing the largest sets of items appearing together frequently.
  - Use Case: Useful for understanding the broadest item combinations without considering all possible frequent subsets.
- Generators
  - Description: The smallest sets of items that generate a closed itemset under the closure operation, representing irreducible item combinations.
  - Use Case: Provides insights into the base combinations of items leading to larger closed itemsets, useful for understanding foundational item combinations.
- Rules
  - Description: Focuses on generating association rules from the frequent itemsets, represented as \(A \Rightarrow B\), where \(A\) and \(B\) are itemsets. These rules are evaluated based on metrics like:
    - Confidence: The probability of seeing the consequent in transactions containing the antecedent, calculated as \(\text{confidence}(A \Rightarrow B) = \frac{\text{support}(A \cup B)}{\text{support}(A)}\).
    - Lift: The increase in the ratio of sale of the consequent when the antecedent is sold, calculated as \(\text{lift}(A \Rightarrow B) = \frac{\text{confidence}(A \Rightarrow B)}{\text{support}(B)}\).
    - Use Case: Important to uncovering actionable insights for strategic decision-making in marketing, promotions, and inventory management.
originalSupport: A boolean indicating whether the support threshold applies to the entire rule (both LHS and RHS) or only to the LHS (antecedent), affecting the selection of relevant rules based on given criteria.
appear: Enables restrictions on item appearances within the rules, directing the mining towards patterns of greater interest or relevance, such as rules where specific items appear only as consequents.
report: While not accessible via the interface, reporting parameters typically influence the level of detail or the format of the mining output in broader contexts.
verbose: When TRUE, this logical flag triggers the function to print detailed information about used parameters and the mining process, enabling debugging and deeper understanding of the operation.
…: Characterizes additional, method-specific arguments that can be passed to the underlying mining function, offering further customization for algorithmic tuning.

Initial Parameters

Support (0.001 or 0.1%): Targets itemsets appearing in at least 0.1% of transactions, aiming to uncover rare associations.
Confidence (80%): Seeks rules where the consequent is likely to be purchased 80% of the time when the antecedent is purchased.
Target (“rules”): Focuses on generating association rules rather than just frequent itemsets.
Method (“fpgrowth”): Utilizes the FP-Growth algorithm for efficient mining, suitable for large datasets.

Parameter Adjustment Guide

Support

Adjusting Upward: Increase the support threshold if too many rules are generated, focusing on more common and potentially actionable associations.
Adjusting Downward: Decrease further only if seeking rarer itemsets.

Confidence

Adjusting Downward: Lower the confidence threshold to discover more rules, including those with weaker but potentially interesting associations.
Adjusting Upward: Increase only if seeking rules with very high predictive reliability, which may reduce the total number of rules found.

Considerations for Parameter Testing

Dataset Characteristics: The size and density of the dataset may require adjustments to manage output volume and relevance.
Analysis Goals: Tailor parameters based on whether the focus is on exploring the data or looking for specific types of associations.
Output Volume: If there is too many rules to manage, it may be neccessary to adjust the parameters to make them easier to mangage.

Next Steps in Analysis

Iterative Testing: Start initialh settings, analyze the output, and adjust based on findings and objectives.
Rule Quality Metrics: Consider additional metrics like lift, leverage, and conviction for deeper insights.
Incorporate Domain Knowledge: Use known patterns or hypotheses to provide best threshold settings.

Adjusting parameters based on these guidelines can help refine the mining process to better align with specific analytical goals and the characteristics of given dataset.

Analysis and Results

This part helps in transitioning from data achieved through data handling to analytical process, utilizing the FP-Growth algorithm to mine frequent itemsets within the transaction data.

The FP-Growth algorithm, known for its efficiency and scalability, it avoids the long itemset generation and candidate verification processes that are characteristic for earlier algorithms such as Apriori.

By directly constructing a compressed version of the dataset in the form of an FP-tree, FP-Growth makes extraction of frequent itemsets faster and without multiple database scans.

Analyzing Transaction Data with FP-Growth

The application of FP-Growth to the “Online Retail II” dataset aims to uncover patterns of product associations that frequently occur within customer transactions. These patterns, manifested as sets of items that are often bought together, provide valuable insights into customer purchasing behavior.

Understanding these associations enables the identification of potential strategies for cross-selling, promotions, and inventory management aimed at at improving sales and customer satisfaction.

association_rules <- fim4r(tr, method = "fpgrowth", support = 0.001, confidence = 0.8, target = "rules")

## fim4r.fpgrowth 
## 
## Parameter specification:
##  supp conf target report
##   0.1   80  rules    scl
## 
## Data size: 21000 transactions and 49706 items 
## Result: 143316 rules

association_rules

## set of 143316 rules

summary(association_rules)

## set of 143316 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10    11    12 
##   187  8852 25429 30082 30662 25013 14881  6169  1722   295    24 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   5.815   7.000  12.000 
## 
## summary of quality measures:
##     support           confidence          lift              count       
##  Min.   :0.001000   Min.   :0.8000   Min.   :   5.674   Min.   : 21.00  
##  1st Qu.:0.001048   1st Qu.:0.9091   1st Qu.:  35.993   1st Qu.: 22.00  
##  Median :0.001190   Median :0.9583   Median :  58.706   Median : 25.00  
##  Mean   :0.001355   Mean   :0.9448   Mean   :  92.703   Mean   : 28.39  
##  3rd Qu.:0.001571   3rd Qu.:1.0000   3rd Qu.:  95.703   3rd Qu.: 33.00  
##  Max.   :0.019381   Max.   :1.0000   Max.   :1000.000   Max.   :407.00

inspectDT(head(sort(association_rules, by = "confidence"), 3))

From this part, it was discrovered that, there are too many rules , ~45 thousands rules at confidence 1 and ~150 thousands above 0.8 confidence, as It is better to firstly check the change of support instead of confidence, support needs to be changed.

The chart above provides a depiction of the most popular items within the dataset.

Section below details the creation of stronger association rules using the FP-Growth algorithm, with a focus on rules that meet higher thresholds of support and confidence.

stronger_association_rules <- fim4r(tr, method = "fpgrowth", support = 0.01, confidence = 0.8, target = "rules")

## fim4r.fpgrowth 
## 
## Parameter specification:
##  supp conf target report
##     1   80  rules    scl
## 
## Data size: 21000 transactions and 49706 items 
## Result: 32 rules

summary(stronger_association_rules)

## set of 32 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 21 11 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.344   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01010   Min.   :0.8000   Min.   :16.93   Min.   :212.0  
##  1st Qu.:0.01125   1st Qu.:1.0000   1st Qu.:39.50   1st Qu.:236.2  
##  Median :0.01219   Median :1.0000   Median :47.24   Median :256.0  
##  Mean   :0.01277   Mean   :0.9634   Mean   :49.63   Mean   :268.1  
##  3rd Qu.:0.01367   3rd Qu.:1.0000   3rd Qu.:62.11   3rd Qu.:287.0  
##  Max.   :0.01938   Max.   :1.0000   Max.   :70.23   Max.   :407.0

inspectDT(head(sort(stronger_association_rules, by = "confidence"), 10))

The revised table provides a clearer view than the previous one, with fewer items grouped per transaction.

To understand how the table functions, consider the following explanations for two of the rules:

{MAGIC GARDEN} => {HOOK} (Rule [2]): MAGIC GARDEN and HOOK appear together in 1.7% of transactions, and every time MAGIC GARDEN is bought, HOOK is also bought. This rule is particularly strong, with a lift indicating a very high likelihood of HOOK being purchased with MAGIC GARDEN (for example a lift of 40 means that the likelihood of finding both items together is 40 times higher than if they were independent), observed in 350 transactions.

{PINK/WHITE SPOTS} => {CHARLOTTE BAG} (Rule [9]): This combination appears in 1.4% of transactions, with a 100% chance of CHARLOTTE BAG being bought whenever PINK/WHITE SPOTS items are bought, seen in 302 transactions. The high lift indicates a strong association.

To refine the analysis further, the next step involves identifying and removing rules that are subsets of other rules.

# Use the code below to remove such rules:
stronger_subset_rules <- which(colSums(is.subset(stronger_association_rules, stronger_association_rules)) > 1) # get subset rules in vector
length(stronger_subset_rules)

## [1] 15

subset_stronger_association_rules <- stronger_association_rules[-stronger_subset_rules] # Remove subset rules.

subset_stronger_association_rules

## set of 17 rules

After filtering out subset rules, the resulting set of rules is used to further summarize and visualize findings. ( only 17 rules left at this point)

summary(subset_stronger_association_rules) #shows the following:

## set of 17 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 15  2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.118   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01010   Min.   :0.8014   Min.   :16.93   Min.   :212.0  
##  1st Qu.:0.01110   1st Qu.:1.0000   1st Qu.:36.84   1st Qu.:233.0  
##  Median :0.01257   Median :1.0000   Median :40.38   Median :264.0  
##  Mean   :0.01302   Mean   :0.9693   Mean   :43.24   Mean   :273.3  
##  3rd Qu.:0.01438   3rd Qu.:1.0000   3rd Qu.:48.22   3rd Qu.:302.0  
##  Max.   :0.01938   Max.   :1.0000   Max.   :70.23   Max.   :407.0

An interactive DataTable is created using inspectDT to display the top 10 association rules sorted by confidence.

This visualization contains an in-depth exploration of the strongest rules (10 highest by confidence), allowing users to interactively examine the details of each rule, including the items involved and the metrics that quantify their association.

This approach assists in identifying the most reliable patterns within the dataset.

inspectDT(head(sort(subset_stronger_association_rules, by = "confidence"), 10))

Below there are some graphs that may be helpful for visualization of association rules.

Graph below shows association rules filtered to include only those with a confidence greater than 40% to highlight the relationships between items in these high-confidence rules.

The visualization, created using the default plotting mechanism, provides a graphical representation of the rules, illustrating the strength of association.

# Filter rules with confidence greater than 0.4 or 40%
subRules<-association_rules[quality(association_rules)$confidence>0.4]
# Plot SubRules
plot(subRules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The scatter plot visualizes 143,316 association rules, highlighting their support and confidence levels. Darker red points indicate a higher lift, suggesting strong item associations, while the clustering near the plot’s origin reflects the prevalence of rules with high confidence but lower support.

Below the other variation of visualizationn with usage of two-key plot method that show different colors that are assigned to rules of different orders. An “order 3” rule might include an additional item:

{Item A, Item C} => {Item B} {Item A} => {Item B, Item C}

plot(subRules,method="two-key plot")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Below interactive graph-based visualization of association rules, representing the top 10 rules with the highest confidence from a subset of association rules. In such a graph:

#10 rules from subRules having the highest confidence.
top10subRules <- head(subset_stronger_association_rules, n = 10, by = "confidence")
plot(top10subRules, method = "graph",  engine = "htmlwidget") #interactive plot engine=htmlwidget

Nodes (Circles): represent items or itemsets.

Edges (Lines): signify association rules, where the direction of the arrow points from the antecedent (the “if” part) to the consequent (the “then” part) of the rule.

Node Size and Color: These can denote the significance of the items, with larger or more vividly colored nodes possibly indicating higher metrics such as support or lift.

Interactivity: The “Select by id” dropdown, allows users to select specific rules or items to examine more closely.

A parallel coordinates plot below is generated for the top 10 rules selected based on lift from the refined subset_stronger_association_rules.

This non-interactive visualization technique displays the multidimensional relationships between the antecedent(lhs) and consequent(rhs) of each rule, offering a clear view of how the lift metric differs across these highly influential rules. The parallel coordinates plot is important in comparing the relative strength and lift of the top associations, highlighting patterns that may be effective for cross-selling or promotional strategies. ( User could consider adding names of rules on the right side of graph too)

subRules2<-head(subset_stronger_association_rules, n=10, by="lift")
plot(subRules2, method="paracoord")

Conclusion

The analysis of the FP-Growth algorithm in the “Online Retail II” dataset has highlighted the practical use of the technique in uncovering purchasing behaviors. Although most focus has been on understanding and utilizing the algorithm’s parameters and visualizing the outcomes, it would be beneficial for future research to shift towards a deeper analytical approach. This can be achieved by using other analytical methods such as eclat in help with fp-growth to gain a more comprehensive understanding of consumer behavior. The insights gained from such research could be useful in making informed decisions in the field of retail management.

Market Basket Analysis of Online Retail Data Using FP-Growth

Maciej Kuchciak

February 2024

Introduction

Significance of FP-Growth

Methodology

Selection of Dataset

Step by Step FP-Growth

Process Overview:

Internal Mining Process of FP-Growth:

Application of FP-Growth

Preparing Data for FP-Growth in R

Parameters of FP-Growth function (fim4r function) and how to set it up

Parameters of `fim4r` Function

Initial Parameters

Parameter Adjustment Guide

Support

Confidence

Next Steps in Analysis

Analysis and Results

Analyzing Transaction Data with FP-Growth

Conclusion

Market Basket Analysis of Online Retail Data Using FP-Growth

Maciej Kuchciak

February 2024

Introduction

Significance of FP-Growth

Methodology

Selection of Dataset

Step by Step FP-Growth

Process Overview:

Internal Mining Process of FP-Growth:

Application of FP-Growth

Preparing Data for FP-Growth in R

Parameters of FP-Growth function (fim4r function) and how to set it up

Parameters of fim4r Function

Initial Parameters

Parameter Adjustment Guide

Support

Confidence

Next Steps in Analysis

Analysis and Results

Analyzing Transaction Data with FP-Growth

Conclusion

Parameters of `fim4r` Function