The exploration of transactional data for pattern discovery is a critical aspect of retail analytics. This project introduced and tried to use proficiently the FP-Growth algorithm to analyze a dataset from a UK-based online retail, capturing transactions between December 1, 2009, and December 9, 2011. The dataset, “Online Retail II” from UCI (UC Irvine Machine Learning Respository) , details transactions involving a wide array of unique gift-ware products sold both to individual customers and wholesalers. By applying the FP-Growth algorithm, the research purpose is to mine frequent itemsets and generate association rules to show the ongoing patterns in customer purchase behavior.
FP-Growth is very proficient in mining frequent itemsets without the need for candidate generation, which is a common limitation in traditional association rule learning methods. By constructing an FP-tree, the algorithm significantly reduces the number of database scans required, facilitating a faster and more scalable analysis. This efficiency makes FP-Growth more suitable for dealing with large datasets like “Online Retail II”, than other algorithms, enabling the discovery of meaningful patterns that can inform strategic business decisions in retail management and marketing strategies in much more efficient manner.
Installation of libraries used in the code:
# Install necessary packages if not already installed
if(!require("tidyverse")) install.packages("tidyverse")
if(!require("arules")) install.packages("arules")
if(!require("arulesViz")) install.packages("arulesViz")
if(!require("gridExtra")) install.packages("gridExtra")
if(!require("dplyr")) install.packages("dplyr")
if(!require("readxl")) install.packages("readxl")
if(!require("plyr")) install.packages("plyr")
if(!require("ggplot2")) install.packages("ggplot2")
if(!require("knitr")) install.packages("knitr")
if(!require("lubridate")) install.packages("lubridate")
if(!require("kableExtra")) install.packages("kableExtra")
if(!require("RColorBrewer")) install.packages("RColorBrewer")
if(!require("purrr")) install.packages("purrr")
if(!require("tidyr")) install.packages("tidyr")
Libraries used in the code and their descriptions:
# Load libraries and add comments
library(tidyverse) # Used for data manipulation and visualization
library(arules) # Used for mining association rules and frequent itemsets
library(arulesViz) # Used for visualizing association rules and frequent itemsets
library(gridExtra) # Possess functions for creating complex layouts in plots
library(dplyr) # Additional package for data manipulation
library(readxl) # Reading Excel files
library(plyr) # Contains tools for splitting, applying, and combining data
library(ggplot2) # System for graphics
library(knitr) # Dynamic report generation
library(lubridate) # Easier date and time manipulation
library(kableExtra) # Enhanced 'kable' tables with additional styling options
library(RColorBrewer) # Color schemes for graphics
library(purrr) # Functional programming tools
library(tidyr) # Tools for tidying data
This section outlines applied procedure, used to analyze transaction data using the FP-Growth algorithm. It begins with the selection of a suitable dataset, followed by data preparation steps to ensure the dataset is optimized for mining frequent itemsets.
The dataset chosen for this study is “Online Retail II”, accessible from the UCI Machine Learning Repository. It represents a comprehensive collection of transactions from a UK-based online retail that operates without a physical storefront. Spanning from December 1, 2009, to December 9, 2011, the dataset provides a detailed amount of sales transactions, including data on the products sold, their prices, the transaction dates, and customer information.
Key attributes of the dataset include:
Invoice (No): A unique identifier for each transaction, where codes starting with ‘c’ indicate cancellations.
StockCode: A unique identifier assigned to each product.
Description: The name of the product.
Quantity: The number of units sold in each transaction.
InvoiceDate: The date and time when each transaction occurred.
UnitPrice: The price per unit of the product, in sterling.
CustomerID: A unique identifier for each customer.
Country: The country where the customer is residing.
This dataset was selected for amount and quality of transactional data across a broad spectrum of products and its potential to reveal insights into consumer purchasing patterns.
The “Online Retail II” dataset’s nature provides an ideal base for applying the FP-Growth algorithm to uncover associations between products and to analyze trends in customer buying behavior over the two-year period.
# Load dataset and check str
file_path <- "online-retail/online_retail_II.xlsx"
#read excel into R dataframe
online_retail_data <- read_excel(file_path)
# Display the dimensions and structure of the loaded dataset
dim(online_retail_data)
## [1] 525461 8
str(online_retail_data)
## tibble [525,461 × 8] (S3: tbl_df/tbl/data.frame)
## $ Invoice : chr [1:525461] "489434" "489434" "489434" "489434" ...
## $ StockCode : chr [1:525461] "85048" "79323P" "79323W" "22041" ...
## $ Description: chr [1:525461] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
## $ Quantity : num [1:525461] 12 12 12 48 24 24 24 10 12 12 ...
## $ InvoiceDate: POSIXct[1:525461], format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
## $ Price : num [1:525461] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
## $ Customer ID: num [1:525461] 13085 13085 13085 13085 13085 ...
## $ Country : chr [1:525461] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
The dataset contains missing values in the Description and CustomerID fields.
For the purpose of this analysis, handling missing Description values is necessary, while CustomerID can be ignored.
#1 # Count the number of missing values in each column
na_counts <- sapply(online_retail_data, function(x) sum(is.na(x)))
na_counts
## Invoice StockCode Description Quantity InvoiceDate Price
## 0 0 2928 0 0 0
## Customer ID Country
## 107927 0
# Handle missing descriptions
online_retail_data <- online_retail_data %>%
filter(!is.na(Description))
# Check the dimensions after removal of missing descriptions
dim(online_retail_data)
## [1] 522533 8
The rows containing missing descriptions were removed.
Following the cleaning of missing descriptions, the next steps involve further data preparation neccessary for FP-Growth analysis:
#online_retail_data$Invoice
#online_retail_data$StockCode
online_retail_data$Description <- as.factor(online_retail_data$Description) # Convert the 'Description' column to a factor for categorical analysis
dim(online_retail_data) # Print dimensions before removing cancelled transactions
## [1] 522533 8
# Filter out cancelled transactions ( transaction is cancelled when quantity is less or equal to 0) and those with letters in InvoiceNo
online_retail_data <- online_retail_data %>%
filter(Quantity > 0, !grepl("[a-zA-Z]", Invoice)) # Keep rows with positive Quantity and numeric Invoice
dim(online_retail_data) # Print dimensions after filtering to see effect
## [1] 512030 8
online_retail_data$Country <- as.factor(online_retail_data$Country) # Convert Country to factor
online_retail_data$Date <- as.Date(online_retail_data$InvoiceDate) # Extract Date from InvoiceDate
online_retail_data$Time <- format(online_retail_data$InvoiceDate,"%H:%M:%S") # Extract Time from InvoiceDate
#online_retail_data$Price
dim(online_retail_data) # Print final dimensions after all operations
## [1] 512030 10
The dataset is now preprocessed and structured, facilitating the effective application of the FP-Growth algorithm for mining frequent itemsets and receiving association rules.
The FP-Growth algorithm represents a significant advancement in the field of association rule mining, offering a more efficient alternative to traditionaly used Apriori algorithm. Unlike Apriori, which requires multiple scans of the transaction database to generate frequent itemsets, FP-Growth compresses the database into a compact structure called the FP-tree (Frequent Pattern tree) and then extracts frequent itemsets directly from this tree. This approach significantly reduces the computational burden (on machine), especially for large datasets.
Data Preprocessing: Before applying FP-Growth, it’s important to preprocess the data to ensure it’s in a suitable format. This involves cleaning the data to remove any null or entries taht may cause errors, filtering out irrelevant transactions (e.g., canceled orders), and transforming the dataset into a list of transactions where each transaction is a set of items bought together.
Building the FP-tree: The FP-Growth algorithm starts by creating the FP-tree, a compact representation of the transaction database where nodes correspond to itemsets and paths represent transactions. The tree is constructed by reading each transaction, maintaining the order of items by their overall frequency in the dataset.
Extracting Frequent Itemsets: Once the FP-tree is built, the algorithm recursively extracts frequent itemsets by exploring conditional bases (subtrees) for each item, starting from the least frequent item and combining it with its conditional pattern base to find frequent patterns.
Conditional FP-tree Construction: For each item (starting from the least frequent one), FP-Growth constructs a conditional FP-tree. This tree represents only those transactions that contain the given item. The process involves tracing the path of each item back to the root of the FP-tree, capturing only those nodes that are part of transactions including the target item. This effectively filters the dataset to focus on relevant subsets for each item, reducing the size of the data to be analyzed further.
Recursive Mining: The algorithm then recursively mines these conditional FP-trees, each time constructing a new tree for an item in the subset, until no more frequent itemsets can be found. This recursive approach allows FP-Growth to efficiently discover all frequent itemsets without having to generate candidate itemsets, unlike the Apriori algorithm.
Mining Separation: Internally, FP-Growth separates the mining process into two distinct parts:
Identifying frequent items: This involves searching through the database to find the frequency of each item and then filtering out those that do not meet the supported threshold.
Constructing conditional bases: For each frequent item, its conditional pattern base (a collection of prefix paths in the FP-tree leading to the item) is identified. They are used as bases and then are used to build conditional FP-trees, which are smaller and focused on specific itemsets.
In this section the process that involves transforming the raw retail
data into a suitable format for identifying frequent itemsets was shown.
The plyr package’s ddply function was utilized
to aggregate items within transactions, followed by data cleaning and
formatting tasks.
Aggregation
# Aggregate transaction items by Invoice and Date
transaction_data <- ddply(online_retail_data, c("Invoice", "Date")
, function(df1) {
paste(df1$Description, collapse = ",")
})
# The aggregated data now consists of items listed together in transactions,
# making it easier to analyze them as sets for frequent pattern mining.
Removal of redundancies
Before proceeding, the dataset will be simplified by removing unnecessary columns.
This step ensures that transaction data is focused only on the items involved in each transaction, which is necessary for the FP-Growth algorithm.
# Remove the InvoiceNo and Date columns as they are no longer needed (for this analysis)
transaction_data$InvoiceNo <- NULL # Remove InvoiceNo column
transaction_data$Date <- NULL # Remove Date column
# Rename the remaining column to 'items' for clarity
colnames(transaction_data) <- c("items")
Writing CSV with Transaction Data
The next step involves saving the processed transaction data to a CSV file. This file will then be read into an R transaction object, which is a format required by the FP-Growth algorithm for mining association rules.
# Write CSV with transaction data
write.csv(transaction_data,'market_basket_transactions.csv', quote = FALSE, row.names = TRUE)
tr <- read.transactions('market_basket_transactions.csv', format = 'basket', sep=',')
Reading Transaction Data for FP-Growth
With the data saved in a CSV file, proceed to read it into an R transaction object.
This step utilizes the arules package, which provides the necessary functionality for association rule mining, including the FP-Growth algorithm.
## transactions as itemMatrix in sparse format with
## 21000 rows (elements/itemsets/transactions) and
## 49706 columns (items) and a density of 0.0004449678
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 2961 1808
## STRAWBERRY CERAMIC TRINKET BOX ASSORTED COLOUR BIRD ORNAMENT
## 1423 1283
## PACK OF 72 RETRO SPOT CAKE CASES (Other)
## 1233 455761
##
## element (itemset/transaction) length distribution:
## sizes
## 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 1 2276 970 807 745 724 678 672 588 651 592 592 602 574 541 559
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 498 520 531 482 465 445 392 362 334 310 293 255 268 227 214 185
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 184 170 138 139 147 127 139 128 122 94 112 112 87 77 79 77
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
## 71 80 72 59 69 51 50 49 57 49 52 41 28 34 24 31
## 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
## 26 27 28 31 23 22 20 25 20 25 13 19 13 14 13 17
## 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## 7 16 11 10 11 11 7 7 10 12 10 14 5 10 11 6
## 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## 7 10 9 5 7 6 3 8 9 10 6 6 6 5 6 3
## 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
## 2 9 6 5 4 6 3 3 5 9 6 3 2 1 3 1
## 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
## 2 3 3 2 4 2 8 4 4 6 7 3 4 3 1 2
## 146 147 148 149 150 151 152 153 154 155 156 159 160 161 162 163
## 2 1 2 3 2 2 2 2 5 3 3 2 3 5 2 1
## 164 165 166 167 168 169 170 171 172 173 175 176 177 178 179 181
## 5 3 3 5 3 2 2 5 1 2 2 2 1 2 1 3
## 182 187 189 190 191 193 195 196 197 198 201 205 207 208 212 213
## 1 1 2 2 1 4 1 1 1 2 1 1 1 2 1 3
## 214 215 216 217 219 220 224 226 227 228 229 230 231 234 236 237
## 1 1 1 1 1 2 2 1 1 1 3 1 1 2 4 2
## 247 248 259 261 263 264 268 269 278 279 282 283 285 293 294 300
## 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## 302 304 311 312 314 318 327 334 345 362 365 367 381 400 419 421
## 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## 425 429 514
## 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 16.00 22.12 27.00 514.00
##
## includes extended item information - examples:
## labels
## 1 *Boombox Ipod Classic
## 2 *USB Office Glitter Lamp
## 3 ?
The summary of the transaction object provides an overview of the dataset, including the number of transactions and items, item distribution across transactions, and other relevant statistics. This information is important for understanding the data’s structure before proceeding with the FP-Growth algorithm for mining frequent itemsets and association rules.
The fim4r function serves as an interface to various
mining algorithms, enabling the mining of different patterns from
transaction data. Each parameter within fim4r is designed
to customize the mining process to meet specific analytical goals.
fim4r Functiontransactions: The dataset for the mining
process, expected to be a collection of transactions where each
transaction comprises a set of items purchased together. This should be
formatted as a transactions object, a specific class in R
designed to handle transaction data efficiently.
method: Determines the mining algorithm to be employed. Each algorithm offers unique advantages:
“apriori”: Best for smaller datasets, operating through candidate itemset generation and subsequent pruning based on the support threshold.
“eclat”: Utilizes a depth-first search (algorithm) to efficiently uncover frequent itemsets, surpassing Apriori’s performance in dense datasets.
“fpgrowth”: Optimized for large datasets by constructing and mining from a compact FP-tree structure, eliminating the need for candidate generation. Used in this research.
“relim”, “sam”: Focus on mining itemsets by recursively removing non-frequent items or employing a split-and-merge strategy.
“carpenter”, “ista”: Specialized for identifying closed itemsets, helping reduce output redundancy by focusing on maximal frequent item sets.
support: Sets the minimum support threshold, expressed as a fraction of total transactions. Only itemsets appearing in at least this fraction of transactions are considered frequent, helping in deletion of infrequent itemsets.
confidence: Applicable when the
target is set to “rules”, establishing the minimum
confidence level for a rule to be deemed significant. This parameter
evaluates the probability of purchasing the consequent given the
antecedent’s purchase.
target: Specifies the desired pattern type for mining, with options including “frequent”, “closed”, “maximal”, “generators”, or “rules”. This choice allows analysts to .focus on particular pattern types that align with their analytical objectives.
originalSupport: A boolean indicating whether the support threshold applies to the entire rule (both LHS and RHS) or only to the LHS (antecedent), affecting the selection of relevant rules based on given criteria.
appear: Enables restrictions on item appearances within the rules, directing the mining towards patterns of greater interest or relevance, such as rules where specific items appear only as consequents.
report: While not accessible via the interface, reporting parameters typically influence the level of detail or the format of the mining output in broader contexts.
verbose: When TRUE, this logical flag triggers the function to print detailed information about used parameters and the mining process, enabling debugging and deeper understanding of the operation.
…: Characterizes additional, method-specific arguments that can be passed to the underlying mining function, offering further customization for algorithmic tuning.
Considerations for Parameter Testing
Adjusting parameters based on these guidelines can help refine the mining process to better align with specific analytical goals and the characteristics of given dataset.
This part helps in transitioning from data achieved through data handling to analytical process, utilizing the FP-Growth algorithm to mine frequent itemsets within the transaction data.
The FP-Growth algorithm, known for its efficiency and scalability, it avoids the long itemset generation and candidate verification processes that are characteristic for earlier algorithms such as Apriori.
By directly constructing a compressed version of the dataset in the form of an FP-tree, FP-Growth makes extraction of frequent itemsets faster and without multiple database scans.
The application of FP-Growth to the “Online Retail II” dataset aims to uncover patterns of product associations that frequently occur within customer transactions. These patterns, manifested as sets of items that are often bought together, provide valuable insights into customer purchasing behavior.
Understanding these associations enables the identification of potential strategies for cross-selling, promotions, and inventory management aimed at at improving sales and customer satisfaction.
association_rules <- fim4r(tr, method = "fpgrowth", support = 0.001, confidence = 0.8, target = "rules")
## fim4r.fpgrowth
##
## Parameter specification:
## supp conf target report
## 0.1 80 rules scl
##
## Data size: 21000 transactions and 49706 items
## Result: 143316 rules
association_rules
## set of 143316 rules
summary(association_rules)
## set of 143316 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10 11 12
## 187 8852 25429 30082 30662 25013 14881 6169 1722 295 24
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 5.815 7.000 12.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001000 Min. :0.8000 Min. : 5.674 Min. : 21.00
## 1st Qu.:0.001048 1st Qu.:0.9091 1st Qu.: 35.993 1st Qu.: 22.00
## Median :0.001190 Median :0.9583 Median : 58.706 Median : 25.00
## Mean :0.001355 Mean :0.9448 Mean : 92.703 Mean : 28.39
## 3rd Qu.:0.001571 3rd Qu.:1.0000 3rd Qu.: 95.703 3rd Qu.: 33.00
## Max. :0.019381 Max. :1.0000 Max. :1000.000 Max. :407.00
inspectDT(head(sort(association_rules, by = "confidence"), 3))
From this part, it was discrovered that, there are too many rules , ~45 thousands rules at confidence 1 and ~150 thousands above 0.8 confidence, as It is better to firstly check the change of support instead of confidence, support needs to be changed.
The chart above provides a depiction of the most popular items within the dataset.
Section below details the creation of stronger association rules using the FP-Growth algorithm, with a focus on rules that meet higher thresholds of support and confidence.
stronger_association_rules <- fim4r(tr, method = "fpgrowth", support = 0.01, confidence = 0.8, target = "rules")
## fim4r.fpgrowth
##
## Parameter specification:
## supp conf target report
## 1 80 rules scl
##
## Data size: 21000 transactions and 49706 items
## Result: 32 rules
summary(stronger_association_rules)
## set of 32 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 21 11
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.344 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01010 Min. :0.8000 Min. :16.93 Min. :212.0
## 1st Qu.:0.01125 1st Qu.:1.0000 1st Qu.:39.50 1st Qu.:236.2
## Median :0.01219 Median :1.0000 Median :47.24 Median :256.0
## Mean :0.01277 Mean :0.9634 Mean :49.63 Mean :268.1
## 3rd Qu.:0.01367 3rd Qu.:1.0000 3rd Qu.:62.11 3rd Qu.:287.0
## Max. :0.01938 Max. :1.0000 Max. :70.23 Max. :407.0
inspectDT(head(sort(stronger_association_rules, by = "confidence"), 10))
The revised table provides a clearer view than the previous one, with fewer items grouped per transaction.
To understand how the table functions, consider the following explanations for two of the rules:
{MAGIC GARDEN} => {HOOK} (Rule [2]): MAGIC GARDEN and HOOK appear together in 1.7% of transactions, and every time MAGIC GARDEN is bought, HOOK is also bought. This rule is particularly strong, with a lift indicating a very high likelihood of HOOK being purchased with MAGIC GARDEN (for example a lift of 40 means that the likelihood of finding both items together is 40 times higher than if they were independent), observed in 350 transactions.
{PINK/WHITE SPOTS} => {CHARLOTTE BAG} (Rule [9]): This combination appears in 1.4% of transactions, with a 100% chance of CHARLOTTE BAG being bought whenever PINK/WHITE SPOTS items are bought, seen in 302 transactions. The high lift indicates a strong association.
To refine the analysis further, the next step involves identifying and removing rules that are subsets of other rules.
# Use the code below to remove such rules:
stronger_subset_rules <- which(colSums(is.subset(stronger_association_rules, stronger_association_rules)) > 1) # get subset rules in vector
length(stronger_subset_rules)
## [1] 15
subset_stronger_association_rules <- stronger_association_rules[-stronger_subset_rules] # Remove subset rules.
subset_stronger_association_rules
## set of 17 rules
After filtering out subset rules, the resulting set of rules is used to further summarize and visualize findings. ( only 17 rules left at this point)
summary(subset_stronger_association_rules) #shows the following:
## set of 17 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 15 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.118 2.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01010 Min. :0.8014 Min. :16.93 Min. :212.0
## 1st Qu.:0.01110 1st Qu.:1.0000 1st Qu.:36.84 1st Qu.:233.0
## Median :0.01257 Median :1.0000 Median :40.38 Median :264.0
## Mean :0.01302 Mean :0.9693 Mean :43.24 Mean :273.3
## 3rd Qu.:0.01438 3rd Qu.:1.0000 3rd Qu.:48.22 3rd Qu.:302.0
## Max. :0.01938 Max. :1.0000 Max. :70.23 Max. :407.0
An interactive DataTable is created using inspectDT to
display the top 10 association rules sorted by confidence.
This visualization contains an in-depth exploration of the strongest rules (10 highest by confidence), allowing users to interactively examine the details of each rule, including the items involved and the metrics that quantify their association.
This approach assists in identifying the most reliable patterns within the dataset.
inspectDT(head(sort(subset_stronger_association_rules, by = "confidence"), 10))
Below there are some graphs that may be helpful for visualization of association rules.
Graph below shows association rules filtered to include only those with a confidence greater than 40% to highlight the relationships between items in these high-confidence rules.
The visualization, created using the default plotting mechanism, provides a graphical representation of the rules, illustrating the strength of association.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association_rules[quality(association_rules)$confidence>0.4]
# Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The scatter plot visualizes 143,316 association rules, highlighting their support and confidence levels. Darker red points indicate a higher lift, suggesting strong item associations, while the clustering near the plot’s origin reflects the prevalence of rules with high confidence but lower support.
Below the other variation of visualizationn with usage of two-key plot method that show different colors that are assigned to rules of different orders. An “order 3” rule might include an additional item:
{Item A, Item C} => {Item B} {Item A} => {Item B, Item C}
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Below interactive graph-based visualization of association rules, representing the top 10 rules with the highest confidence from a subset of association rules. In such a graph:
#10 rules from subRules having the highest confidence.
top10subRules <- head(subset_stronger_association_rules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget") #interactive plot engine=htmlwidget
Nodes (Circles): represent items or itemsets.
Edges (Lines): signify association rules, where the direction of the arrow points from the antecedent (the “if” part) to the consequent (the “then” part) of the rule.
Node Size and Color: These can denote the significance of the items, with larger or more vividly colored nodes possibly indicating higher metrics such as support or lift.
Interactivity: The “Select by id” dropdown, allows users to select specific rules or items to examine more closely.
A parallel coordinates plot below is generated for the top 10 rules
selected based on lift from the refined
subset_stronger_association_rules.
This non-interactive visualization technique displays the multidimensional relationships between the antecedent(lhs) and consequent(rhs) of each rule, offering a clear view of how the lift metric differs across these highly influential rules. The parallel coordinates plot is important in comparing the relative strength and lift of the top associations, highlighting patterns that may be effective for cross-selling or promotional strategies. ( User could consider adding names of rules on the right side of graph too)
subRules2<-head(subset_stronger_association_rules, n=10, by="lift")
plot(subRules2, method="paracoord")
The analysis of the FP-Growth algorithm in the “Online Retail II” dataset has highlighted the practical use of the technique in uncovering purchasing behaviors. Although most focus has been on understanding and utilizing the algorithm’s parameters and visualizing the outcomes, it would be beneficial for future research to shift towards a deeper analytical approach. This can be achieved by using other analytical methods such as eclat in help with fp-growth to gain a more comprehensive understanding of consumer behavior. The insights gained from such research could be useful in making informed decisions in the field of retail management.