# Installing the 'arules' and 'arulesViz' packages for association rule mining and visualization
install.packages("arules")
Error in install.packages : Updating loaded packages
install.packages("arulesViz")
Error in install.packages : Updating loaded packages
library(arules)
library(arulesViz)
# Importing dataset 'book'
book <- read.csv('book.csv')
# Displaying a summary of the dataset to understand its structure
summary(book)
ChildBks YouthBks CookBks DoItYBks RefBks ArtBks GeogBks ItalCook
Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
Median :0.000 Median :0.0000 Median :0.000 Median :0.000 Median :0.0000 Median :0.000 Median :0.000 Median :0.0000
Mean :0.423 Mean :0.2475 Mean :0.431 Mean :0.282 Mean :0.2145 Mean :0.241 Mean :0.276 Mean :0.1135
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :1.000 Max. :1.0000 Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.000 Max. :1.000 Max. :1.0000
ItalAtlas ItalArt Florence
Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.000 Median :0.0000 Median :0.0000
Mean :0.037 Mean :0.0485 Mean :0.1085
3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.000 Max. :1.0000 Max. :1.0000
# Converting all binary variables in the dataset to categorical variables
book1 <- as.data.frame(lapply(book, as.factor))
# Checking the structure of the modified dataset to ensure conversion to categorical variables
str(book1)
'data.frame': 2000 obs. of 11 variables:
$ ChildBks : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 1 2 2 ...
$ YouthBks : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 2 1 2 ...
$ CookBks : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 2 ...
$ DoItYBks : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
$ RefBks : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 1 1 ...
$ ArtBks : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ GeogBks : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 2 ...
$ ItalCook : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ItalAtlas: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ItalArt : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Florence : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
Transforming the data frame into a ‘transactions’ class for association rule mining
book2 <- as(book1, "transactions")
Plotting item frequency for the top 22 items in the transaction dataset
itemFrequencyPlot(book2, topN=22)
Generating association rules with specific parameters
rule1 <- apriori(book2, parameter = list(supp = 0.01, conf = 0.4, minlen = 5, maxlen = 10))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 20
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 2000 transaction(s)] done [0.00s].
sorting and recoding items ... [22 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10
Warning: Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
done [0.01s].
writing ... [127070 rule(s)] done [0.04s].
creating S4 object ... done [0.03s].
rule1
set of 127070 rules
Inspecting the top 15 rules sorted by a default criterion
inspect(head(sort(rule1), n=15))
NA
head(quality(rule1))
Sorting and inspecting rules by ‘lift’
rule1_lift <- sort(rule1, by = "lift", descending = TRUE)
inspect(head(rule1_lift))
Sorting and inspecting rules by ‘confidence’
rule1_confidence <- sort(rule1, by = "confidence", descending = TRUE)
inspect(head(rule1_confidence))
Visualizing the generated rules using different plots
# Visualization using a scatter plot. Adjust 'measure' arguments as needed.
plot(rule1, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
install.packages("arulesViz")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Warning in install.packages :
package ‘arulesViz’ is in use and will not be installed
install.packages("arules")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Warning in install.packages :
package ‘arules’ is in use and will not be installed
plot(rule1, method = "two-key plot")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Rules with higher support (towards the right of the plot) are based on items that are more common in the dataset.
Rules with higher confidence (towards the top of the plot) are more reliable in predicting the consequent in a transaction.
The majority of rules with high confidence also have relatively low support, which is a common occurrence in large datasets where specific item combinations occur infrequently but are highly predictable.
There is a variety of rules with different numbers of items involved (as indicated by the different colors), but it seems that rules with fewer items (order 5 and 6) are more common.
The ‘gaps’ in the plot (horizontal lines without points) might indicate thresholds or boundaries where no rules meet the criteria to be plotted, possibly due to the parameter settings in the apriori algorithm.
##############################################################################
Generating and inspecting different sets of rules with varied parameters, and visualizing them using different methods
rule2 <- apriori(book1, parameter = list(supp = 0.05, confidence = 0.8, minlen = 6, maxlen = 20))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 100
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 2000 transaction(s)] done [0.00s].
sorting and recoding items ... [20 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 done [0.01s].
writing ... [15855 rule(s)] done [0.01s].
creating S4 object ... done [0.00s].
inspect(head(rule2, 10))
rule2
set of 15855 rules
par(mar = c(5, 8, 4, 2) + 0.1)
plot(rule2, method = "grouped", cex = 0.1)
Warning: Unknown control parameters: cex
Available control parameters (with default values):
k = 20
aggr.fun = function (x, ...) UseMethod("mean")
rhs_max = 10
lhs_label_items = 2
col = c("#EE0000FF", "#EEEEEEFF")
groups = NULL
engine = ggplot2
verbose = FALSE
The items on the Y-axis that have larger and more intensely colored bubbles associated with them are those that most frequently lead to other items being bought. These are strong and frequent rules.
The distribution of bubbles across the support axis can give us an idea of how common certain items or itemsets are within the transactions in the dataset.
The rules with higher lift values, indicated by the darker shades, are particularly interesting because they may reveal strong associations that are not immediately obvious.
If there are any rows (representing itemsets) that have many large, dark-colored bubbles, these itemsets are likely to be very strong predictors for various other items (not shown in the visible plot area).
###########################################################################
rule3 <- apriori(book1, parameter = list(supp = 0.04, confidence = 0.6, minlen = 7, maxlen = 10))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 80
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 2000 transaction(s)] done [0.00s].
sorting and recoding items ... [21 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10
Warning: Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
done [0.01s].
writing ... [14572 rule(s)] done [0.01s].
creating S4 object ... done [0.00s].
inspect(head(rule3, 10))
rule3
set of 14572 rules
plot(rule3, method = "graph")
Warning: Too many rules supplied. Only plotting the best 100 using ‘lift’ (change control parameter max if needed).
Items with larger nodes are more common within the dataset, and any rules involving these items will impact a larger portion of the transactions.
Nodes with a darker color are part of rules with higher lift, which are of particular interest because they indicate that the association between the items is stronger than expected by chance. This could suggest a potential for cross-selling or promotions.
The structure of the network can give you an idea of how items are interconnected. For instance, if many nodes (items) are connected to a single node, this central node may be a key item that is frequently bought with various other items.
Due to the overlap and density of the plot, it may be difficult to identify specific rules or the direction of the association (which item is the antecedent and which is the consequent). Interactive tools or filtering to show a subset of rules may be helpful for deeper analysis.
###########################################################################
rule4 <- apriori(book1, parameter = list(supp = 0.06, confidence = 0.7, minlen = 8, maxlen = 15))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 120
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 2000 transaction(s)] done [0.00s].
sorting and recoding items ... [20 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 done [0.01s].
writing ... [4557 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(head(rule4, 10))
lhs rhs support confidence coverage lift count
[1] {CookBks=0,
DoItYBks=1,
RefBks=0,
ArtBks=0,
ItalCook=0,
ItalArt=0,
Florence=0} => {ItalAtlas=0} 0.0605 1.0000000 0.0605 1.038422 121
[2] {CookBks=0,
DoItYBks=1,
RefBks=0,
ArtBks=0,
ItalCook=0,
ItalAtlas=0,
Florence=0} => {ItalArt=0} 0.0605 1.0000000 0.0605 1.050972 121
[3] {CookBks=0,
DoItYBks=1,
RefBks=0,
ArtBks=0,
ItalCook=0,
ItalAtlas=0,
ItalArt=0} => {Florence=0} 0.0605 0.9680000 0.0625 1.085810 121
[4] {CookBks=0,
DoItYBks=1,
RefBks=0,
ArtBks=0,
ItalAtlas=0,
ItalArt=0,
Florence=0} => {ItalCook=0} 0.0605 1.0000000 0.0605 1.128032 121
[5] {CookBks=0,
DoItYBks=1,
ArtBks=0,
ItalCook=0,
ItalAtlas=0,
ItalArt=0,
Florence=0} => {RefBks=0} 0.0605 0.8897059 0.0680 1.132662 121
[6] {CookBks=0,
DoItYBks=1,
RefBks=0,
ItalCook=0,
ItalAtlas=0,
ItalArt=0,
Florence=0} => {ArtBks=0} 0.0605 0.8897059 0.0680 1.172208 121
[7] {YouthBks=0,
DoItYBks=1,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=0} => {ItalAtlas=0} 0.0655 0.9924242 0.0660 1.030555 131
[8] {YouthBks=0,
DoItYBks=1,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalAtlas=0,
Florence=0} => {ItalArt=0} 0.0655 1.0000000 0.0655 1.050972 131
[9] {YouthBks=0,
DoItYBks=1,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalAtlas=0,
ItalArt=0} => {Florence=0} 0.0655 0.9776119 0.0670 1.096592 131
[10] {YouthBks=0,
DoItYBks=1,
ArtBks=0,
GeogBks=0,
ItalAtlas=0,
ItalArt=0,
Florence=0} => {ItalCook=0} 0.0655 0.9703704 0.0675 1.094608 131
rule4
set of 4557 rules
plot(rule4, method = "paracoord")
rule5 <- apriori(book1, parameter = list(supp = 0.03, confidence = 0.85, minlen = 9, maxlen = 20))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 60
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 2000 transaction(s)] done [0.00s].
sorting and recoding items ... [22 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 done [0.01s].
writing ... [2340 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(head(rule5, 10))
lhs rhs support confidence coverage lift count
[1] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
Florence=1} => {ItalArt=0} 0.0325 1.0000000 0.0325 1.050972 65
[2] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalArt=0,
Florence=1} => {ItalCook=0} 0.0325 1.0000000 0.0325 1.128032 65
[3] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {ArtBks=0} 0.0325 0.8904110 0.0365 1.173137 65
[4] {ChildBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {YouthBks=0} 0.0325 0.9701493 0.0335 1.289235 65
[5] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {GeogBks=0} 0.0325 0.9285714 0.0350 1.282557 65
[6] {ChildBks=0,
YouthBks=0,
CookBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {DoItYBks=0} 0.0325 0.9848485 0.0330 1.371655 65
[7] {YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {ChildBks=0} 0.0325 0.8552632 0.0380 1.482259 65
[8] {ChildBks=0,
YouthBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
ItalArt=0,
Florence=1} => {CookBks=0} 0.0325 0.9027778 0.0360 1.586604 65
[9] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalCook=0,
Florence=1} => {ItalAtlas=0} 0.0320 0.9846154 0.0325 1.022446 64
[10] {ChildBks=0,
YouthBks=0,
CookBks=0,
DoItYBks=0,
ArtBks=0,
GeogBks=0,
ItalAtlas=0,
Florence=1} => {ItalCook=0} 0.0320 1.0000000 0.0320 1.128032 64
rule5
set of 2340 rules
plot(rule5, method = "two-key plot")
To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Most rules have a high confidence level (above 85%), indicating that the consequent items are very likely to be purchased when the antecedent items are purchased.
There is a wide range of support for these rules, with no clear concentration of points towards higher support values. This suggests that while the rules are reliable (high confidence), they may not apply to a large portion of the dataset (lower support).
The spread of points across the confidence levels, particularly in the high-confidence area, suggests there are many strong rules that could be leveraged for marketing strategies, such as product placement or promotions.
The rules are relatively evenly distributed across the different order sizes (9, 10, and 11), with no single order size dominating the high-confidence, high-support area of the plot.
The cluster of points at the lower support levels suggests that there are potentially interesting but less frequent itemsets that could be the focus of niche marketing strategies.
top5rules <- head(rule5, n=5, by = "confidence")
plot(top5rules, engine = "htmlwidget", method = "graph")
Key Influencers: The items at the tail of the arrows are key influencers, meaning that their presence in a transaction strongly suggests the likelihood of the item at the head being purchased.
High Confidence Associations: The rules represented in this graph are the strongest in the dataset in terms of confidence, so the relationships shown are highly reliable.
Interconnectedness: The graph shows how different items are interconnected. If an item is a common consequent (many arrows pointing to it), it might be a popular item that can be targeted for promotional strategies.
Possible Item Combinations: The graph illustrates how different item combinations lead to the purchase of other items. This can inform strategies for product placement, inventory management, and cross-selling.
The top 5 rules based on confidence highlight the most predictable purchasing patterns within the dataset. These patterns can guide decision-making in marketing and sales strategies, as they reveal which items are likely to be purchased together. For instance, if “ItalCook” (Italian Cookbooks) appears often as an antecedent, it might be beneficial to place related items that frequently follow “ItalCook” in proximity in a store or to bundle them in promotions. The graph visually summarizes these associations, making it easier to identify and understand the strongest relationships in the dataset.