Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in a databases using different algorithms.
The data set involves online retail shops where customers from various countries are actively engaging in transactions. The data set captures details such as CustomerID, ProductID, Quantity, Price, Country, Category, and Date for each purchase.
Objective: This project aims to analyze customer purchasing behavior based on geographical patterns. The primary goals include:
Geographical Insights: Exploring and analysing sales patterns across different countries to identify potential market trends or regional preferences.
Performance Metrics: Calculating key performance metrics such as support, confidence, and lift for association rules to quantify the strength and relevance of discovered patterns.
Visualization: Utilizing visualizations to present meaningful insights, aiding in the interpretation of complex patterns and relationships within the dataset.
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("arules")
##
## The downloaded binary packages are in
## /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp7mTaGS/downloaded_packages
install.packages("igraph")
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## igraph 1.5.1 1.6.0 TRUE
## installing the source package 'igraph'
## Warning in install.packages("igraph"): installation of package 'igraph' had
## non-zero exit status
install.packages("arulesViz")
##
## The downloaded binary packages are in
## /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp7mTaGS/downloaded_packages
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(readr)
online_retail_data <- read_csv("online_retail_data.csv")
## Rows: 4000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Category
## dbl (4): CustomerID, ProductID, Quantity, Price
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(online_retail_data)
## # A tibble: 6 × 7
## CustomerID ProductID Quantity Price Country Category Date
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <date>
## 1 1102 22 4 22.4 UK Clothing 2022-01-01
## 2 1435 48 1 59.5 USA Electronics 2022-01-02
## 3 1860 2 1 91.4 Canada Home 2022-01-03
## 4 1270 10 2 80.0 Canada Clothing 2022-01-04
## 5 1106 34 9 92.2 USA Home 2022-01-05
## 6 1071 16 2 97.9 Canada Clothing 2022-01-06
As shown in the ‘head’, each row corresponds to a specific customer from a different country who made a purchase, the date on which a particular transaction occurred and the description of each item.
retail_dimensions <- dim(online_retail_data)
retail_dimensions
## [1] 4000 7
We can observe that to extract the rules, we are using a data set of 4000 observations and 7 variables in total.
missing_values <- sum(is.na(online_retail_data))
missing_values
## [1] 0
As shown above, we have no missing values that are associated with the data set.
This step transforms the data set into the transaction format expected by the arules package, allowing the performance of the association rule mining on the data.
online_retail_data <- read.csv("online_retail_data.csv")
binary_matrix <- table(online_retail_data$CustomerID, online_retail_data$Country) > 0
trans <- as(binary_matrix, "transactions")
length(trans)
## [1] 970
LIST(head(trans))
## $`1000`
## [1] "Australia" "Canada" "UK" "USA"
##
## $`1001`
## [1] "Australia" "Canada" "UK"
##
## $`1002`
## [1] "Australia" "Canada"
##
## $`1003`
## [1] "UK"
##
## $`1004`
## [1] "Australia" "Canada" "USA"
##
## $`1005`
## [1] "UK"
Contigency table
ctab<-crossTable(trans, sort=TRUE)
ctab<-crossTable(trans, measure="count", sort=TRUE)
ctab
## UK Canada USA Australia
## UK 631 401 391 404
## Canada 401 630 391 390
## USA 391 391 618 384
## Australia 404 390 384 612
# Displaying a summary of the transactions
summary(trans)
inspect(trans)
sample(trans)
#transactions as itemMatrix in sparse format with
# 970 rows (elements/itemsets/transactions) and
# 4 columns (items) and a density of 0.6420103
#most frequent items:
# UK Canada USA Australia (Other)
# 631 630 618 612 0
#element (itemset/transaction) length distribution:
#sizes
# 1 2 3 4
#130 318 363 159
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.000 2.000 3.000 2.568 3.000 4.000
#includes extended item information - examples:
# labels
#1 Australia
#2 Canada
#3 UK
#includes extended transaction information - examples:
# transactionID
#1 1000
#2 1001
#3 1002
For the data and the rules to be comprehensive, we have to convert the entries into transactions. Hence from the summary statistics above, items represent countries, and the counts indicate how many times each country appears in the transactions and the most frequent item in the data set. .The length distribution is showing sizes of item sets (transactions) in the data set. For example, there are 130 transactions with only one item, 318 transactions with two items, and so on. The median transaction length is 3, meaning half of the transactions have three items or fewer and UK has the highest frequency.
The apriori algorithm creates frequent items and based on these, it creates rules.
#inital
rules <- apriori(online_retail_data)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 400
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4020 item(s), 4000 transaction(s)] done [0.01s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 0 rules
The initial application of the Apriori Algorithm set the parameters at a default support = 0.1 and confidence = 0.8. This specified that the algorithm should only consider rules that have a very high level of confidence (80%) with items occurring in at least 10% of transactions.
The condition turned out to the stringent and filtered out a large portion of potential rules as well as limiting the number of frequent item sets especially because the data set has diverse set of items. Hence the output had zero(0) rules as a result of failing to meet the stringent criteria.
HENCE, the reason why l ended up adjusting the parameters to support = 0.01 and confidence = 0.5. This relaxed the approach making it easier and less restrictive than the former.
#adjusted
rules <- apriori(online_retail_data, parameter = list(support = 0.01, confidence = 0.5))
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 40
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4020 item(s), 4000 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [50 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 50 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 4 46
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 4.00 4.00 3.92 4.00 4.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01000 Min. :0.5000 Min. :0.01950 Min. :1.127
## 1st Qu.:0.01125 1st Qu.:0.5076 1st Qu.:0.02125 1st Qu.:1.144
## Median :0.01300 Median :0.5160 Median :0.02438 Median :1.166
## Mean :0.01475 Mean :0.5221 Mean :0.02841 Mean :1.190
## 3rd Qu.:0.01469 3rd Qu.:0.5366 3rd Qu.:0.02881 3rd Qu.:1.220
## Max. :0.04050 Max. :0.5696 Max. :0.07975 Max. :1.534
## count
## Min. : 40.00
## 1st Qu.: 45.00
## Median : 52.00
## Mean : 59.00
## 3rd Qu.: 58.75
## Max. :162.00
##
## mining info:
## data ntransactions support confidence
## online_retail_data 4000 0.01 0.5
## call
## apriori(data = online_retail_data, parameter = list(support = 0.01, confidence = 0.5))
The adjustment to lower support and confidence thresholds resulted in the discovery of 50 rules. This suggests that relaxing the criteria allowed the algorithm to find associations in the data. The rule length distribution indicates that 4rules that contain 3 items and 46 rules that have 4 items. These have a median length of 4, indicating that most rules involve four items.
The support, confidence, and lift values vary across the rules, providing insights into the strength and relevance of the discovered associations. Both 25% and 75% of the rules have a length of 4 or less. This implies that a substantial proportion of the rules share the same or similar lengths. The summary of quality measures also indicate that the support, confidence, coverage, lift and count increased from their default values after creating of 50 rules.
inspect(rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {ProductID=[1,17),
## Category=Home} => {Quantity=[6,9]} 0.04050 0.5078370 0.07975 1.144421 162
## [2] {Country=USA,
## Category=Toys} => {Quantity=[6,9]} 0.03050 0.5083333 0.06000 1.145540 122
## [3] {CustomerID=[1e+03,1.34e+03),
## Country=USA} => {Quantity=[6,9]} 0.03850 0.5016287 0.07675 1.130431 154
## [4] {Country=UK,
## Category=Toys} => {Quantity=[6,9]} 0.03375 0.5075188 0.06650 1.143704 135
## [5] {Quantity=[1,3),
## Price=[1.01,34.2),
## Category=Home} => {CustomerID=[1.34e+03,1.67e+03)} 0.01000 0.5128205 0.01950 1.534242 40
## [6] {ProductID=[1,17),
## Country=Australia,
## Category=Home} => {Quantity=[6,9]} 0.01125 0.5696203 0.01975 1.283651 45
## [7] {Price=[66.7,100],
## Country=Australia,
## Category=Home} => {Quantity=[6,9]} 0.01125 0.5113636 0.02200 1.152369 45
## [8] {CustomerID=[1e+03,1.34e+03),
## Price=[66.7,100],
## Country=Australia} => {Quantity=[6,9]} 0.01625 0.5078125 0.03200 1.144366 65
## [9] {CustomerID=[1e+03,1.34e+03),
## ProductID=[17,34),
## Country=Australia} => {Quantity=[6,9]} 0.01500 0.5172414 0.02900 1.165614 60
## [10] {CustomerID=[1e+03,1.34e+03),
## Country=USA,
## Category=Home} => {Quantity=[6,9]} 0.01075 0.5375000 0.02000 1.211268 43
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=60)
## UK Canada USA Australia
## 631 630 618 612
According to the summary of the data, while the range is almost the same, UK has the highest frequency and is appearing the most in the transactions compared to the other countries, meaniing that most of the people that are purchasing items online are from the UK. This information was visualized using the items frequency plot above of the countries that different people purchase from. From the Items frequency plot, the highest frequency is at 631.
rules <- eclat(trans, parameter = list(supp = 0.1, maxlen = 4))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.1 1 4 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 97
##
## create itemset ...
## set transactions ...[4 item(s), 970 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating bit matrix ... [4 row(s), 970 column(s)] done [0.00s].
## writing ... [15 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
In order to extract meaningful patterns and associations from our dataset, we employed a two-step approach involving the Eclat algorithm and the rule induction process. Initially, Eclat was applied to identify frequent itemsets (15), representing combinations of items that co-occurred frequently in the data. This step was essential for discovering potential associations among items
freq_rules<-ruleInduction(rules, trans, confidence=0.5)
inspect(freq_rules[1:10])
## lhs rhs support confidence lift
## [1] {Canada, UK, USA} => {Australia} 0.1639175 0.6360000 1.0080392
## [2] {Australia, UK, USA} => {Canada} 0.1639175 0.6334661 0.9753367
## [3] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 1.0269813
## [4] {Australia, Canada, UK} => {USA} 0.1639175 0.6115385 0.9598581
## [5] {UK, USA} => {Australia} 0.2587629 0.6419437 1.0174598
## [6] {Australia, USA} => {UK} 0.2587629 0.6536458 1.0048121
## [7] {Australia, UK} => {USA} 0.2587629 0.6212871 0.9751594
## [8] {Canada, USA} => {Australia} 0.2453608 0.6086957 0.9647627
## [9] {Australia, USA} => {Canada} 0.2453608 0.6197917 0.9542824
## [10] {Australia, Canada} => {USA} 0.2453608 0.6102564 0.9578458
## itemset
## [1] 1
## [2] 1
## [3] 1
## [4] 1
## [5] 2
## [6] 2
## [7] 2
## [8] 3
## [9] 3
## [10] 3
To manage the number of generated itemsets, we further refined our results using the ruleInduced function. This function transformed the frequent itemsets into association rules, considering metrics such as confidence, support, and lift. By doing so, we were able to filter and rank the rules, focusing on those with higher confidence levels and stronger support.
Therefore, the 15 itemsets found in Eclat were transformed into 28 association rules during the rule induction process. Each association rule represents a potential relationship between items with associated metrics indicating the strength and significance of the relationship.
inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE), 10))
## lhs rhs support confidence lift
## [1] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 1.0269813
## [2] {Australia, Canada} => {UK} 0.2680412 0.6666667 1.0248283
## [3] {Australia} => {UK} 0.4164948 0.6601307 1.0147810
## [4] {Australia, USA} => {UK} 0.2587629 0.6536458 1.0048121
## [5] {Canada, UK} => {Australia} 0.2680412 0.6483791 1.0276596
## [6] {Australia, UK} => {Canada} 0.2680412 0.6435644 0.9908848
## [7] {UK, USA} => {Australia} 0.2587629 0.6419437 1.0174598
## [8] {UK} => {Australia} 0.4164948 0.6402536 1.0147810
## [9] {UK, USA} => {Canada} 0.2577320 0.6393862 0.9844518
## [10] {Canada, USA} => {UK} 0.2577320 0.6393862 0.9828916
## itemset
## [1] 1
## [2] 8
## [3] 9
## [4] 2
## [5] 8
## [6] 8
## [7] 2
## [8] 9
## [9] 4
## [10] 4
This highest confidence rule indicates that in 16.39% of transactions involving customers from Australia, Canada, and the USA, there is a high confidence (66.81%) that customers will also make purchases from the UK. The lift value of 1.03 suggests a slight positive correlation between the antecedent and consequent.
As for the lowest, the rule {UK} => {Canada} has a confidence of 63.55%, indicating that in 41.34% of the transactions involving customers from the UK, there is a 63.55% confidence that these customers will also purchase items from Canada. The lift value of 0.98 suggests a slightly negative correlation, meaning that the likelihood of customers from the UK purchasing items in Canada is 0.98 times lower than if these events were independent.
inspect(head(sort(freq_rules, by = "support", decreasing = TRUE), 10))
## lhs rhs support confidence lift itemset
## [1] {UK} => {Australia} 0.4164948 0.6402536 1.0147810 9
## [2] {Australia} => {UK} 0.4164948 0.6601307 1.0147810 9
## [3] {UK} => {Canada} 0.4134021 0.6354992 0.9784670 11
## [4] {Canada} => {UK} 0.4134021 0.6365079 0.9784670 11
## [5] {USA} => {UK} 0.4030928 0.6326861 0.9725919 5
## [6] {UK} => {USA} 0.4030928 0.6196513 0.9725919 5
## [7] {USA} => {Canada} 0.4030928 0.6326861 0.9741357 6
## [8] {Canada} => {USA} 0.4030928 0.6206349 0.9741357 6
## [9] {Canada} => {Australia} 0.4020619 0.6190476 0.9811702 10
## [10] {Australia} => {Canada} 0.4020619 0.6372549 0.9811702 10
In terms of support, the highest rule ndicates that in 41.65% of transactions where customers are from the UK, there is a 64.03% confidence that they will also purchase items from Australia. The lift value of 1.01 suggests a mild positive correlation, indicating that the occurrence of purchases in Australia is slightly more likely when customers are from the UK.
The lowest one with 40.31% of transactions involving customers from Canada, there is a 62.06% confidence that they will make purchases from the USA. The lift value of 0.97 indicates a slight negative correlation, suggesting that the occurrence of purchases in the USA is slightly less likely when customers are from Canada.
The rule with the highest support reveals a strong association between customers from the UK and purchases in Australia. On the other hand, the lowest support rule highlights a relatively weaker association between customers from Canada and purchases in the USA.
inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))
## lhs rhs support confidence lift
## [1] {Canada, UK} => {Australia} 0.2680412 0.6483791 1.0276596
## [2] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 1.0269813
## [3] {Australia, Canada} => {UK} 0.2680412 0.6666667 1.0248283
## [4] {UK, USA} => {Australia} 0.2587629 0.6419437 1.0174598
## [5] {UK} => {Australia} 0.4164948 0.6402536 1.0147810
## [6] {Australia} => {UK} 0.4164948 0.6601307 1.0147810
## [7] {Canada, UK, USA} => {Australia} 0.1639175 0.6360000 1.0080392
## [8] {Australia, USA} => {UK} 0.2587629 0.6536458 1.0048121
## [9] {Australia, UK} => {Canada} 0.2680412 0.6435644 0.9908848
## [10] {USA} => {Australia} 0.3958763 0.6213592 0.9848341
## itemset
## [1] 8
## [2] 1
## [3] 8
## [4] 2
## [5] 9
## [6] 9
## [7] 1
## [8] 2
## [9] 8
## [10] 7
As for the highest lift, the rule {Canada, UK} => {Australia} has a lift of 1.03, suggesting that the likelihood of customers from Canada and the UK purchasing items from Australia is 1.03 times higher than if these events were independent. The support of 26.80% indicates that this rule is applicable in approximately 26.80% of the transactions. Additionally, the confidence of 64.84% highlights that in 26.80% of transactions involving customers from Canada and the UK, there is a 64.84% confidence that these customers will also buy items from Australia.
For the lowest, the rule {USA} => {Canada} has a lift of 0.97, indicating that the likelihood of customers from the USA purchasing items from Canada is 0.97 times the likelihood of these events occurring independently. The support of 40.31% signifies that this rule is applicable in approximately 40.31% of transactions. Moreover, the confidence of 63.27% suggests that in 40.31% of transactions involving customers from the USA, there is a 63.27% confidence that these customers will also buy items from Canada.
These visualizations assist in uncovering patterns in customer purchasing behavior based on their country identities, leading to actionable insights for marketing or product placement strategies.
plot(freq_rules, method = "matrix", measure = c("support", "confidence"), shading = "lift", interactive = FALSE)
## Warning in plot.rules(freq_rules, method = "matrix", measure = c("support", :
## The parameter interactive is deprecated. Use engine='interactive' instead.
## Itemsets in Antecedent (LHS)
## [1] "{Australia,Canada,USA}" "{Canada,UK,USA}" "{Canada,UK}"
## [4] "{UK,USA}" "{Australia}" "{Australia,Canada}"
## [7] "{UK}" "{Australia,UK}" "{Australia,USA}"
## [10] "{Canada}" "{USA}" "{Australia,UK,USA}"
## [13] "{Canada,USA}" "{Australia,Canada,UK}"
## Itemsets in Consequent (RHS)
## [1] "{USA}" "{Canada}" "{Australia}" "{UK}"
This matrix plot represents the association rules in a tabular format. The color shading indicates the lift, and columns show support and confidence values. It provides a compact overview of rule metrics and their relationships.
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter =0)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.
Here, we can visualize the relationship between the LIFT, the CONFIDENCE and the SUPPORT of the item sets. In the scatter plot, most items with high lift values have relatively low to moderate/medium support values. The rule with the highest confidence in this case has the lowest support value and a high lift value. The items with the lowest lift values generally have moderate to high support values and relatively low/moderate confidence values. To determine the importance of a rule, the Confidence and the Support values are considered the most. This is because, the support value determines the presence or probability of a transaction containing both A and B;Whereas, the confidence value validates the rule’s precision.
In this section, we delve into the comprehensive analysis of the association rules, as represented by the graph and accompanying rule list generated through the Apriori algorithm. The objective is to show the relationships and patterns within the transactional data, shedding light on the associations between different items and countries.
plot(freq_rules, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
inspect(freq_rules)
## lhs rhs support confidence lift
## [1] {Canada, UK, USA} => {Australia} 0.1639175 0.6360000 1.0080392
## [2] {Australia, UK, USA} => {Canada} 0.1639175 0.6334661 0.9753367
## [3] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 1.0269813
## [4] {Australia, Canada, UK} => {USA} 0.1639175 0.6115385 0.9598581
## [5] {UK, USA} => {Australia} 0.2587629 0.6419437 1.0174598
## [6] {Australia, USA} => {UK} 0.2587629 0.6536458 1.0048121
## [7] {Australia, UK} => {USA} 0.2587629 0.6212871 0.9751594
## [8] {Canada, USA} => {Australia} 0.2453608 0.6086957 0.9647627
## [9] {Australia, USA} => {Canada} 0.2453608 0.6197917 0.9542824
## [10] {Australia, Canada} => {USA} 0.2453608 0.6102564 0.9578458
## [11] {UK, USA} => {Canada} 0.2577320 0.6393862 0.9844518
## [12] {Canada, USA} => {UK} 0.2577320 0.6393862 0.9828916
## [13] {Canada, UK} => {USA} 0.2577320 0.6234414 0.9785407
## [14] {USA} => {UK} 0.4030928 0.6326861 0.9725919
## [15] {UK} => {USA} 0.4030928 0.6196513 0.9725919
## [16] {USA} => {Canada} 0.4030928 0.6326861 0.9741357
## [17] {Canada} => {USA} 0.4030928 0.6206349 0.9741357
## [18] {USA} => {Australia} 0.3958763 0.6213592 0.9848341
## [19] {Australia} => {USA} 0.3958763 0.6274510 0.9848341
## [20] {Canada, UK} => {Australia} 0.2680412 0.6483791 1.0276596
## [21] {Australia, UK} => {Canada} 0.2680412 0.6435644 0.9908848
## [22] {Australia, Canada} => {UK} 0.2680412 0.6666667 1.0248283
## [23] {UK} => {Australia} 0.4164948 0.6402536 1.0147810
## [24] {Australia} => {UK} 0.4164948 0.6601307 1.0147810
## [25] {Canada} => {Australia} 0.4020619 0.6190476 0.9811702
## [26] {Australia} => {Canada} 0.4020619 0.6372549 0.9811702
## [27] {UK} => {Canada} 0.4134021 0.6354992 0.9784670
## [28] {Canada} => {UK} 0.4134021 0.6365079 0.9784670
## itemset
## [1] 1
## [2] 1
## [3] 1
## [4] 1
## [5] 2
## [6] 2
## [7] 2
## [8] 3
## [9] 3
## [10] 3
## [11] 4
## [12] 4
## [13] 4
## [14] 5
## [15] 5
## [16] 6
## [17] 6
## [18] 7
## [19] 7
## [20] 8
## [21] 8
## [22] 8
## [23] 9
## [24] 9
## [25] 10
## [26] 10
## [27] 11
## [28] 11
Inspecting the sorted list of association rules, we observe a diverse set of rules capturing associations involving items from Australia, Canada, the UK, and the USA. Each rule is characterized by its support, confidence, and lift values, providing quantitative insights into the strength and significance of the associations.
Each node in the graph represents an item or a country, while the edges depict the associations between them. The graph aids in understanding the intricate network of relationships captured by the association rules.
Geographical Associations:
Rules such as {Canada, UK, USA} => {Australia} and {Australia, UK, USA} => {Canada} highlight geographical associations, suggesting that customers from these countries are more likely to purchase items from Australia or Canada.
Cross-Country Purchasing Patterns:
Rules like {UK} => {Australia} and {USA} => {Canada} suggest cross-country purchasing patterns, indicating that certain items are commonly bought in conjunction with each other across different countries.
Negative Associations:
Some rules, with lift values slightly below 1, indicate negative
associations. For instance, {UK} => {USA} and {Canada, UK} =>
{USA} suggest that customers purchasing from the UK or the combination
of Canada and the UK are slightly less likely to include items from the
USA compared to random chance.
In this section, we delve into the intricacies of association rules with a specific focus on the presence of the United Kingdom (UK) in transactions. By concentrating on the UK, we aim to uncover meaningful patterns and relationships between this specific market and other countries within our dataset.
# Apriori algorithm with appearance constraint for "UK"
UK_rules <- apriori(
data = trans,
parameter = list(supp = 0.01, conf = 0.5), # Adjust support and confidence as needed
appearance = list(default = "lhs", rhs = "UK"),
control = list(verbose = FALSE)
)
inspect(sort(UK_rules, by = 'lift'))
## lhs rhs support confidence coverage lift
## [1] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 0.2453608 1.0269813
## [2] {Australia, Canada} => {UK} 0.2680412 0.6666667 0.4020619 1.0248283
## [3] {Australia} => {UK} 0.4164948 0.6601307 0.6309278 1.0147810
## [4] {Australia, USA} => {UK} 0.2587629 0.6536458 0.3958763 1.0048121
## [5] {} => {UK} 0.6505155 0.6505155 1.0000000 1.0000000
## [6] {Canada, USA} => {UK} 0.2577320 0.6393862 0.4030928 0.9828916
## [7] {Canada} => {UK} 0.4134021 0.6365079 0.6494845 0.9784670
## [8] {USA} => {UK} 0.4030928 0.6326861 0.6371134 0.9725919
## count
## [1] 159
## [2] 260
## [3] 404
## [4] 251
## [5] 631
## [6] 250
## [7] 401
## [8] 391
The rules are presented in the format where “lhs” (left-hand side) represents the antecedent (items present before the arrow), and “rhs” (right-hand side) represents the consequent (items after the arrow). They are sorted based on the lift value in descending order.
This rule indicates that the combination of items from Australia, Canada, and the USA is associated with the presence of items from the UK. The lift value slightly above 1 suggests a positive association. Hence customers buying from Australia, Canada, USA are about 2.7% more likely to include items from the UK compared to random chance. Approximately 16.4% of transactions contain this combination. Given the presence of items from Australia, Canada, and the USA, there is a high confidence (66.8%) that items from the UK are also present.
In this segment, our attention turns to unraveling the intricate association rules involving the vibrant market of Canada. Utilizing the Apriori algorithm with an appearance constraint for “Canada,” we aim to discern meaningful patterns and relationships within the transactional data.
# Apriori algorithm with appearance constraint for "canada"
Canada_rules <- apriori(
data = trans,
parameter = list(supp = 0.01, conf = 0.5), # Adjust support and confidence as needed
appearance = list(default = "lhs", rhs = "Canada"),
control = list(verbose = FALSE)
)
By employing parameters such as support and confidence, we seek to uncover rules that showcase the association between Canada and other items in our dataset.
inspect(sort(Canada_rules, by = 'lift'))
## lhs rhs support confidence coverage lift
## [1] {} => {Canada} 0.6494845 0.6494845 1.0000000 1.0000000
## [2] {Australia, UK} => {Canada} 0.2680412 0.6435644 0.4164948 0.9908848
## [3] {UK, USA} => {Canada} 0.2577320 0.6393862 0.4030928 0.9844518
## [4] {Australia} => {Canada} 0.4020619 0.6372549 0.6309278 0.9811702
## [5] {UK} => {Canada} 0.4134021 0.6354992 0.6505155 0.9784670
## [6] {Australia, UK, USA} => {Canada} 0.1639175 0.6334661 0.2587629 0.9753367
## [7] {USA} => {Canada} 0.4030928 0.6326861 0.6371134 0.9741357
## [8] {Australia, USA} => {Canada} 0.2453608 0.6197917 0.3958763 0.9542824
## count
## [1] 630
## [2] 260
## [3] 250
## [4] 390
## [5] 401
## [6] 159
## [7] 391
## [8] 238
plot(Canada_rules, method="graph", control =list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
As shown above, the lift values slightly below 1 indicate a negative association hence customers buying from UK, USA are 1.6% less likely to include items from Canada compared to random chance. As well as Customers buying from Australia, UK are 0.9% less likely to include items from Canada compared to random chance.
The appearance constraint ensures that our rules specifically involve the presence of Australia in transactions.
# Apriori algorithm with appearance constraint for "AUSTRALIA"
Australia_rules <- apriori(
data = trans,
parameter = list(supp = 0.01, conf = 0.5), # Adjust support and confidence as needed
appearance = list(default = "lhs", rhs = "Australia"),
control = list(verbose = FALSE)
)
inspect(sort(Australia_rules, by = 'lift'))
## lhs rhs support confidence coverage lift
## [1] {Canada, UK} => {Australia} 0.2680412 0.6483791 0.4134021 1.0276596
## [2] {UK, USA} => {Australia} 0.2587629 0.6419437 0.4030928 1.0174598
## [3] {UK} => {Australia} 0.4164948 0.6402536 0.6505155 1.0147810
## [4] {Canada, UK, USA} => {Australia} 0.1639175 0.6360000 0.2577320 1.0080392
## [5] {} => {Australia} 0.6309278 0.6309278 1.0000000 1.0000000
## [6] {USA} => {Australia} 0.3958763 0.6213592 0.6371134 0.9848341
## [7] {Canada} => {Australia} 0.4020619 0.6190476 0.6494845 0.9811702
## [8] {Canada, USA} => {Australia} 0.2453608 0.6086957 0.4030928 0.9647627
## count
## [1] 260
## [2] 251
## [3] 404
## [4] 159
## [5] 612
## [6] 384
## [7] 390
## [8] 238
plot(Australia_rules, method = "graph", control = list(layout = "stress", circular = FALSE))
The lift value slightly above 1 indicates a positive association. Customers purchasing from {Canada, UK} are about 2.8% more likely to include items from Australia compared to random chance. Customers purchasing from {UK, USA} are about 1.7% more likely to include items from Australia compared to random chance.From rule 1, given the presence of items from Canada and the UK, there is a high confidence (64.8%) that items from Australia are also present and in rule 2, given the presence of items from the UK and the USA, there is a 64.2% confidence that items from Australia are also present.
Overall, The rules highlight positive associations between Australia and specific combinations of items from Canada, the UK, and the USA. The lift values above 1 indicate that these combinations are more likely to co-occur than expected by chance. Rule 5, with an empty left-hand side, indicates that items from Australia are frequently purchased independently.
By setting specific support and confidence thresholds, we aim to identify rules that highlight the relationships between the USA and other items in our dataset. The appearance constraint ensures that our rules specifically involve the presence of the USA in transactions.
# Apriori algorithm with appearance constraint for "AUSTRALIA"
USA_rules <- apriori(
data = trans,
parameter = list(supp = 0.01, conf = 0.5), # Adjust support and confidence as needed
appearance = list(default = "lhs", rhs = "USA"),
control = list(verbose = FALSE)
)
inspect(sort(USA_rules, by = 'lift'))
## lhs rhs support confidence coverage lift
## [1] {} => {USA} 0.6371134 0.6371134 1.0000000 1.0000000
## [2] {Australia} => {USA} 0.3958763 0.6274510 0.6309278 0.9848341
## [3] {Canada, UK} => {USA} 0.2577320 0.6234414 0.4134021 0.9785407
## [4] {Australia, UK} => {USA} 0.2587629 0.6212871 0.4164948 0.9751594
## [5] {Canada} => {USA} 0.4030928 0.6206349 0.6494845 0.9741357
## [6] {UK} => {USA} 0.4030928 0.6196513 0.6505155 0.9725919
## [7] {Australia, Canada, UK} => {USA} 0.1639175 0.6115385 0.2680412 0.9598581
## [8] {Australia, Canada} => {USA} 0.2453608 0.6102564 0.4020619 0.9578458
## count
## [1] 618
## [2] 384
## [3] 250
## [4] 251
## [5] 391
## [6] 391
## [7] 159
## [8] 238
plot(USA_rules, method = "graph", control = list(layout = "stress", circular = FALSE))
From rule 1, Around 39.6% of transactions involve items from Australia along with items from the USA. From rule 3, Approximately 25.8% of transactions involve items from Canada and the UK along with items from the USA. Overall, the rules highlight negative associations between the USA and specific combinations of items from Australia, Canada, and the UK. The lift values below 1 indicate that these combinations are less likely to co-occur than expected by chance. Rule 1, with an empty left-hand side, indicates that items from the USA are frequently purchased independently.
Dissimilarity is the numerical measure of how different two data items are. When dissimilarity measure is low, then the items under observation are similar. And if dissimilarity is high, the items are different.
To measure dissimilarity, the Jaccard index is used:
J(A,B) = |A ∩ B| / |A ∪ B|S
trans.sel <- trans[, itemFrequency(trans) > 0.5]
jac <- dissimilarity(trans.sel, which = "items")
round(jac, digits = 3)
## Australia Canada UK
## Canada 0.542
## UK 0.518 0.534
## USA 0.546 0.544 0.544
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")
Canada vs. Australia (0.542):The dissimilarity value of 0.542 between Canada and Australia indicates that these two transactions have a Jaccard dissimilarity of 54.2%. This means that 54.2% of the items in Canada’s transaction set are different from the items in Australia’s transaction set.
UK vs. Australia (0.518): The dissimilarity value of 0.518 between the UK and Australia indicates a Jaccard dissimilarity of 51.8%. Similarly, this means that 51.8% of the items in the UK’s transaction set are different from the items in Australia’s transaction set.
USA vs. Canada (0.544): The dissimilarity value of 0.544 between the USA and Canada indicates a Jaccard dissimilarity of 54.4%. This means that 54.4% of the items in the USA’s transaction set are different from the items in Canada’s transaction set.
USA vs. UK (0.544): The dissimilarity value of 0.544 between the USA and the UK indicates a Jaccard dissimilarity of 54.4%. Similar to the previous explanation, this means that 54.4% of the items in the USA’s transaction set are different from the items in the UK’s transaction set.
Affinity measures the strength of association. In this case, the higher the affinity value, the higher the probability that two products (identified by product ID) are bought together or share common purchasing patterns.
Calculated as:
A(i,j) = supp(i,j)/supp(i)+supp(j)−supp(i,j)
a <- affinity(trans.sel)
round(a, digits = 3)
## An object of class "ar_similarity"
## Australia Canada UK USA
## Australia 0.000 0.458 0.482 0.454
## Canada 0.458 0.000 0.466 0.456
## UK 0.482 0.466 0.000 0.456
## USA 0.454 0.456 0.456 0.000
## Slot "method":
## [1] "Affinity"
Only taking into account the affinity levels that are < 0.5, we observe that the following pairs of items have high probability of being purchased together:
The goal of assessing redundancy is to identify and eliminate rules that provide essentially the same information. Below l am checking for redundancy in those rules helping me identify rules that might be subsumed by others or provide similar information, hence allowing me to focus on the most relevant rules.
is_redundant_uk <- is.redundant(UK_rules)
is_redundant_uk
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
supporting_transactions_canada <- supportingTransactions(UK_rules, trans)
is_redundant_canada <- is.redundant(Canada_rules)
is_redundant_canada
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
supporting_transactions_canada <- supportingTransactions(Canada_rules, trans)
is_redundant_australia <- is.redundant(Australia_rules)
is_redundant_australia
## [1] FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
supporting_transactions_australia <- supportingTransactions(Australia_rules, trans)
is_redundant_usa <- is.redundant(USA_rules)
is_redundant_usa
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
supporting_transactions_usa <- supportingTransactions(USA_rules, trans)
Having identified the redundant rules, the next step involves removing them so l can remain with the most important ones.
non_redundant_uk_rules <-UK_rules[!is_redundant_uk]
non_redundant_uk_rules
## set of 4 rules
non_redundant_usa_rules <- USA_rules[!is_redundant_usa]
non_redundant_usa_rules
## set of 1 rules
non_redundant_canada_rules <- Canada_rules[!is_redundant_canada]
non_redundant_canada_rules
## set of 1 rules
non_redundant_australia_rules <- Australia_rules[!is_redundant_australia]
non_redundant_australia_rules
## set of 4 rules
all_rules <- c(non_redundant_uk_rules, non_redundant_canada_rules, non_redundant_australia_rules, non_redundant_usa_rules)
inspect(sort(all_rules, by = 'lift')[1:10])
## lhs rhs support confidence coverage
## [1] {Canada, UK} => {Australia} 0.2680412 0.6483791 0.4134021
## [2] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672 0.2453608
## [3] {Australia, Canada} => {UK} 0.2680412 0.6666667 0.4020619
## [4] {UK, USA} => {Australia} 0.2587629 0.6419437 0.4030928
## [5] {Australia} => {UK} 0.4164948 0.6601307 0.6309278
## [6] {UK} => {Australia} 0.4164948 0.6402536 0.6505155
## [7] {} => {UK} 0.6505155 0.6505155 1.0000000
## [8] {} => {Canada} 0.6494845 0.6494845 1.0000000
## [9] {} => {Australia} 0.6309278 0.6309278 1.0000000
## [10] {} => {USA} 0.6371134 0.6371134 1.0000000
## lift count
## [1] 1.027660 260
## [2] 1.026981 159
## [3] 1.024828 260
## [4] 1.017460 251
## [5] 1.014781 404
## [6] 1.014781 404
## [7] 1.000000 631
## [8] 1.000000 630
## [9] 1.000000 612
## [10] 1.000000 618
plot(all_rules, method="paracoord", control=list(reorder=TRUE))
As shown above, each of these are unique and independent, hence at the end of the day we the most useful association rules of 10 that are still stressing on the fact that customers from Australia, Canada, and the USA exhibited a high likelihood of purchasing items from the UK,.
The analysis employed association rule mining techniques of the Apriori algorithm, Eclat algorithm, and affinity matrix, to extract meaningful insights. Our analysis uncovered compelling association rules that provide valuable insights into customer purchasing patterns. Notably, we identified strong associations between certain countries and product categories, revealing interesting cross-border shopping trends.
Customers from Australia, Canada, and the USA exhibited a high likelihood of purchasing items from the UK, indicating potential opportunities for targeted marketing or product bundling.
The affinity matrix illuminated the similarity between countries based on customer transactions. Notably, the matrix revealed distinct patterns, such as higher affinity between the UK and Australia.
Overall, based on the Redundancy analysis, confirmation that all these countries take UK as the main market point was approved.