INTRODUCTION

Association Rule Mining

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in a databases using different algorithms.

DATA SET DETAILS.

The data set involves online retail shops where customers from various countries are actively engaging in transactions. The data set captures details such as CustomerID, ProductID, Quantity, Price, Country, Category, and Date for each purchase.

Objective: This project aims to analyze customer purchasing behavior based on geographical patterns. The primary goals include:

Geographical Insights: Exploring and analysing sales patterns across different countries to identify potential market trends or regional preferences.
Performance Metrics: Calculating key performance metrics such as support, confidence, and lift for association rules to quantify the strength and relevance of discovered patterns.
Visualization: Utilizing visualizations to present meaningful insights, aiding in the interpretation of complex patterns and relationships within the dataset.

Loading Data

options(repos = c(CRAN = "https://cloud.r-project.org"))

install.packages("arules")

## 
## The downloaded binary packages are in
##  /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp7mTaGS/downloaded_packages

install.packages("igraph")

## 
##   There is a binary version available but the source version is later:
##        binary source needs_compilation
## igraph  1.5.1  1.6.0              TRUE

## installing the source package 'igraph'

## Warning in install.packages("igraph"): installation of package 'igraph' had
## non-zero exit status

install.packages("arulesViz")

## 
## The downloaded binary packages are in
##  /var/folders/11/1z2rlbps68d_flh4dt29lm100000gq/T//Rtmp7mTaGS/downloaded_packages

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(readr)

online_retail_data <- read_csv("online_retail_data.csv")

## Rows: 4000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Category
## dbl  (4): CustomerID, ProductID, Quantity, Price
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(online_retail_data)

## # A tibble: 6 × 7
##   CustomerID ProductID Quantity Price Country Category    Date      
##        <dbl>     <dbl>    <dbl> <dbl> <chr>   <chr>       <date>    
## 1       1102        22        4  22.4 UK      Clothing    2022-01-01
## 2       1435        48        1  59.5 USA     Electronics 2022-01-02
## 3       1860         2        1  91.4 Canada  Home        2022-01-03
## 4       1270        10        2  80.0 Canada  Clothing    2022-01-04
## 5       1106        34        9  92.2 USA     Home        2022-01-05
## 6       1071        16        2  97.9 Canada  Clothing    2022-01-06

As shown in the ‘head’, each row corresponds to a specific customer from a different country who made a purchase, the date on which a particular transaction occurred and the description of each item.

retail_dimensions <- dim(online_retail_data)
retail_dimensions

## [1] 4000    7

We can observe that to extract the rules, we are using a data set of 4000 observations and 7 variables in total.

Data cleaning

Checking for missing values

missing_values <- sum(is.na(online_retail_data))
missing_values

## [1] 0

As shown above, we have no missing values that are associated with the data set.

TRANSFORMING DATA INTO TRANSACTIONS

This step transforms the data set into the transaction format expected by the arules package, allowing the performance of the association rule mining on the data.

online_retail_data <- read.csv("online_retail_data.csv")

binary_matrix <- table(online_retail_data$CustomerID, online_retail_data$Country) > 0

trans <- as(binary_matrix, "transactions")

length(trans)

## [1] 970

LIST(head(trans))

## $`1000`
## [1] "Australia" "Canada"    "UK"        "USA"      
## 
## $`1001`
## [1] "Australia" "Canada"    "UK"       
## 
## $`1002`
## [1] "Australia" "Canada"   
## 
## $`1003`
## [1] "UK"
## 
## $`1004`
## [1] "Australia" "Canada"    "USA"      
## 
## $`1005`
## [1] "UK"

Contigency table

ctab<-crossTable(trans, sort=TRUE) 
ctab<-crossTable(trans, measure="count", sort=TRUE) 
ctab

##            UK Canada USA Australia
## UK        631    401 391       404
## Canada    401    630 391       390
## USA       391    391 618       384
## Australia 404    390 384       612

# Displaying a summary of the transactions
summary(trans)
inspect(trans)
sample(trans)

#transactions as itemMatrix in sparse format with
# 970 rows (elements/itemsets/transactions) and
# 4 columns (items) and a density of 0.6420103 

#most frequent items:
#       UK    Canada       USA Australia   (Other) 
#      631       630       618       612         0 

#element (itemset/transaction) length distribution:
#sizes
#  1   2   3   4 
#130 318 363 159 

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  1.000   2.000   3.000   2.568   3.000   4.000 

#includes extended item information - examples:
#     labels
#1 Australia
#2    Canada
#3        UK

#includes extended transaction information - examples:
#  transactionID
#1          1000
#2          1001
#3          1002

For the data and the rules to be comprehensive, we have to convert the entries into transactions. Hence from the summary statistics above, items represent countries, and the counts indicate how many times each country appears in the transactions and the most frequent item in the data set. .The length distribution is showing sizes of item sets (transactions) in the data set. For example, there are 130 transactions with only one item, 318 transactions with two items, and so on. The median transaction length is 3, meaning half of the transactions have three items or fewer and UK has the highest frequency.

Apriori Algorithm

The apriori algorithm creates frequent items and based on these, it creates rules.

Creating Rules

#inital 
rules <- apriori(online_retail_data)

## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default
## discretization (see '? discretizeDF').

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 400 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4020 item(s), 4000 transaction(s)] done [0.01s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

 summary(rules)

## set of 0 rules

The initial application of the Apriori Algorithm set the parameters at a default support = 0.1 and confidence = 0.8. This specified that the algorithm should only consider rules that have a very high level of confidence (80%) with items occurring in at least 10% of transactions.

The condition turned out to the stringent and filtered out a large portion of potential rules as well as limiting the number of frequent item sets especially because the data set has diverse set of items. Hence the output had zero(0) rules as a result of failing to meet the stringent criteria.

HENCE, the reason why l ended up adjusting the parameters to support = 0.01 and confidence = 0.5. This relaxed the approach making it easier and less restrictive than the former.

#adjusted
rules <- apriori(online_retail_data, parameter = list(support = 0.01, confidence = 0.5))

## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default
## discretization (see '? discretizeDF').

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 40 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4020 item(s), 4000 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [50 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 50 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4 
##  4 46 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    4.00    3.92    4.00    4.00 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01000   Min.   :0.5000   Min.   :0.01950   Min.   :1.127  
##  1st Qu.:0.01125   1st Qu.:0.5076   1st Qu.:0.02125   1st Qu.:1.144  
##  Median :0.01300   Median :0.5160   Median :0.02438   Median :1.166  
##  Mean   :0.01475   Mean   :0.5221   Mean   :0.02841   Mean   :1.190  
##  3rd Qu.:0.01469   3rd Qu.:0.5366   3rd Qu.:0.02881   3rd Qu.:1.220  
##  Max.   :0.04050   Max.   :0.5696   Max.   :0.07975   Max.   :1.534  
##      count       
##  Min.   : 40.00  
##  1st Qu.: 45.00  
##  Median : 52.00  
##  Mean   : 59.00  
##  3rd Qu.: 58.75  
##  Max.   :162.00  
## 
## mining info:
##                data ntransactions support confidence
##  online_retail_data          4000    0.01        0.5
##                                                                                    call
##  apriori(data = online_retail_data, parameter = list(support = 0.01, confidence = 0.5))

The adjustment to lower support and confidence thresholds resulted in the discovery of 50 rules. This suggests that relaxing the criteria allowed the algorithm to find associations in the data. The rule length distribution indicates that 4rules that contain 3 items and 46 rules that have 4 items. These have a median length of 4, indicating that most rules involve four items.

The support, confidence, and lift values vary across the rules, providing insights into the strength and relevance of the discovered associations. Both 25% and 75% of the rules have a length of 4 or less. This implies that a substantial proportion of the rules share the same or similar lengths. The summary of quality measures also indicate that the support, confidence, coverage, lift and count increased from their default values after creating of 50 rules.

inspect(rules[1:10])

##      lhs                               rhs                              support confidence coverage     lift count
## [1]  {ProductID=[1,17),                                                                                           
##       Category=Home}                => {Quantity=[6,9]}                 0.04050  0.5078370  0.07975 1.144421   162
## [2]  {Country=USA,                                                                                                
##       Category=Toys}                => {Quantity=[6,9]}                 0.03050  0.5083333  0.06000 1.145540   122
## [3]  {CustomerID=[1e+03,1.34e+03),                                                                                
##       Country=USA}                  => {Quantity=[6,9]}                 0.03850  0.5016287  0.07675 1.130431   154
## [4]  {Country=UK,                                                                                                 
##       Category=Toys}                => {Quantity=[6,9]}                 0.03375  0.5075188  0.06650 1.143704   135
## [5]  {Quantity=[1,3),                                                                                             
##       Price=[1.01,34.2),                                                                                          
##       Category=Home}                => {CustomerID=[1.34e+03,1.67e+03)} 0.01000  0.5128205  0.01950 1.534242    40
## [6]  {ProductID=[1,17),                                                                                           
##       Country=Australia,                                                                                          
##       Category=Home}                => {Quantity=[6,9]}                 0.01125  0.5696203  0.01975 1.283651    45
## [7]  {Price=[66.7,100],                                                                                           
##       Country=Australia,                                                                                          
##       Category=Home}                => {Quantity=[6,9]}                 0.01125  0.5113636  0.02200 1.152369    45
## [8]  {CustomerID=[1e+03,1.34e+03),                                                                                
##       Price=[66.7,100],                                                                                           
##       Country=Australia}            => {Quantity=[6,9]}                 0.01625  0.5078125  0.03200 1.144366    65
## [9]  {CustomerID=[1e+03,1.34e+03),                                                                                
##       ProductID=[17,34),                                                                                          
##       Country=Australia}            => {Quantity=[6,9]}                 0.01500  0.5172414  0.02900 1.165614    60
## [10] {CustomerID=[1e+03,1.34e+03),                                                                                
##       Country=USA,                                                                                                
##       Category=Home}                => {Quantity=[6,9]}                 0.01075  0.5375000  0.02000 1.211268    43

Generating item-frequencies

itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")

head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=60)

##        UK    Canada       USA Australia 
##       631       630       618       612

According to the summary of the data, while the range is almost the same, UK has the highest frequency and is appearing the most in the transactions compared to the other countries, meaniing that most of the people that are purchasing items online are from the UK. This information was visualized using the items frequency plot above of the countries that different people purchase from. From the Items frequency plot, the highest frequency is at 631.

ECLAT ALGORITHM

rules <- eclat(trans, parameter = list(supp = 0.1, maxlen = 4))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE     0.1      1      4 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 97 
## 
## create itemset ... 
## set transactions ...[4 item(s), 970 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating bit matrix ... [4 row(s), 970 column(s)] done [0.00s].
## writing  ... [15 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

In order to extract meaningful patterns and associations from our dataset, we employed a two-step approach involving the Eclat algorithm and the rule induction process. Initially, Eclat was applied to identify frequent itemsets (15), representing combinations of items that co-occurred frequently in the data. This step was essential for discovering potential associations among items

freq_rules<-ruleInduction(rules, trans, confidence=0.5)

inspect(freq_rules[1:10])

##      lhs                         rhs         support   confidence lift     
## [1]  {Canada, UK, USA}        => {Australia} 0.1639175 0.6360000  1.0080392
## [2]  {Australia, UK, USA}     => {Canada}    0.1639175 0.6334661  0.9753367
## [3]  {Australia, Canada, USA} => {UK}        0.1639175 0.6680672  1.0269813
## [4]  {Australia, Canada, UK}  => {USA}       0.1639175 0.6115385  0.9598581
## [5]  {UK, USA}                => {Australia} 0.2587629 0.6419437  1.0174598
## [6]  {Australia, USA}         => {UK}        0.2587629 0.6536458  1.0048121
## [7]  {Australia, UK}          => {USA}       0.2587629 0.6212871  0.9751594
## [8]  {Canada, USA}            => {Australia} 0.2453608 0.6086957  0.9647627
## [9]  {Australia, USA}         => {Canada}    0.2453608 0.6197917  0.9542824
## [10] {Australia, Canada}      => {USA}       0.2453608 0.6102564  0.9578458
##      itemset
## [1]  1      
## [2]  1      
## [3]  1      
## [4]  1      
## [5]  2      
## [6]  2      
## [7]  2      
## [8]  3      
## [9]  3      
## [10] 3

To manage the number of generated itemsets, we further refined our results using the ruleInduced function. This function transformed the frequent itemsets into association rules, considering metrics such as confidence, support, and lift. By doing so, we were able to filter and rank the rules, focusing on those with higher confidence levels and stronger support.

Therefore, the 15 itemsets found in Eclat were transformed into 28 association rules during the rule induction process. Each association rule represents a potential relationship between items with associated metrics indicating the strength and significance of the relationship.

Analyzing rules by confidence, support and lift

inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE), 10))

##      lhs                         rhs         support   confidence lift     
## [1]  {Australia, Canada, USA} => {UK}        0.1639175 0.6680672  1.0269813
## [2]  {Australia, Canada}      => {UK}        0.2680412 0.6666667  1.0248283
## [3]  {Australia}              => {UK}        0.4164948 0.6601307  1.0147810
## [4]  {Australia, USA}         => {UK}        0.2587629 0.6536458  1.0048121
## [5]  {Canada, UK}             => {Australia} 0.2680412 0.6483791  1.0276596
## [6]  {Australia, UK}          => {Canada}    0.2680412 0.6435644  0.9908848
## [7]  {UK, USA}                => {Australia} 0.2587629 0.6419437  1.0174598
## [8]  {UK}                     => {Australia} 0.4164948 0.6402536  1.0147810
## [9]  {UK, USA}                => {Canada}    0.2577320 0.6393862  0.9844518
## [10] {Canada, USA}            => {UK}        0.2577320 0.6393862  0.9828916
##      itemset
## [1]  1      
## [2]  8      
## [3]  9      
## [4]  2      
## [5]  8      
## [6]  8      
## [7]  2      
## [8]  9      
## [9]  4      
## [10] 4

This highest confidence rule indicates that in 16.39% of transactions involving customers from Australia, Canada, and the USA, there is a high confidence (66.81%) that customers will also make purchases from the UK. The lift value of 1.03 suggests a slight positive correlation between the antecedent and consequent.

As for the lowest, the rule {UK} => {Canada} has a confidence of 63.55%, indicating that in 41.34% of the transactions involving customers from the UK, there is a 63.55% confidence that these customers will also purchase items from Canada. The lift value of 0.98 suggests a slightly negative correlation, meaning that the likelihood of customers from the UK purchasing items in Canada is 0.98 times lower than if these events were independent.

inspect(head(sort(freq_rules, by = "support", decreasing = TRUE), 10))

##      lhs            rhs         support   confidence lift      itemset
## [1]  {UK}        => {Australia} 0.4164948 0.6402536  1.0147810  9     
## [2]  {Australia} => {UK}        0.4164948 0.6601307  1.0147810  9     
## [3]  {UK}        => {Canada}    0.4134021 0.6354992  0.9784670 11     
## [4]  {Canada}    => {UK}        0.4134021 0.6365079  0.9784670 11     
## [5]  {USA}       => {UK}        0.4030928 0.6326861  0.9725919  5     
## [6]  {UK}        => {USA}       0.4030928 0.6196513  0.9725919  5     
## [7]  {USA}       => {Canada}    0.4030928 0.6326861  0.9741357  6     
## [8]  {Canada}    => {USA}       0.4030928 0.6206349  0.9741357  6     
## [9]  {Canada}    => {Australia} 0.4020619 0.6190476  0.9811702 10     
## [10] {Australia} => {Canada}    0.4020619 0.6372549  0.9811702 10

In terms of support, the highest rule ndicates that in 41.65% of transactions where customers are from the UK, there is a 64.03% confidence that they will also purchase items from Australia. The lift value of 1.01 suggests a mild positive correlation, indicating that the occurrence of purchases in Australia is slightly more likely when customers are from the UK.

The lowest one with 40.31% of transactions involving customers from Canada, there is a 62.06% confidence that they will make purchases from the USA. The lift value of 0.97 indicates a slight negative correlation, suggesting that the occurrence of purchases in the USA is slightly less likely when customers are from Canada.

The rule with the highest support reveals a strong association between customers from the UK and purchases in Australia. On the other hand, the lowest support rule highlights a relatively weaker association between customers from Canada and purchases in the USA.

inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))

##      lhs                         rhs         support   confidence lift     
## [1]  {Canada, UK}             => {Australia} 0.2680412 0.6483791  1.0276596
## [2]  {Australia, Canada, USA} => {UK}        0.1639175 0.6680672  1.0269813
## [3]  {Australia, Canada}      => {UK}        0.2680412 0.6666667  1.0248283
## [4]  {UK, USA}                => {Australia} 0.2587629 0.6419437  1.0174598
## [5]  {UK}                     => {Australia} 0.4164948 0.6402536  1.0147810
## [6]  {Australia}              => {UK}        0.4164948 0.6601307  1.0147810
## [7]  {Canada, UK, USA}        => {Australia} 0.1639175 0.6360000  1.0080392
## [8]  {Australia, USA}         => {UK}        0.2587629 0.6536458  1.0048121
## [9]  {Australia, UK}          => {Canada}    0.2680412 0.6435644  0.9908848
## [10] {USA}                    => {Australia} 0.3958763 0.6213592  0.9848341
##      itemset
## [1]  8      
## [2]  1      
## [3]  8      
## [4]  2      
## [5]  9      
## [6]  9      
## [7]  1      
## [8]  2      
## [9]  8      
## [10] 7

As for the highest lift, the rule {Canada, UK} => {Australia} has a lift of 1.03, suggesting that the likelihood of customers from Canada and the UK purchasing items from Australia is 1.03 times higher than if these events were independent. The support of 26.80% indicates that this rule is applicable in approximately 26.80% of the transactions. Additionally, the confidence of 64.84% highlights that in 26.80% of transactions involving customers from Canada and the UK, there is a 64.84% confidence that these customers will also buy items from Australia.

For the lowest, the rule {USA} => {Canada} has a lift of 0.97, indicating that the likelihood of customers from the USA purchasing items from Canada is 0.97 times the likelihood of these events occurring independently. The support of 40.31% signifies that this rule is applicable in approximately 40.31% of transactions. Moreover, the confidence of 63.27% suggests that in 40.31% of transactions involving customers from the USA, there is a 63.27% confidence that these customers will also buy items from Canada.

Visualizing the rules

These visualizations assist in uncovering patterns in customer purchasing behavior based on their country identities, leading to actionable insights for marketing or product placement strategies.

plot(freq_rules, method = "matrix", measure = c("support", "confidence"), shading = "lift", interactive = FALSE)

## Warning in plot.rules(freq_rules, method = "matrix", measure = c("support", :
## The parameter interactive is deprecated. Use engine='interactive' instead.

## Itemsets in Antecedent (LHS)
##  [1] "{Australia,Canada,USA}" "{Canada,UK,USA}"        "{Canada,UK}"           
##  [4] "{UK,USA}"               "{Australia}"            "{Australia,Canada}"    
##  [7] "{UK}"                   "{Australia,UK}"         "{Australia,USA}"       
## [10] "{Canada}"               "{USA}"                  "{Australia,UK,USA}"    
## [13] "{Canada,USA}"           "{Australia,Canada,UK}" 
## Itemsets in Consequent (RHS)
## [1] "{USA}"       "{Canada}"    "{Australia}" "{UK}"

This matrix plot represents the association rules in a tabular format. The color shading indicates the lift, and columns show support and confidence values. It provides a compact overview of rule metrics and their relationships.

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter =0)

## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.

Here, we can visualize the relationship between the LIFT, the CONFIDENCE and the SUPPORT of the item sets. In the scatter plot, most items with high lift values have relatively low to moderate/medium support values. The rule with the highest confidence in this case has the lowest support value and a high lift value. The items with the lowest lift values generally have moderate to high support values and relatively low/moderate confidence values. To determine the importance of a rule, the Confidence and the Support values are considered the most. This is because, the support value determines the presence or probability of a transaction containing both A and B;Whereas, the confidence value validates the rule’s precision.

Comprehensive Association Analysis

In this section, we delve into the comprehensive analysis of the association rules, as represented by the graph and accompanying rule list generated through the Apriori algorithm. The objective is to show the relationships and patterns within the transactional data, shedding light on the associations between different items and countries.

plot(freq_rules, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

inspect(freq_rules)

##      lhs                         rhs         support   confidence lift     
## [1]  {Canada, UK, USA}        => {Australia} 0.1639175 0.6360000  1.0080392
## [2]  {Australia, UK, USA}     => {Canada}    0.1639175 0.6334661  0.9753367
## [3]  {Australia, Canada, USA} => {UK}        0.1639175 0.6680672  1.0269813
## [4]  {Australia, Canada, UK}  => {USA}       0.1639175 0.6115385  0.9598581
## [5]  {UK, USA}                => {Australia} 0.2587629 0.6419437  1.0174598
## [6]  {Australia, USA}         => {UK}        0.2587629 0.6536458  1.0048121
## [7]  {Australia, UK}          => {USA}       0.2587629 0.6212871  0.9751594
## [8]  {Canada, USA}            => {Australia} 0.2453608 0.6086957  0.9647627
## [9]  {Australia, USA}         => {Canada}    0.2453608 0.6197917  0.9542824
## [10] {Australia, Canada}      => {USA}       0.2453608 0.6102564  0.9578458
## [11] {UK, USA}                => {Canada}    0.2577320 0.6393862  0.9844518
## [12] {Canada, USA}            => {UK}        0.2577320 0.6393862  0.9828916
## [13] {Canada, UK}             => {USA}       0.2577320 0.6234414  0.9785407
## [14] {USA}                    => {UK}        0.4030928 0.6326861  0.9725919
## [15] {UK}                     => {USA}       0.4030928 0.6196513  0.9725919
## [16] {USA}                    => {Canada}    0.4030928 0.6326861  0.9741357
## [17] {Canada}                 => {USA}       0.4030928 0.6206349  0.9741357
## [18] {USA}                    => {Australia} 0.3958763 0.6213592  0.9848341
## [19] {Australia}              => {USA}       0.3958763 0.6274510  0.9848341
## [20] {Canada, UK}             => {Australia} 0.2680412 0.6483791  1.0276596
## [21] {Australia, UK}          => {Canada}    0.2680412 0.6435644  0.9908848
## [22] {Australia, Canada}      => {UK}        0.2680412 0.6666667  1.0248283
## [23] {UK}                     => {Australia} 0.4164948 0.6402536  1.0147810
## [24] {Australia}              => {UK}        0.4164948 0.6601307  1.0147810
## [25] {Canada}                 => {Australia} 0.4020619 0.6190476  0.9811702
## [26] {Australia}              => {Canada}    0.4020619 0.6372549  0.9811702
## [27] {UK}                     => {Canada}    0.4134021 0.6354992  0.9784670
## [28] {Canada}                 => {UK}        0.4134021 0.6365079  0.9784670
##      itemset
## [1]   1     
## [2]   1     
## [3]   1     
## [4]   1     
## [5]   2     
## [6]   2     
## [7]   2     
## [8]   3     
## [9]   3     
## [10]  3     
## [11]  4     
## [12]  4     
## [13]  4     
## [14]  5     
## [15]  5     
## [16]  6     
## [17]  6     
## [18]  7     
## [19]  7     
## [20]  8     
## [21]  8     
## [22]  8     
## [23]  9     
## [24]  9     
## [25] 10     
## [26] 10     
## [27] 11     
## [28] 11

Inspecting the sorted list of association rules, we observe a diverse set of rules capturing associations involving items from Australia, Canada, the UK, and the USA. Each rule is characterized by its support, confidence, and lift values, providing quantitative insights into the strength and significance of the associations.

Each node in the graph represents an item or a country, while the edges depict the associations between them. The graph aids in understanding the intricate network of relationships captured by the association rules.

Geographical Associations:

Rules such as {Canada, UK, USA} => {Australia} and {Australia, UK, USA} => {Canada} highlight geographical associations, suggesting that customers from these countries are more likely to purchase items from Australia or Canada.

Cross-Country Purchasing Patterns:

Rules like {UK} => {Australia} and {USA} => {Canada} suggest cross-country purchasing patterns, indicating that certain items are commonly bought in conjunction with each other across different countries.

Negative Associations:

Some rules, with lift values slightly below 1, indicate negative associations. For instance, {UK} => {USA} and {Canada, UK} => {USA} suggest that customers purchasing from the UK or the combination of Canada and the UK are slightly less likely to include items from the USA compared to random chance.

UK RULES

In this section, we delve into the intricacies of association rules with a specific focus on the presence of the United Kingdom (UK) in transactions. By concentrating on the UK, we aim to uncover meaningful patterns and relationships between this specific market and other countries within our dataset.

# Apriori algorithm with appearance constraint for "UK"
UK_rules <- apriori(
  data = trans,
  parameter = list(supp = 0.01, conf = 0.5),  # Adjust support and confidence as needed
  appearance = list(default = "lhs", rhs = "UK"),
  control = list(verbose = FALSE)
)

inspect(sort(UK_rules, by = 'lift'))

##     lhs                         rhs  support   confidence coverage  lift     
## [1] {Australia, Canada, USA} => {UK} 0.1639175 0.6680672  0.2453608 1.0269813
## [2] {Australia, Canada}      => {UK} 0.2680412 0.6666667  0.4020619 1.0248283
## [3] {Australia}              => {UK} 0.4164948 0.6601307  0.6309278 1.0147810
## [4] {Australia, USA}         => {UK} 0.2587629 0.6536458  0.3958763 1.0048121
## [5] {}                       => {UK} 0.6505155 0.6505155  1.0000000 1.0000000
## [6] {Canada, USA}            => {UK} 0.2577320 0.6393862  0.4030928 0.9828916
## [7] {Canada}                 => {UK} 0.4134021 0.6365079  0.6494845 0.9784670
## [8] {USA}                    => {UK} 0.4030928 0.6326861  0.6371134 0.9725919
##     count
## [1] 159  
## [2] 260  
## [3] 404  
## [4] 251  
## [5] 631  
## [6] 250  
## [7] 401  
## [8] 391

The rules are presented in the format where “lhs” (left-hand side) represents the antecedent (items present before the arrow), and “rhs” (right-hand side) represents the consequent (items after the arrow). They are sorted based on the lift value in descending order.

This rule indicates that the combination of items from Australia, Canada, and the USA is associated with the presence of items from the UK. The lift value slightly above 1 suggests a positive association. Hence customers buying from Australia, Canada, USA are about 2.7% more likely to include items from the UK compared to random chance. Approximately 16.4% of transactions contain this combination. Given the presence of items from Australia, Canada, and the USA, there is a high confidence (66.8%) that items from the UK are also present.

CANADA RULES

In this segment, our attention turns to unraveling the intricate association rules involving the vibrant market of Canada. Utilizing the Apriori algorithm with an appearance constraint for “Canada,” we aim to discern meaningful patterns and relationships within the transactional data.

# Apriori algorithm with appearance constraint for "canada"
Canada_rules <- apriori(
  data = trans,
  parameter = list(supp = 0.01, conf = 0.5),  # Adjust support and confidence as needed
  appearance = list(default = "lhs", rhs = "Canada"),
  control = list(verbose = FALSE)
)

By employing parameters such as support and confidence, we seek to uncover rules that showcase the association between Canada and other items in our dataset.

inspect(sort(Canada_rules, by = 'lift'))

##     lhs                     rhs      support   confidence coverage  lift     
## [1] {}                   => {Canada} 0.6494845 0.6494845  1.0000000 1.0000000
## [2] {Australia, UK}      => {Canada} 0.2680412 0.6435644  0.4164948 0.9908848
## [3] {UK, USA}            => {Canada} 0.2577320 0.6393862  0.4030928 0.9844518
## [4] {Australia}          => {Canada} 0.4020619 0.6372549  0.6309278 0.9811702
## [5] {UK}                 => {Canada} 0.4134021 0.6354992  0.6505155 0.9784670
## [6] {Australia, UK, USA} => {Canada} 0.1639175 0.6334661  0.2587629 0.9753367
## [7] {USA}                => {Canada} 0.4030928 0.6326861  0.6371134 0.9741357
## [8] {Australia, USA}     => {Canada} 0.2453608 0.6197917  0.3958763 0.9542824
##     count
## [1] 630  
## [2] 260  
## [3] 250  
## [4] 390  
## [5] 401  
## [6] 159  
## [7] 391  
## [8] 238

plot(Canada_rules, method="graph", control =list(type="items") )

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

As shown above, the lift values slightly below 1 indicate a negative association hence customers buying from UK, USA are 1.6% less likely to include items from Canada compared to random chance. As well as Customers buying from Australia, UK are 0.9% less likely to include items from Canada compared to random chance.

AUSTRALIA RULES

The appearance constraint ensures that our rules specifically involve the presence of Australia in transactions.

# Apriori algorithm with appearance constraint for "AUSTRALIA"
Australia_rules <- apriori(
  data = trans,
  parameter = list(supp = 0.01, conf = 0.5),  # Adjust support and confidence as needed
  appearance = list(default = "lhs", rhs = "Australia"),
  control = list(verbose = FALSE)
)

inspect(sort(Australia_rules, by = 'lift'))

##     lhs                  rhs         support   confidence coverage  lift     
## [1] {Canada, UK}      => {Australia} 0.2680412 0.6483791  0.4134021 1.0276596
## [2] {UK, USA}         => {Australia} 0.2587629 0.6419437  0.4030928 1.0174598
## [3] {UK}              => {Australia} 0.4164948 0.6402536  0.6505155 1.0147810
## [4] {Canada, UK, USA} => {Australia} 0.1639175 0.6360000  0.2577320 1.0080392
## [5] {}                => {Australia} 0.6309278 0.6309278  1.0000000 1.0000000
## [6] {USA}             => {Australia} 0.3958763 0.6213592  0.6371134 0.9848341
## [7] {Canada}          => {Australia} 0.4020619 0.6190476  0.6494845 0.9811702
## [8] {Canada, USA}     => {Australia} 0.2453608 0.6086957  0.4030928 0.9647627
##     count
## [1] 260  
## [2] 251  
## [3] 404  
## [4] 159  
## [5] 612  
## [6] 384  
## [7] 390  
## [8] 238

plot(Australia_rules, method = "graph", control = list(layout = "stress", circular = FALSE))

The lift value slightly above 1 indicates a positive association. Customers purchasing from {Canada, UK} are about 2.8% more likely to include items from Australia compared to random chance. Customers purchasing from {UK, USA} are about 1.7% more likely to include items from Australia compared to random chance.From rule 1, given the presence of items from Canada and the UK, there is a high confidence (64.8%) that items from Australia are also present and in rule 2, given the presence of items from the UK and the USA, there is a 64.2% confidence that items from Australia are also present.

Overall, The rules highlight positive associations between Australia and specific combinations of items from Canada, the UK, and the USA. The lift values above 1 indicate that these combinations are more likely to co-occur than expected by chance. Rule 5, with an empty left-hand side, indicates that items from Australia are frequently purchased independently.

USA RULES

By setting specific support and confidence thresholds, we aim to identify rules that highlight the relationships between the USA and other items in our dataset. The appearance constraint ensures that our rules specifically involve the presence of the USA in transactions.

# Apriori algorithm with appearance constraint for "AUSTRALIA"
USA_rules <- apriori(
  data = trans,
  parameter = list(supp = 0.01, conf = 0.5),  # Adjust support and confidence as needed
  appearance = list(default = "lhs", rhs = "USA"),
  control = list(verbose = FALSE)
)

inspect(sort(USA_rules, by = 'lift'))

##     lhs                        rhs   support   confidence coverage  lift     
## [1] {}                      => {USA} 0.6371134 0.6371134  1.0000000 1.0000000
## [2] {Australia}             => {USA} 0.3958763 0.6274510  0.6309278 0.9848341
## [3] {Canada, UK}            => {USA} 0.2577320 0.6234414  0.4134021 0.9785407
## [4] {Australia, UK}         => {USA} 0.2587629 0.6212871  0.4164948 0.9751594
## [5] {Canada}                => {USA} 0.4030928 0.6206349  0.6494845 0.9741357
## [6] {UK}                    => {USA} 0.4030928 0.6196513  0.6505155 0.9725919
## [7] {Australia, Canada, UK} => {USA} 0.1639175 0.6115385  0.2680412 0.9598581
## [8] {Australia, Canada}     => {USA} 0.2453608 0.6102564  0.4020619 0.9578458
##     count
## [1] 618  
## [2] 384  
## [3] 250  
## [4] 251  
## [5] 391  
## [6] 391  
## [7] 159  
## [8] 238

plot(USA_rules, method = "graph", control = list(layout = "stress", circular = FALSE))

From rule 1, Around 39.6% of transactions involve items from Australia along with items from the USA. From rule 3, Approximately 25.8% of transactions involve items from Canada and the UK along with items from the USA. Overall, the rules highlight negative associations between the USA and specific combinations of items from Australia, Canada, and the UK. The lift values below 1 indicate that these combinations are less likely to co-occur than expected by chance. Rule 1, with an empty left-hand side, indicates that items from the USA are frequently purchased independently.

Dissimilarity Measures

Dissimilarity is the numerical measure of how different two data items are. When dissimilarity measure is low, then the items under observation are similar. And if dissimilarity is high, the items are different.

To measure dissimilarity, the Jaccard index is used:

J(A,B) = |A ∩ B| / |A ∪ B|S

trans.sel <- trans[, itemFrequency(trans) > 0.5]
jac <- dissimilarity(trans.sel, which = "items")
round(jac, digits = 3)

##        Australia Canada    UK
## Canada     0.542             
## UK         0.518  0.534      
## USA        0.546  0.544 0.544

plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")

Canada vs. Australia (0.542):The dissimilarity value of 0.542 between Canada and Australia indicates that these two transactions have a Jaccard dissimilarity of 54.2%. This means that 54.2% of the items in Canada’s transaction set are different from the items in Australia’s transaction set.
UK vs. Australia (0.518): The dissimilarity value of 0.518 between the UK and Australia indicates a Jaccard dissimilarity of 51.8%. Similarly, this means that 51.8% of the items in the UK’s transaction set are different from the items in Australia’s transaction set.
USA vs. Canada (0.544): The dissimilarity value of 0.544 between the USA and Canada indicates a Jaccard dissimilarity of 54.4%. This means that 54.4% of the items in the USA’s transaction set are different from the items in Canada’s transaction set.
USA vs. UK (0.544): The dissimilarity value of 0.544 between the USA and the UK indicates a Jaccard dissimilarity of 54.4%. Similar to the previous explanation, this means that 54.4% of the items in the USA’s transaction set are different from the items in the UK’s transaction set.

Affinity Measure

Affinity measures the strength of association. In this case, the higher the affinity value, the higher the probability that two products (identified by product ID) are bought together or share common purchasing patterns.

Calculated as:

A(i,j) = supp(i,j)/supp(i)+supp(j)−supp(i,j)

a <- affinity(trans.sel)
round(a, digits = 3)

## An object of class "ar_similarity"
##           Australia Canada    UK   USA
## Australia     0.000  0.458 0.482 0.454
## Canada        0.458  0.000 0.466 0.456
## UK            0.482  0.466 0.000 0.456
## USA           0.454  0.456 0.456 0.000
## Slot "method":
## [1] "Affinity"

Only taking into account the affinity levels that are < 0.5, we observe that the following pairs of items have high probability of being purchased together:

Assessing Redundancy

The goal of assessing redundancy is to identify and eliminate rules that provide essentially the same information. Below l am checking for redundancy in those rules helping me identify rules that might be subsumed by others or provide similar information, hence allowing me to focus on the most relevant rules.

is_redundant_uk <- is.redundant(UK_rules)
is_redundant_uk

## [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE

supporting_transactions_canada <- supportingTransactions(UK_rules, trans)



is_redundant_canada <- is.redundant(Canada_rules)
is_redundant_canada

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

supporting_transactions_canada <- supportingTransactions(Canada_rules, trans)



is_redundant_australia <- is.redundant(Australia_rules)
is_redundant_australia

## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

supporting_transactions_australia <- supportingTransactions(Australia_rules, trans)




is_redundant_usa <- is.redundant(USA_rules)
is_redundant_usa

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

supporting_transactions_usa <- supportingTransactions(USA_rules, trans)

Having identified the redundant rules, the next step involves removing them so l can remain with the most important ones.

non_redundant_uk_rules <-UK_rules[!is_redundant_uk]
non_redundant_uk_rules

## set of 4 rules

non_redundant_usa_rules <- USA_rules[!is_redundant_usa]
non_redundant_usa_rules

## set of 1 rules

non_redundant_canada_rules <- Canada_rules[!is_redundant_canada]
non_redundant_canada_rules

## set of 1 rules

non_redundant_australia_rules <- Australia_rules[!is_redundant_australia]
non_redundant_australia_rules

## set of 4 rules

all_rules <- c(non_redundant_uk_rules, non_redundant_canada_rules, non_redundant_australia_rules, non_redundant_usa_rules)


inspect(sort(all_rules, by = 'lift')[1:10])

##      lhs                         rhs         support   confidence coverage 
## [1]  {Canada, UK}             => {Australia} 0.2680412 0.6483791  0.4134021
## [2]  {Australia, Canada, USA} => {UK}        0.1639175 0.6680672  0.2453608
## [3]  {Australia, Canada}      => {UK}        0.2680412 0.6666667  0.4020619
## [4]  {UK, USA}                => {Australia} 0.2587629 0.6419437  0.4030928
## [5]  {Australia}              => {UK}        0.4164948 0.6601307  0.6309278
## [6]  {UK}                     => {Australia} 0.4164948 0.6402536  0.6505155
## [7]  {}                       => {UK}        0.6505155 0.6505155  1.0000000
## [8]  {}                       => {Canada}    0.6494845 0.6494845  1.0000000
## [9]  {}                       => {Australia} 0.6309278 0.6309278  1.0000000
## [10] {}                       => {USA}       0.6371134 0.6371134  1.0000000
##      lift     count
## [1]  1.027660 260  
## [2]  1.026981 159  
## [3]  1.024828 260  
## [4]  1.017460 251  
## [5]  1.014781 404  
## [6]  1.014781 404  
## [7]  1.000000 631  
## [8]  1.000000 630  
## [9]  1.000000 612  
## [10] 1.000000 618

plot(all_rules, method="paracoord", control=list(reorder=TRUE))

As shown above, each of these are unique and independent, hence at the end of the day we the most useful association rules of 10 that are still stressing on the fact that customers from Australia, Canada, and the USA exhibited a high likelihood of purchasing items from the UK,.

Conclusion

The analysis employed association rule mining techniques of the Apriori algorithm, Eclat algorithm, and affinity matrix, to extract meaningful insights. Our analysis uncovered compelling association rules that provide valuable insights into customer purchasing patterns. Notably, we identified strong associations between certain countries and product categories, revealing interesting cross-border shopping trends.

Customers from Australia, Canada, and the USA exhibited a high likelihood of purchasing items from the UK, indicating potential opportunities for targeted marketing or product bundling.

The affinity matrix illuminated the similarity between countries based on customer transactions. Notably, the matrix revealed distinct patterns, such as higher affinity between the UK and Australia.

Overall, based on the Redundancy analysis, confirmation that all these countries take UK as the main market point was approved.

associationsProject

cynthia T.M Nyahoda

2024-01-28