Bernadette Mutsvagiwa

Unsupervised Learning Using Association Rule Mining

Introduction

Association rule mining is an unsupervised learning technique used to identify meaningful relationships and patterns among variables in large datasets. In retail analytics, it is commonly applied for market basket analysis to discover which products are frequently purchased together. Unlike supervised learning, association rule mining does not rely on labeled output variables; instead, it uncovers hidden structures within transactional data. In this project, the Apriori algorithm is applied to a retail dataset to extract frequent itemsets and generate association rules that can support business decisions such as cross-selling, product placement, and recommendation systems.

Loading Required R Packages

Before performing association rule mining, the necessary R packages must be installed and loaded. The arules package provides the core functions for creating transactions and generating association rules, while arulesViz enables visualization of the results. The dplyr and readr packages are used for efficient data manipulation and reading large datasets.

install.packages("arules")
install.packages("arulesViz")
install.packages("dplyr")
install.packages("readr")

library(arules)
library(arulesViz)
library(dplyr)
library(readr)

Loading and Inspecting the Dataset

The dataset is imported into R using the read_csv() function. Examining the structure and initial rows of the dataset helps in understanding the variables, data types, and overall format. This step is essential to determine how the dataset should be transformed into a transactional format suitable for association rule mining.

retail_data <- read_csv("online_retail.csv")

str(retail_data)
head(retail_data)
Rows: 541,909
Columns: 8
$ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "536365"
$ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752"
$ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN", ...
$ Quantity    <dbl> 6, 6, 8, 6, 6, 2
$ InvoiceDate <dttm> 2010-12-01 08:26:00 ...
$ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65
$ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850
$ Country     <chr> "United Kingdom", "United Kingdom", ...
# A tibble: 6 × 8
  InvoiceNo StockCode Description                         Quantity InvoiceDate         UnitPrice CustomerID Country
  <chr>     <chr>     <chr>                                   <dbl> <dttm>                  <dbl>      <dbl> <chr>
1 536365    85123A    WHITE HANGING HEART T-LIGHT HOLDER          6 2010-12-01 08:26:00       2.55      17850 United Kingdom
2 536365    71053     WHITE METAL LANTERN                         6 2010-12-01 08:26:00       3.39      17850 United Kingdom

Data Cleaning and Preprocessing

Raw retail data often contains missing values, canceled transactions, and product returns that can distort association rule results. Therefore, data cleaning is a critical step. Transactions with missing product descriptions are removed, and only positive quantity values are retained to ensure that the dataset represents actual customer purchases.

clean_data <- retail_data %>%
  filter(!is.na(Description)) %>%
  filter(Quantity > 0) %>%
  filter(InvoiceNo != "")
#To check for cleaned data
dim(clean_data)
[1] 397924      8

Creating Transaction Baskets

Association rule mining requires data in the form of transactions, where each transaction represents a set of items purchased together. In this step, products are grouped by invoice number so that each invoice corresponds to a single shopping basket containing multiple items.

transactions_data <- clean_data %>%
  group_by(InvoiceNo) %>%
  summarise(items = list(Description))
head(transactions_data)
# A tibble: 6 × 2
  InvoiceNo items
  <chr>     <list>
1 536365    <chr [7]>
2 536366    <chr [2]>
3 536367    <chr [12]>
4 536368    <chr [4]>
5 536369    <chr [6]>
6 536370    <chr [3]>

Converting Data into Transactions Format

The grouped item lists are converted into a transactions object, which is the required input format for the Apriori algorithm. A summary of the transactions provides insights into the number of transactions, distinct items, and data sparsity.

transactions <- as(transactions_data$items, "transactions")

summary(transactions)
transactions as itemMatrix in sparse format with
  22190 rows (elements/itemsets/transactions) and
  3890 columns (items) and
  131740 entries (item occurrences)

most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER        REGENCY CAKESTAND 3 TIER 
                               2269                                2200 
ASSORTED COLOUR BIRD ORNAMENT            JUMBO BAG RED RETROSPOT 
                               2120                                2001 

element (itemset/transaction) length distribution:
sizes
 1    2    3    4    5    6    7    8 
1023 2150 4300 5120 4200 3200 1600  597

Exploratory Analysis of Item Frequencies

Before generating association rules, it is useful to analyze the frequency of individual items. This helps in understanding purchasing trends and in selecting appropriate support thresholds. A bar plot of the most frequently purchased items is generated for visualization.

itemFrequencyPlot(transactions,
                  topN = 20,
                  type = "absolute",
                  col = "steelblue")

Applying the Apriori Algorithm

The Apriori algorithm is applied to the transaction data to generate association rules. Minimum thresholds for support, confidence, and rule length are specified to ensure that only meaningful and interpretable rules are produced.

rules <- apriori(transactions,
                 parameter = list(
                   supp = 0.01,
                   conf = 0.3,
                   minlen = 2
                 ))
Apriori

Parameter specification:
 confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
        0.3    0.1    1 none FALSE            TRUE       5    0.01      2     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE   2    FALSE

Absolute minimum support count: 221

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[3890 item(s), 22190 transaction(s)] done [0.04s].
sorting and recoding items ... [210 item(s)] done [0.01s].
creating transaction tree ... done [0.03s].
checking subsets of size 1 done [0.02s].
checking subsets of size 2 done [0.01s].
writing ... [350 rule(s)] done [0.02s].
creating S4 object  ... done [0.01s].

Inspecting the Generated Association Rules

Once the rules are generated, they are inspected to understand the relationships between items. Each rule consists of an antecedent (left-hand side), a consequent (right-hand side), and quality measures such as support, confidence, and lift.

length(rules)
inspect(head(rules, 10))
    lhs                                     rhs                                      support confidence lift
1 {WHITE HANGING HEART T-LIGHT HOLDER} => {RED HANGING HEART T-LIGHT HOLDER}  0.015   0.65       2.10
2 {JUMBO BAG RED RETROSPOT}            => {JUMBO BAG PINK POLKADOT}           0.012   0.58       1.85
3 {REGENCY CAKESTAND 3 TIER}           => {GREEN REGENCY TEACUP AND SAUCER}   0.011   0.54       1.73

Sorting and Evaluating Rules

To identify the strongest rules, they are sorted based on confidence and lift. Confidence measures the reliability of a rule, while lift indicates the strength of the association compared to random chance. Rules with lift values greater than one indicate positive associations.

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_lift, 5))
    lhs                            rhs                           support confidence lift
1 {ALARM CLOCK BAKELIKE GREEN} => {ALARM CLOCK BAKELIKE RED}     0.010   0.72       3.25
2 {WOODEN FRAME ANTIQUE WHITE} => {WOODEN PICTURE FRAME WHITE}  0.011   0.69       3.10

Removing Redundant Rules

Redundant rules do not provide additional information beyond what is already captured by other rules. Removing them improves interpretability and reduces clutter in the analysis.

redundant_rules <- is.redundant(rules)
rules_pruned <- rules[!redundant_rules]

length(rules_pruned)
[1] 215

Visualizing Association Rules

Scatter Plot Visualization

A scatter plot is used to visualize rules based on support and confidence, with lift represented through color shading. This visualization helps identify rules that are both frequent and strong.

plot(rules_pruned,
     measure = c("support", "confidence"),
     shading = "lift")

Network Graph Visualization

A graph-based visualization displays relationships between items and rules in a network format. This representation is particularly effective for understanding complex interactions among products.

plot(rules_pruned,
     method = "graph",
     engine = "htmlwidget")

Conclusion and interpretation & business insights

The extracted association rules reveal meaningful patterns in customer purchasing behavior. For example, if customers who buy one product are highly likely to purchase another, businesses can use this information for product bundling, promotional strategies, and recommendation systems. The lift metric is especially valuable, as it highlights associations that occur more frequently than expected by chance.

In conclusion, this project demonstrated the application of unsupervised learning through association rule mining using the Apriori algorithm in R. By transforming raw retail data into transactional format, exploring item frequencies, generating and pruning rules, and visualizing results, valuable insights into customer behavior were obtained. These findings can support data-driven decision-making in retail and marketing environments.