Association Rules: Market Basket Analysis

Introduction

Association rules are a type of rule-based machine learning technique used for discovering interesting patterns or relationships in large transactional datasets, such as market basket analysis, website clickstream data or medical records. In particular, association rules are used to identify frequent itemsets and to generate rules that indicate the likelihood of the co-occurrence of items within a dataset.
An association rule expresses a relationship between two or more items such that if one or more items are present in a transaction, it is highly likely that the other item(s) will also be present in that same transaction.
In this project, the market basket analysis will be focussed on to understand the customer preference and focus on profit maximization.

Required Libraries

library(arules)
library(arulesViz)
library(kableExtra)
library(ggplot2)

Dataset description

The dataset consists of 12,525 different transactions.The number of items at each transaction depends on the purchase of the customer. In this project we will be trying to determine the frequency and the time occurrence using Apriori algorithm to understand the preference of customers. The included items are Bread, Butter, Coffee Powder, Cheese, Milk, Ghee, Lassi, Panner, Sugar, Sweet, Tea Powder.

Analysis and Manipulation

Data loading and Summary

This data is a transactional data, which each customer’s purchase at one instance. So it is loaded with read.transactions.

data<- read.transactions("DataSetA.csv", header = TRUE, sep = ",")

summary(data)

## transactions as itemMatrix in sparse format with
##  12525 rows (elements/itemsets/transactions) and
##  12 columns (items) and a density of 0.4371723 
## 
## most frequent items:
##          Milk          Ghee Coffee Powder       Yougurt         Bread 
##          5526          5509          5508          5502          5484 
##       (Other) 
##         38178 
## 
## element (itemset/transaction) length distribution:
## sizes
##    2    3    4    5    6    7    8    9   10   11 
## 1402 1592 1666 1947 2124 1998 1266  438   84    8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   5.246   7.000  11.000 
## 
## includes extended item information - examples:
##   labels
## 1  Bread
## 2 Butter
## 3 Cheese

str(data)

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:65707] 3 4 1 2 5 10 0 1 2 3 ...
##   .. .. ..@ p       : int [1:12526] 0 2 6 12 18 24 26 29 35 37 ...
##   .. .. ..@ Dim     : int [1:2] 12 12525
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  12 obs. of  1 variable:
##   .. ..$ labels: chr [1:12] "Bread" "Butter" "Cheese" "Coffee Powder" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

Data Inspection

inspect(data)

The result of the inspection is hidden due to its large length of 12525 rows, which will hinder the focus of the work.

Check for Frequencies.

length(data)

## [1] 12525

itemFrequency(data, type="relative")

##         Bread        Butter        Cheese Coffee Powder          Ghee 
##     0.4378443     0.4375250     0.4371257     0.4397605     0.4398403 
##         Lassi          Milk        Panner         Sugar         Sweet 
##     0.4336128     0.4411976     0.4346507     0.4376846     0.4377645 
##    Tea Powder       Yougurt 
##     0.4297804     0.4392814

itemFrequency(data, type="absolute")

##         Bread        Butter        Cheese Coffee Powder          Ghee 
##          5484          5480          5475          5508          5509 
##         Lassi          Milk        Panner         Sugar         Sweet 
##          5431          5526          5444          5482          5483 
##    Tea Powder       Yougurt 
##          5383          5502

Apriori Alorithm

The Apriori algorithm is a data mining technique which is used for the association rule for transactional databases. The main goal is to find the relationship amoung the variables in a dataset.
The algorithm works by first identifying frequent itemsets in the dataset. An itemset is considered frequent if it appears in a minimum number of transactions, known as the support threshold. The algorithm then generates association rules from these frequent itemsets, based on a minimum confidence threshold.

Rules <- apriori(data, parameter = list(minlen = 2, conf = .4, supp = 0.15))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.15      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1878 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[12 item(s), 12525 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [132 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

In the above code, a minimum of 2 items are marked for generating the rules with minimum confidence being 0.4, making sure the number of times the rule was found to be true and minimum support being 0.15, measuring the frequency of occurrence of the rule. A possibility of 168 occurrences of the set of items are found.

The next part, redundancy check is performed.Redundancy refers to the situation where multiple rules convey the same information or provide similar insights. Redundant rules can clutter the output of the algorithm and make it difficult to interpret the results.

Redundancy Check

redundant <- is.redundant(Rules, measure="confidence")
which(redundant)

## integer(0)

With no redundancy, the next part of the task is proceeded with plotting the rules and applying with some visualization.

Visualization

plot(Rules, measure=c("support","lift"), shading="confidence")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(Rules, method="grouped")

plot(Rules, method="graph")

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

As seen in the graph, there are too much of rules to visualize properly. Thus inspection is needed to check for some cases of missing occurrences in few instances and further proceed to search for the rule in the whole transactional dataset.

Rules Inspection

inspect(head(sort(Rules, by = "confidence", decreasing = T), 20))

##      lhs                rhs             support   confidence coverage  lift    
## [1]  {Lassi}         => {Sweet}         0.2056687 0.4743141  0.4336128 1.083492
## [2]  {Sweet}         => {Lassi}         0.2056687 0.4698158  0.4377645 1.083492
## [3]  {Butter}        => {Sugar}         0.2052695 0.4691606  0.4375250 1.071915
## [4]  {Sugar}         => {Butter}        0.2052695 0.4689894  0.4376846 1.071915
## [5]  {Panner}        => {Bread}         0.2035928 0.4684056  0.4346507 1.069799
## [6]  {Coffee Powder} => {Ghee}          0.2057485 0.4678649  0.4397605 1.063715
## [7]  {Ghee}          => {Coffee Powder} 0.2057485 0.4677800  0.4398403 1.063715
## [8]  {Sugar}         => {Milk}          0.2046307 0.4675301  0.4376846 1.059684
## [9]  {Lassi}         => {Milk}          0.2027146 0.4675014  0.4336128 1.059619
## [10] {Bread}         => {Panner}        0.2035928 0.4649891  0.4378443 1.069799
## [11] {Tea Powder}    => {Sweet}         0.1998403 0.4649824  0.4297804 1.062175
## [12] {Yougurt}       => {Coffee Powder} 0.2039122 0.4641948  0.4392814 1.055563
## [13] {Butter}        => {Sweet}         0.2030339 0.4640511  0.4375250 1.060047
## [14] {Milk}          => {Sugar}         0.2046307 0.4638075  0.4411976 1.059684
## [15] {Sweet}         => {Butter}        0.2030339 0.4637972  0.4377645 1.060047
## [16] {Coffee Powder} => {Yougurt}       0.2039122 0.4636892  0.4397605 1.055563
## [17] {Panner}        => {Ghee}          0.2014371 0.4634460  0.4346507 1.053669
## [18] {Sweet}         => {Bread}         0.2027146 0.4630677  0.4377645 1.057608
## [19] {Bread}         => {Sweet}         0.2027146 0.4629832  0.4378443 1.057608
## [20] {Lassi}         => {Coffee Powder} 0.2004790 0.4623458  0.4336128 1.051358
##      count
## [1]  2576 
## [2]  2576 
## [3]  2571 
## [4]  2571 
## [5]  2550 
## [6]  2577 
## [7]  2577 
## [8]  2563 
## [9]  2539 
## [10] 2550 
## [11] 2503 
## [12] 2554 
## [13] 2543 
## [14] 2563 
## [15] 2543 
## [16] 2554 
## [17] 2523 
## [18] 2539 
## [19] 2539 
## [20] 2511

As it is visible within 20 rows or order, the purchase of Panner is only once at the specified confidence level. Thus an analysis is worked on to find the rules which have the purchase of Panner.

Rules of Purchasing Panner

PRules <- apriori(data, parameter = list(minlen = 2, conf = .42, supp = 0.15),
                 appearance=list(rhs=c("Panner")))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.42    0.1    1 none FALSE            TRUE       5    0.15      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1878 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[12 item(s), 12525 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

There are eleven occurrences of rules for the purchase of Panner.

Visualization

plot(PRules, measure=c("support","lift"), shading="confidence", color="blue")

plot(PRules, method="grouped",color="blue")

## Warning: Unknown control parameters: color

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

plot(PRules, method="graph", color="blue")

As seen from the plot, the Bread has highest measure of lift (strength of association) with Panner and Tea Powder has the lowest measure of lift. So the degree of buying Panner when Bread is bought is the highest compared to any of the other items.

Conclusion

To summarize, the dataset is pretty good to analyze the association rules using the Apriori algorithm. Though it needed some fixes within the support and confidence level to provide a reasonable amount of rules, overall the technique made a good outcome and was tested using Panner to be the test case, providing reasonable results.

Reference

https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset
https://www.geeksforgeeks.org/apriori-algorithm/
Lecture notes by Prof. Jacek Lewkowicz

Association Rules: Market Basket Analysis

Anirban Das (ID: 454449)

2023-02-27

Introduction

Required Libraries

Dataset description

Analysis and Manipulation

Data loading and Summary

Data Inspection

Check for Frequencies.

Apriori Alorithm

Redundancy Check

Visualization

Rules Inspection

Rules of Purchasing Panner

Visualization

Conclusion

Reference