Association rules are a type of rule-based machine learning technique
used for discovering interesting patterns or relationships in large
transactional datasets, such as market basket analysis, website
clickstream data or medical records. In particular, association rules
are used to identify frequent itemsets and to generate rules that
indicate the likelihood of the co-occurrence of items within a dataset.
An association rule expresses a relationship between two or more
items such that if one or more items are present in a transaction, it is
highly likely that the other item(s) will also be present in that same
transaction.
In this project, the market basket analysis will be
focussed on to understand the customer preference and focus on profit
maximization.
library(arules)
library(arulesViz)
library(kableExtra)
library(ggplot2)
The dataset consists of 12,525 different transactions.The number of
items at each transaction depends on the purchase of the customer. In
this project we will be trying to determine the frequency and the time
occurrence using Apriori algorithm to understand the preference of
customers. The included items are Bread, Butter, Coffee Powder, Cheese,
Milk, Ghee, Lassi, Panner, Sugar, Sweet, Tea Powder.
This data is a transactional data, which each customer’s purchase at one instance. So it is loaded with read.transactions.
data<- read.transactions("DataSetA.csv", header = TRUE, sep = ",")
summary(data)
## transactions as itemMatrix in sparse format with
## 12525 rows (elements/itemsets/transactions) and
## 12 columns (items) and a density of 0.4371723
##
## most frequent items:
## Milk Ghee Coffee Powder Yougurt Bread
## 5526 5509 5508 5502 5484
## (Other)
## 38178
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11
## 1402 1592 1666 1947 2124 1998 1266 438 84 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 5.246 7.000 11.000
##
## includes extended item information - examples:
## labels
## 1 Bread
## 2 Butter
## 3 Cheese
str(data)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:65707] 3 4 1 2 5 10 0 1 2 3 ...
## .. .. ..@ p : int [1:12526] 0 2 6 12 18 24 26 29 35 37 ...
## .. .. ..@ Dim : int [1:2] 12 12525
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 12 obs. of 1 variable:
## .. ..$ labels: chr [1:12] "Bread" "Butter" "Cheese" "Coffee Powder" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
inspect(data)
The result of the inspection is hidden due to its large length
of 12525 rows, which will hinder the focus of the work.
length(data)
## [1] 12525
itemFrequency(data, type="relative")
## Bread Butter Cheese Coffee Powder Ghee
## 0.4378443 0.4375250 0.4371257 0.4397605 0.4398403
## Lassi Milk Panner Sugar Sweet
## 0.4336128 0.4411976 0.4346507 0.4376846 0.4377645
## Tea Powder Yougurt
## 0.4297804 0.4392814
itemFrequency(data, type="absolute")
## Bread Butter Cheese Coffee Powder Ghee
## 5484 5480 5475 5508 5509
## Lassi Milk Panner Sugar Sweet
## 5431 5526 5444 5482 5483
## Tea Powder Yougurt
## 5383 5502
The Apriori algorithm is a data mining technique which is used for
the association rule for transactional databases. The main goal is to
find the relationship amoung the variables in a dataset.
The
algorithm works by first identifying frequent itemsets in the dataset.
An itemset is considered frequent if it appears in a minimum number of
transactions, known as the support threshold. The algorithm then
generates association rules from these frequent itemsets, based on a
minimum confidence threshold.
Rules <- apriori(data, parameter = list(minlen = 2, conf = .4, supp = 0.15))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.15 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1878
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[12 item(s), 12525 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [132 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
In the above code, a minimum of 2 items are marked for
generating the rules with minimum confidence being 0.4, making sure the
number of times the rule was found to be true and minimum support being
0.15, measuring the frequency of occurrence of the rule. A possibility
of 168 occurrences of the set of items are found.
The next
part, redundancy check is performed.Redundancy refers to the situation
where multiple rules convey the same information or provide similar
insights. Redundant rules can clutter the output of the algorithm and
make it difficult to interpret the results.
redundant <- is.redundant(Rules, measure="confidence")
which(redundant)
## integer(0)
With no redundancy, the next part of the task is proceeded with
plotting the rules and applying with some visualization.
plot(Rules, measure=c("support","lift"), shading="confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(Rules, method="grouped")
plot(Rules, method="graph")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
As seen in the graph, there are too much of rules to visualize
properly. Thus inspection is needed to check for some cases of missing
occurrences in few instances and further proceed to search for the rule
in the whole transactional dataset.
inspect(head(sort(Rules, by = "confidence", decreasing = T), 20))
## lhs rhs support confidence coverage lift
## [1] {Lassi} => {Sweet} 0.2056687 0.4743141 0.4336128 1.083492
## [2] {Sweet} => {Lassi} 0.2056687 0.4698158 0.4377645 1.083492
## [3] {Butter} => {Sugar} 0.2052695 0.4691606 0.4375250 1.071915
## [4] {Sugar} => {Butter} 0.2052695 0.4689894 0.4376846 1.071915
## [5] {Panner} => {Bread} 0.2035928 0.4684056 0.4346507 1.069799
## [6] {Coffee Powder} => {Ghee} 0.2057485 0.4678649 0.4397605 1.063715
## [7] {Ghee} => {Coffee Powder} 0.2057485 0.4677800 0.4398403 1.063715
## [8] {Sugar} => {Milk} 0.2046307 0.4675301 0.4376846 1.059684
## [9] {Lassi} => {Milk} 0.2027146 0.4675014 0.4336128 1.059619
## [10] {Bread} => {Panner} 0.2035928 0.4649891 0.4378443 1.069799
## [11] {Tea Powder} => {Sweet} 0.1998403 0.4649824 0.4297804 1.062175
## [12] {Yougurt} => {Coffee Powder} 0.2039122 0.4641948 0.4392814 1.055563
## [13] {Butter} => {Sweet} 0.2030339 0.4640511 0.4375250 1.060047
## [14] {Milk} => {Sugar} 0.2046307 0.4638075 0.4411976 1.059684
## [15] {Sweet} => {Butter} 0.2030339 0.4637972 0.4377645 1.060047
## [16] {Coffee Powder} => {Yougurt} 0.2039122 0.4636892 0.4397605 1.055563
## [17] {Panner} => {Ghee} 0.2014371 0.4634460 0.4346507 1.053669
## [18] {Sweet} => {Bread} 0.2027146 0.4630677 0.4377645 1.057608
## [19] {Bread} => {Sweet} 0.2027146 0.4629832 0.4378443 1.057608
## [20] {Lassi} => {Coffee Powder} 0.2004790 0.4623458 0.4336128 1.051358
## count
## [1] 2576
## [2] 2576
## [3] 2571
## [4] 2571
## [5] 2550
## [6] 2577
## [7] 2577
## [8] 2563
## [9] 2539
## [10] 2550
## [11] 2503
## [12] 2554
## [13] 2543
## [14] 2563
## [15] 2543
## [16] 2554
## [17] 2523
## [18] 2539
## [19] 2539
## [20] 2511
As it is visible within 20 rows or order, the purchase of Panner
is only once at the specified confidence level. Thus an analysis is
worked on to find the rules which have the purchase of Panner.
PRules <- apriori(data, parameter = list(minlen = 2, conf = .42, supp = 0.15),
appearance=list(rhs=c("Panner")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.42 0.1 1 none FALSE TRUE 5 0.15 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1878
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[12 item(s), 12525 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
There are eleven occurrences of rules for the purchase of
Panner.
plot(PRules, measure=c("support","lift"), shading="confidence", color="blue")
plot(PRules, method="grouped",color="blue")
## Warning: Unknown control parameters: color
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
plot(PRules, method="graph", color="blue")
As seen from the plot, the Bread has highest measure of lift
(strength of association) with Panner and Tea Powder has the lowest
measure of lift. So the degree of buying Panner when Bread is bought is
the highest compared to any of the other items.
To summarize, the dataset is pretty good to analyze the association rules using the Apriori algorithm. Though it needed some fixes within the support and confidence level to provide a reasonable amount of rules, overall the technique made a good outcome and was tested using Panner to be the test case, providing reasonable results.