rm(list=ls())
library(arules)
library(arulesViz)
We’ll first see how the data is in the given csv file and read it in Transaction Class.
products.initial <- read.csv("MarketBasketAnalysis.csv", sep=",", colClasses = "factor")
str(products.initial)
## 'data.frame': 15001 obs. of 5 variables:
## $ A : Factor w/ 15001 levels "30000","30001",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Quantity : Factor w/ 68 levels "1","10","100",..: 13 13 13 13 13 13 1 1 1 1 ...
## $ Transaction: Factor w/ 6726 levels "100001","100007",..: 5594 5594 5594 5594 5594 5595 5596 5596 5596 5596 ...
## $ Store : Factor w/ 10 levels "1","10","2","3",..: 7 7 7 7 7 1 7 7 7 7 ...
## $ Product : Factor w/ 17 levels "Bow","Candy Bar",..: 5 2 2 2 2 8 2 2 2 5 ...
We can see the structure of the data we read using read.csv. This
type of data structure can’t be used as input to the Apriori
algorithm.
We need to read data in transaction class which can be done using
“read.transactions” of the package “arules”
rm(products.initial)
products = read.transactions("MarketBasketAnalysis.csv", format = "single", sep = ",", cols = c("Transaction", "Product"), header = TRUE)
Observe the structure of the dataset now. Class is of ‘transactions’.
str(products)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:9629] 1 15 4 16 1 3 4 7 15 4 ...
## .. .. ..@ p : int [1:6727] 0 1 2 3 4 9 11 12 13 14 ...
## .. .. ..@ Dim : int [1:2] 17 6726
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 17 obs. of 1 variable:
## .. ..$ labels: chr [1:17] "Bow" "Candy Bar" "Deodorant" "Greeting Cards" ...
## ..@ itemsetInfo:'data.frame': 6726 obs. of 1 variable:
## .. ..$ transactionID: chr [1:6726] "100001" "100007" "100010" "100013" ...
inspect(head(products))
## items transactionID
## [1] {Candy Bar} 100001
## [2] {Toothpaste} 100007
## [3] {Magazine} 100010
## [4] {Wrapping Paper} 100013
## [5] {Candy Bar,
## Greeting Cards,
## Magazine,
## Pencils,
## Toothpaste} 100016
## [6] {Magazine,
## Pencils} 100019
iteminfo contains the list of items present in the dataset. 17 unique items/products are listed in this dataset.
print(products@itemInfo)
## labels
## 1 Bow
## 2 Candy Bar
## 3 Deodorant
## 4 Greeting Cards
## 5 Magazine
## 6 Markers
## 7 Pain Reliever
## 8 Pencils
## 9 Pens
## 10 Perfume
## 11 Photo Processing
## 12 Prescription Med
## 13 Shampoo
## 14 Soap
## 15 Toothbrush
## 16 Toothpaste
## 17 Wrapping Paper
itemsetInfo has the list of transactions. This dataset has 6726 unique transactions.
head(products@itemsetInfo)
## transactionID
## 1 100001
## 2 100007
## 3 100010
## 4 100013
## 5 100016
## 6 100019
nrow(products@itemsetInfo)
## [1] 6726
data is the sparse matix with item number in the first column and the transactions in each row. A cell is marked with ‘|’ if an item has a transaction to it and ‘.’ if an item doesn’t have a transaction to it.
products@data
## 17 x 6726 sparse Matrix of class "ngCMatrix"
##
## [1,] . . . . . . . . . . . . . . . . . . . . . . | . . . . . . . . . . ......
## [2,] | . . . | . . . . . | . | . | . . . . . . . . | . . . . . | . . | ......
## [3,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [4,] . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [5,] . . | . | | . | . | . . . | | . . . | . . | . . | . . | . | . . . ......
## [6,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [7,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [8,] . . . . | | | . . . | . | . . . . . . . . . . . . . . . . . . . . ......
## [9,] . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . | . . ......
## [10,] . . . . . . . . . . . . . | . . . | | | . | . . . | . . . . . | | ......
## [11,] . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . ......
## [12,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [13,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
## [14,] . . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . ......
## [15,] . . . . . . . . . . . . . | . . | . | . . . . . . . . . . . . . . ......
## [16,] . | . . | . . . . . | . | . . | . . . . | . . . . . . . | . . . . ......
## [17,] . . . | . . . . . . . . . . . | . . . . . . . . . . . . . . . . . ......
##
## .....suppressing 6693 columns in show(); maybe adjust options(max.print=, width=)
## ..............................
We’ll have a look at summary of the dataset and do some basic EDA (Exploratory Data Analysis) on these transactions like Frequency of each item, frequency plot.
Summary gives the below information.
summary(products)
## transactions as itemMatrix in sparse format with
## 6726 rows (elements/itemsets/transactions) and
## 17 columns (items) and a density of 0.08421228
##
## most frequent items:
## Magazine Candy Bar Toothpaste Greeting Cards Pens
## 1560 1182 1093 1028 969
## (Other)
## 3797
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 4848 1192 470 135 54 19 3 3 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.432 2.000 9.000
##
## includes extended item information - examples:
## labels
## 1 Bow
## 2 Candy Bar
## 3 Deodorant
##
## includes extended transaction information - examples:
## transactionID
## 1 100001
## 2 100007
## 3 100010
Item density is the total no. of transactions for all items divided by the product of No. of items and No. of transactions. To be simple, in the sparse matrix, total no. of cells with “|” divided by total no. of cells. We can see the calculation below.
itemNoTrans.Total = 0
for(i in as.numeric(rownames(products@itemInfo)))
{
itemNo = products@data[i,]
itemNoTrans = length(itemNo[itemNo==TRUE])
itemNoTrans.Total = itemNoTrans.Total + itemNoTrans
}
items.density = itemNoTrans.Total / (nrow(products) * ncol(products))
print(items.density)
## [1] 0.08421228
Frequency of each item is calculated as no. of times each item is in a transaction/ total no. of transactions. We can see the calculation below.
# Transaction list for Item 1
Item1.Transactions = products@data[5,]
Item1.Freq = length(Item1.Transactions[Item1.Transactions==TRUE])/length(Item1.Transactions)
print(Item1.Freq)
## [1] 0.2319358
We can see different Frequency plots below. Magazines are the most frequent items and deodrants are the least frequent items.
itemFrequencyPlot(products)
Top 5 Frequnt items can be seen below.
itemFrequencyPlot(products, topN=5)
#itemFrequencyPlot(products, topN=ncol(products))
Frequnecy plot based on the support can be seen below.
Frequency of items with minimum 0.1 support :
itemFrequencyPlot(products, support = 0.1)
We passed two parameters to apriori function.
1. Transactions Dataset
2. List of parameters.
Minimum Support = 0.01 => Rules should have minimum support of
0.01
Minimum Confidence = 0.1 => Rules should have miniumn confidence of
0.1
Minimum Length = 2 => Minimum length of rule should be 2; which means
rule should contains atleast two items involved.
Maximum Length = 5 => Maximum length of rule should be 5; which means
rule should contains at max two items involved.
The default parameters are minimum support of 0.1, minimum confidence of 0.8, maxlen of 10 items
Support: It’s the percentage of transactions that contain all of the items in an itemset (e.g., pencil, paper and rubber). The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.
Confidence: It’s the probability that a transaction that contains the items on the left hand side of the rule (in our example, pencil and paper) also contains the item on the right hand side (a rubber). The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate you can expect for a given rule.
Lift: It’s the probability of all of the items in a rule occurring together (otherwise known as the support) divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. For example, if pencil, paper and rubber occurred together in 2.5% of all transactions, pencil and paper in 10% of transactions and rubber in 8% of transactions, then the lift would be: 0.025/(0.1*0.08) = 3.125. A lift of more than 1 suggests that the presence of pencil and paper increases the probability that a rubber will also occur in the transaction. Overall, lift summarises the strength of association between the products on the left and right hand side of the rule; the larger the lift the greater the link between the two products.
products.apriori <- apriori(products, parameter=list(support=0.01, confidence = 0.1, minlen=2, maxlen =5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 67
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 6726 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [49 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(products.apriori)
## set of 49 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 28 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.429 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01011 Min. :0.1231 Min. :0.02275 Min. :0.6007
## 1st Qu.:0.01204 1st Qu.:0.1960 1st Qu.:0.03747 1st Qu.:1.0569
## Median :0.01710 Median :0.2596 Median :0.06765 Median :1.4774
## Mean :0.02192 Mean :0.2713 Mean :0.09811 Mean :1.7122
## 3rd Qu.:0.02988 3rd Qu.:0.3309 3rd Qu.:0.15284 3rd Qu.:2.2982
## Max. :0.04609 Max. :0.4837 Max. :0.23194 Max. :3.0575
## count
## Min. : 68.0
## 1st Qu.: 81.0
## Median :115.0
## Mean :147.4
## 3rd Qu.:201.0
## Max. :310.0
##
## mining info:
## data ntransactions support confidence
## products 6726 0.01 0.1
## call
## apriori(data = products, parameter = list(support = 0.01, confidence = 0.1, minlen = 2, maxlen = 5))
49 rules have created out of which 28 rules involves two items and 21 rules involves three items.
Summary of Quality Measures gives the five point summary of support, confidence, lift.
We can see the list of rules produced by the algorithm below.
inspect(products.apriori)
## lhs rhs support confidence
## [1] {Bow} => {Toothbrush} 0.01011002 0.1959654
## [2] {Toothbrush} => {Bow} 0.01011002 0.1494505
## [3] {Photo Processing} => {Magazine} 0.01650312 0.2975871
## [4] {Toothbrush} => {Perfume} 0.01709783 0.2527473
## [5] {Perfume} => {Toothbrush} 0.01709783 0.2068345
## [6] {Toothbrush} => {Magazine} 0.01129944 0.1670330
## [7] {Perfume} => {Magazine} 0.01397562 0.1690647
## [8] {Pens} => {Magazine} 0.02007136 0.1393189
## [9] {Pencils} => {Toothpaste} 0.02274755 0.1683168
## [10] {Toothpaste} => {Pencils} 0.02274755 0.1399817
## [11] {Pencils} => {Greeting Cards} 0.02988403 0.2211221
## [12] {Greeting Cards} => {Pencils} 0.02988403 0.1955253
## [13] {Pencils} => {Candy Bar} 0.03508772 0.2596260
## [14] {Candy Bar} => {Pencils} 0.03508772 0.1996616
## [15] {Pencils} => {Magazine} 0.02854594 0.2112211
## [16] {Magazine} => {Pencils} 0.02854594 0.1230769
## [17] {Toothpaste} => {Greeting Cards} 0.03330360 0.2049405
## [18] {Greeting Cards} => {Toothpaste} 0.03330360 0.2178988
## [19] {Toothpaste} => {Candy Bar} 0.04148082 0.2552608
## [20] {Candy Bar} => {Toothpaste} 0.04148082 0.2360406
## [21] {Toothpaste} => {Magazine} 0.02988403 0.1838975
## [22] {Magazine} => {Toothpaste} 0.02988403 0.1288462
## [23] {Greeting Cards} => {Candy Bar} 0.04608980 0.3015564
## [24] {Candy Bar} => {Greeting Cards} 0.04608980 0.2622673
## [25] {Greeting Cards} => {Magazine} 0.03746655 0.2451362
## [26] {Magazine} => {Greeting Cards} 0.03746655 0.1615385
## [27] {Candy Bar} => {Magazine} 0.03999405 0.2275804
## [28] {Magazine} => {Candy Bar} 0.03999405 0.1724359
## [29] {Pencils, Toothpaste} => {Candy Bar} 0.01100208 0.4836601
## [30] {Candy Bar, Pencils} => {Toothpaste} 0.01100208 0.3135593
## [31] {Candy Bar, Toothpaste} => {Pencils} 0.01100208 0.2652330
## [32] {Greeting Cards, Pencils} => {Magazine} 0.01204282 0.4029851
## [33] {Magazine, Pencils} => {Greeting Cards} 0.01204282 0.4218750
## [34] {Greeting Cards, Magazine} => {Pencils} 0.01204282 0.3214286
## [35] {Candy Bar, Pencils} => {Magazine} 0.01040737 0.2966102
## [36] {Magazine, Pencils} => {Candy Bar} 0.01040737 0.3645833
## [37] {Candy Bar, Magazine} => {Pencils} 0.01040737 0.2602230
## [38] {Greeting Cards, Toothpaste} => {Candy Bar} 0.01457032 0.4375000
## [39] {Candy Bar, Toothpaste} => {Greeting Cards} 0.01457032 0.3512545
## [40] {Candy Bar, Greeting Cards} => {Toothpaste} 0.01457032 0.3161290
## [41] {Greeting Cards, Toothpaste} => {Magazine} 0.01115076 0.3348214
## [42] {Magazine, Toothpaste} => {Greeting Cards} 0.01115076 0.3731343
## [43] {Greeting Cards, Magazine} => {Toothpaste} 0.01115076 0.2976190
## [44] {Candy Bar, Toothpaste} => {Magazine} 0.01323223 0.3189964
## [45] {Magazine, Toothpaste} => {Candy Bar} 0.01323223 0.4427861
## [46] {Candy Bar, Magazine} => {Toothpaste} 0.01323223 0.3308550
## [47] {Candy Bar, Greeting Cards} => {Magazine} 0.01724651 0.3741935
## [48] {Greeting Cards, Magazine} => {Candy Bar} 0.01724651 0.4603175
## [49] {Candy Bar, Magazine} => {Greeting Cards} 0.01724651 0.4312268
## coverage lift count
## [1] 0.05159084 2.8968426 68
## [2] 0.06764793 2.8968426 68
## [3] 0.05545644 1.2830584 111
## [4] 0.06764793 3.0575144 115
## [5] 0.08266429 3.0575144 115
## [6] 0.06764793 0.7201691 76
## [7] 0.08266429 0.7289292 94
## [8] 0.14406780 0.6006787 135
## [9] 0.13514719 1.0357722 153
## [10] 0.16250372 1.0357722 153
## [11] 0.13514719 1.4467581 201
## [12] 0.15283973 1.4467581 201
## [13] 0.13514719 1.4773640 236
## [14] 0.17573595 1.4773640 236
## [15] 0.13514719 0.9106880 192
## [16] 0.23193577 0.9106880 192
## [17] 0.16250372 1.3408852 224
## [18] 0.15283973 1.3408852 224
## [19] 0.16250372 1.4525244 279
## [20] 0.17573595 1.4525244 279
## [21] 0.16250372 0.7928813 201
## [22] 0.23193577 0.7928813 201
## [23] 0.15283973 1.7159632 310
## [24] 0.17573595 1.7159632 310
## [25] 0.15283973 1.0569141 252
## [26] 0.23193577 1.0569141 252
## [27] 0.17573595 0.9812215 269
## [28] 0.23193577 0.9812215 269
## [29] 0.02274755 2.7521980 74
## [30] 0.03508772 1.9295517 74
## [31] 0.04148082 1.9625489 74
## [32] 0.02988403 1.7374856 81
## [33] 0.02854594 2.7602444 81
## [34] 0.03746655 2.3783593 81
## [35] 0.03508772 1.2788462 70
## [36] 0.02854594 2.0746087 70
## [37] 0.03999405 1.9254788 70
## [38] 0.03330360 2.4895305 98
## [39] 0.04148082 2.2981884 98
## [40] 0.04608980 1.9453649 98
## [41] 0.03330360 1.4435955 75
## [42] 0.02988403 2.4413439 75
## [43] 0.03746655 1.8314599 75
## [44] 0.04148082 1.3753653 89
## [45] 0.02988403 2.5196101 89
## [46] 0.03999405 2.0359843 89
## [47] 0.04608980 1.6133499 116
## [48] 0.03746655 2.6193699 116
## [49] 0.03999405 2.8214312 116
These are the top 10 frequent item-sets that have a minimum support of 1%.
inspect(sort(products.apriori, by="support")[1:10])
## lhs rhs support confidence coverage
## [1] {Greeting Cards} => {Candy Bar} 0.04608980 0.3015564 0.1528397
## [2] {Candy Bar} => {Greeting Cards} 0.04608980 0.2622673 0.1757360
## [3] {Toothpaste} => {Candy Bar} 0.04148082 0.2552608 0.1625037
## [4] {Candy Bar} => {Toothpaste} 0.04148082 0.2360406 0.1757360
## [5] {Candy Bar} => {Magazine} 0.03999405 0.2275804 0.1757360
## [6] {Magazine} => {Candy Bar} 0.03999405 0.1724359 0.2319358
## [7] {Greeting Cards} => {Magazine} 0.03746655 0.2451362 0.1528397
## [8] {Magazine} => {Greeting Cards} 0.03746655 0.1615385 0.2319358
## [9] {Pencils} => {Candy Bar} 0.03508772 0.2596260 0.1351472
## [10] {Candy Bar} => {Pencils} 0.03508772 0.1996616 0.1757360
## lift count
## [1] 1.7159632 310
## [2] 1.7159632 310
## [3] 1.4525244 279
## [4] 1.4525244 279
## [5] 0.9812215 269
## [6] 0.9812215 269
## [7] 1.0569141 252
## [8] 1.0569141 252
## [9] 1.4773640 236
## [10] 1.4773640 236
These are the top 10 frequent item-sets that have a minimum support of 1% and minimum confidence of 10%.
inspect(sort(products.apriori, by=c("support","confidence"))[1:10])
## lhs rhs support confidence coverage
## [1] {Greeting Cards} => {Candy Bar} 0.04608980 0.3015564 0.1528397
## [2] {Candy Bar} => {Greeting Cards} 0.04608980 0.2622673 0.1757360
## [3] {Toothpaste} => {Candy Bar} 0.04148082 0.2552608 0.1625037
## [4] {Candy Bar} => {Toothpaste} 0.04148082 0.2360406 0.1757360
## [5] {Candy Bar} => {Magazine} 0.03999405 0.2275804 0.1757360
## [6] {Magazine} => {Candy Bar} 0.03999405 0.1724359 0.2319358
## [7] {Greeting Cards} => {Magazine} 0.03746655 0.2451362 0.1528397
## [8] {Magazine} => {Greeting Cards} 0.03746655 0.1615385 0.2319358
## [9] {Pencils} => {Candy Bar} 0.03508772 0.2596260 0.1351472
## [10] {Candy Bar} => {Pencils} 0.03508772 0.1996616 0.1757360
## lift count
## [1] 1.7159632 310
## [2] 1.7159632 310
## [3] 1.4525244 279
## [4] 1.4525244 279
## [5] 0.9812215 269
## [6] 0.9812215 269
## [7] 1.0569141 252
## [8] 1.0569141 252
## [9] 1.4773640 236
## [10] 1.4773640 236
These are the top 5 frequent item-sets that have a minimum support of 1% and minimum confidence of 10% sorted in descending order of lift.
inspect(sort(products.apriori, by="lift")[1:5])
## lhs rhs support confidence coverage
## [1] {Perfume} => {Toothbrush} 0.01709783 0.2068345 0.08266429
## [2] {Toothbrush} => {Perfume} 0.01709783 0.2527473 0.06764793
## [3] {Bow} => {Toothbrush} 0.01011002 0.1959654 0.05159084
## [4] {Toothbrush} => {Bow} 0.01011002 0.1494505 0.06764793
## [5] {Candy Bar, Magazine} => {Greeting Cards} 0.01724651 0.4312268 0.03999405
## lift count
## [1] 3.057514 115
## [2] 3.057514 115
## [3] 2.896843 68
## [4] 2.896843 68
## [5] 2.821431 116
From the above list of rules, we can see some redundancy.
Ex: We can see {Pencils} => {Candy Bar} ; {Toothpaste} => {Candy
Bar}; {Pencils, Toothpaste} => {Candy Bar}.
This has some amount of redundancy.
We can identify and remove redundancy using below commands.
subset.matrix <- is.subset(products.apriori, products.apriori, sparse = FALSE)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
## {Bow,Toothbrush} {Perfume,Toothbrush}
## 2 5
## {Pencils,Toothpaste} {Greeting Cards,Pencils}
## 10 12
## {Candy Bar,Pencils} {Magazine,Pencils}
## 14 16
## {Greeting Cards,Toothpaste} {Candy Bar,Toothpaste}
## 18 20
## {Magazine,Toothpaste} {Candy Bar,Greeting Cards}
## 22 24
## {Greeting Cards,Magazine} {Candy Bar,Magazine}
## 26 28
## {Candy Bar,Pencils,Toothpaste} {Candy Bar,Pencils,Toothpaste}
## 29 30
## {Candy Bar,Pencils,Toothpaste} {Greeting Cards,Magazine,Pencils}
## 31 32
## {Greeting Cards,Magazine,Pencils} {Greeting Cards,Magazine,Pencils}
## 33 34
## {Candy Bar,Magazine,Pencils} {Candy Bar,Magazine,Pencils}
## 35 36
## {Candy Bar,Magazine,Pencils} {Candy Bar,Greeting Cards,Toothpaste}
## 37 38
## {Candy Bar,Greeting Cards,Toothpaste} {Candy Bar,Greeting Cards,Toothpaste}
## 39 40
## {Greeting Cards,Magazine,Toothpaste} {Greeting Cards,Magazine,Toothpaste}
## 41 42
## {Greeting Cards,Magazine,Toothpaste} {Candy Bar,Magazine,Toothpaste}
## 43 44
## {Candy Bar,Magazine,Toothpaste} {Candy Bar,Magazine,Toothpaste}
## 45 46
## {Candy Bar,Greeting Cards,Magazine} {Candy Bar,Greeting Cards,Magazine}
## 47 48
## {Candy Bar,Greeting Cards,Magazine}
## 49
products.apriori.pruned <- products.apriori[!redundant]
inspect(sort(products.apriori.pruned, by="lift"))
## lhs rhs support confidence coverage
## [1] {Toothbrush} => {Perfume} 0.01709783 0.2527473 0.06764793
## [2] {Bow} => {Toothbrush} 0.01011002 0.1959654 0.05159084
## [3] {Greeting Cards} => {Candy Bar} 0.04608980 0.3015564 0.15283973
## [4] {Pencils} => {Candy Bar} 0.03508772 0.2596260 0.13514719
## [5] {Toothpaste} => {Candy Bar} 0.04148082 0.2552608 0.16250372
## [6] {Pencils} => {Greeting Cards} 0.02988403 0.2211221 0.13514719
## [7] {Toothpaste} => {Greeting Cards} 0.03330360 0.2049405 0.16250372
## [8] {Photo Processing} => {Magazine} 0.01650312 0.2975871 0.05545644
## [9] {Greeting Cards} => {Magazine} 0.03746655 0.2451362 0.15283973
## [10] {Pencils} => {Toothpaste} 0.02274755 0.1683168 0.13514719
## [11] {Candy Bar} => {Magazine} 0.03999405 0.2275804 0.17573595
## [12] {Pencils} => {Magazine} 0.02854594 0.2112211 0.13514719
## [13] {Toothpaste} => {Magazine} 0.02988403 0.1838975 0.16250372
## [14] {Perfume} => {Magazine} 0.01397562 0.1690647 0.08266429
## [15] {Toothbrush} => {Magazine} 0.01129944 0.1670330 0.06764793
## [16] {Pens} => {Magazine} 0.02007136 0.1393189 0.14406780
## lift count
## [1] 3.0575144 115
## [2] 2.8968426 68
## [3] 1.7159632 310
## [4] 1.4773640 236
## [5] 1.4525244 279
## [6] 1.4467581 201
## [7] 1.3408852 224
## [8] 1.2830584 111
## [9] 1.0569141 252
## [10] 1.0357722 153
## [11] 0.9812215 269
## [12] 0.9106880 192
## [13] 0.7928813 201
## [14] 0.7289292 94
## [15] 0.7201691 76
## [16] 0.6006787 135
We can also plot these association rules using different graphs.
Scatter Plot: Plots all rules with Support on x-axis
and confidence on y-axes and color represents the Lift. Brighter the
color, more the lift.
We can see top right corner point which has high Support and high
confidence but low lift.
But we can’t see the actual rule in this graph which is a great
disadvantage.
plot(products.apriori.pruned)
Grouped Plot: This plots the LHS items and RHS items
with lines connecting and bubbles placed on the intersection of LHS and
RHS items.
Size of the bubble represents Support and color represents the
Lift.
Compared to Scatterplot this is a better view as it shows the LHS and
RHS items and their associations along with support and Lift.
But confidence is missed in this graph.
plot(products.apriori.pruned,method="group")
Graph Plot: This plots the Items, associations
between the Items are joined with arrows with the bubble in between.
Size represents the Support and color represents the lift.
There is an interactive version but can’t be plotted in the markdown
file and hence it’s commented.
#plot(products.apriori.pruned,method="graph",interactive=TRUE,shading=NA)
plot(products.apriori.pruned,method="graph")
From the Item frequency plot we can see Magazine is the most frequent item and hence it’s placed in the center with all other items plotted around it.
We’ll see how Magazine is associated to other items.
summary(products)
## transactions as itemMatrix in sparse format with
## 6726 rows (elements/itemsets/transactions) and
## 17 columns (items) and a density of 0.08421228
##
## most frequent items:
## Magazine Candy Bar Toothpaste Greeting Cards Pens
## 1560 1182 1093 1028 969
## (Other)
## 3797
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 4848 1192 470 135 54 19 3 3 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.432 2.000 9.000
##
## includes extended item information - examples:
## labels
## 1 Bow
## 2 Candy Bar
## 3 Deodorant
##
## includes extended transaction information - examples:
## transactionID
## 1 100001
## 2 100007
## 3 100010
products.apriori.magazine <- apriori(products, parameter=list(support=0.01, confidence = 0.1, minlen=2, maxlen =5),
appearance = list(rhs=c("Magazine"), default = "lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 67
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[17 item(s), 6726 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(products.apriori.magazine, by="lift"))
## lhs rhs support confidence
## [1] {Greeting Cards, Pencils} => {Magazine} 0.01204282 0.4029851
## [2] {Candy Bar, Greeting Cards} => {Magazine} 0.01724651 0.3741935
## [3] {Greeting Cards, Toothpaste} => {Magazine} 0.01115076 0.3348214
## [4] {Candy Bar, Toothpaste} => {Magazine} 0.01323223 0.3189964
## [5] {Photo Processing} => {Magazine} 0.01650312 0.2975871
## [6] {Candy Bar, Pencils} => {Magazine} 0.01040737 0.2966102
## [7] {Greeting Cards} => {Magazine} 0.03746655 0.2451362
## [8] {Candy Bar} => {Magazine} 0.03999405 0.2275804
## [9] {Pencils} => {Magazine} 0.02854594 0.2112211
## [10] {Toothpaste} => {Magazine} 0.02988403 0.1838975
## [11] {Perfume} => {Magazine} 0.01397562 0.1690647
## [12] {Toothbrush} => {Magazine} 0.01129944 0.1670330
## [13] {Pens} => {Magazine} 0.02007136 0.1393189
## coverage lift count
## [1] 0.02988403 1.7374856 81
## [2] 0.04608980 1.6133499 116
## [3] 0.03330360 1.4435955 75
## [4] 0.04148082 1.3753653 89
## [5] 0.05545644 1.2830584 111
## [6] 0.03508772 1.2788462 70
## [7] 0.15283973 1.0569141 252
## [8] 0.17573595 0.9812215 269
## [9] 0.13514719 0.9106880 192
## [10] 0.16250372 0.7928813 201
## [11] 0.08266429 0.7289292 94
## [12] 0.06764793 0.7201691 76
## [13] 0.14406780 0.6006787 135
13 rules are produced by the algorithm. We’ll plot these rules the graph method and look for some insights. We can see this plot is different from the previous graph plot. In this we directly linked the Item Magazine with other Itemsets, whereas in the previous one, we used Items in place of Itemsets.
Width of the arrow represents support and color represents lift. Candy Bar => Magazine has the highest support but relatively low lift.
plot(products.apriori.magazine,method="graph",control = list(type="itemsets"))
In Market Basket Analysis, it is tough to have thresholds for
support, confidence and lift values and pick the items falling above the
threshold.
Picking the “appropriate” values for support and confidence can be
difficult, as it is very much an unsupervised process. It’s better to
pick these values based on the domain knowledge and looking up on the
different association rules produced.
In our case, we recommend {Photo Processing, Magazine}, {Greeting Cards,
Candy Bar}, {Toothbrush, Perfume} as Itemsets to be combined.
As said earlier, we can find more appropriate combination of itemsets using the domain knowledge.
References: https://select-statistics.co.uk/blog/market-basket-analysis-understanding-customer-behaviour/