Performing Market Basket Analysis using Association Rules
Performing Market Basket Analysis using Association Rules
- Loading the arules library and the data set into sparse matrix format
- Exploring groceries dataset
- Looking at the first 5 transactions and the proportion of transaction that contains items
- Visualizating Item Frequency Plot with top 10 most frequent item
- Visualizing the sparse matrix
- Training a model with Apriori Algorithm
- Evaluating the model
- Sorting the association rules according to lift
- Taking subsets of association rules
Loading the arules library and the data set into sparse matrix format
suppressPackageStartupMessages(library(arules))
## Warning: package 'arules' was built under R version 3.4.4
groceries <- read.transactions("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv",
sep = ",")
Exploring groceries dataset
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
There are 9835 transactions and 169 items. Density is 2.6% which means there are 2.6% nonzero matrix cells which are 9835x169 = 1662115, 1662115x0.02609146 = 43,367. Whole milk is the most frequent item. Further we get statistics about the size of the transactions. We can see 2159 transaction has 1 item and so on. The average items/transaction are 4.409.
Looking at the first 5 transactions and the proportion of transaction that contains items
inspect(groceries[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
itemFrequency(groceries[,1:5])
## abrasive cleaner artif. sweetener baby cosmetics baby food
## 0.0035587189 0.0032536858 0.0006100661 0.0001016777
## bags
## 0.0004067107
We can see abrasive cleaner is present in 3.55% of transactions while artif. sweetener in 3.25% and so on. The items are displayed according to alphabetical order.
Visualizating Item Frequency Plot with top 10 most frequent item
itemFrequencyPlot(groceries,support=0.1, topN=10)
The histogram shows the eight items in the groceries data with at least 10 percent support.
Visualizing the sparse matrix
image(groceries[1:10])
The image function displays first 10 rows and 169 columns. Cells in the matrix are filled with black for transactions (rows) where the item (column) was purchased.
image(sample(groceries, 100))
Using sample function with image function creates a big visualization of the sparse matrix. A random 100 sample is plotted and thus we can get insights about the items in a transaction. We can see that some columns are heavily populated with the black dots indicating those items are more popular and are present in many transactions. Lets continue with our analysis further.
Training a model with Apriori Algorithm
With the transaction data we have, we can find the association between the items in the dataset using Apriori algorithm from the arules package. For finding the association rules, support and confidence parameter plays a vital role. If these are set high then there will be few or no rules and setting it low can give very high number of unreliable rules and the operation will take a lot of time and may run out of memory. Lets say we are interested in finding out items which are sold twice a day, 60 times a month which equals 60/ 9835 = 0.006. This is our setting for support parameter. We can keep confidence level to 0.25. minlen is set to 2 to remove rules that contain less than two items.
asso_rules <- apriori(groceries, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Evaluating the model
summary(asso_rules)
## set of 463 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 150 297 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.711 3.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.006101 Min. :0.2500 Min. :0.9932 Min. : 60.0
## 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 1st Qu.: 70.0
## Median :0.008744 Median :0.3554 Median :1.9332 Median : 86.0
## Mean :0.011539 Mean :0.3786 Mean :2.0351 Mean :113.5
## 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 3rd Qu.:121.0
## Max. :0.074835 Max. :0.6600 Max. :3.9565 Max. :736.0
##
## mining info:
## data ntransactions support confidence
## groceries 9835 0.006 0.25
By using summary function we can get a detailed overview of the association rules. As we can see there are 463 association rules. Rule length distribution tells us how many items are present in how many rules. 2 items are present in 150 rules, 3 in 297 rules and 4 in 16 rules.
To further our analysis we can inspect individual rules by using inspect function.
inspect(asso_rules[1:5])
## lhs rhs support confidence lift
## [1] {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460
## [2] {pasta} => {whole milk} 0.006100661 0.4054054 1.586614
## [3] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
## [4] {herbs} => {other vegetables} 0.007727504 0.4750000 2.454874
## [5] {herbs} => {whole milk} 0.007727504 0.4750000 1.858983
## count
## [1] 68
## [2] 60
## [3] 69
## [4] 76
## [5] 76
itemsets<-eclat(groceries)[1:5]
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.1 1 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 983
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating bit matrix ... [8 row(s), 9835 column(s)] done [0.00s].
## writing ... [8 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
support(items(itemsets), groceries)
## [1] 0.2555160 0.1934926 0.1839349 0.1395018 0.1743772
The first five rules are seen here. Also, we can see support for the top 5 most frequent items. We can see the lift column along with support and confidence. The lift of a rule measures how much likely an item or itemset is purchased relative to its typical rate of purchase, given that you know another item or itemsethas been purchased.
Sorting the association rules according to lift
inspect(sort(asso_rules, by = "lift")[1:5])
## lhs rhs support confidence lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69
## [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89
## [3] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 69
## [4] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692 78
## [5] {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649 93
We can sort the rules according to support, confidence or lift. Here the first rule is with the highest lift which indicates that customers who buy herbs are almost 4 times more likely to buy root vegetables verses other customers.
Taking subsets of association rules
Sometimes the marketing team requires to promote a specific product, say they want to promote berries, and want to find out how often and with which items the berries are purchased. The subset function enables one to find subsets of transactions, items or rules. The %in% operator is used for exact matching
berryrules <- subset(asso_rules, items %in% "berries")
inspect(berryrules)
## lhs rhs support confidence lift
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886
## [2] {berries} => {yogurt} 0.010574479 0.3180428 2.279848
## [3] {berries} => {other vegetables} 0.010269446 0.3088685 1.596280
## [4] {berries} => {whole milk} 0.011794611 0.3547401 1.388328
## count
## [1] 89
## [2] 104
## [3] 101
## [4] 116
We find 4 rules related to berries 2 of which seems interesting and gives useful insight. In addition to whipped cream, berries are also purchased with yogurt frequently that could serve well for breakfast, lunch and dessert.
vegrules <- subset(asso_rules, items %pin% "vegetable")
inspect(sort(vegrules, by="lift")[1:10])
## lhs rhs support confidence lift count
## [1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69
## [2] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 69
## [3] {beef,
## other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692 78
## [4] {other vegetables,
## tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649 93
## [5] {beef,
## whole milk} => {root vegetables} 0.008032537 0.3779904 3.467851 79
## [6] {other vegetables,
## pip fruit} => {tropical fruit} 0.009456024 0.3618677 3.448613 93
## [7] {citrus fruit,
## other vegetables} => {root vegetables} 0.010371124 0.3591549 3.295045 102
## [8] {other vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.007625826 0.3424658 3.263712 75
## [9] {other vegetables,
## whole milk,
## yogurt} => {root vegetables} 0.007829181 0.3515982 3.225716 77
## [10] {other vegetables,
## tropical fruit,
## whole milk} => {yogurt} 0.007625826 0.4464286 3.200164 75
Now we can see top 10 rules with highest lift which has some kind of vegetable in the transaction. The %pin% operator is used to partial matching.We can see herbs and root vegetables are purchased four times more often together as seen before. Also it is very useful to know that customers buying other vegetables,tropical fruit and whole milk are 3.77 times more likely to buy root vegetables. Thus we can find several other useful rules and our marketing team can plan their promotional offers likewise.
**THANK YOU SO MUCH FOR READING**