Understanding the behavior of consumers is a key to success for many companies.There are some relations between the purchased good, i.e. the purchase of one good drives people to the purchase of another; knowledge of such patterns may help in creating effective sale strategy. This report aims to reveal purchasing patterns of consumers. For this purpose we will analyze market basket data with association rules. The rules will be analysed using apriori algorithm. The apriori algorithm uses three measures:
- Support - telling how often an item or a rule appears in the data set
- Confidence - share of transactions where presence of one item is followed by the presence of another specific item
- Lift - informs about the association between two items. A value grater than 1 suggests positive association between items, lower than 1 - negative association and close to 1 implies lack of dependency.
Dataset provides infromation about 2000 store transactions. Each row in the data represents one market basket. 42 columns stand for 42 different product. The data comes from Kaggle (https://www.kaggle.com/arronlacey/market-basket-analysis?select=market_basket_analysis.csv). Before applying the Apriori algorithm on the data set, we will transform the data from matix format to basket format and try to learn more about the transactions. Then the itemFrequencyPlot() function to create bar plots will be used to view the distribution of the products.
summary(market)
## transactions as itemMatrix in sparse format with
## 2000 rows (elements/itemsets/transactions) and
## 43 columns (items) and a density of 0.1183023
##
## most frequent items:
## pizza mars coke lasagna twix (Other)
## 491 486 482 433 425 7857
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
## 124 324 275 247 224 219 173 128 107 78 57 31 9 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 5.087 7.000 17.000
##
## includes extended item information - examples:
## labels
## 1 7up
## 2 bbq
## 3 bread
length(market)
## [1] 2000
# limit the plot to 20 items
#absolute
itemFrequencyPlot(market, topN = 20, type="absolute", col ="purple", cex.names=.8, main="Item frequency - absolute")
#relative
itemFrequencyPlot(market, topN = 20, type="relative", col ="purple", cex.names=.8, main="Item frequency - relative")
We see that pizza, mars and coke present the highest frequency item, people most often buy them.
Now we will try to find association rules using apriori algorithm. We will set min support to 0.015, so that the pair of products is bought by at least 30 (0.01*2000) people; and the confidence, that when the person bought product X, product Y will also be bought to 65%. Th minimum length of a rule is 2 elements.
marketrules <- apriori(market, parameter = list(support = 0.015, confidence = 0.65, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.65 0.1 1 none FALSE TRUE 5 0.015 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 30
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 2000 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.06s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
marketrules
## set of 35 rules
summary(marketrules)
## set of 35 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 1 34
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.971 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.0150 Min. :0.6522 Min. :0.01900 Min. :2.660
## 1st Qu.:0.0160 1st Qu.:0.6638 1st Qu.:0.02350 1st Qu.:2.755
## Median :0.0175 Median :0.6852 Median :0.02450 Median :2.910
## Mean :0.0196 Mean :0.6989 Mean :0.02829 Mean :3.040
## 3rd Qu.:0.0195 3rd Qu.:0.7205 3rd Qu.:0.02850 3rd Qu.:3.086
## Max. :0.0805 Max. :0.8421 Max. :0.12200 Max. :4.459
## count
## Min. : 30.0
## 1st Qu.: 32.0
## Median : 35.0
## Mean : 39.2
## 3rd Qu.: 39.0
## Max. :161.0
##
## mining info:
## data ntransactions support confidence
## market 2000 0.015 0.65
## call
## apriori(data = market, parameter = list(support = 0.015, confidence = 0.65, minlen = 2))
The total number of rules is 35.
# reorder the rules so that we are able to inspect the most meaningful ones
inspect(sort(marketrules, by = "confidence")[1:10])
## lhs rhs support confidence coverage lift
## [1] {peas, pepsi} => {coke} 0.0160 0.8421053 0.0190 3.494213
## [2] {chicken.tikka, potatoes} => {pizza} 0.0190 0.7916667 0.0240 3.224711
## [3] {7up, milk} => {coke} 0.0170 0.7906977 0.0215 3.280903
## [4] {7up, potatoes} => {coke} 0.0175 0.7777778 0.0225 3.227294
## [5] {bulmers, lasagna} => {pizza} 0.0230 0.7419355 0.0310 3.022140
## [6] {newspaper, pepsi} => {coke} 0.0155 0.7380952 0.0210 3.062636
## [7] {bread, chicken.tikka} => {pizza} 0.0180 0.7346939 0.0245 2.992643
## [8] {pepsi, potatoes} => {coke} 0.0205 0.7321429 0.0280 3.037937
## [9] {fosters, twix} => {mars} 0.0155 0.7209302 0.0215 2.966791
## [10] {pepsi, twix} => {coke} 0.0180 0.7200000 0.0250 2.987552
## count
## [1] 32
## [2] 38
## [3] 34
## [4] 35
## [5] 46
## [6] 31
## [7] 36
## [8] 41
## [9] 31
## [10] 36
inspect(sort(marketrules, by = "lift")[1:10])
## lhs rhs support confidence coverage
## [1] {lasagna, red.wine} => {bulmers} 0.0160 0.6666667 0.0240
## [2] {bread, tea} => {cheese} 0.0155 0.6595745 0.0235
## [3] {ham, mayonnaise} => {cheese} 0.0160 0.6530612 0.0245
## [4] {instant.coffee, mars} => {milk} 0.0195 0.6724138 0.0290
## [5] {peas, pepsi} => {coke} 0.0160 0.8421053 0.0190
## [6] {7up, milk} => {coke} 0.0170 0.7906977 0.0215
## [7] {7up, potatoes} => {coke} 0.0175 0.7777778 0.0225
## [8] {chicken.tikka, potatoes} => {pizza} 0.0190 0.7916667 0.0240
## [9] {instant.coffee, pizza} => {lasagna} 0.0175 0.6730769 0.0260
## [10] {newspaper, pepsi} => {coke} 0.0155 0.7380952 0.0210
## lift count
## [1] 4.459309 32
## [2] 4.071447 31
## [3] 4.031242 32
## [4] 3.664380 39
## [5] 3.494213 32
## [6] 3.280903 34
## [7] 3.227294 35
## [8] 3.224711 38
## [9] 3.108900 35
## [10] 3.062636 31
We see that 84.21% of people who bought peas and pepsi also bought coke, and over 79% people who bought chicken.tikka and potatoes also bought pizza. The lift measure is the highest for ham, pizza and cheese which implies high association between them.
Scatter-Plot
library(arulesViz)
library(plotly)
plot(marketrules, engine="plotly")
The above plot shows that rules with high lift have low confidence.
Two-key plot
plot(marketrules, method = "two-key plot", engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above two-key plot shows also the number of items in the rule. For most rules there are 3 items.
Grouped matrix-based visualization
plot(marketrules, method="grouped", control=list(reorder=TRUE))
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
The above balloon plot shows the antecedent groups (LHS) as columns and consequents (RHS) as rows. The group which contains the most important rules according to lift are shown in the leftmost column. The group contains 1 rule with one positive consequent - bulmers.
library(arulesViz)
plot(marketrules, method="graph", measure="support", shading="lift", main = "Association Rules Graph")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The above graph shows revealed rules between products. The arrows show the relation between categories. The size and color of vertices often represent lift and support, respectively.
Parallel Coordinates Plot
marketrules2<-head(marketrules, n=10, by="lift")
plot(marketrules2, method="paracoord", control=list(reorder=TRUE))
The positions are in the LHS where 2 is the most recent addition to basket and 1 is the item people previously had. The Parallel Coordinates Plot indicates that when people buy mayonnaise and ham, they are also likely to buy cheese.
The association rules is an extremely useful tool in studying patterns of behavior and can be applied not only in market basket analysis. The results of the apriori algorithm used in this report is easy to understand and interpret. Another advantage is a good operation of the algorithm with large data sets enabling to extract useful information that is usually difficult when we have many dimensions. For the analysed dataset the behavior of customers somehow reflects their eating habits, sometimes they also suggest an intention to prepare a specific meal.