Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.(wiki)
This dataset is about the Groceries Market Basket.The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items.
This paper focuses on analyzing this dataset by using Association rule to find the most preferred shopping combinations of the customers and help the store to make the most suitable decision for the combination of products.
As can be seen from the chart below, whole milk,soda,rolls,yogurt,bottled water,shopping bags,tropical fruit,canned beer,sausage and pastry are the most purchased foods.
library(arules)
library(arulesViz)
setwd("E:/Master of Data science/6-unsupervised learing/7-ULproject/project3-Association rules")
trans1<-read.transactions("groceries.csv", format="basket", sep=",", skip=0)
itemFrequency(trans1, type="relative")
## abrasive cleaner artif. sweetener baby cosmetics
## 0.0023953606 0.0027735754 0.0006303580
## bags baking powder bathroom cleaner
## 0.0005042864 0.0128593041 0.0021432173
## beef berries beverages
## 0.0405950580 0.0284921836 0.0258446798
## bottled beer bottled water brandy
## 0.0798033283 0.1062783661 0.0044125063
## brown bread butter butter milk
## 0.0572365103 0.0438729198 0.0218103883
## cake bar candles candy
## 0.0117246596 0.0081946546 0.0284921836
## canned beer canned fish canned fruit
## 0.0850983359 0.0123550177 0.0026475038
## canned vegetables cat food cereals
## 0.0075642965 0.0208018154 0.0045385779
## chewing gum chicken chocolate
## 0.0204236006 0.0310136157 0.0457639939
## chocolate marshmallow citrus fruit cleaner
## 0.0086989410 0.0668179526 0.0042864347
## cling film/bags cocoa drinks coffee
## 0.0100857287 0.0021432173 0.0553454362
## condensed milk cooking chocolate cookware
## 0.0095814423 0.0022692890 0.0027735754
## cream cream cheese curd
## 0.0007564297 0.0321482602 0.0447554211
## curd cheese decalcifier dental care
## 0.0036560767 0.0012607161 0.0046646495
## dessert detergent dish cleaner
## 0.0316439738 0.0158850227 0.0100857287
## dishes dog food domestic eggs
## 0.0143721634 0.0078164397 0.0510590015
## female sanitary products finished products fish
## 0.0057992940 0.0055471508 0.0027735754
## flour flower (seeds) flower soil/fertilizer
## 0.0137418053 0.0081946546 0.0021432173
## frankfurter frozen chicken frozen dessert
## 0.0526979324 0.0007564297 0.0088250126
## frozen fish frozen fruits frozen meals
## 0.0086989410 0.0005042864 0.0258446798
## frozen potato products frozen vegetables fruit/vegetable juice
## 0.0071860817 0.0375693394 0.0635400908
## grapes hair spray ham
## 0.0165153807 0.0011346445 0.0209278870
## hamburger meat hard cheese herbs
## 0.0240796773 0.0186585981 0.0105900151
## honey house keeping products hygiene articles
## 0.0015128593 0.0069339385 0.0289964700
## ice cream instant coffee Instant food products
## 0.0247100353 0.0068078669 0.0065557237
## jam ketchup kitchen towels
## 0.0044125063 0.0034039334 0.0046646495
## kitchen utensil light bulbs liqueur
## 0.0003782148 0.0035300050 0.0010085729
## liquor liquor (appetizer) liver loaf
## 0.0121028744 0.0080685830 0.0044125063
## long life bakery product make up remover male cosmetics
## 0.0331568331 0.0008825013 0.0046646495
## margarine mayonnaise meat
## 0.0481593545 0.0069339385 0.0196671710
## meat spreads misc. beverages mustard
## 0.0041603631 0.0282400403 0.0108421583
## napkins newspapers nut snack
## 0.0470247100 0.0750126072 0.0028996470
## nuts/prunes oil onions
## 0.0030257186 0.0224407463 0.0208018154
## organic products organic sausage packaged fruit/vegetables
## 0.0012607161 0.0021432173 0.0122289460
## pasta pastry pet care
## 0.0133635905 0.0823247605 0.0093292990
## photo/film pickled vegetables pip fruit
## 0.0100857287 0.0142460918 0.0613968734
## popcorn pork potato products
## 0.0069339385 0.0446293495 0.0025214322
## potted plants preservation products processed cheese
## 0.0160110943 0.0001260716 0.0137418053
## prosecco pudding powder ready soups
## 0.0021432173 0.0018910741 0.0015128593
## red/blush wine rice roll products
## 0.0176500252 0.0045385779 0.0068078669
## rolls/buns root vegetables rubbing alcohol
## 0.1752395361 0.0763993949 0.0007564297
## rum salad dressing salt
## 0.0036560767 0.0003782148 0.0088250126
## salty snack sauces sausage
## 0.0335350479 0.0049167927 0.0830811901
## seasonal products semi-finished bread shopping bags
## 0.0131114473 0.0155068079 0.0934190620
## skin care sliced cheese snack products
## 0.0028996470 0.0191628845 0.0026475038
## soap soda soft cheese
## 0.0026475038 0.1756177509 0.0123550177
## softener sound storage medium soups
## 0.0047907211 0.0001260716 0.0045385779
## sparkling wine specialty bar specialty cheese
## 0.0050428643 0.0269793243 0.0052950076
## specialty chocolate specialty fat specialty vegetables
## 0.0301311145 0.0031517902 0.0012607161
## spices spread cheese sugar
## 0.0040342915 0.0100857287 0.0286182552
## sweet spreads syrup tea
## 0.0081946546 0.0026475038 0.0028996470
## tidbits toilet cleaner tropical fruit
## 0.0023953606 0.0006303580 0.0856026223
## turkey UHT-milk vinegar
## 0.0051689360 0.0313918306 0.0050428643
## waffles whipped/sour cream whisky
## 0.0351739788 0.0530761473 0.0007564297
## white bread white wine whole milk
## 0.0351739788 0.0208018154 0.2240292486
## yogurt zwieback
## 0.1191376702 0.0064296520
itemFrequencyPlot(trans1, topN=30, type="relative", main="Item Frequency")
The support is calculated as the proportion of the number of occurrences of the product combination as a percentage of the total number of transactions,so the value is in the interval [0,1]. The larger the value, the more frequent the occurrence of product combinations and the more relevant they are. \[ Support=\frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions}=P(A\cap B)\]
Confidence is calculated as the proportion of the number of times A and B are traded together as a percentage of the number of times A is traded, and the interval of this value is also [0,1]. The larger the value, the greater the probability that A and B always occur together, and A and B are more likely to be purchased in a bundle.Confidence is often used for cross-selling recommendations. Usually the higher confidence the better the recommendation.
\[ Confidence=\frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions\ with\ A}=\frac{P(A\cap B)}{P(A)}\]
Lift is a calculation of the percentage of the number of times A and B are purchased together as a percentage of the product of the number of times A and B are sold separately; the larger the value, the greater the probability that the two items will be sold in a bundle. \[ Lift=\frac{Confidence}{Expected\ Confidence}=\frac{P(A\cap B)}{P(A).P(B)}\]
After constant checking of the parameter values, it was finally determined that the parameter of support was 0.01 and confidence was 0.3. At this point, the rules do not contain single items, but basically combinations of two items and more, which is more effective for analyzing how to sell product bundles.
rules.basket <- apriori(trans1, parameter = list(supp = 0.01, conf = 0.3, minlen=1, maxlen=15))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 15 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 79
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 7932 transaction(s)] done [0.00s].
## sorting and recoding items ... [81 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [30 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(sort(rules.basket, by ="confidence"),by="support",decreasing = TRUE),15))
## lhs rhs support confidence coverage
## [1] {yogurt} => {whole milk} 0.04185577 0.3513228 0.11913767
## [2] {root vegetables} => {whole milk} 0.03189612 0.4174917 0.07639939
## [3] {tropical fruit} => {whole milk} 0.03126576 0.3652430 0.08560262
## [4] {pastry} => {whole milk} 0.02811397 0.3415008 0.08232476
## [5] {sausage} => {rolls/buns} 0.02697932 0.3247344 0.08308119
## [6] {newspapers} => {whole milk} 0.02357539 0.3142857 0.07501261
## [7] {domestic eggs} => {whole milk} 0.02193646 0.4296296 0.05105900
## [8] {whipped/sour cream} => {whole milk} 0.02181039 0.4109264 0.05307615
## [9] {citrus fruit} => {whole milk} 0.02168432 0.3245283 0.06681795
## [10] {pip fruit} => {whole milk} 0.02054967 0.3347023 0.06139687
## [11] {curd} => {whole milk} 0.02017146 0.4507042 0.04475542
## [12] {fruit/vegetable juice} => {whole milk} 0.02004539 0.3154762 0.06354009
## [13] {butter} => {whole milk} 0.01991931 0.4540230 0.04387292
## [14] {brown bread} => {whole milk} 0.01966717 0.3436123 0.05723651
## [15] {margarine} => {whole milk} 0.01853253 0.3848168 0.04815935
## lift count
## [1] 1.568200 332
## [2] 1.863559 253
## [3] 1.630336 248
## [4] 1.524358 223
## [5] 1.853089 214
## [6] 1.402878 187
## [7] 1.917739 174
## [8] 1.834253 173
## [9] 1.448598 172
## [10] 1.494011 163
## [11] 2.011810 160
## [12] 1.408192 159
## [13] 2.026624 158
## [14] 1.533783 156
## [15] 1.717708 147
After sorting the rules in descending order by confidence and support, the 15 rules with the largest confidence and support values are obtained. Most of the rules have whole milk in the right hand side, and on the left hand side,the items are relatively mixed and the most frequently purchased items are yogurt and root vegetables.This means that most of the consumers will buy a bottle of whole milk by the way after purchasing other items.
There are also butter milk, UHT-milk, and condensed milk in the store, but most of the consumers choose whole milk, which proves that consumers prefer whole milk.
plot(rules.basket, method="grouped")
As you can see from the graph above, whole milk is the item that customers often purchase. {curd}~{yogurt} has the largest lift value, which means that curd and yogurt are usually purchased in bundles than purchased seperately, and {sausage}~{rolls/buns} has the largest support value, so they make up a large portion of all the transactions, and customers often purchase the items as bundles. The support values of {yogurt}~{whole milk} are also large, and they are also among the items frequently purchased by customers.
plot(rules.basket, method="graph")
As we can see, in the center of that graphic is whole milk, which means that whole milk is something that customers often buy. Among them root vegetables and yogurt are most often purchased together with whole milk.
Based on the above analysis, I know that whole milk is the most frequently purchased item by the customer, then I will analyze, what is the most frequently purchased item paired with whole milk.
rules_Whole_milk<-apriori(data=trans1, parameter=list(supp=0.02,conf = 0.3),
appearance=list(default="lhs", rhs="whole milk"), control=list(verbose=F))
inspect(sort(rules_Whole_milk, by='lift'))
## lhs rhs support confidence coverage
## [1] {curd} => {whole milk} 0.02017146 0.4507042 0.04475542
## [2] {domestic eggs} => {whole milk} 0.02193646 0.4296296 0.05105900
## [3] {root vegetables} => {whole milk} 0.03189612 0.4174917 0.07639939
## [4] {whipped/sour cream} => {whole milk} 0.02181039 0.4109264 0.05307615
## [5] {tropical fruit} => {whole milk} 0.03126576 0.3652430 0.08560262
## [6] {yogurt} => {whole milk} 0.04185577 0.3513228 0.11913767
## [7] {pastry} => {whole milk} 0.02811397 0.3415008 0.08232476
## [8] {pip fruit} => {whole milk} 0.02054967 0.3347023 0.06139687
## [9] {citrus fruit} => {whole milk} 0.02168432 0.3245283 0.06681795
## [10] {fruit/vegetable juice} => {whole milk} 0.02004539 0.3154762 0.06354009
## [11] {newspapers} => {whole milk} 0.02357539 0.3142857 0.07501261
## lift count
## [1] 2.011810 160
## [2] 1.917739 174
## [3] 1.863559 253
## [4] 1.834253 173
## [5] 1.630336 248
## [6] 1.568200 332
## [7] 1.524358 223
## [8] 1.494011 163
## [9] 1.448598 172
## [10] 1.408192 159
## [11] 1.402878 187
is.significant(rules_Whole_milk, trans1)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.superset(rules_Whole_milk)
## 11 x 11 sparse Matrix of class "ngCMatrix"
##
## {curd,whole milk} | . . . . . . . . . .
## {whipped/sour cream,whole milk} . | . . . . . . . . .
## {domestic eggs,whole milk} . . | . . . . . . . .
## {newspapers,whole milk} . . . | . . . . . . .
## {pip fruit,whole milk} . . . . | . . . . . .
## {fruit/vegetable juice,whole milk} . . . . . | . . . . .
## {citrus fruit,whole milk} . . . . . . | . . . .
## {pastry,whole milk} . . . . . . . | . . .
## {root vegetables,whole milk} . . . . . . . . | . .
## {tropical fruit,whole milk} . . . . . . . . . | .
## {whole milk,yogurt} . . . . . . . . . . |
From the above results it is clear that all the rules are significant and all the rules are not supersets of other rules. We obtained 11 rules, among them the largest lift values are {curd}~{whole milk}, and then {domestic eggs}~{whole milk}.But their count values are not significant. the largest count values are {yogurt}~{whole milk}.
plot(rules_Whole_milk, method="graph",control = list(cex=0.9))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
As you can see from the chart above, most often sold with whole milk are domestic eggs, root vegetables, and yogurt.
The Jaccard Index, also known as the Jaccard similarity coefficient,
is a statistical method for comparing similarities between finite sample
sets. It is defined as the ratio of the size of the intersection of two
sets to the size of the concatenation, and is used to quantify the
degree of similarity between two sets.The Jaccard Index is calculated
using the following formula:
Jaccard coefficient (similarity): \[
J(X,Y)=\frac{\lvert X\cap Y \rvert}{\lvert X\cup Y \rvert}\]
Jaccard distance (dissimilarity) is 1−Jaccardcoefficient:
\[ d_j(X,Y)=1-Jaccard\ coefficient=\frac{\lvert X\cup Y \rvert-\lvert X\cap Y \rvert}{\lvert X\cup Y \rvert}\]
The Jaccard coefficient (similarity) index is generally in [0,1], when Jaccard coefficient is 0, it means that all products are different, when Jaccard coefficient is 1, it means that all are the same. the larger the value of Jaccard coefficient, it means that the more similar between two products.
trans.sel<-trans1[,itemFrequency(trans1)>0.1]
jac<-dissimilarity(trans.sel, which="items")
round(jac,digits=3)
## bottled water rolls/buns soda whole milk
## rolls/buns 0.920
## soda 0.886 0.888
## whole milk 0.903 0.863 0.912
## yogurt 0.911 0.893 0.913 0.861
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")
The results show that ‘bottled water’ and ‘rolls/buns’ do not overlap in 92%, so they are maximally dissimilar.
According to the tree diagram, the two most similar goods are whole milk and yogurt, and bottled water and soda。
Affinity is a measure of similarity of two items and can be representead as:
\[ A(i,j)=\frac{supp(i,j)}{supp(i)+supp(j)-supp(i,j)}\]
The larger the value, the more similar the two products are, and the more customers are inclined to buy them together at the time of purchase.
a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
## bottled water rolls/buns soda whole milk yogurt
## bottled water 0.000 0.080 0.114 0.097 0.089
## rolls/buns 0.080 0.000 0.112 0.137 0.107
## soda 0.114 0.112 0.000 0.088 0.087
## whole milk 0.097 0.137 0.088 0.000 0.139
## yogurt 0.089 0.107 0.087 0.139 0.000
## Slot "method":
## [1] "Affinity"
par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)
As can be seen from the graph above, the darker the color of the intersection, the more likely it is that the two items will be purchased together. From the graph above, it is clear that whole milk and rolls/buns, whole milk and yogurt, soda and bottled water, and soda and rolls/buns consumers often buy them together.
Based on the above analysis, it is clear that the product frequently purchased by the consumers of this store is whole milk. the products frequently purchased along with whole milk are yogurt, rolls/buns. the product combinations that are also frequently purchased include soda and bottled water, soda and rolls/buns.
The owner of the store can adjust the product placement based on the results of the above analysis.