The data considered in following analysis is of 1 week duration of South France from a well known chain of Marts.
Set working directory and read the data.
setwd("C:\\Users\\admin\\Desktop\\shubhangi")
dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
head(dataset)
The rows here represent the transactions and colums are the items purchased in the respective transactions.
for ex.second basket or transaction has burgers,meatball,eggs.
no_basket=dim(dataset)[1]
no_basket
## [1] 7501
most_item=dim(dataset)[2]-1
most_item
## [1] 19
Total number of baskets sold or transactions done in 1 week is 7501.
Highest number of items in basket is 19. \
Data Preprocessing
The “arules” package is used for aprior algorithm.The package “arules” doesn’t take dataframe or the csv file as input.We have to create a sparse (very less non zero enteries in matrix) matrix of the transaction file So run the following command and see for yourself what it gives.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
dataset = read.transactions('C:\\Users\\admin\\Desktop\\shubhangi\\Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE) #read help file for read.transactions.
## distribution of transactions with duplicates:
## 1
## 5
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17
## 16 18 19 20
## 4 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
Let me take you through the summary of dataset:: \ 1. 7501 transactions are the rows for the sparse matrix and there is 119 unique items from all the transaction that makes up the columns.
2. The density is 3.2%.That means only 3.2 %of the entries in the matrix are non zero.
3. Most frequent items:
Mineral water is bought hishest number is times.It is in 1788 baskets followed by eggs in 1348 baskets.
4. Itemset length distributions:
1754 baskets has only 1 item.1358 baskets has 2 items.
5. Min number of items in basket is 1 and max is 20. Also out of 7501 basket only one basket has maximum number of items.i.e 20.
6. On an average people buy 4 items.since,mean is 3.9
Plotting 10 most frequently bought items
itemFrequencyPlot(dataset, topN =
10,col=seq(1:10)) #plotting top 10 items only.
Mineral water is most frequently bought item followed by eggs and spaghetti and french fries.
Y axis of the plot has relative item frequency. In market basket analysis terms, it is called as support. \
It is done in single line code using aprior function as shown below.Please go through the help file of “aprior”.
Note that here we are going to give the minimum value of support.Hence,the items that will be present in the rules will have higher support than the one specified below.
We don’t want to consider products with low support.Because they are the ones that are bought less frequently.And we want to increase the revenue.Hence,we only consider products that are bought often.
Chossing support totally depends on bussiness goals.Suppose we want to consider the products that are bought atleast 3 or 4 times in a day. So,if we are able to find some association rules around these products we can place these product together in mart and this can boost the sales.
Say we want products purchased 3 times a day.that means 7x3=21 times in week.Remember that the dataset is of transcation of mart in a period of one week.
Hence support of product purchased 3 times in a day=7*3/7500=0.0028=0.003. \ Let us start with default value of support and then change it if we don’t get good results. We don’t want very high confidence because then we will get very obvious rules.Rules that dosen’t need ML to understand.
lets start with default value of confidence 0.8.This is very high confidence.So,we may not get very great but will get obvious rules.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.8)) #read helpfile
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
It gives min support 0.003 and confidence 0.8 (as specified by us in code) Min length here is 1.i.e rule will be based on minimum 1 product in basket.We could also can change it to two or more. Number of rules is zero.that is when we trained our aprior model this algorithm found zero rules. This is beacuse of 0.8 confidence.That means that all the rules made by the algorithm has confiedence more than 80%. That means each rule should be correct on atleast 80% of the transactions.In layman’s terms the combination of products that come together by the rule should be true for 80% of the baskets.i,e 80% will buy that combination of products.That is too much to ask for.
Let us change confidence to 40 %
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.02s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Note that 281 rules are created.
Visulaization of the rules
We are only interested in most significant rules.So we arrange the rules in descending order by their respective values of lift.
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence lift count
## [1] {mineral water,
## whole wheat pasta} => {olive oil} 0.003866151 0.4027778 6.115863 29
## [2] {spaghetti,
## tomato sauce} => {ground beef} 0.003066258 0.4893617 4.980600 23
## [3] {french fries,
## herb & pepper} => {ground beef} 0.003199573 0.4615385 4.697422 24
## [4] {cereals,
## spaghetti} => {ground beef} 0.003066258 0.4600000 4.681764 23
## [5] {frozen vegetables,
## mineral water,
## soup} => {milk} 0.003066258 0.6052632 4.670863 23
## [6] {chocolate,
## herb & pepper} => {ground beef} 0.003999467 0.4411765 4.490183 30
## [7] {chocolate,
## mineral water,
## shrimp} => {frozen vegetables} 0.003199573 0.4210526 4.417225 24
## [8] {frozen vegetables,
## mineral water,
## olive oil} => {milk} 0.003332889 0.5102041 3.937285 25
## [9] {cereals,
## ground beef} => {spaghetti} 0.003066258 0.6764706 3.885303 23
## [10] {frozen vegetables,
## soup} => {milk} 0.003999467 0.5000000 3.858539 30
1. Rule 1: If people buy mineral water and whole wheat pasta they will also buy olive oil in 40 % of cases.
2. Rule 2: If people buy spaghetti and tomato sauce they will also buy ground beef in 48% of cases.
3. Rule 6 :People who bought chocolate and herb also buy beef.That dosen’t make much sense.This has higher lift beacuse chocolate has high suppoort.It is fifth most bought item.(see itemfreqplot)To avoid this we can change support or we can change confidence.\ So,let us reduce the confidence.Then we will not get the rules that are associated with the most purchased items.
let us change confidence to 20% ::
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1348 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Note that 1348 rules are created.That is lot of rules.This is because as we reduce confidence,the algorithm found lot many rules.
Let us look at top 10 rules.
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support
## [1] {mineral water,whole wheat pasta} => {olive oil} 0.003866151
## [2] {frozen vegetables,milk,mineral water} => {soup} 0.003066258
## [3] {fromage blanc} => {honey} 0.003332889
## [4] {spaghetti,tomato sauce} => {ground beef} 0.003066258
## [5] {light cream} => {chicken} 0.004532729
## [6] {pasta} => {escalope} 0.005865885
## [7] {french fries,herb & pepper} => {ground beef} 0.003199573
## [8] {cereals,spaghetti} => {ground beef} 0.003066258
## [9] {frozen vegetables,mineral water,soup} => {milk} 0.003066258
## [10] {french fries,ground beef} => {herb & pepper} 0.003199573
## confidence lift count
## [1] 0.4027778 6.115863 29
## [2] 0.2771084 5.484407 23
## [3] 0.2450980 5.164271 25
## [4] 0.4893617 4.980600 23
## [5] 0.2905983 4.843951 34
## [6] 0.3728814 4.700812 44
## [7] 0.4615385 4.697422 24
## [8] 0.4600000 4.681764 23
## [9] 0.6052632 4.670863 23
## [10] 0.2307692 4.665768 24
1.Rule 1: If people buy mineral water and whole wheat pasta they will also buy olive oil in 40 % of cases.
This rule make sense.This related to people who are wants to have healthy diet.So mineral water, pasta and of couse olive oil and not ordinary oil. Olive Oil should be placed not too far from whole wheat pasta.
2.Rule 2 :If people buy frozen veg ,milk and water they will buy soup.Again represent buying habits of people who wants to have healthy meals.Frence people add milk to their soup.
3.Rule 9:If people buy frozen veg ,soup and water they will buy milk.Rule 2 and rule 9 represents a traingle like situation.Same can be seen in rule 7 and rule 10.This can be clearly seen in below diagram.
Digramatic representation of top 10 rules ::
For this we are going to need the package called as “arulesViz”.
subrules=head(sort(rules, by = 'lift'),10) #top 10 rules only
# install.packages("arulesViz")
library(arulesViz)
## Loading required package: grid
plot(subrules,method="graph", control=list("items", main=" "))
## Warning: Unknown control parameters:
## Available control parameters (with default values):
## main = Graph for 10 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
1.The diameter of the circle is directly proportional to the number of baskets in the dataset for which the rule hold true.It can be seen that that rule number 6 is the rule with highest number of count (44) and has largest circle in the above diagram.
Conclusion
1.Frozen veg,mineral water,soup and milk should be kept close to one another in the store and then see if there is any increase in the joint sales of those four.
2.Light cream and chicken should be placed together.French people have a habit of putting light cream on chicken.
3.Fromage blanc and honey must not be placed far away
If above arrangement is implemented and there is not much increase in the revenue of the mart then parameters can be changed and new rules can be identified. Then based on these newly found association rules arrangement can be redone. This is how actually data analyst and the mart supervisors work together to increase the sales.This is more of trail and then implementation kind of method.
Note 1. This algorithm can be further improved by taking into consideration other demographic parameters (Gender,Age,etc.) of the people buying the baskets.
2. Persona of people around their habit of buying products can be identified. For,ex:health concious,fast food liker,health freak,sweet lover etc. Based on these personas offer based campaigns can be launched by the mart management.