Association analysis is the task of finding interesting relationships in large datasets. These interesting relationships can take two forms: frequent item sets or association rules. Frequent item sets are a collection of items that frequently occur together.
This section will require creation of association rules that will identify relationships between variables in the dataset. Insights from the analysis will be provided.
Loading the ‘arules’ library that has the infrastructure for representing, manipulating and analyzing transaction data and patterns
# Loading the arules library
#
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
path <- "/home/oppy/Downloads/Supermarket_Sales_Dataset II.csv"
df <- read.transactions(path, sep = ",", rm.duplicates=T)
## distribution of transactions with duplicates:
## 1
## 5
df
## transactions in sparse format with
## 7501 transactions (rows) and
## 119 items (columns)
# Verifying the object's class
# ---
# This should show us transactions as the type of data that we will need
# ---
#
class(df)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Loading the items
inspect(df[1:5])
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
# Generating a summary of the transaction dataset
# ---
# This would give us some information such as the most purchased items,
# distribution of the item sets (no. of items purchased in each transaction), etc.
# ---
#
summary(df)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
# Exploring the frequency of some articles
# i.e. items ranging from 8 to 10 and performing
# some operation in percentage terms of the total transactions
#
itemFrequency(df[, 1:10],type = "absolute")
## almonds antioxydant juice asparagus avocado
## 153 67 36 250
## babies food bacon barbecue sauce black tea
## 34 65 81 107
## blueberries body spray
## 69 86
round(itemFrequency(df[, 5:10],type = "relative")*100,2)
## babies food bacon barbecue sauce black tea blueberries
## 0.45 0.87 1.08 1.43 0.92
## body spray
## 1.15
# Producing a chart of frequencies and fitering
# to consider only items with a minimum percentage
# of support/ considering a top x of items
# ---
# Displaying top 10 most common items in the transactions dataset
# and the items whose relative importance is at least 10%
#
par(mfrow = c(1, 2))
# plot the frequency of items
itemFrequencyPlot(df, topN = 10,col="cornflowerblue")
itemFrequencyPlot(df, support = 0.1,col="pink")
Mineral water was the top most purchased item
# Building a model based on association rules
# using the apriori function
# ---
# We use Min Support as 0.001 and confidence as 0.8
# ---
#
rulez <- apriori (df, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rulez
## set of 74 rules
With 0.001 Min support and confidence as 0.8 we obtained 74 rules.
# We can perform an exploration of our model
summary(rulez)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## df 7501 0.001 0.8
## call
## apriori(data = df, parameter = list(supp = 0.001, conf = 0.8))
# Observing rules built in our model i.e. first 5 model rules
# ---
#
inspect(rulez[10:15])
## lhs rhs support confidence
## [1] {red wine, tomato sauce} => {chocolate} 0.001066524 0.8000000
## [2] {pancakes, tomato sauce} => {mineral water} 0.001066524 0.8000000
## [3] {chicken, protein bar} => {spaghetti} 0.001199840 0.8181818
## [4] {meatballs, whole wheat pasta} => {milk} 0.001333156 0.8333333
## [5] {red wine, soup} => {mineral water} 0.001866418 0.9333333
## [6] {turkey, whole wheat pasta} => {mineral water} 0.001466471 0.8461538
## coverage lift count
## [1] 0.001333156 4.882669 8
## [2] 0.001333156 3.356152 8
## [3] 0.001466471 4.699220 9
## [4] 0.001599787 6.430898 10
## [5] 0.001999733 3.915511 14
## [6] 0.001733102 3.549776 11
Interpretation of the above: If a shopper buys red wine and/or tomato sauce, they are also likely to buy chocolate
Since Mineral Water, eggs, spaghetti were the top 3 most purchased items, we will create a promotion relating to the sale of these items by creating a subset of rules concerning them
This would tell us the items that the customers bought before purchasing each of the item
# ---
#
mineral <- subset(rulez, subset = rhs %pin% "mineral water")
# Then order by confidence
mineral <-sort(mineral, by="confidence", decreasing=TRUE)
inspect(mineral[1:5])
## lhs rhs support confidence coverage lift count
## [1] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [2] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [4] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 0.001466471 3.813809 10
## [5] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
Customers who bought the items in the first column were more likely to buy mineral water
# ---
#
eggs <- subset(rulez, subset = rhs %pin% "eggs" )
# Then order by confidence
eggs <-sort(eggs, by="confidence", decreasing=TRUE)
inspect(eggs[])
## lhs rhs support confidence coverage
## [1] {black tea, spaghetti, turkey} => {eggs} 0.001066524 0.8888889 0.001199840
## [2] {mineral water, pasta, shrimp} => {eggs} 0.001333156 0.8333333 0.001599787
## lift count
## [1] 4.946258 8
## [2] 4.637117 10
Customers who bought the items in the first column were more likely to buy eggs
# ---
#
spaghetti <- subset(rulez, subset = rhs %pin% "spaghetti" )
# Then order by confidence
spaghetti <-sort(spaghetti, by="confidence", decreasing=TRUE)
inspect(spaghetti[])
## lhs rhs support confidence coverage lift count
## [1] {light cream,
## mineral water,
## shrimp} => {spaghetti} 0.001066524 0.8888889 0.001199840 5.105326 8
## [2] {ground beef,
## salmon,
## shrimp} => {spaghetti} 0.001066524 0.8888889 0.001199840 5.105326 8
## [3] {burgers,
## milk,
## salmon} => {spaghetti} 0.001066524 0.8888889 0.001199840 5.105326 8
## [4] {frozen vegetables,
## ground beef,
## mineral water,
## shrimp} => {spaghetti} 0.001733102 0.8666667 0.001999733 4.977693 13
## [5] {burgers,
## frozen vegetables,
## pancakes} => {spaghetti} 0.001466471 0.8461538 0.001733102 4.859877 11
## [6] {frozen vegetables,
## olive oil,
## tomatoes} => {spaghetti} 0.002133049 0.8421053 0.002532996 4.836624 16
## [7] {green tea,
## ground beef,
## tomato sauce} => {spaghetti} 0.001333156 0.8333333 0.001599787 4.786243 10
## [8] {frozen vegetables,
## tomatoes,
## whole wheat rice} => {spaghetti} 0.001333156 0.8333333 0.001599787 4.786243 10
## [9] {chicken,
## protein bar} => {spaghetti} 0.001199840 0.8181818 0.001466471 4.699220 9
## [10] {frozen vegetables,
## ground beef,
## mineral water,
## tomatoes} => {spaghetti} 0.001199840 0.8181818 0.001466471 4.699220 9
## [11] {bacon,
## pancakes} => {spaghetti} 0.001733102 0.8125000 0.002133049 4.666587 13
## [12] {milk,
## mineral water,
## parmesan cheese} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [13] {cooking oil,
## mineral water,
## red wine} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [14] {avocado,
## burgers,
## milk} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [15] {frozen vegetables,
## mineral water,
## olive oil,
## tomatoes} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [16] {chocolate,
## french fries,
## mineral water,
## olive oil} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
Customers who bought the items in the first column were more likely to buy spaghetti
determine items that customers might buy who have previously bought the top 3 most common items
# Subset the rules
mineral1 <- subset(rulez, subset = lhs %pin% "mineral water")
# Order by confidence
mineral1<-sort(mineral1, by="confidence", decreasing=TRUE)
# inspect top 5
inspect(mineral1[])
## lhs rhs support confidence coverage lift count
## [1] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.0000000 0.001066524 7.717078 8
## [2] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [3] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
## [4] {light cream,
## mineral water,
## shrimp} => {spaghetti} 0.001066524 0.8888889 0.001199840 5.105326 8
## [5] {grated cheese,
## mineral water,
## rice} => {ground beef} 0.001066524 0.8888889 0.001199840 9.046887 8
## [6] {escalope,
## hot dogs,
## mineral water} => {milk} 0.001066524 0.8888889 0.001199840 6.859625 8
## [7] {chocolate,
## ground beef,
## milk,
## mineral water,
## spaghetti} => {frozen vegetables} 0.001066524 0.8888889 0.001199840 9.325253 8
## [8] {frozen vegetables,
## ground beef,
## mineral water,
## shrimp} => {spaghetti} 0.001733102 0.8666667 0.001999733 4.977693 13
## [9] {mineral water,
## pasta,
## shrimp} => {eggs} 0.001333156 0.8333333 0.001599787 4.637117 10
## [10] {frozen vegetables,
## ground beef,
## mineral water,
## tomatoes} => {spaghetti} 0.001199840 0.8181818 0.001466471 4.699220 9
## [11] {milk,
## mineral water,
## parmesan cheese} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [12] {cooking oil,
## mineral water,
## red wine} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [13] {frozen vegetables,
## mineral water,
## olive oil,
## tomatoes} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [14] {chocolate,
## french fries,
## mineral water,
## olive oil} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
Customers who bought mineral water had also also bought milk, shrimp, ground beef etc. before
# Subset the rules
eggs1 <- subset(rulez, subset = lhs %pin% "eggs")
# Order by confidence
eggs1<-sort(eggs1, by="confidence", decreasing=TRUE)
# inspect top 5
inspect(eggs1[])
## lhs rhs support confidence coverage lift count
## [1] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [2] {brownies,
## eggs,
## ground beef} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [3] {chocolate,
## eggs,
## frozen vegetables,
## ground beef} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [4] {chocolate,
## eggs,
## olive oil,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [5] {cooking oil,
## eggs,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [6] {cake,
## eggs,
## milk,
## turkey} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [7] {chocolate,
## eggs,
## milk,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
Customers who bought eggs had also also bought milk before
# Subset the rules
spaghetti1 <- subset(rulez, subset = lhs %pin% "spaghetti")
# Order by confidence
spaghetti1<-sort(spaghetti1, by="confidence", decreasing=TRUE)
# inspect top 5
inspect(spaghetti1[])
## lhs rhs support confidence coverage lift count
## [1] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
## [2] {black tea,
## spaghetti,
## turkey} => {eggs} 0.001066524 0.8888889 0.001199840 4.946258 8
## [3] {chocolate,
## ground beef,
## milk,
## mineral water,
## spaghetti} => {frozen vegetables} 0.001066524 0.8888889 0.001199840 9.325253 8
## [4] {chocolate,
## frozen vegetables,
## shrimp,
## spaghetti} => {mineral water} 0.001733102 0.8666667 0.001999733 3.635831 13
## [5] {frozen vegetables,
## milk,
## shrimp,
## spaghetti} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [6] {chocolate,
## eggs,
## olive oil,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [7] {chocolate,
## milk,
## shrimp,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [8] {milk,
## spaghetti,
## strong cheese} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [9] {oil,
## shrimp,
## spaghetti} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [10] {french fries,
## milk,
## pancakes,
## spaghetti} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
Customers who bought spaghetti had also also bought mineral water, eggs and frozen vegetables before
Conclusions
Mineral Water, eggs, spaghetti were the top 3 most purchased items
Customers who bought ground beef, light cream, olive oil were more likely to buy mineral water
Customers who bought black tea, spaghetti, turkey were more likely to buy eggs
Customers who bought ight cream, mineral water, shrimp were more likely to buy Spaghetti
Customers who bought eggs had also also bought milk before
Customers who bought spaghetti had also also bought mineral water, eggs and frozen vegetables before
Customers who bought mineral water had also also bought milk, shrimp, ground beef etc. before
Recommendations;
Curate marketing strategies with the most commonly purchased items such as:
Package deals for the items bought together
Have the isles for the items most commonly bought together closer to each other
Discount the prices for the most commonly bought items
Advertise the items that are most likely to be bought together