Association rules is one of the unsupervised learning techniques. It is used to check if an item in the dataset depends on another item within the same dataset. Association rules can be referred to as a descriptive method used to discover existing relationships in a large dataset. These relationships are shown in the form of rules.
Association rules is a major aspect from machine learning that is used for market basket analysis. The analysis that will be carried out in this project will be on a shopping mall dataset. The aim is to analyse the shopping behaviour of different customers and understand the relationship that exists between the various products bought.
library(data.table)
library(arules)
## Warning: package 'arules' was built under R version 4.1.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(readr)
library(arulesViz)
The data used in this analysis was gotten from kaggle website (https://www.kaggle.com/fanatiks/shopping-cart). It contains a list of items bought buy different customers and each observation shows the content of the shopping cart of each of the customer.
cart <- read_csv2("~/Downloads/dataset (1).csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 1499 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (34): Item1, Item2, Item3, Item4, Item5, Item6, Item7, Item8, Item9, Ite...
## lgl (1): Item35
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cart)
## # A tibble: 6 × 35
## Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11 Item12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1/1/2… pork sandw… lunch… all-… flour soda butt… vege… beef alumi… all- …
## 2 1/1/2… shamp… hand … waffl… vege… chee… mixes milk sand… laund… dishw… waffl…
## 3 2/1/2… pork soap ice c… toil… dinn… hand… spag… milk ketch… sandw… poult…
## 4 2/1/2… juice lunch… soda toil… all-… <NA> <NA> <NA> <NA> <NA> <NA>
## 5 2/1/2… pasta torti… mixes hand… toil… vege… vege… pape… veget… flour veget…
## 6 2/1/2… toile… eggs toile… vege… bage… dish… cere… pape… laund… butter cerea…
## # … with 23 more variables: Item13 <chr>, Item14 <chr>, Item15 <chr>,
## # Item16 <chr>, Item17 <chr>, Item18 <chr>, Item19 <chr>, Item20 <chr>,
## # Item21 <chr>, Item22 <chr>, Item23 <chr>, Item24 <chr>, Item25 <chr>,
## # Item26 <chr>, Item27 <chr>, Item28 <chr>, Item29 <chr>, Item39 <chr>,
## # Item31 <chr>, Item32 <chr>, Item33 <chr>, Item34 <chr>, Item35 <lgl>
From the dataset, we can see that the first column has the date of each transaction at the beginning of each observation. We will tidy up our data to remove the dates so that we have just items bought per customer. We will also remove the last column since it is empty.
cart$Item1<- substr(cart$Item1, 9, nchar(cart$Item1))
cart$Item1<- gsub("2", "", as.character(cart$Item1))
cart$Item1<- gsub("0", "", as.character(cart$Item1))
cart$Item1<- gsub("00", "", as.character(cart$Item1))
cart$Item1<- gsub("1", "", as.character(cart$Item1))
cart$Item35=NULL
head(cart)
## # A tibble: 6 × 34
## Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11 Item12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 yogurt pork sandw… lunch… all-… flour soda butt… vege… beef alumi… all- …
## 2 toile… shamp… hand … waffl… vege… chee… mixes milk sand… laund… dishw… waffl…
## 3 soda pork soap ice c… toil… dinn… hand… spag… milk ketch… sandw… poult…
## 4 cerea… juice lunch… soda toil… all-… <NA> <NA> <NA> <NA> <NA> <NA>
## 5 sandw… pasta torti… mixes hand… toil… vege… vege… pape… veget… flour veget…
## 6 laund… toile… eggs toile… vege… bage… dish… cere… pape… laund… butter cerea…
## # … with 22 more variables: Item13 <chr>, Item14 <chr>, Item15 <chr>,
## # Item16 <chr>, Item17 <chr>, Item18 <chr>, Item19 <chr>, Item20 <chr>,
## # Item21 <chr>, Item22 <chr>, Item23 <chr>, Item24 <chr>, Item25 <chr>,
## # Item26 <chr>, Item27 <chr>, Item28 <chr>, Item29 <chr>, Item39 <chr>,
## # Item31 <chr>, Item32 <chr>, Item33 <chr>, Item34 <chr>
tail(cart)
## # A tibble: 6 × 34
## Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11 Item12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ice c… pasta toile… pork chee… coff… <NA> <NA> <NA> <NA> <NA> <NA>
## 2 sugar beef sandw… hand … pape… pape… all-… beef frui… coffe… beef shamp…
## 3 coffe… dinne… lunch… spagh… pasta vege… cere… dinn… soap milk eggs poult…
## 4 beef lunch… eggs poult… vege… tort… beef beef indi… dishw… shamp… dishw…
## 5 sandw… ketch… milk poult… chee… soap toil… yogu… beef waffl… sugar spagh…
## 6 soda laund… veget… shamp… vege… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # … with 22 more variables: Item13 <chr>, Item14 <chr>, Item15 <chr>,
## # Item16 <chr>, Item17 <chr>, Item18 <chr>, Item19 <chr>, Item20 <chr>,
## # Item21 <chr>, Item22 <chr>, Item23 <chr>, Item24 <chr>, Item25 <chr>,
## # Item26 <chr>, Item27 <chr>, Item28 <chr>, Item29 <chr>, Item39 <chr>,
## # Item31 <chr>, Item32 <chr>, Item33 <chr>, Item34 <chr>
We will view the basic statistics of the data to have a better understanding/insight of the content.
write.table(cart, file = "cart.csv", sep = "\t", row.names = F)
shopcart<-read.transactions("~/Downloads/cart.csv", format="basket", sep=";", skip=0)
## Warning in readLines(file, encoding = encoding): incomplete final line found on
## '~/Downloads/cart.csv'
## Warning in asMethod(object): removing duplicated items in transactions
nrow(shopcart)
## [1] 1499
ncol(shopcart)
## [1] 38
length(shopcart)
## [1] 1499
LIST(head(shopcart))
## [[1]]
## [1] "all- purpose" "aluminum foil" "beef"
## [4] "butter" "dinner rolls" "flour"
## [7] "ice cream" "laundry detergent" "lunch meat"
## [10] "mixes" "pork" "sandwich bags"
## [13] "shampoo" "soap" "soda"
## [16] "vegetables" "yogurt"
##
## [[2]]
## [1] "aluminum foil" "cereals"
## [3] "cheeses" "dishwashing liquid/detergent"
## [5] "hand soap" "individual meals"
## [7] "laundry detergent" "milk"
## [9] "mixes" "sandwich bags"
## [11] "shampoo" "toilet paper"
## [13] "tortillas" "vegetables"
## [15] "waffles" "yogurt"
##
## [[3]]
## [1] "bagels" "cereals" "cheeses"
## [4] "dinner rolls" "eggs" "hand soap"
## [7] "ice cream" "ketchup" "laundry detergent"
## [10] "lunch meat" "milk" "pork"
## [13] "poultry" "sandwich loaves" "shampoo"
## [16] "soap" "soda" "spaghetti sauce"
## [19] "toilet paper" "vegetables"
##
## [[4]]
## [1] "all- purpose" "cereals" "juice" "lunch meat" "soda"
## [6] "toilet paper"
##
## [[5]]
## [1] "all- purpose" "dinner rolls" "eggs" "flour"
## [5] "hand soap" "individual meals" "milk" "mixes"
## [9] "paper towels" "pasta" "pork" "poultry"
## [13] "sandwich loaves" "soda" "spaghetti sauce" "toilet paper"
## [17] "tortillas" "vegetables" "waffles" "yogurt"
##
## [[6]]
## [1] "all- purpose" "aluminum foil"
## [3] "bagels" "butter"
## [5] "cereals" "coffee/tea"
## [7] "dishwashing liquid/detergent" "eggs"
## [9] "ketchup" "laundry detergent"
## [11] "milk" "paper towels"
## [13] "pasta" "poultry"
## [15] "shampoo" "soap"
## [17] "spaghetti sauce" "toilet paper"
## [19] "vegetables"
itemFrequency(shopcart, type="relative")
## all- purpose aluminum foil
## 0.3702468 0.3862575
## bagels beef
## 0.3822548 0.3689126
## butter cereals
## 0.3729153 0.3829219
## cheeses coffee/tea
## 0.3869246 0.3795864
## dinner rolls dishwashing liquid/detergent
## 0.3782522 0.3902602
## eggs flour
## 0.3815877 0.3535690
## fruits hand soap
## 0.3682455 0.3488993
## ice cream individual meals
## 0.3895931 0.3742495
## juice ketchup
## 0.3702468 0.3555704
## laundry detergent lunch meat
## 0.3755837 0.3882588
## milk mixes
## 0.3782522 0.3749166
## paper towels pasta
## 0.3655771 0.3769179
## pork poultry
## 0.3595730 0.4089393
## sandwich bags sandwich loaves
## 0.3602402 0.3569046
## shampoo soap
## 0.3582388 0.3775851
## soda spaghetti sauce
## 0.3862575 0.3675784
## sugar toilet paper
## 0.3669113 0.3755837
## tortillas vegetables
## 0.3635757 0.7264843
## waffles yogurt
## 0.3915944 0.3782522
itemFrequency(shopcart, type="absolute")
## all- purpose aluminum foil
## 555 579
## bagels beef
## 573 553
## butter cereals
## 559 574
## cheeses coffee/tea
## 580 569
## dinner rolls dishwashing liquid/detergent
## 567 585
## eggs flour
## 572 530
## fruits hand soap
## 552 523
## ice cream individual meals
## 584 561
## juice ketchup
## 555 533
## laundry detergent lunch meat
## 563 582
## milk mixes
## 567 562
## paper towels pasta
## 548 565
## pork poultry
## 539 613
## sandwich bags sandwich loaves
## 540 535
## shampoo soap
## 537 566
## soda spaghetti sauce
## 579 551
## sugar toilet paper
## 550 563
## tortillas vegetables
## 545 1089
## waffles yogurt
## 587 567
itemFrequencyPlot(shopcart, topN=15, type="relative", main="Item frequency-Relative", col="darkmagenta")
itemFrequencyPlot(shopcart, topN=15, type="absolute", main="Item frequency-Absolute", col="darkturquoise")
image(shopcart[1:5])
image(sample(shopcart, 100))
summary(shopcart)
## transactions as itemMatrix in sparse format with
## 1499 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.3836242
##
## most frequent items:
## vegetables poultry
## 1089 613
## waffles dishwashing liquid/detergent
## 587 585
## ice cream (Other)
## 584 18394
##
## element (itemset/transaction) length distribution:
## sizes
## 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 15 57 56 53 71 74 72 79 67 72 89 86 84 105 95 94 114 78 67 36
## 24 25 26 27
## 24 7 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 10.00 15.00 14.58 19.00 27.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
From the tables, we can see that dataset consists of 1499 rows and 38 columns, which is 1499 transactions and 38 items that might appear in a customer’s shopping cart. Based on the item frequency plots for both the relative and absolute type, we can also see that vegetables, poultry and waffles are the top 3 most frequently bought items.
There are 3 major metrics that helps us to gain better insight into the strength of the association between antecedent and a consequent, which are both list of items.These 3 main measures used in association rules mining will be briefly discussed and analysed below.
This measure gives us an idea on the rate of frequency in which an item exists in each transaction. Mathematically, support is calculated as the fraction of the total number of transactions in which the the item occurs.
Support(X)= Transaction containing X/Total number of transactions
The result gotten from this measure helps us to focus on rules that deserve further analysis. An item with low support means we do not have an adequate amount of data on the relationship that exists between its items, so drawing a conclusion is impossible.
This indicates how often an item exists in a transaction with another item. For example, we have a basket that has bread, eggs, butter and milk. From the support measure, bread has the highest support so in confidence, we aim to find the percentage of transactions of each customer in which the presence of bread results to the presence of other listed items. In summary, Confidence indicates how often the rule has been found to be true. We can also refer to it as the the percentage in which the consequent is also satisfied upon particular antecedent. Technically, confidence is the conditional probability of occurrence of consequent given the antecedent.
confidence(X→Y)=support(X,Y)/support(X)
This shows the strength of any rule. It is the increase in the probability of having an item in the cart.It is the ratio of the observed support measure and expected support if X and Y are independent of each other.
lift= lift(X→Y)=confidence(X→Y)/support(Y)
If lift is equal to 1, the probability of the antecedent and consequent occuring are independent of each other. if lift is less than one, then one item is a substitute to the other item which means that the existence of one item in a shopping basket will mean the other is not present. While if lift is greater than one, we are shown the level of how much two items depend on each other.
We will be applying the Apriori Algorithm for the purpose of this analysis.
This is used to find frequent itemsets in the data set for the association rule. Here, prior knowledge of frequent itemset features is used.
We will apply the apriori algorithm with no assumptions. Support is at 15% and confidence at 65%,
shopcartRules <- apriori(shopcart, parameter = list(support = 0.15, confidence = 0.65, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.65 0.1 1 none FALSE TRUE 5 0.15 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 224
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1499 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [39 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
shopcartRules
## set of 39 rules
summary(shopcartRules)
## set of 39 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 37 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.051 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1514 Min. :0.7434 Min. :0.1841 Min. :1.023
## 1st Qu.:0.2829 1st Qu.:0.7665 1st Qu.:0.3646 1st Qu.:1.055
## Median :0.2875 Median :0.7755 Median :0.3749 Median :1.067
## Mean :0.2836 Mean :0.7791 Mean :0.3646 Mean :1.072
## 3rd Qu.:0.3002 3rd Qu.:0.7838 3rd Qu.:0.3819 3rd Qu.:1.079
## Max. :0.3202 Max. :0.8406 Max. :0.4089 Max. :1.157
## count
## Min. :227.0
## 1st Qu.:424.0
## Median :431.0
## Mean :425.1
## 3rd Qu.:450.0
## Max. :480.0
##
## mining info:
## data ntransactions support confidence
## shopcart 1499 0.15 0.65
## call
## apriori(data = shopcart, parameter = list(support = 0.15, confidence = 0.65, minlen = 2))
inspect(shopcartRules[1:10])
## lhs rhs support confidence coverage lift
## [1] {hand soap} => {vegetables} 0.2668446 0.7648184 0.3488993 1.052766
## [2] {flour} => {vegetables} 0.2708472 0.7660377 0.3535690 1.054445
## [3] {pork} => {vegetables} 0.2675117 0.7439703 0.3595730 1.024069
## [4] {ketchup} => {vegetables} 0.2728486 0.7673546 0.3555704 1.056258
## [5] {tortillas} => {vegetables} 0.2755170 0.7577982 0.3635757 1.043103
## [6] {sandwich loaves} => {vegetables} 0.2835223 0.7943925 0.3569046 1.093475
## [7] {shampoo} => {vegetables} 0.2768512 0.7728119 0.3582388 1.063770
## [8] {sandwich bags} => {vegetables} 0.2848566 0.7907407 0.3602402 1.088448
## [9] {fruits} => {vegetables} 0.2841895 0.7717391 0.3682455 1.062293
## [10] {butter} => {vegetables} 0.2855237 0.7656530 0.3729153 1.053915
## count
## [1] 400
## [2] 406
## [3] 401
## [4] 409
## [5] 413
## [6] 425
## [7] 415
## [8] 427
## [9] 426
## [10] 428
Support at 15% and confidence at 65% resulted to 39 rules.
plot_shopcart_rules <- plot(shopcartRules, measure=c("support","lift"), shading="confidence", main="Shop Cart Rules")
plot_shopcart_rules
plot(shopcartRules, method="graph", max.overlaps = Inf, col="darkorchid")
## Warning: Unknown control parameters: max.overlaps
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Rules with the highest values of support, confidence and lift have been displayed above. Vegetables appears in all of displayed rules as it was present in majority of the transactions.
inspect(sort(shopcartRules, by = "support")[1:10])
## lhs rhs support confidence
## [1] {poultry} => {vegetables} 0.3202135 0.7830343
## [2] {eggs} => {vegetables} 0.3108739 0.8146853
## [3] {aluminum foil} => {vegetables} 0.3095397 0.8013817
## [4] {yogurt} => {vegetables} 0.3082055 0.8148148
## [5] {waffles} => {vegetables} 0.3048699 0.7785349
## [6] {dishwashing liquid/detergent} => {vegetables} 0.3048699 0.7811966
## [7] {laundry detergent} => {vegetables} 0.3042028 0.8099467
## [8] {cheeses} => {vegetables} 0.3035357 0.7844828
## [9] {lunch meat} => {vegetables} 0.3015344 0.7766323
## [10] {ice cream} => {vegetables} 0.3008672 0.7722603
## coverage lift count
## [1] 0.4089393 1.077841 480
## [2] 0.3815877 1.121408 466
## [3] 0.3862575 1.103096 464
## [4] 0.3782522 1.121586 462
## [5] 0.3915944 1.071647 457
## [6] 0.3902602 1.075311 457
## [7] 0.3755837 1.114885 456
## [8] 0.3869246 1.079834 455
## [9] 0.3882588 1.069028 452
## [10] 0.3895931 1.063010 451
Vegetables are the most popular items in the bakery. It mostly appears with poultry (480 transactions i.e.32%) Other rules with the highest support indicate that vegetables are being frequently bought with eggs (466 transactions i.e 31.1%) and aluminium foil which is seen in 464 transactions (31%).
inspect(sort(shopcartRules, by = "confidence")[1:10])
## lhs rhs support
## [1] {dishwashing liquid/detergent, poultry} => {vegetables} 0.1547698
## [2] {dinner rolls, poultry} => {vegetables} 0.1514343
## [3] {yogurt} => {vegetables} 0.3082055
## [4] {eggs} => {vegetables} 0.3108739
## [5] {laundry detergent} => {vegetables} 0.3042028
## [6] {aluminum foil} => {vegetables} 0.3095397
## [7] {sugar} => {vegetables} 0.2935290
## [8] {sandwich loaves} => {vegetables} 0.2835223
## [9] {sandwich bags} => {vegetables} 0.2848566
## [10] {cheeses} => {vegetables} 0.3035357
## confidence coverage lift count
## [1] 0.8405797 0.1841227 1.157051 232
## [2] 0.8224638 0.1841227 1.132115 227
## [3] 0.8148148 0.3782522 1.121586 462
## [4] 0.8146853 0.3815877 1.121408 466
## [5] 0.8099467 0.3755837 1.114885 456
## [6] 0.8013817 0.3862575 1.103096 464
## [7] 0.8000000 0.3669113 1.101194 440
## [8] 0.7943925 0.3569046 1.093475 425
## [9] 0.7907407 0.3602402 1.088448 427
## [10] 0.7844828 0.3869246 1.079834 455
Rules for the transactions with the highest confidence show that if customer buys dishwashing liquid/detergent and poultry, dinner rolls and poultry, or yogurt, he will also buy vegetables with the probability of 84.1%, 82.2% and 81.5% respectively.
inspect(sort(shopcartRules, by = "lift")[1:10])
## lhs rhs support
## [1] {dishwashing liquid/detergent, poultry} => {vegetables} 0.1547698
## [2] {dinner rolls, poultry} => {vegetables} 0.1514343
## [3] {yogurt} => {vegetables} 0.3082055
## [4] {eggs} => {vegetables} 0.3108739
## [5] {laundry detergent} => {vegetables} 0.3042028
## [6] {aluminum foil} => {vegetables} 0.3095397
## [7] {sugar} => {vegetables} 0.2935290
## [8] {sandwich loaves} => {vegetables} 0.2835223
## [9] {sandwich bags} => {vegetables} 0.2848566
## [10] {cheeses} => {vegetables} 0.3035357
## confidence coverage lift count
## [1] 0.8405797 0.1841227 1.157051 232
## [2] 0.8224638 0.1841227 1.132115 227
## [3] 0.8148148 0.3782522 1.121586 462
## [4] 0.8146853 0.3815877 1.121408 466
## [5] 0.8099467 0.3755837 1.114885 456
## [6] 0.8013817 0.3862575 1.103096 464
## [7] 0.8000000 0.3669113 1.101194 440
## [8] 0.7943925 0.3569046 1.093475 425
## [9] 0.7907407 0.3602402 1.088448 427
## [10] 0.7844828 0.3869246 1.079834 455
All 10 rules for the transactions with the highest lift have lift value exceeding 1. It means that vegetables are more likely to appear in transaction together with dishwashing liquid/detergent and poultry, dinner rolls and poultry, or yogurt than separately.
The discovery of association rules is very interesting and has created great insights to not just grocery stores but also in medicine,retail shops, user experienc design (UX), online shops like Alieexpress, Instagram and entertainment industries like Netflix, Apple music, Amazon prime, Spotify and YouTube.
The dataset used which showed details of transactions carried out by various customers shows that the most popular/frequent bought item is vegetables. This will give the shop management better insight on management of the store and the arrangement of products in the grocery store.