This document is a basic analysis of association rules existing in a set of transactions of a random grocery shop. The data is taken from kaggle. The project explores association rules in the data based on the 3 key assumptions of the Apriori algorithm, those being:
Libraries
library(dplyr)
library(knitr)
library(tidyr)
library(readr)
library(lubridate)
library(ggplot2)
library(Matrix)
library(arules)
library(arulesViz)
First, the data is loaded into R. It has a clear structure, however not all rows follow it and it required somewhat significant wrangling to convert the existing format into a more usable one.
cart <- read.csv("dataset.csv", header = FALSE)
str(cart)
## 'data.frame': 1637 obs. of 32 variables:
## $ V1 : chr "1/1/2000yogurt" "1/1/2000toilet paper" "2/1/2000soda" "2/1/2000cereals" ...
## $ V2 : chr " pork" " shampoo" " pork" " juice" ...
## $ V3 : chr " sandwich bags" " hand soap" " soap" " lunch meat" ...
## $ V4 : chr " lunch meat" " waffles" " ice cream" " soda" ...
## $ V5 : chr " all- purpose" " vegetables" " toilet paper" " toilet paper" ...
## $ V6 : chr " flour" " cheeses" " dinner rolls" " all- purpose" ...
## $ V7 : chr " soda" " mixes" " hand soap" " " ...
## $ V8 : chr " butter" " milk" " spaghetti sauce" "" ...
## $ V9 : chr " vegetables" " sandwich bags" " milk" "" ...
## $ V10: chr " beef" " laundry detergent" " ketchup" "" ...
## $ V11: chr " aluminum foil" " dishwashing liquid/detergent" " sandwich loaves" "" ...
## $ V12: chr " all- purpose" " waffles" " poultry" "" ...
## $ V13: chr " dinner rolls" " individual meals" " toilet paper" "" ...
## $ V14: chr " shampoo" " hand soap" " ice cream" "" ...
## $ V15: chr " all- purpose" " vegetables" " ketchup" "" ...
## $ V16: chr " mixes" " individual meals" " vegetables" "" ...
## $ V17: chr " soap" " yogurt" " laundry detergent" "" ...
## $ V18: chr " laundry detergent" " cereals" " spaghetti sauce" "" ...
## $ V19: chr " ice cream" " shampoo" " bagels" "" ...
## $ V20: chr " dinner rolls" " vegetables" " soap" "" ...
## $ V21: chr " " " aluminum foil" " ice cream" "" ...
## $ V22: chr "" " tortillas" " shampoo" "" ...
## $ V23: chr "" " mixes" " lunch meat" "" ...
## $ V24: chr "" " " " cereals" "" ...
## $ V25: chr "" "" " spaghetti sauce" "" ...
## $ V26: chr "" "" " pork" "" ...
## $ V27: chr "" "" " vegetables" "" ...
## $ V28: chr "" "" " cheeses" "" ...
## $ V29: chr "" "" " eggs" "" ...
## $ V30: chr "" "" " vegetables" "" ...
## $ V31: chr "" "" " vegetables" "" ...
## $ V32: chr "" "" " " "" ...
Separating the date from the items in the first column. It’s unsuccessful for 138 rows where the date does not exist. Those rows need to be omitted as the transaction cannot be properly assigned to a date, or even trusted to have occurred.
# separating the date from the product in the first column -> adding transaction id
cart_clean <- cart %>%
mutate(V1 = as.character(V1)) %>%
separate(V1, into = c("Date", "V1"), sep = "(?<=\\d{4})")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 138 rows [11, 22, 29, 46,
## 62, 74, 101, 112, 123, 128, 146, 156, 169, 193, 195, 204, 221, 227, 234, 236,
## ...].
cart_clean <- cart_clean %>%
mutate(Date = if_else(!is.na(dmy(Date)), as.character(Date), NA_character_))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Date = if_else(!is.na(dmy(Date)), as.character(Date),
## NA_character_)`.
## Caused by warning:
## ! 138 failed to parse.
cart_clean <- na.omit(cart_clean) # 138 omissions
cart_clean <- cart_clean %>%
mutate(transaction_id = row_number()) %>%
select(transaction_id, everything())
head(cart_clean)
## transaction_id Date V1 V2 V3
## 1 1 1/1/2000 yogurt pork sandwich bags
## 2 2 1/1/2000 toilet paper shampoo hand soap
## 3 3 2/1/2000 soda pork soap
## 4 4 2/1/2000 cereals juice lunch meat
## 5 5 2/1/2000 sandwich loaves pasta tortillas
## 6 6 2/1/2000 laundry detergent toilet paper eggs
## V4 V5 V6 V7
## 1 lunch meat all- purpose flour soda
## 2 waffles vegetables cheeses mixes
## 3 ice cream toilet paper dinner rolls hand soap
## 4 soda toilet paper all- purpose
## 5 mixes hand soap toilet paper vegetables
## 6 toilet paper vegetables bagels dishwashing liquid/detergent
## V8 V9 V10
## 1 butter vegetables beef
## 2 milk sandwich bags laundry detergent
## 3 spaghetti sauce milk ketchup
## 4
## 5 vegetables paper towels vegetables
## 6 cereals paper towels laundry detergent
## V11 V12 V13 V14
## 1 aluminum foil all- purpose dinner rolls shampoo
## 2 dishwashing liquid/detergent waffles individual meals hand soap
## 3 sandwich loaves poultry toilet paper ice cream
## 4
## 5 flour vegetables pork poultry
## 6 butter cereals bagels paper towels
## V15 V16 V17 V18
## 1 all- purpose mixes soap laundry detergent
## 2 vegetables individual meals yogurt cereals
## 3 ketchup vegetables laundry detergent spaghetti sauce
## 4
## 5 eggs vegetables pork spaghetti sauce
## 6 shampoo toilet paper soap soap
## V19 V20 V21 V22 V23
## 1 ice cream dinner rolls
## 2 shampoo vegetables aluminum foil tortillas mixes
## 3 bagels soap ice cream shampoo lunch meat
## 4
## 5 vegetables milk waffles individual meals vegetables
## 6 pasta coffee/tea poultry bagels aluminum foil
## V24 V25 V26 V27 V28 V29
## 1
## 2
## 3 cereals spaghetti sauce pork vegetables cheeses eggs
## 4
## 5 dinner rolls all- purpose soda yogurt
## 6 butter spaghetti sauce ketchup all- purpose milk
## V30 V31 V32
## 1
## 2
## 3 vegetables vegetables
## 4
## 5
## 6
Transforming the cleaned data into a format accepted by the “arules” library. Each product in one row is assigned an artificial transaction ID and the whole data frame is transformed so that the items in transactions are tracked vertically by that ID, rather than horizontally. Additionally, unnecessary spaces are removed from appearing before the names of the first products in a transaction. A left over from the previous transformation.
# creating a transaction like data frame
transactions <- cart_clean %>%
pivot_longer(
cols = starts_with("V"),
names_to = "product_key",
values_to = "product" ) %>%
select(-product_key) %>%
mutate(product = trimws(product, which = "left")) %>%
filter(product != "")
head(transactions)
## # A tibble: 6 × 3
## transaction_id Date product
## <int> <chr> <chr>
## 1 1 1/1/2000 yogurt
## 2 1 1/1/2000 pork
## 3 1 1/1/2000 sandwich bags
## 4 1 1/1/2000 lunch meat
## 5 1 1/1/2000 all- purpose
## 6 1 1/1/2000 flour
str(transactions)
## tibble [28,936 × 3] (S3: tbl_df/tbl/data.frame)
## $ transaction_id: int [1:28936] 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr [1:28936] "1/1/2000" "1/1/2000" "1/1/2000" "1/1/2000" ...
## $ product : chr [1:28936] "yogurt" "pork" "sandwich bags" "lunch meat" ...
Finally, a transaction object is created
transaction_list <- split(transactions[["product"]], transactions[["transaction_id"]])
transaction_obj <- as(transaction_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
inspect(transaction_obj[1:5])
## items transactionID
## [1] {all- purpose,
## aluminum foil,
## beef,
## butter,
## dinner rolls,
## flour,
## ice cream,
## laundry detergent,
## lunch meat,
## mixes,
## pork,
## sandwich bags,
## shampoo,
## soap,
## soda,
## vegetables,
## yogurt} 1
## [2] {aluminum foil,
## cereals,
## cheeses,
## dishwashing liquid/detergent,
## hand soap,
## individual meals,
## laundry detergent,
## milk,
## mixes,
## sandwich bags,
## shampoo,
## toilet paper,
## tortillas,
## vegetables,
## waffles,
## yogurt} 2
## [3] {bagels,
## cereals,
## cheeses,
## dinner rolls,
## eggs,
## hand soap,
## ice cream,
## ketchup,
## laundry detergent,
## lunch meat,
## milk,
## pork,
## poultry,
## sandwich loaves,
## shampoo,
## soap,
## soda,
## spaghetti sauce,
## toilet paper,
## vegetables} 3
## [4] {all- purpose,
## cereals,
## juice,
## lunch meat,
## soda,
## toilet paper} 4
## [5] {all- purpose,
## dinner rolls,
## eggs,
## flour,
## hand soap,
## individual meals,
## milk,
## mixes,
## paper towels,
## pasta,
## pork,
## poultry,
## sandwich loaves,
## soda,
## spaghetti sauce,
## toilet paper,
## tortillas,
## vegetables,
## waffles,
## yogurt} 5
Item frequency plots of all unique items present in the every transaction and of the top 15 most frequently occuring ones, show that one clearly dominates above all others - vegetables.
itemFrequencyPlot(transaction_obj, type="absolute", main="Occurences of all items", col = "royalblue2")
itemFrequencyPlot(transaction_obj, topN = 10, type="absolute", main="Top 15 most occuring items", col = "orange")
This is the first hint that the rules of association will most
probably revolve around it. To examine that, the Apriori algorithm is
run on the data and the first set of rules is created, with a rather
high support of 10% and a confidence of 80%.
rules <- apriori(transaction_obj, parameter = list(support = 0.1, confidence = 0.8, minlen = 2))
summary(rules)
## set of 420 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 3 417
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.993 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1074 Min. :0.8000 Min. :0.1328 Min. :1.102
## 1st Qu.:0.1214 1st Qu.:0.8099 1st Qu.:0.1481 1st Qu.:1.116
## Median :0.1264 Median :0.8217 Median :0.1541 Median :1.132
## Mean :0.1282 Mean :0.8235 Mean :0.1557 Mean :1.135
## 3rd Qu.:0.1321 3rd Qu.:0.8333 3rd Qu.:0.1601 3rd Qu.:1.148
## Max. :0.3095 Max. :0.8755 Max. :0.3809 Max. :1.206
## count
## Min. :161.0
## 1st Qu.:182.0
## Median :189.5
## Mean :192.2
## 3rd Qu.:198.0
## Max. :464.0
##
## mining info:
## data ntransactions support confidence
## transaction_obj 1499 0.1 0.8
## call
## apriori(data = transaction_obj, parameter = list(support = 0.1, confidence = 0.8, minlen = 2))
inspect(rules[1:10])
## lhs rhs support confidence
## [1] {laundry detergent} => {vegetables} 0.3028686 0.8092692
## [2] {eggs} => {vegetables} 0.3095397 0.8126095
## [3] {yogurt} => {vegetables} 0.3082055 0.8148148
## [4] {hand soap, sandwich loaves} => {vegetables} 0.1154103 0.8357488
## [5] {hand soap, sugar} => {vegetables} 0.1167445 0.8101852
## [6] {hand soap, paper towels} => {vegetables} 0.1074049 0.8009950
## [7] {hand soap, juice} => {vegetables} 0.1114076 0.8067633
## [8] {hand soap, individual meals} => {vegetables} 0.1140761 0.8260870
## [9] {hand soap, pasta} => {vegetables} 0.1187458 0.8054299
## [10] {dinner rolls, hand soap} => {vegetables} 0.1180787 0.8009050
## coverage lift count
## [1] 0.3742495 1.114977 454
## [2] 0.3809206 1.119579 464
## [3] 0.3782522 1.122617 462
## [4] 0.1380921 1.151459 173
## [5] 0.1440961 1.116239 175
## [6] 0.1340894 1.103577 161
## [7] 0.1380921 1.111524 167
## [8] 0.1380921 1.138147 171
## [9] 0.1474316 1.109687 178
## [10] 0.1474316 1.103453 177
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules, method="grouped")
At these levels of support and confidence, all created rules have vegetables as the supercedent and with the number of rules created, evidence points to the fact that buying just about anything, leads to buying vegetables. Computing an even bigger rule set is impractical, as an extrapolation can be made that most, if not all of it, will still point to vegetables. Out of this set of rules, the most relevant ones should be considered.
With that said, only 3 rules of length 2 were created, which give an idea of the most direct precedent to supercedent relationship, with support of over 30% and both confidence and lift values in the neighborhood of other most highly rated rules, they are most common type of rules in the set. According to this data, whenever laundry detergent, eggs or yogurt appear in a basket, people also buy vegetables over 80% of the time.
Narrowing this rule set down to exclude the first 3 rules of
length 2, raising the support level and ordering by lift, paints a
clearer picture of the rest of the set.
rules2 <- apriori(transaction_obj, parameter = list(support = 0.14, confidence = 0.8, minlen = 3))
inspect(sort(rules2, by = "lift")[1:10])
## lhs rhs support confidence
## [1] {cheeses, eggs} => {vegetables} 0.1454303 0.8755020
## [2] {dinner rolls, eggs} => {vegetables} 0.1420947 0.8729508
## [3] {milk, yogurt} => {vegetables} 0.1400934 0.8677686
## [4] {eggs, poultry} => {vegetables} 0.1434290 0.8600000
## [5] {eggs, yogurt} => {vegetables} 0.1414276 0.8582996
## [6] {dishwashing liquid/detergent, eggs} => {vegetables} 0.1414276 0.8582996
## [7] {aluminum foil, yogurt} => {vegetables} 0.1454303 0.8549020
## [8] {cheeses, sandwich bags} => {vegetables} 0.1447632 0.8509804
## [9] {eggs, soda} => {vegetables} 0.1487658 0.8479087
## [10] {aluminum foil, bagels} => {vegetables} 0.1407605 0.8473896
## coverage lift count
## [1] 0.1661107 1.206229 218
## [2] 0.1627752 1.202714 213
## [3] 0.1614410 1.195575 210
## [4] 0.1667779 1.184871 215
## [5] 0.1647765 1.182529 212
## [6] 0.1647765 1.182529 212
## [7] 0.1701134 1.177847 218
## [8] 0.1701134 1.172444 217
## [9] 0.1754503 1.168213 223
## [10] 0.1661107 1.167497 211
plot(rules2)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules2, method="paracoord", control=list(reorder=TRUE))
plot(rules2, method="graph")
There is a bigger mix of rules, however eggs and yogurt appear again, this time alongside cheese, dinner rolls, milk or each other, with the highest lift values out of the entire set.
While it seems that whenever laundry detergent
appears in a basket in this shop, vegetables are purchased too, the best
predictors seem to be eggs and yogurt,
in that order, having the highest lift values of all other items,
whether by themselves or in combination with other products.
Whenever people buy eggs or yogurt, they are much more likely to buy
vegetables.