Introduction
Dataset and preprocessing
Data Description
Data Initial Visualization
Item Analysis
Apriori algorithm
Optimal of support and confidence
Execution of support and confidence
Visualize association rules
Conclusions
Market Basket is “Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.” by article “A Gentle Introduction on Market Basket Analysis” Among all the orders we are looking for the most frequent rules (“if”-“then”). And there are some important indicators:
support: number of transactions which contain “item sets” or “antecedent and consequent”.
confidence: the co-occurrence of the “item sets” / the occurrence of antecedent.
Benchmark confidence: the occurrence of consequent / all baskets
lift: confidence/ benchmark confidence
library(arules) #association rules
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz) #visualization for rules
library(stringr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ tidyr::unpack() masks Matrix::unpack()
library(gridExtra) #grid graphics
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:arules':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Sys.setLanguage('en')
Market Data coming from Kaggle(https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis), which contains 522,065 observations, and 7 variables, including:
BillNo: 6-digit number assigned to each transaction. Nominal.
Itemname: Product name. Nominal.
Quantity: The quantities of each product per transaction. Numeric.
Date: The day and time when each transaction was generated. Numeric.
Price: Product price. Numeric.
CustomerID: 5-digit number assigned to each customer. Nominal.
Country: Name of the country where each customer resides. Nominal.
setwd('D:/unsupervised_learning/market basket analysis')
data=read.csv('Assignment-1_Data.csv',sep = ';',dec = '.')
head(data,10)
## BillNo Itemname Quantity Date Price
## 1 536365 WHITE HANGING HEART T-LIGHT HOLDER 6 01.12.2010 08:26 2,55
## 2 536365 WHITE METAL LANTERN 6 01.12.2010 08:26 3,39
## 3 536365 CREAM CUPID HEARTS COAT HANGER 8 01.12.2010 08:26 2,75
## 4 536365 KNITTED UNION FLAG HOT WATER BOTTLE 6 01.12.2010 08:26 3,39
## 5 536365 RED WOOLLY HOTTIE WHITE HEART. 6 01.12.2010 08:26 3,39
## 6 536365 SET 7 BABUSHKA NESTING BOXES 2 01.12.2010 08:26 7,65
## 7 536365 GLASS STAR FROSTED T-LIGHT HOLDER 6 01.12.2010 08:26 4,25
## 8 536366 HAND WARMER UNION JACK 6 01.12.2010 08:28 1,85
## 9 536366 HAND WARMER RED POLKA DOT 6 01.12.2010 08:28 1,85
## 10 536367 ASSORTED COLOUR BIRD ORNAMENT 32 01.12.2010 08:34 1,69
## CustomerID Country
## 1 17850 United Kingdom
## 2 17850 United Kingdom
## 3 17850 United Kingdom
## 4 17850 United Kingdom
## 5 17850 United Kingdom
## 6 17850 United Kingdom
## 7 17850 United Kingdom
## 8 17850 United Kingdom
## 9 17850 United Kingdom
## 10 13047 United Kingdom
str(data)
## 'data.frame': 522064 obs. of 7 variables:
## $ BillNo : chr "536365" "536365" "536365" "536365" ...
## $ Itemname : chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ Date : chr "01.12.2010 08:26" "01.12.2010 08:26" "01.12.2010 08:26" "01.12.2010 08:26" ...
## $ Price : chr "2,55" "3,39" "2,75" "3,39" ...
## $ CustomerID: int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
The data includes data from 12/2010-31/12/2011, Group data by month, day, and hour to to analyze the number of buyers, order volume, product category, and sales quantity
data$Date=as.POSIXlt(data$Date,tz='','%d.%m.%Y %H:%M')
data=data%>%mutate(Month=as.factor(month(Date)))%>%mutate(Day=as.factor(wday(Date)))%>%mutate(Hour=as.factor(hour(Date)))
#month data analysis
data_month=data%>%group_by(Month)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))
#Day data analysis
data_day=data%>%group_by(Day)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))
#Hour data analysis
data_hour=data%>%group_by(Hour)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))
data_month
## # A tibble: 12 × 5
## Month Orders Items Quantities CustomerNo
## <fct> <int> <int> <int> <int>
## 1 1 1205 2531 378849 736
## 2 2 1151 2342 271019 749
## 3 3 1631 2490 341075 965
## 4 4 1487 2439 297659 849
## 5 5 1822 2460 377135 1050
## 6 6 1646 2602 331965 982
## 7 7 1623 2658 372133 941
## 8 8 1430 2582 405136 926
## 9 9 1946 2720 531240 1256
## 10 10 2225 2898 574038 1348
## 11 11 2965 2947 732610 1654
## 12 12 2532 3442 654994 1255
data_day
## # A tibble: 6 × 5
## Day Orders Items Quantities CustomerNo
## <fct> <int> <int> <int> <int>
## 1 1 2182 3311 455928 1219
## 2 2 3337 3602 811161 1577
## 3 3 3907 3586 1026577 1687
## 4 4 4071 3628 957119 1763
## 5 5 4599 3630 1161823 1985
## 6 6 3567 3537 855245 1543
data_hour
## # A tibble: 15 × 5
## Hour Orders Items Quantities CustomerNo
## <fct> <int> <int> <int> <int>
## 1 6 1 1 1 1
## 2 7 26 261 11026 25
## 3 8 537 1969 145983 418
## 4 9 1560 3070 491374 880
## 5 10 2487 3157 737498 1248
## 6 11 2601 3331 640118 1291
## 7 12 3405 3419 814060 1623
## 8 13 2897 3427 682242 1565
## 9 14 2684 3476 560618 1374
## 10 15 2643 3533 605693 1272
## 11 16 1541 3443 319643 745
## 12 17 881 3148 158068 408
## 13 18 242 2333 62114 140
## 14 19 141 1514 29828 96
## 15 20 18 611 9587 15
month01=data_month%>%ggplot(aes(x=Month,y=Orders))+
geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
labs(title = 'Orders per month')+theme_bw()
month02=data_month%>%ggplot(aes(x=Month,y=Items))+
geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
labs(title = 'Items per month')+theme_bw()
month03=data_month%>%ggplot(aes(x=Month,y=Quantities))+
geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
labs(title = 'Sales quantity per month')+theme_bw()
month04=data_month%>%ggplot(aes(x=Month,y=CustomerNo))+
geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
labs(title = 'Customer per month')+theme_bw()
grid.arrange(month01, month02, month03,month04, ncol=2)
day01=data_day%>%ggplot(aes(x=Day,y=Orders))+
geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
labs(title = 'Orders per month')+theme_bw()
day02=data_day%>%ggplot(aes(x=Day,y=Items))+
geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
labs(title = 'Items per month')+theme_bw()
day03=data_day%>%ggplot(aes(x=Day,y=Quantities))+
geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
labs(title = 'Sales quantity per month')+theme_bw()
day04=data_day%>%ggplot(aes(x=Day,y=CustomerNo))+
geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
labs(title = 'Customer per month')+theme_bw()
grid.arrange(day01, day02,day03,day04, ncol=2)
par(mfrow=c(2,2))
hour01=data_hour%>%ggplot(aes(x=Hour,y=Orders))+
geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
labs(title = 'Orders per month')+theme_bw()
hour02=data_hour%>%ggplot(aes(x=Hour,y=Items))+
geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
labs(title = 'Items per month')+theme_bw()
hour03=data_hour%>%ggplot(aes(x=Hour,y=Quantities))+
geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
labs(title = 'Sales quantity per month')+theme_bw()
hour04=data_hour%>%ggplot(aes(x=Hour,y=CustomerNo))+
geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
labs(title = 'Customer per month')+theme_bw()
grid.arrange(hour01, hour02, hour03, hour04, ncol=2)
data_new=data%>%drop_na()
data_new=data_new%>%group_by(BillNo)%>%summarize(paste(Itemname,collapse = ','))
data_new$BillNo=NULL
colnames(data_new)=c('items')
Then, Convert data into transcation format in order to apply association rule, so I will make all items which are bought together in row based on same BillNo and Date.
#save(data_new,'transcation.csv')
write.csv(data_new,'transcationdata.csv',quote = F, row.names = F)
transaction=read.transactions('transcationdata.csv',format = 'basket',sep=',')
summary(transaction)
## transactions as itemMatrix in sparse format with
## 18164 rows (elements/itemsets/transactions) and
## 7699 columns (items) and a density of 0.002293202
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 1717 1468
## JUMBO BAG RED RETROSPOT PARTY BUNTING
## 1394 1244
## ASSORTED COLOUR BIRD ORNAMENT (Other)
## 1226 313643
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1541 857 742 741 742 693 642 632 631 565 598 517 494 519 530 509
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 457 429 467 406 385 307 303 267 233 246 226 210 212 208 164 153
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 135 139 131 106 110 87 108 91 87 86 84 62 59 67 59 58
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 57 47 61 39 39 47 41 34 27 37 29 26 27 16 24 25
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 20 27 24 21 14 20 19 13 16 15 11 15 12 6 8 14
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 13 10 8 8 11 10 13 8 6 5 5 11 5 4 4 3
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 5 5 2 4 1 4 4 2 2 2 6 3 4 3 2 1
## 113 114 116 117 118 120 121 122 123 125 126 127 131 132 133 134
## 3 1 3 3 3 1 2 2 1 3 2 2 1 1 2 1
## 140 141 142 143 145 146 147 150 154 157 168 171 177 178 180 202
## 1 2 2 1 1 2 1 1 3 2 2 2 1 1 1 1
## 204 228 236 249 250 285 320 400 419
## 1 1 1 1 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 13.00 17.66 23.00 419.00
##
## includes extended item information - examples:
## labels
## 1 1 HANGER
## 2 10 COLOUR SPACEBOY PEN
## 3 12 COLOURED PARTY BALLOONS
The summary() results contains 4 parts of the result:
Explain that the transaction data contains 18,164 transaction records, involving 7,699 commodities
List the most frequent commodities that appear in the shopping basket, such as “White hanging heart t-light holder” appears in 1,717 transactions, “Regency cakestand 3 tier” 1,468 transactions, etc.
List the number of transactions that contain the number of commodities in the shopping basket. If there are 1,541 transactions that only purchase 1 commodity, and only 857 transaction purchases 2 commodities, etc.
Summarize the number of commodities traded in the shopping basket, including quantiles and mean value. Mean means that all shopping baskets contain an average of 17.6 commodities
Before applying the Apriori algorithem, I will use itemFrequencyPlot() fuction to see the highest frequency distribution of items to learn more about the transactions.
#the most frequent in transactions
itemFrequencyPlot(transaction,topN=15, type='absolute',col='blue',main="Absolute Item Frequency Plot",cex.names=0.7,xlab='Item name', ylab='Frequency(absolute)')
itemFrequencyPlot(transaction,topN=15, type='relative',col='lightblue',main="Relative Item Frequency Plot",cex.names=0.7,xlab='Item name', ylab='Frequency(relative)')
So, we can see the “15 top frequent” absolute and relative values in transactions.
My first step is to determine the optimal thresholds for suppprt and confidence to get optimal sets of association rules. If values too high, we may get 0 rules, but too low, then the algorithm will take longer to execute and lots rules and most of which will be useless. So, what’s the thresholds? I will try different support and confidence values and see graphically how many rules are generated for each combination.
Support levels: 0.03, 0.02, 0.01, 0.005- base the frequency relative plot, the highest support level is around 0.09
Confidence levels: 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1
#support and confidence levels
supportlevels=c( 0.03, 0.02, 0.01, 0.005)
confidencelevels=c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
#inital support integers
rules_0.03=integer(length=9)
rules_0.02=integer(length=9)
rules_0.01=integer(length=9)
rules_0.005=integer(length=9)
#Apriori algorithem wiht support 0.03
for (i in 1:9){
rules_0.03[i]= length(apriori(transaction,parameter = list(sup=supportlevels[1], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.02
for (i in 1:9){
rules_0.02[i]= length(apriori(transaction,parameter = list(sup=supportlevels[2], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.01
for (i in 1:9){
rules_0.01[i]= length(apriori(transaction,parameter = list(sup=supportlevels[3], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.005
for (i in 1:9){
rules_0.005[i]= length(apriori(transaction,parameter = list(sup=supportlevels[4], conf=confidencelevels[i],target='rules')))
}
rules_0.03
## [1] 0 0 0 0 0 0 0 0 0
rules_0.02
## [1] 0 1 4 10 16 25 30 32 32
rules_0.01
## [1] 9 19 41 80 159 258 357 408 446
rules_0.005
## [1] 180 290 590 1072 1591 2086 2560 3096 3638
Next, I will use qplot() to see the clear visualization of the rules generated by different support and confidence level.
library(ggplot2)
# support level of 0.005-0.03 + different confidence level 0.1-0.9
plot1 = qplot(confidencelevels, rules_0.03, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 0.03") +
theme_bw()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
plot2 = qplot(confidencelevels, rules_0.02, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 0.02") +
theme_bw()
plot3 = qplot(confidencelevels, rules_0.01, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 0.01") +
theme_bw()
plot4 = qplot(confidencelevels, rules_0.005, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 0.005") +
theme_bw()
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
From these 4 plots, we can see:
Support level of 0.03: There is 0 rules.
Support level of 0.02: This support level starts to get rules, with 0.5 confidence we can get around 15 rules.
Support level of 0.01: Start to get many levels, with 0.5 confidence we can get around 150 rules.
Support level of 0.005: Many rules to analyze. with 0.8 confidence we can get aroud 290 rules.
Therefore, I want to use a support level of 0.005 with confidence level 0.8.
Then, I’ll execute the Apriori algorithm with 0.5% support and 80% confidence.
rules=apriori(transaction, parameter = list(sup=0.005,conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 90
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7699 item(s), 18164 transaction(s)] done [0.07s].
## sorting and recoding items ... [1055 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [290 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 290 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 49 115 81 39 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.441 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005010 Min. :0.8000 Min. :0.005230 Min. : 9.035
## 1st Qu.:0.006221 1st Qu.:0.8575 1st Qu.:0.006717 1st Qu.: 59.554
## Median :0.006717 Median :0.9186 Median :0.007322 Median :103.677
## Mean :0.007084 Mean :0.9079 Mean :0.007824 Mean : 84.181
## 3rd Qu.:0.007432 3rd Qu.:0.9548 3rd Qu.:0.008093 3rd Qu.:107.991
## Max. :0.021526 Max. :1.0000 Max. :0.026481 Max. :116.436
## count
## Min. : 91.0
## 1st Qu.:113.0
## Median :122.0
## Mean :128.7
## 3rd Qu.:135.0
## Max. :391.0
##
## mining info:
## data ntransactions support confidence
## transaction 18164 0.005 0.8
## call
## apriori(data = transaction, parameter = list(sup = 0.005, conf = 0.8, maxlen = 10))
We can get clear information from 290 rules as:
49 rules covering combinations of 2 commodities and 115 rules covering combinations of 3 commodities, etc.
Among 290 rules, each rule contains an average of 3.441 combinations of commodities
Mean of support is 0.007, mean of confidence is 0.91, mean of lift is 84.2
#Top 10 rules sorted by support
inspect(sort(rules, by = list('support'))[1:10])
## lhs rhs support confidence coverage lift count
## [1] {PINK REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.02152610 0.8128898 0.02648095 24.94144 391
## [2] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01794759 0.8337596 0.02152610 22.70526 326
## [3] {PINK REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01794759 0.8882834 0.02020480 27.25469 326
## [4] {GREEN REGENCY TEACUP AND SAUCER,
## REGENCY CAKESTAND 3 TIER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01409381 0.8126984 0.01734200 22.13172 256
## [5] {PINK REGENCY TEACUP AND SAUCER,
## REGENCY CAKESTAND 3 TIER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01233209 0.8682171 0.01420392 26.63901 224
## [6] {PINK REGENCY TEACUP AND SAUCER,
## REGENCY CAKESTAND 3 TIER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01194671 0.8410853 0.01420392 22.90476 217
## [7] {SHED} => {KEY FOB} 0.01161638 1.0000000 0.01161638 59.55410 211
## [8] {SET 3 RETROSPOT TEA} => {SUGAR} 0.01156133 1.0000000 0.01156133 86.49524 210
## [9] {SUGAR} => {SET 3 RETROSPOT TEA} 0.01156133 1.0000000 0.01156133 86.49524 210
## [10] {SET 3 RETROSPOT TEA} => {COFFEE} 0.01156133 1.0000000 0.01156133 65.57401 210
Using above output sorted by ‘support’, we can make analysis:
2.15% of transactions (counting 391) are items “PINK REGENCY TEACUP AND SAUCER” and “GREEN REGENCY TEACUP AND SAUCER” together, and TOP 6 are all teacup and saucer
1.16% of transactions(counting 211) are items “SHED” and “KEY FOB”.
Sugar happened twice in last items, so in the next analytical part I will calculated how sugar are related to other products.
#Top 10 rules sorted by confidence
inspect(sort(rules, by = list('confidence'))[1:10])
## lhs rhs support confidence
## [1] {ELEPHANT} => {BIRTHDAY CARD} 0.005945827 1
## [2] {RETRO SPOT} => {BIRTHDAY CARD} 0.005945827 1
## [3] {HOT PINK} => {FEATHER PEN} 0.006386259 1
## [4] {FRONT DOOR} => {KEY FOB} 0.006826690 1
## [5] {AIRLINE LOUNGE} => {METAL SIGN} 0.005725611 1
## [6] {SET 3 RETROSPOT TEA} => {SUGAR} 0.011561330 1
## [7] {SUGAR} => {SET 3 RETROSPOT TEA} 0.011561330 1
## [8] {SET 3 RETROSPOT TEA} => {COFFEE} 0.011561330 1
## [9] {SUGAR} => {COFFEE} 0.011561330 1
## [10] {BACK DOOR} => {KEY FOB} 0.010680467 1
## coverage lift count
## [1] 0.005945827 88.17476 108
## [2] 0.005945827 88.17476 108
## [3] 0.006386259 98.71739 116
## [4] 0.006826690 59.55410 124
## [5] 0.005725611 116.43590 104
## [6] 0.011561330 86.49524 210
## [7] 0.011561330 86.49524 210
## [8] 0.011561330 65.57401 210
## [9] 0.011561330 65.57401 210
## [10] 0.010680467 59.55410 194
These top10 sorted by “confidence” are all with 100% confidence, which means 100% of consumers will buy the product on the left and then buy the product on the right. For example, 100% of consumers who bought “ELEPHANT” and “BIRTHDAY CARD”, or “HOT PINK” and “FEATHER PEN” and so on. Besides that, “Birthday card”, “Coffee” ..appears frequently.
#Top 10 rules sorted by lift
inspect(sort(rules, by = list('lift'))[1:10])
## lhs rhs support confidence coverage lift count
## [1] {AIRLINE LOUNGE} => {METAL SIGN} 0.005725611 1.0000000 0.005725611 116.4359 104
## [2] {HERB MARKER BASIL,
## HERB MARKER MINT,
## HERB MARKER PARSLEY,
## HERB MARKER ROSEMARY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006166043 0.9180328 0.006716582 115.7996 112
## [3] {PINK VINTAGE SPOT BEAKER} => {BLUE VINTAGE SPOT BEAKER} 0.005340233 0.8738739 0.006110989 115.0221 97
## [4] {HERB MARKER MINT,
## HERB MARKER PARSLEY,
## HERB MARKER ROSEMARY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006606474 0.9090909 0.007267122 114.6717 120
## [5] {HERB MARKER BASIL,
## HERB MARKER PARSLEY,
## HERB MARKER ROSEMARY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006441312 0.9069767 0.007101960 114.4050 117
## [6] {HERB MARKER BASIL,
## HERB MARKER MINT,
## HERB MARKER PARSLEY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006221097 0.9040000 0.006881744 114.0296 113
## [7] {HERB MARKER BASIL,
## HERB MARKER CHIVES,
## HERB MARKER MINT,
## HERB MARKER ROSEMARY,
## HERB MARKER THYME} => {HERB MARKER PARSLEY} 0.006166043 0.9911504 0.006221097 113.9447 112
## [8] {HERB MARKER PARSLEY,
## HERB MARKER ROSEMARY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006881744 0.8992806 0.007652499 113.4343 125
## [9] {HERB MARKER BASIL,
## HERB MARKER MINT,
## HERB MARKER PARSLEY,
## HERB MARKER ROSEMARY} => {HERB MARKER CHIVES} 0.006276151 0.8976378 0.006991852 113.2270 114
## [10] {HERB MARKER MINT,
## HERB MARKER PARSLEY,
## HERB MARKER THYME} => {HERB MARKER CHIVES} 0.006716582 0.8970588 0.007487338 113.1540 122
Using above output sorted by “lift”, we can make analysis:
The highest lift value is 116, which is items “AIRLINE LOUNGE”, and “METAL SIGN”.
“PINK VINTAGE SPOT BEAKER” and “BLUE VINTAGE SPOT BEAKER” have lift value 115.
The other top 9 items are all related Herb Marker Cooking Seasonings.
To sum up, that’s the rules we get by using the Apriori algorithm. We select appropriate support and coffidence level to execute the algorithm. Finnally I want to visualized these association rules.
The interactive scatter-plot visualization
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
#plot top 10 rules
plot(head(rules,n=10,by='confidence'),main="Scatter plot for 10 confidence rules",engine='plotly')
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(head(rules,n=10,by='support'),main="Scatter plot for 10 support rules",engine='plotly')
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(head(rules,n=10,by='lift'),main="Scatter plot for 10 lift rules",engine='plotly')
plot(rules, method=‘grouped’)
Next, I will analyze the rules to “COFFEE”, “SUGAR”. In order to analyze how coffee and sugar are related to other products, but it was only part of given task. So should we put these products on the same shelf for easier access or maybe in the opposite corners of the shop to force clients to go through the whole shop? And this is where data science meets the substantive expertise.
# rules to coffee
rules.to.coffee=apriori(data=transaction, parameter = list(supp=0.002,conf=0.8),
appearance = list(default='lhs',rhs='COFFEE'))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 36
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[7699 item(s), 18164 transaction(s)] done [0.08s].
## sorting and recoding items ... [1901 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.06s].
## writing ... [23 rule(s)] done [0.01s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules.to.coffee,by='support')))
## lhs rhs support confidence coverage lift count
## [1] {SET 3 RETROSPOT TEA} => {COFFEE} 0.011561330 1 0.011561330 65.57401 210
## [2] {SUGAR} => {COFFEE} 0.011561330 1 0.011561330 65.57401 210
## [3] {SET 3 RETROSPOT TEA,
## SUGAR} => {COFFEE} 0.011561330 1 0.011561330 65.57401 210
## [4] {SUGAR JARS} => {COFFEE} 0.003908831 1 0.003908831 65.57401 71
## [5] {RED SPOTTY BISCUIT TIN,
## SET 3 RETROSPOT TEA} => {COFFEE} 0.003413345 1 0.003413345 65.57401 62
## [6] {RED SPOTTY BISCUIT TIN,
## SUGAR} => {COFFEE} 0.003413345 1 0.003413345 65.57401 62
Coffee are chosen by consumers who also buy:
SET 3 RETROSPOT TEA-> COFFEE
SUGAR-> COFFEE
SET 3 RETROSPOT TEA + SUGAR-> COFFEE
# rules to sugar
rules.to.sugar=apriori(data=transaction, parameter = list(supp=0.002,conf=0.8),
appearance = list(default='lhs',rhs='SUGAR'),
control=list(verbose=F))
inspect(head(sort(rules.to.sugar,by='support')))
## lhs rhs support confidence coverage lift count
## [1] {SET 3 RETROSPOT TEA} => {SUGAR} 0.011561330 1 0.011561330 86.49524 210
## [2] {COFFEE,
## SET 3 RETROSPOT TEA} => {SUGAR} 0.011561330 1 0.011561330 86.49524 210
## [3] {RED SPOTTY BISCUIT TIN,
## SET 3 RETROSPOT TEA} => {SUGAR} 0.003413345 1 0.003413345 86.49524 62
## [4] {COFFEE,
## RED SPOTTY BISCUIT TIN} => {SUGAR} 0.003413345 1 0.003413345 86.49524 62
## [5] {COFFEE,
## RED SPOTTY BISCUIT TIN,
## SET 3 RETROSPOT TEA} => {SUGAR} 0.003413345 1 0.003413345 86.49524 62
## [6] {SET 3 RETROSPOT TEA,
## SET/5 RED RETROSPOT LID GLASS BOWLS} => {SUGAR} 0.002587536 1 0.002587536 86.49524 47
Sugar are chosen by consumers who also buy:
SET 3 RETROSPOT TEA->SUGAR
SET 3 RETROSPOT TEA+ COFFEE-> SUGAR
SET 3 RETROSPOT TEA + RED SPOTTY BISCUIT TIN->SUGAR
The visualization for rhs COFFEE
plot(rules.to.coffee, method='paracoord')
plot(rules.to.coffee, method='graph',shading="lift")
plot(rules.to.coffee, method='grouped')
The visualization for rhs sugar
plot(rules.to.sugar, method="paracoord")
plot(rules.to.sugar, method="graph", shading="lift")
plot(rules.to.sugar, method="grouped")
Based on these shopping baskets, and visualization results, they can be used as suggestions for retail owners to arrange product catalogs or improve product marketing. By utilizing the association rules, to be able to increase customer engagement and improve customer experience and identify customer behavior.