Introduction
Dataset and preprocessing
- Data Description
- Data Initial Visualization
- Item Analysis
Apriori algorithm
- Optimal of support and confidence
- Execution of support and confidence
- Visualize association rules
Conclusions

1 Introduction

Market Basket is “Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.” by article “A Gentle Introduction on Market Basket Analysis” Among all the orders we are looking for the most frequent rules (“if”-“then”). And there are some important indicators:

support: number of transactions which contain “item sets” or “antecedent and consequent”.

confidence: the co-occurrence of the “item sets” / the occurrence of antecedent.

Benchmark confidence: the occurrence of consequent / all baskets

lift: confidence/ benchmark confidence

library(arules) #association rules

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz) #visualization for rules
library(stringr)
library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ forcats 0.5.2 
## ✔ readr   2.1.3      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ tidyr::unpack() masks Matrix::unpack()

library(gridExtra) #grid graphics

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:arules':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Sys.setLanguage('en')

2 Dataset and preprocessing

2.1Data Description

Market Data coming from Kaggle(https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis), which contains 522,065 observations, and 7 variables, including:

BillNo: 6-digit number assigned to each transaction. Nominal.
Itemname: Product name. Nominal.
Quantity: The quantities of each product per transaction. Numeric.
Date: The day and time when each transaction was generated. Numeric.
Price: Product price. Numeric.
CustomerID: 5-digit number assigned to each customer. Nominal.
Country: Name of the country where each customer resides. Nominal.

setwd('D:/unsupervised_learning/market basket analysis')
data=read.csv('Assignment-1_Data.csv',sep = ';',dec = '.')
head(data,10)

##    BillNo                            Itemname Quantity             Date Price
## 1  536365  WHITE HANGING HEART T-LIGHT HOLDER        6 01.12.2010 08:26  2,55
## 2  536365                 WHITE METAL LANTERN        6 01.12.2010 08:26  3,39
## 3  536365      CREAM CUPID HEARTS COAT HANGER        8 01.12.2010 08:26  2,75
## 4  536365 KNITTED UNION FLAG HOT WATER BOTTLE        6 01.12.2010 08:26  3,39
## 5  536365      RED WOOLLY HOTTIE WHITE HEART.        6 01.12.2010 08:26  3,39
## 6  536365        SET 7 BABUSHKA NESTING BOXES        2 01.12.2010 08:26  7,65
## 7  536365   GLASS STAR FROSTED T-LIGHT HOLDER        6 01.12.2010 08:26  4,25
## 8  536366              HAND WARMER UNION JACK        6 01.12.2010 08:28  1,85
## 9  536366           HAND WARMER RED POLKA DOT        6 01.12.2010 08:28  1,85
## 10 536367       ASSORTED COLOUR BIRD ORNAMENT       32 01.12.2010 08:34  1,69
##    CustomerID        Country
## 1       17850 United Kingdom
## 2       17850 United Kingdom
## 3       17850 United Kingdom
## 4       17850 United Kingdom
## 5       17850 United Kingdom
## 6       17850 United Kingdom
## 7       17850 United Kingdom
## 8       17850 United Kingdom
## 9       17850 United Kingdom
## 10      13047 United Kingdom

str(data)

## 'data.frame':    522064 obs. of  7 variables:
##  $ BillNo    : chr  "536365" "536365" "536365" "536365" ...
##  $ Itemname  : chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity  : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ Date      : chr  "01.12.2010 08:26" "01.12.2010 08:26" "01.12.2010 08:26" "01.12.2010 08:26" ...
##  $ Price     : chr  "2,55" "3,39" "2,75" "3,39" ...
##  $ CustomerID: int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country   : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

2.2Data initial visualization

The data includes data from 12/2010-31/12/2011, Group data by month, day, and hour to to analyze the number of buyers, order volume, product category, and sales quantity

data$Date=as.POSIXlt(data$Date,tz='','%d.%m.%Y %H:%M')
data=data%>%mutate(Month=as.factor(month(Date)))%>%mutate(Day=as.factor(wday(Date)))%>%mutate(Hour=as.factor(hour(Date)))
#month data analysis
data_month=data%>%group_by(Month)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))
#Day data analysis
data_day=data%>%group_by(Day)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))
#Hour data analysis
data_hour=data%>%group_by(Hour)%>%summarize(Orders=n_distinct(BillNo),Items=n_distinct(Itemname),Quantities=sum(Quantity),CustomerNo=n_distinct(CustomerID))

data_month

## # A tibble: 12 × 5
##    Month Orders Items Quantities CustomerNo
##    <fct>  <int> <int>      <int>      <int>
##  1 1       1205  2531     378849        736
##  2 2       1151  2342     271019        749
##  3 3       1631  2490     341075        965
##  4 4       1487  2439     297659        849
##  5 5       1822  2460     377135       1050
##  6 6       1646  2602     331965        982
##  7 7       1623  2658     372133        941
##  8 8       1430  2582     405136        926
##  9 9       1946  2720     531240       1256
## 10 10      2225  2898     574038       1348
## 11 11      2965  2947     732610       1654
## 12 12      2532  3442     654994       1255

data_day

## # A tibble: 6 × 5
##   Day   Orders Items Quantities CustomerNo
##   <fct>  <int> <int>      <int>      <int>
## 1 1       2182  3311     455928       1219
## 2 2       3337  3602     811161       1577
## 3 3       3907  3586    1026577       1687
## 4 4       4071  3628     957119       1763
## 5 5       4599  3630    1161823       1985
## 6 6       3567  3537     855245       1543

data_hour

## # A tibble: 15 × 5
##    Hour  Orders Items Quantities CustomerNo
##    <fct>  <int> <int>      <int>      <int>
##  1 6          1     1          1          1
##  2 7         26   261      11026         25
##  3 8        537  1969     145983        418
##  4 9       1560  3070     491374        880
##  5 10      2487  3157     737498       1248
##  6 11      2601  3331     640118       1291
##  7 12      3405  3419     814060       1623
##  8 13      2897  3427     682242       1565
##  9 14      2684  3476     560618       1374
## 10 15      2643  3533     605693       1272
## 11 16      1541  3443     319643        745
## 12 17       881  3148     158068        408
## 13 18       242  2333      62114        140
## 14 19       141  1514      29828         96
## 15 20        18   611       9587         15

Visualization month data

month01=data_month%>%ggplot(aes(x=Month,y=Orders))+
  geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
  labs(title = 'Orders per month')+theme_bw()

month02=data_month%>%ggplot(aes(x=Month,y=Items))+
  geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
  labs(title = 'Items per month')+theme_bw()

month03=data_month%>%ggplot(aes(x=Month,y=Quantities))+
  geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
  labs(title = 'Sales quantity per month')+theme_bw()

month04=data_month%>%ggplot(aes(x=Month,y=CustomerNo))+
  geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
  labs(title = 'Customer per month')+theme_bw()

grid.arrange(month01, month02, month03,month04, ncol=2)

Visualization day data

day01=data_day%>%ggplot(aes(x=Day,y=Orders))+
  geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
  labs(title = 'Orders per month')+theme_bw()

day02=data_day%>%ggplot(aes(x=Day,y=Items))+
  geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
  labs(title = 'Items per month')+theme_bw()

day03=data_day%>%ggplot(aes(x=Day,y=Quantities))+
  geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
  labs(title = 'Sales quantity per month')+theme_bw()

day04=data_day%>%ggplot(aes(x=Day,y=CustomerNo))+
  geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
  labs(title = 'Customer per month')+theme_bw()

grid.arrange(day01, day02,day03,day04, ncol=2)

Visualization hour data

par(mfrow=c(2,2))
hour01=data_hour%>%ggplot(aes(x=Hour,y=Orders))+
  geom_bar(stat = 'identity',fill='steelblue1',show.legend = T,colour='black')+geom_label(aes(label=Orders))+
  labs(title = 'Orders per month')+theme_bw()

hour02=data_hour%>%ggplot(aes(x=Hour,y=Items))+
  geom_bar(stat = 'identity',fill='peachpuff2',show.legend = T,colour='black')+geom_label(aes(label=Items))+
  labs(title = 'Items per month')+theme_bw()

hour03=data_hour%>%ggplot(aes(x=Hour,y=Quantities))+
  geom_bar(stat = 'identity',fill='pink',show.legend = T,colour='black')+geom_label(aes(label=Quantities))+
  labs(title = 'Sales quantity per month')+theme_bw()

hour04=data_hour%>%ggplot(aes(x=Hour,y=CustomerNo))+
  geom_bar(stat = 'identity',fill='purple',show.legend = T,colour='black')+geom_label(aes(label=CustomerNo))+
  labs(title = 'Customer per month')+theme_bw()

grid.arrange(hour01, hour02, hour03, hour04, ncol=2)

data_new=data%>%drop_na()
data_new=data_new%>%group_by(BillNo)%>%summarize(paste(Itemname,collapse = ','))
data_new$BillNo=NULL
colnames(data_new)=c('items')

Then, Convert data into transcation format in order to apply association rule, so I will make all items which are bought together in row based on same BillNo and Date.

#save(data_new,'transcation.csv')
write.csv(data_new,'transcationdata.csv',quote = F, row.names = F)
transaction=read.transactions('transcationdata.csv',format = 'basket',sep=',')

summary(transaction)

## transactions as itemMatrix in sparse format with
##  18164 rows (elements/itemsets/transactions) and
##  7699 columns (items) and a density of 0.002293202 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                               1717                               1468 
##            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
##                               1394                               1244 
##      ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
##                               1226                             313643 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1541  857  742  741  742  693  642  632  631  565  598  517  494  519  530  509 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  457  429  467  406  385  307  303  267  233  246  226  210  212  208  164  153 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  135  139  131  106  110   87  108   91   87   86   84   62   59   67   59   58 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   57   47   61   39   39   47   41   34   27   37   29   26   27   16   24   25 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   20   27   24   21   14   20   19   13   16   15   11   15   12    6    8   14 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   13   10    8    8   11   10   13    8    6    5    5   11    5    4    4    3 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    5    5    2    4    1    4    4    2    2    2    6    3    4    3    2    1 
##  113  114  116  117  118  120  121  122  123  125  126  127  131  132  133  134 
##    3    1    3    3    3    1    2    2    1    3    2    2    1    1    2    1 
##  140  141  142  143  145  146  147  150  154  157  168  171  177  178  180  202 
##    1    2    2    1    1    2    1    1    3    2    2    2    1    1    1    1 
##  204  228  236  249  250  285  320  400  419 
##    1    1    1    1    1    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00   13.00   17.66   23.00  419.00 
## 
## includes extended item information - examples:
##                       labels
## 1                   1 HANGER
## 2     10 COLOUR SPACEBOY PEN
## 3 12 COLOURED PARTY BALLOONS

The summary() results contains 4 parts of the result:

Explain that the transaction data contains 18,164 transaction records, involving 7,699 commodities
List the most frequent commodities that appear in the shopping basket, such as “White hanging heart t-light holder” appears in 1,717 transactions, “Regency cakestand 3 tier” 1,468 transactions, etc.
List the number of transactions that contain the number of commodities in the shopping basket. If there are 1,541 transactions that only purchase 1 commodity, and only 857 transaction purchases 2 commodities, etc.
Summarize the number of commodities traded in the shopping basket, including quantiles and mean value. Mean means that all shopping baskets contain an average of 17.6 commodities

2.3 Item analysis

Before applying the Apriori algorithem, I will use itemFrequencyPlot() fuction to see the highest frequency distribution of items to learn more about the transactions.

#the most frequent in transactions
itemFrequencyPlot(transaction,topN=15, type='absolute',col='blue',main="Absolute Item Frequency Plot",cex.names=0.7,xlab='Item name', ylab='Frequency(absolute)')

itemFrequencyPlot(transaction,topN=15, type='relative',col='lightblue',main="Relative Item Frequency Plot",cex.names=0.7,xlab='Item name', ylab='Frequency(relative)')

So, we can see the “15 top frequent” absolute and relative values in transactions.

3 The Apriori Algorithm

3.1 Optimal support and confidence

My first step is to determine the optimal thresholds for suppprt and confidence to get optimal sets of association rules. If values too high, we may get 0 rules, but too low, then the algorithm will take longer to execute and lots rules and most of which will be useless. So, what’s the thresholds? I will try different support and confidence values and see graphically how many rules are generated for each combination.

Support levels: 0.03, 0.02, 0.01, 0.005- base the frequency relative plot, the highest support level is around 0.09
Confidence levels: 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1

#support and confidence levels
supportlevels=c( 0.03, 0.02, 0.01, 0.005)
confidencelevels=c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)

#inital support integers
rules_0.03=integer(length=9)
rules_0.02=integer(length=9)
rules_0.01=integer(length=9)
rules_0.005=integer(length=9)

#Apriori algorithem wiht support 0.03
for (i in 1:9){
  rules_0.03[i]= length(apriori(transaction,parameter = list(sup=supportlevels[1], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.02
for (i in 1:9){
  rules_0.02[i]= length(apriori(transaction,parameter = list(sup=supportlevels[2], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.01
for (i in 1:9){
  rules_0.01[i]= length(apriori(transaction,parameter = list(sup=supportlevels[3], conf=confidencelevels[i],target='rules')))
}
#Apriori algorithem wiht support 0.005
for (i in 1:9){
  rules_0.005[i]= length(apriori(transaction,parameter = list(sup=supportlevels[4], conf=confidencelevels[i],target='rules')))
}

rules_0.03

## [1] 0 0 0 0 0 0 0 0 0

rules_0.02

## [1]  0  1  4 10 16 25 30 32 32

rules_0.01

## [1]   9  19  41  80 159 258 357 408 446

rules_0.005

## [1]  180  290  590 1072 1591 2086 2560 3096 3638

Next, I will use qplot() to see the clear visualization of the rules generated by different support and confidence level.

library(ggplot2)
# support level of 0.005-0.03 + different confidence level 0.1-0.9
plot1 = qplot(confidencelevels, rules_0.03, geom=c("point", "line"), 
               xlab="Confidence level", ylab="Number of rules found", 
               main="Apriori with a support level of 0.03") +
  theme_bw()

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

plot2 = qplot(confidencelevels, rules_0.02, geom=c("point", "line"), 
               xlab="Confidence level", ylab="Number of rules found", 
               main="Apriori with a support level of 0.02") +
  theme_bw()

plot3 = qplot(confidencelevels, rules_0.01, geom=c("point", "line"), 
               xlab="Confidence level", ylab="Number of rules found", 
               main="Apriori with a support level of 0.01") +
  theme_bw()

plot4 = qplot(confidencelevels, rules_0.005, geom=c("point", "line"), 
               xlab="Confidence level", ylab="Number of rules found", 
               main="Apriori with a support level of 0.005") +
  theme_bw()

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

From these 4 plots, we can see：

Support level of 0.03: There is 0 rules.
Support level of 0.02: This support level starts to get rules, with 0.5 confidence we can get around 15 rules.
Support level of 0.01: Start to get many levels, with 0.5 confidence we can get around 150 rules.
Support level of 0.005: Many rules to analyze. with 0.8 confidence we can get aroud 290 rules.

Therefore, I want to use a support level of 0.005 with confidence level 0.8.

3.2 Execution support and confidence

Then, I’ll execute the Apriori algorithm with 0.5% support and 80% confidence.

rules=apriori(transaction, parameter = list(sup=0.005,conf=0.8,maxlen=10))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 90 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7699 item(s), 18164 transaction(s)] done [0.07s].
## sorting and recoding items ... [1055 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [290 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 290 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5   6 
##  49 115  81  39   6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.441   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.005010   Min.   :0.8000   Min.   :0.005230   Min.   :  9.035  
##  1st Qu.:0.006221   1st Qu.:0.8575   1st Qu.:0.006717   1st Qu.: 59.554  
##  Median :0.006717   Median :0.9186   Median :0.007322   Median :103.677  
##  Mean   :0.007084   Mean   :0.9079   Mean   :0.007824   Mean   : 84.181  
##  3rd Qu.:0.007432   3rd Qu.:0.9548   3rd Qu.:0.008093   3rd Qu.:107.991  
##  Max.   :0.021526   Max.   :1.0000   Max.   :0.026481   Max.   :116.436  
##      count      
##  Min.   : 91.0  
##  1st Qu.:113.0  
##  Median :122.0  
##  Mean   :128.7  
##  3rd Qu.:135.0  
##  Max.   :391.0  
## 
## mining info:
##         data ntransactions support confidence
##  transaction         18164   0.005        0.8
##                                                                                 call
##  apriori(data = transaction, parameter = list(sup = 0.005, conf = 0.8, maxlen = 10))

We can get clear information from 290 rules as:

49 rules covering combinations of 2 commodities and 115 rules covering combinations of 3 commodities, etc.
Among 290 rules, each rule contains an average of 3.441 combinations of commodities

Mean of support is 0.007, mean of confidence is 0.91, mean of lift is 84.2

#Top 10 rules sorted by support
inspect(sort(rules, by = list('support'))[1:10])

##      lhs                                   rhs                                  support confidence   coverage     lift count
## [1]  {PINK REGENCY TEACUP AND SAUCER}   => {GREEN REGENCY TEACUP AND SAUCER} 0.02152610  0.8128898 0.02648095 24.94144   391
## [2]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                      
##       PINK REGENCY TEACUP AND SAUCER}   => {ROSES REGENCY TEACUP AND SAUCER} 0.01794759  0.8337596 0.02152610 22.70526   326
## [3]  {PINK REGENCY TEACUP AND SAUCER,                                                                                       
##       ROSES REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER} 0.01794759  0.8882834 0.02020480 27.25469   326
## [4]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                      
##       REGENCY CAKESTAND 3 TIER}         => {ROSES REGENCY TEACUP AND SAUCER} 0.01409381  0.8126984 0.01734200 22.13172   256
## [5]  {PINK REGENCY TEACUP AND SAUCER,                                                                                       
##       REGENCY CAKESTAND 3 TIER}         => {GREEN REGENCY TEACUP AND SAUCER} 0.01233209  0.8682171 0.01420392 26.63901   224
## [6]  {PINK REGENCY TEACUP AND SAUCER,                                                                                       
##       REGENCY CAKESTAND 3 TIER}         => {ROSES REGENCY TEACUP AND SAUCER} 0.01194671  0.8410853 0.01420392 22.90476   217
## [7]  {SHED}                             => {KEY FOB}                         0.01161638  1.0000000 0.01161638 59.55410   211
## [8]  {SET 3 RETROSPOT TEA}              => {SUGAR}                           0.01156133  1.0000000 0.01156133 86.49524   210
## [9]  {SUGAR}                            => {SET 3 RETROSPOT TEA}             0.01156133  1.0000000 0.01156133 86.49524   210
## [10] {SET 3 RETROSPOT TEA}              => {COFFEE}                          0.01156133  1.0000000 0.01156133 65.57401   210

Using above output sorted by ‘support’, we can make analysis:

2.15% of transactions (counting 391) are items “PINK REGENCY TEACUP AND SAUCER” and “GREEN REGENCY TEACUP AND SAUCER” together, and TOP 6 are all teacup and saucer
1.16% of transactions(counting 211) are items “SHED” and “KEY FOB”.
Sugar happened twice in last items, so in the next analytical part I will calculated how sugar are related to other products.

#Top 10 rules sorted by confidence
inspect(sort(rules, by = list('confidence'))[1:10])

##      lhs                      rhs                   support     confidence
## [1]  {ELEPHANT}            => {BIRTHDAY CARD}       0.005945827 1         
## [2]  {RETRO SPOT}          => {BIRTHDAY CARD}       0.005945827 1         
## [3]  {HOT PINK}            => {FEATHER PEN}         0.006386259 1         
## [4]  {FRONT  DOOR}         => {KEY FOB}             0.006826690 1         
## [5]  {AIRLINE LOUNGE}      => {METAL SIGN}          0.005725611 1         
## [6]  {SET 3 RETROSPOT TEA} => {SUGAR}               0.011561330 1         
## [7]  {SUGAR}               => {SET 3 RETROSPOT TEA} 0.011561330 1         
## [8]  {SET 3 RETROSPOT TEA} => {COFFEE}              0.011561330 1         
## [9]  {SUGAR}               => {COFFEE}              0.011561330 1         
## [10] {BACK DOOR}           => {KEY FOB}             0.010680467 1         
##      coverage    lift      count
## [1]  0.005945827  88.17476 108  
## [2]  0.005945827  88.17476 108  
## [3]  0.006386259  98.71739 116  
## [4]  0.006826690  59.55410 124  
## [5]  0.005725611 116.43590 104  
## [6]  0.011561330  86.49524 210  
## [7]  0.011561330  86.49524 210  
## [8]  0.011561330  65.57401 210  
## [9]  0.011561330  65.57401 210  
## [10] 0.010680467  59.55410 194

These top10 sorted by “confidence” are all with 100% confidence, which means 100% of consumers will buy the product on the left and then buy the product on the right. For example, 100% of consumers who bought “ELEPHANT” and “BIRTHDAY CARD”, or “HOT PINK” and “FEATHER PEN” and so on. Besides that, “Birthday card”, “Coffee” ..appears frequently.

#Top 10 rules sorted by lift
inspect(sort(rules, by = list('lift'))[1:10])

##      lhs                           rhs                            support confidence    coverage     lift count
## [1]  {AIRLINE LOUNGE}           => {METAL SIGN}               0.005725611  1.0000000 0.005725611 116.4359   104
## [2]  {HERB MARKER BASIL,                                                                                       
##       HERB MARKER MINT,                                                                                        
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER ROSEMARY,                                                                                    
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006166043  0.9180328 0.006716582 115.7996   112
## [3]  {PINK VINTAGE SPOT BEAKER} => {BLUE VINTAGE SPOT BEAKER} 0.005340233  0.8738739 0.006110989 115.0221    97
## [4]  {HERB MARKER MINT,                                                                                        
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER ROSEMARY,                                                                                    
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006606474  0.9090909 0.007267122 114.6717   120
## [5]  {HERB MARKER BASIL,                                                                                       
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER ROSEMARY,                                                                                    
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006441312  0.9069767 0.007101960 114.4050   117
## [6]  {HERB MARKER BASIL,                                                                                       
##       HERB MARKER MINT,                                                                                        
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006221097  0.9040000 0.006881744 114.0296   113
## [7]  {HERB MARKER BASIL,                                                                                       
##       HERB MARKER CHIVES,                                                                                      
##       HERB MARKER MINT,                                                                                        
##       HERB MARKER ROSEMARY,                                                                                    
##       HERB MARKER THYME}        => {HERB MARKER PARSLEY}      0.006166043  0.9911504 0.006221097 113.9447   112
## [8]  {HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER ROSEMARY,                                                                                    
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006881744  0.8992806 0.007652499 113.4343   125
## [9]  {HERB MARKER BASIL,                                                                                       
##       HERB MARKER MINT,                                                                                        
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER ROSEMARY}     => {HERB MARKER CHIVES}       0.006276151  0.8976378 0.006991852 113.2270   114
## [10] {HERB MARKER MINT,                                                                                        
##       HERB MARKER PARSLEY,                                                                                     
##       HERB MARKER THYME}        => {HERB MARKER CHIVES}       0.006716582  0.8970588 0.007487338 113.1540   122

Using above output sorted by “lift”, we can make analysis:

The highest lift value is 116, which is items “AIRLINE LOUNGE”, and “METAL SIGN”.
“PINK VINTAGE SPOT BEAKER” and “BLUE VINTAGE SPOT BEAKER” have lift value 115.
The other top 9 items are all related Herb Marker Cooking Seasonings.

To sum up, that’s the rules we get by using the Apriori algorithm. We select appropriate support and coffidence level to execute the algorithm. Finnally I want to visualized these association rules.

3.2 Visualize association rules

The interactive scatter-plot visualization

plot(rules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

#plot top 10 rules
plot(head(rules,n=10,by='confidence'),main="Scatter plot for 10 confidence rules",engine='plotly')

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(head(rules,n=10,by='support'),main="Scatter plot for 10 support rules",engine='plotly')

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(head(rules,n=10,by='lift'),main="Scatter plot for 10 lift rules",engine='plotly')

plot(rules, method=‘grouped’)

Next, I will analyze the rules to “COFFEE”, “SUGAR”. In order to analyze how coffee and sugar are related to other products, but it was only part of given task. So should we put these products on the same shelf for easier access or maybe in the opposite corners of the shop to force clients to go through the whole shop? And this is where data science meets the substantive expertise.

# rules to coffee
rules.to.coffee=apriori(data=transaction, parameter = list(supp=0.002,conf=0.8),
                        appearance = list(default='lhs',rhs='COFFEE'))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 36 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[7699 item(s), 18164 transaction(s)] done [0.08s].
## sorting and recoding items ... [1901 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.06s].
## writing ... [23 rule(s)] done [0.01s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(rules.to.coffee,by='support')))

##     lhs                          rhs          support confidence    coverage     lift count
## [1] {SET 3 RETROSPOT TEA}     => {COFFEE} 0.011561330          1 0.011561330 65.57401   210
## [2] {SUGAR}                   => {COFFEE} 0.011561330          1 0.011561330 65.57401   210
## [3] {SET 3 RETROSPOT TEA,                                                                  
##      SUGAR}                   => {COFFEE} 0.011561330          1 0.011561330 65.57401   210
## [4] {SUGAR JARS}              => {COFFEE} 0.003908831          1 0.003908831 65.57401    71
## [5] {RED SPOTTY BISCUIT TIN,                                                               
##      SET 3 RETROSPOT TEA}     => {COFFEE} 0.003413345          1 0.003413345 65.57401    62
## [6] {RED SPOTTY BISCUIT TIN,                                                               
##      SUGAR}                   => {COFFEE} 0.003413345          1 0.003413345 65.57401    62

Coffee are chosen by consumers who also buy:

SET 3 RETROSPOT TEA-> COFFEE
SUGAR-> COFFEE
SET 3 RETROSPOT TEA + SUGAR-> COFFEE

# rules to sugar
rules.to.sugar=apriori(data=transaction, parameter = list(supp=0.002,conf=0.8),
                        appearance = list(default='lhs',rhs='SUGAR'), 
                        control=list(verbose=F))
inspect(head(sort(rules.to.sugar,by='support')))

##     lhs                                      rhs         support confidence    coverage     lift count
## [1] {SET 3 RETROSPOT TEA}                 => {SUGAR} 0.011561330          1 0.011561330 86.49524   210
## [2] {COFFEE,                                                                                          
##      SET 3 RETROSPOT TEA}                 => {SUGAR} 0.011561330          1 0.011561330 86.49524   210
## [3] {RED SPOTTY BISCUIT TIN,                                                                          
##      SET 3 RETROSPOT TEA}                 => {SUGAR} 0.003413345          1 0.003413345 86.49524    62
## [4] {COFFEE,                                                                                          
##      RED SPOTTY BISCUIT TIN}              => {SUGAR} 0.003413345          1 0.003413345 86.49524    62
## [5] {COFFEE,                                                                                          
##      RED SPOTTY BISCUIT TIN,                                                                          
##      SET 3 RETROSPOT TEA}                 => {SUGAR} 0.003413345          1 0.003413345 86.49524    62
## [6] {SET 3 RETROSPOT TEA,                                                                             
##      SET/5 RED RETROSPOT LID GLASS BOWLS} => {SUGAR} 0.002587536          1 0.002587536 86.49524    47

Sugar are chosen by consumers who also buy:

SET 3 RETROSPOT TEA->SUGAR
SET 3 RETROSPOT TEA+ COFFEE-> SUGAR
SET 3 RETROSPOT TEA + RED SPOTTY BISCUIT TIN->SUGAR

The visualization for rhs COFFEE

plot(rules.to.coffee, method='paracoord')

plot(rules.to.coffee, method='graph',shading="lift")

plot(rules.to.coffee, method='grouped')

The visualization for rhs sugar

plot(rules.to.sugar, method="paracoord")

plot(rules.to.sugar, method="graph", shading="lift")

plot(rules.to.sugar, method="grouped")

Conclusion

Based on these shopping baskets, and visualization results, they can be used as suggestions for retail owners to arrange product catalogs or improve product marketing. By utilizing the association rules, to be able to increase customer engagement and improve customer experience and identify customer behavior.

Market Basket Analysis on customers data

Ting_Wei