library(viridis)

## Loading required package: viridisLite

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(arules)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

INTRO

Market Basket Analysis is like a detective tool for understanding what customers buy together. By looking at purchase records, businesses can learn what items tend to be bought at the same time. In this blog post, we’ll use a real online shopping dataset to learn how this analysis works. We’ll cover how to understand the data, get it ready for analysis, find patterns in customer purchases, understand the rules we find, and see how businesses can use this information to improve their strategies. Original dataset can be found here: https://archive.ics.uci.edu/dataset/352/online+retail

Dataset Information:

This is a transnational data set that contains all the transactions occurring `between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Most customers of the company are wholesalers.

head(Online_Retail)

## # A tibble: 6 × 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANGING HEAR…        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE METAL LANTERN        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPID HEARTS…        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UNION FLAG…        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY HOTTIE …        6 2010-12-01 08:26:00      3.39
## 6 536365    22752     SET 7 BABUSHKA NES…        2 2010-12-01 08:26:00      7.65
## # ℹ 2 more variables: CustomerID <dbl>, Country <chr>

Attribute Information:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.** If this code starts with the letter ‘c’, it indicates a cancellation.**

StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

UnitPrice: Unit price. Numeric, Product price per unit in sterling.

CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal, the name of the country where each customer resides.

#Data preparation

summary(Online_Retail)

##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##   InvoiceDate                       UnitPrice           CustomerID    
##  Min.   :2010-12-01 08:26:00.00   Min.   :-11062.06   Min.   :12346   
##  1st Qu.:2011-03-28 11:34:00.00   1st Qu.:     1.25   1st Qu.:13953   
##  Median :2011-07-19 17:17:00.00   Median :     2.08   Median :15152   
##  Mean   :2011-07-04 13:34:57.16   Mean   :     4.61   Mean   :15288   
##  3rd Qu.:2011-10-19 11:27:00.00   3rd Qu.:     4.13   3rd Qu.:16791   
##  Max.   :2011-12-09 12:50:00.00   Max.   : 38970.00   Max.   :18287   
##                                                       NA's   :135080  
##    Country         
##  Length:541909     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

This dataset contains 8 columns and 541909 observation. By looking into the information of the dataset, we see some NA values present in a couple of columns; Description and CustomerID.

Deal with Cancelled transaction:

Before checking NA value it is better omit the rows that related to cancelled transaction because maybe so missing value be handle in this way.

Online_Retail <- Online_Retail[!grepl("C", Online_Retail$InvoiceNo), ]

Something else that we can do is that remove the rows that quantity is equal to zero, which it means there is no transaction.

Online_Retail <- Online_Retail %>% filter(Quantity > 0,
                        UnitPrice > 0,
                        Description != "")

Checking missing value

sum(is.na(Online_Retail))

## [1] 132220

As we can see the number of NA in customerID was decreased but it is still quite a lot. we have 132220 NA in CustomerID. this number of missing value are significant.but I decided to remove the missing value and find an association among products.

Online_Retail <-  na.omit(Online_Retail)
sum(is.na(Online_Retail))

## [1] 0

separate Date and Time in InvoiceData:

Online_Retail$Date <- sapply(strsplit(as.character(Online_Retail$InvoiceDate), " "), "[", 1)
Online_Retail$Time <- sapply(strsplit(as.character(Online_Retail$InvoiceDate), " "), "[", 2)
Online_Retail <- Online_Retail[-5]
library(dplyr)
library(tidyr)

Transactions format

After cleaning our data it is time to converting data format to a format that is aproperiate for Apriori algorithm.

data.list <- split(x = Online_Retail$Description,
                   f = Online_Retail$InvoiceNo)

## Transactions formatting
data.transactions <- as(object = data.list,
                        Class = "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

summary(data.transactions)

## transactions as itemMatrix in sparse format with
##  18532 rows (elements/itemsets/transactions) and
##  3866 columns (items) and a density of 0.005410881 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                               1971                               1703 
##            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
##                               1600                               1379 
##      ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
##                               1375                             379633 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1407  732  630  625  661  598  588  588  596  530  536  488  498  513  542  546 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  457  434  482  429  391  334  342  304  242  251  239  238  267  219  194  172 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  159  169  131  121  125  117  130  116  118   97   91   96   90   83   75   80 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   78   79   52   58   68   67   63   44   59   41   32   54   36   25   40   35 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   33   36   30   35   24   30   30   20   22   26   24   20   18   20   11   13 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   18   19   15   20   15   14    9   13   11    9    9   15   12    7    4    9 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    8   12    4   11    8    3    6    7    2    3    6    4    2    4    4    3 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    3    6    6    8    3    4    5    5    5    7    3    4    3    2    5    1 
##  129  130  131  132  134  135  136  137  138  139  140  141  142  143  144  145 
##    1    2    3    2    2    2    2    2    2    1    1    4    1    1    1    2 
##  146  148  149  150  151  153  154  156  157  163  165  169  170  175  176  178 
##    2    1    3    1    1    1    1    1    1    1    2    2    1    1    2    1 
##  179  180  181  184  187  192  193  195  202  204  208  210  219  227  249  259 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  262  270  280  333  347  352  363  375  386  419  434  439  525  529  541 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   20.92   27.00  541.00 
## 
## includes extended item information - examples:
##                       labels
## 1     10 COLOUR SPACEBOY PEN
## 2 12 COLOURED PARTY BALLOONS
## 3  12 DAISY PEGS IN WOOD BOX
## 
## includes extended transaction information - examples:
##   transactionID
## 1        536365
## 2        536366
## 3        536367

class(data.transactions)

## [1] "transactions"
## attr(,"package")
## [1] "arules"

inspect(head(data.transactions))

##     items                                  transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,                    
##      GLASS STAR FROSTED T-LIGHT HOLDER,                 
##      KNITTED UNION FLAG HOT WATER BOTTLE,               
##      RED WOOLLY HOTTIE WHITE HEART.,                    
##      SET 7 BABUSHKA NESTING BOXES,                      
##      WHITE HANGING HEART T-LIGHT HOLDER,                
##      WHITE METAL LANTERN}                         536365
## [2] {HAND WARMER RED POLKA DOT,                         
##      HAND WARMER UNION JACK}                      536366
## [3] {ASSORTED COLOUR BIRD ORNAMENT,                     
##      BOX OF 6 ASSORTED COLOUR TEASPOONS,                
##      BOX OF VINTAGE ALPHABET BLOCKS,                    
##      BOX OF VINTAGE JIGSAW BLOCKS,                      
##      DOORMAT NEW ENGLAND,                               
##      FELTCRAFT PRINCESS CHARLOTTE DOLL,                 
##      HOME BUILDING BLOCK WORD,                          
##      IVORY KNITTED MUG COSY,                            
##      LOVE BUILDING BLOCK WORD,                          
##      POPPY'S PLAYHOUSE BEDROOM,                         
##      POPPY'S PLAYHOUSE KITCHEN,                         
##      RECIPE BOX WITH METAL HEART}                 536367
## [4] {BLUE COAT RACK PARIS FASHION,                      
##      JAM MAKING SET WITH JARS,                          
##      RED COAT RACK PARIS FASHION,                       
##      YELLOW COAT RACK PARIS FASHION}              536368
## [5] {BATH BUILDING BLOCK WORD}                    536369
## [6] {ALARM CLOCK BAKELIKE GREEN,                        
##      ALARM CLOCK BAKELIKE PINK,                         
##      ALARM CLOCK BAKELIKE RED,                          
##      CHARLOTTE BAG DOLLY GIRL DESIGN,                   
##      CIRCUS PARADE LUNCH BOX,                           
##      INFLATABLE POLITICAL GLOBE,                        
##      LUNCH BOX I LOVE LONDON,                           
##      MINI JIGSAW CIRCUS PARADE,                         
##      MINI JIGSAW SPACEBOY,                              
##      MINI PAINT SET VINTAGE,                            
##      PANDA AND BUNNIES STICKER SHEET,                   
##      POSTAGE,                                           
##      RED TOADSTOOL LED NIGHT LIGHT,                     
##      ROUND SNACK BOXES SET OF4 WOODLAND,                
##      SET 2 TEA TOWELS I LOVE LONDON,                    
##      SET/2 RED RETROSPOT TEA TOWELS,                    
##      SPACEBOY LUNCH BOX,                                
##      STARS GIFT TAPE,                                   
##      VINTAGE HEADS AND TAILS CARD GAME,                 
##      VINTAGE SEASIDE JIGSAW PUZZLES}              536370

Plotting the most purchased items

# Plotting the most purchased items
itemFrequencyPlot(x = data.transactions,
                  type = "relative",
                  topN = 20,
                  horiz = T,
                  col = rainbow(20))

Apriori

The Apriori algorithm is a classic data mining technique used for association rule mining. It identifies frequent item sets in a dataset and generates association rules based on these item sets. The algorithm works by iteratively discovering frequent item sets by pruning infrequent ones, leveraging the “Apriori principle” which states that any subset of a frequent item set must also be frequent. Initially, the Apriori algorithm was applied using default parameters: a minimum support of 0.1 and a minimum confidence of 0.8. These parameters determine the threshold levels for item set frequency and rule confidence, respectively, guiding the algorithm’s discovery of frequent item sets and generation of association rules.

rules <- apriori(data.transactions)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1853 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3866 item(s), 18532 transaction(s)] done [0.14s].
## sorting and recoding items ... [1 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules

## set of 0 rules

Well as we can find from: writing … [0 rule(s)] done [0.00s]. there is no rule, so we should change the trashholds defult until we fine some association rules. I changed in this way: minimum support of 0.02 and a minimum confidence of 0.5.

rules <- apriori(data.transactions, parameter = list(support = 0.02, confidence = 0.5, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.02      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 370 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3866 item(s), 18532 transaction(s)] done [0.14s].
## sorting and recoding items ... [207 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [30 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules

## set of 30 rules

By lowering the minimum support and confidence values, 30 association rules were identified from the dataset. Here we can see the first 10 of these rules:

inspect(head(rules, n = 10))

##      lhs                                  rhs                                    support confidence   coverage      lift count
## [1]  {PINK REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER}   0.02482193  0.8273381 0.03000216 22.188466   460
## [2]  {GREEN REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER}    0.02482193  0.6657019 0.03728686 22.188466   460
## [3]  {PINK REGENCY TEACUP AND SAUCER}  => {ROSES REGENCY TEACUP AND SAUCER}   0.02352687  0.7841727 0.03000216 18.559754   436
## [4]  {ROSES REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER}    0.02352687  0.5568327 0.04225124 18.559754   436
## [5]  {JUMBO BAG STRAWBERRY}            => {JUMBO BAG RED RETROSPOT}           0.02233974  0.6330275 0.03529031  7.332041   414
## [6]  {ALARM CLOCK BAKELIKE PINK}       => {ALARM CLOCK BAKELIKE RED}          0.02136844  0.6460033 0.03307792 13.650778   396
## [7]  {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER}   0.02919275  0.7829233 0.03728686 18.530185   541
## [8]  {ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}   0.02919275  0.6909323 0.04225124 18.530185   541
## [9]  {GREEN REGENCY TEACUP AND SAUCER} => {REGENCY CAKESTAND 3 TIER}          0.02018131  0.5412446 0.03728686  5.889809   374
## [10] {WOODEN FRAME ANTIQUE WHITE}      => {WOODEN PICTURE FRAME WHITE FINISH} 0.02525362  0.5770654 0.04376214 12.207962   468

but how we can analysis this table?

Support:

Support tells us how often a rule is true.

Confidence:

Confidence tells us how likely the rule is true when the antecedent occurs.

Lift:

Lift tells us how much more likely the consequent is given the antecedent, compared to its usual frequency. A lift greater than 1 indicates a positive association, close to 1 indicates independence, and less than 1 indicates a negative association

we pick up the rules that their confidence is more than 0.6. which represent us the rules which is more relaiable.

Rules<-rules[quality(rules)$confidence>0.6]

visualisation the top 19 rules:

Here we can see the plot of 19 rules which their confidence is more than 0.6.

plot(Rules)

Interpretaion of result

inspect(sort(Rules, by = "support")[1:2], linebreak = FALSE)

##     lhs                                  rhs                              
## [1] {JUMBO BAG PINK POLKADOT}         => {JUMBO BAG RED RETROSPOT}        
## [2] {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER}
##     support    confidence coverage   lift      count
## [1] 0.02946255 0.6268657  0.04699978  7.260672 546  
## [2] 0.02919275 0.7829233  0.03728686 18.530185 541

“JUMBO BAG RED RETROSPOT” is the most popular item in the our online shop. this product most often is appears together with “JUMBO BAG PINK POLKADOT” (546 transactions). Other rule with the highest support indicate that “ROSES REGENCY TEACUP AND SAUCER” is being willingly bought with “GREEN REGENCY TEACUP AND SAUCER” (2.3% of all transactions).

inspect(sort(Rules, by = "confidence")[1:2], linebreak = FALSE)

##     lhs                                                                 
## [1] {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER} =>
## [2] {GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY TEACUP AND SAUCER} =>
##     rhs                               support    confidence coverage   lift    
## [1] {GREEN REGENCY TEACUP AND SAUCER} 0.02104468 0.8944954  0.02352687 23.98956
## [2] {ROSES REGENCY TEACUP AND SAUCER} 0.02104468 0.8478261  0.02482193 20.06630
##     count
## [1] 390  
## [2] 390

Rules for our data with the highest confidence show that if customer buys {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER}, s/he will also buy {GREEN REGENCY TEACUP AND SAUCER} with the probability of 89%.

inspect(sort(Rules, by = "lift")[1:2], linebreak = FALSE)

##     lhs                                                                  
## [1] {GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER} =>
## [2] {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER}  =>
##     rhs                               support    confidence coverage   lift    
## [1] {PINK REGENCY TEACUP AND SAUCER}  0.02104468 0.7208872  0.02919275 24.02785
## [2] {GREEN REGENCY TEACUP AND SAUCER} 0.02104468 0.8944954  0.02352687 23.98956
##     count
## [1] 390  
## [2] 390

Lift for this rules is close to 25>1 which show us it is more likely if we choose {PINK REGENCY TEACUP AND SAUCER} then choose {GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER}. and the same interpretation for other rule.

Conclusion

In conclusion, Market Basket Analysis has helped uncover valuable patterns and associations within our data, which can be used to inform strategic decisions such as product placement, and promotional campaigns. Further analysis and interpretation of the association rules may lead to actionable insights for optimizing sales and enhancing customer experience.

Market Basket Analysis

Shokoufeh Naseri

2024