library(viridis)
## Loading required package: viridisLite
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
Market Basket Analysis is like a detective tool for understanding what customers buy together. By looking at purchase records, businesses can learn what items tend to be bought at the same time. In this blog post, we’ll use a real online shopping dataset to learn how this analysis works. We’ll cover how to understand the data, get it ready for analysis, find patterns in customer purchases, understand the rules we find, and see how businesses can use this information to improve their strategies. Original dataset can be found here: https://archive.ics.uci.edu/dataset/352/online+retail
This is a transnational data set that contains all the transactions occurring `between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Most customers of the company are wholesalers.
head(Online_Retail)
## # A tibble: 6 × 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 536365 85123A WHITE HANGING HEAR… 6 2010-12-01 08:26:00 2.55
## 2 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39
## 3 536365 84406B CREAM CUPID HEARTS… 8 2010-12-01 08:26:00 2.75
## 4 536365 84029G KNITTED UNION FLAG… 6 2010-12-01 08:26:00 3.39
## 5 536365 84029E RED WOOLLY HOTTIE … 6 2010-12-01 08:26:00 3.39
## 6 536365 22752 SET 7 BABUSHKA NES… 2 2010-12-01 08:26:00 7.65
## # ℹ 2 more variables: CustomerID <dbl>, Country <chr>
Attribute Information:
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.** If this code starts with the letter ‘c’, it indicates a cancellation.**
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
#Data preparation
summary(Online_Retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00.00 Min. :-11062.06 Min. :12346
## 1st Qu.:2011-03-28 11:34:00.00 1st Qu.: 1.25 1st Qu.:13953
## Median :2011-07-19 17:17:00.00 Median : 2.08 Median :15152
## Mean :2011-07-04 13:34:57.16 Mean : 4.61 Mean :15288
## 3rd Qu.:2011-10-19 11:27:00.00 3rd Qu.: 4.13 3rd Qu.:16791
## Max. :2011-12-09 12:50:00.00 Max. : 38970.00 Max. :18287
## NA's :135080
## Country
## Length:541909
## Class :character
## Mode :character
##
##
##
##
This dataset contains 8 columns and 541909 observation. By looking into the information of the dataset, we see some NA values present in a couple of columns; Description and CustomerID.
Deal with Cancelled transaction:
Before checking NA value it is better omit the rows that related to cancelled transaction because maybe so missing value be handle in this way.
Online_Retail <- Online_Retail[!grepl("C", Online_Retail$InvoiceNo), ]
Something else that we can do is that remove the rows that quantity is equal to zero, which it means there is no transaction.
Online_Retail <- Online_Retail %>% filter(Quantity > 0,
UnitPrice > 0,
Description != "")
Checking missing value
sum(is.na(Online_Retail))
## [1] 132220
As we can see the number of NA in customerID was decreased but it is still quite a lot. we have 132220 NA in CustomerID. this number of missing value are significant.but I decided to remove the missing value and find an association among products.
Online_Retail <- na.omit(Online_Retail)
sum(is.na(Online_Retail))
## [1] 0
separate Date and Time in InvoiceData:
Online_Retail$Date <- sapply(strsplit(as.character(Online_Retail$InvoiceDate), " "), "[", 1)
Online_Retail$Time <- sapply(strsplit(as.character(Online_Retail$InvoiceDate), " "), "[", 2)
Online_Retail <- Online_Retail[-5]
library(dplyr)
library(tidyr)
After cleaning our data it is time to converting data format to a format that is aproperiate for Apriori algorithm.
data.list <- split(x = Online_Retail$Description,
f = Online_Retail$InvoiceNo)
## Transactions formatting
data.transactions <- as(object = data.list,
Class = "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(data.transactions)
## transactions as itemMatrix in sparse format with
## 18532 rows (elements/itemsets/transactions) and
## 3866 columns (items) and a density of 0.005410881
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 1971 1703
## JUMBO BAG RED RETROSPOT PARTY BUNTING
## 1600 1379
## ASSORTED COLOUR BIRD ORNAMENT (Other)
## 1375 379633
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1407 732 630 625 661 598 588 588 596 530 536 488 498 513 542 546
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 457 434 482 429 391 334 342 304 242 251 239 238 267 219 194 172
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 159 169 131 121 125 117 130 116 118 97 91 96 90 83 75 80
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 78 79 52 58 68 67 63 44 59 41 32 54 36 25 40 35
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 33 36 30 35 24 30 30 20 22 26 24 20 18 20 11 13
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 18 19 15 20 15 14 9 13 11 9 9 15 12 7 4 9
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 8 12 4 11 8 3 6 7 2 3 6 4 2 4 4 3
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 3 6 6 8 3 4 5 5 5 7 3 4 3 2 5 1
## 129 130 131 132 134 135 136 137 138 139 140 141 142 143 144 145
## 1 2 3 2 2 2 2 2 2 1 1 4 1 1 1 2
## 146 148 149 150 151 153 154 156 157 163 165 169 170 175 176 178
## 2 1 3 1 1 1 1 1 1 1 2 2 1 1 2 1
## 179 180 181 184 187 192 193 195 202 204 208 210 219 227 249 259
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 262 270 280 333 347 352 363 375 386 419 434 439 525 529 541
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 20.92 27.00 541.00
##
## includes extended item information - examples:
## labels
## 1 10 COLOUR SPACEBOY PEN
## 2 12 COLOURED PARTY BALLOONS
## 3 12 DAISY PEGS IN WOOD BOX
##
## includes extended transaction information - examples:
## transactionID
## 1 536365
## 2 536366
## 3 536367
class(data.transactions)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
inspect(head(data.transactions))
## items transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,
## GLASS STAR FROSTED T-LIGHT HOLDER,
## KNITTED UNION FLAG HOT WATER BOTTLE,
## RED WOOLLY HOTTIE WHITE HEART.,
## SET 7 BABUSHKA NESTING BOXES,
## WHITE HANGING HEART T-LIGHT HOLDER,
## WHITE METAL LANTERN} 536365
## [2] {HAND WARMER RED POLKA DOT,
## HAND WARMER UNION JACK} 536366
## [3] {ASSORTED COLOUR BIRD ORNAMENT,
## BOX OF 6 ASSORTED COLOUR TEASPOONS,
## BOX OF VINTAGE ALPHABET BLOCKS,
## BOX OF VINTAGE JIGSAW BLOCKS,
## DOORMAT NEW ENGLAND,
## FELTCRAFT PRINCESS CHARLOTTE DOLL,
## HOME BUILDING BLOCK WORD,
## IVORY KNITTED MUG COSY,
## LOVE BUILDING BLOCK WORD,
## POPPY'S PLAYHOUSE BEDROOM,
## POPPY'S PLAYHOUSE KITCHEN,
## RECIPE BOX WITH METAL HEART} 536367
## [4] {BLUE COAT RACK PARIS FASHION,
## JAM MAKING SET WITH JARS,
## RED COAT RACK PARIS FASHION,
## YELLOW COAT RACK PARIS FASHION} 536368
## [5] {BATH BUILDING BLOCK WORD} 536369
## [6] {ALARM CLOCK BAKELIKE GREEN,
## ALARM CLOCK BAKELIKE PINK,
## ALARM CLOCK BAKELIKE RED,
## CHARLOTTE BAG DOLLY GIRL DESIGN,
## CIRCUS PARADE LUNCH BOX,
## INFLATABLE POLITICAL GLOBE,
## LUNCH BOX I LOVE LONDON,
## MINI JIGSAW CIRCUS PARADE,
## MINI JIGSAW SPACEBOY,
## MINI PAINT SET VINTAGE,
## PANDA AND BUNNIES STICKER SHEET,
## POSTAGE,
## RED TOADSTOOL LED NIGHT LIGHT,
## ROUND SNACK BOXES SET OF4 WOODLAND,
## SET 2 TEA TOWELS I LOVE LONDON,
## SET/2 RED RETROSPOT TEA TOWELS,
## SPACEBOY LUNCH BOX,
## STARS GIFT TAPE,
## VINTAGE HEADS AND TAILS CARD GAME,
## VINTAGE SEASIDE JIGSAW PUZZLES} 536370
# Plotting the most purchased items
itemFrequencyPlot(x = data.transactions,
type = "relative",
topN = 20,
horiz = T,
col = rainbow(20))
The Apriori algorithm is a classic data mining technique used for association rule mining. It identifies frequent item sets in a dataset and generates association rules based on these item sets. The algorithm works by iteratively discovering frequent item sets by pruning infrequent ones, leveraging the “Apriori principle” which states that any subset of a frequent item set must also be frequent. Initially, the Apriori algorithm was applied using default parameters: a minimum support of 0.1 and a minimum confidence of 0.8. These parameters determine the threshold levels for item set frequency and rule confidence, respectively, guiding the algorithm’s discovery of frequent item sets and generation of association rules.
rules <- apriori(data.transactions)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1853
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3866 item(s), 18532 transaction(s)] done [0.14s].
## sorting and recoding items ... [1 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 0 rules
Well as we can find from: writing … [0 rule(s)] done [0.00s]. there is no rule, so we should change the trashholds defult until we fine some association rules. I changed in this way: minimum support of 0.02 and a minimum confidence of 0.5.
rules <- apriori(data.transactions, parameter = list(support = 0.02, confidence = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.02 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 370
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3866 item(s), 18532 transaction(s)] done [0.14s].
## sorting and recoding items ... [207 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [30 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 30 rules
By lowering the minimum support and confidence values, 30 association rules were identified from the dataset. Here we can see the first 10 of these rules:
inspect(head(rules, n = 10))
## lhs rhs support confidence coverage lift count
## [1] {PINK REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.02482193 0.8273381 0.03000216 22.188466 460
## [2] {GREEN REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.02482193 0.6657019 0.03728686 22.188466 460
## [3] {PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.02352687 0.7841727 0.03000216 18.559754 436
## [4] {ROSES REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.02352687 0.5568327 0.04225124 18.559754 436
## [5] {JUMBO BAG STRAWBERRY} => {JUMBO BAG RED RETROSPOT} 0.02233974 0.6330275 0.03529031 7.332041 414
## [6] {ALARM CLOCK BAKELIKE PINK} => {ALARM CLOCK BAKELIKE RED} 0.02136844 0.6460033 0.03307792 13.650778 396
## [7] {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.02919275 0.7829233 0.03728686 18.530185 541
## [8] {ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.02919275 0.6909323 0.04225124 18.530185 541
## [9] {GREEN REGENCY TEACUP AND SAUCER} => {REGENCY CAKESTAND 3 TIER} 0.02018131 0.5412446 0.03728686 5.889809 374
## [10] {WOODEN FRAME ANTIQUE WHITE} => {WOODEN PICTURE FRAME WHITE FINISH} 0.02525362 0.5770654 0.04376214 12.207962 468
but how we can analysis this table?
Support:
Support tells us how often a rule is true.
Confidence:
Confidence tells us how likely the rule is true when the antecedent occurs.
Lift:
Lift tells us how much more likely the consequent is given the antecedent, compared to its usual frequency. A lift greater than 1 indicates a positive association, close to 1 indicates independence, and less than 1 indicates a negative association
we pick up the rules that their confidence is more than 0.6. which represent us the rules which is more relaiable.
Rules<-rules[quality(rules)$confidence>0.6]
visualisation the top 19 rules:
Here we can see the plot of 19 rules which their confidence is more than 0.6.
plot(Rules)
inspect(sort(Rules, by = "support")[1:2], linebreak = FALSE)
## lhs rhs
## [1] {JUMBO BAG PINK POLKADOT} => {JUMBO BAG RED RETROSPOT}
## [2] {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER}
## support confidence coverage lift count
## [1] 0.02946255 0.6268657 0.04699978 7.260672 546
## [2] 0.02919275 0.7829233 0.03728686 18.530185 541
“JUMBO BAG RED RETROSPOT” is the most popular item in the our online shop. this product most often is appears together with “JUMBO BAG PINK POLKADOT” (546 transactions). Other rule with the highest support indicate that “ROSES REGENCY TEACUP AND SAUCER” is being willingly bought with “GREEN REGENCY TEACUP AND SAUCER” (2.3% of all transactions).
inspect(sort(Rules, by = "confidence")[1:2], linebreak = FALSE)
## lhs
## [1] {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER} =>
## [2] {GREEN REGENCY TEACUP AND SAUCER, PINK REGENCY TEACUP AND SAUCER} =>
## rhs support confidence coverage lift
## [1] {GREEN REGENCY TEACUP AND SAUCER} 0.02104468 0.8944954 0.02352687 23.98956
## [2] {ROSES REGENCY TEACUP AND SAUCER} 0.02104468 0.8478261 0.02482193 20.06630
## count
## [1] 390
## [2] 390
Rules for our data with the highest confidence show that if customer buys {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER}, s/he will also buy {GREEN REGENCY TEACUP AND SAUCER} with the probability of 89%.
inspect(sort(Rules, by = "lift")[1:2], linebreak = FALSE)
## lhs
## [1] {GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER} =>
## [2] {PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER} =>
## rhs support confidence coverage lift
## [1] {PINK REGENCY TEACUP AND SAUCER} 0.02104468 0.7208872 0.02919275 24.02785
## [2] {GREEN REGENCY TEACUP AND SAUCER} 0.02104468 0.8944954 0.02352687 23.98956
## count
## [1] 390
## [2] 390
Lift for this rules is close to 25>1 which show us it is more likely if we choose {PINK REGENCY TEACUP AND SAUCER} then choose {GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER}. and the same interpretation for other rule.
In conclusion, Market Basket Analysis has helped uncover valuable patterns and associations within our data, which can be used to inform strategic decisions such as product placement, and promotional campaigns. Further analysis and interpretation of the association rules may lead to actionable insights for optimizing sales and enhancing customer experience.