This study analyzes 19559 transactions from Market Basket Analysis dataset to identify key purchasing patterns using Apriori and ECLAT algorithms. A Pareto analysis revealed that 20% of products generate 80% of total revenue. Strong associations were observed with purchase confidence often exceeding 90%. Targeted findings also identified a significant cross-selling opportunity between wooden picture frames and the Regency porcelain series. These insights provide actionable strategies for optimized product bundling and placement to enhance overall profitability.
df <- read.csv("Assignment-1_Data.csv", sep = ";", dec = ",")
df <- df %>%
filter(!is.na(Itemname), !is.na(BillNo)) %>%
filter(Quantity > 0, Price > 0) %>%
mutate(Itemname = trimws(Itemname))
head(df)The data set has 19559 unique transactions and 4006 unique items.
summary_info <- df %>%
summarise(
total_records = n(),
unique_bills = n_distinct(BillNo),
unique_items = n_distinct(Itemname)
)
print(summary_info)## total_records unique_bills unique_items
## 1 519551 19559 4006
# Calculate sales performance per item
product_performance <- df %>%
mutate(TotalValue = Quantity * Price) %>%
group_by(Itemname) %>%
summarise(TotalSalesValue = sum(TotalValue)) %>%
arrange(desc(TotalSalesValue)) %>%
mutate(
CumulativeSales = cumsum(TotalSalesValue),
PercSales = CumulativeSales / sum(TotalSalesValue),
ProductRank = row_number(),
PercProducts = ProductRank / n()
)
# Visualize the Pareto Curve
ggplot(product_performance, aes(x = PercProducts, y = PercSales)) +
geom_line(color = "steelblue", size = 1) +
geom_hline(yintercept = 0.8, linetype = "dashed", color = "red") +
geom_vline(xintercept = 0.2, linetype = "dashed", color = "red") +
labs(title = "Pareto Analysis: Sales Value vs. Products",
subtitle = "Red lines indicate the 80/20 rule threshold",
x = "Percentage of Products",
y = "Percentage of Total Sales Value") +
theme_minimal()The Pareto chart illustrates that approximately 20% of key products generate 80% of the total sales value, confirming a high concentration of revenue within a small segment of the assortment. This distribution highlights that the vast majority of financial performance is driven by a very limited number of top-performing items.
We can observe that majority of total sales value comes from postage. Apart from that, the highest revenue was generated by “Paper Craft, Little Birdie”
list_of_transactions <- split(df$Itemname, df$BillNo)
transactions <- as(list_of_transactions, "transactions")
summary(transactions)## transactions as itemMatrix in sparse format with
## 19559 rows (elements/itemsets/transactions) and
## 4006 columns (items) and a density of 0.006493659
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER JUMBO BAG RED RETROSPOT
## 2198 2061
## REGENCY CAKESTAND 3 TIER PARTY BUNTING
## 1904 1655
## LUNCH BAG RED RETROSPOT (Other)
## 1541 499441
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1609 828 675 646 674 607 604 597 603 530 542 487 498 523 535 546
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 455 433 475 424 392 340 343 304 243 254 236 237 262 216 189 182
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 158 175 133 134 127 120 132 119 117 103 93 99 92 89 80 90
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 84 83 53 60 78 70 69 46 63 46 35 58 38 27 41 38
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 38 41 32 38 28 33 37 23 25 34 26 20 18 26 15 11
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 18 20 15 22 16 17 9 17 11 12 9 14 16 7 4 10
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 9 12 5 10 10 3 6 9 2 4 7 3 4 4 7 3
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 5 6 6 8 6 4 8 4 6 11 4 5 3 5 7 1
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 4 3 3 2 5 4 2 6 6 2 5 6 2 1 5
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 5 3 2 4 5 3 5 3 6 2 3 2 3 4 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 3 3 3 2 5 4 1 4 5 2 1 4 3 4 2 5
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 5 4 2 4 2 6 4 3 3 3 2 3 4 4 2 3
## 193 194 195 196 197 198 199 202 203 204 205 206 207 208 210 211
## 2 3 3 4 2 2 3 2 5 5 1 2 1 4 1 4
## 212 213 214 215 216 217 218 219 220 222 223 224 225 226 227 228
## 1 1 2 1 2 4 2 2 2 1 1 3 3 1 1 1
## 229 230 232 233 234 235 237 238 239 241 242 243 244 247 249 250
## 2 1 1 1 1 1 3 3 1 2 1 2 2 2 3 2
## 253 254 255 257 259 261 262 263 264 266 267 270 273 279 280 282
## 1 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1
## 283 285 286 288 289 291 292 295 296 298 299 301 309 310 315 319
## 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1
## 320 331 332 333 334 339 341 344 345 347 348 349 352 354 357 358
## 1 1 4 1 1 1 1 1 1 2 1 1 2 1 1 1
## 363 369 375 376 379 382 386 388 399 404 408 411 414 415 416 419
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1
## 420 428 433 434 438 439 443 449 453 455 458 460 463 471 482 486
## 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## 487 488 494 499 503 506 514 515 517 518 520 522 524 525 527 529
## 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1
## 531 536 539 541 543 552 561 567 572 578 585 588 589 593 595 599
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 601 607 622 629 635 645 647 649 661 673 676 687 703 720 731 748
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1108
## 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 26.01 29.00 1108.00
##
## includes extended item information - examples:
## labels
## 1 *Boombox Ipod Classic
## 2 *USB Office Mirror Ball
## 3 10 COLOUR SPACEBOY PEN
##
## includes extended transaction information - examples:
## transactionID
## 1 536365
## 2 536366
## 3 536367
The most frequently purchased items include the “White Hanging Heart T-Light Holder” and “Jumbo Bag Red Retrospot”, which serve as primary anchors in customer baskets. On average, a typical transaction contains 26 items. The median value equals to 15, informs that - 50% of transactions consist of less that 15 items, and another 50% transactions consist more that 15 items.
The data reveals a broad range of shopping behaviors, from single-item purchases to a massive outlier of 1108 items in a single basket. These initial metrics confirm a sparse but rich transactional environment suitable for robust association rule mining.
# Plot relative frequency of top 20 items
itemFrequencyPlot(transactions, topN = 20, type = "relative",
col = "steelblue", main = "Top 20 Frequent Items",
cex.names = 0.7)# Temporal analysis: Transactions by Day of Week
df_time <- df %>%
mutate(DateProper = dmy_hm(Date),
DayOfWeek = wday(DateProper, label = TRUE))
ggplot(df_time, aes(x = DayOfWeek)) +
geom_bar(fill = "steelblue") +
labs(title = "Transaction Volume by Day of Week",
x = "Day of Week", y = "Count") +
theme_minimal()# CrossTable for first 5 items to see raw co-occurrences
ct <- crossTable(transactions[,1:5], sort = TRUE)
print(ct)## 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS
## 10 COLOUR SPACEBOY PEN 312 10
## 12 COLOURED PARTY BALLOONS 10 160
## 12 DAISY PEGS IN WOOD BOX 7 1
## *USB Office Mirror Ball 0 0
## *Boombox Ipod Classic 0 0
## 12 DAISY PEGS IN WOOD BOX *USB Office Mirror Ball
## 10 COLOUR SPACEBOY PEN 7 0
## 12 COLOURED PARTY BALLOONS 1 0
## 12 DAISY PEGS IN WOOD BOX 73 0
## *USB Office Mirror Ball 0 2
## *Boombox Ipod Classic 0 0
## *Boombox Ipod Classic
## 10 COLOUR SPACEBOY PEN 0
## 12 COLOURED PARTY BALLOONS 0
## 12 DAISY PEGS IN WOOD BOX 0
## *USB Office Mirror Ball 0
## *Boombox Ipod Classic 1
\[support(X)=\frac{count(X)}{N}\]
\[confidence(X \to Y)=\frac{support(X,Y)}{support(X)}\]
\[lift(X \to Y)=\frac{confidence(X\to Y)}{support(Y)}\]
Using the Apriori algorithm to find strong relationships between products. Lowered support from 0.01 to 0.005 to see less frequent rules.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 97
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4006 item(s), 19559 transaction(s)] done [0.08s].
## sorting and recoding items ... [1554 item(s)] done [0.01s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.09s].
## writing ... [16056 rule(s)] done [0.01s].
## creating S4 object ... done [0.00s].
## set of 15229 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7
## 313 6173 6665 1904 169 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.702 4.000 7.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005010 Min. :0.6000 Min. :0.005010 Min. : 5.35
## 1st Qu.:0.005266 1st Qu.:0.6711 1st Qu.:0.006800 1st Qu.: 11.53
## Median :0.005675 Median :0.7465 Median :0.007874 Median : 16.10
## Mean :0.006270 Mean :0.7585 Mean :0.008448 Mean : 18.42
## 3rd Qu.:0.006595 3rd Qu.:0.8333 3rd Qu.:0.009152 3rd Qu.: 21.77
## Max. :0.041873 Max. :1.0000 Max. :0.061915 Max. :110.65
## count
## Min. : 98.0
## 1st Qu.:103.0
## Median :111.0
## Mean :122.6
## 3rd Qu.:129.0
## Max. :819.0
##
## mining info:
## data ntransactions support confidence
## transactions 19559 0.005 0.6
## call
## apriori(data = transactions, parameter = list(supp = 0.005, conf = 0.6))
# Sort by lift to see the strongest associations
rules_clean_sorted <- sort(rules_clean, by = "lift", decreasing = TRUE)
# Inspect top 10 rules
inspect(head(rules_clean_sorted, 10))## lhs rhs support confidence coverage lift count
## [1] {CHRISTMAS TREE HEART DECORATION,
## CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005061608 0.7071429 0.007157830 110.64806 99
## [2] {CHRISTMAS TREE DECORATION WITH BELL,
## CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005061608 0.9519231 0.005317245 103.43702 99
## [3] {DOLLY GIRL CHILDRENS CUP,
## SPACEBOY CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS BOWL} 0.005163863 0.9805825 0.005266118 102.56264 101
## [4] {DOLLY GIRL CHILDRENS CUP,
## SPACEBOY CHILDRENS BOWL} => {DOLLY GIRL CHILDRENS BOWL} 0.005777391 0.9576271 0.006033028 100.16165 113
## [5] {DOLLY GIRL CHILDRENS BOWL,
## SPACEBOY CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS CUP} 0.005163863 0.9528302 0.005419500 98.08635 101
## [6] {DOLLY GIRL CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS CUP} 0.005572882 0.9396552 0.005930774 96.73008 109
## [7] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION} 0.005624009 0.8800000 0.006390920 95.62178 110
## [8] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005624009 0.6111111 0.009202924 95.62178 110
## [9] {PINK VINTAGE SPOT BEAKER,
## RED VINTAGE SPOT BEAKER} => {GREEN VINTAGE SPOT BEAKER} 0.005163863 0.8487395 0.006084156 91.71545 101
## [10] {BLUE VINTAGE SPOT BEAKER,
## GREEN VINTAGE SPOT BEAKER} => {PINK VINTAGE SPOT BEAKER} 0.005777391 0.8496241 0.006799939 89.34299 113
The Apriori algorithm generated over 15,000 rules with a high average confidence of 75%, showing that the identified purchasing patterns are very reliable. Most of these rules involve groups of 3 to 4 items, suggesting that customers frequently buy coordinated sets rather than just individual pairs. The top results reveal exceptionally strong connections in seasonal Christmas decorations and themed children’s bowls or cups, where “Lift” values reach over 100. This high Lift indicates that these specific items are almost never purchased alone and are highly dependent on each other.
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 97
##
## create itemset ...
## set transactions ...[4006 item(s), 19559 transaction(s)] done [0.09s].
## sorting and recoding items ... [1554 item(s)] done [0.01s].
## creating sparse bit matrix ... [1554 row(s), 19559 column(s)] done [0.00s].
## writing ... [18590 set(s)] done [1.51s].
## Creating S4 object ... done [0.00s].
frequent_items_size2 <- frequent_items[size(frequent_items) > 1]
inspect(head(sort(frequent_items_size2, by = "support"), 10))## items support count
## [1] {JUMBO BAG PINK POLKADOT,
## JUMBO BAG RED RETROSPOT} 0.04187331 819
## [2] {GREEN REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} 0.03732297 730
## [3] {JUMBO BAG RED RETROSPOT,
## JUMBO STORAGE BAG SUKI} 0.03686283 721
## [4] {JUMBO BAG RED RETROSPOT,
## JUMBO SHOPPER VINTAGE RED PAISLEY} 0.03456209 676
## [5] {ALARM CLOCK BAKELIKE GREEN,
## ALARM CLOCK BAKELIKE RED} 0.03236362 633
## [6] {LUNCH BAG BLACK SKULL.,
## LUNCH BAG RED RETROSPOT} 0.03231249 632
## [7] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER} 0.03088092 604
## [8] {LUNCH BAG PINK POLKADOT,
## LUNCH BAG RED RETROSPOT} 0.03031852 593
## [9] {JUMBO BAG BAROQUE BLACK WHITE,
## JUMBO BAG RED RETROSPOT} 0.02965387 580
## [10] {JUMBO BAG RED RETROSPOT,
## LUNCH BAG RED RETROSPOT} 0.02929598 573
# Induce rules from ECLAT results
rules_from_eclat <- ruleInduction(frequent_items, transactions, confidence = 0.6)
inspect(head(sort(rules_from_eclat, by="lift"), 10))## lhs rhs support confidence lift itemset
## [1] {CHRISTMAS TREE HEART DECORATION,
## CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005061608 0.7071429 110.64806 1031
## [2] {CHRISTMAS TREE DECORATION WITH BELL,
## CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005061608 0.9519231 103.43702 1031
## [3] {DOLLY GIRL CHILDRENS CUP,
## SPACEBOY CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS BOWL} 0.005163863 0.9805825 102.56264 198
## [4] {DOLLY GIRL CHILDRENS CUP,
## SPACEBOY CHILDRENS BOWL} => {DOLLY GIRL CHILDRENS BOWL} 0.005777391 0.9576271 100.16165 200
## [5] {DOLLY GIRL CHILDRENS BOWL,
## SPACEBOY CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS CUP} 0.005163863 0.9528302 98.08635 198
## [6] {DOLLY GIRL CHILDRENS BOWL,
## SPACEBOY CHILDRENS CUP} => {DOLLY GIRL CHILDRENS CUP} 0.005572882 0.9396552 96.73008 242
## [7] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005624009 0.6111111 95.62178 1034
## [8] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION} 0.005624009 0.8800000 95.62178 1034
## [9] {PINK VINTAGE SPOT BEAKER,
## RED VINTAGE SPOT BEAKER} => {GREEN VINTAGE SPOT BEAKER} 0.005163863 0.8487395 91.71545 210
## [10] {BLUE VINTAGE SPOT BEAKER,
## GREEN VINTAGE SPOT BEAKER} => {PINK VINTAGE SPOT BEAKER} 0.005777391 0.8496241 89.34299 211
The results show a strong preference for purchasing matching pairs within the same product lines, particularly for storage bags, teacups, and lunch bags. These high-support itemsets indicate that customers frequently buy multiple color variations of the same item to complete a set, representing the most common shopping behaviors in the store.
# Subset rules for specific criteria
selected_rules <- subset(rules,
items %pin% "PICTURE" &
!(rhs %pin% "POSTAGE") &
!(lhs %pin% "POSTAGE"))
selected_rules_sorted <- sort(selected_rules, by = "lift", decreasing = TRUE)
# Inspect the filtered rules
inspect(head(selected_rules_sorted, 10))## lhs rhs support confidence coverage lift count
## [1] {PICTURE DOMINOES,
## VINTAGE SNAP CARDS} => {VINTAGE HEADS AND TAILS CARD GAME} 0.006135283 0.6936416 0.008845033 22.83996 120
## [2] {GREEN REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {PINK REGENCY TEACUP AND SAUCER} 0.005163863 0.7214286 0.007157830 19.19785 101
## [3] {PINK REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {GREEN REGENCY TEACUP AND SAUCER} 0.005163863 0.9266055 0.005572882 18.62639 101
## [4] {PINK REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {GREEN REGENCY TEACUP AND SAUCER} 0.005930774 0.8854962 0.006697684 17.80002 116
## [5] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {ROSES REGENCY TEACUP AND SAUCER} 0.005163863 0.8706897 0.005930774 16.82788 101
## [6] {ROSES REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {PINK REGENCY TEACUP AND SAUCER} 0.005572882 0.6264368 0.008896160 16.67004 109
## [7] {PICTURE DOMINOES,
## VINTAGE HEADS AND TAILS CARD GAME} => {VINTAGE SNAP CARDS} 0.006135283 0.7407407 0.008282632 16.44512 120
## [8] {GREEN REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {PINK REGENCY TEACUP AND SAUCER} 0.005930774 0.6137566 0.009663071 16.33261 116
## [9] {ROSES REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {GREEN REGENCY TEACUP AND SAUCER} 0.007157830 0.8045977 0.008896160 16.17382 140
## [10] {PINK REGENCY TEACUP AND SAUCER,
## WOODEN PICTURE FRAME WHITE FINISH} => {ROSES REGENCY TEACUP AND SAUCER} 0.005572882 0.8320611 0.006697684 16.08131 109
The picture category analysis reveals two distinct purchasing patterns: a vintage game cluster and a strong decorative association between white wooden frames and the Regency teacup collection. High lift and confidence values suggest that the white frame is a primary companion for customers building a complete Regency-themed set, making it an ideal candidate for cross-promotional bundling.
The analysis confirmed a strong concentration of revenue, highlighting the role of a narrow group of products in driving profits and generating the most robust association rules. The identified association rules, offer immediate opportunities for sales growth through strategic cross-selling and attractive product bundling.