In this project, I will analyse association rules among subcategories of supermarket products. The main aim is to give insights to marketing department for shelf design and promotions.
The data has been acquired from the Kaggle website (V. Chowdhury 2022). The data consists of several columns about customers, branches, and product. However for Market Basket Analysis we need only two columns from the data:
First one is Sub-Category. I use Sub-Category column instead of product because there are extensive range of products. This much detail will not give us a clear insights. In further steps, one can filter out for several sub-categories and make a market basket analysis with product names.
Second is Customer.ID. As the transaction of each subcategory has been shown in separate rows, we need Customer.ID to detect the transaction pathes of customers.
data <- read.csv2("Sample - Superstore.csv", sep = ",")
str(data)
## 'data.frame': 9994 obs. of 21 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
## $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Customer.Name: chr "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
## $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
## $ Region : chr "South" "South" "West" "South" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category : chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : chr "261.96" "731.94" "14.62" "957.5775" ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : chr "0" "0" "0" "0.45" ...
## $ Profit : chr "41.9136" "219.582" "6.8714" "-383.031" ...
data <- subset(data, select = c(Customer.ID, Sub.Category))
str(data)
## 'data.frame': 9994 obs. of 2 variables:
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
summary(data)
## Customer.ID Sub.Category
## Length:9994 Length:9994
## Class :character Class :character
## Mode :character Mode :character
As Apriori algorithm use special type of data (transaction), we need to convert our data.
trans <- as(split(data[,"Sub.Category"], data[,"Customer.ID"]), "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(trans)
## transactions as itemMatrix in sparse format with
## 793 rows (elements/itemsets/transactions) and
## 17 columns (items) and a density of 0.4452192
##
## most frequent items:
## Binders Paper Furnishings Storage Phones (Other)
## 650 611 528 514 511 3188
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 7 15 27 46 71 103 119 114 99 94 53 22 20 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.000 7.569 9.000 15.000
##
## includes extended item information - examples:
## labels
## 1 Accessories
## 2 Appliances
## 3 Art
##
## includes extended transaction information - examples:
## transactionID
## 1 AA-10315
## 2 AA-10375
## 3 AA-10480
inspect(trans[1:10])
## items transactionID
## [1] {Accessories,
## Appliances,
## Binders,
## Fasteners,
## Furnishings,
## Paper,
## Phones,
## Storage,
## Supplies} AA-10315
## [2] {Accessories,
## Art,
## Binders,
## Furnishings,
## Paper,
## Phones,
## Storage} AA-10375
## [3] {Accessories,
## Art,
## Furnishings,
## Paper,
## Phones,
## Storage,
## Tables} AA-10480
## [4] {Art,
## Binders,
## Bookcases,
## Chairs,
## Envelopes,
## Furnishings,
## Paper,
## Phones,
## Storage} AA-10645
## [5] {Art,
## Bookcases,
## Chairs,
## Phones,
## Storage} AB-10015
## [6] {Accessories,
## Appliances,
## Binders,
## Chairs,
## Furnishings,
## Paper,
## Supplies,
## Tables} AB-10060
## [7] {Accessories,
## Art,
## Binders,
## Furnishings,
## Labels,
## Machines,
## Phones,
## Storage,
## Tables} AB-10105
## [8] {Accessories,
## Art,
## Binders,
## Furnishings,
## Paper,
## Supplies} AB-10150
## [9] {Accessories,
## Art,
## Binders,
## Chairs,
## Furnishings,
## Paper,
## Storage} AB-10165
## [10] {Accessories,
## Art,
## Binders,
## Copiers,
## Fasteners,
## Furnishings,
## Paper,
## Phones,
## Storage,
## Supplies} AB-10255
Before making analyzes, let’s first look for our sub-category frequency distribution.
itemFrequencyPlot(trans, topN = 17, type="absolute", main="Item Frequency")
itemFrequencyPlot(trans, topN = 17, type="relative", main="Item Frequency")
As you can see, the most purchased sub-categories are binders and papers.
rules <- apriori(trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 793 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans, parameter = list(support = 0.006, confidence = 0.25, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.02s].
## writing ... [176470 rule(s)] done [0.04s].
## creating S4 object ... done [0.11s].
inspect(rules[1:10])
## lhs rhs support confidence coverage lift
## [1] {Copiers} => {Bookcases} 0.02143758 0.265625 0.08070618 1.0802083
## [2] {Copiers} => {Envelopes} 0.02774275 0.343750 0.08070618 1.3232706
## [3] {Copiers} => {Tables} 0.02269861 0.281250 0.08070618 0.8545259
## [4] {Copiers} => {Labels} 0.02648172 0.328125 0.08070618 0.9259898
## [5] {Copiers} => {Appliances} 0.03404792 0.421875 0.08070618 0.9397384
## [6] {Copiers} => {Chairs} 0.04665826 0.578125 0.08070618 1.1264205
## [7] {Copiers} => {Accessories} 0.04539723 0.562500 0.08070618 0.9410601
## [8] {Copiers} => {Art} 0.05800757 0.718750 0.08070618 1.1537829
## [9] {Copiers} => {Phones} 0.06052963 0.750000 0.08070618 1.1638943
## [10] {Copiers} => {Storage} 0.05548550 0.687500 0.08070618 1.0606761
## count
## [1] 17
## [2] 22
## [3] 18
## [4] 21
## [5] 27
## [6] 37
## [7] 36
## [8] 46
## [9] 48
## [10] 44
Explanation of result:
lhs - The first product(s) in purchasing path or IF product also known as antecedent.
rhs - The following product(s) in purchasing path or THEN product also know as consequent.
Support - “how frequent an itemsetis in all the transactions”. (A. Garg 2018)
Confidence - “the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents.” (A. Garg 2018)
Lift - “the support (frequency) of consequent while calculating the conditional probability of occurrence of {Y} given {X}.” (A. Garg 2018)
Coverage - “the support of the left-hand-side of the rule X => Y.” (rdrr.io)
In the below tables you can see the top 5 rules sorted for each element of result.
inspect(sort(rules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Chairs,
## Envelopes,
## Fasteners,
## Paper,
## Phones,
## Supplies} => {Machines} 0.007566204 0.8571429 0.008827238 6.865801 6
## [2] {Appliances,
## Chairs,
## Envelopes,
## Fasteners,
## Paper,
## Phones,
## Supplies} => {Machines} 0.007566204 0.8571429 0.008827238 6.865801 6
## [3] {Binders,
## Chairs,
## Envelopes,
## Fasteners,
## Paper,
## Phones,
## Supplies} => {Machines} 0.007566204 0.8571429 0.008827238 6.865801 6
## [4] {Appliances,
## Binders,
## Chairs,
## Envelopes,
## Fasteners,
## Paper,
## Phones,
## Supplies} => {Machines} 0.007566204 0.8571429 0.008827238 6.865801 6
## [5] {Appliances,
## Chairs,
## Envelopes,
## Fasteners,
## Phones,
## Storage,
## Supplies} => {Machines} 0.006305170 0.8333333 0.007566204 6.675084 5
inspect(sort(rules, by = "confidence")[1:5])
## lhs rhs support confidence coverage
## [1] {Copiers, Machines} => {Binders} 0.015132409 1 0.015132409
## [2] {Copiers, Supplies} => {Binders} 0.017654477 1 0.017654477
## [3] {Copiers, Labels, Machines} => {Phones} 0.007566204 1 0.007566204
## [4] {Copiers, Labels, Machines} => {Paper} 0.007566204 1 0.007566204
## [5] {Copiers, Labels, Machines} => {Binders} 0.007566204 1 0.007566204
## lift count
## [1] 1.220000 12
## [2] 1.220000 14
## [3] 1.551859 6
## [4] 1.297872 6
## [5] 1.220000 6
inspect(sort(rules, by = "support")[1:5])
## lhs rhs support confidence coverage lift
## [1] {Paper} => {Binders} 0.6506936 0.8445172 0.7704918 1.030311
## [2] {Binders} => {Paper} 0.6506936 0.7938462 0.8196721 1.030311
## [3] {Furnishings} => {Binders} 0.5535939 0.8314394 0.6658260 1.014356
## [4] {Binders} => {Furnishings} 0.5535939 0.6753846 0.8196721 1.014356
## [5] {Storage} => {Binders} 0.5460277 0.8424125 0.6481715 1.027743
## count
## [1] 516
## [2] 516
## [3] 439
## [4] 439
## [5] 433
inspect(sort(rules, by = "count")[1:5])
## lhs rhs support confidence coverage lift
## [1] {Paper} => {Binders} 0.6506936 0.8445172 0.7704918 1.030311
## [2] {Binders} => {Paper} 0.6506936 0.7938462 0.8196721 1.030311
## [3] {Furnishings} => {Binders} 0.5535939 0.8314394 0.6658260 1.014356
## [4] {Binders} => {Furnishings} 0.5535939 0.6753846 0.8196721 1.014356
## [5] {Storage} => {Binders} 0.5460277 0.8424125 0.6481715 1.027743
## count
## [1] 516
## [2] 516
## [3] 439
## [4] 439
## [5] 433
The better way to see results is to use graphs.
plot(sort(rules, by = "confidence")[1:100], method="graph")
plot(rules, shading="order", control=list(main="Two-key plot"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.