This is a market basket analysis based on a sample dataset from Kaggle.
Source: https://www.kaggle.com/carrie1/ecommerce-data#data.csv
#load packages
library(tidyverse)
library(arules)
library(arulesViz)
library(htmlwidgets)
library(plotly)
#load dataset
library(readr)
data <- read_csv("~/Downloads/data.csv")
#view glimpse
glimpse(data)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo <chr> "536365", "536365", "536365", "536365", "536365", "5363...
## $ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...
## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, ...
## $ InvoiceDate <chr> "12/1/2010 8:26", "12/1/2010 8:26", "12/1/2010 8:26", "...
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1...
## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "...
#create a transactional object and view the first few rows
df_list = split(data$Description,
data$InvoiceNo)
df_trx = as(df_list, "transactions")
inspect(head(df_trx))
## items transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,
## GLASS STAR FROSTED T-LIGHT HOLDER,
## KNITTED UNION FLAG HOT WATER BOTTLE,
## RED WOOLLY HOTTIE WHITE HEART.,
## SET 7 BABUSHKA NESTING BOXES,
## WHITE HANGING HEART T-LIGHT HOLDER,
## WHITE METAL LANTERN} 536365
## [2] {HAND WARMER RED POLKA DOT,
## HAND WARMER UNION JACK} 536366
## [3] {ASSORTED COLOUR BIRD ORNAMENT,
## BOX OF 6 ASSORTED COLOUR TEASPOONS,
## BOX OF VINTAGE ALPHABET BLOCKS,
## BOX OF VINTAGE JIGSAW BLOCKS,
## DOORMAT NEW ENGLAND,
## FELTCRAFT PRINCESS CHARLOTTE DOLL,
## HOME BUILDING BLOCK WORD,
## IVORY KNITTED MUG COSY,
## LOVE BUILDING BLOCK WORD,
## POPPY'S PLAYHOUSE BEDROOM,
## POPPY'S PLAYHOUSE KITCHEN,
## RECIPE BOX WITH METAL HEART} 536367
## [4] {BLUE COAT RACK PARIS FASHION,
## JAM MAKING SET WITH JARS,
## RED COAT RACK PARIS FASHION,
## YELLOW COAT RACK PARIS FASHION} 536368
## [5] {BATH BUILDING BLOCK WORD} 536369
## [6] {ALARM CLOCK BAKELIKE GREEN,
## ALARM CLOCK BAKELIKE PINK,
## ALARM CLOCK BAKELIKE RED,
## CHARLOTTE BAG DOLLY GIRL DESIGN,
## CIRCUS PARADE LUNCH BOX,
## INFLATABLE POLITICAL GLOBE,
## LUNCH BOX I LOVE LONDON,
## MINI JIGSAW CIRCUS PARADE,
## MINI JIGSAW SPACEBOY,
## MINI PAINT SET VINTAGE,
## PANDA AND BUNNIES STICKER SHEET,
## POSTAGE,
## RED TOADSTOOL LED NIGHT LIGHT,
## ROUND SNACK BOXES SET OF4 WOODLAND,
## SET 2 TEA TOWELS I LOVE LONDON,
## SET/2 RED RETROSPOT TEA TOWELS,
## SPACEBOY LUNCH BOX,
## STARS GIFT TAPE,
## VINTAGE HEADS AND TAILS CARD GAME,
## VINTAGE SEASIDE JIGSAW PUZZLES} 536370
#most popular
itemFrequencyPlot(df_trx,
type = 'absolute',
topN = 15,
horiz = TRUE,
col = 'indianred4',
main = 'Most popular items (absolute)')
itemFrequencyPlot(df_trx,
type = 'relative',
topN = 15,
horiz = TRUE,
col = 'indianred4',
main = 'Most popular items (relative)')
In order to find the appropriate parameters to mine for rules, I will test different support levels of 1%, 2% and 3%, loop through different confidence levels and visualize the results. Instead of testing different support levels, I could also test confidence levels and loop through support levels instead.
# Set of confidence levels
confidenceLevels = seq(from=0.95, to=0.5, by=-0.05)
# Create empty vector
rules_sup1 = NULL
# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)) {
rules_sup1[i] =
length(apriori(df_trx,
parameter=list(sup=0.01,
conf=confidenceLevels[i],
target="rules")))
}
# Create empty vector
rules_sup2 = NULL
# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels)) {
rules_sup2[i] =
length(apriori(df_trx,
parameter=list(sup=0.02,
conf=confidenceLevels[i],
target="rules")))
}
# Create empty vector
rules_sup3 = NULL
# Apriori algorithm with a support level of 3%
for (i in 1:length(confidenceLevels)) {
rules_sup3[i] =
length(apriori(df_trx,
parameter=list(sup=0.03,
conf=confidenceLevels[i],
target="rules")))
}
# Create data frame with all metrics to be plotted
nb_rules = data.frame(rules_sup1, rules_sup2, rules_sup3,
confidenceLevels)
# Number of rules found with a support level of 1% and 2%
rules_plot <- ggplot(data=nb_rules, aes(x=confidenceLevels)) +
# Lines and points for rules_sup1
geom_line(aes(y=rules_sup1, colour="Support level of 1%")) +
geom_point(aes(y=rules_sup1,colour="Support level of 1%")) +
# Lines and points for rules_sup2
geom_line(aes(y=rules_sup2, colour="Support level of 2%")) +
geom_point(aes(y=rules_sup2,colour="Support level of 2%")) +
# Lines and points for rules_sup3
geom_line(aes(y=rules_sup3, colour="Support level of 3%")) +
geom_point(aes(y=rules_sup3,colour="Support level of 3%")) +
# Polishing the graph
theme_bw() + ylab("") +
ggtitle("Number of extracted rules with apriori")
ggplotly(rules_plot)
Based on this graph, a minimum support of 1% with a confidence level of 80% results in 25 rules, whereas a minimum support of 2% with a confidence level of 60% results in 18 rules. A lower support and confidence level results in more rules, so more associations are found if the number of transactions in which the rule must be true is lower.
Therefore, when choosing the appropriate parameters for the apriori algorithm, a few different approaches may be considered, such as:
In this case, if I want to get at least 100 rules, I might choose something like a minimum support of 1% and confidence level of 65%. However, if I’m interested in getting at least 25 rules with the highest possible support and confidence levels, I might choose a support level of 1% and a confidence level of 80%.
Here I will choose a support level of 1% and a confidence level of 60% which will result in 179 rules at first.
#Apply the apriori algorithm to find rules:
#minimum support: 0.01 (must occur in at least 1% of all transactions)
#minimum confidence: 0.6 (rule is true in 60% of transactions)
#minlen = 2 (no empty sets)
rules = apriori(df_trx,
parameter = list(supp = 0.01,
conf = 0.6,
minlen = 2,
target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 259
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4211 item(s), 25900 transaction(s)] done [0.19s].
## sorting and recoding items ... [590 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.03s].
## writing ... [179 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
# find redundant rules
redundant_rules = is.redundant(rules)
# leave out redundanct rules
rules1 = rules[!redundant_rules]
summary(rules1)
## set of 179 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 62 106 11
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.715 3.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01000 Min. :0.6013 Min. : 7.404 Min. :259.0
## 1st Qu.:0.01064 1st Qu.:0.6458 1st Qu.:10.371 1st Qu.:275.5
## Median :0.01174 Median :0.6880 Median :17.844 Median :304.0
## Mean :0.01316 Mean :0.7048 Mean :19.428 Mean :340.7
## 3rd Qu.:0.01301 3rd Qu.:0.7558 3rd Qu.:22.041 3rd Qu.:337.0
## Max. :0.03216 Max. :0.8991 Max. :60.260 Max. :833.0
##
## mining info:
## data ntransactions support confidence
## df_trx 25900 0.01 0.6
After removing the redundant rules, the same number of rules (179) remains. The minimum support and confidence levels are as was specified in the parameters for the apriori algorithm above.
The median and maximum values provide interesting insights into the retrieved rules. On an average, the itemsets appear in about 1% of all transactions with a maximum of 3%. This would suggest that there is a variety of products that people often buy from the retailer in combination with each other, but the products that are most frequently purchased together make up no more than 3% of all transactions.
However, it must be noted that market basket analysis is not concerned with the quantities of purchased items but rather which items were bought at least once per one transaction.
rules_df <- as(rules1, 'data.frame')
head(rules_df)
## rules
## 1 {PAINTED METAL PEARS ASSORTED} => {ASSORTED COLOUR BIRD ORNAMENT}
## 2 {PINK HAPPY BIRTHDAY BUNTING} => {BLUE HAPPY BIRTHDAY BUNTING}
## 3 {BLUE HAPPY BIRTHDAY BUNTING} => {PINK HAPPY BIRTHDAY BUNTING}
## 4 {CANDLEHOLDER PINK HANGING HEART} => {WHITE HANGING HEART T-LIGHT HOLDER}
## 5 {ALARM CLOCK BAKELIKE ORANGE} => {ALARM CLOCK BAKELIKE RED}
## 6 {SET OF 6 TEA TIME BAKING CASES} => {SET OF 12 FAIRY CAKE BAKING CASES}
## support confidence lift count
## 1 0.01003861 0.6989247 12.339571 260
## 2 0.01019305 0.6502463 42.853382 264
## 3 0.01019305 0.6717557 42.853382 264
## 4 0.01096525 0.7047146 7.928805 284
## 5 0.01196911 0.6709957 16.076585 310
## 6 0.01007722 0.6796875 31.604859 261
Here we can see that items PAINTED METAL PEARS ASSORTED and ASSORTED COLOUR BIRD ORNAMENT appear together in 1% of all transactions. 69% of customers who bought the first item also bought the second. As the lift is over 1, it means that there is a very high probability that purchasing the one product is consequent of purchasing the other.
plot(rules1, engine = 'plotly')
The two products that are most often bought together with the highest lift are REGENCY TEA PLATE PINK and REGENCY TEA PLATE GREEN.
The two items appear in together in 1% of all baskets. ~90% of customers who purchased REGENCY TEA PLATE PINK also purchased REGENCY TEA PLATE GREEN. The lift value indicates that purchasing the second product is highly dependent on purchasing the first one, so if someone purchases the first item they are also highly likely to buy the second one.
plot(head(sort(rules1, by= "support"), 5),
method = "graph",
engine = "htmlwidget")
plot(head(sort(rules1, by= "confidence"), 5),
method = "graph",
engine = "htmlwidget")
plot(head(sort(rules1, by= "lift"), 5),
method = "graph",
engine = "htmlwidget")
For further analysis, we could look at a specific item or itemsets on the left hand side to understand which other items people also bought if they bought the particular item or itemsets. Additionally, with a specific item or itemsets on the right hand side we can obtain rules regarding items or itemsets which led to the purchasing of that particular item or itemsets.
This information has many practical applications, for instance for placing the products in a particular fashion in a retail setting, clustering or segmenting customers based on their shopping patterns or sending personalized offers to customers based on what they have bought in the past and might be interested in purchasing in the future.
Additionally, we may find products that are popular on their own, so whether or not they are purchased would not depend on other items. On the other hand, we may also discover items that tend to be purchased mostly in combination with other products which can explain why a particular item sells better or worse than another.
Let’s imagine two possible fake scenarios in a supermarket setting with products A and B.
In the first scenario, the factory that produces product A announces that due to an unexpected logistical issue they will not be able to send any new stock to the supermarket for a period of two weeks. The supermarket data analyst finds out that there is a product B which is often bought together with product A where B is the consequent of purchasing A. As product B is a perishable product with a very short life time, the supermarket has several options to come up with a plan to reduce potential cost of product waste if they know that product B is very likely to sell a lot less if product A is not available. For example, they may some up with a campaign to promote product B or order less stock while product A has reduced availability.
In a different situation we may discover that the sales of product B have dropped unexpectedly from the beginning of the week and we want to know why. The data analyst then finds out that product B is often bought in combination with product A which currently has reduced availability so people cannot buy it, causing the sales of product B to drop. Again, the supermarket may want to think of a plan not to let the product go to waste.
Finally, an interactive table is a convenient way to explore all rules based on the metrics.
If there is a very large number of products that are classified into several different product categories, we may want to find out how items in different categories influence each other.
One approach would be to run the apriori algorithm separately for each category, but in that case the retrieved rules would only be limited to the items within the same category, ignoring the fact that products in different categories may also be very much linked together.
The second approach would be to run the apriori algorithm based on the product categories rather than the items in order to find interconnections between different types or products. This would probably not work very well with very general categories like food vs housewares as the results would probably be rather casual. However, it might work very well with more specific smaller categories or sub-categories, for example to discover if dairy products are often purchased together with meat products, or if someone is very likely to purchase a magazine if they also purchased stationary.
inspectDT(rules1)