# Load libraries
library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(gridExtra)
library(lubridate)
setwd("C:\\Users\\hi\\Documents")
# Read the data
trans <- read.transactions("BreadBasket_DMS.csv", format="single", cols=c(3,4), sep=",", rm.duplicates=TRUE)Let’s get an idea of what we’re working with.
## transactions in sparse format with
## 6614 transactions (rows) and
## 104 items (columns)
## transactions as itemMatrix in sparse format with
## 6614 rows (elements/itemsets/transactions) and
## 104 columns (items) and a density of 0.02008705
##
## most frequent items:
## Coffee Bread Tea Cake Pastry (Other)
## 3188 2146 941 694 576 6272
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 2556 2154 1078 546 187 67 18 3 2 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.089 3.000 10.000
##
## includes extended item information - examples:
## labels
## 1 Adjustment
## 2 Afternoon with the baker
## 3 Alfajores
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 10
## 3 1000
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 104 obs. of 1 variable:
## .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
## ..@ itemsetInfo:'data.frame': 6614 obs. of 1 variable:
## .. ..$ transactionID: Factor w/ 6614 levels "1","10","1000",..: 1 2 3 4 5 6 7 8 9 10 ...
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:13817] 11 63 80 19 80 11 93 14 45 11 ...
## .. .. ..@ p : int [1:6615] 0 1 3 5 7 9 11 15 16 17 ...
## .. .. ..@ Dim : int [1:2] 104 6614
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 104 obs. of 1 variable:
## .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
## ..@ itemsetInfo:'data.frame': 6614 obs. of 1 variable:
## .. ..$ transactionID: Factor w/ 6614 levels "1","10","1000",..: 1 2 3 4 5 6 7 8 9 10 ...
The data set contains 15.010 observations and the following columns,
Date. Categorical variable that tells us the date of the transactions (YYYY-MM-DD format). The column includes dates from 30/10/2016 to 09/04/2017.
Time. Categorical variable that tells us the time of the transactions (HH:MM:SS format).
Transaction. Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that’s why the data set has less transactions than observations.
Item. Categorical variable containing the products.
Before applying the Apriori algorithm on the data set, we are going to show some visualizations to learn more about the transactions. For example, we can generate an itemFrequencyPlot() to create an item Frequency Bar Plot to view the distribution of products.
# Absolute Item Frequency Plot
itemFrequencyPlot(trans, topN=15, type="absolute", col="wheat2",xlab="Item name",
ylab="Frequency (absolute)", main="Absolute Item Frequency Plot")The itemFrequencyPlot() allows us to show the absolute or relative values. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others, as it’s shown in the following plot.
# Relative Item Frequency Plot
itemFrequencyPlot(trans, topN=15, type="relative", col="lightcyan2", xlab="Item name",
ylab="Frequency (relative)", main="Relative Item Frequency Plot")Coffee is the best-selling product by far, followed by bread and tea. Let’s display some other visualizations describing the time distribution using the ggplot() function.
Transactions per month
Transactions per weekday
Transactions per hour
# Load data
setwd("C:\\Users\\hi\\Documents")
trans_csv <- read.csv("BreadBasket_DMS.csv")
trans_csv$Date <- as.Date(trans_csv$Date, format="%m/%d/%Y")
# Transactions per month
trans_csv %>%
mutate(Month=as.factor(month(Date))) %>%
group_by(Month) %>%
summarise(Transactions=n_distinct(Transaction)) %>%
ggplot(aes(x=Month, y=Transactions)) +
geom_bar(stat="identity", fill="mistyrose2", show.legend=FALSE, colour="black") +
geom_label(aes(label=Transactions)) +
labs(title="Transactions per month") +
theme_bw() The data set includes dates from 30/10/2016 to 09/04/2017, that’s why we have so few transactions in October and April.
# Transactions per weekday
trans_csv %>%
mutate(WeekDay=as.factor(weekdays(as.Date(Date)))) %>%
group_by(WeekDay) %>%
summarise(Transactions=n_distinct(Transaction)) %>%
ggplot(aes(x=WeekDay, y=Transactions)) +
geom_bar(stat="identity", fill="peachpuff2", show.legend=FALSE, colour="black") +
geom_label(aes(label=Transactions)) +
labs(title="Transactions per weekday") +
scale_x_discrete(limits=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")) +
theme_bw() As we can see, Saturday is the busiest day in the bakery. Conversely, Wednesday is the day with fewer transactions.
# Transactions per hour
trans_csv %>%
mutate(Hour=as.factor(hour(hms(Time)))) %>%
group_by(Hour) %>%
summarise(Transactions=n_distinct(Transaction)) %>%
ggplot(aes(x=Hour, y=Transactions)) +
geom_bar(stat="identity", fill="steelblue1", show.legend=FALSE, colour="black") +
geom_label(aes(label=Transactions)) +
labs(title="Transactions per hour") +
theme_bw()There’s not much to discuss with this visualization. The results are logical and expected.
The first step in order to create a set of association rules is to determine the optimal thresholds for support and confidence. If we set these values too low, then the algorithm will take longer to execute and we will get a lot of rules (most of them will not be useful). Then, what values do we choose? We can try different values of support and confidence and see graphically how many rules are generated for each combination.
# Support and confidence values
supportLevels <- c(0.1, 0.05, 0.01, 0.005)
confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
# Empty integers
rules_sup10 <- integer(length=9)
rules_sup5 <- integer(length=9)
rules_sup1 <- integer(length=9)
rules_sup0.5 <- integer(length=9)
# Apriori algorithm with a support level of 10%
for (i in 1:length(confidenceLevels)) {
rules_sup10[i] <- length(apriori(trans, parameter=list(sup=supportLevels[1],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 5%
for (i in 1:length(confidenceLevels)) {
rules_sup5[i] <- length(apriori(trans, parameter=list(sup=supportLevels[2],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)) {
rules_sup1[i] <- length(apriori(trans, parameter=list(sup=supportLevels[3],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 0.5%
for (i in 1:length(confidenceLevels)) {
rules_sup0.5[i] <- length(apriori(trans, parameter=list(sup=supportLevels[4],
conf=confidenceLevels[i], target="rules")))
}In the following graphs we can see the number of rules generated with a support level of 10%, 5%, 1% and 0.5%.
# Number of rules found with a support level of 10%
plot1 <- qplot(confidenceLevels, rules_sup10, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 10%") +
theme_bw()
# Number of rules found with a support level of 5%
plot2 <- qplot(confidenceLevels, rules_sup5, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 5%") +
scale_y_continuous(breaks=seq(0, 10, 2)) +
theme_bw()
# Number of rules found with a support level of 1%
plot3 <- qplot(confidenceLevels, rules_sup1, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 1%") +
scale_y_continuous(breaks=seq(0, 50, 10)) +
theme_bw()
# Number of rules found with a support level of 0.5%
plot4 <- qplot(confidenceLevels, rules_sup0.5, geom=c("point", "line"),
xlab="Confidence level", ylab="Number of rules found",
main="Apriori with a support level of 0.5%") +
scale_y_continuous(breaks=seq(0, 130, 20)) +
theme_bw()
# Subplot
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)We can join the four lines to improve the visualization.
# Data frame
num_rules <- data.frame(rules_sup10, rules_sup5, rules_sup1, rules_sup0.5, confidenceLevels)
# Number of rules found with a support level of 10%, 5%, 1% and 0.5%
ggplot(data=num_rules, aes(x=confidenceLevels)) +
# Plot line and points (support level of 10%)
geom_line(aes(y=rules_sup10, colour="Support level of 10%")) +
geom_point(aes(y=rules_sup10, colour="Support level of 10%")) +
# Plot line and points (support level of 5%)
geom_line(aes(y=rules_sup5, colour="Support level of 5%")) +
geom_point(aes(y=rules_sup5, colour="Support level of 5%")) +
# Plot line and points (support level of 1%)
geom_line(aes(y=rules_sup1, colour="Support level of 1%")) +
geom_point(aes(y=rules_sup1, colour="Support level of 1%")) +
# Plot line and points (support level of 0.5%)
geom_line(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
geom_point(aes(y=rules_sup0.5, colour="Support level of 0.5%")) +
# Labs and theme
labs(x="Confidence levels", y="Number of rules found",
title="Apriori algorithm with different support levels") +
theme_bw() +
theme(legend.title=element_blank())Let’s analyze the results,
Support level of 10%. We only identify a few rules with very low confidence levels. This means that there are no relatively frequent associations in our data set. We can’t choose this value, the resulting rules are unrepresentative.
Support level of 5%. We only identify a rule with a confidence of at least 50%. It seems that we have to look for support levels below 5% to obtain a greater number of rules with a reasonable confidence.
Support level of 1%. We started to get dozens of rules, of which 13 have a confidence of at least 50%.
Support level of 0.5%. Too many rules to analyze!
To sum up, we are going to use a support level of 1% and a confidence level of 50%.
Let’s execute the Apriori algorithm with the values obtained in the previous section.
# Apriori algorithm execution with a support level of 1% and a confidence level of 50%
rules_sup1_conf50 <- apriori(trans, parameter=list(sup=supportLevels[3],
conf=confidenceLevels[5], target="rules"))The generated association rules are the following,
## lhs rhs support confidence lift count
## [1] {Tiffin} => {Coffee} 0.01058361 0.5468750 1.134577 70
## [2] {Spanish Brunch} => {Coffee} 0.01406108 0.6326531 1.312537 93
## [3] {Scone} => {Coffee} 0.01844572 0.5422222 1.124924 122
## [4] {Toast} => {Coffee} 0.02570305 0.7296137 1.513697 170
## [5] {Alfajores} => {Coffee} 0.02237678 0.5522388 1.145705 148
## [6] {Juice} => {Coffee} 0.02131842 0.5300752 1.099723 141
## [7] {Hot chocolate} => {Coffee} 0.02721500 0.5263158 1.091924 180
## [8] {Medialuna} => {Coffee} 0.03296039 0.5751979 1.193337 218
## [9] {Cookies} => {Coffee} 0.02978530 0.5267380 1.092800 197
## [10] {NONE} => {Coffee} 0.04172966 0.5810526 1.205484 276
## [11] {Sandwich} => {Coffee} 0.04233444 0.5679513 1.178303 280
## [12] {Pastry} => {Coffee} 0.04868461 0.5590278 1.159790 322
## [13] {Cake} => {Coffee} 0.05654672 0.5389049 1.118042 374
We can also create an HTML table widget using the inspectDT() function from the aruslesViz package. Rules can be interactively filtered and sorted.
How do we interpret these rules?
52% of the customers who bought a hot chocolate also bought a coffee.
63% of the customers who bought a spanish brunch also bought a coffee.
73% of the customers who bought a toast also bought a coffee.
And so on. It seems that in this bakery there are many coffee lovers.
We are going to use the arulesViz package to create the visualizations. Let’s begin with a simple scatter plot with different measures of interestingness on the axes (lift and support) and a third measure (confidence) represented by the color of the points.
The following visualization represents the rules as a graph with items as labeled vertices, and rules represented as vertices connected to items using arrows.
We can also change the graph layout.
What else can we do? We can represent the rules as a grouped matrix-based visualization. The support and lift measures are represented by the size and color of the ballons, respectively. In this case it’s not a very useful visualization, since we only have coffe on the right-hand-side of the rules.
There’s an awesome function called ruleExplorer() that explores association rules using interactive manipulations and visualization using shiny. Unfortunately, R Markdown still doesn’t support shiny app objects.
We have executed the Apriori algorithm with the appropriate support and confidence values. What happens if we execute it with low values? How do the visualizations change? Let’s try with a support level of 0.5% and a confidence level of 10%.
# Apriori algorithm execution with a support level of 0.5% and a confidence level of 10%
rules_sup0.5_conf10 <- apriori(trans, parameter=list(sup=supportLevels[4], conf=confidenceLevels[9], target="rules"))It’s impossible to analyze these visualizations! For larger rule sets visual analysis becomes difficult. Furthermore, most of the rules are useless. That’s why we have to carefully select the right values of support and confidence.
# Parallel coordinates plot
plot(rules_sup0.5_conf10, method="paracoord", control=list(reorder=TRUE))In this kernel we have learned about the Apriori algorithm, one of the most frequently used algorithms in data mining. We have reviewed some statistical concepts (support, confidence, lift and conviction) to select interesting rules, we have chosen the appropriate values to execute the algorithm and finally we have visualized the resulting association rules.
.