In this experiment, we use the categories dataset provided by the Coursera course, “Pattern Discovery in Data Mining” for the assignment. First, we load the dataset by the read.transactions command provided by the arules package.
# setwd("./R/data")
suppressMessages(require('arules', quietly = TRUE))
categories <- suppressMessages(read.transactions("categories.txt", sep = ";"))
summary(categories)
## transactions as itemMatrix in sparse format with
## 77185 rows (elements/itemsets/transactions) and
## 1048 columns (items) and a density of 0.002769815
##
## most frequent items:
## Restaurants Shopping Food Beauty & Spas
## 25071 10281 9249 6490
## Health & Medical (Other)
## 5120 167839
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 1978 35278 19556 12068 5750 2128 371 45 9 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.903 4.000 10.000
##
## includes extended item information - examples:
## labels
## 1 Accessories
## 2 Accountants
## 3 Active Life
It includes 77185 transactions. We first make an item frequency plot for the top ten items.
itemFrequencyPlot(categories, topN = 10, col = rainbow(4))
# itemmost <- itemFrequency(categories) # default type:"relative"
Then we extract by some data manipulation queries the 1048 items and their absolute support. We display the first five results in alphabetical order.
| score | |
|---|---|
| Accessories | 352 |
| Accountants | 69 |
| Active Life | 3100 |
| Acupuncture | 126 |
| Adult | 44 |
| Adult Education | 24 |
We select the items with a relative minimum support to 0.01 (771 for absolute support) as required in the assignment. We display the first five items with a higher support also ranked in alphabetical order.
| score | |
|---|---|
| Active Life | 3100 |
| American (New) | 1593 |
| American (Traditional) | 2416 |
| Arts & Entertainment | 2271 |
| Auto Repair | 1716 |
| Automotive | 4208 |
Finally, we print the first five results in descending order.
| score | labels | |
|---|---|---|
| Restaurants | 25071 | Restaurants |
| Shopping | 10281 | Shopping |
| Food | 9249 | Food |
| Beauty & Spas | 6490 | Beauty & Spas |
| Health & Medical | 5120 | Health & Medical |
| Nightlife | 5088 | Nightlife |
For the second part of the assignment, we load the data in basket format.
basket <- suppressMessages(read.transactions("categories.txt", format = 'basket', sep = ';'))
We then make an item frequency plot for the top twenty items.
itemFrequencyPlot(basket, topN = 20, type = 'absolute', col = rainbow(4))
We make some basic exploratory search on this basket dataset. These are the first itemsets.
## items
## [1] {American (Traditional),
## Breakfast & Brunch,
## Restaurants}
## [2] {Restaurants,
## Sandwiches}
## [3] {IT Services & Computer Repair,
## Local Services}
## [4] {Italian,
## Restaurants}
## [5] {Coffee & Tea,
## Food}
size(basket[1:5]) # number of items in each observation
## [1] 3 2 2 2 2
We can observe that we have 1048 items and 77185 transactions. We use ECLAT instead of Apriori since “ECLAT improves Apriori in the step of Extracting frequent itemsets. As you know Apriori has to scan the Database multiple times, but with ECLAT there is no need to scan the database for countig the support for k-itemsets (k>=1). In R, apriori() could have as an output the frequent itemsets or association rules. Although eclat() has as an output just the frequent itemsets. You have after that use the command ruleInduction() to extract rules from those itemsets.” See this post in StackExchange Forum.
frequentItems <- eclat(basket, parameter = list(supp = 0.01, maxlen = 100))
We get an itemMatrix of size 1048 by 97. Then, we select the 97 itemsets and sort them by their relative support. We then display the last ranked itemsets.
| items | support | |
|---|---|---|
| [1] | {Bars,Sports Bars} | 0.0105979 |
| [2] | {Sports Bars} | 0.0105979 |
| [3] | {Restaurants,Sushi Bars} | 0.0103388 |
| [4] | {Sushi Bars} | 0.0103388 |
| [5] | {Burgers,Fast Food,Restaurants} | 0.0100279 |
| [6] | {Burgers,Fast Food} | 0.0100279 |
We then extract a list of 97 itemsets separated by a semi-colon as required by the assignment and get their support.
Finally, we obtain a dataframe with 2 variables: the frequent category sets and their absolute support. We show the last six results.
| support_abs | |
|---|---|
| Bars;Sports Bars | 818 |
| Sports Bars | 818 |
| Restaurants;Sushi Bars | 798 |
| Sushi Bars | 798 |
| Burgers;Fast Food;Restaurants | 774 |
| Burgers;Fast Food | 774 |
Lastly, we write the output in a table.
# support = 0,01 & confidence = 0,2
categoriesrules <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 771
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1048 item(s), 77185 transaction(s)] done [0.06s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
# categoriesrules_max <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, maxlen = 2))
summary(categoriesrules)
# 76 rules
# rule length distribution (lhs + rhs):sizes 2 (59) & 3 (17)
We get a set of 76 rules with 59 rules for a length distribution (lhs + rhs) of size 2 and 17 rules for a length distribution (lhs + rhs) of size 3. Alas, with this quick query, we can not observe anything very informative.
| lhs | rhs | support | confidence | lift | ||
|---|---|---|---|---|---|---|
| [1] | {Bars} | => | {Nightlife} | 0.0560731 | 1.0000000 | 15.170008 |
| [2] | {Nightlife} | => | {Bars} | 0.0560731 | 0.8506289 | 15.170008 |
| [3] | {Fast Food} | => | {Restaurants} | 0.0369372 | 1.0000000 | 3.078657 |
| [4] | {Pizza} | => | {Restaurants} | 0.0344238 | 1.0000000 | 3.078657 |
| [5] | {Nightlife} | => | {Restaurants} | 0.0328173 | 0.4978381 | 1.532672 |
| [6] | {Mexican} | => | {Restaurants} | 0.0325841 | 1.0000000 | 3.078657 |
| [7] | {Bars} | => | {Restaurants} | 0.0313921 | 0.5598429 | 1.723564 |
| [8] | {Bars,Nightlife} | => | {Restaurants} | 0.0313921 | 0.5598429 | 1.723564 |
| [9] | {Bars,Restaurants} | => | {Nightlife} | 0.0313921 | 1.0000000 | 15.170008 |
| [10] | {Nightlife,Restaurants} | => | {Bars} | 0.0313921 | 0.9565732 | 17.059405 |
Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect to drop the item from the dataset. Alternatively, you can remove redundant rules generated.
We can eliminate these duplicate rules using the follow snippet of code:
subset.matrix <- is.subset(categoriesrules, categoriesrules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
categoriesrules.pruned <- categoriesrules[!redundant]
rules<-categoriesrules.pruned
# converting the rule set to a data frame
rules_df <- as(rules, "data.frame")
Finally, we get a set of 42 rules that we choose to keep as a dataframe.
str(rules_df)
## 'data.frame': 42 obs. of 4 variables:
## $ rules : Factor w/ 42 levels "{American (New)} => {Restaurants}",..: 42 29 27 19 20 33 10 4 35 36 ...
## $ support : num 0.0103 0.011 0.0132 0.0107 0.0107 ...
## $ confidence: num 1 1 1 1 1 1 1 1 1 1 ...
## $ lift : num 3.08 3.08 8.35 64.59 15.08 ...
We can still play around with selecting another value for confidence parameter.
# idem avec confidence = 0.5
categoriesrules2 <- apriori(categories,
parameter = list(support = 0.01, confidence = 0.5, minlen = 2))
summary(categoriesrules2)
We get a set of 57 rules with 43 rules for a length distribution (lhs + rhs) of size 2 and 14 rules for a length distribution (lhs + rhs) of size 3.
| lhs | rhs | support | confidence | lift | ||
|---|---|---|---|---|---|---|
| [1] | {Bars} | => | {Nightlife} | 0.0560731 | 1.0000000 | 15.170008 |
| [2] | {Nightlife} | => | {Bars} | 0.0560731 | 0.8506289 | 15.170008 |
| [3] | {Fast Food} | => | {Restaurants} | 0.0369372 | 1.0000000 | 3.078657 |
| [4] | {Pizza} | => | {Restaurants} | 0.0344238 | 1.0000000 | 3.078657 |
| [5] | {Mexican} | => | {Restaurants} | 0.0325841 | 1.0000000 | 3.078657 |
| [6] | {Bars} | => | {Restaurants} | 0.0313921 | 0.5598429 | 1.723564 |
| [7] | {Bars,Nightlife} | => | {Restaurants} | 0.0313921 | 0.5598429 | 1.723564 |
| [8] | {Bars,Restaurants} | => | {Nightlife} | 0.0313921 | 1.0000000 | 15.170008 |
| [9] | {Nightlife,Restaurants} | => | {Bars} | 0.0313921 | 0.9565732 | 17.059405 |
| [10] | {American (Traditional)} | => | {Restaurants} | 0.0313014 | 1.0000000 | 3.078657 |
There is not any very informative rules.
Now that we know how to generate rules, limit the output, let’s say we wanted to target items to generate rules. There are two types of targets we might be interested in that are illustrated with an example about “restaurants” since it is the largest item:
This essentially means we want to set either the Left Hand Side and Right Hand Side.
Answering the first question we adjust our apriori() function as follows:
rules_restaurants_rhs <-apriori(data=categories, parameter=list(supp=0.01,conf = 0.2),
appearance = list(default="lhs",rhs="Restaurants"),
control = list(verbose=F))
rules_restaurants_rhs <-sort(rules_restaurants_rhs,decreasing=TRUE,by="confidence")
rules_restaurants_rhs_inspect <- inspect(rules_restaurants_rhs[1:5])
# 19 rules
| lhs | rhs | support | confidence | lift | ||
|---|---|---|---|---|---|---|
| [1] | {Sushi Bars} | => | {Restaurants} | 0.0103388 | 1 | 3.078657 |
| [2] | {Japanese} | => | {Restaurants} | 0.0109866 | 1 | 3.078657 |
| [3] | {Cafes} | => | {Restaurants} | 0.0129818 | 1 | 3.078657 |
| [4] | {Chinese} | => | {Restaurants} | 0.0211051 | 1 | 3.078657 |
| [5] | {Breakfast & Brunch} | => | {Restaurants} | 0.0177366 | 1 | 3.078657 |
Likewise, we can set the left hand side to be “restaurants” and find its antecedents. We choose a confidence equal to 0.1.
rules_restaurants_lhs<-apriori(data=categories, parameter=list(supp=0.01,conf = 0.1,minlen=2),
appearance = list(default="rhs",lhs="Restaurants"),
control = list(verbose=F))
rules_restaurants_lhs <-sort(rules_restaurants_lhs, decreasing=TRUE,by="confidence")
rules_restaurants_lhs_inspect <- inspect(rules_restaurants_lhs[1:4])
# Only 4 rules
| lhs | rhs | support | confidence | lift | ||
|---|---|---|---|---|---|---|
| [1] | {Restaurants} | => | {Fast Food} | 0.0369372 | 0.1137170 | 3.078657 |
| [2] | {Restaurants} | => | {Pizza} | 0.0344238 | 0.1059790 | 3.078657 |
| [3] | {Restaurants} | => | {Nightlife} | 0.0328173 | 0.1010331 | 1.532672 |
| [4] | {Restaurants} | => | {Mexican} | 0.0325841 | 0.1003151 | 3.078657 |
Last step, we can plot the rhs rules with the arulesViz package.
library(arulesViz)
plot(rules_restaurants_rhs,method="graph",interactive=FALSE,shading=NA)
# plot(rules_restaurants_rhs,method="graph",interactive=TRUE,shading=NA)
The interactive mode performs better but can not be displayed in knit mode.
– The Official documentation of arules package.
– A very nice analyzis on the groceries dataset.
– Association Rules about the titanic dataset.