Frequent Single Item Mining

In this experiment, we use the categories dataset provided by the Coursera course, “Pattern Discovery in Data Mining” for the assignment. First, we load the dataset by the read.transactions command provided by the arules package.

# setwd("./R/data")
suppressMessages(require('arules', quietly = TRUE))
categories <- suppressMessages(read.transactions("categories.txt", sep = ";"))
summary(categories)
## transactions as itemMatrix in sparse format with
##  77185 rows (elements/itemsets/transactions) and
##  1048 columns (items) and a density of 0.002769815 
## 
## most frequent items:
##      Restaurants         Shopping             Food    Beauty & Spas 
##            25071            10281             9249             6490 
## Health & Medical          (Other) 
##             5120           167839 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##  1978 35278 19556 12068  5750  2128   371    45     9     2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.903   4.000  10.000 
## 
## includes extended item information - examples:
##        labels
## 1 Accessories
## 2 Accountants
## 3 Active Life

It includes 77185 transactions. We first make an item frequency plot for the top ten items.

itemFrequencyPlot(categories, topN = 10, col = rainbow(4))

# itemmost <- itemFrequency(categories) # default type:"relative"

Then we extract by some data manipulation queries the 1048 items and their absolute support. We display the first five results in alphabetical order.

score
Accessories 352
Accountants 69
Active Life 3100
Acupuncture 126
Adult 44
Adult Education 24

We select the items with a relative minimum support to 0.01 (771 for absolute support) as required in the assignment. We display the first five items with a higher support also ranked in alphabetical order.

score
Active Life 3100
American (New) 1593
American (Traditional) 2416
Arts & Entertainment 2271
Auto Repair 1716
Automotive 4208

Finally, we print the first five results in descending order.

score labels
Restaurants 25071 Restaurants
Shopping 10281 Shopping
Food 9249 Food
Beauty & Spas 6490 Beauty & Spas
Health & Medical 5120 Health & Medical
Nightlife 5088 Nightlife

Frequent Itemsets Mining

For the second part of the assignment, we load the data in basket format.

basket <- suppressMessages(read.transactions("categories.txt", format = 'basket', sep = ';'))

We then make an item frequency plot for the top twenty items.

itemFrequencyPlot(basket, topN = 20, type = 'absolute', col = rainbow(4))

We make some basic exploratory search on this basket dataset. These are the first itemsets.

##     items                          
## [1] {American (Traditional),       
##      Breakfast & Brunch,           
##      Restaurants}                  
## [2] {Restaurants,                  
##      Sandwiches}                   
## [3] {IT Services & Computer Repair,
##      Local Services}               
## [4] {Italian,                      
##      Restaurants}                  
## [5] {Coffee & Tea,                 
##      Food}
size(basket[1:5]) # number of items in each observation
## [1] 3 2 2 2 2

We can observe that we have 1048 items and 77185 transactions. We use ECLAT instead of Apriori since “ECLAT improves Apriori in the step of Extracting frequent itemsets. As you know Apriori has to scan the Database multiple times, but with ECLAT there is no need to scan the database for countig the support for k-itemsets (k>=1). In R, apriori() could have as an output the frequent itemsets or association rules. Although eclat() has as an output just the frequent itemsets. You have after that use the command ruleInduction() to extract rules from those itemsets.” See this post in StackExchange Forum.

frequentItems <- eclat(basket, parameter = list(supp = 0.01, maxlen = 100))

We get an itemMatrix of size 1048 by 97. Then, we select the 97 itemsets and sort them by their relative support. We then display the last ranked itemsets.

items support
[1] {Bars,Sports Bars} 0.0105979
[2] {Sports Bars} 0.0105979
[3] {Restaurants,Sushi Bars} 0.0103388
[4] {Sushi Bars} 0.0103388
[5] {Burgers,Fast Food,Restaurants} 0.0100279
[6] {Burgers,Fast Food} 0.0100279

We then extract a list of 97 itemsets separated by a semi-colon as required by the assignment and get their support.

Finally, we obtain a dataframe with 2 variables: the frequent category sets and their absolute support. We show the last six results.

support_abs
Bars;Sports Bars 818
Sports Bars 818
Restaurants;Sushi Bars 798
Sushi Bars 798
Burgers;Fast Food;Restaurants 774
Burgers;Fast Food 774

Lastly, we write the output in a table.

Frequent Itemsets Mining Using Apriori

# support = 0,01 & confidence = 0,2
categoriesrules <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 771 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1048 item(s), 77185 transaction(s)] done [0.06s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
# categoriesrules_max <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, maxlen = 2))
summary(categoriesrules)
# 76 rules
# rule length distribution (lhs + rhs):sizes 2 (59) & 3 (17) 

We get a set of 76 rules with 59 rules for a length distribution (lhs + rhs) of size 2 and 17 rules for a length distribution (lhs + rhs) of size 3. Alas, with this quick query, we can not observe anything very informative.

lhs rhs support confidence lift
[1] {Bars} => {Nightlife} 0.0560731 1.0000000 15.170008
[2] {Nightlife} => {Bars} 0.0560731 0.8506289 15.170008
[3] {Fast Food} => {Restaurants} 0.0369372 1.0000000 3.078657
[4] {Pizza} => {Restaurants} 0.0344238 1.0000000 3.078657
[5] {Nightlife} => {Restaurants} 0.0328173 0.4978381 1.532672
[6] {Mexican} => {Restaurants} 0.0325841 1.0000000 3.078657
[7] {Bars} => {Restaurants} 0.0313921 0.5598429 1.723564
[8] {Bars,Nightlife} => {Restaurants} 0.0313921 0.5598429 1.723564
[9] {Bars,Restaurants} => {Nightlife} 0.0313921 1.0000000 15.170008
[10] {Nightlife,Restaurants} => {Bars} 0.0313921 0.9565732 17.059405

Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect to drop the item from the dataset. Alternatively, you can remove redundant rules generated.

We can eliminate these duplicate rules using the follow snippet of code:

subset.matrix <- is.subset(categoriesrules, categoriesrules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
categoriesrules.pruned <- categoriesrules[!redundant]
rules<-categoriesrules.pruned
# converting the rule set to a data frame
rules_df <- as(rules, "data.frame")

Finally, we get a set of 42 rules that we choose to keep as a dataframe.

str(rules_df)
## 'data.frame':    42 obs. of  4 variables:
##  $ rules     : Factor w/ 42 levels "{American (New)} => {Restaurants}",..: 42 29 27 19 20 33 10 4 35 36 ...
##  $ support   : num  0.0103 0.011 0.0132 0.0107 0.0107 ...
##  $ confidence: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ lift      : num  3.08 3.08 8.35 64.59 15.08 ...

We can still play around with selecting another value for confidence parameter.

# idem avec confidence = 0.5
categoriesrules2 <- apriori(categories, 
        parameter = list(support = 0.01, confidence = 0.5, minlen = 2))
summary(categoriesrules2)

We get a set of 57 rules with 43 rules for a length distribution (lhs + rhs) of size 2 and 14 rules for a length distribution (lhs + rhs) of size 3.

lhs rhs support confidence lift
[1] {Bars} => {Nightlife} 0.0560731 1.0000000 15.170008
[2] {Nightlife} => {Bars} 0.0560731 0.8506289 15.170008
[3] {Fast Food} => {Restaurants} 0.0369372 1.0000000 3.078657
[4] {Pizza} => {Restaurants} 0.0344238 1.0000000 3.078657
[5] {Mexican} => {Restaurants} 0.0325841 1.0000000 3.078657
[6] {Bars} => {Restaurants} 0.0313921 0.5598429 1.723564
[7] {Bars,Nightlife} => {Restaurants} 0.0313921 0.5598429 1.723564
[8] {Bars,Restaurants} => {Nightlife} 0.0313921 1.0000000 15.170008
[9] {Nightlife,Restaurants} => {Bars} 0.0313921 0.9565732 17.059405
[10] {American (Traditional)} => {Restaurants} 0.0313014 1.0000000 3.078657

There is not any very informative rules.

Now that we know how to generate rules, limit the output, let’s say we wanted to target items to generate rules. There are two types of targets we might be interested in that are illustrated with an example about “restaurants” since it is the largest item:

This essentially means we want to set either the Left Hand Side and Right Hand Side.

Answering the first question we adjust our apriori() function as follows:

rules_restaurants_rhs <-apriori(data=categories, parameter=list(supp=0.01,conf = 0.2), 
               appearance = list(default="lhs",rhs="Restaurants"),
               control = list(verbose=F))
rules_restaurants_rhs <-sort(rules_restaurants_rhs,decreasing=TRUE,by="confidence")
rules_restaurants_rhs_inspect <- inspect(rules_restaurants_rhs[1:5])
# 19 rules
lhs rhs support confidence lift
[1] {Sushi Bars} => {Restaurants} 0.0103388 1 3.078657
[2] {Japanese} => {Restaurants} 0.0109866 1 3.078657
[3] {Cafes} => {Restaurants} 0.0129818 1 3.078657
[4] {Chinese} => {Restaurants} 0.0211051 1 3.078657
[5] {Breakfast & Brunch} => {Restaurants} 0.0177366 1 3.078657

Likewise, we can set the left hand side to be “restaurants” and find its antecedents. We choose a confidence equal to 0.1.

rules_restaurants_lhs<-apriori(data=categories, parameter=list(supp=0.01,conf = 0.1,minlen=2), 
               appearance = list(default="rhs",lhs="Restaurants"),
               control = list(verbose=F))
rules_restaurants_lhs <-sort(rules_restaurants_lhs, decreasing=TRUE,by="confidence")
rules_restaurants_lhs_inspect <- inspect(rules_restaurants_lhs[1:4])
# Only 4 rules
lhs rhs support confidence lift
[1] {Restaurants} => {Fast Food} 0.0369372 0.1137170 3.078657
[2] {Restaurants} => {Pizza} 0.0344238 0.1059790 3.078657
[3] {Restaurants} => {Nightlife} 0.0328173 0.1010331 1.532672
[4] {Restaurants} => {Mexican} 0.0325841 0.1003151 3.078657

Last step, we can plot the rhs rules with the arulesViz package.

library(arulesViz)
plot(rules_restaurants_rhs,method="graph",interactive=FALSE,shading=NA)

# plot(rules_restaurants_rhs,method="graph",interactive=TRUE,shading=NA)

The interactive mode performs better but can not be displayed in knit mode.