MARKET BASKET ANALYZIS

Frequent Single Item Mining

In this experiment, we use the categories dataset provided by the Coursera course, “Pattern Discovery in Data Mining” for the assignment. First, we load the dataset by the read.transactions command provided by the arules package.

# setwd("./R/data")
suppressMessages(require('arules', quietly = TRUE))
categories <- suppressMessages(read.transactions("categories.txt", sep = ";"))
summary(categories)

## transactions as itemMatrix in sparse format with
##  77185 rows (elements/itemsets/transactions) and
##  1048 columns (items) and a density of 0.002769815 
## 
## most frequent items:
##      Restaurants         Shopping             Food    Beauty & Spas 
##            25071            10281             9249             6490 
## Health & Medical          (Other) 
##             5120           167839 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##  1978 35278 19556 12068  5750  2128   371    45     9     2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.903   4.000  10.000 
## 
## includes extended item information - examples:
##        labels
## 1 Accessories
## 2 Accountants
## 3 Active Life

It includes 77185 transactions. We first make an item frequency plot for the top ten items.

itemFrequencyPlot(categories, topN = 10, col = rainbow(4))

# itemmost <- itemFrequency(categories) # default type:"relative"

Then we extract by some data manipulation queries the 1048 items and their absolute support. We display the first five results in alphabetical order.

	score
Accessories	352
Accountants	69
Active Life	3100
Acupuncture	126
Adult	44
Adult Education	24

We select the items with a relative minimum support to 0.01 (771 for absolute support) as required in the assignment. We display the first five items with a higher support also ranked in alphabetical order.

	score
Active Life	3100
American (New)	1593
American (Traditional)	2416
Arts & Entertainment	2271
Auto Repair	1716
Automotive	4208

Finally, we print the first five results in descending order.

	score	labels
Restaurants	25071	Restaurants
Shopping	10281	Shopping
Food	9249	Food
Beauty & Spas	6490	Beauty & Spas
Health & Medical	5120	Health & Medical
Nightlife	5088	Nightlife

Frequent Itemsets Mining

For the second part of the assignment, we load the data in basket format.

basket <- suppressMessages(read.transactions("categories.txt", format = 'basket', sep = ';'))

We then make an item frequency plot for the top twenty items.

itemFrequencyPlot(basket, topN = 20, type = 'absolute', col = rainbow(4))

We make some basic exploratory search on this basket dataset. These are the first itemsets.

##     items                          
## [1] {American (Traditional),       
##      Breakfast & Brunch,           
##      Restaurants}                  
## [2] {Restaurants,                  
##      Sandwiches}                   
## [3] {IT Services & Computer Repair,
##      Local Services}               
## [4] {Italian,                      
##      Restaurants}                  
## [5] {Coffee & Tea,                 
##      Food}

size(basket[1:5]) # number of items in each observation

## [1] 3 2 2 2 2

We can observe that we have 1048 items and 77185 transactions. We use ECLAT instead of Apriori since “ECLAT improves Apriori in the step of Extracting frequent itemsets. As you know Apriori has to scan the Database multiple times, but with ECLAT there is no need to scan the database for countig the support for k-itemsets (k>=1). In R, apriori() could have as an output the frequent itemsets or association rules. Although eclat() has as an output just the frequent itemsets. You have after that use the command ruleInduction() to extract rules from those itemsets.” See this post in StackExchange Forum.

frequentItems <- eclat(basket, parameter = list(supp = 0.01, maxlen = 100))

We get an itemMatrix of size 1048 by 97. Then, we select the 97 itemsets and sort them by their relative support. We then display the last ranked itemsets.

	items	support
[1]	{Bars,Sports Bars}	0.0105979
[2]	{Sports Bars}	0.0105979
[3]	{Restaurants,Sushi Bars}	0.0103388
[4]	{Sushi Bars}	0.0103388
[5]	{Burgers,Fast Food,Restaurants}	0.0100279
[6]	{Burgers,Fast Food}	0.0100279

We then extract a list of 97 itemsets separated by a semi-colon as required by the assignment and get their support.

Finally, we obtain a dataframe with 2 variables: the frequent category sets and their absolute support. We show the last six results.

	support_abs
Bars;Sports Bars	818
Sports Bars	818
Restaurants;Sushi Bars	798
Sushi Bars	798
Burgers;Fast Food;Restaurants	774
Burgers;Fast Food	774

Lastly, we write the output in a table.

Frequent Itemsets Mining Using Apriori

# support = 0,01 & confidence = 0,2
categoriesrules <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 771 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1048 item(s), 77185 transaction(s)] done [0.06s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

# categoriesrules_max <- apriori(categories, parameter = list(support = 0.01, confidence = 0.2, maxlen = 2))

summary(categoriesrules)
# 76 rules
# rule length distribution (lhs + rhs):sizes 2 (59) & 3 (17)

We get a set of 76 rules with 59 rules for a length distribution (lhs + rhs) of size 2 and 17 rules for a length distribution (lhs + rhs) of size 3. Alas, with this quick query, we can not observe anything very informative.

	lhs		rhs	support	confidence	lift
[1]	{Bars}	=>	{Nightlife}	0.0560731	1.0000000	15.170008
[2]	{Nightlife}	=>	{Bars}	0.0560731	0.8506289	15.170008
[3]	{Fast Food}	=>	{Restaurants}	0.0369372	1.0000000	3.078657
[4]	{Pizza}	=>	{Restaurants}	0.0344238	1.0000000	3.078657
[5]	{Nightlife}	=>	{Restaurants}	0.0328173	0.4978381	1.532672
[6]	{Mexican}	=>	{Restaurants}	0.0325841	1.0000000	3.078657
[7]	{Bars}	=>	{Restaurants}	0.0313921	0.5598429	1.723564
[8]	{Bars,Nightlife}	=>	{Restaurants}	0.0313921	0.5598429	1.723564
[9]	{Bars,Restaurants}	=>	{Nightlife}	0.0313921	1.0000000	15.170008
[10]	{Nightlife,Restaurants}	=>	{Bars}	0.0313921	0.9565732	17.059405

Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect to drop the item from the dataset. Alternatively, you can remove redundant rules generated.

We can eliminate these duplicate rules using the follow snippet of code:

subset.matrix <- is.subset(categoriesrules, categoriesrules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
categoriesrules.pruned <- categoriesrules[!redundant]
rules<-categoriesrules.pruned
# converting the rule set to a data frame
rules_df <- as(rules, "data.frame")

Finally, we get a set of 42 rules that we choose to keep as a dataframe.

str(rules_df)

## 'data.frame':    42 obs. of  4 variables:
##  $ rules     : Factor w/ 42 levels "{American (New)} => {Restaurants}",..: 42 29 27 19 20 33 10 4 35 36 ...
##  $ support   : num  0.0103 0.011 0.0132 0.0107 0.0107 ...
##  $ confidence: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ lift      : num  3.08 3.08 8.35 64.59 15.08 ...

We can still play around with selecting another value for confidence parameter.

# idem avec confidence = 0.5
categoriesrules2 <- apriori(categories, 
        parameter = list(support = 0.01, confidence = 0.5, minlen = 2))

summary(categoriesrules2)

We get a set of 57 rules with 43 rules for a length distribution (lhs + rhs) of size 2 and 14 rules for a length distribution (lhs + rhs) of size 3.

	lhs		rhs	support	confidence	lift
[1]	{Bars}	=>	{Nightlife}	0.0560731	1.0000000	15.170008
[2]	{Nightlife}	=>	{Bars}	0.0560731	0.8506289	15.170008
[3]	{Fast Food}	=>	{Restaurants}	0.0369372	1.0000000	3.078657
[4]	{Pizza}	=>	{Restaurants}	0.0344238	1.0000000	3.078657
[5]	{Mexican}	=>	{Restaurants}	0.0325841	1.0000000	3.078657
[6]	{Bars}	=>	{Restaurants}	0.0313921	0.5598429	1.723564
[7]	{Bars,Nightlife}	=>	{Restaurants}	0.0313921	0.5598429	1.723564
[8]	{Bars,Restaurants}	=>	{Nightlife}	0.0313921	1.0000000	15.170008
[9]	{Nightlife,Restaurants}	=>	{Bars}	0.0313921	0.9565732	17.059405
[10]	{American (Traditional)}	=>	{Restaurants}	0.0313014	1.0000000	3.078657

There is not any very informative rules.

Now that we know how to generate rules, limit the output, let’s say we wanted to target items to generate rules. There are two types of targets we might be interested in that are illustrated with an example about “restaurants” since it is the largest item:

What are customers likely to buy before going to restaurants?
What are customers likely to buy if they go to restaurants?

This essentially means we want to set either the Left Hand Side and Right Hand Side.

Answering the first question we adjust our apriori() function as follows:

rules_restaurants_rhs <-apriori(data=categories, parameter=list(supp=0.01,conf = 0.2), 
               appearance = list(default="lhs",rhs="Restaurants"),
               control = list(verbose=F))
rules_restaurants_rhs <-sort(rules_restaurants_rhs,decreasing=TRUE,by="confidence")
rules_restaurants_rhs_inspect <- inspect(rules_restaurants_rhs[1:5])
# 19 rules

	lhs		rhs	support	confidence	lift
[1]	{Sushi Bars}	=>	{Restaurants}	0.0103388	1	3.078657
[2]	{Japanese}	=>	{Restaurants}	0.0109866	1	3.078657
[3]	{Cafes}	=>	{Restaurants}	0.0129818	1	3.078657
[4]	{Chinese}	=>	{Restaurants}	0.0211051	1	3.078657
[5]	{Breakfast & Brunch}	=>	{Restaurants}	0.0177366	1	3.078657

Likewise, we can set the left hand side to be “restaurants” and find its antecedents. We choose a confidence equal to 0.1.

rules_restaurants_lhs<-apriori(data=categories, parameter=list(supp=0.01,conf = 0.1,minlen=2), 
               appearance = list(default="rhs",lhs="Restaurants"),
               control = list(verbose=F))
rules_restaurants_lhs <-sort(rules_restaurants_lhs, decreasing=TRUE,by="confidence")
rules_restaurants_lhs_inspect <- inspect(rules_restaurants_lhs[1:4])
# Only 4 rules

	lhs		rhs	support	confidence	lift
[1]	{Restaurants}	=>	{Fast Food}	0.0369372	0.1137170	3.078657
[2]	{Restaurants}	=>	{Pizza}	0.0344238	0.1059790	3.078657
[3]	{Restaurants}	=>	{Nightlife}	0.0328173	0.1010331	1.532672
[4]	{Restaurants}	=>	{Mexican}	0.0325841	0.1003151	3.078657

Last step, we can plot the rhs rules with the arulesViz package.

library(arulesViz)
plot(rules_restaurants_rhs,method="graph",interactive=FALSE,shading=NA)

# plot(rules_restaurants_rhs,method="graph",interactive=TRUE,shading=NA)

The interactive mode performs better but can not be displayed in knit mode.

References

Related to the Apriori implementation:

– The Official documentation of arules package.

– A very nice analyzis on the groceries dataset.

– Association Rules about the titanic dataset.

About Markdown and knit

– Daring Fireball documentation.

– A resource website.

MARKET BASKET ANALYZIS

emmanuelq2

10 avril 2017

Frequent Single Item Mining

Frequent Itemsets Mining

Frequent Itemsets Mining Using Apriori

References