In this section, I will create association rules that will allow identification of relationships between variables in the dataset.

# calling the library
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
# Loading the dataset
df <- read.transactions('http://bit.ly/SupermarketDatasetII',sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
# Checking the class of the dataset
class(df)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing the dataset

inspect(df[1:5])
##     items               
## [1] {almonds,           
##      antioxydant juice, 
##      avocado,           
##      cottage cheese,    
##      energy drink,      
##      frozen smoothie,   
##      green grapes,      
##      green tea,         
##      honey,             
##      low fat yogurt,    
##      mineral water,     
##      olive oil,         
##      salad,             
##      salmon,            
##      shrimp,            
##      spinach,           
##      tomato juice,      
##      vegetables mix,    
##      whole weat flour,  
##      yams}              
## [2] {burgers,           
##      eggs,              
##      meatballs}         
## [3] {chutney}           
## [4] {avocado,           
##      turkey}            
## [5] {energy bar,        
##      green tea,         
##      milk,              
##      mineral water,     
##      whole wheat rice}
# Generating the statistical summary of the data

summary(df)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Observation: the most frequent items are mineral water, eggs, spaghetti, french fries, chocolate

# Displaying top 10 most common items in the dataset and the items whose relative importance is at least 10%

par(mfrow = c(1, 2))

# plot the frequency of items
itemFrequencyPlot(df, topN = 10, col="blue")
itemFrequencyPlot(df, support = 0.1, col="darkred")

The items shown above have an importance of 10% and above in the dataset.

# Building a model based on association rules 
# Using Min Support as 0.001 and confidence as 0.6
 
rules <- apriori (df, parameter = list(supp = 0.001, conf = 0.6))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [545 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules
## set of 545 rules

Building the model using support as 0.001 and a confidence of 60% generates a set of 545 rules

# Exploring the model
summary(rules)
## set of 545 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
## 146 329  67   3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   3.866   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.6000   Min.   :0.001067   Min.   : 2.517  
##  1st Qu.:0.001067   1st Qu.:0.6250   1st Qu.:0.001600   1st Qu.: 2.797  
##  Median :0.001200   Median :0.6667   Median :0.001866   Median : 3.446  
##  Mean   :0.001409   Mean   :0.6893   Mean   :0.002081   Mean   : 3.889  
##  3rd Qu.:0.001466   3rd Qu.:0.7273   3rd Qu.:0.002266   3rd Qu.: 4.177  
##  Max.   :0.005066   Max.   :1.0000   Max.   :0.007999   Max.   :34.970  
##      count      
##  Min.   : 8.00  
##  1st Qu.: 8.00  
##  Median : 9.00  
##  Mean   :10.57  
##  3rd Qu.:11.00  
##  Max.   :38.00  
## 
## mining info:
##  data ntransactions support confidence
##    df          7501   0.001        0.6
##                                                            call
##  apriori(data = df, parameter = list(supp = 0.001, conf = 0.6))

Most rules have 3 or 4 items, 3 rules have 6 items. More statistical information such as support, lift and confidence are also provided.

# observing the first 10 rules built
inspect(rules[1:10])
##      lhs                           rhs              support     confidence
## [1]  {cookies, shallot}         => {low fat yogurt} 0.001199840 0.6000000 
## [2]  {low fat yogurt, shallot}  => {cookies}        0.001199840 0.6923077 
## [3]  {cookies, shallot}         => {green tea}      0.001199840 0.6000000 
## [4]  {cookies, shallot}         => {french fries}   0.001199840 0.6000000 
## [5]  {low fat yogurt, shallot}  => {french fries}   0.001066524 0.6153846 
## [6]  {burger sauce, chicken}    => {mineral water}  0.001066524 0.6666667 
## [7]  {frozen smoothie, spinach} => {mineral water}  0.001066524 0.8888889 
## [8]  {milk, spinach}            => {mineral water}  0.001066524 0.6666667 
## [9]  {spaghetti, spinach}       => {mineral water}  0.001333156 0.7142857 
## [10] {olive oil, strong cheese} => {spaghetti}      0.001066524 0.7272727 
##      coverage    lift     count
## [1]  0.001999733 7.840767  9   
## [2]  0.001733102 8.611940  9   
## [3]  0.001999733 4.541473  9   
## [4]  0.001999733 3.510608  9   
## [5]  0.001733102 3.600624  8   
## [6]  0.001599787 2.796793  8   
## [7]  0.001199840 3.729058  8   
## [8]  0.001599787 2.796793  8   
## [9]  0.001866418 2.996564 10   
## [10] 0.001466471 4.177085  8

Interpretation of a few random rules:

# ordering the rules by a criteria
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
##     lhs                        rhs                 support confidence    coverage      lift count
## [1] {french fries,                                                                               
##      mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.001066524       1.00 0.001066524 12.606723     8
## [2] {ground beef,                                                                                
##      light cream,                                                                                
##      olive oil}             => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [3] {cake,                                                                                       
##      meatballs,                                                                                  
##      mineral water}         => {milk}          0.001066524       1.00 0.001066524  7.717078     8
## [4] {cake,                                                                                       
##      olive oil,                                                                                  
##      shrimp}                => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [5] {mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.002532996       0.95 0.002666311 11.976387    19
rules<-sort(rules, by="support", decreasing=TRUE)
inspect(rules[1:5])
##     lhs                     rhs                 support confidence    coverage     lift count
## [1] {frozen vegetables,                                                                      
##      soup}               => {mineral water} 0.005065991  0.6333333 0.007998933 2.656954    38
## [2] {olive oil,                                                                              
##      tomatoes}           => {spaghetti}     0.004399413  0.6111111 0.007199040 3.509912    33
## [3] {pancakes,                                                                               
##      soup}               => {mineral water} 0.004266098  0.6274510 0.006799093 2.632276    32
## [4] {chocolate,                                                                              
##      eggs,                                                                                   
##      ground beef}        => {mineral water} 0.003999467  0.6122449 0.006532462 2.568484    30
## [5] {frozen vegetables,                                                                      
##      ground beef,                                                                            
##      milk}               => {mineral water} 0.003732836  0.6511628 0.005732569 2.731752    28
rules<-sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:5])
##     lhs                        rhs                        support confidence    coverage     lift count
## [1] {escalope,                                                                                         
##      french fries,                                                                                     
##      pasta}                 => {mushroom cream sauce} 0.001066524  0.6666667 0.001599787 34.96970     8
## [2] {fresh tuna,                                                                                       
##      fromage blanc}         => {honey}                0.001599787  0.6666667 0.002399680 14.04682    12
## [3] {eggs,                                                                                             
##      mineral water,                                                                                    
##      pasta}                 => {shrimp}               0.001333156  0.9090909 0.001466471 12.72218    10
## [4] {french fries,                                                                                     
##      mushroom cream sauce,                                                                             
##      pasta}                 => {escalope}             0.001066524  1.0000000 0.001066524 12.60672     8
## [5] {milk,                                                                                             
##      pasta}                 => {shrimp}               0.001599787  0.8571429 0.001866418 11.99520    12

Interpretation: - Ordering by confidence in decsending order gives 5 rules with 100% confidence and 1 with 95% confidence. - Ordering by support in descending order, the first rule is applicable 0.005 times to the dataset - Ordering by lift in descending order, the first rule is expected to be founf true 34.98 times in the data.

# Visualizing the rules

# calling the library
library(arulesViz)

# plotting
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules, method = 'grouped')

Case Example

Suppose we’re interested in making a promotion relating to the sale of chocolate, we could create a subset of rules concerning these products. This would tell us the items that the customers bought before purchasing chocolate:

chocolate <- subset(rules, subset = rhs %pin% "chocolate")
 
# Then order by confidence
chocolate <- sort(chocolate, by="confidence", decreasing=TRUE)
inspect(chocolate[1:5])
##     lhs                                 rhs         support     confidence
## [1] {escalope, french fries, shrimp} => {chocolate} 0.001066524 0.8888889 
## [2] {red wine, tomato sauce}         => {chocolate} 0.001066524 0.8000000 
## [3] {burgers, olive oil, pancakes}   => {chocolate} 0.001199840 0.7500000 
## [4] {almonds, olive oil, spaghetti}  => {chocolate} 0.001066524 0.7272727 
## [5] {almonds, milk, spaghetti}       => {chocolate} 0.001066524 0.7272727 
##     coverage    lift     count
## [1] 0.001199840 5.425188 8    
## [2] 0.001333156 4.882669 8    
## [3] 0.001599787 4.577502 9    
## [4] 0.001466471 4.438790 8    
## [5] 0.001466471 4.438790 8

Observation:

Chocolate was bought after buying a set of the following items: {escalope, french fries, shrimp}, {red wine, tomato sauce}, {burgers, olive oil, pancakes}, {almonds, olive oil, spaghetti} and {almonds, milk, spaghetti}.

Conclusion