R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
path_trans<-"http://bit.ly/SupermarketDatasetII"
df<-read.transactions(path_trans,sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
df
## transactions in sparse format with
##  7501 transactions (rows) and
##  119 items (columns)
# Verifying the object's class
# ---
# This should show us transactions as the type of data that we will need
# ---
# 
class(df)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
# Previewing our first 5 transactions
#
inspect(df[1:5])
##     items               
## [1] {almonds,           
##      antioxydant juice, 
##      avocado,           
##      cottage cheese,    
##      energy drink,      
##      frozen smoothie,   
##      green grapes,      
##      green tea,         
##      honey,             
##      low fat yogurt,    
##      mineral water,     
##      olive oil,         
##      salad,             
##      salmon,            
##      shrimp,            
##      spinach,           
##      tomato juice,      
##      vegetables mix,    
##      whole weat flour,  
##      yams}              
## [2] {burgers,           
##      eggs,              
##      meatballs}         
## [3] {chutney}           
## [4] {avocado,           
##      turkey}            
## [5] {energy bar,        
##      green tea,         
##      milk,              
##      mineral water,     
##      whole wheat rice}
summary(df)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus
itemFrequencyPlot(df, topN=10,  cex.names=1)

# This would give us some information such as the most purchased items, 
# distribution of the item sets (no. of items purchased in each transaction), etc.
# ---
# 
summary(df)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus
# Exploring the frequency of some articles 
# i.e. transacations ranging from 1 to 10 and performing 
# some operation in percentage terms of the total transactions 
itemFrequency(df[, 1:10],type = "absolute")
##           almonds antioxydant juice         asparagus           avocado 
##               153                67                36               250 
##       babies food             bacon    barbecue sauce         black tea 
##                34                65                81               107 
##       blueberries        body spray 
##                69                86
round(itemFrequency(df[, 1:10],type = "relative")*100,2)
##           almonds antioxydant juice         asparagus           avocado 
##              2.04              0.89              0.48              3.33 
##       babies food             bacon    barbecue sauce         black tea 
##              0.45              0.87              1.08              1.43 
##       blueberries        body spray 
##              0.92              1.15
# Producing a chart of frequencies and fitering 
# to consider only items with a minimum percentage 
# of support/ considering a top x of items
# ---
# Displaying top 10 most common items in the transactions dataset 
# and the items whose relative importance is at least 10%
# 
par(mfrow = c(1, 2))

# plot the frequency of items
itemFrequencyPlot(df, topN = 10,col="darkgreen")
itemFrequencyPlot(df, support = 0.1,col="darkred")

# we can see that mineral water was most common item
# Building a model based on association rules 
# using the apriori function 
# ---
# We use Min Support as 0.001 and confidence as 0.8
#this will work as the best parameters sicechanging them will lead to a big rise or drop in rules.
# ---
# 
rules <- apriori (df, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules
## set of 74 rules
# Observing rules built in our model i.e. first 5 model rules
# ---
# 
inspect(rules[1:5])
##     lhs                              rhs             support     confidence
## [1] {frozen smoothie, spinach}    => {mineral water} 0.001066524 0.8888889 
## [2] {bacon, pancakes}             => {spaghetti}     0.001733102 0.8125000 
## [3] {nonfat milk, turkey}         => {mineral water} 0.001199840 0.8181818 
## [4] {ground beef, nonfat milk}    => {mineral water} 0.001599787 0.8571429 
## [5] {mushroom cream sauce, pasta} => {escalope}      0.002532996 0.9500000 
##     coverage    lift      count
## [1] 0.001199840  3.729058  8   
## [2] 0.002133049  4.666587 13   
## [3] 0.001466471  3.432428  9   
## [4] 0.001866418  3.595877 12   
## [5] 0.002666311 11.976387 19
# Interpretation of the first rule:
# ---
# If someone buys frozen smoothie, spinach, there is a 88% chance he/she will also buy mineral water
# Ordering these rules by a criteria such as the level of confidence
# then looking at the first five rules.
# We can also use different criteria such as: (by = "lift" or by = "support")
# 
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:5])
##     lhs                        rhs                 support confidence    coverage      lift count
## [1] {french fries,                                                                               
##      mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.001066524       1.00 0.001066524 12.606723     8
## [2] {ground beef,                                                                                
##      light cream,                                                                                
##      olive oil}             => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [3] {cake,                                                                                       
##      meatballs,                                                                                  
##      mineral water}         => {milk}          0.001066524       1.00 0.001066524  7.717078     8
## [4] {cake,                                                                                       
##      olive oil,                                                                                  
##      shrimp}                => {mineral water} 0.001199840       1.00 0.001199840  4.195190     9
## [5] {mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.002532996       0.95 0.002666311 11.976387    19
# Interpretation
# ---
# The given top 4 rules have a confidence of 100
library(arulesViz)
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The plot shows support on the x-axis and confidence on the y-axis. Lift ist shown as a color with different levels ranging from grey to red.

We could also use only “confidence” as a specific measure of interest:

plot(rules, measure = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

There is a special value for shading called “order” which produces a two-key plot where the color of the points represents the length (order) of the rule if you select method = “two-key plot. This is basically a scatterplot with shading =”order”:

plot(rules, method = "two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

# Interactive Scatter plot

plot(rules, engine = "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

4.8 Graph-based visualization Graph-based techniques concentrate on the relationship between individual items in the rule set. They represent the rules (or itemsets) as a graph with items as labeled vertices, and rules (or itemsets) represented as vertices connected to items using arrows.

For rules, the LHS items are connected with arrows pointing to the vertex representing the rule and the RHS has an arrow pointing to the item.

Several engines are available. The default engine uses igraph (plot.igraph and tkplot for the interactive visualization). … arguments are passed on to the respective plotting function (use for color, etc.).

The network graph below shows associations between selected items. Larger circles imply higher support, while red circles imply higher lift. Graphs only work well with very few rules, why we only use a subset of 10 rules from our data:

subrules <- head(rules, n = 10, by = "confidence")

plot(subrules, method = "graph",  engine = "htmlwidget")

#Paralel coordinate plot

plot(subrules, method="paracoord")