test_MBA

Question

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Response

Import all necessary libraries

library(arules)
library(pander)
library(arulesViz)
library(fpp2)
library(RColorBrewer)

The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.

Load the data from csv using read.transactions() from arules package.

grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")

grocery_df

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

Exploratory Data Analysis

Use summary() to get the over view of data.

summary(grocery_df)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.

In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.

itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')

The above plot shows the same top 5 items as we get from summary().

bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)

The above plot shows the Bottom 10 items purchased.

Items distribution in basket

hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200), 
     main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)

We can see that the number of baskets decreases with the increase number of items.

Model Building

In this section using the APRIORI algorithm we make some rules and interpret how it works.

Generating Rules

Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.

# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))

The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).

summary(association_rules)

## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##        data ntransactions support confidence
##  grocery_df          9835   0.001        0.8

Avove summary() shows the following:

Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule.

Total number of rules: The set of 410 rules

Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12

Summary of Quality measures: Min and max values for Support, Confidence and, Lift.

Information used for creating rules: The data, support, and confidence we provided to the algorithm.

Since there are 410 rules, let’s print only top 10:

inspect(association_rules[1:10])

##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {cereals,                                                                                        
##       curd}                   => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [3]  {cereals,                                                                                        
##       yogurt}                 => {whole milk}       0.001728521  0.8095238 0.002135231  3.168192    17
## [4]  {butter,                                                                                         
##       jam}                    => {whole milk}       0.001016777  0.8333333 0.001220132  3.261374    10
## [5]  {bottled beer,                                                                                   
##       soups}                  => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {house keeping products,                                                                         
##       napkins}                => {whole milk}       0.001321810  0.8125000 0.001626843  3.179840    13
## [7]  {house keeping products,                                                                         
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [8]  {pastry,                                                                                         
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {curd,                                                                                           
##       turkey}                 => {other vegetables} 0.001220132  0.8000000 0.001525165  4.134524    12
## [10] {rice,                                                                                           
##       sugar}                  => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12

Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means.

lhs: items present in the basket
rhs: item more likely bought with lhs
support: Fraction of transactions that contain the item-set
confidence: For a rule A=>B Confidence shows the percentage in which B is bought with A.
lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
count: Frequency of occurrence of an item-set

Using the above output, we can make analysis such as:

100% of the customers who bought ’{rice,sugar} also bought {whole milk}.
92% of the customers who bought {house keeping products,whipped/sour cream} also bought {whole milk}.

Removing redundant rules

We can remove rules that are subsets of larger rules. Use the code below to remove such rules:

# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1) 
length(subset_rules)

## [1] 91

Finding Rules related to single item

Sometimes, want to work on a specific product. If want to find out what causes influence on the purchase of item X can use appearance option in the apriori command. Appearance gives options to set LHS (IF part) and RHS (THEN part) of the rule.

For example, to find what customers buy before buying ‘whole milk’ run the following line of code:

whmilk_association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8), appearance = list(default="lhs",rhs="whole milk"), control=list(verbose=F))


rules_whmilk_byconf <- sort(whmilk_association_rules, by="confidence", decreasing=TRUE)

inspect(head(rules_whmilk_byconf, n = 10))

##      lhs                     rhs              support confidence    coverage     lift count
## [1]  {rice,                                                                                
##       sugar}              => {whole milk} 0.001220132          1 0.001220132 3.913649    12
## [2]  {canned fish,                                                                         
##       hygiene articles}   => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [3]  {butter,                                                                              
##       rice,                                                                                
##       root vegetables}    => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [4]  {flour,                                                                               
##       root vegetables,                                                                     
##       whipped/sour cream} => {whole milk} 0.001728521          1 0.001728521 3.913649    17
## [5]  {butter,                                                                              
##       domestic eggs,                                                                       
##       soft cheese}        => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [6]  {butter,                                                                              
##       hygiene articles,                                                                    
##       pip fruit}          => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [7]  {hygiene articles,                                                                    
##       root vegetables,                                                                     
##       whipped/sour cream} => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [8]  {hygiene articles,                                                                    
##       pip fruit,                                                                           
##       root vegetables}    => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [9]  {cream cheese,                                                                        
##       domestic eggs,                                                                       
##       sugar}              => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [10] {curd,                                                                                
##       domestic eggs,                                                                       
##       sugar}              => {whole milk} 0.001016777          1 0.001016777 3.913649    10

Lets find what customers buy before buying ‘Other vegetables’ run the following line of code:

otherveg_association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8),appearance = list(default="lhs",rhs="other vegetables"), control=list(verbose=F))

rules_otherveg_byconf <- sort(otherveg_association_rules, by="confidence", decreasing=TRUE)

inspect(head(rules_otherveg_byconf, n = 10))

##      lhs                        rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       soft cheese}           => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [2]  {brown bread,                                                                                  
##       pip fruit,                                                                                    
##       whipped/sour cream}    => {other vegetables} 0.001118454  1.0000000 0.001118454 5.168156    11
## [3]  {grapes,                                                                                       
##       tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [4]  {ham,                                                                                          
##       pip fruit,                                                                                    
##       tropical fruit,                                                                               
##       yogurt}                => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [5]  {ham,                                                                                          
##       pip fruit,                                                                                    
##       tropical fruit,                                                                               
##       whole milk}            => {other vegetables} 0.001118454  1.0000000 0.001118454 5.168156    11
## [6]  {butter,                                                                                       
##       fruit/vegetable juice,                                                                        
##       tropical fruit,                                                                               
##       whipped/sour cream}    => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [7]  {newspapers,                                                                                   
##       rolls/buns,                                                                                   
##       soda,                                                                                         
##       whole milk}            => {other vegetables} 0.001016777  1.0000000 0.001016777 5.168156    10
## [8]  {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       tropical fruit,                                                                               
##       whipped/sour cream}    => {other vegetables} 0.001220132  1.0000000 0.001220132 5.168156    12
## [9]  {oil,                                                                                          
##       root vegetables,                                                                              
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001423488  0.9333333 0.001525165 4.823612    14
## [10] {citrus fruit,                                                                                 
##       root vegetables,                                                                              
##       tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.001423488  0.9333333 0.001525165 4.823612    14

Visualizing Association Rules

A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.

# Filter rules with confidence greater than 0.4 or 40%
subRules<-association_rules[quality(association_rules)$confidence>0.4]
#Plot SubRules
plot(subRules)

The above plot shows that rules with high lift have low support. There is one rule in the middle which has moderate support and high lift. And gray points shows high support less lift.

Another way to visualize by individual Rule Representation. We need to pass method as “two-key plot”. The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.

plot(subRules,method="two-key plot")

As we discussed in EDA section, items 4 has more rules and items 6 has less rules. That we can see in the plot.

Graph-Based Visualizations

Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations. Here we will see top 10 rules from subRules which have the highest confidence.

All the interactive graph has a dropdown, can select respective rules and see the graph.

top10subRules <- head(subRules, n = 10, by = "confidence")

below, plot an interactive graph: Here we use engine=htmlwidget parameter in plot.

set.seed(123)
plot(top10subRules, method = "graph",  engine = "htmlwidget")

The above graph showing more red color means high lift, and big size means high confidence. Rule 4 has big size and light red color, this means data suggest if some one buys {flour, root vegetables, whipped/soure cream} that persom more likely to buy {whole milk}. If we look at Rule 6 it has less confidence and high lift, this means there is high chance a person buys {citrus fruit, root vegetables, soft cheese} will buy {other vegetables}.

Below graph for whole milk in RHS.

set.seed(123)
plot(head(whmilk_association_rules, n= 10), method = "graph",  engine = "htmlwidget")

Look at Rule 6, and Rule 8 both rules have high lift and meddium confidence. Rule 2 and Rule 9 has high confidence and low lift.

Below graph for Other vegetables in RHS.

set.seed(123)
plot(head(otherveg_association_rules, n= 10), method = "graph",  engine = "htmlwidget")

Above plot, if we look at Rule 9, it has high confidence and high lift. This means a person buys {butter milk, pork} more likely to buy {other vegetables}.

Individual Rule Representation

This representation is also called as Parallel Coordinates Plot. It is useful to visualized which products along with which items cause what kind of sales.

As mentioned above, the RHS is the Consequent or the item we propose the customer will buy; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.

set.seed(123)
subRules2<-head(subRules, n=10, by="lift")
plot(subRules2, method="paracoord")

Above plot, we see the wide dark line of the arrow, which means a person who buys {red/blush wine, liquor} more likely to buy bottled beer. We can see the line next to the dark line, which means who buys {grapes, fruits/vegetables, citrus fruits} less likely to buy {pip fruit}.

Conclusions

Since association rules are very useful in setting a strategy of the store, we found few conclusions out of the above analysis.

Concerning the least sold products which are baby food, sound storage medium, preservation products, kitchen utensil, bags, frozen chicken, baby cosmetics, toilet cleaner, and salad dressing, the business should consider withdrawing them from sale in order to not unprofitable items. Set a sales promotion or promotion by combining these products with more popular ones (e.g. vegetables with whole milk, soda, yogurt, etc.),

Concerning the association rules, the grocery store can place with other fruits near fruits, root vegetables, cheese and place whole milk with rice, cereal, yogurt, and sour cream.

Reference: [Data Camp market basket analysis] https://www.datacamp.com/community/tutorials/market-basket-analysis-r

[R51 Association Rules, Market Basket Analysis in R Recommender Systems]https://www.youtube.com/watch?v=iASqPvQpJ20