Question

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Response

Import all necessary libraries

library(arules)
library(pander)
library(arulesViz)
library(fpp2)
library(RColorBrewer)

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Load the csv file to R and lets get in to EDA.

The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.

Load the data from csv using read.transactions() from arules package.

grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")

grocery_df
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
Exploratory Data Analysis

Use summary() to get the over view of data.

summary(grocery_df)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.

In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.

itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')

The above plot shows the same top 5 items as we get from summary().

bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)

The above plot shows the Bottom 10 items purchased.

Items distribution in basket

hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200), 
     main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)

We can see that the number of baskets decreases with the increase number of items.

Model Building

In this section using the APRIORI algorithm we make some rules and interpret how it works.

Generating Rules

Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.

# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))

The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).

summary(association_rules)
## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##        data ntransactions support confidence
##  grocery_df          9835   0.001        0.8

Avove summary() shows the following:

Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule.

Total number of rules: The set of 410 rules

Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12

Summary of Quality measures: Min and max values for Support, Confidence and, Lift.

Information used for creating rules: The data, support, and confidence we provided to the algorithm.

Since there are 410 rules, let’s print only top 10:

inspect(association_rules[1:10])
##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {cereals,                                                                                        
##       curd}                   => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [3]  {cereals,                                                                                        
##       yogurt}                 => {whole milk}       0.001728521  0.8095238 0.002135231  3.168192    17
## [4]  {butter,                                                                                         
##       jam}                    => {whole milk}       0.001016777  0.8333333 0.001220132  3.261374    10
## [5]  {bottled beer,                                                                                   
##       soups}                  => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {house keeping products,                                                                         
##       napkins}                => {whole milk}       0.001321810  0.8125000 0.001626843  3.179840    13
## [7]  {house keeping products,                                                                         
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [8]  {pastry,                                                                                         
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {curd,                                                                                           
##       turkey}                 => {other vegetables} 0.001220132  0.8000000 0.001525165  4.134524    12
## [10] {rice,                                                                                           
##       sugar}                  => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12

Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means.

  • lhs: items present in the basket
  • rhs: item more likely bought with lhs
  • support: Fraction of transactions that contain the item-set
  • confidence: For a rule A=>B Confidence shows the percentage in which B is bought with A.
  • lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
  • count: Frequency of occurrence of an item-set

Using the above output, we can make analysis such as:

  • 100% of the customers who bought ’{rice,sugar} also bought {whole milk}.
  • 92% of the customers who bought {house keeping products,whipped/sour cream} also bought {whole milk}.
Removing redundant rules

We can remove rules that are subsets of larger rules. Use the code below to remove such rules:

# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1) 
length(subset_rules)
## [1] 91