The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Here is the dataset is in GroceryDataSet.csv (comma separated file). You assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Import all necessary libraries
library(arules)
library(pander)
library(arulesViz)
library(fpp2)
library(RColorBrewer)
The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. Load the csv file to R and lets get in to EDA.
The purpose of market basket analysis is retailers/businesses can analyze the data like what are customers buying together and make use of that information for making some profitable decisions.
Load the data from csv using read.transactions() from arules package.
grocery_df <- read.transactions('https://raw.githubusercontent.com/SubhalaxmiRout002/DATA624/main/Week4/GroceryDataSet.csv', sep = ",", format = "basket")
grocery_df
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
Use summary() to get the over view of data.
summary(grocery_df)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The summary gives the number of rows and columns present in the data. It shows the purchase of the most frequent items, in that the whole milk is on the top, 2nd highest is other vegetables, 3rd rank is roll/buns, etc. Below this the length distribution of the items. There is a total of 32 items, the first item occurs 2159 times, 2nd item occurs 1643 times, and so on. This mean 2159 carts have 1st item, 1643 carts have 2nd items. If we add all these items’ frequencies it will sum up to the total number of rows that is 9835. If we look at the distribution, the mean is 4.4 which means on average there are 4 items per basket.
In itemFrequencyPlot(grocery_df, topN=10,type=“absolute”) first argument is the transaction object to be plotted that is grocery_df. topN allows to plot top N highest frequency items. type can be type=“absolute” or type=“relative”. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.
itemFrequencyPlot(grocery_df, topN = 10, type="absolute", col=brewer.pal(8,'Pastel2'), main = 'Top 10 items purchased')
The above plot shows the same top 5 items as we get from summary().
bottom_10 <- head(sort(itemFrequency(grocery_df, type="absolute"), decreasing=FALSE), n=10)
par(mar=c(10.5,3,2, 0.3))
barplot(bottom_10, ylab = "Frequency", main = "Bottom 10 items purchased", col=brewer.pal(8,'Pastel2'), las = 2)
The above plot shows the Bottom 10 items purchased.
Items distribution in basket
hist(size(grocery_df), breaks = 0:35, xaxt="n", ylim=c(0,2200),
main = "Number of items in particular baskets", xlab = "Items", col = brewer.pal(8,'Pastel2'))
axis(1, at=seq(0,33,by=1), cex.axis=0.8)
We can see that the number of baskets decreases with the increase number of items.
In this section using the APRIORI algorithm we make some rules and interpret how it works.
Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.
# Min Support as 0.001, confidence as 0.8.
association_rules <- apriori(grocery_df, parameter = list(supp=0.001, conf=0.8,maxlen=10), control=list(verbose=F))
The apriori will take dats as the transaction object on which mining is to be applied. Parameter will allow to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 10 items (maxlen).
summary(association_rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## grocery_df 9835 0.001 0.8
Avove summary() shows the following:
Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 10 items as max of items in a rule.
Total number of rules: The set of 410 rules
Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12
Summary of Quality measures: Min and max values for Support, Confidence and, Lift.
Information used for creating rules: The data, support, and confidence we provided to the algorithm.
Since there are 410 rules, let’s print only top 10:
inspect(association_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 0.002135231 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 0.001220132 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 0.001626843 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 0.001525165 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 0.001220132 3.913649 12
Above rule table shows lsh, rhs, support, confidence, coverage, lift, count. Lets know about what these terms means.
Using the above output, we can make analysis such as:
We can remove rules that are subsets of larger rules. Use the code below to remove such rules:
# get subset rules in vector
subset_rules <- which(colSums(is.subset(association_rules, association_rules)) > 1)
length(subset_rules)
## [1] 91