The Groceries Data Set contains a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
library(arules)
library(arulesViz)
library(caret)
library(caTools)
library(cluster)
library(corrplot)
library(dplyr)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(knitr)
library(lattice)
library(lubridate)
library(mice)
library(plyr)
library(RColorBrewer)
library(reshape2)
library(readxl)
library(tidyr)
library(tidyverse)
library(utils)
We start by reading the CSV file containing the grocery transactions. We use the \(read.transactions\) function from the \(arules\) package, which enables reading of data in transaction format.
tr = read.transactions("GroceryDataSet.csv", format = 'basket', sep=',')
summary(tr)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The most common items can be shown using both absolute and relative frequency plots.
itemFrequencyPlot(tr, topN = 20, type = "absolute",
col = brewer.pal(8,'Pastel2'), main = "Absolute Item Frequency Plot")
itemFrequencyPlot(tr, topN = 20, type = "relative",
col = brewer.pal(8,'Pastel2'), main = "Relative Item Frequency Plot")
In the context of association rule mining, the following terms have specific definitions.
\(Support:\) This is a measure of how frequently an itemset appears in the dataset. For example, an itemset that appears in half of all transactions has \(Support\) = 0.5. This can also be thought of as how popular an itemset is.
\(Confidence:\) This indicates how often a rule has been found to be true.
\(Lift:\) For a pair of itemsets, Lift equals the Support of the two itemsets together, divided by their individual Supports. For itemsets that have no correlation, the value of Lift would be 1. For highly correlated itemsets, Lift would be (significantly) greater than 1.
The \(arules\) library is one of the most common R libraries for learning association rules. We use it to generate association rules as follows.
# Min Support as 0.002, confidence as 0.5.
association.rules = arules::apriori(tr, parameter=list(supp=0.002, conf=0.5, maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(association.rules)
## set of 1098 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 6 576 471 45
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.505 4.000 5.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.002034 Min. :0.5000 Min. :1.957 Min. : 20.00
## 1st Qu.:0.002237 1st Qu.:0.5263 1st Qu.:2.194 1st Qu.: 22.00
## Median :0.002644 Median :0.5676 Median :2.584 Median : 26.00
## Mean :0.003289 Mean :0.5845 Mean :2.668 Mean : 32.35
## 3rd Qu.:0.003559 3rd Qu.:0.6223 3rd Qu.:2.899 3rd Qu.: 35.00
## Max. :0.022267 Max. :0.8857 Max. :7.154 Max. :219.00
##
## mining info:
## data ntransactions support confidence
## tr 9835 0.002 0.5
Inspect the top 10 rules:
inspect(association.rules[1:10])
## lhs rhs support
## [1] {cereals} => {whole milk} 0.003660397
## [2] {jam} => {whole milk} 0.002948653
## [3] {specialty cheese} => {other vegetables} 0.004270463
## [4] {rice} => {other vegetables} 0.003965430
## [5] {rice} => {whole milk} 0.004677173
## [6] {baking powder} => {whole milk} 0.009252669
## [7] {specialty cheese,yogurt} => {whole milk} 0.002033554
## [8] {specialty cheese,whole milk} => {yogurt} 0.002033554
## [9] {other vegetables,specialty cheese} => {whole milk} 0.002236909
## [10] {specialty cheese,whole milk} => {other vegetables} 0.002236909
## confidence lift count
## [1] 0.6428571 2.515917 36
## [2] 0.5471698 2.141431 29
## [3] 0.5000000 2.584078 42
## [4] 0.5200000 2.687441 39
## [5] 0.6133333 2.400371 46
## [6] 0.5229885 2.046793 91
## [7] 0.7142857 2.795464 20
## [8] 0.5405405 3.874793 20
## [9] 0.5238095 2.050007 22
## [10] 0.5945946 3.072957 22
Here we filter rules with confidence greater than 0.6 or 60%, then obtain the top 10 rules with highest \(lift\).
subRules = association.rules[quality(association.rules)$confidence > 0.6]
top10RulesByLift = head(subRules, n = 10, by = "lift")
inspect(top10RulesByLift)
## lhs rhs support confidence lift count
## [1] {beef,
## citrus fruit,
## other vegetables} => {root vegetables} 0.002135231 0.6363636 5.838280 21
## [2] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 5.804238 31
## [3] {citrus fruit,
## frozen vegetables,
## other vegetables} => {root vegetables} 0.002033554 0.6250000 5.734025 20
## [4] {beef,
## other vegetables,
## tropical fruit} => {root vegetables} 0.002745297 0.6136364 5.629770 27
## [5] {butter,
## other vegetables,
## tropical fruit,
## whole milk} => {yogurt} 0.002338587 0.6969697 4.996135 23
## [6] {citrus fruit,
## root vegetables,
## tropical fruit,
## whole milk} => {other vegetables} 0.003152008 0.8857143 4.577509 31
## [7] {other vegetables,
## rolls/buns,
## tropical fruit,
## whole milk} => {yogurt} 0.002541942 0.6250000 4.480230 25
## [8] {rolls/buns,
## tropical fruit,
## whipped/sour cream} => {yogurt} 0.002135231 0.6176471 4.427521 21
## [9] {curd,
## tropical fruit,
## whole milk} => {yogurt} 0.003965430 0.6093750 4.368224 39
## [10] {grapes,
## tropical fruit,
## whole milk} => {other vegetables} 0.002033554 0.8000000 4.134524 20
plot(top10RulesByLift, main="Scatter plot for Top 10 rules by Lift")
plot(top10RulesByLift, method="paracoord")
plotly_arules(top10RulesByLift)
plot(top10RulesByLift, method = "graph", engine = "htmlwidget")
In this section we perform a cluster analysis on the data using the K-Means Algorithm. For this, we first need to transform the transaction data into a dataframe format, and normalize (scale and center) it.
tr_data = as(tr, "matrix")
norm_data = as.data.frame(scale(tr_data))
dim(norm_data)
## [1] 9835 169
set.seed(1234)
kmfit = kmeans(norm_data, centers=5, nstart = 25)
str(kmfit)
## List of 9
## $ cluster : int [1:9835] 4 4 4 4 4 1 4 4 4 4 ...
## $ centers : num [1:5, 1:169] 0.1246 -0.0598 0.3498 -0.0418 0.6703 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## $ totss : num 1661946
## $ withinss : num [1:5] 823255 16544 9517 741302 16571
## $ tot.withinss: num 1607189
## $ betweenss : num 54757
## $ size : int [1:5] 2277 17 41 7477 23
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
summary(kmfit)
## Length Class Mode
## cluster 9835 -none- numeric
## centers 845 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
The following is a visual plot of the clusters found by the K-Means algorithm. Since there are more than 2 dimensions, the plot uses the first two components of the PCA transformation.
#print(kmfit$centers)
factoextra::fviz_cluster(kmfit, data = norm_data)
norm_data %>%
mutate(Cluster = kmfit$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 5 x 170
## Cluster `abrasive clean… `artif. sweeten… `baby cosmetics` `baby food`
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.125 0.113 0.0464 -0.0101
## 2 2 -0.0598 0.976 -0.0247 5.82
## 3 3 0.350 -0.0571 -0.0247 -0.0101
## 4 4 -0.0418 -0.0360 -0.0139 -0.0101
## 5 5 0.670 -0.0571 -0.0247 -0.0101
## # … with 165 more variables: bags <dbl>, `baking powder` <dbl>, `bathroom
## # cleaner` <dbl>, beef <dbl>, berries <dbl>, beverages <dbl>, `bottled
## # beer` <dbl>, `bottled water` <dbl>, brandy <dbl>, `brown bread` <dbl>,
## # butter <dbl>, `butter milk` <dbl>, `cake bar` <dbl>, candles <dbl>,
## # candy <dbl>, `canned beer` <dbl>, `canned fish` <dbl>, `canned
## # fruit` <dbl>, `canned vegetables` <dbl>, `cat food` <dbl>,
## # cereals <dbl>, `chewing gum` <dbl>, chicken <dbl>, chocolate <dbl>,
## # `chocolate marshmallow` <dbl>, `citrus fruit` <dbl>, cleaner <dbl>,
## # `cling film/bags` <dbl>, `cocoa drinks` <dbl>, coffee <dbl>,
## # `condensed milk` <dbl>, `cooking chocolate` <dbl>, cookware <dbl>,
## # cream <dbl>, `cream cheese` <dbl>, curd <dbl>, `curd cheese` <dbl>,
## # decalcifier <dbl>, `dental care` <dbl>, dessert <dbl>,
## # detergent <dbl>, `dish cleaner` <dbl>, dishes <dbl>, `dog food` <dbl>,
## # `domestic eggs` <dbl>, `female sanitary products` <dbl>, `finished
## # products` <dbl>, fish <dbl>, flour <dbl>, `flower (seeds)` <dbl>,
## # `flower soil/fertilizer` <dbl>, frankfurter <dbl>, `frozen
## # chicken` <dbl>, `frozen dessert` <dbl>, `frozen fish` <dbl>, `frozen
## # fruits` <dbl>, `frozen meals` <dbl>, `frozen potato products` <dbl>,
## # `frozen vegetables` <dbl>, `fruit/vegetable juice` <dbl>,
## # grapes <dbl>, `hair spray` <dbl>, ham <dbl>, `hamburger meat` <dbl>,
## # `hard cheese` <dbl>, herbs <dbl>, honey <dbl>, `house keeping
## # products` <dbl>, `hygiene articles` <dbl>, `ice cream` <dbl>, `instant
## # coffee` <dbl>, `Instant food products` <dbl>, jam <dbl>,
## # ketchup <dbl>, `kitchen towels` <dbl>, `kitchen utensil` <dbl>, `light
## # bulbs` <dbl>, liqueur <dbl>, liquor <dbl>, `liquor (appetizer)` <dbl>,
## # `liver loaf` <dbl>, `long life bakery product` <dbl>, `make up
## # remover` <dbl>, `male cosmetics` <dbl>, margarine <dbl>,
## # mayonnaise <dbl>, meat <dbl>, `meat spreads` <dbl>, `misc.
## # beverages` <dbl>, mustard <dbl>, napkins <dbl>, newspapers <dbl>, `nut
## # snack` <dbl>, `nuts/prunes` <dbl>, oil <dbl>, onions <dbl>, `organic
## # products` <dbl>, `organic sausage` <dbl>, `other vegetables` <dbl>,
## # `packaged fruit/vegetables` <dbl>, …