Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
# get the grocery dataset as a transactional dataset
gr_transaction<-read.transactions("F:\\CUNY masters\\Data 624\\HW10\\GroceryDataSet.csv",format="basket",sep=",")
str(gr_transaction)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 1 variable:
## .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
#inspect(gr_transaction)
# Convert to a data frame for tabular view
#trans_df <- as(gr_transaction, "data.frame")
#head(trans_df)
# see 10 most popular sold items
itemFrequencyPlot(gr_transaction,topN=10,type="absolute")
From the frequency plot above, it is clear that the 10 most popular items include whole milk, other vegetables, rolls/buns, soda, yogurt, bottled water, root vegetables, tropical fruits, shopping bags, and sausage.
Apply apriori algorithm to get the assocaition rules:
# get association rules
rules<-apriori(gr_transaction,parameter = list(supp=0.001,conf=0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 410 rules
Removal of redundant rules:
# remove redundant rules
rules <- rules[!is.redundant(rules)]
rules
## set of 392 rules
Get top 10 Rules Sorted by Lift
# rules sorted by lift
sorted_rules<-sort(rules,by="lift",decreasing=TRUE)
# get top 10 rules
inspect(sorted_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [4] {citrus fruit,
## fruit/vegetable juice,
## grapes} => {tropical fruit} 0.001118454 0.8461538 0.001321810 8.063879 11
## [5] {other vegetables,
## rice,
## whole milk,
## yogurt} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [6] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [7] {ham,
## other vegetables,
## pip fruit,
## yogurt} => {tropical fruit} 0.001016777 0.8333333 0.001220132 7.941699 10
## [8] {beef,
## citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.001016777 0.8333333 0.001220132 7.645367 10
## [9] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [10] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
top10<-sorted_rules[1:10]
plot(top10, method="graph")
The plot above shows the top 10 rules. The red/blush wine liquor is positioned outside the dense area, indicating some level of significance.