library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
groceries <- read.transactions("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/GroceryDataSet.csv", sep = ",")
Trying to work with the arules packages, I found that I needed to upload my data as transactions rather than a data set for the functions within the package to work, such as itemFrequencyPlot, to quickly see which were the most commonly purchased items
itemFrequencyPlot(groceries, topN = 20, main = 'Top items by Relative Frequency')
I’m choosing .002 and .5 to be my parameters in generating the rules. At our value of support, the order must occur once in every five hundred orders, so we do not waste time on combinations that do not happen frequently enough. My confidence threshold for the rule is .5, a naive starting point since one has two options, to buy something or not.
grocery_rules <- apriori(data = groceries,
parameter = list(support = 0.002,
confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(grocery_rules, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {butter,
## hard cheese} => {whipped/sour cream} 0.002033554 0.5128205 0.003965430 7.154028 20
## [2] {beef,
## citrus fruit,
## other vegetables} => {root vegetables} 0.002135231 0.6363636 0.003355363 5.838280 21
## [3] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [4] {citrus fruit,
## frozen vegetables,
## other vegetables} => {root vegetables} 0.002033554 0.6250000 0.003253686 5.734025 20
## [5] {beef,
## other vegetables,
## tropical fruit} => {root vegetables} 0.002745297 0.6136364 0.004473818 5.629770 27
## [6] {bottled water,
## root vegetables,
## yogurt} => {tropical fruit} 0.002236909 0.5789474 0.003863752 5.517391 22
## [7] {herbs,
## other vegetables,
## whole milk} => {root vegetables} 0.002440264 0.6000000 0.004067107 5.504664 24
## [8] {grapes,
## pip fruit} => {tropical fruit} 0.002135231 0.5675676 0.003762074 5.408941 21
## [9] {herbs,
## yogurt} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
## [10] {beef,
## other vegetables,
## soda} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
plot(grocery_rules, method = "graph",
measure = "confidence", shading = "lift")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
top_ten <- head(grocery_rules, n = 10, by = "lift")
plot(top_ten, method = "grouped", control = list(k = 10))
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
For cluster analysis, the produced rules need to be in the form of a data frame. Lift and count are the variables associated with the rules that are of interest to us, but their scales are different and need to be standardized.
grocery_rules_df <- as(grocery_rules, "data.frame")
head(grocery_rules_df)
## rules support confidence coverage
## 1 {cereals} => {whole milk} 0.003660397 0.6428571 0.005693950
## 2 {jam} => {whole milk} 0.002948653 0.5471698 0.005388917
## 3 {specialty cheese} => {other vegetables} 0.004270463 0.5000000 0.008540925
## 4 {rice} => {other vegetables} 0.003965430 0.5200000 0.007625826
## 5 {rice} => {whole milk} 0.004677173 0.6133333 0.007625826
## 6 {baking powder} => {whole milk} 0.009252669 0.5229885 0.017691917
## lift count
## 1 2.515917 36
## 2 2.141431 29
## 3 2.584078 42
## 4 2.687441 39
## 5 2.400371 46
## 6 2.046793 91
grocery_rules_scaled <- grocery_rules_df |>
select(lift, count) |>
scale()
head(grocery_rules_scaled)
## lift count
## 1 -0.23329033 0.2045431
## 2 -0.80616474 -0.1874086
## 3 -0.12902110 0.5405018
## 4 0.02909958 0.3725225
## 5 -0.41004763 0.7644742
## 6 -0.95093679 3.2841641
The next step would be to optimize the number of clusters according to our rules. We can compare the results of the elbow and silhouette methods.
fviz_nbclust(grocery_rules_scaled, kmeans, method = "wss")
Looking at this output, it seems that 3 is the optimal number of
clusters since it has the most dramatic shift in slope.
fviz_nbclust(grocery_rules_scaled, kmeans, method = "silhouette")
We have agreement between the two methods, so I am more confident that the optimal model uses three clusters.
set.seed(9682)
k_3 <- kmeans(grocery_rules_scaled, centers = 3, nstart = 25)
k_3$size
## [1] 859 101 138
fviz_cluster(k_3, data = grocery_rules_scaled, repel = TRUE)
With the amount of rules (1098) contained in our data set, it seems that the graph might be too noisy to interpret in this form. But we can see extremes contained in the green and blue clusters, with red being the most central and largest of the groups.