Homework 10

Daniel DeBonis

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
groceries <- read.transactions("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/GroceryDataSet.csv", sep = ",")

Trying to work with the arules packages, I found that I needed to upload my data as transactions rather than a data set for the functions within the package to work, such as itemFrequencyPlot, to quickly see which were the most commonly purchased items

itemFrequencyPlot(groceries, topN = 20, main = 'Top items by Relative Frequency') 

I’m choosing .002 and .5 to be my parameters in generating the rules. At our value of support, the order must occur once in every five hundred orders, so we do not waste time on combinations that do not happen frequently enough. My confidence threshold for the rule is .5, a naive starting point since one has two options, to buy something or not.

grocery_rules <- apriori(data = groceries,  
                        parameter = list(support = 0.002,  
                                         confidence = 0.5)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(sort(grocery_rules, by = 'lift')[1:10]) 
##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {butter,                                                                                      
##       hard cheese}        => {whipped/sour cream} 0.002033554  0.5128205 0.003965430 7.154028    20
## [2]  {beef,                                                                                        
##       citrus fruit,                                                                                
##       other vegetables}   => {root vegetables}    0.002135231  0.6363636 0.003355363 5.838280    21
## [3]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit,                                                                              
##       whole milk}         => {root vegetables}    0.003152008  0.6326531 0.004982206 5.804238    31
## [4]  {citrus fruit,                                                                                
##       frozen vegetables,                                                                           
##       other vegetables}   => {root vegetables}    0.002033554  0.6250000 0.003253686 5.734025    20
## [5]  {beef,                                                                                        
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.002745297  0.6136364 0.004473818 5.629770    27
## [6]  {bottled water,                                                                               
##       root vegetables,                                                                             
##       yogurt}             => {tropical fruit}     0.002236909  0.5789474 0.003863752 5.517391    22
## [7]  {herbs,                                                                                       
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.002440264  0.6000000 0.004067107 5.504664    24
## [8]  {grapes,                                                                                      
##       pip fruit}          => {tropical fruit}     0.002135231  0.5675676 0.003762074 5.408941    21
## [9]  {herbs,                                                                                       
##       yogurt}             => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20
## [10] {beef,                                                                                        
##       other vegetables,                                                                            
##       soda}               => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20
plot(grocery_rules, method = "graph",  
     measure = "confidence", shading = "lift") 
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

top_ten <- head(grocery_rules, n = 10, by = "lift")
plot(top_ten, method = "grouped", control = list(k = 10))

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

For cluster analysis, the produced rules need to be in the form of a data frame. Lift and count are the variables associated with the rules that are of interest to us, but their scales are different and need to be standardized.

grocery_rules_df <- as(grocery_rules, "data.frame")
head(grocery_rules_df)
##                                      rules     support confidence    coverage
## 1                {cereals} => {whole milk} 0.003660397  0.6428571 0.005693950
## 2                    {jam} => {whole milk} 0.002948653  0.5471698 0.005388917
## 3 {specialty cheese} => {other vegetables} 0.004270463  0.5000000 0.008540925
## 4             {rice} => {other vegetables} 0.003965430  0.5200000 0.007625826
## 5                   {rice} => {whole milk} 0.004677173  0.6133333 0.007625826
## 6          {baking powder} => {whole milk} 0.009252669  0.5229885 0.017691917
##       lift count
## 1 2.515917    36
## 2 2.141431    29
## 3 2.584078    42
## 4 2.687441    39
## 5 2.400371    46
## 6 2.046793    91
grocery_rules_scaled <- grocery_rules_df |>
  select(lift, count) |>
  scale()

head(grocery_rules_scaled)
##          lift      count
## 1 -0.23329033  0.2045431
## 2 -0.80616474 -0.1874086
## 3 -0.12902110  0.5405018
## 4  0.02909958  0.3725225
## 5 -0.41004763  0.7644742
## 6 -0.95093679  3.2841641

The next step would be to optimize the number of clusters according to our rules. We can compare the results of the elbow and silhouette methods.

fviz_nbclust(grocery_rules_scaled, kmeans, method = "wss")

Looking at this output, it seems that 3 is the optimal number of clusters since it has the most dramatic shift in slope.

fviz_nbclust(grocery_rules_scaled, kmeans, method = "silhouette")

We have agreement between the two methods, so I am more confident that the optimal model uses three clusters.

set.seed(9682)
k_3 <- kmeans(grocery_rules_scaled, centers = 3, nstart = 25)
k_3$size
## [1] 859 101 138
fviz_cluster(k_3, data = grocery_rules_scaled, repel = TRUE)

With the amount of rules (1098) contained in our data set, it seems that the graph might be too noisy to interpret in this form. But we can see extremes contained in the green and blue clusters, with red being the most central and largest of the groups.