library(arules)
library(arulesViz)
library(tidyverse)
library(cluster)
library(factoextra)DATA624 - Homework 10
Overview
Market Basket Analysis is used in this project for determining the set of goods often bought together. Association rules are analyzed based on their support, confidence, and lift. The ten best association rules are ranked in descending order of their lift.
Load Libraries
Load the Grocery Transactions
groceries <- read.transactions(
"GroceryDataSet.csv",
format = "basket",
sep = ","
)
summary(groceries)transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda
2513 1903 1809 1715
yogurt (Other)
1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
17 18 19 20 21 22 23 24 26 27 28 29 32
29 14 14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels
1 abrasive cleaner
2 artif. sweetener
3 baby cosmetics
Inspect the First Five Transactions
inspect(groceries[1:5]) items
[1] {citrus fruit,
margarine,
ready soups,
semi-finished bread}
[2] {coffee,
tropical fruit,
yogurt}
[3] {whole milk}
[4] {cream cheese,
meat spreads,
pip fruit,
yogurt}
[5] {condensed milk,
long life bakery product,
other vegetables,
whole milk}
Item Frequency Plot
itemFrequencyPlot(
groceries,
topN = 20,
type = "absolute",
main = "Top 20 Most Frequently Purchased Items"
)Generate Association Rules
rules <- apriori(
groceries,
parameter = list(
supp = 0.001,
conf = 0.3,
minlen = 2
)
)Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.3 0.1 1 none FALSE TRUE 5 0.001 2
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 9
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.00s].
writing ... [13770 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
summary(rules)set of 13770 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5 6
239 5000 6942 1530 59
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 4.000 3.722 4.000 6.000
summary of quality measures:
support confidence coverage lift
Min. :0.001017 Min. :0.3000 Min. :0.001017 Min. : 1.174
1st Qu.:0.001118 1st Qu.:0.3654 1st Qu.:0.002237 1st Qu.: 2.206
Median :0.001322 Median :0.4545 Median :0.003152 Median : 2.840
Mean :0.001889 Mean :0.4833 Mean :0.004337 Mean : 3.132
3rd Qu.:0.001830 3rd Qu.:0.5758 3rd Qu.:0.004474 3rd Qu.: 3.670
Max. :0.074835 Max. :1.0000 Max. :0.193493 Max. :35.716
count
Min. : 10.00
1st Qu.: 11.00
Median : 13.00
Mean : 18.58
3rd Qu.: 18.00
Max. :736.00
mining info:
data ntransactions support confidence
groceries 9835 0.001 0.3
call
apriori(data = groceries, parameter = list(supp = 0.001, conf = 0.3, minlen = 2))
Top 10 Rules by Lift
rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
top10_rules <- inspect(rules_lift[1:10]) lhs rhs support
[1] {bottled beer, red/blush wine} => {liquor} 0.001931876
[2] {ham, white bread} => {processed cheese} 0.001931876
[3] {bottled beer, liquor} => {red/blush wine} 0.001931876
[4] {Instant food products, soda} => {hamburger meat} 0.001220132
[5] {curd, sugar} => {flour} 0.001118454
[6] {baking powder, sugar} => {flour} 0.001016777
[7] {processed cheese, white bread} => {ham} 0.001931876
[8] {popcorn, soda} => {salty snack} 0.001220132
[9] {baking powder, flour} => {sugar} 0.001016777
[10] {ham, processed cheese} => {white bread} 0.001931876
confidence coverage lift count
[1] 0.3958333 0.004880529 35.71579 19
[2] 0.3800000 0.005083884 22.92822 19
[3] 0.4130435 0.004677173 21.49356 19
[4] 0.6315789 0.001931876 18.99565 12
[5] 0.3235294 0.003457041 18.60767 11
[6] 0.3125000 0.003253686 17.97332 10
[7] 0.4634146 0.004168785 17.80345 19
[8] 0.6315789 0.001931876 16.69779 12
[9] 0.5555556 0.001830198 16.40807 10
[10] 0.6333333 0.003050330 15.04549 19
top10_rules lhs rhs support
[1] {bottled beer, red/blush wine} => {liquor} 0.001931876
[2] {ham, white bread} => {processed cheese} 0.001931876
[3] {bottled beer, liquor} => {red/blush wine} 0.001931876
[4] {Instant food products, soda} => {hamburger meat} 0.001220132
[5] {curd, sugar} => {flour} 0.001118454
[6] {baking powder, sugar} => {flour} 0.001016777
[7] {processed cheese, white bread} => {ham} 0.001931876
[8] {popcorn, soda} => {salty snack} 0.001220132
[9] {baking powder, flour} => {sugar} 0.001016777
[10] {ham, processed cheese} => {white bread} 0.001931876
confidence coverage lift count
[1] 0.3958333 0.004880529 35.71579 19
[2] 0.3800000 0.005083884 22.92822 19
[3] 0.4130435 0.004677173 21.49356 19
[4] 0.6315789 0.001931876 18.99565 12
[5] 0.3235294 0.003457041 18.60767 11
[6] 0.3125000 0.003253686 17.97332 10
[7] 0.4634146 0.004168785 17.80345 19
[8] 0.6315789 0.001931876 16.69779 12
[9] 0.5555556 0.001830198 16.40807 10
[10] 0.6333333 0.003050330 15.04549 19
Top association rules are identified based on the lift method, which identifies the level of dependency that exists between items. The greater the lift value, the stronger the relationship that exists between items in terms of their purchase; the two items are likely to be purchased together in one transaction.
Association Rule Graph
plot(
rules_lift[1:10],
method = "graph",
engine = "htmlwidget"
)The association rules graph shows connections between products that are bought together frequently. Products that are connected by the same rule have a strong connection between them based on their purchases from customers. There are various groups of connected products, which include snacks, ingredients for baking, and processed food products.
Lift is one measure of correlation between two products where the lift value will show the strength of the relationship between the two products without taking into account any coincidences.
Extra Credit: Cluster Analysis
binary_matrix <- as(groceries, "matrix")
sample_data <- binary_matrix[1:500, ]
sample_data <- sample_data[, apply(sample_data, 2, var) > 0]
set.seed(123)
kmeans_result <- kmeans(sample_data, centers = 3, nstart = 25)
fviz_cluster(
kmeans_result,
data = sample_data,
labelsize = 0
)A rudimentary k-means clustering analysis was conducted using a reduced dataset of the grocery store transaction records. After eliminating features with no variance, transactions were clustered into three groups according to their purchasing behavior.
The diagram illustrates that most transactions cluster around the origin point while a few transactions are further away from the main clusters. This implies that there might be more unique purchasing behavior in the transactions. With the first two principal components explaining a very small amount of variance in the data, the clusters show significant overlap, which is characteristic of sparsely populated transaction data such as grocery basket items.
In summary, the clustering analysis offers a rudimentary grouping of consumer purchasing behavior.
Conclusion
Association rule mining was performed in this study to extract patterns from grocery purchases. Association rules were chosen according to the lift value since lift aids in finding item sets that have an occurrence greater than random. The use of cluster analysis also made it easy to categorize transactions.