1 Introduction

Source: Kaggle Groceries Dataset

Consumer purchasing behavior plays a critical role in the success of retail businesses, as understanding how customers shop can support higher profitability and sustained competitive performance. Data mining methods, particularly the Apriori algorithm, provide an effective approach for uncovering relationships within large transactional datasets. In this study, the Apriori algorithm is applied to the Groceries dataset, which consists of customer purchase records collected from a grocery store.

The analysis focuses on identifying meaningful patterns and associations among products that are frequently bought together. These insights can assist retailers in improving store layout, arranging products strategically, and designing targeted marketing campaigns to enhance sales performance and customer satisfaction. Additionally, this study highlights the significant influence of product placement on purchasing decisions within supermarket environments.

This report presents the analytical process, key observations, and results obtained from applying the Apriori algorithm in R using the arules package to the Groceries dataset.

2 Setup and Library Loading

# Load required libraries
library(tidyverse)
library(lubridate)
library(magrittr)
library(arules)
library(Matrix)
library(arulesViz)
library(knitr)
library(kableExtra)

3 Dataset and Exploratory Data Analysis

This study utilizes the Groceries dataset, which consists of 38,765 records representing customer purchase transactions from a grocery store.

# Load the dataset
groceries <- read.csv("Groceries_dataset.csv")

# Display first few rows
head(groceries) %>%
  kable(caption = "First 6 rows of Groceries Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
First 6 rows of Groceries Dataset
Member_number Date itemDescription
1808 21-07-2015 tropical fruit
2552 05-01-2015 whole milk
2300 19-09-2015 pip fruit
1187 12-12-2015 other vegetables
3037 01-02-2015 whole milk
4941 14-02-2015 rolls/buns

3.1 Dataset Dimensions

# Check dimensions
dim(groceries)
## [1] 38765     3
cat("The dataset contains", nrow(groceries), "rows and", ncol(groceries), "columns.\n")
## The dataset contains 38765 rows and 3 columns.

3.2 Missing Values Check

# Checking missing values
missing_values <- colSums(is.na(groceries))
missing_values %>%
  as.data.frame() %>%
  setNames("Missing Values") %>%
  kable(caption = "Missing Values per Column") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Missing Values per Column
Missing Values
Member_number 0
Date 0
itemDescription 0

Missing values: none (based on colSums(is.na())).

3.3 Dataset Structure

str(groceries)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : chr  "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
##  $ itemDescription: chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...

4 Data Pre-processing

# Converting datetime format
groceries$Date <- as.Date(groceries$Date, format = "%d-%m-%Y")

# Extracting year, month, day, and weekday
groceries$year <- format(as.Date(groceries$Date), "%Y")
groceries$month <- format(as.Date(groceries$Date), "%m")
groceries$day <- format(as.Date(groceries$Date), "%d")
groceries$weekday <- format(as.Date(groceries$Date), "%w")

# Rearranging the columns
groceries <- groceries[c("Member_number", "Date", "year", "month", "day", "weekday", "itemDescription")]

# Display processed data
head(groceries) %>%
  kable(caption = "Processed Groceries Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Processed Groceries Dataset
Member_number Date year month day weekday itemDescription
1808 2015-07-21 2015 07 21 2 tropical fruit
2552 2015-01-05 2015 01 05 1 whole milk
2300 2015-09-19 2015 09 19 6 pip fruit
1187 2015-12-12 2015 12 12 6 other vegetables
3037 2015-02-01 2015 02 01 0 whole milk
4941 2015-02-14 2015 02 14 6 rolls/buns

5 Sales Analysis by Year

5.1 Monthly Sales Comparison (2014 vs 2015)

# Filtering data by year 2014 and 2015
df1 <- groceries %>% filter(year == "2014")
df2 <- groceries %>% filter(year == "2015")

# Plotting monthly data of number of quantity purchased in 2014 and 2015
sales_2014 <- df1 %>% group_by(month) %>% summarize(count = n())
sales_2015 <- df2 %>% group_by(month) %>% summarize(count = n())

# Adding a year column to the data frames
sales_2014$year <- 2014
sales_2015$year <- 2015

# Combining both plots
sales_combined <- rbind(sales_2014, sales_2015)

# Plotting the data
ggplot(sales_combined, aes(x = month, y = count, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    x = "Months",
    y = "Number of products",
    title = "Monthly Products Sold in Years 2014 and 2015",
    fill = "Year"
  ) +
  scale_fill_manual(values = c("2014" = "cornflowerblue", "2015" = "bisque3")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_text(angle = 0, hjust = 0.5)
  )

5.1.1 Key Findings

  • Sales Growth: Every month in 2015 exceeds 2014, indicating year-over-year growth.
  • Weak Performance: Unsatisfying sales performances by year were seen in March 2014 and December 2015.
  • Peak Sales: Record sales occurred in August 2015, when approximately 1,920 products were sold in that month.

6 Weekday Purchase Analysis

# Create a temporary data frame with quantity purchased column
temp <- groceries %>%
  mutate(qty_purchased = map_dbl(Member_number, ~sum(. == Member_number)))

# Slice first 5000 rows
temp1 <- temp[1:5000,]

# Converting weekday variable to category
temp1$weekday <- as.factor(temp1$weekday)

# Creating a new data frame which has the frequency of weekdays
weekday_bin <- data.frame(table(temp1$weekday))
colnames(weekday_bin) <- c("weekday", "count")

# Creating a heatmap
heatmap <- ggplot(weekday_bin, aes(x = weekday, y = 1, fill = count)) +
  geom_tile() +
  labs(title = "Number of Quantity Purchases Across Weekdays") +
  scale_fill_gradient(low = "#FFFFFF", high = "cornflowerblue") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(
    expand = c(0, 0),
    labels = c("0" = "Sunday", "1" = "Monday", "2" = "Tuesday",
               "3" = "Wednesday", "4" = "Thursday", "5" = "Friday", "6" = "Saturday")
  ) +
  scale_y_continuous(name = "") +
  geom_text(aes(label = count), vjust = 0.5, size = 5)

print(heatmap)

6.0.1 Key Findings

  • Peak Shopping Day: Sunday had the highest number of purchases (750), indicating customers prefer weekend shopping.
  • Midweek Shopping: Wednesday follows closely with 749 purchases.
  • Lowest Activity: Monday has the least purchases (692).
  • Overall Pattern: The distribution is relatively balanced across weekdays.

7 Customer Analysis

7.1 Top Customers by Purchase Frequency

# Getting the top customers based on quantity purchased
top_customers <- temp %>%
  select(Member_number, qty_purchased, year) %>%
  arrange(desc(qty_purchased)) %>%
  head(500)

# Converting the datatype of id and year
top_customers$Member_number <- as.factor(top_customers$Member_number)
top_customers$year <- as.factor(top_customers$year)

# Plotting with ggplot2
ggplot(top_customers, aes(x = qty_purchased, y = Member_number, fill = year)) +
  geom_bar(stat = "identity", color = "black") +
  ggtitle("Top Customers by Purchase Frequency") +
  xlab("Quantity Purchased") +
  ylab("Customer ID") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "bottom"
  ) +
  scale_fill_manual(values = c("2014" = "cornflowerblue", "2015" = "bisque3"))

7.1.1 Key Findings

  • Customer Concentration: A small group of customers accounts for a large proportion of total purchases, indicating that sales are driven by a limited number of high-frequency buyers.
  • Top Overall Customer: Customer 3180 demonstrates the highest purchase frequency across both 2014 and 2015, making them the most active customer in the dataset.
  • Strong Repeat Buyers: Customers such as 3737, 3050, 2051, and 2625 consistently show high purchase volumes, suggesting loyal purchasing behavior.
  • Year-over-Year Increase: For most top customers, purchase quantities in 2015 exceed those in 2014, reflecting increased shopping activity in the later year.
  • Variability in Behavior: Several customers exhibit noticeable differences in purchasing frequency between the two years, indicating inconsistent or changing shopping patterns over time.

7.2 Purchase Distribution Analysis

# Aggregating and sorting the data
groceries_agg <- groceries %>%
  group_by(Member_number) %>%
  summarise(count = n()) %>%
  arrange(count)

# Descriptive statistics
summary_stats <- summary(groceries_agg$count)

summary_df <- data.frame(
  Statistic = names(summary_stats),
  Value = as.numeric(summary_stats)
)

summary_df %>%
  kable(caption = "Summary Statistics of Customer Purchases") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary Statistics of Customer Purchases
Statistic Value
Min. 2.000000
1st Qu. 6.000000
Median 9.000000
Mean 9.944843
3rd Qu. 13.000000
Max. 36.000000
# Creating a histogram with a density plot
ggplot(groceries_agg, aes(x = count)) +
  geom_histogram(
    aes(y = after_stat(density)),
    binwidth = 1,
    colour = "black",
    fill = "#FFFFFF"
  ) +
  geom_density(alpha = 0.2, fill = "#FF6666") +
  ggtitle("Distribution of Grocery Purchases by Customer") +
  xlab("Number of Purchases") +
  ylab("Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

8 Association Rule Mining

8.1 Introduction to Association Rules

Supermarkets often organize their aisles and shelves in ways that group related products together, encouraging customers to purchase complementary items. For instance, items such as bread, butter, honey, cheese, and eggs are typically placed close to one another, just as toothbrushes are commonly positioned near toothpaste. This arrangement reflects the natural relationship between products that are frequently bought together.

Such product placement strategies are intentionally designed to influence purchasing behavior and increase overall sales. Retailers may also promote bundled items or offer discounts on related products to motivate customers to add multiple associated items to their shopping carts. By leveraging these associations, businesses aim to enhance revenue while improving the shopping experience for customers.

Paste the graph

8.1.1 Key Metrics

  • Support: Represents how frequently an item or itemset appears in the dataset; items with very low occurrence are typically excluded from rule generation.
  • Confidence: Measures the probability that item Y is purchased given that item X has already been purchased.
  • Lift: Evaluates the strength of an association by comparing the observed confidence to what would be expected if the items were independent. A lift value greater than one suggests a positive relationship, while a value less than one indicates a negative association.

8.1.2 Apriori Algorithm

Apriori is a widely used algorithm in association rule mining that identifies frequent combinations of items within a dataset. It first determines itemsets that appear commonly in transactions and then uses these frequent itemsets to construct association rules, which describe the probability of one item being purchased when another item is present or absent.

8.2 Data Preparation for Apriori

To apply the Apriori algorithm, the dataset must first be transformed into a transactional structure. This can be accomplished using two different approaches:

Method 1: Using ddply and CSV

# Sorting the groceries by customer id
dataset <- groceries[order(groceries$Member_number),]

# Converting Member_number column to numeric
dataset$Member_number <- as.numeric(dataset$Member_number)

# Create a new data frame called items that contains the concatenated itemDescription
# for each Member_number and Date combination
items <- plyr::ddply(
  dataset,
  c("Member_number", "Date"),
  function(df) paste(df$itemDescription, collapse = ",")
)

# Remove Member_number and Date columns from items
items$Member_number <- NULL
items$Date <- NULL

# Rename the column in items to "items"
colnames(items) <- c("items")

# Write itemList to a CSV file called "Items.csv"
write.csv(items, file = "Items.csv", quote = FALSE, row.names = FALSE)

Method 2: Direct Transaction Creation

# Create transaction list grouped by Member_number and Date
trans_list <- groceries %>%
  group_by(Member_number, Date) %>%
  summarise(items = list(itemDescription), .groups = 'drop') %>%
  pull(items)

# Convert to transactions object directly
data <- as(trans_list, "transactions")

# Display basic information
cat("Number of transactions:", length(data), "\n")
## Number of transactions: 14963
cat("Number of unique items:", length(itemLabels(data)), "\n")
## Number of unique items: 167
# Show first few transactions
inspect(head(data, 6))
##     items                                             
## [1] {pastry, salty snack, whole milk}                 
## [2] {sausage, semi-finished bread, whole milk, yogurt}
## [3] {pickled vegetables, soda}                        
## [4] {canned beer, misc. beverages}                    
## [5] {hygiene articles, sausage}                       
## [6] {rolls/buns, sausage, whole milk}

8.3 Mining Association Rules

# Mine association rules from data using Apriori algorithm
# Parameters:
#   - minlen: minimum number of items in a rule (2 = pairs)
#   - sup: minimum support (0.001 = item appears in 0.1% of transactions)
#   - conf: minimum confidence (0.05 = 5% probability)
rules <- arules::apriori(
  data,
  parameter = list(
    minlen = 2,
    sup = 0.001,
    conf = 0.05,
    target = "rules"
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Print the number of association rules mined
cat("Number of association rules mined:", length(rules), "\n\n")
## Number of association rules mined: 450
# Summarize the association rules
summary(rules)
## set of 450 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 423  27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.06    2.00    3.00 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001002   Min.   :0.05000   Min.   :0.005347   Min.   :0.5195  
##  1st Qu.:0.001270   1st Qu.:0.06397   1st Qu.:0.015973   1st Qu.:0.7672  
##  Median :0.001938   Median :0.08108   Median :0.023592   Median :0.8349  
##  Mean   :0.002760   Mean   :0.08759   Mean   :0.033725   Mean   :0.8859  
##  3rd Qu.:0.003342   3rd Qu.:0.10482   3rd Qu.:0.043708   3rd Qu.:0.9600  
##  Max.   :0.014837   Max.   :0.25581   Max.   :0.157923   Max.   :2.1829  
##      count      
##  Min.   : 15.0  
##  1st Qu.: 19.0  
##  Median : 29.0  
##  Mean   : 41.3  
##  3rd Qu.: 50.0  
##  Max.   :222.0  
## 
## mining info:
##  data ntransactions support confidence
##  data         14963   0.001       0.05
##                                                                                                    call
##  arules::apriori(data = data, parameter = list(minlen = 2, sup = 0.001, conf = 0.05, target = "rules"))
# Display top 20 rules by lift
cat("\n\nTop 20 Rules by Lift:\n")
## 
## 
## Top 20 Rules by Lift:
top_rules <- head(sort(rules, by = "lift"), 20)
inspect(top_rules)
##      lhs                         rhs               support     confidence
## [1]  {whole milk, yogurt}     => {sausage}         0.001470293 0.13173653
## [2]  {sausage, whole milk}    => {yogurt}          0.001470293 0.16417910
## [3]  {specialty chocolate}    => {citrus fruit}    0.001403462 0.08786611
## [4]  {sausage, yogurt}        => {whole milk}      0.001470293 0.25581395
## [5]  {flour}                  => {tropical fruit}  0.001069304 0.10958904
## [6]  {beverages}              => {sausage}         0.001537125 0.09274194
## [7]  {soda, whole milk}       => {sausage}         0.001069304 0.09195402
## [8]  {napkins}                => {pastry}          0.001737619 0.07854985
## [9]  {processed cheese}       => {root vegetables} 0.001069304 0.10526316
## [10] {hard cheese}            => {pip fruit}       0.001069304 0.07272727
## [11] {soft cheese}            => {yogurt}          0.001269799 0.12666667
## [12] {curd}                   => {sausage}         0.002940587 0.08730159
## [13] {detergent}              => {yogurt}          0.001069304 0.12403101
## [14] {sugar}                  => {bottled water}   0.001470293 0.08301887
## [15] {white bread}            => {canned beer}     0.001537125 0.06406685
## [16] {brown bread}            => {canned beer}     0.002405935 0.06394316
## [17] {canned beer}            => {brown bread}     0.002405935 0.05128205
## [18] {chewing gum}            => {yogurt}          0.001403462 0.11666667
## [19] {rolls/buns, whole milk} => {sausage}         0.001136136 0.08133971
## [20] {rolls/buns, sausage}    => {whole milk}      0.001136136 0.21250000
##      coverage    lift     count
## [1]  0.011160863 2.182917 22   
## [2]  0.008955423 1.911760 22   
## [3]  0.015972733 1.653762 21   
## [4]  0.005747511 1.619866 22   
## [5]  0.009757402 1.617141 16   
## [6]  0.016574216 1.536764 23   
## [7]  0.011628684 1.523708 16   
## [8]  0.022121232 1.518529 26   
## [9]  0.010158391 1.513019 16   
## [10] 0.014702934 1.482586 16   
## [11] 0.010024728 1.474952 19   
## [12] 0.033683085 1.446615 44   
## [13] 0.008621266 1.444261 16   
## [14] 0.017710352 1.368074 22   
## [15] 0.023992515 1.365573 23   
## [16] 0.037626144 1.362937 36   
## [17] 0.046915725 1.362937 36   
## [18] 0.012029673 1.358508 21   
## [19] 0.013967787 1.347825 17   
## [20] 0.005346521 1.345594 17

9 Visualizing Association Rules

9.1 Item Frequency Plot

itemFrequencyPlot(
  data,
  topN = 10,
  col = "cornflowerblue",
  main = "Top 10 Most Frequent Items",
  ylab = "Frequency"
)

9.2 Metric Relationships

# Convert rules to data frame
rules_df <- as(rules, "data.frame")

# Support vs Confidence
ggplot(rules_df, aes(x = support, y = confidence)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(title = "Support vs Confidence") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Finding: The scatter plot shows no strong linear relationship between support and confidence. Most association rules are concentrated at low support values while displaying a wide range of confidence levels, indicating that strong associations often involve less frequent item combinations.

# Support vs Lift
ggplot(rules_df, aes(x = support, y = lift)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(title = "Support vs Lift") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Confidence vs Lift
ggplot(rules_df, aes(x = confidence, y = lift)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(title = "Confidence vs Lift") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

9.3 Grouped Matrix Visualization

plot(rules, method = "grouped", control = list(k = 5))

9.4 Interactive Network Graph (Top 20 Rules)

plot(rules[1:20], method = "graph", engine = "htmlwidget")

9.4.1 Example Analysis: Whole Milk

Looking at rules involving whole milk:

  • {flour} => {whole milk}: support = 0.00134, confidence = 0.137, lift = 0.867
  • {pasta} => {whole milk}: support = 0.00107, confidence = 0.132, lift = 0.837
  • {semi-finished bread} => {whole milk}: support = 0.00167, confidence = 0.176, lift = 1.11

Findings:

  • Semi-finished bread and whole milk have a high support, confidence, and lift value (>1), indicating a strong positive correlation.
  • Flour and whole milk exhibit a moderate correlation with a lower lift value.
  • Pasta and whole milk show the weakest correlation but is still noteworthy.

9.5 Interactive Network Graph (Top 30 Rules)

plot(rules[1:30], method = "graph", engine = "htmlwidget")

9.5.1 Example Analysis: Rolls/Buns

Looking at rules involving rolls/buns:

  • {packaged fruit/vegetables} => {rolls/buns}: support = 0.0012, confidence = 0.142, lift = 1.29
  • {seasonal products} => {rolls/buns}: support = 0.001, confidence = 0.142, lift = 1.29
  • {processed cheese} => {rolls/buns}: support = 0.00147, confidence = 0.145, lift = 1.32
  • {detergent} => {rolls/buns}: support = 0.001, confidence = 0.116, lift = 1.06
  • {soft cheese} => {rolls/buns}: support = 0.001, confidence = 0.1, lift = 0.909
  • {cat food} => {rolls/buns}: support = 0.00107, confidence = 0.0904, lift = 0.822
  • {red/blush wine} => {rolls/buns}: support = 0.00134, confidence = 0.127, lift = 1.16

Findings:

  • Strong Correlations: Processed cheese and seasonal products show strong positive correlations with rolls/buns (lift > 1.2).
  • Negative Correlation: Cat food has a negative correlation with rolls/buns (lift < 1).
  • Weak Correlations: Packaged fruit/vegetables, detergent, soft cheese, and red/blush wine show weak correlations.

9.6 Parallel Coordinates Plot

plot(rules[1:10], method = "paracoord")

10 Conclusion

The Apriori algorithm demonstrated strong effectiveness in developing a model for discovering association rules. Its consistent ability to generate reliable patterns has contributed to its widespread use in transactional data analysis. Nevertheless, the model developed in this study was not evaluated using separate testing data, indicating the need for additional validation methods to enhance its reliability and accuracy. Further improvements could include examining rare items or unusual purchasing behaviors, which may offer opportunities for targeted promotions involving associated products. In addition, careful adjustment of support and lift thresholds is necessary to prevent overemphasis on specific rules. Moreover, the importance of visual representation should not be underestimated. Visualizing association rules allows for clearer and more intuitive interpretation of relationships, supporting practical decision-making. Overall, although the Apriori algorithm proved valuable for identifying meaningful associations, incorporating validation techniques, anomaly analysis, balanced parameter selection, and effective visualizations can further strengthen the model’s practical applicability and performance.

11 References

  1. Kaggle Groceries Dataset: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset
  2. Association Rule Mining: https://medium.com/hengky-sanjaya-blog/association-rule-mining-74c8256a04fb
  3. Implementing Apriori Algorithm in R: https://www.r-bloggers.com/2016/07/implementing-apriori-algorithm-in-r/
  4. Association Rules with Apriori Algorithm: https://towardsdatascience.com/association-rules-with-apriori-algorithm-574593e35223

Report Generated: 31.01.2026 Author: Mariyam Babayeva