Source: Kaggle Groceries Dataset
Consumer purchasing behavior plays a critical role in the success of retail businesses, as understanding how customers shop can support higher profitability and sustained competitive performance. Data mining methods, particularly the Apriori algorithm, provide an effective approach for uncovering relationships within large transactional datasets. In this study, the Apriori algorithm is applied to the Groceries dataset, which consists of customer purchase records collected from a grocery store.
The analysis focuses on identifying meaningful patterns and associations among products that are frequently bought together. These insights can assist retailers in improving store layout, arranging products strategically, and designing targeted marketing campaigns to enhance sales performance and customer satisfaction. Additionally, this study highlights the significant influence of product placement on purchasing decisions within supermarket environments.
This report presents the analytical process, key observations, and results obtained from applying the Apriori algorithm in R using the arules package to the Groceries dataset.
This study utilizes the Groceries dataset, which consists of 38,765 records representing customer purchase transactions from a grocery store.
# Load the dataset
groceries <- read.csv("Groceries_dataset.csv")
# Display first few rows
head(groceries) %>%
kable(caption = "First 6 rows of Groceries Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Member_number | Date | itemDescription |
|---|---|---|
| 1808 | 21-07-2015 | tropical fruit |
| 2552 | 05-01-2015 | whole milk |
| 2300 | 19-09-2015 | pip fruit |
| 1187 | 12-12-2015 | other vegetables |
| 3037 | 01-02-2015 | whole milk |
| 4941 | 14-02-2015 | rolls/buns |
## [1] 38765 3
## The dataset contains 38765 rows and 3 columns.
# Checking missing values
missing_values <- colSums(is.na(groceries))
missing_values %>%
as.data.frame() %>%
setNames("Missing Values") %>%
kable(caption = "Missing Values per Column") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Missing Values | |
|---|---|
| Member_number | 0 |
| Date | 0 |
| itemDescription | 0 |
Missing values: none (based on colSums(is.na())).
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : chr "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
## $ itemDescription: chr "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
# Converting datetime format
groceries$Date <- as.Date(groceries$Date, format = "%d-%m-%Y")
# Extracting year, month, day, and weekday
groceries$year <- format(as.Date(groceries$Date), "%Y")
groceries$month <- format(as.Date(groceries$Date), "%m")
groceries$day <- format(as.Date(groceries$Date), "%d")
groceries$weekday <- format(as.Date(groceries$Date), "%w")
# Rearranging the columns
groceries <- groceries[c("Member_number", "Date", "year", "month", "day", "weekday", "itemDescription")]
# Display processed data
head(groceries) %>%
kable(caption = "Processed Groceries Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Member_number | Date | year | month | day | weekday | itemDescription |
|---|---|---|---|---|---|---|
| 1808 | 2015-07-21 | 2015 | 07 | 21 | 2 | tropical fruit |
| 2552 | 2015-01-05 | 2015 | 01 | 05 | 1 | whole milk |
| 2300 | 2015-09-19 | 2015 | 09 | 19 | 6 | pip fruit |
| 1187 | 2015-12-12 | 2015 | 12 | 12 | 6 | other vegetables |
| 3037 | 2015-02-01 | 2015 | 02 | 01 | 0 | whole milk |
| 4941 | 2015-02-14 | 2015 | 02 | 14 | 6 | rolls/buns |
# Filtering data by year 2014 and 2015
df1 <- groceries %>% filter(year == "2014")
df2 <- groceries %>% filter(year == "2015")
# Plotting monthly data of number of quantity purchased in 2014 and 2015
sales_2014 <- df1 %>% group_by(month) %>% summarize(count = n())
sales_2015 <- df2 %>% group_by(month) %>% summarize(count = n())
# Adding a year column to the data frames
sales_2014$year <- 2014
sales_2015$year <- 2015
# Combining both plots
sales_combined <- rbind(sales_2014, sales_2015)
# Plotting the data
ggplot(sales_combined, aes(x = month, y = count, fill = factor(year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Months",
y = "Number of products",
title = "Monthly Products Sold in Years 2014 and 2015",
fill = "Year"
) +
scale_fill_manual(values = c("2014" = "cornflowerblue", "2015" = "bisque3")) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 0, hjust = 0.5)
)# Create a temporary data frame with quantity purchased column
temp <- groceries %>%
mutate(qty_purchased = map_dbl(Member_number, ~sum(. == Member_number)))
# Slice first 5000 rows
temp1 <- temp[1:5000,]
# Converting weekday variable to category
temp1$weekday <- as.factor(temp1$weekday)
# Creating a new data frame which has the frequency of weekdays
weekday_bin <- data.frame(table(temp1$weekday))
colnames(weekday_bin) <- c("weekday", "count")
# Creating a heatmap
heatmap <- ggplot(weekday_bin, aes(x = weekday, y = 1, fill = count)) +
geom_tile() +
labs(title = "Number of Quantity Purchases Across Weekdays") +
scale_fill_gradient(low = "#FFFFFF", high = "cornflowerblue") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
scale_x_discrete(
expand = c(0, 0),
labels = c("0" = "Sunday", "1" = "Monday", "2" = "Tuesday",
"3" = "Wednesday", "4" = "Thursday", "5" = "Friday", "6" = "Saturday")
) +
scale_y_continuous(name = "") +
geom_text(aes(label = count), vjust = 0.5, size = 5)
print(heatmap)# Getting the top customers based on quantity purchased
top_customers <- temp %>%
select(Member_number, qty_purchased, year) %>%
arrange(desc(qty_purchased)) %>%
head(500)
# Converting the datatype of id and year
top_customers$Member_number <- as.factor(top_customers$Member_number)
top_customers$year <- as.factor(top_customers$year)
# Plotting with ggplot2
ggplot(top_customers, aes(x = qty_purchased, y = Member_number, fill = year)) +
geom_bar(stat = "identity", color = "black") +
ggtitle("Top Customers by Purchase Frequency") +
xlab("Quantity Purchased") +
ylab("Customer ID") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom"
) +
scale_fill_manual(values = c("2014" = "cornflowerblue", "2015" = "bisque3"))# Aggregating and sorting the data
groceries_agg <- groceries %>%
group_by(Member_number) %>%
summarise(count = n()) %>%
arrange(count)
# Descriptive statistics
summary_stats <- summary(groceries_agg$count)
summary_df <- data.frame(
Statistic = names(summary_stats),
Value = as.numeric(summary_stats)
)
summary_df %>%
kable(caption = "Summary Statistics of Customer Purchases") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Statistic | Value |
|---|---|
| Min. | 2.000000 |
| 1st Qu. | 6.000000 |
| Median | 9.000000 |
| Mean | 9.944843 |
| 3rd Qu. | 13.000000 |
| Max. | 36.000000 |
# Creating a histogram with a density plot
ggplot(groceries_agg, aes(x = count)) +
geom_histogram(
aes(y = after_stat(density)),
binwidth = 1,
colour = "black",
fill = "#FFFFFF"
) +
geom_density(alpha = 0.2, fill = "#FF6666") +
ggtitle("Distribution of Grocery Purchases by Customer") +
xlab("Number of Purchases") +
ylab("Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Supermarkets often organize their aisles and shelves in ways that group related products together, encouraging customers to purchase complementary items. For instance, items such as bread, butter, honey, cheese, and eggs are typically placed close to one another, just as toothbrushes are commonly positioned near toothpaste. This arrangement reflects the natural relationship between products that are frequently bought together.
Such product placement strategies are intentionally designed to influence purchasing behavior and increase overall sales. Retailers may also promote bundled items or offer discounts on related products to motivate customers to add multiple associated items to their shopping carts. By leveraging these associations, businesses aim to enhance revenue while improving the shopping experience for customers.
Paste the graph
Apriori is a widely used algorithm in association rule mining that identifies frequent combinations of items within a dataset. It first determines itemsets that appear commonly in transactions and then uses these frequent itemsets to construct association rules, which describe the probability of one item being purchased when another item is present or absent.
To apply the Apriori algorithm, the dataset must first be transformed into a transactional structure. This can be accomplished using two different approaches:
Method 1: Using ddply and CSV
# Sorting the groceries by customer id
dataset <- groceries[order(groceries$Member_number),]
# Converting Member_number column to numeric
dataset$Member_number <- as.numeric(dataset$Member_number)
# Create a new data frame called items that contains the concatenated itemDescription
# for each Member_number and Date combination
items <- plyr::ddply(
dataset,
c("Member_number", "Date"),
function(df) paste(df$itemDescription, collapse = ",")
)
# Remove Member_number and Date columns from items
items$Member_number <- NULL
items$Date <- NULL
# Rename the column in items to "items"
colnames(items) <- c("items")
# Write itemList to a CSV file called "Items.csv"
write.csv(items, file = "Items.csv", quote = FALSE, row.names = FALSE)Method 2: Direct Transaction Creation
# Create transaction list grouped by Member_number and Date
trans_list <- groceries %>%
group_by(Member_number, Date) %>%
summarise(items = list(itemDescription), .groups = 'drop') %>%
pull(items)
# Convert to transactions object directly
data <- as(trans_list, "transactions")
# Display basic information
cat("Number of transactions:", length(data), "\n")## Number of transactions: 14963
## Number of unique items: 167
## items
## [1] {pastry, salty snack, whole milk}
## [2] {sausage, semi-finished bread, whole milk, yogurt}
## [3] {pickled vegetables, soda}
## [4] {canned beer, misc. beverages}
## [5] {hygiene articles, sausage}
## [6] {rolls/buns, sausage, whole milk}
# Mine association rules from data using Apriori algorithm
# Parameters:
# - minlen: minimum number of items in a rule (2 = pairs)
# - sup: minimum support (0.001 = item appears in 0.1% of transactions)
# - conf: minimum confidence (0.05 = 5% probability)
rules <- arules::apriori(
data,
parameter = list(
minlen = 2,
sup = 0.001,
conf = 0.05,
target = "rules"
)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Print the number of association rules mined
cat("Number of association rules mined:", length(rules), "\n\n")## Number of association rules mined: 450
## set of 450 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 423 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.06 2.00 3.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.05000 Min. :0.005347 Min. :0.5195
## 1st Qu.:0.001270 1st Qu.:0.06397 1st Qu.:0.015973 1st Qu.:0.7672
## Median :0.001938 Median :0.08108 Median :0.023592 Median :0.8349
## Mean :0.002760 Mean :0.08759 Mean :0.033725 Mean :0.8859
## 3rd Qu.:0.003342 3rd Qu.:0.10482 3rd Qu.:0.043708 3rd Qu.:0.9600
## Max. :0.014837 Max. :0.25581 Max. :0.157923 Max. :2.1829
## count
## Min. : 15.0
## 1st Qu.: 19.0
## Median : 29.0
## Mean : 41.3
## 3rd Qu.: 50.0
## Max. :222.0
##
## mining info:
## data ntransactions support confidence
## data 14963 0.001 0.05
## call
## arules::apriori(data = data, parameter = list(minlen = 2, sup = 0.001, conf = 0.05, target = "rules"))
##
##
## Top 20 Rules by Lift:
## lhs rhs support confidence
## [1] {whole milk, yogurt} => {sausage} 0.001470293 0.13173653
## [2] {sausage, whole milk} => {yogurt} 0.001470293 0.16417910
## [3] {specialty chocolate} => {citrus fruit} 0.001403462 0.08786611
## [4] {sausage, yogurt} => {whole milk} 0.001470293 0.25581395
## [5] {flour} => {tropical fruit} 0.001069304 0.10958904
## [6] {beverages} => {sausage} 0.001537125 0.09274194
## [7] {soda, whole milk} => {sausage} 0.001069304 0.09195402
## [8] {napkins} => {pastry} 0.001737619 0.07854985
## [9] {processed cheese} => {root vegetables} 0.001069304 0.10526316
## [10] {hard cheese} => {pip fruit} 0.001069304 0.07272727
## [11] {soft cheese} => {yogurt} 0.001269799 0.12666667
## [12] {curd} => {sausage} 0.002940587 0.08730159
## [13] {detergent} => {yogurt} 0.001069304 0.12403101
## [14] {sugar} => {bottled water} 0.001470293 0.08301887
## [15] {white bread} => {canned beer} 0.001537125 0.06406685
## [16] {brown bread} => {canned beer} 0.002405935 0.06394316
## [17] {canned beer} => {brown bread} 0.002405935 0.05128205
## [18] {chewing gum} => {yogurt} 0.001403462 0.11666667
## [19] {rolls/buns, whole milk} => {sausage} 0.001136136 0.08133971
## [20] {rolls/buns, sausage} => {whole milk} 0.001136136 0.21250000
## coverage lift count
## [1] 0.011160863 2.182917 22
## [2] 0.008955423 1.911760 22
## [3] 0.015972733 1.653762 21
## [4] 0.005747511 1.619866 22
## [5] 0.009757402 1.617141 16
## [6] 0.016574216 1.536764 23
## [7] 0.011628684 1.523708 16
## [8] 0.022121232 1.518529 26
## [9] 0.010158391 1.513019 16
## [10] 0.014702934 1.482586 16
## [11] 0.010024728 1.474952 19
## [12] 0.033683085 1.446615 44
## [13] 0.008621266 1.444261 16
## [14] 0.017710352 1.368074 22
## [15] 0.023992515 1.365573 23
## [16] 0.037626144 1.362937 36
## [17] 0.046915725 1.362937 36
## [18] 0.012029673 1.358508 21
## [19] 0.013967787 1.347825 17
## [20] 0.005346521 1.345594 17
itemFrequencyPlot(
data,
topN = 10,
col = "cornflowerblue",
main = "Top 10 Most Frequent Items",
ylab = "Frequency"
)# Convert rules to data frame
rules_df <- as(rules, "data.frame")
# Support vs Confidence
ggplot(rules_df, aes(x = support, y = confidence)) +
geom_point(color = "red", alpha = 0.6) +
labs(title = "Support vs Confidence") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Finding: The scatter plot shows no strong linear relationship between support and confidence. Most association rules are concentrated at low support values while displaying a wide range of confidence levels, indicating that strong associations often involve less frequent item combinations.
# Support vs Lift
ggplot(rules_df, aes(x = support, y = lift)) +
geom_point(color = "red", alpha = 0.6) +
labs(title = "Support vs Lift") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))# Confidence vs Lift
ggplot(rules_df, aes(x = confidence, y = lift)) +
geom_point(color = "red", alpha = 0.6) +
labs(title = "Confidence vs Lift") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Looking at rules involving whole milk:
{flour} => {whole milk}: support = 0.00134,
confidence = 0.137, lift = 0.867{pasta} => {whole milk}: support = 0.00107,
confidence = 0.132, lift = 0.837{semi-finished bread} => {whole milk}: support =
0.00167, confidence = 0.176, lift = 1.11Findings:
Looking at rules involving rolls/buns:
{packaged fruit/vegetables} => {rolls/buns}: support
= 0.0012, confidence = 0.142, lift = 1.29{seasonal products} => {rolls/buns}: support =
0.001, confidence = 0.142, lift = 1.29{processed cheese} => {rolls/buns}: support =
0.00147, confidence = 0.145, lift = 1.32{detergent} => {rolls/buns}: support = 0.001,
confidence = 0.116, lift = 1.06{soft cheese} => {rolls/buns}: support = 0.001,
confidence = 0.1, lift = 0.909{cat food} => {rolls/buns}: support = 0.00107,
confidence = 0.0904, lift = 0.822{red/blush wine} => {rolls/buns}: support = 0.00134,
confidence = 0.127, lift = 1.16Findings:
The Apriori algorithm demonstrated strong effectiveness in developing a model for discovering association rules. Its consistent ability to generate reliable patterns has contributed to its widespread use in transactional data analysis. Nevertheless, the model developed in this study was not evaluated using separate testing data, indicating the need for additional validation methods to enhance its reliability and accuracy. Further improvements could include examining rare items or unusual purchasing behaviors, which may offer opportunities for targeted promotions involving associated products. In addition, careful adjustment of support and lift thresholds is necessary to prevent overemphasis on specific rules. Moreover, the importance of visual representation should not be underestimated. Visualizing association rules allows for clearer and more intuitive interpretation of relationships, supporting practical decision-making. Overall, although the Apriori algorithm proved valuable for identifying meaningful associations, incorporating validation techniques, anomaly analysis, balanced parameter selection, and effective visualizations can further strengthen the model’s practical applicability and performance.
Report Generated: 31.01.2026 Author: Mariyam Babayeva