Market Basket Analysis on Groceries Data

Introducion

RMarkdown view

Understanding consumer purchasing behavior is essential for organizations in the retail industry to boost profitability and keep a competitive advantage. One way to gain insight into customer behavior is through data mining techniques such as the Apriori algorithm. In this paper, I have applied the Apriori algorithm to analyze the Groceries dataset, which contains customers’ purchase orders from a grocery store. By analyzing the data, I aim to discover interesting patterns and associations among items customers purchase together. This information can be used by businesses to optimize their store layout, product placement, and marketing strategies to increase sales and customer satisfaction. Through this paper, I have also gained a better understanding of the importance of item placement in supermarkets, which can significantly impact a business’s bottom line. This report will present my analysis, insights, and findings from the Groceries dataset using the Apriori algorithm implemented in R using the ‘arules’ library.

Exploratory Data Analysis

Necessary Libraries:

library(tidyverse)
library(lubridate)
library(magrittr)
library(arules)
library(Matrix)
library(arulesViz)

Getting the Data

In this paper, I used the Grocery dataset, which contains 38765 rows of data from customers’ grocery store purchase orders.

groceries <- read.csv("Groceries_dataset.csv")
head(groceries)

##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 01-05-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 02-01-2015       whole milk
## 6          4941 14-02-2015       rolls/buns

dim(groceries)

## [1] 38765     3

#checking missing values
colSums(is.na(groceries))

##   Member_number            Date itemDescription 
##               0               0               0

There are no missing values

str(groceries)

## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : chr  "21-07-2015" "01-05-2015" "19-09-2015" "12-12-2015" ...
##  $ itemDescription: chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...

Data Pre-processing

#converting datatime format
groceries$Date <- as.Date(groceries$Date, format= "%d-%m-%Y")
# Extracting year, month, day, and weekday
groceries$year <- format(as.Date(groceries$Date), "%Y")
groceries$month <- format(as.Date(groceries$Date), "%m")
groceries$day <- format(as.Date(groceries$Date), "%d")
groceries$weekday <- format(as.Date(groceries$Date), "%w")

# Rearranging the columns
groceries <- groceries[c("Member_number", "Date", "year", "month", "day", "weekday", "itemDescription")]
head(groceries)

##   Member_number       Date year month day weekday  itemDescription
## 1          1808 2015-07-21 2015    07  21       2   tropical fruit
## 2          2552 2015-05-01 2015    05  01       5       whole milk
## 3          2300 2015-09-19 2015    09  19       6        pip fruit
## 4          1187 2015-12-12 2015    12  12       6 other vegetables
## 5          3037 2015-01-02 2015    01  02       5       whole milk
## 6          4941 2015-02-14 2015    02  14       6       rolls/buns

Exploratory Data Analysis

#Filtering data by year 2014 and 2015  
df1 <- groceries %>% filter(year == 2014)
df2 <- groceries %>% filter(year == 2015)

#Plotting monthly data of number of quantity purchased in 2014 and 2015 
sales_2014 <- df1 %>% group_by(month) %>% summarize(count = n())
sales_2015 <- df2 %>% group_by(month) %>% summarize(count = n())

#Adding a year column to the data frames
sales_2014$year <- 2014
sales_2015$year <- 2015

#Combining both plots
sales_combined <- rbind(sales_2014, sales_2015)

#Plotting the data
ggplot(sales_combined, aes(x = month, y = count, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Months", y = "Number of products", title = "Monthly products sold in years 2014 and 2015", fill = "Year") +
  scale_fill_manual(values = c("2014" = "cornflowerblue", "2015" = "bisque3")) +
  theme_minimal()

Findings

Since the average sales in 2015 were greater than they were in 2014, we can conclude that store revenue is increasing.According to the data, unsatisfying sales performances were seen in September and February in 2014 and 2015, respectively.It is possible to see the record sale in October 2015, when approximately 2000 products were sold in that month.

# Create a temporary data frame with quantity purchased column
temp <- groceries %>%
  mutate(qty_purchased = map_dbl(Member_number, ~sum(. == Member_number))) 

# Slice first 5000 rows
temp1 <- temp[1:5000,]

# Converting weekday variable to category
temp1$weekday <- as.factor(temp1$weekday)

# Creating a new data frame which has the frequency of weekdays
weekday_bin <- data.frame(table(temp1$weekday))
colnames(weekday_bin) <- c("weekday", "count")

# Creating a heatmap
heatmap <- ggplot(weekday_bin, aes(x=weekday, y=1, fill=count)) +
  geom_tile() +
  labs(title="Number of quantity purchases across weekdays") +
  scale_fill_gradient(low="#FFFFFF", high="cornflowerblue") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_x_discrete(expand = c(0,0)) +
  scale_y_continuous(name="") +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

# Adding data labels
text <- heatmap +
  geom_text(aes(label=count), vjust=-0.5, size=5)

# Printing the plot
print(text)

Findings

Upon analyzing the heatmap, it can be seen that the highest number of purchases across weekdays was made on Sunday with a value of 759. This indicates that customers tend to do more shopping on weekends than on weekdays. Wednesday follows closely behind with 756 purchases, while Monday has the least number of purchases with a value of 668. Overall, the heatmap portrays a relatively balanced distribution of purchases across the weekdays.

# Getting the top customers based on quantity purchased
top_customers <- temp %>% 
  select(Member_number, qty_purchased, year) %>% 
  arrange(desc(qty_purchased)) %>% 
  head(500)

# Converting the datatype of id and year
top_customers$Member_number <- as.factor(top_customers$Member_number)
top_customers$year <- as.factor(top_customers$year)

# Plotting with ggplot2
ggplot(top_customers, aes(x = qty_purchased, y = Member_number, fill = year)) +
  geom_bar(stat = "identity", color = "black") +
  ggtitle("Top Customers") +
  xlab("Quantity Purchased") +
  ylab("Customer ID") +
  theme_minimal() +
  theme(legend.position = "bottom")

Findings

The plot that has been created with Customer ID on Y and Quantity Purchased on X; shows the frequency of eighteen selected top customers’ grocery purchases. For the year 2014, customers 3737 and 3018 have been the most devoted customer. In 2015, customers 3180, 3050, and 2433 were the most purchasing customers. To top the list, customer 3180 has been the most devoted customer. While considering the extent of customers’ devotion, there may be a small number of consumers who are perceived to be inconsistent since they made numerous purchases in 2014 but none in 2015, or vice versa. Moreover, it is hard to make a remark on each client’s customer life expectancy as we just have two years of data.

# Aggregating and sorting the data
groceries_agg <- groceries %>% 
  group_by(Member_number) %>% 
  summarise(count = n()) %>% 
  arrange(count)

# Descriptive statistics
summary(groceries_agg$count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   6.000   9.000   9.945  13.000  36.000

# Creating a histogram with a density plot
ggplot(groceries_agg, aes(x = count)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 1, colour = "black", fill = "#FFFFFF") +
  geom_density(alpha = .2, fill = "#FF6666") +
  ggtitle("Distribution of Grocery Purchases by Customer") +
  xlab("Number of Purchases") +
  ylab("Density")

Associate Rule Mining

Have you ever thought about why all the sections and racks in a supermarket are structured in such a way that the products are associated? Bread,butter, honey or cheese and eggs for example, can be found in nearby racks, as can a toothbrush and toothpaste. These products are linked. If you buy a brush, you are more likely to buy the paste. These are marketing strategies designed to get you to load your basket with products and their linked things, hence increasing sales revenue. Several businesses offer discounts on the connected item or combine both items and sell them at a lower price in order to entice you to buy the item+item associated with it.

Association Rule

Association rules are a type of statistical rule-based approach used in data mining and machine learning. The goal of association rule mining is to identify frequent patterns, associations, or correlations between different items in a dataset.

RMarkdown view

Support: It denotes the item’s popularity; if an item is not regularly purchased, it will not be considered in the association.
Confidence: It indicates the possibility that Y will be purchased when X is purchased.
Lift: It mixes support and assurance. A lift of more than one indicates that the antecedent’s presence raises the likelihood that the consequent will take place in a particular transaction. When the lift is less than 1, there is a lower likelihood of buying the antecedent and consequent in the same transaction.

Aproiri Algorithm

Apriori is a popular algorithm used in association rule mining, a data mining technique used to identify patterns or relationships between variables in large datasets.

The Apriori algorithm works by generating frequent itemsets from the given dataset, where an itemset is a collection of one or more items. The algorithm then uses these frequent itemsets to derive association rules that indicate the likelihood of the occurrence of one item based on the presence or absence of another item.

RMarkdown view

# sorting the groceries by customer id 
dataset <- groceries[order(groceries$Member_number),]

# converting Member_number column to numeric
dataset$Member_number <- as.numeric(dataset$Member_number)

# display the structure of the sorted data frame
str(dataset)

## 'data.frame':    38765 obs. of  7 variables:
##  $ Member_number  : num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ Date           : Date, format: "2015-05-27" "2015-07-24" ...
##  $ year           : chr  "2015" "2015" "2015" "2015" ...
##  $ month          : chr  "05" "07" "03" "11" ...
##  $ day            : chr  "27" "24" "15" "25" ...
##  $ weekday        : chr  "3" "5" "0" "3" ...
##  $ itemDescription: chr  "soda" "canned beer" "sausage" "sausage" ...

# create a new data frame called items that contains the concatenated itemDescription for each Member_number and Date combination
items <- plyr::ddply(dataset, c("Member_number", "Date"), function(df) paste(df$itemDescription, collapse = ","))

# remove Member_number and Date columns from items
items$Member_number <- NULL
items$Date <- NULL

# rename the column in items to "items"
colnames(items) <- c("items")

# write itemList to a CSV file called "Items.csv"
write.csv(items, file = "Items.csv", quote = FALSE, row.names = TRUE)
# display the first 6 rows of itemList
head(items)

##                                           items
## 1                 whole milk,pastry,salty snack
## 2 sausage,whole milk,semi-finished bread,yogurt
## 3                       soda,pickled vegetables
## 4                   canned beer,misc. beverages
## 5                      sausage,hygiene articles
## 6                 sausage,whole milk,rolls/buns

# read the CSV file into a transaction object called txn
data <- arules::read.transactions(file = "Items.csv", rm.duplicates = TRUE, format = "basket", sep = ",", cols = 1)

## distribution of transactions with duplicates:
## items
##   1   2   3   4 
## 662  39   5   1

# remove double quotes from item labels in data
data@itemInfo$labels <- gsub("\"", "", data@itemInfo$labels)

# mine association rules from data using Apriori algorithm
rules <- arules::apriori(data, parameter = list(minlen = 2, sup = 0.001, conf = 0.05, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# print the number of association rules mined
print(length(rules))

## [1] 450

# summarize the association rules
summary(rules)

## set of 450 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 423  27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.06    2.00    3.00 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001002   Min.   :0.05000   Min.   :0.005346   Min.   :0.5195  
##  1st Qu.:0.001270   1st Qu.:0.06397   1st Qu.:0.015972   1st Qu.:0.7673  
##  Median :0.001938   Median :0.08108   Median :0.023590   Median :0.8350  
##  Mean   :0.002760   Mean   :0.08759   Mean   :0.033723   Mean   :0.8859  
##  3rd Qu.:0.003341   3rd Qu.:0.10482   3rd Qu.:0.043705   3rd Qu.:0.9601  
##  Max.   :0.014836   Max.   :0.25581   Max.   :0.157912   Max.   :2.1831  
##      count      
##  Min.   : 15.0  
##  1st Qu.: 19.0  
##  Median : 29.0  
##  Mean   : 41.3  
##  3rd Qu.: 50.0  
##  Max.   :222.0  
## 
## mining info:
##  data ntransactions support confidence
##  data         14964   0.001       0.05
##                                                                                                    call
##  arules::apriori(data = data, parameter = list(minlen = 2, sup = 0.001, conf = 0.05, target = "rules"))

# inspect the first 20 association rules
arules::inspect(rules[1:20])

##      lhs                            rhs                support     confidence
## [1]  {frozen fish}               => {whole milk}       0.001069233 0.1568627 
## [2]  {seasonal products}         => {rolls/buns}       0.001002406 0.1415094 
## [3]  {pot plants}                => {other vegetables} 0.001002406 0.1282051 
## [4]  {pot plants}                => {whole milk}       0.001002406 0.1282051 
## [5]  {pasta}                     => {whole milk}       0.001069233 0.1322314 
## [6]  {pickled vegetables}        => {whole milk}       0.001002406 0.1119403 
## [7]  {packaged fruit/vegetables} => {rolls/buns}       0.001202887 0.1417323 
## [8]  {detergent}                 => {yogurt}           0.001069233 0.1240310 
## [9]  {detergent}                 => {rolls/buns}       0.001002406 0.1162791 
## [10] {detergent}                 => {whole milk}       0.001403368 0.1627907 
## [11] {semi-finished bread}       => {other vegetables} 0.001002406 0.1056338 
## [12] {semi-finished bread}       => {whole milk}       0.001670676 0.1760563 
## [13] {red/blush wine}            => {rolls/buns}       0.001336541 0.1273885 
## [14] {red/blush wine}            => {other vegetables} 0.001136060 0.1082803 
## [15] {flour}                     => {tropical fruit}   0.001069233 0.1095890 
## [16] {flour}                     => {whole milk}       0.001336541 0.1369863 
## [17] {herbs}                     => {yogurt}           0.001136060 0.1075949 
## [18] {herbs}                     => {whole milk}       0.001136060 0.1075949 
## [19] {processed cheese}          => {root vegetables}  0.001069233 0.1052632 
## [20] {processed cheese}          => {rolls/buns}       0.001470195 0.1447368 
##      coverage    lift      count
## [1]  0.006816359 0.9933534 16   
## [2]  0.007083667 1.2864807 15   
## [3]  0.007818765 1.0500611 15   
## [4]  0.007818765 0.8118754 15   
## [5]  0.008086073 0.8373723 16   
## [6]  0.008954825 0.7088763 15   
## [7]  0.008487036 1.2885066 18   
## [8]  0.008620690 1.4443580 16   
## [9]  0.008620690 1.0571081 15   
## [10] 0.008620690 1.0308929 21   
## [11] 0.009489441 0.8651911 15   
## [12] 0.009489441 1.1148993 25   
## [13] 0.010491847 1.1581057 20   
## [14] 0.010491847 0.8868668 17   
## [15] 0.009756750 1.6172489 16   
## [16] 0.009756750 0.8674833 20   
## [17] 0.010558674 1.2529577 17   
## [18] 0.010558674 0.6813587 17   
## [19] 0.010157712 1.5131200 16   
## [20] 0.010157712 1.3158214 22

Visualizing the Association Rules

itemFrequencyPlot(data, topN = 10, col = "cornflowerblue")

rules_df <- as(rules, "data.frame")

# Plotting the relationship between the metrics
ggplot(rules_df, aes(x = support, y = confidence)) +
  geom_point(color = "red") +
  labs(title = "Support vs Confidence") +
  theme(plot.title = element_text(hjust = 0.5))

The link between support and confidence is bleakly linear, thus the most common things are accompanied by some less common ones.

ggplot(rules_df, aes(x = support, y = lift)) +
  geom_point(color = "red") +
  labs(title = "Support vs Lift") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(rules_df, aes(x = confidence, y = lift)) +
  geom_point(color = "red") +
  labs(title = "Confidence vs Lift") +
  theme(plot.title = element_text(hjust = 0.5))

plot(rules, method = "grouped", control = list(k = 5))

plot(rules[1:20], method="graph",engine="htmlwidget")

For example, we can examine whole milk:

{flour} => {whole milk} support = 0.00134 confidence = 0.137 coverage = 0.00976 lift = 0.867

{pasta} => {whole milk} support = 0.00107 confidence = 0.132 coverage = 0.00809 lift = 0.837

{semi-finished bread} => {whole milk} support = 0.00167 confidence = 0.176 coverage = 0.00949 lift = 1.11

I have observed that semi-finished bread and whole milk have a high support, confidence, and lift value, indicating a strong positive correlation between these items. Similarly, flour and whole milk also exhibit a moderate level of correlation, with a slightly lower lift value than semi-finished bread. The correlation between pasta and whole milk is weaker than the other two correlations, but still noteworthy.

plot(rules[1:30], method="graph",engine="htmlwidget")

Looking at the rools/buns:

{packaged fruit/vegetables} => {rolls/buns} support = 0.0012 confidence = 0.142 coverage = 0.00849 lift = 1.29

{seasonal products} => {rolls/buns} support = 0.001 confidence = 0.142 coverage = 0.00708 lift = 1.29

{processed cheese} => {rolls/buns} support = 0.00147 confidence = 0.145 coverage = 0.0102 lift = 1.32

{detergent} => {rolls/buns} support = 0.001 confidence = 0.116 coverage = 0.00862 lift = 1.06

{soft cheese} => {rolls/buns} support = 0.001 confidence = 0.1 coverage = 0.01 lift = 0.909

{cat food} => {rolls/buns} support = 0.00107 confidence = 0.0904 coverage = 0.0118 lift = 0.822

{red/blush wine} => {rolls/buns} support = 0.00134 confidence = 0.127 coverage = 0.0105 lift = 1.16

I have observed that processed cheese and rolls/buns, as well as seasonal products and rolls/buns, have a high support, confidence, and lift value. This indicates a strong positive correlation between these items, meaning they are commonly purchased together.

On the other hand, cat food and rolls/buns have a negative correlation, as indicated by the lift value being less than 1. Moreover, packaged fruit/vegetables, detergent, soft cheese, and red/blush wine have a weak correlation with rolls/buns, as evidenced by their low support, confidence, and lift values.

plot(rules[1:10], method="paracoord")

Conclusion

The Apriori algorithm proved to be a useful tool for constructing the model to identify association rules. Its ability to consistently produce the same results has made it a popular choice for discovering patterns in transactional data.

However, it’s important to note that the model constructed in this analysis was not assessed using any test data. Therefore, it may be useful to add further evaluation techniques to ensure the accuracy and validity of the model.

One approach to further evaluate the model would be to identify anomalies or rare objects, as this can aid in promoting the connected item with the anomaly. Additionally, it’s essential to balance the values of lift and support to avoid bias towards a particular rule.

Lastly, the visual interpretability of the model is an essential factor that should not be overlooked. Visualizing the association rules can help us understand the patterns and relationships more intuitively, making it easier to interpret and apply them in real-life scenarios.

In conclusion, while the Apriori algorithm has proven to be a useful technique for constructing the model and identifying association rules, further evaluation and consideration of additional factors such as anomalies, lift, support, and visual interpretability can help improve the accuracy and usefulness of the model in real-life applications.

Resources

https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

https://medium.com/hengky-sanjaya-blog/association-rule-mining-74c8256a04fb

https://www.r-bloggers.com/2016/07/implementing-apriori-algorithm-in-r/

https://towardsdatascience.com/association-rules-with-apriori-algorithm-574593e35223

Market Basket Analysis on Groceries Data

Hande Demirci

2023-28-02

Introducion

Exploratory Data Analysis

Getting the Data

Data Pre-processing

Exploratory Data Analysis

Associate Rule Mining

Visualizing the Association Rules

Conclusion