To uncover the association between items that are frequently bought together.
We performed basic Exploratory Data Analysis to understand sales patterns across the two years of data.
Following are the packages required in the course of our analysis:
library(readr) #to read the csv files
library(stringr) #for string manipulations like str_replace etc
library(tidyr) #creating tidy data
library(plyr) #Tools for Splitting, Applying and Combining Data
library(dplyr) #for a flexibility in data manipulation
library(lubridate) #for easy time and date manipulation
library(magrittr) #for piping capabilities and to write neat code
library(ggplot2) #visualizing data
library(gridExtra) #Arranging items in grid
library(plotly) #3D Plotting
library(arules) #for analyzing transaction data and patterns (frequent itemsets and association rules).
library(arulesViz) #Extends package 'arules' with various visualization techniques for association rules and item-sets.
library(RColorBrewer)#Color schemes for plotting
This dataset contains household level transactions over two years from a group of 2,500 households who shop at a retailer. It contains all of the purchases from each household’s purchases, not just those form a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
The data is loaded from 9 CSV files into the R workspace.
file <- c("product", "hh_demographic", "coupon", "campaign_table", "coupon_redempt",
"transaction_data", "campaign_table", "campaign_desc", "causal_data")
for (i in seq_along(file)) {
file_name <- paste0( "data/", file[i], ".csv")
assign(paste0("df_",file[i]), read_csv(file_name))
}
#Missing value check
miss_dept <- which(!is.na(df_product$department))
df_product <- df_product[miss_dept,] #15 null values removed
sum(is.na(df_product$department)) #0
#cleaning the hh_comp_desc column
age_desc <- factor(df_hh_demographic$age_desc)
#creating levels for income
income_desc <- factor(df_hh_demographic$income_desc, levels = c("Under 15K", "15-24K", "25-34K", "35-49K","50-74K", "75-99K" , "100-124K", "125-149K" ,"150-174K","175-199K", "200-249K" ,"250K+"))
#Check if transaction_data has products outside the product table
anti_join(df_transaction_data, df_product, by = "product_id") #0
#creating year and quarter column
df_transaction_data <- df_transaction_data %>%
as.data.frame() %>%
mutate(year = paste0('Y',ceiling(week_no/52))) %>%
mutate(quarter = case_when(
.$week_no >= 1 & .$week_no <= 13 ~ "Y1-Q1",
.$week_no >= 14 & .$week_no <= 26 ~ "Y1-Q2",
.$week_no >= 27 & .$week_no <= 39 ~ "Y1-Q3",
.$week_no >= 40 & .$week_no <= 52 ~ "Y1-Q4",
.$week_no >= 53 & .$week_no <= 65 ~ "Y2-Q1",
.$week_no >= 66 & .$week_no <= 78 ~ "Y2-Q2",
.$week_no >= 79 & .$week_no <= 91 ~ "Y2-Q3",
TRUE ~ "Y2-Q4"
))
#Joining cleaned product and transaction data for further analysis
df_td_prod <- left_join(df_transaction_data, df_product, by = "product_id")
glimpse(df_td_prod)
## Observations: 2,595,732
## Variables: 20
## $ household_key <int> 2375, 2375, 2375, 2375, 2375, 2375, 2375,...
## $ basket_id <dbl> 26984851472, 26984851472, 26984851472, 26...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ product_id <int> 1004906, 1033142, 1036325, 1082185, 81604...
## $ quantity <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales_value <dbl> 1.39, 0.82, 0.99, 1.21, 1.50, 1.98, 1.57,...
## $ store_id <int> 364, 364, 364, 364, 364, 364, 364, 364, 3...
## $ retail_disc <dbl> -0.60, 0.00, -0.30, 0.00, -0.39, -0.60, -...
## $ trans_time <chr> "1631", "1631", "1631", "1631", "1631", "...
## $ week_no <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ year <chr> "Y1", "Y1", "Y1", "Y1", "Y1", "Y1", "Y1",...
## $ quarter <chr> "Y1-Q1", "Y1-Q1", "Y1-Q1", "Y1-Q1", "Y1-Q...
## $ manufacturer <int> 69, 2, 69, 2, 69, 69, 321, 586, 1266, 586...
## $ department <chr> "PRODUCE", "PRODUCE", "PRODUCE", "PRODUCE...
## $ brand <chr> "Private", "National", "Private", "Nation...
## $ commodity_desc <chr> "POTATOES", "ONIONS", "VEGETABLES - ALL O...
## $ sub_commodity_desc <chr> "POTATOES RUSSET (BULK&BAG)", "ONIONS SWE...
## $ curr_size_of_product <chr> "5 LB", "40 LB", NA, "40 LB", "1 LB", "12...
We first started by looking at the data from a high level. We looked at the total sales across the 4 quarters of both the years. We observe that the total sales was lesser for Y1-Q1 and Y2-Q4. The low sales in Y2-Q4 could be attributed to incomplete data, as we have data only until week - 102. Our hypothesis for the low sales for Y1-Q1 is that the company could be at its inception stage. For the purpose of analysis we have picked “Y2-Q2” and “Y3-Q3”, as they have almost consistent total sales, and are the most recent data we have.
df_transaction_data %>%
group_by(quarter) %>%
summarize(sales_value = sum(sales_value)) %>%
ggplot(aes(x = quarter, y = sales_value, group = 1)) +
geom_line(color = "black", size = 1.0, alpha = 0.5) +
ylim(380000,1200000) +
ylab("Total Sales Value") +
ggtitle("Overall Sales by Quarters")
We then look at the top 10 departments by sale. We observe grocery to be the department with highest sales (almost 50% of the total sales).
fancy_scientific <- function(l) {
l <- format(l, scientific = TRUE)
l <- gsub("^(.*)e", "'\\1'e", l)
l <- gsub("e", "%*%10^", l)
parse(text=l)
}
color.function <- colorRampPalette( c( "#CCCCCC" , "#104E8B" ) )
df_td_prod %>%
filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
group_by(department) %>%
summarize(sales_value = sum(sales_value)) %>%
arrange(desc(sales_value)) %>%
head(10) %>%
ggplot(aes(x = reorder(department, sales_value), y = sales_value, fill = as.factor(sales_value))) +
xlab("Department") +
geom_bar(stat = "identity") + scale_y_sqrt(name = "Total Sales Value", labels = fancy_scientific) +
scale_fill_manual(values = color.function(10)) + guides(fill = FALSE) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
ggtitle("Top 10 Departments by Total Sales")
We want to look at the top 10 commodties under Grocery that add to the revenue. We also want to look at the top 10 frequently purchased commodities under Grocery. Plotting the two, we notice that FLUID MILK PRODUCTS is bought most frequently and it also contributes to the sales the most. This shows that people purchase milk always, therefore it would be okay if we do not give coupons for the same as it is a highest selling product in itself.
color.function1 <- colorRampPalette( c( "#FBEEE6" , "#6E2C00" ))
top_grocery_subcom <- df_td_prod %>% filter(department == "GROCERY") %>%
filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
group_by(commodity_desc) %>%
summarise(sales_value = sum(sales_value)) %>%
arrange(desc(sales_value)) %>%
head(10) %>%
ggplot(aes(x = reorder(commodity_desc, sales_value), y = sales_value, fill = as.factor(sales_value))) +
geom_bar(stat = "identity") + scale_fill_manual(values = color.function1(10)) +
guides(fill = FALSE) +
xlab("Commodities") +
ylab("Total Sales Value") +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
ggtitle("Top 10 Commodities-Sales")
color.function2 <- colorRampPalette( c( "#E8F8F5" , "#0E6251" ))
top_grocery_trans <- df_td_prod %>% filter(department == "GROCERY") %>%
filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
group_by(commodity_desc) %>%
summarise(purchase = n_distinct(basket_id)) %>%
arrange(desc(purchase)) %>%
head(10) %>%
ggplot(aes(x = reorder(commodity_desc, purchase), y = purchase, fill = as.factor(purchase))) +
geom_bar(stat = "identity") + scale_fill_manual(values = color.function2(10)) +
guides(fill = FALSE) +
xlab("Commodities") +
ylab("No. of Transactions it occured in") +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
ggtitle("Top 10 Commodities-Transactions")
grid.arrange(top_grocery_subcom, top_grocery_trans, ncol=2)
We want to study the demographics of the people who make the most purchases from the top commodities. This will tell us the group of people we should target for giving coupons. If we hover over the plot, we observe that maximum transactions are done by people from the income group of 50-74K and with a household size of 2-3 people.
dim3_plot <-
df_td_prod %>%
filter(department == "GROCERY") %>%
group_by(commodity_desc) %>%
summarise(purchase = n_distinct(basket_id)) %>%
arrange(desc(purchase)) %>%
head(10) %>%
inner_join(df_td_prod,by = "commodity_desc") %>%
inner_join(df_hh_demographic) %>%
group_by(age_desc, income_desc, household_size_desc) %>%
summarise(purchase = n_distinct(basket_id))
plot_ly(dim3_plot,
x = as.factor(dim3_plot$income_desc), y = as.factor(dim3_plot$age_desc), z = dim3_plot$household_size_desc,
marker = list(color = dim3_plot$purchase, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Income Description'),
yaxis = list(title = 'Age Description'),
zaxis = list(title = 'Household Size')))
We will perform Market Basket Analysis for the transaction data for quarters 2 and 3 of year 2, for the department grocery at the product level commodity.
# Filtering data
mba_transaction <- df_transaction_data %>%
filter(quarter == 'Y2-Q2' | quarter == 'Y2-Q3') %>%
inner_join(df_product, by = "product_id") %>%
filter(department == "GROCERY") %>%
select(basket_id, day, commodity_desc)
The atomic transaction data should be in a format so that all items that are bought together in one invoice are in one row.
#Converting transaction data to atomic format
mba_transaction <- ddply (mba_transaction,
c("basket_id","day"),
function(df) paste(df$commodity_desc, collapse = ","))
# As Basket ID and Day will not be of any use in the rule mining, we are setting them to NULL
# Set column basket_id to NULL
mba_transaction$basket_id <- NULL
# Set column basket_id to NULL
mba_transaction$day <- NULL
#Rename column to items
colnames(mba_transaction) <- c("Items")
glimpse(mba_transaction)
## Observations: 60,569
## Variables: 1
## $ Items <chr> "BAG SNACKS", "BAKED BREAD/BUNS/ROLLS", "SOFT DRINKS", "...
We now have the transaction data in the basket format. The next step is to load this transaction data into an object of the transaction class.
# Store the transaction data into a .csv
write.csv(mba_transaction,"Data/market_basket_transactions.csv", quote = FALSE, row.names = TRUE)
# Convert the .csv into an object of the transaction class
tr <- read.transactions("Data/market_basket_transactions.csv", format = 'basket', sep=',')
Below is a plot of the relative item frequencies for the top 20 frequent commodities.
# Create an item frequency plot for the top 20 items
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'), cex.names = 0.7, main="Relative Item Frequency Plot")
Fluid Milk Products is the most frequently purchased commodity. Soft drinks, Baked Bread/Buns/Rolls, Cheese closely follow.
We are using the APRIORI algorithm to mine the association rules. We are setting a threshold of min support to 0.001, confidence to 0.8.
total.association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, maxlen=10))
summary(total.association.rules)
## set of 494349 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6 7 8 9
## 27 15423 125377 210812 120774 21453 483
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.018 7.000 9.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001007 Min. :0.8000 Min. : 2.463 Min. : 61.00
## 1st Qu.:0.001106 1st Qu.:0.8214 1st Qu.: 2.679 1st Qu.: 67.00
## Median :0.001255 Median :0.8438 Median : 2.976 Median : 76.00
## Mean :0.001459 Mean :0.8490 Mean : 3.168 Mean : 88.35
## 3rd Qu.:0.001568 3rd Qu.:0.8712 3rd Qu.: 3.681 3rd Qu.: 95.00
## Max. :0.028645 Max. :1.0000 Max. :17.369 Max. :1735.00
##
## mining info:
## data ntransactions support confidence
## tr 60570 0.001 0.8
There are a total of 494349 rules. Out of these, there are 27 rules with itemlist length 3, let’s closely look at these rules.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, maxlen=3))
Let us take a look at the top 10 rules.
#Print top 10 rules
inspect(association.rules[1:10])
## lhs rhs support confidence lift count
## [1] {FRZN JCE CONC/DRNKS,
## HISPANIC} => {CHEESE} 0.001056629 0.8205128 3.705798 64
## [2] {FROZEN PIZZA,
## FRZN JCE CONC/DRNKS} => {FLUID MILK PRODUCTS} 0.001518904 0.8000000 2.463196 92
## [3] {CANNED MILK,
## HISPANIC} => {CHEESE} 0.001221727 0.8222222 3.713519 74
## [4] {CANNED MILK,
## HISPANIC} => {FLUID MILK PRODUCTS} 0.001205217 0.8111111 2.497407 73
## [5] {OLIVES,
## PASTA SAUCE} => {CHEESE} 0.001370315 0.8300000 3.748647 83
## [6] {MARGARINES,
## OLIVES} => {CHEESE} 0.001551923 0.8034188 3.628594 94
## [7] {MEAT - SHELF STABLE,
## OLIVES} => {CHEESE} 0.001684002 0.8160000 3.685416 102
## [8] {BLEACH,
## REFRGRATD DOUGH PRODUCTS} => {FLUID MILK PRODUCTS} 0.001684002 0.8031496 2.472894 102
## [9] {COCOA MIXES,
## HOUSEHOLD CLEANG NEEDS} => {FLUID MILK PRODUCTS} 0.001056629 0.8101266 2.494376 64
## [10] {COCOA MIXES,
## PNT BTR/JELLY/JAMS} => {FLUID MILK PRODUCTS} 0.002080238 0.8025478 2.471041 126
From the above output, we can make analysis such as:
As we saw in the frequency plot, Fluid milk products are the most frequently purchased commodity. It will be interesting to see the association rules specific to this product.
Below is an interactive scatter plot of the support and confidence of the rules. Hover over a rule to see all its quality measures.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-total.association.rules[quality(total.association.rules)$confidence>0.4]
#Plot interactive rules
plotly_arules(subRules)
The plot above shows that rules with a high lift have a low support.
Let’s now look at a graphical representation of the top 10 rules. Select a commodity to see the associated rules, or directly select the rule number.
#Graphical representation
top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget")
Rule 1 states that 100% of the people who buy DRY NOODLES/PASTA, MILK BY-PRODUCTS, MISC. DAIRY, VEGETABLES also buy CHEESE. Hence, we recommed creating a product bundle of products belonging to these 5 commodities, and applying a discount on this bundle, to further improve the sales of cheese.
We can use the rules similarly, to create product bundles and apply discount on the tagerted commodities.
This analysis can also be used for product placement, all products belonging to one bundle can be placed strategically to increase the lift.