R-Project Final Report

Synopsis

Problem Statement:

To uncover the association between items that are frequently bought together.

  • This will help us provide appropriate discount coupons strategically to the frequently bought products.
  • This will also help us in formulating campaigns and distributing coupons to the customers by studying the demographics in relation with the products.

Implementation:

We performed basic Exploratory Data Analysis to understand sales patterns across the two years of data.

  • We looked at total sales by department, and narrowed down our focus to the highest sales revenue generating department. We further analysed the top products by sales and transactions.
  • We then explored the customer data in association with the top products by transaction.
  • We performed Market Basket Analysis on transaction data for two quarters and generated rules which helped us identify most frequently bought product combinations.

Summary:

  • We see from our analysis that sales were consistent across all quarters except for the first and last. We hypothesise that, this pattern is due to incomplete data for the last quarter and a possible slow sales pick-up due to the store being in its inception stage for the first quarter.
  • Department-wise analysis shows that Grocery is the topmost revenue generating department, accounting for almost 50% of the total sales.
  • Customer demographic analysis shows that the most active customers for the company are aged between 25 and 34, have an income of 50-74K per annum with an average household size of 2-3.
  • Using the rules identified from Market Basket Analysis, we can create the right product bundles and give appropriate discounts on the same.

Packages Required

Following are the packages required in the course of our analysis:

library(readr)       #to read the csv files
library(stringr)     #for string manipulations like str_replace etc
library(tidyr)       #creating tidy data
library(plyr)        #Tools for Splitting, Applying and Combining Data
library(dplyr)       #for a flexibility in data manipulation
library(lubridate)   #for easy time and date manipulation
library(magrittr)    #for piping capabilities and to write neat code
library(ggplot2)     #visualizing data
library(gridExtra)   #Arranging items in grid
library(plotly)      #3D Plotting
library(arules)      #for analyzing transaction data and patterns (frequent itemsets and association rules).
library(arulesViz)   #Extends package 'arules' with various visualization techniques for association rules and item-sets. 
library(RColorBrewer)#Color schemes for plotting

Data Wrangling

Data Source: 84.51-Customer Journey study, http://www.8451.com/area51/

This dataset contains household level transactions over two years from a group of 2,500 households who shop at a retailer. It contains all of the purchases from each household’s purchases, not just those form a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

Data Loading:

The data is loaded from 9 CSV files into the R workspace.

file <- c("product", "hh_demographic", "coupon", "campaign_table", "coupon_redempt",
          "transaction_data", "campaign_table", "campaign_desc", "causal_data")
for (i in seq_along(file)) {
  file_name <- paste0( "data/", file[i], ".csv")
  assign(paste0("df_",file[i]), read_csv(file_name))
}

Data Cleaning

#Missing value check
miss_dept <- which(!is.na(df_product$department))
df_product <- df_product[miss_dept,] #15 null values removed
sum(is.na(df_product$department)) #0

#cleaning the hh_comp_desc column
age_desc <- factor(df_hh_demographic$age_desc)
#creating levels for income
income_desc <- factor(df_hh_demographic$income_desc, levels = c("Under 15K", "15-24K", "25-34K", "35-49K","50-74K", "75-99K" , "100-124K", "125-149K" ,"150-174K","175-199K", "200-249K" ,"250K+"))

#Check if transaction_data has products outside the product table
anti_join(df_transaction_data, df_product, by = "product_id") #0

#creating year and quarter column
df_transaction_data <- df_transaction_data %>% 
                      as.data.frame() %>% 
                      mutate(year = paste0('Y',ceiling(week_no/52))) %>% 
                      mutate(quarter = case_when(
                              .$week_no >= 1 & .$week_no <= 13   ~ "Y1-Q1",
                              .$week_no >= 14 & .$week_no <= 26  ~ "Y1-Q2",
                              .$week_no >= 27 & .$week_no <= 39  ~ "Y1-Q3",
                              .$week_no >= 40 & .$week_no <= 52  ~ "Y1-Q4",
                              .$week_no >= 53 & .$week_no <= 65  ~ "Y2-Q1",
                              .$week_no >= 66 & .$week_no <= 78  ~ "Y2-Q2",
                              .$week_no >= 79 & .$week_no <= 91  ~ "Y2-Q3",
                              TRUE ~ "Y2-Q4"
                              ))

#Joining cleaned product and transaction data for further analysis
df_td_prod <- left_join(df_transaction_data, df_product, by = "product_id")

A glimpse of the clean and tidy data

glimpse(df_td_prod)
## Observations: 2,595,732
## Variables: 20
## $ household_key        <int> 2375, 2375, 2375, 2375, 2375, 2375, 2375,...
## $ basket_id            <dbl> 26984851472, 26984851472, 26984851472, 26...
## $ day                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ product_id           <int> 1004906, 1033142, 1036325, 1082185, 81604...
## $ quantity             <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales_value          <dbl> 1.39, 0.82, 0.99, 1.21, 1.50, 1.98, 1.57,...
## $ store_id             <int> 364, 364, 364, 364, 364, 364, 364, 364, 3...
## $ retail_disc          <dbl> -0.60, 0.00, -0.30, 0.00, -0.39, -0.60, -...
## $ trans_time           <chr> "1631", "1631", "1631", "1631", "1631", "...
## $ week_no              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ coupon_disc          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupon_match_disc    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ year                 <chr> "Y1", "Y1", "Y1", "Y1", "Y1", "Y1", "Y1",...
## $ quarter              <chr> "Y1-Q1", "Y1-Q1", "Y1-Q1", "Y1-Q1", "Y1-Q...
## $ manufacturer         <int> 69, 2, 69, 2, 69, 69, 321, 586, 1266, 586...
## $ department           <chr> "PRODUCE", "PRODUCE", "PRODUCE", "PRODUCE...
## $ brand                <chr> "Private", "National", "Private", "Nation...
## $ commodity_desc       <chr> "POTATOES", "ONIONS", "VEGETABLES - ALL O...
## $ sub_commodity_desc   <chr> "POTATOES RUSSET (BULK&BAG)", "ONIONS SWE...
## $ curr_size_of_product <chr> "5 LB", "40 LB", NA, "40 LB", "1 LB", "12...

Exploratory Data Analysis

High-level sales analysis

Overall Sales by Quarter

We first started by looking at the data from a high level. We looked at the total sales across the 4 quarters of both the years. We observe that the total sales was lesser for Y1-Q1 and Y2-Q4. The low sales in Y2-Q4 could be attributed to incomplete data, as we have data only until week - 102. Our hypothesis for the low sales for Y1-Q1 is that the company could be at its inception stage. For the purpose of analysis we have picked “Y2-Q2” and “Y3-Q3”, as they have almost consistent total sales, and are the most recent data we have.

  df_transaction_data %>% 
  group_by(quarter) %>% 
  summarize(sales_value = sum(sales_value)) %>%
  ggplot(aes(x = quarter, y = sales_value, group = 1)) + 
  geom_line(color = "black", size = 1.0, alpha = 0.5) +
  ylim(380000,1200000) + 
  ylab("Total Sales Value") +
  ggtitle("Overall Sales by Quarters")

Overall Sales by Department for Y2-Q2 & Y2-Q3

We then look at the top 10 departments by sale. We observe grocery to be the department with highest sales (almost 50% of the total sales).

fancy_scientific <- function(l) {
  l <- format(l, scientific = TRUE)
  l <- gsub("^(.*)e", "'\\1'e", l)
  l <- gsub("e", "%*%10^", l)
  parse(text=l)
}
  color.function <- colorRampPalette( c( "#CCCCCC" , "#104E8B" ) )
  df_td_prod %>% 
  filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
  group_by(department) %>%
  summarize(sales_value = sum(sales_value)) %>%
  arrange(desc(sales_value)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(department, sales_value), y = sales_value, fill = as.factor(sales_value))) + 
  xlab("Department") +
  geom_bar(stat = "identity") + scale_y_sqrt(name = "Total Sales Value", labels = fancy_scientific) +
  scale_fill_manual(values = color.function(10)) + guides(fill = FALSE) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))+
  ggtitle("Top 10 Departments by Total Sales")  

Top 10 Commodities under Grocery by Total Sales and Total Transactions

We want to look at the top 10 commodties under Grocery that add to the revenue. We also want to look at the top 10 frequently purchased commodities under Grocery. Plotting the two, we notice that FLUID MILK PRODUCTS is bought most frequently and it also contributes to the sales the most. This shows that people purchase milk always, therefore it would be okay if we do not give coupons for the same as it is a highest selling product in itself.

color.function1 <- colorRampPalette( c( "#FBEEE6" , "#6E2C00" ))
top_grocery_subcom <- df_td_prod %>% filter(department == "GROCERY") %>%
  filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
  group_by(commodity_desc) %>%
  summarise(sales_value = sum(sales_value)) %>%
  arrange(desc(sales_value)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(commodity_desc, sales_value), y = sales_value, fill = as.factor(sales_value))) +
  geom_bar(stat = "identity") + scale_fill_manual(values = color.function1(10)) +
  guides(fill = FALSE) +
  xlab("Commodities") +
  ylab("Total Sales Value") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  ggtitle("Top 10 Commodities-Sales")

color.function2 <- colorRampPalette( c( "#E8F8F5" , "#0E6251" ))
top_grocery_trans <- df_td_prod %>% filter(department == "GROCERY") %>%
  filter(quarter == "Y2-Q1" | quarter == "Y3-Q2") %>%
  group_by(commodity_desc) %>%
  summarise(purchase = n_distinct(basket_id)) %>%
  arrange(desc(purchase)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(commodity_desc, purchase), y = purchase, fill = as.factor(purchase))) +
  geom_bar(stat = "identity") + scale_fill_manual(values = color.function2(10)) +
  guides(fill = FALSE) +
  xlab("Commodities") +
  ylab("No. of Transactions it occured in") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  ggtitle("Top 10 Commodities-Transactions")

grid.arrange(top_grocery_subcom, top_grocery_trans, ncol=2)

Sales analysis by demographics

Demographics of the people that buy the top 10 Commodites under Grocery

We want to study the demographics of the people who make the most purchases from the top commodities. This will tell us the group of people we should target for giving coupons. If we hover over the plot, we observe that maximum transactions are done by people from the income group of 50-74K and with a household size of 2-3 people.

dim3_plot <- 
  df_td_prod %>% 
  filter(department == "GROCERY") %>%
  group_by(commodity_desc) %>%
  summarise(purchase = n_distinct(basket_id)) %>%
  arrange(desc(purchase)) %>%
  head(10) %>%
  inner_join(df_td_prod,by = "commodity_desc") %>%
  inner_join(df_hh_demographic) %>%
  group_by(age_desc, income_desc, household_size_desc) %>%
  summarise(purchase = n_distinct(basket_id))

plot_ly(dim3_plot, 
        x = as.factor(dim3_plot$income_desc), y = as.factor(dim3_plot$age_desc), z = dim3_plot$household_size_desc, 
        marker = list(color = dim3_plot$purchase, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>%
        add_markers() %>%
        layout(scene = list(xaxis = list(title = 'Income Description'),
                            yaxis = list(title = 'Age Description'),
                            zaxis = list(title = 'Household Size')))

Market Basket Analysis

We will perform Market Basket Analysis for the transaction data for quarters 2 and 3 of year 2, for the department grocery at the product level commodity.

Data pre-processing

# Filtering data 
mba_transaction <- df_transaction_data %>% 
  filter(quarter == 'Y2-Q2' | quarter == 'Y2-Q3') %>% 
  inner_join(df_product, by = "product_id") %>% 
  filter(department == "GROCERY") %>% 
  select(basket_id, day, commodity_desc)

The atomic transaction data should be in a format so that all items that are bought together in one invoice are in one row.

#Converting transaction data to atomic format
mba_transaction <- ddply (mba_transaction, 
                          c("basket_id","day"),
                          function(df) paste(df$commodity_desc, collapse = ","))

# As Basket ID and Day will not be of any use in the rule mining, we are setting them to NULL
# Set column basket_id to NULL 
mba_transaction$basket_id <- NULL
# Set column basket_id to NULL 
mba_transaction$day <- NULL
#Rename column to items
colnames(mba_transaction) <- c("Items")

Glimpse of final basket format transaction data for Market Basket Analysis

glimpse(mba_transaction)
## Observations: 60,569
## Variables: 1
## $ Items <chr> "BAG SNACKS", "BAKED BREAD/BUNS/ROLLS", "SOFT DRINKS", "...

Creating Transaction Object

We now have the transaction data in the basket format. The next step is to load this transaction data into an object of the transaction class.

# Store the transaction data into a .csv
write.csv(mba_transaction,"Data/market_basket_transactions.csv", quote = FALSE, row.names = TRUE)
# Convert the .csv into an object of the transaction class
tr <- read.transactions("Data/market_basket_transactions.csv", format = 'basket', sep=',')

Item Frequency Plot

Below is a plot of the relative item frequencies for the top 20 frequent commodities.

# Create an item frequency plot for the top 20 items
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'), cex.names = 0.7, main="Relative Item Frequency Plot")

Fluid Milk Products is the most frequently purchased commodity. Soft drinks, Baked Bread/Buns/Rolls, Cheese closely follow.

Generating Association Rules

We are using the APRIORI algorithm to mine the association rules. We are setting a threshold of min support to 0.001, confidence to 0.8.

total.association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, maxlen=10))
summary(total.association.rules)
## set of 494349 rules
## 
## rule length distribution (lhs + rhs):sizes
##      3      4      5      6      7      8      9 
##     27  15423 125377 210812 120774  21453    483 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   6.018   7.000   9.000 
## 
## summary of quality measures:
##     support           confidence          lift            count        
##  Min.   :0.001007   Min.   :0.8000   Min.   : 2.463   Min.   :  61.00  
##  1st Qu.:0.001106   1st Qu.:0.8214   1st Qu.: 2.679   1st Qu.:  67.00  
##  Median :0.001255   Median :0.8438   Median : 2.976   Median :  76.00  
##  Mean   :0.001459   Mean   :0.8490   Mean   : 3.168   Mean   :  88.35  
##  3rd Qu.:0.001568   3rd Qu.:0.8712   3rd Qu.: 3.681   3rd Qu.:  95.00  
##  Max.   :0.028645   Max.   :1.0000   Max.   :17.369   Max.   :1735.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr         60570   0.001        0.8

There are a total of 494349 rules. Out of these, there are 27 rules with itemlist length 3, let’s closely look at these rules.

association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, maxlen=3))

Let us take a look at the top 10 rules.

#Print top 10 rules 
inspect(association.rules[1:10])
##      lhs                           rhs                       support confidence     lift count
## [1]  {FRZN JCE CONC/DRNKS,                                                                    
##       HISPANIC}                 => {CHEESE}              0.001056629  0.8205128 3.705798    64
## [2]  {FROZEN PIZZA,                                                                           
##       FRZN JCE CONC/DRNKS}      => {FLUID MILK PRODUCTS} 0.001518904  0.8000000 2.463196    92
## [3]  {CANNED MILK,                                                                            
##       HISPANIC}                 => {CHEESE}              0.001221727  0.8222222 3.713519    74
## [4]  {CANNED MILK,                                                                            
##       HISPANIC}                 => {FLUID MILK PRODUCTS} 0.001205217  0.8111111 2.497407    73
## [5]  {OLIVES,                                                                                 
##       PASTA SAUCE}              => {CHEESE}              0.001370315  0.8300000 3.748647    83
## [6]  {MARGARINES,                                                                             
##       OLIVES}                   => {CHEESE}              0.001551923  0.8034188 3.628594    94
## [7]  {MEAT - SHELF STABLE,                                                                    
##       OLIVES}                   => {CHEESE}              0.001684002  0.8160000 3.685416   102
## [8]  {BLEACH,                                                                                 
##       REFRGRATD DOUGH PRODUCTS} => {FLUID MILK PRODUCTS} 0.001684002  0.8031496 2.472894   102
## [9]  {COCOA MIXES,                                                                            
##       HOUSEHOLD CLEANG NEEDS}   => {FLUID MILK PRODUCTS} 0.001056629  0.8101266 2.494376    64
## [10] {COCOA MIXES,                                                                            
##       PNT BTR/JELLY/JAMS}       => {FLUID MILK PRODUCTS} 0.002080238  0.8025478 2.471041   126

From the above output, we can make analysis such as:

  • 83% of people who bought Olives and Pasta Sauce also bought Cheese.
  • 80% of people who bought Frozen Pizza and Frozen Juice also bought Fluid Milk Products.

As we saw in the frequency plot, Fluid milk products are the most frequently purchased commodity. It will be interesting to see the association rules specific to this product.

Visualising Associatiion Rules

Below is an interactive scatter plot of the support and confidence of the rules. Hover over a rule to see all its quality measures.

# Filter rules with confidence greater than 0.4 or 40%
subRules<-total.association.rules[quality(total.association.rules)$confidence>0.4]

#Plot interactive rules
plotly_arules(subRules)

The plot above shows that rules with a high lift have a low support.

Let’s now look at a graphical representation of the top 10 rules. Select a commodity to see the associated rules, or directly select the rule number.

#Graphical representation
top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph",  engine = "htmlwidget")
  • Rule 1 states that 100% of the people who buy DRY NOODLES/PASTA, MILK BY-PRODUCTS, MISC. DAIRY, VEGETABLES also buy CHEESE. Hence, we recommed creating a product bundle of products belonging to these 5 commodities, and applying a discount on this bundle, to further improve the sales of cheese.

  • We can use the rules similarly, to create product bundles and apply discount on the tagerted commodities.

  • This analysis can also be used for product placement, all products belonging to one bundle can be placed strategically to increase the lift.