1. Introduction

1.1 Research Question

This project aims to explore the effect on sales, units and visits due to:

  1. In-store display of a product
  2. Whether the product was featured in the weekly mailer or not

For years, retailers have been investing in these two strategies to boost their revenues. At the same time, for years, consumers have been influenced by these two strategies without even knowing it. This project aims to quantify this effect and help both consumers and retailers.

1.2 Data

The analysis will make use of transactional data and the causal data for a major retailer for a period of 2 years. The transactional data, as the name suggests, records all the transactions for 2500 households. The causal data records the display and mailer information for 68377 products in 115 stores.

1.3 Methodology

Calculate the effect of in-store display and weekly mailer information on the sales, units and visits between Year 1 and Year 2 for the same:

  1. Product
  2. Store
  3. Week ID

Once we have the information for distinct (a), (b) and (c) we can make generalized statements for all 68,377 products in 115 stores for the last 102 weeks. Using these results which are obtained from the sampled data, we will make use of central limit theorem and confidence intervals to make inferences about the population.

1.4 Practical Usage

This project aims to quantify the effect of strategies discussed in 1.1 and help:

  1. Retailers optimize their in-store display
  2. Consumers optimize their spending on the right products with minimum influence

2. Packages Required

Load Files

This package allows you to read and write data at blazing fast speed

library(data.table)

This package is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges like mutating, selecting, filtering, summarizing and arranging columns

library(dplyr)

This packages provides fast and effecient implementations of common string manipulations

library(stringr)

This package is a system for declaratively creating graphics, based on The Grammar of Graphics

library(ggplot2)

This package makes it easier to work with dates and times

library(lubridate)

This package helps in printing datasets fited to page

library(printr)

This package helps in creating tidy data

library(tidyr)

This package helps to render data objects in R as HTML tables using the JavaScript library ‘DataTables’

library(DT)

3. Data Preparation

3.1 Source

This dataset contains household level transactions of over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

  1. hh_demographic (801x8) - This table contains demographic information for a portion of households. Due to nature of the data, the demographic information is not available for all households.
  2. transaction_data (2595732x12) - This table contains all products purchased by households within this study. Each line found in this table is essentially the same line that would be found on a store receipt.
  3. campaign_table (7208x3) - This table lists the campaigns received by each household in the study. Each household received a different set of campaigns.
  4. campaign_desc (30x4) - This table gives the length of time for which a campaign runs. So, any coupons received as part of a campaign are valid within the dates contained in this table.
  5. product (92353x7) - This table contains information on each product sold such as type of product, national or private label and a brand identifier.
  6. coupon (124548x3) - This table lists all the coupons sent to customers as part of a campaign, as well as the products for which each coupon is redeemable. Some coupons are redeemable for multiple products. One example is a coupon for any private label frozen vegetable. There are a large number of products where this coupon could be redeemed.
  7. coupon_redempt (2318x4) - This table identifies the coupons that each household redeemed.
  8. causal_data (36786524x5) - This table signifies whether a given product was featured in the weekly mailer or was part of an in-store display (other than regular product placement).

Link: https://github.com/akshaykher/Data-Wrangling-Project/blob/master/The_Complete_Journey_guide.pdf

3.2 Data Importing and Cleaning

Importing All Tables

# names of the data files to import
names <- c("product", "transaction_data", "causal_data")

# product table link - https://www.dropbox.com/s/2hq56ye5tjp8ryb/product.csv?dl=0
# transaction_data link - https://www.dropbox.com/s/kqn5ontiyp3hvo0/transaction_data.csv?dl=0
# causal_data link - https://www.dropbox.com/s/n6x0ugtdna6av3t/causal_data.csv?dl=0
# importing all data files at once
for(i in seq_along(names))
{
  df = fread(paste0("data/", names[i], ".csv")) 
  assign(names[i], df)
}

Product Table

datatable(head(product, 100))
# Replacing Empty Strings with NA
product$department[which(product$department == "")] = NA
product$commodity_desc[which(product$commodity_desc == "")] = NA
product$sub_commodity_desc[which(product$sub_commodity_desc == "")] = NA
product$curr_size_of_product[which(product$curr_size_of_product == "")] = NA

# Converting Some String Variables to Factor Variables
product <- mutate(product, 
                  department = factor(department),
                  brand = factor(brand),
                  commodity_desc = factor(commodity_desc),
                  sub_commodity_desc = factor(sub_commodity_desc))

Causal Table

datatable(head(causal_data, 100))
# Converting Some String Variables to Factor Variables
causal_data <- mutate(causal_data, display = factor(display), mailer = factor(mailer))

Transaction Table

datatable(head(transaction_data, 100))
########## Quantity Variable ##########

# 99% of the quantities are less 10. However there is huge spike in quantity from 99% to 100%. We will need to examine this!
q1<- c(quantile(transaction_data$quantity, probs = seq(0, 0.9, 0.1)),
quantile(transaction_data$quantity, probs = seq(0.91, 1, 0.01)))

df1 <- as.data.frame(q1)
names(df1) <- "quantity"
datatable(df1)
########## Quantity Variable ##########

# For all quantities greater than 200, the sub_commodity_desc is "GASOLINE-REG UNLEADED". The quantity of gas may be calculated in gallons giving it a high number. 
transaction_data %>%
filter(quantity>200) %>%
select(product_id) %>%
unique() %>%
inner_join(select(product, product_id, department, commodity_desc
                  , sub_commodity_desc), by="product_id") %>%
select(sub_commodity_desc) %>%
unique()
sub_commodity_desc
GASOLINE-REG UNLEADED
########## Quantity Variable ##########

# Flagging Gasoline and Non-Gasoline Transactions
transaction_data$gasoline_transaction <- 
  ifelse(transaction_data$product_id %in% 
           product[product$sub_commodity_desc == "GASOLINE-REG UNLEADED",]
                $product_id, 1, 0)

# Max Quantity for a Non-Gasoline Transactions is 144. This seems legitimate.
q2 <- quantile(filter(transaction_data, gasoline_transaction == 0)$quantity, 
         seq(0, 1, 0.1))

df2 <- as.data.frame(q2)
names(df2) <- "quantity"
datatable(df2)
########## Retail Discount Variable ##########

# As discount can never be negative, we will convert positive discounts to 0
transaction_data[transaction_data$retail_disc > 0]$retail_disc = 0

3.3 Final Data

Product Table - First 20 Rows

# Summary of Product Table
glimpse(product)
## Observations: 92,353
## Variables: 7
## $ product_id           <int> 25671, 26081, 26093, 26190, 26355, 26426,...
## $ manufacturer         <int> 2, 2, 69, 69, 69, 69, 69, 69, 69, 16, 69,...
## $ department           <fct> GROCERY, MISC. TRANS., PASTRY, GROCERY, G...
## $ brand                <fct> National, National, Private, Private, Pri...
## $ commodity_desc       <fct> FRZN ICE, NO COMMODITY DESCRIPTION, BREAD...
## $ sub_commodity_desc   <fct> ICE - CRUSHED/CUBED, NO SUBCOMMODITY DESC...
## $ curr_size_of_product <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 O...

Causal Table

# Summary of Causal Table
glimpse(causal_data)
## Observations: 36,786,524
## Variables: 5
## $ product_id <int> 26190, 26190, 26190, 26190, 26190, 26190, 26190, 26...
## $ store_id   <int> 286, 288, 289, 292, 293, 295, 296, 297, 298, 299, 3...
## $ week_no    <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,...
## $ display    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, ...
## $ mailer     <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, ...

Transaction Table- First 20 Rows

# Summary of Transaction Table
glimpse(transaction_data)
## Observations: 2,595,732
## Variables: 13
## $ household_key        <int> 2375, 2375, 2375, 2375, 2375, 2375, 2375,...
## $ basket_id            <dbl> 26984851472, 26984851472, 26984851472, 26...
## $ day                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ product_id           <int> 1004906, 1033142, 1036325, 1082185, 81604...
## $ quantity             <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales_value          <dbl> 1.39, 0.82, 0.99, 1.21, 1.50, 1.98, 1.57,...
## $ store_id             <int> 364, 364, 364, 364, 364, 364, 364, 364, 3...
## $ retail_disc          <dbl> -0.60, 0.00, -0.30, 0.00, -0.39, -0.60, -...
## $ trans_time           <int> 1631, 1631, 1631, 1631, 1631, 1642, 1642,...
## $ week_no              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ coupon_disc          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupon_match_disc    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ gasoline_transaction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

3.4 Summary

Product Table

# Total unique counts of each variable
product %>%
  summarize(tot_products = n(),
            tot_manufacturer = n_distinct(manufacturer),
            tot_department = n_distinct(department),
            tot_brand = n_distinct(brand),
            tot_commodity_desc = n_distinct(commodity_desc),
            tot_sub_commodity_desc = n_distinct(sub_commodity_desc))
tot_products tot_manufacturer tot_department tot_brand tot_commodity_desc tot_sub_commodity_desc
92353 6476 44 2 308 2383
# Total NAs in each variable
c1 <- colSums(is.na(product))

df1 <- as.data.frame(c1)
names(df1) <- "total_NAs"
df1
total_NAs
product_id 0
manufacturer 0
department 15
brand 0
commodity_desc 15
sub_commodity_desc 15
curr_size_of_product 30607

Causal Table

# Total unique counts/min-max of each variable
causal_data %>%
  summarize(total_observations = n(),
            tot_products = n_distinct(product_id),
            min_week = min(week_no),
            max_week = max(week_no),
            tot_display = n_distinct(display),
            tot_mailer = n_distinct(mailer))
total_observations tot_products min_week max_week tot_display tot_mailer
36786524 68377 9 101 10 11
# Total NAs in each variable
c2 <- colSums(is.na(causal_data))

df2 <- as.data.frame(c2)
names(df2) <- "total_NAs"
df2
total_NAs
product_id 0
store_id 0
week_no 0
display 0
mailer 0
# Explanation and Total Counts of the Display Variable
table_display <- table(causal_data$display)
names(table_display) <- 
              c("Not on Display", "Store Front", "Store Rear", "Front End Cap",
               "Mid-Aisle End Cap", "Rear End Cap", "Side-Aisle End Cap",
               "In-Aisle", "Secondary Location Display", "In-Shelf")

df_display <- as.data.frame(table_display)
names(df_display) <- c("Display","Count")

df_display %>%
  arrange(Count) %>%                                # sort by counts
  mutate(Display = factor(Display,Display)) %>%     # reset factor
  ggplot(aes(x=Display, y=Count)) +                 # plot 
  geom_bar(stat="identity") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma)

# Explanation and Total Counts of the Mailer Variable
table_mailer <- table(causal_data$mailer)
names(table_mailer) <- 
            c("Not on Ad", "Interior Page Feature", "Interior Page Line Item",
              "Front Page Feature","Back Page Feature", "Wrap Front Feature", 
              "Wrap Interior Coupon","Wrap Back Feature", "Interior Page Coupon", 
              "Free on Interior Page", "Free on Front Page, Back Page or Wrap")

df_mailer <- as.data.frame(table_mailer)
names(df_mailer) <- c("Mailer","Count")

df_mailer %>%
  arrange(Count) %>%                                # sort by counts
  mutate(Mailer = factor(Mailer,Mailer)) %>%        # reset factor
  ggplot(aes(x=Mailer, y=Count)) +                  # plot 
  geom_bar(stat="identity") +
  coord_flip()+
  scale_y_continuous(labels = scales::comma)

Transaction Table

# Total unique counts of each variable grouped by Gasoline Transactions
transaction_data %>%
  group_by(gasoline_transaction) %>%
  summarize(total_visits = n_distinct(basket_id),
  total_observations = n(),
  total_households = n_distinct(household_key),
  total_products = n_distinct(product_id),
  total_stores = n_distinct(store_id))
gasoline_transaction total_visits total_observations total_households total_products total_stores
0 252543 2570770 2500 92304 557
1 24952 24962 1438 35 164
# Min, Max of each variable grouped by Gasonline Transactions
transaction_data %>%
  group_by(gasoline_transaction) %>%
  summarize(min_quantity = min(quantity),
  max_quantity = max(quantity),
  min_sales = min(sales_value),
  max_sales = max(sales_value),
  min_day = min(day),
  max_day = max(day),
  min_week = min(week_no),
  max_week = max(week_no))
gasoline_transaction min_quantity max_quantity min_sales max_sales min_day max_day min_week max_week
0 0 144 0.00 840 1 711 1 102
1 1 89638 0.03 250 12 711 2 102
# Total NAs in each variable
c3 <-colSums(is.na(transaction_data))

df3 <- as.data.frame(c3)
names(df3) <- "total_NAs"
df3
total_NAs
household_key 0
basket_id 0
day 0
product_id 0
quantity 0
sales_value 0
store_id 0
retail_disc 0
trans_time 0
week_no 0
coupon_disc 0
coupon_match_disc 0
gasoline_transaction 0

4. Exploratory Data Analysis

4.1 Data Wrangling

First, we’ll join Causal Data with Transaction Data on:

  • product_id - which accounts for product level variation
  • store_id - which accounts for store level variation
  • week_no - which accounts for seasonal variation

Second, we’ll find the total sales, units and visits by aggregating on:

  • product_id
  • store_id
  • week_no
  • display
  • mailer

Thus, in the coming analysis, we’ll compare sales, units and visits for same products, for the same stores and in the same week. This will ensure that any differences that we detect can be attributed to display and mailer level changes only. The variation due to product, store and week number will already be accounted for.

# Joining Causal Data with Transaction Data and then aggregating it.
mailer_display_transactions <-
  causal_data %>%
  inner_join(select(transaction_data,household_key,basket_id,
                    quantity,sales_value,product_id,store_id,week_no),
             by = c("product_id","store_id","week_no")) %>% 
  group_by(store_id,product_id,week_no,display,mailer) %>% 
  summarize(total_customers = n_distinct(household_key),
            sum_sales=sum(sales_value),
            sum_quantity=sum(quantity),
            sum_visits=n_distinct(basket_id))

datatable(head(mailer_display_transactions,20)) # First 20 Rows

Now we’ll separate the above dataset into Year 1 and Year 2 based on the week_no. For the Year 2 dataset we’ll recode the week_no’s from 54-102 to 1-52.

The purpose of doing this is to compare metrics of Year 1 vs Year 2.

# Year 1 dataset
mailer_display_transactions_year1 <-
  mailer_display_transactions %>% 
  filter(week_no <=54)

# Year 2 dataset
mailer_display_transactions_year2 <-
  mailer_display_transactions %>% 
  filter(week_no >54) %>% 
  mutate(week_no_modified = week_no-54)

# Recoding the week_no's for Year 2
mailer_display_transactions_year2$week_no <-
  mailer_display_transactions_year2$week_no_modified

mailer_display_transactions_year2$week_no_modified = NULL

Now, we’ll join Year 1 and Year 2 dataset on:

  • product_id
  • store_id
  • week_no

Then, we’ll separate the above data into:

  • mailer_year1_vs_year2 - This contains all the rows where:
    • mailer changed from 0 to not 0 from Year 1 to Year 2
    • mailer changed from not 0 to 0 from Year 1 to Year 2
  • display_year1_vs_year2 - This contains all the rows where:
    • display changed from 0 to not 0 from Year 1 to Year 2
    • display changed from not 0 to 0 from Year 1 to Year 2

The purpose of doing this is to quantify the effect of:

  • Shifting a product not on display to on-display from Year 1 to Year 2 and vice-versa
  • Put a product from not on Ad to on-Ad from Year 1 to Year 2 and vice-versa
# Joining Year 1 and Year 2 datasets
mailer_display_transactions_year1_vs_year2 <- 
  mailer_display_transactions_year1 %>% 
  inner_join(mailer_display_transactions_year2,
             by=c("product_id","store_id","week_no"),
             suffix=c("_Yr1","_Yr2")) # as both tables have same colnames

# Mailer Dataset - where there is a difference between Year 1 and Year 2
mailer_year1_vs_year2 <- 
  mailer_display_transactions_year1_vs_year2 %>% 
  filter( (mailer_Yr1==0 & mailer_Yr2 !=0) | (mailer_Yr2==0 & mailer_Yr1 !=0) )

# Display Dataset - where there is a difference between Year 1 and Year 2
display_year1_vs_year2 <- 
  mailer_display_transactions_year1_vs_year2 %>% 
  filter( (display_Yr1==0 & display_Yr2 !=0) | (display_Yr2==0 & display_Yr1 !=0) )

datatable(head(mailer_display_transactions_year1_vs_year2,20)) # First 20 Rows

Now that we’ve already taken into account the product level variation, we’ll aggregate the mailer and display datasets by store and week number.

# Aggregating Mailer dataset
mailer_year1_vs_year2 <-
  mailer_year1_vs_year2 %>%
  group_by(store_id,week_no,mailer_Yr1,mailer_Yr2) %>% 
  summarize(sum_sales_Yr1=sum(sum_sales_Yr1),
            sum_quantity_Yr1=sum(sum_quantity_Yr1),
            sum_visits_Yr1=sum(sum_visits_Yr1),
            sum_sales_Yr2=sum(sum_sales_Yr2),
            sum_quantity_Yr2=sum(sum_quantity_Yr2),
            sum_visits_Yr2=sum(sum_visits_Yr2))

# Aggregating Display dataset
display_year1_vs_year2 <-
  display_year1_vs_year2 %>%
  group_by(store_id,week_no,display_Yr1,display_Yr2) %>% 
  summarize(sum_sales_Yr1=sum(sum_sales_Yr1),
            sum_quantity_Yr1=sum(sum_quantity_Yr1),
            sum_visits_Yr1=sum(sum_visits_Yr1),
            sum_sales_Yr2=sum(sum_sales_Yr2),
            sum_quantity_Yr2=sum(sum_quantity_Yr2),
            sum_visits_Yr2=sum(sum_visits_Yr2))

datatable(head(mailer_year1_vs_year2,20)) # First 20 Rows

Now, in future analysis, we want to conduct a hypothesis test to detect any difference in sales, units and visits for a particular store & week between Year 1 and Year 2 due to display/mailer variables.

The quantities between Year 1 and Year 2 are paired, thus we should calculate their difference and conduct the hypothesis on the average of difference.

The diff_of_avg_yr1_vs_yr2 function does exactly that. It takes the arguments:

  • df - display or mailer dataset
  • val_yr1 - mailer or display value in Year 1
  • val_yr2 - mailer or display value in Year 2
  • var_interest - calculate for sales, units or visits or all of them
  • mailer - 1 for mailer dataset, 0 otherwise
  • display - 1 for display dataset, 0 otherwise

Using these arguments the above function calculates the average of difference between the Year 1 and Year 2 called the Point Estimate. Further, it also calculates the standard deviation and number of observation, both of which will be used to conduct a hypothesis test.

diff_of_avg_yr1_vs_yr2 <- 
  function(df = display_year1_vs_year2,
           val_yr1 = 0,
           val_yr2 = "A",
           var_interest = c("diff_sales","diff_quantity","diff_visits"),
           mailer = 0,
           display = 1)
  {
    if(mailer == 1) # if mailer dataset
    {
      calc_diff <-
        df %>% 
        # calculate all differences for specified mailer variables
        mutate(diff_sales = sum_sales_Yr2-sum_sales_Yr1,
               diff_quantity = sum_quantity_Yr2-sum_quantity_Yr1,
               diff_visits = sum_visits_Yr2-sum_visits_Yr1) %>% 
        filter(mailer_Yr1 == val_yr1,mailer_Yr2 == val_yr2)
    } else if (display == 1)  # if display dataset
    {
      calc_diff <-
        df %>%
        # calculate all differences for specified display variables
        mutate(diff_sales = sum_sales_Yr2-sum_sales_Yr1,
               diff_quantity = sum_quantity_Yr2-sum_quantity_Yr1,
               diff_visits = sum_visits_Yr2-sum_visits_Yr1) %>% 
        filter(display_Yr1 == val_yr1,display_Yr2 == val_yr2)
    } else  # if not a mailer or display dataset
    {
      print("Correct Value Not Selected") 
    }
    
    # calculate average of difference for all var_interest specified by user
    l1 = lapply(calc_diff[,names(calc_diff) %in% var_interest],mean,na.rm=TRUE)
    
    # calculate sd of difference for all var_interest specified by user
    l2 = lapply(calc_diff[,names(calc_diff) %in% var_interest],sd,na.rm=TRUE)
    
    # calculate number of observations
    l3 = list(n_obs=nrow(calc_diff))
    
    # append mean, sd and number of observations
    point_estimate <- append(append(l1,l2),l3)
    names(point_estimate) <- c("diff_sales","diff_quantity","diff_visits",
                   "sd_sales","sd_quantity","sd_visits","n_obs")
    
    return(point_estimate)
  }

# Sample Output of Function
diff_of_avg_yr1_vs_yr2(df = display_year1_vs_year2,
           val_yr1 = 0,
           val_yr2 = "A",
           var_interest = c("diff_sales","diff_quantity","diff_visits"),
           mailer = 0,
           display = 1)
## $diff_sales
## [1] 0.1145652
## 
## $diff_quantity
## [1] 0.08695652
## 
## $diff_visits
## [1] -0.173913
## 
## $sd_sales
## [1] 1.696525
## 
## $sd_quantity
## [1] 1.847835
## 
## $sd_visits
## [1] 0.5697698
## 
## $n_obs
## [1] 46

We have 10 unique mailer categories and 11 unique display categories. If we run diff_of_avg_yr1_vs_yr2 function on each combination then we’ll need to call the function 38 times manually. To avoid this, we’ll create a new function called calculate_point_estimates which will do this automatically.

The calculate_point_estimates function takes the arguments:

  • mailer - 1 for mailer dataset, 0 otherwise
  • display - 1 for display dataset, 0 otherwise
  • df - display or mailer dataset
calculate_point_estimates <- function(mailer=0,display=0,df=mailer_year1_vs_year2)
{
  if(display == 1) # if display dataset
  {
    names = c("1","2","3","4","5","6","7","9","A")
  } else if(mailer==1) # if mailer dataset
  {
    names = c("A","C","D","F","H","J","L","P","X","Z")
  }
  
  df_point_estimates = data.frame()
  
  for(i in 1:length(names)) 
  {
    # Yr 1 vs Yr 2 where Yr 1 Variable = 0 and Yr 2 Variable !=0
    point_estimate <-
      diff_of_avg_yr1_vs_yr2(df = df,
                             val_yr1 = 0, # Yr 1 = 0
                             val_yr2 = names[i],# Yr 2 not 0
                             var_interest = c("diff_sales","diff_quantity","diff_visits"),
                             mailer = mailer,
                             display = display)
    df_point_estimates=bind_rows(df_point_estimates,point_estimate)
    df_point_estimates$val_yr1[i] = 0
    df_point_estimates$val_yr2[i] = names[i]
  }
  
  for(i in 1:length(names)) 
  {
    # Yr 1 vs Yr 2 where Yr 1 Variable != 0 and Yr 2 Variable !=0
    point_estimate <-
      diff_of_avg_yr1_vs_yr2(df = df, 
                             val_yr1 = names[i],# Yr 1 not 0
                             val_yr2 = 0, # Yr = 0
                             var_interest = c("diff_sales","diff_quantity","diff_visits"),
                             mailer = mailer,
                             display = display)
    df_point_estimates=bind_rows(df_point_estimates,point_estimate)
    df_point_estimates$val_yr1[i+length(names)] = names[i]
    df_point_estimates$val_yr2[i+length(names)] = 0
  }
  
  return(na.omit(df_point_estimates))
}

Running the function calculate_point_estimates on display and mailer dataset

display_point_estimates <- 
  calculate_point_estimates(display=1,mailer=0,df=display_year1_vs_year2) %>%
  select(val_yr1,val_yr2,everything())

mailer_point_estimates <- 
  calculate_point_estimates(display=0,mailer=1,df=mailer_year1_vs_year2) %>% 
  select(val_yr1,val_yr2,everything())

datatable(mailer_point_estimates) # mailer_point_estimates dataset

Now we will create a function calc_p_value_conf_int function which will calculate the final datasets. It takes 1 dataset as an argument and does the following steps:

  1. Tidy the dataset so that the columns diff_sales, diff_units, diff_visits, sd_sales, sd_units and sd_visits are gathered into three columns: Variables, diff and sd
  2. Calculate the p-value
  3. Calculate the upper and lower bounds of the confidence interval
  4. Filter all rows with p-value > 5%
calc_p_value_conf_int <- function(df)
{
  df1 <-
    df %>% 
    select(val_yr1,val_yr2,starts_with("diff"),n_obs) %>% 
    gather(Variable, diff, c("diff_sales","diff_quantity","diff_visits")) %>% 
    mutate(Variable = str_replace(Variable,"diff_",""))
  
  df2 <-
    df %>% 
    select(val_yr1,val_yr2,starts_with("sd")) %>% 
    gather(Variable, sd, c("sd_sales","sd_quantity","sd_visits")) %>% 
    mutate(Variable = str_replace(Variable,"sd_",""))
  
  df3 <-
  df1 %>% 
    inner_join(df2,by = c("val_yr1","val_yr2","Variable")) %>% 
    
    # p-value
    mutate(p_value = pnorm(abs(diff/(sd/sqrt(n_obs))),lower.tail = FALSE)*2) %>%
    
    # Lower bound of CI
    mutate(ci_lower = diff-1.96*(sd/sqrt(n_obs))) %>% 
    
    # Upper bound of CI
    mutate(ci_upper = diff+1.96*(sd/sqrt(n_obs))) %>% 
    
    # Keep rows with p-value < 5%
    filter(p_value <= 0.05)
  
  return(df3)
}

4.2 Analysis

Mailer

For a particular product in a particular store in a particular week, we are 95% confident that:

  • The percentage of sales increases by 6% to 84% if the product is shifted from Not on Ad to Front Page Feature
  • The percentage of units increases by 8% to 46% if the product is shifted from Not on Ad to Interior Page Feature
  • The percentage of units increases by 33% to 103% if the product is shifted from Not on Ad to Front Page Feature
  • The percentage of visits increases by 4% to 22% if the product is shifted from Not on Ad to Interior Page Feature
  • The percentage of visits increases by 23% to 53% if the product is shifted from Not on Ad to Front Page Feature
  • The percentage of visits decreases by 0% to 15% if the product is shifted from Interior Page Feature to Not on Ad
  • The percentage of units decreases by 36% to 50% if the product is shifted from Front Page Feature to Not on Ad
  • The percentage of visits decreases by 14% to 36% if the product is shifted from Front Page Feature to Not on Ad
# running the 'calc_p_value_conf_int' function on mailer data
mailer_analysis <- 
  calc_p_value_conf_int(mailer_point_estimates) %>% 
  mutate(avg_sales=mean(mailer_display_transactions$sum_sales),
         avg_units=mean(mailer_display_transactions$sum_quantity),
         avg_visits=mean(mailer_display_transactions$sum_visits))

# calculating percentage increase in sales, units and visits instead of absolute
for (i in 1:nrow(mailer_analysis))
{
  if(mailer_analysis$Variable[i] == "sales")
  {
    mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_sales[i] 
  } else if (mailer_analysis$Variable[i] == "quantity")
  {
    mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_units[i]
  } else if (mailer_analysis$Variable[i] == "visits")
  {
    mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_visits[i]
  }
}

# output mailer yr1, display yr2, n, variable, p-value, confidence interval
mailer_analysis %>%  
  select(val_yr1,val_yr2,n_obs,Variable,p_value,ci_lower,ci_upper) %>% 
  arrange(val_yr1) %>% 
  mutate(ci_lower = ci_lower*100, ci_upper = 100*ci_upper)
val_yr1 val_yr2 n_obs Variable p_value ci_lower ci_upper
0 D 569 sales 0.0022946 5.683548 84.218476
0 A 491 quantity 0.0006097 7.632072 46.100164
0 D 569 quantity 0.0000000 33.983106 103.320578
0 A 491 visits 0.0017194 4.383105 22.177247
0 D 569 visits 0.0000000 23.868319 53.698171
A 0 664 visits 0.0471979 -15.650275 -0.113509
D 0 481 quantity 0.0000004 -50.142834 -36.428163
D 0 481 visits 0.0000859 -36.060256 -14.059690
Display

For a particular product in a particular store in a particular week, we are 95% confident that:

  • The percentage of sales increases by 3% to 185% if the product is shifted from Not on Display to Mid-Aisle End Cap
  • The percentage of units decreases by 5% to 61% if the product is shifted from Not on Display to Front End Cap
  • The percentage of units increases by 0% to 126% if the product is shifted from Not on Display to Mid-Aisle End Cap
  • The percentage of visits decreases by 2% to 50% if the product is shifted from Not on Display to Front End Cap
  • The percentage of visits decreases by 1% to 29% if the product is shifted from Not on Display to In-Shelf
  • The percentage of visits increases by 10% to 79% if the product is shifted from Front End Cap to Not on Display
  • The percentage of visits increases by 1% to 41% if the product is shifted from Secondary Location Display to Not on Display
# running the 'calc_p_value_conf_int' function on display data
display_analysis <- 
  calc_p_value_conf_int(display_point_estimates) %>% 
  mutate(avg_sales=mean(mailer_display_transactions$sum_sales), # avg sales
         avg_units=mean(mailer_display_transactions$sum_quantity), # avg units
         avg_visits=mean(mailer_display_transactions$sum_visits)) # avg visits

# calculating percentage increase in sales, units and visits instead of absolute
for (i in 1:nrow(display_analysis))
{
  if(display_analysis$Variable[i] == "sales")
  {
    display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_sales[i] 
  } else if (display_analysis$Variable[i] == "quantity")
  {
    display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_units[i]
  } else if (display_analysis$Variable[i] == "visits")
  {
    display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_visits[i]
  }
}

# output display yr1, display yr2, n, variable, p-value, confidence interval
display_analysis %>%  
  select(val_yr1,val_yr2,n_obs,Variable,p_value,ci_lower,ci_upper) %>% 
  arrange(val_yr1) %>% 
  mutate(ci_lower = ci_lower*100, ci_upper = 100*ci_upper)
val_yr1 val_yr2 n_obs Variable p_value ci_lower ci_upper
0 4 25 sales 0.0306214 2.812311 184.938788
0 3 110 quantity 0.0301836 -61.030992 -5.051511
0 4 25 quantity 0.0460634 0.678906 126.883121
0 3 110 visits 0.0307633 -50.528878 -2.863526
0 A 46 visits 0.0384339 -29.018057 -0.925740
3 0 101 visits 0.0077055 10.323282 79.044402
9 0 163 visits 0.0348656 1.307634 41.419101

5 Summary

Problem Statement

Quantify the impact on sales, units and visits due to:

  1. In-store display of a product
  2. Whether the product was featured in the weekly mailer or not

Rationale

Data Used
  • causal_data - This table signifies whether a given product was featured in the weekly mailer or was part of an in-store display.
  • transaction_data - This table contains all products purchased by households within this study. Each line found in this table is essentially the same line that would be found on a store receipt.
  • product - This table contains information on each product sold such as type of product, national or private label and a brand identifier.
Methodology
  1. Calculate sales, units and visits for each store, week and product combination
  2. Divide data into Year1 and Year2
  3. Calculate average difference in sales, units and visits between Year1 and Year2 for each combination of display and mailer
  4. Keep only those display and mailer combinations which go from no mailer/no display in Yr1 to some mailer/some display in Yr2 or vice versa
  5. Conduct a paired t-test on the average difference for each combination
  6. Calculate p-value and confidence intervals
  7. Keep combinations with p-value < 5%

Insights

  • The percentage of sales, units and visits increase if a product is shifted from Not on Ad to Interior/Front Page Feature
  • The percentage of sales, units and visits increase if a product is shifted from Not on Display to Mid-Aisle Cap. Surprisingly, we see a decrease if a product is shifted from Not on Display to Front-End Cap.

For exact values please refer to the tab 4.2

Implications

The retailer can perform the following steps to increase their sales, units and visits for a particular product:

  • Put products on Interior or Front-Page Feature of the mailer
  • Put products on the Mid-Aisle Cap display

Limitations

The data consists of only 2500 households. For a more in-depth analysis, we will need to capture more households.