Introduction

The purpose of this report is to provide an overview of the Trade Activity Effectiveness analysis. Given the household transaction data for the Pasta, Pasta Sauce, Pancake Mix and Syrup categories, this document covers the end-to-end journey of the analysis.

Problem Statement

Feature and Display is part of in-store trade activity that are used to increase sales of products. Using the data provided,

Can we understand what the impact of Feature and Display on the sales of four categories is?

What is the penetration of different products at a weekly store level?
What are the products for which Feature and/or Display has increased its units sold per week in the 2 years?
Did the sales increase because of the trade activity?
How does the interaction between complimentary categories play a role in this impact?

Why should we solve this problem?

Two reasons why this problem should be solved:

Improve trade activity efficiency by spending trade on the right products
Increase profit from products that have had positive effect from trade activity

Approach

We will use trade activity data and transactions of household data to understand the impact of trade on sales of products
Perform EDA to understand the effect of Feature and Display on sales of the product along with other factors
Perform pre-post analysis to check if the change in sales is statistically significant because of trade of activity

How can the results be consumed?

In case there has been a positive impact on certain products, Kroger can increase its retail margin with the manufacturers
In case there has been no impact and the sales of product have been decreasing, Kroger can increase the cost of Feature and/or Display through product price for manufacturers
For complimentary categories, use Feature and Display for cross promotions to increase sales of products

Packages Required

Packages used:

sas7bdat - to load the SAS datasets
tidyverse - for data cleaning
dpylr - to transform raw data into analytical dataset
png - to read PNG images
grid - for graphical modifications to images

# install.packages("sas7bdat",repos = "http://cran.us.r-project.org")
# install.packages("dplyr",repos = "http://cran.us.r-project.org")
# install.packages("tidyverse",repos = "http://cran.us.r-project.org")

library(sas7bdat)
library(dplyr)
library(tidyverse)
library(png)
library(grid)

Data Preparation

Step 1: Understanding the context

Data Source: Carbo-loading data

Carbo-Loading contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied.

Step 2: Data Importing

# It's not a good idea to import data in a RMD file, so we will only load the already imported data. However, code for importing data is below. 
# Note: Download data from the above link and change working directory to your local machine

setwd("C:/Users/varsh/OneDrive/Documents/R/Final Project/data")

causal_lookup <- read.sas7bdat("causal_lookup.sas7bdat")
product_lookup <- read.sas7bdat("product_lookup.sas7bdat")
store_lookup <- read.sas7bdat("store_lookup.sas7bdat")
transactions <- read.sas7bdat("transactions.sas7bdat")

Step 3: Data Cleaning

As part of Data Cleaning, we will first check the dimensions and structure of the datasets.

## Observations: 351,372
## Variables: 6
## $ upc          <fct> 7680850108, 5100001212, 5100002792, 3620000300, 4...
## $ store        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ week         <dbl> 68, 66, 72, 55, 68, 55, 66, 55, 55, 76, 66, 66, 6...
## $ feature_desc <fct> Wrap Interior Feature, Wrap Back Feature, Interio...
## $ display_desc <fct> Not on Display, Not on Display, Not on Display, N...
## $ geography    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

## Observations: 927
## Variables: 5
## $ upc                 <fct> 111112360, 566300023, 566300028, 566300029...
## $ product_description <fct> VINCENT S ORIG MARINARA S, PINE MOUNTAIN S...
## $ commodity           <fct> pasta sauce, syrups, syrups, syrups, syrup...
## $ brand               <fct> Vincent's, Pine Mountain, Miller, Miller, ...
## $ product_size        <fct> 25 OZ, 40 OZ, 19 OZ, 12 OZ, 19 OZ, 26    O...

## Observations: 387
## Variables: 2
## $ store          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ store_zip_code <fct> 37865, 30084, 30039, 31210, 30044, 31204, 30064...

## Observations: 5,197,681
## Variables: 11
## $ upc                 <fct> 7680850106, 3620000470, 1800028064, 999998...
## $ dollar_sales        <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, ...
## $ units               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <fct> 1100, 1100, 1137, 1148, 1323, 1323, 1415, ...
## $ geography           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ week                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ household           <dbl> 125434, 125434, 108320, 162016, 89437, 894...
## $ store               <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket              <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, ...
## $ day                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ coupon              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Summary of dimensions of the four datasets

Table name	Dimensions (Row X Columns)
Causal data	351372, 6
Product data	927, 5
Store data	387, 2
Transactions	5197681, 11

Step 4: Checking for duplicate entries

Before joining the datasets into one, we should check for duplicate entries in each dataset at the right level.

Below is the expected level (combination of columns) of data at which each row should be unique:

product_lookup : Since it contains product information, it should be unique at “upc” level
store_lookup : Since it contains store information, it should be unique at “store” level
causal_lookup : For each product, there should ideally be one unique Feature and Display entry for a week in a store. So the level would be “upc”,“week”,“store”,“feature_desc”,“display_desc”
transactions : At household transaction level, this dataset should be unique at “household”, “basket” and “upc” level

Here’s the code for removing duplicates:

causal_dedup <- distinct(as_tibble(causal_lookup), upc, store, week, .keep_all = TRUE)

product_dedup <- distinct(as_tibble(product_lookup), upc, .keep_all = TRUE)

store_dedup <- distinct(as_tibble(store_lookup), store, .keep_all = TRUE)

transactions_dedup <- distinct(as_tibble(transactions), household, basket,
                               upc, .keep_all = TRUE)

Summary of dimensions (Row X Columns) of the four datasets - Before and after removing duplicates

Table name	Dimensions Before	Dimensions (After)
Causal data	351372, 6	351372, 6
Product data	927, 5	927, 5
Store data	387, 2	387, 2
Transactions	5197681, 11	5197681, 11

Conclusion: There are no duplicate rows for the respective levels in all four datasets.

Step 5: Identifying and treating missing values

We need to check for missing values (NA values) in each dataset individually and then check for NAs, once again, after they have been joined into one dataset.

Here is the code to check for missing values in each dataset:

# Checking missing values in all four datasets using apply function
# The resulting table will have a TRUE or FALSE indicator to confirm if 
# the respective column has a missing value or not

apply(is.na(causal_lookup), 2, any)

##          upc        store         week feature_desc display_desc 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##    geography 
##        FALSE

apply(is.na(causal_lookup), 2, any)

##          upc        store         week feature_desc display_desc 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##    geography 
##        FALSE

apply(is.na(causal_lookup), 2, any)

##          upc        store         week feature_desc display_desc 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##    geography 
##        FALSE

apply(is.na(causal_lookup), 2, any)

##          upc        store         week feature_desc display_desc 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##    geography 
##        FALSE

The four datasets don’t have missing values in them separately. But once we join them into one dataset using appropriate keys, we may notice missing values.

Below is the code snippet for joining the tables

# Converting some of the columns from factors to character before joining # the tables
transactions$upc <- as.character(transactions$upc)
product_lookup$upc <- as.character(product_lookup$upc)

# Joining transactions table with product_lookup on key "upc"
trans_product <- left_join(transactions, product_lookup, by = "upc")

# Joining the above table with store_lookup on key "store"
trans_prod_store <- left_join(trans_product, store_lookup, by = "store")

# Changing key columns from factor to character in causal data
causal_lookup$upc <- as.character(causal_lookup$upc)
causal_lookup$feature_desc <- as.character(causal_lookup$feature_desc)
causal_lookup$display_desc <- as.character(causal_lookup$display_desc)

# Joining the previously created table with causal data on 
# "upc","geography","store" and "week"
trans_prod_sto_causal <- left_join(trans_prod_store, causal_lookup, 
                              by = c("upc","geography","store","week"))

# Viewing the first 6 rows of the final dataset
head(trans_prod_sto_causal)

##          upc dollar_sales units time_of_transaction geography week
## 1 7680850106         0.80     1                1100         2    1
## 2 3620000470         3.59     1                1100         2    1
## 3 1800028064         2.25     1                1137         2    1
## 4 9999985067         0.85     1                1148         2    1
## 5 9999985131         2.19     1                1323         2    1
## 6 5100002794         2.19     1                1323         2    1
##   household store basket day coupon            product_description
## 1    125434   244      1   1      0             BARILLA ANGEL HAIR
## 2    125434   244      1   1      0       BERTOLLI TOM&BASIL SAUCE
## 3    108320   244      2   1      0      H J PANCK BTRMLK COMP MIX
## 4    162016   244      3   1      0       PRIVATE LABEL VERMICELLI
## 5     89437   244      4   1      0 PRIVATE LABEL IMPORTED LASAGNA
## 6     89437   244      4   1      0          PREGO SPAG SAUCE MEAT
##       commodity                 brand product_size store_zip_code
## 1         pasta               Barilla        16 OZ          40222
## 2   pasta sauce              Bertolli        24 OZ          40222
## 3 pancake mixes           Hungry Jack                       40222
## 4         pasta         Private Label        16 OZ          40222
## 5         pasta Private Label Premium        16 OZ          40222
## 6   pasta sauce                 Prego        26 OZ          40222
##   feature_desc display_desc
## 1         <NA>         <NA>
## 2         <NA>         <NA>
## 3         <NA>         <NA>
## 4         <NA>         <NA>
## 5         <NA>         <NA>
## 6         <NA>         <NA>

# Checking for NA values in trans_prod_sto_causal table

na_count <- apply(is.na(trans_prod_sto_causal), 2, any)

na_count

##                 upc        dollar_sales               units 
##               FALSE               FALSE               FALSE 
## time_of_transaction           geography                week 
##               FALSE               FALSE               FALSE 
##           household               store              basket 
##               FALSE               FALSE               FALSE 
##                 day              coupon product_description 
##               FALSE               FALSE                TRUE 
##           commodity               brand        product_size 
##                TRUE                TRUE                TRUE 
##      store_zip_code        feature_desc        display_desc 
##               FALSE                TRUE                TRUE

Following are the reasons for missing values:

Product table is not updated to include new products from the transactions (plausible reason)
Causal data has data only from week 43 to week 104
Products didn’t have any trade activity for a particular week in a store

Null values in the Feature and Display columns were imputed in 2 steps:

Since causal data is not available from week 1 to week 42, replace NA is such rows with “No Data”
For rest of the rows with NA values, replace them with “No Trade” since no trade was done for those products

Below is the code snippet for treating null values in Feature and Display columns

# Replacing null values from week 1 to week 42 with "No Data"
for (i in 1:42) {
  trans_prod_sto_causal$feature_desc[trans_prod_sto_causal$week == i] <- "No Data"
  trans_prod_sto_causal$display_desc[trans_prod_sto_causal$week == i] <- "No Data"
}

# Replacing rest of the null values with "No Trade"
trans_prod_sto_causal$feature_desc[is.na(trans_prod_sto_causal$feature_desc)] <- "No Trade"
trans_prod_sto_causal$display_desc[is.na(trans_prod_sto_causal$display_desc)] <- "No Trade"

Missing values in Feature and Display have been treated.

Decision to treat missing values in product description, commodity, brand and product_size can be done after the proposed EDA.

Step 6: Revisiting the dataset post cleaning

# Checking for NA values in the treated trans_prod_sto_causal table

na_count <- apply(is.na(trans_prod_sto_causal), 2, any)

na_count

##                 upc        dollar_sales               units 
##               FALSE               FALSE               FALSE 
## time_of_transaction           geography                week 
##               FALSE               FALSE               FALSE 
##           household               store              basket 
##               FALSE               FALSE               FALSE 
##                 day              coupon product_description 
##               FALSE               FALSE                TRUE 
##           commodity               brand        product_size 
##                TRUE                TRUE                TRUE 
##      store_zip_code        feature_desc        display_desc 
##               FALSE               FALSE               FALSE

Proposed Exploratory Data Analysis

Below is an overview of the analytical approach that will be followed to answer the problem statement.

Next steps:

Need a thorough understanding of conducting pre-post analysis
Need to know the technique to counter any interaction between complimentary categories

Trade Activity Effectiveness

Varsha S Kumar

November 10, 2018