2018-12-09
The purpose of this report is to provide an overview of the Trade Activity Effectiveness analysis. This document covers the end-to-end journey of the analysis of how the products under Pasta, Pasta Sauce, Pancake mix and Syrups are impacted by trade activities like Feature and Display.
Feature and Display are part of in-store trade activity that are used to increase sales of products every week. Using the household transaction data, Can we understand what the impact of Feature and Display on the sales of four categories is?
Two reasons why this problem should be solved:
Packages used:
# Loading all the packages
library(sas7bdat)
library(tidyverse)
library(png)
library(grid)
library(GGally)
library(DT)
Data Source: Carbo-loading data
Carbo-Loading contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied.
There are four datasets in Carbo-loading:
# Iteration for reading the SAS datasets
file_1 <- c("causal_lookup","product_lookup","store_lookup","transactions")
file_2 <- ".sas7bdat"
for (i in seq_along(file_1)) {
filename <- paste0("data/",file_1[i],file_2)
if (file.exists(filename)) {
assign(paste0("df_",file_1[i],file_2),read.sas7bdat(filename))
} else {
print("File doesn't exist")
}
}
As part of Data Cleaning, we will first check the dimensions and structure of the datasets.
Below is the summary from each table.
causal_lookup :
## Observations: 351,372
## Variables: 6
## $ upc <fct> 7680850108, 5100001212, 5100002792, 3620000300, 4...
## $ store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ week <dbl> 68, 66, 72, 55, 68, 55, 66, 55, 55, 76, 66, 66, 6...
## $ feature_desc <fct> Wrap Interior Feature, Wrap Back Feature, Interio...
## $ display_desc <fct> Not on Display, Not on Display, Not on Display, N...
## $ geography <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
product_lookup :
## Observations: 927
## Variables: 5
## $ upc <fct> 111112360, 566300023, 566300028, 566300029...
## $ product_description <fct> VINCENT S ORIG MARINARA S, PINE MOUNTAIN S...
## $ commodity <fct> pasta sauce, syrups, syrups, syrups, syrup...
## $ brand <fct> Vincent's, Pine Mountain, Miller, Miller, ...
## $ product_size <fct> 25 OZ, 40 OZ, 19 OZ, 12 OZ, 19 OZ, 26 O...
store_lookup :
## Observations: 387
## Variables: 2
## $ store <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ store_zip_code <fct> 37865, 30084, 30039, 31210, 30044, 31204, 30064...
transactions :
## Observations: 5,197,681
## Variables: 11
## $ upc <fct> 7680850106, 3620000470, 1800028064, 999998...
## $ dollar_sales <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, ...
## $ units <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <fct> 1100, 1100, 1137, 1148, 1323, 1323, 1415, ...
## $ geography <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ household <dbl> 125434, 125434, 108320, 162016, 89437, 894...
## $ store <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, ...
## $ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ coupon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Before joining the datasets into one, check for duplicate entries in each dataset at the right level.
Summary of dimensions (Row X Columns) of the four datasets - Before and after removing duplicates
| Table name | Dimensions Before | Dimensions (After) |
|---|---|---|
| Causal data | 351372, 6 | 351372, 6 |
| Product data | 927, 5 | 927, 5 |
| Store data | 387, 2 | 387, 2 |
| Transactions | 5197681, 11 | 5197681, 11 |
Conclusion: There are no duplicate rows for the respective levels in all four datasets.
We need to check for missing values (NA values) in each dataset individually and then check for NAs, once again, after they have been joined into one dataset.
Missing values in causal_lookup :
Missing values in product_lookup :
Missing values in store_lookup :
Missing values in transactions :
The four datasets don’t have missing values in them separately. But once we join them into one dataset using appropriate keys, we may notice missing values.
Sample of the joined dataset :
Missing values in joined dataset :
Following are the reasons for missing values:
Missing values in the Feature and Display columns were imputed in 2 steps:
To understand the data better, we will perform univariate and bivariate analysis on the variables through plots.
Below is an overview of the analytical approach that will be followed to answer the problem statement.
After cleaning the data, the following steps were performed on the data:
The final results table:
datatable(final_prod_results)
Key Findings:
Point to note
Paired t-tests removes paired values which are exactly the same or is just one observation at the product-week level, so a lot of products may have gotten eliminated from t-test process above. The results shown are for only product week sales that conformed to the requirements of the paired t-test.
To summarize the analysis:
Yes, we have used a paired t-test to understand the effects of weekly sales of each product in stores with and without trade activities.
Pasta Sauce has had the highest penetration across all weeks, given it’s wide variety of product choices.
Given that the sales was compared at weekly level (time aligned) for each product, we can say at 0.05 level of significance that the sales of products increased because of trade activity.
Although there is high correlation between the sales of complimentary categories, the impact of trade activity doesn’t seem to have been affected by this correlation (Refer to finding 3)