The purpose of this report is to provide an overview of the Trade Activity Effectiveness analysis. Given the household transaction data for the Pasta, Pasta Sauce, Pancake Mix and Syrup categories, this document covers the end-to-end journey of the analysis.
Feature and Display is part of in-store trade activity that are used to increase sales of products. Using the data provided,
Can we understand what the impact of Feature and Display on the sales of four categories is?
Two reasons why this problem should be solved:
Packages used:
# install.packages("sas7bdat",repos = "http://cran.us.r-project.org")
# install.packages("dplyr",repos = "http://cran.us.r-project.org")
# install.packages("tidyverse",repos = "http://cran.us.r-project.org")
library(sas7bdat)
library(dplyr)
library(tidyverse)
library(png)
library(grid)
Data Source: Carbo-loading data
Carbo-Loading contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied.
# It's not a good idea to import data in a RMD file, so we will only load the already imported data. However, code for importing data is below.
# Note: Download data from the above link and change working directory to your local machine
setwd("C:/Users/varsh/OneDrive/Documents/R/Final Project/data")
causal_lookup <- read.sas7bdat("causal_lookup.sas7bdat")
product_lookup <- read.sas7bdat("product_lookup.sas7bdat")
store_lookup <- read.sas7bdat("store_lookup.sas7bdat")
transactions <- read.sas7bdat("transactions.sas7bdat")
As part of Data Cleaning, we will first check the dimensions and structure of the datasets.
## Observations: 351,372
## Variables: 6
## $ upc <fct> 7680850108, 5100001212, 5100002792, 3620000300, 4...
## $ store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ week <dbl> 68, 66, 72, 55, 68, 55, 66, 55, 55, 76, 66, 66, 6...
## $ feature_desc <fct> Wrap Interior Feature, Wrap Back Feature, Interio...
## $ display_desc <fct> Not on Display, Not on Display, Not on Display, N...
## $ geography <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## Observations: 927
## Variables: 5
## $ upc <fct> 111112360, 566300023, 566300028, 566300029...
## $ product_description <fct> VINCENT S ORIG MARINARA S, PINE MOUNTAIN S...
## $ commodity <fct> pasta sauce, syrups, syrups, syrups, syrup...
## $ brand <fct> Vincent's, Pine Mountain, Miller, Miller, ...
## $ product_size <fct> 25 OZ, 40 OZ, 19 OZ, 12 OZ, 19 OZ, 26 O...
## Observations: 387
## Variables: 2
## $ store <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ store_zip_code <fct> 37865, 30084, 30039, 31210, 30044, 31204, 30064...
## Observations: 5,197,681
## Variables: 11
## $ upc <fct> 7680850106, 3620000470, 1800028064, 999998...
## $ dollar_sales <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, ...
## $ units <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <fct> 1100, 1100, 1137, 1148, 1323, 1323, 1415, ...
## $ geography <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ household <dbl> 125434, 125434, 108320, 162016, 89437, 894...
## $ store <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, ...
## $ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ coupon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Summary of dimensions of the four datasets
| Table name | Dimensions (Row X Columns) |
|---|---|
| Causal data | 351372, 6 |
| Product data | 927, 5 |
| Store data | 387, 2 |
| Transactions | 5197681, 11 |
Before joining the datasets into one, we should check for duplicate entries in each dataset at the right level.
Below is the expected level (combination of columns) of data at which each row should be unique:
Here’s the code for removing duplicates:
causal_dedup <- distinct(as_tibble(causal_lookup), upc, store, week, .keep_all = TRUE)
product_dedup <- distinct(as_tibble(product_lookup), upc, .keep_all = TRUE)
store_dedup <- distinct(as_tibble(store_lookup), store, .keep_all = TRUE)
transactions_dedup <- distinct(as_tibble(transactions), household, basket,
upc, .keep_all = TRUE)
Summary of dimensions (Row X Columns) of the four datasets - Before and after removing duplicates
| Table name | Dimensions Before | Dimensions (After) |
|---|---|---|
| Causal data | 351372, 6 | 351372, 6 |
| Product data | 927, 5 | 927, 5 |
| Store data | 387, 2 | 387, 2 |
| Transactions | 5197681, 11 | 5197681, 11 |
Conclusion: There are no duplicate rows for the respective levels in all four datasets.
We need to check for missing values (NA values) in each dataset individually and then check for NAs, once again, after they have been joined into one dataset.
Here is the code to check for missing values in each dataset:
# Checking missing values in all four datasets using apply function
# The resulting table will have a TRUE or FALSE indicator to confirm if
# the respective column has a missing value or not
apply(is.na(causal_lookup), 2, any)
## upc store week feature_desc display_desc
## FALSE FALSE FALSE FALSE FALSE
## geography
## FALSE
apply(is.na(causal_lookup), 2, any)
## upc store week feature_desc display_desc
## FALSE FALSE FALSE FALSE FALSE
## geography
## FALSE
apply(is.na(causal_lookup), 2, any)
## upc store week feature_desc display_desc
## FALSE FALSE FALSE FALSE FALSE
## geography
## FALSE
apply(is.na(causal_lookup), 2, any)
## upc store week feature_desc display_desc
## FALSE FALSE FALSE FALSE FALSE
## geography
## FALSE
The four datasets don’t have missing values in them separately. But once we join them into one dataset using appropriate keys, we may notice missing values.
Below is the code snippet for joining the tables
# Converting some of the columns from factors to character before joining # the tables
transactions$upc <- as.character(transactions$upc)
product_lookup$upc <- as.character(product_lookup$upc)
# Joining transactions table with product_lookup on key "upc"
trans_product <- left_join(transactions, product_lookup, by = "upc")
# Joining the above table with store_lookup on key "store"
trans_prod_store <- left_join(trans_product, store_lookup, by = "store")
# Changing key columns from factor to character in causal data
causal_lookup$upc <- as.character(causal_lookup$upc)
causal_lookup$feature_desc <- as.character(causal_lookup$feature_desc)
causal_lookup$display_desc <- as.character(causal_lookup$display_desc)
# Joining the previously created table with causal data on
# "upc","geography","store" and "week"
trans_prod_sto_causal <- left_join(trans_prod_store, causal_lookup,
by = c("upc","geography","store","week"))
# Viewing the first 6 rows of the final dataset
head(trans_prod_sto_causal)
## upc dollar_sales units time_of_transaction geography week
## 1 7680850106 0.80 1 1100 2 1
## 2 3620000470 3.59 1 1100 2 1
## 3 1800028064 2.25 1 1137 2 1
## 4 9999985067 0.85 1 1148 2 1
## 5 9999985131 2.19 1 1323 2 1
## 6 5100002794 2.19 1 1323 2 1
## household store basket day coupon product_description
## 1 125434 244 1 1 0 BARILLA ANGEL HAIR
## 2 125434 244 1 1 0 BERTOLLI TOM&BASIL SAUCE
## 3 108320 244 2 1 0 H J PANCK BTRMLK COMP MIX
## 4 162016 244 3 1 0 PRIVATE LABEL VERMICELLI
## 5 89437 244 4 1 0 PRIVATE LABEL IMPORTED LASAGNA
## 6 89437 244 4 1 0 PREGO SPAG SAUCE MEAT
## commodity brand product_size store_zip_code
## 1 pasta Barilla 16 OZ 40222
## 2 pasta sauce Bertolli 24 OZ 40222
## 3 pancake mixes Hungry Jack 40222
## 4 pasta Private Label 16 OZ 40222
## 5 pasta Private Label Premium 16 OZ 40222
## 6 pasta sauce Prego 26 OZ 40222
## feature_desc display_desc
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
# Checking for NA values in trans_prod_sto_causal table
na_count <- apply(is.na(trans_prod_sto_causal), 2, any)
na_count
## upc dollar_sales units
## FALSE FALSE FALSE
## time_of_transaction geography week
## FALSE FALSE FALSE
## household store basket
## FALSE FALSE FALSE
## day coupon product_description
## FALSE FALSE TRUE
## commodity brand product_size
## TRUE TRUE TRUE
## store_zip_code feature_desc display_desc
## FALSE TRUE TRUE
Following are the reasons for missing values:
Null values in the Feature and Display columns were imputed in 2 steps:
Below is the code snippet for treating null values in Feature and Display columns
# Replacing null values from week 1 to week 42 with "No Data"
for (i in 1:42) {
trans_prod_sto_causal$feature_desc[trans_prod_sto_causal$week == i] <- "No Data"
trans_prod_sto_causal$display_desc[trans_prod_sto_causal$week == i] <- "No Data"
}
# Replacing rest of the null values with "No Trade"
trans_prod_sto_causal$feature_desc[is.na(trans_prod_sto_causal$feature_desc)] <- "No Trade"
trans_prod_sto_causal$display_desc[is.na(trans_prod_sto_causal$display_desc)] <- "No Trade"
Missing values in Feature and Display have been treated.
Decision to treat missing values in product description, commodity, brand and product_size can be done after the proposed EDA.
# Checking for NA values in the treated trans_prod_sto_causal table
na_count <- apply(is.na(trans_prod_sto_causal), 2, any)
na_count
## upc dollar_sales units
## FALSE FALSE FALSE
## time_of_transaction geography week
## FALSE FALSE FALSE
## household store basket
## FALSE FALSE FALSE
## day coupon product_description
## FALSE FALSE TRUE
## commodity brand product_size
## TRUE TRUE TRUE
## store_zip_code feature_desc display_desc
## FALSE FALSE FALSE
Below is an overview of the analytical approach that will be followed to answer the problem statement.
Next steps: