Carbo-Loading Data

1.Introduction and Objective

Heard of Pasta and Pancakes ? Do you like them ? Ever wondered the science behind keeping the complementary items such as pasta, pasta sauce in the nearby aisles or promotions that are applied to those items? Which pasta/pasta sauce brand is the most consumed in USA?

The humongous data that is generated with each transcation of these items can help us answer such questions, which makes this dataset pretty interesting.

The Carbo-Loading dataset hosts the household transaction over a period of 2 years from 4 categories of food items consumed by customers, which are: Pasta, Pasta Sauce, Syrup, Pancake Mix.

The objective of this dataset is to understand the variation of units of a commodity sold and revenue generated with various metrics such as display of a product on mailer or store, brand , coupon usage, time.** This will enable us to deliver some insights to boost sales of a commodity.

1.1 Data

The dataset chosen here is the carbo-loading dataset , that was shared by Professor Brad Boehmke.

The Carbo-Loading Dataset is a relational database comprising of four different tables:

Product_Lookup - Provides information about each product corresponding to a unique UPC(10-digit unique code).
Causal_Lookup - Provides information about the trade activity for each UPC.
Transactions - Contains the details of all the transactions made through the loyalty card program of a leading US grocer.
Store_Lookup - Provides each store’s zip code.

1.2 Approach

I intend to use visualization to derive relations between the diferent categories and their consumptions. I will use histogram to plot the frequency of the occurence of a brand/ product type in the data.Box plot, bar charts can also help us understand the distribution of products, brands on the basis of demographics, coupon usage et cetera.At a later stage, I think logistic regression can help me predict the sales of a product or estimate the success rate of a marketing campaign.

1.3 How will this analysis help ?

The commodities that are the ground of analysis are something that is consumed in daily life. Falling short of any of these groceries or over stocking of a not so popular product can result in the loss of the grocer and identifying the products that are most consumed in a particular category will enable the grocer to boost sales, run suitable promotions, ensuring a win-win situation for the consumer of those commodities and the seller.

2. Packages Required

2.1 Libraries Used

collection of packages in R such as dplyr, readr, ggplot2 which are used for EDA

library(tidyverse)

Import Sas7bdat files in R

library(haven)

Makes it easier to work with date and time

library(lubridate)

Arranging Plots

library(gridExtra)

Making interactive graphs

library(plotly)

3. Data Preparation

3.1 Data Source/ Variable Description

The original dataset contains 4 relational database, with a primary key or foreign keys that will enable the joins in the 4 databases.After importing and cleaning, I will use inner join on all 4 to get a database containing only the rows that had values in all four of them.

Causal_Lookup

week - Expresses the week of transaction. Values are between 1-104.
upc- Standard 10-digit UPC
store- Identifies unique store
feature_desc- Describes location of product on weekly mailer
display_desc- location of in-store display containing the product.
geography- Defines the region of sale of product. It is divided in 2 major regions - 1 and 2

Product_Lookup

product_description- Gives the description of the product
commodity- Specifies one of the 4 categories to which the product belongs- Pasta, Pasta sauce, Syrups, Pancake Mixes.
brand- Brand to which the commodity belongs.
product_size- Package size of the product.

Transactions

dollar_sales- Amount of dollars spent by the customer
units- Number of products purchased by the customer
time of transaction- Time of transcation(military time)
household- Identifies the unique household
basket- Indentifies the unique trips made to the store
coupon- Coupon used(1) or not (0)
day- Day of transaction. Values between 1-728. If value is between 1-7, week will be 1 and so on.

Store_Lookup

Store zip Code - 5 Digit zip code.
Store- Identifies the unique store.

3.2 Dataset 1 - causal_lookup

Importing the data

causal_lookup <- read_sas("causal_lookup.sas7bdat")

To clean the data, we need to understand the structure of the data first, which is done as follows:

Having a look at the structure of the causal_lookup

dim(causal_lookup)
head(causal_lookup)
glimpse(causal_lookup)
colnames(causal_lookup)

Contains 351372 rows and 6 columns.

Column-wise identification of missing values

colSums(is.na(causal_lookup))

After concluding that there are no missing values,we decide whether the current datatype of column is appropriate or not, this can be decided using the glimpse function. If it is not appropriate, we cast it into the correct datatype.

causal_lookup <- causal_lookup %>% 
  mutate(
    upc = as.numeric(upc),
    feature_desc = as.factor(feature_desc),
    display_desc = as.factor(display_desc),
    geography = as.factor(geography)
  )

Now, we can proceed to the validations that were given as the part of the variable descriptions.

Check for the uniqueness records, this can be done using unique function. Since no rows were duplicate, I proceeded with the originally imported dataset.
Other validations that were required, such as lenght of UPC should be 10, value of geography should either be 1 or 2

casual_1 <- causal_lookup %>% 
  filter(nchar(as.character(upc)) == 10) %>%
  filter(geography == 1 | geography == 2)

Since, after running the code above, the data turned out to be relatively clean, I decided to proceed with the original set.

The final requirement corresponding to this dataset was, arranging week in the chronological order:

causal_lookup <- causal_lookup %>%
  arrange(week)

The causal_lookup dataset is now ready to be used for ‘joining’ operation.

3.3 Dataset 2 - Product_Lookup

Importing the data

product_lookup <- read_sas("product_lookup.sas7bdat")

Taking a look at the structure of the data in product_lookup:

head(product_lookup)
glimpse(product_lookup)
colnames(product_lookup)

Contains 927 rows and 5 columns.

Checking if there are missing values or not

colSums(is.na(product_lookup))

Now that the data does not contain missing values, we can proceed with casting the columns into the correct datatype, for instance commodity should be considered as factor here.

product_lookup <- product_lookup %>% mutate(
  upc = as.numeric(upc),
  commodity = as.factor(commodity),
  brand = as.factor(brand)
)

The product size of the product_lookup will be dealt separately because it contains lot of problems, which need to be specifically dealt with. I will taking some elementary steps to handle this column, but wherever I will be using this column in the analysis, I will make sure that proper cleaning is done before using this data.

#Removal of white spaces, for example one of the value was '26   OZ'
product_lookup$product_size <- str_replace_all(product_lookup$product_size, fixed(" "),"")
#Other issues are that some values are in LB, some are in OZ, so proper calculations need to be done there to deal with this problem
#Although product type should be numeric but we are not going to change the type because NAs will be forced at certain places.

After the cleaning steps, I would like to validate the data as per the guidelines:

The data should be unique, free from duplicate entries.
Commodity should have values among these four only: Pasta, Pasta Sauce, Syrup, Pancake Mix.

unique(product_lookup$commodity)

Unique gives us four values, which satisfies our criterion for commodity.

Now, our dataset is ready to be used for joining.

3.4 Dataset 3- Transactions

Importing the data

transactions <- read_sas("transactions.sas7bdat")

Taking a look at the structure of the data:

head(transactions)
glimpse(transactions)
colnames(transactions)

Contains 5197681 rows and 11 columns.

Checking whethere our dataset has any missing values or not:

colSums(is.na(transactions))

Since the data has no missing values, we can clearly proceed with the checking of the appropriateness of the datatype of the variables.

transactions <- transactions %>% mutate(
  upc = as.numeric(upc),
  geography = as.factor(geography),
  coupon = as.factor(coupon)

  )

One more variable needs to be taken care of here, which is time_of_transaction. It is originally given in military time and it needs to be converted, I converted it to 24-hour format and the rest will taken care care of when this column will be used in analysis.

transactions$time_of_transaction <- sub("([[:digit:]]{2,2})$", ":\\1", transactions$time_of_transaction)

The next step needs to be validation of data, in accordance with the decsriptions given in the data description.

There are no duplicate records in the table.
Since there is a condition that if days are between 1-7, week will be 1;if days are between 8-14, week will be 2, and so on. We need to cross check that the corresponding values are correct in the week column or not.
Sales should be greater than 0

transactions <- transactions %>%
  filter(transactions$dollar_sales >0)

Mutate

transactions <- transactions %>% mutate(year = ifelse(week <= 52, 1, 2))
transactions$year <- as.factor(transactions$year)

3.5 Dataset 4 - Store_Lookup

Importing the data

store_lookup <- read_sas("store_lookup.sas7bdat")

Glimpse of the data

head(store_lookup)
glimpse(store_lookup)

Contains 387 rows and 2 columns.

Checking if there are missing values present or not

colSums(is.na(store_lookup))

Validating the data based on the condition given in the data description fil
The length of the store_zip_code should be 5

store_1 <- store_lookup %>% 
  filter(nchar(as.character(store_zip_code))==5)

setequal(store_1, store_lookup)
#The two datasets are equal, which means that the zip codes fulfill the condition.

This finishes the cleaning part of the datasets. Since we were given the relational database, we need to work on the analysis part after joining the 4 datasets, on the basis of some primary key or foreign key.

3.6 Joining the Datasets

After having done with the cleaning of the data, we proceed to build a connection between the 4 tables so that we can join them. I preferred inner join, since it will give me a dataset where values are present in all four tables.

table_1 <- inner_join(causal_lookup, product_lookup, by = "upc")

table_2 <- inner_join(table_1, transactions,by = c("upc", "store", "week", "geography") )

final_db <- inner_join(table_2, store_lookup, by = "store")

final_db contains the values from all four tables.

Preview of the final dataset:

final_db <- final_db %>% group_by(commodity, brand)
head(final_db)

## # A tibble: 6 x 19
## # Groups:   commodity, brand [3]
##      upc store  week feature_desc display_desc geography product_descrip~
##    <dbl> <dbl> <dbl> <fct>        <fct>        <fct>     <chr>           
## 1 3.62e9     1    55 Wrap Interi~ Not on Disp~ 1         RAGU OWS SPAG S~
## 2 4.11e9     1    68 Wrap Interi~ Not on Disp~ 1         CLASSICO SPICY ~
## 3 4.42e9     1    66 Interior Pa~ Not on Disp~ 1         MRS BUTTERWORTH~
## 4 3.62e9     1    56 Interior Pa~ Not on Disp~ 1         RAGU ROBUSTO SI~
## 5 4.11e9     1    68 Wrap Interi~ Not on Disp~ 1         CLSCO SAUCE TOM~
## 6 3.62e9     1    56 Interior Pa~ Not on Disp~ 1         RAGU LT PARM AL~
## # ... with 12 more variables: commodity <fct>, brand <fct>,
## #   product_size <chr>, dollar_sales <dbl>, units <dbl>,
## #   time_of_transaction <chr>, household <dbl>, basket <dbl>, day <dbl>,
## #   coupon <fct>, year <fct>, store_zip_code <chr>

Checking the structure of final_db and looking for missing values

glimpse(final_db)

## Observations: 381,746
## Variables: 19
## $ upc                 <dbl> 3620000300, 4112907742, 4420979129, 362000...
## $ store               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ week                <dbl> 55, 68, 66, 56, 68, 56, 67, 67, 56, 56, 56...
## $ feature_desc        <fct> Wrap Interior Feature, Wrap Interior Featu...
## $ display_desc        <fct> Not on Display, Not on Display, Not on Dis...
## $ geography           <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ product_description <chr> "RAGU OWS SPAG SAUCE MEAT", "CLASSICO SPIC...
## $ commodity           <fct> pasta sauce, pasta sauce, syrups, pasta sa...
## $ brand               <fct> Ragu, Classico, Mrs Butterworth, Ragu, Cla...
## $ product_size        <chr> "26 OZ", "26 OZ", "24 OZ", "26    OZ", "26...
## $ dollar_sales        <dbl> 1.50, 2.00, 2.50, 1.50, 2.00, 3.00, 0.80, ...
## $ units               <dbl> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <chr> "2215", "1541", "1334", "1317", "1402", "2...
## $ household           <dbl> 437312, 237038, 430585, 402670, 425090, 75...
## $ basket              <dbl> 1665200, 2088329, 2031119, 1718312, 209759...
## $ day                 <dbl> 381, 471, 458, 392, 473, 387, 464, 465, 38...
## $ coupon              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ year                <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ store_zip_code      <chr> "37865", "37865", "37865", "37865", "37865...

colSums(is.na(final_db))

Quick summary of the entire data frame

summary(final_db)

4. Proposed Exploratory Data Analysis

As decided initially, I will try to develop some relations and deliver insights mainly by using visualization and some statistical analysis.

4.1 Display of a product and Sales Relationship

final_db %>% group_by(commodity,display_desc, feature_desc) %>% summarise(sales_total = sum(dollar_sales)) %>% 
  plot_ly( x = ~display_desc, y = ~sales_total, type = "bar", text = text, 
           name = 'Display Description') %>% 
    add_trace( x = ~feature_desc, y = ~sales_total, type = "bar", text = text, 
             name = 'Feature Description') %>%
  layout(title = "Variation in Sales with Display in Store and Mailer")

The sales by products “Not on Display” is quite high, which implies that the product must have been sold online. Hence, online sales are hgihly favored here.
Interior page feature has garnered the highest sales here, followed by Front page features.

4.2 Commodity-Brand-Sales Relationship

Commodity-Sales Relationship

g1 <-final_db %>% group_by(geography, commodity) %>% summarise(sum_sales = sum(dollar_sales)) %>% 
  ggplot(aes(x= reorder(commodity, -sum_sales), y= sum_sales, fill= geography))+
  geom_col(position ="dodge" )+
  coord_flip()+
  xlab("Commodity")+
  ylab("Sales in $")

#commodity and units sold  
g2 <-final_db %>% group_by(geography, commodity) %>% summarise(sum_units = sum(units)) %>% 
  ggplot(aes(x= reorder(commodity, -sum_units), y= sum_units, fill= geography))+
  geom_col(position ="dodge")+
  coord_flip()+
  xlab("Commodity")+
  ylab("Units Sold")

grid.arrange(g1, g2)

Some insights:

Pasta Sauce is the most consumed commodity (supported by the histogram as well).With region 2 consuming greater amount of pasta sauce.
Although the sales of pasta is lower in region 2 but their units sold are almost equal for both the regions.

Brand-Sales Relationship

Revenue and Unit sales for different brands

final_db %>% group_by(brand) %>% summarise(sales_total = sum(dollar_sales), unit_Sales = sum(units)) %>% 
  plot_ly( x = ~brand, y = ~sales_total, type = "bar", text = text, 
           name = 'Total_Sales') %>% 
  add_trace( x = ~brand, y = ~unit_Sales, type = "bar", text = text, 
             name = 'Unit_Sales') %>%
  layout(title = "Brand-wise metrics: Sales and Units Sold")

Ragu and Private Label have the highest unit sales and revenue as well. This needs to be analysed further to identify the reason behind such n abnormal high value.

4.3 Coupon Usage and Sales Relationship

Relation between units sold and revenue and usage of coupons

final_db %>% group_by(coupon) %>% summarise(sales_1 = sum(dollar_sales), sales_2 = sum(units)) %>%
  plot_ly(x = ~coupon, y = ~sales_1, type = "bar", name = "Total_Sales") %>%
  add_trace(x = ~coupon, y = ~sales_2, type = "bar", name = "Total Units") %>%
  layout(title = "Coupons and Sales Relations ")

0- No coupons used , 1 coupons used

There are only 2% incidents where coupons have been used.It can be inferred from here that the campaign did not reach as many people as it should or the promotion was not good enough to be considered.

Commodity on which coupon was used:

final_db %>% filter(coupon == 1) %>% group_by(commodity, brand) %>% summarise(sum_sales = sum(dollar_sales), unit_sales = sum(units)) %>%
  plot_ly(x = ~commodity, y = ~sum_sales, type = "bar", name = "Total Sales" ) %>%
  add_trace(x = ~commodity, y = ~unit_sales, type = "bar", name = "Unit Sales" )

The above graph shows that the maximum usage of coupons was done for pasta sauce, which is the most sought after commodity according to the dataset.

4.4 Sales over a period of two years

final_db %>% group_by(year) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
  plot_ly(x= ~year, y= ~unit_sales, type="bar", name="Unit Sales") %>%
  add_trace(x= ~year, y= ~sales, type="bar", name="Revenue") %>%
  layout(title="Sales over a period of 2 years")

g1 <- final_db %>% group_by(year, commodity) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
  ggplot(aes(x=year, y=unit_sales, fill=commodity))+
  geom_col(position="dodge")+
  ggtitle("Units over two years")


g2 <- final_db %>% group_by(year, commodity) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
  ggplot(aes(x=year, y=sales, fill=commodity))+
  geom_col(position="dodge")+
  ggtitle("Revenue over two years")

grid.arrange(g1, g2)

Since the data for year 1 is given for the last leg, we can say that difference in the sales for the two years is naturally very high. We can scale down the sales for year 2 according to year 1 or we can generate the missing sales for year 1 to define a relationship between the sales over a period of time.

5. Summary

To summarise I would like state couple of findings:

The sales given for the period of week 43-104:

A. Pasta Sauce has the highest sales with Ragu brand being the top contributor.

B. The online sales (Not on Display) is the highest contributor to the sales.

C. Geographical region 1 and 2 do not have very high difference in sales of all commodities except pasta sauce.

D. Promotions such as coupons are not widely used for any geographical region or commodity or brand. This means that there is a need to work on that front.

E. For mailer, Interior page feature is the most profitable source of propogation, with Front Page feature, being the second most favored.

Sales over the period of 2 years

In this case, we need to either scale down the data from year 2 to match the levels of year 1 , or we need to extrapolate the data of week 1-42 for year 1.

To conclude, I would like to state that a lot more can be done on this data. I would like to work on drilling down the metrics to find the correlations as well.

Thank You