Heard of Pasta and Pancakes ? Do you like them ? Ever wondered the science behind keeping the complementary items such as pasta, pasta sauce in the nearby aisles or promotions that are applied to those items? Which pasta/pasta sauce brand is the most consumed in USA?
The humongous data that is generated with each transcation of these items can help us answer such questions, which makes this dataset pretty interesting.
The Carbo-Loading dataset hosts the household transaction over a period of 2 years from 4 categories of food items consumed by customers, which are: Pasta, Pasta Sauce, Syrup, Pancake Mix.
The objective of this dataset is to understand the variation of units of a commodity sold and revenue generated with various metrics such as display of a product on mailer or store, brand , coupon usage, time.** This will enable us to deliver some insights to boost sales of a commodity.
The dataset chosen here is the carbo-loading dataset , that was shared by Professor Brad Boehmke.
The Carbo-Loading Dataset is a relational database comprising of four different tables:
I intend to use visualization to derive relations between the diferent categories and their consumptions. I will use histogram to plot the frequency of the occurence of a brand/ product type in the data.Box plot, bar charts can also help us understand the distribution of products, brands on the basis of demographics, coupon usage et cetera.At a later stage, I think logistic regression can help me predict the sales of a product or estimate the success rate of a marketing campaign.
The commodities that are the ground of analysis are something that is consumed in daily life. Falling short of any of these groceries or over stocking of a not so popular product can result in the loss of the grocer and identifying the products that are most consumed in a particular category will enable the grocer to boost sales, run suitable promotions, ensuring a win-win situation for the consumer of those commodities and the seller.
collection of packages in R such as dplyr, readr, ggplot2 which are used for EDA
library(tidyverse)
Import Sas7bdat files in R
library(haven)
Makes it easier to work with date and time
library(lubridate)
Arranging Plots
library(gridExtra)
Making interactive graphs
library(plotly)
The original dataset contains 4 relational database, with a primary key or foreign keys that will enable the joins in the 4 databases.After importing and cleaning, I will use inner join on all 4 to get a database containing only the rows that had values in all four of them.
Causal_Lookup
Product_Lookup
Transactions
Store_Lookup
causal_lookup <- read_sas("causal_lookup.sas7bdat")
To clean the data, we need to understand the structure of the data first, which is done as follows:
dim(causal_lookup)
head(causal_lookup)
glimpse(causal_lookup)
colnames(causal_lookup)
Contains 351372 rows and 6 columns.
colSums(is.na(causal_lookup))
After concluding that there are no missing values,we decide whether the current datatype of column is appropriate or not, this can be decided using the glimpse function. If it is not appropriate, we cast it into the correct datatype.
causal_lookup <- causal_lookup %>%
mutate(
upc = as.numeric(upc),
feature_desc = as.factor(feature_desc),
display_desc = as.factor(display_desc),
geography = as.factor(geography)
)
Now, we can proceed to the validations that were given as the part of the variable descriptions.
Check for the uniqueness records, this can be done using unique function. Since no rows were duplicate, I proceeded with the originally imported dataset.
Other validations that were required, such as lenght of UPC should be 10, value of geography should either be 1 or 2
casual_1 <- causal_lookup %>%
filter(nchar(as.character(upc)) == 10) %>%
filter(geography == 1 | geography == 2)
Since, after running the code above, the data turned out to be relatively clean, I decided to proceed with the original set.
The final requirement corresponding to this dataset was, arranging week in the chronological order:
causal_lookup <- causal_lookup %>%
arrange(week)
The causal_lookup dataset is now ready to be used for ‘joining’ operation.
product_lookup <- read_sas("product_lookup.sas7bdat")
head(product_lookup)
glimpse(product_lookup)
colnames(product_lookup)
Contains 927 rows and 5 columns.
colSums(is.na(product_lookup))
Now that the data does not contain missing values, we can proceed with casting the columns into the correct datatype, for instance commodity should be considered as factor here.
product_lookup <- product_lookup %>% mutate(
upc = as.numeric(upc),
commodity = as.factor(commodity),
brand = as.factor(brand)
)
The product size of the product_lookup will be dealt separately because it contains lot of problems, which need to be specifically dealt with. I will taking some elementary steps to handle this column, but wherever I will be using this column in the analysis, I will make sure that proper cleaning is done before using this data.
#Removal of white spaces, for example one of the value was '26 OZ'
product_lookup$product_size <- str_replace_all(product_lookup$product_size, fixed(" "),"")
#Other issues are that some values are in LB, some are in OZ, so proper calculations need to be done there to deal with this problem
#Although product type should be numeric but we are not going to change the type because NAs will be forced at certain places.
After the cleaning steps, I would like to validate the data as per the guidelines:
unique(product_lookup$commodity)
Unique gives us four values, which satisfies our criterion for commodity.
Now, our dataset is ready to be used for joining.
transactions <- read_sas("transactions.sas7bdat")
head(transactions)
glimpse(transactions)
colnames(transactions)
Contains 5197681 rows and 11 columns.
colSums(is.na(transactions))
Since the data has no missing values, we can clearly proceed with the checking of the appropriateness of the datatype of the variables.
transactions <- transactions %>% mutate(
upc = as.numeric(upc),
geography = as.factor(geography),
coupon = as.factor(coupon)
)
One more variable needs to be taken care of here, which is time_of_transaction. It is originally given in military time and it needs to be converted, I converted it to 24-hour format and the rest will taken care care of when this column will be used in analysis.
transactions$time_of_transaction <- sub("([[:digit:]]{2,2})$", ":\\1", transactions$time_of_transaction)
The next step needs to be validation of data, in accordance with the decsriptions given in the data description.
There are no duplicate records in the table.
Since there is a condition that if days are between 1-7, week will be 1;if days are between 8-14, week will be 2, and so on. We need to cross check that the corresponding values are correct in the week column or not.
Sales should be greater than 0
transactions <- transactions %>%
filter(transactions$dollar_sales >0)
transactions <- transactions %>% mutate(year = ifelse(week <= 52, 1, 2))
transactions$year <- as.factor(transactions$year)
store_lookup <- read_sas("store_lookup.sas7bdat")
head(store_lookup)
glimpse(store_lookup)
Contains 387 rows and 2 columns.
colSums(is.na(store_lookup))
Validating the data based on the condition given in the data description fil
The length of the store_zip_code should be 5
store_1 <- store_lookup %>%
filter(nchar(as.character(store_zip_code))==5)
setequal(store_1, store_lookup)
#The two datasets are equal, which means that the zip codes fulfill the condition.
This finishes the cleaning part of the datasets. Since we were given the relational database, we need to work on the analysis part after joining the 4 datasets, on the basis of some primary key or foreign key.
After having done with the cleaning of the data, we proceed to build a connection between the 4 tables so that we can join them. I preferred inner join, since it will give me a dataset where values are present in all four tables.
table_1 <- inner_join(causal_lookup, product_lookup, by = "upc")
table_2 <- inner_join(table_1, transactions,by = c("upc", "store", "week", "geography") )
final_db <- inner_join(table_2, store_lookup, by = "store")
final_db contains the values from all four tables.
final_db <- final_db %>% group_by(commodity, brand)
head(final_db)
## # A tibble: 6 x 19
## # Groups: commodity, brand [3]
## upc store week feature_desc display_desc geography product_descrip~
## <dbl> <dbl> <dbl> <fct> <fct> <fct> <chr>
## 1 3.62e9 1 55 Wrap Interi~ Not on Disp~ 1 RAGU OWS SPAG S~
## 2 4.11e9 1 68 Wrap Interi~ Not on Disp~ 1 CLASSICO SPICY ~
## 3 4.42e9 1 66 Interior Pa~ Not on Disp~ 1 MRS BUTTERWORTH~
## 4 3.62e9 1 56 Interior Pa~ Not on Disp~ 1 RAGU ROBUSTO SI~
## 5 4.11e9 1 68 Wrap Interi~ Not on Disp~ 1 CLSCO SAUCE TOM~
## 6 3.62e9 1 56 Interior Pa~ Not on Disp~ 1 RAGU LT PARM AL~
## # ... with 12 more variables: commodity <fct>, brand <fct>,
## # product_size <chr>, dollar_sales <dbl>, units <dbl>,
## # time_of_transaction <chr>, household <dbl>, basket <dbl>, day <dbl>,
## # coupon <fct>, year <fct>, store_zip_code <chr>
glimpse(final_db)
## Observations: 381,746
## Variables: 19
## $ upc <dbl> 3620000300, 4112907742, 4420979129, 362000...
## $ store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ week <dbl> 55, 68, 66, 56, 68, 56, 67, 67, 56, 56, 56...
## $ feature_desc <fct> Wrap Interior Feature, Wrap Interior Featu...
## $ display_desc <fct> Not on Display, Not on Display, Not on Dis...
## $ geography <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ product_description <chr> "RAGU OWS SPAG SAUCE MEAT", "CLASSICO SPIC...
## $ commodity <fct> pasta sauce, pasta sauce, syrups, pasta sa...
## $ brand <fct> Ragu, Classico, Mrs Butterworth, Ragu, Cla...
## $ product_size <chr> "26 OZ", "26 OZ", "24 OZ", "26 OZ", "26...
## $ dollar_sales <dbl> 1.50, 2.00, 2.50, 1.50, 2.00, 3.00, 0.80, ...
## $ units <dbl> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, ...
## $ time_of_transaction <chr> "2215", "1541", "1334", "1317", "1402", "2...
## $ household <dbl> 437312, 237038, 430585, 402670, 425090, 75...
## $ basket <dbl> 1665200, 2088329, 2031119, 1718312, 209759...
## $ day <dbl> 381, 471, 458, 392, 473, 387, 464, 465, 38...
## $ coupon <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ year <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ store_zip_code <chr> "37865", "37865", "37865", "37865", "37865...
colSums(is.na(final_db))
summary(final_db)
As decided initially, I will try to develop some relations and deliver insights mainly by using visualization and some statistical analysis.
final_db %>% group_by(commodity,display_desc, feature_desc) %>% summarise(sales_total = sum(dollar_sales)) %>%
plot_ly( x = ~display_desc, y = ~sales_total, type = "bar", text = text,
name = 'Display Description') %>%
add_trace( x = ~feature_desc, y = ~sales_total, type = "bar", text = text,
name = 'Feature Description') %>%
layout(title = "Variation in Sales with Display in Store and Mailer")
The sales by products “Not on Display” is quite high, which implies that the product must have been sold online. Hence, online sales are hgihly favored here.
Interior page feature has garnered the highest sales here, followed by Front page features.
Commodity-Sales Relationship
g1 <-final_db %>% group_by(geography, commodity) %>% summarise(sum_sales = sum(dollar_sales)) %>%
ggplot(aes(x= reorder(commodity, -sum_sales), y= sum_sales, fill= geography))+
geom_col(position ="dodge" )+
coord_flip()+
xlab("Commodity")+
ylab("Sales in $")
#commodity and units sold
g2 <-final_db %>% group_by(geography, commodity) %>% summarise(sum_units = sum(units)) %>%
ggplot(aes(x= reorder(commodity, -sum_units), y= sum_units, fill= geography))+
geom_col(position ="dodge")+
coord_flip()+
xlab("Commodity")+
ylab("Units Sold")
grid.arrange(g1, g2)
Some insights:
Brand-Sales Relationship
Revenue and Unit sales for different brands
final_db %>% group_by(brand) %>% summarise(sales_total = sum(dollar_sales), unit_Sales = sum(units)) %>%
plot_ly( x = ~brand, y = ~sales_total, type = "bar", text = text,
name = 'Total_Sales') %>%
add_trace( x = ~brand, y = ~unit_Sales, type = "bar", text = text,
name = 'Unit_Sales') %>%
layout(title = "Brand-wise metrics: Sales and Units Sold")
final_db %>% group_by(coupon) %>% summarise(sales_1 = sum(dollar_sales), sales_2 = sum(units)) %>%
plot_ly(x = ~coupon, y = ~sales_1, type = "bar", name = "Total_Sales") %>%
add_trace(x = ~coupon, y = ~sales_2, type = "bar", name = "Total Units") %>%
layout(title = "Coupons and Sales Relations ")
There are only 2% incidents where coupons have been used.It can be inferred from here that the campaign did not reach as many people as it should or the promotion was not good enough to be considered.
Commodity on which coupon was used:
final_db %>% filter(coupon == 1) %>% group_by(commodity, brand) %>% summarise(sum_sales = sum(dollar_sales), unit_sales = sum(units)) %>%
plot_ly(x = ~commodity, y = ~sum_sales, type = "bar", name = "Total Sales" ) %>%
add_trace(x = ~commodity, y = ~unit_sales, type = "bar", name = "Unit Sales" )
The above graph shows that the maximum usage of coupons was done for pasta sauce, which is the most sought after commodity according to the dataset.
final_db %>% group_by(year) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
plot_ly(x= ~year, y= ~unit_sales, type="bar", name="Unit Sales") %>%
add_trace(x= ~year, y= ~sales, type="bar", name="Revenue") %>%
layout(title="Sales over a period of 2 years")
g1 <- final_db %>% group_by(year, commodity) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
ggplot(aes(x=year, y=unit_sales, fill=commodity))+
geom_col(position="dodge")+
ggtitle("Units over two years")
g2 <- final_db %>% group_by(year, commodity) %>% summarise(unit_sales = sum(units), sales = sum(dollar_sales)) %>%
ggplot(aes(x=year, y=sales, fill=commodity))+
geom_col(position="dodge")+
ggtitle("Revenue over two years")
grid.arrange(g1, g2)
To summarise I would like state couple of findings:
A. Pasta Sauce has the highest sales with Ragu brand being the top contributor.
B. The online sales (Not on Display) is the highest contributor to the sales.
C. Geographical region 1 and 2 do not have very high difference in sales of all commodities except pasta sauce.
D. Promotions such as coupons are not widely used for any geographical region or commodity or brand. This means that there is a need to work on that front.
E. For mailer, Interior page feature is the most profitable source of propogation, with Front Page feature, being the second most favored.
In this case, we need to either scale down the data from year 2 to match the levels of year 1 , or we need to extrapolate the data of week 1-42 for year 1.
To conclude, I would like to state that a lot more can be done on this data. I would like to work on drilling down the metrics to find the correlations as well.
Thank You