We are looking at the carbo-loading transactions over a period of two years from four categories- Pasta, Pasta Sauce, Syrup, and Pancake Mix.
We want to know:
Are complementary products bought together? If yes, which ones? Are there any specific brands that are bought together?
Are commodity sales (units) impacted when they are featured in weekly mailer?
I will primarily be performing basket analysis by observing items bought in the same basket. For this we will use transactions and product-lookup data. Then we will group data by baskets and concatenate the brands in the same row.
I will slice the data at commodity, week and store level. Then we will perform a hypothesis test to check if the unit sales of a commodity are impacted when they are featured in mailer.
By identifying relationships among commodities, the retailer can focus on brands that are often bought together. Brands can be appropriately placed in stores to lure customers to buy combos. Further, the ‘good combinations’ can be promoted among customers who might have stopped purchasing the commodities after buying some ‘bad combinations’ of those commodities.
Weekly mailer analysis will help us in determining the usefulness of the mailer in promoting sales. If there is no significant impact, then the marketing budget may be used somewhere else. Additionally, the analysis will help us determine if the impact of mailer is same or different for different commodities.
library(tidyverse) #to visualize, transform, input, tidy and join data
library(haven) #to input data from SAS
library(dplyr) #data wrangling
library(stringr) #string related functions
library(kableExtra) #to create HTML Table
library(DT) #to preview the data sets
library(tibble) #to better preview tibbles
library(scales) # to clear cluttering of labels in plots
We obtained the carbo-loading household level data, obtained through the loyalty card program of a leading US grocer. It contains four data-sets, and was obtained from here. Please see the code book at the same location.
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| dollar_sales | Amount of dollars spent by the consumer |
| units | Number of products purchased by the consumer |
| time_of_transaction | The time of transaction(military time) |
| geography | Distinguishes between two large geographical regions, possibly values are 1 or 2 |
| week | Week of the transaction, values are from 1 to 104 |
| household | Unique households |
| store | Unique stores |
| basket | Unique baskets/trips to store |
| day | day of the transaction, possible values are from 1 to 728 |
| coupon | Indicates coupon usage, 1 if used, 0 otherwise |
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| product_description | Description of product |
| commodity | specifies the four product categories under consideration |
| brand | Specific brand of item |
| product_size | Specifies package size of product |
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| store | Identifies unique store |
| week | Week of transaction, possible values are 1 through 104 |
| feature_desc | Describes product location on weekly mailer |
| display_desc | Describes temporary in-store display |
| geography | Distinguishes between two large geographical regions, possible values are 1 or 2 |
| Variable | Features |
|---|---|
| store | Identifies unique stores |
| store_zip_code | 5 digit zip code |
We load the four datasets-
# Import data files
data.file_name <- c("causal_lookup", "product_lookup", "store_lookup","transactions")
df_name <- c("causal_lookup", "prd_lookup", "store_lookup","transactions")
# looping the loading process
for(i in seq_along(data.file_name))
{
df = read_sas(paste0("data/",data.file_name[i], ".sas7bdat"))
assign(df_name[i], df)
}
unique(nchar(x = prd_lookup$upc))
## [1] 9 10
prd_lookup_crtd <- prd_lookup
prd_lookup_crtd$upc <- str_pad(prd_lookup$upc, 10, side = "left", pad = '0')
quantile(transactions$dollar_sales,
probs = seq(0, 1, 0.01))
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
## -11.76 0.44 0.50 0.50 0.52 0.55 0.59 0.67 0.69 0.75
## 10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
## 0.79 0.79 0.79 0.79 0.80 0.85 0.88 0.89 0.89 0.90
## 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
## 1.00 1.00 1.03 1.09 1.09 1.18 1.19 1.19 1.19 1.25
## 40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
## 1.25 1.29 1.30 1.37 1.39 1.45 1.49 1.50 1.50 1.50
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 1.50 1.53 1.58 1.58 1.58 1.59 1.59 1.60 1.65 1.67
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
## 1.69 1.78 1.79 1.85 1.90 1.98 1.98 1.99 1.99 1.99
## 70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
## 1.99 1.99 2.00 2.00 2.18 2.19 2.25 2.29 2.38 2.45
## 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 2.49 2.50 2.55 2.69 2.69 2.79 2.89 2.99 2.99 2.99
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
## 3.00 3.18 3.19 3.34 3.49 3.59 3.98 4.00 4.77 5.70
## 100%
## 153.14
It seems like few data points are inconsistent with the other. So, we were interested in knowing if the data is actually skewed, so we ran the quantile function on units as well.
quantile(transactions$units,
probs = seq(0, 1, 0.01))
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
## 2 2 2 2 2 2 2 2 3 3 156
We observe the same in units too.
Finally we looked at the transactions -
transactions %>%
filter( units > 100) %>% left_join(prd_lookup_crtd, by = "upc")
## # A tibble: 3 x 15
## upc dollar_sales units time_of_transac~ geography week household store
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9999~ 118. 120 1400 1 65 352180 102
## 2 2700~ 153. 156 1541 2 73 88418 270
## 3 9999~ 132. 133 1500 1 81 398723 9
## # ... with 7 more variables: basket <dbl>, day <dbl>, coupon <dbl>,
## # product_description <chr>, commodity <chr>, brand <chr>,
## # product_size <chr>
Since there are exactly three entries which tell us that more than 100 units of pasta sauce(mapped with the prd_lookup_crtd table to get the commodity name) were bought by the households at one time, which is not expected. There might be an issue with the data entry here, so we plan to drop these rows for our further analysis purpose.
#identifying the data index
which(transactions$units > 100)
## [1] 3127414 3522595 4020790
#cross-validating data at the index
transactions[c(3127414,3522595,4020790),]
## # A tibble: 3 x 11
## upc dollar_sales units time_of_transac~ geography week household store
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9999~ 118. 120 1400 1 65 352180 102
## 2 2700~ 153. 156 1541 2 73 88418 270
## 3 9999~ 132. 133 1500 1 81 398723 9
## # ... with 3 more variables: basket <dbl>, day <dbl>, coupon <dbl>
#deleting the row and creating another table so as to not mess with the original
transactions_crtd <- transactions[-c(3127414,3522595,4020790),]
Hence, the identified issues were fixed and thus the tables are ready to be used for further analysis.
Please note: Instead of creating a master dataset with more than 15 columns, we plan to slice and dice and apply joins in the required data levels.
A brief of the transactions data which will be majorly used for the entire analysis.
The unique number of UPCs that we will primarily use to join our tables: 927
All these UPCs are identified in the product_lookup table, so we can fetch the product, brand information
transactions_crtd %>% anti_join(prd_lookup_crtd , by = "upc")
## # A tibble: 0 x 11
## # ... with 11 variables: upc <chr>, dollar_sales <dbl>, units <dbl>,
## # time_of_transaction <chr>, geography <dbl>, week <dbl>,
## # household <dbl>, store <dbl>, basket <dbl>, day <dbl>, coupon <dbl>
The different features avaiable in the data and are of interest- Wrap Interior Feature, Wrap Back Feature, Interior Page Feature, Not on Feature, Interior Page Line Item, Wrap Front Feature, Front Page Feature, Back Page Feature
The sales and units by feature_desc-
## # A tibble: 8 x 3
## feature_desc total_sales total_units
## <chr> <dbl> <dbl>
## 1 Interior Page Feature 240347. 196841
## 2 Not on Feature 172678. 136526
## 3 Front Page Feature 105429. 106489
## 4 Wrap Interior Feature 22507. 19338
## 5 Back Page Feature 17707. 17476
## 6 Wrap Front Feature 11098. 11210
## 7 Wrap Back Feature 9492. 7467
## 8 Interior Page Line Item 9378. 5965
Objective:
The objective is to check if there is any relationship among the given commodities: pasta, pasta sauce, syrups, and pancake mixes
If a relationship exists between commodities then the top brands of the commodities that are most commonly bought together would be identified
We have the following commodties:
# Commodities
categories <- c("pasta","pasta sauce","pancake mixes","syrups")
# Declaring a matrix that will contain complimentary category data
basket_analysis_table <- matrix(NA,nrow=length(categories),
ncol=length(categories),byrow=T)
Tidying Data:
To attain the objective, the data frames: transactions and prd_lookup_crtd, would be joined. This would help in determining the commodity name of a upc.
A loop would be used to filter a combination of 2 commodities at a time from the above defined categories. Then the data would be grouped by baskets to count the number of different commodites in a basket.
Notes:
Code for tidying data and checking the relationship between commodity sales (units):
# Selecting upc's that were sold. Returned items have been excluded.
transactions1 <- transactions %>%
filter(dollar_sales>0)
#The loop will pick all combinations of commodities and populate the basket_analysis_table matrix
for (i in seq_along(categories))
{
for(j in (1:length(categories)))
{
#Joining transactions1 and prd_lookup_crtd tables
# and chosing baskets that had ith and/or jth commodities in them
filtered_data1 <- transactions1 %>%
left_join(prd_lookup_crtd, by="upc") %>%
filter((commodity %in% c(categories[i],categories[j])))
# Grouping baskets to check if they had only one of the ith or jth commodity or both of them
cat1.cat2 <- filtered_data1 %>%
select(basket,commodity) %>%
group_by(basket) %>%
summarize(n_commodities=n_distinct(commodity))
# 1 would mean that either ith or jth commodity was in the basket but not
# both. 2 would mean that the basket contained both of the commodities. Note
# that there might be other commodities in the basket as well. But we are
# only interested in checking for ith and jth categories.
# When i = j, that would mean we are simply counting the total baskets
# in which the ith commodity was present. i != j would focus on counting
# complimentary sales (units) of ith commodity with jth commodity.
if(i==j){
basket_analysis_table[i,j] <- table(cat1.cat2$n_commodities)[1]
}
else{
basket_analysis_table[i,j] <- table(cat1.cat2$n_commodities)[2]
}
}
}
# Adding col names and row names to matrix
colnames(basket_analysis_table) <- categories
rownames(basket_analysis_table) <- categories
# Displaying the results
basket_analysis_table
## pasta pasta sauce pancake mixes syrups
## pasta 1958161 818343 49887 109502
## pasta sauce 818343 1543463 43563 94276
## pancake mixes 49887 43563 263796 107652
## syrups 109502 94276 107652 658595
Well Well! We can see something is there in the above table that requires our attention. But isn’t it difficult to read such large numbers in the matrix? Let us take a look at the plots that will help us in checking relationships between commodities.
Note: There can be other commodities as well in the basket when flter for the ith and jth commodities. However, that does not impact our analysis because we are checking relationship between only 2 commodites at a time
Code for plotting data in basket_analysis_table:
# Converting the matrix to a data frame
rownames(basket_analysis_table) <- NULL
bask_anl_df <- as.data.frame(basket_analysis_table)
# Adding categories to the df
bask_anl_df <- add_column(bask_anl_df, category1 = categories , .before = 1)
# Tidying data for plot
basket_anl_plot_tdf <- bask_anl_df %>%
gather(category2, values,-category1) %>%
arrange(category1)
# Shortening name of pancake mixes to pcake mixes and converting chr columns to factors
basket_anl_plot_tdf$category1 <- str_replace(basket_anl_plot_tdf$category1,
"pancake mixes", "pcake mixes")
basket_anl_plot_tdf$category2 <- str_replace(basket_anl_plot_tdf$category2,
"pancake mixes", "pcake mixes")
basket_anl_plot_tdf$category1 <- as.factor(basket_anl_plot_tdf$category1)
basket_anl_plot_tdf$category2 <- as.factor(basket_anl_plot_tdf$category2)
# Plot
# For making x labels to fit in 2 lines to avoid cluttering
levels(basket_anl_plot_tdf$category2) <- gsub(" ", "\n",
levels(basket_anl_plot_tdf$category2))
# Creating a plot for different comodity combinations
plot1 <- basket_anl_plot_tdf %>% ggplot(aes(x=category2, y=values)) +
geom_bar(position= "dodge",stat="identity",fill = "#FF6666") +
facet_grid(. ~ category1) +
ggtitle("Basket Analysis - Commodity Combinations", subtitle =
"Number of baskets containing any 2 product Commodity purchased at the same time") +
xlab("Commodity") +
ylab("Number of Baskets")
plot1
Results:
For each commodity (written at top of plot), we can how many baskets contained the given commodity and the remaining commodity.
Analysis:
Tidying Data:
# We will only consider baskets in which both pasta and pasta sauce (atleast) were present
filtered_data1 <- transactions1 %>%
left_join(prd_lookup_crtd, by="upc") %>%
filter((commodity %in% c("pasta","pasta sauce")))
# Grouping baskets to check the number of different commodities in them out of ith and jth commodities, ie, 1 or 2
pasta_pasta.sauce <- filtered_data1 %>%
select(basket,commodity) %>%
group_by(basket) %>%
summarize(n_commodities=n_distinct(commodity))
rel_baskets_p.ps <- pasta_pasta.sauce %>%
filter(n_commodities==2) %>%
select(basket)
# The following code arranges brands in alphabetical order for same basket and remove duplicate brands in same basket
#Special case: when brand of p and ps is same, those baskets will only occur once and are extracted later in the code
brands <- filtered_data1 %>%
filter(basket %in% rel_baskets_p.ps$basket) %>%
arrange(basket,brand) %>%
select(basket, brand) %>%
distinct()
# Concatenating brands in the same basket in different rows to same row and removing resulting duplicates
brands_concat <- brands %>%
group_by(basket) %>%
mutate(count=n()) %>%
mutate(combo = paste0(brand, collapse = ",")) %>%
select(-brand) %>%
distinct()
brands_concat
## # A tibble: 818,343 x 3
## # Groups: basket [818,343]
## basket count combo
## <dbl> <int> <chr>
## 1 1 2 Barilla,Bertolli
## 2 4 2 Prego,Private Label Premium
## 3 8 2 Classico,San Giorgio
## 4 9 2 Prego,Private Label
## 5 11 2 Creamette,Ragu
## 6 12 2 Newman's,Private Label
## 7 14 2 Prego,Private Label
## 8 15 2 Newman's,Private Label Premium
## 9 18 2 Creamette,Hunt's
## 10 19 2 Hunt's,Private Label
## # ... with 818,333 more rows
# Gives count of different brands in a basket
num_brands_per_bask <- table(brands_concat$count)
num_brands_per_bask
##
## 1 2 3 4 5 6 7 8
## 71115 661549 75730 8836 981 120 11 1
# basket plot
brands_concat %>% select(-combo) %>%
ggplot(aes(x=as.factor(count))) +
geom_bar(fill = "#FF6666") +
ggtitle("Count of Unique Brands in Basket - Pasta and Pasta Sauce") +
xlab("Different Brands in Basket") +
ylab("Number of baskets") +
scale_y_continuous(limits=c(0, 800000),labels = comma)
Analysis The above plot displays the distinct number of brands (atleast one each of both pasta and pasta sauce) in a basket. But then what does one mean in the x axis? It would mean that both the brands of pasta and pasta sauce were same. Simmilarly, two would mean that both brands were different. 3 would mean that we had one brand of pasta, one brand of pasta sauce, and one brand either of p or ps and all 3 are different and so on.
There are 818,343 baskets containing pasta and pasta sauce. Out of these these baskets:
Here we go!!! The following code identifies the top brand combinations for pasta and pasta sauce:
#We would look only till upto combinations of 2 brands of pasta and pasta sauce in a basket
# Case1: When same brand of p and psauce present in the basket
same_p_ps_brand <- brands_concat %>%
filter(count==1 )
table(same_p_ps_brand$combo)
##
## Barilla Colavita Private Label
## 7805 4 62325
## Private Label Premium
## 981
# 62325 cases exist where we had Private label brand of p and psauce in a basket
# Case2: When 2 different brands( one of p and one of psauce) were present in the basket
brands_concat_c2 <- brands_concat %>%
filter(count==2) %>%
select(-count) %>%
group_by(combo) %>%
summarise(freq=n()) %>%
arrange(desc(freq))
head(brands_concat_c2,10)
## # A tibble: 10 x 2
## combo freq
## <chr> <int>
## 1 Private Label,Ragu 141965
## 2 Prego,Private Label 65860
## 3 Hunt's,Private Label 43978
## 4 Mueller,Ragu 34421
## 5 Barilla,Ragu 31074
## 6 Creamette,Ragu 30458
## 7 Barilla,Prego 17833
## 8 Ragu,Ronzoni 17158
## 9 Classico,Private Label 17099
## 10 Mueller,Prego 16710
# Top brands plot
# This plot gives the number of baskets that had atleast 1 brand
# of pasta and 1 brand of pasta sauce
single_brand <- data.frame(combo="Private Label,Private Label",freq=62325)
top_brand_comb <- rbind(head(brands_concat_c2,4),single_brand)
top_brand_comb1 <- top_brand_comb %>% arrange(desc(freq))
options(scipen=5000)
top_brand_comb1 %>% ggplot(aes(x=reorder(combo,freq),y=freq)) +
geom_bar(position= "dodge",stat="identity",fill = "#FF6666") +
coord_flip() +
ggtitle("Top Brand Combinations - Pasta and Pasta Sauce", subtitle =
"Number of baskets containing given brand combination of pasta and pasta sauce") +
xlab("Brand Combination") + ylab("Number of baskets") +
scale_y_continuous(limits=c(0, 150000),labels = comma)
Results:
Private Label Brand dominates the pasta segment. However, there are a wide variety of pasta sauces, which are bought together with private label pasta. Private Label, Ragu combination clearly dominates the other combos present in the stores! So, does the retailer promote such combinations to its customers? We don’t know. We do know that the retailer does send out weekly mailers for promotions. If we can somehow confirm that those mailers do have an impact on increase in sales of an individula commodity, would’nt it be beneficial to promote the above combinations together in those mailers? People already like them, which means that these combination do well with each other. So, if other people who have not tried these combinations are prompted to buy these combinations, then we just might see some new happy customers!
Objective: The objective of this analysis is to check if the sales (number of units) of a commodity are impacted when the product is featured in the mailer for different stores. We will perform a paired t-test for checking the validity of the above objective.
Tidying Data: To perform hypothesis test, data in the transactions1 data frame would be needed to be tidied.
4 tests will be performed for each of the given commodities: Patsa, Pasta Sauce, Syrups, and Pancake Mixes
2 groups would be created - Featured and Not Featured, indicating whether a specific commodity was featured in the mailer for a particular week for a particular store.
If a commodity featured in mailer, irrespective of its placement in the mailer, it will be grouped in the Featured category for that week for that store
The data would be grouped by stores to calculate mean weekly sales (units) when the commodities featured in the mailer and when they did not for all 4 commodities one by one
The difference in the mean weekly sales will be calculted for all stores for a commodity and checked if it follows Normal Distribution
Paired t-test will be applied on difference of mean weekly sales (units) for featured and not featured categories with the following hypothesis:
Hypothesis to be checked
We start with creating vectors that to capture the output of the hypothesis data:
# Creating vectors that will contain the output of for the Hypothesis Testing
p_val <- rep(NA,length(categories))
estimate <- rep(NA,length(categories))
conf_inter.upper <- rep(NA,length(categories))
conf_inter.lower <- rep(NA,length(categories))
Filtering data:
# Selecting data for week>43 as data is NA for causal_lookup for weeks <43
table_join <- transactions1 %>%
filter(week>=43) %>%
select(upc,units,week,store) %>%
left_join(prd_lookup_crtd, by=c("upc")) %>%
select(upc,units,week,store,commodity)
head(table_join)
## # A tibble: 6 x 5
## upc units week store commodity
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 4112907763 2 43 244 pasta sauce
## 2 9999985004 1 43 244 pasta
## 3 3340061127 2 43 244 pasta
## 4 9999985046 1 43 244 pasta
## 5 9999985040 1 43 244 pasta
## 6 9999985067 1 43 244 pasta
Our data is in the above format after joining and keeping selected columns. Since we would have to tidy data for each commodity, a function has been created to avoid repititiveness in code. Tidieid data is returned by this function.
tidy_data <- function(x) {
filterd_table_comm_i <- table_join %>%
filter(commodity == x) %>%
inner_join(causal_lookup,by=c("upc","week","store")) %>%
select(units,week,store, feature_desc) %>%
mutate(mailer=ifelse(feature_desc=="Not on Feature",
"Not Featured","Featured")) %>%
select(-feature_desc)
# Separating data when commodity was featured and when it was not featured in mailer
table_featured <- filterd_table_comm_i %>%
filter(mailer=="Featured")
table_not_featured <- filterd_table_comm_i %>%
filter(mailer=="Not Featured")
# Getting weekly mean sales (units) by store for featured items
table_feature_sum <- table_featured %>%
select(-mailer) %>%
group_by(store,week) %>%
summarise(sum_fsales=sum(units)) %>%
select(-week) %>%
group_by(store) %>%
summarise(mean_weekly_fsales=mean(sum_fsales))
# Getting weekly mean sales (units) by store for non_featured items
table_non_feature_sum <-table_not_featured %>%
select(-mailer) %>%
group_by(store,week) %>%
summarise(sum_nfsales=sum(units)) %>%
select(-week) %>%
group_by(store) %>%
summarise(mean_weekly_nfsales=mean(sum_nfsales))
# Joining tables for hypothesis testing
table_f_n.f <- table_feature_sum %>% inner_join(table_non_feature_sum, by=c("store"))
# Calculating difference between mean weekly sales for different stores
# when the product featured in mailer and when it did not.
table_f_n.f1 <- table_f_n.f %>% mutate(diff=mean_weekly_fsales-mean_weekly_nfsales)
# Returning data
return(table_f_n.f1)
}
The following function has been created to perform paired t-test on tidied data for each commodity:
# Function for Paired t Test
paired.t.test <- function(y,z){
t.test(y, z,
paired=TRUE,
conf.level=0.95)
}
The following loop will push each commodity into the function which will return the tidied data. Upon running the loop, we will get 4 histograms. Paired t-test can only be applied if the distribution of the difference in mean weekly sales for featured and not featured groups is approximately Normal.
# categories <- c("pasta","pasta sauce", "pancake mixes","syrups") ...for reference
# For loop will apply paired t-test to all commodities in categories vector
par(mfrow=c(2,2))
for (i in seq_along(categories)){
final_table <- tidy_data(categories[i])
# Checking for normality of the mean weekly difference before hypothesis test
title <-paste("Normality check for", categories[i])
# Plotting the distribution for difference
norm_plot_check <- hist(final_table$diff, main=title,
xlab= "Difference in weekly mean unit sales",
col= "darkmagenta")
norm_plot_check
# Sending the tidied data to the function paired.t.test ans storing the result
output <- paired.t.test(final_table$mean_weekly_fsales,final_table$mean_weekly_nfsales)
p_val[i] <- output$p.value
estimate[i] <- output$estimate
conf_inter.lower[i] <- output$conf.int[1]
conf_inter.upper[i] <- output$conf.int[2]
}
The histograms seem fine.
Tidied data is as follows: The following data is for commodity: syrups . It is in same format for all other commodities as well.
t-test performed using the following function:
paired.t.test <- function(y,z){
t.test(y, z, paired=TRUE, conf.level=0.95)
}
Results:
Analysis:
P values for all the commodities is less than our alpha value. Hence, we do not have sufficient evidence to rejet the hypothesis for all the categories. Hence, we are 95% confident that the weekly mean difference of featured and not featured sales are not same.
The estimate for difference in mean weekly sales (Featured - Not Featured) along with the confidence intervals for pasta, pasta sauce, and pancake mixes clearly give us an indication that the sales of these commodities increased by ~6.6, 12.5, and 5.9 units respectively for each store on an average when that commodity featured in the mailer.
At the same time, even though the hypothesis is rejected for syrups, the estimate and confidence interval for syrups indicate that very little difference exists for mean weekly unit sales of syrups when it is featured in mailer compared with when it is not.
Conclusion: Featuring commodities in weekly mailer generally tend to increase mean weekly unit sales. However, there are categories such as syrups that have almost no impact. So, it would be beneficial for the retailer to revisit all the commodities that it features in it weekly mailers. It can remove commodities such as syrups, and focus on commodities that increase mean weekly sales.
What all I did?
Joined transactions and product lookup tables. Filtered data for each commodity combination.
Grouped data by baskets and counted baskets that had the commodity combination
Pursued brand analysis for commodities that showed a complimentary sales relationship
Filtered data for those commodities, grouped by baskets, concated brands to one row, grouped data by brands, counted brand frequency
Joined transactions, product lookup, and causal_lookup tables. Filtered data for week>43
Created a function to tidy data for hypothesis testing that involved creating a new variable, feature_desc to categorize whether a upc was featured or not featured in mailer for a specific week for a store
Separated data for featured and not featured categories
Calculated mean weekly sales for both categories for all stores for all weeks and joined tables together
Performed paired t-test
Results?
We finally have the answers. And some interesting questions that our answers generated! That means more interesting work!!!
1a. Are complementary products bought together?
Ans. Yes!
1b. If yes, are there any specific brands that are bought together?
Ans. Yes!
Top 5 brand combinations (pasta, pasta sauce) are as follows:
2. Are commodity sales (units) impacted when they are featured in weekly mailer?
Ans. Yes!
The confidence intervals gained from paired t-test revealed that mean weekly sales for pasta, pasta sauce, and pancake mixes increased by ~6.6, 12.5, and 5.9 units respectively for each store on an average when that commodity featured in the mailer.
However, unit sales of one commodity - Syrups, did not show much difference. Thus, we can conclude that the retailer should revisit all the commodities that it features in its mailer and keep only those that have a positive impact on mean sales.
Implications of the analysis to stakeholders:
The retailers should be happy that the money it had been spending on its weekly mailers indeed helped in increasing sales! Pheww! Also, it could consider removing syrups from mailers as it does not really impact sales.
And now that we know the top brands for pasta and pasta sauce, wouldn’t it be interesting to test if sales of those brands increase when placed side by side in mailer?
It could also send promotional offers to customers to lure them to buy the top combinations. Lastly, we also noted in the Complementary Category Analysis that there a few experimental households that buy more than 2 brands of either pasta or pasta sauce. It would be interesting for the retailer to study their buying habits and send appropriate promotional offers to this lot!
Limitations/Scope of improvement in the analysis
I only considered 2 combinations at a time. It would have been prudent to study 3 combinations of commodities as well. For example, do certain brands of pasta, pasta sauce, and syrups sell together?
Since we know that the mailer impacts the weekly unit sales, it is necessary to see the impact of placement of the ad in the mailer on the weekly unit sales of the commodity. Hypothesis testing could be done for the above-mentioned problem.
I performed the hypothesis testing for all stores together. Perhaps, classifying the stores based on sales, or geography and then performing hypothesis testing separately would have been more prudent.