This project aims to explore the effect on sales, units and visits due to:
For years, retailers have been investing in these two strategies to boost their revenues. At the same time, for years, consumers have been influenced by these two strategies without even knowing it. This project aims to quantify this effect and help both consumers and retailers.
The analysis will make use of transactional data and the causal data for a major retailer for a period of 2 years. The transactional data, as the name suggests, records all the transactions for 2500 households. The causal data records the display and mailer information for 68377 products in 115 stores.
Calculate the effect of in-store display and weekly mailer information on the sales, units and visits between Year 1 and Year 2 for the same:
Once we have the information for distinct (a), (b) and (c) we can make generalized statements for all 68,377 products in 115 stores for the last 102 weeks. Using these results which are obtained from the sampled data, we will make use of central limit theorem and confidence intervals to make inferences about the population.
This project aims to quantify the effect of strategies discussed in 1.1 and help:
Load Files
This package allows you to read and write data at blazing fast speed
library(data.table)
This package is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges like mutating, selecting, filtering, summarizing and arranging columns
library(dplyr)
This packages provides fast and effecient implementations of common string manipulations
library(stringr)
This package is a system for declaratively creating graphics, based on The Grammar of Graphics
library(ggplot2)
This package makes it easier to work with dates and times
library(lubridate)
This package helps in printing datasets fited to page
library(printr)
This package helps in creating tidy data
library(tidyr)
This package helps to render data objects in R as HTML tables using the JavaScript library ‘DataTables’
library(DT)
This dataset contains household level transactions of over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
Link: https://github.com/akshaykher/Data-Wrangling-Project/blob/master/The_Complete_Journey_guide.pdf
Importing All Tables
# names of the data files to import
names <- c("product", "transaction_data", "causal_data")
# product table link - https://www.dropbox.com/s/2hq56ye5tjp8ryb/product.csv?dl=0
# transaction_data link - https://www.dropbox.com/s/kqn5ontiyp3hvo0/transaction_data.csv?dl=0
# causal_data link - https://www.dropbox.com/s/n6x0ugtdna6av3t/causal_data.csv?dl=0
# importing all data files at once
for(i in seq_along(names))
{
df = fread(paste0("data/", names[i], ".csv"))
assign(names[i], df)
}
Product Table
datatable(head(product, 100))
# Replacing Empty Strings with NA
product$department[which(product$department == "")] = NA
product$commodity_desc[which(product$commodity_desc == "")] = NA
product$sub_commodity_desc[which(product$sub_commodity_desc == "")] = NA
product$curr_size_of_product[which(product$curr_size_of_product == "")] = NA
# Converting Some String Variables to Factor Variables
product <- mutate(product,
department = factor(department),
brand = factor(brand),
commodity_desc = factor(commodity_desc),
sub_commodity_desc = factor(sub_commodity_desc))
Causal Table
datatable(head(causal_data, 100))
# Converting Some String Variables to Factor Variables
causal_data <- mutate(causal_data, display = factor(display), mailer = factor(mailer))
Transaction Table
datatable(head(transaction_data, 100))
########## Quantity Variable ##########
# 99% of the quantities are less 10. However there is huge spike in quantity from 99% to 100%. We will need to examine this!
q1<- c(quantile(transaction_data$quantity, probs = seq(0, 0.9, 0.1)),
quantile(transaction_data$quantity, probs = seq(0.91, 1, 0.01)))
df1 <- as.data.frame(q1)
names(df1) <- "quantity"
datatable(df1)
########## Quantity Variable ##########
# For all quantities greater than 200, the sub_commodity_desc is "GASOLINE-REG UNLEADED". The quantity of gas may be calculated in gallons giving it a high number.
transaction_data %>%
filter(quantity>200) %>%
select(product_id) %>%
unique() %>%
inner_join(select(product, product_id, department, commodity_desc
, sub_commodity_desc), by="product_id") %>%
select(sub_commodity_desc) %>%
unique()
| sub_commodity_desc |
|---|
| GASOLINE-REG UNLEADED |
########## Quantity Variable ##########
# Flagging Gasoline and Non-Gasoline Transactions
transaction_data$gasoline_transaction <-
ifelse(transaction_data$product_id %in%
product[product$sub_commodity_desc == "GASOLINE-REG UNLEADED",]
$product_id, 1, 0)
# Max Quantity for a Non-Gasoline Transactions is 144. This seems legitimate.
q2 <- quantile(filter(transaction_data, gasoline_transaction == 0)$quantity,
seq(0, 1, 0.1))
df2 <- as.data.frame(q2)
names(df2) <- "quantity"
datatable(df2)
########## Retail Discount Variable ##########
# As discount can never be negative, we will convert positive discounts to 0
transaction_data[transaction_data$retail_disc > 0]$retail_disc = 0
Product Table - First 20 Rows
# Summary of Product Table
glimpse(product)
## Observations: 92,353
## Variables: 7
## $ product_id <int> 25671, 26081, 26093, 26190, 26355, 26426,...
## $ manufacturer <int> 2, 2, 69, 69, 69, 69, 69, 69, 69, 16, 69,...
## $ department <fct> GROCERY, MISC. TRANS., PASTRY, GROCERY, G...
## $ brand <fct> National, National, Private, Private, Pri...
## $ commodity_desc <fct> FRZN ICE, NO COMMODITY DESCRIPTION, BREAD...
## $ sub_commodity_desc <fct> ICE - CRUSHED/CUBED, NO SUBCOMMODITY DESC...
## $ curr_size_of_product <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 O...
Causal Table
# Summary of Causal Table
glimpse(causal_data)
## Observations: 36,786,524
## Variables: 5
## $ product_id <int> 26190, 26190, 26190, 26190, 26190, 26190, 26190, 26...
## $ store_id <int> 286, 288, 289, 292, 293, 295, 296, 297, 298, 299, 3...
## $ week_no <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,...
## $ display <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, ...
## $ mailer <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, ...
Transaction Table- First 20 Rows
# Summary of Transaction Table
glimpse(transaction_data)
## Observations: 2,595,732
## Variables: 13
## $ household_key <int> 2375, 2375, 2375, 2375, 2375, 2375, 2375,...
## $ basket_id <dbl> 26984851472, 26984851472, 26984851472, 26...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ product_id <int> 1004906, 1033142, 1036325, 1082185, 81604...
## $ quantity <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales_value <dbl> 1.39, 0.82, 0.99, 1.21, 1.50, 1.98, 1.57,...
## $ store_id <int> 364, 364, 364, 364, 364, 364, 364, 364, 3...
## $ retail_disc <dbl> -0.60, 0.00, -0.30, 0.00, -0.39, -0.60, -...
## $ trans_time <int> 1631, 1631, 1631, 1631, 1631, 1642, 1642,...
## $ week_no <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ gasoline_transaction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Product Table
# Total unique counts of each variable
product %>%
summarize(tot_products = n(),
tot_manufacturer = n_distinct(manufacturer),
tot_department = n_distinct(department),
tot_brand = n_distinct(brand),
tot_commodity_desc = n_distinct(commodity_desc),
tot_sub_commodity_desc = n_distinct(sub_commodity_desc))
| tot_products | tot_manufacturer | tot_department | tot_brand | tot_commodity_desc | tot_sub_commodity_desc |
|---|---|---|---|---|---|
| 92353 | 6476 | 44 | 2 | 308 | 2383 |
# Total NAs in each variable
c1 <- colSums(is.na(product))
df1 <- as.data.frame(c1)
names(df1) <- "total_NAs"
df1
| total_NAs | |
|---|---|
| product_id | 0 |
| manufacturer | 0 |
| department | 15 |
| brand | 0 |
| commodity_desc | 15 |
| sub_commodity_desc | 15 |
| curr_size_of_product | 30607 |
Causal Table
# Total unique counts/min-max of each variable
causal_data %>%
summarize(total_observations = n(),
tot_products = n_distinct(product_id),
min_week = min(week_no),
max_week = max(week_no),
tot_display = n_distinct(display),
tot_mailer = n_distinct(mailer))
| total_observations | tot_products | min_week | max_week | tot_display | tot_mailer |
|---|---|---|---|---|---|
| 36786524 | 68377 | 9 | 101 | 10 | 11 |
# Total NAs in each variable
c2 <- colSums(is.na(causal_data))
df2 <- as.data.frame(c2)
names(df2) <- "total_NAs"
df2
| total_NAs | |
|---|---|
| product_id | 0 |
| store_id | 0 |
| week_no | 0 |
| display | 0 |
| mailer | 0 |
# Explanation and Total Counts of the Display Variable
table_display <- table(causal_data$display)
names(table_display) <-
c("Not on Display", "Store Front", "Store Rear", "Front End Cap",
"Mid-Aisle End Cap", "Rear End Cap", "Side-Aisle End Cap",
"In-Aisle", "Secondary Location Display", "In-Shelf")
df_display <- as.data.frame(table_display)
names(df_display) <- c("Display","Count")
df_display %>%
arrange(Count) %>% # sort by counts
mutate(Display = factor(Display,Display)) %>% # reset factor
ggplot(aes(x=Display, y=Count)) + # plot
geom_bar(stat="identity") +
coord_flip() +
scale_y_continuous(labels = scales::comma)
# Explanation and Total Counts of the Mailer Variable
table_mailer <- table(causal_data$mailer)
names(table_mailer) <-
c("Not on Ad", "Interior Page Feature", "Interior Page Line Item",
"Front Page Feature","Back Page Feature", "Wrap Front Feature",
"Wrap Interior Coupon","Wrap Back Feature", "Interior Page Coupon",
"Free on Interior Page", "Free on Front Page, Back Page or Wrap")
df_mailer <- as.data.frame(table_mailer)
names(df_mailer) <- c("Mailer","Count")
df_mailer %>%
arrange(Count) %>% # sort by counts
mutate(Mailer = factor(Mailer,Mailer)) %>% # reset factor
ggplot(aes(x=Mailer, y=Count)) + # plot
geom_bar(stat="identity") +
coord_flip()+
scale_y_continuous(labels = scales::comma)
Transaction Table
# Total unique counts of each variable grouped by Gasoline Transactions
transaction_data %>%
group_by(gasoline_transaction) %>%
summarize(total_visits = n_distinct(basket_id),
total_observations = n(),
total_households = n_distinct(household_key),
total_products = n_distinct(product_id),
total_stores = n_distinct(store_id))
| gasoline_transaction | total_visits | total_observations | total_households | total_products | total_stores |
|---|---|---|---|---|---|
| 0 | 252543 | 2570770 | 2500 | 92304 | 557 |
| 1 | 24952 | 24962 | 1438 | 35 | 164 |
# Min, Max of each variable grouped by Gasonline Transactions
transaction_data %>%
group_by(gasoline_transaction) %>%
summarize(min_quantity = min(quantity),
max_quantity = max(quantity),
min_sales = min(sales_value),
max_sales = max(sales_value),
min_day = min(day),
max_day = max(day),
min_week = min(week_no),
max_week = max(week_no))
| gasoline_transaction | min_quantity | max_quantity | min_sales | max_sales | min_day | max_day | min_week | max_week |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 144 | 0.00 | 840 | 1 | 711 | 1 | 102 |
| 1 | 1 | 89638 | 0.03 | 250 | 12 | 711 | 2 | 102 |
# Total NAs in each variable
c3 <-colSums(is.na(transaction_data))
df3 <- as.data.frame(c3)
names(df3) <- "total_NAs"
df3
| total_NAs | |
|---|---|
| household_key | 0 |
| basket_id | 0 |
| day | 0 |
| product_id | 0 |
| quantity | 0 |
| sales_value | 0 |
| store_id | 0 |
| retail_disc | 0 |
| trans_time | 0 |
| week_no | 0 |
| coupon_disc | 0 |
| coupon_match_disc | 0 |
| gasoline_transaction | 0 |
First, we’ll join Causal Data with Transaction Data on:
Second, we’ll find the total sales, units and visits by aggregating on:
Thus, in the coming analysis, we’ll compare sales, units and visits for same products, for the same stores and in the same week. This will ensure that any differences that we detect can be attributed to display and mailer level changes only. The variation due to product, store and week number will already be accounted for.
# Joining Causal Data with Transaction Data and then aggregating it.
mailer_display_transactions <-
causal_data %>%
inner_join(select(transaction_data,household_key,basket_id,
quantity,sales_value,product_id,store_id,week_no),
by = c("product_id","store_id","week_no")) %>%
group_by(store_id,product_id,week_no,display,mailer) %>%
summarize(total_customers = n_distinct(household_key),
sum_sales=sum(sales_value),
sum_quantity=sum(quantity),
sum_visits=n_distinct(basket_id))
datatable(head(mailer_display_transactions,20)) # First 20 Rows
Now we’ll separate the above dataset into Year 1 and Year 2 based on the week_no. For the Year 2 dataset we’ll recode the week_no’s from 54-102 to 1-52.
The purpose of doing this is to compare metrics of Year 1 vs Year 2.
# Year 1 dataset
mailer_display_transactions_year1 <-
mailer_display_transactions %>%
filter(week_no <=54)
# Year 2 dataset
mailer_display_transactions_year2 <-
mailer_display_transactions %>%
filter(week_no >54) %>%
mutate(week_no_modified = week_no-54)
# Recoding the week_no's for Year 2
mailer_display_transactions_year2$week_no <-
mailer_display_transactions_year2$week_no_modified
mailer_display_transactions_year2$week_no_modified = NULL
Now, we’ll join Year 1 and Year 2 dataset on:
Then, we’ll separate the above data into:
The purpose of doing this is to quantify the effect of:
# Joining Year 1 and Year 2 datasets
mailer_display_transactions_year1_vs_year2 <-
mailer_display_transactions_year1 %>%
inner_join(mailer_display_transactions_year2,
by=c("product_id","store_id","week_no"),
suffix=c("_Yr1","_Yr2")) # as both tables have same colnames
# Mailer Dataset - where there is a difference between Year 1 and Year 2
mailer_year1_vs_year2 <-
mailer_display_transactions_year1_vs_year2 %>%
filter( (mailer_Yr1==0 & mailer_Yr2 !=0) | (mailer_Yr2==0 & mailer_Yr1 !=0) )
# Display Dataset - where there is a difference between Year 1 and Year 2
display_year1_vs_year2 <-
mailer_display_transactions_year1_vs_year2 %>%
filter( (display_Yr1==0 & display_Yr2 !=0) | (display_Yr2==0 & display_Yr1 !=0) )
datatable(head(mailer_display_transactions_year1_vs_year2,20)) # First 20 Rows
Now that we’ve already taken into account the product level variation, we’ll aggregate the mailer and display datasets by store and week number.
# Aggregating Mailer dataset
mailer_year1_vs_year2 <-
mailer_year1_vs_year2 %>%
group_by(store_id,week_no,mailer_Yr1,mailer_Yr2) %>%
summarize(sum_sales_Yr1=sum(sum_sales_Yr1),
sum_quantity_Yr1=sum(sum_quantity_Yr1),
sum_visits_Yr1=sum(sum_visits_Yr1),
sum_sales_Yr2=sum(sum_sales_Yr2),
sum_quantity_Yr2=sum(sum_quantity_Yr2),
sum_visits_Yr2=sum(sum_visits_Yr2))
# Aggregating Display dataset
display_year1_vs_year2 <-
display_year1_vs_year2 %>%
group_by(store_id,week_no,display_Yr1,display_Yr2) %>%
summarize(sum_sales_Yr1=sum(sum_sales_Yr1),
sum_quantity_Yr1=sum(sum_quantity_Yr1),
sum_visits_Yr1=sum(sum_visits_Yr1),
sum_sales_Yr2=sum(sum_sales_Yr2),
sum_quantity_Yr2=sum(sum_quantity_Yr2),
sum_visits_Yr2=sum(sum_visits_Yr2))
datatable(head(mailer_year1_vs_year2,20)) # First 20 Rows
Now, in future analysis, we want to conduct a hypothesis test to detect any difference in sales, units and visits for a particular store & week between Year 1 and Year 2 due to display/mailer variables.
The quantities between Year 1 and Year 2 are paired, thus we should calculate their difference and conduct the hypothesis on the average of difference.
The diff_of_avg_yr1_vs_yr2 function does exactly that. It takes the arguments:
Using these arguments the above function calculates the average of difference between the Year 1 and Year 2 called the Point Estimate. Further, it also calculates the standard deviation and number of observation, both of which will be used to conduct a hypothesis test.
diff_of_avg_yr1_vs_yr2 <-
function(df = display_year1_vs_year2,
val_yr1 = 0,
val_yr2 = "A",
var_interest = c("diff_sales","diff_quantity","diff_visits"),
mailer = 0,
display = 1)
{
if(mailer == 1) # if mailer dataset
{
calc_diff <-
df %>%
# calculate all differences for specified mailer variables
mutate(diff_sales = sum_sales_Yr2-sum_sales_Yr1,
diff_quantity = sum_quantity_Yr2-sum_quantity_Yr1,
diff_visits = sum_visits_Yr2-sum_visits_Yr1) %>%
filter(mailer_Yr1 == val_yr1,mailer_Yr2 == val_yr2)
} else if (display == 1) # if display dataset
{
calc_diff <-
df %>%
# calculate all differences for specified display variables
mutate(diff_sales = sum_sales_Yr2-sum_sales_Yr1,
diff_quantity = sum_quantity_Yr2-sum_quantity_Yr1,
diff_visits = sum_visits_Yr2-sum_visits_Yr1) %>%
filter(display_Yr1 == val_yr1,display_Yr2 == val_yr2)
} else # if not a mailer or display dataset
{
print("Correct Value Not Selected")
}
# calculate average of difference for all var_interest specified by user
l1 = lapply(calc_diff[,names(calc_diff) %in% var_interest],mean,na.rm=TRUE)
# calculate sd of difference for all var_interest specified by user
l2 = lapply(calc_diff[,names(calc_diff) %in% var_interest],sd,na.rm=TRUE)
# calculate number of observations
l3 = list(n_obs=nrow(calc_diff))
# append mean, sd and number of observations
point_estimate <- append(append(l1,l2),l3)
names(point_estimate) <- c("diff_sales","diff_quantity","diff_visits",
"sd_sales","sd_quantity","sd_visits","n_obs")
return(point_estimate)
}
# Sample Output of Function
diff_of_avg_yr1_vs_yr2(df = display_year1_vs_year2,
val_yr1 = 0,
val_yr2 = "A",
var_interest = c("diff_sales","diff_quantity","diff_visits"),
mailer = 0,
display = 1)
## $diff_sales
## [1] 0.1145652
##
## $diff_quantity
## [1] 0.08695652
##
## $diff_visits
## [1] -0.173913
##
## $sd_sales
## [1] 1.696525
##
## $sd_quantity
## [1] 1.847835
##
## $sd_visits
## [1] 0.5697698
##
## $n_obs
## [1] 46
We have 10 unique mailer categories and 11 unique display categories. If we run diff_of_avg_yr1_vs_yr2 function on each combination then we’ll need to call the function 38 times manually. To avoid this, we’ll create a new function called calculate_point_estimates which will do this automatically.
The calculate_point_estimates function takes the arguments:
calculate_point_estimates <- function(mailer=0,display=0,df=mailer_year1_vs_year2)
{
if(display == 1) # if display dataset
{
names = c("1","2","3","4","5","6","7","9","A")
} else if(mailer==1) # if mailer dataset
{
names = c("A","C","D","F","H","J","L","P","X","Z")
}
df_point_estimates = data.frame()
for(i in 1:length(names))
{
# Yr 1 vs Yr 2 where Yr 1 Variable = 0 and Yr 2 Variable !=0
point_estimate <-
diff_of_avg_yr1_vs_yr2(df = df,
val_yr1 = 0, # Yr 1 = 0
val_yr2 = names[i],# Yr 2 not 0
var_interest = c("diff_sales","diff_quantity","diff_visits"),
mailer = mailer,
display = display)
df_point_estimates=bind_rows(df_point_estimates,point_estimate)
df_point_estimates$val_yr1[i] = 0
df_point_estimates$val_yr2[i] = names[i]
}
for(i in 1:length(names))
{
# Yr 1 vs Yr 2 where Yr 1 Variable != 0 and Yr 2 Variable !=0
point_estimate <-
diff_of_avg_yr1_vs_yr2(df = df,
val_yr1 = names[i],# Yr 1 not 0
val_yr2 = 0, # Yr = 0
var_interest = c("diff_sales","diff_quantity","diff_visits"),
mailer = mailer,
display = display)
df_point_estimates=bind_rows(df_point_estimates,point_estimate)
df_point_estimates$val_yr1[i+length(names)] = names[i]
df_point_estimates$val_yr2[i+length(names)] = 0
}
return(na.omit(df_point_estimates))
}
Running the function calculate_point_estimates on display and mailer dataset
display_point_estimates <-
calculate_point_estimates(display=1,mailer=0,df=display_year1_vs_year2) %>%
select(val_yr1,val_yr2,everything())
mailer_point_estimates <-
calculate_point_estimates(display=0,mailer=1,df=mailer_year1_vs_year2) %>%
select(val_yr1,val_yr2,everything())
datatable(mailer_point_estimates) # mailer_point_estimates dataset
Now we will create a function calc_p_value_conf_int function which will calculate the final datasets. It takes 1 dataset as an argument and does the following steps:
calc_p_value_conf_int <- function(df)
{
df1 <-
df %>%
select(val_yr1,val_yr2,starts_with("diff"),n_obs) %>%
gather(Variable, diff, c("diff_sales","diff_quantity","diff_visits")) %>%
mutate(Variable = str_replace(Variable,"diff_",""))
df2 <-
df %>%
select(val_yr1,val_yr2,starts_with("sd")) %>%
gather(Variable, sd, c("sd_sales","sd_quantity","sd_visits")) %>%
mutate(Variable = str_replace(Variable,"sd_",""))
df3 <-
df1 %>%
inner_join(df2,by = c("val_yr1","val_yr2","Variable")) %>%
# p-value
mutate(p_value = pnorm(abs(diff/(sd/sqrt(n_obs))),lower.tail = FALSE)*2) %>%
# Lower bound of CI
mutate(ci_lower = diff-1.96*(sd/sqrt(n_obs))) %>%
# Upper bound of CI
mutate(ci_upper = diff+1.96*(sd/sqrt(n_obs))) %>%
# Keep rows with p-value < 5%
filter(p_value <= 0.05)
return(df3)
}
For a particular product in a particular store in a particular week, we are 95% confident that:
# running the 'calc_p_value_conf_int' function on mailer data
mailer_analysis <-
calc_p_value_conf_int(mailer_point_estimates) %>%
mutate(avg_sales=mean(mailer_display_transactions$sum_sales),
avg_units=mean(mailer_display_transactions$sum_quantity),
avg_visits=mean(mailer_display_transactions$sum_visits))
# calculating percentage increase in sales, units and visits instead of absolute
for (i in 1:nrow(mailer_analysis))
{
if(mailer_analysis$Variable[i] == "sales")
{
mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_sales[i]
} else if (mailer_analysis$Variable[i] == "quantity")
{
mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_units[i]
} else if (mailer_analysis$Variable[i] == "visits")
{
mailer_analysis$ci_lower[i] = mailer_analysis$ci_lower[i]/mailer_analysis$avg_visits[i]
}
}
# output mailer yr1, display yr2, n, variable, p-value, confidence interval
mailer_analysis %>%
select(val_yr1,val_yr2,n_obs,Variable,p_value,ci_lower,ci_upper) %>%
arrange(val_yr1) %>%
mutate(ci_lower = ci_lower*100, ci_upper = 100*ci_upper)
| val_yr1 | val_yr2 | n_obs | Variable | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|
| 0 | D | 569 | sales | 0.0022946 | 5.683548 | 84.218476 |
| 0 | A | 491 | quantity | 0.0006097 | 7.632072 | 46.100164 |
| 0 | D | 569 | quantity | 0.0000000 | 33.983106 | 103.320578 |
| 0 | A | 491 | visits | 0.0017194 | 4.383105 | 22.177247 |
| 0 | D | 569 | visits | 0.0000000 | 23.868319 | 53.698171 |
| A | 0 | 664 | visits | 0.0471979 | -15.650275 | -0.113509 |
| D | 0 | 481 | quantity | 0.0000004 | -50.142834 | -36.428163 |
| D | 0 | 481 | visits | 0.0000859 | -36.060256 | -14.059690 |
For a particular product in a particular store in a particular week, we are 95% confident that:
# running the 'calc_p_value_conf_int' function on display data
display_analysis <-
calc_p_value_conf_int(display_point_estimates) %>%
mutate(avg_sales=mean(mailer_display_transactions$sum_sales), # avg sales
avg_units=mean(mailer_display_transactions$sum_quantity), # avg units
avg_visits=mean(mailer_display_transactions$sum_visits)) # avg visits
# calculating percentage increase in sales, units and visits instead of absolute
for (i in 1:nrow(display_analysis))
{
if(display_analysis$Variable[i] == "sales")
{
display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_sales[i]
} else if (display_analysis$Variable[i] == "quantity")
{
display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_units[i]
} else if (display_analysis$Variable[i] == "visits")
{
display_analysis$ci_lower[i] = display_analysis$ci_lower[i]/display_analysis$avg_visits[i]
}
}
# output display yr1, display yr2, n, variable, p-value, confidence interval
display_analysis %>%
select(val_yr1,val_yr2,n_obs,Variable,p_value,ci_lower,ci_upper) %>%
arrange(val_yr1) %>%
mutate(ci_lower = ci_lower*100, ci_upper = 100*ci_upper)
| val_yr1 | val_yr2 | n_obs | Variable | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|
| 0 | 4 | 25 | sales | 0.0306214 | 2.812311 | 184.938788 |
| 0 | 3 | 110 | quantity | 0.0301836 | -61.030992 | -5.051511 |
| 0 | 4 | 25 | quantity | 0.0460634 | 0.678906 | 126.883121 |
| 0 | 3 | 110 | visits | 0.0307633 | -50.528878 | -2.863526 |
| 0 | A | 46 | visits | 0.0384339 | -29.018057 | -0.925740 |
| 3 | 0 | 101 | visits | 0.0077055 | 10.323282 | 79.044402 |
| 9 | 0 | 163 | visits | 0.0348656 | 1.307634 | 41.419101 |
Quantify the impact on sales, units and visits due to:
For exact values please refer to the tab 4.2
The retailer can perform the following steps to increase their sales, units and visits for a particular product:
The data consists of only 2500 households. For a more in-depth analysis, we will need to capture more households.