Regork is one of the largest retailers in the continental United States and is a company that uses data to optimize their day to day functionalities. As part of the Regork Data analysis team we are always looking at different ways in which we can improve the processes at the organization.
As part of this, we decided to analyze whether coupons are being used effectively in generating sales and also provide suggestions to optimize how coupons are currently being targeted. Coupons are an important part of the marketing and sales strategies of any retail company, and effectively using them will not only improve bottom line numbers but also increase customer engagement.
The approach that we have taken to address the business objective is ‘Descriptive Analysis’. Through descriptive analysis, we perform investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
In order to perform EDA, we have decided to group our data in three different ways:
Customer demographic based grouping - To understand the current coupon usage by demography, we have taken data for income range and age, and grouped the sales we have generated from coupon discounted purchases and other purchases without coupons based on age and income range. We have generated the average and total values for both sets of data.
Product category based grouping - We are also analyzing how the sales penetration differs for different product categories with and without coupon usage since at Regork, different product categories are managed by individual managers.
Store based grouping - And finally, we want to see which are the stores that have the largest average coupon discounted sales to optimize their coupon usage further.
Based on the results we get from the above analysis, we can decide which section of customers, in respect to age, income range, product purchase pattern, geography etc. that we can target to improve our overall sales using coupons.
completejourney - Gives us the data-set we are analyzing.
ggplot2 - Provides us with the different graphs that we can use to plot our results.
plotly - Used to make our graphs more interactive.
dplyr - For flexibility in data manipulation.
tidyr - To clean up messy data.
lubridate - To easily format timestamps to date for transaction data.
scales - To map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.
library(completejourney)
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyr)
library(lubridate)
library(scales)
The original data set is from the ‘completejourney’ package. From this package we have used three tables.
Transactions
Demographics
Products
This data set contains shopping information for different households at Rogerk in 2017. For the purpose of our analysis, we create a base table ‘dataset’ which contains a flag for coupon usage, which tells us whether coupon discount was availed by a household.
dataset <- get_transactions() %>%
inner_join(demographics) %>%
mutate(Total_Coupon_Discount = coupon_disc + coupon_match_disc) %>%
group_by(basket_id) %>%
mutate(Total_Coupon_Disc_Basket = sum(Total_Coupon_Discount)) %>%
ungroup() %>%
mutate(Coupon_Flag = if_else(Total_Coupon_Disc_Basket > 0, 1,0))
For analyzing the sales penetration for each product category where coupons were used, we have conducted some additional data preparation steps. The base data frame ’dataset’ is modified to reflect coupon usage at a product level, and joined with products table. Two tables, for coupon usage, and non-coupon usage are created separately.
First we create the data frame with sales penetration for each product category where coupons were used.
Note: The product category, “COUPON/MISC ITEMS’ is excluded from the data preparation as these are miscellaneous products with no coupons of any sort are provided to the customers.
dataset_modified <- get_transactions() %>%
mutate(Total_Coupon_Discount = coupon_disc + coupon_match_disc) %>%
mutate(Coupon_Flag = if_else(Total_Coupon_Discount > 0, 1,0))
product_coupon_usage <- dataset_modified %>%
inner_join(products, on = product_id) %>%
filter(product_category != 'COUPON/MISC ITEMS') %>%
select(product_category, Coupon_Flag, sales_value) %>%
filter(Coupon_Flag == 1) %>%
mutate(tot_sales = sum(sales_value)) %>%
group_by(product_category) %>%
summarise(sales_penetration = sum(sales_value)/tot_sales) %>%
unique() %>%
mutate(Coupon_Usage = 'Coupon Usage') %>%
arrange(desc(sales_penetration)) %>%
ungroup() %>%
mutate(cumulative_penetration = cumsum(sales_penetration))
Now, we create the data frame with sales penetration for each product category where coupons were not used.
product_non_coupon_usage <- dataset_modified %>%
inner_join(products, on = product_id) %>%
filter(product_category != 'COUPON/MISC ITEMS') %>%
select(product_category, Coupon_Flag, sales_value) %>%
filter(Coupon_Flag == 0) %>%
mutate(tot_sales = sum(sales_value)) %>%
group_by(product_category) %>%
summarise(sales_penetration = sum(sales_value)/tot_sales) %>%
unique() %>%
mutate(Coupon_Usage = 'No Coupon Usage') %>%
arrange(desc(sales_penetration)) %>%
ungroup() %>%
mutate(cumulative_penetration = cumsum(sales_penetration))
The original data set is from the completejourney package. We have used the entire data set of the transaction data for this analysis.
dataset_store <- get_transactions()%>%
mutate(Total_Coupon_Discount = coupon_disc + coupon_match_disc)%>%
group_by(basket_id)%>%
mutate(Total_Coupon_Disc_Basket = sum(Total_Coupon_Discount))%>%
ungroup()%>%
mutate(Coupon_Flag = if_else(Total_Coupon_Disc_Basket > 0, 1,0))
For our coupon performance by demographic analysis, we need to make use of 2 tables - the transactions table and the demographic table. Before getting started, we first wanted to check the data completeness for these two datasets.
Q : What % of data is available in the demographics table for all the households in the transactions table?
cat("There are ", length(unique(get_transactions()[['household_id']])), " number of households in the transactions table\n")
## There are 2469 number of households in the transactions table
cat("Out of these, we have demographic data only for ", length(unique(dataset[['household_id']])), " households\n")
## Out of these, we have demographic data only for 801 households
cat("This translaes to a data completeness of ~ 32%")
## This translaes to a data completeness of ~ 32%
As we can see, we have data for 801 households. Going forward, we will continue to use this dataset for the coupon - demographic analysis.
In the demographic dataset, we will be making use of the household_id, income and age columns mainly for our analysis. Hence we need to check if there are any missing values in these columns.
colSums(is.na(dataset %>% select(household_id, age, income)))
## household_id age income
## 0 0 0
There are no missing value that concerns us.
We want to check how the coupon usages are distributed across households. To do this, we can plot the number of coupons used in the last year vs the number of households that fall into that bucket. We first count the number of coupons used by household using the transactions table and then use this dataframe to create another summary where we group by each coupon used bucket and count the number of households
eda_3_df <- dataset %>% group_by(household_id, income, age) %>% summarise(Total_Coupons = sum(Coupon_Flag)) %>%
mutate(Total_Coupon_Bucket = cut(Total_Coupons, breaks = c(-1,10,50,100,200,400,800,1000,2000), labels = c("0-10", "10-50", "50-100", "100-200", "200-400", "400-800", "800-1000", "1000+")))
eda_3_df <- eda_3_df %>% group_by(Total_Coupon_Bucket, income, age) %>% summarise(Total_Households = n_distinct(household_id))
acet_plot <- ggplot(eda_3_df, aes(x = Total_Coupon_Bucket, y = Total_Households, fill = income)) +
geom_bar(stat = "identity") +
xlab("Coupon Usage Count")+
ylab("Number of Households") +
facet_wrap(. ~ age, ncol = 2) +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face = "bold", colour = "blue", size = 12),
axis.title.y = element_text(face = "bold", colour = "blue", size = 12),
legend.title = element_text(face = "bold", size = 10),
strip.background = element_rect(fill = "lightgreen", colour = "black", size = 1),
strip.text = element_text(face = "bold",size = rel(1.2)))+
theme(plot.title = element_text(size = 10, vjust = -.5))
ggplotly(acet_plot)
rm(acet_plot, eda_3_df)
Next, we wanted to see if there’s a certain seasonality in the coupon usage trend and the revenue the coupons generate. To do this, we created a line chart with Months on the x-axis and total sales on the y-axis. The total sales is split between coupons used and coupons not used
line_graph <- dataset %>% mutate(Month = month(transaction_timestamp, label = T)) %>% group_by(Month) %>%
summarise(Sales_W_Coupon = sum(if_else(Coupon_Flag == 1, sales_value,0)),
Sales_Wo_Coupon = sum(if_else(Coupon_Flag == 0, sales_value,0))) %>% arrange(Month)
line_graph <- line_graph %>% gather("Metric", "Value_Spend", -c(Month))
line_graph <- ggplot(line_graph, aes(x = Month, y = Value_Spend, group = Metric)) +
geom_line(aes(y = Value_Spend, col = Month)) +
labs(title = "Time Series of Monthly Coupon Sales vs Non-Coupon Sales",
subtitle = "Drawn from Long Data format",
caption = "Source: CompleteJourney",
y = "Sales",
x = "Month",
color = NULL) + # title and caption
# scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels
scale_color_manual(labels = c("Sales without Coupon", "Sales With Coupon"),
values = c("Sales without Coupon" = "#00ba38", "Sales With Coupon" = "#f8766d")) + # line color
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) +
theme_minimal()
line_graph
rm(line_graph)
From the two graphs we generated below, we can come to certain conclusions. Some of them are as given below :
The demographic group, that has the largest shift between average spend using coupons and without coupons is the 25-34 age group earning 200-249k per year. If you look at the total sales contribution of this group, it is clear that currently they are not a large part of our customer base. But, we can also see that even without coupons their average spend is $74, which is much higher than many other group’s average spending with coupons.
The demographic group with the lowest shift is in the 19-24 age group earning 200-249k per year. This group also doesn’t contribute much to the total sales. From this we can conclude that targeting coupons at them is a waste of resources and they can be targeted in a different manner.
The demographic contributing the most to total sales is in the 45-54 age group earning 50-74k per year. This is our core customer base. Their average spend with and without coupons doesn’t have much of a difference, so currently we can conclude that we don’t need to change the strategy we are using in targeting them.
summary_df <- dataset %>%
group_by(income, age, basket_id, Coupon_Flag) %>%
summarise(Total_Sales = sum(sales_value))
summary_df <- summary_df %>%
group_by(age, income) %>%
summarise(baskets_w_coupon = sum(if_else(Coupon_Flag == 1,1,0)),
baskets_wo_coupon = sum(if_else(Coupon_Flag != 1,1,0)),
total_sales_w_coupon = sum(if_else(Coupon_Flag == 1,
Total_Sales,0)),
total_sales_wo_coupon = sum(if_else(Coupon_Flag == 1,
0,Total_Sales)))
summary_df <- summary_df %>%
mutate(avg_spend_w_coupon = total_sales_w_coupon/baskets_w_coupon,
avg_spend_wo_coupon = total_sales_wo_coupon/baskets_wo_coupon,
Lift = avg_spend_w_coupon / avg_spend_wo_coupon) %>%
mutate(avg_spend_w_coupon = round(avg_spend_w_coupon, 2)) %>%
mutate(avg_spend_wo_coupon = round(avg_spend_wo_coupon, 2))
plot_df <- summary_df %>%
select(income, age, avg_spend_w_coupon, avg_spend_wo_coupon) %>%
arrange() %>%
gather("Metric", "Average_Spend", -c(income,age)) %>%
mutate(x_axis = paste(income, age))%>%
mutate(Metric = if_else(Metric == "avg_spend_w_coupon","Average Spend with Coupon", "Average Spend without Coupon" ))
plot1 <- ggplot(plot_df, aes(Average_Spend, income)) +
geom_line(aes(group = x_axis)) +
geom_point(aes(color = Metric), size = 2) +
labs(title = "Comparison of average spend for customers in different age and income groups",
x = "Average Spend",
y = "Income and Age groups",
color = "Average spend with and without coupons") +
scale_x_continuous(limits = c(0,200), labels = scales::dollar) +
scale_color_brewer(name = "Metric", labels = c('Average Spend with Coupon', 'Average Spend without Coupon'), palette = "Set1") +
facet_wrap(. ~ age) +
theme_gray()
ggplotly(plot1)
plot_df2 <- summary_df %>% select(income, age, total_sales_w_coupon, total_sales_wo_coupon) %>%
gather("Metric", "Total_Sales", -c(income,age)) %>% mutate(x_axis = paste(income, age))%>%
mutate(Metric = if_else(Metric == "total_sales_w_coupon","Total Sales with Coupon", "Total Sales without Coupon" ))
plot2 <- ggplot(plot_df2, aes(Total_Sales, income, fill = Metric)) +
geom_bar(stat = "identity") +
labs(title = "Comparison of total spend for customers in different age and income groups",
x = "Total Spend",
y = "Income and Age groups",
color = "Total with and without coupons") +
facet_wrap(. ~ age) +
scale_fill_brewer(name = "Metric", labels = c("Total Sales With Coupon", "Total Sales Without Coupon"), palette = "Set2")
ggplotly(plot2)
Upon understanding the coupon usage for different demographics, we now move on to analyse the relationship between coupon usage and different product categories of Regork.
The base data frame, ’dataset’ is modified to obtain coupon usage on product level which is then combined with the product table so as to understand which product categories have observed coupon usage on them.
Sales penetration for each product category is defined as the percentage of sales for a product category upon the total sales at Regork.
The sales penetration is compared across product categories between coupon usage and non-coupon usage.
When the sales penetration by product category with coupon usage is compared against the list of product categories without coupon usage, we notice that there are only 183 product categories which observe coupon usage out of the total 301 product categories.
Meaning not all the product categories provide coupon discounts.
The combined list of product categories with and without coupon usage is 301, as shown in the data frame summary below,
combined_list <- merge(product_non_coupon_usage, product_coupon_usage,
by.x = "product_category", by.y = "product_category",
all.x = TRUE)
colnames(combined_list) <- c("product_category", "Sales_Penetration_with_no_Coupon_Usage",
"coupon_usage_indicator1", "csum_non_coupon_usage",
"Sales_Penetration_with_Coupon_Usage",
"coupon_usage_indicator2", "csum_coupon_usage")
summary(combined_list)
## product_category Sales_Penetration_with_no_Coupon_Usage
## Length:301 Min. :1.010e-06
## Class :character 1st Qu.:4.246e-04
## Mode :character Median :1.650e-03
## Mean :3.322e-03
## 3rd Qu.:4.135e-03
## Max. :4.281e-02
##
## coupon_usage_indicator1 csum_non_coupon_usage
## Length:301 Min. :0.04281
## Class :character 1st Qu.:0.71726
## Mode :character Median :0.92119
## Mean :0.81848
## 3rd Qu.:0.98865
## Max. :1.00000
##
## Sales_Penetration_with_Coupon_Usage coupon_usage_indicator2 csum_coupon_usage
## Min. :0.00002 Length:301 Min. :0.09174
## 1st Qu.:0.00042 Class :character 1st Qu.:0.77927
## Median :0.00187 Mode :character Median :0.95158
## Mean :0.00546 Mean :0.85441
## 3rd Qu.:0.00578 3rd Qu.:0.99321
## Max. :0.09174 Max. :1.00000
## NA's :118 NA's :118
combined_list_final <- combined_list %>% filter(is.na(Sales_Penetration_with_Coupon_Usage) == FALSE)
summary(combined_list_final)
## product_category Sales_Penetration_with_no_Coupon_Usage
## Length:183 Min. :8.525e-05
## Class :character 1st Qu.:1.291e-03
## Mode :character Median :2.669e-03
## Mean :4.425e-03
## 3rd Qu.:5.936e-03
## Max. :4.245e-02
## coupon_usage_indicator1 csum_non_coupon_usage
## Length:183 Min. :0.08526
## Class :character 1st Qu.:0.59041
## Mode :character Median :0.83613
## Mean :0.75103
## 3rd Qu.:0.94531
## Max. :0.99913
## Sales_Penetration_with_Coupon_Usage coupon_usage_indicator2 csum_coupon_usage
## Min. :1.526e-05 Length:183 Min. :0.09174
## 1st Qu.:4.235e-04 Class :character 1st Qu.:0.77927
## Median :1.869e-03 Mode :character Median :0.95158
## Mean :5.464e-03 Mean :0.85441
## 3rd Qu.:5.779e-03 3rd Qu.:0.99321
## Max. :9.174e-02 Max. :1.00000
combined_list_final_1 <- combined_list_final %>%
mutate(sales_penetration_index = round((Sales_Penetration_with_no_Coupon_Usage - Sales_Penetration_with_Coupon_Usage), 3)) %>%
filter(sales_penetration_index > 0) %>%
arrange(desc(sales_penetration_index)) %>%
head(30)
plot_a <- combined_list_final_1 %>% select(product_category, Sales_Penetration_with_no_Coupon_Usage,
Sales_Penetration_with_Coupon_Usage) %>%
gather("Metric", "Sales_Penetration", -c(product_category))
plot_b <- plot_a %>% mutate(Metric =
(ifelse(Metric == "Sales_Penetration_with_no_Coupon_Usage", "Sales Penetration with no Coupon Usage",
"Sales Penetration with Coupon Usage")))
plot <- ggplot(plot_b, aes(x = Sales_Penetration, y = reorder(product_category,+Sales_Penetration), fill = Metric)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("orange","purple")) +
scale_x_continuous(labels = scales::percent) +
theme_classic() +
xlab("Sales Penetration") +
ylab("Product Category") +
ggtitle("Difference In Sales Penetration by Product Category") +
theme(panel.grid.major.x = element_blank(),
plot.title = element_text(size = 10, vjust = -.5))
ggplotly(plot)
From the chart above, we can infer the following,
The sales penetration without coupon usage is higher than with coupon usage.
This difference on an average is upto 2% for each product category.
Meaning, there are product categories that perform well even without any coupons offered.
From the data, we can infer that there are 111 such product categories whose sales penetration is higher without coupon usage than with coupon usage.
combined_list_final_2 <- combined_list_final %>%
mutate(sales_penetration_index = round((Sales_Penetration_with_no_Coupon_Usage - Sales_Penetration_with_Coupon_Usage), 3)) %>%
filter(sales_penetration_index > 0) %>%
arrange(desc(sales_penetration_index))
No_Coupon_Usage_List <- combined_list_final_2 %>% select(product_category)
No_Coupon_Usage_List
Since, we observed 111 product categories out of 301 total product categories with better sales penetration without coupon usage, it is safe to say that Regork’s customers purchase the products within these categories irrespective of whether a coupon can be availed or not.
Our insights are on a product category level, hence our recommendations to the product category managers at Regork is that,
The below graph provides us with an information on the number of top stores that accounts to 50% of the total sales among all the stores present in the data set.
plot_store1 <- dataset_store%>%
group_by(store_id)%>%
summarise(Total_sales_by_store = sum(sales_value))%>%
arrange(desc(Total_sales_by_store))%>%
mutate(Total_sales = sum(Total_sales_by_store),
penetration = Total_sales_by_store*100/Total_sales)%>%
mutate(penetration = cumsum(penetration),
No_of_stores = row_number())%>%
ggplot(aes(x = No_of_stores , y = penetration)) +
geom_line(color = "blue") +
geom_vline(xintercept=35, size=0.5, color="red", linetype="dotdash") +
scale_x_continuous(n.breaks = 40, limits = c(0,150)) +
xlab("Number of Stores") +
ylab("Cumulative Sales Penetration") +
ggtitle("Cumulative Sales Penetration for top 150 Stores")
ggplotly(plot_store1)
From the graph we can find out that the top 35 stores with the highest total sales value accounts to 50% of the total sales. Hence we are considering the top 35 stores for our analysis.
new_dataset <- dataset_store %>%
group_by(store_id) %>%
summarise(Total_Sales = sum(sales_value))%>%
arrange(desc(Total_Sales))%>%
top_n(35)
store_ts <- dataset_store%>%
inner_join(new_dataset, by = "store_id")%>%
group_by(store_id, basket_id, Coupon_Flag) %>%
summarise(Total_Sales = sum(sales_value))
store_basket <- store_ts %>%
group_by(store_id) %>%
summarise(baskets_w_coupon = sum(if_else(Coupon_Flag == 1,1,0)),
baskets_wo_coupon = sum(if_else(Coupon_Flag != 1,1,0)),
total_sales_w_coupon = sum(if_else(Coupon_Flag ==1,Total_Sales,0)),
total_sales_wo_coupon = sum(if_else(Coupon_Flag ==1,0,Total_Sales)))
store_avg <- store_basket %>%
mutate(avg_spend_w_coupon = total_sales_w_coupon/baskets_w_coupon,
avg_spend_wo_coupon = total_sales_wo_coupon/baskets_wo_coupon)
Below are the conclusions drawn from the two graphs that are generated:
For the store ID 367 which has the highest total sales value, the average sales with coupon is greater than the average sales without coupon. We can increase the number of coupons in this store to further increase total sales value.
The largest gap between the average spend with and without coupon is observed for the store ID 31862, which has a total sales of $60,580. Though the average sales with coupon for this store is significantly higher compared to that without coupon, there is very less total sales generated with coupon. This store can stratergise to increase providing coupons that may result in increase in the total sales value.
In store ID 343, the difference between average spend with and without coupon is not significant. It has quite a good total sales with coupon, so we can retain the number of coupons that is being provided in 2017.
plot_store2 <- store_avg%>%
select(store_id,avg_spend_w_coupon, avg_spend_wo_coupon)%>%
gather("Metric", "Average_Spend", -c(store_id))%>%
mutate(Metric = if_else(Metric == "avg_spend_w_coupon","Average Spend with Coupon", "Average Spend without Coupon" ))
plot2 <- ggplot(plot_store2, aes(x = Average_Spend, y = reorder(store_id,+Average_Spend))) +
geom_line(aes(group = store_id)) +
geom_point(aes(color = Metric), size = 2) +
scale_x_continuous(limits = c(0,100), labels = scales::dollar) +
theme_minimal() +
xlab("Average Spend") +
ylab("Store ID") +
ggtitle("Average spend by Store ID with and without coupon") +
xlab("Average Spend") +
ylab("Store ID") +
ggtitle("Average spend by Store ID with and without coupon")
ggplotly(plot2)
plot_store3 <- store_avg %>%
select(store_id,total_sales_w_coupon, total_sales_wo_coupon)%>%
gather("Metric", "Total_Sales", -c(store_id))%>%
mutate(Metric = if_else(Metric == "total_sales_w_coupon","Total Sales with Coupon", "Total Sales without Coupon" ))
plot3 <- ggplot(plot_store3, aes(x = Total_Sales, y = reorder(store_id,+Total_Sales), fill = Metric)) +
geom_bar(stat = "identity") +
theme_minimal()+
xlab("Total Sales") +
ylab("Store ID") +
ggtitle("Total Sales by Store ID with and without coupon")
ggplotly(plot3)
Based on the findings from the three analyses done above, please find given below some of target areas for the marketing team to target using coupons:
Demographically, the biggest untapped segment for us to target in the future should be the 25-34 age group earning 200-249k per year. This group seems to be spending on an average much higher than the other groups (with and without coupons) but still contribute a tiny fraction to our overall sales.
From a product category perspective, we can see that there a good chunk of product categories that perform better without coupons being provided for them. Hence, the coupon marketing for these categories can be either reduced as a whole or optimized in such a way that the cost of marketing is decreased.
From a store-level, we need to target those stores where we see a significant shift in average sales with and without coupons. If we drill down further, we can target specific customers shopping specific products in a particular store using the analysis done.
The primary limitation for this dataset is the limited time period for which this data-set is valid i.e.,for the year of 2017.
The demographics table has very few data points. Only 32% of the transaction data-set is mapped to the demographics table.
Continuing on from the previous point, there are many NA values in the marital status and home ownership data in the demographics table which reduces the analysis points even further.
Doing purchasing power analysis on this data is difficult because there is no geographical data on store locations except for store id which is not descriptive enough.
10% of all the stores mentioned in the dataset have less than $10 in sales. This will again contribute in skewing the data further.
There is no detailed information on the promotions being run such as the percentage of discount being offered etc.