Regork is one of the largest retailers in United States and is a company that uses data to optimize their daily operations. As part of the Regork Data Analytics team we are always looking at different ways in which help Regork’s business to grow and get maximum value from the investments in the company.
Business Question 1:
The project aims to explore new avenues of revenue growth for Regork. To study the existing product portfolio of Regork and identify potential opportunities for business expansion, cross-selling and maximizing basket value for each customer.
Once we identify the category, it is also important to run a campaign for the new product. Hence, our second business question is to analyse the existing campaign data and provide recommendations on improving campaign strategy.
Business Question 2:
As part of analysis, we decided to analyze campaigns which yields in highest sales during which month of the year and time of the day and also provide suggestions to optimize the product display locations.Campaigns are an important part of the marketing and sales strategies for any retail company, and effectively using them will not only improve bottom line numbers but also increase customer engagement.
For this analysis, we have used the records of Regork’s household level transactions of 2469 households, who are frequent shoppers in the year 2017. The data includes customer demographics, transaction data, product meta data and campaign information.
The approach that we have taken to address the business objective is ‘Descriptive Analysis’. Through descriptive analysis, we perform investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
For Business question 1, we analysed the least selling categories to identify potential opportunities, checked for their potential customer base, further examined the existing data about the category and then explored best bundling categories for cross-selling the selected category.
Similarly, for campaign analysis, we studied multiple aspects of the campaigns to spot the factors leading to campaign success.
The proposed solutions will help Regork expand their business, and thus increase revenue. They include insights to assure that these ideas have the potential to increase Regork profits. In multiple ways, these proposed solutions aim to maximize the resource utilization and improve campaign design.
completejourney - Gives us the data-set we are analyzing.
ggplot2 - Provides us with the different graphs that we can use to plot our results.
plotly - Used to make our graphs more interactive.
dplyr - For flexibility in data manipulation.
tidyr - To clean up messy data.
lubridate - To easily format time stamps to date for transaction data.
scales - To map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.
treemapify - Allows creating treemaps, which are used for displaying hierarchical data using nested rectangles
library(completejourney)
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyr)
library(lubridate)
library(tidyverse)
library(scales)
library(ggtext)
library(here)
library(treemapify)
The original data set is from the ‘completejourney’ package. This data set has the records of 2017 household level transactions of 2469 households who are frequent shoppers at Regork. The data includes customer demographics, transaction data, product meta data and campaign information.
For our project we have used below data sets to answer the chosen business questions.
Transactions
Promotions
Campaigns
Campaign_descriptions
Products
Demographics
transactions <- get_transactions()
promotions <- get_promotions()
products <- products
demographics <- demographics
coupons <- coupons
coupon_redemptions <- coupon_redemptions
#Checking for Missing Value Check
sum(is.na(campaigns))
## [1] 0
sum(is.na(campaign_descriptions))
## [1] 0
#Join Campaign tables
campaign <- full_join(campaigns, campaign_descriptions, by = "campaign_id")
#Sanity check : end_day > start_day
sum(campaign$end_day < campaign$start_day)
## [1] 0
## No further data cleaning needed for our analysis at this stage. We will assess the data at each stage and tidy if needed, as our analysis progresses.
EXPLORING NEW AVENUES FOR REVENUE GROWTH
What untapped market can Regork expand its business to? How to use existing resources to market such new products?
We start by narrowing down the under-performing product categories in terms of sales volume and total value.
### Finding Under-performing Product Categories #####################################
least <- products %>%
inner_join(transactions, by = "product_id") %>%
group_by(product_category)%>%
summarize(total_qty= sum(quantity), total_sales =sum(sales_value))%>%
filter(total_sales < 100 & total_qty <100)%>%
mutate(puvalue = total_sales/total_qty)%>%
arrange(total_sales)
ggplot(least, aes(x = product_category, y = total_sales)) +
geom_point(aes(color = puvalue,size = total_qty), shape = 17) +
scale_x_discrete(guide = guide_axis(angle = 90)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Under Performing Product Categories",
subtitle = "Data represents the total sales volume and value of least selling products. \nRegork 2017",
x = "Product Category",
y = "Total Sales Value",
size = "Quantity Sold",
color = "Per Unit Cost",
caption = "Data sources: https://bradleyboehmke.github.io/completejourney/articles/completejourney.html",
tag = "Fig.1")
It can be observed from the plot Fig.1 that the category “Toys” is the least selling category. Before deciding to move ahead with this category, we would like to examine the potential market for Toys. Hence we analysed the customer demographics to look for potential household with kids.
### Market Potential for Toys #####################################
market <- demographics %>%
filter(kids_count >= 1)%>%
inner_join(transactions, by = "household_id")%>%
group_by(household_id,income)%>%
summarize(household_sales =sum(sales_value),visits = n_distinct(basket_id))%>%
arrange(desc(household_sales))
## `summarise()` has grouped output by 'household_id'. You can override using the
## `.groups` argument.
ggplot(market, aes(x=visits,y=household_sales,color=income))+
geom_point(shape=19,size = 2)+
labs(title = "Potential of Untapped Market",
subtitle = "Data represents the income demographics, sales volume of CUSTOMERS WITH KIDS",
x = "Store Visits",
y = "Household Sales Value",
color = "Income Group",
caption = "Data sources: https://bradleyboehmke.github.io/completejourney/articles/completejourney.html",
tag = "Fig.2")
As shown in the plot Fig.2 , there is a diverse pool of customers who have kids, and are frequent shoppers at Regork. Some of these households even make hundreds of purchases that value as high as $15000 and above. As the color of the points indicate, there are customers with income above 200k as well. Hence, there is a sizeable market for premium products also which can help gain higher margins. Although, this is a tangent that needs further deep-dive, we would like to continue our analysis on the product category “Toys”.
### Studying multiple aspects of the selected category "TOYS" #####################################
## Toys Product Portfolio
Options <- products %>%
filter(product_category == "TOYS")%>%
select(product_category,product_id,product_type,brand)
Options
## # A tibble: 3 × 4
## product_category product_id product_type brand
## <chr> <chr> <chr> <fct>
## 1 TOYS 831764 CARD GAMES National
## 2 TOYS 9193084 DIECAST MINI National
## 3 TOYS 9674186 CARD GAMES National
## Toys Campaigns
campaigns <- coupons %>%
filter(product_id == "831764" | product_id=="9193084"| product_id=="9674186")
n_distinct(campaigns$campaign_id)
## [1] 0
## Toys Display Locations
location <- promotions%>%
filter(product_id == "831764" | product_id=="9193084"| product_id=="9674186")%>%
group_by(display_location)%>%
mutate(display_location = fct_recode(display_location,
"In Shelf" = "A",
"Store Front" = "1",
"Store Rear" = "2",
"Front End" = "3",
"Mid Aisle End" = "4",
"Rear End" = "5",
"Side Aisle" = "6",
"In Aisle" = "7",
"Secondary Location Display" = "9",
))
ggplot(location,aes(fct_infreq(display_location)))+
geom_bar(fill = "gold")+
scale_x_discrete(guide = guide_axis(angle = 90)) +
labs(title = "Promotions on Toys through Display locations",
subtitle = "Data represents the display locations used for Toys in various stores",
x = "Display Location",
y = "Frequency",
caption = "Data sources: https://bradleyboehmke.github.io/completejourney/articles/completejourney.html",
tag = "Fig.3")
Quick observations on the category “Toys”, from the tibble Campaigns, tibble Options and the plot Fig.3, are as follows:
Thus, our final recommendations are based on the insights found so far. Additionally, to recommend how to push sales for Toys, if introduced, we analysed the product categories by sales value.
### Choosing product categories to bundle Toys for promotional offers #####################################
bundle <- demographics %>%
inner_join(transactions, by = "household_id")%>%
inner_join(products, by = "product_id")%>%
group_by(product_category,kids_count)%>%
filter(product_category != "COUPON/MISC ITEMS")%>%
mutate(kids_count = fct_collapse(kids_count,Families_with_kids = c("1","2","3+")))%>%
mutate(kids_count = fct_collapse(kids_count,No_kids = "0"))%>%
summarize(category_sales =sum(sales_value))%>%
filter(category_sales > 10000)%>%
arrange(desc(category_sales))
## `summarise()` has grouped output by 'product_category'. You can override using
## the `.groups` argument.
ggplot(bundle,aes(area = category_sales, fill = product_category, label = paste(product_category))) +
geom_treemap() +
geom_treemap_text(grow = TRUE,reflow = TRUE) +
theme(legend.position = "none") +
facet_grid(~kids_count)+
labs(title = "Indentifying Product Categories to Bundle promotional offers for Toys",
subtitle = "Data represents the top product categories by sales value, among families with and without kids.",
x = "Product Category",
y = "Category Sales Value",
fill = "Household Composition",
caption = "Data sources: https://bradleyboehmke.github.io/completejourney/articles/completejourney.html",
tag = "Fig.4")
The above plot Fig.4 helps identify the top selling category in each grid. Although the top categories are Soft drinks, beef, Cheese etc, they are common for all households. We can notice that the families with kids purchase “Diapers & Disposables” category with a total sales value of about $11000.This is the category we would want to bundle toys to, for our promotional offers.
This leads to our next question on how to improve our campaign strategy. This strategy will help our new venture of expanding Toys business to ultimate success.
CAMPAIGN STRATEGY
For the purpose of our analysis, we create a base table ‘datasets’ which contains a details of campaign resulting in highest sales and product category analysis purchased during that campaign period.
We start with analyzing campaign data, and tried to find out campaign with highest sales made. ‘Campaign id - 18’ had the highest sales. To deep drive further, we analyzed month of the campaign and time of the day which had the highest sales.
To explore more on the sales analysis, we tried to find if any particular store was making highest sales for top selling product categories based on any external factor.
#Join Transactions and Campaign tables
df1 <- campaign %>%
group_by(campaign_id) %>%
inner_join(transactions, by = 'household_id') %>%
mutate(daterange = as.Date(transaction_timestamp)) %>%
filter(between(daterange, min(start_date), max(end_date))) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
x=as.POSIXct(strptime(c("050000","105959","110000","155959",
"160000",
"185959"),"%H%M%S"),"UTC")
df2 <- campaign %>%
filter(campaign_id == '18') %>%
inner_join(transactions, by = 'household_id') %>%
mutate(time = format(as.POSIXct(transaction_timestamp), format = "%H:%M:%S")) %>%
mutate(daterange = as.Date(transaction_timestamp)) %>%
mutate(Month = month(daterange)) %>%
filter(between(daterange, min(start_date), max(end_date)))
df2$hour <- hour(hms(df2$time))
hr_breaks = seq(0, 24, by = 4)
hr_labels <- c("00-04", "04-08", "08-12", "12-16",
"16-20", "20-00")
df2$tod = cut(df2$hour, breaks = hr_breaks,
labels = hr_labels,
include.lowest = T, right = F)
df3 <- df2 %>%
group_by(month(daterange), tod) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
df3 <- df2 %>%
mutate(Monthname = case_when(Month == 10 ~ 'October',
Month == 11 ~ 'November',
Month == 12 ~ 'December')) %>%
group_by(Monthname, tod) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
plot1 <- ggplot(df3, aes(x = tod, y = total_sales, fill = Monthname)) +
geom_bar(stat = "identity") +
xlab("Time of the Day(24Hrs)") +
ylab("Total Sales") +
#facet_wrap(. ~ age, ncol = 2) +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face = "bold", colour = "blue", size = 12),
axis.title.y = element_text(face = "bold", colour = "blue", size = 12),
legend.title = element_text(face = "bold", size = 12),
strip.background = element_rect(fill = "lightgreen", colour = "black", size = 1),
strip.text = element_text(face = "bold",size = rel(1.2)))+
theme(plot.title = element_text(size = 5, vjust = -.5))
ggplotly(plot1)
After this we tried to find pattern for top 3 store ids and product category.
df4 <- df2 %>%
inner_join(products, by = "product_id") %>%
group_by(product_category) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
top_prod <- df4 %>%
slice(1:10)
dis_product_category = n_distinct(df4$product_category, na.rm = TRUE)
uni_prod <- unique(df2$product_id)
top10prod <- df2 %>%
inner_join(products, by = 'product_id') %>%
filter(product_category == top_prod$product_category)
df5 <- promotions %>%
inner_join(top10prod)
store <- df5 %>%
group_by(product_category, store_id) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
top_n(n = 3)
# Plot 2 ----------------------------------------------------------
store %>%
ggplot(aes(total_sales, store_id, label = round(total_sales,0),angle = 45)) +
geom_segment(aes(x = 0, y = store_id, xend = total_sales, yend = store_id)) +
geom_point(aes(color = product_category),size = 3) +
theme(panel.background = element_rect(fill = "white", colour = "grey50")) +
theme(plot.title.position = "plot",
plot.caption.position = "plot" ,
plot.title = element_markdown(),
plot.subtitle = element_text(size = 9, margin = margin(b = 25, l= -25)),
plot.caption = element_markdown()) +
labs(title = 'Sales of Top 3 Stores based on Product Category',tag = "Fig.6")
We also tried to find pattern for display location based on top 10 product category.
display <- df5 %>%
group_by(product_category, display_location) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
top_n(n = 3)
# Plot 3 ------------------------------------------------------------------
display %>%
ggplot(aes(total_sales, display_location, label = round(total_sales,0),angle = 45)) +
geom_segment(aes(x = 0, y = display_location, xend = total_sales, yend = display_location)) +
geom_point(aes(color = product_category),size = 3) +
#scale_color_manual(values=c("#FFFF33", "#33FF99")) +
theme(panel.background = element_rect(fill = "white", colour = "grey50")) +
theme(plot.title.position = "plot",
plot.caption.position = "plot" ,
plot.title = element_markdown(),
plot.subtitle = element_text(size = 9, margin = margin(b = 25, l= -25)),
plot.caption = element_markdown()) +
labs(title = "Sales of Top 3 Display Location based on Product Category",
tag = "Fig.7")+
facet_wrap(~ product_category)
#### {- .leave-tabset}
Regork is a national grocery chain. We are trying to identify potential areas of growth where the company could invest future resources to increase revenue and profits. For this, we chose to identify under performing product categories and thus building revenue streams from new avenues. This has been the category Toys, after analysis. Similarly, we chose to examine of the effectiveness of existing campaign data and made recommendations that will help in revenue growth.
The approach that we have taken to address the business objective is ‘Descriptive Analysis’. Through descriptive analysis, we perform investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Based on the findings from the analyses done above, we have the following recommendations:
Business Question 1:
Expand the Product Portfolio for Toys. Get more brands onboarded and include Private Brands. Include product types that suit boys and girls, with a mix of premium and economy options, for the income demographics of our target households vary
Seasonally move selected brands or themed products to better display locations to increase visibility for this category
Include bundled promotions to push sales for these new products among the target customers. As shown in Fig.4, bundling with products like Diapers & Disposables will help reach our target customers
Business Question 2:
From a sales perspective, we can see that October to December is the peak time for sales and Soft drinks and Beer are one of the most popular purchase, while bag snacks are still a little low on the bar. Thus, the team can run more campaigns around the festive seasons will help to increase sales.
From a display-level, we need to target those stores where we see a high sales for display location 0. So, we can keep probably move the closer soft drinks, milk products and beer providing more visibility and consequently more sales
This analysis suggests Regork to invest funds in building the product category of Toys and build a new revenue channel from the existing customer base itself, by increasing their basket value. Further, this analysis should help the marketing teams of Regork to improvise their campaign design, thus driving our new venture of expanding Toys business to ultimate success.
The primary limitation for this data set is the limited time period for which this data was collected. Year on year comparison of campaigns during the same holiday seasons would yield better insights than comparing within the year. Seasonality of the sales also needs yearly data
Geographical data on store locations or any other information on store_id s’ will help validating the analysis done and avoiding any misinterpretations
Display location data is helpful in analyzing the promotions but planogram of the store - which brands are placed at eye-level of target customers is also equally important
Campaign types are identified as A,B and C. However, additional information on what each of them define would help getting additional insights. The product types and categories in each campaign type are widely diverse to make any assumptions.