Regork Targeted Growth Opportunity

Introduction

Problem Statement

Working as a data scientist for a national grocery chain named Regork. I have been tasked with identifying a potential area of growth where the company could invest future resources to increase revenue and profits. My manager has asked that I prepare a report that outlines my findings.

This analysis identifies purchasing patterns across income segments to target products for targeted promotions and marketing campaigns.

Methodology

To address the business problem, I used the completejourney dataset to produce data driven insights for the following:

  1. Average number of products purchased per transaction
  2. Product with higher sales value
  3. Product purchased regularly based on low, medium, high income segment

Proposed Solution

This analysis will surface key findings to make data driven decisions. Regork leadership will be empowered to target specific customers when developing marketing plans to incease revenue and profits.

Data Acqusition, Munge, Analysis Setup

In this step, I load the data, make sure it’s clean and can be used. This includes removing any empty cells and making sure values are in correct format.

**Packages/Libraries Required & Setup

I used the packages below in the analysis

library(completejourney) # shopping transaction data 
library(tidyverse)       # tidy data
library(ggplot2)         # data visualization
library(dplyr)           # data manipulation and transform
library(RColorBrewer)    # colors for plots
library(reactable)       # data tables tabs
library(scales)          # visualization- map & display data values 

Variables

  1. household_id: unique identifier for each customer to link transactions to demographics

  2. basket_id: unique identifier for transaction by customer

  3. product_id: unique identifier for specific items

  4. quantity: number of units of specific product in transactin

  5. sales_value: revenue from sale of specific product in transaction

  6. income: income segment of household from demographic data

  7. department: grouping of similar products

  8. brand: manuacturer name for product by national or private brand.

  9. product_category: grouping of specific products a level below department

  10. product_type: Describes specific items with grouping level below product category.

Data Preparation (Wrangle & Mutate) This step is where I create data frame, load dataset, process, join data (transactions, products, demographcis), pick variables, reorder, create income categories.

library(tidyverse)
library(completejourney)

setwd("C:/Users/edietz/OneDrive - Blue Origin, LLC/Desktop/Learning RStudio Desktop Files/Final")

full_transactions_data <- readRDS("transactions.rds")

df <- full_transactions_data %>%
  inner_join(demographics, by = "household_id") %>%
  left_join(products, by = "product_id") %>%
  select(household_id, basket_id, product_id, quantity, sales_value, income, department, brand, product_category, product_type) %>% 
  mutate(income = fct_collapse(
    income,
    "Regork Income Segment Low" = c("Under 15K", "15-24K", "25-34K", "35-49K"),
    "Regork Income Segment Medium" = c("50-74K", "75-99K", "100-124K", "125-149K"),
    "Regork Income Segment High" =c("150-174K", "175-199K", "200-249K", "250K+")
  ))

Exploratory Data Analysis

Analysis 1

Analysis 1: Average Number of Products Purchased Per Transaction

This analysis is meant to test if specific customer income segments purchase more product types per transaction than other income segements. To get this, I calculated the average number of products by transaction (basket) to see if there was an association between income level and number of products in a transaction.

I filtered outliers out from product_category column to get better representation of avg number of products purchases.

It showed that high income segment customers averaged the most products (19.1) per transaction/ basket. Regork could use this data to target high income segments with promos and marketing campaigns to try to increase the average products sold per transaction.

df <- df %>%
  filter(product_category != "COUPON/MISC ITEMS") %>%
  filter(product_category != "FUEL")

Average Number of Products Purchased Per Transaction

df %>%
  group_by(basket_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
  summary(total_quantity)
##   basket_id         total_quantity  
##  Length:72992       Min.   :  0.00  
##  Class :character   1st Qu.:  3.00  
##  Mode  :character   Median :  8.00  
##                     Mean   : 14.59  
##                     3rd Qu.: 19.00  
##                     Max.   :220.00

Preparing Data for Plotting

I calculate average quantity per transaction (basket) for the full dataset, then I filtered outlider, grouped by income and transaction, calculated the quanitty of products by transaction and ordered the income levels to plot findings.

avg_products_by_income <- df %>%
  filter(product_category != "COUPON/MISC ITEMS", product_category != "FUEL") %>%
  group_by(income, basket_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop_last') %>%
  summarize(avg_products = mean(total_quantity, na.rm = TRUE), .groups = 'drop') %>%
  mutate(income = factor(income, 
                         levels = c("Regork Income Segment Low",
                                    "Regork Income Segment Medium",
                                    "Regork Income Segment High")))
print(avg_products_by_income)
## # A tibble: 3 × 2
##   income                       avg_products
##   <ord>                               <dbl>
## 1 Regork Income Segment Low            13.0
## 2 Regork Income Segment Medium         15.7
## 3 Regork Income Segment High           19.1

Average Number of Products Purchased Per Transaction by Income Level

Average number of products purhcased per transaction by income level are:

  1. Regork Income Segment Low: 13.0
  2. Regork Income Segment Medium: 15.7
  3. Regork Income Segment High: 19.1

Visualization: Avg Number of Products Purchase Per Transaction by Income

library(ggplot2)
library(scales)

ggplot(avg_products_by_income, aes(x = income, y = avg_products)) +
  geom_segment(aes(xend = income, yend = 0),
               color = "magenta") +
  geom_point(size = 5, aes(color = income)) +
  geom_text(aes(label = round(avg_products, 2)),
            vjust = 1.5,
            size = 4) +
  labs(
    title = "Average Products Per Transaction by Income",
    x = "Regork Income Segment",
    y = "Average Products Per Basket"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust =0.5, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
  ) +
  scale_y_continuous(limits = c(0, max(avg_products_by_income$avg_products) * 1.15))

Regork Income Segment Low

df%>%
  filter(income == "Regork Income Segment Low") %>%
  group_by(basket_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
  summary(total_quantity)
##   basket_id         total_quantity  
##  Length:36715       Min.   :  0.00  
##  Class :character   1st Qu.:  3.00  
##  Mode  :character   Median :  7.00  
##                     Mean   : 13.03  
##                     3rd Qu.: 16.00  
##                     Max.   :220.00

Regork Income Segment Medium

df%>%
  filter(income == "Regork Income Segment Medium") %>%
  group_by(basket_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
  summary(total_quantity)
##   basket_id         total_quantity  
##  Length:31169       Min.   :  0.00  
##  Class :character   1st Qu.:  3.00  
##  Mode  :character   Median :  8.00  
##                     Mean   : 15.68  
##                     3rd Qu.: 21.00  
##                     Max.   :219.00

Regork Income Segment High

df%>%
  filter(income == "Regork Income Segment High") %>%
  group_by(basket_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
  summary(total_quantity)
##   basket_id         total_quantity 
##  Length:5108        Min.   :  0.0  
##  Class :character   1st Qu.:  5.0  
##  Mode  :character   Median : 12.0  
##                     Mean   : 19.1  
##                     3rd Qu.: 26.0  
##                     Max.   :140.0

Analysis 2

2. Product Brands With Higher Sales Value This analysis compares avg sales value of National vs Private brand products based on income. I wanted to compare sales of products by brand to analyze customer preference. By doing this, it showed me National brands generate higher revenue for Regorks than the private brand.

Preparing data for Plotting

Top 10 Product Categories by Sales Value

sales_by_segment <- df %>%
  filter(product_category !="COUPON/MISC ITEMS",
         product_category !="FUEL") %>%
  filter(brand %in% c("National", "Private")) %>%
  group_by(brand,product_category, income) %>%
  summarize(
    total_sales_value = sum(sales_value, na.rm = TRUE),
    .groups = 'drop'
  )

top_categories_list <- sales_by_segment %>%
  group_by(product_category) %>%
  summarize(grand_total = sum(total_sales_value), .groups = 'drop') %>%
  arrange(desc(grand_total)) %>%
  slice_head(n = 10) %>%
  pull(product_category)

top_combined_sales <- sales_by_segment %>%
  filter(product_category %in% top_categories_list) %>%
  mutate(income = factor(income, 
                         levels = c("Regork Income Segment Low", 
                                    "Regork Income Segment Medium",
                                    "Regork Income Segment High")))

print(top_combined_sales)
## # A tibble: 59 × 4
##    brand    product_category       income                      total_sales_value
##    <fct>    <chr>                  <ord>                                   <dbl>
##  1 National BAG SNACKS             Regork Income Segment Low              18541.
##  2 National BAG SNACKS             Regork Income Segment Medi…            20313.
##  3 National BAG SNACKS             Regork Income Segment High              3903.
##  4 National BAKED BREAD/BUNS/ROLLS Regork Income Segment Low              11891.
##  5 National BAKED BREAD/BUNS/ROLLS Regork Income Segment Medi…            15575.
##  6 National BAKED BREAD/BUNS/ROLLS Regork Income Segment High              3386.
##  7 National BEEF                   Regork Income Segment Low              42153.
##  8 National BEEF                   Regork Income Segment Medi…            38250.
##  9 National BEEF                   Regork Income Segment High              8619.
## 10 National BEERS/ALES             Regork Income Segment Low              18209.
## # ℹ 49 more rows

**Visualization: Top 10 Product Categories Sold by Brand”

library(ggplot2)
library(scales)

ggplot(top_combined_sales,
       aes(x = product_category, y = total_sales_value, fill = income)) +
  geom_col()+
  facet_wrap(~ brand, scales = "free_x", ncol = 1) +
  scale_fill_manual(values = c("Regork Income Segment Low" = "magenta",
                               "Regork Income Segment Medium" = "blue",
                               "Regork Income Segment High" = "orange")) +
  scale_y_continuous(labels = scales:: comma_format(prefix = "$")) +
  labs(
    title = "Top 10 Product Categories Sold by Brand",
    subtitle = "Total Sales by Income Segment broken down by Brand",
    x = "Product Category",
    y = "Total Sales Value ($)",
    fill = "Income Segment"
  ) + 
  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Priate Brand

Private Brand Average Sales

asv_private <-df%>%
  filter(brand =="Private") %>%
  summarize(
    total_sale = sum(sales_value, na.rm = TRUE),
    total_quantity = sum(quantity, na.rm = TRUE)
  ) %>%
  mutate(revenue_per_item = round((total_sale / total_quantity), 2))
reactable(asv_private, bordered = TRUE, striped = TRUE, compact = TRUE)

National Brand

National Brand Average Sales

asv_national <- df %>%
  filter(brand =="National") %>%
  summarize(
    total_sale = sum(sales_value, na.rm = TRUE),
    total_quantity = sum(quantity, na.rm = TRUE)
  )%>%
  mutate(revenue_per_item = round((total_sale /total_quantity), 2))
  reactable(asv_national, bordered = TRUE, striped = TRUE, compact = TRUE)

Analysis 3

3. Top Products Regularly Purchased by Income Segment (Low, Medium, High)

This analysis captures the average top 20 most frequently purchased products per transaction (basket)

# Prep data for visualization, calculations, filter names, combine

# Low Income Segment
top_low_prods <- df %>%
  filter(income =="Regork Income Segment Low") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size)) %>%
  mutate(income_segment = "Low")

# Medium Income Segment

top_mid_prods <- df %>%
  filter(income =="Regork Income Segment Medium") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size)) %>%
  mutate(income_segment = "Medium")

# High Income Segment

top_up_prods <- df %>%
  filter(income =="Regork Income Segment High") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size)) %>%
  mutate(income_segment = "High")

master_top_prods <- bind_rows(top_low_prods, top_mid_prods, top_up_prods) %>%
  mutate(income_segment = factor (income_segment, levels = c("Low", "Medium", "High"))) %>%
  select(income_segment, product_id, total_quantity, brand, product_category, product_type) %>%
  arrange(income_segment, desc(total_quantity))

#Prep Table

reactable(
  master_top_prods,
  filterable = TRUE, 
  searchable = TRUE,
  showPageSizeOptions = TRUE, 
  striped = TRUE, 
  highlight = TRUE, 
  defaultPageSize = 10,
  columns = list(
    income_segment = colDef (name = "Income Segment", minWidth = 100),
    product_id = colDef(name = "Product ID", minWidth = 80),
    total_quantity = colDef(name = "Total Qty (Top 20)", minWidth = 100),
    brand = colDef(name = "Brand", minWidth = 80),
    product_category = colDef(name = "Product Category", minWidth = 130),
    product_type = colDef(name = "Product Type", minWidth =150)
  )
)
library(ggplot2)
library(scales)

plot_data <- master_top_prods %>%
  mutate(product_category = reorder (product_category, total_quantity))

ggplot(plot_data,
       aes(x = total_quantity, y = product_category)) +
  
  geom_segment(aes(xend = 0, yend = product_category),
               color = "magenta",
               linewidth = 0.5) +
  geom_point(aes(color = brand), size = 3) +
  
  facet_wrap(~ income_segment, scales ="free_y", ncol =1) +
  labs(
    title = "Top 20 Retail Products by Income Segment",
    subtitle = "Product Rank by Total Quantity Purchased",
    x = "Total Quantity Purchased",
    y = "Product Category", 
    color = "Bran Type"
  ) +
  scale_x_continuous(labels = scales::comma)+
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face ="bold"),
    legend.position = "bottom"
  )

Top 20 Product Purchased by Low Income Segment

top_low_prods <- df %>%
  filter(income == "Regork Income Segment Low") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity)) %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size))

reactable(top_low_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)

Top 20 Product Purchased by Medium Income Segment

top_mid_prods <- df %>%
  filter(income == "Regork Income Segment Medium") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity)) %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size))

reactable(top_mid_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)

Top 20 Product Purchased by High Income Segment

top_up_prods <- df %>%
  filter(income == "Regork Income Segment High") %>%
  group_by(product_id) %>%
  summarize(total_quantity = sum(quantity)) %>%
  slice_max(total_quantity, n = 20) %>%
  left_join(products, by = "product_id") %>%
  select(-c(manufacturer_id, package_size))

reactable(top_up_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)

Summary

Summarize Problem Statement

The objective was to identify a specific data driven areas for growth where Regork could invest resources to increase revenue an dprofits. To do this, I analyzed customer purchase patters across income segments to determine high demand products for targeted promotions and marketing campaigns.

Methodology

The method I used surfaces data points on specific products that leadership can take immediate action on to increase revenue. In the immediate term, leadership can spin up marketing campaigns and invest resources where product purchases are going well, while simultaneously place resources where there are opportunities for growth.

I used the completejourney dataset to process and joine transaction, product and demographic data. Key variables were leveraged for targeted data analysis.

The methodology focused on three areas to provide a recommendation to Regork leadership:

  1. Average number of products purchased per transaction based income segment to determine most valuabe consumer segment to invest in immediately

  2. Top products pruchased and reviewed product brands with highest sales for Regork leadership to determine where to invest for immediate revenue growth

  3. Top 20 product most frequently purchased by income segment to create list of products for targeted promotions

Interesting Insights from Analysis

Looking at products prchused most often, Regork could identify which of those products are essential items or nice to have desirable items based on income segment. By doing this, Regork can better understand why products are in high demand and expand to similar products in the same grouping.

Implications to Consumer Based on Analysis

Consumer who are high income may notice increase in individualized marketing for products they purchase frequently. This could make them feel seen and create a deeper emotional bond with Regork. Consumers in the low income segment may notice promotions targeting essential items, which could help their savings go further and create loyalty.

Proposal to Regork CEO

This analysis shares targeted opportunities for revenue and profit growth, giving the CEO the ability to make fully infomred business decisions with immediacy.

With these new data insights on consumer purchases by income segment, Regork can:

  1. Begin targeted marketing campaigns to generate sales on high value and essential items. These campaigns could drive foot traffic in the stores with the potential to generate additional sales on items not advertised.

  2. Focus on product placement by verifying highly desired items are always in stock and placed in areas of high visibility or high traffic in the stores. Store layout and product placement could increase average items per transaction.

  3. Focus on Targetd Promotional Coupons and bogo deals through digital mail, paper mail to drive customers into the store. And then offer targeted promotions at the point of sale as a last opportunity to maximize their spending or to keep them coming back.

  4. Focus on all income segments and what drives their behavior to purhcase more. Lower income segments want deals on staple items, whereas high income segments may be drawn to shop for high value items.

Limitations of Analysis & How it can be Improved

The data used in the analysis is stale and does not allow for real time insights based on any economic shifts. To improve this, having real time data that shows customer spending would allow Regorks to act fast to prevent negative impacts to profit margins.

The data is not segmented by store location or region, leaving less information on demographic data to understand customer behavior like product preferences by culture and spedning habits. It would be valuable to have store location data to target customer segmenets with marketing campaigns to grow the Regork national footprint.