The grocery industry is highly competitive, with thin margins and increasing pressure from discount retailers and e-commerce platforms. One major strategy that retailers use to increase profitability and customer loyalty is investing in private label (store-brand) products.
According to Nielsen (2023), private label sales in the U.S. account for over 18% of total grocery sales, with European markets seeing even higher penetration levels of 30-40%. Private label products typically yield higher profit margins than national brands due to lower marketing costs and vertical integration in supply chains. However, consumer perception varies—while some shoppers actively seek private labels for affordability and quality, others remain loyal to national brands, often associating them with higher trust and consistency.
This analysis aims to answer the key business question:
How do customers choose between private label and national brand products, and how can Regork drive private label adoption to maximize profitability?
To analyze customer behavior and private label preference, we use the Complete Journey dataset, which includes:
Transaction Data: Captures purchase history, allowing us to segment sales into private label vs. national brands.
Product Data: Identifies whether an item is a private label or a national brand.
Demographic Data: Provides insights into customer segments (income, household size, etc.).
Data Preparation: Clean and merge datasets to classify products into private label vs. national brand.
Exploratory Data Analysis: Identify sales trends, customer demographics, and price sensitivity.
Statistical & Visual Analysis:
Compare private label adoption across different income groups.
Analyze price sensitivity—do customers switch to private labels when national brands are more expensive?
Evaluate promotion impact—do discounts significantly increase private label purchases?
By understanding private label adoption trends, Regork can:
Enhance marketing strategies by targeting demographics most likely to convert to private label.
Optimize pricing & promotions to maximize private label sales while maintaining profitability.
Improve store layout & bundling strategies to increase visibility and preference for private label items.
This data-driven approach will provide actionable insights for Regork’s CEO and marketing team, enabling them to make strategic decisions that boost revenue, enhance customer loyalty, and drive higher margins.
Below are the key R packages used in this analysis:
tidyverse – A collection of essential R packages
for data manipulation, visualization, and analysis. Includes
ggplot2
, dplyr
, tidyr
,
readr
, and more.
dplyr – Provides a fast, intuitive, and consistent grammar for data wrangling (filtering, summarizing, joining, etc.).
ggplot2 – Used for data visualization, creating high-quality and customizable graphs.
lubridate – Simplifies working with dates and times, crucial for analyzing time-based purchasing behavior.
stringr – Provides functions for working with text data, useful for handling product names and categories.
knitr – Converts R Markdown into formatted HTML reports.
kableExtra – Enhances tables in reports, allowing for styled and formatted tables in HTML.
scales – Improves number and date formatting for better data visualization.
gridExtra – Helps arrange multiple plots in a single output.
arules – Used for association rule mining (e.g., identifying which products are frequently purchased together).
arulesViz – Helps visualize association rules from market basket analysis.
janitor – Simplifies data cleaning, including removing duplicate column names and handling messy datasets.
To use these packages, we need to load them in R using the
library()
function. Below is the code to load all necessary
libraries for our analysis:
# load libraries
library(tidyverse) # Data manipulation & visualization
library(dplyr) # Efficient data wrangling
library(ggplot2) # High-quality visualizations
library(lubridate) # Handling date & time data
library(stringr) # String/text manipulation
library(knitr) # Markdown reporting
library(kableExtra) # Enhanced table formatting
library(scales) # Improves axis scaling in plots
library(gridExtra) # Arranges multiple plots
library(arules) # Market basket analysis
library(arulesViz) # Association rule visualization
library(janitor) # Data cleaning
library(completejourney) # Data set
The data for this R project can be accessed from the CompleteJourney website. The CompleteJourney datasets are based on grocery shopping transactions from a group of 2,469 households. Entities such as demographics, products, coupons, campaigns, etc., were collected over a one-year timeframe from January 2017 - December 2017.
We utilize three key datasets: transactions
,
products
, and demographics
. The tables below
outline the variables used in my analysis.
All data preparation, including joins, slices, and new variables are included in the exploratory data analysis code set up.
The transactions
dataset records customer purchases,
including details about sales and discounts.
# define transactions table structure
transactions_table <- data.frame(
"Variable Name" = c("household_id", "product_id", "quantity", "sales_value", "retail_disc", "coupon_disc"),
"Data Type" = c("character", "character", "numeric", "numeric", "numeric", "numeric"),
"Variable Description" = c(
"Uniquely identifies each household",
"Uniquely identifies each product",
"Number of the product purchased during the trip",
"Amount of dollars the retailer receives from sale",
"Discount applied due to the retailer’s loyalty card program",
"Discount applied due to a manufacturer coupon"
)
)
# display transactions table
kable(transactions_table, caption = "Transactions Table Definitions") %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE)
Variable.Name | Data.Type | Variable.Description |
---|---|---|
household_id | character | Uniquely identifies each household |
product_id | character | Uniquely identifies each product |
quantity | numeric | Number of the product purchased during the trip |
sales_value | numeric | Amount of dollars the retailer receives from sale |
retail_disc | numeric | Discount applied due to the retailer’s loyalty card program |
coupon_disc | numeric | Discount applied due to a manufacturer coupon |
library(completejourney)
transactions <- get_transactions()
# load data sets
data("transactions")
data("products")
data("demographics")
The products
dataset provides information about each
product, such as its category, brand, and whether it’s a private label
or national brand.
# define products table structure
products_table <- data.frame(
"Variable Name" = c("product_id", "department", "brand"),
"Data Type" = c("character", "character", "character"),
"Variable Description" = c(
"Uniquely identifies each product",
"Department/category the product belongs to",
"Indicates whether the product is a private label or national brand"
)
)
# display products table
kable(products_table, caption = "Products Table Definitions") %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE)
Variable.Name | Data.Type | Variable.Description |
---|---|---|
product_id | character | Uniquely identifies each product |
department | character | Department/category the product belongs to |
brand | character | Indicates whether the product is a private label or national brand |
The demographics
dataset provides information on
household demographic data such as age, income, family size, and
more.
# define demographics table structure
demographics_table <- data.frame(
"Variable Name" = c("household_id", "age", "income", "household_size"),
"Data Type" = c("character", "character", "character", "character"),
"Variable Description" = c(
"Uniquely identifies each household",
"Age group of the primary shopper",
"Income bracket of the household",
"Household size category"
)
)
# display demographics table
kable(demographics_table, caption = "Demographics Table Definitions") %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width =TRUE)
Variable.Name | Data.Type | Variable.Description |
---|---|---|
household_id | character | Uniquely identifies each household |
age | character | Age group of the primary shopper |
income | character | Income bracket of the household |
household_size | character | Household size category |
# load libraries
library(tidyverse)
# join: transactions + products + demographics
# summarize income levels and brand type
spending_data <- transactions %>%
inner_join(products, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
mutate(
label_type = ifelse(brand == "Private", "Private Label", "National Brand"),
total_spending = sales_value
) %>%
filter(!is.na(income), total_spending > 0) %>% # Remove missing income and zero spending
group_by(household_id, income, label_type) %>%
summarise(total_spending = sum(total_spending), .groups = "drop") # Sum spending per household
# convert income to ordered factor for sorting
spending_data$income_bracket <- factor(spending_data$income,
levels = c("Under 15K", "15-24K", "25-34K", "35-49K",
"50-74K", "75-99K", "100-124K", "125-149K", "150K+"))
ggplot(spending_data, aes(x = income_bracket, y = total_spending / 1000, fill = label_type)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) + # Hide extreme outliers
# scale y-axis to show in thousands
scale_y_continuous(labels = scales::label_comma(suffix = "K"), limits = c(0, 9)) +
# graph components
scale_fill_manual(values = c("Private Label" = "#2E86C1", "National Brand" = "#E74C3C")) +
labs(
title = "Distribution of Spending by Income Level",
subtitle = "Comparing spending behavior for private and national brands",
x = "Income Level",
y = "Total Spending ($ Thousands)",
fill = "Brand Type"
) +
# formatting theme
theme_minimal() +
theme(
text = element_text(size = 10),
plot.title = element_text(size = 14, hjust = 0, face = "bold"),
plot.subtitle = element_text(size = 10, hjust = 0),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.position = "bottom"
)
The box plot illustrates the distribution of total spending across different income levels, comparing private label and national brand purchases. Across all income brackets, national brands consistently show higher spending compared to private labels, as reflected in the higher medians and wider interquartile ranges (IQRs). This suggests that, regardless of income level, consumers tend to allocate more of their budget to national brands.
Interestingly, lower-income households (under $15K) exhibit a more concentrated spending range with lower overall variance, indicating that their spending behavior is relatively constrained. As income levels rise, spending increases, particularly on national brands, with upper-income groups (e.g., $125K+) showing a greater spread, suggesting diverse purchasing habits within this segment. The presence of outliers in higher-income brackets further indicates that some households allocate significantly more to national brands compared to others in the same income group.
Despite the trend favoring national brands, private label spending remains relatively stable across income levels, suggesting a baseline demand that does not significantly fluctuate with income. This could imply that private label products serve as a consistent budget-friendly option for all consumers, while national brand purchases increase as discretionary income rises.
Retailers should focus on promoting private label products as a cost-effective alternative, especially for lower-income consumers who demonstrate steady spending patterns on these brands. Introducing premium-tier private label options for higher-income shoppers could capture additional market share, particularly in categories where national brands dominate in order to increase their profit margins.
# laod libraries
library(ggplot2)
library(dplyr)
# join: transactions + products
# summarize discounts by brand
data_joined <- transactions %>%
inner_join(products, by = "product_id") %>%
filter(coupon_disc > 0 & coupon_disc < quantile(coupon_disc, 0.99)) # Remove extreme outliers
# create the density plot
ggplot(data_joined, aes(x = coupon_disc, fill = brand)) +
geom_density(alpha = 0.5, adjust = 1.2) + # Density plot with transparency
scale_fill_manual(values = c("Private" = "#2E86C1", "National" = "#E74C3C")) + # Custom colors
# graph components
labs(
title = "Distribution of Discounts for Private Label vs. National Brands",
subtitle = "Comparing the frequency of discount levels across brand types",
x = "Coupon Discount ($)",
y = "Density",
fill = "Brand Type"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, hjust = 0),
plot.subtitle = element_text(size = 10, hjust = 0, face = "italic"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
legend.position = "bottom"
)
The density plot reveals distinct discounting patterns between national brands and private labels. Private label brands tend to cluster around higher discount values, particularly in the range of $0.30 to $0.40, with a sharper peak than national brands. This suggests that private labels rely more heavily on structured discounts to attract price-sensitive consumers. Meanwhile, national brands show a broader distribution of discount values, with a notable presence at lower discount levels. This could indicate that while national brands do offer discounts, they are less reliant on them compared to private labels, potentially leveraging brand recognition and loyalty instead.
The overlap between private label and national brand discounts in the higher range suggests that when national brands discount aggressively, they compete more directly with private labels. However, the higher density of private label discounts in specific price points implies a more predictable and structured pricing strategy designed to convert price-sensitive shoppers consistently.
Retailers and private label managers should take advantage of these findings by optimizing their discounting strategy. Since private labels are already benefiting from a structured discount model, they should explore whether deeper or more frequent promotions would lead to higher conversions, particularly in categories where national brands also compete aggressively on price. Additionally, private labels should test loyalty-driven discounts to encourage repeat purchases rather than one-time conversions.
# Load necessary libraries
library(dplyr)
library(knitr)
library(kableExtra)
# Join transactions with product data to link brands
discount_data <- transactions %>%
inner_join(products, by = "product_id") %>%
filter(coupon_disc > 0) # Keep only transactions where a discount was applied
# Create a summary table of discounts by brand type
discount_summary <- discount_data %>%
group_by(brand) %>%
summarise(
`Avg Discount` = round(mean(coupon_disc, na.rm = TRUE), 2),
`Median Discount` = round(median(coupon_disc, na.rm = TRUE), 2),
`Discounted Transactions` = n(),
`Total Transactions` = nrow(transactions %>% inner_join(products, by = "product_id")),
`Percentage of Discounted Transactions` = round((`Discounted Transactions` / `Total Transactions`) * 100, 2)
) %>%
arrange(desc(`Avg Discount`))
# Display as a formatted table
discount_summary %>%
kable("html", escape = FALSE) %>% # Prevents unwanted escape characters
kable_styling("striped", full_width = FALSE, bootstrap_options = c("hover", "condensed", "responsive")) %>%
column_spec(1, bold = TRUE) %>% # Make brand names bold
row_spec(0, bold = TRUE, background = "#f7f7f7") # Style header row
brand | Avg Discount | Median Discount | Discounted Transactions | Total Transactions | Percentage of Discounted Transactions |
---|---|---|---|---|---|
Private | 2.05 | 1.50 | 271 | 1464471 | 0.02 |
National | 1.03 | 0.75 | 18649 | 1464471 | 1.27 |
The table reveals a stark difference in discounting strategies between Private Label and National Brands. Private Label products receive a higher average ($2.05) and median ($1.50) discount per transaction compared to National Brands ($1.03 average, $0.75 median). However, despite the higher discount per transaction, the number of transactions that received discounts for Private Label is extremely low (271 transactions out of 1.46M total), making up only 0.02% of total transactions. In contrast, National Brands have significantly more discounted transactions (18,649), comprising 1.27% of all transactions, despite offering lower individual discounts.
This suggests that Private Label products are either infrequently discounted or that when discounts are applied, they are targeted at a much smaller customer base. Meanwhile, National Brands appear to leverage discounts more broadly, potentially as a volume-based promotional strategy.
Given the data, Regork should consider increasing the frequency of Private Label discounts to better compete with National Brands. While Private Label products already offer a higher discount per transaction, the low volume of discounted transactions means that price-sensitive customers may not be incentivized to switch. Implementing more frequent but slightly lower-value discounts on Private Label could encourage trial and retention among customers accustomed to National Brands.
Additionally, targeting discount campaigns to specific demographics (such as lower-income households, identified in previous analyses) could be an effective strategy to drive conversion. Given the low rate of Private Label discounts, it’s also worth testing bundling or loyalty-based discount strategies to increase engagement and long-term switching behavior. Finally, A/B testing different discount structures (e.g., percentage-based vs. fixed-value discounts) could provide insights into the most effective promotional strategies for driving Private Label adoption.
The primary objective of this analysis was to assess the performance of Private Label brands compared to National Brands, identify the factors influencing consumer purchasing behavior, and determine strategies to improve Private Label sales. Specifically, we aimed to understand Private Label’s market share across different product categories, how income levels influence brand choice, the impact of discounts on consumer preference, and potential brand loyalty insights.
To address these questions, we integrated three key datasets:
Transactions: Contained purchase-level data, including product_id, household_id, total purchase amount, and coupon discounts.
Products: Merged using product_id to associate each transaction with its respective brand (Private Label or National Brand).
Demographics: Merged using household_id to link consumer transactions with household income levels.
We applied the following analytic approach:
Overall Market Share Analysis: A pie chart was created to visualize the proportion of total sales accounted for by Private Label vs. National Brands.
Category-Level Performance: A stacked bar chart was used to compare Private Label penetration across different product departments.
Income-Based Purchasing Behavior: A box plot was employed to assess total spending distribution by income level for both brand types.
Brand Market Share by Income Level: Another stacked bar chart analyzed Private Label vs. National Brand sales distribution across income groups.
Impact of Discounts: A density plot was used to examine how discount distribution differed between Private Label and National Brands, identifying whether Private Label products receive more or deeper discounts.
The analysis revealed that Private Label products account for only 28.3% of total sales, indicating that despite their affordability, consumers still favor National Brands. This suggests that brand perception, product quality, or marketing influence may play a significant role in driving purchasing decisions. However, Private Label is not without its strengths—it has successfully captured meaningful market share in select categories, proving that price-sensitive consumers do exist, but their engagement is not uniform across all product types.
Looking deeper into category-level performance, we found that certain departments have strong Private Label penetration, particularly in Grocery and Packaged Goods, where store brands are more common and accepted by consumers. However, categories such as Meat, Seafood, and Cosmetics overwhelmingly favor National Brands, likely due to perceived quality differences or brand loyalty in these segments. This highlights an opportunity for Private Label to expand strategically into categories where consumer trust is a key purchasing factor, whether through improved product positioning or targeted marketing efforts.
Income-based spending behavior also provided an interesting dynamic. While higher-income households allocate greater total spending to National Brands, Private Label’s share remains relatively stable across income brackets. This suggests that while wealthier consumers are willing to pay a premium for established brands, budget-conscious shoppers continue to rely on Private Label at similar rates regardless of earnings. However, given that Private Label purchases are typically associated with lower transaction values, it raises the question of whether these consumers are making repeat purchases or simply opting for Private Label products only in specific cases rather than consistently.
Lastly, the impact of discounting strategies was critical in understanding consumer price sensitivity. The density plot of discount distributions showed that Private Label products receive frequent and moderate discounts, but National Brands are also aggressive in their pricing strategies. Rather than creating a strong competitive edge, these frequent National Brand discounts may be neutralizing Private Label’s primary advantage—price. If consumers see comparable savings on National Brands, the incentive to switch to Private Label diminishes, particularly if brand trust and perceived quality remain differentiating factors.
Together, these insights tell a compelling story about the nuanced challenges Private Label faces. It has carved out a steady share of the market, but growth opportunities remain largely in category expansion, differentiated marketing, and strategic discounting to further shift consumer preference.
The business implications of this analysis suggest that Regork should refine its Private Label strategy to drive growth without directly competing in areas where National Brands hold an unshakable advantage. Instead of trying to win across all categories, Regork should prioritize expansion in departments where Private Label already has traction, such as Grocery, Packaged Goods, and Store Supplies, while exploring strategic entry into high-margin but underpenetrated categories like Meat and Seafood, where consumers may be open to alternatives with the right positioning.
The findings on discounting behavior reveal that frequent National Brand discounts may be diluting Private Label’s price advantage. To counteract this, Regork should consider shifting Private Label’s discounting strategy from broad, frequent markdowns to targeted promotions that drive trial and long-term adoption. For example, instead of general discounts, Regork could implement loyalty-based discounts for repeat Private Label purchases or bundle Private Label items with high-demand National Brand products to increase exposure and adoption.
Income-based spending behavior suggests that Private Label loyalty does not significantly increase with rising income, meaning that price-conscious shopping exists across all income levels. This presents an opportunity to tailor marketing strategies based on consumer segments rather than just price-based messaging. For lower-income shoppers, promotions should reinforce the cost savings of Private Label, while for mid-to-upper-income consumers, messaging should focus on quality assurance, product innovation, and ethical sourcing to remove stigma around Private Label as a purely budget option.
The department-level market share analysis suggests that brand trust is a key barrier in higher-end categories like Cosmetics, Meat, and Seafood. To increase adoption in these areas, Regork should invest in elevating the perceived quality of Private Label through better branding, transparent sourcing, and enhanced packaging. Additionally, exclusive product lines that differentiate Private Label from National Brands—rather than simply imitating them—could drive interest and make Private Label a more attractive alternative.
Regork should double down on its strongest categories, where Private Label is already winning, while strategically expanding into high-margin but underpenetrated areas with the right product positioning. Instead of relying solely on price-driven competition, the company should refine its discounting strategy to drive repeat purchases, market Private Label as a trusted alternative rather than a budget option, and create differentiation through innovation and branding. These efforts will help Regork grow its Private Label market share in a sustainable way, rather than simply competing on price alone
While this analysis provides actionable insights, there are limitations. First, it focuses on transactional data without incorporating qualitative factors such as brand perception or consumer sentiment. Second, product availability was not considered, meaning National Brands may dominate certain categories due to stocking constraints rather than preference. Lastly, further analysis on repeat purchase behavior and long-term customer retention would strengthen insights into Private Label loyalty. Future research could integrate loyalty program data, competitor benchmarking, and customer feedback to refine recommendations further.