National vs Private Label Comparison

Midterm Project

1. Introduction

Problem Statement and Summary:

Our team has been tasked with exploring the differences between private label brands (Regork), and national labels. This will provide valuable insight for Regork by breaking down the private and national labels to see where the majority of sales is coming from. Regork will be able to take our analysis and findings and properly distribute resources and analyze the performance of each label type. Regork will also be able to see its product sales compared to the sales of national label products.

2. Packages Required

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(lubridate)

We call the completejourney dataset which includes the data that we are going to work with including transactions, demographics, products, etc

We have used the tidyverse package which includes a set of tools for quick data analysis like dplyer or tibble

We also use lubridate which helps with formatting and converting time format data

3. Data Prep

transactions <- get_transactions()
products 
coupon_redemptions
campaigns
transactions
demographics

Calling the necessary data sets that will be used

data <- transactions %>%
  inner_join(products) %>%
  inner_join(demographics)

## Joining with `by = join_by(product_id)`
## Joining with `by = join_by(household_id)`

Joining and filtering data

4. Data Analysis

Sales by Department

sales_summary <- transactions %>%
  inner_join(products, by = "product_id") %>%
  group_by(department, brand) %>%
  summarise(Total_Sales = sum(sales_value) / 1000, .groups = 'drop') %>%
  arrange(desc(Total_Sales))
# Create a plot for both National and Private brands
p <- ggplot(sales_summary %>% filter(brand %in% c("National", "Private")), 
            aes(x = department, y = Total_Sales, fill = brand)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::dollar_format())+
  labs(title = "Total Sales by Department for National and Private Brands",
       x = "Department",
       y = "Total Sales (In Thousands of $)") +
  theme_minimal() +
    coord_flip()
print(p)

Summary/Explanation: This graph shows the total sales of both national and private brands broken into each department. We can interpret that for majority of departments, the national label produces more sales than the private label. The top 3 departments for overall sales are, grocery, drugs gm, and produce.

Number of Transactions by Brand Type

data %>%
  group_by(brand) %>%
  summarise(transaction_count = n()/1000) %>%
  ggplot(aes(x = brand, y = transaction_count, fill = brand)) +
    geom_bar(stat = "identity") +
    labs(title = "Number of transactions by Brand type Comparison (in thousands)",
         x = "Brand Type",
         y = "Number of Transactions") +
    scale_fill_manual(values = c("National" = "lightblue", "Private" = "orange")) +
    theme_minimal()

Summary/Explanation: This graph compares the number of transactions of the national and private label products. The national label has significantly more transactions than the private label. This is likely because there are more national products than private products.

Percentage of Transactions by Coupon Usage

data %>%
  filter(brand == "Private") %>%
  mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
  group_by(department, coupon_used) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department) %>%
  mutate(total_department_transactions = sum(total_transactions),
         percentage_coupon_used = (total_transactions / total_department_transactions) * 100) %>%
  filter(coupon_used == "yes") %>%
  ungroup() %>%
  ggplot(aes(x = department, y = percentage_coupon_used)) +
    geom_bar(stat = "identity", fill = "lightblue") +
    geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
              vjust = -0.5,
              color =  "black",
              size = 3.5) +
    labs(title = "Percentage of Transactions using coupons by Department (Private)",
         x = "Department",
         y = "Percentage of Transactions") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.text.y = element_blank())

## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

Summary/Explanation: Using the data provided, we calculated the percentage of transactions by department that utilized discount coupons and then separated them into 2 graphs by Brands.

Private brands rarely offer coupons with the percentage in the range of 0.1 ~ 0.2%.

data %>%
  filter(brand == "National") %>%
  mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
  group_by(department, coupon_used) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department) %>%
  mutate(total_department_transactions = sum(total_transactions),
         percentage_coupon_used = (total_transactions / total_department_transactions)*100) %>%
  filter(coupon_used == "yes") %>%
  ungroup() %>%
  ggplot(aes(x = department, y = percentage_coupon_used)) +
    geom_bar(stat = "identity", fill = "lightblue") +
    geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
              vjust = -0.5,
              color =  "black",
              size = 3.5) +
    labs(title = "Percentage of Transactions using coupons by Department (National)",
         x = "Department",
         y = "Percentage of Transactions") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.text.y = element_blank())

## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

Summary/Explanation: On the other hand, National Brands seems to have more discounted sales with Photo & Video Department having 20% of its sales attracted by coupons followed up by Packaged Food at 4.6%. Additionally, the percentages seem to fall between 0.1 ~ 4.6%

Retention Rate

retention_national <- data %>%
  filter(brand == "National") %>%
  mutate(transaction_date = as.Date(transaction_timestamp)) %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%
  group_by(department, household_id, year_month) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department, household_id) %>%
  arrange(department, household_id, year_month) %>%
  mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
  ungroup() %>%
  mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
  ungroup() %>%
  group_by(department, year_month) %>%
  summarise(total_customers = n(),
            retained_customers = sum(retained, na.rm = TRUE),
            retention_rate = retained_customers / total_customers * 100) %>%
  ungroup() %>%
  group_by(department) %>%
  summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
  mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
  arrange(desc(average_retention_rate))

## `summarise()` has grouped output by 'department', 'household_id'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

knitr::kable(retention_national, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "National Brand")

National Brand
Department	Retention Rate
GROCERY	91.76%
DRUG GM	90.98%
PRODUCE	90.56%
MEAT-PCKGD	88.76%
MEAT	88.62%
DELI	84.67%
PASTRY	79.82%
NUTRITION	78.65%
SALAD BAR	70.91%
PROD-WHS SALES	66.67%
SEAFOOD	66%
COSMETICS	65.08%
SEAFOOD-PCKGD	61.15%
FLORAL	57.32%
RESTAURANT	47.62%
SPIRITS	46.28%
MISCELLANEOUS	41.57%
GARDEN CENTER	40.55%
CHEF SHOPPE	36.96%
COUPON	34.06%
GM MERCH EXP	31.25%
TRAVEL & LEISURE	30.58%
FROZEN GROCERY	27.2%
CNTRL/STORE SUP	0%
PHOTO & VIDEO	0%
POSTAL CENTER	0%

Summary/Explanation: Assuming that a customer is considered “retained” if they purchase a product of either National of Private brand for two consecutive months, we calculated the monthly retention rate for both Brand categories for the year of 2017 and then take the average.

For Table 1 (National): Grocery, Meat, Seafood and Produce enjoy a relatively high retention rate of +60% while miscellaneous items like agriculture or equipment have a lower rate

retention_private <- data %>%
  filter(brand == "Private") %>%
  mutate(transaction_date = as.Date(transaction_timestamp)) %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%
  group_by(department, household_id, year_month) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department, household_id) %>%
  arrange(department, household_id, year_month) %>%
  mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
  ungroup() %>%
  mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
  ungroup() %>%
  group_by(department, year_month) %>%
  summarise(total_customers = n(),
            retained_customers = sum(retained, na.rm = TRUE),
            retention_rate = retained_customers / total_customers * 100) %>%
  ungroup() %>%
  group_by(department) %>%
  summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
  mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
  arrange(desc(average_retention_rate))

## `summarise()` has grouped output by 'department', 'household_id'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

knitr::kable(retention_private, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "Private Brand")

Private Brand
Department	Retention Rate
GROCERY	91.57%
PRODUCE	84.98%
FUEL	83.62%
DRUG GM	82.33%
MEAT-PCKGD	80.88%
PASTRY	77.2%
DELI	73.63%
SEAFOOD-PCKGD	68.03%
MEAT	67.94%
MISCELLANEOUS	60.81%
NUTRITION	58.35%
FLORAL	38.2%
COSMETICS	35.5%
AUTOMOTIVE	2%
SEAFOOD	19.88%
SALAD BAR	0%

Summary/Explanation: For Table 2 (Private): Grocery, Meat, Produce and Fuel have higher rates than other items. Interestingly, the Seafood department has a retention rate of only ~20% compared to National brands

Sales by Income Range

data %>%
  mutate(transaction_date = as.Date(transaction_timestamp),
         year_month = format(transaction_date, "%m")) %>%
  group_by(year_month, brand, income) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  ggplot(aes(x = year_month, y = total_sales, color = brand, group = brand)) +
  geom_line(size = 1) +
  facet_wrap(~ income) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Sales: National vs Private by Income Range",
       x = "Month",
       y = "Sales",
       color = "Brand") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7, hjust = 1),
        legend.position = "top")

## `summarise()` has grouped output by 'year_month', 'brand'. You can override
## using the `.groups` argument.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Summary/Explanation: Using the data provided, we calculated the amount of national and private label sales in dollars for specific income ranges. We found that the incomes ranging from 35k-99k shop the most at Regork and the income range with the biggest difference between national and private label purchases is 50k-74k.

Shopping Trend

basket_brand <- data %>%
  group_by(basket_id) %>%
  summarise(distinct_brand = n_distinct(brand),
            first_brand = first(brand)) %>%
  ungroup()

only_nationals <- basket_brand %>%
  filter(distinct_brand == 1 & first_brand == "National")
only_privates <- basket_brand %>%
  filter(distinct_brand == 1 & first_brand == "Private")
both <- basket_brand %>%
  filter(distinct_brand == 2)
total_baskets <- basket_brand %>%
  summarise(total_basket_count = n())

percentage_national_baskets <- (nrow(only_nationals) / total_baskets$total_basket_count) * 100
percentage_private_baskets <- (nrow(only_privates) / total_baskets$total_basket_count) * 100
percentage_both <- (nrow(both) / total_baskets$total_basket_count) * 100

result1 <- tibble(
  Type = c("Only National", "Only Private", "Both"),
  Count = c(nrow(only_nationals), nrow(only_privates), nrow(both)),
  Percentage = c(percentage_national_baskets, percentage_private_baskets, percentage_both)
)
result2 <- result1 %>%
  mutate(Percentage = paste0(round(Percentage, 2), "%"))
knitr::kable(result2, col.names = c("Basket Type", "Count", "Percentage"), align = "lcr")

Basket Type	Count	Percentage
Only National	22300	27.29%
Only Private	12222	14.96%
Both	47187	57.75%

Summary/Explanation: With the data provided, we calculated the percentage of customers that purchase only national, only private, or a mixture of both during their Regork visits. We found that the majority of customers have a mixture of both national and private label products in their baskets.

5. Summary

Problem Statement:

The problem statement that we addressed was to explore the relationship between National and Private labels and propose some strategies to boost Private Label Sales.

Addressing the Problem:

We used the complete journey data set to analyze the difference in sales between the national and private label products. We also compared the percentage of overall transactions that used a coupon. We also found the percentage of customers that are brand loyal by finding the monthly retention rate of 2017 for each department type. The third analysis that we did was calculating the national vs. private label sales for specific income ranges. The calculation we did was finding the percentage of customers that purchased only national, only private, or a mixture of both, to analyze their shopping trends.

Interesting Insights

Most departments generate higher sales for national brands compared to private brands.

National brands have significantly more transactions compared to private brands.

The national brands have a higher percentage of coupon usage compared to private brands.

For seafood, the national label has a high retention rate of 61.15% and the private label has a low retention rate of 19.88%.

The income range of 35k-99k is the majority of sales at Regork.

The income range of 200k-249k rarely purchases from Regork.

Majority of customers purchase both national and private products when they shop at Regork.

Implications:

Since little to no sales were made using coupons for Private labels. Regork could launch a few promotions using coupons to boost sales especially in departments with lower retention rate like Seafood or Meat. By utilizing a coupon or discount promotion, customers will be more likely to buy private labels, and ultimately increase retention on specific departments.

Limitations of analysis:

We mainly analyzed the overall trend of National vs Private Labels without deep diving in specific products which might skew the results of some underrepresented items that are not bought frequently. Secondly, the dataset was only for the year 2017 which limited the abilitiy to forecast or realize trends on a time series.

A lot of assumptions were made, especially the definition of retention of shopping trends which might undermine other factors.

One way to improve would be to look at specific products for bundle analysis (i.e. a customer might buy national brand cereal and private label milk) and create pricing or product placement strategies. Another important factor is also time. We can assume the cycle of sales in 1 year and make that the average but having more data means we can also account for the external factors that affect sales in different years.