Midterm Project

1. Introduction

Problem Statement and Summary:

Our team has been tasked with exploring the differences between private label brands (Regork), and national labels. This will provide valuable insight for Regork by breaking down the private and national labels to see where the majority of sales is coming from. Regork will be able to take our analysis and findings and properly distribute resources and analyze the performance of each label type. Regork will also be able to see its product sales compared to the sales of national label products.

2. Packages Required

library(tidyverse)
library(completejourney)
library(lubridate)

We call the completejourney dataset which includes the data that we are going to work with including transactions, demographics, products, etc

We have used the tidyverse package which includes a set of tools for quick data analysis like dplyer or tibble

We also use lubridate which helps with formatting and converting time format data

3. Data Prep

transactions <- get_transactions()
products 
coupon_redemptions
campaigns
transactions
demographics

Calling the necessary data sets that will be used

data <- transactions %>%
  inner_join(products) %>%
  inner_join(demographics)

Joining and filtering data

4. Data Analysis

Sales by Department

sales_summary <- transactions %>%
  inner_join(products, by = "product_id") %>%
  group_by(department, brand) %>%
  summarise(Total_Sales = sum(sales_value) / 1000, .groups = 'drop') %>%
  arrange(desc(Total_Sales))
# Create a plot for both National and Private brands
p <- ggplot(sales_summary %>% filter(brand %in% c("National", "Private")), 
            aes(x = department, y = Total_Sales, fill = brand)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::dollar_format())+
  labs(title = "Total Sales by Department for National and Private Brands",
       x = "Department",
       y = "Total Sales (In Thousands of $)") +
  theme_minimal() +
    coord_flip()
print(p)

Summary/Explanation: This graph shows the total sales of both national and private brands broken into each department. We can interpret that for majority of departments, the national label produces more sales than the private label. The top 3 departments for overall sales are, grocery, drugs gm, and produce.

Number of Transactions by Brand Type

data %>%
  group_by(brand) %>%
  summarise(transaction_count = n()/1000) %>%
  ggplot(aes(x = brand, y = transaction_count, fill = brand)) +
    geom_bar(stat = "identity") +
    labs(title = "Number of transactions by Brand type Comparison (in thousands)",
         x = "Brand Type",
         y = "Number of Transactions") +
    scale_fill_manual(values = c("National" = "lightblue", "Private" = "orange")) +
    theme_minimal()

Summary/Explanation: This graph compares the number of transactions of the national and private label products. The national label has significantly more transactions than the private label. This is likely because there are more national products than private products.

Percentage of Transactions by Coupon Usage

data %>%
  filter(brand == "Private") %>%
  mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
  group_by(department, coupon_used) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department) %>%
  mutate(total_department_transactions = sum(total_transactions),
         percentage_coupon_used = (total_transactions / total_department_transactions) * 100) %>%
  filter(coupon_used == "yes") %>%
  ungroup() %>%
  ggplot(aes(x = department, y = percentage_coupon_used)) +
    geom_bar(stat = "identity", fill = "lightblue") +
    geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
              vjust = -0.5,
              color =  "black",
              size = 3.5) +
    labs(title = "Percentage of Transactions using coupons by Department (Private)",
         x = "Department",
         y = "Percentage of Transactions") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.text.y = element_blank())

Summary/Explanation: Using the data provided, we calculated the percentage of transactions by department that utilized discount coupons and then separated them into 2 graphs by Brands.

Private brands rarely offer coupons with the percentage in the range of 0.1 ~ 0.2%.

data %>%
  filter(brand == "National") %>%
  mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
  group_by(department, coupon_used) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department) %>%
  mutate(total_department_transactions = sum(total_transactions),
         percentage_coupon_used = (total_transactions / total_department_transactions)*100) %>%
  filter(coupon_used == "yes") %>%
  ungroup() %>%
  ggplot(aes(x = department, y = percentage_coupon_used)) +
    geom_bar(stat = "identity", fill = "lightblue") +
    geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
              vjust = -0.5,
              color =  "black",
              size = 3.5) +
    labs(title = "Percentage of Transactions using coupons by Department (National)",
         x = "Department",
         y = "Percentage of Transactions") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.text.y = element_blank())

Summary/Explanation: On the other hand, National Brands seems to have more discounted sales with Photo & Video Department having 20% of its sales attracted by coupons followed up by Packaged Food at 4.6%. Additionally, the percentages seem to fall between 0.1 ~ 4.6%

Retention Rate

retention_national <- data %>%
  filter(brand == "National") %>%
  mutate(transaction_date = as.Date(transaction_timestamp)) %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%
  group_by(department, household_id, year_month) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department, household_id) %>%
  arrange(department, household_id, year_month) %>%
  mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
  ungroup() %>%
  mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
  ungroup() %>%
  group_by(department, year_month) %>%
  summarise(total_customers = n(),
            retained_customers = sum(retained, na.rm = TRUE),
            retention_rate = retained_customers / total_customers * 100) %>%
  ungroup() %>%
  group_by(department) %>%
  summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
  mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
  arrange(desc(average_retention_rate))

knitr::kable(retention_national, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "National Brand")
National Brand
Department Retention Rate
GROCERY 91.76%
DRUG GM 90.98%
PRODUCE 90.56%
MEAT-PCKGD 88.76%
MEAT 88.62%
DELI 84.67%
PASTRY 79.82%
NUTRITION 78.65%
SALAD BAR 70.91%
PROD-WHS SALES 66.67%
SEAFOOD 66%
COSMETICS 65.08%
SEAFOOD-PCKGD 61.15%
FLORAL 57.32%
RESTAURANT 47.62%
SPIRITS 46.28%
MISCELLANEOUS 41.57%
GARDEN CENTER 40.55%
CHEF SHOPPE 36.96%
COUPON 34.06%
GM MERCH EXP 31.25%
TRAVEL & LEISURE 30.58%
FROZEN GROCERY 27.2%
CNTRL/STORE SUP 0%
PHOTO & VIDEO 0%
POSTAL CENTER 0%

Summary/Explanation: Assuming that a customer is considered “retained” if they purchase a product of either National of Private brand for two consecutive months, we calculated the monthly retention rate for both Brand categories for the year of 2017 and then take the average.

For Table 1 (National): Grocery, Meat, Seafood and Produce enjoy a relatively high retention rate of +60% while miscellaneous items like agriculture or equipment have a lower rate

retention_private <- data %>%
  filter(brand == "Private") %>%
  mutate(transaction_date = as.Date(transaction_timestamp)) %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%
  group_by(department, household_id, year_month) %>%
  summarise(total_transactions = n()) %>%
  ungroup() %>%
  group_by(department, household_id) %>%
  arrange(department, household_id, year_month) %>%
  mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
  ungroup() %>%
  mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
  ungroup() %>%
  group_by(department, year_month) %>%
  summarise(total_customers = n(),
            retained_customers = sum(retained, na.rm = TRUE),
            retention_rate = retained_customers / total_customers * 100) %>%
  ungroup() %>%
  group_by(department) %>%
  summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
  mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
  arrange(desc(average_retention_rate))

knitr::kable(retention_private, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "Private Brand")
Private Brand
Department Retention Rate
GROCERY 91.57%
PRODUCE 84.98%
FUEL 83.62%
DRUG GM 82.33%
MEAT-PCKGD 80.88%
PASTRY 77.2%
DELI 73.63%
SEAFOOD-PCKGD 68.03%
MEAT 67.94%
MISCELLANEOUS 60.81%
NUTRITION 58.35%
FLORAL 38.2%
COSMETICS 35.5%
AUTOMOTIVE 2%
SEAFOOD 19.88%
SALAD BAR 0%

Summary/Explanation: For Table 2 (Private): Grocery, Meat, Produce and Fuel have higher rates than other items. Interestingly, the Seafood department has a retention rate of only ~20% compared to National brands

Sales by Income Range

data %>%
  mutate(transaction_date = as.Date(transaction_timestamp),
         year_month = format(transaction_date, "%m")) %>%
  group_by(year_month, brand, income) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  ggplot(aes(x = year_month, y = total_sales, color = brand, group = brand)) +
  geom_line(size = 1) +
  facet_wrap(~ income) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Sales: National vs Private by Income Range",
       x = "Month",
       y = "Sales",
       color = "Brand") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7, hjust = 1),
        legend.position = "top")

Summary/Explanation: Using the data provided, we calculated the amount of national and private label sales in dollars for specific income ranges. We found that the incomes ranging from 35k-99k shop the most at Regork and the income range with the biggest difference between national and private label purchases is 50k-74k.

Shopping Trend

basket_brand <- data %>%
  group_by(basket_id) %>%
  summarise(distinct_brand = n_distinct(brand),
            first_brand = first(brand)) %>%
  ungroup()

only_nationals <- basket_brand %>%
  filter(distinct_brand == 1 & first_brand == "National")
only_privates <- basket_brand %>%
  filter(distinct_brand == 1 & first_brand == "Private")
both <- basket_brand %>%
  filter(distinct_brand == 2)
total_baskets <- basket_brand %>%
  summarise(total_basket_count = n())

percentage_national_baskets <- (nrow(only_nationals) / total_baskets$total_basket_count) * 100
percentage_private_baskets <- (nrow(only_privates) / total_baskets$total_basket_count) * 100
percentage_both <- (nrow(both) / total_baskets$total_basket_count) * 100

result1 <- tibble(
  Type = c("Only National", "Only Private", "Both"),
  Count = c(nrow(only_nationals), nrow(only_privates), nrow(both)),
  Percentage = c(percentage_national_baskets, percentage_private_baskets, percentage_both)
)
result2 <- result1 %>%
  mutate(Percentage = paste0(round(Percentage, 2), "%"))
knitr::kable(result2, col.names = c("Basket Type", "Count", "Percentage"), align = "lcr")
Basket Type Count Percentage
Only National 22300 27.29%
Only Private 12222 14.96%
Both 47187 57.75%

Summary/Explanation: With the data provided, we calculated the percentage of customers that purchase only national, only private, or a mixture of both during their Regork visits. We found that the majority of customers have a mixture of both national and private label products in their baskets.

5. Summary

Problem Statement:

The problem statement that we addressed was to explore the relationship between National and Private labels and propose some strategies to boost Private Label Sales.

Addressing the Problem:

We used the complete journey data set to analyze the difference in sales between the national and private label products. We also compared the percentage of overall transactions that used a coupon. We also found the percentage of customers that are brand loyal by finding the monthly retention rate of 2017 for each department type. The third analysis that we did was calculating the national vs. private label sales for specific income ranges. The calculation we did was finding the percentage of customers that purchased only national, only private, or a mixture of both, to analyze their shopping trends.

Interesting Insights

Most departments generate higher sales for national brands compared to private brands.

National brands have significantly more transactions compared to private brands.

The national brands have a higher percentage of coupon usage compared to private brands.

For seafood, the national label has a high retention rate of 61.15% and the private label has a low retention rate of 19.88%.

The income range of 35k-99k is the majority of sales at Regork.

The income range of 200k-249k rarely purchases from Regork.

Majority of customers purchase both national and private products when they shop at Regork.

Implications:

Since little to no sales were made using coupons for Private labels. Regork could launch a few promotions using coupons to boost sales especially in departments with lower retention rate like Seafood or Meat. By utilizing a coupon or discount promotion, customers will be more likely to buy private labels, and ultimately increase retention on specific departments.

Limitations of analysis:

We mainly analyzed the overall trend of National vs Private Labels without deep diving in specific products which might skew the results of some underrepresented items that are not bought frequently. Secondly, the dataset was only for the year 2017 which limited the abilitiy to forecast or realize trends on a time series.

A lot of assumptions were made, especially the definition of retention of shopping trends which might undermine other factors.

One way to improve would be to look at specific products for bundle analysis (i.e. a customer might buy national brand cereal and private label milk) and create pricing or product placement strategies. Another important factor is also time. We can assume the cycle of sales in 1 year and make that the average but having more data means we can also account for the external factors that affect sales in different years.