Problem Statement and Summary:
Our team has been tasked with exploring the differences between private label brands (Regork), and national labels. This will provide valuable insight for Regork by breaking down the private and national labels to see where the majority of sales is coming from. Regork will be able to take our analysis and findings and properly distribute resources and analyze the performance of each label type. Regork will also be able to see its product sales compared to the sales of national label products.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(lubridate)
We call the completejourney dataset which includes the data that we are going to work with including transactions, demographics, products, etc
We have used the tidyverse package which includes a set of tools for quick data analysis like dplyer or tibble
We also use lubridate which helps with formatting and converting time format data
transactions <- get_transactions()
products
coupon_redemptions
campaigns
transactions
demographics
Calling the necessary data sets that will be used
data <- transactions %>%
inner_join(products) %>%
inner_join(demographics)
## Joining with `by = join_by(product_id)`
## Joining with `by = join_by(household_id)`
Joining and filtering data
sales_summary <- transactions %>%
inner_join(products, by = "product_id") %>%
group_by(department, brand) %>%
summarise(Total_Sales = sum(sales_value) / 1000, .groups = 'drop') %>%
arrange(desc(Total_Sales))
# Create a plot for both National and Private brands
p <- ggplot(sales_summary %>% filter(brand %in% c("National", "Private")),
aes(x = department, y = Total_Sales, fill = brand)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::dollar_format())+
labs(title = "Total Sales by Department for National and Private Brands",
x = "Department",
y = "Total Sales (In Thousands of $)") +
theme_minimal() +
coord_flip()
print(p)
Summary/Explanation: This graph shows the total sales of both national and private brands broken into each department. We can interpret that for majority of departments, the national label produces more sales than the private label. The top 3 departments for overall sales are, grocery, drugs gm, and produce.
data %>%
group_by(brand) %>%
summarise(transaction_count = n()/1000) %>%
ggplot(aes(x = brand, y = transaction_count, fill = brand)) +
geom_bar(stat = "identity") +
labs(title = "Number of transactions by Brand type Comparison (in thousands)",
x = "Brand Type",
y = "Number of Transactions") +
scale_fill_manual(values = c("National" = "lightblue", "Private" = "orange")) +
theme_minimal()
Summary/Explanation: This graph compares the number of transactions of the national and private label products. The national label has significantly more transactions than the private label. This is likely because there are more national products than private products.
data %>%
filter(brand == "Private") %>%
mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
group_by(department, coupon_used) %>%
summarise(total_transactions = n()) %>%
ungroup() %>%
group_by(department) %>%
mutate(total_department_transactions = sum(total_transactions),
percentage_coupon_used = (total_transactions / total_department_transactions) * 100) %>%
filter(coupon_used == "yes") %>%
ungroup() %>%
ggplot(aes(x = department, y = percentage_coupon_used)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
vjust = -0.5,
color = "black",
size = 3.5) +
labs(title = "Percentage of Transactions using coupons by Department (Private)",
x = "Department",
y = "Percentage of Transactions") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
Summary/Explanation: Using the data provided, we calculated the percentage of transactions by department that utilized discount coupons and then separated them into 2 graphs by Brands.
Private brands rarely offer coupons with the percentage in the range of 0.1 ~ 0.2%.
data %>%
filter(brand == "National") %>%
mutate(coupon_used = if_else(coupon_disc > 0, "yes", "no")) %>%
group_by(department, coupon_used) %>%
summarise(total_transactions = n()) %>%
ungroup() %>%
group_by(department) %>%
mutate(total_department_transactions = sum(total_transactions),
percentage_coupon_used = (total_transactions / total_department_transactions)*100) %>%
filter(coupon_used == "yes") %>%
ungroup() %>%
ggplot(aes(x = department, y = percentage_coupon_used)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_text(aes(label = paste0(round(percentage_coupon_used, 1), "%")),
vjust = -0.5,
color = "black",
size = 3.5) +
labs(title = "Percentage of Transactions using coupons by Department (National)",
x = "Department",
y = "Percentage of Transactions") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_blank())
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
Summary/Explanation: On the other hand, National Brands seems to have more discounted sales with Photo & Video Department having 20% of its sales attracted by coupons followed up by Packaged Food at 4.6%. Additionally, the percentages seem to fall between 0.1 ~ 4.6%
retention_national <- data %>%
filter(brand == "National") %>%
mutate(transaction_date = as.Date(transaction_timestamp)) %>%
mutate(year_month = format(transaction_date, "%Y-%m")) %>%
group_by(department, household_id, year_month) %>%
summarise(total_transactions = n()) %>%
ungroup() %>%
group_by(department, household_id) %>%
arrange(department, household_id, year_month) %>%
mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
ungroup() %>%
mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
ungroup() %>%
group_by(department, year_month) %>%
summarise(total_customers = n(),
retained_customers = sum(retained, na.rm = TRUE),
retention_rate = retained_customers / total_customers * 100) %>%
ungroup() %>%
group_by(department) %>%
summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
arrange(desc(average_retention_rate))
## `summarise()` has grouped output by 'department', 'household_id'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
knitr::kable(retention_national, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "National Brand")
| Department | Retention Rate |
|---|---|
| GROCERY | 91.76% |
| DRUG GM | 90.98% |
| PRODUCE | 90.56% |
| MEAT-PCKGD | 88.76% |
| MEAT | 88.62% |
| DELI | 84.67% |
| PASTRY | 79.82% |
| NUTRITION | 78.65% |
| SALAD BAR | 70.91% |
| PROD-WHS SALES | 66.67% |
| SEAFOOD | 66% |
| COSMETICS | 65.08% |
| SEAFOOD-PCKGD | 61.15% |
| FLORAL | 57.32% |
| RESTAURANT | 47.62% |
| SPIRITS | 46.28% |
| MISCELLANEOUS | 41.57% |
| GARDEN CENTER | 40.55% |
| CHEF SHOPPE | 36.96% |
| COUPON | 34.06% |
| GM MERCH EXP | 31.25% |
| TRAVEL & LEISURE | 30.58% |
| FROZEN GROCERY | 27.2% |
| CNTRL/STORE SUP | 0% |
| PHOTO & VIDEO | 0% |
| POSTAL CENTER | 0% |
Summary/Explanation: Assuming that a customer is considered “retained” if they purchase a product of either National of Private brand for two consecutive months, we calculated the monthly retention rate for both Brand categories for the year of 2017 and then take the average.
For Table 1 (National): Grocery, Meat, Seafood and Produce enjoy a relatively high retention rate of +60% while miscellaneous items like agriculture or equipment have a lower rate
retention_private <- data %>%
filter(brand == "Private") %>%
mutate(transaction_date = as.Date(transaction_timestamp)) %>%
mutate(year_month = format(transaction_date, "%Y-%m")) %>%
group_by(department, household_id, year_month) %>%
summarise(total_transactions = n()) %>%
ungroup() %>%
group_by(department, household_id) %>%
arrange(department, household_id, year_month) %>%
mutate(previous_month_transactions = lag(total_transactions, 1)) %>%
ungroup() %>%
mutate(retained = ifelse(!is.na(previous_month_transactions) & previous_month_transactions > 0, 1, 0)) %>%
ungroup() %>%
group_by(department, year_month) %>%
summarise(total_customers = n(),
retained_customers = sum(retained, na.rm = TRUE),
retention_rate = retained_customers / total_customers * 100) %>%
ungroup() %>%
group_by(department) %>%
summarise(average_retention_rate = mean(retention_rate, na.rm = TRUE)) %>%
mutate(average_retention_rate = paste0(round(average_retention_rate, 2), "%")) %>%
arrange(desc(average_retention_rate))
## `summarise()` has grouped output by 'department', 'household_id'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
knitr::kable(retention_private, col.names = c("Department", "Retention Rate"), align = "lcr", caption = "Private Brand")
| Department | Retention Rate |
|---|---|
| GROCERY | 91.57% |
| PRODUCE | 84.98% |
| FUEL | 83.62% |
| DRUG GM | 82.33% |
| MEAT-PCKGD | 80.88% |
| PASTRY | 77.2% |
| DELI | 73.63% |
| SEAFOOD-PCKGD | 68.03% |
| MEAT | 67.94% |
| MISCELLANEOUS | 60.81% |
| NUTRITION | 58.35% |
| FLORAL | 38.2% |
| COSMETICS | 35.5% |
| AUTOMOTIVE | 2% |
| SEAFOOD | 19.88% |
| SALAD BAR | 0% |
Summary/Explanation: For Table 2 (Private): Grocery, Meat, Produce and Fuel have higher rates than other items. Interestingly, the Seafood department has a retention rate of only ~20% compared to National brands
data %>%
mutate(transaction_date = as.Date(transaction_timestamp),
year_month = format(transaction_date, "%m")) %>%
group_by(year_month, brand, income) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
ggplot(aes(x = year_month, y = total_sales, color = brand, group = brand)) +
geom_line(size = 1) +
facet_wrap(~ income) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Sales: National vs Private by Income Range",
x = "Month",
y = "Sales",
color = "Brand") +
theme_minimal() +
theme(axis.text.x = element_text(size = 7, hjust = 1),
legend.position = "top")
## `summarise()` has grouped output by 'year_month', 'brand'. You can override
## using the `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Summary/Explanation: Using the data provided, we calculated the amount of national and private label sales in dollars for specific income ranges. We found that the incomes ranging from 35k-99k shop the most at Regork and the income range with the biggest difference between national and private label purchases is 50k-74k.
basket_brand <- data %>%
group_by(basket_id) %>%
summarise(distinct_brand = n_distinct(brand),
first_brand = first(brand)) %>%
ungroup()
only_nationals <- basket_brand %>%
filter(distinct_brand == 1 & first_brand == "National")
only_privates <- basket_brand %>%
filter(distinct_brand == 1 & first_brand == "Private")
both <- basket_brand %>%
filter(distinct_brand == 2)
total_baskets <- basket_brand %>%
summarise(total_basket_count = n())
percentage_national_baskets <- (nrow(only_nationals) / total_baskets$total_basket_count) * 100
percentage_private_baskets <- (nrow(only_privates) / total_baskets$total_basket_count) * 100
percentage_both <- (nrow(both) / total_baskets$total_basket_count) * 100
result1 <- tibble(
Type = c("Only National", "Only Private", "Both"),
Count = c(nrow(only_nationals), nrow(only_privates), nrow(both)),
Percentage = c(percentage_national_baskets, percentage_private_baskets, percentage_both)
)
result2 <- result1 %>%
mutate(Percentage = paste0(round(Percentage, 2), "%"))
knitr::kable(result2, col.names = c("Basket Type", "Count", "Percentage"), align = "lcr")
| Basket Type | Count | Percentage |
|---|---|---|
| Only National | 22300 | 27.29% |
| Only Private | 12222 | 14.96% |
| Both | 47187 | 57.75% |
Summary/Explanation: With the data provided, we calculated the percentage of customers that purchase only national, only private, or a mixture of both during their Regork visits. We found that the majority of customers have a mixture of both national and private label products in their baskets.
Problem Statement:
The problem statement that we addressed was to explore the relationship between National and Private labels and propose some strategies to boost Private Label Sales.
Addressing the Problem:
We used the complete journey data set to analyze the difference in sales between the national and private label products. We also compared the percentage of overall transactions that used a coupon. We also found the percentage of customers that are brand loyal by finding the monthly retention rate of 2017 for each department type. The third analysis that we did was calculating the national vs. private label sales for specific income ranges. The calculation we did was finding the percentage of customers that purchased only national, only private, or a mixture of both, to analyze their shopping trends.
Interesting Insights
Most departments generate higher sales for national brands compared to private brands.
National brands have significantly more transactions compared to private brands.
The national brands have a higher percentage of coupon usage compared to private brands.
For seafood, the national label has a high retention rate of 61.15% and the private label has a low retention rate of 19.88%.
The income range of 35k-99k is the majority of sales at Regork.
The income range of 200k-249k rarely purchases from Regork.
Majority of customers purchase both national and private products when they shop at Regork.
Implications:
Since little to no sales were made using coupons for Private labels. Regork could launch a few promotions using coupons to boost sales especially in departments with lower retention rate like Seafood or Meat. By utilizing a coupon or discount promotion, customers will be more likely to buy private labels, and ultimately increase retention on specific departments.
Limitations of analysis:
We mainly analyzed the overall trend of National vs Private Labels without deep diving in specific products which might skew the results of some underrepresented items that are not bought frequently. Secondly, the dataset was only for the year 2017 which limited the abilitiy to forecast or realize trends on a time series.
A lot of assumptions were made, especially the definition of retention of shopping trends which might undermine other factors.
One way to improve would be to look at specific products for bundle analysis (i.e. a customer might buy national brand cereal and private label milk) and create pricing or product placement strategies. Another important factor is also time. We can assume the cycle of sales in 1 year and make that the average but having more data means we can also account for the external factors that affect sales in different years.