Regork’s Growth Opportunities

Introduction

Problem Statement

We are working in Regork company, which is an industry leader, is devoted to long-term growth an constant progress. To strengthen its market position and increase profitability, Regork intends to seek and capitalize on new growth prospects. This project will analyze market trends, consumer demands, and competitive landscapes to identify a key area for future investment. Regork aims to increase its product/service offerings or enter new markets by employing data-driven insights and inventive thinking, resulting in revenue growth. This project’s collaborative and strategic strategy seeks to position Regork for long-term success and sustainable expansion.

Overview

In this analysis, we:

Identify customer segments of five loyalty levels including: Loyal, Regular, Repeat, Occasional, Prospective. We want to focus non-loyal customers and product they typically purchase and current marketing campaigns to target these top products bought by newer customers.
Determine the highest quantity of item sales for prospective customers and compare current marketing strategies among these items
Find compliments to those items currently missing marketing campaigns among all transactions.
Identify when and how to host a new marketing campaign

Packages and Libraries Required

The following R packages are required in order to run this R project analysis:

library(completejourney) # data frames describing transactions from Regork in 2017

## Warning: package 'completejourney' was built under R version 4.3.3

library(ggplot2)         # visualization of data

## Warning: package 'ggplot2' was built under R version 4.3.3

library(tidyverse)       # packages for data manipulation and exploration

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

library(stringr)         # manipulation of string characters
library(dplyr)           # data manipulation tools

Data Preparation & Tidy Data

Gather Datasets

The data for this R project is accessed through completejourney. The completejourney data sets are historical data frames of data based on grocery transactions during the year 2017.

To gain access to the full datasets, the following code must be run. These are very large datasets and therefore may take time to load.

# Accessing the full transactions dataset
transactions <- get_transactions()
transactions

## # A tibble: 1,469,307 × 11
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>

# Accessing the full promotions dataset
promotions <- get_promotions()
promotions

## # A tibble: 20,940,529 × 5
##    product_id store_id display_location mailer_location  week
##    <chr>      <chr>    <fct>            <fct>           <int>
##  1 1000050    316      9                0                   1
##  2 1000050    337      3                0                   1
##  3 1000050    441      5                0                   1
##  4 1000092    292      0                A                   1
##  5 1000092    293      0                A                   1
##  6 1000092    295      0                A                   1
##  7 1000092    298      0                A                   1
##  8 1000092    299      0                A                   1
##  9 1000092    304      0                A                   1
## 10 1000092    306      0                A                   1
## # ℹ 20,940,519 more rows

# These are the other datasets used in this exploratory analysis
campaigns

## # A tibble: 6,589 × 2
##    campaign_id household_id
##    <chr>       <chr>       
##  1 1           105         
##  2 1           1238        
##  3 1           1258        
##  4 1           1483        
##  5 1           2200        
##  6 1           293         
##  7 1           529         
##  8 1           536         
##  9 1           568         
## 10 1           630         
## # ℹ 6,579 more rows

coupons

## # A tibble: 116,204 × 3
##    coupon_upc  product_id campaign_id
##    <chr>       <chr>      <chr>      
##  1 10000085207 9676830    26         
##  2 10000085207 9676943    26         
##  3 10000085207 9676944    26         
##  4 10000085207 9676947    26         
##  5 10000085207 9677008    26         
##  6 10000085207 9677052    26         
##  7 10000085207 9677385    26         
##  8 10000085207 9677479    26         
##  9 10000085207 9677791    26         
## 10 10000085207 9677878    26         
## # ℹ 116,194 more rows

products

## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>

Consolidating Dataframes

In this section we present datasets and lists that we created, combined, and organized for our further analysis in the next section.

First, we classified the customers into five levels loyalty: Loyal, Regular, Repeat, Occasional, Prospective.

loyalty_levels <- list(
  "Loyal", "Regular", "Repeat", "Occasional", "Prospective"
)

We used these levels to create two new data frames. One for all customer loyalty levels and one specifically for those prospective customers. We define the loyalty level based on five quantiles of the data set. For example, the customers with a transaction frequency in the top 20% will be considered a ‘Loyal’ customer.

loyalty_df <- transactions %>%
  group_by(household_id) %>%
  summarise(
    first_date = as.POSIXct(min(transaction_timestamp)),
    last_date = as.POSIXct(max(transaction_timestamp)),
    baskets = n_distinct(basket_id)
  ) %>%
  mutate(
    difference = as.numeric(difftime(last_date, first_date, units = "days"))
  ) %>%
  group_by(household_id) %>%
  summarise(freq = case_when(
    difference <= 1 ~ 365,
    TRUE ~ mean(difference / (baskets), na.rm = TRUE)
  )) %>%
  arrange((freq)) %>%
  mutate(loyalty = factor(ntile(freq, 5), labels = loyalty_levels))

prospective_cust <- loyalty_df %>%
  filter(loyalty == "Prospective")

Exploratory Data Analysis

Our Exploratory Data Analysis section provides an outline for the data manipulation and visualization performed to address the problem statement.

Prospective Customers’ Frequent Purchase History

To find the ten most common purchases for the ‘prospective’ customers in 2017, we used the transactions dataframe alongside the filtered loyalty dataframe and products. Using the quantity for each product type, we found the most common items.

top_ten <- transactions %>%
  right_join(prospective_cust, by = "household_id") %>%
  inner_join(products, by = "product_id") %>%
  group_by(product_type) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  slice_head(n = 10)

top_ten

## # A tibble: 10 × 2
##    product_type                   total
##    <chr>                          <int>
##  1 FLUID MILK WHITE ONLY           1103
##  2 SOFT DRINKS 12/18&15PK CAN CAR   712
##  3 YOGURT NOT MULTI-PACKS           673
##  4 SFT DRNK 2 LITER BTL CARB INCL   628
##  5 SHREDDED CHEESE                  571
##  6 BANANAS                          473
##  7 MAINSTREAM WHITE BREAD           438
##  8 POTATO CHIPS                     437
##  9 CANDY BARS (SINGLES)(INCLUDING   403
## 10 SS ECONOMY ENTREES/DINNERS ALL   364

From that data, we segmented those products with current marketing strategies and those without.

coupon_count <- coupons %>%
  inner_join(products, by = "product_id") %>%
  right_join(top_ten, by = "product_type") %>%
  group_by(product_type) %>%
  summarise(coupon = n_distinct(campaign_id, na.rm = TRUE))

coupon_check <- function(coupon) {
  ifelse(coupon > 0, "Yes", "No")
}

top_ten %>%
  inner_join(coupon_count, by = "product_type") %>%
  mutate(coupon_status = coupon_check(coupon)) %>%
  ggplot(aes(x = product_type, y = total, fill = coupon_status)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Product Quantity for Top Items", x = "Product Type", y = "Total Product Quantity") +
  coord_flip()

The bar graph shows top ten products are bought by ‘prospective’ customers. The ‘coupon_status’ shows that only one product was left without a marketing campaign, but was still high on the ‘prospective’ customers purchase list. This lead us to believe that there is an opportunity to exploit this want of candy bars.

Candy Bar Compliments

We figured out the patterns in the purchases made by all customers of the less targeted candy bar.

# collecting all transactions involving "CANDY BARS (SINGLES)"
candy <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(
    str_detect(product_type, regex("CANDY BARS", ignore_case = T)),
    str_detect(product_type, regex("(SINGLES)", ignore_case = T)),
  ) %>%
  group_by(basket_id) %>%
  summarise(total = n_distinct(basket_id))

#compile the top 10 complimentary products for candy bars
compliments <- transactions %>%
  right_join(candy, by = "basket_id") %>%
  inner_join(products, by = "product_id") %>%
  filter(
    !str_detect(product_type, regex("CANDY BARS", ignore_case = T)),
    !str_detect(product_type, regex("(SINGLES)", ignore_case = T))
  ) %>%
  group_by(product_type) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  slice_head(n = 10)

compliments

## # A tibble: 10 × 2
##    product_type                   total
##    <chr>                          <int>
##  1 FLUID MILK WHITE ONLY           2268
##  2 SOFT DRINKS 12/18&15PK CAN CAR  1663
##  3 SFT DRNK 2 LITER BTL CARB INCL  1581
##  4 MAINSTREAM WHITE BREAD          1175
##  5 SFT DRNK SNGL SRV BTL CARB (EX  1166
##  6 YOGURT NOT MULTI-PACKS          1148
##  7 SHREDDED CHEESE                 1089
##  8 BANANAS                         1024
##  9 POTATO CHIPS                     981
## 10 SNACK CAKE - MULTI PACK          949

# check the number of coupons for each product among the top 10 complimentary list
coupon_compliments <- coupons %>%
  inner_join(products, by = "product_id") %>%
  right_join(compliments, by = "product_type") %>%
  group_by(product_type) %>%
  summarise(coupon = n_distinct(campaign_id, na.rm = TRUE))

# validate the 'coupon_status' for each product
compliments %>%
  inner_join(coupon_compliments, by = "product_type") %>%
  mutate(coupon_status = coupon_check(coupon)) %>%
  ggplot(aes(x = product_type, y = total, fill = coupon_status)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Product Quantity for Top Compliment Items", x = "Product Type", y = "Total Product Quantity") +
  coord_flip()

The bar graph shows top ten complimentary products bought by all customers. The ‘coupon_status’ shows that two products are left without a marketing campaign. This leads us to believe that these should become the focus of your new marketing campaign.

Recommendation for New Marketing Strategies

In order to capitalize on your ‘prospective’ customer-base, we believe your new marketing campaign should focus on the two complimentary products: Snack Cake - Multi-Pack and Soft Drink Singles. These two products are commonly bought simultaneously with Candy Bars, already determined to be a common product for ‘prospective’ customers. Combining these two pieces, a marketing strategy focused on the complimentary products will hopefully attract the ‘prospective’ customers to the deals they may receive with their loyalty to Regork.

The final piece is to figure out when to host a marketing campaign.

To do this, we summarized the daily sales of candy bars.

daily_candy <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(
    str_detect(product_type, regex("CANDY BARS", ignore_case = TRUE)),
    str_detect(product_type, regex("(SINGLES)", ignore_case = TRUE))
  ) %>%
  mutate(day = yday(transaction_timestamp)) %>%
  group_by(day) %>%
  summarise(total = sum(quantity, na.rm = TRUE),
            sales_value = sum(sales_value, na.rm = TRUE))

daily_candy

## # A tibble: 364 × 3
##      day total sales_value
##    <dbl> <dbl>       <dbl>
##  1     1    49        22.6
##  2     2    36        16.5
##  3     3    97        39.8
##  4     4    47        21.2
##  5     5    51        23.8
##  6     6    33        12.4
##  7     7   141        54.3
##  8     8    66        27.8
##  9     9    63        25.6
## 10    10    36        16.0
## # ℹ 354 more rows

With this information gathered, we ran a simulation for each possible monthly combination of sales. We wanted to find the highest sales revenue in a period of 30 days to find what time of year the most candy bars are being bought. We wanted the highest time period since this is most likely when these “prospective customers” are in the store buying candy bars.

# Create an empty data frame to store monthly results
monthly_results_list <- list()

# Iterate through each day and calculate the monthly total for the highest month
for (start_day in 1:(365-29)) {
  month_campaign <- daily_candy %>%
    filter(day %in% (start_day:(start_day + 29)))
  
  # Calculate the monthly total
  total <- sum(month_campaign$sales_value, na.rm = TRUE)
  
  # Store the results in the data frame
  monthly_results_list[[start_day]] <- data.frame(
    month_start = as.numeric(start_day),
    month_end = start_day + 29,
    month_total = total)
  
}

# combine the list into a data frame
monthly_results <- do.call(rbind, monthly_results_list)

# Find the top month
top_month <- monthly_results %>%
  arrange(desc(month_total)) %>%
  slice(1)

print(top_month)

##   month_start month_end month_total
## 1          70        99     1021.62

This tells us that the greatest number of transactions took place between 03/12 and 04/10.

To visualize the daily sales with the suggested timeline for marketing these products, we created a line graph.

# determines the days within the specified timeline
daily_candy <- daily_candy %>%
  mutate(in_top_month = between(day, top_month$month_start, top_month$month_end))

# creating the plot
daily_candy_plot <- daily_candy %>%
  ggplot(aes(day, total)) +
  geom_line() +
  
  # add a shaded region for the top month
  geom_rect(data = daily_candy %>% filter(in_top_month),
            aes(xmin = min(day), xmax = max(day), ymin = -Inf, ymax = 250),
            fill = "lightblue", alpha = 0.03) + 
  
  # labels the start and end dates for the campaign
  geom_text(data = top_month,
            aes(x = (month_start + month_end) / 2, y = -Inf, label = paste("Start:", format(as.Date(month_start, format="%j"), "%m/%d"), "\nEnd:", format(as.Date(month_end, format="%j"), "%m/%d"))),
            vjust = -1, hjust = 0.5, color = "red", size = 3) +
  labs(title = "Daily Candy Sales",
       x = "Day",
       y = "Total Sales") + 
  theme_minimal() +
  coord_cartesian(ylim = c(0, 250))

daily_candy_plot

We aligned when we thought a fresh marketing campaign should concentrate on Snack Cake - Multi-Pack and Soft Drink Singles. By comparing the sales of the non-marketed potential client product, Candy Bars, with the non-marketed complementing products, this new endeavor should take place around 03/12 and 04/10 during any year.

Sales Revenue Contribution by Customers Loyalty

It is also important to find if this campaign is worth the trouble of starting before beginning efforts to focus on ‘prospective’ customers.

To find how much impact this campaign would have on total sales revenue, we computed the percentage of total sales revenue that each loyalty group contributes and then created a bar plot where each bar represents the total sales revenue for a loyalty level, with the bars filled based on loyalty level.

transactions %>%
  inner_join(loyalty_df) %>%
  group_by(loyalty) %>%
  summarise(sales_revenue = sum(sales_value)) %>%
  mutate(freq = paste(round(sales_revenue / sum(sales_revenue) * 100, 2), "%" )) %>%
  ggplot(aes(x = loyalty, y = sales_revenue, fill = loyalty)) +
  geom_bar(stat = "identity") +
  geom_label(aes(label = freq)) +
  labs(
    title = "Sales Revenue by Loyalty Level",
    x = "Loyalty"
  ) +
  scale_y_continuous(name = "Total Sales Revenue", labels = scales::dollar)

This reveals what we already knew before starting: that ‘loyal’ customers generate the highest percentage of yearly sales revenue, followed by ‘regular’ customers and ‘repeat’ customers, all the way down to ‘prospective’ customers.

While it is crucial to recognize the potential when trying to capitalize on new market segments, in this case, the ‘prospective’ customers, doing so may prove challenging. Despite constituting 20% of all total customers, they contribute only 3.37% to total yearly sales revenue.

This group obviously has the most room to grow. However, given the unpredictability associated with customers who do not frequently shop with Regork, persuading them with coupons for products they may or may not purchase in a given transaction can be a sporadic process. Customers who are not currently loyal may find it challenging to work with coupons they are not accustomed to, making it difficult to encourage them to buy additional products.

It may be more beneficial to focus on customers within the ‘regular’ and ‘repeat’ segments, as they have more concrete data on their transactions. This would make finding complementary products and persuading them to buy new products easier than with ‘prospective’ customers. Additionally, these segments account for a higher share of total revenue and are more willing to engage in higher transactions with us.

Summary

(i) Problem Statement:

The problem statement addressed in this report is related to understanding the impact of focusing on new customers and the frequency of non-loyal customers on revenue generation. Specifically, the goal is to assess how much more revenue is generated as the frequency between visits decreases.

(ii) Methodology:

To address this problem statement, we segmented customers into various levels of loyalty, focusing on the frequency of ‘prospective’ customers and the products they typically purchase. We also analyzed marketing campaigns associated with those products and assessed the sales value contribution by each level of customer loyalty.

(iii) Insights from Analysis:

A product usually bought by ‘prospective’ customers and is a non-marketed product is single-wrapped candy bars. There is a clear trend that customers usually buy candy bars with the multipack of snack cakes and single soft drinks.
Our analysis also showed the percentage contribution to total sales revenue by each loyalty level, and based on the numbers we believe there will not be much change in the total sales revenue if we choose to target prospective customers.

(iv) Our Recommendation:

Targeting Customers in Different Loyalty Levels: Based on the analysis, it is recommended to target customers in different loyalty levels other than ‘prospective’ customers, as they can generate more sales and contribute more to the total sales revenue. By focusing on these customers, we can potentially increase its revenue and overall profitability.
Targeting ‘Prospective’ Customers: If we choose to target ‘prospective’ customers, marketing campaigns should focus on snack cakes and soft drinks. Since there is already a trend of buying these items together without campaigns, adding campaigns will hopefully show ‘prospective’ customers the kinds of deals they can achieve by maintaining their loyalty to Regork.

(v) Limitations and Suggestions for Improvement:

The analysis depends solely on historical data and assumes that customer behavior remains stable, which may not be the case in a rapidly evolving market environment.
The analysis does not consider external variables, such as changes in market conditions or competitor strategies, which can significantly influence revenue generation.
To enhance the analysis, it is advisable to gather more recent data and integrate external factors into the modeling. Furthermore, employing advanced machine learning algorithms could provide a more precise understanding of customer behavior and revenue generation.

BANA 4080 | Mid-term Project Group 3

Owen Gerdes, Loi Ly, Nhi Tran

2024-02-24