Introduction

Consumers today are showing a growing preference for healthier, natural, and ethically produced foods — a shift highlighted by Alam et al. (2025) in Emerging Trends in Food Process Engineering: Integrating Sensing Technologies for Health, Sustainability, and Consumer Preferences. With rising interest in trends like organic, low-carb, gluten-free, and protein-rich products, understanding this segment is critical to staying competitive.

While broader market data and consumer research outline these emerging trends, this analysis seeks to reveal which health products are currently driving revenue at Regork and identifying the demographic segments leading this growth.

By combining market insights with transaction-level data, Regork can develop targeted strategies that meet evolving consumer demands and capitalize on these promising product and customer segments.

This study assesses

Packages Required

To execute the code in this R project, the following R packages are required:

library(completejourney)  # dataset access
library(dplyr)            # data manipulation & transformation
library(ggplot2)          # data visualization & plotting
library(stringr)          # string detection & manipulation
library(lubridate)        # date handling
library(scales)           # formatting axis labels

Data Preparation

This report draws on the completejourney dataset, which captures one year of transaction data at the household level from 2,469 frequent grocery shoppers. It offers comprehensive records of each household’s purchases across all product categories, along with demographic information and direct marketing history for select households. You can access the full user guide here.

Loading the data

This section imports three completejourney datasets — transactions, products, and demographics —that form the foundation for subsequent analysis.

# Load the completejourney transactions dataset
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
# Load the completejourney products dataset
products <- products
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
# Load the completejourney demographics dataset
demographics <- demographics
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>

Dataframe Creation & Transformation

This section performs data filtering, joins, variable creation, and aggregation to prepare key dataframes, for subsequent analysis of health product sales and demographics.

# Identify health-related products based on keywords and selected categories
health_keywords <- c("organic", "gluten free", "no sugar", "protein", "fitness&diet")
selected_categories <- c("ORGANICS FRUIT & VEGETABLES", "FITNESS&DIET", "NATURAL VITAMINS")

products <- products %>%
  mutate(
    is_health_product = str_detect(str_to_lower(product_type), str_c(health_keywords, collapse = "|")) |
      product_category %in% selected_categories
  )
health_products <- products %>% filter(is_health_product)

# Filter transactions to include only health products and create weekly timestamps
transactions_health <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(is_health_product) %>%
  filter(!is.na(product_category)) %>%
  mutate(week = floor_date(transaction_timestamp, "week"))

# Aggregate total weekly sales for health products
weekly_health_sales <- transactions_health %>%
  group_by(week) %>%
  summarise(health_sales = sum(sales_value, na.rm = TRUE))

# Aggregate total weekly sales for all products
weekly_total_sales <- transactions %>%
  mutate(week = floor_date(transaction_timestamp, "week")) %>%
  group_by(week) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE))

# Calculate weekly percentage share of health product sales relative to total sales
health_share <- weekly_health_sales %>%
  left_join(weekly_total_sales, by = "week") %>%
  mutate(health_sales_pct = health_sales / total_sales)

# Summarize total sales for the three predefined health product categories
health_categories <- transactions_health %>%
  group_by(product_category) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales))


# Identify top 10 health product types by total sales
top_products <- transactions_health %>%
  group_by(product_type) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales)) %>%
  slice_head(n = 10)

# Join demographic info with health product transactions for segmentation analysis
transactions_health_demo <- transactions_health %>%
  left_join(demographics, by = "household_id")

# Summarize health product spend by income group
income_spend <- transactions_health_demo %>%
  filter(!is.na(income), !is.na(sales_value)) %>%
  group_by(income) %>%
  summarise(total_spend = sum(sales_value, na.rm = TRUE)) %>%
  arrange(income)

# Summarize health product spend by age group
age_spend <- transactions_health_demo %>%
  filter(!is.na(age), !is.na(sales_value)) %>%
  group_by(age) %>%
  summarise(total_spend = sum(sales_value, na.rm = TRUE)) %>%
  arrange(age)

# Summarize health product spend by age and income groups
demo_spend <- transactions_health_demo %>%
  filter(!is.na(age), !is.na(income)) %>%
  group_by(age, income) %>%
  summarise(total_spend = sum(sales_value, na.rm = TRUE)) %>%
  ungroup()

Exploratory Data Analysis

This section visualizes key trends in health product sales over time, highlights top health products by revenue, and examines spending patterns on health products across income and age demographic groups to uncover insights.

📈 Current Market Penetration

This section explores how health product sales are evolving week-over-week as a share of total store sales.

# Line plot: Health product sales as a share of total weekly sales
ggplot(health_share, aes(x = week, y = health_sales_pct)) +
  geom_line(color = "forestgreen", size = 1.3) +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    title = "Health Product Sales as % of Total Weekly Sales",
    subtitle = "How is the health category trending as a share of total weekly sales?",
    x = "Week",
    y = "Health Sales Share (%)",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

🍎 Product Performance

This section examines sales patterns across health product categories and types to reveal how revenue is distributed.

# Bar chart: Total sales by health product category (horizontal)
health_categories %>%
  filter(product_category %in% selected_categories) %>%
  ggplot(aes(x = reorder(product_category, total_sales), y = total_sales, fill = total_sales)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Total Sales Performance of Health Product Categories",
    subtitle = "How much revenue do dedicated health product categories generate overall?",
    x = "Product Category",
    y = "Total Sales",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  ) +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen")

# Dot plot: Top 10 health product types ranked by total sales
ggplot(top_products, aes(x = reorder(product_type, total_sales), y = total_sales)) +
  geom_point(color = "forestgreen", size = 4) +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Top 10 Health Product Types by Total Sales",
    subtitle = "Which health product types contribute most to overall sales?",
    x = "Product Type", 
    y = "Total Sales",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

👥 Customer Segmentation

Here, we explore how health product spending varies across income and age to surface patterns in consumer behavior.

# Dot plot: Total health product spend by income group
ggplot(income_spend, aes(x = reorder(income, total_spend), y = total_spend)) +
  geom_point(color = "darkblue", size = 5) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Total Health Product Spend by Income Group",
    subtitle = "How does spending on health products vary across income groups?",
    x = "Income Group",
    y = "Total Spend",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

# Box plot: Distribution of per-transaction spend by income group
ggplot(transactions_health_demo %>% filter(!is.na(income)), aes(x = income, y = sales_value)) +
  geom_boxplot(fill = "lightblue") +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Health Product Spend Distribution by Income Group",
    subtitle = "What is the range of spend per transaction across different income groups?",
    x = "Income Group", 
    y = "Spend per Transaction",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

# Violin plot: Distribution of per-transaction spend by age group
ggplot(transactions_health_demo %>% filter(!is.na(age)), aes(x = age, y = sales_value)) +
  geom_violin(fill = "lightgreen") +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Health Product Spend Distribution by Age Group",
    subtitle = "How does spend per transaction on health products vary across age groups?",
    x = "Age Group", 
    y = "Spend per Transaction",
    caption = "Source: completejourney dataset"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

# Heatmap: Total health product spend by income and age group
ggplot(demo_spend, aes(x = income, y = age, fill = total_spend)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "darkgreen", labels = scales::dollar_format()) +
  labs(
    title = "Health Product Total Spend By Income and Age Group",
    subtitle = "How is overall health product spend distributed across income and age segments?",
    x = "Income Group", 
    y = "Age Group", 
    caption = "Source: completejourney dataset",
    fill = "Total Spend") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 15, margin = margin(b = 10)),
    plot.subtitle = element_text(face = "italic", color = "gray40", size = 13, margin = margin(b = 15)),
    plot.caption = element_text(hjust = 1, size = 9, color = "gray60", margin = margin(t = 10)),
    plot.title.position = "plot",
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

Summary

(i) Problem Statement

The analysis aims to understand the impact of health and wellness products within Regork’s overall sales landscape and uncover opportunities for growth. As consumer interest in healthier options continue to grow, this analysis explores how significantly these products contribute to weekly sales, which offerings are gaining traction, and which customer groups are driving sales.

(ii) Approach (Data & Methodology)

To address the problem statement, the analysis leverages the complete_journey dataset, which contains transaction-level grocery purchase data and household demographics.

The methodology involved:

(iii) Key Insights

(iv) Consumer Implications & CEO Recommendations

(v) Limitations & Future Improvements

The analysis is based on one year of historical data from a limited set of households, which may not represent broader or future market trends.

Health products were identified using keyword and category filters, which may miss some relevant products or include gray area items.

Income and age groupings are broad, and more granular segmentation could reveal deeper insights.

Using promotional campaign data could help show how marketing impacts health product sales.