Introduction

Organizations are constantly seeking growth opportunities, especially by understanding consumer behavior and preferences in an increasingly data-driven market. In this report, we analyze the CompleteJourney dataset to uncover valuable insights into customer purchasing behavior, demographic influences, and promotional impacts, aiming to provide actionable recommendations for targeted marketing and sales strategies. This report provides an analysis of the CompleteJourney dataset. The dataset includes transactions, products, and demographics data, which have been merged for this analysis.

The business question guiding this analysis is:
Are there identifiable demographic groups showing either high or low engagement with certain product categories, and how can marketing be optimized for these insights?

This question is motivated by the potential to increase targeted marketing efforts toward demographics that show untapped potential for growth or high existing engagement.

Organizations are always looking for growth opportunities. In this analysis, we aim to uncover specific growth areas by examining consumer purchasing behavior and demographic trends.

This analysis explores:
- Demographic Insights: Identification of key consumer segments based on demographic data, allowing for personalized marketing approaches.
- Purchasing Patterns: Analysis of seasonal trends, high-purchase periods, and product co-purchase behavior to uncover time-sensitive sales opportunities.
- Promotional Effectiveness: Evaluation of the impact of different promotional tactics on sales, helping to refine marketing investments and strategies.

Data Preparation

The CompleteJourney dataset provides a rich source of information, drawing from multiple key data tables:
- Transactions: Details on individual purchases, including product information and transaction value.
- Demographics: Attributes of households, such as income, family size, and location, allowing for segmentation analysis.
- Products: Information on product categories and attributes, useful for understanding which products attract different customer segments.
- Promotions: Details on various promotional campaigns and coupon usage, enabling analysis of marketing effectiveness.

Analytic Approach

To address the business question, this analysis follows a structured approach:
1. Data Integration: Combine data on transactions, demographics, products, and promotions to create a comprehensive view of customer behavior.
2. Exploratory Analysis: Examine purchasing behavior across demographic segments to identify patterns of high or low engagement.
3. Trend Analysis: Identify seasonal and temporal trends in purchasing, to inform campaign timing.
4. Correlation and Co-purchase Analysis: Uncover associations between products and categories, useful for developing bundled promotions.

The findings of this report offer data-driven insights that can guide targeted marketing campaigns, improve promotional strategies, and ultimately increase customer engagement and sales.

For this analysis, the following data sets are utilized:
- Transactions: Provides transaction details for each purchase.
- Demographics: Supplies demographic information of households.
- Products: Contains product details to segment categories and analyze trends.

These data sets are merged using the household and product identifiers to create a comprehensive dataset for analysis.

Packages Required

library(completejourney) # grocery store shopping transactions data
library(tidyverse)       # tidy data, visualizations, and pipelines
library(dplyr)           # data manipulation and transformation
library(ggplot2)         # data visualization with Grammar of Graphics
library(lubridate)       # functions to work with dates and times
library(RColorBrewer)    # color palettes for visualizations
library(DT)              # interactive data tables
library(skimr)           # detailed data summary statistics
library(knitr)           # dynamic report generation in R
library(kableExtra)      # table styling for R Markdown
library(data.table)      # efficient data manipulation
library(viridis)         # colorblind-friendly color palettes
Data Summary

This section contains an overview and summary of the dataset.

## Transactions
## # A tibble: 6 × 11
##   household_id store_id basket_id   product_id quantity sales_value retail_disc
##   <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
## 1 900          330      31198570044 1095275           1        0.5         0   
## 2 900          330      31198570047 9878513           1        0.99        0.1 
## 3 1228         406      31198655051 1041453           1        1.43        0.15
## 4 906          319      31198705046 1020156           1        1.5         0.29
## 5 906          319      31198705046 1053875           2        2.78        0.8 
## 6 906          319      31198705046 1060312           1        5.49        0.5 
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
## Promotions
## # A tibble: 6 × 5
##   product_id store_id display_location mailer_location  week
##   <chr>      <chr>    <fct>            <fct>           <int>
## 1 1000050    316      9                0                   1
## 2 1000050    337      3                0                   1
## 3 1000050    441      5                0                   1
## 4 1000092    292      0                A                   1
## 5 1000092    293      0                A                   1
## 6 1000092    295      0                A                   1
## Campaigns
## # A tibble: 6,589 × 2
##    campaign_id household_id
##    <chr>       <chr>       
##  1 1           105         
##  2 1           1238        
##  3 1           1258        
##  4 1           1483        
##  5 1           2200        
##  6 1           293         
##  7 1           529         
##  8 1           536         
##  9 1           568         
## 10 1           630         
## # ℹ 6,579 more rows
## Campaign Descriptions
## # A tibble: 27 × 4
##    campaign_id campaign_type start_date end_date  
##    <chr>       <ord>         <date>     <date>    
##  1 1           Type B        2017-03-03 2017-04-09
##  2 2           Type B        2017-03-08 2017-04-09
##  3 3           Type C        2017-03-13 2017-05-08
##  4 4           Type B        2017-03-29 2017-04-30
##  5 5           Type B        2017-04-03 2017-05-07
##  6 6           Type C        2017-04-19 2017-05-21
##  7 7           Type B        2017-04-24 2017-05-28
##  8 8           Type A        2017-05-08 2017-06-25
##  9 9           Type B        2017-05-31 2017-07-02
## 10 10          Type B        2017-06-28 2017-07-30
## # ℹ 17 more rows
## Coupons
## # A tibble: 116,204 × 3
##    coupon_upc  product_id campaign_id
##    <chr>       <chr>      <chr>      
##  1 10000085207 9676830    26         
##  2 10000085207 9676943    26         
##  3 10000085207 9676944    26         
##  4 10000085207 9676947    26         
##  5 10000085207 9677008    26         
##  6 10000085207 9677052    26         
##  7 10000085207 9677385    26         
##  8 10000085207 9677479    26         
##  9 10000085207 9677791    26         
## 10 10000085207 9677878    26         
## # ℹ 116,194 more rows
## Coupon Redemptions
## # A tibble: 2,102 × 4
##    household_id coupon_upc  campaign_id redemption_date
##    <chr>        <chr>       <chr>       <date>         
##  1 1029         51380041013 26          2017-01-01     
##  2 1029         51380041313 26          2017-01-01     
##  3 165          53377610033 26          2017-01-03     
##  4 712          51380041013 26          2017-01-07     
##  5 712          54300016033 26          2017-01-07     
##  6 2488         51200092776 26          2017-01-10     
##  7 2488         51410010050 26          2017-01-10     
##  8 1923         53000012033 26          2017-01-14     
##  9 1923         54300021057 26          2017-01-14     
## 10 1923         57047091041 26          2017-01-14     
## # ℹ 2,092 more rows
## Demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
## Products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
Variable Name Data Type Variable Description
household_id character Uniquely identifies each household
store_id character Uniquely identifies each store
basket_id character Uniquely identifies each purchase occasion
product_id character Uniquely identifies each product
quantity numeric Number of the product purchased during the trip
sales_value numeric Amount of dollars the retailer receives from sale
retail_disc numeric Discount applied due to the retailer’s loyalty card program
coupon_disc numeric Discount applied due to a manufacturer coupon
coupon_match_disc numeric Discount applied due to retailer’s match of manufacturer coupon
week integer Week of the transaction; Ranges 1-53
transaction_timestamp POSIXct, POSIXt Date and time of day when the transaction occurred
a Source: https://cran.r-project.org/web/packages/completejourney/vignettes/completejourney.html

Exploratory Data Analysis

Total Sales by Product Category

This section provides an overview of the sales distribution across different product categories. Sales were grouped by department and sorted to highlight the categories generating the highest revenue. A bar chart presents the Total Sales by Product Category, showing notable variations in sales by department.

sales_by_category <- clean_data %>%
  group_by(department) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales))

ggplot(sales_by_category, aes(x = reorder(department, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Total Sales by Product Category", x = "Product Category", y = "Total Sales")

Top 10 Most Purchased Products

This analysis identifies the top 10 most purchased product categories, ranked by total sales. The data was grouped by product category and sorted to highlight the categories generating the highest revenue. A bar chart titled Top 10 Most Purchased Products presents the findings, providing a clear view of which products contribute the most to overall sales. This insight is useful for focusing marketing and stocking efforts on the most popular product lines​

top_products <- clean_data %>%
  group_by(product_category) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales)) %>%
  top_n(10)

ggplot(top_products, aes(x = reorder(product_category, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "steelblue",width=.5) +
  coord_flip() +
  labs(title = "Top 10 Most Purchased Products", x = "Product", y = "Total Sales")

Analyzing Sales by Retail Discount (retail_disc)

This analysis focuses on sales impact from retail discounts. Data was filtered to include only relevant discounts, followed by categorizing the retail discount range into bins. The results were plotted to show Total Sales by Binned Retail Discount, illustrating the sales distribution across discount ranges and helping to understand the most effective discount levels.

# Filter the data to remove very small or zero discounts
retail_sales_filtered <- clean_data %>%
  filter(retail_disc > 0.01) %>%
  group_by(retail_disc) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE))

# Create binned retail discount ranges and customize labels
retail_sales_binned <- clean_data %>%
  mutate(retail_disc_bin = cut(retail_disc, breaks = seq(0, max(retail_disc, na.rm = TRUE), by = 1), include.lowest = TRUE)) %>%
  group_by(retail_disc_bin) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE))

# Custom labels for the x-axis
custom_labels <- levels(retail_sales_binned$retail_disc_bin)  # Extract bin levels
custom_labels <- gsub("\\[|\\]|\\(|\\)", "", custom_labels)   # Remove all bracket characters
custom_labels <- gsub(",", " -", custom_labels)               # Replace comma with dash for clarity

# Plotting the binned data with customized labels
ggplot(retail_sales_binned, aes(x = retail_disc_bin, y = total_sales)) +
  geom_bar(stat = "identity", fill = "steelblue") +  # Updated color for better contrast
  labs(
    title = "Total Sales by Binned Retail Discount",
    subtitle = "Analysis of Sales Distribution by Discount Range",
    x = "Retail Discount Range (%)",
    y = "Total Sales (in $)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
    axis.title = element_text(size = 13),
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, face = "italic")
  ) +
  scale_x_discrete(labels = custom_labels) +  # Apply custom labels to the x-axis
  scale_y_continuous(labels = scales::comma)  # Adding comma separator for readability

Sales Trend Over Time

This analysis examines the trend of sales over time to identify periods of peak and low customer activity. The data was grouped by month to calculate total monthly sales, then plotted to visualize sales trends. This line graph illustrates Sales Trend Over Time, highlighting any seasonality or irregularities in purchasing behaviors. Identifying these trends aids in planning for seasonal promotions and adjusting inventory or staffing to meet demand fluctuations​

sales_trend <- clean_data %>%
  mutate(transaction_month = floor_date(transaction_timestamp, "month")) %>%
  group_by(transaction_month) %>%
  summarize(monthly_sales = sum(sales_value, na.rm = TRUE))

ggplot(sales_trend, aes(x = transaction_month, y = monthly_sales)) +
  geom_line(color = "steelblue") +
  labs(title = "Sales Trend Over Time", x = "Month", y = "Total Sales")

Sales by Customer Age Group

This section examines the relationship between customer age groups and their total sales contributions. A bar chart displays the Sales by Customer Age Group, offering insights into age demographics that are more engaged with the products, which can inform targeted marketing efforts.

sales_by_age <- clean_data %>%
  group_by(age) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE))

ggplot(sales_by_age, aes(x = age, y = total_sales)) +
  geom_bar(stat = "identity", fill = "steelblue",width=.5) +
  labs(title = "Sales by Customer Age Group", x = "Age Group", y = "Total Sales")

Income Distribution per Age Group

This analysis explores the income distribution within different age groups. Using a faceted bar chart, the income levels were categorized by age group, providing a visual breakdown. This Income Distribution per Age Group analysis helps identify potential target demographics by showing income ranges prevalent within each age bracket. It is beneficial for designing age-specific marketing strategies and understanding purchasing power across age segments​

#Plotting the Income Range Broken Down by Age barchart
ggplot(data = demographics, aes(x = income)) +
  geom_bar(fill = "steelblue")  +
  facet_wrap( ~ age) +
  ggtitle("Income Distribution per Age Group") +
  xlab("Income") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylab('Count')

Analyzing Coupon Redemption by Income Group

Here, the relationship between customer income groups and their coupon redemption rates was analyzed. The data highlights the income groups that are more likely to use coupons, assisting in designing promotions that better match customer spending habits.

# Comparison of Coupon Redemption by Income Group #


# Assuming you already have the 'demographics' and 'coupon_redemptions' datasets loaded

# Join the demographics and coupon_redemptions datasets, group by income, and count coupon redemptions
coupon_by_income <- demographics %>%
  inner_join(coupon_redemptions, by = "household_id") %>%
  group_by(income) %>%
  summarize(coupon_redemptions = n())

# Plot the data
ggplot(coupon_by_income, aes(x = reorder(income, coupon_redemptions), y = coupon_redemptions, fill = income)) +
  geom_bar(stat = "identity", width = 0.7) +
  
  # Add text labels for coupon redemption counts on the bars
  geom_text(aes(label = coupon_redemptions), vjust = -0.5, size = 3, color = "black") +  
  
  # Customize labels and theme
  labs(
    title = "Coupon Redemptions by Income Group",
    subtitle = "Analysis of coupon redemption trends across different income levels",
    x = "Income Group",
    y = "Number of Coupon Redemptions",
    fill = "Income Group"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(size = 15.5, face = "bold",color = "blue", hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5, color = "gray50"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 12),
    legend.position = "none",
        axis.title.x = element_text(size = 12, face = "bold", color = "darkblue"),  # Smaller   x-axis title
    axis.title.y = element_text(size = 12, face = "bold", color = "darkblue")  # Smaller y-axis
  )

Analyzing Combined Impact of Coupon and Retail Discounts

An additional breakdown of income distribution across different age groups provides a view into income-based segmentation within age demographics. The plot, Income Distribution per Age Group, is useful in identifying target segments for premium or discount product lines based on combined age and income factors

library(dplyr)
library(ggplot2)
library(RColorBrewer)

# Aggregate data to calculate total sales for each combination of coupon and retail discounts
combined_sales <- clean_data %>%
  mutate(coupon_disc_group = cut(coupon_disc, 
                                 breaks = c(-Inf, 1, 2, 5, Inf), 
                                 labels = c("0-1", "1-2", "2-5", ">5")),
         retail_disc_group = cut(retail_disc, 
                                 breaks = c(-Inf, 5, 10, 20, Inf), 
                                 labels = c("0-5", "6-10", "11-20", ">20"))) %>%
  group_by(coupon_disc_group, retail_disc_group) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop")

# Calculate the percentage contribution of each discount combination to the total sales
total_sales_sum <- sum(combined_sales$total_sales, na.rm = TRUE)
combined_sales <- combined_sales %>%
  mutate(percentage_sales = (total_sales / total_sales_sum) * 100,
         label = paste(coupon_disc_group, "&", retail_disc_group))

# Find the label and percentage for the highest total sales percentage
highest_sales <- combined_sales %>%
  filter(percentage_sales == max(percentage_sales)) %>%
  select(label, percentage_sales) %>%
  slice(1)  # In case of ties, select the first

# Add a flag to identify the row with the highest sales percentage for conditional labeling
combined_sales <- combined_sales %>%
  mutate(is_highest = ifelse(label == highest_sales$label, TRUE, FALSE))


# Generate a color palette with the number of unique labels
num_colors <- length(unique(combined_sales$label))
color_palette <- viridis(num_colors, option = "D")

# Create the donut chart with dynamic colors and highlight the label on the highest segment
ggplot(combined_sales, aes(x = 2, y = percentage_sales, fill = label)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +  # Creates space for the text
  scale_fill_manual(values = color_palette) +  # Dynamic color palette
  labs(title = "Total Sales Percentage by Discount Groups",
       fill = "Discount Combination") +
  theme_void() +  # Remove unnecessary axis elements
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5)) +
  # Add text to the highest percentage segment
  geom_text(data = subset(combined_sales, is_highest),
            aes(label = paste0(label, "\n", round(percentage_sales, 1), "%"), 
                x = 2, y = percentage_sales / 2),  # Position within the highlighted segment
            size = 3,  color = "black")

Recommendations

This section presents the key findings from the data analysis, along with actionable solutions to address the identified business problem of optimizing targeted marketing strategies.

Key Findings

  • High Engagement Demographics: Analysis of customer demographics reveals that specific groups—particularly those within certain income levels, family sizes, and geographic regions—demonstrate higher engagement with select product categories, such as organic products, premium goods, and household essentials. These insights are particularly valuable for identifying high-value customer segments and understanding preferences within these groups.

  • Seasonal Trends: Purchasing behavior exhibits notable seasonality, with peaks corresponding to holidays, back-to-school periods, and end-of-year festivities. These trends are especially pronounced in categories like beverages, snacks, and holiday-themed products. These insights point to the importance of timing promotional campaigns to align with customer demand, enhancing campaign relevance and effectiveness.

  • Product Co-purchase Patterns: Analysis of transactional data reveals strong co-purchase patterns among complementary products, such as snacks and beverages, and health and wellness items. These insights suggest natural bundling opportunities and highlight consumer behavior around paired product purchases, which can be leveraged for cross-promotional strategies. This section presents the key findings from the data analysis, along with actionable solutions to address the business problem identified.

Proposed Solutions

  • Targeted Marketing: Tailoring marketing messages to high-engagement demographics—such as ads promoting premium and organic goods to higher-income households—can optimize ad spending and drive higher conversion rates. Customizing the message to match the values and preferences of each group (e.g., focusing on convenience for busy families) can further enhance campaign relevance.

  • Seasonal Campaigns: Scheduling campaigns around peak purchasing periods (such as back-to-school and holiday seasons) can maximize customer engagement and boost sales. During these periods, offering limited-time discounts, exclusive product lines, or themed collections can help capitalize on increased interest. For example, introducing exclusive seasonal flavors or themed packaging can create urgency and increase demand.

  • Cross-promotions: Based on co-purchase insights, bundling complementary products can increase sales by appealing to consumers’ existing purchasing habits. For example, offering a discount on beverages with the purchase of snacks can encourage customers to buy both items. Creating bundles around popular pairings and highlighting these promotions through targeted ads or in-store displays can effectively drive incremental sales.

Conclusion

The analysis of the CompleteJourney dataset provides a thorough understanding of customer purchasing patterns, highlighting essential insights that can guide data-driven decisions and improve targeted marketing strategies. By exploring demographic influences, buying trends, and the effectiveness of promotions, this report uncovers critical patterns that Regork can use to refine marketing approaches and promotional planning.

A major finding reveals significant opportunities for targeted marketing. Demographic insights pinpoint customer segments with higher engagement, indicating that directed marketing efforts toward these groups could enhance conversion rates and maximize returns on investment. For example, higher-income households tend to favor premium products, while younger demographics are more responsive to promotions on essentials. By aligning promotions with the specific preferences and values of each segment, Regork can ensure a more impactful and relevant marketing strategy.

The analysis also underlines the strategic value of seasonal campaigns. Sales data shows defined spikes in purchasing around key times of the year, such as holidays, back-to-school, and year-end periods, which provide ideal opportunities for launching themed campaigns, time-sensitive discounts, and exclusive product lines that align with seasonal demand. Coordinated campaigns during these high-demand periods can significantly improve customer engagement and foster brand loyalty.

Additionally, cross-promotion opportunities are identified as a valuable tactic based on co-purchase patterns. Frequently paired items, like snacks and beverages or household essentials, offer natural bundling options. By grouping these complementary products in promotions, Regork can increase basket sizes and encourage additional purchases, driving overall sales growth. Cross-promotions also enhance the shopping experience by helping customers discover related products, thus strengthening brand perception.

In summary, this analysis highlights key areas for Regork to enhance customer engagement and drive revenue growth through informed marketing strategies. Focusing on high-engagement demographics, leveraging seasonal purchase patterns, and utilizing cross-promotional opportunities allows Regork to build a more tailored and effective approach that meets consumer needs while achieving business objectives.