Final Project: Group 8

By: Quynh Tran, Clarence Turley, Nikhil Mothe

Introduction

This analysis will be performed using Exploratory Data Analysis from the Complete Journey dataset and will focus on identifying purchasing trends, demographic patterns related to household size and income, and the impact of specific marketing campaigns (based on their IDs) on candy sales. By uncovering these insights, we aim to provide actionable recommendations to boost candy sales and enhance customer engagement.

Data Used:

Data Tables used from complete Journey:

Transactions
Demographics
Products

Using these datasets, data frames will be created and visualized in various graph styles, which will be explained in detail throughout this report.

Packages Needed:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(dplyr)
library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(ggCorpIdent)

## Loading required package: extrafont
## Registering fonts with R
## 
## Attaching package: 'ggCorpIdent'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     ggplot, scale_colour_discrete, scale_fill_discrete

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(scales)
library(sunburstR)
library(RColorBrewer) 
library(extrafont)
library(paletteer)

# Tidy up the data into a single dataframe

transactions <- get_transactions()
prod <- products
demo <- demographics
coupon <- coupon_redemptions

df <- transactions %>%
  inner_join(prod) %>%
  inner_join(demo) %>%
  inner_join(coupon)

## Joining with `by = join_by(product_id)`
## Joining with `by = join_by(household_id)`
## Joining with `by = join_by(household_id)`

## Warning in inner_join(., coupon): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 1549 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

df

df_forcandy <- transactions %>%
  inner_join(prod) %>%
  inner_join(demo)

## Joining with `by = join_by(product_id)`
## Joining with `by = join_by(household_id)`

# bottom 10 products in Grocery
df_grocery <- df_forcandy %>%
  filter(department == "GROCERY") %>%  
  group_by(product_type) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(total_sales) 


df_bottom_grocery <- df_grocery %>%
  slice_head(n = 10)

# Plot
ggplot(df_bottom_grocery, aes(x = reorder(product_type, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() + 
  scale_y_continuous(labels = label_comma()) + 
  labs(title = "Lowest-Performing Products in Grocery by Total Sales",
       x = "Product Type",
       y = "Total Sales") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    axis.text.y = element_text(size = 14),  
    axis.title = element_text(size = 12),  
    axis.title.y = element_text(margin = margin(r = 20))  
  )

The lowest performing product-categories in Regork’s sales portfolio. The underwhelming sales performance of the candy-category in 2017 is surprising and disappointing, but also gives us an opportunity to develop practical solutions for tangible sales growth.

# DF candy
candy_df <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))

# timestamp
candy_df <- candy_df %>%
  mutate(month = month(transaction_timestamp))

# DF monthly quantity
candy_by_month <- candy_df %>%
  group_by(month) %>%
  summarise(quantity = sum(quantity, na.rm = TRUE)) %>%
  ungroup()

candy_by_month <- candy_by_month %>%
  mutate(month_name = factor(month.name[month], levels = month.name))

candy_by_month <- candy_by_month %>%
  arrange(month)

# Bar plot
ggplot(candy_by_month, aes(x = month_name, y = quantity, fill = month_name)) +
  geom_bar(stat = "identity") +
  scale_fill_paletteer_d("colorBlindness::Blue2DarkRed12Steps") +  
  labs(
    title = "Amount of Candy Purchased Monthly in 2017",
    x = "Month",
    y = "Total Quantity Sold"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12),
    plot.title = element_text(size = 16)
  ) +
  geom_text(
    aes(label = scales::comma(quantity)), 
    vjust = -0.5, 
    color = "black",
    size = 4
  )

A Monthly comparison analysis of candy sales in 2017.

Key Takeaway: Monthly changes do not significantly impact the overall total amount of candy purchased by consumers. Therefore, an effective mechanism implemented to improve the category (overall) should result in continual and robust performance.

# df
candy_df <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE)) %>%
  select(income, transaction_timestamp, sales_value, product_type) %>%
  mutate(month = month(transaction_timestamp))

# Monthly candy sales by income group
candy_by_month_income <- candy_df %>%
  group_by(month, income) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(month) %>%
  mutate(month_name = factor(month.name[month], levels = month.name))

# Plot 
ggplot(candy_by_month_income, aes(x = month_name, y = total_sales, fill = income)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_paletteer_d("impressionist.colors::korenveld_onder_onweerslucht") +  
  labs(
    title = "Monthly Trends in Candy Sales by Income Group",
    x = "Month",
    y = "Total Sales"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12),
    plot.title = element_text(size = 16),
    plot.margin = margin(10, 10, 10, 10)
  )

An illustration of the monthly candy sales by Income Group.

Key Takeaways: Overall, candy sales spike during the holiday months (Halloween & Christmas) for all income groups. A drop in candy purchases for the other months is an opportunity to significantly improve the annual sales total.

Monthly Trends in Candy Sales

# Candy DF
candy_df <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))

candy_df <- candy_df %>%
  mutate(month = month(transaction_timestamp))

# Sum of sales of Candy
candy_by_month <- candy_df %>%
  group_by(month) %>%
  summarise(sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ungroup()

candy_by_month <- candy_by_month %>%
  mutate(month_name = factor(month.name[month], levels = month.name))

# Months sort
candy_by_month <- candy_by_month %>%
  arrange(month)

# Line plot 
ggplot(candy_by_month, aes(x = month_name, y = sales_value, group = 1)) +
  geom_line(color = "#F9FFAF", linewidth = 1.5) +  
  geom_point(color = "#00A3A8", size = 4) +  
  labs(
    title = "Monthly Trends in Candy Sales (2017)",
    x = "Month",
    y = "Total Quantity Sold"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 12),
    plot.title = element_text(size = 16)
  ) +
  geom_text(
    aes(label = scales::comma(sales_value)),  
    vjust = -0.5, 
    color = "black",
    size = 4
  ) +
  theme(
    plot.margin = margin(10, 10, 10, 10)
  )

Another illustration showing the effect of seasonal patterns on sales performance (especially in relation to the Holidays: Halloween/October, Christmas/December). These observations suggest that candy sales peak during certain months; and then dip after holiday-celebrations have concluded, which suggests that consumers than choose alternative food options for health (or New Years Resolution) reasons.

# Candy DF
df_candy <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))
candy_sales_by_income_household <- df_candy %>%
  group_by(income, household_size) %>%
  summarise(total_sales_value = sum(sales_value, na.rm = TRUE), .groups = "drop")

pivot_table <- candy_sales_by_income_household %>%
  pivot_wider(names_from = household_size, values_from = total_sales_value, values_fill = list(total_sales_value = 0))

melted_data <- melt(pivot_table, id.vars = "income", variable.name = "household_size", value.name = "total_sales_value")

# Heatmap
plot_ly(melted_data, x = ~household_size, y = ~income, z = ~total_sales_value, type = "heatmap", colors = "Blues", 
        colorbar = list(title = "Total Sales ($)")) %>%
  layout(
    title = "Sales of Candy Products by Income Group and Household Size",
    xaxis = list(title = "Household Size"),
    yaxis = list(title = "Income Group")
  )

An illustration of candy sales based on household (family) size & income group.

Key Takeaways: The largest demographic of candy consumers is comprised of a household size of 1 or 2 people, and earn $50 - $74K per year. This demographic should be our target audience in marketing & promotional campaigns.

# Candy DF
df_candy <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))

candy_sales_by_income_kids <- df_candy %>%
  group_by(income, kids_count) %>%
  summarise(total_sales_value = sum(sales_value, na.rm = TRUE), .groups = "drop")

pivot_table <- candy_sales_by_income_kids %>%
  pivot_wider(names_from = kids_count, values_from = total_sales_value, values_fill = list(total_sales_value = 0))

melted_data <- melt(pivot_table, id.vars = "income", variable.name = "kids_count", value.name = "total_sales_value")

# Heatmap using plotly
plot_ly(melted_data, x = ~kids_count, y = ~income, z = ~total_sales_value, type = "heatmap", colors = "Blues", 
        colorbar = list(title = "Total Sales ($)")) %>%
  layout(
    title = "Sales of Candy Products by Income Group and Kids Count",
    xaxis = list(title = "Kids Count"),
    yaxis = list(title = "Income Group")
  )

An illustration of candy sales based on household (family) size & # of Kids.

Key Takeaways: Consumers who make an annual income of $50 - $74K and have no kids are the largest spending group of candy. This is our #1 target audience.

Create Targeted Bundling Promotions

df_candy <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))  

candy_baskets <- unique(df_candy$basket_id)

purchases <- df %>%
  filter(basket_id %in% candy_baskets)

purchases_grocery <- purchases %>%
  filter(department == "GROCERY") %>%
  filter(!(grepl("candy", product_type, ignore.case = TRUE)))  


top_purchases <- purchases_grocery %>%
  group_by(product_type) %>%
  summarise(total_quantity = sum(quantity, na.rm = TRUE)) %>%
  arrange(desc(total_quantity)) %>%
  head(10)

#plot
ggplot(top_purchases, aes(x = reorder(product_type, total_quantity), y = total_quantity, fill = total_quantity)) +
  geom_bar(stat = "identity") +
  coord_flip() +  
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  
  labs(
    title = "Top 10 Products Bought with Candy in the GROCERY Department",
    x = "Product Type",
    y = "Total Quantity Purchased"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10)
  ) +
  theme(
    plot.margin = margin(10, 10, 10, 10)
  )

Sale Strategy - By bundling candy with [milk, yogurt, and soft drinks], Regork is ensured to increase annual sales and encourage customers to purchase more store items as a result. A “Bundling Strategy” also enhances the perceived value of the offer, and thus makes it even more attractive to customers.

df_candy <- df %>%
  filter(grepl("candy", product_type, ignore.case = TRUE))

# sales and quantity by campaign_id
candy_sales_by_campaign <- df_candy %>%
  group_by(campaign_id) %>%
  summarise(
    total_sales_value = sum(sales_value, na.rm = TRUE),
    total_quantity = sum(quantity, na.rm = TRUE),
    .groups = 'drop'  
  )

# top 10 campaigns 
candy_sales_campaign <- candy_sales_by_campaign %>%
  arrange(desc(total_sales_value)) %>%
  slice_head(n = 10)  


candy_sales_campaign <- candy_sales_campaign %>%
  mutate(
    campaign_id = as.character(campaign_id), 
    total_sales_value = as.character(total_sales_value)  
  ) %>%
  select(campaign_id, total_sales_value)  

# pie chart
sunburst(candy_sales_campaign, count = TRUE) %>%
  ggtitle("Top 10 Campaigns for Candy Sales")

## $title
## 
## $subtitle
## [1] "Top 10 Campaigns for Candy Sales"
## 
## attr(,"class")
## [1] "labels"

# df
df_top_sellers <- df %>%
  group_by(income, store_id) %>%
  summarize(Avg_total_sales = mean(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(Avg_total_sales)) %>%  
  group_by(income) %>%
  slice_head(n = 4)

# Plot
ggplot(df_top_sellers, aes(reorder(store_id, Avg_total_sales), 
        y = Avg_total_sales, fill = income)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_paletteer_d("impressionist.colors::korenveld_onder_onweerslucht") + 
  labs(title = "Most Popular & Profitable Stores (per Income Group)",
       x = "Store ID",
       y = "Average Total Sales",
       fill = "Income Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1),
        axis.text.y = element_text(size = 10),
        axis.title = element_text(size = 12),
        axis.title.y = element_text(margin = margin(r = 20)))

An Analysis of the Most Popular & Profitable Stores (per Income Group)

Key Takeaways: Consumer prefer specific stores to spend their money;and these stores should be our target locations for product placement, and marketing campaigns. Placing our product where consumers spend the most maximizes product visibility & presence, therefore increasing consumer sales as a result.

# df
df_top_sellers <- df %>%
  group_by(store_id) %>%
  summarize(Avg_total_sales = mean(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(Avg_total_sales)) %>%  
  slice_head(n = 12)  

# Plot
ggplot(df_top_sellers, aes(reorder(store_id, Avg_total_sales), 
        y = Avg_total_sales, fill = store_id)) +  
  geom_bar(stat = "identity") +  
  coord_flip() + 
  scale_y_continuous(labels = label_comma()) +  
  scale_fill_paletteer_d("palettetown::dragonite") +  
  labs(title = "Most Popular & Profitable Stores",
       x = "Store ID",
       y = "Average Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1),
        axis.text.y = element_text(size = 10),
        axis.title = element_text(size = 12),
        axis.title.y = element_text(margin = margin(r = 20)))

An Identification of the Most Popular & Profitable Stores (based on Story ID) Key Takeaways: The most profitable stores (based on the Average Sales metric & Store ID) are the following: 901, 3149, 602, 197, 3163. A targeted marketing campaign on these stores will improve annual (candy) sales.

Summary:

Our project focused on Regork’s promotional campaigns for Candy product. Since sales for Candy is one of the lowest perform (bottom 5 sales in Grocery) and saw potential in growth. We focus our business question to: How can Regork optimize candy sales by identifying key product pairings, seasonal demand patterns, and effective marketing campaigns?

Key insights:

Top Performing Stores:

The stores with the highest average candy sales are:
- Store 901
- Store 3149
- Store 602
- Store 197
- Store 3163
These stores consistently outperform others in terms of candy sales, suggesting that there may be specific factors contributing to their success.

Target Customer Segments:

Candy sales are highest in households with:
- Household Size 2 and Income Group 50-74K.
- Household Size 1 and Income Group 50-74K.
- Households with 0 Kids and Income Group 50-74K, followed by those with an income of 75-99K.

Top Products Purchased with Candy:

Milk, yogurt, and soft drinks are the most frequently bought products along with candy, indicating potential cross-selling opportunities.

Seasonal Demand Patterns:

Candy sales peak in October (likely due to Halloween) and December (due to holiday season), while January sees the lowest sales.

Conclusion:

To optimize candy sales and maximize revenue, Regork should prioritize marketing efforts on the highest-performing stores (901, 3149, 602, 197, 3163) while also tailoring campaigns to the specific customer segments that are most likely to purchase candy. Seasonal promotions, strategic cross-selling, and bundled product offerings with milk, yogurt, and soft drinks can also drive increased sales. By leveraging data and aligning marketing strategies with the store’s performance and customer demographics, Regork can significantly enhance candy sales and overall profitability.