This report explores a potential area of growth where Regork could invest future resources to increase revenue and profits. The main business question addressed is:
Are there opportunities to increase sales by introducing more targeted marketing campaigns?
Marketing campaigns can have a direct impact on both growth and profitability. This analysis helps Regork make more informed decisions by showing whether additional campaigns are likely to bring real results. It can also help the company avoid spending money on campaigns that don’t deliver returns and justify marketing investments that do.
To perform this analysis, The Complete Journey dataset was used. This dataset includes one year of transaction data from 2,469 households who are frequent shoppers at a grocery store. It contains detailed records of all purchases made by each household. For some households, the data also includes demographic information and a history of direct marketing contacts, allowing for more targeted analysis. Below are the specific data sets within The Complete Journey utilized in the analysis:
The analysis was carried out using R. The process involved importing and exploring the data, combining datasets through joins, filtering and creating new variables and calculations, and generating multiple plots to visualize the findings and trends.
Imported and explored the data
Performed joins between datasets
Filtered and created new variables and calculations
Built several plots to visualize key findings
To answer the business question, the analysis focused on understanding the performance of each type of campaign to determine which ones generated more sales. Demographic data was used to identify customer segments and to analyze which groups responded best to different campaigns. The performance of targeted versus non-targeted groups was compared, and the main products associated with the most successful campaign was identified.
This analysis provides valuable insights that can help Regork target new audiences, promote specific products more effectively, and take advantage of seasonal trends. By understanding which strategies work best, Regork can improve the effectiveness of its marketing efforts and make smarter decisions about where to invest in future campaigns.
suppressMessages(suppressWarnings(library(completejourney)))# Source of the dataset used in the study
suppressMessages(suppressWarnings(library(dplyr))) # For joins, filters, creation of new variables
suppressMessages(suppressWarnings(library(lubridate))) # Data manipulation
suppressMessages(suppressWarnings(library(ggplot2))) # Creation of plots
suppressMessages(suppressWarnings(library(scales))) # Used to define plot scales
suppressMessages(suppressWarnings(library(forcats))) # Used with fct_reorder
Initially, it was assessed how was the distribution of the total sales over the months of 2017 for the group of 2,469 households who were frequent shoppers at Regork by marketing campaign types (Type A, Type B, Type C, No Campaign). To accomplish that, the transactions and campaigns/ campaign_descriptions data frames were joined. A “left_join” was used, so that the final data frame also contained the observations without a campaign associated. A separated data frame (“transaction_not_campaign”) was created with an “anti_join” that included only transactions of households not associated with any campaign. Later, both tables containing all the information (transactions with and without campaign) were united to generate a unique data frame (“trans_campaign_demo”), in which rows without a campaign were transformed in “No campaign”.
# Full transaction data
transactions <- get_transactions()
# Joined with campaign and campaign_description
transactions_campaign <- transactions %>%
mutate(transaction_date = as_date(transaction_timestamp)) %>%
left_join(campaigns, by = "household_id", relationship = "many-to-many") %>%
left_join(campaign_descriptions, by = "campaign_id") %>%
filter(
transaction_date >= start_date,
transaction_date <= end_date,
)
# Created a data frame without campaign
transaction_not_campaign <- transactions %>%
anti_join(transactions_campaign, by = "household_id")
# Joined with campaigns again
transactions_no_campaign <- transaction_not_campaign %>%
mutate(transaction_date = as_date(transaction_timestamp)) %>%
left_join(campaigns, by = "household_id", relationship = "many-to-many") %>%
left_join(campaign_descriptions, by = "campaign_id")
# Union of both transactions_campaign and transactions_no_campaign
campaign_df <- rbind(transactions_campaign, transactions_no_campaign)
# Joined with demographics
trans_campaign_demo <- campaign_df %>%
inner_join(demographics, by = "household_id") %>%
mutate(campaign_type = case_when(
is.na(campaign_type) ~ "No Campaign",
campaign_type == "Type A" ~ "Type A",
campaign_type == "Type B" ~ "Type B",
campaign_type == "Type C" ~ "Type C"))
The initial visualization focused on analyzing how total sales varied across campaign types over time. A line chart was chosen as the most effective method to illustrate the performance of each campaign throughout the study period.
# Plot total sales over time by campaign
trans_campaign_demo %>%
group_by(transaction_date, campaign_type) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
ggplot(aes(x = transaction_date, y = total_sales, color = campaign_type)) +
geom_line(linewidth = 1) + # Replaced deprecated `size` with `linewidth`
scale_color_manual(values = c(
"Type A" = "maroon",
"Type B" = "palegreen4",
"Type C" = "orange3",
"No Campaign" = "honeydew4")) +
scale_y_continuous(labels = scales::label_dollar()) +
labs(
title = "Total Sales Over Time by Campaign Type",
x = "Transaction Date",
y = "Total Sales ($)",
color = "Campaign Type"
) +
theme_minimal(base_size = 16) +
theme(
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
In terms of total sales, Type A campaigns were the most impactful, followed by Type B and C. In contrast, the No Campaign group recorded the lowest total sales. Another interesting insight is that sales generaly increased over time for households exposed to campaigns, whereas sales for households without any campaign remained relatively stable. Additionally, campaigns Type A and B appear to alternate throughout the year, with the highest point occurred around the December holiday season.
This part of the analysis aimed to identify the demographic groups that generated higher sales during campaigns. By segmenting the audience based on their response to marketing efforts, we can better personalize messaging, improve conversion rates, and reduce spending on less responsive segments.
The heatmap below visualizes total sales associated with any campaign, segmented by the demographic variables age and income. These two variables were chosen due to their strong influence on consumer behavior, purchasing power, and product preferences.
Note that a logarithmic transformation was applied to the data to improve the scaling and readability of the axes.
# Aggregated total sales, filtered just with campaigns, and applied log1p transformation
heatmap_data <- trans_campaign_demo %>%
filter(!is.na(campaign_type)) %>%
group_by(age, income) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
mutate(log_sales = log1p(total_sales))
# Plot heatmap
ggplot(heatmap_data, aes(x = income, y = age, fill = log_sales)) +
geom_tile(color = "white") +
scale_fill_gradient(
low = "lightyellow", high = "darkorange",
name = "Total Sales",
labels = function(x) label_dollar()(expm1(x))
) +
labs(
title = "Total Sales (With Campaign) by Age and Income",
subtitle = "Color shows total sales per group (log-scaled)",
x = "Income Bracket",
y = "Age Group"
) +
theme_minimal(base_size = 16) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 12)
)
Based on the visualization above, we identified our target group as households with an income between 50-74K and an age range of 45-54.
To complement this, box plots for both age and income were created to further illustrate the distribution of sales and to detect any potential outliers.
# Plot boxplot total sales by campaign type and age
trans_campaign_demo %>%
group_by(household_id, campaign_type, age) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
filter(total_sales <= 5000) %>% # Adjusted y axis scale for better visibility
ggplot(aes(x = campaign_type, y = total_sales, fill = campaign_type)) +
geom_boxplot() +
facet_wrap(~ age) +
scale_y_continuous(labels = scales::label_dollar()) +
scale_fill_manual(values = c(
"Type A" = "maroon",
"Type B" = "palegreen4",
"Type C" = "orange3",
"No Campaign" = "honeydew4")) +
labs(
title = "Total Sales per Household by Campaign Type and Age",
x = "Campaign Type",
y = "Total Sales",
fill = "Campaign Type"
) +
theme_minimal(base_size = 16) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
Interestingly, Campaign Type A consistently showed a higher median total
sales across all age groups compared to Types B and C. However, in some
age groups, sales without any campaign actually exceeded those generated
by the campaigns during the study period. Additionally, sales associated
with campaigns exhibited a greater number of outliers.
Below is the income box plot:
trans_campaign_demo %>%
group_by(household_id, campaign_type, income) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
filter(total_sales <= 5000) %>% # Adjusted y axis scale for better visibility
ggplot(aes(x = campaign_type, y = total_sales, fill = campaign_type)) +
geom_boxplot() +
facet_wrap(~ income) +
scale_y_continuous(labels = scales::label_dollar()) +
scale_fill_manual(values = c(
"Type A" = "maroon",
"Type B" = "palegreen4",
"Type C" = "orange3",
"No Campaign" = "honeydew4")) +
labs(
title = "Total Sales per Household by Campaign Type and Income",
x = "Campaign Type",
y = "Total Sales",
fill = "Campaign Type"
) +
theme_minimal(base_size = 16) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
This visualization clearly shows how campaign Type A had a strong impact on certain income groups, particularly those earning 250K and above.
For this part of the analysis, the coupons and coupon_redemptions data frames were merged with the existing dataset. An “inner_join” was applied to include only households that received coupons, since not all campaigns distributed coupons to every household. This ensured the resulting data frame contained observations with coupons that may or may not have been redeemed (indicated by redemption_status). Then, a “left_join” was performed with the coupon_redemptions data frame to incorporate the redemption date. In the code below, a new variable called target was created.
# Joined with coupons data and created variables "target" and "redemption status"
coupon_campaign_df <- suppressWarnings({
trans_campaign_demo %>%
inner_join(coupons, by = c("product_id", "campaign_id")) %>%
left_join(coupon_redemptions, by = c("household_id", "campaign_id", "coupon_upc")) %>%
mutate(
redemption_status = ifelse(is.na(redemption_date), "No redemption", "Redemption"),
target = case_when(
age == "45-54" ~ "Target",
income == "50-74K" ~ "Target",
TRUE ~ "Other"),
)})
The plot below illustrates the coupon redemption rate for each campaign type, comparing the Target group to all Other groups. The redemption rate is calculated as the percentage of purchases that resulted in a coupon redemption.
\[ \text{Redemption Rate} = \left( \frac{\text{Number of Redemptions}}{\text{Total Purchases}} \right) \times 100 \]
coupon_campaign_df %>%
group_by(target, campaign_type, redemption_status) %>%
summarise(purchases = n(), .groups = "drop") %>%
# Calculate redemption rate within each target & campaign_type
group_by(target, campaign_type) %>%
mutate(
total_purchases = sum(purchases),
redemption_rate = 100 * purchases / total_purchases
) %>%
filter(redemption_status == "Redemption") %>%
ungroup() %>%
mutate(target = factor(target, levels = c("Target", "Other"))) %>%
# Plot redemption rate by campaign type and target group
ggplot(aes(x = campaign_type, y = redemption_rate, fill = target)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7), width = 0.6) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
scale_fill_manual(values = c("Target" = "darkorange3", "Other" = "honeydew4")) +
labs(
title = "Coupon Redemption Rate by Campaign Type and Target Group",
subtitle = "% of purchases that had a coupon redeemed",
x = "Campaign Type",
y = "Redemption Rate (%)",
fill = "Group"
) +
theme_minimal(base_size = 16) +
theme(
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 12)
)
Despite the fact that campaign Type A generated higher total
sales, the visualization above shows that Types B and C
achieved higher coupon redemption rates — with Type B reaching
nearly 15%. This suggests that, while Type A drove more
spending overall, Types B and C were more effective in
encouraging coupon usage. Overall, the redemption rates appear to be
relatively low across all campaign types.
Next, the analysis focused on average sales per household for both the Target and Other groups, in order to evaluate sales performance across the weeks during and after the campaign period. The relative_week variable was created to measure time relative to each household’s campaign start date, allowing comparison among different households regardless of when the campaign began.
trans_campaign_demo <- trans_campaign_demo %>%
mutate(
relative_week = floor(as.numeric(difftime(transaction_date, start_date, units = "weeks")))
)
households_campaign <- trans_campaign_demo %>%
filter(transaction_date >= start_date & transaction_date <= end_date) %>%
distinct(household_id)
weekly_purchases <- trans_campaign_demo %>%
filter(household_id %in% households_campaign$household_id) %>%
mutate(target = case_when(
age == "45-54" ~ "Target",
income == "50-74K" ~ "Target",
TRUE ~ "Other")) %>%
group_by(relative_week, target) %>%
summarise(
avg_sales_per_household = sum(sales_value) / n_distinct(household_id),
.groups = "drop"
)
avg_campaign_duration <- mean(
as.numeric(difftime(trans_campaign_demo$end_date, trans_campaign_demo$start_date, units = "weeks")),
na.rm = TRUE
)
ggplot(weekly_purchases, aes(x = relative_week, y = avg_sales_per_household, color = target)) +
geom_line(linewidth = 1.2) + # updated to avoid warning
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
geom_vline(xintercept = ceiling(avg_campaign_duration), linetype = "dashed", color = "red") +
labs(
title = "Average Weekly Sales per Household: Target vs. Others",
subtitle = "Week 0 = Campaign Start; Red dashed lines = Campaign Start & End (avg)",
x = "Weeks Relative to Campaign Start",
y = "Avg Sales per Household ($)",
color = "Group"
) +
theme_minimal(base_size = 16) +
theme(
plot.title = element_text(face = "bold"),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12)
) +
scale_color_manual(values = c("Target" = "darkorange3", "Other" = "honeydew4")) +
scale_y_continuous(labels = scales::label_dollar())
The visualization shows that during the campaign period, the
Target group had higher average sales compared to the
Other groups. However, after the campaign ended, the
Other groups demonstrated a slight improvement in performance.
Additionally, after the end of the campaigns the average sales remained
lower than those observed during the campaign period.
To incorporate product details, the dataset was merged with the products data frame using an “inner join”, ensuring that only observations with matching product information were included.
This analysis focused specifically on the Type A campaign, as previous results indicated it had the strongest performance in terms of total sales.
# Identified top 5 products within Type A
df_top_type_a <- trans_campaign_demo %>%
inner_join(products, by = "product_id") %>%
filter(campaign_type == "Type A") %>%
group_by(product_type) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
arrange(desc(total_sales)) %>%
slice_max(total_sales, n = 5)
top5_product_types <- df_top_type_a$product_type
# Prepared comparison by target group
df_top5_target_compare <- trans_campaign_demo %>%
inner_join(products, by = "product_id") %>%
filter(campaign_type == "Type A", product_type %in% top5_product_types) %>%
mutate(
target_group = case_when(
age == "45-54" ~ "Target",
income == "50-74K" ~ "Target",
TRUE ~ "Non-Target"
)
) %>%
group_by(product_type, target_group) %>%
summarise(total_sales = sum(sales_value), .groups = "drop") %>%
mutate(
target_group = factor(target_group, levels = c("Non-Target", "Target"))
)
# Plot bar
ggplot(df_top5_target_compare, aes(x = total_sales, y = fct_reorder(product_type, total_sales), fill = target_group)) +
geom_col(position = position_dodge(width = 0.8), width = 0.7) +
scale_x_continuous(labels = label_dollar()) +
scale_fill_manual(values = c("Target" = "darkorange3", "Non-Target" = "honeydew4")) +
labs(
title = "Top 5 Products in Type A Campaign",
subtitle = "Sales: Target vs Non-Target Households",
x = "Total Sales ($)",
y = "Product Type",
fill = "Group"
) +
theme_minimal(base_size = 16) +
theme(
axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12),
plot.title = element_text(face = "bold", size = 14)
)
The visualization above highlights the top five product types, with
gasoline leading the list. It also shows that the
Target group generated higher sales compared to non-target
groups during the assessed period. This finding reinforces the
recommendation that Regork should prioritize this group to optimize
future campaign strategies.
In summary, this report explores the potential to boost sales by introducing more targeted marketing campaigns.
To conduct this analysis, we used The Complete Journey datasets—specifically the transactions, demographics, campaign, campaign_description, coupon, coupon_redemptions, and products datasets. All data manipulation, including inner, left, and anti joins, filtering, calculations, and creation of new variables, was performed within the R environment.
The main insights revealed that the majority of the 2017 sales for the group of 2,469 households who were frequent shoppers at Regork occurred among households exposed to any campaigns, particularly those targeted by Type A campaigns, followed by Types B and C. Sales showed a consistent upward trend toward the end of the year, peaking around the December holiday season.
The age group 45–54 with an income range of 50–74K recorded the highest volume of campaign-related sales. Notably, campaign Type A consistently delivered higher median total sales across all age groups compared to Types B and C. Campaign-related sales also exhibited a greater number of outliers. Campaign Type A had a pronounced effect on high-income groups, especially those earning $250K and above.
While Type A campaigns drove the highest total sales, the coupon redemption rates were relatively low across the board. Interestingly, Types B and C outperformed Type A in redemption effectiveness, with Type B achieving a peak rate of nearly 15%. It suggestst that while Type A drove more spending overall, Types B and C were more successful in encouraging coupon usage.
During the campaign period, the Target group demonstrated higher average sales compared to “Other” groups. However, post-campaign, the “Other” groups showed a slight performance improvement when compared to “Target”, though average sales still remained below campaign-period levels.
In terms of products, gasoline was the top-selling item associated with Type A campaign. Overall, the Target group consistently generated higher sales than non-target groups during the evaluation period for the top products sold for Type A campaign.
This analysis confirms the original business question: there are clear opportunities to increase sales by implementing more targeted marketing campaigns. Regork demonstrated strong sales performance when campaigns were in place. Based on the findings, we recommend the following actions:
The analysis shows that more campaigns lead to more sales, but not necessarily more profit. Therefore, a deeper investigation is recommended to evaluate the balance between campaign volume and marketing effectiveness. Specifically, calculating Return on Investment (ROI) for each campaign type would help identify the most cost-effective strategies.
This study covers a single year of data. Analyzing a longer time period would provide more robust insights into performance trends, seasonality, and long-term campaign impact.
Additionally, further analysis is needed to understand why Campaign Types B and C achieved higher coupon redemption rates compared to Type A. Uncovering the drivers behind this effectiveness could inform improvements to Type A strategies or support a hybrid approach to optimize both sales volume and coupon engagement.