We are working in Regork company, which is an industry leader, is devoted to long-term growth an constant progress. To strengthen its market position and increase profitability, Regork intends to seek and capitalize on new growth prospects. This project will analyze market trends, consumer demands, and competitive landscapes to identify a key area for future investment. Regork aims to increase its product/service offerings or enter new markets by employing data-driven insights and inventive thinking, resulting in revenue growth. This project’s collaborative and strategic strategy seeks to position Regork for long-term success and sustainable expansion.
In this analysis, we:
Identify customer segments of five loyalty levels including: Loyal, Regular, Repeat, Occasional, Prospective. We want to focus non-loyal customers and product they typically purchase and current marketing campaigns to target these top products bought by newer customers.
Determine the highest quantity of item sales for prospective customers and compare current marketing strategies among these items
Find compliments to those items currently missing marketing campaigns among all transactions.
Identify when and how to host a new marketing campaign
The following R packages are required in order to run this R project analysis:
library(completejourney) # data frames describing transactions from Regork in 2017
## Warning: package 'completejourney' was built under R version 4.3.3
library(ggplot2) # visualization of data
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyverse) # packages for data manipulation and exploration
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
library(stringr) # manipulation of string characters
library(dplyr) # data manipulation tools
The data for this R project is accessed through completejourney. The completejourney data sets are historical data frames of data based on grocery transactions during the year 2017.
To gain access to the full datasets, the following code must be run. These are very large datasets and therefore may take time to load.
# Accessing the full transactions dataset
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
## household_id store_id basket_id product_id quantity sales_value retail_disc
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 900 330 31198570044 1095275 1 0.5 0
## 2 900 330 31198570047 9878513 1 0.99 0.1
## 3 1228 406 31198655051 1041453 1 1.43 0.15
## 4 906 319 31198705046 1020156 1 1.5 0.29
## 5 906 319 31198705046 1053875 2 2.78 0.8
## 6 906 319 31198705046 1060312 1 5.49 0.5
## 7 906 319 31198705046 1075313 1 1.5 0.29
## 8 1058 381 31198676055 985893 1 1.88 0.21
## 9 1058 381 31198676055 988791 1 1.5 1.29
## 10 1058 381 31198676055 9297106 1 2.69 0
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## # transaction_timestamp <dttm>
# Accessing the full promotions dataset
promotions <- get_promotions()
promotions
## # A tibble: 20,940,529 × 5
## product_id store_id display_location mailer_location week
## <chr> <chr> <fct> <fct> <int>
## 1 1000050 316 9 0 1
## 2 1000050 337 3 0 1
## 3 1000050 441 5 0 1
## 4 1000092 292 0 A 1
## 5 1000092 293 0 A 1
## 6 1000092 295 0 A 1
## 7 1000092 298 0 A 1
## 8 1000092 299 0 A 1
## 9 1000092 304 0 A 1
## 10 1000092 306 0 A 1
## # ℹ 20,940,519 more rows
# These are the other datasets used in this exploratory analysis
campaigns
## # A tibble: 6,589 × 2
## campaign_id household_id
## <chr> <chr>
## 1 1 105
## 2 1 1238
## 3 1 1258
## 4 1 1483
## 5 1 2200
## 6 1 293
## 7 1 529
## 8 1 536
## 9 1 568
## 10 1 630
## # ℹ 6,579 more rows
coupons
## # A tibble: 116,204 × 3
## coupon_upc product_id campaign_id
## <chr> <chr> <chr>
## 1 10000085207 9676830 26
## 2 10000085207 9676943 26
## 3 10000085207 9676944 26
## 4 10000085207 9676947 26
## 5 10000085207 9677008 26
## 6 10000085207 9677052 26
## 7 10000085207 9677385 26
## 8 10000085207 9677479 26
## 9 10000085207 9677791 26
## 10 10000085207 9677878 26
## # ℹ 116,194 more rows
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
In this section we present datasets and lists that we created, combined, and organized for our further analysis in the next section.
First, we classified the customers into five levels loyalty: Loyal, Regular, Repeat, Occasional, Prospective.
loyalty_levels <- list(
"Loyal", "Regular", "Repeat", "Occasional", "Prospective"
)
We used these levels to create two new data frames. One for all customer loyalty levels and one specifically for those prospective customers. We define the loyalty level based on five quantiles of the data set. For example, the customers with a transaction frequency in the top 20% will be considered a ‘Loyal’ customer.
loyalty_df <- transactions %>%
group_by(household_id) %>%
summarise(
first_date = as.POSIXct(min(transaction_timestamp)),
last_date = as.POSIXct(max(transaction_timestamp)),
baskets = n_distinct(basket_id)
) %>%
mutate(
difference = as.numeric(difftime(last_date, first_date, units = "days"))
) %>%
group_by(household_id) %>%
summarise(freq = case_when(
difference <= 1 ~ 365,
TRUE ~ mean(difference / (baskets), na.rm = TRUE)
)) %>%
arrange((freq)) %>%
mutate(loyalty = factor(ntile(freq, 5), labels = loyalty_levels))
prospective_cust <- loyalty_df %>%
filter(loyalty == "Prospective")
Our Exploratory Data Analysis section provides an outline for the data manipulation and visualization performed to address the problem statement.
To find the ten most common purchases for the ‘prospective’ customers in 2017, we used the transactions dataframe alongside the filtered loyalty dataframe and products. Using the quantity for each product type, we found the most common items.
top_ten <- transactions %>%
right_join(prospective_cust, by = "household_id") %>%
inner_join(products, by = "product_id") %>%
group_by(product_type) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
slice_head(n = 10)
top_ten
## # A tibble: 10 × 2
## product_type total
## <chr> <int>
## 1 FLUID MILK WHITE ONLY 1103
## 2 SOFT DRINKS 12/18&15PK CAN CAR 712
## 3 YOGURT NOT MULTI-PACKS 673
## 4 SFT DRNK 2 LITER BTL CARB INCL 628
## 5 SHREDDED CHEESE 571
## 6 BANANAS 473
## 7 MAINSTREAM WHITE BREAD 438
## 8 POTATO CHIPS 437
## 9 CANDY BARS (SINGLES)(INCLUDING 403
## 10 SS ECONOMY ENTREES/DINNERS ALL 364
From that data, we segmented those products with current marketing strategies and those without.
coupon_count <- coupons %>%
inner_join(products, by = "product_id") %>%
right_join(top_ten, by = "product_type") %>%
group_by(product_type) %>%
summarise(coupon = n_distinct(campaign_id, na.rm = TRUE))
coupon_check <- function(coupon) {
ifelse(coupon > 0, "Yes", "No")
}
top_ten %>%
inner_join(coupon_count, by = "product_type") %>%
mutate(coupon_status = coupon_check(coupon)) %>%
ggplot(aes(x = product_type, y = total, fill = coupon_status)) +
geom_bar(stat = "identity") +
labs(title = "Total Product Quantity for Top Items", x = "Product Type", y = "Total Product Quantity") +
coord_flip()
The bar graph shows top ten products are bought by ‘prospective’ customers. The ‘coupon_status’ shows that only one product was left without a marketing campaign, but was still high on the ‘prospective’ customers purchase list. This lead us to believe that there is an opportunity to exploit this want of candy bars.
We figured out the patterns in the purchases made by all customers of the less targeted candy bar.
# collecting all transactions involving "CANDY BARS (SINGLES)"
candy <- transactions %>%
inner_join(products, by = "product_id") %>%
filter(
str_detect(product_type, regex("CANDY BARS", ignore_case = T)),
str_detect(product_type, regex("(SINGLES)", ignore_case = T)),
) %>%
group_by(basket_id) %>%
summarise(total = n_distinct(basket_id))
#compile the top 10 complimentary products for candy bars
compliments <- transactions %>%
right_join(candy, by = "basket_id") %>%
inner_join(products, by = "product_id") %>%
filter(
!str_detect(product_type, regex("CANDY BARS", ignore_case = T)),
!str_detect(product_type, regex("(SINGLES)", ignore_case = T))
) %>%
group_by(product_type) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
slice_head(n = 10)
compliments
## # A tibble: 10 × 2
## product_type total
## <chr> <int>
## 1 FLUID MILK WHITE ONLY 2268
## 2 SOFT DRINKS 12/18&15PK CAN CAR 1663
## 3 SFT DRNK 2 LITER BTL CARB INCL 1581
## 4 MAINSTREAM WHITE BREAD 1175
## 5 SFT DRNK SNGL SRV BTL CARB (EX 1166
## 6 YOGURT NOT MULTI-PACKS 1148
## 7 SHREDDED CHEESE 1089
## 8 BANANAS 1024
## 9 POTATO CHIPS 981
## 10 SNACK CAKE - MULTI PACK 949
# check the number of coupons for each product among the top 10 complimentary list
coupon_compliments <- coupons %>%
inner_join(products, by = "product_id") %>%
right_join(compliments, by = "product_type") %>%
group_by(product_type) %>%
summarise(coupon = n_distinct(campaign_id, na.rm = TRUE))
# validate the 'coupon_status' for each product
compliments %>%
inner_join(coupon_compliments, by = "product_type") %>%
mutate(coupon_status = coupon_check(coupon)) %>%
ggplot(aes(x = product_type, y = total, fill = coupon_status)) +
geom_bar(stat = "identity") +
labs(title = "Total Product Quantity for Top Compliment Items", x = "Product Type", y = "Total Product Quantity") +
coord_flip()
The bar graph shows top ten complimentary products bought by all
customers. The ‘coupon_status’ shows that two products are left without
a marketing campaign. This leads us to believe that these should become
the focus of your new marketing campaign.
In order to capitalize on your ‘prospective’ customer-base, we believe your new marketing campaign should focus on the two complimentary products: Snack Cake - Multi-Pack and Soft Drink Singles. These two products are commonly bought simultaneously with Candy Bars, already determined to be a common product for ‘prospective’ customers. Combining these two pieces, a marketing strategy focused on the complimentary products will hopefully attract the ‘prospective’ customers to the deals they may receive with their loyalty to Regork.
The final piece is to figure out when to host a marketing campaign.
To do this, we summarized the daily sales of candy bars.
daily_candy <- transactions %>%
inner_join(products, by = "product_id") %>%
filter(
str_detect(product_type, regex("CANDY BARS", ignore_case = TRUE)),
str_detect(product_type, regex("(SINGLES)", ignore_case = TRUE))
) %>%
mutate(day = yday(transaction_timestamp)) %>%
group_by(day) %>%
summarise(total = sum(quantity, na.rm = TRUE),
sales_value = sum(sales_value, na.rm = TRUE))
daily_candy
## # A tibble: 364 × 3
## day total sales_value
## <dbl> <dbl> <dbl>
## 1 1 49 22.6
## 2 2 36 16.5
## 3 3 97 39.8
## 4 4 47 21.2
## 5 5 51 23.8
## 6 6 33 12.4
## 7 7 141 54.3
## 8 8 66 27.8
## 9 9 63 25.6
## 10 10 36 16.0
## # ℹ 354 more rows
With this information gathered, we ran a simulation for each possible monthly combination of sales. We wanted to find the highest sales revenue in a period of 30 days to find what time of year the most candy bars are being bought. We wanted the highest time period since this is most likely when these “prospective customers” are in the store buying candy bars.
# Create an empty data frame to store monthly results
monthly_results_list <- list()
# Iterate through each day and calculate the monthly total for the highest month
for (start_day in 1:(365-29)) {
month_campaign <- daily_candy %>%
filter(day %in% (start_day:(start_day + 29)))
# Calculate the monthly total
total <- sum(month_campaign$sales_value, na.rm = TRUE)
# Store the results in the data frame
monthly_results_list[[start_day]] <- data.frame(
month_start = as.numeric(start_day),
month_end = start_day + 29,
month_total = total)
}
# combine the list into a data frame
monthly_results <- do.call(rbind, monthly_results_list)
# Find the top month
top_month <- monthly_results %>%
arrange(desc(month_total)) %>%
slice(1)
print(top_month)
## month_start month_end month_total
## 1 70 99 1021.62
This tells us that the greatest number of transactions took place between 03/12 and 04/10.
To visualize the daily sales with the suggested timeline for marketing these products, we created a line graph.
# determines the days within the specified timeline
daily_candy <- daily_candy %>%
mutate(in_top_month = between(day, top_month$month_start, top_month$month_end))
# creating the plot
daily_candy_plot <- daily_candy %>%
ggplot(aes(day, total)) +
geom_line() +
# add a shaded region for the top month
geom_rect(data = daily_candy %>% filter(in_top_month),
aes(xmin = min(day), xmax = max(day), ymin = -Inf, ymax = 250),
fill = "lightblue", alpha = 0.03) +
# labels the start and end dates for the campaign
geom_text(data = top_month,
aes(x = (month_start + month_end) / 2, y = -Inf, label = paste("Start:", format(as.Date(month_start, format="%j"), "%m/%d"), "\nEnd:", format(as.Date(month_end, format="%j"), "%m/%d"))),
vjust = -1, hjust = 0.5, color = "red", size = 3) +
labs(title = "Daily Candy Sales",
x = "Day",
y = "Total Sales") +
theme_minimal() +
coord_cartesian(ylim = c(0, 250))
daily_candy_plot
We aligned when we thought a fresh marketing campaign should concentrate on Snack Cake - Multi-Pack and Soft Drink Singles. By comparing the sales of the non-marketed potential client product, Candy Bars, with the non-marketed complementing products, this new endeavor should take place around 03/12 and 04/10 during any year.
It is also important to find if this campaign is worth the trouble of starting before beginning efforts to focus on ‘prospective’ customers.
To find how much impact this campaign would have on total sales revenue, we computed the percentage of total sales revenue that each loyalty group contributes and then created a bar plot where each bar represents the total sales revenue for a loyalty level, with the bars filled based on loyalty level.
transactions %>%
inner_join(loyalty_df) %>%
group_by(loyalty) %>%
summarise(sales_revenue = sum(sales_value)) %>%
mutate(freq = paste(round(sales_revenue / sum(sales_revenue) * 100, 2), "%" )) %>%
ggplot(aes(x = loyalty, y = sales_revenue, fill = loyalty)) +
geom_bar(stat = "identity") +
geom_label(aes(label = freq)) +
labs(
title = "Sales Revenue by Loyalty Level",
x = "Loyalty"
) +
scale_y_continuous(name = "Total Sales Revenue", labels = scales::dollar)
This reveals what we already knew before starting: that ‘loyal’ customers generate the highest percentage of yearly sales revenue, followed by ‘regular’ customers and ‘repeat’ customers, all the way down to ‘prospective’ customers.
While it is crucial to recognize the potential when trying to capitalize on new market segments, in this case, the ‘prospective’ customers, doing so may prove challenging. Despite constituting 20% of all total customers, they contribute only 3.37% to total yearly sales revenue.
This group obviously has the most room to grow. However, given the unpredictability associated with customers who do not frequently shop with Regork, persuading them with coupons for products they may or may not purchase in a given transaction can be a sporadic process. Customers who are not currently loyal may find it challenging to work with coupons they are not accustomed to, making it difficult to encourage them to buy additional products.
It may be more beneficial to focus on customers within the ‘regular’ and ‘repeat’ segments, as they have more concrete data on their transactions. This would make finding complementary products and persuading them to buy new products easier than with ‘prospective’ customers. Additionally, these segments account for a higher share of total revenue and are more willing to engage in higher transactions with us.
(i) Problem Statement:
The problem statement addressed in this report is related to understanding the impact of focusing on new customers and the frequency of non-loyal customers on revenue generation. Specifically, the goal is to assess how much more revenue is generated as the frequency between visits decreases.
(ii) Methodology:
To address this problem statement, we segmented customers into various levels of loyalty, focusing on the frequency of ‘prospective’ customers and the products they typically purchase. We also analyzed marketing campaigns associated with those products and assessed the sales value contribution by each level of customer loyalty.
(iii) Insights from Analysis:
A product usually bought by ‘prospective’ customers and is a non-marketed product is single-wrapped candy bars. There is a clear trend that customers usually buy candy bars with the multipack of snack cakes and single soft drinks.
Our analysis also showed the percentage contribution to total sales revenue by each loyalty level, and based on the numbers we believe there will not be much change in the total sales revenue if we choose to target prospective customers.
(iv) Our Recommendation:
Targeting Customers in Different Loyalty Levels: Based on the analysis, it is recommended to target customers in different loyalty levels other than ‘prospective’ customers, as they can generate more sales and contribute more to the total sales revenue. By focusing on these customers, we can potentially increase its revenue and overall profitability.
Targeting ‘Prospective’ Customers: If we choose to target ‘prospective’ customers, marketing campaigns should focus on snack cakes and soft drinks. Since there is already a trend of buying these items together without campaigns, adding campaigns will hopefully show ‘prospective’ customers the kinds of deals they can achieve by maintaining their loyalty to Regork.
(v) Limitations and Suggestions for Improvement:
The analysis depends solely on historical data and assumes that customer behavior remains stable, which may not be the case in a rapidly evolving market environment.
The analysis does not consider external variables, such as changes in market conditions or competitor strategies, which can significantly influence revenue generation.
To enhance the analysis, it is advisable to gather more recent data and integrate external factors into the modeling. Furthermore, employing advanced machine learning algorithms could provide a more precise understanding of customer behavior and revenue generation.