Problem Statement
Working as a data scientist for a national grocery chain named Regork. I have been tasked with identifying a potential area of growth where the company could invest future resources to increase revenue and profits. My manager has asked that I prepare a report that outlines my findings.
This analysis identifies purchasing patterns across income segments to target products for targeted promotions and marketing campaigns.
Methodology
To address the business problem, I used the completejourney dataset to produce data driven insights for the following:
Proposed Solution
This analysis will surface key findings to make data driven decisions. Regork leadership will be empowered to target specific customers when developing marketing plans to incease revenue and profits.
In this step, I load the data, make sure it’s clean and can be used. This includes removing any empty cells and making sure values are in correct format.
**Packages/Libraries Required & Setup
I used the packages below in the analysis
library(completejourney) # shopping transaction data
library(tidyverse) # tidy data
library(ggplot2) # data visualization
library(dplyr) # data manipulation and transform
library(RColorBrewer) # colors for plots
library(reactable) # data tables tabs
library(scales) # visualization- map & display data values
Variables
household_id: unique identifier for each customer to link transactions to demographics
basket_id: unique identifier for transaction by customer
product_id: unique identifier for specific items
quantity: number of units of specific product in transactin
sales_value: revenue from sale of specific product in transaction
income: income segment of household from demographic data
department: grouping of similar products
brand: manuacturer name for product by national or private brand.
product_category: grouping of specific products a level below department
product_type: Describes specific items with grouping level below product category.
Data Preparation (Wrangle & Mutate) This step is where I create data frame, load dataset, process, join data (transactions, products, demographcis), pick variables, reorder, create income categories.
library(tidyverse)
library(completejourney)
setwd("C:/Users/edietz/OneDrive - Blue Origin, LLC/Desktop/Learning RStudio Desktop Files/Final")
full_transactions_data <- readRDS("transactions.rds")
df <- full_transactions_data %>%
inner_join(demographics, by = "household_id") %>%
left_join(products, by = "product_id") %>%
select(household_id, basket_id, product_id, quantity, sales_value, income, department, brand, product_category, product_type) %>%
mutate(income = fct_collapse(
income,
"Regork Income Segment Low" = c("Under 15K", "15-24K", "25-34K", "35-49K"),
"Regork Income Segment Medium" = c("50-74K", "75-99K", "100-124K", "125-149K"),
"Regork Income Segment High" =c("150-174K", "175-199K", "200-249K", "250K+")
))
Analysis 1: Average Number of Products Purchased Per Transaction
This analysis is meant to test if specific customer income segments purchase more product types per transaction than other income segements. To get this, I calculated the average number of products by transaction (basket) to see if there was an association between income level and number of products in a transaction.
I filtered outliers out from product_category column to get better representation of avg number of products purchases.
It showed that high income segment customers averaged the most products (19.1) per transaction/ basket. Regork could use this data to target high income segments with promos and marketing campaigns to try to increase the average products sold per transaction.
df <- df %>%
filter(product_category != "COUPON/MISC ITEMS") %>%
filter(product_category != "FUEL")
Average Number of Products Purchased Per Transaction
df %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:72992 Min. : 0.00
## Class :character 1st Qu.: 3.00
## Mode :character Median : 8.00
## Mean : 14.59
## 3rd Qu.: 19.00
## Max. :220.00
Preparing Data for Plotting
I calculate average quantity per transaction (basket) for the full dataset, then I filtered outlider, grouped by income and transaction, calculated the quanitty of products by transaction and ordered the income levels to plot findings.
avg_products_by_income <- df %>%
filter(product_category != "COUPON/MISC ITEMS", product_category != "FUEL") %>%
group_by(income, basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop_last') %>%
summarize(avg_products = mean(total_quantity, na.rm = TRUE), .groups = 'drop') %>%
mutate(income = factor(income,
levels = c("Regork Income Segment Low",
"Regork Income Segment Medium",
"Regork Income Segment High")))
print(avg_products_by_income)
## # A tibble: 3 × 2
## income avg_products
## <ord> <dbl>
## 1 Regork Income Segment Low 13.0
## 2 Regork Income Segment Medium 15.7
## 3 Regork Income Segment High 19.1
Average Number of Products Purchased Per Transaction by Income Level
Average number of products purhcased per transaction by income level are:
Visualization: Avg Number of Products Purchase Per Transaction by Income
library(ggplot2)
library(scales)
ggplot(avg_products_by_income, aes(x = income, y = avg_products)) +
geom_segment(aes(xend = income, yend = 0),
color = "magenta") +
geom_point(size = 5, aes(color = income)) +
geom_text(aes(label = round(avg_products, 2)),
vjust = 1.5,
size = 4) +
labs(
title = "Average Products Per Transaction by Income",
x = "Regork Income Segment",
y = "Average Products Per Basket"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust =0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
) +
scale_y_continuous(limits = c(0, max(avg_products_by_income$avg_products) * 1.15))
df%>%
filter(income == "Regork Income Segment Low") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:36715 Min. : 0.00
## Class :character 1st Qu.: 3.00
## Mode :character Median : 7.00
## Mean : 13.03
## 3rd Qu.: 16.00
## Max. :220.00
df%>%
filter(income == "Regork Income Segment Medium") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:31169 Min. : 0.00
## Class :character 1st Qu.: 3.00
## Mode :character Median : 8.00
## Mean : 15.68
## 3rd Qu.: 21.00
## Max. :219.00
df%>%
filter(income == "Regork Income Segment High") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:5108 Min. : 0.0
## Class :character 1st Qu.: 5.0
## Mode :character Median : 12.0
## Mean : 19.1
## 3rd Qu.: 26.0
## Max. :140.0
2. Product Brands With Higher Sales Value This analysis compares avg sales value of National vs Private brand products based on income. I wanted to compare sales of products by brand to analyze customer preference. By doing this, it showed me National brands generate higher revenue for Regorks than the private brand.
Preparing data for Plotting
Top 10 Product Categories by Sales Value
sales_by_segment <- df %>%
filter(product_category !="COUPON/MISC ITEMS",
product_category !="FUEL") %>%
filter(brand %in% c("National", "Private")) %>%
group_by(brand,product_category, income) %>%
summarize(
total_sales_value = sum(sales_value, na.rm = TRUE),
.groups = 'drop'
)
top_categories_list <- sales_by_segment %>%
group_by(product_category) %>%
summarize(grand_total = sum(total_sales_value), .groups = 'drop') %>%
arrange(desc(grand_total)) %>%
slice_head(n = 10) %>%
pull(product_category)
top_combined_sales <- sales_by_segment %>%
filter(product_category %in% top_categories_list) %>%
mutate(income = factor(income,
levels = c("Regork Income Segment Low",
"Regork Income Segment Medium",
"Regork Income Segment High")))
print(top_combined_sales)
## # A tibble: 59 × 4
## brand product_category income total_sales_value
## <fct> <chr> <ord> <dbl>
## 1 National BAG SNACKS Regork Income Segment Low 18541.
## 2 National BAG SNACKS Regork Income Segment Medi… 20313.
## 3 National BAG SNACKS Regork Income Segment High 3903.
## 4 National BAKED BREAD/BUNS/ROLLS Regork Income Segment Low 11891.
## 5 National BAKED BREAD/BUNS/ROLLS Regork Income Segment Medi… 15575.
## 6 National BAKED BREAD/BUNS/ROLLS Regork Income Segment High 3386.
## 7 National BEEF Regork Income Segment Low 42153.
## 8 National BEEF Regork Income Segment Medi… 38250.
## 9 National BEEF Regork Income Segment High 8619.
## 10 National BEERS/ALES Regork Income Segment Low 18209.
## # ℹ 49 more rows
**Visualization: Top 10 Product Categories Sold by Brand”
library(ggplot2)
library(scales)
ggplot(top_combined_sales,
aes(x = product_category, y = total_sales_value, fill = income)) +
geom_col()+
facet_wrap(~ brand, scales = "free_x", ncol = 1) +
scale_fill_manual(values = c("Regork Income Segment Low" = "magenta",
"Regork Income Segment Medium" = "blue",
"Regork Income Segment High" = "orange")) +
scale_y_continuous(labels = scales:: comma_format(prefix = "$")) +
labs(
title = "Top 10 Product Categories Sold by Brand",
subtitle = "Total Sales by Income Segment broken down by Brand",
x = "Product Category",
y = "Total Sales Value ($)",
fill = "Income Segment"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Private Brand Average Sales
asv_private <-df%>%
filter(brand =="Private") %>%
summarize(
total_sale = sum(sales_value, na.rm = TRUE),
total_quantity = sum(quantity, na.rm = TRUE)
) %>%
mutate(revenue_per_item = round((total_sale / total_quantity), 2))
reactable(asv_private, bordered = TRUE, striped = TRUE, compact = TRUE)
National Brand Average Sales
asv_national <- df %>%
filter(brand =="National") %>%
summarize(
total_sale = sum(sales_value, na.rm = TRUE),
total_quantity = sum(quantity, na.rm = TRUE)
)%>%
mutate(revenue_per_item = round((total_sale /total_quantity), 2))
reactable(asv_national, bordered = TRUE, striped = TRUE, compact = TRUE)
3. Top Products Regularly Purchased by Income Segment (Low, Medium, High)
This analysis captures the average top 20 most frequently purchased products per transaction (basket)
# Prep data for visualization, calculations, filter names, combine
# Low Income Segment
top_low_prods <- df %>%
filter(income =="Regork Income Segment Low") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size)) %>%
mutate(income_segment = "Low")
# Medium Income Segment
top_mid_prods <- df %>%
filter(income =="Regork Income Segment Medium") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size)) %>%
mutate(income_segment = "Medium")
# High Income Segment
top_up_prods <- df %>%
filter(income =="Regork Income Segment High") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE), .groups = 'drop') %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size)) %>%
mutate(income_segment = "High")
master_top_prods <- bind_rows(top_low_prods, top_mid_prods, top_up_prods) %>%
mutate(income_segment = factor (income_segment, levels = c("Low", "Medium", "High"))) %>%
select(income_segment, product_id, total_quantity, brand, product_category, product_type) %>%
arrange(income_segment, desc(total_quantity))
#Prep Table
reactable(
master_top_prods,
filterable = TRUE,
searchable = TRUE,
showPageSizeOptions = TRUE,
striped = TRUE,
highlight = TRUE,
defaultPageSize = 10,
columns = list(
income_segment = colDef (name = "Income Segment", minWidth = 100),
product_id = colDef(name = "Product ID", minWidth = 80),
total_quantity = colDef(name = "Total Qty (Top 20)", minWidth = 100),
brand = colDef(name = "Brand", minWidth = 80),
product_category = colDef(name = "Product Category", minWidth = 130),
product_type = colDef(name = "Product Type", minWidth =150)
)
)
library(ggplot2)
library(scales)
plot_data <- master_top_prods %>%
mutate(product_category = reorder (product_category, total_quantity))
ggplot(plot_data,
aes(x = total_quantity, y = product_category)) +
geom_segment(aes(xend = 0, yend = product_category),
color = "magenta",
linewidth = 0.5) +
geom_point(aes(color = brand), size = 3) +
facet_wrap(~ income_segment, scales ="free_y", ncol =1) +
labs(
title = "Top 20 Retail Products by Income Segment",
subtitle = "Product Rank by Total Quantity Purchased",
x = "Total Quantity Purchased",
y = "Product Category",
color = "Bran Type"
) +
scale_x_continuous(labels = scales::comma)+
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face ="bold"),
legend.position = "bottom"
)
top_low_prods <- df %>%
filter(income == "Regork Income Segment Low") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_low_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
top_mid_prods <- df %>%
filter(income == "Regork Income Segment Medium") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_mid_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
top_up_prods <- df %>%
filter(income == "Regork Income Segment High") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 20) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_up_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
Summarize Problem Statement
The objective was to identify a specific data driven areas for growth where Regork could invest resources to increase revenue an dprofits. To do this, I analyzed customer purchase patters across income segments to determine high demand products for targeted promotions and marketing campaigns.
Methodology
The method I used surfaces data points on specific products that leadership can take immediate action on to increase revenue. In the immediate term, leadership can spin up marketing campaigns and invest resources where product purchases are going well, while simultaneously place resources where there are opportunities for growth.
I used the completejourney dataset to process and joine transaction, product and demographic data. Key variables were leveraged for targeted data analysis.
The methodology focused on three areas to provide a recommendation to Regork leadership:
Average number of products purchased per transaction based income segment to determine most valuabe consumer segment to invest in immediately
Top products pruchased and reviewed product brands with highest sales for Regork leadership to determine where to invest for immediate revenue growth
Top 20 product most frequently purchased by income segment to create list of products for targeted promotions
Interesting Insights from Analysis
Looking at products prchused most often, Regork could identify which of those products are essential items or nice to have desirable items based on income segment. By doing this, Regork can better understand why products are in high demand and expand to similar products in the same grouping.
Implications to Consumer Based on Analysis
Consumer who are high income may notice increase in individualized marketing for products they purchase frequently. This could make them feel seen and create a deeper emotional bond with Regork. Consumers in the low income segment may notice promotions targeting essential items, which could help their savings go further and create loyalty.
Proposal to Regork CEO
This analysis shares targeted opportunities for revenue and profit growth, giving the CEO the ability to make fully infomred business decisions with immediacy.
With these new data insights on consumer purchases by income segment, Regork can:
Begin targeted marketing campaigns to generate sales on high value and essential items. These campaigns could drive foot traffic in the stores with the potential to generate additional sales on items not advertised.
Focus on product placement by verifying highly desired items are always in stock and placed in areas of high visibility or high traffic in the stores. Store layout and product placement could increase average items per transaction.
Focus on Targetd Promotional Coupons and bogo deals through digital mail, paper mail to drive customers into the store. And then offer targeted promotions at the point of sale as a last opportunity to maximize their spending or to keep them coming back.
Focus on all income segments and what drives their behavior to purhcase more. Lower income segments want deals on staple items, whereas high income segments may be drawn to shop for high value items.
Limitations of Analysis & How it can be Improved
The data used in the analysis is stale and does not allow for real time insights based on any economic shifts. To improve this, having real time data that shows customer spending would allow Regorks to act fast to prevent negative impacts to profit margins.
The data is not segmented by store location or region, leaving less information on demographic data to understand customer behavior like product preferences by culture and spedning habits. It would be valuable to have store location data to target customer segmenets with marketing campaigns to grow the Regork national footprint.