As a data scientist as Regork, I have been asked to identify potential growth opportunities that Regork can invest in to grow revenue. Regork prides itself on being a premier grocer that adds value to consumer’s lives. This analysis explores which products Regork can target based on consumer income level for increased sales through marketing activities that will add value for Regork and its consumers.
Analysis Question: Which products can we target based on the customer’s income level to increase the quantity of items purchased per basket (basket size)?
To answer this question fully, I’ll address these additional questions along the way:
Definitions of key variables:
Libraries
library(tidyverse, quietly = TRUE) # compilation of packages used for tidying and visualizing data
library(completejourney, quietly = TRUE) # grocery store transaction, product, and demographic data
library(RColorBrewer) # improved colors for plots
library(reactable) # for displaying data tables
palette <- brewer.pal(n = 9, name = "Pastel1")
Tidying Data: Joining, Filtering, Mutating, and Factor Collapse
# Retrieving and creating data frame from completejourney library, then
# joining transactions, products, and demographics data tables
df <- get_transactions() %>%
inner_join(demographics, by = "household_id") %>%
left_join(products, by = "product_id") %>%
select(household_id, basket_id, product_id, quantity, sales_value, income, department, brand, product_category, product_type) %>% # narrow down to variables of interest
mutate(income = fct_collapse(
income,
"low income" = c("Under 15K", "15-24K", "25-34K", "35-49K"),
"middle income" = c("50-74K", "75-99K", "100-124K"),
"upper income" = c("125-149K", "150-174K", "175-199K", "200-249K", "250K+")
)) # recode income levels for ease of analysis
Starting Point
# starting point
tibble(df)
What is the average quantity of products purchased per basket?
There are high outliers in the quantity field of this data set relating to two categories in the product_category column. The observations relating to COUPON/MISC ITEMS and FUEL have been filtered out. From this analysis we learn that on average, customers purchase 15 items (rounded up) on each trip to Regork.
df <- df %>%
filter(product_category != "COUPON/MISC ITEMS") %>%
filter(product_category != "FUEL")
# what is the average quantity of products per basket?
df %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:72992 Min. : 0.00
## Class :character 1st Qu.: 3.00
## Mode :character Median : 8.00
## Mean : 14.59
## 3rd Qu.: 19.00
## Max. :220.00
The following analysis demonstrates that average basket size does change by income level. The average basket sizes based on income levels are:
df %>%
filter(income == "low income") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:36715 Min. : 0.00
## Class :character 1st Qu.: 3.00
## Mode :character Median : 7.00
## Mean : 13.03
## 3rd Qu.: 16.00
## Max. :220.00
df %>%
filter(income == "middle income") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:27340 Min. : 0.0
## Class :character 1st Qu.: 3.0
## Mode :character Median : 9.0
## Mean : 15.6
## 3rd Qu.: 21.0
## Max. :219.0
df %>%
filter(income == "upper income") %>%
group_by(basket_id) %>%
summarize(total_quantity = sum(quantity, na.rm = TRUE)) %>%
summary(total_quantity)
## basket_id total_quantity
## Length:8937 Min. : 0.00
## Class :character 1st Qu.: 4.00
## Mode :character Median : 10.00
## Mean : 17.89
## 3rd Qu.: 25.00
## Max. :210.00
Do larger basket sizes lead to higher sales?
We can see from this scatter plot a positive correlation between basket size and sales value. This affirms that if we can increase basket size through targeted marketing campaigns, revenue will increase.
df %>%
group_by(basket_id) %>%
summarize(
basket_size = sum(quantity, na.rm = TRUE),
total_sale = sum(sales_value, na.rm = TRUE)) %>%
ggplot(aes(total_sale, basket_size)) +
geom_point(alpha = .1) +
geom_smooth(method = "lm", se = FALSE, color = palette[1]) +
geom_smooth(method = "auto", se = FALSE, color = palette[2]) +
coord_flip() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Relationship Between Basket Size & Total Sale",
subtitle = "Based on Regork transaction data from 2016-2017",
y = "Basket Size",
x= "Total Sale"
)
Which brand of products has a higher sales value?
As we work to determine which products we could target for consumers to purchase more of, it is relevant to ask how we might maximize the increase in sales value. One way to do that is to compare the sales_value of products based on whether it comes from a national or private brand.
Sales_value represents the dollar amount Regork receives from the sale of a particular product. We can see from this analysis that Regork makes more on average from selling national brand products.
# Average Sales Value per National Brand Product
asv_national <- df %>%
filter(brand == "National") %>%
summarise(
total_sale = sum(sales_value, na.rm = TRUE),
total_quantity = sum(quantity, na.rm = TRUE)
) %>%
mutate(revenue_per_item = round((total_sale / total_quantity), 2))
reactable(asv_national, bordered = TRUE, striped = TRUE, compact = TRUE)
# Average Sales Value per Private Brand Product
asv_private <- df %>%
filter(brand == "Private") %>%
summarise(
total_sale = sum(sales_value, na.rm = TRUE),
total_quantity = sum(quantity, na.rm = TRUE)
) %>%
mutate(revenue_per_item = round((total_sale / total_quantity), 2))
reactable(asv_private, bordered = TRUE, striped = TRUE, compact = TRUE)
What are the most common products purchased by each income level?
This section of the analysis identifies the top 30 products purchased by each income level. Why 30? 30 represents roughly double the average number of products purchased on each visit to Regork. Remember that the average basket size across all income levels is ~15 items. By examining the top 30 products, we can identify the following:
top_low_prods <- df %>%
filter(income == "low income") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 30) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_low_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
p1 <- top_low_prods %>%
slice(1:15) %>%
ggplot(aes(x = total_quantity, y = product_type)) +
geom_point(aes(color = department), size = 3) +
geom_segment(
aes(x = 0, xend = total_quantity, y = product_type, yend = product_type, color = department),
linewidth = 0.1
) +
labs(title = "Top 15 Products Purchased by Low Income Households",
subtitle = "Based on Regork transaction data from 2016-2017",
x = "Total Quantity Purchased",
y = "Product"
)
p1
top_mid_prods <- df %>%
filter(income == "middle income") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 30) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_mid_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
p2 <- top_mid_prods %>%
slice(1:15) %>%
ggplot(aes(x = total_quantity, y = product_type)) +
geom_point(aes(color = department), size = 3) +
geom_segment(
aes(x = 0, xend = total_quantity, y = product_type, yend = product_type, color = department),
linewidth = 0.1
) +
labs(title = "Top 15 Products Purchased by Middle Income Households",
subtitle = "Based on Regork transaction data from 2016-2017",
x = "Total Quantity Purchased",
y = "Product"
)
p2
top_up_prods <- df %>%
filter(income == "upper income") %>%
group_by(product_id) %>%
summarize(total_quantity = sum(quantity)) %>%
slice_max(total_quantity, n = 30) %>%
left_join(products, by = "product_id") %>%
select(-c(manufacturer_id, package_size))
reactable(top_up_prods, filterable = TRUE, searchable = TRUE, defaultPageSize = 5, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
p3 <- top_up_prods %>%
slice(1:15) %>%
ggplot(aes(x = total_quantity, y = product_type)) +
geom_point(aes(color = department), size = 3) +
geom_segment(
aes(x = 0, xend = total_quantity, y = product_type, yend = product_type, color = department),
linewidth = 0.1
) +
labs(title = "Top 15 Products Purchased by Upper Income Households",
subtitle = "Based on Regork transaction data from 2016-2017",
x = "Total Quantity Purchased",
y = "Product"
)
p3
My goal was to analyze which products Regork could target in marketing campaigns to increase overall revenue by increasing basket size. By utilizing the available transaction data I was able to identify 15 products for each income level Regork can target for marketing and promotion offerings.
In order to address the question of this analysis I broke the question down into sub-sections. I first sought to understand how many items customers tend to purchase in a single visit, the average of which is ~15 (this varies slightly by income level). Although it may seem obvious, I then confirmed that larger basket sizes do in fact correlate with larger sales_values, which means greater revenue to Regork. Next, I sought to understand if national brand or private brand goods yield greater revenue for Regork. This analysis showed that national brand product sales have a greater positive impact to our revenue. Finally, I sought to understand the 30 most common products purchased by each income level.
In this methodology is an assumption that it is easier to get customers to purchase more of products they already purchase than to push a customer to a new product altogether.
Based on this analysis, Regork should target the following products for cross-promotions to each income level:
top_low_prods %>%
slice(16:30) %>%
select(product_id, product_type) %>%
reactable(defaultPageSize = 10, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
top_mid_prods %>%
slice(16:30) %>%
select(product_id, product_type) %>%
reactable(defaultPageSize = 10, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
top_up_prods %>%
slice(16:30) %>%
select(product_id, product_type) %>%
reactable(defaultPageSize = 10, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = TRUE)
Based on this analysis, I propose that Regork designs promotions (i.e. discounts) in the following structure: Offer discounts so that when a customer purchases an item in the top 15 for their income level, they receive a discount on an item from the remaining items in the top 30 products, provided that this purchase is made during the same trip. Such a promotion would have the effect of increasing basket_size and therefore revenue for Regork; and for the consumer it would help them save on the products they already know and love. The paired item in the discount should be from a national brand to maximize revenue generation.
This analysis makes an assumption that is worth testing to determine if it holds. Further analysis could also be done to maximize this approach and make targeted promotions even more individualized to the consumer. For example, the pairing of products for discounts could be changed by store_id to improve offerings at a local level. It would also be worth noting if there are significant changes if income levels were split into more subgroups (e.g. 5 groups, instead of 3).