Targetting Low Income Regork Consumers

Introduction

Business Problem

In the pursuit of increasing sales for Regork, I focused in on the group with the lowest spending: the low income demographic. In the data analysis, I defined the low income group as having an annual income range of $30k or lower. With less disposable income, it comes to reason that this group would also have lower spending. This demographic is also more likely to seek sales discounts due to their income level. In considering how to increase spending from this group, I analysed the impact of sales promotions, specifically those that offer a coupon discount for certain products, and how they could be utilized in a way that appealed to the low income demographic and ultimately lead to them increasing the number of purchases they make.

Addressing the Problem

To adress the problem and possible solutions, I analyzed the purchasing trends of different income demographics. In this report, it was separated into four levels, including “low” (as mentioned, this was at a range of 0-$30k), “middle” (defined as $30-$60k), “high-middle” ($60-$100k), and “high” (anything above $100k). I compared these levels on the basis of their top 10 most purchased products for the year of 2017 and how often they use coupons in their transactions. Other data analysis involved only the low income demographic, which looked into their purchasing habits, such as how often coupons were used in their most purchased products, when they make the most purchases over the course of a year, and spikes in purchases during the times when promotional campaigns were active. The results from this analysis shows not only the most opportune times to offer discounts on products, but also which products have the most potential for sales growth among the low income demographic.

Proposed Solution

Based on the results of the analysis in this report, I propose that Regork discounts three specific products (fluid milk products, frozen meat/meat dinners, and frozen pizzas) at their peak purchase times during the year (mid-June, October, and August, respectively). By doing so, the low income demographic will be more inclined to purchase these products, based on their existing purchasing trends during the year. With these products having the highest price among the most popular products for the low income demographic, encouraging the sales of these products will increase profit for the company and spending for low income households.

Packages Required

tidyverse - Collection of packages used for easier data manipulation, visualization, and analysis.

completejourney - Provided the datasets used in this analysis.

dplyr - Used for data wrangling, including filtering, summarizing, and joining datasets.

stringr - Provides functions to use with character strings.

lubridate - Used to manage date and time data.

ggplot2 - Creates visual charts for data analysis.

plotly - Enhances charts created with ggplot2 by adding interactive elements.

slider - Used to calculate rolling averages.

knitr - Not used in the analysis, but helped to show data sets as tables in this report.

Data Preparation

Many data sets were created over the course of this analysis. All can be viewed in the tables below. The main data set used was income_trans, which was created by first joining the transaction data set with the demographic and product data sets. Afterwards, I used mutate to separate income levels and then broke the income ranges into “Low”, “Middle”, “Upper-Middle”, and “High”. This way, we can define and separate the low income range in order to perform the analysis. The next step was to clean up the data by getting rid of unnecessary columns and removing NA values. From this data set, I created multiple others.

library(tidyverse) 
library(completejourney)
library(dplyr)
library(stringr) 
library(lubridate)
library(ggplot2) 
library(plotly) 
library(slider) 
library(knitr)


invisible(ls("package:completejourney"))

transactions <- completejourney::get_transactions() %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id")

transactions <- transactions %>%
  mutate(
  income_lower = ifelse(str_detect(income, "Under"), 0, str_extract(income, "\\d{2,}") %>% as.numeric()),
    income_upper = ifelse(str_detect(income, "Under"), str_extract(income, "\\d{2,}") %>% as.numeric(), str_extract(income, "(?<=-)\\d{2,}") %>% as.numeric()),
    income_numeric = ifelse(!is.na(income_upper) & !is.na(income_lower) & income_upper != income_lower,
                            (income_lower + income_upper) / 2,
                            ifelse(!is.na(income_upper), income_upper, income_lower)),
    income_range = cut(income_numeric,
                       breaks = c(0, 30, 60, 100, Inf),
                       labels = c("Low", "Middle", "Upper-Middle", "High"),
                       include.lowest = TRUE)  
  )
  
  
 
income_trans <- transactions %>%
select(income_range, age, package_size, product_type, product_category, department, transaction_timestamp, coupon_disc, sales_value, quantity, product_id, household_id)

income_trans <- na.omit(income_trans)

knitr::kable(head(income_trans))
income_range age package_size product_type product_category department transaction_timestamp coupon_disc sales_value quantity product_id household_id
Middle 35-44 4 OZ ROLLS: BAGELS ROLLS PASTRY 2017-01-01 06:53:26 0 0.50 1 1095275 900
Middle 35-44 85 CT FACIAL TISSUE & PAPER HANDKE FACIAL TISS/DNR NAPKIN GROCERY 2017-01-01 07:10:28 0 0.99 1 9878513 900
High 45-54 11.5 OZ POTATO CHIPS BAG SNACKS GROCERY 2017-01-01 07:26:30 0 1.43 1 1041453 1228
Low 55-64 17.1 OZ REFRIGERATED BAGELS REFRGRATD DOUGH PRODUCTS GROCERY 2017-01-01 07:30:27 0 1.50 1 1020156 906
Low 55-64 5.0 OZ TUNA SEAFOOD - SHELF STABLE GROCERY 2017-01-01 07:30:27 0 2.78 2 1053875 906
Low 55-64 30 OZ FRZN BREADED PREPARED CHICK FRZN MEAT/MEAT DINNERS GROCERY 2017-01-01 07:30:27 0 5.49 1 1060312 906

The next data set was created by grouping the income ranges and product categories, then finding the top products by summarizing the total quantities purchased. This data set was used in creating chart 1.

 top_products <- income_trans %>%
  group_by(income_range, product_category) %>%
  summarise(total_quantity = sum(quantity), .groups = "drop") %>%
  group_by(income_range) %>%
  slice_max(order_by = total_quantity, n = 10) %>%
  ungroup()

knitr::kable(head(top_products))
income_range product_category total_quantity
Low SOFT DRINKS 15234
Low FRZN MEAT/MEAT DINNERS 8426
Low BAKED BREAD/BUNS/ROLLS 7656
Low CHEESE 7530
Low VEGETABLES - SHELF STABLE 6937
Low SOUP 6783

For the second chart, I filtered the income transactions data set to only have the low income range’s purchases. I then created a data set for the top purchased products by grouping the data by product category and sales value, then finding the total quantity and using the slice_max function to find the top ten. I used the distinct () function to ensure product categories weren’t repeated in the chart.

low_income_trans <- income_trans %>%
  filter(income_range == "Low")
  
top_purchased_products <- low_income_trans %>%
  group_by(product_category, sales_value) %>%
  summarise(total_quantity = sum(quantity, na.rm = TRUE), .groups = "drop") %>%
  slice_max(order_by = total_quantity, n = 10)
  
top_purchased_products <- top_purchased_products %>%
  distinct(product_category, .keep_all = TRUE)

knitr::kable(head(top_purchased_products))
product_category sales_value total_quantity
FRZN MEAT/MEAT DINNERS 2.00 1756
SOFT DRINKS 0.59 1102
CANDY - CHECKLANE 1.00 1026
VEGETABLES - SHELF STABLE 1.00 1020
FROZEN PIZZA 2.00 852
BAKED BREAD/BUNS/ROLLS 0.88 849

The third chart was created by making a data set that shows the coupon use for all groups by creating a new column that showed whether a coupon was used, based on the coupon discount being greater than 0. I then found the percentage of the mean time a coupon was used for each income range’s transactions.

coupon_use <- income_trans %>%
  mutate(used_coupon = ifelse(coupon_disc > 0, "Yes", "No")) %>%
  group_by(income_range) %>%
  summarise(percentage = mean(used_coupon == "Yes") * 100, .groups = "drop")

knitr::kable(head(coupon_use))
income_range percentage
Low 1.333945
Middle 1.427174
Upper-Middle 2.365926
High 1.614144

Similarly, I created a data set that shows the coupon usage for top products purchased by the low income group only. This was used in the fourth chart.

coupon_usage <- low_income_trans %>%
  filter(product_category %in% top_products$product_category) %>%
  mutate(used_coupon = ifelse(coupon_disc > 0, "Yes", "No")) %>%
  group_by(product_category) %>%
  summarize(percentage = mean(used_coupon == "Yes") * 100, .groups = "drop")
  
  coupon_usage <- left_join(coupon_usage, top_products, by = "product_category")
  
knitr::kable(head(coupon_usage))
product_category percentage income_range total_quantity
BAG SNACKS 0.6717502 Low 6717
BAG SNACKS 0.6717502 Middle 5155
BAG SNACKS 0.6717502 Upper-Middle 8907
BAG SNACKS 0.6717502 High 4512
BAKED BREAD/BUNS/ROLLS 0.4442329 Low 7656
BAKED BREAD/BUNS/ROLLS 0.4442329 Middle 6736

The data set for the fifth chart involved joining campaign data with coupon redemption data, then finding the total count of coupons and the product category they were used for.

coupon_campaign_data <- coupon_redemptions %>%
  left_join(campaigns, by = "household_id")
  
low_income_with_campaign <- low_income_trans %>%
  left_join(coupon_campaign_data, by = "household_id", relationship = "many-to-many")
  
low_income_with_campaign <- na.omit(low_income_with_campaign)

coupon_count <- low_income_with_campaign %>%
  group_by(coupon_upc, product_category) %>%
  summarise(coupon_count = n(), .groups = "drop")  


category_coupon_count <- coupon_count %>%
  group_by(product_category) %>%
  summarise(total_coupon_count = sum(coupon_count), .groups = "drop") %>%
  arrange(desc(total_coupon_count))

top_10_couponns <- category_coupon_count %>%
  top_n(10, total_coupon_count)

top_10_couponns$total_coupon_count_thousands <- top_10_couponns$total_coupon_count / 1000


knitr::kable(head(top_10_couponns))
product_category total_coupon_count total_coupon_count_thousands
SOFT DRINKS 133521 133.521
FRZN MEAT/MEAT DINNERS 125058 125.058
BAKED BREAD/BUNS/ROLLS 94796 94.796
CHEESE 90850 90.850
BAG SNACKS 70895 70.895
YOGURT 69901 69.901

The sixth chart used a data set that was created by joining the coupon redemptions and campaign descriptions to a new data set. I then used mutate() to create a new column which shows whether a campaign was active or not. I then created a second data set which showed both total daily purchases as well as a 7-day rolling average for purchases. Both data sets were used in the chart.

low_at_campaign <- low_income_trans %>%
  left_join(coupon_redemptions, by = ("household_id"), relationship = "many-to-many") %>%
  left_join(campaign_descriptions, by = "campaign_id")

low_at_campaign <- low_at_campaign %>%
  mutate(campaign_active = ifelse(transaction_timestamp >= start_date & transaction_timestamp <= end_date, "Active", "Inactive"))
  
low_at_campaign <- low_at_campaign %>%
  mutate(transaction_timestamp = as.Date(transaction_timestamp))
  
daily_purchases <- low_at_campaign %>%
  mutate(day = floor_date(transaction_timestamp, "day")) %>%
  group_by(day) %>%
  summarise(total_purchases = n(), .groups = "drop") %>%
  mutate(rolling_avg = slide_mean(total_purchases, before = 6, complete = TRUE))

  
 campaign_periods <- low_at_campaign %>%
  mutate(day = floor_date(transaction_timestamp, "day")) %>%
  filter(campaign_active == "Active") %>%
  distinct(day) %>%
  mutate(ymin = 0, ymax = max(daily_purchases$total_purchases, na.rm = TRUE))
  
 
knitr::kable(head(daily_purchases))
day total_purchases rolling_avg
2017-01-01 1305 NA
2017-01-02 2115 NA
2017-01-03 1840 NA
2017-01-04 1004 NA
2017-01-05 1340 NA
2017-01-06 756 NA
knitr::kable(head(campaign_periods))
day ymin ymax
2017-01-01 0 5536
2017-01-03 0 5536
2017-01-04 0 5536
2017-01-05 0 5536
2017-01-06 0 5536
2017-01-07 0 5536

The final data set was created by taking the low income purchases during campaign’s data set and grouping it by total purchases per day and product category. By doing so, chart 7 is able to show total purchases per day over the course of a year, along with the product category of those purchases.

  top_products_daily <- low_at_campaign %>%
  mutate(day = floor_date(transaction_timestamp, "day")) %>%  
  group_by(day, product_category) %>%
  summarise(total_purchases = n(), .groups = "drop") %>%  
  group_by(day) %>%
  slice_max(order_by = total_purchases, n = 1, with_ties = FALSE) %>%  
  ungroup()
  
  top_products_daily <- top_products_daily %>%
  mutate(day = as.Date(day))
  
knitr::kable(head(top_products_daily))
day product_category total_purchases
2017-01-01 FRZN MEAT/MEAT DINNERS 118
2017-01-02 SOFT DRINKS 121
2017-01-03 BAKED BREAD/BUNS/ROLLS 123
2017-01-04 BAKED BREAD/BUNS/ROLLS 70
2017-01-05 SOFT DRINKS 94
2017-01-06 FRZN MEAT/MEAT DINNERS 48

Exploratory Data Analysis

To begin, I looked into the top 10 most purchased products for each income level. The purpose of this was to see what items the low income level demographic purchases the most, as well as how they compare to other income levels. By doing so, the unique purchases in the low income level can be focused on.

You can hover your cursor over the bars in this chart to see the exact numbers purchased.

 chart1 <- ggplot(top_products, aes(x = reorder(product_category, total_quantity), y = total_quantity, fill = income_range, text = paste("Total: ", total_quantity))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ income_range, scales = "free_y") +
  coord_flip() +
  labs(title = "Top 10 Most Purchased Products by Income Level",
       x = "Product category",
       y = "Total Quantity Purchased") +
  theme(axis.text.x = element_blank())  

ggplotly(chart1, tooltip = "text") %>%
  layout(
    width = 1000,   
    height = 500   
  )

As demonstrated in the graphs, the low income level has lower spending overall, which is to be expected due to them having less disposable income. The income levels also share very similar top purchased products, but with varying quantities of these categories purchased. To see which products sell for the most, and therefore would be most profitable if purchases increased, I created the following chart comparing sales values.

plot2 <- ggplot(top_purchased_products, aes(x = reorder(product_category, sales_value), y = sales_value, 
                                                  text = paste0("Price: $", round(sales_value, 2)))) +
  geom_col(fill = "blue") +  
  coord_flip() +  
  labs(title = "Price of Top Purchased Products",
       x = "Product Category",
       y = "Price ($)") +
  scale_y_continuous(labels = scales::dollar_format()) +  
  scale_x_discrete(drop = FALSE) +
  theme_minimal()

ggplotly(plot2, tooltip = "text")

Based on the chart, fluid milk products would provide the greatest return if sales were to increase. Frozen meat/meat dinners and frozen pizzas would also be profitable with an increase in sales. The next step would be to consider coupon usage by each income level.

plot3 <- ggplot(coupon_use, aes(x = income_range, y = percentage, fill = income_range, text = paste0("Coupon Use: ", round(percentage, 1), "%"))) +
  geom_col(show.legend = FALSE) +
  labs(title = "Percentage of Transactions Using a Coupon by Income Level",
       x = "Income Level",
       y = "Percentage of Transactions") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  
  theme_minimal()
 
 ggplotly(plot3, tooltip = "text")

As shown in the chart, the low income level has the lowest percentage of coupon usage. However, this may be in part due to having lower spending overall. To dive deeper into the low income demographic’s coupon usage, let’s examine how often the low income demographic uses coupons for their top purchased items.

plot4 <- ggplot(coupon_usage, aes(x = reorder(product_category, percentage), y = percentage, fill = product_category, text = paste0("Coupon Use: ", round(percentage, 1), "%"))) +
  geom_col(fill = "pink") +
  coord_flip() +
  labs(title = "Coupon Usage for Top 10 Purchased Products (Low Income)",
       x = "Product Category",
       y = "Percentage of Transactions Using a Coupon") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  
  theme_minimal()

 
 ggplotly(plot4, tooltip = "text") %>%
   layout(
    width = 1000,   
    height = 500
   )

The chart shows that yogurt, frozen pizza, and soup are most often paid for with coupons by low income households. However, it would also be important to consider which products have the most coupons offered.

  plot_coupon <- ggplot(top_10_couponns, aes(x = reorder(product_category, total_coupon_count/1000), y = total_coupon_count_thousands, fill = product_category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Products with Most Offered Coupons (Low Income)",
       x = "Product Category",
       y = "Total Coupon Count (in thousands)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


ggplotly(plot_coupon, tooltip = "text") %>%
  layout(
    showlegend = FALSE,
    width = 1000,   
    height = 500,
    xaxis = list(
      tickformat = ',.0f',  
      range = c(0, max(top_10_couponns$total_coupon_count_thousands) + 10)
    )
   )

The chart shows that many of the products that were purchased by low income households were products where coupons were offered the most. Soft drinks are the most purchased product for the lower income demographic and also have the most coupons available for it. As previously mentioned, fluid milk products, frozen/meat meat dinners, and frozen pizzas have the highest prices out of the mostly commonly purchased items by low income households. Focusing in on these, there are fewer coupons available compared to the other most purchased products. To see if there’s a link between coupons being available and total purchases by low income households, I created the following graph.

line_chart2 <- ggplot() + 
geom_tile(data = campaign_periods, aes(x = day, y = max(daily_purchases$total_purchases, na.rm = TRUE) / 2, 
                                         height = max(daily_purchases$total_purchases, na.rm = TRUE), 
                                         width = 1, fill = "Campaign Active"), alpha = 0.2) +
    geom_line(data = daily_purchases, aes(x = day, y = total_purchases, color = "Total Purchases"), linewidth = 1) +
  geom_line(data = daily_purchases, aes(x = day, y = rolling_avg, color = "7-Day Rolling Avg"), linewidth = 1) +
labs(title = "Daily Purchases over 2017",
       x = "Month",
       y = "Total Purchases",
       color = "Legend",
       fill = "Legend") +
       scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  scale_color_manual(values = c("Total Purchases" = "gray", "7-Day Rolling Avg" = "black")) +  
  scale_fill_manual(values = c("Campaign Active" = "blue")) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  ggplotly(line_chart2, tooltip = "text")

The chart shows that the highest spikes in purchases tend to occur during active marketing campaigns, further evidencing that there is an increase in sales among the low income demographic when discounts are offered. It would be helpful to look at the purchasing trends of the lower income demographic over time in order decide the best time to start and end a marketing campaign for specific products where a discount is provided. The following chart show the top product purchased per day by the low income households over the course of 2017.

Hover cursor over line to see top product category per day.

line_chart_top_products <- ggplot(top_products_daily, aes(x = day, y = total_purchases, group = 1, text = paste("Product:", product_category))) +
  geom_line(linewidth = 1, color = "blue") +
  labs(
    title = "Top Purchased Product Category Per Day",
    x = "Date",
    y = "Total Purchases"
  ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(line_chart_top_products, tooltip = "text")

This chart provides a better description on when the best time to put out promotions for certain products would be, based on the purchasing trends of the low income demographic. To circle back to chart 2, fluid milk products, frozen meat/meat dinners, and frozen pizzas have the most potential to boost profit due to their high sales value among popular products for the low income demographic. Based on the chart, the ideal time to discount these products would be in August for frozen pizzas, October for frozen meat, and mid-June for fluid milk products.

Summary

This analysis came about in order to address how to increase sales among the low income demographic, and whether discounts offered in the form of coupons could result in higher spending.

I analyzed the problem statement and possible solutions by first dividing the income ranges into four distinct levels, thereby defining the low income level as a certain range. I analyzed the most purchased products for each income range to distinguish what products are most popular among the low income range, as well as products that could appeal to other income groups if discounted (therefore provoking an increase in sales for other groups as well). Analyzing the coupon usage for all four income groups also provided meaningful data on how often each group uses coupons in their purchases, and whether a coupon discount would be enticing to multiple ranges. Although this report focuses on how to increase purchases within the low income range, the data on coupon usage suggests that the other three ranges are often more likely to use coupon discounts, meaning that discounting products at strategic times throughout the year could increase spending overall, not just for specific income ranges. Comparing the prices of the most purchased products for the low income range also provided insight on what products would generate the most revenue with an increase in purchases, which I then focused on in further analysis. Furthermore, comparing coupon usage to how often coupons are available for certain products helped to reveal products where sales could potentially increase if coupons were made available more often. My analysis concluded by analyzing the daily purchases during the course of 2017, with the periods of active marketing campaigns highlighted to show when spikes in sales occurred during these times. I also provided a chart showing what the most purchased product was each day over the course of the year, which revealed the times that a product was most likely to be purchased, which could be further increased if a discount was offered.

Implications

Focusing in on the three most purchased products with the highest price among the low income demographic (frozen pizzas, frozen meat dinners, and fluid milk products), my analysis showed that the top purchasing times for these products would be August, October, and mid-June, respectively. By discounting these products at these times, Regork could see an increase in sales by the low income demographic. My analysis also shows that the other three income levels (middle, upper-middle, and high) favor similar products and have higher coupon usage, suggesting that discounting these products at peak purchase times would increase spending among these other three income ranges as well, overall increasing profit for the company.

Limitations

This analysis is subject to several limitations. First of all, I was only provided data for the year of 2017. This analysis could be improved with access to longer-term trends, which would show how often consumer tastes change and whether peak purchasing times for the low income demographic vary significantly per year. This analysis only looks at the demographic of income level, and doesn’t take into account age, household size, number of kids, and other factors which may also play a part in top products or purchase times.