Business Problem
In the pursuit of increasing sales for Regork, I focused in on the group with the lowest spending: the low income demographic. In the data analysis, I defined the low income group as having an annual income range of $30k or lower. With less disposable income, it comes to reason that this group would also have lower spending. This demographic is also more likely to seek sales discounts due to their income level. In considering how to increase spending from this group, I analysed the impact of sales promotions, specifically those that offer a coupon discount for certain products, and how they could be utilized in a way that appealed to the low income demographic and ultimately lead to them increasing the number of purchases they make.
Addressing the Problem
To adress the problem and possible solutions, I analyzed the purchasing trends of different income demographics. In this report, it was separated into four levels, including “low” (as mentioned, this was at a range of 0-$30k), “middle” (defined as $30-$60k), “high-middle” ($60-$100k), and “high” (anything above $100k). I compared these levels on the basis of their top 10 most purchased products for the year of 2017 and how often they use coupons in their transactions. Other data analysis involved only the low income demographic, which looked into their purchasing habits, such as how often coupons were used in their most purchased products, when they make the most purchases over the course of a year, and spikes in purchases during the times when promotional campaigns were active. The results from this analysis shows not only the most opportune times to offer discounts on products, but also which products have the most potential for sales growth among the low income demographic.
Proposed Solution
Based on the results of the analysis in this report, I propose that Regork discounts three specific products (fluid milk products, frozen meat/meat dinners, and frozen pizzas) at their peak purchase times during the year (mid-June, October, and August, respectively). By doing so, the low income demographic will be more inclined to purchase these products, based on their existing purchasing trends during the year. With these products having the highest price among the most popular products for the low income demographic, encouraging the sales of these products will increase profit for the company and spending for low income households.
tidyverse - Collection of packages used for easier data manipulation, visualization, and analysis.
completejourney - Provided the datasets used in this analysis.
dplyr - Used for data wrangling, including filtering, summarizing, and joining datasets.
stringr - Provides functions to use with character strings.
lubridate - Used to manage date and time data.
ggplot2 - Creates visual charts for data analysis.
plotly - Enhances charts created with ggplot2 by adding interactive elements.
slider - Used to calculate rolling averages.
knitr - Not used in the analysis, but helped to show data sets as tables in this report.
Many data sets were created over the course of this analysis. All can be viewed in the tables below. The main data set used was income_trans, which was created by first joining the transaction data set with the demographic and product data sets. Afterwards, I used mutate to separate income levels and then broke the income ranges into “Low”, “Middle”, “Upper-Middle”, and “High”. This way, we can define and separate the low income range in order to perform the analysis. The next step was to clean up the data by getting rid of unnecessary columns and removing NA values. From this data set, I created multiple others.
library(tidyverse)
library(completejourney)
library(dplyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(plotly)
library(slider)
library(knitr)
invisible(ls("package:completejourney"))
transactions <- completejourney::get_transactions() %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id")
transactions <- transactions %>%
mutate(
income_lower = ifelse(str_detect(income, "Under"), 0, str_extract(income, "\\d{2,}") %>% as.numeric()),
income_upper = ifelse(str_detect(income, "Under"), str_extract(income, "\\d{2,}") %>% as.numeric(), str_extract(income, "(?<=-)\\d{2,}") %>% as.numeric()),
income_numeric = ifelse(!is.na(income_upper) & !is.na(income_lower) & income_upper != income_lower,
(income_lower + income_upper) / 2,
ifelse(!is.na(income_upper), income_upper, income_lower)),
income_range = cut(income_numeric,
breaks = c(0, 30, 60, 100, Inf),
labels = c("Low", "Middle", "Upper-Middle", "High"),
include.lowest = TRUE)
)
income_trans <- transactions %>%
select(income_range, age, package_size, product_type, product_category, department, transaction_timestamp, coupon_disc, sales_value, quantity, product_id, household_id)
income_trans <- na.omit(income_trans)
knitr::kable(head(income_trans))
| income_range | age | package_size | product_type | product_category | department | transaction_timestamp | coupon_disc | sales_value | quantity | product_id | household_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Middle | 35-44 | 4 OZ | ROLLS: BAGELS | ROLLS | PASTRY | 2017-01-01 06:53:26 | 0 | 0.50 | 1 | 1095275 | 900 |
| Middle | 35-44 | 85 CT | FACIAL TISSUE & PAPER HANDKE | FACIAL TISS/DNR NAPKIN | GROCERY | 2017-01-01 07:10:28 | 0 | 0.99 | 1 | 9878513 | 900 |
| High | 45-54 | 11.5 OZ | POTATO CHIPS | BAG SNACKS | GROCERY | 2017-01-01 07:26:30 | 0 | 1.43 | 1 | 1041453 | 1228 |
| Low | 55-64 | 17.1 OZ | REFRIGERATED BAGELS | REFRGRATD DOUGH PRODUCTS | GROCERY | 2017-01-01 07:30:27 | 0 | 1.50 | 1 | 1020156 | 906 |
| Low | 55-64 | 5.0 OZ | TUNA | SEAFOOD - SHELF STABLE | GROCERY | 2017-01-01 07:30:27 | 0 | 2.78 | 2 | 1053875 | 906 |
| Low | 55-64 | 30 OZ | FRZN BREADED PREPARED CHICK | FRZN MEAT/MEAT DINNERS | GROCERY | 2017-01-01 07:30:27 | 0 | 5.49 | 1 | 1060312 | 906 |
The next data set was created by grouping the income ranges and product categories, then finding the top products by summarizing the total quantities purchased. This data set was used in creating chart 1.
top_products <- income_trans %>%
group_by(income_range, product_category) %>%
summarise(total_quantity = sum(quantity), .groups = "drop") %>%
group_by(income_range) %>%
slice_max(order_by = total_quantity, n = 10) %>%
ungroup()
knitr::kable(head(top_products))
| income_range | product_category | total_quantity |
|---|---|---|
| Low | SOFT DRINKS | 15234 |
| Low | FRZN MEAT/MEAT DINNERS | 8426 |
| Low | BAKED BREAD/BUNS/ROLLS | 7656 |
| Low | CHEESE | 7530 |
| Low | VEGETABLES - SHELF STABLE | 6937 |
| Low | SOUP | 6783 |
For the second chart, I filtered the income transactions data set to only have the low income range’s purchases. I then created a data set for the top purchased products by grouping the data by product category and sales value, then finding the total quantity and using the slice_max function to find the top ten. I used the distinct () function to ensure product categories weren’t repeated in the chart.
low_income_trans <- income_trans %>%
filter(income_range == "Low")
top_purchased_products <- low_income_trans %>%
group_by(product_category, sales_value) %>%
summarise(total_quantity = sum(quantity, na.rm = TRUE), .groups = "drop") %>%
slice_max(order_by = total_quantity, n = 10)
top_purchased_products <- top_purchased_products %>%
distinct(product_category, .keep_all = TRUE)
knitr::kable(head(top_purchased_products))
| product_category | sales_value | total_quantity |
|---|---|---|
| FRZN MEAT/MEAT DINNERS | 2.00 | 1756 |
| SOFT DRINKS | 0.59 | 1102 |
| CANDY - CHECKLANE | 1.00 | 1026 |
| VEGETABLES - SHELF STABLE | 1.00 | 1020 |
| FROZEN PIZZA | 2.00 | 852 |
| BAKED BREAD/BUNS/ROLLS | 0.88 | 849 |
The third chart was created by making a data set that shows the coupon use for all groups by creating a new column that showed whether a coupon was used, based on the coupon discount being greater than 0. I then found the percentage of the mean time a coupon was used for each income range’s transactions.
coupon_use <- income_trans %>%
mutate(used_coupon = ifelse(coupon_disc > 0, "Yes", "No")) %>%
group_by(income_range) %>%
summarise(percentage = mean(used_coupon == "Yes") * 100, .groups = "drop")
knitr::kable(head(coupon_use))
| income_range | percentage |
|---|---|
| Low | 1.333945 |
| Middle | 1.427174 |
| Upper-Middle | 2.365926 |
| High | 1.614144 |
Similarly, I created a data set that shows the coupon usage for top products purchased by the low income group only. This was used in the fourth chart.
coupon_usage <- low_income_trans %>%
filter(product_category %in% top_products$product_category) %>%
mutate(used_coupon = ifelse(coupon_disc > 0, "Yes", "No")) %>%
group_by(product_category) %>%
summarize(percentage = mean(used_coupon == "Yes") * 100, .groups = "drop")
coupon_usage <- left_join(coupon_usage, top_products, by = "product_category")
knitr::kable(head(coupon_usage))
| product_category | percentage | income_range | total_quantity |
|---|---|---|---|
| BAG SNACKS | 0.6717502 | Low | 6717 |
| BAG SNACKS | 0.6717502 | Middle | 5155 |
| BAG SNACKS | 0.6717502 | Upper-Middle | 8907 |
| BAG SNACKS | 0.6717502 | High | 4512 |
| BAKED BREAD/BUNS/ROLLS | 0.4442329 | Low | 7656 |
| BAKED BREAD/BUNS/ROLLS | 0.4442329 | Middle | 6736 |
The data set for the fifth chart involved joining campaign data with coupon redemption data, then finding the total count of coupons and the product category they were used for.
coupon_campaign_data <- coupon_redemptions %>%
left_join(campaigns, by = "household_id")
low_income_with_campaign <- low_income_trans %>%
left_join(coupon_campaign_data, by = "household_id", relationship = "many-to-many")
low_income_with_campaign <- na.omit(low_income_with_campaign)
coupon_count <- low_income_with_campaign %>%
group_by(coupon_upc, product_category) %>%
summarise(coupon_count = n(), .groups = "drop")
category_coupon_count <- coupon_count %>%
group_by(product_category) %>%
summarise(total_coupon_count = sum(coupon_count), .groups = "drop") %>%
arrange(desc(total_coupon_count))
top_10_couponns <- category_coupon_count %>%
top_n(10, total_coupon_count)
top_10_couponns$total_coupon_count_thousands <- top_10_couponns$total_coupon_count / 1000
knitr::kable(head(top_10_couponns))
| product_category | total_coupon_count | total_coupon_count_thousands |
|---|---|---|
| SOFT DRINKS | 133521 | 133.521 |
| FRZN MEAT/MEAT DINNERS | 125058 | 125.058 |
| BAKED BREAD/BUNS/ROLLS | 94796 | 94.796 |
| CHEESE | 90850 | 90.850 |
| BAG SNACKS | 70895 | 70.895 |
| YOGURT | 69901 | 69.901 |
The sixth chart used a data set that was created by joining the coupon redemptions and campaign descriptions to a new data set. I then used mutate() to create a new column which shows whether a campaign was active or not. I then created a second data set which showed both total daily purchases as well as a 7-day rolling average for purchases. Both data sets were used in the chart.
low_at_campaign <- low_income_trans %>%
left_join(coupon_redemptions, by = ("household_id"), relationship = "many-to-many") %>%
left_join(campaign_descriptions, by = "campaign_id")
low_at_campaign <- low_at_campaign %>%
mutate(campaign_active = ifelse(transaction_timestamp >= start_date & transaction_timestamp <= end_date, "Active", "Inactive"))
low_at_campaign <- low_at_campaign %>%
mutate(transaction_timestamp = as.Date(transaction_timestamp))
daily_purchases <- low_at_campaign %>%
mutate(day = floor_date(transaction_timestamp, "day")) %>%
group_by(day) %>%
summarise(total_purchases = n(), .groups = "drop") %>%
mutate(rolling_avg = slide_mean(total_purchases, before = 6, complete = TRUE))
campaign_periods <- low_at_campaign %>%
mutate(day = floor_date(transaction_timestamp, "day")) %>%
filter(campaign_active == "Active") %>%
distinct(day) %>%
mutate(ymin = 0, ymax = max(daily_purchases$total_purchases, na.rm = TRUE))
knitr::kable(head(daily_purchases))
| day | total_purchases | rolling_avg |
|---|---|---|
| 2017-01-01 | 1305 | NA |
| 2017-01-02 | 2115 | NA |
| 2017-01-03 | 1840 | NA |
| 2017-01-04 | 1004 | NA |
| 2017-01-05 | 1340 | NA |
| 2017-01-06 | 756 | NA |
knitr::kable(head(campaign_periods))
| day | ymin | ymax |
|---|---|---|
| 2017-01-01 | 0 | 5536 |
| 2017-01-03 | 0 | 5536 |
| 2017-01-04 | 0 | 5536 |
| 2017-01-05 | 0 | 5536 |
| 2017-01-06 | 0 | 5536 |
| 2017-01-07 | 0 | 5536 |
The final data set was created by taking the low income purchases during campaign’s data set and grouping it by total purchases per day and product category. By doing so, chart 7 is able to show total purchases per day over the course of a year, along with the product category of those purchases.
top_products_daily <- low_at_campaign %>%
mutate(day = floor_date(transaction_timestamp, "day")) %>%
group_by(day, product_category) %>%
summarise(total_purchases = n(), .groups = "drop") %>%
group_by(day) %>%
slice_max(order_by = total_purchases, n = 1, with_ties = FALSE) %>%
ungroup()
top_products_daily <- top_products_daily %>%
mutate(day = as.Date(day))
knitr::kable(head(top_products_daily))
| day | product_category | total_purchases |
|---|---|---|
| 2017-01-01 | FRZN MEAT/MEAT DINNERS | 118 |
| 2017-01-02 | SOFT DRINKS | 121 |
| 2017-01-03 | BAKED BREAD/BUNS/ROLLS | 123 |
| 2017-01-04 | BAKED BREAD/BUNS/ROLLS | 70 |
| 2017-01-05 | SOFT DRINKS | 94 |
| 2017-01-06 | FRZN MEAT/MEAT DINNERS | 48 |
To begin, I looked into the top 10 most purchased products for each income level. The purpose of this was to see what items the low income level demographic purchases the most, as well as how they compare to other income levels. By doing so, the unique purchases in the low income level can be focused on.
You can hover your cursor over the bars in this chart to see the exact numbers purchased.
chart1 <- ggplot(top_products, aes(x = reorder(product_category, total_quantity), y = total_quantity, fill = income_range, text = paste("Total: ", total_quantity))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ income_range, scales = "free_y") +
coord_flip() +
labs(title = "Top 10 Most Purchased Products by Income Level",
x = "Product category",
y = "Total Quantity Purchased") +
theme(axis.text.x = element_blank())
ggplotly(chart1, tooltip = "text") %>%
layout(
width = 1000,
height = 500
)
As demonstrated in the graphs, the low income level has lower spending overall, which is to be expected due to them having less disposable income. The income levels also share very similar top purchased products, but with varying quantities of these categories purchased. To see which products sell for the most, and therefore would be most profitable if purchases increased, I created the following chart comparing sales values.
plot2 <- ggplot(top_purchased_products, aes(x = reorder(product_category, sales_value), y = sales_value,
text = paste0("Price: $", round(sales_value, 2)))) +
geom_col(fill = "blue") +
coord_flip() +
labs(title = "Price of Top Purchased Products",
x = "Product Category",
y = "Price ($)") +
scale_y_continuous(labels = scales::dollar_format()) +
scale_x_discrete(drop = FALSE) +
theme_minimal()
ggplotly(plot2, tooltip = "text")
Based on the chart, fluid milk products would provide the greatest return if sales were to increase. Frozen meat/meat dinners and frozen pizzas would also be profitable with an increase in sales. The next step would be to consider coupon usage by each income level.
plot3 <- ggplot(coupon_use, aes(x = income_range, y = percentage, fill = income_range, text = paste0("Coupon Use: ", round(percentage, 1), "%"))) +
geom_col(show.legend = FALSE) +
labs(title = "Percentage of Transactions Using a Coupon by Income Level",
x = "Income Level",
y = "Percentage of Transactions") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_minimal()
ggplotly(plot3, tooltip = "text")
As shown in the chart, the low income level has the lowest percentage of coupon usage. However, this may be in part due to having lower spending overall. To dive deeper into the low income demographic’s coupon usage, let’s examine how often the low income demographic uses coupons for their top purchased items.
plot4 <- ggplot(coupon_usage, aes(x = reorder(product_category, percentage), y = percentage, fill = product_category, text = paste0("Coupon Use: ", round(percentage, 1), "%"))) +
geom_col(fill = "pink") +
coord_flip() +
labs(title = "Coupon Usage for Top 10 Purchased Products (Low Income)",
x = "Product Category",
y = "Percentage of Transactions Using a Coupon") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_minimal()
ggplotly(plot4, tooltip = "text") %>%
layout(
width = 1000,
height = 500
)
The chart shows that yogurt, frozen pizza, and soup are most often paid for with coupons by low income households. However, it would also be important to consider which products have the most coupons offered.
plot_coupon <- ggplot(top_10_couponns, aes(x = reorder(product_category, total_coupon_count/1000), y = total_coupon_count_thousands, fill = product_category)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Products with Most Offered Coupons (Low Income)",
x = "Product Category",
y = "Total Coupon Count (in thousands)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(plot_coupon, tooltip = "text") %>%
layout(
showlegend = FALSE,
width = 1000,
height = 500,
xaxis = list(
tickformat = ',.0f',
range = c(0, max(top_10_couponns$total_coupon_count_thousands) + 10)
)
)
The chart shows that many of the products that were purchased by low income households were products where coupons were offered the most. Soft drinks are the most purchased product for the lower income demographic and also have the most coupons available for it. As previously mentioned, fluid milk products, frozen/meat meat dinners, and frozen pizzas have the highest prices out of the mostly commonly purchased items by low income households. Focusing in on these, there are fewer coupons available compared to the other most purchased products. To see if there’s a link between coupons being available and total purchases by low income households, I created the following graph.
line_chart2 <- ggplot() +
geom_tile(data = campaign_periods, aes(x = day, y = max(daily_purchases$total_purchases, na.rm = TRUE) / 2,
height = max(daily_purchases$total_purchases, na.rm = TRUE),
width = 1, fill = "Campaign Active"), alpha = 0.2) +
geom_line(data = daily_purchases, aes(x = day, y = total_purchases, color = "Total Purchases"), linewidth = 1) +
geom_line(data = daily_purchases, aes(x = day, y = rolling_avg, color = "7-Day Rolling Avg"), linewidth = 1) +
labs(title = "Daily Purchases over 2017",
x = "Month",
y = "Total Purchases",
color = "Legend",
fill = "Legend") +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
scale_color_manual(values = c("Total Purchases" = "gray", "7-Day Rolling Avg" = "black")) +
scale_fill_manual(values = c("Campaign Active" = "blue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(line_chart2, tooltip = "text")
The chart shows that the highest spikes in purchases tend to occur during active marketing campaigns, further evidencing that there is an increase in sales among the low income demographic when discounts are offered. It would be helpful to look at the purchasing trends of the lower income demographic over time in order decide the best time to start and end a marketing campaign for specific products where a discount is provided. The following chart show the top product purchased per day by the low income households over the course of 2017.
Hover cursor over line to see top product category per day.
line_chart_top_products <- ggplot(top_products_daily, aes(x = day, y = total_purchases, group = 1, text = paste("Product:", product_category))) +
geom_line(linewidth = 1, color = "blue") +
labs(
title = "Top Purchased Product Category Per Day",
x = "Date",
y = "Total Purchases"
) +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(line_chart_top_products, tooltip = "text")
This chart provides a better description on when the best time to put out promotions for certain products would be, based on the purchasing trends of the low income demographic. To circle back to chart 2, fluid milk products, frozen meat/meat dinners, and frozen pizzas have the most potential to boost profit due to their high sales value among popular products for the low income demographic. Based on the chart, the ideal time to discount these products would be in August for frozen pizzas, October for frozen meat, and mid-June for fluid milk products.
This analysis came about in order to address how to increase sales among the low income demographic, and whether discounts offered in the form of coupons could result in higher spending.
I analyzed the problem statement and possible solutions by first dividing the income ranges into four distinct levels, thereby defining the low income level as a certain range. I analyzed the most purchased products for each income range to distinguish what products are most popular among the low income range, as well as products that could appeal to other income groups if discounted (therefore provoking an increase in sales for other groups as well). Analyzing the coupon usage for all four income groups also provided meaningful data on how often each group uses coupons in their purchases, and whether a coupon discount would be enticing to multiple ranges. Although this report focuses on how to increase purchases within the low income range, the data on coupon usage suggests that the other three ranges are often more likely to use coupon discounts, meaning that discounting products at strategic times throughout the year could increase spending overall, not just for specific income ranges. Comparing the prices of the most purchased products for the low income range also provided insight on what products would generate the most revenue with an increase in purchases, which I then focused on in further analysis. Furthermore, comparing coupon usage to how often coupons are available for certain products helped to reveal products where sales could potentially increase if coupons were made available more often. My analysis concluded by analyzing the daily purchases during the course of 2017, with the periods of active marketing campaigns highlighted to show when spikes in sales occurred during these times. I also provided a chart showing what the most purchased product was each day over the course of the year, which revealed the times that a product was most likely to be purchased, which could be further increased if a discount was offered.
Implications
Focusing in on the three most purchased products with the highest price among the low income demographic (frozen pizzas, frozen meat dinners, and fluid milk products), my analysis showed that the top purchasing times for these products would be August, October, and mid-June, respectively. By discounting these products at these times, Regork could see an increase in sales by the low income demographic. My analysis also shows that the other three income levels (middle, upper-middle, and high) favor similar products and have higher coupon usage, suggesting that discounting these products at peak purchase times would increase spending among these other three income ranges as well, overall increasing profit for the company.
Limitations
This analysis is subject to several limitations. First of all, I was only provided data for the year of 2017. This analysis could be improved with access to longer-term trends, which would show how often consumer tastes change and whether peak purchasing times for the low income demographic vary significantly per year. This analysis only looks at the demographic of income level, and doesn’t take into account age, household size, number of kids, and other factors which may also play a part in top products or purchase times.