Packages loaded:
completejourney - data that is used for this project, all data from a grocery store
ggplot2 - package used to graph and plot in r
dplyr - package to help data manipulation (filtering, joining, etc.)
tidyverse - general package in for data manipulation that includes dplyr and ggplot2 inside it
The business problem I am addressing in this project is the question of what young people are buying at the grocery store and what products they are using coupons on. First, we must define what we call a young person. I am considered a young person to be someone in the age buckets of 19-24 and 25-34. There are definitely differences in lifestyles in those two buckets but they are still considered young adults.
I addressed this problem by using data in ways to form graphs showing the most purchased products, home ownership status comparisons, and most discounted products via coupon usage.
My analysis will help the Regork CEO because I am able to use my findings to give out a plan that will increase profits for the company. The overarching idea of the plan is to either increase or decrease the number of discounts we give out, increase the price of specific products, or target coupons at certain demographics based on household status.
The graph below shows the top 10 most purchased products by customers aged 19-24. I excluded purchases of gas from the data because that is an obvious top 10 product and it was the number one by a good amount.
You can learn a lot about the demographic by looking at what they buy the most.
Could we create coupons and discounts for items on this list in order to generate even more sales of the products?
Or do you want to lessen the coupons and discounts on the items listed in order to see if the customers are buying these products because of the coupon or because of the product itself.
Lessening the coupons and discounts could lead to greater profit margins.
library(completejourney)
library(tidyverse)
library(ggplot2)
library(dplyr)
transactions <- get_transactions()
promotions <- get_promotions()
products <- products
demographics <- demographics
coupons <- coupons
campaigns <- campaigns
all_joined_data <- transactions %>%
left_join(products, by = 'product_id') %>%
left_join(demographics, by = 'household_id')
young_data <- transactions %>%
left_join(products, by = 'product_id') %>%
left_join(demographics, by = 'household_id') %>%
filter(age == '19-24')
most_purchased <- young_data %>%
group_by(product_category) %>%
summarise(total_sales_value = sum(sales_value, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_sales_value)) %>%
filter(product_category != 'COUPON/MISC ITEMS') %>%
head(10)
ggplot(most_purchased, aes(x = reorder(product_category, total_sales_value), y = total_sales_value, fill = product_category)) +
geom_bar(stat = 'identity', fill = 'steelblue') + labs(title = 'Top 10 Most Purchased Products by Customers Age 19-24', caption = '*Gas excluded from data*', x = 'Product Category', y = 'Total Sales Value') + coord_flip()
This pie chart is a confirmation of the assumption that most people in this age bucket would be renters. You can search online and find that the vast majority rent in this age bucket.
A way to corner this market of renting 19-24 year old could be to push coupons and discounts on stores that are in the area of apartment buildings. That would generate more sales for those products.
homeowner_renter_status <- young_data %>%
group_by(home_ownership) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(!is.na(home_ownership))
ggplot(homeowner_renter_status, aes(x = '', y = percentage, fill = home_ownership)) +
geom_bar(stat = 'identity', width = 1) +
coord_polar(theta = 'y') +
labs(title = 'Homeowner or Renter Distribution (Age 19-24)', fill = 'Homeownership Status', caption = 'Close to 60% of the data was N/A. Assume the majority is renting.') + theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.ticks = element_blank(), axis.text.x = element_blank()) + geom_text(aes(label = paste0(round(percentage, 1), '%')), position = position_stack(vjust = 0.5), color = 'white', size = 4)
Here you can see the most popular products for people aged 25-34. This graph does look similar to the 19-24 age one at the top.
BUT, you notice the beginning of a change at the end with diapers and disposables coming in. Beer is also higher here because everyone in this age bucket is legal to drink.
After seeing both top 10 graphs, you can see the similarities between the two. It could be a wise idea to create some combo deals with the items that appear in both top 10s.
two_three_data <- transactions %>%
left_join(products, by = 'product_id') %>%
left_join(demographics, by = 'household_id') %>%
filter(age == '25-34')
most_purchased_two <- two_three_data %>%
group_by(product_category) %>%
summarise(total_sales_value = sum(sales_value, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_sales_value)) %>%
filter(product_category != 'COUPON/MISC ITEMS') %>%
head(10)
ggplot(most_purchased_two, aes(x = reorder(product_category, total_sales_value), y = total_sales_value, fill = product_category)) +
geom_bar(stat = 'identity', fill = 'steelblue') + labs(title = 'Top 10 Most Purchased Products by Customers Age 25-34', caption = '*Gas excluded from data*', x = 'Product Category', y = 'Total Sales Value') + coord_flip()
The overwhelming majority of people this age are homeowners. That can be attributed to the idea of people starting to settle down into families and or they advanced in their career becoming more successful. Thus giving them the ability to buy a house.
Diapers and disposables creeping into the top 10 above goes hand in hand with the majority homeowners. Put those two together and you come to the conclusion of people starting families in this age bucket.
homeowner_renter_status_twothree <- two_three_data %>%
group_by(home_ownership) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(percentage = count / sum(count) * 100) %>%
filter(!is.na(home_ownership))
ggplot(homeowner_renter_status_twothree, aes(x = '', y = percentage, fill = home_ownership)) +
geom_bar(stat = 'identity', width = 1) +
coord_polar(theta = 'y') +
labs(title = 'Homeowner or Renter Distribution (Age 25-34)', fill = 'Homeownership Status', caption = 'Rest of data is N/A. Assume the majority is homeowner.') + theme_minimal() + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), axis.ticks = element_blank(), axis.text.x = element_blank()) + geom_text(aes(label = paste0(round(percentage, 1), '%')), position = position_stack(vjust = 0.5), color = 'white', size = 4)
Below you see the most discounted products. One might notice that there is similarities between this graph and the most purchased graph for this age range.
The max total discount here is just above two thousand dollars. That shows that there aren’t many coupons being used by people ages 19-24. That reinforces the idea that we need to push coupons out to these 19-24 year old renters that most likely live in apartments in cities.
But when they use coupons, they use it for the products they buy the most.
coupon_analysis_19_24 <- all_joined_data %>%
left_join(coupons, by = 'product_id') %>%
group_by(product_category) %>%
filter(age == '19-24', !is.na(home_ownership)) %>%
summarise(total_discount = sum(retail_disc, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_discount)) %>%
head(10)
ggplot(coupon_analysis_19_24, aes(x = reorder(product_category, total_discount), y = total_discount, fill = product_category)) + geom_bar(stat = 'identity', fill = 'darkgreen') + coord_flip() + labs(title = 'Top 10 Total Discounted Items via Coupons (Ages 19-24 Demographic)', x = 'Product Category', y = 'Total Discount') + theme_minimal()
Below shows, for ages 25-34, the 10 most discounted products. There are three main things I want you to get out of this graph.
First: Chicken appears on this list seemingly out of nowhere. Chicken is not on the top 10 most purchased products list but here it is number one (!) on the most discounted via coupons. This means we may need to pull back coupons on chicken.
Secondly: For ages 19-24 the max total discount was around two thousand. Here the max is close to eight thousand. That is very telling because it shows that people aged 25-34 are using WAY more coupons than people aged 19-24. Maybe we should think about pulling back on coupons in suburban areas and any stores surrounded by homes.
Lastly: This is more of a minor point but diapers does not appear on the most discounted but it crept into the most purchased at number 10. We could consider giving more coupons for diapers in stores around homes.
Also, when it comes to coupons they come in the mail as well so it can be easier to target people based on them being homeowners or renters by using the mail.
coupon_analysis_25_34 <- all_joined_data %>%
left_join(coupons, by = 'product_id') %>%
group_by(product_category) %>%
filter(age == '25-34', !is.na(home_ownership)) %>%
summarise(total_discount = sum(retail_disc, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_discount)) %>%
head(10)
ggplot(coupon_analysis_25_34, aes(x = reorder(product_category, total_discount), y = total_discount, fill = product_category)) + geom_bar(stat = 'identity', fill = 'purple') + coord_flip() + labs(title = 'Top 10 Total Discounted Items via Coupons (Ages 25-34 Demographic)', x = 'Product Category', y = 'Total Discount') + theme_minimal()
To conclude, it is time to give my final recommendation plan in hopes of it increasing profits for the company. The plan is a simple push-pull method. We need to push coupons and discounts at people ages 19-24 in order to generate more revenue. They aren’t using nearly as many coupons as people aged 25-34 (remember 2,000 vs 8,000). Give them more coupons and they will buy more products. We need to pull coupons and discounts from people ages 25-34. We are getting beat on chicken, it is the most discounted item but not a top purchased item overall. Either pull back on chicken coupons or raise the overall price of chicken. Another reason to pull is that they are using way more coupons than 19-24. Over 8,000 dollars in coupons used on chicken when it does even crack the top 10 most purchased is unacceptable.
| Age Group | Push or Pull? | Why? |
|---|---|---|
| 19-24 | PUSH! |
|
| 25-34 | PULL! |
|