The Super Bowl is one of the biggest days of the year for many Americans. Most people choose to watch the big game by gathering for parties and spending large amounts of money on food. I believe that capitalizing on Super Bowl sales is a large area of growth for Regork. I will be focusing on frozen pizza and soft drink sales in particular. The question that I am trying to solve is:
What demographics would be best to market to for soda and frozen pizza products for the Super Bowl?
I used data from three different data sets: demographic data, product data, and transaction data. All of these data sets are related through a primary key of the customers household id. By dissecting this data, I was able to uncover insights on the purchases of frozen pizzas and soft drinks for the Super Bowl.
suppressWarnings(suppressMessages(library(completejourney)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(lubridate)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(DT)))
demographics <- demographics
transactions <- get_transactions()
promotions <- get_promotions()
joineddf <- transactions %>%
inner_join(products, by = "product_id")
pizza_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
group_by(income) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(pizza_demographics, 12), options = list(pageLength = 10))
specific_day <- "2017-02-05"
pizza_SB_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
filter(date(transaction_timestamp) == specific_day) %>%
group_by(income) %>%
summarize(SB_sales = sum(sales_value)) %>%
arrange(desc(SB_sales))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(pizza_SB_demographics, 7), options = list(pageLength = 10))
pizza_Non_SB_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
filter(date(transaction_timestamp) != specific_day) %>%
group_by(income) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales)) %>%
mutate(average_daily_sales = total_sales/364) %>%
select(income, average_daily_sales) %>%
slice_head(n = 9)
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(pizza_Non_SB_demographics, 100), options = list(pageLength = 9))
ggplot(pizza_SB_demographics, aes(x = income, y = SB_sales)) +
geom_bar(stat = "identity", color = "red") +
labs(title = "Customer Spending On Frozen Pizzas By Income Level",
subtitle = "On February 5th 2017 (Super Bowl 51)") +
ylim(0,65)
ggplot(pizza_Non_SB_demographics, aes(x = income, y = average_daily_sales)) +
geom_bar(stat = "identity", color = "blue") +
labs(title = "Average Daily Sales On Frozen Pizzas By Income Level",
subtitle = "On Every Day Except February 5th 2017 (Super Bowl 51)") +
ylim(0,65)
pizzamultiplier <- pizza_SB_demographics %>%
full_join(pizza_Non_SB_demographics) %>%
mutate(SB_multiplier = SB_sales/average_daily_sales) %>%
arrange(desc(SB_multiplier)) %>%
slice_head(n = 7)
## Joining, by = "income"
datatable(head(pizzamultiplier, 100), options = list(pageLength = 5))
ggplot(pizzamultiplier, aes(income, SB_multiplier)) +
geom_point() +
labs(title = "How Much More Customers Spend On Pizza For The Super Bowl",
subtitle = "Super Bowl Multiplier Rating")
SB_Pizza_house <- transactions %>%
inner_join(demographics) %>%
inner_join(products) %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
filter(date(transaction_timestamp) == specific_day) %>%
group_by(household_size) %>%
summarize(SB_sales = sum(sales_value))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(SB_Pizza_house, 100), options = list(pageLength = 5))
NON_SB_PIZZA_HOUSE <- transactions %>%
inner_join(demographics) %>%
inner_join(products) %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
filter(date(transaction_timestamp) != specific_day) %>%
group_by(household_size) %>%
summarize(sales = sum(sales_value)) %>%
mutate(average_daily_sales = sales/364) %>%
select(household_size, average_daily_sales)
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(NON_SB_PIZZA_HOUSE, 100), options = list(pageLength = 5))
ggplot(SB_Pizza_house, aes(x = household_size, y = SB_sales)) +
geom_bar(stat = "identity", color = "red") +
labs(title = "Customer Spending On Frozen Pizzas By Household Size",
subtitle = "On February 5th 2017 (Super Bowl 51)") +
ylim(0,130)
ggplot(NON_SB_PIZZA_HOUSE, aes(x = household_size, y = average_daily_sales)) +
geom_bar(stat = "identity", color = "blue") +
labs(title = "Customer Average Daily Spending On Frozen Pizzas By Household Size",
subtitle = "On everyday except February 5th 2017 (Super Bowl 51)") +
ylim(0,130)
joineddf$transaction_timestamp <- as.POSIXct(joineddf$transaction_timestamp)
SBweek <- subset(joineddf, transaction_timestamp >= as.POSIXct("2017-01-30") &
transaction_timestamp <= as.POSIXct("2017-02-06"))
SBweek$date <- as.Date(SBweek$transaction_timestamp)
weekpizzasales <- SBweek %>%
filter(str_detect(product_category, "FROZEN PIZZA")) %>%
group_by(date) %>%
summarize(daily_sales = sum(sales_value)) %>%
slice_head(n = 7)
weekpizzasales %>%
ggplot(aes(date, daily_sales)) +
geom_point() +
geom_smooth() +
ylim(0, 500) +
scale_x_date(date_breaks = "1 day", date_labels = "%b %d") +
labs(title = "Pizza Spending By Day Leading Up To The Super Bowl",
subtitle = "(Super Bowl is February 5th)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
There are many observations that can be taken away from this analysis. The first observation that stuck out to me was how much some income levels increased their sales on the day of the Super Bowl. Despite the income level of 50-74k leading frozen pizza sales across the year, The income group of 35-49k led frozen pizza sales on the day of the Super Bowl by a large margin. The 35-49k income group had a multiplier factor of 2.69, while the 50-74k income group had a multiplier factor of 1.49. The income range of 150-174k had a multiplier of around 7, and the range of 125-149k had a multiplier of around 3. I am going to ignore these two income ranges because despite having a high multiplier rating, their sales were very low when compared to the previously discussed income ranges of 35-49k and 50-74k.
Regork can use this information in the future to now know to place an emphasis on the income range of 35-49k when thinking about future promotions or campaigns and choosing what demographics to focus on.
When comparing the relation between household size and Super Bowl frozen pizza sales, it was found that households with 2 people living in them had a very large increase in frozen pizza sales. It was found that frozen pizza sales were somewhat consistent across the year regardless of household size, but on the day of the Super Bowl, sales grew exponentially in 2 person households. Frozen pizza sales in 2 person households had 3.26 times more sales on the day of Super Bowl 51 over the next leading household size, which was 1.
Regork can further use this information along with the previous statement to concur that the ideal demographic for a potential customer looking to buy a frozen pizza for the Super Bowl would be someone living in a household size of 2, with an income level of 35-49k, as both of these demographics had large sales across the year, led sales on the day of the Super Bowl, and had large Super Bowl multiplier factors.
When looking at the exact day that customers bought frozen pizzas, it was to no surprise that sales increased every day in the 5 days leading up to the Super Bowl. This information is of not much importance to Regork, as it just confirms as most would expect: Customers have more time to shop on the weekends, and also tend to purchase last minute as it gets closer to the big game.
soda_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
group_by(income) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(soda_demographics, 12), options = list(pageLength = 10))
specific_day <- "2017-02-05"
soda_SB_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
filter(date(transaction_timestamp) == specific_day) %>%
group_by(income) %>%
summarize(SB_sales = sum(sales_value)) %>%
arrange(desc(SB_sales))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(soda_SB_demographics, 12), options = list(pageLength = 10))
datatable(head(soda_SB_demographics, 100), options = list(pageLength = 11))
soda_Non_SB_demographics <- demographics %>%
inner_join(transactions) %>%
inner_join(products) %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
filter(date(transaction_timestamp) != specific_day) %>%
group_by(income) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales)) %>%
mutate(average_daily_sales = total_sales/364) %>%
select(income, average_daily_sales) %>%
slice_head(n = 9)
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(soda_Non_SB_demographics, 100), options = list(pageLength = 12))
graph111 <- soda_SB_demographics %>%
slice_head(n = 8)
ggplot(graph111, aes(x = income, y = SB_sales)) +
geom_bar(stat = "identity", color = "red") +
labs(title = "Customer Spending On soft drinks By Income Level",
subtitle = "On February 5th 2017 (Super Bowl 51)") +
ylim(0,250)
ggplot(soda_Non_SB_demographics, aes(x = income, y = average_daily_sales)) +
geom_bar(stat = "identity", color = "blue") +
labs(title = "Average Daily Sales On Soft Drinks By Income Level",
subtitle = "On Every Day Except February 5th 2017 (Super Bowl 51)") +
ylim(0,250)
sodamultiplier <- soda_SB_demographics %>%
full_join(soda_Non_SB_demographics) %>%
mutate(SB_multiplier = SB_sales/average_daily_sales) %>%
arrange(desc(SB_multiplier)) %>%
slice_head(n = 9)
## Joining, by = "income"
datatable(head(sodamultiplier, 100), options = list(pageLength = 9))
ggplot(sodamultiplier, aes(income, SB_multiplier)) +
geom_point() +
labs(title = "How Much More Customers Spend On Soda For The Super Bowl")
SB_soda_house <- transactions %>%
inner_join(demographics) %>%
inner_join(products) %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
filter(date(transaction_timestamp) == specific_day) %>%
group_by(household_size) %>%
summarize(SB_sales = sum(sales_value))
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(SB_soda_house, 100), options = list(pageLength = 9))
NON_SB_SODA_HOUSE <- transactions %>%
inner_join(demographics) %>%
inner_join(products) %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
filter(date(transaction_timestamp) != specific_day) %>%
group_by(household_size) %>%
summarize(sales = sum(sales_value)) %>%
mutate(average_daily_sales = sales/364) %>%
select(household_size, average_daily_sales)
## Joining, by = "household_id"
## Joining, by = "product_id"
datatable(head(NON_SB_SODA_HOUSE, 100), options = list(pageLength = 9))
ggplot(SB_soda_house, aes(x = household_size, y = SB_sales)) +
geom_bar(stat = "identity", color = "red") +
labs(title = "Customer Spending On Soft Drinks By Household Size",
subtitle = "On February 5th 2017 (Super Bowl 51)") +
ylim(0,250)
ggplot(NON_SB_SODA_HOUSE, aes(x = household_size, y = average_daily_sales)) +
geom_bar(stat = "identity", color = "blue") +
labs(title = "Average Daily Soft Drink Sales By Household Size",
subtitle = "On every day except February 5th 2017 (Super Bowl 51)") +
ylim(0,250)
joineddf$transaction_timestamp <- as.POSIXct(joineddf$transaction_timestamp)
SBweek <- subset(joineddf, transaction_timestamp >= as.POSIXct("2017-01-30") &
transaction_timestamp <= as.POSIXct("2017-02-06"))
SBweek$date <- as.Date(SBweek$transaction_timestamp)
weeksodasales <- SBweek %>%
filter(str_detect(product_category, "SOFT DRINKS")) %>%
group_by(date) %>%
summarize(daily_sales = sum(sales_value)) %>%
slice_head(n = 7)
weeksodasales %>%
ggplot(aes(date, daily_sales)) +
geom_point() +
geom_smooth() +
ylim(0, 1500) +
scale_x_date(date_breaks = "1 day", date_labels = "%b %d") +
labs(title = "Soft Drink Spending By Day Leading Up To The Super Bowl",
subtitle = "(Super Bowl is February 5th)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
After analyzing the soft drink sales across all of 2017 as well as soft drink sales for the Super Bowl, I found that the results were similar to to that of the frozen pizza sales analysis.
The income range of 35-49k had the largest Super Bowl multiplier and also led sales on the day of the Super Bowl. I would conclude that this is the ideal income range for soft drink sales
The ideal household size for selling soft drinks was different from the results of the frozen pizza analysis. I found that the ideal household size is a one person household. Despite being second in average daily sales, one person households led all others in Super Bowl soft drink sales and had the highest multiplier.
Ideal frozen pizza demographic: 35-49k income range in a 2 person household.
Ideal soft drink demographic: 35-49k income range in a 1 person household.
I came to these conclusions based on average sales across the year, Super Bowl sales, and the Super Bowl multiplier (By how many times do total sales increase on the day of the Super Bowl). In terms of my analysis on what days these products were bought, it was to no surprise that the closer it got to the Super Bowl, the more frozen pizza and soft drink sales increased. Although this fact may not be a surprise to Regork, I belive my figures will help them with future inventory forecasting.
I believe that there are a few limitations to my analysis that could be improved upon. The first limitation that I belive is that I do not have access to multiple years worth of data. To truly confirm what I believe to be the ideal demographics, I would need to look at many years of data to see if if my conclusion holds true over time. Another limitation that I found with my analysis is the data lost when joining between different data sets. Prior to my previous point, as I am using less and less data as I join different data sets, my analysis becomes less and less accurate. The last limitation to my analysis is that there is no mention of geographical area within my data. Super Bowl 51 was played between the Atlanta Falcons and the New England Patriots. Because I am not aware of each stores geographical location, I can not take into account that if there were to be a large number of stores near either Boston or Atlanta, it could have skewed my results, as it is likely that people near those cities have a higher chance of watching the game, thus spending more.