My motivation for this project is to see if online buzz or being searched more in google actually translates into sales for big beverage brands. I want to find out if Google Search spikes during major events like the Super Bowl or the summertime months can predict when a brand is about to see a jump in retail sales. To do this, I’ll be using two different data sources: a CSV dataset of beverage sales records from Kaggle and live search data pulled directly from gtrendsR. The workflow will involve cleaning up the brand names and standardizing the different date formats, like daily or weekly, so I can join the datasets properly. I’ll then run a statistical correlation analysis to see if online buzz and the sales actually move together. For the project feature not covered in class, I’ll be using gTrendsR which provides an interface for retrieving and visualizing trend data without needing an API key.
Source: Willmann, S. (2023). Beverage Sales Dataset. Retrieved from Kaggle: https://www.kaggle.com/datasets/sebastianwillmann/beverage-sales
References: Veylinx. (2023). Which Brands Won the Super Bowl? Retrieved from: https://veylinx.com/blog/which-brands-won-the-super-bowl
Introduction: For this project, I wanted to see if three biggest soda brands Coca-Cola, Pepsi, and Sprite actually see a payoff from hype they generate during big events and campaigns during those event. I’m narrowing the focus down to the 2023 calendar year to see how these brands compete during huge marketing moments like the Super Bowl. Using the OSEMN framework, I’ll be digging into about 9 million rows of sales data to find out if Google Search buzz actually leads to more people buying drinks, or if the market leaders stay on top regardless of what’s trending online.
Loading Dataset:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(googlesheets4)
library(lubridate)
beverage_url <- "https://raw.githubusercontent.com/Kristoffgit/Superbowl-Beverages/refs/heads/main/superbowl_sales_2023.csv"
beverage_data <- read_csv(beverage_url)
## Rows: 231776 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Order_ID, Customer_ID, Customer_Type, Product, Category, Region
## dbl (4): Unit_Price, Quantity, Discount, Total_Price
## date (1): Order_Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
beverage_data$Order_Date <- as.Date(beverage_data$Order_Date)
Although Coca-Cola is the most dominant brand in the U.S. with the largest market share, I wanted to see if that dominance actually holds up on Super Bowl Sunday. After reviewing three of the most popular soda brands, the data shows some surprising results.
On the actual day of the Super Bowl (February 12th), Sprite jumped ahead to lead the pack in units sold. However, the most interesting part is what happened the day after the game. On February 13th, Pepsi managed to beat out both Coca-Cola and its sibling, Sprite, by a significant margin. While Sprite stayed competitive, it couldn’t overcome the momentum Pepsi gained right after the event, which could be a result of Pepsi’s Zero Sugar Ad. Based on consumer demand metrics, the Zero Sugar ad campaign performed the best and achieved an 18% increase in purchase demand. With thanks to its campaign ad starring Ben Stiller and Steve Martin, it propelled consumer interest. This helps explain why on February 13th, Pepsi had beat both Sprite and Coca-Cola in units sold the day after the super bowl with 4606 units.
market_leaders_2023 <- beverage_data %>%
filter(Product %in% c("Coca-Cola", "Pepsi", "Sprite"),
year(Order_Date) == 2023) %>%
group_by(Order_Date, Product) %>%
summarize(Daily_Quantity = sum(Quantity)) %>%
ungroup()
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Order_Date and Product.
## ℹ Output is grouped by Order_Date.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Order_Date, Product))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
market_leaders_2023 %>%
group_by(Product) %>%
summarize(Total_Volume = sum(Daily_Quantity))
## # A tibble: 3 × 2
## Product Total_Volume
## <chr> <dbl>
## 1 Coca-Cola 121285
## 2 Pepsi 105275
## 3 Sprite 115975
ggplot(market_leaders_2023, aes(x = Product, y = Daily_Quantity, fill = Product)) +
geom_boxplot() +
scale_fill_manual(values = c("Coca-Cola" = "red", "Pepsi" = "blue", "Sprite" = "gray")) +
labs(title = "2023 Sales Volume: Coke vs. Pepsi vs. Sprite",
x = "Brand",
y = "Daily Units Sold") +
theme_minimal() +
theme(legend.position = "none")
superbowl_battle_2023 <- beverage_data %>%
filter(Product %in% c("Coca-Cola", "Pepsi", "Sprite"),
Order_Date >= "2023-02-06",
Order_Date <= "2023-02-13") %>%
group_by(Order_Date, Product) %>%
summarize(Daily_Quantity = sum(Quantity)) %>%
ungroup()
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Order_Date and Product.
## ℹ Output is grouped by Order_Date.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Order_Date, Product))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
ggplot(superbowl_battle_2023, aes(x = Order_Date, y = Daily_Quantity, color = Product)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
scale_color_manual(values = c("Coca-Cola" = "red", "Pepsi" = "blue", "Sprite" = "gray")) +
labs(title = "The 2023 Super Bowl Battle: Daily Sales",
x = "Date",
y = "Units Sold") +
theme_minimal()
For my second data source, I used the googlesheets4 library to pull data directly from the Google Sheets API. This let me connect web-hosted search data with my local sales CSV. I decided to go this route instead of using a direct gtrendsR call because the API wasn’t being stable and wouldn’t let me pull the trends I needed. Using the Google Sheets API was a much more reliable way to make sure the two different data types could actually talk to each other.
For context when reviewing Google Trends, search interest over a specific time period, is displayed on a relative scale from 0 to 100, where 100 signifies the peak interest for the time period of the chart. A value of 50 indicates half the popularity of the peak, and 0 suggests insufficient data. When reviewing the data across Coca-Cola, Pepsi, and Sprite, we can see that after the day of the Super Bowl, Pepsi’s Ad Campaign hit the peak interest on the scale at 100. This aligns with our original proposal that online buzz actually translated into an increase in sales.
gs4_deauth()
google_trends_raw <- read_sheet("https://docs.google.com/spreadsheets/d/1ItSA5wJN8_ARueFMtAv7JL3O5cNRTZKHQqBAsQQ6-Bg/edit?usp=sharing")
## ✔ Reading from "Google Trends February 2023".
## ✔ Range 'time_series_US_20230206-0000_20260505-2215'.
trends_scrubbed <- google_trends_raw %>%
rename(Date = 1,
`Coca-Cola` = 2,
Pepsi = 3,
Sprite = 4) %>%
mutate(Date = as.Date(Date))
trends_long <- trends_scrubbed %>%
pivot_longer(cols = c("Coca-Cola", "Pepsi", "Sprite"),
names_to = "Product",
values_to = "Search_Hits")
head(trends_long)
## # A tibble: 6 × 3
## Date Product Search_Hits
## <date> <chr> <dbl>
## 1 2023-02-06 Coca-Cola 7
## 2 2023-02-06 Pepsi 62
## 3 2023-02-06 Sprite 4
## 4 2023-02-07 Coca-Cola 9
## 5 2023-02-07 Pepsi 68
## 6 2023-02-07 Sprite 4
final_df <- market_leaders_2023 %>%
inner_join(trends_long, by = c("Order_Date" = "Date", "Product" = "Product"))
head(final_df)
## # A tibble: 6 × 4
## Order_Date Product Daily_Quantity Search_Hits
## <date> <chr> <dbl> <dbl>
## 1 2023-02-06 Coca-Cola 3657 7
## 2 2023-02-06 Pepsi 3943 62
## 3 2023-02-06 Sprite 4523 4
## 4 2023-02-07 Coca-Cola 4248 9
## 5 2023-02-07 Pepsi 3579 68
## 6 2023-02-07 Sprite 4172 4
When reviewing the pearson correlation, it appears there wasn’t much of a relationship. The p-value gave us 0.087 which is greater than 0.05, so it’s not statistically significant for the month of February 2023, including the Super Bowl. The lack of significance can be that since Pepsi’s rise wasn’t synchronized perfectly due to the nature of social media and ads taking time to reach audiences, it may not be able to track the significance with only just Google trends and sales.
To do another investigation, I did a linear regression to see if search interest can predict sales volume. The intercept is 4222.576, which can signal to us that even if there was not a buzz online for Pepsi, Sprite, or Coca-Cola, these brands have a stable fanbase that already would purchases their sodas regardless of additional marketing campaigns. So even without a fresh campaign, all three of these brands can be expected to maintain a baseline of about 4200 units a day with their already established market presence.
cor_test <- cor.test(final_df$Search_Hits, final_df$Daily_Quantity)
cor_test
##
## Pearson's product-moment correlation
##
## data: final_df$Search_Hits and final_df$Daily_Quantity
## t = -1.7736, df = 28, p-value = 0.08701
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6084110 0.0479591
## sample estimates:
## cor
## -0.3178012
sales_model <- lm(Daily_Quantity ~ Search_Hits, data = final_df)
summary(sales_model)
##
## Call:
## lm(formula = Daily_Quantity ~ Search_Hits, data = final_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -958.26 -308.75 -32.56 230.25 1358.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4222.576 141.811 29.776 <2e-16 ***
## Search_Hits -5.618 3.168 -1.774 0.087 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 584.8 on 28 degrees of freedom
## Multiple R-squared: 0.101, Adjusted R-squared: 0.06889
## F-statistic: 3.146 on 1 and 28 DF, p-value: 0.08701
gtrendsR (429 Error): In my proposal, I mentioned using gtrendsR to pull the google trends report for these brands into R but I ran into an unanticipated 429 Error. To navigate around it, I pulled the csv from google trends of the search requests, added it into a google sheets document and utilized googlesheets4 to pull the data.
Github couldnt handle the beverage dataset:
Do to the vast amount of information in the beverages dataset, github wasn’t capable of hosting a file that large. I found an alternative by creating a smaller version of this dataset but using only the information in February 2023.