Red Bull Aviators Virtual Experience Hypothetical Account Dataset

Objective 🎯

The purpose of this project is to guide readers through the process of:

🗄️ Database Creation: Establishing a PostgreSQL database in RStudio, configured within RMarkdown.
📑 Table Creation & Data Population: Designing and populating tables with SQL queries, enabling seamless data entry and efficient structure for analysis.
🛠️ Data Wrangling & Analysis: Conducting data manipulation, transformation, and analysis using SQL (Postgres) syntax within RStudio to gain valuable insights.

Key Questions 🔍

This analysis seeks to address the following questions:

What is the Total Sales by Account Type and Year?
How has Sales Grown or Trended by Account Type across the years?
What are the Annual Sales Trends over the five-year period?
Who are the Best & Worst Performing Accounts overall and by account type?
What is the Impact of Product Lines on Sales over time?
How Effective are Marketing Programs in driving sales?

These questions will help us evaluate sales patterns, the effectiveness of different account types, and the role of promotional efforts.

Procedure 🔁

:::: Workflow:
Data Import ➔ Data Exploration ➔ Data Cleaning & Transformation ➔ Visualization ➔ Reporting ➔ Sharing ➔ Data Archiving ::::

This sequence ensures a structured approach to data handling, allowing for an iterative process to refine insights as necessary.

Additional Variables 📝

The dataset includes variables that add depth to the analysis:

Product Lines: We assess three product variants, namely Regular, Sugar Free, and Yellow Edition, for their impact on sales.
Marketing Programs: Analyzes the effect of four marketing programs, including Cooler, Digital Screen, Menu Inclusion, and Posters.
Variable Code (rsycdmp): A string code that summarizes the presence of these product lines and marketing programs (e.g., YYYYYYY for all active programs, YYYYYYN if the last program is inactive).

# Setup for Knitr options to control output and avoid unwanted messages or warnings
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  results = "markup"
)

Libraries 📚

The following R packages are used for this analysis:

dplyr and tidyr: For data manipulation, filtering, and reshaping. ggplot2: For creating visualisations. skimr: To provide a quick overview of data characteristics. janitor: Useful for cleaning column names and summarising data. readxl and openxlsx: For reading and writing Excel files. forcats: For managing factor variables, especially in plotting.

# Load packages
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(skimr)
library(readxl)
library(janitor)
library(scales)
library(forcats)
library(openxlsx)
library(tidyverse)

Loading the Data 📂

The initial step is to load and clean the data for analysis. We load the data from an Excel file and rename columns to standardize names.

df <- read_excel('C:/Users/Account/Documents/Projects/Virtual Work/Red Bull/Account Sales Data for Analysis v2.xlsx',
                 sheet = 'original_data',
                 skip = 3) |> 
  janitor::clean_names() |> 
  mutate(rsycdmp = paste0(regular, sugar_free, yellow_edition, cooler, digital_screen, menu_inclusion, posters)) |> 
  select(-c(regular, sugar_free, yellow_edition, cooler, digital_screen, menu_inclusion, posters)) |> 
  pivot_longer(names_to = 'f_year', values_to = 'sales',
               cols = c('x2017', 'x2018', 'x2019', 'x2020', 'x2021')) |> 
  select(account_name, account_type, rsycdmp, f_year, sales) |> 
  distinct()

glimpse(df)

## Rows: 300
## Columns: 5
## $ account_name <chr> "Bar 1", "Bar 1", "Bar 1", "Bar 1", "Bar 1", "Bar 2", "Ba…
## $ account_type <chr> "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "…
## $ rsycdmp      <chr> "YesYesYesYesYesYesYes", "YesYesYesYesYesYesYes", "YesYes…
## $ f_year       <chr> "x2017", "x2018", "x2019", "x2020", "x2021", "x2017", "x2…
## $ sales        <dbl> 1982, 5388, 7063, 7208, 9093, 2786, 3804, 4121, 6210, 690…

Explanation

Data Cleaning: Renamed columns for readability. Pivot Transformation: Changed year columns (e.g., x2017, x2018) to rows, creating a single f_year column. **rsycdmp Column**: Encodes presence/absence of marketing programs, making it easier to analyze patterns by program combinations.

# Data Exploration
skim(df)  # Summarized statistics of the cleaned data

Data summary
Name	df
Number of rows	300
Number of columns	5
_______________________
Column type frequency:
character	4
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
account_name	1	5	14	60
account_type	1	3	10	4
rsycdmp	1	15	21	19
f_year	1	5	5	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
sales	0	1	4936.16	2956.9	24	2319.25	4932	7580.5	9983	▇▆▇▆▇

df1 <- df |> 
  select(f_year) |> 
  summary()

df1[1]

## [1] "Length:300        "

df1[6]

## [1] NA

The dataframe only contains 300 records of five columns (with the sales amount column being the only numeric data type and the rest are character data types).

Checking for any missing values.

# Check for any missing values
sum(is.na(df))

## [1] 0

We have none. Therefore the exploration should be easier.

# Format the year column to remove the 'x' letter at the beginning
df <- df %>% 
  mutate(f_year = as.integer(str_replace(f_year, 'x', '')),
         rsycdmp = str_replace_all(rsycdmp, 'Yes', 'Y'), # Replace all Yes with Y
         rsycdmp = str_replace_all(rsycdmp, 'No', 'N'),
         on_premise_sales = case_when(account_type == 'Bar' ~ 'Yes',
                                      account_type == 'Restaurant' ~ 'Yes',
                                      account_type == 'Club' ~ 'Yes',
                                      TRUE ~ 'No')) %>% 
  distinct()

str(df)

## tibble [300 × 6] (S3: tbl_df/tbl/data.frame)
##  $ account_name    : chr [1:300] "Bar 1" "Bar 1" "Bar 1" "Bar 1" ...
##  $ account_type    : chr [1:300] "Bar" "Bar" "Bar" "Bar" ...
##  $ rsycdmp         : chr [1:300] "YYYYYYY" "YYYYYYY" "YYYYYYY" "YYYYYYY" ...
##  $ f_year          : int [1:300] 2017 2018 2019 2020 2021 2017 2018 2019 2020 2021 ...
##  $ sales           : num [1:300] 1982 5388 7063 7208 9093 ...
##  $ on_premise_sales: chr [1:300] "Yes" "Yes" "Yes" "Yes" ...

Below is the data summary.

df |> 
  summary()

##  account_name       account_type         rsycdmp              f_year    
##  Length:300         Length:300         Length:300         Min.   :2017  
##  Class :character   Class :character   Class :character   1st Qu.:2018  
##  Mode  :character   Mode  :character   Mode  :character   Median :2019  
##                                                           Mean   :2019  
##                                                           3rd Qu.:2020  
##                                                           Max.   :2021  
##      sales      on_premise_sales  
##  Min.   :  24   Length:300        
##  1st Qu.:2319   Class :character  
##  Median :4932   Mode  :character  
##  Mean   :4936                     
##  3rd Qu.:7580                     
##  Max.   :9983

Check the total Sales

sum(df$sales) # Correct sales volume total

## [1] 1480848

From 2017 the minimum sale was for 24 items with a high sale of 9983 at an average 4936 of sales YoY to 2021.

options(scipen = '999')
df %>% 
  ggplot(aes(sales))+
  geom_histogram(bins = 75)+
  labs(title = "The Distribution of Sales",
       subtitle = "Year 2017 - Year 2021",
       y = "Number of Occurances",
       x = "Total Sale")

df |> 
  group_by(sales) |> 
  summarise(count = n()) |> 
  arrange(desc(count)) |> 
  ggplot(aes(sales, count))+
  geom_histogram(stat = 'identity')

df |> 
  ggplot(aes(x = 1, y = sales)) +
  geom_point(position = 'jitter')

There seem to be no uniform distribution of provided in the data interms of volumes sold to date. We have a pick of sales at

# Generate a boxplot of the sales
b <- ggplot(data = df, mapping = aes(sales)) +
  geom_boxplot()+
  labs(title = 'Periodic Sales Volumes',
       x = 'Total Sales',
       caption = 'Source: Red Bull Aviators Virtual Experience Hypothetical Account Dataset')+
  facet_grid(f_year~.)

# Save the plot
ggsave("plot.png", plot = b)

b + facet_grid(account_type~.)

There seem to be no outliers in the dataset. With sales ranging from around $24 to $9,983 and the median being approx $5,000.00

Exploration

df %>% 
  ggplot(aes(fct_reorder(on_premise_sales, sales), sales, fill = account_type, group = account_type))+
  geom_col(position = 'dodge')+
  theme(axis.text = element_text(size = rel(0.5)))+
  labs(title = 'Yearly Account Sales',
       x = 'On Premise',
       y = 'Total Sales ($)',
       caption = 'Source: Red Bull Aviators Virtual Experience Hypothetical Account Dataset')+
  coord_flip()+
  facet_grid(f_year~.)

# Yearly contribution by product
df |> 
  group_by(account_type, f_year) |> 
  summarise(amount = sum(sales)) |>  
  pivot_wider(names_from = f_year, values_from = amount) |> 
  mutate(cagr = percent((`2021`/`2017`)**(1/2)-1),
        ste = percent((`2018`/`2017`)**(1/2)-1),
        etn = percent((`2019`/`2018`)**(1/2)-1),
        ntt = percent((`2020`/`2019`)**(1/2)-1),
        ttto = percent((`2021`/`2020`)**(1/2)-1)) |> 
  arrange(desc(cagr)) |> 
  mutate(`2017` = dollar(`2017`),
         `2018` = dollar(`2018`),
         `2019` = dollar(`2019`),
         `2020` = dollar(`2020`),
         `2021` = dollar(`2021`)
         )

## # A tibble: 4 × 11
## # Groups:   account_type [4]
##   account_type `2017`  `2018` `2019` `2020` `2021` cagr  ste   etn   ntt   ttto 
##   <chr>        <chr>   <chr>  <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <chr>
## 1 Club         $47,259 $67,2… $79,6… $102,… $112,… 54%   19%   9%    13%   5%   
## 2 Hotel        $44,888 $50,5… $70,3… $82,5… $100,… 50%   6%    18%   8%    10%  
## 3 Restaurant   $46,025 $65,0… $77,7… $89,5… $102,… 49%   19%   9%    7%    7%   
## 4 Bar          $51,804 $60,1… $60,7… $75,9… $94,1… 35%   8%    1%    12%   11%

The data shows that Club has significantly contributed to revenues since the year 2017, by 19% as compared to other accounts such as Hotel with 18%, and Restaurant earning 19% cum since 2017 and last the Bar, which has contributed 13% on average. With a huge jump having been experienced in 2020. Club has seen a huge boost in sales during the 2017-2018 year (19%) and that says it for Restaurant, Hotel contributing 19%, 18% in sales growth during the 2017-2018 and 2018-2019 respectively. Despite all these booms, it is imperative to note that There is need to focus on promoting Club and Restaurant, which have seen a drop in sales during 2020-2021 year, dropping from the 13% and 7% from the prior year reported values. Bar has seen a negative drop since 2019-2020 year which had boosted by 11% to 12% dropping off with 1% in 2020-2021.

ggplot(df, aes(f_year, sales))+
  geom_smooth(aes(col = account_type, fill = account_type), se = FALSE)+
  labs(title = 'Sales for the year 2017 - 2021',
       x = '',
       y = '$',
       caption = 'Sales growth/trends by Account Type')

Both on Premise and Off-Premise sales have seen a growth stance, with the Club growing by 54% since 2017 at 19%, 9% and 13% growth YoY to 2021. Hotel sales have grown by 50%, Restaurant at 49% and, 35% respectively for Bar sales.

ggplot(df, aes(f_year, sales))+
  geom_col()+
  labs(title = 'YoY Sales',
       x = 'Year',
       y = 'Revenue ($)',
       caption = 'Source: RedBull Account Sales Data')

Sales have have seen a steady growth since 2017, from around USD189,976.00 to USD409,194.00 in 2021, which translates into 16% growth.

Best and worst performing accounts (overall, and by account_type)

df |> 
  group_by(account_name, account_type) |> 
  summarise(amount = sum(sales)) |> 
  ggplot(aes(fct_reorder(account_name, amount), amount))+
  geom_histogram(stat = 'identity')+
  coord_flip()+
  facet_grid(~account_type)+
  theme(axis.text = element_text(size = rel(0.6)))+
  labs(title = 'Best Performing Accounts',
       caption = '4.Best and worst performing accounts (overall, and by account_type)',
       x = '')

ggplot(df, aes(fct_reorder(rsycdmp, sales), sales)) +
  geom_col()+
  scale_color_brewer(palette = 'Set2')+
  coord_flip() +
  ggtitle(label = 'Promotional Sales')+
  facet_wrap(~account_type)

ggplot(df, aes(fct_reorder(rsycdmp, sales), sales)) +
  geom_col()+
  scale_color_brewer(palette = 'Set2')+
  coord_flip() +
  ggtitle(label = 'Promotional Sales')+
  facet_wrap(account_type~f_year)+
  theme(axis.text = element_text(size = rel(0.4)))

`Total Sales` <- sum(df['sales'])

paste0('Total Sales: ', dollar(`Total Sales`))

## [1] "Total Sales: $1,480,848"

Which marketing efforts are driving sales by atleast a percentage of total sales each year

df |> 
  group_by(f_year, rsycdmp) |> 
  summarise(total_sales = sum(sales),
            count = n()) |> 
  ggplot(aes(f_year, total_sales, color = rsycdmp, size = count))+
  geom_point(position = 'jitter')

df |> 
  group_by(f_year, rsycdmp) |> 
  summarise(total_sales = sum(sales),
            count = n()) |> 
  arrange(desc(total_sales), f_year, desc(count)) |> 
  group_by(f_year) |> 
  summarise(max_sale = max(total_sales)) |> 
  mutate(max_sale = round(max_sale, 2)) |> 
  ggplot(aes(fct_reorder(as.character(f_year), max_sale), max_sale))+
  geom_col()+
  labs(title = 'Maximum Volume Sold YoY',
       x = '',
       y = 'Total Volume')

## Insights

Sales have seen a steady growth since 2017 from 190k to above 400k in 2021. This has been widely driven by the marginal sales for Club.