The purpose of this project is to guide readers through the process of:
This analysis seeks to address the following questions:
These questions will help us evaluate sales patterns, the effectiveness of different account types, and the role of promotional efforts.
:::: Workflow:
Data Import ➔ Data Exploration ➔ Data Cleaning & Transformation ➔
Visualization ➔ Reporting ➔ Sharing ➔ Data Archiving ::::
This sequence ensures a structured approach to data handling, allowing for an iterative process to refine insights as necessary.
The dataset includes variables that add depth to the analysis:
Regular, Sugar Free, and
Yellow Edition, for their impact on sales.Cooler,
Digital Screen, Menu Inclusion, and
Posters.YYYYYYY for all active programs,
YYYYYYN if the last program is inactive).# Setup for Knitr options to control output and avoid unwanted messages or warnings
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
results = "markup"
)
The following R packages are used for this analysis:
dplyr and tidyr: For data manipulation,
filtering, and reshaping. ggplot2: For creating
visualisations. skimr: To provide a quick overview of data
characteristics. janitor: Useful for cleaning column names
and summarising data. readxl and openxlsx: For
reading and writing Excel files. forcats: For managing
factor variables, especially in plotting.
# Load packages
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(skimr)
library(readxl)
library(janitor)
library(scales)
library(forcats)
library(openxlsx)
library(tidyverse)
The initial step is to load and clean the data for analysis. We load the data from an Excel file and rename columns to standardize names.
df <- read_excel('C:/Users/Account/Documents/Projects/Virtual Work/Red Bull/Account Sales Data for Analysis v2.xlsx',
sheet = 'original_data',
skip = 3) |>
janitor::clean_names() |>
mutate(rsycdmp = paste0(regular, sugar_free, yellow_edition, cooler, digital_screen, menu_inclusion, posters)) |>
select(-c(regular, sugar_free, yellow_edition, cooler, digital_screen, menu_inclusion, posters)) |>
pivot_longer(names_to = 'f_year', values_to = 'sales',
cols = c('x2017', 'x2018', 'x2019', 'x2020', 'x2021')) |>
select(account_name, account_type, rsycdmp, f_year, sales) |>
distinct()
glimpse(df)
## Rows: 300
## Columns: 5
## $ account_name <chr> "Bar 1", "Bar 1", "Bar 1", "Bar 1", "Bar 1", "Bar 2", "Ba…
## $ account_type <chr> "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "Bar", "…
## $ rsycdmp <chr> "YesYesYesYesYesYesYes", "YesYesYesYesYesYesYes", "YesYes…
## $ f_year <chr> "x2017", "x2018", "x2019", "x2020", "x2021", "x2017", "x2…
## $ sales <dbl> 1982, 5388, 7063, 7208, 9093, 2786, 3804, 4121, 6210, 690…
Data Cleaning: Renamed columns for readability.
Pivot Transformation: Changed year columns (e.g.,
x2017, x2018) to rows, creating a single
f_year column. **rsycdmp Column**: Encodes
presence/absence of marketing programs, making it easier to analyze
patterns by program combinations.
# Data Exploration
skim(df) # Summarized statistics of the cleaned data
| Name | df |
| Number of rows | 300 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| account_name | 0 | 1 | 5 | 14 | 0 | 60 | 0 |
| account_type | 0 | 1 | 3 | 10 | 0 | 4 | 0 |
| rsycdmp | 0 | 1 | 15 | 21 | 0 | 19 | 0 |
| f_year | 0 | 1 | 5 | 5 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| sales | 0 | 1 | 4936.16 | 2956.9 | 24 | 2319.25 | 4932 | 7580.5 | 9983 | ▇▆▇▆▇ |
df1 <- df |>
select(f_year) |>
summary()
df1[1]
## [1] "Length:300 "
df1[6]
## [1] NA
The dataframe only contains 300 records of five columns (with the sales amount column being the only numeric data type and the rest are character data types).
Checking for any missing values.
# Check for any missing values
sum(is.na(df))
## [1] 0
We have none. Therefore the exploration should be easier.
# Format the year column to remove the 'x' letter at the beginning
df <- df %>%
mutate(f_year = as.integer(str_replace(f_year, 'x', '')),
rsycdmp = str_replace_all(rsycdmp, 'Yes', 'Y'), # Replace all Yes with Y
rsycdmp = str_replace_all(rsycdmp, 'No', 'N'),
on_premise_sales = case_when(account_type == 'Bar' ~ 'Yes',
account_type == 'Restaurant' ~ 'Yes',
account_type == 'Club' ~ 'Yes',
TRUE ~ 'No')) %>%
distinct()
str(df)
## tibble [300 × 6] (S3: tbl_df/tbl/data.frame)
## $ account_name : chr [1:300] "Bar 1" "Bar 1" "Bar 1" "Bar 1" ...
## $ account_type : chr [1:300] "Bar" "Bar" "Bar" "Bar" ...
## $ rsycdmp : chr [1:300] "YYYYYYY" "YYYYYYY" "YYYYYYY" "YYYYYYY" ...
## $ f_year : int [1:300] 2017 2018 2019 2020 2021 2017 2018 2019 2020 2021 ...
## $ sales : num [1:300] 1982 5388 7063 7208 9093 ...
## $ on_premise_sales: chr [1:300] "Yes" "Yes" "Yes" "Yes" ...
Below is the data summary.
df |>
summary()
## account_name account_type rsycdmp f_year
## Length:300 Length:300 Length:300 Min. :2017
## Class :character Class :character Class :character 1st Qu.:2018
## Mode :character Mode :character Mode :character Median :2019
## Mean :2019
## 3rd Qu.:2020
## Max. :2021
## sales on_premise_sales
## Min. : 24 Length:300
## 1st Qu.:2319 Class :character
## Median :4932 Mode :character
## Mean :4936
## 3rd Qu.:7580
## Max. :9983
Check the total Sales
sum(df$sales) # Correct sales volume total
## [1] 1480848
From 2017 the minimum sale was for 24 items with a high sale of 9983 at an average 4936 of sales YoY to 2021.
options(scipen = '999')
df %>%
ggplot(aes(sales))+
geom_histogram(bins = 75)+
labs(title = "The Distribution of Sales",
subtitle = "Year 2017 - Year 2021",
y = "Number of Occurances",
x = "Total Sale")
df |>
group_by(sales) |>
summarise(count = n()) |>
arrange(desc(count)) |>
ggplot(aes(sales, count))+
geom_histogram(stat = 'identity')
df |>
ggplot(aes(x = 1, y = sales)) +
geom_point(position = 'jitter')
There seem to be no uniform distribution of provided in the data interms of volumes sold to date. We have a pick of sales at
# Generate a boxplot of the sales
b <- ggplot(data = df, mapping = aes(sales)) +
geom_boxplot()+
labs(title = 'Periodic Sales Volumes',
x = 'Total Sales',
caption = 'Source: Red Bull Aviators Virtual Experience Hypothetical Account Dataset')+
facet_grid(f_year~.)
# Save the plot
ggsave("plot.png", plot = b)
b + facet_grid(account_type~.)
There seem to be no outliers in the dataset. With sales ranging from around $24 to $9,983 and the median being approx $5,000.00
df %>%
ggplot(aes(fct_reorder(on_premise_sales, sales), sales, fill = account_type, group = account_type))+
geom_col(position = 'dodge')+
theme(axis.text = element_text(size = rel(0.5)))+
labs(title = 'Yearly Account Sales',
x = 'On Premise',
y = 'Total Sales ($)',
caption = 'Source: Red Bull Aviators Virtual Experience Hypothetical Account Dataset')+
coord_flip()+
facet_grid(f_year~.)
# Yearly contribution by product
df |>
group_by(account_type, f_year) |>
summarise(amount = sum(sales)) |>
pivot_wider(names_from = f_year, values_from = amount) |>
mutate(cagr = percent((`2021`/`2017`)**(1/2)-1),
ste = percent((`2018`/`2017`)**(1/2)-1),
etn = percent((`2019`/`2018`)**(1/2)-1),
ntt = percent((`2020`/`2019`)**(1/2)-1),
ttto = percent((`2021`/`2020`)**(1/2)-1)) |>
arrange(desc(cagr)) |>
mutate(`2017` = dollar(`2017`),
`2018` = dollar(`2018`),
`2019` = dollar(`2019`),
`2020` = dollar(`2020`),
`2021` = dollar(`2021`)
)
## # A tibble: 4 × 11
## # Groups: account_type [4]
## account_type `2017` `2018` `2019` `2020` `2021` cagr ste etn ntt ttto
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Club $47,259 $67,2… $79,6… $102,… $112,… 54% 19% 9% 13% 5%
## 2 Hotel $44,888 $50,5… $70,3… $82,5… $100,… 50% 6% 18% 8% 10%
## 3 Restaurant $46,025 $65,0… $77,7… $89,5… $102,… 49% 19% 9% 7% 7%
## 4 Bar $51,804 $60,1… $60,7… $75,9… $94,1… 35% 8% 1% 12% 11%
The data shows that Club has significantly contributed
to revenues since the year 2017, by 19% as compared to other accounts
such as Hotel with 18%, and Restaurant earning 19% cum since 2017 and
last the Bar, which has contributed 13% on average. With a huge jump
having been experienced in 2020. Club has seen a huge
boost in sales during the 2017-2018 year (19%) and that says it for Restaurant, Hotel contributing 19%, 18% in sales growth
during the 2017-2018 and 2018-2019 respectively. Despite all these
booms, it is imperative to note that There is need to focus on promoting
Club and Restaurant, which have seen a
drop in sales during 2020-2021 year, dropping from the 13% and 7% from
the prior year reported values. Bar has seen a negative drop since
2019-2020 year which had boosted by 11% to 12% dropping off with 1% in
2020-2021.
ggplot(df, aes(f_year, sales))+
geom_smooth(aes(col = account_type, fill = account_type), se = FALSE)+
labs(title = 'Sales for the year 2017 - 2021',
x = '',
y = '$',
caption = 'Sales growth/trends by Account Type')
Both on Premise and Off-Premise sales have seen a growth stance, with the Club growing by 54% since 2017 at 19%, 9% and 13% growth YoY to 2021. Hotel sales have grown by 50%, Restaurant at 49% and, 35% respectively for Bar sales.
ggplot(df, aes(f_year, sales))+
geom_col()+
labs(title = 'YoY Sales',
x = 'Year',
y = 'Revenue ($)',
caption = 'Source: RedBull Account Sales Data')
Sales have have seen a steady growth since 2017, from around
USD189,976.00 to USD409,194.00 in 2021, which
translates into 16% growth.
df |>
group_by(account_name, account_type) |>
summarise(amount = sum(sales)) |>
ggplot(aes(fct_reorder(account_name, amount), amount))+
geom_histogram(stat = 'identity')+
coord_flip()+
facet_grid(~account_type)+
theme(axis.text = element_text(size = rel(0.6)))+
labs(title = 'Best Performing Accounts',
caption = '4.Best and worst performing accounts (overall, and by account_type)',
x = '')
ggplot(df, aes(fct_reorder(rsycdmp, sales), sales)) +
geom_col()+
scale_color_brewer(palette = 'Set2')+
coord_flip() +
ggtitle(label = 'Promotional Sales')+
facet_wrap(~account_type)
ggplot(df, aes(fct_reorder(rsycdmp, sales), sales)) +
geom_col()+
scale_color_brewer(palette = 'Set2')+
coord_flip() +
ggtitle(label = 'Promotional Sales')+
facet_wrap(account_type~f_year)+
theme(axis.text = element_text(size = rel(0.4)))
`Total Sales` <- sum(df['sales'])
paste0('Total Sales: ', dollar(`Total Sales`))
## [1] "Total Sales: $1,480,848"
Which marketing efforts are driving sales by atleast a percentage of total sales each year
df |>
group_by(f_year, rsycdmp) |>
summarise(total_sales = sum(sales),
count = n()) |>
ggplot(aes(f_year, total_sales, color = rsycdmp, size = count))+
geom_point(position = 'jitter')
df |>
group_by(f_year, rsycdmp) |>
summarise(total_sales = sum(sales),
count = n()) |>
arrange(desc(total_sales), f_year, desc(count)) |>
group_by(f_year) |>
summarise(max_sale = max(total_sales)) |>
mutate(max_sale = round(max_sale, 2)) |>
ggplot(aes(fct_reorder(as.character(f_year), max_sale), max_sale))+
geom_col()+
labs(title = 'Maximum Volume Sold YoY',
x = '',
y = 'Total Volume')
| ## Insights |
Sales have seen a steady growth since 2017 from 190k to above 400k in 2021. This has been widely driven by the marginal sales for Club.