Introduction

Maven is a website with a diversity of datasets for training or data analysis.

Objective

Analyze the dataset “Coffee Store” and take insights about the store’s financial health and advise the manager on how to improve profit.

Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at /mnt/ssd/projetos/dataviz/Coffee+Shop+Sales/R
library(readxl)
library(glue)
library(ggtext)
library(lubridate)
library(skimr)

Import dataframe

df=read_xlsx(here('../dados/Coffee Shop Sales.xlsx'))

Glimpse dataframe

df |> glimpse()
## Rows: 149,116
## Columns: 11
## $ transaction_id   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ transaction_date <dttm> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023…
## $ transaction_time <dttm> 1899-12-31 07:06:11, 1899-12-31 07:08:56, 1899-12-31…
## $ transaction_qty  <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1,…
## $ store_id         <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8,…
## $ store_location   <chr> "Lower Manhattan", "Lower Manhattan", "Lower Manhatta…
## $ product_id       <dbl> 32, 57, 59, 22, 57, 77, 22, 28, 39, 58, 56, 33, 51, 5…
## $ unit_price       <dbl> 3.00, 3.10, 4.50, 2.00, 3.10, 3.00, 2.00, 2.00, 4.25,…
## $ product_category <chr> "Coffee", "Tea", "Drinking Chocolate", "Coffee", "Tea…
## $ product_type     <chr> "Gourmet brewed coffee", "Brewed Chai tea", "Hot choc…
## $ product_detail   <chr> "Ethiopia Rg", "Spicy Eye Opener Chai Lg", "Dark choc…

Resume of dataframe

df |> skim()
Data summary
Name df
Number of rows 149116
Number of columns 11
_______________________
Column type frequency:
character 4
numeric 5
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
store_location 0 1 7 15 0 3 0
product_category 0 1 3 18 0 9 0
product_type 0 1 5 21 0 29 0
product_detail 0 1 5 28 0 80 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
transaction_id 0 1 74737.37 43153.60 1.0 37335.75 74727.5 112094.25 149456 ▇▇▇▇▇
transaction_qty 0 1 1.44 0.54 1.0 1.00 1.0 2.00 8 ▇▁▁▁▁
store_id 0 1 5.34 2.07 3.0 3.00 5.0 8.00 8 ▇▇▁▁▇
product_id 0 1 47.92 17.93 1.0 33.00 47.0 60.00 87 ▁▆▇▆▅
unit_price 0 1 3.38 2.66 0.8 2.50 3.0 3.75 45 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
transaction_date 0 1 2023-01-01 00:00:00 2023-06-30 00:00:00 2023-04-24 00:00:00 181
transaction_time 0 1 1899-12-31 06:00:00 1899-12-31 20:59:32 1899-12-31 11:15:28 25762

Create the new column ‘money’

The ‘money’ column is calculated by multiplying transaction_qty with unit_price. This will help to look the revenue of each store.

df$money=df$transaction_qty*df$unit_price

What are the coffee categories?

Product Categories (Created with the help of ChatGPT)

Category Description
Coffee A hot drink made from coffee beans.
Tea A drink made from tea leaves.
Drinking Chocolate A sweet chocolate drink, usually hot.
Bakery Baked foods like bread and cakes.
Flavours Things added to change or improve taste.
Loose Tea Tea leaves sold without bags or packaging.
Coffee beans Whole beans used to make coffee.
Packaged Chocolate Chocolate sold in bars or boxes.
Branded Products with a known company name.

How much does each product category contribute to the semester revenue?

for( product in unique(df$product_category)){
  total_money=sum(df$money)
  valor_total_do_produto=df|> filter(product_category==product) |> summarise(sum(money))
  valor_percentual=round((valor_total_do_produto/total_money)*100,2)
  cores = setNames(c("#000000", "#E69F00"), c("Others", product))
  g=df |> 
    mutate(geral= ifelse(product_category == product,product,'Others'),
           geral = factor(geral, levels = c("Others",product))  ) |>
    group_by(geral,transaction_date) |>
    summarise(total=sum(money)) |> 
    ggplot() +
    geom_area(aes(x=transaction_date,y=total,fill=geral))+
    scale_fill_manual(values = cores) +
    labs(subtitle = glue('The total revenue between January and July was 
<b>{total_money}<b>.'),
         title = glue('The product category <i><b>{product}</b></i> contributes with 
<b>{valor_percentual}%</b> of the revenue')) +
    theme(
      plot.title = element_markdown(size = 14),
      plot.subtitle = element_markdown(size = 10),
      panel.background = element_rect(fill = 'white'),
      panel.grid = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      legend.title = element_blank(),
      legend.position = 'top',
      legend.margin = margin(t=1)
      
    ) +
    scale_y_sqrt() +
    ylab('') +
    xlab('')
  print(g)
}
## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'geral'. You can override using the
## `.groups` argument.

## Conclusion

Coffee and tea sales together account for more than 50% of total revenue.

How much does each location contribute to the annual revenue?

cores=c("#000000", "#E69F00",'#808080')
df$year=lubridate::year(df$transaction_date) |> as.factor()
df |> 
  group_by(store_location,year) |>
  summarise(total=sum(money)) |> 
  ggplot() +
  geom_col(aes(x=year,y=total,fill=store_location),position = position_dodge(width = 1),width  = 0.7 )+
  scale_fill_manual(values = cores) +
  labs(title = 'Each store location contributes similarly to annual revenue.')+
  theme(
    plot.title = element_markdown(size = 14),
    plot.subtitle = element_markdown(size = 10),
    panel.background = element_rect(fill = 'white'),
    panel.grid = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.title = element_blank(),
    legend.position = 'top',
    legend.margin = margin(t=1)
  ) +
  scale_y_sqrt() +
  ylab('') +
  xlab('')
## `summarise()` has grouped output by 'store_location'. You can override using
## the `.groups` argument.

# Which product category contributes the most in each store?

for( place in unique(df$store_location)){
  category_colors <- c(
  "Coffee"             = "#6F4E37",  
  "Tea"                = "#386641",  
  "Drinking Chocolate" = "#A0522D",  
  "Bakery"             = "#C06014",  
  "Flavours"           = "#7B1FA2",  
  "Loose Tea"          = "#1E88E5",  
  "Coffee beans"       = "#3E2723",  
  "Packaged Chocolate" = "#8E44AD",
  "Branded"            = "#B71C1C"   
)

  df_g=df |> 
     filter( store_location == place)
  
  total_money=sum(df_g$money)
  
  g= df_g|>
    group_by(transaction_date,product_category  ) |>
    summarise(total=sum(money)) |> 
    ggplot() +
    geom_area(aes(x=transaction_date,y=total,fill=reorder(product_category,total)))+
    scale_fill_manual(values = category_colors) +
    labs(subtitle = glue('The total revenue between January and July was 
<b>{total_money}<b>.'),
         title = glue('How much does each category contribute to the {place} store')) +
    theme(
      plot.title = element_markdown(size = 14),
      plot.subtitle = element_markdown(size = 10),
      panel.background = element_rect(fill = 'white'),
      panel.grid = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      legend.title = element_blank(),
      legend.position = 'top',
      legend.margin = margin(t=1)
      
    ) +
    scale_y_sqrt() +
    ylab('') +
    xlab('')
  print(g)
}
## `summarise()` has grouped output by 'transaction_date'. You can override using
## the `.groups` argument.

## `summarise()` has grouped output by 'transaction_date'. You can override using
## the `.groups` argument.

## `summarise()` has grouped output by 'transaction_date'. You can override using
## the `.groups` argument.

Coffee and tea sales were the categories of products with the most sales.

What are the top 5 types of products for each store?

for( place in unique(df$store_location)){

  df_g=df |> 
     filter( store_location == place)
  
  g= df_g|>
    group_by(product_type,product_category) |>
    summarise(mean_unit_price=mean(unit_price),mean_quantity=mean(transaction_qty),total=sum(money))|> 
    arrange(-total) |> 
    head(5)

  print(g)
}
## `summarise()` has grouped output by 'product_type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 5
## # Groups:   product_type [5]
##   product_type          product_category   mean_unit_price mean_quantity  total
##   <chr>                 <chr>                        <dbl>         <dbl>  <dbl>
## 1 Barista Espresso      Coffee                        3.64          1.60 31051 
## 2 Brewed Chai tea       Tea                           2.93          1.62 24009.
## 3 Gourmet brewed coffee Coffee                        2.71          1.64 23201.
## 4 Hot chocolate         Drinking Chocolate            4.14          1.60 22494.
## 5 Brewed Black tea      Tea                           2.75          1.64 16216.
## `summarise()` has grouped output by 'product_type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 5
## # Groups:   product_type [5]
##   product_type          product_category   mean_unit_price mean_quantity  total
##   <chr>                 <chr>                        <dbl>         <dbl>  <dbl>
## 1 Barista Espresso      Coffee                        3.55          1.47 32420.
## 2 Brewed Chai tea       Tea                           2.93          1.50 25645.
## 3 Hot chocolate         Drinking Chocolate            4.16          1.51 23586.
## 4 Gourmet brewed coffee Coffee                        2.72          1.50 23010.
## 5 Brewed Black tea      Tea                           2.75          1.51 15576.
## `summarise()` has grouped output by 'product_type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 5
## # Groups:   product_type [5]
##   product_type          product_category   mean_unit_price mean_quantity  total
##   <chr>                 <chr>                        <dbl>         <dbl>  <dbl>
## 1 Barista Espresso      Coffee                        3.80          1.49 27935 
## 2 Brewed Chai tea       Tea                           2.95          1.48 27428.
## 3 Hot chocolate         Drinking Chocolate            4.14          1.48 26335.
## 4 Gourmet brewed coffee Coffee                        2.67          1.48 23823.
## 5 Brewed herbal tea     Tea                           2.74          1.49 16282

The top products in the three stores are: - Barista Espresso - Brewed Chai tea - Hot chocolate - Gourmet brewed coffee

“Brewed Black and herbal tea profits are too low compared to the others!”

“How much does each top product contribute to sales each month?”

beverage_colors <- c(
  "Barista Espresso"      = "#3B1F1B", 
  "Brewed Chai tea"       = "#B86023", 
  "Hot chocolate"         = "#7B4D45",  
  "Gourmet brewed coffee" = "#A67C52",
  "Others"                = "#6E5A75"   
)

for (store in df$store_location |> unique()){
  df_g = df |> filter(store_location == store)
  
  g = df_g |>
    mutate(product= ifelse(product_type %in% c('Barista Espresso','Brewed Chai tea','Hot chocolate','Gourmet brewed coffee'),product_type, 'Others'),
           product= factor(product, levels = c("Others",'Barista Espresso','Brewed Chai tea','Hot chocolate','Gourmet brewed coffee'))) |>
    mutate(month=month(transaction_date,label = TRUE)) |> 
    group_by(month,product) |> 
    summarise(total_quantity=sum(transaction_qty)) |> 
    ggplot() +
    xlab('')+
    ylab('')+
    geom_col(aes(x=month,y=total_quantity, fill=product)) +
    scale_fill_manual(values=beverage_colors) +
    labs(title=glue('The contribution proportion of each product for sales by month in {store}'))+
    theme(
      plot.title = element_text(size = 14,margin = margin(0,0,3,0)),
      panel.background = element_rect(fill = 'white'),
      panel.grid = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      legend.title = element_blank(),
      legend.position = 'top',
      legend.margin = margin(t=1),
      legend.text = element_text(size=8)
      
    ) 
  
  print(g)
    
             
}
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

Sales have increased significantly along the year. Barista Espresso, Brewed Chai tea, and Gourmet brewed coffee have similar sales quantities. Hot chocolate sales are significantly lower.

Conclusion

There is a growing trend to sell more. The stores should increase the proportion of products: Barista Espresso,Brewed Chai tea,Hot chocolate,Gourmet brewed coffee.