Introduction

In October 2023, James Hoffmann conducted a direct taste test with around 4,000 people in the US. A month later, he released the survey data for the taste test, with each participant receiving four unlabeled coffee samples. This survey can provide interesting statistical insights, so let’s create an engaging statistical visualization.

source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-14/readme.md

Business Questions

1. At what age is coffee consumption highest?

2. What is the best coffee recommendation?

3. Where is the favorite place to enjoy coffee?

1. Data Preparation

1.1 Prerequisites

1.2 Importing Libraries

# for data wrangling
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
# visualization
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(DT)
library(png)
library(grid)
library(prettydoc)
library(rmarkdown)

1.3 Importing Dataset

Coffe <- read.csv("data_input/Coffe.csv")

2. Data Processing

2.1 Duplicates

sum(duplicated(Coffe))
## [1] 0

2.2 Missing Values

sum(is.na(Coffe))
## [1] 56764
Coffe1 <- subset(Coffe, select = c(age, where_drink, cups, favorite, additions, style, strength, roast_level, caffeine))
Coffe1[Coffe1 == ""] <- NA

Let’s check the data we have selected again

colSums(is.na(x = Coffe1))
##         age where_drink        cups    favorite   additions       style 
##          31          70          93          62          83          84 
##    strength roast_level    caffeine 
##         126         102         125
colSums(is.na(x = Coffe1))
##         age where_drink        cups    favorite   additions       style 
##          31          70          93          62          83          84 
##    strength roast_level    caffeine 
##         126         102         125
Coffe2 <- Coffe1 %>%
  drop_na()

2.3 Data Types

To make processing easier, let’s first change all the character data into factors in the selected column

summary(Coffe2)
##      age            where_drink            cups             favorite        
##  Length:3811        Length:3811        Length:3811        Length:3811       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   additions            style             strength         roast_level       
##  Length:3811        Length:3811        Length:3811        Length:3811       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    caffeine        
##  Length:3811       
##  Class :character  
##  Mode  :character
str(Coffe2)
## 'data.frame':    3811 obs. of  9 variables:
##  $ age        : chr  "<18 years old" "25-34 years old" "<18 years old" ">65 years old" ...
##  $ where_drink: chr  "On the go, At a cafe, At home, At the office" "At the office" "At home, At the office, At a cafe" "At the office, At a cafe" ...
##  $ cups       : chr  "2" "1" "3" "3" ...
##  $ favorite   : chr  "Other" "Pourover" "Pourover" "Cortado" ...
##  $ additions  : chr  "Milk, dairy alternative, or coffee creamer, Sugar or sweetener, Flavor syrup" "No - just black" "No - just black, Milk, dairy alternative, or coffee creamer" "No - just black" ...
##  $ style      : chr  "Full Bodied" "Fruity" "Bright" "Fruity" ...
##  $ strength   : chr  "Somewhat light" "Somewhat strong" "Medium" "Somewhat strong" ...
##  $ roast_level: chr  "Light" "Light" "Light" "Blonde" ...
##  $ caffeine   : chr  "Full caffeine" "Full caffeine" "Full caffeine" "Full caffeine" ...

3. Exploratory Data Analysis

Coffe3 <- Coffe2 %>%
  mutate(across(where(is.character), as.factor))

3.1 The highest age of coffee drinkers

oage <- Coffe3 %>%
group_by(age) %>%
  summarise(total = n()) %>% 
  arrange(desc(total))
oage
## # A tibble: 7 × 2
##   age             total
##   <fct>           <int>
## 1 25-34 years old  1900
## 2 35-44 years old   914
## 3 18-24 years old   424
## 4 45-54 years old   291
## 5 55-64 years old   176
## 6 >65 years old      89
## 7 <18 years old      17

3.2 View the best recipes for all ages

allrecipe <- Coffe3 %>%
group_by(favorite, style, additions, strength, roast_level, caffeine) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup() %>% head(5)
## `summarise()` has grouped output by 'favorite', 'style', 'additions',
## 'strength', 'roast_level'. You can override using the `.groups` argument.
allrecipe
## # A tibble: 5 × 7
##   favorite style  additions       strength        roast_level caffeine     count
##   <fct>    <fct>  <fct>           <fct>           <fct>       <fct>        <int>
## 1 Pourover Fruity No - just black Medium          Light       Full caffei…   123
## 2 Pourover Fruity No - just black Somewhat strong Light       Full caffei…   118
## 3 Pourover Juicy  No - just black Medium          Light       Full caffei…    43
## 4 Pourover Juicy  No - just black Somewhat strong Light       Full caffei…    42
## 5 Pourover Bright No - just black Somewhat strong Light       Full caffei…    34

Based on the table above we can conclude that Pourover + Fruity + No - Just black + Medium + Light + Full caffeine are the most favorite for all ages.

Let’s make the best recipe for each age do aggregation

age <- Coffe3 %>%
group_by(age, favorite, style, additions, strength, roast_level, caffeine) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup() 
## `summarise()` has grouped output by 'age', 'favorite', 'style', 'additions',
## 'strength', 'roast_level'. You can override using the `.groups` argument.
age
## # A tibble: 2,521 × 8
##    age             favorite style  additions strength roast_level caffeine count
##    <fct>           <fct>    <fct>  <fct>     <fct>    <fct>       <fct>    <int>
##  1 25-34 years old Pourover Fruity No - jus… Medium   Light       Full ca…    72
##  2 25-34 years old Pourover Fruity No - jus… Somewha… Light       Full ca…    63
##  3 35-44 years old Pourover Fruity No - jus… Medium   Light       Full ca…    36
##  4 25-34 years old Pourover Juicy  No - jus… Somewha… Light       Full ca…    28
##  5 35-44 years old Pourover Fruity No - jus… Somewha… Light       Full ca…    28
##  6 25-34 years old Pourover Juicy  No - jus… Medium   Light       Full ca…    18
##  7 25-34 years old Pourover Floral No - jus… Somewha… Light       Full ca…    17
##  8 18-24 years old Pourover Fruity No - jus… Somewha… Light       Full ca…    15
##  9 25-34 years old Pourover Bright No - jus… Medium   Light       Full ca…    15
## 10 25-34 years old Pourover Bright No - jus… Somewha… Light       Full ca…    15
## # ℹ 2,511 more rows

Grouping data by age

agerecipe<- age %>%
  group_by(age) %>%
  slice_max(order_by = count, n = 1) %>%
  arrange(desc(count)) %>%
  ungroup() %>% head(5)
agerecipe
## # A tibble: 5 × 8
##   age             favorite style  additions  strength roast_level caffeine count
##   <fct>           <fct>    <fct>  <fct>      <fct>    <fct>       <fct>    <int>
## 1 25-34 years old Pourover Fruity No - just… Medium   Light       Full ca…    72
## 2 35-44 years old Pourover Fruity No - just… Medium   Light       Full ca…    36
## 3 18-24 years old Pourover Fruity No - just… Somewha… Light       Full ca…    15
## 4 45-54 years old Pourover Fruity No - just… Somewha… Light       Full ca…     6
## 5 55-64 years old Pourover Fruity No - just… Somewha… Light       Full ca…     6

Change the display based on the number of fans and recipes

x <- data.frame(
  age = c("25-34 years old", "35-44 years old", "18-24 years old", "45-54 years old", "55-64 years old"),
  favorite = c("Pourover", "Pourover", "Pourover", "Pourover", "Pourover"),
  style = c("Fruity", "Fruity", "Fruity", "Fruity", "Fruity"),
  additions = c("No - just black", "No - just black", "No - just black", "No - just black", "No - just black"),
  strength = c("Medium", "Medium", "Somewhat strong", "Somewhat strong", "Somewhat strong"),
  roast_level = c("Light", "Light", "Light", "Light", "Light"),
  caffeine = c("Full caffeine", "Full caffeine", "Full caffeine", "Full caffeine", "Full caffeine"),
  count = c(72, 36, 15, 6, 6)
)

# Menggabungkan nilai dari kolom-kolom kecuali age dan count menjadi satu kolom 'details'
age_united <- x %>%
  unite("recipe", -c(age, count), sep = ", ")
age_united
##               age
## 1 25-34 years old
## 2 35-44 years old
## 3 18-24 years old
## 4 45-54 years old
## 5 55-64 years old
##                                                                     recipe
## 1          Pourover, Fruity, No - just black, Medium, Light, Full caffeine
## 2          Pourover, Fruity, No - just black, Medium, Light, Full caffeine
## 3 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
## 4 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
## 5 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
##   count
## 1    72
## 2    36
## 3    15
## 4     6
## 5     6

3.3 View data based on best location

Let’s take another look at the Coffee 3 data

Coffe3 %>%  head(5)
##               age                                  where_drink cups
## 1   <18 years old On the go, At a cafe, At home, At the office    2
## 2 25-34 years old                                At the office    1
## 3   <18 years old            At home, At the office, At a cafe    3
## 4   >65 years old                     At the office, At a cafe    3
## 5 25-34 years old            At home, At the office, On the go    1
##              favorite
## 1               Other
## 2            Pourover
## 3            Pourover
## 4             Cortado
## 5 Regular drip coffee
##                                                                      additions
## 1 Milk, dairy alternative, or coffee creamer, Sugar or sweetener, Flavor syrup
## 2                                                              No - just black
## 3                  No - just black, Milk, dairy alternative, or coffee creamer
## 4                                                              No - just black
## 5               Milk, dairy alternative, or coffee creamer, Sugar or sweetener
##         style        strength roast_level      caffeine
## 1 Full Bodied  Somewhat light       Light Full caffeine
## 2      Fruity Somewhat strong       Light Full caffeine
## 3      Bright          Medium       Light Full caffeine
## 4      Fruity Somewhat strong      Blonde Full caffeine
## 5       Sweet Somewhat strong      Medium Full caffeine

To enjoy coffee better, apart from the recipe, sometimes the place also influences the enjoyment. Let’s look at it in boxplot form to see the data structure.

boxplot(Coffe3$where_drink)

From the boxplot form above, there are many variations in the data, let’s fix it to get more relevant data.

data <- c(
  "At a cafe                           63",
  "At a cafe, At home            80",
  "At a cafe, At home, At the office            16",
  "At a cafe, At home, At the office, On the go                5",
  "At a cafe, At home, On the go            14",
  "At a cafe, At home, On the go, At the office            6",
  "At a cafe, At the office            4",
  "At a cafe, At the office, At home            18",
  "At a cafe, At the office, At home, On the go            3",
  "At a cafe, At the office, On the go            1",
  "At a cafe, On the go, At home            6",
  "At a cafe, On the go, At home, At the office            3",
  "At a cafe, On the go, At the office            1",
  "At home            1558",
  "At home, At a cafe            405",
  "At home, At a cafe, At the office            47",
  "At home, At a cafe, At the office, On the go            8",
  "At home, At a cafe, On the go            35",
  "At home, At a cafe, On the go, At the office            20",
  "At home, At the office            389",
  "At home, At the office, At a cafe            76",
  "At home, At the office, At a cafe, On the go            18",
  "At home, At the office, None of these            1",
  "At home, At the office, On the go            45",
  "At home, At the office, On the go, At a cafe            23",
  "At home, None of these            1",
  "At home, On the go            118",
  "At home, On the go, At a cafe            33",
  "At home, On the go, At a cafe, At the office            14",
  "At home, On the go, At the office            40",
  "At home, On the go, None of these            1",
  "At the office            135",
  "At the office, At a cafe            10",
  "At the office, At a cafe, At home            18",
  "At the office, At a cafe, At home, On the go            8",
  "At the office, At a cafe, On the go            2",
  "At the office, At home, At a cafe            60",
  "At the office, At home, At a cafe, On the go            12",
  "At the office, At home, On the go            29",
  "At the office, None of these            1",
  "At the office, On the go            12",
  "At the office, On the go, At a cafe            3",
  "At the office, On the go, At home            13",
  "None of these            29",
  "None of these, At a cafe            1",
  "On the go            31",
  "On the go, At a cafe            6",
  "On the go, At a cafe, At home            7",
  "On the go, At a cafe, At home, At the office            7",
  "On the go, At a cafe, At the office            2",
  "On the go, At a cafe, At the office, At home            34",
  "On the go, At home, At a cafe            15",
  "On the go, At home, At a cafe, At the office            5",
  "On the go, At home, At the office            11",
  "On the go, At home, At the office, At a cafe            4",
  "On the go, At the office            7",
  "On the go, At the office, At a cafe, At home            1",
  "On the go, At the office, At home            14"
)

# Extracting specific counts
extract_count <- function(data, location) {
  pattern <- paste0("^", location, "\\s+([0-9]+)")
  count <- grep(pattern, data, value = TRUE)
  if (length(count) > 0) {
    count <- as.numeric(sub(pattern, "\\1", count))
  } else {
    count <- 0
  }
  return(count)
}

# Example usage
locations <- c("At a cafe", "At home", "At the office", "On the go", "None of these")
loccounts <- sapply(locations, function(loc) extract_count(data, loc))
plot_data <- data.frame(location = locations, count = loccounts)
plot_data <- plot_data[order(plot_data$count, decreasing = TRUE), ]

4. Visualization

sorted_counts <- loccounts[order(loccounts, decreasing = TRUE)]
sorted_locations <- locations[order(loccounts, decreasing = TRUE)]

barplot(sorted_counts, names.arg = sorted_locations, col = "skyblue", main = "Frequency of Locations", ylab = "Frequency") %>% +
  coord_flip()

## NULL

To make it more interesting for visualization, let’s use ggplot

plot_data <- data.frame(location = locations, count = loccounts)
plot_data <- plot_data[order(plot_data$count, decreasing = TRUE), ]

# Membuat plot menggunakan ggplot2
ggplot(plot_data, aes(x = reorder(location, count), y = count, fill = location)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Frequency of Locations", x = NULL, y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()+
  scale_fill_brewer(palette = "Set1")

ggplot(data = oage, aes(x = reorder(age, total, .desc = TRUE), y = total, fill = age)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Distribution of Top 5 Age Groups",
       x = NULL,
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  scale_fill_brewer(palette = "Set3")

Based on the existing recipe, here is a visualization of the connoisseurs based on age.

ggplot(agerecipe, aes(x = reorder(age, count), y = count, fill = age)) +
  geom_col() +  geom_bar(stat = "identity", color = "black") +
  labs(title = "Top 5 Ages with Highest Count", x = NULL, y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set1")+
  coord_flip()

Here is the best recipe table

Let’s prepare the visualization based on the recipe group.

allrecipe <- allrecipe %>%
  mutate(group_recipe = paste0("Coffe Recipes", row_number()))
allrecipe
## # A tibble: 5 × 8
##   favorite style  additions     strength roast_level caffeine count group_recipe
##   <fct>    <fct>  <fct>         <fct>    <fct>       <fct>    <int> <chr>       
## 1 Pourover Fruity No - just bl… Medium   Light       Full ca…   123 Coffe Recip…
## 2 Pourover Fruity No - just bl… Somewha… Light       Full ca…   118 Coffe Recip…
## 3 Pourover Juicy  No - just bl… Medium   Light       Full ca…    43 Coffe Recip…
## 4 Pourover Juicy  No - just bl… Somewha… Light       Full ca…    42 Coffe Recip…
## 5 Pourover Bright No - just bl… Somewha… Light       Full ca…    34 Coffe Recip…
ggplot(allrecipe, aes(x = reorder(group_recipe, count), y = count, fill = group_recipe)) +
  geom_col(color = "black") +  # Menggunakan geom_col() untuk bar warna fill
  labs(x = NULL, y = "Count", title = "Recipes by Group") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +  # Menghapus legend
  coord_flip()

5. Conclusion

From the data above, we can conclude several things. First, the highest coffee consumption is at the age of 25 - 34 years. The best recipe is Pourover + Fruity + No - Just black + Medium + Light + Full caffeine, besides that there are 4 more recipes that we can recommend to enjoy. We can also enjoy coffee in the best place based on the data above, namely at home with 1558 respondents choosing home as their best place

6. Table