In October 2023, James Hoffmann conducted a direct taste test with around 4,000 people in the US. A month later, he released the survey data for the taste test, with each participant receiving four unlabeled coffee samples. This survey can provide interesting statistical insights, so let’s create an engaging statistical visualization.
source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-14/readme.md
1. At what age is coffee consumption highest?
2. What is the best coffee recommendation?
3. Where is the favorite place to enjoy coffee?
# for data wrangling
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
# visualization
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(DT)
library(png)
library(grid)
library(prettydoc)
library(rmarkdown)
Coffe <- read.csv("data_input/Coffe.csv")
sum(duplicated(Coffe))
## [1] 0
sum(is.na(Coffe))
## [1] 56764
Coffe1 <- subset(Coffe, select = c(age, where_drink, cups, favorite, additions, style, strength, roast_level, caffeine))
Coffe1[Coffe1 == ""] <- NA
Let’s check the data we have selected again
colSums(is.na(x = Coffe1))
## age where_drink cups favorite additions style
## 31 70 93 62 83 84
## strength roast_level caffeine
## 126 102 125
colSums(is.na(x = Coffe1))
## age where_drink cups favorite additions style
## 31 70 93 62 83 84
## strength roast_level caffeine
## 126 102 125
Coffe2 <- Coffe1 %>%
drop_na()
To make processing easier, let’s first change all the character data into factors in the selected column
summary(Coffe2)
## age where_drink cups favorite
## Length:3811 Length:3811 Length:3811 Length:3811
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## additions style strength roast_level
## Length:3811 Length:3811 Length:3811 Length:3811
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## caffeine
## Length:3811
## Class :character
## Mode :character
str(Coffe2)
## 'data.frame': 3811 obs. of 9 variables:
## $ age : chr "<18 years old" "25-34 years old" "<18 years old" ">65 years old" ...
## $ where_drink: chr "On the go, At a cafe, At home, At the office" "At the office" "At home, At the office, At a cafe" "At the office, At a cafe" ...
## $ cups : chr "2" "1" "3" "3" ...
## $ favorite : chr "Other" "Pourover" "Pourover" "Cortado" ...
## $ additions : chr "Milk, dairy alternative, or coffee creamer, Sugar or sweetener, Flavor syrup" "No - just black" "No - just black, Milk, dairy alternative, or coffee creamer" "No - just black" ...
## $ style : chr "Full Bodied" "Fruity" "Bright" "Fruity" ...
## $ strength : chr "Somewhat light" "Somewhat strong" "Medium" "Somewhat strong" ...
## $ roast_level: chr "Light" "Light" "Light" "Blonde" ...
## $ caffeine : chr "Full caffeine" "Full caffeine" "Full caffeine" "Full caffeine" ...
Coffe3 <- Coffe2 %>%
mutate(across(where(is.character), as.factor))
oage <- Coffe3 %>%
group_by(age) %>%
summarise(total = n()) %>%
arrange(desc(total))
oage
## # A tibble: 7 × 2
## age total
## <fct> <int>
## 1 25-34 years old 1900
## 2 35-44 years old 914
## 3 18-24 years old 424
## 4 45-54 years old 291
## 5 55-64 years old 176
## 6 >65 years old 89
## 7 <18 years old 17
allrecipe <- Coffe3 %>%
group_by(favorite, style, additions, strength, roast_level, caffeine) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup() %>% head(5)
## `summarise()` has grouped output by 'favorite', 'style', 'additions',
## 'strength', 'roast_level'. You can override using the `.groups` argument.
allrecipe
## # A tibble: 5 × 7
## favorite style additions strength roast_level caffeine count
## <fct> <fct> <fct> <fct> <fct> <fct> <int>
## 1 Pourover Fruity No - just black Medium Light Full caffei… 123
## 2 Pourover Fruity No - just black Somewhat strong Light Full caffei… 118
## 3 Pourover Juicy No - just black Medium Light Full caffei… 43
## 4 Pourover Juicy No - just black Somewhat strong Light Full caffei… 42
## 5 Pourover Bright No - just black Somewhat strong Light Full caffei… 34
Based on the table above we can conclude that Pourover + Fruity + No - Just black + Medium + Light + Full caffeine are the most favorite for all ages.
Let’s make the best recipe for each age do aggregation
age <- Coffe3 %>%
group_by(age, favorite, style, additions, strength, roast_level, caffeine) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
## `summarise()` has grouped output by 'age', 'favorite', 'style', 'additions',
## 'strength', 'roast_level'. You can override using the `.groups` argument.
age
## # A tibble: 2,521 × 8
## age favorite style additions strength roast_level caffeine count
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int>
## 1 25-34 years old Pourover Fruity No - jus… Medium Light Full ca… 72
## 2 25-34 years old Pourover Fruity No - jus… Somewha… Light Full ca… 63
## 3 35-44 years old Pourover Fruity No - jus… Medium Light Full ca… 36
## 4 25-34 years old Pourover Juicy No - jus… Somewha… Light Full ca… 28
## 5 35-44 years old Pourover Fruity No - jus… Somewha… Light Full ca… 28
## 6 25-34 years old Pourover Juicy No - jus… Medium Light Full ca… 18
## 7 25-34 years old Pourover Floral No - jus… Somewha… Light Full ca… 17
## 8 18-24 years old Pourover Fruity No - jus… Somewha… Light Full ca… 15
## 9 25-34 years old Pourover Bright No - jus… Medium Light Full ca… 15
## 10 25-34 years old Pourover Bright No - jus… Somewha… Light Full ca… 15
## # ℹ 2,511 more rows
Grouping data by age
agerecipe<- age %>%
group_by(age) %>%
slice_max(order_by = count, n = 1) %>%
arrange(desc(count)) %>%
ungroup() %>% head(5)
agerecipe
## # A tibble: 5 × 8
## age favorite style additions strength roast_level caffeine count
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int>
## 1 25-34 years old Pourover Fruity No - just… Medium Light Full ca… 72
## 2 35-44 years old Pourover Fruity No - just… Medium Light Full ca… 36
## 3 18-24 years old Pourover Fruity No - just… Somewha… Light Full ca… 15
## 4 45-54 years old Pourover Fruity No - just… Somewha… Light Full ca… 6
## 5 55-64 years old Pourover Fruity No - just… Somewha… Light Full ca… 6
Change the display based on the number of fans and recipes
x <- data.frame(
age = c("25-34 years old", "35-44 years old", "18-24 years old", "45-54 years old", "55-64 years old"),
favorite = c("Pourover", "Pourover", "Pourover", "Pourover", "Pourover"),
style = c("Fruity", "Fruity", "Fruity", "Fruity", "Fruity"),
additions = c("No - just black", "No - just black", "No - just black", "No - just black", "No - just black"),
strength = c("Medium", "Medium", "Somewhat strong", "Somewhat strong", "Somewhat strong"),
roast_level = c("Light", "Light", "Light", "Light", "Light"),
caffeine = c("Full caffeine", "Full caffeine", "Full caffeine", "Full caffeine", "Full caffeine"),
count = c(72, 36, 15, 6, 6)
)
# Menggabungkan nilai dari kolom-kolom kecuali age dan count menjadi satu kolom 'details'
age_united <- x %>%
unite("recipe", -c(age, count), sep = ", ")
age_united
## age
## 1 25-34 years old
## 2 35-44 years old
## 3 18-24 years old
## 4 45-54 years old
## 5 55-64 years old
## recipe
## 1 Pourover, Fruity, No - just black, Medium, Light, Full caffeine
## 2 Pourover, Fruity, No - just black, Medium, Light, Full caffeine
## 3 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
## 4 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
## 5 Pourover, Fruity, No - just black, Somewhat strong, Light, Full caffeine
## count
## 1 72
## 2 36
## 3 15
## 4 6
## 5 6
Let’s take another look at the Coffee 3 data
Coffe3 %>% head(5)
## age where_drink cups
## 1 <18 years old On the go, At a cafe, At home, At the office 2
## 2 25-34 years old At the office 1
## 3 <18 years old At home, At the office, At a cafe 3
## 4 >65 years old At the office, At a cafe 3
## 5 25-34 years old At home, At the office, On the go 1
## favorite
## 1 Other
## 2 Pourover
## 3 Pourover
## 4 Cortado
## 5 Regular drip coffee
## additions
## 1 Milk, dairy alternative, or coffee creamer, Sugar or sweetener, Flavor syrup
## 2 No - just black
## 3 No - just black, Milk, dairy alternative, or coffee creamer
## 4 No - just black
## 5 Milk, dairy alternative, or coffee creamer, Sugar or sweetener
## style strength roast_level caffeine
## 1 Full Bodied Somewhat light Light Full caffeine
## 2 Fruity Somewhat strong Light Full caffeine
## 3 Bright Medium Light Full caffeine
## 4 Fruity Somewhat strong Blonde Full caffeine
## 5 Sweet Somewhat strong Medium Full caffeine
To enjoy coffee better, apart from the recipe, sometimes the place also influences the enjoyment. Let’s look at it in boxplot form to see the data structure.
boxplot(Coffe3$where_drink)
From the boxplot form above, there are many variations in the data,
let’s fix it to get more relevant data.
data <- c(
"At a cafe 63",
"At a cafe, At home 80",
"At a cafe, At home, At the office 16",
"At a cafe, At home, At the office, On the go 5",
"At a cafe, At home, On the go 14",
"At a cafe, At home, On the go, At the office 6",
"At a cafe, At the office 4",
"At a cafe, At the office, At home 18",
"At a cafe, At the office, At home, On the go 3",
"At a cafe, At the office, On the go 1",
"At a cafe, On the go, At home 6",
"At a cafe, On the go, At home, At the office 3",
"At a cafe, On the go, At the office 1",
"At home 1558",
"At home, At a cafe 405",
"At home, At a cafe, At the office 47",
"At home, At a cafe, At the office, On the go 8",
"At home, At a cafe, On the go 35",
"At home, At a cafe, On the go, At the office 20",
"At home, At the office 389",
"At home, At the office, At a cafe 76",
"At home, At the office, At a cafe, On the go 18",
"At home, At the office, None of these 1",
"At home, At the office, On the go 45",
"At home, At the office, On the go, At a cafe 23",
"At home, None of these 1",
"At home, On the go 118",
"At home, On the go, At a cafe 33",
"At home, On the go, At a cafe, At the office 14",
"At home, On the go, At the office 40",
"At home, On the go, None of these 1",
"At the office 135",
"At the office, At a cafe 10",
"At the office, At a cafe, At home 18",
"At the office, At a cafe, At home, On the go 8",
"At the office, At a cafe, On the go 2",
"At the office, At home, At a cafe 60",
"At the office, At home, At a cafe, On the go 12",
"At the office, At home, On the go 29",
"At the office, None of these 1",
"At the office, On the go 12",
"At the office, On the go, At a cafe 3",
"At the office, On the go, At home 13",
"None of these 29",
"None of these, At a cafe 1",
"On the go 31",
"On the go, At a cafe 6",
"On the go, At a cafe, At home 7",
"On the go, At a cafe, At home, At the office 7",
"On the go, At a cafe, At the office 2",
"On the go, At a cafe, At the office, At home 34",
"On the go, At home, At a cafe 15",
"On the go, At home, At a cafe, At the office 5",
"On the go, At home, At the office 11",
"On the go, At home, At the office, At a cafe 4",
"On the go, At the office 7",
"On the go, At the office, At a cafe, At home 1",
"On the go, At the office, At home 14"
)
# Extracting specific counts
extract_count <- function(data, location) {
pattern <- paste0("^", location, "\\s+([0-9]+)")
count <- grep(pattern, data, value = TRUE)
if (length(count) > 0) {
count <- as.numeric(sub(pattern, "\\1", count))
} else {
count <- 0
}
return(count)
}
# Example usage
locations <- c("At a cafe", "At home", "At the office", "On the go", "None of these")
loccounts <- sapply(locations, function(loc) extract_count(data, loc))
plot_data <- data.frame(location = locations, count = loccounts)
plot_data <- plot_data[order(plot_data$count, decreasing = TRUE), ]
sorted_counts <- loccounts[order(loccounts, decreasing = TRUE)]
sorted_locations <- locations[order(loccounts, decreasing = TRUE)]
barplot(sorted_counts, names.arg = sorted_locations, col = "skyblue", main = "Frequency of Locations", ylab = "Frequency") %>% +
coord_flip()
## NULL
To make it more interesting for visualization, let’s use ggplot
plot_data <- data.frame(location = locations, count = loccounts)
plot_data <- plot_data[order(plot_data$count, decreasing = TRUE), ]
# Membuat plot menggunakan ggplot2
ggplot(plot_data, aes(x = reorder(location, count), y = count, fill = location)) +
geom_bar(stat = "identity", color = "black") +
labs(title = "Frequency of Locations", x = NULL, y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()+
scale_fill_brewer(palette = "Set1")
ggplot(data = oage, aes(x = reorder(age, total, .desc = TRUE), y = total, fill = age)) +
geom_bar(stat = "identity", color = "black") +
labs(title = "Distribution of Top 5 Age Groups",
x = NULL,
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
scale_fill_brewer(palette = "Set3")
Based on the existing recipe, here is a visualization of the connoisseurs based on age.
ggplot(agerecipe, aes(x = reorder(age, count), y = count, fill = age)) +
geom_col() + geom_bar(stat = "identity", color = "black") +
labs(title = "Top 5 Ages with Highest Count", x = NULL, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")+
coord_flip()
Here is the best recipe table
Let’s prepare the visualization based on the recipe group.
allrecipe <- allrecipe %>%
mutate(group_recipe = paste0("Coffe Recipes", row_number()))
allrecipe
## # A tibble: 5 × 8
## favorite style additions strength roast_level caffeine count group_recipe
## <fct> <fct> <fct> <fct> <fct> <fct> <int> <chr>
## 1 Pourover Fruity No - just bl… Medium Light Full ca… 123 Coffe Recip…
## 2 Pourover Fruity No - just bl… Somewha… Light Full ca… 118 Coffe Recip…
## 3 Pourover Juicy No - just bl… Medium Light Full ca… 43 Coffe Recip…
## 4 Pourover Juicy No - just bl… Somewha… Light Full ca… 42 Coffe Recip…
## 5 Pourover Bright No - just bl… Somewha… Light Full ca… 34 Coffe Recip…
ggplot(allrecipe, aes(x = reorder(group_recipe, count), y = count, fill = group_recipe)) +
geom_col(color = "black") + # Menggunakan geom_col() untuk bar warna fill
labs(x = NULL, y = "Count", title = "Recipes by Group") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") + # Menghapus legend
coord_flip()
From the data above, we can conclude several things. First, the highest coffee consumption is at the age of 25 - 34 years. The best recipe is Pourover + Fruity + No - Just black + Medium + Light + Full caffeine, besides that there are 4 more recipes that we can recommend to enjoy. We can also enjoy coffee in the best place based on the data above, namely at home with 1558 respondents choosing home as their best place