Data loarding

tuesdata <- tidytuesdayR::tt_load(2024, week = 20)

coffee_survey <- tuesdata$coffee_survey

Introduction

This report analyzes coffee consumption data provided in the tidytuesday dataset.Viewers fill out a survey about the 4 coffees they order from Cometeer.

Preview dataset

head(coffee_survey)
## # A tibble: 6 × 57
##   submission_id age   cups  where_drink brew  brew_other purchase purchase_other
##   <chr>         <chr> <chr> <chr>       <chr> <chr>      <chr>    <chr>         
## 1 gMR29l        18-2… <NA>  <NA>        <NA>  <NA>       <NA>     <NA>          
## 2 BkPN0e        25-3… <NA>  <NA>        Pod/… <NA>       <NA>     <NA>          
## 3 W5G8jj        25-3… <NA>  <NA>        Bean… <NA>       <NA>     <NA>          
## 4 4xWgGr        35-4… <NA>  <NA>        Coff… <NA>       <NA>     <NA>          
## 5 QD27Q8        25-3… <NA>  <NA>        Pour… <NA>       <NA>     <NA>          
## 6 V0LPeM        55-6… <NA>  <NA>        Pod/… <NA>       <NA>     <NA>          
## # ℹ 49 more variables: favorite <chr>, favorite_specify <chr>, additions <chr>,
## #   additions_other <chr>, dairy <chr>, sweetener <chr>, style <chr>,
## #   strength <chr>, roast_level <chr>, caffeine <chr>, expertise <dbl>,
## #   coffee_a_bitterness <dbl>, coffee_a_acidity <dbl>,
## #   coffee_a_personal_preference <dbl>, coffee_a_notes <chr>,
## #   coffee_b_bitterness <dbl>, coffee_b_acidity <dbl>,
## #   coffee_b_personal_preference <dbl>, coffee_b_notes <chr>, …

Data cleaning

Filtering out missing values

data_filtered <- coffee_survey[!is.na(coffee_survey$gender) & !is.na(coffee_survey$cups) & !is.na(coffee_survey$education_level)& !is.na(coffee_survey$ethnicity_race), ]

Aggregation by gender

Compared to the other genders, men have the highest self-rating for coffee expertise.

aggregate(expertise~gender,FUN=mean,data=data_filtered)

Variable transformation

For the sake of simplicity when drawing graphs with age as the horizontal coordinate.

data_filtered$age <- str_replace_all(data_filtered$age, " years old", "")

Data VisualizationD

Boxplot: Coffee Consumption by age group

The participants from each group surveyed consumed an average of about 2 cups of coffee per day.

ggplot(data_filtered, aes(x = age, y = as.numeric(cups), fill = age)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Boxplot of Coffee Consumption by Age Group",
       x = "Age Group",
       y = "Cups of Coffee per Day") +
  theme_bw()

Boxplot: Coffee Consumption by Gender Group

Women drink less coffee on average than the other genders.

ggplot(data_filtered, aes(x = gender, y = as.numeric(cups), fill = gender)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Boxplot of Coffee Consumption by Gender Group",
       x = "Gender Group",
       y = "Cups of Coffee per Day") +
  theme_bw()

Density Plot: Coffee Consumption Distribution by ethnicity_race

Many people’s coffee consumption is concentrated between 1 and 2 cups per day.

ggplot(data_filtered, aes(x = as.numeric(cups), fill = ethnicity_race)) +
  geom_density(alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Density Plot of Coffee Consumption",
       x = "Cups per Day",
       y = "Density") +
  theme_minimal()

Density Plot: Coffee Consumption Distribution by education_level

Those with more education (e.g., masters, PhDs) were more likely to be spread over 3 cups/day.

ggplot(data_filtered, aes(x = as.numeric(cups), fill = education_level)) +
  geom_density(alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Density Plot of Coffee Consumption",
       x = "Cups per Day",
       y = "Density") +
  theme_classic()

Histogram: Participants’ self-assessed coffee expertise

Scores of 6-7 were the highest, and the distribution was right-skewed, indicating that most people felt they had a general knowledge of coffee.

ggplot(data_filtered, aes(x = as.numeric(expertise))) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.7) +
  geom_density(aes(y = ..count..), color = "gray", linewidth = 0.5) +
  labs(title = "Participants' self-assessed coffee expertise",
       x = "expertise assessement",
       y = "Frequency") +
  theme_bw()

Jitter Plot: Age vs Coffee Consumption

Coffee consumption is highest among 25-44 year olds, and lower among the young and the elderly

ggplot(data_filtered, aes(x = age, y = as.numeric(cups))) +
  geom_jitter(width = 0.2, height = 0.2, color = "gray", alpha = 0.5) +  # Jitter points for better visibility
  labs(title = "Age vs Coffee Consumption",
       x = "Age",
       y = "Cups of Coffee per Day") +
  theme_bw()

Stacked Bar Chart: Coffee Bitterness by Gender

Men are usually more inclined to confidently assess their coffee expertise

df_filtered <- subset(data_filtered, gender %in% c("Male", "Female"))
df_filtered$expertise_cat <- cut(df_filtered$expertise, 
                                 breaks = c(0, 5, Inf),
                                 labels = c("Low_expertise", "High_expertise"))
ggplot(df_filtered, aes(x = gender, fill = expertise_cat)) +
  geom_bar() +
  labs(title = "Distribution of Expertise Categories by Gender",  
    x = "Gender",  
    y = "Count",  
    fill = "Expertise Category" ) +
  theme_minimal() 

Conclusion

1.Coffee consumption patterns of different groups: most people drink 1-3 cups per day, with 2 cups being the most common; consumption habits vary by race and education level, but the overall trend is similar; age correlates strongly with consumption, with 25-44 year olds being the main group of drinkers, usually at 2-3 cups/day. People <18 and >65 years old drink less coffee.

2.Self-assessment of coffee expertise: Men are more likely than women to consider themselves coffee experts, with a higher percentage of high expertise in the male group and a larger percentage of low expertise in the female. However, since more men than women participated in the survey, there may be a gender bias in the survey.