Introduction

The aim of this project was to provide insight into the Starbucks dataset, published on 21 December 2021 as part of the Tidy Tuesday initiative. The dataset included detailed nutritional information on 93 drinks, such as calories, fat/sugar/caffeine content and many more. The main goal of the analysis is to show which options are the best for health conscious coffee lovers and what are the products and product categories that should be avoided.

Load packages & data

rm(list = ls())
library(data.table)
library(ggplot2)
library(gganimate)
library(animation)
#install.packages("ggmap")
library(ggmap)
library(GGally)
library(dplyr)
library(readr)
library(reshape2)
#install.packages("AER")
library(AER)
library(tidyverse)
library(lspline)
library(fixest)
library(modelsummary)
library(skimr)
library(RColorBrewer)
library(cowplot)
#install.packages("rowr")
library(factoextra)
library(kableExtra)
starbucks <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv')

Data cleaning & munging

In order to be able to conduct further analysis on the data, I created a factor to capture vegan drinks (drinks that don’t contain any version of cow milk) and computed the categorization of drinks according to the attached data sheet. Next, I changed the type of the “milk” variable to factor, as here the number values correspond to the different types of milk available in Starbucks:

As the last step of cleaning, I corrected product names where it was needed.

Exploratory data analysis

As it can be seen from the descriptive statistics table, 48% of the drinks are vegan and the average drink contains 228.39 calories, while the average serving sizes are also relatively big, with a mean value of 461.34 ml and a maximum of 887 ml. When we look at the sugar content of drinks, the average drink contains as much as 35 grams. To put this number in context, the AHA suggests an added-sugar limit of no more than 100 calories per day (about 6 teaspoons or 24 grams of sugar) for most women and no more than 150 calories per day (about 9 teaspoons or 36 grams of sugar) for most men. As for caffeine, the U.S. Food and Drug Administration considers 400 milligrams a safe amount of caffeine for healthy adults to consume daily, so bot the median (75 mg) and the average (91.86 mg) values seem to be safe to drink in moderation.

t1 <- datasummary( ("Calories" = calories ) + 
             ("Vegan" = vegan_bin ) + 
             ("Whip" = whip ) +
             ("Serving size (ml)" = serv_size_m_l) +
             ("Total fat (g)" = total_fat_g) +
             ("Staurated fat (g)" = saturated_fat_g) +
             ("Trans fat (g)" = trans_fat_g) +
             ("Cholesterol (mg)" = cholesterol_mg) +
             ("Sodium (mg)" = sodium_mg) +
             ("Total carbs (g)" = total_carbs_g) +
             ("Fiber (g)" = fiber_g) +
             ("Sugar (g)" = sugar_g) +
             ("Caffeine (mg)" = caffeine_mg) ~
             Mean + Median + SD + Min + Max, 
             data = starbucks) %>% 
             kable_minimal(full_width = F, position = "center" , font_size = 11)

t1
Mean Median SD Min Max
“Calories” 228.39 220.00 137.67 0 640
“Vegan” 0.48 0.00 0.50 0.00 1.00
“Whip” 0.25 0.00 0.43 0 1
“Serving size (ml)” 461.34 473.00 172.18 0 887
“Total fat (g)” 6.19 4.50 5.97 0.00 28.00
“Staurated fat (g)” 3.88 2.50 4.01 0.00 20.00
“Trans fat (g)” 0.12 0.00 0.16 0.00 2.00
“Cholesterol (mg)” 15.24 5.00 17.97 0 75
“Sodium (mg)” 139.65 135.00 93.09 0 370
“Total carbs (g)” 37.72 37.00 23.26 0 96
“Fiber (g)” 0.87 0.00 1.55 0 9
“Sugar (g)” 34.99 34.00 22.46 0 89
“Caffeine (mg)” 91.86 75.00 78.11 0 475

When we take a closer look to the distribution of sugar content in drinks, we can see that in most categories the majority of options exceed the officially recommended values. However, we have a better chance to find healthier options in the Refresher, Tea, Espresso, Smoothie and Coffee categories, while hot chocolates and frappuccinos will most likely out of the range.

ggplot(starbucks, aes(sugar_g)) +
  geom_histogram(fill = "seagreen", color = "white",
                 alpha = 0.4) +
  geom_vline(aes(xintercept = 24), colour="black", linetype = "dashed") +
  geom_vline(aes(xintercept = 36), colour="black", linetype = "dashed") +
  annotate("text", x = c(10,50), y = c(50,50), 
          label = c("Recommended daily sugar intake for women", "Recommended daily sugar intake for men") , color="black", 
         size=2) +
   theme_bw() + 
  labs( x= "Sugar (g)", y="Count", title = "Distribution of sugar content", subtitle = "Category: {closest_state}") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  transition_states(category)

In a similar matter, we can see that in most categories we can easily find an option below the recommended 400mg caffeine content, as only some drinks from the Coffee category exceed this value. Therefore we can conclude that we can safely have one cup of our favourite drink without worrying about our caffeine consumption.

ggplot(starbucks, aes(caffeine_mg)) +
  geom_histogram(fill = "seagreen", color = "white",
                 alpha = 0.4) +
  geom_vline(aes(xintercept = 400), colour="black", linetype = "dashed") +
  annotate("text", x = c(340), y = c(50), 
          label = c("Recommended daily caffeine intake") , color="black", 
         size=2) +
   theme_bw() + 
  labs( x= "Caffeine (mg)", y="Count", title = "Distribution of caffeine content", subtitle = "Category: {closest_state}") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  transition_states(category)

To check the pattern of association between nutritients and calories, I created a set of scatterplots including four main nutritient variables: sugar, total fat, sodium and caffeine. As expected, higher amounts of total fat and sugar come with high calorie values, and sodium follows the same almost linear pattern. In case of caffeine, there is no linear pattern: drinks between ~0-50 mg of caffeine seem to have decreasing amount of calories, while calorie content starts to increase for drinks between ~50 and 120-130 mgs. above 130mgs of caffeine, drinks tend to have lower amount of calories again, which is probably due to that high caffeine drinks belong typically to the Espresso and Coffee categories, that are less likely to include high-sugar syrups. In general, vegan options tend to have better nutritional values.

# Check the association between caffeine and calories
p1 <- ggplot(starbucks, aes(caffeine_mg, calories)) +
 geom_point(alpha = 0.4) +
  geom_smooth(method = "loess" , aes(color = vegan), se = F, size = 2, alpha = 0.4) +
  theme_bw() + 
  labs( x= "Caffeine (mg)", y="Calories", color = "Vegan") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  scale_color_manual(values = c("seagreen", "plum3"))

# Check the association between total fat and calories
 p2 <- ggplot(starbucks, aes(total_fat_g, calories)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess",aes(color = vegan), se = F, size = 2, alpha = 0.4) +
  theme_bw() + 
  labs( x= "Total fat (g)", y="Calories", color = "Vegan") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  scale_color_manual(values = c("seagreen", "plum3"))

# Check the association between sugar and calories
 p3<- ggplot(starbucks, aes(sugar_g, calories)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", aes(color = vegan), se = F, size = 2, alpha = 0.4) +
  theme_bw() + 
  labs( x= "Sugar (g)", y="Calories", color = "Vegan") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  scale_color_manual(values = c("seagreen", "plum3"))
 
 # Check the association between sodium and calories
 p4 <- ggplot(starbucks, aes(sodium_mg, calories)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", aes(color = vegan), se = F, size = 2, alpha = 0.4) +
  theme_bw() + 
  labs( x= "Sodium (mg)", y="Calories", color = "grey80") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())+
  scale_color_manual(values = c("seagreen", "plum3"))

# create summary plot grid
g0 <- plot_grid(p1,p2,p3,p4, nrow = 2, ncol = 2)
g0

Analytical questions

How does calorie density differ in different drink categories?

As the graph below represents, Hot chocolate, Espresso and Frappuccino categories represent the majority of high-calorie drinks. The difference between vegan an not-vegan options is that the average calorie content seem to be lower in case of vegan drinks, and vegan hot chocolates have significantly lower density of high calorie options which might be due to the lower calorie plant-based milk alternatives.

g1 <- ggplot(starbucks, aes(x=calories, group=category, fill=category)) +
    geom_density(adjust=1.5, alpha=.4,  position = "fill") +
    theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank()) +
    labs(x = "Calories", y = "Density", fill = "Category", title = "Calorie density of {closest_state} drinks") +
  transition_states(vegan)
    
g1

How does the calorie content of drinks differ in different serving sizes?

Among all categories, larger serving sizes come with higher calorie levels. The only exception is the Hot chocolate & other and the Tea categories, where the calorie range is smaller for the last boxes. In case of Hot chocolate & other category, this might be because there are some non-chocolate drinks involved that come in large serving size but doesn’t include many calories, such as lemonade.

g2 <- ggplot(starbucks, aes(factor(serv_size_m_l), calories)) +
  geom_boxplot(fill = "seagreen", alpha = 0.4, na.rm=T, bandwidth =3, notch = T) +
  geom_jitter(shape=16, position=position_jitter(0.2), alpha = 0.4) +
  theme_bw() +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())+
  transition_states(category) +
  labs( x = "Serving size (ml)", y = "Calories", title = "Calories per serving size", subtitle = "Category: {closest_state}") 

g2

How does the level of caffeine differ in different serving sizes?

Except for Coffee and Espresso categories, larger servings tend to include more caffeine. In case of Espresso, the smallest serving size is 0ml and includes espresso variations that traditionally have the highest level of caffeine. In the Coffee category, the two largest serving sizes include somewhat lower levels of caffeine which might be due to the higher water ratio of large drinks (just think about ice coffee and batch brew variations that all belong to this category at Starbucks).

g3 <- ggplot(starbucks, aes(factor(serv_size_m_l), caffeine_mg)) +
  geom_boxplot(fill = "seagreen", alpha = 0.4, na.rm=T, bandwidth =3, notch = T) +
  geom_jitter(shape=16, position=position_jitter(0.2), alpha = 0.4) +
  labs( x = "Serving size (ml)", y = "Caffeine (mg)", title = "Caffeine (mg) per serving size", subtitle = "Category: {closest_state}") +
  theme_bw() +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  transition_states(category)
g3

How does the average calorie level of different products look like?

From the graph below we can see how calorie levels differ in different products and categories. Not surprisingly products from the Frappuccino and Hot chocolate categories seem to dominate the high-calorie drinks, but there are also significant amount of products from the Tea category among them on the graph.

data <- starbucks %>% 
  filter (calories > 0) %>% 
  group_by(product_name, category) %>% 
  summarize (avg_cal = mean(calories, na.rm = T))

label_data <- data

label_data$id <- seq(1, nrow(label_data))
 
# calculate the ANGLE of the labels
number_of_bar <- nrow(label_data)
angle <-  90 - 360 * (label_data$id-0.5) /number_of_bar     # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
 
# calculate the alignment of labels: right or left
# If I am on the left part of the plot, my labels have currently an angle < -90
label_data$hjust<-ifelse( angle < -90, 1, 0)
 
# flip angle BY to make them readable
label_data$angle<-ifelse(angle < -90, angle+180, angle)


g4 <-ggplot(data, aes(x=as.factor(product_name), y=avg_cal, fill= category)) +
  geom_bar(stat="identity", alpha = 0.4) +
  ylim(-400,450) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-2,4), "cm")) +
  coord_polar(start = 0) + 
   geom_text(data=label_data, aes(x=id, y=avg_cal+10, label=product_name, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=1.5, angle= label_data$angle, inherit.aes = FALSE ) +
  scale_fill_brewer(palette = "PRGn") +
  labs(fill = "Category")
g4

Is there any correlation between the various nutritients and serving sizes?

From the correlation heatmap below, we can conclude that nutritients that are considered to be less-or non-healthy, such as fats, sodium, sugar and carbs are highly correlated with each other. (The very high correlation between fats as well as sugar and carbs is probably because trans fat and saturated fat are subcategories of total fat and sugar is a subcategory of total carbs.) Interestingly, caffeine content only seem to correlate with serving size based on the heatmap.

reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

starbucks_num <- keep( starbucks , is.numeric )
cormat <- round( cor( starbucks_num , use = "complete.obs") , 2 )
cormat <- reorder_cormat(cormat)


# Put it into a tibble format
melted_cormat <- melt( cormat , na.rm = TRUE)

# Now we can create a heat-map
g5 <- ggplot( data = starbucks )+
      geom_tile( data = melted_cormat, aes( Var2 , Var1 , fill = value ) ) +
      scale_fill_gradient2(low = "white", high = "seagreen", mid = "#edf8e9", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
      theme_bw()+ 
      theme( axis.text.x = element_text(angle = 45, vjust = 1, 
                                    size = 10, hjust = 1))+
      labs(y="",x="", title = "Correlation matrix")+
      coord_fixed() +
      scale_x_discrete(limit = c("caffeine_mg", "total_fat_g", "saturated_fat_g", "trans_fat_g", "cholesterol_mg", "fiber_g", "serv_size_m_l", "sodium_mg", "calories", "total_carbs_g", "sugar_g"),
                     labels = c("Caffeine (mg)", "Total fat (g)", "Saturated fat (g)", "Trans fat (g)", "Cholesterol (mg)", "Fiber (g)", "Serving size (ml)", "Sodium (mg)", "Calories", "Total carbs (g)", "Sugar (g)")) +
      scale_y_discrete(limit = c("caffeine_mg", "total_fat_g", "saturated_fat_g", "trans_fat_g", "cholesterol_mg", "fiber_g", "serv_size_m_l", "sodium_mg", "calories", "total_carbs_g", "sugar_g"),
                     labels = c("Caffeine (mg)",  "Total fat (g)", "Saturated fat (g)", "Trans fat (g)", "Cholesterol (mg)", "Fiber (g)", "Serving size (ml)", "Sodium (mg)", "Calories", "Total carbs (g)", "Sugar (g)"))

g5

How are the different drinks clustered?

The clustering below is based on average values of calories, total fat, caffeine and sugar among product types.

# create matrix for dendogram
s <- starbucks %>% 
  group_by(product_name) %>% 
  summarize(avg_cal = mean(calories, na.rm = T),
            avg_total_fat = mean(total_fat_g, na.rm = T),
            avg_caffeine = mean(caffeine_mg, na.rm = T),
            avg_sugar = mean(sugar_g, na.rm = T))

s <- as.data.frame(s)
rownames(s) <- c(s$product_name)

newdata <- s[,-1]
df <- newdata

df.scaled <- scale(df)
res.dist <- dist ( x = df, method = "euclidian")

x <- as.matrix(res.dist) [1:93, 1:93]

hc <- hclust( d = res.dist, method = "complete")

## plot the dendogram

Circ <- fviz_dend(hc, cex = 0.5, lwd = 0.8, k = 4,
                 rect = TRUE,
                 k_colors = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A"),
                 rect_border = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A"),
                 rect_fill = TRUE,
                 type = "circular",
                 axes = F
                 ) 

Circ

Conclusion

Based on the analysis, health conscious customers should be cautious when making their choice at Starbucks, since the majority of products exceeds the recommended sugar intake level and the average drink contains ~230 calories. The healthier product categories are Refreshers, Coffee and Espresso-based drinks, and usually smaller portions and vegan options have better nutritional values in general. In terms of caffeine, the vast majority of products is within the recommended caffeine intake value. All things considered, Starbucks is probably not the best option for health conscious coffee lovers for daily consumption, as traditional and new wave (“specialty”) coffee shops provide a wide range of syrup-free and healthier coffee variations. If you are on a diet, smaller serving sizes and more traditional coffee/espresso and tea variations (such as pure espresso, brewed coffee or green tea) are safe options at Starbucks as well, but drinks from the Hot Chocolate and Frappuccino categories are better be saved for your “cheat days”. However, having your favorite drink at Starbucks - regardless of its category - every now and then won’t negatively effect your health/diet as long as you have an active and balanced lifestyle. :)