Starbucks Nutrition Facts Exploratory Data Analysis

The “starbucks.csv” dataset contains nutrition facts for several Starbucks food items. The dataset used in this document was extracted from Kaggle (https://www.kaggle.com/datasets/utkarshx27/starbucks-nutrition).

Preliminary

We will use the tidyverse package to explore the data and ggplot2 to create the visualizations. The tidyverse package contains the dplyr package for data manipulation, among others.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

# importing the file
starbucks <- read.csv(file = 'starbucks.csv')
head(starbucks)
##   X                        item calories fat carb fiber protein   type
## 1 1                8-Grain Roll      350   8   67     5      10 bakery
## 2 2           Apple Bran Muffin      350   9   64     7       6 bakery
## 3 3               Apple Fritter      420  20   59     0       5 bakery
## 4 4             Banana Nut Loaf      490  19   75     4       7 bakery
## 5 5 Birthday Cake Mini Doughnut      130   6   17     0       0 bakery
## 6 6           Blueberry Oat Bar      370  14   47     5       6 bakery

We can remove the X column, which serves as a secondary index.

sbux <- subset(starbucks, select = -c(X))
str(sbux) # viewing the data structure
## 'data.frame':    77 obs. of  7 variables:
##  $ item    : chr  "8-Grain Roll" "Apple Bran Muffin" "Apple Fritter" "Banana Nut Loaf" ...
##  $ calories: int  350 350 420 490 130 370 460 370 310 420 ...
##  $ fat     : num  8 9 20 19 6 14 22 14 18 25 ...
##  $ carb    : int  67 64 59 75 17 47 61 55 32 39 ...
##  $ fiber   : int  5 7 0 4 0 5 2 0 0 0 ...
##  $ protein : int  10 6 5 7 0 6 7 6 5 7 ...
##  $ type    : chr  "bakery" "bakery" "bakery" "bakery" ...

There are 77 observations in the dataset, described through the following 7 variables:

item: food item, string

calories: calories, numerical

fat: grams of fat in item, numerical

carb: grams of carbohydrates in item, numerical

fiber: grams of fiber in food item, numerical

protein: grams of protein in food item, numerical type: a factor with levels bakery, bistro box, hot breakfast, parfait, petite, salad, and sandwich

Finding out how many items there are of each type:

count_type <- sbux %>% 
  group_by(type) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

count_type
## # A tibble: 7 × 4
##   type              n   perc labels
##   <chr>         <int>  <dbl> <chr> 
## 1 salad             1 0.0130 1.3%  
## 2 parfait           3 0.0390 3.9%  
## 3 sandwich          7 0.0909 9.1%  
## 4 bistro box        8 0.104  10.4% 
## 5 hot breakfast     8 0.104  10.4% 
## 6 petite            9 0.117  11.7% 
## 7 bakery           41 0.532  53.2%
# pie chart of results
ggplot(count_type, aes(x = "", y = perc, fill = type)) +
  geom_col(color = NA) +
  geom_label(aes(label = labels),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  guides(fill = guide_legend(title = "Food Types")) +
  coord_polar(theta = "y") + 
  theme_void() +
  labs(title = "Breakdown of Starbucks Food Menu Items by Type")

Food Item Calories: Highest, Lowest & Outliers

One of the first things we can explore in this dataset are the foods with the highest and lowest amounts of calories. Since we have 77 items in our dataset, we can just focus on the top 5 foods with the most/least amounts of calories. We can do so using the select() function from dplyr.

high_cal <- sbux %>% 
  arrange(desc(calories)) %>% 
  select(item, calories, type) %>% 
  head(5)
high_cal
##                                           item calories          type
## 1 Sausage & Cheddar Classic Breakfast Sandwich      500 hot breakfast
## 2                              Banana Nut Loaf      490        bakery
## 3                       Cranberry Orange Scone      490        bakery
## 4                        Iced Lemon Pound Cake      490        bakery
## 5                      Zucchini Walnut Muffin       490        bakery

The Sausage & Cheddar Classic Breakfast sandwich has the most calories out of all the items in the menu, but the next items come very close to it. Another interesting observation is that every item after the sandwich is of the bakery type. This might suggest that bakery items have a higher average caloric content than all other types. We will check this in the next section.

low_cal <- sbux %>% 
  arrange(calories) %>% 
  select(item, calories, type) %>% 
  head(5) 
low_cal
##                          item calories          type
## 1          Deluxe Fruit Blend       80         salad
## 2 Birthday Cake Mini Doughnut      130        bakery
## 3  Double Fudge Mini Doughnut      130        bakery
## 4   Petite Vanilla Bean Scone      140        bakery
## 5   Starbucks Perfect Oatmeal      140 hot breakfast

The Deluxe Fruit Blend has noticeably less calories than the other low calorie items, suggesting that it may be an outlier in the dataset. Calorie outliers will be covered in a later section.

Average Calorie Content of Each Food Item Type

In the previous section, we observed that 4 out of the 5 items with the most calories belong to the bakery food type. This raises the question: on average, how does each type of food compare, calorie wise?

To find out, we group the data by type, summarize it using the mean function, and arrange it in a descending order from highest average calories to lowest.

avg_cal <- sbux %>% 
  group_by(type) %>% 
  summarise(avg_cal = mean(calories)) %>%
  arrange(desc(avg_cal))
avg_cal
## # A tibble: 7 × 2
##   type          avg_cal
##   <chr>           <dbl>
## 1 sandwich         396.
## 2 bistro box       378.
## 3 bakery           369.
## 4 hot breakfast    325 
## 5 parfait          300 
## 6 petite           178.
## 7 salad             80

It turns out that sandwiches are in fact the items containing the most calories, on average. We can visualize the data above using a bar chart:

ggplot(avg_cal, aes(x=reorder(type, avg_cal), y=avg_cal, fill=type)) +
  geom_bar(stat='identity') +
  coord_flip() +
  labs(title='Average Calories by Type', x='Type', y='Average Calories')

Outliers in calorie information

When exploring data, it is important to detect any possible outliers, since they can have a large influence on statistics derived from the dataset. We want to make sure that the average calorie data by type we extracted in the previous step is not skewed due to any extreme values. To find out about any outliers, we need to map the distribution of the data using a box plot.

boxplot(starbucks$calories,
        names = c("Calories"),
        main = "Calorie Distribution in Starbucks Food Items")

There is only one outlier in the dataset, which lays on the lower extreme of the data. We can identify that particular food item using the code blow:

c_out <- boxplot.stats(starbucks$calories)$out
c_out_index <- which(starbucks$calories %in% c_out)
c_out_item <- starbucks[c_out_index, "item"]
c_out_item
## [1] "Deluxe Fruit Blend"

When we selected the bottom 5 items in terms of caloric content, Deluxe Fruit Blend was the item with the least calories. This confirmed our earlier assumption that this item was an outlier in the dataset. Deluxe Fruit Blend is also the only item that belongs to the “salad” type.

low_cal <- sbux %>% 
  arrange(calories) %>% 
  select(item, calories, type) %>% 
  head(1) 
low_cal
##                 item calories  type
## 1 Deluxe Fruit Blend       80 salad

Food Items with the Highest Levels of Each Nutritional Property

Next, we can check which food items have the highest levels of each nutritional property: fat, carbohydrates, fiber, and protein, by using a similar expression to the one in the previous section.

  1. Fat
high_fat <- sbux %>% arrange(desc(fat)) %>% select(item, fat, type) %>% head(5)
high_fat
##                                           item fat          type
## 1                      Zucchini Walnut Muffin   28        bakery
## 2                               Cheese & Fruit  28    bistro box
## 3 Sausage & Cheddar Classic Breakfast Sandwich  28 hot breakfast
## 4                          Egg Salad Sandwich   27      sandwich
## 5                              Salumi & Cheese  26    bistro box
  1. Carbohydrates
high_carb <- sbux %>% arrange(desc(carb)) %>% select(item, carb, type) %>% head(5)
high_carb
##                                            item carb   type
## 1 Reduced-Fat Banana Chocolate Chip Coffee Cake   80 bakery
## 2                                Pumpkin Scone    78 bakery
## 3                               Banana Nut Loaf   75 bakery
## 4                        Cranberry Orange Scone   73 bakery
## 5                           Cinnamon Chip Scone   70 bakery
  1. Fiber
high_fiber <- sbux %>% arrange(desc(fiber)) %>% select(item, fiber, type) %>% head(5)
high_fiber
##                     item fiber       type
## 1      Apple Bran Muffin     7     bakery
## 2       Multigrain Bagel     6     bakery
## 3         Cheese & Fruit     6 bistro box
## 4       Chicken & Hummus     6 bistro box
## 5 Chipotle Chicken Wraps     6 bistro box
  1. Protein
high_protein <- sbux %>% arrange(desc(protein)) %>% select(item, protein, type) %>% head(5)
high_protein
##                              item protein       type
## 1         Turkey & Swiss Sandwich      34   sandwich
## 2 Tarragon Chicken Salad Sandwich      32   sandwich
## 3              Ham & Swiss Panini      28   sandwich
## 4          Chipotle Chicken Wraps      26 bistro box
## 5         Chicken Santa Fe Panini      26   sandwich

Food item types with the highest average fat, carbohydrate, fiber, and protein content

In the previous section, we made two interesting observations:

  1. Bakery items seem to have in general more carbohydrates than any other item.

  2. Sandwiches seem to have the highest amount of protein.

A way to figure out the legitimacy of these assumptions is to calculate the average fat, carbohydrate, fiber and protein content of each item type.

avg_by_type <- sbux %>%
  group_by(type) %>%
  summarize(avg_fat = mean(fat),
            avg_carb = mean(carb),
            avg_fiber = mean(fiber),
            avg_protein = mean(protein))
avg_by_type
## # A tibble: 7 × 5
##   type          avg_fat avg_carb avg_fiber avg_protein
##   <chr>           <dbl>    <dbl>     <dbl>       <dbl>
## 1 bakery          14.6      54.3      1.95        5.93
## 2 bistro box      18.4      33.6      5.12       19.1 
## 3 hot breakfast   13.7      33.2      2.25       16.1 
## 4 parfait          6.5      53.7      2           8.33
## 5 petite           9.33     23.3      0           1.11
## 6 salad            0        20        2           0   
## 7 sandwich        14.7      43        3.43       24.3

We can plot each macronutrient in a separate bar graph:

ggplot(avg_by_type, aes(x=reorder(type, -avg_fat), y=avg_fat, fill=type)) +
  geom_bar(stat='identity') +
  labs(title='Average Fat by Type', x='Type', y='Average Fat (g)')

ggplot(avg_by_type, aes(x=reorder(type, -avg_carb), y=avg_carb, fill=type)) +
  geom_bar(stat='identity') +
  labs(title='Average Carb by Type', x='Type', y='Average Carb (g)')

ggplot(avg_by_type, aes(x=reorder(type, -avg_fiber), y=avg_fiber, fill=type)) +
  geom_bar(stat='identity') +
  labs(title='Average Fiber by Type', x='Type', y='Average Fiber (g)')

ggplot(avg_by_type, aes(x=reorder(type, -avg_protein), y=avg_protein, fill=type)) +
  geom_bar(stat='identity') +
  labs(title='Average Protein by Type', x='Type', y='Average Protein (g)')

Bakery items do indeed have more carbohydrates than other item types, but they are closely followed by the parfait type. Sandwiches also have the highest protein content than other types. When it comes to fat and fiber, the bistro box category has on average the highest amount of both.

Are there any food items that are particularly rich in fiber but low in fat and carbohydrates?

To answer this question, we first need to define what counts as “rich in fiber” or “low in fat/carbs”. To do so, we can look at the summary statistics for each macronutrient.

summary(sbux)
##      item              calories          fat             carb      
##  Length:77          Min.   : 80.0   Min.   : 0.00   Min.   :16.00  
##  Class :character   1st Qu.:300.0   1st Qu.: 9.00   1st Qu.:31.00  
##  Mode  :character   Median :350.0   Median :13.00   Median :45.00  
##                     Mean   :338.8   Mean   :13.77   Mean   :44.87  
##                     3rd Qu.:420.0   3rd Qu.:18.00   3rd Qu.:59.00  
##                     Max.   :500.0   Max.   :28.00   Max.   :80.00  
##      fiber          protein           type          
##  Min.   :0.000   Min.   : 0.000   Length:77         
##  1st Qu.:0.000   1st Qu.: 5.000   Class :character  
##  Median :2.000   Median : 7.000   Mode  :character  
##  Mean   :2.221   Mean   : 9.481                     
##  3rd Qu.:4.000   3rd Qu.:15.000                     
##  Max.   :7.000   Max.   :34.000

To determine what counts as “rich in fiber” we will take the 3rd quartile value of 4 grams. For fat, we are going to take the Q1 value of 9 grams. We will do the same for carbohydrates. Even though the Q1 carb value of 31 is not technically low, if we pick a lower value we might not be able to find any items that fit our criteria.

q3 <- sbux %>%
  filter(fiber > 4, fat < 9, carb < 31) %>%
  select(item, fiber, fat, carb)
q3
##               item fiber fat carb
## 1 Chicken & Hummus     6   8   29

Correlation between Caloric Content and Macronutrient Amount

To find out if there is a correlation between the caloric content of Stabucks’ food items and their nutritional values, we can create a correlation matrix and then visualize its values using a heatmap.

cor_matrix <- cor(sbux[, c("calories", "fat", "carb", "fiber", "protein")])
cor_matrix
##           calories         fat        carb       fiber     protein
## calories 1.0000000  0.75868250  0.67499902  0.26064508  0.41039771
## fat      0.7586825  1.00000000  0.14454651 -0.02854851  0.22347000
## carb     0.6749990  0.14454651  1.00000000  0.21304449 -0.05078924
## fiber    0.2606451 -0.02854851  0.21304449  1.00000000  0.48856400
## protein  0.4103977  0.22347000 -0.05078924  0.48856400  1.00000000
# remove dendogram
heatmap(cor_matrix, Rowv = NA, Colv = NA)

Calories are highly correlated with fat and carb values in an item, but not so much with protein and even less with fiber.

Differences in the Nutritional Profiles of Food Items Across Different Categories

So far, we have observed that different food types have different average compositions. It is helpful to map these compositions in a stacked bar chart, to observe the differences across the types simultaneously.

avg_composition <- aggregate(cbind(fat, carb, fiber, protein) ~ type, sbux, mean)

# Reshape the data into long format
avg_composition_long <- tidyr::gather(avg_composition, nutrient, value, -type)

# Create the stacked bar chart
compositions <- ggplot(avg_composition_long, aes(x = type, y = value, fill = nutrient)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Composition by Food Type", x = "Food Type", y = "Average Composition") +
  scale_fill_manual(values = c("steelblue", "lightgreen", "orange", "pink"), 
                    labels = c("Fat", "Carbohydrates", "Fiber", "Protein")) +
  theme_minimal()
compositions

Food Items that Provide a Good Balance of Macronutrients

According to medical professionals (https://www.prospectmedical.com/resources/wellness-center/macronutrients-fats-carbs-protein#:~:text=In%20general%2C%20most%20adults%20should,30%2D40%25%20Fat.), most adults should target their diets to comprise of 45-65% carbohydrates, 10-35% protein and 20-35% fat. Keeping that in mind, we can determine which items from the Starbucks dataset can serve as a better meal option.

First, we need to figure out the composition of each item from the list:

sum_composition <- aggregate(cbind(fat, carb, fiber, protein) ~ item, sbux, sum)

# Calculate the ratios for each macronutrient within each food type
ratio_composition <- transform(sum_composition, 
                               fat_ratio = round(fat / (fat + carb + fiber + protein), 3),
                               carb_ratio = round(carb / (fat + carb + fiber + protein), 3),
                               fiber_ratio = round(fiber / (fat + carb + fiber + protein), 3),
                               protein_ratio = round(protein / (fat + carb + fiber + protein), 3))

# Display the ratio composition
composition_ratios <- subset(ratio_composition, select = -c(fat, carb, fiber, protein))
head(composition_ratios)
##                                       item fat_ratio carb_ratio fiber_ratio
## 1                             8-Grain Roll     0.089      0.744       0.056
## 2                        Apple Bran Muffin     0.105      0.744       0.081
## 3                            Apple Fritter     0.238      0.702       0.000
## 4                                Apple Pie     0.194      0.750       0.000
## 5 Bacon & Gouda Artisan Breakfast Sandwich     0.277      0.462       0.000
## 6                          Banana Nut Loaf     0.181      0.714       0.038
##   protein_ratio
## 1         0.111
## 2         0.070
## 3         0.060
## 4         0.056
## 5         0.262
## 6         0.067

Now that we have the list, we can pick out the items that have macronutrients spread across those percentages:

balanced_food <- composition_ratios[
  composition_ratios$fat_ratio >= 0.2 & composition_ratios$fat_ratio <= 0.35 &
    composition_ratios$carb_ratio >= 0.45 &
    composition_ratios$carb_ratio <= 0.65 &
    composition_ratios$protein_ratio >= 0.1 & composition_ratios$protein_ratio <= 0.35,
]
balanced_food
##                                                 item fat_ratio carb_ratio
## 5           Bacon & Gouda Artisan Breakfast Sandwich     0.277      0.462
## 38          Ham & Cheddar Artisan Breakfast Sandwich     0.239      0.463
## 52                                           Protein     0.257      0.500
## 62                Roasted Tomato & Mozzarella Panini     0.225      0.550
## 65      Sausage & Cheddar Classic Breakfast Sandwich     0.318      0.466
## 76 Veggie & Monterey Jack Artisan Breakfast Sandwich     0.277      0.462
##    fiber_ratio protein_ratio
## 5        0.000         0.262
## 38       0.000         0.299
## 52       0.068         0.176
## 62       0.038         0.188
## 65       0.000         0.216
## 76       0.000         0.262

Using the criteria specified at the beginning of this section, there are 6 items in the Starbucks food list that can be considered balanced meals in terms of macronutrient composition. All of them are sandwiches, with the exception of “Protein” that falls under bistro box.