The “starbucks.csv” dataset contains nutrition facts for several Starbucks food items. The dataset used in this document was extracted from Kaggle (https://www.kaggle.com/datasets/utkarshx27/starbucks-nutrition).
We will use the tidyverse package to explore the data and ggplot2 to create the visualizations. The tidyverse package contains the dplyr package for data manipulation, among others.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
# importing the file
starbucks <- read.csv(file = 'starbucks.csv')
head(starbucks)
## X item calories fat carb fiber protein type
## 1 1 8-Grain Roll 350 8 67 5 10 bakery
## 2 2 Apple Bran Muffin 350 9 64 7 6 bakery
## 3 3 Apple Fritter 420 20 59 0 5 bakery
## 4 4 Banana Nut Loaf 490 19 75 4 7 bakery
## 5 5 Birthday Cake Mini Doughnut 130 6 17 0 0 bakery
## 6 6 Blueberry Oat Bar 370 14 47 5 6 bakery
We can remove the X column, which serves as a secondary index.
sbux <- subset(starbucks, select = -c(X))
str(sbux) # viewing the data structure
## 'data.frame': 77 obs. of 7 variables:
## $ item : chr "8-Grain Roll" "Apple Bran Muffin" "Apple Fritter" "Banana Nut Loaf" ...
## $ calories: int 350 350 420 490 130 370 460 370 310 420 ...
## $ fat : num 8 9 20 19 6 14 22 14 18 25 ...
## $ carb : int 67 64 59 75 17 47 61 55 32 39 ...
## $ fiber : int 5 7 0 4 0 5 2 0 0 0 ...
## $ protein : int 10 6 5 7 0 6 7 6 5 7 ...
## $ type : chr "bakery" "bakery" "bakery" "bakery" ...
There are 77 observations in the dataset, described through the following 7 variables:
item
: food item, string
calories
: calories, numerical
fat
: grams of fat in item, numerical
carb
: grams of carbohydrates in item, numerical
fiber
: grams of fiber in food item, numerical
protein
: grams of protein in food item, numerical
type
: a factor with levels bakery, bistro box, hot
breakfast, parfait, petite, salad, and sandwich
Finding out how many items there are of each type:
count_type <- sbux %>%
group_by(type) %>% # Variable to be transformed
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
count_type
## # A tibble: 7 × 4
## type n perc labels
## <chr> <int> <dbl> <chr>
## 1 salad 1 0.0130 1.3%
## 2 parfait 3 0.0390 3.9%
## 3 sandwich 7 0.0909 9.1%
## 4 bistro box 8 0.104 10.4%
## 5 hot breakfast 8 0.104 10.4%
## 6 petite 9 0.117 11.7%
## 7 bakery 41 0.532 53.2%
# pie chart of results
ggplot(count_type, aes(x = "", y = perc, fill = type)) +
geom_col(color = NA) +
geom_label(aes(label = labels),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
guides(fill = guide_legend(title = "Food Types")) +
coord_polar(theta = "y") +
theme_void() +
labs(title = "Breakdown of Starbucks Food Menu Items by Type")
One of the first things we can explore in this dataset are the foods
with the highest and lowest amounts of calories. Since we have 77 items
in our dataset, we can just focus on the top 5 foods with the most/least
amounts of calories. We can do so using the select()
function from dplyr.
high_cal <- sbux %>%
arrange(desc(calories)) %>%
select(item, calories, type) %>%
head(5)
high_cal
## item calories type
## 1 Sausage & Cheddar Classic Breakfast Sandwich 500 hot breakfast
## 2 Banana Nut Loaf 490 bakery
## 3 Cranberry Orange Scone 490 bakery
## 4 Iced Lemon Pound Cake 490 bakery
## 5 Zucchini Walnut Muffin 490 bakery
The Sausage & Cheddar Classic Breakfast sandwich has the most calories out of all the items in the menu, but the next items come very close to it. Another interesting observation is that every item after the sandwich is of the bakery type. This might suggest that bakery items have a higher average caloric content than all other types. We will check this in the next section.
low_cal <- sbux %>%
arrange(calories) %>%
select(item, calories, type) %>%
head(5)
low_cal
## item calories type
## 1 Deluxe Fruit Blend 80 salad
## 2 Birthday Cake Mini Doughnut 130 bakery
## 3 Double Fudge Mini Doughnut 130 bakery
## 4 Petite Vanilla Bean Scone 140 bakery
## 5 Starbucks Perfect Oatmeal 140 hot breakfast
The Deluxe Fruit Blend has noticeably less calories than the other low calorie items, suggesting that it may be an outlier in the dataset. Calorie outliers will be covered in a later section.
In the previous section, we observed that 4 out of the 5 items with the most calories belong to the bakery food type. This raises the question: on average, how does each type of food compare, calorie wise?
To find out, we group the data by type, summarize it using the mean function, and arrange it in a descending order from highest average calories to lowest.
avg_cal <- sbux %>%
group_by(type) %>%
summarise(avg_cal = mean(calories)) %>%
arrange(desc(avg_cal))
avg_cal
## # A tibble: 7 × 2
## type avg_cal
## <chr> <dbl>
## 1 sandwich 396.
## 2 bistro box 378.
## 3 bakery 369.
## 4 hot breakfast 325
## 5 parfait 300
## 6 petite 178.
## 7 salad 80
It turns out that sandwiches are in fact the items containing the most calories, on average. We can visualize the data above using a bar chart:
ggplot(avg_cal, aes(x=reorder(type, avg_cal), y=avg_cal, fill=type)) +
geom_bar(stat='identity') +
coord_flip() +
labs(title='Average Calories by Type', x='Type', y='Average Calories')
When exploring data, it is important to detect any possible outliers, since they can have a large influence on statistics derived from the dataset. We want to make sure that the average calorie data by type we extracted in the previous step is not skewed due to any extreme values. To find out about any outliers, we need to map the distribution of the data using a box plot.
boxplot(starbucks$calories,
names = c("Calories"),
main = "Calorie Distribution in Starbucks Food Items")
There is only one outlier in the dataset, which lays on the lower
extreme of the data. We can identify that particular food item using the
code blow:
c_out <- boxplot.stats(starbucks$calories)$out
c_out_index <- which(starbucks$calories %in% c_out)
c_out_item <- starbucks[c_out_index, "item"]
c_out_item
## [1] "Deluxe Fruit Blend"
When we selected the bottom 5 items in terms of caloric content, Deluxe Fruit Blend was the item with the least calories. This confirmed our earlier assumption that this item was an outlier in the dataset. Deluxe Fruit Blend is also the only item that belongs to the “salad” type.
low_cal <- sbux %>%
arrange(calories) %>%
select(item, calories, type) %>%
head(1)
low_cal
## item calories type
## 1 Deluxe Fruit Blend 80 salad
Next, we can check which food items have the highest levels of each nutritional property: fat, carbohydrates, fiber, and protein, by using a similar expression to the one in the previous section.
high_fat <- sbux %>% arrange(desc(fat)) %>% select(item, fat, type) %>% head(5)
high_fat
## item fat type
## 1 Zucchini Walnut Muffin 28 bakery
## 2 Cheese & Fruit 28 bistro box
## 3 Sausage & Cheddar Classic Breakfast Sandwich 28 hot breakfast
## 4 Egg Salad Sandwich 27 sandwich
## 5 Salumi & Cheese 26 bistro box
high_carb <- sbux %>% arrange(desc(carb)) %>% select(item, carb, type) %>% head(5)
high_carb
## item carb type
## 1 Reduced-Fat Banana Chocolate Chip Coffee Cake 80 bakery
## 2 Pumpkin Scone 78 bakery
## 3 Banana Nut Loaf 75 bakery
## 4 Cranberry Orange Scone 73 bakery
## 5 Cinnamon Chip Scone 70 bakery
high_fiber <- sbux %>% arrange(desc(fiber)) %>% select(item, fiber, type) %>% head(5)
high_fiber
## item fiber type
## 1 Apple Bran Muffin 7 bakery
## 2 Multigrain Bagel 6 bakery
## 3 Cheese & Fruit 6 bistro box
## 4 Chicken & Hummus 6 bistro box
## 5 Chipotle Chicken Wraps 6 bistro box
high_protein <- sbux %>% arrange(desc(protein)) %>% select(item, protein, type) %>% head(5)
high_protein
## item protein type
## 1 Turkey & Swiss Sandwich 34 sandwich
## 2 Tarragon Chicken Salad Sandwich 32 sandwich
## 3 Ham & Swiss Panini 28 sandwich
## 4 Chipotle Chicken Wraps 26 bistro box
## 5 Chicken Santa Fe Panini 26 sandwich
In the previous section, we made two interesting observations:
Bakery items seem to have in general more carbohydrates than any other item.
Sandwiches seem to have the highest amount of protein.
A way to figure out the legitimacy of these assumptions is to calculate the average fat, carbohydrate, fiber and protein content of each item type.
avg_by_type <- sbux %>%
group_by(type) %>%
summarize(avg_fat = mean(fat),
avg_carb = mean(carb),
avg_fiber = mean(fiber),
avg_protein = mean(protein))
avg_by_type
## # A tibble: 7 × 5
## type avg_fat avg_carb avg_fiber avg_protein
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bakery 14.6 54.3 1.95 5.93
## 2 bistro box 18.4 33.6 5.12 19.1
## 3 hot breakfast 13.7 33.2 2.25 16.1
## 4 parfait 6.5 53.7 2 8.33
## 5 petite 9.33 23.3 0 1.11
## 6 salad 0 20 2 0
## 7 sandwich 14.7 43 3.43 24.3
We can plot each macronutrient in a separate bar graph:
ggplot(avg_by_type, aes(x=reorder(type, -avg_fat), y=avg_fat, fill=type)) +
geom_bar(stat='identity') +
labs(title='Average Fat by Type', x='Type', y='Average Fat (g)')
ggplot(avg_by_type, aes(x=reorder(type, -avg_carb), y=avg_carb, fill=type)) +
geom_bar(stat='identity') +
labs(title='Average Carb by Type', x='Type', y='Average Carb (g)')
ggplot(avg_by_type, aes(x=reorder(type, -avg_fiber), y=avg_fiber, fill=type)) +
geom_bar(stat='identity') +
labs(title='Average Fiber by Type', x='Type', y='Average Fiber (g)')
ggplot(avg_by_type, aes(x=reorder(type, -avg_protein), y=avg_protein, fill=type)) +
geom_bar(stat='identity') +
labs(title='Average Protein by Type', x='Type', y='Average Protein (g)')
Bakery items do indeed have more carbohydrates than other item types,
but they are closely followed by the parfait type. Sandwiches also have
the highest protein content than other types. When it comes to fat and
fiber, the bistro box category has on average the highest amount of
both.
To answer this question, we first need to define what counts as “rich in fiber” or “low in fat/carbs”. To do so, we can look at the summary statistics for each macronutrient.
summary(sbux)
## item calories fat carb
## Length:77 Min. : 80.0 Min. : 0.00 Min. :16.00
## Class :character 1st Qu.:300.0 1st Qu.: 9.00 1st Qu.:31.00
## Mode :character Median :350.0 Median :13.00 Median :45.00
## Mean :338.8 Mean :13.77 Mean :44.87
## 3rd Qu.:420.0 3rd Qu.:18.00 3rd Qu.:59.00
## Max. :500.0 Max. :28.00 Max. :80.00
## fiber protein type
## Min. :0.000 Min. : 0.000 Length:77
## 1st Qu.:0.000 1st Qu.: 5.000 Class :character
## Median :2.000 Median : 7.000 Mode :character
## Mean :2.221 Mean : 9.481
## 3rd Qu.:4.000 3rd Qu.:15.000
## Max. :7.000 Max. :34.000
To determine what counts as “rich in fiber” we will take the 3rd quartile value of 4 grams. For fat, we are going to take the Q1 value of 9 grams. We will do the same for carbohydrates. Even though the Q1 carb value of 31 is not technically low, if we pick a lower value we might not be able to find any items that fit our criteria.
q3 <- sbux %>%
filter(fiber > 4, fat < 9, carb < 31) %>%
select(item, fiber, fat, carb)
q3
## item fiber fat carb
## 1 Chicken & Hummus 6 8 29
To find out if there is a correlation between the caloric content of Stabucks’ food items and their nutritional values, we can create a correlation matrix and then visualize its values using a heatmap.
cor_matrix <- cor(sbux[, c("calories", "fat", "carb", "fiber", "protein")])
cor_matrix
## calories fat carb fiber protein
## calories 1.0000000 0.75868250 0.67499902 0.26064508 0.41039771
## fat 0.7586825 1.00000000 0.14454651 -0.02854851 0.22347000
## carb 0.6749990 0.14454651 1.00000000 0.21304449 -0.05078924
## fiber 0.2606451 -0.02854851 0.21304449 1.00000000 0.48856400
## protein 0.4103977 0.22347000 -0.05078924 0.48856400 1.00000000
# remove dendogram
heatmap(cor_matrix, Rowv = NA, Colv = NA)
Calories are highly correlated with fat and carb values in an item, but
not so much with protein and even less with fiber.
So far in our analysis we have spotted a couple of trends, such as:
More than half of the food items at Starbucks belong to the bakery type.
The middle 50% of the food items have a caloric content between 300 and 420 kcal.
The higher the number of calories in an item, the more likely its fat and carbohydrate content is high as well.
Bakery items have the most carbs, sandwiches the most protein, and parfait items the most fat & fiber.
Something else we can look into is the distribution of macronutrients among all the food items.
boxplot(sbux$fat, sbux$carb, sbux$fiber, sbux$protein,
names = c("Fat", "Carbohydrates", "Fiber", "Protein"),
main = "Nutritional Properties Box Plot")
From this box plot we can confirm that Starbucks food items are richer
in carbohydrates than any other nutrient. However, this chart shows us
that there are two outliers in the protein nutrient.
p_out <- boxplot.stats(sbux$protein)$out
p_out_indices <- which(sbux$protein %in% p_out)
p_out_items <- starbucks[p_out_indices, "item"]
p_out_items
## [1] "Tarragon Chicken Salad Sandwich" "Turkey & Swiss Sandwich"
So far, we have observed that different food types have different average compositions. It is helpful to map these compositions in a stacked bar chart, to observe the differences across the types simultaneously.
avg_composition <- aggregate(cbind(fat, carb, fiber, protein) ~ type, sbux, mean)
# Reshape the data into long format
avg_composition_long <- tidyr::gather(avg_composition, nutrient, value, -type)
# Create the stacked bar chart
compositions <- ggplot(avg_composition_long, aes(x = type, y = value, fill = nutrient)) +
geom_bar(stat = "identity") +
labs(title = "Average Composition by Food Type", x = "Food Type", y = "Average Composition") +
scale_fill_manual(values = c("steelblue", "lightgreen", "orange", "pink"),
labels = c("Fat", "Carbohydrates", "Fiber", "Protein")) +
theme_minimal()
compositions
According to medical professionals (https://www.prospectmedical.com/resources/wellness-center/macronutrients-fats-carbs-protein#:~:text=In%20general%2C%20most%20adults%20should,30%2D40%25%20Fat.), most adults should target their diets to comprise of 45-65% carbohydrates, 10-35% protein and 20-35% fat. Keeping that in mind, we can determine which items from the Starbucks dataset can serve as a better meal option.
First, we need to figure out the composition of each item from the list:
sum_composition <- aggregate(cbind(fat, carb, fiber, protein) ~ item, sbux, sum)
# Calculate the ratios for each macronutrient within each food type
ratio_composition <- transform(sum_composition,
fat_ratio = round(fat / (fat + carb + fiber + protein), 3),
carb_ratio = round(carb / (fat + carb + fiber + protein), 3),
fiber_ratio = round(fiber / (fat + carb + fiber + protein), 3),
protein_ratio = round(protein / (fat + carb + fiber + protein), 3))
# Display the ratio composition
composition_ratios <- subset(ratio_composition, select = -c(fat, carb, fiber, protein))
head(composition_ratios)
## item fat_ratio carb_ratio fiber_ratio
## 1 8-Grain Roll 0.089 0.744 0.056
## 2 Apple Bran Muffin 0.105 0.744 0.081
## 3 Apple Fritter 0.238 0.702 0.000
## 4 Apple Pie 0.194 0.750 0.000
## 5 Bacon & Gouda Artisan Breakfast Sandwich 0.277 0.462 0.000
## 6 Banana Nut Loaf 0.181 0.714 0.038
## protein_ratio
## 1 0.111
## 2 0.070
## 3 0.060
## 4 0.056
## 5 0.262
## 6 0.067
Now that we have the list, we can pick out the items that have macronutrients spread across those percentages:
balanced_food <- composition_ratios[
composition_ratios$fat_ratio >= 0.2 & composition_ratios$fat_ratio <= 0.35 &
composition_ratios$carb_ratio >= 0.45 &
composition_ratios$carb_ratio <= 0.65 &
composition_ratios$protein_ratio >= 0.1 & composition_ratios$protein_ratio <= 0.35,
]
balanced_food
## item fat_ratio carb_ratio
## 5 Bacon & Gouda Artisan Breakfast Sandwich 0.277 0.462
## 38 Ham & Cheddar Artisan Breakfast Sandwich 0.239 0.463
## 52 Protein 0.257 0.500
## 62 Roasted Tomato & Mozzarella Panini 0.225 0.550
## 65 Sausage & Cheddar Classic Breakfast Sandwich 0.318 0.466
## 76 Veggie & Monterey Jack Artisan Breakfast Sandwich 0.277 0.462
## fiber_ratio protein_ratio
## 5 0.000 0.262
## 38 0.000 0.299
## 52 0.068 0.176
## 62 0.038 0.188
## 65 0.000 0.216
## 76 0.000 0.262
Using the criteria specified at the beginning of this section, there are 6 items in the Starbucks food list that can be considered balanced meals in terms of macronutrient composition. All of them are sandwiches, with the exception of “Protein” that falls under bistro box.