In this lab exercise you will drive your own data analysis and data science report. Your job is to develop a question to pursue and use data to support some preliminary conclusions. You should also find time to reflect on your results and identify possible errors or concerns you have about the data and analysis.
# Load standard libraries
library(tidyverse)
library(openintro)
library(knitr) # this will keep code on the page!
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
In this lab we will be working with data from fast food restaurants. This dataset contains nutritional information for 515 menu items from some of the most popular fast food restaurants worldwide. You can use the follow code to load and inspect this data.
# Load data and inspect it
data(fastfood)
ls()
## [1] "fastfood"
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
head(fastfood)
Check if there are repeated items in data
fastfood %>%
group_by(restaurant, item) %>%
summarise(count_of_restaurant_item = n()) %>%
filter(count_of_restaurant_item > 1)
## `summarise()` has grouped output by 'restaurant'. You can override using the
## `.groups` argument.
We have two items which are repeated in the dataframe
Check if variables are the same
fastfood %>%
filter(restaurant == "Taco Bell") %>%
filter(item == "Chili Cheese Burrito" | item == "Express Taco Salad w/ Chips")
They are the same so we can delete the duplicates.
Creating a new dataframe with no repeated items
distinct_fastfood <- fastfood %>%
distinct()
dim(distinct_fastfood)
## [1] 513 17
Looking for null values
summary(distinct_fastfood)
## restaurant item calories cal_fat
## Length:513 Length:513 Min. : 20.0 Min. : 0.0
## Class :character Class :character 1st Qu.: 330.0 1st Qu.: 120.0
## Mode :character Mode :character Median : 490.0 Median : 210.0
## Mean : 531.1 Mean : 238.9
## 3rd Qu.: 690.0 3rd Qu.: 310.0
## Max. :2430.0 Max. :1270.0
##
## total_fat sat_fat trans_fat cholesterol
## Min. : 0.0 Min. : 0.000 Min. :0.000 Min. : 0.00
## 1st Qu.: 14.0 1st Qu.: 4.000 1st Qu.:0.000 1st Qu.: 35.00
## Median : 23.0 Median : 7.000 Median :0.000 Median : 60.00
## Mean : 26.6 Mean : 8.152 Mean :0.463 Mean : 72.55
## 3rd Qu.: 35.0 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.: 95.00
## Max. :141.0 Max. :47.000 Max. :8.000 Max. :805.00
##
## sodium total_carb fiber sugar
## Min. : 15 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 800 1st Qu.: 28.00 1st Qu.: 2.000 1st Qu.: 3.000
## Median :1110 Median : 44.00 Median : 3.000 Median : 6.000
## Mean :1247 Mean : 45.65 Mean : 4.128 Mean : 7.273
## 3rd Qu.:1550 3rd Qu.: 57.00 3rd Qu.: 5.000 3rd Qu.: 9.000
## Max. :6080 Max. :156.00 Max. :17.000 Max. :87.000
## NA's :12
## protein vit_a vit_c calcium
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 15.75 1st Qu.: 4.00 1st Qu.: 4.00 1st Qu.: 8.00
## Median : 25.00 Median : 10.00 Median : 10.00 Median : 20.00
## Mean : 27.92 Mean : 18.86 Mean : 20.17 Mean : 24.85
## 3rd Qu.: 36.00 3rd Qu.: 20.00 3rd Qu.: 30.00 3rd Qu.: 30.00
## Max. :186.00 Max. :180.00 Max. :400.00 Max. :290.00
## NA's :1 NA's :212 NA's :208 NA's :208
## salad
## Length:513
## Class :character
## Mode :character
##
##
##
##
There are some rows with NAs in protein, fiber, vit_a, vit_c, and calcium.
distinct_fastfood %>% filter_all(any_vars(is.na(.)))
Here we can see that the NAs are not sporadic among rows. The row which have NA for some variable looks like they also have NA in a different one. I want to know how many observations have NAs from each restaurant.
# Check how many items per restaurant
restaurant_items <- distinct_fastfood %>%
group_by(restaurant) %>%
summarise(num_of_items = n())
# Check NAs on any column per restaurant
restaurant_NAs <- distinct_fastfood %>%
filter_all(any_vars(is.na(.))) %>%
group_by(restaurant) %>%
summarise(rows_with_NA = n())
restaurant_items_NA <- restaurant_items %>%
left_join(restaurant_NAs, by = "restaurant")
restaurant_items_NA
According to the FDA(Food and Drug Administration), Americans eat too much sodium. The Centers for Disease Control and Prevention says that Americans consume more than 3,400 mg of sodium per day on average. Diets higher in sodium are associated with an increased risk in developing high blood pressure which can lead to heart diseases and strokes.
Next, practice using your data science skills to answer you question. Follow the steps below in your data science process. In your analysis, use at least one data visualization to help you communicate your findings.
From the previous EDA we saw there are no NAs in sodium column.
Check the distribution of sodium in items
summary(distinct_fastfood$sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 800 1110 1247 1550 6080
ggplot(distinct_fastfood, aes(x = sodium)) +
geom_histogram(binwidth = 100) +
ggtitle("Amount of items in calories range") +
xlab("Sodium") + ylab("Items")
This looks like a normal distribution with some outliers to the right.
Two data points are noisy because we do not know if they are errors or
outliers. Let’s see the items with 0 mg of sodium and the ones with more
than 4000.
distinct_fastfood %>%
filter(sodium == 0 | sodium > 4000)
The data makes sense and looks accurate.
ggplot(distinct_fastfood, aes(x = sodium, fill = restaurant)) +
geom_density(alpha = .5)
Box plot for all the restaurants.
ggplot(distinct_fastfood, aes(x = restaurant, y = sodium)) +
geom_boxplot()
They all appear to have normal distributions except Arbys which looks
like it has bimodal distribution where we could have two types of food
groups.
distinct_fastfood %>%
group_by(restaurant) %>%
summarise(mean_sodium_mg = mean(sodium), median_sodium_mg = median(sodium), standard_dev_sodium = sd(sodium), num_of_items = n()) %>%
arrange(median_sodium_mg)
I thought that McDonalds was going to be the worst option due to their high sodium food but looking at the median we can see that they offer other food options with less sodium.
The best restaurant to eat if in order to ingest less sodium is Taco Bell. They have 113 items and the median and mean sodium is the lowest among fast food restaurants. Chick fill a comes second by just a median difference of 40 sodium mg and since we just have 27 items, the variability is higher so we should not rule out eating in Chick Fil-A.
####(b) Challenge Your Solution From chatGPT on sodium-to-calorie ratio: High ratio: If a food or meal has a high sodium-to-calorie ratio, it means it contains a significant amount of sodium relative to the number of calories it provides. This can be a concern because it indicates that the food may be high in salt and may contribute to excessive sodium intake if consumed regularly.
Low ratio: A low sodium-to-calorie ratio suggests that the food or meal contains relatively less sodium compared to the number of calories it provides. Foods with a lower ratio are generally healthier choices, especially when combined with other nutrient-dense foods.
While the sodium-to-calorie ratio can provide some helpful information about the salt content of a food, it is crucial to consider other aspects of nutrition, such as the overall nutrient profile, including vitamins, minerals, and macronutrients (carbohydrates, proteins, and fats). Additionally, focusing on a balanced and varied diet that includes whole, unprocessed foods is essential for promoting overall health and well-being.
Creating sodium-to-calorie ratio statistic
# Create new dataframe
distinct_fastfood_sodium_cal_ratio <- distinct_fastfood
distinct_fastfood_sodium_cal_ratio$sodium_to_calorie <- distinct_fastfood_sodium_cal_ratio$sodium / distinct_fastfood_sodium_cal_ratio$calories
head(distinct_fastfood_sodium_cal_ratio)
ggplot(distinct_fastfood_sodium_cal_ratio, aes(x = sodium_to_calorie, fill = restaurant)) +
geom_density(alpha = 0.5) +
xlim(0, 5)
## Warning: Removed 5 rows containing non-finite values (`stat_density()`).
These are the statistics for the ratio in mg/cal
distinct_fastfood_sodium_cal_ratio %>%
group_by(restaurant) %>%
summarise(mean_scr = mean(sodium_to_calorie), median_str = median(sodium_to_calorie), standard_dev_str = sd(sodium_to_calorie), num_of_items = n()) %>%
arrange(median_str)
The standard deviation from Sonic looks a little bit odd so we would give it a look.
sonic_data <- distinct_fastfood_sodium_cal_ratio %>%
filter(restaurant == "Sonic")
ggplot(sonic_data, aes(x = sodium_to_calorie)) +
geom_histogram(binwidth = 1)
distinct_fastfood_sodium_cal_ratio %>%
filter(restaurant == "Sonic") %>%
filter(sodium_to_calorie > 20)
This looks like a typo so we are getting rid of that.
distinct_fastfood_sodium_cal_ratio %>%
filter(sodium_to_calorie < 10) %>%
group_by(restaurant) %>%
summarise(mean_scr = mean(sodium_to_calorie), median_str = median(sodium_to_calorie), standard_dev_str = sd(sodium_to_calorie), num_of_items = n()) %>%
arrange(median_str)
It now makes sense and we have Sonic still in second position.
From my first analysis I stated that eating at taco bell is the healthiest for people who are choosing between the 8 restaurants. But now in the second analysis after reviewing the “sodium_to_calorie_ratio” to understand better how much salt the item has, I realized that Burger King is the restaurant with the healthiest would be Burger King but it is not really healthy as the ratio is 1 mg per calorie and the median of this restaurant is 2.
####(c) What Next?
We understand that sodium in high amount of quantities is bad and from the past analysis we saw that the best restaurant for people taking care of their heart health is Taco Bell. This is good if we care just on items and how many sodium they contain. But the main reason that we go to a restaurant is because we are hungry. So we can eat in a fastfood place and ask for the item with low sodium but that might not be enough. So what makes you less hungry?
Fiber is a type of carbohydrate that the body can’t digest so it cannot be broken down into sugar molecules. This helps to keep hunger and blood sugar in check. Children and adults need 25 to 35 grams of fiber per day for good health according to Harvard school of public health. Among the benefits of fiber it appears it lowers the risk for hearth disease being paramount for people with hearth problems.
One thing I would like to do is to statistically determine if Burger King is the healthiest restaurant but for this we would need o have more data and group by type of food.
A balanced diet is important and we could eat fastfood but making sure that we are eating other healthier food to keep the balance. For next research it would be good if we can implement the 2 for 1 meal where fastfood chains give you two meals so that you can enjoy your non healthy meal without feeling too bad later. This could be a combination of a high-calorie-high-sodium food with other foods that could counter the high intakes of certain substance or element.