Objectives

In this lab exercise you will drive your own data analysis and data science report. Your job is to develop a question to pursue and use data to support some preliminary conclusions. You should also find time to reflect on your results and identify possible errors or concerns you have about the data and analysis.

# Load standard libraries
library(tidyverse)
library(openintro)
library(knitr) # this will keep code on the page!
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)

Data

In this lab we will be working with data from fast food restaurants. This dataset contains nutritional information for 515 menu items from some of the most popular fast food restaurants worldwide. You can use the follow code to load and inspect this data.

# Load data and inspect it
data(fastfood)
ls()
## [1] "fastfood"
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
head(fastfood)

Check if there are repeated items in data

fastfood %>%
  group_by(restaurant, item) %>%
  summarise(count_of_restaurant_item = n()) %>%
  filter(count_of_restaurant_item > 1)
## `summarise()` has grouped output by 'restaurant'. You can override using the
## `.groups` argument.

We have two items which are repeated in the dataframe

Check if variables are the same

fastfood %>%
  filter(restaurant == "Taco Bell") %>%
  filter(item == "Chili Cheese Burrito" | item == "Express Taco Salad w/ Chips")

They are the same so we can delete the duplicates.

Creating a new dataframe with no repeated items

distinct_fastfood <- fastfood %>%
  distinct()
dim(distinct_fastfood)
## [1] 513  17

Looking for null values

summary(distinct_fastfood)
##   restaurant            item              calories         cal_fat      
##  Length:513         Length:513         Min.   :  20.0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.: 330.0   1st Qu.: 120.0  
##  Mode  :character   Mode  :character   Median : 490.0   Median : 210.0  
##                                        Mean   : 531.1   Mean   : 238.9  
##                                        3rd Qu.: 690.0   3rd Qu.: 310.0  
##                                        Max.   :2430.0   Max.   :1270.0  
##                                                                         
##    total_fat        sat_fat         trans_fat      cholesterol    
##  Min.   :  0.0   Min.   : 0.000   Min.   :0.000   Min.   :  0.00  
##  1st Qu.: 14.0   1st Qu.: 4.000   1st Qu.:0.000   1st Qu.: 35.00  
##  Median : 23.0   Median : 7.000   Median :0.000   Median : 60.00  
##  Mean   : 26.6   Mean   : 8.152   Mean   :0.463   Mean   : 72.55  
##  3rd Qu.: 35.0   3rd Qu.:11.000   3rd Qu.:1.000   3rd Qu.: 95.00  
##  Max.   :141.0   Max.   :47.000   Max.   :8.000   Max.   :805.00  
##                                                                   
##      sodium       total_carb         fiber            sugar       
##  Min.   :  15   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 800   1st Qu.: 28.00   1st Qu.: 2.000   1st Qu.: 3.000  
##  Median :1110   Median : 44.00   Median : 3.000   Median : 6.000  
##  Mean   :1247   Mean   : 45.65   Mean   : 4.128   Mean   : 7.273  
##  3rd Qu.:1550   3rd Qu.: 57.00   3rd Qu.: 5.000   3rd Qu.: 9.000  
##  Max.   :6080   Max.   :156.00   Max.   :17.000   Max.   :87.000  
##                                  NA's   :12                       
##     protein           vit_a            vit_c           calcium      
##  Min.   :  1.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 15.75   1st Qu.:  4.00   1st Qu.:  4.00   1st Qu.:  8.00  
##  Median : 25.00   Median : 10.00   Median : 10.00   Median : 20.00  
##  Mean   : 27.92   Mean   : 18.86   Mean   : 20.17   Mean   : 24.85  
##  3rd Qu.: 36.00   3rd Qu.: 20.00   3rd Qu.: 30.00   3rd Qu.: 30.00  
##  Max.   :186.00   Max.   :180.00   Max.   :400.00   Max.   :290.00  
##  NA's   :1        NA's   :212      NA's   :208      NA's   :208     
##     salad          
##  Length:513        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

There are some rows with NAs in protein, fiber, vit_a, vit_c, and calcium.

distinct_fastfood %>% filter_all(any_vars(is.na(.)))

Here we can see that the NAs are not sporadic among rows. The row which have NA for some variable looks like they also have NA in a different one. I want to know how many observations have NAs from each restaurant.

# Check how many items per restaurant
restaurant_items <- distinct_fastfood %>%
  group_by(restaurant) %>%
  summarise(num_of_items = n())

# Check NAs on any column per restaurant
restaurant_NAs <- distinct_fastfood %>%
  filter_all(any_vars(is.na(.))) %>%
  group_by(restaurant) %>%
  summarise(rows_with_NA = n())

restaurant_items_NA <- restaurant_items %>%
  left_join(restaurant_NAs, by = "restaurant")

restaurant_items_NA

Problem 1: Formulate a Question

Background and problem statement

According to the FDA(Food and Drug Administration), Americans eat too much sodium. The Centers for Disease Control and Prevention says that Americans consume more than 3,400 mg of sodium per day on average. Diets higher in sodium are associated with an increased risk in developing high blood pressure which can lead to heart diseases and strokes.

Data Science Question

What are the fastfood options which use less sodium in their food in order for predisposed people to heart disease to be able to eat at those places?

Problem 2: Data Analysis

Next, practice using your data science skills to answer you question. Follow the steps below in your data science process. In your analysis, use at least one data visualization to help you communicate your findings.

(a) Try the Easy Solution First

From the previous EDA we saw there are no NAs in sodium column.

Check the distribution of sodium in items

summary(distinct_fastfood$sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15     800    1110    1247    1550    6080
ggplot(distinct_fastfood, aes(x = sodium)) +
  geom_histogram(binwidth = 100) +
  ggtitle("Amount of items in calories range") +
  xlab("Sodium") + ylab("Items")

This looks like a normal distribution with some outliers to the right. Two data points are noisy because we do not know if they are errors or outliers. Let’s see the items with 0 mg of sodium and the ones with more than 4000.

distinct_fastfood %>%
  filter(sodium == 0 | sodium > 4000)

The data makes sense and looks accurate.

ggplot(distinct_fastfood, aes(x = sodium, fill = restaurant)) +
  geom_density(alpha = .5)

Box plot for all the restaurants.

ggplot(distinct_fastfood, aes(x = restaurant, y = sodium)) +
  geom_boxplot()

They all appear to have normal distributions except Arbys which looks like it has bimodal distribution where we could have two types of food groups.

distinct_fastfood %>%
  group_by(restaurant) %>%
  summarise(mean_sodium_mg = mean(sodium), median_sodium_mg = median(sodium), standard_dev_sodium = sd(sodium), num_of_items = n()) %>%
  arrange(median_sodium_mg)

I thought that McDonalds was going to be the worst option due to their high sodium food but looking at the median we can see that they offer other food options with less sodium.

Conclusion for easy solution

The best restaurant to eat if in order to ingest less sodium is Taco Bell. They have 113 items and the median and mean sodium is the lowest among fast food restaurants. Chick fill a comes second by just a median difference of 40 sodium mg and since we just have 27 items, the variability is higher so we should not rule out eating in Chick Fil-A.

####(b) Challenge Your Solution From chatGPT on sodium-to-calorie ratio: High ratio: If a food or meal has a high sodium-to-calorie ratio, it means it contains a significant amount of sodium relative to the number of calories it provides. This can be a concern because it indicates that the food may be high in salt and may contribute to excessive sodium intake if consumed regularly.

Low ratio: A low sodium-to-calorie ratio suggests that the food or meal contains relatively less sodium compared to the number of calories it provides. Foods with a lower ratio are generally healthier choices, especially when combined with other nutrient-dense foods.

While the sodium-to-calorie ratio can provide some helpful information about the salt content of a food, it is crucial to consider other aspects of nutrition, such as the overall nutrient profile, including vitamins, minerals, and macronutrients (carbohydrates, proteins, and fats). Additionally, focusing on a balanced and varied diet that includes whole, unprocessed foods is essential for promoting overall health and well-being.

Creating sodium-to-calorie ratio statistic

# Create new dataframe
distinct_fastfood_sodium_cal_ratio <- distinct_fastfood

distinct_fastfood_sodium_cal_ratio$sodium_to_calorie <- distinct_fastfood_sodium_cal_ratio$sodium / distinct_fastfood_sodium_cal_ratio$calories

head(distinct_fastfood_sodium_cal_ratio)
ggplot(distinct_fastfood_sodium_cal_ratio, aes(x = sodium_to_calorie, fill = restaurant)) +
  geom_density(alpha = 0.5) +
  xlim(0, 5)
## Warning: Removed 5 rows containing non-finite values (`stat_density()`).

These are the statistics for the ratio in mg/cal

distinct_fastfood_sodium_cal_ratio %>%
  group_by(restaurant) %>%
  summarise(mean_scr = mean(sodium_to_calorie), median_str = median(sodium_to_calorie), standard_dev_str = sd(sodium_to_calorie), num_of_items = n()) %>%
  arrange(median_str)

The standard deviation from Sonic looks a little bit odd so we would give it a look.

sonic_data <- distinct_fastfood_sodium_cal_ratio %>%
  filter(restaurant == "Sonic")

ggplot(sonic_data, aes(x = sodium_to_calorie)) +
  geom_histogram(binwidth = 1)

distinct_fastfood_sodium_cal_ratio %>%
  filter(restaurant == "Sonic") %>%
  filter(sodium_to_calorie > 20)

This looks like a typo so we are getting rid of that.

distinct_fastfood_sodium_cal_ratio %>%
  filter(sodium_to_calorie < 10) %>%
  group_by(restaurant) %>%
  summarise(mean_scr = mean(sodium_to_calorie), median_str = median(sodium_to_calorie), standard_dev_str = sd(sodium_to_calorie), num_of_items = n()) %>%
  arrange(median_str)

It now makes sense and we have Sonic still in second position.

Conclusion:

From my first analysis I stated that eating at taco bell is the healthiest for people who are choosing between the 8 restaurants. But now in the second analysis after reviewing the “sodium_to_calorie_ratio” to understand better how much salt the item has, I realized that Burger King is the restaurant with the healthiest would be Burger King but it is not really healthy as the ratio is 1 mg per calorie and the median of this restaurant is 2.

####(c) What Next?

We understand that sodium in high amount of quantities is bad and from the past analysis we saw that the best restaurant for people taking care of their heart health is Taco Bell. This is good if we care just on items and how many sodium they contain. But the main reason that we go to a restaurant is because we are hungry. So we can eat in a fastfood place and ask for the item with low sodium but that might not be enough. So what makes you less hungry?

Fiber is a type of carbohydrate that the body can’t digest so it cannot be broken down into sugar molecules. This helps to keep hunger and blood sugar in check. Children and adults need 25 to 35 grams of fiber per day for good health according to Harvard school of public health. Among the benefits of fiber it appears it lowers the risk for hearth disease being paramount for people with hearth problems.

  • One thing I would like to do is to statistically determine if Burger King is the healthiest restaurant but for this we would need o have more data and group by type of food.

  • A balanced diet is important and we could eat fastfood but making sure that we are eating other healthier food to keep the balance. For next research it would be good if we can implement the 2 for 1 meal where fastfood chains give you two meals so that you can enjoy your non healthy meal without feeling too bad later. This could be a combination of a high-calorie-high-sodium food with other foods that could counter the high intakes of certain substance or element.