Introdution

Image from getty images:

https://www.gettyimages.com/detail/photo/mcdonalds-royalty-free-image/517499813?adppopup=true

Determining the nutritional content for food is massively important as the obesity rate for American adults is 42.5% . Obesity is bad as it can increase the risk of heart failure cancer and many other medical conditions. For the best health of America it would be important to decrease the obesity rate, this can be done by understanding the biggest predictors of calorie content and eating a healthy balanced diet within a calorie limit with sufficient macronutrients to lose or maintain weight.

The source for this dataset is fastfoodnutrition.com, only the entrees from 2018. Was made into a github repository for the openintro project.

The variables I am using are: calories - total calories of an item total_fat - total fat of an item sat_fat - total saturated fat of an item trans_fat - total trans fat in an item cholesterol - total cholesterol in an item sodium - total sodium in an item total_carb - total carbs in an item fiber - total fiber in an item protein - total protein in an item

The questions I would like to answer is what is the most significant predictor of calorie content of a meal out of the selected variables above. I would also like to visualize the relationship between calories and significant variables.

I chose this topic because I am interested in nutrition myself, I would also like to better understand what are the best predictors of calorie content in food for my health. I also chose this topic to help people better understand calorie content and what influences it.

# Load necessary libraries

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
summary(fastfood)
##   restaurant            item              calories         cal_fat      
##  Length:515         Length:515         Min.   :  20.0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.: 330.0   1st Qu.: 120.0  
##  Mode  :character   Mode  :character   Median : 490.0   Median : 210.0  
##                                        Mean   : 530.9   Mean   : 238.8  
##                                        3rd Qu.: 690.0   3rd Qu.: 310.0  
##                                        Max.   :2430.0   Max.   :1270.0  
##                                                                         
##    total_fat         sat_fat         trans_fat      cholesterol    
##  Min.   :  0.00   Min.   : 0.000   Min.   :0.000   Min.   :  0.00  
##  1st Qu.: 14.00   1st Qu.: 4.000   1st Qu.:0.000   1st Qu.: 35.00  
##  Median : 23.00   Median : 7.000   Median :0.000   Median : 60.00  
##  Mean   : 26.59   Mean   : 8.153   Mean   :0.465   Mean   : 72.46  
##  3rd Qu.: 35.00   3rd Qu.:11.000   3rd Qu.:1.000   3rd Qu.: 95.00  
##  Max.   :141.00   Max.   :47.000   Max.   :8.000   Max.   :805.00  
##                                                                    
##      sodium       total_carb         fiber            sugar       
##  Min.   :  15   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 800   1st Qu.: 28.50   1st Qu.: 2.000   1st Qu.: 3.000  
##  Median :1110   Median : 44.00   Median : 3.000   Median : 6.000  
##  Mean   :1247   Mean   : 45.66   Mean   : 4.137   Mean   : 7.262  
##  3rd Qu.:1550   3rd Qu.: 57.00   3rd Qu.: 5.000   3rd Qu.: 9.000  
##  Max.   :6080   Max.   :156.00   Max.   :17.000   Max.   :87.000  
##                                  NA's   :12                       
##     protein           vit_a            vit_c           calcium      
##  Min.   :  1.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 16.00   1st Qu.:  4.00   1st Qu.:  4.00   1st Qu.:  8.00  
##  Median : 24.50   Median : 10.00   Median : 10.00   Median : 20.00  
##  Mean   : 27.89   Mean   : 18.86   Mean   : 20.17   Mean   : 24.85  
##  3rd Qu.: 36.00   3rd Qu.: 20.00   3rd Qu.: 30.00   3rd Qu.: 30.00  
##  Max.   :186.00   Max.   :180.00   Max.   :400.00   Max.   :290.00  
##  NA's   :1        NA's   :214      NA's   :210      NA's   :210     
##     salad          
##  Length:515        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
sapply(fastfood, function(x) sum(is.na(x)))  # Check for amout of NA values
##  restaurant        item    calories     cal_fat   total_fat     sat_fat 
##           0           0           0           0           0           0 
##   trans_fat cholesterol      sodium  total_carb       fiber       sugar 
##           0           0           0           0          12           0 
##     protein       vit_a       vit_c     calcium       salad 
##           1         214         210         210           0
# Calculate the proportion of missing values for each column
na_proportion <- colSums(is.na(fastfood)) / nrow(fastfood)

# Display the proportion of NA values for each column
na_proportion
##  restaurant        item    calories     cal_fat   total_fat     sat_fat 
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
##   trans_fat cholesterol      sodium  total_carb       fiber       sugar 
## 0.000000000 0.000000000 0.000000000 0.000000000 0.023300971 0.000000000 
##     protein       vit_a       vit_c     calcium       salad 
## 0.001941748 0.415533981 0.407766990 0.407766990 0.000000000
# Build the linear regression model with variables we will be using
model_data <- fastfood |>
  select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, protein)

# Run the linear regression model
cal_model <- lm(calories ~ total_fat + sat_fat + trans_fat + cholesterol + sodium + total_carb + fiber + protein, data = model_data)

# Display the summary of the linear regression model
summary(cal_model)
## 
## Call:
## lm(formula = calories ~ total_fat + sat_fat + trans_fat + cholesterol + 
##     sodium + total_carb + fiber + protein, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -877.49   -4.73    0.25    7.47  182.02 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.055672   4.860595   0.217   0.8281    
## total_fat    8.273373   0.268327  30.833   <2e-16 ***
## sat_fat      1.573103   0.793779   1.982   0.0481 *  
## trans_fat   -0.291658   4.536887  -0.064   0.9488    
## cholesterol -0.130581   0.108309  -1.206   0.2285    
## sodium       0.008746   0.005969   1.465   0.1435    
## total_carb   3.900347   0.152043  25.653   <2e-16 ***
## fiber       -0.550708   0.935828  -0.588   0.5565    
## protein      4.285273   0.351456  12.193   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.34 on 494 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9742, Adjusted R-squared:  0.9738 
## F-statistic:  2335 on 8 and 494 DF,  p-value: < 2.2e-16

Model equation: calories = 1.06 + 8.27(total_fat) + 1.57(sat_fat) −0.29(trans_fat) − 0.13(cholesterol) + 0.009(sodium) + 3.90(total_carb) − 0.55(fiber) + 4.29(protein)

P values:

Total fat has a significant P value, less than 2e-16. Sat fat has a P value of 0.0481 which is less than 0.05. Total carb has a significant p value which is less than 2e-16. Protein has a significant P value, less than 2e-16.

Adjusted R squared:

Our adjusted R squared is 0.9738. This means that 98% of the variation in calories can be explained by this model. This shows this model is a very strong fit.

Diagnostic plots:

par(mfrow = c(2, 2))
plot(cal_model)

Analysis of Linear Regression model:

This is a good model as we have an adjusted R-squared value of 0.9738 meaning this model explains most of the variability. Total fat, saturated fat, total carbs, and protein are significant predictors of calories. This makes sense as they are main macronutrients which make up calorie content. Variables like trans fat, cholesterol, sodium and fiber were not statistically significant. The diagnostic models look good although there are some high leverage influential points.

library(ggplot2)

point_color <- "#00BFC4"  # teal
line_color <- "#F8766D"   # red

# Scatter plot: Total Fat vs Calories
ggplot(model_data, aes(x = total_fat, y = calories)) +
  geom_point(color = point_color, alpha = 0.6) +
  geom_smooth(method = "lm", color = line_color) +
  labs(title = "Total Fat (grams) vs Calories", 
       x = "Total Fat (g)", 
       y = "Calories",
       caption = "Data Source: fastfoodnutrition.org 2018") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

This graph shows that total fat in grams and calories has a positive relationship. As total fat content increases calories also increases.

# Scatter plot: Total Carbs vs Calories
ggplot(model_data, aes(x = total_carb, y = calories)) +
  geom_point(color = point_color, alpha = 0.6) +
  geom_smooth(method = "lm", color = line_color) +
  labs(title = "Total Carbs (grams) vs Calories",
       x = "Total Carbohydrates (g)",
       y = "Calories",
       caption = "Data Source: fastfoodnutrition.org 2018") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

This graph shows that total carbs in grams and calories has a positive relationship. As total carb content increases calories also increases.

# Scatter plot: Protein vs Calories
ggplot(model_data, aes(x = protein, y = calories)) +
  geom_point(color = point_color, alpha = 0.6) +
  geom_smooth(method = "lm", color = line_color) +
  labs(title = "Protein (grams) vs Calories",
       x = "Protein (g)", 
       y = "Calories",
       caption = "Data Source: fastfoodnutrition.org 2018") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

This graph shows that total protein in grams and calories has a positive relationship. As total protein content increases calories also increases.

# Calculate average calories for each restaurant
avg_calories <- fastfood |>
  group_by(restaurant) |>
  summarise(mean_calories = mean(calories, na.rm = TRUE)) |>
  arrange(desc(mean_calories))

# Bar plot
ggplot(avg_calories, aes(x = reorder(restaurant, mean_calories), y = mean_calories, fill = restaurant)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Average Calories by Restaurant",
    x = "Restaurant",
    y = "Average Calories",
    caption = "Data Source: fastfoodnutrition.org 2018"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 12)

We can see from this graph that there is a difference between the average calories content of entrees between restaurants. McDonalds has the highest average calorie content in their entrees. Restaurants do have a different average calorie content although they are not signifcant predictors as individuals can order different meals and get different calorie contents. This is why macronutritients are the best predictors of calorie content.

# Build the linear regression model with variables we will be using
model_data <- fastfood |>
  select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, protein, restaurant)

# Run the linear regression model
cal_model_rest <- lm(calories ~ total_fat + sat_fat + trans_fat + cholesterol + sodium + total_carb + fiber + protein + restaurant, data = model_data)

# Display the summary of the linear regression model
summary(cal_model_rest)
## 
## Call:
## lm(formula = calories ~ total_fat + sat_fat + trans_fat + cholesterol + 
##     sodium + total_carb + fiber + protein + restaurant, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -870.12   -5.68   -0.26    6.50  183.47 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.465854   7.816433  -0.955   0.3400    
## total_fat              8.187367   0.291779  28.060   <2e-16 ***
## sat_fat                1.565689   0.822580   1.903   0.0576 .  
## trans_fat              0.421681   4.605720   0.092   0.9271    
## cholesterol           -0.155717   0.109776  -1.418   0.1567    
## sodium                 0.010720   0.006212   1.726   0.0850 .  
## total_carb             3.902376   0.154398  25.275   <2e-16 ***
## fiber                 -0.624672   1.112939  -0.561   0.5749    
## protein                4.373081   0.370905  11.790   <2e-16 ***
## restaurantBurger King 14.045829   8.937974   1.571   0.1167    
## restaurantChick Fil-A  4.997054  11.605955   0.431   0.6670    
## restaurantDairy Queen 15.466077   9.547419   1.620   0.1059    
## restaurantMcdonalds   11.293339   9.122994   1.238   0.2164    
## restaurantSonic        1.572321   9.128868   0.172   0.8633    
## restaurantSubway       3.300923   8.874641   0.372   0.7101    
## restaurantTaco Bell   10.234317   8.459196   1.210   0.2269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.38 on 487 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9738 
## F-statistic:  1243 on 15 and 487 DF,  p-value: < 2.2e-16

I included the linear regression model above as an extra test to see the significance. Restaurant was not a significant predictor of calories, in this model the significant predictors remained the same except saturated fat became larger than the significance level of 0.05 just slightly. I will include it as a significant predictor in the conclusion as I am officially using the first model I created. I would please like you to grade this one. This model is just to make the point that restaurant is not statistically significant factor for calories.

Conclusion:

These visualizations show us that total fat, saturated fat, total carbs, and protein are significant predictors of calories with them all having a positive relationship with calories in fast food entrees. There is also differences in average calorie content between restaurant but the restaurantt itself is not a significant predictor of calorie content. I wish I would have been able to include vitamins and calcium in the calorie content model but due to a large percent of missing values for these variables I was unable to include them. I also did test if restaurants were a significant predictor of calorie content, they were not. I tested that in an extra linear model. I was happy with this analysis and would like to thank you for giving such a good introduction in data visualization.