Introdution
Image from getty images:
https://www.gettyimages.com/detail/photo/mcdonalds-royalty-free-image/517499813?adppopup=true
Determining the nutritional content for food is massively important as the obesity rate for American adults is 42.5% . Obesity is bad as it can increase the risk of heart failure cancer and many other medical conditions. For the best health of America it would be important to decrease the obesity rate, this can be done by understanding the biggest predictors of calorie content and eating a healthy balanced diet within a calorie limit with sufficient macronutrients to lose or maintain weight.
The source for this dataset is fastfoodnutrition.com, only the entrees from 2018. Was made into a github repository for the openintro project.
The variables I am using are: calories - total calories of an item total_fat - total fat of an item sat_fat - total saturated fat of an item trans_fat - total trans fat in an item cholesterol - total cholesterol in an item sodium - total sodium in an item total_carb - total carbs in an item fiber - total fiber in an item protein - total protein in an item
The questions I would like to answer is what is the most significant predictor of calorie content of a meal out of the selected variables above. I would also like to visualize the relationship between calories and significant variables.
I chose this topic because I am interested in nutrition myself, I would also like to better understand what are the best predictors of calorie content in food for my health. I also chose this topic to help people better understand calorie content and what influences it.
# Load necessary libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
summary(fastfood)
## restaurant item calories cal_fat
## Length:515 Length:515 Min. : 20.0 Min. : 0.0
## Class :character Class :character 1st Qu.: 330.0 1st Qu.: 120.0
## Mode :character Mode :character Median : 490.0 Median : 210.0
## Mean : 530.9 Mean : 238.8
## 3rd Qu.: 690.0 3rd Qu.: 310.0
## Max. :2430.0 Max. :1270.0
##
## total_fat sat_fat trans_fat cholesterol
## Min. : 0.00 Min. : 0.000 Min. :0.000 Min. : 0.00
## 1st Qu.: 14.00 1st Qu.: 4.000 1st Qu.:0.000 1st Qu.: 35.00
## Median : 23.00 Median : 7.000 Median :0.000 Median : 60.00
## Mean : 26.59 Mean : 8.153 Mean :0.465 Mean : 72.46
## 3rd Qu.: 35.00 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.: 95.00
## Max. :141.00 Max. :47.000 Max. :8.000 Max. :805.00
##
## sodium total_carb fiber sugar
## Min. : 15 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 800 1st Qu.: 28.50 1st Qu.: 2.000 1st Qu.: 3.000
## Median :1110 Median : 44.00 Median : 3.000 Median : 6.000
## Mean :1247 Mean : 45.66 Mean : 4.137 Mean : 7.262
## 3rd Qu.:1550 3rd Qu.: 57.00 3rd Qu.: 5.000 3rd Qu.: 9.000
## Max. :6080 Max. :156.00 Max. :17.000 Max. :87.000
## NA's :12
## protein vit_a vit_c calcium
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 16.00 1st Qu.: 4.00 1st Qu.: 4.00 1st Qu.: 8.00
## Median : 24.50 Median : 10.00 Median : 10.00 Median : 20.00
## Mean : 27.89 Mean : 18.86 Mean : 20.17 Mean : 24.85
## 3rd Qu.: 36.00 3rd Qu.: 20.00 3rd Qu.: 30.00 3rd Qu.: 30.00
## Max. :186.00 Max. :180.00 Max. :400.00 Max. :290.00
## NA's :1 NA's :214 NA's :210 NA's :210
## salad
## Length:515
## Class :character
## Mode :character
##
##
##
##
sapply(fastfood, function(x) sum(is.na(x))) # Check for amout of NA values
## restaurant item calories cal_fat total_fat sat_fat
## 0 0 0 0 0 0
## trans_fat cholesterol sodium total_carb fiber sugar
## 0 0 0 0 12 0
## protein vit_a vit_c calcium salad
## 1 214 210 210 0
# Calculate the proportion of missing values for each column
na_proportion <- colSums(is.na(fastfood)) / nrow(fastfood)
# Display the proportion of NA values for each column
na_proportion
## restaurant item calories cal_fat total_fat sat_fat
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## trans_fat cholesterol sodium total_carb fiber sugar
## 0.000000000 0.000000000 0.000000000 0.000000000 0.023300971 0.000000000
## protein vit_a vit_c calcium salad
## 0.001941748 0.415533981 0.407766990 0.407766990 0.000000000
# Build the linear regression model with variables we will be using
model_data <- fastfood |>
select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, protein)
# Run the linear regression model
cal_model <- lm(calories ~ total_fat + sat_fat + trans_fat + cholesterol + sodium + total_carb + fiber + protein, data = model_data)
# Display the summary of the linear regression model
summary(cal_model)
##
## Call:
## lm(formula = calories ~ total_fat + sat_fat + trans_fat + cholesterol +
## sodium + total_carb + fiber + protein, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -877.49 -4.73 0.25 7.47 182.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.055672 4.860595 0.217 0.8281
## total_fat 8.273373 0.268327 30.833 <2e-16 ***
## sat_fat 1.573103 0.793779 1.982 0.0481 *
## trans_fat -0.291658 4.536887 -0.064 0.9488
## cholesterol -0.130581 0.108309 -1.206 0.2285
## sodium 0.008746 0.005969 1.465 0.1435
## total_carb 3.900347 0.152043 25.653 <2e-16 ***
## fiber -0.550708 0.935828 -0.588 0.5565
## protein 4.285273 0.351456 12.193 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.34 on 494 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9742, Adjusted R-squared: 0.9738
## F-statistic: 2335 on 8 and 494 DF, p-value: < 2.2e-16
Model equation: calories = 1.06 + 8.27(total_fat) + 1.57(sat_fat) −0.29(trans_fat) − 0.13(cholesterol) + 0.009(sodium) + 3.90(total_carb) − 0.55(fiber) + 4.29(protein)
P values:
Total fat has a significant P value, less than 2e-16. Sat fat has a P value of 0.0481 which is less than 0.05. Total carb has a significant p value which is less than 2e-16. Protein has a significant P value, less than 2e-16.
Adjusted R squared:
Our adjusted R squared is 0.9738. This means that 98% of the variation in calories can be explained by this model. This shows this model is a very strong fit.
Diagnostic plots:
par(mfrow = c(2, 2))
plot(cal_model)
Analysis of Linear Regression model:
This is a good model as we have an adjusted R-squared value of 0.9738 meaning this model explains most of the variability. Total fat, saturated fat, total carbs, and protein are significant predictors of calories. This makes sense as they are main macronutrients which make up calorie content. Variables like trans fat, cholesterol, sodium and fiber were not statistically significant. The diagnostic models look good although there are some high leverage influential points.
library(ggplot2)
point_color <- "#00BFC4" # teal
line_color <- "#F8766D" # red
# Scatter plot: Total Fat vs Calories
ggplot(model_data, aes(x = total_fat, y = calories)) +
geom_point(color = point_color, alpha = 0.6) +
geom_smooth(method = "lm", color = line_color) +
labs(title = "Total Fat (grams) vs Calories",
x = "Total Fat (g)",
y = "Calories",
caption = "Data Source: fastfoodnutrition.org 2018") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
This graph shows that total fat in grams and calories has a positive relationship. As total fat content increases calories also increases.
# Scatter plot: Total Carbs vs Calories
ggplot(model_data, aes(x = total_carb, y = calories)) +
geom_point(color = point_color, alpha = 0.6) +
geom_smooth(method = "lm", color = line_color) +
labs(title = "Total Carbs (grams) vs Calories",
x = "Total Carbohydrates (g)",
y = "Calories",
caption = "Data Source: fastfoodnutrition.org 2018") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
This graph shows that total carbs in grams and calories has a positive
relationship. As total carb content increases calories also
increases.
# Scatter plot: Protein vs Calories
ggplot(model_data, aes(x = protein, y = calories)) +
geom_point(color = point_color, alpha = 0.6) +
geom_smooth(method = "lm", color = line_color) +
labs(title = "Protein (grams) vs Calories",
x = "Protein (g)",
y = "Calories",
caption = "Data Source: fastfoodnutrition.org 2018") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
This graph shows that total protein in grams and calories has a positive
relationship. As total protein content increases calories also
increases.
# Calculate average calories for each restaurant
avg_calories <- fastfood |>
group_by(restaurant) |>
summarise(mean_calories = mean(calories, na.rm = TRUE)) |>
arrange(desc(mean_calories))
# Bar plot
ggplot(avg_calories, aes(x = reorder(restaurant, mean_calories), y = mean_calories, fill = restaurant)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Average Calories by Restaurant",
x = "Restaurant",
y = "Average Calories",
caption = "Data Source: fastfoodnutrition.org 2018"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12)
We can see from this graph that there is a difference between the average calories content of entrees between restaurants. McDonalds has the highest average calorie content in their entrees. Restaurants do have a different average calorie content although they are not signifcant predictors as individuals can order different meals and get different calorie contents. This is why macronutritients are the best predictors of calorie content.
# Build the linear regression model with variables we will be using
model_data <- fastfood |>
select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium, total_carb, fiber, protein, restaurant)
# Run the linear regression model
cal_model_rest <- lm(calories ~ total_fat + sat_fat + trans_fat + cholesterol + sodium + total_carb + fiber + protein + restaurant, data = model_data)
# Display the summary of the linear regression model
summary(cal_model_rest)
##
## Call:
## lm(formula = calories ~ total_fat + sat_fat + trans_fat + cholesterol +
## sodium + total_carb + fiber + protein + restaurant, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -870.12 -5.68 -0.26 6.50 183.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.465854 7.816433 -0.955 0.3400
## total_fat 8.187367 0.291779 28.060 <2e-16 ***
## sat_fat 1.565689 0.822580 1.903 0.0576 .
## trans_fat 0.421681 4.605720 0.092 0.9271
## cholesterol -0.155717 0.109776 -1.418 0.1567
## sodium 0.010720 0.006212 1.726 0.0850 .
## total_carb 3.902376 0.154398 25.275 <2e-16 ***
## fiber -0.624672 1.112939 -0.561 0.5749
## protein 4.373081 0.370905 11.790 <2e-16 ***
## restaurantBurger King 14.045829 8.937974 1.571 0.1167
## restaurantChick Fil-A 4.997054 11.605955 0.431 0.6670
## restaurantDairy Queen 15.466077 9.547419 1.620 0.1059
## restaurantMcdonalds 11.293339 9.122994 1.238 0.2164
## restaurantSonic 1.572321 9.128868 0.172 0.8633
## restaurantSubway 3.300923 8.874641 0.372 0.7101
## restaurantTaco Bell 10.234317 8.459196 1.210 0.2269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.38 on 487 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9738
## F-statistic: 1243 on 15 and 487 DF, p-value: < 2.2e-16
I included the linear regression model above as an extra test to see the significance. Restaurant was not a significant predictor of calories, in this model the significant predictors remained the same except saturated fat became larger than the significance level of 0.05 just slightly. I will include it as a significant predictor in the conclusion as I am officially using the first model I created. I would please like you to grade this one. This model is just to make the point that restaurant is not statistically significant factor for calories.
Conclusion:
These visualizations show us that total fat, saturated fat, total carbs, and protein are significant predictors of calories with them all having a positive relationship with calories in fast food entrees. There is also differences in average calorie content between restaurant but the restaurantt itself is not a significant predictor of calorie content. I wish I would have been able to include vitamins and calcium in the calorie content model but due to a large percent of missing values for these variables I was unable to include them. I also did test if restaurants were a significant predictor of calorie content, they were not. I tested that in an extra linear model. I was happy with this analysis and would like to thank you for giving such a good introduction in data visualization.