library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.2
## âś” ggplot2   4.0.0     âś” tibble    3.3.0
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")

fast_food<- read_csv("fastfood.csv")
## Rows: 515 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): restaurant, item, salad
## dbl (14): calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, sod...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What contributes to the amount of calories found in fast food items?

Introduction

Every day many people all over the world consume fast food whether it be for a quick snack or a family outing. There are many fast food chains that offer different products with varying degrees of nutrition. The question I wondered was how the calories of these foods are decided and what factors play a role in determining the amount.

To help me answer this question I decided to use a data set called fastfood. The fastfood data set has 515 observations and 17 variables. This data set contains information about different fast food items from different fast food chains and their nutritional contents such as calories, fats, sugar, sodium, and carbs. This information will allow me to answer my question on what contributes to the amojnt of calories in fast food items. I got this data set from the website openintro.org which can be found at https://www.openintro.org/data/index.php?data=fastfood.

Data Analysis

The names of the variables I will be using is calories, total_fat, sodium, protein, sugar, and cholesterol. All of these variables are quantitative.

  1. calories: This is the number of calories of the item

  2. total_fat: This is the amount of total fat in the item

  3. sodium: This is the amount sodium in the item

  4. protein: This is the amount of protein in the item

  5. sugar: This is the amount of sugar in the item

  6. cholesterol: This is the amount of cholesterol in the item

First I went ahead and checked the head and structure of the data set and see that the column names look good and there is nothing out of the ordinary. I then went ahead and check to see if there was any NA’s in any of the variables I planned to use. I noticed all my variable had no NA’s except for protein which had one NA. I then decided to impute that NA with the mean protein value so I wouldn’t have any missing information when I created my model. I rechecked the columns for any NA’s and saw that none of the columns I planned to use had NA’s now.

head(fast_food)
## # A tibble: 6 Ă— 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
str(fast_food)
## spc_tbl_ [515 Ă— 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ restaurant : chr [1:515] "Mcdonalds" "Mcdonalds" "Mcdonalds" "Mcdonalds" ...
##  $ item       : chr [1:515] "Artisan Grilled Chicken Sandwich" "Single Bacon Smokehouse Burger" "Double Bacon Smokehouse Burger" "Grilled Bacon Smokehouse Chicken Sandwich" ...
##  $ calories   : num [1:515] 380 840 1130 750 920 540 300 510 430 770 ...
##  $ cal_fat    : num [1:515] 60 410 600 280 410 250 100 210 190 400 ...
##  $ total_fat  : num [1:515] 7 45 67 31 45 28 12 24 21 45 ...
##  $ sat_fat    : num [1:515] 2 17 27 10 12 10 5 4 11 21 ...
##  $ trans_fat  : num [1:515] 0 1.5 3 0.5 0.5 1 0.5 0 1 2.5 ...
##  $ cholesterol: num [1:515] 95 130 220 155 120 80 40 65 85 175 ...
##  $ sodium     : num [1:515] 1110 1580 1920 1940 1980 950 680 1040 1040 1290 ...
##  $ total_carb : num [1:515] 44 62 63 62 81 46 33 49 35 42 ...
##  $ fiber      : num [1:515] 3 2 3 2 4 3 2 3 2 3 ...
##  $ sugar      : num [1:515] 11 18 18 18 18 9 7 6 7 10 ...
##  $ protein    : num [1:515] 37 46 70 55 46 25 15 25 25 51 ...
##  $ vit_a      : num [1:515] 4 6 10 6 6 10 10 0 20 20 ...
##  $ vit_c      : num [1:515] 20 20 20 25 20 2 2 4 4 6 ...
##  $ calcium    : num [1:515] 20 20 50 20 20 15 10 2 15 20 ...
##  $ salad      : chr [1:515] "Other" "Other" "Other" "Other" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   restaurant = col_character(),
##   ..   item = col_character(),
##   ..   calories = col_double(),
##   ..   cal_fat = col_double(),
##   ..   total_fat = col_double(),
##   ..   sat_fat = col_double(),
##   ..   trans_fat = col_double(),
##   ..   cholesterol = col_double(),
##   ..   sodium = col_double(),
##   ..   total_carb = col_double(),
##   ..   fiber = col_double(),
##   ..   sugar = col_double(),
##   ..   protein = col_double(),
##   ..   vit_a = col_double(),
##   ..   vit_c = col_double(),
##   ..   calcium = col_double(),
##   ..   salad = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
colSums(is.na(fast_food))
##  restaurant        item    calories     cal_fat   total_fat     sat_fat 
##           0           0           0           0           0           0 
##   trans_fat cholesterol      sodium  total_carb       fiber       sugar 
##           0           0           0           0          12           0 
##     protein       vit_a       vit_c     calcium       salad 
##           1         214         210         210           0
meanff <- mean(fast_food$protein, na.rm = TRUE)
meanff
## [1] 27.89105
fast_food <- fast_food |>
  mutate(protein = ifelse(is.na(protein), round(meanff, 0), protein))

colSums(is.na(fast_food))
##  restaurant        item    calories     cal_fat   total_fat     sat_fat 
##           0           0           0           0           0           0 
##   trans_fat cholesterol      sodium  total_carb       fiber       sugar 
##           0           0           0           0          12           0 
##     protein       vit_a       vit_c     calcium       salad 
##           0         214         210         210           0

Here I just created a new data set with only the variables I planned to use in my multiple regression model. I also got a summary of all the columns in the data set to see what the minimum, maximum, median, and mean were for each.

fastfood <- fast_food |>
  select(calories, total_fat, sodium, protein, sugar, cholesterol)

summary(fastfood)
##     calories        total_fat          sodium        protein      
##  Min.   :  20.0   Min.   :  0.00   Min.   :  15   Min.   :  1.00  
##  1st Qu.: 330.0   1st Qu.: 14.00   1st Qu.: 800   1st Qu.: 16.00  
##  Median : 490.0   Median : 23.00   Median :1110   Median : 25.00  
##  Mean   : 530.9   Mean   : 26.59   Mean   :1247   Mean   : 27.89  
##  3rd Qu.: 690.0   3rd Qu.: 35.00   3rd Qu.:1550   3rd Qu.: 36.00  
##  Max.   :2430.0   Max.   :141.00   Max.   :6080   Max.   :186.00  
##      sugar         cholesterol    
##  Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 3.000   1st Qu.: 35.00  
##  Median : 6.000   Median : 60.00  
##  Mean   : 7.262   Mean   : 72.46  
##  3rd Qu.: 9.000   3rd Qu.: 95.00  
##  Max.   :87.000   Max.   :805.00

Regression Analysis

To help me answer my question I planned to create a multiple linear regression model using the lm() function to predict the number of calories in fast food items using total_fat, sodium, protein, sugar, and cholesterol as predictors.

food_model <- lm(calories ~ total_fat + sodium + protein + sugar + cholesterol, data = fastfood)
food_model
## 
## Call:
## lm(formula = calories ~ total_fat + sodium + protein + sugar + 
##     cholesterol, data = fastfood)
## 
## Coefficients:
## (Intercept)    total_fat       sodium      protein        sugar  cholesterol  
##    41.08998     10.64781      0.06652      6.23538      4.82198     -1.17550
summary(food_model)
## 
## Call:
## lm(formula = calories ~ total_fat + sodium + protein + sugar + 
##     cholesterol, data = fastfood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -943.75  -33.06    1.45   32.47  337.59 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.089975   7.180818   5.722  1.8e-08 ***
## total_fat   10.647810   0.338548  31.451  < 2e-16 ***
## sodium       0.066522   0.008749   7.604  1.4e-13 ***
## protein      6.235377   0.494222  12.617  < 2e-16 ***
## sugar        4.821980   0.543820   8.867  < 2e-16 ***
## cholesterol -1.175496   0.135617  -8.668  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.73 on 509 degrees of freedom
## Multiple R-squared:  0.9307, Adjusted R-squared:   0.93 
## F-statistic:  1367 on 5 and 509 DF,  p-value: < 2.2e-16

The adjusted R² = 0.93 which means about 93% of the variance of calories is explained by the model which is really strong. All predictors are significant as their p-values are all less than the significance level of 0.05. As the intercept = 41.089975 the model predicts the amount of calories to be 41 when all predictors are zero. Looking at the coefficients they all are positive except for cholesterol. While controlling other factors for every 1 unit increase in total_fat is expected to increase the number of calories by 10.65. For every 1 unit increase in sodium the number of calories is expected to increase by 0.07. Every 1 unit increase in protein is expected to increase the number of calories by 6.24. Every one unit increase in sugar is expected to increase the number of calories by 4.82. Finally, every one unit increase in cholesterol is expected to decrease the number of calories by 1.18.

Model Assumptions and Diagnostics

For my multiple linear regression model I need to check the assumptions of linearity, independence, homoscedasticity, normality, and multicollinearity. This will ensure that my model is reliable in predicting the number of calories in fast food items. The assumptions for multiple linear regression are: - Linearity: Relationship between predictors and response is linear. We want the trend (loess) line to closely follow the predicted line with no clear curves or clusters. - Independence: Observations are independent. We look at the Residual vs Order plot to see if the residuals bounce randomly around zero without patterns. - Homoscedasticity: Constant variance of residuals. We look at the Residuals vs Fitted plot to see if the residuals are scattered randomly around zero at a consistent spread across the line. We also look at the scale-location plot where we want the line to stay roughly horizontal with the spread of points being even. - Normality: Residuals are normally distributed. We check the Q-Q plot to check if the residuals follow a straight diagonal line with little to no deviation. - No multicollinearity. We check a correlation matrix to see if there is any predictor that are highly correlated. We do not want any high correlations.

Linearity

crPlots(food_model)

Looking at the Component + Residual plots I notice that they all look good and the linearity assumption is met. Looking at total_fat the loess line seems to follow the dashed line almost exactly showing linearity. For sodium the loess line follows the dashed line only deviating slightly near the end showing some non-linearity but not too much. The same can be said for the protein, sugar, and cholesterol plots where their loess lines follow the dashed lines as well only deviating slightly near the ends showing good linearity.

Independence

plot(resid(food_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Looking at the Residuals vs Order plot there is a single burst around the 130 index but everywhere else seems to have the residuals scattered randomly around zero. This shows that the independence assumption is met.

Core Diognostics

par(mfrow=c(2,2)); plot(food_model); par(mfrow=c(1,1))

Residuals vs Fitted: Looking at the plot the residuals seem to be randomly scattered around zero in the beginning however near the higher fitted values there is a slight curve down showing some non-linearity. Over all linearity is acceptable.

Scale-Location: The residuals seem evenly spread around the line and the red line shows a slight curve up toward the higher fitted values showing some heteroscedasticity but it is not too severe.

Q-Q Residuals: The residual mostly follow the line with only a big deviation of two residuals near the beginning and a slight deviation near the end showing that the plot is not exactly normally distributed.

Residuals vs Leverage: Most of the residuals lay between the cooks distance with only three residuals being 206, 193, and 128 appearing near the cooks distance line showing some moderate leverage.

Multicollinearity

cor(fastfood[, c("total_fat", "sodium", "protein", "sugar", "cholesterol")], use = "complete.obs")
##             total_fat    sodium   protein     sugar cholesterol
## total_fat   1.0000000 0.6691816 0.7136267 0.2593702   0.8013520
## sodium      0.6691816 1.0000000 0.7659422 0.4229934   0.5961644
## protein     0.7136267 0.7659422 1.0000000 0.3894746   0.8660469
## sugar       0.2593702 0.4229934 0.3894746 1.0000000   0.2982589
## cholesterol 0.8013520 0.5961644 0.8660469 0.2982589   1.0000000

High correlations between total_fat-cholesterol (0.801) and protein-cholesterol (0.866). Moderate correlations between total_fat-protein (0.714), sodium-protein (0.766), sodium-total_fat (0.669), and sodium-cholesterol (0.596). These correlations can make the coefficients inflated and make the p-values become unreliable. It can confuse the model and make the predictions less accurate. It would probably be best to remove some predictors so the model will be less confused.

RMSE

residuals_multiple <- resid(food_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 74.2949

The RMSE = 74.2949 which means that the models predictions miss by 74.3 calories on average.

Conclusion and Future Directions

This multiple regression model helps to answer what factors play a key role in determining the amount of calories in fast food items. Predictors such as total fat, sodium, protein, and sugar play a large role in increasing the number of calories in fast food while cholesterol tends to decrease the amount of calories. Sodium doesn’t make a huge change as it only increases the number of calories by a small amount.

When checking the assumptions and diagnostics there where some assumptions that were not met completely such as there being a slight trace of heteroscedasticity and the model not being exactly normal. Overall this model is able to explain 93% of the variance in calories which means the fit is strong. However, the RMSE is pretty big and there is some high multicollinearity meaning this model predictions can vary and be weak at times. In the future since there are some predictors with high correlations we could try to remove some to see if it would make a difference in how accurate the predictions are to potentially make the model stronger and remove any confusion.