library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.2
## âś” ggplot2 4.0.0 âś” tibble 3.3.0
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")
fast_food<- read_csv("fastfood.csv")
## Rows: 515 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): restaurant, item, salad
## dbl (14): calories, cal_fat, total_fat, sat_fat, trans_fat, cholesterol, sod...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What contributes to the amount of calories found in fast food items?
Every day many people all over the world consume fast food whether it be for a quick snack or a family outing. There are many fast food chains that offer different products with varying degrees of nutrition. The question I wondered was how the calories of these foods are decided and what factors play a role in determining the amount.
To help me answer this question I decided to use a data set called fastfood. The fastfood data set has 515 observations and 17 variables. This data set contains information about different fast food items from different fast food chains and their nutritional contents such as calories, fats, sugar, sodium, and carbs. This information will allow me to answer my question on what contributes to the amojnt of calories in fast food items. I got this data set from the website openintro.org which can be found at https://www.openintro.org/data/index.php?data=fastfood.
The names of the variables I will be using is calories, total_fat, sodium, protein, sugar, and cholesterol. All of these variables are quantitative.
calories: This is the number of calories of the item
total_fat: This is the amount of total fat in the item
sodium: This is the amount sodium in the item
protein: This is the amount of protein in the item
sugar: This is the amount of sugar in the item
cholesterol: This is the amount of cholesterol in the item
First I went ahead and checked the head and structure of the data set and see that the column names look good and there is nothing out of the ordinary. I then went ahead and check to see if there was any NA’s in any of the variables I planned to use. I noticed all my variable had no NA’s except for protein which had one NA. I then decided to impute that NA with the mean protein value so I wouldn’t have any missing information when I created my model. I rechecked the columns for any NA’s and saw that none of the columns I planned to use had NA’s now.
head(fast_food)
## # A tibble: 6 Ă— 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan G… 380 60 7 2 0 95
## 2 Mcdonalds Single Ba… 840 410 45 17 1.5 130
## 3 Mcdonalds Double Ba… 1130 600 67 27 3 220
## 4 Mcdonalds Grilled B… 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy Ba… 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
str(fast_food)
## spc_tbl_ [515 Ă— 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ restaurant : chr [1:515] "Mcdonalds" "Mcdonalds" "Mcdonalds" "Mcdonalds" ...
## $ item : chr [1:515] "Artisan Grilled Chicken Sandwich" "Single Bacon Smokehouse Burger" "Double Bacon Smokehouse Burger" "Grilled Bacon Smokehouse Chicken Sandwich" ...
## $ calories : num [1:515] 380 840 1130 750 920 540 300 510 430 770 ...
## $ cal_fat : num [1:515] 60 410 600 280 410 250 100 210 190 400 ...
## $ total_fat : num [1:515] 7 45 67 31 45 28 12 24 21 45 ...
## $ sat_fat : num [1:515] 2 17 27 10 12 10 5 4 11 21 ...
## $ trans_fat : num [1:515] 0 1.5 3 0.5 0.5 1 0.5 0 1 2.5 ...
## $ cholesterol: num [1:515] 95 130 220 155 120 80 40 65 85 175 ...
## $ sodium : num [1:515] 1110 1580 1920 1940 1980 950 680 1040 1040 1290 ...
## $ total_carb : num [1:515] 44 62 63 62 81 46 33 49 35 42 ...
## $ fiber : num [1:515] 3 2 3 2 4 3 2 3 2 3 ...
## $ sugar : num [1:515] 11 18 18 18 18 9 7 6 7 10 ...
## $ protein : num [1:515] 37 46 70 55 46 25 15 25 25 51 ...
## $ vit_a : num [1:515] 4 6 10 6 6 10 10 0 20 20 ...
## $ vit_c : num [1:515] 20 20 20 25 20 2 2 4 4 6 ...
## $ calcium : num [1:515] 20 20 50 20 20 15 10 2 15 20 ...
## $ salad : chr [1:515] "Other" "Other" "Other" "Other" ...
## - attr(*, "spec")=
## .. cols(
## .. restaurant = col_character(),
## .. item = col_character(),
## .. calories = col_double(),
## .. cal_fat = col_double(),
## .. total_fat = col_double(),
## .. sat_fat = col_double(),
## .. trans_fat = col_double(),
## .. cholesterol = col_double(),
## .. sodium = col_double(),
## .. total_carb = col_double(),
## .. fiber = col_double(),
## .. sugar = col_double(),
## .. protein = col_double(),
## .. vit_a = col_double(),
## .. vit_c = col_double(),
## .. calcium = col_double(),
## .. salad = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(fast_food))
## restaurant item calories cal_fat total_fat sat_fat
## 0 0 0 0 0 0
## trans_fat cholesterol sodium total_carb fiber sugar
## 0 0 0 0 12 0
## protein vit_a vit_c calcium salad
## 1 214 210 210 0
meanff <- mean(fast_food$protein, na.rm = TRUE)
meanff
## [1] 27.89105
fast_food <- fast_food |>
mutate(protein = ifelse(is.na(protein), round(meanff, 0), protein))
colSums(is.na(fast_food))
## restaurant item calories cal_fat total_fat sat_fat
## 0 0 0 0 0 0
## trans_fat cholesterol sodium total_carb fiber sugar
## 0 0 0 0 12 0
## protein vit_a vit_c calcium salad
## 0 214 210 210 0
Here I just created a new data set with only the variables I planned to use in my multiple regression model. I also got a summary of all the columns in the data set to see what the minimum, maximum, median, and mean were for each.
fastfood <- fast_food |>
select(calories, total_fat, sodium, protein, sugar, cholesterol)
summary(fastfood)
## calories total_fat sodium protein
## Min. : 20.0 Min. : 0.00 Min. : 15 Min. : 1.00
## 1st Qu.: 330.0 1st Qu.: 14.00 1st Qu.: 800 1st Qu.: 16.00
## Median : 490.0 Median : 23.00 Median :1110 Median : 25.00
## Mean : 530.9 Mean : 26.59 Mean :1247 Mean : 27.89
## 3rd Qu.: 690.0 3rd Qu.: 35.00 3rd Qu.:1550 3rd Qu.: 36.00
## Max. :2430.0 Max. :141.00 Max. :6080 Max. :186.00
## sugar cholesterol
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 35.00
## Median : 6.000 Median : 60.00
## Mean : 7.262 Mean : 72.46
## 3rd Qu.: 9.000 3rd Qu.: 95.00
## Max. :87.000 Max. :805.00
To help me answer my question I planned to create a multiple linear regression model using the lm() function to predict the number of calories in fast food items using total_fat, sodium, protein, sugar, and cholesterol as predictors.
food_model <- lm(calories ~ total_fat + sodium + protein + sugar + cholesterol, data = fastfood)
food_model
##
## Call:
## lm(formula = calories ~ total_fat + sodium + protein + sugar +
## cholesterol, data = fastfood)
##
## Coefficients:
## (Intercept) total_fat sodium protein sugar cholesterol
## 41.08998 10.64781 0.06652 6.23538 4.82198 -1.17550
summary(food_model)
##
## Call:
## lm(formula = calories ~ total_fat + sodium + protein + sugar +
## cholesterol, data = fastfood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -943.75 -33.06 1.45 32.47 337.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.089975 7.180818 5.722 1.8e-08 ***
## total_fat 10.647810 0.338548 31.451 < 2e-16 ***
## sodium 0.066522 0.008749 7.604 1.4e-13 ***
## protein 6.235377 0.494222 12.617 < 2e-16 ***
## sugar 4.821980 0.543820 8.867 < 2e-16 ***
## cholesterol -1.175496 0.135617 -8.668 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.73 on 509 degrees of freedom
## Multiple R-squared: 0.9307, Adjusted R-squared: 0.93
## F-statistic: 1367 on 5 and 509 DF, p-value: < 2.2e-16
The adjusted R² = 0.93 which means about 93% of the variance of calories is explained by the model which is really strong. All predictors are significant as their p-values are all less than the significance level of 0.05. As the intercept = 41.089975 the model predicts the amount of calories to be 41 when all predictors are zero. Looking at the coefficients they all are positive except for cholesterol. While controlling other factors for every 1 unit increase in total_fat is expected to increase the number of calories by 10.65. For every 1 unit increase in sodium the number of calories is expected to increase by 0.07. Every 1 unit increase in protein is expected to increase the number of calories by 6.24. Every one unit increase in sugar is expected to increase the number of calories by 4.82. Finally, every one unit increase in cholesterol is expected to decrease the number of calories by 1.18.
For my multiple linear regression model I need to check the assumptions of linearity, independence, homoscedasticity, normality, and multicollinearity. This will ensure that my model is reliable in predicting the number of calories in fast food items. The assumptions for multiple linear regression are: - Linearity: Relationship between predictors and response is linear. We want the trend (loess) line to closely follow the predicted line with no clear curves or clusters. - Independence: Observations are independent. We look at the Residual vs Order plot to see if the residuals bounce randomly around zero without patterns. - Homoscedasticity: Constant variance of residuals. We look at the Residuals vs Fitted plot to see if the residuals are scattered randomly around zero at a consistent spread across the line. We also look at the scale-location plot where we want the line to stay roughly horizontal with the spread of points being even. - Normality: Residuals are normally distributed. We check the Q-Q plot to check if the residuals follow a straight diagonal line with little to no deviation. - No multicollinearity. We check a correlation matrix to see if there is any predictor that are highly correlated. We do not want any high correlations.
crPlots(food_model)
Looking at the Component + Residual plots I notice that they all look good and the linearity assumption is met. Looking at total_fat the loess line seems to follow the dashed line almost exactly showing linearity. For sodium the loess line follows the dashed line only deviating slightly near the end showing some non-linearity but not too much. The same can be said for the protein, sugar, and cholesterol plots where their loess lines follow the dashed lines as well only deviating slightly near the ends showing good linearity.
plot(resid(food_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Looking at the Residuals vs Order plot there is a single burst around the 130 index but everywhere else seems to have the residuals scattered randomly around zero. This shows that the independence assumption is met.
par(mfrow=c(2,2)); plot(food_model); par(mfrow=c(1,1))
Residuals vs Fitted: Looking at the plot the residuals seem to be randomly scattered around zero in the beginning however near the higher fitted values there is a slight curve down showing some non-linearity. Over all linearity is acceptable.
Scale-Location: The residuals seem evenly spread around the line and the red line shows a slight curve up toward the higher fitted values showing some heteroscedasticity but it is not too severe.
Q-Q Residuals: The residual mostly follow the line with only a big deviation of two residuals near the beginning and a slight deviation near the end showing that the plot is not exactly normally distributed.
Residuals vs Leverage: Most of the residuals lay between the cooks distance with only three residuals being 206, 193, and 128 appearing near the cooks distance line showing some moderate leverage.
cor(fastfood[, c("total_fat", "sodium", "protein", "sugar", "cholesterol")], use = "complete.obs")
## total_fat sodium protein sugar cholesterol
## total_fat 1.0000000 0.6691816 0.7136267 0.2593702 0.8013520
## sodium 0.6691816 1.0000000 0.7659422 0.4229934 0.5961644
## protein 0.7136267 0.7659422 1.0000000 0.3894746 0.8660469
## sugar 0.2593702 0.4229934 0.3894746 1.0000000 0.2982589
## cholesterol 0.8013520 0.5961644 0.8660469 0.2982589 1.0000000
High correlations between total_fat-cholesterol (0.801) and protein-cholesterol (0.866). Moderate correlations between total_fat-protein (0.714), sodium-protein (0.766), sodium-total_fat (0.669), and sodium-cholesterol (0.596). These correlations can make the coefficients inflated and make the p-values become unreliable. It can confuse the model and make the predictions less accurate. It would probably be best to remove some predictors so the model will be less confused.
residuals_multiple <- resid(food_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 74.2949
The RMSE = 74.2949 which means that the models predictions miss by 74.3 calories on average.
This multiple regression model helps to answer what factors play a key role in determining the amount of calories in fast food items. Predictors such as total fat, sodium, protein, and sugar play a large role in increasing the number of calories in fast food while cholesterol tends to decrease the amount of calories. Sodium doesn’t make a huge change as it only increases the number of calories by a small amount.
When checking the assumptions and diagnostics there where some assumptions that were not met completely such as there being a slight trace of heteroscedasticity and the model not being exactly normal. Overall this model is able to explain 93% of the variance in calories which means the fit is strong. However, the RMSE is pretty big and there is some high multicollinearity meaning this model predictions can vary and be weak at times. In the future since there are some predictors with high correlations we could try to remove some to see if it would make a difference in how accurate the predictions are to potentially make the model stronger and remove any confusion.
Data set: https://www.openintro.org/data/index.php?data=fastfood