#Introduction
Does the nutritional content of fast food items predict the total number of calories in an item?
The dataset used for this analysis is the Nutrition in Fast Food Dataset from OpenIntro Statistics, containing 515 menu items across multiple restaurants. Each observation corresponds to a menu item and includes quantitative nutritional information (calories, total fat, saturated fat, trans fat, cholesterol, sodium, total carbohydrates, fiber, sugar, protein) and categorical variables (salad, restaurant). Missing nutritional data will be excluded for analysis.
Source: OpenIntro Fast Food Dataset
#Data Analysis
Exploratory Data Analysis (EDA) was performed to understand distributions, detect missing values, and examine relationships among predictors and the outcome (calories). Plots such as scatterplots, histograms, and boxplots were generated to visualize patterns.
# Load dataset
setwd("C:/Users/FAHIMA KAMAL/Documents")
fastfood <- read.csv("fastfood.csv", stringsAsFactors = FALSE)
# Inspect dataset
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories <int> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat <int> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat <int> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <int> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium <int> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb <int> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber <int> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar <int> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein <int> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a <int> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c <int> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium <int> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
# Select relevant variables and remove missing values
fastfood_clean <- fastfood %>%
select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium,
total_carb, fiber, sugar, protein, salad, restaurant) %>%
drop_na()
# Convert categorical variables to factors
fastfood_clean$salad <- factor(fastfood_clean$salad)
fastfood_clean$salad <- droplevels(fastfood_clean$salad)
fastfood_clean$restaurant <- factor(fastfood_clean$restaurant)
fastfood_clean$restaurant <- droplevels(fastfood_clean$restaurant)
# Remove factors with only one level (prevents contrasts error)
factor_vars <- sapply(fastfood_clean, is.factor)
single_level_factors <- names(fastfood_clean)[factor_vars & sapply(fastfood_clean[, factor_vars], function(x) nlevels(x) < 2)]
fastfood_model_data <- fastfood_clean %>%
select(-all_of(single_level_factors))
# Summary statistics
fastfood_clean %>%
summarise(
mean_cal = mean(calories),
sd_cal = sd(calories),
min_cal = min(calories),
max_cal = max(calories)
)
## mean_cal sd_cal min_cal max_cal
## 1 524.4533 280.1736 20 2430
# Scatterplot: Calories vs Total Fat
ggplot(fastfood_clean, aes(x = total_fat, y = calories)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Calories vs Total Fat", x = "Total Fat (g)", y = "Calories")
## `geom_smooth()` using formula = 'y ~ x'
#Regression Analysis
Multiple linear regression was performed with calories as the outcome variable and nutritional variables as predictors.
# Fit multiple linear regression using valid variables
mlr_model <- lm(calories ~ ., data = fastfood_model_data)
# Model summary
summary(mlr_model)
##
## Call:
## lm(formula = calories ~ ., data = fastfood_model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -868.70 -5.73 -0.12 6.54 186.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.246432 7.828279 -0.926 0.3551
## total_fat 8.141509 0.300297 27.111 <2e-16 ***
## sat_fat 1.637233 0.830342 1.972 0.0492 *
## trans_fat 0.113796 4.632549 0.025 0.9804
## cholesterol -0.140106 0.112419 -1.246 0.2133
## sodium 0.010898 0.006222 1.752 0.0805 .
## total_carb 3.955423 0.174588 22.656 <2e-16 ***
## fiber -0.695000 1.118803 -0.621 0.5348
## sugar -0.266917 0.409213 -0.652 0.5145
## protein 4.340156 0.374541 11.588 <2e-16 ***
## restaurantBurger King 14.472444 8.967136 1.614 0.1072
## restaurantChick Fil-A 4.677410 11.623142 0.402 0.6876
## restaurantDairy Queen 15.366829 9.554267 1.608 0.1084
## restaurantMcdonalds 12.035107 9.198944 1.308 0.1914
## restaurantSonic 1.339037 9.141257 0.146 0.8836
## restaurantSubway 3.621285 8.893453 0.407 0.6841
## restaurantTaco Bell 9.281463 8.589327 1.081 0.2804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.4 on 486 degrees of freedom
## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9737
## F-statistic: 1164 on 16 and 486 DF, p-value: < 2.2e-16
Interpretation of Coefficients:
Each coefficient represents the expected change in calories for a one-unit increase in the predictor, holding all other variables constant.
For the categorical variable salad, the coefficient indicates the difference in mean calories between salad and non-salad items.
#Model Assumptions and Diagnostics Linearity & Homoscedasticity
par(mfrow = c(2,2))
plot(mlr_model)
Residuals vs Fitted: Check linearity and constant variance.
Normal Q-Q: Check normality of residuals.
Scale-Location: Check homoscedasticity.
Residuals vs Leverage: Identify influential points.
Multicollinearity
# Function to calculate VIF manually
vif_manual <- function(model) {
X <- model.matrix(model)[,-1] # Remove intercept
vif_values <- sapply(1:ncol(X), function(i){
r2 <- summary(lm(X[,i] ~ X[,-i]))$r.squared
1 / (1 - r2)
})
names(vif_values) <- colnames(X)
return(vif_values)
}
vif_manual(mlr_model)
## total_fat sat_fat trans_fat
## 7.174164 6.609146 3.577680
## cholesterol sodium total_carb
## 11.822911 4.484357 4.669897
## fiber sugar protein
## 2.812386 1.884654 10.719967
## restaurantBurger King restaurantChick Fil-A restaurantDairy Queen
## 2.061279 1.557003 1.704588
## restaurantMcdonalds restaurantSonic restaurantSubway
## 2.074723 1.922094 2.980444
## restaurantTaco Bell
## 3.174843
Variance Inflation Factor (VIF) > 5 may indicate multicollinearity.
#Conclusion and Future Directions
The multiple linear regression model shows how different nutritional components contribute to calorie content in fast food items. Significant predictors indicate which nutrients most strongly influence calorie counts.
Model fit: R² and adjusted R² indicate proportion of variability in calories explained by predictors.
Limitations:
Some nutritional variables may be correlated (e.g., total fat and saturated fat).
Dataset may not include all restaurant items or account for portion size differences.
Future Directions:
Add interaction terms (e.g., fat × protein).
Consider regularized regression (ridge or lasso) to handle multicollinearity.
Expand analysis to include other categorical predictors such as restaurant chains.
#References
OpenIntro Statistics. Nutrition in Fast Food Dataset. https://www.openintro.org/data/index.php?data=fastfood