Project 3 Final Paper

#Introduction

Does the nutritional content of fast food items predict the total number of calories in an item?

The dataset used for this analysis is the Nutrition in Fast Food Dataset from OpenIntro Statistics, containing 515 menu items across multiple restaurants. Each observation corresponds to a menu item and includes quantitative nutritional information (calories, total fat, saturated fat, trans fat, cholesterol, sodium, total carbohydrates, fiber, sugar, protein) and categorical variables (salad, restaurant). Missing nutritional data will be excluded for analysis.

Source: OpenIntro Fast Food Dataset

#Data Analysis

Exploratory Data Analysis (EDA) was performed to understand distributions, detect missing values, and examine relationships among predictors and the outcome (calories). Plots such as scatterplots, histograms, and boxplots were generated to visualize patterns.

# Load dataset

setwd("C:/Users/FAHIMA KAMAL/Documents")
fastfood <- read.csv("fastfood.csv", stringsAsFactors = FALSE)

# Inspect dataset

glimpse(fastfood)

## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories    <int> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat     <int> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat   <int> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <int> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium      <int> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb  <int> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber       <int> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar       <int> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein     <int> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a       <int> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c       <int> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium     <int> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…

# Select relevant variables and remove missing values

fastfood_clean <- fastfood %>%
select(calories, total_fat, sat_fat, trans_fat, cholesterol, sodium,
total_carb, fiber, sugar, protein, salad, restaurant) %>%
drop_na()

# Convert categorical variables to factors

fastfood_clean$salad <- factor(fastfood_clean$salad)
fastfood_clean$salad <- droplevels(fastfood_clean$salad)

fastfood_clean$restaurant <- factor(fastfood_clean$restaurant)
fastfood_clean$restaurant <- droplevels(fastfood_clean$restaurant)

# Remove factors with only one level (prevents contrasts error)

factor_vars <- sapply(fastfood_clean, is.factor)
single_level_factors <- names(fastfood_clean)[factor_vars & sapply(fastfood_clean[, factor_vars], function(x) nlevels(x) < 2)]

fastfood_model_data <- fastfood_clean %>%
select(-all_of(single_level_factors))

# Summary statistics

fastfood_clean %>%
summarise(
mean_cal = mean(calories),
sd_cal = sd(calories),
min_cal = min(calories),
max_cal = max(calories)
)

##   mean_cal   sd_cal min_cal max_cal
## 1 524.4533 280.1736      20    2430

# Scatterplot: Calories vs Total Fat

ggplot(fastfood_clean, aes(x = total_fat, y = calories)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Calories vs Total Fat", x = "Total Fat (g)", y = "Calories")

## `geom_smooth()` using formula = 'y ~ x'

#Regression Analysis

Multiple linear regression was performed with calories as the outcome variable and nutritional variables as predictors.

# Fit multiple linear regression using valid variables

mlr_model <- lm(calories ~ ., data = fastfood_model_data)

# Model summary

summary(mlr_model)

## 
## Call:
## lm(formula = calories ~ ., data = fastfood_model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -868.70   -5.73   -0.12    6.54  186.25 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.246432   7.828279  -0.926   0.3551    
## total_fat              8.141509   0.300297  27.111   <2e-16 ***
## sat_fat                1.637233   0.830342   1.972   0.0492 *  
## trans_fat              0.113796   4.632549   0.025   0.9804    
## cholesterol           -0.140106   0.112419  -1.246   0.2133    
## sodium                 0.010898   0.006222   1.752   0.0805 .  
## total_carb             3.955423   0.174588  22.656   <2e-16 ***
## fiber                 -0.695000   1.118803  -0.621   0.5348    
## sugar                 -0.266917   0.409213  -0.652   0.5145    
## protein                4.340156   0.374541  11.588   <2e-16 ***
## restaurantBurger King 14.472444   8.967136   1.614   0.1072    
## restaurantChick Fil-A  4.677410  11.623142   0.402   0.6876    
## restaurantDairy Queen 15.366829   9.554267   1.608   0.1084    
## restaurantMcdonalds   12.035107   9.198944   1.308   0.1914    
## restaurantSonic        1.339037   9.141257   0.146   0.8836    
## restaurantSubway       3.621285   8.893453   0.407   0.6841    
## restaurantTaco Bell    9.281463   8.589327   1.081   0.2804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.4 on 486 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9737 
## F-statistic:  1164 on 16 and 486 DF,  p-value: < 2.2e-16

Interpretation of Coefficients:

Each coefficient represents the expected change in calories for a one-unit increase in the predictor, holding all other variables constant.

For the categorical variable salad, the coefficient indicates the difference in mean calories between salad and non-salad items.

#Model Assumptions and Diagnostics Linearity & Homoscedasticity

par(mfrow = c(2,2))
plot(mlr_model)

Residuals vs Fitted: Check linearity and constant variance.

Normal Q-Q: Check normality of residuals.

Scale-Location: Check homoscedasticity.

Residuals vs Leverage: Identify influential points.

Multicollinearity

# Function to calculate VIF manually

vif_manual <- function(model) {
X <- model.matrix(model)[,-1] # Remove intercept
vif_values <- sapply(1:ncol(X), function(i){
r2 <- summary(lm(X[,i] ~ X[,-i]))$r.squared
1 / (1 - r2)
})
names(vif_values) <- colnames(X)
return(vif_values)
}

vif_manual(mlr_model)

##             total_fat               sat_fat             trans_fat 
##              7.174164              6.609146              3.577680 
##           cholesterol                sodium            total_carb 
##             11.822911              4.484357              4.669897 
##                 fiber                 sugar               protein 
##              2.812386              1.884654             10.719967 
## restaurantBurger King restaurantChick Fil-A restaurantDairy Queen 
##              2.061279              1.557003              1.704588 
##   restaurantMcdonalds       restaurantSonic      restaurantSubway 
##              2.074723              1.922094              2.980444 
##   restaurantTaco Bell 
##              3.174843

Variance Inflation Factor (VIF) > 5 may indicate multicollinearity.

#Conclusion and Future Directions

The multiple linear regression model shows how different nutritional components contribute to calorie content in fast food items. Significant predictors indicate which nutrients most strongly influence calorie counts.

Model fit: R² and adjusted R² indicate proportion of variability in calories explained by predictors.

Limitations:

Some nutritional variables may be correlated (e.g., total fat and saturated fat).

Dataset may not include all restaurant items or account for portion size differences.

Future Directions:

Add interaction terms (e.g., fat × protein).

Consider regularized regression (ridge or lasso) to handle multicollinearity.

Expand analysis to include other categorical predictors such as restaurant chains.

#References

OpenIntro Statistics. Nutrition in Fast Food Dataset. https://www.openintro.org/data/index.php?data=fastfood

Project 3 Final Paper

James Chowdhury

2025-12-16