In my project titled “Nutritional Composition”, the subject is to understand the impact of macronutrients on the calories of foods. For this purpose, the dataset nutrition_food.csv, which comes from the Yazio website (https://www.yazio.com/en/foods/hard-candy.html ), is studied. It contains 48 variables, among which two are categorical and 46 are quantitative. Since I want to show the caloric contribution, I will use a total of 7 variables, including Data.Kilocalories, which will be my dependent variable and measures the number of kilocalories (kcal). Data.Fat.Total Lipid represents the total lipids in the food, Data.Protein corresponds to proteins, Data.Carbohydrate which are glucides, Data.Sugar Total , and Data.Fiber.
Load the Libraries and Upload the Dataset
library(tidyverse)library(ggfortify)library(plotly)setwd("C:/Users/monik/OneDrive/Desktop/DATA 110")foods <-read_csv('nutrition_food_kaggle.csv')head(foods) # show the first six lines of the dataset
# fixing named variables with space colnames(foods)<-gsub(" ", ".",colnames(foods))# rename the variables that i will usefoods_new <- foods|>rename(foods_name = Category,calories = Data.Kilocalories,protein = Data.Protein,fat_lipid = Data.Fat.Total.Lipid,carbohydrates = Data.Carbohydrate,sugar = Data.Sugar.Total,fiber = Data.Fiber)# select only variable neededfoods_new1 <- foods_new|>select(foods_name,calories,protein, fat_lipid,carbohydrates,sugar,fiber) head(foods_new1)
Call:
lm(formula = calories ~ protein + fat_lipid + carbohydrates +
sugar + fiber, data = foods_new1)
Residuals:
Min 1Q Median 3Q Max
-184.925 -4.286 -0.055 3.147 290.374
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.62624 0.36307 12.742 <2e-16 ***
protein 4.08597 0.01882 217.134 <2e-16 ***
fat_lipid 8.79237 0.01155 761.463 <2e-16 ***
carbohydrates 3.92083 0.01005 390.248 <2e-16 ***
sugar -0.01096 0.01748 -0.627 0.531
fiber -2.18147 0.05027 -43.398 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.3 on 7407 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.991
F-statistic: 1.63e+05 on 5 and 7407 DF, p-value: < 2.2e-16
autoplot(fit1, 1:4, nrow=2, ncol=2) # or plot(fit1)
Looking the plots to diagnose if the linear model is appropriate
The blue line showing on the residual plor is relatively horizontal, so the lm is appropriate. QQPlot indicates that the distribution is relatively because outliers are indicated by their row number Scale-Location indicates the homogeneous variance (homeoscedacity) because the line is almost straight meaning that dots are well distributed.
First Backward Elimination
Iam trying to predict calories, . Note the adjusted R-squared value is 99.01% which is but one of the variable that does not appear to be as significant as the other is sugar with the p_value of 0.531 > 0.05 . So i drop the sugar variable because it is not statistically significant and re-run the model.
fit2 <-lm(calories ~ protein + fat_lipid + carbohydrates + fiber, data = foods_new1)summary(fit2)
Call:
lm(formula = calories ~ protein + fat_lipid + carbohydrates +
fiber, data = foods_new1)
Residuals:
Min 1Q Median 3Q Max
-184.628 -4.271 -0.062 3.155 290.384
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.616426 0.362720 12.73 <2e-16 ***
protein 4.086994 0.018747 218.01 <2e-16 ***
fat_lipid 8.792096 0.011538 762.01 <2e-16 ***
carbohydrates 3.917046 0.008038 487.31 <2e-16 ***
fiber -2.174567 0.049043 -44.34 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.29 on 7408 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.991
F-statistic: 2.038e+05 on 4 and 7408 DF, p-value: < 2.2e-16
autoplot(fit2, 1:4, nrow=2, ncol=2) # or plot(fit2)
the adjusted R-squared value is still 0.991 indicates that 99.01% of the variation in calories is explain by and all p_value are less than 0.05 meaning all variables included in the model are statistically significant predictors of calories. This suggests that each macronutrients contribute meaningfully to explaining the variation in caloric content.
If protein increases by 1 gram, then calories increase by 4.087 kilocalories, If fat increases by 1 gram, then calories increase by 8.792 kilocalories A 1 gram increase in carbohydrates then kilocalories increase by 3.917 if the fiber increase by 1 gram the kalocalories decreases by 2.175
Scatterplot : Relationship Between fat_lipid and Calories colored by Protein
plot1 <-ggplot(foods_new, aes(x = fat_lipid, y = calories, color = protein)) +geom_point(alpha =0.6) +xlim(0,100) +ylim(0,1000) +scale_color_gradient(low ="blue", high ="red") +labs(title ="Relationship between Fat and Calories",x ="Fat (g)",y ="Calories",caption ="Source: Yazio nutrition website https://www.yazio.com/en/foods/hard-candy.html" ,color ="protein" )+#theme_minimal(base_size = 12)theme_minimal(base_size =12) plot1
Scatterplot Sizing by Carbohydrates
plot2 <- plot1 +aes(size = carbohydrates)plot2
Reduce the Foods Item to Make the Scatterplot More Readable
foods_small <-head(foods_new1, 200) # use only the first 200 raws of the dataset
My Final Plot: Relationship Between Macronutrients and Calories: A Multivariate Scatterplot without Interactivity
My Final Plot: Relationship Between Macronutrients and Calories: A Multivariate Scatterplot With Interactivity
I used the function ggplotly for the interactivity and I lost the carbohydrates legend
plot_reduce <-ggplot(foods_small,aes(x = fat_lipid,y = calories,color = protein,size = carbohydrates,text =paste("Foods:",foods_name, "\nFiber:", fiber))) +# add the name of the food in the interactivity and fibergeom_point(alpha =0.6) +xlim(0,100) +ylim(0,1000) +scale_color_gradient(low ="blue", high ="red") +labs(title ="Relationship Between Macronutrients and Calories",x ="Fat (g)",y ="Calories",color ="protein",size ="Carbohydrates",caption ="Source: Yazio nutrition website" ) +theme_minimal(base_size =12)ggplotly(plot_reduce)
Essay
My dataset originally contained a total of 48 variables. First, I noticed that the variables were not properly formatted, as some of them contained spaces. I therefore used the gsub function to replace spaces with dots, since R does not accept variable names with spaces. Next, I selected only the relevant nutritional variables for the study, specifically six variables, and removed the unnecessary columns. After that, I checked for missing values. I also renamed some variables because their original names were too long and difficult to use. Before producing the final visualization, I created a subset of the data (foods_small) containing the first 200 observations using the head function. This was done to improve readability and reduce the issue of overplotting. The final visualization represents a multivariate scatterplot showing the relationship between macronutrients and food calories. The x-axis represents fat content, while the y-axis represents calories. The color of the points represents proteins, and the size of the points represents carbohydrates. From the graph, a clear positive relationship can be observed between macronutrients and calories: foods that are rich in fat, carbohydrates, and proteins tend to have the highest caloric values. I encountered difficulties during the backward elimination process, as all variables had p-values lower than 0.05 and a good R² value. This made me question whether there was an error in my model. I also faced challenges in integrating multiple aesthetic mappings in a single ggplot visualization (color, size, etc.). Finally, the large number of data points sometimes made the visualization more difficult to interpret.