Data110 Project1: Nutritional composition

Author

M Madinko

Nutritional Composition

Introduction: A bout the Dataset

In my project titled “Nutritional Composition”, the subject is to understand the impact of macronutrients on the calories of foods. For this purpose, the dataset nutrition_food.csv, which comes from the Yazio website (https://www.yazio.com/en/foods/hard-candy.html ), is studied. It contains 48 variables, among which two are categorical and 46 are quantitative. Since I want to show the caloric contribution, I will use a total of 7 variables, including Data.Kilocalories, which will be my dependent variable and measures the number of kilocalories (kcal). Data.Fat.Total Lipid represents the total lipids in the food, Data.Protein corresponds to proteins, Data.Carbohydrate which are glucides, Data.Sugar Total , and Data.Fiber.

Load the Libraries and Upload the Dataset

library(tidyverse)
library(ggfortify)
library(plotly)
setwd("C:/Users/monik/OneDrive/Desktop/DATA 110")
foods <- read_csv('nutrition_food_kaggle.csv')
head(foods) # show the first six lines of the dataset
# A tibble: 6 × 48
  Category   Description   Nutrient Data Bank N…¹ `Data.Alpha Carotene` Data.Ash
  <chr>      <chr>                          <dbl>                 <dbl>    <dbl>
1 BUTTER     BUTTER,WITH …                   1001                     0     2.11
2 BUTTER     BUTTER,WHIPP…                   1002                     0     2.11
3 BUTTER OIL BUTTER OIL,A…                   1003                     0     0   
4 CHEESE     CHEESE,BLUE                     1004                     0     5.11
5 CHEESE     CHEESE,BRICK                    1005                     0     3.18
6 CHEESE     CHEESE,BRIE                     1006                     0     2.7 
# ℹ abbreviated name: ¹​`Nutrient Data Bank Number`
# ℹ 43 more variables: `Data.Beta Carotene` <dbl>,
#   `Data.Beta Cryptoxanthin` <dbl>, Data.Carbohydrate <dbl>,
#   Data.Cholesterol <dbl>, Data.Choline <dbl>, Data.Fiber <dbl>,
#   Data.Kilocalories <dbl>, `Data.Lutein and Zeaxanthin` <dbl>,
#   Data.Lycopene <dbl>, Data.Manganese <dbl>, Data.Niacin <dbl>,
#   `Data.Pantothenic Acid` <dbl>, Data.Protein <dbl>, …

Data Cleaning

# fixing named variables with space 
colnames(foods)<-gsub(" ", ".",colnames(foods))
 # rename the variables that i will use
foods_new <- foods|>
  rename(foods_name = Category,
         calories = Data.Kilocalories,
         protein = Data.Protein,
         fat_lipid = Data.Fat.Total.Lipid,
         carbohydrates = Data.Carbohydrate,
         sugar = Data.Sugar.Total,
         fiber = Data.Fiber)
# select only variable needed
foods_new1 <- foods_new|>
select(foods_name,calories,protein, fat_lipid,carbohydrates,sugar,fiber) 
head(foods_new1)
# A tibble: 6 × 7
  foods_name calories protein fat_lipid carbohydrates  sugar fiber
  <chr>         <dbl>   <dbl>     <dbl>         <dbl>  <dbl> <dbl>
1 BUTTER          717    0.85      81.1          0.06 0.0600     0
2 BUTTER          717    0.85      81.1          0.06 0.0600     0
3 BUTTER OIL      876    0.28      99.5          0    0          0
4 CHEESE          353   21.4       28.7          2.34 0.5        0
5 CHEESE          371   23.2       29.7          2.79 0.510      0
6 CHEESE          334   20.8       27.7          0.45 0.450      0

Making a Multiple Regression Model.

fit1 <- lm(calories ~ protein + fat_lipid +carbohydrates +sugar + fiber,
          data = foods_new1)
summary(fit1)

Call:
lm(formula = calories ~ protein + fat_lipid + carbohydrates + 
    sugar + fiber, data = foods_new1)

Residuals:
     Min       1Q   Median       3Q      Max 
-184.925   -4.286   -0.055    3.147  290.374 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.62624    0.36307  12.742   <2e-16 ***
protein        4.08597    0.01882 217.134   <2e-16 ***
fat_lipid      8.79237    0.01155 761.463   <2e-16 ***
carbohydrates  3.92083    0.01005 390.248   <2e-16 ***
sugar         -0.01096    0.01748  -0.627    0.531    
fiber         -2.18147    0.05027 -43.398   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.3 on 7407 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.991 
F-statistic: 1.63e+05 on 5 and 7407 DF,  p-value: < 2.2e-16
autoplot(fit1, 1:4, nrow=2, ncol=2) # or plot(fit1)

Looking the plots to diagnose if the linear model is appropriate

The blue line showing on the residual plor is relatively horizontal, so the lm is appropriate. QQPlot indicates that the distribution is relatively because outliers are indicated by their row number Scale-Location indicates the homogeneous variance (homeoscedacity) because the line is almost straight meaning that dots are well distributed.

First Backward Elimination

Iam trying to predict calories, . Note the adjusted R-squared value is 99.01% which is but one of the variable that does not appear to be as significant as the other is sugar with the p_value of 0.531 > 0.05 . So i drop the sugar variable because it is not statistically significant and re-run the model.

fit2 <- lm(calories ~ protein + fat_lipid + carbohydrates + fiber, data = foods_new1)
summary(fit2)

Call:
lm(formula = calories ~ protein + fat_lipid + carbohydrates + 
    fiber, data = foods_new1)

Residuals:
     Min       1Q   Median       3Q      Max 
-184.628   -4.271   -0.062    3.155  290.384 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.616426   0.362720   12.73   <2e-16 ***
protein        4.086994   0.018747  218.01   <2e-16 ***
fat_lipid      8.792096   0.011538  762.01   <2e-16 ***
carbohydrates  3.917046   0.008038  487.31   <2e-16 ***
fiber         -2.174567   0.049043  -44.34   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.29 on 7408 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.991 
F-statistic: 2.038e+05 on 4 and 7408 DF,  p-value: < 2.2e-16
autoplot(fit2, 1:4, nrow=2, ncol=2) # or plot(fit2)

the adjusted R-squared value is still 0.991 indicates that 99.01% of the variation in calories is explain by and all p_value are less than 0.05 meaning all variables included in the model are statistically significant predictors of calories. This suggests that each macronutrients contribute meaningfully to explaining the variation in caloric content.

Linear Equation

calories = 4.616 + 4.087(protein) + 8.792(fat) + 3.917(carbs) - 2.175(fiber)

If protein increases by 1 gram, then calories increase by 4.087 kilocalories, If fat increases by 1 gram, then calories increase by 8.792 kilocalories A 1 gram increase in carbohydrates then kilocalories increase by 3.917 if the fiber increase by 1 gram the kalocalories decreases by 2.175

Scatterplot : Relationship Between fat_lipid and Calories colored by Protein

plot1 <- ggplot(foods_new, aes(x = fat_lipid, y = calories, color = protein)) +
  geom_point(alpha = 0.6) +
  xlim(0,100) + 
  ylim(0,1000) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "Relationship between Fat and Calories",
    x = "Fat (g)",
    y = "Calories",
    caption = "Source: Yazio nutrition website 
https://www.yazio.com/en/foods/hard-candy.html" ,
    color = "protein"
  )+
 #theme_minimal(base_size = 12)
 theme_minimal(base_size = 12)
 plot1

Scatterplot Sizing by Carbohydrates

plot2 <- plot1 +
  aes(size = carbohydrates)
plot2

Reduce the Foods Item to Make the Scatterplot More Readable

foods_small <- head(foods_new1, 200) # use only the first 200 raws of the dataset

My Final Plot: Relationship Between Macronutrients and Calories: A Multivariate Scatterplot without Interactivity

plot_reduce <- ggplot(foods_small,
                      aes(x = fat_lipid,
                          y = calories,
                          color = protein,
                          size = carbohydrates)) +
  geom_point(alpha = 0.6) +
  xlim(0,100) +
  ylim(0,1000) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "Relationship Between Macronutrients and Calories",
    x = "Fat (g)",
    y = "Calories",
    color = "Protein",
    size = "Carbohydrates",
    caption = "Source: Yazio nutrition website"
  ) +
  theme_minimal(base_size = 12)
plot_reduce

My Final Plot: Relationship Between Macronutrients and Calories: A Multivariate Scatterplot With Interactivity

I used the function ggplotly for the interactivity and I lost the carbohydrates legend

plot_reduce <- ggplot(foods_small,
                      aes(x = fat_lipid,
                          y = calories,
                          color = protein,
                          size = carbohydrates,
                          text = paste("Foods:",foods_name, "\nFiber:", fiber))) +  # add the name of the food in the interactivity and fiber
  geom_point(alpha = 0.6) +
  xlim(0,100) +
  ylim(0,1000) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "Relationship Between Macronutrients and Calories",
    x = "Fat (g)",
    y = "Calories",
    color = "protein",
    size = "Carbohydrates",
    caption = "Source: Yazio nutrition website"
  ) +
  theme_minimal(base_size = 12)

ggplotly(plot_reduce)

Essay

My dataset originally contained a total of 48 variables. First, I noticed that the variables were not properly formatted, as some of them contained spaces. I therefore used the gsub function to replace spaces with dots, since R does not accept variable names with spaces. Next, I selected only the relevant nutritional variables for the study, specifically six variables, and removed the unnecessary columns. After that, I checked for missing values. I also renamed some variables because their original names were too long and difficult to use. Before producing the final visualization, I created a subset of the data (foods_small) containing the first 200 observations using the head function. This was done to improve readability and reduce the issue of overplotting. The final visualization represents a multivariate scatterplot showing the relationship between macronutrients and food calories. The x-axis represents fat content, while the y-axis represents calories. The color of the points represents proteins, and the size of the points represents carbohydrates. From the graph, a clear positive relationship can be observed between macronutrients and calories: foods that are rich in fat, carbohydrates, and proteins tend to have the highest caloric values. I encountered difficulties during the backward elimination process, as all variables had p-values lower than 0.05 and a good R² value. This made me question whether there was an error in my model. I also faced challenges in integrating multiple aesthetic mappings in a single ggplot visualization (color, size, etc.). Finally, the large number of data points sometimes made the visualization more difficult to interpret.