Final project

Author

Jaiden Soto

Final Project

By Jaiden Soto

Source: Gerber. (n.d.). Organic 1st foods carrot baby food [Product photograph] by Gerber Products Company, 2026, Gerber.

(https://www.gerber.com/gerber-organic-1st-foods-carrot-baby.)

Fair Use

Introduction:

Which nutritional component present in baby food is the strongest predictor of it’s sugar concentration? That is the research question I have chosen for my final project. This research question comes from changes in my extended families lives, as I have had numerous cousins born in the last decade. Visiting and taking care of them has made me interested in the nutritional components of the food they are given. Specifically, my focus is around the sugar concentration of the foods they are receiving, as it deeply impacts daily life for me and other people alike. To answer my question, I utilized data from the Corgis Dataset Repository dataset. The repository sourced the data from United States Department of Agriculture’s Food Composition Database, referred to as the USDA Composition Database. The USDA sources the nutritional data from the, “What We Eat in America” sub-survey of the greater National Health and Nutrition Examination Survey by the USDA and United States Department of Health and Human Services. The survey is taken from a sample of around 5,000 people to find out food metrics for the average American. The USDA takes the foods that they eat, converts them into a metric format, and they quantifies the nutritional value for the foods. The only thing not sourced from the USDA, but the Dataset Repository, is the description for the foods. This is sourced from Wikipedia summaries.

The variables utilized for my analysis consisted of the category variable, a categorical variable that helps identify if the food item is baby food. I then utilized the nutritional variables by grouping them into distinct categories. First, I utilized regular nutritional variables that don’t fall into subgroups. By utilizing the Corgis Dataset comments and MerckManual’s informative articles, I was able to distinguish between the two groups. The nutritional categories that didn’t need to be grouped were the water concentration, Lipids concentration, Fiber concentration, Protein concentration, Sugar total, and the variables surround Vitamin’s C, E and K. The water concentration didn’t need adjusting as it simply recorded the concentration of Water. The total amount of Lipids variable took into account all of the Fat variables in the dataset. This made it the best choice to utilize, as it was specifically the grouping I needed. The Fiber and Protein concentration followed suit with the water concentration, as it simply recorded the total amount of those nutrients in the different foods. Following up, I utilized the sugar total as my response variable. This calculated the total amount of sugar present in each food. The vitamin C, E, and K categories didn’t have any subgroups containing modifications of the original vitamin’s, so I left them unchanged. For the modification of categories, I created six new variables as summary categories, based on original nutritional concentrations that weren’t joined together. This would make analysis easier, and allow me to understand my results better. I first sorted the Major minerals into Macrominerals or Trace Minerals. The difference between the two is that Macrominerals are Major Minerals needed in large amounts, while Trace Minerals are needed in only smaller amounts (Merck Manual Consumer Version, n.d.). I then sorted the Carotenoids into two distinct categories, Carotene’s and other Carotenoids. The distinction here is that Carotene’s don’t contain any oxygen, while the other Carotenoids did (Anthony, 2018)). The other Carotenoids would be more efficiently used when grouped together, so I decided to group them together. Finally, I sorted the Vitamin A and B categories into groupings that were the direct Vitamins and other nutrients that were modified versions of the Vitamins. This allowed for a complete observation into the understanding of how the two vitamin’s play into the sugar concentration.

Background Information:

Baby food is a type of food introduced into the diet’s of infants around six-months of age (American Academy of Pediatrics, n.d.). This allows them to begin their transition into consuming whole foods, necessary with age. The purpose of baby food nutrient contents is to allow babies to receive a good amount of balanced nutrients in their diet as they mature. One specific nutrient, sugar, is an extremely important additive health and development of babies. This is because natural sugars help spur baby development and energy as glucose is an important nutrient for regular cell growth and stability (American Academy of Pediatrics, n.d.). However, added sugars can have a deep, negative impact on the health of babies (Square Baby, n.d.). It can cause sugar dependencies and increase the chances of diabetes as they get older. My project was investigating other nutritional factors to see how they play into the health of the baby, and to see if unhealthy metrics can be used to quantify the sugar concentration.

Loading necessary libraries and dataset

# Loading neccesary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2) # exploration plots, 
library(dplyr)
library(DataExplorer)
library(highcharter) # interactive visualization

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)

# Loading the dataset
baby_food <- read_csv("ingredients.csv")

Rows: 2332 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Category, Description
dbl (36): Nutrient Data Bank Number, Data.Alpha Carotene, Data.Beta Carotene...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning, wrangling, and exploring

Data Cleaning

# Removing null values 
baby_food <- baby_food %>% filter(complete.cases(.))

# Uses the 'rename_with' function to remove the "Data." from all of the column names
baby_food <- baby_food %>% rename_with(~gsub("Data.", "", .))

# Uses the 'rename_with' function to remove the "Fat" from the fat column names
baby_food <- baby_food %>% rename_with(~gsub("Fat.", "", .))

Data Wrangling

# Creates a new variable for Macrominerals, by finding the sum of all of the macromineral quantities
baby_food <- mutate(baby_food, `Macrominerals` = `Major Minerals.Calcium` + `Major Minerals.Magnesium` +
                      `Major Minerals.Phosphorus` + `Major Minerals.Potassium` +`Major Minerals.Sodium`)

# Creates a new variable for Microminerals, by finding the sum of all of the macromineral quantities
baby_food <- mutate(baby_food, `Trace Minerals` = `Major Minerals.Copper` + `Major Minerals.Iron` +
                      `Major Minerals.Zinc` + `Selenium`)

# Creates a new variable for the total amount of carotenes, by combining the alpha and beta carotenes together with the
# Cryptoxanthin
baby_food <- mutate(baby_food, `Total Carotenes` = `Alpha Carotene` + `Beta Carotene` + 
                      `Beta Cryptoxanthin`)

# Creates a new variable for the total amount of remaining carotenoids by summing the two other carotenoid nutrients
baby_food <- mutate(baby_food, `Other Carotenoids` = `Lutein and Zeaxanthin` + `Lycopene`)

# Creates a new variable for Vitamin A and Vitamin A and it's subtype
baby_food <- mutate(baby_food, `Vitamin A` = `Vitamins.Vitamin A - RAE` + `Retinol`)

# Creates a new variable for Vitamin B and it's subtypes
baby_food <- mutate(baby_food, `Vitamin B` = `Vitamins.Vitamin B12` + `Vitamins.Vitamin B6`+ `Niacin` +
                      `Riboflavin` + `Thiamin` + `Choline`)

# Removes the "Vitamins." from the Vitamin variables. While it is data cleaning, it was more useful for after creating the
# two new vitamin variables as it preserved the other three vitamin variables
baby_food <- baby_food %>% rename_with(~gsub("Vitamins.", "", .))

# Filters the dataset to only include observations that involve baby food
baby_food <- filter(baby_food, Category == "Babyfood")

# Uses the select statement to filter the entire dataset to just the relevent variables needed for the analysis. This 
# removes the leftover variables that were used in the creation of new variables.
baby_food <- select(baby_food, `Category`, `Fiber`, `Protein`, `Total Lipid`, `Water`, `Macrominerals`,
                   `Trace Minerals`, `Total Carotenes`, `Other Carotenoids`, `Vitamin A`, `Vitamin B`, 
                    `Vitamin C`, `Vitamin E`, `Vitamin K`,  `Sugar Total`,)

Data Exploration

# Shows the baby_food table
head(baby_food)

# A tibble: 6 × 15
  Category Fiber Protein `Total Lipid` Water Macrominerals `Trace Minerals`
  <chr>    <dbl>   <dbl>         <dbl> <dbl>         <dbl>            <dbl>
1 Babyfood     0     0            0.02  12.9           171             2.21
2 Babyfood     0    12.0          2.52  82.4           337             6.25
3 Babyfood     0    12.0          2.52  82.4           337             6.25
4 Babyfood     0    13.1          2.45  82.4           324             6.91
5 Babyfood     0    11.3          3.8   80.5           348            17.5 
6 Babyfood     0    14.1          3.41  81.1           360             5.97
# ℹ 8 more variables: `Total Carotenes` <dbl>, `Other Carotenoids` <dbl>,
#   `Vitamin A` <dbl>, `Vitamin B` <dbl>, `Vitamin C` <dbl>, `Vitamin E` <dbl>,
#   `Vitamin K` <dbl>, `Sugar Total` <dbl>

All of the relevant variables were shown in the table, with select variables giving different quantities based on the aggregate functions used. As these are all in tidy-format, they are ready for the analysis.

Statistical Analysis: Multiple Linear Regression

Multiple Linear Regression Model

lr <- lm(formula = `Sugar Total` ~ `Protein` + `Total Lipid` + `Water` + `Macrominerals` +
                   `Trace Minerals` + `Total Carotenes` + `Other Carotenoids` + `Vitamin A` + `Vitamin B` + 
                    `Vitamin C` + `Vitamin E` + `Vitamin K` + `Fiber`, data = baby_food)
summary(lr)


Call:
lm(formula = `Sugar Total` ~ Protein + `Total Lipid` + Water + 
    Macrominerals + `Trace Minerals` + `Total Carotenes` + `Other Carotenoids` + 
    `Vitamin A` + `Vitamin B` + `Vitamin C` + `Vitamin E` + `Vitamin K` + 
    Fiber, data = baby_food)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.361  -2.542   0.236   2.780  29.242 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         37.7690702  4.1654517   9.067 2.80e-16 ***
Protein             -1.6669585  0.3201251  -5.207 5.45e-07 ***
`Total Lipid`       -0.4871335  0.1989219  -2.449   0.0153 *  
Water               -0.3479821  0.0432151  -8.052 1.32e-13 ***
Macrominerals       -0.0016570  0.0032710  -0.507   0.6131    
`Trace Minerals`    -0.2806047  0.0659082  -4.258 3.40e-05 ***
`Total Carotenes`   -0.0004538  0.0026597  -0.171   0.8647    
`Other Carotenoids` -0.0017297  0.0009942  -1.740   0.0837 .  
`Vitamin A`         -0.0002386  0.0376028  -0.006   0.9949    
`Vitamin B`          0.3432403  0.0826824   4.151 5.21e-05 ***
`Vitamin C`          0.0395842  0.0208368   1.900   0.0592 .  
`Vitamin E`          0.6027434  0.4620098   1.305   0.1938    
`Vitamin K`          0.0077192  0.0384746   0.201   0.8412    
Fiber               -0.2642323  0.5209144  -0.507   0.6126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.746 on 171 degrees of freedom
Multiple R-squared:  0.538, Adjusted R-squared:  0.5029 
F-statistic: 15.32 on 13 and 171 DF,  p-value: < 2.2e-16

Statistical Analysis

Adjusted R-squared: 0.5029

P-Value: < 2.2e-16

Equation: 37.77 - -1.67(protein) - 0.49(total fats) - 0.35(water) - 0.002(macrominerals) - 0.28(trace minerals) - 0.0005(total carotene) - 0.002(total carotenoid) - 0.0002(vitamin A) + 0.40(vitamin B) + 0.04(vitamin C) + 0.60(vitamin E) + 0.008(vitamin K) - 0.26(fiber)

Analysis:

The adjusted R-squared being around 0.50 showcases that about 50% of the variance for total sugar can be shown through these variables. This makes the model somewhat significant in predicting sugar concentration in baby foods, although there is still more extraneous variables that need to be taken account for. The model itself is statistically significantly by a wide-margin, as the p-value is extremely smaller than 0.05. For the variables themselves, it can be noted that there were many variables that were not good predictors of sugar by a strong amount. For example, protein was a very negative predictor for sugar. This can be re utilized by stating that it is a negative predictor for sugar, meaning that the higher the protein concentration there is in the baby food, the lower the sugar concentration will be.

Diagnostic Plot

plot(lr, 1)

The Residuals vs Fitted plot highlights issues within the original model, showcasing how the effect of outliers drastically impacts the performance of the model. The model is also very strong when predicting lower levels of sugar concentration, but becomes more unprecise as the sugar concentration increases. This also brings back to the point of drastic outliers, but as the sugar concentration stays within reasonable levels, the model is very accurate in predicting the sugar concentration based on the variables.

Visualizations:

Visualization 1

p1 <- ggplot(baby_food, aes(
  x = Protein, 
  y = `Sugar Total`, 
  color = `Total Lipid`,)) + 
  geom_point(alpha = 0.8) + 
  labs(
    title = "Correlation of Sugar concentration with Nutritional Influence", 
    x = "Total Sugars", 
    y = "Total Proteins", 
    color = "Total Fat Amount", 
    caption = "Source: Corgis Dataset Repository, sourced from the United States Department of Agriculture Food Composition Database"
  ) + 
  theme_minimal()

p1

While this graph does;t show too much, it does give a good idea into the spread of the variables throughout the main protein components. This allowed me to better understand my data and visualization, throughout which the Fat, Protein, and Sugar concentration are three central figures in understanding the impact the nutritional value has on the baby food. I also chose these variables because I wanted to see it’s impact with the added and unadded, but the dataset would not let me utilize it. That is something that I would like to test for in the future.

Tableau Visualization

Tableau Public Profile

For my Tableau visualization, I utilized the histogram to show the frequency of sugar concentrations as well the sum of the sugar in total. This allowed me to get an idea of how the total amount of sugar both plays into the concentration of sugar in baby foods together and individually. I also utilized summary statistics in my Interactivity, allowing the user to see the averages for Fats, Proteins, Fibers, Cholesterol, and Water for each sugar concentration metric. What I found from my visualization is that the Sugar concentrations would peak around nine grams, giving a general metric into the importance of the sugar concentrations for baby food.

Bibliography:

CORGIS Dataset Project. (n.d.). Ingredients dataset. Retrieved May 15, 2026, from CORGIS Ingredients Dataset

Saidi, R. (2022). Data visualization and analysis. RPubs. RPubs Article

Merck Manual Consumer Version. (n.d.). Overview of minerals. Merck Manuals Overview of Minerals

American Academy of Pediatrics. (n.d.). Infant food and feeding. AAP Infant Food and Feeding

Square Baby. (n.d.). Baby food: All about sugar. Square Baby Sugar Article

GeeksforGeeks. (n.d.). filter() in Python. GeeksforGeeks filter() in Python