Project 2

Introduction

In this project, I analyze a nutrition dataset that contains information on various food items. The dataset includes nutritional information such as carbohydrate content, protein content, and calorie values. The dataset contains both quantitative variables (such as carbohydrate and protein content) and categorical variables (such as food category).

The variables analyzed include:

  • Data.Kilocalories (calories): The number of calories in a food item.
  • Data.Carbohydrate (carbohydrates): The amount of carbohydrates in grams.
  • Data.Protein (protein): The amount of protein in grams.
  • Category: The type or category of the food (e.g., dairy, meat, grains).

I chose this dataset to explore the relationship between macronutrients (carbohydrates and protein) and their caloric content, which is a key aspect of nutrition analysis.

Background Research

Nutrition plays a crucial role in maintaining health and preventing disease. Carbohydrates and protein are two of the main macronutrients that provide energy to the body. The relationship between carbohydrate intake and protein content in foods is important because these nutrients support different physiological functions. While carbohydrates are the body’s primary energy source, proteins are essential for tissue repair and muscle growth.

The dataset used in this project provides an opportunity to explore how these macronutrients correlate with the calorie content of various food items, helping us understand food composition better.

To filter the data, I first looked at the 10 most common categories in the dataset. I counted how often each category appeared and then selected the top 10. After that, I filtered the dataset to keep only the rows that correspond to these top 10 categories. This way, I focused the analysis and the scatter plot just on the most frequent food categories, ensuring that the results were representative of the most common items in the dataset.

The categories that appear in the dataset are as follows, with the number of occurrences for each:

BEEF: 457 occurrences

PORK: 304 occurrences

CEREALS RTE: 291 occurrences

BABYFOOD: 221 occurrences

LAMB: 198 occurrences

No Category: 198 occurrences

SOUP: 188 occurrences

CHICKEN: 163 occurrences

CEREALS: 132 occurrences

FAST FOODS: 130 occurrences

These category counts highlight the most common food types in the dataset. The “No Category” group likely represents items that didn’t fit into a specific category or were missing category information. The scatter plot’s points, grouped by these categories, provide insights into how food types with different macronutrient compositions are distributed across carbohydrates and protein.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/alassanefaye/Library/Mobile Documents/com~apple~CloudDocs/DATA110 ")
data <- read_csv("nutrition_food_kaggle.csv")
Rows: 7413 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Category, Description, Data.Household Weights.1st Household Weight...
dbl (44): Nutrient Data Bank Number, Data.Alpha Carotene, Data.Ash, Data.Bet...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 48
  Category   Description   Nutrient Data Bank N…¹ `Data.Alpha Carotene` Data.Ash
  <chr>      <chr>                          <dbl>                 <dbl>    <dbl>
1 BUTTER     BUTTER,WITH …                   1001                     0     2.11
2 BUTTER     BUTTER,WHIPP…                   1002                     0     2.11
3 BUTTER OIL BUTTER OIL,A…                   1003                     0     0   
4 CHEESE     CHEESE,BLUE                     1004                     0     5.11
5 CHEESE     CHEESE,BRICK                    1005                     0     3.18
6 CHEESE     CHEESE,BRIE                     1006                     0     2.7 
# ℹ abbreviated name: ¹​`Nutrient Data Bank Number`
# ℹ 43 more variables: `Data.Beta Carotene` <dbl>,
#   `Data.Beta Cryptoxanthin` <dbl>, Data.Carbohydrate <dbl>,
#   Data.Cholesterol <dbl>, Data.Choline <dbl>, Data.Fiber <dbl>,
#   Data.Kilocalories <dbl>, `Data.Lutein and Zeaxanthin` <dbl>,
#   Data.Lycopene <dbl>, Data.Manganese <dbl>, Data.Niacin <dbl>,
#   `Data.Pantothenic Acid` <dbl>, Data.Protein <dbl>, …
top_categories <- c("BEEF", "PORK", "CEREALS RTE", "BABYFOOD", "LAMB", 
                    "No Category", "SOUP", "CHICKEN", "CEREALS", "FAST FOODS")
data_filtered <- data %>%
  filter(Category %in% top_categories)

ggplot(data_filtered, aes(x = `Data.Carbohydrate`, y = `Data.Protein`, color = Category)) + 
  geom_point() +  # Add points to the plot
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  labs(
    title = "Scatter Plot of Carbohydrates vs Protein with Regression Line (Top 10 Categories)",
    x = "Carbohydrates (g)",
    y = "Protein (g)",
    color = "Food Category"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

In the scatter plot, each point represents a food item, with the x-axis showing the carbohydrate content and the y-axis showing the protein content. The points are distributed across the graph in such a way that there is a noticeable moderate negative correlation between the two variables. This means that as the carbohydrate content increases, the protein content tends to decrease slightly. The points don’t form a perfect line, but there is a general downward trend, suggesting that foods high in carbs (like grains and starchy foods) often have lower protein content, while protein-rich foods (such as meats) tend to have lower carbohydrate content.

# Calculate the correlation coefficient between Carbohydrates and Protein
correlation <- cor(data_filtered$`Data.Carbohydrate`, data_filtered$`Data.Protein`, use = "complete.obs")

# Print the correlation coefficient
correlation
[1] -0.4315321

Conclusion

The scatter plot revealed a moderate negative correlation between Carbohydrates and Protein (correlation coefficient = -0.43). This suggests that, generally, as the carbohydrate content in food items increases, the protein content tends to decrease, but the relationship is not particularly strong. In short, while there is a bit of a trend in the data showing that foods with more carbohydrates tend to have less protein, the relationship isn’t super strong. The correlation we found is moderate (around -0.43), which means there’s a slight inverse relationship, higher carbs usually mean lower protein, but it’s not a strict rule. This kind of makes sense when you think about it because things like grains and starchy foods often have more carbs and less protein, while foods like meats are packed with protein but don’t have as much carbs. However, the data points in the scatter plot are pretty spread out, which tells us that other factors—like the type of food (e.g., dairy, meat, grains) could have a bigger impact on the balance of carbs and protein.So, in the end, while there’s some connection between carbs and protein, it’s not strong enough to say you can predict one based on the other. It’s clear that other things are going on that affect the nutritional makeup of these foods, and that could be an interesting area to dig deeper into.

Bibliography

American Dietetic Association. (2021). The importance of macronutrients in human health. Retrieved from https://www.eatright.org/

Harvard T.H. Chan School of Public Health. (2021). The Nutrition Source: Protein. Retrieved from https://www.hsph.harvard.edu/nutritionsource/protein/

NutritionData.com. (2021). Nutrition and food composition analysis. Retrieved from https://www.nutritiondata.com/

Citation: OpenAI. (2025). ChatGPT (Version 4) [AI model]. OpenAI. https://chat.openai.com