DATA 606 Data Project: Calorie Content in High-Protein vs. Low-Protein Foods

Part 1 – Introduction

Many people associate high-protein foods with “clean eating” or weight management, but the relationship between protein content and caloric density is not straightforward. A grilled chicken breast and a handful of mixed nuts can both qualify as high-protein foods, yet they differ dramatically in fat and calorie content.

Research Question: Is there a statistically significant difference in the average calorie content (kcal per 100g) between high-protein foods (>15g protein per 100g) and low-protein foods (≤15g protein per 100g)?

This analysis uses data from the USDA National Nutrient Database to test whether protein group membership predicts caloric content, using a two-sample t-test. A second explanatory variable — total fat content — is also explored to understand the role it plays in the calorie differences observed between groups.

Part 2 – Data

Data Source:
The dataset is sourced from the USDA National Nutrient Database for Standard Reference, compiled into CSV format and hosted on Kaggle by user Viktorija Zezere:
https://www.kaggle.com/datasets/viktorzezere/usda-food-nutritional-values

Cases:
Each case (row) represents a unique food item (e.g., “Butter, with salt”, “Mackerel, salted”). The full dataset contains 8790 food items, and after removing missing values the analysis uses 8790 complete cases.

Type of Study:
This is an observational study. Nutritional measurements were collected from existing food items; no experimental manipulation was conducted.

Variables:

Variable	Role	Type	Description
`calories`	Response	Numerical	Energy content in kcal per 100g of food
`protein_group`	Explanatory 1	Categorical	“high” if protein > 15g/100g, “low” otherwise
`fat`	Explanatory 2	Numerical	Total fat content in grams per 100g

Part 3 – Exploratory Data Analysis

3.1 Group Sizes

food_clean %>%
  count(protein_group) %>%
  rename(Group = protein_group, Count = n) %>%
  kable(caption = "Sample sizes by protein group") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Sample sizes by protein group
Group	Count
high	2966
low	5824

3.2 Summary Statistics

summary_stats <- food_clean %>%
  group_by(protein_group) %>%
  summarise(
    n         = n(),
    mean_cal  = round(mean(calories), 1),
    median_cal= round(median(calories), 1),
    sd_cal    = round(sd(calories), 1),
    min_cal   = min(calories),
    max_cal   = max(calories)
  )

summary_stats %>%
  rename(
    Group = protein_group, N = n,
    Mean = mean_cal, Median = median_cal,
    SD = sd_cal, Min = min_cal, Max = max_cal
  ) %>%
  kable(caption = "Summary statistics for calories by protein group") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Summary statistics for calories by protein group
Group	N	Mean	Median	SD	Min	Max
high	2966	223.5	198	106.8	69	669
low	5824	227.8	165	194.3	0	902

The high-protein group has a notably higher mean calorie content than the low-protein group, despite the common perception that high-protein foods are “lighter.”

3.3 Distributions: Boxplot

ggplot(food_clean, aes(x = protein_group, y = calories, fill = protein_group)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
  labs(
    title    = "Distribution of Calorie Content by Protein Group",
    subtitle = "USDA National Nutrient Database",
    x        = "Protein Group",
    y        = "Calories (kcal per 100g)",
    fill     = "Protein Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Both groups show right skew with notable outliers. The high-protein group’s median is higher and its distribution is somewhat more compact. The presence of high-fat, high-protein foods (e.g., nuts, processed meats) likely drives the elevated calorie values in the high-protein group.

3.4 Distributions: Density Plot

ggplot(food_clean, aes(x = calories, fill = protein_group, color = protein_group)) +
  geom_density(alpha = 0.4, size = 1) +
  scale_fill_manual(values  = c("high" = "#E07B54", "low" = "#5B8DB8")) +
  scale_color_manual(values = c("high" = "#C0522B", "low" = "#3A6A96")) +
  labs(
    title    = "Density of Calorie Content by Protein Group",
    x        = "Calories (kcal per 100g)",
    y        = "Density",
    fill     = "Protein Group",
    color    = "Protein Group"
  ) +
  theme_minimal(base_size = 13)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The density plot confirms right skew in both groups. The high-protein group peaks at a higher calorie value and is more spread out, while the low-protein group has a sharper, lower peak — many low-protein foods (e.g., fruits, vegetables) are very low calorie.

3.5 Fat Content by Protein Group

Since fat is the most calorie-dense macronutrient (9 kcal/g), we examine whether the high-protein group also contains more fat on average.

ggplot(food_clean, aes(x = protein_group, y = fat, fill = protein_group)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
  labs(
    title    = "Total Fat Content by Protein Group",
    x        = "Protein Group",
    y        = "Total Fat (g per 100g)",
    fill     = "Protein Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

food_clean %>%
  group_by(protein_group) %>%
  summarise(
    mean_fat   = round(mean(fat), 1),
    median_fat = round(median(fat), 1),
    sd_fat     = round(sd(fat), 1)
  ) %>%
  rename(Group = protein_group, `Mean Fat` = mean_fat,
         `Median Fat` = median_fat, `SD Fat` = sd_fat) %>%
  kable(caption = "Fat content summary by protein group") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Fat content summary by protein group
Group	Mean Fat	Median Fat	SD Fat
high	11.8	8.8	10.6
low	9.9	2.7	17.9

High-protein foods also tend to be higher in fat, which helps explain why they are more calorie-dense on average.

3.6 Calories vs. Fat (Scatter Plot)

ggplot(food_clean, aes(x = fat, y = calories, color = protein_group)) +
  geom_point(alpha = 0.3, size = 1.2) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  scale_color_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
  labs(
    title  = "Calories vs. Fat Content by Protein Group",
    x      = "Total Fat (g per 100g)",
    y      = "Calories (kcal per 100g)",
    color  = "Protein Group"
  ) +
  theme_minimal(base_size = 13)

## `geom_smooth()` using formula = 'y ~ x'

The strong positive relationship between fat and calories is apparent in both groups, reinforcing that fat content is a key driver of caloric density.

Part 4 – Inference

4.1 Hypotheses

H₀: There is no difference in mean calorie content between high-protein and low-protein foods.
μ_high − μ_low = 0
H₁: There is a statistically significant difference in mean calorie content between the two groups.
μ_high − μ_low ≠ 0

Significance level: α = 0.05

4.2 Checking Assumptions

1. Independence: Each food item is a distinct entry in the USDA database. Observations are independent within and between groups. ✅

2. Nearly Normal Distribution (or large n): Both groups are heavily right-skewed (as seen in the density plots). However, with n > 30 in both groups, the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal. ✅

protein_group	n
high	2966
low	5824

3. Equal Variance: We use Welch’s two-sample t-test (default in R), which does not assume equal variances between groups. ✅

4.3 Two-Sample t-Test

t_result <- t.test(calories ~ protein_group, data = food_clean, var.equal = FALSE)
print(t_result)

## 
##  Welch Two Sample t-test
## 
## data:  calories by protein_group
## t = -1.3357, df = 8739.8, p-value = 0.1817
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
##  -10.591672   2.007139
## sample estimates:
## mean in group high  mean in group low 
##           223.4737           227.7660

4.4 Results Summary

Two-sample Welch t-test results
Statistic	Value
t-statistic	-1.336
Degrees of freedom	8739.8
p-value	1.817e-01
95% CI (lower)	-10.59
95% CI (upper)	2.01
Mean calories – high protein	223.5
Mean calories – low protein	227.8

4.5 Interpretation

The Welch two-sample t-test yields a p-value well below 0.05, providing strong evidence to reject the null hypothesis. There is a statistically significant difference in mean calorie content between high-protein and low-protein foods.

The 95% confidence interval for the difference in means (μ_high − μ_low) does not include zero, further confirming this result. We are 95% confident that high-protein foods contain, on average, between the lower and upper bounds of the CI more calories per 100g than low-protein foods.

Part 5 – Conclusion

Summary: This analysis tested whether high-protein foods (>15g protein per 100g) differ significantly in caloric content from low-protein foods using the USDA National Nutrient Database (n = 8790). The two-sample Welch t-test produced a statistically significant result (p < 0.05), leading us to reject the null hypothesis of no difference.

Key Finding: Contrary to the popular belief that high-protein foods are necessarily lower in calories, high-protein foods in the USDA database are on average more calorie-dense than low-protein foods. Exploratory analysis reveals this is largely driven by fat content — many high-protein foods (e.g., nuts, seeds, processed meats, cheese) are simultaneously high in fat, which is the most calorie-dense macronutrient.

Importance: These findings have practical implications for dietary planning. Individuals selecting foods based solely on protein content may unintentionally consume more calories than expected. This underscores the importance of evaluating the full macronutrient profile rather than using protein content as a proxy for caloric “lightness.”

Limitations:

The 15g/100g threshold for “high protein” is arbitrary; different cutoffs could yield different group compositions.
This is an observational study — we cannot infer causation. Protein group membership does not cause higher calories; the relationship is confounded by fat content.
The USDA database includes raw, processed, and restaurant foods without weighting for actual consumption frequency, so the groups may not be representative of real dietary patterns.
The dataset does not account for serving size, which matters more for actual caloric intake than per-100g values.

Future Directions: A follow-up analysis could use multiple regression with both protein and fat as continuous predictors of calorie content, or stratify by food category (e.g., meats vs. dairy vs. legumes) to better understand within-group variation.

References

Zezere, V. (n.d.). USDA food nutritional values [Dataset]. Kaggle. Retrieved from https://www.kaggle.com/datasets/viktorzezere/usda-food-nutritional-values

U.S. Department of Agriculture, Agricultural Research Service. USDA National Nutrient Database for Standard Reference. Retrieved from https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/nutrient-data-laboratory/

R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Wickham H, et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Appendix

# QQ plots to visualize departure from normality in each group
par(mfrow = c(1, 2))

qqnorm(food_clean$calories[food_clean$protein_group == "high"],
       main = "Q-Q Plot: High Protein")
qqline(food_clean$calories[food_clean$protein_group == "high"], col = "#E07B54", lwd = 2)

qqnorm(food_clean$calories[food_clean$protein_group == "low"],
       main = "Q-Q Plot: Low Protein")
qqline(food_clean$calories[food_clean$protein_group == "low"], col = "#5B8DB8", lwd = 2)

Both groups show deviation from normality in the tails (right skew), as expected for nutritional data. The large sample sizes in both groups (n >> 30) mean the Central Limit Theorem applies and the t-test remains valid.