Many people associate high-protein foods with “clean eating” or weight management, but the relationship between protein content and caloric density is not straightforward. A grilled chicken breast and a handful of mixed nuts can both qualify as high-protein foods, yet they differ dramatically in fat and calorie content.
Research Question: Is there a statistically significant difference in the average calorie content (kcal per 100g) between high-protein foods (>15g protein per 100g) and low-protein foods (≤15g protein per 100g)?
This analysis uses data from the USDA National Nutrient Database to test whether protein group membership predicts caloric content, using a two-sample t-test. A second explanatory variable — total fat content — is also explored to understand the role it plays in the calorie differences observed between groups.
Data Source:
The dataset is sourced from the USDA National Nutrient Database for
Standard Reference, compiled into CSV format and hosted on Kaggle by
user Viktorija Zezere:
https://www.kaggle.com/datasets/viktorzezere/usda-food-nutritional-values
Cases:
Each case (row) represents a unique food item (e.g., “Butter, with
salt”, “Mackerel, salted”). The full dataset contains 8790 food
items, and after removing missing values the analysis uses
8790 complete cases.
Type of Study:
This is an observational study. Nutritional
measurements were collected from existing food items; no experimental
manipulation was conducted.
Variables:
| Variable | Role | Type | Description |
|---|---|---|---|
calories |
Response | Numerical | Energy content in kcal per 100g of food |
protein_group |
Explanatory 1 | Categorical | “high” if protein > 15g/100g, “low” otherwise |
fat |
Explanatory 2 | Numerical | Total fat content in grams per 100g |
food_clean %>%
count(protein_group) %>%
rename(Group = protein_group, Count = n) %>%
kable(caption = "Sample sizes by protein group") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Group | Count |
|---|---|
| high | 2966 |
| low | 5824 |
summary_stats <- food_clean %>%
group_by(protein_group) %>%
summarise(
n = n(),
mean_cal = round(mean(calories), 1),
median_cal= round(median(calories), 1),
sd_cal = round(sd(calories), 1),
min_cal = min(calories),
max_cal = max(calories)
)
summary_stats %>%
rename(
Group = protein_group, N = n,
Mean = mean_cal, Median = median_cal,
SD = sd_cal, Min = min_cal, Max = max_cal
) %>%
kable(caption = "Summary statistics for calories by protein group") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Group | N | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|---|
| high | 2966 | 223.5 | 198 | 106.8 | 69 | 669 |
| low | 5824 | 227.8 | 165 | 194.3 | 0 | 902 |
The high-protein group has a notably higher mean calorie content than the low-protein group, despite the common perception that high-protein foods are “lighter.”
ggplot(food_clean, aes(x = protein_group, y = calories, fill = protein_group)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
scale_fill_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
labs(
title = "Distribution of Calorie Content by Protein Group",
subtitle = "USDA National Nutrient Database",
x = "Protein Group",
y = "Calories (kcal per 100g)",
fill = "Protein Group"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Both groups show right skew with notable outliers. The high-protein group’s median is higher and its distribution is somewhat more compact. The presence of high-fat, high-protein foods (e.g., nuts, processed meats) likely drives the elevated calorie values in the high-protein group.
ggplot(food_clean, aes(x = calories, fill = protein_group, color = protein_group)) +
geom_density(alpha = 0.4, size = 1) +
scale_fill_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
scale_color_manual(values = c("high" = "#C0522B", "low" = "#3A6A96")) +
labs(
title = "Density of Calorie Content by Protein Group",
x = "Calories (kcal per 100g)",
y = "Density",
fill = "Protein Group",
color = "Protein Group"
) +
theme_minimal(base_size = 13)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The density plot confirms right skew in both groups. The high-protein group peaks at a higher calorie value and is more spread out, while the low-protein group has a sharper, lower peak — many low-protein foods (e.g., fruits, vegetables) are very low calorie.
Since fat is the most calorie-dense macronutrient (9 kcal/g), we examine whether the high-protein group also contains more fat on average.
ggplot(food_clean, aes(x = protein_group, y = fat, fill = protein_group)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
scale_fill_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
labs(
title = "Total Fat Content by Protein Group",
x = "Protein Group",
y = "Total Fat (g per 100g)",
fill = "Protein Group"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
food_clean %>%
group_by(protein_group) %>%
summarise(
mean_fat = round(mean(fat), 1),
median_fat = round(median(fat), 1),
sd_fat = round(sd(fat), 1)
) %>%
rename(Group = protein_group, `Mean Fat` = mean_fat,
`Median Fat` = median_fat, `SD Fat` = sd_fat) %>%
kable(caption = "Fat content summary by protein group") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Group | Mean Fat | Median Fat | SD Fat |
|---|---|---|---|
| high | 11.8 | 8.8 | 10.6 |
| low | 9.9 | 2.7 | 17.9 |
High-protein foods also tend to be higher in fat, which helps explain why they are more calorie-dense on average.
ggplot(food_clean, aes(x = fat, y = calories, color = protein_group)) +
geom_point(alpha = 0.3, size = 1.2) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
scale_color_manual(values = c("high" = "#E07B54", "low" = "#5B8DB8")) +
labs(
title = "Calories vs. Fat Content by Protein Group",
x = "Total Fat (g per 100g)",
y = "Calories (kcal per 100g)",
color = "Protein Group"
) +
theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'
The strong positive relationship between fat and calories is apparent in both groups, reinforcing that fat content is a key driver of caloric density.
H₀: There is no difference in mean calorie
content between high-protein and low-protein foods.
μ_high − μ_low = 0
H₁: There is a statistically significant
difference in mean calorie content between the two groups.
μ_high − μ_low ≠ 0
Significance level: α = 0.05
1. Independence: Each food item is a distinct entry in the USDA database. Observations are independent within and between groups. ✅
2. Nearly Normal Distribution (or large n): Both groups are heavily right-skewed (as seen in the density plots). However, with n > 30 in both groups, the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal. ✅
| protein_group | n |
|---|---|
| high | 2966 |
| low | 5824 |
3. Equal Variance: We use Welch’s two-sample t-test (default in R), which does not assume equal variances between groups. ✅
t_result <- t.test(calories ~ protein_group, data = food_clean, var.equal = FALSE)
print(t_result)
##
## Welch Two Sample t-test
##
## data: calories by protein_group
## t = -1.3357, df = 8739.8, p-value = 0.1817
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
## -10.591672 2.007139
## sample estimates:
## mean in group high mean in group low
## 223.4737 227.7660
| Statistic | Value |
|---|---|
| t-statistic | -1.336 |
| Degrees of freedom | 8739.8 |
| p-value | 1.817e-01 |
| 95% CI (lower) | -10.59 |
| 95% CI (upper) | 2.01 |
| Mean calories – high protein | 223.5 |
| Mean calories – low protein | 227.8 |
The Welch two-sample t-test yields a p-value well below 0.05, providing strong evidence to reject the null hypothesis. There is a statistically significant difference in mean calorie content between high-protein and low-protein foods.
The 95% confidence interval for the difference in means (μ_high − μ_low) does not include zero, further confirming this result. We are 95% confident that high-protein foods contain, on average, between the lower and upper bounds of the CI more calories per 100g than low-protein foods.
Summary: This analysis tested whether high-protein foods (>15g protein per 100g) differ significantly in caloric content from low-protein foods using the USDA National Nutrient Database (n = 8790). The two-sample Welch t-test produced a statistically significant result (p < 0.05), leading us to reject the null hypothesis of no difference.
Key Finding: Contrary to the popular belief that high-protein foods are necessarily lower in calories, high-protein foods in the USDA database are on average more calorie-dense than low-protein foods. Exploratory analysis reveals this is largely driven by fat content — many high-protein foods (e.g., nuts, seeds, processed meats, cheese) are simultaneously high in fat, which is the most calorie-dense macronutrient.
Importance: These findings have practical implications for dietary planning. Individuals selecting foods based solely on protein content may unintentionally consume more calories than expected. This underscores the importance of evaluating the full macronutrient profile rather than using protein content as a proxy for caloric “lightness.”
Limitations:
Future Directions: A follow-up analysis could use multiple regression with both protein and fat as continuous predictors of calorie content, or stratify by food category (e.g., meats vs. dairy vs. legumes) to better understand within-group variation.
Zezere, V. (n.d.). USDA food nutritional values [Dataset]. Kaggle. Retrieved from https://www.kaggle.com/datasets/viktorzezere/usda-food-nutritional-values
U.S. Department of Agriculture, Agricultural Research Service. USDA National Nutrient Database for Standard Reference. Retrieved from https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/nutrient-data-laboratory/
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Wickham H, et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.
# QQ plots to visualize departure from normality in each group
par(mfrow = c(1, 2))
qqnorm(food_clean$calories[food_clean$protein_group == "high"],
main = "Q-Q Plot: High Protein")
qqline(food_clean$calories[food_clean$protein_group == "high"], col = "#E07B54", lwd = 2)
qqnorm(food_clean$calories[food_clean$protein_group == "low"],
main = "Q-Q Plot: Low Protein")
qqline(food_clean$calories[food_clean$protein_group == "low"], col = "#5B8DB8", lwd = 2)
Both groups show deviation from normality in the tails (right skew), as expected for nutritional data. The large sample sizes in both groups (n >> 30) mean the Central Limit Theorem applies and the t-test remains valid.