library(tidyverse)Calories and Sugar in Breakfast Cereals: Frequentist and Bayesian Analysis
Introduction
Breakfast cereals vary widely in nutritional content. Sugar is often added to improve taste, but cereals with more sugar may also contain more calories. This project examines the relationship between Sugar (grams per serving) and Calories (per serving) using the Cereal dataset. We use both frequentist and Bayesian methods to answer:
Research Question:
Do cereals with higher sugar content tend to have higher calories?
Cereal <- read.csv("Cereal (2).csv")
head(Cereal) Cereal Calories Sugar Fiber
1 Common Sense Oat Bran 100 6 3
2 Product 19 100 3 1
3 All Bran Xtra Fiber 50 0 14
4 Just Right 140 9 2
5 Original Oat Bran 70 5 10
6 Heartwise 90 5 6
The preview shows cereal names along with their nutritional values. The variable Cereal is a character variable, while Calories, Sugar, and Fiber are numeric. The displayed rows indicate variability in calorie and sugar content across cereals, suggesting that the dataset is suitable for further statistical analysis.
summary(Cereal$Calories) Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 90.0 104.0 101.6 110.0 160.0
summary(Cereal$Sugar) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.750 5.000 5.714 9.075 15.000
sd(Cereal$Calories)[1] 22.16394
sd(Cereal$Sugar)[1] 4.604666
Calories and sugar values show substantial variability. Calories range from 50–160 (Mean = 101.6, SD = 22.16), while sugar ranges from 0–15 (Mean = 5.71, SD = 4.60). The mean and median values suggest approximately symmetric calories and slightly right-skewed sugar.
ggplot(Cereal, aes(x = Sugar, y = Calories)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Scatterplot of Sugar vs Calories")`geom_smooth()` using formula = 'y ~ x'
The plot shows a positive linear trend, indicating that cereals with higher sugar content generally tend to have higher calorie values. Although some variability is present, the upward-sloping regression line suggests a moderate positive association. This visual pattern supports the use of correlation analysis and simple linear regression.
cor.test(Cereal$Calories, Cereal$Sugar)
Pearson's product-moment correlation
data: Cereal$Calories and Cereal$Sugar
t = 3.5069, df = 34, p-value = 0.001296
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2249563 0.7217280
sample estimates:
cor
0.5154008
The Pearson correlation test showed a moderate positive correlation (r = 0.515) between sugar and calories, which was statistically significant (p = 0.0013).
model <- lm(Calories ~ Sugar, data = Cereal)
summary(model)
Call:
lm(formula = Calories ~ Sugar, data = Cereal)
Residuals:
Min 1Q Median 3Q Max
-37.428 -9.832 0.245 8.909 40.322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 87.4277 5.1627 16.935 <2e-16 ***
Sugar 2.4808 0.7074 3.507 0.0013 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.27 on 34 degrees of freedom
Multiple R-squared: 0.2656, Adjusted R-squared: 0.244
F-statistic: 12.3 on 1 and 34 DF, p-value: 0.001296
The regression analysis showed that sugar significantly predicts calories (β = 2.48, p = 0.0013). Each additional gram of sugar is associated with an increase of approximately 2.48 calories. Sugar explains about 26.6% of the variation in calories.
par(mfrow = c(1,1))
plot(model)Residuals vs Fitted:
The residuals appear randomly scattered around zero, suggesting that the linearity assumption is reasonable and no strong patterns are present.
Normal Q-Q Plot:
The points fall approximately along the reference line, indicating that the residuals are approximately normally distributed.
Scale-Location Plot:
The spread of residuals is relatively constant across fitted values, supporting the assumption of homoscedasticity (constant variance).
Residuals vs Leverage:
Most observations fall within the Cook’s distance boundaries, suggesting no extreme influential outliers that unduly affect the model.
The diagnostic plots indicate that regression assumptions (linearity, normality, constant variance, and absence of influential outliers) are reasonably satisfied.
library(BayesianFirstAid)Loading required package: rjags
Loading required package: coda
Linked to JAGS 4.3.2
Loaded modules: basemod,bugs
bayes_cor <- bayes.cor.test(Cereal$Calories, Cereal$Sugar)
bayes_cor
Bayesian First Aid Pearson's Correlation Coefficient Test
data: Cereal$Calories and Cereal$Sugar (n = 36)
Estimated correlation:
0.49
95% credible interval:
0.21 0.73
The correlation is more than 0 by a probability of 0.998
and less than 0 by a probability of 0.002
The Bayesian analysis estimated a moderate positive correlation (r = 0.49), with a 99.8% probability that the true correlation is positive.
test <- cor.test(Cereal$Calories, Cereal$Sugar, conf.level = 0.95)
test
Pearson's product-moment correlation
data: Cereal$Calories and Cereal$Sugar
t = 3.5069, df = 34, p-value = 0.001296
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2249563 0.7217280
sample estimates:
cor
0.5154008
round(test$conf.int, 3)[1] 0.225 0.722
attr(,"conf.level")
[1] 0.95
Hypotheses
H₀: There is no linear correlation between Sugar and Calories (ρ = 0)
H₁: There is a linear correlation between Sugar and Calories (ρ ≠ 0)
p = 0.0013 < 0.05
Since the p-value (0.0013) is less than α = 0.05, we reject the null hypothesis. This indicates that there is a statistically significant linear correlation between sugar content and calorie values among breakfast cereals.
Conclusion
Both frequentist and Bayesian analyses indicate a moderate positive relationship between Sugar and Calories in breakfast cereals. The Pearson correlation test showed a statistically significant association (r = 0.515, p = 0.0013), supported by a 95% confidence interval of [0.225, 0.722]. The regression model further demonstrated that sugar content significantly predicts calories. These findings consistently suggest that cereals with higher sugar content tend to contain more calories.