── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Cereal Calories Sugar Fiber
1 Common Sense Oat Bran 100 6 3
2 Product 19 100 3 1
3 All Bran Xtra Fiber 50 0 14
4 Just Right 140 9 2
5 Original Oat Bran 70 5 10
6 Heartwise 90 5 6
Part(A): Create A scatterplot
plot(Cereal$Sugar, Cereal$Calories,main ="Scatterplot of Calories vs Sugar",xlab ="Sugar (grams per serving)",ylab ="Calories",pch =19)
Calculate Correlation
cor(Cereal$Sugar, Cereal$Calories)
[1] 0.5154008
Interpretation
The scatterplot shows a positive linear relationship between sugar and calories in breakfast cereals. As the sugar content per serving increases, the number of calories per serving also tends to increase.
The sample correlation coefficient is r = 0.5154, which indicates a moderate positive linear relationship between sugar and calories.
Part (b): Hypothesis Test for Correlation
We test whether there is a linear relationship between sugar and calories.
The hypotheses are:
\[
H_0: \rho = 0
\]
\[
H_a: \rho \ne 0
\]
where \(\rho\) represents the true population correlation between sugar and calories.
Since the p-value is less than 0.05, we reject \(H_0\).
There is sufficient evidence to conclude that there is a statistically significant linear relationship between sugar and calories in breakfast cereals.
The 95% confidence interval for the true population correlation is:
cor.test(Cereal$Sugar,Cereal$Calories)
Pearson's product-moment correlation
data: Cereal$Sugar and Cereal$Calories
t = 3.5069, df = 34, p-value = 0.001296
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2249563 0.7217280
sample estimates:
cor
0.5154008
\[
(\text{lower = 0.22}, \text{upper = 0.72})
\]
Since 0 is not in this interval, this further supports our conclusion that there is a significant positive linear relationship.
Part (c): Bayesian Correlation Analysis
# Sample sizen <-nrow(Cereal)# Sample correlationr <-cor(Cereal$Sugar, Cereal$Calories)# Fisher transformationz_hat <-atanh(r)# Standard errorse <-1/sqrt(n -3)# Simulate posterior distributionset.seed(123)z_sim <-rnorm(10000, mean = z_hat, sd = se)# Transform back to correlation scalerho_sim <-tanh(z_sim)# Posterior meanmean(rho_sim)
Using a Bayesian approach, we simulated the posterior distribution of the population correlation \(\rho\).
Posterior Results
The posterior mean of the correlation is
\[
\hat{\rho}_{posterior} = 0.5041
\]
The 95% credible interval for the population correlation is
\[
(0.2221,\; 0.7210)
\]
The probability that the true correlation is positive is
\[
P(\rho > 0) = 0.9995
\]
Interpretation
The posterior mean of approximately 0.504 indicates a moderate positive relationship between sugar and calories.
The 95% credible interval means that there is a 95% probability that the true population correlation lies between 0.2221 and 0.7210.
Since the probability that \(\rho > 0\) is 0.9995, there is extremely strong evidence that sugar and calories are positively correlated.
Part (d): Comparison of Frequentist and Bayesian Results
In the frequentist analysis, we tested
\[
H_0: \rho = 0
\]
and obtained a small p-value. Since the p-value was less than 0.05, we rejected the null hypothesis and concluded that there is a statistically significant linear relationship between sugar and calories.
The 95% confidence interval did not include 0, which also supported this conclusion.
In the Bayesian analysis, we obtained a posterior mean of 0.5041 and a 95% credible interval of (0.2221, 0.7210). The probability that the true correlation is positive was 0.9995.
While both approaches lead to the same practical conclusion, that sugar and calories are positively correlated, the interpretations differ.
The frequentist confidence interval means that if we repeatedly sampled and constructed intervals, 95% of those intervals would contain the true correlation.
In contrast, the Bayesian credible interval means that there is a 95% probability that the true population correlation lies within the interval (0.2221, 0.7210).
Thus, the Bayesian approach allows us to directly state the probability that the correlation is positive, whereas the frequentist approach does not.