1. Abstract

This project explores two different methods of analyzing the correlation between two variables. Using R programming, we choose two factors from the Cereal dataset and then calculate the correlations using two different methods so that we may compare them to each other and reveal what they say about the data.

2. Introduction

When considering breakfast cereals, we often think of a meal with little to no nutritional value packed with sugar. This project uses R programming to calculate the correlation between calories and sugar in breakfast cereals using a Pearson(t-test) and Bayesian methods. We then compare our p-value to the Bayesian probability and our confidence interval to our Bayesian credibility interval.

3. Methodology

Data Source: Stat2 Models for a World of Data. Tools: R programming language, base packages, ggplot2, bayesian first aid. Data visualization: Creating scatterplots using ggplot2 and base R.

4. Results

To begin our analysis of the correlation methods, we first visualize our data with Sugar on the x-axis and Calories on the y-axis, where each point represents a cereal.

The scatterplot shows some pattern because the spread of the data is not random, so we continue to the next step of our analysis. We now calculate the correlation coefficient between Calories and Sugar.

## Correlation coefficient is: 0.5154008

Now we conduct a t-test to calculate the confidence interval of the correlation and the p-value, which tells us whether we should accept or reject the null hypothesis that there is no correlation between calories and sugar in our Cereal dataset.

## 
##  Pearson's product-moment correlation
## 
## data:  Cereal$Calories and Cereal$Sugar
## t = 3.5069, df = 34, p-value = 0.001296
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2249563 0.7217280
## sample estimates:
##       cor 
## 0.5154008

Using the standard threshold of 0.05, our p-value of 0.001296 indicates that we should reject the null hypothesis, meaning that calories and sugar are significantly correlated.

Additionally, our confidence interval tells us with 95% confidence that the correlation coefficient of the population made up of all cereals falls between 0.2249563 and 0.7217280.

Using our alternative method of Bayesian reasoning, we use the Bayesian correlation to estimate the population correlation and 95% credible interval.

## 
##  Bayesian First Aid Pearson's Correlation Coefficient Test
## 
## data: Sugar and Calories (n = 36)
## Estimated correlation:
##   0.49 
## 95% credible interval:
##   0.22 0.72 
## The correlation is more than 0 by a probability of 0.998 
## and less than 0 by a probability of 0.002

The Bayesian correlation estimates that the population correlation is 0.49. The 95% credible interval tells us that there is a 95% probability that the correlation of the population is between 0.20 and 0.73.

Our last step is to plot the values that result from using the Bayesian methods accordingly.

The histogram above the scatter plot shows the posterior distribution of the population correlation. The posterior distribution represents what we know about the parameter, in this case, correlation, after considering the data, combined with previous beliefs and knowledge before observing the data. The bars spread across the number line in the histogram show a 0.2% chance that the population correlation is negative and a 99.8% chance that it is positive.

As mentioned earlier, the scatterplot shows the relationship between the two variables. The spikes on the margins of the scatterplot represent the distribution of the x and y-axis variables or the frequency in which the values in the data occur. The curves smooth out and give an approximation of this distribution.

5. Discussion

Let us first compare our p-value to the Bayesian probability that the population correlation is negative. Our p-value of 0.001296 indicates that there is only a 0.1296% chance of observing our data (or something more extreme), given that the null hypothesis of no correlation between calories and sugar is true. This value suggests strong evidence against the null hypothesis.

On the other hand, the Bayesian probability that the population correlation is negative tells us that there is a 0.1% chance that the correlation between calories and sugar in the population is negative or less than zero.

Both these measurements communicate that it is highly improbable that calories and sugars are not positively correlated. However, the p-value measures the probability the observed data gives the null hypothesis, while the Bayesian probability based on the data measures the likelihood that the correlation is negative. A negative correlation would indicate that as calories increase, sugars decrease, which is seemingly counterintuitive for breakfast cereals.

Let us compare our 95% confidence interval to the Bayesian 95% credible interval. Our confidence interval of 0.2249563 to 0.7217280 indicates that we can have 95% confidence that the population correlation lies within this range. The Bayesian 95% credible interval suggests a 95% probability that the population correlation falls between its given range of 0.20 and 0.73.

While these two intervals provide similar ranges and may seem to convey the same information, the difference lies in how we interpret them. The confidence interval reflects the uncertainty in estimating the population parameter based on the sample data. While the credible interval represents the probability that the population correlation falls within the given range, given the data and prior beliefs.

Both measurements communicate related concepts but highlight different aspects of statistical interpretation.

6. Conclusion

In this project, we aimed to use the cereal dataset to calculate the correlation between calories and sugar and, using two different statistical analysis methods, to conclude the parameter of the population correlation. We compared and discussed the differences between p-values, confidence intervals, and measurements such as Bayesian probability and credible interval.

7. References

Cereal Dataset: [https://www.stat2.org/datasets/Cereal.csv]

8. Appendices

8.1 Setup Code

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(BayesianFirstAid)

Cereal <- read.csv('Cereal.csv')

8.2 Cereal Scatter Plot Code

ggplot(Cereal,
       mapping = aes(x = Sugar, y = Calories)) +
  geom_point() +
  labs(title = "Calories vs Sugar in Breakfast Cereals")

8.3 Correlation Coefficient Code

correlation_coefficient <- cor(Cereal$Calories, Cereal$Sugar)

cat("Correlation coefficient is:", correlation_coefficient)

8.4 T-test Code

hypothesis_test_result <- cor.test(Cereal$Calories, Cereal$Sugar)

print(hypothesis_test_result)

8.5 Bayes Cor Test and Plot

Sugar <- Cereal$Sugar 
Calories <- Cereal$Calories

bayes_cortest_cereal <- bayes.cor.test(Sugar, Calories)

print(bayes_cortest_cereal)

plot(bayes_cortest_cereal)