Project 2 - Correlation using Bayesian Analysis

Author

Sabirin Muuse

Introduction

In this project, we analyze the relationship between Calories and Sugar in breakfast cereals using both frequentist and Bayesian methods.

We will:

  1. Create a scatterplot and compute the correlation.
  2. Perform a hypothesis test for correlation.
  3. Conduct a Bayesian correlation analysis.
  4. Compare the frequentist and Bayesian results.

Load Packages

install.packages("ggplot2")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)
install.packages("tidyverse")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)
library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Cereal <- read.csv("https://www.stat2.org/datasets/Cereal.csv")
head(Cereal)
                 Cereal Calories Sugar Fiber
1 Common Sense Oat Bran      100     6     3
2            Product 19      100     3     1
3   All Bran Xtra Fiber       50     0    14
4            Just Right      140     9     2
5     Original Oat Bran       70     5    10
6             Heartwise       90     5     6

Part(A): Create A scatterplot

plot(Cereal$Sugar, Cereal$Calories,
     main = "Scatterplot of Calories vs Sugar",
     xlab = "Sugar (grams per serving)",
     ylab = "Calories",
     pch = 19)

Calculate Correlation

cor(Cereal$Sugar, Cereal$Calories)
[1] 0.5154008

Interpretation

The scatterplot shows a positive linear relationship between sugar and calories in breakfast cereals. As the sugar content per serving increases, the number of calories per serving also tends to increase.

The sample correlation coefficient is r = 0.5154, which indicates a moderate positive linear relationship between sugar and calories.

Part (b): Hypothesis Test for Correlation

We test whether there is a linear relationship between sugar and calories.

The hypotheses are:

\[ H_0: \rho = 0 \]

\[ H_a: \rho \ne 0 \]

where \(\rho\) represents the true population correlation between sugar and calories.

Since the p-value is less than 0.05, we reject \(H_0\).

There is sufficient evidence to conclude that there is a statistically significant linear relationship between sugar and calories in breakfast cereals.

The 95% confidence interval for the true population correlation is:

cor.test(Cereal$Sugar,Cereal$Calories)

    Pearson's product-moment correlation

data:  Cereal$Sugar and Cereal$Calories
t = 3.5069, df = 34, p-value = 0.001296
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2249563 0.7217280
sample estimates:
      cor 
0.5154008 

\[ (\text{lower = 0.22}, \text{upper = 0.72}) \]

Since 0 is not in this interval, this further supports our conclusion that there is a significant positive linear relationship.

Part (c): Bayesian Correlation Analysis

# Sample size
n <- nrow(Cereal)

# Sample correlation
r <- cor(Cereal$Sugar, Cereal$Calories)

# Fisher transformation
z_hat <- atanh(r)

# Standard error
se <- 1 / sqrt(n - 3)

# Simulate posterior distribution
set.seed(123)
z_sim <- rnorm(10000, mean = z_hat, sd = se)

# Transform back to correlation scale
rho_sim <- tanh(z_sim)

# Posterior mean
mean(rho_sim)
[1] 0.5040621
# 95% credible interval
quantile(rho_sim, c(0.025, 0.975))
     2.5%     97.5% 
0.2220941 0.7209906 
# Probability correlation > 0
mean(rho_sim > 0)
[1] 0.9995

Part (c): Bayesian Correlation Analysis

Using a Bayesian approach, we simulated the posterior distribution of the population correlation \(\rho\).

Posterior Results

The posterior mean of the correlation is

\[ \hat{\rho}_{posterior} = 0.5041 \]

The 95% credible interval for the population correlation is

\[ (0.2221,\; 0.7210) \]

The probability that the true correlation is positive is

\[ P(\rho > 0) = 0.9995 \]

Interpretation

The posterior mean of approximately 0.504 indicates a moderate positive relationship between sugar and calories.

The 95% credible interval means that there is a 95% probability that the true population correlation lies between 0.2221 and 0.7210.

Since the probability that \(\rho > 0\) is 0.9995, there is extremely strong evidence that sugar and calories are positively correlated.

Part (d): Comparison of Frequentist and Bayesian Results

In the frequentist analysis, we tested

\[ H_0: \rho = 0 \]

and obtained a small p-value. Since the p-value was less than 0.05, we rejected the null hypothesis and concluded that there is a statistically significant linear relationship between sugar and calories.

The 95% confidence interval did not include 0, which also supported this conclusion.

In the Bayesian analysis, we obtained a posterior mean of 0.5041 and a 95% credible interval of (0.2221, 0.7210). The probability that the true correlation is positive was 0.9995.

While both approaches lead to the same practical conclusion, that sugar and calories are positively correlated, the interpretations differ.

The frequentist confidence interval means that if we repeatedly sampled and constructed intervals, 95% of those intervals would contain the true correlation.

In contrast, the Bayesian credible interval means that there is a 95% probability that the true population correlation lies within the interval (0.2221, 0.7210).

Thus, the Bayesian approach allows us to directly state the probability that the correlation is positive, whereas the frequentist approach does not.