Chapter 4 - Main Concepts

Linear Regression Analogy to the Geocentric Model

The Ptolemaic geocentric model using the additive combination of circles and epicycles was able to predict the movement of the planets with some accuracy. Despite its inaccuracies, the model worked within certain boundaries and had its uses.

Similarly, Linear regression is like the geocentric model of the solar system in many ways:

  • Simple Yet Powerful: uses an additive combination of measurements (like variables) to predict an outcome. It’s a straightforward tool that can describe a wide variety of phenomena.

  • Descriptive but Limited: can accurately describe relationships between variables but may not capture the full complexity of the real-world processes if taken too literally.

  • Useful Approximation: helps in making predictions and understanding data, even though it might not always capture every nuance.

Example to Illustrate

Imagine you’re a farmer trying to predict the yield of your crop based on the amount of rainfall and fertilizer used. You gather data over several years and use linear regression to create a model:

\[ \text{Crop Yield} = \beta_0 + \beta_1 \times \text{Rainfall} + \beta_2 \times \text{Fertilizer} \]

Epicycles in Farming: Just like epicycles were circles on circles to predict planet positions, you’re using rainfall and fertilizer (variables) to predict crop yield.

Approximation: Your linear regression model gives a good approximation of crop yield based on the data you have, similar to how the geocentric model approximated planetary positions.

Limits: If you tried to predict crop yield in a completely different climate with different soil, the model might fail, just as the geocentric model would fail to plot a Mars probe.

Bayesian Linear Regression

The chapter introduces linear regression as a Bayesian procedure, which means using probability distributions to describe uncertainty. Instead of just finding the best-fitting line, Bayesian regression provides a distribution of possible lines, giving a more nuanced understanding of the uncertainty in predictions.


Call:
lm(formula = scores ~ hours, data = data)

Residuals:
      1       2       3       4       5 
-3.5366  1.0061  5.0915 -0.8232 -1.7378 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  47.6220     3.8381  12.408  0.00113 **
hours         5.4573     0.6621   8.242  0.00374 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.792 on 3 degrees of freedom
Multiple R-squared:  0.9577,    Adjusted R-squared:  0.9436 
F-statistic: 67.93 on 1 and 3 DF,  p-value: 0.00374

                2.5 %    97.5 %
(Intercept) 35.407486 59.836417
hours        3.350122  7.564513
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Side-by-Side Comparison

  • Frequentist Plot:

    • Shows a single best-fit line with a shaded area representing the confidence interval.
    • The confidence interval provides an estimate of the uncertainty around the best-fit line.
  • Bayesian Plot:

    • Shows multiple lines sampled from the posterior distribution, indicating a range of plausible models.
    • The spread of these lines gives a visual representation of the uncertainty in the parameter estimates.

Why normal distributions are normal

The familiar “bell” curve of the Gaussian distribution is emerging from the randomness. Where does it come from? Why is it so common?

Normality by addition When you add together many random values from the same distribution, the result tends to form a bell curve, or normal distribution. Here’s an easy way to understand why this happens:

  1. Average and Fluctuations:
  • Think of the average value of the distribution as the center point.
  • Each random value you pick is like a fluctuation or deviation from this average. Some values will be above the average, and some will be below.
  1. Adding Fluctuations:
  • When you start adding these random values, the deviations tend to balance out. A value above the average can be offset by a value below the average.
  • For example, if one value is higher than the average by a certain amount, another value might be lower by about the same amount, canceling each other out.
  1. More Values, More Cancellation:
  • The more values you add together, the more they cancel each other out. This is because there are more opportunities for positive deviations to be balanced by negative deviations.
  1. Resulting Sum:
  • After adding many values, the most likely outcome is one where these fluctuations have balanced out, leading to a sum that is close to zero relative to the average.
  • This balancing act creates a bell curve because there are many ways to achieve sums near the average (due to many combinations of small positive and negative deviations) and fewer ways to achieve extreme sums.

Visual Example

Imagine rolling a die:

  • Each roll gives you a number between 1 and 6.
  • The average (mean) roll is 3.5.
  • If you roll the die many times and add the results, the sum of these rolls will form a bell curve around the mean value because the high and low rolls balance out over many rolls.

Why Use the Gaussian Distribution in Hypotheses?

In the context of statistical modeling, the Gaussian (or normal) distribution is often used as a foundation for building hypotheses. The justifications for this choice fall into two broad categories: ontological and epistemological.

1. Ontological Justification The ontological reason for using the Gaussian distribution is that it appears frequently in nature. While we might never encounter a perfect Gaussian distribution, we often observe patterns that approximate it across various domains and scales. Some examples include:

  • Measurement Errors: Errors in measurements tend to follow a normal distribution.
  • Biological Variations: Growth rates and other biological measures often exhibit Gaussian patterns.
  • Physical Phenomena: The velocities of molecules in a gas, for instance, are normally distributed.

The underlying reason for this prevalence is that many natural processes involve the addition of small, random fluctuations. When these fluctuations are added together repeatedly, the result tends to be a Gaussian distribution. This is because the process of summing these fluctuations sheds all detailed information about the individual contributions, leaving only the mean and spread (variance).

However, it’s important to note that the Gaussian distribution is not the only pattern in nature. There are other significant distributions, like exponential, gamma, and Poisson, which also arise from natural processes. The Gaussian is part of the broader exponential family of distributions, all of which are fundamental in the natural world.

2. Epistemological Justification The epistemological justification for using the Gaussian distribution is based on the concept of maximum entropy and information theory. When we only know or are willing to assume the mean and variance of a distribution, the Gaussian distribution is the most reasonable choice. This is because:

  • State of Ignorance: The Gaussian distribution represents a state of minimal assumptions beyond knowing the mean and variance. It does not introduce any additional assumptions about the data.
  • Maximum Entropy: Given the constraints of finite variance, the Gaussian distribution is the shape that can be realized in the largest number of ways. It is the least surprising and least informative assumption, making it the most consistent with our limited knowledge. In essence, if we don’t have specific information suggesting a different distribution, the Gaussian distribution is the safest, most neutral assumption to make. It aligns with the idea of making the least informative assumption that still fits our known constraints.

Practical Implications Using the Gaussian distribution as a starting point for building models doesn’t mean we are committing to it as the only or best model. It is a practical tool that helps us begin modeling continuous measurements. As we gather more information or develop specific knowledge about the data, we can refine our models and possibly choose other distributions that better fit our data’s characteristics.

In summary, the Gaussian distribution is a powerful tool in statistical modeling due to its natural occurrence in many processes and its role as a default assumption when we have limited information. It provides a solid foundation for building and refining our understanding of data.

Simple Explanation with Examples

Main Idea When we don’t have much information about a set of data, but we do know its average (mean) and how spread out the values are (variance), the Gaussian distribution is the most neutral choice. It’s like saying, “Given what little we know, this is the safest bet.”

Example 1: Heights of Students Imagine you’re a school principal, and you want to understand the heights of students in your school. You don’t have detailed data about every student’s height, but you do know the average height (mean) and the general variability in heights (variance).

Known Information:

  • Average height (mean): 150 cm
  • Variability in heights (variance): 25 cm²

Given just this information, you choose to assume that the heights follow a Gaussian distribution because it is the least biased choice. It doesn’t make any extra assumptions about the specific shape of the data beyond what you know.

Ontological Justification

  • Normal Distribution:
    • You assume the heights follow a normal distribution with the given mean and variance. This is reasonable because it implies that most people have heights around the average, with fewer people being much taller or shorter. This aligns well with how human heights are typically distributed.
  • Uniform Distribution:
    • Alternatively, if you assume the heights follow a uniform distribution, you imply that any height within a certain range is equally likely. This assumption doesn’t match the natural clustering of human heights around an average value and can be considered less realistic and more informative than necessary.

Epistemological Justification

  • Normal Distribution:
    • The normal distribution is the most unbiased choice when you only know the mean and variance of the data. It is the least informative assumption you can make about the data beyond these two parameters.
  • Uniform Distribution:
    • Assumes all outcomes in the interval [a, b] are equally likely, which is a strong assumption compared to knowing just the mean and variance. This makes it a less neutral choice when you have limited information.

Framework for Describing Models

Step 1: Recognize a Set of Variables

Observable Variable (Data): - Heights of individuals (let’s denote it as \(y\)).

Unobservable Variables (Parameters): - Mean height (\(\mu\)). - Standard deviation of heights (\(\sigma\)).

Step 2: Define Each Variable

Heights (\(y\)): The heights are distributed according to a normal (Gaussian) distribution with mean \(\mu\) and standard deviation \(\sigma\).

\[ y_i \sim \text{Normal}(\mu, \sigma) \]

Mean Height (\(\mu\)): This parameter can be modeled with a prior distribution if we have some prior knowledge. Let’s assume a normal prior with mean 178 cm and standard deviation 20 cm.

\[ \mu \sim \text{Normal}(178, 20) \]

Standard Deviation (\(\sigma\)): This parameter can also be modeled with a prior distribution. Let’s assume a uniform prior from 0 to 50 cm, indicating we believe the standard deviation could reasonably be anywhere in this range.

\[ \sigma \sim \text{Uniform}(0, 50) \]

Step 3: Joint Generative Model

Combining these variables and their probability distributions, we define a joint generative model. This model allows us to simulate hypothetical observations and analyze real data.

  1. Specify priors: \[ \mu \sim \text{Normal}(178, 20) \] \[ \sigma \sim \text{Uniform}(0, 50) \]

  2. Specify likelihood: \[ y_i \sim \text{Normal}(\mu, \sigma) \]

This model describes how the observed data (heights) are generated from the underlying parameters (mean and standard deviation).

Step 4: Compute the Posterior

To compute the posterior distribution, we apply Bayes’ theorem:

\[ P(\mu, \sigma \mid y) \propto P(y \mid \mu, \sigma) P(\mu) P(\sigma) \]

where:

  • \(P(\mu, \sigma \mid y)\) is the posterior distribution.
  • \(P(y \mid \mu, \sigma)\) is the likelihood of the observed data.
  • \(P(\mu)\) and \(P(\sigma)\) are the priors.

Prior Choices

Computing the posterior distribution

The chapter investigates two ways of computing the posterior: grid approximation and Quadratic approximation.

  • Grid approximation:
    • Pros: Simple, intuitive, and easy to understand.
    • Cons: Computationally expensive and may not scale well to high-dimensional problems.
  • Quadratic approximation:
    • Pros: Fast and efficient for high-dimensional problems.
    • Cons: Requires more advanced mathematical techniques and may be less intuitive.

Questions to discuss

1. Normal Distribution Justification:

  • Why is the Gaussian distribution often used as the skeleton for hypotheses in Bayesian regression? Discuss the ontological and epistemological justifications presented in the notebook.

2. Priors and Their Influence:

  • How do the choices of priors (Normal vs. Uniform) influence the results of Bayesian regression?

3. Visualizing Posterior Distributions:

  • The notebook includes plots of posterior distributions for 𝜇 and 𝜎 using both grid and quadratic approximations. What insights can you draw from these plots about the behavior of the posterior distributions?

4. Computational Challenges:

  • Bayesian methods, especially those involving complex models, can be computationally intensive. How do address these challenges?
---
title: "Statistical Rethinking with R"
output: html_notebook
---



```{r, echo=FALSE}
# Import the necessary libraries for the analysis.

# Install necessary packages if not already installed
if (!requireNamespace("gganimate", quietly = TRUE)) install.packages("gganimate")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("gifski", quietly = TRUE)) install.packages("gifski")
if (!requireNamespace("av", quietly = TRUE)) install.packages("av")
if (!requireNamespace("base64enc", quietly = TRUE)) install.packages("base64enc")
if (!requireNamespace("magick", quietly = TRUE)) install.packages("magick")

# Load necessary libraries
library(rethinking)
library(ggplot2)
library(gridExtra)

library(ggplot2)
library(gganimate)
library(dplyr)

library(magick)
library(htmltools)
library(base64enc)



```

# Chapter 4 - Main Concepts

## Linear Regression Analogy to the Geocentric Model

The Ptolemaic geocentric model using the additive combination of circles
and epicycles was able to predict the movement of the planets with some
accuracy. Despite its inaccuracies, the model worked within certain
boundaries and had its uses.

Similarly, Linear regression is like the geocentric model of the solar
system in many ways:

-   **Simple Yet Powerful**: uses an additive combination of
    measurements (like variables) to predict an outcome. It's a
    straightforward tool that can describe a wide variety of phenomena.

-   **Descriptive but Limited**: can accurately describe relationships
    between variables but may not capture the full complexity of the
    real-world processes if taken too literally.

-   **Useful Approximation**: helps in making predictions and
    understanding data, even though it might not always capture every
    nuance.

### Example to Illustrate

Imagine you’re a farmer trying to predict the yield of your crop based
on the amount of rainfall and fertilizer used. You gather data over
several years and use linear regression to create a model:

$$ \text{Crop Yield} = \beta_0 + \beta_1 \times \text{Rainfall} + \beta_2 \times \text{Fertilizer} $$

**Epicycles in Farming**: Just like epicycles were circles on circles to
predict planet positions, you’re using rainfall and fertilizer
(variables) to predict crop yield.

**Approximation**: Your linear regression model gives a good
approximation of crop yield based on the data you have, similar to how
the geocentric model approximated planetary positions.

**Limits**: If you tried to predict crop yield in a completely different
climate with different soil, the model might fail, just as the
geocentric model would fail to plot a Mars probe.

### Bayesian Linear Regression

The chapter introduces linear regression as a Bayesian procedure, which
means using probability distributions to describe uncertainty. Instead
of just finding the best-fitting line, Bayesian regression provides a
distribution of possible lines, giving a more nuanced understanding of
the uncertainty in predictions.

```{r, echo=FALSE}
# Example data
data <- data.frame(hours = c(2, 3, 5, 7, 9), scores = c(55, 65, 80, 85, 95))

# Frequentist regression
# Fit a frequentist linear model
freq_model <- lm(scores ~ hours, data = data)

# Summary of the frequentist model
summary(freq_model)

# Predicted values from the frequentist model
data$freq_pred <- predict(freq_model)

# Confidence intervals for the frequentist model
confint(freq_model)


# Bayesian regression
# Define the Bayesian model using the map function
bayes_model <- map(
  alist(
    scores ~ dnorm(mu, sigma),          # Likelihood: exam scores are normally distributed
    mu <- a + b * hours,                # Linear model: mean of scores depends on hours studied
    a ~ dnorm(50, 10),                  # Prior for the intercept
    b ~ dnorm(0, 10),                   # Prior for the slope
    sigma ~ dunif(0, 10)                # Prior for the standard deviation
  ),
  data = data
)

# Summarize the Bayesian model
precis(bayes_model)

# Extract posterior samples
post <- extract.samples(bayes_model, n = 1000)


# Create combined plot
# Create the frequentist plot
freq_plot <- ggplot(data, aes(x = hours, y = scores)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
  labs(title = "Frequentist Linear Regression", x = "Hours Studied", y = "Exam Score")

# Create the Bayesian plot
bayes_plot <- ggplot(data, aes(x = hours, y = scores)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, col = "lightgrey") +
  geom_abline(intercept = mean(post$a), slope = mean(post$b), color = "blue", size = 1.2) +
  labs(title = "Bayesian Linear Regression", x = "Hours Studied", y = "Exam Score")

# Add multiple posterior lines to the Bayesian plot
for (i in 1:100) {
  bayes_plot <- bayes_plot +
    geom_abline(intercept = post$a[i], slope = post$b[i], color = alpha("blue", 0.1))
}

# Combine the two plots side by side
grid.arrange(freq_plot, bayes_plot, ncol = 2)


```

### Side-by-Side Comparison

-   **Frequentist Plot**:

    -   Shows a single best-fit line with a shaded area representing the
        confidence interval.
    -   The confidence interval provides an estimate of the uncertainty
        around the best-fit line.

-   **Bayesian Plot**:

    -   Shows multiple lines sampled from the posterior distribution,
        indicating a range of plausible models.
    -   The spread of these lines gives a visual representation of the
        uncertainty in the parameter estimates.

## Why normal distributions are normal

The familiar “bell” curve of the Gaussian distribution is emerging from
the randomness. Where does it come from? Why is it so common?

**Normality by addition** When you add together many random values from
the same distribution, the result tends to form a bell curve, or normal
distribution. Here’s an easy way to understand why this happens:

1.  **Average and Fluctuations**:

-   Think of the average value of the distribution as the center point.
-   Each random value you pick is like a fluctuation or deviation from
    this average. Some values will be above the average, and some will
    be below.

2.  **Adding Fluctuations**:

-   When you start adding these random values, the deviations tend to
    balance out. A value above the average can be offset by a value
    below the average.
-   For example, if one value is higher than the average by a certain
    amount, another value might be lower by about the same amount,
    canceling each other out.

3.  **More Values, More Cancellation**:

-   The more values you add together, the more they cancel each other
    out. This is because there are more opportunities for positive
    deviations to be balanced by negative deviations.

4.  **Resulting Sum**:

-   After adding many values, the most likely outcome is one where these
    fluctuations have balanced out, leading to a sum that is close to
    zero relative to the average.
-   This balancing act creates a bell curve because there are many ways
    to achieve sums near the average (due to many combinations of small
    positive and negative deviations) and fewer ways to achieve extreme
    sums.

**Visual Example**

Imagine rolling a die:

-   Each roll gives you a number between 1 and 6.
-   The average (mean) roll is 3.5.
-   If you roll the die many times and add the results, the sum of these
    rolls will form a bell curve around the mean value because the high
    and low rolls balance out over many rolls.

```{r, eval=FALSE, echo=FALSE}


# Set seed for reproducibility
set.seed(123)

# Parameters
num_simulations <- 1000
num_rolls <- 30

# Function to simulate dice rolls and calculate sums
simulate_rolls <- function(num_simulations, num_rolls) {
  data <- data.frame()
  for (i in 1:num_simulations) {
    rolls <- sample(1:6, num_rolls, replace = TRUE)
    sum_rolls <- cumsum(rolls)
    temp <- data.frame(roll_number = 1:num_rolls, sum = sum_rolls, simulation = i)
    data <- rbind(data, temp)
  }
  return(data)
}

# Simulate the dice rolls
data <- simulate_rolls(num_simulations, num_rolls)

# Calculate mean and standard deviation for normalization
data <- data %>%
  group_by(roll_number) %>%
  mutate(mean_sum = mean(sum), sd_sum = sd(sum)) %>%
  ungroup()

# Create animated plot
p <- ggplot(data, aes(x = sum)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "blue", alpha = 0.5) +
  geom_density(color = "red", size = 1) +
  labs(title = 'Distribution of Sums of Dice Rolls: Roll {closest_state}',
       x = 'Sum of Dice Rolls', y = 'Density') +
  theme_minimal() +
  transition_states(roll_number, transition_length = 2, state_length = 1) +
  ease_aes('sine-in-out')

# Save the animation
anim_save("dice_rolls_animation.gif", p, renderer = gifski_renderer())


```

```{r, echo=FALSE}
# Display the saved GIF inline
gif_path <- "dice_rolls_animation.gif"
gif_data <- base64enc::dataURI(file = gif_path, mime = "image/gif")

# Use htmltools to embed the GIF in the notebook
htmltools::tags$img(src = gif_data)

```

```{r, echo=FALSE}

# Number of simulations
num_simulations <- 10000

# Number of dice rolls in each simulation
num_rolls <- 30

# Simulate the sums of dice rolls
sums <- replicate(num_simulations, sum(sample(1:6, num_rolls, replace = TRUE)))

# Plot the distribution of the sums
ggplot(data.frame(sums = sums), aes(x = sums)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "blue", alpha = 0.5) +
  geom_density(color = "red") +
  labs(title = "Distribution of Sums of 30 Dice Rolls",
       x = "Sum of Dice Rolls",
       y = "Density") +
  theme_minimal()

```

```{r, echo=FALSE}
# Simulate the sums of random values from a uniform distribution
sums_uniform <- replicate(num_simulations, sum(runif(num_rolls, 0, 1)))

# Plot the distribution of the sums
ggplot(data.frame(sums = sums_uniform), aes(x = sums)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "green", alpha = 0.5) +
  geom_density(color = "red") +
  labs(title = "Distribution of Sums of 30 Uniform Random Values",
       x = "Sum of Uniform Values",
       y = "Density") +
  theme_minimal()
```

### Why Use the Gaussian Distribution in Hypotheses?
In the context of statistical modeling, the Gaussian (or normal) distribution is often used as a foundation for building hypotheses. The justifications for this choice fall into two broad categories: ontological and epistemological.

**1. Ontological Justification**
The ontological reason for using the Gaussian distribution is that it appears frequently in nature. While we might never encounter a perfect Gaussian distribution, we often observe patterns that approximate it across various domains and scales. Some examples include:

- **Measurement Errors**: Errors in measurements tend to follow a normal distribution.
- **Biological Variations**: Growth rates and other biological measures often exhibit Gaussian patterns.
- **Physical Phenomena**: The velocities of molecules in a gas, for instance, are normally distributed.

The underlying reason for this prevalence is that many natural processes involve the addition of small, random fluctuations. When these fluctuations are added together repeatedly, the result tends to be a Gaussian distribution. This is because the process of summing these fluctuations sheds all detailed information about the individual contributions, leaving only the mean and spread (variance).

However, it's important to note that the Gaussian distribution is not the only pattern in nature. There are other significant distributions, like exponential, gamma, and Poisson, which also arise from natural processes. The Gaussian is part of the broader exponential family of distributions, all of which are fundamental in the natural world.

**2. Epistemological Justification**
The epistemological justification for using the Gaussian distribution is based on the concept of maximum entropy and information theory. When we only know or are willing to assume the mean and variance of a distribution, the Gaussian distribution is the most reasonable choice. This is because:

- **State of Ignorance**: The Gaussian distribution represents a state of minimal assumptions beyond knowing the mean and variance. It does not introduce any additional assumptions about the data.
- **Maximum Entropy**: Given the constraints of finite variance, the Gaussian distribution is the shape that can be realized in the largest number of ways. It is the least surprising and least informative assumption, making it the most consistent with our limited knowledge.
In essence, if we don't have specific information suggesting a different distribution, the Gaussian distribution is the safest, most neutral assumption to make. It aligns with the idea of making the least informative assumption that still fits our known constraints.

**Practical Implications**
Using the Gaussian distribution as a starting point for building models doesn't mean we are committing to it as the only or best model. It is a practical tool that helps us begin modeling continuous measurements. As we gather more information or develop specific knowledge about the data, we can refine our models and possibly choose other distributions that better fit our data's characteristics.

In summary, the Gaussian distribution is a powerful tool in statistical modeling due to its natural occurrence in many processes and its role as a default assumption when we have limited information. It provides a solid foundation for building and refining our understanding of data.


**Simple Explanation with Examples**

**Main Idea**
When we don't have much information about a set of data, but we do know its average (mean) and how spread out the values are (variance), the Gaussian distribution is the most neutral choice. It's like saying, "Given what little we know, this is the safest bet."

**Example 1: Heights of Students**
Imagine you're a school principal, and you want to understand the heights of students in your school. You don't have detailed data about every student's height, but you do know the average height (mean) and the general variability in heights (variance).

**Known Information:**

- Average height (mean): 150 cm
- Variability in heights (variance): 25 cm²

Given just this information, you choose to assume that the heights follow a Gaussian distribution because it is the least biased choice. It doesn't make any extra assumptions about the specific shape of the data beyond what you know.


**Ontological Justification**

- Normal Distribution:
  - You assume the heights follow a normal distribution with the given mean and variance. This is reasonable because it implies that most people have heights around the average, with fewer people being much taller or shorter. This aligns well with how human heights are typically distributed.
  
- Uniform Distribution:
  - Alternatively, if you assume the heights follow a uniform distribution, you imply that any height within a certain range is equally likely. This assumption doesn't match the natural clustering of human heights around an average value and can be considered less realistic and more informative than necessary.
  
**Epistemological Justification**

- Normal Distribution:
  - The normal distribution is the most unbiased choice when you only know the mean and variance of the data. It is the least informative assumption you can make about the data beyond these two parameters.
  
- Uniform Distribution:
  - Assumes all outcomes in the interval [a, b] are equally likely, which is a strong assumption compared to knowing just the mean and variance. This makes it a less neutral choice when you have limited information.
  
### Framework for Describing Models

#### Step 1: Recognize a Set of Variables

**Observable Variable (Data)**:
- Heights of individuals (let's denote it as \( y \)).

**Unobservable Variables (Parameters)**:
- Mean height (\( \mu \)).
- Standard deviation of heights (\( \sigma \)).

#### Step 2: Define Each Variable

**Heights (\( y \))**: The heights are distributed according to a normal (Gaussian) distribution with mean \( \mu \) and standard deviation \( \sigma \).

\[
y_i \sim \text{Normal}(\mu, \sigma)
\]

**Mean Height (\( \mu \))**: This parameter can be modeled with a prior distribution if we have some prior knowledge. Let's assume a normal prior with mean 178 cm and standard deviation 20 cm.

\[
\mu \sim \text{Normal}(178, 20)
\]

**Standard Deviation (\( \sigma \))**: This parameter can also be modeled with a prior distribution. Let's assume a uniform prior from 0 to 50 cm, indicating we believe the standard deviation could reasonably be anywhere in this range.

\[
\sigma \sim \text{Uniform}(0, 50)
\]

#### Step 3: Joint Generative Model

Combining these variables and their probability distributions, we define a joint generative model. This model allows us to simulate hypothetical observations and analyze real data.

1. **Specify priors**:
   \[
   \mu \sim \text{Normal}(178, 20)
   \]
   \[
   \sigma \sim \text{Uniform}(0, 50)
   \]

2. **Specify likelihood**:
   \[
   y_i \sim \text{Normal}(\mu, \sigma)
   \]

This model describes how the observed data (heights) are generated from the underlying parameters (mean and standard deviation).

### Step 4: Compute the Posterior

To compute the posterior distribution, we apply Bayes' theorem:

\[
P(\mu, \sigma \mid y) \propto P(y \mid \mu, \sigma) P(\mu) P(\sigma)
\]

where:

- \( P(\mu, \sigma \mid y) \) is the posterior distribution.
- \( P(y \mid \mu, \sigma) \) is the likelihood of the observed data.
- \( P(\mu) \) and \( P(\sigma) \) are the priors.


**Prior Choices**

```{r, echo=False}


# Set up the plotting area to have 1 row and 2 columns
par(mfrow=c(1,2))

# Plot density curves for prior means
plot(NULL, xlim=c(0, 350), ylim=c(0, 0.025), xlab="x", ylab="Density", main="Prior Mean")
curve(dnorm(x, 178, 20), from=0, to=350, col="blue", add=TRUE)
curve(dnorm(x, 178, 50), from=0, to=350, col="red", add=TRUE)
legend("topright", legend=c("Normal(178, 20)", "Normal(178, 50)"), col=c("blue", "red"), lty=1)

# Plot density curves for prior standard deviations
plot(NULL, xlim=c(0, 100), ylim=c(0, 0.05), xlab="x", ylab="Density", main="Prior Standard Deviation")
curve(dunif(x, 0, 50), from=0, to=50, col="blue", add=TRUE)
curve(dunif(x, 0, 100), from=0, to=100, col="red", add=TRUE)
legend("topright", legend=c("Uniform(0, 50)", "Uniform(0, 100)"), col=c("blue", "red"), lty=1)


```
```{r, echo=FALSE}


# Install necessary packages if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("gridExtra", quietly = TRUE)) install.packages("gridExtra")

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Set seed for reproducibility
set.seed(123)

# Define prior parameters
prior_means <- c(178, 178)
prior_sds <- c(20, 50)
prior_sigma_ranges <- list(c(0, 50), c(0, 100))

# Function to compute predictive priors
compute_predictive_prior <- function(mean, sd, sigma_range) {
  sample_mu <- rnorm(1e4, mean, sd)
  sample_sigma <- runif(1e4, sigma_range[1], sigma_range[2])
  prior_h <- rnorm(1e4, sample_mu, sample_sigma)
  return(prior_h)
}

# Compute predictive priors for each combination
prior_h_1 <- compute_predictive_prior(178, 20, c(0, 50))
prior_h_2 <- compute_predictive_prior(178, 20, c(0, 100))
prior_h_3 <- compute_predictive_prior(178, 50, c(0, 50))
prior_h_4 <- compute_predictive_prior(178, 50, c(0, 100))

# Create plots for each combination with reduced title font size
plot1 <- ggplot(data.frame(prior_h_1), aes(x = prior_h_1)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Normal(178, 20) & Uniform(0, 50)", x = "Height", y = "Density") +
  coord_cartesian(xlim = c(0, 350)) +
  theme(plot.title = element_text(size = 10))

plot2 <- ggplot(data.frame(prior_h_2), aes(x = prior_h_2)) +
  geom_density(fill = "red", alpha = 0.5) +
  labs(title = "Normal(178, 20) & Uniform(0, 100)", x = "Height", y = "Density") +
  coord_cartesian(xlim = c(0, 350)) +
  theme(plot.title = element_text(size = 10))

plot3 <- ggplot(data.frame(prior_h_3), aes(x = prior_h_3)) +
  geom_density(fill = "green", alpha = 0.5) +
  labs(title = "Normal(178, 50) & Uniform(0, 50)", x = "Height", y = "Density") +
  coord_cartesian(xlim = c(0, 350)) +
  theme(plot.title = element_text(size = 10))

plot4 <- ggplot(data.frame(prior_h_4), aes(x = prior_h_4)) +
  geom_density(fill = "purple", alpha = 0.5) +
  labs(title = "Normal(178, 50) & Uniform(0, 100)", x = "Height", y = "Density") +
  coord_cartesian(xlim = c(0, 350)) +
  theme(plot.title = element_text(size = 10))

# Arrange plots in a 2x2 grid
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)

```

### Computing the posterior distribution

The chapter investigates two ways of computing the posterior: grid approximation and Quadratic approximation.

- Grid approximation:
  - Pros: Simple, intuitive, and easy to understand.
  - Cons: Computationally expensive and may not scale well to high-dimensional problems.

- Quadratic approximation:
  - Pros: Fast and efficient for high-dimensional problems.
  - Cons: Requires more advanced mathematical techniques and may be less intuitive.
 


```{r, echo=FALSE}

# Load data
data(Howell1)
d <- Howell1
d2 <- d[d$age >= 18, ]

# Grid approximation
mu.list <- seq(from = 150, to = 160, length.out = 500)
sigma.list <- seq(from = 7, to = 9, length.out = 500)
post <- expand.grid(mu = mu.list, sigma = sigma.list)
post$LL <- sapply(1:nrow(post), function(i) sum(dnorm(d2$height, post$mu[i], post$sigma[i], log = TRUE)))
post$prod <- post$LL + dnorm(post$mu, 178, 20, TRUE) + dunif(post$sigma, 0, 50, TRUE)
post$prob <- exp(post$prod - max(post$prod))

# Sampling from the posterior for grid approximation
sample.rows <- sample(1:nrow(post), size = 1e4, replace = TRUE, prob = post$prob)
sample.mu <- post$mu[sample.rows]
sample.sigma <- post$sigma[sample.rows]

# Quadratic approximation
flist <- alist(
  height ~ dnorm(mu, sigma),
  mu ~ dnorm(178, 20),
  sigma ~ dunif(0, 50)
)
m4.1 <- quap(flist, data = d2)
post_quap <- extract.samples(m4.1, n = 1e4)

# Create density plots for mu and sigma for both methods
dens_mu_grid <- ggplot(data.frame(mu = sample.mu), aes(x = mu)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Posterior of Mu (Grid Approx)", x = expression(mu), y = "Density")

dens_sigma_grid <- ggplot(data.frame(sigma = sample.sigma), aes(x = sigma)) +
  geom_density(fill = "blue", alpha = 0.5) +
  labs(title = "Posterior of Sigma (Grid Approx)", x = expression(sigma), y = "Density")

dens_mu_quap <- ggplot(data.frame(mu = post_quap$mu), aes(x = mu)) +
  geom_density(fill = "red", alpha = 0.5) +
  labs(title = "Posterior of Mu (Quadratic Approx)", x = expression(mu), y = "Density")

dens_sigma_quap <- ggplot(data.frame(sigma = post_quap$sigma), aes(x = sigma)) +
  geom_density(fill = "red", alpha = 0.5) +
  labs(title = "Posterior of Sigma (Quadratic Approx)", x = expression(sigma), y = "Density")

# Arrange plots in a 2x2 grid
grid.arrange(dens_mu_grid, dens_mu_quap, dens_sigma_grid, dens_sigma_quap, ncol = 2)

# Compute 95% credible intervals for both methods
ci_mu_grid <- PI(sample.mu, prob = 0.95)
ci_sigma_grid <- PI(sample.sigma, prob = 0.95)

ci_mu_quap <- PI(post_quap$mu, prob = 0.95)
ci_sigma_quap <- PI(post_quap$sigma, prob = 0.95)

# Round the credible intervals to two decimal places
ci_mu_grid <- round(ci_mu_grid, 2)
ci_sigma_grid <- round(ci_sigma_grid, 2)
ci_mu_quap <- round(ci_mu_quap, 2)
ci_sigma_quap <- round(ci_sigma_quap, 2)

# Create a data frame to compare the credible intervals
comparison_table <- data.frame(
  Parameter = c("Mu", "Sigma"),
  Method = c("Grid Approximation", "Grid Approximation", "Quadratic Approximation", "Quadratic Approximation"),
  Lower = c(ci_mu_grid[1], ci_sigma_grid[1], ci_mu_quap[1], ci_sigma_quap[1]),
  Upper = c(ci_mu_grid[2], ci_sigma_grid[2], ci_mu_quap[2], ci_sigma_quap[2])
)

# Display the comparison table
print(comparison_table)


```

## Questions to discuss

**1. Normal Distribution Justification**:

  - Why is the Gaussian distribution often used as the skeleton for hypotheses in Bayesian regression? Discuss the ontological and epistemological justifications presented in the notebook.

**2. Priors and Their Influence**:

  - How do the choices of priors (Normal vs. Uniform) influence the results of Bayesian regression? 

**3. Visualizing Posterior Distributions**:

  - The notebook includes plots of posterior distributions for 𝜇 and 𝜎 using both grid and quadratic approximations. What insights can you draw from these plots about the behavior of the posterior distributions?

**4. Computational Challenges**:

  - Bayesian methods, especially those involving complex models, can be computationally intensive. How do address these challenges?





