2024-10-21

What is the Normal Distribution?

  • The normal distribution, also known as the Gaussian distribution, is one of the most important probability distributions in statistics.
  • It is characterized by its bell-shaped curve and is symmetric around the mean.

Properties of the Normal Distribution

The probability density function (PDF) of a Normal Distribution is given by:

\[ f(x|\mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Where \(\mu\) is the mean and \(\sigma\) is the standard deviation.

  • The mean \(\mu\) determines the center of the distribution.
  • The standard deviation \(\sigma\) determines the width of the bell curve.
    • A larger \(\sigma\) results in a wider and flatter curve. A smaller \(\sigma\) results in a narrower and taller curve.

Properties of the Normal Distribution Cont.

68-95-99.7 Rule (Empirical Rule)

  • In a normal distribution:
    • 68% of data falls within 1 standard deviation of the mean (\(\mu \pm 1\sigma\)),
    • 95% of data falls within 2 standard deviations of the mean (\(\mu \pm 2\sigma\)),
    • 99.7% of data falls within 3 standard deviations of the mean (\(\mu \pm 3\sigma\)).

Visualizing the Normal Distribution

We can visualize the Normal Distribution using ggplot2. Below is the plot of the Standard Normal Distribution (mean = 0 and standard deviation = 1).

Standard Normal Distribution

The Standard Normal Distribution is a special case of the Normal Distribution. It has a mean of 0 and a standard deviation of 1.

Any Normal Distribution can be transformed into a Standard Normal Distribution using the formula:

\[ Z = \frac{X - \mu}{\sigma} \]

Where \(Z\) is the standard score (z-score), \(X\) is the value from the original Normal Distribution, \(\mu\) is the mean of the original distribution, and \(\sigma\) is the standard deviation of the original distribution

The Z-score represents how many standard deviations a data point \(X\) is from the mean \(\mu\).

Applying the Normal Distribution

The Normal Distribution is widely used. Let’s use ggplot2 to visualize the iris dataset, where the sepal lengths are normally distributed.

R Code to Generate the Plot:

# Load the iris dataset
data(iris)

# Generate data for sepal length
sepal_length <- iris$Sepal.Length
density_sepal_length <- dnorm(
  sort(sepal_length), mean=mean(sepal_length), sd=sd(sepal_length))
data_sepal_length <- data.frame(
  Sepal.Length = sort(sepal_length), 
                                Density = density_sepal_length)

R Code Cont.

library(ggplot2)
ggplot(data_sepal_length, aes(Sepal.Length, Density)) +
  geom_line(color="purple", linewidth=1) +
  labs(title="Distribution of Sepal Length in Iris Dataset",
       x="Sepal Length",
       y="Density") +
  theme_minimal() +
  geom_vline(xintercept = mean(sepal_length), linetype = "dashed", 
             color = "red") +
  geom_vline(xintercept = c(mean(sepal_length) - sd(sepal_length), 
                            mean(sepal_length) + sd(sepal_length)), 
             linetype = "dashed", color = "green") +
  geom_vline(xintercept = c(mean(sepal_length) - 2 * 
                              sd(sepal_length), mean(sepal_length) + 
                              2 * sd(sepal_length)), 
             linetype = "dashed", color = "orange") +
  annotate("text", x = mean(sepal_length), y = 0.05, 
           label = paste("Mean =", round(mean(sepal_length), 2)), 
           vjust = -1, color = "red", size = 4)

Result of R Code: Plot

3D Visualization Using Plotly

Let’s use Plotly to chart a 3D visualization of the mtcars dataset. This is the relationship between miles per gallon and horsepower. Adding a Z axis representing PDF density shows that it follows a 3D Normal Distribution.

Thank You!

gif