Understanding the Multivariate Normal Distribution

Author

Trevor Caughlin

Overview

This tutorial introduces the multivariate normal (MVN) distribution using simulation in R. We will:

  • Generate data using mvrnorm()
  • Visualize multivariate structure
  • Interpret the variance-covariance matrix
  • Use density ellipses and contour plots to build intuition about joint density

1. Setup

We use the MASS package for simulation and ggplot2 for plotting.

library(MASS)
library(ggplot2)

2. The Multivariate Normal Distribution

A multivariate normal distribution is defined by:

  • A mean vector \(\boldsymbol{\mu}\)
  • A variance-covariance matrix \(\boldsymbol{\Sigma}\)

For a 2D case:

\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \quad \boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{bmatrix} \]

The mean vector determines the center of the distribution, while the variance-covariance matrix determines its shape, spread, and orientation. The mean vector determines the center of the distribution, while the variance-covariance matrix determines its shape, spread, and orientation.


3. Simulating Data

Let’s simulate 500 observations from a 2D MVN distribution.

set.seed(123)

mu <- c(0, 0)
Sigma <- matrix(c(1, 0.8,
                  0.8, 1), nrow = 2)

sim_data <- as.data.frame(mvrnorm(n = 500, mu = mu, Sigma = Sigma))
names(sim_data) <- c("X1", "X2")
head(sim_data)
           X1         X2
1 -0.34137865 -0.7220491
2  0.09586955 -0.5326006
3  1.15402260  1.8034185
4 -0.17061630  0.3043966
5  0.59989348 -0.3545872
6  1.65714177  1.5969652

4. Visualizing the Data

ggplot(sim_data, aes(x = X1, y = X2)) +
  geom_point(alpha = 0.35) +
  theme_minimal() +
  labs(title = "Simulated Multivariate Normal Data")

Interpretation

  • The cloud of points is centered near the mean vector \(\boldsymbol{\mu} = (0,0)\)
  • The elliptical shape reflects joint variability
  • The upward tilt suggests a positive relationship between the variables

5. Density Ellipses and Contour Plots

A helpful way to understand the multivariate normal distribution is to look at lines of equal density.

Density ellipse

ggplot(sim_data, aes(x = X1, y = X2)) +
  geom_point(alpha = 0.2) +
  stat_ellipse(type = "norm", linewidth = 1) +
  theme_minimal() +
  labs(title = "Simulated Data with Normal-Theory Density Ellipse")

Contour plot of the joint density

ggplot(sim_data, aes(x = X1, y = X2)) +
  geom_density_2d(linewidth = 0.7) +
  theme_minimal() +
  labs(title = "Contour Plot of the Joint Density")

Interpretation

  • The density ellipse summarizes the overall shape expected under a multivariate normal model
  • The contour lines connect locations with similar joint density, like elevation contours on a map
  • For a bivariate normal distribution, these contours are elliptical
  • The more stretched the ellipse, the stronger the linear association between variables

6. Understanding the Variance-Covariance Matrix

Our matrix was:

Sigma
     [,1] [,2]
[1,]  1.0  0.8
[2,]  0.8  1.0

Diagonal elements (variances)

  • \(\sigma_1^2 = 1\): variance of \(X_1\)
  • \(\sigma_2^2 = 1\): variance of \(X_2\)

These control how spread out each variable is individually.

  • Larger variance means a variable is more dispersed
  • Smaller variance means observations cluster more tightly around the mean

Off-diagonal elements (covariance)

  • \(\sigma_{12} = 0.8\)

This measures how X1 and X2 vary together.

  • Positive covariance: variables tend to increase together
  • Negative covariance: one variable tends to decrease as the other increases
  • Zero covariance: no linear relationship

In a scatterplot, covariance affects the orientation of the point cloud. In density ellipses and contour plots, covariance affects the tilt of the ellipse.

A geometric interpretation of \(\boldsymbol{\Sigma}\)

The variance-covariance matrix controls:

  1. How wide the distribution is in each direction (variances)
  2. Whether the distribution is tilted (covariance)
  3. How elongated the contours are (strength of association)

7. From Covariance to Correlation

Correlation standardizes covariance:

cor(sim_data)
          X1        X2
X1 1.0000000 0.7861308
X2 0.7861308 1.0000000

Because both variances are 1 in this example, covariance and correlation are numerically the same.


8. Exploring Different Covariance Structures

Case 1: No correlation

Sigma_uncorr <- matrix(c(1, 0,
                         0, 1), 2)

sim_uncorr <- as.data.frame(mvrnorm(500, mu, Sigma_uncorr))
names(sim_uncorr) <- c("X1", "X2")

ggplot(sim_uncorr, aes(X1, X2)) +
  geom_point(alpha = 0.2) +
  stat_ellipse(type = "norm", linewidth = 1) +
  geom_density_2d(linewidth = 0.7) +
  theme_minimal() +
  ggtitle("No Correlation: Ellipse and Contours")

Interpretation: The cloud is roughly circular, with no preferred direction. The contours are not tilted.


Case 2: Strong positive correlation

Sigma_pos <- matrix(c(1, 0.9,
                      0.9, 1), 2)

sim_pos <- as.data.frame(mvrnorm(500, mu, Sigma_pos))
names(sim_pos) <- c("X1", "X2")

ggplot(sim_pos, aes(X1, X2)) +
  geom_point(alpha = 0.2) +
  stat_ellipse(type = "norm", linewidth = 1) +
  geom_density_2d(linewidth = 0.7) +
  theme_minimal() +
  ggtitle("Strong Positive Correlation: Ellipse and Contours")

Interpretation: The ellipse is long and narrow, tilted upward. This indicates that high values of one variable tend to occur with high values of the other.


Case 3: Negative correlation

Sigma_neg <- matrix(c(1, -0.8,
                      -0.8, 1), 2)

sim_neg <- as.data.frame(mvrnorm(500, mu, Sigma_neg))
names(sim_neg) <- c("X1", "X2")

ggplot(sim_neg, aes(X1, X2)) +
  geom_point(alpha = 0.2) +
  stat_ellipse(type = "norm", linewidth = 1) +
  geom_density_2d(linewidth = 0.7) +
  theme_minimal() +
  ggtitle("Negative Correlation: Ellipse and Contours")

Interpretation: The ellipse is tilted downward, showing that larger values of one variable tend to occur with smaller values of the other.


9. Why Ellipses Appear

For a bivariate normal distribution, contours of equal density form ellipses.

  • The center of the ellipse is determined by the mean vector
  • The width and height are determined by the variances
  • The rotation is determined by the covariance

This is one reason the variance-covariance matrix is so important: it determines the geometry of the multivariate distribution.


10. Key Takeaways

  • The mean vector sets the center of the distribution
  • The variance-covariance matrix controls:
    • spread in each variable
    • the orientation of the cloud
    • the shape of density contours
  • Covariance determines how variables move together
  • In two dimensions, multivariate normal contours are ellipses
  • Density ellipses and contour plots help make the abstract matrix (\(\boldsymbol{\Sigma}\)) visually intuitive