library(MASS)
library(ggplot2)Understanding the Multivariate Normal Distribution
Overview
This tutorial introduces the multivariate normal (MVN) distribution using simulation in R. We will:
- Generate data using
mvrnorm() - Visualize multivariate structure
- Interpret the variance-covariance matrix
- Use density ellipses and contour plots to build intuition about joint density
1. Setup
We use the MASS package for simulation and ggplot2 for plotting.
2. The Multivariate Normal Distribution
A multivariate normal distribution is defined by:
- A mean vector \(\boldsymbol{\mu}\)
- A variance-covariance matrix \(\boldsymbol{\Sigma}\)
For a 2D case:
\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \quad \boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{bmatrix} \]
The mean vector determines the center of the distribution, while the variance-covariance matrix determines its shape, spread, and orientation. The mean vector determines the center of the distribution, while the variance-covariance matrix determines its shape, spread, and orientation.
3. Simulating Data
Let’s simulate 500 observations from a 2D MVN distribution.
set.seed(123)
mu <- c(0, 0)
Sigma <- matrix(c(1, 0.8,
0.8, 1), nrow = 2)
sim_data <- as.data.frame(mvrnorm(n = 500, mu = mu, Sigma = Sigma))
names(sim_data) <- c("X1", "X2")
head(sim_data) X1 X2
1 -0.34137865 -0.7220491
2 0.09586955 -0.5326006
3 1.15402260 1.8034185
4 -0.17061630 0.3043966
5 0.59989348 -0.3545872
6 1.65714177 1.5969652
4. Visualizing the Data
ggplot(sim_data, aes(x = X1, y = X2)) +
geom_point(alpha = 0.35) +
theme_minimal() +
labs(title = "Simulated Multivariate Normal Data")Interpretation
- The cloud of points is centered near the mean vector \(\boldsymbol{\mu} = (0,0)\)
- The elliptical shape reflects joint variability
- The upward tilt suggests a positive relationship between the variables
5. Density Ellipses and Contour Plots
A helpful way to understand the multivariate normal distribution is to look at lines of equal density.
Density ellipse
ggplot(sim_data, aes(x = X1, y = X2)) +
geom_point(alpha = 0.2) +
stat_ellipse(type = "norm", linewidth = 1) +
theme_minimal() +
labs(title = "Simulated Data with Normal-Theory Density Ellipse")Contour plot of the joint density
ggplot(sim_data, aes(x = X1, y = X2)) +
geom_density_2d(linewidth = 0.7) +
theme_minimal() +
labs(title = "Contour Plot of the Joint Density")Interpretation
- The density ellipse summarizes the overall shape expected under a multivariate normal model
- The contour lines connect locations with similar joint density, like elevation contours on a map
- For a bivariate normal distribution, these contours are elliptical
- The more stretched the ellipse, the stronger the linear association between variables
6. Understanding the Variance-Covariance Matrix
Our matrix was:
Sigma [,1] [,2]
[1,] 1.0 0.8
[2,] 0.8 1.0
Diagonal elements (variances)
- \(\sigma_1^2 = 1\): variance of \(X_1\)
- \(\sigma_2^2 = 1\): variance of \(X_2\)
These control how spread out each variable is individually.
- Larger variance means a variable is more dispersed
- Smaller variance means observations cluster more tightly around the mean
Off-diagonal elements (covariance)
- \(\sigma_{12} = 0.8\)
This measures how X1 and X2 vary together.
- Positive covariance: variables tend to increase together
- Negative covariance: one variable tends to decrease as the other increases
- Zero covariance: no linear relationship
In a scatterplot, covariance affects the orientation of the point cloud. In density ellipses and contour plots, covariance affects the tilt of the ellipse.
A geometric interpretation of \(\boldsymbol{\Sigma}\)
The variance-covariance matrix controls:
- How wide the distribution is in each direction (variances)
- Whether the distribution is tilted (covariance)
- How elongated the contours are (strength of association)
7. From Covariance to Correlation
Correlation standardizes covariance:
cor(sim_data) X1 X2
X1 1.0000000 0.7861308
X2 0.7861308 1.0000000
Because both variances are 1 in this example, covariance and correlation are numerically the same.
8. Exploring Different Covariance Structures
Case 1: No correlation
Sigma_uncorr <- matrix(c(1, 0,
0, 1), 2)
sim_uncorr <- as.data.frame(mvrnorm(500, mu, Sigma_uncorr))
names(sim_uncorr) <- c("X1", "X2")
ggplot(sim_uncorr, aes(X1, X2)) +
geom_point(alpha = 0.2) +
stat_ellipse(type = "norm", linewidth = 1) +
geom_density_2d(linewidth = 0.7) +
theme_minimal() +
ggtitle("No Correlation: Ellipse and Contours")Interpretation: The cloud is roughly circular, with no preferred direction. The contours are not tilted.
Case 2: Strong positive correlation
Sigma_pos <- matrix(c(1, 0.9,
0.9, 1), 2)
sim_pos <- as.data.frame(mvrnorm(500, mu, Sigma_pos))
names(sim_pos) <- c("X1", "X2")
ggplot(sim_pos, aes(X1, X2)) +
geom_point(alpha = 0.2) +
stat_ellipse(type = "norm", linewidth = 1) +
geom_density_2d(linewidth = 0.7) +
theme_minimal() +
ggtitle("Strong Positive Correlation: Ellipse and Contours")Interpretation: The ellipse is long and narrow, tilted upward. This indicates that high values of one variable tend to occur with high values of the other.
Case 3: Negative correlation
Sigma_neg <- matrix(c(1, -0.8,
-0.8, 1), 2)
sim_neg <- as.data.frame(mvrnorm(500, mu, Sigma_neg))
names(sim_neg) <- c("X1", "X2")
ggplot(sim_neg, aes(X1, X2)) +
geom_point(alpha = 0.2) +
stat_ellipse(type = "norm", linewidth = 1) +
geom_density_2d(linewidth = 0.7) +
theme_minimal() +
ggtitle("Negative Correlation: Ellipse and Contours")Interpretation: The ellipse is tilted downward, showing that larger values of one variable tend to occur with smaller values of the other.
9. Why Ellipses Appear
For a bivariate normal distribution, contours of equal density form ellipses.
- The center of the ellipse is determined by the mean vector
- The width and height are determined by the variances
- The rotation is determined by the covariance
This is one reason the variance-covariance matrix is so important: it determines the geometry of the multivariate distribution.
10. Key Takeaways
- The mean vector sets the center of the distribution
- The variance-covariance matrix controls:
- spread in each variable
- the orientation of the cloud
- the shape of density contours
- Covariance determines how variables move together
- In two dimensions, multivariate normal contours are ellipses
- Density ellipses and contour plots help make the abstract matrix (\(\boldsymbol{\Sigma}\)) visually intuitive