Correlation and covariance are both measures of the relationship between two variables in statistics.

1 Covariance:

1.1 Definition:

Covariance is a measure of how much two variables change together. It indicates the direction of the linear relationship between two variables (whether they tend to increase or decrease together).

1.2 Formula:

For two variables X and Y, the covariance (cov) is calculated as follows:

\[ \text{cov}(X, Y) = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{N} \]

where \(X_i\) ​ and \(Y_i\) ​ are individual data points, \(\bar X\) and \(\bar Y\) are the means of X and Y, and N is the number of data points.

1.3 Interpretation:

  • Positive covariance: Indicates a positive relationship (as one variable increases, the other tends to increase).

  • Negative covariance: Indicates a negative relationship (as one variable increases, the other tends to decrease).

  • Covariance close to zero: Indicates a weak or no linear relationship.

2 Correlation:

2.1 Definition:

Correlation is a standardized measure of the strength and direction of the linear relationship between two variables.

It ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

2.2 Formula:

The correlation coefficient (often denoted by \(r\)) is calculated as the covariance divided by the product of the standard deviations of the two variables.

\[ r = \frac{\text{cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]

where \(\sigma_X\) ​ and \(\sigma_Y\) are the standard deviations of X and Y.

2.3 Interpretation:

  • r=1: Perfect positive linear relationship.

  • r=−1: Perfect negative linear relationship.

  • r=0: No linear relationship.

  • \(0<|r|<1\): Strength of the linear relationship (closer to 1 or -1 indicates a stronger relationship).

In summary, while covariance provides a measure of how two variables vary together, correlation standardizes this measure, making it easier to interpret and compare relationships between different pairs of variables.

Correlation is preferred in many cases because it is independent of the scales of the variables and always ranges between -1 and 1.

3 Differences Summarized in a table

Difference between Correlation and Covariance

4 Covariance vs Correlation: Applications

4.1 Application of correlation

  • When working with enormous volumes of data, the objective is to uncover patterns. Therefore, a correlation matrix is employed to search for patterns in the data and assess if the variables are highly connected.

  • A correlation matrix is often used as input for exploratory component analysis, confirmatory factor analysis, structural equation models, and linear regression when missing values are excluded pairwise.

  • Correlation matrix is also used as a diagnostic while verifying other analyses. For example, many correlations in linear regression imply that the linear regression estimates would be incorrect.

4.2 Application of covariance

  • Cholesky decomposition is used to simulate systems with numerous interrelated variables. Due to its positive and semi-definite in nature, a covariance matrix aids in determining the Cholesky decomposition. The lower matrix’s product and its transpose are used to deconstruct the matrix.

  • Principal component analysis minimizes the dimensionality of huge data sets. An eigendecomposition is performed on the covariance matrix to perform principal component analysis.

5 EXTRA: Covariance vs Correlation Matrix

https://www.simplilearn.com/covariance-vs-correlation-article

5.1 What Is A Covariance Matrix?

A covariance matrix is a square matrix that illustrates the variance of dataset elements and the covariance between two datasets. Variance is a measure of dispersion defined as data spread from the provided dataset’s mean. Covariance between two variables is calculated and used to measure how the two variables fluctuate together.

5.2 What Is A Correlation Matrix?

A correlation matrix can be defined as a matrix with correlation coefficients among different variables. The connection between the two variables is represented by each cell in the table. A correlation matrix can be used to summarise data, as an input to a more advanced analysis, or as a diagnostic for further studies.