Correlation and covariance are both measures of the relationship between two variables in statistics.
Covariance is a measure of how much two variables change together. It indicates the direction of the linear relationship between two variables (whether they tend to increase or decrease together).
For two variables X and Y, the covariance (cov) is calculated as follows:
\[ \text{cov}(X, Y) = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{N} \]
where \(X_i\) and \(Y_i\) are individual data points, \(\bar X\) and \(\bar Y\) are the means of X and Y, and
N is the number of data points.
Positive covariance: Indicates a positive relationship (as one variable increases, the other tends to increase).
Negative covariance: Indicates a negative relationship (as one variable increases, the other tends to decrease).
Covariance close to zero: Indicates a weak or no linear relationship.
Correlation is a standardized measure of the strength and direction of the linear relationship between two variables.
It ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
The correlation coefficient (often denoted by \(r\)) is calculated as the covariance divided by the product of the standard deviations of the two variables.
\[ r = \frac{\text{cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]
where \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of X and Y.
r=1: Perfect positive linear relationship.
r=−1: Perfect negative linear relationship.
r=0: No linear relationship.
\(0<|r|<1\): Strength of the linear relationship (closer to 1 or -1 indicates a stronger relationship).
In summary, while covariance provides a measure of how two variables vary together, correlation standardizes this measure, making it easier to interpret and compare relationships between different pairs of variables.
Correlation is preferred in many cases because it is independent of the scales of the variables and always ranges between -1 and 1.