2026-02-22
\[Cov(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1}\]
1. This is the same notation we used for mean and variance, we just have an extra variable.
2. The Numerator: This part of the equation calculates the cross products. It finds the difference between each observation and its respective mean for both variables.
3. The Denominator: We divide by $n - 1$ to account for degrees of freedom. This adjustment ensures the sample covariance is an unbiased estimate of the population covariance.
4. Interpretation: The sign of the result indicates the direction of the relationship. A positive value means variables move together. A negative value means they move in opposite directions.
In Variance, we square deviations \((x_i - \bar{x})^2\) so they do not sum to zero.
- Squaring makes everything positive, which hides the direction of the relationship.In Covariance, we multiply \((x_i - \bar{x})\) by \((y_i - \bar{y})\) instead of squaring which also prevents summing to zero but allows us to see direction.
- If both variables are above their means, the product is positive $(+ \times +)$.
- If both are below their means, the product is also positive $(- \times -)$.
- If they move in opposite directions, the product is negative $(+ \times -)$.\[r = \frac{Cov(x, y)}{s_x s_y}\]
- The product of the standard deviations creates a standardized scale
The expanded version of the formula demonstrates how we are comparing shared variance to total variance:
\[r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}\]
- Because we are dividing by the product of the standard deviations, the resulting coefficient is unitless and bounded. The value of $r$ will always fall within the range of -1.0 to +1.0. A result of +1.0 indicates a perfect positive linear relationship, while -1.0 represents a perfect negative linear relationship. A result near zero suggests that no linear relationship exists between the variables.
The correlation coefficient (\(r\)) is a standardized measure that always falls between -1.0 and +1.0. This mathematical property ensures that we can compare the strength of relationships across different datasets regardless of their original units of measurement. This boundary is not arbitrary; it is a direct result of the geometric and algebraic relationship between the two variables.
One way to understand this boundary is to treat our data as vectors. If we center our variables by subtracting their means, we can represent \(X\) and \(Y\) as two vectors in high-dimensional space. In this context, the correlation coefficient is identical to the cosine of the angle (\(\theta\)) between these two vectors.
\[r = \cos(\theta)\]
Because the cosine of any angle must fall between -1.0 and +1.0, the correlation coefficient is restricted to that same range.
When the two vectors point in exactly the same direction, the angle is 0 degrees and the cosine is 1. This represents a perfect positive correlation. When the vectors point in exactly opposite directions, the angle is 180 degrees and the cosine is -1, representing a perfect negative correlation. If the vectors are perpendicular, the angle is 90 degrees and the cosine is 0, indicating the variables are linearly unrelated.
UH POLS3316, Spring 2026, Instructor: Tom Hanna