Covariance and bivariate correlation
Correlation and covariance measure the (linear) relationship between two or more variables.
This figure describes the covariance or correlation between two variables. As discussed in the EFA theory, the arrows connecting the same variable is the variance. Covariance is the variance of a variable in relation to another, and is described as:
\[cov_{XY}= \frac{\sum(X-M_X)(Y-M_Y)}{N-1}\]
where \(M\) is the mean, and \(N\) is the sample size.
Variance is obtained by:
\[var=\frac{\sum(X-M)^2}{N-1}\]
The values of a covariance have not limits. Positive covariance values indicate that as one variable increase, so does another. Negative covariance indicates the opposite.
In contrast to covariance, correlation measures the relationship between two variables after they have been standardized (centred to have a mean of 0, and standard deviation of 1). This standardization into z-scores is done by:
\[Z_{X}=\frac{X-M}{\sigma}\]
Where \(Z_X\) is the standardized score of the variable \(X\), \(M\) is the mean of \(X\), and \(\sigma\) is the standard deviation (SD) of \(X\) (which is \(\sigma=\sqrt{var} =\sqrt{\frac{\sum(X-M)^2}{N-1}}\), expressed in the same units as the variable).
The correlation between the two standardized is calculated as:
\[r_{XY}=\frac{\sum{(Z_X)(Z_Y)}}{N-1}\] Where \(Z_X\) and \(Z_Y\) are the standardized values of the variables \(X\) and \(Y\) respectively. Correlation is easier to interpret than covariance, and is often represented by the Bravais-Pearson coefficient \(r\). This tells us both the direction and strength of a correlation. The correlation matrix is simply a matrix describing the correlation of two or more variables. There are also other types of correlation; the use of tetrachoric and polychoric correlation coefficients is common because they are suitable for non-interval data. Tetrachoric correlation is used to estimate the association between two dichotomous variables, and polychoric correlation is used for ordinal-level data.
Partial correlation
What about when we observe a correlation between two variables, but these correlation is actually the consequence of a confounding variable? Lets look at an example, where wellbeing and chocolate consumption appear correlated, because they are confounded by wealth.
# Create the DF - Z.Wealth is the confounder, X.Well and Y.Choc are choclate and wellbeing
Z.Wealth <- seq(1,101,1)
df <- data.frame("X.Well" = rnorm(101, Z.Wealth, 20), "Y.Choc"=rnorm(101, Z.Wealth, 20), Z.Wealth)
# Correlation
round(cor(df),2) %>%
kable(format = "html", table.attr = "style='width:50%;'") %>%
kable_styling()
|
X.Well
|
Y.Choc
|
Z.Wealth
|
X.Well
|
1.00
|
0.68
|
0.88
|
Y.Choc
|
0.68
|
1.00
|
0.83
|
Z.Wealth
|
0.88
|
0.83
|
1.00
|
Here we can use partial correlations to estimate the relationship between X and Y accounting for Z, which will be held constant. This correlation is written as \(r_{XY.Z}\), calculated as:
\[r_{XY.Z}=\frac{r_{XY}-(r_{XZ})(r_{YZ})}{\sqrt{(1-r^2_{XZ})(1-r^2_{YZ})}}\]
In this equation, \(Z\), describing wealth, is held constant; we removed the relationship between \(Z\) and \(X\) and \(Y\), leaving the relationship between \(X\) and \(Y\).
Lets apply this to the data presented above.
# Function to calculate partial correlations
partial_func <- function(X,Y,Z){
R.XY = cor(X,Y)
R.XZ = cor(X,Z)
R.YZ = cor(Y,Z)
numerator <-R.XY-(R.XZ*R.YZ)
r.XZ2 = cor(X,Z)^2
r.YZ2 = cor(Y,Z)^2
r.XZ2_min = 1 - r.XZ2
r.YZ2_min = 1 - r.YZ2
r.XZ2_r.YZ2 = r.XZ2_min * r.YZ2_min
demoniator = sqrt(r.XZ2_r.YZ2)
RXY.Z = numerator/demoniator
return(RXY.Z)
}
# Implement the correlation
round(partial_func(X =df$X.Well, Y= df$Y.Choc, Z = df$Z.Wealth),2)
## [1] -0.17
# # Double check the equation is correct
# pcor(df)
If we implement the partial correlation, we can now see that the correlation between \(X\) and \(Y\) is no longer strongly positive, since we’ve accounted for \(Z\).
It is unwise to interpret the correlation or partial correlation as indicating causality. Causality may be considered according to three criteria:
- The association rule: two variables must be statistically associated.
- There must be a causal order, often one event happening before another.
- The non-artificiality rule: the association between two variables must not disappear when removing another variable higher in the causal order (above the two variables).
Linear regression anlysis
Regression is called “simple” when there is only one predictor variable describing a criterion variable (dependent variable), such as when \(X\) predicts \(Y\). In such a model \(Y.B\) or \(\beta\) is the regression coefficient, which can either be non-standardized \(B\) or standardized \(\beta\), and where \(e\) is the residual error. This can be graphically expressed as:

Here the arrow connecting the explanatory variable to the response describes the coefficient, the circular arrow at the explanatory variable is the variance in that variable, and the circular arrow at the response is the residual variance in the response variable. This model is written as:
\[Y=\alpha+\beta X + e\] where \(\alpha\) is the intercept, and \(\beta\) is the coefficient. This model is often estimated using OLS, where estimates are found by minimising the sum of squares of the difference between the observed values and predicted values of the linear function.
This allows us to optimise the correlation between two variables \(R\); the square of this, \(R^2\) describes the proportion of variance in the response variable attributable to the explanatory variable or variables - this is an indicator of model fit.
Remember, non-standardized coefficients reflect the original units of the explanatory variable, but standardized coefficients only range between -1 and 1. A coefficient \(\beta\) indicates the expected increase in \(Y\) in standard deviation units when there is a 1 SD increase in \(X\). If all variables are standardized, then you can compare the \(\beta\) associated with different variables.
When the model has at least two explanatory variables, then then we call it multiple regression. Here we describe a model where the explanatory variables are uncorrelated.
# Create df
X <- rnorm(100,0,1)
Z <- rnorm(100,0,1)
Z<-Z-lm(Z~X)$fitted # making the two variable orthagnol
Y = X*1 + Z*1 + rnorm(100,0,.1)
df1 <- data.frame("Y"=Y,"X"=X,"Z"=Z)
# The model
model_1 <- 'Y ~ X + Z' # regression
# Run the model
fit_1 <- sem(model_1, df1)
With uncorrelated variables, we can see that the results of the simple linear regression (only \(Y\) and \(X\)) and multiple regression (\(Y\), \(X\) and \(Z\)) are approximately the same.
# The simple regression
summary(lm(df1$Y~df1$X))
##
## Call:
## lm(formula = df1$Y ~ df1$X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2426 -0.5985 -0.2032 0.6668 3.2660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001888 0.099040 0.019 0.985
## df1$X 0.986123 0.101203 9.744 4.35e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9858 on 98 degrees of freedom
## Multiple R-squared: 0.4921, Adjusted R-squared: 0.4869
## F-statistic: 94.95 on 1 and 98 DF, p-value: 4.35e-16

However, when they are correlated we find that the results of the simple regression differ to those of the multiple regression differ; the multiple regression approximates the true coefficient, but the results of the simple regression are clearly off.
# Create df
cor <- rnorm(100,0,.5)
X <- cor + rnorm(100,0,1)
Z <- cor + rnorm(100,0,1)
Y = X*1 + Z*1 + rnorm(100,0,.1)
df2 <- data.frame("Y"=Y,"X"=X,"Z"=Z)
# The model
model_2 <- 'Y ~ X + Z' # regression
# Run the model
fit_2 <- sem(model_2, df2)
# Simple regression with a correlated variable
summary(lm(df2$Y~df2$X))
##
## Call:
## lm(formula = df2$Y ~ df2$X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.38507 -0.78822 0.09523 0.65450 2.28799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.22643 0.09824 -2.305 0.0233 *
## df2$X 1.13976 0.07687 14.828 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9817 on 98 degrees of freedom
## Multiple R-squared: 0.6917, Adjusted R-squared: 0.6886
## F-statistic: 219.9 on 1 and 98 DF, p-value: < 2.2e-16

At a more basic level, we can see this difference when running correlations and partial correlations between with uncorrelated and correlated explanatory variables. With uncorrelated explanatory variables, the correlation between \(X\) and \(Y\) are very similar between the correlation and partial correlation (although I’d have thought they would have been more similar).
# Correlation
cov(df1$X,df1$Y)
## [1] 0.9450399
# Parital correlation
round(partial_func(df1$Y,df1$X,df1$Z),3)
## [1] 0.994
But they differ when the explanatory variables are correlated - the correlation is less than the partial correlation.
# Correlation
cor(df2$X,df2$Y)
## [1] 0.8316846
# Parital correlation
round(partial_func(df2$Y,df2$X,df2$Z),3)
## [1] 0.998
Because we have set the true coefficients to mean 0 and SD 1, we expect that increasing \(X\) by 1 SD is will result in a 1 SD higher value of \(Y\), holding \(Z\) constant. We also added some noise to the data, but this approximately the result we found.
We can calcualte the \(R^2\) of our model, to see how well the results fit the data. The \(R^2\) is calculated by:
\[R^2=\beta_1r_{YX}+\beta_2r_{YZ}\] We can look at the \(R^2\) of our second model, and find it is very high - unsurprisingly, since we’ve accounted for all by the random noise we added to \(Y\).
inspect(fit_2, 'r2')
## Y
## 0.997
If we ran the model with only one explanatory variable, we’d get a much lower fit:
# The model
model_3 <- 'Y ~ X ' # regression
# Run the model
inspect(sem(model_3, df2), 'r2')
## Y
## 0.692
Two points to consider with multiple regression:
- First, highly collinear variables are challenging to estimate. This is a perennial problem (including within SEM). However, in some cases it might be appropriate to estimate the common cause of variation between highly correlated variables, and use this as an explanatory variable (i.e. using factor analysis within SEM).
- Second, multiple regression is additive, which precludes modelling both the direct and indirect (through mediator variable, for instance) effects of an explanatory variables. This second limitation can be addressed in through path analysis, and more generally SEM.