Let’s see what they are.
The model is.
The Model
X causes Y given the covariates C1 and C2.
Setting the model and the data for simulation.
n.sample <- 10000
rho <- 0.05
dat <- mvrnorm(n.sample, c(0,0), matrix(c(1,rho,rho,1),2))
C1 <- dat[,1]
C2 <- dat[,2]
X <- rnorm(n.sample) + C1 + C2
See how the data are.
plot(dat, col=rgb(0,0,0,alpha=min(1, 1000/n.sample)))
The library for the partial and semi-partial correlation analysis. Package ‘ppcor’ by Seongho Kim.
#install.packages('ppcor')
library(ppcor)
Correlation b2 two variables.
round(cor(dat), 03)
Y X C1 C2
Y 1.000 0.908 0.543 0.713
X 0.908 1.000 0.596 0.597
C1 0.543 0.596 1.000 0.046
C2 0.713 0.597 0.046 1.000
See the correlation b2 X and Y. Y is well predictable from X only. See the correlation b2 C1 and C2. Almost nothing.
Let’s see what the partial correlations are like.
round(pcor(dat)$estimate, 03)
Y X C1 C2
Y 1.000 0.706 0.331 0.582
X 0.706 1.000 0.240 -0.001
C1 0.331 0.240 1.000 -0.562
C2 0.582 -0.001 -0.562 1.000
Squared Partial correlation is “the proportion the variance in Y that is not explained by covariate(s) that can be uniquely by X”(Hayes, 2013).
The partial correlation between X and Y is essentially nothing now, which means that the proportion of the variance that is not explained by covariates(C1, C2) which can be uniquely explained by X is essentialy nothing. Wordy!
Let me summarize. 1. Variance in Y not explainable by covariates C1, C2. 2. What’s the proportion of them that can be uniquely explainable by X?
So the number and the kind of covariates are crucial for partial correlation, because it’s dealing with the variation that’s left after explaining X and Y with covariates.
Another way to calculate the partial correlation is to calculate the correlation between the residuals of Y with residuals of X in multiple regression with covariates as predictors.
resY <- residuals(lm(Y ~ C1 + C2))
resX <- residuals(lm(X ~ C1 + C2))
round(cor(resX, resY), 03)
[1] 0.706
See, they coincide.
Semi-partial correlation(a.k.a Part correlation) is the proportion of the total variance in Y that’s uniquely explainable by X(Hayes, 2013).
Unlike partial correlation, it doesn’t measure the proportion of the variance left behind.
round(spcor(dat)$estimate, 03)
Y X C1 C2
Y 1.000 0.339 0.120 0.244
X 0.399 1.000 0.099 -0.001
C1 0.233 0.165 1.000 -0.451
C2 0.409 -0.001 -0.388 1.000
We get squared the semi-partial correlation by calculating the R-squared of the multiple linear regression model(with other covariates) with X and without X.
It’s typically written as \(\Delta R^2\).
sum1 <- summary(lm(Y ~ X+C1+C2))
sum2 <- summary(lm(Y ~ C1+C2))
round(sqrt(sum1$r.squared - sum2$r.squared), 03)
[1] 0.339
Notice that C1 and C2 have high partial and semi-partial correlation even though they are causally independent. Correlation does not implay causation!
Another way to conceptualize the semi-partial correlation is seeing it as the correlation between “the residuals of the regression of X on C1 and C2” and Y. You can get the semi-partial correlation from the multiple regression of standardized Y on C1, C2 and res X
resX <- residuals(lm(X ~ C1+C2))
cor(Y, resX)
[1] 0.3391886
summary(lm(scale(Y) ~ C1 + C2 + resX))
Call:
lm(formula = scale(Y) ~ C1 + C2 + resX)
Residuals:
Min 1Q Median 3Q Max
-1.30418 -0.22589 0.00136 0.22521 1.37979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004687 0.003408 -1.375 0.169
C1 0.504899 0.003369 149.867 <2e-16 ***
C2 0.679658 0.003364 202.022 <2e-16 ***
resX 0.337469 0.003391 99.531 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3408 on 9996 degrees of freedom
Multiple R-squared: 0.8839, Adjusted R-squared: 0.8839
F-statistic: 2.537e+04 on 3 and 9996 DF, p-value: < 2.2e-16
Actually, you do not even need to regress Y on other covariates.
summary(lm(scale(Y) ~ resX))
Call:
lm(formula = scale(Y) ~ resX)
Residuals:
Min 1Q Median 3Q Max
-3.6313 -0.6292 0.0008 0.6284 3.5553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.742e-16 9.408e-03 0.00 1
resX 3.375e-01 9.360e-03 36.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9408 on 9998 degrees of freedom
Multiple R-squared: 0.115, Adjusted R-squared: 0.115
F-statistic: 1300 on 1 and 9998 DF, p-value: < 2.2e-16
This is quite a surprise. Because normally the coefficient from simple regression and the coefficient from multiple regression are different!
As the residuals resX above can’t be predicted by linear combination of C1 and C2, in other words cor(resX, C1) and cor(resX, C2) are 0,
cor(resX, C1);
[1] 4.423245e-18
cor(resX, C2)
[1] -4.787394e-17
The coefficients doesn’t get changed by including any of C1 and C2. So it’s just the correlation between resX and Y!
And we can think of it as correlation between X and Y in which X represents what is left after being explained by C1 and C2.
In contrast, the partial correlation is the correlation between X and Y in which both X and Y represents what is left after being explained by C1 and C2.
So the name “semi-partial” sticks. For partial correlations, both of variances are stripped(explained) away by covariates. For semi-partial correlations, variances only in X are stripped(explained) away by covariates.
Let me draw a diagram that compares two.
Partial Correlation
Semi-partial Correlation