It’s possible to perform correlation with lm(). There’s a couple of reasons that may be desirable.
Let’s work the trees data set built into R. From the help page, “this data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees.”
data(trees)
pairs(trees)
cor(trees) |> round(2)
## Girth Height Volume
## Girth 1.00 0.52 0.97
## Height 0.52 1.00 0.60
## Volume 0.97 0.60 1.00
It appears all three are positively correlated with one another.
The cor.test function allows us to get 95% confidence intervals and tests whether the correlation is different from 0. For example, here’s a confidence interval and test for Girth and Height.
cor.test(~ Girth + Height, data = trees)
##
## Pearson's product-moment correlation
##
## data: Girth and Height
## t = 3.2722, df = 29, p-value = 0.002758
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2021327 0.7378538
## sample estimates:
## cor
## 0.5192801
But what if we wanted to test whether the correlation was different than a certain value, say 0.20? We can’t do that with cor.test. However can do that fairly easily if we fit a linear model using lm.
To do that we need to standardize both Girth and Height. We can do that using the scale function.
trees$GirthZ <- scale(trees$Girth)[,1]
trees$HeightZ <- scale(trees$Height)[,1]
Now we’re ready to calculate correlation using lm.
m <- lm(GirthZ ~ 0 + HeightZ, data = trees)
coef(summary(m)) |> round(3)
## Estimate Std. Error t value Pr(>|t|)
## HeightZ 0.519 0.156 3.328 0.002
We see the coefficient for HeightZ is the same as the correlation we obtained from cor.test. The hypothesis test result is the same as well. It tests if the coefficient is different from 0. But now that we have this as a model object, we can do further work. For example, using the linearHypothesis function in the car package we can test if the correlation is different from 0.2 instead of 0.
library(car)
linearHypothesis(m, "HeightZ = 0.2")
## Linear hypothesis test
##
## Hypothesis:
## HeightZ = 0.2
##
## Model 1: restricted model
## Model 2: GirthZ ~ 0 + HeightZ
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 24.969
## 2 30 21.910 1 3.0582 4.1873 0.04958 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let’s randomly delete some values so we can perform multiple imputation.
set.seed(1)
i <- sample(nrow(trees), size = 6)
trees$HeightZ[i] <- NA
summary(trees[,4:5])
## GirthZ HeightZ
## Min. :-1.5769 Min. :-2.0402
## 1st Qu.:-0.7005 1st Qu.:-0.1569
## Median :-0.1110 Median : 0.4708
## Mean : 0.0000 Mean : 0.2009
## 3rd Qu.: 0.6378 3rd Qu.: 0.7847
## Max. : 2.3427 Max. : 1.7264
## NA's :6
Now use multiple imputation to impute the missing values. For this we’ll use the mice package. We perform 5 imputations using predictive mean matching (PMM).
library(mice)
imp <- mice(trees[,4:5], m = 5, method = "pmm",
print = FALSE, seed = 1)
We would like to calculate the correlation on all 5 imputed data sets and pool the results. There is no method for the cor.test function, but there is one for lm. Below we use the with and pool functions to run the regression on all 5 imputed data sets and pool the results.
fit <- with(imp, lm(GirthZ ~ 0 + HeightZ))
pool(fit) |> summary()
## term estimate std.error statistic df p.value
## 1 HeightZ 0.3375085 0.1905754 1.770997 22.07142 0.09037167