Correlation with lm

It’s possible to perform correlation with lm(). There’s a couple of reasons that may be desirable.

More interesting hypothesis tests
For multiple imputation methods

Let’s work the trees data set built into R. From the help page, “this data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees.”

data(trees)
pairs(trees)

cor(trees) |> round(2)

##        Girth Height Volume
## Girth   1.00   0.52   0.97
## Height  0.52   1.00   0.60
## Volume  0.97   0.60   1.00

It appears all three are positively correlated with one another.

The cor.test function allows us to get 95% confidence intervals and tests whether the correlation is different from 0. For example, here’s a confidence interval and test for Girth and Height.

cor.test(~ Girth + Height, data = trees)

## 
##  Pearson's product-moment correlation
## 
## data:  Girth and Height
## t = 3.2722, df = 29, p-value = 0.002758
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2021327 0.7378538
## sample estimates:
##       cor 
## 0.5192801

But what if we wanted to test whether the correlation was different than a certain value, say 0.20? We can’t do that with cor.test. However can do that fairly easily if we fit a linear model using lm.

To do that we need to standardize both Girth and Height. We can do that using the scale function.

trees$GirthZ <- scale(trees$Girth)[,1]
trees$HeightZ <- scale(trees$Height)[,1]

Now we’re ready to calculate correlation using lm.

m <- lm(GirthZ ~ 0 + HeightZ, data = trees)
coef(summary(m)) |> round(3)

##         Estimate Std. Error t value Pr(>|t|)
## HeightZ    0.519      0.156   3.328    0.002

We see the coefficient for HeightZ is the same as the correlation we obtained from cor.test. The hypothesis test result is the same as well. It tests if the coefficient is different from 0. But now that we have this as a model object, we can do further work. For example, using the linearHypothesis function in the car package we can test if the correlation is different from 0.2 instead of 0.

library(car)
linearHypothesis(m, "HeightZ = 0.2")

## Linear hypothesis test
## 
## Hypothesis:
## HeightZ = 0.2
## 
## Model 1: restricted model
## Model 2: GirthZ ~ 0 + HeightZ
## 
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     31 24.969                              
## 2     30 21.910  1    3.0582 4.1873 0.04958 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now let’s randomly delete some values so we can perform multiple imputation.

set.seed(1)
i <- sample(nrow(trees), size = 6)
trees$HeightZ[i] <- NA
summary(trees[,4:5])

##      GirthZ           HeightZ       
##  Min.   :-1.5769   Min.   :-2.0402  
##  1st Qu.:-0.7005   1st Qu.:-0.1569  
##  Median :-0.1110   Median : 0.4708  
##  Mean   : 0.0000   Mean   : 0.2009  
##  3rd Qu.: 0.6378   3rd Qu.: 0.7847  
##  Max.   : 2.3427   Max.   : 1.7264  
##                    NA's   :6

Now use multiple imputation to impute the missing values. For this we’ll use the mice package. We perform 5 imputations using predictive mean matching (PMM).

library(mice)
imp <- mice(trees[,4:5], m = 5, method = "pmm",
                print = FALSE, seed = 1)

We would like to calculate the correlation on all 5 imputed data sets and pool the results. There is no method for the cor.test function, but there is one for lm. Below we use the with and pool functions to run the regression on all 5 imputed data sets and pool the results.

fit <- with(imp, lm(GirthZ ~ 0 + HeightZ))
pool(fit) |> summary()

##      term  estimate std.error statistic       df    p.value
## 1 HeightZ 0.3375085 0.1905754  1.770997 22.07142 0.09037167

Correlation with lm

Clay Ford

7/8/2021