Linear Regression II

Author

Dr Andrew Dalby

Introduction

In my previous post I talked about ordinary least squares linear regression and described the specific conditions when this should be used. i.e. when the values for the x-axis has been set by the experimenter.

Now I want to demonstrate how to use principle/major axis regression which is the method that you should use when there are sources of variation in both axes.

I am going to use Galton’s historical data set on heights of parents and children. Galton is a very problematic figure as he was a racist and eugenicist. There is currently a resurgence of this genetic determinism of racial profile which was started by people like Galton.

It is important that we learn from history that science is not above the ethical and moral weaknesses of the scientists who create it. I can use his data without agreeing with any of his ideas. In discovering regression to the mean Galton actually showed why Eugenics can never work.

Galton’s Parent Child Data

I am firstly going to analyse the data correctly with major axis regression and then I am going to carry out the analysis with ordinary least squares to see what the differences are.

There are two possible libraries for carrying out major axis regression smatr and lmodel2. I am only going to use lmodel2 for the graphical examples. The main strength of smatr is that it has diagnostics for checking the assumptions.

library(HistData)
library(ggplot2)
library(ggpmisc)
library(smatr)
library(lmodel2)
data(Galton)
ggplot(Galton, aes(x=parent, y=child))+
  geom_point()+
  geom_smooth(method=lm)+
  stat_poly_eq(mapping = use_label("eq", "R2", "n", "P"),colour="#4b0082")+
  stat_ma_line(colour="darkorange") +
  stat_ma_eq(mapping = use_label("eq", "R2", "n", "P"), orientation = "y",label.y = 0.9, colour="darkorange")+
  theme(aspect.ratio = 1)

While the r-squared value is the same the coefficients are completely different and they produce completely different lines. Visually this does not seem a good fit. But this is in part because of the differing axis scales.

The major axis regression does not take into account if there are differences in scale and variability between the two axes. Standardised major axis regression and ranged axis regression account for this issue.

ggplot(Galton, aes(x=parent, y=child))+
  geom_point()+
  stat_ma_line(colour="#4b0082")+
  stat_poly_eq(mapping = use_label("eq", "R2", "n", "P"),colour="#4b0082")+
  stat_ma_line(method="SMA", colour="darkorange")+
  stat_ma_eq(mapping = use_label("eq", "R2", "n", "P"), orientation = "y",label.y = 1.0, colour="darkorange")+
  stat_ma_line(method="RMA", range.x = "interval", range.y = "interval",colour="darkturquoise")+
  stat_ma_eq(mapping = use_label("eq", "R2", "n", "P"), orientation = "y",label.y = 0.90, colour="darkturquoise")+
  theme(aspect.ratio = 1)

Here there is very little difference between the standardised and ranged major axis regression and so you could choose to use either.

Note that the p-values are calculated from permutation tests and not from means and standard errors.

If you want to calculate the model and see the coefficients and the diagnostics then you would run the regression fitting.

fit.Galton <- lmodel2(child ~ parent, Galton, nperm=99)

RMA was not requested: it will not be computed.

fit.Galton


Model II regression

Call: lmodel2(formula = child ~ parent, data = Galton, nperm = 99)

n = 928   r = 0.4587624   r-square = 0.2104629 
Parametric P-values:   2-tailed = 1.732509e-49    1-tailed = 8.662546e-50 
Angle between the two OLS regression lines = 39.08808 degrees

Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
A permutation test of r is equivalent to a permutation test of the OLS slope
P-perm for SMA = NA because the SMA slope cannot be tested

Regression results
  Method Intercept     Slope Angle (degrees) P-perm (1-tailed)
1    OLS  23.94153 0.6462906        32.87421              0.01
2     MA -69.81524 2.0188460        63.64929              0.01
3    SMA -28.14205 1.4087698        54.63138                NA

Confidence intervals
  Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
1    OLS       18.42510        29.45796  0.5655602   0.7270209
2     MA      -88.98805       -54.14195  1.7893963   2.2995269
3    SMA      -33.81447       -22.78538  1.3303507   1.4918114

Eigenvalues: 7.362699 2.17189 

H statistic used for computing C.I. of MA: 0.002468459

You do not need to use the summary method to view the output and the confidence intervals are calculated automatically.

What is surprising is the size of the difference between the OLS and major axis regression in this case.

You can fit the model in smatr to check that it gives the same results and also to produce the diagnostic plots of the residuals and the QQ-plot of the residuals.

fit.Galton2 <- sma(child ~ parent, Galton)
fit.Galton2

Call: sma(formula = child ~ parent, data = Galton) 

Fit using Standardized Major Axis 

------------------------------------------------------------
Coefficients:
            elevation    slope
estimate    -28.14205 1.408770
lower limit -33.65918 1.330351
upper limit -22.62492 1.491811

H0 : variables uncorrelated
R-squared : 0.2104629 
P-value : < 2.22e-16

plot(fit.Galton2,which="res")

plot(fit.Galton2,which="qq")

Here the residuals are normally distributed in both axes and the assumptions hold.

There is one other type of linear regression that needs to be considered and this is Deming Regression. This is most often used in comparing two methods of measurement but it can be used in any case when the x and y axes could be switched, when x is not causing y. They are correlated/associated but not through a causal relationship. This would not apply for the Galton data but it would apply to other allometric studies such as the Fisher/Anderson iris data.