Father Son Height Data

Author

Dr Andrew Dalby

Background

This set of data was reconstructed from the joint distribution of heights in table 14 of Statistical Methods in Biology (2nd Edition) Norman, T.J. Bailey, Hodder and Stoughton 1983. The data were recorded at 2cm intervals.

Building the Data

father <- c(1.66,1.70,1.62,1.64,1.64,1.66,1.66,rep(1.68,3),1.70,1.60,1.60,1.62,
            1.62,rep(1.64,3),rep(1.66,3),rep(1.68,5),1.70,1.70,1.72,1.74,1.76,
            rep(1.60,2),rep(1.62,4),rep(1.64,6),rep(1.66,5),rep(1.68,4),rep(1.70,3),
            rep(1.72,2),1.76,1.58,rep(1.60,3),rep(1.62,3),rep(1.64,6),rep(1.66,9),
            rep(1.68,8),rep(1.70,7),rep(1.72,4),1.74,1.76,1.80, rep(1.58,2),
            rep(1.60,4),rep(1.62,5),rep(1.64,8),rep(1.66,9),rep(1.68,13),rep(1.70,9)
            ,rep(1.72,5),rep(1.74,6),rep(1.76,3),rep(1.78,4),1.82, rep(1.60,2),
            rep(1.62,3),rep(1.64,6),rep(1.66,7),rep(1.68,9),rep(1.70,17),
            rep(1.72,13),rep(1.74,11),rep(1.76,4),rep(1.78,5),rep(1.80,2),1.82,
            1.60, rep(1.62,2),rep(1.64,5),rep(1.66,8),rep(1.68,8),rep(1.70,11),
            rep(1.72,14),rep(1.74,9),rep(1.76,6),rep(1.78,2),rep(1.80,2),1.82,
            1.62,rep(1.64,4),rep(1.66,5),rep(1.68,6),rep(1.70,9),rep(1.72,11),
            rep(1.74,8),rep(1.76,6),rep(1.78,3),rep(1.80,2),1.82,1.84,1.62,1.64,
            rep(1.66,5),rep(1.68,4),rep(1.70,4),rep(1.72,9),rep(1.74,11),
            rep(1.76,3),rep(1.78,3),1.80,1.82,1.82,1.64,rep(1.68,2),rep(1.68,3),
            rep(1.70,4),rep(1.72,4),rep(1.74,5),rep(1.76,6),rep(1.78,3),1.80,
            1.82,1.82,1.84,1.66,1.68,1.70,1.70,rep(1.72,2),rep(1.74,3),rep(1.76,4),
            rep(1.78,4),rep(1.80,2),rep(1.82,3),1.84,1.66,1.72,1.72,1.74,
            rep(1.76,3),1.78,1.80,1.74,1.76,1.76,1.78,1.80, 1.78,1.80,1.84
            )
                                                  

son <- c(rep(1.58,2),rep(1.60,9),rep(1.62,20),rep(1.64,27),rep(1.66,44),
         rep(1.68,69), rep(1.70,80), rep(1.72,69),rep(1.74,57),rep(1.76,44),
         rep(1.78,32),rep(1.80,23),rep(1.82,9),rep(1.84,5),rep(1.86,3))

data <- data.frame(father,son)

As always the most important first step of data analysis is creating graphs of the data. This is continuous data for two variables and so the appropriate graphs are:

Histograms to show that the data is normally distributed.
Scatterplot to look for a relationship between the two variables.

hist(data$son, col="lightblue", xlab="height(m)", main = "Histogram of Son's Heights")

hist(data$father, col="dodgerblue", xlab="height(m)", main = "Histogram of Father's Heights")

library(ggplot2)

ggplot(data, aes(x=father, y=son)) +
  geom_point(size=2, shape=23) +
  geom_smooth(method=lm)

`geom_smooth()` using formula = 'y ~ x'

From the histograms the data can be considered as normally distributed for both axes and so you can also calculate the Correlations between the two variables.

cor(data, method="spearman")

          father       son
father 1.0000000 0.4919055
son    0.4919055 1.0000000

cor(data, method="pearson")

          father       son
father 1.0000000 0.4950375
son    0.4950375 1.0000000

cor(data, method="kendall")

          father       son
father 1.0000000 0.3725771
son    0.3725771 1.0000000

You can also carry out the correlation t-tests to check if the correlations are significant.

cor.test(data$father, data$son, method="spearman")

Warning in cor.test.default(data$father, data$son, method = "spearman"): Cannot
compute exact p-value with ties


    Spearman's rank correlation rho

data:  data$father and data$son
S = 10146873, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4919055

cor.test(data$father, data$son, method="pearson")


    Pearson's product-moment correlation

data:  data$father and data$son
t = 12.625, df = 491, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4253200 0.5589147
sample estimates:
      cor 
0.4950375

cor.test(data$father, data$son, method="kendall")


    Kendall's rank correlation tau

data:  data$father and data$son
z = 11.232, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.3725771

Model II Regression

The usual version of linear regression can also be called OLS - ordinary least squares regression. This is also called Model I Regression. That applies if you have carried out an EXPERIMENT where you have an explanatory variable for which you pick selected values and an outcome variable that results from a causative relationship. The errors on the predictions are therefore normally distributed.

In this case there is just a correlation between the two variables and they are observations and not part of an experiment. We do not have any control over the values of father’s heights. For this case we can use Model II Regression. In R this is called major axis or MA regression, Holmes et. al, call this Principal Axis Regression.

See Research Methods for the Biosciences for a more complete explanation.

Holmes, D., Moody, P., Dine, D. and Trueman, L., 2017. Research methods for the Biosciences. Oxford University Press.

library(lmodel2)
model2 <- lmodel2(data$son~data$father, data=data, nperm=99)

RMA was not requested: it will not be computed.

model2


Model II regression

Call: lmodel2(formula = data$son ~ data$father, data = data, nperm =
99)

n = 493   r = 0.4950375   r-square = 0.2450621 
Parametric P-values:   2-tailed = 7.692509e-32    1-tailed = 3.846254e-32 
Angle between the two OLS regression lines = 37.3256 degrees

Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
A permutation test of r is equivalent to a permutation test of the OLS slope
P-perm for SMA = NA because the SMA slope cannot be tested

Regression results
  Method  Intercept     Slope Angle (degrees) P-perm (1-tailed)
1    OLS 0.87113981 0.4935331        26.26786              0.01
2     MA 0.01867827 0.9938707        44.82387              0.01
3    SMA 0.01341301 0.9969610        44.91281                NA

Confidence intervals
  Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
1    OLS      0.7402107       1.0020690  0.4167240   0.5703423
2     MA     -0.2678878       0.2639857  0.8498917   1.1620657
3    SMA     -0.1224860       0.1392447  0.9231063   1.0767246

Eigenvalues: 0.004228506 0.00142819 

H statistic used for computing C.I. of MA: 0.006055039

op <- par(mfrow = c(1,2))
plot(model2, "OLS", xlab="Father's Height (m)", ylab="Son's Height (m)")
plot(model2, "MA", xlab="Father's Height (m)", ylab="Son's Height (m)")

par(op)

The model is verified using permutation tests and not a T-test as in the Model I regression.

You can compare this to the usual OLS model I Regression in R.

model1 <- lm(son~father, data=data)
summary (model1)


Call:
lm(formula = son ~ father, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.130146 -0.030146 -0.000275  0.030113  0.129595 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.87114    0.06664   13.07   <2e-16 ***
father       0.49353    0.03909   12.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04619 on 491 degrees of freedom
Multiple R-squared:  0.2451,    Adjusted R-squared:  0.2435 
F-statistic: 159.4 on 1 and 491 DF,  p-value: < 2.2e-16

confint.lm(model1)

                2.5 %    97.5 %
(Intercept) 0.7402107 1.0020690
father      0.4167240 0.5703423

plot(model1, 1)

plot(model1, 2)

plot(model1, 3)

plot(model1, 4)

plot(model1, 5)

Note that Major Axis regression is not available in SPSS although it can be implemented by coding the model.