father <- c(1.66,1.70,1.62,1.64,1.64,1.66,1.66,rep(1.68,3),1.70,1.60,1.60,1.62,
1.62,rep(1.64,3),rep(1.66,3),rep(1.68,5),1.70,1.70,1.72,1.74,1.76,
rep(1.60,2),rep(1.62,4),rep(1.64,6),rep(1.66,5),rep(1.68,4),rep(1.70,3),
rep(1.72,2),1.76,1.58,rep(1.60,3),rep(1.62,3),rep(1.64,6),rep(1.66,9),
rep(1.68,8),rep(1.70,7),rep(1.72,4),1.74,1.76,1.80, rep(1.58,2),
rep(1.60,4),rep(1.62,5),rep(1.64,8),rep(1.66,9),rep(1.68,13),rep(1.70,9)
,rep(1.72,5),rep(1.74,6),rep(1.76,3),rep(1.78,4),1.82, rep(1.60,2),
rep(1.62,3),rep(1.64,6),rep(1.66,7),rep(1.68,9),rep(1.70,17),
rep(1.72,13),rep(1.74,11),rep(1.76,4),rep(1.78,5),rep(1.80,2),1.82,
1.60, rep(1.62,2),rep(1.64,5),rep(1.66,8),rep(1.68,8),rep(1.70,11),
rep(1.72,14),rep(1.74,9),rep(1.76,6),rep(1.78,2),rep(1.80,2),1.82,
1.62,rep(1.64,4),rep(1.66,5),rep(1.68,6),rep(1.70,9),rep(1.72,11),
rep(1.74,8),rep(1.76,6),rep(1.78,3),rep(1.80,2),1.82,1.84,1.62,1.64,
rep(1.66,5),rep(1.68,4),rep(1.70,4),rep(1.72,9),rep(1.74,11),
rep(1.76,3),rep(1.78,3),1.80,1.82,1.82,1.64,rep(1.68,2),rep(1.68,3),
rep(1.70,4),rep(1.72,4),rep(1.74,5),rep(1.76,6),rep(1.78,3),1.80,
1.82,1.82,1.84,1.66,1.68,1.70,1.70,rep(1.72,2),rep(1.74,3),rep(1.76,4),
rep(1.78,4),rep(1.80,2),rep(1.82,3),1.84,1.66,1.72,1.72,1.74,
rep(1.76,3),1.78,1.80,1.74,1.76,1.76,1.78,1.80, 1.78,1.80,1.84
)
son <- c(rep(1.58,2),rep(1.60,9),rep(1.62,20),rep(1.64,27),rep(1.66,44),
rep(1.68,69), rep(1.70,80), rep(1.72,69),rep(1.74,57),rep(1.76,44),
rep(1.78,32),rep(1.80,23),rep(1.82,9),rep(1.84,5),rep(1.86,3))
data <- data.frame(father,son)Father Son Height Data
Background
This set of data was reconstructed from the joint distribution of heights in table 14 of Statistical Methods in Biology (2nd Edition) Norman, T.J. Bailey, Hodder and Stoughton 1983. The data were recorded at 2cm intervals.
Building the Data
As always the most important first step of data analysis is creating graphs of the data. This is continuous data for two variables and so the appropriate graphs are:
Histograms to show that the data is normally distributed.
Scatterplot to look for a relationship between the two variables.
hist(data$son, col="lightblue", xlab="height(m)", main = "Histogram of Son's Heights")hist(data$father, col="dodgerblue", xlab="height(m)", main = "Histogram of Father's Heights")
library(ggplot2)ggplot(data, aes(x=father, y=son)) +
geom_point(size=2, shape=23) +
geom_smooth(method=lm)`geom_smooth()` using formula = 'y ~ x'
From the histograms the data can be considered as normally distributed for both axes and so you can also calculate the Correlations between the two variables.
cor(data, method="spearman") father son
father 1.0000000 0.4919055
son 0.4919055 1.0000000
cor(data, method="pearson") father son
father 1.0000000 0.4950375
son 0.4950375 1.0000000
cor(data, method="kendall") father son
father 1.0000000 0.3725771
son 0.3725771 1.0000000
You can also carry out the correlation t-tests to check if the correlations are significant.
cor.test(data$father, data$son, method="spearman")Warning in cor.test.default(data$father, data$son, method = "spearman"): Cannot
compute exact p-value with ties
Spearman's rank correlation rho
data: data$father and data$son
S = 10146873, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.4919055
cor.test(data$father, data$son, method="pearson")
Pearson's product-moment correlation
data: data$father and data$son
t = 12.625, df = 491, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4253200 0.5589147
sample estimates:
cor
0.4950375
cor.test(data$father, data$son, method="kendall")
Kendall's rank correlation tau
data: data$father and data$son
z = 11.232, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.3725771
Model II Regression
The usual version of linear regression can also be called OLS - ordinary least squares regression. This is also called Model I Regression. That applies if you have carried out an EXPERIMENT where you have an explanatory variable for which you pick selected values and an outcome variable that results from a causative relationship. The errors on the predictions are therefore normally distributed.
In this case there is just a correlation between the two variables and they are observations and not part of an experiment. We do not have any control over the values of father’s heights. For this case we can use Model II Regression. In R this is called major axis or MA regression, Holmes et. al, call this Principal Axis Regression.
See Research Methods for the Biosciences for a more complete explanation.
Holmes, D., Moody, P., Dine, D. and Trueman, L., 2017. Research methods for the Biosciences. Oxford University Press.
library(lmodel2)
model2 <- lmodel2(data$son~data$father, data=data, nperm=99)RMA was not requested: it will not be computed.
model2
Model II regression
Call: lmodel2(formula = data$son ~ data$father, data = data, nperm =
99)
n = 493 r = 0.4950375 r-square = 0.2450621
Parametric P-values: 2-tailed = 7.692509e-32 1-tailed = 3.846254e-32
Angle between the two OLS regression lines = 37.3256 degrees
Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
A permutation test of r is equivalent to a permutation test of the OLS slope
P-perm for SMA = NA because the SMA slope cannot be tested
Regression results
Method Intercept Slope Angle (degrees) P-perm (1-tailed)
1 OLS 0.87113981 0.4935331 26.26786 0.01
2 MA 0.01867827 0.9938707 44.82387 0.01
3 SMA 0.01341301 0.9969610 44.91281 NA
Confidence intervals
Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
1 OLS 0.7402107 1.0020690 0.4167240 0.5703423
2 MA -0.2678878 0.2639857 0.8498917 1.1620657
3 SMA -0.1224860 0.1392447 0.9231063 1.0767246
Eigenvalues: 0.004228506 0.00142819
H statistic used for computing C.I. of MA: 0.006055039
op <- par(mfrow = c(1,2))
plot(model2, "OLS", xlab="Father's Height (m)", ylab="Son's Height (m)")
plot(model2, "MA", xlab="Father's Height (m)", ylab="Son's Height (m)")par(op)The model is verified using permutation tests and not a T-test as in the Model I regression.
You can compare this to the usual OLS model I Regression in R.
model1 <- lm(son~father, data=data)
summary (model1)
Call:
lm(formula = son ~ father, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.130146 -0.030146 -0.000275 0.030113 0.129595
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.87114 0.06664 13.07 <2e-16 ***
father 0.49353 0.03909 12.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04619 on 491 degrees of freedom
Multiple R-squared: 0.2451, Adjusted R-squared: 0.2435
F-statistic: 159.4 on 1 and 491 DF, p-value: < 2.2e-16
confint.lm(model1) 2.5 % 97.5 %
(Intercept) 0.7402107 1.0020690
father 0.4167240 0.5703423
plot(model1, 1)plot(model1, 2)plot(model1, 3)plot(model1, 4)plot(model1, 5)Note that Major Axis regression is not available in SPSS although it can be implemented by coding the model.