When some trait is said to vary proportionally to another variable, then it is said to vary allometrically. We will explore two common allometric relationships between body size variables, as examples of allometry in morphology. There are also many example of allometry in physiology, etc., so this lesson can apply to other disciplines.1 This lesson also demonstrates different types of linear models, and some considerations in using a transformed linear model or a non-linear or curvilinear model.
It is not uncommon in the practice of fisheries science to estimate one variable from another. A very simple example is when fish come in gutted, headed, finned, or otherwise processed at sea, but one needs to estimate the full size of the individual fish.2 These relationships are typically allometric and can be fitted to the linear model, \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\), where \(y_i\) is the response (or dependent variable), \(x_i\) is the predictor (or independent variable), \(\beta_0\) is the intercept parameter, \(\beta_1\) is the slope parameter, and \(\epsilon_i\) represents all the unexplained sources of variability (we will impose this using our friend the rnorm() function).
Here we simulate an example where whole weight and gutted weight are taken from fish at uniformly spaced size intervals.
gutted.wt = seq (1, 45, 1) #Start simulation with 45 fish
gutted.wt #Each fish weighs (gutted) from 1 to 45 kgs at 1.5 kg intervals
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
The whole weight is certainly heavier, and we expect by some constant percentage, perhaps about 10%. This can be simulated to include some random variation.
set.seed (1357) # this statement will make your random numbers match mine, but you can turn this off, too
whole.wt=gutted.wt*1.1+((rnorm(length (gutted.wt),0,0.1*gutted.wt)))
whole.wt
## [1] 1.136582 2.553266 3.005511 4.548515 4.577318 7.041212 7.569947
## [8] 9.722813 10.097628 10.019610 11.378110 13.572520 15.501668 14.404081
## [15] 16.303020 18.004447 17.193424 16.437165 19.536711 23.086182 20.139767
## [22] 25.009925 24.975206 29.715893 20.754357 31.925544 22.608315 28.905087
## [29] 28.660591 37.275577 32.411425 32.936039 37.755929 40.152286 43.129705
## [36] 37.414998 39.821553 41.294264 38.993305 42.990916 42.071364 40.922536
## [43] 48.312831 44.974683 54.467134
A simple linear model confirms that the two weight variables are positively associated.
Wts = lm (whole.wt~gutted.wt) #Run a regression and correlation analysis
summary.lm (Wts) #Which calculates an intercept (a), slope (b), r2, and df
##
## Call:
## lm(formula = whole.wt ~ gutted.wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4525 -1.2359 0.1123 1.1850 5.9634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.10346 0.80262 -0.129 0.898
## gutted.wt 1.08016 0.03039 35.547 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.647 on 43 degrees of freedom
## Multiple R-squared: 0.9671, Adjusted R-squared: 0.9663
## F-statistic: 1264 on 1 and 43 DF, p-value: < 2.2e-16
cl95<-confint(Wts)
cl95
## 2.5 % 97.5 %
## (Intercept) -1.722096 1.515178
## gutted.wt 1.018878 1.141440
The regression slope is significantly different than zero (P < 0.001), which is a trivial result, because we simulated the data to generate a slope of about 1.1. The estimated slope for this particular simulation is 1.08. No, this value is not 1.1, but it is only an estimate. Importantly, the 95% confidence limits of this estimate (1.02 – 1.14), do not bracket a value of 1.0, which would implausibly indicate there is no difference between gutted and total weight (but more likely be due to the imprecision of our measurements). In fact, the 95% c.l. limits bracket our target of 1.1. We want that, at a minimum, and would prefer that these limits are as narrow as possible around the target value.
Remember, this is one particular realization of our simulation approach; you can try other set.seed value to observe other realizations, which will variously show you slopes that should average at 1.1.
The intercept is not significantly different from zero (P = 0.83), which is also reasonable, in that the total weight will be zero when the gutted weight is zero (95% c.l. = -1.72 – 1.52).
The coefficient of determination (\(r^2\)) is very high (0.967), indicating a strong linear nature of the paired, simulated data. You can lower this value by increasing the error in the simulated data.
In general, the simulated data behaves like we expect it to, so we continue with some simple diagnostics using the predicted values and residuals. Below, these values are retrieved from the working R memory before plotting.
signif(predict (Wts), digits=3) #These are the predicted values, with 3 sign. digits
## 1 2 3 4 5 6 7 8 9 10
## 0.977 2.060 3.140 4.220 5.300 6.380 7.460 8.540 9.620 10.700
## 11 12 13 14 15 16 17 18 19 20
## 11.800 12.900 13.900 15.000 16.100 17.200 18.300 19.300 20.400 21.500
## 21 22 23 24 25 26 27 28 29 30
## 22.600 23.700 24.700 25.800 26.900 28.000 29.100 30.100 31.200 32.300
## 31 32 33 34 35 36 37 38 39 40
## 33.400 34.500 35.500 36.600 37.700 38.800 39.900 40.900 42.000 43.100
## 41 42 43 44 45
## 44.200 45.300 46.300 47.400 48.500
In terms of plotting, we would like to see the: 1) the data, 2) a predictive line, and 3) some measure of variability, which we build in the following code.
plot (gutted.wt, whole.wt) #This is a plot of the raw data (Fig. 2A)
abline (Wts$coef) #This overlays the predicted equation (solid line)
vxb=data.frame(gutted.wt=seq(1,45,1)) # this creates a df for your 95% cl
tmb=predict(Wts, newdata=vxb, se.fit=TRUE) #creates se.fit (as part of a list)
cl95upb=(tmb$fit+qnorm(0.975)*tmb$se.fit) # creates the upper cl vector
cl95lob=(tmb$fit-qnorm(0.975)*tmb$se.fit) # creates the lower cl vector
lines (gutted.wt, cl95upb, lty =2) # plots the upper cl vector (dashed line)
lines (gutted.wt, cl95lob, lty =2) # plots the lower cl vector (dashed line)
We should also be interested in diagnosing if there is any trend in the residuals, as significant deviations or a pattern of these deviations can affect our confidence in modeling these data.
signif(residuals (Wts), digits=3) #Tabulates the residual values
## 1 2 3 4 5 6 7 8 9
## 0.1600 0.4960 -0.1320 0.3310 -0.7200 0.6640 0.1120 1.1800 0.4800
## 10 11 12 13 14 15 16 17 18
## -0.6790 -0.4000 0.7140 1.5600 -0.6150 0.2040 0.8250 -1.0700 -2.9000
## 19 20 21 22 23 24 25 26 27
## -0.8830 1.5900 -2.4400 1.3500 0.2350 3.9000 -6.1500 3.9400 -6.4500
## 28 29 30 31 32 33 34 35 36
## -1.2400 -2.5600 4.9700 -0.9700 -1.5300 2.2100 3.5300 5.4300 -1.3700
## 37 38 39 40 41 42 43 44 45
## -0.0409 0.3520 -3.0300 -0.1120 -2.1100 -4.3400 1.9700 -2.4500 5.9600
plot (gutted.wt, Wts$residuals) #This plots of the residuals
abline (0, 0) #This adds a horizontal line at zero
segments (gutted.wt, 0, gutted.wt, Wts$residuals) #This connected each residual with the pred line (0)
In this case, the residuals fit relatively well. If these were real (rather than simulated) data, then we could double check any suspicious outliers as potential typos. As is, you can change the set.seed() value and see how outliers come and go. The pattern of increasing residual values with increasing fish size is not unexpected. In terms of modeling, it does violate the assumption of ‘homogenity of variance,’ but the pattern of larger deviations with larger size is not unexpected nor does it speak of a need for mitigation here. If is looked more severe, it could be mitigated by transforming the data – a log-log transformation should correct this – which is something we will explore in example #2, where the pattern of residuals is more problematic.
Technically, one of the assumptions of Model I regression is that the independent variable is measured without error. In this case, the length of an animal has some observation error, which may be small or not so small, depending on the conditions of measurements (i.e., resolution of the calipers, condition of the specimen, etc.). If there is reason to believe that there is notable error in the measurements of the independent variable, then there are alternative regression formula to use. 3
R has a package (lmodel2) that compares ordinary least squares (OLS), with three Model II methods: major axis (MA), standard major axis (SMA), and ranged major axis (RMA). This package needs to be installed and loaded for you to follow along with the subsequest coding.
library(lmodel2)
Wts2 = lmodel2 (whole.wt~gutted.wt, range.x= c(0,50), range.y=c(0,50), nperm=10) #
## Warning in if (range.y == "relative") {: the condition has length > 1 and
## only the first element will be used
## Warning in if (range.x == "relative") {: the condition has length > 1 and
## only the first element will be used
Wts2
##
## Model II regression
##
## Call: lmodel2(formula = whole.wt ~ gutted.wt, range.y = c(0, 50),
## range.x = c(0, 50), nperm = 10)
##
## n = 45 r = 0.9834072 r-square = 0.9670897
## Parametric P-values: 2-tailed = 1.630232e-33 1-tailed = 8.151158e-34
## Angle between the two OLS regression lines = 0.9544239 degrees
##
## Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
## A permutation test of r is equivalent to a permutation test of the OLS slope
## P-perm for SMA = NA because the SMA slope cannot be tested
##
## Regression results
## Method Intercept Slope Angle (degrees) P-perm (1-tailed)
## 1 OLS -0.1034590 1.080159 47.20680 0.09090909
## 2 MA -0.5625517 1.100120 47.72941 0.09090909
## 3 SMA -0.5226403 1.098384 47.68439 NA
## 4 RMA -0.4808363 1.096567 47.63715 0.09090909
##
## Confidence intervals
## Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
## 1 OLS -1.722096 1.5151781 1.018878 1.141440
## 2 MA -2.044693 0.8302441 1.039563 1.164561
## 3 SMA -1.971390 0.8475339 1.038812 1.161373
## 4 RMA -1.949804 0.9149494 1.035880 1.160435
##
## Eigenvalues: 377.4825 3.129837
##
## H statistic used for computing C.I. of MA: 0.0007973839
All three alternative ‘Model II’ methods created very similar estimates of the slope relative to each other. All Model II estimates of the slope were higher than the OLS estimate, and they were closer to the ‘true’ value (1.1).
It is worth noting that these 45 values of length were fixed in the simulation, but when treated as if random error existed, the estimates of slope were ‘improved’ with Model II methods. This is something worthy of further simulation and consideration when using a real dataset of this type. Regardless, the availability of this R package, and the theoretical reasoning to use it when both variables are measured with some error, then Model II regression will undoubtably be used more frequently in this type of situation.
Often we know that data should fit to a specific model, for example, that weight (\(W\)) is proportional to length (\(L\)). We even know that weight (\(W\)) should be approximately a cubic function of \(L\), because \(W\) is a proxy for \(V\), and \(V\) is a cubic function of \(L\) (i.e., \(V\) = \(L\) x \(L\) x \(L\) = \(L^3\)). To begin, we set a seed value so your results will match those below:
set.seed (2468) #using a different set.seed value this time
We simulate a length-weight relationship according to a power function, \(y = \beta x^ \alpha\), with some random error, by writing the following code:
length.mm <- rnorm (100, 200, 50) # creates random numbers, normal distribution
error <- rnorm (100, 0, 50) # creates a norm. dist. of error
weight.g <- 0.0001 * length.mm^3.0 + error # creates weights with random error
These 100 values of length and weight can be plotted, of course, as a scattergram of these data, just as done in Example 1; however, this plot can also be enhanced with a little extra code that combines such a scattergram with two histograms.4
To do so, we need to evaluate and set the range and breaks of length and weight values, as follows.
c(range(length.mm), range(weight.g))
## [1] 56.99967 325.09632 3.91785 3309.21423
length.range <- c(0,350) # set these limits for the plot
weight.range <- c(0,3500)
length.hist <- hist (length.mm, breaks = seq(0,350,25), plot = F)
weight.hist <- hist (weight.g, breaks = seq (0, 3500, 250), plot = F)
top <- max (c(length.hist$counts, weight.hist$counts))
The original layout – single panel – and a 3-part layout are defined and then the different pieces are assembled in series:
def.par <- par(no.readonly = TRUE) # save default, for resetting later
nf <-layout (matrix (c(2,0,1,3), 2,2, byrow=T), c(3,1), c(1,3), T)
layout.show(nf)
#1
par (mar=c(3,3,1,1))
plot (length.mm,weight.g, xlim=length.range, ylim = weight.range, xlab="Length (mm)", ylab="Weight (g)")
#2
par(mar=c(0,3,1,1))
barplot (length.hist$counts, axes=F, ylim=c(0,top), space=0)
#3
par(mar=c(3,0,1,1))
barplot (weight.hist$counts, axes=F, xlim=c(0,top), space=0, horiz=T)
This enhanced graph depicts not only the individual length-weight pairs but also the relative frequency of both length and weight values. This is a nice graphical diversion but the full analysis requires more attention. We will want to return to only a single graph layout by typing:
par(def.par) #- reset to default
#layout.show() # you could check with this statement (not done)
In terms of interpretation, the paired, raw data do not appear to be linearly related. This is all the more obvious by overlaying the predicted line and by plotting the residuals.
par (mfrow = c(1,2)) # this puts two graphs side by side
plot (length.mm, weight.g)
L_W = lm (weight.g~length.mm) # This is the same code in Ex. 1
abline (L_W$coef)
plot (length.mm, L_W$residuals)
abline (0, 0)
segments (length.mm, 0, length.mm, L_W$residuals)
This U-shaped pattern of the residuals is a clear violation of homogenity of variance. There are two ways to fix this. First, we can transform the data. This may be a reasonable approach since the nonlinear form of the power function, \(y_i = \beta x_i^ \alpha\), can be expressed in a linear form, \(y' = \beta_0 + \alpha x'\), by log-transforming both sides of the equation, such that \(y'\) = log (\(y\)), \(\beta_0\) = log (\(\beta\)), \(x'\) = log (\(x\)). The error term, \(\epsilon_i\), may either be additive or multiplicative, but we will not address that here.
log.L_W = lm (log(weight.g)~log(length.mm)) #log the data and refit to a linear model
summary.lm (log.L_W)
##
## Call:
## lm(formula = log(weight.g) ~ log(length.mm))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27688 -0.06202 -0.01511 0.07888 0.47411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.39682 0.36769 -28.28 <2e-16 ***
## log(length.mm) 3.22510 0.07014 45.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1849 on 98 degrees of freedom
## Multiple R-squared: 0.9557, Adjusted R-squared: 0.9553
## F-statistic: 2114 on 1 and 98 DF, p-value: < 2.2e-16
par (mfrow = c(1,2))
plot (log(length.mm), log(weight.g))
abline (log.L_W$coef)
plot (log(length.mm), log.L_W$residuals)
abline (0, 0)
segments (log(length.mm), 0, log(length.mm), log.L_W$residuals)
This doesn’t look particularly good. There is a high-flying outlier and a declining pattern in the residuals among larger fish.
As an alternative, we can still model the raw data using a nonlinear power function. This is not the place to fully outline non-linear modeling, but you can see the code (below) requires an explicit statement of the model to be used, and some initial values. In this case, we actually know the values because we simulated the data with specific parameters and a power function. Even if we did not, we expect \(\beta\) to be very small and \(\alpha\) to approximate 3, because of the cubic relationship between length and weight in most animals (as pointed out earlier).
nlmodel<-nls(weight.g~beta*length.mm^alpha, start=list(beta=0.0001, alpha=3))
sumnlmod<-summary(nlmodel)
sumnlmod
##
## Formula: weight.g ~ beta * length.mm^alpha
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta 1.344e-04 1.837e-05 7.316 7.08e-11 ***
## alpha 2.946e+00 2.476e-02 118.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.02 on 98 degrees of freedom
##
## Number of iterations to convergence: 5
## Achieved convergence tolerance: 2.652e-08
par (mfrow = c(1,2))
plot (length.mm,weight.g, xlim=length.range, ylim = weight.range, xlab="Length (mm)", ylab="Weight (g)")
av<-seq(75,325,2)
aa<-sumnlmod$coefficients[1]
ab<-sumnlmod$coefficients[2]
bv<-aa*av^ab
lines(av,bv)
plot (length.mm, sumnlmod$resid)
abline (0, 0)
segments (length.mm, 0, length.mm, sumnlmod$residuals)
Ah, that looks good. We can finish here with a quantitative assessment of these models.
AIC(L_W)
## [1] 1334.635
AIC(nlmodel)
## [1] 1062.111
The linear model fitted to the untransformed data does poorly (i.e., has a higher AIC value). This is expected, because the data were not simulated to be linear. In general, one would not expect a length-weight relationship to be linear in a typical 3-dimensional form.
The power function – a curvilinear equation – fits the data better. This is obvious here, as this is the function we used to simulate the data. The plots of the residuals showed no pattern, and identify the level of deviation from the predictions estimated by the model. Again, the level of error was specified in the simulation, but ordinarily this might be something to examine in a real dataset, to check for outliers as well as to evaluate observation error in measurements of length or length or both.
This example also explored the use of log-transformed data to fit a linear model. This was more common in past, before computing power was readily available to solve non-analytical, non-linear problems. Although, the transformation ‘corrects’ the curvi-linear nature of the data, the residuals indicated an unfortunate bias (e.g., heavier weights are prediced at longer lengths), and as would be the case with any transformation, the back-transformed values would not be the same as the raw data estimates.
Fishery work often involves allometric relationships. In the case of paired body size data, there are some ways to plot the data in an interesting manner, either to audit it for outliers or present it to an audience. It is likely necessary to consider carefully alternatives to ordinary least squares linear regression; Model II and non-linear modeling were well suited to the allometric data presented here. An examination of the residuals is often appropriate to evaluate the level of measurement error and to select a model that best predicts the value of some body part from another.
This is a well worn problem, for example, Zar, J. H. (1968). Calculation and miscalculation of the allometric equation as a model in biological data. Bioscience 18, 1118-1120; or Hafley, W. L. (1969). Calculation and Miscalculation of the Allometric Equation Reconsidered. Bioscience 19, 974-975, 983.↩
A specific example is Almeida, F. P., Hartley, D. L. & Burnett, J. (1995). Length-weight relationships and sexual maturity of goosefish off the northeast coast of the United States. North American Journal of Fishery Management 15, 14-25.↩
An example of this is: Downs, D. E. & Cheng, Y. W. (2013). Length-Length and Width-Length Conversion of Longnose Skate and Big Skate Off the Pacific Coast: Implications for the Choice of Alternative Measurement Units in Fisheries Stock Assessment. North American Journal of Fisheries Management 33, 887-893.↩
This fancy plotting was modified from an example of the layout function (pp. 860-862) in Crawley, M. J. (2007). The R Book. John Wiley & Sons, Ltd.↩