This chapter expands on the concepts of modeling fish growth, outlined in Part I, where we introduced a common form of the von Bertalanffy growth model (VBGM) and applied it to a published dataset (Kimura 1980)1. In Part II, the goals are to:
The data in Kimura (1980) was summarized, with only one length value (i.e., a mean length per sex) per age. The computing power available on your desk is more powerful today than in 1980, so if you had a similarly large sample as Kimura (which was in the 1000s of fish) then you will likely avoid working with summarized data and instead include all individual lengths and ages in the analysis.
We simulate such a sample here. First by fitting the female Pacific hake data from Kimura (1980) to a VBGM and accepting the mean parameter estimates as the true population mean values.
Then we simulate a series of samples that should have the same parameter estimates, trying to inject more realism in each succeeding dataset.
Finally, we use the likelihood ratio test to examine if the most realistic simulated data differs from the original data in terms of parametric values of the VBGM.
Kimura (1980) reported an average size of male and female Pacific hake. There were 13 age classes of females and 11 age classes of males. You can upload these data from the Gary Nelson’s ‘fishmethods’ package using the commands library (fishmethods) and data (Kimura), or enter the data by hand as shown below.
Kimura_F <- data.frame(age = c(1,2,seq(3.3,13.3,1)), length=c(15.40, 28.03, 41.18, 46.20, 48.23, 50.26, 51.82, 54.27, 56.98, 58.93, 59.00, 60.91, 61.83))
Recall we will use the common form of the VBGM:
\[L_t = L_∞(1 - e^{(-K(t-t_o)}),\]
where \(L_t\) is the length at age (t, years); \(L_∞\) is the average size at the maximum age; \(K\) is the Brody coefficient (yr\(^{-1}\)); and \(t_o\) is the age when the average size is zero.
attach(Kimura_F)
KF_<-nls(Kimura_F$length ~ Linf * (1 - exp(-K*(Kimura_F$age-t0))),
start = list (Linf = 60, K = 0.3, t0 = 0))
detach(Kimura_F)
The predicted values can be added to our data set:
KF_P<-data.frame(cbind(Kimura_F, preds=predict(KF_)))
KF_P
## age length preds
## 1 1.0 15.40 16.46621
## 2 2.0 28.03 27.94454
## 3 3.3 41.18 38.58487
## 4 4.3 46.20 44.39195
## 5 5.3 48.23 48.71009
## 6 6.3 50.26 51.92105
## 7 7.3 51.82 54.30872
## 8 8.3 54.27 56.08419
## 9 9.3 56.98 57.40442
## 10 10.3 58.93 58.38615
## 11 11.3 59.00 59.11616
## 12 12.3 60.91 59.65899
## 13 13.3 61.83 60.06265
These predicted values can be viewed for our purposes as the ‘true’ parametric mean length at age. In nature, individual lengths at age will vary in relation to genetic or environmental conditions. We can simulate this variability by adding random values to the predicted parameters using the rnorm() function2. A set.seed value is used so that this text will match your console if you are following along.
set.seed(1357)
KF_R<-data.frame(cbind(KF_P, rands=predict(KF_)+rnorm(13)))
KF_R
## age length preds rands
## 1 1.0 15.40 16.46621 16.83203
## 2 2.0 28.03 27.94454 29.71087
## 3 3.3 41.18 38.58487 37.60324
## 4 4.3 46.20 44.39195 44.76324
## 5 5.3 48.23 48.71009 46.86473
## 6 6.3 50.26 51.92105 52.65641
## 7 7.3 51.82 54.30872 54.12293
## 8 8.3 54.27 56.08419 57.23770
## 9 9.3 56.98 57.40442 57.62401
## 10 10.3 58.93 58.38615 57.40576
## 11 11.3 59.00 59.11616 58.45990
## 12 12.3 60.91 59.65899 59.96943
## 13 13.3 61.83 60.06265 60.98701
The plot (below, left panel) combines the data reported by Kimura (1980; female Pacific hake, open circles) versus the VBGM prediction (solid line) and one simulated value for each age (filled circles).
par(mfrow=c(1,3))
plot(KF_R$age,KF_R$length, pch=1, xlab="Age(years)", ylab="Length (cm)", ylim=c(0,70))
lines(KF_R$age, KF_R$preds, lty=1)
points(KF_R$age, KF_R$rands, pch=20)
legend ("bottomright", inset=0.05, c("raw data", "random"), pch=c(1, 20))
#Some histograms to examine residuals
hist(residuals(KF_), xlim=c(-4,4), ylim=c(0,6), main="Raw data", xlab="Model residuals", ylab="Frequency")
hist(KF_R$preds-KF_R$rands, xlim=c(-4,4), ylim=c(0,6), main="Random data", xlab="Preds-Rands", ylab="Frequency")
As a diagnostic, histograms are added (above) for the model residuals (center panel) and for the difference between the model predicted values and the simulated random values (right panel). At these small sample sizes, these residual data do not obviously deviate from a normal distribution, which is good; however, the right panel suggests that the default value of variability in rnorm() (i.e., SD=1) needs to be larger. In the next series we will use a value of 1.5 to spread out these simulated values and better match the raw data.
The goal here, however, is to work with a dataset where there are many lengths per age. Kimura’s (1980) summarized dataset was based on 1000s of aged fish, but let’s keep it simple, say 30 lengths per age for a total of 390 fish aged.
KF_R.30a<-data.frame(age.1=rep(Kimura_F$age,30), rands.1=predict(KF_)+rnorm(13*30)) #SD still is 1.0
Here is a tabulation of the randomly assigned lengths by age class. Cells with no lengths in an an age class are normally designated as zero with the table function, but I have stripped out the zeros (i.e., assigned as ‘NA’) to make the table cleaner.
KF_tab<-table(Length.bins=round(KF_R.30a$rands),Ages=round(KF_R.30a$age))
KF_tab[KF_tab == 0] <- NA
KF_tab
## Ages
## Length.bins 1 2 3 4 5 6 7 8 9 10 11 12 13
## 13 1
## 14 2
## 15 2
## 16 12
## 17 8
## 18 4
## 19 1
## 26 3
## 27 9
## 28 11
## 29 7
## 36 1
## 37 9
## 38 7
## 39 9
## 40 4
## 42 1
## 43 7
## 44 8
## 45 11
## 46 3
## 47 6
## 48 10
## 49 10
## 50 3 1
## 51 1 8
## 52 8 1
## 53 11 4
## 54 2 11
## 55 14 5
## 56 9 5 1
## 57 13 9 2 3 1
## 58 3 8 17 5 3
## 59 8 8 12 17 8
## 60 2 9 10 10
## 61 1 2 6
## 62 3
As discussed, above, the default values of rnorm() do not appear to represent the variability of the raw data. In this next plot, we will set a new value of SD to 1.5. In addition, it seems artificial that the random deviate at all ages has the same variability. Instead, it seems likely that growth variations between individuals will accummulate each year, where a faster growing fish in year 1 is likely to be a faster growing fish in later years. If so, then we should increase not only the base standard deviation (Simulation a; SD=1.5) of the random deviate, but we can also increase the SD with each age class. Here, we will set the latter at an increase in SD of 0.2 for each year, topping out at SD=4.1 for age 13 (Simulation b).
par(mfrow=c(1,3))
age.13<-c(1,2,seq(3.3,13.3,1)); age.390<-rep(age.13,30)
KF_R.390a<-data.frame(age.1=age.390, rands.1=predict(KF_)+rnorm(390, mean=0, sd=1.5))
KF_R.390b<-data.frame(age.1=age.390, rands.1=predict(KF_)+rnorm(390, mean=0, sd=1.5+(0.2*rank(KF_P$age))) )
plot(KF_P$age, KF_P$preds, type='l', xlab="Age(years)", ylab="Length (cm)", ylim=c(0,70))
points(jitter(KF_R.390a$age), KF_R.390a$rands, pch=20)
points(jitter(KF_R.390b$age), KF_R.390b$rands, pch=21)
legend("bottomright", inset=0.05, bty='n', title = "Random data" , c("SD fixed (1.5)", "SD varies with age (1.5-4.1)"), pch=c(20, 21), cex=0.8 )
#Histograms to examine residuals
hist(KF_R$preds-KF_R.390a$rands, breaks = seq(-12, 12, 1), xlim=c(-12,12), ylim=c(0,100), main="SD fixed", xlab="Preds-Rands", ylab="Frequency")
hist(KF_R$preds-KF_R.390b$rands, breaks = seq(-12, 12, 1), xlim=c(-12,12), ylim=c(0,100), main="SD varies", xlab="Preds-Rands", ylab="Frequency")
A jitter function is used to make the individual lengths per age more apparent. It is evident that the simulated values with a fixed SD=1.5 (filled circles) are a tight grouping such that the individual points are hard to identify. We could see individual points better by generating fewer lengths at age, but that is not the point, because in simulation, you will generally want to simulate large datasets to avoid small sample artefacts. Instead, the point here is to show how the simulated values with larger SD at older ages (open circles) spread out nicely. In general, such ‘heteroscedastic’ variance with age is a realistic pattern, although the exact values chosen for SD are likely species-specific.
If we are shooting for realism, then an even more realistic example would be a sample with more younger fish than older fish. That is the pattern seen in nature because of cummulative mortality among age classes. It may also be caused by sampling gear that does not collect the older, larger fish very well (i.e., selectivity). Regardless, it is possible in simulation to vary the sample sizes with each age class.
Here, we will keep the same sample size (\(n\) = 390) but sample randomly from an array with 13 age-1 fish, 12 age-2 fish, 11 age-3 fish, etc., to mimic the decreasing numbers of fish that will be available in nature at each age class.
par(mfrow=c(1,3))
#Set up a procedure to select a variable number of fish per age (same total)
age = c(1,2,seq(3.3,13.3,1))
age.loop<-1 #Start the for loop with a value of 1
for (i in 2:13) { stub <-age[1:i]; age.loop<-c(age.loop, stub) }
sampled.ages<-sort(sample(age.loop, 390, replace=TRUE))
predicted.lens<-61.23329*(1 - exp(-0.29625*(sampled.ages+0.05726)))
#Generate predicted lengths for each age, add a standard deviate
KF_R.390aa<-data.frame(age.2=sampled.ages, preds.2=predicted.lens, rands.2=predicted.lens+rnorm(130, mean=0, sd=1.5))
KF_R.390bb<-data.frame(age.2=sampled.ages, preds.2=predicted.lens, rands.2=predicted.lens+rnorm(130, mean=0, sd=1.5+(0.2*rank(KF_P$age))))
#Plot the contrasting samples relative to the true curve
plot(KF_P$age, KF_P$preds, type='l', xlab="Age(years)", ylab="Length (cm)", ylim=c(0,70))
points(jitter(KF_R.390aa$age.2), KF_R.390aa$rands.2, pch=20)
points(jitter(KF_R.390bb$age.2), KF_R.390bb$rands.2, pch=21)
legend("bottomright", inset=0.05, title = "Unequal n per age class", c("SD fixed (1.5)", "SD varies (1.5-4.1)"), pch=c(20, 21) )
#A quick diagnostic by looking at the residuals
hist(KF_R.390aa$preds-KF_R.390aa$rands, breaks = seq(-12, 12, 1), xlim=c(-12,12), ylim=c(0,120), main="SD fixed", xlab="Preds-Rands", ylab="Frequency")
hist(KF_R.390bb$preds-KF_R.390bb$rands, breaks = seq(-12, 12, 1), xlim=c(-12,12), ylim=c(0,120), main="SD varies", xlab="Preds-Rands", ylab="Frequency")
Now, the sample with age-specific heteroscedastic variance and sample size is looking more like a real dataset.
How will the parameter estimates from these various simulated datasets compare to our initial parameter estimates that we regarded as ‘true’ parametric values? So far, we have been evaluating the simulation results by plotting the data to evaluate it visually. Let’s compare the parameters estimates of each step.
Base model: These parameters fit the original, raw data for female Pacific hake from Kimura (1980).
## Estimate Std. Error
## Linf 61.23329 1.21409
## K 0.29625 0.02874
## t0 -0.05726 0.17534
Simulation a: These next parameters fit the first simulated example, where 30 females per age class were simulated by adding a standard normal deviate, rnorm(x, mean=0, sd=1.5), to the predicted values from the base model (i.e., SD fixed). The standard errors of the estimates have become reduced, presumably because of the increase in sample size relative to the base model (i.e., 390 vs. 13)
## Estimate Std. Error
## Linf 61.55526 0.19524
## K 0.28881 0.00439
## t0 -0.07807 0.02779
Simulation b: These next parameters fit the alternative to the first simulated example. Now, 30 females per age class were simulated by adding a standard normal deviate with increasing SD as the fish grow older (i.e., SD varied). A higher SD used to simulate the standard deviates leads to a higher SE in all three parameter estimates, relative to ‘Simulation a’, and thus greater uncertainty to the true value.
## Estimate Std. Error
## Linf 61.09523 0.392369
## K 0.30389 0.009744
## t0 -0.03519 0.057218
Simulation aa: In this next example, the parameters were fit to a sample with a fixed SD, like in ‘simulation a’, but more younger fish were represented. Relative to ‘simulation a,’ the SE of \(L_∞\) and \(K\) has increased and the SE of \(t_o\) is about the same. This is not too surprising considering the data density is higher near the origin but declines as you get to older ages, where the certainty of \(L_∞\) will be determined.
## Estimate Std. Error
## Linf 61.06773 0.299171
## K 0.29726 0.005218
## t0 -0.04017 0.024043
Simulation bb: In this final example, the parameters were fit to a sample with a variable SD and more younger fish were represented. As in ‘Simulation b’ the SE’s are higher than the simulations with a fixed SD of 1.5. The SE of \(L_∞\) is particularly high, but not so high as the base model, because of the higher SD (i.e., 4.1 by age 13) and the lower sample size at older ages.
## Estimate Std. Error
## Linf 60.77896 0.501716
## K 0.29713 0.008831
## t0 -0.05781 0.041054
In general the SE was lower for all four simulated samples, compared to the original, summarized data, because of the large numbers of ‘fish’ generated by simulation. These SE’s can be examined to do a quick check on the certainty surrounding these estimates.
Among all five outputs, the range in estimates of \(L_∞\), (60.779, 61.5553), is 0.7763 cm. The minimum SE, (0.1952), if multiplied by 4 as a proxy for the minimum 95% confidence interval, is (0.781), which is very similar to the difference between the \(L_∞\) estimates. As a first approximation, there is not much room for statistical differences between these estimates of \(L_∞\) fitted to the four simulated datasets or the raw data.
The range in estimates of \(K\), (0.2888, 0.3039), is 0.0151 cm. The minimum SE, (0.0044), if multiplied by 4 as a proxy for the minimum 95% confidence interval, is (0.0176), which is also very similar to the difference between the \(K\) estimates. Again, but only as a first approximation, there is not much room for statistical differences between these estimates of \(K\) fitted to the four simulated datasets or the raw data.
The range in estimates of \(t_o\), (-0.0781, -0.0352), is 0.0429 cm. The minimum SE, (0.024), if multiplied by 4 as a proxy for the minimum 95% confidence interval, is (0.0962), which is larger than difference between the \(t_o\) estimates. Once more, there is little room for statistical differences between these estimates of \(t_o\) fitted to the four simulated datasets or the raw data.
This approach to compare the approximate 95% confidence limits to the differences of means is only meant to be a back of the envelop method to evaluate the results. It is useful to get a better sense of the data and the model fit, but a more powerful statistical approach is needed to compare model fits.
In Part I, we introduced this test for comparing the parameters of 2 groups. Let’s use this statistical approach to determine if one of the simulated data differs from the raw data in parameter estimates. Our expection is to not reject the null hypothesis, \(H_o\), because the purpose of the simulation was to generate a sample similar to the original data (Kimura, 1980).
If the R package ‘fishmethods’ is not already open on your console, you need to call it us with the library function, as shown in Part I.
We need to build a dataset by stripping out the relevant variabiles (age, length) from the two datasets: the raw data from Kimura (1980) and the last simulated dataset (i.e., Simulated bb), which was deemed to be most realistic. Finally, using cbind we add a dummy variable to identify each dataset because we will combined them together with rbind.
KFraw<-data.frame(cbind(ages=KF_R$age, lengths=KF_R$length))
KFraw<-cbind(KFraw, group="Raw")
KFsim<-data.frame(cbind(ages=KF_R.390bb$age.2, lengths=KF_R.390bb$rands.2))
KFsim<-cbind(KFsim, group="Sim")
KFboth<-rbind(KFraw,KFsim)
The package “fishmethods’ has a command, vblrt, to test for differences between nested VBGMs. As in Part I, the default code by Gary Nelson, the package’s author, is used.
Kimura_LLR<-vblrt(len=KFboth$lengths, age = KFboth$ages, group=KFboth$group, error=2, select=1, plottype=2)
The plots, above, show much lower variability of the residuals of the simulated data relative to the raw data. This is particularly true for the young and middle ages but less so among the oldest age classes. Overall, these patterns show you the effect of sample size on the residuals.
The ratio values, below, are tested using a \(\chi^2\) stastistic to see if models that have one or more common parameters (H1-H4) are different than the general or full model (Ho) that allowed all three parameters to vary by group. If a particular hypothesis is rejected, then group-specific values, are warranted.
Here, as expected, we find no significant probability values, (range of \(p\): 0.862, 0.934). Thus, we do not reject the hypotheses that parameters are equal among groups (H1-H4, \(\alpha\) = 0.05), in favor of an alternative model that all parameters differed (Ho) .
Kimura_LLR$results
## tests hypothesis chisq df p
## 1 Ho vs H1 Linf1=Linf2 0.03 1 0.862
## 2 Ho vs H2 K1=K2 0.02 1 0.888
## 3 Ho vs H3 t01=t02 0.01 1 0.920
## 4 Ho vs H4 Linf1=Linf2,K1=K2,t01=t02 0.43 3 0.934
This result, or non-result, is consistent with our purpose to have simulated a realistic dataset that does not differ in parametric estimates from the data presented in Kimura (1980). Further exploration of our success would benefit from a statistical power analysis, which we will leave to another chapter.
In this chapter the power of R is used to fit a model to data and simulate four samples with the same parameter estimates. The different samples simulated varying degrees of realistic patterns, allowing us to explore, for example, the effects of sample size, variance, and mortality.
The specifics presented here only briefly introduce the process of simulating fish growth data. Other patterns, such as gear selectivity for young fish, could be added. In this manner, one can simulate a sample that behaves like a dataset you possess, but you can vary it to see how sensitive it is to manipulation.
The relevance of the specific values used here, for example for SD, was not so important. These depend on the natural history of the species. A plot of residuals was used to approximate how much variability there was among age classes, but information on the variability within age classes was not available from the Kimura (1980) dataset. If a species is data poor than these may be merely guesses, but if there is sufficient data, then you can simulate data quite realistically. Regardless, the emphasis was to show the general approach in a step-by-step manner.
Kimura, (1980). Fish. Bull. 77: 765-776; http://fishbull.noaa.gov/77-4/kimura.pdf↩
If you are unfamiliar with the rnorm function, then see the corresponding chapter on ‘Data simulation and hypothesis testing.’↩