Maturity schedules: matching data with models

Introduction

Fishery scientists separate the immature from the mature components of a stock for various reasons. In particular, most actions towards rebuilding or maintaining sustainability require keeping the mature or spawning component of the population above some threshold. This component is often expressed in terms of aggregate weight, typically called the spawning stock biomass.

Maturity can occur as a multi-step precess over the life-time of a fish, but fundamentally, these steps can be aggregated into binary categories (immature, mature). Fish are initially immature and later they become mature. The general linear model can address binary responses, but it requires that you no longer assume a continuous distribution of the response variable. This is not particularly complicated, as there are several well-known binomial distributions to use in modeling maturity data; however, such models are not necessarily taught in a basic statistics course, so we will go though this in detail here.

Our motivation is to be able to effectively demonstrate to an audience not only a point estimate, such as the median size or age at maturity, but the confidence we have in such an estimate. We will also introduce an example in which an important covariate, such as sex or stock area, explains enough of the variability in maturity data that it should be accounted for in indentifying stock structure or in building an assessment model.

Simulated Data

The lesson here is based on simulated data for a fish that is modestly long-lived, with a maximum age in the teens, that matures at about a third of its maximum age. In simulating a data set that will be exactly the same on everyone’s computer, we need to set a ‘seed’ value in the R envrionment. If you are following along in this manner, begin your R session with the same seed value as shown here, and simulate ages of 250 immature and another 250 mature fish with the rbinom function. Initially, we will not consider sex, or any other covariate, that could affect the maturity schedule of this sample of 500 fish.

set.seed (2468) # This 'seed' fixes the simulated values
imm <- rbinom (250, 10, .30) #immature fish
mat <- rbinom (250, 30, .30) # mature fish
Age <- c(imm, mat) #concantenate the ages together
I.M <- c(rep (0, 250), rep (1, 250)) #assign 0=immature & 1 = mature
SimData <- data.frame (Age, I.M) #assemble both arrays into a data frame
table (SimData) # examine the data

##     I.M
## Age   0  1
##   0   2  0
##   1  32  0
##   2  64  1
##   3  72  3
##   4  51  7
##   5  19 13
##   6   6 14
##   7   4 38
##   8   0 37
##   9   0 51
##   10  0 31
##   11  0 24
##   12  0 12
##   13  0 12
##   14  0  7

These data were carefully selected to have a wide range of fish ages from 0 to 14, with a reasonable overlap in ages of immature and mature fish. A quick plot also makes it obvious that immature fish are younger than mature fish.

par (mfrow = c(1,2)); par (las=1) #These are graphic parameters
hist(imm, xlab='Age (years)'); hist(mat, xlab='Age (years)')

If we want to combine these histograms into one figure, the following code walks you through a rather colorful way to present your data. The first two statements assign different colors to the immature and mature fish groups. These colors are actually transparent, so when they are overlapped, a third color will appear to depict the ages that are both immature and mature in this sample.

trred <- rgb(1,0,0,0.5)  #this is a transparent red
trblue <- rgb(0,0,1,0.5) #this is a transparent blue
par (las=1) # this puts the y-axis labels 'upright'
hist(imm,col=trred, breaks=seq(0, 15, 1), right = F, main="", xlab="Age (years)", ylab="Frequency", ylim=c(0,100))
par(new=TRUE) # redraws the other dataset on the same axis
hist(mat,col=trblue,breaks=seq(0, 15, 1), right = F, main="", xlab="", ylab="", ylim=c(0,100))
legend("topright",pch=c(15, 15), col=c(trred, trblue),  legend=c("Immature", "Mature"), xjust=0, bty="n",  title=expression(""), inset = 0.1, 0.05)

Figure legend: Immature fish are red, mature fish are blue, and both maturity classes are purple (i.e. these histograms are overlapping and not stacked).

The binomial model

When using a binary response variable, such that we are predicting which of two possible events is likely, a binomial model is appropriate. The two figures below show how the familar form of the linear model, \(y_i = \alpha + \beta x_i\), will not provide any meaningful analyis to these data.

par (mfrow = c(1,2)); par (las=1)
plot (Age, I.M) # Plot the data
I.M.pred = lm (I.M~Age) # Estimate the coefficients
abline (I.M.pred$coef) # Plot the predicted slope and intercept
plot (Age, I.M.pred$residuals) # New plot for residuals
abline (0, 0) # Plot a reference line of no model error
segments (Age, 0, Age, I.M.pred$residuals) #Connect residuals with the reference line

While the general trend is captured by the regression slope (i.e. the probability of maturity increases with age), there is no predictive precision to the regression. Worse, the residuals are obviously out of wack, indicating a new model structure is needed.

A binomial model is not only appropriate, the data can still fit a general linear model. This is possible because, while the predicted curve of a binomial model may not be linear, the relationship between the data and a binomial link function can be linear. Consider logistic regression, which is a common binomial model, as an example of this linear link function. In mathmatical terms, its general, non-linear form, \(P_i = (1 + e\) ^{-(\(\alpha\)+\(\beta x_i\))})^-1, can be rewritten as the natural log of an odds ratio, \(ln (P_i/1-P_i) = \alpha+\beta x_i\), which has a linear form: \(y_i = \alpha + \beta x_i\). An odds ratio of 50 means that an event will occur half the time. In the original, non-linear form of logistic regression, this is the inflection point of a symetrical s-shaped curve. In our specific case here, the inflection is the median age of maturity, \(A_{50}\).

In the R environment, the ‘logit’ link can be accessed with the general linear model (glm) statement. The logit link is the default of family ‘binomial’.

SimData.mat <- glm(I.M~Age, data = SimData, family=binomial); summary(SimData.mat)

## 
## Call:
## glm(formula = I.M ~ Age, family = binomial, data = SimData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.25877  -0.26611  -0.01400   0.09655   3.09185  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.6678     0.7330  -10.46   <2e-16 ***
## Age           1.4482     0.1416   10.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 693.15  on 499  degrees of freedom
## Residual deviance: 175.22  on 498  degrees of freedom
## AIC: 179.22
## 
## Number of Fisher Scoring iterations: 7

You may recall that the quotient of the coefficients (-7.67, 1.45) calcuates the inflection point (\(A_{50}\)). Although simple in concept, the equation in R is cumbersome:

A50 <- abs (signif(SimData.mat$coefficients[1]/SimData.mat$coefficients[2], digits=3))

This equation correctly calculates the \(A_{50}\) as 5.29, but there are other ways to do this.

If you load the MASS package, you can estimate any point on the predicted curve, shown here as estimates at 1, 5, 50, 95 and 99% age at maturity.

library (MASS) # for p.dose
doses <- signif(dose.p (SimData.mat, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3); doses

##           Dose        SE
## p = 0.01: 2.12 0.3116284
## p = 0.05: 3.26 0.2171880
## p = 0.50: 5.29 0.1404376
## p = 0.95: 7.33 0.2669555
## p = 0.99: 8.47 0.3670347

As a quick check, you can also input various ages into the parameterized, non-linear logistic equation: \(P_i = 1/(1 + e\)^{-(-7.67+1.45\(x_{Age}\))}). If we input \(A_{50}\)=5.29 then we expect that this resulting value, 0.498, will match 0.50, which it does.

Graphic display guidelines

In terms of best practices of modeling maturity data, I would assert that you want to display and report on several components of your analysis:

the data,
the predicted curve (the ogive),
specifically meaningful points on the ogive,
uncertainty around the prediction.

The following code does all of these things for our simulated data, all in a single image.

par (las=1)

#Add or calculate some elements to plot 
vxb=data.frame(Age=seq(0,15,0.1))
tmb=predict(SimData.mat, newdata=vxb, se.fit=TRUE) #2 vectors

# this is phat, the estimated prob. of maturity
vyb= 1/(1+exp(-tmb$fit))  

# these are the confidence limits assuming a normal distribution around the predicted fit
cl95upb=1/(1+exp(-(tmb$fit+qnorm(0.975)*tmb$se.fit)))
cl95lob=1/(1+exp(-(tmb$fit-qnorm(0.975)*tmb$se.fit)))

#Plot the predicted line
plot (vxb$Age, vyb, type = 'l', xlab = "Age (years)", ylab = "Probability",  xlim = c(0,15))

#Plot the data as tick marks (rugs), and 'jitter' them a bit so they are not all on top of each other
rug (jitter(SimData$Age[SimData$I.M==0]))
rug (jitter(SimData$Age[SimData$I.M==1]), side=3)

#Plot the 95% confidence limits
lines (vxb$Age, cl95upb, lty =2)
lines (vxb$Age, cl95lob, lty =2)

#Add some point estimates
dose95 <- doses[4]
text (12.5, 0.5, as.expression(substitute(italic(A)[95] == dose95, list (dose95=dose95))))
dose50 <- doses[3]
text (12.5, 0.4, as.expression(substitute(italic(A)[50] == dose50, list (dose50=dose50))))
dose05 <- doses[2]
text (12.5, 0.3, as.expression(substitute(italic(A)[05] == dose05, list (dose05=dose05))))

The data are shown as tick marks along the lower (immature fish) and upper (mature fish) axis. Note, the tick marks are ‘jiggered’ to depict the density of the data at each age. This may not be needed when plotting length data or when the age data are in decimal values. The predicted curve is the solid ogive, and three key points (5, 50, 95% maturity) are written in text to the side. The uncertainty around the predicted curve is plotted as 95% confidence limits with a dashed line format. Here there is little uncertainty around any point estimate for this sample; however, this is a simulated, large sample size of both immature and mature fish, so it is a rather idealized example.

More realistic examples

It is common that small, immature fish are difficult to collect because of gear selectivity or regulations such as minimum size limits. This poses a problem, because the model cannot fit properly to where there is no data. How much of a problem is this? Here we explore two examples: 1) a knife edge selection at age 4, and 2) the same knife edge selection, plus a much lower sample size overall.

Knife-edge selection

First, we subset the existing, simulated data set by deleting all fish (immature or mature) that are less than four years old, and we re-examine the resulting model fit and selected parameter estimates. We also plot the result, although for brevity, we do not show the plotting code.

SimData2 <- subset (SimData, Age>3); table (SimData2)

##     I.M
## Age   0  1
##   4  51  7
##   5  19 13
##   6   6 14
##   7   4 38
##   8   0 37
##   9   0 51
##   10  0 31
##   11  0 24
##   12  0 12
##   13  0 12
##   14  0  7

SimData2.mat <- glm(I.M~Age, data = SimData2, family=binomial)
SimData2.mat # Or use the summary statement to get a full summary

## 
## Call:  glm(formula = I.M ~ Age, family = binomial, data = SimData2)
## 
## Coefficients:
## (Intercept)          Age  
##      -8.072        1.518  
## 
## Degrees of Freedom: 325 Total (i.e. Null);  324 Residual
## Null Deviance:       363.3 
## Residual Deviance: 139   AIC: 143

dos2 <- signif(dose.p (SimData2.mat, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3); dos2

##           Dose        SE
## p = 0.01: 2.29 0.3954110
## p = 0.05: 3.38 0.2751920
## p = 0.50: 5.32 0.1432918
## p = 0.95: 7.26 0.2808743
## p = 0.99: 8.35 0.4015980

(plotting code not shown, but see previous example for format)

The specific point estimates along the predicted curve and the 95% confidence envelop have changed very litte as a result; however, we only cropped off ages from the extreme left ‘tail’ of the data range. Also, although the subset procedure reduced the sample size from 500 to 326, our sample size is still quite large.

Knife-edge selection & additional subsampling

What if we reduced the sample size further, to say 10% of the original sample size? Again, we will simulate a knife-edge selectivity at age 4 and then randomly subsample the immature and mature fish from the original data arrays. We will also plot the data and model fit, with only minor modifications of the code (code not shown, for brevity)

# delete immature and mature fish 3 years or less
imm2 <- subset (imm, imm>3) #immature fish
mat2 <- subset (mat, mat>3) # mature fish
# subsample each string to 10% of the original sample size
set.seed (1357)
imm25 <- sample (imm2, 25)
mat25 <- sample (mat2, 25)
Age25 <- c(imm25, mat25) #concantenate the ages together
I.M25 <- c(rep (0, 25), rep (1, 25)) #assign 0=immature & 1 = mature
SimData25 <- data.frame (Age25, I.M25); table (SimData25)

##      I.M25
## Age25  0  1
##    4  14  1
##    5   6  3
##    6   3  0
##    7   2  6
##    8   0  2
##    9   0  4
##    10  0  3
##    11  0  3
##    13  0  1
##    14  0  2

SimData25.mat <- glm(I.M25~Age25, data = SimData25, family=binomial);  SimData25.mat

## 
## Call:  glm(formula = I.M25 ~ Age25, family = binomial, data = SimData25)
## 
## Coefficients:
## (Intercept)        Age25  
##      -7.351        1.203  
## 
## Degrees of Freedom: 49 Total (i.e. Null);  48 Residual
## Null Deviance:       69.31 
## Residual Deviance: 33.06     AIC: 37.06

dos25 <- signif(dose.p (SimData25.mat, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3); dos25

##           Dose        SE
## p = 0.01: 2.29 1.0224982
## p = 0.05: 3.66 0.6853840
## p = 0.50: 6.11 0.3796820
## p = 0.95: 8.55 0.8470337
## p = 0.99: 9.93 1.1966992

(Plot code not shown)

In these more realistic examples, we can see the effect of gear selectivity is minor when sample sizes are high, especially when the fish selected covered the majority of the ogive. However, When sample size was dramatically reduced, the parameter estimates (e.g., \(A_{50}\)) changed considerably, but so did our confidence in the estimates. We have less confidence about any particular predicted point as indicated by the much wider 95% confidence limits. All things being equal we are more confident about the parameter estimates when sample size is larger, but the examples here are not unrealistic.

Binomial model selection

So far we have limited ourselves to the logit cummulative density function (CDF). Other symmetric functions are the probit and cauchit, which have ‘thinner’ or ‘fatter’ tails than the logit function. The complementary log-log function or the Gompertz growth function (not shown) are also sigmoid in shape but not symmetrical around an inflection point.

One may ask if specific functions demonstrate less uncertainty than another for your sample. This can be checked with Akaike’s information criterion (AIC), which is the 11th element of the glm output, so you will see us call ‘[11]’ as we build a list of AIC values for each CDF, below.

mod.logit <- glm(I.M25~Age25, data = SimData25, family=binomial(link = "logit"))
mod.probit <- glm(I.M25~Age25, data = SimData25, family=binomial(link = "probit"))
mod.cauchit <- glm(I.M25~Age25, data = SimData25, family=binomial(link = "cauchit"))
mod.cloglog <- glm(I.M25~Age25, data = SimData25, family=binomial(link = "cloglog"))
#This will print out the AIC values of each model 
AIC_values <- list(logit=mod.logit[11], probit=mod.probit[11], cauchit=mod.cauchit[11], cloglog=mod.cloglog[11])
AIC_values

## $logit
## $logit$aic
## [1] 37.05537
## 
## 
## $probit
## $probit$aic
## [1] 36.82318
## 
## 
## $cauchit
## $cauchit$aic
## [1] 38.3881
## 
## 
## $cloglog
## $cloglog$aic
## [1] 36.20294

The maximum difference in AIC values in this list is ~ 2. Such a small difference indicates that no particular model is less certain than the others. However, AIC only compares one model against the other, and it cannot say if any model is particularly realistic or helpful to our needs. To do this, we should also check how model selection affects the parameter estimates for each CDF.

signif(dose.p (mod.logit, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3)

##           Dose        SE
## p = 0.01: 2.29 1.0224982
## p = 0.05: 3.66 0.6853840
## p = 0.50: 6.11 0.3796820
## p = 0.95: 8.55 0.8470337
## p = 0.99: 9.93 1.1966992

signif(dose.p (mod.probit, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3)

##           Dose        SE
## p = 0.01: 2.79 0.7964751
## p = 0.05: 3.76 0.5901389
## p = 0.50: 6.08 0.3661814
## p = 0.95: 8.41 0.7704293
## p = 0.99: 9.37 0.9906066

signif(dose.p (mod.cauchit, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3)

##             Dose        SE
## p = 0.01: -13.50 9.3198227
## p = 0.05:   2.45 1.8994648
## p = 0.50:   6.39 0.4206968
## p = 0.95:  10.30 1.8873781
## p = 0.99:  26.30 9.3074390

signif(dose.p (mod.cloglog, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3)

##           Dose        SE
## p = 0.01: 1.58 1.3719383
## p = 0.05: 3.39 0.8852475
## p = 0.50: 6.28 0.3467655
## p = 0.95: 7.91 0.5803044
## p = 0.99: 8.39 0.6935260

The cauchit model had unrealistic values, specifically a early maturation point estimate (\(A_{01}\)) that was < 0. The other three models all had similar median point estimates (e.g., the range of \(A_{50}\) estimates was ~ 0.2 year), but differed noticably in point estimates near the tails (e.g., \(A_{01}\), \(A_{99}\)).

This commentary is true only for this particular dataset. It is possible that one function may work better with one sample, stock, sex, or another. However, the logit function is most commonly used, and if you do use a different CDF, then that can confound your ability to compare your parameters to past studies that used the logit.

Using AIC to identify important covariates

One final consideration is that you may want to compare maturity curves, such as between two sexes, or two datasets taken at different times or from different stocks. We would treat sex, time, or stock as covariates to age or length as a predictor variable. Again, we will modify the original, simulated data set. The focus here is on one covariate: stock. We simulate a northern (N) stock that matures one year later than a southern (S) stock. We also add some random variation to the ages to simulate integer values, and we set a sample size of 100 fish for both stocks (i.e., 50 immature and 50 mature fish per stock).

set.seed (13579)
Age <- c(imm [1:50]+rnorm(50,0,0.2), mat[1:50]+rnorm(50,0,0.2),         
imm[1:50]+1+rnorm(50,0,0.2), mat[1:50]+1+rnorm(50,0,0.2))
I.M <- c(rep (0, 50), rep (1, 50), rep (0, 50), rep (1, 50)) 
S.N <- c(rep ("S", 100), rep ("N", 100))
SimData.SN  <- data.frame (S.N, Age, I.M) # The whole dataset
SimData.S <- subset (SimData.SN, S.N == "S") # southern fish only
SimData.N <- subset (SimData.SN, S.N == "N") # northern fish only

We anticipate that the parameters will not only be different, but the \(A_{50}\) and other estimates will be approximately one year apart. We check that with the now (hopefully) familiar glm and dose statements.

SimDataS.mat <- glm(I.M~Age, data = SimData.S, family=binomial)
SimDataN.mat <- glm(I.M~Age, data = SimData.N, family=binomial)

dosS <- signif(dose.p (SimDataS.mat, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3); dosS #Southern stock

##           Dose        SE
## p = 0.01: 2.91 0.6861394
## p = 0.05: 3.81 0.4833923
## p = 0.50: 5.44 0.2865610
## p = 0.95: 7.06 0.5276355
## p = 0.99: 7.97 0.7352517

dosN <- signif(dose.p (SimDataN.mat, p = c(0.01, 0.05, 0.50, 0.95, 0.99)), digits = 3); dosN #Northern stock

##           Dose        SE
## p = 0.01: 3.63 0.6872551
## p = 0.05: 4.61 0.4870239
## p = 0.50: 6.36 0.3024432
## p = 0.95: 8.10 0.5445208
## p = 0.99: 9.08 0.7515851

We can also plot these differences, although for brevity, I will not show the plotting code (see above examples).

We are curious if these known, simulated differences in the maturity ogives can be exposed quantitatively.

To do this, we can use Akaike’s Information Criterion (AIC) again to examine the certainty of the full model compared to reduced models. The full model includes all effects, the main effects and their possible interactions, and their error terms. The reduced models are considered without an interaction term or even without one of the main effects.

In plain English, we are asking if adding stock information, or a stock & fish age interaction, is more predictive of maturity status than fish age alone. The predictive power of stock alone is included as a forth model, but it is a trivial hypothesis, since we do not expect that modeling the data without age will be meaningful.

We will use a corrected form of AIC (AICc). This is more relevant at smaller sample sizes, which are typical in fisheries research, so take note of the package AICcmodavg for future use.

library (AICcmodavg) # Using the second order AIC = AICc
# First, fit the 4 possible models
Mod.1 <- glm(I.M ~ Age*as.factor(S.N),'binomial', data = SimData.SN)
Mod.2 <- glm(I.M ~ Age+as.factor(S.N),'binomial', data = SimData.SN)
Mod.3 <- glm(I.M ~ Age,'binomial', data = SimData.SN)
Mod.4 <- glm(I.M ~as.factor(S.N),'binomial', data = SimData.SN)
# Second, compare these models
aictab(list(Mod.1, Mod.2, Mod.3, Mod.4))

## 
## Model selection based on AICc:
## 
##      K   AICc Delta_AICc AICcWt Cum.Wt      LL
## Mod2 3  58.60       0.00   0.62   0.62  -26.24
## Mod1 4  60.63       2.04   0.22   0.84  -26.21
## Mod3 2  61.31       2.71   0.16   1.00  -28.62
## Mod4 2 281.32     222.72   0.00   1.00 -138.63

The four AIC values evaluated are:

the full model (Mod.1) has terms for both main effects (a continuous variable for age and a factorial covariate of stock area), as well as an interactive effect of these main effects,
a partly reduced model (Mod.2) has only the main effects (no interaction term),
a fully reduced model that considers only the effect of age (Mod.3), and
the last possible reduced model, which considers only the effect of stock area (Mod.4).

The lowest AICc value (58.6) suggest that Mod.2, a partly reduced model that includes age and stock area as main effects, provides the least uncertainty. This model is 2.04 AICc points less then the full model, and in this simulated example, we happen to know that there was no interaction simulated in the data. It is also 2.71 AICc points less than the reduced model that only uses age as a predictor variable. This makes sense, because we know that we simulated a stock difference in the two maturity ogives. Finally, a reduce model of ‘stock’ as a predictor alone is not an informative model (\(\Delta AIC_c\) = 223), as we expected.

A \(\Delta AIC_c\) value of at least 2-3 is generally considered enough to say that the covariate is worth considering, simply from an information theoretical point of view. From a biological point of view, we should go back to the biological outcome itself. In this case, the stock-specific time step difference in \(A_{50}\) is a whole year, which would be a significant period in most fishery management contexts.

Once the covariate, as a main or interactive effect, is identified as important, then estimation of maturity should account for the covariate. This can be done with a single model equation that explicity states the covariate, such as:

summary(Mod.2)

## 
## Call:
## glm(formula = I.M ~ Age + as.factor(S.N), family = "binomial", 
##     data = SimData.SN)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.54975  -0.09254  -0.00584   0.07985   1.81613  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -11.0802     1.9626  -5.646 1.65e-08 ***
## Age               1.7449     0.3055   5.712 1.12e-08 ***
## as.factor(S.N)S   1.5873     0.7697   2.062   0.0392 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.259  on 199  degrees of freedom
## Residual deviance:  52.475  on 197  degrees of freedom
## AIC: 58.475
## 
## Number of Fisher Scoring iterations: 8

We can relate this back to a parameterized logistic equation, now with a covariate, which is: \(P_i\) = 1/(1 + exp(-1*(-11.08+1.74\(x_{Age}\)+1.59\(x_{S.N}\))).

If we input \(A_{50}\) of the southern stock = 5.44 and a stock value of 1, then we expect that this resulting value, 0.5, will match 0.50, which it does.

If we input \(A_{50}\) of the northern stock = 6.36, and a stock value of 0, then we expect that this resulting value, 0.504, will match 0.50, which it does.

Conclusions

This lesson demonstrates logistic regression as applied to model selection and parameter estimation of fish maturity. Using a simulated data set it looked at several issues:

A colorful way to plot the raw data as a histogram
A simple demonstration as to why a typical linear model will not fit binomial data
Explaining how a link function allows the odds ratio to fit the linear model (e.g., logistic regression)
Best practices in terms of plotting the data and the model results in a single image
The effects of gear selectivity and sample size
Alternative cummulative density functions
And how to identify an important covariate