Simple Crop Yield Modelling

Author

Affiliation

Dan Bebber

Published

February 21, 2025

Crop modelling

Types of crop models

There are many ways to model crop yields, from physiologically-based process models to statistical pattern-matching approaches. Process models ideally simulate plant responses to instantaneous environmental conditions and physiological status, but are commonly driven by hourly or daily data on key predictors. For example, APSIM and DSSAT use daily mean temperature data for thermal time models of crop phenology and leaf area. Clearly this is a statistical approximation of the physiology of plant growth, but one that avoids the need for very high temporal resolution weather data while retaining good predictive power.

Process models need a lot of data

While models driven by daily climate data provide a useful statistical approximation of physiological processes, they still require a very large amount of input data (hundreds of days through the growing season) and process modules (e.g. LAI accumulation and loss, photosynthetic rate, photosynthate allocation to plant organs, nutrient uptake) each with multiple parameters. Such models typically require setting of hundreds of parameters and may be impractical for application in GBCL, where high resolution global gridded estimates, or estimates for hundreds of thousands of samples, are required. In addition, process-based models may not be available for some crops.

Statistical models should approximate biology

An alternative is to fit more abstract statistical models of key predictors to yield observations. There are many ways to do this, including highly abstracted machine learning approaches such as Random Forest. Statistical models can have a more-or-less realistic structure. For example, physiological processes tend to be most rapid at some optimal level of an environmental variable, and decline to zero as the environment shifts away from that optimum. These physiological responses, aggregated to the whole organism, give rise to the ecological niche which is the location of the organism within a multidimensional space defined by responses to all environmental factors that influence growth, survival and reproduction.

Tractability-realism trade offs

As we average environmental conditions over space (from cell to leaf to field to grid cell) and time (from second to hour to day to year) for the sake of computational tractability and data availability, we move away from true organismal responses to statistical correlations. Fortunately, in many cases these correlations remain informative (for example in the use of climatologies to predict species distributions). For crops, mean growing season conditions may be relatively well correlated with yields because of temporal autocorrelation in weather - there are years with warmer seasons and years with cooler seasons, with crops responding to elevated or depressed temperatures across many days. The effects of transient fluctuations (weather shocks) that do not markedly influence the mean will not be captured. Annual mean conditions will be less well correlated than growing season conditions since they include potentially very different, and sub-optimal, off-season weather.

Conditions for realistic models

As a first approximation, growing season average conditions can explain a large fraction of variability in yield. However, it is imperative that appropriate functions of environmental variables are fitted to data which acknowledge the unimodal nature of most biological responses. Thus, we should not expect crop yields to increase or decrease linearly with temperature, for example, but rather peak at some intermediate optimal range. We expect this to be the case even when conditions are averaged over space and/or time, within reason. Additionally, the model should be formulated such that yield estimates are bounded between zero and a maximum. When considering multiple predictors, the Law of the Minimum must be obeyed. If an environmental factor is suboptimal, then yield is restricted even if all other factors are optimal.

A simple model

Modifying maximum yield by limiting factors

These three conditions suggest that simple linear and polynomial models (e.g., quadratic) are inadequate, as they do not inherently enforce unimodal responses, bounded yields, or limiting factors. Instead, the observed yield \(Y_{obs}\) can be modelled as the product of an assumed maximum yield \(Y_{max}\), which could occur under perfectly optimal conditions, and a scalar \(\mathbf{S}\) which represents the combined influence of some number of limiting factors:

\[ Y_{obs} = Y_{max} \times \mathbf{S} \]

The scalar \(\mathbf{S}\) is itself the product of functions of environmental variables bounded in \((0,1)\):

\[ \mathbf{S} = \prod_{i=1}^N{f(v_i)} \]

where \(f(v_i)\) is a function of the \(i\)th environmental variable \(v\), of which \(N\) are considered in the model.

Modelling banana yield using beta functions

An example of this approach relates to banana yields modelled using beta functions of mean annual temperature and precipitation. Use of annual means is defensible for banana for two reasons: firstly, banana is a perennial crop with continuous production through the year; secondly, banana is grown in relatively aseasonal tropical climates therefore the annual variability is limited and the mean, particularly for temperature, is a reasonable representation of daily conditions. The relationship between yield and these climate variables was modelled using beta functions determined by minimum (\(v_{\min}\)), optimum (\(v_{\mathrm{opt}}\)) and maximum (\(v_{\max}\)) values of predictors:

\[ f(v) = \left(\frac{v_{\max}-v}{v_{\max}-v_{\mathrm{opt}}}\right)\left(\frac{v-v_{\min}} {v_{\mathrm{opt}}-v_{\min}}\right)^\frac{v_{\mathrm{opt}}-v_{\min}}{v_{\max}-v_{\mathrm{opt}}} \]

The model, fitted by optimization to reported annual yields across Latin America and the Caribbean, gave banana yield predictions which were highly correlated with observed values (\(r=0.6\)). When beta function responses are multiplied, the resulting scalar has a single optimum (Figure 1).

Figure 1: Product of beta functions of two predictors (\(v_1, v_2\)).

Simple response functions

Numerous different functions, besides the beta function, can be employed to model the influence of environmental limiting facts on yield. Another option is the trapezoid, defined by four parameters, which provides for a wider range of optimal conditions (Figure 2 a). In some cases, responses to environmental variables might appear to be monotonic (i.e. increasing or decreasing) without revealing the optimal range. For example, crop yields tend to increase with nitrogen fertilizer application, and toxic levels of N are seldom reached. In this case, a simpler two-parameter bounded linear function can be employed (Figure 2 b). These response functions are utilized to model limiting factors to crop growth in the UN FAO ECOCROP model.

Figure 2: Simple yield modifier functions. a) Trapezoid. b) Bounded linear.

Modelling crop yields in R

Crop modelling

The R statistical programming environment has a vast library of methods for both fitting models to data, and for predicting yields using process-based and statistical models.

R contains implementations of process-based crop models including APSIM (apsimx package), DSSAT (DSSAT) and WOFOST (Rwofost). These can be used to predict what crop yields should be for a particular site, providing that sufficient information is available to parameterize them.

The ECOCROP model is implemented in the Recocrop package. This package could be used to supply parameters for limiting factor functions of numerous crops. For example, ECOCROP provides parameters for trapezoid, bounded linear and step functions for maize (Table 1). These parameters could be applied in a statistical fit to observed yields, leaving \(Y_{max}\) as the parameter to be fitted.

Table 1: ECOCROP parameters for maize (Zea mays).

Variable	Min	Opt1	Opt2	Max
Temperature	10	18	33	47
Rainfall (annual)	400	600	1200	1800
Latitude	40			48
Altitude				4000
Soil pH	4.5	5.0	7.0	8.5

Fitting models to data

Various optimization methods are available in R to enable yield modifier function parameters to be updated in the light of new observations.

Fitting methods include:

Least-squares fits, e.g. (generalized) linear and non-linear modelling in packages like MASS, nlme and nlstools
Maximimum likelihood and Bayesian methods, e.g. Bayesian regression models (brms) and integrated nested Laplacian approximation (R-INLA).
General optimization methods, of which there is a large and growing selection.

When large amounts of training data (i.e. observed yields) are available along with many predictors, a potentially interesting way of estimating response functions is through Boundary Line Analysis (BLA), implemented in the BLA package. BLA takes advantage of the idea that \(Y_{obs}=Y_{max} \times \mathbf{S}\), which means that when \(Y_{obs}\) is plotted against any particular predictor \(v_i\), the maximum values of \(Y_{obs}\) along \(v\) should follow \(Y_{max} \times f(v_i)\) because all other predictor functions should be non-limiting:

\[ \prod^{N-1}_i{f(v_i)} = 1 \]

Simulated example

# Required packages
library(tidyverse) # data handling
library(ggplot2) # plotting
library(cowplot) # plotting support

# Reproducible results
set.seed(0)

The relatively simple case of fitting the product of two beta functions to observed yield by optimization, through minimization of least squares, could proceed as follows:

Simulate a dataset

Some assumptions were made regarding errors and uncertainties.

Firstly, that the observed values of predictor variables \(v_1\) and \(v_2\) are not the true values which determine the observed yield, i.e. we don’t know the true levels of the predictors: \(v_{obs}=v_{true}+\epsilon\) where \(\epsilon \sim \mathcal{N}(0, \sigma^2)\).

Secondly, that the expected yield is determined by the true levels of the predictors: \(Y_{exp} = f(v_{true})\).

Thirdly, that the observed yield has lognormal error proportional to the expected yield: \(Y_{obs} = Y_{exp} \times e^\eta\) where \(\eta \sim \mathcal{N}(1,\psi^2)\).

# Beta function with three parameters (min, opt, max)
beta3 <- function(v, vmin, vopt, vmax) {
  vf <- ((vmax - v) / (vmax - vopt)) *
    ((v - vmin) / (vopt - vmin)) ^
    ((vopt - vmin) / (vmax - vopt))
  vf[v <= vmin | v >= vmax] <- 0 # Set values outside range to zero
  vf
}

# Function for predicted yield (Y_max * scalar)
y_pred <- function(y_max, v1_min, v1_opt, v1_max, v2_min, v2_opt, v2_max, v1, v2) {
  y_max *
    beta3(v1, v1_min, v1_opt, v1_max) *
    beta3(v2, v2_min, v2_opt, v2_max)
}

# Simulate observed yield
Y_dat <- data.frame(
  v1 = runif(10000, 0, 1), # True predictor v1
  v2 = runif(10000, 0, 1), # True predictor v2
  v1_err = rnorm(10000, 0, 0.06), # Normal error in measuring v1
  v2_err = rnorm(10000, 0, 0.06), # Normal error in measuring v2
  y_err = rlnorm(10000, meanlog = 0, sdlog = 0.1) # Lognormal error in observed yield
) %>%
  mutate(v1_obs = v1 + v1_err,
         v2_obs = v2 + v2_err)

# True parameters
# y_max, v1_min, v1_opt, v1_max, v2_min, v2_opt, v2_max
par_true <- c(5, 0.2, 0.5, 0.8, 0.3, 0.6, 0.9) # True parameters

# Modelled and observed yields
Y_dat <- Y_dat %>%
  mutate(
    # Modelled yield
    Y_mod = y_pred(par_true[1], par_true[2], par_true[3], par_true[4], 
                      par_true[5], par_true[6], par_true[7], v1, v2),
    # Observed yield (with error)
    Y_obs = Y_mod * y_err) %>% # Multiplicative error
  dplyr::filter(Y_obs > 0) # Zero yields go unreported

# Plot observed yield against predictors
theme_set(theme_cowplot(font_size=10))
fv1 <- ggplot(Y_dat, aes(v1_obs, Y_obs)) + 
  geom_point(size = 0.1) +
  labs(x = expression(v[1]), y = expression(Y[obs]))
fv2 <- ggplot(Y_dat, aes(v2_obs, Y_obs)) + 
  geom_point(size = 0.1)+
  labs(x = expression(v[1]), y = expression(Y[obs]))
plot_grid(fv1, fv2, nrow = 1, labels = "auto", label_size=10)

Figure 3: Observed yield vs. observed predictors.

The simulated dataset appears to be very noisy, but a unimodal relationship between the predictors and observed yield is evident (Figure 3).

Optimize model parameters

We’ll optimize model parameters by least squares fitting using optim(). Parameter estimates will be constrained within reasonable bounds by setting method = "L-BFGS-L". Initial parameter estimates were taken by inspection of Figure 3.

# Objective function to minimize (Sum of Squared Errors)
sse <- function(params, v1, v2, Y_obs) {
  y_max <- params[1]
  v1_min <- params[2]; v1_opt <- params[3]; v1_max <- params[4]
  v2_min <- params[5]; v2_opt <- params[6]; v2_max <- params[7]
  
  # Predicted yield
  Y_pred <- y_pred(y_max, v1_min, v1_opt, v1_max, v2_min, v2_opt, v2_max, v1, v2)
  
  # Sum of squared errors
  sum((Y_obs - Y_pred)^2, na.rm = TRUE)
}

# Initial parameter estimates
# y_max, v1_min, v1_opt, v1_max, v2_min, v2_opt, v2_max
par_init <- c(5.5, 0, 0.5, 1, 0.1, 0.5, 1)

# Optimization
optim_results <- optim(
  par = par_init,
  fn = sse,
  v1 = Y_dat$v1_obs,
  v2 = Y_dat$v2_obs,
  Y_obs = Y_dat$Y_obs,
  method = "L-BFGS-B",  # Constrained optimization
  lower = c(0, 0, 0, 0, 0, 0, 0),  # Lower bounds
  upper = c(10, 1, 1, 1, 1, 1, 1)   # Upper bounds
)
par_opt <- optim_results$par

# Print results
cat("Fitted values = ", round(optim_results$par, 2))

Fitted values =  4.23 0.13 0.51 0.84 0.21 0.61 0.93

cat("True values = ", par_true)

True values =  5 0.2 0.5 0.8 0.3 0.6 0.9

The optimized parameter estimates are reasonable approximations of the ‘true’ values, but the curves are wider and flatter (Figure 4).

# Plot observed yield against predictors
theme_set(theme_cowplot(font_size = 10))
fv1 <- ggplot(Y_dat, aes(v1_obs, Y_obs)) +
  geom_point(size = 0.1, col = "grey") +
  geom_function(
    fun = function(x)
      par_true[1] *
      beta3(x, par_true[2], par_true[3], par_true[4]),
    col = "blue"
  ) +
  geom_function(
    fun = function(x)
      par_opt[1] *
      beta3(x, par_opt[2], par_opt[3], par_opt[4]),
    col = "red"
  ) +
  labs(x = expression(v[1]), y = expression(Y[obs]))
fv2 <- ggplot(Y_dat, aes(v2_obs, Y_obs)) +
  geom_point(size = 0.1, col = "grey") +
  geom_function(
    fun = function(x)
      par_true[1] *
      beta3(x, par_true[5], par_true[6], par_true[7]),
    col = "blue"
  ) +
  geom_function(
    fun = function(x)
      par_opt[1] *
      beta3(x, par_opt[5], par_opt[6], par_opt[7]),
    col = "red"
  ) +
  labs(x = expression(v[2]), y = expression(Y[obs]))
plot_grid(fv1,
          fv2,
          nrow = 1,
          labels = "auto",
          label_size = 10)

Figure 4: Fitted (red) and ‘true’ (blue) beta functions.

Check the model fit

We can now predict yield using the fitted parameter estimates. The observed and fitted values are strongly correlated (Figure 5).

# Calculate fitted yields from parameters
Y_dat$Y_fit <- y_pred(par_opt[1], par_opt[2], par_opt[3], par_opt[4],
                     par_opt[5], par_opt[6], par_opt[7], Y_dat$v1_obs, Y_dat$v2_obs)

# Plot observed vs. fitted yields
ggplot(Y_dat, aes(Y_fit, Y_obs)) +
  geom_point(size = 0.1) +
  geom_abline(slope = 1, intercept = 0) +
  labs(x = expression(Y[fit]), y = expression(Y[obs]))
  
# Calculate correlation
cor.test(~Y_fit + Y_obs, Y_dat)


    Pearson's product-moment correlation

data:  Y_fit and Y_obs
t = 74.209, df = 3567, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7658001 0.7916121
sample estimates:
     cor 
0.779036

Parameter uncertainty estimation

There are a number of ways to estimate parameter uncertainty, including profile likelihood, bootstrapping and Bayesian MCMC. to be completed

Boundary Line Analysis

The BLA package can be used to fit trapezoid and other functions for crop yield responses using various algorithms.

The BOLIDES algorithm finds upper (y) bounding points of an x-y scatter plot (Figure 6). The bolides() function provides for numerous modelling fitting options for the selected points, including trapezoid, linear, quadratic, and logistic. Custom functions may also be used (e.g. beta function).

# Load Boundary Line Analyis package
library(BLA)

# Show the boundary line
par(mar = c(3,3,0.1,0.1), las = 1, mgp = c(2, 0.5, 0), tcl = -0.25)
v1_bol <- bolides(Y_dat$v1_obs, Y_dat$Y_obs, model = "explore", plot = TRUE,
        pch = 16, col = "grey", cex = 0.5,
        xlab = expression(v[1]),
        ylab = expression(Y[obs]))

Figure 6: Explore the boundary line for \(v_1\).

bolides() requires an estimate of starting parameters for fitting models by optimization. These can be supplied as vectors, or calculated from selecting points plot that outline the desired response function using startValues() depending on the model to be fitted. We used simulated annealing optimization method = "SANN" to search for global optima.

# Rewrite beta function
beta3bla <- function(x,a,b,c,d){
  y <- ((c - x) / (c - b)) *
    ((x - a) / (b - a)) ^
    ((b - a) / (c - b))
  y[x <= a | x >= c] <- 0 # Set values outside range to zero
  y <- y*d
}

# Fit the model to the boundary line
par(mar = c(3,3,0.1,0.1), las = 1, mgp = c(2, 0.5, 0), tcl = -0.25)
mod_bol <- bolides(Y_dat$v1_obs, Y_dat$Y_obs, 
                    model = "other",
                    method = "SANN",
                    equation = beta3bla,
                    start = c(0.1, 0.6, 0.9, 5),
                    pch = 16, col = "grey", cex = 0.5,
                    xlab = expression(v[1]),
                    ylab = expression(Y[obs]))

# View parameters
mod_bol

$Model
[1] "other"

$Equation
function (x, a, b, c, d) 
{
    y <- ((c - x)/(c - b)) * ((x - a)/(b - a))^((b - a)/(c - 
        b))
    y[x <= a | x >= c] <- 0
    y <- y * d
}
<bytecode: 0x7f9bc3fd9108>

$Parameters
   Estimate
a 0.0785031
b 0.4976271
c 0.9539801
d 6.6061399

$RMS
[1] 0.1446928

Figure 7: Fitted BOLIDES beta function for \(v_1\).

The fitted function (Figure 7) looks somewhat unconvincing because we know that in our simulation, both \(v_1\) and \(Y_{obs}\) have error which both widens the range of \(v_1\) within which the crop grows, and elevates estimated \(Y_{max}\).

To help overcome these biases, it may be better to select an upper quantile fraction of the boundary data points. The blqr() function does this using quantile regression. In this case, we’ll fit the beta function, adapted for use in blqr().

# Fit the model to the boundary line to upper 95%
par(mar = c(3,3,0.1,0.1), las = 1, mgp = c(2, 0.5, 0), tcl = -0.25)
mod_blqr <- blqr(Y_dat$v1_obs, Y_dat$Y_obs, 
                  model = "other",
                  method = "SANN",
                  equation = beta3bla,
                  start = c(0.1, 0.6, 0.9, 5),
                  tau = 0.95,
                  pch = 16, col = "grey", cex = 0.5,
                  xlab = expression(v[1]),
                  ylab = expression(Y[obs]))

# Show fitted model results
print(mod_blqr)

$Model
[1] "other"

$Equation
function (x, a, b, c, d) 
{
    y <- ((c - x)/(c - b)) * ((x - a)/(b - a))^((b - a)/(c - 
        b))
    y[x <= a | x >= c] <- 0
    y <- y * d
}
<bytecode: 0x7f9bc3fd9108>

$Parameters
    Estimate
a 0.06746607
b 0.49391954
c 0.94699374
d 5.11484069

$RSS
[1] 423.0159

Figure 8: Fitted 95% quantile beta function for \(v_1\).

The results look good, but again give a wider range of \(v_1\) than the ‘true’ values, due to noise in the dataset.