Book Chapter Example 1

Solution:

Fitting procedure

As the first step, we need to start with getting familiar with the dataset used in this example. It contains, yield in Mg/ha in different days of the year during the growth season and for different levels of agriculture inputs. It is called ‘sm’ and it’s also part of the ‘nlraa’ R package. We need to always plot the observe data and see how does it look like and make sure that the shape of desired function is sufficiently good to represent the observed data. According to Figure 1 (A), the observed yield for both low and high agricultural input follow a kind of S shape pattern and it makes the Beta function (from the sigmoid group of functions) an options to be used in this example.

So after calling the required packages (ggplot2 and nlraa), we import the dataset using the built-in function ‘data’.

data(sm)

In the Beta growth function (\(Y=Y_{max}(1+\frac{t_e-t}{t_e-t_m})(\frac{t}{t_e})^{(\frac{t_e}{t_e-t_m})}\)) there is one predictor and response variable while it has 3 coefficients that they need to be estimated. In this example, \(Y\) is the response variable and it stands for Yield, while \(t\) is the predictor and it represents the time (Day of the year).

Using the ‘nls’ function, we can estimate the parameters of a nonlinear model by least-squares method. The first argument of ‘nls’ function needs to be the nonlinear model and it should be described in the form of: \(Respons\)~\(f(predictor)\). The second argument needs to address the dataset containing the data and the third argument feeds the ‘nls’ function with an estimation of starting values for each coefficient. There are also some additional arguments for determining the algorithm to use, constraining the parameters and etc (For further options and details about the nls function, look for nls manual using ‘?nls’). The following is an example of how this function needs to be called :

# Non linear regression for the low input data - B1=Block 1
reg.low <- nls(Y.low ~ SSbgf(X.low, w.max, t.e, t.m),data=Maize.B1.low, start=list(w.max=13,t.e=230,t.m=180))

# Non linear regression for the High input data - B1=Block 1
reg.high <- nls(Y.high ~ SSbgf(X.high, w.max, t.e, t.m),data=Maize.B1.high, start=list(w.max=20,t.e=240,t.m=210))

In general, by checking out the observed plot we can guess some initial values for our non-linear model. In the case of Beta function, \(Y_{max}\) represent the maximum value the model can reach and it can be estimated by finding the maximum of the observed yield, while \(t_e\) shows what day of the year it happnes. Finally, \(t_m\) is when the inflection happens and can be estimated by finding the ‘X’ with the maximum slope. Using the initial values estimated from the observed plot and performing the non-linear regression, the following coefficients were found as the result:

Input	\(Y_{max}\)	\(t_e\)	\(t_m\)
Low	14.4	249.12	220.35
High	25	251.26	225.78

By feeding the ‘summary’ function with the object generated as the results of fitting process, we can get more detailed information about the standard error (consequently confidence interval) and the p-value of each coefficient. P-value presented in the summary of regression results shows how likely is to reject the hypothesis that the value of coefficient is equal to zero (null hypothesis). P-value less than some threshold like 0.05 shows that the coeffcient is statisticly different than zero.

summary(reg.low)

## 
## Formula: Y.low ~ SSbgf(X.low, w.max, t.e, t.m)
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## w.max    14.40       2.21     6.5   0.0013 ** 
## t.e     249.12       3.27    76.1  7.4e-09 ***
## t.m     220.35       4.75    46.4  8.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.5 on 5 degrees of freedom
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 3.74e-06

As a result of fitting process, the measured yield (points) versus simulated values (lines) looks like the following plot (Figure 1 (B)):

Figure 1: A) Observed yield during the growth season B) Simulated yield versus observed for both agricultral inputs

Checking the model assumptions

As it has been already stated, there are different assumptions we need to make sure about their validity after performing a non-linear regression. The first assumption is that the expected value of the residuals is zero. In addition, they need to be normally distributed and its variance should be homogeneous.

Residuals mean

We start with checking the mean of residuals, which is not staisfying for the high input data and it makes us suspicious about the performance of the model in that case.

Input	Low	High
Mean of residual	-0.23	-0.27

Equality of the variance

For different level of agricultural input we can check the homogeneity of variance by looking at the scatter plot of the residuals against the predicted values. If the variation of the residuals was changing with the size of predctions it shows inhomogeneity and we could use some transformation like log-transform to solve this problem.

Clearly for both low and high agricultral input, the beta growth function tend to overstimate at early growth stages and result more negetaive residuals in the early growth period.

Normality

Even though having a small sample size of residuals woudn’t provide us a strong evidance about the normality of residuals, we can still use a density plot to see how their distribution look like.

Figure 2: Here you see some interesting stuff about cars and such.

In this specific example since we work with time dependant variable, we can also test and make sure that the resdiuals are not autocorrolated. In order to do that we can simply use ‘acf’ function which finds and plots the corrolation between the residuals in different time lags.

Statistical descriptors

Statistical descriptors were calculated using eq(.) for better understanding of our model’s performance. The first, two indices (RMSE and NRMSE), show the overall error and also its magnitude in relative to the mean of observed value. While, Bias helps to find out weather our model over predicts or under predicts. On the other hand, the last two indices (\(R^2\) and EF) would show the correlation or how well the model understand the pattern exist in observed data.

Input	\(RMSE\)	\(NRMSE\)	\(Bias\)	\(R^2\)	\(EF\)
Low	1.22	18.05	-0.23	0.95	0.94
High	1.1	11.29	-0.27	0.99	0.98

As we expected \(R^2\) and EF are close to one and it is because the Beta growth function follows the ‘S’ shape of the obeserved yield. But our investivation is not over and we need to find out how far are the predicted values from their corresponding observed points. Bias helps us to know that the model is overpredicting in general and NRMSE shows that the mean of predicted values may have 18% error in relative to mean of observed yield.

Discussion

After checking out the assumptions and statistical descriptors, we can start to discuss the validity of the model used in this example and the results obtained. The original Beta growth function doesn’t seem to be the best option for both high and low input data, because the residuals have a large mean, sinousod pattern against the predicted values as well as relativly large error generated by the model.

As it was seen, the beta growth function was not able to fully capture the initial phases of growth and as a result it generated high mean of residuals as well as large error in statistical indices. But this issue can be covered by modifying the Beta function by delaying the growth in the model at the cost of two extra parameters. The modified version of the Beta function (\(Y=Y_b+(Y_{max}-Y_b)(1+\frac{t_e-t}{t_e-t_m})(\frac{t-t_b}{t_e-t_b})^{(\frac{t_e-t_b}{t_e-t_m})}\)) allows for two offsets in both x, y axes and they are called \(t_b\) and \(Y_b\) respectively. These two parameter (\(t_b\) and \(Y_b\)) can represent the planting date and yield at sowing date respectively (Yin et al., 2003). The modified version of Beta function will be tested in the next example as an alternative function with the aim of finding the best descriptor model.