1 Introduction

This investigation builds upon the Data Geeks group project that investigated combining agricultural commodity data with environmental factors such as climate and soil conditions to model the influence these geographical based factors have on agricultural yield. The previous investigation was limited to data for grapes and simple linear regression models.

This study furthers that work by expanding the dataset to 21 different agricultural commodities grown widely in Australia. I investigate methods of analysing and modelling this dataset which is divided up by commodities that each respond differently to environmental variables. The modelling tools developed could be used to predict how changes in environmental factors could impact the output of various agricultural commodities in Australia. Alternatively, they could be used to identify areas that are well suited for growing a particular commodity or vice versa. I have sought to answer the following research question:

Can multilevel models be used to analyse the interaction between geographic environmental factors and the yields of different agricultural commodities grown in Australia?

In this study I have first modelled the commodity groups as a fixed effect in a linear regression model - the baseline. I have then introduced the commodity groups as random effects in a multilevel model. This study assesses the performance of a number of multilevel models, discusses the advantages of such an approach and identifies future steps to be explored.

2 Background

2.1 Agricultural commodities and yield

Crop yield is a measure of how efficiently a commodity is produced within an area. In this dataset the yield is either measured in t/ha or kg/tree, depending on the commodity type. The agricultural yield may be affected by a range of environmental, societal and biological factors. The initial study which focused on grapes on the effect of climate, soil conditions, irrigation and the grape varieties used. As the irrigation and variety data is data only collected by the wine industry in Australia and is not available for other commodities the new study focuses only on climate and soil variables, however the variables from these sources have been expanded.

It is to be expected that different agricultural commodities will have significantly different yields due to the inherent biological differences and farming practices. Because of this yields will differ in both absolute amount and also how they respond to various external factors.

2.2 Multilevel (or mixed effect) models

Multilevel models, also known as linear mixed effect models amongst other things, are well suited for modelling data with a clustered or a hierarchical structure. The input agricultural data included variables that explicitly defined a structure of this kind so such a model seems appropriate. This structure is demonstrated with a sample of the categories in the Figure 1 diagram. Level 1 of the structure is the observations which are each associated with an SA2 region. Level 2 is the agricultural commodity. For each observation the yield is of a particular commodity and the predictor variables are extracted from areas within that region where the commodity is grown. Using these two levels we can create a two level nested model. We could also add Level 3 to create a three level nested model, for example the broad type of commodity as defined in the input source.

Note that in this case Level 1 will feature multiple observations in a single region, but for different commodities. We could create an additional (non-nested) level based on region.

Figure 1: Hierarchical structure of the agriculture commodity dataset

Another way of thinking about how these types of models work is to consider their other name - mixed effect models. The mixed effects refer to a combination of fixed and random effects. Whether variables are designated as fixed or random effects will depend on the dataset and goals of the investigation. In multilevel models the groups as shown in Figure 1 are designated as random" effects, allowing each group to have its own mean and variance.

The use of multilevel models can have the following advantages:

  • Through designation of random effects each group in the models can have different intercepts (random intercept) or slopes (random slope). This can reduce error compared to simple linear regression model that forces all the data to share a common global intercept and slope.
  • Using the data from all groups in a single model can cause group means to drift towards the global mean and reduce errors, compared to running separate models for each group. This is known as shrinkage and is particular advantageous for groups with small sample sizes, as is the case with this dataset.
  • Multilevel models are well suited for analysis variation between groups of interest.

3 Data preparation

The data was prepared using the same procedures employed during the group project stage. The basic steps are as follows:

The dataset was expanded to 21 different agricultural commodities by merging the yield data for each commodity with the predictor variables at the landuse areas where these commodities are grown. The commodities with the most landuse areas were selected to maximise the number of observations within each group and improve the potential performance of the model.

Figure 2 shows number of observations (SA2 regions) in the merged dataset for each commodity. Grapes which was used in the initial model are present in the most regions. Most of the commodities have between 10-30 observations and four commodities have less than 10. For multilevel models to perform well at least 5 groups is required. However repeated observations within groups is also necessary for good performance. To best satisfy these constraints the top 5 commodities that are measured in t/ha have been selected for modelling (grapes, olives, sugar cane, cotton and wheat).

Figure 2: Number of commodity region observations

Figure 3 shows that the yield for sugar cane are significantly higher than the other commodities. This difference may pose an issue for modelling.

Figure 3: Boxplot of the selected commodity yields

The geographic spread of the observations available in the dataset for the chosen commodities is shown in Figure 4. The grape data is most widespread which was part of the reason for selection for the initial model. While the wheat region in South Australia is present in the data Western Australia’s wheat belt is not and landuse data is was not available for this commodity and region. From this geographic spread we can expect a large range of environmental variables.

## Warning: Duplicated aesthetics after name standardisation: colour
Figure 4: Map of the commodity yield observations

4 Analysis

4.1 Linear regression model - baseline

First a linear regression model was run on the whole dataset to serve as a baseline. By doing this we are designating commodity as a fixed effect (along with all the other predictors). An Elastic Net regression model was run with 70% of the observations randomly sampled to create the train set and the remaining 30% used for an out-of-sample test set. In both models the predictor variables were centred and scaled and a five-fold cross validation run. The variables used, after filtering for multicollinearity, are shown in Figure 5.

As expected certain commodity categories appear to be the most influential features, Sugar cane is having the most significant influence as it produces much higher yields. A range of different climate and soil features also appear to be significant including soil pH and depth, the minimum temperature during winter months, the variability or rainfall and the total rainfall of particular months.

Figure 5: Top variables by importance used in the Elastic Net regression model

The results of the models are shown in Table 1 and the diagnostic residual plots are shown in Figure 6. This model’s performance is not as good as what was achieved in the previous model that only included grapes (scatter index = 0.36). The diagnostic plots indicate the model may be influenced outliers from the sugar cane subset and that the data is somewhat skewed. How these issues are handled by the multilevel models will be investigated in the next section.

Metric Value
rmse for test set 11.3
rmse train/test ratio 0.84
Scatter index 0.56
Figure 6: Diagnostic residual plots from the Elastic Net regression model

4.2 Linear multilevel model

The same test/train setup and pre-processing steps were used for all multilevel models. The same features were initially selected as used in the baseline model.

# preprocess by scaling and centering numeric features
yield_preproc_ha <- yield_top_ha %>%
  select(-yield) %>% 
  mutate_if(is.numeric, scale) %>% 
  mutate(yield = yield_top_ha$yield)

# set train and test
train_size <- floor(0.7 * nrow(yield_preproc_ha))
set.seed(11) 
train_n <- sample(seq_len(nrow(yield_preproc_ha)), size = train_size)

train <- yield_preproc_ha[train_n, ]
test <- yield_preproc_ha[-train_n, ]

# check sizes
nrow(yield_top_ha) == nrow(train) + nrow(test)
## [1] TRUE
# selected fixed effect features
fixed_feat <- "ph + minmaysep_temp + soil_depth + clay + varmay_rain + rainjul + bulk_density + phosphorus"

4.2.1 Model 1 - random intercept

Model 1 was a random intercept model with a fixed mean. This models random effect intercepts for each commodity group while using one global slope across all models and is represented by: \(yield \sim environmental\_features + (1|commodity)+ \epsilon\)

Here yield is response variable, the environmental features are fixed effects, the commodity intercept is a random effect and \(\epsilon\) is the error.

The following observations can be made from the summary outputs:

  • The scaled residuals indicate the data is still skewed slightly
  • The Random effects section shows the commodity is describing a large proportion of the variance, relative to the remaining residual
  • pH appears to be the most significant fixed effect
  • There is no significant multicollinearity between the fixed effects
  • It can be seen in the coefficients that the intercept changes for each commodity group and the slopes stay constant
# run the multilevel model
lme  <-  lmer(as.formula(paste("yield  ~", fixed_feat, " + (1 | commodity)")), data=train)

# summary outputs
summary(lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: yield ~ ph + minmaysep_temp + soil_depth + clay + varmay_rain +  
##     rainjul + bulk_density + phosphorus + (1 | commodity)
##    Data: train
## 
## REML criterion at convergence: 2122
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.5195 -0.4653 -0.0383  0.3943  7.8169 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  commodity (Intercept) 1768.05  42.048  
##  Residual                93.75   9.683  
## Number of obs: 287, groups:  commodity, 5
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)     21.0583    18.8266   1.119
## ph               5.5442     0.9202   6.025
## minmaysep_temp  -4.1843     1.0857  -3.854
## soil_depth       3.2725     0.8720   3.753
## clay            -2.5426     0.9826  -2.588
## varmay_rain      2.5599     1.0301   2.485
## rainjul          2.5558     0.9492   2.693
## bulk_density    -1.7669     0.7318  -2.414
## phosphorus       1.6851     0.9772   1.724
## 
## Correlation of Fixed Effects:
##             (Intr) ph     mnmys_ sl_dpt clay   vrmy_r rainjl blk_dn
## ph          -0.009                                                 
## minmysp_tmp -0.012 -0.231                                          
## soil_depth   0.000 -0.177 -0.162                                   
## clay        -0.012 -0.060  0.127 -0.045                            
## varmay_rain -0.008  0.094  0.106 -0.141  0.016                     
## rainjul     -0.004  0.487 -0.167 -0.146  0.319  0.444              
## bulk_densty  0.002 -0.220 -0.150  0.096  0.204 -0.054  0.156       
## phosphorus   0.000 -0.030  0.172 -0.221 -0.353 -0.096 -0.115  0.157
coef(lme)
## $commodity
##            (Intercept)       ph minmaysep_temp soil_depth      clay
## cotton       -6.441424 5.544188      -4.184349   3.272528 -2.542639
## grapes        7.438580 5.544188      -4.184349   3.272528 -2.542639
## olives        6.037166 5.544188      -4.184349   3.272528 -2.542639
## sugar cane   95.502917 5.544188      -4.184349   3.272528 -2.542639
## wheat         2.754218 5.544188      -4.184349   3.272528 -2.542639
##            varmay_rain  rainjul bulk_density phosphorus
## cotton        2.559921 2.555772    -1.766887   1.685055
## grapes        2.559921 2.555772    -1.766887   1.685055
## olives        2.559921 2.555772    -1.766887   1.685055
## sugar cane    2.559921 2.555772    -1.766887   1.685055
## wheat         2.559921 2.555772    -1.766887   1.685055
## 
## attr(,"class")
## [1] "coef.mer"

Predicting on the test set the results are essential the same as the baseline model.

# Make predictions on train
train_pred <- predict(lme, train)

# Make predictions on test
test_pred <- predict(lme, test)

# calculate rmse and scatter index
rmse_test <- rmse(test_pred, test$yield)
rmse_train <- rmse(train_pred, train$yield)
SI <- rmse_test/mean(yield_top_ha$yield)
rmse_ratio <- rmse_train/rmse_test

round(rmse_test,1)
## [1] 11.3
round(rmse_ratio,2)
## [1] 0.84
round(SI,2)
## [1] 0.56

4.2.2 Model 2 - correlated random intercept and slope

Model 2 was a random intercept and slope model and is represented by: \(yield \sim environmental\_features + (environmental\_features|commodity)+ \epsilon\)

Such a formula allows the slope to vary between groups and allows correlation between all the random effect terms.

The following observations can be made from the summary outputs:

  • Firstly the “boundary (singular) fit” is indicating that the model is too complicated for the number observations in the dataset.
  • We can see correlations starting to approach the boundaries of 1 and -1 which indicates convergence and the limits of the model.
  • We can see that both the intercept and slopes now vary with commodity. The coefficients sometimes vary from positive to negative between commodity groups which might suggest a breaking down of the model.
# run the multilevel model
lme  <-  lmer(as.formula(paste("yield  ~", fixed_feat, " + (", fixed_feat, " | commodity)")), data=train)
## boundary (singular) fit: see ?isSingular
# summary outputs
summary(lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: yield ~ ph + minmaysep_temp + soil_depth + clay + varmay_rain +  
##     rainjul + bulk_density + phosphorus + (ph + minmaysep_temp +  
##     soil_depth + clay + varmay_rain + rainjul + bulk_density +  
##     phosphorus | commodity)
##    Data: train
## 
## REML criterion at convergence: 2016.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8172 -0.3871 -0.0214  0.3728  7.1770 
## 
## Random effects:
##  Groups    Name           Variance Std.Dev. Corr                         
##  commodity (Intercept)    1725.221 41.536                                
##            ph              165.607 12.869    1.00                        
##            minmaysep_temp    3.861  1.965    0.83  0.85                  
##            soil_depth       12.184  3.491    0.99  0.99  0.77            
##            clay             64.421  8.026   -0.97 -0.98 -0.94 -0.94      
##            varmay_rain       2.145  1.465   -0.03 -0.07 -0.57  0.07  0.25
##            rainjul           8.136  2.852    0.94  0.92  0.58  0.96 -0.83
##            bulk_density      6.216  2.493   -0.98 -0.99 -0.92 -0.96  1.00
##            phosphorus       64.229  8.014    0.98  0.99  0.93  0.95 -1.00
##  Residual                   62.714  7.919                                
##                   
##                   
##                   
##                   
##                   
##                   
##                   
##   0.33            
##   0.22 -0.85      
##  -0.23  0.84 -1.00
##                   
## Number of obs: 287, groups:  commodity, 5
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)     22.5402    18.5897   1.213
## ph               8.6352     5.8138   1.485
## minmaysep_temp  -0.5640     1.1963  -0.471
## soil_depth       3.5428     1.7219   2.057
## clay            -3.1775     3.7085  -0.857
## varmay_rain      0.3083     1.1316   0.272
## rainjul          1.5245     1.5022   1.015
## bulk_density    -1.2986     1.2718  -1.021
## phosphorus       1.7518     3.6976   0.474
## 
## Correlation of Fixed Effects:
##             (Intr) ph     mnmys_ sl_dpt clay   vrmy_r rainjl blk_dn
## ph           0.988                                                 
## minmysp_tmp  0.602  0.599                                          
## soil_depth   0.902  0.879  0.427                                   
## clay        -0.940 -0.947 -0.662 -0.829                            
## varmay_rain -0.015 -0.050 -0.261 -0.003  0.172                     
## rainjul      0.797  0.801  0.293  0.723 -0.633  0.369              
## bulk_densty -0.857 -0.873 -0.650 -0.740  0.877  0.114 -0.585       
## phosphorus   0.945  0.942  0.700  0.807 -0.962 -0.157  0.668 -0.834
## convergence code: 0
## boundary (singular) fit: see ?isSingular
coef(lme)
## $commodity
##            (Intercept)        ph minmaysep_temp soil_depth        clay
## cotton       0.5568035  2.325689     -0.4385049   1.359922  -0.7447301
## grapes       8.0824024  4.256282     -0.9894888   2.329800  -0.7008599
## olives       6.2811129  2.936844     -2.4870605   2.592695   2.0746608
## sugar cane  96.3862081 31.532079      2.4192567   9.679973 -17.1904031
## wheat        1.3944771  2.125199     -1.3243577   1.751564   0.6738451
##            varmay_rain    rainjul bulk_density phosphorus
## cotton      -0.9481073 -0.7784803  -0.43010019 -0.8627614
## grapes       0.0670259  0.4332518  -0.53622526 -0.7781438
## olives       2.0478461  1.6667540   0.23584304 -3.2794961
## sugar cane   0.1374287  6.1989974  -5.67826762 15.8063536
## wheat        0.2375129  0.1022161  -0.08430904 -2.1267522
## 
## attr(,"class")
## [1] "coef.mer"

The out of sample prediction has gotten worse and the rmse ratio also suggests we are overfitting. The complexity of the model needs to be reduced.

# Make predictions on train
train_pred <- predict(lme, train)

# Make predictions on test
test_pred <- predict(lme, test)

# calculate rmse and scatter index
rmse_test <- rmse(test_pred, test$yield)
rmse_train <- rmse(train_pred, train$yield)
SI <- rmse_test/mean(yield_top_ha$yield)
rmse_ratio <- rmse_train/rmse_test

round(rmse_test,1)
## [1] 13.6
round(rmse_ratio,2)
## [1] 0.57
round(SI,2)
## [1] 0.67

4.2.3 Model 3 - uncorrelated random intercept and slope with less variables

One way to reduce the complexity is to remove the correlations between random effects introduced in Model 2. Another way is to remove some less significant fixed effects. In the this model both are done. The random effect correlation terms can be removed with \(||\) like so: \(yield \sim environmental\_features + (environmental\_features||commodity)+ \epsilon\)

The following observations can be made from the summary outputs:

  • Reducing the complexity has removed the boundary problems seen in the previous model
  • The slope coefficients are now mostly all positive or negative, as expected. A notable exception is sugar cane and wheat appear to be far more negatively influenced by the variability of January rain than the other commodities.
less_feat <- "ph + minmaysep_temp + soil_depth + varmay_rain + bulk_density + varjan_rain"

# run the multilevel model
lme  <-  lmer(as.formula(paste("yield  ~", less_feat, " + (", less_feat, " || commodity)")), data=train)

# summary outputs
summary(lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## yield ~ ph + minmaysep_temp + soil_depth + varmay_rain + bulk_density +  
##     varjan_rain + ((1 | commodity) + (0 + ph | commodity) + (0 +  
##     minmaysep_temp | commodity) + (0 + soil_depth | commodity) +  
##     (0 + varmay_rain | commodity) + (0 + bulk_density | commodity) +  
##     (0 + varjan_rain | commodity))
##    Data: train
## 
## REML criterion at convergence: 2096
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.8298 -0.4209 -0.0334  0.3539  7.6525 
## 
## Random effects:
##  Groups      Name           Variance  Std.Dev.
##  commodity   (Intercept)    1024.6556 32.0102 
##  commodity.1 ph                0.6933  0.8326 
##  commodity.2 minmaysep_temp   16.8032  4.0992 
##  commodity.3 soil_depth       13.0538  3.6130 
##  commodity.4 varmay_rain      24.6245  4.9623 
##  commodity.5 bulk_density      0.9052  0.9514 
##  commodity.6 varjan_rain     107.8399 10.3846 
##  Residual                     76.4747  8.7450 
## Number of obs: 287, groups:  commodity, 5
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)     16.9418    14.4524   1.172
## ph               3.1724     0.9179   3.456
## minmaysep_temp  -3.0974     2.3351  -1.326
## soil_depth       3.0710     2.0857   1.472
## varmay_rain      2.9575     2.7726   1.067
## bulk_density    -1.0968     0.9150  -1.199
## varjan_rain     -5.1452     5.0912  -1.011
## 
## Correlation of Fixed Effects:
##             (Intr) ph     mnmys_ sl_dpt vrmy_r blk_dn
## ph          -0.034                                   
## minmysp_tmp  0.005 -0.026                            
## soil_depth   0.011 -0.024 -0.035                     
## varmay_rain -0.006 -0.060  0.040 -0.057              
## bulk_densty -0.003 -0.132 -0.015  0.048 -0.002       
## varjan_rain  0.012 -0.028  0.030  0.015 -0.016 -0.050
coef(lme)
## $commodity
##            (Intercept)       ph minmaysep_temp soil_depth varmay_rain
## cotton       -5.192554 2.996451     -0.8739626   1.000673    1.820914
## grapes        8.197225 3.589697     -0.7960007   2.332757   -0.452600
## olives        6.335993 2.872791     -2.8920372   1.485937    1.806244
## sugar cane   72.862169 3.344031     -8.4372862   7.773372    9.608682
## wheat         2.506085 3.058863     -2.4878514   2.762216    2.004369
##            bulk_density varjan_rain
## cotton       -0.8780934   0.6541116
## grapes       -0.8473394   0.7076168
## olives       -0.8946445   0.8843795
## sugar cane   -1.7683223 -20.8553368
## wheat        -1.0954760  -7.1168109
## 
## attr(,"class")
## [1] "coef.mer"

The out of sample predictions have improved to better than the baseline model (rmse 9.3 compared to 11.3).

# Make predictions on train
train_pred <- predict(lme, train)

# Make predictions on test
test_pred <- predict(lme, test)

# calculate rmse and scatter index
rmse_test <- rmse(test_pred, test$yield)
rmse_train <- rmse(train_pred, train$yield)
SI <- rmse_test/mean(yield_top_ha$yield)
rmse_ratio <- rmse_train/rmse_test

round(rmse_test,1)
## [1] 9.3
round(rmse_ratio,2)
## [1] 0.9
round(SI,2)
## [1] 0.46

5 Discussion

A number of advantages for applying multilevel models to this type of data are evident. Having direct control over the fixed and random effects and essentially defining the structure of the model is very powerful and allows the user to adapt the model to suit the dataset and aim. It also provides more meaningful feedback on what factors influence the response. The results demonstrate that a multilevel model can outperform a linear regression model on this dataset and through further work could likely be improved further. However it is clear that multilevel models have limitations based on group and observation sizes that need to be considered. In this case the lack of data points meant that more complicated models with additional correlations terms or more nested levels could not be used.

For this dataset running separate elastic net models for each commodity may deliver improved performance. As the dataset is very broad with a selection of environmental variables this would allow the models to select new predictors for each commodity. Multilevel models appear to be better suited to narrower models with a very high number of repeated measurements.

Another alternative approach could be to perform logistic regression instead and predict for commodity, rather than yield. This would still deliver on the broad research aim but would eliminate the need to summarise the results by SA2 region. Instead the whole landuse area dataset could be used which contains over 50,000 observations, compared to region dataset which has 622. It would also allow for more sophisticated geospatial based analysis.

Investigating multilevel models has given me a greater appreciation for complicated interactions that are involved with the agricultural dataset we have created. It has also affirmed that we made the correct decsion selecting grapes for our initial model as it is clearly the commodity best suited for modelling.

6 Conclusion

Multilevel modelling has been performed on the agricultural dataset to predict the influence of environmental variables of the yield of a selection of commodities. It was found that a two level nested model (with random intercept and slope) was able to outperform a baseline simple regression model. However the complexity of multilevel models needs to be limited when working with datasets with a relatively small number of observations. Further investigation into alternative approaches is needed to overcome this.

7 References

D Bates, M Mächler, B Bolker, S Walker 2014. Fitting linear mixed-effects models using lme4, view 8 November 2020, https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

B Winter 2014. A very basic tutorial for performing linear mixed effects analyses, view 8 November 2020, http://www.bodowinter.com/uploads/1/2/9/3/129362560/bw_lme_tutorial2.pdf

X A Harrison, L Donaldson, M E Correa-Cano, J Evans, D N Fisher, C E D Goodwin, B S Robinson, D J Hodgson, R Inger 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology, view 8 November 2020, https://peerj.com/articles/4794/